6 Włodzimierz Bielecki, Krzysztof Kraska
Similarly to the Computer software development, the embedded system development needs programming languages, debuggers, compilers, linkers and other programming tools. Approved as an IEEE standard, the SystemC language is an example of the tool that enables the implementation of both software and hardware parts of embedded Systems.
The optimal implementation of software components designed for multiprocessor embedded systems is critical for their performance and the power consumption. Howev-er, poor data locality is a common feature of many existing numerical applications [6]. Such applications are often represented with affine loops where the considerable quanti-ties of data placed in arrays exceeded the size of a fast but smali cache memory. For an inefficient codę, referenced data has to be fetched to a cache from external memory although they could be reused many times. Because cache is expensive, memories often operate at fuli speed of a processor while cheaper but morę capacious external memory modules operate at several times slower freąuency, hence the time to access data located in a cache memory is significantly less. Improvement in data locality can be obtained by means of high-level program transformations. Increasing data locality within a program improves the utilization of fast data cache and delimits accesses to slower memory modules at lower level. Finally it makes generał performance improvement for software parts of embedded systems.
A new method of extracting synchronization-free slices (SFS) in arbitrarily nested loops was introduced in [1], The method enables us to extract morę parallel threads than other methods. The well-known technique invented by Wolfe [3] estimates data reuse factors. It makes possible to adopt such order of the loop execution that increases data locality in a program. In relation to the method of extracting synchronization-free slices, the estimation of data locality is a necessary activity to obtain an improved performance for a program executed on a target architecture. The SFS method extracts maximal number of the parallel threads however any target embedded architecture consists of the fixed number of CPU cores usually smaller than the number of threads extracted. Hence, it is necessary to adjust the level of parallelism in a program to the target architecture [10]. Our previous research conducted on parallel computers indicates that the extraction of synchronization-free slices as well as applying the tiling and the array contraction techniques within an individual thread can considerably increase the performance of a parallel program. For example, the results of the experiments performed for the Livermore Loops Kemel 1 (hydro fragment) [5] and the matrix multiplication algorithm [6] indicate the considerable gains in the execution time (Figurę la and Figurę lb) [2].
On the contrary, the example of a simple codę in Figurę 2 executed on the same target architecture proves that the extraction of parallel slices under certain circumstances can limit the performance of a program - the execution time of the parallel codę (8 seconds) was about 30% greater than that of the correspondent sequential codę (6 seconds). It can be noticed that the parallel codę has the decreased spatial-reuse factor value for a reference to the array a [ ] caused to maintain the coherence between the caches of multiple processors to a large extent.