10 Włodzimierz Bielecki, Krzysztof Kraska
405/440/460 embedded systems development and the related IBM RlSCWatch v6.0i debugger [8]. Cache utilization was reached from DCU (sim readdcu) statistics of the Simulator.
The following configuration of the Simulator was used to conduct experiments:
— 2 x PowerPC405 processors with
— 16KB two-way set-associative DataCache-Ll (8 words/32 bytes cache linę)
— no DataCache-L2.
The sources exposed to the experiments were developed in a manner representative for the embedded software development using the cross-platform development envi-ronment composed of the Intel PC Workstation and the target executable architecture [8]. The examined C sources were compiled on the Fedora 4 Linux x86 to the PowerPC Embedded ABI file format by means of the gcc-3.3.1 compiler and executed in the target system environment using the MC-ISS software Simulator. Due to the target architecture limitations, two threads of the data processing were extracted in the sources. Iterations of a parallel loop were assigned to threads according to the scheduling of static policy, i.e., one thread has assigned a half of the consecutive loop iterations [12].
Table 4 shows the results achieved for the matrix multiplication codę being simu-lated in the MC-ISS embedded software Simulator.
Table 4. The experimental results of DCU utilization for the matrix multiplication codę (N=256, B=8)
RlSCWatch STATUS |
Sequential |
Parallel SFS |
Parallel SFS with Blocking |
Parallel SFS with Blocking & Array Contraction | ||||
CPU0 |
CPU1 |
CPU0 |
CPU1 |
CPU0 |
CPU1 |
CPU0 |
CPU1 | |
DCU total accesses |
31852424 |
N/A |
127014104 |
127044104 |
1450122% |
1450122% |
119846472 |
119846472 |
DCU misses |
2160751 |
N/A |
8634538 |
8634538 |
317789 |
317789 |
317789 |
317789 |
Misses/total r%i |
6,8% |
N/A |
6,8% |
6,8% |
0,22% |
0,22% |
0,27% |
0,27% |
Table 5 shows the results obtained for the Livermore loop Kernel 1 (hydro fragment) codę executed in the MC-ISS embedded software Simulator.
Table 5. The experimental results of DCU utilization for the Kernel 1 (loop= 100; array_size=8192*sizeof(int))
RlSCWatch STATUS |
Sequential |
Parallel |
Parallel SFS |
Parallel SFS with Arrav Contraction | ||||
CPU0 |
CPU1 |
CPU0 |
CPU1 |
CPU0 |
CPU1 |
CPU0 |
CPU1 | |
DCU total accesses |
11527637 |
N/A |
5800687 |
5800687 |
11576472 |
11576472 |
8399916 |
8399916 |
DCU misses |
309546 |
N/A |
155799 |
155799 |
5130 |
5130 |
5131 |
5131 |
misses/total [%] |
2,69% |
N/A |
2,69% |
2,69% |
0,04% |
0,04% |
0,06% |
0,06% |
The examined sources have achieved the same (in the first case) and much better (in the second case) DCU misses/total ratio after synchronization-free slices extraction.