Increasing data locality of parallel programs executed in embedded systems 9
Increasing data locality of parallel programs executed in embedded systems 9
Table 2. Temporal-reuse factors for the Livermore Loops Kernel 1 (hydro fragment)
Reference |
Reuse factors | |||||||||
Temporal |
Spatial |
Self-reuse |
Cumulative self reuse |
Data footprint | ||||||
k |
1 |
k |
1 |
Rk |
Ri |
Rk* |
R,' |
Fk* |
F,' | |
x[kl |
1 |
Loop |
32 |
1 |
32 |
loop |
32 |
321oop |
n/32 |
n/128 |
y[k] |
1 |
Loop |
32 |
1 |
32 |
loop |
32 |
321oop |
n/32 |
n/128 |
Z[k+10] |
1 |
Loop |
32 |
1 |
32 |
loop |
32 |
321oop |
n/32 |
n/128 |
Z[k+11] |
1 |
Loop |
32 |
1 |
32 |
loop |
32 |
321oop |
n/32 |
n/128 |
E |
128 |
4* loop |
128 |
128* loop |
n/8 |
n/32 |
Table 3. Spatial-reuse factors for the Livermore Loops Kernel 1 (hydro fragment)
Reference |
Reuse factors | |||||
Self-spatial reuse |
Group-temporal reuse |
Cumulative group reuse | ||||
k |
1 |
k |
1 |
k |
1 | |
zfk+11] |
32 |
1 |
1 |
1 |
32 |
32 |
z[k+10] |
32 |
1 |
1 |
loop |
32 |
32*loop |
The value Fi*= 196608 for the 4-bytes array element size gives 768KB. After apply-ing the tiling techniąue and splitting data into blocks with the side B=32 elements, the value of the data footprint was decreased because that data amount could be entirely placed in DataCache-Ll:
» 6 *B2 «
F =- ; fi = 32 ; F. = 48 .
128
The value Fi*=48 for the 4-bytes array element size gives 192B. It should be noticed that DataCache-Ll was shared between 2 parallel threads.
In the case of the Livermore Loops Kernel 1 (hydro fragment), the self-reuse factors are identical for the source with fine grained parallelism and the source with synchroni-zation-free slices extracted. There is also the group-reuse between references z [k+11] and z [k+10] sorted so that the reuse distance between adjacent references is lexico-graphically nonnegative. There are also self-temporal and self-spatial reuse factors for the both references. The group-spatial reuse factor eąuals one sińce there is the self-spatial reuse factor. To take into account reuse between references, a generalized data reuse factor for the outermost loop LI is computed by dividing the data footprint by the cumulative group reuse factor that finally gives:
n! 32 32 * loop
= n/(322 *loop).
Experiments were performed by means of the software Simulator IBM PowerPC Multi-Core Instruction Set Simulator vl.29 (MC-ISS) [7] intended for the PowerPC