Włodzimierz Bielecki, Krzysztof Kraska
The assurance of the optimal performance for a program with the level of paral-lelism limited to the possibility of a target architecture reąuires the iterative estimation of data locality for different ways of the slices agglomeration in combination with dif-ferent types of scheduling (space-time mapping) and applying well-known techniąues for improving data locality. The choice of the best combination of the above-mentioned elements under a generalized data locality factor will assure the optimal performance of a program.
For both of the Livermore Loops Kernel 1 and the matrix multiplication algorithm, the level of parallelism was adjusted to a target architecture by applying the function omp_set_num_threads ( omp_get_num_procs ( ) ) from the OpenMP library. The number of threads created for the target architecture (2 x Intel Pentium 4 with Hyper-Threading technology) was 4. The size of the DataCache-Ll linę for the processor was 128-bytes which corresponds to 32 array elements. Since the directive #pragma parallel in the both programs omitted the type of the iteration scheduling, the compiler applied the scheduling of the static policy allocating V* consecutive iterations to each thread [12].
The data locality factors for the matrix multiplication algorithm and the Livermore Loops Kernel 1 calculated with the method [3] are presented in Table 1, Table 2 and Table 3.
Table 1. Reuse factors for the matrix multiplication algorithm
Reuse factors | |||||||||||||||
Reference |
Temporal |
Spatial |
Self-reuse |
Cumulative self reuse |
Data footprint | ||||||||||
k |
j |
i |
k |
j |
i |
Rk |
Rj |
Ri |
Rk’ |
Rj’ |
Ri’ |
Fk* |
Fj’ |
Fi’ | |
z[i][j] |
N |
1 |
1 |
1 |
32 |
1 |
N |
32 |
1 |
N |
32N |
32N |
1 |
N/32 |
N2/128 |
x[i][k| |
1 |
N |
1 |
32 |
1 |
1 |
32 |
N |
1 |
32 |
32N |
32N |
N/32 |
N/32 |
N2/128 |
ylklUl |
1 |
1 |
N/4 |
1 |
32 |
1 |
1 |
32 |
N/4 |
1 |
32 |
8N |
N |
N2/32 |
N2/32 |
Z |
N + 33 |
N + 64 |
N/4+ 2 |
N+33 |
64N + 32 |
72N |
N+N/32 + 1 |
o Z + Z |
Z |
In the case of the matrix multiplication, there are no separate references to the same array, therefore only the self-reuse factors were calculated. For arrays of size Nx N, where N=2048, the data footprint for the outermost loop, Fi*, considerably exceeded the size of DataCache-Ll causing an access to slower memory level:
, 6*N2
F =- ; N = 2048 ; F = 196608 .
128