Włodzimierz Bielecki, Krzysztof Kraska

The assurance of the optimal performance for a program with the level of paral-lelism limited to the possibility of a target architecture reąuires the iterative estimation of data locality for different ways of the slices agglomeration in combination with dif-ferent types of scheduling (space-time mapping) and applying well-known techniąues for improving data locality. The choice of the best combination of the above-mentioned elements under a generalized data locality factor will assure the optimal performance of a program.

For both of the Livermore Loops Kernel 1 and the matrix multiplication algorithm, the level of parallelism was adjusted to a target architecture by applying the function omp_set_num_threads ( omp_get_num_procs ( ) ) from the OpenMP library. The number of threads created for the target architecture (2 x Intel Pentium 4 with Hyper-Threading technology) was 4. The size of the DataCache-Ll linę for the processor was 128-bytes which corresponds to 32 array elements. Since the directive #pragma parallel in the both programs omitted the type of the iteration scheduling, the compiler applied the scheduling of the static policy allocating V* consecutive iterations to each thread [12].

The data locality factors for the matrix multiplication algorithm and the Livermore Loops Kernel 1 calculated with the method [3] are presented in Table 1, Table 2 and Table 3.

Table 1. Reuse factors for the matrix multiplication algorithm

Reuse factors

Reference

Temporal

Spatial

Self-reuse

Cumulative self reuse

Data footprint

k

j

i

k

j

i

Rk

Rj

Ri

Rk’

Rj’

Ri’

Fk*

Fj’

Fi’

z[i][j]

N

1

1

1

32

1

N

32

1

N

32N

32N

1

N/32

N2/128

x[i][k|

1

N

1

32

1

1

32

N

1

32

32N

32N

N/32

N/32

N2/128

ylklUl

1

1

N/4

1

32

1

1

32

N/4

1

32

8N

N

N2/32

N2/32

Z

N + 33

N + 64

N/4+ 2

N+33

64N + 32

72N

N+N/32 + 1

o

Z

+

Z

Z

In the case of the matrix multiplication, there are no separate references to the same array, therefore only the self-reuse factors were calculated. For arrays of size Nx N, where N=2048, the data footprint for the outermost loop, Fi*, considerably exceeded the size of DataCache-Ll causing an access to slower memory level:

,    6*N2

F =- ; N = 2048 ; F = 196608 .

128