8002824277

8002824277



Włodzimierz Bielecki, Krzysztof Kraska

The assurance of the optimal performance for a program with the level of paral-lelism limited to the possibility of a target architecture reąuires the iterative estimation of data locality for different ways of the slices agglomeration in combination with dif-ferent types of scheduling (space-time mapping) and applying well-known techniąues for improving data locality. The choice of the best combination of the above-mentioned elements under a generalized data locality factor will assure the optimal performance of a program.

For both of the Livermore Loops Kernel 1 and the matrix multiplication algorithm, the level of parallelism was adjusted to a target architecture by applying the function omp_set_num_threads ( omp_get_num_procs ( ) ) from the OpenMP library. The number of threads created for the target architecture (2 x Intel Pentium 4 with Hyper-Threading technology) was 4. The size of the DataCache-Ll linę for the processor was 128-bytes which corresponds to 32 array elements. Since the directive #pragma parallel in the both programs omitted the type of the iteration scheduling, the compiler applied the scheduling of the static policy allocating V* consecutive iterations to each thread [12].

The data locality factors for the matrix multiplication algorithm and the Livermore Loops Kernel 1 calculated with the method [3] are presented in Table 1, Table 2 and Table 3.

Table 1. Reuse factors for the matrix multiplication algorithm

Reuse factors

Reference

Temporal

Spatial

Self-reuse

Cumulative self reuse

Data footprint

k

j

i

k

j

i

Rk

Rj

Ri

Rk’

Rj’

Ri’

Fk*

Fj’

Fi’

z[i][j]

N

1

1

1

32

1

N

32

1

N

32N

32N

1

N/32

N2/128

x[i][k|

1

N

1

32

1

1

32

N

1

32

32N

32N

N/32

N/32

N2/128

ylklUl

1

1

N/4

1

32

1

1

32

N/4

1

32

8N

N

N2/32

N2/32

Z

N + 33

N + 64

N/4+ 2

N+33

64N + 32

72N

N+N/32 + 1

o

Z

+

Z

Z

In the case of the matrix multiplication, there are no separate references to the same array, therefore only the self-reuse factors were calculated. For arrays of size Nx N, where N=2048, the data footprint for the outermost loop, Fi*, considerably exceeded the size of DataCache-Ll causing an access to slower memory level:

,    6*N2

F =- ; N = 2048 ; F = 196608 .

128



Wyszukiwarka

Podobne podstrony:
10 Włodzimierz Bielecki, Krzysztof Kraska 405/440/460 embedded systems development and the related I
12 Włodzimierz Bielecki, Krzysztof Kraska Mathematica package. The CodeGenerationModule is responsib
6 Włodzimierz Bielecki, Krzysztof Kraska Similarly to the Computer software development, the embedde
Spis treści Włodzimierz Bielecki, Krzysztof Kraska INCREASING DATA LOCALITY OF PARALLEL PROGRAMS EXE
Increasing data locality of parallel programs executed in embedded Systems Włodzimierz Bielecki, Krz
118 Krzysztof Waliszewski The microeconomic benefits of personal planning and financial management h
114 Krzysztof Waliszewski The origination of a new subdiscipline in financial science is connected w
18 [230]    Wierzcholski* Krzysztof, Czajkowski Andrzej:Influence of pressure on the
3 [26] Wierzcholski Krzysztof: About the system of Solutions of the difference eąuation Monograph (i
[92]    Wierzcholski Krzysztof: Estimation and integration of the basie equations for

więcej podobnych podstron