HPCS lab4

High performance computing systems Lab 4

Dept. of Computer Architecture

Faculty of ETI

Gdansk University of Technology

Paweł Czarnul

For this exercise, study support for multithreading offered by MPI. This refers to the possibility of calling MPI functions from multiple threads started within a process.

Namely, an MPI implementation may support one of the following support levels for using threads: 1. MPI_THREAD_SINGLE – no support,

2. MPI_THREAD_SERIALIZED – threads are allowed to call MPI functions, but only one at a time,

3. MPI_THREAD_FUNNELED – only the thread that initialized MPI will call MPI functions, 4. MPI_THREAD_MULTIPLE – no restrictions.

Instead of MPI_Init, MPI_Init_thread should be called to initialize MPI and thread support. A program requests a certain level of thread support while MPI returns the level it supports.

Study the MPI specification for details.

The application presented here is an extended version of the program from lab1. That is, it computes pi in parallel using an old method from the 17th century: Pi/4=1/1 – 1/3 + 1/5 – 1/7 + 1/9 …. (1) A similar method to lab1 was adopted. However, in this case the program requests MPI_THREAD_MULTIPLE from MPI. Then, each process creates a certain number (THREADNUM) threads that will calculate subsums. Namely, successive elements of the aforementioned sum will be assigned to the threads of process 0, the threads of process 1, …, the threads of process (n-1), the threads of process 0, the threads of process 1, …, the threads of process (n-1), ….

Note that each thread returns its results to the master process – this is done independently by each thread.

Note that you need to use an MPI implementation that supports MPI_THREAD_MULTIPLE.

Use the KASK cluster for that:

Compile the application as follows:

/usr/mpi/gcc/mvapich2-1.4.1/bin/mpicc program41.c -lm -lpthread and run as follows:

time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 2 -hostfile ./machines ./a.out where machines should include n01

Note results:

THREADNUM=1

[pczarnul2@n01 ~]$ /usr/mpi/gcc/mvapich2-1.4.1/bin/mpicc program41.c -lm -lpthread

[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 1 -hostfile

./machines ./a.out

Received result 0.785398 from thread 0 process 0pi=3.141593

real

0m9.878s

user

0m0.016s

sys

0m0.019s

[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 2 -hostfile

./machines ./a.out

Received result 5.891106 from thread 0 process 0

Received result -5.105708 from thread 0 process 1pi=3.141593

real

0m5.294s

user

0m0.014s

sys

0m0.021s

[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 4 -hostfile

./machines ./a.out

Received result 3.379040 from thread 0 process 0

Received result -2.674728 from thread 0 process 1

Received result 2.512067 from thread 0 process 2

Received result -2.430980 from thread 0 process 3pi=3.141593

real

0m3.737s

user

0m0.018s

sys

0m0.017s

Now with THREADNUM=2

[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 1 -hostfile

./machines ./a.out

Received result 5.891106 from thread 0 process 0

Received result -5.105708 from thread 1 process 0pi=3.141593

real

0m5.254s

user

0m0.017s

sys

0m0.023s

[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 2 -hostfile

./machines ./a.out

Received result -2.674728 from thread 0 process 0

Received result 3.379040 from thread 1 process 0

Received result 2.512067 from thread 0 process 1

Received result -2.430980 from thread 1 process 1pi=3.141593

real

0m3.518s

user

0m0.022s

sys

0m0.017s

[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 4 -hostfile

./machines ./a.out

Received result 2.151846 from thread 0 process 0

Received result -1.474313 from thread 1 process 0

Received result 1.331611 from thread 0 process 1

Received result -1.266250 from thread 1 process 1

Received result 1.227194 from thread 0 process 2

Received result -1.200415 from thread 1 process 2

Received result 1.180455 from thread 0 process 3

Received result -1.164730 from thread 1 process 3pi=3.141593

real

0m2.731s

user

0m0.014s

sys

0m0.020s

Source code:

#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>

#include <mpi.h>

#define THREADNUM 2

#define RESULT 1

pthread_t thread[THREADNUM];

pthread_attr_t attr;

int startValue[THREADNUM];

double precision=1000000000;

//2000000000;

int step;

double pilocal=0;

int myrank,proccount;

void *Calculate(void *args) {

int start=*((int *)args); // start from this number int mine,sign;

double pi=0;

int count=0;

// each process performs computations on its part pi=0;

mine=start*2+1;

sign=(((mine-1)/2)%2)?-1:1;

for (;mine<precision;) {

if (!(count%1000000)) {

printf("\nProcess %d %ld %ld", myrank,sign,mine); fflush(stdout);

}

pi+=sign/(double)mine;

mine+=2*step;

sign=(((mine-1)/2)%2)?-1:1;

count++;

}

MPI_Send(&pi,1,MPI_DOUBLE,0,RESULT,MPI_COMM_WORLD);

// now update our local process pi value

pilocal+=pi;

}

int main(int argc, char **argv) {

double pi_final=0;

int i,j;

int threadsupport;

void *threadstatus;

MPI_Status status;

// Initialize MPI

MPI_Init_thread(&argc, &argv,MPI_THREAD_MULTIPLE,&threadsupport); if (threadsupport!=MPI_THREAD_MULTIPLE) {

printf("\nThe implementation does not support MPI_THREAD_MULTIPLE, it supports level

%d\n",threadsupport);

MPI_Finalize();

exit(-1);

}

// find out my rank

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

// find out the number of processes in MPI_COMM_WORLD

MPI_Comm_size(MPI_COMM_WORLD, &proccount);

// now distribute the required precision

if (precision<proccount) {

printf("Precision smaller than the number of processes - try again."); MPI_Finalize();

return -1;

}

// now start the threads in each process

// define the thread as joinable

pthread_attr_init(&attr);

pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

// initialize the step value

step=proccount*THREADNUM;

for (i=0;i<THREADNUM;i++) {

// initialize the start value

startValue[i]=myrank*THREADNUM+i;

// launch a thread for calculations

pthread_create(&thread[i], &attr, Calculate, (void *)(&(startValue[i])));

}

if (!myrank) { // receive results from the threads double resulttemp;

for(i=0;i<proccount;i++) for(j=0;j<THREADNUM;j++) {

MPI_Recv(&resulttemp,1,MPI_DOUBLE,i,RESULT,MPI_COMM_WORLD,&status); printf("\nReceived result %f from thread %d process %d",resulttemp,j,i); fflush(stdout);

pi_final+=resulttemp;

}

// now synchronize the threads

for (i=0;i<THREADNUM;i++)

pthread_join(thread[i], &threadstatus);

// now merge the numbers to rank 0

MPI_Reduce(&pilocal,&pi_final,1,

MPI_DOUBLE,MPI_SUM,0,

MPI_COMM_WORLD);

if (!myrank) {

pi_final*=4;

printf("pi=%f",pi_final);

}

// Shut down MPI

MPI_Finalize();

return 0;

}