High performance computing systems Lab 4
Dept. of Computer Architecture
Faculty of ETI
Gdansk University of Technology
Paweł Czarnul
For this exercise, study support for multithreading offered by MPI. This refers to the possibility of calling MPI functions from multiple threads started within a process.
Namely, an MPI implementation may support one of the following support levels for using threads: 1. MPI_THREAD_SINGLE – no support,
2. MPI_THREAD_SERIALIZED – threads are allowed to call MPI functions, but only one at a time,
3. MPI_THREAD_FUNNELED – only the thread that initialized MPI will call MPI functions, 4. MPI_THREAD_MULTIPLE – no restrictions.
Instead of MPI_Init, MPI_Init_thread should be called to initialize MPI and thread support. A program requests a certain level of thread support while MPI returns the level it supports.
Study the MPI specification for details.
The application presented here is an extended version of the program from lab1. That is, it computes pi in parallel using an old method from the 17th century: Pi/4=1/1 – 1/3 + 1/5 – 1/7 + 1/9 …. (1) A similar method to lab1 was adopted. However, in this case the program requests MPI_THREAD_MULTIPLE from MPI. Then, each process creates a certain number (THREADNUM) threads that will calculate subsums. Namely, successive elements of the aforementioned sum will be assigned to the threads of process 0, the threads of process 1, …, the threads of process (n-1), the threads of process 0, the threads of process 1, …, the threads of process (n-1), ….
Note that each thread returns its results to the master process – this is done independently by each thread.
Note that you need to use an MPI implementation that supports MPI_THREAD_MULTIPLE.
Use the KASK cluster for that:
Compile the application as follows:
/usr/mpi/gcc/mvapich2-1.4.1/bin/mpicc program41.c -lm -lpthread and run as follows:
time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 2 -hostfile ./machines ./a.out where machines should include n01
Note results:
THREADNUM=1
[pczarnul2@n01 ~]$ /usr/mpi/gcc/mvapich2-1.4.1/bin/mpicc program41.c -lm -lpthread
[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 1 -hostfile
./machines ./a.out
Received result 0.785398 from thread 0 process 0pi=3.141593
real
0m9.878s
user
0m0.016s
sys
0m0.019s
[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 2 -hostfile
./machines ./a.out
Received result 5.891106 from thread 0 process 0
Received result -5.105708 from thread 0 process 1pi=3.141593
real
0m5.294s
user
0m0.014s
sys
0m0.021s
[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 4 -hostfile
./machines ./a.out
Received result 3.379040 from thread 0 process 0
Received result -2.674728 from thread 0 process 1
Received result 2.512067 from thread 0 process 2
Received result -2.430980 from thread 0 process 3pi=3.141593
real
0m3.737s
user
0m0.018s
sys
0m0.017s
Now with THREADNUM=2
[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 1 -hostfile
./machines ./a.out
Received result 5.891106 from thread 0 process 0
Received result -5.105708 from thread 1 process 0pi=3.141593
real
0m5.254s
user
0m0.017s
sys
0m0.023s
[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 2 -hostfile
./machines ./a.out
Received result -2.674728 from thread 0 process 0
Received result 3.379040 from thread 1 process 0
Received result 2.512067 from thread 0 process 1
Received result -2.430980 from thread 1 process 1pi=3.141593
real
0m3.518s
user
0m0.022s
sys
0m0.017s
[pczarnul2@n01 ~]$ time /usr/mpi/gcc/mvapich2-1.4.1/bin/mpirun_rsh -np 4 -hostfile
./machines ./a.out
Received result 2.151846 from thread 0 process 0
Received result -1.474313 from thread 1 process 0
Received result 1.331611 from thread 0 process 1
Received result -1.266250 from thread 1 process 1
Received result 1.227194 from thread 0 process 2
Received result -1.200415 from thread 1 process 2
Received result 1.180455 from thread 0 process 3
Received result -1.164730 from thread 1 process 3pi=3.141593
real
0m2.731s
user
0m0.014s
sys
0m0.020s
Source code:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define THREADNUM 2
#define RESULT 1
pthread_t thread[THREADNUM];
pthread_attr_t attr;
int startValue[THREADNUM];
double precision=1000000000;
//2000000000;
int step;
double pilocal=0;
int myrank,proccount;
void *Calculate(void *args) {
int start=*((int *)args); // start from this number int mine,sign;
int count=0;
// each process performs computations on its part pi=0;
mine=start*2+1;
sign=(((mine-1)/2)%2)?-1:1;
for (;mine<precision;) {
/*
if (!(count%1000000)) {
printf("\nProcess %d %ld %ld", myrank,sign,mine); fflush(stdout);
}
*/
pi+=sign/(double)mine;
mine+=2*step;
sign=(((mine-1)/2)%2)?-1:1;
count++;
}
MPI_Send(&pi,1,MPI_DOUBLE,0,RESULT,MPI_COMM_WORLD);
/*
// now update our local process pi value
pilocal+=pi;
*/
}
int main(int argc, char **argv) {
double pi_final=0;
int i,j;
int threadsupport;
void *threadstatus;
MPI_Status status;
// Initialize MPI
MPI_Init_thread(&argc, &argv,MPI_THREAD_MULTIPLE,&threadsupport); if (threadsupport!=MPI_THREAD_MULTIPLE) {
printf("\nThe implementation does not support MPI_THREAD_MULTIPLE, it supports level
%d\n",threadsupport);
MPI_Finalize();
exit(-1);
}
// find out my rank
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
// find out the number of processes in MPI_COMM_WORLD
MPI_Comm_size(MPI_COMM_WORLD, &proccount);
// now distribute the required precision
if (precision<proccount) {
printf("Precision smaller than the number of processes - try again."); MPI_Finalize();
return -1;
}
// now start the threads in each process
// define the thread as joinable
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
// initialize the step value
step=proccount*THREADNUM;
for (i=0;i<THREADNUM;i++) {
// initialize the start value
startValue[i]=myrank*THREADNUM+i;
// launch a thread for calculations
pthread_create(&thread[i], &attr, Calculate, (void *)(&(startValue[i])));
}
if (!myrank) { // receive results from the threads double resulttemp;
for(i=0;i<proccount;i++) for(j=0;j<THREADNUM;j++) {
MPI_Recv(&resulttemp,1,MPI_DOUBLE,i,RESULT,MPI_COMM_WORLD,&status); printf("\nReceived result %f from thread %d process %d",resulttemp,j,i); fflush(stdout);
pi_final+=resulttemp;
}
}
// now synchronize the threads
for (i=0;i<THREADNUM;i++)
pthread_join(thread[i], &threadstatus);
/*
// now merge the numbers to rank 0
MPI_Reduce(&pilocal,&pi_final,1,
MPI_DOUBLE,MPI_SUM,0,
MPI_COMM_WORLD);
*/
if (!myrank) {
pi_final*=4;
printf("pi=%f",pi_final);
}
// Shut down MPI
MPI_Finalize();
return 0;
}