Upload
marcie
View
52
Download
0
Embed Size (px)
DESCRIPTION
Crash Course In Message Passing Interface. Adam Simpson NCCS User A ssistance. Interactive job on mars. $ qsub -I -AUT-NTNLEDU - lwalltime = 02:00 :00,size =16 $ module swap PrgEnv -cray PrgEnv -gnu $ cd / lustre /medusa/$USER - PowerPoint PPT Presentation
Citation preview
ORNL is managed by UT-Battelle for the US Department of Energy
Crash Course In Message Passing Interface
Adam Simpson
NCCS User Assistance
2 Presentation_name
Interactive job on mars$ qsub -I -AUT-NTNLEDU -lwalltime=02:00:00,size=16$ module swap PrgEnv-cray PrgEnv-gnu$ cd /lustre/medusa/$USER
Once in the interactive job you may submit MPI jobs using aprun
3 Presentation_name
Terminology
• Process– Instance of program being executed– Has own virtual address space (memory)– Often referred to as a task
• Message– Data to be sent from one process to another
• Message Passing– The act of transferring a message from one process to
another
4 Presentation_name
Why message passing?
• Traditional distributed memory systems
CPU
mem
ory
Net
wor
kCPU
mem
ory
Net
wor
kCPU
mem
ory
Net
wor
k
CPU
mem
ory
Net
wor
k
5 Presentation_name
Why message passing?
• Traditional distributed memory systems
CPU
mem
ory
Net
wor
kCPU
mem
ory
Net
wor
kCPU
mem
ory
Net
wor
k
CPU
mem
ory
Net
wor
k
Node
6 Presentation_name
Why message passing?
• Traditional distributed memory systems– Pass data between processes running on different nodes
CPU
mem
ory
Net
wor
kCPU
mem
ory
Net
wor
kCPU
mem
ory
Net
wor
k
CPU
mem
ory
Net
wor
k
7 Presentation_name
Why message passing?
• Modern hybrid shared/distributed memory systems
mem
ory
Net
wor
k
mem
ory
Net
wor
k
mem
ory
Net
wor
k
mem
ory
Net
wor
kCPU CPU
CPUCPU
CPU CPU
CPUCPU
CPU CPU
CPUCPU
CPU CPU
CPUCPU
Node
8 Presentation_name
Why message passing?
• Modern hybrid shared/distributed memory systems– Need support for inter/intra node communication
mem
ory
Net
wor
k
mem
ory
Net
wor
k
mem
ory
Net
wor
k
mem
ory
Net
wor
kCPU CPU
CPUCPU
CPU CPU
CPUCPU
CPU CPU
CPUCPU
CPU CPU
CPUCPU
9 Presentation_name
What is MPI?
• A message passing library specification– Message Passing Interface– Community driven – Forum composed of academic and industry researchers
• De facto standard for distributed message passing– Available on nearly all modern HPC resources
• Currently on 3rd major specification release• Several open and proprietary implementations
– Primarily C/Fortran implementations• OpenMPI, MPICH, MVAPICH, Cray MPT, Intel MPI, etc.• Bindings available for Python, R, Ruby, Java, etc.
10 Presentation_name
What about Threads and Accelerators?
• MPI complements shared memory models• Hybrid method
– MPI between nodes, threads/accelerators on node
• Accelerator support– Pass accelerator memory addresses to MPI functions– DMA from interconnect to GPU memory
11 Presentation_name
MPI Functions
• MPI consists of hundreds of functions• Most users will only use a handful• All functions prefixed with MPI_• C functions return integer error
– MPI_SUCCESS if no error
• Fortran subroutines take integer error as last argument– MPI_SUCCESS if no error
12 Presentation_name
More terminology
• Communicator– An object that represents a group of processes than can
communicate with each other– MPI_Comm type– Predefined communicator: MPI_COMM_WORLD
• Rank– Within a communicator each process is given a unique
integer ID– Ranks start at 0 and are incremented contiguously
• Size– The total number of ranks in a communicator
13 Presentation_name
Basic functions
int MPI_Init( int *argc, char ***argv )• argc
– Pointer to the number of arguments
• argv– Pointer to argument vector
• Initializes MPI environment• Must be called before any other MPI call
14 Presentation_name
Basic functions
int MPI_Finalize( void )
• Cleans up MPI environment• Call right before exiting program• Calling MPI functions after MPI_Finalize is undefined
15 Presentation_name
Basic functions
int MPI_Comm_rank(MPI_Comm comm, int *rank)• comm is an MPI communicator
– Usually MPI_COMM_WORLD
• rank – will be set to the rank of the calling process in the
communicator of comm
16 Presentation_name
Basic functions
int MPI_Comm_size(MPI_Comm comm, int *size)• comm is an MPI communicator
– Usually MPI_COMM_WORLD
• size – will be set to the number of ranks in the communicator comm
17 Presentation_name
Hello MPI_COMM_WORLD#include “stdio.h”#include “mpi.h”
int main(int argc, char **argv){ int rank, size; MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size);
printf(“Hello from rank %d of %d total \n”, rank, size);
MPI_Finalize(); return 0;}
18 Presentation_name
Hello MPI_COMM_WORLD
• Compile– mpicc wrapper used to link in libraries and includes
• On a Cray system us cc instead
– Uses standard C compiler, such as gcc, under the hood
$ cc hello_mpi.c –o hello.exe
19 Presentation_name
Hello MPI_COMM_WORLD
• Run– mpirun –n # ./a.out launches # copies of a.out
• On a Cray system use aprun –n # ./a.out• Ability to control mapping of processes onto hardware
$ aprun –n 3 ./hello.exeHello from rank 0 of 3 totalHello from rank 2 of 3 totalHello from rank 1 of 3 total
20 Presentation_name
Hello MPI_COMM_WORLD
• Run– mpirun –n # ./a.out launches # copies of a.out
• On a Cray system use aprun –n # ./a.out• Ability to control mapping of processes onto hardware
$ aprun –n 3 ./hello.exeHello from rank 0 of 3 totalHello from rank 2 of 3 totalHello from rank 1 of 3 total
Note the order
21 Presentation_name
MPI_Datatype
• Many MPI functions require a datatype– Not all datatypes are contiguous in memory– User defined datatypes possible
• Built in types for all intrinsic C types– MPI_INT, MPI_FLOAT, MPI_DOUBLE, …
• Built in types for all intrinsic Fortran types– MPI_INTEGER, MPI_REAL, MPI_COMPLEX,…
22 Presentation_name
Point to Point
• Point to Point routines– Involves two and only two processes– One process explicitly initiates send operation– One process explicitly initiates receive operation– Several send/receive flavors available
• Blocking/non-blocking• Buffered/non-buffered• Combined send-receive
– Basis for more complicated messaging
23 Presentation_name
Point to Pointint MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
bufInitial address of send buffercountNumber of elements to senddatatypeDatatype of each element in send buffer
dest
Rank of destinationtagInteger tag used by receiver to identify messagecommCommunicator
• Returns after send buffer is ready to reuse– Message may not be delivered on destination rank
24 Presentation_name
Point to Pointint MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status status) buf
Initial address of receive buffercountMaximum number of elements that can be receiveddatatypeDatatype of each element in receive buffer
source
Rank of sourcetagInteger tag used to identify messagecommCommunicatorstatusStruct containing information on received message • Returns after receive buffer is ready to reuse
25 Presentation_name
Point to Point: try 1#include “mpi.h”
int main(int argc, char **argv)
{
int rank, other_rank, send_buff, recv_buff;
MPI_Status status;
tag = 5;
send_buff = 10;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if(rank==0)
other_rank = 1;
else
other_rank = 0;
MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, tag, MPI_COMM_WORLD, &status);
MPI_Send(&send_buff, 1, MPI_INT, other_rank, tag, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
26 Presentation_name
Point to Point: try 1
• Compile $ cc dead.c –o lock.exe
• Run $ aprun –n 2 ./lock.exe … … … deadlock
27 Presentation_name
Deadlock
• The problem– Both processes are waiting to receive a message– Neither process ever sends a message– Deadlock as both processes wait forever– Easier to deadlock than one might think
• Buffering can mask logical deadlocks
28 Presentation_name
Point to Point: Fix 1#include “mpi.h”int main(int argc, char **argv){ int rank, other_rank, recv_buff; MPI_Status status;
MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==0){ other_rank = 1; MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &status); MPI_Send(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD); } else { other_rank = 0; MPI_Send(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD); MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &status); } MPI_Finalize(); return 0;}
29 Presentation_name
Non blocking P2P
• Non blocking Point to Point functions– Allow Send and Receive to not block on CPU
• Return before buffers are safe to reuse
– Can be used to prevent deadlock situations– Can be used to overlap communication and computation– Calls prefixed with “I”, because they return immediately
30 Presentation_name
Non blocking P2Pint MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Request request)
bufInitial address of send buffercountNumber of elements to senddatatypeDatatype of each element in send buffer
dest
Rank of destinationtagInteger tag used by receiver to identify messagerequestObject used to keep track of status of receive operation
31 Presentation_name
Non blocking P2Pint MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request request) buf
Initial address of receive buffercountMaximum number of elements that can be receiveddatatypeDatatype of each element in receive buffer
source
Rank of sourcetagInteger tag used to identify messagecommCommunicatorrequestObject used to keep track of status of receive operation
32 Presentation_name
Non blocking P2P
int MPI_Wait(MPI_Request *request, MPI_Status *status)
• request– The request you want to wait to complete
• status– Status struct containing information on completed request
• Will block until specified request operation is complete
• MPI_Wait, or similar, must be called if request is used
33 Presentation_name
Point to Point: Fix 2#include “mpi.h”int main(int argc, char **argv){ int rank, other_rank, recv_buff; MPI_Request send_req, recv_req; MPI_Status send_stat, recv_stat;
MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==0) other_rank = 1; else other_rank = 0;
MPI_Irecv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &recv_req); MPI_Isend(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &send_req);
MPI_Wait(&recv_req, &recv_stat); MPI_Wait(&send_req, &send_stat);
MPI_Finalize(); return 0;}
34 Presentation_name
Collectives
• Collective routines– Involves all processes in communicator– All processes in communicator must participate– Serve several purposes
• Synchronization• Data movement• Reductions
– Several routines originate or terminate at a single process known as the “root”
35 Presentation_name
Collectives
int MPI_Barrier( MPI_Comm comm )
• Blocks process in comm until all process reach it• Used to synchronize processes in comm
36 Presentation_name
Collectives
Broadcast
Root
Rank 0 Rank 1 Rank 2 Rank 3
37 Presentation_name
Collectives
int MPI_Bcast(void *buf, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
bufInitial address of send buffercountNumber of elements to senddatatypeDatatype of each element in send buffer
root
Rank of node that will broadcast buf commCommunicator
38 Presentation_name
Collectives
Scatter
RootRootRootRoot
Rank 0 Rank 1 Rank 2 Rank 3
39 Presentation_name
Collectivesint MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)sendbuf
Initial address of send buffer on root nodesendcountNumber of elements to sendsendtypeDatatype of each element in send bufferrecvbufInitial address of receive buffer on each node
recvcount
Number of elements in receive bufferrecvtypeDatatype of each element in receive bufferrootRank of node that will broadcast buf commCommunicator
40 Presentation_name
Collectives
Gather
RootRootRootRoot
Rank 0 Rank 1 Rank 2 Rank 3
41 Presentation_name
Collectives
sendbufInitial address of send buffer on root nodesendcountNumber of elements to sendsendtypeDatatype of each element in send bufferrecvbufInitial address of receive buffer on each node
recvcount
Number of elements in receive bufferrecvtypeDatatype of each element in receive bufferrootRank of node that will broadcast buf commCommunicator
int MPI_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
42 Presentation_name
Collectives
Reduce
Root
Rank 0 Rank 1 Rank 2 Rank 3+ + +- - -* * *
43 Presentation_name
Collectivesint MPI_Reduce(const void *sendbuf, void *recvbuf int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
sendbuf Initial address of send bufferrecvbufBuffer to receive reduced result on root rankcountNumber of elements in sendbufdatatypeDatatype of each element in send buffer
op
Reduce operation (MPI_MAX, MPI_MIN, MPI_SUM, …)rootRank of node that will receive reduced value in recvbufcommCommunicator
44 Presentation_name
Collectives: Example#include “stdio.h”#include “mpi.h”
int main(int argc, char **argv){ int rank, root, bcast_data; root = 0; if(rank == root) bcast_data = 10;
MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Bcast(&bcast_data, 1, MPI_INT, root, MPI_COMM_WORLD); printf("Rank %d has bcast_data = %d\n", rank, bcast_data); MPI_Finalize(); return 0;}
45 Presentation_name
Collectives: Example
$ cc collectives.c –o coll.exe$ aprun –n 4 ./coll.exeRank 0 has bcast_data = 10Rank 1 has bcast_data = 10Rank 3 has bcast_data = 10Rank 2 has bcast_data = 10