Crash Course In Message Passing Interface

Preview:

DESCRIPTION

Crash Course In Message Passing Interface. Adam Simpson NCCS User A ssistance. Interactive job on mars. $ qsub -I -AUT-NTNLEDU - lwalltime = 02:00 :00,size =16 $ module swap PrgEnv -cray PrgEnv -gnu $ cd / lustre /medusa/$USER - PowerPoint PPT Presentation

Citation preview

ORNL is managed by UT-Battelle for the US Department of Energy

Crash Course In Message Passing Interface

Adam Simpson

NCCS User Assistance

2 Presentation_name

Interactive job on mars$ qsub -I -AUT-NTNLEDU -lwalltime=02:00:00,size=16$ module swap PrgEnv-cray PrgEnv-gnu$ cd /lustre/medusa/$USER

Once in the interactive job you may submit MPI jobs using aprun

3 Presentation_name

Terminology

• Process– Instance of program being executed– Has own virtual address space (memory)– Often referred to as a task

• Message– Data to be sent from one process to another

• Message Passing– The act of transferring a message from one process to

another

4 Presentation_name

Why message passing?

• Traditional distributed memory systems

CPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

k

CPU

mem

ory

Net

wor

k

5 Presentation_name

Why message passing?

• Traditional distributed memory systems

CPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

k

CPU

mem

ory

Net

wor

k

Node

6 Presentation_name

Why message passing?

• Traditional distributed memory systems– Pass data between processes running on different nodes

CPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

k

CPU

mem

ory

Net

wor

k

7 Presentation_name

Why message passing?

• Modern hybrid shared/distributed memory systems

mem

ory

Net

wor

k

mem

ory

Net

wor

k

mem

ory

Net

wor

k

mem

ory

Net

wor

kCPU CPU

CPUCPU

CPU CPU

CPUCPU

CPU CPU

CPUCPU

CPU CPU

CPUCPU

Node

8 Presentation_name

Why message passing?

• Modern hybrid shared/distributed memory systems– Need support for inter/intra node communication

mem

ory

Net

wor

k

mem

ory

Net

wor

k

mem

ory

Net

wor

k

mem

ory

Net

wor

kCPU CPU

CPUCPU

CPU CPU

CPUCPU

CPU CPU

CPUCPU

CPU CPU

CPUCPU

9 Presentation_name

What is MPI?

• A message passing library specification– Message Passing Interface– Community driven – Forum composed of academic and industry researchers

• De facto standard for distributed message passing– Available on nearly all modern HPC resources

• Currently on 3rd major specification release• Several open and proprietary implementations

– Primarily C/Fortran implementations• OpenMPI, MPICH, MVAPICH, Cray MPT, Intel MPI, etc.• Bindings available for Python, R, Ruby, Java, etc.

10 Presentation_name

What about Threads and Accelerators?

• MPI complements shared memory models• Hybrid method

– MPI between nodes, threads/accelerators on node

• Accelerator support– Pass accelerator memory addresses to MPI functions– DMA from interconnect to GPU memory

11 Presentation_name

MPI Functions

• MPI consists of hundreds of functions• Most users will only use a handful• All functions prefixed with MPI_• C functions return integer error

– MPI_SUCCESS if no error

• Fortran subroutines take integer error as last argument– MPI_SUCCESS if no error

12 Presentation_name

More terminology

• Communicator– An object that represents a group of processes than can

communicate with each other– MPI_Comm type– Predefined communicator: MPI_COMM_WORLD

• Rank– Within a communicator each process is given a unique

integer ID– Ranks start at 0 and are incremented contiguously

• Size– The total number of ranks in a communicator

13 Presentation_name

Basic functions

int MPI_Init( int *argc, char ***argv )• argc

– Pointer to the number of arguments

• argv– Pointer to argument vector

• Initializes MPI environment• Must be called before any other MPI call

14 Presentation_name

Basic functions

int MPI_Finalize( void )

• Cleans up MPI environment• Call right before exiting program• Calling MPI functions after MPI_Finalize is undefined

15 Presentation_name

Basic functions

int MPI_Comm_rank(MPI_Comm comm, int *rank)• comm is an MPI communicator

– Usually MPI_COMM_WORLD

• rank – will be set to the rank of the calling process in the

communicator of comm

16 Presentation_name

Basic functions

int MPI_Comm_size(MPI_Comm comm, int *size)• comm is an MPI communicator

– Usually MPI_COMM_WORLD

• size – will be set to the number of ranks in the communicator comm

17 Presentation_name

Hello MPI_COMM_WORLD#include “stdio.h”#include “mpi.h”

int main(int argc, char **argv){ int rank, size; MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size);

printf(“Hello from rank %d of %d total \n”, rank, size);

MPI_Finalize(); return 0;}

18 Presentation_name

Hello MPI_COMM_WORLD

• Compile– mpicc wrapper used to link in libraries and includes

• On a Cray system us cc instead

– Uses standard C compiler, such as gcc, under the hood

$ cc hello_mpi.c –o hello.exe

19 Presentation_name

Hello MPI_COMM_WORLD

• Run– mpirun –n # ./a.out launches # copies of a.out

• On a Cray system use aprun –n # ./a.out• Ability to control mapping of processes onto hardware

$ aprun –n 3 ./hello.exeHello from rank 0 of 3 totalHello from rank 2 of 3 totalHello from rank 1 of 3 total

20 Presentation_name

Hello MPI_COMM_WORLD

• Run– mpirun –n # ./a.out launches # copies of a.out

• On a Cray system use aprun –n # ./a.out• Ability to control mapping of processes onto hardware

$ aprun –n 3 ./hello.exeHello from rank 0 of 3 totalHello from rank 2 of 3 totalHello from rank 1 of 3 total

Note the order

21 Presentation_name

MPI_Datatype

• Many MPI functions require a datatype– Not all datatypes are contiguous in memory– User defined datatypes possible

• Built in types for all intrinsic C types– MPI_INT, MPI_FLOAT, MPI_DOUBLE, …

• Built in types for all intrinsic Fortran types– MPI_INTEGER, MPI_REAL, MPI_COMPLEX,…

22 Presentation_name

Point to Point

• Point to Point routines– Involves two and only two processes– One process explicitly initiates send operation– One process explicitly initiates receive operation– Several send/receive flavors available

• Blocking/non-blocking• Buffered/non-buffered• Combined send-receive

– Basis for more complicated messaging

23 Presentation_name

Point to Pointint MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

bufInitial address of send buffercountNumber of elements to senddatatypeDatatype of each element in send buffer

dest

Rank of destinationtagInteger tag used by receiver to identify messagecommCommunicator

• Returns after send buffer is ready to reuse– Message may not be delivered on destination rank

24 Presentation_name

Point to Pointint MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status status) buf

Initial address of receive buffercountMaximum number of elements that can be receiveddatatypeDatatype of each element in receive buffer

source

Rank of sourcetagInteger tag used to identify messagecommCommunicatorstatusStruct containing information on received message • Returns after receive buffer is ready to reuse

25 Presentation_name

Point to Point: try 1#include “mpi.h”

int main(int argc, char **argv)

{

int rank, other_rank, send_buff, recv_buff;

MPI_Status status;

tag = 5;

send_buff = 10;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if(rank==0)

other_rank = 1;

else

other_rank = 0;

MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, tag, MPI_COMM_WORLD, &status);

MPI_Send(&send_buff, 1, MPI_INT, other_rank, tag, MPI_COMM_WORLD);

MPI_Finalize();

return 0;

}

26 Presentation_name

Point to Point: try 1

• Compile $ cc dead.c –o lock.exe

• Run $ aprun –n 2 ./lock.exe … … … deadlock

27 Presentation_name

Deadlock

• The problem– Both processes are waiting to receive a message– Neither process ever sends a message– Deadlock as both processes wait forever– Easier to deadlock than one might think

• Buffering can mask logical deadlocks

28 Presentation_name

Point to Point: Fix 1#include “mpi.h”int main(int argc, char **argv){ int rank, other_rank, recv_buff; MPI_Status status;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==0){ other_rank = 1; MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &status); MPI_Send(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD); } else { other_rank = 0; MPI_Send(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD); MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &status); } MPI_Finalize(); return 0;}

29 Presentation_name

Non blocking P2P

• Non blocking Point to Point functions– Allow Send and Receive to not block on CPU

• Return before buffers are safe to reuse

– Can be used to prevent deadlock situations– Can be used to overlap communication and computation– Calls prefixed with “I”, because they return immediately

30 Presentation_name

Non blocking P2Pint MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Request request)

bufInitial address of send buffercountNumber of elements to senddatatypeDatatype of each element in send buffer

dest

Rank of destinationtagInteger tag used by receiver to identify messagerequestObject used to keep track of status of receive operation

31 Presentation_name

Non blocking P2Pint MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request request) buf

Initial address of receive buffercountMaximum number of elements that can be receiveddatatypeDatatype of each element in receive buffer

source

Rank of sourcetagInteger tag used to identify messagecommCommunicatorrequestObject used to keep track of status of receive operation

32 Presentation_name

Non blocking P2P

int MPI_Wait(MPI_Request *request, MPI_Status *status)

• request– The request you want to wait to complete

• status– Status struct containing information on completed request

• Will block until specified request operation is complete

• MPI_Wait, or similar, must be called if request is used

33 Presentation_name

Point to Point: Fix 2#include “mpi.h”int main(int argc, char **argv){ int rank, other_rank, recv_buff; MPI_Request send_req, recv_req; MPI_Status send_stat, recv_stat;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==0) other_rank = 1; else other_rank = 0;

MPI_Irecv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &recv_req); MPI_Isend(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &send_req);

MPI_Wait(&recv_req, &recv_stat); MPI_Wait(&send_req, &send_stat);

MPI_Finalize(); return 0;}

34 Presentation_name

Collectives

• Collective routines– Involves all processes in communicator– All processes in communicator must participate– Serve several purposes

• Synchronization• Data movement• Reductions

– Several routines originate or terminate at a single process known as the “root”

35 Presentation_name

Collectives

int MPI_Barrier( MPI_Comm comm )

• Blocks process in comm until all process reach it• Used to synchronize processes in comm

36 Presentation_name

Collectives

Broadcast

Root

Rank 0 Rank 1 Rank 2 Rank 3

37 Presentation_name

Collectives

int MPI_Bcast(void *buf, int count, MPI_Datatype datatype, int root, MPI_Comm comm)

bufInitial address of send buffercountNumber of elements to senddatatypeDatatype of each element in send buffer

root

Rank of node that will broadcast buf commCommunicator

38 Presentation_name

Collectives

Scatter

RootRootRootRoot

Rank 0 Rank 1 Rank 2 Rank 3

39 Presentation_name

Collectivesint MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)sendbuf

Initial address of send buffer on root nodesendcountNumber of elements to sendsendtypeDatatype of each element in send bufferrecvbufInitial address of receive buffer on each node

recvcount

Number of elements in receive bufferrecvtypeDatatype of each element in receive bufferrootRank of node that will broadcast buf commCommunicator

40 Presentation_name

Collectives

Gather

RootRootRootRoot

Rank 0 Rank 1 Rank 2 Rank 3

41 Presentation_name

Collectives

sendbufInitial address of send buffer on root nodesendcountNumber of elements to sendsendtypeDatatype of each element in send bufferrecvbufInitial address of receive buffer on each node

recvcount

Number of elements in receive bufferrecvtypeDatatype of each element in receive bufferrootRank of node that will broadcast buf commCommunicator

int MPI_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

42 Presentation_name

Collectives

Reduce

Root

Rank 0 Rank 1 Rank 2 Rank 3+ + +- - -* * *

43 Presentation_name

Collectivesint MPI_Reduce(const void *sendbuf, void *recvbuf int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

sendbuf Initial address of send bufferrecvbufBuffer to receive reduced result on root rankcountNumber of elements in sendbufdatatypeDatatype of each element in send buffer

op

Reduce operation (MPI_MAX, MPI_MIN, MPI_SUM, …)rootRank of node that will receive reduced value in recvbufcommCommunicator

44 Presentation_name

Collectives: Example#include “stdio.h”#include “mpi.h”

int main(int argc, char **argv){ int rank, root, bcast_data; root = 0; if(rank == root) bcast_data = 10;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Bcast(&bcast_data, 1, MPI_INT, root, MPI_COMM_WORLD); printf("Rank %d has bcast_data = %d\n", rank, bcast_data); MPI_Finalize(); return 0;}

45 Presentation_name

Collectives: Example

$ cc collectives.c –o coll.exe$ aprun –n 4 ./coll.exeRank 0 has bcast_data = 10Rank 1 has bcast_data = 10Rank 3 has bcast_data = 10Rank 2 has bcast_data = 10

Recommended