45
ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance

Crash Course In Message Passing Interface

  • Upload
    marcie

  • View
    52

  • Download
    0

Embed Size (px)

DESCRIPTION

Crash Course In Message Passing Interface. Adam Simpson NCCS User A ssistance. Interactive job on mars. $ qsub -I -AUT-NTNLEDU - lwalltime = 02:00 :00,size =16 $ module swap PrgEnv -cray PrgEnv -gnu $ cd / lustre /medusa/$USER - PowerPoint PPT Presentation

Citation preview

Page 1: Crash Course In Message Passing Interface

ORNL is managed by UT-Battelle for the US Department of Energy

Crash Course In Message Passing Interface

Adam Simpson

NCCS User Assistance

Page 2: Crash Course In Message Passing Interface

2 Presentation_name

Interactive job on mars$ qsub -I -AUT-NTNLEDU -lwalltime=02:00:00,size=16$ module swap PrgEnv-cray PrgEnv-gnu$ cd /lustre/medusa/$USER

Once in the interactive job you may submit MPI jobs using aprun

Page 3: Crash Course In Message Passing Interface

3 Presentation_name

Terminology

• Process– Instance of program being executed– Has own virtual address space (memory)– Often referred to as a task

• Message– Data to be sent from one process to another

• Message Passing– The act of transferring a message from one process to

another

Page 4: Crash Course In Message Passing Interface

4 Presentation_name

Why message passing?

• Traditional distributed memory systems

CPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

k

CPU

mem

ory

Net

wor

k

Page 5: Crash Course In Message Passing Interface

5 Presentation_name

Why message passing?

• Traditional distributed memory systems

CPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

k

CPU

mem

ory

Net

wor

k

Node

Page 6: Crash Course In Message Passing Interface

6 Presentation_name

Why message passing?

• Traditional distributed memory systems– Pass data between processes running on different nodes

CPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

kCPU

mem

ory

Net

wor

k

CPU

mem

ory

Net

wor

k

Page 7: Crash Course In Message Passing Interface

7 Presentation_name

Why message passing?

• Modern hybrid shared/distributed memory systems

mem

ory

Net

wor

k

mem

ory

Net

wor

k

mem

ory

Net

wor

k

mem

ory

Net

wor

kCPU CPU

CPUCPU

CPU CPU

CPUCPU

CPU CPU

CPUCPU

CPU CPU

CPUCPU

Node

Page 8: Crash Course In Message Passing Interface

8 Presentation_name

Why message passing?

• Modern hybrid shared/distributed memory systems– Need support for inter/intra node communication

mem

ory

Net

wor

k

mem

ory

Net

wor

k

mem

ory

Net

wor

k

mem

ory

Net

wor

kCPU CPU

CPUCPU

CPU CPU

CPUCPU

CPU CPU

CPUCPU

CPU CPU

CPUCPU

Page 9: Crash Course In Message Passing Interface

9 Presentation_name

What is MPI?

• A message passing library specification– Message Passing Interface– Community driven – Forum composed of academic and industry researchers

• De facto standard for distributed message passing– Available on nearly all modern HPC resources

• Currently on 3rd major specification release• Several open and proprietary implementations

– Primarily C/Fortran implementations• OpenMPI, MPICH, MVAPICH, Cray MPT, Intel MPI, etc.• Bindings available for Python, R, Ruby, Java, etc.

Page 10: Crash Course In Message Passing Interface

10 Presentation_name

What about Threads and Accelerators?

• MPI complements shared memory models• Hybrid method

– MPI between nodes, threads/accelerators on node

• Accelerator support– Pass accelerator memory addresses to MPI functions– DMA from interconnect to GPU memory

Page 11: Crash Course In Message Passing Interface

11 Presentation_name

MPI Functions

• MPI consists of hundreds of functions• Most users will only use a handful• All functions prefixed with MPI_• C functions return integer error

– MPI_SUCCESS if no error

• Fortran subroutines take integer error as last argument– MPI_SUCCESS if no error

Page 12: Crash Course In Message Passing Interface

12 Presentation_name

More terminology

• Communicator– An object that represents a group of processes than can

communicate with each other– MPI_Comm type– Predefined communicator: MPI_COMM_WORLD

• Rank– Within a communicator each process is given a unique

integer ID– Ranks start at 0 and are incremented contiguously

• Size– The total number of ranks in a communicator

Page 13: Crash Course In Message Passing Interface

13 Presentation_name

Basic functions

int MPI_Init( int *argc, char ***argv )• argc

– Pointer to the number of arguments

• argv– Pointer to argument vector

• Initializes MPI environment• Must be called before any other MPI call

Page 14: Crash Course In Message Passing Interface

14 Presentation_name

Basic functions

int MPI_Finalize( void )

• Cleans up MPI environment• Call right before exiting program• Calling MPI functions after MPI_Finalize is undefined

Page 15: Crash Course In Message Passing Interface

15 Presentation_name

Basic functions

int MPI_Comm_rank(MPI_Comm comm, int *rank)• comm is an MPI communicator

– Usually MPI_COMM_WORLD

• rank – will be set to the rank of the calling process in the

communicator of comm

Page 16: Crash Course In Message Passing Interface

16 Presentation_name

Basic functions

int MPI_Comm_size(MPI_Comm comm, int *size)• comm is an MPI communicator

– Usually MPI_COMM_WORLD

• size – will be set to the number of ranks in the communicator comm

Page 17: Crash Course In Message Passing Interface

17 Presentation_name

Hello MPI_COMM_WORLD#include “stdio.h”#include “mpi.h”

int main(int argc, char **argv){ int rank, size; MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size);

printf(“Hello from rank %d of %d total \n”, rank, size);

MPI_Finalize(); return 0;}

Page 18: Crash Course In Message Passing Interface

18 Presentation_name

Hello MPI_COMM_WORLD

• Compile– mpicc wrapper used to link in libraries and includes

• On a Cray system us cc instead

– Uses standard C compiler, such as gcc, under the hood

$ cc hello_mpi.c –o hello.exe

Page 19: Crash Course In Message Passing Interface

19 Presentation_name

Hello MPI_COMM_WORLD

• Run– mpirun –n # ./a.out launches # copies of a.out

• On a Cray system use aprun –n # ./a.out• Ability to control mapping of processes onto hardware

$ aprun –n 3 ./hello.exeHello from rank 0 of 3 totalHello from rank 2 of 3 totalHello from rank 1 of 3 total

Page 20: Crash Course In Message Passing Interface

20 Presentation_name

Hello MPI_COMM_WORLD

• Run– mpirun –n # ./a.out launches # copies of a.out

• On a Cray system use aprun –n # ./a.out• Ability to control mapping of processes onto hardware

$ aprun –n 3 ./hello.exeHello from rank 0 of 3 totalHello from rank 2 of 3 totalHello from rank 1 of 3 total

Note the order

Page 21: Crash Course In Message Passing Interface

21 Presentation_name

MPI_Datatype

• Many MPI functions require a datatype– Not all datatypes are contiguous in memory– User defined datatypes possible

• Built in types for all intrinsic C types– MPI_INT, MPI_FLOAT, MPI_DOUBLE, …

• Built in types for all intrinsic Fortran types– MPI_INTEGER, MPI_REAL, MPI_COMPLEX,…

Page 22: Crash Course In Message Passing Interface

22 Presentation_name

Point to Point

• Point to Point routines– Involves two and only two processes– One process explicitly initiates send operation– One process explicitly initiates receive operation– Several send/receive flavors available

• Blocking/non-blocking• Buffered/non-buffered• Combined send-receive

– Basis for more complicated messaging

Page 23: Crash Course In Message Passing Interface

23 Presentation_name

Point to Pointint MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

bufInitial address of send buffercountNumber of elements to senddatatypeDatatype of each element in send buffer

dest

Rank of destinationtagInteger tag used by receiver to identify messagecommCommunicator

• Returns after send buffer is ready to reuse– Message may not be delivered on destination rank

Page 24: Crash Course In Message Passing Interface

24 Presentation_name

Point to Pointint MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status status) buf

Initial address of receive buffercountMaximum number of elements that can be receiveddatatypeDatatype of each element in receive buffer

source

Rank of sourcetagInteger tag used to identify messagecommCommunicatorstatusStruct containing information on received message • Returns after receive buffer is ready to reuse

Page 25: Crash Course In Message Passing Interface

25 Presentation_name

Point to Point: try 1#include “mpi.h”

int main(int argc, char **argv)

{

int rank, other_rank, send_buff, recv_buff;

MPI_Status status;

tag = 5;

send_buff = 10;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if(rank==0)

other_rank = 1;

else

other_rank = 0;

MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, tag, MPI_COMM_WORLD, &status);

MPI_Send(&send_buff, 1, MPI_INT, other_rank, tag, MPI_COMM_WORLD);

MPI_Finalize();

return 0;

}

Page 26: Crash Course In Message Passing Interface

26 Presentation_name

Point to Point: try 1

• Compile $ cc dead.c –o lock.exe

• Run $ aprun –n 2 ./lock.exe … … … deadlock

Page 27: Crash Course In Message Passing Interface

27 Presentation_name

Deadlock

• The problem– Both processes are waiting to receive a message– Neither process ever sends a message– Deadlock as both processes wait forever– Easier to deadlock than one might think

• Buffering can mask logical deadlocks

Page 28: Crash Course In Message Passing Interface

28 Presentation_name

Point to Point: Fix 1#include “mpi.h”int main(int argc, char **argv){ int rank, other_rank, recv_buff; MPI_Status status;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==0){ other_rank = 1; MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &status); MPI_Send(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD); } else { other_rank = 0; MPI_Send(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD); MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &status); } MPI_Finalize(); return 0;}

Page 29: Crash Course In Message Passing Interface

29 Presentation_name

Non blocking P2P

• Non blocking Point to Point functions– Allow Send and Receive to not block on CPU

• Return before buffers are safe to reuse

– Can be used to prevent deadlock situations– Can be used to overlap communication and computation– Calls prefixed with “I”, because they return immediately

Page 30: Crash Course In Message Passing Interface

30 Presentation_name

Non blocking P2Pint MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Request request)

bufInitial address of send buffercountNumber of elements to senddatatypeDatatype of each element in send buffer

dest

Rank of destinationtagInteger tag used by receiver to identify messagerequestObject used to keep track of status of receive operation

Page 31: Crash Course In Message Passing Interface

31 Presentation_name

Non blocking P2Pint MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request request) buf

Initial address of receive buffercountMaximum number of elements that can be receiveddatatypeDatatype of each element in receive buffer

source

Rank of sourcetagInteger tag used to identify messagecommCommunicatorrequestObject used to keep track of status of receive operation

Page 32: Crash Course In Message Passing Interface

32 Presentation_name

Non blocking P2P

int MPI_Wait(MPI_Request *request, MPI_Status *status)

• request– The request you want to wait to complete

• status– Status struct containing information on completed request

• Will block until specified request operation is complete

• MPI_Wait, or similar, must be called if request is used

Page 33: Crash Course In Message Passing Interface

33 Presentation_name

Point to Point: Fix 2#include “mpi.h”int main(int argc, char **argv){ int rank, other_rank, recv_buff; MPI_Request send_req, recv_req; MPI_Status send_stat, recv_stat;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==0) other_rank = 1; else other_rank = 0;

MPI_Irecv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &recv_req); MPI_Isend(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &send_req);

MPI_Wait(&recv_req, &recv_stat); MPI_Wait(&send_req, &send_stat);

MPI_Finalize(); return 0;}

Page 34: Crash Course In Message Passing Interface

34 Presentation_name

Collectives

• Collective routines– Involves all processes in communicator– All processes in communicator must participate– Serve several purposes

• Synchronization• Data movement• Reductions

– Several routines originate or terminate at a single process known as the “root”

Page 35: Crash Course In Message Passing Interface

35 Presentation_name

Collectives

int MPI_Barrier( MPI_Comm comm )

• Blocks process in comm until all process reach it• Used to synchronize processes in comm

Page 36: Crash Course In Message Passing Interface

36 Presentation_name

Collectives

Broadcast

Root

Rank 0 Rank 1 Rank 2 Rank 3

Page 37: Crash Course In Message Passing Interface

37 Presentation_name

Collectives

int MPI_Bcast(void *buf, int count, MPI_Datatype datatype, int root, MPI_Comm comm)

bufInitial address of send buffercountNumber of elements to senddatatypeDatatype of each element in send buffer

root

Rank of node that will broadcast buf commCommunicator

Page 38: Crash Course In Message Passing Interface

38 Presentation_name

Collectives

Scatter

RootRootRootRoot

Rank 0 Rank 1 Rank 2 Rank 3

Page 39: Crash Course In Message Passing Interface

39 Presentation_name

Collectivesint MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)sendbuf

Initial address of send buffer on root nodesendcountNumber of elements to sendsendtypeDatatype of each element in send bufferrecvbufInitial address of receive buffer on each node

recvcount

Number of elements in receive bufferrecvtypeDatatype of each element in receive bufferrootRank of node that will broadcast buf commCommunicator

Page 40: Crash Course In Message Passing Interface

40 Presentation_name

Collectives

Gather

RootRootRootRoot

Rank 0 Rank 1 Rank 2 Rank 3

Page 41: Crash Course In Message Passing Interface

41 Presentation_name

Collectives

sendbufInitial address of send buffer on root nodesendcountNumber of elements to sendsendtypeDatatype of each element in send bufferrecvbufInitial address of receive buffer on each node

recvcount

Number of elements in receive bufferrecvtypeDatatype of each element in receive bufferrootRank of node that will broadcast buf commCommunicator

int MPI_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

Page 42: Crash Course In Message Passing Interface

42 Presentation_name

Collectives

Reduce

Root

Rank 0 Rank 1 Rank 2 Rank 3+ + +- - -* * *

Page 43: Crash Course In Message Passing Interface

43 Presentation_name

Collectivesint MPI_Reduce(const void *sendbuf, void *recvbuf int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

sendbuf Initial address of send bufferrecvbufBuffer to receive reduced result on root rankcountNumber of elements in sendbufdatatypeDatatype of each element in send buffer

op

Reduce operation (MPI_MAX, MPI_MIN, MPI_SUM, …)rootRank of node that will receive reduced value in recvbufcommCommunicator

Page 44: Crash Course In Message Passing Interface

44 Presentation_name

Collectives: Example#include “stdio.h”#include “mpi.h”

int main(int argc, char **argv){ int rank, root, bcast_data; root = 0; if(rank == root) bcast_data = 10;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Bcast(&bcast_data, 1, MPI_INT, root, MPI_COMM_WORLD); printf("Rank %d has bcast_data = %d\n", rank, bcast_data); MPI_Finalize(); return 0;}

Page 45: Crash Course In Message Passing Interface

45 Presentation_name

Collectives: Example

$ cc collectives.c –o coll.exe$ aprun –n 4 ./coll.exeRank 0 has bcast_data = 10Rank 1 has bcast_data = 10Rank 3 has bcast_data = 10Rank 2 has bcast_data = 10