paral_IO

Embed Size (px)

Citation preview

  • 7/28/2019 paral_IO

    1/46

    MPI IOTimothy H. Kaiser, [email protected]

  • 7/28/2019 paral_IO

    2/46

    Purpose Introduce Parallel IO

    Introduce MPI-IO Not give an exhaustive survey

    Explain why you would want to use it

    Explain why you wouldnt want to use it Give a nontrivial and useful example

  • 7/28/2019 paral_IO

    3/46

    References http://www.nersc.gov/nusers/resources/

    software/libs/io/mpiio.php

    Parallel I/O for High Performance Computing.John May

    Using MPI-2 Advanced Features of the Message

    Passing Interface. William Gropp, Ewing Luskand Rajeev Thakur

  • 7/28/2019 paral_IO

    4/46

    What & Why of parallel IO Same motivation of going parallel initially

    You have lots of data You want to do things fast

    Parallel IO will (hopefully) enable you tomove large amounts of data to/from diskquickly

  • 7/28/2019 paral_IO

    5/46

    What & Why of parallel IO Parallel implies that some number (or all) of

    your processors (simultaneously) participatein an IO operation

    Good parallel IO shows speedup as you addprocessors

    I write about 300 Mbytes/second, othersfaster

  • 7/28/2019 paral_IO

    6/46

    A Motivating Example Earthquake Model E3d

    Finite difference simulation with the griddistributed across N processors

    On BlueGene we run at sizes of 7509 x 7478 x250 = 14,021,250,000 cells or 56 GBytes per

    volume, output 3 velocity volumes per dump For a restart file we write 14 volumes

  • 7/28/2019 paral_IO

    7/46

    Simple (nonMPI) Parallel IO Each processor dumps its portion of the grid

    to a separate unique filechar* unique(char *name,int myid) {

    static char unique_str[40];int i;for(i=0;i

  • 7/28/2019 paral_IO

    8/46

    Simple (nonMPI) Parallelmodule stuffcontainsfunction unique(name,myid)

    character (len=*) namecharacter (len=20) uniquecharacter (len=80) tempwrite(temp,"(a,i5.5)")trim(name),myidunique=temp

    returnend function uniqueend module

  • 7/28/2019 paral_IO

    9/46

    Why not just do this?

    Might write thousands of files

    Could be very slow

    Output is dependent on the number ofprocessors

    We might want the data in a single file

  • 7/28/2019 paral_IO

    10/46

    MPI-IO to the rescue MPI has over 55 calls related to file input and

    output

    Available in most modern MPI libraries Can produce exceptional results

    Support striping

    A collection of distributed files look like one

    We will look at outputs to a single file

  • 7/28/2019 paral_IO

    11/46

    Why not? Some functionality might not be available

    3d data types More likely to have/introduce bugs

    Memory leak

    File system overload Just hangs

  • 7/28/2019 paral_IO

    12/46

    Why not?

    More complex than normal output

    Need support from the file system for goodperformance

    Have seen 200 bytes/second NOT Megabytes

    Have run out of file locks

  • 7/28/2019 paral_IO

    13/46

    Our Real World Example We have a 3d volume of some data v

    distributed across N processor

    The size and distribution are input and not thesame on each processor

    We are outputting some function of v , V=f(v)

    Each processor writes its values to a commonfile

  • 7/28/2019 paral_IO

    14/46

  • 7/28/2019 paral_IO

    15/46

    Special Considerations

    We are calculating our output on the fly

    Create a buffer

    Fill the buffer and write

    Different processors will have differentnumber of writes

  • 7/28/2019 paral_IO

    16/46

    Special Considerations We want to use a collective write operation

    Each process must call the collective writethe same number of times

    Each process must determine how manywrites it needs to do

    The total number of writes is the max Some processors might call write with no

    data

  • 7/28/2019 paral_IO

    17/46

    Procedure # 1

    Allocate a temporary output buffer

    Open the file

    Set the view of the file to the beginning

    Process 0 writes the file header (36 bytes)

  • 7/28/2019 paral_IO

    18/46

  • 7/28/2019 paral_IO

    19/46

    Procedure #3 Loop over the grid Fill buffer

    If buffer is full Write it

    Adjust offset

    do_call_max=do_call_max-1

    Call write with no data until do_call_max=0

  • 7/28/2019 paral_IO

    20/46

    The MPI-IO Routines

    MPI_File_open(MPI_COMM_WORLD,fname,(MPI_MODE_RDWR|MPI_MODE_CREATE),MPI_INFO_NULL,&fh);

    MPI_File_set_view(fh,disp,MPI_INT,filetype,"native",MPI_INFO_NULL);

    MPI_File_write_at(fh, 0, header, hl, MPI_INT,&status);

    MPI_File_write_at_all(fh, offset, ptr, i2, MPI_INT,&status);

    MPI_File_close(&fh);

  • 7/28/2019 paral_IO

    21/46

    Synopsis: Opens a file

    int MPI_File_open(MPI_Comm comm, char *filename, int amode, MPI_Info info, MPI_File *fh)

    Input Parameters

    comm

    communicator (handle)

    filename

    name of file to open (string)

    amode

    file access mode (integer)

    info

    info object (handle)

    Output Parameters

    fh

    file handle (handle)

    MPI_File_open

  • 7/28/2019 paral_IO

    22/46

    MPI_File_set_view

    Synopsis: Sets the file view

    int MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype,MPI_Datatype filetype, char *datarep, MPI_Info info)

    Input Parameters

    fh

    file handle (handle)disp

    displacement (nonnegative integer)

    etype

    elementary datatype (handle)

    filetype

    filetype (handle)

    datarep

    data representation (string)info

    info object (handle)

  • 7/28/2019 paral_IO

    23/46

    MPI_File_write_at

    Synopsis: Write using explicit offset, not collective

    int MPI_File_write_at(MPI_File fh, MPI_Offset offset, void *buf,int count, MPI_Datatype datatype, MPI_Status *status)

    Input Parameters

    fh

    file handle (handle)

    offset

    file offset (nonnegative integer)buf

    initial address of buffer (choice)

    count

    number of elements in buffer (nonnegative integer)

    datatype

    datatype of each buffer element (handle)

    Output Parameters

    status

    status object (Status)

  • 7/28/2019 paral_IO

    24/46

    MPI_File_write_at_all

    Synopsis: Collective write using explicit offset

    int MPI_File_write_at_all(MPI_File fh, MPI_Offset offset, void *buf,int count, MPI_Datatype datatype, MPI_Status *status)

    Input Parameters

    fh

    file handle (handle)

    offset

    file offset (nonnegative integer)

    buf

    initial address of buffer (choice)

    count

    number of elements in buffer (nonnegative integer)

    datatype

    datatype of each buffer element (handle)

    Output Parameters

    status

    status object (Status)

  • 7/28/2019 paral_IO

    25/46

    MPI_File_close

    Synopsis: Closes a file

    int MPI_File_close(MPI_File *fh)

    Input Parameters

    fh

    file handle (handle)

  • 7/28/2019 paral_IO

    26/46

    The Data type Routines MPI_Type_create_subarray(3,gsizes,lsizes,istarts,MPI_ORDER_C,old_type,new_type);

    MPI_Type_contiguous(sx,old_type,&VECT);

    MPI_Type_struct(sz,blocklens,indices,old_types,&TWOD);

    MPI_Type_commit(&TWOD);

    Our preferred routine creates a 3d description

    On some platforms we need to fake it

  • 7/28/2019 paral_IO

    27/46

    MPI_Type_create_subarray

    Synopsis: Creates a datatype describing a subarray of an N dimensional arrayint MPI_Type_create_subarray(int ndims, int *array_of_sizes,

    int *array_of_subsizes, int *array_of_starts, int order,MPI_Datatype oldtype, MPI_Datatype *newtype)

    Input Parameters

    ndims

    number of array dimensions (positive integer)

    array_of_sizes

    number of elements of type oldtype in each dimension of the full array (array of positive integers)array_of_subsizes

    number of elements of type oldtype in each dimension of the subarray (array of positive integers)

    array_of_starts

    starting coordinates of the subarray in each dimension (array of nonnegative integers)

    order

    array storage order flag (state)

    oldtype

    old datatype (handle)

    Output Parameters

    newtype

    new datatype (handle)

  • 7/28/2019 paral_IO

    28/46

    MPI_Type_contiguous

    Synopsis: Creates a contiguous datatype

    int MPI_Type_contiguous( int count,MPI_Datatype old_type,MPI_Datatype *newtype)

    Input Parameters

    countreplication count (nonnegative integer)

    oldtype

    old datatype (handle)

    Output Parameter

    newtype

    new datatype (handle)

  • 7/28/2019 paral_IO

    29/46

    MPI_Type_struct

    Synopsis: Creates a struct datatypeint MPI_Type_struct( int count, int blocklens[], MPI_Aint indices[],

    MPI_Datatype old_types[], MPI_Datatype *newtype )

    Input Parameters

    count

    number of blocks (integer) -- also number of entries in arrays array_of_types ,

    array_of_displacements and array_of_blocklengths

    blocklensnumber of elements in each block (array)

    indices

    byte displacement of each block (array)

    old_types

    type of elements in each block (array of handles to datatype objects)

    Output Parameter

    newtype

    new datatype (handle)

  • 7/28/2019 paral_IO

    30/46

    MPI_Init(&argc,&argv); MPI_Comm_rank( MPI_COMM_WORLD, &myid); MPI_Comm_size( MPI_COMM_WORLD, &numprocs); MPI_Get_processor_name(name,&resultlen); printf("process %d running on %s\n",myid,name);/* we read and broadcast the global grid size (nx,ny,nz) */ if(myid == 0) { if(argc != 4){ printf("the grid size is not on the command line assuming 100 x 50 x 75\n"); gblsize[0]=100; gblsize[1]=50; gblsize[2]=75; } else { gblsize[0]=atoi(argv[1]); gblsize[1]=atoi(argv[2]); gblsize[2]=atoi(argv[3]); }

    } MPI_Bcast(gblsize,3,MPI_INT,0,MPI_COMM_WORLD);/********** a ***********/

    Our Program...

  • 7/28/2019 paral_IO

    31/46

    /* the routine three takes the number of processors andreturns a 3d decomposition or topology. this is simplya factoring of the number of processors into 3 integersstored in comp */ three(numprocs,comp);

    /* the routine mpDecomposition takes the processor topology andthe global grid dimensions and maps the grid to the topology.

    mpDecomposition returns the number of cells a processor holdsand the starting coordinates for its portion of the grid */ if(myid == 0 ) {

    printf("input mpDecomposition %5d%5d%5d%5d%5d%5d\n",gblsize[1],gblsize[2],gblsize[0], comp[1], comp[2], comp[0]);} mpDecomposition( gblsize[1],gblsize[2]gblsize[0],comp[1],comp[2],comp[0],myid,dist);printf(" out mpDecomposition %5d%5d%5d%5d%5d%5d%5d\n",myid,dist[0],dist[1],dist[2],

    dist[3],dist[4],dist[5]);

    /********** b ***********/

  • 7/28/2019 paral_IO

    32/46

    Global size 50 x 200 x 100 on 8 processors

    Example Distribution

  • 7/28/2019 paral_IO

    33/46

    /* global grid size */nx=gblsize[0]; ny=gblsize[1]; nz=gblsize[2];

    /* amount that i hold */ sx=dist[0]; sy=dist[1]; sz=dist[2];/* my grid starts here */ x0=dist[3]; y0=dist[4]; z0=dist[5];/********** c ***********/

    Back to our program...

  • 7/28/2019 paral_IO

    34/46

    /* allocate memory for our volume */vol=getArrayF3D((long)sy,(long)0,(long)0,(long)sz,(long)0,(long)0,

    (long)sx,(long)0,(long)0);

    /* fill the volume with numbers 1 to global grid size *//* the program from which this example was derived, e3d,

    holds its data as a collection of vertical planes.plane number increases with y. that is why we loopon y with the outer most loop. */ k=1+(x0+nx*z0+(nx*nz)*y0);

    for (ltmp=0;ltmp

  • 7/28/2019 paral_IO

    35/46

    /* create a file name based on the grid size */ for(j=1;j

  • 7/28/2019 paral_IO

    36/46

    /* we create a description of the layout of the data *//* more on this later */ printf("mysubgrid0 %5d%5d%5d%5d%5d%5d%5d%5d%5d%5d\n",myid,nx,ny,nz,sx,sy,sz,x0,y0,z0);

    mysubgrid0(nx, ny, nz,sx, sy, sz, x0, y0, z0, MPI_INT,&disp,&filetype);

    /* length of the header */ disp=disp+(4*hl);/* every processor "moves" past the header */

    ierr=MPI_File_set_view(fh, disp, MPI_INT, filetype, "native",MPI_INFO_NULL);/********** 02 ***********/

  • 7/28/2019 paral_IO

    37/46

    /* we are going to create the data on the fly *//* so we allocate a buffer for it */

    t3=MPI_Wtime();isize=sx*sy*sz;buf_size=NUM_VALS*sizeof(FLT);if( isize < NUM_VALS) {buf_size=isize*sizeof(FLT);

    }else {buf_size=NUM_VALS*sizeof(FLT);

    }ptr=(FLT*)malloc(buf_size);offset=0;

    /* find the max and min number of isize of each processors buffer */ierr=MPI_Allreduce ( &isize, &max_size, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);ierr=MPI_Allreduce ( &isize, &min_size, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD);

    /********** 03 ***********/

  • 7/28/2019 paral_IO

    38/46

    /* find out how many times each processor will dump its buffer */i=0;i2=0;do_call=0;sample=1;grid_l=y0+sy;grid_m=z0+sz;grid_n=x0+sx;

    /* could just do division but that would be too easy */

    for(l = y0; l < grid_l; l = l + sample) {for(m = z0; m < grid_m; m = m + sample) {for(n = x0; n < grid_n; n = n + sample) {i++;i2++;if(i == isize || i2 == NUM_VALS){do_call++;i2=0;

    } } } }

    /* get the maximum number of many times a processor will dump its buffer */ierr= MPI_Allreduce ( &do_call, &do_call_max, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);/********** 04 ***********/

  • 7/28/2019 paral_IO

    39/46

    /* finally we start to write the data */i=0;i2=0;

    /* we loop over our grid filling the output buffer */

    for(l = y0; l < grid_l; l = l + sample) {for(m = z0; m < grid_m; m = m + sample) {for(n = x0; n < grid_n; n = n + sample) {ptr[i2] = getS3D(vol,l, m, n,y0,z0,x0);i++;i2++;

    /********** 05 ***********/

  • 7/28/2019 paral_IO

    40/46

    /* when we have all our data or the buffer is full we write */if(i == isize || i2 == NUM_VALS){t5=MPI_Wtime();t7++;if((isize == max_size) && (max_size == min_size)) {

    /* as long as every processor has data to write we use the collective version *//* the collective version of the write is MPI_File_write_at_all */

    ierr=MPI_File_write_at_all(fh, offset, ptr, i2, MPI_INT,&status);do_call_max=do_call_max-1;

    }else {

    /* if only I have data to write then we use MPI_File_write_at *//*ierr=MPI_File_write_at(fh, offset, ptr, i2, MPI_INT,&status);*/

    /* Wait! Why was that line commented out? Why are we using MPI_File_write_at_all? *//* Answer: Some versions of MPI work better using MPI_File_write_at_all *//* What happens if some processors are done writing and don't call this? *//* Answer: See below. */

    ierr=MPI_File_write_at_all(fh, offset, ptr, i2, MPI_INT,&status);do_call_max=do_call_max-1;

    }offset=offset+i2;

    i2=0;t6=MPI_Wtime();dt[5]=dt[5]+(t6-t5);

    }}}}

    /********** 06 ***********/

  • 7/28/2019 paral_IO

    41/46

    /* Here is where we fix the problem of unmatched calls to MPI_File_write_at_all*//* If a processor is done with its writes and others still have *//* data to write the the done processor just calls *//* MPI_File_write_at_all but this 0 values to write *//* All processors call MPI_File_write_at_all the same number of *//* times so everyone is happy */

    while(do_call_max > 0) {ierr=MPI_File_write_at_all(fh, (MPI_Offset)0, (void *)0, 0, MPI_INT,&status);do_call_max=do_call_max-1;

    }/* We finally close the file */

    ierr=MPI_File_close(&fh);/*********

    ierr=MPI_Info_free(&fileinfo);*********/

    MPI_Finalize();exit(0);

    /********** 07 ***********/

  • 7/28/2019 paral_IO

    42/46

    vista --rawtype int --minmax 0 1000000 --skip 12 -x 640 -y480 --outformat png --fov 30 bonk --raw 100 50 200 -r .5 .251.0 -g 0.9 0.9 0.9 1.0 -a 0.002 --opacity 0.01

    Our output:

  • 7/28/2019 paral_IO

    43/46

    http://peloton.sdsc.edu/~tkaiser/mpiio/mpiio.cSource

  • 7/28/2019 paral_IO

    44/46

    void mpDecomposition(int l, int m, int n, int nx, int ny, int nz, int node, int *dist){

    int nnode, mnode, rnode; int grid_n,grid_n0,grid_m,grid_m0,grid_l,grid_l0;/* x decomposition */ rnode = node%nx; mnode = (n%nx); nnode = (n/nx); grid_n = (rnode < mnode) ? (nnode + 1) : (nnode); grid_n0 = rnode*nnode; grid_n0 += (rnode < mnode) ? (rnode) : (mnode);/* z decomposition */ rnode = (node%(nx*nz))/nx; mnode = (m%nz); nnode = (m/nz); grid_m = (rnode < mnode) ? (nnode + 1) : (nnode); grid_m0 = rnode*nnode; grid_m0 += (rnode < mnode) ? (rnode) : (mnode);/* y decomposition */ rnode = node/(nx*nz); mnode = (l%ny); nnode = (l/ny); grid_l = (rnode < mnode) ? (nnode + 1) : (nnode); grid_l0 = rnode*nnode; grid_l0 += (rnode < mnode) ? (rnode) : (mnode); dist[0]=grid_n; dist[1]=grid_l; dist[2]=grid_m; dist[3]=grid_n0; dist[4]=grid_l0; dist[5]=grid_m0;}

    /* the routine mpDecomposition takes theprocessor topology (nx, ny,nz) and theglobal grid dimensions (l,m,n) and mapsthe grid to the topology.

    mpDecomposition returns the number ofcells a processor holds, dist[0:2], andthe starting coordinates for its portionof the grid dist[3:5] */

  • 7/28/2019 paral_IO

    45/46

    void mysubgrid0(int nx, int ny, int nz, int sx, int sy, int sz, int x0, int y0, int z0,MPI_Datatype old_type, MPI_Offset *location,MPI_Datatype *new_type)

    {

    MPI_Datatype VECT;#define BSIZE 5000 int blocklens[BSIZE]; MPI_Aint indices[BSIZE]; MPI_Datatype old_types[BSIZE]; MPI_Datatype TWOD; int i; if(myid == 0)printf("using mysubgrid version 1\n"); if(sz > BSIZE)mpi_check(-1,"sz > BSIZE, increase BSIZE and recompile"); ierr=MPI_Type_contiguous(sx,old_type,&VECT);ierr=MPI_Type_commit(&VECT);

    for (i=0;i

  • 7/28/2019 paral_IO

    46/46

    void mysubgrid0(int nx, int ny, int nz,int sx, int sy, int sz,int x0, int y0, int z0,

    MPI_Datatype old_type, MPI_Offset *location, MPI_Datatype *new_type){ int gsizes[3],lsizes[3],istarts[3]; gsizes[2]=nx; gsizes[1]=nz; gsizes[0]=ny; lsizes[2]=sx; lsizes[1]=sz; lsizes[0]=sy; istarts[2]=x0; istarts[1]=z0; istarts[0]=y0; if(myid == 0)printf("using mysubgrid version 2\n"); ierr=MPI_Type_create_subarray(3,gsizes,lsizes,istarts,MPI_ORDER_C,old_type,new_type); ierr=MPI_Type_commit(new_type);

    *location=0;}

    /* This one is actually perfered. it uses a singlecall to the mpi routine MPI_Type_create_subarray withthe the grid description as input. what we get back isa data type that is a 3d strided volume. Unfortunately,MPI_Type_create_subarray does not work for 3d arraysfor some versions of MPI, in particular LAM. */