Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

  • View

  • Download

Embed Size (px)


Presentacin en el EuroMPI de Santorini, Sept 2011


Juan A. Sillero, Guillem Borrell, Javier Jimnez(Universidad Politcnica de Madrid)and Robert D. Moser (U. Texas Austin) T/NT INTERFACEx/99y/99z/99Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k coresFUNDED BY: CICYT, ERC, INCITE, & UPM Motivations Numerical approach Computational setup & domain decomposition Node topology Code Scaling IO performancesConclusionsOutlineMotivations Differences between internal and external flows: Internal: Pipes and channels External: Boundary layers Effect of large-scale intermittency in the turbulent structures Energy consumption optimization:Skin friction is generated in the interface vehicle-boundary layer Separation of scales: 3 Layers structure: inner, logarithmic and outerAchieved only with high Reynolds number Important advantages of simulations over experimentsINTERNAL FLOWS EXTERNAL FLOWSMotivations: Some underlying physicTurbulentNon TurbulentDuctTurbulent TurbulentPipeSections:|!0+|INTERNAL FLOWS EXTERNAL FLOWSMotivations: Underlying physicTurbulentNon TurbulentDuctTurbulent TurbulentPipeSections:Skin Friction (drag)|!0+|(5% world energy consumption)Numerical Approach Incompressible Navier-Stokes equationsuvwp+ Boundary ConditionsStaggered gridNumerical Approach Incompressible Navier-Stokes equationsNon-Linear TermsLinear Viscous TermsLinear pressure-gradient termsSemi-implicitRK-3uvwpStaggered gridNumerical Approach Incompressible Navier-Stokes equationsuvwpuvwpCompact Finite Diferences(X & Y)Pseudo-SpectralSPATIAL DISCRETIZATION:Numerical ApproachSimens et al. JCP 228, 4218 (2009)Jimenez et al. JFM 657, 335 (2010) * Fractional Step MethodInlet conditions using [Lund et al.] recycling scheme approach Linear systems solved using LU decompositionPoisson equation for pressure solved using direct method 2nd order time accuracy and 4th order CFD Computational setup & domain decompositionXY 63 MbZY 11 MbZYXYBlue Gene/P4x450 PowerPC2 Gb RAM (DDR2)INCITE project (ANL)PRACE project (Jugene)Plane to PlanedecompositionTier-0(16 R*8 buffers)Computational setup & domain decompositionNew parallelization strategy+Hybrid OpenMP-MPIComputational setup & domain decomposition Global transposes: Change the memory layout Collective communications: MPI_ALLTOALLV Messages are single precision (R*4)About 40% of the total time (when using Torus network) 4 OpenMP threads Static Scheduling:Through private indexesMaximise data localityGood load balance Loop blocking in Y Tridiagonal LS LU solverTuned for Blue Gene/PComputational setup & domain decompositionComputational setup & domain decompositionBL2: SECOND MPl GROUPBL1: FlRST MPl GROUPTOTAL AVAlLABLE NODES(MPl_COMM_WORLD)INITIALIZATIONREAD FIELDCOMPUTE Naver-Stokes RHSTIME INTEGRATIOINPosson Equaton(Mass Conservaton)BL 1 BL 2INITIALIZATIONREAD FIELDCOMPUTE Naver-Stokes RHSTIME INTEGRATIOINPosson Equaton(Mass Conservaton)MinTime StepSEND}RECElVElNLET PLANEEXPLlClTSYNCHRONlZATlONMPl_FlNALlZEEND PROGRAM Create 2 MPI groups (MPI_GROUP_INCL) Groups created based in 2 list of ranks Split global communicator in 2 local ones Each group performs independently Some global operations: Time step: MPI_ALLREDUCE Inlet conditions: SEND/RECEIVENode topologyHow to map virtual processes onto physical processors?Predened CustomTwice Faster8192 NodesBL1=512BL2=76803D Torus network is lost: CommBL1 [ CommBL2 = MPI COMM WORLDNode topologyBALANCECOMM.COMPUT.CODE SCALING: Millions of points per nodeSize of message [Bytes]Time per message [s]Time [s]Across nodes (MPI) Within node (OpenMP) 7 MBNode occupation2 kB40% Comm.8% Transp.52% Comp.Linear weakscalingIO Performances Checkpoint of the simulation: 0.5 TBytes (R*4) Every 3 hours (12 hours run) Velocity {u,v,w} and pressure {p} elds (4x84 GB+4x7.2 Gb) Correlation les {u} Different strategies for IO: Serial IO: Discarded Parallel Collective IO: Posix calls SIONLIB library (Juelich) HDF5 (GPFS & PVFS2) HDF5 Tuning for Blue Gene/P: GPFS & PVFS2 (cache OFF & ON respectively) Cache OFF, write: 2 Gb/sec (5-15 minutes) Cache ON, write: 16 Gb/sec (25-60 seconds) Forcing le system block size in GPFS: 16 Gb/secConclusions Turbulent boundary layer code ported to hybrid OpenMP-MPI Memory optimized for Blue Gene/P: 0.5 GB/core Excellent weak linear scalability up to 8k nodes Big impact in performances using custom node topologies Parallel Collective IO (HDF5): Read 22 Gb/sec, Write 16 Gb/secLow pressure isosurface at high Reynolds numbers