Juan A. Sillero, Guillem Borrell, Javier Jiménez(Universidad Politécnica de Madrid)
and Robert D. Moser (U. Texas Austin)
T/NT INTERFACE
x/�99
y/�99
z/�99
Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores
FUNDED BY: CICYT, ERC, INCITE, & UPM
★ Motivations
★ Numerical approach
★ Computational setup & domain decomposition
★ Node topology
★ Code Scaling
★ IO performances
★Conclusions
Outline
Motivations
• Differences between internal and external flows:• Internal: Pipes and channels• External: Boundary layers
• Effect of large-scale intermittency in the turbulent structures• Energy consumption optimization:
•Skin friction is generated in the interface vehicle-boundary layer
• Separation of scales:• 3 Layers structure: inner, logarithmic and outer•Achieved only with high Reynolds number
• Important advantages of simulations over experiments
INTERNAL FLOWS EXTERNAL FLOWS
Motivations: Some underlying physic
Turbulent
Non Turbulent
Duct
TurbulentTurbulent
Pipe
Sections:
|!0+|
INTERNAL FLOWS EXTERNAL FLOWS
Motivations: Underlying physic
Turbulent
Non Turbulent
Duct
TurbulentTurbulent
Pipe
Sections:
Skin Friction (drag)
|!0+|(5% world energy consumption)
Numerical Approach
• Incompressible Navier-Stokes equations
u
v wp
+ Boundary Conditions
Staggered grid
Numerical Approach
• Incompressible Navier-Stokes equations
Non-Linear Terms Linear Viscous Terms
Linear pressure-gradient termsSemi-implicitRK-3
u
v wp
Staggered grid
Numerical Approach
• Incompressible Navier-Stokes equations
u
v wp u
v wp
Compact Finite Diferences(X & Y) Pseudo-Spectral
SPATIAL DISCRETIZATION:
Numerical Approach
Simens et al. JCP 228, 4218 (2009)Jimenez et al. JFM 657, 335 (2010) *
✦ Fractional Step Method
✦Inlet conditions using [Lund et al.] recycling scheme approach
✦ Linear systems solved using LU decomposition
✦Poisson equation for pressure solved using direct method
✦ 2nd order time accuracy and 4th order CFD
Computational setup & domain decomposition
⇧XY ⇡ 63 Mb
⇧ZY ⇡ 11 Mb
⇧ZY
⇧XY
Blue Gene/P
• 4x450 PowerPC• 2 Gb RAM (DDR2)
INCITE project (ANL)PRACE project (Jugene)
Plane to Planedecomposition
Tier-0
(16 R*8 buffers)
Computational setup & domain decomposition
New parallelization strategy+
Hybrid OpenMP-MPI
Computational setup & domain decomposition
• Global transposes: • Change the memory layout• Collective communications: MPI_ALLTOALLV• Messages are single precision (R*4)
•About 40% of the total time (when using Torus network)
• 4 OpenMP threads• Static Scheduling:
•Through private indexes•Maximise data locality•Good load balance
• Loop blocking in Y• Tridiagonal LS LU solver•Tuned for Blue Gene/P
Computational setup & domain decomposition
Computational setup & domain decomposition
• Create 2 MPI groups (MPI_GROUP_INCL)• Groups created based in 2 list of ranks• Split global communicator in 2 local ones
• Each group performs independently• Some global operations:
• Time step: MPI_ALLREDUCE • Inlet conditions: SEND/RECEIVE
Node topology
How to map virtual processes onto physical processors?
Predefined Custom
Twice Faster
8192 NodesBL1=512BL2=7680
3D Torus network is lost: CommBL1 [ CommBL2 = MPI COMM WORLD
Node topology
BALANCE
COMM.
COMPUT.
CODE SCALING:
Millions of points per node Size of message [Bytes]
Tim
e pe
r m
essa
ge [
s]
Tim
e [s
]
Across nodes (MPI) Within node (OpenMP)
⇡ 7 MBNode occu
patio
n
2 kB
40% Comm.8% Transp.52% Comp.
Linear weakscaling
IO Performances
• Checkpoint of the simulation: 0.5 TBytes (R*4) • Every 3 hours (12 hours run)• Velocity {u,v,w} and pressure {p} fields (4x84 GB+4x7.2 Gb)• Correlation files {u}
• Different strategies for IO:• Serial IO: Discarded• Parallel Collective IO:
• Posix calls • SIONLIB library (Juelich)• HDF5 (GPFS & PVFS2)
• HDF5 Tuning for Blue Gene/P:• GPFS & PVFS2 (cache OFF & ON respectively)
• Cache OFF, write: 2 Gb/sec (5-15 minutes) • Cache ON, write: 16 Gb/sec (25-60 seconds)
• Forcing file system block size in GPFS: 16 Gb/sec
⇡
Conclusions
✴ Turbulent boundary layer code ported to hybrid OpenMP-MPI
✴ Memory optimized for Blue Gene/P: 0.5 GB/core
✴ Excellent weak linear scalability up to 8k nodes
✴ Big impact in performances using custom node topologies
✴ Parallel Collective IO (HDF5): Read 22 Gb/sec, Write 16 Gb/sec
Low pressure isosurface at high Reynolds numbers