Download pdf - Hybrid OpenMP-MPI Turbulent Boundary +Layer Code Over 32k Cores

Juan A. Sillero, Guillem Borrell, Javier Jiménez(Universidad Politécnica de Madrid)

and Robert D. Moser (U. Texas Austin)

T/NT INTERFACE

x/�99

y/�99

z/�99

Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores

FUNDED BY: CICYT, ERC, INCITE, & UPM

★ Motivations

★ Numerical approach

★ Computational setup & domain decomposition

★ Node topology

★ Code Scaling

★ IO performances

★Conclusions

Outline

Motivations

• Differences between internal and external flows:• Internal: Pipes and channels• External: Boundary layers

• Effect of large-scale intermittency in the turbulent structures• Energy consumption optimization:

•Skin friction is generated in the interface vehicle-boundary layer

• Separation of scales:• 3 Layers structure: inner, logarithmic and outer•Achieved only with high Reynolds number

• Important advantages of simulations over experiments

INTERNAL FLOWS EXTERNAL FLOWS

Motivations: Some underlying physic

Turbulent

Non Turbulent

Duct

TurbulentTurbulent

Pipe

Sections:

|!0+|

INTERNAL FLOWS EXTERNAL FLOWS

Motivations: Underlying physic

Turbulent

Non Turbulent

Duct

TurbulentTurbulent

Pipe

Sections:

Skin Friction (drag)

|!0+|(5% world energy consumption)

Numerical Approach

• Incompressible Navier-Stokes equations

u

v wp

+ Boundary Conditions

Staggered grid

Numerical Approach


Non-Linear Terms Linear Viscous Terms

Linear pressure-gradient termsSemi-implicitRK-3

u

v wp

Staggered grid

Numerical Approach


u

v wp u

v wp

Compact Finite Diferences(X & Y) Pseudo-Spectral

SPATIAL DISCRETIZATION:

Numerical Approach

Simens et al. JCP 228, 4218 (2009)Jimenez et al. JFM 657, 335 (2010) *

✦ Fractional Step Method

✦Inlet conditions using [Lund et al.] recycling scheme approach

✦ Linear systems solved using LU decomposition

✦Poisson equation for pressure solved using direct method

✦ 2nd order time accuracy and 4th order CFD

Computational setup & domain decomposition

⇧XY ⇡ 63 Mb

⇧ZY ⇡ 11 Mb

⇧ZY

⇧XY

Blue Gene/P

• 4x450 PowerPC• 2 Gb RAM (DDR2)

INCITE project (ANL)PRACE project (Jugene)

Plane to Planedecomposition

Tier-0

(16 R*8 buffers)


New parallelization strategy+

Hybrid OpenMP-MPI


• Global transposes: • Change the memory layout• Collective communications: MPI_ALLTOALLV• Messages are single precision (R*4)

•About 40% of the total time (when using Torus network)

• 4 OpenMP threads• Static Scheduling:

•Through private indexes•Maximise data locality•Good load balance

• Loop blocking in Y• Tridiagonal LS LU solver•Tuned for Blue Gene/P



• Create 2 MPI groups (MPI_GROUP_INCL)• Groups created based in 2 list of ranks• Split global communicator in 2 local ones

• Each group performs independently• Some global operations:

• Time step: MPI_ALLREDUCE • Inlet conditions: SEND/RECEIVE

Node topology

How to map virtual processes onto physical processors?

Predefined Custom

Twice Faster

8192 NodesBL1=512BL2=7680

3D Torus network is lost: CommBL1 [ CommBL2 = MPI COMM WORLD

Node topology

BALANCE

COMM.

COMPUT.

CODE SCALING:

Millions of points per node Size of message [Bytes]

Tim

e pe

r m

essa

ge [

s]

Tim

e [s

]

Across nodes (MPI) Within node (OpenMP)

⇡ 7 MBNode occu

patio

n

2 kB

40% Comm.8% Transp.52% Comp.

Linear weakscaling

IO Performances

• Checkpoint of the simulation: 0.5 TBytes (R*4) • Every 3 hours (12 hours run)• Velocity {u,v,w} and pressure {p} fields (4x84 GB+4x7.2 Gb)• Correlation files {u}

• Different strategies for IO:• Serial IO: Discarded• Parallel Collective IO:

• Posix calls • SIONLIB library (Juelich)• HDF5 (GPFS & PVFS2)

• HDF5 Tuning for Blue Gene/P:• GPFS & PVFS2 (cache OFF & ON respectively)

• Cache OFF, write: 2 Gb/sec (5-15 minutes) • Cache ON, write: 16 Gb/sec (25-60 seconds)

• Forcing file system block size in GPFS: 16 Gb/sec

⇡

Conclusions

✴ Turbulent boundary layer code ported to hybrid OpenMP-MPI

✴ Memory optimized for Blue Gene/P: 0.5 GB/core

✴ Excellent weak linear scalability up to 8k nodes

✴ Big impact in performances using custom node topologies

✴ Parallel Collective IO (HDF5): Read 22 Gb/sec, Write 16 Gb/sec

Low pressure isosurface at high Reynolds numbers