2015_GTC2010

Embed Size (px)

Citation preview

  • 8/14/2019 2015_GTC2010

    1/51

    Nikolai Sakharnykh - NVIDIASan Jose Convention Center, San Jose, CA | September 21, 2010

    Efficient Tridiagonal Solvers for ADmethods and Fluid Simulation

  • 8/14/2019 2015_GTC2010

    2/51

    Introduction

    Tridiagonal solvers very popular technique in bo

    compute and graphics applications

    Application in Alternating Direction Implicit (ADI

    2 different examples will be covered in this talk

    3D Fluid Simulation for research and science

    2D Depth-of-Field Effect for graphics and games

  • 8/14/2019 2015_GTC2010

    3/51

    Outline

    Tridiagonal Solvers Overview

    Introduction

    Gauss Elimination

    Cyclic Reduction

    Fluid Simulation in 3D domain

    Depth-of-Field Effect in 2D domain

  • 8/14/2019 2015_GTC2010

    4/51

    Tridiagonal Solvers

    Need to solve many independent tridiagonal syst

    Matrix sizes configurations are problem specific

  • 8/14/2019 2015_GTC2010

    5/51

    Gauss Elimination (Sweep)

    Forward sweep:

    Backward substitution:

    The fastest serial approach

    O(N) complexity

    Optimal number of operations

  • 8/14/2019 2015_GTC2010

    6/51

    Cyclic Reduction

    Eliminate unknowns using linear combination of

    After one reduction step a system is decomposed

    Reduction step can be done by N threads in para

  • 8/14/2019 2015_GTC2010

    7/51

    Cyclic Reduction

    Parallel Cyclic Reduction (PCR)

    Apply reduction to new systems and repeat O(log N)

    For some special cases can use fewer steps

    Cyclic Reduction (CR)

    Forward reduction, backward substitution

    Complexity O(N), but requires more operations than S

  • 8/14/2019 2015_GTC2010

    8/51

    Outline

    Tridiagonal Solvers Overview

    Fluid Simulation in 3D domain

    Problem statement, applications

    ADI numerical method

    GPU implementation details, optimizations

    Performance analysis and results comparisons

    Depth-of-Field Effect in 2D domain

  • 8/14/2019 2015_GTC2010

    9/51

    Problem Statement

    Viscid incompressible fluid in 3D domain

    New challenge: complex dynamic boundaries

    Euler coordinates: velocity and temperature

    free

    injection

    no-slip

  • 8/14/2019 2015_GTC2010

    10/51

    Applications

    Blood flow simulation in heart

    Capture

    movement

    Simulatinside h

  • 8/14/2019 2015_GTC2010

    11/51

    Applications

    Sea and ocean simulations

    Static b

    Additiona

    parameters

  • 8/14/2019 2015_GTC2010

    12/51

    Definitions

    Equation of state

    Describe relation between and

    Example:

    Density

    Velocity

    Temperature

    Pressure

    gas constant for air

  • 8/14/2019 2015_GTC2010

    13/51

    Governing equations

    Continuity equation

    For incompressible fluids:

    Navier-Stokes equations:

    Dimensionless form, use equation of state

    Reynolds number (= inertia/viscosity ratio)

  • 8/14/2019 2015_GTC2010

    14/51

    Governing equations

    Energy equation:

    Dimensionless form, use equation of state

    heat capacity ratio

    Prandtl number

    dissipative function

  • 8/14/2019 2015_GTC2010

    15/51

    ADI numerical method

    Alternating Direction Implicit

    XY

    Z

    Fixed Y, Z Fixed X, Z Fixed

  • 8/14/2019 2015_GTC2010

    16/51

    ADI method iterations

    Use global iterations for the whole system of equ

    Some equations are not linear:

    Use local iterations to approximate the non-linear te

    previous

    time step

    Solve X-direquations

    Solve Y-direquations

    Solve Z-direquations

    Updating allvariables

    global iterations

  • 8/14/2019 2015_GTC2010

    17/51

    Discretization

    Use regular grid, implicit finite difference schem

    Got a tridiagonal system for

    Independent system for each fixed pair (j, k)

    i i+1 i-1 i+1 i

  • 8/14/2019 2015_GTC2010

    18/51

    Need to solve lots of tridiagonal systems

    Sizes of systems may vary across the grid

    Tridiagonal systems

    O

    I

    B

    system 1

    system 2

    system 3

  • 8/14/2019 2015_GTC2010

    19/51

    Implementation details

    {

    {

    {

    build tridiagonal matrices and rhs

    solve tridiagonal systems}

    update non-linear terms

    }

    }

  • 8/14/2019 2015_GTC2010

    20/51

    GPU implementation

    Store all data arrays entirely on GPU in linear memo

    Reduce amount of memory transfers to minimum Map 3D arrays to linear memory

    Main tasks

    Build matrices

    Solve systems

    Additional routines for non-linear updates

    Merge, copy 3D arrays with masks

    (X, Y, Z)

    Z + Y * dimZ + X * dim

    Z fastest-changing d

  • 8/14/2019 2015_GTC2010

    21/51

    Building matrices

    Input data:

    Previous/non-linear 3D layers

    Each thread computes:

    Coefficients of a tridiagonal matrix

    Right-hand side vector

    Use C++ templates for direction and equation

    a

    bc

    d

  • 8/14/2019 2015_GTC2010

    22/51

    Building matrices performance

    Poor Z direction performance compared to X/Y

    Threads access contiguous memory region

    Memory access is uncoalesced, lots of cache misses

    Tesla

    C2050(SP)

    sec

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    1.6

    1.8

    Build Build + Solve

    X dir

    Y dir

    Z dir

    Dir Requestsper load

    L1loa

    X 2 3 2

    Y 2 3 3

    Z 32

    Build kernTotal time

  • 8/14/2019 2015_GTC2010

    23/51

    Building matrices optimization

    Run Z phase in transposed XZY space

    Better locality for memory accesses

    Additional overhead on transpose

    XYZ

    X local iterations Y local iterations Z local iterations

    Transposeinput arrays

    Transpoutpuarray

    Y local iterations

    XZY

  • 8/14/2019 2015_GTC2010

    24/51

    Building matrices - optimization

    Tridiagonal solver time dominates over transpose

    Transpose will takes less % with more local iterations

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1.4

    1.6

    1.8

    X dir Y dir Z dir Z dir OPT

    Transpose

    Build + Solve

    Tesla

    C2050

    (SP)

    sec

    2.5x

    Total time

    Z dir Requestsper load

    Ll

    Original 32

    Transposed 2 3

    Build kerne

  • 8/14/2019 2015_GTC2010

    25/51

    Problem analysis

    Big number of tridiagonal systems

    Grid size squared on each step: 16K for 128^3

    Relatively small system sizes

    Max of grid size: 128 for 128^3

    1 thread per system is a suitable approach for th

    Enough systems to utilize GPU

  • 8/14/2019 2015_GTC2010

    26/51

    Solving tridiagonal systems

    Matrix layout is crucial for performance

    Sequential layout Interleaved layout

    a0 a1 a2 a3 a0 a1 a2 a3 a0 a0 a0 a0 a1 a1

    system 1 system 2

    PCR and CR friendly Sweep friend

    ADI Z direction ADI X, Y directio

    Threa

    d

    1

    Threa

    d

    2

    Threa

    d

    3

    Threa

    d

    1

    Threa

    d

    2

    Threa

    d

    3

  • 8/14/2019 2015_GTC2010

    27/51

    Tridiagonal solvers performance

    3D ADI test setu

    Systems size =

    Number of sys

    Matrix layout

    *CR, SWEEP_T

    SWEEP: interl

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    64 96 128 160 192 224 256

    PCR

    CR

    SWEEP

    SWEEP_T

    Tesla C2050ms

    N

    PCR/CR implem

    Fast tridiagonal so

    by Y. Zhang, J. Coh

  • 8/14/2019 2015_GTC2010

    28/51

    Choosing tridiagonal solver

    Application for 3D ADI method

    For X, Y directions matrices are interleaved by defau

    Z is interleaved as well if doing in transposed space

    Sweep solver seems like the best choice

    Although can experiment with PCR for Z direction

  • 8/14/2019 2015_GTC2010

    29/51

    Performance analysis Fermi

    L1/L2 effect on performance

    Using 48K L1 instead of 16K gives 10-15% speed-up

    Turning L1 off reduces performance by 10%

    Really help on misaligned accesses and spatial reuse

    Sweep effective bandwidth on Tesla C2050 Single Precision = 119 GB/s, 82% of peak

    Double Precision = 135 GB/s, 93% of peak

  • 8/14/2019 2015_GTC2010

    30/51

    Performance benchmark

    CPU configuration:

    Intel Core i7 4 cores @ 2.8 GHz Use all 4 cores through OpenMP (3-4x speed-up vs 1 c

    GPU configurations:

    NVIDIA Tesla C1060

    NVIDIA Tesla C2050

    Measure Build + Solve time for all iterations

  • 8/14/2019 2015_GTC2010

    31/51

    Test cases

    Test Grid size X dir Y dir Z dirCube 128 x 128 x 128 10K 10K 10K

    Heart Volume 96 x 160 x 128 13K 8K 8K

    Sea Area 160 x 128 x 128 18K 20K 6K

    Cube Heart VolumeSea Area

  • 8/14/2019 2015_GTC2010

    32/51

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    22

    X dir Y dir Z dir

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    22

    X dir Y dir Z dir All dirs

    Core i7

    Tesla C1060Tesla C2050

    Performance results Cube

    Same number of systems for X/Y/Z, all sizes are

    SINGLE DOUBLEsystems/ms systems/ms

    5.9x

    9.5x

  • 8/14/2019 2015_GTC2010

    33/51

    Performance results Heart Volum

    More systems for X, sizes vary smoothly

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    X dir Y dir Z dir All dirs

    Core i7

    Tesla C1060Tesla C2050

    SINGLE DOUBLEsystems/ms systems/ms

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    X dir Y dir Z dir

    8.1x

    11.4x

  • 8/14/2019 2015_GTC2010

    34/51

    Performance results Sea Area

    Lots of small systems of different size for X/Y

    0

    10

    20

    30

    40

    50

    60

    70

    80

    X dir Y dir Z dir All dirs

    Core i7

    Tesla C1060Tesla C2050

    SINGLE DOUBLEsystems/ms systems/ms

    0

    10

    20

    30

    40

    50

    60

    70

    80

    X dir Y dir Z dir

    6.5x

    9.6x

  • 8/14/2019 2015_GTC2010

    35/51

    Future work

    Current main limitation is memory capacity

    Lots of 3D arrays: grid data, time layers, tridiagonal m

    Multi-GPU and clusters enable high-resolution gr

    Split grid along current direct

    Assign each partition to

    Solve systems and update com

    GPU 1 GPU 2 GPU 3 GPU 4

    ADIstep

    direct

    ion

  • 8/14/2019 2015_GTC2010

    36/51

    References

    http://code.google.com/p/cmc-fluid-solver/

    Tridiagonal Solvers on the GPU and Applications to Fluid Simul

    Technology Conference, 2009, Nikolai Sakharnykh

    A dynamic visualization system for multiprocessor computers w

    memory and its application for numerical modeling of the turbuviscous fluids, Moscow University Computational Mathematics

    Cybernetics, 2007, Paskonov V.M., Berezin S.B., Korukhova E.S.

    http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/
  • 8/14/2019 2015_GTC2010

    37/51

    Outline

    Tridiagonal Solvers Overview

    Fluid Simulation in 3D domain

    Depth-of-Field Effect in 2D domain

    Problem statement

    ADI numerical method

    Analysis and hybrid approach

    Shared memory optimization

    Visual results

  • 8/14/2019 2015_GTC2010

    38/51

    Problem statement

    ADI method application for 2D problems

    Real-time Depth-Of-Field simulation

    Using diffusion equation to blur the image

    Now need to solve tridiagonal systems in 2D dom

    Different setup, different methods for GPU

  • 8/14/2019 2015_GTC2010

    39/51

    Numerical method

    Alternating Direction Implicit method

    X Y

    for all fixed for all fixed

  • 8/14/2019 2015_GTC2010

    40/51

  • 8/14/2019 2015_GTC2010

    41/51

    GPU i l t ti

  • 8/14/2019 2015_GTC2010

    42/51

    GPU implementation

    Matrix layout

    Sequential for X direction

    Interleaved for Y direction

    Use hybrid algorithm

    Start with Parallel Cyclic Reduction (PCR) Subdivide our systems into smaller ones

    Finish with Gauss Elimination (Sweep)

    Solve each new system by 1 thread

  • 8/14/2019 2015_GTC2010

    43/51

    H b id h

  • 8/14/2019 2015_GTC2010

    44/51

    Hybrid approach

    Benefits of each additional PCR step

    Improves memory layout for Sweep in X direction Subdivides systems for more efficient Sweep

    But additional overhead for each step

    Increasing total number of operations for solving a sy

    Need to choose an optimal number of PCR steps

    For DOF effect in high-res: 3 steps for X direction

    Sh d ti i ti

  • 8/14/2019 2015_GTC2010

    45/51

    Shared memory optimization

    Shared memory is used as a

    temporary transpose buffer Load 32x16 area into shared

    memory

    Compute 4 iterations inside

    shared memory

    Store 32x16 back to globalmemory

    Gives 20% speed-up Sharedmemory

    32

    16

    16

    16

    8 8 8 8

    Diff i i l ti id

  • 8/14/2019 2015_GTC2010

    46/51

    Diffuse temperature series (= color) on 2D plate

    with non-uniform heat conductivity (= circles of

    Key features:

    No color bleeding

    Support for spatially varyingarbitrary circles of confusion

    Diffusion simulation idea

    out of focus

    large radius

    in focus

    radius = 0

    sharp

    Depth of Field in games

  • 8/14/2019 2015_GTC2010

    47/51

    Depth-of-Field in games

    From M

    THQ a

    Depth of Field in games

  • 8/14/2019 2015_GTC2010

    48/51

    Depth-of-Field in games

    From M

    THQ a

    References

  • 8/14/2019 2015_GTC2010

    49/51

    References

    Pixar article about Diffusion DOF:

    http://graphics.pixar.com/library/DepthOfField/paper.pdf

    Cyclic reduction methods on CUDA:

    http://www.idav.ucdavis.edu/func/return_pdf?pub_id=978

    Upcoming sample in NVIDIA SDK

    Summary

    http://graphics.pixar.com/library/DepthOfField/paper.pdfhttp://www.idav.ucdavis.edu/func/return_pdf?pub_id=978http://www.idav.ucdavis.edu/func/return_pdf?pub_id=978http://graphics.pixar.com/library/DepthOfField/paper.pdf
  • 8/14/2019 2015_GTC2010

    50/51

    Summary

    10x speed-ups for 3D ADI methods in complex are

    Great Fermi performance in double precision

    Achieve real-time performance in graphics and g

    Realistic depth-of-field effect

    Tridiagonal solvers performance is problem speci

  • 8/14/2019 2015_GTC2010

    51/51