2015_GTC2010

8/14/2019 2015_GTC2010

1/51

Nikolai Sakharnykh - NVIDIASan Jose Convention Center, San Jose, CA | September 21, 2010

Efficient Tridiagonal Solvers for ADmethods and Fluid Simulation

8/14/2019 2015_GTC2010

2/51

Introduction

Tridiagonal solvers very popular technique in bo

compute and graphics applications

Application in Alternating Direction Implicit (ADI

2 different examples will be covered in this talk

3D Fluid Simulation for research and science

2D Depth-of-Field Effect for graphics and games

8/14/2019 2015_GTC2010

3/51

Outline

Tridiagonal Solvers Overview

Introduction

Gauss Elimination

Cyclic Reduction

Fluid Simulation in 3D domain

Depth-of-Field Effect in 2D domain

8/14/2019 2015_GTC2010

4/51

Tridiagonal Solvers

Need to solve many independent tridiagonal syst

Matrix sizes configurations are problem specific

8/14/2019 2015_GTC2010

5/51

Gauss Elimination (Sweep)

Forward sweep:

Backward substitution:

The fastest serial approach

O(N) complexity

Optimal number of operations

8/14/2019 2015_GTC2010

6/51

Cyclic Reduction

Eliminate unknowns using linear combination of

After one reduction step a system is decomposed

Reduction step can be done by N threads in para

8/14/2019 2015_GTC2010

7/51

Cyclic Reduction

Parallel Cyclic Reduction (PCR)

Apply reduction to new systems and repeat O(log N)

For some special cases can use fewer steps

Cyclic Reduction (CR)

Forward reduction, backward substitution

Complexity O(N), but requires more operations than S

8/14/2019 2015_GTC2010

8/51

Outline



Problem statement, applications

ADI numerical method

GPU implementation details, optimizations

Performance analysis and results comparisons


8/14/2019 2015_GTC2010

9/51

Problem Statement

Viscid incompressible fluid in 3D domain

New challenge: complex dynamic boundaries

Euler coordinates: velocity and temperature

free

injection

no-slip

8/14/2019 2015_GTC2010

10/51

Applications

Blood flow simulation in heart

Capture

movement

Simulatinside h

8/14/2019 2015_GTC2010

11/51

Applications

Sea and ocean simulations

Static b

Additiona

parameters

8/14/2019 2015_GTC2010

12/51

Definitions

Equation of state

Describe relation between and

Example:

Density

Velocity

Temperature

Pressure

gas constant for air

8/14/2019 2015_GTC2010

13/51

Governing equations

Continuity equation

For incompressible fluids:

Navier-Stokes equations:

Dimensionless form, use equation of state

Reynolds number (= inertia/viscosity ratio)

8/14/2019 2015_GTC2010

14/51

Governing equations

Energy equation:

Dimensionless form, use equation of state

heat capacity ratio

Prandtl number

dissipative function

8/14/2019 2015_GTC2010

15/51


Alternating Direction Implicit

XY

Z

Fixed Y, Z Fixed X, Z Fixed

8/14/2019 2015_GTC2010

16/51

ADI method iterations

Use global iterations for the whole system of equ

Some equations are not linear:

Use local iterations to approximate the non-linear te

previous

time step

Solve X-direquations

Solve Y-direquations

Solve Z-direquations

Updating allvariables

global iterations

8/14/2019 2015_GTC2010

17/51

Discretization

Use regular grid, implicit finite difference schem

Got a tridiagonal system for

Independent system for each fixed pair (j, k)

i i+1 i-1 i+1 i

8/14/2019 2015_GTC2010

18/51

Need to solve lots of tridiagonal systems

Sizes of systems may vary across the grid

Tridiagonal systems

O

I

B

system 1

system 2

system 3

8/14/2019 2015_GTC2010

19/51

Implementation details

{

{

{

build tridiagonal matrices and rhs

solve tridiagonal systems}

update non-linear terms

}

}

8/14/2019 2015_GTC2010

20/51

GPU implementation

Store all data arrays entirely on GPU in linear memo

Reduce amount of memory transfers to minimum Map 3D arrays to linear memory

Main tasks

Build matrices

Solve systems

Additional routines for non-linear updates

Merge, copy 3D arrays with masks

(X, Y, Z)

Z + Y * dimZ + X * dim

Z fastest-changing d

8/14/2019 2015_GTC2010

21/51

Building matrices

Input data:

Previous/non-linear 3D layers

Each thread computes:

Coefficients of a tridiagonal matrix

Right-hand side vector

Use C++ templates for direction and equation

a

bc

d

8/14/2019 2015_GTC2010

22/51

Building matrices performance

Poor Z direction performance compared to X/Y

Threads access contiguous memory region

Memory access is uncoalesced, lots of cache misses

Tesla

C2050(SP)

sec

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Build Build + Solve

X dir

Y dir

Z dir

Dir Requestsper load

L1loa

X 2 3 2

Y 2 3 3

Z 32

Build kernTotal time

8/14/2019 2015_GTC2010

23/51

Building matrices optimization

Run Z phase in transposed XZY space

Better locality for memory accesses

Additional overhead on transpose

XYZ

X local iterations Y local iterations Z local iterations

Transposeinput arrays

Transpoutpuarray

Y local iterations

XZY

8/14/2019 2015_GTC2010

24/51

Building matrices - optimization

Tridiagonal solver time dominates over transpose

Transpose will takes less % with more local iterations

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

X dir Y dir Z dir Z dir OPT

Transpose

Build + Solve

Tesla

C2050

(SP)

sec

2.5x

Total time

Z dir Requestsper load

Ll

Original 32

Transposed 2 3

Build kerne

8/14/2019 2015_GTC2010

25/51

Problem analysis

Big number of tridiagonal systems

Grid size squared on each step: 16K for 128^3

Relatively small system sizes

Max of grid size: 128 for 128^3

1 thread per system is a suitable approach for th

Enough systems to utilize GPU

8/14/2019 2015_GTC2010

26/51

Solving tridiagonal systems

Matrix layout is crucial for performance

Sequential layout Interleaved layout

a0 a1 a2 a3 a0 a1 a2 a3 a0 a0 a0 a0 a1 a1

system 1 system 2

PCR and CR friendly Sweep friend

ADI Z direction ADI X, Y directio

Threa

d

1

Threa

d

2

Threa

d

3

Threa

d

1

Threa

d

2

Threa

d

3

8/14/2019 2015_GTC2010

27/51

Tridiagonal solvers performance

3D ADI test setu

Systems size =

Number of sys

Matrix layout

*CR, SWEEP_T

SWEEP: interl

0

2

4

6

8

10

12

14

16

18

64 96 128 160 192 224 256

PCR

CR

SWEEP

SWEEP_T

Tesla C2050ms

N

PCR/CR implem

Fast tridiagonal so

by Y. Zhang, J. Coh

8/14/2019 2015_GTC2010

28/51

Choosing tridiagonal solver

Application for 3D ADI method

For X, Y directions matrices are interleaved by defau

Z is interleaved as well if doing in transposed space

Sweep solver seems like the best choice

Although can experiment with PCR for Z direction

8/14/2019 2015_GTC2010

29/51

Performance analysis Fermi

L1/L2 effect on performance

Using 48K L1 instead of 16K gives 10-15% speed-up

Turning L1 off reduces performance by 10%

Really help on misaligned accesses and spatial reuse

Sweep effective bandwidth on Tesla C2050 Single Precision = 119 GB/s, 82% of peak

Double Precision = 135 GB/s, 93% of peak

8/14/2019 2015_GTC2010

30/51

Performance benchmark

CPU configuration:

Intel Core i7 4 cores @ 2.8 GHz Use all 4 cores through OpenMP (3-4x speed-up vs 1 c

GPU configurations:

NVIDIA Tesla C1060

NVIDIA Tesla C2050

Measure Build + Solve time for all iterations

8/14/2019 2015_GTC2010

31/51

Test cases

Test Grid size X dir Y dir Z dirCube 128 x 128 x 128 10K 10K 10K

Heart Volume 96 x 160 x 128 13K 8K 8K

Sea Area 160 x 128 x 128 18K 20K 6K

Cube Heart VolumeSea Area

8/14/2019 2015_GTC2010

32/51

0

2

4

6

8

10

12

14

16

18

20

22

X dir Y dir Z dir

0

2

4

6

8

10

12

14

16

18

20

22

X dir Y dir Z dir All dirs

Core i7

Tesla C1060Tesla C2050

Performance results Cube

Same number of systems for X/Y/Z, all sizes are

SINGLE DOUBLEsystems/ms systems/ms

5.9x

9.5x

8/14/2019 2015_GTC2010

33/51

Performance results Heart Volum

More systems for X, sizes vary smoothly

0

5

10

15

20

25

30

35

40

45


Core i7



0

5

10

15

20

25

30

35

40

45

X dir Y dir Z dir

8.1x

11.4x

8/14/2019 2015_GTC2010

34/51

Performance results Sea Area

Lots of small systems of different size for X/Y

0

10

20

30

40

50

60

70

80


Core i7



0

10

20

30

40

50

60

70

80

X dir Y dir Z dir

6.5x

9.6x

8/14/2019 2015_GTC2010

35/51

Future work

Current main limitation is memory capacity

Lots of 3D arrays: grid data, time layers, tridiagonal m

Multi-GPU and clusters enable high-resolution gr

Split grid along current direct

Assign each partition to

Solve systems and update com

GPU 1 GPU 2 GPU 3 GPU 4

ADIstep

direct

ion

8/14/2019 2015_GTC2010

36/51

References

http://code.google.com/p/cmc-fluid-solver/

Tridiagonal Solvers on the GPU and Applications to Fluid Simul

Technology Conference, 2009, Nikolai Sakharnykh

A dynamic visualization system for multiprocessor computers w

memory and its application for numerical modeling of the turbuviscous fluids, Moscow University Computational Mathematics

Cybernetics, 2007, Paskonov V.M., Berezin S.B., Korukhova E.S.
http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/

8/14/2019 2015_GTC2010

37/51

Outline




Problem statement


Analysis and hybrid approach

Shared memory optimization

Visual results

8/14/2019 2015_GTC2010

38/51

Problem statement

ADI method application for 2D problems

Real-time Depth-Of-Field simulation

Using diffusion equation to blur the image

Now need to solve tridiagonal systems in 2D dom

Different setup, different methods for GPU

8/14/2019 2015_GTC2010

39/51

Numerical method

Alternating Direction Implicit method

X Y

for all fixed for all fixed

8/14/2019 2015_GTC2010

40/51

8/14/2019 2015_GTC2010

41/51

GPU i l t ti

8/14/2019 2015_GTC2010

42/51

GPU implementation

Matrix layout

Sequential for X direction

Interleaved for Y direction

Use hybrid algorithm

Start with Parallel Cyclic Reduction (PCR) Subdivide our systems into smaller ones

Finish with Gauss Elimination (Sweep)

Solve each new system by 1 thread

8/14/2019 2015_GTC2010

43/51

H b id h

8/14/2019 2015_GTC2010

44/51

Hybrid approach

Benefits of each additional PCR step

Improves memory layout for Sweep in X direction Subdivides systems for more efficient Sweep

But additional overhead for each step

Increasing total number of operations for solving a sy

Need to choose an optimal number of PCR steps

For DOF effect in high-res: 3 steps for X direction

Sh d ti i ti

8/14/2019 2015_GTC2010

45/51

Shared memory optimization

Shared memory is used as a

temporary transpose buffer Load 32x16 area into shared

memory

Compute 4 iterations inside

shared memory

Store 32x16 back to globalmemory

Gives 20% speed-up Sharedmemory

32

16

16

16

8 8 8 8

Diff i i l ti id

8/14/2019 2015_GTC2010

46/51

Diffuse temperature series (= color) on 2D plate

with non-uniform heat conductivity (= circles of

Key features:

No color bleeding

Support for spatially varyingarbitrary circles of confusion

Diffusion simulation idea

out of focus

large radius

in focus

radius = 0

sharp

Depth of Field in games

8/14/2019 2015_GTC2010

47/51

Depth-of-Field in games

From M

THQ a

Depth of Field in games

8/14/2019 2015_GTC2010

48/51

Depth-of-Field in games

From M

THQ a

References

8/14/2019 2015_GTC2010

49/51

References

Pixar article about Diffusion DOF:

http://graphics.pixar.com/library/DepthOfField/paper.pdf

Cyclic reduction methods on CUDA:

http://www.idav.ucdavis.edu/func/return_pdf?pub_id=978

Upcoming sample in NVIDIA SDK

Summary
http://graphics.pixar.com/library/DepthOfField/paper.pdfhttp://www.idav.ucdavis.edu/func/return_pdf?pub_id=978http://www.idav.ucdavis.edu/func/return_pdf?pub_id=978http://graphics.pixar.com/library/DepthOfField/paper.pdf

8/14/2019 2015_GTC2010

50/51

Summary

10x speed-ups for 3D ADI methods in complex are

Great Fermi performance in double precision

Achieve real-time performance in graphics and g

Realistic depth-of-field effect

Tridiagonal solvers performance is problem speci

8/14/2019 2015_GTC2010

51/51

Documents

2015_GTC2010