Upload
jramirezcr
View
214
Download
0
Embed Size (px)
Citation preview
8/14/2019 2015_GTC2010
1/51
Nikolai Sakharnykh - NVIDIASan Jose Convention Center, San Jose, CA | September 21, 2010
Efficient Tridiagonal Solvers for ADmethods and Fluid Simulation
8/14/2019 2015_GTC2010
2/51
Introduction
Tridiagonal solvers very popular technique in bo
compute and graphics applications
Application in Alternating Direction Implicit (ADI
2 different examples will be covered in this talk
3D Fluid Simulation for research and science
2D Depth-of-Field Effect for graphics and games
8/14/2019 2015_GTC2010
3/51
Outline
Tridiagonal Solvers Overview
Introduction
Gauss Elimination
Cyclic Reduction
Fluid Simulation in 3D domain
Depth-of-Field Effect in 2D domain
8/14/2019 2015_GTC2010
4/51
Tridiagonal Solvers
Need to solve many independent tridiagonal syst
Matrix sizes configurations are problem specific
8/14/2019 2015_GTC2010
5/51
Gauss Elimination (Sweep)
Forward sweep:
Backward substitution:
The fastest serial approach
O(N) complexity
Optimal number of operations
8/14/2019 2015_GTC2010
6/51
Cyclic Reduction
Eliminate unknowns using linear combination of
After one reduction step a system is decomposed
Reduction step can be done by N threads in para
8/14/2019 2015_GTC2010
7/51
Cyclic Reduction
Parallel Cyclic Reduction (PCR)
Apply reduction to new systems and repeat O(log N)
For some special cases can use fewer steps
Cyclic Reduction (CR)
Forward reduction, backward substitution
Complexity O(N), but requires more operations than S
8/14/2019 2015_GTC2010
8/51
Outline
Tridiagonal Solvers Overview
Fluid Simulation in 3D domain
Problem statement, applications
ADI numerical method
GPU implementation details, optimizations
Performance analysis and results comparisons
Depth-of-Field Effect in 2D domain
8/14/2019 2015_GTC2010
9/51
Problem Statement
Viscid incompressible fluid in 3D domain
New challenge: complex dynamic boundaries
Euler coordinates: velocity and temperature
free
injection
no-slip
8/14/2019 2015_GTC2010
10/51
Applications
Blood flow simulation in heart
Capture
movement
Simulatinside h
8/14/2019 2015_GTC2010
11/51
Applications
Sea and ocean simulations
Static b
Additiona
parameters
8/14/2019 2015_GTC2010
12/51
Definitions
Equation of state
Describe relation between and
Example:
Density
Velocity
Temperature
Pressure
gas constant for air
8/14/2019 2015_GTC2010
13/51
Governing equations
Continuity equation
For incompressible fluids:
Navier-Stokes equations:
Dimensionless form, use equation of state
Reynolds number (= inertia/viscosity ratio)
8/14/2019 2015_GTC2010
14/51
Governing equations
Energy equation:
Dimensionless form, use equation of state
heat capacity ratio
Prandtl number
dissipative function
8/14/2019 2015_GTC2010
15/51
ADI numerical method
Alternating Direction Implicit
XY
Z
Fixed Y, Z Fixed X, Z Fixed
8/14/2019 2015_GTC2010
16/51
ADI method iterations
Use global iterations for the whole system of equ
Some equations are not linear:
Use local iterations to approximate the non-linear te
previous
time step
Solve X-direquations
Solve Y-direquations
Solve Z-direquations
Updating allvariables
global iterations
8/14/2019 2015_GTC2010
17/51
Discretization
Use regular grid, implicit finite difference schem
Got a tridiagonal system for
Independent system for each fixed pair (j, k)
i i+1 i-1 i+1 i
8/14/2019 2015_GTC2010
18/51
Need to solve lots of tridiagonal systems
Sizes of systems may vary across the grid
Tridiagonal systems
O
I
B
system 1
system 2
system 3
8/14/2019 2015_GTC2010
19/51
Implementation details
{
{
{
build tridiagonal matrices and rhs
solve tridiagonal systems}
update non-linear terms
}
}
8/14/2019 2015_GTC2010
20/51
GPU implementation
Store all data arrays entirely on GPU in linear memo
Reduce amount of memory transfers to minimum Map 3D arrays to linear memory
Main tasks
Build matrices
Solve systems
Additional routines for non-linear updates
Merge, copy 3D arrays with masks
(X, Y, Z)
Z + Y * dimZ + X * dim
Z fastest-changing d
8/14/2019 2015_GTC2010
21/51
Building matrices
Input data:
Previous/non-linear 3D layers
Each thread computes:
Coefficients of a tridiagonal matrix
Right-hand side vector
Use C++ templates for direction and equation
a
bc
d
8/14/2019 2015_GTC2010
22/51
Building matrices performance
Poor Z direction performance compared to X/Y
Threads access contiguous memory region
Memory access is uncoalesced, lots of cache misses
Tesla
C2050(SP)
sec
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Build Build + Solve
X dir
Y dir
Z dir
Dir Requestsper load
L1loa
X 2 3 2
Y 2 3 3
Z 32
Build kernTotal time
8/14/2019 2015_GTC2010
23/51
Building matrices optimization
Run Z phase in transposed XZY space
Better locality for memory accesses
Additional overhead on transpose
XYZ
X local iterations Y local iterations Z local iterations
Transposeinput arrays
Transpoutpuarray
Y local iterations
XZY
8/14/2019 2015_GTC2010
24/51
Building matrices - optimization
Tridiagonal solver time dominates over transpose
Transpose will takes less % with more local iterations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
X dir Y dir Z dir Z dir OPT
Transpose
Build + Solve
Tesla
C2050
(SP)
sec
2.5x
Total time
Z dir Requestsper load
Ll
Original 32
Transposed 2 3
Build kerne
8/14/2019 2015_GTC2010
25/51
Problem analysis
Big number of tridiagonal systems
Grid size squared on each step: 16K for 128^3
Relatively small system sizes
Max of grid size: 128 for 128^3
1 thread per system is a suitable approach for th
Enough systems to utilize GPU
8/14/2019 2015_GTC2010
26/51
Solving tridiagonal systems
Matrix layout is crucial for performance
Sequential layout Interleaved layout
a0 a1 a2 a3 a0 a1 a2 a3 a0 a0 a0 a0 a1 a1
system 1 system 2
PCR and CR friendly Sweep friend
ADI Z direction ADI X, Y directio
Threa
d
1
Threa
d
2
Threa
d
3
Threa
d
1
Threa
d
2
Threa
d
3
8/14/2019 2015_GTC2010
27/51
Tridiagonal solvers performance
3D ADI test setu
Systems size =
Number of sys
Matrix layout
*CR, SWEEP_T
SWEEP: interl
0
2
4
6
8
10
12
14
16
18
64 96 128 160 192 224 256
PCR
CR
SWEEP
SWEEP_T
Tesla C2050ms
N
PCR/CR implem
Fast tridiagonal so
by Y. Zhang, J. Coh
8/14/2019 2015_GTC2010
28/51
Choosing tridiagonal solver
Application for 3D ADI method
For X, Y directions matrices are interleaved by defau
Z is interleaved as well if doing in transposed space
Sweep solver seems like the best choice
Although can experiment with PCR for Z direction
8/14/2019 2015_GTC2010
29/51
Performance analysis Fermi
L1/L2 effect on performance
Using 48K L1 instead of 16K gives 10-15% speed-up
Turning L1 off reduces performance by 10%
Really help on misaligned accesses and spatial reuse
Sweep effective bandwidth on Tesla C2050 Single Precision = 119 GB/s, 82% of peak
Double Precision = 135 GB/s, 93% of peak
8/14/2019 2015_GTC2010
30/51
Performance benchmark
CPU configuration:
Intel Core i7 4 cores @ 2.8 GHz Use all 4 cores through OpenMP (3-4x speed-up vs 1 c
GPU configurations:
NVIDIA Tesla C1060
NVIDIA Tesla C2050
Measure Build + Solve time for all iterations
8/14/2019 2015_GTC2010
31/51
Test cases
Test Grid size X dir Y dir Z dirCube 128 x 128 x 128 10K 10K 10K
Heart Volume 96 x 160 x 128 13K 8K 8K
Sea Area 160 x 128 x 128 18K 20K 6K
Cube Heart VolumeSea Area
8/14/2019 2015_GTC2010
32/51
0
2
4
6
8
10
12
14
16
18
20
22
X dir Y dir Z dir
0
2
4
6
8
10
12
14
16
18
20
22
X dir Y dir Z dir All dirs
Core i7
Tesla C1060Tesla C2050
Performance results Cube
Same number of systems for X/Y/Z, all sizes are
SINGLE DOUBLEsystems/ms systems/ms
5.9x
9.5x
8/14/2019 2015_GTC2010
33/51
Performance results Heart Volum
More systems for X, sizes vary smoothly
0
5
10
15
20
25
30
35
40
45
X dir Y dir Z dir All dirs
Core i7
Tesla C1060Tesla C2050
SINGLE DOUBLEsystems/ms systems/ms
0
5
10
15
20
25
30
35
40
45
X dir Y dir Z dir
8.1x
11.4x
8/14/2019 2015_GTC2010
34/51
Performance results Sea Area
Lots of small systems of different size for X/Y
0
10
20
30
40
50
60
70
80
X dir Y dir Z dir All dirs
Core i7
Tesla C1060Tesla C2050
SINGLE DOUBLEsystems/ms systems/ms
0
10
20
30
40
50
60
70
80
X dir Y dir Z dir
6.5x
9.6x
8/14/2019 2015_GTC2010
35/51
Future work
Current main limitation is memory capacity
Lots of 3D arrays: grid data, time layers, tridiagonal m
Multi-GPU and clusters enable high-resolution gr
Split grid along current direct
Assign each partition to
Solve systems and update com
GPU 1 GPU 2 GPU 3 GPU 4
ADIstep
direct
ion
8/14/2019 2015_GTC2010
36/51
References
http://code.google.com/p/cmc-fluid-solver/
Tridiagonal Solvers on the GPU and Applications to Fluid Simul
Technology Conference, 2009, Nikolai Sakharnykh
A dynamic visualization system for multiprocessor computers w
memory and its application for numerical modeling of the turbuviscous fluids, Moscow University Computational Mathematics
Cybernetics, 2007, Paskonov V.M., Berezin S.B., Korukhova E.S.
http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/http://code.google.com/p/cmc-fluid-solver/8/14/2019 2015_GTC2010
37/51
Outline
Tridiagonal Solvers Overview
Fluid Simulation in 3D domain
Depth-of-Field Effect in 2D domain
Problem statement
ADI numerical method
Analysis and hybrid approach
Shared memory optimization
Visual results
8/14/2019 2015_GTC2010
38/51
Problem statement
ADI method application for 2D problems
Real-time Depth-Of-Field simulation
Using diffusion equation to blur the image
Now need to solve tridiagonal systems in 2D dom
Different setup, different methods for GPU
8/14/2019 2015_GTC2010
39/51
Numerical method
Alternating Direction Implicit method
X Y
for all fixed for all fixed
8/14/2019 2015_GTC2010
40/51
8/14/2019 2015_GTC2010
41/51
GPU i l t ti
8/14/2019 2015_GTC2010
42/51
GPU implementation
Matrix layout
Sequential for X direction
Interleaved for Y direction
Use hybrid algorithm
Start with Parallel Cyclic Reduction (PCR) Subdivide our systems into smaller ones
Finish with Gauss Elimination (Sweep)
Solve each new system by 1 thread
8/14/2019 2015_GTC2010
43/51
H b id h
8/14/2019 2015_GTC2010
44/51
Hybrid approach
Benefits of each additional PCR step
Improves memory layout for Sweep in X direction Subdivides systems for more efficient Sweep
But additional overhead for each step
Increasing total number of operations for solving a sy
Need to choose an optimal number of PCR steps
For DOF effect in high-res: 3 steps for X direction
Sh d ti i ti
8/14/2019 2015_GTC2010
45/51
Shared memory optimization
Shared memory is used as a
temporary transpose buffer Load 32x16 area into shared
memory
Compute 4 iterations inside
shared memory
Store 32x16 back to globalmemory
Gives 20% speed-up Sharedmemory
32
16
16
16
8 8 8 8
Diff i i l ti id
8/14/2019 2015_GTC2010
46/51
Diffuse temperature series (= color) on 2D plate
with non-uniform heat conductivity (= circles of
Key features:
No color bleeding
Support for spatially varyingarbitrary circles of confusion
Diffusion simulation idea
out of focus
large radius
in focus
radius = 0
sharp
Depth of Field in games
8/14/2019 2015_GTC2010
47/51
Depth-of-Field in games
From M
THQ a
Depth of Field in games
8/14/2019 2015_GTC2010
48/51
Depth-of-Field in games
From M
THQ a
References
8/14/2019 2015_GTC2010
49/51
References
Pixar article about Diffusion DOF:
http://graphics.pixar.com/library/DepthOfField/paper.pdf
Cyclic reduction methods on CUDA:
http://www.idav.ucdavis.edu/func/return_pdf?pub_id=978
Upcoming sample in NVIDIA SDK
Summary
http://graphics.pixar.com/library/DepthOfField/paper.pdfhttp://www.idav.ucdavis.edu/func/return_pdf?pub_id=978http://www.idav.ucdavis.edu/func/return_pdf?pub_id=978http://graphics.pixar.com/library/DepthOfField/paper.pdf8/14/2019 2015_GTC2010
50/51
Summary
10x speed-ups for 3D ADI methods in complex are
Great Fermi performance in double precision
Achieve real-time performance in graphics and g
Realistic depth-of-field effect
Tridiagonal solvers performance is problem speci
8/14/2019 2015_GTC2010
51/51