Accelerating Fluids Simulation Using SPH and ...uu.diva-portal.org/smash/get/diva2:885188/FULLTEXT01.pdf · Accelerating Fluids Simulation Using SPH and Implementation on ... work

IT 15076

Examensarbete 30 hpDecember 2015

Accelerating Fluids Simulation Using SPH and Implementation on GPU

Aditya Hendra

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Accelerating Fluids Simulation Using SPH andImplementation on GPU

Aditya Hendra

Fluids simulation is usually done with CFD methods which offers high precision butneeds days/weeks/months to compute on desktop CPUs which limits the practical usein industrial control systems. In order to reduce the computation time SmoothedParticle Hydrodynamics (SPH) method is used. SPH is commonly used to simulatefluids in computer graphics field, especially in gaming. It offers faster computation atthe cost of lesser accuracy. The goal of this work is to determine the feasibility ofusing SPH method with GPU parallel programming to provide fluids simulation whichis fast enough for real-time feedback control simulation. A previous master thesiswork about accelerating fluids simulation using SPH method was done by AnnJohansson at ABB. Her work in Matlab using intel i7 - 2.4Ghz needs 7089 seconds tocompute a water-jet simulation with 40000 particles and time-step of 0.006 second.Our work utilizes GPU parallel programs implemented in Fluidsv3, an open-sourcesoftware as the base code. With CUDA C/C++ and Nvidia GTX980, we need 18seconds to compute a water-jet simulation using 1000000 particles and time-step of0.0001 second. Currently, our work lacks of validation method to measure theaccuracy of the fluids simulation and more work needs to be done about this.However it only takes 80 msec to compute one iteration which opens an opportunityto be used together with any real-time systems, such as a feedback control system,that has a period of 100msec. This mean it could model industry processes that utilizewater such as the cooling process in a hot rolling mill. The next question, which is notaddressed in this study, would be how to satisfy application dependent needs such as:simulation accuracy, required parameters, simulations duration in real-time, etc.

Tryckt av: Reprocentralen ITCIT 15076Examinator: Wang YiÄmnesgranskare: Stefan EngblomHandledare: Kateryna Mishchenko

AcknowledgementI would like to express my gratitude to ABB’s project team, especially mythesis supervisor Kateryna Mishchenko, Markus Lindgren and Lokman Ho-sain for their trust and guidance from the start till the end of the thesis work.

Furthermore, I would like to thank you Rama Hoetzlein for making hiswork of Fluidsv3 available freely on the web. I also want to thank you mythesis reviewer Stefan Engblom from Uppsala University for his beneficialhelp in writing.

My sincere thanks also goes to my master program coordinator at UppsalaUniversity, Philipp Rümmer whom has helped me with many things duringmy education there.

Last but not least, I am very grateful to Swedish Institute for providing mefull scholarship for my master education in Sweden. Without it, this workwould not be possible.

Contents

Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.1 Project Purpose and Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1 Fluids Simulation with Smoothed-Particle Hydrodynamics . . . . . . . 14

2.1.1 Navier-Stokes Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.2 Smoothed Particle Hydrodynamics (SPH) . . . . . . . . . . . . . . . . . . . . . 14

2.2 GPU Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.1 General-Purpose computing on Graphics Processing

Units (GPGPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.2 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 SPH for Fluids Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1 Acceleration of MATLAB code using GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 GPU Implementation of Fluids simulation using SPH Method . . 25

3.2.1 SPH Implementation with Fluidsv3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.2 Alternative SPH Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Other GPGPU Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 GPGPU Fluids (Water-Jet) Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.1 GPU Performance Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Water-Jet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Simulation Modification to Fluidsv3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.4 Visual Correctness Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.1 Fluidsv3 Original Boundary Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4.2 Ihmsen et.al Boundary Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Parallel Programming Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.1 Reduce-Malloc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.2 Graphics-Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.3 GPU Shared-Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 Predictive-Corrective Incompressible SPH (PCISPH) . . . . . . . . . . . . . . . . 384.7 Known Limitation of Water-Spray simulation SPH Method for

Fluids Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.7.1 Particles’ Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.7.2 Incompressibility and Simulation Stability . . . . . . . . . . . . . . . . . . . . 394.7.3 Initial Speed vs. Time Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.7.4 Obtaining Parameters Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.7.5 Simulation Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.1 Performance Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2 Unused Optimization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.3 Different Simulation Parameter to Simulation Speed . . . . . . . . . . . . . . . . . . 44

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.1 Accelerated Fluids Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 GPU shared-memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.3 PCISPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.4 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Appendix A: Fluidsv3’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Appendix B: Water-Jet Spray’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Acronyms

CFD Computational fluid dynamics. 11CUDA Compute Unified Device Architecture. 20

GLEW OpenGL Extension Wrangler Library. 28GPGPU General-Purpose computing on Graphics Processing Units. v, 11, 20GPU Graphics Processor Unit. 11, 23

HPC High Performance Computing. 23

PCISPH Predictive-Corrective Incompressible SPH. 31

SIMD Single Instruction Multiple Data. 20SM Streaming Multiprocessors. 21, 23, 24SPH Smoothed Particle Hydrodynamics. v, 11, 13–15

9

1. Introduction

Fluids is a term for liquid and gaseous substances, which are common in manyindustrial applications. At ABB Ltd, we work with optimization of processmodels related to fluids are frequently done and hot rolling mills’ model opti-mization is one of them.

Hot rolling mills is a metal forming process to make a very thin sheet ofmetal while retaining specific metal properties. Figure 1.1 shows the typicalhot rolling mill setting. A heated metal block is processed through severalsteps of rolling mills to get the appropriate thickness before being cooled downin high capacity cooler and coiled into a finish product.

Figure 1.1. Hot-Rolling Mill Illustration

One of many things that could affect the quality of hot rolling mills’ endproducts is the cooling process. Cooling is the last part of the whole rollingprocess and is done by the run out table. A run-out table moves the hot metalsheets under water-spray jets to cool these sheets down to a certain tempera-ture.

It is important to have a better understanding of how the cooling processaffects the metal quality in order to have end products with correct properties.One method to achieve this is by using a complete computer simulation ofthe cooling process with feedback control system. However, a reasonably fastfluids simulation is needed to be able to incorporate it with real-time controlsystem. This means we have to select computational methods that are fast andprovide good results.

The first step to perform such a computer simulation is to have an acceler-ated fluids simulation for water-spray jet which still has good enough accuracy.Fluids simulation is typically done using Computational fluid dynamics (CFD)methods which require a lot of computation time that makes it not suitable forour aims.

One method that is fast and provides good enough results is Smoothed Par-ticle Hydrodynamics (SPH). SPH is commonly used for fluids simulation ingaming industry where computation speed is very important.

11

SPH is an interpolation or estimation method. It provides an approximationof numerical equations of fluids dynamics by substituting the fluids with a setof particles. Using SPH, the fluids’ movement is depicted by moving particles.

All particles’ position need to be computed for each iteration. However,there is no specific order to compute particles’ position : particle A could becomputed before particle B is computed, vice versa as long as all particles’position have been computed before the next iteration starts.

This process suits quite well for parallel programming implementation,such as General-Purpose computing on Graphics Processing Units (GPGPU),where particles’ position could be computed in parallel.

A GPU was originally used for graphics manipulation and image process-ing especially for gaming visualisation. Technology improvements made itpossible to exploit it for general computing.

The computation power behind a GPU are hundreds or thousands of paral-lel computation cores. Individually, each of these cores is much slower andsimpler than CPU’s core – an intel i7 has 4Ghz core speed while an NvidiaGTX980 has 1.1Ghz core speed. Intel i7 cores are also equipped with manybuilt-in hardware instruction sets that are not available in GPU cores. How-ever, the huge number of them makes up the GPU cores’ slow speed and canprovide a higher throughput than CPU’s high speed cores.

A common analogy is to say that a CPU is like a Ferrari and a GPU islike a city bus. A bus is much slower than a Ferrari, but it could carry a lotof passengers. Of course, the Ferrari will arrive first but if we wait until thebus arrives and use that amount of time for the Ferrari to go back and forthto carry more passengers then we could see that in that time the bus carriesmore passenger or in terms of computing performance, the GPU has higherthroughput. A Ferrari has more luxurious features than bus, likewise a CPUhas more advanced instruction sets than a GPU but if those instructions setsare not used in the simulation then they are not necessary.

Importantly, SPH implementation using GPU parallel programming hasbeen done to simulate many fluids application in [1], [2], [3], [4], [5]. How-ever, for our particular needs we want to do a specific implementation of fluidssimulation – water-jet sprays using SPH and GPU parallel programming im-plementation for faster computations.

1.1 Project Purpose and GoalThis thesis work is a continuation from master thesis work done by Ann Jo-hansson at ABB for her master thesis at Uppsala University. In [6], Johanssonlooked at various methods which are generally suitable for video games andchose SPH as a method to simulate spray water jets in Matlab.

The purpose of this master thesis is to serve as a milestone for a compre-hensive computer simulation. The comprehensive simulation will simulate a

12

complete run-out table in a rolling mill with multiple jets cooling the rolledmaterial using a feedback control running in real-time.

The goal of this master thesis is to ascertain the feasibility and drawbacksof using SPH method with GPU parallel programming implementation foraccelerated fluids simulation that could be used in real-time feedback controlsimulation.

The project could be divided into three problem statements:1. Is it feasible to use GPU parallel programming to provide a fast water-

spray jet simulation to be incorporated for real-time feedback controlsimulation? If not, what is the best performance we could achieve? Whatshould be done to get a real-time control simulation?

2. What is the tradeoff between simulation speed and accuracy? What arethe physics parameters related to this ?

3. What are the benefits/drawbacks of using GPUs for fluid simulations,when compared to a CPU solution?

This report is divided into the six sections. Section 2 discusses the basicsof SPH and GPU parallel programming. Section 3 covers SPH implementa-tion in fluids simulation. Section 4 is devoted to GPU implementation of fluidsimulation, its optimization results and limitation. Section 5 discusses simula-tion’s results. Section 6 states the conclusion of the work and suggested futureworks.

1.2 ScopeThe scope of this work assumes the following:

1. Any physical model development and improvement is not part of theproject.

2. The physics model is provided by ABB and is not the primary contribu-tion of this thesis.

3. We only look into single GPU implementation using Nvidia GTX 980gaming graphic card.

4. We only do implementation on a stand-alone workstation using VisualStudio 2010 and 2013 as the IDE.

5. We use an open source fluids simulation code, Fluidsv3 [7], as our basecode. Fluidsv3 is an open source fluid simulator for the CPU and GPUusing the Smoothed Particle Hydrodynamics (SPH) method. Fluidsv3 iswritten and copyrighted by Rama Hoetzlein with attribute-ZLib licensewhich grants us right to use his software for any purpose including com-mercial applications as long as the original author is acknowledged.

13

2. Background

This section describes the theory and methodology used as a foundation of theproject. It describes Fluids simulation with Smoothed-particle hydrodynamics(SPH) and GPGPU Graphics Processor Unit (GPU) Parallel Programming.

2.1 Fluids Simulation with Smoothed-ParticleHydrodynamics

2.1.1 Navier-Stokes EquationCommonly, fluids simulations are done using the Navier-Stokes equationscomputing the motion of fluids [8].

Navier-Stokes equations for an incompressible fluid are:

ρ�

∂u

∂ t+u ·∇u

�=−∇p+µ∇2

u+ f (2.1)

∇ ·u = 0, (2.2)

where ρρρ is the density, uuu is the velocity, ppp is the pressure, µµµ is the viscosity,ttt is the time and fff is the sum of external forces, e.g. gravity. ∇∇∇ ··· uuu is thedivergence of velocity.

There are different numerical methods to solve Navier-Stokes equationswhich could be divided into two main categories: grid-based Eulerian methodsand mesh-free Lagrangian methods. The grid-based Eulerian methods such asFinite Volume method, Finite element method, and Finite difference methodgive at high accuracy the expense of more computation time. The mesh-free orLagrangian methods such as Smoothed Particle Hydrodynamics (SPH) needsless computation time but gives less accuracy that is still hopefully acceptable.

2.1.2 Smoothed Particle Hydrodynamics (SPH)In this project the mesh-free SPH method is chosen to compute fluids simu-lation. It provides an approximation of numerical equations of fluids dynam-ics by substituting the fluids with a set of particles. The method was origi-nally developed by Gingold and Monaghan [9] in 1977 and independently byLucy [10] in 1977. Originally, it was used to solve astrophysical problems.

14

Gingold and Monaghan improved SPH algorithm to conserve linear and an-gular momentum which attests the similarities between SPH and moleculardynamics [11].

SPH is an interpolation or estimation method. According to it, a scalarphysical quantity of any particle called A is interpolated at location r by aweighted sum of contributions from all neighbouring particles within a radiush:

As(r) = ∑j

m j

A j

ρ j

W (r− r j,h), (2.3)

where :j is the index to all the neighbouring particles,m j is the mass of particle j,r j is its position,ρ j is the density,A j is the scalar physical quantity of fluids particle j, andW (r− r j,h) is the smoothing kernel with a radius h.

The gradient of A is computed as follows:

∇As(r) = ∑j

m j

A j

ρ j

∇W (r− r j,h), (2.4)

where ∇As(r) is the gradient of the scalar A at position r.The second derivative of function A is:

∇2As(r) = ∑

j

m j

A j

ρ j

∇2W (r− r j,h), (2.5)

where ∇2As(r) is the Laplacian of the scalar A at position r.

Derivation of equation (2.3) are given thoroughly in [6] and Kelager gave thederivation steps for equation (2.4) and (2.5) in [12]. We use these equations toevaluate several scalar quantities which affect fluids dynamics such as density,pressure and viscosity as described in equation (2.1).

An elaborate explanation of SPH application for fluids simulation is men-tioned in [13]. The usage of particles guarantees the mass conservation andequation (2.2) can be omitted, moreover since the particles move with the fluidthe convective term u ·∇u in equation (2.1) could be omitted, too. In the end,the mesh free or Lagrangian of the Navier-Stokes equation for an incompress-ible fluids takes the form:

ρ�

∂u

∂ t

�=−∇p+µ∇2

u+ fexternal , (2.6)

with −∇p is the pressure term, µ∇2v is the viscosity term and f

external isthe external force. The sum of these three force density field determines themovement of particles. For each particle i, we get:

15

Fi =−∇pi +µ∇2ui + f

external

i,

Fi = ρ�

∂u

∂ t

�,

ai =∂ui

∂ t=

Fi

ρi

,

(2.7)

whereui is the velocity,Fi is the force density field,ρi is the density field,and ai is the acceleration of particle i respectively.

The pressure term −∇p has the following representation:

fpressure

i=−∇p(ri) =−∑

j

m j

pi + p j

2ρ j

∇W (ri − r j,h), (2.8)

where fpressure

iis the pressure force for particle i. In [14] it is suggested to

compute p as:

p = k(ρ −ρ0), (2.9)

where ρ0 is the rest density, k is the gas stiffness constant and ρ is derivedfrom equation (2.3) with ρ as the scalar quantity:

ρ(i) = ∑j

m j

ρ j

ρ j

W (ri − r j,h)

= ∑j

m jW (ri − r j,h).(2.10)

The viscosity term µ∇2v is :

fviscosity

i= µ∇2

u(ri) = µ ∑j

m j

u j −ui

ρ j

∇2W (ri − r j,h), (2.11)

where fviscosity

iis the viscosity force for particle i.

Smoothing kernels WWW introduced in equation (2.8), (2.10) and (2.11) hasthe following representations, according to [13]:

Smoothing kernel for density equation at (2.10) is

Wpoly6(r,h) =315

64πh9

�(h2 −�r�2)3 0 ≤ �r� ≤ h

0 otherwise,(2.12)

16

where h is the smooth kernel radius and �r� is the scalar value of r.Smoothing kernel for pressure equation at (2.8) is

Wspiky(r,h) =15

πh6

�(h−�r�)3 0 ≤ �r� ≤ h

0 otherwise,

∇Wspiky(r,h) =− 45πh6

�r

�r�(h−�r�)2 0 ≤ �r� ≤ h

0 otherwise,

(2.13)

for pressure we use the gradient of Wspiky smoothing kernel.Smoothing kernel for viscosity equation in (2.11) is

Wviscosity(r,h) =15

2πh3

�−�r�3

2h3 + �r�2

h2 + h

2�r� −1 0 ≤ �r� ≤ h

0 otherwise,

∇2Wviscosity(r,h) =

45πh6

�(h−�r�) 0 ≤ �r� ≤ h

0 otherwise,

(2.14)

for viscosity we use the Laplacian of Wviscosity smoothing kernel.

Simulating fluids movement using SPH method is an iterative process. Eachiteration computes an estimation for every particles’ position at the next time-step. First, the component of Fi, which are f

pressure

i(equation (2.8)), f

viscosity

i

((2.11)) and fexternal

i(gravity) are computed. Second, the three components

are summed to get Fi. Third, Fi is used to compute ai using equation (2.7).Last, ai is used to obtain particles’ position by solving the ODE using leap-frog integration.

The full procedure of Navier-Stokes computations using SPH method stepsfor each particle i are stated in Algorithm 1.

17

Algorithm 1 Navier-Stokes SPH Algorithm1: Compute density

ρ(i) = ρ(ri) = ∑j

m jWpoly6(ri − r j,h) (2.15)

2: Compute pressure from density

pi = p(ri) = k(ρ(ri)−ρ0). (2.16)

3: Compute pressure force from pressure interaction between neighbouringparticles

fpressure

i=−∇p(ri) =−∑

j

m j

pi + p j

2ρ j

∇Wspiky(ri − r j,h). (2.17)

4: Compute viscosity force between neighbouring particles

fviscosity

i= µ∇2

u(ri) = µ ∑j

m j

u j −ui

ρ j

∇2Wviscosity(ri − r j,h). (2.18)

5: Sum the pressure force, viscosity force and external force (e.g. gravity)

Fi = fpressure

i+ fviscosity

i+ fexternal

i. (2.19)

6: Compute the acceleration

ai =Fi

ρi

. (2.20)

7: Solve the ODE using leap-frog integration to get ut , velocity at time t foreach particle by using the following steps:(a) The velocity at time t + 1

2 ∆t is computed by

ut+ 1

2 ∆t= u

t− 12 ∆t

+∆tat , (2.21)

(b) then the position at time t +∆t is computed by

rt+∆t = rt +∆tut+ 1

2 ∆t. (2.22)

(c) The velocity at time 0− 12 ∆t is computed by

u0− 12 ∆t

= u0 −12

∆ta0, (2.23)

(d) and ut is approximate by the average of ut− 1

2 ∆t+u

t+ 12 ∆t

,

ut ≈u

t− 12 ∆t

+ut+ 1

2 ∆t

2, (2.24)

where ∆t is the integration time step.18

2.2 GPU Parallel ProgrammingThis section discusses how GPU became prevalent in parallel programmingand what is the high level technology behind it.

2.2.1 General-Purpose computing on Graphics Processing Units(GPGPU)

Originally, GPU was intensively used in gaming industry due to its’ highthroughput. Each of GPU’s cores processes data for different pixels in par-allel. GPUs keep evolving to have more and faster computation cores. Thisattracts applications in different industries which utilize repetitive and parallelcomputations.

GPU utilization for non-graphical applications was started in 2003 by adopt-ing the already existed high-level shading languages such as DirectX, OpenGLand Cg [15]. However, this approach has several shortcomings. Firstly, theusers need to have extensive knowledge about computer graphics API andGPU architecture. Secondly, the computation problem has to be representedin vertex coordinates, textures and shaders. Thirdly, some basic programmingfeatures such as random memory read and write were not available. Lastly,no GPU at that time could provide double precision floating point, which isimportant for some applications.

To solve those shortcomings, new programming framework such as Com-pute Unified Device Architecture (CUDA) from Nvidia and OpenCL fromKhronos Group were developed. Their main goal is to provide more suit-able tools to harness GPU’s potential for general applications. These newframeworks make GPU programming a real general purpose programminglanguange.

2.2.2 CUDAIn 2007, Nvidia launched a new GPGPU framework which is designed forgeneral purpose programming, known as CUDA [16]. CUDA is a frameworkthat provides programmers the possibility to run general purpose parallel-programming applications written in C, C++, Fortran, or OpenCL in everyNvidia GPUs launched since 2007.

CUDA toolkit accomodates a comprehensive software development envi-ronment in C or C++ languages which makes it possible to adopt almost all Cor C++ language capabilities.

In CUDA, every parallel program is written in a special function calledkernel. The CUDA kernel is called from the CPU (host) and executed in theGPU (device). Kernel functions are executed by a set of threads in a SingleInstruction Multiple Data (SIMD) fashion. SIMD enables a single instruction

19

to be executed by many processing threads at the same time. Each threadhandles its own input and output for computation.

Each kernel execution creates a logical structure of threads in a hierarchicalorder. It starts with a thread grid, consisting of many thread blocks while eachblock consists of many threads. The number of blocks and all threads need tobe defined before kernel execution.

Threads, thread blocks and thread grids in its hierarchical correspondingmemory structure are presented in Figure 2.1. Each thread has its own localmemory which is the GPU register, and threads in a thread block could accessits own block shared memory. Threads in each grid that execute the samekernel function have access to the application’s global memory.

Figure 2.1. Hierarchy of threads, thread blocks and thread grids with its correspondingmemory allocation in CUDA [15]

CUDA provides a unique ID for each thread within its block and a uniqueID for each block within its grid. Additionally, unique grid ID for more thanone grid creation at the same time is provided. It is possible to access anarbitrary thread within each grid by combining the block’s ID and thread ID.For example:

index = (blockIdx.x∗blockDim.x)+ threadIdx.x. (2.25)

20

The variable index is used to access a unique thread index in each block withinone grid.

In each GPU there are many computation cores, Nvidia calls them as CUDAcores and group them in a set called Streaming Multiprocessors (SM). SIMD– a single instruction is executed by many processing threads, this is done ina subset of SM which is called warp that consists of 32 CUDA threads. Thisnumber comes from Nvidia’s existing GPU architecture design.

All threads in a block are placed in its warps. Every thread in a warp hasa sequential index, so the first 32 threads, thread with index from 0 until 31,will be in the same warp, the second 32 threads will be in the next warp, andso on.

Every warp will be executed asynchronously with respect to each other.Each thread in a warp is executed in parallel at the same time so that if onethread delays its execution then the whole warp will be delayed. For thisreason it is a good practice to have threads in a warp to access memory addressin chronological order so that we could reduce GPU’s global memory accessby utilizing coalesced data loaded from it.

It is a common pratice to use a number of multiple 32 as a number of threadsin a block. If we assign 16 threads in a blocks then each block will be countedas one warp of 32 threads with 16 threads inactive and reduce the overallthroughput.

Each Nvidia GPU architecture has its own hardware limitation that couldaffect the choice of the appropriate thread number in a block. The limitationswe usually look into are:

– the maximum number of threads in a block– the maximum number of active blocks an SM could handle the same

time– the allocated shared memory in an SM– the allocated register size in an SM– the number of CUDA cores in an SM.

Shared memory is a piece of memory block that has lower latency thanGPU’s global memory. The amount is limited per SM and is usually used tostore data which are repeatedly used by the threads in a block, so that threadsdo not need to access global memory which has higher latency.

The amount of shared memory assigned to the block affects the number ofactive blocks in a SM. For example, if we allocate 128 threads in a block andwant to use 256 bytes of shared memory for each block, then the number ofactive blocks per SM is

active_blocks =total_shared_memory

256, (2.26)

because we have to divide the avaiable shared memory to every active block. Ifthe number of active blocks is small then this could result in a small number ofactive warps a SM will handle them at the same time. The warp scheduler will

21

execute a warp that has all computation data for every thread in that warp. Ifone thread in a warp is still waiting for data then that warp will not be executedand will be scheduled for next iteration.

GPU’s high throughput comes from utilizing all computation cores at thesame time. Therefore, we want to have enough number of active blocks ormore precisely warps in an SM, so that when a warp is not ready there areother warps ready to be executed. In short, to maximize the performance wewant to make the CUDA cores in an SM to be as ”busy” as possible instead ofbeing idle.

CUDA cores are the computation ”engines”. Each executing thread in awarp will be processed by a core. Each SM has a limited number of cores,therefore there are a limited number of warp being processed at the same timewithin an SM. A GPU with more and faster cores will give higher throughput.

In this project the Nvidia GTX 980 gaming graphic card is used. Comparingto the most advanced Nvidia High Performance Computing (HPC) GPU card,the Nvidia Tesla K80, the GTX 980 card costs just 10 -15% of its price. NvidiaGTX 980’s peak processing power for single precision is 4612 Giga FloatingPoint Operations Per Second (GFLOPS) and 144 GFLOPS for double preci-sion while Nvidia Tesla K80’s peak processing power for single precision is8740 GFLOPS and 2910 GFLOPS for double precision unit.

Figure 2.2 shows the diagram of one Nvidia GTX980 Streaming Multipro-cessors (SM). Nvidia GTX980 has 16 SM each with 128 CUDA cores dis-tributed in 4 warp scheduler and 96KB shared memory.

22

Figure 2.2. Nvidia GTX980 codename is Maxwell therefore its Streaming Multipro-cessors (SM) is called SMM. SMM’s block diagram [17]

23

3. SPH for Fluids Simulation

This section discusses one of possible implementations of SPH for fluids sim-ulation using Matlab. The motivation of using different implementation plat-form to use SPH for fluids simulation is given as well.

3.1 Acceleration of MATLAB code using GPUsOne of ABB’s in-house implementation of SPH is a modification of a masterthesis work in SPH [6]. It is written in Matlab which uses CPU’s computa-tional power. For this acceleration effort we use Intel i7 Quad cores, 4 Ghzwith 32 GB of RAM and MATLAB R2015a.

According to Matlab guidelines [18], there are several methods of how GPUcomputational power could be used in Matlab, in order to reduce the compu-tation time.

The first method is to allocate the variable with high computation cycleinto the GPU using gpuArray command, and let Matlab manage the rest. Thisapproach needs computation time 14x than ABB’s in-house implementation.This is due to the inefficiency code structure and too often memory copyingprocess between GPU and CPU at each iteration.

The second method is to collect all computationally extensive pieces ofthe code into a MATLAB function and run this function on GPU. Unfortu-nately, there is some limitation on the structure and contents of the codes.Firstly, codes should use only Matlab functions that are implemented on GPU[18]. Secondly, the codes should not use index operation and all computationsshould be done element wise. With these restrictions, there are only a few oflines of codes that could be triggered on GPU, resulting in a similar result withthe previous method.

The third method is to call the CUDA kernel files directly from Matlab.However, this method does not provide any benefit in terms of effort. In orderto fully benefit the effort of using CUDA kernels, the whole implementationshould be in C/C++ environment. Therefore, we do not attempt this method.

The Fourth method is code refactoring. The code branches and loops arereplaced by matrices operations when it is possible.

Table 3.1 shows the result of code refactoring to compute 500 iterationsof fluids simulation. The speed improvement is significant, but it is not fastenough to be used in real time feedback control systems. There are three main

24

Table 3.1 Matlab Refactored Benchmark for 500 iterations

Matlab FunctionOriginal

ExecutionTime (s)

Refactored CodeExecutionTime (s)

SpeedImprovement

DoubleDensityRelax 11800 800 14xTemperatureCompute 6003 425 14xViscousImpulse 6560 6560 1x

Matlab functions in the implementation, DoubleDensityRelax, ViscousIm-pulse, and TemperatureCompute.

As shown in Table 3.1 the code refactoring improves the speed of comput-ing DoubleDensityRelax and TemperatureCompute by 14x. No improvementfor ViscousImpulse, because the loop involves a sequential process where eachiteration uses previous iteration’s data.

To animate the fluids another code in Matlab dealing with visualization isused. This code adds another 6387 seconds with no possibility of refactoringbecause it consists of one for-loop. This contributes to a total duration of15293 seconds to simulate 500 iterations of fluids simulation or an average of30.5 seconds for each iteration. This is obviously longer than the period of areal-time feedback control systems which is less than a second. Therefore, theMatlab implementation is not suitable and a different approach is required.

3.2 GPU Implementation of Fluids simulation usingSPH Method

Section 2.2 describes the motivation to use GPU’s high throughput to solve anextensive computation problem, specifically for problem that has high arith-metic intensity that can be reformulated in parallel computation code. Anarithmetic problem that is associative and commutative is one example of theideal problem for GPU such as Nvidia’s, because the computation could bedone in any order which is not the case for subtraction or division operations.

In SPH, fluids is approximated using a number of particles. In each iterationevery particle’s position is computed based on its distance to its neighbouringparticles. However, there is no order in which particles are to be computed.Particle A could be computed first or simultaneously with particle B withoutchanging the result as long as all particles have been computed before nextiteration starts.This means that multi-cores parallel programming frameworkcould be used to compute each particles simultaneously.

Some of the first implementation of SPH on GPU using OpenGL and Cgwere done in [19] and [20]. Yan et al [1] gave an alternative SPH algorithmthat uses non-uniform particle model which means the particles could split ormerge. It incorporates an adaptive surface tension model for better stability.

25

After Nvidia launched CUDA, fluids simulation using SPH method on GPUbecame more common with promising speed up result compared to CPU.Hérault et al [2] distributed the SPH implementation into three parts: neighbor-list-construction, force-computation, and Euler-integration. The speed up isdifferent for each part where force-computation had the most speed up of 207times compared to CPU, while neighbor-list-construction got 15.1 times andEuler-integration was 23.8 times faster. Gao et al [21] measured the simula-tion frame rate improvement on their work and managed to get a speed up ofnearly 140 times with a simulation of 102000 particles.

3.2.1 SPH Implementation with Fluidsv3Based on the literature research we conclude that GPU implementation is moreefficient than CPU implementation to do fluids simulation with SPH, espe-cially in terms of simulation speed. To shorten the learning curve, we decideto use Fluidsv3 which is an open-source fluids simulation of SPH. Fluidsv3 [7]source code is provided as it is and it simulates the ocean waves using SPHmethod. The code was written in C/C++ with CUDA Toolkit.

We change at the implementation code for our needs and optimized it toreduce the simulation time which is the main topic of discussion in section 4.

3.2.2 Alternative SPH ImplementationDuring the implementation phase, we found several other SPH implementa-tions using GPGPU framework that are of interest and could be investigatedin the future:

– Open Worm project [4], an open source project which is about to createa virtual C. Elegans Nematode in a computer using OpenCL.

– DualSphysics research group [5], is a research group of several universi-ties in Europe. They provide a CPU and GPU (CUDA) implementationof SPH for research and applications in fluids simulations.

– Other computer graphics research groups at ETH Zürich [22] and Uni-versity of Freiburg [23] have done research and different solutions usingSPH methods.

3.3 Other GPGPU PlatformThe two biggest GPU producer are Nvidia and AMD. Using Fluidsv3 codemeans using CUDA toolkit, and using CUDA toolkit means we have to useNvidia’s GPU. Nevertheless, this is not the only way to implement GPU par-allel programming.

OpenCL supports more heterogeneous parallel programming hardware (CPU,GPU, FPGA, etc) as long as it has multi-core structure. It is an open question

26

whether to choose CUDA, OpenCL or other GPU parallel programming lan-guange and it is not discussed in this work.

27

4. GPGPU Fluids (Water-Jet) Simulation

This section describes the modification of Fluidsv3 original implementationand our efforts on further improvements of the performance. An overview ofFluidsv3 original algorithm is presented in Appendix A.

4.1 GPU Performance BenchmarkingTo improve the simulation’s speed, the GPU’s performance during simulationis analyzed. The results are used to look for parts of the implementation thatcould be improved to achieve faster simulations.

There are two tools from Nvidia that can be used to analyse the perfor-mance of a CUDA program. They are Nvidia Nsight Performance Analysisand Nvidia Visual Profiler [24] and [25]. Both tools offer similar performanceanalysis capabilities but have some differences. Nsight Performance Analysisis a plug-in tool for Microsoft Visual Studio or Eclipse and could only be exe-cuted from either one of the two IDEs. Nvidia Visual Profiler is a stand alonetool that could be used to analyse an executable file regardless of IDE used.

Another difference is that Nvidia Visual Profiler offers ”Guided Analysis”mode that helps to find automatically the CUDA kernels of the program tobe optimized and how to do it. Nsight Performance Analysis offers moreelaborate benchmark of the application. It analyses not only the CUDA kernelsbut also the program’s connection with system’s API or visualization API suchas DirectX and OpenGL.

We use Nsight Performance Analysis because of its elaborate benchmarksand since Nvidia Profiler’s Guided Analysis mode is too general for our prob-lem.

Nvidia Performance Analysis has two main benchmark tools: Trace andProfiler-CUDA. Trace is used to measure the GPU utilization and OpenGLactivity, while Profiler-CUDA is used to get the performance report of CUDAkernels executed in the application. More information about both tools arefound in [24] and [25].

The experiment under consideration is about the computation of fluids parti-cles which simulate a spray of water-jet. The experiment starts with no particleand terminated when the number of particles reaches 1 million.

There are three measurements that we include in the report:1. GPU Utilization, it represents in percentage how often the GPU was uti-

lized in comparison to the overall benchmark time. Higher value means

28

the GPU is less idling during the simulation. This information is pro-vided by Nsight Performance Analysis - Trace.

2. GFLOPS, a weighted sum of all executed single precision operationsper second. Higher value means that tbe GPU spends more time in com-putation which also means that the algorithm is more efficient in usingGPU computation power. Nvidia GTX 980 has a maximum computationpower of 4612 GFLOPS for single precision operation. This informationis provided by Nsight Performance Analysis - Profiler-CUDA.

3. Iteration Time, time consumed to compute one simulation’s iteration inmilliseconds. Lower value means more computation iterations per sec-ond which also means faster simulation. A lower iteration time is im-portant for our optimization method. This information is provided byFluidsv3’s internal function.

4.2 Water-Jet ModelThe work is about a specific case of accelerated fluids simulation, water-jetspray. This model is chosen because of its use to simulate coolant agent in hotrolling mills which is one of ABB’s area of interest.

In the real world water-jet spray has many applications such as, fountainwater, shower spray, cleaning spray, fire extinguisher, etc. For hot rollingmills coolant, pure water with adjustable velocity and volume is sprayed frommany inlets to hot surface metal in order to cool it down. To simulate this,several parameters are introduced. The initial speed to adjust the velocity, theinlet radius to adjust the volume and the number of inlet for how many inletsto include in the simulation.

In order to simulate the flow of water-jet spray, a new layer of particlesis added in the initial position in each time step. The shape and number ofparticles used as a new layer are constant based on simulation configuration.The layers gradually build up and simulate the flow of water computed by SPHmethod.

Although in real world the water could be sprayed in the opposite directionagainst the gravity, the work only simulate water sprayed in the same directionof gravity.

4.3 Simulation Modification to Fluidsv3Figure 4.1 shows Fluidsv3’ – original application – a continuous ocean-wavesimulation.

The simulation model under consideration is the water-jet spray and in orderto have that visualization, the following modifications are required:

29

Figure 4.1. Fluidsv3’ continuous ocean-wave simulation

• Particle Initialization, number of particles and particles’ position. Flu-idsv3 initializes all fluids particles in the beginning of the simulation,while we add the particles layer by layer at every time step to simulate asprayed fluids. Figure 4.2 shows zoomed-in layers of particles.

Figure 4.2. Three layer of initialized particles

• Simulation boundary. Fluidsv3 simulates a continuous ocean-wave ina simulation space shaped of a box. The water jet spray model underconsideration has a single solid boundary at the bottom

• Reusing deleted particles memory. If particles bounce outside the simu-lation area then they will be deleted and the memory space used will becleared so that it could be used for new particles.

• Changing particles initial speed. Our fluids simulation is a water-jetspray which injects fluids from top to bottom in Y-axis. The initializedparticles have a given initialization speed.

In summary, Fluidsv3’s simulation parameters are modified according toTable 6.3, Appendix.

30

4.4 Visual Correctness ModificationFigure 4.3 shows the result of the modification, four inlets of water-spray jetshitting a solid surface simultaneously.

Figure 4.3. The upper view of the original boundary handling method

The first step is to verify the correctness of the model from the physicalstandpoint. Figure 4.4 shows the bottom side of the simulation model wherethe fluids is penetrating the boundary. This is not physically correct since par-ticles are penetrating through the surface which is the steel strip (not shown).

Figure 4.4. Hidden Boundary Particles diameter set to 0.00015 meter

31

4.4.1 Fluidsv3 Original Boundary HandlingFluidsv3 original boundary handling uses hidden boundary particles. The hid-den particles’ diameter acts as ”a damping area” that brakes the fluids move-ment. Fluids’ particles receive opposing force whenever they enter this area.In other words, particles are repelled when they penetrate the hidden-boundaryparticles. The more penetration, the stronger opposing force.

Equation (2.22) shows that particles’ position for time step t +∆t dependson their velocity at time step t + 1

2 ∆t which is highly affected by the initialvelocity, u0 at equation (2.23). As iteration goes, u0’s value is accumulated inu

t+ 12 ∆t

hence affects how big the particles move in each iteration which in theend decides if the particles penetrate the solid surface boundary or not.

An experiment of using u0 of 20 m/s and hidden particles’ diameter of 0.03x 0.005 (simscale value) meter results in some particles’ position computedalmost beyond the damping area, thus the opposing force is not strong enoughand particles leak through the surface. Figure 4.4 shows fluids passing thesolid surface (not drawn).

A simple fix for this case it to use bigger hidden particles’ diameter. Usinghidden particles’ diameter of 0.06 (0.0003 meter in reality) prevents the par-ticles from penetrating the surface. Figure 4.5 shows that the particles do notpenetrate the solid boundary when we use bigger hidden particles’ diameter.The side penetration is because the fluids particles are beyond the boundaryarea.

Bigger hidden particles’ diameter provides bigger ”breaking area”, so thatthe particles’ have more ”space” to reduce their velocity thus no boundarypenetration.

Figure 4.5. Hidden Boundary Particles diameter set to 0.0003 meter

4.4.2 Ihmsen et.al Boundary HandlingThe solution depends on many parameters such as initialization speed or gasstiffness constant. For example, if we want to use initial speed of 60 m/s,

32

which might be too big for water-jet, an even bigger hidden particles’ diame-ter is required to prevent particles penetrating the boundary. Clearly a betterboundary handling mechanism can improve the simulation.

After some literature review, we decide to try a new boundary handlingmethod propopsed by Ihmsen et.al [26]. This method was originally usedwith PCISPH [27] but could also be used for SPH.

Figure 4.6 shows that this method gives a fluids splashing or fluids repellingeffect which usually exists when a high speed fluids hit a solid surface com-pared to figure 4.3 which does not visualize any repelling effect. Figure 4.7shows that the particles do not penetrate the boundary even though a hiddenparticles’ diameter of 0.03 (0.00015 meter in reality) and an initial speed of 60m/s are used.

Figure 4.6. The upper view of Ihmsen et.al boundary handling

Figure 4.7. The side view of Ihmsen et.al boundary handling

Observing the visualization results and its ability to use wider range of ini-tial speed or gas stiffness constant, it is concluded that Ihmsen et.al boundary

33

method is better than Fluidsv3’ original method in terms of physical correct-ness of simulation. Thus, this method will be used for further code optimiza-tion

4.5 Parallel Programming OptimizationWe use Nsight Performance Analysis mentioned in Section 4.1 to gauge theGPU’s performance of our simulation to make the simulation run faster.

The water-jet simulation has an iteration time of 203.54 msec and GPUutilization of 36% which means that most of the GPU time is spent idling,waiting for its turn.

In order to improve the GPU utilization and reduce the computation timefor each iteration, the following methods were implemented one by one.

4.5.1 Reduce-MallocNsight Performance Analysis shows that most of the GPU time is spent wait-ing for CUDAmalloc and CUDAmemcpy to finish their operations makingthese operations as the bottleneck.

Nvidia GTX 980 has 4GB RAM which, based on Fluidsv3’s data structure,is big enough to store 22 millions particles’ data. By allocating GPU’s memoryfor 22 millions particles in the beginning of simulation, CUDAmalloc wouldonly be called once. CUDAmemcopy operations are also called once to copyinitialization particles data from host (CPU) to device (GPU).

This result in improvement of GPU utilization to 75.6 % and reduction ofiteration time to 95.18 msec.

4.5.2 Graphics-InteroperabilityFluidsv3 uses OpenGL to render the animation and during this process, itneeds to copy the data from computation memory to OpenGL memory.

In [28], Graphics-Interoperability means that the same GPU memory bufferis used to store data for rendering and computation results. This method re-moves the need to copy data from computation memory to rendering memory.

Using this approach, the GPU utilization is improved to 87.2 % and theiteration time is reduced to 78.73 msec.

4.5.3 GPU Shared-MemoryNsight Performance Analysis also shows that there are two CUDA kernelfunctions that consume most of the GPU computation resource, computePres-sure (19 %) and computeForce (64 %). Improving the computation speed of

34

these functions should improve the total simulations speed. For improvementand benchmark try-out we choose kernel function computePressure because ithas simpler code structure than computeForce.

Figure 4.8 shows the kernel warp issue efficiency for computePressure ker-nel function which is the GPU’s ability to issue instructions during kernel’sexecution cycle [24]. Instructions are executed in a warp which is a set ofthreads. Higher percentage means that more instructions are executed, butcomputePressure has 54.03 % of warp issue efficiency. This means that 45.97%of all computation cycles does not issue any instruction.

Figure 4.9 shows that memory dependency, which is the condition wherememory store or load cannot be done because resources are either not availableor are fully utilized, is the cause for 50.12 % of the kernels inability to runinstructions. Here, the question is which operations are causing this memorydependency problem?

Figure 4.8. Warps Issue Efficiency for computePressure CUDA kernel

From the variety of data that computePressure kernel needs to retrieve fromGPU’s global memory, particles’ position data are the most frequent one. Thekernel needs to compute every particle’s distance to its neighbouring particles.The more neighbour a particle has, the more data access the kernel does forthat particle.

All particles used in SPH method are put in grids of similar size cubes asmentioned in Appendix A. Figure 4.10 restates figure 6.2, Appendix, whichdepicts a 3x3x3 grids structure with each grid is represented by coloured cir-cles. Each grid contains different number of particles.

35

Figure 4.9. Issue Stall Reasons for computePressure CUDA kernel

Figure 4.10. Grid Neighbours. Yellow circle represent the centre grid and the rest arethe neighbouring grids. The dotted line is for visualization purpose only and does notrepresent any information.

36

All particles in the centre grid, the yellow circle, have the same neighbour-ing particles located in the neighbouring grids. Therefore, all particles in thecentre grid request the same neighbouring particles’ position which results indata request redundancies from GPU’s global memory.

If we can store these often requested data in a ”closer” place than GPU’sglobal memory then we could provide the data to CUDA kernel ”faster”. Asa result, this improves the warp efficiency and allows for doing more com-putations per cycle [29]. This ”closer” memory place is called GPU shared-memory.

All Nvidia’s GPU that support CUDA have shared memory. It is a memorythat resides in the GPU chip itself. It has roughly 100 times lower latency thanuncached GPU global memory, which resides outside the GPU chip, providedthat there is no bank conflict between the threads [30].

Goswami et.al [31] used shared memory for their SPH-fluids simulation inNvidia GPU. Although the paper did not give the results of comparison of us-ing vs. not using shared memory, it was decided to use the share memory inSPH computations. The goal is to investigate whether it is possible to achieveany further computational improvements since we have different implementa-tion compared to Goswami’s.

In order to reduce data traffic to the global memory, it is required to place allthe neighbouring particles’ data in the GPU shared-memory and let the centregrid’s particles access the shared-memory. However, this could only be doneby introducing extra line of codes to copy the required data into the shared-memory. Additionally, it is important to keep the memory offset in such a waythat each centre grid particle could access the correct shared-memory offset.Accessing the wrong offset results in unstable, wrong or even run-time errorsimulation.

Our project uses Nvidia GTX980 which has 96KB of shared memory foreach streaming multiprocessors (SM) where each SM could handle 32 threadblocks simultaneously. A particle’s position data is a vector float3 data type.This means that each position storage requires 12 bytes of memory. With lim-ited memory space we need to use more code to repeatedly re-store necessaryparticles data.

The benchmark shows that the GPU utilization is up to 95 %, but the itera-tion time increases to 219.2 msec. The number of operations of computePres-sure kernel function drops to 85 GFLOPS from 747 GFLOPS.

This is caused by not parallelized extra code which introduces a lot of se-quential loops. The code makes the GPU busier hence the increase of GPUutilization to 95 %. Nevertheless, since the code is not executed in parallel thenumber of operations drops to 85 GFLOPS.

37

4.6 Predictive-Corrective Incompressible SPH(PCISPH)

In [7], Fluidsv3’s author suggested PCISPH, a different SPH method to im-prove fluids simulation’s speed. It is an SPH method that enforce fluids in-compressibility [27]. PCISPH method predicts density fluctuations which arethen corrected using pressure forces. This prediction and correction step arerepeated for several times to get lowest density fluctuations in order to enforcefluids incompressibility. The algorithm is described in Algorithm 7, Appendix.

PCISPH’s authors stated the possibility to use a time-step 35x bigger thanSPH while retaining the simulation’s visualization. This is an interesting fea-ture since time-step determines how long the simulation elapses in one itera-tion. Bigger time-step means that for each computation’s iteration the simula-tion predicts a longer duration of how the fluids should behave in real world.For example, 10 iterations with a time-step of 0.001 second equal to 0.01 sec-ond of simulated real-world time but using a bigger time-step of 0.01 second,10 iterations equal to 0.1 second of simulated real-world time.

However, we are not successful in implementing PCISPH using biggertime-step and because of the limitation of time, we do not further investigation.Instead, we have different fluids speed visualization when using PCISPH.

Table 6.4, Appendix, shows the simulation configuration with PCISPH whichis the same as SPH’s configuration except the initial speed value. Furthermore,instead of gas-stiffness constant, PCISPH uses pcisphDelta parameter.

Figure 4.11 shows the visualization effect of PCISPH with 0.5 million parti-cles. The simulation shows fluids that is progressing at faster speed comparedto original SPH Algorithm 4.12.

Figure 4.11. PCISPH Simulation when it reaches 0.5 million particles

Appendix B explains the water-jet spray algorithm using SPH and PCISPHin a whole.

38

Figure 4.12. SPH Simulation when it reaches 0.5 million particles

4.7 Known Limitation of Water-Spray simulation SPHMethod for Fluids Simulation

There are several known limitation in the current water-spray jet implementa-tion. This section describes them and how they affect the simulation.

4.7.1 Particles’ ScalingFluidsv3 original simulation uses SimScale parameter to scale up the distancebetween particles for better visualization. The problem here is that changingthe parameter does not rescale the particles’ size visualization. This couldresult in wrong visualization such as a fluids visualized by very dense particleseven though it is not.

4.7.2 Incompressibility and Simulation StabilityEquation (2.9) which is used to compute p is restated as

p = k(ρ −ρ0). (4.1)

In equation (4.1), p measures the pressure for the particles. ρ0 is the flu-ids expected rest density during simulation. The water is modeled as an ar-tificially incompressible fluids with Navier-Stokes equation meaning that thefluids should have the same density at every part of it and in every time step.The particles move at each time step and change the fluids density. There-fore, the equation above is used in order to enforce the incompressible fluids.When ρ < ρ0 then the particles experience an attraction force and if ρ > ρ0the particles experience a repelling force.

Here, the question is how big a k is. A bigger k gives strong attraction-repelling force and makes the fluids’ ρ being equal to ρ0 faster, so that par-ticles are not congested in the same time for too many time steps. Reduced

39

congested particles in the same smooth-radius area means less neighbouringparticles and less computation per iteration and vice versa.

However, if a too big k is used then it is required to use much smaller timestep, otherwise the simulation will be unstable. A bigger k will also providebigger force which results in bigger particle movement in one time step. If theparticles move further than the smoothing kernel radius in one time step thenthe attraction force is not computed, thus lost. If many particles experiencethis case then the fluids particles will look as they are blown up.

4.7.3 Initial Speed vs. Time StepSince we simulate water-jet spray, the fluids initial speed depends on our dis-cretion.

In the simulation, initial speed value is accumulated in particles’ velocitywhich determines how far a layer of fluids particles move in one time step.This makes initial speed very influential for particles’ movement.

In each time step a new layer of fluids particles is added in the same sim-ulation’s coordinates. It is important to have an initial speed that fits the timestep. Using an initial speed and a time step which are too small result in abloating particles fluids instead of a flowing fluids. Figure 4.13 shows such acase.

On the other hand, using an initial speed and time step which are too bigresult in a big gap between layers of particles. This is also not correct sincethere is no attraction force between layers of particles. Figure 4.14 shows sucha case.

Therefore, it is important to have an initial speed that fits the time step in or-der to produce a stable and working simulation. However, such a combinationof initial speed and time step are obtained experimentally which is cumber-som, not ideal and affect the usefulness of the simulation.

4.7.4 Obtaining Parameters ValueThe simulation uses many adjustable parameters which determine the simu-lation’s result and stability. Parameters such as smoothing kernel radius, vis-cosity, particle’s radius, and pcisph’s delta obtained their value by trial anderror.

The use of boundary handling method based on [26] prevents any particlesfrom penetrating the solid surface for the current simulation setting. This doesnot mean by default that it will work for any arbitrary initial speed. The effec-tiveness of the boundary handling method also depends on the diameter of thesolid boundary particles and gas stiffness constant.

40

Figure 4.13. Bloating particles fluids

Figure 4.14. Too big gaps between layers of particles

4.7.5 Simulation CorrectnessSPH has been used in many fluids simulation, and its accuracy has been testedby different test cases, for example in [39] for case studies: Creation of wavesby land slides, Dam-break propagation over wet bed sand and Wave-structureinteraction or in [37] for hydraulic jump test case.

However, the measurement of water-jet simulation accuracy is not imple-mented in this work and no implementation for proper physical correctness isdone. Instead, the correctness of our implementation is subjectively assessedby visual analysis.

41

5. Discussion

During the implementation, different methods related to SPH algorithm andparallel programming optimization are used. Methods related to SPH al-gorithm are Ihmsen-boundary-handling and PCISPH. Methods related toGPU parallel programming are reduce-malloc, graphics-interoperability,GPU shared-memory. This section discusses the comparative results.

5.1 Performance BenchmarkThere are six simulation configuration to explain the motivation of using re-lated SPH and parallel programming methods:

1. Fluidsv3 original boundary handler2. Ihmsen et.al boundary handler3. Config.2 and Reduce-Malloc4. Config.3 and Graphics-Interoperability5. Config.4 and GPU Shared-Memory6. Config.3 and PCISPH algorithmTable 5.1 shows the comparative benchmarks: GPU utilization, Last iter-

ation’s execution time, and GFLOPS which is the number of single preci-sion floating point operations per second – our GPU’s theoritical peak is 4612GFLOPS.

Table 5.1 Optimization Benchmark

Configuration GPU Utilization(%)

Last Iteration’sExecution Time

(msec)

GFLOPS at OneMillion Particles

1. Original-boundary-handler 38.1 212.64 7452. Ihmsen-boundary-handler 36 203.54 7203. Reduce-Malloc 75.6 95.18 7284. GPU Interoperability 87.2 78.73 7475. GPU Shared-Memory 95 219.2 856. PCISPH 89 153.4 826

Iteration’s execution time is the main figure of our simulation’s benchmarkand we want to make it as small as possible while retaining the simulation’sstability and physical correctness. Execution time depends on GPU’s utiliza-tion, GPU’s GFLOPS, and algorithm’s complexity.

42

The GPU’s utilization increases from config 1 to 4 while at the same timethe iteration’s time is reduced. However, although config 5 has even higherGPU’s utilization, the iteration’s time is the worst.

The reason for this is that the GPU’s GFLOPS is reduced significantly. Con-fig 1 to 4 has similar GPU’s GFLOPS, but config 5 has a significant lowervalue. This is because of the extra code used to implement GPU shared-memory overcomes the benefit of using shared-memory. Therefore, reducingthe parallel programming overall throughput.

Config 6 shows PCISPH’s performance which has higher GPU’s utiliza-tion and GFLOPS than config 4 but needs more time to compute the iteration(lower speed). This is because the PCISPH method has higher algorithm com-plexity and demands more computation per iteration as seen at Algorithm 9,Appendix.

Based on last iteration’s execution time we decide that Config 4 is the mostappropriate for our problem. Although Config 6 offers different fluids speed asseen at Figure 4.11, we subjectively judge that Config 4 has better visualizationwith its ”splash” effect as seen at Figure 4.7.

5.2 Unused Optimization TechniqueUsing the same GPU for computation and rendering results in better perfor-mance compared to using two separate GPU. Two separate GPU, a lower per-formance for display rendering and the other one with higher performance forcomputation, make the overall simulation 2x slower. It is because PeripheralComponent Interconnect Express (PCIe) bandwidth used to transfer data be-tween two different GPU is slower than the GPU to display port.

There are several common parallel programming best practice that are notused because of the nature of our implementation. One of them is CUDAStream. CUDA Stream is a sequence of operations that are executed on thedevice in a specific order [28]. CUDA Stream could improve parallel perfor-mance by executing multiple streams at the same time in order to overlap datatransfer between host (CPU’s memory) and device (GPU’s memory).

Our implementation has one big portion of computation – CUDA kernelfunctions which have to be executed in consecutive order. It does not havemultiple memory copying process from device to host that usually exists inCUDA streaming process.

Some operations such as dot-product, atomic addition are not optimizedbecause they are executed infrequently in each iteration.

Other technique such as CUDA loop-unrolling, or using bigger or smallernumber of threads in blocks does not improve the performance.

43

5.3 Different Simulation Parameter to Simulation SpeedSPH implementation uses many parameters that could be changed and affectthe simulations’ speed, as mentioned at Table 6.3.

Some parameters could be changed to reduce the computation workloadsuch as using smaller smooth-radius, using bigger gas stiffness constant orusing bigger viscosity constant. These configurations will reduce the numberof the neighbour particles involved in computing force in each iteration, thusimproving speed. However, whether this will result in a correct simulation isdebatable since the effect of changing one parameter is not linear.

On the other hand, using bigger inlet radius, increasing number of inlet,decreasing time step, decreasing particle radius and using bigger simulationspace will increase the number of particles used, hence more computationworkload for each iteration and slower simulation. Nevertheless, this does notnecessarily results in stable simulations.

44

6. Conclusion

6.1 Accelerated Fluids SimulationA GPU has many computation cores. Each core has low throughput but theircombination gives high computation throughput.

Therefore, getting the most of GPU’s computation ability by reducing un-necessary process or idle time is very important to improve simulation’s speedusing GPU. We achieve this by reducing CUDAmalloc, CUDAmemcopy andeliminating rendering data copying process.

Table 6.1 shows the comparison of our CUDA work with Johansson’s Mat-lab model which we use to address the problem statements mentioned in Sec-tion 1.1:

Table 6.1 Simulation’s Result ComparisonDescription Johansson Our Work

Time Step (s) 0.006 0.0001 0.0005 0.001Viscosity (Pa.s) 3.5 0.35RestDensity (kg/m3) 998.29 998.29ParticleRadius (m) 0.01 0.001InletRadius (m) 0.15 0.02SmoothRadius (m) 1.3 0.005GasStiffness (m2/s2) 3 20 2 2InitialVelocity (m/s) 5 20 20 5HeightAboveSurface (m) 2.5 0.175

Processor Intel i7 NvidiaGTX980

Speed (Ghz) 2.4 1.1# of Cores 4 2048Memory (GB) 12 4Total Particles 40000 1000000Total time (s) 7089 16.63 14.77 9.36Average execution timeof last ten iterations(msec)

- 75.22 61.75 45.9

Real World Time (s) 3 0.0384 0.192 0.384# of Iterations 500 384Particle Increment 276 2608

45

1. Is it feasible to use GPU parallel programming for real-time feedbackcontrol simulation?Compared to Johansson’s Matlab work, our work using GPU parallelprogramming provides faster computation. Using time-step of 0.0001second, we could compute fluids simulation up to 1000000 particles in16.63 seconds, while Johansson’s work of using 40000 particles needed7089 seconds to compute.The average execution time of last ten iterations is 75.22 msec. Thismeans that it is feasible to incorporate this implementation into a real-time feedback control simulation provided that the control system hascycle period of 100 msec.

2. What is the trade-off between simulation speed and accuracy?According to table 6.1 to maintain simulation stability we need to changeseveral parameters when changing the time-step. We see that using big-ger time-step reduces the total simulation’s time and average executiontime of last ten iterations. Using bigger time-step also increases the du-ration of real-world time, where real-world time = time-step * numberof iterations.Since SPH is an estimation method, by using smaller time-step we areestimating fluids movement at a smaller time frame. However, since thiswork do not implement any scientific method to measure correctnessthen it is not clear if using smaller time frame results in higher accuracy.

3. What are the benefits/drawbacks of using GPUs for fluid simulations,when compared to a CPU solution?Because of GPU’s high throughput, we could get faster simulation com-pared to Matlab-CPU implementation provided we use a good parallelalgorithm.However, our implementation of GPU shared-memory enforces us touse sequential codes which reduce the GPU’s throughput. Therefore,not every kind of implementation is suitable with parallel-programming.Another drawback of using gaming GPU is the limitation of using single-precision floating computation while CPU could compute double-precisionfloating points. Using HPC GPU could solve this although with signifi-cant increase in equipment cost.

6.2 GPU shared-memoryCurrent implementation’s data structure permits the GPU’s global memory, aDynamic Random Access Memory (DRAM) module, to use memory coales-cent technique. A single access to global memory will retrieve the requireddata plus several byte of its next offsets. How many bytes it would get de-pends of the hardware specification. If the next set of required data is actuallystored in this next offset, it means the data are already retrieved from the global

46

memory and the process could get the data from the level 1 or level 2 cachememory instead of the global memory. This reduces the need to request datafrom the global memory.

The purpose of using GPU shared-memory is to reduce further the requestdata from global memory. Instead of letting the cache memory manage thecoalescent data if any, we dynamically stored the necessary data to shared-memory.

We need to collect and keep the data’s addresses before they could be copiedfrom the global memory to GPU’s shared-memory.

In CUDA, computation and data access for one particle is done by a singlethread. However, shared-memory is allocated per each thread block, as shownin figure 2.1, therefore all particles computed by the threads in each block areaccommodated by the same shared-memory.

Particles or threads in each thread block that access the same neighbouringdata are redundancies that is to be removed. Therefore, we only keep theunique data address from each thread block for every iteration.

The current algorithm used to get these unique addresses is neither asso-ciative nor commutative, thus is not suitable for parallel programming imple-mentation. Therefore, we write the code in serial. This result in suboptimalGPU parallel computation. The results obtained show that the benefit of ac-cessing data from shared-memory is smaller than the extra time introduced bythe extra code.

Suboptimal parallel computation prevents us from getting a clear answerto the question whether GPU shared-memory is more efficient than memorycoalescent technique. Therefore, to use the shared memory in the optimal way,the SPH algorithm implementing in Fluid3 should be revised.

6.3 PCISPHFluidsv3 uses the basic implementation of SPH. That method is not the onlymethod for SPH. PCISPH is one of novel method implementing SPH usingGPU. It needs more computation to complete one iteration, but the fluids sim-ulation progresses faster compared to original SPH method used in Fluidsv3.

6.4 Future WorksThere are several things which could improve the work even further:

– Simulation Correctness. There is a need to devise and implement a vali-dation method to measure the correctness of the simulation compared toreal world.

– GPU shared-memory overhead code. A big portion of current GPUshared-memory overhead code is done in serial. Parallelizing this piece

47

of code will improve the performance. In order to do this a change indata structure or SPH algorithm is needed. However, whether this im-plementation will be better compared to memory coalescent techniqueis an open question.

– Adaptive particle splitting. Another approach to improve fluids sim-ulation using SPH is by using Adaptive particle splitting. Several re-search have been done by Xiong et.al [32] and Zhang et.al [33]. Withthis approach, the number of particles in simulation could be reducedor increased whenever it is necessary. Reducing the number of particlesmeans less computation efforts needed hence faster simulation.

– Visualization. In [7]Fluidsv3’s author suggested a feature to render theparticles into an ISO surface fluids which will make the particles to looklike a real fluids. Of course this will need more computation power fromthe GPU, so the trade-off needs to be considered.Current implementation does not animate the solid surface boundary andthe simulation space boundary. These add-on could also improve simu-lation’s display.

– Adaptive Simulation Parameters. There are simulations parameter thatdepend on each other and should be automatically determined before thesimulation starts instead of manually experimenting on them.One example is that the particles’ initialization speed affect the valueof time step. Time step should be pre-determined based on how manyseconds a particle need to travel a ”particle distance” given an initializa-tion speed. Whether this would result in an accurate simulation or not isdebatable.

– Different SPH Implementation. We use Fluidsv3 as our SPH base codefor fluids simulation, but this is not the only SPH implementation. Dual-SPHysics is another implementation of fluids simulation using SPH [34]and [35]. Whether this is better suited for water-jet spray simulation isan open question.

– Multi GPU Implementation. Algorithm optimization is a way to im-prove the performance of a simulation but at some point the simulation’sperformance will depend on the GPU hardware capability. Multi GPUimplementation is another way to scale-up the simulation in number ofparticles or speed or both. Rustico et.al presented a multi-GPU SPHimplementation of fluids simulation in [36].

48

References

[1] H. Yan, Z. Wang, J. He, X. Chen, C. Wang, and Q. Peng, “Real-time fluidsimulation with adaptive SPH,” Computer Animation and Virtual Worlds,vol. 20, no. 2-3, pp. 417–426, Jun. 2009. [Online]. Available:http://doi.wiley.com/10.1002/cav.300

[2] A. Hérault, G. Bilotta, and R. A. Dalrymple, “SPH on GPU with CUDA,”Journal of Hydraulic Research, vol. 48, no. sup1, pp. 74–79, Jan. 2010.[Online]. Available:http://www.tandfonline.com/doi/abs/10.1080/00221686.2010.9641247

[3] “Sph-flow,” 2015, [Accessed 30-August-2015]. [Online]. Available:http://www.sph-flow.com/index.html

[4] “Openworm,” 2015, [Accessed 10-July-2015]. [Online]. Available:http://www.openworm.org/

[5] “Dualsphysics,” 2015, [Accessed 10-July-2015]. [Online]. Available:http://www.dual.sphysics.org/

[6] A. Johansson, “Video Games Fluid Flow Simulations Towards Automation:Smoothed Particle Hydrodynamics,” Master’s thesis, Uppsala University, 2014.[Online]. Available:http://www.diva-portal.org/smash/record.jsf?pid=diva2:703754

[7] R. C. Hoetzlein, “Fast fixed-radius nearest neighbors: Interactivemillion-particle fluids,” GPU Technology Conference, 2014. [Online].Available: http://fluids3.com

[8] K. Erleben, J. Sporring, K. Henriksen, and H. Dohlmann, “Physics-basedanimation. charles river media,” INC.,, 2005.

[9] R. Gingold and J. Monaghan, “Smoothed particle hydrodynamics: theory andapplication to non-spherical stars,” mn, vol. 181, pp. 375–389, 1977.

[10] L. B. Lucy, “A Numerical Approach to Testing the Fission Hypothesis,” aj,vol. 82, no. 12, pp. 1013–1924, December 1977.

[11] J. J. Monaghan, “Smoothed particle hydrodynamics,” Reports on Progress in

Physics, vol. 68, no. 8, pp. 1703–1759, Aug. 2005. [Online]. Available:http://stacks.iop.org/0034-4885/68/i=8/a=R01?key=crossref.c562820df517a049ca7f11d0aefe49b4

[12] M. Kelager, “Lagrangian fluid dynamics using smoothed particlehydrodynamics,” 2006. [Online]. Available:http://www.glowinggoo.com/sph/bin/kelager.06.pdf

[13] M. Müller, D. Charypar, and M. Gross, “Particle-based fluid simulation forinteractive applications,” in Proceedings of the 2003 ACM

SIGGRAPH/Eurographics symposium on Computer animation. EurographicsAssociation, 2003, pp. 154–159. [Online]. Available:http://dl.acm.org/citation.cfm?id=846298

49

[14] M. D. M.-P. Gascuel, “Smoothed particles: A new paradigm for animatinghighly deformable bodies,” in Computer Animation and Simulation’96:

Proceedings of the Eurographics Workshop in Poitiers, France, August

31–September 1, 1996. Springer Science & Business Media, 2012, p. 61.[15] Nvidia Corporation, “Nvidia Fermi Compute Architecture Whitepaper,”

Corporation, Nvidia, Tech. Rep., 2009. [Online]. Available:http://www.nvidia.com/content/pdf/fermi_white_papers/nvidiafermicomputearchitecturewhitepaper.pdf

[16] N. Corporation, “CUDA C Programming Guide,”http://docs.nvidia.com/cuda/cuda-c-programming-guide/, 2015, [Online;accessed 20-March-2015].

[17] Nvidia Corporation, “GeForce GTX980 Whitepaper,” Corporation, Nvidia,Tech. Rep., 2014. [Online]. Available: http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF

[18] “Matlab parallel computing toolbox user’s guide,” 2015, [Accessed06-August-2015]. [Online]. Available:http://se.mathworks.com/help/pdf_doc/distcomp/distcomp.pdf

[19] A. Kolb and N. Cuntz, “Dynamic particle coupling for gpu-based fluidsimulation,” in In Proc. of the 18th Symposium on Simulation Technique, 2005,pp. 722–727.

[20] T. Harada, S. Koshizuka, and Y. Kawaguchi, “Smoothed particlehydrodynamics on gpus,” in Computer Graphics International. SBCPetropolis, 2007, pp. 63–70.

[21] X. Gao, Z. Wang, H. Wan, and X. Long, “Accelerate Smoothed ParticleHydrodynamics Using GPU,” in Information Computing and

Telecommunications (YC-ICT), 2010 IEEE Youth Conference on. IEEE, 2010,pp. 399–402. [Online]. Available:http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5713129

[22] “Eth zürich computer graphics laboratory,” 2015, [Accessed 30-August-2015].[Online]. Available: https://graphics.ethz.ch/

[23] “University of freiburg, computer science department,” 2015, [Accessed30-August-2015]. [Online]. Available:http://cg.informatik.uni-freiburg.de/index.htm

[24] “Nvidia nsight visual studio edition 4.6 user guide,” 2015, [Accessed26-July-2015]. [Online]. Available:http://docs.nvidia.com/nsight-visual-studio-edition/4.6/Nsight_Visual_Studio_Edition_User_Guide.htm

[25] “Nvidia profiler documentation guide,” 2015, [Accessed 26-July-2015].[Online]. Available:http://docs.nvidia.com/cuda/profiler-users-guide/index.html

[26] M. Ihmsen, N. Akinci, M. Gissler, and M. Teschner, “Boundary handling andadaptive time-stepping for PCISPH,” in Workshop on virtual reality interaction

and physical simulation VRIPHYS, 2010. [Online]. Available:http://cg.informatik.uni-freiburg.de/intern/seminar/rigidFluidCoupling_2010_VRIPHYS_boundaryHandling.pdf

[27] B. Solenthaler and R. Pajarola, “Predictive-corrective incompressible SPH,”ACM Transactions on Graphics, vol. 28, no. 3, Jul. 2009. [Online]. Available:

50

http://portal.acm.org/citation.cfm?doid=1531326.1531346[28] J. Sanders, CUDA by example: an introduction to general-purpose GPU

programming. Upper Saddle River, NJ: Addison-Wesley, 2011.[29] “Cuda toolkit documentation – shared memory,” 2015, [Accessed

30-August-2015]. [Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory

[30] “Using shared memory in cuda c/c++,” 2015, [Accessed 30-August-2015].[Online]. Available:http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/

[31] P. Goswami, P. Schlegel, B. Solenthaler, and R. Pajarola, “Interactive SPHsimulation and rendering on the GPU,” in Proceedings of the 2010 ACM

SIGGRAPH/Eurographics Symposium on Computer Animation. EurographicsAssociation, 2010, pp. 55–64. [Online]. Available:http://dl.acm.org/citation.cfm?id=1921437

[32] Q. Xiong, B. Li, and J. Xu, “GPU-accelerated adaptive particle splitting andmerging in SPH,” Computer Physics Communications, vol. 184, no. 7, pp.1701–1707, Jul. 2013. [Online]. Available:http://linkinghub.elsevier.com/retrieve/pii/S0010465513000787

[33] Y. Zhang, B. Solenthaler, and R. Pajarola, “Adaptive Sampling and Renderingof Fluids on the GPU.” [Online]. Available: http://www.ifi.uzh.ch/vmml/publications/older-puclications/SPH_GPU_PBG08.pdf

[34] A. J. C. Crespo, J. M. DomÃnguez, M. G. Gesteira, A. Barreiro, B. D. Rogers,S. Longshaw, and R. Canelas, “User Guide for DualSPHysics code,” SPHysics

project, v2. 0 edition, 2012. [Online]. Available:http://dual.sphysics.org/files/3313/8753/4540/DualSPHysics_v3.0_GUIDE.pdf

[35] M. Gomez-Gesteira, B. Rogers, A. Crespo, R. Dalrymple, M. Narayanaswamy,and J. Dominguez, “SPHysics - development of a free-surface fluid solver - Part1: Theory and formulations,” Computers & Geosciences, vol. 48, pp. 289–299,Nov. 2012. [Online]. Available:http://linkinghub.elsevier.com/retrieve/pii/S0098300412000854

[36] E. Rustico, G. Bilotta, A. Herault, C. Del Negro, and G. Gallo, “Advances inMulti-GPU Smoothed Particle Hydrodynamics Simulations,” IEEE

Transactions on Parallel and Distributed Systems, vol. 25, pp. 43–52, Jan. 2014.[37] D. LóPez, “Smoothed particle hydrodynamics model applied to hydraulic

structures: a hydraulic jump test case,” Journal of Hydraulic Research, vol. 48,no. extra, p. 000, 2009. [Online]. Available: http://www.iahr.net/site/e_library/journals/jhr/default_journalhydraulicresearch2.asp?varVol=48&varIss=extra

[38] M. Gomez-Gesteira, B. Rogers, A. Crespo, R. Dalrymple, M. Narayanaswamy,and J. Dominguez, “SPHysics - development of a free-surface fluid solver - Part1: Theory and formulations,” Computers & Geosciences, vol. 48, pp. 289–299,Nov. 2012. [Online]. Available:http://linkinghub.elsevier.com/retrieve/pii/S0098300412000854

[39] M. Gomez-Gesteira, A. Crespo, B. Rogers, R. Dalrymple, J. Dominguez, andA. Barreiro, “SPHysics - development of a free-surface fluid solver - Part 2:Efficiency and test cases,” Computers & Geosciences, vol. 48, pp. 300–307,Nov. 2012. [Online]. Available:http://linkinghub.elsevier.com/retrieve/pii/S0098300412000842

51

Appendix A.Fluidsv3’s Algorithm

At this appendix the Fluidsv3’s code structure and algorithm are discussed.The original code is due [7]. Algorithm 2 describes Fluidsv3 main algorithmstructure. Fluidsv3 original code has more functions and procedures than aredescribed in the algorithm but for simplicity reason only functions and proce-dures that are modified are included. The modified functions are either relatedto work implementations, SPH algorithm or optimization. Other functions arerelated to supporting the simulation and do not need to be modified. TwoFluidsv3 procedure are described in Algorithm 3 and Algorithm 4.

Algorithm 2 Fluidsv3 Main Algorithm1: OpenGL Initialization2: CUDA and Rendering Initialization3: while Program is running do4: procedure DISPLAY5: if Program Start || GUI data changed then6: procedure SETUP-SIMULATION7: end procedure8: end if9: procedure RUN-SIMULATION

10: procedure RUN-SIMULATION-CPU11: end procedure12: procedure RUN-SIMULATION-CUDA13: end procedure14: end procedure15: Draw-Simulation-Particles16: Draw-Simulation-GUI17: end procedure18: end while19: Free-allocated-memory

Algorithm 2 shows that Fluidsv3 starts with OpenGL initialization, by load-ing OpenGL Extension Wrangler Library (GLEW) context. Subsequently, theCUDA device is initialized to make sure that we have a working Nvidia GPUand declaring paremeters for rendering.

52

The simulation is done within the DISPLAY callback procedure. Here,there are two important procedures that deal with computation. First is setup-simulation procedure which is described in Algorithm 3. Second is run-simulationprocedure which consists of two similar procedures: run-simulation-CPU andrun-simulation-CUDA. The work focuses on GPU implementation, thereforeonly run-simulation-CUDA procedure is described in Algorithm 4. After thecomputation process, OpenGL rendering process draws the resulting particles’positions and the simulations’ GUI data.

Algorithm 3 Setup Simulation Procedure1: SetupDefaultParams2: AllocateParticles3: SetupSPHKernels4: SetupParticleSpacing5: SetupAddVolume6: SetupGridAllocate7: FluidSetupCUDA8: FluidParamCUDA9: TransferToCUDA

Setup simulation procedure is called at the beginning of the program execu-tion or whenever any simulation parameter is changed from GUI. Algorithm 3describes the functions inside setup simulation procedure.First, SetupDefaultParams imports the static parameters defined in an XMLfile. Using an XML file to store parameters eliminate the necessity to rebuildthe code every time we need to change parameters. Fluidsv3 uses tinyXML,an open-source library for this feature.Second, AllocateParticles allocates the memory space in CPU’s DRAM forevery particles’ computation data such as position, color, velocity, pressure,density, acceleration, gridcell, etc. This means that the allocated memoryspace depends on the number of particles and the simulation area, more parti-cles or bigger simulation area needs more DRAM space.Third, SetupSPHKernels pre-computes the constant part of SPH kernel func-tions. Equation (2.12), (2.13) and (2.14) describe the SPH’s kernel functionsand each of them has a constant part that could be pre-computed to reduce thecomputation demand within each simulation iteration.Fourth, SetupParticleSpacing computes the space between each particle fromthe particle’s mass and density.Fifth, SetupAddVolume initializes every particles position according to thesimulation space area and the number of particles in a rectangular cuboidshape.Sixth, SetupGridAllocate detemines the number of grid and the size of eachgrid with respect to the simulation area.

53

In SPH, fluids are represented as particles and it is important to know allpositions of neighbouring particles within the distance of smoothing radius.In order to do this in a simpler way, simulation space is divided into grids ofsame cube size with unique id and particles are assigned to the respective gridbased on the particles’ x, y, and z position.

Figure 6.1 shows an example of a small area simulation grid arrangement.If the grid’s edge length is defined to be at least the same as smoothing ra-dius then the next neighbouring grids contains all the particles that are withinthe smoothing radius. This makes neighbour search process easy because weonly need to check particles which are located in its own grid and the nextneighbouring grids.

Figure 6.1. Simulation grids, each small cube represent a unique simulation grid. Thenumber of grid in x, y, and z direction depends on simulation space

Figure 6.2 depicts all the neighbouring grids position needed to get the po-sitions of neighbouring particles. It is composed with 3x3x3 grid structure,the dotted line is for visualization purpose only and does not represent anydistance.

Fluidsv3 also pre-computes the adjacency table which list a grid’s offset ofthe 3x3x3 grid. For example, if the simulation space grid is represented byFig. 6.1 then the adjacancy table list for grid index 0 is shown at Table 6.2.The actual grid offset depends on number of grids in x, y and z axis.Seventh, FluidSetupCUDA decides how many thread blocks are needed forCUDA kernel. This is based on the number of total particles and the number

54

Figure 6.2. Grid Neighbours. Yellow circle represent the center grid and the rest arethe neighbouring grids. The dotted line is for visualization purpose only and does notrepresent any information.

of threads in a block. It also allocates the GPU memory for each particleinformation such as velocity, pressure, density, etc.Eight, FluidParamCUDA copies simulation related parameters such as parti-cle mass, density, smoothing radius, gravity, viscosity, etc to their respectiveGPU variables.Ninth, TransferToCUDA copies the particles information to GPU and handsover computation to GPU.

Algorithm 4 Run Simulation CUDA Procedure1: InsertParticlesCUDA2: PrefixSumCellsCUDA3: CountingSortFullCUDA4: ComputePressureCUDA5: ComputeForceCUDA6: AdvanceCUDA7: TransferFromCUDA

Algorithm 4 describes the steps of GPU computation. Step one until step sixare done by CUDA kernels, therefore we have six CUDA kernels for fluidsv3original simulation.First, InsertParticlesCUDA allocates each particle into its respective grid.

55

Table 6.2 Adjcacency Tableadj offset adj offset adj offset0 0 9 36 18 721 1 10 37 19 732 2 11 38 20 743 6 12 42 21 784 7 13 43 22 795 8 14 44 23 806 12 15 48 24 847 13 16 49 25 858 14 17 50 26 86

Second, PrefixSumCellsCUDA computes prefix sum of the number of parti-cles from the grids sequence.Third, CountingSortFullCUDA sort the particles based on their grid id andsaves the particles in the same grid in a sequential memory location. It alsostores the number of particles in each grid.Fourth, ComputePressureCUDA compute the pressure for each particles basedon equation (2.8).Fifth, ComputeForceCUDA compute the force for each particles based onequation (2.19).Sixth, AdvanceCUDA moves each particles position based on equation (2.22).Seventh, TransferFromCUDA copies the particles position and color databack to CPU for rendering.

56

Appendix B.Water-Jet Spray’s Algorithm

For water-jet spray main algorithm, Algorithm 2 is slightly modified into Al-gorithm 5 by adding one procedure to handle Predictive-Corrective Incom-pressible SPH (PCISPH). PCISPH is an incompressible fluids simulation methodbased on SPH [27]. This method offers different simulation visualization andspeed.

Algorithm 5 Fluidsv3 Water-Jet Main Algorithm1: OpenGL Initialization2: CUDA and Rendering Initialization3: while Program is running do4: procedure DISPLAY5: if Program Start || GUI data changed then6: procedure SETUP SIMULATION7: end procedure8: end if9: procedure RUN SIMULATION

10: procedure RUN SIMULATION CPU11: end procedure12: procedure RUN SIMULATION CUDA13: end procedure14: procedure RUN SIMULATION CUDA PCISPH15: end procedure16: end procedure17: FluidAddVolumeCUDA18: FluidGPUunMap19: Draw Simulation Particles20: Draw Simulation GUI21: end procedure22: end while23: Free allocated memory

FluidAddVolumeCUDA is added to reuse memory space. During simula-tion, particles that go beyond the simulation area are deleted but the memoryis still used. If the unused memory space is big enough for a new layer ofparticle then simulation will use that block of memory. Otherwise, it will use

57

the next empty memory space for the new layer. If the simulation runs out ofmemory then the spray will stop until there is enough memory.

Another new function is FluidGPUunMap. This is to unmap GPU bufferfrom CUDA pointer so it could be used for rendering.

Algorithm 6 explains the setup-simulation procedure for water-jet spraysimulation.

Algorithm 6 Water-Jet Setup Simulation Procedure1: SetupDefaultParams2: SetupGPUInterop3: SetupSpacing4: CountParticle5: AllocateParticles6: SetupAddVolume7: SetupGridAllocate8: FluidSetupCUDA9: FluidParamCUDA

10: TransferToCUDA

First, SetupDefaultParams imports simulation parameters from an XML file.Some parameters are added for our specific simulation. For example, particle’sinitialization speed, number of water inlet, inlet radius, flat surface height,PCISPH delta, and maximum number of particles.

During simulation, the number of particles is increased at every time step.This means dynamic memory allocation at each time step. To minimize dy-namic memory allocation, it is decided to allocate memory for the maximumnumber of particles possible. This number is obtained from experimental pro-cess and depends on the GPU’s memory and the code’s data structure.Second, SetupGPUInterop allocates two GPU memory buffers for rendering.One buffer is used to store data of each particle’s position and another is usedfor its color. These buffers are then mapped to CUDA pointer to be used inGPU computation. By using the same memory area for rendering and com-putation we remove any memory copying from CUDA computation for ren-dering process thus improving simulation speed. The buffers’ size depends onthe maximum number of the particles.Third, SetupSpacing gives the distance between particles center point basedon the fluids rest density and the particle mass. The space is computed so thatthere will be some overlapping between particles.Fourth, CountParticle computes the number of particles needed for everylayer of a water-jet spray inlet. This number depends on the radius of theinlet, the fluids rest density and the space between particles.Fifth, AllocateParticles allocates space in CPU’s memory for the initial par-ticles’ position, speed and color. This only takes a small amount of space

58

since it is only required to save the information of one layer of particles forinitialization.Sixth, SetupAddVolume uses the allocated space to store initialization parti-cles’ position, speed and color.Seventh, SetupGridAllocate determines the number of grids. It uses the orig-inal implementation.Eight, FluidSetupCUDA decides the number of thread blocks needed for ev-ery CUDA kernel. It also allocates the GPU memory for each particle infor-mation. If a memory space bigger than GPU’s available space is asked thenthis part will fail during run-time.Ninth, FluidParamCUDA copies simulation related parameters such as parti-cle mass, density, etc to its respective GPU variables.Tenth, TransferToCUDA copies the initialization particles’ position, speedand color to their respective GPU variables.

Water-jet spray simulation uses the similar algorithm as Algorithm 4 forCUDA SPH simulation without the need to copy particles position and colordata for rendering. Another difference is in AdvanceCuda function particlesthat move beyond simulation area are marked as unused particles. A flat sur-face boundary is added but not drawn to simulate a flat layer of metal.

Instead of original Fluidsv3’s boundary hanlding method, a different methoddue to [26] is used. A novel boundary handling scheme for PCISPH is pre-sented, it is mentioned that it could also be used for SPH.

One reason of using this boundary handling is that to use bigger gas-stiffnessconstant as mentioned in equation (2.16) or high initial speed. This hap-pens while keeping other parameters the same and without any particle leaksthrough the boundary. Big gas-stiffness constant is necessary to make the flu-ids as incompressible as possible especially if high initial velocity is used.

Algorithm 7 describes the steps of PCISPH procedure. PCISPH is dueto [27], the algorithm is adopted with Fluidsv3 implementation of SPH. Thefirst four functions and the last are the same, the fifth and sixth are the imple-mentation of PCISPH. Algorithm 8 and 9 describe the PCISPH implementa-tion.

Algorithm 7 Run Simulation CUDA PCISPH Procedure1: InsertParticlesCUDA2: PrefixSumCellsCUDA3: CountingSortFullCUDA4: ComputePressureCUDA5: pcisphComputeInit6: PCISPHLoop7: AdvanceCUDA

59

Algorithm 8 pcisphComputeInit1: for each particle i do2: compute viscosity and external forces F

v,ext

i(t)

3: initialize pressure p(t) = 0.04: initialize pressure force F

P

i(t) = 0.0

5: end for

Algorithm 9 PCISPHLoop1: while iter < 3 or ρ∗

erri> η do

2: for each particle i do3: predict velocity v

∗i(t +∆t)

4: predict position x∗i(t +∆t)

5: end for6: for each particle i do7: density ρ∗

i(t +∆t)

8: density variation ρ∗erri

(t +∆t)9: update pressure pi(t)+ = δρ∗

erri(t +∆t)

10: end for11: for each particle i do12: compute pressure force F

P

i(t)

13: end for14: end while

Table 6.3 Water-Jet Simulation’s Parameter

Particles Parameter Environment ParameterTime Step (s) 0.0001 MaxParticle 22000000Viscosity (Pa.s) 0.35 NumInlet 4RestDensity (kg/m3) 998.29 InitSpeed (m/s) -20.0ParticleRadius (m) 0.001 VolMin (m) -80, 00, -80DampHeight (m) 0.03 VolMax (m) 80, 140, 80SmoothRadius (m) 0.005 InitMin (m) 00, 65, 00IntStiff 20 InletRadius (m) 4.0PlaneGravDir (m/s2) 0,-9.82,0 SurfaceHeight (m) 30.00SimScale 0.005 SurfaceLength (m) 40.00

60

These parameters are kept in an XML file. DampHeight is the diameter ofhidden boundary particles, IntStiff is the gas stiffness constant and SimScaleis the scaling factor of simulation space dimensions.

VolMin and VolMax are the minimum and maximum coordinate of simu-lation area in x, y and z axis. The simulation space is always of rectangularparallelepiped form, to get the length of its edge we need to subtract VolMaxby VolMin. As follows from the Table, the simulation space dimension is 160X 140 X 160 units in the x, y and z directions.

InitMin is the height of the water-spray from the origin, InitMax is unusedFluidsv3 original parameter. SurfaceHeight is the height of solid surface fromorigin, and SurfaceLength is the length of its edge.

Every measurement unit represent a meter times the SimScale in real space.Thus, the real simulation space dimension is (160 * 0.005) meter in length,(140 * 0.005) meter in height, and (160 * 0.005) meter in width. It is the samewith DampHeight, SurfaceHeight, SurfaceLength, InitMin, and InletRadius.The simulation scales up the dimension for visualization purpose.

Table 6.4 Simulation’s Parameter for PCISPH MethodParticles’ Parameter Environment’s Parameter

Time Step (s) 0.0001 MaxParticle 22000000Viscosity (Pa.s) 0.35 NumInlet 4RestDensity (kg/m3) 998.29 InitSpeed (m/s) -10.0ParticleRadius (m) 0.001 VolMin (m) -80, 00, -80DampHeight (m) 0.06 VolMax (m) 80, 140, 80SmoothRadius (m) 0.005 InitMin (m) 00, 65, 00pcisphDelta 5.1 InletRadius (m) 4.0PlaneGravDir (m/s2) 0,-9.82,0 SurfaceHeight (m) 30.00SimScale 0.005 SurfaceLength 40.00

61

Documents

Accelerating Fluids Simulation Using SPH and ...uu.diva-portal.org/smash/get/diva2:885188/FULLTEXT01.pdf · Accelerating Fluids Simulation Using SPH and Implementation on ... work