Download pdf - Final Thesis

POLITECHNIKA WROC£AWSKA

WYDZIA£ INFORMATYKI I ZARZ•DZANIA

GPGPU driven simulations of zero-temperature 1D

Ising model with Glauber dynamics

Daniel Kosalla

FINAL THESISunder supervision of

Dr inø. Dariusz Konieczny

Wroc≥aw 2013

Acknowledgments:

Dr inø. Dariusz Konieczny

Contents

1. Motivation 5

2. Target 5

3. Scope of work 5

4. Theoretical background and proposed model 6

4.1. Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.2. Historic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.3. Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.4. Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.5. Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.6. Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.7. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5. General Purpose Graphic Processing Units 10

5.1. History of General Purpose GPUs . . . . . . . . . . . . . . . . . . . . . . 10

5.2. CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

6. CPU Simulations 14

6.1. Sequential algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6.2. Random number generation on CPU . . . . . . . . . . . . . . . . . . . . 15

6.3. CPU performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

7. GPU Simulations - thread per simulation 17

7.1. Thread per simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

7.2. Running the simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

7.3. Solution space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

7.4. Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.5. Thread per simulation - static memory . . . . . . . . . . . . . . . . . . . 20

7.6. Comparison of static and dynamic memory use . . . . . . . . . . . . . . . 21

8. GPU Simulations - thread per spin 24

8.1. Thread per spin approach . . . . . . . . . . . . . . . . . . . . . . . . . . 24

8.2. Concurrent execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii

8.3. Thread communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

8.4. Race conditions with shared memory . . . . . . . . . . . . . . . . . . . . 27

8.5. Thread per spin approach - reduction . . . . . . . . . . . . . . . . . . . . 27

8.6. Thread per spin approach - flags . . . . . . . . . . . . . . . . . . . . . . 29

8.7. Thread-per-spin performance . . . . . . . . . . . . . . . . . . . . . . . . 30

8.8. Thread-per-spin vs thread-per-simulation performance . . . . . . . . . . . 31

9. Bond density for some W0 values 34

10.Conclusions 36

11.Future work 36

Appendix 38

A. Sequential algorithm - CPU 39

B. Thread per simulation - no optimizations 43

C. Thread per simulation - static memory 48

D. Thread per spin - no optimizations 53

E. Thread per spin - parallel reduction 58

F. Thread per spin - update flag 63

iv

1. Motivation

In the presence of recent developments of SCM (Single Chain Magnets) [1–4] theissue of criticality in 1D Ising-like magnet chains has turned out to be an promising fieldof study [5–8]. Some practical applications has been already suggested [2]. Unfortunately,the details of general mechanism driving this changes in real world is yet to be discovered.Traditionaly, Monte Carlo Simulations regarding Ising model were conducted on CPUs1.However, in the presence of powerful GPGPU’s2 new trend in scientific computationswas started enabling more detailed and even faster calculations.

2. Target

The following document describes developed GPGPU applications capable of pro-ducing insights into underlying physical problem, examination of diÄerent approachesof conducting Monte Carlo simulations on GPGPU and comparison between developedparallel GPGPU algorithms and sequential CPU-based approach.

3. Scope of work

The scope of this document includes development of 5 parallel GPGPU algorithms,namely:

• Thread-per-simulation algorithm

• Thread-per-simulation algorithm with static memory

• Thread-per-spin algorithm

• Thread-per-spin algorithm with flags

• Thread-per-spin algorithm with reduction

1CPU - Central Processing Unit2GPGPU - General Purpose Graphics Processing Unit

5

4. Theoretical background and proposed model

4.1. Ising model

Although initially proposed by Wilhelm Lenz, it was Ernst Ising[10], who developeda mathematical model for ferromagnetic phenomena. Ising model is usually representedby means of lattice of spins - discrete variables {≠1, 1}, representing magnetic dipolemoments of molecules in the material. The spins are then interacting with it’s neighbours,which may cause the phase transition of the whole lattice.

4.2. Historic methods

Monte Carlo Simulation (MC) on Ising model consist of a sequence of lattice updates.Traditionally all (synchronous) or single (sequential) spins are updated in each iterationproducing the lattice-state for future iterations. The update methods are based on theso called dynamics that are describing spin interactions.

4.3. Updating

The idea of partially synchronous updating scheme has been suggested [5–7]. Thisc-synchronous mode has a fixed parameter of spins being updated in one step-time.However, one can imagine, that the number of updated spins/molecules (often referredto as cL, where: L denotes size of the chain and c œ (0, 1]) is changing as the simulationprogresses. If so, then it is either linked to some characteristics of the system or maybe expressed with some probability distribution (described in subsection 4.5). Thisapproach of changing c parameter can be applied while choosing spins randomly as wellas in cluster (subsection 4.6) but only the later will be considered in this document.

4.4. Simulations

In the proposed model cL sequential updating is used with c due to provideddistribution. The considered environment consist of one dimensional array of L spinssi = ±1. Index of each spin is denoted by i = 1, 2, . . . , L. Periodic boundary conditionsare assumed, i.e. sL+1 = s1.

It has been shown in [8] that the system under synchronous Glauber dynamicsreaches one of two absorbing states - ferromagnetic or antiferromagnetic. Therefore,let’s introduce density of bonds (fl) as an order parameter:

fl =

Lqi=1(1≠ sisi+1)

2L(4.1)

6

As stated in [8] phase transitions in synchronous updating modes and c-sequential[7] ought to be rather continuous (in cases diÄerent then c = 1 for the later). Smoothphase transition can be observed in the Figure 4.1.

Figure 4.1. The average density of active bonds in the stationary state < flst > as a functionof W0 for c = 0.9 and several lattice sizes L.

[7] B. Skorupa, K. Sznajd-Weron, and R. Topolnicki. Phase diagram for a zero-

temperature Glauber dynamics under partially synchronous updates, Phys. Rev.

E 86, 051113 (2012)

The system is considered in low temperatures (T ) and therefore T = 0 can beassumed. Metropolis algorithm can be considered as a special case of zero-temperatureGlauber dynamics for 1/2 spins. Each spin is flipped (si = ≠si) with rate W (”E) perunit time. While T = 0:

W (”E) =

Y____]

____[

1 if ”E < 0,

W0 if ”E = 0,

0 if ”E > 0

(4.3)

In the case of T = 0, the ordering parameter W0 = [0; 1] (e.g. Glauber rate -W0 = 1/2 or Metropolis rate W0 = 1) is assumed to be constant. One can imagine thateven W0 parameter can in fact be changed during simulation process but that’s out ofscope of proposed model.

System starts in the fully ferromagnetic state (fl = flf = 0). After each time-stepchanges are applied to the system and the next time-step is being evaluated. Afterpredetermined number of time steps state of the system is investigated. If the chain hasobtained antiferromagnetic state (fl = flaf = 1) or suÖciently large number of time-stepshas been inconclusive then whole simulation is being shout down.

4.5. Distributions

During the simulation c will not be fixed in time but rather change from [0; 1]according to triangular continuous probability distribution[9] presented in the Figure 4.2.

While studying diÄerent initial conditions for simulations, distributions are to beadjusted in order to provide peak values in range {0, 1}. This is due to the fact that

7

0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

1.5

2.0

Figure 4.2. c could be any value in the interval [0; 1] but is most likely to around value ofc = 1/2. Other values possible but their probabilities are inversely proportional

to their distance from c = 1/2.

the value of 0.5 (as presented in the plot) would mean that in each time-step half ofthe spins gets to be updated.

4.6. Updating

The following algorithms make use of triangular probability distribution to assignappropriate c value before each time step. After (on average) L updated spins eachMonte Carlo Step (MCS) can be distinguished.

4.7. Algorithm

Transformation of the above-mentioned rules into set of instructions could yield infollowing description or pseudocode (below):

Update cL consecutive spins starting from randomly chosen one. Eachchange is saved to the new array rather than the old one. After each Stopupdated spins are saved and new time-step can be started.

1. Assign c value with given distribution

2. Choose random value of i œ [0, L]

3. max = i+ cL

4. si is i-th spin

• if si = si+1 · si = si≠1 :

– s

Õi = si+1 = si≠1

• otherwise:

– Flip si with probability W0

8

5. if i ˛ max

• i = i+ 1

• Go to step 4

6. Stop

9

5. General Purpose Graphic Processing Units

5.1. History of General Purpose GPUs

Traditionally, in desktop computer GPU is highly specialized electronic circuitdesigned to robustly handle 2D and 3D graphics. In 1992 Silicon Graphics, releasedOpenGL library. OpenGL was meant as standardised, platform-independent interfacefor writing 3D graphics. By the mid 1990s an increasing demand for 3D applicationsappeared in the customer market.

It was NVIDIA who developed GeForce 256 and branded it as ”the word’s firstGPU”3. GeForce 256, although, one of many graphical accelerators was one that showeda very rapid increase in the field incorporating many features such as transform andlighting computations directly on graphics processor. The release of GPUs capable ofcoping with programmable pipelines attracted researchers to explore the possibility ofusing graphical processors outside their original use scheme. Although, early GPUs ofearly 2000s were programmable in a way that enabled for pixel manipulation, researchersnoticed that since this manipulations could actually represent any kind of operationsand pixels could virtually represent any kind of data.

In the late 2006 NVIDIA revealed GeForce 8800 GTX, the first GPU built withCUDA Architecture. CUDA Architecture enables programmer to use every arithmeticlogic unit4 on the GPU (as opposed to early days of GPGPU when the access to ALUswas granted only via the restricted and complicated interface of OpenGL and DirectX).The new family of GPUs started with 8800 GTX was built with IEEE compliantALUs capable of single-precision floating-point arithmetics. Moreover, new ALUs wereequipped not only with extended set of instructions that could be used in generalpurpose computing but also enabled for the arbitrary read and write operations todevice memory.

Few months after the lunch of 8800 GTX NVIDIA published a compiler that tookstandard C language extended with some additional keywords and transformed it intofully featured GPU code capable of general purpose processing. It is important to stressthat currently used CUDA C is by far easier to use then OpenGL/DirectX. Programmersdo not have to disguise their data for graphics and can use industry-standard C or evenother languages like C#, Java or Python (via appropriate bindings).

CUDA in now used in various fields of science raging from medical imaging, fluiddynamics to environmental science and others oÄering enormous, several-orders-of-magnitude speed ups5. GPUs are not only faster then CPUs in terms of computed data3http://www.nvidia.com/page/geforce256.html4ALU - Arithmetic Logic Unit5http://www.nvidia.com/object/cuda-apps-flash-new-changed.html

10

http://www.nvidia.com/page/geforce256.html

http://www.nvidia.com/object/cuda-apps-flash-new-changed.html

per unit time (e.g. FLOPS6) but also in terms of power and cost eÖciency.

5.2. CUDA Architecture

The underlying architecture of CUDA is driven by design decisions connected withGPU’s primary purpose, that is graphics processing. Graphics processing is usuallyhighly parallel process. Therefore, GPU also works in parallel fashion. The importantdistinction can be made into logical and physical layer of GPU architecture.

Programmer decomposes computational problem into atomic processes (threads)that can be executed simultaneously. Since this partition usually results in creation ofhundreds, thousands or even millions if threads. For programmer convenience threadscan be organized inside blocks which in turn are part of blocks. Both, blocks and gridsare 3 dimensional structures. This spatial dimensions are introduced for easier problemdecomposition. As mentioned before: GPU is meant for graphics processing which isusually related to processing 2D or 3D sets of data.

This grouping is associated not only with logical decomposition of problems, butalso with physical structure of GPU. A basic unit of execution on GPU is the warp.Warp consist of 32 threads. Each thread in warp belongs to the same block. If the blockis bigger then warp size then threads are divided between several warps. The warpsare executed on the executional unit called Streaming Multiprocessors (SMs). EachSM executes several warps (not necessarily from the same block). Physically, each SMconsist of 8 streaming processors (SP, CUDA cores) and 32 ”basic” ALUs. 8 SPs spend4 clock cycles executing the same processor instruction enabling 32 threads in warp toexecute in parallel. Each of the threads in warp can (and usually do) have diÄerentdata supplied to them forming whats known as SIMD7 architecture.6FLOPS - Floating Point Operations Per Second7Single Instruction, Multiple Data

11

Figure 5.1. Grid of thread blockshttp://docs.nvidia.com/cuda/cuda-c-programming-guide/

CUDA also provides rich memory hierarchy available for every thread. Each ofthe memory spaces has it’s own characteristics. The fastest and the smallest memorymemory is the per-thread local memory. Unfortunately, local, register-based memory isout of reach for CUDA programmer and is used automatically. Each thread in blockcan make use of shared memory. This memory can be accessed by diÄerent threadsin block and is usually the main medium of inter-thread communication. The slowestmemory spaces (but available to every thread) are called global, constant and texturerespectively, each of them have diÄerent size and purpose but they are all persistentacross kernel launches by the same application.

12

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

Figure 5.2. CUDA memory hierarchyhttp://docs.nvidia.com/cuda/cuda-c-programming-guide/

13

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

6. CPU Simulations

6.1. Sequential algorithm

The baseline for presented algorithms will be the sequential, CPU based code. Thesimulation itself is executed by the algorithm presented in Listing 1.

Listing 1. Sequential algorithm for CPU1 while (monte_carlo_steps < MAX_MCS) {

2 if ( is_lattice_updated ==FALSE && BOND_DENSITY(LATTICE) == 0.0 ) {

3 // If lattice is in ferromagnetic state , simulation can stop

4 break;

5 }

6 float C = TRIANGLE_DISTRIBUTION(MIU , SIGMA);

7 first_i = (int)(LATTICE_SIZE * randomUniform ());

8 last_i = (int)(first_i + (C * LATTICE_SIZE));

9 is_lattice_updated = FALSE; // ?

10 for (int i = 0; i < LATTICE_SIZE; i++) {

11 NEXT_STEP_LATTICE[i] = LATTICE[i];

12 if (( first_i <= i && i <= last_i )

13 || ( last_i >= LATTICE_SIZE && ( last_i % LATTICE_SIZE >= i || i >=

first_i ) )

14 ) {

15 int left = MOD((i-1), LATTICE_SIZE);

16 int right = MOD((i+1), LATTICE_SIZE);

17 // Neighbours are the same and different than the current spin

18 if ( LATTICE[left] == LATTICE[right] ) {

19 NEXT_STEP_LATTICE[i] = LATTICE[left];

20 }

21 // Otherwise randomly flip the spin

22 else if ( W0 > randomUniform () ) {

23 NEXT_STEP_LATTICE[i] = FLIP_SPIN(LATTICE[i]);

24 }

25 lattice_update_counter ++;

26 }

27 if (LATTICE[i] != NEXT_STEP_LATTICE[i]) {

28 is_lattice_updated = TRUE;

29 }

30 }

31 monte_carlo_steps = (int)(lattice_update_counter / LATTICE_SIZE);

32 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP);

33 }

The code runs the simulation with initial conditions of MAX MCS, LATTICE SIZE,LATTICE. LATTICE is set to be an array initialized in antiferromagnetic state (whichcould be represented by a sequence of consecutive ones or zeroes). To explore solutionspace (the combination of W0, MIU and SIGMA) we run each simulation one after another.

C/C++’s operator is in fact remainder from the division and not the modulo operatorin the mathematical sense. The most prominent diÄerence is that -1 % LATTICE SIZE

14

== -1, whereas: MOD((-1), LATTICE SIZE) == LATTICE SIZE-1. Therefore, while ac-cessing current spin’s neighbours MOD(x,N) macro is used (Listing 2).

Listing 2. Modulo function-like macro1 #define MOD(x, N) (((x < 0) ? ((x % N) + N) : x) % N)

6.2. Random number generation on CPU

The CPU code uses GSL8 based Mersenne Twister9. Usage of GSL-supplied MT isshown in Listing 3.

Listing 3. GSL’s Mersenne Twister setup1 #include <gsl/gsl_rng.h>

2 #include <gsl/gsl_randist.h>

3 // ...

4 const gsl_rng_type * T;

5 gsl_rng * r;

6 // ...

7 double randomUniform () {

8 return gsl_rng_uniform(r);

9 }

10 // ...

11 int main(int argc , char *argv []) {

12 gsl_rng_env_setup ();

13 T = gsl_rng_mt19937;

14 r = gsl_rng_alloc (T);

15 long seed = time (NULL) * getpid ();

16 gsl_rng_set(r, seed);

17 // simulation

18 // randomUniform () calls

19 }

6.3. CPU performance

The tests of CPU were conducted on quad-core AMD Phenom(tm) II X4 945Processor with 4GB of RAM. Simulations occupied only one core at the time. Theresults presented in Figure 6.1 will be used as a baseline for further comparisons (withrespective MAX MTS values).

8GSL - GNU Scientific Library, http://www.gnu.org/software/gsl/9http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms.

html

15

http://www.gnu.org/software/gsl/

http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms.html

http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms.html

Figure 6.1. Execution times of CPU simulations with MAX MTS equal to 1 000 and 10 000.Markers denote arithmetic mean of the 5 averages conducted. The curves fitted

are 4-th degree polynomials.

16

7. GPU Simulations - thread per simulation

7.1. Thread per simulation

CUDA provides C/C++-like language for executing code on GPU (CUDA C).The code is compiled and CUDA compiler via use of specific language extensions(e.g. device , host ) can distinguish the parts to be executed by CPU(host),GPU(device) or both(global).

Listing 4. Thread per simulation algorithm1 while ( monte_carlo_steps < MAX_MCS ) {

2 if(is_lattice_updated ==FALSE && BOND_DENSITY(LATTICE)==0.0) {

3 // stop when lattice is in ferromagnetic state

4 break;

5 }

6 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y));

7 float W0 = Z / (float)MAX_Z;

8 first_i = (int)(LATTICE_SIZE * RANDOM (& state[BLOCK_ID ])) +

THREAD_LATTICE_INDEX;

9 last_i = (int)(first_i + (C * LATTICE_SIZE)) + THREAD_LATTICE_INDEX;

10 is_lattice_updated = FALSE; // ?

11 for ( int i = THREAD_LATTICE_INDEX; i < LATTICE_SIZE+THREAD_LATTICE_INDEX

; i++ ) {

12 NEXT_STEP_LATTICE[i] = LATTICE[i];

13 if (( first_i <= i && i <= last_i )

14 || ( last_i >= LATICE_SIZE+THREAD_LATICE_INDEX && ( last_i % (

LATICE_SIZE+THREAD_LATICE_INDEX) >= i || i >= first_i ) )

15 ) {

16 int left = MOD((i-1), LATICE_SIZE) + THREAD_LATICE_INDEX;

17 int right = MOD((i+1), LATICE_SIZE) + THREAD_LATICE_INDEX;

18 // If neighbours are the same


20 NEXT_STEP_LATTICE[i] = LATTICE[left];

21 }

22 // ... otherwise randomly flip the spin

23 else if ( W0 > RANDOM (&state[BLOCK_ID ])) {

24 NEXT_STEP_LATTICE[i] = FLIP_SPIN(LATTICE[i]);

25 }

26 lattice_update_counter ++;

27 }

28 if ( LATTICE[i] != NEXT_STEP_LATTICE[i] ) {

29 is_lattice_updated = TRUE;

30 }

31 }

32 monte_carlo_steps =(int)(lattice_update_counter/LATTICE_SIZE);


34 }

17

7.2. Running the simulation

In order for the CUDA compiler and then GPU to execute the code correctly,programmer has to follow some conventions for the program structure. For instance:creating functions to be executed on GPU has to be prefixed with global ordevice keyword. Moreover, call to a GPU function has to be done with <<<gridDim,blockDim>>>. The framework for executing code on GPU is shown in Listing 5.

Listing 5. Exemplary foundation of GPU-executed code1 // Imports

2 // Helper definitions etc.

3 global void generate kernel(

4 curandStateMtgp32 *state ,

5 short * LATTICE ,

6 short * NEXT_STEP_LATTICE ,

7 int * DEV_MCS_NEEDED ,

8 float * DEV_BOND_DENSITY

9 ) {

10 // Code to be executed by GPU

11 while ( monte_carlo_steps < MAX_MCS ) {

12 if ( is_lattice_updated ==FALSE && BOND_DENSITY(LATTICE)==0.0 ) {

13 // stop when lattice is in ferromagnetic state

14 break;

15 }

16 // Rest of the simulation code

17 }

18 }

19 // ...


21 // Initializations ...

22 generate kernel<<<gridDim blockDim>>>

23 devMTGPStates ,

24 DEV_LATTICES ,

25 DEV_NEXT_STEP_LATTICES ,

26 DEV_MCS_NEEDED ,

27 DEV_BOND_DENSITY

28 );

29 // Obtaining results

30 // Cleanup

31 }

7.3. Solution space

The important diÄerence over CPU version is the use of (X, Y, Z) which denoteposition of the thread in logical structure provided by CUDA architecture. Threads canbe organized inside 3D structures called blocks and indexed using ”Cartesian” combina-tion of {x, y, z}. Later, they could be referenced inside kernel with blockIdx.{x,y,z}.Grid is also a 3D structure and similarly to blocks can be referenced inside kernel

18

with gridIdx.{x,y,z}. This structuring is provided for programmer convenience andis related to GPUs being devices meant for 2D and 3D graphics processing, wheresuch ”Cartesian” decomposition is quite natural. Although, blocks and grids are logicalstructures, they are associated with physical properties of GPUs. This fact can (andshould, whenever possible) be used for problem decomposition in order to optimizeruntime performance.

Here, the (X, Y, Z) correspond to (MIU, SIGMA, W0), that are distributed with(blockIdx.x, blockIdx.y, threadIdx.x). This was done in order to keep relativelysmall number of threads in the block (see subsection 7.4). By this convention eachthread can calculate it’s own set of values of (MIU, SIGMA, W0). Listing 6 shows howa thread can map it’s coordinates into initial parameters of simulation. For instance,threads with it’s blockIdx == (100,100,0) will be executing simulations for MIU=1.0and SIGMA=0.5 if MIU SIZE=100 and SIGMA SIZE=200.

Listing 6. Simulation parameters computation for each thread1 #define MIU_START 0.0

2 #define MIU_END 1.0

3 #define MIU_SIZE 10

4 #define SIGMA_START 0.0

5 #define SIGMA_END 1.0

6 #define SIGMA_SIZE 10

7 // ...

8 #define X blockIdx.x

9 #define Y blockIdx.y

10 #define Z threadIdx.x

11 #define MAX_X MIU_SIZE

12 #define MAX_Y SIGMA_SIZE

13 // ...

14 __global__ void generate_kernel(


16 short * LATTICE ,

17 short * NEXT_STEP_LATTICE ,



20 ) {

21 // ...

22 float C = TRIANGLE_DISTRIBUTION(X / MAX_X , Y / MAX_Y);

23 // ...

24 }


26 dim3 blockDim(W0_SIZE ,1,1);

27 dim3 gridDim(MIU_SIZE ,SIGMA_SIZE ,1);

28 // ...

29 generate_kernel <<<gridDim , blockDim >>>(

30 // ...

31 )

32 // ...

33 }

19

7.4. Random Number Generators

Important part of every Monte Carlo simulation is randomness. In order to keep thesimulation converge to the actual result, the quality of the Random Number Generator(RNG) must be high. The de facto standard for scientific MC simulations is MersenneTwister10[13]. There is a version of MT19937 optimized for GPGPU usage11 that wasincluded into CUDA as a cuRAND library12. There are, however, some limitations ofbuilt-in MT19937:

• 1 MTGP state per block

• Up to 256 threads per state

• Up to 200 states using included, pre-generated sequences

MT is called with curand uniform(state) and returns floating point number inrange (0, 1]. The values are uniformly distributed in range. To transform this sequenceof uniformly distributed numbers a special function (function-like macro) can be used(Listing 7).

Listing 7. Transformation of uniform- into triangle distribution1 #define TRIANGLE_DISTRIBUTION(miu , sigma) ({

2 float start = max(miu -sigma , 0.0);

3 float end = min(miu+sigma , 1.0);

4 float rand = (

5 curand_uniform (&state[BLOCK_ID ])

6 + curand_uniform (& state[BLOCK_ID ])

7 ) / 2.0;

8 ((end -start) * rand) + start;

9 })

7.5. Thread per simulation - static memory

In the algorithm presented in Listing 4 the memory usage is not optimized at all. Itis not only allocated in the global memory space, but also each time the program is run,the host’s memory has to be allocated copied into device. Listing 8 shows the ineÖcientmemory allocations that occur in thread-per-simulation algorithm from subsection 7.1.

Listing 8. Dynamic allocation of memory1 __global__ void generate_kernel(


3 short * LATTICE,

10MT19937, MT11MTGP, http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/index.html12http://docs.nvidia.com/cuda/curand/device-api-overview.html

20

http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/index.html

http://docs.nvidia.com/cuda/curand/device-api-overview.html

4 short * NEXT STEP LATTICE,



7 )

8 // ...

9 short * DEV_LATTICES;

10 short * DEV_NEXT_STEP_LATTICES;

11 CUDA_CALL(cudaMalloc(

12 &DEV_LATTICES ,

13 THREADS_NEEDED * sizeof(short) * LATTICE_SIZE

14 ));

15 CUDA_CALL(cudaMalloc(

16 &DEV_NEXT_STEP_LATTICES ,

17 THREADS_NEEDED * sizeof(short) * LATTICE_SIZE

18 ));

19 // ...

20 generate_kernel <<<grid_size , block_size >>>(

21 devMTGPStates ,

22 DEV_LATTICES ,

23 DEV_NEXT_STEP_LATTICES ,

24 DEV_MCS_NEEDED ,

25 DEV_BOND_DENSITY

26 );

If the memory is allocated inside kernel code the need for time-consuming copyingbetween host and device disappears. It is possible to statically allocate memory in thedevice code (Listing 9).

Listing 9. Static memory allocation inside kernel1 __global__ void generate_kernel(




5 ) {

6 short LATTICE_1[LATTICE_SIZE ];

7 short LATTICE_2[LATTICE_SIZE ];

8 short * LATTICE = LATTICE_1;

9 short * NEXT_STEP_LATTICE = LATTICE_2;

10 // ...

11 }

7.6. Comparison of static and dynamic memory use

Although quite simple, the following optimization does in fact improve the perfor-mance of the simulations. The results of the static vs dynamic memory allocation areillustrated in Figure 7.1.

All of the empirical tests of GPU code were done on GeForce GTX 570 GPU withIntel i7 CPU.

21

Figure 7.1. Speedup of static and dynamic (memorypool) memory allocations with MAX MTSequal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages

conducted. The curves fitted are 4-th degree polynomials.

Figure 7.2 shows the results conducted for range of 1 up to 6 000 concurrentsimulations. Static memory approach is faster than dynamic memory in every trialconducted. Moreover, as seen in Figure 7.2, static memory tends to maintain speeduprather than lose it’s ”velocity” as it is in the case of dynamic memory approach (comparefitted curves above 40 000 concurrent simulations).

22



23

8. GPU Simulations - thread per spin

8.1. Thread per spin approach

CUDA C Best Practices Guide13 encourages the use of multiple threads for optimalutilization of GPU cores. In this spirit, one can apply the approach where each spincould be represented by single thread and each simulation takes up an entire block.This idea is presented in Listing 10.

Listing 10. Thread per spin algorithm1 while (monte_carlo_steps < MAX_MCS) {

2 syncthreads();

3 if (threadIdx.x == 0) {4 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP);

5 if ( BOND_DENSITY(LATTICE) == 0.0 ) {

6 // If lattice is ferromagnetic , simulation can stop

7 monte_carlo_steps = MAX_MCS;

8 break;

9 }

10 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y

));

11 first_i = (int)(LATTICE_SIZE * curand_uniform (&state[BLOCK_ID ]));



14 }

15 syncthreads();

16 NEXT_STEP_LATTICE[threadIdx.x] = LATTICE[threadIdx.x];

17 if (( first_i <= threadIdx.x && threadIdx.x <= last_i )

18 || ( last_i >= LATICE_SIZE && ( last_i % LATICE_SIZE >= threadIdx.x ||

threadIdx.x >= first_i ) )

19 ) {

20 short left = MOD(( threadIdx.x-1), LATICE_SIZE);

21 short right = MOD(( threadIdx.x+1), LATICE_SIZE);

22 // Neighbours are the same


24 NEXT_STEP_LATTICE[threadIdx.x] = LATTICE[left];

25 }

26 // Otherwise randomly flip the spin

27 else if ( W0 > curand_uniform (&state[BLOCK_ID ])) {

28 NEXT_STEP_LATTICE[threadIdx.x] = FLIP_SPIN(LATTICE[threadIdx.x]);

29 }

30 atomicAdd(&lattice update counter,1);

31 }

32 }

13http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

24

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

8.2. Concurrent execution

In the approach presented in Listing 10 some new features of CUDA are shown.Namely, syncthreads(); which can be used to synchronize execution of threads. Itensures that all threads in block will be executing the same instruction after passing thesyncthreads(); call. Launch of exactly LATTICE SIZE*MIU SIZE*SIGMA SIZE*W0 SIZEthreads is initialized. Each block is exactly LATTICE SIZE long (Listing 11).

Listing 11. Grid and block sizes1 dim3 blockDim(LATTICE_SIZE ,1,1);

2 dim3 gridDim(MIU_SIZE ,SIGMA_SIZE ,W0_SIZE);

Each thread in block is running a single, block-wide simulation instance and executethe same code. This introduces the problem - every thread will execute initializationcode like setting up W0, C, MIU etc. Some of this values (like C) are random, therefore,running this code multiple times will produce diÄerent results. The situation, where evenone spin of the simulation is evaluated according to diÄerent W0 value is unacceptable.Correct initial setup can be obtained by evaluating initialization by only one thread(Listing 12).

Listing 12. Shared memory definitions1 __global__ void generate_kernel(




5 ) {

6 // ...

7 if (threadIdx.x == 0) {

8 LATTICE = LATTICE_1;

9 NEXT_STEP_LATTICE = LATTICE_2;

10 SWAP = NULL;

11 lattice_update_counter =0;

12 monte_carlo_steps =0;

13 W0 = Z/( float)MAX_Z;

14 }

15 __syncthreads ();

16 // ...

17 while (monte_carlo_steps < MAX_MCS) {

18 // ...

19 }

20 }

Concurrent execution by multiple threads makes initialization of LATTICE easier andfaster. All of the threads are updating their own values. Block’s threads are accessingmemory in bulk and without conflicts which could be a potential source of speedup(Listing 13).

Listing 13. LATTICE initialization

25

1 __global__ void generate_kernel(




5 ) {

6 // ...

7


9 // Initialization

10 }


12 // Initialize as antiferromagnetic

13 NEXT STEP LATTICE[threadIdx.x] = threadIdx.x&1;


15 // ...

16 }

17 }

8.3. Thread communication

To ensure thread cooperation inside simulation, block-level communication is needed.It can be obtained by means of shared memory. Shared memory is a type of memoryresiding on-chip. It is about 100x faster14 than uncached global memory. Shared memoryis accessible to every thread in block.

Listing 14 ilustrates the definition of shared resources inside kernel. CUDAcompiler automatically allocates the on-chip memory for shared variables only once(though the kernel is executed by every thread). All of the threads in block access thesame place in on-chip memory while accessing shared data.

Listing 14. Shared memory definitions1 __global__ void generate_kernel(




5 ) {

6 shared unsigned short LATTICE 1[LATTICE SIZE];

7 shared unsigned short LATTICE 2[LATTICE SIZE];

8 shared unsigned short first i, last i;

9 shared unsigned long long int lattice update counter;

10 shared unsigned long monte carlo steps;

11 shared float W0;

12

13 shared unsigned short * LATTICE;

14 shared unsigned short * NEXT STEP LATTICE;

15 shared unsigned short * SWAP;

16

14http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/

26

http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/

17 // Initialization of LATTICE pointers , lattice_update_counter etc.

18


20 // ...

21 }

22 }

8.4. Race conditions with shared memory

The issue of race conditions arise when multiple threads try to write to sharedmemory. Writing on GPU is usually not an atomic operation. It actually consist of 3diÄerent operations15 e.g. the incrementation of some number consist of:

1. Reading the value

2. Incrementing the value

3. Writing the new value

During the time required to perform this steps other threads can interrupt theexecution. Fortunately, CUDA does provide the programmer with set of atomic*()functions. atomic*() ensures that any number of threads requesting read or write tothe same memory instance will be served properly.

The code presented in Listing 15 shows how to perform lattice update counterincrementation to ensure correctness of results.

Listing 15. Atomic add1 while (monte_carlo_steps < MAX_MCS) {

2 // ...



5 || ( last_i >= LATICE_SIZE && ( last_i % LATICE_SIZE >= threadIdx.x ||

threadIdx.x >= first_i ) )

6 ) {

7 // ...

8 atomicAdd(&lattice update counter,1);

9 }

10 }

8.5. Thread per spin approach - reduction

Reduction is the process of decreasing the number of elements. This ”definition”,although vague, means that having multiple elements of some sort, we apply someprocess to reduce the number of input elements.15http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

27

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

Code presented in Listing 16, computes the bond density of LATTICE in each iteration.Moreover, it is done sequentially by only one thread which could be ineÖcient.

Listing 16. Unoptimized iteration initialization1 while (monte_carlo_steps < MAX_MCS) {

2 __syncthreads ();



5 if ( BOND DENSITY(LATTICE) == 0.0 ) {6 // If lattice is in ferromagnetic state , simulation can stop


8 break;

9 }

10 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y

));

11 first_i = (int)(LATTICE_SIZE * curand_uniform (&state[BLOCK_ID ]));



14 }


16 // ...

17 }

Let’s recall that bond density (fl = 0) iÄ16 all spins are either in SPIN UP orSPIN DOWN position. From this simple observation we can conclude that the sum ofall LATTICE elements (represented in code by means of 0 and 1) would be equal to0 (if all elements are zeroes) or LATTICE SIZE (all elements being ones) if LATTICE isin the ferromagnetic state. Moreover, we can make use of multiple GPU threads anduse existing NEXT STEP LATTICE as it is not needed between iterations. In algorithmpresented in Listing 17 the sum of the LATTICE elements is calculated in log2(L) steps.In case of L = 64 it would take 6 iterations, after which the summation results wouldbe stored in NEXT STEP LATTICE[0].

Listing 17. Parallel reduction1 for (int i = LATTICE_SIZE /2; i != 0; i /= 2) {

2 if (threadIdx.x < i) {

3 NEXT_STEP_LATTICE[threadIdx.x] += NEXT_STEP_LATTICE [( threadIdx.x+i)];

4 }

5 __syncthreads ();

6 }

The approach from Listing 17 could even be extended to the process of calculatingthe actual BOND DENSITY(LATTICE). This method (again, using auxiliary array ofNEXT STEP LATTICE) is presented in Listing 18.

Listing 18. Parallel reduction to calculate BOND DENSITY(LATTICE)1 __syncthreads ();

16iÄ - if and only if

28

2 NEXT_STEP_LATTICE[threadIdx.x] = 2*abs(LATTICE[threadIdx.x]-LATTICE [(

threadIdx.x+1) % LATTICE_SIZE ]);

3 __syncthreads ();

4 for (int i = LATTICE_SIZE /2; i > 0; i /= 2) {

5 if (threadIdx.x < i) {

6 // Use NEXT_STEP_LATTICE as cache array

7 NEXT_STEP_LATTICE[threadIdx.x] += NEXT_STEP_LATTICE[threadIdx.x+i];

8 }

9 __syncthreads ();

10 }

8.6. Thread per spin approach - flags

The possible way of optimizing the performance would be to avoid calculatingbond density during each execution of while loop altogether. If during update itera-tion, none of the spins were updated, then LATTICE was simply shallow copied intoNEXT STEP LATTICE. One can suspect that this behavior could be caused by latticebeing in stationary state. If the stationary state in question is one of ferromagneticstates the simulation can be stopped.

Listing 19 introduces a new variable: lattice update counter iter. This variablewill hold the information about how many spins were actually changed during simulationiteration. If the change did occur, then the BOND DENSITY(LATTICE) will not be executedat all. The if statement’s lattice update counter iter != 0 will be evaluated beforeand if it’s condition is not satisfied therefore, (by means of C being lazy-evaluatedlanguage) the part after && will not be reached. If, however, the change did notoccur (lattice update counter iter != 0) and lattice is in ferromagnetic state (BONDDENSITY(LATTICE) == 0.0) the simulation should stop. Unfortunately, break; willapply only to threadIdx.x == 0. In order to have other threads stop their work, wecould use check already performed by each thread before starting the actual work, thatis: set the monte carlo steps = MAX MCS. In this way we prevent other threads fromexecution (and potentially interfering with the results).

Listing 19. Thread per spin with flags1 __global__ void generate_kernel(




5 ) {

6 // Shared memory definitions

7 shared lattice update counter iter;

8 // Simulation initialization


10 // ...

11 lattice update counter=0;

12 }

29


14 // Initialize as antiferromagnetic

15 NEXT_STEP_LATTICE[threadIdx.x] = threadIdx.x&1;




19 // Iteration initialization

20 if ( lattice update counter iter == 0

21 && BOND DENSITY(LATTICE) == 0.0 ) {22 // If ferromagnetic , simulation can stop


24 break;

25 }

26 lattice update counter iter=0;

27 }




31 || ( last_i >= LATTICE_SIZE && ( last_i % LATTICE_SIZE >= threadIdx.

x || threadIdx.x >= first_i ) )

32 ) {

33 // Iteration update

34 }

35 if(NEXT_STEP_LATTICE[threadIdx.x]!= LATTICE[threadIdx.x]){

36 atomicAdd(&lattice update counter iter,1);

37 }

38 }

39 }

8.7. Thread-per-spin performance

As seen in Figure 8.1, each improvement over basic thread-per-spin method introducessome kind of speedup. Noteworthy is the performance gap especially between useof reduction and flags. Apparently, using flag to avoid per-iteration calculations ofBOND DENSITY(LATTICE) is significantly faster then usage of version equipped withhighly optimized BOND DENSITY(LATTICE) algorithm.

30



8.8. Thread-per-spin vs thread-per-simulation performance

Tests presented in Figure 8.2 and Figure 8.3 show the comparison between suggestedapproaches. Unoptimized thread-per-spin approach turns out to be faster then thread-per-simulation in every test under 20 000 concurrent simulations. Threads on the GPUdo not have as powerful processor at their disposal as those run on CPU. This leads tothe conclusion that most of the tasks conducted on GPU should be split onto separatethreads to parallelize the execution even by the expense of increased communicationtime. However, above 20 000 simulations threshold, overhead provided by a huge amountof threads and RNGs instances causes thread-per-spin to be worse performing thanthread-per-simulation approaches.

31

Figure 8.2. Execution times of thread-per-spin and thread-per-simulation simulations withMAX MTS equal to 1 000. Markers denote arithmetic mean of the 5 averages


Figure 8.3. Execution times of thread-per-spin and thread-per-simulation simulations withMAX MTS equal to 10 000. Markers denote arithmetic mean of the 5 averages


32

As it is in the case of execution times, also speedups of thread-per-simulationsare greater of those using thread-per-spin approach (Figure 8.4 and Figure 8.5). Formassive amounts of concurrent threads thread-per-spin simulations perform relativelywell gaining speedups of about 8-9x. Thread-per-simulation on the other hand shows animpressive speedup of up to 28x. In the case of low number of threads thread-per-spinapproach shows better speedup below 20 000 threshold. However, for bigger simulations(25 000 and more) thread-per-simulations show more promising results.

Figure 8.4. Speedups of thread-per-spin and thread-per-simulation simulations with MAX MTSequal to 1 000. Markers denote arithmetic mean of the 5 averages conducted.

The curves fitted are 4-th degree polynomials.

33

Figure 8.5. Speedups of thread-per-spin and thread-per-simulation simulations with MAX MTSequal to 10 000. Markers denote arithmetic mean of the 5 averages conducted.

The curves fitted are 4-th degree polynomials.

9. Bond density for some W0 values

The calculations made during this project helped in development of some insightinto how triangular distribution could aÄect the phase transition. Some exemplary bonddensity (fl) plots are presented in Figure 9.1 and Figure 9.2.

34

Figure 9.1. Bond density after 106 MCS, W0 = 0.9, MIU = [0, 0.25, 0.5, . . . , 1] andSIGMA = [0, 0.25, 0.5, . . . , 1]

Figure 9.2. Bond density after 106 MCS, W0 = 0.6, MIU = [0, 0.25, 0.5, . . . , 1] andSIGMA = [0, 0.25, 0.5, . . . , 1]

35

10. Conclusions

CUDA does in fact expose an easy-to-use environment for harnessing the power ofpresent-day GPGPUs. The realization of the project helped in speeding up complexand time-consuming calculations that take days on high-end CPUs. It helped gettingto know CUDA compiler and it’s most useful libraries.

Another important (although, not mentioned) element of this study was the usage ofscripting languages. Technologies such as Python17 enable: easy work distribution acrossGPGPU workstations, harvesting the results, processing the data18 and plotting19 theresults for easy pattern recognition and presentation. Unfortunately, GPU architecturerequires programmer to really know the underlying hardware and various programmingtechniques if he or she wants to obtain an optimal performance.

11. Future work

In the future, the developed CUDA program could be used to drive fully featuredstudy of the physical phenomena described in section 4. In order to do that, moredetailed data has to be gathered, including improved data resolution and higher numberof averages.

17http://www.python.org/18http://www.numpy.org/19http://matplotlib.org/

36

http://www.python.org/

http://www.numpy.org/

http://matplotlib.org/

References

[1] C. Coulon, et al. Glauber dynamics in a single-chain magnet: From theory to realsystems Phys. Rev. B 69 (2004)

[2] L. Bogani, et al. Single chain magnets: where to from here? J. Mater Chem., 18,(2008)

[3] H. Miyasaka, et. al. Slow Dynamics of the Magnetization in One- DimensionalCoordination. Polymers: Single-Chain Magnets Inorg. Chem., 48, (2009)

[4] R.O. Kuzian, et. al. Ca2Y2Cu5O10: the first frustrated quasi-1D ferromagnet closeto criticality, Phys. Rev. Letters, 109, (2012)

[5] K. Sznajd-Weron and S. Krupa. Inflow versus outflow zero-temperature dynamicsin one dimension, Phys. Rev. E 74, 031109 (2006)

[6] F. Radicchi, D. Vilone, and H. Meyer-Ortmanns. Phase Transition between Syn-chronous and Asynchronous Updating Algorithms, J. Stat. Phys. 129, 593 (2007)

[7] B. Skorupa, K. Sznajd-Weron, and R. Topolnicki. Phase diagram for a zero-temperature Glauber dynamics under partially synchronous updates, Phys. Rev. E86, 051113 (2012)

[8] I. G. Yi and B. J. Kim. Phase transition in a one-dimensional Ising ferromagnetat zero temperature using Glauber dynamics with a synchronous updating mode,Phys. Rev. E 83, 033101 (2011)

[9] M. Evans, N. Hastings, B. Peacock. Statistical Distributions, 3rd ed. New York:Wiley, pp. 187-188, (2000)

[10] E. Ising. Beitrag zur Theorie des Ferromagnetismus, Z. Phys. 31: 253-258, (1925)

[11] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller. Equationsof State Calculations by Fast Computing Machines, Journal of Chemical Physics21 (6): 108-1092, (1953)

[12] W. Lenz, ”Beitrage zum Verstandnis der magnetischen Eigenschaften in festenKorpern”, Physikalische Zeitschrift 21: 613-615, (1920)

[13] M. Matsumoto and T. Nishimura, ”Mersenne Twister: A 623-dimensionally equidis-tributed uniform pseudorandom number generator”, ACM Trans. on Modeling andComputer Simulation Vol. 8, No. 1, January pp.3-30 (1998)

37