1
GTC Workshop Japan 2011 (2011 7 22 日) Multi-GPU Computing of Large Scale Phase-Field Simulation for Dendritic Solidification on TSUBAME 2.0 Takashi Shimokawabe 1) , Takayuki Aoki 1) , Tomohiro Takaki 2) , and Akinori Yamanaka 1) 1) Tokyo Institute of Technology (2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8550) 2) Kyoto Institute of Technology (Gosyokaido-cyo, Matsugasaki, Sakyo-ku, Kyoto, 606-8585) Key Words: CUDA, Phase-field method, Multi-GPU computing, Dendritic solidification 1 Introduction The mechanical properties and performance of metal mate- rials depend on the intrinsic microstructures in these materials. In materials science and related areas, the phase-field method is widely used as one of the powerful computational methods to simulate the formation of complex microstructures such as den- drites during solidification and phase transformation of metals. The phase-field model consists of a large number of complex nonlinear terms comparing with other stencil computation. As a result, the phase-field simulations require large-scale compu- tation, such as multi-GPU computing, to perform realistic three- dimensional simulations in the typical scales of the microstruc- tural pattern. 2 Multi-GPU implementation and results In this simulation, the time integration of the phase field and the solute concentration are carried out by the second-order finite difference scheme for space with the first-order forward Euler- type finite difference method for time on a three-dimensional regular computational grid. We decompose the whole computational domain in both the y- and z-directions (2D decomposition) and allocate each sub domain to a GPU since 3D decomposition tends to degrade the GPU performance that we are able to achieve. Similar to conven- tional multi-CPU calculations with MPI, multi-GPU computing requires boundary data exchanges between sub domains. Be- cause a GPU cannot directly access to the global memory of the other GPUs, host CPUs are used to bridge GPUs for the data ex- change between the neighbor GPUs. This process is composed of the following three steps: (1) the data transfer from GPU to CPU using the CUDA runtime library, (2) the data exchange be- tween nodes with the MPI library, and (3) the data transfer back from CPU to GPU with the CUDA runtime library. Figure 1 demonstrates the dendritic growth during the binary alloy solidification using the mesh size of 768 × 1632 × 3264 on 1156 GPUs of TSUBAME 2.0. To evaluate the performance of our multi-GPU code for the phase-field simulations, we use the TSUBAME 2.0 supercom- puter at the Tokyo Institute of Technology. The TSUBAME 2.0 supercomputer in Tokyo Institute of Technology is equipped with 4224 NVIDIA Tesla M2050 GPUs. Each node of TSUB- AME 2.0 has three Tesla M2050 attached to the PCI Express Fig. 1: Dendritic growth in the binary alloy solidification bus 2.0 ×16 (8 GB/s), two QDR InfiniBand and two sockets of the Intel CPU Xeon X5670 (Westmere-EP) 2.93 GHz 6-core. Figure 2 shows that the performance of the phase-field simula- tion by the multi-GPU computing using TSUBAME 2.0 in both double- and single-precision. The performance of 1.017 PFlops is achieved for the mesh size of 4, 096 × 6, 500 × 10, 400 using 4,000 GPUs with 16,000 CPU cores of TSUBAME 2.0 in single precision. Number of GPUs 10 2 10 3 10 Performance [TFlops] 1 10 2 10 3 10 GPU, Single Precision GPU, Double Precision Fig. 2: Performance of multi-GPU computation REFERENCES [1] T. Aoki, S. Ogawa, and A. Yamanaka. Multiple-GPU scal- ability of phase-field simulation for dendritic solidification. Progress in Nuclear Science and Technology, in press.

Multi-GPU Computing of Large Scale Phase-Field Simulation

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

GTC Workshop Japan 2011 (2011年 7月 22日)

Multi-GPU Computing of Large Scale Phase-FieldSimulation for Dendritic Solidification on TSUBAME 2.0

Takashi Shimokawabe1), Takayuki Aoki1), Tomohiro Takaki2), and Akinori Yamanaka1)

1) Tokyo Institute of Technology (2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8550)

2) Kyoto Institute of Technology (Gosyokaido-cyo, Matsugasaki, Sakyo-ku, Kyoto, 606-8585)

Key Words: CUDA, Phase-field method, Multi-GPU computing, Dendritic solidification

1 Introduction

The mechanical properties and performance of metal mate-rials depend on the intrinsic microstructures in these materials.In materials science and related areas, the phase-field method iswidely used as one of the powerful computational methods tosimulate the formation of complex microstructures such as den-drites during solidification and phase transformation of metals.

The phase-field model consists of a large number of complexnonlinear terms comparing with other stencil computation. Asa result, the phase-field simulations require large-scale compu-tation, such as multi-GPU computing, to perform realistic three-dimensional simulations in the typical scales of the microstruc-tural pattern.

2 Multi-GPU implementation and results

In this simulation, the time integration of the phase field andthe solute concentration are carried out by the second-order finitedifference scheme for space with the first-order forward Euler-type finite difference method for time on a three-dimensionalregular computational grid.

We decompose the whole computational domain in both they- and z-directions (2D decomposition) and allocate each subdomain to a GPU since 3D decomposition tends to degrade theGPU performance that we are able to achieve. Similar to conven-tional multi-CPU calculations with MPI, multi-GPU computingrequires boundary data exchanges between sub domains. Be-cause a GPU cannot directly access to the global memory of theother GPUs, host CPUs are used to bridge GPUs for the data ex-change between the neighbor GPUs. This process is composedof the following three steps: (1) the data transfer from GPU toCPU using the CUDA runtime library, (2) the data exchange be-tween nodes with the MPI library, and (3) the data transfer backfrom CPU to GPU with the CUDA runtime library.

Figure 1 demonstrates the dendritic growth during the binaryalloy solidification using the mesh size of 768×1632×3264 on1156 GPUs of TSUBAME 2.0.

To evaluate the performance of our multi-GPU code for thephase-field simulations, we use the TSUBAME 2.0 supercom-puter at the Tokyo Institute of Technology. The TSUBAME2.0 supercomputer in Tokyo Institute of Technology is equippedwith 4224 NVIDIA Tesla M2050 GPUs. Each node of TSUB-AME 2.0 has three Tesla M2050 attached to the PCI Express

Fig. 1: Dendritic growth in the binary alloy solidification

bus 2.0 ×16 (8 GB/s), two QDR InfiniBand and two socketsof the Intel CPU Xeon X5670 (Westmere-EP) 2.93 GHz 6-core.Figure 2 shows that the performance of the phase-field simula-tion by the multi-GPU computing using TSUBAME 2.0 in bothdouble- and single-precision. The performance of 1.017PFlopsis achieved for the mesh size of 4, 096× 6, 500× 10, 400 using4,000 GPUs with 16,000 CPU cores of TSUBAME 2.0 in singleprecision.

Number of GPUs10 210 310

Perf

orm

ance

[TFl

ops]

1

10

210

310GPU, Single Precision

GPU, Double Precision

Performance of Phase-field method on TSUBAME 2.0

Fig. 2: Performance of multi-GPU computation

REFERENCES

[1] T. Aoki, S. Ogawa, and A. Yamanaka. Multiple-GPU scal-ability of phase-field simulation for dendritic solidification.Progress in Nuclear Science and Technology, in press.