11
Journal of Physics: Conference Series OPEN ACCESS GPU accelerated computing–from hype to mainstream, the rebirth of vector computing To cite this article: Satoshi Matsuoka et al 2009 J. Phys.: Conf. Ser. 180 012043 View the article online for updates and enhancements. Related content Triggering events with GPUs at ATLAS S Kama, J Augusto Soares, J Baines et al. - GPUs for statistical data analysis in HEP: a performance study of GooFit on GPUs vs. RooFit on CPUs Alexis Pompili, Adriano Di Florio and CMS Collaboration). - World's fastest supercomputer opens up to users Ling Xin - Recent citations Akito Onodera et al - Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth Kentaro Sano et al - GPU and APU computations of Finite Time Lyapunov Exponent fields Christian Conti et al - This content was downloaded from IP address 218.155.225.64 on 02/10/2021 at 03:24

GPU accelerated computing–from hype to mainstream, the rebirth

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Journal of Physics Conference Series

OPEN ACCESS

GPU accelerated computingndashfrom hype tomainstream the rebirth of vector computingTo cite this article Satoshi Matsuoka et al 2009 J Phys Conf Ser 180 012043

View the article online for updates and enhancements

Related contentTriggering events with GPUs at ATLASS Kama J Augusto Soares J Baines et al

-

GPUs for statistical data analysis in HEPa performance study of GooFit on GPUsvs RooFit on CPUsAlexis Pompili Adriano Di Florio and CMSCollaboration)

-

Worlds fastest supercomputer opens up tousersLing Xin

-

Recent citationsAkito Onodera et al-

Multi-FPGA Accelerator for ScalableStencil Computation with ConstantMemory BandwidthKentaro Sano et al

-

GPU and APU computations of Finite TimeLyapunov Exponent fieldsChristian Conti et al

-

This content was downloaded from IP address 21815522564 on 02102021 at 0324

GPU accelerated computingmdashfrom hype to mainstream the rebirth of vector computing

Satoshi Matsuoka1 Takayuki Aoki1 Toshio Endo1 Akira Nukada1 Toshihiro Kato2 and Atushi Hasegawa3 1Tokyo Institute of Technology 2NEC Corporation 3NEC Informatec Systems Ltd

matsuistitechacjp

Abstract Acceleration technologies in particular GPUs and Cell are receiving considerable attention in modern-day HPC Compared to classic accelerators and traditional CPUs these devices not only exhibit higher compute density but also sport significant memory bandwidth and vector-like capabilities to stream data at bandwidth of 100 GBs or more The latter qualifies such accelerators as a rebirth of vector computing With large-scale deployments of GPUs such as Tokyo Techrsquos TSUBAME 12 supercomputer facilitating 680 GPUs in a 100-Teraflops scale supercomputer we can demonstrate that even under a massively parallel setting GPUs can scale both in dense linear algebra codes as well as vector-oriented CFD codes In both cases however careful algorithmic developments especially latency hiding are important to maximize their performance

1 Introduction

11 Commoditization of HPC and its acceleration but niche still remains Acceleration technologies such as Cell[1] GPU[2] ClearSpeed[3] MD Grape[4] are receiving considerable attention in modern-day HPC By all means acceleration via specialized hardware or processors is not new mdash various application areas such as graphics multimedia networks cryptography etc typically are accelerated in PCs and embedded devices Also in the past HPC architectures typically had vector acceleration options in CM-5 Fujitsu AP1000 etc but all failed or faded away However recently renewed attention to acceleration in HPC is driven by commodity hardware intended for other uses now being used in typical HPC application scenarios and achieving great leaps in performance with regard to cost power size etc Let us first review how acceleration and commoditization interplayed in HPC Later on in the article we will see concretization of modern acceleration in our latest TSUBAME 12 supercomputer at the Tokyo Institute of Technology which is the largest open science GPU accelerated machine to date and we will discuss its various performance characteristics

Commoditization in HPC fundamentally changed its position from being a niche market to occupying 20 of all the server markets The first sign of this was actually a long time ago historically around 1980 with the availability of the 8087 acceleration option in the 8086 processor where IEEE 754 double floating point came into actual being leading to a series of x86 processors with increasing FP and memory performance as multimedia and other needs became more prevalent in

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

ccopy 2009 IOP Publishing Ltd 1

PCs The first Beowulf cluster called Wigraf was born in 1994 at CaltechNASA signifying the start of commoditization of networks and the HPC software stacks to build a complete supercomputermdashfrom then on cluster technology has matured to the extent that we can build petascale machines embodying tens of thousands of CPU cores Such wide proliferation of commodity technology in HPC over time has allowed its widespread proliferation and mainstream adoption not only allowing one to save costs in building large-scale machines but also realizing integration and smooth growth paths in the whole HPC ecosystem from simple PCs and small few node clusters all the way up to petascale machines It is now commonplace for someone to develop and run the same application in both his office machine and the largest supercomputer using thousands of CPU coresmdashwith just the size of the problem being different

However even today various vendors including IBM Cray NEC Fujitsu etc sell custom-made supercomputers and more are in future development Given the cost and other technical advantages of commodity clusters such ldquodinosaursrdquo should have disappeared a long time ago mdash however if one would observe various benchmarks such as the TOP500 or the HPC challenge such machines dominate the top ranks There are several reasons for this including the ones below (1) High-performance highly efficient vector processing although vector processing capabilities in

x86 processors (SSE) and others (eg AltiVec on PowerPC) have dramatically improved their FP performances for single-thread applications they are still no match for dedicated vector machines such as the NEC SX There are several reasons for this including much wider vectors (8~16 vs 2~4) and deeper vector pipelines augmented by significantly greater memory bandwidth as well as random access capabilities Although in the past increases in both the processor and the memory clock rate helped to improved single-thread performance of commodity CPUs such an increase has virtually come to a halt since around 2005 when multicore parallelism became the dominant methodology for increasing CPU performance As such expensive vector machines are still preferred for applications such as CFD where the algorithm is memory bound or time-dependent code where there is inherent serialization in the algorithm itself

(2) Powerinstallationmaintenance efficiency even if one would overcome the inefficiency of PC

clusters by utilizing more parallelism various costs such as physical power space failure and maintenance may become overwhelming Modern large supercomputers consume more than a Megawatt of power greatly reducing their installation possibilities due to the costs involved In such situations a dedicated design that would reduce the extra components and thus improve the failure rates power consumption etc may be desirable

Accelerators may solve the above two problems while being merely additive to commodity clusters

thus preserving their cost advantage while addressing the niche advantages of dedicated supercomputers Already supercomputers as LANLrsquos Roadrunner and Tokyo Techrsquos TSUBAME exist which are clusters augmented with extensive commodity acceleration In fact Roadrunner based on a combination of AMD Opteron and IBMSonyToshiba PowerXCell became the first supercomputer ever to achieve Petaflops in 2008

The differences between the failed acceleration in HPC in the past versus the possible success in modern times are twofold Firstly modern-day accelerators are designed as part of the commodity PC ecosystem Even customized hardware accelerators such as MDGrape and ClearSpeed are designed with standard fast IO buses such as PCI-Express and fit within standard high-performance cluster nodes as hosts where the bulk of the code will still run on the PC while time-critical bottlenecks are executed on accelerators Cell and GPUs are more fundamentally commodity in that they had been developed with high-end embedded devices and multimedia gaming as target application areas with an abundance of both hardware and software in the accompanying PC ecosystem As such leveraging this ecosystem has exactly the same characteristics as leveraging commodity clusters with a possible similar success scenario

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

2

Secondly as discussed above up until 2005 due to the continued exponential speedup of single-thread performance of CPUs any acceleration speedups were quickly caught up and exceeded however this is now hitting a roadblock as shown above and extensive acceleration would sustain its advantage over conventional CPUs in terms of resources required for a long time For example on a single node of TSUBAME sporting 8 Opteron 280s (24 Ghz x 16 cores) the best 3-D FFT performance utilizing all 16 cores has been approximately 20 Gigaflops both single and double precision using the most optimized code available Contrastingly a single GPU card within the Tesla 1070p records over 150 Gigaflops in single precision and 40 Gigaflops using the NU-FFT [5] we developed Moreover when we parallelize across nodes we need more than 16 nodes (= 256 cores) to recover the parallelization overhead incurred to match a single GPU performance This difference is expected to worsen as GPU performance is expected to skyrocket both in terms of pure flops and memory bandwidth and most importantly excel in cost and flopswatt performance as a result

12 Commodity vector accelerationmdashthe rebirth of vector computing Overall it is fair to state that GPUs and Cell are receiving the most attention as accelerators of

choice primarily because of their commodity nature but also because of their high compute as well as memory bandwidth density The IBMToshibaSony PowerXCell 8i sports 1028 GigaFlops double precision peak performance as well as 25 GBs memory bandwidth GPU absolute performance is more dramatic with both the latest AMD FireStream and NVIDIA Tesla GPU sporting over 1 TFlops of (albeit single precision) performance and over 100 GBs memory bandwidth using multithreaded SIMD-vector processor array architecture augmented by point-to-point fast memory interconnect Such memory bandwidth effectively rivals the performance of the most expensive NEC SX series

Figure 1 GPU as ldquorebirth of vectorsrdquo in HPC mdash supporting both high computational density as well as high memory access frequency

Such a situation is depicted in figure 1 Here the Y-axis is the compute density of an application and

the X-axis is the memory access frequency HPC apps would be naturally situated farther away from the axis but can be categorized as (a) those with high computational density such as N-body or dense linear algebra situated towards the upper part of the graph and (b) those with high memory access frequency such as CFD and FFT situated towards the right Standard CPU performances are by and large mediocre appropriate for applications near the center and do not efficiently support (a) or (b) Modern accelerators such as GPUs and Cell contrastingly excel at both (a) and (b) because of their

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

3

architectural properties as described above As such modern accelerators could be general HPC accelerators for various types of codes not just specific ones as in the past

In fact both GPUs and Cell can be regarded as a rebirth of vector computing Although different from traditional architectures with deep vector pipelining nonetheless both architectures are quite amenable to where vector computers excelled in the past In fact modern algorithms intended to run on cache-oriented standard CPUs often do not work well due to their requirements for shorter memory access latency and locality enabled by extensive caching and associated blocking Instead classic algorithms intended for long vector-length ldquostreamingrdquo memory access work quite well Our recent bandwidth-intensive FFT work [5] is a manifestation of this in that we principally employ the classic vector-oriented multi-row FFT algorithm to compute the FFT for Y- and Z-axis (but not for the X-axismdashfor details refer to [5])

2 Acceleration in actionmdashTSUBAME 12 exemplars As we have seen accelerators exhibit high promise in becoming the next dominant resources in providing the majority of the compute capabilities in HPC serving both dense computations as well as vector-friendly streaming codes This is not to say however that such will occur automatically With commodity clusters it took years of continued research and development effort (still continuing to date) to mature to be useable in production especially at large scale Similarly since development of commodity accelerators has focused on their principal market ie non-HPC single-node applications considerable RampD as well as large-scale deployments are required for them to become truly mainstream in HPC

Here we present a concrete example of the largest high-performance GPU deployment to date on an open science supercomputer TSUBAME 12 at the Tokyo Institute of Technology and some of the early results in scaling representative large parallel benchmark codes

Figure 2 TSUBAME 12 GPU accelerated supercomputer

The original TSUBAME 10 was deployed as a 10480-CPU core cluster at the Tokyo Institute of

Technology GSIC Center in the spring of 2006 [6] In addition to the 5120 24 Ghz dual-core AMD Opteron CPUs in 655 nodes (8 socket Sun x4600 nodes) providing approximately 504 Gigaflops of peak FP performance it also sported 360 ClearSpeed CS600 Advance Accelerators one card per node for about half of the nodes providing approximately 30 Teraflops of additional compute power Over the years additional ClearSpeeds as well as Intel ldquoHarpertownrdquo Xeon nodes were added pushing the total performance up to approximately 105 Teraflops

TSUBAME 12 was implemented similarly as a follow-on upgrade to the existing TSUBAME adding 170 NVIDIA Tesla s1070 units each embodying four Tesla GPU cards for a total of 680 cards This allowed the single-precision FP performance of TSUBAME to be boosted to nearly 900

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

4

Teraflops and double-precision to approximately 160 Teraflops The added power and thermal load on TSUBAME is only about 150 KW which is about 18~110th of TSUBAME itself

Speedups of various libraries and applications using GPUs have been quite astounding on a single GPU as exemplified by the FFT speedup demonstrated in Section 2 We have utilized such capabilities for example to greatly speed up sophisticated applications such as all-to-all 3-D protein docking achieving equivalent speed to our earlier published result in [7] on BlueGeneL with almost four times better energy efficiency

Here we mention our current results on two multi-node large-scale benchmarks one dense linear algebra (Linpack) and the other sparse finite difference (multi-node Himeno) benchmark codes

21 Linpack on CPU-GPU-ClearSpeed heterogeneous configuration One of the major issues in heterogeneous supercomputers equipped with a batch of GPUsaccelerators is how users develop parallel programs that effectively use hybrid computing resources especially tightly coupled programs Here we take the High Performance Linpack (HPL) benchmark [8] and describe how it is heterogeneously accelerated on TSUBAME12 For more detailed implementation refer to our previous paper [9] 211 High Performance Linpack (HPL) HPL is a well known MPI based parallel benchmark used in the Top500 ranking solving a set of dense linear equations using a direct method It employs a block-based solver algorithm so as to harness the highly tuned underlying Level 3 BLAS librarymdashin particular using a fast matrix-multiply (DGEMM) function is the key to obtain good HPL performance Parallelization is achieved based on two-dimensional block cyclic distribution of the matrix where each MPI process possesses a sub-matrix of the almost same sizemdashas such HPL is designed for homogeneous environments

Thus there are several challenging issues to port HPL onto TSUBAME12 (1) All the processors CPUs GPUs ClearSpeed boards should be used cooperatively for DGEMM

kernel computation (intra-node heterogeneity) On TSUBAME the ratios of theoretical peak computation performance for CPUs GPUs and ClearSpeed are 35 33 32 therefore omitting any type of processor that causes heavy performance degradation

(2) We also have to consider inter-node heterogeneity which is that only 312 out of 648 nodes have two Tesla GPUs and the rest have none due to the configuration restrictions of Tesla 1070p

(3) Finally PCI-ePCI-X communication overhead should not be underestimated since the matrix data is basically allocated on host memory

212 Kernel functions and matrix block size In HPL the most time consuming part in the overall benchmark is the DGEMM function calls that multiply the (Mprime x B) matrix and (B x Nprime) matrix where B is a tuneable block size parameter To maintain a favorable communication-computation ratio B should be sufficiently large

As a CPU BLAS library we use GotoBLAS [10] For GPUs we use a matrix multiply function written from scratch by our group in order to achieve maximum efficient and at the same time allow asynchronous double buffering to hide the data transfer latency a feature not available in NVIDIArsquos native CUBLAS Its on-board performance is about 80 GFlops and we are able to almost completely hide the latency to effectively achieve this speed For ClearSpeed the CSXL BLAS library by ClearSpeed Inc (DGEMM speed is 63 GFlops) is used Through preliminary experiments we found that B = 1152 would be the optimal block size 213 Coping with heterogeneity In TSUBAME 12 we are faced with intra- and inter-node heterogeneity as described above We conduct load balancing of the DGEMM kernels running on different CPUsaccelerators based on the following strategies By contrast non-kernel computations such as pivoting or row exchange are simply done on x86 CPUs

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

5

Intra-node heterogeneity We conceptually regard all hybrid processors on each node as a farm that provides kernel computation facility for processes Here each process may delegate some fraction of DGEMM computation to accelerators while other fractions may be computed by the CPU-based BLAS as in figure 3

Inter-node heterogeneity With the above model inter-node heterogeneity appears as imbalance of the DGEMM performances among nodes To keep the performance per process almost identical we configure the number of processes per node to reflect their compute capabilities

Figure 3 Mapping between processes and processors (CPUs GPUs ClearSpeed) The left figure shows a node with GPUs the right one shows a node without GPU

214 Other techniques

We observed that PCI-ePCI-X communication consumes considerable CPU load and as a result conducting CPU-accelerator communication and computing DGEMM on the same cores incurs considerable performance degradation Instead we assign several dedicated cores solely for communication (black cores in figure 3) Although we lose considerable performance by such loss of cores (over 10 Teraflops for TSUBAME12) it becomes an overall win when acceleration is considered

In the original code DGEMM calls are sometimes fragmented when communication and DGEMM overlap We have eliminated such fragmentation by using Pthreads

We have implemented and added overlapping between pivot rows communication and DGEMM computation which was not in the original code but is effective for accelerated machines due to the inherent parallelism between CPUs and accelerators

215 Performance evaluation In running the whole-machine Linpack we used 648 of the 655 TSUBAME nodes each of which has eight dual-core Opteron 880 processors and a ClearSpeed accelerator and 312 nodes embodying two Tesla GPUs Additionally another production Xeon cluster called TSUBASA was integrated and used cooperativelymdashthere we used 80 nodes each of which has two quad-core Xeon E5440 processors Combining all these hybrid resources we have achieved 8701 TFlops Since the total peak speed is 163 TFlops the efficiency is 53 partly due to performance loss we mentioned above assigning dedicated CPU to support multiple accelerator as well as compromises made such as the block size (B=1152) Figure 4 compares peak speed and Linpack speed for each category of processors Interestingly we observe here that all the processors show similar efficiency (48 to 56) so the resulting share of runtime performance largely matches that of the theoretical peak performance Here GPU is the most efficient (56) in a more careful comparison but it is unclear whether this is inherent or due to compromise weighing in favour of GPU and we are conducting follow-up research to clarify this The figure also shows electrical power consumption we observe that while accelerators provide 66 of the computation performance their power consumption is only 15 of the entire machine

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

6

Figure 4 Performance ratio of each kind of processor Peak speed (double-precision) contribution to Linpack speed and electrical power consumption are shown

Figure 5 History of TSUBAME in the Top500

Figure 5 shows the history of our Linpack experiments on the TSUBAME in particular the

achieved Teraflops and rank in the Top500 Ever since the original TSUBAME was installed in 2006 processor resources have been steadily increased due to system upgrade which is represented by continuous improvement for every Top500 since its inception Since the performance only with Opteron CPUs was 3818 TFlops acceleration technology improved the TSUBAME system performance by a factor of 23

22 Himeno CFD benchmark acceleration over multiple GPU nodes The Himeno CFD benchmark is a simple 3-D CFD code that solves a Poisson equation using finite difference Jacobi iterations Although production-quality CFD code would be substantially more complex and employ sophisticated methods the Himeno benchmark nonetheless is extremely bandwidth demanding and serves as a benchmark that measures the worst-case scaling scenario for bandwidth intensive codes Himeno benchmark scores are published for a variety of architectures on the benchmark website [11] On a standard x86 CPU the benchmark is completely memory bound and the score would be approximately 1 GBs Contrastingly a single GPU implementation on a NVIDIA CUDA GPU has seen reported scores of over 70 GBs (figure 6) by utilizing high bandwidth device memory as well as shared memory

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

7

Figure 6 Riken Himeno benchmark and parallelization using CUDA GPUs

The Himeno benchmarkrsquos main loop involves 34 FP calculations 31 memory reads and 1 memory

write By optimization we can reduce this to 13 reads and 1 writemdashstill significant and very much memory bound There are four grid sizes the smallest S being (65 x 65 x 129) up to the largest XL being (513 x 513 x 1025) For our benchmark we slightly modified XL so that it would be (1025 x 513 x 513) ie will have a prolonged X-axis instead of Z for ease of programming

For parallelization across GPUs we conduct one-dimensional block distribution along the X-axis across multiple nodes Within a GPU we extensively parallelize and optimize the code using shared memory and coalesced memory access Across the GPUs we overlap MPI communication across nodes and Jacobi iteration computation within a GPU in each node

On TSUBAME12 CPU GPU communication bandwidth is greatly sacrificed by being based on PCI-E Gen1 x8 instead of the more modern Gen2 x16 and furthermore by two GPU cards sharing a single PCI-E lanemdashas such in the worst case we have only 18th the bandwidth of a standard Gen2 x16 implementation and care must be taken to hide the latency effectively Under this circumstance communication of the boundary region on every Jacobi iteration takes approximately 82 ms If the GPU compute granularity is greater than this we would have effectively hid the latency However as we increase the number of GPUs while maintaining the problems size (strong scaling) the boundary area adjacent to the next GPU will not change and only the size of the X-axis segment will get smaller shortening the compute time

Figure 7 (left) shows the result We observe that latency hiding technique greatly improves the scalability of the code achieving over 700 Gigaflops on 32 CPUs At that point we achieve 11 TBs in memory bandwidth Reflecting upon the theoretical compute models we observe that there is still a little bit of room for improvement but the benchmark is approaching the hardware limits

By doubling the problem size to XXL (figure 7 (right) 2049 x 513 x 513) we achieve considerable improvement in scalability scaling from 16 to 32 GPUs enables 179 times speedup Further details of the benchmark and the implementation will be the subject of another paper

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

8

152

292

524

709

132

224

337

455

0

100

200

300

400

500

600

700

800

4 8 16 32

GFL

OPS

152

292

524

709

304

584

1049

0

200

400

600

800

1000

1200

4 8 16 32

GFL

OPS

GPU nodes

Left bar with latency hiding right bar without GPU nodes

Left bar size XL right bar size XXL

Figure 7 Riken Himeno benchmark parallelized across nodes on TSUBAME 12 The left graph shows the effect of latency hiding as we scale to 32 GPU nodes The right graph shows the effect of doubling the problem size to achieve larger granularity and thus better scalability

3 Conclusion and future work Commodity acceleration is finally here and in fact might dominate the computing aspects of future supercomputers as one of its major properties is the rebirth of vector computing which has been largely punted in commodity clusters Already there are numbers of successful application ports with dramatic speedups reported in the literature and large-scale multi-node applications and deployments are starting to materialize where significant speedups andor power savings are being reported TSUBAME12 one of the first large-scale deployments of GPUs in a production supercomputer is allowing us to understand the attractions as well as limitations of the approach and what technical problems still lie ahead Even as we speak many GPU-enabled clusters are being planneddeployed but without sufficient software support they are reminiscent of early clusters

As future work we plan to conduct extensive research to help GPUs become the dominant compute resource This research includes various systems issues as well as algorithmic and application issues in preparation for deployment of a petascale TSUBAME 20 anticipated to be deployed in 2010

References [1] ldquoCell Broadband Engine Technology and Systemsrdquo IBM Systems Journal 51-5 May 2007 [2] Owens JD Houston M Luebke D Green S Stone JE Phillips JC ldquoGPU Computingrdquo Proc

IEEE 96-5 May 2008 pp 879-899 [3] ClearSpeed Inc ldquoClearSpeed Whitepaper CSX Processor Architecturerdquo

httpwwwclearspeedcomdocsresourcesClearSpeed_Architecture_Whitepaper_Feb07v2pdf Feb 2007

[4] Taiji M ldquoMDGRAPE-3 chip a 165 Gflops Application Specific LSI for Molecular Dynamics Simulationsrdquo Proc Hot Chips 16 IEEE Computer Society Press (CD-ROM) 2004

[5] Akira Nukada Yasuhiko Ogata Toshio Endo and Satoshi Matsuoka ldquoBandwidth Intensive 3-D FFT kernel for GPUs using CUDArdquo Proc ACMIEEE Supercomputing 2008 (SC2008) Austin Texas the IEEE Press Nov 2008

[6] Satoshi Matsuoka Petascale Computing Algorithms and Applications --- Chapter 14 The Road to TSUBAME and Beyond Chapman amp Hall CRC Computational Science Series pp289-310 2008

[7] Akira Nukada Yuichiro Hourai Akira Nishida and Yutaka Akiyama ldquoldquoHigh Performance 3D Convolution for Protein Docking on IBM Blue Generdquo Parallel and Distributed Processing and Applications Springer LNCS Vol 4742 pp 958-969 2007

[8] A Petitet R Whaley J Dongarra and A Cleary HPL ndash a portable implementation of the high-performance Linpack benchmark for distributed computers httpwwwnetliborgbenchmarkhpl

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

9

[9] Toshio Endo and Satoshi Matsuoka Massive Supercomputing Coping with Heterogeneity of Modern Accelerators IEEE International Parallel amp Distributed Processing Symposium (IPDPS 2008) the IEEE Press April 2008

[10] K Goto Goto BLAS httpwwwtaccutexaseduresourcessoftware [11] The Riken Himeno CFD Benchmark httpacccrikenjpHPCHimenoBMTindex_ehtml

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

10

GPU accelerated computingmdashfrom hype to mainstream the rebirth of vector computing

Satoshi Matsuoka1 Takayuki Aoki1 Toshio Endo1 Akira Nukada1 Toshihiro Kato2 and Atushi Hasegawa3 1Tokyo Institute of Technology 2NEC Corporation 3NEC Informatec Systems Ltd

matsuistitechacjp

Abstract Acceleration technologies in particular GPUs and Cell are receiving considerable attention in modern-day HPC Compared to classic accelerators and traditional CPUs these devices not only exhibit higher compute density but also sport significant memory bandwidth and vector-like capabilities to stream data at bandwidth of 100 GBs or more The latter qualifies such accelerators as a rebirth of vector computing With large-scale deployments of GPUs such as Tokyo Techrsquos TSUBAME 12 supercomputer facilitating 680 GPUs in a 100-Teraflops scale supercomputer we can demonstrate that even under a massively parallel setting GPUs can scale both in dense linear algebra codes as well as vector-oriented CFD codes In both cases however careful algorithmic developments especially latency hiding are important to maximize their performance

1 Introduction

11 Commoditization of HPC and its acceleration but niche still remains Acceleration technologies such as Cell[1] GPU[2] ClearSpeed[3] MD Grape[4] are receiving considerable attention in modern-day HPC By all means acceleration via specialized hardware or processors is not new mdash various application areas such as graphics multimedia networks cryptography etc typically are accelerated in PCs and embedded devices Also in the past HPC architectures typically had vector acceleration options in CM-5 Fujitsu AP1000 etc but all failed or faded away However recently renewed attention to acceleration in HPC is driven by commodity hardware intended for other uses now being used in typical HPC application scenarios and achieving great leaps in performance with regard to cost power size etc Let us first review how acceleration and commoditization interplayed in HPC Later on in the article we will see concretization of modern acceleration in our latest TSUBAME 12 supercomputer at the Tokyo Institute of Technology which is the largest open science GPU accelerated machine to date and we will discuss its various performance characteristics

Commoditization in HPC fundamentally changed its position from being a niche market to occupying 20 of all the server markets The first sign of this was actually a long time ago historically around 1980 with the availability of the 8087 acceleration option in the 8086 processor where IEEE 754 double floating point came into actual being leading to a series of x86 processors with increasing FP and memory performance as multimedia and other needs became more prevalent in

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

ccopy 2009 IOP Publishing Ltd 1

PCs The first Beowulf cluster called Wigraf was born in 1994 at CaltechNASA signifying the start of commoditization of networks and the HPC software stacks to build a complete supercomputermdashfrom then on cluster technology has matured to the extent that we can build petascale machines embodying tens of thousands of CPU cores Such wide proliferation of commodity technology in HPC over time has allowed its widespread proliferation and mainstream adoption not only allowing one to save costs in building large-scale machines but also realizing integration and smooth growth paths in the whole HPC ecosystem from simple PCs and small few node clusters all the way up to petascale machines It is now commonplace for someone to develop and run the same application in both his office machine and the largest supercomputer using thousands of CPU coresmdashwith just the size of the problem being different

However even today various vendors including IBM Cray NEC Fujitsu etc sell custom-made supercomputers and more are in future development Given the cost and other technical advantages of commodity clusters such ldquodinosaursrdquo should have disappeared a long time ago mdash however if one would observe various benchmarks such as the TOP500 or the HPC challenge such machines dominate the top ranks There are several reasons for this including the ones below (1) High-performance highly efficient vector processing although vector processing capabilities in

x86 processors (SSE) and others (eg AltiVec on PowerPC) have dramatically improved their FP performances for single-thread applications they are still no match for dedicated vector machines such as the NEC SX There are several reasons for this including much wider vectors (8~16 vs 2~4) and deeper vector pipelines augmented by significantly greater memory bandwidth as well as random access capabilities Although in the past increases in both the processor and the memory clock rate helped to improved single-thread performance of commodity CPUs such an increase has virtually come to a halt since around 2005 when multicore parallelism became the dominant methodology for increasing CPU performance As such expensive vector machines are still preferred for applications such as CFD where the algorithm is memory bound or time-dependent code where there is inherent serialization in the algorithm itself

(2) Powerinstallationmaintenance efficiency even if one would overcome the inefficiency of PC

clusters by utilizing more parallelism various costs such as physical power space failure and maintenance may become overwhelming Modern large supercomputers consume more than a Megawatt of power greatly reducing their installation possibilities due to the costs involved In such situations a dedicated design that would reduce the extra components and thus improve the failure rates power consumption etc may be desirable

Accelerators may solve the above two problems while being merely additive to commodity clusters

thus preserving their cost advantage while addressing the niche advantages of dedicated supercomputers Already supercomputers as LANLrsquos Roadrunner and Tokyo Techrsquos TSUBAME exist which are clusters augmented with extensive commodity acceleration In fact Roadrunner based on a combination of AMD Opteron and IBMSonyToshiba PowerXCell became the first supercomputer ever to achieve Petaflops in 2008

The differences between the failed acceleration in HPC in the past versus the possible success in modern times are twofold Firstly modern-day accelerators are designed as part of the commodity PC ecosystem Even customized hardware accelerators such as MDGrape and ClearSpeed are designed with standard fast IO buses such as PCI-Express and fit within standard high-performance cluster nodes as hosts where the bulk of the code will still run on the PC while time-critical bottlenecks are executed on accelerators Cell and GPUs are more fundamentally commodity in that they had been developed with high-end embedded devices and multimedia gaming as target application areas with an abundance of both hardware and software in the accompanying PC ecosystem As such leveraging this ecosystem has exactly the same characteristics as leveraging commodity clusters with a possible similar success scenario

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

2

Secondly as discussed above up until 2005 due to the continued exponential speedup of single-thread performance of CPUs any acceleration speedups were quickly caught up and exceeded however this is now hitting a roadblock as shown above and extensive acceleration would sustain its advantage over conventional CPUs in terms of resources required for a long time For example on a single node of TSUBAME sporting 8 Opteron 280s (24 Ghz x 16 cores) the best 3-D FFT performance utilizing all 16 cores has been approximately 20 Gigaflops both single and double precision using the most optimized code available Contrastingly a single GPU card within the Tesla 1070p records over 150 Gigaflops in single precision and 40 Gigaflops using the NU-FFT [5] we developed Moreover when we parallelize across nodes we need more than 16 nodes (= 256 cores) to recover the parallelization overhead incurred to match a single GPU performance This difference is expected to worsen as GPU performance is expected to skyrocket both in terms of pure flops and memory bandwidth and most importantly excel in cost and flopswatt performance as a result

12 Commodity vector accelerationmdashthe rebirth of vector computing Overall it is fair to state that GPUs and Cell are receiving the most attention as accelerators of

choice primarily because of their commodity nature but also because of their high compute as well as memory bandwidth density The IBMToshibaSony PowerXCell 8i sports 1028 GigaFlops double precision peak performance as well as 25 GBs memory bandwidth GPU absolute performance is more dramatic with both the latest AMD FireStream and NVIDIA Tesla GPU sporting over 1 TFlops of (albeit single precision) performance and over 100 GBs memory bandwidth using multithreaded SIMD-vector processor array architecture augmented by point-to-point fast memory interconnect Such memory bandwidth effectively rivals the performance of the most expensive NEC SX series

Figure 1 GPU as ldquorebirth of vectorsrdquo in HPC mdash supporting both high computational density as well as high memory access frequency

Such a situation is depicted in figure 1 Here the Y-axis is the compute density of an application and

the X-axis is the memory access frequency HPC apps would be naturally situated farther away from the axis but can be categorized as (a) those with high computational density such as N-body or dense linear algebra situated towards the upper part of the graph and (b) those with high memory access frequency such as CFD and FFT situated towards the right Standard CPU performances are by and large mediocre appropriate for applications near the center and do not efficiently support (a) or (b) Modern accelerators such as GPUs and Cell contrastingly excel at both (a) and (b) because of their

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

3

architectural properties as described above As such modern accelerators could be general HPC accelerators for various types of codes not just specific ones as in the past

In fact both GPUs and Cell can be regarded as a rebirth of vector computing Although different from traditional architectures with deep vector pipelining nonetheless both architectures are quite amenable to where vector computers excelled in the past In fact modern algorithms intended to run on cache-oriented standard CPUs often do not work well due to their requirements for shorter memory access latency and locality enabled by extensive caching and associated blocking Instead classic algorithms intended for long vector-length ldquostreamingrdquo memory access work quite well Our recent bandwidth-intensive FFT work [5] is a manifestation of this in that we principally employ the classic vector-oriented multi-row FFT algorithm to compute the FFT for Y- and Z-axis (but not for the X-axismdashfor details refer to [5])

2 Acceleration in actionmdashTSUBAME 12 exemplars As we have seen accelerators exhibit high promise in becoming the next dominant resources in providing the majority of the compute capabilities in HPC serving both dense computations as well as vector-friendly streaming codes This is not to say however that such will occur automatically With commodity clusters it took years of continued research and development effort (still continuing to date) to mature to be useable in production especially at large scale Similarly since development of commodity accelerators has focused on their principal market ie non-HPC single-node applications considerable RampD as well as large-scale deployments are required for them to become truly mainstream in HPC

Here we present a concrete example of the largest high-performance GPU deployment to date on an open science supercomputer TSUBAME 12 at the Tokyo Institute of Technology and some of the early results in scaling representative large parallel benchmark codes

Figure 2 TSUBAME 12 GPU accelerated supercomputer

The original TSUBAME 10 was deployed as a 10480-CPU core cluster at the Tokyo Institute of

Technology GSIC Center in the spring of 2006 [6] In addition to the 5120 24 Ghz dual-core AMD Opteron CPUs in 655 nodes (8 socket Sun x4600 nodes) providing approximately 504 Gigaflops of peak FP performance it also sported 360 ClearSpeed CS600 Advance Accelerators one card per node for about half of the nodes providing approximately 30 Teraflops of additional compute power Over the years additional ClearSpeeds as well as Intel ldquoHarpertownrdquo Xeon nodes were added pushing the total performance up to approximately 105 Teraflops

TSUBAME 12 was implemented similarly as a follow-on upgrade to the existing TSUBAME adding 170 NVIDIA Tesla s1070 units each embodying four Tesla GPU cards for a total of 680 cards This allowed the single-precision FP performance of TSUBAME to be boosted to nearly 900

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

4

Teraflops and double-precision to approximately 160 Teraflops The added power and thermal load on TSUBAME is only about 150 KW which is about 18~110th of TSUBAME itself

Speedups of various libraries and applications using GPUs have been quite astounding on a single GPU as exemplified by the FFT speedup demonstrated in Section 2 We have utilized such capabilities for example to greatly speed up sophisticated applications such as all-to-all 3-D protein docking achieving equivalent speed to our earlier published result in [7] on BlueGeneL with almost four times better energy efficiency

Here we mention our current results on two multi-node large-scale benchmarks one dense linear algebra (Linpack) and the other sparse finite difference (multi-node Himeno) benchmark codes

21 Linpack on CPU-GPU-ClearSpeed heterogeneous configuration One of the major issues in heterogeneous supercomputers equipped with a batch of GPUsaccelerators is how users develop parallel programs that effectively use hybrid computing resources especially tightly coupled programs Here we take the High Performance Linpack (HPL) benchmark [8] and describe how it is heterogeneously accelerated on TSUBAME12 For more detailed implementation refer to our previous paper [9] 211 High Performance Linpack (HPL) HPL is a well known MPI based parallel benchmark used in the Top500 ranking solving a set of dense linear equations using a direct method It employs a block-based solver algorithm so as to harness the highly tuned underlying Level 3 BLAS librarymdashin particular using a fast matrix-multiply (DGEMM) function is the key to obtain good HPL performance Parallelization is achieved based on two-dimensional block cyclic distribution of the matrix where each MPI process possesses a sub-matrix of the almost same sizemdashas such HPL is designed for homogeneous environments

Thus there are several challenging issues to port HPL onto TSUBAME12 (1) All the processors CPUs GPUs ClearSpeed boards should be used cooperatively for DGEMM

kernel computation (intra-node heterogeneity) On TSUBAME the ratios of theoretical peak computation performance for CPUs GPUs and ClearSpeed are 35 33 32 therefore omitting any type of processor that causes heavy performance degradation

(2) We also have to consider inter-node heterogeneity which is that only 312 out of 648 nodes have two Tesla GPUs and the rest have none due to the configuration restrictions of Tesla 1070p

(3) Finally PCI-ePCI-X communication overhead should not be underestimated since the matrix data is basically allocated on host memory

212 Kernel functions and matrix block size In HPL the most time consuming part in the overall benchmark is the DGEMM function calls that multiply the (Mprime x B) matrix and (B x Nprime) matrix where B is a tuneable block size parameter To maintain a favorable communication-computation ratio B should be sufficiently large

As a CPU BLAS library we use GotoBLAS [10] For GPUs we use a matrix multiply function written from scratch by our group in order to achieve maximum efficient and at the same time allow asynchronous double buffering to hide the data transfer latency a feature not available in NVIDIArsquos native CUBLAS Its on-board performance is about 80 GFlops and we are able to almost completely hide the latency to effectively achieve this speed For ClearSpeed the CSXL BLAS library by ClearSpeed Inc (DGEMM speed is 63 GFlops) is used Through preliminary experiments we found that B = 1152 would be the optimal block size 213 Coping with heterogeneity In TSUBAME 12 we are faced with intra- and inter-node heterogeneity as described above We conduct load balancing of the DGEMM kernels running on different CPUsaccelerators based on the following strategies By contrast non-kernel computations such as pivoting or row exchange are simply done on x86 CPUs

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

5

Intra-node heterogeneity We conceptually regard all hybrid processors on each node as a farm that provides kernel computation facility for processes Here each process may delegate some fraction of DGEMM computation to accelerators while other fractions may be computed by the CPU-based BLAS as in figure 3

Inter-node heterogeneity With the above model inter-node heterogeneity appears as imbalance of the DGEMM performances among nodes To keep the performance per process almost identical we configure the number of processes per node to reflect their compute capabilities

Figure 3 Mapping between processes and processors (CPUs GPUs ClearSpeed) The left figure shows a node with GPUs the right one shows a node without GPU

214 Other techniques

We observed that PCI-ePCI-X communication consumes considerable CPU load and as a result conducting CPU-accelerator communication and computing DGEMM on the same cores incurs considerable performance degradation Instead we assign several dedicated cores solely for communication (black cores in figure 3) Although we lose considerable performance by such loss of cores (over 10 Teraflops for TSUBAME12) it becomes an overall win when acceleration is considered

In the original code DGEMM calls are sometimes fragmented when communication and DGEMM overlap We have eliminated such fragmentation by using Pthreads

We have implemented and added overlapping between pivot rows communication and DGEMM computation which was not in the original code but is effective for accelerated machines due to the inherent parallelism between CPUs and accelerators

215 Performance evaluation In running the whole-machine Linpack we used 648 of the 655 TSUBAME nodes each of which has eight dual-core Opteron 880 processors and a ClearSpeed accelerator and 312 nodes embodying two Tesla GPUs Additionally another production Xeon cluster called TSUBASA was integrated and used cooperativelymdashthere we used 80 nodes each of which has two quad-core Xeon E5440 processors Combining all these hybrid resources we have achieved 8701 TFlops Since the total peak speed is 163 TFlops the efficiency is 53 partly due to performance loss we mentioned above assigning dedicated CPU to support multiple accelerator as well as compromises made such as the block size (B=1152) Figure 4 compares peak speed and Linpack speed for each category of processors Interestingly we observe here that all the processors show similar efficiency (48 to 56) so the resulting share of runtime performance largely matches that of the theoretical peak performance Here GPU is the most efficient (56) in a more careful comparison but it is unclear whether this is inherent or due to compromise weighing in favour of GPU and we are conducting follow-up research to clarify this The figure also shows electrical power consumption we observe that while accelerators provide 66 of the computation performance their power consumption is only 15 of the entire machine

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

6

Figure 4 Performance ratio of each kind of processor Peak speed (double-precision) contribution to Linpack speed and electrical power consumption are shown

Figure 5 History of TSUBAME in the Top500

Figure 5 shows the history of our Linpack experiments on the TSUBAME in particular the

achieved Teraflops and rank in the Top500 Ever since the original TSUBAME was installed in 2006 processor resources have been steadily increased due to system upgrade which is represented by continuous improvement for every Top500 since its inception Since the performance only with Opteron CPUs was 3818 TFlops acceleration technology improved the TSUBAME system performance by a factor of 23

22 Himeno CFD benchmark acceleration over multiple GPU nodes The Himeno CFD benchmark is a simple 3-D CFD code that solves a Poisson equation using finite difference Jacobi iterations Although production-quality CFD code would be substantially more complex and employ sophisticated methods the Himeno benchmark nonetheless is extremely bandwidth demanding and serves as a benchmark that measures the worst-case scaling scenario for bandwidth intensive codes Himeno benchmark scores are published for a variety of architectures on the benchmark website [11] On a standard x86 CPU the benchmark is completely memory bound and the score would be approximately 1 GBs Contrastingly a single GPU implementation on a NVIDIA CUDA GPU has seen reported scores of over 70 GBs (figure 6) by utilizing high bandwidth device memory as well as shared memory

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

7

Figure 6 Riken Himeno benchmark and parallelization using CUDA GPUs

The Himeno benchmarkrsquos main loop involves 34 FP calculations 31 memory reads and 1 memory

write By optimization we can reduce this to 13 reads and 1 writemdashstill significant and very much memory bound There are four grid sizes the smallest S being (65 x 65 x 129) up to the largest XL being (513 x 513 x 1025) For our benchmark we slightly modified XL so that it would be (1025 x 513 x 513) ie will have a prolonged X-axis instead of Z for ease of programming

For parallelization across GPUs we conduct one-dimensional block distribution along the X-axis across multiple nodes Within a GPU we extensively parallelize and optimize the code using shared memory and coalesced memory access Across the GPUs we overlap MPI communication across nodes and Jacobi iteration computation within a GPU in each node

On TSUBAME12 CPU GPU communication bandwidth is greatly sacrificed by being based on PCI-E Gen1 x8 instead of the more modern Gen2 x16 and furthermore by two GPU cards sharing a single PCI-E lanemdashas such in the worst case we have only 18th the bandwidth of a standard Gen2 x16 implementation and care must be taken to hide the latency effectively Under this circumstance communication of the boundary region on every Jacobi iteration takes approximately 82 ms If the GPU compute granularity is greater than this we would have effectively hid the latency However as we increase the number of GPUs while maintaining the problems size (strong scaling) the boundary area adjacent to the next GPU will not change and only the size of the X-axis segment will get smaller shortening the compute time

Figure 7 (left) shows the result We observe that latency hiding technique greatly improves the scalability of the code achieving over 700 Gigaflops on 32 CPUs At that point we achieve 11 TBs in memory bandwidth Reflecting upon the theoretical compute models we observe that there is still a little bit of room for improvement but the benchmark is approaching the hardware limits

By doubling the problem size to XXL (figure 7 (right) 2049 x 513 x 513) we achieve considerable improvement in scalability scaling from 16 to 32 GPUs enables 179 times speedup Further details of the benchmark and the implementation will be the subject of another paper

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

8

152

292

524

709

132

224

337

455

0

100

200

300

400

500

600

700

800

4 8 16 32

GFL

OPS

152

292

524

709

304

584

1049

0

200

400

600

800

1000

1200

4 8 16 32

GFL

OPS

GPU nodes

Left bar with latency hiding right bar without GPU nodes

Left bar size XL right bar size XXL

Figure 7 Riken Himeno benchmark parallelized across nodes on TSUBAME 12 The left graph shows the effect of latency hiding as we scale to 32 GPU nodes The right graph shows the effect of doubling the problem size to achieve larger granularity and thus better scalability

3 Conclusion and future work Commodity acceleration is finally here and in fact might dominate the computing aspects of future supercomputers as one of its major properties is the rebirth of vector computing which has been largely punted in commodity clusters Already there are numbers of successful application ports with dramatic speedups reported in the literature and large-scale multi-node applications and deployments are starting to materialize where significant speedups andor power savings are being reported TSUBAME12 one of the first large-scale deployments of GPUs in a production supercomputer is allowing us to understand the attractions as well as limitations of the approach and what technical problems still lie ahead Even as we speak many GPU-enabled clusters are being planneddeployed but without sufficient software support they are reminiscent of early clusters

As future work we plan to conduct extensive research to help GPUs become the dominant compute resource This research includes various systems issues as well as algorithmic and application issues in preparation for deployment of a petascale TSUBAME 20 anticipated to be deployed in 2010

References [1] ldquoCell Broadband Engine Technology and Systemsrdquo IBM Systems Journal 51-5 May 2007 [2] Owens JD Houston M Luebke D Green S Stone JE Phillips JC ldquoGPU Computingrdquo Proc

IEEE 96-5 May 2008 pp 879-899 [3] ClearSpeed Inc ldquoClearSpeed Whitepaper CSX Processor Architecturerdquo

httpwwwclearspeedcomdocsresourcesClearSpeed_Architecture_Whitepaper_Feb07v2pdf Feb 2007

[4] Taiji M ldquoMDGRAPE-3 chip a 165 Gflops Application Specific LSI for Molecular Dynamics Simulationsrdquo Proc Hot Chips 16 IEEE Computer Society Press (CD-ROM) 2004

[5] Akira Nukada Yasuhiko Ogata Toshio Endo and Satoshi Matsuoka ldquoBandwidth Intensive 3-D FFT kernel for GPUs using CUDArdquo Proc ACMIEEE Supercomputing 2008 (SC2008) Austin Texas the IEEE Press Nov 2008

[6] Satoshi Matsuoka Petascale Computing Algorithms and Applications --- Chapter 14 The Road to TSUBAME and Beyond Chapman amp Hall CRC Computational Science Series pp289-310 2008

[7] Akira Nukada Yuichiro Hourai Akira Nishida and Yutaka Akiyama ldquoldquoHigh Performance 3D Convolution for Protein Docking on IBM Blue Generdquo Parallel and Distributed Processing and Applications Springer LNCS Vol 4742 pp 958-969 2007

[8] A Petitet R Whaley J Dongarra and A Cleary HPL ndash a portable implementation of the high-performance Linpack benchmark for distributed computers httpwwwnetliborgbenchmarkhpl

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

9

[9] Toshio Endo and Satoshi Matsuoka Massive Supercomputing Coping with Heterogeneity of Modern Accelerators IEEE International Parallel amp Distributed Processing Symposium (IPDPS 2008) the IEEE Press April 2008

[10] K Goto Goto BLAS httpwwwtaccutexaseduresourcessoftware [11] The Riken Himeno CFD Benchmark httpacccrikenjpHPCHimenoBMTindex_ehtml

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

10

PCs The first Beowulf cluster called Wigraf was born in 1994 at CaltechNASA signifying the start of commoditization of networks and the HPC software stacks to build a complete supercomputermdashfrom then on cluster technology has matured to the extent that we can build petascale machines embodying tens of thousands of CPU cores Such wide proliferation of commodity technology in HPC over time has allowed its widespread proliferation and mainstream adoption not only allowing one to save costs in building large-scale machines but also realizing integration and smooth growth paths in the whole HPC ecosystem from simple PCs and small few node clusters all the way up to petascale machines It is now commonplace for someone to develop and run the same application in both his office machine and the largest supercomputer using thousands of CPU coresmdashwith just the size of the problem being different

However even today various vendors including IBM Cray NEC Fujitsu etc sell custom-made supercomputers and more are in future development Given the cost and other technical advantages of commodity clusters such ldquodinosaursrdquo should have disappeared a long time ago mdash however if one would observe various benchmarks such as the TOP500 or the HPC challenge such machines dominate the top ranks There are several reasons for this including the ones below (1) High-performance highly efficient vector processing although vector processing capabilities in

x86 processors (SSE) and others (eg AltiVec on PowerPC) have dramatically improved their FP performances for single-thread applications they are still no match for dedicated vector machines such as the NEC SX There are several reasons for this including much wider vectors (8~16 vs 2~4) and deeper vector pipelines augmented by significantly greater memory bandwidth as well as random access capabilities Although in the past increases in both the processor and the memory clock rate helped to improved single-thread performance of commodity CPUs such an increase has virtually come to a halt since around 2005 when multicore parallelism became the dominant methodology for increasing CPU performance As such expensive vector machines are still preferred for applications such as CFD where the algorithm is memory bound or time-dependent code where there is inherent serialization in the algorithm itself

(2) Powerinstallationmaintenance efficiency even if one would overcome the inefficiency of PC

clusters by utilizing more parallelism various costs such as physical power space failure and maintenance may become overwhelming Modern large supercomputers consume more than a Megawatt of power greatly reducing their installation possibilities due to the costs involved In such situations a dedicated design that would reduce the extra components and thus improve the failure rates power consumption etc may be desirable

Accelerators may solve the above two problems while being merely additive to commodity clusters

thus preserving their cost advantage while addressing the niche advantages of dedicated supercomputers Already supercomputers as LANLrsquos Roadrunner and Tokyo Techrsquos TSUBAME exist which are clusters augmented with extensive commodity acceleration In fact Roadrunner based on a combination of AMD Opteron and IBMSonyToshiba PowerXCell became the first supercomputer ever to achieve Petaflops in 2008

The differences between the failed acceleration in HPC in the past versus the possible success in modern times are twofold Firstly modern-day accelerators are designed as part of the commodity PC ecosystem Even customized hardware accelerators such as MDGrape and ClearSpeed are designed with standard fast IO buses such as PCI-Express and fit within standard high-performance cluster nodes as hosts where the bulk of the code will still run on the PC while time-critical bottlenecks are executed on accelerators Cell and GPUs are more fundamentally commodity in that they had been developed with high-end embedded devices and multimedia gaming as target application areas with an abundance of both hardware and software in the accompanying PC ecosystem As such leveraging this ecosystem has exactly the same characteristics as leveraging commodity clusters with a possible similar success scenario

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

2

Secondly as discussed above up until 2005 due to the continued exponential speedup of single-thread performance of CPUs any acceleration speedups were quickly caught up and exceeded however this is now hitting a roadblock as shown above and extensive acceleration would sustain its advantage over conventional CPUs in terms of resources required for a long time For example on a single node of TSUBAME sporting 8 Opteron 280s (24 Ghz x 16 cores) the best 3-D FFT performance utilizing all 16 cores has been approximately 20 Gigaflops both single and double precision using the most optimized code available Contrastingly a single GPU card within the Tesla 1070p records over 150 Gigaflops in single precision and 40 Gigaflops using the NU-FFT [5] we developed Moreover when we parallelize across nodes we need more than 16 nodes (= 256 cores) to recover the parallelization overhead incurred to match a single GPU performance This difference is expected to worsen as GPU performance is expected to skyrocket both in terms of pure flops and memory bandwidth and most importantly excel in cost and flopswatt performance as a result

12 Commodity vector accelerationmdashthe rebirth of vector computing Overall it is fair to state that GPUs and Cell are receiving the most attention as accelerators of

choice primarily because of their commodity nature but also because of their high compute as well as memory bandwidth density The IBMToshibaSony PowerXCell 8i sports 1028 GigaFlops double precision peak performance as well as 25 GBs memory bandwidth GPU absolute performance is more dramatic with both the latest AMD FireStream and NVIDIA Tesla GPU sporting over 1 TFlops of (albeit single precision) performance and over 100 GBs memory bandwidth using multithreaded SIMD-vector processor array architecture augmented by point-to-point fast memory interconnect Such memory bandwidth effectively rivals the performance of the most expensive NEC SX series

Figure 1 GPU as ldquorebirth of vectorsrdquo in HPC mdash supporting both high computational density as well as high memory access frequency

Such a situation is depicted in figure 1 Here the Y-axis is the compute density of an application and

the X-axis is the memory access frequency HPC apps would be naturally situated farther away from the axis but can be categorized as (a) those with high computational density such as N-body or dense linear algebra situated towards the upper part of the graph and (b) those with high memory access frequency such as CFD and FFT situated towards the right Standard CPU performances are by and large mediocre appropriate for applications near the center and do not efficiently support (a) or (b) Modern accelerators such as GPUs and Cell contrastingly excel at both (a) and (b) because of their

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

3

architectural properties as described above As such modern accelerators could be general HPC accelerators for various types of codes not just specific ones as in the past

In fact both GPUs and Cell can be regarded as a rebirth of vector computing Although different from traditional architectures with deep vector pipelining nonetheless both architectures are quite amenable to where vector computers excelled in the past In fact modern algorithms intended to run on cache-oriented standard CPUs often do not work well due to their requirements for shorter memory access latency and locality enabled by extensive caching and associated blocking Instead classic algorithms intended for long vector-length ldquostreamingrdquo memory access work quite well Our recent bandwidth-intensive FFT work [5] is a manifestation of this in that we principally employ the classic vector-oriented multi-row FFT algorithm to compute the FFT for Y- and Z-axis (but not for the X-axismdashfor details refer to [5])

2 Acceleration in actionmdashTSUBAME 12 exemplars As we have seen accelerators exhibit high promise in becoming the next dominant resources in providing the majority of the compute capabilities in HPC serving both dense computations as well as vector-friendly streaming codes This is not to say however that such will occur automatically With commodity clusters it took years of continued research and development effort (still continuing to date) to mature to be useable in production especially at large scale Similarly since development of commodity accelerators has focused on their principal market ie non-HPC single-node applications considerable RampD as well as large-scale deployments are required for them to become truly mainstream in HPC

Here we present a concrete example of the largest high-performance GPU deployment to date on an open science supercomputer TSUBAME 12 at the Tokyo Institute of Technology and some of the early results in scaling representative large parallel benchmark codes

Figure 2 TSUBAME 12 GPU accelerated supercomputer

The original TSUBAME 10 was deployed as a 10480-CPU core cluster at the Tokyo Institute of

Technology GSIC Center in the spring of 2006 [6] In addition to the 5120 24 Ghz dual-core AMD Opteron CPUs in 655 nodes (8 socket Sun x4600 nodes) providing approximately 504 Gigaflops of peak FP performance it also sported 360 ClearSpeed CS600 Advance Accelerators one card per node for about half of the nodes providing approximately 30 Teraflops of additional compute power Over the years additional ClearSpeeds as well as Intel ldquoHarpertownrdquo Xeon nodes were added pushing the total performance up to approximately 105 Teraflops

TSUBAME 12 was implemented similarly as a follow-on upgrade to the existing TSUBAME adding 170 NVIDIA Tesla s1070 units each embodying four Tesla GPU cards for a total of 680 cards This allowed the single-precision FP performance of TSUBAME to be boosted to nearly 900

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

4

Teraflops and double-precision to approximately 160 Teraflops The added power and thermal load on TSUBAME is only about 150 KW which is about 18~110th of TSUBAME itself

Speedups of various libraries and applications using GPUs have been quite astounding on a single GPU as exemplified by the FFT speedup demonstrated in Section 2 We have utilized such capabilities for example to greatly speed up sophisticated applications such as all-to-all 3-D protein docking achieving equivalent speed to our earlier published result in [7] on BlueGeneL with almost four times better energy efficiency

Here we mention our current results on two multi-node large-scale benchmarks one dense linear algebra (Linpack) and the other sparse finite difference (multi-node Himeno) benchmark codes

21 Linpack on CPU-GPU-ClearSpeed heterogeneous configuration One of the major issues in heterogeneous supercomputers equipped with a batch of GPUsaccelerators is how users develop parallel programs that effectively use hybrid computing resources especially tightly coupled programs Here we take the High Performance Linpack (HPL) benchmark [8] and describe how it is heterogeneously accelerated on TSUBAME12 For more detailed implementation refer to our previous paper [9] 211 High Performance Linpack (HPL) HPL is a well known MPI based parallel benchmark used in the Top500 ranking solving a set of dense linear equations using a direct method It employs a block-based solver algorithm so as to harness the highly tuned underlying Level 3 BLAS librarymdashin particular using a fast matrix-multiply (DGEMM) function is the key to obtain good HPL performance Parallelization is achieved based on two-dimensional block cyclic distribution of the matrix where each MPI process possesses a sub-matrix of the almost same sizemdashas such HPL is designed for homogeneous environments

Thus there are several challenging issues to port HPL onto TSUBAME12 (1) All the processors CPUs GPUs ClearSpeed boards should be used cooperatively for DGEMM

kernel computation (intra-node heterogeneity) On TSUBAME the ratios of theoretical peak computation performance for CPUs GPUs and ClearSpeed are 35 33 32 therefore omitting any type of processor that causes heavy performance degradation

(2) We also have to consider inter-node heterogeneity which is that only 312 out of 648 nodes have two Tesla GPUs and the rest have none due to the configuration restrictions of Tesla 1070p

(3) Finally PCI-ePCI-X communication overhead should not be underestimated since the matrix data is basically allocated on host memory

212 Kernel functions and matrix block size In HPL the most time consuming part in the overall benchmark is the DGEMM function calls that multiply the (Mprime x B) matrix and (B x Nprime) matrix where B is a tuneable block size parameter To maintain a favorable communication-computation ratio B should be sufficiently large

As a CPU BLAS library we use GotoBLAS [10] For GPUs we use a matrix multiply function written from scratch by our group in order to achieve maximum efficient and at the same time allow asynchronous double buffering to hide the data transfer latency a feature not available in NVIDIArsquos native CUBLAS Its on-board performance is about 80 GFlops and we are able to almost completely hide the latency to effectively achieve this speed For ClearSpeed the CSXL BLAS library by ClearSpeed Inc (DGEMM speed is 63 GFlops) is used Through preliminary experiments we found that B = 1152 would be the optimal block size 213 Coping with heterogeneity In TSUBAME 12 we are faced with intra- and inter-node heterogeneity as described above We conduct load balancing of the DGEMM kernels running on different CPUsaccelerators based on the following strategies By contrast non-kernel computations such as pivoting or row exchange are simply done on x86 CPUs

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

5

Intra-node heterogeneity We conceptually regard all hybrid processors on each node as a farm that provides kernel computation facility for processes Here each process may delegate some fraction of DGEMM computation to accelerators while other fractions may be computed by the CPU-based BLAS as in figure 3

Inter-node heterogeneity With the above model inter-node heterogeneity appears as imbalance of the DGEMM performances among nodes To keep the performance per process almost identical we configure the number of processes per node to reflect their compute capabilities

Figure 3 Mapping between processes and processors (CPUs GPUs ClearSpeed) The left figure shows a node with GPUs the right one shows a node without GPU

214 Other techniques

We observed that PCI-ePCI-X communication consumes considerable CPU load and as a result conducting CPU-accelerator communication and computing DGEMM on the same cores incurs considerable performance degradation Instead we assign several dedicated cores solely for communication (black cores in figure 3) Although we lose considerable performance by such loss of cores (over 10 Teraflops for TSUBAME12) it becomes an overall win when acceleration is considered

In the original code DGEMM calls are sometimes fragmented when communication and DGEMM overlap We have eliminated such fragmentation by using Pthreads

We have implemented and added overlapping between pivot rows communication and DGEMM computation which was not in the original code but is effective for accelerated machines due to the inherent parallelism between CPUs and accelerators

215 Performance evaluation In running the whole-machine Linpack we used 648 of the 655 TSUBAME nodes each of which has eight dual-core Opteron 880 processors and a ClearSpeed accelerator and 312 nodes embodying two Tesla GPUs Additionally another production Xeon cluster called TSUBASA was integrated and used cooperativelymdashthere we used 80 nodes each of which has two quad-core Xeon E5440 processors Combining all these hybrid resources we have achieved 8701 TFlops Since the total peak speed is 163 TFlops the efficiency is 53 partly due to performance loss we mentioned above assigning dedicated CPU to support multiple accelerator as well as compromises made such as the block size (B=1152) Figure 4 compares peak speed and Linpack speed for each category of processors Interestingly we observe here that all the processors show similar efficiency (48 to 56) so the resulting share of runtime performance largely matches that of the theoretical peak performance Here GPU is the most efficient (56) in a more careful comparison but it is unclear whether this is inherent or due to compromise weighing in favour of GPU and we are conducting follow-up research to clarify this The figure also shows electrical power consumption we observe that while accelerators provide 66 of the computation performance their power consumption is only 15 of the entire machine

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

6

Figure 4 Performance ratio of each kind of processor Peak speed (double-precision) contribution to Linpack speed and electrical power consumption are shown

Figure 5 History of TSUBAME in the Top500

Figure 5 shows the history of our Linpack experiments on the TSUBAME in particular the

achieved Teraflops and rank in the Top500 Ever since the original TSUBAME was installed in 2006 processor resources have been steadily increased due to system upgrade which is represented by continuous improvement for every Top500 since its inception Since the performance only with Opteron CPUs was 3818 TFlops acceleration technology improved the TSUBAME system performance by a factor of 23

22 Himeno CFD benchmark acceleration over multiple GPU nodes The Himeno CFD benchmark is a simple 3-D CFD code that solves a Poisson equation using finite difference Jacobi iterations Although production-quality CFD code would be substantially more complex and employ sophisticated methods the Himeno benchmark nonetheless is extremely bandwidth demanding and serves as a benchmark that measures the worst-case scaling scenario for bandwidth intensive codes Himeno benchmark scores are published for a variety of architectures on the benchmark website [11] On a standard x86 CPU the benchmark is completely memory bound and the score would be approximately 1 GBs Contrastingly a single GPU implementation on a NVIDIA CUDA GPU has seen reported scores of over 70 GBs (figure 6) by utilizing high bandwidth device memory as well as shared memory

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

7

Figure 6 Riken Himeno benchmark and parallelization using CUDA GPUs

The Himeno benchmarkrsquos main loop involves 34 FP calculations 31 memory reads and 1 memory

write By optimization we can reduce this to 13 reads and 1 writemdashstill significant and very much memory bound There are four grid sizes the smallest S being (65 x 65 x 129) up to the largest XL being (513 x 513 x 1025) For our benchmark we slightly modified XL so that it would be (1025 x 513 x 513) ie will have a prolonged X-axis instead of Z for ease of programming

For parallelization across GPUs we conduct one-dimensional block distribution along the X-axis across multiple nodes Within a GPU we extensively parallelize and optimize the code using shared memory and coalesced memory access Across the GPUs we overlap MPI communication across nodes and Jacobi iteration computation within a GPU in each node

On TSUBAME12 CPU GPU communication bandwidth is greatly sacrificed by being based on PCI-E Gen1 x8 instead of the more modern Gen2 x16 and furthermore by two GPU cards sharing a single PCI-E lanemdashas such in the worst case we have only 18th the bandwidth of a standard Gen2 x16 implementation and care must be taken to hide the latency effectively Under this circumstance communication of the boundary region on every Jacobi iteration takes approximately 82 ms If the GPU compute granularity is greater than this we would have effectively hid the latency However as we increase the number of GPUs while maintaining the problems size (strong scaling) the boundary area adjacent to the next GPU will not change and only the size of the X-axis segment will get smaller shortening the compute time

Figure 7 (left) shows the result We observe that latency hiding technique greatly improves the scalability of the code achieving over 700 Gigaflops on 32 CPUs At that point we achieve 11 TBs in memory bandwidth Reflecting upon the theoretical compute models we observe that there is still a little bit of room for improvement but the benchmark is approaching the hardware limits

By doubling the problem size to XXL (figure 7 (right) 2049 x 513 x 513) we achieve considerable improvement in scalability scaling from 16 to 32 GPUs enables 179 times speedup Further details of the benchmark and the implementation will be the subject of another paper

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

8

152

292

524

709

132

224

337

455

0

100

200

300

400

500

600

700

800

4 8 16 32

GFL

OPS

152

292

524

709

304

584

1049

0

200

400

600

800

1000

1200

4 8 16 32

GFL

OPS

GPU nodes

Left bar with latency hiding right bar without GPU nodes

Left bar size XL right bar size XXL

Figure 7 Riken Himeno benchmark parallelized across nodes on TSUBAME 12 The left graph shows the effect of latency hiding as we scale to 32 GPU nodes The right graph shows the effect of doubling the problem size to achieve larger granularity and thus better scalability

3 Conclusion and future work Commodity acceleration is finally here and in fact might dominate the computing aspects of future supercomputers as one of its major properties is the rebirth of vector computing which has been largely punted in commodity clusters Already there are numbers of successful application ports with dramatic speedups reported in the literature and large-scale multi-node applications and deployments are starting to materialize where significant speedups andor power savings are being reported TSUBAME12 one of the first large-scale deployments of GPUs in a production supercomputer is allowing us to understand the attractions as well as limitations of the approach and what technical problems still lie ahead Even as we speak many GPU-enabled clusters are being planneddeployed but without sufficient software support they are reminiscent of early clusters

As future work we plan to conduct extensive research to help GPUs become the dominant compute resource This research includes various systems issues as well as algorithmic and application issues in preparation for deployment of a petascale TSUBAME 20 anticipated to be deployed in 2010

References [1] ldquoCell Broadband Engine Technology and Systemsrdquo IBM Systems Journal 51-5 May 2007 [2] Owens JD Houston M Luebke D Green S Stone JE Phillips JC ldquoGPU Computingrdquo Proc

IEEE 96-5 May 2008 pp 879-899 [3] ClearSpeed Inc ldquoClearSpeed Whitepaper CSX Processor Architecturerdquo

httpwwwclearspeedcomdocsresourcesClearSpeed_Architecture_Whitepaper_Feb07v2pdf Feb 2007

[4] Taiji M ldquoMDGRAPE-3 chip a 165 Gflops Application Specific LSI for Molecular Dynamics Simulationsrdquo Proc Hot Chips 16 IEEE Computer Society Press (CD-ROM) 2004

[5] Akira Nukada Yasuhiko Ogata Toshio Endo and Satoshi Matsuoka ldquoBandwidth Intensive 3-D FFT kernel for GPUs using CUDArdquo Proc ACMIEEE Supercomputing 2008 (SC2008) Austin Texas the IEEE Press Nov 2008

[6] Satoshi Matsuoka Petascale Computing Algorithms and Applications --- Chapter 14 The Road to TSUBAME and Beyond Chapman amp Hall CRC Computational Science Series pp289-310 2008

[7] Akira Nukada Yuichiro Hourai Akira Nishida and Yutaka Akiyama ldquoldquoHigh Performance 3D Convolution for Protein Docking on IBM Blue Generdquo Parallel and Distributed Processing and Applications Springer LNCS Vol 4742 pp 958-969 2007

[8] A Petitet R Whaley J Dongarra and A Cleary HPL ndash a portable implementation of the high-performance Linpack benchmark for distributed computers httpwwwnetliborgbenchmarkhpl

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

9

[9] Toshio Endo and Satoshi Matsuoka Massive Supercomputing Coping with Heterogeneity of Modern Accelerators IEEE International Parallel amp Distributed Processing Symposium (IPDPS 2008) the IEEE Press April 2008

[10] K Goto Goto BLAS httpwwwtaccutexaseduresourcessoftware [11] The Riken Himeno CFD Benchmark httpacccrikenjpHPCHimenoBMTindex_ehtml

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

10

Secondly as discussed above up until 2005 due to the continued exponential speedup of single-thread performance of CPUs any acceleration speedups were quickly caught up and exceeded however this is now hitting a roadblock as shown above and extensive acceleration would sustain its advantage over conventional CPUs in terms of resources required for a long time For example on a single node of TSUBAME sporting 8 Opteron 280s (24 Ghz x 16 cores) the best 3-D FFT performance utilizing all 16 cores has been approximately 20 Gigaflops both single and double precision using the most optimized code available Contrastingly a single GPU card within the Tesla 1070p records over 150 Gigaflops in single precision and 40 Gigaflops using the NU-FFT [5] we developed Moreover when we parallelize across nodes we need more than 16 nodes (= 256 cores) to recover the parallelization overhead incurred to match a single GPU performance This difference is expected to worsen as GPU performance is expected to skyrocket both in terms of pure flops and memory bandwidth and most importantly excel in cost and flopswatt performance as a result

12 Commodity vector accelerationmdashthe rebirth of vector computing Overall it is fair to state that GPUs and Cell are receiving the most attention as accelerators of

choice primarily because of their commodity nature but also because of their high compute as well as memory bandwidth density The IBMToshibaSony PowerXCell 8i sports 1028 GigaFlops double precision peak performance as well as 25 GBs memory bandwidth GPU absolute performance is more dramatic with both the latest AMD FireStream and NVIDIA Tesla GPU sporting over 1 TFlops of (albeit single precision) performance and over 100 GBs memory bandwidth using multithreaded SIMD-vector processor array architecture augmented by point-to-point fast memory interconnect Such memory bandwidth effectively rivals the performance of the most expensive NEC SX series

Figure 1 GPU as ldquorebirth of vectorsrdquo in HPC mdash supporting both high computational density as well as high memory access frequency

Such a situation is depicted in figure 1 Here the Y-axis is the compute density of an application and

the X-axis is the memory access frequency HPC apps would be naturally situated farther away from the axis but can be categorized as (a) those with high computational density such as N-body or dense linear algebra situated towards the upper part of the graph and (b) those with high memory access frequency such as CFD and FFT situated towards the right Standard CPU performances are by and large mediocre appropriate for applications near the center and do not efficiently support (a) or (b) Modern accelerators such as GPUs and Cell contrastingly excel at both (a) and (b) because of their

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

3

architectural properties as described above As such modern accelerators could be general HPC accelerators for various types of codes not just specific ones as in the past

In fact both GPUs and Cell can be regarded as a rebirth of vector computing Although different from traditional architectures with deep vector pipelining nonetheless both architectures are quite amenable to where vector computers excelled in the past In fact modern algorithms intended to run on cache-oriented standard CPUs often do not work well due to their requirements for shorter memory access latency and locality enabled by extensive caching and associated blocking Instead classic algorithms intended for long vector-length ldquostreamingrdquo memory access work quite well Our recent bandwidth-intensive FFT work [5] is a manifestation of this in that we principally employ the classic vector-oriented multi-row FFT algorithm to compute the FFT for Y- and Z-axis (but not for the X-axismdashfor details refer to [5])

2 Acceleration in actionmdashTSUBAME 12 exemplars As we have seen accelerators exhibit high promise in becoming the next dominant resources in providing the majority of the compute capabilities in HPC serving both dense computations as well as vector-friendly streaming codes This is not to say however that such will occur automatically With commodity clusters it took years of continued research and development effort (still continuing to date) to mature to be useable in production especially at large scale Similarly since development of commodity accelerators has focused on their principal market ie non-HPC single-node applications considerable RampD as well as large-scale deployments are required for them to become truly mainstream in HPC

Here we present a concrete example of the largest high-performance GPU deployment to date on an open science supercomputer TSUBAME 12 at the Tokyo Institute of Technology and some of the early results in scaling representative large parallel benchmark codes

Figure 2 TSUBAME 12 GPU accelerated supercomputer

The original TSUBAME 10 was deployed as a 10480-CPU core cluster at the Tokyo Institute of

Technology GSIC Center in the spring of 2006 [6] In addition to the 5120 24 Ghz dual-core AMD Opteron CPUs in 655 nodes (8 socket Sun x4600 nodes) providing approximately 504 Gigaflops of peak FP performance it also sported 360 ClearSpeed CS600 Advance Accelerators one card per node for about half of the nodes providing approximately 30 Teraflops of additional compute power Over the years additional ClearSpeeds as well as Intel ldquoHarpertownrdquo Xeon nodes were added pushing the total performance up to approximately 105 Teraflops

TSUBAME 12 was implemented similarly as a follow-on upgrade to the existing TSUBAME adding 170 NVIDIA Tesla s1070 units each embodying four Tesla GPU cards for a total of 680 cards This allowed the single-precision FP performance of TSUBAME to be boosted to nearly 900

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

4

Teraflops and double-precision to approximately 160 Teraflops The added power and thermal load on TSUBAME is only about 150 KW which is about 18~110th of TSUBAME itself

Speedups of various libraries and applications using GPUs have been quite astounding on a single GPU as exemplified by the FFT speedup demonstrated in Section 2 We have utilized such capabilities for example to greatly speed up sophisticated applications such as all-to-all 3-D protein docking achieving equivalent speed to our earlier published result in [7] on BlueGeneL with almost four times better energy efficiency

Here we mention our current results on two multi-node large-scale benchmarks one dense linear algebra (Linpack) and the other sparse finite difference (multi-node Himeno) benchmark codes

21 Linpack on CPU-GPU-ClearSpeed heterogeneous configuration One of the major issues in heterogeneous supercomputers equipped with a batch of GPUsaccelerators is how users develop parallel programs that effectively use hybrid computing resources especially tightly coupled programs Here we take the High Performance Linpack (HPL) benchmark [8] and describe how it is heterogeneously accelerated on TSUBAME12 For more detailed implementation refer to our previous paper [9] 211 High Performance Linpack (HPL) HPL is a well known MPI based parallel benchmark used in the Top500 ranking solving a set of dense linear equations using a direct method It employs a block-based solver algorithm so as to harness the highly tuned underlying Level 3 BLAS librarymdashin particular using a fast matrix-multiply (DGEMM) function is the key to obtain good HPL performance Parallelization is achieved based on two-dimensional block cyclic distribution of the matrix where each MPI process possesses a sub-matrix of the almost same sizemdashas such HPL is designed for homogeneous environments

Thus there are several challenging issues to port HPL onto TSUBAME12 (1) All the processors CPUs GPUs ClearSpeed boards should be used cooperatively for DGEMM

kernel computation (intra-node heterogeneity) On TSUBAME the ratios of theoretical peak computation performance for CPUs GPUs and ClearSpeed are 35 33 32 therefore omitting any type of processor that causes heavy performance degradation

(2) We also have to consider inter-node heterogeneity which is that only 312 out of 648 nodes have two Tesla GPUs and the rest have none due to the configuration restrictions of Tesla 1070p

(3) Finally PCI-ePCI-X communication overhead should not be underestimated since the matrix data is basically allocated on host memory

212 Kernel functions and matrix block size In HPL the most time consuming part in the overall benchmark is the DGEMM function calls that multiply the (Mprime x B) matrix and (B x Nprime) matrix where B is a tuneable block size parameter To maintain a favorable communication-computation ratio B should be sufficiently large

As a CPU BLAS library we use GotoBLAS [10] For GPUs we use a matrix multiply function written from scratch by our group in order to achieve maximum efficient and at the same time allow asynchronous double buffering to hide the data transfer latency a feature not available in NVIDIArsquos native CUBLAS Its on-board performance is about 80 GFlops and we are able to almost completely hide the latency to effectively achieve this speed For ClearSpeed the CSXL BLAS library by ClearSpeed Inc (DGEMM speed is 63 GFlops) is used Through preliminary experiments we found that B = 1152 would be the optimal block size 213 Coping with heterogeneity In TSUBAME 12 we are faced with intra- and inter-node heterogeneity as described above We conduct load balancing of the DGEMM kernels running on different CPUsaccelerators based on the following strategies By contrast non-kernel computations such as pivoting or row exchange are simply done on x86 CPUs

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

5

Intra-node heterogeneity We conceptually regard all hybrid processors on each node as a farm that provides kernel computation facility for processes Here each process may delegate some fraction of DGEMM computation to accelerators while other fractions may be computed by the CPU-based BLAS as in figure 3

Inter-node heterogeneity With the above model inter-node heterogeneity appears as imbalance of the DGEMM performances among nodes To keep the performance per process almost identical we configure the number of processes per node to reflect their compute capabilities

Figure 3 Mapping between processes and processors (CPUs GPUs ClearSpeed) The left figure shows a node with GPUs the right one shows a node without GPU

214 Other techniques

We observed that PCI-ePCI-X communication consumes considerable CPU load and as a result conducting CPU-accelerator communication and computing DGEMM on the same cores incurs considerable performance degradation Instead we assign several dedicated cores solely for communication (black cores in figure 3) Although we lose considerable performance by such loss of cores (over 10 Teraflops for TSUBAME12) it becomes an overall win when acceleration is considered

In the original code DGEMM calls are sometimes fragmented when communication and DGEMM overlap We have eliminated such fragmentation by using Pthreads

We have implemented and added overlapping between pivot rows communication and DGEMM computation which was not in the original code but is effective for accelerated machines due to the inherent parallelism between CPUs and accelerators

215 Performance evaluation In running the whole-machine Linpack we used 648 of the 655 TSUBAME nodes each of which has eight dual-core Opteron 880 processors and a ClearSpeed accelerator and 312 nodes embodying two Tesla GPUs Additionally another production Xeon cluster called TSUBASA was integrated and used cooperativelymdashthere we used 80 nodes each of which has two quad-core Xeon E5440 processors Combining all these hybrid resources we have achieved 8701 TFlops Since the total peak speed is 163 TFlops the efficiency is 53 partly due to performance loss we mentioned above assigning dedicated CPU to support multiple accelerator as well as compromises made such as the block size (B=1152) Figure 4 compares peak speed and Linpack speed for each category of processors Interestingly we observe here that all the processors show similar efficiency (48 to 56) so the resulting share of runtime performance largely matches that of the theoretical peak performance Here GPU is the most efficient (56) in a more careful comparison but it is unclear whether this is inherent or due to compromise weighing in favour of GPU and we are conducting follow-up research to clarify this The figure also shows electrical power consumption we observe that while accelerators provide 66 of the computation performance their power consumption is only 15 of the entire machine

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

6

Figure 4 Performance ratio of each kind of processor Peak speed (double-precision) contribution to Linpack speed and electrical power consumption are shown

Figure 5 History of TSUBAME in the Top500

Figure 5 shows the history of our Linpack experiments on the TSUBAME in particular the

achieved Teraflops and rank in the Top500 Ever since the original TSUBAME was installed in 2006 processor resources have been steadily increased due to system upgrade which is represented by continuous improvement for every Top500 since its inception Since the performance only with Opteron CPUs was 3818 TFlops acceleration technology improved the TSUBAME system performance by a factor of 23

22 Himeno CFD benchmark acceleration over multiple GPU nodes The Himeno CFD benchmark is a simple 3-D CFD code that solves a Poisson equation using finite difference Jacobi iterations Although production-quality CFD code would be substantially more complex and employ sophisticated methods the Himeno benchmark nonetheless is extremely bandwidth demanding and serves as a benchmark that measures the worst-case scaling scenario for bandwidth intensive codes Himeno benchmark scores are published for a variety of architectures on the benchmark website [11] On a standard x86 CPU the benchmark is completely memory bound and the score would be approximately 1 GBs Contrastingly a single GPU implementation on a NVIDIA CUDA GPU has seen reported scores of over 70 GBs (figure 6) by utilizing high bandwidth device memory as well as shared memory

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

7

Figure 6 Riken Himeno benchmark and parallelization using CUDA GPUs

The Himeno benchmarkrsquos main loop involves 34 FP calculations 31 memory reads and 1 memory

write By optimization we can reduce this to 13 reads and 1 writemdashstill significant and very much memory bound There are four grid sizes the smallest S being (65 x 65 x 129) up to the largest XL being (513 x 513 x 1025) For our benchmark we slightly modified XL so that it would be (1025 x 513 x 513) ie will have a prolonged X-axis instead of Z for ease of programming

For parallelization across GPUs we conduct one-dimensional block distribution along the X-axis across multiple nodes Within a GPU we extensively parallelize and optimize the code using shared memory and coalesced memory access Across the GPUs we overlap MPI communication across nodes and Jacobi iteration computation within a GPU in each node

On TSUBAME12 CPU GPU communication bandwidth is greatly sacrificed by being based on PCI-E Gen1 x8 instead of the more modern Gen2 x16 and furthermore by two GPU cards sharing a single PCI-E lanemdashas such in the worst case we have only 18th the bandwidth of a standard Gen2 x16 implementation and care must be taken to hide the latency effectively Under this circumstance communication of the boundary region on every Jacobi iteration takes approximately 82 ms If the GPU compute granularity is greater than this we would have effectively hid the latency However as we increase the number of GPUs while maintaining the problems size (strong scaling) the boundary area adjacent to the next GPU will not change and only the size of the X-axis segment will get smaller shortening the compute time

Figure 7 (left) shows the result We observe that latency hiding technique greatly improves the scalability of the code achieving over 700 Gigaflops on 32 CPUs At that point we achieve 11 TBs in memory bandwidth Reflecting upon the theoretical compute models we observe that there is still a little bit of room for improvement but the benchmark is approaching the hardware limits

By doubling the problem size to XXL (figure 7 (right) 2049 x 513 x 513) we achieve considerable improvement in scalability scaling from 16 to 32 GPUs enables 179 times speedup Further details of the benchmark and the implementation will be the subject of another paper

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

8

152

292

524

709

132

224

337

455

0

100

200

300

400

500

600

700

800

4 8 16 32

GFL

OPS

152

292

524

709

304

584

1049

0

200

400

600

800

1000

1200

4 8 16 32

GFL

OPS

GPU nodes

Left bar with latency hiding right bar without GPU nodes

Left bar size XL right bar size XXL

Figure 7 Riken Himeno benchmark parallelized across nodes on TSUBAME 12 The left graph shows the effect of latency hiding as we scale to 32 GPU nodes The right graph shows the effect of doubling the problem size to achieve larger granularity and thus better scalability

3 Conclusion and future work Commodity acceleration is finally here and in fact might dominate the computing aspects of future supercomputers as one of its major properties is the rebirth of vector computing which has been largely punted in commodity clusters Already there are numbers of successful application ports with dramatic speedups reported in the literature and large-scale multi-node applications and deployments are starting to materialize where significant speedups andor power savings are being reported TSUBAME12 one of the first large-scale deployments of GPUs in a production supercomputer is allowing us to understand the attractions as well as limitations of the approach and what technical problems still lie ahead Even as we speak many GPU-enabled clusters are being planneddeployed but without sufficient software support they are reminiscent of early clusters

As future work we plan to conduct extensive research to help GPUs become the dominant compute resource This research includes various systems issues as well as algorithmic and application issues in preparation for deployment of a petascale TSUBAME 20 anticipated to be deployed in 2010

References [1] ldquoCell Broadband Engine Technology and Systemsrdquo IBM Systems Journal 51-5 May 2007 [2] Owens JD Houston M Luebke D Green S Stone JE Phillips JC ldquoGPU Computingrdquo Proc

IEEE 96-5 May 2008 pp 879-899 [3] ClearSpeed Inc ldquoClearSpeed Whitepaper CSX Processor Architecturerdquo

httpwwwclearspeedcomdocsresourcesClearSpeed_Architecture_Whitepaper_Feb07v2pdf Feb 2007

[4] Taiji M ldquoMDGRAPE-3 chip a 165 Gflops Application Specific LSI for Molecular Dynamics Simulationsrdquo Proc Hot Chips 16 IEEE Computer Society Press (CD-ROM) 2004

[5] Akira Nukada Yasuhiko Ogata Toshio Endo and Satoshi Matsuoka ldquoBandwidth Intensive 3-D FFT kernel for GPUs using CUDArdquo Proc ACMIEEE Supercomputing 2008 (SC2008) Austin Texas the IEEE Press Nov 2008

[6] Satoshi Matsuoka Petascale Computing Algorithms and Applications --- Chapter 14 The Road to TSUBAME and Beyond Chapman amp Hall CRC Computational Science Series pp289-310 2008

[7] Akira Nukada Yuichiro Hourai Akira Nishida and Yutaka Akiyama ldquoldquoHigh Performance 3D Convolution for Protein Docking on IBM Blue Generdquo Parallel and Distributed Processing and Applications Springer LNCS Vol 4742 pp 958-969 2007

[8] A Petitet R Whaley J Dongarra and A Cleary HPL ndash a portable implementation of the high-performance Linpack benchmark for distributed computers httpwwwnetliborgbenchmarkhpl

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

9

[9] Toshio Endo and Satoshi Matsuoka Massive Supercomputing Coping with Heterogeneity of Modern Accelerators IEEE International Parallel amp Distributed Processing Symposium (IPDPS 2008) the IEEE Press April 2008

[10] K Goto Goto BLAS httpwwwtaccutexaseduresourcessoftware [11] The Riken Himeno CFD Benchmark httpacccrikenjpHPCHimenoBMTindex_ehtml

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

10

architectural properties as described above As such modern accelerators could be general HPC accelerators for various types of codes not just specific ones as in the past

In fact both GPUs and Cell can be regarded as a rebirth of vector computing Although different from traditional architectures with deep vector pipelining nonetheless both architectures are quite amenable to where vector computers excelled in the past In fact modern algorithms intended to run on cache-oriented standard CPUs often do not work well due to their requirements for shorter memory access latency and locality enabled by extensive caching and associated blocking Instead classic algorithms intended for long vector-length ldquostreamingrdquo memory access work quite well Our recent bandwidth-intensive FFT work [5] is a manifestation of this in that we principally employ the classic vector-oriented multi-row FFT algorithm to compute the FFT for Y- and Z-axis (but not for the X-axismdashfor details refer to [5])

2 Acceleration in actionmdashTSUBAME 12 exemplars As we have seen accelerators exhibit high promise in becoming the next dominant resources in providing the majority of the compute capabilities in HPC serving both dense computations as well as vector-friendly streaming codes This is not to say however that such will occur automatically With commodity clusters it took years of continued research and development effort (still continuing to date) to mature to be useable in production especially at large scale Similarly since development of commodity accelerators has focused on their principal market ie non-HPC single-node applications considerable RampD as well as large-scale deployments are required for them to become truly mainstream in HPC

Here we present a concrete example of the largest high-performance GPU deployment to date on an open science supercomputer TSUBAME 12 at the Tokyo Institute of Technology and some of the early results in scaling representative large parallel benchmark codes

Figure 2 TSUBAME 12 GPU accelerated supercomputer

The original TSUBAME 10 was deployed as a 10480-CPU core cluster at the Tokyo Institute of

Technology GSIC Center in the spring of 2006 [6] In addition to the 5120 24 Ghz dual-core AMD Opteron CPUs in 655 nodes (8 socket Sun x4600 nodes) providing approximately 504 Gigaflops of peak FP performance it also sported 360 ClearSpeed CS600 Advance Accelerators one card per node for about half of the nodes providing approximately 30 Teraflops of additional compute power Over the years additional ClearSpeeds as well as Intel ldquoHarpertownrdquo Xeon nodes were added pushing the total performance up to approximately 105 Teraflops

TSUBAME 12 was implemented similarly as a follow-on upgrade to the existing TSUBAME adding 170 NVIDIA Tesla s1070 units each embodying four Tesla GPU cards for a total of 680 cards This allowed the single-precision FP performance of TSUBAME to be boosted to nearly 900

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

4

Teraflops and double-precision to approximately 160 Teraflops The added power and thermal load on TSUBAME is only about 150 KW which is about 18~110th of TSUBAME itself

Speedups of various libraries and applications using GPUs have been quite astounding on a single GPU as exemplified by the FFT speedup demonstrated in Section 2 We have utilized such capabilities for example to greatly speed up sophisticated applications such as all-to-all 3-D protein docking achieving equivalent speed to our earlier published result in [7] on BlueGeneL with almost four times better energy efficiency

Here we mention our current results on two multi-node large-scale benchmarks one dense linear algebra (Linpack) and the other sparse finite difference (multi-node Himeno) benchmark codes

21 Linpack on CPU-GPU-ClearSpeed heterogeneous configuration One of the major issues in heterogeneous supercomputers equipped with a batch of GPUsaccelerators is how users develop parallel programs that effectively use hybrid computing resources especially tightly coupled programs Here we take the High Performance Linpack (HPL) benchmark [8] and describe how it is heterogeneously accelerated on TSUBAME12 For more detailed implementation refer to our previous paper [9] 211 High Performance Linpack (HPL) HPL is a well known MPI based parallel benchmark used in the Top500 ranking solving a set of dense linear equations using a direct method It employs a block-based solver algorithm so as to harness the highly tuned underlying Level 3 BLAS librarymdashin particular using a fast matrix-multiply (DGEMM) function is the key to obtain good HPL performance Parallelization is achieved based on two-dimensional block cyclic distribution of the matrix where each MPI process possesses a sub-matrix of the almost same sizemdashas such HPL is designed for homogeneous environments

Thus there are several challenging issues to port HPL onto TSUBAME12 (1) All the processors CPUs GPUs ClearSpeed boards should be used cooperatively for DGEMM

kernel computation (intra-node heterogeneity) On TSUBAME the ratios of theoretical peak computation performance for CPUs GPUs and ClearSpeed are 35 33 32 therefore omitting any type of processor that causes heavy performance degradation

(2) We also have to consider inter-node heterogeneity which is that only 312 out of 648 nodes have two Tesla GPUs and the rest have none due to the configuration restrictions of Tesla 1070p

(3) Finally PCI-ePCI-X communication overhead should not be underestimated since the matrix data is basically allocated on host memory

212 Kernel functions and matrix block size In HPL the most time consuming part in the overall benchmark is the DGEMM function calls that multiply the (Mprime x B) matrix and (B x Nprime) matrix where B is a tuneable block size parameter To maintain a favorable communication-computation ratio B should be sufficiently large

As a CPU BLAS library we use GotoBLAS [10] For GPUs we use a matrix multiply function written from scratch by our group in order to achieve maximum efficient and at the same time allow asynchronous double buffering to hide the data transfer latency a feature not available in NVIDIArsquos native CUBLAS Its on-board performance is about 80 GFlops and we are able to almost completely hide the latency to effectively achieve this speed For ClearSpeed the CSXL BLAS library by ClearSpeed Inc (DGEMM speed is 63 GFlops) is used Through preliminary experiments we found that B = 1152 would be the optimal block size 213 Coping with heterogeneity In TSUBAME 12 we are faced with intra- and inter-node heterogeneity as described above We conduct load balancing of the DGEMM kernels running on different CPUsaccelerators based on the following strategies By contrast non-kernel computations such as pivoting or row exchange are simply done on x86 CPUs

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

5

Intra-node heterogeneity We conceptually regard all hybrid processors on each node as a farm that provides kernel computation facility for processes Here each process may delegate some fraction of DGEMM computation to accelerators while other fractions may be computed by the CPU-based BLAS as in figure 3

Inter-node heterogeneity With the above model inter-node heterogeneity appears as imbalance of the DGEMM performances among nodes To keep the performance per process almost identical we configure the number of processes per node to reflect their compute capabilities

Figure 3 Mapping between processes and processors (CPUs GPUs ClearSpeed) The left figure shows a node with GPUs the right one shows a node without GPU

214 Other techniques

We observed that PCI-ePCI-X communication consumes considerable CPU load and as a result conducting CPU-accelerator communication and computing DGEMM on the same cores incurs considerable performance degradation Instead we assign several dedicated cores solely for communication (black cores in figure 3) Although we lose considerable performance by such loss of cores (over 10 Teraflops for TSUBAME12) it becomes an overall win when acceleration is considered

In the original code DGEMM calls are sometimes fragmented when communication and DGEMM overlap We have eliminated such fragmentation by using Pthreads

We have implemented and added overlapping between pivot rows communication and DGEMM computation which was not in the original code but is effective for accelerated machines due to the inherent parallelism between CPUs and accelerators

215 Performance evaluation In running the whole-machine Linpack we used 648 of the 655 TSUBAME nodes each of which has eight dual-core Opteron 880 processors and a ClearSpeed accelerator and 312 nodes embodying two Tesla GPUs Additionally another production Xeon cluster called TSUBASA was integrated and used cooperativelymdashthere we used 80 nodes each of which has two quad-core Xeon E5440 processors Combining all these hybrid resources we have achieved 8701 TFlops Since the total peak speed is 163 TFlops the efficiency is 53 partly due to performance loss we mentioned above assigning dedicated CPU to support multiple accelerator as well as compromises made such as the block size (B=1152) Figure 4 compares peak speed and Linpack speed for each category of processors Interestingly we observe here that all the processors show similar efficiency (48 to 56) so the resulting share of runtime performance largely matches that of the theoretical peak performance Here GPU is the most efficient (56) in a more careful comparison but it is unclear whether this is inherent or due to compromise weighing in favour of GPU and we are conducting follow-up research to clarify this The figure also shows electrical power consumption we observe that while accelerators provide 66 of the computation performance their power consumption is only 15 of the entire machine

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

6

Figure 4 Performance ratio of each kind of processor Peak speed (double-precision) contribution to Linpack speed and electrical power consumption are shown

Figure 5 History of TSUBAME in the Top500

Figure 5 shows the history of our Linpack experiments on the TSUBAME in particular the

achieved Teraflops and rank in the Top500 Ever since the original TSUBAME was installed in 2006 processor resources have been steadily increased due to system upgrade which is represented by continuous improvement for every Top500 since its inception Since the performance only with Opteron CPUs was 3818 TFlops acceleration technology improved the TSUBAME system performance by a factor of 23

22 Himeno CFD benchmark acceleration over multiple GPU nodes The Himeno CFD benchmark is a simple 3-D CFD code that solves a Poisson equation using finite difference Jacobi iterations Although production-quality CFD code would be substantially more complex and employ sophisticated methods the Himeno benchmark nonetheless is extremely bandwidth demanding and serves as a benchmark that measures the worst-case scaling scenario for bandwidth intensive codes Himeno benchmark scores are published for a variety of architectures on the benchmark website [11] On a standard x86 CPU the benchmark is completely memory bound and the score would be approximately 1 GBs Contrastingly a single GPU implementation on a NVIDIA CUDA GPU has seen reported scores of over 70 GBs (figure 6) by utilizing high bandwidth device memory as well as shared memory

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

7

Figure 6 Riken Himeno benchmark and parallelization using CUDA GPUs

The Himeno benchmarkrsquos main loop involves 34 FP calculations 31 memory reads and 1 memory

write By optimization we can reduce this to 13 reads and 1 writemdashstill significant and very much memory bound There are four grid sizes the smallest S being (65 x 65 x 129) up to the largest XL being (513 x 513 x 1025) For our benchmark we slightly modified XL so that it would be (1025 x 513 x 513) ie will have a prolonged X-axis instead of Z for ease of programming

For parallelization across GPUs we conduct one-dimensional block distribution along the X-axis across multiple nodes Within a GPU we extensively parallelize and optimize the code using shared memory and coalesced memory access Across the GPUs we overlap MPI communication across nodes and Jacobi iteration computation within a GPU in each node

On TSUBAME12 CPU GPU communication bandwidth is greatly sacrificed by being based on PCI-E Gen1 x8 instead of the more modern Gen2 x16 and furthermore by two GPU cards sharing a single PCI-E lanemdashas such in the worst case we have only 18th the bandwidth of a standard Gen2 x16 implementation and care must be taken to hide the latency effectively Under this circumstance communication of the boundary region on every Jacobi iteration takes approximately 82 ms If the GPU compute granularity is greater than this we would have effectively hid the latency However as we increase the number of GPUs while maintaining the problems size (strong scaling) the boundary area adjacent to the next GPU will not change and only the size of the X-axis segment will get smaller shortening the compute time

Figure 7 (left) shows the result We observe that latency hiding technique greatly improves the scalability of the code achieving over 700 Gigaflops on 32 CPUs At that point we achieve 11 TBs in memory bandwidth Reflecting upon the theoretical compute models we observe that there is still a little bit of room for improvement but the benchmark is approaching the hardware limits

By doubling the problem size to XXL (figure 7 (right) 2049 x 513 x 513) we achieve considerable improvement in scalability scaling from 16 to 32 GPUs enables 179 times speedup Further details of the benchmark and the implementation will be the subject of another paper

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

8

152

292

524

709

132

224

337

455

0

100

200

300

400

500

600

700

800

4 8 16 32

GFL

OPS

152

292

524

709

304

584

1049

0

200

400

600

800

1000

1200

4 8 16 32

GFL

OPS

GPU nodes

Left bar with latency hiding right bar without GPU nodes

Left bar size XL right bar size XXL

Figure 7 Riken Himeno benchmark parallelized across nodes on TSUBAME 12 The left graph shows the effect of latency hiding as we scale to 32 GPU nodes The right graph shows the effect of doubling the problem size to achieve larger granularity and thus better scalability

3 Conclusion and future work Commodity acceleration is finally here and in fact might dominate the computing aspects of future supercomputers as one of its major properties is the rebirth of vector computing which has been largely punted in commodity clusters Already there are numbers of successful application ports with dramatic speedups reported in the literature and large-scale multi-node applications and deployments are starting to materialize where significant speedups andor power savings are being reported TSUBAME12 one of the first large-scale deployments of GPUs in a production supercomputer is allowing us to understand the attractions as well as limitations of the approach and what technical problems still lie ahead Even as we speak many GPU-enabled clusters are being planneddeployed but without sufficient software support they are reminiscent of early clusters

As future work we plan to conduct extensive research to help GPUs become the dominant compute resource This research includes various systems issues as well as algorithmic and application issues in preparation for deployment of a petascale TSUBAME 20 anticipated to be deployed in 2010

References [1] ldquoCell Broadband Engine Technology and Systemsrdquo IBM Systems Journal 51-5 May 2007 [2] Owens JD Houston M Luebke D Green S Stone JE Phillips JC ldquoGPU Computingrdquo Proc

IEEE 96-5 May 2008 pp 879-899 [3] ClearSpeed Inc ldquoClearSpeed Whitepaper CSX Processor Architecturerdquo

httpwwwclearspeedcomdocsresourcesClearSpeed_Architecture_Whitepaper_Feb07v2pdf Feb 2007

[4] Taiji M ldquoMDGRAPE-3 chip a 165 Gflops Application Specific LSI for Molecular Dynamics Simulationsrdquo Proc Hot Chips 16 IEEE Computer Society Press (CD-ROM) 2004

[5] Akira Nukada Yasuhiko Ogata Toshio Endo and Satoshi Matsuoka ldquoBandwidth Intensive 3-D FFT kernel for GPUs using CUDArdquo Proc ACMIEEE Supercomputing 2008 (SC2008) Austin Texas the IEEE Press Nov 2008

[6] Satoshi Matsuoka Petascale Computing Algorithms and Applications --- Chapter 14 The Road to TSUBAME and Beyond Chapman amp Hall CRC Computational Science Series pp289-310 2008

[7] Akira Nukada Yuichiro Hourai Akira Nishida and Yutaka Akiyama ldquoldquoHigh Performance 3D Convolution for Protein Docking on IBM Blue Generdquo Parallel and Distributed Processing and Applications Springer LNCS Vol 4742 pp 958-969 2007

[8] A Petitet R Whaley J Dongarra and A Cleary HPL ndash a portable implementation of the high-performance Linpack benchmark for distributed computers httpwwwnetliborgbenchmarkhpl

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

9

[9] Toshio Endo and Satoshi Matsuoka Massive Supercomputing Coping with Heterogeneity of Modern Accelerators IEEE International Parallel amp Distributed Processing Symposium (IPDPS 2008) the IEEE Press April 2008

[10] K Goto Goto BLAS httpwwwtaccutexaseduresourcessoftware [11] The Riken Himeno CFD Benchmark httpacccrikenjpHPCHimenoBMTindex_ehtml

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

10

Teraflops and double-precision to approximately 160 Teraflops The added power and thermal load on TSUBAME is only about 150 KW which is about 18~110th of TSUBAME itself

Speedups of various libraries and applications using GPUs have been quite astounding on a single GPU as exemplified by the FFT speedup demonstrated in Section 2 We have utilized such capabilities for example to greatly speed up sophisticated applications such as all-to-all 3-D protein docking achieving equivalent speed to our earlier published result in [7] on BlueGeneL with almost four times better energy efficiency

Here we mention our current results on two multi-node large-scale benchmarks one dense linear algebra (Linpack) and the other sparse finite difference (multi-node Himeno) benchmark codes

21 Linpack on CPU-GPU-ClearSpeed heterogeneous configuration One of the major issues in heterogeneous supercomputers equipped with a batch of GPUsaccelerators is how users develop parallel programs that effectively use hybrid computing resources especially tightly coupled programs Here we take the High Performance Linpack (HPL) benchmark [8] and describe how it is heterogeneously accelerated on TSUBAME12 For more detailed implementation refer to our previous paper [9] 211 High Performance Linpack (HPL) HPL is a well known MPI based parallel benchmark used in the Top500 ranking solving a set of dense linear equations using a direct method It employs a block-based solver algorithm so as to harness the highly tuned underlying Level 3 BLAS librarymdashin particular using a fast matrix-multiply (DGEMM) function is the key to obtain good HPL performance Parallelization is achieved based on two-dimensional block cyclic distribution of the matrix where each MPI process possesses a sub-matrix of the almost same sizemdashas such HPL is designed for homogeneous environments

Thus there are several challenging issues to port HPL onto TSUBAME12 (1) All the processors CPUs GPUs ClearSpeed boards should be used cooperatively for DGEMM

kernel computation (intra-node heterogeneity) On TSUBAME the ratios of theoretical peak computation performance for CPUs GPUs and ClearSpeed are 35 33 32 therefore omitting any type of processor that causes heavy performance degradation

(2) We also have to consider inter-node heterogeneity which is that only 312 out of 648 nodes have two Tesla GPUs and the rest have none due to the configuration restrictions of Tesla 1070p

(3) Finally PCI-ePCI-X communication overhead should not be underestimated since the matrix data is basically allocated on host memory

212 Kernel functions and matrix block size In HPL the most time consuming part in the overall benchmark is the DGEMM function calls that multiply the (Mprime x B) matrix and (B x Nprime) matrix where B is a tuneable block size parameter To maintain a favorable communication-computation ratio B should be sufficiently large

As a CPU BLAS library we use GotoBLAS [10] For GPUs we use a matrix multiply function written from scratch by our group in order to achieve maximum efficient and at the same time allow asynchronous double buffering to hide the data transfer latency a feature not available in NVIDIArsquos native CUBLAS Its on-board performance is about 80 GFlops and we are able to almost completely hide the latency to effectively achieve this speed For ClearSpeed the CSXL BLAS library by ClearSpeed Inc (DGEMM speed is 63 GFlops) is used Through preliminary experiments we found that B = 1152 would be the optimal block size 213 Coping with heterogeneity In TSUBAME 12 we are faced with intra- and inter-node heterogeneity as described above We conduct load balancing of the DGEMM kernels running on different CPUsaccelerators based on the following strategies By contrast non-kernel computations such as pivoting or row exchange are simply done on x86 CPUs

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

5

Intra-node heterogeneity We conceptually regard all hybrid processors on each node as a farm that provides kernel computation facility for processes Here each process may delegate some fraction of DGEMM computation to accelerators while other fractions may be computed by the CPU-based BLAS as in figure 3

Inter-node heterogeneity With the above model inter-node heterogeneity appears as imbalance of the DGEMM performances among nodes To keep the performance per process almost identical we configure the number of processes per node to reflect their compute capabilities

Figure 3 Mapping between processes and processors (CPUs GPUs ClearSpeed) The left figure shows a node with GPUs the right one shows a node without GPU

214 Other techniques

We observed that PCI-ePCI-X communication consumes considerable CPU load and as a result conducting CPU-accelerator communication and computing DGEMM on the same cores incurs considerable performance degradation Instead we assign several dedicated cores solely for communication (black cores in figure 3) Although we lose considerable performance by such loss of cores (over 10 Teraflops for TSUBAME12) it becomes an overall win when acceleration is considered

In the original code DGEMM calls are sometimes fragmented when communication and DGEMM overlap We have eliminated such fragmentation by using Pthreads

We have implemented and added overlapping between pivot rows communication and DGEMM computation which was not in the original code but is effective for accelerated machines due to the inherent parallelism between CPUs and accelerators

215 Performance evaluation In running the whole-machine Linpack we used 648 of the 655 TSUBAME nodes each of which has eight dual-core Opteron 880 processors and a ClearSpeed accelerator and 312 nodes embodying two Tesla GPUs Additionally another production Xeon cluster called TSUBASA was integrated and used cooperativelymdashthere we used 80 nodes each of which has two quad-core Xeon E5440 processors Combining all these hybrid resources we have achieved 8701 TFlops Since the total peak speed is 163 TFlops the efficiency is 53 partly due to performance loss we mentioned above assigning dedicated CPU to support multiple accelerator as well as compromises made such as the block size (B=1152) Figure 4 compares peak speed and Linpack speed for each category of processors Interestingly we observe here that all the processors show similar efficiency (48 to 56) so the resulting share of runtime performance largely matches that of the theoretical peak performance Here GPU is the most efficient (56) in a more careful comparison but it is unclear whether this is inherent or due to compromise weighing in favour of GPU and we are conducting follow-up research to clarify this The figure also shows electrical power consumption we observe that while accelerators provide 66 of the computation performance their power consumption is only 15 of the entire machine

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

6

Figure 4 Performance ratio of each kind of processor Peak speed (double-precision) contribution to Linpack speed and electrical power consumption are shown

Figure 5 History of TSUBAME in the Top500

Figure 5 shows the history of our Linpack experiments on the TSUBAME in particular the

achieved Teraflops and rank in the Top500 Ever since the original TSUBAME was installed in 2006 processor resources have been steadily increased due to system upgrade which is represented by continuous improvement for every Top500 since its inception Since the performance only with Opteron CPUs was 3818 TFlops acceleration technology improved the TSUBAME system performance by a factor of 23

22 Himeno CFD benchmark acceleration over multiple GPU nodes The Himeno CFD benchmark is a simple 3-D CFD code that solves a Poisson equation using finite difference Jacobi iterations Although production-quality CFD code would be substantially more complex and employ sophisticated methods the Himeno benchmark nonetheless is extremely bandwidth demanding and serves as a benchmark that measures the worst-case scaling scenario for bandwidth intensive codes Himeno benchmark scores are published for a variety of architectures on the benchmark website [11] On a standard x86 CPU the benchmark is completely memory bound and the score would be approximately 1 GBs Contrastingly a single GPU implementation on a NVIDIA CUDA GPU has seen reported scores of over 70 GBs (figure 6) by utilizing high bandwidth device memory as well as shared memory

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

7

Figure 6 Riken Himeno benchmark and parallelization using CUDA GPUs

The Himeno benchmarkrsquos main loop involves 34 FP calculations 31 memory reads and 1 memory

write By optimization we can reduce this to 13 reads and 1 writemdashstill significant and very much memory bound There are four grid sizes the smallest S being (65 x 65 x 129) up to the largest XL being (513 x 513 x 1025) For our benchmark we slightly modified XL so that it would be (1025 x 513 x 513) ie will have a prolonged X-axis instead of Z for ease of programming

For parallelization across GPUs we conduct one-dimensional block distribution along the X-axis across multiple nodes Within a GPU we extensively parallelize and optimize the code using shared memory and coalesced memory access Across the GPUs we overlap MPI communication across nodes and Jacobi iteration computation within a GPU in each node

On TSUBAME12 CPU GPU communication bandwidth is greatly sacrificed by being based on PCI-E Gen1 x8 instead of the more modern Gen2 x16 and furthermore by two GPU cards sharing a single PCI-E lanemdashas such in the worst case we have only 18th the bandwidth of a standard Gen2 x16 implementation and care must be taken to hide the latency effectively Under this circumstance communication of the boundary region on every Jacobi iteration takes approximately 82 ms If the GPU compute granularity is greater than this we would have effectively hid the latency However as we increase the number of GPUs while maintaining the problems size (strong scaling) the boundary area adjacent to the next GPU will not change and only the size of the X-axis segment will get smaller shortening the compute time

Figure 7 (left) shows the result We observe that latency hiding technique greatly improves the scalability of the code achieving over 700 Gigaflops on 32 CPUs At that point we achieve 11 TBs in memory bandwidth Reflecting upon the theoretical compute models we observe that there is still a little bit of room for improvement but the benchmark is approaching the hardware limits

By doubling the problem size to XXL (figure 7 (right) 2049 x 513 x 513) we achieve considerable improvement in scalability scaling from 16 to 32 GPUs enables 179 times speedup Further details of the benchmark and the implementation will be the subject of another paper

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

8

152

292

524

709

132

224

337

455

0

100

200

300

400

500

600

700

800

4 8 16 32

GFL

OPS

152

292

524

709

304

584

1049

0

200

400

600

800

1000

1200

4 8 16 32

GFL

OPS

GPU nodes

Left bar with latency hiding right bar without GPU nodes

Left bar size XL right bar size XXL

Figure 7 Riken Himeno benchmark parallelized across nodes on TSUBAME 12 The left graph shows the effect of latency hiding as we scale to 32 GPU nodes The right graph shows the effect of doubling the problem size to achieve larger granularity and thus better scalability

3 Conclusion and future work Commodity acceleration is finally here and in fact might dominate the computing aspects of future supercomputers as one of its major properties is the rebirth of vector computing which has been largely punted in commodity clusters Already there are numbers of successful application ports with dramatic speedups reported in the literature and large-scale multi-node applications and deployments are starting to materialize where significant speedups andor power savings are being reported TSUBAME12 one of the first large-scale deployments of GPUs in a production supercomputer is allowing us to understand the attractions as well as limitations of the approach and what technical problems still lie ahead Even as we speak many GPU-enabled clusters are being planneddeployed but without sufficient software support they are reminiscent of early clusters

As future work we plan to conduct extensive research to help GPUs become the dominant compute resource This research includes various systems issues as well as algorithmic and application issues in preparation for deployment of a petascale TSUBAME 20 anticipated to be deployed in 2010

References [1] ldquoCell Broadband Engine Technology and Systemsrdquo IBM Systems Journal 51-5 May 2007 [2] Owens JD Houston M Luebke D Green S Stone JE Phillips JC ldquoGPU Computingrdquo Proc

IEEE 96-5 May 2008 pp 879-899 [3] ClearSpeed Inc ldquoClearSpeed Whitepaper CSX Processor Architecturerdquo

httpwwwclearspeedcomdocsresourcesClearSpeed_Architecture_Whitepaper_Feb07v2pdf Feb 2007

[4] Taiji M ldquoMDGRAPE-3 chip a 165 Gflops Application Specific LSI for Molecular Dynamics Simulationsrdquo Proc Hot Chips 16 IEEE Computer Society Press (CD-ROM) 2004

[5] Akira Nukada Yasuhiko Ogata Toshio Endo and Satoshi Matsuoka ldquoBandwidth Intensive 3-D FFT kernel for GPUs using CUDArdquo Proc ACMIEEE Supercomputing 2008 (SC2008) Austin Texas the IEEE Press Nov 2008

[6] Satoshi Matsuoka Petascale Computing Algorithms and Applications --- Chapter 14 The Road to TSUBAME and Beyond Chapman amp Hall CRC Computational Science Series pp289-310 2008

[7] Akira Nukada Yuichiro Hourai Akira Nishida and Yutaka Akiyama ldquoldquoHigh Performance 3D Convolution for Protein Docking on IBM Blue Generdquo Parallel and Distributed Processing and Applications Springer LNCS Vol 4742 pp 958-969 2007

[8] A Petitet R Whaley J Dongarra and A Cleary HPL ndash a portable implementation of the high-performance Linpack benchmark for distributed computers httpwwwnetliborgbenchmarkhpl

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

9

[9] Toshio Endo and Satoshi Matsuoka Massive Supercomputing Coping with Heterogeneity of Modern Accelerators IEEE International Parallel amp Distributed Processing Symposium (IPDPS 2008) the IEEE Press April 2008

[10] K Goto Goto BLAS httpwwwtaccutexaseduresourcessoftware [11] The Riken Himeno CFD Benchmark httpacccrikenjpHPCHimenoBMTindex_ehtml

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

10

Intra-node heterogeneity We conceptually regard all hybrid processors on each node as a farm that provides kernel computation facility for processes Here each process may delegate some fraction of DGEMM computation to accelerators while other fractions may be computed by the CPU-based BLAS as in figure 3

Inter-node heterogeneity With the above model inter-node heterogeneity appears as imbalance of the DGEMM performances among nodes To keep the performance per process almost identical we configure the number of processes per node to reflect their compute capabilities

Figure 3 Mapping between processes and processors (CPUs GPUs ClearSpeed) The left figure shows a node with GPUs the right one shows a node without GPU

214 Other techniques

We observed that PCI-ePCI-X communication consumes considerable CPU load and as a result conducting CPU-accelerator communication and computing DGEMM on the same cores incurs considerable performance degradation Instead we assign several dedicated cores solely for communication (black cores in figure 3) Although we lose considerable performance by such loss of cores (over 10 Teraflops for TSUBAME12) it becomes an overall win when acceleration is considered

In the original code DGEMM calls are sometimes fragmented when communication and DGEMM overlap We have eliminated such fragmentation by using Pthreads

We have implemented and added overlapping between pivot rows communication and DGEMM computation which was not in the original code but is effective for accelerated machines due to the inherent parallelism between CPUs and accelerators

215 Performance evaluation In running the whole-machine Linpack we used 648 of the 655 TSUBAME nodes each of which has eight dual-core Opteron 880 processors and a ClearSpeed accelerator and 312 nodes embodying two Tesla GPUs Additionally another production Xeon cluster called TSUBASA was integrated and used cooperativelymdashthere we used 80 nodes each of which has two quad-core Xeon E5440 processors Combining all these hybrid resources we have achieved 8701 TFlops Since the total peak speed is 163 TFlops the efficiency is 53 partly due to performance loss we mentioned above assigning dedicated CPU to support multiple accelerator as well as compromises made such as the block size (B=1152) Figure 4 compares peak speed and Linpack speed for each category of processors Interestingly we observe here that all the processors show similar efficiency (48 to 56) so the resulting share of runtime performance largely matches that of the theoretical peak performance Here GPU is the most efficient (56) in a more careful comparison but it is unclear whether this is inherent or due to compromise weighing in favour of GPU and we are conducting follow-up research to clarify this The figure also shows electrical power consumption we observe that while accelerators provide 66 of the computation performance their power consumption is only 15 of the entire machine

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

6

Figure 4 Performance ratio of each kind of processor Peak speed (double-precision) contribution to Linpack speed and electrical power consumption are shown

Figure 5 History of TSUBAME in the Top500

Figure 5 shows the history of our Linpack experiments on the TSUBAME in particular the

achieved Teraflops and rank in the Top500 Ever since the original TSUBAME was installed in 2006 processor resources have been steadily increased due to system upgrade which is represented by continuous improvement for every Top500 since its inception Since the performance only with Opteron CPUs was 3818 TFlops acceleration technology improved the TSUBAME system performance by a factor of 23

22 Himeno CFD benchmark acceleration over multiple GPU nodes The Himeno CFD benchmark is a simple 3-D CFD code that solves a Poisson equation using finite difference Jacobi iterations Although production-quality CFD code would be substantially more complex and employ sophisticated methods the Himeno benchmark nonetheless is extremely bandwidth demanding and serves as a benchmark that measures the worst-case scaling scenario for bandwidth intensive codes Himeno benchmark scores are published for a variety of architectures on the benchmark website [11] On a standard x86 CPU the benchmark is completely memory bound and the score would be approximately 1 GBs Contrastingly a single GPU implementation on a NVIDIA CUDA GPU has seen reported scores of over 70 GBs (figure 6) by utilizing high bandwidth device memory as well as shared memory

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

7

Figure 6 Riken Himeno benchmark and parallelization using CUDA GPUs

The Himeno benchmarkrsquos main loop involves 34 FP calculations 31 memory reads and 1 memory

write By optimization we can reduce this to 13 reads and 1 writemdashstill significant and very much memory bound There are four grid sizes the smallest S being (65 x 65 x 129) up to the largest XL being (513 x 513 x 1025) For our benchmark we slightly modified XL so that it would be (1025 x 513 x 513) ie will have a prolonged X-axis instead of Z for ease of programming

For parallelization across GPUs we conduct one-dimensional block distribution along the X-axis across multiple nodes Within a GPU we extensively parallelize and optimize the code using shared memory and coalesced memory access Across the GPUs we overlap MPI communication across nodes and Jacobi iteration computation within a GPU in each node

On TSUBAME12 CPU GPU communication bandwidth is greatly sacrificed by being based on PCI-E Gen1 x8 instead of the more modern Gen2 x16 and furthermore by two GPU cards sharing a single PCI-E lanemdashas such in the worst case we have only 18th the bandwidth of a standard Gen2 x16 implementation and care must be taken to hide the latency effectively Under this circumstance communication of the boundary region on every Jacobi iteration takes approximately 82 ms If the GPU compute granularity is greater than this we would have effectively hid the latency However as we increase the number of GPUs while maintaining the problems size (strong scaling) the boundary area adjacent to the next GPU will not change and only the size of the X-axis segment will get smaller shortening the compute time

Figure 7 (left) shows the result We observe that latency hiding technique greatly improves the scalability of the code achieving over 700 Gigaflops on 32 CPUs At that point we achieve 11 TBs in memory bandwidth Reflecting upon the theoretical compute models we observe that there is still a little bit of room for improvement but the benchmark is approaching the hardware limits

By doubling the problem size to XXL (figure 7 (right) 2049 x 513 x 513) we achieve considerable improvement in scalability scaling from 16 to 32 GPUs enables 179 times speedup Further details of the benchmark and the implementation will be the subject of another paper

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

8

152

292

524

709

132

224

337

455

0

100

200

300

400

500

600

700

800

4 8 16 32

GFL

OPS

152

292

524

709

304

584

1049

0

200

400

600

800

1000

1200

4 8 16 32

GFL

OPS

GPU nodes

Left bar with latency hiding right bar without GPU nodes

Left bar size XL right bar size XXL

Figure 7 Riken Himeno benchmark parallelized across nodes on TSUBAME 12 The left graph shows the effect of latency hiding as we scale to 32 GPU nodes The right graph shows the effect of doubling the problem size to achieve larger granularity and thus better scalability

3 Conclusion and future work Commodity acceleration is finally here and in fact might dominate the computing aspects of future supercomputers as one of its major properties is the rebirth of vector computing which has been largely punted in commodity clusters Already there are numbers of successful application ports with dramatic speedups reported in the literature and large-scale multi-node applications and deployments are starting to materialize where significant speedups andor power savings are being reported TSUBAME12 one of the first large-scale deployments of GPUs in a production supercomputer is allowing us to understand the attractions as well as limitations of the approach and what technical problems still lie ahead Even as we speak many GPU-enabled clusters are being planneddeployed but without sufficient software support they are reminiscent of early clusters

As future work we plan to conduct extensive research to help GPUs become the dominant compute resource This research includes various systems issues as well as algorithmic and application issues in preparation for deployment of a petascale TSUBAME 20 anticipated to be deployed in 2010

References [1] ldquoCell Broadband Engine Technology and Systemsrdquo IBM Systems Journal 51-5 May 2007 [2] Owens JD Houston M Luebke D Green S Stone JE Phillips JC ldquoGPU Computingrdquo Proc

IEEE 96-5 May 2008 pp 879-899 [3] ClearSpeed Inc ldquoClearSpeed Whitepaper CSX Processor Architecturerdquo

httpwwwclearspeedcomdocsresourcesClearSpeed_Architecture_Whitepaper_Feb07v2pdf Feb 2007

[4] Taiji M ldquoMDGRAPE-3 chip a 165 Gflops Application Specific LSI for Molecular Dynamics Simulationsrdquo Proc Hot Chips 16 IEEE Computer Society Press (CD-ROM) 2004

[5] Akira Nukada Yasuhiko Ogata Toshio Endo and Satoshi Matsuoka ldquoBandwidth Intensive 3-D FFT kernel for GPUs using CUDArdquo Proc ACMIEEE Supercomputing 2008 (SC2008) Austin Texas the IEEE Press Nov 2008

[6] Satoshi Matsuoka Petascale Computing Algorithms and Applications --- Chapter 14 The Road to TSUBAME and Beyond Chapman amp Hall CRC Computational Science Series pp289-310 2008

[7] Akira Nukada Yuichiro Hourai Akira Nishida and Yutaka Akiyama ldquoldquoHigh Performance 3D Convolution for Protein Docking on IBM Blue Generdquo Parallel and Distributed Processing and Applications Springer LNCS Vol 4742 pp 958-969 2007

[8] A Petitet R Whaley J Dongarra and A Cleary HPL ndash a portable implementation of the high-performance Linpack benchmark for distributed computers httpwwwnetliborgbenchmarkhpl

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

9

[9] Toshio Endo and Satoshi Matsuoka Massive Supercomputing Coping with Heterogeneity of Modern Accelerators IEEE International Parallel amp Distributed Processing Symposium (IPDPS 2008) the IEEE Press April 2008

[10] K Goto Goto BLAS httpwwwtaccutexaseduresourcessoftware [11] The Riken Himeno CFD Benchmark httpacccrikenjpHPCHimenoBMTindex_ehtml

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

10

Figure 4 Performance ratio of each kind of processor Peak speed (double-precision) contribution to Linpack speed and electrical power consumption are shown

Figure 5 History of TSUBAME in the Top500

Figure 5 shows the history of our Linpack experiments on the TSUBAME in particular the

achieved Teraflops and rank in the Top500 Ever since the original TSUBAME was installed in 2006 processor resources have been steadily increased due to system upgrade which is represented by continuous improvement for every Top500 since its inception Since the performance only with Opteron CPUs was 3818 TFlops acceleration technology improved the TSUBAME system performance by a factor of 23

22 Himeno CFD benchmark acceleration over multiple GPU nodes The Himeno CFD benchmark is a simple 3-D CFD code that solves a Poisson equation using finite difference Jacobi iterations Although production-quality CFD code would be substantially more complex and employ sophisticated methods the Himeno benchmark nonetheless is extremely bandwidth demanding and serves as a benchmark that measures the worst-case scaling scenario for bandwidth intensive codes Himeno benchmark scores are published for a variety of architectures on the benchmark website [11] On a standard x86 CPU the benchmark is completely memory bound and the score would be approximately 1 GBs Contrastingly a single GPU implementation on a NVIDIA CUDA GPU has seen reported scores of over 70 GBs (figure 6) by utilizing high bandwidth device memory as well as shared memory

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

7

Figure 6 Riken Himeno benchmark and parallelization using CUDA GPUs

The Himeno benchmarkrsquos main loop involves 34 FP calculations 31 memory reads and 1 memory

write By optimization we can reduce this to 13 reads and 1 writemdashstill significant and very much memory bound There are four grid sizes the smallest S being (65 x 65 x 129) up to the largest XL being (513 x 513 x 1025) For our benchmark we slightly modified XL so that it would be (1025 x 513 x 513) ie will have a prolonged X-axis instead of Z for ease of programming

For parallelization across GPUs we conduct one-dimensional block distribution along the X-axis across multiple nodes Within a GPU we extensively parallelize and optimize the code using shared memory and coalesced memory access Across the GPUs we overlap MPI communication across nodes and Jacobi iteration computation within a GPU in each node

On TSUBAME12 CPU GPU communication bandwidth is greatly sacrificed by being based on PCI-E Gen1 x8 instead of the more modern Gen2 x16 and furthermore by two GPU cards sharing a single PCI-E lanemdashas such in the worst case we have only 18th the bandwidth of a standard Gen2 x16 implementation and care must be taken to hide the latency effectively Under this circumstance communication of the boundary region on every Jacobi iteration takes approximately 82 ms If the GPU compute granularity is greater than this we would have effectively hid the latency However as we increase the number of GPUs while maintaining the problems size (strong scaling) the boundary area adjacent to the next GPU will not change and only the size of the X-axis segment will get smaller shortening the compute time

Figure 7 (left) shows the result We observe that latency hiding technique greatly improves the scalability of the code achieving over 700 Gigaflops on 32 CPUs At that point we achieve 11 TBs in memory bandwidth Reflecting upon the theoretical compute models we observe that there is still a little bit of room for improvement but the benchmark is approaching the hardware limits

By doubling the problem size to XXL (figure 7 (right) 2049 x 513 x 513) we achieve considerable improvement in scalability scaling from 16 to 32 GPUs enables 179 times speedup Further details of the benchmark and the implementation will be the subject of another paper

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

8

152

292

524

709

132

224

337

455

0

100

200

300

400

500

600

700

800

4 8 16 32

GFL

OPS

152

292

524

709

304

584

1049

0

200

400

600

800

1000

1200

4 8 16 32

GFL

OPS

GPU nodes

Left bar with latency hiding right bar without GPU nodes

Left bar size XL right bar size XXL

Figure 7 Riken Himeno benchmark parallelized across nodes on TSUBAME 12 The left graph shows the effect of latency hiding as we scale to 32 GPU nodes The right graph shows the effect of doubling the problem size to achieve larger granularity and thus better scalability

3 Conclusion and future work Commodity acceleration is finally here and in fact might dominate the computing aspects of future supercomputers as one of its major properties is the rebirth of vector computing which has been largely punted in commodity clusters Already there are numbers of successful application ports with dramatic speedups reported in the literature and large-scale multi-node applications and deployments are starting to materialize where significant speedups andor power savings are being reported TSUBAME12 one of the first large-scale deployments of GPUs in a production supercomputer is allowing us to understand the attractions as well as limitations of the approach and what technical problems still lie ahead Even as we speak many GPU-enabled clusters are being planneddeployed but without sufficient software support they are reminiscent of early clusters

As future work we plan to conduct extensive research to help GPUs become the dominant compute resource This research includes various systems issues as well as algorithmic and application issues in preparation for deployment of a petascale TSUBAME 20 anticipated to be deployed in 2010

References [1] ldquoCell Broadband Engine Technology and Systemsrdquo IBM Systems Journal 51-5 May 2007 [2] Owens JD Houston M Luebke D Green S Stone JE Phillips JC ldquoGPU Computingrdquo Proc

IEEE 96-5 May 2008 pp 879-899 [3] ClearSpeed Inc ldquoClearSpeed Whitepaper CSX Processor Architecturerdquo

httpwwwclearspeedcomdocsresourcesClearSpeed_Architecture_Whitepaper_Feb07v2pdf Feb 2007

[4] Taiji M ldquoMDGRAPE-3 chip a 165 Gflops Application Specific LSI for Molecular Dynamics Simulationsrdquo Proc Hot Chips 16 IEEE Computer Society Press (CD-ROM) 2004

[5] Akira Nukada Yasuhiko Ogata Toshio Endo and Satoshi Matsuoka ldquoBandwidth Intensive 3-D FFT kernel for GPUs using CUDArdquo Proc ACMIEEE Supercomputing 2008 (SC2008) Austin Texas the IEEE Press Nov 2008

[6] Satoshi Matsuoka Petascale Computing Algorithms and Applications --- Chapter 14 The Road to TSUBAME and Beyond Chapman amp Hall CRC Computational Science Series pp289-310 2008

[7] Akira Nukada Yuichiro Hourai Akira Nishida and Yutaka Akiyama ldquoldquoHigh Performance 3D Convolution for Protein Docking on IBM Blue Generdquo Parallel and Distributed Processing and Applications Springer LNCS Vol 4742 pp 958-969 2007

[8] A Petitet R Whaley J Dongarra and A Cleary HPL ndash a portable implementation of the high-performance Linpack benchmark for distributed computers httpwwwnetliborgbenchmarkhpl

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

9

[9] Toshio Endo and Satoshi Matsuoka Massive Supercomputing Coping with Heterogeneity of Modern Accelerators IEEE International Parallel amp Distributed Processing Symposium (IPDPS 2008) the IEEE Press April 2008

[10] K Goto Goto BLAS httpwwwtaccutexaseduresourcessoftware [11] The Riken Himeno CFD Benchmark httpacccrikenjpHPCHimenoBMTindex_ehtml

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

10

Figure 6 Riken Himeno benchmark and parallelization using CUDA GPUs

The Himeno benchmarkrsquos main loop involves 34 FP calculations 31 memory reads and 1 memory

write By optimization we can reduce this to 13 reads and 1 writemdashstill significant and very much memory bound There are four grid sizes the smallest S being (65 x 65 x 129) up to the largest XL being (513 x 513 x 1025) For our benchmark we slightly modified XL so that it would be (1025 x 513 x 513) ie will have a prolonged X-axis instead of Z for ease of programming

For parallelization across GPUs we conduct one-dimensional block distribution along the X-axis across multiple nodes Within a GPU we extensively parallelize and optimize the code using shared memory and coalesced memory access Across the GPUs we overlap MPI communication across nodes and Jacobi iteration computation within a GPU in each node

On TSUBAME12 CPU GPU communication bandwidth is greatly sacrificed by being based on PCI-E Gen1 x8 instead of the more modern Gen2 x16 and furthermore by two GPU cards sharing a single PCI-E lanemdashas such in the worst case we have only 18th the bandwidth of a standard Gen2 x16 implementation and care must be taken to hide the latency effectively Under this circumstance communication of the boundary region on every Jacobi iteration takes approximately 82 ms If the GPU compute granularity is greater than this we would have effectively hid the latency However as we increase the number of GPUs while maintaining the problems size (strong scaling) the boundary area adjacent to the next GPU will not change and only the size of the X-axis segment will get smaller shortening the compute time

Figure 7 (left) shows the result We observe that latency hiding technique greatly improves the scalability of the code achieving over 700 Gigaflops on 32 CPUs At that point we achieve 11 TBs in memory bandwidth Reflecting upon the theoretical compute models we observe that there is still a little bit of room for improvement but the benchmark is approaching the hardware limits

By doubling the problem size to XXL (figure 7 (right) 2049 x 513 x 513) we achieve considerable improvement in scalability scaling from 16 to 32 GPUs enables 179 times speedup Further details of the benchmark and the implementation will be the subject of another paper

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

8

152

292

524

709

132

224

337

455

0

100

200

300

400

500

600

700

800

4 8 16 32

GFL

OPS

152

292

524

709

304

584

1049

0

200

400

600

800

1000

1200

4 8 16 32

GFL

OPS

GPU nodes

Left bar with latency hiding right bar without GPU nodes

Left bar size XL right bar size XXL

Figure 7 Riken Himeno benchmark parallelized across nodes on TSUBAME 12 The left graph shows the effect of latency hiding as we scale to 32 GPU nodes The right graph shows the effect of doubling the problem size to achieve larger granularity and thus better scalability

3 Conclusion and future work Commodity acceleration is finally here and in fact might dominate the computing aspects of future supercomputers as one of its major properties is the rebirth of vector computing which has been largely punted in commodity clusters Already there are numbers of successful application ports with dramatic speedups reported in the literature and large-scale multi-node applications and deployments are starting to materialize where significant speedups andor power savings are being reported TSUBAME12 one of the first large-scale deployments of GPUs in a production supercomputer is allowing us to understand the attractions as well as limitations of the approach and what technical problems still lie ahead Even as we speak many GPU-enabled clusters are being planneddeployed but without sufficient software support they are reminiscent of early clusters

As future work we plan to conduct extensive research to help GPUs become the dominant compute resource This research includes various systems issues as well as algorithmic and application issues in preparation for deployment of a petascale TSUBAME 20 anticipated to be deployed in 2010

References [1] ldquoCell Broadband Engine Technology and Systemsrdquo IBM Systems Journal 51-5 May 2007 [2] Owens JD Houston M Luebke D Green S Stone JE Phillips JC ldquoGPU Computingrdquo Proc

IEEE 96-5 May 2008 pp 879-899 [3] ClearSpeed Inc ldquoClearSpeed Whitepaper CSX Processor Architecturerdquo

httpwwwclearspeedcomdocsresourcesClearSpeed_Architecture_Whitepaper_Feb07v2pdf Feb 2007

[4] Taiji M ldquoMDGRAPE-3 chip a 165 Gflops Application Specific LSI for Molecular Dynamics Simulationsrdquo Proc Hot Chips 16 IEEE Computer Society Press (CD-ROM) 2004

[5] Akira Nukada Yasuhiko Ogata Toshio Endo and Satoshi Matsuoka ldquoBandwidth Intensive 3-D FFT kernel for GPUs using CUDArdquo Proc ACMIEEE Supercomputing 2008 (SC2008) Austin Texas the IEEE Press Nov 2008

[6] Satoshi Matsuoka Petascale Computing Algorithms and Applications --- Chapter 14 The Road to TSUBAME and Beyond Chapman amp Hall CRC Computational Science Series pp289-310 2008

[7] Akira Nukada Yuichiro Hourai Akira Nishida and Yutaka Akiyama ldquoldquoHigh Performance 3D Convolution for Protein Docking on IBM Blue Generdquo Parallel and Distributed Processing and Applications Springer LNCS Vol 4742 pp 958-969 2007

[8] A Petitet R Whaley J Dongarra and A Cleary HPL ndash a portable implementation of the high-performance Linpack benchmark for distributed computers httpwwwnetliborgbenchmarkhpl

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

9

[9] Toshio Endo and Satoshi Matsuoka Massive Supercomputing Coping with Heterogeneity of Modern Accelerators IEEE International Parallel amp Distributed Processing Symposium (IPDPS 2008) the IEEE Press April 2008

[10] K Goto Goto BLAS httpwwwtaccutexaseduresourcessoftware [11] The Riken Himeno CFD Benchmark httpacccrikenjpHPCHimenoBMTindex_ehtml

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

10

152

292

524

709

132

224

337

455

0

100

200

300

400

500

600

700

800

4 8 16 32

GFL

OPS

152

292

524

709

304

584

1049

0

200

400

600

800

1000

1200

4 8 16 32

GFL

OPS

GPU nodes

Left bar with latency hiding right bar without GPU nodes

Left bar size XL right bar size XXL

Figure 7 Riken Himeno benchmark parallelized across nodes on TSUBAME 12 The left graph shows the effect of latency hiding as we scale to 32 GPU nodes The right graph shows the effect of doubling the problem size to achieve larger granularity and thus better scalability

3 Conclusion and future work Commodity acceleration is finally here and in fact might dominate the computing aspects of future supercomputers as one of its major properties is the rebirth of vector computing which has been largely punted in commodity clusters Already there are numbers of successful application ports with dramatic speedups reported in the literature and large-scale multi-node applications and deployments are starting to materialize where significant speedups andor power savings are being reported TSUBAME12 one of the first large-scale deployments of GPUs in a production supercomputer is allowing us to understand the attractions as well as limitations of the approach and what technical problems still lie ahead Even as we speak many GPU-enabled clusters are being planneddeployed but without sufficient software support they are reminiscent of early clusters

As future work we plan to conduct extensive research to help GPUs become the dominant compute resource This research includes various systems issues as well as algorithmic and application issues in preparation for deployment of a petascale TSUBAME 20 anticipated to be deployed in 2010

References [1] ldquoCell Broadband Engine Technology and Systemsrdquo IBM Systems Journal 51-5 May 2007 [2] Owens JD Houston M Luebke D Green S Stone JE Phillips JC ldquoGPU Computingrdquo Proc

IEEE 96-5 May 2008 pp 879-899 [3] ClearSpeed Inc ldquoClearSpeed Whitepaper CSX Processor Architecturerdquo

httpwwwclearspeedcomdocsresourcesClearSpeed_Architecture_Whitepaper_Feb07v2pdf Feb 2007

[4] Taiji M ldquoMDGRAPE-3 chip a 165 Gflops Application Specific LSI for Molecular Dynamics Simulationsrdquo Proc Hot Chips 16 IEEE Computer Society Press (CD-ROM) 2004

[5] Akira Nukada Yasuhiko Ogata Toshio Endo and Satoshi Matsuoka ldquoBandwidth Intensive 3-D FFT kernel for GPUs using CUDArdquo Proc ACMIEEE Supercomputing 2008 (SC2008) Austin Texas the IEEE Press Nov 2008

[6] Satoshi Matsuoka Petascale Computing Algorithms and Applications --- Chapter 14 The Road to TSUBAME and Beyond Chapman amp Hall CRC Computational Science Series pp289-310 2008

[7] Akira Nukada Yuichiro Hourai Akira Nishida and Yutaka Akiyama ldquoldquoHigh Performance 3D Convolution for Protein Docking on IBM Blue Generdquo Parallel and Distributed Processing and Applications Springer LNCS Vol 4742 pp 958-969 2007

[8] A Petitet R Whaley J Dongarra and A Cleary HPL ndash a portable implementation of the high-performance Linpack benchmark for distributed computers httpwwwnetliborgbenchmarkhpl

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

9

[9] Toshio Endo and Satoshi Matsuoka Massive Supercomputing Coping with Heterogeneity of Modern Accelerators IEEE International Parallel amp Distributed Processing Symposium (IPDPS 2008) the IEEE Press April 2008

[10] K Goto Goto BLAS httpwwwtaccutexaseduresourcessoftware [11] The Riken Himeno CFD Benchmark httpacccrikenjpHPCHimenoBMTindex_ehtml

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

10

[9] Toshio Endo and Satoshi Matsuoka Massive Supercomputing Coping with Heterogeneity of Modern Accelerators IEEE International Parallel amp Distributed Processing Symposium (IPDPS 2008) the IEEE Press April 2008

[10] K Goto Goto BLAS httpwwwtaccutexaseduresourcessoftware [11] The Riken Himeno CFD Benchmark httpacccrikenjpHPCHimenoBMTindex_ehtml

SciDAC 2009 IOP PublishingJournal of Physics Conference Series 180 (2009) 012043 doi1010881742-65961801012043

10