GA 676598EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE GTC2018
Porting Quantum ESPRESSO to GPU Accelerated SystemsPietro Bonfà, Fabio Affinito, Carlo Cavazzoni
CINECA, Casalecchio di Reno, Italy
https://www.nvidia.com/en-us/data-center/tesla-k80/
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
What is QuantumESPRESSO
Porting strategy
Benchmarks
Conclusions
Outlook
What is QuantumESPRESSO
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
What is QuantumESPRESSO● QUANTUM ESPRESSO is an initiative coordinated by the QUANTUM
ESPRESSO Foundation, with the participation of SISSA, CINECA, ICTP, EPFL, with many partners in Europe and Worldwide.
● QUANTUM ESPRESSO is not a single application for quantum simulations; it is rather a distribution of packages performing different tasks and destined to be interoperable.
● Free as in freedom (GPLv2) and open development.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
What is QuantumESPRESSO● Runs from standalone workstation to massively parallel systems.
● Large scientific user base, vehicle for new methods and new algorithms.
○ V6.2.1 → 70400 downloads○ >50 contributors○ 1600+ registered users○ ~ 500k lines, Fortran (& C)
● Simplify transition of new science to HPC systems.
$ ./configure && make all
Posts/month in ML
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
What is QuantumESPRESSO
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
QE LibrariesSome of the time consuming workloads of many packages are already encapsulated in a number of libraries, namely
LAXLib FFTXlib KS_Solvers
FFTW, MKL, ESSL, ...
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Clues from profilingPWscf (CPU version) running on a single KNL node with 64 MPI processes
(best time to solution).
Porting strategy
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Past and present QE GPU portsPorting effort carried out by MaX and supported by NVIDIA.
CUDA C based plugin for QE 5.x (pw.x) developed by F. Spiga and I. Girotto.
2012
2013
2014
2015
2016
2017
2018 Independent CUDA Fortran based port of QE 6.1 (pw.x) developed by F. Spiga and NVIDIA. Provides best performance, most used features implemented.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
QE v5.4: CUDA C Plugin✓✓ Self contained
● BLAS → PHIGEMM● LAPACK→ MAGMA● 3 CUDA C kernels + cuFFT
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
QE v5.4: CUDA C Plugin✓✓ Self contained
✓ Good performance
F. Spiga: http://www.tcm.phy.cam.ac.uk/~mdt26/esdg_slides/spiga_may13.pdf
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
QE v5.4: CUDA C Plugin✓✓ Self contained
✓ Good performance
✗ Boilerplate code InterfaceKernel
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
QE v6.1: CUDA Fortran✓ Single programming language: Fortran + CUDA Fortran
● BLAS → cuBLAS● LAPACK→ Custom GPU Eigensolver (outperforms MAGMA)● CUF Kernel directives and CUDA Fortran kernels
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
QE v6.1: CUDA Fortran✓ Single programming language: Fortran + CUDA Fortran
✓✓ Very good performance
For a detailed description of the code and the benchmarks see: http://www.dcs.warwick.ac.uk/pmbs/pmbs17/PMBS17/
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
QE v6.1: CUDA Fortran✓ Single programming language: Fortran + CUDA Fortran
✓✓ Very good performance
✗ Diverged from master branch
✗ Only selected features implemented
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
New Porting StrategyLanguage: CUDA Fortran, leverage on existing v6.1 code.
Programming model: explicit and directive based.
Plan:
1. Preserve modularity.2. Maintain alignment with master branch. Maintain “hackability”.3. Leave user experience intact.4. General GPU architecture solutions.5. Performance, of course.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
New Porting Strategy
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
New Porting Strategy
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
New Porting StrategyApplication: pw.x
Accelerated, Working, Unavailable, Broken
GPU version
Total Energy (K points)
Forces Stress Collinear Magnetism
Non-collinear magnetism
Gamma trick
US PP PAW PP DFT+U All other functionalities
v5.4 A W W B (?) U A A ? W (?) W (?)
v6.1 A A A A U W (*) A A (*) U U (*)
v6.3 A W W A A A A A (*) W W
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
New Porting Strategy
Libraries Global Variables
Memory Allocation
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Libraries● Full API support:
● Unit testing:
● Target best performance: CUDA Fortran, explicit CUDA API (concurrency, hardware specific options).
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Libraries - FFTXlib● Many small 3D FFTs (101 → 103)
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Libraries - FFTXlib● Many small 3D FFTs (101 → 103)● Overlap of communication and computation
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Libraries - FFTXlib● Many small 3D FFTs (101 → 103)● Overlap of communication and computation● Batched work
# bands times
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Libraries - FFTXlib● Many small 3D FFTs (101 → 103)● Overlap of communication and computation● Batched work
4 bands 1D FFT
4 bands 1D FFT
Scatter
Scatter
8 ba
nds
Alltoall
4 bands 2D FFTAlltoall
4 bands 2D FFT
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Home-brewed managed memory:
1. Prioritize data encapsulation efforts.2. Enforce a simple and effective update scheme for global variables.3. Can provide asynchronous updates (not implemented yet).4. General data duplication scheme.5. Saves performance on old hardware.
Global Variables
USE us, ONLY : nqx, dq, spline_psUSE us_gpum, ONLY : tab_d, tab_d2y_d!implicit none!if (lmaxkb.lt.0) returncall start_clock ('init_us_2')
call using_tab_d(READ) ! <- sync. hereif (spline_ps) call using_tab_d2y_d(READWRITE) <-’
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Memory allocation● pw.x allocates many scratch variables. This impacts substantially the
performance of the accelerated version of the subroutines.● At the same time GPU memory is limited.
USE some_module, ONLY : work!implicit none!IF( ALLOCATED( work ) .and. SIZE( work ) < lwork ) DEALLOCATE( work )IF( .not. ALLOCATED( work ) ) ALLOCATE( work( max_lwork ) )[...]
QE GPU v6.1
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Memory allocation● pw.x allocates many scratch variables. This impacts substantially the
performance of the accelerated version of the subroutines.● At the same time GPU memory is limited.
USE some_module, ONLY : work!implicit none!IF( ALLOCATED( work ) .and. SIZE( work ) < lwork ) DEALLOCATE( work )IF( .not. ALLOCATED( work ) ) ALLOCATE( work( max_lwork ) )[...]
USE buffer_module,ONLY : gpu_buffer!implicit none!REAL, POINTER :: work(:)gpu_buffer%lock_buffer(work, 10, ierr)[...]gpu_buffer%release_buffer(work, ierr)
QE GPU v6.3
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
✓ Self contained ✓ Single programming language: Fortran + CUDA Fortran✓ Aligned with official develop branch❓Performance...
Recap
Libraries
Global Variables
Memory Allocation
Benchmarks
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Benchmark systemsCompute units
Piz Daint XC50 @ CSCS:Model: Xeon E5-2690 v3 (HSW) @ 2.60 GHzCores: 1x12 = 12Accelerators: 1 x P100RAM: 64 GB/node
Galileo @ CINECAModel: Xeon E5-2630 v3 (HSW) @ 2.40 GHzCores: 2x8 = 16Accelerators: 2 x K80RAM: 128 GB/node
Marconi @ CINECAModel: Xeon E5-2697 v4 (BDW) @ 2.30 GHzCores: 2x18 = 36 RAM: 128 GB/node
Q3 20161.3 TFLOPs
Q1 20150.6 + 2x2.9 TFLOPs
Q4 20160.5 + 4.7 TFLOPs
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Benchmark systemsCompute units
Piz Daint XC50 @ CSCS:Aries routing and communications ASIC, and Dragonfly network topology.
Galileo @ CINECAInfiniband network, with OFED v1.5.3, capable of a maximum bandwidth of 40Gbit/s between each pair of nodes.
Marconi @ CINECAIntel Omnipath, 100 Gb/s. Fat Tree OPA(2:1 oversubscription tapering at the level of the core switches only)
Q3 20161.3 TFLOPs
Q1 20150.6 + 2x2.7 TFLOPs
Q4 20160.5 + 4.7 TFLOPs
GPU
CPU NIC
GPU
CPU NICCPU
GPU
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
● Total time for the iterative solution of the KS equation is compared for the CPU and the GPU versions of pw.x.
● Best time to solution per compute unit(s) is reported.
● Optimal execution parameters for v6.1 and v6.3 may differ.
Benchmark details
Initialization
Iterations for electronic ground state
Forces and Stress
pw.x
Structural optimization
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
C70Very small test case, gamma trick.
number of atoms/cell = 280number of atomic types = 1number of electrons = 1120number of Kohn-Sham states = 672kinetic-energy cutoff = 45 Rycharge density cutoff = 450 Ryconvergence threshold = 1.0E-08
Dense grid: 1685364 G-vectors FFT dimensions: ( 225, 128, 240) Smooth grid: 426442 G-vectors FFT dimensions: ( 144, 81, 150)
Iterations to reach convergence: 16
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
C70Very small test case, gamma trick.
1. Speedup GPU vs CPU ~ 1.5x2. v6.1 is missing gamma trick
( vs )
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
C70Very small test case, gamma trick.
1. Speedup GPU vs CPU ~ 1.5x2. v6.1 is missing gamma trick
( vs )3. CPU version scales better
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
C70Very small test case, gamma trick.
1. Speedup GPU vs CPU ~ 1.5x2. v6.1 is missing gamma trick
( vs )3. CPU version scales better4. At saturation GPU still faster
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
AuSurfSmall test case, 2 k-points.
Iterations to reach convergence: 21±1
number of atoms/cell = 112number of atomic types = 1number of electrons = 1232number of Kohn-Sham states = 800kinetic-energy cutoff = 25 Rycharge density cutoff = 200 Ryconvergence threshold = 1.0E-06
Dense grid: 2158381 G-vectors FFT dimensions: ( 180, 90, 288)Smooth grid: 763307 G-vectors FFT dimensions: ( 125, 64, 200)
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
AuSurfSmall test case, 2 k-points.
1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory
(but vs in this case)
~
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
AuSurfSmall test case, 2 k-points.
1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory
(but vs in this case)3. CPU and GPU versions both
scaling well.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
AuSurfSmall test case, 2 k-points.
1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory
(but vs in this case)3. CPU and GPU versions both
scaling well.4. v6.3 on GPUs is significantly
slower than v6.1.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
AuSurfSmall test case, 2 k-points.
1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory
(but vs in this case)3. CPU and GPU versions both
scaling well.4. v6.3 on GPUs is significantly
slower than v6.1.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
AuSurfSmall test case, 2 k-points.
1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory
(but vs in this case)3. CPU and GPU versions both
scaling well.4. v6.3 on GPUs is significantly
slower than v6.1.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
AuSurfSmall test case, 2 k-points.
1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory
(but vs in this case)3. CPU and GPU versions both
scaling well.4. v6.3 on GPUs is significantly
slower than v6.1.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Ta2O5Large test case, 26 k-points.
Iterations to reach convergence: [45, 49, 50, 51, 52]
number of atoms/cell = 96number of atomic types = 2number of electrons = 544number of Kohn-Sham states = 326kinetic-energy cutoff = 130 Rycharge density cutoff = 520 Ryconvergence threshold = 1.0E-08
Dense grid: 3645397 G-vectors FFT dimensions: ( 200, 180, 216)
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Ta2O5Large test case, 26 k-points.
1. Speedup GPU vs CPU ≳ 2x2. v6.1 allocates more memory
(but vs in this case)
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Ta2O5Large test case, 26 k-points.
1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory
(but vs in this case)3. CPU and GPU versions both
scaling well.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Ta2O5Large test case, 26 k-points.
1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory
(but vs in this case)3. CPU and GPU versions both
scaling well.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Ta2O5Large test case, 26 k-points.
1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory
(but vs in this case)3. CPU and GPU versions both
scaling well.4. v6.3 on GPUs is significantly
slower than v6.1.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Ta2O5Large test case, 26 k-points.
1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory
(but vs in this case)3. CPU and GPU versions both
scaling well.4. v6.3 on GPUs is significantly
slower than v6.1.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Ta2O5Large test case, 26 k-points.
1. Speedup GPU vs CPU > 2x2. v6.1 allocates more memory
(but vs in this case)3. CPU and GPU versions both
scaling well.4. v6.3 on GPUs is significantly
slower than v6.1.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Porting statusQE 6.3 GPU is:
✓ aligned with develop branch of community, ✓ passes all 186 tests of the feature testing suite,✓ undergoing integration with the main project,✓ provides good performance, generally better than 2x (far from saturation),✓ ready for alpha release.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Porting statusQE 6.3 GPU is:
✓ aligned with develop branch of community, ✓ passes all 186 tests of the feature testing suite,✓ undergoing integration with the main project,✓ provides good performance, generally better than 2x (far from saturation),✓ ready for alpha release.
Collaboration and support from: J. Romero, M. Marić, M. Fatica, E. Phillips (NVIDIA)F. Spiga (ARM), A. Chandran (FZJ), I. Girotto (ICTP), P. Giannozzi (Univ. Udine), P. Delugas, S. De Gironcoli (SISSA).
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Conclusions● Preserved modularity
○ For code maintainability○ For simpler development and debugging
● Preserved all functionalities○ Same user experience○ Various level of acceleration for the
various functionalities
● Preserved (promote?) data encapsulation
(from www.nvidia.com/en-us/data-center/tesla-k80 )
(modified from www.nvidia.com/en-us/data-center/tesla-k80 )
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Outlook and perspectives● Investigate performance degradation from v6.1 to v6.3
○ How much is coming from missing components?○ Impact of directive based programming model?
● More benchmarking on different HW combinations.
● More code validation, initialization and forces ported to CUDA Fortran.
● Prepare first alpha release.
EUROPEAN CENTER OF EXCELLENCE - A H2020 E-INFRASTRUCTURE
Outlook and perspectives● Investigate performance degradation from v6.1 to v6.3
○ How much is coming from missing components?○ Impact of directive based programming model?
● More benchmarking on different HW combinations.
● More code validation, initialization and forces ported to CUDA Fortran.
● Prepare first alpha release.THANK YOU FOR YOUR ATTENTION!
Credits: icons made by freepik from flaticon