PGI CUDA FortranとGPU最適化ライブラリの一連携法

PGI CUDA Fortran と GPU最適化ライブラリの一連携法

沼津工業高等専門学校電子制御工学科

出川智啓

2014/9/26 Prometech Simulation Conference 2014 1

内容

• FortranによるGPGPUとCUDA C付属ライブラリ

• C言語とFortranの連携法

• cuSPARSEの利用と有限要素法への応用

• cuRANDの利用


はじめに

• GPGPUの普及と裾野の広がり• 資産を多く持つFortranユーザからの要求の高まり

• CUDA Fortranの登場

• 世代更新に伴う高性能化• 高速化など利用効率向上のノウハウの蓄積

• 黎明期から成熟期へ

• CUDA Fortranの対応• 最新版のリリースからCUDA Fortranの対応に待ち時間がある

• CUDA C付属ライブラリの利用方法が不明確

• CUDA FortranからCUDA C付属ライブラリを利用する方法を整理し，応用例を示す


PGI CUDA Fortran• FortranのNVIDIA GPU向け拡張

• PGI社の販売するFortranコンパイラで利用可能

• 10.0以降で利用可能

• 2014年9月26日現在の最新版は14.9（9月8日リリース）

• CUDA Cを利用するが，新機能はFortranコンパイラが対応しないと利用できない

• かけた労力と得られる利得（性能向上）のバランスがよい

• 並列計算の知識だけである程度の性能が得られる


CUDA Fortranサンプル（ベクトル和）

add.f90module kernel

implicit nonecontainssubroutine add_CPU(a,b,c,n)

implicit none

integer,value :: nreal :: a(n),b(n),c(n)integer :: ido i=1,n

c(i) = a(i)+b(i)end do

end subroutine add_CPUend module kernel

program adduse kernelimplicit none

integer,parameter :: n = 512real,allocatable :: a(:), b(:), c(:)

allocate(a(n)); a = 1.0allocate(b(n)); b = 2.0allocate(c(n)); c = 0.0

call add_CPU(a, b, c, n)

deallocate(a)deallocate(b)deallocate(c)

end program add


CUDA Fortranサンプル（ベクトル和）

add.cufmodule kernel

implicit nonecontainsattributes(global)

subroutine add_GPU(a,b,c,n)implicit none

integer,value :: nreal :: a(n),b(n),c(n)integer :: ii = (blockIdx%x‐1)*blockDim%x

+ threadIdx%xc(i) = a(i)+b(i)

end subroutine add_GPUend module kernel

program adduse kernelimplicit none

integer,parameter :: n = 512real,allocatable,device :: &

a(:), b(:), c(:)

allocate(a(n)); a = 1.0allocate(b(n)); b = 2.0allocate(c(n)); c = 0.0

call add_GPU<<<1,n>>>(a, b, c, n)

deallocate(a)deallocate(b)deallocate(c)

end program add

メモリ属性を指定するため，コンパイラが容易に判断

BlockやThreadのIDと配列添字を対応

CUDA Cよりも簡潔な記述が可能（※エラー処理を考えなければ）


CUDA C付属ライブラリ

• cuBLAS 密行列向け線形代数演算

• cuSPARSE 疎行列向け線形代数演算

• cuFFT フーリエ変換

• cuRAND 乱数生成

• Thrust ソート，縮約，スキャン

• NPP 画像処理，信号処理

• cuDNN ニューラルネットワーク

など

• cuBLASとcuFFTは，CUDA Fortranから使うための情報が入手しやすい



• cuBLAS• 11.8以降でFortran Moduleが提供される

• use :: cublasで利用可能

• 自前でinterfaceを用意する必要がなくなった

• Softek社のホームページに

使用方法が詳細に解説されている

http://www.softek.co.jp/SPG/Pgi/TIPS/public/accel/cublas40.html



• cuFFT• Softek社のホームページ

に使用方法が解説されている

• interfaceの書き方

• 関数の呼び方など

http://www.softek.co.jp/SPG/Pgi/TIPS/public/accel/cufft.html



• cuBLASとcuFFTは，CUDA Fortranから使うための情報が入手しやすい

• 他のライブラリはどうすればいい？

• あきらめる？自分で実装する？

• 個人で頑張っても最適化されたライブラリの速度には到底及ばない

• CUDA Fortranからライブラリ使いたい

• C言語とFortranの連携は昔から行われている

• Fortran 2003でC言語との連携が強化されている


C言語とFortranの連携

• FortranからC言語の関数を呼ぶことが可能なら，CUDA付属のライブラリもCUDA Fortranから呼ぶことができる

• 混合言語プログラミング

main.f90program main

implicit noneinterface

subroutine c_routine()!DEC$ ATTRIBUTES C :: c_routineend subroutine c_routine

end interface

call c_routine()end program main

routine.c#include<stdio.h>extern "C"{void c_routine(){

printf("greetings from C¥n");}}



• C言語との連携強化

• Fortran 2003の主要機能の一つ

• iso_c_binding組込みモジュール

• C言語の型に対応する変数のkindの定義

• C言語のポインタを処理するための派生型と関数

• use :: iso_c_bindingとして利用



• C言語の型に対応する変数のkindの定義

• integer(kind)• c_int int型• c_long long int型• c_long_long long long int型• c_intptr_t メモリアドレス格納（ポインタ変数）用

• real(kind)• c_float float型• c_double double型

• C言語のポインタを処理するための派生型（と関数）

• type(c_ptr) Cのポインタに対応



• bind(C)によるC言語の関数呼び出し仕様の記述

• bind(C, name="C言語の関数名")

• interface内で関数・サブルーチンの引数仕様を記述する際に設定

interfacesubroutine Fortranの関数名(仮引数名) bind(C,name="C言語の関数名")

implicit none仮引数の型と名前を記述

end subroutine c_routineend interface


C言語とFortranの連携program main

implicit noneinterface

subroutine c_routine(a, b) bind(C,name="c_routine")use :: iso_c_bindinginteger(c_int),value :: a !valueを付けると値渡しinteger(c_int) :: b !valueを付けないとポインタ渡し

end subroutine c_routineend interface

integer(c_int) :: a=1, b=2call c_routine(a,b)

end program main


#include<stdio.h>extern "C"{void c_routine(int a, int *b){ //返り値がvoidのときはsubroutineとして扱う

printf("%d, %d¥n", a, *b);}}


• 列挙型の定義

• C言語のenum型に対応

• ある意味のある整数のまとまりを定義

• タグ名は利用できない

C言語(enum型)typedef enum {

CUSPARSE_INDEX_BASE_ZERO=0, CUSPARSE_INDEX_BASE_ONE=1

} cusparseIndexBase_t;

Fortran(enum,bind(c))enum,bind(c) !:: cusparseIndexBase_t

enumerator :: CUSPARSE_INDEX_BASE_ZERO = 0enumerator :: CUSPARSE_INDEX_BASE_ONE = 1

end enum


CUDA付属のライブラリ

• Fortran2003の各種機能を利用してCUDA FortranからcuSPARSE, cuRANDを利用

1. モジュールファイルを作成

• 使う関数のinterfaceを記述

• ライブラリで定められた定数を列挙型で記述

2. プログラム側でモジュールファイルをuseして関数を呼び出し


cuSPARSE• cuSPARSEの行列－ベクトル積を呼び出してみる

• 疎行列の格納形式はCSR形式を採用

• 列挙型の定義• cusparseIndexBase_t, cusparseMatrixType_t, cusparseOperation_t, cusparseStatus_t

• interfaceの記述• cuSparseDcsrmv• helper functions

• cuSparseCreate, cuSparseDestroy, cuSparseMatDescr, cuSparseDestroyMatDescr, cuSparseSetMatIndexBase, cuSparseGetMatIndexBase


cuSPARSE用interface

module cusparseuse iso_c_bindingimplicit none

enum,bind(c) !:: cusparseIndexBase_tenumerator :: CUSPARSE_INDEX_BASE_ZERO = 0enumerator :: CUSPARSE_INDEX_BASE_ONE = 1

end enum

enum,bind(c) !:: cusparseMatrixType_tenumerator :: CUSPARSE_MATRIX_TYPE_GENERAL = 0enumerator :: CUSPARSE_MATRIX_TYPE_SYMMETRIC = 1enumerator :: CUSPARSE_MATRIX_TYPE_HERMITIAN = 2enumerator :: CUSPARSE_MATRIX_TYPE_TRIANGULAR = 3

end enum


cuSPARSE用interfaceenum,bind(c) !:: cusparseOperation_t

enumerator :: CUSPARSE_OPERATION_NON_TRANSPOSE = 0enumerator :: CUSPARSE_OPERATION_TRANSPOSE = 1enumerator :: CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE = 2

end enum

enum,bind(c) !:: cusparseStatus_tenumerator :: CUSPARSE_STATUS_SUCCESS = 0enumerator :: CUSPARSE_STATUS_NOT_INITIALIZED = 1enumerator :: CUSPARSE_STATUS_ALLOC_FAILED = 2enumerator :: CUSPARSE_STATUS_INVALID_VALUE = 3enumerator :: CUSPARSE_STATUS_ARCH_MISMATCH = 4enumerator :: CUSPARSE_STATUS_MAPPING_ERROR = 5enumerator :: CUSPARSE_STATUS_EXECUTION_FAILED = 6enumerator :: CUSPARSE_STATUS_INTERNAL_ERROR = 7enumerator :: CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED = 8

end enum


cuSPARSE用interfaceinterface

integer(c_int) function cuSparseCreate(handle)&bind(C,name="cusparseCreate")

use iso_c_bindingimplicit nonetype(c_ptr) :: handle !アドレスを渡す

end function cuSparseCreate

integer(c_int) function cuSparseDestroy(handle)&bind(C,name="cusparseDestroy")

use iso_c_bindingimplicit nonetype(c_ptr),value :: handle !値を渡す

end function cuSparseDestroy

end interface



integer(c_int) function cuSparseMatDescr(descr)&bind(C,name="cusparseCreateMatDescr")

use iso_c_bindingimplicit nonetype(c_ptr),intent(inout) :: descr !アドレスを渡す

end function cuSparseMatDescr

integer(c_int) function cuSparseDestroyMatDescr(descr)&bind(C,name="cusparseDestroyMatDescr")

use iso_c_bindingimplicit nonetype(c_ptr),value,intent(in) :: descr !値を渡す

end function cuSparseDestroyMatDescr

end interface



integer(c_int) function cuSparseSetMatIndexBase(descr,INDEX_BASE)&bind(C,name="cusparseSetMatIndexBase")

use iso_c_bindingimplicit nonetype(c_ptr) ,value :: descr !値を渡すinteger(c_int),value,intent(in) :: INDEX_BASE

end function cuSparseSetMatIndexBase

integer(c_int) function cuSparseGetMatIndexBase(descr)&bind(C,name="cusparseGetMatIndexBase")

use iso_c_bindingimplicit nonetype(c_ptr),value :: descr !値を渡す

end function cuSparseGetMatIndexBase

end interface


cuSPARSE用interfaceinterface cuSparsecsrmv

integer(c_int) function cuSparseDcsrmv &(handle,OPERATION, N_Row, N_Col,N_Nonzero,coef1, descr, Mat_val, &

Mat_rowptr,Mat_colidx,x, coef2, y) bind(C,name="cusparseDcsrmv")use iso_c_bindingimplicit none!y = coef1*OPERATION(A)*x + coef2*ytype(c_ptr) ,value,intent(in ) :: handleinteger(c_int) ,value,intent(in ) :: OPERATIONinteger(c_int) ,value,intent(in ) :: N_Rowinteger(c_int) ,value,intent(in ) :: N_Colinteger(c_int) ,value,intent(in ) :: N_Nonzeroreal(c_double) ,intent(in ) :: coef1type(c_ptr) ,value,intent(in ) :: descrreal(c_double),device ,intent(in ) :: Mat_val(:)integer(c_int),device ,intent(in ) :: Mat_rowptr(:)integer(c_int),device ,intent(in ) :: Mat_colidx(:)real(c_double),device ,intent(in ) :: x(:)real(c_double) ,intent(in ) :: coef2real(c_double),device ,intent(inout) :: y(:)

end function cuSparseDcsrmvend interfaceend module cusparse


Bi-CGStab法の実装

• 連立一次方程式の反復解法

• SOR法のように漸化式を構成せず，行列‐ベクトル積が計算できればよい

• CUDAにCG法を実装した

サンプルがあったので，Bi‐CGStab法に変更してCUDA Fortranに移植

Compute r0=b−[A]x0. Set r0*=r0, c2−1=0.

For j=0,1,…, until ||r||/||b|| < , Do

pj = rj+ c2 j−1(pj−1−j−1[A]pj−1)

c1j = (r0*, rj)/(r0*, [A]pj)

tj = rj−c1j[A]pjc3j = ([A]tj, tj)/([A]tj, [A]tj)

xj+1 = pj+c1jcj+c3jtjrj+1 = tj−c3j[A]tjc2j = (r0*, rj)/{c3j(r0*, [A]pj)}

EndDo

疑似コード


プロジェクトの作成とコンパイル

• Visual Studioでプロジェクトを作成

• プロジェクトの作成，プログラムを書く過程は通常のプログラミングと同じ


プロジェクトの作成とコンパイル

• コンパイルオプションの設定

• Enable CUDAFortranをYes

• 追加の依存ファイルにcusparse.lib,cublas.libなど利用するライブラリを追加


Bi-CGStab法• 1次元Poisson方程式から導かれる3重対角行列

• 格子点間隔を1として簡略化

• 解(x)が1, 2, 3・・・N−1, Nとなるようbを設定

N

N

N

N

bb

bb

xx

xx

1

2

1

1

2

1

210121

121012


Bi-CGStab法program cusparse_bicgstab

use iso_c_bindinguse cudaforuse cusparseuse cublasimplicit none

integer :: i

real(c_double),allocatable :: b(:)real(c_double),allocatable :: x(:)real(c_double),allocatable :: r(:)real(c_double),allocatable :: zerovector(:)

!3重対角行列用変数real(c_double),allocatable :: Aval(:)integer(c_int),allocatable :: Arowptr(:)integer(c_int),allocatable :: Acolidx(:)integer(c_int),parameter :: N = 1024integer(c_int) :: Nnzinteger,parameter :: indexbase =

CUSPARSE_INDEX_BASE_ONEinteger :: colstart

type(c_ptr) :: Handle_cuSparsetype(c_ptr) :: descrinteger(c_int) :: stat

type(cublasHandle) :: Handle_cuBlas

real(c_double),allocatable,device :: dev_Aval(:)integer(c_int),allocatable,device :: dev_Arowptr(:)integer(c_int),allocatable,device :: dev_Acolidx(:)real (c_double),allocatable,device :: dev_b(:)real (c_double),allocatable,device :: dev_x(:)real (c_double),allocatable,device :: dev_r(:)

real (c_double),allocatable,device :: dev_dr(:)real (c_double),allocatable,device :: dev_dp(:)

!BiCGStab用変数real(c_double) :: bb,rrreal(c_double) :: coef1,coef2,coef3real(c_double) :: At,Att,AtAt,rsAp,rrsreal(c_double),allocatable,device :: dev_p (:)real(c_double),allocatable,device :: dev_rs(:)real(c_double),allocatable,device :: dev_t (:)real(c_double),allocatable,device :: dev_At(:)real(c_double),allocatable,device :: dev_Ap(:)

integer :: ite


Bi-CGStab法!3重対角行列の生成Nnz = 2 + (N‐2)*3 + 2 !非ゼロ要素数allocate(Aval(Nnz))allocate(Acolidx(Nnz))allocate(Arowptr(N+1))

i=1 !1行目Arowptr(i) = indexbaseAcolidx(1) = i ;Aval(1) =‐2d0Acolidx(2) = i+1;Aval(2) = 1d0colstart = 3

do i=2,N‐1 !2~N‐1行目Arowptr(i) = 2 + (i‐2)*3 + indexbaseAcolidx(colstart ) = i‐1;Aval(colstart ) = 1d0Acolidx(colstart+1) = i ;Aval(colstart+1) =‐2d0Acolidx(colstart+2) = i+1;Aval(colstart+2) = 1d0colstart = colstart + 3

end doi=N !N行目

Arowptr(i) = 2 + (i‐2)*3 + indexbaseAcolidx(colstart ) = i‐1;Aval(colstart ) = 1d0Acolidx(colstart+1) = i ;Aval(colstart+1) =‐2d0

!終端Arowptr(N+1) = Nnz + indexbase

allocate(zerovector(N));zerovector=0d0

allocate(x(N));x=(/(i, i=1,N)/)allocate(b(N),source = zerovector)allocate(r(N),source = zerovector)

allocate(dev_Aval(Nnz) ,source = Aval )allocate(dev_Acolidx(Nnz),source = Acolidx )allocate(dev_Arowptr(N+1),source = Arowptr )allocate(dev_x (N) ,source = x )allocate(dev_b (N) ,source = b )allocate(dev_r (N) ,source = r )allocate( dev_p (N) ,source = zerovector)allocate( dev_rs(N) ,source = zerovector)allocate( dev_t (N) ,source = zerovector)allocate( dev_At(N) ,source = zerovector)allocate( dev_Ap(N) ,source = zerovector)

stat = cuBlasCreate(Handle_cuBlas)stat = cuSparseCreate(Handle_cuSparse)stat = cuSparseMatDescr(descr)stat = cuSparseSetMatType

(descr,CUSPARSE_MATRIX_TYPE_GENERAL)stat = cuSparseSetMatIndexBase

(descr,CUSPARSE_INDEX_BASE_ONE)


Bi-CGStab法!右辺ベクトルの作成．このbとAを基にxを求める!b() = 1*A(,)*x() + 0*b()stat = cuSparseDcsrmv(Handle_cuSparse,&

CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, Nnz,&1d0, descr, dev_Aval, dev_Arowptr, dev_Acolidx,&

dev_x, 0d0, dev_b)dev_x = 0d0 !解の初期推定は0bb = cublasDdot(N, dev_b, 1, dev_b, 1)

!r() = b()‐A(,)*x()!r() = ‐A(,)*x()!r() = b() +r()stat = cuSparseDcsrmv(Handle_cuSparse,&

CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, Nnz,&‐1d0, descr, dev_Aval, dev_Arowptr, dev_Acolidx,&

dev_x, 0d0, dev_r)call cublasDaxpy(N, 1d0, dev_b, 1, dev_r, 1)rr = cublasDdot (N, dev_r, 1, dev_r, 1)

!相対誤差が許容誤差以内なら反復しないif(rr/bb<=1d‐24) return

!疑似残差の設定dev_rs=dev_rrrs = rr

coef1 = 0d0coef2 = 0d0coef3 = 0d0

ite=0

!p = r‐coef2*(p‐coef3*Ap) ‐> p() = r()dev_p=dev_r

!Ap() = A(,)*p()!rsAp = rs()*Ap()stat = cuSparseDcsrmv(Handle_cuSparse,&


dev_p, 0d0, dev_Ap)rsAp = cublasDdot(N, dev_rs, 1, dev_Ap, 1)coef1 = rrs/rsAp

!t = r‐aplh*Ap!t() = r()!t() = t()‐coef1*Ap()dev_t=dev_rcall cublasDaxpy(N, ‐coef1, dev_Ap, 1, dev_t, 1)

! At() = A(,)*t()! Att() = At()*t()!AtAt() = At()*At()stat = cuSparseDcsrmv(Handle_cuSparse,&


dev_t, 0d0, dev_At)Att = cublasDdot(N, dev_At, 1, dev_t , 1)AtAt = cublasDdot(N, dev_At, 1, dev_At, 1)coef3 = Att/AtAt


Bi-CGStab法!x = x + coef1*p + coef3*t!x() = x() + coef1*p()!x() = x() + coef3*t()!r = t ‐ coef3*At!r() = t()!r() = r() ‐ coef3*At()call cublasDaxpy(N, coef1, dev_p , 1, dev_x, 1)call cublasDaxpy(N, coef3, dev_t , 1, dev_x, 1)dev_r=dev_tcall cublasDaxpy(N, ‐coef3, dev_At, 1, dev_r, 1)

rr = cublasDdot(N, dev_r, 1, dev_r , 1)rrs = cublasDdot(N, dev_r, 1, dev_rs, 1)coef2= rrs/(rsAp*coef3)

print *,"iteration = ",ite, "residual = ",sqrt(rr)!反復開始BiCGStab:do ite=1,2**20

!p = r‐coef2*(p‐coef3*Ap)!Ap() = ‐coef3*Ap()!p() = p()+Ap()!p() = coef2*p()!p() = p()+t()call cublasDscal(N, ‐coef3, dev_Ap, 1)call cublasDaxpy(N, 1d0, dev_Ap, 1, dev_p, 1)call cublasDscal(N, coef2, dev_p , 1)call cublasDaxpy(N, 1d0, dev_r , 1, dev_p, 1)

!Ap() = A(,)*p()!rsAp = rs()*Ap()stat = cuSparseDcsrmv(Handle_cuSparse,&


dev_p, 0d0, dev_Ap)rsAp = cublasDdot(N, dev_rs, 1, dev_Ap, 1)coef1= rrs/rsAp

!t = r‐aplh*Ap!t() = r()!t() = t()coef1*Ap()dev_t=dev_rcall cublasDaxpy(N, ‐coef1, dev_Ap, 1, dev_t, 1)stat = cuSparseDcsrmv(Handle_cuSparse,&


dev_t, 0d0, dev_At) ! At() = A(,)*t()! Att() = At()*t()!AtAt() = At()*At()

Att = cublasDdot(N, dev_At, 1, dev_t , 1)AtAt = cublasDdot(N, dev_At, 1, dev_At, 1)coef3 = Att/AtAt


Bi-CGStab法!x = x + coef1*p + coef3*t!x() = x() + coef1*p()!x() = x() + coef3*t()!r = t ‐ coef3*At!r() = t()!r() = r() ‐ coef3*At()call cublasDaxpy(N, coef1, dev_p , 1, dev_x, 1)call cublasDaxpy(N, coef3, dev_t , 1, dev_x, 1)dev_r=dev_tcall cublasDaxpy(N, ‐coef3, dev_At, 1, dev_r, 1)rr = cublasDdot(N, dev_r, 1, dev_r , 1)

print*,"iteration = ",ite,"residual = ",sqrt(rr)if(rr<1d‐24) exit BiCGStab

rrs = cublasDdot(N, dev_r, 1, dev_rs, 1)coef2 = rrs/(rsAp*coef3)

end do BiCGStab

x=dev_xprint *,x

stat = cuSparseDestroy(Handle_cuSparse)stat = cuBlasDestroy(Handle_cuBlas)

deallocate(zerovector)

deallocate(x)deallocate(b)deallocate(r)

deallocate(dev_Aval )deallocate(dev_Acolidx)deallocate(dev_Arowptr)deallocate(dev_x )deallocate(dev_b )deallocate(dev_r )deallocate(dev_p )deallocate(dev_rs )deallocate(dev_t )deallocate(dev_At )deallocate(dev_Ap )

end program cusparse_bicgstab


実行結果

0.99999999998769911.999999999975400 2.9999999999630983.9999999999507924.9999999999385015.999999999926186・・・1018.9999999999521019.9999999999591020.9999999999671021.9999999999811022.9999999999831023.999999999991

収束履歴近似解


反復回数

相対残差

有限要素法のkernel-free実装

• 行列－ベクトル積の計算ができて，連立一次方程式が解ければ有限要素法の計算ができる

• CPU側で係数行列を作成

• 有限要素法で計算する部分はフローチャートに沿って関数を呼び出すだけ

• GPUプログラミングの初心者でもカーネルを書かずにGPUの性能を引き出すことができる（はず）


支配方程式

• 非圧縮性流れの連続の式とNavier-Stokes方程式

0

yv

xu

2

2

2

2

2

2

2

2

Re1Re1

yv

xv

yp

yvv

xvu

tv

yu

xu

xp

yuv

xuu

tu

p : 圧力

Re : レイノルズ数

t : 時間

u, v : x, y方向速度x, y : 空間方向


Fractional Step法• 速度と圧力の分離解法

2

2

2

2

2

2

2

2

Re1~

Re1~

yv

xv

yvv

xvuΔtvv

yu

xu

yvu

xuuΔtuu

nnnnnnn

nnnnnnn

yv

xu

Δtyp

xp nn ~~1

2

12

2

12

ypΔtvv

xpΔtuun

n

nn

11

11

~

~

vu ~ ,~ : x, y方向中間速度


有限要素方程式

• Galerkin法による定式化

vup ~][~][1][ 1 yxn

ΔtGGD

1

1

][Re1][][~

][Re1][][~

-nnnynnxn

-nnnynnxn

Δt

Δt

mvvvvuvv

muvuuuuu

DGG

DGG

　

　

111

111

~~

mpvvmpuu

nyn

nxn

ΔtΔtGG

m: 集中化質量行列[D] : 拡散行列[Gx],[Gy] : x, y方向勾配行列


計算手順の置き換え

• 有限要素方程式，Bi‐CGStab法の計算手順を

• ベクトルのスカラ倍

• ベクトル成分同士の和

• ベクトル成分同士の積

• ベクトルの内積

• 行列－ベクトル積

の組合せに置換


計算手順の置き換え

*1

• ベクトル成分同士の積はcuBLAS, cuSPARSEに無い

• 帯行列－ベクトル積cublasDgbmvを利用し，対角行列とベクトルの積として計算

ProcedurecuBLAS/cuSPERSE

Functionベクトルのスカラ倍 cublasDscal

ベクトル成分同士の和 cublasDaxpy

ベクトル成分同士の積*1 cublasDgbmv

ベクトルの内積 cublasDdot

行列－ベクトル積 cusparseDcrsmv


中間速度計算の置換

1][Re1][][~ -nnnynnxn Δt muvuuuuu

DGG 　

uu ← cublasDgbmv(un, un)vu ← cublasDgbmv (vn, un)Guu ← cusparseDcrsmv([Gx], uu)Guv ← cusparseDcrsmv ([Gy], vu)Du ← cusparseDcrsmv([D], un)

← cublasDgbmv(un, m)

← cublasDaxpy( , −t×Guu)

← cublasDaxpy( , −t×Gvu)

← cublasDaxpy( , t/Re×Du)

← cublasDgbmv( , m−1)

u~

u~

u~

u~u~u~u~u~u~


速度修正の置換

111 ~ mpuu nxn Δt G

Gp ← cusparseDcrsmv([Gx], pn+1)Gp ← cublasDgbmv(Gp, m−1)

← cublasDaxpy( , −t×Gp) un+1 ← u~

u~u~


有限要素法プログラムの作成

• プログラム作成，GPU移植

• 沼津高専5年生（当時）

• プログラミング言語の経験• C言語 1年• Java 半年

• 数値計算の経験無し

• 計算機環境

• CPU Intel Xeon W3565• GPU NVIDIA Tesla C2075• CUDA 5.0, 5.5• PGI CUDA Fortran 14.6

有限要素法プログラムの作成（C言語）

GPU移植（CUDA 5.0）

CUDA Fortranへ移植

3ヶ月（6時間/週，うち2週間は定期試験により中断）

1週間 1日

移植に要した時間


計算条件

• レイノルズ数 Re=Ud/=100• 円柱直径 d=1• 流入流速 U=1• 計算時間間隔 t=0.01• 計算終了時間 t=1000

62d

16d

3角形1次要素要素数 14400節点数 7414

U

x

y


計算結果

• 圧力，速度分布(t=600)• 円柱後流に放出されるカルマン渦列がとらえられている


計算時間

• CPUで1週間程度かかっていたプログラムの実行時間が8時間に短縮

• GPUプログラミング初心者でも，C言語で作成した各

処理をカーネルに置き換えるだけで有限要素法をGPUへ移植可能

• 処理がかなり冗長になっている

CPU GPU（CUDA C）

GPU（CUDA Fortran）

1週間 8時間 8時間


cuRAND• cuSPARSEと同様に列挙型を定義してinterfaceを記述

• CUDA Fortranからcurandを呼ぶ方法は，英語の講演資料等に書かれていたりする• 例えばFatica, M., CUDA Libraries and CUDA Fortran, http://www.nvidia.com/content/PDF/isc‐2011/Fatica.pdf

• 検索しても日本語の情報は全然出てこない


モンテカルロ法

• サンプルとして円周率を求めてみる

• 正方形の中に乱数を使って点を打ち，円の内側に入っている点を数える

• 点の数が無限に多く，点の座標が重複しないなら円周率が正しく求められる

• 円の面積 r2

• 4角形の面積 4r2

• 面積比 a=r2/4r2

• 円周率 =4a• 円と四角形の面積比aを点の数で近似

2r

2r


cuRAND用interfacemodule curand

use iso_c_bindingimplicit none

enum,bind(c) ! :: curandStatus_tenumerator :: CURAND_STATUS_SUCCESS = 0enumerator :: CURAND_STATUS_VERSION_MISMATCH = 100enumerator :: CURAND_STATUS_NOT_INITIALIZED = 101enumerator :: CURAND_STATUS_ALLOCATION_FAILED = 102enumerator :: CURAND_STATUS_TYPE_ERROR = 103enumerator :: CURAND_STATUS_OUT_OF_RANGE = 104enumerator :: CURAND_STATUS_LENGTH_NOT_MULTIPLE = 105enumerator :: CURAND_STATUS_DOUBLE_PRECISION_REQUIRED = 106enumerator :: CURAND_STATUS_LAUNCH_FAILURE = 201enumerator :: CURAND_STATUS_PREEXISTING_FAILURE = 202enumerator :: CURAND_STATUS_INITIALIZATION_FAILED = 203enumerator :: CURAND_STATUS_ARCH_MISMATCH = 204enumerator :: CURAND_STATUS_INTERNAL_ERROR = 999

end enum


cuRAND用interface

enum,bind(c) ! :: curandRngType_tenumerator :: CURAND_RNG_TEST = 0enumerator :: CURAND_RNG_PSEUDO_DEFAULT = 100enumerator :: CURAND_RNG_PSEUDO_XORWOW = 101enumerator :: CURAND_RNG_PSEUDO_MRG32K3A = 121enumerator :: CURAND_RNG_PSEUDO_MTGP32 = 141enumerator :: CURAND_RNG_PSEUDO_MT19937 = 142enumerator :: CURAND_RNG_PSEUDO_PHILOX4_32_10 = 161enumerator :: CURAND_RNG_QUASI_DEFAULT = 200enumerator :: CURAND_RNG_QUASI_SOBOL32 = 201enumerator :: CURAND_RNG_QUASI_SCRAMBLED_SOBOL32 = 202enumerator :: CURAND_RNG_QUASI_SOBOL64 = 203enumerator :: CURAND_RNG_QUASI_SCRAMBLED_SOBOL64 = 204

end enum


cuRAND用interfaceInterface

integer(c_int) function curandCreateGenerator(generator, rng_type)&bind(C,name="curandCreateGenerator")

use iso_c_bindingimplicit nonetype(c_ptr) ,intent(inout) :: generatorinteger(c_int),value,intent(in ) :: rng_type

end function curandCreateGenerator

integer(c_int) function curandDestroyGenerator(generator)& bind(C,name="curandDestroyGenerator")

use iso_c_bindingimplicit nonetype(c_ptr),value,intent(in) :: generator

end function curandDestroyGenerator

end interface


cuRAND用interfaceinterface curandGenerateUniform

integer(c_int) function curandGenerateUniform(generator, output, num)&bind(C,name="curandGenerateUniform")

use iso_c_bindingimplicit nonetype(c_ptr) ,value,intent(in ) :: generatorreal(c_float) ,device ,intent(inout) :: output(:)integer(c_size_t) ,value,intent(in ) :: num

end function curandGenerateUniform

integer(c_int) function curandGenerateUniformDouble(generator, output, num)&bind(C,name="curandGenerateUniformDouble")

use iso_c_bindingimplicit nonetype(c_ptr) ,value,intent(in ) :: generatorreal(c_double) ,device ,intent(inout) :: output(:)integer(c_size_t) ,value,intent(in ) :: num

end function curandGenerateUniformDoubleend interface

end module curand


モンテカルロ法による円周率の計算program montecarlo

use cudaforuse iso_c_bindinguse curandimplicit none

integer(c_size_t),parameter :: N=1024*1024integer :: i,insidereal(8) :: pireal(c_double),allocatable :: x(:)real(c_double),allocatable :: y(:)

!curand用変数type(c_ptr) :: generatorreal(c_double),allocatable,device :: dev_x(:)real(c_double),allocatable,device :: dev_y(:)integer(c_int) :: stat

stat = curandCreateGenerator&(generator,CURAND_RNG_PSEUDO_MTGP32)

allocate(dev_x(N))allocate(dev_y(N))stat = curandGenerateUniform(generator, dev_x, N)stat = curandGenerateUniform(generator, dev_y, N)

allocate(x(N), source=dev_x)allocate(y(N), source=dev_y)

inside = 0do i=1,N

if( (x(i)**2+y(i)**2) <= 1d0) inside=inside+1end do

pi = 4d0*dble(inside)/Nprint *,pi, "error=",abs(pi‐acos(‐1d0))/acos(‐1d0)

stat = curandDestroyGenerator(generator)deallocate(x)deallocate(y)deallocate(dev_x)deallocate(dev_y)

end program montecarlo


モンテカルロ法による円周率の計算

点の数円周率相対誤差

21 4.00000 2.73×10−1

22 4.00000 2.73×10−1

23 4.00000 2.73×10−1

24 3.50000 1.14×10−1

25 3.50000 1.14×10−1

26 3.31250 5.44×10−2

27 3.31250 5.44×10−2

28 3.17188 9.64×10−3

29 3.17969 1.21×10−2

210 3.12500 5.28×10−3

点の数円周率相対誤差

211 3.10938 1.03×10−2

212 3.10938 1.03×10−2

213 3.12354 5.75×10−3

214 3.12915 3.96×10−3

215 3.13025 3.61×10−3

216 3.14325 5.27×10−4

217 3.13980 5.70×10−4

218 3.14346 5.95×10−4

219 3.14624 1.48×10−3

220 3.14131 8.94×10−5


おわりに

• CUDA Cに付属しているライブラリをFortranから利用することを目的として，iso_c_binding組込みモジュールを紹介した

• cuSPARSEを利用するためのFortran Moduleを作成• Bi‐CGStab法や有限要素法の実装に利用した

• 数値計算やGPUプログラミング初心者であっても，現実的な時間でCPUプログラムをGPUに移植することができた

• 速度向上はそれなりだが，冗長な処理もある

• cuRANDを利用するためのFortran Moduleを作成• 試みに，モンテカルロ法による円周率の計算に適用した

• iso_c_bindingを利用することにより，効率的にCUDA C付属ライブラリとCUDA Fortranの連携を取ることができる


Engineering

PGI CUDA FortranとGPU最適化ライブラリの一連携法