병렬프로그래밍과 Cuda

병렬 프로그래밍과 CUDA

[email protected]

Parallel Programming & CUDA

윤 석준 책임연구원

2014-04-23 2 윤석준 ([email protected])

이 발표에서 다루는 것

1. 병렬 프로그래밍의 이해 2. 병렬 프로그래밍 기법 2.1. Core 내부 병렬 프로그래밍

2.2. Thread 병렬 프로그래밍

2.3. Process 병렬 프로그래밍

2.4. GPGPU를 이용한 병렬 프로그래밍

3. CUDA 란 ? 4. CUDA 적용 사례

5. 최근 병렬화 이슈 6. 질의 및 응답


Moore의 법칙

마이크로칩의 처리 능력이 18개월마다 2배로 늘어난다

CO-FOUNDER, INTEL

Intel Core i3 / i5 / i7 2010년 1월 Clarkdale 기준 - i3 : 3.3 GHz - i5 : 3.6 GHz - i7 : 3.2 GHz

2014년 2월 현재 Haswell 기준 - i3 : 2.4 GHz - i5 : 2.9 GHz - i7 : 3.3 GHz


Multicore Processor

i7 i5 i3

Cores 4 4 2

Threads 8 4 4


병렬 프로그래밍

𝒌

𝟏𝟎𝟎𝟎𝟎

𝑘=0

병렬처리가 가능한 경우

- 서로 영향을 미치지 않는 작업들

𝒌

𝟏𝟎𝟎𝟎

𝑘=0

𝒌

𝟐𝟎𝟎𝟎

𝑘=𝟏𝟎𝟎𝟏

𝒌

𝟏𝟎𝟎𝟎𝟎

𝑘=𝟗𝟎𝟎𝟏

…

+

병렬처리가 불가능한 경우

- Loop에서 앞의 계산값이 필요한 경우

피보나치 수열(Fibonacci Sequence)


병렬 프로그래밍 기법

① Core 내부 병렬 프로그래밍

② Thread 병렬 프로그래밍

③ Process 병렬 프로그래밍

④ GPGPU를 이용한 병렬 프로그래밍


Core 내부 병렬 프로그램

SIMD (Single Instruction Multiple Data)

int a = 1 + 2; int b = 3 + 4; int c = 5 + 6; int d = 7 + 8;

SISD

1 2 a = +

3 4 b = +

5 6 c = +

7 8 d = +

SIMD

1 3 5 7 2 4 6 8 = + a b c d

• 기본적인 사칙 연산만 가능

• 별도의 Memory 할당 필요

• 제한된 Registor 개수


Thread 병렬 프로그램

• OpenMP • pthreads • Parallel Pattern Library • 멀티스레딩 프로그램

Loop 병렬화 Section 병렬화


Process 병렬 프로그램

• MPI • HPF • PVM


GPGPU 병렬 프로그램



Floating-Point Operations per Second for the CPU and GPU

● GTX 770 : 3.2 Tera FLOPs

● i7 : 141 Giga FLOPs

22.6배



Memory Bandwidth for the CPU and GPU

● GTX 770 : 224.3 GB/s

● i7 : 25.6 GB/s

8.8배


CUDA란 ? (Compute Unified Device Architecture)

• 2006년 11월 GeForce 8800 GTX 이후 생산되는 GPU

• Geforce : 일반적인 Graphic Card • Quadro : 고성능의 Graphic 작업 전용 • Tesla : Graphic 기능 없이 고성능 연산 전용


CPU vs GPU

The GPU Devotes More Transistors to Data Processing


CUDA Hardware Archetecture

• SM (Streaming Multiprocessor)

• CUDA Core (Streaming Processor)


CUDA Data Parallel Threading Model

• Block : 작업단위 (SM)

• Warp : SM 내에서 동시 실행 가능한 Thread 개수

• Grid : Block 의 집합 • Thread : Block 안에서의 병렬화 (CUDA Core)


Structure of CUDA Memory

• Registor - On Chip Processor 에 있는 Memory

- 함수내의 Local 변수 (배열은 Global)

- 가장 빠른 메모리

• Shared Memory - On Chip Processor 에 있는 Memory

- SM 내의 Thread 들이 공유

- L1 캐시급 속도

• Constant Memory - 읽기전용 캐시

- Write (from DRAM) : 400 ~ 600 Cycles

- Read : Registor와 동급

• Global Memory - Video Card에 장착된 DRAM

- Read/Write : 400 ~ 600 Cycles

• Texture Memory - 캐시 읽기를 지원하는 Global Memory

- 설정 후 읽기전용

Register

Shared Memory

Constant Memory

Global Memory Texture Memory


CUDA Streaming

• 연산 속도에 비해 Host DRAM 과 GPU DRAM의 Data 전송속도가 느림

• 큰 Data를 가공할 경우 연산시간보다 Data 전송시간의 비중이 더 커짐

• Data 전송 및 연산시간을 작은 단위로 나눠서 순차적으로 진행


CUDA Library

• cuRAND : 랜덤수 생성기 (CUDA Random Number Generation library)

• CUFFT : FFT 연산 Library (CUDA Fast Fourier Transform library)

• CUBLAS : 선형대수학 연산 Library (CUDA Basic Linear Algebra Subroutines library)


CUDA Programming 예제

float fResult[1024][1000];

float fData[1024][1000];

for (int i = 0; i < 1024; i++)

{

for (int j = 0; j < 1000; j++)

{

for (int k = 0; k < 33; k++)

{

fResult[i][j] += Calc(fData[i][j], k);

}

}

}

int main() { … float *dev_fResult, *dev_fData; int iSizeData = 1024 * 1000 * sizeof(float); cudaMalloc((void**)&dev_fResult, iSizeData); cudaMemset(dev_fResult, 0, iSizeData); cudaMalloc((void**)&dev_fData, iSizeData); cudaMemcpy(dev_fData, fData, iSizeData, cudaMemcpyHostToDevice); KernelFunc<<<1024, 1000>>>(dev_fResult, dev_fData); float* fResult = new float[1024 * 1000]; cudaMemcpy(fResult, dev_fResult, iSizeData, cudaMemcpyDeviceToHost); cudaFree(dev_fResult); cudaFree(dev_fData); … }

// 커널함수 : CPU에 의해 호출되어 // GPU에서 실행 // Device 함수 : GPU에서 실행 // Host 함수 : CPU에서 실행

// Atomic 연산 : Mutal Exclusion

// Device 메모리 할당

// Kernel 함수 호출

// Host 메모리 할당

__global__ void KernelFunc(float* i_fResult, float* i_fData) { float fResult = 0.0f; int index = blockIdx.x * gridDim + threadIdx.x; float fData = i_fData[index]; for (int k = 0; k < 33; k++) { fResult += Calc(fData, k); } i_fResult[index] = fResult; }


CUDA Compute capability (version)

http://en.wikipedia.org/wiki/CUDA

.

.

.

Technical specifications

Compute capability (version)

1 1.1 1.2 1.3 2.x 3 3.5 5

8800 GTX 8400M GT GTS 350M GTX 280 GTX 550 GTX 770 GTX TITAN GTX 750

Maximum dimensionality of grid of thread blocks 2 3

Maximum x-, y-, or z-dimension of a grid of thread blocks 65535 231-1

Maximum dimensionality of thread block 3

Maximum x- or y-dimension of a block 512 1024

Maximum z-dimension of a block 64

Maximum number of threads per block 512 1024

Warp size 32

Maximum number of resident blocks per multiprocessor 8 16 32

Maximum number of resident warps per multiprocessor 24 32 48 64

Maximum number of resident threads per multiprocessor 768 1024 1536 2048

Number of 32-bit registers per multiprocessor 8 K 16 K 32 K 64 K

Maximum number of 32-bit registers per thread 128 63 255

Maximum amount of shared memory per multiprocessor 16 KB 48 KB 64 KB




CUDA 6.0 : Unified Memory

http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6













CUDA 적용 사례 : 성공 1

• Beam Pattern 생성

- 1000 방위 x 1000 주파수Bin

- 100개 Sensor ( 각 방위별 BeamPattern 생성)

- Data 전달 없음

CPU : 32 s (OpenMP)

CUDA : 0.26 s 123배


CUDA 적용 사례 : 성공 2

• 타겟 신호 에 Beam Pattern 적용

- 192 방위 x 1000 주파수Bin

- 16타겟

- Data Read/Write : 192 x 1000 x floatComplex x 2 (LOFAR/DEMON)

CPU : 900 ms (OpenMP)

CUDA : 14 ms 64배


CUDA 적용 사례 : 실패

• Cubic Spline Interoplation

CPU 연산에 비하여 성능 향상이 없음

… Xi(14) Xi(15) Xi(16) …

Xo(n) Xo(n+1)

- 입력 된 각각의 점들에 대하여 사용될 각종 변수들 계산 : Sequential Operation

- 출력값 연산 : 병렬화 가능은 하나 해당 위치를 찾을 시 Sequential에 비해 검색 시간이 느림


최근 병렬화 이슈

• Raspberry-Pi

- 700 Mhz ARM11 CPU

- Broadcom Videocore IV GPU

- 256 Mbytes RAM

- 10.13 GFLOPs

- i7 : 141 GFLOPs

- 슈퍼컴퓨터 ‘천둥’ : 107 TFLOPs

http://www.raspberrypi.org

http://www.raspberrypi.org/

http://www.raspberrypi.org/


최근 병렬화 이슈

• Multi-GPU Motherboard

http://prod.danawa.com/info/?pcode=2466508&cate1=861&cate2=875&cate3=968&cate4=0



Software

병렬프로그래밍과 Cuda