Download pdf - More modern gpu

More Modern GPU

[email protected]

Preferred Networks, Inc.

Preferred Infrastructure, Inc.

12/3 2015PFI/PFN

GPU/CUDA

l GPU/CUDA

l GPUCPU

TitanX 6TFlops (3092 cores), Xeon 0.8TFlops (18 cores)

CPU24SIMD

GPU2

l GPU

GPUGB/s

l

2

GPU

l

CPU

GPU/CUDAGPUagain

l

HW

l

CPU

CUDA

l CUDA [Okuta 2015]

l cupy (chainer)

numpyGPU

l modern gpu/thrust

STL

CUDA

4

cupy (chainer.cuda.cupy)

l chainernumpy

l Numpygpu

l CUDA

l

cuda

cuda

u Chainer

Nvidiakernelload

l Chainer Meetup

5

ModernGPUl

MIMDMultiple Instruction, Multiple Data)

u

u PESY-SC()datacenter as a computer

SIMDSingle Instruction, Multiple Data)

u

u SSE

l GPUMIMD + SIMDSIMT

(SM MIMD

SMwarp1632SIMD

u gather/scatter

SM 6

l

10030002900

l

7

(Prefix-)Scan

l X[0n)Scan

X[i] := X[0] + X[1] + ... + X[i-1] exclusive scan

X[i] := X[0] + X[1] + + X[i] inclusive scan

l Scan

X[i]X[i] += X[i-1]

X[i], X[i] += X[i-2]

X[i]X[i] += X[i-4]

...

l X[7]

X[7] += X[6]

X[7] += X[5] (=X[4]+X[5])

l O(log n)8

Modern GPU

l GPU

l Modern GPU

Nvidia2013

http://nvlabs.github.io/moderngpu/

l

1.

2.

3.

l 1

9

(1/2)

l X, Y

l X = {1, 3, 3, 5, 7, 9, 10, 10, 11, 13, 15}

l Y = {0, 2, 3, 3, 7, 8, 8, 9, 10, 11, 14}

l Z={0, 1, 2, 3, 3, 3, 3, 5, 7, 7 8, 8, 9, 9, 10, 10, 10, 11, 11, 13, 14, 15}

l n=|X|, m=|Y|O(n+m)

10

(2/2)

X, Y, Z

ix = 0, iy = 0, iz = 0

while (iz < n+m) {

if (comp(X[ix], Y[iy]) Z[iz++] = X[ix++]

else Z[iz++] = Y[iy++]

}

// comp

11

0 1 3 5 6 6 7 9 1012247899

10

12

ABZ[iz++] = X[ix++]Z[iz++] = Y[iy++]

X[4]=6 < Y[4]=7

l

13

0 1 3 5 6 6 7 9 1012247899

10

Z[04) Z[48)

Z[812)

Z[1216)

l X = 0 1 3 5 6 6 7 9 10

l Y = 1 2 2 4 7 8 9 9 10

l 4

l 0 1 1 2

l 3 5 2 4

l 6 6 7 7

l 9 8 9 9

l 10 10

l

15

0 1 3 5 6 6 7 9 1012247899

10

X[i] < Y[8 i]i

l

l

Z

#pragma unroll

l

TitanX 288GB/32700/

16

l Bulk Insert, Bulk Delete

l Segmented Vector Reduction

Reduction

l DBJoin

Outer, Inner, Left-, Right- Join

l MapReduce

l Modern GPUThrust, Cub

17

GPUMapReduce

l Map+Shuffle(Sort)+ReduceGPU

l MapReduce

Shuffle

GPUGPU

l MapDKV D-> [K, V]

l Shuffle [K, V] -> [K, [V]]

l Reduce [V] -> Z

l [D] -> [K, Z]

18

GPU

l

1.

1. !isAlpha(right) && isAlpha(left))

2.

1. Segmented Reduction

2. Reduce

3. KRReduce

3.

4. Segmented Reduction

1.

19

l https://github.com/hillbig/gpuexperiments

20

l 700MB

l Titan X

l 106,888,008

l 1,252,268

l GPU1.67 CPU->GPU 0.2

0.1%

l CPU 14.80

l 10

21

22

input=300000000 wordCount=45788064distinctWord=1129243 words=457880640 2528465 the1 1564080 of2 1219248 and3 986168 in4 862412 a5 862356 to6 507386 is7 484451 The8 445334 was9 336005 for10 334510 s11 316207 as12 295183 by13 282728 with14 281566 on15 241960 that16 235218 doc17 221649 from18 193797 at19 189947 his20 157175 an

l nvcc + thrust

templatenvcc

1

l

10

23

l GPU

cupy, thrust, cubCUDA

l

Gochannel, go routine

l

Datacenter as a computer1 (TFlops in 1 chip)

24