Download pdf - More modern gpu

Transcript
  • More Modern GPU

    [email protected]

    Preferred Networks, Inc.

    Preferred Infrastructure, Inc.

    12/3 2015PFI/PFN

  • GPU/CUDA

    l GPU/CUDA

    l GPUCPU

    TitanX 6TFlops (3092 cores), Xeon 0.8TFlops (18 cores)

    CPU24SIMD

    GPU2

    l GPU

    GPUGB/s

    l

    2

  • GPU

    l

    CPU

    GPU/CUDAGPUagain

    l

    HW

    l

    CPU

  • CUDA

    l CUDA [Okuta 2015]

    l cupy (chainer)

    numpyGPU

    l modern gpu/thrust

    STL

    CUDA

    4

  • cupy (chainer.cuda.cupy)

    l chainernumpy

    l Numpygpu

    l CUDA

    l

    cuda

    cuda

    u Chainer

    Nvidiakernelload

    l Chainer Meetup

    5

  • ModernGPUl

    MIMDMultiple Instruction, Multiple Data)

    u

    u PESY-SC()datacenter as a computer

    SIMDSingle Instruction, Multiple Data)

    u

    u SSE

    l GPUMIMD + SIMDSIMT

    (SM MIMD

    SMwarp1632SIMD

    u gather/scatter

    SM 6

  • l

    10030002900

    l

    7

  • (Prefix-)Scan

    l X[0n)Scan

    X[i] := X[0] + X[1] + ... + X[i-1] exclusive scan

    X[i] := X[0] + X[1] + + X[i] inclusive scan

    l Scan

    X[i]X[i] += X[i-1]

    X[i], X[i] += X[i-2]

    X[i]X[i] += X[i-4]

    ...

    l X[7]

    X[7] += X[6]

    X[7] += X[5] (=X[4]+X[5])

    l O(log n)8

  • Modern GPU

    l GPU

    l Modern GPU

    Nvidia2013

    http://nvlabs.github.io/moderngpu/

    l

    1.

    2.

    3.

    l 1

    9

  • (1/2)

    l X, Y

    l X = {1, 3, 3, 5, 7, 9, 10, 10, 11, 13, 15}

    l Y = {0, 2, 3, 3, 7, 8, 8, 9, 10, 11, 14}

    l Z={0, 1, 2, 3, 3, 3, 3, 5, 7, 7 8, 8, 9, 9, 10, 10, 10, 11, 11, 13, 14, 15}

    l n=|X|, m=|Y|O(n+m)

    10

  • (2/2)

    X, Y, Z

    ix = 0, iy = 0, iz = 0

    while (iz < n+m) {

    if (comp(X[ix], Y[iy]) Z[iz++] = X[ix++]

    else Z[iz++] = Y[iy++]

    }

    // comp

    11

  • 0 1 3 5 6 6 7 9 1012247899

    10

    12

    ABZ[iz++] = X[ix++]Z[iz++] = Y[iy++]

    X[4]=6 < Y[4]=7

  • l

    13

    0 1 3 5 6 6 7 9 1012247899

    10

    Z[04) Z[48)

    Z[812)

    Z[1216)

  • l X = 0 1 3 5 6 6 7 9 10

    l Y = 1 2 2 4 7 8 9 9 10

    l 4

    l 0 1 1 2

    l 3 5 2 4

    l 6 6 7 7

    l 9 8 9 9

    l 10 10

  • l

    15

    0 1 3 5 6 6 7 9 1012247899

    10

    X[i] < Y[8 i]i

  • l

    l

    Z

    #pragma unroll

    l

    TitanX 288GB/32700/

    16

  • l Bulk Insert, Bulk Delete

    l Segmented Vector Reduction

    Reduction

    l DBJoin

    Outer, Inner, Left-, Right- Join

    l MapReduce

    l Modern GPUThrust, Cub

    17

  • GPUMapReduce

    l Map+Shuffle(Sort)+ReduceGPU

    l MapReduce

    Shuffle

    GPUGPU

    l MapDKV D-> [K, V]

    l Shuffle [K, V] -> [K, [V]]

    l Reduce [V] -> Z

    l [D] -> [K, Z]

    18

  • GPU

    l

    1.

    1. !isAlpha(right) && isAlpha(left))

    2.

    1. Segmented Reduction

    2. Reduce

    3. KRReduce

    3.

    4. Segmented Reduction

    1.

    19

  • l https://github.com/hillbig/gpuexperiments

    20

  • l 700MB

    l Titan X

    l 106,888,008

    l 1,252,268

    l GPU1.67 CPU->GPU 0.2

    0.1%

    l CPU 14.80

    l 10

    21

  • 22

    input=300000000 wordCount=45788064distinctWord=1129243 words=457880640 2528465 the1 1564080 of2 1219248 and3 986168 in4 862412 a5 862356 to6 507386 is7 484451 The8 445334 was9 336005 for10 334510 s11 316207 as12 295183 by13 282728 with14 281566 on15 241960 that16 235218 doc17 221649 from18 193797 at19 189947 his20 157175 an

  • l nvcc + thrust

    templatenvcc

    1

    l

    10

    23

  • l GPU

    cupy, thrust, cubCUDA

    l

    Gochannel, go routine

    l

    Datacenter as a computer1 (TFlops in 1 chip)

    24

  • Copyright 2015-

    Preferred Networks All Right Reserved.


Recommended