47
yaSpMV: Yet Another SpMV Framework on GPUs Shengen Yan, Chao Li, Yunquan Zhang, Huiyang Zhou 1

yaSpMV: Yet Another SpMV Framework on GPUs

  • Upload
    lamis

  • View
    109

  • Download
    0

Embed Size (px)

DESCRIPTION

yaSpMV: Yet Another SpMV Framework on GPUs. Shengen Yan , Chao Li, Yunquan Zhang, Huiyang Zhou. Introduction. Sparse Matrix-Vector Multiplication spmv is a very important linear algebra algorithm Serial implementation is quite simple - PowerPoint PPT Presentation

Citation preview

Page 1: yaSpMV: Yet Another SpMV Framework on GPUs

yaSpMV: Yet Another SpMV Framework on GPUs

Shengen Yan, Chao Li, Yunquan Zhang, Huiyang Zhou

Page 2: yaSpMV: Yet Another SpMV Framework on GPUs

Introduction• Sparse Matrix-Vector Multiplication

– spmv is a very important linear algebra algorithm– Serial implementation is quite simple

// A*x=y, where A is stored in the CSR format. for (i = 0; i < m; ++i) { double y0 = y[i]; for (k = rowptr[i]; k < rowptr[i+1]; ++k) y0 = value[k] * x[column_index[k]];

y[i] = y0; }

– There are many work involved in its optimization on both CPUs and GPUs

• Many formats have been proposed.

Page 3: yaSpMV: Yet Another SpMV Framework on GPUs

Introduction• Parallel implementation: two challenges

– Bandwidth • the upper bound of flop:byte ratio is 0.25

– Load imbalance • Different number of non-zeroes in different rows• Worse on GPUs

𝐴=[0 0 ¿0 0 0 ¿0 0 5 1 0 ¿

4¿0¿0¿0¿0¿0¿7 ¿2¿3¿5¿0¿0¿¿¿¿¿ ]3 6 9

4 7 1 3 8 4

x =

Page 4: yaSpMV: Yet Another SpMV Framework on GPUs

Executive Summary

• BCCOO format– addressing the bandwidth challenge

• Customized efficient segmented scan/sum– addressing load imbalance problem– very efficient

• Results (GTX 680)– vs. CUSPARSE V5.0

• up to 229% and 65% on average improvement– vs. clSpMV

• up to 195% and 70% on average improvement

Page 5: yaSpMV: Yet Another SpMV Framework on GPUs

Outline• Introduction• Formats for SpMV

– addressing the bandwidth challenge• Efficient Segmented Sum/Scan for SpMV• Auto-Tuning Framework• Experimentation• Conclusions

Page 6: yaSpMV: Yet Another SpMV Framework on GPUs

COO format

𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]

COO format of matrix A

[3 6 9 5 1 4 7 2 3 5 4 7 1 3 8 4][0 0 0 1 1 1 2 2 2 2 3 3 3 3 3 3][2 6 7 2 3 6 4 5 6 7 0 1 4 5 6 7]

Page 7: yaSpMV: Yet Another SpMV Framework on GPUs

Blocked COO (BCOO) format

𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]

Page 8: yaSpMV: Yet Another SpMV Framework on GPUs

Blocked COO (BCOO) format

𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]

[ ] [ ] BCOO format block size 2x2

3 05 1

01

Page 9: yaSpMV: Yet Another SpMV Framework on GPUs

Blocked COO (BCOO) format

𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]

[ ] [ ] BCOO format block size 2x2

3 05 1

01

6 94 0

03

Page 10: yaSpMV: Yet Another SpMV Framework on GPUs

Blocked COO (BCOO) format

𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]

[ ] [ ] BCOO format block size 2x2

3 05 1

01

6 94 0

03

0 04 7

7 21 3

3 58 4

10

12

13

Page 11: yaSpMV: Yet Another SpMV Framework on GPUs

Blocked compressed COO (BCCOO) format

[ ] Difference value =[ ] Bit Flag (flipped)=[ ] [ ] BCCOO format block size 2x2

3 05 1

0

1

6 94 0

0

3

0 04 7

7 21 3

3 58 4

1

0

1

2

1

3

𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]

0 1 0 0 11 0 1 1 0

Integer Bit

Row Index Compression Ratio: 1/32

Page 12: yaSpMV: Yet Another SpMV Framework on GPUs

Formats for SpMV• Extensions of BCCOO format

– BCCOO+ format• Rearrange the non-zeros blocks.• Relief the irregular access to the vector

– Column index compression• Using difference function on the column index.

Page 13: yaSpMV: Yet Another SpMV Framework on GPUs

13

Example matrix• Assume there are 4 threads

𝐵=[5 0 3 0 2 0 6 90 0 0 1 0 0 4 00 4 0 8 0 2 0 00 7 6 1 0 3 8 4 ]

[1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0]BCCOO format of matrix B (Block size 1x1)

Page 14: yaSpMV: Yet Another SpMV Framework on GPUs

Auxiliary Information for SpMV: Result Entry• Getting the location of the first result generated by

each thread in the output array. That’s to say to compute the row index that the first result in each thread belongs to.

• Only need to count the zero number in the bit flag array of the previous threads.

[1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0]Thread 0 Thread 1 Thread 2 Thread 3

0 0 2 3

Page 15: yaSpMV: Yet Another SpMV Framework on GPUs

Outline• Introduction• Formats for SpMV• Efficient Segmented Sum/Scan for SpMV

– addressing load imbalance problem• Auto-Tuning Framework• Experimentation• Conclusions

Page 16: yaSpMV: Yet Another SpMV Framework on GPUs

Even workload partition• No workload imbalance

Non-zero Blocks

workgroups workgroup 1 workgroup 2 workgroup 3workgroup 0

threads

T0 T1 T2 T3

Page 17: yaSpMV: Yet Another SpMV Framework on GPUs

Efficient Segmented Sum/Scan for SpMV• Three logic steps

– Read the data and multiply them with the corresponding vector values.

– Perform a segmented sum/scan using the bit flag array from our BCCOO/BCCOO+ format

– Results combination and write back the results to global memory.

• All these three steps are implemented in one kernel

Page 18: yaSpMV: Yet Another SpMV Framework on GPUs

Step 1 Read the data and multiply with vector values

𝐵=[5 0 3 0 2 0 6 90 0 0 1 0 0 4 00 4 0 8 0 2 0 00 7 6 1 0 3 8 4 ]

[1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0]

BCCOO format of matrix B

• Ex: 4 Threads

Page 19: yaSpMV: Yet Another SpMV Framework on GPUs

Step 1 Read the data and multiply with vector values • Ex: 4 Threads

Problem: B*x =? Assume: x=[2 9 6 5 4 8 7 3]

X

=2 6 4 7

0 2 4 6

357958 965873

10188 422752836 401663365245612

5326914 4 82761384

Page 20: yaSpMV: Yet Another SpMV Framework on GPUs

Step 2 Segmented sum/scan• Three types of rows in our algorithm

– All the non-zeros of the row are in the same thread.• Serial segmented sum/scan in threads

– A row spans multiple threads.• + Parallel segmented sum/scan among threads.

– A row spans multiple workgroups.• + Cross workgroup synchronization (details in paper)

Page 21: yaSpMV: Yet Another SpMV Framework on GPUs

21

Step 2 Segmented sum/scan• 1) Serial segmented sum/scan in each thread

Problem: B*x =? Assume: x=[2 9 6 5 4 8 7 3]

[ 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 ]Scan Scan Scan Scan

Serial Segmented Scan (intermediate[-1]=0):Intermediate[i] = intermediate[i-1] * BitFlag[i-1] + Intermediate[i]

40566399529859733 3652710283678

Page 22: yaSpMV: Yet Another SpMV Framework on GPUs

Step 2 Segmented sum/scan• 2) Generate last partial sum and perform parallel segmented

scan among threads for the last partial sum

[ 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 ]

Partial SumsHead Flag(Exist ‘0’ in Bit Flag?)

Partial Sums

Scan Scan Scan Scan

78 36 99 00 1 1 1

Parallel segmented scan

78 36 99 0

22

Page 23: yaSpMV: Yet Another SpMV Framework on GPUs

Step 2 Segmented sum/scan• 2) Generate last partial sum and perform parallel segmented

scan among threads for the last partial sum

[ 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 ]

Partial SumsHead Flag(Exist ‘0’ in Bit Flag?)

Partial Sums

Scan Scan Scan Scan

78 36 99 00 1 1 1

Parallel segmented scan

78 36 99 0

23

1

0

135

Page 24: yaSpMV: Yet Another SpMV Framework on GPUs

Step 3 Results combination and write the results to global memory.

[ 1 1 1 1 1 1 1 1 1 1 1 1 ] Partial Sums 78 36 99 0

Combined results

+ ++

0 0 0 0

27 33 56 97

105 33 92 196

Problem: B*x =? Assume: x=[2 9 6 5 4 8 7 3]

𝑅𝑒𝑠𝑢𝑙𝑡 𝐸𝑛𝑡𝑟𝑦 0

𝐅𝐢𝐧𝐚𝐥𝐑𝐞𝐬𝐮𝐥𝐭 [ , ,, ]

0 2 3

105 92 19633

0+1

Page 25: yaSpMV: Yet Another SpMV Framework on GPUs

Auto-Tuning Framework• In order to generate the optimal kernel code, we employ the

auto-tuning technique to search the best parameters.

Tunable parameters

Average auto-tuning time: 13 secondsAuto-tuning speed: ~1 million non-zeros per seconds

Page 26: yaSpMV: Yet Another SpMV Framework on GPUs

26

Experiments• Experimental Methodology

– We have implemented our proposed scheme in OpenCL.– We have evaluated our scheme on GTX 480 and GTX 680

using 20 real world matrices. – Comparison library

• CUSPARSE V5.0 (Nvidia official SpMV library)• CUSP (SC 09)• clSpMV (ICS 12)

Page 27: yaSpMV: Yet Another SpMV Framework on GPUs

Used Matrices

Name Size Non-zeros(NNZ) NNZ/Row

Dense 2K * 2K 4M 2000

Protein 36K * 36K 4.3M 119

FEM/Spheres 83K * 83K 6M 72

FEM/Cantilever 62K * 62K 4M 65Wind Tunnel 218K*218K 11M 53FEM/Harbor 47K * 47K 2.3M 59

QCD 49K * 49K 1.9M 39FEM/Ship 141K*141K 7.8M 28Economics 207K*207K 1.2M 6

Epidemiology 526K*526K 2.1M 4FEM/Accelerator 121K*121K 2.6M 22

Circuit 171K*171K 0.95M 6

Webbase 1M * 1M 3.1M 3

LP 4K * 1.1M 11.3M 2825

Circuit5M 5.56M* 5.56M 59.5M 11

eu-2005 863K*863K 19.2M 22Ga41As41H72 268K*268K 18.4M 67

in-2004 1.38M*1.38M

17M 12

mip1 66K * 66K 10.3M 152

Si41Ge41H72 186K*186K 15M 81

Page 28: yaSpMV: Yet Another SpMV Framework on GPUs

28Performance results on Kepler (GTX 680)

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

H-mea

n0

10

20

30

40

50

60

70

CUSPARSE

GFLO

PS

Page 29: yaSpMV: Yet Another SpMV Framework on GPUs

29Performance results on Kepler (GTX 680)

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

H-mea

n0

10

20

30

40

50

60

70

CUSPARSE CUSP

GFLO

PS

Page 30: yaSpMV: Yet Another SpMV Framework on GPUs

30Performance results on Kepler (GTX 680)

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

H-mea

n0

10

20

30

40

50

60

70

CUSPARSE CUSP clSpMV Cocktail

GFLO

PS

Page 31: yaSpMV: Yet Another SpMV Framework on GPUs

31Performance results on Kepler (GTX 680)

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

H-mea

n0

10

20

30

40

50

60

70

CUSPARSE CUSP clSpMV Cocktail clSpMV Best single

GFLO

PS

Page 32: yaSpMV: Yet Another SpMV Framework on GPUs

32Performance results on Kepler (GTX 680)

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

H-mea

n0

10

20

30

40

50

60

70

CUSPARSE CUSP clSpMV CocktailclSpMV Best single yaSpMV

GFLO

PS

Average Performance Improvement:65% over CUSPARSE, 70% over clSpMV COCKTAIL,88% over clSpMV best single, 150% over CUSP

Page 33: yaSpMV: Yet Another SpMV Framework on GPUs

33Performance breakdown on Kepler (GTX680)

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

H-mea

n0

10

20

30

40

50

60

70

COO

GFLO

PS

Page 34: yaSpMV: Yet Another SpMV Framework on GPUs

34Performance breakdown on Kepler (GTX680)

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

H-mea

n0

10

20

30

40

50

60

70

COO BCCOO

GFLO

PS

Page 35: yaSpMV: Yet Another SpMV Framework on GPUs

35Performance breakdown on Kepler (GTX680)

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

H-mea

n0

10

20

30

40

50

60

70

COO BCCOO Efficient Segmented Sum/Scan

GFLO

PS

Page 36: yaSpMV: Yet Another SpMV Framework on GPUs

36Performance breakdown on Kepler (GTX680)

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

H-mea

n0

10

20

30

40

50

60

70

COO BCCOOEfficient Segmented Sum/Scan Adjacent Synchronization

GFLO

PS

Page 37: yaSpMV: Yet Another SpMV Framework on GPUs

37Performance breakdown on Kepler (GTX680)

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

H-mea

n0

10

20

30

40

50

60

70COO BCCOO Efficient Segmented Sum/Scan Adjacent Synchronization Fine-Grain Optimizations

GFLO

PS

Average Performance Improvement (vs. COO format):+BCCOO: 66%+Efficient Segmented Sum/Scan: 192%+Adjacent Synchronization: 212%+Fine-Grain Optimizations: 257%

Page 38: yaSpMV: Yet Another SpMV Framework on GPUs

38Relative memory footprint of different formats

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

Averag

e0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

COO ELL Cocktail Best single BCCOO

Rela

tive

mem

ory

foot

prin

t

Average memory footprint consumption: vs. COO: 60%vs. ELL: 19%vs. Cocktail:79%vs. Best single: 69%

Page 39: yaSpMV: Yet Another SpMV Framework on GPUs

39

Conclusions• The BCCOO format

– Addressed the memory bandwidth problem• The customized matrix-based segmented sum/scan

algorithms– Addressed the work load imbalance problem– Only need to invoke one kernel.– Very efficient: used a lot of optimization approaches.

• Results (GTX 680)– Vs. CUSPARSE V5.0

• up to 229% and 65% on average improvement– Vs. clSpMV

• up to 195% and 70% on average improvement

• Code is available online– http://code.google.com/p/yaspmv/

Page 40: yaSpMV: Yet Another SpMV Framework on GPUs

40

Thanks & Question?

Page 41: yaSpMV: Yet Another SpMV Framework on GPUs

41

COO format

𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]

Page 42: yaSpMV: Yet Another SpMV Framework on GPUs

42

COO format

𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]

[ ][ ][ ]

302

Page 43: yaSpMV: Yet Another SpMV Framework on GPUs

43

COO format

𝐴=[ 0 0 3 0 0 0 6 90 0 5 1 0 0 4 00 0 0 0 7 2 3 54 7 0 0 1 3 8 4 ]

[ ][ ][ ]

3 602

06

Page 44: yaSpMV: Yet Another SpMV Framework on GPUs

44

      

Step 2 Segmented sum/scan• 3) Accumulating partial sums across workgroups

Generate Partial Sums

Generate Partial Sums

Generate Partial Sums

Generate Partial Sums

Step 3 Step 3 Step 3 Step 3

P0 P1 P2 P3

Using Adjacent Synchronization

Page 45: yaSpMV: Yet Another SpMV Framework on GPUs

45

Fine-grained optimizations• Texture memory for vector read• Cut the adjacent synchronization chain as early as

possible• Remove the parallel segmented scan if possible• If the number of columns is smaller than 65535

short type column index array may be helpful to decrease the memory traffic.

Page 46: yaSpMV: Yet Another SpMV Framework on GPUs

46Performance results on Fermi (GTX480)

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

H-mea

n0

10

20

30

40

50

60

70

CUSPARSE CUSP clSpMV CocktailclSpMV Best singel yaSpMV

GFLO

PS

Average Performance Improvement:42% over CUSPARSE, 40% over clSpMV COCKTAIL,60% over clSpMV best single, 74% over CUSP

Page 47: yaSpMV: Yet Another SpMV Framework on GPUs

47

Absolute memory footprint consumption of COO,BCOO,BCCOO formats

Dense

Protein

FEM/Sp

heres

FEM/Can

tilever

Wind Tunnel

FEM/H

arbor

QCD

FEM/Sh

ip

Economics

Epidem

iology

FEM/A

cceler

atorCirc

uit

Webbase LP

Circuit5

M

eu-2005

Ga41As41H72

in-2004mip1

Si41Ge4

1H72

Averag

e0

100000000

200000000

300000000

400000000

500000000

600000000

700000000

800000000

COO BCOO BCCOO