22
Exploiting Both Pipelining and Data Parallelism with SIMD RA Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science & Technology) ARC March 21, 2012 Hong Kong

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Embed Size (px)

Citation preview

Page 1: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Exploiting Both Pipelining and Data Parallelism with SIMD RA

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek*

*Seoul National University**UNIST (Ulsan National Institute of Science & Technology)

ARC March 21, 2012Hong Kong

Page 2: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Reconfigurable Architec-ture

2/20

Reconfigurable architecture High perfor-

mance Flexible

Cf. ASIC Energy effi-

cient Cf. GPU

Source: ChipDesignMag.com

Page 3: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Coarse-Grained Reconfigurable Ar-chitecture

3 /20

Coarse-Grained RA Word-level granularity Dynamic reconfigurability Simpler to compile

Execution modelMain

Proces-sor

CGRA

Main Mem-ory

DMA Con-

troller

MorphoSys

ADRES

Page 4: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Application Mapping

4 /20

Place and route DFG on the PE array mapping space

Should satisfy several constraints Should map nodes on the PE which have a right functionality Data transfer between nodes should be guaranteed Resource consumption should be minimized for performance

Application

IR

Front-end

Partitioner

ConventionalC

compilation

ConfigurationAssembly

Exec. + Config.

Extended assembler

Seq Code Loops

Place & Route

DFG generation

ArchParam.

Mapping for CGRA

<DFG>

<CGRA>

Page 5: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Modulo scheduling-based mapping

5 /20

Software Pipelining

time

0 1

2

4

3

A[i]

B[i]

C[i]

PE0

PE3

PE1

PE2

PE0 PE1 PE2 PE3

1

2

3

4

5

6

7

0 1

2

4

3

0 1

2

4

3

0 1

2

4

3

II = 2 cycles

II : Initiation In-terval

Page 6: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Suffer several problems in a large scale CGRA Lack of parallelism

Limited ILP in general applications Configuration size(in unrolling case)

Search a very large mapping space for place-ment and routing

Skyrocketing compilation time

CGRAs remain at 4x4 or 8x8 at the most.6 /20

Problem - Scalability

Page 7: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Overview

7 /20

Background

SIMD Reconfigurable Architecture (SIMD RA)

Mapping on SIMD RA

Evaluation

Page 8: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Consists of multi-ple identical parts, called cores

Identical for the reuse of configura-tions

At least one load-store PE in each core

8 /20

SIMD Reconfigurable Architecture

Crossbar Switch

Bank1 Bank2 Bank3 Bank4

Core 1 Core 2

Core 3 Core 4

Page 9: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

More iterations executed in parallel Scale with the PE array size Short compilation time thanks to small mapping space Archive denser scheduled configuration

Higher utilization and performance. Loop must not have loop-carried dependence.9 /20

Advantages of SIMD-RA

time

Large Core

Iteration 0

Iteration 1

Iteration 2

Iteration 3

Iteration 4

Iteration 5

time

Core 1

Core 2

Core 3

Core 4

Iter.0

Iter.1

Iter.2

Iter.3

Iter.4

Iter.5

Iter.6

Iter.7

Iter.8

Iter.9

Iter.10

Iter.11

Large Core

Core 1

Core 2

Core 3

Core 4

Page 10: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Overview

10 /20

Background

SIMD Reconfigurable Architecture (SIMD RA)

Bank Conflict Minimization in SIMD RA

Evaluation

Page 11: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

New mapping problem Iteration-to-core mapping

Iteration mapping affects on the per-formance related with a data mapping affect the number of bank conflicts

11 /20

Problems of SIMD RA mapping

for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i];}

Core 1

Core 2

Core 3

Core 4

15 iterations

Page 12: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Iteration-to-core mapping Data mapping

12 /20

Mapping schemes

Iter.0-3

Iter.4-7

Iter.12-14

Iter.8-11

Iter.0,4,8,1

2

Iter.1,5,9,1

3

Iter.3,7,11

Iter.2,6,10,

14

Crossbar SwitchA[0]A[4]A[8]

A[12]B[1]B[5]B[9]

B[13]

A[1]A[5]A[9]

A[13]B[2]B[6]

B[10]B[14]

A[2]A[6]

A[10]A[14]B[3]B[7]

B[11]

A[3]A[7]

A[11]B[0]B[4]B[8]

B[12]

Crossbar SwitchA[0]A[1]A[2]A[3]A[4]A[5]

A[13]A[14]

B[0]B[1]B[2]B[3]B[4]B[5]

B[13]B[14]

… …

< Sequen-tial >

< Interleav-ing >

< Sequen-tial >

< Interleav-ing >

for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i];}

Page 13: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

13

With interleaving data placement, interleaved iteration assignment is better than sequential iteration assignment.

Weak in stride ac-cesses

reduce the number of utilized banks,

increase bank conflicts

Interleaving data placement

Iter.0-3

Iter.4-7

Iter.12-14

Iter.8-11

Iter.0,4,8,1

2

Iter.1,5,9,1

3

Iter.3,7,11

Iter.2,6,10,

14

Crossbar Switch

A[0]A[4]A[8]

A[12]B[1]B[5]B[9]

B[13]

A[1]A[5]A[9]

A[13]B[2]B[6]

B[10]B[14]

A[2]A[6]

A[10]A[14]B[3]B[7]

B[11]

A[3]A[7]

A[11]B[0]B[4]B[8]

B[12]

ConfigurationLoad

A[i]…

… …

Load

A[2i]

Page 14: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

14

Sequential data place-ment

Cannot work well with SIMD mapping

Cause frequent bank conflicts

Data tiling i) array base address

modification ii) rearranging data on

the local memory. Sequential iteration

assignment with data tiling suits for SIMD mapping

14

Crossbar Switch

A[0]A[1]A[2]A[3]B[0]B[1]B[2] B[3]

A[4]A[5]A[6]A[7]B[4]B[5]B[6]B[7]

A[8]A[9]

A[10]A[11]B[8]B[9]

B[10]B[11]

A[12]A[13]A[14]

B[12]B[13]B[14]

Crossbar Switch

A[0]A[1]A[2]A[3]A[4]A[5]

A[13]A[14]

B[0]B[1]B[2]B[3]B[4]B[5]

B[13]B[14]

… …

Iter.0-3

Iter.4-7

Iter.12-14

Iter.8-11

Iter.0,4,8,1

2

Iter.1,5,9,1

3

Iter.3,7,11

Iter.2,6,10,

14

ConfigurationLoad

A[i]…

… …

Page 15: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Two out of the four combinations have strong advantages Interleaved iteration, interleaved data

mapping Weak in accesses with stride Simple data management

Sequential iteration, sequential data mapping (with data tiling)

More robust against bank conflict Data rearranging overhead

15 /20

Summary of Mapping Combinations Analysis

Page 16: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Experimental Setup

16 /20

Sets of loop kernels from OpenCV, multimedia and SPEC2000 benchmarks

Target system Two CGRA sizes – 4x4, 8x4 2x2 core with one load-store PE and one multiplier PE Mesh + diagonal connections between PEs Full crossbar switch between PEs and local memory

banks

Compared with non-SIMD mapping Original : non-SIMD previous mapping SIMD : Our approach (interleaving-interleaving mapping)

Page 17: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

reduced by 61% in 4x4 CGRA, 79% in 8x4 CGRA17 /20

Configuration Size

Page 18: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

18 /20

RuntimeO

rig.

SIM

DO

rig.(

U2)

SIM

DO

rig.

SIM

DO

rig.(

U2)

SIM

DO

rig.(

U2)

SIM

DO

rig.(

U2)

SIM

DO

rig.

SIM

DO

rig.(

U4)

SIM

DO

rig.(

U2)

SIM

DO

rig.(

U4)

SIM

DO

rig.(

U3)

SIM

DO

rig.(

U6)

SIM

DO

rig.(

U3)

SIM

DO

rig.(

U4)

SIM

DO

rig.(

U4)

SIM

DO

rig.(

U8)

SIM

DO

rig.(

U4)

SIM

DO

rig.(

U6)

SIM

DO

rig.(

U2)

SIM

DO

rig.(

U2)

SIM

DO

rig.

SIM

DO

rig.

SIM

D

4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4Swim1 Swim2 Swim3 Laplace Wavelet CalcHar-

risCvtColor Dot-

ProductGaussian Erode Average

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%Stall time Non-stall time

Ru

nti

me (

Norm

alize

d t

o 4

x4

O

rig

inal)

29%

32%

Page 19: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Presented SIMD reconfigurable architec-ture Exploit data parallelism and instruction level

parallelism at the same time

Advantages of SIMD reconfigurable archi-tecture Scale the large number of PEs well Alleviate increasing compilation time Increase performance and reduce configura-

tion size19 /20

Conclusion

Page 20: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

Thank you!

20 /20

Page 21: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

21

In a large loop case, small core might not

be a good match

Merge multiple cores ⇒ Macrocore

No HW modification require

Core size

Crossbar Switch

Bank1 Bank2 Bank3 Bank4

Core 1 Core 2

Core 3 Core 4

Macrocore 1

Macrocore 2

Page 22: Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science

22

SIMD RA mapping flow

Operation Mapping

Check SIMDRequirement

Select Core Size

Iteration Mapping

Data Tiling

If scheduling fails and MaxII<II, increase core size.

Traditional MappingFail

If scheduling fails, increase II and repeat.

Modulo Scheduling

Array Placement(Implicit)

Int-Int Seq-Tiling