21
University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado at Boulder 2007.03.11 Towards a Toolchain for Pipeline-Parallel Programming on CMPs John Giacomoni

University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

  • View
    245

  • Download
    0

Embed Size (px)

Citation preview

Page 1: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

Tipp Moseley, Graham Price, Brian Bushnell,

Manish Vachharajani, and Dirk Grunwald

University of Colorado at Boulder

2007.03.11

Towards a Toolchain forPipeline-Parallel Programming on CMPs

John Giacomoni

Page 2: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

ProblemProblem

• UP performance at “end of life”• Chip-Multiprocessor systems

– Individual cores less powerful than UP– Asymmetric and Heterogeneous– 10s-100s-1000s of cores

• How to program?

Intel (2x2-core) MIT RAW (16-core) 100-core 400-core

Page 3: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

Programmers…Programmers…

• Programmers are:– Bad at explicitly parallel programming– “Better” at sequential programming

• Solutions?– Hide parallelism

• Compilers• Sequential libraries?

– Math, iteration, searching, and ??? Routines

Page 4: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

Using Multi-CoreUsing Multi-Core

• Task Parallelism– Desktop

• Data Parallelism– Web serving– Split/Join, MapReduce, etc…

• Pipeline Parallelism– Video Decoding– Network Processing

Page 5: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

We believe that the best strategy for developing parallel programs may be to evolve them from sequential implementations.

Therefore we need a toolchain that assists programmers in converting sequential programs into parallel ones.

This toolchain will need to support all four conversion stages: identification, implementation, verification, and runtime system support.

Joining theJoining theMinority ChorusMinority Chorus

Page 6: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

TheTheToolchainToolchain

• Identification– LoopProf and LoopSampler– ParaMeter

• Implementation– Concurrent Threaded Pipelining

• Verification

• Runtime system support

Page 7: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

LoopProfLoopProfLoopSamplerLoopSampler

• Thread level parallelism benefits from coarse grain information– Not provided by gprof, et al.

• Visualize relationship between functions and hot loops

• No recompilation• LoopSampler is effectively overhead free

Page 8: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

Partial LoopPartial LoopCall GraphCall Graph

Boxes are functionsOvals are loops

Page 9: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

ParaMeterParaMeter

• Dynamic Instruction Number vs. Ready Time graph• Visualize dependence chains

– Fast random access of trace information– Compact representation

• Trace Slicing– Moving forward or backwards in a trace based on a flow

(control, dependences, etc)– Requires information from disparate trace locations

• Variable Liveness Analysis

Page 10: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

DIN vs.DIN vs.Ready TimeReady Time

li r1, &fibarr

mov r2, 8

mov r4, 1

mov r3, 1

add r5,r3,r4

st 0(r1),r5

mov r3,r4

mov r4,r5

addi r2,r2,-1

addi r1,r1,4

bnz r2, loop

add r5,r3,r4

st 0(r1),r5

mov r3,r4

0xbee0

0xbee4

Instruction Mem Addr

1

2

3

4

5

6

7

8

9

10

11

12

13

14

DIN

1

2

3

4

5

6

7

8

9

10

11

12

13

14

DIN Ready Time

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1

2

3

4

13

5

7

9

10

6

8

11

14

12

Page 11: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

DIN vs.DIN vs.Ready TimeReady Time

DIN plot for 254.gap (IA64,gcc,inf)MultipleMultiple

Dep. chainsDep. chains

Page 12: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

Handling theHandling theInformation GlutInformation Glut

• Challenging Trace size• Trace complexity• Need fast random access

• Solution– Binary Decision Diagrams

– Compression ratios: 16-60x• 109 instructions in 1GB

Page 13: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

ImplementationImplementation

• Well researched– Task-Parallel– Data-Parallel

• More work to be done– Pipeline-Parallel

• Concurrent Threaded Pipelining– FastForward– DSWP

• Stream languages– Streamit

Page 14: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

ConcurrentConcurrentThreaded PipeliningThreaded Pipelining

• Pipeline-Parallel organization– Each stage bound to a processor– Sequential data flow

• Data Hazards are a problem

• Software solution– FastForward

Page 15: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

ThreadedThreadedPipeliningPipelining

Concurrent

Sequential

Page 16: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

Related WorkRelated Work

• Software architectures– Click, SEDA, Unix pipes, sysv queues, etc…– Locking queues take >= 1200 cycles (600ns)

• Additional overhead for cross-domain communication

• Compiler extracted pipelines– Decoupled Software Pipelining (DSWP)

• Modified IMPACT compiler• Communication operations <= 100 cycles (50ns)

– Assumes hardware queues

• Decoupled Access/Execute Architectures

Page 17: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

FastForwardFastForward

• Portable software only framework ~70-80 cycles (35-40ns)/queue operation

• Core-core & die-die

– Architecturally tuned CLF queues• Works with all consistency models• Temporal slipping & prefetching to hide die-die

communication

– Cross-domain communication• Kernel/Process/Thread

Page 18: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

Network ScenarioNetwork ScenarioFShmFShm

• How do we protect?

• GigE Network Properties:

• 1,488,095 frames/sec

• 672 ns/frame

• Frame dependencies

Page 19: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

VerificationVerification

• Characterize run-time behavior with static analysis– Test generation– Code verification– Post-mortem root-fault analysis

• Identify the frontier of states leading to an observed fault

• Use formal methods to final fault-lines

Page 20: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

RuntimeRuntimeSystem SupportSystem Support

• Hardware virtualization– Asymmetric and heterogeneous cores– Cores may not share main memory (GPU)

• Pipelined OS services• Pipelines may cross process domains

– FShm– Each domain should keep its private memory

• Protection

• Need label for each pipeline– Co/gang-scheduling of pipelines

Page 21: University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado

University of Colorado at Boulder

Core Research Lab

Questions?