Dynamically Specialized Datapaths for Energy Efficient Computing

Preview:

DESCRIPTION

Dynamically Specialized Datapaths for Energy Efficient Computing. Venkatraman Govindaraju , Chen-Han Ho , Karu Sankaralingam Department of Computer Sciences UW-Madison http://www.cs.wisc.edu/vertical. Hardware Improvement. Wedding Cake!. Cupcake!. Pancake!. 1971. 1991. 2011. - PowerPoint PPT Presentation

Citation preview

Dynamically Specialized Datapaths for Energy Efficient Computing

Venkatraman Govindaraju, Chen-Han Ho, Karu Sankaralingam

Department of Computer SciencesUW-Madison

http://www.cs.wisc.edu/vertical1

2

Hardware Improvement

Pancake!

Wedding Cake!

Cupcake!

Not exactly!1971 1991 2011

3

Technology Scaling

Okay, but how is a wedding cake made?

Honey, I shrunk the cooks!

4

The CPU Approachin-order processor

Cupcake!C!

5

The Advanced CPU ApproachOut-of-order, Superscalar

Wedding Cake!

WC!

Do as scheduled!

You mis-predicted!

Two ways at once!

Partial cake from

refrigerator!

Partial cake to

refrigerator!

Load strawberry!

Better performance, but not efficient!

Too many things to do!

6

Hardware Specialization

• We can build a specialized hardware datapath for a certain application

• Will be efficient• Example: GPU for

graphics processing• But,..

“The Wedding Cake Team”

7

Can I get a strawberry pancake?

What are you talking about?

Performance, Efficiency, and Flexibility?

8

Dynamically Specialized Execution Resources : DySER

Dynamically Specialized Execution!

9

Overview

• Dynamically Specialized Execution• Hardware resource: DySER– How to specialize and be dynamic?

• The compile time support: Slicer• HW/SW interface: ISA extensions• Integration, performance, and conclusion

10

A Little PeekFetch Decode Execute Memory WriteBack

D$

I$Register

File

Decode ExecUnits

DySER

11

DySER: Summary

Pipe

Shared Cache

DySER

• Heterogeneous array• ≈ 64 KB SRAM area• Up to 10X speedup• An average of 40% energy reduction

12

Dynamically Specialized Execution Resources

• An array of functional units and switches

• A stateless execution unit in processor pipeline– Pipelined– Simple flow control

A B

C

A*B+C

13

Dynamic Specialization

• Capture the pattern between different applications

• The specialized datapath is constructed at the granularity of functional units– Switches for

programmability

14

How DySER Works

• Same DySER block, different pattern

• Simple switch is sufficient– Routers are

energy inefficient• Remove per-

instruction overhead

Specialization Efficiency⇒ Circuit SwitchPacket Switch

15

Slice and Dice

• Dynamically Specialized Execution• Hardware resource: DySER– How to specialize and be dynamic?

• The compile time support: Slicer• HW/SW interface: ISA extensions• Integration, performance, and conclusion

16

Identifying The Specialization Target

• Applications are executed in phases– Capture the most

frequent phase

• Identify the phases– Path profiling

• Construct path-treesFind computation? Use DySER!

17

Core DySER

Slicer: A Compiler for the DySER • The instructions in path-

trees are not all computations– Slice the path-tree into a

computation slice and a load slice

• Execute computation slice in DySER

• Execute load-slice in conventional processor pipeline

Slicer

Application

Communication

18

Working Together

• Dynamically Specialized Execution• Hardware resource: DySER– How to specialize and be dynamic?

• The compile time support: Slicer• HW/SW interface: ISA extensions• Integration, performance, and conclusion

19

Communication Between The DySER and Processor Core

• DySER interface: ISA extension

bb1: MOV control1 => R2MOV control2 => R3MOV 1 => R4SLL R4, target => R4LD reg->node => R5DYSER_INIT [COMPSLICE]DYSER_SEND R2 => DI1DYSER_SEND R3 => DI2DYSER_SEND R4 => DI3

bb2: DYSER_LOAD [R5+offset(state)] => DM0DYSER_STORE:DO2 DO1, [R5+offset(state)]DYSER_COMMITADD R5, sizeof(node), R5ADDCC R1, -1, R1BNE bb2

Initialize DySERSend input from

register file to DySERSend input

from memory to DySER

Store output from DySER to memory

Commit DySER output to register file

20

Energy Efficient Bakery Is About to Open!

DySER to the rescue!

Integration!

21

Back To Hardware

• Dynamically Specialized Execution• Hardware resource: DySER– How to specialize and be dynamic?

• The compile time support: Slicer• HW/SW interface: ISA extensions• Integration, performance, and conclusion

22

It Is Simple -- Integration

• DySER interface: FIFOFetch Decode Execute Memory WriteBack

D$

I$Register

File

Decode ExecUnits

DySER

23

Out-of-Order Integration

• Out-of-order core integration

• DySER itself maintains no architectural state

• Use buffers to keep the state for speculative execution

24

It Is Good – Evaluation Method

• Simulator: Wisconsin Multifacet GEMS– Benchmarks: SPEC CPU2006, Parboil, and PARSEC– Modified GCC compiler– DySER with 64 functional units

• Speedup & energy reduction– Quantify the low overhead execution on computation

slice– Wattch-based model in GEMS

25

Result - Performance

cp pnssad

blacksch

oles

bodytrack

cannealnamd

soplex

lbm

Geomean1

3

5

7

9

11

1-issue inorder2-issue out-of-order

Spee

dup

26

Result – Energy Reduction

cp pnssad

blacksch

oles

bodytrack

cannealnamd

soplex

lbm

Geomean0

102030405060708090

100

1-issue inorder2-issue out-of-order

Ener

gy R

educ

tion

(%)

27

It is flexible – comparison

• DySER can be SIMD, can do operation-fusion, can accelerate loops– Not enough resources? – The Slicer can help to partition the computational

slice and offload from DySER to processor core• DySER looks like dataflow, but..– No entire new ISA, no routers or packets, no burden

to programmers

28

Conclusion

• Hardware specialization is efficient– Dynamic approach with moderate integration

complexity and few ISA extensions– Up to 10X speedup, ~40% average energy redutcion

• Future work:– FPGA implementation– Comparison with other specialization approaches• FPGA • GPGPU• SSE, AVX

29

Questions?

30

Backup Slides

31

Can This Work?Benchmark Number of pathtrees Pathtrees contribute 90%

execution time

blackscholes 9 3bodytrack 322 9canneal 89 12facesim 906 22

fluidanimate 33 2freqmine 151 31

streamcluster 61 1swaptions 36 6

• We also find: applications re-execute Path-tree several times before moves to next

32

Related work

• Industrial effort

• Generality

• RAW• TRIPS• Wave scalar

• VEAL(ISCA 08)

• scalability

• DySER

• Ambric• Mathstar

DySER Configuration

• Special configure phase– Encode configure information in data, passing through

the existing datapath

33

S1 : L->R

Switch 0: Switch 1:

Not mine This is it!

Switch 1:Left -> Right