31
Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Embed Size (px)

Citation preview

Page 1: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Sunpyo Hong, Hyesoon Kim

An Integrated GPU Power and Performance Model

Page 2: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Outline

Motivation

Integrated Performance and Power (IPP) Modules

Results

Conclusion

2

Page 3: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

2007 2008

2010

Motivation

3

GPUs have become popular for running general applications

Cost: Increasing number of cores

2008

2009

2010

GFLOPS significantly increasing

Page 4: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Motivation

4

Power consumption has also been increasing

Relationship with respect to performance ?

2007

2008

2009

2010

Effects of increased number of cores and power ?

Page 5: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

IPP System

5

GPU KernelPerformance Prediction

Power, TemperaturePrediction

Performance / Watt Prediction

Number of Active Cores Prediction

Compiler

Programmer

H/W Dynamic Power Manager

Propose the Integrated Performance and Power system (IPP)

Performance Prediction (ISCA 2009)

Power, Temperature Prediction

Performance / Watt, Number of Active Cores Prediction

IPP System

Page 6: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

# Active Cores # Active Cores # Active Cores

Pe

rfo

rma

nc

e

Po

we

r (W

)

Type 1

Type 2

Type 1

Type 2

Performance, Power, Perf / Power

6

Pe

rfo

rma

nc

e /

Wa

tt

GOAL: To maximize the performance/watt metric by using fewer cores for Type II applications

What kinds of applications will get benefit with fewer cores ?

Page 7: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Outline

Motivation

Integrated Performance and Power (IPP) Modules

Results

Conclusion

7

Page 8: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Power, TemperaturePrediction

IPP System Modules

8

Performance Prediction

Performance / Watt Prediction

Number of Active Cores Prediction

IPP System

Performance Prediction

Power, Temperature Prediction

Performance / Watt, Number of Active Cores Prediction

Page 9: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Power Prediction Module

9

Methodology is based on the empirical CPU power model (Isci’03)

Maximum allowable power consumption

How often unit is accessed

Σ Runtime_power_component

Divide GPU into architectural units and know which unit is active

GPU_Power = Runtime_power + Idle_Power

Runtime_power =

Σ = AccessRate x MaxPower

Page 10: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Texture Cache

GPU Architectural Units

10

PTX Instruction Arch. Unit

All Instructions

FDS (Fetch/ Decode/ Scheduler)

Add_fp Sub_fp Mul_fp Fma_fp Neg_fp Min_fp Lg2_fp Ex2_fp Mad_fp Div_fp Abs_fp

FP Unit

Sin_fp Cos_fp Rcp_fp Sqrt_fp Rsqrt_fp

SFU

GPUSM

Memory Other Logics

SM SM SM SM

SMRegister File

Fetch/Decode /ScheduleFP Units

ALU

Constant Cache

INT Units

Which architectural unit is accessed ?

No speculative execution

Associate each instruction with a specific arch. unit

APP

SFU

Page 11: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Access Rate

11

How often the architectural unit is accessed per unit of time

AccessRate x MaxPowerRuntime_power_component =

AccessRate = DAC_per_th x Warps_per_SM_________________________

Exec_cycles / 4

Data access count (DAC) per thread: Total number of accesses to a specific unit

Active number of threads allocated per SM Predicted execution cycles

Page 12: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Max Power Parameters (I)

AccessRate x MaxPowerRuntime_power_component =

Allowable maximum power consumption per arch. unit

FPSFUINTGlobal MemoryShared MemoryTexture CacheConstant Cache

P1

P2

P3

.

.

.Pn

AR1A

AR2A

AR3A

.

.

.ARnA

MA

MB

MC

.

.

.MZ

AR1B

AR2B

AR3B

.

.

.ARnB

. . .

. . .

. . .

.

.

.

. . .

AR1Z

AR2Z

AR3Z

.

.

.ARnZ

= X

Microbechmarks

Measured Power

AccessRate Maximum power per arch. unit

Solve for the set of MaxPower values (MA, MB, … , MZ)

12

Page 13: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Max Power Parameters (II)

13

Empirical Power Parameters

Units MaxPower

FP 0.2REG 0.3ALU 0.2SFU 0.5INT 0.25FDS (Fetch/Dec/Sch)

0.5

Shared memory 1Texture cache 0.9Constant cache 0.4Const_SM 0.813Global memory 52Local memory 52

Power Consumption vs. Model (GTX280)

The prediction error of microbenchmarks is 2.7%

Page 14: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Runtime Power

14

2W5 W 5 W

GPU ApplicationAccessing

FP UnitRegistersFetch / Decode / Schedule

Memory System

5 W 5 W 5 W 40 W

15 W

AccessRate x MaxPowerRuntime_power_component =

GPU_Power = Σ Runtime_power_component + Idle_Power

27 W=

Page 15: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Power vs. Number of Active SMs

15

Power consumption also depends on the number of Active SMs

Modeled by using an logarithmic function

Power_SMs = Max_SM x log10(α x Active_SMs + β)α = (10 - β) / Num_SMs, β = 1.1

Page 16: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Temperature Effects

16

Temperature effects on power consumption

Higher power consumption due to increased temperature is also modeled

Power delta: 14 watts

Temperature delta: 22 degrees

GP

U P

ow

er

(W)

Tem

pera

ture

(C

)

ΔPower ΔTemperature

Page 17: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Power / TemperaturePrediction

IPP System Modules

17

Performance Prediction

Performance / Watt Prediction

Number of active cores prediction

IPP System

Performance Prediction

Power, Temperature Prediction

Performance / Watt, Number of Active Cores Prediction

Page 18: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Performance Metrics

MWP: Metric of Memory-level Parallelism

1 12 2

3 34 4

MWP=4

18

1 1 32 42

34

CWP = 4

MWP = 2

MemoryWaiting period

CWP: Computation Warp Parallelism

Number of warps that execute during a memory access period

Maximum number of warps that can overlap memory accesses

Page 19: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Performance Predictions

19

Case 1: MWP> CWP

Case 2: CWP > MWP

Case 3: Not Enough Warps (N=CWP and N=MWP)

Performance is dominated by the memory cycles

Under-utilized cycles

Performance is dominated by the computation cycles

Page 20: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Power, TemperaturePrediction

IPP System Modules

20

Performance Prediction

Performance / Watt Prediction

Number of Active Cores Prediction

IPP System

Performance Prediction

Power / Temperature Prediction

Performance / Watt, Number of Active Cores Prediction

Page 21: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

IPP System

Performance / Watt Prediction

-> Use the outputs from the performance and power modules

Performance prediction____________________Power prediction

=

21

Power, TemperaturePrediction

Performance Prediction

Performance / Watt Prediction

Number of Active Cores Prediction

IPP System

Page 22: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

IPP System

22

Number of Active Cores Prediction

-> First check if it’s not possible to saturate bandwidth

-> Choose the maximum # of cores if any condition is true

1) Not Enough Warps

2) CWP < MWP

3) MWP < MWP_peak_BW

Not enough warps to saturate BW

Computationally-intensive code

Does not reach the machine’s peak BW

Page 23: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

IPP System

23

Otherwise, find the number of cores that saturates BW

Memory Bandwidth_________________BW_per_warp x N

=Number of Cores

Machine specified BW parameter (GB/s) Bandwidth that each core uses

N : # warps allocated per core

BW_per_warp : Bandwidth each warp consumes

Page 24: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Outline

Motivation

Integrated Performance and Power (IPP) Modules

Results

Conclusion

24

Page 25: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Benchmark Description Bandwidth (GB/s)

SVM Kernel from a SVM-based algorithm 54.679

Binomial (Bino) American option pricing 3.689

Sepia Filter for artificially aging images 12.012

Convolve (Conv) 2D Separable image convolution 16.208

Blackscholes (Bs) European option pricing 51.033

Cmem Matrix add FP operations 64.617

Matrixmul (Nmat) Naive version of matrix multiplication 123.33

Dotp Matrix dotproduct 111.313

Madd Matrix multiply-add 115.058

Dmadd Matrix double memory multiply add 109.996

Mmul Matrix single multiply 114.997

Benchmarks

25

Evaluated various GPGPU Benchmarks

Bandwidth limited vs. Non-bandwidth limited

Page 26: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Power Decomposition

26

Power breakdown graph for each benchmark

Average power prediction error for GPGPU apps is 8.94%

Page 27: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Performance Per Watt

27

Performance per watt vs. Number of active cores

IPP

IPP

IPP

Number of Active Cores Number of Active Cores

IPP achieves 85% of the best manual mapping for saving energy

Page 28: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Energy Savings

28

With power-gating, the energy saving prediction is 25.8%

The energy saving measured for runtime is 10.9%

Page 29: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Outline

Motivation

Integrated Performance and Power (IPP) Modules

Results

Conclusion

29

Page 30: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Conclusions Introduced IPP (Integrated Power and Performance System) for

GPU architecture and GPGPU kernels

30

Power Prediction For GPGPU benchmarks, the prediction error is 8.9%

IPP predicts the number of cores that will achieve the energy savings

Saved 10.9% energy for bandwidth-limited apps with fewer cores

Estimated the energy saving for power-gated system is 25.8%

IPP extends the empirical CPU power model to GPU side

Page 31: Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

31

Thank you