Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model

Sunpyo Hong, Hyesoon Kim

An Integrated GPU Power and Performance Model

Outline

Motivation

Integrated Performance and Power (IPP) Modules

Results

Conclusion

2

2007 2008

2010

Motivation

3

GPUs have become popular for running general applications

Cost: Increasing number of cores

2008

2009

2010

GFLOPS significantly increasing

Motivation

4

Power consumption has also been increasing

Relationship with respect to performance ?

2007

2008

2009

2010

Effects of increased number of cores and power ?

IPP System

5

GPU KernelPerformance Prediction

Power, TemperaturePrediction

Performance / Watt Prediction

Number of Active Cores Prediction

Compiler

Programmer

H/W Dynamic Power Manager

Propose the Integrated Performance and Power system (IPP)

Performance Prediction (ISCA 2009)

Power, Temperature Prediction

Performance / Watt, Number of Active Cores Prediction

IPP System

# Active Cores # Active Cores # Active Cores

Pe

rfo

rma

nc

e

Po

we

r (W

)

Type 1

Type 2

Type 1

Type 2

Performance, Power, Perf / Power

6

Pe

rfo

rma

nc

e /

Wa

tt

GOAL: To maximize the performance/watt metric by using fewer cores for Type II applications

What kinds of applications will get benefit with fewer cores ?

Outline

Motivation


Results

Conclusion

7


IPP System Modules

8

Performance Prediction



IPP System




Power Prediction Module

9

Methodology is based on the empirical CPU power model (Isci’03)

Maximum allowable power consumption

How often unit is accessed

Σ Runtime_power_component

Divide GPU into architectural units and know which unit is active

GPU_Power = Runtime_power + Idle_Power

Runtime_power =

Σ = AccessRate x MaxPower

Texture Cache

GPU Architectural Units

10

PTX Instruction Arch. Unit

All Instructions

FDS (Fetch/ Decode/ Scheduler)

Add_fp Sub_fp Mul_fp Fma_fp Neg_fp Min_fp Lg2_fp Ex2_fp Mad_fp Div_fp Abs_fp

FP Unit

Sin_fp Cos_fp Rcp_fp Sqrt_fp Rsqrt_fp

SFU

GPUSM

Memory Other Logics

SM SM SM SM

SMRegister File

Fetch/Decode /ScheduleFP Units

ALU

Constant Cache

INT Units

Which architectural unit is accessed ?

No speculative execution

Associate each instruction with a specific arch. unit

APP

SFU

Access Rate

11

How often the architectural unit is accessed per unit of time

AccessRate x MaxPowerRuntime_power_component =

AccessRate = DAC_per_th x Warps_per_SM_________________________

Exec_cycles / 4

Data access count (DAC) per thread: Total number of accesses to a specific unit

Active number of threads allocated per SM Predicted execution cycles

Max Power Parameters (I)


Allowable maximum power consumption per arch. unit

FPSFUINTGlobal MemoryShared MemoryTexture CacheConstant Cache

P1

P2

P3

.

.

.Pn

AR1A

AR2A

AR3A

.

.

.ARnA

MA

MB

MC

.

.

.MZ

AR1B

AR2B

AR3B

.

.

.ARnB

. . .

. . .

. . .

.

.

.

. . .

AR1Z

AR2Z

AR3Z

.

.

.ARnZ

= X

Microbechmarks

Measured Power

AccessRate Maximum power per arch. unit

Solve for the set of MaxPower values (MA, MB, … , MZ)

12

Max Power Parameters (II)

13

Empirical Power Parameters

Units MaxPower

FP 0.2REG 0.3ALU 0.2SFU 0.5INT 0.25FDS (Fetch/Dec/Sch)

0.5

Shared memory 1Texture cache 0.9Constant cache 0.4Const_SM 0.813Global memory 52Local memory 52

Power Consumption vs. Model (GTX280)

The prediction error of microbenchmarks is 2.7%

Runtime Power

14

2W5 W 5 W

GPU ApplicationAccessing

FP UnitRegistersFetch / Decode / Schedule

Memory System

5 W 5 W 5 W 40 W

15 W


GPU_Power = Σ Runtime_power_component + Idle_Power

27 W=

Power vs. Number of Active SMs

15

Power consumption also depends on the number of Active SMs

Modeled by using an logarithmic function

Power_SMs = Max_SM x log10(α x Active_SMs + β)α = (10 - β) / Num_SMs, β = 1.1

Temperature Effects

16

Temperature effects on power consumption

Higher power consumption due to increased temperature is also modeled

Power delta: 14 watts

Temperature delta: 22 degrees

GP

U P

ow

er

(W)

Tem

pera

ture

(C

)

ΔPower ΔTemperature

Power / TemperaturePrediction

IPP System Modules

17



Number of active cores prediction

IPP System




Performance Metrics

MWP: Metric of Memory-level Parallelism

1 12 2

3 34 4

MWP=4

18

1 1 32 42

34

CWP = 4

MWP = 2

MemoryWaiting period

CWP: Computation Warp Parallelism

Number of warps that execute during a memory access period

Maximum number of warps that can overlap memory accesses

Performance Predictions

19

Case 1: MWP> CWP

Case 2: CWP > MWP

Case 3: Not Enough Warps (N=CWP and N=MWP)

Performance is dominated by the memory cycles

Under-utilized cycles

Performance is dominated by the computation cycles


IPP System Modules

20




IPP System


Power / Temperature Prediction


IPP System


-> Use the outputs from the performance and power modules

Performance prediction____________________Power prediction

=

21





IPP System

IPP System

22


-> First check if it’s not possible to saturate bandwidth

-> Choose the maximum # of cores if any condition is true

1) Not Enough Warps

2) CWP < MWP

3) MWP < MWP_peak_BW

Not enough warps to saturate BW

Computationally-intensive code

Does not reach the machine’s peak BW

IPP System

23

Otherwise, find the number of cores that saturates BW

Memory Bandwidth_________________BW_per_warp x N

=Number of Cores

Machine specified BW parameter (GB/s) Bandwidth that each core uses

N : # warps allocated per core

BW_per_warp : Bandwidth each warp consumes

Outline

Motivation


Results

Conclusion

24

Benchmark Description Bandwidth (GB/s)

SVM Kernel from a SVM-based algorithm 54.679

Binomial (Bino) American option pricing 3.689

Sepia Filter for artificially aging images 12.012

Convolve (Conv) 2D Separable image convolution 16.208

Blackscholes (Bs) European option pricing 51.033

Cmem Matrix add FP operations 64.617

Matrixmul (Nmat) Naive version of matrix multiplication 123.33

Dotp Matrix dotproduct 111.313

Madd Matrix multiply-add 115.058

Dmadd Matrix double memory multiply add 109.996

Mmul Matrix single multiply 114.997

Benchmarks

25

Evaluated various GPGPU Benchmarks

Bandwidth limited vs. Non-bandwidth limited

Power Decomposition

26

Power breakdown graph for each benchmark

Average power prediction error for GPGPU apps is 8.94%

Performance Per Watt

27

Performance per watt vs. Number of active cores

IPP

IPP

IPP

Number of Active Cores Number of Active Cores

IPP achieves 85% of the best manual mapping for saving energy

Energy Savings

28

With power-gating, the energy saving prediction is 25.8%

The energy saving measured for runtime is 10.9%

Outline

Motivation


Results

Conclusion

29

Conclusions Introduced IPP (Integrated Power and Performance System) for

GPU architecture and GPGPU kernels

30

Power Prediction For GPGPU benchmarks, the prediction error is 8.9%

IPP predicts the number of cores that will achieve the energy savings

Saved 10.9% energy for bandwidth-limited apps with fewer cores

Estimated the energy saving for power-gated system is 25.8%

IPP extends the empirical CPU power model to GPU side

31

Thank you

Documents

Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model