Upload
laureen-robbins
View
214
Download
1
Embed Size (px)
Citation preview
Sunpyo Hong, Hyesoon Kim
An Integrated GPU Power and Performance Model
Outline
Motivation
Integrated Performance and Power (IPP) Modules
Results
Conclusion
2
2007 2008
2010
Motivation
3
GPUs have become popular for running general applications
Cost: Increasing number of cores
2008
2009
2010
GFLOPS significantly increasing
Motivation
4
Power consumption has also been increasing
Relationship with respect to performance ?
2007
2008
2009
2010
Effects of increased number of cores and power ?
IPP System
5
GPU KernelPerformance Prediction
Power, TemperaturePrediction
Performance / Watt Prediction
Number of Active Cores Prediction
Compiler
Programmer
H/W Dynamic Power Manager
Propose the Integrated Performance and Power system (IPP)
Performance Prediction (ISCA 2009)
Power, Temperature Prediction
Performance / Watt, Number of Active Cores Prediction
IPP System
# Active Cores # Active Cores # Active Cores
Pe
rfo
rma
nc
e
Po
we
r (W
)
Type 1
Type 2
Type 1
Type 2
Performance, Power, Perf / Power
6
Pe
rfo
rma
nc
e /
Wa
tt
GOAL: To maximize the performance/watt metric by using fewer cores for Type II applications
What kinds of applications will get benefit with fewer cores ?
Outline
Motivation
Integrated Performance and Power (IPP) Modules
Results
Conclusion
7
Power, TemperaturePrediction
IPP System Modules
8
Performance Prediction
Performance / Watt Prediction
Number of Active Cores Prediction
IPP System
Performance Prediction
Power, Temperature Prediction
Performance / Watt, Number of Active Cores Prediction
Power Prediction Module
9
Methodology is based on the empirical CPU power model (Isci’03)
Maximum allowable power consumption
How often unit is accessed
Σ Runtime_power_component
Divide GPU into architectural units and know which unit is active
GPU_Power = Runtime_power + Idle_Power
Runtime_power =
Σ = AccessRate x MaxPower
Texture Cache
GPU Architectural Units
10
PTX Instruction Arch. Unit
All Instructions
FDS (Fetch/ Decode/ Scheduler)
Add_fp Sub_fp Mul_fp Fma_fp Neg_fp Min_fp Lg2_fp Ex2_fp Mad_fp Div_fp Abs_fp
FP Unit
Sin_fp Cos_fp Rcp_fp Sqrt_fp Rsqrt_fp
SFU
GPUSM
Memory Other Logics
SM SM SM SM
SMRegister File
Fetch/Decode /ScheduleFP Units
ALU
Constant Cache
INT Units
Which architectural unit is accessed ?
No speculative execution
Associate each instruction with a specific arch. unit
APP
SFU
Access Rate
11
How often the architectural unit is accessed per unit of time
AccessRate x MaxPowerRuntime_power_component =
AccessRate = DAC_per_th x Warps_per_SM_________________________
Exec_cycles / 4
Data access count (DAC) per thread: Total number of accesses to a specific unit
Active number of threads allocated per SM Predicted execution cycles
Max Power Parameters (I)
AccessRate x MaxPowerRuntime_power_component =
Allowable maximum power consumption per arch. unit
FPSFUINTGlobal MemoryShared MemoryTexture CacheConstant Cache
P1
P2
P3
.
.
.Pn
AR1A
AR2A
AR3A
.
.
.ARnA
MA
MB
MC
.
.
.MZ
AR1B
AR2B
AR3B
.
.
.ARnB
. . .
. . .
. . .
.
.
.
. . .
AR1Z
AR2Z
AR3Z
.
.
.ARnZ
= X
Microbechmarks
Measured Power
AccessRate Maximum power per arch. unit
Solve for the set of MaxPower values (MA, MB, … , MZ)
12
Max Power Parameters (II)
13
Empirical Power Parameters
Units MaxPower
FP 0.2REG 0.3ALU 0.2SFU 0.5INT 0.25FDS (Fetch/Dec/Sch)
0.5
Shared memory 1Texture cache 0.9Constant cache 0.4Const_SM 0.813Global memory 52Local memory 52
Power Consumption vs. Model (GTX280)
The prediction error of microbenchmarks is 2.7%
Runtime Power
14
2W5 W 5 W
GPU ApplicationAccessing
FP UnitRegistersFetch / Decode / Schedule
Memory System
5 W 5 W 5 W 40 W
15 W
AccessRate x MaxPowerRuntime_power_component =
GPU_Power = Σ Runtime_power_component + Idle_Power
27 W=
Power vs. Number of Active SMs
15
Power consumption also depends on the number of Active SMs
Modeled by using an logarithmic function
Power_SMs = Max_SM x log10(α x Active_SMs + β)α = (10 - β) / Num_SMs, β = 1.1
Temperature Effects
16
Temperature effects on power consumption
Higher power consumption due to increased temperature is also modeled
Power delta: 14 watts
Temperature delta: 22 degrees
GP
U P
ow
er
(W)
Tem
pera
ture
(C
)
ΔPower ΔTemperature
Power / TemperaturePrediction
IPP System Modules
17
Performance Prediction
Performance / Watt Prediction
Number of active cores prediction
IPP System
Performance Prediction
Power, Temperature Prediction
Performance / Watt, Number of Active Cores Prediction
Performance Metrics
MWP: Metric of Memory-level Parallelism
1 12 2
3 34 4
MWP=4
18
1 1 32 42
34
CWP = 4
MWP = 2
MemoryWaiting period
CWP: Computation Warp Parallelism
Number of warps that execute during a memory access period
Maximum number of warps that can overlap memory accesses
Performance Predictions
19
Case 1: MWP> CWP
Case 2: CWP > MWP
Case 3: Not Enough Warps (N=CWP and N=MWP)
Performance is dominated by the memory cycles
Under-utilized cycles
Performance is dominated by the computation cycles
Power, TemperaturePrediction
IPP System Modules
20
Performance Prediction
Performance / Watt Prediction
Number of Active Cores Prediction
IPP System
Performance Prediction
Power / Temperature Prediction
Performance / Watt, Number of Active Cores Prediction
IPP System
Performance / Watt Prediction
-> Use the outputs from the performance and power modules
Performance prediction____________________Power prediction
=
21
Power, TemperaturePrediction
Performance Prediction
Performance / Watt Prediction
Number of Active Cores Prediction
IPP System
IPP System
22
Number of Active Cores Prediction
-> First check if it’s not possible to saturate bandwidth
-> Choose the maximum # of cores if any condition is true
1) Not Enough Warps
2) CWP < MWP
3) MWP < MWP_peak_BW
Not enough warps to saturate BW
Computationally-intensive code
Does not reach the machine’s peak BW
IPP System
23
Otherwise, find the number of cores that saturates BW
Memory Bandwidth_________________BW_per_warp x N
=Number of Cores
Machine specified BW parameter (GB/s) Bandwidth that each core uses
N : # warps allocated per core
BW_per_warp : Bandwidth each warp consumes
Outline
Motivation
Integrated Performance and Power (IPP) Modules
Results
Conclusion
24
Benchmark Description Bandwidth (GB/s)
SVM Kernel from a SVM-based algorithm 54.679
Binomial (Bino) American option pricing 3.689
Sepia Filter for artificially aging images 12.012
Convolve (Conv) 2D Separable image convolution 16.208
Blackscholes (Bs) European option pricing 51.033
Cmem Matrix add FP operations 64.617
Matrixmul (Nmat) Naive version of matrix multiplication 123.33
Dotp Matrix dotproduct 111.313
Madd Matrix multiply-add 115.058
Dmadd Matrix double memory multiply add 109.996
Mmul Matrix single multiply 114.997
Benchmarks
25
Evaluated various GPGPU Benchmarks
Bandwidth limited vs. Non-bandwidth limited
Power Decomposition
26
Power breakdown graph for each benchmark
Average power prediction error for GPGPU apps is 8.94%
Performance Per Watt
27
Performance per watt vs. Number of active cores
IPP
IPP
IPP
Number of Active Cores Number of Active Cores
IPP achieves 85% of the best manual mapping for saving energy
Energy Savings
28
With power-gating, the energy saving prediction is 25.8%
The energy saving measured for runtime is 10.9%
Outline
Motivation
Integrated Performance and Power (IPP) Modules
Results
Conclusion
29
Conclusions Introduced IPP (Integrated Power and Performance System) for
GPU architecture and GPGPU kernels
30
Power Prediction For GPGPU benchmarks, the prediction error is 8.9%
IPP predicts the number of cores that will achieve the energy savings
Saved 10.9% energy for bandwidth-limited apps with fewer cores
Estimated the energy saving for power-gated system is 25.8%
IPP extends the empirical CPU power model to GPU side
31
Thank you