GLoop: An Event-driven Runtime for Consolidating GPGPU Applications
Yusuke Suzuki*in collaboration with
Hiroshi Yamada**, Shinpei Kato***, Kenji Kono*
* Keio University** Tokyo University of Agriculture and Technology
*** The University of Tokyo
Graphic Processing Unit (GPU)• GPUs are used for data-parallel computations
– Composed of thousands of cores– Performance-per-watt of GPUs outperforms CPUs
• GPGPU is widely accepted beyond scientific purpose– Network Systems [Jang et al. ’11], Servers [AGRAWAL et al. ‘14],
FS [Silberstein et al. ’13] [Sun et al. ’12], DBMS [He et al. ’08] etc.
• GPUs become computing resource for applications
L2 CacheL1 L1 L1 L1 L1 L1 L1
Video Memory
Main MemoryCPU
NVIDIA/GPU
GPUs in the Cloud
• Cloud platforms adopt GPUs as part of their computing resource
• Beyond scientific applications, various server workload start using GPUs– Key-value store [Hetherington et al. ’15], web server
workload [Agrawal et al. ‘14], SSL reverse proxy [Jang et al.
‘11] etc.
Request
Response Computing
Motivation for GPU consolidation
• Consolidating GPGPU applications on a shared GPU is a key requirement for cloud platforms
– Consolidation can improve GPU utilization since the load of cloud services varies with diurnal patterns
• GPU are continuously scaling up, which strengthen the motivation for consolidation
App
GPU
App App
Manager
Sharea single GPU
among GPGPU Apps Physical
Machine
Problem – GPU Eaters• Scientific applications launch long-running kernels• Even worse, recent GPGPU applications utilize polling
for efficient / effective GPU computing– GPUfs [Silberstein et al. ‘13], GPUnet [Kim et al. ‘14], Persistent
Threads [Gupta et al. ‘12]
• Such GPU Eaters monopolize a shared GPU
GPU
GPU Eater AppApp
Monopolize a GPU for a long time
Blocked for a long time
……data =net::receivePolling(
socket, …,);………
Long-running / Polling GPU kernels
Hardware Preemption
• Recent NVIDIA Pascal GPUs introduce instruction-level hardware preemption
• No publicly available information shows the availability of software-level preemption control
• Lack of software-control limits our sharing policy– e.g. We cannot apply a proportional share policy to GPU kernels that is
based on customer payment
GPU
AppApp
GPU Scheduler
GPU Kernel GPU Kernel
Do not have software control over GPU kernel scheduling
Outline
• Motivation
• Previous Work
• Proposal: GLoop
• Experiments
• Conclusion
Previous Work• Schedule GPU commands / kernel launches
– TimeGraph [Kato et al. ‘11], Gdev [Kato et al. ‘12], GPUvm [Suzuki et al. ‘14], Disengaged Scheduling [Menychtas et al. ‘14], PTask [Rossbach et al. ‘11], Elastic kernels [Pai et al. ‘13]
– Costly GPU kernel launches [Kim et al. ‘14] need to be frequently issued
• Context Funneling [Wang et al. ‘11]– Allow concurrent GPU kernel running at the expense of isolation– To ensure concurrent GPU kernel execution, GPU resource needs to be split apriori,
which is not work-conserving
• GPUpIO [Zeno et al. ‘16]– Suffer from long-running GPU kernels
Launch
Time
App A App B
Kernel executions GPU
App App
ServiceRedirect GPU
streams to one GPU context
GPU command/kernel scheduler Context Funneling
EffiSha [Chen et al. ‘17], FLEP [Wu et al. ‘17]
• EffiSha and FLEP schedules TBs on persistent threads manner
• Launch persistent threads, dispatch logical TBs on them, stop kernels if necessary
• Assume TBs are short-running– long-running TBs and polling TBs can monopolize GPUs
SoftwareLogical TB Scheduler
GPU
Logical TB queue
Physical TBs
Outline
• Motivation
• Previous Work
• Proposal: GLoop
• Experiments
• Conclusion
Goals
1. Consolidate GPU eaters efficiently
– Relaunch GPU kernels only if necessary
– Software-controlled GPU eater scheduling
2. Provide GPU resource isolation
– Isolation among apps is mandatory for multi-tenancy
3. Use proprietary GPGPU software stack
– GPU drivers and runtimes are black-boxed, and almost all the GPGPU applications reside on these infrastructures
Proposal: GLoop
• GLoop is a GPGPU framework allowing us to host multiple GPU eaters on a GPU
• GLoop offers1. Event-driven programming model for GPGPU
2. Scheduler and event loop runtime achieving fair share in the face of GPU eaters
GPU
GLoop Scheduler
App
GPUContext
auto callback = [=](DeviceLoop* loop, …){
// Processing the data.// ...
};
net::receive(loop, socket, …, callback);
Event-driven programming model
Scheduler andevent loop runtime
Schedulable GPU Kernels
• TBs are good for schedulable units, but long-running TBs monopolize GPUs
• GLoop needs to transform long-running TBs into small chunks of schedulable tasks
Exec
uti
on
Tim
e
TB1 TB2
…
…
Split TBs with Scheduling Points
…
Short TBs
Long TBs
Event-driven Programming• GLoop adopts event-driven programming model• Event-driven style is suitable for GPGPU server applications that
are driven by external events such as network packet arrival• GPGPU program can execute host operation in non-blocking style
(like I/O operation)• Provide fine-grained scheduling points
– GLoop-aware program can insert low-cost scheduling points
auto callback = [=](DeviceLoop* loop, …) {// Processing the data.// ...// Continuation!
};net::receive(loop, socket, …, callback);
…data = net::receivePolling(socket, …,);…
GLoop Architecture
• GLoop consists of 3 components
– GLoop Scheduler
– Host Event Loops
– Device Event Loops
GPU
GLoop Scheduler
App
GPUContext
In Host OS
Within a resource container
Resource container-
based system
Host Event Loop
Device Event Loop
Host and Device Event Loops• Device event loops request ops to a host event loop and register a
callback• Host event loop performs ops and notify completions to device
event loops• Device event loops poll completions
– But at that time, if requested, device event loop can finish its polling
• Device event loops invoke registered callbacks once associated event is completed
GPUCall
GLoop API{
…}
Callback
Register Callback
Request Host Operation (Like I/O)
Pollcompletion
Notify completion
Invoke registered Callback
Perform Async Host I/O
GPU kernel launch requests
• Host Event Loop acquires token to execute kernel
• Device Event Loops dispatch callbacks generated from a program
GPU
GLoop SchedulerAcquire token to execute kernel
Submit GPU kernel
Device Event Loops dispatches
callbacks
{…
}
Callback
Device Event Loop suspension• Scheduler interacts with applications and requests
suspension to switch GPU contexts• GLoop Scheduler monitors GPU utilization of each GPU app
and makes scheduling decision
GPU
GLoop Scheduler
{…
}
Callback
Acquire token to execute kernel
Request suspension
Request suspension
Save Loop’s state and exit
SubmitGPU kernel
GPU Apps Scheduler
• GLoop schedules applications by periodically suspending them
• GLoop uses weighted fair queuing for proportional-sharing scheduling
GPU……App1Context
…App2Context
Time
App1 App2
GLoopSchedulerApp1
QueueApp2
Queue
Outline
• Motivation
• Previous Work
• Proposal: GLoop
• Experiments
• Conclusion
Evaluation Setup
• Implementation
– Linux 4.4.0-34, NVIDIA GPU driver 361.42
– CUDA 7.5
• Environment
– Xeon E5-2620, 8-GB RAM
– NVIDIA Tesla K40c Kepler GPU, 12-GB GDDR5
Case Studies
• To demonstrate that applicability of GLoop, we ported 8 applications• Select benchmarks which has long-running GPU kernels from
Parboil2 [Stratton et al. ‘12], Rodinia [Che et al. ‘09], GPUfs, and GPUnet• GLoop runtime successfully insert enough scheduling points
Name Ported from Category
TPACF Parboil2
Compute-intensiveLavaMD Rodinia
MUMmerGPU Rodinia
Hybridsort Rodinia
Grep GPUfs
I/O-intensiveApproximate Image Matching GPUfs
Echo Server GPUnet
Matrix Multiplication Server GPUnet
Standalone Overhead - Compute• GLoop overhead is -8% - 2% in compute-intensive apps except for
hybridsort• Only hybridsort shows significant performance degradation (16%)
– Currently, GLoop toolchain has a limitation that GLoop runtime cannot use shared memory for optimization in hybridsort kernels
• While GLoop shows moderate performance, kernel-split shows significant degradation in some benchmarks– Kernel-split needs to exit GPU kernels when encountering scheduling point
7.912
18.021
9.151
3.362 3.272 3.378
0
5
10
15
20
vanilla kernel-split gloop vanilla (w/optimization)
kernel-split gloop
hybridsort tpacf
Exe
cuti
on
Tim
e (
sec)
CUDAInit/Fin DataInit/Fin IO Copy Kernel
Performance at Scale
• The execution time of TPACF increases linearly
• The execution time of 8 greps is 10.6x longer than the time of standalone– This overhead comes from I/O contention
3.2686.652
12.228
24.520
0
10
20
30
1 2 4 8Exe
cuti
on
Tim
e (
seco
nd
s)
# of tpacf instances
26.33154.266
123.963
279.149
0
100
200
300
1 2 4 8
# of grep instances
10.6x
Performance Isolation
• GLoop scheduler successfully assign weighted utilizations to specific app (66% for TPACF)
0%
25%
50%
75%
100%
0 1 2 3
Time (sec)
throttle1 tpacf
0%
25%
50%
75%
100%
0 1 2 3
Time (sec)
throttle1 throttle2
throttle3 throttle4
throttle5 throttle6
throttle7 tpacf
0%
25%
50%
75%
100%
0 1 2 3
Time (sec)
throttle1 throttle2
throttle3 tpacf
GP
U U
tiliz
atio
n (
%)
TPACF + 1 throttle TPACF + 3 throttles TPACF + 7 throttles
GPU Server Consolidation
• Gradually launch under-utilized (20%) GPU servers and measure each GPU utilization
• Successfully consolidate three GPU serves on one shared GPU
0%
25%
50%
75%
100%
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
GP
U U
tiliz
atio
n (
%)
Time (sec)
server1 server2 server3
CUDA Context Initialization takes 200 – 400ms
GPU Idle-time Exploitation
• While running under-utilized GPU server (20%), periodically launch TPACF instances
• TPACF exploits idle-time of a shared GPU
0%
25%
50%
75%
100%
0 5 10 15 20
GP
U U
tiliz
atio
n (
%)
Time (sec)
server
tpacf1
tpacf2
tpacf3
GLoop assigns idle-time to compute-intensive TPACF app
Outline
• Motivation
• Previous Work
• Proposal: GLoop
• Experiments
• Conclusion
Conclusion
• GLoop’s event-driven programming model and architecture achieves – Efficient consolidation
• While generating enough scheduling points,suspend-and-resume kernels only when GLoop Scheduler required
– Resource isolation• Use GPU Context to isolate each app’s GPU state
– Use proprietary GPU runtimes without modification
• The evaluations show– GLoop performance is comparable to an existing GPU eaters
– Successfully schedule GPU eaters with flexible scheduling policy
– Gloop consolidate multiple GPU eaters on a shared GPU