Download pdf - GLoop: An Event-driven Runtime for Consolidating GPGPU … · 2019. 11. 29. · Hardware Preemption • Recent NVIDIA Pascal GPUs introduce instruction-level hardware preemption •

GLoop: An Event-driven Runtime for Consolidating GPGPU Applications

Yusuke Suzuki*in collaboration with

Hiroshi Yamada**, Shinpei Kato***, Kenji Kono*

* Keio University** Tokyo University of Agriculture and Technology

*** The University of Tokyo

Graphic Processing Unit (GPU)• GPUs are used for data-parallel computations

– Composed of thousands of cores– Performance-per-watt of GPUs outperforms CPUs

• GPGPU is widely accepted beyond scientific purpose– Network Systems [Jang et al. ’11], Servers [AGRAWAL et al. ‘14],

FS [Silberstein et al. ’13] [Sun et al. ’12], DBMS [He et al. ’08] etc.

• GPUs become computing resource for applications

L2 CacheL1 L1 L1 L1 L1 L1 L1

Video Memory

Main MemoryCPU

NVIDIA/GPU

GPUs in the Cloud

• Cloud platforms adopt GPUs as part of their computing resource

• Beyond scientific applications, various server workload start using GPUs– Key-value store [Hetherington et al. ’15], web server

workload [Agrawal et al. ‘14], SSL reverse proxy [Jang et al.

‘11] etc.

Request

Response Computing

Motivation for GPU consolidation

• Consolidating GPGPU applications on a shared GPU is a key requirement for cloud platforms

– Consolidation can improve GPU utilization since the load of cloud services varies with diurnal patterns

• GPU are continuously scaling up, which strengthen the motivation for consolidation

App

GPU

App App

Manager

Sharea single GPU

among GPGPU Apps Physical

Machine

Problem – GPU Eaters• Scientific applications launch long-running kernels• Even worse, recent GPGPU applications utilize polling

for efficient / effective GPU computing– GPUfs [Silberstein et al. ‘13], GPUnet [Kim et al. ‘14], Persistent

Threads [Gupta et al. ‘12]

• Such GPU Eaters monopolize a shared GPU

GPU

GPU Eater AppApp

Monopolize a GPU for a long time

Blocked for a long time

……data =net::receivePolling(

socket, …,);………

Long-running / Polling GPU kernels

Hardware Preemption

• Recent NVIDIA Pascal GPUs introduce instruction-level hardware preemption

• No publicly available information shows the availability of software-level preemption control

• Lack of software-control limits our sharing policy– e.g. We cannot apply a proportional share policy to GPU kernels that is

based on customer payment

GPU

AppApp

GPU Scheduler

GPU Kernel GPU Kernel

Do not have software control over GPU kernel scheduling

Outline

• Motivation

• Previous Work

• Proposal: GLoop

• Experiments

• Conclusion

Previous Work• Schedule GPU commands / kernel launches

– TimeGraph [Kato et al. ‘11], Gdev [Kato et al. ‘12], GPUvm [Suzuki et al. ‘14], Disengaged Scheduling [Menychtas et al. ‘14], PTask [Rossbach et al. ‘11], Elastic kernels [Pai et al. ‘13]

– Costly GPU kernel launches [Kim et al. ‘14] need to be frequently issued

• Context Funneling [Wang et al. ‘11]– Allow concurrent GPU kernel running at the expense of isolation– To ensure concurrent GPU kernel execution, GPU resource needs to be split apriori,

which is not work-conserving

• GPUpIO [Zeno et al. ‘16]– Suffer from long-running GPU kernels

Launch

Time

App A App B

Kernel executions GPU

App App

ServiceRedirect GPU

streams to one GPU context

GPU command/kernel scheduler Context Funneling

EffiSha [Chen et al. ‘17], FLEP [Wu et al. ‘17]

• EffiSha and FLEP schedules TBs on persistent threads manner

• Launch persistent threads, dispatch logical TBs on them, stop kernels if necessary

• Assume TBs are short-running– long-running TBs and polling TBs can monopolize GPUs

SoftwareLogical TB Scheduler

GPU

Logical TB queue

Physical TBs

Outline

• Motivation

• Previous Work

• Proposal: GLoop

• Experiments

• Conclusion

Goals

1. Consolidate GPU eaters efficiently

– Relaunch GPU kernels only if necessary

– Software-controlled GPU eater scheduling

2. Provide GPU resource isolation

– Isolation among apps is mandatory for multi-tenancy

3. Use proprietary GPGPU software stack

– GPU drivers and runtimes are black-boxed, and almost all the GPGPU applications reside on these infrastructures

Proposal: GLoop

• GLoop is a GPGPU framework allowing us to host multiple GPU eaters on a GPU

• GLoop offers1. Event-driven programming model for GPGPU

2. Scheduler and event loop runtime achieving fair share in the face of GPU eaters

GPU

GLoop Scheduler

App

GPUContext

auto callback = [=](DeviceLoop* loop, …){

// Processing the data.// ...

};

net::receive(loop, socket, …, callback);

Event-driven programming model

Scheduler andevent loop runtime

Schedulable GPU Kernels

• TBs are good for schedulable units, but long-running TBs monopolize GPUs

• GLoop needs to transform long-running TBs into small chunks of schedulable tasks

Exec

uti

on

Tim

e

TB1 TB2

…

…

Split TBs with Scheduling Points

…

Short TBs

Long TBs

Event-driven Programming• GLoop adopts event-driven programming model• Event-driven style is suitable for GPGPU server applications that

are driven by external events such as network packet arrival• GPGPU program can execute host operation in non-blocking style

(like I/O operation)• Provide fine-grained scheduling points

– GLoop-aware program can insert low-cost scheduling points

auto callback = [=](DeviceLoop* loop, …) {// Processing the data.// ...// Continuation!

};net::receive(loop, socket, …, callback);

…data = net::receivePolling(socket, …,);…

GLoop Architecture

• GLoop consists of 3 components

– GLoop Scheduler

– Host Event Loops

– Device Event Loops

GPU

GLoop Scheduler

App

GPUContext

In Host OS

Within a resource container

Resource container-

based system

Host Event Loop

Device Event Loop

Host and Device Event Loops• Device event loops request ops to a host event loop and register a

callback• Host event loop performs ops and notify completions to device

event loops• Device event loops poll completions

– But at that time, if requested, device event loop can finish its polling

• Device event loops invoke registered callbacks once associated event is completed

GPUCall

GLoop API{

…}

Callback

Register Callback

Request Host Operation (Like I/O)

Pollcompletion

Notify completion

Invoke registered Callback

Perform Async Host I/O

GPU kernel launch requests

• Host Event Loop acquires token to execute kernel

• Device Event Loops dispatch callbacks generated from a program

GPU

GLoop SchedulerAcquire token to execute kernel

Submit GPU kernel

Device Event Loops dispatches

callbacks

{…

}

Callback

Device Event Loop suspension• Scheduler interacts with applications and requests

suspension to switch GPU contexts• GLoop Scheduler monitors GPU utilization of each GPU app

and makes scheduling decision

GPU

GLoop Scheduler

{…

}

Callback

Acquire token to execute kernel

Request suspension

Request suspension

Save Loop’s state and exit

SubmitGPU kernel

GPU Apps Scheduler

• GLoop schedules applications by periodically suspending them

• GLoop uses weighted fair queuing for proportional-sharing scheduling

GPU……App1Context

…App2Context

Time

App1 App2

GLoopSchedulerApp1

QueueApp2

Queue

Outline

• Motivation

• Previous Work

• Proposal: GLoop

• Experiments

• Conclusion

Evaluation Setup

• Implementation

– Linux 4.4.0-34, NVIDIA GPU driver 361.42

– CUDA 7.5

• Environment

– Xeon E5-2620, 8-GB RAM

– NVIDIA Tesla K40c Kepler GPU, 12-GB GDDR5

Case Studies

• To demonstrate that applicability of GLoop, we ported 8 applications• Select benchmarks which has long-running GPU kernels from

Parboil2 [Stratton et al. ‘12], Rodinia [Che et al. ‘09], GPUfs, and GPUnet• GLoop runtime successfully insert enough scheduling points

Name Ported from Category

TPACF Parboil2

Compute-intensiveLavaMD Rodinia

MUMmerGPU Rodinia

Hybridsort Rodinia

Grep GPUfs

I/O-intensiveApproximate Image Matching GPUfs

Echo Server GPUnet

Matrix Multiplication Server GPUnet

Standalone Overhead - Compute• GLoop overhead is -8% - 2% in compute-intensive apps except for

hybridsort• Only hybridsort shows significant performance degradation (16%)

– Currently, GLoop toolchain has a limitation that GLoop runtime cannot use shared memory for optimization in hybridsort kernels

• While GLoop shows moderate performance, kernel-split shows significant degradation in some benchmarks– Kernel-split needs to exit GPU kernels when encountering scheduling point

7.912

18.021

9.151

3.362 3.272 3.378

0

5

10

15

20

vanilla kernel-split gloop vanilla (w/optimization)

kernel-split gloop

hybridsort tpacf

Exe

cuti

on

Tim

e (

sec)

CUDAInit/Fin DataInit/Fin IO Copy Kernel

Performance at Scale

• The execution time of TPACF increases linearly

• The execution time of 8 greps is 10.6x longer than the time of standalone– This overhead comes from I/O contention

3.2686.652

12.228

24.520

0

10

20

30

1 2 4 8Exe

cuti

on

Tim

e (

seco

nd

s)

# of tpacf instances

26.33154.266

123.963

279.149

0

100

200

300

1 2 4 8

# of grep instances

10.6x

Performance Isolation

• GLoop scheduler successfully assign weighted utilizations to specific app (66% for TPACF)

0%

25%

50%

75%

100%

0 1 2 3

Time (sec)

throttle1 tpacf

0%

25%

50%

75%

100%

0 1 2 3

Time (sec)

throttle1 throttle2

throttle3 throttle4

throttle5 throttle6

throttle7 tpacf

0%

25%

50%

75%

100%

0 1 2 3

Time (sec)

throttle1 throttle2

throttle3 tpacf

GP

U U

tiliz

atio

n (

%)

TPACF + 1 throttle TPACF + 3 throttles TPACF + 7 throttles

GPU Server Consolidation

• Gradually launch under-utilized (20%) GPU servers and measure each GPU utilization

• Successfully consolidate three GPU serves on one shared GPU

0%

25%

50%

75%

100%

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

GP

U U

tiliz

atio

n (

%)

Time (sec)

server1 server2 server3

CUDA Context Initialization takes 200 – 400ms

GPU Idle-time Exploitation

• While running under-utilized GPU server (20%), periodically launch TPACF instances

• TPACF exploits idle-time of a shared GPU

0%

25%

50%

75%

100%

0 5 10 15 20

GP

U U

tiliz

atio

n (

%)

Time (sec)

server

tpacf1

tpacf2

tpacf3

GLoop assigns idle-time to compute-intensive TPACF app

Outline

• Motivation

• Previous Work

• Proposal: GLoop

• Experiments

• Conclusion

Conclusion

• GLoop’s event-driven programming model and architecture achieves – Efficient consolidation

• While generating enough scheduling points,suspend-and-resume kernels only when GLoop Scheduler required

– Resource isolation• Use GPU Context to isolate each app’s GPU state

– Use proprietary GPU runtimes without modification

• The evaluations show– GLoop performance is comparable to an existing GPU eaters

– Successfully schedule GPU eaters with flexible scheduling policy

– Gloop consolidate multiple GPU eaters on a shared GPU