Demand-Based Coordinated Scheduling for SMP VMs

Demand-Based Coordinated Scheduling for SMP VMs

Hwanju Kim1, Sangwook Kim2, Jinkyu Jeong1, Joonwon Lee2, and Seungryoul Maeng1

Korea Advanced Institute of Science and Technology (KAIST)1

Sungkyunkwan University2

The 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

Houston, Texas, March 16-20 2013

1

Software Trends in Multi-core Era

• Making the best use of HW parallelism

• Increasing “thread-level parallelism”

HW

SW

“Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications”, Proceedings of IEEE, 2008

Processor

OS

App App App

Apps increasingly being multithreadedRMS apps are “emerging killer apps”

Processors increasingly adding more cores

2/28


• Synchronization (communication)

• The greatest obstacle to the performance of multithreaded workloads

HW

SW

Processor

OS

App App App

Barrier

Barrier

Thread

Lock wait

SpinlockSpinwait

CPU

3/28


• Virtualization

• Ubiquitous for consolidating multiple workloads• “Even OSes are workloads to be handled by VMM”

HW

SW

Processor

OS

App App App

OS OS

VMM

SMPVM

SMPVM

SMPVM

“Synchronization-conscious coordination” is essential for VMM to improve efficiency

Virtual CPU (vCPU) as a software entity dictated by VMM scheduler

4/28

Coordinated Scheduling

vCPU

VMM scheduler VMM

pCPU pCPU pCPU pCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

Timeshared

Uncoordinated scheduling A vCPU treated as an independent entity

Independententity

vCPU

VMM scheduler VMM

pCPU pCPU pCPU pCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

Coordinated scheduling Sibling vCPUs treated as a group

(who belongs to the same VM)

Coordinatedgroup

vCPU

vCPU

vCPU

Timeshared

Lock holder

Lock waiter

Lock waiter

Running

Waiting

Waiting

Uncoordinated scheduling makes inter-vCPU synchronization ineffective

Timeshared

5/28

Prior Efforts for CoordinationCoscheduling [Ousterhout82]

: Synchronizing execution

Time

pCPU

pCPU

pCPU

pCPU

vCPU execution

Illusion of dedicated multi-core,but CPU fragmentation

Relaxed coscheduling [VMware10]

: Balancing execution time

Time

pCPU

pCPU

pCPU

pCPU

Stop execution for siblings to catch up

Good CPU utilization & coordination,but not based on synchronization demands

Time

pCPU

pCPU

pCPU

pCPU

Balance scheduling [Sukwong11]

: Balancing pCPU allocation

Good CPU utilization & coordination,but not based on synchronization demands

Selective coscheduling [Weng09,11]…

: Coscheduling selected vCPUs

Time

pCPU

pCPU

pCPU

pCPU

Better coordination through explicit information,but relying on user or OS support

Selected vCPUs

Need for VMM scheduling based on synchronization (coordination) demands

6/28

Overview

• Demand-based coordinated scheduling

• Identifying synchronization demands

• With non-intrusive design

• Not compromising inter-VM fairness

Time

pCPU

pCPU

pCPU

pCPU

Demand of coscheduling for synchronization

Demand of delayed preemption for synchronization

Preemptionattempt

7/28

Coordination Space

• Time and space domains

• Independent scheduling decision for each domain

SpaceWhere to schedule?

TimeWhen to schedule?

vCPU

pCPU pCPU pCPU pCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

CoordinatedgroupPreemptive scheduling policy

Coscheduling Delayed preemption

pCPU assignment policy

8/28

Outline

• Motivation

• Coordination in time domain

• Kernel-level coordination demands

• User-level coordination demands

• Coordination in space domain

• Load-conscious balance scheduling

• Evaluation

vCPU

pCPU pCPU

vCPU

vCPU

vCPU

vCPU

vCPU

Space

Time

9/28

Synchronization to be Coordinated

• Synchronization based on “busy-waiting”

• Unnecessary CPU consumption by busy-waiting for a descheduled vCPU• Significant performance degradation

• Semantic gap• “OSes make liberal use of busy-waiting (e.g., spinlock)

since they believe their vCPUs are dedicated”

Serious problem in kernel

vCPU

pCPU pCPU pCPU pCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

• When and where to demand synchronization?

• How to identify coordination demands?

10/28

Kernel-Level Coordination Demands

• Does kernel really need coordination?

• Experimental analysis• Multithreaded applications in the PARSEC suite

• Measuring “kernel time” when uncoordinated

Solorun (no consolidation) Corun (w/ 1 VM running streamcluster)A 8-vCPU VM on 8 pCPUs

0%

20%

40%

60%

80%

100%

bla

cksc

hole

s

bodyt

rack

canneal

dedup

face

sim

ferr

et

fluid

anim

ate

freqm

ine

rayt

race

stre

am

clust

er

swaptions

vips

x264

CPU

tim

e (

%)

Kernel time User time

0%

20%

40%

60%

80%

100%

bla

cksc

hole

s

bodyt

rack

canneal

dedup

face

sim

ferr

et

fluid

anim

ate

freqm

ine

rayt

race

stre

am

clust

er

swaptions

vips

x264

CPU

tim

e (

%)

Kernel time User timeKernel time ratio is largely amplified by x1.3-x30 “Newly introduced kernel-level contention”

11/28

Kernel-Level Coordination Demands

• Where is the kernel time amplified?

0%

20%

40%

60%

80%

100%

bla

cksc

hole

s

bodytr

ack

canneal

dedup

face

sim

ferr

et

fluid

anim

ate

freqm

ine

raytr

ace

stre

am

clust

er

swaptions

vip

s

x264

CPU tim

e (%

)

Kernel time User time

0%

20%

40%

60%

80%

100%CPU

usa

ge f

or

kern

el tim

e (

%)

TLB shootdown Lock spinning Others

Kernel time breakdown by functions

Dominant sources1) TLB shootdown2) Lock spinning

How to identify?12/28

How to Identify TLB Shootdown?

• TLB shootdown

• Notification of TLB invalidation to a remote CPU

CPU

Thread

CPU

Thread

Virtual address space

TLB TLB

V->P1

V->P1

V->P1

V->P2 or V->NullModify

orUnmap

Inter-processor interrupt (IPI)

Busy-waiting until all correspondingTLB entries are invalidated

“Busy-waiting for TLB synchronization” is efficient in native systems,but not in virtualized systems if target vCPUs are not scheduled.(Even worse if TLBs are synchronized in a broadcast manner)

13/28

How to Identify TLB Shootdown?

• TLB shootdown IPI

• Virtualized by VMM

• Used in x86-based Windows and Linux

0%

20%

40%

60%

80%

100%

bodytr

ack

canneal

dedup

face

sim

ferr

et

fluid

ani…

stre

am

cl…

swaptions

vip

s

x264

CPU

usa

ge for

kern

el tim

e (

%)

TLB shootdown Lock spinning Others

0

500

1000

1500

2000

bodytr

ack

canneal

dedup

face

sim

ferr

et

fluid

anim

…

stre

am

clu…

swaptions

vip

s

x264

# o

f IP

Is /

vCPU

/

sec

“A TLB shootdown IPI is a signal for coordination demand!” Co-schedule IPI-recipient vCPUs with its sender vCPU

TLB shootdown IPI traffic

14/28

How to Identify Lock Spinning?

• Why excessive lock spinning?

• “Lock-holder preemption (LHP)”• Short critical section can be unpredictably prolonged by

vCPU preemption

• Which spinlock is problematic?vCPU

pCPU pCPU pCPU pCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

0%

20%

40%

60%

80%

100%

Lock

wait t

ime (%

) Other locks

Runqueue lock

Pagetable lock

Semaphore wait-queue lock

Futex wait-queue lock

Spinlockwait time

breakdown82%

93%

15/28


• Futex

• Linux kernel support for user-level synchronization (e.g., mutex, barrier, conditional variables, etc)

mutex_lock(mutex)

/* critical section */

mutex_unlock(mutex)

futex_wake(mutex) {

spin_lock(queue->lock)

thread=dequeue(queue)

wake_up(thread)

spin_unlock(queue->lock)

}

mutex_lock(mutex)

futex_wait(mutex) {


enqueue(queue, me)


schedule() /* blocked */

vCPU1 vCPU2

/* wake-up */

/* critical section */

mutex_unlock(mutex)

futex_wake(mutex) {


Reschedule IPI

User-levelcontention

Kernel-levelcontention

If vCPU1 is preempted before releasing its spinlock,vCPU2 starts busy-waiting on the preempted spinlock

LHP!

Kernelspace

16/28

Preempted


• Why preemption-prone?

pCPU

vCPU1

vCPU0

VMExit

IPI emulation

Wait-queue lock

VMExitAPIC reg access

VMEntry

VMExitAPIC reg access

VMEntry

Wait-queue unlock

VMEntry

Wait-queue lockspinning

Prolonged by VMM intervention

Multiple VMM interventionsfor one IPI transmission

Repeated by iterative wake-up

No more short critical section! Likelihood of preemption

Preemption by woken-up sibling Serious issue

Remote thread wake-up

17/28


• Generalization: “Wait-queue locks”

• Not limited to futex wake-up

• Many wake-up functions in the Linux kernel• General wake-up

• __wake_up*()

• Semaphore or mutex unlock

• rwsem_wake(), __mutex_unlock_common_slowpath(), …

• “Multithreaded workloads usually communicate and synchronize on wait-queues”

“A Reschedule IPI is a signal for coordination demand!”Delay preemption of an IPI-sender vCPU

until a likely-held spinlock is released

18/28

Outline

• Motivation






• Evaluation

vCPU

pCPU pCPU

vCPU

vCPU

vCPU

vCPU

vCPU

Space

Time

19/28

vCPU-to-pCPU Assignment

• Balance scheduling [Sukwong11]

• Spreading sibling vCPUs on different pCPUs• Increase in likelihood of coscheduling

• No coordination in time domain

vCPU

pCPU pCPU pCPU pCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

pCPU pCPU pCPU pCPU

Uncoordinated scheduling Balance scheduling

vCPU stacking

Likelihood of coscheduling

<

No vCPU stacking

20/28

vCPU-to-pCPU Assignment

• Balance scheduling [Sukwong11]

• Limitation• Based on “global CPU loads are well balanced”

• In practice, VMs with fair CPU shares can have

vCPU vCPU

vCPU

vCPU vCPU

x4 shares

SMP VM

UP VM

vCPU vCPU

vCPU

vCPU vCPUSMP VM

SMP VM

Inactive vCPUs

Single-threaded workload

Multithreaded workload

Different # of vCPUs Different TLP

0

200

400

600

800

5 15 25 35 45 55 65 75 85 95CPU

usa

ge (

%)

Time (sec)

canneal

0

200

400

600

800

1 4 7 10 13 16 19 22CPU

usa

ge (

%)

Time (sec)

dedupTLP can be changed

in a multithreaded app

TLP: Thread-level parallelism

pCPU pCPU

vCPUvCPU

pCPU pCPU

vCPU vCPU

vCPU vCPU

High scheduling latency

Balance scheduling on imbalanced loads

21/28

Proposed Scheme


• Adaptive scheme based on pCPU loads

• When assigning a vCPU, check pCPU loads

vCPU

pCPU pCPU pCPU pCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

If load is balanced Balance scheduling

vCPU

pCPU pCPU pCPU pCPU

vCPU

vCPUvCPU vCPU

vCPU

If load is imbalanced Favoring underloaded pCPUs

CPU load > Avg. CPU load

overloaded

Handled by coordinationin time domain

22/28

Outline

• Motivation






• Evaluation

23/28

Evaluation

• Implementation

• Based on Linux KVM and CFS

• Evaluation

• Effective time slice• For coscheduling & delayed preemption

• 500us decided by sensitive analysis

• Performance improvement

• Alternative• OS re-engineering

24/28

Evaluation

• SMP VM with UP VMs

• One 8-vCPU VM + four 1-vCPU VMs (x264)

0.00

0.50

1.00

1.50

2.00

Norm

alize

d e

xecu

tion t

ime

Workloads of 8-vCPU VM

Baseline Balance LC-Balance LC-Balance+Resched-DP LC-Balance+Resched-DP+TLB-Co

Futex-intensive 5-53% improvement

TLB-intensive 20-90% improvement

Performance of 8-vCPU VM

LC-Balance: Load-conscious balance schedulingResched-DP: Delayed preemption for reschedule IPITLB-Co: Coscheduling for TLB shootdown IPI

Non-synchronization-intensive

25/28

High scheduling latencyBalancescheduling

Alternative: OS Re-engineering

• Virtualization-friendly re-engineering

• Decoupling reschedule IPI transmission from thread wake-up

wake_up (queue) {


thread=dequeue(queue)

wake_up(thread)


}

Reschedule IPI

Delayed reschedule IPI transmission

• Modified wake_up func• Using per-cpu bitmap• Applied to futex_wakeup

& futex_requeue

One 8-vCPU VM + four 1-vCPU VMs (x264)

Delayed reschedule IPI is virtualization-friendly to resolve LHP problems

26/28

0.00

0.20

0.40

0.60

0.80

1.00

1.20

facesim streamcluster

Norm

alize

d e

xecu

tion t

ime

Baseline

Baseline w/ DelayedResched

LC_Balance

LC_Balance w/ DelayedResched

LC_Balance w/ Resched-DP

Conclusions & Future Work

• Demand-based coordinated scheduling

• IPI as an effective signal for coordination

• pCPU assignment conscious of dynamic CPU loads

• Limitation

• Cannot cover ALL types of synchronization demands• Kernel spinlock contention w/o VMM intervention

• Future work

• Cooperation with HW (e.g., PLE) & paravirt

Barrier or lock

27/28

Addressspace

Thank You!

• Questions and comments

• Contacts

• [email protected]

• http://calab.kaist.ac.kr/~hjukim

28/28

mailto:[email protected]

http://calab.kaist.ac.kr/~hjukim

EXTRA SLIDES

29

User-Level Coordination Demands

• Coscheduling-friendly workloads

• SPMD, bulk-synchronous, etc.

• Busy-waiting synchronization• “Spin-then-block”

Barrier

Barrier

Thread1 Thread2 Thread3 Thread4

Wakeup

Wakeup

Wakeup

Wakeup

Additionalbarrier

Thread1 Thread2 Thread3 Thread4 Thread1 Thread2 Thread3 Thread4

Wakeup

Coscheduling(balanced execution)

Uncoordinated(largely skewed execution)

Uncoordinated(skewed execution)

More blocking operations when uncoordinated

Spin Block

30/28

User-Level Coordination Demands

• Coscheduling

• Avoiding more expensive blocking in a VM• VMExits for CPU yielding and wake-up

• Halt (HLT) and Reschedule IPI

• When to coschedule?• User-level synchronization involves reschedule IPIs

Providing a knob to selectively enable this coscheduling for coscheduling-friendly VMs

Reschedule IPI traffic of streamcluster

Barriers Barriers Barriers Barriers Barriers Barriers“A Reschedule IPI is a signal for coordination demand!”Co-schedule IPI-recipient vCPUs with a sender vCPU

31/28

Urgent vCPU First (UVF) Scheduling

• Urgent vCPU

• 1. Preemptively scheduled if fairness is kept

• 2. Protected from preemption once scheduled• During “Urgent time slice (utslice)”

pCPU

vCPU vCPU vCPU

Urgent queue Runqueue

vCPU

pCPU

vCPU vCPU vCPUvCPU

FIFO order Proportional shares order

vCPU : urgent vCPU

vCPU vCPU

Wait queue

If inter-VM fairness is kept

Coscheduled

Protected frompreemption

32/28

Proposed Scheme


• Adaptive scheme based on pCPU loads

vCPU

pCPU pCPU pCPU pCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

vCPU

Balanced loads Balance scheduling

vCPU

pCPU pCPU pCPU pCPU

vCPU

vCPUvCPU vCPU

vCPU

Imbalanced loads Favoring underloaded pCPUs

vCPU

pCPU0 pCPU1 pCPU2 pCPU3

vCPU

vCPU vCPU

Wait queue

• Example

vCPUvCPU vCPU

Candidate pCPU set(Scheduler assigns a lowest-loaded pCPU in this set)

= {pCPU0, pCPU1, pCPU2, pCPU3}

pCPU3 is overloaded(i.e., CPU load > Avg. CPU load)

Handled by coordination in time domain(UVF scheduling)

33/28

Evaluation

• Urgent time slice (utslice)

• 1. Utslice for reducing LHP

• 2. Utslice for quickly serving multiple urgent vCPUs

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 100 300 500 700 1000

# o

f fu

tex q

ueue L

HP

Utslice (usec)

bodytrack

facesim

streamcluster

Workloads:A futex-intensive workload in one VM+ dedup in another VM as a preempting VM

>300us utslice2x-3.8x LHP reduction

Remaining LHPs occur during local wake-up orbefore reschedule IPI transmission Unlikely lead to lock contention

34/28

Evaluation

• Urgent time slice (utslice)

• 1. utslice for reducing LHP

• 2. utslice for quickly serving multiple urgent vCPUs

30

35

40

45

50

55

60

0

2

4

6

8

10

12

14

16

100 500 1000 3000 5000

Avera

ge e

xecu

tion t

ime (se

c)

CPU

cycl

es

(%)

Utslice (usec)

Spinlock cycles (%)

TLB cycles (%)

Execution time (sec)

Workloads:3 VMs, each of which runs vips(vips - TLB-IPI-intensive application)

As utslice increases, TLB shootdown cycles increase

500usec is an appropriate utslice for bothLHP reduction and multiple urgent vCPUs

~11% degradation

35/28

Evaluation

• Urgent allowance

• Improving overall efficiency with fairness

0

0.5

1

1.5

2

2.5

3

3.5

0

5

10

15

20

25

30

No UVF 0 6 12 18 24

Slo

wdow

n

CPU

cycl

es

(%)

Urgent allowance (msec)

Spinlock cycles

TLB cycles

Slowdown (vips)

Slowdown (facesim x 2)

Workloads:vips (TLB-IPI-intensive) VM + two facesim VMs

Efficient TLB synchronization

No performance drop

36/28

Evaluation

• Impact of kernel-level coordination

• One 8-vCPU VM + four 1-vCPU VMs (x264)

0.00

0.50

1.00

1.50

Norm

alize

d e

xecu

tion t

ime

Co-running workloads with 1-vCPU VM (x264)

Baseline Balance LC-Balance LC-Balance+Resched-DP LC-Balance+Resched-DP+TLB-Co

Performance of 1-vCPU VM

LC-Balance: Load-conscious balance schedulingResched-DP: Delayed preemption for reschedule IPITLB-Co: Coscheduling for TLB shootdown IPI

Unfaircontention

Balancescheduling

Balance scheduling Up to 26% degradation

37/28

Evaluation: Two SMP VMs

w/ dedup

w/ freqmine

a: baselineb: balancec: LC-balanced: LC-balance+Resched-DPe: LC-balance+Resched-DP+TLB-Co

corun

solorun

Time

Time

38/28

Evaluation

• Effectiveness on HW-assisted feature

• CPU feature to reduce the amount of busy-waiting• VMExit in response to excessive busy-waiting

• Intel Pause-Loop-Exiting (PLE), AMD Pause Filter

• Inevitable cost of some busy-waiting and VMExit

LHPPAUSEPAUSE

PAUSE

… Threshold

VMExit

Yielding

0

0.2

0.4

0.6

0.8

1

0

2

4

6

8

10

Baseline LC_Balance LC_Balance

w/ UVF

Norm

alize

d e

xecu

tion t

ime

CPU

cycl

es

(%)

TLB cycles (%) Spinlock cycles (%)


0

0.2

0.4

0.6

0.8

1

0

2

4

6

8

10

Baseline LC_Balance LC_Balance

w/ UVF

Norm

alize

d e

xecu

tion t

ime

CPU

cycl

es

(%)

TLB cycles (%) Spinlock cycles (%)


streamcluster (futex-intensive) ferret (TLB-IPI-intensive)

Apps Streamcluster facesim ferret vips

Reduction in Pause-loop VMExits (%) 44.5 97.7 74.0 37.9

39/28

Evaluation

• Coscheduling-friendly user-level workload

• Streamcluster• Spin-then-block barrier intensive workload

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

UVF w/o Resched-Co UVF w/ Resched-Co

# o

f barr

ier

synch

roniz

ation

Departure (block)

Departure (spin)

Arrival (block)

Arrival (spin)

More performance improvementas the time of spin-waiting increases

Blocking: 38%Reschedule IPIs (3 VMExits): 21%Additional (departure) barriers: 29%

Normalized execution time (corunning w/ bodytrack)

Additional barriers

Barrier breakdown

Resched-Co: Coscheduling for rescheudle IPI

0.00

0.20

0.40

0.60

0.80

1.00

0.1ms spin wait

(default)

10x spin wait 20x spin wait

Norm

alize

d e

xecu

tion t

ime

UVF w/o Resched-Co UVF w/ Resched-Co

40/28

Engineering

Demand-Based Coordinated Scheduling for SMP VMs