Demand-Based Coordinated Scheduling for SMP VMs

Hwanju Kim1, Sangwook Kim2, Jinkyu Jeong1, Joonwon Lee2, and Seungryoul Maeng1

Korea Advanced Institute of Science and Technology (KAIST)1

Sungkyunkwan University2

The 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

Houston, Texas, March 16-20 2013

Software Trends in Multi-core Era

• Making the best use of HW parallelism

• Increasing “thread-level parallelism”

“Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications”, Proceedings of IEEE, 2008

Processor

App App App

Apps increasingly being multithreadedRMS apps are “emerging killer apps”

Processors increasingly adding more cores

• Synchronization (communication)

• The greatest obstacle to the performance of multithreaded workloads

Processor

App App App

Barrier

Thread

Lock wait

SpinlockSpinwait

• Virtualization

• Ubiquitous for consolidating multiple workloads• “Even OSes are workloads to be handled by VMM”

Processor

App App App

“Synchronization-conscious coordination” is essential for VMM to improve efficiency

Virtual CPU (vCPU) as a software entity dictated by VMM scheduler

Coordinated Scheduling

VMM scheduler VMM

pCPU pCPU pCPU pCPU

Timeshared

Uncoordinated scheduling A vCPU treated as an independent entity

Independententity

VMM scheduler VMM

pCPU pCPU pCPU pCPU

Coordinated scheduling Sibling vCPUs treated as a group

(who belongs to the same VM)

Coordinatedgroup

Timeshared

Lock holder

Lock waiter

Running

Waiting

Uncoordinated scheduling makes inter-vCPU synchronization ineffective

Timeshared

Prior Efforts for CoordinationCoscheduling [Ousterhout82]

: Synchronizing execution

vCPU execution

Illusion of dedicated multi-core,but CPU fragmentation

Relaxed coscheduling [VMware10]

: Balancing execution time

Stop execution for siblings to catch up

Good CPU utilization & coordination,but not based on synchronization demands

Balance scheduling [Sukwong11]

: Balancing pCPU allocation

Good CPU utilization & coordination,but not based on synchronization demands

Selective coscheduling [Weng09,11]…

: Coscheduling selected vCPUs

Better coordination through explicit information,but relying on user or OS support

Selected vCPUs

Need for VMM scheduling based on synchronization (coordination) demands

Overview

• Demand-based coordinated scheduling

• Identifying synchronization demands

• With non-intrusive design

• Not compromising inter-VM fairness

Demand of coscheduling for synchronization

Demand of delayed preemption for synchronization

Preemptionattempt

Coordination Space

• Time and space domains

• Independent scheduling decision for each domain

SpaceWhere to schedule?

TimeWhen to schedule?

pCPU pCPU pCPU pCPU

CoordinatedgroupPreemptive scheduling policy

Coscheduling Delayed preemption

pCPU assignment policy

Outline

• Motivation

• Coordination in time domain

• Kernel-level coordination demands

• User-level coordination demands

• Coordination in space domain

• Load-conscious balance scheduling

• Evaluation

pCPU pCPU

Synchronization to be Coordinated

• Synchronization based on “busy-waiting”

• Unnecessary CPU consumption by busy-waiting for a descheduled vCPU• Significant performance degradation

• Semantic gap• “OSes make liberal use of busy-waiting (e.g., spinlock)

since they believe their vCPUs are dedicated”

Serious problem in kernel

pCPU pCPU pCPU pCPU

• When and where to demand synchronization?

• How to identify coordination demands?

Kernel-Level Coordination Demands

• Does kernel really need coordination?

• Experimental analysis• Multithreaded applications in the PARSEC suite

• Measuring “kernel time” when uncoordinated

Solorun (no consolidation) Corun (w/ 1 VM running streamcluster)A 8-vCPU VM on 8 pCPUs

canneal

swaptions

Kernel time User time

canneal

swaptions

Kernel time User timeKernel time ratio is largely amplified by x1.3-x30 “Newly introduced kernel-level contention”

Kernel-Level Coordination Demands

• Where is the kernel time amplified?

bodytr

canneal

swaptions

CPU tim

Kernel time User time

100%CPU

el tim

TLB shootdown Lock spinning Others

Kernel time breakdown by functions

Dominant sources1) TLB shootdown2) Lock spinning

How to identify?12/28

How to Identify TLB Shootdown?

• TLB shootdown

• Notification of TLB invalidation to a remote CPU

Thread

Virtual address space

TLB TLB

V->P2 or V->NullModify

orUnmap

Inter-processor interrupt (IPI)

Busy-waiting until all correspondingTLB entries are invalidated

“Busy-waiting for TLB synchronization” is efficient in native systems,but not in virtualized systems if target vCPUs are not scheduled.(Even worse if TLBs are synchronized in a broadcast manner)

How to Identify TLB Shootdown?

• TLB shootdown IPI

• Virtualized by VMM

• Used in x86-based Windows and Linux

bodytr

canneal

ani…

swaptions

ge for

el tim

TLB shootdown Lock spinning Others

bodytr

canneal

clu…

swaptions

“A TLB shootdown IPI is a signal for coordination demand!” Co-schedule IPI-recipient vCPUs with its sender vCPU

TLB shootdown IPI traffic

How to Identify Lock Spinning?

• Why excessive lock spinning?

• “Lock-holder preemption (LHP)”• Short critical section can be unpredictably prolonged by

vCPU preemption

• Which spinlock is problematic?vCPU

pCPU pCPU pCPU pCPU

wait t

ime (%

) Other locks

Runqueue lock

Pagetable lock

Semaphore wait-queue lock

Futex wait-queue lock

Spinlockwait time

breakdown82%

• Futex

• Linux kernel support for user-level synchronization (e.g., mutex, barrier, conditional variables, etc)

mutex_lock(mutex)

/* critical section */

mutex_unlock(mutex)

futex_wake(mutex) {

spin_lock(queue->lock)

thread=dequeue(queue)

wake_up(thread)

spin_unlock(queue->lock)

mutex_lock(mutex)

futex_wait(mutex) {

enqueue(queue, me)

schedule() /* blocked */

vCPU1 vCPU2

/* wake-up */

/* critical section */

mutex_unlock(mutex)

futex_wake(mutex) {

Reschedule IPI

User-levelcontention

Kernel-levelcontention

If vCPU1 is preempted before releasing its spinlock,vCPU2 starts busy-waiting on the preempted spinlock

Kernelspace

Preempted

• Why preemption-prone?

VMExit

IPI emulation

Wait-queue lock

VMExitAPIC reg access

VMEntry

VMExitAPIC reg access

VMEntry

Wait-queue unlock

VMEntry

Wait-queue lockspinning

Prolonged by VMM intervention

Multiple VMM interventionsfor one IPI transmission

Repeated by iterative wake-up

No more short critical section! Likelihood of preemption

Preemption by woken-up sibling Serious issue

Remote thread wake-up

• Generalization: “Wait-queue locks”

• Not limited to futex wake-up

• Many wake-up functions in the Linux kernel• General wake-up

• __wake_up*()

• Semaphore or mutex unlock

• rwsem_wake(), __mutex_unlock_common_slowpath(), …

• “Multithreaded workloads usually communicate and synchronize on wait-queues”

“A Reschedule IPI is a signal for coordination demand!”Delay preemption of an IPI-sender vCPU

until a likely-held spinlock is released

Outline

• Motivation

• Evaluation

pCPU pCPU

vCPU-to-pCPU Assignment

• Balance scheduling [Sukwong11]

• Spreading sibling vCPUs on different pCPUs• Increase in likelihood of coscheduling

• No coordination in time domain

pCPU pCPU pCPU pCPU

Uncoordinated scheduling Balance scheduling

vCPU stacking

Likelihood of coscheduling

No vCPU stacking

vCPU-to-pCPU Assignment

• Balance scheduling [Sukwong11]

• Limitation• Based on “global CPU loads are well balanced”

• In practice, VMs with fair CPU shares can have

vCPU vCPU

x4 shares

SMP VM

vCPU vCPU

vCPU vCPUSMP VM

SMP VM

Inactive vCPUs

Single-threaded workload

Multithreaded workload

Different # of vCPUs Different TLP

5 15 25 35 45 55 65 75 85 95CPU

Time (sec)

canneal

1 4 7 10 13 16 19 22CPU

Time (sec)

dedupTLP can be changed

in a multithreaded app

TLP: Thread-level parallelism

pCPU pCPU

vCPUvCPU

pCPU pCPU

vCPU vCPU

High scheduling latency

Balance scheduling on imbalanced loads

Proposed Scheme

• Adaptive scheme based on pCPU loads

• When assigning a vCPU, check pCPU loads

pCPU pCPU pCPU pCPU

If load is balanced Balance scheduling

pCPU pCPU pCPU pCPU

vCPUvCPU vCPU

If load is imbalanced Favoring underloaded pCPUs

CPU load > Avg. CPU load

overloaded

Handled by coordinationin time domain

Outline

• Motivation

• Evaluation

Evaluation

• Implementation

• Based on Linux KVM and CFS

• Evaluation

• Effective time slice• For coscheduling & delayed preemption

• 500us decided by sensitive analysis

• Performance improvement

• Alternative• OS re-engineering

Evaluation

• SMP VM with UP VMs

• One 8-vCPU VM + four 1-vCPU VMs (x264)

tion t

Workloads of 8-vCPU VM

Baseline Balance LC-Balance LC-Balance+Resched-DP LC-Balance+Resched-DP+TLB-Co

Futex-intensive 5-53% improvement

TLB-intensive 20-90% improvement

Performance of 8-vCPU VM

LC-Balance: Load-conscious balance schedulingResched-DP: Delayed preemption for reschedule IPITLB-Co: Coscheduling for TLB shootdown IPI

Non-synchronization-intensive

High scheduling latencyBalancescheduling

Alternative: OS Re-engineering

• Virtualization-friendly re-engineering

• Decoupling reschedule IPI transmission from thread wake-up

wake_up (queue) {

thread=dequeue(queue)

wake_up(thread)

Reschedule IPI

Delayed reschedule IPI transmission

• Modified wake_up func• Using per-cpu bitmap• Applied to futex_wakeup

& futex_requeue

One 8-vCPU VM + four 1-vCPU VMs (x264)

Delayed reschedule IPI is virtualization-friendly to resolve LHP problems

facesim streamcluster

tion t

Baseline

Baseline w/ DelayedResched

LC_Balance

LC_Balance w/ DelayedResched

LC_Balance w/ Resched-DP

Conclusions & Future Work

• Demand-based coordinated scheduling

• IPI as an effective signal for coordination

• pCPU assignment conscious of dynamic CPU loads

• Limitation

• Cannot cover ALL types of synchronization demands• Kernel spinlock contention w/o VMM intervention

• Future work

• Cooperation with HW (e.g., PLE) & paravirt

Barrier or lock

Addressspace

Thank You!

• Questions and comments

• Contacts

• hjukim@calab.kaist.ac.kr

• http://calab.kaist.ac.kr/~hjukim

EXTRA SLIDES

User-Level Coordination Demands

• Coscheduling-friendly workloads

• SPMD, bulk-synchronous, etc.

• Busy-waiting synchronization• “Spin-then-block”

Barrier

Thread1 Thread2 Thread3 Thread4

Wakeup

Additionalbarrier

Thread1 Thread2 Thread3 Thread4 Thread1 Thread2 Thread3 Thread4

Wakeup

Coscheduling(balanced execution)

Uncoordinated(largely skewed execution)

Uncoordinated(skewed execution)

More blocking operations when uncoordinated

Spin Block

User-Level Coordination Demands

• Coscheduling

• Avoiding more expensive blocking in a VM• VMExits for CPU yielding and wake-up

• Halt (HLT) and Reschedule IPI

• When to coschedule?• User-level synchronization involves reschedule IPIs

Providing a knob to selectively enable this coscheduling for coscheduling-friendly VMs

Reschedule IPI traffic of streamcluster

Barriers Barriers Barriers Barriers Barriers Barriers“A Reschedule IPI is a signal for coordination demand!”Co-schedule IPI-recipient vCPUs with a sender vCPU

Urgent vCPU First (UVF) Scheduling

• Urgent vCPU

• 1. Preemptively scheduled if fairness is kept

• 2. Protected from preemption once scheduled• During “Urgent time slice (utslice)”

vCPU vCPU vCPU

Urgent queue Runqueue

vCPU vCPU vCPUvCPU

FIFO order Proportional shares order

vCPU : urgent vCPU

vCPU vCPU

Wait queue

If inter-VM fairness is kept

Coscheduled

Protected frompreemption

Proposed Scheme

• Adaptive scheme based on pCPU loads

pCPU pCPU pCPU pCPU

Balanced loads Balance scheduling

pCPU pCPU pCPU pCPU

vCPUvCPU vCPU

Imbalanced loads Favoring underloaded pCPUs

pCPU0 pCPU1 pCPU2 pCPU3

vCPU vCPU

Wait queue

• Example

vCPUvCPU vCPU

Candidate pCPU set(Scheduler assigns a lowest-loaded pCPU in this set)

= {pCPU0, pCPU1, pCPU2, pCPU3}

pCPU3 is overloaded(i.e., CPU load > Avg. CPU load)

Handled by coordination in time domain(UVF scheduling)

Evaluation

• Urgent time slice (utslice)

• 1. Utslice for reducing LHP

• 2. Utslice for quickly serving multiple urgent vCPUs

0 100 300 500 700 1000

ueue L

Utslice (usec)

bodytrack

facesim

streamcluster

Workloads:A futex-intensive workload in one VM+ dedup in another VM as a preempting VM

>300us utslice2x-3.8x LHP reduction

Remaining LHPs occur during local wake-up orbefore reschedule IPI transmission Unlikely lead to lock contention

Evaluation

• Urgent time slice (utslice)

• 1. utslice for reducing LHP

• 2. utslice for quickly serving multiple urgent vCPUs

100 500 1000 3000 5000

tion t

ime (se

Utslice (usec)

Spinlock cycles (%)

TLB cycles (%)

Execution time (sec)

Workloads:3 VMs, each of which runs vips(vips - TLB-IPI-intensive application)

As utslice increases, TLB shootdown cycles increase

500usec is an appropriate utslice for bothLHP reduction and multiple urgent vCPUs

~11% degradation

Evaluation

• Urgent allowance

• Improving overall efficiency with fairness

No UVF 0 6 12 18 24

Urgent allowance (msec)

Spinlock cycles

TLB cycles

Slowdown (vips)

Slowdown (facesim x 2)

Workloads:vips (TLB-IPI-intensive) VM + two facesim VMs

Efficient TLB synchronization

No performance drop

Evaluation

• Impact of kernel-level coordination

• One 8-vCPU VM + four 1-vCPU VMs (x264)

tion t

Co-running workloads with 1-vCPU VM (x264)

Baseline Balance LC-Balance LC-Balance+Resched-DP LC-Balance+Resched-DP+TLB-Co

Performance of 1-vCPU VM

LC-Balance: Load-conscious balance schedulingResched-DP: Delayed preemption for reschedule IPITLB-Co: Coscheduling for TLB shootdown IPI

Unfaircontention

Balancescheduling

Balance scheduling Up to 26% degradation

Evaluation: Two SMP VMs

w/ dedup

w/ freqmine

a: baselineb: balancec: LC-balanced: LC-balance+Resched-DPe: LC-balance+Resched-DP+TLB-Co

solorun

Evaluation

• Effectiveness on HW-assisted feature

• CPU feature to reduce the amount of busy-waiting• VMExit in response to excessive busy-waiting

• Intel Pause-Loop-Exiting (PLE), AMD Pause Filter

• Inevitable cost of some busy-waiting and VMExit

LHPPAUSEPAUSE

… Threshold

VMExit

Yielding

Baseline LC_Balance LC_Balance

w/ UVF

tion t

TLB cycles (%) Spinlock cycles (%)

Baseline LC_Balance LC_Balance

w/ UVF

tion t

TLB cycles (%) Spinlock cycles (%)

streamcluster (futex-intensive) ferret (TLB-IPI-intensive)

Apps Streamcluster facesim ferret vips

Reduction in Pause-loop VMExits (%) 44.5 97.7 74.0 37.9

Evaluation

• Coscheduling-friendly user-level workload

• Streamcluster• Spin-then-block barrier intensive workload

100000

200000

300000

400000

500000

600000

700000

800000

900000

UVF w/o Resched-Co UVF w/ Resched-Co

f barr

Departure (block)

Departure (spin)

Arrival (block)

Arrival (spin)

More performance improvementas the time of spin-waiting increases

Blocking: 38%Reschedule IPIs (3 VMExits): 21%Additional (departure) barriers: 29%

Normalized execution time (corunning w/ bodytrack)

Additional barriers

Barrier breakdown

Resched-Co: Coscheduling for rescheudle IPI

0.1ms spin wait

(default)

10x spin wait 20x spin wait

tion t

UVF w/o Resched-Co UVF w/ Resched-Co

Demand-Based Coordinated Scheduling for SMP VMs

Engineering

VMS Portal

Algorithms and Systems for Virtual Machine Scheduling in Cloud …708798/FULLTEXT02.pdf · 2014-03-31 · pend and resume physical machines, VM migration, and suspend and resume VMs),

Yacimientos Tipo Vms

Macroscop VMS (Czech)

dot - VMS-South

VMS Harmonization

Short-Term Scheduling · PDF file2 Short-Term Scheduling Industrial Engineering Sequencing Examples Forward and Backward Scheduling Forward scheduling: begins the schedule as soon

Sistemi Operativi SCHEDULING DELLA CPU. Scheduling della CPU n Concetti di Base n Criteri di Scheduling n Algoritmi di Scheduling FCFS, SJF, Round-Robin,

Presentacion Vms- Final

Scheduling della CPU. Sistemi Operativi a.a. 2007-08 5.2 Scheduling della CPU Concetti fondamentali Criteri di scheduling Algoritmi di scheduling Scheduling

Scheduling dinámico Scoreboarding. Universidad de SonoraArquitectura de Computadoras2 Scheduling Scheduling. En que orden se ejecutan las instrucciones

SI France | Brochure: Siveillance™ VMS - Guide …...Caractéristiques 2016 R3 2016 R3 2016 R3 Composants du système (modules logiciels) VMS 100 VMS 200 VMS 300 Nombre de modèles

VMS System

Scheduling Konzepte :CPU Scheduling

Coordinated by :

Modelo VMs Estratégicos

Sentinel Scheduling

exacqVision VMS Platform

SIKaP (VMS)

Macroscop VMS