Scaling and scheduling to maximize application performance within budget constraints

Ming Mao, Marty Humphrey

CS Department, UVa

Scaling and Scheduling to Maximize Application Performance within Budget

Constraints in Cloud Workflows

IPDPS 2013 (May 21st 2013)

1

2

Dynamic scalability and cost saving are two of the most important factors when considering cloud adoption

Two major benefits - dynamic scalability and cost

A survey from 39 major technology companies [1]

Cloud benefits On-demand self-services

Broad network access

Resource pooling

Rapid elasticity

Measured services

Cheaper maintenance

……

Why do you move into the cloud?

3

Dynamic scalability – the ability to acquire/release resources in response to demand dynamically

Dynamic scalability challenge → It relies on the users to tell the size of resource pool

Over-provisioning → cost more than necessary, offset cloud advantages

Under-provisioning → hurt application performance, cannot meet service level agreements and lose application customers

Cloud dynamic scalability

over-provisioning under-provisioning

4

Problem - What resources should be acquired/released in the cloud, and how should the computing activities be mapped to the cloud resources, so that the application performance can be maximized within the budget constrains?

In this paper, we discuss limited budget case

The unlimited budget case was discussed in our SC 11 paper

Solution - This paper argues that an automatic resource provisioning and allocation mechanism, i.e., an auto-scaling solution – is the key to successful cloud adoption. Essentially, an auto-scaling solution needs to answer the following two questions:

Capacity determination (or resource provisioning) what types of resources, how much and for how long

Job scheduling (or resource allocation) map computing activities onto the cloud resources

Problem statement

5

An application consists of service components. A workflow goes through different service components and therefore consists of multiple connected tasks

Workload is a stream of workflow jobs not known in advance

Task precedence constraints need to be preserved

Jobs have individual priorities

Service oriented architecture (SOA) & workflow jobs

6

Minimize job turnaround time within budget constraints Problem formulation

Problem terminology Cloud application

app = {Si}

Job class J = {DAG(Si), priorityJ| Si ∈ app}

Cloud VM VMv = {[𝐽𝑆𝑖]v , cv , lagv}

Workload Wt = 𝑗𝑜𝑏𝐽

𝑆𝑖𝑗𝑜𝑏𝐽𝑆𝑖

Scaling plan Scalingt = {VMv → Nv}

Scheduling plan Schedulet = { 𝑗𝐽

𝑆𝑖 →VMv}

Goal Min( 𝑗𝑜𝑏𝑡𝑢𝑟𝑛𝑎𝑟𝑜𝑢𝑛𝑑 × 𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦/𝑗𝑜𝑏 𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦𝑗𝑜𝑏 )

&& Cost(app) <= B (budget, dollars/hour)

Target - The service provider has a limited budget and aims to maximize the application performance.

Solution idea – a monitor-control loop that makes scaling and scheduling decisions based on updated workload and VM information

7

Scheduling-first Idea – allocate application budget to individual jobs based on priorities

and schedule tasks within job budget

Step 1 – Distribute budget: 𝐵𝑗 = 𝐵 × 𝑝𝑗/ 𝑝𝑗𝑗

Step 2 – Schedule tasks for each job, schedule as many tasks as possible on their fast machines

Step 3 – Consolidate budget return job budget to the application

the application uses the remaining budget collected from individual jobs to schedule high priority tasks

Step 4 – Acquire instance acquire instances and execute tasks based on the determined schedule plans

Minimize job turnaround time within budget constraints

Solution: scheduling-first

8

Scheduling-first

Step 1 – Distribute budget: 𝐵𝑗 = 𝐵 × 𝑝𝑗/ 𝑝𝑗𝑗

Minimize job turnaround time within budget constraints Solution: scheduling-first

Step 2 – Schedule tasks

e.g. Budget(B) = $1/h;

Large(L) = $0.5/h; Medium(M) = $0.3/h;

Small(S) = $0.1/h;

Step 1: job1 and job2 have the same priority,

job1 → $0.5/h, job2 → $0.5/h

Step 2: job1(T1) → $0.5(L);

job2(T5) → $0.5(L);

Step 3: job1(T2+T3) → $0.5(S+M);

job2(T6) → $0.5(L);

job1 returns $0.1 to system; job2(T7) → $0.1(S);

Step 4

acquire instances when necessary

Step 3 – Consolidate budget

Step 4 – Acquire instance

9

Minimize job turnaround time within budget constraints Solution: scaling-first

Scaling-first Idea – determine the computing capacity by looking at the overall

workload and schedule tasks based on priority

Step 1 – determine the VMs assume tasks run on their fastest machines and calculate the cost Cfast for the next

hour

acquire VMs proportionally based on Budget/Cfast

Step 2 – consolidate budget use the remaining the budget to purchase new machines.

Step 3 – schedule tasks schedule tasks based on task priority

10

Minimize job turnaround time within budget constraints Solution: scaling-first

Scaling-first Step 1 – determine the VMs

Step 2 – consolidate budget

Step 3 – schedule tasks

Step 1: assume tasks run on fastest machines and calculate Cfast and acquire VMs proportionally based on B/Cfast,

Step 2: the remaining $0.5 can be used to purchase 1 L machine

Step 3: tasks are scheduled based on their priorities

11

Instance consolidation

Schedule tasks on different VM types to save partial instance hour cost

Budget allocation schemes

Evenly distributed – e.g. daily x/365, hourly x/8760

Based on workload – e.g. high on busy times, low on non-busy times

Workload prediction – $/hour → $/job

Minimize job turnaround time within budget constraints Other considerations

Workload patterns

Application models

12

Time 72 hours

Task execution Randomly generated

VM lag 5 min

Minimize job turnaround time within budget constraints

Evaluation – experiment setup

Baseline Standard

VM Type Price

Micro $0.02/hour

Standard $0.080/hour

High-CPU $0.66/hour

High-Memory $0.45/hour

Extra-Large $1.3/hour

13

Minimize job turnaround time within budget constraints Evaluation – job turnaround time

above – weighted average job turnaround time for the hybrid application and cycle workload pattern

Scheduling-first and scaling-first can save 9.8%- 45.2% cost compared to the standard machine choice.

Scaling-first works better under small budget ranges while scheduling-first works better under large budget ranges.

14

Minimize job turnaround time within budget constraints Evaluation – sensitivity to inaccurate parameters

left – scheduling-first’s sensitivity to inaccurate parameters (Hybrid application + Cycle workload pattern)

right – scaling-first’s sensitivity to inaccurate parameters (Hybrid application + Cycle workload pattern)

When the estimation error is within ±20%, the job turnaround time shows -10.2% – 16.7% difference.

When the task estimation error reaches ±60%, the performance of both algorithms shows significant degradation (more than ±25% difference)

15

Minimize job turnaround time within budget constraints Evaluation – instance consolidation

left – job turnaround time / resource utilization of scheduling-first’s instance consolidation (Hybrid application + Cycle workload pattern)

right – job turnaround time / resource utilization of scaling-first’s instance consolidation (Hybrid application + Cycle workload pattern)

When budget is low or high, the improvement is small. When the budget is in between, the improvement is more significant (e.g. utilization rate improves 2.2% to 19.9% when the budget is between $15/hour and $25/hour).

Scaling-first benefits more from instance consolidation process than scheduling-first

16

Conclusions

choose appropriate VM types based on the workload.

Scheduling-first and scaling-first are trade-offs between the task execution time and waiting time.

As long as the VM performance can be correctly ranked, the proposed mechanisms have good tolerance to inaccurate parameters.

Instance consolidation is an efficient strategy to save partial instance hours and improve resource utilization.

Future work

Other billing models – reserved instances, spot instances, $/min

Maximize application performance within budget constraints for data-intensive applications

Hybrid and federate cloud environments

Develop evaluation benchmarks and simulation platforms

Conclusion and future work

17

Thanks!

Technology

Scaling and scheduling to maximize application performance within budget constraints