Automated Control for Elastic Storage

1

Harold C. Lim, Shinath Baba and Jeffery S. Chase from Duke University

AUTOMATED CONTROL FOR ELASTIC STORAGE

Presented by: Yonggang LiuDepartment of Electrical and Computer Engineering,

University of Florida

2

OutlineIntroductionSystem overviewSystem architecture and modeling

methodologiesEvaluationContribution and related workDiscussions and future work

3



4

Introduction -Popularity of highly dynamic workloadsMany web-based services (especially Web

2.0) often experience rapid load surges and drops.One Facebook application saw an increase

from 25,000 to 250,000 users in 3 days, with up to 20,000 new users signing up per hour during peak times.

Elastic services offered by cloud computing becomes one solutionGrow/shrink service capacity dynamically as

the load changes.

5

Introduction - Elasticity in cloud computing

Elasticity is one of cloud computing’s greatest features – Systems acquire and release resources in response to users’ dynamic workloads; users only pay for what they need.

SLAsWeb Services

Virtualization

Picture provided by Dr. Andy Li from UF

6

Introduction -Topic of this paperThis paper addresses the challenges

associated with controlling the elastic storage in a data-intensive service, in cloud computing environment.

Intuitively, it does:If performance can not meet the Service

Level Objective (SLO) → grow storage capacity

If performance meets SLO, and system utilization is low → shrink storage capacity

7

Introduction -Topic of this paperIn this paper, Hadoop Distributed File System

(HDFS) is employed as the storage system.When the controller increases the storage size:

Create new storage instancesMove storage data to the new instances (data

rebalancing)When the controller reduces the storage size:

Remove a certain number of storage instancesSome storage data on existing nodes get replicated

because the replica number is lower than the replica degree N. This is automatically done by DHFS.

8



9

System overviewWhat is the big picture

Controller

Cloud Provider (Amazon EC2)

Web Tier (Apache server)Application Tier (Facebook

core)Storage Tier (Hadoop DFS)

Elastic Service

Clients

Sensor

Actuator

Gathermeasurements

Manage instances

Sensors highersystem load

Create more storageinstances, and rebalance data

Suppose we are hosting the Facebookserver on amazon EC2 instances, withthe proposed control techniques.

Sensors lowersystem load

Remove somestorage nodes

10

System overviewChallenges in elastic storage controlControlling elastic storage involves many

challenges:Data Rebalancing. The newly added storage

nodes will not be effective until data rebalancing is done.

Interference to Guest Service. Data rebalancing also consumes the system resources.

Actuator Delay. The controller must consider the delay of the control operations, otherwise it may response too late or become unstable.

11



12

System architectureThe controller is composed by:

Horizontal Scale Controller (HSC) - responsible for growing and shrinking the number of storage nodes.

Data Rebalance Controller (DRC) - controlling the data transfers to rebalance the storage tier after it grows or shrinks.

State machine - coordinating the actions of the HSC and the DRC.

13

System architecture -Horizontal Scale Controller (HSC)Actuator: The HSC uses cloud APIs to

change the number of active server instances.

Sensor: The paper uses CPU utilization on the storage nodes as the sensor feedback metricIt is easy to measure, and strongly correlated

to overall response time of the Cloudstone benchmark when the bottleneck is on the storage tier.

14

Modeling methodology -System model without controllerThe system without a controller can be described as this

graph:

U(z): Input to the system, the number of storage instances.D(z): The effect of client workload variance on the value of

storage instance number.V(z): The effective number of storage instancesY(z): The Output of the system, the CPU utilization on

storage nodes.G(z): The transfer function of the storage system.

G(z)U(z) Y(z)++ V(z)

D(z)

15

Modeling methodology -Controller - Integral controlControl Policy (K): Integral control

- the integral gain parameter. - the current sensor measurement. - the desired reference sensor

measurement, which is 20% CPU utilization for 3 second average response time.

G(z)R(z)

K(z)+-E(z) U(z) Y(z)

++ V(z)D(z)

16

Modeling methodology -Controller - discrete control functionsBecause discrete actuators (instance

number) are used in the system, the paper generates the following discrete control functions:

and are the higher and lower thresholds for CPU utilization .

Only when (under-provisioned) or (over-provisioned), , i.e., the controller adds/removes the storage instances.

17

Modeling methodology -Proportional thresholdingHow to set and ?

They can’t be static, because for a cluster of size N, adding/removing a node affects 1/N of the total capacity.

“Proportional thresholding” mechanism:Set , and vary to vary the range.Suppose “workload” is the per-node

workload and we have N instances. We get

Suppose , we get

18

System architecture -Data Rebalance Controller (DRC)The DRC rebalances the layout of data in the system

after the number of storage nodes grows or shrinks.Rebalancing is a cause of actuator delay and

interference.Tuning knob of HDFS rebalancer:

Bandwidth b allocated to the rebalancer.Select b to control the tradeoff between lag and

interference.Big b - fast rebalance, serious impacts on normal

service.Small b - slow rebalance, not very disruptive to normal

service.

19

Modeling Methodology -Modeling the impacts of bThe paper employed multi-variate

regression to decide b:The time to completion of rebalancing (Time)

as a function of the bandwidth throttle (b) and size of data to be moved(s): .

The impact of rebalancing on service response time (Impact) as a function of the bandwidth throttle (b) and per-node workload (l): .

Values of s and l are measured by sensors in DRC.

20

Modeling Methodology -Balancing between lag and interferenceThe Data Rebalance Controller poses the

choice of b as a cost-based optimization problem:

The ratio of can be specified by the guest

based on the relative preference towards Time over Impact.

21

System architecture -State machineRecall that:

Horizontal Scale Controller (HSC) is used to increase/shrink the number of storage nodes

Data Rebalance Controller (DRC) is used to rebalance the storage after the changes in storage node size

They have mutual dependencies:After HSC adds a new storage node, the system cannot

obtain full service until DRC completes rebalancing.When one component is taking actions, the noise will be

introduced to the sensor measurements of the other one.To preserve stability during adjustments, a state

machine is employed to coordinate HSC and DRC to manage their mutual dependencies.

22

System architecture -State machineThe following diagram shows the internal

state machine of the elasticity controller in the storage tier.

Horizontal Scale State

Rebalance state

Init

Storage tier configuration changed? No

Storage tier configuration

changed? Yes

Rebalancing done? Yes

Rebalancingdone? No

Elasticity Controller

Storage Tier

23



24

Evaluation -Experimental TestbedThe paper employs CloudStone to run with GlassFish

as the front-end application server tier.CloudStone: a flexible Web 2.0 benchmark generatorGlassFish: an open source application server project

HDFS is used for the storageHDFS is modified to expose the rebalancer’s bandwidth

throttle b as an actuator to the external controller.The paper implements a local ORCA cluster as the

cloud infrastructure providerORCA: A resource control framework that provides a

resource leasing service; guests can lease resources from a substrate resource provider, such as a cloud provider

25

Evaluation -Experimental TestbedThe experimental service cluster:

A group of servers running on a local network.To fully explore the effects of the storage tier:

Other tiers are statically over-provisioned.The storage tier nodes:

Dynamically allocated virtual machine instancesThey all have fixed resource configurations:

30 MB disk space; 512 MB RAM; single disk arm; 2.8 GHz CPU.

HDFS is preloaded with at least 36 GB data.

26

Evaluation - Controller EffectivenessStatic and dynamic resource previsioning

to load burst of 10 times at .

a1. CPU utilization - static

b1. Response time - static

a2. CPU utilization - dynamic

b2. Response time - dynamic

Target response time:3 seconds.Target CPU utilization:20%.

See from the figures:1. Dynamic provisioningis able to adapt to the load burst.2. Instance creation anddata rebalancing hascost and delay on effect.

27

Evaluation - Controller EffectivenessStatic and dynamic resource previsioning

to small load increase of 35% at .






See from the figures:1. Dynamic provisioningis alert enough to adapt tothe small load increase.2. The cost and delay ofnode creation/rebalancingare smaller than the prev.

28

Evaluation - Resource EfficiencyStatic and dynamic resource previsioning

to load decrease of 30% at .






See from the figures:1. Shrinking the storage size has much lower cost/delay than increasing it.2. During resizing process,There are almost no SLOviolations.

29

Evaluation - Comparison of Rebalance PoliciesRecall that:

, monotone decreasing function of b., monotone increasing function of b.And we want to optimize for the cost

function:

30



31

Contribution and related workThis paper is the first to address the problem of automated

control for elastic storage in cloud computing.SCADS is a related work dealing with dynamically scaling

a storage system. It uses machine learning to predict resource requirements.

Padala et al. proposed a decoupled architecture (between guest and cloud provider) for cloud computing. They did not consider the actuator constraints.

Aqueduct uses a feedback controller to throttle the rebalancing bandwidth usage to ensure the SLOs will not be violated. The rebalancing may be able to use very little bandwidth.

32



33

Discussions and future workThe proposed modeling method is not able

to correctly handle workloads with transient noise, which is common in reality.Adding a filter module solves the problem:

H(z)W(z)

G(z)R(z)

K(z)+-E(z) U(z) Y(z)

++ V(z)D(z)

34

Discussions and future workThe proposed model sets tight resource

allocation model. A small system load change often triggers adding/removing storage instances, which is very disruptive.Recall the proposed control function:

By setting lower or higher (not exceed ), we prevent the system from changing frequently.

The drawback of this approach: The system will be under-provisioned to some

extent.

35

Discussions and future workMake the resource configuration of newly

created storage instances tunable.Resizing storage size by adding/removing

storage instances with flexible resource configuration.

Optimizing the system by exploring the capacity and efficiency of individual storage instances, rather than storage instance amount.

This requires investigating the performance of storage nodes under different setups: disk size, CPU frequency, RAM size, etc.

36

THANK YOU!

Documents

Automated Control for Elastic Storage