Towards Energy Efficient Hadoop

Towards Energy Efficient Hadoop

Wednesday, June 10, 2009 Santa Clara Marriott

Yanpei Chen, Laura Keys, Randy Katz

RAD Lab, UC Berkeley

Why Energy?

Cooling Costs Environment

Why Energy Efficient Software

Power Utilization Efficiency (PUE)

=Total power used by a datacenter

IT power used by a datacenter

= IT power + PDU + UPS + HVAC+ Lighting + other overhead

Servers, network, storage

≥ 2 circa 2006 and before

1 present day

Most of the further savings to be had in IT hardware and software

Productivity

Energy as a Performance Metric

Resources Used

Traditional view of the software system design space

Increase productivity for fixed resources of a system

Productivity

Energy as a Performance Metric

Resources Used

Maybe a better view of the design space?

Energy

Decrease energy without compromising productivity?

Methodology

Performance Metrics

Parameters

Workload

Energy measurement

Basket of metrics – job duration, energy, power (i.e. time rate of energy use). Performance variance?

Static – cluster size, workload size, configuration parameters. Dynamic – Task scheduling? Block placement? Speculative execution?

Exercise all components – sort, HDFS read, HDFS write, shuffle. Representative of production workloads – nutch, gridmix, others?

Wall plug energy measurement – 1W accuracy, 1 reading per second. Fine grain measurement to correlate energy consumption to hardware components?

Scaling to More Workers – Sort

JouleSort highly customized system vs. Out of box Hadoop with default config.

11k sorted records per joule vs. 87 sorted records per joule

Terasort format, 100 bytes records with 10 bytes keys, 10GB of total data

Out of box Hadoop 0.18.2 with default config.

0500

10001500200025003000

0 2 4 6 8 10 12 14

total power

workers + master

(W)

# of workers

Sort - Total power

0

500000

1000000

1500000

2000000

0 2 4 6 8 10 12 14

total energy

workers + master

(J)

# of workers

Sort - Total energy

0

1000

2000

3000

4000

5000

0 2 4 6 8 10 12 14

duration (s)

# of workers

Sort - Job duration

Reduce energy by adding more workers????

Scaling to More Workers – Sort

Terasort format, 100 bytes records with 10 bytes keys, 10GB of total data

Out of box Hadoop with default config., workers energy only

Energy of the master amortized by additional workers

0500

10001500200025003000

0 2 4 6 8 10 12 14

total power

workers + master

(W)

# of workers

Sort - Total power

0

500000

1000000

1500000

0 2 4 6 8 10 12 14

total energy

workers + master

(J)

# of workers

Sort - Total energy

0

1000

2000

3000

4000

5000

0 2 4 6 8 10 12 14

duration (s)

# of workers

Sort - Job duration

Scaling to More Workers – Nutch

Nutch web crawler and indexer, with Hadoop 0.19.1.

Index URLs anchored at www.berkeley.edu, depth 7, 2000 links per page

Workload has some built-in bottlenecks?

0200400600800

100012001400160018002000

0 4 8

workers

Nutch Total Power (W)

02000000400000060000008000000

1000000012000000140000001600000018000000

0 4 8

workers

Nutch Total Energy (J)

0100020003000400050006000700080009000

10000

0 4 8

workers

Nutch Duration (s)

Isolating IO Stages

HDFS read, shuffle, HDFS write jobs, modified from prepackaged sort example

Read, shuffle, write 10GB of data, terasort format, does nothing else

HDFS write seems to be the scaling bottleneck

0

0.2

0.4

0.6

0.8

1

1 4 8 12

fraction

workers

Duration fraction

Shuffle

HDFSWrite

HDFSRead 0

0.2

0.4

0.6

0.8

1

1 4 8 12

fraction

workers

Energy fraction

Shuffle

HDFSWrite

HDFSRead

HDFS Replication

HDFS read, shuffle, HDFS write, sort jobs, 10GB data, terasort format

Modify the number of HDFS replica, default config. for everything else

Some workloads are affected – HDFS write, some are not – shuffle

0500

10001500200025003000

0 1 2 3 4 5

total power

workers + master

(W)

HDFS replication

HDFSWrite - Total power

0

500000

1000000

1500000

0 1 2 3 4 5

total energy workers + master

(J)

HDFS replication

HDFSWrite - Total energy

0100200300400500600

0 1 2 3 4 5

duration (s)

HDFS replication

HDFSWrite- Job duration

0500

10001500200025003000

0 1 2 3 4 5

total power

workers + master

(W)

HDFS replication

Shuffle - Total power

0200000400000600000800000

10000001200000

0 1 2 3 4 5

total energy

workers + master

(J)

HDFS replication

Shuffle - Total energy

0

100

200

300

400

500

0 1 2 3 4 5

duration (s)

HDFS replication

Shuffle - Job duration

0

0.2

0.4

0.6

0.8

1

1 4 8 12

fraction

workers

Duration fraction

Shuffle

HDFSWrite

HDFSRead 0

0.2

0.4

0.6

0.8

1

1 4 8 12

fraction

workers

Energy fraction

Shuffle

HDFSWrite

HDFSRead

0

0.2

0.4

0.6

0.8

1

1 4 8 12

fraction

workers

Duration fraction

Shuffle

HDFSWrite

HDFSRead 0

0.2

0.4

0.6

0.8

1

1 4 8 12

fraction

workers

Energy fraction

Shuffle

HDFSWrite

HDFSRead

HDFS Replication

Replication 3 – default

Replication 2

Reducing HDFS replication to 2 makes HDFS write less of a bottleneck?

Changing Input Size

Sort, modified from prepackaged sort example

Jobs that handle less than ~1GB of data per node bottlenecked by overhead

Out of box Hadoop competitive with JouleSort winner at 100MB?!?

Here’s a somewhat noteworthy result:

0

2000

4000

6000

8000

10000

12000

14000

0 4 8 12

records

workers

Records sorted per joule

10GB

5GB

1GB

500MB

100MB

HDFS Block SizeHDFS read, shuffle, HDFS write, sort jobs, 10GB data, terasort format

Modify the HDFS block size, default config. for everything else

0500

10001500200025003000

16 64 256 1024

total power

workers + master

(W)

block size (MB)

HDFSRead - Total power

0200000400000600000800000

10000001200000

16 64 256 1024

total energy

workers + master

(J)

block size (MB)

HDFSRead - Total energy

0

100

200

300

400

16 64 256 1024

duration (s)

block size (MB)

HDFSRead - Job duration

0500

10001500200025003000

16 64 256 1024

total power

workers + master

(W)

block size (MB)

Shuffle - Total power

0

500000

1000000

1500000

2000000

16 64 256 1024

total energy workers + master

(J)

block size (MB)

Shuffle - Total energy

0

200

400

600

800

16 64 256 1024

duration (s)

block size (MB)

Shuffle - Job duration

Some workloads are affected – HDFS read, some are not – shuffle

Slow Nodes

One node on the cluster consistently received fewer blocks

Removing the slow node leads to performance improvement

Clever ways to use the slow node instead of taking it offline?

0 0.2 0.4 0.6 0.8 1 1.2

r32r33r34r36r37r38r39

r4r40

r5

normalized # of blocks

HDFS block placement

0 0.2 0.4 0.6 0.8 1 1.2

r33

r34

r36

r37

r38

r39

r4

r40

r5

normalized # of blocks

HDFS block placement - no lagger

experiment duration (s) 95% conf interval total energy (J) 95% conf interval records / J avg power (W) 95% conf intervalwith r32 387.65 73.79715084 827039.6807 157169.6671 129.829542 2134.26656 1.798035323without r32 301.15 81.15046268 648813.0401 174540.4687 165.4932567 1940.780717 5.619743737

Predicting IO Energy

0

1000

2000

3000

4000

5000

0 2 4 6 8 10 12 14

duration (s)

# of workers

Sort - Job duration

measuredpredicted

0

500000

1000000

1500000

2000000

2500000

0 2 4 6 8 10 12 14

energy (J)

# of workers

Sort - Total Energy

measuredpredicted

0500

100015002000250030003500

0 2 4 6 8 10 12 14

power (W)

# of workers

Sort - Total Power

measuredpredicted

Working example: Predict IO energy for a particular task

Benchmark energy in joules per byte for HDFS read, shuffle, HDFS write

IO energy = bytes read × joules per byte (HDFS read) +

bytes shuffled × joules per byte (shuffle) +

bytes written × joules per byte (HDFS write)

The simple model is effective, but requires prior measurements

Cluster Provision and Configuration

Working example: Find optimal cluster size for a steady job steam

N = number of workers that we could assign

D(N) = job duration, as a function of N

Pa(N) = power when active, as a function of N

Pi = power when idle

T = average job arrival interval.

E(N) = expected energy consumed per job, as a function of N

E(N) = D(N) Pa(N) + [T – D(N)] Pi

Optimize for E(N) over the range N such that D(N) ≤ T

In general, multi-dimensional optimization problem to meet job constraints

Optimal HDFS Replication

Working example: Reduce HDFS replication from 3 to 2, i.e. off-rack replica only?

Benefit = probability(no failure) × [energy(3 replicas) – energy(2 replicas)]

Cost = probability(failure and local recovery) ×

[energy(off-rack recovery) – energy(rack-local recovery)]

Cost-benefit trade-off between lower energy and higher recovery costs

Need to quantify probability of failure/recovery to set sensible replication

Faster = More Energy Efficient?

r = fraction of resources used in a system, ranging from 0 to 1

R(r) = work rate of a system, ranging from 0 to RMAX

P(r) = power of the system, ranging from 0 to PMAX

r1, r2 = the lower and upper bounds of the resources operating region

W = workload size

E(r) = energy consumed when r

0 1

Pmax

r0 1

Rmax

r

R(r) = r × RMAX P(r) = r × PMAX

Power Work rate

E(r) = MAX

MAX

MAX

MAX

RP

WRrPr

WrPrR

W)(

)(

Constant energy for fixed workload size, so run as fast as we can






W = workload size


0 1

Rmax

r

R(r) = r × RMAX Power Work rateP(r) = PIDLE + r × (PMAX – PIDLE)

0 1

Pmax

r

Reduce energy by using more resources, so run as fast as we can, again

E(r ) =

MAX

IDLEMAXIDLE

RrPPrP

WrPrR

W )(

)(

E(r) = IDLEMAXMAX

IDLE

MAX

PPRW

rP

RW






W = workload size


0 1

Rmax

r

Power Work rateP(r) = PIDLE + r × (PMAX – PIDLE)

0 1

Pmax

r

R(r) = r × RMAX

Caveats: What is meant by resource? What is a realistic behavior for R(r)?

Performance

Take Away Thoughts

Resources Used

If work rate resources used, energy is another aspect of performance

All prior performance optimization techniques don’t need to be re-invented

What if work rate is not proportional to resources used?

Different hardware?

Productivity benchmarks?

Hadoop as terasort and JouleSort winner?

Documents

Towards Energy Efficient Hadoop