A Container-based Sizing Framework for Apache Hadoop/Spark Clusters

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters

October 27, 2016

Hokkaido UniversityAkiyoshi SUGIKI, Phyo Thandar

Thant

2

AgendaHokkaido University Academic Cloud

A Docker-based Sizing Framework for Hadoop

Multi-objective Optimization of Hadoop

3

Information Initiative Center, Hokkaido UniversityFounded in 1962 as a national supercomputing centerA member of

– HPCI (High Performance Computing Infrastructure) - 12 institutes

– JHPCN (Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructure) - 8 institutes

University R&D center for supercomputing, cloud computing, networking, and cyber security

Operating HPC twins– Supercomputer (172 TFLOPS) and Academic Cloud System (43

TFLOPS)

4

Hokkaido University Academic Cloud (2011-)Japan’s largest academic cloud system

– > 43 TFLOPS (> 114 nodes)– ~2,000 VMs

Supercomputer Cloud System

Data-scienceCloud System (Added, 2013-)

SR16000 M1172 TF/176 nodes22 TB (128 GB/node)

AMS2500 (File System)600 TB (SAS, RAID5)300 TB (SATA, RAID6)

BS200044 TF/114 nodes14 TB (128 GB/node)

Cloud Storage1.96 PB

AMS2300 (Boot File System)260 TB (SAS, RAID6)

VFP500N+AMS2500 (NAS)500 TB (near-line NAS/RAID6)

HA8000/RS210HM80 GB x 25 nodes32 GB x 2 nodes

CloudStack 3.x CloudStack 4.x

Hadoop Package for “Bigdata”(Hadoop, Hive, Mahout, and R)

5

Supporting “Big Data”“Big Data” cluster package

• Hadoop, Hive, Mahout, and R• MPI, OpenMP, and Torque

– Automatic deployment of VM-based clusters– Custom scheduling policy

• Spread I/O on multiple disks

VMVM

VM

VM

#1

#2

#3

#4

Storage #1

Storage #2

Storage #3

Storage #4

Hadoop ClusterVirtual Disks

Hadoop

Hive

Mahout

R

Big Data Package

6

Lessons Learned (So Far)No single Hadoop (a little like silos)

– Hadoop instance for each group of usersVersion problem

– Upgrades and expansion of Hadoop ecosystemStrong demand of middle person

– Gives advice with deep understanding of research domains, statistical analysis, and Hadoop-based systems

VM VM VM

Hadoop #1

VM VM VM

Hadoop #2

VM VM VM

Hadoop #3

Research Group #1 Research Group #2 Research Group #3

ResearchData

7

Going NextA new system will be installed in April, 2018

– x2 CPU cores, x5 storage space– Bare-metal, accelerating performance at every layer– Supports both interclouds and hybrid clouds

Still supports Hadoop as well as Spark– Cluster templates– Build user community

SupercomputerSystem Hokkaido U.

Regions(Tokyo,Osaka,

Okinawa)Cloud

Systems(In other universitiesand public clouds)

Cluster Templates (Hadoop, Spark, …)

8

RequirementsRun Hadoop on multiple Clouds

– Academic Clouds (Community Clouds)• Hokkaido University Academic Cloud, ...

– Public Clouds• Amazon AWS, Microsoft Azure, Google Cloud, …

Offer best choice for researches (our users)– Under multiple criteria

• Cost• Performance (time constraints)• Energy

…

9

Our SolutionA Container-based Sizing Framework for Hadoop

Clusters– Docker-based

• Light-weight, easily migrate to other clouds– Emulation (rather than simulation)

• Close to actual execution times on multiple clouds– Output:

• Instance type• Number of instances• Hadoop configuration (*-site.xml files)

10

Architecture

EmulationEngineDocker Runtime

Application (HPC, Big Data)Application (HPC, Big Data)

Docker

Application (HPC, Big Data,…)

CPU

Mem

ory

Disk I/O

Netw

ork I/O

Interpose

Collect Metrics

Run Profiles

InstanceProfiles

t2 m4 r3c4

Public Clouds

CostEstimator

11

Why Docker?

Virtual Machines OS ContainersSize Large SmallMachine Emulation

Complete Partial (Share OS kernel)

Launch time Large SmallMigration Sometime requires

image conversionEasy

Software Xen, KVM, VMware Dockers, rkt, …

App

Lib

App

Lib

OS

Container ContainerApp

Lib

App

Lib

OS

VM VM

OS

Hypervisor

12

Container ExecutionCluster Management

– Docker Swarm– Multi-host (VXLAN-based) networking mode

Container– Resources

• CPUs, memory, disk, and network I/O– Regulation

• Docker run options, cgroups, and tc– Monitoring

• Docker remote API and cgroups

13

Docker Image“Hadoop all in the box”

– Hadoop– Spark– HiBench

The same image for master/slavesExports

– (Environment variables)– File mounts

• *-site.xml files– (Data volumes)

Hadoop

Spark

HiBench

Hadoop

Spark

HiBench

Volume mounts

Hadoop all in the box

core-site.xml

hdfs-site.xml

yarn-site.xml

mapred-site.xml

14

Resources

Resources How CommandCPU cores Change CPU set Docker

run/cgroupsclock rate Change quota & period Docker

run/cgroupsMemory size Set memory limit Docker

run/cgroupsOut-of-memory (OOM)

Change out-of-memory handling

Docker run/cgroups

Disk IOPS Throttle read/write IOPS Docker run/cgroups

bandwidth Throttle read/write bytes/sec

Docker run/cgroups

Network IOPS Throttle TX/RX IOPS Docker run/cgroups

bandwidth Throttle TX/RX bytes/sec Docker run/cgroups

latency Insert latency (> 1 ms) tcFreezer freeze Suspend/resume cgroups

15

Hadoop ConfigurationMust be adjusted according to

– Instance type (CPU, memory, disk, and network)– Number of instances

Targeting all parameters in *-site.xmlDependent parameters

– (Instance type)– YARN container size– JVM heap size– Map task size– Reduce task size

Machine Instance Size

YARN Container Size

JVM Heap Size

Map/Reduce Task Size

16

OptimizationMulti-objective GAs

– Trading cost and performance (time constraints)– Other factors: energy, …– Future: multi-objective to many-objective (> 3)

Generate “Pareto-optimal Front”Technique: non-dominated sorting

Objective 1

Objective 2

XX X X X

X

X

XX

XX

X

XX

X

17

(Short) SummaryA Sizing Framework for Hadoop/Spark Clusters

– OS container-based approach– Combined with Genetic Algorithms

• Multi-objective optimization (cost & perf.)

Future Work– Docker Container Executor (DCE)

• DCE runs YARN containers into Docker ones• Designed to provide custom environment for each app.• We believe DCE can also be utilized for slow-down and

speeding-up of Hadoop tasks

18

Slow Down - Torturing HadoopMake strugglersNo intervention is required

Map 1 Map 2 Map 3 Map 4 Map 5

Master

Red 1 Red 2 Red 3 Red 4

Map Tasks

Reduce Tasks

Struggler

Struggler

19

Speeding up - Accelerating HadoopBalance resource usage of tasks on the same node

Map 1 Map 2 Map 3 Map 4 Map 5

Master

Red 1 Red 2 Red 3 Red 4

Map Tasks

Reduce Tasks

Struggler

Struggler

MHCO: Multi-Objective Hadoop Configuration Optimization Using

Steady-State NSGA-II

21

Introduction

BIG DATA

◦ Increasing use of connected devices at the hands of the Internet of

Things and data growth from scientific researches will lead to an

exponential increase in the data

◦ Portion of these data is underutilized or underexploited

◦ Hadoop MapReduce is very popular programming model for large

scale data analytics

22

Problem Definition I

◦ Objective 1 Parameter Tuning for Minimizing Execution Time

mapred-site.xml

core-site .xml

hdfs-site.xml

yarn-site .xml

Configuration settings for HDFS core such as I/O settings

Configuration settings for HDFS daemons

Configuration settings for MapReduce daemons

Configuration settings for YARN daemons

◦Hadoop provides tunable options have significant effect on

application performance

◦Practitioners and administers lack the expertise to tune

◦Appropriate parameter configuration is the key factor in Hadoop

23

Problem Definition II

◦ Appropriate machine instance selection for Hadoop cluster

◦ Objective 2 Instance Type Selection for Minimizing Hadoop

Cluster Deployment Cost

request

Service provider

Applicatio

nresult

Machine instance type- small- medium- large - x-large Pay Per

Use

24

Proposed Search based Approach

ssNSGA-II

Performance Optimization

Hadoop Parameter Tuning

1 Deployment Cost Optimization

Cluster Instance TypeSelection

2

◦ Chromosome encoding can solve dynamic nature of Hadoop on

version changes

◦ Use Steady State approach for computation overhead reduction

in generic GA approach

◦ Bi-objective optimization (execution time, cluster deployment

cost)

Objective Function

min t(p) , min c(p) where, p = [p1,p2,…,pm] , configuration parameter

list and instance type t(p) = execution time of MR job c(p)= machine instance usage

cost

25

t(p) = twc

c(p) = (SP*NS)*t(p)where,

twc = workload execution time SP= instance price NS=no of machine instances

Assumption - two objective functions are black-box functions - no of instances in the cluster is static

Instance type

Mem(GB) / cpu cores

Price per second (Yen)

X-large 128/40 0.0160

Large 30/10 0.0039

Medium 12/4 0.0016

Small 3/1 0.0004

26

Parameter Grouping

I. HDFS and MAPREDUCE PARAMETERS

II. YARN PARAMETERS

III.YARN related MAPREDUCE PARAMETERS

17

6

7

30

machine instance type specification (cpu, mem)

reference from previous researches

27

Group I Parameter Values

Parameter Name Value Rangedfs.namenode.handler.count 10, 20dfs.datanode.handler.count 10, 20dfs.blocksize 134217728,

268435456mapreduce.map.output.compress True, Falsemapreduce.job.jvm.numtasks 1: limited,

-1: unlimitedmapreduce.map.sort.spill.percent 0.8, 0.9mapreduce.reduce.shuffle.input.buffer.percent

0.7, 0.8

mapreduce.reduce.shuffle.memory.limit.percent

0.25, 0.5

mapreduce.reduce.shuffle.merge.percent 0.66, 0.9

mapreduce.reduce.input.buffer.percent 0.0, 0.5

Parameter Name Value Rangedfs.datanode.max.transfer.threads 4096, 5120,

6144, 7168dfs.datanode.balance.bandwidthPerSec

1048576, 2097152, 194304, 8388608

mapreduce.task.io.sort.factor 10, 20, 30, 40mapreduce.task.io.sort.mb 100, 200, 300,

400mapreduce.tasktracker.http.threads 40, 45, 50, 60mapreduce.reduce.shuffle.parallelcopies

5, 10, 15, 20

mapreduce.reduce.merge.inmem.threshold

1000, 1500, 2000, 2500

28

Group II and III Parameter ValuesYARN Parameters x-large large medium smallyarn.nodemanager.resource.memory.mb 102400 26624 10240 3072

yarn.nodemanager.resource.cpu-vcores 39 9 3 1

yarn.scheduler.maximum.allocation-mb 102400 26624 10240 3072

yarn.scheduler.minimum.allocation-mb 5120 2048 2048 1024

yarn.scheduler.maximum.allocation-vcores 39 9 3 1

yarn.scheduler.minimum.allocation-vcores 10 3 1 1

mapreduce.map.memory.mb 5120 2048 2048 1024mapreduce.reduce.memory.mb 10240 4096 2048 1024mapreduce.map.cpu.vcores 10 3 1 1mapreduce.reduce.cpu.vcores 10 3 1 1mapreduce.child.java.opts 8192 3277 1638 819yarn.app.mapreduce.am.resource-mb 10240 4096 2048 1024

yarn.app.mapreduce.am.command-opts 8192 3277 1638 819

29

Chromosome EncodingHDFS and MAPREDUCE Parameters

Binary Chromosome

Machine Instance Type

Single bit or two consecutive bits represents parameter values, instance type

Dependent Parameters

YARN Parameterssmall

YARN related MapReduce Parameters

Chromosome Length = 26 bits

30

System Architecture

ssNSGA-II optimization

workload Resourc

e Manage

r

Node Manage

r

Node Manage

r

Node Manage

r

List of optimal setting

Time Cost

…Cluster deployment

cost

31

ssNSGA-II Based Hadoop Configuration Optimization

Generate n Sample Configuration Chromosomes C1,C2,…,Cn

Select 2 Random Parents P1,P2

Perform 2 Point Crossover on P1, P2 (probability Pc =1)Generate Offspring Coffspring

Perform Mutation on Coffspring (probability Pm= 0.1)

Coffspring Fitness Calculation Update Population P

Perform Non-dominated Sorting Update Population P

Output Pareto Solutions List, Copt

REPEAT CONDITION = YES

32

Experiment Benchmark

Type Workload Input Size BenchmarkMicroBenchmark - Sort

- TeraSort- Wordcount

2.98023GB - measure cluster performance (intrinsic behavior of the cluster)

Web Search - Pagerank 5000 pages with 3 Iterations

- measure the execution performance for real world big data applications

Benchmark used : Hibench Benchmark suite version 4.0,

https://github.com/intel-hadoop/HiBench/releases

33

Experiment Environment

Setup Information SpecificationCPU Intel ® Xeon R

E7-8870(40 cores)

Memory 128 GB RAMStorage 400 TB Hadoop version 2.7.1JDK 1.8.0NameNode

DataNode1 DataNode2 DataNode3

DataNode4 DataNode 5

User

Public network

6-node cluster 1 NameNode5 DataNodes

ssNSGA-II optimization

34

Experimental Results

20 40 60 80 100 120 140 160 180 200012345678

sort workload result

execution time (sec)

cost

(¥)

small medium large x-large

30 40 50 60 70 800

1

2

3

4

5

6terasort workload result

execution time (sec)co

st (¥

)


Population Size =30 Number of Evaluations=180

Number of Objectives = 2 Mutation Probability = 0.1

Crossover Probability = 1.0

* significant effects on HDFS and MapReduce Parameters

35

Experimental Results Cont’d

50 100 150 200 250 30002468

1012141618

pagerank workload result

execution time (sec)co

st (¥

)

medium large x-large small

0 100 200 300 400 500 600012345678

wordcount workload result

execution time (sec)

cost

(¥)


* depend on YARN / related Parameters compared to HDFS and MapReduce Parameters

Population Size =30 Number of Evaluations=180

Number of Objectives = 2 Mutation Probability = 0.1

Crossover Probability = 1.0

36

Conclusion & Continuing Work◦ Offline Hadoop configuration optimization using the ssNSGA-II

based search strategy

◦ x-large instance type cluster is not a suitable option for the

current workloads and input data size

◦ Large or medium instance type cluster show the balance for our

objective functions

◦Continuing process - dynamic cluster resizing through containers

and online configuration optimization of M/R workloads for

scientific workflow applications for effective Big Data Processing

Technology

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters