A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
October 27, 2016
Hokkaido UniversityAkiyoshi SUGIKI, Phyo Thandar
Thant
2
AgendaHokkaido University Academic Cloud
A Docker-based Sizing Framework for Hadoop
Multi-objective Optimization of Hadoop
3
Information Initiative Center, Hokkaido UniversityFounded in 1962 as a national supercomputing centerA member of
– HPCI (High Performance Computing Infrastructure) - 12 institutes
– JHPCN (Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructure) - 8 institutes
University R&D center for supercomputing, cloud computing, networking, and cyber security
Operating HPC twins– Supercomputer (172 TFLOPS) and Academic Cloud System (43
TFLOPS)
4
Hokkaido University Academic Cloud (2011-)Japan’s largest academic cloud system
– > 43 TFLOPS (> 114 nodes)– ~2,000 VMs
Supercomputer Cloud System
Data-scienceCloud System (Added, 2013-)
SR16000 M1172 TF/176 nodes22 TB (128 GB/node)
AMS2500 (File System)600 TB (SAS, RAID5)300 TB (SATA, RAID6)
BS200044 TF/114 nodes14 TB (128 GB/node)
Cloud Storage1.96 PB
AMS2300 (Boot File System)260 TB (SAS, RAID6)
VFP500N+AMS2500 (NAS)500 TB (near-line NAS/RAID6)
HA8000/RS210HM80 GB x 25 nodes32 GB x 2 nodes
CloudStack 3.x CloudStack 4.x
Hadoop Package for “Bigdata”(Hadoop, Hive, Mahout, and R)
5
Supporting “Big Data”“Big Data” cluster package
• Hadoop, Hive, Mahout, and R• MPI, OpenMP, and Torque
– Automatic deployment of VM-based clusters– Custom scheduling policy
• Spread I/O on multiple disks
VMVM
VM
VM
#1
#2
#3
#4
Storage #1
Storage #2
Storage #3
Storage #4
Hadoop ClusterVirtual Disks
Hadoop
Hive
Mahout
R
Big Data Package
6
Lessons Learned (So Far)No single Hadoop (a little like silos)
– Hadoop instance for each group of usersVersion problem
– Upgrades and expansion of Hadoop ecosystemStrong demand of middle person
– Gives advice with deep understanding of research domains, statistical analysis, and Hadoop-based systems
VM VM VM
Hadoop #1
VM VM VM
Hadoop #2
VM VM VM
Hadoop #3
Research Group #1 Research Group #2 Research Group #3
ResearchData
7
Going NextA new system will be installed in April, 2018
– x2 CPU cores, x5 storage space– Bare-metal, accelerating performance at every layer– Supports both interclouds and hybrid clouds
Still supports Hadoop as well as Spark– Cluster templates– Build user community
SupercomputerSystem Hokkaido U.
Regions(Tokyo,Osaka,
Okinawa)Cloud
Systems(In other universitiesand public clouds)
Cluster Templates (Hadoop, Spark, …)
8
RequirementsRun Hadoop on multiple Clouds
– Academic Clouds (Community Clouds)• Hokkaido University Academic Cloud, ...
– Public Clouds• Amazon AWS, Microsoft Azure, Google Cloud, …
Offer best choice for researches (our users)– Under multiple criteria
• Cost• Performance (time constraints)• Energy
…
9
Our SolutionA Container-based Sizing Framework for Hadoop
Clusters– Docker-based
• Light-weight, easily migrate to other clouds– Emulation (rather than simulation)
• Close to actual execution times on multiple clouds– Output:
• Instance type• Number of instances• Hadoop configuration (*-site.xml files)
10
Architecture
EmulationEngineDocker Runtime
Application (HPC, Big Data)Application (HPC, Big Data)
Docker
Application (HPC, Big Data,…)
CPU
Mem
ory
Disk I/O
Netw
ork I/O
Interpose
Collect Metrics
Run Profiles
InstanceProfiles
t2 m4 r3c4
Public Clouds
CostEstimator
11
Why Docker?
Virtual Machines OS ContainersSize Large SmallMachine Emulation
Complete Partial (Share OS kernel)
Launch time Large SmallMigration Sometime requires
image conversionEasy
Software Xen, KVM, VMware Dockers, rkt, …
App
Lib
App
Lib
OS
Container ContainerApp
Lib
App
Lib
OS
VM VM
OS
Hypervisor
12
Container ExecutionCluster Management
– Docker Swarm– Multi-host (VXLAN-based) networking mode
Container– Resources
• CPUs, memory, disk, and network I/O– Regulation
• Docker run options, cgroups, and tc– Monitoring
• Docker remote API and cgroups
13
Docker Image“Hadoop all in the box”
– Hadoop– Spark– HiBench
The same image for master/slavesExports
– (Environment variables)– File mounts
• *-site.xml files– (Data volumes)
Hadoop
Spark
HiBench
Hadoop
Spark
HiBench
Volume mounts
Hadoop all in the box
core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml
14
Resources
Resources How CommandCPU cores Change CPU set Docker
run/cgroupsclock rate Change quota & period Docker
run/cgroupsMemory size Set memory limit Docker
run/cgroupsOut-of-memory (OOM)
Change out-of-memory handling
Docker run/cgroups
Disk IOPS Throttle read/write IOPS Docker run/cgroups
bandwidth Throttle read/write bytes/sec
Docker run/cgroups
Network IOPS Throttle TX/RX IOPS Docker run/cgroups
bandwidth Throttle TX/RX bytes/sec Docker run/cgroups
latency Insert latency (> 1 ms) tcFreezer freeze Suspend/resume cgroups
15
Hadoop ConfigurationMust be adjusted according to
– Instance type (CPU, memory, disk, and network)– Number of instances
Targeting all parameters in *-site.xmlDependent parameters
– (Instance type)– YARN container size– JVM heap size– Map task size– Reduce task size
Machine Instance Size
YARN Container Size
JVM Heap Size
Map/Reduce Task Size
16
OptimizationMulti-objective GAs
– Trading cost and performance (time constraints)– Other factors: energy, …– Future: multi-objective to many-objective (> 3)
Generate “Pareto-optimal Front”Technique: non-dominated sorting
Objective 1
Objective 2
XX X X X
X
X
XX
XX
X
XX
X
17
(Short) SummaryA Sizing Framework for Hadoop/Spark Clusters
– OS container-based approach– Combined with Genetic Algorithms
• Multi-objective optimization (cost & perf.)
Future Work– Docker Container Executor (DCE)
• DCE runs YARN containers into Docker ones• Designed to provide custom environment for each app.• We believe DCE can also be utilized for slow-down and
speeding-up of Hadoop tasks
18
Slow Down - Torturing HadoopMake strugglersNo intervention is required
Map 1 Map 2 Map 3 Map 4 Map 5
Master
Red 1 Red 2 Red 3 Red 4
Map Tasks
Reduce Tasks
Struggler
Struggler
19
Speeding up - Accelerating HadoopBalance resource usage of tasks on the same node
Map 1 Map 2 Map 3 Map 4 Map 5
Master
Red 1 Red 2 Red 3 Red 4
Map Tasks
Reduce Tasks
Struggler
Struggler
MHCO: Multi-Objective Hadoop Configuration Optimization Using
Steady-State NSGA-II
21
Introduction
BIG DATA
◦ Increasing use of connected devices at the hands of the Internet of
Things and data growth from scientific researches will lead to an
exponential increase in the data
◦ Portion of these data is underutilized or underexploited
◦ Hadoop MapReduce is very popular programming model for large
scale data analytics
22
Problem Definition I
◦ Objective 1 Parameter Tuning for Minimizing Execution Time
mapred-site.xml
core-site .xml
hdfs-site.xml
yarn-site .xml
Configuration settings for HDFS core such as I/O settings
Configuration settings for HDFS daemons
Configuration settings for MapReduce daemons
Configuration settings for YARN daemons
◦Hadoop provides tunable options have significant effect on
application performance
◦Practitioners and administers lack the expertise to tune
◦Appropriate parameter configuration is the key factor in Hadoop
23
Problem Definition II
◦ Appropriate machine instance selection for Hadoop cluster
◦ Objective 2 Instance Type Selection for Minimizing Hadoop
Cluster Deployment Cost
request
Service provider
Applicatio
nresult
Machine instance type- small- medium- large - x-large Pay Per
Use
24
Proposed Search based Approach
ssNSGA-II
Performance Optimization
Hadoop Parameter Tuning
1 Deployment Cost Optimization
Cluster Instance TypeSelection
2
◦ Chromosome encoding can solve dynamic nature of Hadoop on
version changes
◦ Use Steady State approach for computation overhead reduction
in generic GA approach
◦ Bi-objective optimization (execution time, cluster deployment
cost)
Objective Function
min t(p) , min c(p) where, p = [p1,p2,…,pm] , configuration parameter
list and instance type t(p) = execution time of MR job c(p)= machine instance usage
cost
25
t(p) = twc
c(p) = (SP*NS)*t(p)where,
twc = workload execution time SP= instance price NS=no of machine instances
Assumption - two objective functions are black-box functions - no of instances in the cluster is static
Instance type
Mem(GB) / cpu cores
Price per second (Yen)
X-large 128/40 0.0160
Large 30/10 0.0039
Medium 12/4 0.0016
Small 3/1 0.0004
26
Parameter Grouping
I. HDFS and MAPREDUCE PARAMETERS
II. YARN PARAMETERS
III.YARN related MAPREDUCE PARAMETERS
17
6
7
30
machine instance type specification (cpu, mem)
reference from previous researches
27
Group I Parameter Values
Parameter Name Value Rangedfs.namenode.handler.count 10, 20dfs.datanode.handler.count 10, 20dfs.blocksize 134217728,
268435456mapreduce.map.output.compress True, Falsemapreduce.job.jvm.numtasks 1: limited,
-1: unlimitedmapreduce.map.sort.spill.percent 0.8, 0.9mapreduce.reduce.shuffle.input.buffer.percent
0.7, 0.8
mapreduce.reduce.shuffle.memory.limit.percent
0.25, 0.5
mapreduce.reduce.shuffle.merge.percent 0.66, 0.9
mapreduce.reduce.input.buffer.percent 0.0, 0.5
Parameter Name Value Rangedfs.datanode.max.transfer.threads 4096, 5120,
6144, 7168dfs.datanode.balance.bandwidthPerSec
1048576, 2097152, 194304, 8388608
mapreduce.task.io.sort.factor 10, 20, 30, 40mapreduce.task.io.sort.mb 100, 200, 300,
400mapreduce.tasktracker.http.threads 40, 45, 50, 60mapreduce.reduce.shuffle.parallelcopies
5, 10, 15, 20
mapreduce.reduce.merge.inmem.threshold
1000, 1500, 2000, 2500
28
Group II and III Parameter ValuesYARN Parameters x-large large medium smallyarn.nodemanager.resource.memory.mb 102400 26624 10240 3072
yarn.nodemanager.resource.cpu-vcores 39 9 3 1
yarn.scheduler.maximum.allocation-mb 102400 26624 10240 3072
yarn.scheduler.minimum.allocation-mb 5120 2048 2048 1024
yarn.scheduler.maximum.allocation-vcores 39 9 3 1
yarn.scheduler.minimum.allocation-vcores 10 3 1 1
mapreduce.map.memory.mb 5120 2048 2048 1024mapreduce.reduce.memory.mb 10240 4096 2048 1024mapreduce.map.cpu.vcores 10 3 1 1mapreduce.reduce.cpu.vcores 10 3 1 1mapreduce.child.java.opts 8192 3277 1638 819yarn.app.mapreduce.am.resource-mb 10240 4096 2048 1024
yarn.app.mapreduce.am.command-opts 8192 3277 1638 819
29
Chromosome EncodingHDFS and MAPREDUCE Parameters
Binary Chromosome
Machine Instance Type
Single bit or two consecutive bits represents parameter values, instance type
Dependent Parameters
YARN Parameterssmall
YARN related MapReduce Parameters
Chromosome Length = 26 bits
30
System Architecture
ssNSGA-II optimization
workload Resourc
e Manage
r
Node Manage
r
Node Manage
r
Node Manage
r
List of optimal setting
Time Cost
…Cluster deployment
cost
31
ssNSGA-II Based Hadoop Configuration Optimization
Generate n Sample Configuration Chromosomes C1,C2,…,Cn
Select 2 Random Parents P1,P2
Perform 2 Point Crossover on P1, P2 (probability Pc =1)Generate Offspring Coffspring
Perform Mutation on Coffspring (probability Pm= 0.1)
Coffspring Fitness Calculation Update Population P
Perform Non-dominated Sorting Update Population P
Output Pareto Solutions List, Copt
REPEAT CONDITION = YES
32
Experiment Benchmark
Type Workload Input Size BenchmarkMicroBenchmark - Sort
- TeraSort- Wordcount
2.98023GB - measure cluster performance (intrinsic behavior of the cluster)
Web Search - Pagerank 5000 pages with 3 Iterations
- measure the execution performance for real world big data applications
Benchmark used : Hibench Benchmark suite version 4.0,
https://github.com/intel-hadoop/HiBench/releases
33
Experiment Environment
Setup Information SpecificationCPU Intel ® Xeon R
E7-8870(40 cores)
Memory 128 GB RAMStorage 400 TB Hadoop version 2.7.1JDK 1.8.0NameNode
DataNode1 DataNode2 DataNode3
DataNode4 DataNode 5
User
Public network
6-node cluster 1 NameNode5 DataNodes
ssNSGA-II optimization
34
Experimental Results
20 40 60 80 100 120 140 160 180 200012345678
sort workload result
execution time (sec)
cost
(¥)
small medium large x-large
30 40 50 60 70 800
1
2
3
4
5
6terasort workload result
execution time (sec)co
st (¥
)
small medium large x-large
Population Size =30 Number of Evaluations=180
Number of Objectives = 2 Mutation Probability = 0.1
Crossover Probability = 1.0
* significant effects on HDFS and MapReduce Parameters
35
Experimental Results Cont’d
50 100 150 200 250 30002468
1012141618
pagerank workload result
execution time (sec)co
st (¥
)
medium large x-large small
0 100 200 300 400 500 600012345678
wordcount workload result
execution time (sec)
cost
(¥)
small medium large x-large
* depend on YARN / related Parameters compared to HDFS and MapReduce Parameters
Population Size =30 Number of Evaluations=180
Number of Objectives = 2 Mutation Probability = 0.1
Crossover Probability = 1.0
36
Conclusion & Continuing Work◦ Offline Hadoop configuration optimization using the ssNSGA-II
based search strategy
◦ x-large instance type cluster is not a suitable option for the
current workloads and input data size
◦ Large or medium instance type cluster show the balance for our
objective functions
◦Continuing process - dynamic cluster resizing through containers
and online configuration optimization of M/R workloads for
scientific workflow applications for effective Big Data Processing