View
565
Download
1
Category
Preview:
Citation preview
Scheduling MapReduce Jobs in HPC Clusters
Marcelo Neves, Tiago Ferreto, Cesar De Rose marcelo.neves@acad.pucrs.br
Faculty of InformaEcs, PUCRS Porto Alegre, Brazil
August 30, 2012
Outline
• IntroducEon • HPC Clusters and MapReduce • MapReduce Job Adaptor • EvaluaEon • Conclusion
2
IntroducEon • MapReduce (MR)
– A parallel programming model – Simplicity, efficiency and high scalability – It has become a de facto standard for large-‐scale data analysis
• MR has also aTracted the aTenEon of the HPC community – Simpler approach to address the parallelizaEon problem – Highly visible cases where MR has been successfully used by companies like Google, Facebook and Yahoo!
3
HPC Clusters and MapReduce • HPC Clusters
– Shared among mulEple users/organizaEons – Resource Management System (RMS), such as PBS/Torque – ApplicaEons are submiTed as batch jobs – Users have to explicitly allocate the resources, specifying the number of nodes and amount of Eme
• MR ImplementaEons (e.g. Hadoop) – Have their own complete job management system – Users do not have to explicitly allocate resources – Require a dedicated cluster
4
Problem
• Two disEnct clusters are required
5
How to run MapReduce jobs in a exisEng HPC cluster along with regular HPC jobs?
Current soluEons
• Hadoop on Demand (HOD) and MyHadoop – Create on demand MR installaEons as RMS’s jobs – It’s not transparent, users sEll must to specify the number of nodes and amount of Eme to be allocated
• MESOS – Shares a cluster between mulEple different frameworks
– Creates another level of resource management – Management is taken away from the cluster’s RMS
6
MapReduce Job Adaptor
7
HPC User
MR User
HPC Job (# of nodes, time)
MR Job (# of nodes, time)
Resource Management
System
MR Job Adaptor
MR Job (# of map tasks, # of reduce tasks,
job profile)
Cluster
MapReduce Job Adaptor
8
• The adaptor has three main goals: – Facilitate the execuEon of MR jobs in HPC clusters – Minimize the average turnaround Eme of the jobs – Exploit unused resources in the cluster (the result of the various shapes of HPC job requests)
CompleEon Eme esEmaEon
• MR performance model by Verma et al. 1 – Job profile with performance invariants – EsEmate upper/lower bounds of job compleEon
9
• NJM= number of map tasks
• NJR= number of reduce tasks
• SJM= number of map slots • SJR= number of reduce slots
1. Verma et al.: Aria: automaEc resource inference and allocaEon for mapreduce environments (2011)
Algorithm
10
EvaluaEon • Simulated environment (using the SimGrid toolkit)
– Cluster composed by 128 nodes with 2 cores each – RMS based on ConservaEve Backfilling (CBF) algorithm – Stream of job submissions
• HPC workload – SyntheEc workload based on model by Lublin et al.1
– Real-‐world HPC traces from the Parallel Workloads Archive (SDSC SP2)
• MR workload – SyntheEc workload derived from Facebook workloads described by
Zaharia et al. 2
11
1. Lublin et al.: The workload on parallel supercomputers: Modeling the characterisEcs of rigid jobs (2003) 2. Zaharia et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling (2010)
Turnaround Time and System UElizaEon • Workload:
– HPC: “peak hour” of Lublin’s model – MR: hour of Facebook-‐like job submissions
• The adaptor obtained shorter turnaround Emes and beTer
cluster uElizaEon in all cases – MR-‐only: turnaround was reduced in ≈ 40% – HPC+MR: overall turnaround was reduced in ≈ 15% – HPC+MR: turnaround of MR jobs was reduced in ≈ 73%
12
≈ 40% ≈ 15%
Influence of the Job Size • Shorter turnaround
regardless the job size • BeTer results for bins with
smaller jobs
13
Job sizes in Facebook workload (based on Zaharia et al.)
1 2 3 4 5 6 7 8 9
Bin
Aver
age
turn
arou
nd ti
me
(min
utes
)
050
010
0015
0020
0025
00NaiveAdaptor1 2 3 4 5 6 7 8 9
Bin
Aver
age
turn
arou
nd ti
me
(min
utes
)
050
010
0015
0020
0025
00
NaiveAdaptor
Bin # Map Tasks
# Reduce Tasks
% Jobs at Facebook
1 1 0 39% 2 2 0 16% 3 10 3 14% 4 50 0 9% 5 100 0 6% 6 200 50 6% 7 400 0 4% 8 800 180 4% 9 2400 0 3%
Influence of System Load
14 Mean MR job inter arrival time (seconds)
Aver
age
turn
arou
nd ti
me
(min
utes
)
50100
250
500
750
1000
1250
1500
1 5 10 15 20 25 30
AlgorithmAdaptor
Naive
Mean MR job inter arrival time (seconds)
Aver
age
turn
arou
nd ti
me
(min
utes
)
50100
250
500
750
1000
1250
1500
1 5 10 15 20 25 30
AlgorithmAdaptor
Naive
Mean HPC job inter arrival time (seconds)
Aver
age
turn
arou
nd ti
me
(min
utes
)
50
100
200
400
600
800
1000
5 10 15 20 25 30
AlgorithmAdaptor
Naive
Mean HPC job inter arrival time (seconds)
Aver
age
turn
arou
nd ti
me
(min
utes
)
50
100
200
400
600
800
1000
5 10 15 20 25 30
AlgorithmAdaptor
Naive
Real-‐world Workload
• Workload: – HPC: a day-‐long trace from SDSC SP2 – MR: 1000 Facebook-‐like MR jobs
• The adaptor’s algorithm performed beTer in all cases
15
≈ 54 % ≈ 80 %
Conclusion
• Although MR has gained aTenEon by HPC community
• There is sEll a quesEon of how to run MR jobs along with regular HPC jobs in a HPC cluster
• MR Job Adaptor – Allows transparent MR job submission on HPC clusters
– Minimizes the average turnaround Eme – Improve the overall uElizaEon, by exploiEng unused resources in the cluster
16
Thank you!
17
Recommended