Scheduling MapReduce Jobs in HPC Clusters

Marcelo Neves, Tiago Ferreto, Cesar De Rose marcelo.neves@acad.pucrs.br

Faculty of InformaEcs, PUCRS Porto Alegre, Brazil

August 30, 2012

Outline

•  IntroducEon •  HPC Clusters and MapReduce •  MapReduce Job Adaptor •  EvaluaEon •  Conclusion

IntroducEon •  MapReduce (MR)

–  A parallel programming model –  Simplicity, efficiency and high scalability –  It has become a de facto standard for large-‐scale data analysis

•  MR has also aTracted the aTenEon of the HPC community –  Simpler approach to address the parallelizaEon problem –  Highly visible cases where MR has been successfully used by companies like Google, Facebook and Yahoo!

HPC Clusters and MapReduce •  HPC Clusters

–  Shared among mulEple users/organizaEons –  Resource Management System (RMS), such as PBS/Torque –  ApplicaEons are submiTed as batch jobs –  Users have to explicitly allocate the resources, specifying the number of nodes and amount of Eme

•  MR ImplementaEons (e.g. Hadoop) –  Have their own complete job management system –  Users do not have to explicitly allocate resources –  Require a dedicated cluster

Problem

•  Two disEnct clusters are required

How to run MapReduce jobs in a exisEng HPC cluster along with regular HPC jobs?

Current soluEons

•  Hadoop on Demand (HOD) and MyHadoop –  Create on demand MR installaEons as RMS’s jobs –  It’s not transparent, users sEll must to specify the number of nodes and amount of Eme to be allocated

•  MESOS –  Shares a cluster between mulEple different frameworks

–  Creates another level of resource management – Management is taken away from the cluster’s RMS

MapReduce Job Adaptor

HPC User

MR User

HPC Job (# of nodes, time)

MR Job (# of nodes, time)

Resource Management

System

MR Job Adaptor

MR Job (# of map tasks, # of reduce tasks,

job profile)

Cluster

MapReduce Job Adaptor

•  The adaptor has three main goals: – Facilitate the execuEon of MR jobs in HPC clusters – Minimize the average turnaround Eme of the jobs – Exploit unused resources in the cluster (the result of the various shapes of HPC job requests)

CompleEon Eme esEmaEon

•  MR performance model by Verma et al. 1 –  Job profile with performance invariants – EsEmate upper/lower bounds of job compleEon

•  NJM= number of map tasks

•  NJR= number of reduce tasks

•  SJM= number of map slots •  SJR= number of reduce slots

1. Verma et al.: Aria: automaEc resource inference and allocaEon for mapreduce environments (2011)

Algorithm

EvaluaEon •  Simulated environment (using the SimGrid toolkit)

–  Cluster composed by 128 nodes with 2 cores each –  RMS based on ConservaEve Backfilling (CBF) algorithm –  Stream of job submissions

•  HPC workload –  SyntheEc workload based on model by Lublin et al.1

–  Real-‐world HPC traces from the Parallel Workloads Archive (SDSC SP2)

•  MR workload –  SyntheEc workload derived from Facebook workloads described by

Zaharia et al. 2

1. Lublin et al.: The workload on parallel supercomputers: Modeling the characterisEcs of rigid jobs (2003) 2. Zaharia et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling (2010)

Turnaround Time and System UElizaEon •  Workload:

–  HPC: “peak hour” of Lublin’s model –  MR: hour of Facebook-‐like job submissions

•  The adaptor obtained shorter turnaround Emes and beTer

cluster uElizaEon in all cases –  MR-‐only: turnaround was reduced in ≈ 40% –  HPC+MR: overall turnaround was reduced in ≈ 15% –  HPC+MR: turnaround of MR jobs was reduced in ≈ 73%

≈ 40% ≈ 15%

Influence of the Job Size •  Shorter turnaround

regardless the job size •  BeTer results for bins with

smaller jobs

Job sizes in Facebook workload (based on Zaharia et al.)

1 2 3 4 5 6 7 8 9

00NaiveAdaptor1 2 3 4 5 6 7 8 9

NaiveAdaptor

Bin # Map Tasks

# Reduce Tasks

% Jobs at Facebook

1 1 0 39% 2 2 0 16% 3 10 3 14% 4 50 0 9% 5 100 0 6% 6 200 50 6% 7 400 0 4% 8 800 180 4% 9 2400 0 3%

Influence of System Load

14 Mean MR job inter arrival time (seconds)

1 5 10 15 20 25 30

AlgorithmAdaptor

Mean MR job inter arrival time (seconds)

1 5 10 15 20 25 30

AlgorithmAdaptor

Mean HPC job inter arrival time (seconds)

5 10 15 20 25 30

AlgorithmAdaptor

Mean HPC job inter arrival time (seconds)

5 10 15 20 25 30

AlgorithmAdaptor

Real-‐world Workload

•  Workload: –  HPC: a day-‐long trace from SDSC SP2 – MR: 1000 Facebook-‐like MR jobs

•  The adaptor’s algorithm performed beTer in all cases

≈ 54 % ≈ 80 %

Conclusion

•  Although MR has gained aTenEon by HPC community

•  There is sEll a quesEon of how to run MR jobs along with regular HPC jobs in a HPC cluster

•  MR Job Adaptor – Allows transparent MR job submission on HPC clusters

– Minimizes the average turnaround Eme –  Improve the overall uElizaEon, by exploiEng unused resources in the cluster

Thank you!

Scheduling MapReduce Jobs in HPC Clusters

Technology

MapReduce in Action

Mapreduce in JavaScript

Presentation Mapreduce Bjornnordlund

Scheduling della CPU. Sistemi Operativi a.a. 2007-08 5.2 Scheduling della CPU Concetti fondamentali Criteri di scheduling Algoritmi di scheduling Scheduling

MapReduce for Bioinformatics

Scheduling Scheduling

homepage PROCESSI STATI DI UN PROCESSO DESCRITTORE DI PROCESSO CONTEXT SWITCH SCHEDULING CLASSI DI SCHEDULING SCHEDULING DELLA CPU OBIETTIVI DELLO SCHEDULING

KT Presentation MapReduce

MapReduceアルゴリズムデザイン #hadoopreading

MapReduce & BigTable

Mapreduce framework suffling & sorting. mapreduce example - wordcount

FlowVision HPC HPC ––инновационный HPC –инновационныйisicad.ru/ru/2008/presentations/d2/pdf/TESIS_Shchelyaev.pdf · November 2005 FlowVision HPC –инновационный

Scheduling della CPU: 1.Tipi di scheduling 2.Metriche 3.Algoritmi di scheduling classici 4.Scheduling multiprocessore 5.Scheduling in sistemi operativi

MapReduce 进阶

BigData MapReduce

MapReduce a distribuovane´ vy´pocˇtyufal.mff.cuni.cz/~straka/courses/npfl102/mapreduce_slides_2009.pdf · MapReduce Implementace MapReduce Google MapReduce Hadoop Phoenix Mars

Scheduling Vorlesungsskript - math.uni-magdeburg.dewerner/scheduling-ma-skript.pdf · KAPITEL 1. EINFÜHRUNG 2 – Planung,SchedulingundTimetablingimTransport (z.B.Tanker-Scheduling,FahrplangestaltungbeiderEisenbahn)

MapReduce - Konzept¶nig.pdf · MapReduce-Konzept 51 Tom White (2009): “Hadoop – The Definite Guide”, O'Reilly Media, Inc. MapReduce: Simplified data processing on large clusters

XHAMI - Extended HDFS and MapReduce Interface for Image ...gridbus.cs.mu.oz.au/papers/XHAMI-Cloud2015.pdf · Keywords: Cloud Computing, Big Data, Hadoop, MapReduce, Extended MapReduce,

Short-Term Scheduling · PDF file2 Short-Term Scheduling Industrial Engineering Sequencing Examples Forward and Backward Scheduling Forward scheduling: begins the schedule as soon