A New Analytical Technique for Designing Provably Efficient
MapReduce Schedulers
Yousi ZhengDept. of ECE, The Ohio State UniversityE-mail: [email protected]
Joint work with Ness B. ShroffPrasun Sinha
2
MapReduceData Centers
Facility containing a very large number of machinesThey process very large datasets: MapReduce
MapReduceFramework for processing highly parallel problems Developed (popularized) by Google Nearly ubiquitous (Google, IBM, Facebook, Microsoft,
Yahoo, Amazon, eBay, twitter…)Used in a variety of different applications
Distributed Grep, distributed sort, AI, scientific computation, Large scale pdf generation, Geographical data, Image processing…
3
MapReduceMap phase:
Takes an input job and divides into many small sub-problems
Operations can run in parallel on potentially different machines
Reduce phaseCombines the output of MapOccurs after the Map phase
is completedRuns on parallel machines…
Goal: Schedule these Map and Reduce tasks in order to minimize the total flow time in the system
4
Time is slottedN “Machines”: Each machine runs 1 unit of workload per slot
Scheduling Constraint: Reduce job startsafter all tasks in the Map job are completed.
Reduce
…Data Center with N machines
…
n jobs over T slots
Each job i consists of Map tasks and Reduce tasks
Model
Each Map task has 1 unit of workloadTotal Mi tasks for Map job i
Each Reduce task can have multiple units of workload
Ri (k)
units of workload for task k for job i
units of workload for Reduce job i
1
MapMi
Ri (k)
job i
5
Preemptive Reduce tasks can be interrupted
at the end of a slot Remaining workload in the task can be
executed on any machine(s) Reasonable if overhead
of data-migration is small Non-preemptive
Types of Schedulers: Treatment of Reduce Tasks
R1 R1
R2
Machines
Time
Machine C
Machine A
Machine B
0 1 2 3 4 5 6 7
6
Preemptive Reduce tasks can be interrupted
at the end of a slot Remaining workload in the task can be
executed on any machine(s) Reasonable if overhead
of data-migration is small Non-preemptive
Once a Reduce task is started, it can’t be interrupted till the end of the task
An individual Reduce task cannot be completed on different machines But different tasks from same job can be assigned to different
machines Note: Since Map tasks have unit workload, they cannot be
interrupted. In practice Map tasks may be > 1 unit, but small
Types of Schedulers: Treatment of Reduce Tasks
R1 R1
R2
Machines
Time
Machine C
Machine A
Machine BR2
0 1 2 3 4 5 6 7
7
Flow-Time Minimization Problem: Preemptive Scenario
Total # machines is N
Total map workload is Mi for job i
Total reduce workload is Ri for job i
Minimize flow-time
Scheduling Constraint
8
Flow-Time Minimization ProblemNon-Preemptive Scenario
Total # machines is N
Total map workload is Mi for job i
Total reduce workload is R(k)
i for task k in job i
Non-preemptive
Minimize flow-time
Both Problems are NP-hard in the strong sense
9
Competitive Ratio
TheoremFor any constant c0 and any online scheduler S, there are sequences of arrivals and workloads in the non-preemptive case, such that the competitive ratio c is greater than c0
No policy gives a finite competitive ratio
: set of all possible schedulers (including non-causal) : Total Flow time of scheduler (sum of FT of all jobs)Define
Competitive ratio: A given online policy S is said to have a competitive ratio of c if for any arrival pattern
10
Efficiency Ratio AnalysisEfficiency ratio: A given online policy S is said to have
an efficiency ratio of if (over some probability space of arrivals)
Theorem● If the workload of Reduce tasks of each job is
light-tailed distributed, then Any work-conserving scheduler has a constant
efficiency ratio, for both preemptive and non-preemptive scenarios.
11
So FarNo scheduler has a finite competitive ratio in non-
preemptive scenarioAn evaluation metric efficiency ratio is introduced
All work conserving schedulers have a finite efficiency ratio () for both preemptive & non-preemptive scenariosBut the performance of these schedulers could still be
quite poor ( could be very large)Key Question: Can we design a simple scheduler
with a provably small efficiency ratio in the MapReduce paradigm?
Preemptive scenario: Yes! (ASRPT)
12
ASRPT: Available Shortest Remaining Processing Time
SRPT: An infeasible schedulerPriority given to jobs with smallest remaining workload In SRPT, Map and Reduce for a given job can be
scheduled in the same time-slot (infeasible in reality)SRPT results in a lower flow time than any feasible
(including non-causal) schedulerASRPT
Map tasks are scheduled according to SRPT (or sooner) By running SRPT in a virtual manner
Reduce tasks greedily fill up the remaining machines In the order determined by the shortest remaining processing
time
13
ASRPT: Available Shortest Remaining Processing Time
R1
M2M1
R21
2
3
4
1 2 3 4
Num
ber o
f mac
hine
s, N
Time: Total Flow-time 4 units
R1R1
M2M1
R21
2
3
4
1 2 3 4
Num
ber o
f mac
hine
s, N
Time
R1
: Total Flow-time 5 unitsSRPT ASRPT
R1
Theorem:If the job arrivals and workload are i.i.d., then ASRPT has an efficiency ratio of 2 in the preemptive scenario.
14
Simulation• N = 100 machines, total time slots T = 800 • # Reduce tasks in each job is 10• Job arrival process: Poisson; λ = 2 jobs per time slot.• Preemptive Scenario• Schedulers
• ASRPT• Fair• FIFO
• ASRPT has smallerefficiency ratio thanFair and FIFO.
15
Conclusion• Investigated the flow time minimization problem in MapReduce
schedulers
• No online policy has a finite competitive ratio in non-preemptive scenarios
• All work-conserving schedulers have constant efficiency ratios • Preemptive and non-preemptive scenarios.
• A specific algorithm (ASRPT) can guarantee an efficiency ratio of 2 in the preemptive scenario. • Is efficiency ratio of ASRPT similarly small in non-pre-emptive case?
• Efficiency Ratio based analysis provides worse case (w.p.1) guarantees (compared to competitive ratio), but gives more design flexibility.• Has the potential to be applied to other problems/domains
16
Consider cost of data migration Combine preemptive and non-preemptive scenarios
Consider multiple phases With phase precedence (for complex processing)
Develop Green Energy solutions Combine with sleep-wake scheduling Consider renewable energy fluctuations
Future Work
Consider Shuffle Phase Introduce more information (e.g.,
dependency Graph) to the scheduler So that all reduce tasks don’t wait
until Map tasks are completed
17
Thank You!
18
Backup Slides
19
Main Results Introduced a metric Efficiency Ratio to evaluate the
performance of schedulers in MapReduce Proved that no online scheduler has a finite competitive ratio Guarantee with probability 1 May have greater modeling implications for analysis of algorithms
Proved that Efficiency Ratio is bounded by a constant for any work-conserving scheduler For both bounded and light-tailed workload for Reduce phase For both preemptive and non-preemptive scenarios
Developed a specific work-conserving algorithm (ASRPT) that has an efficiency ratio of 2 for preemptive scenario.
20
For any work-conserving scheduling policy: Preemptive:
Non-preemptive:
Similar result holds but with different constants ( )
Efficiency Ratio: Preemptive & Non-preemptive
where p0 = Prob (a slot has no job arrivals)
21
MapReduce: FIFO scheduler
R1M2
M1R2
1
2
3
4
1 2 3 4
Num
ber o
f mac
hine
s
Time
Flow-time of a job: time spent by job in the system
FT of Job 1: 3 time slots FT of Job 2: 3 time slots Total FT : 6 time slots
FCFS scheduler with 4 machines, 2 jobs
Map must finish before Reduce can start
R1
M1
R1
M2 R2
22
MapReduce: Smarter Scheduler
R1
M2M1
R2
1
2
3
4
1 2 3 4
Num
ber o
f mac
hine
s
Time
M1
R1
M2 R2
Flow-time: time in the system
FT of Job 1: 3 time slotsFT of Job 2: 2 time slotsTotal FT: 5 time slots
Goal: Minimize total flow-time
R1
“Smarter Scheduler”
23
Consider cost of data migrationCombine preemptive and non-preemptive scenarios
Consider multiple phases With phase precedence (for complex processing)
Develop Green Energy solutionsCombine with sleep-wake schedulingConsider renewable energy fluctuations
Future Work
Consider Shuffle Phase Introduce more information (e.g.,
dependency Graph) to the scheduler
So that all reduce tasks don’t wait until Map tasks are completed
24
Analysis Outline
Divide up time into repeating intervals Interval corresponds to the consecutive times slots until a
time-slot occurs with an idle machine(s)
Bound the moments of these intervals with constants
Bound the efficiency ratio using the moments
Analysis applies to: Preemptive and non-preemptive scenariosAllows jobs to have infinite but light-tailed support
25
Preemptive Scenario: An ExampleComplete time-slot (all machines used)Incomplete time-slot (all machines not used)Kj: Length of j-th interval (ends in incomplete slot)
Observations for a job arriving in interval j (work-conserving)
Map task finishes in interval j (otherwise last time-slot would be complete)
Reduce tasks finish in interval j or interval j+1
Flow time per job is bounded, wp1, if the interval sizes (moments) are bounded
g is bounded wp1.
26
Non-preemptive Scenario: An Example
Observation for reduce tasks: Unlike preemptive scenario, in the non-
preemptive scenario, each job arriving in interval j will start (not finish) all its reduce tasks in interval j or interval j+1
Reduce tasks must be finished by the end of interval j+1 plus time Ri-1.
Hence, if (all moments) of the interval are bounded then the flow times (per job) are bounded wp1 g is bounded wp1.
27
Lemma: If the scheduler is work-conserving, then for any given number , there exists a constant , such that , for all j.
where
Bounds on the Moments
Rough idea: Since the traffic intensity ρ<1, and the workload in each
slot is light-tailed, the number of consecutive complete time slots will decay exponentially (Chernoff’s Theorem)
The moments of interval K are bounded
28
Many unanswered questions
Competitive Ratio Efficiency Ratio
Preemptive ?Constant for any work-conserving scheduler
2 for ASRPT
Non-preemptive
No constant for any scheduler
Constant for any work-conserving scheduler
? for ASRPT
29
Main schedulers proposed for MapReduce framework
FIFO scheduler:Default in HadoopSuffers from the well known head-of-line blocking
problemFair scheduler [Zaharia’09]
Mitigate head-of-line blocking problem of FIFOJobs stick to the machines on which they are initially
scheduledCoupling scheduler [Tan’12]
Mitigate the problem that jobs stick to machines
30
Consider machines with speed-up [Moseley’11] O(1/ ε5) competitive algorithm with (1 + ε) speed for
the online case, where 0 < ε ≤ 1Speed-up of machines is necessary in the algorithm (if
ε decreases to 0, the competitive ratio will increase to infinity correspondingly).
Consider minimizing total completion time [Chang’11, Chen’12]Minimizing total completion time = minimizing total
flow time. Competitive ratio of “minimizing total completion time”
≠ competitive ratio of “minimizing total flow time”.
Main schedulers proposed for MapReduce framework
31
ASRPT runs SRPT in a virtual fashion and keeps track of how SRPT would have scheduled jobs in any given slot.
ASRPT assigns machines to the tasks in the following priority order (from high to low): (Only in the non-preemptive scenario) The previously
scheduled Reduce tasks which are not yet finished,The Map tasks which are scheduled in the SRPT, The available Reduce tasks, The available Map tasks.
For each priority group, the algorithm greedily assign machines to the corresponding tasks through the sorted available workload.
ASRPT
32
Light-tailed Distributed Workload
where
33
SimulationsPreemptive Non-preemptive
34
Main Results Competitive Ratio: Deterministic guarantee under adversarial arrivals
Unbounded for all non-preemptive scheduler
Efficiency Ratio Asymptotic Guarantee with probability 1 May have greater modeling implications for analysis of algorithms
Efficiency Ratio is bounded by a constant for any work-conserving scheduler For bounded workload for Reduce phase (Ri < Rmax) For light-tailed workload for Reduce phase
An Algorithm (ASRPT) with efficiency ratio of 2 for preemptive scenario.
35
Non-preemptive One task can only be allocated to one machine (preserves
data locality) Once started cannot be interrupted till the end of the task
Types of Schedulers: Treatment of Reduce Tasks
36
Preemptive Every task can be interrupted at the end of a slot Remaining workload can be executed on any machine
Types of Schedulers: Treatment of Reduce Tasks
37
More in Practice
Shuffle Phase
38
ASRPT: Available Shortest Remaining Processing Time [Example]
R1
M2M1
R21
2
3
4
1 2 3 4
Num
ber o
f mac
hine
s, N
Time: Total Flow-time 4 units
R1R1
M2M1
R2
1
2
3
4
1 2 3 4
Num
ber o
f mac
hine
s, N
Time
R1
: Total Flow-time 5 unitsSRPT ASRPT
39
ASRPT: Available Shortest Remaining Processing Time [Example]
R1M2
M1R2
1
2
3
4
1 2 3 4
Num
ber o
f mac
hine
s, N
Time: Total Flow-time 6 units
R1
R1
M2M1
R2
1
2
3
4
1 2 3 4
Num
ber o
f mac
hine
s, N
Time
R1
: Total Flow-time 5 unitsFIFO ASRPT
40
Lemma: If the scheduler is work-conserving, then for any given number , there exists a constant , such that , for all j.
where
Bounds on the Moments