Getting More for Less in Optimized MapReduce Workflows

Jaegwang LimDongguk University

Introduction

Introduction• MapReduce performance depends on some factors

User must specify the number of reduce tasks User must specify the input size

MapReduce Processing Pipeline

ReduceMap

Platform Performance Model

Read ShuffleCollect Spill Merge Write

• Record Dataset (Input size & Duration)

Platform Performance Model

No. DataMB

Readmsec

1 16 20102 16 20113 32 40564 18 2200… … …

No. DataMB

Col-lectmsec

1 8 12102 8 13503 16 24554 16 2411… … …

No. DataMB

Spillmsec

1 16 32132 16 32223 24 40024 16 3200… … …

……

………

Platform Performance ModelRead Collect Spill

Merge Shuffle Write

Platform Performance Model• Evaluation Error

Platform Performance Model• 2.5GB, Less 10%

Work-Flow Performance Model• Dataset• The overall input data size• The Map/Reduce selectivity• The Processing time per record of function

Map/Reduce Map/ReduceOutputSize

Map/Reduce

InputSize

Selectivity = Input size / Output size

Work-Flow Performance Model• Record Dataset• Pig Program TPC-H Query

Reduce Selectivity = 0.9

Suggested Number of reduce tasks128*0.9 = 115

Conclusion• Automated Performance Management System• Help users to optimize their Map/Reduce application

Getting More for Less in Optimized MapReduce Workflows

Business

MapReduce w MongoDB

Evar web optimized

MapReduce in Action

Hadoop ja MapReduce

bts optimized

Fluig - Workflows - RH

JTL-Wawi | Workflows

Módulo 08 Workflows

MapReduce 进阶

Mapreduce tuning

R és MapReduce

MapReduceアルゴリズムデザイン #hadoopreading

MapReduce for Bioinformatics

MapReduce & BigTable

Procesy pracy ( workflows )

Workflows On Rails

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview

1 mapreduce-fest

Tevar web optimized

Workflows científicos