Getting More for Less in Optimized MapReduce Workflows

Preview:

Citation preview

Jaegwang LimDongguk University

Introduction

Introduction• MapReduce performance depends on some factors

User must specify the number of reduce tasks User must specify the input size

MapReduce Processing Pipeline

ReduceMap

Platform Performance Model

Read ShuffleCollect Spill Merge Write

• Record Dataset (Input size & Duration)

Platform Performance Model

No. DataMB

Readmsec

1 16 20102 16 20113 32 40564 18 2200… … …

No. DataMB

Col-lectmsec

1 8 12102 8 13503 16 24554 16 2411… … …

No. DataMB

Spillmsec

1 16 32132 16 32223 24 40024 16 3200… … …

……

..

………

Platform Performance ModelRead Collect Spill

Merge Shuffle Write

Platform Performance Model• Evaluation Error

Platform Performance Model• 2.5GB, Less 10%

Work-Flow Performance Model• Dataset• The overall input data size• The Map/Reduce selectivity• The Processing time per record of function

Map/Reduce Map/ReduceOutputSize

Map/Reduce

InputSize

InputSize

Selectivity = Input size / Output size

Work-Flow Performance Model• Record Dataset• Pig Program TPC-H Query

Reduce Selectivity = 0.9

Suggested Number of reduce tasks128*0.9 = 115

Conclusion• Automated Performance Management System• Help users to optimize their Map/Reduce application