View
109
Download
0
Category
Preview:
Citation preview
Jaegwang LimDongguk University
Introduction
Introduction• MapReduce performance depends on some factors
User must specify the number of reduce tasks User must specify the input size
MapReduce Processing Pipeline
ReduceMap
Platform Performance Model
Read ShuffleCollect Spill Merge Write
• Record Dataset (Input size & Duration)
Platform Performance Model
No. DataMB
Readmsec
1 16 20102 16 20113 32 40564 18 2200… … …
No. DataMB
Col-lectmsec
1 8 12102 8 13503 16 24554 16 2411… … …
No. DataMB
Spillmsec
1 16 32132 16 32223 24 40024 16 3200… … …
……
..
………
Platform Performance ModelRead Collect Spill
Merge Shuffle Write
Platform Performance Model• Evaluation Error
Platform Performance Model• 2.5GB, Less 10%
Work-Flow Performance Model• Dataset• The overall input data size• The Map/Reduce selectivity• The Processing time per record of function
Map/Reduce Map/ReduceOutputSize
Map/Reduce
InputSize
InputSize
Selectivity = Input size / Output size
Work-Flow Performance Model• Record Dataset• Pig Program TPC-H Query
Reduce Selectivity = 0.9
Suggested Number of reduce tasks128*0.9 = 115
Conclusion• Automated Performance Management System• Help users to optimize their Map/Reduce application
Recommended