21
資資資資資 資資資資資資資資 iLab 資資資資資資 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 資資 :2011 UKSim 5th European Symposium on Computer Modeling and Simulation 資資 :Rutvik Karve, Devendra Dahiphale, Amit Chhajer 資資資 : 資資資

資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

Embed Size (px)

Citation preview

Page 1: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

1資訊工程系 智慧型系統實驗室 iLab南台科技大學

Optimizing Cloud MapReduce for Processing Stream Data

using Pipelining

出處 :2011 UKSim 5th European Symposium on Computer Modeling and Simulation

作者 :Rutvik Karve, Devendra Dahiphale, Amit Chhajer

報告者 :邵建銘

Page 2: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 2

Outline

1.Introduction

2.Literature Survey

3.Our Proposed Architecture

4.Advantages,Features And Applications

5.Conclusions And Future Work

Page 3: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 3

Introduction(1/2)

• Cloud Map Reduce (CMR) is gaining popularity among small companies for processing large data sets in cloud environments.

• The current implementation of CMR is designed for batch processing of data.

• For processing streaming data in Cloud MapReduce (CMR), significant changes are required to be made to the existing CMR architecture.

Page 4: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 4

Introduction(2/2)

• We use pipelining between Map and Reduce phases as an approach to support stream data processing.

• In contrast to the current implementation where the Reducers do not start working unless all Mappers have finished.

• In our architecture, the Reduce phase too gets a continuous stream of data and can produce continuous output.

Page 5: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 5

Literature Survey(1/4)

• MapReduce– MapReduce is a programming model developed by

Google for processing large data sets in a distributed fashion.

– The model consists of two phases: Map phase and Reduce phase.

Page 6: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 6

Literature Survey(2/4)

• Hadoop– Hadoop is an implementation of the MapReduce

programming model developed by Apache.– It incorporates a distributed file system called HDFS.– Hadoop is popular for processing huge data sets

especially in social networking, targeted advertisements, internet log processing etc.

Page 7: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 7

Literature Survey(3/4)Cloud MapReduce

Cloud MapReduce is a light-weight implememtation of MapReduce programming model on top of the Amazon cloud OS.

The architecture of CMR consists of one input queue and multiple reduce queues

master reduce queue that holds the pointers to the reduce queues, and an output queue that holds the final results.

S3 file system is used to store the data to be processed, and SimpleDB is used to communicate the status of the worker nodes.

Page 8: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 8

Literature Survey(4/4)• Online MapReduce (Hadoop Online Prototype)

– Hadoop Online Prototype(HOP) is a modification to traditional Hadoop framework that incorporates pipelining between the Map and Reduce phases.

– The downstream dataflow element can begin processing before an upstream producer finishes.

– HOP providing support for processing streaming data.– It also supports snapshots of output data.

Page 9: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 9

Our Proposed Architecture(1/8)• The HOP implementations suffer from the

following drawbacks.1. HOP is unsuitable for cloud, it lacks the inherent scalability and flexibility of cloud.2. In HOP, code for handling of HDFS, reliability, scheduling etc. is a part of the Hadoop framework itself, and hence makes it large and heavy-weight.3. Cloud MapReduce does not support stream data processing.

Page 10: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 10

Our Proposed Architecture(2/8)• Our proposal aims at bridging this gap between

heavy-weight HOP and the light-weight, scalable Cloud MapReduce implementation, by providing support for processing stream data in CMR.

• The challenges involved in the implementation include:1. Providing support for streaming data at input.2. A novel design for output aggregation.3. Handling Reducer failures.4. Handling windows based on timestamps.

Page 11: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 11

Our Proposed Architecture(3/8)• Currently, no open-source implementation exists

for processing streaming data using MapReduce on top of Cloud.

• The best of our knowledge is to integrate stream data processing capability with MapReduce on Amazon Web Services using EC2.

• Describe the architecture of the Pipelined CMR approach.– Input, Mapper Operation, Reduce Phase

Page 12: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 12

Our Proposed Architecture(4/8)• Input

– A drop-box concept can be used, where a folder on S3 is used to hold the data that is to be processed by Cloud MapReduce.

– The user is responsible for providing data in the drop-box from which it will be sent to the input SQS queue

Page 13: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 13

Our Proposed Architecture(5/8)• Mapper Operation

– The Mapper, whenever it is free, pops one message from the input SQS queue thereby removing the message from the queue for a visibility timeout and processes it according to the user-defined Map function.

Page 14: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 14

Our Proposed Architecture(6/8)• Reduce Phase

– The Mapper writes the intermediate records produced to ReduceQueues.

– ReduceQueues are intermediate staging queues implemented using SQS for holding the mapper output.

Page 15: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 15

Our Proposed Architecture(7/8)

Page 16: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 16

Our Proposed Architecture(8/8)• Handling Reducer Failures

– The Status field is used for handling Reducer failures. – Status can be one of Live, Dead, Idle.– If reducer has not updated its status, its sets its Status

to Dead– If reducer’s status is Idle, assigns the new Reducer the

Reduce Queue Pointers and Output Pointers previously held by the old Reducer.

Page 17: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 17

Advantages,Features And Applications(1/4)

The design has the following advantages:

1. This allows parallelism between the Map and Reduce phases.

2. A downstream processing element can start processing as soon as some data is available from an upstream element.

3. The network is better utilized as data is continuously pushed from one phase to the next.

Page 18: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 18

Advantages,Features And Applications(2/4)

4. The final output is computed incrementally.5. Introduction of a pipeline between the Reduce

phase of one job and the Map phase of the next job will support Cascaded MapReduce jobs.

Page 19: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 19

Advantages,Features And Applications(3/4)

Other Features of the design include:

• Time windows• Snapshots• Cascaded MapReduce Jobs

Page 20: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 20

Advantages,Features And Applications(4/4)

Applications:• With these features, the design is particularly suited to stream

processing of data.

• Typically analysis and processing of web feeds click-streams, micro-blogging, and stock market quotes are some of the popular and typical stream processing applications.

• This design can also be used to process real-time data.

Page 21: 資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium

資訊工程系 智慧型系統實驗室 iLab南台科技大學 21

Conclusions And Future Work• The design fulfills a real need of processing

streaming data using MapReduce• Further work can be done in maintaining

intermediate output information for supporting rolling windows.

• Future scope also includes designing a generic system that is portable across several cloud operating systems.