63
RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING MATEI ZAHARIA, MOSHARAF CHOWDHURY, TATHAGATA DAS, ANKUR DAVE, JUSTIN MA, MURPHY MCCAULEY, MICHAEL J. FRANKLIN, SCOTT SHENKER, ION STOICA. NSDI'12 PROCEEDINGS OF THE 9TH USENIX CONFERENCE ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION PAPERS WE LOVE AMSTERDAM AUGUST 13, 2015 @gabriele_modena

Resilient Distributed Datasets

Embed Size (px)

Citation preview

Page 1: Resilient Distributed Datasets

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTINGMATEI ZAHARIA, MOSHARAF CHOWDHURY, TATHAGATA DAS, ANKUR DAVE, JUSTIN MA, MURPHY MCCAULEY, MICHAEL J. FRANKLIN, SCOTT SHENKER, ION STOICA.

NSDI'12 PROCEEDINGS OF THE 9TH USENIX CONFERENCE ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION

PAPERS WE LOVE AMSTERDAM AUGUST 13, 2015

@gabriele_modena

Page 2: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

About me• CS.ML

• Data science & predictive modelling

• with a sprinkle of systems work

• Hadoop & c. for data wrangling & crunching numbers

• … and Spark

Page 3: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Page 4: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-

tolerant manner. RDDs are motivated by two types of applications that current computing

frameworks handle inefficiently: iterative algorithms and interactive data mining tools.

Page 5: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

How• Review (concepts from) key related work

• RDD + Spark

• Some critiques

Page 6: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Related work• MapReduce

• Dryad

• Hadoop Distributed FileSystem (HDFS)

• Mesos

Page 7: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

What’s an iterative algorithm anyway?

data = input data w = <target vector> for i in num_iterations: for item in data:

update(w)

Multiple input scans

At each iteration, do something

Update a shared data structure

Page 8: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

HDFS

• GFS paper (2003)

• Distributed storage (with replication)

• Block ops

• NameNode hashes file locations (blocks)

Data Node

Data Node

Data Node

Name Node

Page 9: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

HDFS

• GFS paper (2003)

• Distributed storage (with replication)

• Block ops

• NameNode hashes file locations (blocks)

Data Node

Data Node

Data Node

Name Node

Page 10: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

HDFS

• GFS paper (2003)

• Distributed storage (with replication)

• Block ops

• NameNode hashes file locations (blocks)

Data Node

Data Node

Data Node

Name Node

Page 11: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

MapReduce• Google paper (2004)

• Apache Hadoop (~2007)

• Divide and conquer functional model

• Goes hand-in-hand with HDFS

• Structure data as (key, value)

1. Map(): filter and project

emit (k, v) pairs

2. Reduce(): aggregate and summarise

group by key and count

Map Map Map

Reduce Reduce

HDFS (blocks)

HDFS

Page 12: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

MapReduce• Google paper (2004)

• Apache Hadoop (~2007)

• Divide and conquer functional model

• Goes hand-in-hand with HDFS

• Structure data as (key, value)

1. Map(): filter and project

emit (k, v) pairs

2. Reduce(): aggregate and summarise

group by key and count

Map Map Map

Reduce Reduce

HDFS (blocks)

HDFS

This is a test Yes it is a test …

Page 13: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

MapReduce• Google paper (2004)

• Apache Hadoop (~2007)

• Divide and conquer functional model

• Goes hand-in-hand with HDFS

• Structure data as (key, value)

1. Map(): filter and project

emit (k, v) pairs

2. Reduce(): aggregate and summarise

group by key and count

Map Map Map

Reduce Reduce

HDFS (blocks)

HDFS

This is a test Yes it is a test …

(This,1), (is, 1), (a, 1), (test., 1), (Yes, 1), (it, 1), (is, 1)

Page 14: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

MapReduce• Google paper (2004)

• Apache Hadoop (~2007)

• Divide and conquer functional model

• Goes hand-in-hand with HDFS

• Structure data as (key, value)

1. Map(): filter and project

emit (k, v) pairs

2. Reduce(): aggregate and summarise

group by key and count

Map Map Map

Reduce Reduce

HDFS (blocks)

HDFS

This is a test Yes it is a test …

(This,1), (is, 1), (a, 1), (test., 1), (Yes, 1), (it, 1), (is, 1)

(This, 1), (is, 2), (a, 2), (test, 2), (Yes, 1), (it, 1)

Page 15: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

(c) Image from Apache Tez http://tez.apache.org

Page 16: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Critiques to MR and HDFS

• Great when records (and jobs) are independent

• In reality expect data to be shuffled across the network

• Latency measured in minutes

• Performance hit for iterative methods

• Composability monsters

• Meant for batch workflows

Page 17: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Dryad• Microsoft paper (2007)

• Inspired Apache Tez

• Generalisation of MapReduce via I/O pipelining

• Applications are (direct acyclic) graphs of tasks

Page 18: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

DryadDAG dag = new DAG("WordCount");

dag.addVertex(tokenizerVertex) .addVertex(summerVertex) .addEdge(new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty()) );

Page 19: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

MapReduce and DryadSELECT a.country, COUNT(b.place_id) FROM place a JOIN tweets b ON (a. place_id = b.place_id) GROUP BY a.country;

(c) Image from Apache Tez http://tez.apache.org. Modified.

Page 20: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Critiques to Dryad • No explicit abstraction for data sharing

• Must express data reps as DAG

• Partial solution: DryadLINQ

• No notion of a distributed filesystem

• How to handle large inputs?

• Local writes / remote reads?

Page 21: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Resilient Distributed Datasets Read-only, partitioned collection of records => a distributed immutable array

accessed via coarse-grained transformations

=> apply a function (scala closure) to all elements of the array

Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj

Page 22: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Resilient Distributed Datasets Read-only, partitioned collection of records => a distributed immutable array

accessed via coarse-grained transformations

=> apply a function (scala closure) to all elements of the array

Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj

Page 23: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Spark

• Transformations - lazily create RDDswc = dataset.flatMap(tokenize) .reduceByKey(add)

• Actions - execute computationwc.collect()

Runtime and API

Page 24: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

Page 25: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

Page 26: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions• Submit to long lived

workers, that store partitions in memory

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

Page 27: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions• Submit to long lived

workers, that store partitions in memory

• Scala closures are serialised as Java objects and passed across the network over HTTP Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

Page 28: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions• Submit to long lived

workers, that store partitions in memory

• Scala closures are serialised as Java objects and passed across the network over HTTP

• Variables bound to the closure are saved in the serialised object

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

Page 29: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions• Submit to long lived

workers, that store partitions in memory

• Scala closures are serialised as Java objects and passed across the network over HTTP

• Variables bound to the closure are saved in the serialised object

• Closures are deserialised on each worker and applied to the RDD (partition)

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

Page 30: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions• Submit to long lived

workers, that store partitions in memory

• Scala closures are serialised as Java objects and passed across the network over HTTP

• Variables bound to the closure are saved in the serialised object

• Closures are deserialised on each worker and applied to the RDD (partition)

• Mesos takes care of resource management

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

Page 31: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Data persistance1. in memory as deserialized java object

2. in memory as serialized data

3. on disk

RDD Checkpointing

Memory management via LRU eviction policy

.persist() RDD for future reuse

Page 32: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

Page 33: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

Page 34: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

Page 35: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

Page 36: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

Page 37: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

hdfs errors

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

Page 38: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

hdfs errors

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

map(_.split(’\t’)(3))

Page 39: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

hdfs errors

time fields

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

map(_.split(’\t’)(3))

Page 40: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

hdfs errors

time fields

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

map(_.split(’\t’)(3))

Page 41: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

hdfs errors

time fields

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

map(_.split(’\t’)(3))

Page 42: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage

Fault recovery

If a partition is lost, derived it back from the lineage

lines

errors

hdfs errors

time fields

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

map(_.split(’\t’)(3))

Page 43: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

RepresentationChallenge: track lineage across transformations

1. Partitions 2. Data locality for partition p 3. List dependencies 4. Iterator function to compute a dataset

based on its parents 5. Metadata for the partitioner scheme

Page 44: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Narrow dependenciespipelined execution on one cluster node

map, filterunion

Page 45: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Wide dependenciesrequire data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation

groupByKeyjoin with inputs

not co-partitioned

Page 46: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Scheduling

Task are allocated based on data locality (delayed scheduling)

1. Action is triggered => compute the RDD 2. Based on lineage, build a graph of stages to execute 3. Each stage contains as many pipelined

transformations with narrow dependencies as possible

4. Launch tasks to compute missing partitions from each stage until it has computed the target RDD

5. If a task fails => re-run it on another node as long as its stage’s parents are still available.

Page 47: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

union

map

groupBy

join

B

C D

E

F

G

Stage 3Stage 2

A

Stage 1

Page 48: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

union

map

groupBy

join

B

C D

E

F

G

Stage 3Stage 2

A

Stage 1

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 49: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job executionG

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 50: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

join

B

F

G

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 51: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 52: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

union

D

E

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 53: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 54: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 55: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 56: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 57: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

GgroupBy

A

Stage 1

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 58: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

G

Stage 2

groupByA

Stage 1

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 59: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

G

Stage 3Stage 2

groupByA

Stage 1

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

Page 60: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Evaluation

Page 61: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

Some critiques (to the paper) Some critiques (to the paper)• How general is this approach? • We are still doing MapReduce

• Concerns wrt iterative algorithms still stand • CPU bound workloads? • Linear Algebra?

• How much tuning is required? • How does the partitioner work? • What is the cost of reconstructing an RDD from

lineage? • Performance when data does not fit in memory

• Eg. a join between two very large non co-partitioned RDDs

Page 62: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

References (Theory)Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Zaharia et. al, Proceedings of NSDI’12. https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Spark: cluster computing with working sets. Zaharia et. al, Proceedings of HotCloud'10. http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf

The Google File System. Ghemawat, Gobioff, Leung, 19th ACM Symposium on Operating Systems Principles, 2003. http://research.google.com/archive/gfs.html

MapReduce: Simplified Data Processing on Large Clusters. Dean, Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation. http://research.google.com/archive/mapreduce.html

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007. http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf

Mesos: a platform for fine-grained resource sharing in the data center, Hindman et. al, Proceedings of NSDI’11. https://www.cs.berkeley.edu/~alig/papers/mesos.pdf

Page 63: Resilient Distributed Datasets

(C) PRESENTATION BY GABRIELE MODENA, 2015

References (Practice)• An overview of the pyspark API through pictures https://github.com/jkthompson/

pyspark-pictures • Barry Brumitt’s presentation on MapReduce design patterns (UW CSE490)

http://courses.cs.washington.edu/courses/cse490h/08au/lectures/MapReduceDesignPatterns-UW2.pdf

• The Dryad Project http://research.microsoft.com/en-us/projects/dryad/ • Apache Spark http://spark.apache.org • Apache Hadoop https://hadoop.apache.org • Apache Tez https://tez.apache.org • Apache Mesos http://mesos.apache.org