Upload
-
View
332
Download
0
Embed Size (px)
Citation preview
SPARK
Alexey Diomin, [email protected]
Intro
Basic
RDD
DAG
RDD
Resilient Distributed Dataset
RDD
Resilient Distributed Dataset
SchemaRDD
DAG
DAG
DAG
Mythology
Spark is not MapReduce
Mythology
Spark is not MapReduce
Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
Mythology
Spark is not MapReduce
Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
InMemory processing
Mythology
Spark is not MapReduce
Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
InMemory processing
Spark Streaming is real-time streaming
Mythology
Spark is not MapReduce
Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
InMemory processing
Spark Streaming is real-time streaming
Lightning-fast cluster computing
MapReduce
MapReduce
MapReduce
Not MapReduce
Spark
Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
Spark
Run programs up to 100x faster than Hadoop
MapReduce* in memory, or 10x faster on disk
*Hadoop without Tez
http://spark.apache.org/
InMemory
InMemory
The MapReduce and Spark shuffles use a “pull”
model. Every map task writes out data to local
disk, and then the reduce tasks make remote
requests to fetch that data
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
Spark Streaming
RDD
DAG
Spark Streaming
Spark Streaming
Receiver.store(...)
Spark Streaming
Google Cloud Dataflow
One of the most compelling aspects of Cloud
Dataflow is its approach to one of the most
difficult problems facing data engineers: how to
develop pipeline logic that can execute in both
batch and streaming contexts.
http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-
dataflow-on-apache-spark/
Lightning-fast cluster computing
Lightning-fast cluster computing
Lightning-fast cluster computing
Lightning-fast cluster computing
Lightning-fast cluster computing
Spark
Logging
Pipeline
Indexes
Job progress
Effective Memory
Network
Example
Staged (batch) execution
Pipelined execution
Indexes
Netflix
https://github.com/amplab/spark-indexedrdd
Job Progress
Accumulators
Broadcast
Memory
val value = task.run(taskId, attemptNumber)
Memory
val value = task.run(taskId, attemptNumber)
val valueBytes = resultSer.serialize(value)
Memory
val value = task.run(taskId, attemptNumber)
val valueBytes = resultSer.serialize(value)
val directResult = new DirectTaskResult(valueBytes,
accumUpdates, task.metrics.orNull)
val serializedDirectResult = ser.serialize(directResult)
Memory
val value = task.run(taskId, attemptNumber)
val valueBytes = resultSer.serialize(value)
val directResult = new DirectTaskResult(valueBytes, accumUpdates, task.metrics.orNull)
val serializedDirectResult = ser.serialize(directResult)
Default JavaSerializer
public synchronized byte toByteArray()[] {
return Arrays.copyOf(buf, count);
}
Network
Network
Network
Problem with firewall/nat/multiple ip/etc.
SQL
Shark (dead)
Spark SQL
Spark on Hive
SparkR
SparkR
Unstable API
Minimum docs
SparkR
Unstable API
Minimum docs
Rstudio Server
Links
Spark
http://spark.apache.org/
Flink
http://flink.apache.org/
Tez
http://tez.apache.org/