«Почему Spark отнюдь не так хорош»

Preview:

Citation preview

SPARK

Alexey Diomin, diominay@gmail.com

Intro

Basic

RDD

DAG

RDD

Resilient Distributed Dataset

RDD

Resilient Distributed Dataset

SchemaRDD

DAG

DAG

DAG

Mythology

Spark is not MapReduce

Mythology

Spark is not MapReduce

Run programs up to 100x faster than

MapReduce in memory, or 10x faster on disk

Mythology

Spark is not MapReduce

Run programs up to 100x faster than

MapReduce in memory, or 10x faster on disk

InMemory processing

Mythology

Spark is not MapReduce

Run programs up to 100x faster than

MapReduce in memory, or 10x faster on disk

InMemory processing

Spark Streaming is real-time streaming

Mythology

Spark is not MapReduce

Run programs up to 100x faster than

MapReduce in memory, or 10x faster on disk

InMemory processing

Spark Streaming is real-time streaming

Lightning-fast cluster computing

MapReduce

MapReduce

MapReduce

Not MapReduce

Spark

Run programs up to 100x faster than

MapReduce in memory, or 10x faster on disk

Spark

Run programs up to 100x faster than Hadoop

MapReduce* in memory, or 10x faster on disk

*Hadoop without Tez

http://spark.apache.org/

InMemory

InMemory

The MapReduce and Spark shuffles use a “pull”

model. Every map task writes out data to local

disk, and then the reduce tasks make remote

requests to fetch that data

http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/

Spark Streaming

RDD

DAG

Spark Streaming

Spark Streaming

Receiver.store(...)

Spark Streaming

Google Cloud Dataflow

One of the most compelling aspects of Cloud

Dataflow is its approach to one of the most

difficult problems facing data engineers: how to

develop pipeline logic that can execute in both

batch and streaming contexts.

http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-

dataflow-on-apache-spark/

Lightning-fast cluster computing

Lightning-fast cluster computing

Lightning-fast cluster computing

Lightning-fast cluster computing

Lightning-fast cluster computing

Spark

Logging

Pipeline

Indexes

Job progress

Effective Memory

Network

Example

Staged (batch) execution

Pipelined execution

Indexes

Netflix

https://github.com/amplab/spark-indexedrdd

Job Progress

Accumulators

Broadcast

Memory

val value = task.run(taskId, attemptNumber)

Memory

val value = task.run(taskId, attemptNumber)

val valueBytes = resultSer.serialize(value)

Memory

val value = task.run(taskId, attemptNumber)

val valueBytes = resultSer.serialize(value)

val directResult = new DirectTaskResult(valueBytes,

accumUpdates, task.metrics.orNull)

val serializedDirectResult = ser.serialize(directResult)

Memory

val value = task.run(taskId, attemptNumber)

val valueBytes = resultSer.serialize(value)

val directResult = new DirectTaskResult(valueBytes, accumUpdates, task.metrics.orNull)

val serializedDirectResult = ser.serialize(directResult)

Default JavaSerializer

public synchronized byte toByteArray()[] {

return Arrays.copyOf(buf, count);

}

Network

Network

Network

Problem with firewall/nat/multiple ip/etc.

SQL

Shark (dead)

Spark SQL

Spark on Hive

SparkR

SparkR

Unstable API

Minimum docs

SparkR

Unstable API

Minimum docs

Rstudio Server

Links

Spark

http://spark.apache.org/

Flink

http://flink.apache.org/

Tez

http://tez.apache.org/