«Почему Spark отнюдь не так хорош»

Alexey Diomin, diominay@gmail.com

Resilient Distributed Dataset

SchemaRDD

Mythology

Spark is not MapReduce

Mythology

Run programs up to 100x faster than

MapReduce in memory, or 10x faster on disk

Mythology

InMemory processing

Mythology

InMemory processing

Spark Streaming is real-time streaming

Mythology

InMemory processing

Spark Streaming is real-time streaming

Lightning-fast cluster computing

MapReduce

Not MapReduce

Run programs up to 100x faster than Hadoop

MapReduce* in memory, or 10x faster on disk

*Hadoop without Tez

http://spark.apache.org/

InMemory

The MapReduce and Spark shuffles use a “pull”

model. Every map task writes out data to local

disk, and then the reduce tasks make remote

requests to fetch that data

http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/

Spark Streaming

Receiver.store(...)

Spark Streaming

Google Cloud Dataflow

One of the most compelling aspects of Cloud

Dataflow is its approach to one of the most

difficult problems facing data engineers: how to

develop pipeline logic that can execute in both

batch and streaming contexts.

http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-

dataflow-on-apache-spark/

Lightning-fast cluster computing

Logging

Pipeline

Indexes

Job progress

Effective Memory

Network

Example

Staged (batch) execution

Pipelined execution

Indexes

Netflix

https://github.com/amplab/spark-indexedrdd

Job Progress

Accumulators

Broadcast

Memory

val value = task.run(taskId, attemptNumber)

Memory

val valueBytes = resultSer.serialize(value)

Memory

val directResult = new DirectTaskResult(valueBytes,

accumUpdates, task.metrics.orNull)

val serializedDirectResult = ser.serialize(directResult)

Memory

val directResult = new DirectTaskResult(valueBytes, accumUpdates, task.metrics.orNull)

val serializedDirectResult = ser.serialize(directResult)

Default JavaSerializer

public synchronized byte toByteArray()[] {

return Arrays.copyOf(buf, count);

Network

Problem with firewall/nat/multiple ip/etc.

Shark (dead)

Spark SQL

Spark on Hive

SparkR

Unstable API

Minimum docs

SparkR

Unstable API

Minimum docs

Rstudio Server

http://spark.apache.org/

http://flink.apache.org/

http://tez.apache.org/

«Почему Spark отнюдь не так хорош»

Technology

Почему ?

Почему полезна каша?

Почему выбирают QNX

Почему расширяется зрачок

Почему радио?

Почему армия родная

почему не покупают

Почему PR Partner?

Почему пси

Четыре почему (1964)

Почему выбирают конкурентов?

Почему 6σ?

Насколько хорош ваш интернет-маркетинг?

почему пузыри круглые

Почему TechExpert

Почему вода мокрая?

Почему мужчины женятся

почему Gmail

Почему Appium?

почему море соленое