«Почему Spark отнюдь не так хорош»

  • View
    319

  • Download
    0

Embed Size (px)

Text of «Почему Spark отнюдь не так хорош»

PowerPoint Presentation

SparkAlexey Diomin, diominay@gmail.com1Intro

2BasicRDDDAGRDDResilient Distributed Dataset

RDDResilient Distributed Dataset

SchemaRDD

DAG

DAG

DAG

mapValues8MythologySpark is not MapReduceMythologySpark is not MapReduceRun programs up to 100x faster than MapReduce in memory, or 10x faster on diskMythologySpark is not MapReduceRun programs up to 100x faster than MapReduce in memory, or 10x faster on diskInMemory processingMythologySpark is not MapReduceRun programs up to 100x faster than MapReduce in memory, or 10x faster on diskInMemory processingSpark Streaming is real-time streamingMythologySpark is not MapReduceRun programs up to 100x faster than MapReduce in memory, or 10x faster on diskInMemory processingSpark Streaming is real-time streamingLightning-fast cluster computingMapReduce

MapReduce

MapReduce

Not MapReduce

SparkRun programs up to 100x faster than MapReduce in memory, or 10x faster on disk

SparkRun programs up to 100x faster than Hadoop MapReduce* in memory, or 10x faster on disk

*Hadoop without Tezhttp://spark.apache.org/

InMemory

InMemoryThe MapReduce and Spark shuffles use a pull model. Every map task writes out data to local disk, and then the reduce tasks make remote requests to fetch that data

http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/Spark StreamingRDDDAGSpark Streaming

Spark Streaming

Receiver.store(...)Spark Streaming

Google Cloud DataflowOne of the most compelling aspects of Cloud Dataflow is its approach to one of the most difficult problems facing data engineers: how to develop pipeline logic that can execute in both batch and streaming contexts.

http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-dataflow-on-apache-spark/Lightning-fast cluster computingLightning-fast cluster computing

Lightning-fast cluster computing

Lightning-fast cluster computing

Lightning-fast cluster computing

SparkLoggingPipelineIndexesJob progressEffective MemoryNetwork

Example

Staged (batch) execution

Pipelined execution

IndexesNetflixhttps://github.com/amplab/spark-indexedrddJob ProgressAccumulatorsBroadcastMemoryval value = task.run(taskId, attemptNumber)

Memoryval value = task.run(taskId, attemptNumber)val valueBytes = resultSer.serialize(value)

Memoryval value = task.run(taskId, attemptNumber)val valueBytes = resultSer.serialize(value)val directResult = new DirectTaskResult(valueBytes, accumUpdates, task.metrics.orNull)val serializedDirectResult = ser.serialize(directResult)

Memoryval value = task.run(taskId, attemptNumber)val valueBytes = resultSer.serialize(value)val directResult = new DirectTaskResult(valueBytes, accumUpdates, task.metrics.orNull)val serializedDirectResult = ser.serialize(directResult)

Default JavaSerializer public synchronized byte toByteArray()[] { return Arrays.copyOf(buf, count); }

42Network

Network

NetworkProblem with firewall/nat/multiple ip/etc.SQLShark (dead)Spark SQLSpark on HiveSparkR

SparkRUnstable APIMinimum docs

SparkRUnstable APIMinimum docs

Rstudio Server

LinksSparkhttp://spark.apache.org/

Flinkhttp://flink.apache.org/

Tezhttp://tez.apache.org/