Apache spark? if only it worked

@mszymani#Devoxx #TantusData #spark

Apache Spark - if only it worked

Marcin Szymaniuk TantusData


http://blog.explainmydata.com/2014/05/

http://blog.explainmydata.com/2014/05/


Agenda

• Execution model

• Sizing executors

• Skewed data

• Locality

• Caching

• Debugging tools

• Local run/tests

• Challenge and a draw!


What is Spark?

• Engine for distributed data processing

• Java, Scala, R, Python API

• SQL, Streaming, Machine Learning


…

…

…

… … … …Stage 1

Stage N

…

…

…

… … … …Stage 2

…

Shuffle

Shuffle

RDD


…

…

…

… … … …Stage 1

Stage N

…

…

…

… … … …Stage 2

…

Shuffle

Shuffle


…

…

…

… … … …Stage 1

Stage N

…

…

…

… … … …Stage 2

…

Shuffle

Shuffle


…

…

…

… … … …Stage 1

Stage N

…

…

…

… … … …Stage 2

…

Shuffle

Shuffle


.reduceByKey{case (x, y) => x + y} .saveAsTextFile(output)

Block 1 Block 2…

Task 1

sc.textFile(“/input/text/“) .flatMap(line=>line.split(" ")) .map(word => (word, 1))

Stage 1

Stage 2

… … … …

…

…

…

… … … …

Stage 2…

…

Stage 1

………


.reduceByKey{case (x, y) => x + y} .saveAsTextFile(output)

Block 1 Block 2…

Task 1

sc.textFile(“/input/text/“) .flatMap(line=>line.split(" ")) .map(word => (word, 1))

Stage 1

Stage 2

… … … …

…

…

…

… … … …

Stage 2…

…

Stage 1

………


Sizing executorsExecutor - JVM process able to run one or more tasks

…

Executor

Example config: --executor-cores 3 --executor-memory 10g

…

Executor

…

…

Executor

…

Pending Tasks

Complete Tasks

…

…

ExecutorJob resources


Executor - JVM process able to run one or more tasks

…

Executor


…

Executor

…

…

Executor

…

Pending Tasks

Complete Tasks

…

…

ExecutorJob resources

Driver

Sizing executors


Sizing executorsExecutor - JVM process able to run one or more tasks

…

Executor


Executor

Executor

Executor

Executor

Node

…

Executor

Executor

Node

…

Node

…

Executor

Executor

Executor

Executor

Node

Executor

…

vs vs vs vs…


Sizing executors

• Spark can benefit from running multiple tasks in the same JVM

• Many cores leads to problems: HDFS I/O, GC

• 1-4 CPUs should be good for start


Sizing executors

Executor heap

Container

Memory overhead

spark.executor.memory

spark.yarn.executor.memoryOverhead

spark.shuffle.memoryFraction

spark.storage.memoryFraction

Executor heap

Cache

Shuffle

User program


Sizing executors

• Keep in mind memory overhead

• Keep some resources for OS

• Determining memory consumption - cache an RDD

• Consider dynamic resource allocation


Shuffle zoom-inTask 1 Task 2 Task 3

Stage 1

Stage 2


Task 1 Task 2 Task 3

Stage 1

Stage 2

Shuffle zoom-in


Task 1 Task 2 Task 3

Stage 1

Stage 2

BOOM!BOOM!

Shuffle zoom-in


We are up & running

• 2G block limit

• Timeouts

• GC overhead limit exceeded

• OOM

• ExecutorLostFailure


We are up & runningWhat to watch out for:

• Failing tasks

• GC heavy tasks

• Shuffle read/write sizes

• Long running tasks


We are up & running

• Too much data per task?

• Is parallelism level large enough?

• More memory for executor?

def groupByKey(numPartitions: Int)def repartition(numPartitions: Int)


Skewed data

… … … …

…

…

…

… … … … … … … …

…

…

…

… … … …

You want: NOT:

FOO V1FOO V2FOO V3

FOO_1 V1FOO_1 OUT1SALT EXEC FOO OUTMERGE

V4BAR

FOO_2 V2FOO_1 V3BAR_1 V4

FOO_2 OUT2BAR_1 OUT3

BAR OUT


Locality

ExecutorNode 1

HDFS Block 1

ExecutorNode 2

HDFS Block 2

ExecutorNode 3

HDFS Block 3

Driver

sparkContext.textFile(“hdfs://…”)


Locality

• Increase number of executors

• For small jobs better to leave as is

• spark.locality.wait parameter


Caching

val rdd1=calculate1()

rdd1.…saveAsTextFile(…)


Executed twice!


val rdd1=calculate1()



Executed twice!

Caching


Caching

• Branch in execution plan is a candidate for caching

• Spark UI shows to see how much memory an RDD is taking

• You cannot control priority - it's LRU

• Don't cache to disk if computation is cheap

• Caching with RF - only when recreation is extremely costly

• Checkpointing vs caching

• Shuffle data is automatically persisted


Join zoom-inTask 1 Task 2

Stage 1

Task 1 Task 2

Stage 2

Stage 3


Optimize shuffleUse the same number of partition

Change map to mapValues


Broadcast Variable

Executor

RDD1_0

RDD2

RDD2Join

Executor

RDD1_1

RDD2Join

Executor

RDD1_2

RDD2Join

Executor

RDD1_3

RDD2Join

Broadcast


Optimize shuffle - recap

• Control number of partitions

• Use mapValues instead of map if you can

• Broadcast variables

• Filter before shuffle

• Avoid groupByKey, use reduceByKey


Debugging tools

• Spark UI

• HDFS monitoring

• Aggregate your logs

• Extra Java options - observe your GC

• Run and test locally


Challenge time!

goo.gl/7eTtvH


Use case

UUID

UUID

UUID

UUID

UUID

UUID

goo.gl/7eTtvH


General notes

• Tests vs no tests? - Test your code!

• You are probably not the only user of the cluster

• What are you optimizing for?

• Share the knowledge

• Spark actually works :)

goo.gl/7eTtvH


Q&A


Thank you!

@mszymani

[email protected]

mailto:[email protected]

Data & Analytics

Apache spark? if only it worked