Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Spark and Scala

Sheng QIAN

2015-06-17

The Berkeley Data Analytics Stack

The Goal of Spark

Compare between Spark and Hadoop

Compare between Spark and Hadoop

Spark supports …• Scala (Best)

• Python(2.7.*)

• Java (…)

All based on RDD (Resilient Distributed Dataset)

• A list of partitions

• A function for computing each split

• A list of dependencies on other RDDs

• Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned

• Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

All based on RDD (Resilient Distributed Dataset)

• A list of partitions

• A function for computing each split

• A list of dependencies on other RDDs

• Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned

• Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

The process

1. File System(HDFS/HBase)/CollectionsRDD

2. Transformation (Delay execution) * Faster than MR due to this

3. Action (execution)

Transformations and actions

Fault tolerance

Every RDD records RDDs it depends on

Cluster Overview

Cluster Overview

Task Schedule

DAG Scheduler• 基于 Stage 构建 DAG ，决定每个任务的最佳位置• 记录哪个 RDD 或者 Stage 输出被物化• 将 taskset 传给底层调度器 TaskScheduler

• 重新提交 shuffle 输出丢失的 stage

Task Scheduler• 提交 taskset( ⼀组 task) 到集群运⾏并汇报结果• 出现 shuffle 输出 lost 要报告 fetch failed 错误 • 碰到 straggle 任务需要放到别的节点上重试• 为每⼀个 TaskSet 维护⼀个 TaskSetManager( 追踪本

地性及错误信息 )

Job Schedule

Job Optimization

Why Scala• Base on JVM

• FP + OO

Scala - GrammarOn Evernote

Thank you

Documents

Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack