21
Spark and Scala Sheng QIAN 2015-06-17

Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Embed Size (px)

Citation preview

Page 1: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Spark and Scala

Sheng QIAN

2015-06-17

Page 2: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

The Berkeley Data Analytics Stack

Page 3: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

The Goal of Spark

Page 4: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Compare between Spark and Hadoop

Page 5: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Compare between Spark and Hadoop

Page 6: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Spark supports …• Scala (Best)

• Python(2.7.*)

• Java (…)

Page 7: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

All based on RDD (Resilient Distributed Dataset)

• A list of partitions

• A function for computing each split

• A list of dependencies on other RDDs

• Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned

• Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Page 8: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

All based on RDD (Resilient Distributed Dataset)

• A list of partitions

• A function for computing each split

• A list of dependencies on other RDDs

• Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned

• Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Page 9: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

The process

1. File System(HDFS/HBase)/CollectionsRDD

2. Transformation (Delay execution) * Faster than MR due to this

3. Action (execution)

Page 10: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Transformations and actions

Page 11: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Fault tolerance

Every RDD records RDDs it depends on

Page 12: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Cluster Overview

Page 13: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Cluster Overview

Page 14: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Task Schedule

Page 15: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

DAG Scheduler• 基于 Stage 构建 DAG ,决定每个任务的最佳位置• 记录哪个 RDD 或者 Stage 输出被物化• 将 taskset 传给底层调度器 TaskScheduler

• 重新提交 shuffle 输出丢失的 stage

Page 16: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Task Scheduler• 提交 taskset( ⼀组 task) 到集群运⾏并汇报结果• 出现 shuffle 输出 lost 要报告 fetch failed 错误 • 碰到 straggle 任务需要放到别的节点上重试• 为每⼀个 TaskSet 维护⼀个 TaskSetManager( 追踪本

地性及错误信息 )

Page 17: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Job Schedule

Page 18: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Job Optimization

Page 19: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Why Scala• Base on JVM

• FP + OO

Page 20: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Scala - GrammarOn Evernote

Page 21: Spark and Scala Sheng QIAN 2015-06-17. The Berkeley Data Analytics Stack

Thank you