Upload
benjamin-horn
View
215
Download
0
Embed Size (px)
Citation preview
Spark and Scala
Sheng QIAN
2015-06-17
The Berkeley Data Analytics Stack
The Goal of Spark
Compare between Spark and Hadoop
Compare between Spark and Hadoop
Spark supports …• Scala (Best)
• Python(2.7.*)
• Java (…)
All based on RDD (Resilient Distributed Dataset)
• A list of partitions
• A function for computing each split
• A list of dependencies on other RDDs
• Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned
• Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
All based on RDD (Resilient Distributed Dataset)
• A list of partitions
• A function for computing each split
• A list of dependencies on other RDDs
• Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned
• Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
The process
1. File System(HDFS/HBase)/CollectionsRDD
2. Transformation (Delay execution) * Faster than MR due to this
3. Action (execution)
Transformations and actions
Fault tolerance
Every RDD records RDDs it depends on
Cluster Overview
Cluster Overview
Task Schedule
DAG Scheduler• 基于 Stage 构建 DAG ,决定每个任务的最佳位置• 记录哪个 RDD 或者 Stage 输出被物化• 将 taskset 传给底层调度器 TaskScheduler
• 重新提交 shuffle 输出丢失的 stage
Task Scheduler• 提交 taskset( ⼀组 task) 到集群运⾏并汇报结果• 出现 shuffle 输出 lost 要报告 fetch failed 错误 • 碰到 straggle 任务需要放到别的节点上重试• 为每⼀个 TaskSet 维护⼀个 TaskSetManager( 追踪本
地性及错误信息 )
Job Schedule
Job Optimization
Why Scala• Base on JVM
• FP + OO
Scala - GrammarOn Evernote
Thank you