View
85
Download
3
Category
Preview:
DESCRIPTION
Introduction to Spark. Tangzk 2014/4/8. Outline. BDAS(the Berkeley Data Analytics Stack) Spark Other Components based on Spark Introduction to Scala Spark Programming Practices Debugging and Testing Spark Programs Learning Spark Why are Previous MapReduce -Based Systems Slow ? - PowerPoint PPT Presentation
Citation preview
INTRODUCTION TO SPARK
Tangzk
04/19/2023
‹#›𝟐𝟑
OUTLINE
BDAS(the Berkeley Data Analytics Stack) Spark Other Components based on Spark Introduction to Scala Spark Programming Practices Debugging and Testing Spark Programs Learning Spark Why are Previous MapReduce-Based Systems
Slow? Conclusion 1
𝟐𝟑
‹#›𝟐𝟑
BDAS(THE BERKELEY DATA ANALYTICS STACK)
2𝟐𝟑
‹#›𝟐𝟑
SPARK
Master/Slave architecture In-Memory Computing Platform Resilient Distributed Datasets(RDDs)
abstraction DAG Engine Execution Fault Recovery using Lineage Supporting Interactive data mining Survived in Hadoop Ecosystem
3𝟐𝟑
‹#›𝟐𝟑
SPARK – IN-MEMORY COMPUTING
Hadoop: two-stages MapReduce topology Spark: DAG executing topology
4𝟐𝟑
‹#›𝟐𝟑
SPARK – RDD(RESILIENT DISTRIBUTED DATASETS)
Data Computing and Storage Abstraction Records organized by Partition Immutable(Read Only) only be created through transformations Move computing instead of data Coarse-grained Programming Interfaces Partition Reconstruction using lineage
5𝟐𝟑
‹#›𝟐𝟑
SPARK-RDD: COARSE-GRAINED PROGRAMMING INTERFACES
Transformation: defining one or more RDDs. Action: return a value. Lazy computation
6𝟐𝟑
‹#›𝟐𝟑
SPARK-RDD: LINEAGE GRAPH
Spark Example: Return the time fields of web GET access by
“66.220.128.123”(time field is number 3 in a tab-separated format)
logs = spark.textFile("hdfs://...")accessesByIp = logs.filter(_.startWith("66.220.128.123"))accessesByIp.persist()accessesByIp.filter(_.contains("GET"))
.map(_.split('\t')(3))
.collect()7𝟐𝟑
‹#›𝟐𝟑
SPARK - INSTALLATION
Running Spark Local Mode Standalone Mode Cluster Mode: on YARN/Mesos
Support Programming Language Scala Java Python
Spark Interactive Shell
8𝟐𝟑
‹#›𝟐𝟑
OTHER COMPONENTS(1) - SHARK
SQL and Rich Analytics at Scale. Partial DAG Execution optimize query
planning at runtime. run SQL queries up to 100× faster than
Apache Hive, and machine learning programs up to 100× faster than Hadoop.
9𝟐𝟑
‹#›𝟐𝟑
OTHER COMPONENTS(1) – SPARK STREAMING
Scalable fault-tolerant streaming computing framework
Discretized streams abstraction: separate continuous dataflow into batches of input data
10𝟐𝟑
‹#›𝟐𝟑
OTHER COMPONENTS(2) - MLBASE
Distributed Machine Learning Library
11𝟐𝟑
‹#›𝟐𝟑
OTHER COMPONENTS(4) – GRAPHX
Distributed Graph Computing framework RDG(Resilient Distributed Graph) abstraction Supporting Gather-Apply-Scatter model in
GraphLab
12𝟐𝟑
‹#›𝟐𝟑
GAS MODEL - GRAPHLAB
Machine 2Machine 1
Machine 4Machine 3
Σ1 Σ2
Σ3 Σ4
+ + +
YYYY
Y’
ΣY’Y’Y’
Gather InfosFrom Nbs
Apply Vertexupdate
Master
Mirror
MirrorMirror
From Jiang Wenrui’s thesis defense
13𝟐𝟑
‹#›𝟐𝟑
INTRODUCTION TO SCALA(1)
Runs on JVM(and .Net) Full interoperability with Java Statically typed Object Oriented Functional Programming
14𝟐𝟑
‹#›𝟐𝟑
INTRODUCTION TO SCALA(2)
Declare a list of integers val ints = List(1,2,4,5,7,3)
Declare a function, cube, compute the cube of an Int def cube(a: Int): Int = a * a * a
Apply cube function to list. val cubes = ints.map(x => cube(x))
Sum the cube of integers. cubes.reduce((x1,x2) => x1+x2)
Define a factorial function that comput n! def fact(n: Int): Int = {
if(n == 0) 1else n * fact(n-1)
}
15𝟐𝟑
‹#›𝟐𝟑
SPARK IN PRACTICES(1)
“Hello Word”(interactive shell) val textFile = sc.textFile(“hdfs://…”)textFile.flatMap(line => line.split(“ “))
.map(word => (word, 1))
.reduceByKey((a, b) => a+b) “Hello Word”(Standalone App)
16𝟐𝟑
‹#›𝟐𝟑
SPARK IN PRACTICES(2)
PageRank in Spark
17𝟐𝟑
‹#›𝟐𝟑
DEBUGGING AND TESTING SPARK PROGRAMS(1)
Running in Local Mode sc = new SparkContext("local", name) Debug in IDE
Running in Standalone/Cluster Mode Job Web GUI on 8080/4040 Log4j jstack/jmap dstat/iostat/lsof -p
Unit test Test in local mode
18𝟐𝟑
‹#›𝟐𝟑
DEBUGGING AND TESTING SPARK PROGRAMS(2)
RDDJoin2:(2,4)(1,2)RDDJoin3:(1,(1,3))(1,(2,3))
19𝟐𝟑
‹#›𝟐𝟑
DEBUGGING AND TESTING SPARK PROGRAMS(3)
Tunning Spark Large object in lambda operator should be
replaced by broadcast variables instead. Coalescing partitions avoid large number of
empty tasks after filtering operations. Make good use of partition for data locality.
(mapPartitions) Good choice of partitioning key to balance data Set spark.local.dir to set of disks Take care of the number of reduce tasks Don’t collect data but write to HDFS directly.
20𝟐𝟑
‹#›𝟐𝟑
LEARNING SPARK
Spark Quick Start, http://spark.apache.org/docs/latest/quick-start.html
Holden Karau, Fast Data Processing with Spark
Spark Docs, http://spark.apache.org/docs/latest/
Spark Source code, https://github.com/apache/spark
Spark User Mailing list, http://spark.apache.org/mailing-lists.html
21𝟐𝟑
‹#›𝟐𝟑
WHY ARE PREVIOUS MAPREDUCE-BASED SYSTEMS SLOW?
Conventional thoughts: expensive data materialization for fault
tolerance, inferior data layout (e.g., lack of indices), costlier execution strategies.
But Hive/Shark alleviate these by: In-memory computing and storage Partial DAG execution
Experiment results in Shark: Intermediate Outputs Data Format and Layout by co-partitioning Execution Strategies optimizing using PDE Task Scheduling Cost
22𝟐𝟑
‹#›𝟐𝟑
CONCLUSION
Spark In-Memory computing platform for iterative and
interactive tasks RDD abstraction Lineage reconstruction for fault recovery Large number of components based on
Spark Programming Just think RDD like vector Function programming Scala IDE is not strong enough. Lack of good tools to debug and test.
23𝟐𝟑
Recommended