Introduction to Spark

INTRODUCTION TO SPARK

Tangzk

04/19/2023

‹#›𝟐𝟑

OUTLINE

BDAS(the Berkeley Data Analytics Stack) Spark Other Components based on Spark Introduction to Scala Spark Programming Practices Debugging and Testing Spark Programs Learning Spark Why are Previous MapReduce-Based Systems

Slow? Conclusion 1

𝟐𝟑

‹#›𝟐𝟑

BDAS(THE BERKELEY DATA ANALYTICS STACK)

2𝟐𝟑

‹#›𝟐𝟑

SPARK

Master/Slave architecture In-Memory Computing Platform Resilient Distributed Datasets(RDDs)

abstraction DAG Engine Execution Fault Recovery using Lineage Supporting Interactive data mining Survived in Hadoop Ecosystem

3𝟐𝟑

‹#›𝟐𝟑

SPARK – IN-MEMORY COMPUTING

Hadoop: two-stages MapReduce topology Spark: DAG executing topology

4𝟐𝟑

‹#›𝟐𝟑

SPARK – RDD(RESILIENT DISTRIBUTED DATASETS)

Data Computing and Storage Abstraction Records organized by Partition Immutable(Read Only) only be created through transformations Move computing instead of data Coarse-grained Programming Interfaces Partition Reconstruction using lineage

5𝟐𝟑

‹#›𝟐𝟑

SPARK-RDD: COARSE-GRAINED PROGRAMMING INTERFACES

Transformation: defining one or more RDDs. Action: return a value. Lazy computation

6𝟐𝟑

‹#›𝟐𝟑

SPARK-RDD: LINEAGE GRAPH

Spark Example: Return the time fields of web GET access by

“66.220.128.123”(time field is number 3 in a tab-separated format)

logs = spark.textFile("hdfs://...")accessesByIp = logs.filter(_.startWith("66.220.128.123"))accessesByIp.persist()accessesByIp.filter(_.contains("GET"))

.map(_.split('\t')(3))

.collect()7𝟐𝟑

‹#›𝟐𝟑

SPARK - INSTALLATION

Running Spark Local Mode Standalone Mode Cluster Mode: on YARN/Mesos

Support Programming Language Scala Java Python

Spark Interactive Shell

8𝟐𝟑

‹#›𝟐𝟑

OTHER COMPONENTS(1) - SHARK

SQL and Rich Analytics at Scale. Partial DAG Execution optimize query

planning at runtime. run SQL queries up to 100× faster than

Apache Hive, and machine learning programs up to 100× faster than Hadoop.

9𝟐𝟑

‹#›𝟐𝟑

OTHER COMPONENTS(1) – SPARK STREAMING

Scalable fault-tolerant streaming computing framework

Discretized streams abstraction: separate continuous dataflow into batches of input data

10𝟐𝟑

‹#›𝟐𝟑

OTHER COMPONENTS(2) - MLBASE

Distributed Machine Learning Library

11𝟐𝟑

‹#›𝟐𝟑

OTHER COMPONENTS(4) – GRAPHX

Distributed Graph Computing framework RDG(Resilient Distributed Graph) abstraction Supporting Gather-Apply-Scatter model in

GraphLab

12𝟐𝟑

‹#›𝟐𝟑

GAS MODEL - GRAPHLAB

Machine 2Machine 1

Machine 4Machine 3

Σ1 Σ2

Σ3 Σ4

+ + +

YYYY

Y’

ΣY’Y’Y’

Gather InfosFrom Nbs

Apply Vertexupdate

Master

Mirror

MirrorMirror

From Jiang Wenrui’s thesis defense

13𝟐𝟑

‹#›𝟐𝟑

INTRODUCTION TO SCALA(1)

Runs on JVM(and .Net) Full interoperability with Java Statically typed Object Oriented Functional Programming

14𝟐𝟑

‹#›𝟐𝟑

INTRODUCTION TO SCALA(2)

Declare a list of integers val ints = List(1,2,4,5,7,3)

Declare a function, cube, compute the cube of an Int def cube(a: Int): Int = a * a * a

Apply cube function to list. val cubes = ints.map(x => cube(x))

Sum the cube of integers. cubes.reduce((x1,x2) => x1+x2)

Define a factorial function that comput n! def fact(n: Int): Int = {

if(n == 0) 1else n * fact(n-1)

}

15𝟐𝟑

‹#›𝟐𝟑

SPARK IN PRACTICES(1)

“Hello Word”(interactive shell) val textFile = sc.textFile(“hdfs://…”)textFile.flatMap(line => line.split(“ “))

.map(word => (word, 1))

.reduceByKey((a, b) => a+b) “Hello Word”(Standalone App)

16𝟐𝟑

‹#›𝟐𝟑

SPARK IN PRACTICES(2)

PageRank in Spark

17𝟐𝟑

‹#›𝟐𝟑

DEBUGGING AND TESTING SPARK PROGRAMS(1)

Running in Local Mode sc = new SparkContext("local", name) Debug in IDE

Running in Standalone/Cluster Mode Job Web GUI on 8080/4040 Log4j jstack/jmap dstat/iostat/lsof -p

Unit test Test in local mode

18𝟐𝟑

‹#›𝟐𝟑


RDDJoin2:(2,4)(1,2)RDDJoin3:(1,(1,3))(1,(2,3))

19𝟐𝟑

‹#›𝟐𝟑


Tunning Spark Large object in lambda operator should be

replaced by broadcast variables instead. Coalescing partitions avoid large number of

empty tasks after filtering operations. Make good use of partition for data locality.

(mapPartitions) Good choice of partitioning key to balance data Set spark.local.dir to set of disks Take care of the number of reduce tasks Don’t collect data but write to HDFS directly.

20𝟐𝟑

‹#›𝟐𝟑

LEARNING SPARK

Spark Quick Start, http://spark.apache.org/docs/latest/quick-start.html

Holden Karau, Fast Data Processing with Spark

Spark Docs, http://spark.apache.org/docs/latest/

Spark Source code, https://github.com/apache/spark

Spark User Mailing list, http://spark.apache.org/mailing-lists.html

21𝟐𝟑

http://spark.apache.org/docs/latest/quick-start.html



http://spark.apache.org/docs/latest/

http://spark.apache.org/docs/latest/

https://github.com/apache/spark

http://spark.apache.org/mailing-lists.html

http://spark.apache.org/mailing-lists.html

‹#›𝟐𝟑

WHY ARE PREVIOUS MAPREDUCE-BASED SYSTEMS SLOW?

Conventional thoughts: expensive data materialization for fault

tolerance, inferior data layout (e.g., lack of indices), costlier execution strategies.

But Hive/Shark alleviate these by: In-memory computing and storage Partial DAG execution

Experiment results in Shark: Intermediate Outputs Data Format and Layout by co-partitioning Execution Strategies optimizing using PDE Task Scheduling Cost

22𝟐𝟑

‹#›𝟐𝟑

CONCLUSION

Spark In-Memory computing platform for iterative and

interactive tasks RDD abstraction Lineage reconstruction for fault recovery Large number of components based on

Spark Programming Just think RDD like vector Function programming Scala IDE is not strong enough. Lack of good tools to debug and test.

23𝟐𝟑

Documents

Introduction to Spark