24
INTRODUCTION TO SPARK Tangzk 06/20/2022

Introduction to Spark

Embed Size (px)

DESCRIPTION

Introduction to Spark. Tangzk 2014/4/8. Outline. BDAS(the Berkeley Data Analytics Stack) Spark Other Components based on Spark Introduction to Scala Spark Programming Practices Debugging and Testing Spark Programs Learning Spark Why are Previous MapReduce -Based Systems Slow ? - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to Spark

INTRODUCTION TO SPARK

Tangzk

04/19/2023

Page 2: Introduction to Spark

‹#›𝟐𝟑

OUTLINE

BDAS(the Berkeley Data Analytics Stack) Spark Other Components based on Spark Introduction to Scala Spark Programming Practices Debugging and Testing Spark Programs Learning Spark Why are Previous MapReduce-Based Systems

Slow? Conclusion 1

𝟐𝟑

Page 3: Introduction to Spark

‹#›𝟐𝟑

BDAS(THE BERKELEY DATA ANALYTICS STACK)

2𝟐𝟑

Page 4: Introduction to Spark

‹#›𝟐𝟑

SPARK

Master/Slave architecture In-Memory Computing Platform Resilient Distributed Datasets(RDDs)

abstraction DAG Engine Execution Fault Recovery using Lineage Supporting Interactive data mining Survived in Hadoop Ecosystem

3𝟐𝟑

Page 5: Introduction to Spark

‹#›𝟐𝟑

SPARK – IN-MEMORY COMPUTING

Hadoop: two-stages MapReduce topology Spark: DAG executing topology

4𝟐𝟑

Page 6: Introduction to Spark

‹#›𝟐𝟑

SPARK – RDD(RESILIENT DISTRIBUTED DATASETS)

Data Computing and Storage Abstraction Records organized by Partition Immutable(Read Only) only be created through transformations Move computing instead of data Coarse-grained Programming Interfaces Partition Reconstruction using lineage

5𝟐𝟑

Page 7: Introduction to Spark

‹#›𝟐𝟑

SPARK-RDD: COARSE-GRAINED PROGRAMMING INTERFACES

Transformation: defining one or more RDDs. Action: return a value. Lazy computation

6𝟐𝟑

Page 8: Introduction to Spark

‹#›𝟐𝟑

SPARK-RDD: LINEAGE GRAPH

Spark Example: Return the time fields of web GET access by

“66.220.128.123”(time field is number 3 in a tab-separated format)

logs = spark.textFile("hdfs://...")accessesByIp = logs.filter(_.startWith("66.220.128.123"))accessesByIp.persist()accessesByIp.filter(_.contains("GET"))

.map(_.split('\t')(3))

.collect()7𝟐𝟑

Page 9: Introduction to Spark

‹#›𝟐𝟑

SPARK - INSTALLATION

Running Spark Local Mode Standalone Mode Cluster Mode: on YARN/Mesos

Support Programming Language Scala Java Python

Spark Interactive Shell

8𝟐𝟑

Page 10: Introduction to Spark

‹#›𝟐𝟑

OTHER COMPONENTS(1) - SHARK

SQL and Rich Analytics at Scale. Partial DAG Execution optimize query

planning at runtime. run SQL queries up to 100× faster than

Apache Hive, and machine learning programs up to 100× faster than Hadoop.

9𝟐𝟑

Page 11: Introduction to Spark

‹#›𝟐𝟑

OTHER COMPONENTS(1) – SPARK STREAMING

Scalable fault-tolerant streaming computing framework

Discretized streams abstraction: separate continuous dataflow into batches of input data

10𝟐𝟑

Page 12: Introduction to Spark

‹#›𝟐𝟑

OTHER COMPONENTS(2) - MLBASE

Distributed Machine Learning Library

11𝟐𝟑

Page 13: Introduction to Spark

‹#›𝟐𝟑

OTHER COMPONENTS(4) – GRAPHX

Distributed Graph Computing framework RDG(Resilient Distributed Graph) abstraction Supporting Gather-Apply-Scatter model in

GraphLab

12𝟐𝟑

Page 14: Introduction to Spark

‹#›𝟐𝟑

GAS MODEL - GRAPHLAB

Machine 2Machine 1

Machine 4Machine 3

Σ1 Σ2

Σ3 Σ4

+ + +

YYYY

Y’

ΣY’Y’Y’

Gather InfosFrom Nbs

Apply Vertexupdate

Master

Mirror

MirrorMirror

From Jiang Wenrui’s thesis defense

13𝟐𝟑

Page 15: Introduction to Spark

‹#›𝟐𝟑

INTRODUCTION TO SCALA(1)

Runs on JVM(and .Net) Full interoperability with Java Statically typed Object Oriented Functional Programming

14𝟐𝟑

Page 16: Introduction to Spark

‹#›𝟐𝟑

INTRODUCTION TO SCALA(2)

Declare a list of integers val ints = List(1,2,4,5,7,3)

Declare a function, cube, compute the cube of an Int def cube(a: Int): Int = a * a * a

Apply cube function to list. val cubes = ints.map(x => cube(x))

Sum the cube of integers. cubes.reduce((x1,x2) => x1+x2)

Define a factorial function that comput n! def fact(n: Int): Int = {

if(n == 0) 1else n * fact(n-1)

}

15𝟐𝟑

Page 17: Introduction to Spark

‹#›𝟐𝟑

SPARK IN PRACTICES(1)

“Hello Word”(interactive shell) val textFile = sc.textFile(“hdfs://…”)textFile.flatMap(line => line.split(“ “))

.map(word => (word, 1))

.reduceByKey((a, b) => a+b) “Hello Word”(Standalone App)

16𝟐𝟑

Page 18: Introduction to Spark

‹#›𝟐𝟑

SPARK IN PRACTICES(2)

PageRank in Spark

17𝟐𝟑

Page 19: Introduction to Spark

‹#›𝟐𝟑

DEBUGGING AND TESTING SPARK PROGRAMS(1)

Running in Local Mode sc = new SparkContext("local", name) Debug in IDE

Running in Standalone/Cluster Mode Job Web GUI on 8080/4040 Log4j jstack/jmap dstat/iostat/lsof -p

Unit test Test in local mode

18𝟐𝟑

Page 20: Introduction to Spark

‹#›𝟐𝟑

DEBUGGING AND TESTING SPARK PROGRAMS(2)

RDDJoin2:(2,4)(1,2)RDDJoin3:(1,(1,3))(1,(2,3))

19𝟐𝟑

Page 21: Introduction to Spark

‹#›𝟐𝟑

DEBUGGING AND TESTING SPARK PROGRAMS(3)

Tunning Spark Large object in lambda operator should be

replaced by broadcast variables instead. Coalescing partitions avoid large number of

empty tasks after filtering operations. Make good use of partition for data locality.

(mapPartitions) Good choice of partitioning key to balance data Set spark.local.dir to set of disks Take care of the number of reduce tasks Don’t collect data but write to HDFS directly.

20𝟐𝟑

Page 22: Introduction to Spark

‹#›𝟐𝟑

LEARNING SPARK

Spark Quick Start, http://spark.apache.org/docs/latest/quick-start.html

Holden Karau, Fast Data Processing with Spark

Spark Docs, http://spark.apache.org/docs/latest/

Spark Source code, https://github.com/apache/spark

Spark User Mailing list, http://spark.apache.org/mailing-lists.html

21𝟐𝟑

Page 23: Introduction to Spark

‹#›𝟐𝟑

WHY ARE PREVIOUS MAPREDUCE-BASED SYSTEMS SLOW?

Conventional thoughts: expensive data materialization for fault

tolerance, inferior data layout (e.g., lack of indices), costlier execution strategies.

But Hive/Shark alleviate these by: In-memory computing and storage Partial DAG execution

Experiment results in Shark: Intermediate Outputs Data Format and Layout by co-partitioning Execution Strategies optimizing using PDE Task Scheduling Cost

22𝟐𝟑

Page 24: Introduction to Spark

‹#›𝟐𝟑

CONCLUSION

Spark In-Memory computing platform for iterative and

interactive tasks RDD abstraction Lineage reconstruction for fault recovery Large number of components based on

Spark Programming Just think RDD like vector Function programming Scala IDE is not strong enough. Lack of good tools to debug and test.

23𝟐𝟑