Introduction to Spark

INTRODUCTION TO SPARK

Tangzk

04/19/2023

‹#›𝟐𝟑

OUTLINE

BDAS(the Berkeley Data Analytics Stack) Spark Other Components based on Spark Introduction to Scala Spark Programming Practices Debugging and Testing Spark Programs Learning Spark Why are Previous MapReduce-Based Systems

Slow? Conclusion 1

𝟐𝟑

‹#›𝟐𝟑

BDAS(THE BERKELEY DATA ANALYTICS STACK)

2𝟐𝟑

‹#›𝟐𝟑

Master/Slave architecture In-Memory Computing Platform Resilient Distributed Datasets(RDDs)

abstraction DAG Engine Execution Fault Recovery using Lineage Supporting Interactive data mining Survived in Hadoop Ecosystem

3𝟐𝟑

‹#›𝟐𝟑

SPARK – IN-MEMORY COMPUTING

Hadoop: two-stages MapReduce topology Spark: DAG executing topology

4𝟐𝟑

‹#›𝟐𝟑

SPARK – RDD(RESILIENT DISTRIBUTED DATASETS)

Data Computing and Storage Abstraction Records organized by Partition Immutable(Read Only) only be created through transformations Move computing instead of data Coarse-grained Programming Interfaces Partition Reconstruction using lineage

5𝟐𝟑

‹#›𝟐𝟑

SPARK-RDD: COARSE-GRAINED PROGRAMMING INTERFACES

Transformation: defining one or more RDDs. Action: return a value. Lazy computation

6𝟐𝟑

‹#›𝟐𝟑

SPARK-RDD: LINEAGE GRAPH

Spark Example: Return the time fields of web GET access by

“66.220.128.123”(time field is number 3 in a tab-separated format)

logs = spark.textFile("hdfs://...")accessesByIp = logs.filter(_.startWith("66.220.128.123"))accessesByIp.persist()accessesByIp.filter(_.contains("GET"))

.map(_.split('\t')(3))

.collect()7𝟐𝟑

‹#›𝟐𝟑

SPARK - INSTALLATION

Running Spark Local Mode Standalone Mode Cluster Mode: on YARN/Mesos

Support Programming Language Scala Java Python

Spark Interactive Shell

8𝟐𝟑

‹#›𝟐𝟑

OTHER COMPONENTS(1) - SHARK

SQL and Rich Analytics at Scale. Partial DAG Execution optimize query

planning at runtime. run SQL queries up to 100× faster than

Apache Hive, and machine learning programs up to 100× faster than Hadoop.

9𝟐𝟑

‹#›𝟐𝟑

OTHER COMPONENTS(1) – SPARK STREAMING

Scalable fault-tolerant streaming computing framework

Discretized streams abstraction: separate continuous dataflow into batches of input data

10𝟐𝟑

‹#›𝟐𝟑

OTHER COMPONENTS(2) - MLBASE

Distributed Machine Learning Library

11𝟐𝟑

‹#›𝟐𝟑

OTHER COMPONENTS(4) – GRAPHX

Distributed Graph Computing framework RDG(Resilient Distributed Graph) abstraction Supporting Gather-Apply-Scatter model in

GraphLab

12𝟐𝟑

‹#›𝟐𝟑

GAS MODEL - GRAPHLAB

Machine 2Machine 1

Machine 4Machine 3

Σ1 Σ2

Σ3 Σ4

ΣY’Y’Y’

Gather InfosFrom Nbs

Apply Vertexupdate

Master

Mirror

MirrorMirror

From Jiang Wenrui’s thesis defense

13𝟐𝟑

‹#›𝟐𝟑

INTRODUCTION TO SCALA(1)

Runs on JVM(and .Net) Full interoperability with Java Statically typed Object Oriented Functional Programming

14𝟐𝟑

‹#›𝟐𝟑

INTRODUCTION TO SCALA(2)

Declare a list of integers val ints = List(1,2,4,5,7,3)

Declare a function, cube, compute the cube of an Int def cube(a: Int): Int = a * a * a

Apply cube function to list. val cubes = ints.map(x => cube(x))

Sum the cube of integers. cubes.reduce((x1,x2) => x1+x2)

Define a factorial function that comput n! def fact(n: Int): Int = {

if(n == 0) 1else n * fact(n-1)

15𝟐𝟑

‹#›𝟐𝟑

SPARK IN PRACTICES(1)

“Hello Word”(interactive shell) val textFile = sc.textFile(“hdfs://…”)textFile.flatMap(line => line.split(“ “))

.map(word => (word, 1))

.reduceByKey((a, b) => a+b) “Hello Word”(Standalone App)

16𝟐𝟑

‹#›𝟐𝟑

SPARK IN PRACTICES(2)

PageRank in Spark

17𝟐𝟑

‹#›𝟐𝟑

DEBUGGING AND TESTING SPARK PROGRAMS(1)

Running in Local Mode sc = new SparkContext("local", name) Debug in IDE

Running in Standalone/Cluster Mode Job Web GUI on 8080/4040 Log4j jstack/jmap dstat/iostat/lsof -p

Unit test Test in local mode

18𝟐𝟑

‹#›𝟐𝟑

RDDJoin2:(2,4)(1,2)RDDJoin3:(1,(1,3))(1,(2,3))

19𝟐𝟑

‹#›𝟐𝟑

Tunning Spark Large object in lambda operator should be

replaced by broadcast variables instead. Coalescing partitions avoid large number of

empty tasks after filtering operations. Make good use of partition for data locality.

(mapPartitions) Good choice of partitioning key to balance data Set spark.local.dir to set of disks Take care of the number of reduce tasks Don’t collect data but write to HDFS directly.

20𝟐𝟑

‹#›𝟐𝟑

LEARNING SPARK

Spark Quick Start, http://spark.apache.org/docs/latest/quick-start.html

Holden Karau, Fast Data Processing with Spark

Spark Docs, http://spark.apache.org/docs/latest/

Spark Source code, https://github.com/apache/spark

Spark User Mailing list, http://spark.apache.org/mailing-lists.html

21𝟐𝟑

‹#›𝟐𝟑

WHY ARE PREVIOUS MAPREDUCE-BASED SYSTEMS SLOW?

Conventional thoughts: expensive data materialization for fault

tolerance, inferior data layout (e.g., lack of indices), costlier execution strategies.

But Hive/Shark alleviate these by: In-memory computing and storage Partial DAG execution

Experiment results in Shark: Intermediate Outputs Data Format and Layout by co-partitioning Execution Strategies optimizing using PDE Task Scheduling Cost

22𝟐𝟑

‹#›𝟐𝟑

CONCLUSION

Spark In-Memory computing platform for iterative and

interactive tasks RDD abstraction Lineage reconstruction for fault recovery Large number of components based on

Spark Programming Just think RDD like vector Function programming Scala IDE is not strong enough. Lack of good tools to debug and test.

23𝟐𝟑

Introduction to Spark

Documents

NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Disclaimer - Seoul National University · between rod-to-rod, wire-to-rod and wire-to-plate spark dischargers to understand the factors affecting bility of spark dischargesta s at

Introduction to Introduction to Control SystemsControl Systems

People You Need to Know - A Spark Plug Publication

Intro to Spark Lightning-fast cluster computing. What is Spark? Spark Overview: A fast and general-purpose cluster computing system

Introduction to Hadoop and Spark (before joining the other talk) and An Overview of Hadoop in Japan

Lab 3: Introduction to Spark Streaminghy562/labs20/Lab6-Spark...Outline Big Stream Analysis Streaming & Real Time Processing Streaming Systems λ vs k What is Spark Streaming Core

SPARK dictionary V1 - AdaCoreSPARK英日辞書（V1.0） 1 （English Version） SPARK/Ada English-Japanese dictionary Introduction The new concepts in SPARK/Ada cannot be understood

Introduction to

JavaCro'15 - Introduction to Apache Spark - Petar Zečević

CONTINUUM MECHANICS - Introduction to tensorskossa/segedletek/cmech/Intro_tensors.pdf · CONTINUUM MECHANICS - Introduction to tensors CONTINUUM MECHANICS - Introduction to tensors

The Data Scientist's Guide to Apache Spark

Extending Spark Streaming to Support Complex Event Processing

Introduction to Nanotechnology - appmju.files.wordpress.com · Introduction to Nanotechnology นาโนเทคโนโลยีเบื องต้น เอกสารประกอบการสอน

Spark shuffle introduction

Ignition to spark erosion - Suomen EDM · 2014-05-06 · Ignition to spark erosion KIPINÄTYÖSTÖTARVIKKEET. SUODATTIMET KIPINÄTYÖSTÖÖN TUOTERYHMÄ 4300 VARASTOITAVAT MALLIT

To Introduction

Spark introduction - In Chinese

CSED101 INTRODUCTION TO COMPUTING INTRODUCTION TO COMPUTER SCIENCE

Project Tungsten Bringing Spark Closer to Bare Meta （Hadoop / Spark Conference Japan 2016 キーノート講演資料）