Budapest Spark Meetup - Basics of Spark coding

Apache SparkMate Gulyas

CTO & Co-FounderGULYÁS MÁTÉ

@gulyasm

Getting Started

Spark CoreSpark SQLSpark StreamingMLlibGraphXCluster Managers

UNIFIED STACK

Spark CoreSpark SQLSpark StreamingMLlibGraphXCluster Managers

RDD API

Dataframe API

Dataset API

UNIFIED STACK

Spark Core

RDD API

Dataframe API

Dataset API

❏ Scala❏ Java❏ Python❏ R

WHICH LANGUAGE TO SPARK ON?

SPARK INSTALL

DRIVERSPARKCONTEXT

DRIVER PROGRAMYour main function. This is what you write.

Launches parallel operations on the cluster. The driver access Spark through SparkContext.

You access the computing cluster via SparkContext

Via SparkContext you can create RDDs.

❏ INTERACTIVE

❏ STANDALONE

A “SPARK SOFTWARE”

Resilient Distributed Dataset (RDD)

THE MAIN ATTRACTION

❏ TRANSFORMATION

❏ ACTION

OPERATIONS ON RDD

CREATES ANOTHER RDDTRANSFORMATION

CALCULATE VALUE AND RETURN IT TO THE DRIVER PROGRAM

ACTION

LAZY EVALUATION

INTERACTIVE

❏ The code: github.com/gulyasm/bigdata

❏ Databricks site: spark.apache.org

❏ User mailing list

❏ Spark books

MATERIALS

MATE GULYASgulyasm@enbrite.ly

@gulyasm@enbritely

THANK YOU!

TRANSFORMATIONSACTIONSLAZY EVALUATION

LIFECYCLE OF A SPARK PROGRAM

1. READ DATA FROM EXTERNAL SOURCE

2. CREATE LAZY EVALUATED

TRANSFORMATIONS

3. CACHE ANY INTERMEDIATE RDD TO REUSE

4. KICK IT OFF BY CALLING SOME ACTION

PARTITIONS

RDD INTERNALS

RDD INTERFACE

➔ set of PARTITIONS

➔ list of DEPENDENCIES on PARENT RDDs

➔ functions to COMPUTE a partition given parents

➔ preferred LOCATIONS (optional)

➔ PARTITIONER for K/V pairs (optional)

MULTIPLE RDDs /** * :: DeveloperApi :: * Implemented by subclasses to compute a given partition. */ @DeveloperApi def compute(split: Partition, context: TaskContext): Iterator[T]

/** Implemented by subclasses to return the set of partitions in this RDD. */ protected def getPartitions: Array[Partition]

/** Implemented by subclasses to return how this RDD depends on parent RDDs. */ protected def getDependencies: Seq[Dependency[_]] = deps

/** Optionally overridden by subclasses to specify placement preferences. */ protected def getPreferredLocations (split: Partition): Seq[String] = Nil

/** Optionally overridden by subclasses to specify how they are partitioned. */ @transient val partitioner: Option[Partitioner] = None

INTERNALS

THE IMPORTANT PART

❏ HOW EXECUTION WORKS

❏ TERMINOLOGY

❏ WHAT SHOULD WE CARE ABOUT

PIPELINING

❏ Parallel to CPU pipelining❏ More steps at a time❏ Recap: computation kicks of when an

action is called due to lazy evaluation

PIPELINING

text = sc.textFile("twit1.txt")words = nonempty.flatMap(lambda x: x.split(" "))fwords = words.filter(lambda x: len(x) > 0)ones = fwords.map(lambda x: (x, 1))result = ones.reduceByKey(lambda l,r: r+l)result.collect()

PIPELINING

text = sc.textFile( )words = nonempty.flatMap( )fwords = words.filter( )ones = fwords.map( )result = ones.reduceByKey( )result.collect()

PIPELINING

sc.textFile( ) .flatMap( ) .filter( ) .map( ) .reduceByKey( )

PIPELINING

sc.textFile().flatMap().filter().map().reduceByKey()

RDD RDD RDD RDD RDD

textFile() flatMap() filter() map() reduceByKey()

text resultwords fwords ones

PIPELINING

def runJob[T, U]( rdd: RDD[T],partitions: Seq[Int], func: (Iterator[T]) => U)

) : Array[U]

RDD RDD RDD RDD RDD

collect()

PIPELINING

❏ Basically an action

❏ An action creates a job

❏ A whole computation with all dependencies

RDD RDD RDD RDD RDD

collect()

❏ Unit of execution❏ Named after the last transformation

(the one runJob was called on)

❏ Transformations pipelined together into stages

❏ Stage boundary usually means shuffling

RDD RDD RDD RDD RDD

collect()

JobStage 1 Stage 2

❏ Unit of execution❏ Named after the last transformation

(the one runJob was called on)

❏ Transformations pipelined together into stages

❏ Stage boundary usually means shuffling

RDD RDD RDD RDD RDD

collect()

JobStage 1 Stage 2

Shuffle

Repartitioning

text = sc.textFile("twit1.txt")words = nonempty.flatMap(lambda x: x.split(" "))fwords = words.filter(lambda x: len(x) > 1)ones = fwords.map(lambda x: (x, 1))rp = ones.repartition(6)result = rp.reduceByKey(lambda l,r: r+l)result.collect()

TaskSet

THE PROCESSRDD Objects DAG Scheduler Task Scheduler Executor

RDD RDD

sc.textFile.map()

.groupBy()

.filter()

Build DAG of operators

- Split DAG into stages of tasks- Each stage when ready = ALL dependent task are finished

DAG Task

Task Scheduler

- Launches tasks- Retry failed tasks

ExecutorBlock manager

Task threads

- Store and serve blocks- Executes tasks

MATE GULYASgulyasm@enbrite.ly

@gulyasm@enbritely

THANK YOU!

Budapest Spark Meetup - Basics of Spark coding

Data & Analytics

Adding Spark support to Kylin at Bay Area Spark Meetup

Real Time Analytics via Spark & Scala | Spark & Scala Fundamentals | Spark & Scala Architecture

Cos’è Meetup ?

Channel Coding dan Decoding- Block Coding

Channel Coding Dan Decoding Block Coding

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014

Spark Hsinchu meetup

Meetup css

Meetup history

Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark Meetup

AWS meetup「Apache Spark on EMR」

Meetup: Spark + Kerberos

Hypertension Coding icd 10-cm coding medesun

Wordpress meetup

Uvod u apache spark zagreb meetup

What is SPARK? - UHgabriel/courses/cosc6339_s17/BDA_11_Spark.pdf · What is SPARK? •In-Memory Cluster ... –Hadoop, –Mesos, •Spark ... Spark Essentials •Spark program has

Hardware meetup

CiscoSpark インストール手順 (forWindowsPC)...Cisco Spark Test Spark Spark 15:02 O Spark spark 15:02 Welcome to Cisco Spark! You can easily meet with your team and collaborate

Movable Type Meetup JSON - MTDDC Meetup TOKYO 2014

Coding Zero To Coding Hero - Irving iOS Jumpstart