Sparkcamp stratasingapore

12 stacks

(255 MB)

(1.2 GB)

(60 GB)

(1.2 GB)

(~x per sec)

30%: Slides

70%: Live coding/demos, labs

df.rdd.partitions.size = 4

(TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int)

255 MB file

64 MB 64 MB 64 MB 64 MB

3,600,000

(referer, resource)

article title

other-wikipedia

other-empty

other-internal

other-google

other-yahoo

other-bing

other-facebook

other-twitter

other-other

(referer)

(referer, resource)

* * * *

Spark >= 1.6

... ...

......

589 MB file64 MB each

7,786,761

2,418,984

7,786,761

• map• filter• flatMap• mapPartitions• mapPartitionsWithIndex• groupBy• sortBy

= medium

Essential Core & Intermediate Spark OperationsTRANSFORMATIONS

ACTIONS

General

• sample• randomSplit

Math / Statistical

= easy

Set Theory / Relational

• union• intersection• subtract• distinct• cartesian• zip

• takeOrdered

Data Structure / I/O

• saveAsTextFile• saveAsSequenceFile• saveAsObjectFile• saveAsHadoopDataset• saveAsHadoopFile• saveAsNewAPIHadoopDataset• saveAsNewAPIHadoopFile

• keyBy• zipWithIndex• zipWithUniqueID• zipPartitions• coalesce• repartition• repartitionAndSortWithinPartitions• pipe

• count• takeSample• max• min• sum• histogram• mean• variance• stdev• sampleVariance• countApprox• countApproxDistinct

• reduce• collect• aggregate• fold• first• take• forEach• top• treeAggregate• treeReduce• forEachPartition• collectAsMap

• sampleByKey

• keys• values

• partitionBy

• countByKey• countByValue• countByValueApprox• countApproxDistinctByKey• countApproxDistinctByKey• countByKeyApprox• sampleByKeyExact

• cogroup (=groupWith)• join• subtractByKey• fullOuterJoin• leftOuterJoin• rightOuterJoin

• flatMapValues• groupByKey• reduceByKey• reduceByKeyLocally• foldByKey• aggregateByKey• sortByKey• combineByKey

= medium

Essential Core & Intermediate Spark OperationsTRANSFORMATIONS

ACTIONS

General Math / Statistical

= easy

Set Theory / Relational Data Structure / I/O

each partition of the parent RDD is used by at most one partition of the child RDD

multiple child RDD partitions may depend on a single parent RDD partition

map flatMap filter mapPartitions

groupByKey join

repartition

en Main_Page 245839 4737756101en Apache_Spark 1370 29484844en.mw Apache_Sparken.d discombobulate 200 284834fr.b Special:Recherche/Acteurs_et_actrices_N 1 739

Project namePage title # of requests

Total size of content returned

• map

• filter

• groupBy

• sort

• union

• join

• leftOuterJoin

• rightOuterJoin

• reduce

• count

• fold

• reduceByKey

• cogroup

• cross

• zip

• sample

• take

• first

• partitionBy

• mapWith

• pipe

• save

• ...

• groupByKey

https://www.youtube.com/watch?v=Jw9yNTJI8iM

“We start by loading some fields from a tab-separated Wikipedia file into a distributed collection of Article objects, then we perform queries on it. Queries on the on-disk data take 20 seconds, but asking Spark to cache the Article objects in memory reduces the query latency to 1-2 seconds.”

- Matei

• •

Map() Map() Map() Map()

Reduce() Reduce() Reduce()

groupByKey reduceByKey combineByKey

(a, 1)

(b, 1)

(a, 1)

(b, 1)

(a, 1)

(b, 1)

(a, 1)(a, 1)

(a, 6)(a, 1)

(b, 5)

(a, 1)(a, 1)(a, 1)

(b, 1)(b, 1)

(b, 1)(b, 1)(b, 1)

(a, 1)

(b, 1)

(a, 1)

(b, 1)

(a, 1)

(a, 1) (a, 2)

(b, 2)(b, 1)

(b, 1)

(a, 1)

(a, 1)(a, 3)

(b, 2)(a, 1)

(b, 1)

(a, 1)

(a, 2) (a, 6)

(a, 3)

(b, 1)

(b, 2) (b, 5)

(b, 2)

sqlContex.setConf(key, value)

spark.sql.shuffle.partititions

Shuffle

Source: parquet.apache.org

vertex vertex

unidirectionaledge

bidirectionaledge

2 uni edges

vertexID: 469

Edge label: 22

vertexID: 3728

inDegree = 5

Out-degree = 2

5 hops

4 hops

Liberal Conservative

filter()

editsDStream

filteredDStream

https://issues.apache.org/jira/browse/SPARK-8360

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

Running test:

pyspark/broadcast.py

14/12/15 18:36:30 ERROR Aliens

attacked the

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

14/12/15 18:36:12 WARN Utils:

Your hostname,

Running

aliens

mllib.linalg.Vector

0.0 0.1 -0.1 50 ...= -1.1

LinearRegression

0.0 6.7+ -11.0 0.0 ...+ +

Running

aliens

Running test:

pyspark/conf.py

built with Hive,

including Datanucleus

jars on classpath

14/12/15 18:36:12 WARN

LabeledPoint(features: Vector, label: Double)

14/12/15 18:36:12 WARN Utils:

Your hostname,

14/12/15 18:36:12 WARN Utils:

Your hostname,

14/12/15 18:36:12 WARN Utils:

Your hostname,

RDD[LabeledPoint]

LinearRegressionModel.predict(features: Vector): Double

RDD[LabeledPoint]

LinearRegression.train(data: RDD[LabeledPoint]): LinearRegressionModel

LinearRegressionModel

Running test:

pyspark/conf.py

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Running test:

pyspark/conf.py

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Running test:

pyspark/conf.py

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Running test:

pyspark/conf.py

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Running test:

pyspark/conf.py

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Running test:

pyspark/conf.py

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Running test:

pyspark/conf.py

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname, …

LinearRegression

Dataframe

Estimator

Transformer

Model(Transformer)

RDD[(Double, Double)] RegressionMetrics

Double

Estimator

Transformer

Evaluator

http://www.meetup.com/Spark-Singapore/events/219039180/

Sparkcamp stratasingapore

Engineering

CSS の構造化、その目的

Pyconsg2014 pyston

PHP 2014/15 - Visión global del ecosistema PHP

StrongLoop Overview

Homekit 20140730

10 Inspirational Engineering Quotes

Antifragility = Elasticity + Resilience + Machine Learning. Models and Algorithms for Open System Fidelity

Lean for scrum teams

ASME VIII UG-ppt

Localizacion, distribucion en planta y manutencion josep m. vallhonrat

Transportation & Environmental Noise Mitigation

Flouride removal ppt BY PMD.RAFI,MTECH (SVU)

Api572 Self Study

COSCUP 2014 : open source compiler 戰國時代的軍備競賽

How KKBOX use mrjob to link python, hadoop, aws

8085 architecture

INTERRUPTS OF 8086 MICROPROCESSOR

HTML5 技術を利用した授業や会議向けデスクトップ画面実時間配信システムとその管理システムの試作, 情報処理学会IOT研究会26

인피니스팬데이터그리드따라잡기 (@JCO 2014)

コマンドラインで始める SoftLayer (May 23, 2014)