Sparkcamp stratasingapore

•

•

•

•

•

•

•

•

•

•

•

•

12 stacks

•

•

•

•

•

•

http://dbricks.co/1PtWxSu

http://dbricks.co/1PtWxSu

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

https://www.linkedin.com/in/blueplastic

•

•

•

•

•

(255 MB)

(1.2 GB)

(60 GB)

(1.2 GB)

(~x per sec)

30%: Slides

70%: Live coding/demos, labs

http://hadoop.apache.org/


..

..





http://hive.apache.org/

http://hive.apache.org/

http://datahub.io/en/dataset/english-wikipedia-pageviews-by-second



df.rdd.partitions.size = 4

(TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int)

m

(TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int)

m

255 MB file

64 MB 64 MB 64 MB 64 MB

m

m

3,600,000

3,600,000

int

https://gist.github.com/rxin/c1592c133e4bccf515dd

(referer, resource)

(referer, resource)

article title

other-wikipedia

other-empty

other-internal

other-google

other-yahoo

other-bing

other-facebook

other-twitter

other-other

(referer)

(referer, resource)

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

* *

**

**

**

**

* * * *

* * * *

**

**

**

**

* * * *

* * * *

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

* * * *

* * * *

**

**

**

**

* * * *

* * * *

**

**

**

**

* * * *

* * * *

**

**

**

**

**

**

**

**

Spark >= 1.6

... ...

...

......

......

......

•

•

•

•

•

•

•

•

+

589 MB file64 MB each

7,786,761

2,418,984

7,786,761

• map• filter• flatMap• mapPartitions• mapPartitionsWithIndex• groupBy• sortBy

= medium

Essential Core & Intermediate Spark OperationsTRANSFORMATIONS

ACTIONS

General

• sample• randomSplit

Math / Statistical

= easy

Set Theory / Relational

• union• intersection• subtract• distinct• cartesian• zip

• takeOrdered

Data Structure / I/O

• saveAsTextFile• saveAsSequenceFile• saveAsObjectFile• saveAsHadoopDataset• saveAsHadoopFile• saveAsNewAPIHadoopDataset• saveAsNewAPIHadoopFile

• keyBy• zipWithIndex• zipWithUniqueID• zipPartitions• coalesce• repartition• repartitionAndSortWithinPartitions• pipe

• count• takeSample• max• min• sum• histogram• mean• variance• stdev• sampleVariance• countApprox• countApproxDistinct

• reduce• collect• aggregate• fold• first• take• forEach• top• treeAggregate• treeReduce• forEachPartition• collectAsMap

• sampleByKey

• keys• values

• partitionBy

• countByKey• countByValue• countByValueApprox• countApproxDistinctByKey• countApproxDistinctByKey• countByKeyApprox• sampleByKeyExact

• cogroup (=groupWith)• join• subtractByKey• fullOuterJoin• leftOuterJoin• rightOuterJoin

• flatMapValues• groupByKey• reduceByKey• reduceByKeyLocally• foldByKey• aggregateByKey• sortByKey• combineByKey

= medium

Essential Core & Intermediate Spark OperationsTRANSFORMATIONS

ACTIONS

General Math / Statistical

= easy

Set Theory / Relational Data Structure / I/O

vs

each partition of the parent RDD is used by at most one partition of the child RDD

multiple child RDD partitions may depend on a single parent RDD partition

•

•

map flatMap filter mapPartitions

groupByKey join

repartition

•

•

•

en Main_Page 245839 4737756101en Apache_Spark 1370 29484844en.mw Apache_Sparken.d discombobulate 200 284834fr.b Special:Recherche/Acteurs_et_actrices_N 1 739

Project namePage title # of requests

Total size of content returned

• map

• filter

• groupBy

• sort

• union

• join

• leftOuterJoin

• rightOuterJoin

• reduce

• count

• fold

• reduceByKey

• cogroup

• cross

• zip

• sample

• take

• first

• partitionBy

• mapWith

• pipe

• save

• ...

• groupByKey

https://www.youtube.com/watch?v=Jw9yNTJI8iM

“We start by loading some fields from a tab-separated Wikipedia file into a distributed collection of Article objects, then we perform queries on it. Queries on the on-disk data take 20 seconds, but asking Spark to cache the Article objects in memory reduces the query latency to 1-2 seconds.”

- Matei

https://www.youtube.com/watch?v=Jw9yNTJI8iM

•

•

•

•

•

•

•

• •

•

•

•

•

•

•

•

•

•

Map() Map() Map() Map()

Map() Map() Map() Map()

•

•

Reduce() Reduce() Reduce()

groupByKey reduceByKey combineByKey

(a, 1)

(b, 1)

(a, 1)

(a, 1)

(b, 1)

(b, 1)

(a, 1)

(a, 1)

(a, 1)

(b, 1)

(b, 1)

(a, 1)(a, 1)

(a, 6)(a, 1)

(b, 5)

a b

(a, 1)(a, 1)(a, 1)

(b, 1)(b, 1)

(b, 1)(b, 1)(b, 1)

(a, 1)

(b, 1)

(a, 1)

(b, 1)

(a, 1)

(a, 1) (a, 2)

(b, 2)(b, 1)

(b, 1)

(a, 1)

(a, 1)(a, 3)

(b, 2)(a, 1)

(b, 1)

(b, 1)

(a, 1)

(a, 2) (a, 6)

(a, 3)

(b, 1)

(b, 2) (b, 5)

(b, 2)

a b

sqlContex.setConf(key, value)

spark.sql.shuffle.partititions

(

Shuffle

•

•

•

•

•

•

•

•

Source: parquet.apache.org

•

•

•

vertex vertex

edge

unidirectionaledge

bidirectionaledge

2 uni edges

vertexID: 469

Edge label: 22

vertexID: 3728

22

8

93

11

7

46

inDegree = 5

1

1

1

1

1

0.5

1

1

1

1

13.65

1

0.5

1

1

Out-degree = 2

5 hops

4 hops

1

1

0

2

4

2

3

3

?

Liberal Conservative

Post

Post

Post

Post

Post

Post

Post

Post

168

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

??

?

?

??

?

? ??

?

?

??

??

?

?

?

?

?

?

?

?

?

?

?

? ?

?

B C

A D

F E

A DD

B C

D

E

AA

F

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

B

C

D

E

A

F

1

2

1 2

1 2

1

2

filter()

editsDStream

filteredDStream

. . .

. . .

. . .

https://issues.apache.org/jira/browse/SPARK-8360

https://issues.apache.org/jira/browse/SPARK-8360

•

•

•

•

•

•

•

•

•

•

•

•

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

Running test:

pyspark/broadcast.py




14/12/15 18:36:30 ERROR Aliens

attacked the

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,





14/12/15 18:36:12 WARN Utils:

Your hostname,

Running

43

test

67

Spark

110

aliens

0...

mllib.linalg.Vector

0.0 0.1 -0.1 50 ...= -1.1

LinearRegression

0.0 6.7+ -11.0 0.0 ...+ +

Running

43

test

67

Spark

110

aliens

0 ...

Running test:

pyspark/conf.py


built with Hive,

including Datanucleus

jars on classpath

14/12/15 18:36:12 WARN


LabeledPoint(features: Vector, label: Double)





14/12/15 18:36:12 WARN Utils:

Your hostname,





14/12/15 18:36:12 WARN Utils:

Your hostname,





14/12/15 18:36:12 WARN Utils:

Your hostname,

...

RDD[LabeledPoint]

LinearRegressionModel.predict(features: Vector): Double

RDD[LabeledPoint]

LinearRegression.train(data: RDD[LabeledPoint]): LinearRegressionModel

LinearRegressionModel

Running test:

pyspark/conf.py



Datanucleus jars on

classpath

14/12/15 18:36:12 WARN


Running test:

pyspark/conf.py



Datanucleus jars on

classpath

14/12/15 18:36:12 WARN


Running test:

pyspark/conf.py



Datanucleus jars on

classpath

14/12/15 18:36:12 WARN


…

Running test:

pyspark/conf.py



Datanucleus jars on

classpath

14/12/15 18:36:12 WARN


Running test:

pyspark/conf.py



Datanucleus jars on

classpath

14/12/15 18:36:12 WARN


Running test:

pyspark/conf.py



Datanucleus jars on

classpath

14/12/15 18:36:12 WARN


Running test:

pyspark/conf.py



Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname, …

LinearRegression

Dataframe

Estimator

Transformer

Model(Transformer)

RDD[(Double, Double)] RegressionMetrics

Double

Estimator

Transformer

Evaluator

http://www.meetup.com/Spark-Singapore/events/219039180/

•

•

•

•

https://www.linkedin.com/in/reynoldxin

Engineering

Sparkcamp stratasingapore