Sparkcamp stratasingapore

Preview:

Citation preview

12 stacks

(255 MB)

(1.2 GB)

(60 GB)

(1.2 GB)

(~x per sec)

30%: Slides

70%: Live coding/demos, labs

df.rdd.partitions.size = 4

(TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int)

m

(TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int)

m

255 MB file

64 MB 64 MB 64 MB 64 MB

m

m

3,600,000

3,600,000

(referer, resource)

(referer, resource)

article title

other-wikipedia

other-empty

other-internal

other-google

other-yahoo

other-bing

other-facebook

other-twitter

other-other

(referer)

(referer, resource)

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

* *

**

**

**

**

* * * *

* * * *

**

**

**

**

* * * *

* * * *

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

**

* * * *

* * * *

**

**

**

**

* * * *

* * * *

**

**

**

**

* * * *

* * * *

**

**

**

**

**

**

**

**

Spark >= 1.6

... ...

...

......

......

......

+

589 MB file64 MB each

7,786,761

2,418,984

7,786,761

• map• filter• flatMap• mapPartitions• mapPartitionsWithIndex• groupBy• sortBy

= medium

Essential Core & Intermediate Spark OperationsTRANSFORMATIONS

ACTIONS

General

• sample• randomSplit

Math / Statistical

= easy

Set Theory / Relational

• union• intersection• subtract• distinct• cartesian• zip

• takeOrdered

Data Structure / I/O

• saveAsTextFile• saveAsSequenceFile• saveAsObjectFile• saveAsHadoopDataset• saveAsHadoopFile• saveAsNewAPIHadoopDataset• saveAsNewAPIHadoopFile

• keyBy• zipWithIndex• zipWithUniqueID• zipPartitions• coalesce• repartition• repartitionAndSortWithinPartitions• pipe

• count• takeSample• max• min• sum• histogram• mean• variance• stdev• sampleVariance• countApprox• countApproxDistinct

• reduce• collect• aggregate• fold• first• take• forEach• top• treeAggregate• treeReduce• forEachPartition• collectAsMap

• sampleByKey

• keys• values

• partitionBy

• countByKey• countByValue• countByValueApprox• countApproxDistinctByKey• countApproxDistinctByKey• countByKeyApprox• sampleByKeyExact

• cogroup (=groupWith)• join• subtractByKey• fullOuterJoin• leftOuterJoin• rightOuterJoin

• flatMapValues• groupByKey• reduceByKey• reduceByKeyLocally• foldByKey• aggregateByKey• sortByKey• combineByKey

= medium

Essential Core & Intermediate Spark OperationsTRANSFORMATIONS

ACTIONS

General Math / Statistical

= easy

Set Theory / Relational Data Structure / I/O

vs

each partition of the parent RDD is used by at most one partition of the child RDD

multiple child RDD partitions may depend on a single parent RDD partition

map flatMap filter mapPartitions

groupByKey join

repartition

en Main_Page 245839 4737756101en Apache_Spark 1370 29484844en.mw Apache_Sparken.d discombobulate 200 284834fr.b Special:Recherche/Acteurs_et_actrices_N 1 739

Project namePage title # of requests

Total size of content returned

• map

• filter

• groupBy

• sort

• union

• join

• leftOuterJoin

• rightOuterJoin

• reduce

• count

• fold

• reduceByKey

• cogroup

• cross

• zip

• sample

• take

• first

• partitionBy

• mapWith

• pipe

• save

• ...

• groupByKey

https://www.youtube.com/watch?v=Jw9yNTJI8iM

“We start by loading some fields from a tab-separated Wikipedia file into a distributed collection of Article objects, then we perform queries on it. Queries on the on-disk data take 20 seconds, but asking Spark to cache the Article objects in memory reduces the query latency to 1-2 seconds.”

- Matei

• •

Map() Map() Map() Map()

Map() Map() Map() Map()

Reduce() Reduce() Reduce()

groupByKey reduceByKey combineByKey

(a, 1)

(b, 1)

(a, 1)

(a, 1)

(b, 1)

(b, 1)

(a, 1)

(a, 1)

(a, 1)

(b, 1)

(b, 1)

(a, 1)(a, 1)

(a, 6)(a, 1)

(b, 5)

a b

(a, 1)(a, 1)(a, 1)

(b, 1)(b, 1)

(b, 1)(b, 1)(b, 1)

(a, 1)

(b, 1)

(a, 1)

(b, 1)

(a, 1)

(a, 1) (a, 2)

(b, 2)(b, 1)

(b, 1)

(a, 1)

(a, 1)(a, 3)

(b, 2)(a, 1)

(b, 1)

(b, 1)

(a, 1)

(a, 2) (a, 6)

(a, 3)

(b, 1)

(b, 2) (b, 5)

(b, 2)

a b

sqlContex.setConf(key, value)

spark.sql.shuffle.partititions

(

Shuffle

Source: parquet.apache.org

vertex vertex

edge

unidirectionaledge

bidirectionaledge

2 uni edges

vertexID: 469

Edge label: 22

vertexID: 3728

22

8

93

11

7

46

inDegree = 5

1

1

1

1

1

0.5

1

1

1

1

13.65

1

0.5

1

1

Out-degree = 2

5 hops

4 hops

1

1

0

2

4

2

3

3

?

Liberal Conservative

Post

Post

Post

Post

Post

Post

Post

Post

168

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

??

?

?

??

?

? ??

?

?

??

??

?

?

?

?

?

?

?

?

?

?

?

? ?

?

B C

A D

F E

A DD

B C

D

E

AA

F

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

B

C

D

E

A

F

1

2

1 2

1 2

1

2

filter()

editsDStream

filteredDStream

. . .

. . .

. . .

https://issues.apache.org/jira/browse/SPARK-8360

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

Running test:

pyspark/broadcast.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:30 ERROR Aliens

attacked the

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

Running

43

test

67

Spark

110

aliens

0...

mllib.linalg.Vector

0.0 0.1 -0.1 50 ...= -1.1

LinearRegression

0.0 6.7+ -11.0 0.0 ...+ +

Running

43

test

67

Spark

110

aliens

0 ...

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive,

including Datanucleus

jars on classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

LabeledPoint(features: Vector, label: Double)

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

...

RDD[LabeledPoint]

LinearRegressionModel.predict(features: Vector): Double

RDD[LabeledPoint]

LinearRegression.train(data: RDD[LabeledPoint]): LinearRegressionModel

LinearRegressionModel

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname, …

LinearRegression

Dataframe

Estimator

Transformer

Model(Transformer)

RDD[(Double, Double)] RegressionMetrics

Double

Estimator

Transformer

Evaluator

http://www.meetup.com/Spark-Singapore/events/219039180/