206

Sparkcamp stratasingapore

Embed Size (px)

Citation preview

Page 1: Sparkcamp stratasingapore
Page 2: Sparkcamp stratasingapore
Page 3: Sparkcamp stratasingapore

Page 4: Sparkcamp stratasingapore

Page 5: Sparkcamp stratasingapore

Page 6: Sparkcamp stratasingapore

12 stacks

Page 7: Sparkcamp stratasingapore

Page 8: Sparkcamp stratasingapore
Page 9: Sparkcamp stratasingapore
Page 10: Sparkcamp stratasingapore
Page 11: Sparkcamp stratasingapore
Page 14: Sparkcamp stratasingapore

Page 15: Sparkcamp stratasingapore
Page 16: Sparkcamp stratasingapore
Page 17: Sparkcamp stratasingapore
Page 18: Sparkcamp stratasingapore

Page 19: Sparkcamp stratasingapore
Page 20: Sparkcamp stratasingapore

Page 21: Sparkcamp stratasingapore
Page 22: Sparkcamp stratasingapore

Page 23: Sparkcamp stratasingapore

(255 MB)

(1.2 GB)

(60 GB)

(1.2 GB)

(~x per sec)

30%: Slides

70%: Live coding/demos, labs

Page 24: Sparkcamp stratasingapore
Page 25: Sparkcamp stratasingapore
Page 28: Sparkcamp stratasingapore
Page 29: Sparkcamp stratasingapore
Page 30: Sparkcamp stratasingapore
Page 31: Sparkcamp stratasingapore
Page 32: Sparkcamp stratasingapore
Page 33: Sparkcamp stratasingapore
Page 35: Sparkcamp stratasingapore
Page 36: Sparkcamp stratasingapore
Page 37: Sparkcamp stratasingapore
Page 39: Sparkcamp stratasingapore
Page 40: Sparkcamp stratasingapore
Page 41: Sparkcamp stratasingapore
Page 42: Sparkcamp stratasingapore
Page 43: Sparkcamp stratasingapore

df.rdd.partitions.size = 4

(TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int)

m

Page 44: Sparkcamp stratasingapore

(TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int)

m

255 MB file

64 MB 64 MB 64 MB 64 MB

Page 45: Sparkcamp stratasingapore

m

m

3,600,000

Page 46: Sparkcamp stratasingapore

3,600,000

Page 47: Sparkcamp stratasingapore
Page 48: Sparkcamp stratasingapore
Page 49: Sparkcamp stratasingapore
Page 51: Sparkcamp stratasingapore
Page 52: Sparkcamp stratasingapore
Page 53: Sparkcamp stratasingapore

(referer, resource)

Page 54: Sparkcamp stratasingapore

(referer, resource)

Page 55: Sparkcamp stratasingapore

article title

other-wikipedia

other-empty

other-internal

other-google

other-yahoo

other-bing

other-facebook

other-twitter

other-other

(referer)

Page 56: Sparkcamp stratasingapore

(referer, resource)

Page 57: Sparkcamp stratasingapore
Page 58: Sparkcamp stratasingapore
Page 59: Sparkcamp stratasingapore
Page 60: Sparkcamp stratasingapore
Page 61: Sparkcamp stratasingapore
Page 62: Sparkcamp stratasingapore
Page 63: Sparkcamp stratasingapore
Page 64: Sparkcamp stratasingapore
Page 65: Sparkcamp stratasingapore
Page 66: Sparkcamp stratasingapore

**

**

**

**

Page 67: Sparkcamp stratasingapore

**

**

**

**

Page 68: Sparkcamp stratasingapore

**

**

**

**

**

Page 69: Sparkcamp stratasingapore

**

**

**

**

* *

Page 70: Sparkcamp stratasingapore

**

**

**

**

* * * *

* * * *

Page 71: Sparkcamp stratasingapore

**

**

**

**

* * * *

* * * *

Page 72: Sparkcamp stratasingapore

**

**

**

**

Page 73: Sparkcamp stratasingapore

**

**

**

**

Page 74: Sparkcamp stratasingapore

**

**

**

**

Page 75: Sparkcamp stratasingapore

**

**

**

**

**

**

**

**

Page 76: Sparkcamp stratasingapore

* * * *

* * * *

**

**

**

**

Page 77: Sparkcamp stratasingapore

* * * *

* * * *

**

**

**

**

Page 78: Sparkcamp stratasingapore

* * * *

* * * *

**

**

**

**

Page 79: Sparkcamp stratasingapore

**

**

**

**

Spark >= 1.6

Page 80: Sparkcamp stratasingapore
Page 81: Sparkcamp stratasingapore
Page 82: Sparkcamp stratasingapore

... ...

Page 83: Sparkcamp stratasingapore
Page 84: Sparkcamp stratasingapore

...

Page 85: Sparkcamp stratasingapore
Page 86: Sparkcamp stratasingapore

......

Page 87: Sparkcamp stratasingapore

......

Page 88: Sparkcamp stratasingapore

......

Page 89: Sparkcamp stratasingapore
Page 90: Sparkcamp stratasingapore

Page 91: Sparkcamp stratasingapore

+

Page 92: Sparkcamp stratasingapore

589 MB file64 MB each

Page 93: Sparkcamp stratasingapore
Page 94: Sparkcamp stratasingapore

7,786,761

Page 95: Sparkcamp stratasingapore

2,418,984

Page 96: Sparkcamp stratasingapore

7,786,761

Page 97: Sparkcamp stratasingapore
Page 98: Sparkcamp stratasingapore
Page 99: Sparkcamp stratasingapore
Page 100: Sparkcamp stratasingapore

• map• filter• flatMap• mapPartitions• mapPartitionsWithIndex• groupBy• sortBy

= medium

Essential Core & Intermediate Spark OperationsTRANSFORMATIONS

ACTIONS

General

• sample• randomSplit

Math / Statistical

= easy

Set Theory / Relational

• union• intersection• subtract• distinct• cartesian• zip

• takeOrdered

Data Structure / I/O

• saveAsTextFile• saveAsSequenceFile• saveAsObjectFile• saveAsHadoopDataset• saveAsHadoopFile• saveAsNewAPIHadoopDataset• saveAsNewAPIHadoopFile

• keyBy• zipWithIndex• zipWithUniqueID• zipPartitions• coalesce• repartition• repartitionAndSortWithinPartitions• pipe

• count• takeSample• max• min• sum• histogram• mean• variance• stdev• sampleVariance• countApprox• countApproxDistinct

• reduce• collect• aggregate• fold• first• take• forEach• top• treeAggregate• treeReduce• forEachPartition• collectAsMap

Page 101: Sparkcamp stratasingapore

• sampleByKey

• keys• values

• partitionBy

• countByKey• countByValue• countByValueApprox• countApproxDistinctByKey• countApproxDistinctByKey• countByKeyApprox• sampleByKeyExact

• cogroup (=groupWith)• join• subtractByKey• fullOuterJoin• leftOuterJoin• rightOuterJoin

• flatMapValues• groupByKey• reduceByKey• reduceByKeyLocally• foldByKey• aggregateByKey• sortByKey• combineByKey

= medium

Essential Core & Intermediate Spark OperationsTRANSFORMATIONS

ACTIONS

General Math / Statistical

= easy

Set Theory / Relational Data Structure / I/O

Page 102: Sparkcamp stratasingapore

vs

each partition of the parent RDD is used by at most one partition of the child RDD

multiple child RDD partitions may depend on a single parent RDD partition

Page 103: Sparkcamp stratasingapore
Page 104: Sparkcamp stratasingapore

Page 105: Sparkcamp stratasingapore

map flatMap filter mapPartitions

groupByKey join

repartition

Page 106: Sparkcamp stratasingapore

en Main_Page 245839 4737756101en Apache_Spark 1370 29484844en.mw Apache_Sparken.d discombobulate 200 284834fr.b Special:Recherche/Acteurs_et_actrices_N 1 739

Project namePage title # of requests

Total size of content returned

Page 107: Sparkcamp stratasingapore
Page 108: Sparkcamp stratasingapore

• map

• filter

• groupBy

• sort

• union

• join

• leftOuterJoin

• rightOuterJoin

• reduce

• count

• fold

• reduceByKey

• cogroup

• cross

• zip

• sample

• take

• first

• partitionBy

• mapWith

• pipe

• save

• ...

• groupByKey

Page 109: Sparkcamp stratasingapore
Page 110: Sparkcamp stratasingapore
Page 111: Sparkcamp stratasingapore

https://www.youtube.com/watch?v=Jw9yNTJI8iM

“We start by loading some fields from a tab-separated Wikipedia file into a distributed collection of Article objects, then we perform queries on it. Queries on the on-disk data take 20 seconds, but asking Spark to cache the Article objects in memory reduces the query latency to 1-2 seconds.”

- Matei

Page 112: Sparkcamp stratasingapore
Page 113: Sparkcamp stratasingapore

Page 114: Sparkcamp stratasingapore
Page 115: Sparkcamp stratasingapore
Page 116: Sparkcamp stratasingapore
Page 117: Sparkcamp stratasingapore
Page 118: Sparkcamp stratasingapore

• •

Page 119: Sparkcamp stratasingapore

Map() Map() Map() Map()

Page 120: Sparkcamp stratasingapore

Map() Map() Map() Map()

Reduce() Reduce() Reduce()

Page 121: Sparkcamp stratasingapore
Page 122: Sparkcamp stratasingapore

groupByKey reduceByKey combineByKey

Page 123: Sparkcamp stratasingapore

(a, 1)

(b, 1)

(a, 1)

(a, 1)

(b, 1)

(b, 1)

(a, 1)

(a, 1)

(a, 1)

(b, 1)

(b, 1)

(a, 1)(a, 1)

(a, 6)(a, 1)

(b, 5)

a b

(a, 1)(a, 1)(a, 1)

(b, 1)(b, 1)

(b, 1)(b, 1)(b, 1)

Page 124: Sparkcamp stratasingapore

(a, 1)

(b, 1)

(a, 1)

(b, 1)

(a, 1)

(a, 1) (a, 2)

(b, 2)(b, 1)

(b, 1)

(a, 1)

(a, 1)(a, 3)

(b, 2)(a, 1)

(b, 1)

(b, 1)

(a, 1)

(a, 2) (a, 6)

(a, 3)

(b, 1)

(b, 2) (b, 5)

(b, 2)

a b

Page 125: Sparkcamp stratasingapore

sqlContex.setConf(key, value)

spark.sql.shuffle.partititions

(

Shuffle

Page 126: Sparkcamp stratasingapore
Page 127: Sparkcamp stratasingapore
Page 128: Sparkcamp stratasingapore
Page 129: Sparkcamp stratasingapore

Page 130: Sparkcamp stratasingapore
Page 131: Sparkcamp stratasingapore
Page 132: Sparkcamp stratasingapore
Page 133: Sparkcamp stratasingapore

Page 134: Sparkcamp stratasingapore

Source: parquet.apache.org

Page 135: Sparkcamp stratasingapore
Page 136: Sparkcamp stratasingapore
Page 137: Sparkcamp stratasingapore
Page 138: Sparkcamp stratasingapore
Page 139: Sparkcamp stratasingapore
Page 140: Sparkcamp stratasingapore
Page 141: Sparkcamp stratasingapore
Page 142: Sparkcamp stratasingapore
Page 143: Sparkcamp stratasingapore

vertex vertex

edge

Page 144: Sparkcamp stratasingapore

unidirectionaledge

bidirectionaledge

2 uni edges

Page 145: Sparkcamp stratasingapore
Page 146: Sparkcamp stratasingapore

vertexID: 469

Edge label: 22

vertexID: 3728

Page 147: Sparkcamp stratasingapore

22

8

93

11

7

46

Page 148: Sparkcamp stratasingapore
Page 149: Sparkcamp stratasingapore

inDegree = 5

1

1

1

1

1

Page 150: Sparkcamp stratasingapore
Page 151: Sparkcamp stratasingapore

0.5

1

1

1

1

13.65

1

0.5

1

1

Page 152: Sparkcamp stratasingapore
Page 153: Sparkcamp stratasingapore

Out-degree = 2

Page 154: Sparkcamp stratasingapore

5 hops

Page 155: Sparkcamp stratasingapore

4 hops

Page 156: Sparkcamp stratasingapore

1

1

0

2

4

2

3

3

Page 157: Sparkcamp stratasingapore

?

Page 158: Sparkcamp stratasingapore
Page 159: Sparkcamp stratasingapore
Page 160: Sparkcamp stratasingapore
Page 161: Sparkcamp stratasingapore
Page 162: Sparkcamp stratasingapore
Page 163: Sparkcamp stratasingapore
Page 164: Sparkcamp stratasingapore
Page 165: Sparkcamp stratasingapore
Page 166: Sparkcamp stratasingapore
Page 167: Sparkcamp stratasingapore
Page 168: Sparkcamp stratasingapore

Liberal Conservative

Post

Post

Post

Post

Post

Post

Post

Post

168

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

??

?

?

??

?

? ??

?

?

??

??

?

?

?

?

?

?

?

?

?

?

?

? ?

?

Page 169: Sparkcamp stratasingapore

B C

A D

F E

A DD

B C

D

E

AA

F

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

B

C

D

E

A

F

1

2

1 2

1 2

1

2

Page 170: Sparkcamp stratasingapore
Page 171: Sparkcamp stratasingapore
Page 172: Sparkcamp stratasingapore
Page 173: Sparkcamp stratasingapore
Page 174: Sparkcamp stratasingapore
Page 175: Sparkcamp stratasingapore
Page 176: Sparkcamp stratasingapore
Page 177: Sparkcamp stratasingapore
Page 178: Sparkcamp stratasingapore
Page 179: Sparkcamp stratasingapore
Page 180: Sparkcamp stratasingapore
Page 181: Sparkcamp stratasingapore
Page 182: Sparkcamp stratasingapore

filter()

editsDStream

filteredDStream

. . .

. . .

. . .

Page 183: Sparkcamp stratasingapore
Page 184: Sparkcamp stratasingapore

https://issues.apache.org/jira/browse/SPARK-8360

Page 185: Sparkcamp stratasingapore
Page 186: Sparkcamp stratasingapore
Page 187: Sparkcamp stratasingapore
Page 188: Sparkcamp stratasingapore
Page 189: Sparkcamp stratasingapore
Page 190: Sparkcamp stratasingapore
Page 191: Sparkcamp stratasingapore

Page 192: Sparkcamp stratasingapore
Page 193: Sparkcamp stratasingapore

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

Running test:

pyspark/broadcast.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:30 ERROR Aliens

attacked the

Page 194: Sparkcamp stratasingapore

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Page 195: Sparkcamp stratasingapore

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

Running

43

test

67

Spark

110

aliens

0...

mllib.linalg.Vector

Page 196: Sparkcamp stratasingapore

0.0 0.1 -0.1 50 ...= -1.1

LinearRegression

0.0 6.7+ -11.0 0.0 ...+ +

Running

43

test

67

Spark

110

aliens

0 ...

Page 197: Sparkcamp stratasingapore

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive,

including Datanucleus

jars on classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

LabeledPoint(features: Vector, label: Double)

Page 198: Sparkcamp stratasingapore

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

Running test: pyspark/conf.py

Spark assembly has been built

with Hive, including

Datanucleus jars on classpath

14/12/15 18:36:12 WARN Utils:

Your hostname,

...

RDD[LabeledPoint]

Page 199: Sparkcamp stratasingapore

LinearRegressionModel.predict(features: Vector): Double

RDD[LabeledPoint]

LinearRegression.train(data: RDD[LabeledPoint]): LinearRegressionModel

Page 200: Sparkcamp stratasingapore

LinearRegressionModel

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname,

Running test:

pyspark/conf.py

Spark assembly has been

built with Hive, including

Datanucleus jars on

classpath

14/12/15 18:36:12 WARN

Utils: Your hostname, …

LinearRegression

Dataframe

Estimator

Transformer

Page 201: Sparkcamp stratasingapore

Model(Transformer)

RDD[(Double, Double)] RegressionMetrics

Double

Estimator

Transformer

Evaluator

Page 202: Sparkcamp stratasingapore
Page 203: Sparkcamp stratasingapore
Page 204: Sparkcamp stratasingapore

http://www.meetup.com/Spark-Singapore/events/219039180/

Page 206: Sparkcamp stratasingapore