Upload
cheng-feng
View
154
Download
0
Embed Size (px)
Citation preview
•
•
•
•
•
•
•
•
•
•
•
•
12 stacks
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
(255 MB)
(1.2 GB)
(60 GB)
(1.2 GB)
(~x per sec)
30%: Slides
70%: Live coding/demos, labs
df.rdd.partitions.size = 4
(TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int)
m
(TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int) (TS) (Str) (Int)
m
255 MB file
64 MB 64 MB 64 MB 64 MB
m
m
3,600,000
3,600,000
(referer, resource)
(referer, resource)
article title
other-wikipedia
other-empty
other-internal
other-google
other-yahoo
other-bing
other-facebook
other-twitter
other-other
(referer)
(referer, resource)
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
* *
**
**
**
**
* * * *
* * * *
**
**
**
**
* * * *
* * * *
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
* * * *
* * * *
**
**
**
**
* * * *
* * * *
**
**
**
**
* * * *
* * * *
**
**
**
**
**
**
**
**
Spark >= 1.6
... ...
...
......
......
......
•
•
•
•
•
•
•
•
+
589 MB file64 MB each
7,786,761
2,418,984
7,786,761
• map• filter• flatMap• mapPartitions• mapPartitionsWithIndex• groupBy• sortBy
= medium
Essential Core & Intermediate Spark OperationsTRANSFORMATIONS
ACTIONS
General
• sample• randomSplit
Math / Statistical
= easy
Set Theory / Relational
• union• intersection• subtract• distinct• cartesian• zip
• takeOrdered
Data Structure / I/O
• saveAsTextFile• saveAsSequenceFile• saveAsObjectFile• saveAsHadoopDataset• saveAsHadoopFile• saveAsNewAPIHadoopDataset• saveAsNewAPIHadoopFile
• keyBy• zipWithIndex• zipWithUniqueID• zipPartitions• coalesce• repartition• repartitionAndSortWithinPartitions• pipe
• count• takeSample• max• min• sum• histogram• mean• variance• stdev• sampleVariance• countApprox• countApproxDistinct
• reduce• collect• aggregate• fold• first• take• forEach• top• treeAggregate• treeReduce• forEachPartition• collectAsMap
• sampleByKey
• keys• values
• partitionBy
• countByKey• countByValue• countByValueApprox• countApproxDistinctByKey• countApproxDistinctByKey• countByKeyApprox• sampleByKeyExact
• cogroup (=groupWith)• join• subtractByKey• fullOuterJoin• leftOuterJoin• rightOuterJoin
• flatMapValues• groupByKey• reduceByKey• reduceByKeyLocally• foldByKey• aggregateByKey• sortByKey• combineByKey
= medium
Essential Core & Intermediate Spark OperationsTRANSFORMATIONS
ACTIONS
General Math / Statistical
= easy
Set Theory / Relational Data Structure / I/O
vs
each partition of the parent RDD is used by at most one partition of the child RDD
multiple child RDD partitions may depend on a single parent RDD partition
•
•
map flatMap filter mapPartitions
groupByKey join
repartition
•
•
•
en Main_Page 245839 4737756101en Apache_Spark 1370 29484844en.mw Apache_Sparken.d discombobulate 200 284834fr.b Special:Recherche/Acteurs_et_actrices_N 1 739
Project namePage title # of requests
Total size of content returned
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• cogroup
• cross
• zip
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save
• ...
• groupByKey
https://www.youtube.com/watch?v=Jw9yNTJI8iM
“We start by loading some fields from a tab-separated Wikipedia file into a distributed collection of Article objects, then we perform queries on it. Queries on the on-disk data take 20 seconds, but asking Spark to cache the Article objects in memory reduces the query latency to 1-2 seconds.”
- Matei
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
•
Map() Map() Map() Map()
Map() Map() Map() Map()
•
•
Reduce() Reduce() Reduce()
groupByKey reduceByKey combineByKey
(a, 1)
(b, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 1)
(a, 1)
(a, 1)
(b, 1)
(b, 1)
(a, 1)(a, 1)
(a, 6)(a, 1)
(b, 5)
a b
(a, 1)(a, 1)(a, 1)
(b, 1)(b, 1)
(b, 1)(b, 1)(b, 1)
(a, 1)
(b, 1)
(a, 1)
(b, 1)
(a, 1)
(a, 1) (a, 2)
(b, 2)(b, 1)
(b, 1)
(a, 1)
(a, 1)(a, 3)
(b, 2)(a, 1)
(b, 1)
(b, 1)
(a, 1)
(a, 2) (a, 6)
(a, 3)
(b, 1)
(b, 2) (b, 5)
(b, 2)
a b
sqlContex.setConf(key, value)
spark.sql.shuffle.partititions
(
Shuffle
•
•
•
•
•
•
•
•
Source: parquet.apache.org
•
•
•
vertex vertex
edge
unidirectionaledge
bidirectionaledge
2 uni edges
vertexID: 469
Edge label: 22
vertexID: 3728
22
8
93
11
7
46
inDegree = 5
1
1
1
1
1
0.5
1
1
1
1
13.65
1
0.5
1
1
Out-degree = 2
5 hops
4 hops
1
1
0
2
4
2
3
3
?
Liberal Conservative
Post
Post
Post
Post
Post
Post
Post
Post
168
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
??
?
?
??
?
? ??
?
?
??
??
?
?
?
?
?
?
?
?
?
?
?
? ?
?
B C
A D
F E
A DD
B C
D
E
AA
F
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
B
C
D
E
A
F
1
2
1 2
1 2
1
2
filter()
editsDStream
filteredDStream
. . .
. . .
. . .
https://issues.apache.org/jira/browse/SPARK-8360
•
•
•
•
•
•
•
•
•
•
•
•
Running test: pyspark/conf.py
Spark assembly has been built
with Hive, including
Datanucleus jars on classpath
14/12/15 18:36:12 WARN Utils:
Your hostname,
Running test:
pyspark/broadcast.py
Spark assembly has been built
with Hive, including
Datanucleus jars on classpath
14/12/15 18:36:30 ERROR Aliens
attacked the
Running test:
pyspark/conf.py
Spark assembly has been
built with Hive, including
Datanucleus jars on
classpath
14/12/15 18:36:12 WARN
Utils: Your hostname,
Running test: pyspark/conf.py
Spark assembly has been built
with Hive, including
Datanucleus jars on classpath
14/12/15 18:36:12 WARN Utils:
Your hostname,
Running
43
test
67
Spark
110
aliens
0...
mllib.linalg.Vector
0.0 0.1 -0.1 50 ...= -1.1
LinearRegression
0.0 6.7+ -11.0 0.0 ...+ +
Running
43
test
67
Spark
110
aliens
0 ...
Running test:
pyspark/conf.py
Spark assembly has been
built with Hive,
including Datanucleus
jars on classpath
14/12/15 18:36:12 WARN
Utils: Your hostname,
LabeledPoint(features: Vector, label: Double)
Running test: pyspark/conf.py
Spark assembly has been built
with Hive, including
Datanucleus jars on classpath
14/12/15 18:36:12 WARN Utils:
Your hostname,
Running test: pyspark/conf.py
Spark assembly has been built
with Hive, including
Datanucleus jars on classpath
14/12/15 18:36:12 WARN Utils:
Your hostname,
Running test: pyspark/conf.py
Spark assembly has been built
with Hive, including
Datanucleus jars on classpath
14/12/15 18:36:12 WARN Utils:
Your hostname,
...
RDD[LabeledPoint]
LinearRegressionModel.predict(features: Vector): Double
RDD[LabeledPoint]
LinearRegression.train(data: RDD[LabeledPoint]): LinearRegressionModel
LinearRegressionModel
Running test:
pyspark/conf.py
Spark assembly has been
built with Hive, including
Datanucleus jars on
classpath
14/12/15 18:36:12 WARN
Utils: Your hostname,
Running test:
pyspark/conf.py
Spark assembly has been
built with Hive, including
Datanucleus jars on
classpath
14/12/15 18:36:12 WARN
Utils: Your hostname,
Running test:
pyspark/conf.py
Spark assembly has been
built with Hive, including
Datanucleus jars on
classpath
14/12/15 18:36:12 WARN
Utils: Your hostname,
…
Running test:
pyspark/conf.py
Spark assembly has been
built with Hive, including
Datanucleus jars on
classpath
14/12/15 18:36:12 WARN
Utils: Your hostname,
Running test:
pyspark/conf.py
Spark assembly has been
built with Hive, including
Datanucleus jars on
classpath
14/12/15 18:36:12 WARN
Utils: Your hostname,
Running test:
pyspark/conf.py
Spark assembly has been
built with Hive, including
Datanucleus jars on
classpath
14/12/15 18:36:12 WARN
Utils: Your hostname,
Running test:
pyspark/conf.py
Spark assembly has been
built with Hive, including
Datanucleus jars on
classpath
14/12/15 18:36:12 WARN
Utils: Your hostname, …
LinearRegression
Dataframe
Estimator
Transformer
Model(Transformer)
RDD[(Double, Double)] RegressionMetrics
Double
Estimator
Transformer
Evaluator
http://www.meetup.com/Spark-Singapore/events/219039180/