Upload
cloudera-japan
View
4.410
Download
0
Embed Size (px)
Citation preview
Cloudera, Inc. All rights reserved.
Cloudera World Tokyo 2015
SparkCloudera [email protected]
Cloudera, Inc. All rights reserved.
Spark
Cloudera, Inc. All rights reserved.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. - http://spark.apache.org
SparkMapReduce100(10
Cloudera, Inc. All rights reserved.
100
100x
Cloudera, Inc. All rights reserved.
Spark
HDFS, HBase, Kudu,
YARN
Spark Hadoop MapReduce Search Others Impala
Cloudera, Inc. All rights reserved.MapReduce
Cloudera, Inc. All rights reserved.
MapReduce:
: Mapper
: Mapper
:
MapReduceMap Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
Cloudera, Inc. All rights reserved.
- MapReduce
13
72.165.33.132 - - [04/Nov/28.114.157.122 - - [04/No\52.93.117.198 - - [04/Nov/
168.90.228.205 - - [04/Nov/28.42.27.49 - - [04/Nov/201192.120.64.138 - - [04/Nov/
156.189.222.57 - - [04/Nov/2164.219.215.208 - - [04/Nov/84.42.208.90 - - [04/Nov/20
164.39.210.117 - - [04/Nov/196.144.35.85 - - [04/Nov/280.78.35.71 - - [04/Nov/201
Application
Cloudera, Inc. All rights reserved.
MapReduce - Map Map
72.165.33.132 - - [04/Nov/28.114.157.122 - - [04/No\52.93.117.198 - - [04/Nov/
168.90.228.205 - - [04/Nov/28.42.27.49 - - [04/Nov/201192.120.64.138 - - [04/Nov/
156.189.222.57 - - [04/Nov/2164.219.215.208 - - [04/Nov/84.42.208.90 - - [04/Nov/20
164.39.210.117 - - [04/Nov/196.144.35.85 - - [04/Nov/280.78.35.71 - - [04/Nov/201
Application
Task
Task
Task
Task
14
Cloudera, Inc. All rights reserved.
MapReduce - Reduce Reduce
15
72.165.33.132, 172.165.33.132, 172.165.33.132, 172.165.33.145, 1
168.90.228.205,1168.90.228.205,1192.120.64.138,1
156.189.222.57,1156.189.222.57,1164.219.215.208,1
164.39.210.117,1164.39.210.117,1164.39.210.118.1
Task
Cloudera, Inc. All rights reserved.
Hadoop MapReduceSpark
Cloudera, Inc. All rights reserved.
Speed
Cloudera, Inc. All rights reserved.
MapReduce
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
Cloudera, Inc. All rights reserved.
Map
ReduceMap
Map Reduce
Map
Map
ReduceMap
Map Reduce
Map
Reduce
Map
Map
Cloudera, Inc. All rights reserved.
Map Reduce Map Reduce
Map Map ReduceXMapX
Cloudera, Inc. All rights reserved.
Map Reduce
Map Reduce
Cloudera, Inc. All rights reserved.
: 18 3
64-128GB RAM
16 cores
50 GB per second
Cloudera, Inc. All rights reserved.
(DAG)
join
filter
groupBy
B: B:
C: D: E:
F:
map
A:
map
take
= cached partition= RDD
Cloudera, Inc. All rights reserved.
::
110sec80sec
Cloudera, Inc. All rights reserved.
:2
+110sec
+1sec110sec
80sec
Cloudera, Inc. All rights reserved.
()
0500
1000150020002500300035004000
1 5 10 20 30Run
ning
Tim
e(s)
# of Iterations
MapReduce
Spark
110 s/
=80s1s
Cloudera, Inc. All rights reserved.
Cloudera, Inc. All rights reserved.
API
ScalaJavaPython
Pythonlines = sc.textFile(...) lines.filter(lambda s: ERROR in s).count()
Scalaval lines = sc.textFile(...) lines.filter(s => s.contains(ERROR)).count()
JavaJavaRDD lines = sc.textFile(...); lines.filter(new Function() { Boolean call(String s) { return s.contains(error); } }).count();
Cloudera, Inc. All rights reserved.
percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....
scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at :21
scala> words.count...res0: Long = 235886
scala>
Cloudera, Inc. All rights reserved.
Word Count public class WordCount { public static void main(String[] args) throws Exception { Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } } public class WordMapper extends Mapper { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) context.write(new Text(word), new IntWritable(1)); } } } } public class SumReducer extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } }
sc.textFile(file) \ .flatMap(lambda s: s.split()) \ .map(lambda w: (w,1)) \ .reduceByKey(lambda v1,v2: v1+v2) .saveAsTextFile(output) MapReduce
2-5x
Cloudera, Inc. All rights reserved.
RDD
Resilient Distributed Datasets (RDD)
Cloudera, Inc. All rights reserved.
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
I've never seen a purple cow. I never hope to see one; But I can tell you, anyhow, I'd rather see than be one.
File: purplecow.txt
RDD: mydata
RDD: mydata_uc
RDD: mydata_filt
> mydata = sc.textFile("purplecow.txt")
> mydata_uc = mydata.map(lambda line: line.upper())
> mydata_filt = \ mydata_uc.filter(lambda line: \ line.startswith('I'))
> mydata_filt.count() 3
Cloudera, Inc. All rights reserved.
Hue Notebook
Cloudera, Inc. All rights reserved.
Jupyter/IPython Notebook
Cloudera, Inc. All rights reserved.
SparkSQLMLlibSparkR
Cloudera, Inc. All rights reserved.
SparkSQL
Spark/JavaSparkSQL
SparkSpark
SQLJavaScala
SQL (. )
Cloudera, Inc. All rights reserved.
MLlibSpark(ML)
MLlibSpark
Cloudera, Inc. All rights reserved.
Streaming
Cloudera, Inc. All rights reserved.
Hadoop MapReduce
Cloudera, Inc. All rights reserved.
Spark Streaming SparkAPI
datadatadatadatadatadatadatadata Live Datat=0 t=1 t=2 t=3
DStreamdata
data
data
data
RDD @ t=1data
data
data
data
RDD @ t=2data
data
data
data
RDD @ t=3
Cloudera, Inc. All rights reserved.
Spark Streaming
: 5
MLlib
Cloudera, Inc. All rights reserved.
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t
batch @ t+2tweets DStream
hashTags DStream
Cloudera, Inc. All rights reserved.SparkICU 54 2015 Cloudera, Inc. All rights reserved.
Cloudera, Inc. All rights reserved.
http://blog.cloudera.com/blog/2015/07/designing-fraud-detection-architecture-that-works-like-your-brain-does/
Cloudera, Inc. All rights reserved.
SparkHadoop
Cloudera, Inc. All rights reserved.
http://itpro.nikkeibp.co.jp/atcl/column/14/072800028/073000001/
MapReduce- Doug Cutting
Cloudera, Inc. All rights reserved.http://www.cloudera.co.jp/blog/one-platform-initiative.html
Hadoop
No.
- Mike Olson
Cloudera, Inc. All rights reserved.
MapReduce
Hive, Pig Sqoop distcp
Cloudera, Inc. All rights reserved.
SparkMapReduce
Stage 1
Crunch on SparkSearch on Spark
Stage 2
Hive on Spark (beta)Spark on HBase (beta)
Stage 3
Pig on Spark (alpha)Sqoop on Spark
ClouderaSpark
Cloudera, Inc. All rights reserved.
MapReduceSpark
Cloudera, Inc. All rights reserved.
Hadoop
Spark
Impala
Solr
MapReduceIO
:
Cloudera, Inc. All rights reserved.
Spark Hadoop
Spark Streaming MLlib SparkSQL GraphX
Data-frames SparkR
HDFS, HBase
YARN
Spark Impala MR Others Search
Cloudera, Inc. All rights reserved.
ClouderaSpark
2013 2014 2015 2016
Spark
CDH4.4Spark
Spark on YARN
Spark
Spark
ClouderaSpark
Cloudera, Inc. All rights reserved.
ClouderaCore Spark Spark Streaming
ETL 20
Jaccard
ERP
(OCR)
(LDA)
1010
Cloudera, Inc. All rights reserved.
Cloudera, Inc. All rights reserved.
Apache Spark
Speed Easy Streaming
Cloudera, Inc. All rights reserved.
Cloudera, Inc. All rights reserved.
Cloudera, Inc. All rights reserved.
Cloudera
Apache Spark Spark & Hadoop I(New)
http://cloudera.co.jp/university
Cloudera, Inc. All rights reserved.
SparkSparkSparkOReilly Advanced Analytics with Spark (written by Clouderans)Apache HadoopCloudera Developer Blog
Cloudera Quick Start VM Spark http://codezine.jp/article/corner/583
Cloudera, Inc. All rights reserved.
Cloudera Live
cloudera.com/live
CDH
Cloudera, Inc. All rights reserved.
Cloudera
We are Hiring!