57
Cloudera World Tokyo 2015 Spark徹底 Cloudera株式会社 川崎 達夫 [email protected]

Spark徹底入門 #cwt2015

Embed Size (px)

Citation preview

  • Cloudera, Inc. All rights reserved.

    Cloudera World Tokyo 2015

    SparkCloudera [email protected]

  • Cloudera, Inc. All rights reserved.

    Spark

  • Cloudera, Inc. All rights reserved.

    Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. - http://spark.apache.org

    SparkMapReduce100(10

  • Cloudera, Inc. All rights reserved.

    100

    100x

  • Cloudera, Inc. All rights reserved.

    Spark

    HDFS, HBase, Kudu,

    YARN

    Spark Hadoop MapReduce Search Others Impala

  • Cloudera, Inc. All rights reserved.MapReduce

  • Cloudera, Inc. All rights reserved.

    MapReduce:

    : Mapper

    : Mapper

    :

    MapReduceMap Map Map Map Map Map Map Map Map Map Map Map

    Reduce Reduce Reduce Reduce

  • Cloudera, Inc. All rights reserved.

    - MapReduce

    13

    72.165.33.132 - - [04/Nov/28.114.157.122 - - [04/No\52.93.117.198 - - [04/Nov/

    168.90.228.205 - - [04/Nov/28.42.27.49 - - [04/Nov/201192.120.64.138 - - [04/Nov/

    156.189.222.57 - - [04/Nov/2164.219.215.208 - - [04/Nov/84.42.208.90 - - [04/Nov/20

    164.39.210.117 - - [04/Nov/196.144.35.85 - - [04/Nov/280.78.35.71 - - [04/Nov/201

    Application

  • Cloudera, Inc. All rights reserved.

    MapReduce - Map Map

    72.165.33.132 - - [04/Nov/28.114.157.122 - - [04/No\52.93.117.198 - - [04/Nov/

    168.90.228.205 - - [04/Nov/28.42.27.49 - - [04/Nov/201192.120.64.138 - - [04/Nov/

    156.189.222.57 - - [04/Nov/2164.219.215.208 - - [04/Nov/84.42.208.90 - - [04/Nov/20

    164.39.210.117 - - [04/Nov/196.144.35.85 - - [04/Nov/280.78.35.71 - - [04/Nov/201

    Application

    Task

    Task

    Task

    Task

    14

  • Cloudera, Inc. All rights reserved.

    MapReduce - Reduce Reduce

    15

    72.165.33.132, 172.165.33.132, 172.165.33.132, 172.165.33.145, 1

    168.90.228.205,1168.90.228.205,1192.120.64.138,1

    156.189.222.57,1156.189.222.57,1164.219.215.208,1

    164.39.210.117,1164.39.210.117,1164.39.210.118.1

    Task

  • Cloudera, Inc. All rights reserved.

    Hadoop MapReduceSpark

  • Cloudera, Inc. All rights reserved.

    Speed

  • Cloudera, Inc. All rights reserved.

    MapReduce

    Map Map Map Map Map Map Map Map Map Map Map Map

    Reduce Reduce Reduce Reduce

  • Cloudera, Inc. All rights reserved.

    Map

    ReduceMap

    Map Reduce

    Map

    Map

    ReduceMap

    Map Reduce

    Map

    Reduce

    Map

    Map

  • Cloudera, Inc. All rights reserved.

    Map Reduce Map Reduce

    Map Map ReduceXMapX

  • Cloudera, Inc. All rights reserved.

    Map Reduce

    Map Reduce

  • Cloudera, Inc. All rights reserved.

    : 18 3

    64-128GB RAM

    16 cores

    50 GB per second

  • Cloudera, Inc. All rights reserved.

    (DAG)

    join

    filter

    groupBy

    B: B:

    C: D: E:

    F:

    map

    A:

    map

    take

    = cached partition= RDD

  • Cloudera, Inc. All rights reserved.

    ::

    110sec80sec

  • Cloudera, Inc. All rights reserved.

    :2

    +110sec

    +1sec110sec

    80sec

  • Cloudera, Inc. All rights reserved.

    ()

    0500

    1000150020002500300035004000

    1 5 10 20 30Run

    ning

    Tim

    e(s)

    # of Iterations

    MapReduce

    Spark

    110 s/

    =80s1s

  • Cloudera, Inc. All rights reserved.

  • Cloudera, Inc. All rights reserved.

    API

    ScalaJavaPython

    Pythonlines = sc.textFile(...) lines.filter(lambda s: ERROR in s).count()

    Scalaval lines = sc.textFile(...) lines.filter(s => s.contains(ERROR)).count()

    JavaJavaRDD lines = sc.textFile(...); lines.filter(new Function() { Boolean call(String s) { return s.contains(error); } }).count();

  • Cloudera, Inc. All rights reserved.

    percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/

    Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....

    scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at :21

    scala> words.count...res0: Long = 235886

    scala>

  • Cloudera, Inc. All rights reserved.

    Word Count public class WordCount { public static void main(String[] args) throws Exception { Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } } public class WordMapper extends Mapper { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) context.write(new Text(word), new IntWritable(1)); } } } } public class SumReducer extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } }

    sc.textFile(file) \ .flatMap(lambda s: s.split()) \ .map(lambda w: (w,1)) \ .reduceByKey(lambda v1,v2: v1+v2) .saveAsTextFile(output) MapReduce

    2-5x

  • Cloudera, Inc. All rights reserved.

    RDD

    Resilient Distributed Datasets (RDD)

  • Cloudera, Inc. All rights reserved.

    I've never seen a purple cow.

    I never hope to see one;

    But I can tell you, anyhow,

    I'd rather see than be one.

    I'VE NEVER SEEN A PURPLE COW.

    I NEVER HOPE TO SEE ONE;

    BUT I CAN TELL YOU, ANYHOW,

    I'D RATHER SEE THAN BE ONE.

    I'VE NEVER SEEN A PURPLE COW.

    I NEVER HOPE TO SEE ONE;

    I'D RATHER SEE THAN BE ONE.

    I've never seen a purple cow. I never hope to see one; But I can tell you, anyhow, I'd rather see than be one.

    File: purplecow.txt

    RDD: mydata

    RDD: mydata_uc

    RDD: mydata_filt

    > mydata = sc.textFile("purplecow.txt")

    > mydata_uc = mydata.map(lambda line: line.upper())

    > mydata_filt = \ mydata_uc.filter(lambda line: \ line.startswith('I'))

    > mydata_filt.count() 3

  • Cloudera, Inc. All rights reserved.

    Hue Notebook

  • Cloudera, Inc. All rights reserved.

    Jupyter/IPython Notebook

  • Cloudera, Inc. All rights reserved.

    SparkSQLMLlibSparkR

  • Cloudera, Inc. All rights reserved.

    SparkSQL

    Spark/JavaSparkSQL

    SparkSpark

    SQLJavaScala

    SQL (. )

  • Cloudera, Inc. All rights reserved.

    MLlibSpark(ML)

    MLlibSpark

  • Cloudera, Inc. All rights reserved.

    Streaming

  • Cloudera, Inc. All rights reserved.

    Hadoop MapReduce

  • Cloudera, Inc. All rights reserved.

    Spark Streaming SparkAPI

    datadatadatadatadatadatadatadata Live Datat=0 t=1 t=2 t=3

    DStreamdata

    data

    data

    data

    RDD @ t=1data

    data

    data

    data

    RDD @ t=2data

    data

    data

    data

    RDD @ t=3

  • Cloudera, Inc. All rights reserved.

    Spark Streaming

    : 5

    MLlib

  • Cloudera, Inc. All rights reserved.

    val tweets = ssc.twitterStream()

    val hashTags = tweets.flatMap (status => getTags(status))

    hashTags.saveAsHadoopFiles("hdfs://...")

    flatMap flatMap flatMap

    save save save

    batch @ t+1batch @ t

    batch @ t+2tweets DStream

    hashTags DStream

  • Cloudera, Inc. All rights reserved.SparkICU 54 2015 Cloudera, Inc. All rights reserved.

  • Cloudera, Inc. All rights reserved.

    http://blog.cloudera.com/blog/2015/07/designing-fraud-detection-architecture-that-works-like-your-brain-does/

  • Cloudera, Inc. All rights reserved.

    SparkHadoop

  • Cloudera, Inc. All rights reserved.

    http://itpro.nikkeibp.co.jp/atcl/column/14/072800028/073000001/

    MapReduce- Doug Cutting

  • Cloudera, Inc. All rights reserved.http://www.cloudera.co.jp/blog/one-platform-initiative.html

    Hadoop

    No.

    - Mike Olson

  • Cloudera, Inc. All rights reserved.

    MapReduce

    Hive, Pig Sqoop distcp

  • Cloudera, Inc. All rights reserved.

    SparkMapReduce

    Stage 1

    Crunch on SparkSearch on Spark

    Stage 2

    Hive on Spark (beta)Spark on HBase (beta)

    Stage 3

    Pig on Spark (alpha)Sqoop on Spark

    ClouderaSpark

  • Cloudera, Inc. All rights reserved.

    MapReduceSpark

  • Cloudera, Inc. All rights reserved.

    Hadoop

    Spark

    Impala

    Solr

    MapReduceIO

    :

  • Cloudera, Inc. All rights reserved.

    Spark Hadoop

    Spark Streaming MLlib SparkSQL GraphX

    Data-frames SparkR

    HDFS, HBase

    YARN

    Spark Impala MR Others Search

  • Cloudera, Inc. All rights reserved.

    ClouderaSpark

    2013 2014 2015 2016

    Spark

    CDH4.4Spark

    Spark on YARN

    Spark

    Spark

    ClouderaSpark

  • Cloudera, Inc. All rights reserved.

    ClouderaCore Spark Spark Streaming

    ETL 20

    Jaccard

    ERP

    (OCR)

    (LDA)

    1010

  • Cloudera, Inc. All rights reserved.

  • Cloudera, Inc. All rights reserved.

    Apache Spark

    Speed Easy Streaming

  • Cloudera, Inc. All rights reserved.

  • Cloudera, Inc. All rights reserved.

  • Cloudera, Inc. All rights reserved.

    Cloudera

    Apache Spark Spark & Hadoop I(New)

    http://cloudera.co.jp/university

  • Cloudera, Inc. All rights reserved.

    SparkSparkSparkOReilly Advanced Analytics with Spark (written by Clouderans)Apache HadoopCloudera Developer Blog

    Cloudera Quick Start VM Spark http://codezine.jp/article/corner/583

  • Cloudera, Inc. All rights reserved.

    Cloudera Live

    cloudera.com/live

    CDH

  • Cloudera, Inc. All rights reserved.

    Cloudera

    [email protected]

    We are Hiring!