22
1 © Cloudera, Inc. All rights reserved. MapReduce を置き換えるSpark Hadoop Spark の統合 ~ The One Platform Initiative Doug Cutting | チーフアーキテクト | Cloudera @cutting

MapReduceを置き換えるSpark 〜HadoopとSparkの統合〜 #cwt2015

Embed Size (px)

Citation preview

  • 1 Cloudera, Inc. All rights reserved.

    MapReduce Spark Hadoop Spark The One Platform Initiative

    Doug Cutting | | Cloudera@cutting

  • 2 Cloudera, Inc. All rights reserved.

    Apache Spark

    Spark

    MapReduce Spark

    One Platform Initiative

    Hadoop

  • 3 Cloudera, Inc. All rights reserved.

    MapReduce ...

    /

    MapReduce

    Hive Pig Mahout SolrCrunch

  • 4 Cloudera, Inc. All rights reserved.

    ...

    : Giraph/Graphlab () Impala ( SQL)

    MapReduce

    :Hama () Dryad (Arbitrary DAG)

  • 5 Cloudera, Inc. All rights reserved.

    Apache Spark

    MapReduce

    (Full Directed Graph expressions)

    :

  • 6 Cloudera, Inc. All rights reserved.

    Apache SparkHadoop

    API

    Scala,Java,Python API

    API

  • 7 Cloudera, Inc. All rights reserved.

    API Scala, Java, Python

    2~5

    Python lines = sc.textFile(...) lines.filter(lambda s: ERROR in s).count()

    Scala val lines = sc.textFile(...) lines.filter(s => s.contains(ERROR)).count()

    Java JavaRDD lines = sc.textFile(...); lines.filter(new Function() { Boolean call(String s) { return s.contains(error); } }).count();

  • 8 Cloudera, Inc. All rights reserved.

    percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/

    Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....

    scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at :21

    scala> words.count...res0: Long = 235886

    scala>

  • 9 Cloudera, Inc. All rights reserved.

    Spark

    RDDResilient Distributed Dataset)

  • 10 Cloudera, Inc. All rights reserved.

    Spark Hadoop

    Spark Streaming MLlib SparkSQL GraphX Data-frames SparkR

    HDFS, HBase

    YARN

    Spark Impala MR OthersSearch

  • 11 Cloudera, Inc. All rights reserved.

    Cloudera Spark

    2013 2014 2015 2016

    Spark

    CDH 4.4 Spark

    YARN Spark

    Spark

    Spark

    Cloudera OReilly Spark

  • 12 Cloudera, Inc. All rights reserved.

    Cloudera Spark ClouderaSpark Hadoop SparkCloudera

    Cloudera Spark Hadoop

    Cloudera 25

    Spark

  • 13 Cloudera, Inc. All rights reserved.

    Cloudera Spark

    Cloudera67%

    Intel17%

    Hortonworks17%

    Hadoop Spark *

    IBM MapR

    Hadoop

    Cloudera, 370 Hortonworks, 4 IBM, 12 MapR, 1 Intel, 400

  • 14 Cloudera, Inc. All rights reserved.

    Cloudera

    Spark 150 800 Spark

  • 15 Cloudera, Inc. All rights reserved.

    Cloudera Core Spark Spark Streaming

    ETL 20

    Jaccard

    ERP

    (LDA)

    1010

  • 16 Cloudera, Inc. All rights reserved.

    Spark MapReduce Hadoop

  • 17 Cloudera, Inc. All rights reserved.

    Spark MapReduce

    1

    Crunch on SparkSearch on Spark

    2

    Hive on Spark (beta)Spark on HBase (beta)

    3

    Pig on Spark (alpha)Sqoop on Spark

    Cloudera Spark

  • 18 Cloudera, Inc. All rights reserved.

    Spark Hadoop One Platform Initiative

    Hadoop

    Hadoop

    1

    80%

  • 19 Cloudera, Inc. All rights reserved.

    Hadoop Spark

    Spark

    Impala

    Low-Latency

    Solr

    MapReduce I/O

    :

  • 20 Cloudera, Inc. All rights reserved.

    Cloudera

    Hadoop 1

    Cloudera

  • 21 Cloudera, Inc. All rights reserved.

    Spark Spark OReilly Advanced Analytics with Spark eBook (Cloudera) Cloudera Developer Blog cloudera.com/spark

    Cloudera Spark Training

    Cloudera Live Spark Tutorial

  • 22 Cloudera, Inc. All rights reserved.

    @cuMng