35
ビッグデータ分析基盤Sparkの 最新動向とその活 Spark SUMMIT EAST 2015 – 201531718 [email protected]

150521_ビッグデータ分析基盤Sparkの最新動向とその活用-Spark SUMMIT EAST 2015-

Embed Size (px)

Citation preview

  • Spark- Spark SUMMIT EAST 2015

    201531718 [email protected]

  • Spark Summit EAST 2015 2

    01. Spark

    Spark

    Spark

    Apache Spark

  • Spark Summit EAST 2015 3

    Spark

    Spark UC BerkeleyAMPLab.OSS Databricks

    Ion Stoica

    hadoop

    spark

  • Spark Summit EAST 2015 4

    Spark

    Ver.

    2009 - UC BerkleyAMPLab.

    2010 - OSSApache

    201210 0.6.0 Java API

    20132 0.7.0 Python API

    20139 0.8.0 UIMlib

    20142 0.9.0 Scala2.10GraphX

    20145 1.0.0 Spark SQLMlib

    201411 1.1.0

    201412 1.2.0 Spark StreamingHA

    20153 1.3.0 DataFrames API

    20154 1.3.1

  • Spark Summit EAST 2015 5

    Spark

    Hadoop MapReduceSpark

    Spark

    HDFS

    MapReduce

    Spark SQL MlibHive Sqoop

    YARN Mesos

    SparkHadoop

    YARN Mesos or

    HDFS

    YARN Mesos

  • Spark Summit EAST 2015 6

    Spark

    HadoopMapReduce

    M

    Spark

    Hadoop

    R R R

    HDFS

    S S S

    HDFS

  • Spark Summit EAST 2015 7

    Spark

    Hadoop

    ASF(Apache Software Foundation)PJ HDFS MapReduce SQL

    Scala, Python

  • Spark Summit EAST 2015 8

    2015/03/182015/03/192 3/18Keynote 3 tracks27 sessions - Developers, Applications, Data Science 3/19Workshop

    The Sheraton, New York Spark Summit East Spark Summit 2015 20157 Spark Summit 20132014

  • Spark Summit EAST 2015 9

  • Spark Summit EAST 2015 10

    Silver

    Sponsors

    Platinum

    Gold

    /

  • Spark Summit EAST 2015 11

    2014Spark

    http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015

  • Spark Summit EAST 2015 12

    2014Spark

    Matei

    Contributors per Month to Spark

    http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015

  • Spark Summit EAST 2015 13

    Spark Summit Keynote

  • Spark Summit EAST 2015 14

    2015 1. Data Science

    RDD20153Spark 1.3 Machine Learning Pipelines R interface2015/6Spark 1.4SparkR

    2015

    2. Platform Interfaces Plug in data sources and algorithms Data Souces

    MySQLHiveHbaseSQL

    Goalunified engine across data sources

    New Direction for Spark in 2015Matei, CTO, Databricks

  • Spark Summit EAST 2015 15

    New Direction for Spark in 2015Matei, CTO, Databricks

    Spark

  • Spark Summit EAST 2015 16

    Harnessing the Power of Spark with Databricks Cloud

    Ion Stoica(CEO at databricks) Databricks Cloud

    Databricks Notebook Scala, Python, SQL AWSSpark + Cluster Manager

    Notebook

  • Spark Summit EAST 2015 17

    Harnessing the Power of Spark with Databricks Cloud

    Databricks Cloud

  • Spark Summit EAST 2015 18

    Developers Track

    Developers Track spark

    SQL Hadoop DB

    java PythonR

  • Spark Summit EAST 2015 19

    Developers Track

    Beyond SQL: Spark SQL Abstractions For The Common Spark Job - Michael Armbrust (Databricks) Hadoop

    API

    importJSON, Hive, MySQL, HDFS, S3 exportdBase, cassandram HBASE, elasticsearch, amazonRedshift

  • Spark Summit EAST 2015 20

    Developers Track

    Spark User Concurrency and Context/RDD Sharing at Production Scale - Farzad Aref (Zoomdata) Zoomdata Zoomdataex. S3, HDFS, RDBSpark

    SparkZoomdata

    HDFSspark

  • Spark Summit EAST 2015 21

    Developers Track

    Power Hive with Spark(Hive on Spark) - Chao Sun (Cloudera), Marcelo Vanzin (Cloudera) HiveSQLHadoopmap/reduce

    HiveSpark

    hiveHIVE-7292

    Hive1.1Hive on Spark(HoS)

    HDFS

    Spark

    Mesos

    Hive

    YARN

    HoS

  • Spark Summit EAST 2015 22

    Data Science Track

    Data Science Track

    2014 /

    Mlib, Graph X, Spark Streaming

    Spark

    SparkRR only Deep LearningGPUSpark

    Youtube

  • Spark Summit EAST 2015 23

    Spark ML Pipelines

    Tokenizer/hashingTFTF-IDF

    lr

    ML Pipelines

    Pipelines

  • Spark Summit EAST 2015 24

    Spark ML Pipelines

    Practical Machine Learning Pipelines with Mllib Joseph Bradley (Databricks) ML Pipelines

    Spark 1.2 Cross Validation

    Future PlanRoadmapSpark 1.3

  • Spark Summit EAST 2015 25

    Spark Mlib

    K-means, Logistic regression

    Scikit-learn / R

    Scala, Python, Java Spark

    Spark Summit 2014

    https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html

  • Spark Summit EAST 2015 26

    Spark Mlib

    Un-collaborative filtering: Giving the right recommendations when your users arent helping you Leah McGuire (PhD, Salesforce)

    Mlib

  • Spark Summit EAST 2015 27

    Spark Streaming

    Scala, JavaSpark 1.3Python

    Socket, Flume, Kafka, TwitterFluentd Discretized Stream= RDD

    nRDD 500ms ~ 30s 10ms Flume / Storm

    CPU /

    DMM2Sparkhttps://prezi.com/iz1d_sefm1q9/dmmcom-dmm2-spark/

  • Spark Summit EAST 2015 28

    Spark Streaming

    Streaming machine learning in Spark Jeremy Freeman (HHMI Janelia Research Center)

    Neuroscientist using computation to understand the brain MlibSpark Streming

    K-means Streaming, Streaming Linear Regression, Time Series analysis

    Spark

  • Spark Summit EAST 2015 29

    () Graph X

    SNS, Network

    Graph X Advent Calendar 2014 http://www.adventar.org/calendars/491

    Graph X

  • Spark Summit EAST 2015 30

    Workshop

    Data Science Workshop

    n Databricks Cloud n n Kaggle

    Hands OnRecSys2015

    SparkGUIDataBricks CloudSpark- - GUI

    Advance Developer Workshop

  • Spark Summit EAST 2015 31

    Workshop

    Workshop DataBricks Cloud

    GUIVM SQLPython

    Developers Workshop

    JavaSQL ScalaPython R 1

    Spark Developers Wireless LAN2

    lan

  • Spark Summit EAST 2015 32

    Meetup

    Meetup DataDriven2015/03/17

    NYC ITCEO,CTO bloomberg Youtube

    NYC Data Science2015/03/18 Spark DataFrames and ML Pipelines for Large-Scale Data Science Databricks

    PyData NYC2015/03/20 Python + Data Science 5(5/22)

    http://pydatatokyo.connpass.com/

  • Spark Summit EAST 2015 33

    Data Driven NYC #35

    #35 SwiftkeySwiftkey, CTO

    InfluxDBPaul Dix@InfluxDB, CEO GO DB

    SparkIon Stoica@Databricks, CEO

    Swiftkey

    1. Datadrivenhttp://datadrivennyc.com/ 2. Datadriven Youtubehttps://www.youtube.com/channel/UCQID78IY6EOojr5RUdD47MQ

  • Spark Summit EAST 2015 34

    PyData NYC

    Project Jupyter for Data Science Matplotlib and the IPython notebook shapeshifting for your data A couple of tips for winning data science competitions

    JupyterJulia + Python + R

    notebook notebook

    notebook Notebook

    1. PyDatahttp://datadrivennyc.com/ 2. PyData Youtubehttps://www.youtube.com/channel/UCQID78IY6EOojr5RUdD47MQ

  • Spark Summit EAST 2015 35

    Spark

    Spark Summit Hadoop

    Workshop HadoopHadoop

    Mlib / SparkStreaming / Graph X / SparkR

    MTG Notebook R, Python, (Julia