Spark 2.0 What's Next （Hadoop / Spark Conference Japan 2016 キーノート講演資料）

Spark 2.0: What’s Next

Reynold Xin @rxin Spark Conference Japan Feb 8, 2016

Please put up your hand if you know what Spark is?

Put up your hand if you think your significant other know what Spark is? (girlfriend, boyfriend, wife, husband, …)

This Talk

What is Spark? How are people using it? Spark 2.0

open source data processing engine built around speed, ease of use, and sophisticated analytics

スピード、使いやすさ、洗練された分析を兼ね合わせたオープンソースのデータ処理エンジン

About Databricks

Founded by creators of Spark & behind Spark development

Cloud Enterprise Spark Platform •  Cluster management, interactive notebooks,

dashboards, production jobs, data governance, security, …

Databricks について

Spark開発者とSpark開発を支持する人たちによって設立された

エンタープライズクラウド Spark プラットフォーム・クラスタ管理、対話型ノートブック・ダッシュボード、ジョブ生成・データガバナンス、セキュリティ

2015: Great Year for Spark

Most active open source project in data (1000+ contributors)

New language: R

Widespread industry support & adoption

2015: Sparkにとって大きな年

データ上、最も活発なオープンソースプロジェクト (1000人以上の貢献者)

新しい言語 : R

幅広い業界サポートと採用

Meetup Groups: December 2014

source: meetup.com

Meetup Groups: December 2015

source: meetup.com

Tokyo Spark Meetup

IBMはApache Spark の高度化へのコミットメントをアナウンス、次の10年で最も重要なオープンソースプロジェクトとなる可能性を秘めているという

Spark Or Hadoop – どちらがベストなビッグデータフレームワーク？

Apache Spark が人気急上昇だと調査結果が示す

How are people using Spark?

Diverse Runtime Environments HOW RESPONDENTS ARE

RUNNING SPARK

51%on a public cloud

MOST COMMON SPARK DEPLOYMENTENVIRONMENTS (CLUSTER MANAGERS)

48% 40% 11%Standalone mode YARN Mesos

Cluster Managers

さまざまな実行環境

Industries Using Spark

Other

Software(SaaS, Web, Mobile)

Consulting (IT)Retail,

e-Commerce

Advertising,Marketing, PR

Banking, Finance

Health, Medical,Pharmacy, Biotech

Carriers,Telecommunications

Education

Computers, Hardware

29.4%

17.7%

14.0%

9.6%

6.7%

6.5%

4.4%

4.4%

3.9%

3.5%

Sparkを利用している業界

ソフトウェア

コンサルティング (IT)

銀行、金融

コンピューター、ハードウェア

教育

健康、医療、薬剤、バイオテクノロジー

キャリア、通信

広告、マーケティング、

PR

小売、eコマース

その他

Top Applications

29%

36%

40%

44%

52%

68%

Fraud Detection / Security

User-Facing Services

Log Processing

Recommendation

Data Warehousing

Business Intelligence ビジネスインテリジェンス

データウェアハウジング

レコメンデーション

ログ処理

ユーザー向けサービス

不正検出 / セキュリティ

上位のアプリケーション

Are we done?

No. Development is faster than ever!

もう完成？

いいえ。開発は今まで以上に活発になって続いている！

2012

started @

Berkeley

2010

research paper

2013

Databricks started

& donated to ASF

2014

Spark 1.0 & libraries (SQL, ML, GraphX)

2015

DataFrames Tungsten

ML Pipelines

2016

Spark 2.0

SQL Streaming MLlib

Spark Core (RDD)

GraphX

Spark stack diagram Sparkのスタック図

Frontend (user facing APIs)

Backend (execution)

Spark stack diagram (a different take)

Sparkのスタック図 (違う見方で)

フロントエンド (ユーザーに面するAPI)

バックエンド (実行)

Frontend (RDD, DataFrame, ML pipelines, …)

Backend (scheduler, shuffle, operators, …)

Spark stack diagram (a different take)

Sparkのスタック図 (違う見方で)

フロントエンド (RDD, DataFrame, ML pipelines, …)

バックエンド (スケジューラ、

シャッフル、演算子、…)

Frontend API Foundation

Streaming

DataFrame/Dataset SQL

Backend 10X Performance

Whole-stage Codegen

Vectorization

Spark 2.0

フロントエンド API の創設

ストリーミング DataFrame/Dataset

SQL

バックエンド 10倍のパフォーマンス

全ステージコード生成ベクトル化

Guiding Principles for API Foundation

1.  Simple yet expressive

2.  (Semantics) well-defined

3.  Sufficiently abstracted to allow optimized backends

API を創るにあたっての指針

シンプルだが表現豊かに

(セマンティクスが) 十分定義されている

バックエンドの最適化ができるよう十分に抽象化されている

Java/Scala frontend

JVM backend

RDD

DataFrame frontend

Logical Plan

Physical execution

Catalyst optimizer

DataFrame

Python Java/Scala SQL

DataFrame Logical Plan

JVM Tungsten …

API Foundations in Spark 2.0

1.  Streaming DataFrames

2.  Maturing and merging DataFrame and Dataset

3.  ANSI SQL •  natural join, subquery, view support

Spark 2.0 におけるAPIの創設

ストリーミング DataFrames

DataFrame と Dataset の成熟とマージ

自然結合、サブクエリ、ビューのサポート

Challenges with Stream Processing

Stream processing is hard to reason about •  Output over time •  Late data •  Failures •  Distribution

And all this has to work across complex operations

•  Windows, sessions, aggregation, etc

ストリーム処理に関する課題

ストリーム処理が難しい理由は・長い期間に渡るアウトプット・遅れてくるデータ・障害・分散

これらすべてが複雑なオペレーションにわたって機能しなければならない・ウィンドウ、セッション、アグリゲーション、など

Next-gen Streaming with DataFrames

1.  Easy-to-use APIs (batch, streaming, and interactive)

2.  Well-defined semantics •  Out-of-order data •  Failures •  Sources/sinks with exactly-once semantics

3.  Leverages Tungsten backend

DataFramesによる次世代ストリーミング

1. 使いやすいAPI (バッチ、ストリーミング、インタラクティブ) 2. うまく定義されたセマンティクス・順序通りでないデータ・障害・exactly-once セマンティクスを持つ source / sink

3. Tungsten バックエンドの利用

Next-gen Streaming with DataFrames

1.  Easy-to-use APIs (batch, streaming, and interactive)

2.  Well-defined semantics •  Out-of-order data •  Failures •  Sources/sinks with exactly-once semantics

3.  Leverages Tungsten backend

DataFramesによる次世代ストリーミング

1. 使いやすいAPI (バッチ、ストリーミング、インタラクティブ) 2. うまく定義されたセマンティクス・順序通りでないデータ・障害・exactly-once セマンティクスを持つ source / sink

3. Tungsten バックエンドの利用

More details next few weeks 数週間後により詳細を

Spark is already pretty fast.

Can we make it 10X faster in 2.0?

Spark はすでにかなり速い

2.0 で 10 倍高速にできるのだろうか？

Spark 1.6 13.95 million rows/sec

Spark 2.0 work-in-progress

125 million rows/sec

High throughput 高スループット

Teaser: SQL/DataFrame Performance

come to my talk this afternoon to learn more 詳しく知りたいようでしたら午後のわたしの話を聞きに来てください

少しだけ宣伝: SQL/DataFrame のパフォーマンス

Tungsten Execution

Python SQL R Streaming

DataFrame (& Dataset)

Advanced Analytics

Spark 2.0 Release Schedule

Under active development on GitHub March – April: code freeze April – May: official release

Spark 2.0 のリリーススケジュール

GitHub 上で活発に開発中

3月-4月 : コードフリーズ

4月-5月 : 正式リリース

ありがとうございました @rxin

Technology

Spark 2.0 What's Next （Hadoop / Spark Conference Japan 2016 キーノート講演資料）