Discretized Streams: Fault-Tolerant Streaming Computation at Scaleの解説

Discretized Streams: Fault-Tolerant Streaming Computation at Scale

2014年7月4日 Katsunori Kanda

紹介する論文について• SOSP’13

• Author: Matei Zaharia et al. (UCB)

• CTO@databricks

• Assistant Professor@MIT

• Contributor of Apache Spark

概要• 新しいストリーム処理モデル(D-Streams)の提案

• 特徴1: parallel recovery

• 特徴2: スループットがスケール(100nodes)

• 特徴3: latencyが数秒～数百ミリ秒

• Spark Streamingとして実装

Apache Sparkとは?

Spark ModelWrite programs in terms of transformations on distributed datasets !

Resilient Distributed Datasets (RDDs) • Collections of objects that can be stored in memory or disk across a cluster • Parallel functional transformations (map, filter, …)

• Automatically rebuilt on failure

2. Goals and Background• 対象とするアプリケーションの例

• Site activity statistics: 10^6 events/s

• Cluster monitoring

• Spam detection

• 0.5-2 sec latency(not target: high-frequency trading)

2.1 Goals1. Scalability to hundreds of nodes

2. Minimal cost beyond base processing

3. Second-scale latency

4. Second-scale recovery from faults and stragglers

2.2 既存の処理モデル

• continuous operator model

• 生存期間が長い状態を持ったオペレータに分割して計算する。入力値によって状態が更新される。

2.2 Previous Processing Models: Replication

• 同じ入力を二つのシステムが同時に受け取る。二つのシステムは、同期が必要になる（DB等が典型例）

2.2 Previous Processing Models: Upstream Backup

• 各ノードはあるチェックポイント以降に送られてきたメッセージのコピーを保持する

• ノードがfailした場合、待機系のノードがfailしたノードの状態を再構築する。この再構築のコストは高い。

• 例: MapReduce Online, Storm

Handle stragglers• 既存のモデルでは、stragglerの問題に対処できない

• replication: stragglerが発生すると全体が遅くなる（同期が必要のため）

• upstream backup: failureとして扱うことになるが、リカバリーが高コスト（前述）

3. Discretized Streams (D-Streams)

• D-Streamsは、

• 小さい(short)

• 状態を持たない(stateless)

• 決定論的タスク(deterministic tasks)

3.1. Computation Model• 短い間隔の決定論的な連続したバッチ計算

3. Computation Model: Recovery from faults

• partition単位で再計算される

• 無限に再計算されることを避けるために、一定間隔で非同期レプリケーションが行われRDDの状態が保存される

• 再計算は、並列実行可能

3.2. Timing Considerations

• 順番通りにデータが到着しない問題への対応

• 余裕時間(slack time)の間はバッチの開始を待つ

• アプリケーションレベルで遅れてきたレコードを処理する方法を提供

3.3. D-Stream API(1/3)

• Transformations: 新しいD-Streamを作る

• paris = words.map(w => (w, 1))

• counts = pairs.reduceByKey((a, b) => a + b)

Stateless API

3.3. D-Stream API(2/3)Stateful API

ex. pairs.reduceByWindow(“5s”, (a,b) => a + b) pairs.reduceByWindow(“5s”, (a,b) => a + b, (a,b) => a - b)

Incremental aggregation:

3.3. D-Stream API(3/3)Stateful API

sessions = events.track( (key, ev) => 1, (key, st, ev) => ev == Exit ? null : 1, “30s”) count = sessions.count()

state tracking:

3.4. Consistency Semantics

• nodeによって処理の進行状況が違うと整合性の問題が生じる

• 既存システム: 同期で解決、または無視

• D-Streams: 時間が区切られているので明確

3.5. Unification with Batch & Interactive Processing

• Batchと同じ計算モデルを使っているのでBatchと組み合わせやすい

• 特徴1: バッチの結果とjoinできる

• 特徴2: 過去データを計算できる

• 特徴3: 対話的な問い合わせができる

• counts.slice(“21:00”, “21:05”).topK(10)

3.6. Summary

4. System Architecture

• Master: D-Streamの系統グラフの管理、タスクスケジューリング、RDD partitionの作成

• Worker nodes: dataを受け取る、partitionへの入力と計算されたRDDの保存、タスク実行

• Client Library: システムにデータを送る

4.2. Optimization for Stream Processing

• Network communication: 非同期I/Oを導入。reduceが速くなった。

• Timestamp pipelining: Sparkのスケジューラーを次の時間の処理を先に登録できるように修正した

• Task scheduling

• Storage layer: 非同期チェックポイントの追加。RDDがimmutableなのでブロックしなくていい。zero-copy I/Oも実装。

• Lineage cutoff: チェックポイント作成後に削除するようになっった

• Master recovery: マスターの状態復帰機能を実装

4.3 Memory Management

• LRUでデータをdiskに書き出している

5.1. Parallel Recovery

5.2. Straggler Mitigation

• simple threshold to detect straggler:

• タスク処理時間の中央値の1.4倍

• 1秒以内にはstragglerを解消できている

5.3. Master Recovery1. 各時間の処理開始前に計算の信頼度記録

2. マスターがfailした場合、各workerが保持しているRDD partitionを新しいマスターに報告する

重要なのは・・・同じRDDを二回計算しても問題ないこと

5.3. D-Streamsのメタデータ• D-Streamsのメタデータ@HDFS

• ユーザーのD-Streamグラフ、ユーザー定義関数

• 最後のチェックポイント作成時刻

• RDDのID(チェックポイント以降)

6.1. Performance

6.1. Comparison with S4 and Storm

6.2. Varying the Checkpoint Interval

6.2. Varying the Number of Nodes

6.2. Struggler Mitigation

Software

Discretized Streams: Fault-Tolerant Streaming Computation at Scaleの解説