Facebookのリアルタイム Big Data 処理

Facebookのリアルタイム Big Data処理

@maruyama097

丸山不二夫　

o 　毎月のアクティブ・ユーザ　　　　　　8億4500万人

o 　一日の「いいね」とコメント　　　　　　　　　　27億

o 　一日にアップロードされる写真　　　　2億5000万枚

o 　Facebook上の友人関係　　　　1000億

構築されるべきインフラの目的

o 世界の誰とでもつながること、誰にも声を与えること、未来のために社会を変革するのを助けること。これらには、巨大なニーズと巨大なチャンスがあります。

o このテクノロジーと、構築されるべきインフラの規模は、前例のないものです。私たちは、これこそが私たちが集中することの出来る、もっとも重要な問題だと確信しています。

-- Facebook上場申請文書から

Real-Time Analytics System

http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html

このシステム・アーキテクチャーの持つ意味

o このシステムは、あなたには、どのような意味を持っているのだろうか？　

o もしもあなたがFacebookではないとしても、このアーキテクチャは、十分にシンプルで、十分に出来合いのツールによって構成されているので、もっと小さなプロジェクトでも機能することが出来るだろう。

システムの目標

o  人々に、信頼出来るやり方で、沢山の異なった統計数字の、リアルタイムのカウンターを与え、データの多寡の偏りを説明する。

o  アノニマスなデータを提供する。データが誰のものかは知ることは出来ない。

o  何故、プラグインが価値があることを示す。あなたのビジネスが、どのような価値を、それから導きだすことが出来るか？

システムの目標

o  データを、もっとアクティブなものに変える。ユーザに、ユーザのコンテンツをもっと価値あるものにするアクションを取ることを助ける。 n  新しいUIのメタファー。漏斗のアイデアを利用する。 n  どのくらいの人が、あるプラグインを見たか、どのくら

いの人がそのプラグインに反応したか、どれくらいの人が、あなたのサイトに誘導されたか

o  データをもっと、タイムリーなものに変える。 n  システムはリアルタイムなものになった。一巡48時

間から30分に変わった。 n  この目的のために、複数の障害点が除去された。

沢山のイベント・タイプ 100以上の指標をトラックする

o  Pluginのインプレッション o  いいね o  ニュースフィードのインプレッション o  ニュース・フィードのクリック

o  人口数

o  一日に、20億イベント。毎秒、20万イベント。

Massive Amounts of Data

データの偏り – 不均等なキーの分布

o  「いいね」は、冪乗則に似た分布をする。ロングテールには、ほとんど「いいね」が集まらないが、あるリソースには、巨大な数の「いいね」が集まる。

o  このことは、アクセスが集中するホットな領域、ホットなキーとロックの競合といった問題が生まれることを意味する。

様々な異なったプロトタイプでの実装の試行錯誤

Facebookでは、このアーキテクチャーに到達するまでに、様々の試行錯誤を行っている。ここでは、それを見ておこう。

MySQLをDBカウンターとして使う

o  行にキーとカウンターを持つ o  結果は、沢山のデータベースの活動の結果。 o  状態は、一日単位の粒度のバケットに格納さ

れる。毎日、深夜に、その情報は置き換えられる。 n  状態更新の時が来ると、その結果は、一斉にデー

タベースに書き出される。それは沢山のロックの競合を生み出す。

n  この作業を、タイム・ゾーンを考慮に入れて拡張するという課題もある。

n  データ分割を異なったやり方で行うという課題。

MySQLをDBカウンターとして使う

o  高い書き込みレートは、ロックの競合をもたらす。データベースに負荷をかけるのは容易なのだが、いつもデータベースをモニターする必要がある。また、データベースのshardingの戦略を常に考え直す必要がある。

o  この問題に対するソリューションは、よく、整理されてはいない。

In-Memoryカウンターを使う

o  もし、IOのボトルネックに悩まされているのなら、全てをメモリー上におけばいい。

o  スケールの問題はない。カウンターはメモリーにおかれているので、書き込みは速く、かつ、カウンターのshadingは容易である。

o  In-memoryカウンターは、いくつかの理由で、他のアプローチより正確でないと感じられる。たとえ、1%の失敗率でも、受け入れられない。データ解析で、お金が動く以上、カウンターは、極めて正確であらねばならない。

o  このシステムを、Facebookは実装した訳ではなく、思考実験にとどまるのだが、正確性の問題は、別の解法に向かわせた。

MapReduceを使う

o  以前のソリューションでは、Hadoop/Hive を使っていた。

o  柔軟で、稼働するのが容易である。巨大な書き出しと読み込みの双方のIOをハンドルできる。事前に、どのようにクエリーが行われるか知る必要はない。データは格納され、そして、クエリーされる。

o  リアルタイムでない。沢山の従属性と、沢山の障害点がある。複雑なシステムで、リアルタイムの目的には、十分には、依拠出来ない。

Cassandra/HBaseを使う

o アベイラビリティと書き込みの効率で、CassandraよりHBaseが、より良いソリューションに見えた。

o HBaseの書き込みレートは巨大で、ボトルネックは解消された。

o The Winner: HBase + Scribe + Ptail + Puma

Real-Time Analytics Systemのアーキテクチャ

このFacebookのシステムの重要な特徴は、m-tierモデルではなく、WAL write ahead logに基づく、Tailingアーキテクチャーである。それは、中小規模のシステムにも応用可能である。

WALを利用した、 Tailingアーキテクチャー

o  HBaseは、沢山の分散したマシン上に、データを格納する。

o  Tailingアーキテクチャを利用する。

o  ユーザーは、Webページ上の「いいね」をクリックする。

o  Facebookに、AJAXリクエストが飛ぶ。 o  AJAXリクエストは、Scribeを使って、ログ・ファ

イルに書き出される。

WALを利用した、 Tailingアーキテクチャー

o  新しいイベントはScribeでログ・ファイルに格納され、ログは、PTailで、後ろから処理される。

o  システムは、ログからイベントを巻き取り、Pumaで処理し、HBaseストレージに書き出す。

o  UIは、ストレージからデータを引き出し、ユーザーに表示する。

Web -> Scribe -> PTail -> Puma -> HBase

Write Ahead Log

o  Facebookのシステムで、スケータビリティと信頼性に取って、本質的に重要な特徴は、WAL write ahead logである。おこると想定される操作のログ。

o  キーに基づいて、データは、リージョン・サーバに分割される。

o  データは、まず初に、WAL に書き出される。

Write Ahead Log

o  データはメモリーにおかれ、ある時点で、あるいは、十分なデータが集積したら、ディスクにフラッシュされる。

o  もしも、マシンが倒れたら、WALからデータを再生できる。

o  ログとメモリー内ストレージを組み合わせて利用することで、極めて高レートのIOを、確実にハンドルできる。

Real-Time Analyticsのデータの流れ

データのスキーマ

o  一つのURLをベースに、沢山のカウンターを格納する。

o  唯一のルックアップ・キーである、行のキーは、リバース・ドメイン(com.facebookのような)のMD5ハッシュである。適切なキー構造の選択は、スキャンやｓｈａｄｉｎｇを容易にする。

o  問題は、適切に別のマシンにデータをshadingすることである。MD5ハッシュを使うことで、URLのこの範囲はここで、その範囲はあっちへと決めることが簡単になる。

データのスキーマ

o  データの中のURLについても同じようなことをするのだが、URLには、IDを追加する。Facebookの中では、全てのURLは、ユニークなIDで表現されている。それは、shadingを助けるために利用されている。

o  com.facebookのようなリバース・ドメインはデータを一緒にクラスター化するのに利用されている。それらにデータが格納されているのなら、一緒にクラスター化されているので、ドメインの情報を効率的に計算することが出来る。

データのTTL

o  全ての行がURLで、そのカラムがカウンターだとしよう。そのとき、そのカラム毎に異なったTTLs (time to live) を設定することができる。

o  だから、一時間毎のカウント数を数えているのなら、全てのURLをずっと保存しておく必要はない。二週間のTTLでも設定しておけばいい。典型的には、カラム・ファミリーをベースに、TTLを設定する。

Scribe

o  Scribeは、ログ・ファイルのロールオーバといったたぐいの問題を処理する。

o  Scribeは、Hadoopと同じHTFSファイル上に構築されている。

o  非常に簡単なログ行を書き出す。それがコンパクトなものであればあるほど、多くをメモリー上に格納することが出来る。

ログの処理

o  サーバあたり、毎秒10,000個の書き込みを処理出来る。

o  ログからデータを読み出す際にデータの消失がないようにチェックポイントの設定が行われている。 n  Tailerは、ログ・ストリームのチェックポイントをHBas

eに保存している。 n  再起動時にも、チェックポイントからデータが再生

され、データ消失はおこらない。 o  クリック詐欺の検出に使えるのだが、それはつく

られていない。

Ptail o  データは、Ptailを使って、ログ・ファイルから読まれる。

Ptailは、複数のScribeストアからデータを集約するために社内で作成されたツールである。Ptailは、ログファイルを後ろから読んで、データを引き出す。

o  Ptailのデータは三つのストリームに分離される。そうして、終的には、三つの異なるデータセンターの、それぞれに固有のクラスターに送られることが可能となる。 n  Pluginインプレッション n  ニュース・フィード　インプレッション n  Actions (plugin + news feed)

PTailのホットスポット

o  分散システムでは、システムの一部分が、他の部分よりホットになることが起こる。一つの例は、リージョン・サーバで、こうして沢山のキーにアクセスが向けられると、ホットになることがあり得る。

o  一つのtailerは、他のtailerより、遅れることがあり得る。もし、一つのtailerが一時間遅れで、他のtailerには遅れがなかったら、どのような数字を、ディスプレイに表示するだろうか？

PTailのホットスポット

o  例えば、インプレッションは、アクションよりも、情報の量がかさむ。だから、CTR（Click Through Rate)は、後の時間になってから、高くなってゆく。

o  この問題の解決策は、もっとも遅れている tailerを見つけ出して、指標の問い合わせがあった場合には、それを使うことだ。

Puma

o  ホット・キーたちのインパクトを軽減するために、データをバッチ処理する。HBaseが、たとえ、一秒あたりたくさんの書き込みをハンドルすることが可能だとしても、それでも、彼らはデータのバッチ処理を望んだ

o  ホットな投稿は、沢山のインプレッションやニューズ・フィードのインプレッションを生み出す。それは巨大なデータの偏りを引き起こし、IOの問題を引き起こすだろう。バッチ処理が多いほど、いいことになる。

Puma

o  バッチは、平均して1.5秒かかる。もっと長いバッチを望んでも、それは、沢山のURLを抱えることになり、ハッシュテーブルを作成する時にメモリー不足に陥るだろう。

o  ロックの競合の問題を避ける為には、新しいバッチを始めるために、後のflushの終了を待つこと。

データをレンダーするUI o  フロントエンドは、全てPHPで書かれている。 o  バックエンドはJavaで書かれており、メッセージ

ングのフォーマットとしてはThriftが利用されている。だから、PHPのプログラムも、Javaのサービスを要求出来る。

o  Webページの表示をもっと高速にするために、キャシングによるソリューションが利用される。

システムのパフォーマンス

o  パフォーマンスは、状況によって変化する。あるカウンターは、即座に返ることができるが、ドメイン内のトップのURLでは、少し時間がかかるかもしれない。そのレンジは、0.5秒から数秒である。

o  沢山の長いデータがキャッシュされると、リアルタイム性は、少なくなる。メムキャッシュで、それぞれについて、ことなるキャッシュTTLを設定すること。

HBaseとHadoop

このシステムでは、HBaseが大きな役割を果たしている。Hadoopは、バックアップ用に準備されているという。

HBaseは、分散カラム・ストア

o  HBaseデータベースは、Hadoopとインターフェースする。Facebookは、社内に、HBaseで仕事するスタッフを抱えている。

o  HBaseでは、リレーショナル・データベースとは異なって、テーブル間のマッピングを作ることはない。

o  インデックスもつくらない。唯一のインデックスは、行のプライマリー・キーだけである。

HBaseは、分散カラム・ストア

o  HBaseでは、行のキーから、数百万ものストレージの疎なカラムを取得出来る。それは、非常に柔軟である。

o  スキーマを指定する必要がない。いつでもキーを追加出来る、カラム・ファミリーを定義すればいい。

HBaseの自動化

o  HBaseは、システムの失敗を検知して、自動的にそれらを迂回することが出来る。

o  現在は、HBaseのデータのshardingの再分割・再配置は、手動で行われている。

o  ホット・スポットの検知と再分割・再配置の自動化は、HBaseのロードマップにのぼっているのだが、まだ、出来ていない。

o  毎週火曜日、だれかがキーをチェックして、データの分割プランに、どのような変更が必要か判断する。

MapReduce

o  Hiveによってクエリー可能になるように、データはMapReduceサーバに送られる。

o  これは、データがHiveによってリカバー出来るような、バックアップ・プランとしても機能する。

o  もともとの生のログは、一定期間の後に、削除される。

将来の課題

ここでは、どのような課題が残されているのかを、見ておこう。

将来の課題のトップリスト

o  一番「いいね」が多いというような、トップのURLを見つけるのが、大変難しい。というのも、YouTubeのようなドメインでは、数百万のURLが、短いあいだに共有されるからだ。

o  メモリー内のソート順を維持し、データが変わるにつれて、順番が更新されるような、もっと創造的なソリューションが求められている。

その他の課題

o  異なるユーザー数のカウント n  時間枠をまたいで、何人が、あるURLに「い

いね」を押したのか。MapReduceでは簡単なのだけれど、単純なカウンターでは、なかなか難しい。

o  ソーシャル・プラグイン以外のアプリケーションの一般化。

その他の課題

o  複数のデータセンターへの移動 n  現在は、一つのデータセンターでのみ稼働している

のだが、複数のデータセンターで動くことを希望している。

n  故障時の代替プランは、現在は、MapReduceを使うというものである。

n  このバックアップ・システムは、毎晩、テストされている。Hiveとこの新しいシステムへのクエリー結果は、一致することを見るために、比較されている。

このプロジェクトについて

このプロジェクトについて

o  このプロジェクトには、5ヶ月かかった。初は、二人のエンジニアがこのプロジェクトで働き始めた。その後、50%のエンジニアが追加された。

o  UIのひと二人が、フロントエンドの為に働いている。

o  エンジニアリング、デザイン、PM、オペレーションで、14人が働いているようだ。

Cassandra

o  他のある人たちは、 Cassandraを選んだ。彼らは、Cassandraのスケーラビリティ、マルチ・データセンター機能、操作の易しさを愛していたから。しかし、Cassandraは、リアルタイムのデータ解析のスタックには、すっきりとは、おさまらなかった。

メッセージング・システムとの共通性

o  メッセージングのシステムを見た時、この解析システムとなんと共通点が多いのかということに気づいた。

o  大きな数、HBase、リアルタイム。巨大な書き込みの負荷を確実に、かつ、タイムリーに扱うという挑戦は、これらの問題の共通の基盤なのである。

o  Facebookは、HBase, Hadoop, HDFS というエコシステムにフォーカスしながら、気まぐれな操作の数を、後で展開すべく、数えているのである。

n

Real-time Analytics at Facebook

Zheng Shao 10/18/2011

Analytics and Real-time what and why

Facebook Insights o ユースケース

n  Websites/Ads/Apps/Pagesの時系列データ n  人口動態の解析 n  ユニーク・ユーザ数/

アクセスの多いページ

o 大きなチャレンジ n  スケーラビリティ n  遅延

FacebookのリアルタイムInsight以前の処理

o  Facebookには、既に、Insightの仕事を処理する完全なデータウェアハウス・ソリューションが存在している。

o  Insight処理のスケーラビリティを担保するために、Facebookでは、3000ノードからなるHadoopクラスターを利用している。 n  HTTPサーバーから生成されるログ・ストリームは、 n  Scribeと呼ばれるログ収集フレームワークで、数秒以内にNF

Sに転送され、 n  そのデータは、一時間毎に、Hadoopにコピー／ロードされる。

Copier/Loaderは、MapReduceジョブで、マシンの失敗を自動的に処理する。

n  毎日のHadoopで生成されたログのサマリーは、パイプライン・ジョブで、終的には、サービスで利用するためにMySQLにロードされる。

Hadoop/Hiveベースの解析

o  3000ノードの Hadoopクラスター o パイプライン・ジョブ: Hiveは、SQL-like な

記述が可能 o Scalabilityはかなりのもので、データセンタ

ーのパワーの限界までもつ。 o  ただ、遅延はひどい。24時間から48時間かかる

Scribe NFS HTTP Hive Hadoop

MySQL

数秒数秒一時間毎

Copier/Loader

一日毎 Pipeline Jobs

遅延への可能な二つの対応

o  一つの考えは、小バッチ処理。 n  一日に一つのバッチを行う代わりに、もっと小さなバ

ッチを行う。 n  問題は、いかにして、一つのバッチあたりのオーバ

ーヘッドを少なくして、一分かそれ以下の小さなバッチが意味のあるようにできるかということ。

o  もう一つの考えは、ストリーム処理。 n  データが到着するとすぐに、それを集約する。これで

リアルタイムに近い結果を得ることが出来る。 n  問題は、ハードウェアの故障に対して、いかにシス

テムを信頼出来るものにするかということ。

低遅延を、どう実現するか？

o  小バッチ処理 n  Map-reduce/Hiv

eを一時間ごと、15分ごと、5分毎に走らせる。

n  バッチあたりのオーバーヘッドをどう減らすか

o  ストリーム処理 n  データが着き次第、

集約処理する。 n  信頼性の問題を、ど

う解決するか？

Facebookの選択

o  Facebookは、ストリーム処理を終的に選択した。

n  Map-Reduceのバッチあたりのオーバーヘッドは、極めて高く、Hadoopクラスター上の5分のバッチでも、実用的ではないということが分かった。

Data FreewayとPuma

o  Facebookが構築したリアルタイム解析システムは、二つの基本的なシステムからなる。

o  第一のシステムは、ScribeとHDFS上に構築されたData Freewayである。

o  第二のシステムは、HBase上に構築された、信頼性の高いストリーム集約エンジンPumaである。

Data Freeway scalable data stream

かつての、Scribeによるデータ転送階層的にログ・データを収集

o  初の転送は、クライアントから中間層になされるもので、数万のノードから数百のノードに、漏斗状に数が減らされる。

o  二つ目の転送は、ログのカテゴリーに基づいてデータをシャッフルするもので、一つのログ・カテゴリーは、一つのwriterノードに格納される。

o  その後、ログ・データは、writerによってNFSに書き込まれ、バッチのcopierとUnixのtail/fopenによって利用される。

o  Scribeは、2008年にオープンソース化。当時は、ログのカテゴリーは、100種類程度。

Scribeによる転送

o Scribeは、シンプルなpush/RPCベースの

ログ・システム o ルーティングは、staticに設定する。

Scribe Clients

Scribe Mid-‐Tier

Scribe Writers NFS

HDFS

Log Consumer

Batch Copier

tail/fopen

初の転送二番目の転送カテゴライズ

三番目の転送 FSへの書き込み

このスタイルの問題

o  Scribeは、2008年のオープンソース化以降、急速に沢山の企業に受けいられた。

o  ルーティングは、スタティックな設定によるもので、柔軟ではあったが、二つほど問題があった。

o  一つは、それぞれのwriterマシン毎に設定ファイルを管理しなければならなかったし、一つのカテゴリーに一つのwriterというのも、スケーラブルではなかった。

o  もう一つの問題は、writerが、単一障害点となっていることだった。

Data Freeway 2011

o  2011年に、Facebookは、Data Freewayを稼働させた。

o  現在では、ピーク時には9GB/sec、端から端までの遅延は10秒で、データを処理している。

o  今では、2500以上のカテゴリーがある。

o  現時点では、Calligraphusで書き出されたHDFSから直接PTailしているのだが、将来的には、Continuous Copierによって書き出されたHDFSから、PTailする計画である。

Data Freeway 2011

Scribe Clients

Calligraphus Mid-‐Fer

Calligraphus Writers

HDFS

HDFS

C1

C1

C2

C2

DataNode

DataNode

PTail

Zookeeper Log

Consumer

ConFnuous Copier

PTail (in the plan)

Data Freewayを構成する４つのコンポーネント

o  第一のコンポーネントはScribeで、クライアント上でのみ稼働し、RPC経由で、データを送りつけることに責任を持っている。

o  第二のコンポーネントは、Calligraphusと呼ばれるもので、Zookeeperを利用してカテゴリーの所有を管理し、データをシャッフルして、HDFSに書き出す。

o  第三のコンポーネントは、Continuous Copierと呼ばれるもので、ファイルが大きくなるにつれ、あるHDFSから他のHDFSに連続的にファイルをコピーする。

o  第四のコンポーネントは、PTailと呼ばれるもので、HDFS上の複数のディレクトリーを並列にtail処理して、stdoutに書き出す。

Calligraphus

o  Calligraphusは、RPCからのデータを取得して、ファイルシステムに書き出すことに責任を持つ。

o  それぞれのログのカテゴリーは、一つ、または、それ以上のファイルシステムのディレクトリーで表現される。

o  それぞれのディレクトリーは、ファイル名にデータの名前を含んだ、順序づけられたファイルのリストである。

o  ログデータを格納するには、思いつく限りもシンプルな形式をしている。

Calligraphusの二つのバケット化処理

o  Calligrapusのもっとも興味深い特徴は、二つのバケット化をサポートしていることである。

o  一つは、アプリケーションで定義されたデータ分割、アプリケーション・バケットである。これらは、分割されたログのコンシューマによって利用される。大きなログのコンシューマの大部分は、そのログ・ストリームが非常に巨大であるので、分割されている。

o  もう一つは、インフラストラクチャ・バケットで、一つのアプリケーション・バケットが、毎秒数バイトから毎秒数ギガバイトのスループットを持つことを可能にする。それぞれのインフラストラクチャ・バケットはディレクトリーである。大きなストリームを、同時に複数のディレクトリに書き込むことが出来る。

Calligraphusのパフォーマンス

o  Calligraphus は、非常に、ハイパフォーマンスである。

o  Facebookは、ファイルシステムのsyncを7秒おきに呼び出しているのだが、それが現時点でのデータ遅延の大の原因になっている。

o  ネットワークのスループットは、簡単に1GbitのNICをあふれさせるくらい大きい。近いうちに、10Gbit NICを使用することを計画中である。

Continuous Copier

o  Continuous Copierは、一つのファイルシステムから他のファイルシステムへの連続的なデータコピーを行うコンポーネントである。バッチベースのmap-reduceのCopierと比較すると、遅延がかなり低く、また、ネットワークの利用もスムースである。

Continuous Copier o  現在は、長期間走り続ける、mapだけを行うジ

ョブとして実装されているが、MapReduce以外の、どんな簡単なジョブ・スケジューリング・システムにも用意に移し替えることが出来る。

o  現時点では、HDFSのロックファイルを使用しているが、早い時期に、ZooKeeperにかえる予定である。

o  稼働中のContinuous Copierのピークのスループットは、約3GB/secで、現在はデータ圧縮を行っている。

Ptail

o  File System à Stream ( à RPC )

directory

files

directory

directory

checkpoint

PTail o  PTailは、ファイルシステムからのデータをアウ

トプット・ストリームに転送する。 o  PTailの重要な特徴は、チェックポイントである。

PTailのチェックポイントは、現在のファイルと、それぞれのディレクトリ内のファイルのオフセットの値を含んでいる。

o  こうして、以前のチェックポイントにロールバックして、データの境界で、いかなるデータの損失もダブりもなしに、データストリームを再生産することが出来る。

チャンネルの比較 Push / RPC Pull / FS Latency 1-2 sec 10 sec Loss/Dups Few None Robustness Low High Complexity Low High

Push / RPC

Pull / FS

Scribe

Calligraphus PTail + ScribeSend

Continuous Copier

Puma real-time aggregation/storage

Pumaは、シンプルなアーキテクチャーを持つ、典型的なストリーム集約エンジンである。

Log Stream AggregaFons

Storage Serving

Puma概観

o  ログストリームは、複数のマシンの集合上で集約される。集約されたサマリーは、永続性を持たすために、通常はストレージに格納される。

o  オンラインのサービスは、Pumaから直接でもストレージからでも、サマリーを取得することが出来る。

o  Pumaでは、読み込みよりも書き込みのスループットの方がかなり大きい。というのも、サマリー等の解析データは、Webサイト等のオーナーだけによってみられるものだから。

Puma概観

o  Pumaへの書き込みのスピードは、一秒あたり、100万行のオーダーである。

o  Facebookでは、ログ行を、年齢、性別等で、複数のGroup-By操作を行う必要があった。

o  Group-Byの初のキーは、常に、time/dateに関連している。そのことは、サマリーは、一定の時間の後で、確定したものになることを意味している。

o  Pumaは、また、ユニークユーザ数やもっともアクセスの多い要素は何かといった、複雑な集約もサポートしている。

MySQLとHBaseの比較

MySQL HBase Parallel Manual sharding Automatic

load balancing Fail-over Manual master/

slave switch Automatic

Read efficiency

High Low

Write efficiency

Medium High

Columnar support

No Yes

Puma2

o  Facebookが初に実装したPumaのアーキテクチャは、Puma2と呼ばれている。実際に稼働したのは、2011年の3月で、Puma2 + HBaseが走る100個のボックス上で、毎秒60万行のログを処理することが出来た。

PTail Puma2 HBase Serving

Puma2のアーキテクチャ

o  Puma2には、PTailがパラレルなデータストリームを提供している。

o  それぞれのログ行毎に、Puma2は、HBaseに対して、“increment”操作を発行する。

o  Puma2のサーバは、HBaseに対して全て対称的に配置されていて、shardingは行っていない。

o  HBase内の同一の行が、同時に複数のPuma2サーバによってincrementされることが出来る。

HBaseのincrement操作

o  HBaseは、同一行の複数カラムに対して、一つの命令で、increment処理を行うことが出来る。それで、Group-Byされた複数のカラムのincrementを、一つの操作で処理出来る。

o  注意してほしいのは、この操作が、incrementに単純化された形ではあるが、MapReduceのReduce操作にあたるということである。HBaseへのキーアクセスは、Shufflingに相当する。

Puma2の利点

o  Puma2のいいところは、非常にシンプルで、管理がしやすいことである。

o  その基本的な理由は、Puma2サーバがほとんど状態を持たず、対称的に配置されているからである。

o  Puma2サーバが持つ唯一の状態は、PTailのチェックポイントについての情報で、それは、HBaseに定期的に書き込まれる。

o  その結果、マシン・ボックスを簡単に増設出来るし、もしも、マシンがダウンした場合には、再起動をかけることも出来た。

Puma2の問題点

o  HBaseのincrement処理は、高価なものであった。なぜなら、一行を丸ごと読み込んで、incrementして書き出す必要があるのだが、行の読み出しは高いものにつく。

o  Puma2はまた、カウント以外の集約をサポートするのが難しかった。そのためには、HBaseのコードに沢山手を入れる必要があったから。実際、「一番アクセスの多い要素」の集約のため、Puma2では、「アクセスの多い要素のテーブル」を複数個用意するという、手の込んだ実装を行っていた。

Puma2の問題点

o  後に、Puma2では、incrementとチェックポイントの書き出しは、同一のトランザクションでは行えなかったので、多少のデータの重複が生じかねないという問題があった。

Puma2の改善の試み　(1) o  Puma2のサービスの改善で、誰もが思いつく

アイデアは、HBaseの負荷を減らすために、incrementの要求をバッチ化して、ひとまとめで行うことだった。

o  しかし、Group-Byのキーは、ロングテール状に、非常に広い範囲にわたって分布しているので、このアイデアは、うまく機能しなかった。

o  それに、バッチの途中ではチェックポイントを保存することが出来ないので、データの正確性も低めることになる。

Puma2の改善の試み　(2) o  HBaseの側では、まず、ロックの数を減らすこ

とで、"increment"操作の適化を行った。 o  別の大きな効果を上げた改善は、DataNode

のデーモンを経由しないで、ディスク上のHDFSファイルを直接読むという、ショートカット読み出しだった。

o  Facebookはまた、高負荷の下での信頼性を改善した。

o  いろいろやってみたが、Puma2は、特に、ユニーク数のカウンターについては、ハッピーとはいえない状態だった。

Puma3

PTail Puma3 HBase

Serving いろいろやってみたが、Puma2は、不十分だった。そこで、Facebookは、 Puma3と呼ばれるアーキテクチャに切り替えた。

Puma2とPuma3の違いメモリーの中での集約

o  Puma2とPuma3の大の違いは、Puma3では、HBaseを使う代わりに、Puma3のプロセスのメモリーの中で集約を行っているということだ。ローカルなメモリー操作はずっと高速であるので、ずっと速いスループットを達成することが出来る。

Puma3のアーキテクチャ

o  Puma3は集約キーで分割される o Shardはメモリー中のハッシュマップ o ハシュマップのエントリーは、集約キー

と、集約のユーザー定義のペア o HBaseは、永続的なキー／バリューのス

トレージ

PTail Puma3 HBase

Serving

集約キーによるSharding

o  メモリー中で集約を行うために、Facebookは、Puma3のサーバを集約キーで分割シェーディングした。

o  このことは、入力となるPTailのデータストリーム自身も分割シェーディングされねばならないことを意味する。それは、Calligraphusのアプリケーション・バケッティングの機能によってサポートされる。

ハッシュマップのエントリーと永続的なストレージ

o  Puma3の分割の要素は、基本的にはイン・メモリーのハッシュマップである。それぞれのハッシュマップのエントリーは、count, sum, avg, その他なんでもいいのだが、集約キーと集約のユーザー定義のペアである。

o  Facebookは、HBaseを永続的なキー/バリューのストレージとして使っている。ただ、普通はそれから読み出すことはない。

Puma3への書き込み

o 書き込みの流れ

n  それぞれのログ行から、キー／バリューのカラムを抜き出す。

n  ハッシュマップを検索して、ユーザーが定義した集約関数を呼び出す。

PTail Puma3 HBase

Serving

Puma3への書き込み

o  Puma3の書き込みの流れは、かなり単純である。基本的には、それぞれのログ行のカラムから、キーと値を抜き出す。

o  キーを使ってメモリー中のハッシュマップを検索し、ユーザー定義の集約関数に、値を与えて呼び出す。

o  注意してほしいのは、ログのストリームは、集約キーで分割されているので、同じ集約キーは一つ以上のPuma3のプロセスには現れることはないということである。このことが、Puma3が機能する鍵となる。

Puma3の状態の保存

o チェックポイントの流れ n  五分毎に、修正されたハッシュマップの

エントリーとPTailのチェックポイントを HBaseに格納する。

n  これらは、起動時には、 (ノードが落ちた後も)、 HBaseからロードされる。

n  一定時間が経過したら、メモリーから、これらのアイテムは取り除かれる。

PTail Puma3 HBase

Serving

Puma3の状態の保存

o  Facebookは、Puma3のプロセスの状態を5分おきに、HBaseにチェックポイントしている。基本的には、PTailのチェックポイントだけではなく、修正された全てのハッシュマップのエントリーを保存している。

o  このことは、もしPuma3がクラッシュして再起動するのなら、HBaseからシークエンシャルReadで、状態をロードすることが出来るということである。HBaseのシークエンシャルReadは、かなり速い。

Time Window

o  メモリーを節約するために、ある集約の時間枠がすぎたら、そのハッシュマップ・エントリーをメモリーから取り除いた。

o  なぜなら、その時間枠に対して、新しいログ行を受け取ることは、二度とないからだ。

Puma3からの読み込み

o 読み込みの流れ

n  コミットされない読み込み：メモリー中のハッシュマップから直接サービスされる。ミスがあった場合には、HBaseからロードされる

n  コミットされたRead: HBaseから読み込まれサービスされる

PTail Puma3 HBase

Serving

Puma3からのコミットされない読み込み

o  データの読み込みの流れについては、二つの選択がある。

o  もし、遅延が10秒程度の、コミットされていない集約を読み出したいのなら、直接、イン・メモリーのハッシュマップのサービスを受ければいい。

o  集約の時間枠が終了してしまったという時だけに起きる、ミスの場合にだけ、HBaseにいけばいい。

Puma3からのコミットされた読み込み

o  もし、コミットされたデータを読みたければ、Puma3は、HBaseから読み出してサービスする。

o  コミットされていない集約の結果の価値は、Puma3のプロセスが、次のチェックポイントを残す前に死んだ時には、価が少なくなることがあるのに注意しよう。

o  Facebookでは、カウントが減らないように、Puma3とサービスの間にキャッシュの層をおくことを計画している。

Puma3でのJoin

o ジョイン

n  HBaseにStaticなジョイン・テーブルがある n  ユーザー定義関数user-defined

function (udf)は、分散ハッシュ検索される n  ローカル・キャッシュで、udfの検索のスループッ

トは大きく改善される。

PTail Puma3 HBase

Serving

Static Table Join

o  Puma3は、HBase中のスタチックなテーブルとのジョインの機能もサポートしている。

o  ジョインのキーは、スタティックなHBaseテーブルの行キーでなければいけない。それは、ユーザ定義関数の中に、簡単な分散ハッシュ検索として実装されている。

o  ローカルキャッシュが、このユーザ定義関数のスループットを大きく改善することが知られている。

Puma2とPuma3の比較　

o  Puma2とPuma3を比較すると、Puma3は、書き込みのスループットが、はるかに優れていることが分かる。

o  同じ負荷の仕事をするのに、25%のマシンで十分だった。その主な理由は、HBaseが本当に書き込みのスループットがいいからだった。

Puma3の問題　

o  同時に、Puma3は、沢山のメモリーを必要とした。基本的には、変化する集約は、ログ・ストリームの書き込みスループットを保証するためには、全てメモリー上に格納されている必要があった。

o  現在、Facebookは、ハッシュマップのために一ボックスあたり、60GBのメモリーを利用している。

o  将来的には、SSDを利用して、一ボックスあたり、この10倍のスケールを可能とすることは容易だと思う。

Puma3での、特別な集約

o  Puma3では、多少の近似値になるが、次のような特別な集約を行うことは簡単に出来る。

o  ユニーク数のカウントでは、単純なadaptiveサンプリング・アルゴリズムを実装した。このアルゴリズムでは、ユニークなアイテムの数が増えるにつれ、積極的なサンプリングが行われる。

o  Facebookは、また、標準的なブルーム・フィルタを実装することを計画している。もアクセスの多いアイテムの集約では、古典的なlossyカウンティング・アルゴリズムと確率的なLossyカウンティング・アルゴリズムの実装を計画している。

PQL – Puma Query Language

o  Pumaの、他のストリーム処理のプロジェクトと区別される、も大きな特徴は、その言語である。

o  Facebookは、入力ストリームと出力テーブル、そしてクエリーそのものも定義する、SQL-likeなクエリー言語、PQL – Puma Query Language を作り上げた。

o  このクエリーには、ジョインのために、ユーザー定義関数だけでなく、集約機能も含まれていることに注意してほしい。

o  Puma3は、現時点では、製品版の一つ前の段階にある。Facebookは、Puma2とHiveと比較して、全てのサマリーを検証出来たら、直ちに、Puma3を製品版に押し上げていく予定である。

PQL – Puma Query Language o  CREATE INPUT TABLE t

(‘time', ‘adid’, ‘userid’); o  CREATE VIEW v AS

SELECT *, udf.age(userid) FROM t WHERE udf.age(userid) > 21

o  CREATE HBASE TABLE h … o  CREATE LOGICAL TABLE l …

o  CREATE AGGREGATION ‘abc’ INSERT INTO l (a, b, c) SELECT udf.hour(time), adid, age, count(1), udf.count_distinc(userid) FROM v GROUP BY udf.hour(time), adid, age;

Future Works challenges and opportunities

今後の課題

o  Facebookが、次に行おうと計画しているもののリストである。

o  第一に、Puma3のシンプルなスケジューラがある。作業は連続的に続いていくのだから、必要なのは、シンプルなスケジューリングだけである。もっともありそうなことは、既存のフレームワークを再利用することだ。

o  第二に、Facebook内部で、このプロジェクトを広く利用していくことである。毎日のレポート用の検索の大部分を、その検索がPumaでサポート出来る、十分シンプルなものであるなら、Hiveから移行する計画である。このことは、遅延を軽減するだけでなく、圧縮・解凍の削減によって、効率性も改善するだろう。

o  第三は、オープンソース化である。現時点では、大のボトルネックは、Java Thriftで、Facebookとオープンソースの間には、分岐が生じている。

o  Facebookは、まず、Calligraphusから始めて、一つずつ、オープンソース化を進めていく計画である。

リアルタイムの Stream処理のシステム

o  アカデミーでも企業でも、沢山の同じような、リアルタイムのStream処理のシステムが存在している。次のようなものがある。

o  STREAM ：Stanford o  Flume ： Cloudera o  S4 ：Yahoo o  Rainbird/Storm ：Twitter o  Kafka ：Linkedin

Facebookのシステムの特徴

o  一つ一つを比較する代わりに、Facebookのシステムの重要な違いについてまとめてみよう。

o  Data Freewayは、10秒以内の遅延で毎秒9GBのスループットを持つ、スケーラブルなデータ・ストリームのフレームワークである。

o  それは、Push/RPCとPull/FS、両方のチャンネルのサポートしている。

o  それは、ユースケースに応じて、チャンネルの任意の組み合わせが出来るコンポーネントを持っている。

Pumaがサポートする機能

o  Pumaは、信頼性の高いストリーム集約エンジンである。それは、時間ベースのGroup Byと Table-Stream Lookup Joinの両方を、しっかりとサポートしている。

o  新しいリアルタイムMapReduceとこれまでのMapReduceを比較したとき、FacebookのPumaは、これまでのHiveに比較出来る、クエリー言語PQLを持っている。

Pumaがサポートしない機能

o  Pumaは、スライディング・ウィンドウやストリーム・ジョインのサポートはしない。

o  というのも、これらは非常に難しい問題で、Facebookの環境では存在しない問題だからである。

Apache hadoop goes realtime at Facebook

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application’s requirements for consistency, availability, partition tolerance, data model and scalability. We explore the enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the system,

and how this solution has significant advantages over the sharded MySQL database scheme used in other applications at Facebook and many other web-scale companies. We discuss the motivations behind our design choices, the challenges that we face in day-to-day operations, and future capabilities and improvements still under development. We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments

4. リアルタイム HDFS

高可用性 –- AvatarNodeの導入

o  立ち上げ時に、HDFSのNameNode（GFSのMasterNode）は、fsimageというファイルから、ファイルシステムのメタデータを読み込む。このメタデータは、HDFSの全てのファイルとディレクトリの名前とメタデータを含んでいる。

o  ただ、NameNodeは、それぞれのファイルのブロックの位置をずっと記憶している訳ではない。だから、NameNodeのコールドスタートの時間は、主に二つの部分から構成されることになる。

DataNodeのコールド・スタート

o  第一に、ファイルシステムのイメージを読み込み、トランザクションログを適用して、新しいファイルシステムのイメージをディスクに書き戻す。

o  第二に、多数のDataNodeから、クラスタ中の全てのブロックの位置情報を回復するために、ブロック情報を処理する。Facebookの大のHDFSクラスタは、約1億五千万個のファイルを持っているのだが、この二つのステップに同じ程度の時間を要した。全体で、このHDFSのコールドスタートには、45分かかった。

BackupNodeのフェールオーバ

o  Apache HDFSのBackupNodeでは、フェールオーバ時にディスクからfsimageファイルを読み込むことを回避できるのだが、それでも、全てのDataNodeからブロック情報を集める必要があった。こうして、BackupNodeを利用したソリューションでは、フェールオーバの時間は、20分近くかかることになる。

o  我々の目標は、フェールオーバを数分以内に終えることだったので、BackupNodeによるソリューションは、迅速なフェールオーバという我々の目標にはあわなかった。

BackupNodeの別の問題

o  別の問題もある。NameNodeは、全てのトランザクションの度に、同期的にBackupNodeを更新するので、システム全体の信頼性は、単一のNameNodeの信頼性より低いものになる。

o  こうして、HDFS Avatarが生まれることになった。

HDFS AvatarＮｏｄｅ

o  Facebookのクラスタは、二つのAvatarNodeを持っている。アクティブなAvatarNodeとスタンバイしているAvatarNodeの二つである。それらは、アクティブ・パッシブなホット・スタンバイのペアを構成している。

o  AvatarNodeは通常のNameNodeのラッパーである。Facebookの全てのHDFSクラスタは、ファイルシステムのイメージの一つのコピーとトランザクション・ログの一つのコピーを保存するのに、NFSを利用している。

二つのAvatarＮｏｄｅ

o  アクティブAvatarNodeは、そのトランザクションをNFSファイルシステムのトランザクション・ログに書き込む。

o  同時に、スタンバイAvatarNodeは、NFSファイルシステムから同じトランザクション・ログを読み込むためにオープンする。そうして、自分の名前空間上でそのトランザクションを実行して、その名前空間を可能な限りアクティブなAvatarNodeの名前空間に近いものに保ち続ける。

stood out as a central point of failure, but we were confident that our HDFS team could build a highly-available NameNode in a reasonable time-frame, and this would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending adding Bloom filters to HBase’s version of LSM[13] Trees, making local DataNode reads efficient and caching NameNode metadata). Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects, both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.

HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column-orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single table. HBase is ideal for workloads that are write-intensive, need to maintain a large amount of data, large indices, and maintain the flexibility to scale out quickly.

4. REALTIME HDFS HDFS was originally designed to be a file system to support offline MapReduce application that are inherently batch systems and where scalability and streaming performance are most critical. We have seen the advantages of using HDFS: its linear scalability and fault tolerance results in huge cost savings across the enterprise. The new, more realtime and online usage of HDFS push new requirements and now use HDFS as a general-purpose low-latency file system. In this section, we describe some of the core changes we have made to HDFS to support these new applications.

4.1 High Availability - AvatarNode The design of HDFS has a single master – the NameNode. Whenever the master is down, the HDFS cluster is unusable until the NameNode is back up. This is a single point of failure and is one of the reason why people are reluctant to deploy HDFS for an application whose uptime requirement is 24x7. In our experience, we have seen that new software upgrades of our HDFS server software is the primary reason for cluster downtime. Since the hardware is not entirely unreliable and the software is well tested before it is deployed to production clusters, in our four years of administering HDFS clusters, we have encountered only one instance when the NameNode crashed, and that happened because of a bad filesystem where the transaction log was stored.

4.1.1 Hot Standby - AvatarNode At startup time, the HDFS NameNode reads filesystem metadata from a file called the fsimage file. This metadata contains the names and metadata of every file and directory in HDFS. However, the NameNode does not persistently store the locations of each block. Thus, the time to cold-start a NameNode consists of two main parts: firstly, the reading of the file system image, applying the transaction log and saving the new file system image back to disk; and secondly, the processing of block reports from a majority of DataNodes to recover all known block locations of

every block in the cluster. Our biggest HDFS cluster [16] has about 150 million files and we see that the two above stages take an equal amount of time. In total, a cold-restart takes about 45 minutes.

The BackupNode available in Apache HDFS avoids reading the fsimage from disk on a failover, but it still needs to gather block reports from all DataNodes. Thus, the failover times for the BackupNode solution can be as high as 20 minutes. Our goal is to do a failover within seconds; thus, the BackupNode solution does not meet our goals for fast failover. Another problem is that the NameNode synchronously updates the BackupNode on every transaction, thus the reliability of the entire system could now be lower than the reliability of the standalone NameNode. Thus, the HDFS AvatarNode was born.

Figure 1

A HDFS cluster has two AvatarNodes: the Active AvatarNode and the Standby AvatarNode. They form an active-passive-hot-standby pair. An AvatarNode is a wrapper around a normal NameNode. All HDFS clusters at Facebook use NFS to store one copy of the filesystem image and one copy of the transaction log. The Active AvatarNode writes its transactions to the transaction log stored in a NFS filesystem. At the same time, the Standby opens the same transaction log for reading from the NFS file system and starts applying transactions to its own namespace thus keeping its namespace as close to the primary as possible. The Standby AvatarNode also takes care of check-pointing the primary and creating a new filesystem image so there is no separate SecondaryNameNode anymore.

The DataNodes talk to both Active AvatarNode and Standby AvatarNode instead of just talking to a single NameNode. That means that the Standby AvatarNode has the most recent state about block locations as well and can become Active in well under a minute. The Avatar DataNode sends heartbeats, block reports and block received to both AvatarNodes. AvatarDataNodes are integrated with ZooKeeper and they know which one of the AvatarNodes serves as the primary and they only process replication/deletion commands coming from the primary AvatarNode. Replication or deletion requests coming from the Standby AvatarNode are ignored.

1074

GFS Architecture

AvatarNodeとDataNode

o  スタンバイAvatarNodeは、また、アクティブAvatarNodeのチェックポイントの面倒も見て、新しいファイルシステムのイメージを作り、分離したSecondaryNameNodeが無くてもいいようにする。

o  DataNodeは、単一のNameNodeだけに話しかける代わりに、アクティブAvatarNodeとスタンバイAvatarNodeの両方に話しかける。このことは、スタンバイAvatarNodeが、も新しいブロックの位置情報を持つと同時に、一分以内にActiveになりうることを意味する。

AvatarDataNodeとZooKeeper o  AvatarのDataNodeは、ハートビートとブロッ

ク情報と受け取ったブロックを、両方のAvatarNodeに送る。

o  AvatarDataNodeは、ZooKeeperで統合されていて、どちらのAvatarNodeがプライマリかを知っていて、そのプライマリのAvatarNodeから来る、複製／削除命令のみを処理する。スタンバイAvatarNodeからの複製・削除命令は、無視される。

HDFSのトランザクション・ロギングの強化

o  HDFSは、ファイルが閉じられるか、あるいはsync/flushされた時だけ、トランザクション・ログに新しく割り当てられたブロックidを記録する。

o  我々は、出来る限り、フェールオーバをトランスペアレントにしたいので、スタンバイしているAvatarNodeは、それぞれのブロックの割当をそれが起きたときに知る必要がある。

ログの利用

o  それで、新しいトランザクションを、それぞれのブロックの割当の都度、エディット・ログに書き出す。こうして、クライアントは、書き込み中のファイルのフェールオーバの直前まで、ファイルに書き込みを続けることが出来る。

o  スタンバイしているAvatarNodeが、アクティブなAvatarNodeによって書き込まれたトランザクション・ログから、トランザクションを読み込むとき、不完全なトランザクションを読み出す可能性がある。

ログ・フォーマットの変更

o  こうした問題を避けるために、このファイルに書き込まれたトランザクション毎、トランザクションの長さ、トランザクションのID、チェックサムの情報を持つように、エディット・ログのフォーマットを変更する必要があった。

トランスペアレントなフェールオーバ：DAFS o  我々は、フェイルオーバ・イベントをまたいで

HDFSにトランスペアレントなアクセスを提供する、クライアント上の階層ファイルシステムであるDistributedAvatarFileSystem (DAFS) を開発した。

o  DAFSは、ZooKeeperと統合されている。ZooKeeperは、与えられたクラスタのプライマリなAvatarNodeの物理アドレスを持つzNodeを保持している。

DAFSとZooKeeper o  クライアントがHDFSクラスタ（例えば、

dfs.cluster.com）と接続しようとする時には、DAFSは、ZooKeeper上でプライマリAvatarNode（dfs-0. cluster.com）の実際の物理アドレスを持つ対応するzNodeを探して、その後の全ての呼び出しをプライマリAvatarNodeに向ける。

o  もしも呼び出しが、ネットワーク・エラーにあったら、プライマリ・ノードの変更のために、ZooKeeperをチェックする。

o  フェールオーバ・イベントの場合には、zNodeは新しいプライマリAvatarNodeの名前を含んでいることになろう。DAFSは、そうした時、新しいプライマリAvatarNodeに対してその呼び出しを再実行するだろう。

DAFSとZooKeeper

o  我々は、ZooKeeperのサブスクリプション・モデルを利用しなかった。なぜなら、それは、ZooKeeperサーバにもっと沢山のリソースが向けられることを必要とする可能性があったからである。

o  もしも、フェールオーバが進行中であったなら、DAFSは、フェールオーバが完全に終わるまで自動的にブロックするだろう。フェールオーバ・イベントは、HDFSのデータにアクセスするアプリケーションにとって、完全に、トランスペアレントなものになる。

リアルタイムの仕事の為のパフォーマンスの改善

o  HDFSは、もともと、MapReduceのような高スループットのシステム用にデザインされている。そのもともとのデザインの原則の大多数は、スループットを改善しようというもので、反応時間については、あまりフォーカスしていない。

o  例えば、エラーを扱う際、fast failureに際しても、再実行や待機を行いがちである。たとえエラーの場合でもリーズナブルな反応時間で、リアルタイムなアプリケーションをサポートすることは、HDFSによって、重要な挑戦となる。

RPC タイムアウト o  一つの例は、HadoopがどのようにRPCのタイ

ムアウトをハンドルするのかということである。 o  Hadoopは、Hadoop-RPCを送るのに、TCP

コネクションを利用している。RPCのクライアントが、TCPソケットのタイムアウトを検出すると、RPCタイムアウトを宣言する代わりに、RPCサーバにpingを送る。もしも、まだサーバが生きているなら、クライアントはレスポンスを待つ。

いつ待機すべきか？

o  このアイデアは、もしRPCサーバが、通信の爆発や一時的な高負荷や広域のGCで、停止状態に遭遇していているのなら、クライアントは、待機してサーバへのトラフィックを絞り込むべきだというものである。

o  その反対に、タイムアウトの例外を投げたり、RPCリクエストを再試行するというのは、タスクの不必要な失敗を招いたり、あるいは、RPCサーバに更なる負荷を加えることになる。

いつ、待機すべきでないか

o  しかしながら、無限に待機を続けることは、リアルタイム性が要求される、どのようなアプリケーションに対してインパクトを与える。

o  HDFSクライアントは、時々、あるDataNodeにRPCを行うのだが、DataNodeが時間内にレスポンスを返すことに失敗した時には、悪いことになる。そのクライアントはRPCの中で固まってしまう。

より良い戦略、Fail Fast o  より良い戦略は、fail fastであり、読み出しでも

書き込みでも、異なるDataNodeを試すことである。

o  こうして、我々は、サーバとのRPCセッションが始まるときに、RPCタイムアウトを指定する能力を開発した。

Recover File Lease o  別の増強は、書き手のリースを早く取り消すということで

ある。HDFSは、ファイルに対しては、一人の書き手しかサポートしていない。そして、このセマンチックスを強化するために、NameNodeはリースを保持している。

o  アプリケーションが、あるファイルを読み込みで開こうとした時、そのファイルが、以前のクローズがきれいに行われていなかったという例は沢山存在する。以前は、この対応は、ログファイルに対してHDFS-appendを、呼び出しが成功するまで繰り返すことで行われていた。

Append操作とLease

o  append操作は、ファイルのソフト・リースのエクスパイアをトリガーする。それで、アプリケーションは、HDFSのNameNodeがログファイルのリースを手放すまで、ソフト・リースの小時間（デフォールトでは一分）は待たなければいけなかった。

RecoverLease APIの追加

o  第二に、HDFS-append操作は、通常は一つ以上のDataNodeを巻き込んだ書き込みパイプラインを確立するので、不必要なコストを加えることになる。

o  エラーが起きれば、パイプラインの確立には10分近くかかりかねない。HDFS-appendのオーバヘッドを避けるために、我々は、ファイルのリースを取り消す、軽量なrecoverLeaseと呼ばれるHDFS APIを追加した。

recoverLeaseの働き

o  NameNodeは、recoverLease要求を受け取ると、ただちにファイルのリース保持者を自分自身に変更する。そして、リースの回復プロセスを開始する。

o  recoverLeaseのRPCは、リースの回復が終了したかについてのステータスを返す。アプリケーションは、ファイルを読もうとする前に、recoverLeaseからの成功したというリターンコードを待つ。

HDFSへの読み書きの遅延

o  アプリケーションが、スケーラビリティやパフォーマンスの理由で、HDFSにデータを格納したいと思うことは度々ある。

o  ただ、HDFSへの読み書きの遅延は、マシン上のローカル・ファイルに対する読み書きより、かなりの程度で大きい。

ローカルなレプリカからの読み出し

o  こうした問題を和らげるために、HDFSクライアントが、データのローカルなレプリカがあるかを検出して、もし存在すれば、データをDataNode経由で転送すること無く、ローカルなレプリカからトランスペアレントにデータを読み出す、機能強化を実装した。

o  これによって、HBaseを利用するある種の作業では、パフォーマンスのプロファイルは、二倍の結果を得た。

新しい特徴 -- Hflush/sync o  Hflush/syncは、HBaseとScribeの両方にと

って、重要な操作である。それは、クライアント側のバッファーに書き込まれたデータを、書き込みパイプラインにプッシュして、どんな読み手に対してもデータを見えるようにして、パイプライン上のクライアントあるいはDataNodeのどちらか一方が倒れた場合でも、データの耐久性を高める。

新しい特徴 -- Hflush/syncの改善

o  Hflush/syncは同期的な操作である。このことは、書き込みパイプラインからのアクノレッジが受け取られるまでかえってこないことを意味する。

o  この操作は、頻繁に呼び出されるので、その効率性を高めることは重要である。我々が行った一つの適化は、Hflush/syncが応答を待っている間にも、引き続いて書き込みを許すことである。

o  この適化は、ある特定のスレッドが定期的にHflush/syncを呼び出すHBaseとScribeの両方で、大幅に、書き込みのスループットを改善した。

新しい特徴 --Concurrent Readers o  我々は、書き込みの中にも、ファイルを読む

ことの出来る能力を必要とするアプリケーションを持っている。

o  読み手は、まず、NameNodeに、そのファイルのメタデータ情報を取得するために問い合わせる。NaneDataは、後のブロックの長さについては、新の情報を持っていないので、クライアントは、レプリカが存在するDataNodeの一つからその情報を取得する。

チェックサムの再計算

o  そして、ファイルの読み出しを始める。この読み手と書き手を並行して走らせるという挑戦は、データの内容やチェックサムが動的に変わりつつあるときに、いかにしてデータの新のチャンクを提供するかということである。

o  我々は、この問題を、データの新のチャンクのチェックサムを、要求された時点で再計算することで解決した。

製品版HBASE

この節では、我々がFacebookで行ってきた、正確性、耐久性、可用性、パフォーマンスに関連した、HBaseの重要な機能増強のいくつかを紹介したい。

ACIDコンプライアンス

o  アプリケーションの開発者は、彼らのデータベース・システムに、ACID適合性、あるいは、そのある種の近似に、期待するようになってきている。実際、我々の初期の評価では、強い整合性の保証は、HBaseのメリットの一つであった。

ACID適合性でのHBaseの修正

o  既存のMVCC（MultiVersion Concurrency Control）風の、読み書きの整合性コントロール（Read-Write Consistency Control）は、十分な隔離を保証し、HDFSのHLog（Write Ahead Log=ログ先行書き込み）は、十分な耐久性を与えている。

o  しかし、我々が必要とする、ACID適合性の行レベルでのAtomicityと整合性に、HBaseが忠実であることを明確にするためには、いくつかの修正が必要であった。

Atomicityの保証

o  初のステップは、行レベルでのAtomicityを保証することだった。RWCCは、大部分の保証を与えていた。

o  しかしながら、ノードが落ちたときには、そうした保証は失われる可能性があった。

o  元来、一つの行の中の複数のエントリーのトランザクションは、HLogには、シーケンシャルに書き込まれるもの。

ログ・トランザクション（WALEdit）の新しいコンセプト

o  もしも、RegionServerが、この書き込みの間に死ぬと、そのトランザクションは、部分的にしか書き込まれない。

o  ログ・トランザクション（WALEdit）の新しいコンセプトで、それぞれの書き込みトランザクションは、完全に終了するか、全く書き込まれないかのどちらかになるだろう。

Consistencyの保証 o  HDFSは、HBaseに複製機能を提供し、我々

の使用のためにHBaseが必要とする、強い整合性の保証の大部分をハンドルすることが出来る。

o  書き込みの間、HDFSは、それぞれのレプリカにパイプラインの接続を準備し、全てのレプリカは、どんなデータが送られてもいいというACKを返す。

o  HBaseは、レスポンスか失敗の通知を受け取るまで、次へは進まない。

Consistencyの保証

o  シーケンス・ナンバーの利用を通じて、NameNodeは、レプリカのどんな間違った振る舞いも同定出来て、それを排除する。

o  機能している間は、NameNodeがこのファイルの回復をするには、時間がかかった。

HLogのロールバック

o  HLogの場合には、整合性と耐久性を維持しながら、前に進むというのは、絶対的にマストな条件である。

o  もし、たった一つでもHDFSのレプリカがデータの書き出しに失敗していることが検出されたら、HBaseは、直ちにログをロールバックして、新しいブロックを取得する。

データ保護機能

o  HDFSはまた、データの破壊に対する保護機能も提供している。HDFSブロックが読み込まれる時、チェックサムのチェックが行われて、それが失敗する場合には、全てのブロックが破棄される。

o  データの破棄は、ほとんど問題にはならない。なぜなら、そのデータには、二つのレプリカが存在しているから。

o  我々は、もしも3つのレプリカ全てが破損したデータを含んだ場合には、そのブロックを隔離して、事後解析にまわすという機能を追加した。

可用性の改善

o  我々は、HBaseリージョンをオフラインにするようなKillテストに際して、沢山の問題があることを、独自に明らかにした。

o  すぐに、次のような問題を特定した。クラスタの遷移の情報は、その時点でアクティブなHBaseマスターのメモリー上にしか格納されていない。マスターが失われれば、こうした情報も失われる。

HBaseマスターの書き換え

o  我々は、HBaseマスターの大規模な書き換えを行った。この書き換えの、重要なコンポーネントは、リージョンの割当情報をマスターのメモリーからZooKeeperに移すものであった。

o  ZooKeeperは、多数のノードに書き込むQuorum制を採用しているので、マスターのフェールオーバの際もこの遷移状態は失われず、多数のサーバの障害時でも、生きながらえる。

オンライン・アップデート

o  クラスターのダウン時間の大の要因は、サーバのランダムな故障ではなく、むしろ、システムのメンテナンスであった。我々は、このダウン時間を小なものにするために、多くの問題を解決した。

o  まず、我々は、RegionServerが、停止要求を発してからシャットダウンするまでに数分を要する場合が、間欠的にあることをたびたび発見した。

圧縮を割り込み可能に

o  この間欠性の問題は、長い圧縮サイクルに起因していた。この問題に対応して、我々は、終了時の応答性を良くするために、圧縮を割り込み可能にした。

o  この対応で、RegionServerは数秒でダウン出来るようになり、クラスタのシャットダウン時間は、リーズナブルな範囲に収まった。

順次再起動

o  もう一つの、Availabilityの改善は、順次再起動である。もともと、HBaseは、クラスタ全体を止めて、それからアップグレードを始めることしかサポートしていなかった。

o  我々は、一つのサーバをある時点でアップグレードするソフトウェアを実行する順次再起動スクリプトを追加した。マスターは、RegionServerがストップした時点で、リージョンを再割り当てするので、このことは、我々のユーザーが経験するダウン時間の量を小厳なものにする。

再起動の問題

o  我々は、サーバが新しく再起動したことに起因する、数々の新しい問題を解決した。

o  たまたま、順次再起動中に起きる数々のバグは、リージョンの停止と再割り当てに関連していた。

o  そういうわけで、ZooKeeperとの統合というマスターの書き換えは、ここで述べた数々の問題の対応にも助けとなった。

分散ログの分割

o  RegionServerが死んだ時には、そのリージョンが再開可能となり読み書きが利用可能となる前に、そのサーバのHLogは分割され再生されなければならない。

o  以前には、ログが再生される前に、残ったRegionServerをまたいで、マスターがログを分割していた。

o  サーバ毎にたくさんのHLogがあるので、ここがリカバリーのプロセスのもっとも遅い部分であった。

ログ分割の並列化

o  この処理は並列化出来る。RegionServerをまたいだ、この分割タスクの管理に、ZooKeeperを活用することで、いまやマスターは分散した分割ログの協調のみを行う。

o  このことは、リカバリーの時間を大幅に削減し、フェールオーバのパフォーマンスに深刻な影響を与えることなしに、RegionServerがより多くのHLogを保持することを可能とした。

パフォーマンスの改善

o  HBaseのデータの挿入は、時には余分な読み出しを犠牲にしても、連続的な書き込みにフォーカスすることで、書き込みのパフォーマンスに

適化されている。 o  データのトランザクションは、まず、コミット・ログ

に書き出され、それからMemStoreと呼ばれるメモリー内のキャッシュに適用される。MemStoreが、あるしきい値に到達すると、HFileに書き出される。HFileは、書き換え不可のHDFSファイルで、ソートされたkey/valueペアを含んでいる。

HFileの圧縮

o  既に存在しているHFileを編集する代わりに、flushの度に新しいHFileが書き出され、リージョンごとのリストに追加される。

o  読み出しの要求は、これら複数のHFileに対してパラレルに行われ、終的な結果に集約される。

o  効率を上げるため、これらのHFileは、定期的に圧縮され一つにマージされる必要がある。こうして、読み出しのパフォーマンスの低下を避ける。

圧縮アルゴリズム o  読み出しのパフォーマンスは、リージョン内のフ

ァイルの数と相関する。こうして、よく出来た圧縮アルゴリズムが、本質的に重要なかなめとなる。

o  もう少し詳しく述べれば、ネットワークIOの効率は、もしも圧縮アルゴリズムが不適切であれば、ドラスティックな影響を受けかねない。我々のユースケースにとって効率的な圧縮アルゴリズムを得たと確信するために、重要な努力がなされた。

二つの圧縮

o  圧縮は、初は、それがマイナーかメジャーかに従って、二つの異なったコードパスに分離されていた。

o  マイナーな圧縮は、サイズを指標として全てのファイルからその一部分を選び出す。

o  一方、時間ベースのメジャー圧縮は、無条件で全てのHFileを圧縮する。

圧縮コードベースの統一

o  以前には、メジャー圧縮のみが、消去・上書き・期限切れデータの除去を行っていたのだが、それで、マイナー圧縮が、必要以上に大きなHFileを生み出す結果になった。

o  それは、ブロック・キャッシュの効率を低下させ、将来の圧縮に悪影響を与える。二つのコードパスを統一することで、コードベースはシンプルになり、ファイルは可能な限り小さなものになった。

圧縮アルゴリズムの改善

o  次の仕事は、圧縮アルゴリズムの改善であった。社内でローンチをした後で、我々は、putとsyncの遅延が非常に大きいことに気づいた。

o  病的な場合には、通常の圧縮では、3つの5MBのファイルは5MBより少し大きいファイルを生成するのだが、1GBものファイルがあることを見つけた。このネットワークIOの浪費は、圧縮キューがバックログになるまで続いていた。

遅延の解消

o  この問題は、既存のアルゴリズムでは初の４つのHFileを無条件でマイナー圧縮するのだが、一方で、HFileが３つになった時に、マイナー圧縮のトリガーがかかることに起因していた。

o  この問題は、ある一定以上のサイズのファイルの無条件なファイル圧縮をやめ、圧縮対象の適当な対象が見つからなかったら圧縮をスキップすることで解決された。

o  その後、putの遅延は、25ミリsecから3ミリsecに落ちた。

サイズ比の決定

o  我々は、圧縮アルゴリズムのサイズ比の決定の改良にも取り組んだ。もともとの圧縮アルゴリズムは、ファイルの年齢でソートし、関連するファイルを比較するものだった。

o  もしも、古いファイルが新しいファイルのサイズの2倍以下だったら、圧縮アルゴリズムは、そのファイルを取り込んで、この操作を繰り返す。

o  しかし、このアルゴリズムは、HFileの数とサイズが非常に大きいものになると、望ましくない振る舞いをした。

アルゴリズムの修正

o  これを改善するために、全ての新しいHFileのサイズ総計の2倍以内であれば、古いファイルを取り込むことにした。

o  これで安定状態は、古いHFileは、次の新しいファイルの約4倍のサイズになるように変わった。

o  結果として、圧縮率は50%を維持した。

読み込みの適化

o  既に議論したように、読み込みのパフォーマンスでは、リージョン内のファイルの数を低いものにして、ランダムIO操作を減らすことがかなめとなる。

o  さらに、圧縮の利用がディスク上のファイルの数を低いものにし、また、ある検索においてはあるファイルをスキップすることも、同様に、IO操作を低減する。

ブルームフィルタの利用

o  ブルームフィルタは、ある行、または、行とカラムが、あるHFile内に存在するかをチェックする、メモリー空間的に効果的で固定時間の方法を提供する。

o  それぞれのHFileは、末尾にオプションのメタデータブロックを持って、連続的に書き込まれているので、ブルームフィルタの追加は、重要な変更なしに行える。

ブルームフィルタのキャッシュ化

o  foldingの利用を通じて、それぞれのブルームフィルタは、ディスクとメモリー内のキャッシュへの書き込み時には、可能な限り小さいものに抑えられた。

o  特定の行またはカラムに対する検索の要求は、それぞれのHFileのキャッシュされたブルームフィルタのチェックで、いくつかのファイルを全くスキップすることを可能にした。

HFileのタイムスタンプ

o  HFileに格納されたデータは、時系列であるか特別のタイムスタンプを含んでいるので、特別のタイムスタンプ選択のアルゴリズムが追加された。

o  時間が進んでも、あるデータのタイムスタンプよりずっと後に、データが挿入されることはほとんどないので、それぞれのHFileは、一般的には、固定された時間のレンジの値を含んでいる。

タイムスタンプのチェック

o  この情報は、それぞれのHFileにメタデータとして格納され、ある特定のタイムスタンプ、あるいは、タイムスタンプの範囲に対する検索は、その要求がそれぞれのファイルの範囲を共通点を持つかチェックされ、範囲に合わないものはスキップされる。

リージョンの局所性

o  HDFSのローカルファイルの読み出しについての改善は、著しいものであったので、リージョンが、そのファイルと同じ物理ノードでホストされていることは、本質的に重要である。

o  クラスタをまたぐリージョンの割り当てを維持しながら、局所性が維持されることを保証するように、ノードを再起動する変更が行われた。

stood out as a central point of failure, but we were confident that our HDFS team could build a highly-available NameNode in a reasonable time-frame, and this would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending adding Bloom filters to HBase’s version of LSM[13] Trees, making local DataNode reads efficient and caching NameNode metadata). Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects, both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.

HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column-orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single table. HBase is ideal for workloads that are write-intensive, need to maintain a large amount of data, large indices, and maintain the flexibility to scale out quickly.

4. REALTIME HDFS HDFS was originally designed to be a file system to support offline MapReduce application that are inherently batch systems and where scalability and streaming performance are most critical. We have seen the advantages of using HDFS: its linear scalability and fault tolerance results in huge cost savings across the enterprise. The new, more realtime and online usage of HDFS push new requirements and now use HDFS as a general-purpose low-latency file system. In this section, we describe some of the core changes we have made to HDFS to support these new applications.

4.1 High Availability - AvatarNode The design of HDFS has a single master – the NameNode. Whenever the master is down, the HDFS cluster is unusable until the NameNode is back up. This is a single point of failure and is one of the reason why people are reluctant to deploy HDFS for an application whose uptime requirement is 24x7. In our experience, we have seen that new software upgrades of our HDFS server software is the primary reason for cluster downtime. Since the hardware is not entirely unreliable and the software is well tested before it is deployed to production clusters, in our four years of administering HDFS clusters, we have encountered only one instance when the NameNode crashed, and that happened because of a bad filesystem where the transaction log was stored.

4.1.1 Hot Standby - AvatarNode At startup time, the HDFS NameNode reads filesystem metadata from a file called the fsimage file. This metadata contains the names and metadata of every file and directory in HDFS. However, the NameNode does not persistently store the locations of each block. Thus, the time to cold-start a NameNode consists of two main parts: firstly, the reading of the file system image, applying the transaction log and saving the new file system image back to disk; and secondly, the processing of block reports from a majority of DataNodes to recover all known block locations of

every block in the cluster. Our biggest HDFS cluster [16] has about 150 million files and we see that the two above stages take an equal amount of time. In total, a cold-restart takes about 45 minutes.

The BackupNode available in Apache HDFS avoids reading the fsimage from disk on a failover, but it still needs to gather block reports from all DataNodes. Thus, the failover times for the BackupNode solution can be as high as 20 minutes. Our goal is to do a failover within seconds; thus, the BackupNode solution does not meet our goals for fast failover. Another problem is that the NameNode synchronously updates the BackupNode on every transaction, thus the reliability of the entire system could now be lower than the reliability of the standalone NameNode. Thus, the HDFS AvatarNode was born.

Figure 1

A HDFS cluster has two AvatarNodes: the Active AvatarNode and the Standby AvatarNode. They form an active-passive-hot-standby pair. An AvatarNode is a wrapper around a normal NameNode. All HDFS clusters at Facebook use NFS to store one copy of the filesystem image and one copy of the transaction log. The Active AvatarNode writes its transactions to the transaction log stored in a NFS filesystem. At the same time, the Standby opens the same transaction log for reading from the NFS file system and starts applying transactions to its own namespace thus keeping its namespace as close to the primary as possible. The Standby AvatarNode also takes care of check-pointing the primary and creating a new filesystem image so there is no separate SecondaryNameNode anymore.

The DataNodes talk to both Active AvatarNode and Standby AvatarNode instead of just talking to a single NameNode. That means that the Standby AvatarNode has the most recent state about block locations as well and can become Active in well under a minute. The Avatar DataNode sends heartbeats, block reports and block received to both AvatarNodes. AvatarDataNodes are integrated with ZooKeeper and they know which one of the AvatarNodes serves as the primary and they only process replication/deletion commands coming from the primary AvatarNode. Replication or deletion requests coming from the Standby AvatarNode are ignored.

1074

GFS Architecture

Scribe

Scribe https://github.com/facebook/scribe/wiki

o  Scribe is a server for aggregating log data that‘s streamed in real　time from clients. It is designed to be scalable and reliable.

o  There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups.

o  If the central scribe server isn’t available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers.


o  The central scribe server(s) can write the messages to the files that are their final destination, typically on an nfs filer or a distributed filesystem, or send them to another layer of scribe servers.

o  Scribe is unique in that clients log entries consisting of two strings, a category and a message. The category is a high level description of the intended destination of the message and can have a specific configuration in the scribe server, which allows data stores to be moved by changing the scribe configuration instead of client code.


o  The server also allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path.

o  Flexibility and extensibility is provided through the “store” abstraction.

o  Stores are loaded dynamically based on a configuration file, and can be changed at runtime without stopping the server.


o  Stores are implemented as a class hierarchy, and stores can contain other stores. This allows a user to chain features together in different orders and combinations by changing only the configuration.

o  Scribe is implemented as a thrift service using the non-blocking C++ server. The installation at facebook runs on thousands of machines and reliably delivers tens of billions of messages a day.

Scribe Overview / Reliability https://github.com/facebook/scribe/wiki/Scribe-Overview

o  The scribe system is designed to be robust to failure of the network or any specific machine, but does not provide transactional guarantees. If a scribe instance on a client machine (we’ll call it a resender for the moment) is unable to send messages to the central scribe server it saves them on local disk, then sends them when the central server or network recovers. To avoid overloading the central server upon a restart, the resender waits a random time between reconnect attempts, and if the central server is near capacity it will return TRY_LATER, which tells the resender to not attempt another send for several minutes.


o  The central server has similar behavior (the same code in fact) for handling failure of the nfs filer or distributed filesystem it’s writing to. If the filesystem goes down the scribe server writes to local disk until it recovers, then sends the data from local disk to the remote filesystem. The order of the messages is preserved in both this and the resender case.


o  These error cases will lead to loss of data: o  If a client can’t connect to either the local or

central scribe server the message will be lost o  If a scribe server crashes it could lose a small

amount of data that’s in memory but not on disk

o  Some multiple component failure cases, such as a resender can’t connect to any central server and its local disk fills up

o  Some rare timeout conditions can lead to duplicate messages

Scribe Overview / Configuration https://github.com/facebook/scribe/wiki/Scribe-Overview

o  The scribe server is configured by the file specified in the -c command line option, or the file /usr/local/scribe/scribe.conf if none is specified on the command line.

o  The basic idea of the configuration is that a particular category if messages is sent to one or more “stores” of various types. Some types of stores can contain other stores, for example a bucket store contains many file stores and distributes messages to them based on a hash.


o  The configuration file consists of a global section and a section for each store. The global section includes the listening port number and the maximum number of messages that the server can handle in a second.

o  Each store section must include a category and a type. There is no restriction on the number categories or the number of stores per category.


o  The remaining items in the store configuration depend on the store type, and include such things as file location, maximum file size, how often to rotate files, and where a resender should send its data.

o  A store can also contain another store configuration, the name of which is specific to the type of store. For example a store of type buffer contains and stores and a store of type bucket contains a store called .


o  The types of stores currently available are: o  file – writes to a file, either local or nfs. o  network – sends messages to another scribe

server. o  buffer – contains a primary and a secondary

store. Messages are sent to the primary store if possible, and otherwise the secondary. When the primary store becomes available the messages are read from the secondary store and sent to the primary. Ordering of the messages is preserved. The secondary store has the restriction that it must be readable, which at the moment means it has to be a file store.


o  bucket – contains a large number of other stores, and decides which messages to send to which stores based on a hash.

o  null – discards all messages. o  thriftfile – similar to a file store but writes

messages into a Thrift TFileTransport file. o  multi – a store that forwards messages to

multiple stores.

The Underlying Technology of Messages

http://www.facebook.com/note.php?note_id=454991608919# 2010年11月16日

Hbase! o  We spent a few weeks setting up a test

framework to evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a couple of other systems. We ultimately chose HBase. MySQL proved to not handle the long tail of data well; as indexes and data sets grew large, performance suffered. We found Cassandra's eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure.

o  HBase comes with very good scalability and performance for this workload and a simpler consistency model than Cassandra. While we’ve done a lot of work on HBase itself over the past year, when we started we also found it to be the most feature rich in terms of our requirements (auto load balancing and failover, compression support, multiple shards per server, etc.). HDFS, the underlying filesystem used by HBase, provides several nice features such as replication, end-to-end checksums, and automatic rebalancing. Additionally, our technical teams already had a lot of development and operational expertise in HDFS from data processing with Hadoop. Since we started working on HBase, we've been focused on committing our changes back to HBase itself and working closely with the community. The open source release of HBase is what we’re running today.

o  Since Messages accepts data from many sources such as email and SMS, we decided to write an application server from scratch instead of using our generic Web infrastructure to handle all decision making for a user's messages. It interfaces with a large number of other services: we store attachments in Haystack, wrote a user discovery service on top of Apache ZooKeeper, and talk to other infrastructure services for email account verification, friend relationships, privacy decisions, and delivery decisions (for example, should a message be sent over chat or SMS). We spent a lot of time making sure each of these services are reliable, robust, and performant enough to handle a real-time messaging system.

o  The new Messages will launch over 20 new infrastructure services to ensure you have a great product experience. We hope you enjoy using it.

Building Realtime Insights

http://www.facebook.com/note.php?note_id=10150103900258920 2011年3月16日

o Social plugins have become an important and growing source of traffic for millions of websites over the past year. We released a new version of Insights for Websites last week to give site owners better analytics on how people interact with their content and to help them optimize their websites in real time.

o  To accomplish this, we had to engineer a system that could process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds. This system had to be able to process many different types of events, and do so in a way that accounted for an uneven distribution of keys. For example, Charlie Sheen articles are currently generating far more Likes and impressions on Facebook than my personal website, and there are far more URLs per domain from a site like espn.com than there is for my personal blog.

o  We tested out a few different architectures for this before settling on the one that we shipped. MySQL counters were not able to handle the write rate even with creative solutions around write batching; in-memory counters did not meet reliability requirements; and MapReduce, though extensible, brought with it too much latency and weird behaviors when multiple processes were waiting on the same table or query.

o  The Insights for Websites that we launched was based on storage in HBase - the distributed, column-oriented Hadoop database; instrumentation that flows through a log file system built atop HDFS known internally as scribe; and a client that tails, batches, and writes data streams out of scribe and into HBase. We chose HBase because of its ability to handle a very high write rate with high reliability. The Write Ahead Log is of HBase is key to enabling this.

o  There are a lot of other details to the architectural design such as table schema; key composition; nuances of batching; and sharding. You can hear more about it in our first Seattle tech talk and look for future engineering blog posts.

WHY HADOOP AND HBASE o  Elasticity: We need to be able to add incremental

capacity to our storage systems with minimal overhead and no downtime. In some cases we may want to add capacity rapidly and the system should automatically balance load and utilization across new hardware.

o  High write throughput: Most of the applications store (and optionally index) tremendous amounts of data and require high aggregate write throughput.

o  Efficient and low-latency strong consistency semantics within a data center: There are important applications like Messages that require strong consistency within a data center. This requirement often arises directly from user expectations. For example ‘unread’ message counts displayed on the home page and the messages shown in the inbox page view should be consistent with respect to each other.

o  While a globally distributed strongly consistent system is practically impossible, a system that could at least provide strong consistency within a data center would make it possible to provide a good user experience. We also knew that (unlike other Facebook applications), Messages was easy to federate so that a particular user could be served entirely out of a single data center making strong consistency within a single data center a critical requirement for the Messages project. Similarly, other projects, like realtime log aggregation, may be deployed entirely within one data center and are much easier to program if the system provides strong consistency guarantees.

o  Efficient random reads from disk: In spite of the widespread use of application level caches (whether embedded or via memcached), at Facebook scale, a lot of accesses miss the cache and hit the back-end storage system. MySQL is very efficient at performing random reads from disk and any new system would have to be comparable.

o  High Availability and Disaster Recovery: We need to provide a service with very high uptime to users that covers both planned and unplanned events (examples of the former being events like software upgrades and addition of hardware/capacity and the latter exemplified by failures of hardware components). We also need to be able to tolerate the loss of a data center with minimal data loss and be able to serve data out of another data center in a reasonable time frame.

o  Fault Isolation: Our long experience running large farms of MySQL databases has shown us that fault isolation is critical. Individual databases can and do go down, but only a small fraction of users are affected by any such event. Similarly, in our warehouse usage of Hadoop, individual disk failures affect only a small part of the data and the system quickly recovers from such faults.

o  Atomic read-modify-write primitives: Atomic increments and compare-and-swap APIs have been very useful in building lockless concurrent applications and are a must have from the underlying storage system.

o  Range Scans: Several applications require efficient retrieval of a set of rows in a particular range. For example all the last 100 messages for a given user or the hourly impression counts over the last 24 hours for a given advertiser.

o  Active-active serving capability across different data centers: As mentioned before, we were comfortable making the assumption that user data could be federated across different data centers (based ideally on user locality). Latency (when user and data locality did not match up) could be masked by using an application cache close to the user.

We chose Hadoop and HBase o  After considerable research and

experimentation, we chose Hadoop and HBase as the foundational storage technology for these next generation applications. The decision was based on the state of HBase at the point of evaluation as well as our confidence in addressing the features that were lacking at that point via in- house engineering. HBase already provided a highly consistent, high write-throughput key-value store.

o  stood out as a central point of failure, but we were confident that our HDFS team could build a highly-available NameNode in a reasonable time-frame, and this would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending adding Bloom filters to HBase’s version of LSM[13] Trees, making local DataNode reads efficient and caching NameNode metadata).

o  Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects, both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.

o  HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column-orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single table. HBase is ideal for workloads that are write-intensive, need to maintain a large amount of data, large indices, and maintain the flexibility to scale out quickly.

REALTIME HDFS o  HDFS was originally designed to be a file

system to support offline MapReduce application that are inherently batch systems and where scalability and streaming performance are most critical. We have seen the advantages of using HDFS: its linear scalability and fault tolerance results in huge cost savings across the enterprise. The new, more realtime and online usage of HDFS push new requirements and now use HDFS as a general-purpose low-latency file system. In this section, we describe some of the core changes we have made to HDFS to support these new applications.

HDFS

o High Availability – Avatar Node o Hadoop RPC compatibility o Block Availability: Placement Policy o  Performance Improvements for a

Realtime Workload o New Features

HBase Enhancement o ACID Compliance o Availability Improvements o  Performance Improvements

DEPLOYMENT AND OPERATIONAL EXPERIENCES

o  Testing o Monitoring and Tools o Manual versus Automatic Splitting o Dark Launch o Dashboards/ODS integration o Backups at the Application layer o Schema Changes o  Importing Data o Reducing Network IO

FUTURE WORK o  The use of Hadoop and HBase at Facebook is

just getting started and we expect to make several iterations on this suite of technologies and continue to optimize for our applications.

o  As we try to use HBase for more applications, we have discussed adding support for maintenance of secondary indices and summary views in HBase. In many use cases, such derived data and views can be maintained asynchronously.

o  Many use cases benefit from storing a large amount of data in HBase’s cache and improvements to HBase are required to exploit very large physical memory. The current limitations in this area arise from issues with using an extremely large heap in Java and we are evaluating several proposals like writing a slab allocator in Java or managing memory via JNI.

o  A related topic is exploiting flash memory to extend the HBase cache and we are exploring various ways to utilize it including FlashCache [18]. Finally, as we try to use Hadoop and HBase for applications that are built to serve the same data in an active-active manner across different data centers, we are exploring approaches to deal with multi data-center replication and conflict resolution.

non-requirements: o  Tolerance of network partitions within a single

data center: Different system components are often inherently centralized. For example, MySQL servers may all be located within a few racks, and network partitions within a data center would cause major loss in serving capabilities therein. Hence every effort is made to eliminate the possibility of such events at the hardware level by having a highly redundant network design.

o  Zero Downtime in case of individual data center failure: In our experience such failures are very rare, though not impossible. In a less than ideal world where the choice of system design boils down to the choice of compromises that are acceptable, this is one compromise that we are willing to make given the low occurrence rate of such events.

Apache hadoop goes realtime at Facebook

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application’s requirements for consistency, availability, partition tolerance, data model and scalability. We explore the enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the system, and how this solution has significant advantages over the sharded MySQL database scheme used in other applications at Facebook and many other web-scale companies.

o  We discuss the motivations behind our design choices, the challenges that we face in day-to-day operations, and future capabilities and improvements still under development. We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments

WORKLOAD TYPES o  Facebook Messaging o  Facebook Insights o  Facebook Metrics System(ODS)

WHY HADOOP AND HBASE o  Elasticity: We need to be able to add incremental

capacity to our storage systems with minimal overhead and no downtime. In some cases we may want to add capacity rapidly and the system should automatically balance load and utilization across new hardware.

o  High write throughput: Most of the applications store (and optionally index) tremendous amounts of data and require high aggregate write throughput.

o  Efficient and low-latency strong consistency semantics within a data center: There are important applications like Messages that require strong consistency within a data center. This requirement often arises directly from user expectations. For example ‘unread’ message counts displayed on the home page and the messages shown in the inbox page view should be consistent with respect to each other.

o  While a globally distributed strongly consistent system is practically impossible, a system that could at least provide strong consistency within a data center would make it possible to provide a good user experience. We also knew that (unlike other Facebook applications), Messages was easy to federate so that a particular user could be served entirely out of a single data center making strong consistency within a single data center a critical requirement for the Messages project. Similarly, other projects, like realtime log aggregation, may be deployed entirely within one data center and are much easier to program if the system provides strong consistency guarantees.

o  Efficient random reads from disk: In spite of the widespread use of application level caches (whether embedded or via memcached), at Facebook scale, a lot of accesses miss the cache and hit the back-end storage system. MySQL is very efficient at performing random reads from disk and any new system would have to be comparable.

o  High Availability and Disaster Recovery: We need to provide a service with very high uptime to users that covers both planned and unplanned events (examples of the former being events like software upgrades and addition of hardware/capacity and the latter exemplified by failures of hardware components). We also need to be able to tolerate the loss of a data center with minimal data loss and be able to serve data out of another data center in a reasonable time frame.

o  Fault Isolation: Our long experience running large farms of MySQL databases has shown us that fault isolation is critical. Individual databases can and do go down, but only a small fraction of users are affected by any such event. Similarly, in our warehouse usage of Hadoop, individual disk failures affect only a small part of the data and the system quickly recovers from such faults.

o  Atomic read-modify-write primitives: Atomic increments and compare-and-swap APIs have been very useful in building lockless concurrent applications and are a must have from the underlying storage system.

o  Range Scans: Several applications require efficient retrieval of a set of rows in a particular range. For example all the last 100 messages for a given user or the hourly impression counts over the last 24 hours for a given advertiser.

o  Active-active serving capability across different data centers: As mentioned before, we were comfortable making the assumption that user data could be federated across different data centers (based ideally on user locality). Latency (when user and data locality did not match up) could be masked by using an application cache close to the user.

We chose Hadoop and HBase o  After considerable research and

experimentation, we chose Hadoop and HBase as the foundational storage technology for these next generation applications. The decision was based on the state of HBase at the point of evaluation as well as our confidence in addressing the features that were lacking at that point via in- house engineering. HBase already provided a highly consistent, high write-throughput key-value store.

o  stood out as a central point of failure, but we were confident that our HDFS team could build a highly-available NameNode in a reasonable time-frame, and this would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending adding Bloom filters to HBase’s version of LSM[13] Trees, making local DataNode reads efficient and caching NameNode metadata).

o  Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects, both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.

o  HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column-orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single table. HBase is ideal for workloads that are write-intensive, need to maintain a large amount of data, large indices, and maintain the flexibility to scale out quickly.

REALTIME HDFS o  HDFS was originally designed to be a file

system to support offline MapReduce application that are inherently batch systems and where scalability and streaming performance are most critical. We have seen the advantages of using HDFS: its linear scalability and fault tolerance results in huge cost savings across the enterprise. The new, more realtime and online usage of HDFS push new requirements and now use HDFS as a general-purpose low-latency file system. In this section, we describe some of the core changes we have made to HDFS to support these new applications.

HDFS

o High Availability – Avatar Node o Hadoop RPC compatibility o Block Availability: Placement Policy o  Performance Improvements for a

Realtime Workload o New Features

HBase Enhancement o ACID Compliance o Availability Improvements o  Performance Improvements

DEPLOYMENT AND OPERATIONAL EXPERIENCES

o  Testing o Monitoring and Tools o Manual versus Automatic Splitting o Dark Launch o Dashboards/ODS integration o Backups at the Application layer o Schema Changes o  Importing Data o Reducing Network IO

FUTURE WORK o  The use of Hadoop and HBase at Facebook is

just getting started and we expect to make several iterations on this suite of technologies and continue to optimize for our applications.

o  As we try to use HBase for more applications, we have discussed adding support for maintenance of secondary indices and summary views in HBase. In many use cases, such derived data and views can be maintained asynchronously.

o  Many use cases benefit from storing a large amount of data in HBase’s cache and improvements to HBase are required to exploit very large physical memory. The current limitations in this area arise from issues with using an extremely large heap in Java and we are evaluating several proposals like writing a slab allocator in Java or managing memory via JNI.

o  A related topic is exploiting flash memory to extend the HBase cache and we are exploring various ways to utilize it including FlashCache [18]. Finally, as we try to use Hadoop and HBase for applications that are built to serve the same data in an active-active manner across different data centers, we are exploring approaches to deal with multi data-center replication and conflict resolution.

non-requirements: o  Tolerance of network partitions within a single

data center: Different system components are often inherently centralized. For example, MySQL servers may all be located within a few racks, and network partitions within a data center would cause major loss in serving capabilities therein. Hence every effort is made to eliminate the possibility of such events at the hardware level by having a highly redundant network design.

o  Zero Downtime in case of individual data center failure: In our experience such failures are very rare, though not impossible. In a less than ideal world where the choice of system design boils down to the choice of compromises that are acceptable, this is one compromise that we are willing to make given the low occurrence rate of such events.

Apache ZooKeeper

http://zookeeper.apache.org/

What is ZooKeeper?

o  ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

Documents

Facebookのリアルタイム Big Data 処理