HBase at LINE

中村俊介, Shunsuke Nakamura (LINE, twitter, facebook: sunsuk7tp)

NHN Japan Corp.

HBase at LINE~ How to grow our storage together with service ~

自己紹介中村俊介

• 2011.10 旧 Japan新卒入社 (2012.1から Japan)

• LINE server engineer, storage team

• Master of Science@東工大首藤研

• Distributed Processing, Cloud Storage, and NoSQL

• MyCassandra [CASSANDRA-2995]: A modular NoSQL with Pluggable Storage Engine based on Cassandra

• はてなインターン/インフラアルバイト

• NHN = Next Human Network• NAVER Korea: 検索ポータルサイト

• 韓国本社の検索シェア7割

• 元Samsungの社内ベンチャー

• NAVER = Navigate + er

• NAVER Japan

• Japanは今年で３年目

• 経営統合によりNAVERはサービス名、グループ、宗教

• LINE、まとめ、画像検索、NDrive

NHN/NAVER

• NAVER• Hangame • livedoor• データホテル• NHN ST • JLISTING• メディエータ• 深紅

韓国本社Green Factory

LINE is NHN Japan STAND-ALONE

8.17 5,500万users (日本 2,500万users) AppStore Ranking - Top 1

Japan, Taiwan, Thailand,, HongKong, Saudi, Malaysia, Bahrain, Jordan, Qatar, Singapore, Indonesia, Kazakhstan, Kuwait, Israel, Macau, Ukraine, UAE,

Switzerland, Australia, Turkey, Vietnam, Germany, Russian

LINE Roadmap2011.6 iPhone first release

Android first release

LINE Card/Camera/Brush

VOIP

2011.8

2011.10I join to LINE team.

PC (Win/Mac), Multi device Sticker Shop

LINE platform

Bots (News, Auto-translation, Public account, Gurin)

Sticker

WAP

WP first release

BB first releaseTimeline

2012.6

2012.8

Target of LINE Storage

1. Performing well (put < 1ms, get < 5ms)

2. A high scalable, available, and eventually consistent storage system built around NoSQL

3. Geological distribution

43.2% 56.8%

Japan�Global�

start

future

LINE Storage and NoSQL1. Performing well

2. A high scalable, available, and eventually consistent storage system

3. Geological distribution

LINE launched with Redis.

At first,

Initial LINE Storage• Target: 1M DL within 2011

• Client-side sharding with a few Redis nodes

• Redis

• Performance: O(1) lookup

• Availability: Async snapshot + slave nodes (+ backup MySQL)

• Capacity: memory + Virtual Memory

app server queue

backup

node

storage

master

slave

......

queuedispatcher

August 28~29 2011 Kuwait Saudi Arabia Qatar Bahrain…

1Million over

• Sharded Redis

• Shard coordination: ZKs + Manager Daemons(shard allocation using consistent hashing, health check&failover)

2011

x 3

x 3 or 5

October 13 2011 Hong Kong

However, in fact...

��

100M DL within 2012

Billions of Messages/Day...

We had encountered so much problems every day in 2011.10...

Redis isNOT easily scalableNOT persistent

And,

easily dies

2. A high scalable, available, and eventually consistent storage system built around

NoSQL

2011年内1Mユーザーを想定したストレージを、サービス無停止で2012年内1Bユーザーに対応する

Zuckerberg’s Law of Sharing (2011. July.7)

Y = C * 2 ^ X (Y: sharing data, X: time, C: constance)Sharing activity will double each year.

LINEのmessage数/月はいくら？

10億x30 = 300億 messages/month

Data and Scalability• constant

• DATA: async operation

• SCALE: thousands ~ millions per each queue

• linear

• DATA: users’ contact and group

• SCALE: millions ~ billions

• exponential

• DATA: message and message inbox

• SCALE: tens of billion ~ 0

2500000000

5000000000

7500000000

10000000000

constant linear exponential

• constant

• FIFO

• read&write fast

• linear

• zipf.

• read fast [w3~5 : r95]

• exponential

• latest

• write fast [w50 : r50]

Data and WorkloadQueue

Zipfian curve

Message timeline

Choosing Storage• constant: Redis

• linear, exponential: 選択肢幾つか

• HBase

• ◯ workload, NoSQL on DFSで運用しやすい (DFSスペシャリスト++)

• × SPOF, Random Readの99%ile性能がやや低い

• Cassandra

• ◯ workload, No SPOF (No Coordinator, rack/DC-aware replication)

• × Weak consistencyに伴う運用コスト, 実装が複雑 (特にCAS操作)

• MongoDB

• ◯ 便利機能 (auto-sharding/failover, various query) → 解析向けで不要

• × workload, 帯域やディスクの使い方悪い

• MySQL Cluster

• ◯ 使い慣れ (1サービス当たり最大数千台弱運用)

• × 最初から分散設計でwrite scalableものを使うべき

HBase• 数百TBを格納可能

• 大量データに対してwrite scalable, 効率的なrandom access

• Semi-structured model (< MongoDB, Redis)

• RDBMSの高級機能はもたない (TX, joins)

• Strong consistency per a row and columnfamily

• NoSQL constructed on DFS

• レプリカ管理不要 / Region移動が楽

• Multi-partition allocation per RS

• ad hocなload balancing

LINE Storage (2012.3)

app. server (nginx)

HDFS

Message HBaseContact HBaseBackup MySQL

Thrift API / Authentication / Renderer

iPhone Android WAP

app. server (nginx)app. server (nginx)

async operationfailed operation

x 100 nodes

x 400 nodes

backup operationx 2 nodesx 100 nodes

x 25 Million

Sharded Redis clusters (message, contact, group)

Redis Queue

dispatcher dispatcher

Redis Queue

dispatcher

Redis Queue

app. server (nginx)

HDFS01

Primary HBaseBackup MySQL

Thrift API / Authentication / Renderer

phone (iPhone/Android/WP/blackberry/WAP) PC (win/mac)

app. server (nginx)app. server (nginx)

async operationfailed operation

x 200 nodes

x 600 nodes

backup operationx 2 nodes

x 200 nodes

x 50 Million

Msg HBase01

dispatcher

Redis Queue

HDFS02

Msg HBase02

Redis Queue

dispatcher dispatcher

Redis Queue

Sharded Redis clusters (message, contact, group)

LINE Storage (2012.7)

2012.3 → 2012.7• ユーザー数2倍、インフラ2倍

• まだHBaseにとってCasual Data

• Message HBaseはdual cluster

• message TTLに応じて切り替え (TTL: 2week → 3week)

• HDFS DNはHBase用のM/Rとしても利用

• Sharded-Redisがまだ基本プライマリ (400→600)

• messageはHBaseにもget

• 他はmodelのみをbackup

LINE Data on HBase• LINE data

• MODEL: <key> → <model>

• INDEX: <key> ↔ <property in model>

• User: <userId> → <User obj>, <userId> ↔ <phone>

• 各modelを1つのrowで表現

• HBaseのconsistency: 1つのrow, columnFamily単位でstrong consistencyを保証

• contactなどの複数modelをもつものはqualifier (column)を利用

• レンジクエリが必要なDataは一つのrowにまとめる (e.g. message Inbox)

• Cons.) column数に対してリニアにlatency大 → delete, search filter with timestamp

User ModeluserIdemail phone

INDEX

timestamp, version• Column level timestamp

• modelのtimestampでindexを構築

• API実行timestampでasync, failure handling

• Search filterとしても利用 (Cons. TTLの利用不可)

• Multiple versioning

• 複数emailのbinding (e.g. Google account password history)

• CSの為のdata trace

Primary key for LINE• Long Type keyを元に生成: e.g. userId, messageId

• simple ^ random for single lookup

• range queryのためのlocalityの考慮不要

• prefix(key) + key

• prefix(key): ALPHABETS[key%26] + key%1000

• o.a.h.hbase.util.RegionSplitter.SplitAlgorithmを実装

• prefixでRegion splitting

a b z

a500a250 a750a000 b000

HRegion260026262652

2756 c2601 2602 d27822808

• Message, Inbox

• exponential scale

• immutable

Data stored in HBase

• User, Contact, Group

• linear scale

• mutable

Message, Inbox

• Sharded-Redisとのhybrid構成

• 片方から読み書きできればOK (< quorum)

• failed queryはJVM Heap,Redisにqueuing&retry

• immutable&idempotent query: 整合性, 重複の問題なし

performance, availability重視

• Sharded-Redisがまだprimary

• scalabilityの問題はない

• mutableなので整合性重要

• RedisからHBaseへ移行 (途中)

• Model Objectのみbackup

User, Contactperformanceよりconsistency重視

RedisからHBaseへ移行1. modelのbackup

• Redisにsync、HBaseにasync write (Java Future, Redis queuing)

2. M/Rを使ってSharded-Redisからfull migration

3. modelを元にindex/inverted index building (eventual) ←イマココ

• Batch Operation: w/ M/R, model table full-scan using TableMapper

• Incremental Operation: Diff logging and sequential indexing or Percolator, HBase Coprocessor

4. access path切り替え, Redis cache化

HBaseに置き換えたら幸せになれた？

ある意味ではYES

• Scalability Issuesが解決

•今年いっぱいまでは

•広域分散 → 3rd issue (To be continue...)

Failure Decreased?

ABSOLUTELY NOT!

HBaseを8ヶ月運用してみた印象

• HBaseは火山

• 毎日小爆発

• 蓄積してたまに大爆発

• 火山のふもとでの安全な暮らし

爆発• 断続的なネットワーク障害によるRS退役

• H/W障害によるDN性能悪化・検知の遅延

• get (get, increment, checkAndPut, checkAndDelete)性能劣化、それに伴う全体性能低下

• (major) compactionによる性能劣化

• データ不整合

• SPOF絡みの問題はまだ起こってない

HBaseのAvailability

• SPOF or 死ぬとdowntimeが発生する箇所が幾つか

1. HDFSのNameNode

2. HBaseのRegionServer, DataNode

1. HDFS NameNode (NN)

• HA Framework for HDFS NN (HDFS-1623)

• Backup NN (0.21)

• Avatar NN (Facebook)

• HA NN using Linux HA

• Active/passive configuration deploying two NN (cloudera)

HA NN using Linux-HA

• DRBD+heartbeatで冗長化

• DRBD: disk mirroring (RAID1-like)

• heartbeat: network monitoring

• pacemaker: resource management (failover logicの登録)

2. RegionServer, DataNode• HBase自体がレプリカをもたない

• failoverされるまでdowntime発生

• 複数コンポーネントで構成されているので、故障検知から全体合意まで、それぞれの通信区間でtimeoutを待たなければいけない

downtime対策• HBase自身がreplicaを持たないのでRS死亡時のdowntimeが必ず発生

• distributed HLog splitting (>=cdh3u3)

• timeout&retry

• ほとんどHClient ↔ RS間のtimeout時間

• timeout調整 (retryごと, operationごと)

• RS ↔ ZK間は短いとnetworkが不安定なときにRSが排除されやすい

• 同じkeyを持つregionを同じRSに配置 → 障害の限定化

• LINEのHBase accessは基本的にasync

• Cluster replication

HBase cluster replication• Cluster Replication: master push方式

• (MySQLのようなbinary logging mechanism), 馬本8.8章参照

• 非同期でWAL (HLog)をslave clusterにpush

• 各RSごとにSynchronous Call

• syncされていないWAL ListをZKが管理

• 検証しつつも、• 独自実装 or 他の手段も考慮中

multi-DC間のreplication向けではない

HDFS tuning for HBase

• Shortcut a local client reads to a Datanodes files directly > 0.23.1, 0.22.1, 1.0.0 (HDFS-2246)

• Rack-aware Replica Placement (HADOOP-692)

削除問題• 削除が少し低速

• 論理削除なのでgetほどではないが、putの2倍かかる

• 例) 1万件のコンタクトをもつユーザー退会処理

• カラム多すぎでクライアント側でtimeout → queuing + iterative delete

• 例) TTLが過ぎたmessage削除

• cold dataに対するRandom I/Oが発生し、serviceに影響

• → dual cluster, full-truncate or TTL利用

• 例) スパマー対応

• compactionされるまでのget性能 (大量のskip処理)への影響

• → column単位ではなく、row単位の削除に

Compaction対策

• Bigtable: I/O最適化と削除の為に定期的なCompaction処理が必要

• RSごとにQueuingされ同時に1 HRegionずつCompactionが実行される

• Compaction実行中にCPU利用率が上がるので、タイミング注意

• タイミング: periodic, StoreFile数, ユーザー実行

• peak-time時に連続して発生しないよう、off-peakにcompactionとregion splitting

Balancing, Splitting, and Compaction

• Region balancing

• 自動balancer (request数ベースのbalancing)はOFF

• serviceのoff-peak時にbalancing

• 異なるtableの同一keyは同じserverに割当→障害を限定

• 問題のあるRegion専用のserver: prison RS

• Region splittingとcompactionのスケジューリング

• 自動splitもなるべく避ける (hbase.hregion.max.filesizeで自動split)

• 連続的なmajor compactionを避ける

• immutable storageはperiodic compactionをOFF

HBase Tools• Client:

• HBaseTemplate: HBase Wrapper like spring’s RedisTemplate

• MirroringHTable: 複数HBase cluster対応

• 運用監視:

• auto splitting: off-peak時のregion split

• auto HRegion balancer: metricsを元にoff-peak balancing

• Region snapshot&restore: META Tableをdaily dump、RS死亡時の復元

• Data Migration:

• Migrator with M/R (Redis → HBase)

• H2H copy tool with M/R: table copy (HBase → HBase)

• metrics collecting via JMX

• Index Builder and Inconsistent Fixer with M/R, incremental implementation (coprocessor)

今後の課題• HBase上の<key, model>を中心にindexやRedis上にcacheを構築

• 停電・地震対策 (rack/dc-awareness)

• HBase cluster replication

• Cassandraをgeological distributed storage for HLogとして利用

• 今以上のスケーラビリティ (数 - 数十億ユーザー)

• HBaseはnetwork-boundで1クラスタ数百台弱が限界

• Multi-clusterで凌ぐかCassandraを使うか

Technology

HBase at LINE