Seven Databases in Seven Weeks

Seven Databases in Seven Weeks

HBase

HDFS (Hadoop Distributed File System)

Server

DFS

HBase

7 つのデータベース７つの世界　での構成

１日目： CRUD とテーブル管理

２日目：ビッグデータを扱う

３日目：クラウドに持っていく

スタンドアロンで Hbase を動かす

テーブルを作る

データの出し入れをする

Wikipedia ダンプを投入する

スクリプト (Not Shell) での操作に慣れる

Thrift を使って操作する

Whirr を使って EC2 にデプロイする

今回は扱いません

今回は扱いません

HBase の特徴

自動シャーディング・自動フェールオーバー

データの一貫性 (CAP:Consistency)

Hadoop/HDFS 統合

各種インタフェース

テーブルサイズが大きくなった時、自動的に分割する

分割されたシャードは、ノード障害時に自動的にフェールオーバーする

データの更新は反映された瞬間から読出可能

結果的に同じ値が読めるようになる（結果整合性）条件緩和を取らない

Hadoop の HDFS 上に展開できる

Hadoop/MapReduce で API を挟まず HBase を入出力の対象にできる

Java Native API の他、 Thrift , REST API から利用可能

１日目： HBase をスタンドアロンで展開する

[root@HBase01 ask]# cd /opt/ [root@HBase01 opt]# wget http://ftp.meisei-u.ac.jp/mirror/apache/dist/hbase/hbase-0.94.7/hbase-0.94.7.tar.gz [root@HBase01 opt]# tar zxvf hbase-0.94.7.tar.gz [root@HBase01 opt]# vi hbase-0.94.7/conf/hbase-site.xml

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value>file:///var/files/hbase</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/var/files/zookeeper</value> </property> </configuration>

実行コマンド

hbase-site.xml

/var /files /hbase /zookeeper

ファイル実体配置

単体で可動するための最小限の設定ファイル設置先の指定で、任意のディレクトリを書き出し先に指定する

xml で指定できる全項目 : src/main/resources/hbase-default.xml


[root@HBase01 opt]# hbase-0.94.7/bin/start-hbase.sh +======================================================================+| Error: JAVA_HOME is not set and Java could not be found |+----------------------------------------------------------------------+| Please download the latest Sun JDK from the Sun Java web site || > http://java.sun.com/javase/downloads/ < || || HBase requires Java 1.6 or later. || NOTE: This script will find Sun Java whether you install using the || binary or the RPM based installer. |+======================================================================+

JDK が要求される

[root@HBase01 opt]# vi hbase-0.94.7/conf/hbase-env.sh - # export JAVA_HOME=/usr/java/jdk1.6.0/ + export JAVA_HOME=/usr/java/latest/

JDK のバリエーション（以下から選んで導入）

Oracle JDK Open JDK1.6 1.61.7 1.7

Java のインストールディレクトリを指定


[root@HBase01 opt]# hbase-0.94.7/bin/start-hbase.sh starting master, logging to /opt/hbase-0.94.7/bin/../logs/hbase-root-master-HBase01.db.algnantoka.out

[root@HBase01 opt]# hbase-0.94.7/bin/stop-hbase.sh stopping hbase...........

起動

停止

[root@HBase01 opt]# hbase-0.94.7/bin/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.7, r1471806, Wed Apr 24 18:48:26 PDT 2013

hbase(main):001:0> status 1 servers, 0 dead, 2.0000 average load

シェル接続

１日目： HBase の使い方

hbase(main):009:0> help "create" Create table; pass table name, a dictionary of specifications per column family, and optionally a dictionary of table configuration. Dictionaries are described below in the GENERAL NOTES section. Examples:

hbase> create 't1', {NAME => 'f1', VERSIONS => 5} hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'} hbase> # The above in shorthand would be the following: hbase> create 't1', 'f1', 'f2', 'f3‘ hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true} hbase> create 't1', 'f1', {SPLITS => ['10', '20', '30', '40']} hbase> create 't1', 'f1', {SPLITS_FILE => 'splits.txt'} hbase> # Optionally pre-split the table into NUMREGIONS, using hbase> # SPLITALGO ("HexStringSplit", "UniformSplit" or classname) hbase> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

テーブル作成 : create

Create ‘TableName’ , {NAME => ‘ColumnFamilyName’, Option => Value …} …基本型

省略表記

Create ‘TableName’ , ‘ColumnFamilyName’, …


hbase(main):010:0> help "put" Put a cell 'value' at specified table/row/column and optionally timestamp coordinates. To put a cell value into table 't1' at row 'r1' under column 'c1' marked with the time 'ts1', do:

hbase> put 't1', 'r1', 'c1', 'value', ts1

レコード挿入 : put

SampleTable : create ‘SampleTable’ , ‘color’ , ‘shape’

put ‘SampleTable’ , ‘first’ , ‘color:red’ , ‘#F00’put ‘SampleTable’ , ‘first’ , ‘color:blue’ , ‘#00F’put ‘SampleTable’ , ‘first’ , ‘color:yellow’ , ‘#FF0’


hbase(main):011:0> help "get" Get row or cell contents; pass table name, row, and optionally a dictionary of column(s), timestamp, timerange and versions. Examples:

hbase> get 't1', 'r1‘ hbase> get 't1', 'r1', {TIMERANGE => [ts1, ts2]} hbase> get 't1', 'r1', {COLUMN => 'c1'} hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']} hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1} hbase> get 't1', 'r1', {COLUMN => 'c1', TIMERANGE => [ts1, ts2], VERSIONS => 4} hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1, VERSIONS => 4} hbase> get 't1', 'r1', {FILTER => "ValueFilter(=, 'binary:abc')"} hbase> get 't1', 'r1', 'c1‘ hbase> get 't1', 'r1', 'c1', 'c2‘ hbase> get 't1', 'r1', ['c1', 'c2']

レコード取得 : get

get ‘SampleTable’ , ‘first’

SampleTable

get ‘SampleTable’ , ‘first’ , ‘color’get ‘SampleTable’ , ‘first’ , ‘color:blue’


hbase(main):001:0> help 'scan'Scan a table; pass table name and optionally a dictionary of scannerspecifications. Scanner specifications may include one or more of:TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,or COLUMNS, CACHE

If no columns are specified, all columns will be scanned.To scan all members of a column family, leave the qualifier empty as in'col_family:'.

The filter can be specified in two ways:1. Using a filterString - more information on this is available in theFilter Language document attached to the HBASE-4176 JIRA2. Using the entire package name of the filter.

Some examples:

hbase> scan '.META.' hbase> scan '.META.', {COLUMNS => 'info:regioninfo'} hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'} hbase> scan 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]} hbase> scan 't1', {FILTER => "(PrefixFilter ('row2') AND (QualifierFilter (>=, 'binary:xyz'))) AND (TimestampsFilter ( 123, 456))"} hbase> scan 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}

For experts, there is an additional option -- CACHE_BLOCKS -- whichswitches block caching for the scanner on (true) or off (false). Bydefault it is enabled. Examples:

hbase> scan 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}

Also for experts, there is an advanced option -- RAW -- which instructs thescanner to return all cells (including delete markers and uncollected deletedcells). This option cannot be combined with requesting specific COLUMNS.Disabled by default. Example:

hbase> scan 't1', {RAW => true, VERSIONS => 10}

レコード検索 : scan

１日目： HBase の使い方 TimeStamp

#FFF

‘first’ , ‘color:red’

#000#0F0#00F#F00

put ‘table’ , ‘first’ , ‘color:red’ , ‘#FFF‘put ‘table’ , ‘first’ , ‘color:red’ , ‘#000'put ‘table’ , ‘first’ , ‘color:red’ , ‘#0F0‘put ‘table’ , ‘first’ , ‘color:red’ , ‘#00F'put ‘table’ , ‘first’ , ‘color:red’ , ‘#F00'

timestamp 1timestamp 2timestamp 3timestamp 4timestamp 5 get ‘table’ , ‘first’ , ‘color:red’

get ‘table’ , ‘first’ , {COLUMN=>‘color:red’ , TIMESTAMP=>4}get ‘table’ , ‘first’ , {COLUMN=>‘color:red’ , VERSIONS=>4}


スキーマ変更 : alterhbase(main):009:0> disable 'table1' 0 row(s) in 2.5190 seconds

hbase(main):010:0> get 'table1', 'first','color:red'COLUMN CELL

ERROR: org.apache.hadoop.hbase.DoNotRetryIOException: table1 is disabled.

hbase(main):012:0> alter 'table1' , { NAME => 'color', VERSIONS => 10} Updating all regions with the new schema... 1/1 regions updated. Done. 0 row(s) in 1.3630 seconds

hbase(main):014:0> enable 'table1' 0 row(s) in 2.3000 seconds

alter の対象 Table はオフラインでなければならない

保持するバージョン数の変更

alter によるスキーマ変更の手順は以下1. 新たなスキーマの空テーブルを作る2. 元テーブルからデータを複製する3. 元テーブルを破棄する

高コストなので、原則スキーマ変更（ ColumnFamily の変更）は行わない

１日目： HBase の使い方 JRuby スクリプティング

include Java import org.apache.hadoop.hbase.client.HTable import org.apache.hadoop.hbase.client.Put import org.apache.hadoop.hbase.HBaseConfiguration

def jbytes(*args) args.map { |arg| arg.to_s.to_java_bytes } end

table = HTable.new( HBaseConfiguration.new, "table1" ) p = Put.new( *jbytes( "third" ) ) p.add( *jbytes( "color", "black", "#000" ) ) p.add( *jbytes( "shape", "triangle", "3" ) ) p.add( *jbytes( "shape", "square", "4" ) ) table.put( p )

hoge.rb

[root@HBase01 opt]# hbase-0.94.7/bin/hbase shell hoge.rb

hbase(main):002:0> get 'table1', 'third' ,{COLUMN => ['color','shape']} COLUMN CELL color:black timestamp=1369049856405, value=#000shape:square timestamp=1369049856405, value=4shape:triangle timestamp=1369049856405, value=39 row(s) in 0.0870 seconds

実行

レコード挿入タイミング

レコードの timestamp が揃う

hbase shell は JRuby インタプリタを拡張したものなので、 JRuby が実行できる

hbase 関係の Java クラス

Hbase とは何か

Google File Sytem (GFS)

MapReduce BigTable

Google の内部システム（発表した論文より）

Hadoop Distributed File Sytem (HDFS)

MapReduce HBase

Hadoop プロジェクト（ Google クローン）

バッチ処理リアルタイム応答

RowKey ColumnFamily1 ColumnFamily

2 ColumnFamily3

1 Column1

Column2

Column1

Column2

Column1

2 Column2

Column3

Column2

Column3

BigTable( ソート済列志向データベース )スキーマで定義する

スキーマレス（自由に追加できる）必須ソート済

#FFF

ある Column

#000#0F0#00F#F00

timestamp 1timestamp 2timestamp 3timestamp 4timestamp 5

タイムスタンプでバージョニングされる

RowKey ColumnFamily1 ColumnFamil

y2 ColumnFamily3

123456789

リージョン

BigTable( ソート済列志向データベース )

リージョンリージョンリージョン

リージョンリージョン

• テーブルはリージョンで物理的に分割（シャーディング）される• リージョンはクラスタ中のリージョンサーバが担当する• リージョンは ColumnFamily 毎に作られる• リージョンはソート済の RowKey を適当なサイズで分割する

BigTable( ソート済列志向データベース )

ColumnFamily はむやみに増やさない　→　 Column の追加で極力対応

RowKey は連続アクセスが起きやすい形にしておく

• テーブルはリージョンで物理的に分割（シャーディング）される• リージョンはクラスタ中のリージョンサーバが担当する• リージョンは ColumnFamily 毎に作られる• リージョンはソート済の RowKey を適当なサイズで分割する

Column や ColumnFamily を条件にして検索する構造を取らない

テーブルスキーマの初期設計超重要

HBase の特徴



Hadoop/HDFS 統合

テーブルサイズが大きくなった時、自動的に分割する

分割されたシャードは、ノード障害時に自動的にフェールオーバーする

データの更新は反映された瞬間から読出可能

結果的に同じ値が読めるようになる（結果整合性）条件緩和を取らない

Hadoop の HDFS 上に展開できる

Hadoop/MapReduce で API を挟まず HBase を入出力の対象にできる

HBase の特徴　を構成する要素



Hadoop/HDFS 統合

リージョンの自動分割？？

？？

HDFS : GFS クローンHbase : BigTable クローン

HDFS

HBase の特徴　を構成する要素

自動フェールオーバー・データの一貫性 (CAP:Consistency)

Master Server

ZooKeeper

Region Server

Region Server( フェールオーバー先 )

ローカルストアRegio

nWALオンメモ

リストア

ReadWrite

ローカルストアRegio

nWAL

replicate

２日目： Wikipedia のデータを扱う

力尽きた＼ ( ＾ 0 ＾ ) ／

text 含む text 含まず0

50

100

150

200

250

圧縮あり圧縮なし

Scan にかかる秒数

Documents

Seven Databases in Seven Weeks