39
HBase Introduction Scott Miao 2012/06/25

001 hbase introduction

Embed Size (px)

Citation preview

Page 1: 001 hbase introduction

HBase IntroductionScott Miao

2012/06/25

Page 2: 001 hbase introduction

Agenda• Course Credit• One common web site story• Why RDB not affordable ?• Big Data• Why use noSQL ?• HBase Indroduction• Hands-on• noSQL architecture common practices• Case study

2

Page 3: 001 hbase introduction

一個網站的故事 (1/3)• RDBMS 是 Persistence tier 一個理所當然的選擇• 它可以幫我們處理 transaction(ACID) ,確保完整性限制

(Integrity Constraints) ,標準的 SQL 語言,甚至還有 Stored Procedure 可以用

• 第一次,你的使用者人數越來越多時…• 使用 AP Servers Cluster ,它們共用一台 DB Server

• 第二次,你的使用者人數越來越多時…• DB Server 分成 Master-Slave 架構

• 從 Slave Servers 讀取資料• 寫入資料至 Master Server

Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-Lars-George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1

3

Page 4: 001 hbase introduction

一個網站的故事 (2/3)• 第三次,你的使用者人數越來越多時…• 針對讀取資料的瓶頸

• 在 Server 程式和 DB 之間,加入 Cache ,例如 Memcached (Memory DB)

• 但 Server 程式的 Cache 和 DB 之間,很可能出現資料不一致的問題

• 針對寫入資料的瓶頸• 增加 DB Server 的機器規格 (CPU 、 Memory 、 Disk 等, Vertically

Scaling)• 別忘記!我們也要連同 Slave Severs 的規格也要一起增加ㄛ…

4

Page 5: 001 hbase introduction

一個網站的故事 (3/3)• 第四次,你的使用者人數越來越多時…• 使用 Database Sharding 技術

• 從 Vertically Scaling 轉換成 Horizontally Scaling• 開啟管理的惡夢• RDBMS 天生不適合分散式儲存 (ACID , Fixed Schema)• DBA 要設定一組 Sharding Rules

• 當其中某一台 DB Server 掛掉,或是儲存容量滿了,就要開始手動作Resharding

• Resharding 包含了要重新調整 Sharding Rules ,接著需要作大量 IO 的資料複製和遷移工作,同時間要保證網站可以正常服務,或是要在一定時間內中斷服務

• 這通常是事後不得已,而且少數可選擇的解決方案• 天知道我的網站會這麼紅?

5

Page 6: 001 hbase introduction

Why RDB not affordable ? (1/6)• Bottleneck of Relational-DB• 90s V.S. recent years (Web 2.0)

• Memcachd + mySQL• Mitigate read stress effectively, but not write stress

• mySQL Cluster solution• Master/Slave

• Not affordable for highly-concurrency scenario• Vertical Partitioning• Vertical/Horizontal Partitioning (Database sharding)

• Complex• Hard to scale-out and change requirements• Low availability

• Some type of simple but big size data cause this conditionhttp://www.infoq.com/cn/news/2011/01/nosql-why

6

Page 7: 001 hbase introduction

Why RDB not affordable ? (2/6) – A general HA system architecture design

軟體專案的素質之四 ─ 整體設計之 架構設計案例 ─ http://takeshi-experience.blogspot.tw/2012/04/blog-post.html

7

Page 8: 001 hbase introduction

Why RDB not affordable ? (3/6) – Master/Slave

8

Page 9: 001 hbase introduction

Why RDB not affordable ? (4/6) – Vertical Partitioning

9

Page 10: 001 hbase introduction

Why RDB not affordable ? (5/6) – Master/Slave + Vertical Partitioning

10

Page 11: 001 hbase introduction

Why RDB not affordable ? (6/6) – Vertical/Horizontal Partitioning

11

Page 12: 001 hbase introduction

• 過去 3年所產生的資料量,比過去四萬年創造的資料量還多!

• WallMart的資料量是美國國會圖書館的 167倍!• eBay分析平台每天處理的資料量高達 100PB! (約

1,000,000GB)• 截至 2010年,世界電子資料儲存量為 1.2ZB!

(1,200,000PB)• 根據 IDC預測, 2020年世界電子資料儲存量會是

2009年的基礎上,再加上 44倍,達到 35萬億GB!• 35,000,000,000,000 Giga Bytes

架构师 10 月刊 ─ http://www.infoq.com/cn/minibooks/architect-oct-10-2011

大資料時代!

12

Page 13: 001 hbase introduction

Trend Micro’s problem• 每人每天造訪約 20 ~ 60 html 頁面• 每個 html 頁面約包含 15 ~ 30 URI• 每個 URI 物件大小約 10 ~ 150 KB• 以一百萬個用戶而言• 100 萬 X 20 = 2,000 萬個 html 頁面• 2,000 萬個 html 頁面 X 15 = 30,000 萬個 URI ( 三十億 )• 30,000 萬個 URI 物件 X 10 = 30,000KB (3TB)

• 以上純屬台灣區的資料量

• 趨勢是個全球性的公司• 故每天的資料量約數十個 TB

趨勢的雲端發現之旅 ─ http://findbook.tw/book/9789866126185/basic

13

Page 14: 001 hbase introduction

大資料時代下的新寵兒 ─

• Not only SQL• 於 2009 年開始• 有以下特性• 不使用關聯式資料模型• 天生分散式儲存• 易於水平式擴充的• 開放原始碼的• 易於擴充的• 簡單的 API 操作 (CRUD ,通常沒有 SQL 支援 )• CAP ( 不同於 ACID)

• Eventually Consistency 、 Availability 、 Partition-Tolerance• 儲存巨量且異質的資料

http://nosql-database.org/

14

Page 15: 001 hbase introduction

Why use noSQL ?• Easy to scale-out• Unlike RDB, no relationship therefore easy to scale-out

• High performance even in the big data• Table-level cache (RDB) V.S. Record-level cache (noSQL)

• Elastic data model• Schema V.S. Schema-less/Dynamic schema

• High availability• Easy to add new machines (nodes) without any performance

impact15

Page 16: 001 hbase introduction

Comparison between RDB and noSQL

Aspects RDB noSQL

Performance

Scalability

Reliability

Availability

Security

Economics

Data Model

Maturity

Commercial supportOLAP/BI

Human resource

If given a really huge of big data…

Getting lower Sustain as a small size of data

Mainly for scale up Mainly for scale out

ACID CAP

Hard to maintain SLA Easy to maintain SLA

Robust Depends

High-end machines Commodity machines

Relational, Fix-schema Depends but more likely simple, Schema-less

Very mature Not mature, various products

Global company Small start-ups

Mature Immature

Easy to find Hard to find

16

Page 17: 001 hbase introduction

noSQL basic categories

iTcloud 新雲端時代 ─ http://www.ithome.com.tw/002/cloud/cloud.html

17

Page 18: 001 hbase introduction

Apache Hbase 介紹• ASF 的 top-level 專案• 屬於 noSQL DB 中的 Key-Value 類型• 源自於 Google 的• Bigtable: A Distributed Storage System for Structured Data• a distributed storage system for managing structured data that is

designed to scale to a very large size: petabytes of data across thousands of commodity servers

• a sparse, distributed, persistent multi-dimensional sorted map

Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-Lars-George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1

18

Page 19: 001 hbase introduction

Apache Hbase Concepts – Column-Oriented (1/2)

http://ofps.oreilly.com/titles/9781449396107/intro.html

19

Page 20: 001 hbase introduction

Apache Hbase Concepts – Column-Oriented (2/2)

• a sparse, distributed, persistent multi-dimensional sorted map• which is indexed by row key, column key (column family +

qualifiers), and a timestamp

Column Families

20

Page 21: 001 hbase introduction

Apache Hbase Concepts - Architecture

http://ofps.oreilly.com/titles/9781449396107/architecture.html

21

Page 22: 001 hbase introduction

Hands-on (1/3) –Use your VM (Virtual Machine) to install tm-puppet

• Please refer to SPN Dev hbase training program again~• Install git on your PC• Install tm-puppet on your VM

22

Page 23: 001 hbase introduction

Hands-on (2/3) –Use HBase shell• Basic operations• help, list, scan

• Create• A table ‘MY_FIRST_TABLE’• Two column families ‘FAM_1’, ‘FAM_2’• Ex.

• create 't1', {NAME => 'f1'}, {NAME => 'f2'}• Create ‘t1’, ‘f1’, ‘f2’

• Put two records (column)• Ex. put 't1', 'r1', 'c1', 'value'

• Update a record (column) (It is also a put)• Delete a record (column)• delete 't1', 'r1', 'c1'

23

Page 24: 001 hbase introduction

Hands-on (3/3) –Requirements• Put your successful installed tm-puppet image file to git• Use following commands

• Jps• Ifconfig

• Cut the image• Path : ${git_home}/hbase-training/001/hands-on/${your_name}/hands-on-001.jpg

• Put your hbase shell records image file to git• Use following commands

• Scan ‘MY_TEST_TABLE’ • Ifconfig

• Cut the image• Path : ${git_home}/ hbase-training/001/hands-on/${your_name}/hands-on-002.jpg

• Commit and push your git

24

Page 25: 001 hbase introduction

noSQL architecture practices (1/8) – Use noSQL as complement• Use noSQL as a mirror (implemented by code)• The RDB is still a major storage device, and noSQL as a mirror

NoSQL 架構實踐(一)— 以 NoSQL為輔 ─ http://www.infoq.com/cn/news/2011/02/nosql-architecture-practice

25

Page 26: 001 hbase introduction

noSQL architecture practices (2/8) – Use noSQL as complement

//PSEUDO CODE for noSQL as a mirror//We want to store the data Object bool status = false; DB.startTransaction(); //start transactionid = DB.Insert(data); //write data Object to RDBif(id > 0){ status = NoSQL.Add(id, data); //write data Object to noSQL by id } if(id > 0 && status == true){ DB.commit(); //commit transaction } else { DB.rollback(); //failed, rollback transaction }

26

Page 27: 001 hbase introduction

• Use noSQL as a mirror (implemented by synchronization)

noSQL architecture practices (3/8) – Use noSQL as complement

27

Page 28: 001 hbase introduction

• Combine RDB & noSQL

noSQL architecture practices (4/8) – Use noSQL as complement

28

Page 29: 001 hbase introduction

noSQL architecture practices (5/8) – Use noSQL as complement

//PSEUDO CODE for RDB & noSQL  combination //we want to store the data Object data.title  = "title"; data.name = "name"; data.time = "2009-12-01 10:10:01";data.from = "1";bool status = false; DB.startTransaction(); //start transaction //write into RDB, data.from is a value needed by search criteriaid = DB.Insert("INSERT INTO table (from) VALUES(data.from)"); if(id > 0){ //write data Object to noSQL by id status = NoSQL.Add(id, data); } if(id>0 && status==true){ DB.commit(); //commit transaction }else{ DB.rollback(); //failed, rollback transaction }

29

Page 30: 001 hbase introduction

• What benefits we can get from the RDB & noSQL combination practice

• Decrease the I/O of RDB, therefore save more storage space• Increase the RDB table-level cache hitrate, only the key

values(PK, FK, search criteria related values) updated will refresh the cache

• Increase the synchronization efficiency for RDB Master/Slave architecture

• Increase the RDB backup/recover efficiency• Increase the scalability/performance for whole system

noSQL architecture practices (6/8) – Use noSQL as complement

30

Page 31: 001 hbase introduction

• Use only with noSQL• Mainly for simple query requirements systems• But there are noSQL products can fulfill the more complex

queries• MonngoDB, Tokyo Cabinet, etc

noSQL architecture practices (7/8) – Use noSQL as master

NoSQL 架構實踐(二)— 以 NoSQL為主 ─ http://www.infoq.com/cn/news/2011/03/nosql-architecture-practice-2

31

Page 32: 001 hbase introduction

• Use noSQL as major data source• APs only write data into noSQL• Then synchronize the data from noSQL to other data stores

based on their application

noSQL architecture practices (8/8) – Use noSQL as master

32

Page 33: 001 hbase introduction

Case Study (1/4) – Facebook’s Real-time Message System

• Use HBase to store 135+ billion messages a month• Beat off other few competitors such as Cassandra, mySQL-

Sharding, etc

• Data Patterns• A short set of temporal data that tends to be volatile• An ever-growing set of data that rarely gets accessed

Facebook's New Real-time Messaging System: HBase to Store 135+ Billion Messages a Month - http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html

33

Page 34: 001 hbase introduction

• Some key aspects of their system:• HBase

• Has a simpler consistency model than Cassandra.• Very good scalability and performance for their data patterns.• Most feature rich for their requirements: auto load balancing and

failover, compression support, multiple shards per server, etc.• HDFS, the filesystem used by HBase, supports replication, end-to-end

checksums, and automatic rebalancing.• Facebook's operational teams have a lot of experience using HDFS

because Facebook is a big user of Hadoop and Hadoop uses HDFS as its distributed file system.

Case Study (2/4) – Facebook’s Real-time Message System

34

Page 35: 001 hbase introduction

• Haystack is used to store attachments.• A custom application server was written from scratch in order

to service the massive inflows of messages from many different sources.

• A user discovery service was written on top of ZooKeeper.• Infrastructure services are accessed for: email account

verification, friend relationships, privacy decisions, and delivery decisions

• Keeping with their small teams doing amazing things approach, 20 new infrastructures services are being released by 15 engineers in one year.

• Facebook is not going to standardize on a single database platform, they will use separate platforms for separate tasks.

Case Study (3/4) – Facebook’s Real-time Message System

35

Page 36: 001 hbase introduction

Case Study (4/4) – Alibaba China Site architecture

http://www.infoq.com/cn/presentations/hl-alibaba-cn-architecture-design-practice

36

Page 37: 001 hbase introduction

37

Page 38: 001 hbase introduction

Data Access pattern as the key for noSQL• Data Structure• Structured• Semi-structured• Unstructured• Size

• How many & how often writes/read (proportion)• Data Writing• Transaction

• Data Reading• Random access• Sequential access• Relationship 38

Page 39: 001 hbase introduction

Q & A

39