babokim@gmailGoogle Bigtable • Google’s Data Management System – Google App Engine, Analytics, Docs, Earth, etc. • A sparse, distributed, persistent multidimensional sorted

김형준http://www.jaso.co.kr

http://[email protected]

이 저작물은 크리에이티브 커먼즈 코리아 저작자표시-비영리-변경금지 2.0 대한민국 라이센스에 따라 이용하실 수 있습니다.

Agenda

• NoSQL 개요• NoSQL 특징• NoSQL 솔루션• HBase Architecture• HBase Data Model• HBase Index/Data File• HBase Failover• HBase Usecase

NoSQL

• Differ from classic relational database management systems – It does not use SQL as its query language– It may not give full ACID guarantees– It has a distributed, fault-tolerant architecture

• NoSQL ≠ Anti RDBMS, NoSQL = Not Only SQL• popularized in early 2009.

NoSQL 출현 배경

• Data Tsunami– 40 billions Web page, 55 trillions Web link(2009년)– 281 exa-bytes, 45 GB/person(2009년)

• 데이터 저장소의 확장성에 대한 요구 증가– Scale up 방식이 아닌 Scale out 방식 요구

• 기존 솔루션은 대용량 데이터 또는 비정형 데이터에 불필요한 기능존재– UPDATEs and DELETEs and JOIN– ACID Transactions– Fixed Schema

• 대용량 처리에 필요한 기능은 지원하지 않음– hierarchical data, graphs

• 글로벌 인터넷 서비스 회사의 기술 공유– 구글, 아마존, 야후, 페이스북 등

• ACID vs. BASE– Atomic, Consistency, Isolation, Durability

• 전통적인 기업 데이터 속성– Basically Available, Soft-state, Eventually consistent

• 인터넷 기반 데이터 속성

CAP Conjection

분산환경에서 적절한

응답시간 이내에세가지 속성을

모두 만족시키는

저장소는 구성하기 어렵다.

Consistency

AvailabilityPartition Tolerance

RDBMSBigtableCloudataHBase

DynamoCassandra

http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf

MySQL – CAP(Read operation)

ClientA ClientB

Master Slave #1 Slave #2

replicationreplication

Update Data = ‘A’ Where Key = ‘k1’

Select DataWhere Key = ‘k1’

PatitionedPatitionedClientA ClientB

Master Slave #1 Slave #2

replication

Update Data = ‘A’ Where Key = ‘k1’

Select DataWhere Key = ‘k1’

- Availability- Partition Tolerance

NoSQL 특징

• 단순한 데이터 모델– Key/value– Document 기반– 단순한 Table/Column 모델

• 다수의 저가 x86 서버로 구성– 대부분 분산 아키텍처– 단위 컴포넌트는 장애에 취약

• 확장성, 고가용성 지원– Automatic partition or shading– Data Replication– Automatic Failover and Recovery

• 관계형 데이터베이스에 비해 약한 데이터 정합성– Eventual Consistency

• 범용적인 용도가 아닌 제한된 용도로 사용

NoSQL 고려사항

• Data Model – Key/Value, Document, Wide Columnar

• Storage Model – In-memory, persistent, Hybrid

• Consistency Model – Strong, Eventual

• Data partitioning– 미지원, DHT, META

• Membership Changes– 지원/미지원, 무 정지 지원, Data reallocation 여부

• Read/write performance • Supported Index

– Row key only, Secondary Index, Full text index • Failure handling(or Availability) mechanism

– Hot-Standby, Cold-Standby– Data replication

• Support Client– HTTP(REST), 특정 언어만 지원, 다양한 언어 지원

• License– Commercial, Apache, GPL

Data Systems and Timeframes

OfflineProcessing

Hadoop

Data warehousing

OLAP

Near-Real-TimeProcessing

Message queues(JMS, RabbitMQ, …)

CEP(Esper)

Blockingrequest/response

OLTP: MySQL, Oracle, …

NoSQL: HBase, Cassandra, …

1 hour+ 1 seconds to 1 hour Less than 1 seconds

NoSQL Solution

• key‐value‐cache– memcached, repcached, coherence, infinispan, eXtreme scale, jboss cache,

velocity, terracota, redis(?)• key‐value‐store

– keyspace, flare, schema‐free, RAMCloud• clustered key‐value‐store

– dynamo, voldemort, Dynomite, SubRecord, Dovetaildb• ordered‐key‐value‐store

– tokyo tyrant, lightcloud, NMDB, luxio, memcachedb, actord• tuple‐store

– gigaspaces, coord, apache river• object database

– ZopeDB, db4o, Shoal• document store

– CouchDB, MongoDB, Jackrabbit, XML Databases, ThruDB, CloudKit, Perservere, Riak Basho, Scalaris

• wide columnar store– BigTable, Cloudata, Hbase, Cassandra, Hypertable, KAI, Qbase, KDI

Who use NoSQL?

• Twitter– Cassandra, HBase, Hadoop, Scribe, FlockDB, Redis

• Facebook– Cassandra, HBase, Hadoop, Scribe, Hive

• Netflix– Amazon SimpleDB, Cassandra

• Digg– Cassandra

• SimpleGeo– Cassandra

• StumbleUpon– HBase, OpenTSDB

• Yahoo!– Hadoop, HBase, PNUTS

• Rackspace– Cassandra

• Gruter– Hadoop, Cloudata, MongoDB, HBase, Hive, Flume

NoSQL 대표 아키텍처

• Bigtable– How can we build a distributed db on top of Distributed File

System?– Shared Disk or Data– http://labs.google.com/papers/bigtable.html– 2006

• Dynamo– How can we build a distributed hash table appropriate for the

data center?– DHT (Distributed Hashing Table)– http://portal.acm.org/citation.cfm?id=1294281– 2006

Google Bigtable

• Google’s Data Management System– Google App Engine, Analytics, Docs, Earth, etc.

• A sparse, distributed, persistent multidimensional sorted map– Indexed by row key, column key, timestamp

• LSM-Tree– Log Structured Merge Tree– In-Memory, On-Disk 병행 사용– 데이터는 Google File System에 저장– 분산파일시스템의 한계 극복

• Real time transaction, Batch processing 모두 만족– MapReduce 플랫폼과 쉽게 연동

• Bigtable clone project– Cloudata

• Korea, Gruter, Hadoop 기반, http://www.cloudata.org– HBase

• Apache, Hadoop 기반, http://hbase.apache.org– Hypertable

• Zvents, C++, Hadoop, KFS

HBase

• Google BigTable의 기능 및 설계를 기반한 오픈소스– http://hbase.apache.org– Apache License– Powerset 소속 개발자가 처음 개발 시작– 현재는 Cloudera, Facebook 등에서 커밋

• 뛰어난 확장성과 안정적인(?) 데이터 저장– 수 백대 규모 확장 가능– Automatic table split & re-assignment– Data Storage로 Hadoop을 사용– Cluster coordinator로 ZooKeeper(zookeeper.apache.org) 사용

• High write throughput, General read throughput– 0.x ~ 수 ms write, 수 ms read

• 테이블 관리– Create, drop, modify table schema

• 실시간 데이터 처리– Single row operation(no join, group by, order by)– Multi row operation: scan

• 배치 프로세싱 지원– Input/OutputFormat, Bulking loading

• Failover– 서버 장애 시 수초 ~ 수분 이내 다른 서버로 재할당

HBase/Hadoop 기반 소프트웨어 스택

분산파일시스템(Hadoop)

RegionServer #1 RegionServer #2 RegionServer #n

HMaster

분산/병렬컴퓨팅 플랫폼(MapReduce)

사용자 애플리케이션

HBase(대용량분산 데이터 저장소)

논리적 Table

물리적 저장소

ZooKeeper(Lock Service)


HBase System Components

DataNode #1(HDFS)

TaskTracker #1(Map&Reduce)

RegionServer #1

DataNode #2(HDFS)

TaskTracker #2(Map&Reduce)

RegionServer #2

DataNode #n(HDFS)

TaskTracker #n(Map&Reduce)

RegionServer #n

Local disk(SATA)

rpc

HTableHBaseAdmin

rpc


Local disk(SATA)

Local disk(SATA)

…

HMaster

: physical server

: daemon process(HBase)

: daemon process(other platform)

eventfailover/ event

Client

failover/ event

: control

: data

HBase Data Model

row #n

row #m+1

row #m

row #k+1

row #k

row #1

Rowkey ColumnFamily#1

Region-1

cK1 v1, t1rk-1v2, t2

ck2 v3, t2v4, t3

v5, t4

Ckn vn, tn

- Sorted by rowKey- Sorted by column

ColumnFamily#n

Region-2

Region-n

Row#1

Col1-1

CF1

Col1-2

Col1-3

Col1-N

Col2-1

Col2-2

Col2-K

CF2

ColN-1

ColN-2

ColN-M

CFn

Row.Key

…… …

Column

Column Key

Value(t1)

Value(t2)

Value(tn)

…

분산된 서버(RegionServer)에 배포

TableA

Data 분산 및 Lookup

• Region– 1 Table = n Region, 데이터 분산 단위– 100 ~ 200MB/Region, 수천 Region/Server

• Lookup path– ROOT Table: Meta의 위치 저장– META Table: User Tablet의 위치 저장– User Table: 데이터 파일 정보 저장– Data File: rowkey에 대한 인덱스 저장

M.T1.1000:M1 … Max:mn

T1.100:U1 … xx xx … n

10 20 … 100 110 120 … 200

HFile(Physical file,sorted by rowkey, column-name)

64KBscan

User Region에 대한 Index

…

…

Data file의 block에 대한 Index(maxkey, file-offset)

Meta Index에 대한 Index

Root Table

Meta Table m1 m2 mn

T1.200:U2 T1.1000:UN

User defined Table U1 U2

… T1.2000:UN

M.T1.2000:M2

T1.1100:U1 T1.1200:U2

ZooKeeper

HBase Index 구조

로우 키 컬럼 키 값 타임스탬프

rk1 ck1 value1 T1

rk1 ck2 value2 T2

rk2 ck1 value3 T1

rk3 ck1 value4 T1

rk3 ck2 value5 T1

rk3 ck3 value6 T1

rk4 ck1 value7 T2

인덱스 오프셋

rk2.ck1 0

rk3.ck2 64

rk4.ck1 128

64KB

64KB

64KB

데이터 인덱스

Column Oriented

MySQL(ibdata1 파일)

id 컬럼 파일 mail 컬럼 파일 address 컬럼 파일

<[email protected][email protected]........... [email protected]..............

babokimjindolklee

[email protected]@[email protected]

seoulpusansuwon

Row Oriented Column Oriented

Data Operaiton

RegionServer

MemoryTable

HFile#2(HDFS)

put(key, value) CommitLog(HDFS)

HFile#1(HDFS)

HFile #n(HDFS)

Minor Compaction

MergedMapFile(HDFS)

Major Compaction

Searcherget(key)

분리된HFile#1(HDFS)

분리된HFile#2(HDFS)

Split

LSM-Tree: Log Structured Merge Tree

HBase & MapReduce

Region-3

Region-N

…

Region-2

Region-1

TableA

META Table

Map Task

TaskTracker

Map Task

Map Task

Map Task

TaskTracker

Map Task

Map Task

Map Task

TaskTracker

Map Task

Map Task

TaskTracker

ReduceTask

TaskTracker

ReduceTask

TableB

Region-2

Region-1

Partitioned by key

DBMSor HDFS

Table

Input

Form

at

Hadoop MapReduce Platform

Table

Out

put

Form

at

Failover

• Master 장애– Data operation은 정상 처리– Table Schema Management, Region Split 기능만 장애– Multi-Master로 장애 대처

• RegionServer 장애– Master에 의해 Region re-assign– 수십 초 ~ 수분 이내 복구

• ZooKeeper 장애– 3/5개 node로 클러스터 구성, 절대 장애 발생하지 않음

• Hadoop NameNode 장애– 별도의 이중화 방안 필요

• Hadoop 전체 장애– HBase 클러스터 장애

Client API

• Native Java Client/API– get(Get get), put(Put put),– delete(Delete delete)– getScanner(Scan scan)

• Non-Java Clients– REST server– Avro server– Thrift server

• TableInputFormat/TableOutputFormat for MapReduce– Hbase as MapReduce source and/or target

• Hbase Shell– Jruby shell adding get, put, scan and admin calls

Client Code

Configuration conf = new Configuration();conf.set(HConstants.ZOOKEEPER_QUORUM, "local01");conf.set("hbase.zookeeper.property.clientPort", "2181");HBaseAdmin admin = new HBaseAdmin(conf);

HTableDescriptor desc = new HTableDescriptor(tableName);desc.addFamily(new HColumnDescriptor(cfName));admin.createTable(desc);HTable table = new HTable(conf, tableName);

Put put = new Put("rk1".getBytes());put.add(cfName.getBytes(), “age".getBytes(), “30".getBytes());table.put(put);

Get get = new Get(“rk1".getBytes());get.addFamily(cfName.getBytes());Result result = table.get(get);for(KeyValue eachValue: result.list()) {System.out.println(new String(eachValue.getValue()));

}

Bigtable Usecase

• Google News Persionalization

HBase Usecase

• Facebook Social plugin

수집

실시간분석

실시간실시간Feedback

배치분석

process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.

Analytic

Transactional

HBase Usecase

• Facebook Social plugin

/category1/collect_1.dat



데이터 수집기데이터 수집기(scribe)



HadoopWrite, sync

Key1: valueKey2: valueKey3: value

HBase

ptail

Driver AggregationStore

CheckpointHandler

Storage

ThriftServer

클라이언트

실시간 Map/Reduce 개념

Cloudata Usecase

• www.seenal.com– 소셜 네트워크 모니터링 및 분석 서비스– 트위터, 블로그 원본 데이터 저장 및 분석용으로 활용

File Storage(HDFS)

MapReduce

Data Storage(Cloudata)

WebServer(apache)

AppServer(thrift)

DistributedSearch Server(lucene, thrift)

Crawler

LogCollector(flume, scribe)

AnalysisApp.

DistributedIndexer

API WebServer

( jetty)

HTTP Application Analysis Storage

Cache(memcached)



WebServer(apache)WebServer(apache)

API WebServer

( jetty)

API WebServer

( jetty) AppServer(thrift)

AppServer(thrift)

결론

데이터 저장을 위한 많은 솔루션이 존재– Oracle, MySQL만 있다는 생각은 버려야 함– 먼저 시스템의 데이터 속성과 요구사항을 파악(CAP, ACID/BASE)– 한 시스템에 여러 솔루션을 적용 -> 한 시스템에 여러 속성을 가지고 있는 데이터 존재

• 소규모/복잡한 관계 데이터: RDBMS• 대규모 실시간 처리 데이터: NoSQL• 대규모 저장용 데이터: Hadoop 등• 빠른 데이터 조회: 캐쉬

적절한 솔루션 선택– 반드시 운영 중 발생할 수 있는 이슈에 대해 검증 후 도입 필요– 대부분의 NoSQL 솔루션은 베타 상태(섣부른 선택은 독이 될 수 있음)– 솔루션의 프로그램 코드 수준으로 검증 필요

NoSQL 솔루션에 대한 안정성 확보– 솔루션 자체의 안정성은 검증이 필요하며 현재의 DBMS 수준의 안정성은 지원하지 않음– 반드시 안정적인 데이터 저장 방안 확보 후 적용 필요– 운영 및 개발 경험을 가진 개발자 확보 어려움– 요구사항에 부합되는 NoSQL 선정 필요

처음부터 중요 시스템에 적용하기 보다는 시범 적용 필요– 선정된 솔루션 검증, 기술력 내재화

저장소의 경우 직접 개발할 필요도 있음– 많은 인터넷 업체에서 개발/사용하고 있는 저장소를 공개– NoSQL의 경우 다양한 오픈소스가 발표되는 원인이기도 함

감사합니다.

Documents

babokim@gmailGoogle Bigtable • Google’s Data Management System – Google App Engine, Analytics, Docs, Earth, etc. • A sparse, distributed, persistent multidimensional sorted