View
9.930
Download
0
Category
Preview:
DESCRIPTION
NoSQL Database
Citation preview
Database
민형기(S-Core)
hg.min@samsung.com
2013. 2. 22.
1
Contents
I. NoSQL 개요
II. NoSQL 기본개념
III.NoSQL 관련논문
IV.NoSQL 종류
V. NoSQL 성능측정
VI.NoSQL 정리
2
NoSQL 개요
3
Thinking – Extreme Data
Qcon London 2012
4 Qcon London 2012
Organizations need deeper insights
5
New Trends
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
6
□ NO! SQL Not Only SQL
관계형 데이터베이스의 한계를 극복하기 위한 데이터 저장소의 새로운 형태
최근에는 빅데이터 처리 & 분산시스템 문제에 집중!
NoSQL 정의
비관계형, 분산, 오픈소스, 수평 확장성을 주요 특징으로 갖는
차세대 데이터베이스
– nosql-database.org
비관계형, 분산, 오픈소스, 수평 확장성을 주요 특징으로 갖는
차세대 데이터베이스
– nosql-database.org
Google Trends - nosql
7
NoSQL 기술 현황 Gartner’s 2012 Hype Cycle for Big Data
출처 : Gartner
8
NoSQL 유래
출처 : Wikipedia, Dataversity
□ 1998년: Carlo Strozzi가 SQL interface를 지원하지 않은 가벼운(lightweight) 오픈소스 관계형DB를 NoSQL이라 정의
시스템 구조를 단순화 시킴
시스템을 이기종 장비 에도 이식시킬 수 있게 함
쉘과 같은 툴로 임의의 UNIX 환경에서 구동시킬 수 있음
표준 상용 제품보다 기능은 줄이면서 가격은 저렴하게 함
데이터 필드 크기, 컬럼 등의 제약을 없앰
□ 2009년: 랙스페이스 직원인 Eric Evans가 오픈소스, 분산, 비관계형DB 이벤트에서 NoSQL을 재언급함. 기존 관계형 DBMS와 다른 특징으로 규정함
비관계형 (Non-relational)
분산 (Distributed)
ACID 미지원
□ 2011년: UnQL(Unstructurred Query Language) 활동 시작
9
Why NoSQL?
□ACID doesn’t scale well
□Web apps have diffent needs (than the apps that RDBMS were designed for)
Low and predictable response time(latency)
Scalability & Elasticity (at low cost!)
High Availability
Flexible schema / semi-structured data
Geographic distribution (multiple datacenters)
□Web apps can (usually) do not
Transaction / strong consistency / integrity
Complex queues
http://www.slideshare.net/marin_dimitrov/nosql-databases-3584443
10
관계형 데이터베이스의 문제 I – 확장성 문제
□ Replication - 복제에 의한 확장
Master-Slave 구조
결과를 슬레이브의 개수만큼 복제함. 특정 시점이 지나면 한계가 됨
읽기(Read)는 빠르지만 쓰기(Write)는 하나의 노드에 대해서만 일어나기 때문에 병목현상이 발생함
Master에서 slave로 퍼지는데 시간이 소요되기 때문에 중요한(Critical)한 읽기는 여전히 Master에서 읽어야 하고, 이것은 어플리케이션 개발에 고려가 필요함
데이터 규모가 큰 경우에는 N번 복제를 해야 하기 때문에 문제가 발생할 소지가 있음. 이것은 Master-Slave 방식으로 확장성에 대한 제한을 가지게 됨
Master-Master 구조
Master를 추가함으로써 쓰기성능을 향상할 수 있으나 충돌이 발생할 가능성이 있음
충돌 가능성은 O(N3) 또는 O(N2) 에 비례함http://research.microsoft.com/~gray/replicas.ps
□ Partitioning(Sharding) - 분할에 의한 확장
Read만큼 Write도 확장할 수 있지만 애플리케이션에서 파티션된 것을 인지하고 있어야 함
RDBMS의 가치는 관계에 있다고 할 수 있는데 파티션을 하면 이 관계가 깨져버리고 각 파티션된 조각간에 조인을 할 수 없기 때문에 관계에 대한 부분은 애플리케이션 레이어에서 책임져야 합니다.
일반적으로 RDBMS에서 수동 Sharding 은 쉽지 않다.
11
관계형 데이터베이스의 문제 I – 확장성 문제
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
12
관계형 데이터베이스의 문제 II – 필요없는 특징들
□ UPDATE와 DELETE
정보의 손실이 발생하기 때문에 잘 사용되지 않음
Auditing이나 re-activation을 위해서 기록이 필요함
일반적으로 도메인 관점에서 삭제(deleted)나 갱신(update)는 사용되지 않음
UPDATE나 DELETE는 INSERT와 version으로 모델 할 수 있음
데이터가 많아지면 비활성(inactive) 데이터는 archive함
INSERT-only 시스템에서는 2개의 문제 존재
데이터베이스에서 종속(cascade)에 대한 트리거를 이용할 수 없음
Query가 비활성 데이터를 걸러내야 할 필요가 있음
□ JOIN
피해야 하는 이유: 데이터가 많을 때 JOIN은 많은 양의 데이터에 복잡한 연산을 수행해야 하기 때문에 비용이 많이 들며 파티션을 넘어서는 동작하지 않기 때문
피하는 방법
정규화의 목적: 일관된 데이터를 가지기 쉽게 하고 스토리지의 양을 줄이기 위함
반정규화(de-normalization)를 하면 JOIN 문제를 피할 수 있음. 반정규화로 일관성에 대한 책임을 DB에서 어플리케이션으로 이동시킬 수 있는데 이는 INSERT-only라면 어렵지 않음
13
관계형 데이터베이스의 문제 III – 필요없는 특징들
□ ACID 트랜잭션
Atomic(원자성): 여러 레코드를 수정할 때 원자성은 필요 없으며 단일키 원자성이면 충분
Consistency(일관성): 대부분의 시스템은 C보다는 P나 A를 필요로 하기 때문에 엄격한 일관성을 가질 필요는 없고 대신 결과적 일관성(Eventually Consistent)을 가질 수 잇음
Isolation(격리성): Read-Committeed 이상의 격리성은 필요하지 않으며 단일키 원자성이 더 쉽다.
Durability(지속성): 각 노드가 실패했을 때도 이용되기 위해서는 메모리가 데이터를 충분히 보관할 수 있을 정도로 저렴해지는 시점까지는 지속성이 필요함
□ 고정된 스키마(Fixed Schema)
RDBMS에서는 데이터를 사용하기 전에 스키마를 정의해야 함: Table, Index등을 정의해야 하는데
스키마 수정은 기본: 현재의 웹환경에서는 빠르게 새로운 피쳐를 추가하고 이미 존재하는 피쳐를 조정하기 위해서는 스키마 수정이 필수적으로 요구됨
스키마 수정의 어려움: 컬럼의 추가/수정/삭제는 row에 lock을 걸고 index의 수정은 테이블에 lock을 걸기 때문
□ 일부 없는 특성
계층화나 그래프를 모델하는 것은 어려움
빠른 응답을 위해서 디스크를 피하고 메인 메모리에서 데이터를 제공하는 것이 바람직한데 대부분의 관계형 데이터베이스는 디스크기반이기 때문에 쿼리들이 디스크에서 수행됨
14
NoSQL Features
□Scale horizontally “simple Operations”
Key lookups, reads and writes of one record or a small number of records, simple selections
□Replicate/distribute data over many servers
□Simple call level interface (constrast w/SQL)
□Weaker concurrency model than ACID
Eventual Consistency
BASE
□Efficient use of distributed indexes and RAM
□Flexible Schema
http://www.cs.washington.edu/education/courses/cse444/12sp/lectures/lecture26-nosql.pdf
15
NoSQL Use Cases
□Massive data Volumes
Massively distributed architecture required to store the data
Google, Amazon, Yahoo, Facebook – 10K ~ 100K servers
□Extremely query workload
Impossible to efficiently do joins at the scale with an RDBMS
□Schema evolution
Schema flexibility(migration) is not trivial at large scale
Schema changes can be gradually introduced with NoSQL
16
NoSQL 기본개념 CAP 정리 ACID vs. BASE Isolation Levels MVCC Distributed Transaction
17
CAP 정리 I
□분산 시스템이 보장해야 할 3가지 특성
□Consistency: 각각의 사용자가 항상 동일한 데이터를 조회한다.
□Availability: 모든 사용자가 항상 읽고 쓸 수 있다.
□Partition tolerance: 물리적 네트워크 분산 환경에서 시스템이 잘 동작한다.
□분산 시스템에서는 적절한 시간에 2가지 특성만 만족할 수 있다.
2000 Prof. Eric Brewer PoDC Conference Keynote
2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)
2000 Prof. Eric Brewer PoDC Conference Keynote
2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)
18
분산 시스템에서의 네트워크 분할은 반드시 대비해야 한다. 따라서 실제로는 어떤 것을 포기할지에 대해 두 개의 선택권만 있다. – Werner Vogels (아마존 CTO)
CAP 정리 II
Availability
Consistency Partition Tolerence
관계형
파티셔닝 기반
DHT 기반
분산환경에서 적절한 응답시간 이내에
세가지 속성을 만족시키는 저장소는 구성하기 어렵다.
http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
Dynamo, Cassandra
BigTable, Hbase, MongoDB
RDBMS
19
CAP 정리 III – Partition Tolerence vs. Availability
"The network will be allowed to loss arbitrarily many messages sent from one node to another"[...]"
"For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response“
- Gilbert and Lynch, SIGACT 2002
"The network will be allowed to loss arbitrarily many messages sent from one node to another"[...]"
"For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response“
- Gilbert and Lynch, SIGACT 2002
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
CP: request can complete at nodes that have quoram
AP: requests can complete at any live node, possibly violating strong consistency
CP: request can complete at nodes that have quoram
AP: requests can complete at any live node, possibly violating strong consistency
20
ACID
Atomicity: All or nothing.
Consistency: Consistent state of data and transactions.
Isolation: Transactions are isolated from each other.
Durability: When the transaction is committed, state will be durable.
Any data store can achieve Atomicity, Isolation and Durability but do you always need consistency? No. By giving up ACID properties, one can achieve higher performance and scalability.
21
BASE – ACID alternative
Basically available: Nodes in the a distributed environment can go down, but the whole system shouldn’t be affected.
Soft State (scalable): The state of the system and data changes over time, even without input. This is because of the eventual consistency model.
Eventual Consistency: Given enough time, data will be consistent across the distributed system.
22
ACID vs. BASE
□Strong Consistency
□Isolation
□Focus on “commit”
□Nested transactions
□Less Availability
□Conservative (pessimistic)
□Difficult evolution (e.g. schema)
□Weak Consistency
□Availability first
□Best effort
□Approximated answers
□Aggressive(optimistic)
□Simpler!
□Faster
□Easier evolution
ACID BASE
출처 : Brewer
23
Isolation Levels
http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels
□Read Uncommitted aka (NOLOCK)
Does not issue shared lock, does not honor exclusive lock
Rows can be updated/inserted/deleted before transaction ends
Least restrictive
□Read Committed Holds shared Lock
Cannot read uncommitted data, but data can be changed before
end of transaction, resulting in non repeatable read or phantom
rows
24
Isolation Levels
http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels
□Repeatable Read Locks data being read, prevents updates/deletes
New rows can be inserted during transaction, will be included in
later reads
□Serializable Aka HOLDLOCK on all tables in SELECT
Locks range of data being read, no modifications are possible
Prevents updates/deletes/inserts
Most restrictive
25
Dirty Reads
http://www.slideshare.net/ErnestoHernandezRodriguez/transaction-isolation-levels
26
Non Repeatable Reads
http://www.slideshare.net/ErnestoHernandezRodriguez/transaction-isolation-levels
27
Phantom Reads
http://www.slideshare.net/ErnestoHernandezRodriguez/transaction-isolation-levels
28
Isolation Levels vs. Reads Phenoma
Isolation Level Dirty Reads Non-repeatable
Reads Phantom
Reads
Read Uncommitted
Read Committed -
Repeatable Read - -
Serializable - - -
http://en.wikipedia.org/wiki/Isolation_(database_systems)
29
Isolation Levels vs. Locks
Isolation Level Range Lock Read Lock Write Lock
Read Uncommitted - - -
Read Committed Exclusive Shared -
Repeatable Read Exclusive Exclusive -
Serializable Exclusive Exclusive Exclusive
http://en.wikipedia.org/wiki/Isolation_(database_systems)
30
Multi Version Concurrency Control
http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels
Root
Index
Index Index Index
Index Index
Data
Data
Data
Data
Index Index Index
Data
Data
Data
Data
Data
Data
Data
Index Index
Data
Data
Data
31
Multi Version Concurrency Control
http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels
Root
Index
Index Index Index
Index Index
Data
Data
Data
Data
Index Index Index
Data
Data
Data
Data
Data
Data
Data
Index Index
Data
Data
Data
obsolete
new version
Index
Index
Index
Data
atomic pointer update
marked for compaction
Reads:
never
blocked
32
Distributed Transactions – 2PC
□Voting Phase: each site is polled as to whether a
transactions should commit (ie: whether their sub-
transaction can commit)
□Decision Phase: if any site says “abort” or does not
reply, then all sites must be told to abort
□Logging is performed for failure recovery (as usual)
http://www.slideshare.net/atali/2011-db-distributed
33 http://www.slideshare.net/atali/2011-db-distributed
Distributed Transactions –2PC Protocol Actions
34
Distributed Transactions –2PC Protocol Actions
http://www.slideshare.net/j_singh/cs-542-concurrency-control-distributed-commit
35
NoSQL 관련논문 Amazon Dynamo Google BigTable
36
Amazon Dynamo Motivation Key Features Consistent Hashing Virtual Nodes Vector Clocks Gossip Protocols & Hinted Handoffs Read Repair
37
Amazon Dynamo - Motivation
□Vast Distributed System Tens of millions of customers Tens of thousands of servers Failure is a normal case
□Outage means Lost Customer Trust Financial loses
□Goal: great customer experience Always Available Fast Reliable Scalable
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
38
Key Features 1/2
□Amazon, ~2007
□Highly-available key-value storage system 99.9995% of request
Targeted for primary key access and small values(< 1MB)
□Scalable and decentralized
□Gives tight control over tradeoffs between: Availability, consistency, performance
□Data partitioned using consistent hashing
□Consistency facilitated by object versioning Quorum-like technique for replicas consistency
Decentralized replica synchronization protocol
Eventual consistency
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf, http://www.slideshare.net/kingherc/bigtable-and-dynamo
39
Key Features 2/2
□Gossip protocol for: Failure detection Membership protocol
□Service Level Agreements(SLAs) Include client’s expected request rate distribution and expected
service latency e.g.: Response time < 300ms, for 99.9% of requests, for a peak
load of 500 requests/sec. Example: Managing shopping carts. Write /read and available
across multiple data centers.
□Trusted network, no authentication □Incremental scalability □Symmetry □Heterogeneity, Load distribution
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf, http://www.slideshare.net/kingherc/bigtable-and-dynamo
40
Techniques used in Dynamo
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability
High Availability for writes
Vector clocks with reconciliation during reads
Version size is decoupled from update rates.
Handling temporary failures
Sloppy Quorum and hinted handoff Provides high availability and durability guarantee
when some of the replicas are not available.
Recovering from permanent failures
Anti-entropy using Merkle trees Synchronizes divergent replicas in the background.
Membership and failure detection
Gossip-based membership protocol and failure detection.
Preserves symmetry and avoids having a centralized registry for storing membership and node liveness
information.
Consistent Hashing Vector Clocks Gossip Protocols Hinted Handoffs Read Repair Merkle Trees
41
Modulo-based Hashing
N1 N2 N3 N4
partition = key % n_servers
?
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
42
Modulo-based Hashing
N1 N2 N3 N4
partition = key % (n_servers – 1)
?
Recalculate the hashes for all entries if n_servers changes
(i.e. full data redistribution when adding/removing a node)
Recalculate the hashes for all entries if n_servers changes
(i.e. full data redistribution when adding/removing a node)
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
43
Consistent Hashing
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
A
2128 0
B
C
D
E
F
Ring (Key Space)
Same hash function for data and nodes
idx = hash(key)
Coordinator: next available clockwise
node
Same hash function for data and nodes
idx = hash(key)
Coordinator: next available clockwise
node
hash(key)
44
Consistent Hashing
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
A
2128 0
B
C
D
E
F
Ring (Key Space)
Same hash function for data and nodes
idx = hash(key)
Coordinator: next available clockwise
node
Same hash function for data and nodes
idx = hash(key)
Coordinator: next available clockwise
node
hash(key)
45
Consistent Hashing - Replication
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
A
2128 0
B
C
D
E
F
Ring (Key Space)
Data replication in
the N-I clockwise
successor nodes
Data replication in
the N-I clockwise
successor nodes
KeyAB hosted
in B, C, D
Node hosting
KeyFA, KeyAB, KeyBC
46
Consistent Hashing – Node Changes
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
A
2128 0
B
C
D
E
F
Key membership
and replicas are
updated when a
node joins or leaves
the network.
The number of
replicas for all data
is kept consistent.
Key membership
and replicas are
updated when a
node joins or leaves
the network.
The number of
replicas for all data
is kept consistent.
Copy KeyEF
Copy KeyFA
Copy KeyAB
47
Virtual Nodes
http://www.slideshare.net/kingherc/bigtable-and-dynamo
Random assignment leads to:
Non-uniform data
Uneven load distribution
Solution: “virtual nodes”
A single node (physical machine) is assigned
multiple random positions (“tokens”) on the ring.
On failure of a node, the load is evenly dispersed
On joining, a node accepts an equivalent load
Number of virtual nodes assigned to a physical
node can be decided based on its capacity.
48
Virtual Nodes – Load Distribution
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
A
2128 0
C
F
G
H
Different Strategies
Virtual Nodes
Different Strategies
Virtual Nodes
B Ring
(Key Space)
D
E
I
Node 1: tokens A, E, G
Node 2: tokens C, F, H
Node 3: tokens B, D, I
Random tokens per each
physical node, partition by
token value
49
Virtual Nodes – Load Distribution
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
50
Replication & Consistency (Quorum)
N = number of nodes with a replica of the data
W = number of replicas that must acknowledge the update(*)
R = minimum number of replicas that must participate in a
successful read operation (*) but the data will be written to N nodes no matter what
W + R > N
W=N, R=1
W=1, R=N
W + R < N
Strong Consistency (usually N=3, R=W=2)
Optimised for reads
Optimised for writes (durability not guranteed in presence of failures)
Weak Consistency
Latency is determined by the slowest of the R replicas for read, W replicas
for write.
51
Vector Clocks & Conflict Detection
http://en.wikipedia.org/wiki/Vector_clock
Causality-based partial
order over events that
happen in the system.
Document version
history: a counter for
each node that updated
the document.
If all update counters in
V1 are smaller or equal
to all update counters in
V2, then V1 precedes V2
Causality-based partial
order over events that
happen in the system.
Document version
history: a counter for
each node that updated
the document.
If all update counters in
V1 are smaller or equal
to all update counters in
V2, then V1 precedes V2
52
Vector Clocks & Conflict Detection
http://en.wikipedia.org/wiki/Vector_clock
Vector Clocks can detect
a conflict. The conflict
resolution is left to the
application or the user.
The application might
resolve conflicts by
checking relative
timestamps, or with
other strategies (like
merging the changes).
Vector clocks can grow
quite large (!)
Vector Clocks can detect
a conflict. The conflict
resolution is left to the
application or the user.
The application might
resolve conflicts by
checking relative
timestamps, or with
other strategies (like
merging the changes).
Vector clocks can grow
quite large (!)
53
Gossip Protocol + Hinted Handoff
http://en.wikipedia.org/wiki/Vector_clock
A
B
C
D
E
F
Ring (Key Space)
periodic, pairwise,
inter-process
interactions of
bounded size
among randomly-
chosen peers
periodic, pairwise,
inter-process
interactions of
bounded size
among randomly-
chosen peers
54
Gossip Protocol + Hinted Handoff
http://en.wikipedia.org/wiki/Vector_clock
periodic, pairwise,
inter-process
interactions of
bounded size
among randomly-
chosen peers
periodic, pairwise,
inter-process
interactions of
bounded size
among randomly-
chosen peers
A
C
D
E
F
B
B must be down
then. Let’s disable it.
I can't see B either.
My Merkle Tree root for
range XY is different!
I can't see B, it might be
down but I need some
ACK. My Merkle Tree
root for range XY is
"ab03Idab4a385afda"
55
Gossip Protocol + Hinted Handoff
http://en.wikipedia.org/wiki/Vector_clock
periodic, pairwise,
inter-process
interactions of
bounded size
among randomly-
chosen peers
periodic, pairwise,
inter-process
interactions of
bounded size
among randomly-
chosen peers
A
C
D
E
F
B
I see. Well, I'll take care of it
for now, and let B know
when B is available again
My canonical node is
supposed to be B.
56
Merkle Trees (Hash Trees)
http://en.wikipedia.org/wiki/Hash_tree
Leaves: hashes of
data blocks.
Nodes: hashes of
their children.
Used to detect
inconsistencies
between replicas
(anti-entropy) and
to minimise the
amount of
transferred data
57
Merkle Trees (Hash Trees)
http://www.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees
Node A Node B gossip
exchange
Minimal data transfer
Differences are easy to locate
Minimal data transfer
Differences are easy to locate
SHA-1, Whirlpool or Tiger (TTH) hash functions SHA-1, Whirlpool or Tiger (TTH) hash functions
58
Read Repair
http://en.wikipedia.org/wiki/Vector_clock
A
B
C
D
E
F
GET(k, R=2)
59
Read Repair
http://en.wikipedia.org/wiki/Vector_clock
A
B
C
D
E
F
GET(k, R=2)
K=XYZ(v.2)
K=XYZ(v.2)
K=ABC(v.1)
60
Read Repair
http://en.wikipedia.org/wiki/Vector_clock
A
B
C
D
E
F
GET(k, R=2)
UPDATE(k, XYZ)
61
Google Bigtable Motivation Key Features System Architecture Building Blocks Data Model SSTable/Tablet/Table IO / Compaction
62
Motivation
□Lots of (semi-)structured data at Google URLs:
Contents crawl metadata links anchors pagerank , …
Per-user data: User preference settings, recent queries/search results, …
Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image
data, user annotations, …
□Scale is large Billions of URLs, many versions/page (~20K/version) Hundreds of millions of users, thousands of q/sec 100TB+ of satellite image data
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf
63
Key Features
□Google, ~2006
□Distributed multi-level map □Fault-tolerant, persistent □Scalable
Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient
scans
□Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf
64 http://www.slideshare.net/kingherc/bigtable-and-dynamo
System Architecture
65
Building Blocks
□Building blocks: Google File System (GFS): Raw storage Scheduler: Google Work Queue, schedules jobs onto
machines Lock service: Chubby, distributed lock manager MapReduce: simplified large-scale data processing
□BigTable uses of building blocks:
GFS: stores persistent data (SSTable file format for storage of data)
Scheduler: schedules jobs involved in BigTable serving Lock service: master election, location bootstrapping Map Reduce: often used to read/write BigTable data
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf
66
Google File System
□Large-scale distributed “filesystem” □Master: responsible for metadata □Chunk servers: responsible for reading and writing large chunks of data
□Chunks replicated on 3 machines, master responsible for ensuring replicas exist
□OSDI ’04 Paper
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf
67
Chubby
□Distributed Lock Service □File System {directory/file}, for locking
Coarse-grained locks, can store small amount of data in a lock
□High Availability 5 replicas, one elected as master Service live when majority is live Uses Paxos algorithm to solve consensus
□A client leases a session with the service □Also an OSDI ’06 Paper
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf
68
Data model
□“Sparse, distributed, persistent, multidim. sorted map” □<Row, Column, Timestamp> triple for key - lookup, insert, and
delete API □Arbitrary “columns” on a row-by-row basis
Column family:qualifier. Family is heavyweight, qualifier lightweight Column-oriented physical store- rows are sparse!
□Does not support a relational model No table-wide integrity constraints No multirow transactions
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf
69
SSTable
□Immutable, sorted file of key-value pairs □Chunks of data plus an index
Index is of block ranges, not values triplicated across three machines in GFS
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt
Index
64K
block
64K
block
64K
block
SSTable
70
Tablet
□Contains some range of rows of the table Dynamically partitioned range of rows
□Built out of multiple SSTables □Typical size: 100~200MB □Tablets are stored in Tablet Servers (~100 per server) □Unit of distribution and load balancing
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt
Index
64K
block
64K
block
64K
block
SSTable
Index
64K
block
64K
block
64K
block
SSTable
Tablet Start:aardvark End:apple
71
Table
□Multiple tablets(table segments) make up the table □SSTables SSTables can be shared can be shared □Tablets do not overlap, SSTables can overlap
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt
SSTable SSTable SSTable SSTable
Tablet
aardvark apple
Tablet
apple_two_E boat
72
Tablets & Splitting
Large tables broken into tablets at row boundaries
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt
“<html>…”
aaa.com
TABLETS
“contents”
EN cnn.com
cnn.com/sports.html
“language”
Website.com
Zuppa.com/menu.html
…
…
73
Finding a Tablet
http://labs.google.com/papers/bigtable-osdi06.pdf
Approach: 3-level hierarchical lookup scheme for tablets – Location is ip:port of relevant server, all stored in META tablets
–1st level: bootstrapped from lock server, points to owner of META0
–2nd level: Uses META0 data to find owner of appropriate META1 tablet
–3rd level: META1 table holds locations of tablets of all other tables
META1 table itself can be split into multiple tablets
74
Servers
Tablet servers manage tablets, multiple tablets per server. Each tablet is 100-200 megs –Each tablet lives at only one server
–Tablet server splits tablets that get too big
Master responsible for load balancing and fault tolerance –Use Chubby to monitor health of tablet servers, restart failed servers
–GFS replicates data. Prefer to start tablet server on same machine that the data is already at
75
BigTable I/O
http://www.slideshare.net/kingherc/bigtable-and-dynamo
memtable read
write
tablet log
SSTable SSTable SSTable
memory
GFS
minor compaction
BMDiff Zippy
Merging / Major Compaction (GC)
□Commit log stores the writes Recent writes are stores in the memtable Older writes are stores in SSTables
□A read operation sees a merged view of the memtable and the SSTables □Checks authorization from ACL stored in Chubby
76
Compactions
Minor compaction – convert the memtable into an SSTable Reduce memory usage
Reduce log traffic on restart
Merging compaction Reduce number of SSTables
Good place to apply policy “keep only N versions”
Major compaction Merging compaction that results in only one SSTable
No deletion records, only live data
77
Locality Groups
Group column families together into an SSTable
–Avoid mingling data, ie page contents and page metadata
–Can keep some groups all in memory
Can compress locality groups
Bloom Filters on locality groups – avoid searching SSTable
78
NoSQL 종류 Key-value Stores Column Database Document Database Graph Database
79
NoSQL 분류표
http://nosql.mypopescu.com/post/2335288905/nosql-databases-genealogy
80
NoSQL Landscape
451 group
81
□Key-Value Stores □Based on DHTs / Amazon’s Dynamo paper □Data model: (global) collection of K-V pairs □Example: Voldemort, Tokyo, Riak, Redis
□Column Store □Based on Google’s BigTable paper □Data model: big table, column families □Example: Hbase, Cassandra, Hypertable
NoSQL 종류 I
http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases
82
□Document Store □Inspired by Lotus Notes □Data model: collections of K-V collections □Example: CouchDB, MongoDB
□Graph Database □Inspired by Euler & graph theory □Data model: nodes, rels, K-V on both □Example: Neo4J, FlockDB
NoSQL 종류 II
http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases
83
NoSQL Data Model 비교
http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
84
Key-value Stores Voldemort Riak Redis
85
Voldemort
http://www.slideshare.net/adorepump/voldemort-nosql, http://nosql.findthebest.com/l/5/Voldemort
□LinkedIn, 2009, Apache 2.0, Java □Model: Key-Value(Data Model), Dynamo(Distributed Model) □Main point: Data is automatically replicated and partitioned to multiple
servers □Concurrency Control: MVCC □Transaction: No □Data Storage: BDB, MySQL, RAM □Key Features
□Data is automatically replicated and partitioned to multiple servers □Simple Optimistic Locking for multi-row updates □Pluggable Storage Engine □Multiple read-writes □Consistent-hashing for data distribution □Data Versioning
□Major Users: LinkedIn, GILT □Best Use: Real-time, large-scale
AP
86 http://www.slideshare.net/cscyphers/big-data-platforms-an-overview
Pros
Highly customizable - each layer of the stack can be replaced as needed
Data elements are versioned during changes
All nodes are independent - no SPOF
Very, very fast reads
Cons
Versioning means lots of disk space being used.
Does not support range queries
No complex query filters
All joins must be done in code
No foreign key constraints
No triggers
Support can be hard to find
Voldemort Pros / Cons AP
87
Voldemort : Logical Architecture AP
http://www.project-voldemort.com/voldemort/design.html , http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
□Dynamo DHT Implementation □Consistent Hashing, Vector Clocks
HTTP / Sockets
Conflict resolved
at read and write Time
Json, Java String, byte[],
Thrift, Avro, Protobuf
Simple Optimistic Locking
for multi-row updates,
pluggable storage engine
LICENSE
LANGUAGE
APACHE 2.0
Java
API / PROTOCOL
HTTP Java
Thrift
Avro
ProtoBuf
CONCURRENCY
MVCC
88
Voldemort: Physical Architecture Options
http://www.project-voldemort.com/voldemort/design.html
AP
89
Riak
http://nosql.findthebest.com/l/6/Riak, http://www.slideshare.net/seancribbs/introduction-to-riak-red-dirt-ruby-conf-training
□Basho Technologies, 2010, Apache 2.0, Erlang, C □Model: Key-Value(Data Model), Dynamo(Distributed Model) □Main Point: Fault tolerance □Protocol: HTTP/REST or custom binary □Transaction: No □Data Storage: Plug-in □Features
□Tunable trade-offs for distribution and replication (N, R, W) □Pre- and post-commit hooks in JavaScript or Erlang □Map/Reduce in Javascript and Erlang □Links & link walking: use it as a graph database □Secondary indices: but only one at once □Large object support (Luwak)
□Major Users: Mozilla, GitHub, Comcast, AOL, Ask.com □Best Uses: high availability □Example Usage: CMS, text search, Point-of-sales data collection.
AP
90 http://www.slideshare.net/cscyphers/big-data-platforms-an-overview
Riak Pros / Cons AP
Pros
All nodes are equal - no SPOF
Horizontal Scalability
Full Text Search
RESTful interface(and HTTP)
Consistency level tunable on each operation
Secondary indexes available
Map/Reduce(JavaScript & Erlang only)
Cons
Not meant for small, discrete and numerous datapoints.
Getting data in is great; getting it out, not so much
Security is non-existent: "Riak assumes the internal environment is trusted"
Conflict resolution can bubble up to the client if not careful.
Erlang is fast, but it's got a serious learning curve.
91
Riak
Buckets -> K-V “Links” (~relations)
Targeted JS Map/Reduce Tunable consistency (one-quorum-all)
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
LICENSE
LANGUAGE
APACHE 2.0
C, Erlang
API / PROTOCOL
REST HTTP
*
ProtoBuf
AP
92 http://bcho.tistory.com/621, http://basho.com/technology/technology-stack/
Riak Logical Architecture AP
93
Redis
http://nosql.findthebest.com/l/6/Riak, http://www.slideshare.net/seancribbs/introduction-to-riak-red-dirt-ruby-conf-training
□VMWare, 2009, BSD, C/C++ □Model: Key-Value(Data Model), Master-Slave(Distributed Model) □Main Point: Blazing fast □Protocol: Telnet-like □Concurrency Control: Locks □Transaction: Yes □Data Storage: RAM (in-memory) □Features
□Disk-backed in-memory database □Currently without disk-swap (VM and Diskstore were abandoned) □Master-slave replication □Pub/Sub lets one implement messaging
□Major Users: StackOverflow, flickr, GitHub, Blizzard, Digg □Best Uses: rapidly changing data, frequently written, rarely read
statistical data □Example Usage: Stock prices. Analytics. Real-time data
CP
94 http://www.slideshare.net/cscyphers/big-data-platforms-an-overview
Pros
Transactional support
Blob storage
Support for sets, lists and sorted sets
Support for Publish-Subscribe(Pub-Sub) messaging
Robust set of operators
Cons
Entirely in memory
Master-slave replication (instead of master-master)
Security is non-existent: designed to be used in trusted environments
Does not support encryption
Support can be hard to find
Redis Pros / Cons CP
95
LICENSE
LANGUAGE
BSD
ANSI C, C++
API / PROTOCOL
*(Many Language)
Telnet Like
PERSISTENCE
In Memory
bg snapshots
Redis CP
REPLICATIONS
Master / Slave
K-V store “Data Structures Server” Map, Set, Sorted Set, Linked List Set/Queue operations, Counters, Pub-Sub, Volatile keys
http://redis.io/presentation/Redis_Cluster.pdf, http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
10-100K ops (whole dataset in RAM + VM) Persistence via snapshotting (tunable fsync freq.) Distributed if client supports consistent hashing
96
Column Families Stores Cassandra HBase
97
Cassandra
http://nosql.findthebest.com/l/2/Cassandra,
□ASF, 2008, Apache 2.0, Java □Model: Column(Data Model), Dynamo(Distributed Model) □Main Point: Best of BigTable and Dynamo □Protocol: Thrift, Avro □Concurrency Control: MVCC □Transaction: No □Data Storage: Disk □Features
□Tunable trade-offs for distribution and replication (N, R, W) □Querying by column, range of keys □BigTable-like features: columns, column families □Has secondary indices □Writes are much faster than reads (!) □Map/reduce possible with Apache Hadoop □All nodes are similar, as opposed to Hadoop/HBase
□Major Users: Facebook, Netflix, Twitter, Adobe, Digg □Best Uses: write often, read less □Example Usage: banking, finance, logging
AP
98 http://www.slideshare.net/cscyphers/big-data-platforms-an-overview
Pros
Designed to span multiple datacenters
Peer to peer communication between nodes
No SPOF
Always writeable
Consistency level is tunable at run time
Supports secondary indexes
Supports Map/Reduce
Support range queries
Cons
No joins
No referential integrity
Written in Java - quite complex to administer and configure
Last update wins
Cassandra Pros / Cons AP
99
Cassandra
http://nosql.findthebest.com/l/2/Cassandra,
AP
Data model of BigTable, infrastructure of Dynamo LICENSE
LANGUAGE
APACHE 2.0
Java
PROTOCOL
Thrift
Avro
PERSISTENCE
memtable,
SSTable
CONSISTENCY
Tunnable
R / W / N
Column
col_name
col_value timestamp
x
100
Cassandra
http://nosql.findthebest.com/l/2/Cassandra,
AP
Data model of BigTable, infrastructure of Dynamo LICENSE
LANGUAGE
APACHE 2.0
Java
PROTOCOL
Thrift
Avro
PERSISTENCE
memtable,
SSTable
CONSISTENCY
Tunnable
R / W / N
…
col_name
col_value timestamp
col_name
col_value timestamp
super_column_name
x
101
Cassandra
http://nosql.findthebest.com/l/2/Cassandra,
AP
Data model of BigTable, infrastructure of Dynamo LICENSE
LANGUAGE
APACHE 2.0
Java
PROTOCOL
Thrift
Avro
PERSISTENCE
memtable,
SSTable
CONSISTENCY
Tunnable
R / W / N
…
col_name
col_value timestamp
col_name
col_value timestamp
super_column_name
row_key
Column Family
x
102
Cassandra
http://nosql.findthebest.com/l/2/Cassandra,
AP
Data model of BigTable, infrastructure of Dynamo LICENSE
LANGUAGE
APACHE 2.0
Java
PROTOCOL
Thrift
Avro
PERSISTENCE
memtable,
SSTable
CONSISTENCY
Tunnable
R / W / N
…
col_name
col_value timestamp
col_name
col_value timestamp
super_column_name
row_key
Super Column Family
…
col_name
col_value timestamp
col_name
col_value timestamp
super_column_name
…
keyspace.get (“column_family”, key, [“super_column”], “column”)
x
103
Cassandra
http://nosql.findthebest.com/l/2/Cassandra,
AP
Data model of BigTable, infrastructure of Dynamo LICENSE
LANGUAGE
APACHE 2.0
Java
PROTOCOL
Thrift
Avro
PERSISTENCE
memtable,
SSTable
CONSISTENCY
Tunnable
R / W / N
…
col_name
col_value timestamp
col_name
col_value timestamp
super_column_name
row_key
Super Column Family
…
col_name
col_value timestamp
col_name
col_value timestamp
super_column_name
…
keyspace.get (“column_family”, key, [“super_column”], “column”)
A B
C
D E
F P2P
Gossip x
ALL
ONE
QUORUM
Random Partitioner (MD5)
OrderPreservingPartitioner
Range Scans, Fulltext Index(Solandra)
104
Cassandra - Data Model
http://javamaster.wordpress.com/2010/03/22/apache-cassandra-quick-tour/
AP
LICENSE
LANGUAGE
APACHE 2.0
Java
PROTOCOL
Thrift
Avro
PERSISTENCE
memtable,
SSTable
CONSISTENCY
Tunnable
R / W / N
HashKey
HashTable Object
105
Cassandra - Data Model
http://javamaster.wordpress.com/2010/03/22/apache-cassandra-quick-tour/
AP
• Column • Name-Value 구조체
• Column Family
• Column들의 집합 • Hash-Key Column리스트
• Super-Column
• Column안에 Column 포함 • ex) username->{firstname,
lastname}
• Super-Column Family • Column Family안에 Column
Family 포함
{name:"emailAddress", value:"cassandra@apache.org"} {name:"age", value:"20"}
UserProfile={ Cassandra={emailAddress:”casandra@apache.org”, age:”20”} TerryCho={emailAddress:”terry.cho@apache.org”, gender:”male”} Cath= { emailAddress:”cath@apache.org” , age:”20”, gender:”female”, address:”Seoul”} }
{name:”username” value: firstname{name:”firstname”,value=”Terry”} value: lastname{name:”lastname”,value=”Cho”} }
UserList={ Cath:{ username:{firstname:”Cath”,lastname:”Yoon”} address:{city:”Seoul”,postcode:”1234”}} Terry:{ username:{firstname:”Terry”,lastname:”Cho”} account:{bank:”hana”,accounted:”1234”}} }
106
Cassandra – Use Case: eBay AP
□A glimpse on eBay’s Cassandra deployment □Dozens of nodes across multiple clusters □200 TB+ storage provisioned □400M+ writes & 100M+ reads per day, and growing
□#1: Social Signals on eBay product & item pages □#2: Hunch taste graph for eBay users & items □#3: Time series use cases (many):
□Mobile notification logging and tracking □Tracking for fraud detection □SOA request/response payload logging □ReadLaser server logs and analytics
□Cassandra meets requirements
□Need Scalable counters □Need real(or near) time analytics on collected social data □Need good write performance □Reads are not latency sensitive
http://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376
107
HBase
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis, http://nosql.findthebest.com/l/10/HBase
□ASF, 2010, Apache 2.0, Java □Model: Column(Data Model), Bigtable(Distributed Model) □Main Point: Billions of rows X millions of columns □Protocol: HTTP/REST (also Thrift) □Concurrency Control: Locks □Transaction: Local □Data Storage: HDFS □Features
□Query predicate push down via server side scan and get filters □Optimizations for real time queries □A high performance Thrift gateway □HTTP supports XML, Protobuf, and binary □Rolling restart for configuration changes and minor upgrades □Random access performance is like MySQL □A cluster consists of several different types of nodes
□Major Users: Facebook □Best Use: random read write to large database □Example Usage: Live messaging
CP
108 http://www.slideshare.net/cscyphers/big-data-platforms-an-overview
Pros
Map/Reduce support
More of a CA approach and AP
Supports predicate push down for performance gains
Automatic partitioning and rebalancing of regins
Data is stored in a sorted order(not indexed)
RESTful API
Strong and vibrant ecosystem
Cons
Secondary indexes generally not supported
Security is non-existent
Requires a Hadoop infrastructure to function
HBase Pros / Cons CP
109 http://www.slideshare.net/cscyphers/big-data-platforms-an-overview
HBase vs. BigTable Terminology
Cons
CP
HBase BigTable
Table Table
Region Tablet
RegionServer Tablet Server
MemStore Memtable
Hfile SSTable
WAL Commit Log
Flush Minor compaction
Minor Compaction Merging compaction
Major Compaction Major compaction
HDFS GFS
MapReduce MapReduce
ZooKeeper Chubby
110
HBase: Architecture
http://nosql.findthebest.com/l/10/HBase, http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
CP
LICENSE
LANGUAGE
APACHE 2.0
Java
PROTOCOL
REST, HTTP
Thrift, Avro
PERSISTENCE
memtable,
SSTable
• Zookeeper as Coordinator (instead of Chubby) • Hmaster: support for multiple masters • HDFS, S3, S3N, EBS (with Gzip/LZO CF compression) • Data sorted by key but evenly distributed across the cluster
111
HBase: Architecture - WAL
http://nosql.findthebest.com/l/10/HBase, http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
CP
LICENSE
LANGUAGE
APACHE 2.0
Java
PROTOCOL
REST, HTTP
Thrift, Avro
PERSISTENCE
memtable,
SSTable
112
HBase: Architecture - WAL
http://nosql.findthebest.com/l/10/HBase, http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
CP
LICENSE
LANGUAGE
APACHE 2.0
Java
PROTOCOL
REST, HTTP
Thrift, Avro
PERSISTENCE
memtable,
SSTable
113
HBase - Usecase: Facebook Message Service
http://nosql.findthebest.com/l/10/HBase, http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
CP
□New Message Service □ combines chat, SMS, email, and Messages into a real-time conversation □ Data pattern
A short set of temporal data that tends to be volatile An ever-growing set of data that rarely gets accessed
□ chat service supports over 300 million users who send over 120 billion messages per month
□Cassandra's eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure.
□HBase meets our requirements □Has a simpler consistency model than Cassandra. □Very good scalability and performance for their data patterns. □Most feature rich for their requirements: auto load balancing and failover,
compression support, multiple shards per server, etc. □HDFS, the filesystem used by HBase, supports replication, end-to-end
checksums, and automatic rebalancing. □Facebook's operational teams have a lot of experience using HDFS
because Facebook is a big user of Hadoop and Hadoop uses HDFS as its distributed file system.
114
Hbase – Usecase: Adobe
http://hstack.org/why-were-using-hbase-part-1
CP
□When we started pushing 40 million records, Hbase squeaked and cracked. After 20M inserts it failed so bad it wouldn’t respond or restart, it mangled the data completely and we had to start over. HBase community turned out to be great, they jumped and helped us,
and upgrading to a new HBase version fixed our problems
□On December 2008, Our HBase cluster would write data but couldn’t answer correctly to reads. I was able to make another backup and restore it on a MySQL cluster
□We decided to switch focus in the beginning of 2009. We were going to
provide a generic, real-time, structured data storage and processing system that could handle any data volume.
115
Document Stores MongoDB CouchDB
116
MongoDB
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis, http://nosql.findthebest.com/l/10/HBase
□10gen, 2009, AGPL, C++ □Model: Document(Data Model), Bigtable(Distributed Model) □Main Point: Full Index Support, Querying, Easy to Use □Protocol: Custom, binary(BSON) □Concurrency Control: Locks □Transaction: No □Data Storage: Disk □Features
□Master/slave replication (auto failover with replica sets) □Sharding built-in □Uses memory mapped files for data storage □GridFS to store big data + metadata (not actually an FS)
□Major Users: Craigslist, Foursquare, SAP, MTV, Disney, Shutterfly, Intuit □Example Usage: CMS system, comment storage, voting □Best Use: dynamic queries, frequently written, rarely read statistical data
CP
117 http://www.slideshare.net/cscyphers/big-data-platforms-an-overview
Pros
Auto-sharding
Auto-failover
Update in place
Spatial index support
Ad hoc query support
Any field in Mongo can be indexed
Very, very popular (lots of production deployments)
Very easy transition from SQL
Cons
Does not support JSON: BSON instead
Master-slave replication
Has had some growing pains(e.g. Foursquare outage)
Not RESTful by default
Failures require a manual database repair operation(similar to MySQL)
Replication for availability, not performance
MongoDB Pros / Cons CP
118 http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
MongoDB CP
LICENSE
LANGUAGE
AGPL v3
C++
PROTOCOL
PERSISTENCE
B+ Trees,
Snapshots
CONCURRENCY
REPLICATION
master-slave
replica sets
REST/BSON
In-place Updates
119 http://sett.ociweb.com/sett/settAug2011.html, http://www.infoq.com/articles/mongodb-java-php-python
MongoDB - Architecture CP
LICENSE
LANGUAGE
AGPL v3
C++
PROTOCOL
PERSISTENCE
B+ Trees,
Snapshots
CONCURRENCY
REPLICATION
master-slave
replica sets
REST/BSON
In-place Updates
• mongod: the core database process • mongos: the controller and query router for sharded clusters
120
CouchDB
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis, http://nosql.findthebest.com/l/3/CouchDB
□ASF, 2005, Apache 2.0, Erlang □Model: Document(Data Model), Notes(Distributed Model) □Main Point: DB consistency, easy to use □Protocol: HTTP, REST □Concurrency Control: MVCC □Transaction: No □Data Storage: Disk □Features
□ACID Semantics □Map/Reduce Views and Indexes □Distributed Architecture with Replication □Built for Offline
□Major Users: LotsOfWords.com, CERN, BBC, □Example Usage: CRM, CMS systems □Best Use: accumulating, occasionally changing data with pre-defined
queries
AP
121 http://www.slideshare.net/cscyphers/big-data-platforms-an-overview
Pros
Very simple API for development
MVCC support for read consistency
Full Map/Reduce support
Data is versioned
Secondary indexes supported
Some security support
RESTful API, JSON support
Materialized views with incremental update support
Cons
The simple API for development is somewhat limited
No foreign keys
Conflict resolution devolves to the application
Versioning requires extensive disk space
Versioning places large load on I/O channels
Replication for performance, not availability
CouchDB Pros / Cons AP
122
CouchDB
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
□ASF, 2005, Apache 2.0, Erlang
AP
LICENSE
LANGUAGE
APACHE 2.0
Erlang
PROTOCOL
PERSISTENCE
Append Only,
B+ Tree
CONCURRENCY
CONSISTENCY
REST/JSON
MVCC
Crash-only design
REPLICATION
Multi-master
123
Graph Database Neo4J
124
Neo4J
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis, http://nosql.findthebest.com/l/12/Neo4j
□Neo Technology, 2007, AGPL/GPL v3, Java □Model: Graph(Data Model), (Distributed Model) □Main Point: Stores data structured in graphs rather than tables. □Protocol: HTTP/REST, SparQL, native Java, Jruby □Concurrency: non-block reads, writes locks involved nodes/relationships
until commit □Transaction: Yes □Data Storage: Disk □Features
□Disk-based: Native graph storage engine with custom binary on-disk format
□Transactional: JTA/JTS, XA, deadlock detection, MVCC, etc □Scales Up: Several billions of nodes/rels/props on single JVM
□Example Usage: Social relations, public transport links, road maps, network
topologies. □Best Use: complex data relationships and queries.
AP
125 http://www.slideshare.net/cscyphers/big-data-platforms-an-overview
Pros
No O/R impedance mismatch(whiteboard friendly)
Can easily evolve schemas
Can represent semi-structured info
Can represent graphs/networks(with performance)
Cons
Poor scalability
Lacks in tool and framework support
No other implementations => potential lock in
No support for ad-hoc queries
Neo4J Pros / Cons AP
126 http://www.slideshare.net/quipo/nosql-databases-why-what-and-when
LICENSE
LANGUAGE
AGPLv3/Commercial
Java
PROTOCOL
PERSISTENCE
On-Disk
Linked-List
REST/JAVA/SPARQL
Neo4J AP
127
NoSQL 성능측정 (YCSB Benchmark Result)
2012.10.22 Altos Systems Inc.
http://research.yahoo.com/Web_Information_Management/YCSB
128
YCSB
http://www.slideshare.net/tazija/evaluating-nosql-performance-time-for-benchmarking
□Since 2010.04 – 0.1.0, Current – 0.1.4
□Yahoo! Team offered “standard” benchmark
□Yahoo! Cloud Serving Benchmark (YCSB) Focus on database Focus on performance
□YCSB Client consist of 2 parts Workload generator Workload scenarios
129
YCSB Features
http://www.slideshare.net/tazija/evaluating-nosql-performance-time-for-benchmarking, https://github.com/brianfrankcooper/YCSB/wiki
□Open source □Extensible □Has connectors Hbase, Cassandra, MongoDB, Redis,
Voldemort Oracle NoSQLDB, Amazon DynamoDB PNUTS, Vmware GemFire, Dynomite, Connector for Sharded RDBMS (i.e. MySQL) IMDG: Jboss Infinispan, Gigaspace XAP
130
YCSB Architecture
http://www.slideshare.net/tazija/evaluating-nosql-performance-time-for-benchmarking
131
Workloads
http://www.slideshare.net/tazija/evaluating-nosql-performance-time-for-benchmarking,
□Workload is a combination of key-values: Request distribution (uniform, zipfian) Record size Operation proportion (%)
□Types of workload phases:
Load phase Transaction phase
132
Workloads
http://www.slideshare.net/tazija/evaluating-nosql-performance-time-for-benchmarking,
□Load phase workload Working set is created 100 million records 1 KB record (10 fields by 100 Bytes) 120-140G total or ≈30-40G per node
□Transaction phase workloads Workload A (read/update ratio: 50/50, zipfian) Workload B (read/update ratio: 95/5, zipfian) Workload C (read ratio: 100, zipfian) Workload D (read/update/insert ratio: 95/0/5, zipfian) Workload E (read/update/insert ratio: 95/0/5, uniform) Workload F (read/read-modify-write ratio: 50/50, zipfian) Workload G (read/insert ratio: 10/90, zipfian)
133
Testing Environment
http://www.slideshare.net/tazija/evaluating-nosql-performance-time-for-benchmarking,
7.5GB of memory
four EC2 Compute Units (two virtual cores with two EC2 Compute Units each)
850GB of instance storage
64-bit platform
high I/O performance
EBS-Optimized (500Mbps)
API name: m1.large
15GB of memory
eight EC2 Compute Units (four virtual cores with two EC2 Compute Units each)
1690GB of instance storage
64-bit platform
high I/O performance
EBS-Optimized (1000Mbps)
API name: m1.xlarge
YCSB Client
NoSQL Server
134
Load Phase, [Insert]
http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html
HBase has unconquerable superiority in writes, and with a pre-created regions it showed us up to 40K ops/sec. Cassandra also provides noticeable performance during loading phase with around 15K ops/sec. MySQL Cluster can show much higher numbers in “just in-memory” mode.
135
Workload A: Update-heavily mode 1) Read/update ratio: 50/50 2) Zipfian request distribution
http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html
136
Workload B: Read-heavy mode 1) Read/update ratio: 95/5 2) Zipfian request distribution
http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html
137
Workload C: Read-only 1) Read/update ratio: 100/0 2) Zipfian request distribution
http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html
138
Workload E: Scanning short ranges 1) Read/update/insert ratio: 95/0/5 2) Latest request distribution 3) Max scan length: 100 records 4) Scan length distribution: uniform
http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html
HBase performs a bit better than Cassandra in range scans, though Cassandra range scans improved noticeably from the 0.6 version presented in YCSB slides. MongoDB 2.5 max throughput 20 ops/sec, latency >≈ 1 sec MySQL Cluster 7.2.5 <10 ops/sec, latency ≈400 ms. MySQL Sharded 5.5.2.3 <40 ops/sec, latency ≈400 ms. Riak’s 1.1.1 bitcask storage engine doesn’t support range scans (eleveldb was slow during load)
139
Workload G: Insert-mostly mode 1) Insert/Read: 90/10 2) Latest request distribution
http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html
Workload with high volume inserts proves that HBase is a winner here, closely followed by Cassandra. MySQL Cluster’s NDB engine also manages perfectly with intensive writes.
140
YCSB Distribution Model Uniform : 레코드를 무작위로 선택하는 방식. 대략적으로 모든 레코드들은 균등하게 선택됨. Zipfian : 일부 레코드들은 집중적으로 선택을 많이 받는 유명한 (popular) 레코드가 되며, 대
부분의 레코드들은 선택을 조금 받는 유명하지 않은 (Unpopular) 레코드들이 됨. Latest : 가장 최근에 입력된 레코드들이 선택을 많이 받음. Multinomial : 각 항목별로 확률을 설정할 수 있음. 예를 들면, 읽기 동작에 95%, 업데이트에
5%, 스캔과 쓰기에 0% 를 설정하면 읽기 중심의 Workload 가 만들어짐.
http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html
141
NoSQL 정리
142
왜 비 관계형인가? □관계형 데이터베이스는 확장하기 어렵다.
복제 – 중복에 의한 확장 Master – Slave: N번 쓰기, Bottle-neck, Slave 제한 필요 Multi-Master: write 확장성 향상, 더 많은 충돌 발생 O(N2)~
분할(Sharding) – 분할에 의한 확장
□몇가지 특징들은 필요없다. UPDATEs와 DELETEs: insert only system (by version) JOINs: 복잡한 집합 연산, 파티션갂 작동 불가 ACID Transactions: Eventually Consistent Fixed Schema
□몇가지 특징들이 불가능하다
계층형 데이터 모델 그래프 모델 메인메모리 의존성 탈피
143
왜 NoSQL을 써야 하는가?
□What’s Wrong with MySQL? (필요한 것) 성능 향상: 무한대의 대용량 빠른 확장: 수직/수평 확장에 강제되지 않은 다운타임 없는 노드 교체/추가: no SPOF 자유로운 스키마 변경
□Google의 GFS 도입 동기 2003년 기준) 15000 이상의 실서비스 서버 운영
매일 ~100대가 Down Fault-tolerance/Consistency/Performance/Workload Multi-GB files (작은 사이즈의 많은 수량이 아닌) 대부붂 1MB 이상의 순차적 읽기, 쓰기는 동시적 Append
Random write는 거의 없음
144
RDBMS와 NoSQL간 선택 기준
기술적 측면
구분 RDBMS NoSQL
데이터셋 특성 대량, 정형데이터 초대용량, 비정형데이터
적합한 연산 랜덤 액세스, 복잡 연산 순차 액세스, 단순 연산
데이터 모델 중복제거, 정규화 No Join, 비정규화
분산시스템 특성 일관성, 가용성 가용성, 분산처리성
확장성 고성능 서버, SQL 최적화 필요 저가 서버 추가, 시스템이 확장 지원
비즈니스 측면
구분 RDBMS NoSQL
비용/ROI 고가 장비, SQL 의한 다양한 분석/처리 지원, 일반인력
저가 장비, 시스템 및 App 관리 비용 상승, 전문인력
적합한 서비스 정확성/일관성 중시,
지속적인 update, 정형 데이터 증가량이 큰 대용량 기반,
비정형 데이터 위주 웹서비스
145
MySQL과 대표적인 NoSQL 성능비교
2010년 6월, 야후 리서치팀의 클라우드 서비스 시스템 벤치마팅 결과 논문
146
MySQL활용에 대한 접근방안
□NoSQL에 대한 현재시점의 평가 □기업 요구 수준의 SLA 제공 못함: 성공 케이스가 많지 않음 □NoSQL과 Application을 연계하는 개발 비용이 크다
ODBC, JDBC 같은 Adapter가 없음, 직접 API 이용 코딩 □ Join이 없어서 다양한 데이터 연계 출력이 어려움
우리나라 포털 사이트 메인을 생각하면 됨 외국 서비스(트위터/페이스북)은 상대적으로 단순 출력
□NoSQL 활용시 고려사항
□RDBMS를 대체한다는 접근은 옳지 않음: CRUD 위주 접근 □성능 튜닝에 많은 시행착오 필요 (신규 실서비스 개발시 부적합) □데이터 증가에 따른 ROI 분석 필요 (도입타당성): 장비 규모나 장애로 인
한 손실 비용 □RDBMS에 부적합한 경우를 해결: 확장성, 비싼 비용, 비정규화(단순화) □MapReduce 이용시에는 HDFS만 있어도 됨: 데이터 모델 불필요
실시간 서비스에 부적합 실시간 서비스를 위해서는 MemCache와 단순 Select 위주로 구성
147
결론 □데이터 저장을 위한 많은 솔루션이 존재
Oracle, MySQL만 있다는 생각은 버려야 함 먼저 시스템의 데이터 속성과 요구사항을 파악(CAP, ACID/BASE) 한 시스템에 여러 솔루션을 적용
소규모/복잡한 관계 데이터: RDBMS 대규모 실시간 처리 데이터: NoSQL, NewSQL 대규모 저장용 데이터: Hadoop 등
□적절한 솔루션 선택 반드시 운영 중 발생할 수 있는 이슈에 대해 검증 후 도입 필요 대부분의 NoSQL 솔루션은 베타 상태(섣부른 선택은 독이 될 수 있음) 솔루션의 프로그램 코드 수준으로 검증 필요
□NoSQL 솔루션에 대한 안정성 확보 솔루션 자체의 안정성은 검증이 필요하며 현재의 DBMS 수준의 안정성은 지원하
지 않음 반드시 안정적인 데이터 저장 방안 확보 후 적용 필요 운영 및 개발 경험을 가진 개발자 확보 어려움 요구사항에 부합되는 NoSQL 선정 필요
□처음부터 중요 시스템에 적용하기 보다는 시범 적용 필요 선정된 솔루션 검증, 기술력 내재화
148
감사합니다.
Recommended