71
NewSQL Database Overview 민형기 (S-Core) [email protected] 2013. 2. 22.

NewSQL Database Overview

Embed Size (px)

Citation preview

Page 1: NewSQL Database Overview

NewSQL Database Overview

민형기 (S-Core)

[email protected]

2013. 2. 22.

Page 2: NewSQL Database Overview

1

Contents

I. Why NewSQL?

II. NewSQL 기본 개념

III. NewSQL 종류

IV.NewSQL 정리

Page 3: NewSQL Database Overview

2

Why NewSQL?

Page 4: NewSQL Database Overview

3

Thinking – Extreme Data

Qcon London 2012

Page 6: NewSQL Database Overview

5 Qcon London 2012

Organizations need deeper insights

Page 7: NewSQL Database Overview

6

Solutions

□Buy High end Technology

□Higher more developers

□Using NoSQL

□Using NewSQL

Page 8: NewSQL Database Overview

7

Solution – Buy High End Technology

Oracle, IBM

Page 9: NewSQL Database Overview

8

Solution – Higher more developers

http://www.trekbikes.com/us/en/bikes/road/race_performance/madone_4_series/madone_4_5

□Application Level Sharding

□Build your replication middleware

□…

Page 10: NewSQL Database Overview

9

Solutions – Use NoSQL

□새로운 비 관계형 데이터 베이스

□분산 아키텍처

□수평 확장성

□고정된 테이블 스키마가 없음

□Join, UPDATE, DELETE 연산이 없음

□트랜잭션이 없음

□SQL 지원이 없음

Page 11: NewSQL Database Overview

10

NoSQL Ecosystems

451 group

Page 12: NewSQL Database Overview

11

MongoDB

□Document-oriented database JSON-style documents: Lists, Maps, primitives

Schema-less

□Transaction = update of a single document

□Rich query language for dynamic queries

□Tunable writes: speed reliability

□Highly scalable and available

Page 13: NewSQL Database Overview

12

MongoDB 사용예

□Use cases High volume writes

Complex data

Semi-structured data

□주요 고객 Foursquare

Bit.ly Intuit

SourceForge, NY Times

GILT Groupe, Evite,

SugarCRM

Page 14: NewSQL Database Overview

13

Apache Cassandra

□Column-oriented database/Extensible row store Think Row ~= java.util.SortedMap

□Transaction = update of a row

□Fast writes = append to a log

□Tunable reads/writes: consistency / availability

□Extremely scalable

Transparent and dynamic clustering

Rack and datacenter aware data replication

□CQL = “SQL”-like DDL and DML

Page 15: NewSQL Database Overview

14

Apache Cassandra 사용 예

□사용 예 Big data

Multiple Data Center distributed database

Persistent cache

(Write intensive) Logging

High-availability (writes)

□주요 고객 Digg, Facebook, Twitter, Reddit, Rackspace

Cloudkick, Cisco, SimpleGeo, Ooyala, OpenX

The largest production cluster has over 100 TB of data in over 150 machines.“ – Casssandra web site

Page 16: NewSQL Database Overview

15

□새로운 관계형 데이터베이스

□SQL과 ACID 트랜잭션을 유지

□새롭고 개선된 분산 아키텍처

□뛰어난 확장성과 성능을 지원

□NewSQL vendors: ScaleDB, NimbusDB, ..., VoltDB

Solutions – Use NewSQL

Page 18: NewSQL Database Overview

17

NewSQL 정의 – Wikipedia

NewSQL is a class of modern relational

database management systems that seek

to provide the same scalable performance

of NoSQL systems for OLTP workloads while

still maintaining the ACID guarantees of a

traditional single-node database system

NewSQL is a class of modern relational

database management systems that seek

to provide the same scalable performance

of NoSQL systems for OLTP workloads while

still maintaining the ACID guarantees of a

traditional single-node database system

http://en.wikipedia.org/wiki/NewSQL

Page 19: NewSQL Database Overview

18

NewSQL 정의 – 451 Group

A DBMS that delivers the scalability and

flexibility promised by NoSQL while retaining

the support for SQL queries and/or ACID, or

to improve performance for appropriate

workloads.

A DBMS that delivers the scalability and

flexibility promised by NoSQL while retaining

the support for SQL queries and/or ACID, or

to improve performance for appropriate

workloads.

http://www.cs.brown.edu/courses/cs227/slides/newsql/newsql-intro.pdf

Page 20: NewSQL Database Overview

19

NewSQL 정의 – Stonbraker

SQL as the primary interface.

ACID support for transactions

Non-locking concurrency control.

High per-node performance.

Parallel, shared-nothing architecture.

SQL as the primary interface.

ACID support for transactions

Non-locking concurrency control.

High per-node performance.

Parallel, shared-nothing architecture.

http://www.cs.brown.edu/courses/cs227/slides/newsql/newsql-intro.pdf

Page 21: NewSQL Database Overview

20

NewSQL Category New Database New MySQL Storage Engines Transparent Clustering

Page 22: NewSQL Database Overview

21 OSBC

The evolving database landscape

Page 23: NewSQL Database Overview

22

MySQL Ecosystem

Page 24: NewSQL Database Overview

23

NewSQL Ecosystem

Page 25: NewSQL Database Overview

24

New Database

□ Newly designed from scratch to achieve scalability and performance One of the key considerations in improving the

performance is making non-disk (memory) or new kinds of disks (flash/SSD) the primary data store.

some (hopefully minor) changes to the code will be required and data migration is still needed.

□Solutions Software-Only: VoltDB, NuoDB, Drizzle, Google Spanner

Supported as an appliance: Clustrix, Translattice.

http://www.linuxforu.com/2012/01/newsql-handle-big-data/

Page 26: NewSQL Database Overview

25

New MySQL Storage Engines

□Highly optimized storage engines for MySQL

□Scale better than built-in engines, such as InnoDB. Good part: the usage of the MySQL interface

Downside part: data migration from other databases

□Solutions TokuDB, MemSQL, Xeround, Akiban, NDB

http://www.linuxforu.com/2012/01/newsql-handle-big-data/

Page 27: NewSQL Database Overview

26

Transparent Clustering

□Retain the OLTP databases in their original format, but provide a pluggable feature Cluster transparently

Ensure Scalability

□Avoid the rewrite code or perform any data migration

□Solutions Cluster transparently: Schooner MySQL, Continuent

Tungsten, ScalArc

Ensure Scalability: ScaleBase, dbShards

http://www.linuxforu.com/2012/01/newsql-handle-big-data/

Page 28: NewSQL Database Overview

27

NewSQL Products VoltDB Google Spanner

Page 29: NewSQL Database Overview

28

VoltDB

http://voltdb.com/products-services/products, http://www.slideshare.net/chris.e.richardson/polygot-persistenceforjavadevs-jfokus2012reorgpptx

□VoltDB, 2010, GPL/VoltDB Proprietary License, Java/C++ □Type: NewSQL, New Database □Main Point: In-memory Database, Java Stored Procedure, VoltDB

implements the design of the academic H-Store project □Protocol: SQL □Transaction: Yes □Data Storage: Memory □Features

□ in-memory relational database □Durability thru replication, snapshots, logging □Transparent partitioning □ACID-level consistency □Synchronous multi-master replication □Database Replication

Page 30: NewSQL Database Overview

29

VoltDB- Technical Overview

“OLTP Through the Looking Glass” http://cs-www.cs.yale.edu/homes/dna/papers/oltpperf-sigmod08.pdf

VoltDB avoids the overhead of traditional databases K-safety for fault tolerance

• no logging

In memory operation for maximum throughput

• no buffer management

Partitions operate autonomously and single-threaded

• no latching or locking

Built to horizontally scale

X

X X

X

29

Page 31: NewSQL Database Overview

30

X X

X

X X

VoltDB - Partitions (1/3)

1 partition per physical CPU core –Each physical server has multiple VoltDB partitions Data - Two types of tables –Partitioned

Single column serves as partitioning key Rows are spread across all VoltDB partitions by partition column Transactional data (high frequency of modification)

–Replicated All rows exist within all VoltDB partitions Relatively static data (low frequency of modification)

Code - Two types of work – both ACID –Single-Partition

All insert/update/delete operations within single partition Majority of transactional workload

–Multi-Partition CRUD against partitioned tables across multiple partitions Insert/update/delete on replicated tables

Page 32: NewSQL Database Overview

31

VoltDB - Partitions (2/3)

Single-partition vs. Multi-partition

1 101 2

1 101 3

4 401 2

1 knife

2 spoon

3 fork

Partition 1

2 201 1

5 501 3

5 502 2

1 knife

2 spoon

3 fork

Partition 2

3 201 1

6 601 1

6 601 2

1 knife

2 spoon

3 fork

Partition 3

table orders : customer_id (partition key) (partitioned) order_id product_id

table products : product_id (replicated) product_name

select count(*) from orders where customer_id = 5 single-partition

select count(*) from orders where product_id = 3 multi-partition

insert into orders (customer_id, order_id, product_id) values (3,303,2) single-partition

update products set product_name = ‘spork’ where product_id = 3 multi-partition

Page 33: NewSQL Database Overview

32

VoltDB - Partitions (3/3)

Looking inside a VoltDB partition… – Each partition contains data and an

execution engine.

– The execution engine contains a queue for transaction requests.

– Requests are executed sequentially (single threaded).

Work

Queue

execution engine

Table Data Index Data

- Complete copy of all replicated tables - Portion of rows (about 1/partitions) of all partitioned tables

Page 34: NewSQL Database Overview

33

VoltDB - Compiling

The database is constructed from – The schema (DDL)

– The work load (Java stored procedures)

– The Project (users, groups, partitioning)

VoltCompiler creates application catalog – Copy to servers along with 1 .jar and

1 .so

– Start servers

CREATE TABLE HELLOWORLD (

HELLO CHAR(15),

WORLD CHAR(15),

DIALECT CHAR(15),

PRIMARY KEY (DIALECT)

);

Schema

import org.voltdb. * ;

@ProcInfo(

partitionInfo = "HELLOWORLD.DIA

singlePartition = true

)

public class Insert extends VoltPr

public final SQLStmt sql =

new SQLStmt("INSERT INTO HELLO

public VoltTable[] run( String hel

import org.voltdb. * ;

@ProcInfo(

partitionInfo = "HELLOWORLD.DIA

singlePartition = true

)

public class Insert extends VoltPr

public final SQLStmt sql =

new SQLStmt("INSERT INTO HELLO

public VoltTable[] run( String hel

import org.voltdb. * ;

@ProcInfo(

partitionInfo = "HE

singlePartition = t

public final SQLStmt

public VoltTable[] run

Stored Procedures

<?xml version="1.0"?>

<project>

<database name='data

<schema path='ddl.

<partition table=‘

</database>

</project>

Project.xml

Page 35: NewSQL Database Overview

34

VoltDB - Transactions

All access to VoltDB is via Java stored procedures (Java + SQL)

A single invocation of a stored procedure is a transaction (committed on success)

Limits round trips between DBMS and application

High performance client applications communicate asynchronously with VoltDB

SQL

Page 36: NewSQL Database Overview

35

VoltDB - Clusters/Durability

Scalability – Increase RAM in servers to add capacity

– Add servers to increase performance / capacity

– Consistently measuring 90% of single-node performance increase per additional node

High availability – K-safety for redundancy

Snapshots – Scheduled, continuous, on demand

Spooling to data warehouse

Disaster Recovery/WAN replication (Future) – Asynchronous replication

Page 37: NewSQL Database Overview

36

Google Spanner

http://research.google.com/archive/spanner.html

□Google, 2012, Paper, C++ □Type: NewSQL, New Database □Main Point: Google's scalable, multi-version, globally-distributed, and

synchronously-replicated database

□Distributed multiversion database General-purpose transactions (ACID) SQL query language Schematized tables Semi-relational data model

□Running in production Storage for Google’s ad data Replaced a sharded MySQL database

Page 38: NewSQL Database Overview

37

Google Spanner Overview

http://research.google.com/archive/spanner.html

□Feature: Lock-free distributed read transactions

□Property: External consistency of distributed transactions

□First system at global scale □Implementation: Integration of concurrency

control, replication, and 2PC □Correctness and performance □Enabling technology: TrueTime □Interval-based global time

Page 43: NewSQL Database Overview

42

NewSQL 정리

Page 44: NewSQL Database Overview

43

Database 업계의 3가지 Trends

□NoSQL 데이터베이스:

분산 아키텍처의 확장성 등의 요구 사항을 충족하며, 스키마 없는 데이터

관리 요구 사항에 부합하도록 설계됨.

□NewSQL 데이터베이스:

분산 아키텍처의 확장성 등의 요구 사항을 충족하거나 혹은 수평 확장을

필요로하지 않지만 성능을 개선은 되도록 설계됨.

□Data Grid/Cache 제품:

응용 프로그램 및 데이터베이스 성능을 높이기 위해 메모리에 데이터를

저장하도록 설계됨.

Page 45: NewSQL Database Overview

44

결론

□데이터 저장을 위한 많은 솔루션이 존재 □ Oracle, MySQL만 있다는 생각은 버려야 함 □ 먼저 시스템의 데이터 속성과 요구사항을 파악(CAP, ACID/BASE) □ 한 시스템에 여러 솔루션을 적용

소규모/복잡한 관계 데이터: RDBMS 대규모 실시간 처리 데이터: NoSQL, NewSQL 대규모 저장용 데이터: Hadoop 등

□적절한 솔루션 선택 □ 반드시 운영 중 발생할 수 있는 이슈에 대해 검증 후 도입 필요 □ 대부분의 NewSQL 솔루션은 베타 상태(섣부른 선택은 독이 될 수 있음) □ 솔루션의 프로그램 코드 수준으로 검증 필요

□NewSQL 솔루션에 대한 안정성 확보 □ 솔루션 자체의 안정성은 검증이 필요하며 현재의 DBMS 수준의 안정성은 지원하

지 않음 □ 반드시 안정적인 데이터 저장 방안 확보 후 적용 필요 □ 운영 및 개발 경험을 가진 개발자 확보 어려움 □ 요구사항에 부합되는 NewSQL 선정 필요

□처음부터 중요 시스템에 적용하기 보다는 시범 적용 필요 □ 선정된 솔루션 검증, 기술력 내재화

Page 46: NewSQL Database Overview

45

감사합니다.

Page 47: NewSQL Database Overview

46

Appendix.

Page 48: NewSQL Database Overview

47

Early – 2000s

http://www.cs.brown.edu/courses/cs227/slides/newsql/newsql-intro.pdf

□All the big players were heavyweight and expensive.

Oracle, DB2, Sybase, SQL Server, etc.

□Open-source databases were missing important features.

Postgres, mSQL, and MySQL.

Page 50: NewSQL Database Overview

49

Early – 2000s : eBay Architecture

http://highscalability.com/ebay-architecture

Push functionality to application: Joins Referential integrity Sorting done

No distributed transactions

Page 51: NewSQL Database Overview

50

Mid– 2000s

http://www.cs.brown.edu/courses/cs227/slides/newsql/newsql-intro.pdf

□MySQL + InnoDB is widely adopted by new web companies:

Supported transactions, replication, recovery.

Still must use custom middleware to scale out across multiple machines.

Memcache for caching queries.

Page 54: NewSQL Database Overview

53

Late – 2000s

http://www.cs.brown.edu/courses/cs227/slides/newsql/newsql-intro.pdf

□MySQL + InnoDB is widely adopted by new web companies:

Supported transactions, replication, recovery.

Still must use custom middleware to scale out across multiple machines.

Memcache for caching queries.

Page 55: NewSQL Database Overview

54

Late – 2000s : MongoDB Architecture

http://sett.ociweb.com/sett/settAug2011.html

Page 56: NewSQL Database Overview

55

Late – 2000s : MongoDB Architecture

http://sett.ociweb.com/sett/settAug2011.html

Easy to use. Becoming more like a DBMS over time. No transactions.

Page 57: NewSQL Database Overview

56

Early – 2010s

http://www.cs.brown.edu/courses/cs227/slides/newsql/newsql-intro.pdf

□New DBMSs that can scale across multiple machines natively and provide ACID guarantees.

MySQL Middleware

Brand New Architectures

Page 58: NewSQL Database Overview

57

Database SPRAIN

Page 59: NewSQL Database Overview

58

Database SPRAIN

□“An injury to ligaments... caused by being stretched beyond normal capacity”

□Six key drivers for NoSQL/NewSQL/DDG adoption Scalability

Performance

Relaxed consistency

Agility

Intricacy

Necessity

Page 60: NewSQL Database Overview

59

Database SPRAIN - Scalability

□Associated sub-driver: Hardware economics Scale-out across clusters of commodity servers

□Example project/service/vendor BigTable HBase Riak MongoDB Couchbase, Hadoop

Amazon RDS, Xeround, SQL Azure, NimbusDB

Data grid/cache

□Associated use case: Large-scale distributed data storage

Analysis of continuously updated data

Multi-tenant PaaS data layer

Page 61: NewSQL Database Overview

60

Database SPRAIN - Scalability

□User: StumbleUpon

□Problem: Scaling problems with recommendation engine on

MySQL

□Solution: HBase Started using Apache HBase to provide real-time

analytics on Su.pr

MySQL lacked the performance headroom and scale

Multiple benefits including avoiding declaring schema

Enables the data to be used for multiple applications and use cases

Page 62: NewSQL Database Overview

61

Database SPRAIN - Performance

□Associated sub-driver: MySQL limitations Inability to perform consistently at scale

□Example project/service/vendor Hypertable Couchbase Membrain MongoDB Redis

Data grid/cache

VoltDB, Clustrix

□Associated use case: Real time data processing of mixed read/write

workloads

Data caching

Large-scale data ingestion

Page 63: NewSQL Database Overview

62

Database SPRAIN - Performance

□User: AOL Advertising

□Problem: Real-time data processing to support targeted

advertising

□Solution: Membase Server Segmentation analysis runs in CDH, results passed into

Membase

Make use of its sub-millisecond data delivery

More time for analysis as part of a 40ms targeted and response time

Also real time log and event management

Page 64: NewSQL Database Overview

63

Database SPRAIN – Relaxed Consistency

□Associated sub-driver: CAP theorem The need to relax consistency in order to maintain

availability

□Example project/service/vendor: Dynamo, Voldemort, Cassandra

Amazon SimpleDB

□Associated use case: Multi-data center replication

Service availability

Non-transactional data off-load

Page 65: NewSQL Database Overview

64

Database SPRAIN – Relaxed Consistency

□User: Wordnik

□Problem: MySQL too consistent –blocked access to data during

inserts and created numerous temp files to stay consistent.

□Solution: MongoDB Single word definition contains multiple data items

from various sources

MongoDB stores data as a complete document

Reduced the complexity of data storage

Page 66: NewSQL Database Overview

65

Database SPRAIN – Agility

□ Associated sub-driver: Polyglot persistence Choose most appropriate storage technology for app

in development

□Example project/service/vendor MongoDB, CouchDB, Cassandra

Google App Engine, SimpleDB, SQL Azure

□Associated use case: Mobile/remote device synchronization

Agile development

Data caching

Page 67: NewSQL Database Overview

66

Database SPRAIN – Agility

□ User: Dimagi BHOMA (Better Health Outcomes through Mentoring and Assessments) project

□Problem: Deliver patient information to clinics despite a lack of

reliable Internet connections

□Solution: Apache CouchDB Replicates data from regional to national database

When Internet connection, and power, is available

Upload patient data from cell phones to local clinic

Page 68: NewSQL Database Overview

67

Database SPRAIN – Intricacy

□ Associated sub-driver: Big data, total data Rising data volume, variety and velocity

□Example project/service/vendor Neo4j GraphDB, InfiniteGraph

Apache Cassandra, Hadoop,

VoltDB, Clustrix

□Associated use case: Social networking applications

Geo-locational applications

Configuration management database

Page 69: NewSQL Database Overview

68

Database SPRAIN – Intricacy

□ User: Evident Software

□Problem: Mapping infrastructure dependencies for application

performance management

□Solution: Neo4j Apache Cassandra stores performance data

Neo4j used to map the correlations between different elements

Enables users to follow relationships between resources while investigating issues

Page 70: NewSQL Database Overview

69

Database SPRAIN – Necessity

□ Associated sub-driver: Open source The failure of existing suppliers to address the

performance, scalability and flexibility requirements of large-scale data processing

□ Example project/service/vendor BigTable, Dynamo, MapReduce, Memcached

Hadoop HBase, Hypertable, Cassandra, Membase

Voldemort, Riak, BigCouch

MongoDB, Redis, CouchDB, Neo4J

□Associated use case: All of the above

Page 71: NewSQL Database Overview

70

Database SPRAIN – Necessity

□BigTable: Google

□Dynamo: Amazon

□Cassandra: Facebook

□HBase: Powerset

□Voldemort: LinkedIn

□Hypertable: Zvents

□Neo4j: Windh Technologies

Yahoo: Apache Hadoop and Apache HBase

Digg: Apache Cassandra

Twitter: Apache Cassandra, Apache Hadoop and FlockDB