Upload
horatio-watts
View
219
Download
0
Embed Size (px)
Citation preview
pg_shard: Shard andScale Out PostgreSQL
Jason PetersenSumedh PathakOzgun Erdogan
What is CitusDB?• CitusDB is a scalable analytics database that
extends PostgreSQL. pg_shard targets the short read/write use case– Citus shards and replicates your data in the same way
pg_shard does– Citus parallelizes your queries for analytics– Citus isn’t a fork of Postgres. Rather, it hooks onto the
planner and executor for distributed query execution.
Talk Outline1. Why use pg_shard?2. Data: Scaling and failure handling3. Computation: Query routing and failure
handling4. Random PostgreSQL coolness
#1 Requested Feature from Citus• Real-time analytics calls for real-time data ingest.• We had one customer who built real-time inserts
on their own. Then two other customers did the same thing.
• We also talked to PostgreSQL users. Some considered application level sharding, or migrating to NoSQL solutions.
Customer Interviews• Dynamically scale a cluster as new machines
are added or old ones are retired. Magically handle failures
• Simple to set up and use. Works natively on PostgreSQL
Architectural DecisionsDynamic Scaling: Use logical shards to facilitate easy rebalancing as cluster membership changes
Simple to Use: Works natively with PostgreSQL by augmenting its planner and executor logic for real-time data ingest and querying
Traditional Partitioning
click_events_2012
Node #1(PostgreSQL)
click_events_2013
Node #2
click_events_2014
Node #3
(4 TB) (4 TB) (4 TB)
Node #4
click_events_2012
Node #1
(4 TB) click_events_2013
Node #2
(4 TB) click_events_2014
Node #3
(4 TB)
click_events_2012
Node #1
(4 TB) click_events_2013
Node #2
(4 TB) click_events_2014
Node #3
(4 TB)
Node #4
1 TB (each)
click_events_2012
Node #1
click_events_2013
Node #2
click_events_2014
Node #3
click_events_2012
Node #4
click_events_2013
Node #5
click_events_2014
Node #6
click_events_2012
Node #1
click_events_2013
Node #2
click_events_2014
Node #3
click_events_2012
Node #4
click_events_2013
Node #5
click_events_2014
Node #6
Logical Sharding
Node #4
Node #1(PostgreSQL)
1 3 4
6 7 9
… … …
… … …
Node #2
1 2 4
5 7 8
… … …
… … …
Node #3
2 3 5
6 8 9
… … …
… … …
512 MB (each)
Node #1(PostgreSQL)
1 6 7
… … …
… … …
… … …
Node #2
1 2 7
… … …
… … …
… … …
Node #3
2 3 8
… … …
… … …
… … …
Node #4
3 4 8
… … …
… … …
… … …
Node #5
4 5 9
… … …
… … …
… … …
Node #6
5 6 9
… … …
… … …
… … …
Node #1(PostgreSQL)
1 6 7
… … …
… … …
… … …
Node #2
1 2 7
… … …
… … …
… … …
Node #3
2 3 8
… … …
… … …
… … …
Node #4
3 4 8
… … …
… … …
… … …
Node #5
4 5 9
… … …
… … …
… … …
Node #6
5 6 9
… … …
… … …
… … …
Metadata Server(PostgreSQL + pg_shard)
1 3 4
6 7 9
… … …
… … …
Worker Node #1
1 2 4
5 7 8
… … …
… … …
Worker Node #2
2 3 5
6 8 9
… … …
… … …
Worker Node #3
Example Data Distributionin pg_shard Cluster
shard and shard placement metadata
Metadata Server (Master Node)• Metadata server holds authoritative state on
shards and their placements in the cluster• This state is minimal: 1 row per distributed table,
1 row per shard, and 1 row per shard placement. Kept in 3 Postgres tables
• To handle metadata node failures, you can create a streaming replica, or reconstruct metadata from workers on the fly
Metadata and Hash Partitioningpostgres=# SELECT * FROM pgs_distribution_metadata.shard; id | relation_id | storage | min_value | max_value -------+-------------+---------+-------------+------------- 10004 | 177880 | t | -2147483648 | -1879048194 10005 | 177880 | t | -1879048193 | -1610612739 10006 | 177880 | t | -1610612738 | -1342177284 10007 | 177880 | t | -1342177283 | -1073741829 10008 | 177880 | t | -1073741828 | -805306374 10009 | 177880 | t | -805306373 | -536870919 ... | ... | ... | ... | ...
Worker Nodes• 1 shard placement = 1 PostgreSQL table• Names of tables, indexes, and constraints are
extended by their shard identifiere.g. click_events_1001 holds data for shard 1001
• Indexes and constraint definitions are propagated when shards first created
What about the logic?• Drop-in PostgreSQL extension• Supports subset of SQL• Doesn’t use any special functions• It’s still just PostgreSQL
Users: Making Scaling SQL EasyCREATE EXTENSION pg_shard; -- create a regular PostgreSQL table: CREATE TABLE customer_reviews (customer_id TEXT NOT NULL, review_date DATE, ...); -- distribute the table on the given partition key: SELECT master_create_distributed_table('customer_reviews', 'customer_id'); -- create 16 logical shards with 2 placements on workers: SELECT master_create_worker_shards('customer_reviews', 16, 2);
PostgreSQL Secret Sauce: Hooks• Full control over command lifetime• Specific to needs:– Planning– Execution (Start, Run, Finish, End)– Utility
Planning Phase• Determine if distributed• Fall through to PostgreSQL if not• Using partition key, find involved shards• Deparse shard-specific SQL
Planning ExampleINSERT INTO customer_reviews (customer_id, rating) VALUES ('HN892', 5);
•Determine partition key clauses:customer_id = 'HN892'
•Find shards from metadata tables:hashtext('HN892') BETWEEN min_value AND max_value
•Produce shard-specific SQLINSERT INTO customer_reviews_16 (customer_id, rating) VALUES ('HN892', 5);
Execute Distributed Modify• Locks enforce safe commutation• Replicas visited in predictable order• Per-session connection pool uses libpq• If replica errors out, mark as inactive
1 3 4
6 7 9
… … …
… … …
Worker Node #1
1 2 4
5 7 8
… … …
… … …
Worker Node #2
2 3 5
6 8 9
… … …
… … …
Worker Node #3
Single-shard INSERTReplication factor: 2
Master
INSERT INTO customer_reviews ...
1 3 4
6 7 9
… … …
… … …
Worker Node #1
1 2 4
5 7 8
… … …
… … …
Worker Node #2
2 3 5
6 8 9
… … …
… … …
Worker Node #3
Single-shard INSERTOne replica fails
Master
INSERT INTO customer_reviews ...
1 3 4
6 7 9
… … …
… … …
Worker Node #1
1 2 4
5 7 8
… … …
… … …
Worker Node #2
2 3 5
6 8 9
… … …
… … …
Worker Node #3
Single-shard INSERTMaster marks inactive
Master
Sets shard 6, node 3 to inactive status
Single Shard SELECT
• Fetch entire result from single shard• Failover to another replica on error• Do not modify state if failure happens• Common key-value access pattern
1 3 4
6 7 9
… … …
… … …
Worker Node #1
1 2 4
5 7 8
… … …
… … …
Worker Node #2
2 3 5
6 8 9
… … …
… … …
Worker Node #3
Single-shard SELECTTry first placement
Master
SELECT * FROM customer_reviews WHERE customer_id = 'HN892';
1 3 4
6 7 9
… … …
… … …
Worker Node #1
1 2 4
5 7 8
… … …
… … …
Worker Node #2
2 3 5
6 8 9
… … …
… … …
Worker Node #3
Single-shard SELECTEncounter error
Master
SELECT * FROM customer_reviews WHERE customer_id = 'HN892';
1 3 4
6 7 9
… … …
… … …
Worker Node #1
1 2 4
5 7 8
… … …
… … …
Worker Node #2
2 3 5
6 8 9
… … …
… … …
Worker Node #3
Single-shard SELECTTry next placement
Master
SELECT * FROM customer_reviews WHERE customer_id = 'HN892';
Multi-Shard SELECTpg_shard: Engaged when no partition key constraint is present. Pulls data to master and performs final pass locally (aggregates, etc.)
CitusDB: Pushes computations to worker nodes themselves for fully distributed query logic
What’s Next?• pg_shard v1.1 release this week• Includes shard repair udf, easier integration with
CitusDB, COPY support through triggers, SELECT project push downs, faster INSERTs, and bug fixes
• pg_shard v1.2 will have more real-time analytics functionality
• Anything else?
Summary• Sharding extension for PostgreSQL
https://github.com/citusdata/pg_shard
• Many small logical shards• Implemented with standard PostgreSQL hooks• Load extension… create table… distribute• JSONB + pg_shard as alternative to NoSQL
PostgreSQL Renaissance
PostgreSQL Usage Trends
• What database does your company use?HackerNews2010
HackerNews2014
Why is PostgreSQL popular
1. Oraclization of MySQL2. Reliable and robust database3. New extension framework: Is the monolithic
SQL database dying? If so, long live Postgres
PostgreSQL Extensions #1
• HyperLogLog extension for real-time unique counts over varying time intervals
• JSONB and GIN indexes for semi-structured data
PostgreSQL Extensions #2
• Real-time funnel and cohort queries• Dynamic funnels to look at people who visited A,
then B, and finally J
• Heatmap video of vehicles on a map
ContactJason: [email protected]
Sumedh: [email protected]: [email protected]
General: groups.google.com/forum/#!
forum/pg_shard-users
Questions