54
TREASURE DATA OPTIMIZING PRESTO CONNECTOR ON CLOUD STORAGE DB Tech Showcase Tokyo 2017 Kai Sasaki Software Engineer at Treasure Data Inc.

Optimizing Presto Connector on Cloud Storage

Embed Size (px)

Citation preview

Page 1: Optimizing Presto Connector on Cloud Storage

T R E A S U R E D A T A

OPTIMIZING PRESTO CONNECTORON CLOUD STORAGEDB Tech Showcase Tokyo 2017

Kai SasakiSoftware Engineer at Treasure Data Inc.

Page 2: Optimizing Presto Connector on Cloud Storage

ABOUT ME

• Kai Sasaki (佐々木 海)

• Software Engineer at TreasureData

• Hadoop/Spark contributor

• Hivemall committer

• Java/Scala/Python

Page 3: Optimizing Presto Connector on Cloud Storage

TREASURE DATA

Data Analytics PlatformUnify all your raw data in scalable and secure platform. Supporting 100+ integrations to enable you to easily connect all your data sources in real-time.

Live with OSS • Fluentd• Embulk• Digdag• Hivemall and morehttps://www.treasuredata.com/opensource/

Page 4: Optimizing Presto Connector on Cloud Storage

AGENDA

• What is Presto?

• Presto Connector Detail

• Cloud Storage and PlazmaDB

• Transaction and Partitioning

• Time Index Partitioning

• User Defined Partitioning

Page 5: Optimizing Presto Connector on Cloud Storage

WHAT IS PRESTO?

Page 6: Optimizing Presto Connector on Cloud Storage

WHAT IS PRESTO?

• Presto is an open source scalable distributed SQL engine for huge OLAP workloads

• Mainly developed by Facebook, Teradata

• Used by FB, Uber, Netflix etc

• In-Memory processing

• Pluggable architecture Hive, Cassandra, Kafka etc

Page 7: Optimizing Presto Connector on Cloud Storage

PRESTO IN TREASURE DATA

Page 8: Optimizing Presto Connector on Cloud Storage

PRESTO IN TREASURE DATA

• Multiple clusters with 40~50 workers

• Presto 0.178 + Original Presto Plugin (Connector)

• 4.3+ million queries per month

• 400 trillion records per month

• 6+ PB per month

Page 9: Optimizing Presto Connector on Cloud Storage

PRESTO CONNECTOR

Page 10: Optimizing Presto Connector on Cloud Storage

PRESTO CONNECTOR

• Presto connector is the plugin for providing the access way to various kind of existing data storage from Presto.

• Connector is responsible for managing metadata/transaction/data accessor.

http://prestodb.io/

Page 11: Optimizing Presto Connector on Cloud Storage

PRESTO CONNECTOR• Hive Connector

Use metastore as metadata and S3/HDFS as storage.

• Kafka Connector Querying Kafka topic as table. Each message as interpreted as row in a table.

• Redis Connector Key/value pair is interpreted as a row in Presto.

• Cassandra ConnectorSupport Cassandra 2.1.5 or later.

Page 12: Optimizing Presto Connector on Cloud Storage

PRESTO CONNECTOR

• Black Hole ConnectorWorks like /dev/null or /dev/zero in Unix like system. Used for catastrophic test or integration test.

• Memory ConnectorMetadata and data are stored in RAM on worker nodes. Still experimental connector mainly used for test.

• System ConnectorProvides information about the cluster state and running query metrics. It is useful for runtime monitoring.

Page 13: Optimizing Presto Connector on Cloud Storage

CONNECTOR DETAIL

Page 14: Optimizing Presto Connector on Cloud Storage

PRESTO CONNECTOR

• Plugin defines an interface to bootstrap your connector creation.

• Also provides the list of UDFs available your Presto cluster.

• ConnectorFactory is able toprovide multiple connector implementations.

Plugin

ConnectorFactory

Connector

getConnectorFactories()

create(connectorId,…)

Page 15: Optimizing Presto Connector on Cloud Storage

PRESTO CONNECTOR• Connector provides classes to manage metadata, storage

accessor and table access control.

• ConnectorSplitManager create data source metadata to be distributed multiple worker node.

• ConnectorPage[Source|Sink]Provideris provided to split operator. Connector

Connector Metadata

Connector SplitManager

Connector PageSource

Provider

Connector PageSink Provider

Connector Access Control

Page 16: Optimizing Presto Connector on Cloud Storage

PRESTO CONNECTOR• Call beginInsert from

ConnectorMetadata

• ConnectorSplitManager createssplits that includes metadata of actual data source (e.g. file path)

• ConnectorPageSourceProvider downloads thefile from data source in parallel

• finishInsert in ConnectorMetadatacommit the transaction

Connector Metadata beginInsert

getSplits

Connector PageSource

Provider

Connector PageSource

Provider

Connector PageSource

Provider

Connector Metadata finishInsert

Operators…

Connector SplitManager

Page 17: Optimizing Presto Connector on Cloud Storage

PRESTO ON CLOUD STORAGE

• Distributed execution engine like Presto cannot make use of data locality any more on cloud storage.

• Read/Write of data can be a dominant factor of query performance, stability and money.

→ Connector should be implemented to take care of network IO cost.

Page 18: Optimizing Presto Connector on Cloud Storage

CLOUD STORAGE IN TD

• Our Treasure Data storage service is built on cloud storage like S3.

• Presto just provides a distributed query execution layer. It requires us to make our storage system also scalable.

• On the other hand, we should make use of maintainability and availability provided cloud service provider (IaaS).

Page 19: Optimizing Presto Connector on Cloud Storage

EASE-UP APPROACH

Page 20: Optimizing Presto Connector on Cloud Storage

PLAZMADB

• We built a thin storage layer on existing cloud storage and relational database, called PlazmaDB.

• PlazmaDB is a central component that stores all customer data for analysis in Treasure Data.

• PlazmaDB consists of two components

• Metadata (PostgreSQL)

• Storage (S3 or RiakCS)

Page 21: Optimizing Presto Connector on Cloud Storage

PLAZMADB

• PlazmaDB stores metadata of data files in PostgreSQL hosted by Amazon RDS.

Page 22: Optimizing Presto Connector on Cloud Storage

PLAZMADB

• PlazmaDB stores metadata of data files in PostgreSQL hosted by Amazon RDS.

• This PostgreSQL manages the index, file path on S3, transaction and deleted files.

LOG

LOG

Page 23: Optimizing Presto Connector on Cloud Storage

TRANSACTION AND PARTITIONING

Page 24: Optimizing Presto Connector on Cloud Storage

TRANSACTION AND PARTITIONING

• Consistency is the most important factor for enterprise analytics workload. Therefore MPP engine like Presto and backend storage MUST always guarantee the consistency.

→ UPDATE is done atomically by PlazmaDB

• At the same time, we want to achieve high throughput by distributing workload to multiple worker nodes.

→ Data files are partitioned in PlazmaDB

Page 25: Optimizing Presto Connector on Cloud Storage

PLAZMADB TRANSACTION

• PlazmaDB supports transaction for the query that has side-effect (e.g. INSERT INTO/CREATE TABLE).

• Transaction of PlazmaDB means the atomic operation on the appearance of the data on S3, not actual file.

• Transaction is composed of two phases

• Uploading uncommitted partitions

• Commit transaction by moving uncommitted partitions

Page 26: Optimizing Presto Connector on Cloud Storage

PLAZMADB TRANSACTION

• Multiple worker try to upload files to S3asynchronously.

Uncommitted Committed

PostgreSQL

Page 27: Optimizing Presto Connector on Cloud Storage

PLAZMADB TRANSACTION

• After uploading is done, insert a record in uncommitted table in PostgreSQL respectively.

Uncommitted Committed

PostgreSQL

Page 28: Optimizing Presto Connector on Cloud Storage

PLAZMADB TRANSACTION

• After uploading is done, insert a record in uncommitted table in PostgreSQL respectively.

Uncommitted Committed

PostgreSQL

p1

p2

Page 29: Optimizing Presto Connector on Cloud Storage

PLAZMADB TRANSACTION

• After all upload tasks are completed, coordinator tries to commit the transaction by moving all records in uncommitted to committed.

Uncommitted Committed

p1

p2

p3

PostgreSQL

Page 30: Optimizing Presto Connector on Cloud Storage

PLAZMADB TRANSACTION

• After all upload tasks are completed, coordinator tries to commit the transaction by moving all records in uncommitted to committed.

Uncommitted Committed

p1

p2

p3

PostgreSQL

Page 31: Optimizing Presto Connector on Cloud Storage

PLAZMADB DELETE

• Delete query is handled in similar way. First newly createdpartitions are uploaded excluding deleted records.

Uncommitted Committed

p1

p2

p3

p1’

p2’

p3’

PostgreSQL

Page 32: Optimizing Presto Connector on Cloud Storage

PLAZMADB DELETE

• When transaction is committed, the records in committed table is replaced by uncommitted records with different file path.

Uncommitted Committed

p1’

p2’

p3’

PostgreSQL

Page 33: Optimizing Presto Connector on Cloud Storage

PARTITIONING

• To make the best of high throughput by Presto parallel processing, it is necessary to distribute data source too.

• Distributing data source evenly can contribute the high throughput and performance stability.

• Two basic partitioning method

• Key range partitioning -> Time-Index partitioning

• Hash partitioning -> User Defined Partitioning

Page 34: Optimizing Presto Connector on Cloud Storage

PARTITIONING• A partition record in PlazmaDB represents a file stored in S3 with some

additional information

• Data Set ID

• Range Index Key

• Record Count

• File Size

• Checksum

• File Path

Page 35: Optimizing Presto Connector on Cloud Storage

PARTITIONING

• All partitions in PlazmaDB are indexed by time when it is generated. Time index is recorded as UNIX epoch.

• A partition keeps first_index_key and last_index_key to specifies the range where the partition includes.

• PlazmaDB index is constructed as multicolumn index by using GiST index of PostgreSQL. (https://www.postgresql.org/docs/current/static/gist.html)

• (data_set_id, index_range(first_index_key, last_index_key))

Page 36: Optimizing Presto Connector on Cloud Storage

LIFECYCLE OF PARTITION

• PlazmaDB has two storage management layer.At the beginning, records are put on realtime storage layer in raw format.

Realtime Storage Archive Storage

time: 100

time: 4000

time: 3800

time: 300

time: 500

Page 37: Optimizing Presto Connector on Cloud Storage

LIFECYCLE OF PARTITION

• Every one hour, a specific map reduce job called Log Merge Job runs to merge same time range records into one partition in archive storage.

Realtime Storage Archive Storage

time: 100

time: 4000

time: 3800

time: 300

time: 500

time: 0~3599

time: 3600~7200

MR

Page 38: Optimizing Presto Connector on Cloud Storage

LIFECYCLE OF PARTITION

• Query execution engine like Presto needs to fetch the data from both realtime storage and archive storage. But basically it should be efficient to read the data from archive storage.

Realtime Storage Archive Storage

time: 100

time: 4000

time: 3800

time: 300

time: 500

time: 0~3599

time: 3600~7200

MR

Page 39: Optimizing Presto Connector on Cloud Storage

TWO PARTITIONING TYPES

Page 40: Optimizing Presto Connector on Cloud Storage

TIME INDEX PARTITIONING

• By using multicolumn index on time range in PlazmaDB, Presto can filter out unnecessary partitions through predicate push down.

• TD_TIME_RANGE UDF tells Presto the hint which partitions should be fetched from PlazmaDB.

• e.g. TD_TIME_RANGE(time, ‘2017-08-31 12:30:00’, NULL, ‘JST’)

• ConnectorSplitManager select the necessary partitions and calculates the split distribution plan.

Page 41: Optimizing Presto Connector on Cloud Storage

TIME INDEX PARTITIONING

• Select metadata records from realtime storage and archive storage according to given time range.SELECT * FROM rt/ar WHERE start < time AND time < end;

ConnectorSplitManger

time: 0~3599

time: 3600~7200

time: 8000

time: 8200

time: 9000

time: 8800

Realtime Storage Archive Storage

Page 42: Optimizing Presto Connector on Cloud Storage

TIME INDEX PARTITIONING

• A split is responsible to download multiple files on S3 in order to reduce overhead.

• ConnectorSplitManager calculates file assignment to each split based on given statistics information (e.g. file size, the number of columns, record count)

f1

f2

f3

ConnectorSplitManger

Split1

Split2

Page 43: Optimizing Presto Connector on Cloud Storage

TIME INDEX PARTITIONINGSELECT 10 cols in a range

0 sec

23 sec

45 sec

68 sec

90 sec

113 sec

135 sec

158 sec

180 sec

60days 50days 40days 30days 20days 10days

TD_TIME_RANGE

Page 44: Optimizing Presto Connector on Cloud Storage

TIME INDEX PARTITIONINGSELECT 10 cols in a range

0 splits

8 splits

15 splits

23 splits

30 splits

38 splits

45 splits

53 splits

60 splits

6years~ 5years 4years 3years 2years 1 year 6month

split

Page 45: Optimizing Presto Connector on Cloud Storage

CHALLENGE

• Time-Index partitioning worked very well because

• Most logs from web page, IoT devices have originally the time when it is created.

• OLAP workload from analysts often limited by specific time range (e.g. in the last week, during a campaign).

• But it lacks the flexibility to make an index on the column other than time. This is required especially in digital marketing, DMP use cases.

Page 46: Optimizing Presto Connector on Cloud Storage

USER DEFINED PARTITIONING

Page 47: Optimizing Presto Connector on Cloud Storage

USER DEFINED PARTITIONING

• Now evaluating user defined partitioning with Presto.

• User defined partitioning allows customer to set index on arbitrary data attribute flexibly.

• User defined partitioning can co-exist with time-index partitioning as secondary index.

Page 48: Optimizing Presto Connector on Cloud Storage

SELECT COUNT(1) FROM audience WHERE TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’) AND audience.room = ‘E’

Page 49: Optimizing Presto Connector on Cloud Storage

BUCKETING

• Similar mechanism with Hive bucketing

• Bucket is a logical group of partition files by specified bucketing column.

table

bucket bucket bucket bucket

partition

partition

partition

partition

partition

partition

partition

partition

partition

partition

partition

partition

time range 1

time range 2

time range 3

time range 4

Page 50: Optimizing Presto Connector on Cloud Storage

BUCKETING

• PlazmaDB defines the hash function type on partitioning key and total bucket count which is fixed in advance.

Connector SplitManager

SELECT COUNT(1) FROM audience WHERE TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’) AND audience.room = ‘E’

table

bucket1 bucket2 bucket3

partition

partition

partition

partition

partition

partition

partition

partition

partition

Page 51: Optimizing Presto Connector on Cloud Storage

BUCKETING

• ConnectorSplitManager select the proper partition from PostgreSQL with given time range and bucket key.

Connector SplitManager

SELECT COUNT(1) FROM audience WHERE TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’) AND audience.room = ‘E’

table

bucket1 bucket2 bucket3

partition

partition

partition

partition

partition

partition

partition

partition

partitionhash(‘E’) -> bucket2

1504483200 < time && time < 1504742400

Page 52: Optimizing Presto Connector on Cloud Storage

USER DEFINED PARTITIONING

• We can skip to read several unnecessary partitions. This architecture very fit to digital marketing use cases.

• Creating user segment

• Aggregation by channel

• Still make use of time index partitioning.

• It’s now tested internally.

Page 53: Optimizing Presto Connector on Cloud Storage

RECAP• Presto provides a plugin mechanism called connector. • Though Presto itself is highly scalable distributed engine, connector is

also responsible for efficient query execution.• PlazmaDB has some desirable features to be integrated with such kind

of connector because of• Transaction support• Time-Index Partitioning• User Defined Partitioning

Page 54: Optimizing Presto Connector on Cloud Storage

T R E A S U R E D A T A