Internals of Presto Service

Internals of Presto Service

Taro L. Saito, Treasure Data [email protected] March 11-12th, 2015 Treasure Data Tech Talk #1 at Tokyo

Taro L. Saito @taroleo

•  2007 University of Tokyo. Ph.D. –  XML DBMS, Transaction Processing

•  Relational-Style XML Query [SIGMOD 2008]

•  ~ 2014 Assistant Professor at University of Tokyo –  Genome Science Research

•  Distributed Computing, Personal Genome Analysis

•  March 2014 ~ Treasure Data –  Software Engineer, MPP Team Leader

•  Open source projects at GitHub –  snappy-java, msgpack-java, sqlite-jdbc –  sbt-pack, sbt-sonatype, larray –  silk

•  Distributed workflow engine

2

Hive

TD API / Web Console

batch query

Presto

Treasure Data

PlazmaDB: MessagePack Columnar Storage

td-presto connector

Interactive query

What is Presto?

•  A distributed SQL Engine developed by Facebook –  For interactive analysis on peta-scale dataset

•  As a replacement of Hive –  Nov. 2013: Open sourced at GitHub

•  Presto –  Written in Java –  In-memory query layer –  CPU efficient for ad-hoc analysis –  Based on ANSI SQL

–  Isolation of query layer and storage access layer •  A connector provides data access (reading schema and records)

4

Presto: Distributed SQL Engine

5

TD Presto has its own query retry mechanism

Tailored to throughput CPU-intensive. Faster response time

Fault Tolerant

Treasure Data: Presto as a Service

6

Presto Public Release

Topics

•  Challenges in providing Database as a Service

•  TD Presto Connector –  Optimizing Scan Performance

–  Multi-tenancy Cluster Management •  Resource allocation •  Monitoring •  Query Tuning

7

buffer

Optimizing Scan Performance

•  Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck

8

TableScanOperator

•  s3 file list •  table schema header

request

S3 / RiakCS

•  release(Buffer) Buffer size limit Reuse allocated buffers

Request Queue

•  priority queue •  max connections limit

Header Column Block 0 (column names)

Column Block 1

Column Block i

Column Block m

MPC1 file

HeaderReader

•  callback to HeaderParser

ColumnBlockReader

header HeaderParser

•  parse MPC file header • column block offsets • column names

column block request Column block requests

column block

prepare

MessageUnpacker

buffer

MessageUnpacker MessageUnpacker

S3 read

S3 read

pull records

Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency

S3 read •  decompression •  msgpack-java v07

S3 read

S3 read

S3 read

MessageBuffer

•  msgpack-java v06 was the bottleneck –  Inefficient buffer access

•  v07 •  Fast memory access

•  sun.misc.Unsafe •  Direct access to heap memory •  extract primitive type value from byte[]

•  cast •  No boxing

9

Unsafe memory access performance is comparable to C

•  http://frsyuki.hatenablog.com/entry/2014/03/12/155231

10

Why ByteBuffer is slow?

•  Following a good programming manner –  Define interface, then implement classes

•  ByteBuffer interface has HeapByteBuffer and DirectByteBuffer implementations

•  In reality: TypeProfile slows down method access –  JVM generates look-up table of method implementations –  Simply importing one or more classes generates TypeProfile

•  v07 avoid TypeProfile generation –  Load an implementation class through Reflection

11

Format Type Detection

•  MessageUnpacker –  read prefix: 1 byte –  detect format type

•  switch-case –  ANTLR generates this

type of codes

12

Format Type Detection

•  Using cache-efficient lookup table: 20000x faster

13

2x performance improvement in v07

14

Database As A Service

15

Claremont Report on Database Research

•  Discussion on future of DBMS –  Top researchers, vendors and

practitioners. –  CACM, Vol. 52 No. 6, 2009

•  Predicts emergence of Cloud Data

Service –  SQL has an important role

•  limited functionality •  suited for service provider

–  A difficult example: Spark　 •  Need a secure application container

to run arbitrary Scala code.

16

Beckman Report on Database Research

•  2013 –  http://beckman.cs.wisc.edu/beckman-report2013.pdf

–  Topics of Big-Data

•  End-to-end service –  From data collection to knowledge

•  Cloud Service has become popular –  IaaS, PaaS, SaaS –  Challenge is to migrate all of the functionalities of DBMS into Cloud

17

Results Push

Results Push

SQL

Big Data Simplified: The Treasure Data Approach A

pp S

erve

rs

Multi-structured Events!•  register!•  login!•  start_event!•  purchase!•  etc!

SQL-based Ad-hoc Queries

SQL-based Dashboards

DBs & Data Marts

Other Apps

Familiar & Table-oriented

Infinite & Economical Cloud Data Store

ü  App log data!ü  Mobile event data!ü  Sensor data!ü  Telemetry!

Mobile SDKs

Web SDK

Multi-structured Events

Multi-structured Events

Treasure Agent

Treasure Agent

Treasure Agent

Treasure Agent Treasure Agent

Treasure Agent

Treasure Agent

Treasure Agent

Embedded SDKs

Server-side Agents

18

Challenges in Database as a Service

•  Tradeoffs

–  Cost and service level objectives (SLOs)

•  Reference –  Workload Management for Big Data Analytics. A. Aboulnaga

[SIGMOD2013 Tutorial]

19

Run each query set on an independent

cluster

Run all queries together on the

smallest possible cluster

Fast $$$ Limited performance guarantee

Reasonable price

Shift of Presto Query Usage

•  Initial phase –  Try and error of queries

•  Many syntax errors, semantic errors

•  Next phase –  Scheduled query execution

•  Increased Presto query usage –  Some customers submit more than 1,000 Presto queries / day

–  Establishing typical query patterns •  hourly, daily reports •  query templates

•  Advanced phase: More elaborate data analysis –  Complex queries

•  via data scientists and data analysts –  High resource usage

20

Usage Shift: Simple to Complex queries

21

Monitoring Presto Usage with Fluentd

22

Hive

Presto

DataDog

•  Monitoring CPU, memory and network usage •  Query stats

23

Query Collection in TD

•  SQL query logs –  query, detailed query plan, elapsed time, processed rows, etc.

•  Presto is used for analyzing the query history

24

Daily/Hourly Query Usage

25

Query Running Time

•  More than 90% of queries finishes within 2 min. ≒ expected response time for interactive queries

26

Processed Rows of Queries

27

Performance

•  Processed rows / sec. of a query

28

Collecting Recoverable Error Patterns

•  Presto has no fault tolerance •  Error types

–  User error •  Syntax errors

–  SQL syntax, missing function •  Semantic errors

–  missing tables/columns

–  Insufficient resource •  Exceeded task memory size

–  Internal failure •  I/O error

–  S3/Riak CS •  worker failure •  etc.

29

TD Presto retries these queries

Query Retry on Internal Errors

•  More than 99.8% of queries finishes without errors

30

Query Retry on Internal Errors (log scale)

•  Queries succeed eventually

31

Multi-tenancy: Resource Allocation

•  Price-plan based resource allocation

•  Parameters –  The number of worker nodes to use (min-candidates) –  The number of hash partitions (initial-hash-partitions) –  The maximum number of running tasks per account

•  If running queries exceeds allowed number of tasks, the next queries need to wait (queued)

•  Presto: SqlQueryExecution class –  Controls query execution state: planning -> running -> finished

•  No resource allocation policy

–  Extended TDSqlQueryExection class monitors running tasks and limits resource usage

•  Rewriting SqlQueryExecutionFactory at run-time by using ASM library

32

Query Queue

•  Presto 0.97 –  Introduces user-wise query queues

•  Can limit the number of concurrent queries per user

•  Problem –  Running too many queries delays overall query

performance

33

Customer Feedback

•  A feedback: –  We don’t care if large queries take long time –  But interactive queries should run immediately

•  Challenges

–  How do we allocate resources even if preceding queries occupies customer share of resources?

–  How do we know a submitted query is interactive one?

34

Admission control is necessary

•  Adjust resource utilization –  Running Drivers (Splits) –  MPL (Multi-Programming Level)

35

Challenge: Auto Scaling

•  Setting the cluster size based on the peak usage is expensive

•  But predicting customer usage is difficult

36

Typical Query Patterns [Li Juang]

•  Q: What are typical queries of a customer? –  Customer feels some queries are slow –  But we don’t know what to compare with, except scheduled queries

•  Approach: Clustering Customer SQLs •  TF/IDF measure: TF x IDF vector

–  Split SQL statements into tokens –  Term frequency (TF) = the number of each term in a query –  Inverse document frequency (IDF) = log (# of queries / # of queries that

have a token)

•  k-means clustering –  TF/IDF vector –  Generates clusters of similar queries

•  x-means clustering for deciding number of clusters automatically –  D. Pelleg [ICML2000]

37

Problematic Queries

•  90% of queries finishes within 2 min. –  But remaining 10% is still large

•  10% of 10,000 queries is 1,000.

•  Long-running queries •  Hog queries

38

Long Running Queries

•  Typical bottlenecks –  Cross joins –  IN (a, b, c, …)

•  semi-join filtering process is slow –  Complex scan condition

•  pushing down selection •  but delays column scan

–  Tuple materialization •  coordinator generates json data

–  Many aggregation columns •  group by 1, 2, 3, 4, 5, 6, …

–  Full scan •  Scanning 100 billion rows…

•  Adding more resources does not always make query faster •  Storing intermediate data to disks is necessary

39

Result are buffered

(waiting fetch)

slow process

fast

fast

Hog Query

•  Queries consuming a lot of CPU/memory resources –  Coined in S. Krompass et al. [EDBT2009]

•  Example:

–  select 1 as day, count(…) from … where time <= current_date - interval 1 day union all select 2 as day, count(…) from … where time <= current_date - interval 2 day union all

–  … –  (up to 190 days)

•  More than 1000 query stages. •  Presto tries to run all of the stages at once.

–  High CPU usage at coordinator

40

•  Query rewriting (better) –  With group by and window functions –  Not a perfect solution

•  Need to understand the meaning of the query •  Semantic change is not allowed

–  e.g., We cannot rewrite UNION to UNION ALL –  UNION includes duplicate elimination

•  Workaround Idea –  Bushy plan -> Deep plan

–  Introduce stage-wise resource assignment

Query Rewriting? Plan Optimization?

41

Future Work

•  Reducing Queuing/Response Time –  Introducing shared queue between customers

•  For utilizing remaining cluster resources –  Fair-Scheduling: C. Gupata [EDBT2009] –  Self-tuning DBMS. S. Chaudhuri [VLDB2007]

•  Adjusting Running Query Size (hard) –  Limiting driver resources as small as possible for hog queries –  Query plan based cost estimation

•  Predicting Query Running Time –  J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011]

42

Summary: Treasures in Treasure Data

•  Treasures for our customers –  Data collected by fluentd (td-agent) –  Query analysis platform –  Query results - values

•  For Treasure Data –  SQL query logs

•  Stored in treasure data

–  We know how customers use SQL •  Typical queries and failures

–  We know which part of query can be improved

43

Engineering

Internals of Presto Service