Upload
treasure-data-inc
View
3.226
Download
2
Embed Size (px)
Citation preview
Internals of Presto Service
Taro L. Saito, Treasure Data [email protected] March 11-12th, 2015 Treasure Data Tech Talk #1 at Tokyo
Taro L. Saito @taroleo
• 2007 University of Tokyo. Ph.D. – XML DBMS, Transaction Processing
• Relational-Style XML Query [SIGMOD 2008]
• ~ 2014 Assistant Professor at University of Tokyo – Genome Science Research
• Distributed Computing, Personal Genome Analysis
• March 2014 ~ Treasure Data – Software Engineer, MPP Team Leader
• Open source projects at GitHub – snappy-java, msgpack-java, sqlite-jdbc – sbt-pack, sbt-sonatype, larray – silk
• Distributed workflow engine
2
Hive
TD API / Web Console
batch query
Presto
Treasure Data
PlazmaDB: MessagePack Columnar Storage
td-presto connector
Interactive query
What is Presto?
• A distributed SQL Engine developed by Facebook – For interactive analysis on peta-scale dataset
• As a replacement of Hive – Nov. 2013: Open sourced at GitHub
• Presto – Written in Java – In-memory query layer – CPU efficient for ad-hoc analysis – Based on ANSI SQL
– Isolation of query layer and storage access layer • A connector provides data access (reading schema and records)
4
Presto: Distributed SQL Engine
5
TD Presto has its own query retry mechanism
Tailored to throughput CPU-intensive. Faster response time
Fault Tolerant
Treasure Data: Presto as a Service
6
Presto Public Release
Topics
• Challenges in providing Database as a Service
• TD Presto Connector – Optimizing Scan Performance
– Multi-tenancy Cluster Management • Resource allocation • Monitoring • Query Tuning
7
buffer
Optimizing Scan Performance
• Fully utilize the network bandwidth from S3 • TD Presto becomes CPU bottleneck
8
TableScanOperator
• s3 file list • table schema header
request
S3 / RiakCS
• release(Buffer) Buffer size limit Reuse allocated buffers
Request Queue
• priority queue • max connections limit
Header Column Block 0 (column names)
Column Block 1
Column Block i
Column Block m
MPC1 file
HeaderReader
• callback to HeaderParser
ColumnBlockReader
header HeaderParser
• parse MPC file header • column block offsets • column names
column block request Column block requests
column block
prepare
MessageUnpacker
buffer
MessageUnpacker MessageUnpacker
S3 read
S3 read
pull records
Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency
S3 read • decompression • msgpack-java v07
S3 read
S3 read
S3 read
MessageBuffer
• msgpack-java v06 was the bottleneck – Inefficient buffer access
• v07 • Fast memory access
• sun.misc.Unsafe • Direct access to heap memory • extract primitive type value from byte[]
• cast • No boxing
9
Unsafe memory access performance is comparable to C
• http://frsyuki.hatenablog.com/entry/2014/03/12/155231
10
Why ByteBuffer is slow?
• Following a good programming manner – Define interface, then implement classes
• ByteBuffer interface has HeapByteBuffer and DirectByteBuffer implementations
• In reality: TypeProfile slows down method access – JVM generates look-up table of method implementations – Simply importing one or more classes generates TypeProfile
• v07 avoid TypeProfile generation – Load an implementation class through Reflection
11
Format Type Detection
• MessageUnpacker – read prefix: 1 byte – detect format type
• switch-case – ANTLR generates this
type of codes
12
Format Type Detection
• Using cache-efficient lookup table: 20000x faster
13
2x performance improvement in v07
14
Database As A Service
15
Claremont Report on Database Research
• Discussion on future of DBMS – Top researchers, vendors and
practitioners. – CACM, Vol. 52 No. 6, 2009
• Predicts emergence of Cloud Data
Service – SQL has an important role
• limited functionality • suited for service provider
– A difficult example: Spark • Need a secure application container
to run arbitrary Scala code.
16
Beckman Report on Database Research
• 2013 – http://beckman.cs.wisc.edu/beckman-report2013.pdf
– Topics of Big-Data
• End-to-end service – From data collection to knowledge
• Cloud Service has become popular – IaaS, PaaS, SaaS – Challenge is to migrate all of the functionalities of DBMS into Cloud
17
Results Push
Results Push
SQL
Big Data Simplified: The Treasure Data Approach A
pp S
erve
rs
Multi-structured Events!• register!• login!• start_event!• purchase!• etc!
SQL-based Ad-hoc Queries
SQL-based Dashboards
DBs & Data Marts
Other Apps
Familiar & Table-oriented
Infinite & Economical Cloud Data Store
ü App log data!ü Mobile event data!ü Sensor data!ü Telemetry!
Mobile SDKs
Web SDK
Multi-structured Events
Multi-structured Events
Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent
Embedded SDKs
Server-side Agents
18
Challenges in Database as a Service
• Tradeoffs
– Cost and service level objectives (SLOs)
• Reference – Workload Management for Big Data Analytics. A. Aboulnaga
[SIGMOD2013 Tutorial]
19
Run each query set on an independent
cluster
Run all queries together on the
smallest possible cluster
Fast $$$ Limited performance guarantee
Reasonable price
Shift of Presto Query Usage
• Initial phase – Try and error of queries
• Many syntax errors, semantic errors
• Next phase – Scheduled query execution
• Increased Presto query usage – Some customers submit more than 1,000 Presto queries / day
– Establishing typical query patterns • hourly, daily reports • query templates
• Advanced phase: More elaborate data analysis – Complex queries
• via data scientists and data analysts – High resource usage
20
Usage Shift: Simple to Complex queries
21
Monitoring Presto Usage with Fluentd
22
Hive
Presto
DataDog
• Monitoring CPU, memory and network usage • Query stats
23
Query Collection in TD
• SQL query logs – query, detailed query plan, elapsed time, processed rows, etc.
• Presto is used for analyzing the query history
24
Daily/Hourly Query Usage
25
Query Running Time
• More than 90% of queries finishes within 2 min. ≒ expected response time for interactive queries
26
Processed Rows of Queries
27
Performance
• Processed rows / sec. of a query
28
Collecting Recoverable Error Patterns
• Presto has no fault tolerance • Error types
– User error • Syntax errors
– SQL syntax, missing function • Semantic errors
– missing tables/columns
– Insufficient resource • Exceeded task memory size
– Internal failure • I/O error
– S3/Riak CS • worker failure • etc.
29
TD Presto retries these queries
Query Retry on Internal Errors
• More than 99.8% of queries finishes without errors
30
Query Retry on Internal Errors (log scale)
• Queries succeed eventually
31
Multi-tenancy: Resource Allocation
• Price-plan based resource allocation
• Parameters – The number of worker nodes to use (min-candidates) – The number of hash partitions (initial-hash-partitions) – The maximum number of running tasks per account
• If running queries exceeds allowed number of tasks, the next queries need to wait (queued)
• Presto: SqlQueryExecution class – Controls query execution state: planning -> running -> finished
• No resource allocation policy
– Extended TDSqlQueryExection class monitors running tasks and limits resource usage
• Rewriting SqlQueryExecutionFactory at run-time by using ASM library
32
Query Queue
• Presto 0.97 – Introduces user-wise query queues
• Can limit the number of concurrent queries per user
• Problem – Running too many queries delays overall query
performance
33
Customer Feedback
• A feedback: – We don’t care if large queries take long time – But interactive queries should run immediately
• Challenges
– How do we allocate resources even if preceding queries occupies customer share of resources?
– How do we know a submitted query is interactive one?
34
Admission control is necessary
• Adjust resource utilization – Running Drivers (Splits) – MPL (Multi-Programming Level)
35
Challenge: Auto Scaling
• Setting the cluster size based on the peak usage is expensive
• But predicting customer usage is difficult
36
Typical Query Patterns [Li Juang]
• Q: What are typical queries of a customer? – Customer feels some queries are slow – But we don’t know what to compare with, except scheduled queries
• Approach: Clustering Customer SQLs • TF/IDF measure: TF x IDF vector
– Split SQL statements into tokens – Term frequency (TF) = the number of each term in a query – Inverse document frequency (IDF) = log (# of queries / # of queries that
have a token)
• k-means clustering – TF/IDF vector – Generates clusters of similar queries
• x-means clustering for deciding number of clusters automatically – D. Pelleg [ICML2000]
37
Problematic Queries
• 90% of queries finishes within 2 min. – But remaining 10% is still large
• 10% of 10,000 queries is 1,000.
• Long-running queries • Hog queries
38
Long Running Queries
• Typical bottlenecks – Cross joins – IN (a, b, c, …)
• semi-join filtering process is slow – Complex scan condition
• pushing down selection • but delays column scan
– Tuple materialization • coordinator generates json data
– Many aggregation columns • group by 1, 2, 3, 4, 5, 6, …
– Full scan • Scanning 100 billion rows…
• Adding more resources does not always make query faster • Storing intermediate data to disks is necessary
39
Result are buffered
(waiting fetch)
slow process
fast
fast
Hog Query
• Queries consuming a lot of CPU/memory resources – Coined in S. Krompass et al. [EDBT2009]
• Example:
– select 1 as day, count(…) from … where time <= current_date - interval 1 day union all select 2 as day, count(…) from … where time <= current_date - interval 2 day union all
– … – (up to 190 days)
• More than 1000 query stages. • Presto tries to run all of the stages at once.
– High CPU usage at coordinator
40
• Query rewriting (better) – With group by and window functions – Not a perfect solution
• Need to understand the meaning of the query • Semantic change is not allowed
– e.g., We cannot rewrite UNION to UNION ALL – UNION includes duplicate elimination
• Workaround Idea – Bushy plan -> Deep plan
– Introduce stage-wise resource assignment
Query Rewriting? Plan Optimization?
41
Future Work
• Reducing Queuing/Response Time – Introducing shared queue between customers
• For utilizing remaining cluster resources – Fair-Scheduling: C. Gupata [EDBT2009] – Self-tuning DBMS. S. Chaudhuri [VLDB2007]
• Adjusting Running Query Size (hard) – Limiting driver resources as small as possible for hog queries – Query plan based cost estimation
• Predicting Query Running Time – J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011]
42
Summary: Treasures in Treasure Data
• Treasures for our customers – Data collected by fluentd (td-agent) – Query analysis platform – Query results - values
• For Treasure Data – SQL query logs
• Stored in treasure data
– We know how customers use SQL • Typical queries and failures
– We know which part of query can be improved
43