View
225
Download
3
Category
Preview:
DESCRIPTION
Impala benchmarks and tuning tips - HadoopCon 2014 in Taiwan
Citation preview
Impala Benchmarks and Tuning Tips
Simon Hsu徐瑞興
2014年9月13日
2HadoopCon 2014
About Me
• 徐瑞興 (Simon Hsu)
– Approach Hadoop in M.S. (2010)
• “A Transparent Approach to Run MapReduce Programs on Collaborative Hadoops” – IEEE BigData 2014
– FOXCONN – RD Dept.
• Hadoop Product Development
– Etu – RD Dept.
• Hadoop Solution (Etu/Cloudera) / Product Development
3HadoopCon 2014
Outline
• Impala Performance Tuning Tips
– “Practical Performance Analysis and Tuning for Cloudera Impala” - Greg Rahn @ Hadoop World 2013
• Impala Benchmarks
– TPC-DS Kit for Impala
4HadoopCon 2014
Brief History of Impala
http://mt.orz.at/archives/2012/12/hadoop.html
5HadoopCon 2014
Brief History of Impala
http://mt.orz.at/archives/2012/12/hadoop.html
6HadoopCon 2014
Brief History of Impala
http://mt.orz.at/archives/2012/12/hadoop.html
7HadoopCon 2014
Brief History of Impala
http://mt.orz.at/archives/2012/12/hadoop.html
8HadoopCon 2014
Brief History of Impala
http://mt.orz.at/archives/2012/12/hadoop.html
9HadoopCon 2014
Hive & Impala
• Running MapReduce Jobs
10HadoopCon 2014
Hive & Impala
• Running by In-memory,
distributed SQL query engine
• Running MapReduce Jobs
11HadoopCon 2014
Impala Feature
• Fast
– Low latency response
• Bypass HDFS DataNode (Read directly from disk)
• Optimized for data warehouse queries (Especially, Parquet)
• Friendly to approach
– Using the same database metadata with Hive
• Benefits in some tools such as Sqoop
– Common HDFS Files Format supported
• Query existing files on HDFS
12HadoopCon 2014
No more predictions in
length of columns!
13HadoopCon 2014
Impala Overview
http://www.slideshare.net/cloudera/impala-v1update130709222849phpapp01
12
3
4
5
14HadoopCon 2014
Impala Performance Tuning Tips
Pre-execution
• Data Types
• Partitioning
• File Format
• Compression
Query Execution
• Gather Table / Column Stats
• Join Type
• Query Profile
Overall Review
• Use Case
• Experience
15HadoopCon 2014
http://www.safaribooksonline.com/library/view/strata-conference-new/9781491945551/part131.html
16HadoopCon 2014
Impala Performance Tuning Tips
Pre-execution
• Data Types
• Partitioning
• File Format
• Compression
Query Execution
• Gather Table / Column Stats
• Join Type
• Query Profile
Overall Review
• Use Case
• Experience
17HadoopCon 2014
Pre-execution
• Data Types
• Partitioning
• File Format
• Compression
18HadoopCon 2014
Data Types
• Change data type to appropriate one
– Avoid type casting overhead
• Ex.
– TimeStamps for time
– INT for IntegerAlthough String is powerful..
19HadoopCon 2014
Partition
• Create table partitions to reduce disk IO
– Depends on general query pattern
• Partitioned by Month
• Partitioned by State
20HadoopCon 2014
Partition Files in HDFS
Table files with partitions
Table files without partitions
Directories
Files
21HadoopCon 2014
Query Test in partitions
with partition
without partition
22HadoopCon 2014
File Format
• Text
– Default Impala table format
• Parquet
– Optimized for working with large data files
• typically 1 GB per file
– Reorganize data for maximum performance of data warehouse-style queries
• Column-oriented binary file format
23HadoopCon 2014
Compression
• Snappy
Less CPU time
Lower compression ratio
• Gzip
More CPU time
Higher compression ratio
24HadoopCon 2014
• Test Table– Number of records: 183,364,043
• Test Query– [master.etu.im:21000] > SELECT COUNT(*) FROM store_sales;
• Setting Compression codec– [master.etu.im:21000] > SET parquet.compression=[SNAPPY/GZIP/NONE/etc.]
Query Time in different compression codec
Codec Table Size on HDFS (GB) Query Time (s)
Snappy 9.2 0.91
Gzip 6.8 1.22
None 16.5 1.21
25HadoopCon 2014
Compression Codec differs
http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
26HadoopCon 2014
Compression Codec differs (cont.)
http://www.safaribooksonline.com/library/view/strata-conference-new/9781491945551/part131.html
27HadoopCon 2014
Impala Performance Tuning Tips
Pre-execution
• Data Types
• Partitioning
• File Format
• Compression
Query Execution
• Gather Table / Column Stats
• Join Type
• Query Profile
Overall Review
• Use Case
• Experience
28HadoopCon 2014
Query Execution
• Gather Table / Column Stats
• Join Type
• Query Profile
29HadoopCon 2014
Usage of Explain Clause
Query Time : 0.31 (s)
Query Time : 2.21 (s)
with partition
without partition
• Query :
– [master.etu.im:21000] > explain select * from store_sales where ss_sold_date_skbetween 2451911 and 2451941 limit 10;
30HadoopCon 2014
Compute Tables Stats
• [master.etu.im:21000] > COMPUTE STATS customer;
• [master.etu.im:21000] > SHOW TABLE STATS customer ;
31HadoopCon 2014
Compute Tables Stats
• [master.etu.im:21000] > COMPUTE STATS customer;
• [master.etu.im:21000] > SHOW TABLE STATS customer ;
各位觀眾, 2個檔
32HadoopCon 2014
Gather Column Stats
• [master.etu.im:21000] > SHOW COLUMN STATS tpcds_parquet.customer;
33HadoopCon 2014
Join Type
• Two Types of Join
– Broadcast Join
• Default Join. Typically, broadcast joins are more efficient in cases where one table is much smaller than the other.
– Shuffle Join
• Typically, shuffle joins are more efficient for joins between large tables of similar size.
• Join Order Optimization
– If automatic optimization is not sufficient
• consider add STRAIGHT_JOIN after SELECT
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_hints.html
34HadoopCon 2014
Query Profile
• Impalad Web console http://Impalad_IP:25000/
35HadoopCon 2014
Impala Performance Tuning Tips
Pre-execution
• Configurations Check
• Data Types
• Partitioning
• File Format
Query Execution
• Gather Table / Column Stats
• Join Type
• Query Profile
Overall Review
• Use Case
• Experience
36HadoopCon 2014
Overall Review
• Use Case
• Experience
37HadoopCon 2014
Use case
• Use case in Partition
– L.T.V. of online gaming
• Average Days
• Average deposit
• How many people in each interval
http://goo.gl/TPoqvk
38HadoopCon 2014
Use case
• Use case in File Format
– Improve the query time in hospital
• Reduce Query Time to 30%~50%
• Number of Columns in each tables: 40~50 columns
• Number of Records in largest table: over 100,000,000
“Taking a rest helps going further. “
http://goo.gl/RL6LSa
39HadoopCon 2014
Notes in Configs
• HDFS Replication bandwith
– dfs.datanode.balance.bandwidthPerSec
• Default value : 10 MB/s
• Memory usage in impala daemon
– Impala Daemon Memory Limit
• (ex.) mem_limit : 80%
• Enable HDFS Short Circuit Read
– dfs.client.read.shortcircuit = true
40HadoopCon 2014
Notes during Operations
• Preserve parquet block size
– $ bin/hadoop distcp –pb srcPath dstPath
• Create external table / Create table
– Preserve raw data or not while dropping table
• Be aware of Insert into ….value ..
– Generate many small files
41HadoopCon 2014
Turn off Beauty Print (-B)
42HadoopCon 2014
Impala Benchmarks
• TPC Benchmark™DS (TPC-DS)
– The New Decision Support Benchmark Standard
• Although the underlying business model of TPC-DS is a retail product supplier, the database schema, data population, queries, data maintenance model and implementation rules have been designed to be broadly representative of modern decision support systems.
https://github.com/cloudera/impala-tpcds-kit
43HadoopCon 2014
Procedure of TPC-DS Benchmark (Impala)
Preparation
• tpcds-env.sh
• hdfs-mkdirs.sh
Data Generation
• gen-dims.sh
• gen-facts.sh
Data Loading
• impala-create-external-tables.sh
• impala-load-dims.sh
• impala-load-store_sales.sh
44HadoopCon 2014
Store Sales ER-Diagram
http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf
Fact Table
45HadoopCon 2014
Query 7 – Intro.
• Compute the average quantity, list price, discount, and sales price for promotional items sold in stores where the promotion is not offered by mail or a special event.
– Restrict the results to a specific gender, marital and educational status.
46HadoopCon 2014
• selecti_item_id,avg(ss_quantity) agg1,avg(ss_list_price) agg2,avg(ss_coupon_amt) agg3,avg(ss_sales_price) agg4
• fromstore_sales,customer_demographics,date_dim,item,promotion
• wheress_sold_date_sk = d_date_skand ss_item_sk = i_item_skand ss_cdemo_sk = cd_demo_skand ss_promo_sk = p_promo_skand cd_gender = 'F'and cd_marital_status = 'W'and cd_education_status = 'Primary'and (p_channel_email = 'N'or p_channel_event = 'N')
and d_year = 1998and ss_sold_date_sk between 2450815 and 2451179
• group byi_item_id
• order byi_item_id
• limit 100;
http://www.minddevelopmentanddesign.com/blog/leaving-las-vagues-or-focus-your-seo-keywords/
47HadoopCon 2014
48HadoopCon 2014
49HadoopCon 2014
50HadoopCon 2014
Conclusion
• Consider the table format : “Parquet”
• Compression codec tradeoffs
• Disk I/O reduction by table partitioning
• See Query profiles for more information
• Run Impala Benchmarks and enjoy yourself
– TPC-DS (Decision Support Benchmark)
318, Rueiguang Rd., Taipei 114, TaiwanSimon Hsu – Sr. Software Engineer0912-166-961simonhsu@etusolution.com
Thank you
Recommended