View
434
Download
1
Category
Preview:
DESCRIPTION
BDTC 2013 Beijing China
Citation preview
Running Cloudera Impala on PostgreSQL
By Chengzhong Liu
liuchengzhong@miaozhen.com
2013.12
Story coming from…
• Data gravity
• Why big data
• Why SQL on big data
Today agenda
• Big data in Miaozhen 秒针系统
• Overview of Cloudera Impala
• Hacking practice in Cloudera Impala
• Performance
• Conclusions
• Q&A
What happened in miaozhen
• 3 billion Ads impression per day
• 20TB data scan for report generation every morning
• 24 servers cluster
• Besides this – TV Monitor
– Mobile Monitor
– Site Monitor
– …
Before Hadoop
• Scrat – PostgreSQL 9.1 cluster
– Write a simple proxy
– <2s for 2TB data scan
• Mobile Monitor – Hadoop-like distribute computing system
– Rabbit MQ + 3 computing servers
– Write a Map-Reduce in C++
– Handles 30 millions to 500 millions Ads impression
Problem & Chance
• Database cluster
• SQL on Hadoop
• Miscellaneous data
• Requirements
– Most data is rational
– SQL interface
SQL on Hadoop
• Google Dremel
• Apache Drill
• Cloudera Impala
• Facebook Presto
• EMC Greenplum/Pivotal
HDFS
Map Reduce
Hive Pig
Impala/Drill /Pivotal/Presto
Latency matters
What’s this
• A kind of MPP engine
• In memory processing
• Small to big join
– Broadcast join
• Small result size
Why Cloudera Impala
• The team move fast – UDF coming out – Better join strategy on the way
• Good code base – Modularize – Easy to add sub classes
• Really fast – Llvm code generation
• 80s/95s – uv test
– Distributed aggregation Tree – In-situ data processing (inside storage)
Typical Arch. SQL Interface Meta Store
Query Planner
Coordinator
Exec Engine
Query Planner
Coordinator
Exec Engine
Query Planner
Coordinator
Exec Engine
Our target
• A MPP database
– Build on PostgreSQL9.1
– Scale well
– Speed
• A mixed data source MPP query engine
– Join two tables in different sources
– In fact…
Hacking… from where
• Add, not change
– Scan Node type
– DB Meta info
• Put changes in configuration
– Thrift Protocol update
• TDBHostInfo
• TDBScanNode
Front end
• Meta store update
– Link data to the table name
– Table location management
• Front end
– Compute table location
Back end
• Coordinator
– pg host
• New scan node type
– db scan node
• Pg scan node
• Psql library using cursor
SQL Plan
Aggr.: sum(count(id)
Exchange node
Aggr. : group by id
Aggr. : count(id)
HDFS/PG scan
Aggr. : group by id
Exchange node
• select count(distinct id)
from table
– MR like process
Env.
• Ads impression logs – 150 millions, 100KB/line
• 3 servers – 24 cores – 32 G mem – 2T * 12 HD – 100Mbps LAN
• Query – Select count(id) from t group by campaign – Select count(distinct id) from t group by campaign – Select * from t where id = ‘xxxxxxxx’
Performance
impala
hive
pg+impala
• Group by speed / core
• 20 M /s
With index
Codegen on/off
en_codegen
dis_codegen
• select count(distinct id) from t group by c
• select distinct id
from t
• select id from t group by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10;
Multi-users
Conclusion
• Source quality – Readable
– Google C++ style
– Robust
• MPP solution based on PG – Proved perf.
– Easy to scale
• Mixed engine usage – HDFS and DB
What’s next
• Yarn integrating
• UDF
• Join with Big table
• BI roadmap
• Fail over
Rerf.
• Cloudera Impala online doc. & src
• http://files.meetup.com/1727991/Impala%20and%20BigQuery.ppt
• http://www.cubrid.org/blog/dev-platform/meet-impala-open-source-real-time-sql-querying-on-hadoop/
• http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/Impala%20tech%20talk.pdf
• @datascientist, @dongxicheng, @flyingsk, @zhh
Thanks! Q & A
Recommended