刘诚忠:Running cloudera impala on postgre sql

Preview:

DESCRIPTION

BDTC 2013 Beijing China

Citation preview

Running Cloudera Impala on PostgreSQL

By Chengzhong Liu

liuchengzhong@miaozhen.com

2013.12

Story coming from…

• Data gravity

• Why big data

• Why SQL on big data

Today agenda

• Big data in Miaozhen 秒针系统

• Overview of Cloudera Impala

• Hacking practice in Cloudera Impala

• Performance

• Conclusions

• Q&A

What happened in miaozhen

• 3 billion Ads impression per day

• 20TB data scan for report generation every morning

• 24 servers cluster

• Besides this – TV Monitor

– Mobile Monitor

– Site Monitor

– …

Before Hadoop

• Scrat – PostgreSQL 9.1 cluster

– Write a simple proxy

– <2s for 2TB data scan

• Mobile Monitor – Hadoop-like distribute computing system

– Rabbit MQ + 3 computing servers

– Write a Map-Reduce in C++

– Handles 30 millions to 500 millions Ads impression

Problem & Chance

• Database cluster

• SQL on Hadoop

• Miscellaneous data

• Requirements

– Most data is rational

– SQL interface

SQL on Hadoop

• Google Dremel

• Apache Drill

• Cloudera Impala

• Facebook Presto

• EMC Greenplum/Pivotal

HDFS

Map Reduce

Hive Pig

Impala/Drill /Pivotal/Presto

Latency matters

What’s this

• A kind of MPP engine

• In memory processing

• Small to big join

– Broadcast join

• Small result size

Why Cloudera Impala

• The team move fast – UDF coming out – Better join strategy on the way

• Good code base – Modularize – Easy to add sub classes

• Really fast – Llvm code generation

• 80s/95s – uv test

– Distributed aggregation Tree – In-situ data processing (inside storage)

Typical Arch. SQL Interface Meta Store

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Our target

• A MPP database

– Build on PostgreSQL9.1

– Scale well

– Speed

• A mixed data source MPP query engine

– Join two tables in different sources

– In fact…

Hacking… from where

• Add, not change

– Scan Node type

– DB Meta info

• Put changes in configuration

– Thrift Protocol update

• TDBHostInfo

• TDBScanNode

Front end

• Meta store update

– Link data to the table name

– Table location management

• Front end

– Compute table location

Back end

• Coordinator

– pg host

• New scan node type

– db scan node

• Pg scan node

• Psql library using cursor

SQL Plan

Aggr.: sum(count(id)

Exchange node

Aggr. : group by id

Aggr. : count(id)

HDFS/PG scan

Aggr. : group by id

Exchange node

• select count(distinct id)

from table

– MR like process

Env.

• Ads impression logs – 150 millions, 100KB/line

• 3 servers – 24 cores – 32 G mem – 2T * 12 HD – 100Mbps LAN

• Query – Select count(id) from t group by campaign – Select count(distinct id) from t group by campaign – Select * from t where id = ‘xxxxxxxx’

Performance

impala

hive

pg+impala

• Group by speed / core

• 20 M /s

With index

Codegen on/off

en_codegen

dis_codegen

• select count(distinct id) from t group by c

• select distinct id

from t

• select id from t group by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10;

Multi-users

Conclusion

• Source quality – Readable

– Google C++ style

– Robust

• MPP solution based on PG – Proved perf.

– Easy to scale

• Mixed engine usage – HDFS and DB

What’s next

• Yarn integrating

• UDF

• Join with Big table

• BI roadmap

• Fail over

Rerf.

• Cloudera Impala online doc. & src

• http://files.meetup.com/1727991/Impala%20and%20BigQuery.ppt‎

• http://www.cubrid.org/blog/dev-platform/meet-impala-open-source-real-time-sql-querying-on-hadoop/

• http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/Impala%20tech%20talk.pdf

• @datascientist, @dongxicheng, @flyingsk, @zhh

Thanks! Q & A