刘诚忠：Running cloudera impala on postgre sql

Running Cloudera Impala on PostgreSQL

By Chengzhong Liu

liuchengzhong@miaozhen.com

2013.12

Story coming from…

• Data gravity

• Why big data

• Why SQL on big data

Today agenda

• Big data in Miaozhen 秒针系统

• Overview of Cloudera Impala

• Hacking practice in Cloudera Impala

• Performance

• Conclusions

• Q&A

What happened in miaozhen

• 3 billion Ads impression per day

• 20TB data scan for report generation every morning

• 24 servers cluster

• Besides this – TV Monitor

– Mobile Monitor

– Site Monitor

– …

Before Hadoop

• Scrat – PostgreSQL 9.1 cluster

– Write a simple proxy

– <2s for 2TB data scan

• Mobile Monitor – Hadoop-like distribute computing system

– Rabbit MQ + 3 computing servers

– Write a Map-Reduce in C++

– Handles 30 millions to 500 millions Ads impression

Problem & Chance

• Database cluster

• SQL on Hadoop

• Miscellaneous data

• Requirements

– Most data is rational

– SQL interface

SQL on Hadoop

• Google Dremel

• Apache Drill

• Cloudera Impala

• Facebook Presto

• EMC Greenplum/Pivotal

Map Reduce

Hive Pig

Impala/Drill /Pivotal/Presto

Latency matters

What’s this

• A kind of MPP engine

• In memory processing

• Small to big join

– Broadcast join

• Small result size

Why Cloudera Impala

• The team move fast – UDF coming out – Better join strategy on the way

• Good code base – Modularize – Easy to add sub classes

• Really fast – Llvm code generation

• 80s/95s – uv test

– Distributed aggregation Tree – In-situ data processing (inside storage)

Typical Arch. SQL Interface Meta Store

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Our target

• A MPP database

– Build on PostgreSQL9.1

– Scale well

– Speed

• A mixed data source MPP query engine

– Join two tables in different sources

– In fact…

Hacking… from where

• Add, not change

– Scan Node type

– DB Meta info

• Put changes in configuration

– Thrift Protocol update

• TDBHostInfo

• TDBScanNode

Front end

• Meta store update

– Link data to the table name

– Table location management

• Front end

– Compute table location

Back end

• Coordinator

– pg host

• New scan node type

– db scan node

• Pg scan node

• Psql library using cursor

SQL Plan

Aggr.: sum(count(id)

Exchange node

Aggr. : group by id

Aggr. : count(id)

HDFS/PG scan

Aggr. : group by id

Exchange node

• select count(distinct id)

from table

– MR like process

• Ads impression logs – 150 millions, 100KB/line

• 3 servers – 24 cores – 32 G mem – 2T * 12 HD – 100Mbps LAN

• Query – Select count(id) from t group by campaign – Select count(distinct id) from t group by campaign – Select * from t where id = ‘xxxxxxxx’

Performance

impala

pg+impala

• Group by speed / core

• 20 M /s

With index

Codegen on/off

en_codegen

dis_codegen

• select count(distinct id) from t group by c

• select distinct id

from t

• select id from t group by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10;

Multi-users

Conclusion

• Source quality – Readable

– Google C++ style

– Robust

• MPP solution based on PG – Proved perf.

– Easy to scale

• Mixed engine usage – HDFS and DB

What’s next

• Yarn integrating

• UDF

• Join with Big table

• BI roadmap

• Fail over

• Cloudera Impala online doc. & src

• http://files.meetup.com/1727991/Impala%20and%20BigQuery.ppt‎

• http://www.cubrid.org/blog/dev-platform/meet-impala-open-source-real-time-sql-querying-on-hadoop/

• http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/Impala%20tech%20talk.pdf

• @datascientist, @dongxicheng, @flyingsk, @zhh

Thanks! Q & A

刘诚忠：Running cloudera impala on postgre sql

Technology

Presentacion Impala Warehousing

YOMEL —- Precisión en la distribución de fertilizantes ...€¦ · FERTILIZADORA IMPALA 10.000 - IMPALA 10000 FERTILIZER SPREADER -Impala Sistema de distribución bidisco con

Postgre sql

Informe IMPALA Barrancabermeja

Instalacion de postgre sql

Impala Terminals

Postgre sql tutorial

Presentations from the Cloudera Impala meetup on Aug 20 2013

Apostila de postgre

58517228 postgre sql-desarrolladoresbasico

Redis vs postgre sql

Presentación postgre sql

Postgre SQL security_20170412

Impala – işlevsel ve kullanışlı file2010, Ekim 20 Bilgi değişimi Impala – işlevsel ve kullanışlı Impala TBS160 Impala TBS160, 2, 3 ya da 4 TL-D floresan lamba için işlevsel

Performance evaluation of cloudera impala 0.6 beta with comparison to Hive

Postgre sql y_replicacion_slony_p

Mike olson, cloudera

Exemplo prático (PostGre+PostGIS)

Tutorial Postgre

Apache Impala Guideimpala.apache.org/docs/build/impala-3.1.pdf · 2019-01-14 · Components of the Impala Server.....15 The Impala Daemon ... SQL Statements for Partitioned Tables.....694