28
Apache Hive 2 (대규모 고속 SQL) 최종욱, Hortonworks [email protected]

Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

Apache Hive 2 (대규모 고속 SQL)

최종욱, Hortonworks

[email protected]

Page 2: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

INTERNET OF

ANYTHING

데이터의 시대

오픈소스는 표준이고,

아파치는 그 중심에 있습니다. Founded: 2011 IPO: 2014

Founded: 1999

Page 3: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

호튼웍스의 접근과 배포판(HDP, HDF)

개발

배포 서비스

설계

DATA AT REST 보관 데이터

DATA IN MOTION 실시간 데이터

ACTIONABLEINTELLIGENCE (실행가능한 지능)

Modern Data Applications (선구적인 데이터 어플리케이션)

Hortonworks DataFlow

(Powered by Apache NIFI)

Hortonworks Data Platform

(Powered by Apache Hadoop)

Page 4: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

호튼웍스 회사 소개 O

NLY

100 오픈소스

Apache Hadoop Data Platform

% 2011년 Yahoo! 에서

하둡팀이 분사

1 번째 하둡 상장회사

기술파트너

전략 파트너

컨설턴트

리셀러 파트너사

개국

19 IPO 4Q14 (NASDAQ: HDP) 유일한 흑자회사

US 포천 100대 기업의

60%, Global 포천

500대기업의 30%가

구독고객

1,100+ 구독고객

1,110+

1 Operating billings is an operating measure defined as the aggregate value of all invoices sent to our customers in a given period Page 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

2,100+

직원

Page 5: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda Scalable Data Warehousing on Hadoop

Overview

Solution Architecture

Customer Use Cases

What’s New

Roadmap

Page 6: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Scalable Data Warehousing on Hadoop C

apab

iliti

es

Batch SQL OLAP / Cube Interactive SQL Sub-Second

SQL ACID / MERGE

Ap

plic

atio

ns

• ETL • Reporting • Data Mining • Deep Analytics

• Multidimensional Analytics

• MDX Tools • Excel

• Reporting • BI Tools: Tableau,

Microstrategy, Cognos

• Ad-Hoc • Drill-Down • BI Tools: Tableau,

Excel

• Continuous Ingestion from Operational DBMS

• Slowly Changing Dimensions

Existing

HDP 2.6

Tech Preview

Legend

Co

re

Platform

Scale-Out Storage

Petabyte Scale Processing

Core SQL Engine

Apache Tez: Scalable Distributed Processing

Advanced Cost-Based Optimizer

Connectivity

Advanced Security

JDBC / ODBC

Comprehensive SQL:2011 Coverage

Page 7: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Hive’s Unique Advantages in HDP 2.6

Why Hive:

• Comprehensive ANSI SQL including 99 TPC-DS Queries.

• The only Hadoop SQL with ACID MERGE for easy updates.

• In-Memory caching for MPP performance at Hadoop scale.

• Per-User dynamic row and column security.

• Replication and DR for critical workloads.

• Compatible with every major BI Tool.

• Proven at 300+ PB Scale.

Page 8: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Hive: Fast Facts

Most Queries Per Hour

100,000 Queries Per Hour (Yahoo Japan)

Analytics Performance

100 Million rows/s Per Node (with Hive LLAP)

Largest Hive Warehouse

300+ PB Raw Storage (Facebook)

Largest Cluster

4,500+ Nodes (Yahoo)

Page 9: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda Scalable Data Warehousing on Hadoop

Overview

Solution Architecture

Customer Use Cases

What’s New

Roadmap

Page 10: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

HDP 2.6 Continues Strong Momentum for Hive

At a High Level: – 1200+ features, improvements and bug fixes

in Hive since HDP 2.5.

– 400+ of these from outside of Hortonworks.

Major Improvements: – Hive LLAP Now GA

– ACID MERGE

– SQL: All 99 TPC-DS out-of-the-box with only trivial rewrites

– Hive View 2.0: Great Features for DBAs

– Diagnostics: Tez UI Total Timeline View

– Tech Preview: Hive OLAP Indexes powered by Druid

820

413

From Hortonworks

From Community

HDP 2.6 Improvements

Hive LLAP GA +

SQL MERGE +

All TPC-DS Queries +

Page 11: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Scalable Data Warehousing on Hadoop C

apab

iliti

es

Batch SQL OLAP / Cube Interactive SQL Sub-Second

SQL ACID / MERGE

Ap

plic

atio

ns

• ETL • Reporting • Data Mining • Deep Analytics

• Multidimensional Analytics

• MDX Tools • Excel

• Reporting • BI Tools: Tableau,

Microstrategy, Cognos

• Ad-Hoc • Drill-Down • BI Tools: Tableau,

Excel

• Continuous Ingestion from Operational DBMS

• Slowly Changing Dimensions

Existing

HDP 2.6

Tech Preview

Legend

Co

re

Platform

Scale-Out Storage

Petabyte Scale Processing

Core SQL Engine

Apache Tez: Scalable Distributed Processing

Advanced Cost-Based Optimizer

Connectivity

Advanced Security

JDBC / ODBC

Comprehensive SQL:2011 Coverage

Page 12: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Hive LLAP -- MPP Performance at Hadoop Scale

Dee

p

Sto

rage

YARN Cluster

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

Query Coordinators

Coord-inator

Coord-inator

Coord-inator

HiveServer2 (Query

Endpoint)

ODBC / JDBC

SQL Queries In-Memory Cache

(Shared Across All Users)

HDFS and Compatible

S3 WASB Isilon

Page 13: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Hive LLAP: One-Touch Provisioning

Page 14: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Hive LLAP Text Cache Delivers Zero ETL Analytics

Fast interactive analytics on CSV and JSON data.

No ETL / conversion to ORCFile needed, just load and go.

Once data is cached, analytics become dramatically faster.

Highlights

Page 15: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Cache 4x More Data with Hive LLAP SSD Cache

Use the combination of DRAM and SSD to dynamically cache data.

Cache 4x more data than using DRAM alone.

Deliver fast analytics on larger datasets with higher concurrency.

Highlights

DRAM Cache

SSD

Deep Storage

Deep Storage

Deep Storage

Page 16: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Hive ACID MERGE Makes Data Maintenance Simple

SQL Standard ACID MERGE now available in Hive.

Efficiently perform record-level inserts, updates and deletes.

Delivers real Data Management in Hadoop, massively simplifying updates, data restatements and change data capture.

Highlights

X Y

1 X1

5 X2

7 X3

A B

1 X1

2 Y2

3 Y3

4 Y4

5 X2

6 Y6

7 X3

Page 17: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Comprehensive SQL in Hive Including All 99 TPC-DS Queries

Multiple and Scalar Subqueries

INTERSECT and EXCEPT

Standard syntax for ROLLUP / GROUPING

Syntax improvements for GROUP BY and ORDER BY

In HDP 2.6+ Hive runs all 99 TPC-DS with only trivial re-writes.

Highlights

Page 18: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

37 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Hive View 2.0: Visual Explain Plan Makes Debugging Easier

Visual Explain Plan See how your query is run and the cost of each step.

Page 19: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

38 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Tez Total Timeline View Show Exactly Where Time Goes

Total Timeline View See exactly where query

time is spent, from compilation to execution.

Page 20: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

40 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Dynamic Tag-based Access Policies with Apache Atlas

• Basic Tag policy – PII example. Access and entitlements must be tag based ABAC and scalable in implementation.

• Geo-based policy – Policy based on IP address, proxy IP substitution maybe required. The rule enforcement but be geo aware.

• Time-based policy – Timer for data access, de-coupled from deletion of data.

• Prohibitions – Prevention of combination of Hive tables that may pose a risk together.

Key Benefits: New scalable metadata based security paradigm Dynamic, real-time policy Active protection – fast updates to changes Centralized and simple to manage policy

Page 21: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

42 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Head to Head: TPC-DS Hive versus Impala

5000 6000 7000 8000 9000 10000 11000 12000 13000

CDH 5.12

HDP 2.6

Total Runtime (s)

(Shared Queries,

Lower is Better)

0 20 40 60 80 100

CDH 5.12

HDP 2.6

TPC-DS Queries

Supported

(Higher is Better)

TPC-DS; Scale 10,000; 9 Nodes; Identical, Trivially Modified SQL Queries

60

99

Page 22: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

43 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Performance: Apache Hive vs. Presto on a partitioned 1TB dataset.

Presto lacks basic performance optimizations like dynamic partition pruning.

On a real dataset / workload Presto perform poorly without full re-writes.

Example: Query 55 without re-writes = 185.17s, with re-writes = 16s. LLAP = 1.37s.

Highlights

Page 23: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

45 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Announcing: Druid is Now GA in HDP 2.6.3

✓ Analyze Streaming and Historical Data with SQL

✓ Powerful Visualization

✓ Simple management and monitoring with Ambari

✓ Fine-grained security

✓ Integrates with Hortonworks SAM for simple development.

Note: Superset remains in Technical Preview in HDP 2.6.3

Page 24: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

48 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Hive + Druid = Insight When You Need It

OLAP Cubes SQL Tables

Streaming Data Historical Data

Unified SQL Layer

Pre-Aggregate ACID MERGE

Easily ingest event data into OLAP cubes

Keep data up-to-date with Hive MERGE

Build OLAP Cubes from Hive Archive data to Hive for history

Run OLAP queries in real-time or Deep Analytics over all history

Deep Analytics Real-Time Query

Page 25: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

49 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

OLAP Analytics in Milliseconds with Hive over Druid

.0

.5

1.0

1.5

2.0

2.5

3.0

Q1.1 Q1.2 Q1.3 Q2.1 Q2.2 Q2.3 Q3.1 Q3.2 Q3.3 Q3.4 Q4.1 Q4.2 Q4.3

Ru

nti

me

(s)

Star Schema Benchmark 1TB Scale with Hive over 10 Druid Nodes

Page 26: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

50 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

OLAP Analytics in Milliseconds with Hive over Druid

Page 27: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda Scalable Data Warehousing on Hadoop

Overview

Solution Architecture

Customer Use Cases

What’s New

Roadmap

Page 28: Apache Hive 2 - 삼성 오픈소스 컨퍼런스Performance: Apache Hive vs. Presto on a partitioned 1TB dataset. Presto lacks basic performance optimizations like dynamic partition

53 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Roadmap At A Glance

Scalable DW in Hadoop HDP 2.6 (Current) HDP 3.0 (Future) Beyond HDP 3.0

Fast BI

• LLAP GA • Vectorized Decimal • SSD Cache • Text / JSON Cache LLAP • Tech Preview: Hive / Druid

• Fine-grained Resource Management Policies

• View navigation for Druid tables • Transparent Parquet cache • Intermediate Results Spooling

• Query Cache • Admission Controls

SQL / EDW

• ACID MERGE • SQL: Cross Product, Multi

Subquery, TPC-DS Complete

• Tables default to ACID capable • Column NOT NULL / Defaults • Surrogate Key Generation • Better Unicode support

• Multi-Statement Transactions • Improved HPL/SQL

Cloud • LLAP Template for

Hortonworks Data Cloud • Replication / DR • Full ACID support for S3 / WASB

Operations • Hive View 2.0: Tools for DBAs • Tez UI: Hive Power Tools

• Hive Studio: Single pane of glass for Hive development and debugging

• Usage Reports • Schema Recommendations

Current

Note: Roadmap is forward-looking and subject to change without notice.