Hortonworks: We Do Hadoop.hortonworks.com/wp-content/uploads/2013/11/Webinar.HDP2_.201311… · Hortonworks: We Do Hadoop. ... Hadoop Pig HCatalog Hive HBase Sqoop Flume Oozie Zookeeper

© Hortonworks Inc. 2013

Hortonworks: We Do Hadoop. Our mission is to enable your Modern Data Architecture

by delivering One Enterprise Hadoop

November 2013

Page 1


Recent Announcements

Hortonworks Data Platform 2.0 GA Culmination of years of work, Hortonworks delivers YARN from the

community to the enterprise to further cements Hadoop’s role in the data architectures of tomorrow

– YARN

– Stinger Phase 2

– Platform and Operational Services

Real Time Stream Processing with Storm Announcing Hortonworks investment roadmap for deeply

integrating Apache Storm with Hadoop for analyzing sensor and

machine data

Page 2

October

15

October

23


HDP 2.0: Investment Themes

FLEXIBLE

•  Delivering YARN from the community to the enterprise to extend Hadoop into a

multi-use platform

•  Hadoop beyond batch

COMPLETE

•  Delivery of Stinger phase 2

•  Provides management of YARN and

Hadoop 2.0 with Ambari

•  As always, we deliver a tested, stable

distribution across all the most recent Apache release

INTEGRATED

•  Certified by partners and customers

Page 3


Flu

me

H

ad

oo

p

P

ig

H

Ca

talo

g

H

ive

H

Base

S

qo

op

O

ozie

Z

oo

keep

er

M

ah

ou

t

A

mb

ari

HDP: Reliable, Consistent & Current

2.2.0

1.1.2

1.0.3

0.5.0

0.4.0

0.11.0

0.10.0

0.9.0

0.94.6

0.94.2

0.92.1

1.4.3

1.4.2

1.4.1

3.3.2

3.2.0

3.1.3

3.3.4

1.2.3

1.2.0

HMC1.1

HMC1

0.7.0

Hortonworks Data Platform

HDP demonstrates most recent community innovation

0.11

0.10.1

0.9.2

0.12.0

0.12.0

0.96.0

0.8.0

1.4.1

1.4.4 4.0.0

3.4.5

HDP 2.0 OCT

2013

HDP 1.2 FEB

2013

HDP 1.1 SEPT

2012

HDP 1.0 JUNE

2012

HDP 1.3 May

2013

1.31

1.30


Hadoop Beyond Batch

HADOOP 1.0

HDFS (redundant, reliable storage)

MapReduce (cluster resource management

& data processing)

HDFS2 (redundant, reliable storage)

YARN (opera6ng system: cluster resource management)

MapReduce (batch)

Others (varied)

HADOOP 2.0

Single Use System

Batch Apps

Multi Use Data Platform

Batch, Interactive, Online, Streaming, …

Page 5

Tez (interac6ve)

A shift from the old to the new…


Hadoop: a FLEXIBLE Multi-use Data Platform

Apache YARN: the Hadoop 2.0 Operating System

•  Apache YARN Enables data processing models beyond MapReduce

(batch), such as interactive, online, streaming and beyond.

•  Interact with all data in multiple ways simultaneously

Page 6

Data Processing Engines Run Na?vely IN Hadoop



BATCH

MapReduce

INTERACTIVE

Tez

STREAMING

Storm

GRAPH

Giraph

REEF

LASR, HPA ONLINE

HBase

OTHERS


2 x

YARN: Efficiency with Shared Services

YARN allows you to double processing in Hadoop on the same hardware while providing more

predictable performance & quality of service

Page 7

Redundant, Reliable Storage (HDFS2)

Efficient Cluster Resource

Management & Shared Services

(Apache YARN)

Standard SQL

Processing Hive

Batch

MapReduce Interac?ve

Tez

Online Data

Processing HBase, Accumulo

Real Time Stream

Processing Storm

others

…

Shared Services YARN provides a stable,

common set of shared resources across multiple,

coordinated workloads -  Manage & Monitor

-  Multi-tenancy

-  Security

-  High Availability

-  Disaster Recovery


S?nger Project (announced February 2013)

Batch AND Interactive SQL-IN-Hadoop

Stinger Initiative

A broad, community-based effort to

drive the next generation of HIVE

Page 8

S?nger Phase 3

•  Hive on Apache Tez •  Query Service (always on)

•  Buffer Cache

•  Cost Based Op6mizer (Op6q)

S?nger Phase 1:

•  Base Op6miza6ons

•  SQL Types •  SQL Analy6c Func6ons

•  ORCFile Modern File Format

S?nger Phase 2:

•  SQL Types

•  SQL Analy6c Func6ons •  Advanced Op6miza6ons

•  Performance Boosts via YARN

Speed Improve Hive query performance by 100X to

allow for interactive query times (seconds)

Scale The only SQL interface to Hadoop designed

for queries that scale from TB to PB

SQL Support broadest range of SQL semantics for

analytic applications running against Hadoop

…all IN Hadoop

Goals:

Delivered September 2013 HIVE 0.12

(HDP 2.0)

Delivered May 2013

HIVE 0.11 (HDP 1.3)

Coming Soon

…70% complete

in 6 months


SPEED: Increasing Hive Performance

Performance Improvements included in Hive 12

–  Vectorization

–  Base & advanced query optimization

–  Startup time improvement

–  Join optimizations

Page 9

Human Acceptable Query Times across ALL use cases

•  Simple and advanced queries across petabytes in seconds

•  Integrates seamlessly with existing tools

•  Currently a 60x improvement in just six months

< 10s


SCALE: Interactive Query at Petabyte Scale

Sustained Query Times

Apache Hive 0.12 provides sustained acceptable query

times even at petabyte scale

Page 10

131 GB (78% Smaller)

File Size Comparison Across Encoding Methods Dataset: TPC‐DS Scale 500 Dataset


Encoded with

Text

Encoded with

RCFile

Encoded with

ORCFile

Encoded with

Parquet


585 GB (Original Size) •  Larger Block Sizes

•  Columnar format

arranges columns adjacent within the

file for compression

& fast access

Parquet

Hive 12

Smaller Footprint

Better encoding with ORCFile in Apache Hive 12 reduces resource

requirements for your cluster


SQL: Enhancing SQL Semantics

Page 11

Hive SQL Datatypes Hive SQL Seman?cs

INT SELECT, INSERT

TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY

BOOLEAN JOIN on explicit join key

FLOAT Inner, outer, cross and semi joins

DOUBLE Sub‐queries in FROM clause

STRING ROLLUP and CUBE

TIMESTAMP UNION

BINARY Windowing Func6ons (OVER, RANK, etc)

DECIMAL Custom Java UDFs

ARRAY, MAP, STRUCT, UNION Standard Aggrega6on (SUM, AVG, etc.)

DATE Advanced UDFs (ngram, Xpath, URL)

VARCHAR Sub‐queries for IN/NOT IN, HAVING

CHAR INTERSECT / EXCEPT

Expanded JOIN Syntax

Hive 0.12 (HDP 2.0)

Available

Roadmap

SQL Compliance

Hive 12 provides a wide

array of SQL datatypes

and semantics so your

existing tools integrate

more seamlessly with

Hadoop


HDFS2 Highlights

• HDFS Federation/NameSpaces

(further scales # of files & nodes)

• Automated failover with a hot standby and full stack resiliency for the NameNode master service

• Standard NFS read/write access to HDFS

• Point in time recovery

with Snapshots in HDFS

• Wire Encryption for HDFS

Data Transfer Protocol

Page 12



MapReduce (batch)

Others (varied)

HADOOP 2.0

Tez (interac6ve)


HDP 2.0 Certified Partners

Page 13


Announcements

Hortonworks Data Platform 2.0 GA Culmination of years of work, Hortonworks delivers YARN from the

community to the enterprise to further cements Hadoop’s role in the data architectures of tomorrow

– YARN

– Stinger Phase 2

– Platform and Operational Services

Real Time Stream Processing with Storm Announcing Hortonworks investment roadmap for deeply

integrating Apache Storm with Hadoop for analyzing sensor and

machine data

Page 14

October

15

October

23


Hadoop 2.0: GA in Apache Community

HADOOP 1.0

HDFS (redundant, reliable storage)

MapReduce (cluster resource management

& data processing)


YARN (cluster resource management)

MapReduce (batch)

Others (varied)

HADOOP 2.0

Single Use System

Batch Apps



Page 15

Tez (interac6ve)

YARN based architecture of Hadoop 2.0 enables new processing approaches


Stream Processing in Hadoop

Driven by new types of data

–  Sensor/Machine

–  Server logs

–  Clickstream

Storm with Hadoop enables new business opportunities

–  Low-latency dashboards

–  Quality, Security, Safety,

Operations Alerts

–  Improved operations

–  Real-time data integration

Page 16


YARN (cluster resource management)

MapReduce (batch)

Apache

STORM

(streaming)

HADOOP 2.0

Tez (interac6ve)



Stream processing has emerged as a key use case


Apache Storm Leading for Stream Processing

• Developed by Twitter, adopted by large enterprises

• In use and proven at Yahoo! and others

• Apache Project with active community of developers

Key Capabilities of Storm

–  Ingest millions of events / second

– Perform arithmetic and aggregations

on the data as it arrives

– Alert on boundary conditions

– Persist to Hive, HBase and HDFS

–  Integration with queuing

Page 17

STORM Processing & Events

Dashboards

Analytics

Email

HDFS

Machine

Server log

Events


Hortonworks

Investment in Storm

Hortonworks Storm Investment Plans

Bringing innovation from the

community to the Enterprise

Page 18

Phase 3: Visual stream development

and management •  Declara6ve “wiring”

•  Hive update support

•  Advanced scheduler •  OpenStack Savanna support

Phase 2: Enterprise connec?vity

•  Data ingest Spouts

•  Bolts for no6fica6on and data persistence: HDFS, HBase

•  AD/LDAP plugin for authen6ca6on

•  High Availability management w/Ambari

Phase 1: Streaming IN Hadoop

•  Storm‐on‐YARN

•  Installa6on with Ambari

•  Ganglia & Nagios monitoring Unlock new uses of data Real-time event processing for sensor

networks and business activity monitoring

Ease of use Connected with Hadoop and the enterprise.

Integrated developer and operations tools

Scale Ingesting millions of events per second. Fast

query on petabytes of data

…all IN Hadoop

Goals: Q1 2014


Hortonworks Approach to Enterprise Hadoop

Identify and introduce enterprise requirements into the public domain

Work with the community to advance and

incubate open source projects

Apply Enterprise Rigor to provide the most

stable and reliable distribution

Community Driven Enterprise Apache Hadoop

Page 19


One Hadoop: Interoperable & Familiar

Page 20

APPLICATIONS

DATA SYSTEM

SOURCES

RDBMS EDW MPP

Emerging Sources (Sensor, Sen?ment, Geo, Unstructured)

HANA

BusinessObjects BI

OPERATIONAL TOOLS

DEV & DATA TOOLS

Exis?ng Sources (CRM, ERP, Clickstream, Logs)

INFRASTRUCTURE


Hortonworks: The Value of “Open” for You

Page 21

Connect With the Hadoop Community We employ a large number of Apache project committers & innovators so that you are represented in the open source community

Avoid Vendor Lock Hortonworks Data Platform remain as close to the open source trunk as

possible and is developed 100% in the open so you are never locked in

The partners you rely on, rely on Hortonworks We work with partners to deeply integrate Hadoop with data center

technologies so you can leverage existing skills and investments

Certified for the Enterprise We engineer, test and certify the Hortonworks Data Platform at scale to

ensure reliability and stability you require for enterprise use

Support from the experts We provide the highest quality of support for deploying at scale. You are

supported by hundreds of years of Hadoop experience

Visit www.hortonworks.com

Try www.hortonworks.com/

sandbox

Follow twitter.com/hortonworks

Documents

Hortonworks: We Do Hadoop.hortonworks.com/wp-content/uploads/2013/11/Webinar.HDP2_.201311… · Hortonworks: We Do Hadoop. ... Hadoop Pig HCatalog Hive HBase Sqoop Flume Oozie Zookeeper