32
1 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum hadoop

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Greenplum hadoop

1 © Copyright 2012 EMC Corporation. All rights reserved.

Page 2: Greenplum hadoop

2 © Copyright 2012 EMC Corporation. All rights reserved.

整合分析結構與非結構性資料暨應用案例 Greenplum Enable Big Data Analytics

邱垂吉 Jimmy Chiu 技術顧問/EMC Greenplum Taiwan

Page 3: Greenplum hadoop

3 © Copyright 2010 EMC Corporation. All rights reserved.

• Volume: data volumes approaching multiple petabytes

• Velocity: data being generated and ingested for analysis in real-time

• Variety: tabular, documents, e-mail, metering, network, video, image, audio

• Complexity: different standards, domain rules, and storage formats per data type

Transactional Data

Documents Smart Grid

Variety Complexity

Velocity Volume

New insights on

customers, products,

and operations

Contextual and

location-aware

delivery to any

device

Images Audio Video Text

Gartner March 2011

Volume, Variety, Velocity, Value + Complexity

Big Data

Page 4: Greenplum hadoop

4 © Copyright 2010 EMC Corporation. All rights reserved.

Sample Big Data Scenarios

AUTO INSURANCE IN P&C INSURANCE

LOAN PROCESSING IN BANKING

SMART GRID ANALYTICS IN UTILITIES/ENERGY

VIDEO ANALYTICS IN RETAIL

PROACTIVE EMERGENCY RESPONSE IN HEALTHCARE

REAL-TIME STATISTICAL

PROCESS CONTROL IN MANUFACTURING

Page 5: Greenplum hadoop

5 © Copyright 2010 EMC Corporation. All rights reserved.

Big Data Analytics For Competitive Advantage Suppliers

Today’s Business Model

Customers

Inventory

Physical Assets

Distribution

Services

Mass

Marketing

Manufacturing

Customers

Suppliers

Inventory

Physical Assets

Distribution

Services

Personal Marketing

Additional Profits

Manufacturing

Big Data Analytics Business Model

Who are my

most valuable

customers?

What are my most

important

products?

What are my most

successful

campaigns?

Page 6: Greenplum hadoop

6 © Copyright 2010 EMC Corporation. All rights reserved.

Big Data meets Fast Data

Social and Personal – Every Minutes:

•Google gets more than 2 million search queries

•About 47,000 people download an App

•Some 100,000 tweets hit Twitter

•Almost 300,000 people log on to Facebook

Business and Transactional:

•CERN (European Organization for Nuclear Research) generates 40TB/sec of scientific data

•Wal-Mart – 1 million transactions per hour

•World’s top systems currently trade at faster than 50 microseconds

•New York Stock Exchange generates 1TB of new trading data daily

Page 7: Greenplum hadoop

7 © Copyright 2010 EMC Corporation. All rights reserved.

Working together, they enable entirely New Business Models

Big Data allows you to find opportunities you didn’t know you had. Fast Data allows you to respond to opportunities before they are gone.

In the Financial Services Industry, large quantities of historical data need to be processed against a growing number of fast-moving data feeds. Batch processing is no longer a suitable solution!

Page 8: Greenplum hadoop

8 © Copyright 2010 EMC Corporation. All rights reserved.

Effective Customer Segmentation is all about blending Structured and Unstructured Data

– Transaction data (structured data) tells you what the customer did.

– Unstructured data can tell you why they did it, why some others did not, what else they need or want, and what problems they may have.

Page 9: Greenplum hadoop

9 © Copyright 2010 EMC Corporation. All rights reserved.

Big Data Architecture Requirements

• Multiple data types: structured, semi-structured, unstructured

• Integrated data stores: real-time, traditional, data warehouse

• Modern development tools: Java, lightweight messages, mobile-enabled

• Cloud-enabled: elastic scale, self-healing

Beware point solutions – integration is critical!

Solving Big Data challenge involves more than just

managing volumes of data.

― Gartner

Page 10: Greenplum hadoop

10 © Copyright 2010 EMC Corporation. All rights reserved.

Greenplum Overview

Page 11: Greenplum hadoop

11 © Copyright 2010 EMC Corporation. All rights reserved.

Greenplum Product Line

Page 12: Greenplum hadoop

12 © Copyright 2010 EMC Corporation. All rights reserved.

Architecture of Greenplum

Master servers optimize queries

for the most efficient query execution

MPP Scatter/Gather streaming for

fast loading of data

Flexible framework for processing large datasets

Interconnect for continuous

pipelining of data processing

Segment servers process queries

close to the data in parallel

Process large datasets with support for

both SQL and MapReduce

Master Master

SQL

MapReduce

Page 13: Greenplum hadoop

13 © Copyright 2010 EMC Corporation. All rights reserved.

Share Disk eg:

Oracle RAC

DB

SAN Share disk

DB DB DB

Intranet

SAN/FC

Share

everything eg:

Unix server

DB

Disk

Share nothing eg:

Greenplum

DB DB DB DB

Disk Disk Disk Disk

Master Intranet

MPP

Greenplum MPP Share-Nothing Arch.

Page 14: Greenplum hadoop

14 © Copyright 2010 EMC Corporation. All rights reserved.

Benefits of the Greenplum Database Architecture

• Simplicity – Parallelism is automatic – no manual partitioning required – No complex tuning required – just load and query – HA – Best of breed x86 and Ethernet networking technologies

• Scalability – Linear scalability – Each node adds storage, query performance, loading performance

• Flexibility – Fully parallelism for SQL92, SQL99, SQL2003 OLAP, MapReduce – Any schema (star, snowflake, 3NF, hybrid, etc) – Rich extensibility and language support (Perl, Python, R, C, etc) – Structure, semi-structure and unstructure

Page 15: Greenplum hadoop

15 © Copyright 2010 EMC Corporation. All rights reserved.

Greenplum and Hadoop

Analytics

Structured

ERP/CRM

Semi-Structured

Machine Data

Logs

UnStructured

Images/Sound

Ad-hoc Analysis

Dynamic Data batch reporting on static data

Page 16: Greenplum hadoop

16 © Copyright 2010 EMC Corporation. All rights reserved.

Big Data Analytics The Power of Data Co-Processing

Greenplum Chorus

Analytic Productivity & Tool Integration

Data Access And Query SQL, MapReduce, SAS, MADLib, Mahout, R, and others

Greenplum Database Greenplum Hadoop

SQL Engine

For Structured Data • In-database Advanced

Analytics

• Extreme performance on

commodity hardware parallel

data exchange

parallel

data exchange

Network

Parallel Loading Of

All Data Types

MapReduce Engine

For Unstructured Data •Enterprise ready Apache

Hadoop

•Faster, more dependable, and

easier to use

Gre

en

plu

m C

om

man

der

En

d-t

o-e

nd

Pla

tfo

rm M

an

ag

em

en

t &

Co

ntr

ol

Page 17: Greenplum hadoop

17 © Copyright 2010 EMC Corporation. All rights reserved.

Greenplum Hadoop

• Greenplum HD

– Enterprise-ready Apache Hadoop

– Proven at Scale in 1,000 node Analytics Workbench

– Single product with 2 storage options (Isilon & HDFS)

• Enterprise Edition becomes Greenplum MR:

– Advanced features

– 100% API compatible

– Software-only product

Page 18: Greenplum hadoop

18 © Copyright 2010 EMC Corporation. All rights reserved.

AWB Update

Analytics Workbench Operational!

•1025 nodes operational

•1011 nodes with GPHD installed

•8 total projects have been on boarded from university collaboration to partner technology evaluation

Proposals accepted by customer engagement team – [email protected]

•Engagement team will learn project objectives

•JEDI council approves/disproves project based on technical feasibility and alignment with company goals

•Projects informed of decisions and timelines

Cluster access via - http://portal.analyticsworkbench.com/

Page 19: Greenplum hadoop

19 © Copyright 2010 EMC Corporation. All rights reserved.

Apache Hadoop Pain Points

• Poor Job and Application Monitoring Solution

• Non-existent Performance Monitoring Monitoring

• Complex System Configuration and Manageability

• No Data Format Interoperability & Storage Abstractions

Operability and

Manageability

• Poor Dimensional Lookup Performance

• Very poor Random Access and Serving Performance

Performance

Page 20: Greenplum hadoop

20 © Copyright 2010 EMC Corporation. All rights reserved.

100% APACHE

INTERFACE

Greenplum MR: Enterprise Edition Stack

Distributed File System

MapReduce Framework (MapRed)

Pig

Hive

HBase

Zookeeper

Enhanced Monitoring

Page 21: Greenplum hadoop

21 © Copyright 2010 EMC Corporation. All rights reserved.

Greenplum MR: Enterprise Edition Enterprise-Ready Hadoop Platform for Unstructured Data

• 2 – 5x Faster than Apache Hadoop Faster

• High Availability

• Mirroring Reliable

• NFS mountable

• Graphical System Management

Easier to Use

Page 22: Greenplum hadoop

22 © Copyright 2010 EMC Corporation. All rights reserved.

Greenplum MR Simple Management

• Health Monitoring

• Cluster Administration

• Application Provisioning

Page 23: Greenplum hadoop

23 © Copyright 2010 EMC Corporation. All rights reserved.

Rack Level Monitoring

Page 24: Greenplum hadoop

24 © Copyright 2010 EMC Corporation. All rights reserved.

Greenplum MR Delivers True Return on Investment

• Eliminates all single points of failure

• High Availability for Job Tracker , NameNode &

NFS

• Snapshots allow point-in-time data protection

and recovery.

• Mirroring for business continuity includes wide

area replication support.

• NFS direct access to simply load and access

data directly in a Hadoop cluster

• Enables standard tools and utilities to work

directly on data contained in Hadoop

• Heatmap user interface provides full cluster

visibility and control.

• Speeds jobs by 2X – 5X

• Provides faster performance with ½ the

hardware

• Substantial capital and operating expense

savings

Page 25: Greenplum hadoop

25 © Copyright 2010 EMC Corporation. All rights reserved.

EMC Greenplum

Fastest data loading Advanced analytics

DATA IN DECISIONS OUT IN-DATABASE ANALYTICS

Scatter/Gather Streaming

technology for the world’s

fastest data loading

•Eliminate data load bottlenecks

•Clean and integrate new data

•Several loading options, ranging from bulk load updates to micro-batching for near real-time processing

Optimized for fast query execution

and linear scalability

•Move processing closer to data

•Shared-nothing, massively parallel processing (MPP) scale-out architecture

•Computing is automatically optimized and distributed across resources

• Provides the best concurrent multi-workload performance

Unified data access for greater

insight and value from data

•Enable parallel analysis across the enterprise

•Open platform with broad language support

•Certified enterprise connectivity and integration with most business intelligence; extract, transform, and load (ETL); and management products

Page 26: Greenplum hadoop

26 © Copyright 2010 EMC Corporation. All rights reserved.

Data Input

Integration Data Stores and

Access Data

Analysis Presentation &

Delivery

Multimedia

Web/Social

ERP

CRM

POS

Data Sources

Mobile

Documents

Machine Data

Quality

MDM

ETL

Enterprise

Data

Warehouse

BU 1

BU 2

BU 3

Da

ta M

art

s

Ma

p-

Re

du

ce

Key Values Documents Other NoSql

Ecosystem* HDFS

Hadoop

NoSQL Stores

Federated

Data

Warehouse

Map-

Reduce

BI as a

Service

Sta

tistic

s

Da

ta M

inin

g

Op

era

tion

s R

esea

rch

Ne

ura

l Ne

ts

Genetic

Alg

orith

ms

OL

AP

Alerts

Reports

Dashboards

Spreadsheets

*Hadoop Ecosystem includes: Hive, Pig, Mahout, HBase, ZooKeeper, Oozie, Sqoop, Avro

Structured

data sources

Traditional data

Integration Traditional data

warehousing

Big data analytics

ramifications

SQL Stores

LOB data

EMC Big Data Analytics Reference Architecture

Mobile

Data Visualization

parallel

data exchange

Page 27: Greenplum hadoop

27 © Copyright 2010 EMC Corporation. All rights reserved.

Architecture for Business Value

DB’s

GPDB

Analytics tools

(SAS, R, MADlib and more)

Business Value

Files

MapRFS

(GPMR)

Analytics Self-develop app

Hbase

Analytics tools

(Mahout)

.csv

.txt

Analytics Self-develop app

JDBC

ODBC

Java API

ETL

Load x MapRFS: C++; MR: C++

Performance: 2~5X

High Availability

Stable

SAS & MADlib

- In GPDB

- In Memory

Chorus for Collaboration

Page 28: Greenplum hadoop

29 © Copyright 2010 EMC Corporation. All rights reserved.

Big Data And EMC

4 New Analytic Applications

Unified Analytics Platform 2

Petabyte Scale Data Storage 1

Data Science 3

Page 29: Greenplum hadoop

30 © Copyright 2010 EMC Corporation. All rights reserved.

SAS / Greenplum Product Overview

SAS High Performance Computing

SAS Access for Integration

Provides integration capability to a number of databases

Allows for increased performance of Base SAS Procs

Products: SAS Access for Greenpum

SAS In-Database Processing

Requires SAS Enterprise Miner in order to be of value

Will lead to significant improvement in performance

Products: SAS Access for Greenplum, SAS Grid Manager, SAS Enterprise Miner, SAS Scoring Accelerator for Greenplum

SAS In-Memory Analytics

New functionality from SAS that requires dedicated database appliance

Very high performance for business users that can significantly increase revenues or decrease costs as a result of improved performance

Products: SAS Access for Greenplum, SAS Grid Manager, SAS High Performance Analytics

Page 30: Greenplum hadoop

31 © Copyright 2010 EMC Corporation. All rights reserved.

SAS and Greenplum UAP Integrated Architecture

SAS AND EMC GREENPLUM UAP INTEGRATED ARCHITECTURE

Data

Scientist

Data

Engineer

Data

Analyst

Bl

Analyst LOB

User

Data

Platform

Admin

DA

TA

SC

IEN

CE

TE

AM

Greenplum Chorus - Analytic Productivity Layer

SAS Analytics

Private/Hybrid Cloud Infrastructure or Appliance

SAS Business Intelligence

SAS Information Management

Greenplum Database Greenplum Hadoop

Data Access & Query Layer (SAS ACCESS, SQL, MapReduce)

Page 31: Greenplum hadoop

32 © Copyright 2010 EMC Corporation. All rights reserved.

Structured & Unstructured Data

Analyze Petabytes Of Current Data

Virtual, Scale Out Architecture

Self-Service

Iterative, Agile

Transparent, Real-time Collaboration

In A Single Unified Analytics Platform

Page 32: Greenplum hadoop

33 © Copyright 2010 EMC Corporation. All rights reserved.