33
建構新世代的智慧數據平台 尹寒柏 Bob Yin Senior Product Specialist

Track B-1 建構新世代的智慧數據平台

Embed Size (px)

Citation preview

Page 1: Track B-1 建構新世代的智慧數據平台

建構新世代的智慧數據平台 尹寒柏 Bob Yin Senior Product Specialist

Page 2: Track B-1 建構新世代的智慧數據平台

10 2 MAINFRAME

CLIENT-SERVER WEB

SOCIAL INTERNET OF THINGS

CLOUD

Few Employees

Many Employees

Customers/ Consumers

Business Ecosystems

Communities & Society

Devices & Machines

10 4

10 6

10 7

10 9 10 11

Front Office Productivity Back Office

Automation E-Commerce

Line-of-Business Self-Service

Social Engagement

Real-Time Optimization

1960s-1970s 1980s

1990s

2011 2014

2007

OS/360

TECHNOLOGY

USERS

VALUE TECHNOLOGIES

SOURCES

BUSINESS

Page 3: Track B-1 建構新世代的智慧數據平台

What are your Business Initiatives related to Big Data?

• Fraud Detection • Risk & Portfolio

Analysis •  Investment

Recommendations

Financial Services • Proactive Customer

Engagement • Location Based

Services

Retail & Telco

•  Connected Vehicle •  Predictive

Maintenance

Manufacturing

•  Predicting Patient Outcomes

•  Total Cost of Care •  Drug Discovery

Healthcare & Pharma

•  Health Insurance Exchanges

•  Public Safety •  Tax Optimization •  Fraud Detection

Public Sector

Media & Entertainment • Online & In-Game

Behavior • Customer X/Up-Sell

Page 4: Track B-1 建構新世代的智慧數據平台

80% of the work in big data projects is data integration and data quality

“80% of the work in any data project is in cleaning

the data”

“70% of my value is an ability to pull the data,

20% of my value is using data-science…”

“I spend more than half my time integrating,

cleansing, and transforming data without

doing any actual analysis.”

Page 5: Track B-1 建構新世代的智慧數據平台

InformationWeek 2013 Analytics, Business Intelligence and Information Management Survey of 541 business technology professionals, October 2012

Big data expertise is scarce and expensive

Data warehouse appliance platforms are expensive

We aren’t sure how big data analytics will create business opportunities

Analytical tools are lacking for big data platforms like Hadoop and NoSQL databases

Our data’s not accurate

Hadoop and NoSQL technologies are hard to learn

What Are Your Primary Concerns About Using Big Data Software

38%

33%

31%

22%

21%

17%

Page 6: Track B-1 建構新世代的智慧數據平台

Staff Projects with Readily Available Skills •  Informatica Developers are Hadoop Developers

Hand-coding A large global bank grew staff from 2 Java

developers to 100 Informatica developers after implementing Informatica Big Data Edition

Careerbuilder.com found in a survey there were 27,000 requests for Hadoop

skills and only 3,000 resumes with Hadoop skills

– whereas there are over 100,000 trained Informatica developers globally.

Page 7: Track B-1 建構新世代的智慧數據平台

Increase Developer Productivity •  Informatica Developers are up to 5x more productive

4 weeks 4 days!

2X performance!

Vs.

Hadoop Hand-coders

Informatica developers

Informatica Developers are 5x more productive based on

customer POCs

Page 8: Track B-1 建構新世代的智慧數據平台

Why Informatica for Big Data & Hadoop

Informatica on Hadoop Why Customers Care Visual development environment Increase productivity up to 5x

over hand-coding 100K+ trained Informatica developers globally

Use existing & readily available skills for big data

200+ high-performance connectors (legacy & new)

Move all types of customer data into Hadoop faster

100+ pre-built transforms for ETL & data quality

Provide broadest out-of-box transformations on Hadoop

100+ pre-built parsers for complex data formats

Analyze and integrate all types of data faster

Vibe “Map Once, Deploy Anywhere” virtual data machine

An insurance policy as new data types and technologies change

Reference architectures to get started

Accelerate customer success with proven solution

Page 9: Track B-1 建構新世代的智慧數據平台

Unleash the Power of Hadoop Informatica Developers are Now Hadoop Developers

Archive

Profile Parse Cleanse ETL Match

Stream

Load Load

Services

Events

Replicate

Topics

Machine Device, Cloud

Documents and Emails

Relational, Mainframe

Social Media, Web Logs

Data Warehouse

Mobile Apps

Analytics & Op Dashboards

Alerts

Analytics Teams

Page 10: Track B-1 建構新世代的智慧數據平台

Transactions, OLTP, OLAP

Social Media, Web Logs

Documents, Email

Machine Device, Scientific

Maximize Your Return On Big Data

Data Warehouse MDM

Operational Systems Analytical Systems Data Assets Data Products

Data Mart

ODS

OLTP

OLTP

Access & Ingest

Parse & Prepare

Discover & Profile

Transform & Cleanse

Extract & Deliver

Manage (i.e. Security, Performance, Governance, Collaboration)

& other NoSQL

Page 11: Track B-1 建構新世代的智慧數據平台

Hadoop complements your existing infrastructure

Page 12: Track B-1 建構新世代的智慧數據平台

Data Warehouse

MDM

Applications

Data Ingestion and Extraction •  Moving terabytes of data per hour

Replicate

Streaming

Batch Load

Extract

Archive Extract Low Cost Store

Transactions, OLTP, OLAP

Social Media, Web Logs

Documents, Email

Industry Standards

Machine Device, Scientific

Page 13: Track B-1 建構新世代的智慧數據平台

Access All Types of Data •  200+ High Performance Connectors, Pre-built Parsers for Specialized Data

Formats

WebSphere MQ JMS MSMQ SAP NetWeaver XI

JD Edwards Lotus Notes Oracle E-Business PeopleSoft

Oracle DB2 UDB DB2/400 SQL Server Sybase

ADABAS Datacom DB2 IDMS IMS

Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP

Informix Teradata Netezza ODBC JDBC

VSAM C-ISAM Binary Flat Files Tape Formats…

Web Services TIBCO webMethods

SAP NetWeaver SAP NetWeaver BI SAS Siebel

Flat files ASCII reports HTML RPG ANSI LDAP

EDI–X12 EDI-Fact RosettaNet HL7 HIPAA

ebXML HL7 v3.0 ACORD (AL3, XML)

XML LegalXML IFX cXML

AST FIX SWIFT Cargo IMP MVR

Salesforce CRM Force.com RightNow NetSuite

ADP Hewitt SAP By Design Oracle OnDemand

Facebook Twitter LinkedIn

Kapow Datasift Pivotal

Vertica Netezza

Teradata Aster

Messaging, and Web Services

Relational and Flat

Files

Mainframe and Midrange

Unstructured Data and Files

MPP Appliances

Packaged Applications

Industry Standards

XML Standards

SaaS/BPO

Social Media

Page 14: Track B-1 建構新世代的智慧數據平台

Cloud of Connectors

Page 15: Track B-1 建構新世代的智慧數據平台

Real-Time Data Collection and Streaming

15

Ultr

a M

essa

ging

Bus

Pub

lish

/ Sub

scrib

e

Leverage High Performance Messaging Infrastructure Publish with Ultra Messaging for global distribution without additional staging or landing.

HDFS

Targets

Web Servers, Operations Monitors, rsyslog, log files, JSON, TCP/UDP, HTTP, SLF4J, etc.

Handhelds, Smart Meters, etc. Discrete Data Messages, MQTT

Sources

Zookeeper

Management and Monitoring

Internet of Things, Sensor Data

PowerCenter Real-Time Edition, Rulepoint (CEP)

No SQL Databases: Cassandara Node

Node

Node

Node

Node

Node

Transformations: Filtering, Timestamp, Static Text, Custom

Page 16: Track B-1 建構新世代的智慧數據平台

Informatica Vibe Data Stream for Machine Data

16

•  High performance/efficient streaming data collection over LAN/WAN

•  GUI interface provides ease of configuration, deployment & use

•  Continuous ingestion of real-time generated data (sensors; logs; etc.). Machine generated & other data sources

•  Enable real-time interactions & response

•  Real-time delivery directly to multiple targets (batch/stream processing)

•  Highly available; efficient; scalable

•  Available ecosystem of light weight agents (sources & targets)

Page 17: Track B-1 建構新世代的智慧數據平台

Streaming Analytics Complex Event Processing

Page 18: Track B-1 建構新世代的智慧數據平台

NoSQL Support for HBase

18

Read from HBase as standard source

Write to HBase as standard target

Complete Mapping with HBase Src/Tgt can execute on hadoop

Sample HBase column families (Stored in JSON/complex formats)

Page 19: Track B-1 建構新世代的智慧數據平台

NoSQL Support for MongoDB

Access, integrate, transform & ingest MongoDB data into other analytic systems (e.g. Hadoop, data warehouse)

Access, integrate, transform, & ingest data into MongoDB

Sampling MongoDB data & flattening it to relational format

Page 20: Track B-1 建構新世代的智慧數據平台

Graphical  representa.on  highligh.ng  data,  segments,  separators,  and  missing  or  invalid  data  

Big Data Parser Easy  Deployment  of  Industry  Standards    

Import  pre-­‐built  industry  libraries  and  easily  customize  for  specific  needs  

Support  of  Healthcare  industry  standards  and  more  

Libraries  are  constantly  maintained  to  ensure  con.nued  compliance  

Page 21: Track B-1 建構新世代的智慧數據平台

Big Data Parser on Taobao

Page 22: Track B-1 建構新世代的智慧數據平台

CUSTOMER_ID example COUNTRY CODE example

3. Drilldown Analysis (into Hadoop Data)

2. Value & Pattern

Analysis of Hadoop Data

1. Profiling Stats: Min/Max Values, NULLs, Inferred Data Types, etc.

Drill down into actual data values to inspect results across entire data set, including potential duplicates

Value and Pattern Frequency to isolated

inconsistent/dirty data or unexpected patterns

Hadoop Data Profiling results – exposed to anyone in enterprise

via browser

Stats to identify outliers and

anomalies in data

Hadoop Data Profiling Results

Page 23: Track B-1 建構新世代的智慧數據平台

•  Big Data cleansing, deduplication, parsing Execute Data Quality on Hadoop

23

Address Validation

Standardize

Parsing

Matching

Address Validation and Geocoding enrichment across

260 countries

Probabilistic or Deterministic Matching

Standardization and Reference Data Management

Parsing of Unstructured Data/Text Fields of all data types of data (customer/

product/ social/ logs)

DQ logic pushed down/run natively ON Hadoop

Page 24: Track B-1 建構新世代的智慧數據平台

Data Quality Taiwan Address

Page 25: Track B-1 建構新世代的智慧數據平台

Cross-language matching

Abdulaziz A/Rahman Al Sugair ععببددااللععززييزز ععببددااللررححممنن االلصصققييرر

Abd. A.Rhman Hammed Al-Shuqair ععببددااللللهه ععببددااللررححممنن ححممدد االلششققيي

ععببددااللععززييزز ععببددااللللهه االلششققييرر ععببددااللععززييزز ببنن ممححممدد االلصصقق Abdulrahman Abdullah A.Alshegri

Arabic:

Toyotomi Hideyoshi 豊臣秀吉 トヨトミヒデヨシ とよとみひでよし

上本町207 シャトー上本町303 シャトー上本町303 兵庫県 小野市 上本町207 上本町303 シャトー上本町33 兵庫県 野市

Japanese:

Page 26: Track B-1 建構新世代的智慧數據平台

Cross-language matching example

繁簡

簡英

簡英(廣東)

Page 27: Track B-1 建構新世代的智慧數據平台

SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM

( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY) JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY) JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY) WHERE nation.N_NAME = 'UNITED STATES' ) T2

INSERT OVERWRITE TABLE TARGET1 SELECT * INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY CUSTKEY;

Data Integration & Quality on Hadoop

Hive-QL

1.  Entire Informatica mapping translated to Hive Query Language

2.  Optimized HQL converted to MapReduce & submitted to Hadoop cluster (job tracker).

3.  Advanced mapping transformations executed on Hadoop through User Defined Functions using Vibe

MapReduce

UDF

Page 28: Track B-1 建構新世代的智慧數據平台

Configure Mapping for Hadoop Execution

No need to redesign mapping logic to execute on either

Traditional or Hadoop infrastructure.

Configure where the integration logic should run – Hadoop or Native

Page 29: Track B-1 建構新世代的智慧數據平台

Mixed Workflow Orchestration One workflow running tasks on hadoop and local environments

Cmd_Choose LoadPath

MT_Load2Hadoop + Parse

Cmd_Load2 Hadoop MT_Parse

Cmd_ProfileData MT_Cleanse

MT_Data Analysis

Notification

Name Type Default Value Description

$User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task

$User.DataSourceConnection String HiveSourceConnection Source connection object

$User.ProfileResult Integer 100 Output from “profiling” commnad task.

Add

Edit

Remove

List of variables:

Page 30: Track B-1 建構新世代的智慧數據平台

Full traceability from workflow to MapReduce jobs

View generated Hive scripts

Unified Administration Single Place to Manage & Monitor

Page 31: Track B-1 建構新世代的智慧數據平台
Page 32: Track B-1 建構新世代的智慧數據平台

Map Once. Deploy Anywhere.

ON PREMISE HADOOP 3rd PARTY APPLICATIONS

CLOUD

Page 33: Track B-1 建構新世代的智慧數據平台