17
bigdata.pervasive.com •+1.855.356.DATA A Visual Workbench for Big Data Analytics on Hadoop

Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

Embed Size (px)

DESCRIPTION

Two of the major barriers to effective Hadoop deployments in the enterprise are the complexity and limited applicability of MapReduce. Software developers with Hadoop and MapReduce experience are in short supply, slowing big data initiatives. Faster results to a broad range of analytic scenarios require working at a higher level of abstraction, supported by new programming paradigms and tools. In this talk we present one such approach based on our experience developing a visual workbench for big data analytics on Hadoop. This approach enables data scientists and analysts to build and execute complex big data workflows for Hadoop with minimal training and without MapReduce knowledge. Libraries of pre-built operators for data preparation and analytics reduce the time and effort required to develop big data projects on Hadoop. The framework is extensible allowing the addition of new operators as needed. Due to the efficiency of the underlying dataflow framework, the run times are shortened, allowing faster iterations of discovery and analysis. Presenter: Jim Falgout, Chief Technologist, Pervasive Big Data & Analytics

Citation preview

Page 1: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA

A Visual Workbench for Big Data Analytics on Hadoop

Page 2: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA

Visual Workbench for Hadoop

• Agenda– Pervasive Software– History of DataRush– Dataflow Concepts– Hadoop Integration– Demo– Performance Testing

2

Page 3: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA 3

Who is Pervasive?

Global Software Company• Tens of thousands of users across the globe

• Operations in Americas, EMEA, Asia

• ~260 employees

Strong Financials• $51 million revenue (trailing 12-month)

• 48 consecutive quarters of profitability

• $46 million in the bank

• NASDAQ:PVSW since 1997

Leader in Data Innovation• 25% of top-line revenue re-invested in R&D

• Software to manage, integrate and analyze data, in the cloud or on-premises, throughout the entire data lifecycle

Page 4: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA 4

History of DataRush

• Initially developed as next-gen data engine for integration

• Requirements– High data throughput– Scalable (data, multicore)– Based on dataflow concepts– Component based architecture– Easy to extend– Easily fits in visual development environment

• Embedded in Pervasive products (DataProfiler)• Extended with SDK for more general use

Page 5: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA

• Operators (nodes) linked together in a directed graph• Data flows along edges• Shared nothing architecture• Provides pipeline parallelism• Supports data parallelism• Data scalable

Dataflow Concepts

5

Page 6: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA

Compilation to Execution Plan

Reader FilterRows DeriveFields Group(partial)

Reader FilterRows DeriveFields Group(partial)

Reader FilterRows DeriveFields Group(partial)

Reader FilterRows DeriveFields Group(partial)

Repartition Group(final) Writer

Repartition Group(final) Writer

Repartition Group(final) Writer

Repartition Group(final) Writer

Phase 1 Phase 2

Compiled to a set of physical graphs

Page 7: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA

Operator Library

Page 8: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA 8

KNIME

• KNIME– Open source analytics workflow tool for the desktop

– Web site: www.knime.org– Supports team collaboration and resource sharing:

• KNIME Teamspace• KNIME Server• KNIME Report

• Integrated with DataRush– DataRush dataflow executor integrated as a plug-in extension– Includes DataRush operators– Product: RushAnalytics for KNIME

Page 9: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA 9

DataRush + KNIME

Page 10: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA 10

Integration with Hadoop

• Data Level– HDFS access

• File system abstraction – works with all I/O operators• Distributed execution – uses splits much like MR

– HBase• Temporal key-value data store based on column families• Fast loading using HFile integration• Fast temporal queries

• Execution– Distributed execution uses distribute DataRush engines (not

MapReduce)– Integrating with YARN for resource sharing

Page 11: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA

Distributed Execution

ClusterManager

NodeManager

Client Executor

Initiates Job

Allocates Resources

Local Phase Graph Phase Graph

HDFS

Data

Spawns

PerfMonitor

Web Browser

11

Page 12: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA

Distributed I/O

• Allows downstream operators to be parallelized

• Parallelization concepts are the same whether the graph is run locally or distributed

ReadSplit

AssignSplits

ReadSplit

ReadSplit

ReadSplit

12

Page 13: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA

Demo

Page 14: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA 14

Performance Test

• DataRush versus PIG– Used TPC-H data– Generated 1TB data

set in HDFS– Ran several “queries”

coded in DataRush and PIG

– Run times in seconds (smaller is better)

Cluster Configuration:• 5 worker nodes• 2 X Intel E5-2650 (8 core)• 64GB RAM• 24 X 1TB SATA 7200 rpm

Q1

Q3

Q6

Q9

Q10

Q18

Q21

0 500 1000 1500 2000 2500 3000 3500 4000

2036

1414

363

2356

1027

1742

3528

401

660

273

1198

626

543

892

TPC-H : 1 Terabyte Test : Run times

DataRushPIG

Run time in seconds

Page 15: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA 15

DataRush/RushAnalytics Solutions

• Opera Solutions– Data science solutions provider– Embedding DataRush in engineered solutions

• Healthcare– Claims cleansing & processing

• Retail– Market basket analysis– Product category resolution (MDM)

• Telecom– CDR processing & analysis

“Pervasive DataRush’s efficiency and ability to automatically scale, whether on a single server or a Hadoop cluster, supports our vision for consistent, reusable, scalable Big Data analytics.”

– Armando Escalante, Chief Operating Officer, Opera Solutions

Page 16: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA 16

Summary

• Easy development of Hadoop workloads– Using drag-and-drop desktop GUI– Team oriented - Supports collaboration with others– No code to write - MapReduce included

• Scalable Execution– Executes within Hadoop cluster– Scales from desktop to server to cluster with no workflow

changes– Scales as cluster does– Handles small to very large data sizes– TPC-H performance testing shows improved performance over

comparable PIG scripts

Page 17: Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA 17

Questions?

• My contact info:

[email protected]

@jimfalgout

• Website

bigdata.pervasive.com