Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

bigdata.pervasive.com •+1.855.356.DATA

A Visual Workbench for Big Data Analytics on Hadoop


Visual Workbench for Hadoop

• Agenda– Pervasive Software– History of DataRush– Dataflow Concepts– Hadoop Integration– Demo– Performance Testing

2

bigdata.pervasive.com •+1.855.356.DATA 3

Who is Pervasive?

Global Software Company• Tens of thousands of users across the globe

• Operations in Americas, EMEA, Asia

• ~260 employees

Strong Financials• $51 million revenue (trailing 12-month)

• 48 consecutive quarters of profitability

• $46 million in the bank

• NASDAQ:PVSW since 1997

Leader in Data Innovation• 25% of top-line revenue re-invested in R&D

• Software to manage, integrate and analyze data, in the cloud or on-premises, throughout the entire data lifecycle


History of DataRush

• Initially developed as next-gen data engine for integration

• Requirements– High data throughput– Scalable (data, multicore)– Based on dataflow concepts– Component based architecture– Easy to extend– Easily fits in visual development environment

• Embedded in Pervasive products (DataProfiler)• Extended with SDK for more general use


• Operators (nodes) linked together in a directed graph• Data flows along edges• Shared nothing architecture• Provides pipeline parallelism• Supports data parallelism• Data scalable

Dataflow Concepts

5


Compilation to Execution Plan

Reader FilterRows DeriveFields Group(partial)




Repartition Group(final) Writer




Phase 1 Phase 2

Compiled to a set of physical graphs


Operator Library


KNIME

• KNIME– Open source analytics workflow tool for the desktop

– Web site: www.knime.org– Supports team collaboration and resource sharing:

• KNIME Teamspace• KNIME Server• KNIME Report

• Integrated with DataRush– DataRush dataflow executor integrated as a plug-in extension– Includes DataRush operators– Product: RushAnalytics for KNIME


DataRush + KNIME


Integration with Hadoop

• Data Level– HDFS access

• File system abstraction – works with all I/O operators• Distributed execution – uses splits much like MR

– HBase• Temporal key-value data store based on column families• Fast loading using HFile integration• Fast temporal queries

• Execution– Distributed execution uses distribute DataRush engines (not

MapReduce)– Integrating with YARN for resource sharing


Distributed Execution

ClusterManager

NodeManager

Client Executor

Initiates Job

Allocates Resources

Local Phase Graph Phase Graph

HDFS

Data

Spawns

PerfMonitor

Web Browser

11


Distributed I/O

• Allows downstream operators to be parallelized

• Parallelization concepts are the same whether the graph is run locally or distributed

ReadSplit

AssignSplits

ReadSplit

ReadSplit

ReadSplit

12


Demo


Performance Test

• DataRush versus PIG– Used TPC-H data– Generated 1TB data

set in HDFS– Ran several “queries”

coded in DataRush and PIG

– Run times in seconds (smaller is better)

Cluster Configuration:• 5 worker nodes• 2 X Intel E5-2650 (8 core)• 64GB RAM• 24 X 1TB SATA 7200 rpm

Q1

Q3

Q6

Q9

Q10

Q18

Q21

0 500 1000 1500 2000 2500 3000 3500 4000

2036

1414

363

2356

1027

1742

3528

401

660

273

1198

626

543

892

TPC-H : 1 Terabyte Test : Run times

DataRushPIG

Run time in seconds


DataRush/RushAnalytics Solutions

• Opera Solutions– Data science solutions provider– Embedding DataRush in engineered solutions

• Healthcare– Claims cleansing & processing

• Retail– Market basket analysis– Product category resolution (MDM)

• Telecom– CDR processing & analysis

“Pervasive DataRush’s efficiency and ability to automatically scale, whether on a single server or a Hadoop cluster, supports our vision for consistent, reusable, scalable Big Data analytics.”

– Armando Escalante, Chief Operating Officer, Opera Solutions


Summary

• Easy development of Hadoop workloads– Using drag-and-drop desktop GUI– Team oriented - Supports collaboration with others– No code to write - MapReduce included

• Scalable Execution– Executes within Hadoop cluster– Scales from desktop to server to cluster with no workflow

changes– Scales as cluster does– Handles small to very large data sizes– TPC-H performance testing shows improved performance over

comparable PIG scripts


Questions?

• My contact info:

[email protected]

@jimfalgout

• Website

bigdata.pervasive.com

Technology

Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop