Upload
yahoo-developer-network
View
109
Download
3
Embed Size (px)
DESCRIPTION
Two of the major barriers to effective Hadoop deployments in the enterprise are the complexity and limited applicability of MapReduce. Software developers with Hadoop and MapReduce experience are in short supply, slowing big data initiatives. Faster results to a broad range of analytic scenarios require working at a higher level of abstraction, supported by new programming paradigms and tools. In this talk we present one such approach based on our experience developing a visual workbench for big data analytics on Hadoop. This approach enables data scientists and analysts to build and execute complex big data workflows for Hadoop with minimal training and without MapReduce knowledge. Libraries of pre-built operators for data preparation and analytics reduce the time and effort required to develop big data projects on Hadoop. The framework is extensible allowing the addition of new operators as needed. Due to the efficiency of the underlying dataflow framework, the run times are shortened, allowing faster iterations of discovery and analysis. Presenter: Jim Falgout, Chief Technologist, Pervasive Big Data & Analytics
Citation preview
bigdata.pervasive.com •+1.855.356.DATA
A Visual Workbench for Big Data Analytics on Hadoop
bigdata.pervasive.com •+1.855.356.DATA
Visual Workbench for Hadoop
• Agenda– Pervasive Software– History of DataRush– Dataflow Concepts– Hadoop Integration– Demo– Performance Testing
2
bigdata.pervasive.com •+1.855.356.DATA 3
Who is Pervasive?
Global Software Company• Tens of thousands of users across the globe
• Operations in Americas, EMEA, Asia
• ~260 employees
Strong Financials• $51 million revenue (trailing 12-month)
• 48 consecutive quarters of profitability
• $46 million in the bank
• NASDAQ:PVSW since 1997
Leader in Data Innovation• 25% of top-line revenue re-invested in R&D
• Software to manage, integrate and analyze data, in the cloud or on-premises, throughout the entire data lifecycle
bigdata.pervasive.com •+1.855.356.DATA 4
History of DataRush
• Initially developed as next-gen data engine for integration
• Requirements– High data throughput– Scalable (data, multicore)– Based on dataflow concepts– Component based architecture– Easy to extend– Easily fits in visual development environment
• Embedded in Pervasive products (DataProfiler)• Extended with SDK for more general use
bigdata.pervasive.com •+1.855.356.DATA
• Operators (nodes) linked together in a directed graph• Data flows along edges• Shared nothing architecture• Provides pipeline parallelism• Supports data parallelism• Data scalable
Dataflow Concepts
5
bigdata.pervasive.com •+1.855.356.DATA
Compilation to Execution Plan
Reader FilterRows DeriveFields Group(partial)
Reader FilterRows DeriveFields Group(partial)
Reader FilterRows DeriveFields Group(partial)
Reader FilterRows DeriveFields Group(partial)
Repartition Group(final) Writer
Repartition Group(final) Writer
Repartition Group(final) Writer
Repartition Group(final) Writer
Phase 1 Phase 2
Compiled to a set of physical graphs
bigdata.pervasive.com •+1.855.356.DATA
Operator Library
bigdata.pervasive.com •+1.855.356.DATA 8
KNIME
• KNIME– Open source analytics workflow tool for the desktop
– Web site: www.knime.org– Supports team collaboration and resource sharing:
• KNIME Teamspace• KNIME Server• KNIME Report
• Integrated with DataRush– DataRush dataflow executor integrated as a plug-in extension– Includes DataRush operators– Product: RushAnalytics for KNIME
bigdata.pervasive.com •+1.855.356.DATA 9
DataRush + KNIME
bigdata.pervasive.com •+1.855.356.DATA 10
Integration with Hadoop
• Data Level– HDFS access
• File system abstraction – works with all I/O operators• Distributed execution – uses splits much like MR
– HBase• Temporal key-value data store based on column families• Fast loading using HFile integration• Fast temporal queries
• Execution– Distributed execution uses distribute DataRush engines (not
MapReduce)– Integrating with YARN for resource sharing
bigdata.pervasive.com •+1.855.356.DATA
Distributed Execution
ClusterManager
NodeManager
Client Executor
Initiates Job
Allocates Resources
Local Phase Graph Phase Graph
HDFS
Data
Spawns
PerfMonitor
Web Browser
11
bigdata.pervasive.com •+1.855.356.DATA
Distributed I/O
• Allows downstream operators to be parallelized
• Parallelization concepts are the same whether the graph is run locally or distributed
ReadSplit
AssignSplits
ReadSplit
ReadSplit
ReadSplit
12
bigdata.pervasive.com •+1.855.356.DATA
Demo
bigdata.pervasive.com •+1.855.356.DATA 14
Performance Test
• DataRush versus PIG– Used TPC-H data– Generated 1TB data
set in HDFS– Ran several “queries”
coded in DataRush and PIG
– Run times in seconds (smaller is better)
Cluster Configuration:• 5 worker nodes• 2 X Intel E5-2650 (8 core)• 64GB RAM• 24 X 1TB SATA 7200 rpm
Q1
Q3
Q6
Q9
Q10
Q18
Q21
0 500 1000 1500 2000 2500 3000 3500 4000
2036
1414
363
2356
1027
1742
3528
401
660
273
1198
626
543
892
TPC-H : 1 Terabyte Test : Run times
DataRushPIG
Run time in seconds
bigdata.pervasive.com •+1.855.356.DATA 15
DataRush/RushAnalytics Solutions
• Opera Solutions– Data science solutions provider– Embedding DataRush in engineered solutions
• Healthcare– Claims cleansing & processing
• Retail– Market basket analysis– Product category resolution (MDM)
• Telecom– CDR processing & analysis
“Pervasive DataRush’s efficiency and ability to automatically scale, whether on a single server or a Hadoop cluster, supports our vision for consistent, reusable, scalable Big Data analytics.”
– Armando Escalante, Chief Operating Officer, Opera Solutions
bigdata.pervasive.com •+1.855.356.DATA 16
Summary
• Easy development of Hadoop workloads– Using drag-and-drop desktop GUI– Team oriented - Supports collaboration with others– No code to write - MapReduce included
• Scalable Execution– Executes within Hadoop cluster– Scales from desktop to server to cluster with no workflow
changes– Scales as cluster does– Handles small to very large data sizes– TPC-H performance testing shows improved performance over
comparable PIG scripts
bigdata.pervasive.com •+1.855.356.DATA 17
Questions?
• My contact info:
@jimfalgout
• Website
bigdata.pervasive.com