Upload
insidehpc
View
475
Download
0
Embed Size (px)
DESCRIPTION
In this slidecast, SriSatish Ambati from 0xdata describes the company's new H20 Open Source, In-memory Machine Learning application for Big Data. "We developed H2O to unlock the predictive power of big data through better algorithms," said SriSatish Ambati, CEO and co-founder of 0xdata. "H2O is simple, extensible and easy to use and deploy from R, Excel and Hadoop. The big data science world is one of algorithm-haves and have-nots. Amazon, Goldman Sachs, Google and Netflix have proven the power of algorithms on data. With our viral and open Apache software license philosophy, along with close ties into the math, Hadoop and R communities, we bring the power of Google-scale machine learning and modeling without sampling to the rest of the world." Watch the presentation video: http://wp.me/p3RLEV-1xc Learn more: http://0xdata.com
Citation preview
H2O – The Open Source Math Engine !
Better Predictions!
4/23/13
H2O – Open Source in-memory Machine Learning for Big Data
SriSatish Ambati, July 2013
Universe is sparse. Life is messy. Data is sparse & messy.!
- Lao Tzu
Hadoop = opportunity Not enough Data Scientists Analysts won’t code java
Volume: HDFS
HIVE/SQL
Data Scientist
Munging slice n dice Features
Classification Regression Clustering Optimal Model
Engineer
Velocity: Events Online Scoring
Explora;on
Modeling
Offline Scoring
Business Analyst
Ensemble models Low latency
Applications
Predictions
Rule Engine
Before H2O
H2O the
Prediction
Engine
Adhoc Explora;on
Math Modeling
Real-‐;me Scoring
Big Data
Messy NAs
Clustering
Classifica;on
Ensembles 100’s nanos
models
Regression
Group By Grep
H2O the
Prediction
Engine
Big Data Explora;on Modeling Scoring Real-‐;me
No New API!
Approximate!results each step!
Big Data Explora;on Modeling Scoring Real-‐;me
Big Data beats Better Algorithms!
Big Data Explora;on Modeling Scoring Real-‐;me
Big Data and Better Algorithms! Scale & Parallelism!
H2O the
Prediction
Engine
Intellectual Legacy
Math needs to be free
Open Source
Support and Innovation
hLps://github.com/0xdata/h2o
Usecases
Conversion, Retention & Churn!• Lead Conversion!• Engagement!• Product Placement!• Recommendations!
Pricing Engine!Fraud Detection!
Customers, Users
Insurance Credit Card Others…
Big Data and Better Algorithms
-‐ Antonio Mollins, Data Scien;st
Pete Fishman, Data Science @Yammer
Screen title
Screen title
0xdata.com
17
A Collection of Distributed Vectors
// A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }
0xdata.com
18
JVM 4 Heap
JVM 1 Heap
JVM 2 Heap
JVM 3 Heap
Frames
A Frame: Vec[] age sex zip ID car
l Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM l But faster if local... more on that later
0xdata.com
19
JVM 4 Heap
JVM 1 Heap
JVM 2 Heap
JVM 3 Heap
Distributed Data Taxonomy
A Chunk, Unit of Parallel Access Vec Vec Vec Vec Vec
l Typically 1e3 to 1e6 elements l Stored compressed l In byte arrays l Get/put is a few clock cycles including compression
0xdata.com
20
JVM 4 Heap
JVM 1 Heap
JVM 2 Heap
JVM 3 Heap
Distributed Parallel Execution
Vec Vec Vec Vec Vec l All CPUs grab Chunks in parallel l F/J load balances l Code moves to Data l Map/Reduce & F/J handles all sync l H2O handles all comm, data manage
0xdata.com
21
Distributed Data Taxonomy
Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame
0xdata.com
22
Distributed Coding Taxonomy
l No Distribution Coding: l Whole Algorithms, Whole Vector-Math!l REST + JSON: e.g. load data, GLM, get results!
l Simple Data-Parallel Coding: l Per-Row (or neighbor row) Math!l Map/Reduce-style: e.g. Any dense linear algebra!
l Complex Data-Parallel Coding l K/V Store, Graph Algo's, e.g. PageRank!
0xdata.com
23
Distributed Coding Taxonomy
l No Distribution Coding: l Whole Algorithms, Whole Vector-Math!l REST + JSON: e.g. load data, GLM, get results!
l Simple Data-Parallel Coding: l Per-Row (or neighbor row) Math!l Map/Reduce-style: e.g. Any dense linear algebra!
l Complex Data-Parallel Coding l K/V Store, Graph Algo's, e.g. PageRank!
Read the docs!
This talk!
Join our GIT!
H2O – The Open Source Math Engine !
Better Predictions!