H2O – The Open Source Math Engine !
Better Predictions!
4/23/13
H2O – Open Source in-memory Machine Learning for Big Data
SriSatish Ambati, July 2013
Universe is sparse. Life is messy. Data is sparse & messy.!
- Lao Tzu
Hadoop = opportunity Not enough Data Scientists Analysts won’t code java
Volume: HDFS
HIVE/SQL
Data Scientist
Munging slice n dice Features
Classification Regression Clustering Optimal Model
Engineer
Velocity: Events Online Scoring
Explora;on
Modeling
Offline Scoring
Business Analyst
Ensemble models Low latency
Applications
Predictions
Rule Engine
Before H2O
H2O the
Prediction
Engine
Adhoc Explora;on
Math Modeling
Real-‐;me Scoring
Big Data
Messy NAs
Clustering
Classifica;on
Ensembles 100’s nanos
models
Regression
Group By Grep
H2O the
Prediction
Engine
Big Data Explora;on Modeling Scoring Real-‐;me
No New API!
Approximate!results each step!
Big Data Explora;on Modeling Scoring Real-‐;me
Big Data beats Better Algorithms!
Big Data Explora;on Modeling Scoring Real-‐;me
Big Data and Better Algorithms! Scale & Parallelism!
H2O the
Prediction
Engine
Intellectual Legacy
Math needs to be free
Open Source
Support and Innovation
hLps://github.com/0xdata/h2o
Usecases
Conversion, Retention & Churn!• Lead Conversion!• Engagement!• Product Placement!• Recommendations!
Pricing Engine!Fraud Detection!
Customers, Users
Insurance Credit Card Others…
Big Data and Better Algorithms
-‐ Antonio Mollins, Data Scien;st
Pete Fishman, Data Science @Yammer
Screen title
Screen title
0xdata.com
17
A Collection of Distributed Vectors
// A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }
0xdata.com
18
JVM 4 Heap
JVM 1 Heap
JVM 2 Heap
JVM 3 Heap
Frames
A Frame: Vec[] age sex zip ID car
l Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM l But faster if local... more on that later
0xdata.com
19
JVM 4 Heap
JVM 1 Heap
JVM 2 Heap
JVM 3 Heap
Distributed Data Taxonomy
A Chunk, Unit of Parallel Access Vec Vec Vec Vec Vec
l Typically 1e3 to 1e6 elements l Stored compressed l In byte arrays l Get/put is a few clock cycles including compression
0xdata.com
20
JVM 4 Heap
JVM 1 Heap
JVM 2 Heap
JVM 3 Heap
Distributed Parallel Execution
Vec Vec Vec Vec Vec l All CPUs grab Chunks in parallel l F/J load balances l Code moves to Data l Map/Reduce & F/J handles all sync l H2O handles all comm, data manage
0xdata.com
21
Distributed Data Taxonomy
Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame
0xdata.com
22
Distributed Coding Taxonomy
l No Distribution Coding: l Whole Algorithms, Whole Vector-Math!l REST + JSON: e.g. load data, GLM, get results!
l Simple Data-Parallel Coding: l Per-Row (or neighbor row) Math!l Map/Reduce-style: e.g. Any dense linear algebra!
l Complex Data-Parallel Coding l K/V Store, Graph Algo's, e.g. PageRank!
0xdata.com
23
Distributed Coding Taxonomy
l No Distribution Coding: l Whole Algorithms, Whole Vector-Math!l REST + JSON: e.g. load data, GLM, get results!
l Simple Data-Parallel Coding: l Per-Row (or neighbor row) Math!l Map/Reduce-style: e.g. Any dense linear algebra!
l Complex Data-Parallel Coding l K/V Store, Graph Algo's, e.g. PageRank!
Read the docs!
This talk!
Join our GIT!
H2O – The Open Source Math Engine !
Better Predictions!