0xdata H2O Podcast

H2O – The Open Source Math Engine !

Better Predictions!

4/23/13

H2O – Open Source in-memory Machine Learning for Big Data

SriSatish Ambati, July 2013

Universe is sparse. Life is messy. Data is sparse & messy.!

- Lao Tzu

Hadoop = opportunity Not enough Data Scientists Analysts won’t code java

Volume: HDFS

HIVE/SQL

Data Scientist

Munging slice n dice Features

Classification Regression Clustering Optimal Model

Engineer

Velocity: Events Online Scoring

Explora;on

Modeling

Offline Scoring

Business Analyst

Ensemble models Low latency

Applications

Predictions

Rule Engine

Before H2O

H2O the

Prediction

Engine

Adhoc Explora;on

Math Modeling

Real-‐;me Scoring

Big Data

Messy NAs

Clustering

Classifica;on

Ensembles 100’s nanos

models

Regression

Group By Grep

H2O the

Prediction

Engine

Big Data Explora;on Modeling Scoring Real-‐;me

No New API!

Approximate!results each step!


Big Data beats Better Algorithms!


Big Data and Better Algorithms! Scale & Parallelism!

H2O the

Prediction

Engine

Intellectual Legacy

Math needs to be free

Open Source

Support and Innovation

hLps://github.com/0xdata/h2o

Usecases

Conversion, Retention & Churn!•  Lead Conversion!•  Engagement!•  Product Placement!•  Recommendations!

Pricing Engine!Fraud Detection!

Customers, Users

Insurance Credit Card Others…

Big Data and Better Algorithms

-‐ Antonio Mollins, Data Scien;st

Pete Fishman, Data Science @Yammer

Screen title

Screen title

0xdata.com

17

A Collection of Distributed Vectors

// A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }

0xdata.com

18

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Frames

A Frame: Vec[] age sex zip ID car

l Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM l But faster if local... more on that later

0xdata.com

19

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Data Taxonomy

A Chunk, Unit of Parallel Access Vec Vec Vec Vec Vec

l Typically 1e3 to 1e6 elements l Stored compressed l In byte arrays l Get/put is a few clock cycles including compression

0xdata.com

20

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Parallel Execution

Vec Vec Vec Vec Vec l All CPUs grab Chunks in parallel l F/J load balances l Code moves to Data l Map/Reduce & F/J handles all sync l H2O handles all comm, data manage

0xdata.com

21

Distributed Data Taxonomy

Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame

0xdata.com

22

Distributed Coding Taxonomy

l  No Distribution Coding: l  Whole Algorithms, Whole Vector-Math!l  REST + JSON: e.g. load data, GLM, get results!

l  Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math!l  Map/Reduce-style: e.g. Any dense linear algebra!

l  Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank!

0xdata.com

23

Distributed Coding Taxonomy

l  No Distribution Coding: l  Whole Algorithms, Whole Vector-Math!l  REST + JSON: e.g. load data, GLM, get results!

l  Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math!l  Map/Reduce-style: e.g. Any dense linear algebra!

l  Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank!

Read the docs!

This talk!

Join our GIT!

H2O – The Open Source Math Engine !

Better Predictions!

Technology

0xdata H2O Podcast