0xdata H2O Podcast

Preview:

DESCRIPTION

In this slidecast, SriSatish Ambati from 0xdata describes the company's new H20 Open Source, In-memory Machine Learning application for Big Data. "We developed H2O to unlock the predictive power of big data through better algorithms," said SriSatish Ambati, CEO and co-founder of 0xdata. "H2O is simple, extensible and easy to use and deploy from R, Excel and Hadoop. The big data science world is one of algorithm-haves and have-nots. Amazon, Goldman Sachs, Google and Netflix have proven the power of algorithms on data. With our viral and open Apache software license philosophy, along with close ties into the math, Hadoop and R communities, we bring the power of Google-scale machine learning and modeling without sampling to the rest of the world." Watch the presentation video: http://wp.me/p3RLEV-1xc Learn more: http://0xdata.com

Citation preview

H2O – The Open Source Math Engine !

Better Predictions!

4/23/13

H2O – Open Source in-memory Machine Learning for Big Data

SriSatish Ambati, July 2013

Universe is sparse. Life is messy. Data is sparse & messy.!

- Lao Tzu

Hadoop = opportunity Not enough Data Scientists Analysts won’t code java

Volume:  HDFS  

HIVE/SQL

Data Scientist

Munging slice n dice Features

Classification Regression Clustering Optimal Model

Engineer

Velocity:  Events   Online  Scoring  

Explora;on  

       Modeling  

Offline  Scoring  

Business Analyst

Ensemble models Low latency

Applications

Predictions

Rule  Engine  

Before H2O

H2O the

Prediction

Engine

Adhoc  Explora;on  

Math  Modeling  

Real-­‐;me  Scoring  

Big Data

Messy  NAs  

Clustering  

Classifica;on                          

                               

Ensembles 100’s nanos  

models  

Regression  

Group  By  Grep  

H2O the

Prediction

Engine

Big  Data  Explora;on  Modeling  Scoring  Real-­‐;me  

 

No New API!

Approximate!results each step!

Big  Data  Explora;on  Modeling  Scoring  Real-­‐;me  

 

Big Data beats Better Algorithms!

Big  Data  Explora;on  Modeling  Scoring  Real-­‐;me  

 

Big Data and Better Algorithms! Scale & Parallelism!

H2O the

Prediction

Engine

Intellectual  Legacy  

 Math  needs    to  be  free  

 Open  Source  

 

Support and Innovation

hLps://github.com/0xdata/h2o  

Usecases

Conversion, Retention & Churn!•  Lead Conversion!•  Engagement!•  Product Placement!•  Recommendations!

Pricing Engine!Fraud Detection!

Customers, Users

Insurance  Credit  Card    Others…  

Big Data and Better Algorithms

-­‐  Antonio  Mollins,  Data  Scien;st  

Pete Fishman, Data Science @Yammer

Screen title

Screen title

0xdata.com  

17  

A Collection of Distributed Vectors

// A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }

0xdata.com  

18  

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Frames

A Frame: Vec[] age   sex   zip   ID   car  

l Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM l But faster if local... more on that later

0xdata.com  

19  

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Data Taxonomy

A Chunk, Unit of Parallel Access Vec   Vec   Vec   Vec   Vec  

l Typically 1e3 to 1e6 elements l Stored compressed l In byte arrays l Get/put is a few clock cycles including compression

0xdata.com  

20  

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Parallel Execution

Vec   Vec   Vec   Vec   Vec  l All CPUs grab Chunks in parallel l F/J load balances l Code moves to Data l Map/Reduce & F/J handles all sync l H2O handles all comm, data manage

0xdata.com  

21  

Distributed Data Taxonomy

Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame

0xdata.com  

22  

Distributed Coding Taxonomy

l  No Distribution Coding: l  Whole Algorithms, Whole Vector-Math!l  REST + JSON: e.g. load data, GLM, get results!

l  Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math!l  Map/Reduce-style: e.g. Any dense linear algebra!

l  Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank!

0xdata.com  

23  

Distributed Coding Taxonomy

l  No Distribution Coding: l  Whole Algorithms, Whole Vector-Math!l  REST + JSON: e.g. load data, GLM, get results!

l  Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math!l  Map/Reduce-style: e.g. Any dense linear algebra!

l  Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank!

Read  the  docs!  

This  talk!  

Join  our  GIT!  

H2O – The Open Source Math Engine !

Better Predictions!