24
H2O – The Open Source Math Engine Better Predictions

0xdata H2O Podcast

Embed Size (px)

DESCRIPTION

In this slidecast, SriSatish Ambati from 0xdata describes the company's new H20 Open Source, In-memory Machine Learning application for Big Data. "We developed H2O to unlock the predictive power of big data through better algorithms," said SriSatish Ambati, CEO and co-founder of 0xdata. "H2O is simple, extensible and easy to use and deploy from R, Excel and Hadoop. The big data science world is one of algorithm-haves and have-nots. Amazon, Goldman Sachs, Google and Netflix have proven the power of algorithms on data. With our viral and open Apache software license philosophy, along with close ties into the math, Hadoop and R communities, we bring the power of Google-scale machine learning and modeling without sampling to the rest of the world." Watch the presentation video: http://wp.me/p3RLEV-1xc Learn more: http://0xdata.com

Citation preview

Page 1: 0xdata H2O Podcast

H2O – The Open Source Math Engine !

Better Predictions!

Page 2: 0xdata H2O Podcast

4/23/13

H2O – Open Source in-memory Machine Learning for Big Data

SriSatish Ambati, July 2013

Page 3: 0xdata H2O Podcast

Universe is sparse. Life is messy. Data is sparse & messy.!

- Lao Tzu

Page 4: 0xdata H2O Podcast

Hadoop = opportunity Not enough Data Scientists Analysts won’t code java

Page 5: 0xdata H2O Podcast

Volume:  HDFS  

HIVE/SQL

Data Scientist

Munging slice n dice Features

Classification Regression Clustering Optimal Model

Engineer

Velocity:  Events   Online  Scoring  

Explora;on  

       Modeling  

Offline  Scoring  

Business Analyst

Ensemble models Low latency

Applications

Predictions

Rule  Engine  

Before H2O

Page 6: 0xdata H2O Podcast

H2O the

Prediction

Engine

Adhoc  Explora;on  

Math  Modeling  

Real-­‐;me  Scoring  

Big Data

Messy  NAs  

Clustering  

Classifica;on                          

                               

Ensembles 100’s nanos  

models  

Regression  

Group  By  Grep  

Page 7: 0xdata H2O Podcast

H2O the

Prediction

Engine

Big  Data  Explora;on  Modeling  Scoring  Real-­‐;me  

 

No New API!

Approximate!results each step!

Page 8: 0xdata H2O Podcast

Big  Data  Explora;on  Modeling  Scoring  Real-­‐;me  

 

Big Data beats Better Algorithms!

Page 9: 0xdata H2O Podcast

Big  Data  Explora;on  Modeling  Scoring  Real-­‐;me  

 

Big Data and Better Algorithms! Scale & Parallelism!

Page 10: 0xdata H2O Podcast

H2O the

Prediction

Engine

Intellectual  Legacy  

 Math  needs    to  be  free  

 Open  Source  

 

Support and Innovation

hLps://github.com/0xdata/h2o  

Page 11: 0xdata H2O Podcast

Usecases

Conversion, Retention & Churn!•  Lead Conversion!•  Engagement!•  Product Placement!•  Recommendations!

Pricing Engine!Fraud Detection!

Page 12: 0xdata H2O Podcast

Customers, Users

Insurance  Credit  Card    Others…  

Page 13: 0xdata H2O Podcast

Big Data and Better Algorithms

-­‐  Antonio  Mollins,  Data  Scien;st  

Page 14: 0xdata H2O Podcast

Pete Fishman, Data Science @Yammer

Page 15: 0xdata H2O Podcast

Screen title

Page 16: 0xdata H2O Podcast

Screen title

Page 17: 0xdata H2O Podcast

0xdata.com  

17  

A Collection of Distributed Vectors

// A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }

Page 18: 0xdata H2O Podcast

0xdata.com  

18  

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Frames

A Frame: Vec[] age   sex   zip   ID   car  

l Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM l But faster if local... more on that later

Page 19: 0xdata H2O Podcast

0xdata.com  

19  

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Data Taxonomy

A Chunk, Unit of Parallel Access Vec   Vec   Vec   Vec   Vec  

l Typically 1e3 to 1e6 elements l Stored compressed l In byte arrays l Get/put is a few clock cycles including compression

Page 20: 0xdata H2O Podcast

0xdata.com  

20  

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Parallel Execution

Vec   Vec   Vec   Vec   Vec  l All CPUs grab Chunks in parallel l F/J load balances l Code moves to Data l Map/Reduce & F/J handles all sync l H2O handles all comm, data manage

Page 21: 0xdata H2O Podcast

0xdata.com  

21  

Distributed Data Taxonomy

Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame

Page 22: 0xdata H2O Podcast

0xdata.com  

22  

Distributed Coding Taxonomy

l  No Distribution Coding: l  Whole Algorithms, Whole Vector-Math!l  REST + JSON: e.g. load data, GLM, get results!

l  Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math!l  Map/Reduce-style: e.g. Any dense linear algebra!

l  Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank!

Page 23: 0xdata H2O Podcast

0xdata.com  

23  

Distributed Coding Taxonomy

l  No Distribution Coding: l  Whole Algorithms, Whole Vector-Math!l  REST + JSON: e.g. load data, GLM, get results!

l  Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math!l  Map/Reduce-style: e.g. Any dense linear algebra!

l  Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank!

Read  the  docs!  

This  talk!  

Join  our  GIT!  

Page 24: 0xdata H2O Podcast

H2O – The Open Source Math Engine !

Better Predictions!