Download pdf - 0xdata H2O Podcast

Transcript
Page 1: 0xdata H2O Podcast

H2O – The Open Source Math Engine !

Better Predictions!

Page 2: 0xdata H2O Podcast

4/23/13

H2O – Open Source in-memory Machine Learning for Big Data

SriSatish Ambati, July 2013

Page 3: 0xdata H2O Podcast

Universe is sparse. Life is messy. Data is sparse & messy.!

- Lao Tzu

Page 4: 0xdata H2O Podcast

Hadoop = opportunity Not enough Data Scientists Analysts won’t code java

Page 5: 0xdata H2O Podcast

Volume:  HDFS  

HIVE/SQL

Data Scientist

Munging slice n dice Features

Classification Regression Clustering Optimal Model

Engineer

Velocity:  Events   Online  Scoring  

Explora;on  

       Modeling  

Offline  Scoring  

Business Analyst

Ensemble models Low latency

Applications

Predictions

Rule  Engine  

Before H2O

Page 6: 0xdata H2O Podcast

H2O the

Prediction

Engine

Adhoc  Explora;on  

Math  Modeling  

Real-­‐;me  Scoring  

Big Data

Messy  NAs  

Clustering  

Classifica;on                          

                               

Ensembles 100’s nanos  

models  

Regression  

Group  By  Grep  

Page 7: 0xdata H2O Podcast

H2O the

Prediction

Engine

Big  Data  Explora;on  Modeling  Scoring  Real-­‐;me  

 

No New API!

Approximate!results each step!

Page 8: 0xdata H2O Podcast

Big  Data  Explora;on  Modeling  Scoring  Real-­‐;me  

 

Big Data beats Better Algorithms!

Page 9: 0xdata H2O Podcast

Big  Data  Explora;on  Modeling  Scoring  Real-­‐;me  

 

Big Data and Better Algorithms! Scale & Parallelism!

Page 10: 0xdata H2O Podcast

H2O the

Prediction

Engine

Intellectual  Legacy  

 Math  needs    to  be  free  

 Open  Source  

 

Support and Innovation

hLps://github.com/0xdata/h2o  

Page 11: 0xdata H2O Podcast

Usecases

Conversion, Retention & Churn!•  Lead Conversion!•  Engagement!•  Product Placement!•  Recommendations!

Pricing Engine!Fraud Detection!

Page 12: 0xdata H2O Podcast

Customers, Users

Insurance  Credit  Card    Others…  

Page 13: 0xdata H2O Podcast

Big Data and Better Algorithms

-­‐  Antonio  Mollins,  Data  Scien;st  

Page 14: 0xdata H2O Podcast

Pete Fishman, Data Science @Yammer

Page 15: 0xdata H2O Podcast

Screen title

Page 16: 0xdata H2O Podcast

Screen title

Page 17: 0xdata H2O Podcast

0xdata.com  

17  

A Collection of Distributed Vectors

// A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }

Page 18: 0xdata H2O Podcast

0xdata.com  

18  

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Frames

A Frame: Vec[] age   sex   zip   ID   car  

l Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM l But faster if local... more on that later

Page 19: 0xdata H2O Podcast

0xdata.com  

19  

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Data Taxonomy

A Chunk, Unit of Parallel Access Vec   Vec   Vec   Vec   Vec  

l Typically 1e3 to 1e6 elements l Stored compressed l In byte arrays l Get/put is a few clock cycles including compression

Page 20: 0xdata H2O Podcast

0xdata.com  

20  

JVM 4 Heap

JVM 1 Heap

JVM 2 Heap

JVM 3 Heap

Distributed Parallel Execution

Vec   Vec   Vec   Vec   Vec  l All CPUs grab Chunks in parallel l F/J load balances l Code moves to Data l Map/Reduce & F/J handles all sync l H2O handles all comm, data manage

Page 21: 0xdata H2O Podcast

0xdata.com  

21  

Distributed Data Taxonomy

Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame

Page 22: 0xdata H2O Podcast

0xdata.com  

22  

Distributed Coding Taxonomy

l  No Distribution Coding: l  Whole Algorithms, Whole Vector-Math!l  REST + JSON: e.g. load data, GLM, get results!

l  Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math!l  Map/Reduce-style: e.g. Any dense linear algebra!

l  Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank!

Page 23: 0xdata H2O Podcast

0xdata.com  

23  

Distributed Coding Taxonomy

l  No Distribution Coding: l  Whole Algorithms, Whole Vector-Math!l  REST + JSON: e.g. load data, GLM, get results!

l  Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math!l  Map/Reduce-style: e.g. Any dense linear algebra!

l  Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank!

Read  the  docs!  

This  talk!  

Join  our  GIT!  

Page 24: 0xdata H2O Podcast

H2O – The Open Source Math Engine !

Better Predictions!