Upload
srisatish-ambati
View
129
Download
0
Embed Size (px)
Citation preview
H2O PySparkling Water
Michal Malohlava @mmalohlava and @h2oai
presents
2016/10/08 PyData
H2O.aiMachine Intelligence
H2O+
PySpark =
PySparkling
H2O.aiMachine Intelligence
H2OOpen-Source In-Memory Data Science Platform
•Highly optimized Java code (in-house)
•Distributed in-memory K-V store and map/reduce computation framework
•Data parser (HDFS, S3, NFS, HTTP, local drives, etc.)
•Read/write access to distributed data frames (R/Pandas-style)
•ML algos - Deep Learning, GBM, DRF, GLM, GLRM, K-Means, PCA, CoxPH, Ensembles
•REST API: clients Interactive UI/R/Python
H2O Python client
pySparkling
H2O.aiMachine Intelligence
PySparklingProvides
Transparent integration of H2O machine learning platform with Spark ecosystem (PySpark)
Transparent use of H2O data structures (H2OFrame) and algorithms with Spark Python API
Excels in existing Spark workflows requiring advanced Machine Learning algorithms
Func
tiona
lity
mis
sing
in H
2O c
an b
e re
plac
ed b
y Sp
ark
and
vice
ver
sa
H2O.aiMachine Intelligence
Benefits
• Additional algorithms
• NLP features
• Powerful data munging
• ML Pipelines
• Advanced algorithms
• speed v. accuracy
• advanced parameters
• Fully distributed and parallelized
• Graphical environment
• Fully fledged Python/R interfaces
H2O.aiMachine Intelligence
PySparklingUse-Cases
H2O.aiMachine Intelligence
Model Building
Data Source
H2O.aiMachine Intelligence
Model Building
Data Source
Data munging
H2O.aiMachine Intelligence
Model Building
Data Source
Data munging Modelling
Deep Learning, GBMDRF, GLM, GLRM
K-Means, PCACoxPH, Ensembles
H2O.aiMachine Intelligence
Model Building
Data Source
Data munging Modelling
Deep Learning, GBMDRF, GLM, GLRM
K-Means, PCACoxPH, Ensembles
Prediction processing
H2O.aiMachine Intelligence
Model Building
Data Source
Data munging Modelling
Deep Learning, GBMDRF, GLM, GLRM
K-Means, PCACoxPH, Ensembles
Prediction processing
Steam
Model management
H2O.aiMachine Intelligence
Data Munging
Data Source
H2O.aiMachine Intelligence
Data Munging
Data Source
H2O.aiMachine Intelligence
Data Munging
Data Source
H2O.aiMachine Intelligence
Data Munging
Data Source
Data load/munging/ exploration (H2O Flow UI)
H2O.aiMachine Intelligence
Data Munging
Data Source
Data load/munging/ exploration (H2O Flow UI) Modelling
H2O.aiMachine Intelligence
Stream processing
H2O.aiMachine Intelligence
Stream processing
DataSourceO
ff-lin
e m
odel
trai
ning
H2O.aiMachine Intelligence
Stream processing
DataSourceO
ff-lin
e m
odel
trai
ning
Data munging
H2O.aiMachine Intelligence
Stream processing
DataSourceO
ff-lin
e m
odel
trai
ning
Data munging Modelling
H2O.aiMachine Intelligence
Stream processing
DataSourceO
ff-lin
e m
odel
trai
ning
Data mungingSt
ream
proc
essi
ng
Data Stream
Spark Streaming/Storm/Flink
Modelling
H2O.aiMachine Intelligence
Stream processing
DataSourceO
ff-lin
e m
odel
trai
ning
Data mungingSt
ream
proc
essi
ng
Data Stream
Spark Streaming/Storm/Flink
Export model
Modelling
H2O.aiMachine Intelligence
Stream processing
DataSourceO
ff-lin
e m
odel
trai
ning
Data mungingSt
ream
proc
essi
ng
Data Stream
Spark Streaming/Storm/FlinkModel
prediction
Deploy the model
Export model
Modelling
H2O.aiMachine Intelligence
Stream processing
DataSourceO
ff-lin
e m
odel
trai
ning
Data mungingSt
ream
proc
essi
ng
Data Stream
Spark Streaming/Storm/FlinkModel
prediction
Deploy the model
Export model
Modelling
Stea
m
Mod
el
man
agem
ent
H2O.aiMachine Intelligence
What is inside?
H2O.aiMachine Intelligence
Cluster
Worker node
PySpark main program
Driver node Worker nodeWorker node
H2O.aiMachine Intelligence
Cluster
Worker node
PySpark main program
Driver node
SparkContext
Worker nodeWorker node
sc = SparkContext.getOrCreate()
H2O.aiMachine Intelligence
Cluster
Worker node
Spark executor Spark executorSpark executor
PySpark main program
Driver node
SparkContext
Worker nodeWorker node
sc = SparkContext.getOrCreate()
H2O.aiMachine Intelligence
Cluster
Worker node
Spark executor Spark executorSpark executor
PySpark main program
Driver node
SparkContext
Worker nodeWorker node
sc = SparkContext.getOrCreate()
H2O.aiMachine Intelligence
Cluster
Worker node
Spark executor Spark executorSpark executor
PySpark main program
Driver node
SparkContext
Worker nodeWorker node
H2OContext
sc = SparkContext.getOrCreate()
h2o_context = H2OContext.getOrCreate()
H2O.aiMachine Intelligence
Cluster
Worker node
Spark executor Spark executorSpark executor
PySpark main program
Driver node
SparkContext
Worker nodeWorker node
H2OContext
sc = SparkContext.getOrCreate()
h2o_context = H2OContext.getOrCreate()
H2O.aiMachine Intelligence
DataSource
Spar
k Ex
ecut
orSp
ark
Exec
utor
Spar
k Ex
ecut
or
Spark Cluster
DataFrame
H2O
Sto
re H2OFrame
DataSource
H2O
Sto
reH
2O S
toreDat
a In
gest
H2O.aiMachine Intelligence
DataSource
Spar
k Ex
ecut
orSp
ark
Exec
utor
Spar
k Ex
ecut
or
Spark Cluster
DataFrame
H2O
Sto
re H2OFrame
h2o_context.as_spark_frame H2OFrame serves data for DataFrame operations
DataSource
H2O
Sto
reH
2O S
tore
Dat
a Ex
chan
ge
PyAPI
H2O.aiMachine Intelligence
DataSource
Spar
k Ex
ecut
orSp
ark
Exec
utor
Spar
k Ex
ecut
or
Spark Cluster
DataFrame
H2O
Sto
re H2OFrame
h2o_context.as_h2o_frame Materializes DataFrame as H2OFrame (H2O as a clever cache)
DataSource
H2O
Sto
reH
2O S
tore
Dat
a Ex
chan
ge
PyAPI
H2O.aiMachine Intelligence
Sentiment Analysis
with PySparklingDEM
O
H2O.aiMachine Intelligence
Start PySparkling
Opens Jupyter Notebook
Download from h2o.ai/download
H2O.aiMachine Intelligence
Future
H2O.aiMachine Intelligence
The PlanSeparation of H2O cluster from Spark infrastructure ✓ Preserving existing API
h2oContext = H2OContext.getOrCreate(ip=“…”, port=…)
Better integration into PySpark pipelines ✓ Support of H2O Ensembles (right now only as R-package)
Integration with Steam platform to support model management DeepWater integration H2O DeepWater with Python
early sneakFabrizio MiloSu
nday
3p
m
H2O.aiMachine Intelligence
Checkout GitHub & Contribute
https://github.com/h2oai/sparkling-water
Checkout H2O.ai Training Books http://h2o.ai/resources
Checkout H2O.ai Blog http://h2o.ai/blog/
Checkout H2O.ai Youtube Channel https://www.youtube.com/user/0xdata
More info
H2O.aiMachine Intelligence
Learn more at h2o.ai Follow us at @h2oai
Come to see us at Open Tour in Dallas! See open.h2o.ai
PySparkling is open-source
ML application platform combining
power of PySpark and H2O
Thank you!
DALLAS, TX OCT 26th