Upload
builien
View
217
Download
2
Embed Size (px)
Citation preview
PMML with R and Java
Thomas Darimont Data Science Meetup Luxembourg
24th Sep 2014
1
PredicAve Model Lifecycle
TradiAonal way …
2
Model SpecificaAon V1
Source: Own representaAon based on “RepresenAng PredicAve SoluAons with PMML”, by Alex Guazzelli hPps://www.youtube.com/watch?v=QBpguVZRVPo
• Uses staAsAcal tool • Defines / trains model • R, Python • Writes model specificaAon
• Implements Spec • Writes custom code • C++, C#, Java • Deploys model (code)
Scien&st Engineer
Problems
• Model definiAon not machine readable • Model needs to be implemented by hand • Changes in the model documents have to be propagated – by hand
• Time consuming (weeks, months, years) • Prone to errors • ImplementaAon ≠ SpecificaAon
3
Solu&on?
PMML
• PredicAve Model Markup Language • Open Standard • Maintained by Data Mining Group (DMG) • XML based DSL for predicAve models • First Version (1999) – Current Version 4.2.1 Goal: “Bridge the Gap between Data ScienAsts and Engineers”
4
Anatomy of PMML Model
• Pre Processing • PredicAve Model – Algorithm descripAon(s) – ParameterizaAon à trained model
• Post Processing – Transform model output – Thresholds / Business rules
5 Source: PMML in AcAon, 2nd EdiAon, 2012, p. 7.
PMML General Structure
• Version / Timestamp • Model development environment informaAon Header
• DefiniAon of variable types • Handling of valid, invalid and missing values Data DicAonary
• Pre-‐processing: NormalizaAon, mapping and discreAzaAon
• Built-‐in and user defined funcAons Data TransformaAons
• Mining Schema • Targets • Outputs
Model 1..*
6 Source: Own representaAon based on PMML in AcAon, 2nd EdiAon, 2012, p. 24.
PMML Model Structure
• DefiniAon of usage type • Outlier and missing value treatment / replacement
Mining Schema
• Prior probability and default value Targets
• List of computed output fields • Post-‐processing Outputs
• DefiniAon of model specific parameters (Parameters)
7 Source: Own representaAon based on PMML in AcAon, 2nd EdiAon, 2012, p. 24.
PMML example
8
Header
Data DicAonary
Model Parameters
Output
Model
irisModel <-‐ lm(Petal.Width ~ Petal.Length, data=iris)
PMML Supported Models • 15 model types • AssociaAon Rules • Baseline Models • Cluster Models • (General) Regression • k-‐Nearest Neighbors • Naive Bayes • Neural Network • Ruleset • Scorecard • Sequences • Text Models • Time Series • Trees • Vector Machine • … roll your own: Ensemble Models -‐> Use provided building blocks
9
PMML TransformaAons • Normaliza&on map values to numbers, the input can be conAnuous (element NormConAnuous) or discrete (element NormDiscrete).
• Discre&za&on map conAnuous values to discrete values.
• Value Mapping map discrete values to other discrete values.
• Func&ons derive a value by applying a funcAon to one or more parameters.
10
PMML FuncAons • Custom funcAons for common transformaAons • Building blocks Category Func&ons
Arithme&c +, -‐, * and /
Math log10, ln, sqrt, abs, exp, pow, threshold, floor, ceil, round
Stats min, max, sum, avg, median, product
Logic if, and, or, not, equal, notEqual, lessThan, lessOrEqual, greaterThan, greaterOrEqual, isMissing, isNotMissing, isIn, isNotIn
String uppercase, lowercase, substring, trimBlanks, concat, replace, matches
Format formatNumber, formatDateAme
Date/Time dateDaysSinceYear, dateSecondsSinceYear, dateSecondsSinceMidnight
11 Source: Own representaAon based on PMML in AcAon, 2nd EdiAon, 2012, p. 63.
PMML MulAple Models
• Several ways for combining mulAple models in one PMML file – Model SegmentaAon – Model Ensemble – Model Chaining – Model ComposiAon
• Custom extensions for referencing external model files
12
Model SegmentaAon
Input Valida&on
Data Pre-‐Processing
Model 1
Model 2
Model n
Raw input Predic&on
…
Predicate based Model selecAon E.g.: SelectFirst
?
13
Outliers, Missing
Values, Invalid Values
PMML File
Source: Own representaAon based on PMML in AcAon, 2nd EdiAon, 2012, p. 190.
X = 1
X = 2
PredicAve Model
Model Ensemble
Input Valida&on
Data Pre-‐Processing
Vo&ng
Model 1
Model 2
Model n
…
Scores from all models are computed
Majority VoAng, Weighted VoAng, Weighted Average, etc.
14
PMML File
Source: Own representaAon based on PMML in AcAon, 2nd EdiAon, 2012, p. 193.
Raw input Predic&on
Model Chaining
Input Valida&on
Data Pre-‐Processing
Model 1
Model 2
Model n
…
Output scores from earlier models are used by subsequent models
15
PMML File
Source: Own representaAon based on PMML in AcAon, 2nd EdiAon, 2012, p. 195.
Raw input Predic&on
Model ComposiAon
Input Valida&on
Data Pre-‐Processing
Main Model
Model 2
Model n
…
Predicate based model selecAon
?
16
PMML File
Source: Own representaAon based on PMML in AcAon, 2nd EdiAon, 2012, p. 196.
Raw input Predic&on
Model VerificaAon
• “Scoring matching test” • “Regression tests for models” • VerificaAonFields – Asserts, range checks for results
• InlineTable – Input + expected output – Include already scored data
17
Model Deployment with PMML
18
• StaAsAcs Tool • Data Mining Tool • …
Model Building
• AnalyAcs ApplicaAon
Model Scoring
Export Model Deploy Model
Example ApplicaAon
19 Source: Own representaAon based on Fundamentals of Stream Processing, Cambridge Press, 2014, p. 390.
Real-‐&me Analy&cs in Stream Processing
Example ApplicaAon cont.
20
PMML
R Madlib
Sprin
g XD
analyAc-‐PM
ML
Spring Batch
HTML5 / JS D3 Spring Boot
Source: Own representaAon based on Fundamentals of Stream Processing, Cambridge Press, 2014, p. 390.
Real-‐&me Analy&cs in Stream Processing
Redis Postgresql HDFS
EC2 Cluster
“Predic&on of short-‐term energy consump&on in a
SmartGrid”
Sensor Data Rabb
it MQ
W / kWh every s 40 houses
325 households 2125 plugs
PMML Tools • R / RaPle • RapidMiner • KNIME • Various PMML Tools from ZemenAs
– TransformaAon Generator – Generic OperaAon Generator
• Py2PMML – Can transform models learned with scikit-‐learn to PMML
• SPSS • SAS • StaAsAca • …
21
PMML Industry Support Digest of analyAc soyware vendors with PMML support • hPp://www.dmg.org/products.html • IBM • Microsoy • Google • Oracle • EMC • Pivotal • SAS • Pentaho • Teradata
22
PMML Resources • PMML in AcAon 2nd EdiAon Book • hPps://support.zemenAs.com/entries/22119057-‐Top-‐10-‐PMML-‐Resources
• hPp://journal.r-‐project.org/archive/2009-‐1/RJournal_2009-‐1_Guazzelli+et+al.pdf
• hPps://www.ibm.com/developerworks/opensource/library/ba-‐ind-‐PMML1/
• hPp://zemenAs.com/knowledge-‐base-‐resources/white-‐papers/
• yt Talk: RepresenAng PredicAve SoluAons with PMML hPps://www.youtube.com/watch?v=QBpguVZRVPo
23
PMML Summary
• Open • Mature • Extensible • Standard • Broad industry support
“PMML is the Lingua Franca for sharing
Predic5ve Model Solu5ons”
24
Source: Dr. Alex Guazzelli, RepresenAng PredicAve SoluAons with PMML, youtube, 2012
PMML with R
• Packages – pmml / 10 years
• Export model to PMML
– pmmlTransformaAons / 1.5 years • WrapData wraps dataframe in a SmartObject(SO) • TransformaAons applied to SO are saved in PMML
• Support for: ksvm, nnet, rpart, lm & glm, arules, kmeans and hclust, randomForest
25
PMML with R
• Hello World
DEMO 26
PMML example
27
irisModel <-‐ lm(Petal.Width ~ Petal.Length, data=iris)
PMML with Java • JPMML hPps://github.com/jpmml/jpmml
– Java based dual licensed AGPL V3 “Umbrella” Project – Reference implementaAon of PMML standard – Backed by hPp://openscoring.io/ – Supports latest PMML Version >= 3.0 – 12 out of 15 model types supported (No Time Series L)
• jpmml-‐evaluator sub-‐project – API for scoring / evaluaAon
• jpmml-‐model sub-‐project – JAXB model derived from PMML XSD – Import / Export / Model generaAon
• Some integraAon projects – Hive, PostgreSQL, pig – Planned: Apache Storm and Apache Spark DEMO
28
QuesAons
29