29
PMML with R and Java Thomas Darimont Data Science Meetup Luxembourg 24 th Sep 2014

PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

  • Upload
    builien

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  with  R  and  Java  

Thomas  Darimont  Data  Science  Meetup  Luxembourg  

24th  Sep    2014  

1  

Page 2: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PredicAve  Model  Lifecycle  

TradiAonal  way  …    

2  

Model  SpecificaAon  V1  

Source:  Own  representaAon  based  on  “RepresenAng  PredicAve  SoluAons  with  PMML”,  by  Alex  Guazzelli  hPps://www.youtube.com/watch?v=QBpguVZRVPo

•  Uses  staAsAcal  tool  •  Defines  /  trains  model  •  R,  Python  •  Writes  model  specificaAon    

•  Implements  Spec  •  Writes  custom  code  •  C++,  C#,  Java  •  Deploys  model  (code)    

Scien&st   Engineer  

Page 3: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

Problems  

•  Model  definiAon  not  machine  readable  •  Model  needs  to  be  implemented  by  hand  •  Changes  in  the  model  documents  have  to  be  propagated  –  by  hand  

•  Time  consuming  (weeks,  months,  years)  •  Prone  to  errors  •  ImplementaAon  ≠  SpecificaAon  

3  

Solu&on?  

Page 4: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  

•  PredicAve  Model  Markup  Language  •  Open  Standard  •  Maintained  by  Data  Mining  Group  (DMG)  •  XML  based  DSL  for  predicAve  models  •  First  Version  (1999)  –  Current  Version  4.2.1    Goal:          “Bridge  the  Gap  between                    Data  ScienAsts  and  Engineers”  

4  

Page 5: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

Anatomy  of  PMML  Model  

•  Pre  Processing  •  PredicAve  Model  – Algorithm  descripAon(s)  – ParameterizaAon  à  trained  model  

•  Post  Processing  – Transform  model  output  – Thresholds  /  Business  rules  

   

5  Source:  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  7.

Page 6: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  General  Structure  

• Version  /  Timestamp  • Model  development  environment  informaAon  Header  

• DefiniAon  of  variable  types  • Handling  of  valid,  invalid  and  missing  values  Data  DicAonary  

• Pre-­‐processing:  NormalizaAon,  mapping  and  discreAzaAon  

• Built-­‐in  and  user  defined  funcAons  Data  TransformaAons  

• Mining  Schema  • Targets  • Outputs  

Model  1..*  

6  Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  24.

Page 7: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  Model  Structure  

•  DefiniAon  of  usage  type  •  Outlier  and  missing  value  treatment  /  replacement    

Mining  Schema  

•  Prior  probability  and  default  value  Targets  

•  List  of  computed  output  fields  •  Post-­‐processing  Outputs  

•  DefiniAon  of  model  specific  parameters  (Parameters)  

7  Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  24.

Page 8: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  example  

8  

Header  

Data  DicAonary  

Model  Parameters  

Output  

Model  

irisModel  <-­‐  lm(Petal.Width  ~  Petal.Length,  data=iris)  

Page 9: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  Supported  Models  •  15  model  types  •  AssociaAon  Rules  •  Baseline  Models  •  Cluster  Models  •  (General)  Regression  •  k-­‐Nearest  Neighbors  •  Naive  Bayes    •  Neural  Network  •  Ruleset  •  Scorecard  •  Sequences  •  Text  Models  •  Time  Series  •  Trees  •  Vector  Machine  •  …  roll  your  own:  Ensemble  Models  -­‐>  Use  provided  building  blocks  

9  

Page 10: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  TransformaAons  •  Normaliza&on    map  values  to  numbers,  the  input  can  be  conAnuous  (element  NormConAnuous)  or  discrete  (element  NormDiscrete).  

•  Discre&za&on    map  conAnuous  values  to  discrete  values.  

•  Value  Mapping  map  discrete  values  to  other  discrete  values.  

•  Func&ons  derive  a  value  by  applying  a  funcAon  to  one  or  more  parameters.  

10  

Page 11: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  FuncAons  •  Custom  funcAons  for  common  transformaAons  •  Building  blocks  Category   Func&ons  

Arithme&c     +,  -­‐,  *  and  /  

Math   log10,  ln,  sqrt,  abs,  exp,  pow,  threshold,  floor,  ceil,  round  

Stats     min,  max,  sum,  avg,  median,  product  

Logic   if,  and,  or,  not,  equal,  notEqual,  lessThan,  lessOrEqual,  greaterThan,  greaterOrEqual,  isMissing,  isNotMissing,  isIn,  isNotIn  

String     uppercase,  lowercase,  substring,  trimBlanks,  concat,  replace,  matches  

Format     formatNumber,  formatDateAme  

Date/Time   dateDaysSinceYear,  dateSecondsSinceYear,  dateSecondsSinceMidnight  

11  Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  63.

Page 12: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  MulAple  Models  

•  Several  ways  for  combining  mulAple  models  in  one  PMML  file  – Model  SegmentaAon  – Model  Ensemble  – Model  Chaining  – Model  ComposiAon  

•  Custom  extensions  for  referencing  external  model  files  

12  

Page 13: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

Model  SegmentaAon  

Input  Valida&on  

Data  Pre-­‐Processing  

Model  1  

Model  2  

Model  n  

Raw  input   Predic&on  

…  

Predicate  based  Model  selecAon  E.g.:  SelectFirst  

?  

13  

Outliers,  Missing  

Values,  Invalid  Values  

PMML  File  

Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  190.

X  =  1  

X  =  2  

PredicAve  Model  

Page 14: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

Model  Ensemble  

Input  Valida&on  

Data  Pre-­‐Processing  

Vo&ng  

Model  1  

Model  2  

Model  n  

…  

Scores  from  all  models  are  computed    

Majority  VoAng,  Weighted  VoAng,  Weighted  Average,  etc.  

14  

PMML  File  

Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  193.

Raw  input   Predic&on  

Page 15: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

Model  Chaining  

Input  Valida&on  

Data  Pre-­‐Processing  

Model  1  

Model  2  

Model  n  

…  

Output  scores  from  earlier  models  are  used  by  subsequent  models  

15  

PMML  File  

Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  195.

Raw  input   Predic&on  

Page 16: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

Model  ComposiAon  

Input  Valida&on  

Data  Pre-­‐Processing  

Main  Model  

Model  2  

Model  n  

…  

Predicate  based  model  selecAon  

?  

16  

PMML  File  

Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  196.

Raw  input   Predic&on  

Page 17: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

Model  VerificaAon  

•  “Scoring  matching  test”  •  “Regression  tests  for  models”  •  VerificaAonFields  – Asserts,  range  checks  for  results  

•  InlineTable  –  Input  +  expected  output  –  Include  already  scored  data    

17  

Page 18: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

Model  Deployment  with  PMML  

18  

•  StaAsAcs  Tool  •  Data  Mining  Tool  •  …  

Model  Building  

•  AnalyAcs  ApplicaAon  

Model  Scoring  

Export  Model   Deploy  Model  

Page 19: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

Example  ApplicaAon  

19  Source:  Own  representaAon  based  on  Fundamentals  of  Stream  Processing,  Cambridge  Press,  2014,  p.  390.

Real-­‐&me  Analy&cs  in  Stream  Processing  

Page 20: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

Example  ApplicaAon  cont.  

20  

PMML  

R  Madlib  

Sprin

g  XD

 

analyAc-­‐PM

ML  

Spring  Batch  

HTML5  /  JS  D3  Spring  Boot  

Source:  Own  representaAon  based  on  Fundamentals  of  Stream  Processing,  Cambridge  Press,  2014,  p.  390.

Real-­‐&me  Analy&cs  in  Stream  Processing  

Redis  Postgresql  HDFS  

EC2  Cluster  

“Predic&on  of  short-­‐term  energy  consump&on  in  a  

SmartGrid”    

Sensor  Data   Rabb

it  MQ  

W  /  kWh  every  s  40  houses  

325  households  2125  plugs  

Page 21: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  Tools  •  R  /  RaPle  •  RapidMiner  •  KNIME  •  Various  PMML  Tools  from  ZemenAs  

–  TransformaAon  Generator  –  Generic  OperaAon  Generator  

•  Py2PMML  –  Can  transform  models  learned  with  scikit-­‐learn  to  PMML  

•  SPSS  •  SAS  •  StaAsAca  •  …  

21  

Page 22: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  Industry  Support  Digest  of  analyAc  soyware  vendors  with  PMML  support  •  hPp://www.dmg.org/products.html  •  IBM  •  Microsoy  •  Google  •  Oracle  •  EMC  •  Pivotal  •  SAS  •  Pentaho  •  Teradata  

22  

Page 23: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  Resources  •  PMML  in  AcAon  2nd  EdiAon  Book  •  hPps://support.zemenAs.com/entries/22119057-­‐Top-­‐10-­‐PMML-­‐Resources  

•  hPp://journal.r-­‐project.org/archive/2009-­‐1/RJournal_2009-­‐1_Guazzelli+et+al.pdf  

•  hPps://www.ibm.com/developerworks/opensource/library/ba-­‐ind-­‐PMML1/  

•  hPp://zemenAs.com/knowledge-­‐base-­‐resources/white-­‐papers/  

•  yt  Talk:  RepresenAng  PredicAve  SoluAons  with  PMML  hPps://www.youtube.com/watch?v=QBpguVZRVPo  

   

23  

Page 24: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  Summary  

•  Open  •  Mature  •  Extensible  •  Standard  •  Broad  industry  support  

 “PMML  is  the  Lingua  Franca  for  sharing  

Predic5ve  Model  Solu5ons”  

24  

Source:  Dr.  Alex  Guazzelli,  RepresenAng  PredicAve  SoluAons  with  PMML,  youtube,  2012      

Page 25: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  with  R  

•  Packages  – pmml  /  10  years  

•  Export  model  to  PMML  

– pmmlTransformaAons  /  1.5  years  • WrapData  wraps  dataframe  in  a  SmartObject(SO)  •  TransformaAons  applied  to  SO  are  saved  in  PMML  

•  Support  for:  ksvm,  nnet,  rpart,  lm  &  glm,  arules,  kmeans  and  hclust,  randomForest  

25  

Page 26: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  with  R  

•  Hello  World    

DEMO  26  

Page 27: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  example  

27  

irisModel  <-­‐  lm(Petal.Width  ~  Petal.Length,  data=iris)  

Page 28: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

PMML  with  Java  •  JPMML  hPps://github.com/jpmml/jpmml  

–  Java  based  dual  licensed  AGPL  V3  “Umbrella”  Project  –  Reference  implementaAon  of  PMML  standard  –  Backed  by  hPp://openscoring.io/  –  Supports  latest  PMML  Version  >=  3.0  –  12  out  of  15  model  types  supported  (No  Time  Series  L)  

•  jpmml-­‐evaluator  sub-­‐project  –  API  for  scoring  /  evaluaAon  

•  jpmml-­‐model  sub-­‐project  –  JAXB  model  derived  from  PMML  XSD  –  Import  /  Export  /  Model  generaAon  

•  Some  integraAon  projects  –  Hive,  PostgreSQL,  pig  –  Planned:  Apache  Storm  and  Apache  Spark   DEMO  

28  

Page 29: PMML$with$Rand$Java - Meetupfiles.meetup.com/5431352/DS_lux_pmml_with_R_td_draft_v11.pdf · PMML$ • Predicve Model Markup$Language$ • Open$Standard$ • Maintained$by$DataMining$Group$(DMG)$

QuesAons  

29