Doubt Truth to be a Liar: Non Triviality of Type Safety for Machine Learning

Matthew Tovbin Principal Engineer, Salesforce Einstein

[email protected] @tovbinm

Doubt Truth to be a Liar Non Triviality of Type Safety for Machine Learning

“Doubt thou the stars are fire, Doubt that the sun doth move, Doubt truth to be a liar, But never doubt I love.” - William Shakespeare, Hamlet

A glimpse into the future

 What I am going to talk about: • Machine Learning (ML) 101

• Real-life ML • Building ML application with Spark ML

• Typed Feature Engineering with Optimus Prime • Behind the scenes

• Going forward

Machine Learning 101

 What is Machine Learning? • “The capacity of a computer to learn from experience, i.e. to modify it’s

processing on the basis of newly acquired information” – 1950s, IBM Journal

What is “experience” in the computer terms? •  It’s just data.

What are the tasks Machine Learning solves?

• Recognition, diagnosis, prediction, forecasting, planning, data mining, etc.

Machine Learning 101  Building a ML model

Feature Engineering

Model Training

Model A

Model B

Model C

Model Evaluation

Real-life ML  Building a ML model pipeline

ETL

Model Evaluation

Feature Engineering

Scoring

Model Training

Model A

Model B

Model C

Deployment

Real-life ML  Just a few problems to mention

• ETL is tough • Feature Engineering is even tougher • Our prototype in R/Python/Octave works great, but …

• Copy/pasting code across projects doesn’t scale • Model Training fails exactly two hours after you go to sleep •  Data is not there •  Data is there, but wrong format •  Data is there, but insufficient •  OOM, Insufficient Space, Serialization…

• So we have the models/scores, but can we trust them?!

Real-life ML  Specifically for Salesforce

Multi-tenancy

• Multiple customers: Square, Fanatics, etc.

• Multiple customer environments: Live, Staging, QA, Dev. • Multiple data sources: SFDC, Marketing Cloud, Service Cloud, IoT Cloud, etc.

• Multiple data entities: Leads, Opportunities, Email Campaigns, etc. • Multiple applications: Lead Scoring, Predictive Journeys, or custom.

Security, Scale, Automation, Transparency, Cost Efficiency and so on.

Salesforce Einstein

 AI for everyone • ML Platform for customers, engineers, data scientists

• No need for ETL or PhD

 PredictionIO

• Most starred Scala project on GitHub • Now part of Apache’s incubator program

Optimus Prime •  In-house transformation framework

• Declarative, collaborative, reusable, typed Services, Microservices, Nanoservices…

Building ML application with Spark ML

Predict survival on the Titanic

Sources: https://www.kaggle.com/c/titanic/data, https://github.com/BenFradet/spark-kaggle

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S

2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0

PC 17599

71.2833 C85 C

3 1 3 Heikkinen, Miss. Laina female 26 0 0

STON/O2. 3101282 7.925 S

4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S 6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q

7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S

8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.075 S ... 890 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30 C148 C 891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.75 Q

Building ML application with Spark ML  [1/5] - Spinning up Spark

import org.apache.spark._ !import org.apache.spark.sql._ !import org.apache.spark.sql.types._ !import org.apache.spark.sql.functions._ !import org.apache.spark.ml._ !import org.apache.spark.ml.classification._ !import org.apache.spark.ml.evaluation._ !import org.apache.spark.ml.feature._ !import org.apache.spark.ml.tuning._ !!// Spinning up Spark !val conf = new SparkConf().setMaster("local[2]") !val session = SparkSession.builder.config(conf).getOrCreate!val (sc, sqlc) = (session.sparkContext, session.sqlContext) !!import sqlc.implicits._ !

Building ML application with Spark ML  [2/5] - Reading the data def readData(file: String): DataFrame = { ! val schema = StructType(Array( ! StructField("PassengerId", IntegerType, nullable = true), // <-- field names are easy to misspell ! StructField("Survived", DoubleType, nullable = true), // <-- we can set any type here ! StructField("Pclass", DoubleType, nullable = true), ! StructField("Name", StringType, nullable = true), ! StructField("Sex", StringType, nullable = true), ! StructField("Age", DoubleType, nullable = true), ! StructField("SibSp", DoubleType, nullable = true), ! StructField("Parch", DoubleType, nullable = true), ! StructField("Ticket", StringType, nullable = true), ! StructField("Fare", DoubleType, nullable = true), ! StructField("Cabin", StringType, nullable = true), ! StructField("Embarked", StringType, nullable = true) ! )) ! val df: DataFrame = sqlc.read.format("csv").option("header", "true").schema(schema).load(file) ! // Select and rename necessary fields ! val select = Array($"Survived".as("survived"), $"Sex".as("sex"), $"Age".as("age”), ! $"Pclass".as("pclass"),$"SibSp".as("sibsp"), $"Parch".as("parch"), $"Embarked".as("embarked")) ! df.select(select: _*) !} !val rawData: DataFrame = readData(file = "titanic.csv") // <-- runtime exceptions...!

Building ML application with Spark ML  [3/5] - Feature Engineering

!

def addFeatures(df: DataFrame): DataFrame = { ! // Create a new family size field := siblings + spouses + parents + children + self ! val familySizeUDF = udf { (sibsp: Double, parch: Double) => sibsp + parch + 1 } ! ! df.withColumn("fsize", familySizeUDF(col("sibsp"), col("parch"))) // <-- full freedom to overwrite !} !!def fillMissing(df: DataFrame): DataFrame = { ! // Fill missing age values with average age ! val avgAge = df.select("age").agg(avg("Age")).collect.head(0) ! ! // Fill missing embarked values with default "S" (i.e Southampton) ! val embarkedUDF = udf{(e: String)=> e match { case x if x == null || x.isEmpty => "S"; case x => x}} ! ! df.na.fill(Map("age" -> avgAge)).withColumn("embarked", embarkedUDF(col("embarked"))) !} !// Modify the dataframe!val allData = fillMissing(addFeatures(rawData)).cache() // <-- need to remember about caching! !val Array(trainSet, testSet) = allData.randomSplit(Array(0.75, 0.25)).map(_.cache())!

Building ML application with Spark ML  [4/5] - Building the pipeline

// Prepare categorical columns !val categoricalFeatures = Array("pclass", "sex", "embarked") !val stringIndexers = categoricalFeatures.map(colName => ! new StringIndexer().setInputCol(colName).setOutputCol(colName + "_index").fit(allData) !) !

// Concat all the feature into a numeric feature vector (transformer) !val allFeatures = Array("age", "sibsp", "parch", "fsize") ++ stringIndexers.map(_.getOutputCol) !!val vectorAssembler = ! new VectorAssembler().setInputCols(allFeatures).setOutputCol("feature_vector”) !

// Prepare Logistic Regression estimator !val logisticRegression = ! new LogisticRegression().setFeaturesCol("feature_vector").setLabelCol("survived”) !

// Finally build the pipeline with the stages above !val pipeline = ! new Pipeline().setStages(stringIndexers ++ Array(vectorAssembler, logisticRegression)) !

Building ML application with Spark ML  [5/5] - Model training

// Cross validate our pipeline with various parameters !val paramGrid = { ! new ParamGridBuilder() ! .addGrid(logisticRegression.regParam, Array(1, 0.1, 0.01)) ! .addGrid(logisticRegression.maxIter, Array(10, 50, 100)) ! .build() !} !val crossValidator = { ! new CrossValidator() ! .setEstimator(pipeline) ! .setEstimatorParamMaps(paramGrid) ! .setEvaluator(new BinaryClassificationEvaluator().setLabelCol("survived")) ! .setNumFolds(3) !} !// Train the model & compute scores !val model: CrossValidationModel = crossValidator.fit(trainSet) !val scores: DataFrame = model.transform(testSet) !!// Save the model for later use !model.save(”titanic-lr-model.ml") !

Building ML application with Spark ML  The good parts

• Simple abstraction: Transformers (.map), Estimators (.reduce) and Pipelines • Serialization allows reusability of models • Good implementations for various estimators: Word2Vec, LR, etc.

• All-In-One: data exploration, prototyping, productionization • Multi language support: Java/Scala/Python

• Healthy ecosystem

Building ML application with Spark ML  The not so good parts

• No type checking (especially painful for Transformers, Estimators and Pipelines) • Transformer and Estimator interfaces are too open: Dataset => DataFrame • DataFrames are everywhere •  No type checking •  Easy to misspell column names •  No integration with ML Vector •  Missing a lot of RDDs functionality

• Lack of support for common data I/O operations • Schema and algorithms definitions are interleaved with data manipulations

Can we do better?!

Typed Feature Engineering with Optimus Prime

Optimus Prime

What is Optimus Prime?

• A transformation framework to develop reusable, modular and typed ML pipelines

Why are we building it? • Declarative and intuitive syntax • Typed operations with Spark ML

• Reusability of I/O operations, features, transformations, pipelines

• Separation of features and transformations from data operations

Optimus Prime

Building ML application with Optimus Prime  [1/3] – Defining a reader & raw features

import com.salesforce.op._!import com.salesforce.op.test.avro.Passenger!!// Define the reader from CSV to Avro !val trainReader: DataReader[Passenger] = DataReaders.Simple.csv[Passenger]( ! path = Some("titanic.csv"), ! schema = Passenger.getClassSchema.toString!) !// Define the response feature !val survived: FeatureLike[Binary] = FeatureBuilder.Binary[Passenger] ! .extract(p => Option(p.getSurvived).map(_ != 0)).asResponse!

// Define the predictors features !val age = FeatureBuilder.NullableNumeric[Passenger].extract(p => Option(p.getAge)).asPredictor!val sex = FeatureBuilder.Categorical[Passenger].extract(p => Array(p.getSex)).asPredictor!val pclass = FeatureBuilder.Numeric[Passenger].extract(_.getPClass).asPredictor!val sibsp = FeatureBuilder.Numeric[Passenger].extract(_.getSibSp).asPredictor!val parch = FeatureBuilder.Numeric[Passenger].extract(_.getParCh).asPredictor!val embarked = FeatureBuilder.Text[Passenger].extract(p => Option(p.getEmbarked).asPredictor!

Building ML application with Optimus Prime  [2/3] – Feature engineering

// Create a new family size feature – no annoying UDFs here! !val fsize: FeatureLike[Numeric] = sibsp + parch + 1 !

// Fill missing age values with average age, i.e. from NullableNumeric we get Numeric !val ageFilled: FeatureLike[Numeric] = age.fillMissingWithMean!

// Fill missing embarked values with default "S" (i.e Southampton) !val embarkedFilled: FeatureLike[Text] = embarked.fillMissingWith(“S”) !

// Create a feature vector field using default vectorizers!val featureVector: FeatureLike[Vector] = ! Seq(sex, ageFilled, pclass, sibsp, parch, fsize, embarkedFilled).vectorize() !!

Building ML application with Optimus Prime  [3/3] – Model training

!

val modelSelector = { // Create a model selector with two algorithms ! new ModelSelector[Passenger]().setInput(survived, featureVector) ! .setParams( !

LogisticRegression.RegParam -> Array(1, 0.1, 0.01), ! LogisticRegression.MaxIter -> Array(10, 50, 100), !

RandomForest.NumTrees -> Array(3, 5, 10) ! ).setModels(LogisticRegression, RandomForest) // <-- multiple algorithms here ! .setEvaluator(BinaryClassification) !} !// Build the pipeline with the model selector !val pipeline = new OpPipeline[Passenger]().setInput(modelSelector) !!// And only now we are spinning up Spark !val conf = new SparkConf().setMaster("local[2]") !implicit val session = SparkSession.builder.config(conf).getOrCreate!!// Train the model & compute scores !val model: OpPipelineModel[Passenger] = pipeline.train()!val scores: DataFrame = model.setReader(testReader).score() !!model.save(”/models/titanic-lr-model.op") // Save the model for later use !

Building ML application with Optimus Prime  Summary

• Everything is typed • Declarative and intuitive syntax with type completion • Feature names are inferred from val names –> no misspelled names

• Features are always unique in a code block, compilation error otherwise • Common data I/O operations provided with DataReaders

• DataFrames are abstracted away – no direct interaction

• Features and transformations are separate from data operations

Behind the scenes of Optimus Prime

Types and interactions

Numeric

Text

…

Feature[T]

transformed with

Categorical

Unary

produce

Transformers (.map)

Binary

…

Estimators (.reduce)

Average

Word2Vec

… fitted into

My Model

Data Readers

CSV

Avro

…

Pipelines

Titanic

Lead Scoring

…

read materialized by

fitted into

joined

Primitive types

case object FeatureTypes { ! type Numeric = Double! type NullableNumeric = Option[Double] ! type Categorical = scala.collection.mutable.WrappedArray[String] ! type Text = Option[String] ! type Binary = Option[Boolean] ! type DateList = scala.collection.mutable.WrappedArray[Long] ! type KeyString = scala.collection.Map[String, String] ! type KeyNumeric = scala.collection.Map[String, Double] ! type KeyBinary = scala.collection.Map[String, Boolean] ! type Vector = org.apache.spark.ml.linalg.Vector! ! // TBD: Specific/rich types: Email, PhoneNumber, ZipCode, Address, GeoLocation etc. !} !

Typed Features

trait FeatureLike[O] extends Serializable { ! ! implicit def wtt: WeakTypeTag[O] // Overcoming type erasure ! ! def name: String // name of the feature !! def defaultValue: O // feature default value ! ! def originStage: OpPipelineStage[O] // the stage which generated this feature ! ! def parents: Seq[FeatureLike[_]] // the input features for the origin stage ! ! // feature transformation function. We have more like this...! final def transformWith[U](stage: OpPipelineStage1[O, U]): FeatureLike[U] !! // ...!} !

Feature names from vals using Macros magic

val sibsp = FeatureBuilder.Numeric[Passenger](name = ”sibsp”) !!val sibsp = FeatureBuilder.Numeric[Passenger] // <-- can we just do this?! !!!!object FeatureBuilder { ! def Numeric[I]: FeatureBuilder[I, Numeric] = macro FeatureBuilderMacros.apply[I, Numeric] !} !!// HAHA! So we meet again! !private[op] object FeatureBuilderMacros { ! def apply[I: c.WeakTypeTag, O: c.WeakTypeTag](c: Context): c.Expr[FeatureBuilder[I, O]] = { ! import c.universe._ ! val enclosingValName = MacrosHelper.definingValName(c) ! val featureName = c.Expr[String](Literal(Constant(enclosingValName))) ! val fbApply = Select(reify(FeatureBuilder).tree, TermName("apply")) ! val fbExpr = c.Expr[FeatureBuilder[I, O]](Apply(fbApply, featureName.tree :: Nil)) !! reify(fbExpr.splice) ! } !} !!// Read more in sbt codebase - https://goo.gl/OdPvry!!

Feature transformations with Implicit Classes

// Create a new family size feature := siblings + spouses + parents + children + self !val fsize = sibsp.transformWith(new BinaryNullableAndNumeric(_ + _), parch) !

.transformWith(new BinaryNumeric (_ + _), 1) !!val fsize = sibsp + parch + 1 // <-- can we just do this? !

// Sure thing! !implicit class RichNullableNumericFeature[I <: NullableNumeric : TypeTag](val f: FeatureLike[I]) { !! def +[I2 <: Numeric: TypeTag](that: FeatureLike[I2]): FeatureLike[NullableNumeric] = { ! val plus = (a: Numeric, b: Numeric) => a + b ! val stage = new BinaryNullableAndNumeric(plus) ! ! f.transformWith[I2, NullableNumeric](f = that, stage) ! }!!} !!// Note: we use another Macro here as well to infer the name of the feature to “fsize” !

Features and Transformers

sibsp: Feature[Numeric]

survivedpclasssex age sibsp parchembarked fsize0 3 male 22 1 0 S 2

1 1 female38 1 0 C 2

1 3 female26 0 0 S 1

Binary Transformer ( _ + _ )

parch: Feature[Numeric]

val fsize: FeatureLike[Numeric] = sibsp + parch + 1 !

_ : Feature[Numeric]

fsize: Feature[Numeric] Binary Transformer ( _ + 1 )

Transformation DAG sibsp parch

_ 1

fsize

Pipeline Stages

Typed pipeline stages

import org.apache.spark.ml.util.MLWritable !import org.apache.spark.ml.PipelineStage!!// OP pipeline stages represent a feature transformation and also !// carry around the input and output feature types !trait OpPipelineStageBase extends OpPipelineStageParams with MLWritable { self: PipelineStage => ! type InputFeatures! type OutputFeatures! ! def setInput(features: InputFeatures): this.type! def getOutput(): OutputFeatures!! // This method allows us to modify the DataFrame schema accordingly ! final override def transformSchema(schema: StructType): StructType!} !!

Binary Transformer ( _ + _ ) Feature[Numeric] Feature[Numeric]

-> Feature[Numeric]

Typed pipeline stages

!

// Stage providing a single feature[O] !trait OpPipelineStage[O] extends OpPipelineStageBase { ! type InputFeatures! final override type OutputFeatures = FeatureLike[O] !} !

// Stage from feature[I] to a feature[O] !trait OpPipelineStage1[I, O] extends OpPipelineStage[O] { ! final override type InputFeatures = FeatureLike[I] !} !

// Stage from a tuple of features to a feature[O] !trait OpPipelineStage2[I1, I2, O] extends OpPipelineStage[O] { ! final override type InputFeatures = (FeatureLike[I1], FeatureLike[I2]) !} !

// ... !// And so on for various combinations: 1to2, ..., 1toN, 2to2, ..., Nto1, Nto2, ...!// See Scala Product types - https://goo.gl/J3V5DP !

1-to-1 Transformer example

!

import org.apache.spark.ml.Transformer!!// A simple 1 to 1 transformer !trait OpTransformer1[I, O] extends Transformer with OpPipelineStage1[I, O] { ! implicit def tti: TypeTag[I] ! implicit def tto: TypeTag[O] ! ! // User provided transform function that operates on input feature value I and produces O ! def transformFn: I => O ! ! // We wrap the transform function above into a UDF ! final override def transform(dataset: Dataset[_]): DataFrame = { ! val functionUDF = udf { (in: Any) => transformFn(FeatureTypeDefaults.castAs[I](in)) } ! ! // Return a dataset with a new column ! dataset.withColumn(outputName, functionUDF(col(in1.name))) ! } !} !// ... !// And so on for various combinations: 1to2, 1to3, ..., 1toN, ..., Nto1, Nto2, ...!

1-to-1 Estimator example  Last code slide. Really.

import org.apache.spark.ml.{Estimator, Model} !!// A simple 1 to 1 estimator which is trained into a model !class UnaryEstimator[I: TypeTag: ClassTag, O: TypeTag](val fitFn: Dataset[I] => I => O) ! extends Estimator[UnaryModel[I, O]] with OpPipelineStage1[I, O] !{ ! implicit val iEncoder: Encoder[I] = ExpressionEncoder()! ! final override def fit(dataset: Dataset[_]): UnaryModel[I, O] = { ! val df: DataFrame = dataset.select(in1.name) ! val ds: Dataset[I] = df.map(r => FeatureTypeDefaults.castAs[I](r.get(0))) // Needs encoder ! ! val transformFn: I => O = fitFn(ds) // Fit function returns a transform function!! new UnaryModel[I, O](transformFn).setParent(this).setInput(in1) ! } !} !!// Represents a trained model (transformer) from feature[I] to feature[O] !class UnaryModel[I, O](val transformFn: I => O) ! (implicit val tti: TypeTag[I], val tto: TypeTag[O]) ! extends Model[UnaryModel[I, O]] with OpTransformer1[I, O] !

Model training (.reduce)

Data Reader RDD[Passenger]

Feature Generator

DataFrame (age, sibsp, parch, embarked, …)

Spark Pipeline .setStages(stages).fit(trainData)

Titanic Model .tranform(…)

Topological sort

Transformation DAG sibsp parch

_ 1

fsize

Going forward with Optimus Prime

 What is still missing? • Handling common data I/O scenarios (i.e joins, aggregators)

• Specific/rich feature types: Email, PhoneNumber, ZipCode etc. • Wrapping existing Spark ML transformers/estimators (in progress)

• Codegen for stage in-out combinations (1to1, …, 1toN, ... Nto1, Nto2, ...) • Codegen for Macros (scary!)

• Automatic feature engineering

• Sanity checking • More…

Key takeaways

• Real-life Machine Learning is hard • Spark ML is great, but it needs type safety • Simple and intuitive syntax saves you trouble down the road

• Scala has all the relevant facilities to provide the above – know to use it • Modularity and reusability is the key

Further exploration

• Salesforce Einstein – http://einstein.com

• PredictionIO – http://predictionio.incubator.apache.org

• Optimus Prime – to be open-sourced (no ETA yet) •  “Optimus Prime: declarative, collaborative, type-safe machine learning” by Shubha Nabar, PhD

https://goo.gl/hgWxJb •  “The Lego Model for Machine Learning” by Leah McGuire, PhD - https://goo.gl/hmct4R

If You’re Curious …

[email protected]

Thank Y u

Software

Doubt Truth to be a Liar: Non Triviality of Type Safety for Machine Learning