Upload
matthew-tovbin
View
363
Download
1
Embed Size (px)
Citation preview
Matthew Tovbin Principal Engineer, Salesforce Einstein
[email protected] @tovbinm
Doubt Truth to be a Liar Non Triviality of Type Safety for Machine Learning
“Doubt thou the stars are fire, Doubt that the sun doth move, Doubt truth to be a liar, But never doubt I love.” - William Shakespeare, Hamlet
A glimpse into the future
What I am going to talk about: • Machine Learning (ML) 101
• Real-life ML • Building ML application with Spark ML
• Typed Feature Engineering with Optimus Prime • Behind the scenes
• Going forward
Machine Learning 101
What is Machine Learning? • “The capacity of a computer to learn from experience, i.e. to modify it’s
processing on the basis of newly acquired information” – 1950s, IBM Journal
What is “experience” in the computer terms? • It’s just data.
What are the tasks Machine Learning solves?
• Recognition, diagnosis, prediction, forecasting, planning, data mining, etc.
Machine Learning 101 Building a ML model
Feature Engineering
Model Training
Model A
Model B
Model C
Model Evaluation
Real-life ML Building a ML model pipeline
ETL
Model Evaluation
Feature Engineering
Scoring
Model Training
Model A
Model B
Model C
Deployment
Real-life ML Just a few problems to mention
• ETL is tough • Feature Engineering is even tougher • Our prototype in R/Python/Octave works great, but …
• Copy/pasting code across projects doesn’t scale • Model Training fails exactly two hours after you go to sleep • Data is not there • Data is there, but wrong format • Data is there, but insufficient • OOM, Insufficient Space, Serialization…
• So we have the models/scores, but can we trust them?!
Real-life ML Specifically for Salesforce
Multi-tenancy
• Multiple customers: Square, Fanatics, etc.
• Multiple customer environments: Live, Staging, QA, Dev. • Multiple data sources: SFDC, Marketing Cloud, Service Cloud, IoT Cloud, etc.
• Multiple data entities: Leads, Opportunities, Email Campaigns, etc. • Multiple applications: Lead Scoring, Predictive Journeys, or custom.
Security, Scale, Automation, Transparency, Cost Efficiency and so on.
Salesforce Einstein
AI for everyone • ML Platform for customers, engineers, data scientists
• No need for ETL or PhD
PredictionIO
• Most starred Scala project on GitHub • Now part of Apache’s incubator program
Optimus Prime • In-house transformation framework
• Declarative, collaborative, reusable, typed Services, Microservices, Nanoservices…
Predict survival on the Titanic
Sources: https://www.kaggle.com/c/titanic/data, https://github.com/BenFradet/spark-kaggle
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
PC 17599
71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0
STON/O2. 3101282 7.925 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S 6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.075 S ... 890 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30 C148 C 891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.75 Q
Building ML application with Spark ML [1/5] - Spinning up Spark
import org.apache.spark._ !import org.apache.spark.sql._ !import org.apache.spark.sql.types._ !import org.apache.spark.sql.functions._ !import org.apache.spark.ml._ !import org.apache.spark.ml.classification._ !import org.apache.spark.ml.evaluation._ !import org.apache.spark.ml.feature._ !import org.apache.spark.ml.tuning._ !!// Spinning up Spark !val conf = new SparkConf().setMaster("local[2]") !val session = SparkSession.builder.config(conf).getOrCreate!val (sc, sqlc) = (session.sparkContext, session.sqlContext) !!import sqlc.implicits._ !
Building ML application with Spark ML [2/5] - Reading the data def readData(file: String): DataFrame = { ! val schema = StructType(Array( ! StructField("PassengerId", IntegerType, nullable = true), // <-- field names are easy to misspell ! StructField("Survived", DoubleType, nullable = true), // <-- we can set any type here ! StructField("Pclass", DoubleType, nullable = true), ! StructField("Name", StringType, nullable = true), ! StructField("Sex", StringType, nullable = true), ! StructField("Age", DoubleType, nullable = true), ! StructField("SibSp", DoubleType, nullable = true), ! StructField("Parch", DoubleType, nullable = true), ! StructField("Ticket", StringType, nullable = true), ! StructField("Fare", DoubleType, nullable = true), ! StructField("Cabin", StringType, nullable = true), ! StructField("Embarked", StringType, nullable = true) ! )) ! val df: DataFrame = sqlc.read.format("csv").option("header", "true").schema(schema).load(file) ! // Select and rename necessary fields ! val select = Array($"Survived".as("survived"), $"Sex".as("sex"), $"Age".as("age”), ! $"Pclass".as("pclass"),$"SibSp".as("sibsp"), $"Parch".as("parch"), $"Embarked".as("embarked")) ! df.select(select: _*) !} !val rawData: DataFrame = readData(file = "titanic.csv") // <-- runtime exceptions...!
Building ML application with Spark ML [3/5] - Feature Engineering
!
def addFeatures(df: DataFrame): DataFrame = { ! // Create a new family size field := siblings + spouses + parents + children + self ! val familySizeUDF = udf { (sibsp: Double, parch: Double) => sibsp + parch + 1 } ! ! df.withColumn("fsize", familySizeUDF(col("sibsp"), col("parch"))) // <-- full freedom to overwrite !} !!def fillMissing(df: DataFrame): DataFrame = { ! // Fill missing age values with average age ! val avgAge = df.select("age").agg(avg("Age")).collect.head(0) ! ! // Fill missing embarked values with default "S" (i.e Southampton) ! val embarkedUDF = udf{(e: String)=> e match { case x if x == null || x.isEmpty => "S"; case x => x}} ! ! df.na.fill(Map("age" -> avgAge)).withColumn("embarked", embarkedUDF(col("embarked"))) !} !// Modify the dataframe!val allData = fillMissing(addFeatures(rawData)).cache() // <-- need to remember about caching! !val Array(trainSet, testSet) = allData.randomSplit(Array(0.75, 0.25)).map(_.cache())!
Building ML application with Spark ML [4/5] - Building the pipeline
// Prepare categorical columns !val categoricalFeatures = Array("pclass", "sex", "embarked") !val stringIndexers = categoricalFeatures.map(colName => ! new StringIndexer().setInputCol(colName).setOutputCol(colName + "_index").fit(allData) !) !
// Concat all the feature into a numeric feature vector (transformer) !val allFeatures = Array("age", "sibsp", "parch", "fsize") ++ stringIndexers.map(_.getOutputCol) !!val vectorAssembler = ! new VectorAssembler().setInputCols(allFeatures).setOutputCol("feature_vector”) !
// Prepare Logistic Regression estimator !val logisticRegression = ! new LogisticRegression().setFeaturesCol("feature_vector").setLabelCol("survived”) !
// Finally build the pipeline with the stages above !val pipeline = ! new Pipeline().setStages(stringIndexers ++ Array(vectorAssembler, logisticRegression)) !
Building ML application with Spark ML [5/5] - Model training
// Cross validate our pipeline with various parameters !val paramGrid = { ! new ParamGridBuilder() ! .addGrid(logisticRegression.regParam, Array(1, 0.1, 0.01)) ! .addGrid(logisticRegression.maxIter, Array(10, 50, 100)) ! .build() !} !val crossValidator = { ! new CrossValidator() ! .setEstimator(pipeline) ! .setEstimatorParamMaps(paramGrid) ! .setEvaluator(new BinaryClassificationEvaluator().setLabelCol("survived")) ! .setNumFolds(3) !} !// Train the model & compute scores !val model: CrossValidationModel = crossValidator.fit(trainSet) !val scores: DataFrame = model.transform(testSet) !!// Save the model for later use !model.save(”titanic-lr-model.ml") !
Building ML application with Spark ML The good parts
• Simple abstraction: Transformers (.map), Estimators (.reduce) and Pipelines • Serialization allows reusability of models • Good implementations for various estimators: Word2Vec, LR, etc.
• All-In-One: data exploration, prototyping, productionization • Multi language support: Java/Scala/Python
• Healthy ecosystem
Building ML application with Spark ML The not so good parts
• No type checking (especially painful for Transformers, Estimators and Pipelines) • Transformer and Estimator interfaces are too open: Dataset => DataFrame • DataFrames are everywhere • No type checking • Easy to misspell column names • No integration with ML Vector • Missing a lot of RDDs functionality
• Lack of support for common data I/O operations • Schema and algorithms definitions are interleaved with data manipulations
Can we do better?!
Optimus Prime
What is Optimus Prime?
• A transformation framework to develop reusable, modular and typed ML pipelines
Why are we building it? • Declarative and intuitive syntax • Typed operations with Spark ML
• Reusability of I/O operations, features, transformations, pipelines
• Separation of features and transformations from data operations
Building ML application with Optimus Prime [1/3] – Defining a reader & raw features
import com.salesforce.op._!import com.salesforce.op.test.avro.Passenger!!// Define the reader from CSV to Avro !val trainReader: DataReader[Passenger] = DataReaders.Simple.csv[Passenger]( ! path = Some("titanic.csv"), ! schema = Passenger.getClassSchema.toString!) !// Define the response feature !val survived: FeatureLike[Binary] = FeatureBuilder.Binary[Passenger] ! .extract(p => Option(p.getSurvived).map(_ != 0)).asResponse!
// Define the predictors features !val age = FeatureBuilder.NullableNumeric[Passenger].extract(p => Option(p.getAge)).asPredictor!val sex = FeatureBuilder.Categorical[Passenger].extract(p => Array(p.getSex)).asPredictor!val pclass = FeatureBuilder.Numeric[Passenger].extract(_.getPClass).asPredictor!val sibsp = FeatureBuilder.Numeric[Passenger].extract(_.getSibSp).asPredictor!val parch = FeatureBuilder.Numeric[Passenger].extract(_.getParCh).asPredictor!val embarked = FeatureBuilder.Text[Passenger].extract(p => Option(p.getEmbarked).asPredictor!
Building ML application with Optimus Prime [2/3] – Feature engineering
// Create a new family size feature – no annoying UDFs here! !val fsize: FeatureLike[Numeric] = sibsp + parch + 1 !
// Fill missing age values with average age, i.e. from NullableNumeric we get Numeric !val ageFilled: FeatureLike[Numeric] = age.fillMissingWithMean!
// Fill missing embarked values with default "S" (i.e Southampton) !val embarkedFilled: FeatureLike[Text] = embarked.fillMissingWith(“S”) !
// Create a feature vector field using default vectorizers!val featureVector: FeatureLike[Vector] = ! Seq(sex, ageFilled, pclass, sibsp, parch, fsize, embarkedFilled).vectorize() !!
Building ML application with Optimus Prime [3/3] – Model training
!
val modelSelector = { // Create a model selector with two algorithms ! new ModelSelector[Passenger]().setInput(survived, featureVector) ! .setParams( !
LogisticRegression.RegParam -> Array(1, 0.1, 0.01), ! LogisticRegression.MaxIter -> Array(10, 50, 100), !
RandomForest.NumTrees -> Array(3, 5, 10) ! ).setModels(LogisticRegression, RandomForest) // <-- multiple algorithms here ! .setEvaluator(BinaryClassification) !} !// Build the pipeline with the model selector !val pipeline = new OpPipeline[Passenger]().setInput(modelSelector) !!// And only now we are spinning up Spark !val conf = new SparkConf().setMaster("local[2]") !implicit val session = SparkSession.builder.config(conf).getOrCreate!!// Train the model & compute scores !val model: OpPipelineModel[Passenger] = pipeline.train()!val scores: DataFrame = model.setReader(testReader).score() !!model.save(”/models/titanic-lr-model.op") // Save the model for later use !
Building ML application with Optimus Prime Summary
• Everything is typed • Declarative and intuitive syntax with type completion • Feature names are inferred from val names –> no misspelled names
• Features are always unique in a code block, compilation error otherwise • Common data I/O operations provided with DataReaders
• DataFrames are abstracted away – no direct interaction
• Features and transformations are separate from data operations
Types and interactions
Numeric
Text
…
Feature[T]
transformed with
Categorical
Unary
produce
Transformers (.map)
Binary
…
Estimators (.reduce)
Average
Word2Vec
… fitted into
My Model
Data Readers
CSV
Avro
…
Pipelines
Titanic
Lead Scoring
…
read materialized by
fitted into
joined
Primitive types
case object FeatureTypes { ! type Numeric = Double! type NullableNumeric = Option[Double] ! type Categorical = scala.collection.mutable.WrappedArray[String] ! type Text = Option[String] ! type Binary = Option[Boolean] ! type DateList = scala.collection.mutable.WrappedArray[Long] ! type KeyString = scala.collection.Map[String, String] ! type KeyNumeric = scala.collection.Map[String, Double] ! type KeyBinary = scala.collection.Map[String, Boolean] ! type Vector = org.apache.spark.ml.linalg.Vector! ! // TBD: Specific/rich types: Email, PhoneNumber, ZipCode, Address, GeoLocation etc. !} !
Typed Features
trait FeatureLike[O] extends Serializable { ! ! implicit def wtt: WeakTypeTag[O] // Overcoming type erasure ! ! def name: String // name of the feature !! def defaultValue: O // feature default value ! ! def originStage: OpPipelineStage[O] // the stage which generated this feature ! ! def parents: Seq[FeatureLike[_]] // the input features for the origin stage ! ! // feature transformation function. We have more like this...! final def transformWith[U](stage: OpPipelineStage1[O, U]): FeatureLike[U] !! // ...!} !
Feature names from vals using Macros magic
val sibsp = FeatureBuilder.Numeric[Passenger](name = ”sibsp”) !!val sibsp = FeatureBuilder.Numeric[Passenger] // <-- can we just do this?! !!!!object FeatureBuilder { ! def Numeric[I]: FeatureBuilder[I, Numeric] = macro FeatureBuilderMacros.apply[I, Numeric] !} !!// HAHA! So we meet again! !private[op] object FeatureBuilderMacros { ! def apply[I: c.WeakTypeTag, O: c.WeakTypeTag](c: Context): c.Expr[FeatureBuilder[I, O]] = { ! import c.universe._ ! val enclosingValName = MacrosHelper.definingValName(c) ! val featureName = c.Expr[String](Literal(Constant(enclosingValName))) ! val fbApply = Select(reify(FeatureBuilder).tree, TermName("apply")) ! val fbExpr = c.Expr[FeatureBuilder[I, O]](Apply(fbApply, featureName.tree :: Nil)) !! reify(fbExpr.splice) ! } !} !!// Read more in sbt codebase - https://goo.gl/OdPvry!!
Feature transformations with Implicit Classes
// Create a new family size feature := siblings + spouses + parents + children + self !val fsize = sibsp.transformWith(new BinaryNullableAndNumeric(_ + _), parch) !
.transformWith(new BinaryNumeric (_ + _), 1) !!val fsize = sibsp + parch + 1 // <-- can we just do this? !
// Sure thing! !implicit class RichNullableNumericFeature[I <: NullableNumeric : TypeTag](val f: FeatureLike[I]) { !! def +[I2 <: Numeric: TypeTag](that: FeatureLike[I2]): FeatureLike[NullableNumeric] = { ! val plus = (a: Numeric, b: Numeric) => a + b ! val stage = new BinaryNullableAndNumeric(plus) ! ! f.transformWith[I2, NullableNumeric](f = that, stage) ! }!!} !!// Note: we use another Macro here as well to infer the name of the feature to “fsize” !
Features and Transformers
sibsp: Feature[Numeric]
survivedpclasssex age sibsp parchembarked fsize0 3 male 22 1 0 S 2
1 1 female38 1 0 C 2
1 3 female26 0 0 S 1
Binary Transformer ( _ + _ )
parch: Feature[Numeric]
val fsize: FeatureLike[Numeric] = sibsp + parch + 1 !
_ : Feature[Numeric]
fsize: Feature[Numeric] Binary Transformer ( _ + 1 )
Transformation DAG sibsp parch
_ 1
fsize
Pipeline Stages
Typed pipeline stages
import org.apache.spark.ml.util.MLWritable !import org.apache.spark.ml.PipelineStage!!// OP pipeline stages represent a feature transformation and also !// carry around the input and output feature types !trait OpPipelineStageBase extends OpPipelineStageParams with MLWritable { self: PipelineStage => ! type InputFeatures! type OutputFeatures! ! def setInput(features: InputFeatures): this.type! def getOutput(): OutputFeatures!! // This method allows us to modify the DataFrame schema accordingly ! final override def transformSchema(schema: StructType): StructType!} !!
Binary Transformer ( _ + _ ) Feature[Numeric] Feature[Numeric]
-> Feature[Numeric]
Typed pipeline stages
!
// Stage providing a single feature[O] !trait OpPipelineStage[O] extends OpPipelineStageBase { ! type InputFeatures! final override type OutputFeatures = FeatureLike[O] !} !
// Stage from feature[I] to a feature[O] !trait OpPipelineStage1[I, O] extends OpPipelineStage[O] { ! final override type InputFeatures = FeatureLike[I] !} !
// Stage from a tuple of features to a feature[O] !trait OpPipelineStage2[I1, I2, O] extends OpPipelineStage[O] { ! final override type InputFeatures = (FeatureLike[I1], FeatureLike[I2]) !} !
// ... !// And so on for various combinations: 1to2, ..., 1toN, 2to2, ..., Nto1, Nto2, ...!// See Scala Product types - https://goo.gl/J3V5DP !
1-to-1 Transformer example
!
import org.apache.spark.ml.Transformer!!// A simple 1 to 1 transformer !trait OpTransformer1[I, O] extends Transformer with OpPipelineStage1[I, O] { ! implicit def tti: TypeTag[I] ! implicit def tto: TypeTag[O] ! ! // User provided transform function that operates on input feature value I and produces O ! def transformFn: I => O ! ! // We wrap the transform function above into a UDF ! final override def transform(dataset: Dataset[_]): DataFrame = { ! val functionUDF = udf { (in: Any) => transformFn(FeatureTypeDefaults.castAs[I](in)) } ! ! // Return a dataset with a new column ! dataset.withColumn(outputName, functionUDF(col(in1.name))) ! } !} !// ... !// And so on for various combinations: 1to2, 1to3, ..., 1toN, ..., Nto1, Nto2, ...!
1-to-1 Estimator example Last code slide. Really.
import org.apache.spark.ml.{Estimator, Model} !!// A simple 1 to 1 estimator which is trained into a model !class UnaryEstimator[I: TypeTag: ClassTag, O: TypeTag](val fitFn: Dataset[I] => I => O) ! extends Estimator[UnaryModel[I, O]] with OpPipelineStage1[I, O] !{ ! implicit val iEncoder: Encoder[I] = ExpressionEncoder()! ! final override def fit(dataset: Dataset[_]): UnaryModel[I, O] = { ! val df: DataFrame = dataset.select(in1.name) ! val ds: Dataset[I] = df.map(r => FeatureTypeDefaults.castAs[I](r.get(0))) // Needs encoder ! ! val transformFn: I => O = fitFn(ds) // Fit function returns a transform function!! new UnaryModel[I, O](transformFn).setParent(this).setInput(in1) ! } !} !!// Represents a trained model (transformer) from feature[I] to feature[O] !class UnaryModel[I, O](val transformFn: I => O) ! (implicit val tti: TypeTag[I], val tto: TypeTag[O]) ! extends Model[UnaryModel[I, O]] with OpTransformer1[I, O] !
Model training (.reduce)
Data Reader RDD[Passenger]
Feature Generator
DataFrame (age, sibsp, parch, embarked, …)
Spark Pipeline .setStages(stages).fit(trainData)
Titanic Model .tranform(…)
Topological sort
Transformation DAG sibsp parch
_ 1
fsize
Going forward with Optimus Prime
What is still missing? • Handling common data I/O scenarios (i.e joins, aggregators)
• Specific/rich feature types: Email, PhoneNumber, ZipCode etc. • Wrapping existing Spark ML transformers/estimators (in progress)
• Codegen for stage in-out combinations (1to1, …, 1toN, ... Nto1, Nto2, ...) • Codegen for Macros (scary!)
• Automatic feature engineering
• Sanity checking • More…
Key takeaways
• Real-life Machine Learning is hard • Spark ML is great, but it needs type safety • Simple and intuitive syntax saves you trouble down the road
• Scala has all the relevant facilities to provide the above – know to use it • Modularity and reusability is the key
Further exploration
• Salesforce Einstein – http://einstein.com
• PredictionIO – http://predictionio.incubator.apache.org
• Optimus Prime – to be open-sourced (no ETA yet) • “Optimus Prime: declarative, collaborative, type-safe machine learning” by Shubha Nabar, PhD
https://goo.gl/hgWxJb • “The Lego Model for Machine Learning” by Leah McGuire, PhD - https://goo.gl/hmct4R