Upload
takeshi-yamamuro
View
582
Download
0
Embed Size (px)
Citation preview
What’s Hivemall
• A collec+on of machine learning algorithms implemented as Hive UDFs/UDTFs/UDAFs – Classifica+on & Regression – Recommenda+on – k-‐Nearest Neighbor Search, …
• Simple SQL-‐based declara+ve APIs
• An open source project on github – Licensed under Apache License – bit.ly/hivemall
Outline
• hVps://github.com/maropu/hivemall-‐spark – Hivemall wapper for Spark – + some u+li+es for easy-‐to-‐use in Spark
• This Hivemall wrapper will give you… – ‘scalable’ ML algorithms in Spark – predic+ons on streaming data in Spark Streaming – quick trials in your laptop
Outline
Cited from hVp://www.slideshare.net/databricks/new-‐direc+on-‐for-‐spark-‐in-‐2015-‐spark-‐summit-‐east#16
hivemall-‐on-‐spark
Why Hivemall in Spark, not MLlib?
• High barriers to add novel algorithms in Mllib – Hivemall already has many fascina+ng ML algorithms and useful u+li+es
hVps://cwiki.apache.org/confluence/display/SPARK/Contribu+ng+to+Spark
Quick Trials for Hivemall in Spark
• Only two steps needed… – 1. Fetch Spark-‐v1.5.1 compiled with Hive dependencies from hVp://spark.apache.org/downloads.html
– 2. Just say ‘<SPARK_HOME>/bin/spark-‐shell -‐-‐packages maropu:hivemall-‐spark:0.0.5’
,where label y∈{0.0, 1.0}, input vector x=(x1,x2,x3,…) model vector w=(w1,w2,w3,…)
Ex.) Logis+c Regression
• An equa+on is given as follows: – Compute w to predict y given x
Logis+c Regression in HiveQL
• Training // Assume that an input format is as follows: // 1.0,[1:0.5,3:0.3,8:0.1] // 0.0,[2:0.1,3:0.8,7:0.4,9:0.1] // … scala> val trainDf <-‐ … sc.textFile(...).map(HmLabeledPoint.parse) scala> trainDf.registerTempTable(“trainTable”) scala> :paste val modelDf = sql(“ SELECT feature, AVG(weight) AS weight FROM ( SELECT train_logregr(add_bias(features), label AS(feature, weight) FROM trainTable ) model GROUP BY feature “)
Logis+c Regression in DataFrame
• Training // Assume that an input format is as follows: // 1.0,[1:0.5,3:0.3,8:0.1] // 0.0,[2:0.1,3:0.8,7:0.4,9:0.1] // … scala> val trainDf <-‐ … sc.textFile(...).map(HmLabeledPoint.parse) scala> :paste val modelDf = trainDf.train_logregr(add_bias($"feature"), $"label") .groupby("feature") .agg("weight"-‐>"avg")
Logis+c Regression in HiveQL
• Test
scala> val testDf <-‐ … sc.textFile(...).map(HmLabeledPoint.parse) scala> modelDf.registerTempTable(“modelDf”) scala> :paste sql(“ SELECT t.rowid, sigmoid(SUM(m.weight * t.value)) AS prob FROM testDf t LEFT OUTER JOIN modelDf m ON (t.feature = m.feature) GROUP BY t.rowid “)
Logis+c Regression in DataFrame
• Test
scala> val testDf <-‐ … sc.textFile(...).map(HmLabeledPoint.parse) scala> :paste // Build UDF from model data val modelUdf = HivemallUlls.funcModel(modelDf) val predictDf = testDf .select($"target", sigmoid(modelUdf($"features")).as(“prob”))
Other Examples
• Tutorials: github.com/maropu/hivemall-‐spark/blob/master/tutorials – Regression – Binary/Mul+class classifica+on – Mix Servers: Averaging models of in-‐progress workers
Hivemall on Spark Streaming
• Apply streaming DataFrame blocks into predic+ons
// Assume that an input format is as follows: // 1.0,[1:0.5,3:0.3,8:0.1] // 0.0,[2:0.1,3:0.8,7:0.4,9:0.1] // ... scala> val testDs = ssc.textFileStream(...).map(HmLabeledPoint.parse) scala> :paste testDs.predict { case testDf => // Do tests as described before val predictResults = testDf.sql(“SELECT …”) }
Low Latency Predic+ons on RDBMS
• Build a mature model from a substan+al amount of training data in Spark and load it into RDBMS → You can get low latency predic+ons*!! // How to write a model into RDBMS in Spark scala> val model = (Build a model as described as before) scala> model.write.jdbc("jdbc:postgresql:postgres", ”pg_table", …)
*Detaild in hVps://github.com/myui/hivemall/wiki/Real-‐+me-‐predic+on-‐on-‐MySQL-‐and-‐batch-‐model-‐construc+on-‐on-‐Hivemall
Current Developments in Hivemall
Cited from hVp://www.slideshare.net/myui/2nd-‐hivemall-‐meetup-‐20151020#37
Current Developments in Hivemall
Cited from hVp://www.slideshare.net/myui/2nd-‐hivemall-‐meetup-‐20151020#37