–
Yung-Chuan Lee 2016.12.18
1
2
Law[Data applications] are like sausages. It is better not to see them being made.
—Otto von Bismarck
! Spark ◦ ● Spark ● Scala ● RDD
! LAB ◦ ~ ● Spark Scala IDE
! Spark MLlib ◦ … ● Scala + lambda + Spark MLlib ● Clustering Classification Regression
3
! github page: https://github.com/yclee0418/sparkTeach◦ installation: Spark◦ codeSample: Spark● exercise - ● https://github.com/yclee0418/sparkTeach/tree/master/
codeSample/exercise● final - ● https://github.com/yclee0418/sparkTeach/tree/master/
codeSample/final
4
! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib
5
Outline
!
◦ 2020 44ZB(IDC 2013~2014) ◦
! ◦ MapReduce by Google(2004) ◦ Hadoop HDFS MapReduce by Yahoo!(2005) ◦ SparkHadoop 10~1000 by AMPLab (2013)
! [ ]SparkHadoop
6
–
!
AMPLab ! ! API ◦ Java Scala Python R
! One Stack to rule them all ◦ SQL Streaming ◦ RDD
7
Spark
! Cluster Manager ◦ Standalone – Spark Manager ◦ Apache Mesos ◦ Hadoop YARN
8
Spark
! [exercise]Spark ◦ JDK 1.8 ◦ spark-2.0.1.tgz(http://spark.apache.org/downloads.html) ◦ Terminal (for Mac) ● cd /Users/xxxxx/Downloads ( ) ● tar -xvzf spark-2.0.1.tgz ( ) ● sudo mv spark-2.0.1 /usr/local/spark (spark /usr/local) ● cd /usr/local/spark ● ./build/sbt package( spark 1 ) ● ./bin/spark-shell ( Spark shell pwd /usr/local/spark)
9
Spark (2.0.1)
[Tips] https://goo.gl/oxNbIX
./bin/run-example org.apache.spark.examples.SparkPi
! Spark Shell Spark command line ◦ Spark
! spark-shell ◦ [ ] Spark \bin\spark-shell
! ◦ var res1: Int = 3 + 5 ◦ import org.apache.spark.rdd._ ◦ val intRdd: RDD[Int]=sc.parallelize(List(1,2,3,4,5)) ◦ intRdd.collect ◦ val txtRdd=sc.textFile(file:///Spark /README.md) ◦ txtRdd.count
! spark-shell ◦ [ ] :quit Ctrl D
10
Spark Shell
SparkScala
[Tips]: ➢ var val ? ➢ intRdd txtRdd ? ➢ org. [Tab] ? ➢ http://localhost:4040
! Spark ! RDD(Resilient Distributed Dataset) ! Scala ! Spark MLlib
11
Outline
! Google ! Map Reduce ! MapReduce ◦ Map (K1, V1) ! list(K2, V2) ◦ Reduce (K2, list(V2))!list(K3, V3)
! ( Word Count )
12
RDD MapReduce
! MapReduce on HadoopWord Count …
◦ iteration iteration ( ) …
13
Hadoop …
HDFS
! Spark – RDD(Resilient Distribute Datasets) ◦ In-Memory Data Processing and Sharing ◦ (tolerant) (efficient)
! ◦ (lineage) – RDD ◦ lineage
! ◦ Transformations: In memory lazy lineage RDD ◦ Action: return Storage◦ Persistence: RDD
14
Spark …
: 1+2+3+4+5 = 15Transformation Action
15
RDD
RDD Ref: http://spark.apache.org/docs/latest/programming-guide.html#transformations
! SparkContext.textFile – RDD ! map: RDD RDD ! filter: RDD RDD ! reduceByKey: RDD Key
RDD Key
! groupByKey: RDD Key RDD ! join cogroup: RDD KeyRDD
! sortBy reverse: RDD! take(N): RDD N RDD! saveAsTextFile: RDD
16
RDD
! count: RDD ! collect: RDD Collection(Seq ! head(N): RDD N ! mkString: Collection
17
[Tips] • • Transformation
! [Exercise] spark-shell ◦ val intRDD = sc.parallelize(List(1,2,3,4,5,6,7,8,9,0)) ◦ intRDD.map(x => x + 1).collect() ◦ intRDD.filter(x => x > 5).collect() ◦ intRDD.stats ◦ val mapRDD=intRDD.map{x=>(g+(x%3), x)} ◦mapRDD.groupByKey.foreach{x=>println(key: %s, vals=%s.format(x._1, x._2.mkString(,)))} ◦mapRDD.reduceByKey(_+_).foreach(println) ◦mapRDD.reduceByKey{case(a,b) => a+b}.foreach(println)
18
RDD
! [Exercise] (The Gettysburg Address)
◦ (The Gettysburg Address)(https://docs.google.com/file/d/0B5ioqs2Bs0AnZ1Z1TWJET2NuQlU/view) gettysburg.txt ◦ gettysburg.txt ( ) ●
◦ ◦ ◦
19
RDD (Word Count )
sc.textFile flatMap split
toLowerCase, filter
sortBy foreach
https://github.com/yclee0418/sparkTeach/blob/master/codeSample/exercise/WordCount_Rdd.txt
take(5) foreach
reduceByKey
! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib
20
Outline
! Scala Scalable Language ( ) ! Scala ◦ lambda expression
Scala
Scala: List(1,2,3,4,5).foreach(x=>println(item %d.format(x)))
Java: Int[] intArr = new Array[] {1,2,3,4,5}; for (int x: intArr) println(String.format(item %d, x));
! scala Java .NET ! ( actor model akka) ! Spark
! import◦ import org.apache.spark.SparkContext◦ import org.apache.spark.rdd._ ( rdd class)◦ import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel } (
clustering class)!
◦ val int1: Int = 5 ( error)◦ var int2: Int = 5 ( )◦ val int = 5 ( )
! ( )◦ def voidFunc(param1: Type, param2: Type2) = { … }
22
Scala
def setLogger = { Logger.getLogger(com).setLevel(Level.OFF) Logger.getLogger(io).setLevel(Level.OFF) }
! ( )◦ def rtnFunc1(param1: Type, param2: Type2): Type3 = {
val v1:Type3 = …v1 //
}! ( )◦ def rtnFunc2(param1: Type, param2: Type2): (Type3, Type4) = {
val v1: Type3 = …val v2: Type4= … (v1, v2)//
}
23
Scala
def getMinMax(intArr: Array[Int]):(Int,Int) = { val min=intArr.min val max=intArr.max (min, max)
}
!
◦ val res = rtnFunc1(param1, param2) ( res)
◦ val (res1, res2) = rtnFunc2(param1, param2) (res1,res2 )
◦ val (_, res2) = rtnFunc2(param1, param2) ()
! For Loop◦ for (i <- collection) { … }
! For Loop ( yield )◦ val rtnArr = for (i <- collection) yield { … }
24
Scala
val intArr = Array(1,2,3,4,5,6,7,8,9) val multiArr= for (i <- intArr; j <- intArr) yield { i*j } //multiArr 81 99
val (min,max)=getMinMax(intArr) val (_, max)=getMinMax(intArr)
! Tuple◦ Tuple ◦ val v=(v1,v2,v3...) v._1, v._2, v._3…◦ lambda
◦ lambda (_)
25
Scalaval intArr = Array(1,2,3,4,5,7,8,9) val res=getMinMax(intArr) //res=(1,9)=>tuple val min=res._1 // res val max=res._2 // res
val intArr = Array((1,2,3),(4,5,6),(7,8,9)) //intArr Tuple val intArr2=intArr.map(x=> (x._1 * x._2 * x._3)) //intArr2: Array[Int] = Array(6, 120, 504) val intArr3=intArr.filter(x=> (x._1 + x._2 > x._3)) //intArr3: Array[(Int, Int, Int)] = Array((4,5,6), (7,8,9))
val intArr = Array((1,2,3),(4,5,6),(7,8,9)) //intArr Tuple def getThird(x:(Int,Int,Int)): Int = { (x._3) } val intArr2=intArr.map(getThird(_)) val intArr2=intArr.map(x=>getThird(x)) // //intArr2: Array[Int] = Array(3, 6, 9)
! Class◦ Scala Class JAVA Class
● private / protected public
● Class
26
Scala
Scala: class Person(userID: Int, name: String) // private
class Person(val userID: Int, var name: String) // public userID val person = new Person(102, John Smith)// person.userID // 102
Person class Java : public Class Person { private final int userID; private final String name; public Person(int userID, String name) { this.userID = userID; this.name = name; }}
! Object◦ Scala static
instance◦ Scala Object static● Scala Object singleton class instance
! Scala Object vs Class◦ object utility Spark Driver Program◦ class Entity
27
Scala
Scala Object: object Utility { def isNumeric(input: String): Boolean = input.trim() .matches(s[+-]?((\\d+(e\\d+)?[lL]?)|(((\\d+(\\.\\d*)?)|(\\.\\d+))(e\\d+)?[fF]?))) def toDouble(input: String): Double = { val rtn = if (input.isEmpty() || !isNumeric(input)) Double.NaN else input.toDouble rtn }}
val d = Utility.toDouble(20) // new
! ◦ val intArr = Array(1,2,3,4,5,7,8,9)
! ◦ val intArrExtra = intArr ++ Array(0,11,12)
! map: ! filter: ! join: Map Key Map ! sortBy reverse: ! take(N): N
28
scala
val intArr = Array(1,2,3,4,5,7,8,9) val intArr2=intArr.map(_ * 2) //intArr2: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18) val intArr3=intArr.filter(_ > 5) //intArr3: Array[Int] = Array(6, 7, 8, 9) val intArr4=intArr.reverse //intArr4: Array[Int] = Array(9, 8, 7, 6, 5, 4, 3, 2, 1)
! sum: ◦ val sum = Array(1,2,3,4,5,7,8,9).sum
! max: ◦ val max = Array(1,2,3,4,5,7,8,9).max
! min: ◦ val max = Array(1,2,3,4,5,7,8,9).min
! distinct:
29
scala
val intArr = Array(1,2,3,4,5,7,8,9) val sum = intArr.sum //sum = 45 val max = intArr.max //max = 9 val min = intArr.min //min = 1 val disc = Array(1,1,1,2,2,2,3,3) //disc = Array(1,2,3)
! spark-shell
! ScalaIDE for eclipse 4.4.1 ◦ http://scala-ide.org/download/sdk.html ◦ ◦ ( ) ◦ ◦ ScalaIDE
30
(IDE)
! Driver Program(word complete breakpoint)
! spark-shell jar ! ◦Eclipse 4.4.2 (Luna) ◦ Scala IDE 4.4.1 ◦ Scala 2.11.8 and Scala 2.10.6 ◦ Sbt 0.13.8 ◦ Scala Worksheet 0.4.0 ◦ Play Framework support 0.6.0 ◦ ScalaTest support 2.10.0 ◦ Scala Refactoring 0.10.0 ◦ Scala Search 0.3.0 ◦ Access to the full Scala IDE ecosystem
31
Scala IDE for eclipse
! Scala IDE Driver Program ◦ Scala Project ◦ Build Path ● Spark ● Scala
◦ package ● package ( )
◦ scala object ◦ ◦ debug ◦ Jar ◦ spark-submit Spark
32
Scala IDE Driver Program
! Scala IDE ◦ FILE -> NEW -> Scala Project ◦ projectFirstScalaProj ◦ JRE 1.8 (1.7 ) ◦ ◦ Finish
33
Scala Project
! Package Explorer Project ExplorerFirstScalaProj Build Path -> Configure Build Path
34
Build Path
[Tips]: Q: Package Project Explorer A: ! Scala perspective
! Scala perspective
-> Window -> Show View
! Spark Driver Program Build Path ◦ jar◦ Scala Library Container 2.11.8(IDE 2.11.8 )
! Configure Build Path Java Build Path Libraries ->Add External JARs… ◦Spark Jar Spark /assembly/target/scala-2.11/jars/ ◦ jar
! Java Build Path Scala Library Container 2.11.8
35
Build Path
! Package Explorer FirstScalaProj srcpackage
◦ src ->New->Package( Ctrl N)
◦ bikeSharing Package ! FirstScalaProj data (Folder) input
36
Package
! (gettysburg.txt)copy data ! bikeSharing Package Scala ObjectBikeCountSort
!
37
Scala Object
package practice1
//spark lib import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.rdd._ //log import org.apache.log4j.Logger import org.apache.log4j.Level
object WordCount { def main(args: Array[String]): Unit = { // Log Console Logger.getLogger(org).setLevel(Level.ERROR) //mark for MLlib INFO msg val sc = new SparkContext(new SparkConf().setAppName(WordCount).setMaster(local[*])) val rawRdd = sc.textFile(data/gettysburg.txt).flatMap { x=>x.split( ) } // (toLowerCase ) ( filter ) val txtRdd = rawRdd.map { x => x.toLowerCase.trim }.filter { x => !x.equals() } val countRdd = txtRdd.map { x => (x, 1) } // 1) Map val resultRdd = countRdd.reduceByKey { case (a, b) => a + b } // ReduceByKey val sortResRdd = resultRdd.sortBy((x => x._2), false) // sortResRdd.take(5).foreach { println } // sortResRdd.saveAsTextFile(data/wc_output) } }
38
WordCountimportLibrary
object main
saveAsTextFile
! word complete ALT/ word complete
! ( tuple)
39
IDE
! debug configuration ◦ icon Debug Configurations ◦ Scala ApplicationDebug ●Name WordCount ●Project FirstScalaProj ●Main Classpractice1.WordCount
◦ Launcher
40
Debug Configuration
! icon Debug ConfigurationDebug console
41
[Tips] • data/output sortResRdd ( part-xxxx ) • Log Level console • output
! Spark-Submit JAR
◦ Package Explorer FirstScalaProj ->Export...->Java/JAR file-> FirstScalaProj src
JAR File
42
JAR
! input output JAR File
◦ data JAR File
43
Spark-submit
! spark-submit
44
Spark-submit
1.submit2. lunch works
3. return status
! Command Line JAR File! Spark-submit
./bin/spark-submit --class <main-class> (package scala object ) --master <master-url> ( master URL local[Worker thread num]) --deploy-mode <deploy-mode> ( Worker Cluster Client Client) --conf <key>=<value> ( Spark ) ... # other options <application-jar> (JAR ) [application-arguments] ( Driver main )
45
Spark-submit submit JOB
Spark /bin/spark-submit --class practice1.WordCount --master local[*] WordCount.jar
[Tips]: ! spark-submit JAR data ! merge output
◦ linux: cat data/output/part-* > res.txt ◦windows: type data\output\part-* > res.txt
! Exercise wordCount Package WordCount2 Object ◦ gettysburg.txt ( )
●
◦ ●Hint1: (index) ● val posRdd=txtRdd.zipWithIndex()
●Hint2: reduceByKey groupByKeyindex
46
Word Count
! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics)◦Clustering ◦Classification◦Regression
47
Outline
! ◦
◦
! ◦ ◦
48
Tasks Experience Performance
! ! ! ! ! ! ! ! ! DNA ! !
49
! (Supervised learning) ◦ (Training Set)
◦ (Features)
(Label) ◦ Regression Classification (
)
50http://en.proft.me/media/science/ml_svlw.jpg
! (Unsupervised learning) ◦ (
Label) ◦ ◦ Clustering ( KMeans)
51http://www.cnblogs.com/shishanyuan/p/4747761.html
! MLlib Machine Learning library Spark
! ◦ RDD ◦
52
Spark MLlib
http://www.cnblogs.com/shishanyuan/p/4747761.html
53
Spark MLlib
https://www.safaribooksonline.com/library/view/spark-for-python/9781784399696/graphics/B03986_04_02.jpg
! Bike Sharing Dataset ()
! https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset ◦● hour.csv: 2011.01.01~2012.12.30
17,379● day.csv: hour.csv
54
Spark MLlib Let’s biking
55
Bike Sharing DatasetFeaturesLabel
(for hour.csv only)
(0 to 6)
(1 to 4)
! ◦ (Summary Statistics):
MultivariateStatisticalSummary Statistics◦ Feature ( ) Label ( )
(correlation) Statistics!
◦Clustering KMeans!
◦Classification Decision Tree LogisticRegressionWithSGD!
◦Regression Decision Tree LinearRegressionWithSGD
56
Spark MLlib
! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics)◦Clustering ◦Classification◦Regression
57
Outline
58
! (Summary Statistics)◦
◦◦Spark● 1: RDD[Double/
Float/Int] RDD stats● 2: RDD[Vector]
Statistics.colStats
59
! (correlation)◦ (Correlation )
◦Spark Pearson Spearman ◦ r Statistics.corr● 0 < | r | < 0.3 ( )● 0.3 <= | r | < 0.7 ( )● 0.7 <= | r | < 1 ( )● r = 1 ( )
60
A. Scala B. Package Scala Object C. data Folder D. Library
! ScalaIDE Scala folder package Object ◦ SummaryStat ( ) ● src ● bike (package ) ● BikeSummary (scala object )
● data (folder ) ● hour.csv
! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8
61
62
A. import B. main Driver Program
C. Log D. SparkContext
//import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import Statistics library import org.apache.spark.mllib.stat.{ MultivariateStatisticalSummary, Statistics }
object BikeSummary { def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeSummary").setMaster("local[*]"))
} }
! spark-shell sparkContext sc ! Driver Program sc ◦ appName - Driver Program ◦master - master URL
63
64
! prepare ◦ input file Features
LabelRDD
! lines.map features( 3~14 ) label( 17 ) RDD ! RDD
: ◦ RDD[Array] ◦ RDD[Tuple] ◦ RDD[BikeShareEntity]
preparedef prepare(sc: SparkContext): RDD[???] = { val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) => { if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name) val lines:RDD[Array[String]] = rawDataNoHead.map { x => x.split(, ).map { x => x.trim() } } //split columns with comma val bikeData:RDD[???]=lines.map{ … } //??? depends on your impl }
65
! RDD[Array]: ◦ val bikeData:RDD[Array[Double]] =lines.
map{x=>(x.slice(3,13).map(_.toDouble) ++ Array(x(16).toDouble))} ◦ 利弊: prepare實作容易,後面用起來痛苦(要記欄位在Array中的
index),也容易出包 ! RDD[Tuple]: ◦ val bikeData:RDD[(Double, Double, Double, …, Double)]
=lines.map{case(season,yr,mnth,…,cnt)=>(season.toDouble, yr.toDouble, mnth.toDouble,…cnt.toDouble)} ◦ 利弊: prepare實作較不易,後面用起來痛苦,比較不會出包(可用較
佳的變數命名來接回傳值) ◦ 例: val features = bikeData.map{case(season,yr,mnth,…,cnt)=> (season, yr,
math, …, windspeed)}
66
! RDD[ Class] : ◦ val bikeData:RDD[BikeShareEntity] = lines.map{ x=> BikeShareEntity(⋯)} ◦ 利弊: prepare實作痛苦,後面用起來快樂(用entity物件操作,不
用管欄位位置、抽象化),不易出包 ◦ 例: val labelRdd = bikeData.map{ ent => { ent.label }}
Case Class Class case class BikeShareEntity(instant: String,dteday:String,season:Double,
yr:Double,mnth:Double,hr:Double,holiday:Double,weekday:Double,workingday:Double,weathersit:Double,temp:Double,atemp:Double,hum:Double,windspeed:Double,casual:Double,registered:Double,cnt:Double)
67
map RDD[BikeShareEntity]val bikeData = rawData.map { x => BikeShareEntity(x(0), x(1), x(2).toDouble, x(3).toDouble,x(4).toDouble, x(5).toDouble, x(6).toDouble,x(7).toDouble,x(8).toDouble,
x(9).toDouble,x(10).toDouble,x(11).toDouble,x(12).toDouble,x(13).toDouble,x(14).toDouble,x(15).toDouble,x(16).toDouble) }
68
! (Class) ! prepare ◦ input file Features
LabelRDD
Entity Class //import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import Statistics library import org.apache.spark.mllib.stat.{ MultivariateStatisticalSummary, Statistics }object BikeSummary {
case class BikeShareEntity(⋯⋯) def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeSummary").setMaster("local[*]"))
} }
69
70
! getFeatures ◦
! printSummary ◦ console
! printCorrelation ◦console
printSummary def printSummary(entRdd: RDD[BikeShareEntity]) = {
val dvRdd = entRdd.map { x => Vectors.dense(getFeatures(x)) } //RDD[Vector] // Statistics.colStats Summary Statistics val summaryAll = Statistics.colStats(dvRdd)println(mean: + summaryAll.mean.toArray.mkString(,))) //println(variance: + summaryAll.variance.toArray.mkString(,))) //
}
71
getFeaturesdef getFeatures(bikeData: BikeShareEntity): Array[Double] = {
//val featureArr = Array(bikeData.casual, bikeData.registered,bikeData.cnt)featureArr
}
72
printCorrelationdef printCorrelation(entRdd: RDD[BikeShareEntity]) = {
// RDD[Double] val cntRdd = entRdd.map { x => x.cnt } val yrRdd = entRdd.map { x => x.yr } //
val yrCorr = Statistics.corr(yrRdd, cntRdd)// println(correlation: %s vs %s: %f.format(yr, cnt, yrCorr)) val seaRdd = entRdd.map { x => x.season }// season val seaCorr = Statistics.corr(seaRdd, cntRdd) println(correlation: %s vs %s: %f.format(season, cnt, seaCorr))
}
A. ◦ BikeSummary.scala SummaryStat ◦ hour.csv data ◦ BikeSummary ( TODO
B. ◦ getFeatures printSummary ● console (temp) (hum) (windspeed) ● yr mnth (temp) (hum)(windspeed) (cnt) console
73
for (yr <- 0 to 1) for (mnth <- 1 to 12) { val yrMnRdd = entRdd.filter { ??? }.map { x => Vectors.dense(getFeatures(x)) } val summaryYrMn = Statistics.colStats( ??? )
println(====== summary yr=%d, mnth=%d ==========.format(yr,mnth)) println(mean: + ???) println(variance: + ???)
}
A. ◦ BikeSummary printCorrelation ◦ hour.csv [yr~windspeed] cnt
console B. feature
◦ printCorrelation ● yr mnth feature( yrmo yrmo=yr*12+mnth)
yrmo cnt
74
! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics)◦Clustering ◦Classification◦Regression
75
Outline
! Traing Set( Label)
! (cluster) !
! ◦
76
Clustering
! ! (x1,x2,...,xn) K-Means n K
(k≤n), (WCSS within-cluster sum of squares)
!
A. KB. KC. ( )D. B C
77
K-Meansiteration
RUN
78
K-Means
ref: http://mropengate.blogspot.tw/2015/06/ai-ch16-5-k-introduction-to-clustering.html
! KMeans.train Model(KMeansModel ◦ val model=KMeans.train(data, numClusters, maxIterations, runs) ● data (RDD[Vector]) ● numClusters (K) ●maxIterations run Iteration
iteration maxIterations model ● runs KMeans run
model ! model.clusterCenters Feature ! model.computeCost WCSS model
79
K-Means in Spark MLlib
80
K-Means BikeSharing! hour.csv KMeans
console
◦ Features yr, season, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed,cnt( cnt Label Feature )
◦ numClusters 5 ( 5 ) ◦maxIterations 20 ( run 20 iteration) ◦ runs 3 3 Run model)
81
Model
A. Scala B. Package Scala Object C. data Folder D. Library
Model K
! ScalaIDE Scala folder package Object ◦ Clustering ( ) ● src ● bike (package ) ● BikeShareClustering (scala object )
● data (folder ) ● hour.csv
! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8
82
83
A. import B. main Driver Program
C. Log D. SparkContext
Model
Model K
//import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import KMeans library import org.apache.spark.mllib.clustering.{ KMeans, KMeansModel }
object BikeShareClustering { def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeClustering").setMaster("local[*]"))
} }
! KMeans Library ! Driver Program sc ◦ appName - Driver Program ◦master - master URL
84
85
! (Class) ! prepare ◦ input file Features
LabelRDD
! BikeSummary
Model
Model K
86
! getFeatures ◦
! KMeans ◦ KMeans.train KMeansModel
! getDisplayString ◦
Model
Model K
getFeatures getDisplayStringgetFeatures def getFeatures(bikeData: BikeShareEntity): Array[Double] = { val featureArr = Array(bikeData.cnt, bikeData.yr, bikeData.season,
bikeData.mnth, bikeData.hr, bikeData.holiday, bikeData.weekday, bikeData.workingday, bikeData.weathersit, bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed, bikeData.casual, bikeData.registered)
featureArr }
87
getDisplayStringdef getDisplayString(centers:Array[Double]): String = { val dispStr = """cnt: %.5f, yr: %.5f, season: %.5f, mnth: %.5f, hr: %.5f,
holiday: %.5f, weekday: %.5f, workingday: %.5f, weathersit: %.5f, temp: %.5f, atemp: %.5f, hum: %.5f,windspeed: %.5f, casual: %.5f, registered: %.5f"""
.format(centers(0), centers(1),centers(2), centers(3),centers(4), centers(5),centers(6), centers(7),centers(8), centers(9),centers(10), centers(11),centers(12), centers(13),centers(14))
dispStr }
KMeans// Features RDD[Vector]val featureRdd = bikeData.map { x => Vectors.dense(getFeatures(x)) }val model = KMeans.train(featureRdd, 5, 20, 3) // K 520 Iteration 3 Run
88
var clusterIdx = 0 model.clusterCenters.sortBy { x => x(0) }.foreach { x => {
println(center of cluster %d n%s.format(clusterIdx, getDisplayString(x.toArray) ))
clusterIdx += 1 } } // Cnt
89
//K-Meansimport org.apache.spark.mllib.clustering.{ KMeans, KMeansModel }
object BikeShareClustering { def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger // SparkContext val sc = new SparkContext(new SparkConf().setAppName(BikeClustering).setMaster(local[*]))
println(============== preparing data ==================) val bikeData = prepare(sc) // hour.csv RDD[BikeShareEntity] bikeData.persist()
println(============== clusting by KMeans ==================) // Features RDD[Vector]
val featureRdd = bikeData.map { x => Vectors.dense(getFeatures(x)) } val model = KMeans.train(featureRdd, 5, 20, 3) // K 5 20 Iteration 3 Run
var clusterIdx = 0 model.clusterCenters.sortBy { x => x(0) }.foreach { x => { println(center of cluster %d n%s.format(clusterIdx, getDisplayString(x.toArray) )) clusterIdx += 1 } } // Cnt
bikeData.unpersist() }
90
! yr season mnth hr cnt ! weathersit cnt ( ) ! temp atemp cnt ( ) ! hum cnt ( ) ! correlation
91
! K ModelWCSS
! WCSS K
Model
Model K
! model.computeCost WCSS model(WCSS )
! numClusters WCSS (K) ! WCSS
92
K-Means
println(============== tuning parameters ==================) for (k <- Array(5,10,15,20, 25)) {
// numClusters WCSS val iterations = 20 val tm = KMeans.train(featureRdd, k, iterations,3) println(k=%d, WCSS=%f.format(k, tm.computeCost(featureRdd))) } ============== tuning parameters ==================
k=5, WCSS=89540755.504054k=10, WCSS=36566061.126232k=15, WCSS=23705349.962375k=20, WCSS=18134353.720998k=25, WCSS=14282108.404025
A. ◦ BikeShareClustering.scala Scala ◦ hour.csv data ◦ BikeShareClustering ( TODO
B. feature ◦ BikeClustering ● yrmo getFeatures KMeansconsole yrmo
● numClusters (ex:50,75,100)
93
K-Means
! K-Means! KMeans
! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics)◦Clustering ◦Classification◦Regression
94
Outline
!
(Binary Classification)(Multi-Class Classification)
! ! ◦ (logistic regression) (decision
trees) (naive Bayes) ◦
95
! !
(Features)(Label)
! (Random Forest)
!
96
! import org.apache.spark.mllib.tree.DecisionTree ! import org.apache.spark.mllib.tree.model.DecisionTreeModel ! DecisionTree.trainClassifier Model(DecisionTreeModel
◦ val model=DecisionTree.trainClassifier(trainData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) ● trainData RDD[LabeledPoint] ● numClasses 2 ● categoricalFeaturesInfo trainData categorical Map[ Index, ]
continuous ● Map(0->2,4->10) 1,5 categorical 2,10
● impurity (Gini Entropy) ●maxDepth
overfit ●maxBins
● categoricalFeaturesInfo maxBins categoricalFeaturesInfo
97
Decision Tree in Spark MLlib
!
( ) !
threshold( ) ◦ Features yr, season, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed
◦ Label cnt 200 1 0 ◦ numClasses 2 ◦ impurity gini ◦maxDepth 5 ◦maxBins 30
98
99
Model
A. Scala B. Package Scala Object C. data Folder D. Library
Model
! ScalaIDE Scala folder package Object ◦ Classification ( ) ● src ● bike (package ) ● BikeShareClassificationDT (scala object )
● data (folder ) ● hour.csv
! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8
100
101
Model
Model A. import B. main Driver Program
C. Log D. SparkContext
//import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import decision tree library import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel
object BikeShareClassificationDT { def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeClassificationDT").setMaster("local[*]"))
} }
! Decision Tree Library ! Driver Program sc ◦ appName - Driver Program ◦master - master URL
102
103
Model
Model
! (Class) ◦ BikeSummary
! prepare ◦ input file Features Label
RDD[LabeledPoint] ◦ RDD[LabeledPoint]
! getFeatures◦ Model feature
! getCategoryInfo◦ categroyInfoMap
104
preparedef prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) =>
{ if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name) val lines:RDD[Array[String]] = rawDataNoHead.map { x =>
x.split(, ).map { x => x.trim() } } //split columns with comma val bikeData = lines.map{ x => BikeShareEntity(⋯) }//RDD[BikeShareEntity] val lpData=bikeData.map { x => {
val label = if (x.cnt > 200) 1 else 0 //大於200為1,否則為0 val features = Vectors.dense(getFeatures(x)) new LabeledPoint(label, features ) //LabeledPoint由label及Vector組成 }
//以6:4的比例隨機分割,將資料切分為訓練及驗證用資料 val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4)) (trainData, validateData)
}
105
getFeatures getCategoryInfogetFeatures方法
def getFeatures(bikeData: BikeShareEntity): Array[Double] = { val featureArr = Array(bikeData.yr, bikeData.season - 1, bikeData.mnth - 1,
bikeData.hr,bikeData.holiday, bikeData.weekday, bikeData.workingday, bikeData.weathersit - 1, bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed)
featureArr } // season Feature 1
getCategoryInfo方法 def getCategoryInfo(): Map[Int, Int]= { val categoryInfoMap = Map[Int, Int](
(/*"yr"*/ 0, 2), (/*season*/ 1, 4),(/*"mnth"*/ 2, 12), (/*"hr"*/ 3, 24), (/*"holiday"*/ 4, 2), (/*"weekday"*/ 5, 7), (/*"workingday"*/ 6, 2), (/*"weathersit"*/ 7, 4))
categoryInfoMap } //( featureArr index, distinct )
106
Model
Model
! trainModel ◦ DecisionTree.trainClassifier Model
! evaluateModel ◦ AUC trainModel Model
107
trainModel evaluateModeldef trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int,cateInfo: Map[Int, Int]): (DecisionTreeModel, Double) = {
val startTime = new DateTime() //val model = DecisionTree.trainClassifier(trainData, 2, cateInfo, impurity, maxDepth, maxBins) // Modelval endTime = new DateTime() //val duration = new Duration(startTime, endTime) //
//MyLogger.debug(model.toDebugString) // Decision Tree (model, duration.getMillis) }
def evaluateModel(validateData: RDD[LabeledPoint], model: DecisionTreeModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] AUC } val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auc = metrics.areaUnderROC()// areaUnderROC auc auc }
108
Model
Model ! tuneParameter ◦ impurity Max Depth Max Bin
trainModelevaluateModel AUC
109
AUC(Area under the Curve of ROC)Positive
(Label 1)Negative
(Label 0)Positive
(Label 1) true positive(TP) false negative(FN)
Negative(Label 0)
false positive(FP) true negative(TN)
! (True Pos Rate)TPR 1 1 ◦TPR=TP/(TP+FN)
! (False Pos Rate)FPR 0 1 ◦ FPR FP/(FP+TN)
! FPR TPR X Y ROC ! AUC ROC
110
AUC
AUC 1100%
0.5 < AUC < 1
AUC 0.5
AUC < 0.5
AUC(Area under the Curve of ROC)
111
tuneParameterdef tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val impurityArr = Array(gini, entropy) val depthArr = Array(3, 5, 10, 15, 20, 25) val binsArr = Array(50, 100, 200) val evalArr = for (impurity <- impurityArr; maxDepth <- depthArr; maxBins <- binsArr) yield { // model AUC val (model, duration) = trainModel(trainData, impurity, maxDepth, maxBins, cateInfo) val auc = evaluateModel(validateData, model) println(parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f .format(impurity, maxDepth, maxBins, auc)) (impurity, maxDepth, maxBins, auc) } val bestEval = (evalArr.sortBy(_._4).reverse)(0) // AUC println(best parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
112
Decision Tree//MLlib libimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.evaluation._import org.apache.spark.mllib.linalg.Vectors//decision treeimport org.apache.spark.mllib.tree.DecisionTreeimport org.apache.spark.mllib.tree.model.DecisionTreeModel
object BikeShareClassificationDT {case class BikeShareEntity(…) // case classdef main(args: Array[String]): Unit = {
MyLogger.setLogger val doTrain = (args != null && args.length > 0 && "Y".equals(args(0))) val sc = new SparkContext(new SparkConf().setAppName("ClassificationDT").setMaster("local[*]"))
println("============== preparing data ==================") val (trainData, validateData) = prepare(sc)
val cateInfo = getCategoryInfo() if (!doTrain) { println("============== train Model (CateInfo)==================") val (modelC, durationC) = trainModel(trainData, "gini", 5, 30, cateInfo) val aucC = evaluateModel(validateData, modelC) println("validate auc(CateInfo)=%f".format(aucC)) } else { println("============== tuning parameters(cateInfo) ==================") tuneParameter(trainData, validateData, cateInfo) }
}}
A. ◦ BikeShareClassificationDT.scala Scala ◦ hour.csv data ◦ BikeShareClassificationDT ( TODO
B. feature ◦ BikeShareClassificationDT ● category AUC ● feature ( |correlation| > 0.1 ) Model AUC
113
Decision Tree
============== tuning parameters(cateInfo) ==================parameter: impurity=gini, maxDepth=3, maxBins=50, auc=0.835524parameter: impurity=gini, maxDepth=3, maxBins=100, auc=0.835524parameter: impurity=gini, maxDepth=3, maxBins=200, auc=0.835524parameter: impurity=gini, maxDepth=5, maxBins=50, auc=0.851846parameter: impurity=gini, maxDepth=5, maxBins=100, auc=0.851846parameter: impurity=gini, maxDepth=5, maxBins=200, auc=0.851846
! (simple linear regression, :y=ax+b)(y) ◦ (x) (y)
!
(Logistic regression) ◦
! S (sigmoid) p(probability)0.5 [ ] [ ]
114
! import org.apache.spark.mllib.classification.{ LogisticRegressionWithSGD, LogisticRegressionModel }
! LogisticRegressionWithSGD.train(trainData, numIterations, stepSize, miniBatchFraction) Model(LogisticRegressionModel ◦ val model=LogisticRegressionWithSGD.train(trainData,numIterations,
stepSize, miniBatchFraction) ● trainData RDD[LabeledPoint] ●numIterations (SGD) 100 ● stepSize SGD 1 ●miniBatchFraction 0~1
1
115
Logistic Regression in Spark
http://www.csie.ntnu.edu.tw/~u91029/Optimization.html
! LogisticRegression train CategoricalFeature one-of-
k(one-hot) encoding ! One-of-K encoding: ◦ N (N= ) ◦ index 1 0
116
Categorical Features
weather Value
Clear 1
Mist 2
Light Snow 3
Heavy Rain 4
weathersit Index
1 0
2 1
3 2
4 3
INDEX Map
Index Encode
0 1000
1 0100
2 0010
3 0001
Encoding
117
Model
A. Scala B. Package Scala Object C. data Folder D. Library
Model
! ScalaIDE Scala folder package Object ◦ Classification ( ) ● src ● bike (package ) ● BikeShareClassificationLG (scala object )
● data (folder ) ● hour.csv
! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8
118
119
Model
Model A. import B. main Driver Program
C. Log D. SparkContext
//import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import Logistic library //Logistic import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.feature.StandardScaler import org.apache.spark.mllib.classification.{ LogisticRegressionWithSGD, LogisticRegressionModel }
object BikeShareClassificationLG { def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeClassificationLG").setMaster("local[*]"))
} }
! Logistic Regression Library ! Driver Program sc ◦ appName - Driver Program ◦master - master URL
120
121
Model
Model
! (Class) ◦ BikeSummary
! prepare ◦ input file Features Label
RDD[LabeledPoint] ◦ RDD[LabeledPoint]
! getFeatures◦ Model feature
! getCategoryFeature◦ 1-of-k encode Array[Double]
One-Of-Kdef prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
… val bikeData = lines.map{ x => new BikeShareEntity(x) }//RDD[BikeShareEntity] val weatherMap=bikeData. .map { x => x.getField(weathersit) } .distinct().collect().zipWithIndex.toMap //產生Index Map val lpData=bikeData.map { x => {
val label = x.getLabel() val features = Vectors.dense(x.getFeatures(weatherMap)) new LabeledPoint(label, features } //LabeledPoint由label及Vector組成 } … }
def getFeatures (weatherMap: Map[Double, Int])= { var rtnArr: Array[Double] = Array() var weatherArray:RDD[Double] = Array.ofDim[Double](weatherMap.size) //weatherArray=Array(0,0,0,0) val index = weatherMap(getField(weathersit)) //weathersit=2; index=1 weatherArray(index) = 1 //weatherArray=Array(0,1,0,0) rtnArr = rtnArr ++ weatherArray …. }
! (Standardizes)(variance) /
!
StandardScaler
def prepare(sc): RDD[LabeledPoint] = { … val featureRdd = bikeData.map { x => Vectors.dense(x.getFeatures(weatherMap)) } //用整個Feature的RDD取得StandardScaler val stdScaler = new StandardScaler(withMean=true, withStd=true).fit(featureRdd ) val lpData2= bikeData.map { x => { val label = x.getLabel() //在建立LabeledPoint前,先對feature作標準化轉換動作 val features = stdScaler.transform(Vectors.dense(x.getFeatures(weatherMap))) new LabeledPoint(label, features) } } …
preparedef prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
…val bikeData = lines.map{ x => BikeShareEntity( ) }//RDD[BikeShareEntity]
val weatherMap=bikeData.map { x => x.weathersit }.distinct().collect().zipWithIndex.toMap // Index Map //Standardize
val featureRddWithMap = bikeData.map { x =>Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))}
val stdScalerWithMap = new StandardScaler(withMean = true, withStd = true).fit(featureRddWithMap)
// Category feature val lpData = bikeData.map { x => { val label = if (x.cnt > 200) 1 else 0 // 200 1 0 val features = stdScalerWithMap.transform(Vectors.dense(getFeatures(x, yrMap, seasonMap,
mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))) new LabeledPoint(label, features) }}
// 6:4val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4))
(trainData, validateData)}
125
getFeaturesgetFeatures方法
def getFeatures(bikeData: BikeShareEntity, yrMap: Map[Double, Int], seasonMap: Map[Double, Int], mnthMap: Map[Double, Int], hrMap: Map[Double, Int], holidayMap: Map[Double, Int], weekdayMap: Map[Double, Int], workdayMap: Map[Double, Int], weatherMap: Map[Double, Int]): Array[Double] = { var featureArr: Array[Double] = Array()
// featureArr ++= getCategoryFeature(bikeData.yr, yrMap) featureArr ++= getCategoryFeature(bikeData.season, seasonMap) featureArr ++= getCategoryFeature(bikeData.mnth, mnthMap) featureArr ++= getCategoryFeature(bikeData.holiday, holidayMap) featureArr ++= getCategoryFeature(bikeData.weekday, weekdayMap) featureArr ++= getCategoryFeature(bikeData.workingday, workdayMap) featureArr ++= getCategoryFeature(bikeData.weathersit, weatherMap) featureArr ++= getCategoryFeature(bikeData.hr, hrMap)
//featureArr ++= Array(bikeData.temp, bikeData.atemp, bikeData.hum,
bikeData.windspeed)featureArr
}
126
getCategoryFeaturegetCategoryFeature方法
def getCategoryFeature(fieldVal: Double, categoryMap: Map[Double, Int]): Array[Double] = {
var featureArray = Array.ofDim[Double](categoryMap.size) val index = categoryMap(fieldVal) featureArray(index) = 1 featureArray }
127
Model
Model
! trainModel ◦ DecisionTree.trainClassifier Model
! evaluateModel ◦ AUC trainModel Model
128
trainModel evaluateModeldef trainModel(trainData: RDD[LabeledPoint],
numIterations: Int, stepSize: Double, miniBatchFraction: Double): (LogisticRegressionModel, Double) = { val startTime = new DateTime() // LogisticRegressionWithSGD.train val model = LogisticRegressionWithSGD.train(trainData, numIterations, stepSize, miniBatchFraction) val endTime = new DateTime() val duration = new Duration(startTime, endTime) //MyLogger.debug(model.toPMML()) // model debug (model, duration.getMillis) }
def evaluateModel(validateData: RDD[LabeledPoint], model: LogisticRegressionModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] AUC } val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auc = metrics.areaUnderROC()// areaUnderROC auc auc }
129
Model
Model ! tuneParameter ◦ iteration stepSize miniBatchFraction
trainModel evaluateModelAUC
130
tuneParameterdef tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val iterationArr: Array[Int] = Array(5, 10, 20, 60,100) val stepSizeArr: Array[Double] = Array(10, 50, 100, 200) val miniBatchFractionArr: Array[Double] = Array(0.5,0.8, 1) val evalArr = for (iteration <- iterationArr; stepSize <- stepSizeArr; miniBatchFraction <- miniBatchFractionArr) yield { // model AUC val (model, duration) = trainModel(ttrainData, iteration, stepSize, miniBatchFraction) val auc = evaluateModel(validateData, model) println(parameter: iteraion=%d, stepSize=%f, batchFraction=%f, auc=%f .format(iteration, stepSize, miniBatchFraction, auc)) (iteration, stepSize, miniBatchFraction, auc) } val bestEval = (evalArr.sortBy(_._4).reverse)(0) // AUC println(best parameter: iteraion=%d, stepSize=%f, batchFraction=%f, auc=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
A. ◦ BikeShareClassificationLG.scala Scala ◦ hour.csv data ◦ BikeShareClassificationLG ( TODO
B. feature ◦ BikeShareClassificationLG ● category AUC ● feature ( |correlation| > 0.1 ) Model AUC
131
Logistic Regression
============== tuning parameters(Category) ==================parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=0.500000, auc=0.857073parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=0.800000, auc=0.855904parameter: iteraion=5, stepSize=10.000000, miniBatchFraction=1.000000, auc=0.855685parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=0.500000, auc=0.852388parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=0.800000, auc=0.852901parameter: iteraion=5, stepSize=50.000000, miniBatchFraction=1.000000, auc=0.853237parameter: iteraion=5, stepSize=100.000000, miniBatchFraction=0.500000, auc=0.852087
! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib ◦ (summary statistics)◦Clustering ◦Classification◦Regression
132
Outline
!
! ! ◦ (Least Squares) Lasso(ridge regression)
133
! import org.apache.spark.mllib.tree.DecisionTree ! import org.apache.spark.mllib.tree.model.DecisionTreeModel ! DecisionTree.trainRegressor Model(DecisionTreeModel
◦ val model=DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo, impurity, maxDepth, maxBins) ● trainData RDD[LabeledPoint] ● categoricalFeaturesInfo trainData categorical Map[ Index, ]
continuous ● Map(0->2,4->10) 1,5 categorical 2,10
● impurity ( variance) ●maxDepth
overfit ●maxBins
● categoricalFeaturesInfo maxBins categoricalFeaturesInfo
134
Decision Tree Regression in Spark
! Model
◦ Features yr, season, mnth, hr, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed ◦ Label cnt ◦ impurity gini ◦maxDepth 5 ◦maxBins 30
135
136
Model
A. Scala B. Package Scala Object C. data Folder D. Library
Model
! ScalaIDE Scala folder package Object ◦ Regression ( ) ● src ● bike (package ) ● BikeShareRegressionDT (scala object )
● data (folder ) ● hour.csv
! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8
137
138
Model
Model A. import B. main Driver Program
C. Log D. SparkContext
//import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import decision tree library import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel
object BikeShareRegressionDT { def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeRegressionDT").setMaster("local[*]"))
} }
! Decision Tree Library ! Driver Program sc ◦ appName - Driver Program ◦master - master URL
139
140
Model
Model
! (Class) ◦ BikeSummary
! prepare ◦ input file Features Label
RDD[LabeledPoint] ◦ RDD[LabeledPoint]
! getFeatures◦ Model feature
! getCategoryInfo◦ categroyInfoMap
141
preparedef prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
val rawData=sc.textFile(data/hour.csv) //read hour.csv in data folder val rawDataNoHead=rawData.mapPartitionsWithIndex { (idx, iter) =>
{ if (idx == 0) iter.drop(1) else iter } } //ignore first row(column name) val lines:RDD[Array[String]] = rawDataNoHead.map { x =>
x.split(, ).map { x => x.trim() } } //split columns with comma val bikeData = lines.map{ x => BikeShareEntity(⋯) }//RDD[BikeShareEntity] val lpData=bikeData.map { x => {
val label = x.cnt //預測目標為租借量欄位 val features = Vectors.dense(getFeatures(x)) new LabeledPoint(label, features ) //LabeledPoint由label及Vector組成 }
//以6:4的比例隨機分割,將資料切分為訓練及驗證用資料 val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4)) (trainData, validateData)
}
142
getFeatures getCategoryInfogetFeatures方法
def getFeatures(bikeData: BikeShareEntity): Array[Double] = { val featureArr = Array(bikeData.yr, bikeData.season - 1, bikeData.mnth - 1,
bikeData.hr,bikeData.holiday, bikeData.weekday, bikeData.workingday, bikeData.weathersit - 1, bikeData.temp, bikeData.atemp, bikeData.hum, bikeData.windspeed)
featureArr } // season Feature 1
getCategoryInfo方法 def getCategoryInfo(): Map[Int, Int]= { val categoryInfoMap = Map[Int, Int](
(/*"yr"*/ 0, 2), (/*season*/ 1, 4),(/*"mnth"*/ 2, 12), (/*"hr"*/ 3, 24), (/*"holiday"*/ 4, 2), (/*"weekday"*/ 5, 7), (/*"workingday"*/ 6, 2), (/*"weathersit"*/ 7, 4))
categoryInfoMap } //( featureArr index, distinct )
143
Model
Model
! trainModel ◦ DecisionTree.trainRegressor Model
! evaluateModel ◦ RMSE trainModel
Model
! (root-mean-square deviation) (root-mean-square error)
! (sample standard deviation)
!
144
RMSE(root-mean-square error)
145
trainModel evaluateModeldef trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int, cateInfo: Map[Int,Int]): (DecisionTreeModel, Double) = {
val startTime = new DateTime() //val model = DecisionTree.trainRegressor(trainData, cateInfo, impurity, maxDepth, maxBins) // Modelval endTime = new DateTime() //val duration = new Duration(startTime, endTime) //
//MyLogger.debug(model.toDebugString) // Decision Tree (model, duration.getMillis) }
def evaluateModel(validateData: RDD[LabeledPoint], model: DecisionTreeModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] RMSE } val metrics = new RegressionMetrics(scoreAndLabels) val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse rmse }
146
Model
Model ! tuneParameter ◦ Max Depth Max Bin
trainModel evaluateModelRMSE
147
tuneParameterdef tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val impurityArr = Array(variance) val depthArr = Array(3, 5, 10, 15, 20, 25) val binsArr = Array(50, 100, 200) val evalArr = for (impurity <- impurityArr; maxDepth <- depthArr; maxBins <- binsArr) yield { // model RMSE val (model, duration) = trainModel(trainData, impurity, maxDepth, maxBins, cateInfo) val rmse = evaluateModel(validateData, model) println(parameter: impurity=%s, maxDepth=%d, maxBins=%d, auc=%f .format(impurity, maxDepth, maxBins, rmse)) (impurity, maxDepth, maxBins, rmse) }
val bestEvalAsc = (evalArr.sortBy(_._4)) val bestEval = bestEvalAsc(0) //RMSE
println(best parameter: impurity=%s, maxDepth=%d, maxBins=%d, rmse=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
A. ◦ BikeShareRegressionDT.scala Scala ◦ hour.csv data ◦ BikeShareRegressionDT.scala ( TODO
B. feature ◦ BikeShareRegressionDT ● feature dayType(Double ) dayType
● holiday=0 workingday=0 dataType=0
● holiday=1 dataType=1
● holiday=0 workingday=1 dataType=2
● dayType feature Model( getFeatures getCategoryInfo)
◦ Categorical Info
148
Decision Tree
============== tuning parameters(CateInfo) ==================parameter: impurity=variance, maxDepth=3, maxBins=50, rmse=118.424606parameter: impurity=variance, maxDepth=3, maxBins=100, rmse=118.424606parameter: impurity=variance, maxDepth=3, maxBins=200, rmse=118.424606parameter: impurity=variance, maxDepth=5, maxBins=50, rmse=93.138794parameter: impurity=variance, maxDepth=5, maxBins=100, rmse=93.138794parameter: impurity=variance, maxDepth=5, maxBins=200, rmse=93.138794
! Least Squares
!
149
! import org.apache.spark.mllib.regression.{LinearRegressionWithSGD, LinearRegressionModel}
! LinearRegressionWithSGD.train(trainData, numIterations, stepSize) Model(LinearRegressionModel ◦ val model=LinearRegressionWithSGD.train(trainData, numIterations,
stepSize) ● trainData RDD[LabeledPoint] ● numIterations (SGD)
● stepSize SGD 1stepSize
●miniBatchFraction 0~11
150
Least Squares Regression in Spark
151
Model
A. Scala B. Package Scala Object C. data Folder D. Library
Model
! ScalaIDE Scala folder package Object ◦ Regression ( ) ● src ● bike (package ) ● BikeShareRegressionLR (scala object )
● data (folder ) ● hour.csv
! Build Path ◦ Spark /assembly/target/scala-2.11/jars/ ◦ Scala container 2.11.8
152
153
Model
Model A. import B. main Driver Program
C. Log D. SparkContext
//import spark rdd library import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.rdd._ //import linear regression library import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.feature.StandardScaler import org.apache.spark.mllib.classification.{ LinearRegressionWithSGD, LinearRegressionModel }
object BikeShareRegressionLR { def main(args: Array[String]): Unit = {
Logger.getLogger(com).setLevel(Level.OFF) //set logger //initialize SparkContext val sc = new SparkContext(new SparkConf().setAppName("BikeRegressionLR").setMaster("local[*]"))
} }
! Linear Regression Library ! Driver Program sc ◦ appName - Driver Program ◦master - master URL
154
155
Model
Model
! (Class) ◦ BikeSummary
! prepare ◦ input file Features Label
RDD[LabeledPoint] ◦ RDD[LabeledPoint]
! getFeatures◦ Model feature
! getCategoryFeature◦ 1-of-k encode Array[Double]
One-Of-Kdef prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
… val bikeData = lines.map{ x => new BikeShareEntity(x) }//RDD[BikeShareEntity] val weatherMap=bikeData. .map { x => x.getField(weathersit) } .distinct().collect().zipWithIndex.toMap //產生Index Map val lpData=bikeData.map { x => {
val label = x.getLabel() val features = Vectors.dense(x.getFeatures(weatherMap)) new LabeledPoint(label, features } //LabeledPoint由label及Vector組成 } … }
def getFeatures (weatherMap: Map[Double, Int])= { var rtnArr: Array[Double] = Array() var weatherArray:RDD[Double] = Array.ofDim[Double](weatherMap.size) //weatherArray=Array(0,0,0,0) val index = weatherMap(getField(weathersit)) //weathersit=2; index=1 weatherArray(index) = 1 //weatherArray=Array(0,1,0,0) rtnArr = rtnArr ++ weatherArray …. }
! (Standardizes)(variance) /
!
StandardScaler
def prepare(sc): RDD[LabeledPoint] = { … val featureRdd = bikeData.map { x => Vectors.dense(x.getFeatures(weatherMap)) } //用整個Feature的RDD取得StandardScaler val stdScaler = new StandardScaler(withMean=true, withStd=true).fit(featureRdd ) val lpData2= bikeData.map { x => { val label = x.getLabel() //在建立LabeledPoint前,先對feature作標準化轉換動作 val features = stdScaler.transform(Vectors.dense(x.getFeatures(weatherMap))) new LabeledPoint(label, features) } } …
preparedef prepare(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint])= {
…val bikeData = lines.map{ x => BikeShareEntity( ) }//RDD[BikeShareEntity]
val weatherMap=bikeData.map { x => x.weathersit }.distinct().collect().zipWithIndex.toMap // Index Map //Standardize
val featureRddWithMap = bikeData.map { x =>Vectors.dense(getFeatures(x, yrMap, seasonMap, mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))}
val stdScalerWithMap = new StandardScaler(withMean = true, withStd = true).fit(featureRddWithMap)
// Category feature val lpData = bikeData.map { x => { val label = x.cnt // val features = stdScalerWithMap.transform(Vectors.dense(getFeatures(x, yrMap, seasonMap,
mnthMap, hrMap, holidayMap, weekdayMap, workdayMap, weatherMap))) new LabeledPoint(label, features) }}
// 6:4val Array(trainData, validateData) = lpData.randomSplit(Array(0.6, 0.4))
(trainData, validateData)}
159
getFeaturesgetFeatures方法
def getFeatures(bikeData: BikeShareEntity, yrMap: Map[Double, Int], seasonMap: Map[Double, Int], mnthMap: Map[Double, Int], hrMap: Map[Double, Int], holidayMap: Map[Double, Int], weekdayMap: Map[Double, Int], workdayMap: Map[Double, Int], weatherMap: Map[Double, Int]): Array[Double] = { var featureArr: Array[Double] = Array()
// featureArr ++= getCategoryFeature(bikeData.yr, yrMap) featureArr ++= getCategoryFeature(bikeData.season, seasonMap) featureArr ++= getCategoryFeature(bikeData.mnth, mnthMap) featureArr ++= getCategoryFeature(bikeData.holiday, holidayMap) featureArr ++= getCategoryFeature(bikeData.weekday, weekdayMap) featureArr ++= getCategoryFeature(bikeData.workingday, workdayMap) featureArr ++= getCategoryFeature(bikeData.weathersit, weatherMap) featureArr ++= getCategoryFeature(bikeData.hr, hrMap)
//featureArr ++= Array(bikeData.temp, bikeData.atemp, bikeData.hum,
bikeData.windspeed)featureArr
}
160
getCategoryFeaturegetCategoryFeature方法
def getCategoryFeature(fieldVal: Double, categoryMap: Map[Double, Int]): Array[Double] = {
var featureArray = Array.ofDim[Double](categoryMap.size) val index = categoryMap(fieldVal) featureArray(index) = 1 featureArray }
161
Model
Model
! trainModel ◦ DecisionTree.trainRegressor Model
! evaluateModel ◦ RMSE trainModel
Model
162
trainModel evaluateModeldef trainModel(trainData: RDD[LabeledPoint],
numIterations: Int, stepSize: Double, miniBatchFraction: Double): (LinearRegressionModel, Double) = { val startTime = new DateTime() // LinearRegressionWithSGD.train val model = LinearRegressionWithSGD.train(trainData, numIterations, stepSize, miniBatchFraction) val endTime = new DateTime() val duration = new Duration(startTime, endTime) //MyLogger.debug(model.toPMML()) // model debug (model, duration.getMillis) }
def evaluateModel(validateData: RDD[LabeledPoint], model: LinearRegressionModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] RMSE } val metrics = new RegressionMetrics(scoreAndLabels) val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse rmse }
163
Model
Model ! tuneParameter ◦ iteration stepSize miniBatchFraction
trainModel evaluateModelRMSE
164
tuneParameterdef tuneParameter(trainData: RDD[LabeledPoint], validateData: RDD[LabeledPoint]) = { val iterationArr: Array[Int] = Array(5, 10, 20, 60,100) val stepSizeArr: Array[Double] = Array(10, 50, 100, 200) val miniBatchFractionArr: Array[Double] = Array(0.5,0.8, 1) val evalArr = for (iteration <- iterationArr; stepSize <- stepSizeArr; miniBatchFraction <- miniBatchFractionArr) yield { // model RMSE val (model, duration) = trainModel(ttrainData, iteration, stepSize, miniBatchFraction) val rmse = evaluateModel(validateData, model) println(parameter: iteraion=%d, stepSize=%f, batchFraction=%f, rmse=%f .format(iteration, stepSize, miniBatchFraction, rmse)) (iteration, stepSize, miniBatchFraction, rmse) } val bestEvalAsc = (evalArr.sortBy(_._4))
val bestEval = bestEvalAsc(0) //RMSE println(best parameter: iteraion=%d, stepSize=%f, batchFraction=%f, rmse=%f .format(bestEval._1, bestEval._2, bestEval._3, bestEval._4)) }
A. ◦ BikeShareRegressionLR.scala Scala ◦ hour.csv data ◦ BikeShareRegressionLR.scala ( TODO
B. feature ◦ BikeShareRegressionLR ● feature dayType(Double ) dayType
● holiday=0 workingday=0 dataType=0
● holiday=1 dataType=1
● holiday=0 workingday=1 dataType=2
● dayType feature Model( getFeatures getCategoryInfo)
◦
165
Linear Regression
============== tuning parameters(Category) ==================parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=0.500000, rmse=256.370620parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=0.800000, rmse=256.376770parameter: iteraion=5, stepSize=0.010000, miniBatchFraction=1.000000, rmse=256.407185parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=0.500000, rmse=250.037095parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=0.800000, rmse=250.062817parameter: iteraion=5, stepSize=0.025000, miniBatchFraction=1.000000, rmse=250.126173
! Random Forest (multitude) (Decision Tree) ◦ (mode) ◦ (mean)
! ◦ overfit ◦ (missing value) ◦
166
(RandomForest)
! import org.apache.spark.mllib.tree.RandomForest ! import org.apache.spark.mllib.tree.model.RandomForestModelimport ! RandomForest.trainRegressor Model(RandomForestModel
◦ val model=RandomForest.trainRegressor(trainData, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) ● trainData RDD[LabeledPoint] ● categoricalFeaturesInfo trainData categorical Map[ Index, ]
continuous ● Map(0->2,4->10) 1,5 categorical 2,10
● numTrees ( Model ) ● impurity ( variance) ● featureSubsetStrategy Feature ( auto ) ●maxDepth
● overfit●
●maxBins● categoricalFeaturesInfo maxBins categoricalFeaturesInfo
167
Random Forest Regression in Spark
168
trainModel evaluateModeldef trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int,): (RandomForestModel, Double) = {
val startTime = new DateTime() // val cateInfo = BikeShareEntity.getCategoryInfo(true) // categoricalFeaturesInfo
val model = RandomForest.trainRegressor(trainData, cateInfo, 3, auto,impurity, maxDepth, maxBins) // Modelval endTime = new DateTime() //val duration = new Duration(startTime, endTime) //
//MyLogger.debug(model.toDebugString) // Decision Tree (model, duration.getMillis) }
def evaluateModel(validateData: RDD[LabeledPoint], model: RandomForestModel): Double = { val scoreAndLabels = validateData.map { data => var predict = model.predict(data.features) (predict, data.label) // RDD[( )] RMSE } val metrics = new RegressionMetrics(scoreAndLabels) val rmse = metrics.rootMeanSquaredError()// rootMeanSquaredError rmse rmse }
! [Exercise] ◦ Regression.zip Package Objectdata Build Path Scala IDE ◦ BikeShareRegressionRF
◦ RandomForest Decision Tree
169
RandomForest Regression
! Spark ! RDD(Resilient Distributed Datasets) ! Scala ! Spark MLlib !
170
Outline
! Etu & udn Hadoop Competition 2016
! ETU udn udn (
Open Data)
! EHC 2015/6 ~ 2015/10 (View) (Order)
(Member) 2015/11 (Storeid) (Cat_id1) (Cat_id2)
172
1) Data Feature LabeledPoint Data ◦ Feature : 6~9 View/Order ◦ Label : 10 Order ( 1 0) ◦ Feature : …
2) LabeledPoint Data Training Set Validating Set( 6:4Split)
3) Training Set Validating Set Machine Learning Model
4) Testing Set ◦ Feature : 6~10 View/Order ◦ Features : 1)
5) 3) Model Testing Set 6) 7) 1) ~ 6)
173
! View/Order uid-storeid-cat_id1-cat_id2Features
! ( RFM 6~9 View/Order ) ◦ View – viewRecent, viewCnt, viewLast1MCnt, viewLast2MCnt( ,6~9 , ,
) ◦ Order – orderRecent, orderCnt, orderLast1MCnt, orderLast2MCnt( ,6~9 , ,
) ◦ avgDaySpan, avgViewCnt, lastViewDate, lastViewCnt (
, , ,)
174
– Features(I)
! ◦ gender, ageScore, cityScore( , encoding, encoding)
◦ ageScore: 1~11 ● EX: if (ages.equals(20 )) ageScore = 1
◦ cityScore: 1~24 ● EX: if (livecity.equals( )) cityScore = 24
! Miss Value ◦ \N ● Gender: 2( ) ● Ages: 35-39 ● City:
175
– Features(II)
! ( ) ◦ http://www.cwb.gov.tw/V7/climate/monthlyData/mD.htm
◦ 6~10 ◦ ◦ ◦ : https://drive.google.com/file/d/0B-
b4FvCO9SYoN2VBSVNjN3F3a0U/view?usp=sharing ! 35 Features( uid-storeid-cat_id1-cat_id2)
176
– Features(III)
177
– LabeledPoint Data
Sort NEncoding
EX: viewCnt( 5 Encoding)
7 3 2 1
viewCnt=5
viewCnt=4
viewCnt=3
viewCnt=2
viewCnt=1
! Xgboost (Extreme Gradient Boosting, ) ◦ Input: LabeledPoint Data(Training Set) ● 35 Features ● Label (1/0 Label=1 0)
◦ Parameter: ● max_depth: Tree ● nround: ● Objective: binary:logistic( )
◦ Implement:
178
– Machine Learning(I)
val param = List(objective -> binary:logistic, max_depth -> 6) val model = XGBoost.train(trainSet, param, nround, 2, null, null)
! Xgboost ◦ Evaluate(with validating Set): ● val predictRes = model.predict(validateSet) ● F_measure
◦ Parameter Tuning: ● max_depth=(5~10) nround=(10~25)
( ) ● : max_depth=6, nround=10
179
– Machine Learning(II)
Precision = 0.16669166166766647 F1 measure = 0.15969926394341 Accuracy = 0.15065655700028824 Micro recall = 0.21370309951060 Micro precision = 0.3715258082813 Micro F1 measure = 0.271333885
! Performance Improvement ◦ model N Feature Feature
180
– Machine Learning(III)
: 90000ms -> 72000ms(local mode)
! yarn resource manager ◦ spark-submit JOB Worker
181
spark-submit --class ehc.RecommandV4 --deploy-mode cluster --master yarn ehcFinalV4.jar
! new SparkContext master URLnew SparkContext(new SparkConf().setAppName(ehcFinal051).setMaster(local[4])) ➔ SetMaster ( spark-submit )
182
Spark-submit Run Script Sample ###### Script Spark ( yarn Manager) Spark Submit Driver Program ############ for linux-like system #########
# delete output on hdfs first`hadoop fs -rm -R -f /user/team007/data/output`
# submit spark jobecho -e processing spark job spark-submit --deploy-mode cluster --master yarn --jars lib/jcommon-1.0.23.jar,lib/joda-time-2.2.jar --class --class ehc.RecommandV4 ehcFinalV4.jar Y
# write to result_yyyyMMddHHmmss.txtecho -e write to outFilehadoop fs -cat /user/team007/data/output/part-* > 'result_'`date +%Y%m%d%H%M%S`'.txt'
! Feature ! Feature
◦
183
–
! Input Single Node
◦ Worker merge ◦ uid-storeid-cat_id1-cat_id2 Sort
! F-Measure ◦ Model ◦ Spark MultilabelMetrics
! ◦
184
–
val scoreAndLabels: RDD[(Array[Double], Array[Double])] = … val metrics = new MultilabelMetrics(scoreAndLabels) println(sF1 measure = ${metrics.f1Measure})
!
! Spark MLlib
◦ Feature Engineering
! Spark MLlib
◦
185
186