Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
SVMs with Slack
Nicholas RuozziUniversity of Texas at Dallas
Based roughly on the slides of David Sontag
Dual SVM
maxππβ₯0
β12οΏ½ππ
οΏ½ππ
πππππππππ¦π¦πππ¦π¦πππ₯π₯ ππ πππ₯π₯ ππ + οΏ½ππ
ππππ
such that
οΏ½ππ
πππππ¦π¦ππ = 0
β’ The dual formulation only depends on inner products between the data points
β’ Same thing is true if we use feature vectors instead
2
Dual SVM
maxππβ₯0
β12οΏ½ππ
οΏ½ππ
πππππππππ¦π¦πππ¦π¦ππΞ¦ π₯π₯ ππ ππΞ¦ π₯π₯ ππ + οΏ½
ππ
ππππ
such that
οΏ½ππ
πππππ¦π¦ππ = 0
β’ The dual formulation only depends on inner products between the data points
β’ Same thing is true if we use feature vectors instead
3
The Kernel Trick
β’ For some feature vectors, we can compute the inner products quickly, even if the feature vectors are very large
β’ This is best illustrated by example
β’ Let ππ π₯π₯1, π₯π₯2 =
π₯π₯1π₯π₯2π₯π₯2π₯π₯1π₯π₯12
π₯π₯22
β’ ππ π₯π₯1, π₯π₯2 ππππ π§π§1, π§π§2 = π₯π₯12π§π§12 + 2π₯π₯1π₯π₯2π§π§1π§π§2 + π₯π₯22π§π§22
= π₯π₯1π§π§1 + π₯π₯2π§π§2 2
= π₯π₯πππ§π§ 2
4
The Kernel Trick
β’ For some feature vectors, we can compute the inner products quickly, even if the feature vectors are very large
β’ This is best illustrated by example
β’ Let ππ π₯π₯1, π₯π₯2 =
π₯π₯1π₯π₯2π₯π₯2π₯π₯1π₯π₯12
π₯π₯22
β’ ππ π₯π₯1, π₯π₯2 ππππ π§π§1, π§π§2 = π₯π₯12π§π§12 + 2π₯π₯1π₯π₯2π§π§1π§π§2 + π₯π₯22π§π§22
= π₯π₯1π§π§1 + π₯π₯2π§π§2 2
= π₯π₯πππ§π§ 2
Reduces to a dot product in the original space
5
The Kernel Trick
β’ The same idea can be applied for the feature vector ππ of all polynomials of degree (exactly) ππ
β’ ππ π₯π₯ ππππ π§π§ = π₯π₯πππ§π§ ππ
β’ More generally, a kernel is a function ππ π₯π₯, π§π§ = ππ π₯π₯ ππππ(π§π§) for some feature map ππ
β’ Rewrite the dual objective
maxππβ₯0,βππ πππππ¦π¦ππ=0
β12οΏ½ππ
οΏ½ππ
πππππππππ¦π¦πππ¦π¦ππππ(π₯π₯(ππ), π₯π₯ ππ ) + οΏ½ππ
ππππ
6
Examples of Kernels
β’ Polynomial kernel of degree exactly ππ
β’ ππ π₯π₯, π§π§ = π₯π₯πππ§π§ ππ
β’ General polynomial kernel of degree ππ for some ππ
β’ ππ π₯π₯, π§π§ = π₯π₯πππ§π§ + ππ ππ
β’ Gaussian kernel for some ππ
β’ ππ π₯π₯, π§π§ = exp β π₯π₯βπ§π§ 2
2ππ2
β’ The corresponding ππ is infinite dimensional!
β’ So many moreβ¦
7
Gaussian Kernels
β’ Consider the Gaussian kernel
expβ π₯π₯ β π§π§ 2
2ππ2= exp
β π₯π₯ β π§π§ ππ(π₯π₯ β π§π§)2ππ2
= expβ π₯π₯ 2 + 2π₯π₯πππ§π§ β π§π§ 2
2ππ2
= exp βπ₯π₯ 2
2ππ2exp β
π§π§ 2
2ππ2exp
π₯π₯πππ§π§ππ2
β’ Use the Taylor expansion for exp()
expπ₯π₯πππ§π§ππ2
= οΏ½ππ=0
βπ₯π₯πππ§π§ ππ
ππ2ππππ!
8
Gaussian Kernels
β’ Consider the Gaussian kernel
expβ π₯π₯ β π§π§ 2
2ππ2= exp
β π₯π₯ β π§π§ ππ(π₯π₯ β π§π§)2ππ2
= expβ π₯π₯ 2 + 2π₯π₯πππ§π§ β π§π§ 2
2ππ2
= exp βπ₯π₯ 2
2ππ2exp β
π§π§ 2
2ππ2exp
π₯π₯πππ§π§ππ2
β’ Use the Taylor expansion for exp()
expπ₯π₯πππ§π§ππ2
= οΏ½ππ=0
βπ₯π₯πππ§π§ ππ
ππ2ππππ!
9
Polynomial kernels of every degree!
Kernels
β’ Bigger feature space increases the possibility of overfitting
β’ Large margin solutions may still generalize reasonably well
β’ Alternative: add βpenaltiesβ to the objective to disincentivize complicated solutions
minπ€π€
12π€π€ 2 + ππ β (# ππππ ππππππππππππππππππππππππππππππππππππ)
β’ Not a quadratic program anymore (in fact, itβs NP-hard)
β’ Similar problem to Hamming loss, no notion of how badly the data is misclassified
10
SVMs with Slack
β’ Allow misclassification
β’ Penalize misclassification linearly (just like in the perceptron algorithm)
β’ Again, easier to work with than the Hamming loss
β’ Objective stays convex
β’ Will let us handle data that isnβt linearly separable!
11
SVMs with Slack
minπ€π€,ππ,ππ
12π€π€ 2 + πποΏ½
ππ
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππ
12
SVMs with Slack
minπ€π€,ππ,ππ
12π€π€ 2 + πποΏ½
ππ
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππPotentially allows some points to be misclassified/inside the margin
13
SVMs with Slack
minπ€π€,ππ,ππ
12π€π€ 2 + πποΏ½
ππ
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππ
Constant c determines degree to which slack is penalized
14
SVMs with Slack
minπ€π€,ππ,ππ
12π€π€ 2 + πποΏ½
ππ
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππ
β’ How does this objective change with ππ?
15
SVMs with Slack
minπ€π€,ππ,ππ
12π€π€ 2 + πποΏ½
ππ
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππ
β’ How does this objective change with ππ?
β’ As ππ β β, requires a perfect classifier
β’ As ππ β 0, allows arbitrary classifiers (i.e., ignores the data)
16
SVMs with Slack
minπ€π€,ππ,ππ
12π€π€ 2 + πποΏ½
ππ
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππ
β’ How should we pick ππ?
17
SVMs with Slack
minπ€π€,ππ,ππ
12π€π€ 2 + πποΏ½
ππ
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππ
β’ How should we pick ππ?
β’ Divide the data into three pieces training, testing, and validation
β’ Use the validation set to tune the value of the hyperparameter ππ
18
Validation Set
β’ General learning strategy
β’ Build a classifier using the training data
β’ Select hyperparameters using validation data
β’ Evaluate the chosen model with the selected hyperparameters on the test data
19
SVMs with Slack
minπ€π€,ππ,ππ
12π€π€ 2 + πποΏ½
ππ
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππ
β’ What is the optimal value of ππ for fixed π€π€ and ππ?
20
SVMs with Slack
minπ€π€,ππ,ππ
12π€π€ 2 + πποΏ½
ππ
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππ
β’ What is the optimal value of ππ for fixed π€π€ and ππ?
β’ If π¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1, then ππππ = 0
β’ If π¦π¦ππ π€π€πππ₯π₯ ππ + ππ < 1, then ππππ = 1 β π¦π¦ππ π€π€πππ₯π₯ ππ + ππ
21
SVMs with Slack
minπ€π€,ππ,ππ
12π€π€ 2 + πποΏ½
ππ
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππ
β’ We can formulate this slightly differently
β’ ππππ = max 0, 1 β π¦π¦ππ π€π€πππ₯π₯ ππ + ππ
β’ Does this look familiar?
β’ Hinge loss provides an upper bound on Hamming loss22
Hinge Loss Formulation
β’ Obtain a new objective by substituting in for ππ
minπ€π€,ππ
12π€π€ 2 + πποΏ½
ππ
max 0, 1 β π¦π¦ππ π€π€πππ₯π₯ ππ + ππ
23
Can minimize with gradient descent!
Hinge Loss Formulation
β’ Obtain a new objective by substituting in for ππ
minπ€π€,ππ
12π€π€ 2 + πποΏ½
ππ
max 0, 1 β π¦π¦ππ π€π€πππ₯π₯ ππ + ππ
24
Hinge lossPenalty to prevent overfitting
Hinge Loss Formulation
β’ Obtain a new objective by substituting in for ππ
minπ€π€,ππ
ππ2π€π€ 2 + πποΏ½
ππ
max 0, 1 β π¦π¦ππ π€π€πππ₯π₯ ππ + ππ
25
Hinge lossRegularizer
ππ controls the amount of regularization
How should we pick ππ?
Imbalanced Data
β’ If the data is imbalanced (i.e., more positive examples than negative examples), may want to evenly distribute the error between the two classes
minπ€π€,ππ,ππ
12
π€π€ 2 +ππππ+
οΏ½ππ:π¦π¦ππ=1
ππππ +ππππβ
οΏ½ππ:π¦π¦ππ=β1
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππ
26
Dual of Slack Formulation
minπ€π€,ππ,ππ
12π€π€ 2 + πποΏ½
ππ
ππππ
such thatπ¦π¦ππ π€π€πππ₯π₯ ππ + ππ β₯ 1 β ππππ , for all ππ
ππππ β₯ 0, for all ππ
27
Dual of Slack Formulation
πΏπΏ π€π€, ππ, ππ, ππ, ππ =12π€π€πππ€π€ + πποΏ½
ππ
ππππ + οΏ½ππ
ππππ(1 β ππππ β π¦π¦ππ(π€π€πππ₯π₯ ππ + ππ)) + οΏ½ππ
βππππππππ
Convex in π€π€, ππ, ππ, so take derivatives to form the dual
πππΏπΏπππ€π€ππ
= π€π€ππ + οΏ½ππ
βπππππ¦π¦πππ₯π₯ππ(ππ) = 0
πππΏπΏππππ
= οΏ½ππ
βπππππ¦π¦ππ = 0
πππΏπΏππππππ
= ππ β ππππ β ππππ = 0
28
Dual of Slack Formulation
maxππβ₯0
β12οΏ½ππ
οΏ½ππ
πππππππππ¦π¦πππ¦π¦πππ₯π₯ ππ πππ₯π₯ ππ + οΏ½ππ
ππππ
such that
οΏ½ππ
πππππ¦π¦ππ = 0
ππ β₯ ππππ β₯ 0, for all ππ
29
Summary
β’ Gather Data + Labelsβ’ Select features vectorsβ’ Randomly split into three groups
β’ Training setβ’ Validation setβ’ Test set
β’ Experimentation cycle β’ Select a βgoodβ hypothesis from the hypothesis spaceβ’ Tune hyperparameters using validation setβ’ Compute accuracy on test set (fraction of correctly
classified instances)
30
Generalization
β’ We argued, intuitively, that SVMs generalize better than the perceptron algorithm
β’ How can we make this precise?
β’ Coming soon... but first...
31