32
SVMs with Slack Nicholas Ruozzi University of Texas at Dallas Based roughly on the slides of David Sontag

SVMs with Slack - personal.utdallas.edunicholas.ruozzi/cs6375/2018fa/lects/...Β Β· SVMs with Slack. Nicholas Ruozzi. University of Texas at Dallas. Based roughly on the slides of David

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

SVMs with Slack

Nicholas RuozziUniversity of Texas at Dallas

Based roughly on the slides of David Sontag

Dual SVM

maxπœ†πœ†β‰₯0

βˆ’12�𝑖𝑖

�𝑗𝑗

πœ†πœ†π‘–π‘–πœ†πœ†π‘—π‘—π‘¦π‘¦π‘–π‘–π‘¦π‘¦π‘—π‘—π‘₯π‘₯ 𝑖𝑖 𝑇𝑇π‘₯π‘₯ 𝑗𝑗 + �𝑖𝑖

πœ†πœ†π‘–π‘–

such that

�𝑖𝑖

πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘– = 0

β€’ The dual formulation only depends on inner products between the data points

β€’ Same thing is true if we use feature vectors instead

2

Dual SVM

maxπœ†πœ†β‰₯0

βˆ’12�𝑖𝑖

�𝑗𝑗

πœ†πœ†π‘–π‘–πœ†πœ†π‘—π‘—π‘¦π‘¦π‘–π‘–π‘¦π‘¦π‘—π‘—Ξ¦ π‘₯π‘₯ 𝑖𝑖 𝑇𝑇Φ π‘₯π‘₯ 𝑗𝑗 + οΏ½

𝑖𝑖

πœ†πœ†π‘–π‘–

such that

�𝑖𝑖

πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘– = 0

β€’ The dual formulation only depends on inner products between the data points

β€’ Same thing is true if we use feature vectors instead

3

The Kernel Trick

β€’ For some feature vectors, we can compute the inner products quickly, even if the feature vectors are very large

β€’ This is best illustrated by example

β€’ Let πœ™πœ™ π‘₯π‘₯1, π‘₯π‘₯2 =

π‘₯π‘₯1π‘₯π‘₯2π‘₯π‘₯2π‘₯π‘₯1π‘₯π‘₯12

π‘₯π‘₯22

β€’ πœ™πœ™ π‘₯π‘₯1, π‘₯π‘₯2 π‘‡π‘‡πœ™πœ™ 𝑧𝑧1, 𝑧𝑧2 = π‘₯π‘₯12𝑧𝑧12 + 2π‘₯π‘₯1π‘₯π‘₯2𝑧𝑧1𝑧𝑧2 + π‘₯π‘₯22𝑧𝑧22

= π‘₯π‘₯1𝑧𝑧1 + π‘₯π‘₯2𝑧𝑧2 2

= π‘₯π‘₯𝑇𝑇𝑧𝑧 2

4

The Kernel Trick

β€’ For some feature vectors, we can compute the inner products quickly, even if the feature vectors are very large

β€’ This is best illustrated by example

β€’ Let πœ™πœ™ π‘₯π‘₯1, π‘₯π‘₯2 =

π‘₯π‘₯1π‘₯π‘₯2π‘₯π‘₯2π‘₯π‘₯1π‘₯π‘₯12

π‘₯π‘₯22

β€’ πœ™πœ™ π‘₯π‘₯1, π‘₯π‘₯2 π‘‡π‘‡πœ™πœ™ 𝑧𝑧1, 𝑧𝑧2 = π‘₯π‘₯12𝑧𝑧12 + 2π‘₯π‘₯1π‘₯π‘₯2𝑧𝑧1𝑧𝑧2 + π‘₯π‘₯22𝑧𝑧22

= π‘₯π‘₯1𝑧𝑧1 + π‘₯π‘₯2𝑧𝑧2 2

= π‘₯π‘₯𝑇𝑇𝑧𝑧 2

Reduces to a dot product in the original space

5

The Kernel Trick

β€’ The same idea can be applied for the feature vector πœ™πœ™ of all polynomials of degree (exactly) 𝑑𝑑

β€’ πœ™πœ™ π‘₯π‘₯ π‘‡π‘‡πœ™πœ™ 𝑧𝑧 = π‘₯π‘₯𝑇𝑇𝑧𝑧 𝑑𝑑

β€’ More generally, a kernel is a function π‘˜π‘˜ π‘₯π‘₯, 𝑧𝑧 = πœ™πœ™ π‘₯π‘₯ π‘‡π‘‡πœ™πœ™(𝑧𝑧) for some feature map πœ™πœ™

β€’ Rewrite the dual objective

maxπœ†πœ†β‰₯0,βˆ‘π‘–π‘– πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘–=0

βˆ’12�𝑖𝑖

�𝑗𝑗

πœ†πœ†π‘–π‘–πœ†πœ†π‘—π‘—π‘¦π‘¦π‘–π‘–π‘¦π‘¦π‘—π‘—π‘˜π‘˜(π‘₯π‘₯(𝑖𝑖), π‘₯π‘₯ 𝑗𝑗 ) + �𝑖𝑖

πœ†πœ†π‘–π‘–

6

Examples of Kernels

β€’ Polynomial kernel of degree exactly 𝑑𝑑

β€’ π‘˜π‘˜ π‘₯π‘₯, 𝑧𝑧 = π‘₯π‘₯𝑇𝑇𝑧𝑧 𝑑𝑑

β€’ General polynomial kernel of degree 𝑑𝑑 for some 𝑐𝑐

β€’ π‘˜π‘˜ π‘₯π‘₯, 𝑧𝑧 = π‘₯π‘₯𝑇𝑇𝑧𝑧 + 𝑐𝑐 𝑑𝑑

β€’ Gaussian kernel for some 𝜎𝜎

β€’ π‘˜π‘˜ π‘₯π‘₯, 𝑧𝑧 = exp βˆ’ π‘₯π‘₯βˆ’π‘§π‘§ 2

2𝜎𝜎2

β€’ The corresponding πœ™πœ™ is infinite dimensional!

β€’ So many more…

7

Gaussian Kernels

β€’ Consider the Gaussian kernel

expβˆ’ π‘₯π‘₯ βˆ’ 𝑧𝑧 2

2𝜎𝜎2= exp

βˆ’ π‘₯π‘₯ βˆ’ 𝑧𝑧 𝑇𝑇(π‘₯π‘₯ βˆ’ 𝑧𝑧)2𝜎𝜎2

= expβˆ’ π‘₯π‘₯ 2 + 2π‘₯π‘₯𝑇𝑇𝑧𝑧 βˆ’ 𝑧𝑧 2

2𝜎𝜎2

= exp βˆ’π‘₯π‘₯ 2

2𝜎𝜎2exp βˆ’

𝑧𝑧 2

2𝜎𝜎2exp

π‘₯π‘₯π‘‡π‘‡π‘§π‘§πœŽπœŽ2

β€’ Use the Taylor expansion for exp()

expπ‘₯π‘₯π‘‡π‘‡π‘§π‘§πœŽπœŽ2

= �𝑛𝑛=0

∞π‘₯π‘₯𝑇𝑇𝑧𝑧 𝑛𝑛

𝜎𝜎2𝑛𝑛𝑛𝑛!

8

Gaussian Kernels

β€’ Consider the Gaussian kernel

expβˆ’ π‘₯π‘₯ βˆ’ 𝑧𝑧 2

2𝜎𝜎2= exp

βˆ’ π‘₯π‘₯ βˆ’ 𝑧𝑧 𝑇𝑇(π‘₯π‘₯ βˆ’ 𝑧𝑧)2𝜎𝜎2

= expβˆ’ π‘₯π‘₯ 2 + 2π‘₯π‘₯𝑇𝑇𝑧𝑧 βˆ’ 𝑧𝑧 2

2𝜎𝜎2

= exp βˆ’π‘₯π‘₯ 2

2𝜎𝜎2exp βˆ’

𝑧𝑧 2

2𝜎𝜎2exp

π‘₯π‘₯π‘‡π‘‡π‘§π‘§πœŽπœŽ2

β€’ Use the Taylor expansion for exp()

expπ‘₯π‘₯π‘‡π‘‡π‘§π‘§πœŽπœŽ2

= �𝑛𝑛=0

∞π‘₯π‘₯𝑇𝑇𝑧𝑧 𝑛𝑛

𝜎𝜎2𝑛𝑛𝑛𝑛!

9

Polynomial kernels of every degree!

Kernels

β€’ Bigger feature space increases the possibility of overfitting

β€’ Large margin solutions may still generalize reasonably well

β€’ Alternative: add β€œpenalties” to the objective to disincentivize complicated solutions

min𝑀𝑀

12𝑀𝑀 2 + 𝑐𝑐 β‹… (# π‘œπ‘œπ‘œπ‘œ π‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘π‘π‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘œπ‘œπ‘šπ‘šπ‘π‘π‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘œπ‘œπ‘›π‘›π‘šπ‘š)

β€’ Not a quadratic program anymore (in fact, it’s NP-hard)

β€’ Similar problem to Hamming loss, no notion of how badly the data is misclassified

10

SVMs with Slack

β€’ Allow misclassification

β€’ Penalize misclassification linearly (just like in the perceptron algorithm)

β€’ Again, easier to work with than the Hamming loss

β€’ Objective stays convex

β€’ Will let us handle data that isn’t linearly separable!

11

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

12

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘šPotentially allows some points to be misclassified/inside the margin

13

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

Constant c determines degree to which slack is penalized

14

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ How does this objective change with 𝑐𝑐?

15

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ How does this objective change with 𝑐𝑐?

β€’ As 𝑐𝑐 β†’ ∞, requires a perfect classifier

β€’ As 𝑐𝑐 β†’ 0, allows arbitrary classifiers (i.e., ignores the data)

16

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ How should we pick 𝑐𝑐?

17

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ How should we pick 𝑐𝑐?

β€’ Divide the data into three pieces training, testing, and validation

β€’ Use the validation set to tune the value of the hyperparameter 𝑐𝑐

18

Validation Set

β€’ General learning strategy

β€’ Build a classifier using the training data

β€’ Select hyperparameters using validation data

β€’ Evaluate the chosen model with the selected hyperparameters on the test data

19

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ What is the optimal value of πœ‰πœ‰ for fixed 𝑀𝑀 and 𝑏𝑏?

20

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ What is the optimal value of πœ‰πœ‰ for fixed 𝑀𝑀 and 𝑏𝑏?

β€’ If 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1, then πœ‰πœ‰π‘–π‘– = 0

β€’ If 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 < 1, then πœ‰πœ‰π‘–π‘– = 1 βˆ’ 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏

21

SVMs with Slack

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

β€’ We can formulate this slightly differently

β€’ πœ‰πœ‰π‘–π‘– = max 0, 1 βˆ’ 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏

β€’ Does this look familiar?

β€’ Hinge loss provides an upper bound on Hamming loss22

Hinge Loss Formulation

β€’ Obtain a new objective by substituting in for πœ‰πœ‰

min𝑀𝑀,𝑏𝑏

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

max 0, 1 βˆ’ 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏

23

Can minimize with gradient descent!

Hinge Loss Formulation

β€’ Obtain a new objective by substituting in for πœ‰πœ‰

min𝑀𝑀,𝑏𝑏

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

max 0, 1 βˆ’ 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏

24

Hinge lossPenalty to prevent overfitting

Hinge Loss Formulation

β€’ Obtain a new objective by substituting in for πœ‰πœ‰

min𝑀𝑀,𝑏𝑏

πœ†πœ†2𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

max 0, 1 βˆ’ 𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏

25

Hinge lossRegularizer

πœ†πœ† controls the amount of regularization

How should we pick πœ†πœ†?

Imbalanced Data

β€’ If the data is imbalanced (i.e., more positive examples than negative examples), may want to evenly distribute the error between the two classes

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12

𝑀𝑀 2 +𝑐𝑐𝑁𝑁+

�𝑖𝑖:𝑦𝑦𝑖𝑖=1

πœ‰πœ‰π‘–π‘– +π‘π‘π‘π‘βˆ’

�𝑖𝑖:𝑦𝑦𝑖𝑖=βˆ’1

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

26

Dual of Slack Formulation

min𝑀𝑀,𝑏𝑏,πœ‰πœ‰

12𝑀𝑀 2 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘–

such that𝑦𝑦𝑖𝑖 𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏 β‰₯ 1 βˆ’ πœ‰πœ‰π‘–π‘– , for all π‘šπ‘š

πœ‰πœ‰π‘–π‘– β‰₯ 0, for all π‘šπ‘š

27

Dual of Slack Formulation

𝐿𝐿 𝑀𝑀, 𝑏𝑏, πœ‰πœ‰, πœ†πœ†, πœ‡πœ‡ =12𝑀𝑀𝑇𝑇𝑀𝑀 + 𝑐𝑐�

𝑖𝑖

πœ‰πœ‰π‘–π‘– + �𝑖𝑖

πœ†πœ†π‘–π‘–(1 βˆ’ πœ‰πœ‰π‘–π‘– βˆ’ 𝑦𝑦𝑖𝑖(𝑀𝑀𝑇𝑇π‘₯π‘₯ 𝑖𝑖 + 𝑏𝑏)) + �𝑖𝑖

βˆ’πœ‡πœ‡π‘–π‘–πœ‰πœ‰π‘–π‘–

Convex in 𝑀𝑀, 𝑏𝑏, πœ‰πœ‰, so take derivatives to form the dual

πœ•πœ•πΏπΏπœ•πœ•π‘€π‘€π‘˜π‘˜

= π‘€π‘€π‘˜π‘˜ + �𝑖𝑖

βˆ’πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘–π‘₯π‘₯π‘˜π‘˜(𝑖𝑖) = 0

πœ•πœ•πΏπΏπœ•πœ•π‘π‘

= �𝑖𝑖

βˆ’πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘– = 0

πœ•πœ•πΏπΏπœ•πœ•πœ‰πœ‰π‘˜π‘˜

= 𝑐𝑐 βˆ’ πœ†πœ†π‘˜π‘˜ βˆ’ πœ‡πœ‡π‘˜π‘˜ = 0

28

Dual of Slack Formulation

maxπœ†πœ†β‰₯0

βˆ’12�𝑖𝑖

�𝑗𝑗

πœ†πœ†π‘–π‘–πœ†πœ†π‘—π‘—π‘¦π‘¦π‘–π‘–π‘¦π‘¦π‘—π‘—π‘₯π‘₯ 𝑖𝑖 𝑇𝑇π‘₯π‘₯ 𝑗𝑗 + �𝑖𝑖

πœ†πœ†π‘–π‘–

such that

�𝑖𝑖

πœ†πœ†π‘–π‘–π‘¦π‘¦π‘–π‘– = 0

𝑐𝑐 β‰₯ πœ†πœ†π‘–π‘– β‰₯ 0, for all π‘šπ‘š

29

Summary

β€’ Gather Data + Labelsβ€’ Select features vectorsβ€’ Randomly split into three groups

β€’ Training setβ€’ Validation setβ€’ Test set

β€’ Experimentation cycle β€’ Select a β€œgood” hypothesis from the hypothesis spaceβ€’ Tune hyperparameters using validation setβ€’ Compute accuracy on test set (fraction of correctly

classified instances)

30

Generalization

β€’ We argued, intuitively, that SVMs generalize better than the perceptron algorithm

β€’ How can we make this precise?

β€’ Coming soon... but first...

31

Roadmap

β€’ Where are we headed?

β€’ Other types of hypothesis spaces for supervised learning

β€’ k nearest neighbor

β€’ Decision trees

β€’ Learning theory

β€’ Generalization and PAC bounds

β€’ VC dimension

β€’ Bias/variance tradeoff

32