Support Vector Machines - ut · Coming up next Supervised machine learning Linear models Least squares regression, SVM Fisher’s discriminant, Perceptron, Logistic regression, SVM

Support Vector Machines

Konstantin Tretyakov ([email protected])

MTAT.03.227 Machine Learning

So far…

Supervised machine learning

Linear models

Least squares regression

Fisher’s discriminant, Perceptron, Logistic model

Non-linear models

Neural networks, Decision trees, Association rules

Unsupervised machine learning

Clustering/EM, PCA

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

May 8, 2012

Coming up next

Supervised machine learning

Linear models

Least squares regression, SVM

Fisher’s discriminant, Perceptron, Logistic regression, SVM

Non-linear models

Neural networks, Decision trees, Association rules

SVM, Kernel-XXX

Unsupervised machine learning

Clustering/EM, PCA, Kernel-XXX

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

Kernels May 8, 2012

First things first

SVM: (𝑦 ∈ {−1,1})

library('e1071')

m = svm(X, y, kernel='linear')

predict(m, newX)

May 8, 2012

Quiz

May 8, 2012

This line is called …

This vector is …

Those lines are …

𝑓 𝒙 = ?

𝒙𝟏 = ? 𝑦1 = ?

Functional margin of 𝒙𝟏?

Geometric margin of 𝒙𝟏?

Distance to origin?

Quiz

May 8, 2012

Separating hyperplane

Normal 𝒘

Isolines (level lines)

𝑓 𝒙 = 𝒘𝑻𝒙 + 𝑏

𝒙𝟏 = (2, 6); 𝑦1 = −1

𝑦1 ⋅ 𝑓 𝒙𝟏 ≈ 2

𝑓(𝒙𝟏)/|𝒘| ≈ 3√2

𝑑 = 𝑏/|𝒘|

Quiz

Suppose we scale 𝒘 and 𝑏 by some constant.

Will it:

Affect the separating hyperplane? How?

Affect the functional margins? How?

Affect the geometric margins? How?

May 8, 2012

Quiz

Example: 𝒘 → 2𝒘, 𝑏 = 0

May 8, 2012

Quiz

Suppose we scale 𝒘 and b by some constant.

Will it:

Affect the separating hyperplane? How?

No: 𝒘𝑇𝒙 + 𝑏 = 0 ⇔ 2𝒘𝑇𝒙 + 2𝑏 = 0

Affect the functional margins? How?

Yes: 2𝒘𝑇𝒙 + 2𝑏 𝑦 = 2 ⋅ 𝒘𝑇𝒙 + 𝑏 𝑦

Affect the geometric margins? How?

No: 2𝒘𝑇𝒙+2𝑏

|2𝒘|=

𝒘𝑇𝒙+𝑏

|𝒘|

May 8, 2012

Which classifier is best?

May 8, 2012

Maximal margin classifier

May 8, 2012

Why maximal margin?

Well-defined, single stable solution

Noise-tolerant

Small parameterization

(Fairly) efficient algorithms exist for finding it

May 8, 2012

Maximal margin: Separable case

May 8, 2012

𝑓 𝒙 = 1

𝑓 𝒙 = −1


May 8, 2012

𝑓 𝒙 = 1

𝑓 𝒙 = −1

∀𝑖 𝑓 𝒙𝑖 𝑦𝑖 ≥ 1


May 8, 2012

𝑓 𝒙 = 1

𝑓 𝒙 = −1

The (geometric)

distance to the

isoline 𝑓 𝒙 = 1 is:


May 8, 2012

𝑓 𝒙 = 1

𝑓 𝒙 = −1

The (geometric)

distance to the

isoline 𝑓 𝒙 = 1 is:

𝑑 =𝑓 𝒙

𝒘=

1

𝒘


Among all linear classifiers (𝒘, 𝑏)

… which keep all points at functional margin of

𝟏 or more,

… we shall look for the one which has the largest

distance 𝒅 to the corresponding isolines, i.e. the

largest geometric margin.

As 𝑑 =1

𝒘, this is equivalent to finding the classifier

with minimal |𝒘|.

…which is equivalent to finding the classifier with

minimal 𝒘 2

May 8, 2012

May 8, 2012

May 8, 2012

May 8, 2012

May 8, 2012

Compare

“Generic” linear classification (separable case):

Find (𝒘, b), such that all points are classified correctly

i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 0

Maximal margin classification (separable case):

Find (𝒘, b), such that all points are classified correctly

with a fixed functional margin

i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 𝟏

and 𝒘 𝟐 is minimal.

May 8, 2012

Remember

May 8, 2012

SVM optimization problem

(separable case):

min𝒘,𝑏

1

2𝒘 2

so that

𝒘𝑇𝒙𝑖 + 𝑏 𝑦𝑖 ≥ 1

General case (“soft margin”)

The same, but we also penalize all margin

violations.

May 8, 2012

SVM optimization problem:

min𝒘,𝑏

1

2𝒘 2 + 𝐶 𝜉𝑖

𝑖

where

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +



violations.

May 8, 2012


min𝒘,𝑏

1

2𝒘 2 + 𝐶 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

𝑖

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +



violations.

May 8, 2012


min𝒘,𝑏

1

2𝒘 2 + 𝐶 1 − 𝑚𝑖 +

𝑖

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +



violations.

May 8, 2012


min𝒘,𝑏

1

2𝒘 2 + 𝐶 hinge(𝑚𝑖)

𝑖

where

hinge 𝑚𝑖 = 1 − 𝑚𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

Hinge loss hinge 𝑚𝑖 = 1 − 𝑚𝑖 +

May 8, 2012

Classification loss functions

“Generic”

classification:

min𝒘,𝑏

[𝑚𝑖 < 0]

𝑖

May 8, 2012


Perceptron:

May 8, 2012


Perceptron:

min𝒘,𝑏

(−𝑚𝑖)+

𝑖

May 8, 2012


Least squares

classification*:

min𝒘,𝑏

𝑚𝑖 − 1 2

𝑖

May 8, 2012


Boosting:

min𝒘,𝑏

exp(−𝑚𝑖)

𝑖

May 8, 2012


Logistic regression:

min𝒘,𝑏

log (1 + 𝑒−𝑚𝑖)

𝑖

May 8, 2012


Regularized logistic

regression:

min𝒘,𝑏

log (1 + 𝑒−𝑚𝑖)

𝑖

+𝜆1

2𝒘 2

May 8, 2012


SVM:

min𝒘,𝑏

1 − 𝑚𝑖 +

𝑖

+1

2𝐶𝒘 2

May 8, 2012


L2-SVM:

min𝒘,𝑏

1 − 𝑚𝑖 +2

𝑖

+1

2𝐶𝒘 2

May 8, 2012


L1-regularized L2-SVM:

min𝒘,𝑏

1 − 𝑚𝑖 +2

𝑖

+ 1

2𝐶𝒘

… etc

May 8, 2012

In general

min𝒘,𝑏

𝜙(𝑚𝑖)

𝑖

+ 𝜆 ⋅ Ω(𝒘)

May 8, 2012

Model fit Model complexity

Compare to MAP estimation

maxModel

log 𝑃(𝑥𝑖|Model)

𝑖

+ log 𝑃(Model)

May 8, 2012

Likelihood Model prior

Compare to MAP estimation

maxModel

log 𝑃(Data|Model) + log 𝑃(Model)

May 8, 2012

Likelihood Model prior

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

𝑖

May 8, 2012

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶 𝜉𝑖

𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 ≥ 1 − 𝜉𝑖

𝜉𝑖 ≥ 0

May 8, 2012

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶 𝜉𝑖

𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0

𝜉𝑖 ≥ 0

May 8, 2012

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶 𝜉𝑖

𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0

𝜉𝑖 ≥ 0

Quadratic function with linear constraints!

May 8, 2012

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶 𝜉𝑖

𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0

𝜉𝑖 ≥ 0


May 8, 2012

Quadratic programming

Minimize

𝑓 𝒙 =1

2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙

subject to:

𝑨𝒙 ≥ 𝒃

𝑪𝒙 = 𝒅

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶 𝜉𝑖

𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0

𝜉𝑖 ≥ 0


May 8, 2012

Quadratic programming

Minimize

𝑓 𝒙 =1

2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙

subject to:

𝑨𝒙 ≥ 𝒃

𝑪𝒙 = 𝒅

> library(quadprog)

> solve.QP(Q, -c, A, b, neq)

Solving the SVM: Dual

min𝒘,𝑏

1

2𝒘 2 + 𝐶 𝜉𝑖𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0

Is equivalent to:

min𝒘,b

max𝜶≥0,𝜷≥0

1

2𝒘 2 + 𝐶 𝜉𝑖

𝑖

− 𝛼𝑖(𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖)

𝑖

− 𝛽𝑖𝜉𝑖

𝑖

May 8, 2012


min𝒘,𝑏

1


Is equivalent to:

min𝒘,b


1

2𝒘 2 + 𝐶 𝜉𝑖

𝑖

− 𝛼𝑖(𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖)

𝑖

− 𝛽𝑖𝜉𝑖

𝑖

May 8, 2012


min𝒘,𝑏

1


Is equivalent to:

min𝒘,b


1

2𝒘 2 + 𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖

𝑖

− 𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

𝑖

May 8, 2012


min𝒘,𝑏

1


Is equivalent to:

min𝒘,b


1

2𝒘 2 + 𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖

𝑖


𝑖

𝐶 − 𝛼𝑖 − 𝛽𝑖 = 0

May 8, 2012


min𝒘,𝑏

1


Is equivalent to:

min𝒘,b


1

2𝒘 2 + 𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖

𝑖


𝑖

0 ≤ 𝛼𝑖 ≤ 𝐶

May 8, 2012


min𝒘,b

max𝜶

1

2𝒘 2


𝑖

0 ≤ 𝛼𝑖 ≤ 𝐶

May 8, 2012


min𝒘,b

max𝜶

1

2𝒘 2


𝑖

0 ≤ 𝛼𝑖 ≤ 𝐶

Sparsity: 𝛼𝑖 is nonzero only for those points which

have

𝑓 𝒙𝑖 𝑦𝑖 − 1 < 0

May 8, 2012


min𝒘,b

max𝜶

1

2𝒘 2


𝑖

0 ≤ 𝛼𝑖 ≤ 𝐶

Now swap the min and the max (can be done in

particular because everything is nice and convex).

May 8, 2012


max𝜶

min𝒘,𝑏

1

2𝒘 2


𝑖

0 ≤ 𝛼𝑖 ≤ 𝐶

Next solve the inner (unconstrained) min as usual.

May 8, 2012


max𝜶

min𝒘,𝑏

1

2𝒘 2


𝑖

0 ≤ 𝛼𝑖 ≤ 𝐶

Next solve the inner (unconstrained) min as usual:

∇𝒘= 𝒘 − 𝛼𝑖𝑦𝑖𝒙𝑖 = 0

∇𝑏= − 𝛼𝑖𝑦𝑖 = 0

May 8, 2012


max𝜶

min𝒘,𝑏

1

2𝒘 2


𝑖

0 ≤ 𝛼𝑖 ≤ 𝐶

Express 𝒘 and substitute:

𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖

𝛼𝑖𝑦𝑖 = 0

May 8, 2012


max𝜶

min𝒘,𝑏

1

2𝒘 2


𝑖

0 ≤ 𝛼𝑖 ≤ 𝐶




May 8, 2012

Dual

representation


max𝜶

min𝒘,𝑏

1

2𝒘 2


𝑖

0 ≤ 𝛼𝑖 ≤ 𝐶


max𝜶

𝛼𝑖

𝑖

−1

2 𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖

𝑇𝒙𝑗

𝑖,𝑗

0 ≤ 𝛼𝑖 ≤ 𝐶


𝑖

May 8, 2012


max𝜶

𝛼𝑖

𝑖

−1

2 𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖

𝑇𝒙𝑗

𝑖,𝑗

0 ≤ 𝛼𝑖 ≤ 𝐶


𝑖

May 8, 2012


max𝜶

𝟏𝑇𝜶 −1

2𝜶𝑇 𝑲 ∘ 𝒀 𝜶

0 ≤ 𝜶 ≤ 𝐶 𝒚𝑇𝜶 = 0

𝐾𝑖𝑗 = 𝒙𝑖𝑇𝒙𝑗, 𝑌𝑖𝑗 = 𝑦𝑖𝑦𝑗

May 8, 2012


min𝜶

1

2𝜶𝑇 𝑲 ∘ 𝒀 𝜶 − 𝟏𝑇𝜶

𝜶 ≥ 0

−𝜶 ≥ −𝐶

𝒚𝑇𝜶 = 0

Then find 𝑏 from the condition:

𝑓 𝒙𝑖 𝑦𝑖 = 1 if 0 < 𝛼𝑖 < 𝐶

May 8, 2012

May 8, 2012

Support vectors

May 8, 2012

C

C

0

0

0

0

0

0.5

0.5

1

Support vectors


𝑖

0 ≤ 𝛼𝑖 ≤ 𝐶

Sparsity

The dual solution is often very sparse, this

allows to perform optimization efficiently

“Working set” approach.

May 8, 2012

Kernels

𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏


𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏

𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏

May 8, 2012

Kernels





May 8, 2012

Kernel function





Kernels

May 8, 2012

𝑓 𝑥 = 𝑤1𝑥 + 𝑤2𝑥2 + 𝑏

𝑓 𝒙 = 𝛼𝑖𝑦𝑖exp (−|𝒙𝑖 − 𝒙 𝟐) + 𝑏

Quiz

SVM is a __________ linear classifier.

Margin maximization can be achieved via

minimization of ______________.

SVM uses _____ loss and _______

regularization.

Besides hinge loss I also know ____ loss and

___ loss.

SVM in both primal and dual form is solved

using ________ programming.

May 8, 2012

Quiz

In primal formulation we solve for parameter

vector ___. In dual formulation we solve for

___ instead.

_____ form of SVM is typically sparse.

Support vectors are those training points for

which _______.

The relation between primal and dual variables

is: ___= ______𝑖 .

A Kernel is a generalization of _____ product.

May 8, 2012

May 8, 2012

Documents

Support Vector Machines - ut · Coming up next Supervised machine learning Linear models Least squares regression, SVM Fisher’s discriminant, Perceptron, Logistic regression, SVM