92
Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

Embed Size (px)

Citation preview

Page 1: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

1

Linear Classifiers / SVM

Soongsil UniversityDept. of Industrial and Information Systems Engineering

Intelligence Systems Lab.

Page 2: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

2

sample

Linear Classifiers

Page 3: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

3

Feature

Linear Classifiers

Page 4: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

4

Feature

Linear Classifiers

Page 5: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

5

Training Set

Linear Classifiers

Page 6: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

6

How to Classify Them Using Computer?

Linear Classifiers

Page 7: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

7

How to Classify Them Using Computer?

Linear Classifiers

n

1iii

T xwXW

n

1iiixwWXT혹은

Page 8: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

8

Linear Classification

Linear Classifiers

Page 9: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

9

Optimal Hyperplane

SVMs(Support Vector Machines)

9

Misclassified

Page 10: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

1010

Which Separating Hyperplane to Use?

Var1

Var2

Page 11: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

11

denotes +1

denotes -1

Any of these would be fine..

..but which is best?

Optimal Hyperplane

SVMs(Support Vector Machines)

Page 12: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

1212

Support Vector Machines

Three main ideas:1. Define what an optimal hyperplane is (in

way that can be identified in a computa-tionally efficient way): maximize margin

2. Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications

3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space

Page 13: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

13

Optimal Hyperplane

SVMs(Support Vector Machines)

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Page 14: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

14

Canonical Hyperplane

SVMs(Support Vector Machines)

The maximum margin linear classifier is the linear classifier with the maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Page 15: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

15

Normal Vector

SVMs(Support Vector Machines)

Page 16: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

1616

Maximizing the Margin Var1

Var2

Margin Width

Margin Width

IDEA 1: Select the separating hyper-plane that maxi-mizes the margin!

Page 17: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

1717

Support Vectors Var1

Var2

Margin Width

Support Vectors

Page 18: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

18

Margin widthB1

b11

b12

1bXW

1bXW

||W||

2 Margin(d)

1x2x

W

2)2

X1

(XW

d

2cos21 XXW

cos21 XXdSince

2d||W||

벡터 W 와 의 내적은 다음의 기하학적 의미를 갖는다 .

21 XX d

0 bXW

Page 19: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

1919

Setting Up the Optimization Problem

Var1

Var21w x b

1w x b

0 bxw

11

w

There is a scale and unit for data so that k=1. Then problem be-comes:

2max

. . ( ) 1, of class 1

( ) 1, of class 2

w

s t w x b x

w x b x

Page 20: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

20

Setting Up the Optimization Problem

SVMs(Support Vector Machines)

20

Var1

Var2kbxw

kbxw

0 bxw

kk

w

The width of the margin is:

2 k

w

So, the problem is:

2max

. . ( ) , of class 1

( ) , of class 2

k

w

s t w x b k x

w x b k x

Page 21: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

21

Setting Up the Optimization Problem

• If class 1 corresponds to 1 and class 2 corre-sponds to -1, we can rewrite

• as

• So the problem becomes:

( ) 1, with 1

( ) 1, with 1i i i

i i i

w x b x y

w x b x y

( ) 1, i i iy w x b x

2max

. . ( ) 1, i i i

w

s t y w x b x

21min

2. . ( ) 1, i i i

w

s t y w x b x

SVMs(Support Vector Machines)

||w||2=wTw is minimized

Page 22: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

2222

Linear, Hard-Margin SVM Formula-tion Find w,b that solves

Problem is convex so, there is a unique global mini-mum value (when feasible)

There is also a unique minimizer, i.e. weight and b value that provides the minimum

Non-solvable if the data is not linearly separable Quadratic Programming

Very efficient computationally with modern con-straint optimization engines (handles thousands of constraints and training instances).

21min

2. . ( ) 1, i i i

w

s t y w x b x

Page 23: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

23

Finding the Decision Bound-ary

• Let {x1, ..., xn} be our data set and let yi Î {1,-1} be the class label of xi

• The decision boundary should classify all points correctly Þ

• The decision boundary can be found by solving the following constrained optimization problem

• The Lagrangian of this optimization problem is

Page 24: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

Lagrangian of SVM optimization problem

Page 25: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

25

Lagrangian of SVM optimization problem

Page 26: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

26

Lagrangian of SVM optimization problem

Page 27: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

27

Lagrangian of SVM optimization problem

Page 28: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

28

Lagrangian of SVM optimization problem

Page 29: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

29

Lagrangian of SVM optimization problem

Page 30: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

30

다음 식에 대입하여 정리하면

=

Lagrangian of SVM optimization problem

j

n

i

n

ji

Tijijii XXyyQ

1 1,12

1)(

편미분 하여 얻은 다음 결과를

Page 31: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

31

Remember The Dual Problem !!

• Two functions based on the Lagrangian function

Min L(x, λ) 을 위한 x 값 , 의 최대값에 해당하는 λ 을 구하는 문제 )(ˆ L

L(x, λ) )(ˆ L

Page 32: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

32

The Dual Problem• By setting the derivative of the Lagrangian to be

zero, the optimization problem can be written in terms of ai (the dual problem)

• This is a quadratic programming (QP) problem– A global maximum of ai can always be found

• w can be recovered by

0,0tosubject

2

1)(Max

0i

1 1,1

i

n

ii

j

n

i

n

ji

Tijijii

y

XXyyQ

Page 33: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

3333

The Dual Problem

• By setting the derivative of the Lagrangian to be zero, the optimization problem can be written in terms of ai (the dual problem)

• This is a quadratic programming (QP) problem– A global maximum of ai can always be found

• w can be recovered by

만약 학습 데이터 수가 아주 많을때는 SVM 학습 속도가 매우 느려질 수 있다 .

dual 문제에서는 parameters α 의 수가 매우 많아 질 수 있기 때문이다 .

0,0tosubject

2

1)(Max

0i

1 1,1

i

n

ii

i

n

i

n

ji

Tijijii

y

XXyyQ

Page 34: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

3434

a6=1.4

A Geometrical Interpretation

Class 1

Class 2

a1=0.8

a2=0

a3=0

a4=0

a5=0a7=0

a8=0.6

a9=0

a10=0

Page 35: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

35

Page 36: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

36

Characteristics of the Solution

• KKT condition indicates many of the ai are zero– w is a linear combination of a small number of data

points

• xi with non-zero ai are called support vectors (SV)– The decision boundary is determined only by the SV– Let tj (j=1, ..., s) be the indices of the s support vectors.

We can write• For testing with a new data z

– Compute

and classify z as:

class 1 if the sum is positive,

class 2 otherwise.

Page 37: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

3737

Support Vector Machines

Three main ideas:1. Define what an optimal hyperplane is

(in way that can be identified in a com-putationally efficient way): maximize margin

2. Extend the above definition for non-lin-early separable problems: have a penalty term for misclassifications

3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space

Page 38: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

3838

Non-Linearly Separable Data

i

Var1

Var21w x b

1w x b

0 bxw

11

w

i

Allow some instances to fall within the margin, but penal-ize them

Introduce slack variablesi

• What if the training set is not linearly separable?

Page 39: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

3939

Formulating the Optimization Prob-lem

i

Var1

Var21w x b

1w x b

0 bxw

11

w

i

Constraint becomes:

Objective function penalizes for misclassified instances and those within the margin

C trades-off margin width and misclassifications;

chosen by the user;

( ) 1 ,

0i i i i

i

y w x b x

21min

2 ii

w C

large C a higher penalty to errors

Page 40: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

4040

Soft Margin Hyperplane• By minimizing åixi, xi can be obtained by

xi are “slack variables” in optimization; xi=0 if there is no error for xi, and xi is an upper bound of the number of errors

• The optimization problem becomes

Page 41: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

41

Soft Margin Classification

• Slack variables ξi can be added to allow misclassification of difficult or noisy examples, resulting margin called soft.

• Need to minimize:

• Subject to:

ξi

ξi

SVMs(Support Vector Machines)

N

1i

kiC

2

||W||)w(L

ii

ii

1bX Wif1

-1bX Wif1)(

iXf

Page 42: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

4242

Linear, Soft-Margin SVMs

Algorithm tries to maintain i to zero while maximiz-ing margin

Notice: algorithm does not minimize the number of misclassifications (NP-complete problem) but the sum of distances from the margin hyperplanes

Other formulations use i2 instead

As C, we get closer to the hard-margin solution

( ) 1 ,

0i i i i

i

y w x b x

21min

2 ii

w C

Page 43: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

4343

Robustness of Soft vs Hard Margin SVMs

i

Var1

Var2

0 bxw

i

Var1

Var20 bxw

Soft Margin SVM Hard Margin SVM

As CAs C0

Page 44: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

4444

Soft vs Hard Margin SVM

Soft-Margin always have a solution Soft-Margin is more robust to outliers

Smoother surfaces (in the non-linear case)

Hard-Margin does not require to guess the cost parameter (requires no param-eters at all)

Page 45: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

45

Page 46: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

46

Page 47: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

47

Page 48: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

48

Page 49: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

Linear SVMs: Overview

• The classifier is a separating hyperplane.• Most “important” training points are support vectors; they

define the hyperplane.• Quadratic optimization algorithms can identify which train-

ing points xi are support vectors with non-zero Lagrangian multipliers αi.

• Both in the dual formulation of the problem and in the solu-tion training points appear only inside inner products:

• Find α1…αN such that•

f(x) = ΣαiyixiTx +

b

SVMs(Support Vector Machines)

0,0tosubject

2

1)(Max

0i

1 1,1

i

n

ii

i

n

i

n

ji

Tijijii

yC

XXyyQ

Page 50: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

505050

Support Vector Machines

Three main ideas:1. Define what an optimal hyperplane is

(in way that can be identified in a com-putationally efficient way): maximize margin

2. Extend the above definition for non-lin-early separable problems: have a penalty term for misclassifications

3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space

Page 51: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

51

Page 52: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

52

Extension to Non-linear Decision Boundary

• So far, we only consider large-margin classifier with a linear decision boundary, how to generalize it to become nonlinear?

• Key idea: transform xi to a higher dimensional space to “make life easier”– Input space: the space the point xi are located

– Feature space: the space of f(xi) after transformation

• Why transform?– Linear operation in the feature space is equivalent to

non-linear operation in input space– Classification can become easier with a proper

transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable

Page 53: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

53

Non-linear SVMs

SVMs(Support Vector Machines)

53

Datasets that are linearly separable with some noise work out great:

But what are we going to do if the dataset is just too hard?

How about… mapping data to a higher-dimen-sional space:

0 x

0 x

0 x

x2

Page 54: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

54

Disadvantages of Linear Decision Surfaces

Var1

Var2

SVMs(Support Vector Machines)

Page 55: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

55

Advantages of Non-Linear Surfaces

Var1

Var2

SVMs(Support Vector Machines)

Page 56: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

56

Linear Classifiers in High-Dimensional Spaces

Var1

Var2 Constructed Fea-ture 1

Find function (x) to map to a different space

Constructed Fea-ture 2

SVMs(Support Vector Machines)

Page 57: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

57

Transforming the Data

• Computation in the feature space can be costly because it is high dimensional– The feature space is typically infinite-dimensional!

• The kernel trick comes to rescue

f( )

f( )

f( )f( )f( )

f( )

f( )f( )

f(.) f( )

f( )

f( )

f( )f( )

f( )

f( )

f( )f( )

f( )

Feature spaceInput space

Page 58: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

58

Page 59: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

59

Mapping Data to a High-Dimensional Space

• Find function (x) to map to a different space, then SVM formulation becomes:

• Data appear as (x), weights w are now weights in the new space

• Explicit mapping expensive if (x) is very high dimensional

• Solving the problem without explicitly mapping the data is desirable !!

21min

2 ii

w C 0

,1))(( ..

i

iii xbxwyts

SVMs(Support Vector Machines)

Page 60: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

60

The Kernel Trick Recall the SVM optimization problem

• The data points only appear as inner product• As long as we can calculate the inner product

in the feature space, we do not need the mapping explicitly

• Many common geometric operations (angles, distances) can be expressed by inner products

• Define the kernel function K by

Page 61: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

6161

An Example for f(.) and K(.,.)

• Suppose f(.) is given as follows

• An inner product in the feature space is

• So, if we define the kernel function as follows, there is no need to carry out f(.) explicitly

• This use of kernel function to avoid carrying out f(.) explicitly is known as the kernel trick

Page 62: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

62

The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj

If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes:

K(xi,xj)= φ(xi) Tφ(xj)

A kernel function is some function that corresponds to an inner product in some expanded feature space.

Example:

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2

,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2

, = (1+[xi1xi2] T[xj1xj2]) 2

= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2]

= φ(xi) Tφ(xj), where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2]

Kernel Example

Page 63: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

6363

Kernel Example

Let

Where:…(we can do XOR!)

2121

2221

21

2221

21

22

222121

21

21

222222111111

22211

221212121

,,

,2,,2,

2

2

,,,,

xxxx

xxxxxxxx

xxxxxxxx

xxxxxxxxxxxx

xxxx

xxxxxxxxK

2221

2121 ,2,, xxxxxx

2, xxxx K

Page 64: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

64

What Functions are Kernels?

SVMs(Support Vector Machines)

64

For some functions K(xi,xj) checking that

K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome.

Mercer’s theorem:

Every semi-positive definite symmetric function is a kernel Semi-positive definite symmetric functions correspond to a

semi-positive definite symmetric Gram matrix:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)

K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)

… … … … …

K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)

K=

Page 65: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

65

Kernel Functions• In practical use of SVM, only the kernel function is

specified (and not f(.))• Kernel function can be thought of as a similarity

measure between the input objects

• Not all similarity measure can be used as kernel function, however

– Mercer's condition states that any positive semi-definite kernel K(x, y), i.e.

can be expressed as a dot product in a high

dimensional space.

Page 66: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

66

Examples of Kernel Functions

SVMs(Support Vector Machines)

66

Linear: K(xi,xj)= xi Txj

Polynomial of power p : K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network):

Closely related to radial basis function neural networks

Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)

It does not satisfy the Mercer condition on all k and q

)2

exp(),(2

2

ji

ji

xxxx

K

Page 67: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

67

Modification Due to Kernel Func-tion

• Change all inner products to kernel functions• For training,

Original

With ker-nel func-tion

Page 68: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

68

Modification Due to Kernel Func-tion

• For testing, the new data z is classified as: – class 1 if f ³0, – class 2 if f <0

Original

With ker-nel func-tion

Page 69: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

69

Page 70: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

70

Page 71: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

Suppose we have 5 1D data points– x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1

and 4, 5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1

We use the polynomial kernel of degree 2– K(x,y) = (xy+1)2

– C is set to 100We first find i (i=1, …, 5) by

Example

Page 72: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

72

By using a QP solver, we get1=0, 2=2.5, 3=0, 4=7.333, 5=4.833

– Verify (at home) that the constraints are indeed satis-fied

– The support vectors are {x2=2, x4=5, x5=6}

The discriminant function is

b is recovered by solving f(x2=2)=1 or by f(x4=5)=-1

or by f(x5=6})=1,

as x2, x5 lie on and x4, lies on

all give b=9

Example

1))(W( bzy Ti 1))(W( bzy T

i

Page 73: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

73

Page 74: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

74

Solve the classical XOR problem, i.e find non-linear discriminant function !!

–Dataset Class 1: 1=(−1,−1), 4=(+1,+1) 𝑥 𝑥Class 2: 2=(−1,+1), 3=(+1,−1) 𝑥 𝑥

–Kernel function Polynomial of order 2: ( , ′)=(𝐾 𝑥 𝑥 𝑥𝑇 ′+1)𝑥 2

–To achieve linear separability, we will use =∞𝐶

Homework

Page 75: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

75

-

Page 76: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

767676

Robustness of Soft vs Hard Margin SVMs

i

Var1

Var2

0 bxw

i

Var1

Var20 bxw

Soft Margin SVM Hard Margin SVM

Page 77: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

7777

21min

2 ii

w C

Page 78: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

78

Choosing the Kernel Func-tion Probably the most tricky part of using SVM.

The kernel function is important because it creates the kernel matrix, which summarize all the data

Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, …)

In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for most applications.

SVM with RBF kernel is closely related to RBF neu-ral networks, with the centers of the radial basis functions automatically chosen for SVM

Page 79: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

79

Strengths of SVM

• Strengths– Training is relatively easy

• No local optimal, unlike in neural networks

– It scales relatively well to high dimensional data

– Tradeoff between classifier complexity and error can be controlled explicitly

– Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors

– By performing logistic regression (Sigmoid) on the SVM output of a set of data can map SVM output to probabilities.

Page 80: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

8080

• Need to choose a “good” kernel function.

• It is sensitive to noise - A relatively small number of mislabeled examples can

dramatically decrease the performance

• It only considers two classes - how to do multi-class classification with SVM ? - Answer: 1) with output arity m, learn m SVM’s

– SVM 1 learns “Output==1” vs “Output != 1”– SVM 2 learns “Output==2” vs “Output != 2”– :– SVM m learns “Output==m” vs “Output != m”

2)To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

Weaknesses of SVM

Page 81: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

81

Summary: Steps for Classification

1. Prepare the pattern matrix {(xi,yi)}

2. Select a Kernel function

3. Select the error parameter Ccan use the values suggested by the SVM software, orcan set apart a validation set to determine the values of the parameter

4. Execute the training algorithm (to find all αi)

5. New data can be classified using αi and Support Vectors

Page 82: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

82

The Dual of the SVM Formula-tion Original SVM formula-

tion n inequality con-

straints n positivity con-

straints n number of vari-

ables

The (Wolfe) dual of this problem

one equality con-straint

n positivity con-straints

n number of vari-ables (Lagrange mul-tipliers)

Objective function more complicated

NOTICE: Data only ap-pear as (xi) (xj)

0

,1))(( ..

i

iii xbxwyts

i

ibw

Cw 2

, 2

1min

iii

i

y

xts

0

,0C .. i

ji i

ijijijia

xxyyi ,

))()((2

1min

Page 83: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

83

Nonlinear SVM - Overview

SVMs(Support Vector Machines)

SVM locates a separating hyperplane in the feature space and classify points in that space

It does not need to represent the space explicitly, simply by defining a kernel function

The kernel function plays the role of the dot product in the feature space.

Page 84: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

84

SVM Applications

• SVM has been used successfully in many real-world

problems - text (and hypertext) categorization - image classification - Ranking (e.g., Google searches)

- bioinformatics (Protein classification, Cancer classification)

- hand-written character recognition

Page 85: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

85

Handwritten digit recognition

Page 86: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

8686

Comparison with Neural Net-works

Neural Networks Hidden Layers map to

lower dimensional spaces Search space has multi-

ple local minima Training is expensive Classification extremely

efficient Requires number of hid-

den units and layers Very good accuracy in

typical domains

SVMs Kernel maps to a very-

high dimensional space Search space has a

unique minimum Training is extremely effi-

cient Classification extremely

efficient Kernel and cost the two

parameters to select Very good accuracy in

typical domains Extremely robust

Page 87: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

8787

Conclusions

SVMs express learning as a mathematical pro-gram taking advantage of the rich theory in optimization

SVM uses the kernel trick to map indirectly to extremely high dimensional spaces

SVMs extremely successful, robust, efficient, and versatile while there are good theoretical indications as to why they generalize well

Page 88: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

8888

Suggested Further Reading http://www.kernel-machines.org/tutorial.html C. J. C. Burges. A Tutorial on Support Vector Machines for

Pattern Recognition. Knowledge Discovery and Data Min-ing, 2(2), 1998.

P.H. Chen, C.-J. Lin, and B. Schölkopf. A tutorial on nu -support vector machines. 2003.

N. Cristianini. ICML'01 tutorial, 2001. K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and

B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Neural Networks, 12(2):181-201, May 2001. (PDF)

B. Schölkopf. SVM and kernel methods, 2001. Tutorial given at the NIPS Conference.

Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, Springel 2001

Page 89: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

89

References

Burges, C. “A Tutorial on Support Vector Machines for Pattern Recognition.” Bell Labs. 1998

Law, Martin. “A Simple Introduction to Support Vector Ma-chines.” Michigan State University. 2006

Prabhakar, K. “An Introduction to Support Vector Machines”

Page 90: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

90

Page 91: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

9191

Page 92: Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

9292