Support Vector Machine 支持向量機

1

Support Vector Machine支持向量機

Speaker ： Yao-Min Huang

Date ： 2004/11/17

2

Outline

• Linear Learning Machines• Kernel-Induced Feature• Optimization Theory• SVM Concept• Hyperplane Classifiers• Optimal Margin Support Vector Classifiers• ν-Soft Margin Support Vector Classifiers• Implement Techniques• Implementation of ν-SV Classifiers• Tools• Conclusion

3

Linear Learning Machines

Ref ： AN INTRODUCTION TO SUPPORT VECTOR MACHINES Chap2

4

Introduction

• In supervised learning, the learning machine is given a training set of inputs with associated output values

lll YXyxyxyxS )()),(),...,,(),,(( 2211

l : # of training samples

x : examples(inputs)

y : labels(outputs)

5

Introduction

• A training set S is said to be trivial if all labels are equal

• Usually

• Binary classification– Input x = (x1, x2, …, xn)’– f(x) >= 0 : assigned to positive class (assign x to +1)– Otherwise negative class (assign x to -1)

1 2

,

( , ,..., )

n

n

X R Y R

x x x x

1

( ) n

i ii

f x w x b w x b

6

Linear Classification

7


• The hyperplane ( 超平面 ) is the dark line.

• w defines a direction perpendicular to the hyperplane

• b moves the hyperplane parallel to itself (# of free parameter is n+1)

8


• Def ： Functional margin

> 0 implies correct classification– The geometric margin is the perpendicular

Euclidean distance of the point to the hyperplane

– The margin of a training set S is the maximum geometric margin over all hyperplanes.

– Try to find hyperplane (wopt, bopt) where the margin is largest

( )i i ir y w x b

9


10

Rosenblatt’s Perceptron

• By Frank Rosenblatt in 1956• on-line and mistake driven ( it only adapt the

weight when a classification is made )• Starts with an initial connection weight vector

w=0• k at most (k is total mistakes)• Require the data to be linearly separable

22R

11


• Linearly separable

12


• Non-separable

13


14


• Theorem ： Novikoff – Prove that Rosenblatt’s algorithm will converge

–

– Then k <= (k is the number of mistakes)

– Proof (Skip)

22R

opt opt1

i

max and exist a vector w w 1 and

y ( ) for 1

ii l

opt i opt

R x

w x b i l

15


• Def ： slack variable – Fix > 0 we can define the margin slack

variable– – If > , xi is misclassified by (w, b)– Figure (Next page)

• Two misclassified points• Other points have their slack variable equal to

zero, since they have a positive margin more than

i i i i i((x , y ), (w, b), ) = = max ( 0, - y (<w, x > + b) )

16


17


• Theorem ： Freund and SchapireS : nontrivial training set|xi| <= R(w, b) be any hyperplane with |w| = 1, >0

l l2 2

i=1 i=1

2

D = = (( , ), ( , ), )

Then the number of mistakes in the first execution of the for loop is bounded by

2(R+D)

i i ix y w b

18


• Freund and Schapire– Can only apply for the first iteration– D can be defined by any hyperplane, the data

are not necessarily linear separable– Finding the smallest # of mistakes is NP-

complete

19


• Algorithm in dual form (Use Lagrange Multiplier and KKT coditions derivate the w get w=??

1

l

i j j ij=1

i i

2i

Given a training set S

a 0, b 0, R max

Repeat

for i = 1 to l

if y ( x x > + b) 0 then

= +1

b = b + y R

end for

until no mistakes made within the for loop

Return ( , b) to define h(x)

i l i

j

x

y

20


• example i with few/many mistakes has a small/large i

I can be regarded as the information content of xi

• The points that are harder to learn have larger i can be used to rank the data according to their information content.

l

j j jj=1

l

j j jj=1

h(x) = sgn(<w x>+b)

= sgn y x x

= sgn y x x

b

b

21

Kernel-Induced Feature


The section 5 of the paper “A tutorial on nu-Support Vector Machines”

22

Overview

• Non-Linear Classifiers– One solution: Multiple layers of threshold

linear function multi-layer neural network

(problems: local minima; many parameters;

heuristics needed to train …etc)– Other solution: project the data into a high

dimensional feature space to increase the computational power of the linear learning machine.

23

Overview

24

Kernel Function

• In order to learn non-linear relations with a linear machine, we need to select a set of non-linear features and to rewrite the data in the new representation.– First ： a fixed non-linear mapping transforms the data into a feature space

F

– Second ： classify them in the feature space

•

• If we have a way of computing the inner product in the feature space directly as a function of the original input points, it becomes possible to merge the two steps needed to build a non-linear learning machine.

• We call such a direct computation method a kernel function,

N

i ii=1

l

i i ii=1

In the previous discussion , X F , f(x)= w b

can be expressed in a dual form f(x)= y (x ) (x) b

：

25

The Gram (Kernel) Matrix

• Gram matrix (also called the kernel matrix)–

– Contains all necessary information for learning algorithm.

th

T

Given m points with n-dimensional vector coordinates, let M be the n m matrix

whose j column consists of the coordinates of the vector , with j = 1, ..., m.

Then define the Gram matrix K=M M. M=

( (1), (2),..., (m))

26

Making Kernels

• The mapping function must be symmetric,

• and satisfy the inequalities that follow from the Cauchy-Schwarz inequality.

)()()(

)()( )(

xzKxz

zxzxK

),(),(

)()()()(

)()()()( ),(2222

zzKxxK

zxzx

zxzxzxK

27

Popular Kernel function

• 線性內核

• 半徑式函數 (Radial Basis Function) 內核

• 多項式內核

• Sigmoid 內核

jiji xxxxK ),(

qjiji xxxxK ]1)[(),(

}||

exp{),(2

2

ji

ji

xxxxK

))(tanh(),( cxxxxK jiji

28

Optimization Theory


http://www.chass.utoronto.ca/~osborne/MathTutorial/

29

Optimization Theory

• Definition – The Kuhn-Tucker conditions for the problem

• L(x) : the Lagrangian (Lagrange, 1788)

x j j

'i

j j j j j j

m

j j jj=1

max f(x) subject to g (x) c for j= 1,...,m

are

L (x) = 0 for i = 1 ,..., n

0, g (x) c and [g (x) - c ] = 0 for j = 1, ..., m,

where

L(x)=f(x)- [g (x) - c ]

So called the complementarity condition

30

Optimization Theory

• Ex– 　

2 2x1,x2 1 2 1 2 1 2

1 1 2

2 1 2

1 2 1

Consider the problem

max [-(x - 4) - (x - 4) ] subject to x + x 4 and x + 3x 9,

We have

The Kuhn-Tucker conditions are

-2(x - 4) - - = 0

-2(x - 4) - - 3 = 0

x + x 4, 0

1 1 2

1 2 2 2 1 2

, and (x + x - 4) = 0

x + 3x 9, 0, and (x + 3x - 9) = 0.

31

SVM Concept

Ref ： The section 2 of the paper “A tutorial on nu-Support Vector Machines”

32

The history of SVM

• SVM 是一種基于統計學習理論的模式識別方法，它是由 Boser,Guyon,Vapnik 在COLT-92 上首次提出，從此迅速的發展起來，現下已經在許多領域（生物訊息學、文本和手寫識別、分類…等）都取得了成功的應用

• COLT(Computational Learning Theory)

33

SVM Concept

• 目標︰找到一個超平面 (Hyperplane) ，使得它能夠儘可能多的將兩類數據點正確的分開，同時使分開的兩類數據點距離分類面最遠。

• 解決方法︰構造一個在約束條件下的最佳化問題，具體的說是一個受限二次規劃問題 (constrained quadratic programming), 求解該問題，得到分類器。

34

模式識別問題的一般描述• 已知︰ m 個觀測樣本， (x1,y1), (x2,y2)

…… (xm,ym)

• 求︰最佳函數 y’= f(x,w)• 滿足條件︰期望風險最小

• 損失函數

),()),(,()( yxdFwxfyLwR

0 , ( , )( , ( , ))

1 , ( , )

y f x wL y f x w

y f x w

35

• 期望風險 R(w) 要倚賴聯合機率 F(x,y) 的資訊，所以實際問題中無法計算。

• 一般用經驗風險 Remp(w) 代替期望風險R(w)

1

1( ) ( , ( , ))

分錯個數

m

emp i ii

R w L y f x wm

m

36

一般模式識別方法的問題• 經驗風險最小不等于期望風險最小，不能

保證分類器的預測能力 .• 經驗風險只有在樣本數無窮大趨近于期望

風險，需要非常多的樣本才能保證分類器的效能。

• 需要找到經驗風險最小和推展能力最大的平衡點。

37

最佳分類面簡單情況︰在線性可分割的情況下的最優分類面（ Margine 最大）

38

SVM 問題的數學表示• 已知︰ m 個觀測樣本， (x1,y1), (x2,y2)……

(xm,ym)

• 目標︰最佳分類面 wx+b=0

• 滿足條件︰該分類面經驗風險最小（錯分最少）推展能力最大（空白最大）

39

分類面方程滿足條件• 對 (xi,yi) 分類面方程 g(x)=wx+b 應滿足

• 即

1)(1

1)(1

bwxxgy

bwxxgy

iii

iii

1)( bwxy ii

40

空白• 空白長度• =2x 樣本點到直線的距離• =2x

||||

12

||||

||

ww

bwx

2||||2

1min

||||

2max w

w

41

SVM

• 已知︰ n 個觀測樣本， (x1,y1), (x2,y2)…… (xm,ym)

• 求解︰

• 目標︰最優分類面 wx+b=0 • 註：此為 Maximal Margin Classifier 問題，僅

用於資料在特徵空間是線性可分割

2

i i

1min || w ||

2y (w x b) 1 , i 1, 2,...,m

42

Hyperplane Classifiersand

Optimal Margin Support Vector Classifier

Ref ： The section 3&4 of the paper “A tutorial on nu-Support Vector Machines”

43

Hyperplane Classifiers

• To construct the Optimal Hyperplane, one solves the following optimization problem

– Lagrangian dual

– By the KKT conditions

2

W,b

i i

1minimize || w ||

2subject to y ((w x ) b) 1 i 1, 2,...,m

W,b0

m2

i i ii 1

max min L(w,b, ) where (24)

1L(w,b, ) || w || (y ((w x ) b) 1) (25)

2

m m

i i i i i i i ii=1 i=1

The complementarity condition, L(w,b, ) 0 , L(w,b, ) 0 b w

lead to

(y ((w x ) b) 1) 0 for i=1 to m , y 0 (32) , w= y x (33)

44

Hyperplane Classifiers

– What the means ? [ substituting (33) into (24) ]

– primal form dual form

– So the hyperplane decision function can be written as

m

i ii=1

y 0 m m m

i i j i j i j i ii 1 i,j=1 i 1

0 m

i ii 1

1y y (x x ) if y 0

2max (34)

if y 0

m

m m

i i j i j i ji 1 i,j=1R

m

i i ii 1

1maximize y y (x x ) (35)

2

subject to 0 , i=1,...,m and y 0 (36)

m

i i ii 1

f (x) sgn y (x x ) b (38)

45

Optimal Margin Support Vector Classifiers

• Linear kernel function

• More general form

and the following QP

m

i i ii 1

m

i i ii 1

f (x) sgn y ( (x) (x )) b

= sgn y k(x, x ) b (40)

' ' 'k(x, x ) (x x ) ( (x) (x )) (39)

：

m

m m


m

i i ii 1

1maximize W( ) y y k(x x ) (41)

2

subject to 0 , i=1,...,m and y 0 (42)

46

-Soft Margin Support Vector Classifiers


47

C-SVC

• C-SVC (add a slack variables )

– Incorporating kernels, and rewriting it in terms of Lagrange multipliers

m2

ii 1

i

i i i

1min (w, ) || w || C (54)

2

subject to 0 , i=1,...m (52)

y (w x b) 1 , i 1, 2,...,m (53)

m

m m


m

i i ii=1

1maximize W( ) y y k(x x )

2

0 C, i=1,...,m and y 0 (55)

48

-SVC

• C is replaced by parameter – the lower upper bound on the number of examples that are support

vectors and that lie on the wrong side of the hyperplane, respectively.

m

m2

iw H, R , ,b R i 1

i i i

i

1 1minimize (w, , ) || w || (56)

2 m

subject to y (w x b) , i 1, 2,...,m (57)

0 , 0 i=1,...m

(58)

[0,1]

49

50

-SVC

• Derive the dual form

m2

ii 1

m

i i i i i ii=1

m

i i ii=1

i i

1 1L(w, , b, , , , ) || w ||

2 m

- y ( w x b) + - (60)

By KKT w= y x (61)

1

m

i ii=1

m

ii=1

(62)m

y 0 (63)

(64)

51

-SVC

• Derive the dual form [substituting (61)&(62) into L]

• And the resulting decision function

• Connection nu-SVC and C-SVC

m

m

i j i j i ji,j=1R

i

m

i ii=1

m

ii=1


21

subject to 0 , (66)m

y 0 (67)

(68)

m

i i ii 1

f (x) sgn y k(x, x ) b (69)

1If -SVC lead to 0, C-SVC with C set a prior to

m

lead to the same decision function.

52

Implement Techniques


53

Implement Techniques

• Parameter Selection

• The Naïve solution – Gradient Ascent

• Chunking and Decomposition

• Sequent Minimal Optimization (SMO)

54

Parameter Selection

• Parameters in Kernel Function– Kernel Alignment: a measure of similarity

between a kernel and a target function .• [Kandola 2002 Optimizing Kernel Alignment

over Combinations of Kernel ]

– Cross Validation Technique (Popular)

)(xy

55

Gradient Ascent

• 求解

•

i i j i j i ji i, j

i i ii

1max W( ) y y K(x x )

2

0 C y 0

l

i j jj 1i

i ii

The ith component of the gradient of W( ) is

W( ) 1 y y K(xi, xj)

So one can max W( ) simply by iterating the update rule

W( )

56

Gradient Ascent

• Simple on-line algorithm for 1-norm soft margin+ L

m

i i i j jj 1

i

Given training set S and leraning rates (R )

0

repeat

for i=0 to m

[1 yi y K(xi, xj)]

make sure that is between 0 and C.

( if

i i

i i

0 then 0

else

if C then C )

end for

until stopping criterion satisfied

return

57

Chunking and Decomposition

• Similar to C-SVC , the difficulty is that yiyjk(xi,xj) are in general not zero .Thus, for large data sets , the Hessian( second derivative) matrix of the objective function can not be stored in the computer memory– Solution

• Traditional method ： Newton or quasi Newton method• Current ： decomposition method

• Edgar Osuna(Cambridge ,MA) 等人在 IEEE NNSP’97 發表了 An Improved Training Algorithm for Support Vector Machines , 提出了 SVM 的分解算法

58

Chunking

• Idea:– The value of objective function is the same if zero rows and columns of

the matrix Q are removed, so large QP problem breaks down into series of smaller QP problems.

• Pseudo code for the general working set method

Given training set S

0

select an arbitary working set S S

repeat

solve optimisation problem on S

select new working set from data not

satisfying Karush-Kuhn-Tucker conditio

n

until stopping criterion criterion satisfied

return

59

Osuna’s Decomposition method

• Keep constant size matrix for every QP sub-problem, to allow very large size training data.

• Pseudo-code:

Given training set

Select an arbitrary working set B of parameters

The set N of parameters

While KKT violated (there exists some ,such that)

select new set B by replacing any

solve optimization problem on B

return

1)(,0

1)(,

1)(,0

jjj

jjj

jjj

yxfC

yxfC

yxf

ji withBi ,,

Nj

60

Sequent Minimal Optimization (SMO)

• 1998 John C. Platt (Microsoft Research)– Derived by taking the idea of the decomposition method

to its extreme and optimizing a minimal subset of just two point at each iteration.

– Benefit• doesn’t need any extra matrix storage• doesn’t need to use numerical QP optimization step• needs more iterations to converge, but only needs a few

operations at each step, which leads to overall speed-up

– Components• Analytic method to solve for two Lagrange parameters• Heuristic for choosing the points

61


• Optimize two Lagrange Multipliers

1 2

i j i j i iij i

2 2

i j ij i i,

i, j 1 i 1

l l

i j ij i j ijj 3 i, j 3

1 1min k(x x ) subject to 0 , 1

2 vl

can be reduced to

1 min k C

2

with C k and C= k

subject

2

1 2 ii 1

l

jj 3

1to 0 , ,

vl

where 1

ij i jRemark K K(x , x ) , i,j=1,2：

62


• Optimize two Lagrange Multipliers

1 2

2 22 11 2 2 12 2 22 2 1 2 2

2 11 2 12 2 22 1 2

since C does nothing with , , one can eliminate it , so the new form as

1 1min ( ) K ( ) K K ( )C C

2 2with the derivative

-( )K ( 2 )K K C C

let

11 12 1 22

11 22 12

2 1

the derivative be zero , then

(K K ) C C

K K 2K

Since is found , we can calculate from the previous condition

63


• Update After A Successful Optimization Step

* *1 2

* *i 1i 1 2i 2 i

2

Let , be the values of the Lagrange parameter after the step

,then the corresponding output is

O K K C

combine with previous , one then has the update equation for

s

1

* 1 22 2

11 22 12

uch that is disppeared,

O O

K K 2K

64

Summary for chunking and decomposition

need QP module?

size for each iteration

iteration unit

Chunking Y arbitrarytraining sample

Osuna’s decomposition

Y arbitrary parameter

SMO N 2 parameter

65

Implementation of -SV Classifiers


66

ν-SVC

• Dual form

m

m

i j i j i ji,j=1R

i

m

i ii=1

m

ii=1


21

subject to 0 , (77)m

y 0

Proves that for any given ν, thereIs at least an optimal solution whichsatisfies eTα=ν [C.-C. Chang 2001]

67

ν-SVC

• Decomposition method algorithm

68

ν-SVC

• SMO-type Implementation– (79) and (80) can be rewritten as

69

Tools• SVM implementation

– Royal Holloway, GMD-FIRST, ATT

– SVMlight (C) by Thorsten Joachims– libsvm (C++/Java/Python) by Chih-Chung Chang and Chih-Jen Lin

– Torch (C++) by Ronan Collobert– Weka (Java) at University of Waikato

– http://www.kernel-machines.org/

• Optimization package:– MINOS (Licensed Software) by Bruce Murtagh and Michael Saunders– LOQO by Robert Vanderbei– MATLAB package

http://svm.dcs.rhbnc.ac.uk/

http://svmlight.org/

http://svmlight.org/

http://libsvm.org/

http://www.idiap.ch/learning/SVMTorch.html

http://www.cs.waikato.ac.nz/ml/weka/

http://www.kernel-machines.org/index.html

http://www.kernel-machines.org/index.html

http://www.sbsi-sol-optimize.com/products_minos5_5.htm

http://www.orfe.princeton.edu/~loqo/

70

Conclusion

• Kernels provide an elegant framework for studying three fundamental issues of machine learning– Similarity measures—the kernel can be viewed as a

(nonlinear) similarity measure, and should ideally incorporate prior knowledge about the problem at hand

– Data representation — as described above, kernels induce representations of the data in a linear space

– Function class — due to the representer theorem, the kernel implicitly also determines the function class which is used for learning.

71

The End

Documents

Support Vector Machine 支持向量機