Upload
phiala
View
173
Download
0
Embed Size (px)
DESCRIPTION
Support Vector Machine 支持向量機. Speaker : Yao-Min Huang Date : 2004/11/17. Outline. Linear Learning Machines Kernel-Induced Feature Optimization Theory SVM Concept Hyperplane Classifiers Optimal Margin Support Vector Classifiers ν-Soft Margin Support Vector Classifiers - PowerPoint PPT Presentation
Citation preview
1
Support Vector Machine支持向量機
Speaker : Yao-Min Huang
Date : 2004/11/17
2
Outline
• Linear Learning Machines• Kernel-Induced Feature• Optimization Theory• SVM Concept• Hyperplane Classifiers• Optimal Margin Support Vector Classifiers• ν-Soft Margin Support Vector Classifiers• Implement Techniques• Implementation of ν-SV Classifiers• Tools• Conclusion
3
Linear Learning Machines
Ref : AN INTRODUCTION TO SUPPORT VECTOR MACHINES Chap2
4
Introduction
• In supervised learning, the learning machine is given a training set of inputs with associated output values
lll YXyxyxyxS )()),(),...,,(),,(( 2211
l : # of training samples
x : examples(inputs)
y : labels(outputs)
5
Introduction
• A training set S is said to be trivial if all labels are equal
• Usually
• Binary classification– Input x = (x1, x2, …, xn)’– f(x) >= 0 : assigned to positive class (assign x to +1)– Otherwise negative class (assign x to -1)
1 2
,
( , ,..., )
n
n
X R Y R
x x x x
1
( ) n
i ii
f x w x b w x b
6
Linear Classification
7
Linear Classification
• The hyperplane ( 超平面 ) is the dark line.
• w defines a direction perpendicular to the hyperplane
• b moves the hyperplane parallel to itself (# of free parameter is n+1)
8
Linear Classification
• Def : Functional margin
> 0 implies correct classification– The geometric margin is the perpendicular
Euclidean distance of the point to the hyperplane
– The margin of a training set S is the maximum geometric margin over all hyperplanes.
– Try to find hyperplane (wopt, bopt) where the margin is largest
( )i i ir y w x b
9
Linear Classification
10
Rosenblatt’s Perceptron
• By Frank Rosenblatt in 1956• on-line and mistake driven ( it only adapt the
weight when a classification is made )• Starts with an initial connection weight vector
w=0• k at most (k is total mistakes)• Require the data to be linearly separable
22R
11
Rosenblatt’s Perceptron
• Linearly separable
12
Rosenblatt’s Perceptron
• Non-separable
13
Rosenblatt’s Perceptron
14
Rosenblatt’s Perceptron
• Theorem : Novikoff – Prove that Rosenblatt’s algorithm will converge
–
– Then k <= (k is the number of mistakes)
– Proof (Skip)
22R
opt opt1
i
max and exist a vector w w 1 and
y ( ) for 1
ii l
opt i opt
R x
w x b i l
15
Rosenblatt’s Perceptron
• Def : slack variable – Fix > 0 we can define the margin slack
variable– – If > , xi is misclassified by (w, b)– Figure (Next page)
• Two misclassified points• Other points have their slack variable equal to
zero, since they have a positive margin more than
i i i i i((x , y ), (w, b), ) = = max ( 0, - y (<w, x > + b) )
16
Rosenblatt’s Perceptron
17
Rosenblatt’s Perceptron
• Theorem : Freund and SchapireS : nontrivial training set|xi| <= R(w, b) be any hyperplane with |w| = 1, >0
l l2 2
i=1 i=1
2
D = = (( , ), ( , ), )
Then the number of mistakes in the first execution of the for loop is bounded by
2(R+D)
i i ix y w b
18
Rosenblatt’s Perceptron
• Freund and Schapire– Can only apply for the first iteration– D can be defined by any hyperplane, the data
are not necessarily linear separable– Finding the smallest # of mistakes is NP-
complete
19
Rosenblatt’s Perceptron
• Algorithm in dual form (Use Lagrange Multiplier and KKT coditions derivate the w get w=??
1
l
i j j ij=1
i i
2i
Given a training set S
a 0, b 0, R max
Repeat
for i = 1 to l
if y ( x x > + b) 0 then
= +1
b = b + y R
end for
until no mistakes made within the for loop
Return ( , b) to define h(x)
i l i
j
x
y
20
Rosenblatt’s Perceptron
• example i with few/many mistakes has a small/large i
I can be regarded as the information content of xi
• The points that are harder to learn have larger i can be used to rank the data according to their information content.
l
j j jj=1
l
j j jj=1
h(x) = sgn(<w x>+b)
= sgn y x x
= sgn y x x
b
b
21
Kernel-Induced Feature
Ref : AN INTRODUCTION TO SUPPORT VECTOR MACHINES Chap3
The section 5 of the paper “A tutorial on nu-Support Vector Machines”
22
Overview
• Non-Linear Classifiers– One solution: Multiple layers of threshold
linear function multi-layer neural network
(problems: local minima; many parameters;
heuristics needed to train …etc)– Other solution: project the data into a high
dimensional feature space to increase the computational power of the linear learning machine.
23
Overview
24
Kernel Function
• In order to learn non-linear relations with a linear machine, we need to select a set of non-linear features and to rewrite the data in the new representation.– First : a fixed non-linear mapping transforms the data into a feature space
F
– Second : classify them in the feature space
•
• If we have a way of computing the inner product in the feature space directly as a function of the original input points, it becomes possible to merge the two steps needed to build a non-linear learning machine.
• We call such a direct computation method a kernel function,
N
i ii=1
l
i i ii=1
In the previous discussion , X F , f(x)= w b
can be expressed in a dual form f(x)= y (x ) (x) b
:
25
The Gram (Kernel) Matrix
• Gram matrix (also called the kernel matrix)–
– Contains all necessary information for learning algorithm.
th
T
Given m points with n-dimensional vector coordinates, let M be the n m matrix
whose j column consists of the coordinates of the vector , with j = 1, ..., m.
Then define the Gram matrix K=M M. M=
( (1), (2),..., (m))
26
Making Kernels
• The mapping function must be symmetric,
• and satisfy the inequalities that follow from the Cauchy-Schwarz inequality.
)()()(
)()( )(
xzKxz
zxzxK
),(),(
)()()()(
)()()()( ),(2222
zzKxxK
zxzx
zxzxzxK
27
Popular Kernel function
• 線性內核
• 半徑式函數 (Radial Basis Function) 內核
• 多項式內核
• Sigmoid 內核
jiji xxxxK ),(
qjiji xxxxK ]1)[(),(
}||
exp{),(2
2
ji
ji
xxxxK
))(tanh(),( cxxxxK jiji
28
Optimization Theory
Ref : AN INTRODUCTION TO SUPPORT VECTOR MACHINES Chap5
http://www.chass.utoronto.ca/~osborne/MathTutorial/
29
Optimization Theory
• Definition – The Kuhn-Tucker conditions for the problem
• L(x) : the Lagrangian (Lagrange, 1788)
x j j
'i
j j j j j j
m
j j jj=1
max f(x) subject to g (x) c for j= 1,...,m
are
L (x) = 0 for i = 1 ,..., n
0, g (x) c and [g (x) - c ] = 0 for j = 1, ..., m,
where
L(x)=f(x)- [g (x) - c ]
So called the complementarity condition
30
Optimization Theory
• Ex–
2 2x1,x2 1 2 1 2 1 2
1 1 2
2 1 2
1 2 1
Consider the problem
max [-(x - 4) - (x - 4) ] subject to x + x 4 and x + 3x 9,
We have
The Kuhn-Tucker conditions are
-2(x - 4) - - = 0
-2(x - 4) - - 3 = 0
x + x 4, 0
1 1 2
1 2 2 2 1 2
, and (x + x - 4) = 0
x + 3x 9, 0, and (x + 3x - 9) = 0.
31
SVM Concept
Ref : The section 2 of the paper “A tutorial on nu-Support Vector Machines”
32
The history of SVM
• SVM 是一種基于統計學習理論的模式識別方法,它是由 Boser,Guyon,Vapnik 在COLT-92 上首次提出,從此迅速的發展起來,現下已經在許多領域(生物訊息學、文本和手寫識別、分類…等)都取得了成功的應用
• COLT(Computational Learning Theory)
33
SVM Concept
• 目標︰找到一個超平面 (Hyperplane) ,使得它能夠儘可能多的將兩類數據點正確的分開,同時使分開的兩類數據點距離分類面最遠。
• 解決方法︰構造一個在約束條件下的最佳化問題,具體的說是一個受限二次規劃問題 (constrained quadratic programming), 求解該問題,得到分類器。
34
模式識別問題的一般描述• 已知︰ m 個觀測樣本, (x1,y1), (x2,y2)
…… (xm,ym)
• 求︰最佳函數 y’= f(x,w)• 滿足條件︰期望風險最小
• 損失函數
),()),(,()( yxdFwxfyLwR
0 , ( , )( , ( , ))
1 , ( , )
y f x wL y f x w
y f x w
35
• 期望風險 R(w) 要倚賴聯合機率 F(x,y) 的資訊,所以實際問題中無法計算。
• 一般用經驗風險 Remp(w) 代替期望風險R(w)
1
1( ) ( , ( , ))
分錯個數
m
emp i ii
R w L y f x wm
m
36
一般模式識別方法的問題• 經驗風險最小不等于期望風險最小,不能
保證分類器的預測能力 .• 經驗風險只有在樣本數無窮大趨近于期望
風險,需要非常多的樣本才能保證分類器的效能。
• 需要找到經驗風險最小和推展能力最大的平衡點。
37
最佳分類面簡單情況︰ 在線性可分割的情況下的最優分類面( Margine 最大)
38
SVM 問題的數學表示• 已知︰ m 個觀測樣本, (x1,y1), (x2,y2)……
(xm,ym)
• 目標︰最佳分類面 wx+b=0
• 滿足條件︰該分類面 經驗風險最小(錯分最少) 推展能力最大(空白最大)
39
分類面方程滿足條件• 對 (xi,yi) 分類面方程 g(x)=wx+b 應滿足
• 即
1)(1
1)(1
bwxxgy
bwxxgy
iii
iii
1)( bwxy ii
40
空白• 空白長度• =2x 樣本點到直線的距離• =2x
||||
12
||||
||
ww
bwx
2||||2
1min
||||
2max w
w
41
SVM
• 已知︰ n 個觀測樣本, (x1,y1), (x2,y2)…… (xm,ym)
• 求解︰
• 目標︰最優分類面 wx+b=0 • 註:此為 Maximal Margin Classifier 問題,僅
用於資料在特徵空間是線性可分割
2
i i
1min || w ||
2y (w x b) 1 , i 1, 2,...,m
42
Hyperplane Classifiersand
Optimal Margin Support Vector Classifier
Ref : The section 3&4 of the paper “A tutorial on nu-Support Vector Machines”
43
Hyperplane Classifiers
• To construct the Optimal Hyperplane, one solves the following optimization problem
– Lagrangian dual
– By the KKT conditions
2
W,b
i i
1minimize || w ||
2subject to y ((w x ) b) 1 i 1, 2,...,m
W,b0
m2
i i ii 1
max min L(w,b, ) where (24)
1L(w,b, ) || w || (y ((w x ) b) 1) (25)
2
m m
i i i i i i i ii=1 i=1
The complementarity condition, L(w,b, ) 0 , L(w,b, ) 0 b w
lead to
(y ((w x ) b) 1) 0 for i=1 to m , y 0 (32) , w= y x (33)
44
Hyperplane Classifiers
– What the means ? [ substituting (33) into (24) ]
– primal form dual form
– So the hyperplane decision function can be written as
m
i ii=1
y 0 m m m
i i j i j i j i ii 1 i,j=1 i 1
0 m
i ii 1
1y y (x x ) if y 0
2max (34)
if y 0
m
m m
i i j i j i ji 1 i,j=1R
m
i i ii 1
1maximize y y (x x ) (35)
2
subject to 0 , i=1,...,m and y 0 (36)
m
i i ii 1
f (x) sgn y (x x ) b (38)
45
Optimal Margin Support Vector Classifiers
• Linear kernel function
• More general form
and the following QP
m
i i ii 1
m
i i ii 1
f (x) sgn y ( (x) (x )) b
= sgn y k(x, x ) b (40)
' ' 'k(x, x ) (x x ) ( (x) (x )) (39)
:
m
m m
i i j i j i ji 1 i,j=1R
m
i i ii 1
1maximize W( ) y y k(x x ) (41)
2
subject to 0 , i=1,...,m and y 0 (42)
46
-Soft Margin Support Vector Classifiers
Ref : The section 6 of the paper “A tutorial on nu-Support Vector Machines”
47
C-SVC
• C-SVC (add a slack variables )
– Incorporating kernels, and rewriting it in terms of Lagrange multipliers
m2
ii 1
i
i i i
1min (w, ) || w || C (54)
2
subject to 0 , i=1,...m (52)
y (w x b) 1 , i 1, 2,...,m (53)
m
m m
i i j i j i ji 1 i,j=1R
m
i i ii=1
1maximize W( ) y y k(x x )
2
0 C, i=1,...,m and y 0 (55)
48
-SVC
• C is replaced by parameter – the lower upper bound on the number of examples that are support
vectors and that lie on the wrong side of the hyperplane, respectively.
m
m2
iw H, R , ,b R i 1
i i i
i
1 1minimize (w, , ) || w || (56)
2 m
subject to y (w x b) , i 1, 2,...,m (57)
0 , 0 i=1,...m
(58)
[0,1]
49
50
-SVC
• Derive the dual form
m2
ii 1
m
i i i i i ii=1
m
i i ii=1
i i
1 1L(w, , b, , , , ) || w ||
2 m
- y ( w x b) + - (60)
By KKT w= y x (61)
1
m
i ii=1
m
ii=1
(62)m
y 0 (63)
(64)
51
-SVC
• Derive the dual form [substituting (61)&(62) into L]
• And the resulting decision function
• Connection nu-SVC and C-SVC
m
m
i j i j i ji,j=1R
i
m
i ii=1
m
ii=1
1maximize W( ) y y k(x x ) (65)
21
subject to 0 , (66)m
y 0 (67)
(68)
m
i i ii 1
f (x) sgn y k(x, x ) b (69)
1If -SVC lead to 0, C-SVC with C set a prior to
m
lead to the same decision function.
52
Implement Techniques
Ref : AN INTRODUCTION TO SUPPORT VECTOR MACHINES Chap7
53
Implement Techniques
• Parameter Selection
• The Naïve solution – Gradient Ascent
• Chunking and Decomposition
• Sequent Minimal Optimization (SMO)
54
Parameter Selection
• Parameters in Kernel Function– Kernel Alignment: a measure of similarity
between a kernel and a target function .• [Kandola 2002 Optimizing Kernel Alignment
over Combinations of Kernel ]
– Cross Validation Technique (Popular)
)(xy
55
Gradient Ascent
• 求解
•
i i j i j i ji i, j
i i ii
1max W( ) y y K(x x )
2
0 C y 0
l
i j jj 1i
i ii
The ith component of the gradient of W( ) is
W( ) 1 y y K(xi, xj)
So one can max W( ) simply by iterating the update rule
W( )
56
Gradient Ascent
• Simple on-line algorithm for 1-norm soft margin+ L
m
i i i j jj 1
i
Given training set S and leraning rates (R )
0
repeat
for i=0 to m
[1 yi y K(xi, xj)]
make sure that is between 0 and C.
( if
i i
i i
0 then 0
else
if C then C )
end for
until stopping criterion satisfied
return
57
Chunking and Decomposition
• Similar to C-SVC , the difficulty is that yiyjk(xi,xj) are in general not zero .Thus, for large data sets , the Hessian( second derivative) matrix of the objective function can not be stored in the computer memory– Solution
• Traditional method : Newton or quasi Newton method• Current : decomposition method
• Edgar Osuna(Cambridge ,MA) 等人在 IEEE NNSP’97 發表了 An Improved Training Algorithm for Support Vector Machines , 提出了 SVM 的分解算法
58
Chunking
• Idea:– The value of objective function is the same if zero rows and columns of
the matrix Q are removed, so large QP problem breaks down into series of smaller QP problems.
• Pseudo code for the general working set method
Given training set S
0
select an arbitary working set S S
repeat
solve optimisation problem on S
select new working set from data not
satisfying Karush-Kuhn-Tucker conditio
n
until stopping criterion criterion satisfied
return
59
Osuna’s Decomposition method
• Keep constant size matrix for every QP sub-problem, to allow very large size training data.
• Pseudo-code:
Given training set
Select an arbitrary working set B of parameters
The set N of parameters
While KKT violated (there exists some ,such that)
select new set B by replacing any
solve optimization problem on B
return
1)(,0
1)(,
1)(,0
jjj
jjj
jjj
yxfC
yxfC
yxf
ji withBi ,,
Nj
60
Sequent Minimal Optimization (SMO)
• 1998 John C. Platt (Microsoft Research)– Derived by taking the idea of the decomposition method
to its extreme and optimizing a minimal subset of just two point at each iteration.
– Benefit• doesn’t need any extra matrix storage• doesn’t need to use numerical QP optimization step• needs more iterations to converge, but only needs a few
operations at each step, which leads to overall speed-up
– Components• Analytic method to solve for two Lagrange parameters• Heuristic for choosing the points
61
Sequent Minimal Optimization (SMO)
• Optimize two Lagrange Multipliers
1 2
i j i j i iij i
2 2
i j ij i i,
i, j 1 i 1
l l
i j ij i j ijj 3 i, j 3
1 1min k(x x ) subject to 0 , 1
2 vl
can be reduced to
1 min k C
2
with C k and C= k
subject
2
1 2 ii 1
l
jj 3
1to 0 , ,
vl
where 1
ij i jRemark K K(x , x ) , i,j=1,2:
62
Sequent Minimal Optimization (SMO)
• Optimize two Lagrange Multipliers
1 2
2 22 11 2 2 12 2 22 2 1 2 2
2 11 2 12 2 22 1 2
since C does nothing with , , one can eliminate it , so the new form as
1 1min ( ) K ( ) K K ( )C C
2 2with the derivative
-( )K ( 2 )K K C C
let
11 12 1 22
11 22 12
2 1
the derivative be zero , then
(K K ) C C
K K 2K
Since is found , we can calculate from the previous condition
63
Sequent Minimal Optimization (SMO)
• Update After A Successful Optimization Step
* *1 2
* *i 1i 1 2i 2 i
2
Let , be the values of the Lagrange parameter after the step
,then the corresponding output is
O K K C
combine with previous , one then has the update equation for
s
1
* 1 22 2
11 22 12
uch that is disppeared,
O O
K K 2K
64
Summary for chunking and decomposition
need QP module?
size for each iteration
iteration unit
Chunking Y arbitrarytraining sample
Osuna’s decomposition
Y arbitrary parameter
SMO N 2 parameter
65
Implementation of -SV Classifiers
Ref : The section 7 of the paper “A tutorial on nu-Support Vector Machines”
66
ν-SVC
• Dual form
m
m
i j i j i ji,j=1R
i
m
i ii=1
m
ii=1
1maximize W( ) y y k(x x ) (76)
21
subject to 0 , (77)m
y 0
Proves that for any given ν, thereIs at least an optimal solution whichsatisfies eTα=ν [C.-C. Chang 2001]
67
ν-SVC
• Decomposition method algorithm
68
ν-SVC
• SMO-type Implementation– (79) and (80) can be rewritten as
69
Tools• SVM implementation
– Royal Holloway, GMD-FIRST, ATT
– SVMlight (C) by Thorsten Joachims– libsvm (C++/Java/Python) by Chih-Chung Chang and Chih-Jen Lin
– Torch (C++) by Ronan Collobert– Weka (Java) at University of Waikato
– http://www.kernel-machines.org/
• Optimization package:– MINOS (Licensed Software) by Bruce Murtagh and Michael Saunders– LOQO by Robert Vanderbei– MATLAB package
70
Conclusion
• Kernels provide an elegant framework for studying three fundamental issues of machine learning– Similarity measures—the kernel can be viewed as a
(nonlinear) similarity measure, and should ideally incorporate prior knowledge about the problem at hand
– Data representation — as described above, kernels induce representations of the data in a linear space
– Function class — due to the representer theorem, the kernel implicitly also determines the function class which is used for learning.
71
The End