Support Vector Machine Tutorial 한국어

Support Vector Machine

Easy Tutorial(I wish)

bT )()( xwx t y(1)

문제를 간단히 하기 위해서 몇 가지 가정

1, 어떤 함수를 통해feature space으로의 mapping 후 linear sepeable

2, lable은 , 1과 -1 두가지

))1(())((

)()()(

)2(

w

xw

w

x

w

x

bt

seperablelinearassumeweyty

vectorfeature

T

n

nn

거리와의초평면과

bT )()( xwx t y(1)

)}({' nnnn ysigntify xxww Perceptron algorithm

오분류(mistake)된 표본(sample)들을 가지고 초평면(hyperplane)의 법선(normal) 벡터를 조정해 나간다. 그래서 perceptron 알고리즘을 mistake driven 알고리즘이라고 한다.

Perceptron, convergence

• Claim : perceptron algorithm indeed convergences in a finite of updates.

• Assume that

updatekaftervectorparamaterthek :)(

RfinitesomeandtallforRt |||| x

t

T

tyts x)(.0 *

2

22

*2

*2*)(*)(

)(*)(

2

22)1(

22)1(

2)1(2)1(

2)1(2)(

)2(*

*)2(*

)2(*

)1(*

*)1(*)(*

||*||

||||1

||||||||||||||||||||

)()*,cos(1

...

||||

||||||||

||||)(2||||

||||||||

...

2)(

)()(

)()(

)(

)()()(

);('

Rkor

kR

k

kR

kk

kR

R

y

y

k

y

y

y

fyify

kk

kTk

k

t

k

tt

Tk

t

k

tt

kk

kT

t

T

t

kT

tt

kT

kT

t

T

t

kTkT

tttt

x

xx

x

x

x

x

xxConvergenc

e proof

Support Vector, and Margin의 정의

빨간색 동그라미가 Support vector이고 support vector와 초평면(hyperplane)과의 수직거리를 margin이라고 한다.

구하기파라메타하는대 로최마진을샘플의가까운가장초평면에

))((min1

maxarg)3(,

bt T

nnb

xwww

SVM의 동기(motivation)?, 직관(intuition )?

수많은 초평면들이 있다. 어느 초평면이 가장 일반화 오류(generalization error)가 적을까? Maximal margin classifier!!

Linear

Classifiers f x

a

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Linear

Classifiers f x

a

yest

denotes +1

denotes -1



Linear

Classifiers f x

a

yest

denotes +1

denotes -1



Linear

Classifiers f x

a

yest

denotes +1

denotes -1


Any of these would be fine..

..but which is best?

SVM 아이디어의 수식화를 위한 전처리?

는파라메터구하기마진의역수를최소화하

미분은나중의편의를위해서을최소화

최소화을

최대화마진

거리와의초평면과

임을뜻함수식만족하는점이모든

를조정함수식이되도록과만족하는가장가까운점이표면에서

변함없다마진은해도리스케일링로

;2

1minarg

)(2

1;

2

1

_

_;

arg1

))1(())((

)()()(

)2(

,))(()5(

))(()4(

))(())(())((.

___,;))((

.

2

,

2

2

1

w

w

w

w

ww

xw

w

x

w

x

xw

xw

w

xw

w

xw

w

xw

www

xw

w b

T

n

nn

T

n

T

n

T

n

T

n

T

n

T

n

inmbt

seperablelinearassumeweyty

vectorfeature

seperablelinearbt

bbt

bt

k

bkt

k

kbkt

kbbkbt

(6)

1;

1;

SVM 아이디어의 수식화

1 ))((2

1minarg

2

,

bttosubject T

nb

xwww

여기까지가 SVM 수식완성!!!!!! 이제 이 이 수식을 풀기만 하면 됨 최적화식은 우리가 고등학교 때 보았던 linear programming과 비슷함 Quadratic programming이라고 한다. 이 Quadratic programming 문제을 dual form으로 전환하기 풀기 위해서 Lagrange multipliers 개념을 도입한다.

Dual form으로 전환 이유

-> 커널 트릭을 위해,

위 수식은 convex 최적화 문제라는 것이다.

Convex , non convex 간단히 설명

->convex하면 좋은게 “미분해서 0” 기법으로 grobal minimum을 구할 수 있다는 것.

.;2/1

01

02

02

)1(1),()

);(),(

0,0)()(),(

.)()()(0)(0

0)(

)(,);()(

);()()(

21

21

1

1

212122

다른방법으로풀기

조건만족

에수직이다는에평행하기때문에는

위에있다고가정모두

테일러시리즈

xx

xx

x

x

xxxxLex

KKTgfL

wheregfgggf

ggggif

g

gxgxg

ggg

T

T

T

x

xx

xxx

xxxεxε

xε

xεxεx

xεxεx

thenlim

Lagrange multiplier

SVM 수식의 dual form으로 전환(SVM의 마술을 위해)

N

n

nn

n

N

n

mnm

N

m

n

N

n

n

N

n

mnm

N

m

n

N

n

n

N

n

n

N

n

N

n

nn

N

n

T

nnmnm

N

m

n

N

n

nn

N

n

nnn

N

n

nn

N

n

nnn

N

n

nnn

N

n

T

nn

ta

a

kttaaattaaaL

abtatattaa),bL(

ta

ta

tab

),bL(

tata),bL(

bta),bL(

1

1 111 11

11 111

1

1

1

11

1

2

)9(;0)12(

)10(;0)11(

),(2

1

2

1)(

~

)(2

1,

)7()9(),8(

)7(;0)9(

)()8(

0,

)()8(0)(,

;))((2

1,)7(

번식의제약식

번식의제약식

에대입해서정리하면를

수식나오는정리하면미분해서식을

수식최적화라그랑쥬

mn

T

T

xx(x)(x)a

xw(x)(x)aw

xw

aw

xwxww

aw

xwwaw

1

이부분 이해를 위해 천천히

1 ))((2

1minarg

2

,

bttosubject T

nb

xwww

수식최적화라그랑쥬 1 ;))((2

1,)7(

1

2

N

n

T

nn bta),bL( xwwaw

N

n

nnn

N

n

nn

N

n

nnn

N

n

nnn

ta

tab

),bL(

tata),bL(

1

1

11

)()8(

0,

)()8(0)(,

xw

aw

xwxww

aw

수식나오는정리하면미분해서식을 )7(;0)9(1

N

n

nnta

수식최적화라그랑쥬 1 ;))((2

1,)7(

1

2

N

n

T

nn bta),bL( xwwaw

N

n

nnnta1

)()8( xw

수식나오는정리하면미분해서식을 )7(;0)9(1

N

n

nnta

N

n

nn

n

N

n

mnm

N

m

n

N

n

n

N

n

mnm

N

m

n

N

n

n

N

n

n

N

n

N

n

nn

N

n

T

nnmnmnm

N

m

n

ta

a

kttaaattaaaL

abtatakttaa),bL(

1

1 111 11

11 111

)10(;0)12(

)10(;0)11(

),(2

1

2

1)(

~

)(),(2

1,

)7()9(),8(

번식의제약식

번식의제약식

에대입해서정리하면를

mn

Txx(x)(x)a

xwxxaw

SVM의 마술

.;!!!!!!!33310000000100000003).

..

)(.?

.

,)(

_)7()9(),8();,(2

1)(

~)10(

,)(

;))((2

1,)7(

3

1 11

3

1

2

없어진다차원은

한다트릭이라고커널이것을있다수할일을그계산으로적은

통해를함수커널그러나있다수할매핑을무한차원즉이라면하지만

좋아보인다안변환은듀얼폼으로의때문에이기보통

갯수의은복잡도계산필요한푸는데문제를최적화이

듀얼폼번식의식을이용해서얻은

차원의은복잡도계산필요한푸는데문제를최적화이


xxxx(x)(x)

xxa

xwwaw

T

ge

kM

NM

featureNNOQP

kttaaaL

featureMMOQP

bta),bL(

N

n

mnmnm

N

m

n

N

n

n

N

n

T

nn 1

이부분 이해를 위해 천천히

한번 읽어보라고 한다

SVM의 예측기

;),()y( (13)

)()8,()()(y (1)

N

1n

1

bkta

tab

nnn

N

nnnn

T

xxx

xwxwx

.것이다학습하는를과결국이란에서 balearningSVM n

N

n

nnn

N

n

mnm

N

m

n

N

n

n

taatosubject

kttaaaL

1

1 11

;0,0

),(2

1)(

~mn xxa

제공함수

하나가그중있다연구도

최적화하려는를그래서부분이다무거운가장계산중가

풀면된다로제약식을위의은

''

)(.

.

.

quadprogMATLAB

SMO

QPSVMQP

QPan

onoptimizatiminimalsequential

.)13(

.,

.0

1)(0

;01)()16(

01)()15(

0)14(

)(

;)(2

1,)7(

1

2

있다수낼분류해식으로가지고집합만의이

있다수알를풀면를즉

것이다라는은이면의미는조건의이

만족한다조건을같은다음과수식은위의


SV

SVQP

SVa

ytora

yta

yt

a

conditionKKT

yta),bL(

nn

nnn

nnn

nn

n

N

n

nn

x

x

x

x

xwaw

1

b를 계산

Sm

nmm

Sn

n

S

mmmn

nnnnn

ktatN

b

bktat

ytbkta

),(1

)18(

;1),()17(

,1)(.),(

xx

xx

xxxx

N

SVm

N

1n

) y((13) 대입하면에을

Support vector만 가지고 분류평면을 만들어 낸다.

Linear seperable 하지 않다면?

여태까지는 sample들이 고차원 mapping 함수를 통해서 linear seperable 가능하다고 가정하고 이론을 전개

시켜 나갔다.

하지만 실제 문제에서는 그러한 경우는 많지 않다.

그래서 데이터들이 linear

seperable하지 않을 때를 해결하기 위한 최적화 수식의

변화가 필요하다.

KKTyta

yt

a

ytaCbL

C

C

Nnyt

ytslackdef

nn

n

n

nnn

nn

n

N

n

nn

N

n

nnnn

N

n

n

b

N

n

n

nnn

nnn

nnn

0)28(

0)27(

0)26(

0)1)(()25(

01)()24(

0)23(

1)(2

1),,,,()22(

2

1minarg

;2

1)21(

;1,;10,;0

.)5(;,...,1,1)(

;)(:.

111

2

2

,

2

1

x

x

xwμaw

w

w

x

x

w

똑같아짐이랑이면

서마진의역을최소화해서는페널티를가하면잘못분류되는샘플에대

분류가틀리게됨마진안에옳게분류

는이식으로바뀐다제약식

차이예측값과의값과타겟실제

(6)

(20)

Linear seperable 하지 않다면?

))30((0)34(

00)26(0)23()31)(33(

);,(2

1)(

~

)()(2

1

)(2

1),,,,()32(

0)31(

00)30(

)(0)29(

1

1 11

111

2

11111

2

1

1

N

n

nn

nnnnn

N

n

mnmnm

N

m

n

N

n

n

N

n

nnn

N

n

n

N

n

nnn

N

n

nn

N

n

nn

N

n

n

N

n

nnn

N

n

n

nn

n

N

n

nn

N

n

nnn

ta

CaandaandCa

kttaaaL

aaCyta

aaytaCbL

CaL

tab

L

xtawL

xxa

xw

xwμaw

w

값미분한대해서파라메터에각

컨디션

Example.

z

zzzzzzxxxxxx

zxzxzxzxzxzx

zxzx

T

T

•T

x

zxzx ,

2

221

2

121

2

221

2

121

2

2

2

22211

2

1

2

12211

2

2211

2

,2,,2,2,1,2,,2,2,1

2221

11)k(

kernelpolynomial

?.

)5,1(,)3,2(:

)4,2(,)5,1(:

2

1

분류하라이용해서을커널과다음점들을다음 SVM

w

w

tt

tt

N

n

nnn

N

n

mnm

N

m

n

N

n

n

taatosubject

kttaaaL

1

1 11

;0,0

),(2

1)(

~mn xxa

원래는 QP로 풀어야되지만 간단한 문제이기 때문에 해석적으로

풀었다.lagrange

multiplier 예제에서

보여준 것처럼.

N

n

nnnta1

)()8( xw

고차원으로 매핑전의

차원에서의 초평면을

구하면 포물선 형태의

분류 경계가 나온다.

레퍼런스

1,비숍책

2,MIT 오픈강의

http://ocw.mit.edu/OcwWeb/Ele

ctrical-Engineering-and-

Computer-Science/6-867Fall-

2006/LectureNotes/index.htm

3,수많은 인터넷 자료들?

http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-867Fall-2006/LectureNotes/index.htm














Fast Training of Support Vector Machines using

Sequential Minimal Optimization,

Platt, John.

Jung-Kyu Lee @ MLLAB

SVM revisited • the (dual) optimization problem we covered in SVM class

• Solving this optimization problem for large-scale application by using quadratic programming is very computationally expensive.

• John Platt a colleague at Microsoft, proposed efficient algorithm called SMO (Sequential Minimal Optimization) to actually solve this optimization problem.

0

m1,...,i ,0s.t

.,2

1)(max

m

1i

i

1,1

ii

jijij

m

ji

i

m

i

i

y

C

xxyyW

a

a

aaaaa

Digression: Coordinate

ascent(1/3)

• Consider trying to solve the following unconstrained optimization

problem.

• Coordinate ascent:

),...,,(max 21 mW aaaa


ascent(2/3)

• How coordinate ascent work?

• We optimize a quadratic function.

• Minimum is (0,0)

• First, minimize this w.r.t α1

• Next, minimize this w.r.t α2

• With same argument, iterate until

convergence.


ascent(3/3)

• Coordinate assent will usually take a lot more steps than other

iterative optimization algorithm such as gradient descent,

Newton’s method, conjugate gradient descent etc.

• However, there are many optimization problems for which it’s

particularly easy to fix all but one of the parameters and optimize

with respect to just that one parameter.

• It turns out that this will be true when we modify this algorithm to

solve the SVM optimization problem.

SMO • Coordinate assent in its basic form does not work to apply SVM

dual optimization problem.

• As coordinate assent progress, we can’t change αi without

violating the constraint.

0

m1,...,i ,0s.t

.,2

1)(max

m

1i

i

1,1

ii

jijij

m

ji

i

m

i

i

y

C

xxyyW

a

a

aaaaa

)1},1,1{(

0

2

11

m

2i

11

m

2i

11

m

1i

yyyy

yy

y

ii

ii

ii

aa

aa

a

Outline of SMO • The SMO (sequential minimal optimization ) algorithm, therefore,

instead of trying to change one α at a time, we will try to change

two α at a time to keep satisfying the constraints.

• The term ‘minimal’ refers to the fact that we’re choosing the

smallest number of α

Convergence of SMO • To test for convergence of SMO algorithm, we can check whether

the KKT conditions are satisfied to within some tolerance

• The key reason that SMO is an efficient algorithm is that updating

αi, αj can be computed very efficiently.

1)(0

1)(

1)(0

bxwyCa

bxwyCa

bxwya

i

T

ii

i

T

ii

i

T

ii

Detail of SMO

• Suppose we’ve decide to hold α3,…,αm , update α1, α2

aa

aaa

a

2211

m

3i

2211

m

1i

0

yy

yyy

y

ii

ii

Detail of SMO

• We can thus picture the constraints on α1, α2 as follows

• α1 and α2 must lie within

the box [0,C]×[0,C] shown.

• α1 and α2 must lie on line :

• From these constraints, we know:

aa 2211 yy

HL 2a

Detail of SMO

• From

• W(α) become

• This is a standard quadratic function.

1221

122

2

11

2211

2211

)(

)(

)(

yy

yyy

yy

yy

aa

aa

aa

aa

),...,,)((),...,,( 212221 mm yyWW aaaaaa

cba

2aa

Detail of SMO

• From

• This is really easy to optimize by setting its derivative to zero.

• If you end up with a value outside,

you must clip your solution.

• That’ll give you the optimal solution

of this quadratic optimization

problem subject to your solution

satisfying this box constraint

and lying on this straight line.

cba

2aa

Detail of SMO • In conclusion,

• We do this process very quickly, which makes the inner loop of

the SMO algorithm very efficient.

H

LifL

Lif

HifHunclippednew

unclippednew

unclippednew

unclippednew

new

,

2

,

2

,

2

,

2

2 a

a

a

a

a

1221 )( yyaa

Conclusion • SMO

- has scalability for large scale data.

- makes SVM computationally inexpensive.

- can be implemented easily because it has simple training

structure (coordinate ascent).

Reference

• [1] Platt, John. Fast Training of Support Vector Machines using Sequential

Minimal Optimization, in Advances in Kernel Methods – Support Vector

Learning, B. Scholkopf, C. Burges, A. Smola, eds., MIT Press (1998).

• [2] The Simplified SMO Algorithm , Machine Learning course @ Stanford

http://www.stanford.edu/class/cs229/materials/smo.pdf

Education

Support Vector Machine Tutorial 한국어