74
Support Vector Machines 主主主 主主主

Support Vector Machines

  • Upload
    enan

  • View
    59

  • Download
    0

Embed Size (px)

DESCRIPTION

Support Vector Machines. 主講人:虞台文. Content. Introduction The VC Dimension & Structure Risk Minimization Linear SVM  The Separable case Linear SVM  The Non-Separable case Lagrange Multipliers. Support Vector Machines. Introduction. Learning Machines. A machine to learn the mapping - PowerPoint PPT Presentation

Citation preview

Page 1: Support Vector Machines

Support Vector Machines

主講人:虞台文

Page 2: Support Vector Machines

Content Introduction The VC Dimension & Structure Risk

Minimization Linear SVM The Separable case Linear SVM The Non-Separable case

Lagrange Multipliers

Page 3: Support Vector Machines

Support Vector Machines

Introduction

Page 4: Support Vector Machines

Learning Machines

A machine to learn the mapping

Defined as

i iyx

( , )fx x αLearning by adjusting

this parameter?

Page 5: Support Vector Machines

Generalization vs. Learning

How a machine learns?– Adjusting the parameters so as to partition the

pattern (feature) space for classification.– How to adjust?

Minimize the empirical risk (traditional approaches).

What the machine learned?– Memorize the patterns it sees? or– Memorize the rules it finds for different classes?– What does the machine actually learn if it minimizes

empirical risk only?

Page 6: Support Vector Machines

Risks

Expected Risk (test error)

12( ) ( , ( )) ,R y f P yd x xα α

Empirical Risk (training error)

12

1

( ) ( , )l

emp i ili

R y f

αxα

( ) ( )?empR Rα α

Page 7: Support Vector Machines

More on Empirical Risk

How can make the empirical risk arbitrarily small?– To let the machine have very large memorization

capacity. Does a machine with small empirical risk also get

small expected risk? How to avoid the machine to strain to memorize

training patterns, instead of doing generalization, only?

How to deal with the straining-memorization capacity of a machine?

What the new criterion should be?

Page 8: Support Vector Machines

Structure Risk Minimization

Goal: Learn both the right ‘structure’ and right `rules’ for classification.

Right Rules:

E.g., Right amount and right forms of components or parameters are to participate in a learning machine.

Right Structure:

The empirical risk will also be reduced if right rules are learned.

Page 9: Support Vector Machines

New Criterion

Total Risk = Empirical Risk +

Risk due to the structure ofthe learning machine

Page 10: Support Vector Machines

Support Vector Machines

The VC Dimension &

Structure Risk Minimization

Page 11: Support Vector Machines

The VC Dimension

Consider a set of function f (x,) {1,1}. A given set of l points can be labeled in 2l ways. If a member of the set {f ()} can be found

which correctly assigns the labels for all labeling, then the set of points is shattered by that set of functions.

The VC dimension of {f ()} is the maximum number of training points that can be shattered by {f ()}.

VC: Vapnik Chervonenkis

Page 12: Support Vector Machines

The VC Dimension for Oriented Lines in R2

VC dimension = 3

Page 13: Support Vector Machines

More on VC Dimension

In general, the VC dimension of a set of oriented hyperplanes in Rn is n+1.

VC dimension is a measure of memorization capability.

VC dimension is not directly related to number of parameters. Vapnik (1995) has an example with 1 parameter and infinite VC dimension.

Page 14: Support Vector Machines

Bound on Expected Risk

Expected Risk 12( ) ( , ( )) ,R y f P yd x xα α

Empirical Risk 12

1

( ) ( , )l

emp i ili

R y f

αxα

(log(( )

2 ) 1) log( 41

)( )emp

h lR

h

lRP

VC Confidence

Page 15: Support Vector Machines

Bound on Expected Risk

(log(( )

2 ) 1) log( 41

)( )emp

h lR

h

lRP

VC Confidence

Consider small (e.g., 0.5).

(log(2( )

) 1) log( )( )

4emp

h lR R

h

l

Page 16: Support Vector Machines

Bound on Expected Risk

Consider small (e.g., 0.5).

(log(2( )

) 1) log( )( )

4emp

h lR R

h

l

Traditional approachesminimize empirical risk only

Structure risk minimization want to minimize the bound

Page 17: Support Vector Machines

VC Confidence

(log(2( )

) 1) log( )( )

4emp

h lR R

h

l

=0.05 and l=10,000

Amongst machines with zero empirical risk, choose the one with smallest VC dimension

How to evaluate VC dimension?

Page 18: Support Vector Machines

h4 h3

Structure Risk Minimization

h2 h1

Nested subset of functions with different VC dimensions.

4 3 2 1h h h h

Page 19: Support Vector Machines

Support Vector Machines

The Linear SVM The Separable Case

Page 20: Support Vector Machines

The Linear Separability

Linearly separable Not linearly separable

Page 21: Support Vector Machines

wx + b = +1

wx + b = 1

The Linear Separability

Linearly separable

w 0b wx 0b wx

0b wx 0b wx wx + b = 0

such at, thbw1b wx 1b wx

1b wx 1b wx

Linearly Separable

1fo1 r

for 11i

i

i

i

yb

b y

wx

wx

( ) 1 0 i iy b i wx

Page 22: Support Vector Machines

wx + b = +1

wx + b = 1

Margin Width

w

wx + b = 0

( ) 1 0 i iy b i wx

O

d

1 1

|| || || ||

bd

b w w2

|| ||w

How about maximize the margin?

What is the relation btw. the margin width and VC dimension?

Page 23: Support Vector Machines

Maximum Margin Classifier

( ) 1 0 i iy b i wx

O

1 1

|| || || ||

bd

b w w2

|| ||w

How about maximize the margin?

What is the relation btw. the margin width and VC dimension?Supporters

Page 24: Support Vector Machines

Building SVM

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

This requires the knowledge about Lagrange Multiplier.

Page 25: Support Vector Machines

The Method of Lagrange

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

2

1

1( , ; ) || || ( ) 1

2

l

i ii

iTbL y b

xw w w

The Lagrangian:

0i

Minimize it w.r.t w, while maximize it w.r.t. .

Page 26: Support Vector Machines

The Lagrangian:

The Method of Lagrange

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

2

1

1( , ; ) || || ( ) 1

2

l

i ii

iTbL y b

xw w w 0i

Minimize it w.r.t w, while maximize it w.r.t. .

What value of i should be if it is

feasible and nonzero?

What value of i should be if it is

feasible and nonzero?

How about if it is zero?

How about if it is zero?

Page 27: Support Vector Machines

The Method of Lagrange

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

2

1

1( , ; ) || || ( ) 1

2

l

i ii

iTbL y b

xw w w

The Lagrangian:

0i

2

1 1

1|| || ( )

2

l l

i ii i

i iTy b

xw w

2

1 1

1( , ; ) || || ( )

2

l l

iT

i ii i

iL yb b

w w xw

Page 28: Support Vector Machines

Duality

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

Maximize

Subject to

*( )*, ;bL w

, ( , ; )b bL w w 0

0, 1, ,i i l

2

1 1

1( , ; ) || || ( )

2

l l

iT

i ii i

iL yb b

w w xw

Page 29: Support Vector Machines

Duality2

1 1

1( , ; ) || || ( )

2

l l

iT

i ii i

iL yb b

w w xw

Maximize

Subject to , ( , ; )b bL w w 0

0, 1, ,i i l

1

( , ; )l

ii

i ibL y

w w xw 01

*l

ii ii

y

w x

1

( , ; ) 0l

b iii

bL y

w1

0l

ii

i y

*( )*, ;bL w

Page 30: Support Vector Machines

Duality2

1 1

1( , ; ) || || ( )

2

l l

iT

i ii i

iL yb b

w w xw

1

( , ; )l

ii

i ibL y

w w xw 01

*l

ii ii

y

w x

1

( , ; ) 0l

b iii

bL y

w1

0l

ii

i y

1 1 1 1 1 1

1( , ; )*

2*

T Tl l l l l l

i i i i i i i i ii i i

i i i i ii i

ii

L y y y b yb y

x x xw x

1 1 1

1

2

Tl l l

i i i ii i

i ii

iy y

x x

1 1 1

1,

2

l l l

i j i ji i j

i i j y y

x x Maximize( )F

Page 31: Support Vector Machines

Duality2

1 1

1( , ; ) || || ( )

2

l l

iT

i ii i

iL yb b

w w xw

1

( , ; )l

ii

i ibL y

w w xw 01

*l

ii ii

y

w x

1

( , ; ) 0l

b iii

bL y

w1

0l

ii

i y

1 1 1 1 1 1

1( , ; )*

2*

T Tl l l l l l

i i i i i i i i ii i i

i i i i ii i

ii

L y y y b yb y

x x xw x

1 1 1

1

2

Tl l l

i i i ii i

i ii

iy y

x x

1 1 1

1,

2

l l l

i j i ji i j

i i j y y

x x Maximize( )F

1( )

2TF D 1

1( )

2TF D 1

0T y 0T y

Page 32: Support Vector Machines

Duality

Minimize

Subject to

212 || ||w

( ) 1 0 Ti iy b i w x

Maximize

Subject to

0

1( )

2TF D 1

0T y

The Primal

The Dual

Page 33: Support Vector Machines

The Solution

Maximize

Subject to

0

1( )

2TF D 1

0T yThe Dual

Find * by …

*

1

*l

ii ii

y

w x * ?b

Quadratic Programming

Page 34: Support Vector Machines

The Solution

Find * by …

*

1

*l

ii ii

y

w x

2

1

1( , ; ) || || ( ) 1

2

l

i ii

iTbL y b

xw w w

The Lagrangian:

** , 0Ti iib y xw

Call it a support vector is i > 0.

Call it a support vector is i > 0.

The Karush-Kuhn-TuckerConditions

Page 35: Support Vector Machines

The Karush-Kuhn-Tucker Conditions

( ) 1 0, 1, ,i iTy ib l w x

0, 1, ,i i l

1

( , ; )l

ii

i ibL y

w w xw 0

1

( , ; ) 0l

b iii

bL y

w

2

1

1( , ; ) || || ( ) 1

2

l

i ii

iTbL y b

xw w w

( ) 1 0, 1, ,i iT

i by i l w x

Page 36: Support Vector Machines

Classification

*

1

*l

ii ii

y

w x

( ) sgn * *T bf wx x

1

*sgn *,l

i ii

i y b

x x

*

*

0

sgn , *i

i i iy b

xx

Page 37: Support Vector Machines

Classification Using Supporters

*

*

0

( ) sgn , *i

i iif by

x xx

The similarity measure btw.input and the ith support vector.

The weight for the ith support vector. Bias

Page 38: Support Vector Machines

Demonstration

Page 39: Support Vector Machines

Support Vector Machines

The Linear SVM

The Non-Separable Case

Page 40: Support Vector Machines

wx + b = 0

wx + b = +1w

x + b = 1

The Non-Separable Case

wx + b = +1

i

We require that

1for 1

1 f r 1oii

i i

i

i

b

yb

y

wx

wx

( ) 1 0 i iiy b i wx

0 i i

Page 41: Support Vector Machines

Mathematic Formulation

Minimize

Subject to

212 || ||w

( ) 1 0 Ti i iy b i w x

0 i i

k

iiC

For simplicity, we consider k = 1.

Page 42: Support Vector Machines

The Lagrangian

i iC

2

( , , ; )

1||

,

|| ( ) 12 i

Ti i ii i ii i i

L

C

b

y b

w

w w x

Minimize

Subject to

212 || ||w

( ) 1 0 Ti i iy b i w x

0 i i

0, 0i i

Page 43: Support Vector Machines

Duality

i iC

21( , , ; ) ||, || ( ) 1

2 iT

i i ii i ii i iL Cb y b w w w x21

( , , ; ) ||, || ( ) 12 i

Ti i ii i ii i iL Cb y b w w w x

Minimize

Subject to

212 || ||w

( ) 1 0 Ti i iy b i w x

0 i i

0, 0i i 0, 0i i

Maximize

Subject to

( *, *, ; ,* )L b w

, , ( , , ; ) 0,b L b w w

, 0 0

Page 44: Support Vector Machines

Duality21

( , , ; ) ||, || ( ) 12 i

Ti i ii i ii i iL Cb y b w w w x

21( , , ; ) ||, || ( ) 1

2 iT

i i ii i ii i iL Cb y b w w w x

0, 0i i 0, 0i i

Maximize

Subject to

( *, *, ; ,* )L b w

, , ( , , ; ) 0,b L b w w

, 0 0

( , , ; ), ii iiL b y w w w x 0

,( , , ; ) 0ii ibL b y w

,( , , ; ) 0i i ib CL w

* i ii i yw x

0i iiy

i iC

0 i C

Page 45: Support Vector Machines

Duality21

( , , ; ) ||, || ( ) 12 i

Ti i ii i ii i iL Cb y b w w w x

21( , , ; ) ||, || ( ) 1

2 iT

i i ii i ii i iL Cb y b w w w x

0, 0i i 0, 0i i

( , , ; ), ii iiL b y w w w x 0

,( , , ; ) 0ii ibL b y w

,( , , ; ) 0i i ib CL w

* i ii i yw x

0i iiy

i iC

0 i C 1

( *, *, ; ) ,* ,2 i j i jjii i ji

L b y y w x x,( )F

Maximize this

Page 46: Support Vector Machines

Duality21

( , , ; ) ||, || ( ) 12 i

Ti i ii i ii i iL Cb y b w w w x

21( , , ; ) ||, || ( ) 1

2 iT

i i ii i ii i iL Cb y b w w w x

0, 0i i 0, 0i i

( , , ; ), ii iiL b y w w w x 0

,( , , ; ) 0ii ibL b y w

,( , , ; ) 0i i ib CL w

* i ii i yw x

0i iiy

i iC

0 i C 1

( *, *, ; ) ,* ,2 i j i jjii i ji

L b y y w x x,( )F

Maximize this1

( )2

TF D 11

( )2

TF D 1

0T y 0T y

0 C 0 C

Page 47: Support Vector Machines

Duality

Maximize

Subject to

C 0 1

1( )

2TF D 1

0T y

The Primal

The Dual

Minimize

Subject to

212 || ||

k

i iC w

( ) 1 0 Ti i iy b i w x

0 i i

Page 48: Support Vector Machines

The Karush-Kuhn-Tucker Conditions

( ) 1 0Ti i iy b w x

0i

0ii

21( , , ; ) ||, || ( ) 1

2 iT

i i ii i ii i iL Cb y b w w w x21

( , , ; ) ||, || ( ) 12 i

Ti i ii i ii i iL Cb y b w w w x

( , , ; ), ii iiL b y w w w x 0

,( , , ; ) 0ii ibL b y w

,( , , ; ) 0i i ib CL w

0i

( ) 1 0Ti ii iy b w x

0i

Page 49: Support Vector Machines

The Solution

Find * by …

**

1

l

ii ii

y

w x

* ?b

Quadratic Programming

Maximize

Subject to

C 0 1

1( )

2TF D 1

0T yThe Dual

?

Page 50: Support Vector Machines

The Solution

Find * by …

**

1

l

ii ii

y

w x

The Lagrangian:

2

( , , ; )

1||

,

|| ( ) 12 i

Ti i ii i ii i i

L

C

b

y b

w

w w x

* * , 0<Ti i ib y C w x

Call it a support vector is 0 < i <

C.

Call it a support vector is 0 < i <

C.

Page 51: Support Vector Machines

The Solution

Find * by …

**

1

l

ii ii

y

w x

* * , 0<Ti i ib y C w x

Call it a support vector is 0 < i <

C.

Call it a support vector is 0 < i <

C.

max 0,1 ( * *)i i iy b w x

A false classification pattern if i > 0.

A false classification pattern if i > 0.

Page 52: Support Vector Machines

Classification

*

1

*l

ii ii

y

w x

( ) sgn * *T bf wx x

1

*sgn *,l

i ii

i y b

x x

*

*

0

sgn , *i

i i iy b

xx

Page 53: Support Vector Machines

Classification Using Supporters

*

*

0

( ) sgn , *i

i iif by

x xx

The similarity measure btw.input and the ith support vector.

The weight for the ith support vector. Bias

Page 54: Support Vector Machines

Support Vector Machines

Lagrange

Multipliers

Page 55: Support Vector Machines

Lagrange Multiplier

“Lagrange Multiplier Method” is a powerful tool for constraint optimization.

Contributed by Riemann.

Page 56: Support Vector Machines

Milkmaid Problem

Page 57: Support Vector Machines

Milkmaid Problem

Page 58: Support Vector Machines

Milkmaid Problem

g(x,

y)

= 0

(x1, y1)

(x2, y2)

(x, y)Goal:Minimize

( , )f x y

Subject to

( , ) 0g x y

22 2

1

( , ) ( ) ( )i ii

f x y x x y y

Page 59: Support Vector Machines

Observation

g(x,

y)

= 0

(x1, y1)

(x2, y2)

(x*, y*)Goal:Minimize

( , )f x y

Subject to

( , ) 0g x y At the extreme point, say, (x*, y*)

f (x*, y*) = g(x*, y*).

At the extreme point, say, (x*, y*)

f (x*, y*) = g(x*, y*).

22 2

1

( , ) ( ) ( )i ii

f x y x x y y

Page 60: Support Vector Machines

Optimization with Equality Constraints

Goal: Min/Max ( )f x

Subject to ( ) 0g xLemma:

At an extreme point, say, x*, we have

( *) *) ( 0( * )f gg x xx if

Page 61: Support Vector Machines

Proof( *) *) ( 0( * )f gg x xx if

( )f x

( ) 0g x

*x

Letx* be an extreme point.

r(t) be any differentiable path on surface g(x)=0 such that r(t0)=x*.

0( )tr is a vector tangent to the surface g(x)=0 at x*.

0( *) ( ( )) |t tf f t x r

Since x* be an extreme point,

( ( )) ( ( )) ( )ddt f t f t t r r r

0 00 ( ( )) ( )f t t r r 0( *) ( )f t x r

r(t)r(t0)

Page 62: Support Vector Machines

Proof( *) *) ( 0( * )f gg x xx if

( )f x

( ) 0g x

*x

0 00 ( ( )) ( )f t t r r 0( *) ( )f t x r

r(t)r(t0)

0( )tr

( *)f x

0(( )*)f t rxThis is true for any r pass through x* on surface g(x)=0.

It implies that ( *) ,f xwhere is the tangential plane of surface g(x)=0 at x*.

Page 63: Support Vector Machines

Optimization with Equality Constraints

Goal: Min/Max ( )f x

Subject to ( ) 0g xLemma:

At an extreme point, say, x*, we have

( *) *) ( 0( * )f gg x xx if

Lagrange Multiplier

Page 64: Support Vector Machines

The Method of Lagrange

Goal: Min/Max ( )f x

Subject to ( ) 0g x

Find the extreme points by solving the following equations.

( ) ( )f g x x

( ) 0g x

x: dimension n.

n + 1 equations with n + 1 variables

Page 65: Support Vector Machines

Lagrangian

Goal: Min/Max ( )f x

Subject to ( ) 0g x Constraint Optimization

Define ( ; ) ( ) ( )L f g x x x Lagrangian

Solve( ; )L x x 0

( ; )L x 0 Unconstraint Optimization

Page 66: Support Vector Machines

Optimization withMultiple Equality Constraints

Min/Max ( )f x

Subject to ( ) 0, 1, ,ig i m x

Define1

( ; ) ( ) ( )m

ii

iL f g

x x x

Solve , ( ; )L x x 0

1,( , )mT

Lagrangian

Page 67: Support Vector Machines

Optimization withInequality Constraints

Minimize ( )f x

Subject to ( ) 0, 1, ,ig i m x ( ) 0, 1, ,jh j n x

You can always reformulate your problems into the about form.

Page 68: Support Vector Machines

Lagrange Multipliers

Lagrangian:

1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

1,( , )mT

1,( , )nT 0i

Minimize ( )f x

Subject to ( ) 0, 1, ,ig i m x ( ) 0, 1, ,jh j n x

Page 69: Support Vector Machines

Lagrange Multipliers

Lagrangian:

1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

1,( , )mT

1,( , )nT 0i

Minimize ( )f x

Subject to ( ) 0, 1, ,ig i m x ( ) 0, 1, ,jh j n x

0 for feasible solutions

0 for feasible solutions

negative for feasible solutions

negative for feasible solutions

Page 70: Support Vector Machines

Duality1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

Let x* be a local extreme.

*,( ; ) |L x x x0 x

( ) (, *; ),D L x

1 1

( ) ( *) ( *) ( *), i j

m n

i ji j

D f g h

x x x

Define

( *)f x

Maximize it w.r.t.

Page 71: Support Vector Machines

Duality1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

Let x* be a local extreme.

*,( ; ) |L x x x0 x

( ) (, *; ),D L x

1 1

( ) ( *) ( *) ( *), i j

m n

i ji j

D f g h

x x x

Define

( *)f x

Maximize it w.r.t.

What are we doing?

What are we doing?

To minimize the Lagrangian w.r.t x, while to maximize it w.r.t. and .

To minimize the Lagrangian w.r.t x, while to maximize it w.r.t. and .

Page 72: Support Vector Machines

Saddle Point Determination1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

Page 73: Support Vector Machines

DualityMinimize ( )f x

Subject to ( ) 0, 1, ,ig i m x ( ) 0, 1, ,jh j n x

The primal

The dualMaximize ( *; ),L x

Subject to

0

, ( ; , )L x x 0

1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x

Page 74: Support Vector Machines

The Karush-Kuhn-Tucker Conditions

1 1

( ; ) ( ) ( ), ) (ji

m n

i ji j

L f g h

x x x xx x x x 0

( ) 0, 1, ,ig i m x

( ) 0, 1, ,jh j n x

0, 1, ,j j n

( ) 0, 1, ,j jh j n x

1 1

( ; ) ( ) ( (, ) )m n

i ji j

i jL f g h

x x x x