47
1 The Problem: Consider a two class task with ω 1 , ω 2 0 2 2 1 1 0 ... 0 ) ( w x w x w x w w x w x g l l T 2 1 2 1 0 2 0 1 2 1 , 0 ) ( 0 : hyperplan decision on the , Assume x x x x w w x w w x w x x T T T LINEAR CLASSIFIERS

1 The Problem: Consider a two class task with ω 1, ω 2 LINEAR CLASSIFIERS

Embed Size (px)

Citation preview

Page 1: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

1

The Problem: Consider a two class task with ω1, ω2

02211

0

...

0)(

wxwxwxw

wxwxg

ll

T

2121

0201

21

, 0)(

0

:hyperplanedecision on the , Assume

xxxxw

wxwwxw

xx

T

TT

LINEAR CLASSIFIERS

Page 2: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

2

Hence:

hyperplane on the w

0)( 0 wxwxg T

22

21

22

21

0 )( ,

ww

xgz

ww

wd

Page 3: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

3

The Perceptron Algorithm

Assume linearly separable classes, i.e.,

The casefalls under the above formulation, since

2

1

0*

0* :*

xxw

xxwwT

T

*T* wxw 0

1

0

x'x,

w

*w'w *

00 'x'wwx*w T*T

Page 4: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

4

Our goal: Compute a solution, i.e., a hyperplane w,so that

• The steps– Define a cost function to be minimized.– Choose an algorithm to minimize the cost

function.– The minimum corresponds to a solution.

2

1 0)(

xxwT

Page 5: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

5

The Cost Function

• Where Y is the subset of the vectors wrongly classified by w. When Y=O (empty set) a solution is achieved and

Yx

Tx xwwJ )()(

0)( wJ

2

1

and if 1

and if 1

xYx

xYx

x

x

0)( wJ

Page 6: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

6

• J(w) is piecewise linear (WHY?)

The Algorithm• The philosophy of the gradient descent

is adopted.

Page 7: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

7

• Wherever valid

This is the celebrated Perceptron Algorithm.

old)()(

(old)(new)

www

wJw

www

xxwww

wJ

Yx Yxx

Tx

)()(

xtwtwYx

xt

)()1(

w

Page 8: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

8

An example:

The perceptron algorithm converges in a finite number of iteration steps to a solution if

)1( )(

)()1(

xxt

t

xtw

xtwtw

t

ct

t

kk

t

t

kk

t

:e.g.,

lim,lim0

2

0

Page 9: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

9

A useful variant of the perceptron algorithm

It is a reward and punishment type of algorithm.

otherwise )()1(

0)( ,)()1(

0)( , )()1(

2)(

)()(

1)(

)()(

twtw

x

xtwxtwtw

x

xtwxtwtw

t

tT

t

t

tT

t

Page 10: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

10

The perceptron

shold thre

weightssynapticor synapses

0w

s'wi

It is a learning machine that learns from the training vectors via the perceptron algorithm.

The network is called perceptron or neuron.

Page 11: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

11

Example: At some stage t the perceptron algorithm results in

The corresponding hyperplane is

5.0 ,1 ,1 021 www

05.021 xx

5.0

51.0

42.1

1

75.0

2.0

)1(7.0

1

05.0

4.0

)1(7.0

5.0

1

1

)1(tw

ρ=0.7

Page 12: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

12

Least Squares Methods If classes are linearly separable, the perceptron

output results in

If classes are NOT linearly separable, we shall compute the weights, , so that the difference between• The actual output of the classifier, , and

• The desired outputs, e.g.,

to be SMALL.

1

021 w,...,w,w

xwT

2

1

if 1

if 1

x

x

Page 13: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

13

SMALL, in the mean square error sense, means to choose so that the cost function:

w

response. desired ingcorrespond theis

)(minargˆ

minimum. becomes ])[()( 2

y

wJw

xwyEwJ

w

T

Page 14: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

14

Minimizing

where Rx is the autocorrelation matrix

and the crosscorrelation vector.

][ˆ

][][

)]([2

0])[()(

:in results to w.r.)(

1

2

yxERw

yxEwxxE

wxyxE

xwyEww

wJ

wwJ

x

T

T

T

][]...[][

...................................

][]...[][

][

21

12111

llll

lT

x

xxExxExxE

xxExxExxE

xxER

][

...

][

][1

yxE

yxE

yxE

l

Page 15: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

15

Multi-class generalization• The goal is to compute M linear discriminant functions:

according to the MSE.

• Adopt as desired responses yi:

• Let

• And the matrix

xwxg Tii )(

otherwise 0

if 1

i

ii

y

xy

TMyyyy ,...,, 21

MwwwW ,...,, 21

Page 16: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

16

• The goal is to compute :

• The above is equivalent to a number M of MSE minimization problems. That is:

Design each so that its desired output is 1 for and 0 for any other class.

Remark: The MSE criterion belongs to a more general class of cost function with the following important property:

• The value of is an estimate, in the MSE sense, of the a-posteriori probability , provided that the desired responses used during training are and 0 otherwise.

M

i

Tii

W

T

WxwyExWyEW

1

22minargminargˆ

iw ix

)(xgi

)|( xP iii xy ,1

W

Page 17: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

17

Mean square error regression: Let , be jointly distributed random vectors with a joint pdf• The goal: Given the value of , estimate the value

of . In the pattern recognition framework, given one wants to estimate the respective label .

• The MSE estimate of , given , is defined as:

• It turns out that:

The above is known as the regression of given and it is, in general, a non-linear function of . If is Gaussian the MSE regressor is linear.

My x),( yxp

x yx

y y x 2

~~minargˆ yyEy

y

ydxypyxyEy

)|(|ˆ

y xx ),( yxp

1y

Page 18: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

18

SMALL in the sum of error squares sense means

that is, the input xi and itscorresponding class label (±1).

),(

)()(

ii

2N

1ii

Ti

xy

xwywJ

: training pairs

i

N

ii

Ti

N

ii

N

ii

Ti

yxwxx

xwyww

wJ

11

1

2

)(

0)()(

iy

Page 19: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

19

Pseudoinverse Matrix Define

responses desired ingcorrespond y

matrix) (an

N

1

TN

T2

T1

y

...

y

Nxl

x

...

x

x

X

matrix) (an ],...,,[ 21 lxNxxxX NT

N

i

Tii

T xxXX1

i

N

ii

T yxyX

1

Page 20: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

20

Thus

Assume N=l X square and invertible. Then

yX

yXXXw

yXwXX

yxwxx

TT

TT

N

i

N

iiii

Ti

1

1 1

)(ˆ

ˆ)(

)(ˆ)(

TT XXXX 1)( Pseudoinverse of X

1

111)(

XX

XXXXXXX TTTT

Page 21: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

21

Assume N>l. Then, in general, there is no solution to satisfy all equations simultaneously:

The “solution” corresponds to the minimum sum of squares solution.

unknowns equations ....

: 22

11

l N

ywx

ywx

ywx

ywX

NTN

T

T

yXw

Page 22: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

22

Example:

5.0

7.0,

6.0

8.0,

4.0

7.0,

2.0

6.0,

6.0

4.0:

3.0

3.0,

7.0

2.0,

4.0

1.0,

5.0

6.0,

5.0

4.0:

2

1

y

..

..

..

..

..

..

..

..

..

..

X

1

1

1

1

1

1

1

1

1

1

15070

16080

14070

12060

16040

13030

17020

14010

15060

15040

Page 23: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

23

34.1

24.0

13.3

)(

0.0

1.0

6.1

,

107.48.4

7.441.224.2

8.424.28.2

1 yXXXw

yXXX

TT

TT

Page 24: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

24

The Bias – Variance DilemmaA classifier is a learning machine that tries to predict the class label y of . In practice, a finite data set D is used for its training. Let us write . Observe that:

For some training sets, , the training may result to good estimates, for some others the result may be worse.

The average performance of the classifier can be tested against the MSE optimal value, in the mean squares sense, that is:

where ED is the mean over all possible data sets D.

)(xgx

);( Dxg

NixyD ii ..., ,2 ,1,,

2]|[);( xyEDxgED

Page 25: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

25

• The above is written as:

• In the above, the first term is the contribution of the bias and the second term is the contribution of the variance.

• For a finite D, there is a trade-off between the two terms. Increasing bias it reduces variance and vice verse. This is known as the bias-variance dilemma.

• Using a complex model results in low-bias but a high variance, as one changes from one training set to another. Using a simple model results in high bias but low variance.

2]|[);( xyEDxgED

22 );();(]|[);( DxgEDxgExyEDxgE DDD

Page 26: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

26

Let an -class task, . In logistic discrimination, the logarithm of the likelihood ratios are modeled via linear functions, i.e.,

Taking into account that

it can be easily shown that the above is equivalent with modeling posterior probabilities as:

LOGISTIC DISCRIMINATION

121 ,

|

|ln 0, , ..., M-, ixww

xP

xP Tii

M

i

1|1

M

ii xP

M ..., , , 21M

Page 27: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

27

For the two-class case it turns out that

1

10,exp1

1| M

i

Tii

M

xwwxP

1,...2,1,exp1

exp| 1

10,

0,

ιxww

xwwxP M

i

Tii

Tii

i

xwwxP T

0

2exp1

1|

xww

xwwxP T

T

0

01

exp1

exp|

Page 28: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

28

The unknown parameters are usually estimated by maximum likelihood arguments.

Logistic discrimination is a useful tool, since it allows linear modeling and at the same time ensures posterior probabilities to add to one.

121 , , 0, , ..., M-, iww ii

Page 29: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

29

The goal: Given two linearly separable classes, design the classifier

that leaves the maximum margin from both classes.

0)( 0 wxwxg T

Support Vector Machines

Page 30: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

30

Margin: Each hyperplane is characterized by:

• Its direction in space, i.e.,

• Its position in space, i.e.,

• For EACH direction, , choose the hyperplane that leaves the SAME distance from the nearest points from each class. The margin is twice this distance.

w

0w

w

Page 31: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

31

The distance of a point from a hyperplane is given by:

Scale, so that at the nearest points, from each class, the discriminant function is ±1:

Thus the margin is given by:

Also, the following is valid

x

, , 0ww

w

xgzx

)ˆ(ˆ

21 for 1)( and for 1)x( 1)( xggxg

www

211

20

10

1

1

xwxw

xwxwT

T

Page 32: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

32

SVM (linear) classifier

Minimize

Subject to

The above is justified since by minimizing

the margin is maximised.

0)( wxwxg T

2

2

1)( wwJ

2for 1

for 1

ii

iii

x,y

,x,y

Niwxwy iT

i ,...,2,1 ,1)( 0

w

w

2

Page 33: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

33

The above is a quadratic optimization task, subject to a set of linear inequality constraints. The Karush-Kuhh-Tucker conditions state that the minimizer satisfies:

• (1)

• (2)

• (3)

• (4)

• Where is the Lagrangian

0) , ,L( 0 www

0) , ,( 00

wwLw

Nii ,...,2,1 ,0

Niwxwy iT

ii ,...,2,1 ,01)( 0

),,( L

)]([2

1) , ,( 0

10 wxwywwwwL i

Ti

N

ii

T

Page 34: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

34

The solution: from the above, it turns out that:

ii

N

ii xyw

1

01

i

N

ii y

Page 35: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

35

Remarks:• The Lagrange multipliers can be either zero or

positive. Thus,

where , corresponding to positive Lagrange multipliers.

– From constraint (4) above, i.e.,

the vectors contributing to

ii

N

ii xyw

s

1

0NN s

Niwxwy iT

ii ,...,2,1 ,0]1)([ 0

10 wxw iT

wsatisfy

Page 36: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

36

– These vectors are known as SUPPORT VECTORS and are the closest vectors, from each class, to the classifier.

– Once is computed, is determined from conditions (4).

– The optimal hyperplane classifier of a support vector machine is UNIQUE.

– Although the solution is unique, the resulting Lagrange multipliers are not unique.

w 0w

Page 37: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

37

Dual Problem Formulation• The SVM formulation is a convex

programming problem, with– Convex cost function– Convex region of feasible solutions

• Thus, its solution can be achieved by its dual problem, i.e.,

– maximize

– subject to

),,( 0 wwL

0

01

1

i

N

ii

ii

N

ii

y

xyw

λ

Page 38: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

38

• Combine the above to obtain:

– maximize

– subject to

)2

1(

1j

Tijij

iji

N

ii xxyy

0

01

i

N

ii y

λ

Page 39: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

39

Remarks:• Support vectors enter via inner products.

Non-Separable classes

Page 40: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

40

In this case, there is no hyperplane such that:

• Recall that the margin is defined as twice the distance between the following two hyperplanes:

xwxwT ,1)(0

1

and

1

0

0

wxw

wxw

T

T

Page 41: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

41

The training vectors belong to one of three possible categories

1) Vectors outside the band which are correctly classified, i.e.,

2) Vectors inside the band, and correctly classified, i.e.,

3) Vectors misclassified, i.e.,

1)( 0 wxwy Ti

1)(0 0 wxwy Ti

0)( 0 wxwy Ti

Page 42: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

42

All three cases above can be represented as:

1)2)3)

are known as slack variables.

iT

i wxwy 1)( 0

i

i

i

1

10

0

i

Page 43: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

43

The goal of the optimization is now two-fold:• Maximize margin• Minimize the number of patterns with .

One way to achieve this goal is via the cost

where C is a constant and

• I(.) is not differentiable. In practice, we use an approximation. A popular choice is:

• Following a similar procedure as before we obtain:

0i

N

iiICwwwJ

1

2

0 2

1) , ,(

00

01)(

i

iiI

N

iiCwwwJ

1

2

0 2

1) , ,(

Page 44: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

44

KKT conditions

Ni

Ni

Niwxwy

NiC

xyλw

ii

ii

iiT

ii

ii

i

N

ii

ii

N

ii

,...,2,1 ,0, )6(

,...,2,1 ,0 )5(

,...,2,1 ,0]1)([ )4(

,...,2,1 ,0 )3(

0 )2(

)1(

0

1

1

Page 45: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

45

The associated dual problem

Maximize

subject to

Remarks: The only difference with the separable class case is the existence of in the constraints.

)2

1(

1 ,j

Tijij

N

i jiii xxyy

N

1iii

i

0y

N,...,2,1i,C0

C

λ

Page 46: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

46

Training the SVM: A major problem is the high computational cost. To this end, decomposition techniques are used. The rationale behind them consists of the following:

• Start with an arbitrary data subset (working set) that can fit in the memory. Perform optimization, via a general purpose optimizer.

• Resulting support vectors remain in the working set, while others are replaced by new ones (outside the set) that violate severely the KKT conditions.

• Repeat the procedure.• The above procedure guarantees that the cost

function decreases.• Platt’s SMO algorithm chooses a working set of

two samples, thus analytic optimization solution can be obtained.

Page 47: 1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS

47

Multi-class generalizationAlthough theoretical generalizations exist, the most popular in practice is to look at the problem as M two-class problems (one against all).

Example:

Observe the effect of different values of C in the case of non-separable classes.