Statistical pattern recognition - Texas A&M Universitycourses.cs.tamu.edu/rgutier/cpsc636_s10/spr.pdf · pattern recognition. Bayes theorem ... Ricardo Gutierrez‐Osuna TAMU ‐CSE

Statisticalpatternrecognition

Bayes theorem• Problem: deciding if a patient has a particular condition based on a particular test

– However, the test is imperfect• Someone with the condition may go undetected (false negative)

S ith t diti t iti (f l iti )• Someone without condition may come out positive (false positive)

– Test properties• SPECIFICITY or true‐negative rate P(NEG|¬COND)g ( | )

• SENSITIVITY or true‐positive rate P(POS|COND)

Ricardo Gutierrez‐Osuna TAMU‐CSE

• Problem definition– Assume a population of 10,000 where 1 out of every 100

l h h di l di ipeople has the medical condition– Assume that we design a test with 98% specificity P(NEG|¬COND) and 90% sensitivity P(POS|COND)

– You take the test, and it comes POSITIVE

–What conditional probability are we after?–How likely is it that you have the condition?y y


• Solution: Joint frequency table–The answer is the ratio of individuals with the condition to total individuals (considering only individuals that tested positive) or 90/288 0 3125

TEST IS POSITIVE TEST IS NEGATIVE ROW TOTAL

individuals that tested positive) or 90/288=0.3125

TEST IS POSITIVE TEST IS NEGATIVE ROW TOTAL

HAS CONDITION Truepositive P(POS|COND) 100×0.90=90

Falsenegative P(NEG|COND)

100×(10.90)=10

100

Falsepositive Truenegative

FREE OF CONDITION

pP(POS|¬COND) 9,900×(10.98)=198

gP(NEG|¬COND)

9,900×0.98=9,072

9,900

COLUMN TOTAL 288 9,712 10,000


• Conditional probabilityS S

0)B(Pfor)B(P

)BA(P)A|B(P >=I “B has

occurred”A A∩B B A A∩B B

Total probability• Total probability

=++===

N1 )BA(P...)BA(P)SA(P)A(P

II

I

B1B3 BN-1

∑==++=

N

kk

NN11

)B(P)A|B(P

)B(P)A|B(P...)B(P)A|B(P 1

B2

N 1

BN

A

=1k


• Alternative solution: Bayes theorem)A(P)A|B(P

)B(P)A(P)A|B(P)A|B(P =

)(P)cond(P)|cond(P)cond|(P =

+⋅+

=+

)cond(P)cond|(P)cond(P)|cond(P)cond(P)|cond(P

=¬⋅¬++⋅+

⋅+=

3125.0990)9801(010900

01.090.0=

×+××

=99.0)98.01(01.090.0 ×−+×


• In SPR, Bayes theorem is expressed as

)ω(P)x|ω(P)ω(P)x|ω(P)|( jjjj ⋅⋅

PriorLikelihoodPosterior

)x(P)()|(

)ω(P)x|ω(P

)()|()|xω(P jj

N

1kkk

jjj =

⋅=

∑= Norm

constant

–And we assign sample x to the class ωk with the highest posteriorIt can be shown this rule minimizes the prob of error• It can be shown this rule minimizes the prob. of error


Discriminant functions

Class assignment

Select maxCosts

g1(x) g2(x) gC(x) Discriminant functions

x2 x3 xdx1 Features

)x|(p(x)gwhereij(x)g(x)gωx

ii

jii

ω=

≠∀>⇔∈


Quadratic classifiers• For normally distributed classes, the posterior can be reduced to a very simple expressioncan be reduced to a very simple expression

– Recall an n‐dimensional Gaussian density is

⎤⎡ 11

U i B l th DF b itt

⎥⎦⎤

⎢⎣⎡ −∑−−

∑= − μ)(xμ)(x

21exp

π)2(1p(x) 1T

2/12n/

– Using Bayes rule, the DF can be written as

P(x)))P(ωP(x|ωx)|P(ωx)(g ii

ii ===

P(x)1)P(ω)μ(x)μ(x

21exp

π)2(1

P(x)

ii1

iT

i2/1i

2n/ ⎥⎦⎤

⎢⎣⎡ −∑−−

∑= −


– Eliminating constant terms

1 T1/2 ⎤⎡ 1

And taking logs

)P()μ(x)μ(x21exp ii

Ti

1/2- ω⎥⎦⎤

⎢⎣⎡ −∑−−∑= −1

iii x)(g

– And taking logs

)P(ωloglog21)μ(x)μ(x

21x)(g iii

1i

Tii +∑−−∑−−= −

– This is known as a quadratic discriminant function (b it i f ti f 2)

22

(because it is a function of x2)


Case 1: Σi=σ2I

• Features are statistically independent, and h h i f ll lhave the same variance for all classes– In this case, the quadratic discriminant function becomes

( )

)P(ωlog)μ(x)μ(x1

)P(ωlogIσlog21)-μ(xIσ)μ(x

21x)(g

T

i2

i1-2T

ii

+−−−=

+−−−=

– Assuming equal priors and dropping constant terms

)P(ωlog)μ(x)μ(xσ2 iii2 +=

( )∑=

−=−−−=DIM

1i

2iii

Tii μx-)μ(x)μ(xx)(g

– This is called an Euclidean‐distance or nearest mean classifier


[ ] [ ] [ ]⎤⎡⎤⎡⎤⎡

===02020252μ47μ23μ T

3T

2T

1

⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡=

2002

Σ2002

Σ2002

Σ 321


Case 2: Σi=Σ• All classes have the same covariance matrix, but the matrix is not diagonalbut the matrix is not diagonal– In this case, the quadratic discriminant becomes

( ) ( ))P(ll1)()(1( ) 1T ∑∑−

–assuming equal priors and eliminating constants

( ) ( ))P(ωloglog2

)-μ(x)μ(x2

(x)g ii1T

ii +∑−∑−−=

g q p g

)μ(xΣ)μ(x(x)g i1-T

ii −−−=x1

–This is known as a Mahalanobis‐distance classifier

μx2

Κμx 2

K μ-x 2i 1 =−∑

Κμ-xi =


[ ] [ ] [ ]⎤⎡⎤⎡⎤⎡

===7.017.017.01

52μ45μ23μ T3

T2

T1

⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡=

27.07.01

Σ27.07.01

Σ27.07.01

Σ 321


General case

[ ] [ ] [ ]=== 52μ45μ23μ T3

T2

T1 [ ] [ ] [ ]

⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡−

−=⎥

⎦

⎤⎢⎣

⎡−

−=

35.05.05.0

Σ7111

Σ2111

Σ

52μ45μ23μ

321

321

Zoomout


k nearest neighbors• Non‐parametric approximation

– Likelihood of each class

VNk)P(x|ω

i

ii =

Vx

– And priorsNNN)P(ω i

i =

– Then, the posterior becomes

kNN

VNk

))P(ωP(x|ω|x)P(ω i

i

i

i

ii

⋅

kNVkP(x)

|x)P(ω iiiii ===


• Example–Given the three classes, assign a class label for the unknown example xu

– Assume the Euclidean distance and k=5 neighbors

f h l hb–Of the 5 closest neighbors, 4 belong to ω1 and 1 belongs to ω3, so xu is assigned to ω1,

ω13, u g 1,

the predominant classxu

ω2

ω3



1-NN 5-NN 20-NN


• Advantages– Simple implementation–Nearly optimal in the large sample limit (N→∞)

• P[error]Bayes <P[error]1NN<2P[error]Bayes–Uses local information, which can yield highly adaptiveUses local information, which can yield highly adaptive behavior

– Lends itself very easily to parallel implementations

i d• Disadvantages– Large storage requirements– Computationally intensive recallComputationally intensive recall–Highly susceptible to the curse of dimensionality


Dimensionality yreduction

Why do dimensionality reduction?

The so‐called “curse of dimensionality”• The so‐called curse of dimensionality– Exponential growth in the number of examples required to accurately estimate a functionrequired to accurately estimate a function

l d l• Exploratory data analysis– Visualizing the structure of the data in a low‐dimensional subspace


• Two approaches to perform dimensionality reduction– Feature selection: choose a subset of all the features

– Feature extraction: create new features by combining the

[ ] [ ]M21 iiiN21 ...xxx...xxx ⎯→⎯

Feature extraction: create new features by combining the existing ones

[ ] [ ] [ ]( )M21 iiiM21N21 ...xxxf...yyy...xxx =⎯→⎯

– Feature extraction is typically a linear transform

[ ] [ ] [ ]( )M21 iiiM21N21 yyy

⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎡

⎥⎥⎥⎤

⎢⎢⎢⎡

=⎥⎥⎥⎤

⎢⎢⎢⎡

⎯⎯⎯⎯⎯⎯ →⎯⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎡

2

1

N22221

N11211

2

1

extractionfeaturelinear2

1

xx

wwwwww

yy

xx

MMOMM

L

L

M

⎥⎥⎥

⎦⎢⎢⎢

⎣

⎥⎥

⎦⎢⎢

⎣⎥⎥

⎦⎢⎢

⎣⎥⎥⎥

⎦⎢⎢⎢

⎣ NMN2M1MM

N xwwwy

x

MOMM


Representation vs. classification

eature 2

1 122

22 2 2Fe 1 11

1 1 11111 11

112 2

2

2 22

222 2

2 22 2

2

1111 11 1

2 2 22

Feature 1


PCA• Solution

– Project the data onto the eigenvectors of the largest eigenvalues of the covariance matrix

– PCA finds orthogonal directions of largest variance

• Properties– If data is Gaussian, PCA finds independent axes

Oth i it i l d l t th– Otherwise, it simply de‐correlates the axes

• Limitation Di ti f hi h i d t il t i– Directions of high variance do not necessarily contain discriminatory information


LDA• Define scatter matrices

– Within class μ1

SW1

x2

μ1SW1

x2

( )( )∑∑∑= ∈=

−−==C

1i ωx

Tii

C

1iiW μxμxSS

μ3

μ

SB1

SB3

S SW3

μ3

μ

SB1

SB3

S SW3

– Between class

= ∈= 1i ωx1i i

μ2

SB2

SW2

μ2

SB2

SW2

( )( )∑=

−−=C

1i

TiiiB μμμμNS

x1x1

– Then maximize ratio

WSWT

WSW

WSWJ(W)

WT

BT

=


• Solution– Optimal projections are the eigenvectors of the largest eigenvalues of

th li d i l blthe generalized eigenvalue problem

• NOTE( ) 0wSλS iWiB =−

NOTE– SB is the sum of C matrices of rank one or less and the mean vectors

are constrained by Σμi=μTherefore S will be at most of rank (C 1) and LDA produces at most– Therefore, SB will be at most of rank (C‐1), and LDA produces at most C‐1 feature projections

• Limitations– Overfitting

– Information not in the mean of the data

– Classes significantly non Gaussian


14

0

2

4

6

1

1

11

1

11 11

122

2222

2

22

222

22

2

3

333

33444

4

4

4

4

44

4

4

444

4

4

55

55

5

5

5

555

5

556

6

66666 6

6

6666

677 7

777

7777

7

77

7

77

888

8 88

88

88

88

8999

99 99

9 9

999

9

99

101010101010101010

10

10

1010 10 10

105

10

7 710

10107 107

710 10 410 447

477 7 4

747

6

10

2922

86

-8

-6

-4

-2

111

1

111

1

12

22 22

2

2

33

3 3

3333

3

3

33

34

4

4 44

5

55

555

55

5

566

66

6666

7

7

78

8

8 88 8

8

8

9

999

9

10

1010ax

is 2

-10-5 0

5

-5

0

5

7

3

10

53

4101010

33

10

1

104

10

35

4

33

11

7

2

10

6 6663

4

5383

4106

81

47

81

46

233

2

7

51 816

7

2 82

74

166

1

677

62

3 858

9

10

9

4

96

5

44

51

7

1688

2

10

52619

106

107 4

856 1

9 8

10

91

41

35 331 2

93

7295

298

4

93528

62

2 9298

6

925 9 3

18

522 129

56

59

28885

51 599

5592 8

axis

3

PCA

x 10-3

7

-10 -5 0 5 10

-10 26

8

axis 1

05

10 -10

-5

axis 1axis 2

5

10

15

5

7

7

77

77

777

7

777777

77

9s 2

LDA0

5

x 10-3

333 7777 77

8 887

811 771 71 71

881

8 5

111

5

15

11161

88

6

51

8888 589 9

1

9

1

8 585155

18 5895 58 95

252

922

552

59

295

229

2 22 22599

295

2299599

22

4

922292

9

s 3

-5

0

5

1 1111

1111

11111

111

11

222

2

22222

2222

222

22 22223

3

333333 33 33

33

3333

3

44444 44444

44

4

4

44

4 44

555

55555 55

5

5

55 555

5555

66 6

66666 666666

666

666

888

8

888 888 888888

888

9 99999999999

999 9999

1010101010101010101010

101010

1010

axisLDA

-0.02

0

0 025

1015

x 10-3

-15

-10

-5

33 33 33333333 333 3333

6

77777776

7 776

76

6

10

77

10

6

10

66

1010

7

10

61

410

67

10

76

7

10

6

1010

61

6666

1010104

66

41010

6

10104 4444 4

10

444444 44 44ax

is

-0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05

4 101010101010

1010

axis 1

0.02

0.04 -50

axis 1axis 2


• LDA and overfitting– Generate an artificial dataset

h l l l h h l k l h d• Three classes, 50 examples per class, with the exact same likelihood: a multivariate Gaussian with zero mean and identity covariance

10 dimensions 50 dimensions

111

1

1

1

1

111

1

11

11

11

1

1 1

1

1

11

1

11

1 111 11

1

11

11

11

11 12

2

22

2

2

222

222

2

2

22

2

2

2

2

2

22

22

2

2

2

2 222

22

22

2

2

2

2 2

22

2 2223

3

33

3

33

3

3

33

3

3 3

3

3

33

33

33

33

3

33

3

3

3

33333 3

33

33

1

1

11 1

1

1

1

1

11

1

1

11

11 1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

11

1

1

11

11

1

1

11

1 222

2

2

222

22

222

2

2

2

22

2

22 22

2

2

2222

2

2

2

2

2 22

2

33

33

33

3

33

3

3 33

3

33

3

333

3333 3

3

33

31

1

1

1

1

1111

1

222

2 33

3

3 33

33

33

33 3 1

11

1 11

2 222 22

22

2

222

2

2

22

23

3 33 33

3333 333

33333

3333

3

11 11 113

100 dimensions

22 2222

150 dimensions

11

11

1

1

1

1

1

11 11

111

1

1

1

111 1

1

11

1

11

11 11

111

1

11 1

11

111 1 11

1

2

22

22222 2

2

2 222

222 22

2 22

22

22

2

22

22

22

22

22

2 222

33

333

3 333 3 3

33

333

3

33

33

3

3

3

333 33 3333

3

333

3

3

3

3

3

33

3

3 33

3

3

111

2222222

222 222 22

222222

2222 22222222 2

22222 22 2 222222

333

33 3333

333333333

333333 3

33

3333

33

333

33 33

333 333

3333

222

222

222 22 2

211 11 11 111

11

11

11111

111 11 111

11

11

11

1111

1111 111

1 11

111

1


Documents

Statistical pattern recognition - Texas A&M Universitycourses.cs.tamu.edu/rgutier/cpsc636_s10/spr.pdf · pattern recognition. Bayes theorem ... Ricardo Gutierrez‐Osuna TAMU ‐CSE