Bayesian Decision Theory (Classification) 主講人：虞台文

Bayesian Decision Theory(Classification)

主講人：虞台文

Contents

Introduction Generalize Bayesian Decision Rule Discriminant Functions The Normal Distribution Discriminant Functions for the Normal Popula

tions. Minimax Criterion Neyman-Pearson Criterion


Introduction

What is Bayesian Decision Theory?

Mathematical foundation for decision making.

Using probabilistic approach to help making decision (e.g., classification) so as to minimize the risk (cost).

Preliminaries and Notations

:},,,{ 21 ci a state of nature

:)( iP prior probability

:x feature vector

:)|( ip x class-conditionaldensity

:)|( xiP posterior probability

Bayesian Rule

)(

)()|()|(

x

xx

p

PpP ii

i

c

jii Ppp

1

)()|()( xx

Decision

)(

)()|()|(

x

xx

p

PpP ii

i

)|(maxarg)( xx iPi

D )|(maxarg)( xx iPi

D

unimportant inmaking decision

unimportant inmaking decision

Decision)(

)()|()|(

x

xx

p

PpP ii

i

( ) arg max ( | )i

iP

x xD( ) arg max ( | )i

iP

x xD

Decide i if P(i|x) > P(j|x) j i

Decide i if p(x|i)P(i) > p(x|j)P(j) j i

Special cases:1. P(1)=P(2)= =P(c)2. p(x|1)=p(x|2) = = p(x|c)

Two Categories

Decide i if P(i|x) > P(j|x) j i

Decide i if p(x|i)P(i) > p(x|j)P(j) j i

Decide 1 if P(1|x) > P(2|x); otherwise decide 2

Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2

Special cases:1. P(1)=P(2)

Decide 1 if p(x|1) > p(x|2); otherwise decide 1

2. p(x|1)=p(x|2)Decide 1 if P(1) > P(2); otherwise decide 2

Example

R2

P(1)=P(2)

R1


Decide 1 if p(x|> p(x|2); otherwise decide 1



Decide 1 if p(x|> p(x|2); otherwise decide 1


Example

R1R1

R2

R2

P(1)=2/3P(2)=1/3

Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2

Classification Error

xx derrorperrorP ),()(

xxx dperrorP )()|(

Consider two categories:

21

12

decide weif)|(

decide weif)|()|(

x

xx

P

PerrorP


)]|(),|(min[ 21 xx PP

xxx dperrorP )()|(

Classification Error

xx derrorperrorP ),()(

Consider two categories:

21

12

decide weif)|(

decide weif)|()|(

x

xx

P

PerrorP


)]|(),|(min[ 21 xx PP


Generalized Bayesian Decision

Rule

The Generation

:},,,{ 21 c a set of c states of nature

:},,,{ 21 a a set of a possible actions

:)|( jiij The loss incurred for taking action i when the true state of nature is j.

We want to minimize the expected loss in making decision.

Risk

can be zero.

Conditional Risk

c

jjjii PR

1

)|()|()|( xx

c

jjij P

1

)|( x

Given x, the expected loss

(risk) associated with taking action

i.

Given x, the expected loss

(risk) associated with taking action

i.

0/1 Loss Function

otherwise1

with assiciateddecision correct a is 0)|( ji

ji

c

jjjii PR

1

)|()|()|( xx

c

jjij P

1

)|( x

( | ) ( | )iR P error x x

Decision

c

jjjii PR

1

)|()|()|( xx

c

jjij P

1

)|( x

)|(minarg)( xx iRi

)|(minarg)( xx iRi

Bayesian Decision Rule:

Overall Risk

xxxx dpRR )()|)((Decision function

Bayesian decision rule:

the optimal one to minimize the overall riskIts resulting overall risk is called the Bayesian risk

)|(minarg)( xx iRi

)|(minarg)( xx iRi

Two-Category Classification

},{ 21

},{ 21 A

ctio

n

State of Nature

1 2

1 11 12

2 21 22

Loss Function

)|()|()|( 2121111 xxx PPR

)|()|()|( 2221212 xxx PPR


)|()|()|( 2121111 xxx PPR

)|()|()|( 2221212 xxx PPR

Perform 1 if R(2|x) > R(1|x); otherwise perform 2

)|()|()|()|( 212111222121 xxxx PPPP

)|()()|()( 2221211121 xx PP



)|()|()|()|( 212111222121 xxxx PPPP

positive

)|()()|()( 2221211121 xx PP

positive

Posterior probabilities are scaled before comparison.


)(

)()|()|(

x

xx

p

PpP ii

i

irrelevan

t

irrelevant


)|()|()|()|( 212111222121 xxxx PPPP

)|()()|()( 2221211121 xx PP

)()|()()()|()( 222212111121 PpPp xx

)(

)(

)(

)(

)|(

)|(

1

2

1121

2212

2

1

P

P

p

p

x

x


)(

)(

)(

)(

)|(

)|(

1

2

1121

2212

2

1

P

P

p

p

x

xPerform 1 if

LikelihoodRatio

Threshold

This slide will be recalled later.This slide will be recalled later.


Discriminant Functions

The Multicategory Classification

g1(x)g1(x)

g2(x)g2(x)

gc(x)gc(x)

x Action(e.g., classification)

(x)

Assign x to i ifgi(x) > gj(x) for all j i.

gi(x)’s are called the discriminant functions.

How to define discriminant functions?

Simple Discriminant Functions

)|()( xx ii Rg

)|()( xx ii Pg

Minimum Risk case:

Minimum Error-Rate case:

)()|()( iii Ppg xx

)(ln)|(ln)( iii Ppg xx

If f(． ) is a monotonically increasing function, than f(gi(． ) )’s are also be discriminant functions.

Decision Regions

} )()(|{ ijgg jii xxxR

Two-category example

Decision regions are separated by decision boundaries.


The Normal Distribution

Basics of Probability

Discrete random variable (X) － Assume integer

Continuous random variable (X)

Probability mass function (pmf): )()( xXPxp

Cumulative distribution function (cdf):

x

t

tpxXPxF )()()(

Probability density function (pdf): )(or )( xfxp

Cumulative distribution function (cdf):

xdttpxXPxF )()()(

not a probability

Expectations

continuous is )()(

discrete is )()()]([

Xdxxpxg

XxpxgXgE x

continuous is )()(

discrete is )()()]([

Xdxxpxg

XxpxgXgE x

Let g be a function of random variable X.

The kth moment ][ kXE

The kth central moments ])[( kXXE

The 1st moment ][XEX

Important Expectations

Mean

continuous is )(

discrete is )(][

Xdxxxp

XxxpXE xX

Variance

continuous is )()(

discrete is )()(])[(][

2

2

22

Xdxxpx

XxpxXEXVar

X

xX

XX

Fact: 22 ])[(][][ XEXEXVar 22 ])[(][][ XEXEXVar

Entropy

continuous is )(ln)(

discrete is )(ln)(][

Xdxxpxp

XxpxpXH x

continuous is )(ln)(

discrete is )(ln)(][

Xdxxpxp

XxpxpXH x

The entropy measures the fundamental uncertainty in the value of points selected randomly from a distribution.

Univariate Gaussian Distribution

x

p(x)X~N(μ,σ2)

2

2

2

)(

2

1)(

x

exp2

2

2

)(

2

1)(

x

exp μ

σ σ

2σ 2σ

3σ 3σE[X] =μ

Var[X] =σ2

Properties:1. Maximize the entropy2. Central limit theorem

Random Vectors

dR:XdR:XA d-dimensional

random vector

TdE ),,,(][ 21 XμVector Mean:

Covariance Matrix:

]))([( TE μXμXΣ

221

22221

11221

ddd

d

d

TdXXX ),,,( 21 X

Multivariate Gaussian Distribution

X~N(μ,Σ)

)()(

2

1exp

||)2(

1)( 1

2/12/μxΣμx

Σx T

dp

)()(

2

1exp

||)2(

1)( 1

2/12/μxΣμx

Σx T

dp

E[X] =μ

E[(X-μ) (X-μ)T] =Σ

2

2

2

)(

2

1)(

x

exp2

2

2

)(

2

1)(

x

exp

A d-dimensional random vector

Properties of N(μ,Σ)

X~N(μ,Σ) A d-dimensional random vector

Let Y=ATX, where A is a d × k matrix.

Y~N(ATμ, ATΣA)

Properties of N(μ,Σ)

X~N(μ,Σ) A d-dimensional random vector

Let Y=ATX, where A is a d × k matrix.

Y~N(ATμ, ATΣA)

On Parameters of N(μ,Σ)

X~N(μ,Σ) TdXXX ),,,( 21 X

TdE ),,,(][ 21 Xμ

ddijTE ][]))([( μXμXΣ

][ ii XE ][ ii XE

),()])([( jijjiiij XXCovXXE ),()])([( jijjiiij XXCovXXE

)(])[( 22iiiiii XVarXE )(])[( 22

iiiiii XVarXE

0 ijji XX

More On Covariance Matrix

ddijTE ][]))([( μXμXΣ

),()])([( jijjiiij XXCovXXE ),()])([( jijjiiij XXCovXXE

)(])[( 22iiiiii XVarXE )(])[( 22

iiiiii XVarXE

0 ijji XX

is symmetric and positive semidefinite.TΦΛΦΣ

: orthonormal matrix, whose columns are eigenvectors of . : diagonal matrix (eigenvalues).

TΦΛΦΛ 2/12/1

T))(( 2/12/1 ΦΛΦΛΣ T))(( 2/12/1 ΦΛΦΛΣ

Whitening Transform

X~N(μ,Σ)

Y=ATX Y~N(ATμ, ATΣA)

T))(( 2/12/1 ΦΛΦΛΣ T))(( 2/12/1 ΦΛΦΛΣ

Let 2/1ΦΛwA2/1ΦΛwA

),(~ wTw

Tww NX ΣAAμAA

IΦΛΦΛΦΛΦΛΣAA )())(()( 2/12/12/12/1 TTw

Tw

),(~ IμAA Tww NX ),(~ IμAA T

ww NX

Whitening Transform

X~N(μ,Σ)

Y=ATX Y~N(ATμ, ATΣA)

T))(( 2/12/1 ΦΛΦΛΣ T))(( 2/12/1 ΦΛΦΛΣ

Let 2/1ΦΛwA2/1ΦΛwA

),(~ wTw

Tww NX ΣAAμAA

IΦΛΦΛΦΛΦΛΣAA )())(()( 2/12/12/12/1 TTw

Tw

),(~ IμAA Tww NX ),(~ IμAA T

ww NX

Whitening

Projection

LinearTransform

Mahalanobis Distance

)()(

2

1exp

||)2(

1)( 1

2/12/μxΣμx

Σx T

dp

)()(

2

1exp

||)2(

1)( 1

2/12/μxΣμx

Σx T

dp

constant

)()( 12 μxΣμx Tr )()( 12 μxΣμx Tr

r2depends on the value of r2

X~N(μ,Σ)

Mahalanobis Distance

)()(

2

1exp

||)2(

1)( 1

2/12/μxΣμx

Σx T

dp

)()(

2

1exp

||)2(

1)( 1

2/12/μxΣμx

Σx T

dp

constant

)()( 12 μxΣμx Tr )()( 12 μxΣμx Tr

r2depends on the value of r2

X~N(μ,Σ)


Discriminant Functions for the Normal Po

pulations

Minimum-Error-Rate Classification

)|()( xx ii Pg )()|()( iii Ppg xx

)(ln)|(ln)( iii Ppg xx

Xi~N(μi,Σi)

)()(

2

1exp

||)2(

1)|( 1

2/12/ iiT

ii

dip μxΣμxΣ

x

)()(

2

1exp

||)2(

1)|( 1

2/12/ iiT

ii

dip μxΣμxΣ

x

)(ln||ln2

12ln

2)()(

2

1)( 1

iiiiT

ii Pd

g ΣμxΣμxx )(ln||ln2

12ln

2)()(

2

1)( 1

iiiiT

ii Pd

g ΣμxΣμxx

Minimum-Error-Rate Classification

)(ln||ln2

12ln

2)()(

2

1)( 1

iiiiT

ii Pd


12ln

2)()(

2

1)( 1

iiiiT

ii Pd

g ΣμxΣμxx

Three Cases:Case 1: IΣ 2i

Case 2: ΣΣ i

Case 3: ji ΣΣ

Classes are centered at different mean, and their feature components are pairwisely independent have the same variance.

Classes are centered at different mean, but have the same variation.

Arbitrary.

Case 1. i = 2I

)(ln||ln2

12ln

2)()(

2

1)( 1

iiiiT

ii Pd


12ln

2)()(

2

1)( 1

iiiiT

ii Pd

g ΣμxΣμxx

irrelevant

)(ln||||2

1)( 2

2 iii Pg

μxx

IΣ2

1 1

iIΣ

21 1

i

)(ln)2(2

12 ii

Ti

Ti

T P

μμxμxx

irrelevant

)(ln

2

11)(

22 iiTi

Tii Pg

μμxμx

Case 1. i = 2I

ii μw 21

ii μw 2

1

)(ln

2

11)(

22 iiTi

Tii Pg

μμxμx

)(ln221

0 iiTii Pw

μμ )(ln22

10 ii

Tii Pw

μμ

0)( iTii wg xwx

Case 1. i = 2I

ii μw 21

ii μw 2

1

)(ln221

0 iiTii Pw

μμ )(ln22

10 ii

Tii Pw

μμ

0)( iTii wg xwx

i j

Boundary btw. i and j

)()( xx ji gg 00 j

Tji

Ti ww xwxw

00)( ijTj

Ti ww xww

)(

)(ln)()( 2

21

j

ij

Tji

Ti

Tj

Ti P

P

μμμμxμμ

)(

)(ln

||||

))(())(()(

22

21

j

i

ji

jiTj

Ti

jiTj

Ti

Tj

Ti P

P

μμ

μμμμμμμμxμμ

Case 1. i = 2I

i j

Boundary btw. i and j

)()( xx ji gg

)(

)(ln

||||

))(())(()(

22

21

j

i

ji

jiTj

Ti

jiTj

Ti

Tj

Ti P

P

μμ

μμμμμμμμxμμ

wT

w0)( 0 xxwT

)()(

)(ln

||||)(

2

2

21

0 jij

i

jiji P

Pμμ

μμμμx

ji μμw x0

x

xx0

The decision boundary will be a hyperplane perpendicular to the line btw. the means at somewhere.

0 if P( i)=P( j)midpoint

Case 1. i = 2I

)()(

)(ln

||||)( 21

2

12

21

2

2121

0 μμμμ

μμx

P

P

)()( 21 PP

Minimum distance classifier (template matching)

Case 1. i = 2I

)()( 21 PP

)()(

)(ln

||||)( 21

2

12

21

2

2121

0 μμμμ

μμx

P

P

Case 1. i = 2I

)()( 21 PP

)()(

)(ln

||||)( 21

2

12

21

2

2121

0 μμμμ

μμx

P

P

Case 1. i = 2I

)()( 21 PP

)()(

)(ln

||||)( 21

2

12

21

2

2121

0 μμμμ

μμx

P

P

Demo

Case 2. i =

)(ln||ln2

12ln

2)()(

2

1)( 1

iiiiT

ii Pd


12ln

2)()(

2

1)( 1

iiiiT

ii Pd

g ΣμxΣμxx

Irrelevant ifP( i)= P( j) i, j

)(ln)()(2

1)( 1

iiT

ii Pg μxΣμxx

MahalanobisDistance

irrelevant

Case 2. i =

)(ln)()(2

1)( 1

iiT

ii Pg μxΣμxx

)(ln)2(2

1 111ii

Ti

Ti

T P μΣμxΣμxΣx

Irrelevant

0)( iTii wg xwx

ii μΣw 1 ii μΣw 1

)(ln121

0 iiTii Pw μΣμ )(ln1

21

0 iiTii Pw μΣμ

Case 2. i =

0)( iTii wg xwx

ii μΣw 1 ii μΣw 1

)(ln121

0 iiTii Pw μΣμ )(ln1

21

0 iiTii Pw μΣμ

i j

)()( xx ji gg 0)( 0 xxwT

)(1ji μμΣw

)()()(

)](/)(ln[)(

121

0 jiji

Tji

jiji

PPμμ

μμΣμμμμx

w

x0

x

Case 2. i =

Case 2. i = Demo

Case 3. i j

)(ln||ln2

12ln

2)()(

2

1)( 1

iiiiT

ii Pd


12ln

2)()(

2

1)( 1

iiiiT

ii Pd

g ΣμxΣμxx

)(ln||ln2

1)()(

2

1)( 1

iiiiT

ii Pg ΣμxΣμxx

irrelevant

0)( iTii

Ti wg xwxWxx

1

2

1 ii ΣW1

2

1 ii ΣW iii μΣw 1 iii μΣw 1 )(ln||ln 1211

21

0 iiiiTii Pw ΣμΣμ )(ln||ln 1

211

21

0 iiiiTii Pw ΣμΣμ

Without this termIn Case 1 and 2

Decision surfaces are hyperquadrics, e.g.,• hyperplanes• hyperspheres• hyperellipsoids• hyperhyperboloids

Case 3. i j

Non-simply connected decision regions can arise in one dimensions for Gaussians having unequal variance.

Case 3. i j

Case 3. i j

Case 3. i j

Demo

Multi-Category Classification


Minimax Criterion

Bayesian Decision Rule:Two-Category Classification

)(

)(

)(

)(

)|(

)|(

1

2

1121

2212

2

1

P

P

p

p

x

xDecide 1 if

LikelihoodRatio

Threshold

Minimax criterion deals with the case thatthe prior probabilities are unknown.

Basic Concept on Minimax

To choose the worst-case prior probabilities (the maximum loss) and, then, pick the decision rule that will minimize the overall risk.

Minimize the maximum possible overall risk.

Overall Risk

xxxx dpRR )()|)((

21

)()|()()|( 21 RRxxxxxx dpRdpR

)|()|()|( 2121111 xxx PPR )|()|()|( 2221212 xxx PPR

2

1

)()]|()|([

)()]|()|([

222121

212111

R

R

xxxx

xxxx

dpPP

dpPPR

Overall Risk

2

1

)()]|()|([

)()]|()|([

222121

212111

R

R

xxxx

xxxx

dpPP

dpPPR

)(

)()|()|(

x

xx

p

PpP ii

i

)(

)()|()|(

x

xx

p

PpP ii

i

2

1

)]|()()|()([

)]|()()|()([

22221121

22121111

R

R

xxx

xxx

dpPpP

dpPpPR

Overall Risk)(1)( 12 PP )(1)( 12 PP

2

1

)]|()()|()([

)]|()()|()([

22221121

22121111

R

R

xxx

xxx

dpPpP

dpPpPR

2

1

)}|()](1[)|()({

)}|()](1[)|()({

21221121

21121111

R

R

xxx

xxx

dpPpP

dpPpPR

1 2

1 1

2 2

12 2 22 2

11 1 1 12 1 2

21 1 1 22 1 2

( | ) ( | )

( ) ( | ) ( ) ( | )

( ) ( | ) ( ) ( | )

R p d p d

P p d P p d

P p d P p d

x x x x

x x x x

x x x x

R R

R R

R R

Overall Risk

1 2

1 1

2 2

12 2 22 2

11 1 1 12 1 2

21 1 1 22 1 2

( | ) ( | )

( ) ( | ) ( ) ( | )

( ) ( | ) ( ) ( | )

R p d p d

P p d P p d

P p d P p d

x x x x

x x x x

x x x x

R R

R R

R R

1)|()|(21

RRxxxx dpdp ii 1)|()|(

21

RRxxxx dpdp ii

12

1

)|()()|()()()(

)|()()]([

222121112122111

22212221

RR

R

xxxx

xx

dpdpP

dpPR

Overall Risk

12

1

)|()()|()()()(

)|()()]([

222121112122111

22212221

RR

R

xxxx

xx

dpdpP

dpPR

The overall risk for a particular P(1).

The value depends onthe setting of decision boundary

The value depends onthe setting of decision boundary

R(x) = ax + bR(x) = ax + b

Overall Risk

12

1

)|()()|()()()(

)|()()]([

222121112122111

22212221

RR

R

xxxx

xx

dpdpP

dpPR

= 0 for minimax solution

= R mm, minimax risk

R(x) = ax + bR(x) = ax + b

Independent on the value of P(i).

Minimax Risk

12

1

)|()()|()()()(

)|()()]([

222121112122111

22212221

RR

R

xxxx

xx

dpdpP

dpPR

1

)|()( 2221222 Rxx dpRmm

2

)|()( 1112111 Rxx dp

Error Probability

12

1

)|()()|()()()(

)|()()]([

222121112122111

22212221

RR

R

xxxx

xx

dpdpP

dpPR

Use 0/1 loss function

12

1

)|()|()(

)|()]([

211

21

RR

R

xxxx

xx

dpdpP

dpPPerror

Minimax Error-Probability

1

)|()( 2Rxx dperrorPmm

2

)|( 1Rxx dp

Use 0/1 loss function

P( 1| 2) P( 2| 1)

12

1

)|()|()(

)|()]([

211

21

RR

R

xxxx

xx

dpdpP

dpPPerror


R1 R2

1 2

1

)|()( 2Rxx dperrorPmm

2

)|( 1Rxx dp

P( 1| 2) P( 2| 1)

12

1

)|()|()(

)|()]([

211

21

RR

R

xxxx

xx

dpdpP

dpPPerror

12

1

)|()|()(

)|()]([

211

21

RR

R

xxxx

xx

dpdpP

dpPPerror


12

1

)|()|()(

)|()]([

211

21

RR

R

xxxx

xx

dpdpP

dpPPerror

12

1

)|()|()(

)|()]([

211

21

RR

R

xxxx

xx

dpdpP

dpPPerror


Neyman-Pearson Criterion

Bayesian Decision Rule:Two-Category Classification

)(

)(

)(

)(

)|(

)|(

1

2

1121

2212

2

1

P

P

p

p

x

xDecide 1 if

LikelihoodRatio

Threshold

Neyman-Pearson Criterion deals with the case that both loss functions and the prior probabilities are unknown.

Signal Detection Theory

The theory of signal detection theory evolved from the development of communications and radar equipment the first half of the last century.

It migrated to psychology, initially as part of sensation and perception, in the 50's and 60's as an attempt to understand some of the features of human behavior when detecting very faint stimuli that were not being explained by traditional theories of thresholds.

The situation of interest

A person is faced with a stimulus (signal) that is very faint or confusing.

The person must make a decision, is the signal there or not.

What makes this situation confusing and difficult is the presences of other mess that is similar to the signal. Let us call this mess noise.

Example

Noise is present both in the environment and in the sensory system of the observer.

The observer reacts to the momentary total activation of the sensory system, which fluctuates from moment to moment, as well as responding to environmental stimuli, which may include a signal.

Example A radiologist is examining a CT scan, looking for

evidence of a tumor. A Hard job, because there is always some uncertainty.

There are four possible outcomes: – hit (tumor present and doctor says "yes'')– miss (tumor present and doctor says "no'') – false alarm (tumor absent and doctor says "yes") – correct rejection (tumor absent and doctor says "no").

Two types of Error

Correct RejectionCorrect Rejection

The Four Cases

P(1|1)

MissMiss

False AlarmsFalse Alarms HitHit

Signal (tumor)

Absent (1) Present (2)

Decision

No (1)

Yes (2)P(2|2)

P(1|2)

P(2|1)

Signal detection theory was developed to help us understand how a continuous and ambiguous signal can lead to a binary yes/no decision.

Signal detection theory was developed to help us understand how a continuous and ambiguous signal can lead to a binary yes/no decision.

No (1) Yes (2)

Decision Making

d’Noise

1

Noise + Signal

2

Criterion

Hit

FalseAlarm

Discriminability

||' 12 d

||' 12 d

Based on expectancy(decision bias)

P(2|2)

P(2|1)

ROC Curve(Receiver Operating Characteristic)

Hit

FalseAlarm

PH=P(2|2)

PFA=P(2|1)

Neyman-Pearson Criterion

FalseAlarm

PFA=P(2|1)

NP:max. PH

subject to PFA ≦ a

Hit

PH=P(2|2)

Likelihood Ratio Test

T

T

pppp

)|()|()|()|(

2

1

2

1

1

0)(

xxxx

x

where T is a threshold that meets the PFA constraint ( ≦ a).

)}|()|(|{ 211 xxx Tpp R )}|()|(|{ 211 xxx Tpp R

)}|()|(|{ 222 xxx Tpp R )}|()|(|{ 222 xxx Tpp R

How to determine T?

Likelihood Ratio Test

T

T

pppp

)|()|()|()|(

2

1

2

1

1

0)(

xxxx

x)}|()|(|{ 211 xxx Tpp R )}|()|(|{ 211 xxx Tpp R

)}|()|(|{ 222 xxx Tpp R )}|()|(|{ 222 xxx Tpp R

PH

PFA

R2 R1

xx dpPFA )|( 12

R

xx dpPH )|( 22

R

xxx dp )|()( 12

R

xxx dp )|()( 22

R

]|)([ 1 XEPFA ]|)([ 1 XEPFA

]|)([ 2 XEPH ]|)([ 2 XEPH

Neyman-Pearson Lemma

Consider the aforementioned rule with T chosen to give PFA() =a. There is no decision rule ’ such that PFA(’) a and PH(’) > PH() .

]|)([ 1 XEPFA ]|)([ 1 XEPFA

]|)([ 2 XEPH ]|)([ 2 XEPH

T

T

pppp

)|()|()|()|(

2

1

2

1

1

0)(

xxxx

x

T

T

pppp

)|()|()|()|(

2

1

2

1

1

0)(

xxxx

x

Pf) Let ’ be a decision rule with .]|)('[)'( 1 aEPFA X

0)]|()|()][(')([ 12 xxxxx dpTp

=1

0 > 0


Consider the aforementioned rule with T chosen to give PFA() =a. There is no decision rule ’ such that PFA(’) ≦a and PH(’) > PH() .

]|)([ 1 XEPFA ]|)([ 1 XEPFA

]|)([ 2 XEPH ]|)([ 2 XEPH

T

T

pp

pp

)|()|(

)|()|(

2

1

2

1

0

1)(

xx

xx

x

T

T

pp

pp

)|()|(

)|()|(

2

1

2

1

0

1)(

xx

xx

x


0)]|()|()][(')([ 12 xxxxx dpTp

=0

0 0

OK


Consider the aforementioned rule with T chosen to give PFA() =a. There is no decision rule ’ such that PFA(’) ≦a and PH(’) > PH() .

]|)([ 1 XEPFA ]|)([ 1 XEPFA

]|)([ 2 XEPH ]|)([ 2 XEPH

T

T

pp

pp

)|()|(

)|()|(

2

1

2

1

0

1)(

xx

xx

x

T

T

pp

pp

)|()|(

)|()|(

2

1

2

1

0

1)(

xx

xx

x


0)]|()|()][(')([ 12 xxxxx dpTp

xxxxxxxxxxxx dpdpdpdpT )|()(')|()()|()(')|()( 1122

)]'()([)]'()([ FAFAHH PPPPT 0

0

)'()( HH PP )'()( HH PP

Documents

Bayesian Decision Theory (Classification) 主講人：虞台文