Upload
monica-bailey
View
251
Download
0
Embed Size (px)
Citation preview
Non-Parameter Estimation
主講人:虞台文
Contents IntroductionParzen Windowskn-Nearest-Neighbor EstimationClassification Techiques
– The Nearest-Neighbor rule (1-NN)– The k-Nearest-Neighbor rule (k-NN)
Distance Metrics
Non-Parameter Estimation
Introduction
Facts Classical parametric densities are unimodal. Many practical problems involve multimodal
densities.
Common parametric forms rarely fit the densities actually encountered in practice.
GoalsEstimate class-conditional densities
Estimate posterior probabilities
)|( ip x
)|( xiP
R
Density Estimation
n samples
')'()( xxX dpP R
R
Assume p(x) is continuous & R is small
R
')( xx dp
+x
RPRVp )(x
Randomly take n samples, let K denote the number of samples inside R.
knk PPk
nkKP
)1()( RR
),(~ RPnBK
RnPKE ][ RnPKE ][
Density Estimation
R
n samples
+x
')'()( xxX dpP R
R
Assume p(x) is continuous & R is small
R
')( xx dp
RPRVp )(x
RnPKE ][ RnPKE ][
RkKE ][
Let kR denote the number of samples in R.
nkP /RR R
R
V
nkp
/)( x
R
R
V
nkp
/)( x
R
Density Estimation
n samplesR
R
V
nkp
/)( x
R
R
V
nkp
/)( x
+x
What items can be controlled?
How?
Use subscript n to take sample size into account.
n
nn V
nkp
/)( x
n
nn V
nkp
/)( x
To this, we should have
We hope )()(lim xx ppnn
)()(lim xx ppnn
1. 0lim n
nV
2. n
nklim
3. 0/lim
nknn
Two Approaches
Parzen Windows– Control Vn
kn-Nearest-Neighbor– Control kn
R
n samples
+x
n
nn V
nkp
/)( x
n
nn V
nkp
/)( x
1. 0lim n
nV
2. n
nklim
3. 0/lim
nknn
What items can be controlled?
How?
Two Approaches
Parzen Windows
kn-Nearest-Neighbor
Non-Parameter Estimation
Parzen Windows
Window Function
otherwise0
,,2,1 ,2/1||1)(
dju j u
otherwise0
,,2,1 ,2/1||1)(
dju j u
1
1
1
1)( uu d 1)( uu d
Window Function
otherwise0
,,2,1 ,2/1||1)(
dju j u
otherwise0
,,2,1 ,2/1||1)(
dju j u
hn
hn
hn
otherwise0
2/||1 nj
n
hx
h
x
1)( uu d 1)( uu d
Window Function
otherwise0
,,2,1 ,2/1||1)(
dju j u
otherwise0
,,2,1 ,2/1||1)(
dju j u
hn
hn
hnx
otherwise0
2/||1 njj
n
hxx
h
xx
otherwise0
2/||1 nj
n
hx
h
x
1)( uu d 1)( uu d
Parzen-Window Estimation
hn
hn
hnx
otherwise0
2/||1 njj
n
hxx
h
xx
otherwise0
2/||1 njj
n
hxx
h
xx
},,,{ 21 nxxx Dkn: # samples inside hypercube centered at x.
n
i n
in h
k1
xx
n
i n
i
nn hVn
p1
11)(
xxx
n
i n
i
nn hVn
p1
11)(
xxx
dnn hV dnn hV
1)( uu d 1)( uu d
Generalization
otherwise0
2/||1 njj
n
hxx
h
xx
otherwise0
2/||1 njj
n
hxx
h
xx
n
i n
i
nn hVn
p1
11)(
xxx
n
i n
i
nn hVn
p1
11)(
xxx
1)( uu d 1)( uu d
Requirement 1)( xx dpn
1
1 1ni
i n n
dn V h
x x
x
n
i
dn 1
1uu
Set x/hn=u.
1)( uu d 1)( uu d The window is not necessary a hypercube.
hn is a important parameter.
It depends on sample size.
1
1 1ni
i n n
dn V h
x xx
Interpolation Parameter
n
i n
i
nn hVn
p1
11)(
xxx
n
i n
i
nn hVn
p1
11)(
xxx 1)( uu d 1)( uu d
nnn hV
xx 1
)(
nnn hV
xx 1
)(hn 0
n(x) is a Dirac delta function.
Example
Parzen-window estimations for five samples
n
i n
i
nn hVn
p1
11)(
xxx
n
i n
i
nn hVn
p1
11)(
xxx
Convergence Conditions
To assure convergence, i.e.,
n
i n
i
nn hVn
p1
11)(
xxx
n
i n
i
nn hVn
p1
11)(
xxx
)()]([lim xx ppE nn
)()]([lim xx ppE nn
and 0)]([lim
xn
npVar 0)]([lim
xn
npVar
we have the following additional constraints:
)(sup uu
0)(lim1
||||
d
iiuu
u
0lim n
nV
n
nnVlim
Illustrations
n
i n
i
nn hVn
p1
11)(
xxx
n
i n
i
nn hVn
p1
11)(
xxx
One dimension case:
2/2
2
1)( ue
u
n
i n
i
nn hhn
p1
11)(
xxx
)1,0(~ NX )1,0(~ NX
nhhn /1 nhhn /1
Illustrations
n
i n
i
nn hVn
p1
11)(
xxx
n
i n
i
nn hVn
p1
11)(
xxx
One dimension case:
2/2
2
1)( ue
u
n
i n
i
nn hhn
p1
11)(
xxx
nhhn /1 nhhn /1
Illustrations
n
i n
i
nn hVn
p1
11)(
xxx
n
i n
i
nn hVn
p1
11)(
xxx
Two dimensioncase:
n
i n
i
nn hhn
p1
2
11)(
xxx
nhhn /1 nhhn /1
Classification Example
Smaller window Larger window
Choosing the Window Function
Vn must approach zero when n, but at a rate slower than 1/n, e.g.,
The value of initial volume V1 is important.
In some cases, a cell volume is proper for one region but unsuitable in a different region.
nVVn /1 nVVn /1
PNN (Probabilistic Neural Network)
Σuu
Σu T
d 2
1exp
||)2(
1)(
2/12/
Σuu
Σu T
d 2
1exp
||)2(
1)(
2/12/
},,,{ 21 nxxx D
)()(
2
1exp
2 kT
knn
k
hhxxxx
xx IΣ
T
kTik
TT
nh)2(
2
1exp
2xxxxxx
n
k n
k
nn hVn
p1
11)(
xxx
n
k n
k
nn hVn
p1
11)(
xxx
n
k
Tk
Tkk
TT
nnn hVn
p1
2)2(
2
1exp
11)( xxxxxxx
n
k
Tk
Tkk
TT
nh12
)2(2
1exp xxxxxx
PNN (Probabilistic Neural Network)
)(xnp
Irrelevant for discriminant a
nalysis 2k(bias)
kT
knet xx
2/kTkk xx
kk xw
xwTk
2exp)(
n
kk h
xxa
n
k
Tk
Tkk
TT
nh12
)2(2
1exp xxxxxx
Irrelevant for discriminant a
nalysis
n
kkkn netap
1
)()(x
n
kkkn netap
1
)()(x
PNN (Probabilistic Neural Network)
kT
knet xx
2/kTkk xx
kk xw
xwTk
2exp)(
n
kk h
xxa
n
kkkn netap
1
)()(x
n
kkkn netap
1
)()(x
ak()
wk1 wk2 wkd
x1 x2 xd
k
wk
x
ak(netk)
. . .
. . .
PNN (Probabilistic Neural Network)
n
kkkn netap
1
)()(x
n
kkkn netap
1
)()(x },,,{ 21 cDDDD },,,{ 21 cDDDD
11)|( op x
…
2
o2
22 )|( op x cc op )|( x
…
1
o1
x1 x2 xd. . .
…
c
oc
. . .
. . .
PNN (Probabilistic Neural Network)
11)|( op x
…
2
o2
22 )|( op x cc op )|( x
…
1
o1
x1 x2 xd. . .
…
c
oc
. . .
. . .Assign patterns to the class with maximum output values.Assign patterns to the class with maximum output values.
1. Pros and cons of the approach?
2. How to deal with prior probabilities?
Non-Parameter Estimation
kn-Nearest-Neighbor Estimation
Basic Concept
Let the cell volume depends on the training data. To estimate p(x), we can center a cell about x and
let it grow until it captures kn samples, where is some specified function of n, e.g.,
nkn nkn
Example
kn=5
Example
nkn nkn
Estimation of A Posteriori Probabilities
x
Pn(i|x)=?
Estimation of A Posteriori Probabilities
x
c
jjn
inin
xp
xpP
1
),(
),()|(
x
n
iinn V
nkxp
/),(
n
nc
jjnn V
nkxp
/),(
1
n
i
k
k
Pn(i|x)=?
Estimation of A Posteriori Probabilities
c
jjn
inin
xp
xpP
1
),(
),()|(
x
n
iinn V
nkxp
/),(
n
nc
jjnn V
nkxp
/),(
1
n
i
k
k
The value of Vn or kn can be determined base on Parzen window or kn-nearest-neighbor technique.
Non-Parameter Estimation
Classification Techniques
• The Nearest-Neighbor Rule• The k-Nearest-Neighbor Rule
The Nearest-Neighbor Rule
x
},,,{ 21 nxxx D A set of labeled prototypes
x’Classify as
The Nearest-Neighbor Rule
Voronoi Tessellation
Error Rate
x
Baysian (optimum):
)|(maxarg xmi
m P
)|(1)|( xx mPerrorP
xx derrorperrorP ),()(
xxx dperrorP )()|(
Optimum: )|(1)|(* xx mPerrorP
xxx dperrorPerrorP )()|()( **
Error Rate
x x’
},,,{
,,,
21
2
2
1
1
ci
n
n
xxxD
},,,{
,,,
21
2
2
1
1
ci
n
n
xxxD
1-NN
),|,(1
),|(
1
xx
xx
i
c
iiP
errorP
Suppose the true class for x is
)|()|(11
xx
i
c
ii PP
xxxxxx dperrorPerrorP )|(),|()|(
Optimum: )|(1)|(* xx mPerrorP
xxx dperrorPerrorP )()|()( **
Error Rate
x x’
},,,{
,,,
21
2
2
1
1
ci
n
n
xxxD
},,,{
,,,
21
2
2
1
1
ci
n
n
xxxD
1-NN
xxxxxx dperrorPerrorP )|(),|()|(
)|()|(1),|(1
xxxx
i
c
ii PPerrorP
?
As n, xx )()|( xxxx p
c
iiP
1
2)|(1 x
c
iiP
1
2)|(1 x
Optimum: )|(1)|(* xx mPerrorP
xxx dperrorPerrorP )()|()( **
Error Rate
x x’
},,,{
,,,
21
2
2
1
1
ci
n
n
xxxD
},,,{
,,,
21
2
2
1
1
ci
n
n
xxxD
1-NN
c
iiP
1
2)|(1 x)|( xerrorP
xxx dperrorPerrorP )()|()(
xxx dpPc
ii )()|(1
1
2
Optimum: )|(1)|(* xx mPerrorP
xxx dperrorPerrorP )()|()( **
Error Bounds
xxx dpPerrorP m )()|(1)(*
1-NN
xxx dpPerrorPc
ii )()|(1)(
1
2
Bayesian
Consider the most complex classification case:
cP i /1)|( x cP i /1)|( x
xx dpc )()/11
c/11
xx dpc )()/11
21 1/ c
)()(* errorPerrorP ?
Error Bounds
xxx dpPerrorP m )()|(1)(*
1-NN
xxx dpPerrorPc
ii )()|(1)(
1
2
Bayesian
Consider the opposite case: 1)|( xmP 1)|( xmP
mi
im
c
ii PPP 22
1
2 )|()|()|( xxx
)|(1)|(* xx mPerrorP )|(1)|(* xx mPerrorP
Maximized this term tofind the upper bound
2**2*2 )|()|(21)]|(1[)|( xxxx errorPerrorPerrorPP m
This term is minimumwhen all elements have
the same value
i.e.,
1
)(
1
)|(1)|(
*
c
errorP
c
PP m
i
xx
2**
1
2 )|(1
)|(21)|( xxx errorPc
cerrorPP
c
ii
Minimized this term
Error Bounds
xxx dpPerrorP m )()|(1)(*
1-NN
xxx dpPerrorPc
ii )()|(1)(
1
2
Bayesian
Consider the opposite case: 1)|( xmP 1)|( xmP
)|(1)|(* xx mPerrorP )|(1)|(* xx mPerrorP
2**
1
2 )|(1
)|(21)|( xxx errorPc
cerrorPP
c
ii
2**
1
2 )|(1
)|(2)|(1 xxx errorPc
cerrorPP
c
ii
)|(2 * xerrorP
xxx dperrorP )()|(2 * xxx dperrorP )()|(*
)(2)()( ** errorPerrorPerrorP )(2)()( ** errorPerrorPerrorP
Error Bounds
xxx dpPerrorP m )()|(1)(*
1-NN
xxx dpPerrorPc
ii )()|(1)(
1
2
Bayesian
Consider the opposite case: 1)|( xmP 1)|( xmP
)|(1)|(* xx mPerrorP )|(1)|(* xx mPerrorP
xxx dperrorP )()|(2 * xxx dperrorP )()|(*
)(2)()( ** errorPerrorPerrorP )(2)()( ** errorPerrorPerrorP
The nearest-neighbor rule is a suboptimal procedure.
The error rate is never worse than twice the Bayes rate.
Error Bounds
The k-Nearest-Neighbor Rule
k = 5
Assign pattern to the class wins the majority.
Error Bounds
Computation Complexity
The computation complexity of the nearest-neighbor algorithm (both in time and space) has received a great deal of analysis.
Require O(dn) space to store n prototypes in a training set.– Editing, pruning or condensing
To search the nearest neighbor for a d-dimensional test point x, the time complexity is O(dn).
– Partial distance– Search tree
Partial Distance
Using the following fact to early throw far-away prototypes
2/1
1
2
2/1
1
2 )()(),(
d
kkk
r
kkkr babaD ba
2/1
1
2
2/1
1
2 )()(),(
d
kkk
r
kkkr babaD ba
dr
Editing Nearest Neighbor
Given a set of points, a Voronoi diagram is a partition of space into regions, within which all points are closer to some particular node than to any other node.
Delaunay Triangulation
If two Voronoi regions share a boundary, the nodes of these regions are connected with an edge. Such nodes are called the Voronoi neighbors (or Delaunay neighbors).
The Decision Boundary
The circled prototypes are redundant.
The Edited Training Set
Editing:The Voronoi Diagram Approach
Compute the Delaunay triangulation for the training set.
Visit each node, marking it if all its Delaunay neighbors are of the same class as the current node.
Delete all marked nodes, exiting with the remaining o
nes as the edited training set.
Demo
Editing: Other Approaches
The Gabriel Graph Approach The Relative Neighbour Graph Approach
References:– Binay K. Bhattacharya, Ronald S. Poulsen, Godfried T. Toussaint,
"Application of Proximity Graphs to Editing Nearest Neighbor Decision Rule", International Symposium on Information Theory, Santa Monica, 1981.
– T.M. Cover, P.E. Hart, "Nearest Neighbor Pattern Classification", IEEE Transactions on Information Theory, vol. IT-13, No.1, 1967, pp.21-27.
– V. Klee, On the complexity of d-dimensional Voronoi diagrams", Arch. Math., vol. 34, 1980, pp. 75-80.
– Godfried T. Toussaint, "The Relative Neighborhood Graph of a Finite Planar Set", Pattern Recognition, vol.12, No.4, 1980, pp.261-268.
Non-Parameter Estimation
Distance Metrics
Nearest-Neighbor Classifier
Distance Measurement is an importance factor for nearest-neighbor classifier, e.g.,– To achieve invariant pattern recognition
The effect of change units
Nearest-Neighbor Classifier
Distance Measurement is an importance factor for nearest-neighbor classifier, e.g.,– To achieve invariant pattern recognition
The effect of translation
Properties of a Distance Metric
Nonnegativity
Reflexivity
Symmetry
Triangle Inequality
0),( baD
baba iff 0),(D
),(),( abba DD
),(),(),( cacbba DDD
Minkowski Metric (Lp Norm)
pd
i
pp x
/1
1
||||||
x
pd
i
pp x
/1
1
||||||
x
1. L1 normManhattan or city block distance
d
iii baL
111 ||||||),( baba
d
iii baL
111 ||||||),( baba
2. L2 normEuclidean distance
2/1
1
222 ||||||),(
d
iii baL baba
2/1
1
222 ||||||),(
d
iii baL baba
3. L norm Chessboard distance
|)max(|||||||),(/1
1ii
d
iii babaL
baba |)max(|||||||),(
/1
1ii
d
iii babaL
baba
Minkowski Metric (Lp Norm)
pd
i
pp x
/1
1
||||||
x
pd
i
pp x
/1
1
||||||
x