Upload
dorothy-fletcher
View
248
Download
7
Embed Size (px)
Citation preview
Statistical Pattern Recognition:A Review
主講人:虞台文
Contents Introduction Statistical Pattern Recognition The Curse of Dimensionality & Peaking Phenomena Dimensionality Reduction
– Feature Extraction– Feature Selection
Classifiers Classifier Combination Error Estimation Unsupervised Classification
Statistical Pattern Recognition:A Review
Introduction
What is Pattern Recognition?
The study of how machines can observe the environment, – learn to distinguish patterns of interest from their
background, and– make sound and reasonable decisions about the
categories of the patterns.
What is a pattern? What kinds of category we have?
What is a Pattern?
As opposite of a chaos; it is an entity, vaguely defined, that could be given a name.
For example, a pattern could be– A fingerprint images– A handwritten cursive word– A human face– A speech signal
Categories (Classes)
Supervised Classification– Discriminant Analysis
Unsupervised Classification– Clustering
Applications of Pattern Recognition
The Design
The design of a pattern recognition system essentially involves the following three aspects:– data acquisition– data representation– decision making
Pattern Recognition Models
The four best known approaches– template matching– statistical classification– syntactic or structural matching– neural networks
Pattern Recognition Models
Statistical Pattern Recognition:A Review
Statistical Pattern Recognition
Pattern Representation
A pattern is represented by a set of d features, or attributes, viewed as a d-dimensional feature vector.
1 2( , , , )Tdx x xx
PreprocessingPreprocessingFeature
MeasurementFeature
MeasurementClassificationClassification
testpattern
Classification Mode
PreprocessingPreprocessing
FeatureExtraction/Selection
FeatureExtraction/Selection
LearningLearningtrainingpattern
Training Mode
Two Modes of a Pattern Recognition system
Decision Making Rules
Given a pattern x = (x1, x2, …, xd)T, assign it to one of c categories in .
={1, 2, …, c}
Decision Making Rules
12
3
x
?)( x ?)( x )(x )(x
The States of Nature
12
3
P(1)P(2)
P(3)
P(i): prior probabilities.
P(x|1)P(x|2)
P(x|3)
P(x|i): class-conditional probabilities.
x
Baysian Decision Theory
12
3
P(1)P(2)
P(3)
P(x|1)P(x|2)
P(x|3)
x )(
)()|()|(
x
xx
P
PPP ii
i
)(
)()|()|(
x
xx
P
PPP ii
i
c
iii PPP
1
)()|()( xx
c
iii PPP
1
)()|()( xx
P(i): prior probabilities.
P(x|i): class-conditional probabilities.
A posterior probability
A posterior probability
Decision Making Rules
Bayes Decision RuleMaximum Likelihood RuleMinimaxNeyman-Pearson. . .
Loss Functions & Conditional Risk
1 2
3
The loss incurred in deciding i when the true class is j.
The loss incurred in deciding i when the true class is j.
LossFunction
Conditional Risk:
)|()|()|(1
xx jj
c
jii PR
)|()|()|(1
xx jj
c
jii PR
)|( ji )|( ji
PosteriorProbability
Bayse Decision Rule
)|()|()|(1
xx jj
c
jii PR
)|()|()|(1
xx jj
c
jii PR
)|(minarg)( xx iRi
)|(minarg)( xx iRi
The optimal decision rule for minimizing the risk.
Maximum Likelihood Decision Rule
)|()|()|(1
xx jj
c
jii PR
)|()|()|(1
xx jj
c
jii PR
)|(minarg)( xx iRi
)|(minarg)( xx iRi
with
jiji
ji 10
)|( 0/1 loss function
( ) arg max ( | )i
iP
x x( ) arg max ( | )i
iP
x x
Minimax
)(
)()|()|(
x
xx
P
PPP ii
i
)(
)()|()|(
x
xx
P
PPP ii
i
c
iii PPP
1
)()|()( xx
c
iii PPP
1
)()|()( xx
P(i): prior probabilities.
P(x|i): class-conditional probabilities.
Minimax deals with the case that P(i)’s are unknown.
Game theory
)|()|()|(1
xx jj
c
jii PR
)|()|()|(1
xx jj
c
jii PR
Neyman-Pearson
)(
)()|()|(
x
xx
P
PPP ii
i
)(
)()|()|(
x
xx
P
PPP ii
i
c
iii PPP
1
)()|()( xx
c
iii PPP
1
)()|()( xx
P(i): prior probabilities.
P(x|i): class-conditional probabilities.Neyman-Pearson criterion wish to minimize the overall risk subject to a constraint, such as
constant)|( xx dR i
for a particular i.
)|()|()|(1
xx jj
c
jii PR
)|()|()|(1
xx jj
c
jii PR
Various Approaches
Performance Evaluation
PreprocessingPreprocessingFeature
MeasurementFeature
MeasurementClassificationClassification
testpattern
Classification Mode
PreprocessingPreprocessing
FeatureExtraction/Selection
FeatureExtraction/Selection
LearningLearningtrainingpattern
Training Mode
TrainingSet
TestSet
Optimizing a classifier to maximize its performance on the training set may not always result in the desired performance on a test set.
Problems on Learning (Generalization)
The curse of dimensionality– The number of features is too large relative to the n
umber of training samples
Classifiers complexity– The number of unknown parameters associated wit
h the classifier is large (e.g., polynomial classifiers or a large neural network)
Overtrained– Too intensively optimized on training set.
Statistical Pattern Recognition:A Review
The Curse of Dimensionality
& Peaking Phenomena
The Curse The performance of a classifier depends on the
interrelationship between– sample sizes– number of features– classifier complexity
If table-lookup technique is adopted for classification, how many training samples are required w.r.t the number of features?
Peaking Phenomena The probability of misclassification of a
decision rule does not increase as the number of features increases.– This is true as long as the class-conditional
densities are completely known.
Peaking Phenomena– Adding features many actually degrade the
performance of a classifier
Trunk’s Example
Two classes with mean vectors and covariance matrices as follows:
T
d
1
,,3
1,
2
1,11 m
T
d
1
,,3
1,
2
1,11 m
T
d
1,,
3
1,
2
1,12 m
T
d
1,,
3
1,
2
1,12 m
IΣΣ 21IΣΣ 21
G. V. Trunk, “A Problem of Dimensionality : A Simple Example,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, no. 3, pp. 306-307, July 1979.
Guideline The number of training samples, the number of
features and the true parameters of the class-conditional densities is very difficult to established.
It is generally accepted that using at least ten times as many training samples per class as the number of features (n/d > 10) is a good practice to follow in classifier design.
Statistical Pattern Recognition:A Review
Dimensionality Reduction
Dimensionality Reduction
A limited yet salient feature set simplifies both pattern representation and classifier design.
Pattern representation is easy for 2D and 3D features.
How to make pattern with high dimensional features viewable?
Example:Pattern Representation
Dimensionality Reduction How?
Feature Extraction– Create new features based on the original
feature set– Transforms are usually involved
Feature Selection– Select the best subset from a given feature
set.
Main Issues in Dimensionality Reduction
The choice of a criterion function– Commonly used criterion: classification error
The determination of the appropriate dimensionality– Correlated with the intrinsic dimensionality of
data
Dimensionality Reduction
Feature Extraction
Feature Extractor
Tidii xxx ,,, 21 Timii yyy ,,, 21
FeatureExtractor
FeatureExtractor
xi yi
m d, usually
Some Important Methods
Principal Component Analysis (PCA)– or Karhunen-Loeve Expansion
Project Pursuit Independent Component Analysis (ICA) Factor Analysis Discriminate Analysis
Kernel PCA Multidimensional Scaling (MDS)
Feed-Forward Neural Networks Self-Organizing Map
Linear Approaches
Nonlinear Approaches
Neural Networks
Feed-Forward Neural Networks
Linear PCALinear PCA Nonlinear PCANonlinear PCA
Demonstration+: Iris Setosa*: Iris Versicoloro: Iris Virginica
PCA Fisher Mapping
Sammon Mapping
Kernel PCA with second order polynomial kernel
Summary
Dimensionality Reduction
Feature Selection
Feature Selector
Tdxxx ,,, 21 Tmxxx ''2'1 ,,,
FeatureSelector
FeatureSelector
m d, usually
x x
# possible Selections
m
d
The problemGiven a set of d features, select a subset of si
ze m that leads to the smallest classification error.
No nonexhaustive sequential feature selection procedure can be guaranteed to produce the optimal subset.
Optimal Methods
Exhaustive Search– Evaluate all possible subsets
Branch-and-Bound– The monotonicity property of the criteri
on function has to be held.
Suboptimal Methods
Best Individual Features
Sequential Forward Selection (SFS) Sequential Backward Selection (SBS)
“Plus l-take away r” Selection Sequential Forward Floating Search and
Sequential Backward Floating Search
Summary
Optimal
Suboptimal
Statistical Pattern Recognition:A Review
Classifiers
ClassificationOnce a feature selection or a classification procedure finds a proper representation, a classifier can be designed using a number of possible approaches.
In practice, the choice of a classifier is a difficult problem and it is often based on which classifier(s) happen to be available, best known, to the user.
Approaches of Classifier Design
Based on Similarity
Probabilistic Approach
Decision-Boundary Approach– Geometric Approach
Classifiers Based on Similarity
Once a good metric to define similarity, patterns can be classified by template matching or minimum distance classifier using a few prototypes per class.
The choice of the metric and prototypes is crucial to the success of this approach.
Classification Methods Based on Similarity
Template Matching– Assign Pattern to the most similar template
Nearest Mean Classifier– Assign Pattern to the nearest class mean
Subspace Method– Assign Pattern to the nearest subspace (invariance)
1-Nearest Neighbor Rule – Assign Pattern to the class of the nearest training
pattern
Probabilistic Approach
Bayes decision rule– It takes into account costs associated with different t
ypes of misclassification.– Given prior probabilities, loss function, class-conditio
nal densities, it is “optimal” in minimizing the risk. – With the 0/1 loss function, it assign a pattern to the c
lass with the maximum posterior probability (maximum likelihood decision rule).
Probability Models
Parametric Models– Parameter estimation– Commonly used models
Multivariate Gaussian distributions for continuous features Binomial distributions for binary features Multinormal distributions for integer-valued features
– Bayse “Plug-in” rule
Nonparametric Models
Classifiers — Probabilistic Approach
Parametric Models– Bayse Plug-in– Logistic Classifier
Nonparametric Models– k-Nearest Neighbor Rule– Parzen Classifier
Geometric Approach
Construct decision boundaries directly by optimizing certain error criterion.
Commonly used criterion– Classification error– MSE
A training procedure is required.
Classifiers — Geometric Approach
Fisher Linear DiscriminantBinary Decision TreeNeural Networks
– Perceptron– Multi-Layer Perceptron– Radial Basis Network
Support Vector Classifier
Statistical Pattern Recognition:A Review
Classifier Combination
Why Combining Classifiers?
Independent classifiers for the same goal.– Person identification by voice, face and handwriting.
Sometimes more than a single training set is available, each collected at different time or in a different environment. These training sets may even use different features.
Different classifiers trained on the same data may not only differ in their global performance, but they also may show strong local differences. Each classifier may have its own region in the feature space where it performs the best.
Some classifiers such as neural networks show different results with different initializations due to the randomness inherent in the training procedure. Instead of selecting the best network and discarding the others, one can combine various networks, thereby taking advantage of all the attempts to learn from data.
Combining Schemes
Parallel
Cascading
Hierarchical
Selection and Training of Individual Classifier
Combining classifiers that are largely independent. Create training sets using various resampling techniqu
es, such as rotation and bootstrapping.
Examples:– Stacking– Bagging (bootstrap aggregation)– Boosting or ARCing (Adaptive Reweighting and Combinig)
Selection and Training of Individual Classifier
Cluster analysis may be used to separate the individual classes in the training set in to subclasses.
– Consequently, simpler classifier (e.g., linear) may be used are combined later to generate, for instance, a piecewise linear result.
Building different classifiers on different sets of training patterns, different feature sets may be used, e.g., random subspace method.
Combiners Static Combiners
– Voting, Averaging, Borda Count.
Trainable Combiners– Lead to better improvement than static ones.– Additional training data needed.
Adaptive Combiners– Combiner evaluates (or weighs) the decision of indiv
idual classifiers depending on the input patterns.
Output Types of Individual Classifiers
Measurement – Confidence or Probability
Rank– Assign a rank to each class
Abstract– A set of several class labels
An Example Handwritten numerals (0-9)
– Extract from a collection of Dutch utility maps– 30 × 48 binary images
200 patterns per class (2000 in total) Features:
– 76 Fourier Coefficients of the character shapes– 216 profile correlations– 64 Karhunen-Loeve coefficients– 240 pixel averages in 2 × 3 windows– 47 Zernike moments– 6 morphlogical features
An Example
Statistical Pattern Recognition:A Review
Error Estimation
Ultimate Measurement of a Classifier
Classification error or simple the error rate Pe.
The percentage of misclassified test samples is taken as an estimate of the error rate.
How should the available samples be split to form training and test sets?– Especially important in small sample case
Error Estimation Methods
Cross- Validataion Approaches
Other Performance Measurements
Example: (Fingerprint Matching System) False Acceptance Rate (FAR)
– The percentage for incorrectly matches False Reject Rate (FRR)
– The percentage for incorrectly unmatches.
Reject rate– The percentage for reject doubtful patterns
Statistical Pattern Recognition:A Review
Unsupervised Classification
Unlabelled Training Data
Unsupervised classification is also known as data clustering.
Difficulties Data can reveal clusters with different sizes
and shapes. Number of clusters depends on the resolution. Similarity measurement.
Importance
Data MiningInformation RetrievalImage SegmentationSignal Compression and CodingMachine Learning
Main Techniques
Iterative Square-Error Partition Clustering– Main concern in the following discussion
Agglomerative Hierarchical Clustering
Formulation Given n patterns in a d-dimensional metric
space, determine a partitions of patterns in to K clusters such that
the patterns in a cluster are more similar to each other than to patterns in different clusters.
Two Popular Approaches forPartition Clustering
Square-Error Clustering– K-Means– Fuzzy K-Means
Mixture Decomposition– EM Algorithm
Square-Error Clustering
m+
⊕1m
⊕2m
⊕3m
Total Scattering:
n
iiS
1
22 |||| mx
K
kkkB nS
1
22 |||| mm
Between-Cluster Scattering:
K
k CW
k
S1
22 ||||x
mx
Within-Cluster Scattering:
Fact: 222WB SSS 222WB SSS
Square-Error Clustering
m+
⊕1m
⊕2m
⊕3m
Fact: 222WB SSS 222WB SSS
Goal: Find a partition to
maximize 2BS
minimize 2WS
or
Mixture Model Formal model for unsupervised classification. Each pattern was produced by one of a set of
alternative (probabilistically modeled) sources. Mixtures can also be seen as a class of models
that are able to represent arbitrarily complex probability density functions.
Mixtures also well suited for represent complex class-conditional densities in supervised learning scenarios.
Mixture Generation
K random sources
+
+
+
+
)|( 11 θxp )|( 11 θxp )|( 22 θxp )|( 22 θxp
)|( 33 θxp )|( 33 θxp
)|( KKp θx )|( KKp θx
P(xCi) = αi
K
mmmmK pp
1)( )|()|( θxΘx
K
mmmmK pp
1)( )|()|( θxΘx
),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:
Mixture Generation
K random sources
+
+
+
+
)|( 11 θxp )|( 11 θxp )|( 22 θxp )|( 22 θxp
)|( 33 θxp )|( 33 θxp
)|( KKp θx )|( KKp θx
P(xCi) = αi
K
mmmmK pp
1)( )|()|( θxΘx
K
mmmmK pp
1)( )|()|( θxΘx
),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:
If pm(x|θm) is normal, it is called the Gaussian Mixture Model (GMM). If pm(x|θm) is normal, it is called the Gaussian Mixture Model (GMM).
Goal
K
mmmmK pp
1)( )|()|( θxΘx
K
mmmmK pp
1)( )|()|( θxΘx
),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:
Given the form of pm(x|θm) and the data x1, x2, …, xn,
find )ˆ,,ˆ,ˆ,,ˆ(ˆ11)( KKK θθΘ
to fit the model
Issues How to estimate the parameters?
– EM (expectation-maximization) algorithm– MCMC (Markov Chain Monte-Carlo) Method
How to estimate the number of components (sources)?– More difficult
),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:
Example