Statistical Pattern Recognition: A Review 主講人:虞台文

Preview:

Citation preview

Statistical Pattern Recognition:A Review

主講人:虞台文

Contents Introduction Statistical Pattern Recognition The Curse of Dimensionality & Peaking Phenomena Dimensionality Reduction

– Feature Extraction– Feature Selection

Classifiers Classifier Combination Error Estimation Unsupervised Classification

Statistical Pattern Recognition:A Review

Introduction

What is Pattern Recognition?

The study of how machines can observe the environment, – learn to distinguish patterns of interest from their

background, and– make sound and reasonable decisions about the

categories of the patterns.

What is a pattern? What kinds of category we have?

What is a Pattern?

As opposite of a chaos; it is an entity, vaguely defined, that could be given a name.

For example, a pattern could be– A fingerprint images– A handwritten cursive word– A human face– A speech signal

Categories (Classes)

Supervised Classification– Discriminant Analysis

Unsupervised Classification– Clustering

Applications of Pattern Recognition

The Design

The design of a pattern recognition system essentially involves the following three aspects:– data acquisition– data representation– decision making

Pattern Recognition Models

The four best known approaches– template matching– statistical classification– syntactic or structural matching– neural networks

Pattern Recognition Models

Statistical Pattern Recognition:A Review

Statistical Pattern Recognition

Pattern Representation

A pattern is represented by a set of d features, or attributes, viewed as a d-dimensional feature vector.

1 2( , , , )Tdx x xx

PreprocessingPreprocessingFeature

MeasurementFeature

MeasurementClassificationClassification

testpattern

Classification Mode

PreprocessingPreprocessing

FeatureExtraction/Selection

FeatureExtraction/Selection

LearningLearningtrainingpattern

Training Mode

Two Modes of a Pattern Recognition system

Decision Making Rules

Given a pattern x = (x1, x2, …, xd)T, assign it to one of c categories in .

={1, 2, …, c}

Decision Making Rules

12

3

x

?)( x ?)( x )(x )(x

The States of Nature

12

3

P(1)P(2)

P(3)

P(i): prior probabilities.

P(x|1)P(x|2)

P(x|3)

P(x|i): class-conditional probabilities.

x

Baysian Decision Theory

12

3

P(1)P(2)

P(3)

P(x|1)P(x|2)

P(x|3)

x )(

)()|()|(

x

xx

P

PPP ii

i

)(

)()|()|(

x

xx

P

PPP ii

i

c

iii PPP

1

)()|()( xx

c

iii PPP

1

)()|()( xx

P(i): prior probabilities.

P(x|i): class-conditional probabilities.

A posterior probability

A posterior probability

Decision Making Rules

Bayes Decision RuleMaximum Likelihood RuleMinimaxNeyman-Pearson. . .

Loss Functions & Conditional Risk

1 2

3

The loss incurred in deciding i when the true class is j.

The loss incurred in deciding i when the true class is j.

LossFunction

Conditional Risk:

)|()|()|(1

xx jj

c

jii PR

)|()|()|(1

xx jj

c

jii PR

)|( ji )|( ji

PosteriorProbability

Bayse Decision Rule

)|()|()|(1

xx jj

c

jii PR

)|()|()|(1

xx jj

c

jii PR

)|(minarg)( xx iRi

)|(minarg)( xx iRi

The optimal decision rule for minimizing the risk.

Maximum Likelihood Decision Rule

)|()|()|(1

xx jj

c

jii PR

)|()|()|(1

xx jj

c

jii PR

)|(minarg)( xx iRi

)|(minarg)( xx iRi

with

jiji

ji 10

)|( 0/1 loss function

( ) arg max ( | )i

iP

x x( ) arg max ( | )i

iP

x x

Minimax

)(

)()|()|(

x

xx

P

PPP ii

i

)(

)()|()|(

x

xx

P

PPP ii

i

c

iii PPP

1

)()|()( xx

c

iii PPP

1

)()|()( xx

P(i): prior probabilities.

P(x|i): class-conditional probabilities.

Minimax deals with the case that P(i)’s are unknown.

Game theory

)|()|()|(1

xx jj

c

jii PR

)|()|()|(1

xx jj

c

jii PR

Neyman-Pearson

)(

)()|()|(

x

xx

P

PPP ii

i

)(

)()|()|(

x

xx

P

PPP ii

i

c

iii PPP

1

)()|()( xx

c

iii PPP

1

)()|()( xx

P(i): prior probabilities.

P(x|i): class-conditional probabilities.Neyman-Pearson criterion wish to minimize the overall risk subject to a constraint, such as

constant)|( xx dR i

for a particular i.

)|()|()|(1

xx jj

c

jii PR

)|()|()|(1

xx jj

c

jii PR

Various Approaches

Performance Evaluation

PreprocessingPreprocessingFeature

MeasurementFeature

MeasurementClassificationClassification

testpattern

Classification Mode

PreprocessingPreprocessing

FeatureExtraction/Selection

FeatureExtraction/Selection

LearningLearningtrainingpattern

Training Mode

TrainingSet

TestSet

Optimizing a classifier to maximize its performance on the training set may not always result in the desired performance on a test set.

Problems on Learning (Generalization)

The curse of dimensionality– The number of features is too large relative to the n

umber of training samples

Classifiers complexity– The number of unknown parameters associated wit

h the classifier is large (e.g., polynomial classifiers or a large neural network)

Overtrained– Too intensively optimized on training set.

Statistical Pattern Recognition:A Review

The Curse of Dimensionality

& Peaking Phenomena

The Curse The performance of a classifier depends on the

interrelationship between– sample sizes– number of features– classifier complexity

If table-lookup technique is adopted for classification, how many training samples are required w.r.t the number of features?

Peaking Phenomena The probability of misclassification of a

decision rule does not increase as the number of features increases.– This is true as long as the class-conditional

densities are completely known.

Peaking Phenomena– Adding features many actually degrade the

performance of a classifier

Trunk’s Example

Two classes with mean vectors and covariance matrices as follows:

T

d

1

,,3

1,

2

1,11 m

T

d

1

,,3

1,

2

1,11 m

T

d

1,,

3

1,

2

1,12 m

T

d

1,,

3

1,

2

1,12 m

IΣΣ 21IΣΣ 21

G. V. Trunk, “A Problem of Dimensionality : A Simple Example,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, no. 3, pp. 306-307, July 1979.

Guideline The number of training samples, the number of

features and the true parameters of the class-conditional densities is very difficult to established.

It is generally accepted that using at least ten times as many training samples per class as the number of features (n/d > 10) is a good practice to follow in classifier design.

Statistical Pattern Recognition:A Review

Dimensionality Reduction

Dimensionality Reduction

A limited yet salient feature set simplifies both pattern representation and classifier design.

Pattern representation is easy for 2D and 3D features.

How to make pattern with high dimensional features viewable?

Example:Pattern Representation

Dimensionality Reduction How?

Feature Extraction– Create new features based on the original

feature set– Transforms are usually involved

Feature Selection– Select the best subset from a given feature

set.

Main Issues in Dimensionality Reduction

The choice of a criterion function– Commonly used criterion: classification error

The determination of the appropriate dimensionality– Correlated with the intrinsic dimensionality of

data

Dimensionality Reduction

Feature Extraction

Feature Extractor

Tidii xxx ,,, 21 Timii yyy ,,, 21

FeatureExtractor

FeatureExtractor

xi yi

m d, usually

Some Important Methods

Principal Component Analysis (PCA)– or Karhunen-Loeve Expansion

Project Pursuit Independent Component Analysis (ICA) Factor Analysis Discriminate Analysis

Kernel PCA Multidimensional Scaling (MDS)

Feed-Forward Neural Networks Self-Organizing Map

Linear Approaches

Nonlinear Approaches

Neural Networks

Feed-Forward Neural Networks

Linear PCALinear PCA Nonlinear PCANonlinear PCA

Demonstration+: Iris Setosa*: Iris Versicoloro: Iris Virginica

PCA Fisher Mapping

Sammon Mapping

Kernel PCA with second order polynomial kernel

Summary

Dimensionality Reduction

Feature Selection

Feature Selector

Tdxxx ,,, 21 Tmxxx ''2'1 ,,,

FeatureSelector

FeatureSelector

m d, usually

x x

# possible Selections

m

d

The problemGiven a set of d features, select a subset of si

ze m that leads to the smallest classification error.

No nonexhaustive sequential feature selection procedure can be guaranteed to produce the optimal subset.

Optimal Methods

Exhaustive Search– Evaluate all possible subsets

Branch-and-Bound– The monotonicity property of the criteri

on function has to be held.

Suboptimal Methods

Best Individual Features

Sequential Forward Selection (SFS) Sequential Backward Selection (SBS)

“Plus l-take away r” Selection Sequential Forward Floating Search and

Sequential Backward Floating Search

Summary

Optimal

Suboptimal

Statistical Pattern Recognition:A Review

Classifiers

ClassificationOnce a feature selection or a classification procedure finds a proper representation, a classifier can be designed using a number of possible approaches.

In practice, the choice of a classifier is a difficult problem and it is often based on which classifier(s) happen to be available, best known, to the user.

Approaches of Classifier Design

Based on Similarity

Probabilistic Approach

Decision-Boundary Approach– Geometric Approach

Classifiers Based on Similarity

Once a good metric to define similarity, patterns can be classified by template matching or minimum distance classifier using a few prototypes per class.

The choice of the metric and prototypes is crucial to the success of this approach.

Classification Methods Based on Similarity

Template Matching– Assign Pattern to the most similar template

Nearest Mean Classifier– Assign Pattern to the nearest class mean

Subspace Method– Assign Pattern to the nearest subspace (invariance)

1-Nearest Neighbor Rule – Assign Pattern to the class of the nearest training

pattern

Probabilistic Approach

Bayes decision rule– It takes into account costs associated with different t

ypes of misclassification.– Given prior probabilities, loss function, class-conditio

nal densities, it is “optimal” in minimizing the risk. – With the 0/1 loss function, it assign a pattern to the c

lass with the maximum posterior probability (maximum likelihood decision rule).

Probability Models

Parametric Models– Parameter estimation– Commonly used models

Multivariate Gaussian distributions for continuous features Binomial distributions for binary features Multinormal distributions for integer-valued features

– Bayse “Plug-in” rule

Nonparametric Models

Classifiers — Probabilistic Approach

Parametric Models– Bayse Plug-in– Logistic Classifier

Nonparametric Models– k-Nearest Neighbor Rule– Parzen Classifier

Geometric Approach

Construct decision boundaries directly by optimizing certain error criterion.

Commonly used criterion– Classification error– MSE

A training procedure is required.

Classifiers — Geometric Approach

Fisher Linear DiscriminantBinary Decision TreeNeural Networks

– Perceptron– Multi-Layer Perceptron– Radial Basis Network

Support Vector Classifier

Statistical Pattern Recognition:A Review

Classifier Combination

Why Combining Classifiers?

Independent classifiers for the same goal.– Person identification by voice, face and handwriting.

Sometimes more than a single training set is available, each collected at different time or in a different environment. These training sets may even use different features.

Different classifiers trained on the same data may not only differ in their global performance, but they also may show strong local differences. Each classifier may have its own region in the feature space where it performs the best.

Some classifiers such as neural networks show different results with different initializations due to the randomness inherent in the training procedure. Instead of selecting the best network and discarding the others, one can combine various networks, thereby taking advantage of all the attempts to learn from data.

Combining Schemes

Parallel

Cascading

Hierarchical

Selection and Training of Individual Classifier

Combining classifiers that are largely independent. Create training sets using various resampling techniqu

es, such as rotation and bootstrapping.

Examples:– Stacking– Bagging (bootstrap aggregation)– Boosting or ARCing (Adaptive Reweighting and Combinig)

Selection and Training of Individual Classifier

Cluster analysis may be used to separate the individual classes in the training set in to subclasses.

– Consequently, simpler classifier (e.g., linear) may be used are combined later to generate, for instance, a piecewise linear result.

Building different classifiers on different sets of training patterns, different feature sets may be used, e.g., random subspace method.

Combiners Static Combiners

– Voting, Averaging, Borda Count.

Trainable Combiners– Lead to better improvement than static ones.– Additional training data needed.

Adaptive Combiners– Combiner evaluates (or weighs) the decision of indiv

idual classifiers depending on the input patterns.

Output Types of Individual Classifiers

Measurement – Confidence or Probability

Rank– Assign a rank to each class

Abstract– A set of several class labels

An Example Handwritten numerals (0-9)

– Extract from a collection of Dutch utility maps– 30 × 48 binary images

200 patterns per class (2000 in total) Features:

– 76 Fourier Coefficients of the character shapes– 216 profile correlations– 64 Karhunen-Loeve coefficients– 240 pixel averages in 2 × 3 windows– 47 Zernike moments– 6 morphlogical features

An Example

Statistical Pattern Recognition:A Review

Error Estimation

Ultimate Measurement of a Classifier

Classification error or simple the error rate Pe.

The percentage of misclassified test samples is taken as an estimate of the error rate.

How should the available samples be split to form training and test sets?– Especially important in small sample case

Error Estimation Methods

Cross- Validataion Approaches

Other Performance Measurements

Example: (Fingerprint Matching System) False Acceptance Rate (FAR)

– The percentage for incorrectly matches False Reject Rate (FRR)

– The percentage for incorrectly unmatches.

Reject rate– The percentage for reject doubtful patterns

Statistical Pattern Recognition:A Review

Unsupervised Classification

Unlabelled Training Data

Unsupervised classification is also known as data clustering.

Difficulties Data can reveal clusters with different sizes

and shapes. Number of clusters depends on the resolution. Similarity measurement.

Importance

Data MiningInformation RetrievalImage SegmentationSignal Compression and CodingMachine Learning

Main Techniques

Iterative Square-Error Partition Clustering– Main concern in the following discussion

Agglomerative Hierarchical Clustering

Formulation Given n patterns in a d-dimensional metric

space, determine a partitions of patterns in to K clusters such that

the patterns in a cluster are more similar to each other than to patterns in different clusters.

Two Popular Approaches forPartition Clustering

Square-Error Clustering– K-Means– Fuzzy K-Means

Mixture Decomposition– EM Algorithm

Square-Error Clustering

m+

⊕1m

⊕2m

⊕3m

Total Scattering:

n

iiS

1

22 |||| mx

K

kkkB nS

1

22 |||| mm

Between-Cluster Scattering:

K

k CW

k

S1

22 ||||x

mx

Within-Cluster Scattering:

Fact: 222WB SSS 222WB SSS

Square-Error Clustering

m+

⊕1m

⊕2m

⊕3m

Fact: 222WB SSS 222WB SSS

Goal: Find a partition to

maximize 2BS

minimize 2WS

or

Mixture Model Formal model for unsupervised classification. Each pattern was produced by one of a set of

alternative (probabilistically modeled) sources. Mixtures can also be seen as a class of models

that are able to represent arbitrarily complex probability density functions.

Mixtures also well suited for represent complex class-conditional densities in supervised learning scenarios.

Mixture Generation

K random sources

+

+

+

+

)|( 11 θxp )|( 11 θxp )|( 22 θxp )|( 22 θxp

)|( 33 θxp )|( 33 θxp

)|( KKp θx )|( KKp θx

P(xCi) = αi

K

mmmmK pp

1)( )|()|( θxΘx

K

mmmmK pp

1)( )|()|( θxΘx

),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:

Mixture Generation

K random sources

+

+

+

+

)|( 11 θxp )|( 11 θxp )|( 22 θxp )|( 22 θxp

)|( 33 θxp )|( 33 θxp

)|( KKp θx )|( KKp θx

P(xCi) = αi

K

mmmmK pp

1)( )|()|( θxΘx

K

mmmmK pp

1)( )|()|( θxΘx

),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:

If pm(x|θm) is normal, it is called the Gaussian Mixture Model (GMM). If pm(x|θm) is normal, it is called the Gaussian Mixture Model (GMM).

Goal

K

mmmmK pp

1)( )|()|( θxΘx

K

mmmmK pp

1)( )|()|( θxΘx

),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:

Given the form of pm(x|θm) and the data x1, x2, …, xn,

find )ˆ,,ˆ,ˆ,,ˆ(ˆ11)( KKK θθΘ

to fit the model

Issues How to estimate the parameters?

– EM (expectation-maximization) algorithm– MCMC (Markov Chain Monte-Carlo) Method

How to estimate the number of components (sources)?– More difficult

),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:

Example

Recommended