86
Statistical Pattern Recognition: A Review 主主主 主主主

Statistical Pattern Recognition: A Review 主講人:虞台文

Embed Size (px)

Citation preview

Page 1: Statistical Pattern Recognition: A Review 主講人:虞台文

Statistical Pattern Recognition:A Review

主講人:虞台文

Page 2: Statistical Pattern Recognition: A Review 主講人:虞台文

Contents Introduction Statistical Pattern Recognition The Curse of Dimensionality & Peaking Phenomena Dimensionality Reduction

– Feature Extraction– Feature Selection

Classifiers Classifier Combination Error Estimation Unsupervised Classification

Page 3: Statistical Pattern Recognition: A Review 主講人:虞台文

Statistical Pattern Recognition:A Review

Introduction

Page 4: Statistical Pattern Recognition: A Review 主講人:虞台文

What is Pattern Recognition?

The study of how machines can observe the environment, – learn to distinguish patterns of interest from their

background, and– make sound and reasonable decisions about the

categories of the patterns.

What is a pattern? What kinds of category we have?

Page 5: Statistical Pattern Recognition: A Review 主講人:虞台文

What is a Pattern?

As opposite of a chaos; it is an entity, vaguely defined, that could be given a name.

For example, a pattern could be– A fingerprint images– A handwritten cursive word– A human face– A speech signal

Page 6: Statistical Pattern Recognition: A Review 主講人:虞台文

Categories (Classes)

Supervised Classification– Discriminant Analysis

Unsupervised Classification– Clustering

Page 7: Statistical Pattern Recognition: A Review 主講人:虞台文

Applications of Pattern Recognition

Page 8: Statistical Pattern Recognition: A Review 主講人:虞台文

The Design

The design of a pattern recognition system essentially involves the following three aspects:– data acquisition– data representation– decision making

Page 9: Statistical Pattern Recognition: A Review 主講人:虞台文

Pattern Recognition Models

The four best known approaches– template matching– statistical classification– syntactic or structural matching– neural networks

Page 10: Statistical Pattern Recognition: A Review 主講人:虞台文

Pattern Recognition Models

Page 11: Statistical Pattern Recognition: A Review 主講人:虞台文

Statistical Pattern Recognition:A Review

Statistical Pattern Recognition

Page 12: Statistical Pattern Recognition: A Review 主講人:虞台文

Pattern Representation

A pattern is represented by a set of d features, or attributes, viewed as a d-dimensional feature vector.

1 2( , , , )Tdx x xx

Page 13: Statistical Pattern Recognition: A Review 主講人:虞台文

PreprocessingPreprocessingFeature

MeasurementFeature

MeasurementClassificationClassification

testpattern

Classification Mode

PreprocessingPreprocessing

FeatureExtraction/Selection

FeatureExtraction/Selection

LearningLearningtrainingpattern

Training Mode

Two Modes of a Pattern Recognition system

Page 14: Statistical Pattern Recognition: A Review 主講人:虞台文

Decision Making Rules

Given a pattern x = (x1, x2, …, xd)T, assign it to one of c categories in .

={1, 2, …, c}

Page 15: Statistical Pattern Recognition: A Review 主講人:虞台文

Decision Making Rules

12

3

x

?)( x ?)( x )(x )(x

Page 16: Statistical Pattern Recognition: A Review 主講人:虞台文

The States of Nature

12

3

P(1)P(2)

P(3)

P(i): prior probabilities.

P(x|1)P(x|2)

P(x|3)

P(x|i): class-conditional probabilities.

x

Page 17: Statistical Pattern Recognition: A Review 主講人:虞台文

Baysian Decision Theory

12

3

P(1)P(2)

P(3)

P(x|1)P(x|2)

P(x|3)

x )(

)()|()|(

x

xx

P

PPP ii

i

)(

)()|()|(

x

xx

P

PPP ii

i

c

iii PPP

1

)()|()( xx

c

iii PPP

1

)()|()( xx

P(i): prior probabilities.

P(x|i): class-conditional probabilities.

A posterior probability

A posterior probability

Page 18: Statistical Pattern Recognition: A Review 主講人:虞台文

Decision Making Rules

Bayes Decision RuleMaximum Likelihood RuleMinimaxNeyman-Pearson. . .

Page 19: Statistical Pattern Recognition: A Review 主講人:虞台文

Loss Functions & Conditional Risk

1 2

3

The loss incurred in deciding i when the true class is j.

The loss incurred in deciding i when the true class is j.

LossFunction

Conditional Risk:

)|()|()|(1

xx jj

c

jii PR

)|()|()|(1

xx jj

c

jii PR

)|( ji )|( ji

PosteriorProbability

Page 20: Statistical Pattern Recognition: A Review 主講人:虞台文

Bayse Decision Rule

)|()|()|(1

xx jj

c

jii PR

)|()|()|(1

xx jj

c

jii PR

)|(minarg)( xx iRi

)|(minarg)( xx iRi

The optimal decision rule for minimizing the risk.

Page 21: Statistical Pattern Recognition: A Review 主講人:虞台文

Maximum Likelihood Decision Rule

)|()|()|(1

xx jj

c

jii PR

)|()|()|(1

xx jj

c

jii PR

)|(minarg)( xx iRi

)|(minarg)( xx iRi

with

jiji

ji 10

)|( 0/1 loss function

( ) arg max ( | )i

iP

x x( ) arg max ( | )i

iP

x x

Page 22: Statistical Pattern Recognition: A Review 主講人:虞台文

Minimax

)(

)()|()|(

x

xx

P

PPP ii

i

)(

)()|()|(

x

xx

P

PPP ii

i

c

iii PPP

1

)()|()( xx

c

iii PPP

1

)()|()( xx

P(i): prior probabilities.

P(x|i): class-conditional probabilities.

Minimax deals with the case that P(i)’s are unknown.

Game theory

)|()|()|(1

xx jj

c

jii PR

)|()|()|(1

xx jj

c

jii PR

Page 23: Statistical Pattern Recognition: A Review 主講人:虞台文

Neyman-Pearson

)(

)()|()|(

x

xx

P

PPP ii

i

)(

)()|()|(

x

xx

P

PPP ii

i

c

iii PPP

1

)()|()( xx

c

iii PPP

1

)()|()( xx

P(i): prior probabilities.

P(x|i): class-conditional probabilities.Neyman-Pearson criterion wish to minimize the overall risk subject to a constraint, such as

constant)|( xx dR i

for a particular i.

)|()|()|(1

xx jj

c

jii PR

)|()|()|(1

xx jj

c

jii PR

Page 24: Statistical Pattern Recognition: A Review 主講人:虞台文

Various Approaches

Page 25: Statistical Pattern Recognition: A Review 主講人:虞台文

Performance Evaluation

PreprocessingPreprocessingFeature

MeasurementFeature

MeasurementClassificationClassification

testpattern

Classification Mode

PreprocessingPreprocessing

FeatureExtraction/Selection

FeatureExtraction/Selection

LearningLearningtrainingpattern

Training Mode

TrainingSet

TestSet

Optimizing a classifier to maximize its performance on the training set may not always result in the desired performance on a test set.

Page 26: Statistical Pattern Recognition: A Review 主講人:虞台文

Problems on Learning (Generalization)

The curse of dimensionality– The number of features is too large relative to the n

umber of training samples

Classifiers complexity– The number of unknown parameters associated wit

h the classifier is large (e.g., polynomial classifiers or a large neural network)

Overtrained– Too intensively optimized on training set.

Page 27: Statistical Pattern Recognition: A Review 主講人:虞台文

Statistical Pattern Recognition:A Review

The Curse of Dimensionality

& Peaking Phenomena

Page 28: Statistical Pattern Recognition: A Review 主講人:虞台文

The Curse The performance of a classifier depends on the

interrelationship between– sample sizes– number of features– classifier complexity

If table-lookup technique is adopted for classification, how many training samples are required w.r.t the number of features?

Page 29: Statistical Pattern Recognition: A Review 主講人:虞台文

Peaking Phenomena The probability of misclassification of a

decision rule does not increase as the number of features increases.– This is true as long as the class-conditional

densities are completely known.

Peaking Phenomena– Adding features many actually degrade the

performance of a classifier

Page 30: Statistical Pattern Recognition: A Review 主講人:虞台文

Trunk’s Example

Two classes with mean vectors and covariance matrices as follows:

T

d

1

,,3

1,

2

1,11 m

T

d

1

,,3

1,

2

1,11 m

T

d

1,,

3

1,

2

1,12 m

T

d

1,,

3

1,

2

1,12 m

IΣΣ 21IΣΣ 21

G. V. Trunk, “A Problem of Dimensionality : A Simple Example,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, no. 3, pp. 306-307, July 1979.

Page 31: Statistical Pattern Recognition: A Review 主講人:虞台文

Guideline The number of training samples, the number of

features and the true parameters of the class-conditional densities is very difficult to established.

It is generally accepted that using at least ten times as many training samples per class as the number of features (n/d > 10) is a good practice to follow in classifier design.

Page 32: Statistical Pattern Recognition: A Review 主講人:虞台文

Statistical Pattern Recognition:A Review

Dimensionality Reduction

Page 33: Statistical Pattern Recognition: A Review 主講人:虞台文

Dimensionality Reduction

A limited yet salient feature set simplifies both pattern representation and classifier design.

Pattern representation is easy for 2D and 3D features.

How to make pattern with high dimensional features viewable?

Page 34: Statistical Pattern Recognition: A Review 主講人:虞台文

Example:Pattern Representation

Page 35: Statistical Pattern Recognition: A Review 主講人:虞台文

Dimensionality Reduction How?

Feature Extraction– Create new features based on the original

feature set– Transforms are usually involved

Feature Selection– Select the best subset from a given feature

set.

Page 36: Statistical Pattern Recognition: A Review 主講人:虞台文

Main Issues in Dimensionality Reduction

The choice of a criterion function– Commonly used criterion: classification error

The determination of the appropriate dimensionality– Correlated with the intrinsic dimensionality of

data

Page 37: Statistical Pattern Recognition: A Review 主講人:虞台文

Dimensionality Reduction

Feature Extraction

Page 38: Statistical Pattern Recognition: A Review 主講人:虞台文

Feature Extractor

Tidii xxx ,,, 21 Timii yyy ,,, 21

FeatureExtractor

FeatureExtractor

xi yi

m d, usually

Page 39: Statistical Pattern Recognition: A Review 主講人:虞台文

Some Important Methods

Principal Component Analysis (PCA)– or Karhunen-Loeve Expansion

Project Pursuit Independent Component Analysis (ICA) Factor Analysis Discriminate Analysis

Kernel PCA Multidimensional Scaling (MDS)

Feed-Forward Neural Networks Self-Organizing Map

Linear Approaches

Nonlinear Approaches

Neural Networks

Page 40: Statistical Pattern Recognition: A Review 主講人:虞台文

Feed-Forward Neural Networks

Linear PCALinear PCA Nonlinear PCANonlinear PCA

Page 41: Statistical Pattern Recognition: A Review 主講人:虞台文

Demonstration+: Iris Setosa*: Iris Versicoloro: Iris Virginica

PCA Fisher Mapping

Sammon Mapping

Kernel PCA with second order polynomial kernel

Page 42: Statistical Pattern Recognition: A Review 主講人:虞台文

Summary

Page 43: Statistical Pattern Recognition: A Review 主講人:虞台文

Dimensionality Reduction

Feature Selection

Page 44: Statistical Pattern Recognition: A Review 主講人:虞台文

Feature Selector

Tdxxx ,,, 21 Tmxxx ''2'1 ,,,

FeatureSelector

FeatureSelector

m d, usually

x x

# possible Selections

m

d

Page 45: Statistical Pattern Recognition: A Review 主講人:虞台文

The problemGiven a set of d features, select a subset of si

ze m that leads to the smallest classification error.

No nonexhaustive sequential feature selection procedure can be guaranteed to produce the optimal subset.

Page 46: Statistical Pattern Recognition: A Review 主講人:虞台文

Optimal Methods

Exhaustive Search– Evaluate all possible subsets

Branch-and-Bound– The monotonicity property of the criteri

on function has to be held.

Page 47: Statistical Pattern Recognition: A Review 主講人:虞台文

Suboptimal Methods

Best Individual Features

Sequential Forward Selection (SFS) Sequential Backward Selection (SBS)

“Plus l-take away r” Selection Sequential Forward Floating Search and

Sequential Backward Floating Search

Page 48: Statistical Pattern Recognition: A Review 主講人:虞台文

Summary

Optimal

Suboptimal

Page 49: Statistical Pattern Recognition: A Review 主講人:虞台文

Statistical Pattern Recognition:A Review

Classifiers

Page 50: Statistical Pattern Recognition: A Review 主講人:虞台文

ClassificationOnce a feature selection or a classification procedure finds a proper representation, a classifier can be designed using a number of possible approaches.

In practice, the choice of a classifier is a difficult problem and it is often based on which classifier(s) happen to be available, best known, to the user.

Page 51: Statistical Pattern Recognition: A Review 主講人:虞台文

Approaches of Classifier Design

Based on Similarity

Probabilistic Approach

Decision-Boundary Approach– Geometric Approach

Page 52: Statistical Pattern Recognition: A Review 主講人:虞台文

Classifiers Based on Similarity

Once a good metric to define similarity, patterns can be classified by template matching or minimum distance classifier using a few prototypes per class.

The choice of the metric and prototypes is crucial to the success of this approach.

Page 53: Statistical Pattern Recognition: A Review 主講人:虞台文

Classification Methods Based on Similarity

Template Matching– Assign Pattern to the most similar template

Nearest Mean Classifier– Assign Pattern to the nearest class mean

Subspace Method– Assign Pattern to the nearest subspace (invariance)

1-Nearest Neighbor Rule – Assign Pattern to the class of the nearest training

pattern

Page 54: Statistical Pattern Recognition: A Review 主講人:虞台文

Probabilistic Approach

Bayes decision rule– It takes into account costs associated with different t

ypes of misclassification.– Given prior probabilities, loss function, class-conditio

nal densities, it is “optimal” in minimizing the risk. – With the 0/1 loss function, it assign a pattern to the c

lass with the maximum posterior probability (maximum likelihood decision rule).

Page 55: Statistical Pattern Recognition: A Review 主講人:虞台文

Probability Models

Parametric Models– Parameter estimation– Commonly used models

Multivariate Gaussian distributions for continuous features Binomial distributions for binary features Multinormal distributions for integer-valued features

– Bayse “Plug-in” rule

Nonparametric Models

Page 56: Statistical Pattern Recognition: A Review 主講人:虞台文

Classifiers — Probabilistic Approach

Parametric Models– Bayse Plug-in– Logistic Classifier

Nonparametric Models– k-Nearest Neighbor Rule– Parzen Classifier

Page 57: Statistical Pattern Recognition: A Review 主講人:虞台文

Geometric Approach

Construct decision boundaries directly by optimizing certain error criterion.

Commonly used criterion– Classification error– MSE

A training procedure is required.

Page 58: Statistical Pattern Recognition: A Review 主講人:虞台文

Classifiers — Geometric Approach

Fisher Linear DiscriminantBinary Decision TreeNeural Networks

– Perceptron– Multi-Layer Perceptron– Radial Basis Network

Support Vector Classifier

Page 59: Statistical Pattern Recognition: A Review 主講人:虞台文

Statistical Pattern Recognition:A Review

Classifier Combination

Page 60: Statistical Pattern Recognition: A Review 主講人:虞台文

Why Combining Classifiers?

Independent classifiers for the same goal.– Person identification by voice, face and handwriting.

Sometimes more than a single training set is available, each collected at different time or in a different environment. These training sets may even use different features.

Different classifiers trained on the same data may not only differ in their global performance, but they also may show strong local differences. Each classifier may have its own region in the feature space where it performs the best.

Some classifiers such as neural networks show different results with different initializations due to the randomness inherent in the training procedure. Instead of selecting the best network and discarding the others, one can combine various networks, thereby taking advantage of all the attempts to learn from data.

Page 61: Statistical Pattern Recognition: A Review 主講人:虞台文

Combining Schemes

Parallel

Cascading

Hierarchical

Page 62: Statistical Pattern Recognition: A Review 主講人:虞台文

Selection and Training of Individual Classifier

Combining classifiers that are largely independent. Create training sets using various resampling techniqu

es, such as rotation and bootstrapping.

Examples:– Stacking– Bagging (bootstrap aggregation)– Boosting or ARCing (Adaptive Reweighting and Combinig)

Page 63: Statistical Pattern Recognition: A Review 主講人:虞台文

Selection and Training of Individual Classifier

Cluster analysis may be used to separate the individual classes in the training set in to subclasses.

– Consequently, simpler classifier (e.g., linear) may be used are combined later to generate, for instance, a piecewise linear result.

Building different classifiers on different sets of training patterns, different feature sets may be used, e.g., random subspace method.

Page 64: Statistical Pattern Recognition: A Review 主講人:虞台文

Combiners Static Combiners

– Voting, Averaging, Borda Count.

Trainable Combiners– Lead to better improvement than static ones.– Additional training data needed.

Adaptive Combiners– Combiner evaluates (or weighs) the decision of indiv

idual classifiers depending on the input patterns.

Page 65: Statistical Pattern Recognition: A Review 主講人:虞台文

Output Types of Individual Classifiers

Measurement – Confidence or Probability

Rank– Assign a rank to each class

Abstract– A set of several class labels

Page 66: Statistical Pattern Recognition: A Review 主講人:虞台文

An Example Handwritten numerals (0-9)

– Extract from a collection of Dutch utility maps– 30 × 48 binary images

200 patterns per class (2000 in total) Features:

– 76 Fourier Coefficients of the character shapes– 216 profile correlations– 64 Karhunen-Loeve coefficients– 240 pixel averages in 2 × 3 windows– 47 Zernike moments– 6 morphlogical features

Page 67: Statistical Pattern Recognition: A Review 主講人:虞台文

An Example

Page 68: Statistical Pattern Recognition: A Review 主講人:虞台文

Statistical Pattern Recognition:A Review

Error Estimation

Page 69: Statistical Pattern Recognition: A Review 主講人:虞台文

Ultimate Measurement of a Classifier

Classification error or simple the error rate Pe.

The percentage of misclassified test samples is taken as an estimate of the error rate.

How should the available samples be split to form training and test sets?– Especially important in small sample case

Page 70: Statistical Pattern Recognition: A Review 主講人:虞台文

Error Estimation Methods

Cross- Validataion Approaches

Page 71: Statistical Pattern Recognition: A Review 主講人:虞台文

Other Performance Measurements

Example: (Fingerprint Matching System) False Acceptance Rate (FAR)

– The percentage for incorrectly matches False Reject Rate (FRR)

– The percentage for incorrectly unmatches.

Reject rate– The percentage for reject doubtful patterns

Page 72: Statistical Pattern Recognition: A Review 主講人:虞台文

Statistical Pattern Recognition:A Review

Unsupervised Classification

Page 73: Statistical Pattern Recognition: A Review 主講人:虞台文

Unlabelled Training Data

Unsupervised classification is also known as data clustering.

Page 74: Statistical Pattern Recognition: A Review 主講人:虞台文

Difficulties Data can reveal clusters with different sizes

and shapes. Number of clusters depends on the resolution. Similarity measurement.

Page 75: Statistical Pattern Recognition: A Review 主講人:虞台文

Importance

Data MiningInformation RetrievalImage SegmentationSignal Compression and CodingMachine Learning

Page 76: Statistical Pattern Recognition: A Review 主講人:虞台文

Main Techniques

Iterative Square-Error Partition Clustering– Main concern in the following discussion

Agglomerative Hierarchical Clustering

Page 77: Statistical Pattern Recognition: A Review 主講人:虞台文

Formulation Given n patterns in a d-dimensional metric

space, determine a partitions of patterns in to K clusters such that

the patterns in a cluster are more similar to each other than to patterns in different clusters.

Page 78: Statistical Pattern Recognition: A Review 主講人:虞台文

Two Popular Approaches forPartition Clustering

Square-Error Clustering– K-Means– Fuzzy K-Means

Mixture Decomposition– EM Algorithm

Page 79: Statistical Pattern Recognition: A Review 主講人:虞台文

Square-Error Clustering

m+

⊕1m

⊕2m

⊕3m

Total Scattering:

n

iiS

1

22 |||| mx

K

kkkB nS

1

22 |||| mm

Between-Cluster Scattering:

K

k CW

k

S1

22 ||||x

mx

Within-Cluster Scattering:

Fact: 222WB SSS 222WB SSS

Page 80: Statistical Pattern Recognition: A Review 主講人:虞台文

Square-Error Clustering

m+

⊕1m

⊕2m

⊕3m

Fact: 222WB SSS 222WB SSS

Goal: Find a partition to

maximize 2BS

minimize 2WS

or

Page 81: Statistical Pattern Recognition: A Review 主講人:虞台文

Mixture Model Formal model for unsupervised classification. Each pattern was produced by one of a set of

alternative (probabilistically modeled) sources. Mixtures can also be seen as a class of models

that are able to represent arbitrarily complex probability density functions.

Mixtures also well suited for represent complex class-conditional densities in supervised learning scenarios.

Page 82: Statistical Pattern Recognition: A Review 主講人:虞台文

Mixture Generation

K random sources

+

+

+

+

)|( 11 θxp )|( 11 θxp )|( 22 θxp )|( 22 θxp

)|( 33 θxp )|( 33 θxp

)|( KKp θx )|( KKp θx

P(xCi) = αi

K

mmmmK pp

1)( )|()|( θxΘx

K

mmmmK pp

1)( )|()|( θxΘx

),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:

Page 83: Statistical Pattern Recognition: A Review 主講人:虞台文

Mixture Generation

K random sources

+

+

+

+

)|( 11 θxp )|( 11 θxp )|( 22 θxp )|( 22 θxp

)|( 33 θxp )|( 33 θxp

)|( KKp θx )|( KKp θx

P(xCi) = αi

K

mmmmK pp

1)( )|()|( θxΘx

K

mmmmK pp

1)( )|()|( θxΘx

),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:

If pm(x|θm) is normal, it is called the Gaussian Mixture Model (GMM). If pm(x|θm) is normal, it is called the Gaussian Mixture Model (GMM).

Page 84: Statistical Pattern Recognition: A Review 主講人:虞台文

Goal

K

mmmmK pp

1)( )|()|( θxΘx

K

mmmmK pp

1)( )|()|( θxΘx

),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:

Given the form of pm(x|θm) and the data x1, x2, …, xn,

find )ˆ,,ˆ,ˆ,,ˆ(ˆ11)( KKK θθΘ

to fit the model

Page 85: Statistical Pattern Recognition: A Review 主講人:虞台文

Issues How to estimate the parameters?

– EM (expectation-maximization) algorithm– MCMC (Markov Chain Monte-Carlo) Method

How to estimate the number of components (sources)?– More difficult

),,,,,( 11)( KKK θθΘ ),,,,,( 11)( KKK θθΘ Parameters:

Page 86: Statistical Pattern Recognition: A Review 主講人:虞台文

Example