Upload
vuongcong
View
222
Download
5
Embed Size (px)
Citation preview
1/104
Computer Vision
3. Machine Vision Algorithms
Computer Engineering, Sejong University
Dongil Han
2/104
SIFT(Scale Invariant Feature Transform)
Adaboost(Adaptive Boosting)
SVM(Support Vector Machine)
Precision and Recall
DL(Deep Learning)
Contents
3/104
Scale Invariant Feature Transform
Scale-invariant feature transform (or SIFT) is an algorithm in computer vision to detect and describe local features in images. The algorithm was published by David Lowe in 1999.
Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, and match moving.
The algorithm is patented in the US; the owner is the University of British Columbia.
…. David LoweComputer Science DepartmentUniversity of British Columbia
4/104
Challenges
• Scale change
• Rotation
• Occlusion
• Illumination
……
5/104
Overview of SIFT
6/104
Scale Space extrema detection
• Find the points, whose surrounding patches (with some scale) are distinctive
• An approximation to the scale-normalized Laplacian of Gaussian
7/104
Maxima and minima in a 3*3*3 neighborhood
8/104
Locate DOG Extrema
• There are still a lot of points, some of them are not good enough.
• The locations of keypoints may be not accurate.
9/104
Edge Response Elimination
• Such a point has large principal curvature across the edge but a small one in the perpendicular direction
• The principal curvatures can be calculated from a Hessian function
• The eigenvalues of H are proportional to the principal curvatures, so two eigenvalues shouldn’t diff too much
10/104
11/104
Orientation assignment
• Assign an orientation to each keypoint, the keypoint descriptor can be represented relative to this orientation and therefore achieve invariance to image rotation
• Compute magnitude and orientation on the Gaussian smoothed images
12/104
Orientation assignment
• A histogram is formed by quantizing the orientations into 36 bins;
• Peaks in the histogram correspond to the orientations of the patch;
• For the same scale and location, there could be multiple keypoints with different orientations;
13/104
Feature Descriptor
• Based on 16*16 patches
• 4*4 subregions
• 8 bins in each subregion
• 4*4*8=128 dimensions in total
14/104
Application : Object recognition
• The SIFT features of training images are extracted and stored
• For a query image
1. Extract SIFT feature
2. Efficient nearest neighbor indexing
3. 3 keypoints, Geometry verification
15/104
16/104
SURF(Speeded Up Robust Feature)
• Approximate version of SIFT
• Works almost equally well
• Very fast
17/104
SIFT Summary
• The most successful feature (probably the most successful paper in computer vision)
• A lot of heuristics, the parameters are optimized based on a small and specific dataset. Different tasks should have different parameter settings.
• Learning local image descriptors (Winder et al 2007): tuning parameters given their dataset.
• We need a universal objective function.
18/104
Adaptive Boosting(Adaboost)
• Introduced by Schapire and Freund in 1990s.
• “Boosting”: convert a weak learning algorithm into a strong one.
• Main idea: Combine many weak classifiers to produce a powerful committee.
• Algorithms:– AdaBoost: adaptive boosting– Gentle AdaBoost– BrownBoost– …
19/104
Classifiers• Classifier Examples
– Linear Classifier
– Quadratic Classifier
– Nonlinear Classifier
+
++
+
+
test set example Classifier examples
+
++
+
+
+
++
+
+
20/104
Adaboost Learning Algorithm
21/104
Adaboost Learning Algorithm• Round 1 of 3
+
++
+
+
+
++
+
+
D21 = 0.300
1=0.424
h1
22/104
+
++
+
+
+
++
+
+
D2h22 = 0.196
2=0.704
Adaboost Learning Algorithm• Round 2 of 3
23/104
Adaboost Learning Algorithm
+
++
+
+
h3
3 = 0.344
2=0.323
STOP
Adaboost Learning Algorithm• Round 3 of 3
24/104
Adaboost Learning AlgorithmAdaboost Learning Algorithm• Final Hypothesis
25/104
Adaboost Learning Algorithm• Learning Process
26/104
Adaboost Learning Algorithm• Final Classifier and parameters
27/104
Algorithm Recapitulation
28/104
29/104
30/104
Adaboost in Face Detection
31/104
Adaboost in Face Detection• Adaboost Learning Algorithm by Froba and Ernst
– Detector has four classifiers of increasing complexity• analyzes image patches W of size 22x22 after MCT : Γ
• location within the analysis window W: x
• Adaboost classifier stage: j
• Classifier : H, Pixel Classifier : hx
32/104
Adaboost Summary
• Advantages- Very simple to implement- Does feature selection resulting in relatively simple classifier- Fairly good generalization
• Disadvantages- Suboptimal solution- Sensitive to noisy data and outliers
33/104
Support Vector Machine(SVM)
• A classifier derived from statistical learning theory by Vapnik, et al. in 1992
• SVM became famous when, using images as input, it gave accuracy comparable to neural-network with hand-designed features in a handwriting recognition task
• Currently, SVM is widely used in object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition, etc.
• Also used for regression
• Demo of SVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm/V. Vapnik
These slides are courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt
34/104
Discriminant Function
• It can be arbitrary functions of x, such as:
Nearest Neighbor
Decision Tree
LinearFunctions
( ) Tg b x w x
NonlinearFunctions
35/104
Linear Discriminant Function
• g(x) is a linear function:
( ) Tg b x w x
x1
x2
wT x + b < 0
wT x + b > 0
• A hyper-plane in the feature space
• (Unit-length) normal vector of the hyper-plane:
w
nw
n
36/104
Linear Discriminant Function
• How would you classify these points using a linear discriminant function in order to minimize the error rate?
=> Infinite number of answers!
denotes +1
denotes -1
x1
x2
37/104
Linear Discriminant Function
x1
x2
denotes +1
denotes -1• How would you classify these
points using a linear discriminant function in order to minimize the error rate?
=> Infinite number of answers!
=> Which one is the best?
38/104
“safe zone” Margin
x1
x2
denotes +1
denotes -1
Large Margin Linear Classifier
• The linear discriminant function (classifier) with the maximum margin is the best
• Margin is defined as the width that the boundary could be increased by before hitting a data point
• Why it is the best?- Robust to outliners and thus strong generalization ability
39/104
x1
x2
denotes +1
denotes -1• Given a set of data points:
• With a scale transformation on both w and b, the above is equivalent to
For 1, 0
For 1, 0
Ti i
Ti i
y b
y b
w x
w x
{( , )}, 1, 2, ,i iy i nx , where
For 1, 1
For 1, 1
Ti i
Ti i
y b
y b
w x
w x
Large Margin Linear Classifier
40/104
Large Margin Linear Classifier
x1
x2
denotes +1
denotes -1
Margin
x+
x+
x-n
Support Vectors
• We know that
• The margin width is:
1
1
T
T
b
b
w x
w x
( )
2 ( )
M
x x n
wx x
w w
41/104
Large Margin Linear Classifier
x1
x2
denotes +1
denotes -1
Margin
x+
x+
x-n
• Formulation:
such that
2maximize
w
For 1, 1
For 1, 1
Ti i
Ti i
y b
y b
w x
w x
42/104
Large Margin Linear Classifier
• Formulation:
such that
2maximize
w
For 1, 1
For 1, 1
Ti i
Ti i
y b
y b
w x
w x
• Formulation:
21minimize
2w
such that
For 1, 1
For 1, 1
Ti i
Ti i
y b
y b
w x
w x
( ) 1Ti iy b w x
or
43/104
Solving the Optimization Problem
( ) 1Ti iy b w x
21minimize
2w
s.t.
Quadratic programming
with linear constraints
2
1
1minimize ( , , ) ( ) 1
2
nT
p i i i ii
L b y b
w w w x
s.t.
LagrangianFunction
0i
44/104
2
1
1minimize ( , , ) ( ) 1
2
nT
p i i i ii
L b y b
w w w x
s.t. 0i
0pL
b
0pL
w 1
n
i i ii
y
w x
1
0n
i ii
y
Solving the Optimization Problem
45/104
Solving the Optimization Problem
• The solution has the form:
( ) 1 0Ti i iy b w x
• From KKT condition:
• Thus, only support vectors have 0i
1 SV
n
i i i i i ii i
y y
w x x
get from ( ) 1 0,
where is support vector
Ti i
i
b y b w x
x
x1
x2
x+
x+
x-
Support Vectors
KKT condition: URL: https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions
46/104
Solving the Optimization Problem
SV
( ) T Ti i
i
g b b
x w x x x
• The linear discriminant function is:
• Notice it relies on a dot product between the test point x and the support vectors xi
• Also keep in mind that solving the optimization problem involved computing the dot products xi
Txj between all pairs of training points
47/104
Large Margin Linear Classifier
• What if data is not linear separable? (noisy data, outliers, etc.)
• Slack variables ξi can be added to allow mis-classification of difficult or noisy data points
x1
x2
denotes +1
denotes -1
12
48/104
Large Margin Linear Classifier
• Formulation:
( ) 1Ti i iy b w x
2
1
1minimize
2
n
ii
C
w
such that
0i
Parameter C can be viewed as a way to control over-fitting.
49/104
Non-linear SVMs• Datasets that are linearly separable with noise work out great:
0 x
0 x
• But what are we going to do if the dataset is just too hard?
0 x
x2
• How about… mapping data to a higher-dimensional space:
50/104
Non-linear SVMs: Feature Space
• General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable:
Φ: x→ φ(x)
51/104
Applications of SVMs
• Bioinformatics• Machine Vision• Text Categorization• Ranking (e.g., Google searches)• Handwritten Character Recognition• Time series analysis
Lots of very successful applications!!!
52/104
Car license plate extraction
Precision=62%, Recall=50.2%False negative rate=49.8%False positive rate=1.1%
• Extract SIFT Feature first
• Classification using SVM
ClassificationResults
Ground Truth
Outside Inside
Outside 293734 (TN) 3398 (FP)
Inside 5505 (FN) 5545 (TP)
53/104
Summary: Support Vector Machine
• Large Margin Classifier – Better generalization ability & less over-fitting
• The Kernel Trick– Map data points to higher dimensional space in order to
make them linearly separable.
– Since only dot product is used, we do not need to represent the mapping explicitly.
54/104
Precision and Recall
• Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)
• Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)
Precision P = tp/(tp + fp)
Recall R = tp/(tp + fn)
55/104
Precision and Recall
• Car license plate extraction example
- TP(True Positive) : 번호판을 번호판으로 판단
- FP(False Positive) : 배경을 번호판으로 판단
- FN(False Negative) : 번호판을 배경으로 판단
- TN(True Negative) : 배경을 배경으로 판단
Precision(정확도) : 번호판으로 판단한 영역 중실제로 번호판인 비율
Recall(재현율, 검출율) : 전체 번호판 중실제로 번호판으로 판단한 비율
56/104
DL(Deep Learning)
• what exactly is deep learning ? - ‘Deep Learning’ means using a neural network
with several layers of nodes between input and output
• why is it generally better than other methods on image, speech and certain other types of data? - the series of layers between input & output do
feature identification and processing in a series of stages,
just as our brains seem to.
These slides are courtesy of https://www.macs.hw.ac.uk/~dwcorne/
57/104
DL(Deep Learning)
• multilayer neural networks have been around for
25 years. What’s actually new?
we have always had good algorithms for learning the
weights in networks with 1 hidden layer
but these algorithms are not good at learning the weights for
networks with more hidden layers
what’s new is: algorithms for training many-later networks
58/104
Brain vs. Computer
1. 10 billion neurons2. 60 trillion synapses3. Distributed processing4. Nonlinear processing5. Parallel processing
1. Faster than neuron (10-9 sec)cf. neuron: 10-3 sec
3. Central processing4. Arithmetic operation (linearity)5. Sequential processing
59/104
Neuron vs Artificial Neuron
수상돌기 세포체 축색돌기
60/104
Simple Perceptron
61/104
Classic Perceptron
Sigmoid UnitSigmoid function is
Differentiable (x) (x)(1 (x))x
Nonlinear Neuron :Sigmoid Unit
62/104
Multilayer Perceptron
63/104
W1
W2
W3
f(x)
1.4
-2.5
-0.06
64/104
2.7
-8.6
0.002
f(x)
1.4
-2.5
-0.06
x = -0.06×2.7 + 2.5×8.6 + 1.4×0.002 = 21.34
65/104
Training the neural network Fields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …
66/104
Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …
Feed it through to get output
1.4
2.7 0.8
1.9
67/104
Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …
Compare with target output
1.4
2.7 0.8 0
1.9 error 0.8
68/104
Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …
Adjust weights based on error
1.4
2.7 0.8 0
1.9 error 0.8
69/104
Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …
Feed it through to get output
6.4
2.8 0.9
1.7
70/104
Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …
Compare with target output
6.4
2.8 0.9 1
1.7 error -0.1
71/104
Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …
Adjust weights based on error
6.4
2.8 0.91
1.7 error -0.1
72/104
Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …
And so on ….
6.4
2.8 0.91
1.7 error -0.1
Repeat this thousands, maybe millions of times – each timetaking a random training instance, and making slight weight adjustmentsAlgorithms for weight adjustment are designed to make
changes that will reduce the error
73/104
The decision boundary perspective…
Initial random weights
74/104
The decision boundary perspective…
Present a training instance / adjust the weights
75/104
The decision boundary perspective…
Eventually ….
76/104
The point I am trying to make
• weight-learning algorithms for NNs are dumb
• they work by making thousands and thousands of tiny adjustments, each making the network do better at the most recent pattern, but perhaps a little worse on many others
• but, by dumb luck, eventually this tends to be good enough to
learn effective classifiers for many real applications
77/104
Some other points
If f(x) is non-linear, a network with 1 hidden layer can, in theory, learn perfectly any classification problem. A set of weights exists that can produce the targets from the inputs. The problem is finding them.
78/104
Some other ‘by the way’ points
If f(x) is linear, the NN can only draw straight decision boundaries (even if there are many layers of units)
79/104
Some other ‘by the way’ points
NNs use nonlinear f(x) so they
can draw complex boundaries,
but keep the data unchanged
80/104
Some other ‘by the way’ points
NNs use nonlinear f(x) so they SVMs only draw straight lines,
can draw complex boundaries, but they transform the data first
but keep the data unchanged in a way that makes that OK
81/104
Feature detectors
82/104
what is this unit doing?
83/104
Hidden layer units become self-organised feature detectors
…
1
63
1 5 10 15 20 25 …
84/104
What does this unit detect?
…
1
63
1 5 10 15 20 25 …
85/104
What does this unit detect?
…
1
63
1 5 10 15 20 25 …
it will send strong signal for a horizontalline in the top row, ignoring everywhere else
86/104
What does this unit detect?
…
1
63
1 5 10 15 20 25 …
87/104
What does this unit detect?
…
1
63
1 5 10 15 20 25 …
Strong signal for a dark area in the top leftcorner
88/104
What features might you expect a good NNto learn, when trained with data like this?
89/10463
1
vertical lines
90/10463
1
Horizontal lines
91/10463
1
Small circles
92/10463
1
Small circles
But what about position invariance ???our example unit detectors were tied to specific parts of the image
93/104
successive layers can learn higher-level features …
etc …detect lines inSpecific positions
v
Higher level detetors( horizontal line, “RHS vertical lune”“upper loop”, etc…
etc …
94/104
successive layers can learn higher-level features …
etc …detect lines inSpecific positions
v
Higher level detetors( horizontal line, “RHS vertical lune”“upper loop”, etc…
etc …
What does this unit detect?
95/104
So: multiple layers make sense
Your brain works that way
96/104
So: multiple layers make sense
Many-layer neural network architectures should be capable of learning the true underlying features and ‘feature logic’, and therefore generalise very well …
97/104
But, until very recently, our weight-learning algorithms simply did not work on multi-layer architectures
98/104
The new way to train multi-layer NNs…
99/104
The new way to train multi-layer NNs…
Train this layer first
100/104
The new way to train multi-layer NNs…
Train this layer first
then this layer
then this layer
then this layerfinally this layer
101/104
Final layer trained to predict class based on outputs from previous layers
102/104
And that’s that
• That’s the basic idea
• There are many many types of deep learning,
• different kinds of autoencoder, variations on architectures and training algorithms, etc…
• Very fast growing area …
103/104
DL Application
104/104
DL Summary
• Much recent excitement, still much to be discovered
• "Google-Brain"• Sum of Products Nets• Biological Plausibility• Potential for significant improvements• Good in structured spaces
– Important research question: To what extent can we use Deep Learning in more arbitrary feature spaces?
– Recent deep training of MLPs with BP shows potential in this area