Upload
mkd2000
View
218
Download
0
Embed Size (px)
Citation preview
7/28/2019 EE491_Y2150
1/32
Feature-Based Aerial Image Registration andMosaicing
A Report Submittedin Partial Fulfillment of the Requirements
for the Degree of
Bachelor of Technology
by
Gaurav GuptaY2150
to the
Department of Electrical Engineering
Indian Institute of Technology, Kanpur
April, 2006
7/28/2019 EE491_Y2150
2/32
Certificate
This is to certify that the work contained in the thesisentitled Feature-Based Aerial Image Registration andMosaicing, by Gaurav Gupta, has been carried out under my
supervision and that this work has not been submittedelsewhere for a degree.
April, 2006 ------------------------------------------(Dr. Sumana Gupta)Department of Electrical Engineering,Indian Institute of Technology,Kanpur.
-------------------------------------------(Dr. Amitabha Mukerjee)Department of Computer Scienceand Engineering,Indian Institute of Technology,Kanpur.
7/28/2019 EE491_Y2150
3/32
ACKNOWLEDGEMENT
I am extremely thankful to Dr. Amitabha Mukerjee and Dr. Sumana Gupta
who provided me with support and guidance which was inevitable for my work. Allmy doubts were welcome. I also would like to thank Dr. Jharna Majumdar un-
der whom guidance I gained knowledge and experience while my short stay at ADE,
Bangalore which helped me lot working on this project. I would like to thank Dr.
A. K. Ghosh for organizing flight for collecting data required for this project. Fur-
thermore, I would like to extend my sincere gratitude to Mr. Shobhit Niranjan,
M.Tech(Dual) student, Dept. of Electrical Engineering, IIT Kanpur, who sat with
me to sort out problems in my code and algorithms. I also would like to thank
Mr Subhranshu Maji, B.Tech student, Dept. of Computer Science and Engineer-
ing, who provided me help in coding some part. I also thank my teammates from
Aerospace Engineering and Computer Science and Engineering who are with me in
UAV Project Group at IIT Kanpur for motivation, support and company in critical
times.
iii
7/28/2019 EE491_Y2150
4/32
Contents
1 Introduction 1
1.1 Geo-Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Geometric Transformations . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Aerial Image Registration . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Image Mosaicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Feature Extraction 10
2.1 Corner Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Saliency Map and Salient Point based on Ittis Model . . . . . . . . . 13
3 Registration Technique 18
3.1 Registration without using Correspondence and Mosaicing . . . . . . 18
4 Discussion 24
5 Conclusion and Further Study 25
iv
7/28/2019 EE491_Y2150
5/32
Chapter 1
Introduction
The ability to locate scenes and objects visible in aerial video imagery with their
corresponding locations in a reference coordinate system is becoming increasingly im-
portant in visually-guided navigation, surveillance and monitoring systems [2] [11].
The availability of low-cost, lightweight video camera systems, high-bandwidth VHF
communications links and a growing inventory of Unmanned Aerial Vehicles (UAVs)
and Mini Aerial Vehicles (MAVs) has resulted in dramatic new opportunities for sur-
veillance and sensing applications of such algorithms. The typical mission of suchUAV/MAV will consist video registration of particular surveillance flight with refer-
ence image (alignment of video frames with pre-calibrated reference imagery-DEM
and satellite data). Frame-to-reference registration of video is complex due to lack of
stable features/saliant points because typical frame does not cover large enough area
and computationally expensive also. Possible solution is registration of mosiac with
reference image created from typical MAV mission flight data.
The objective of this work is to study feature based techniques to do aerial reg-
istration of typical MAV mission flight data in semi-urban environment and createmosaic using estimated geometric transformation parameters. The next task is to reg-
ister this generated mosaic with high resolution satellite reference image which would
allow us to locate any given landmark in a MAV mission video in world co-ordinate
system. Two aspects to this work are:
1. Feature extraction
2. Registration
In this semester till now we have studied various important features which can
be used for registration which are Harris corner detector, KLT corner detector and
1
7/28/2019 EE491_Y2150
6/32
Visually salient points using Ittis model. Features are extracted successfully. Rota-
tion and scale parameters are determined using recent registration technique devisedby Xiong et. al. [1]. The algorithm in [1] is different from conventional feature
based image registration algorithms, which does not need image matching or corre-
spondence.We compare different feature extraction algorithms and use those features
for registration and compare the results. We extend the approach in [1] for finding
translation also and use estimated parameters to generate mosaic. Code for mosaicing
is also written and some results are obtained although the algorithm for mosaicing
is not robust enough to the noise in the registration parameters. Implementation is
done in Windows platform on Visual C++ 6.0 with OpenCV [29] libraries.Next semester tasks will be to find translation parameters robustly and creating
mosaic from a MAV video data. In the future the mosaic will be geo-registered with
ortho-rectified satellite image data (reference image data). We are planning to buy
satellite image data of Kanpur area for this purpose from NRSA (National Remote
Sensing Agency). Some preprocessing is also needed because the collected aerial data
generally has noise due to weather conditions, motion compensation needs to be there
and camera parameters are to be estimated for pre-warping the images; these factors
affect the registration procedure and its accuracy.
1.1 Geo-Registration
Computer vision techniques can be used to successfully align any given video frame
with pre-calibrated reference imagery. This kind of registration is known as geo-
registration [2]. The Reference Imagery is a high-resolution orthographic image, usu-
ally with a Ground Sampling Distance of 1 (meaning a pixel corresponds to 1m2
on ground). This Reference Imagery is geodetically aligned, and has an associated
Digital Elevation Map (DEM), so that each pixel of the Reference Imagery has a pre-
cise longitude, latitude, and height associated with it. The Reference Imagery, which
covers a substantial area, can be cropped on the basis of the telemetry data (Teleme-
try is an automatic measurement of data that defines the position of the camera in
terms of nine parameters: vehicle latitude, vehicle longitude, vehicle height, vehicle
roll, vehicle pitch, vehicle heading, camera elevation, camera scan angle and camera
focal length.) to a smaller area corresponding to Ivideo(x) which denotes video frame
or mosaic created from aerial data. This cropped reference image can be referred as
Iref(x).
2
7/28/2019 EE491_Y2150
7/32
Two transformation functions exist between reference and aerial image. Ifx is a
image point (in pixels), I1(x) is the aerial image array and I2(x) is reference imagearray then,
1. freg(x) is the geometric transformation. Finding this mapping is the problem
of image registration,
I1(x) I2(freg(x)) (1.1)
2. fcolor is intensity /colour mapping between aerial and reference image,
I1(x) = fcolor(I2(freg(x))) (1.2)
Finding freg(x) is the larger challenge. Shah et. al. [2] identifies the following
difficulties:
1. The two imageries are in different projection views: Ivideo(x) is an image of
perspective projection, whereas Iref(x) is an image of orthographic projection.
While the telemetry information can be used with a sensor model to bring both
images into a single projection view,
2. Because of the large duration of time that elapses between the capturing of the
two images, data distortions like severe lighting and atmospheric variations and
object changes in the form of forest growths or new construction cause a high
number of disjoint features (features present in one image but not in the other).
3. Remotely sensed terrain imagery, in particular, has the property of being highly
self-correlated both as image data and elevation data. This includes first order
correlations (locally similar luminance or elevation values in buildings), second
order correlations (edge continuations in roads, forest edges, and ridges), as wellas higher order correlations (homogeneous textures in forests and homogenous
elevations in plateaus).
In the past, substantial research has been directed towards determining the geo-
location of objects from an aerial view. Several systems such as Terrain Contour
Matching (TERCOM) [3], SITAN, Inertial Navigation/Guidance Systems (INS/IGS),
Global Positioning Systems (GPS) and most recently Digital Scene-Matching and
Area Correlation (DSMAC) have already been deployed in applications requiring
geo-location. While each of these systems has had some degree of success, several
shortcomings and deficiencies have become increasingly apparent. By understanding
3
7/28/2019 EE491_Y2150
8/32
the limitations of these systems, we can acquire a better appreciation for the need of
effective image based systems.Two types of approaches can be distinguished for geo-registration problem: Elevation-
Based Correspondence and Image-Based Correspondence. Elevation based algorithms
attempt to achieve alignment by matching the DEM with an elevation map recov-
ered from video data. Aggarwal et. al. in [4] perform pixel-wise stereo analysis of
successive frames to yield a recovered elevation map or REM, as the initial Data Rec-
tification step. In [5], Sim and Park propose another geo-registration algorithm that
reconstructs a REM from stereo analysis of successive video frames. Normalized Cross
Correlation based point-matching is used to recover the elevation values. Elevation-Based approaches (based on DEMs) have the general drawback that they rely on
the accuracy of recovered elevation from two frames, a task found to be notoriously
difficult.
Intensity-based approaches to geo-registration use intensity properties of both im-
ageries to achieve alignment. Work has been done developing image based techniques
towards registration of two sets of reference imageries [6], as well as the registration
of two successive video images ( [7], [8]). In [9], Cannata et al use the telemetry
information to bring a video frame into an orthographic projection view, by associat-
ing each pixel with an elevation value from the DEM. By ortho-rectifying the aerial
video frame, the process of alignment is simplified to a strict 2D registration problem.
Correspondence is achieved by taking 32 32 pixel patches uniformly over the aerial
image and correlating them with a larger search patch in the Reference Image, using
Normalized Cross Correlation. Finally, the sensor parameters are updated using a
conjugate gradient method, or by a Kalman Filter to stress temporal continuity.
An alternate approach is presented by Kumar et al in [10] where instead of
ortho-rectifying the Aerial Video Frame, a perspective projection of the associated
area of the Reference Image is performed. In [10], two further data rectification
steps are performed. Video frame-to-frame alignment is used to create a mosaic
providing greater context for alignment than a single image. For data rectification,
a Laplacian filter at multiple scales is then applied to both the video mosaic and
reference image. To achieve correspondence, two stages of alignment are used: coarse
followed by fine alignment. For coarse alignment salient (feature) points are defined
as the locations where the response in both scale and space is maximum. Normalized
correlation is used as a match measure between salient points and the associated
reference patch. One feature point is picked as a reference, and the correlation surfacesfor each feature point are then translated to be centered at the reference feature point.
4
7/28/2019 EE491_Y2150
9/32
In the subsequent work, [11], the filter is modified to use the Laplacian of Gaussian
filter as well as Hilbert Transform, in four directions to yield four oriented energyimages for each aerial video frame, and for each perspectively projected reference
image. Instead of considering video mosaics for alignment, the authors use a mosaic
of 3 key-frames from the data stream, each with at least 50 percent overlap.
The major limitation of the intensity based approaches are the assumptions that
are made. The research literature of image-based correspondence is quite vast; [12] is
a general survey of some of these registration techniques. Alignment by maximization
of Mutual Information [13] is another frequently used registration approach, and while
it provides high levels of robustness it also allows many false positives when matchingover a search area of the nature encountered in Geo-Registration. In addition to
working with no GPS, it is also possible to consider situations where telemetry data
is not available or corrupted. This is also possible, but due to lack of initial point in
the visual search, it results in significant increase in computational time.
1.2 Geometric Transformations
A Geometric Transformation is a mapping that relocates image points. Transforma-tions can be global or local in nature. Global transformations are usually defined by a
single set of parameters, which is applied to the whole image. Some of the most com-
mon global transformations are affine, perspective, and polynomial transformations.
The affine transformations include translation, scaling and shear motion parameters.
Translation and rotation transforms are usually caused by the different orientation of
the sensor, while scaling transform is the effect of change in altitude of the sensor.
The sensor distortion or the viewing angle may cause stretching and shearing. Rigid
transformations account for object or sensor movement in which objects in the images
maintain their relative shape and size [14]. A rigid-body transformation is composed
of a combination of rotation , translation in x direction tx translation in y direction
ty, and scale s. It can be written as,x2y2
=
txty
+ s
cos sin sin cos
x1y1
(1.3)
where (x2, y2) is the new transformed coordinate of (x1, y1), tx and ty are x-axis
and y-axis translations, and s is a scale factor. The general 2D affine transformation
can be expressed as shown in the following equation:
5
7/28/2019 EE491_Y2150
10/32
(a)
(b) (c)
Figure 1.1: (a) Based on the telemetry data, that specifies the corresponding area
of the Reference Imagery the camera is capturing, the Reference Image is cropped.,(b) The aerial video frame before and (c)after geo-registration with the CroppedReference Image. It should be noted that the Reference Image is an OrthographicImage while the Aerial Video Frame is a Perspective Image. Images are taken from[2] for elaboration
6
7/28/2019 EE491_Y2150
11/32
x2y2
=t
xty
+a
11a12a21 a22 x
1y1
(1.4)
A =
a11 a12a21 a22
where (x2, y2) is the new transformed coordinate of (x1, y1). The matrix A can be
combination of rotation, scale, or shear. The rotation matrix is similar to 1.3. The
scale for both x and y axes can be expressed as:
Scale =Sx 0
0 Sy
(1.5)
However Local distortions may be present in the scenes due to a motion paral-
lax, movement of object, etc. The parameters of a local mapping a transformation
vary across the different regions of the image to handle local deformations. These
parameters can be determined by subdividing the image into small image parts.
1.3 Aerial Image Registration
Image registration is the process of determining the geometric transformation defined
above between a newly sensed image, called input image, and a reference image of
the same scene that could possibly be taken at different times, from different sensors,
or from different viewpoints. The current automated registration techniques can be
classified into two broad categories: area-based and feature-based techniques. In the
area-based algorithms, a small window of points in the sensed image is compared
statistically with windows of the same size in the reference image [15]. Window
correspondence is based on the similarity measure between two given windows. The
measure of similarity is usually the normalized cross correlation. Area-based tech-niques can be implemented by the Fourier transform using the fast Fourier transform
(FFT) [16]. A majority of the area-based methods have the limitation of register-
ing only images with small misalignment, and therefore, the images must be roughly
aligned with each other initially. The correlation measures become unreliable when
the images have multiple modalities and the gray-level characteristics vary (e.g., TM
and synthetic aperture radar (SAR) data).
In contrast, the feature-based methods are more robust and more suitable in
these cases. There are two critical procedures generally involved in the feature-basedtechniques: feature extraction and feature correspondence. The basic building block
7
7/28/2019 EE491_Y2150
12/32
of feature based image registration scheme involves matching feature points that are
extracted from a sensed image to their counter parts in a reference image. Featuresmay be control points, corners, junctions or interest points. These features are also
known as visually salient point. Feature matching overcomes the inabilities of are
based signal correlation by attempting matching only information rich points.
1.3.1 Image Integration
It deals with finding fcolor between two aerial images or two different imaging system.
Various techniques have been developed for modifying the image grey levels in the
vicinity of a boundary to obtain a smooth transition between images by removing
these seams and creating a blended image. These mainly consist in choosing a frontier,
which induces a minimum of discontinuity [18]. [17] proposed using a polynomial
curve. In each line of the common area, one point is retained and the curve is
defined by a minimum squared error procedure. These methods are appropriate if
the common area is quite identical (plane support). Heitz presented a simplification
using a parametric plane function s = ax + by + c determined by a mean square
error procedure. More important transformation must be applied when the common
regions includes large differences, Westerkamp et. al. [19] described a polynomial
function to assemble distorted microscopic images.
1.4 Image Mosaicing
An Image mosaic is a synthetic composition generated from a sequence of images
and it can be obtained by understanding geometric relationship between images.
The geometric relations are coordinate transformations that relate the different im-
age coordinate systems. By applying the appropriate transformations via a warping
operation and merging the overlapping regions of warped images, it is possible to con-
struct a single image indistinguishable from a single large image of the same object,
covering the entire visible area of the scene. This merged single image is the called
mosaic. The basic scheme for Mosaicing comprises of two main steps, which are
outlined below.
1. Image registration using Geometric Transformations derived from image data
and/or camera models.
2. Image integration or blending.
8
7/28/2019 EE491_Y2150
13/32
We have adopted Feature Based approach to solve registration problem. Different
from conventional feature based image registration algorithms, our approach is basedon the work done by Xiong et. al. [1] which does not need image matching or
correspondence. In [1] they only consider Harris corners as features. We compare
different feature extraction algorithms and use those features for correspondence and
compare the results. We extend the approach in [1] for finding translation also and
use estimated parameters to generate mosaic.
9
7/28/2019 EE491_Y2150
14/32
Chapter 2
Feature Extraction
A feature is the result of an interpretation of n pixels, usually in a compact support,
in a window ofp p. An important step in almost all machine as well as biological
vision systems is to process the input image(s) to extract features or primal sketches.
In general, the feature detection process involves computing the response R of one or
multiple detectors (filters/operators) to the input image(s), followed by the analysis
of R to isolate points (or regions) that satisfy certain constraints. In-fact, the best
definition of a feature is the operator itself. There are several kinds of feature usedfor matching. They may be divided into four grouped as follows:
Visual features (edges, textures junctions and corners)
Transform Coefficient Features: Fourier descriptors, Hadamard coefficients.
Algebraic Features (based on matrix decomposition of an image)
Statistical Features (moment invariants)
2.1 Corner Detection
Corners are defined as the junction point of two straight line edges. Most existing
edge detectors perform poorly at corners, because they assume an edge to be an
entity with infinite extent, an assumption, which is violated at the corners. Since,
most of the gray-level based corner detectors are based on existing edge detectors, the
performance of such corner detectors is not satisfactory. For example the Canny edge
detector [20] is found incapable of accurately locating edges near a corner due to the
well-known rounding effect. Harris corner detector [21] and KLT feature detector [22]
are the most widely used corner detectors so we compare them for our application.
10
7/28/2019 EE491_Y2150
15/32
2.1.1 Harris Corner Detector
Harris corner detector [21] algorithm computes a matrix, which is related to the auto-
correlation function of Image intensity. This matrix averages the first derivatives of
the signal on a window:
expx2 + y2
22
I2x IxIyIxIy I
2y
(2.1)
where Ix and Iy are the gradient (derivatives) in the x and y direction. The eigen
values of this matrix are the principal curvatures of the auto-correlation function. If
these two curvatures are high, an interest point is present.The Algorithm for Harris corner detection is as follows:
Algorithm for Harris Corner Detector
1. Compute Matrix C for each pixel of the input image.
2. The standard Harris Corner detection algorithm proposes two different criteri-
ons for corner point selection. The first is to compare the value of (det(C)
k trace(C)2) with a threshold and the second way is to compare the value of
R = det(C)trace(C)
with a threshold. Where C is the covariance matrix of gradient
computed above. We have used the second way while implementing the Harris
algorithm because first method highly depends on the chosen value of constant
k.
Feature Reduction
While selecting corner points using the Harris algorithm we have applied a two level
corner strength comparison. Suppose 1and2 are the two eigen values of the covari-
ance matrix C then criteria for feature reduction is as follows,
First we compare the value of norm =21 +
22 with a threshold and if it is
greater, it is a first level corner point.
Then we divide the image into 25 25 grids and in each grid we select at
maximum one corer point that has highest value of norm of eigen values defined
above.
Fig. 2.1(a) shows the corners detected using Harris corner detection algorithm [21]and fig. 2.1(b) shows the corners after applying feature reduction algorithm.
11
7/28/2019 EE491_Y2150
16/32
(a) (b)
Figure 2.1: (a) Corners obtained after using Harris Corner Detector algorithm, (b)Detected features after feature reduction algorithm.
2.1.2 KLT Corner Detector
KLT features [22] are geometrically stable under different transformations. Hence
features detected by KLT have high repeatability factor and have high information
content. It is also based on auto-correlation function of image intensity.
KLT Corner Detector Algorithm
1. Compute Matrix C for each pixel of the input image and let 1 and 2 denotes
its eigen values.
2. The KLT corner has first level corner detection based on the value of smaller
eigen value. It is computed in a window about the point under consideration,
and is compared with threshold: if it is greater than threshold it is a first levelcorner point. Then the array of all corner points is sorted in decreasing order
of minimum of eigen values of windows about points under considerations.
3. Moving from top to down we delete all the points lying below the point under
consideration in the array and satisfy 8-neighborhood criterion.
In fig. 2.2(a) corners are shown using KLT algorithm and fig. 2.2(b) shows the
reduced KLT corners.
12
7/28/2019 EE491_Y2150
17/32
(a) (b)
Figure 2.2: (a) Corners obtained after using KLT Corner Detector algorithm, (b)Detected features after feature reduction.
2.2 Saliency Map and Salient Point based on Ittis
Model
Visual attention is basically a biological mechanism used essentially by primates to
compensate for the inability of their brains to process the huge amount of visual in-
formation gathered by the two eyes. Early works on attention modeling were mostly
inspired by the biological model of the Brain. The Caltechs Hypothesis [23] elabo-
rated by Itti-Koch [24] represents one of the first concrete descriptions on how the
visual attention model works. According to the hypothesis the elementary features
are extracted in a unique map of attention, the saliency map, which resides either
in LGN (lateral geniculate nucleus) or in the V1 (Primary Visual Cortex). Finally,
the Winner Take All(WTA) network [27] which is responsible for detecting the most
salient scene location is located around the thalamic reticular nucleus.
One of the first and the most popular of the computation models of saliency and
is based on the Caltechs thesis. It is based on four main principles: Visual attention
is based on multi-featured inputs; saliency of a region is affected by the surrounding
context; the saliency of locations is represented by a saliency map, and the Winner
Take All and Inhibition of return are suitable mechanisms to allow attention shifts
13
7/28/2019 EE491_Y2150
18/32
Figure 2.3: Schematic model for Ittis model [23]
2.2.1 Feature Maps for Static Images
First, a number of features (1....j.....n) are extracted from the scene by computing the
so called feature maps Fj . Such a map represents the image of the scene, based on a
well-defined feature, which leads to a multi-featured representation of the scene. In
his implementation, Itti considered seven different features which are computed from
an RGB color image and which belong to three main cues, namely intensity, color,
and orientation.
Intensity Feature
F1 = I= 0.3 R+ 0.59 G+ 0.11 B (2.2)
Two chromatic features based on the two color opponency filters R+G and
B+Y where the yellow signal is defined as Y = R+G2
. Such chromatic oppo-
nency exists in human visual cortex.
F2 =RG
I(2.3)
F2 =B Y
I(2.4)
14
7/28/2019 EE491_Y2150
19/32
The normalization of the features with I decouples hue from intensity.
Four local orientation features F4...7 according to the angles{0;45;90;135}.
Gabor filters which represent a suitable mathematical model of the receptive
field impulse response of orientation-selective neurons in primary visual cortex
[25], are used to compute the orientation features. In this implementation of
the model, it is possible to use an arbitrary number of orientations. However, it
has been noticed that using more than four orientations does not improve the
performance of the model drastically.
2.2.2 Center-Surround Receptive Field Profiles
In a second step, each feature map is transformed in its conspicuity map which high-
lights the parts of the scene that strongly differ, according to a specific feature, from
their surroundings. In biologically plausible models, this is usually achieved by us-
ing a center-surround mechanism. Practically, this mechanism can be implemented
with a difference-of-Gaussians- filter, DoG, which can be applied on feature maps to
extract local activities for each feature type. A visual attention task has to detect
conspicuous regions, regardless of their sizes. Thus, a multiscale conspicuity operatoris required. Applying variable size center-surround- filters on fixed size images,has a
high computational cost. This method is based on a multiresolution representation
of images. For a feature j, a gaussian pyramid Ij is created by progressively lowpass
filtering and sub-sampling by factor 2 the feature map Fj , using a gaussian filter G:
Ij(0) = Fj (2.5)
Ij(i) = (Ij(i 1) G) (2.6)
where () refers to the spatial convolution operator and refers to the downsam-
pling operation. Center-Surround is then implemented as the difference between fine
(c for center) and coarse scales (s for surround). Indeed, for a feature j(1...j...n), a set
of intermediate multiscale conspicuity maps Mj,k(1...k.....K) are computed according
to the equation below, giving rise to (n*K) maps for n considered features.
Mj,k = |Ij(ck) Ij(sk)| (2.7)
where is a cross-scale difference operator that first interpolates the coarser scale
to the finer one and then carries out a point-by-point substraction. The absolute
15
7/28/2019 EE491_Y2150
20/32
value of the difference between the center and the surround allows the simultaneous
computing of both sensitivities, dark center on bright surround and bright center ondark surround (red/green and green/red or blue/yellow and yellow/blue for color).
2.2.3 Saliency Map
The purpose of the saliency map is to represent the conspicuity or saliency
at every location in the visual field by a scalar quantity, and to guide the selection
of attended locations, based on the spatial distribution of saliency. At each spatial
location, all the feature maps consequently needs to be combined into a unique scalar
measure of salience. In the implementation all the feature maps are normalized to
the same total dynamic range (e.g., between 0 to 255), and to sum all feature maps
into the saliency map. This operation is defined as N(.).
2.2.4 Selection of the point of Attention
Once the saliency Map has been computed the Winner Take All (WTA) and Inhibi-
tion of Return are Suitable Mechanisms to imitate the eye movements and the focus
of attention [27]. The WTA will select the point with maximum salience at each it-eration. However The movement of the attention point can be done by inhibiting the
saliency of the current object being attended [26] [27]. At each iteration the saliency
of the object being attended to is decayed, thus eventually the objects not being
attended to will increase in saliency and take the focus of attention. Other approach
could be to divide the saliency map into sufficient grids and take local maxima of the
intensity of saliency map image in the grid above certain threshold value.
Fig. 2.4(a) shows the saliency map generated by combining all the feature maps
based on the Ittis model for finding visually salient points. It clearly shows that the
salient locations have larger intensity in the image. Fig. 2.5(a) and fig. 2.5(b) shows
the visually salient points detected on a pair of images.
16
7/28/2019 EE491_Y2150
21/32
(a)
Figure 2.4: Saliency map obtained from normalized summation of all feature maps
(a) (b)
Figure 2.5: (a) Visually salient points in first image obtained, (b) salient points insecond image obtained.
17
7/28/2019 EE491_Y2150
22/32
Chapter 3
Registration Technique
3.1 Registration without using Correspondence and
Mosaicing
This approach is based on the Xiong et. al. [1] which studies the problem of aerial
image registration without any correspondence, by a novel algorithm. Features are
detected on pair of images (observed and reference) using either any corner detector orsaliency map approach based on Ittis model. Image patches are created using these
features as positions. Circle is used as the shape of image patches to deal with rotation
situation. By changing the size of image patches, we can handle scaling situation.
Orientations of image patches are computed with eigenvector approach. With the
orientation differences of patches between reference and observed images, an angle
histogram is created by a voting procedure. The orientation difference corresponding
to the maximum peak of the histogram is the rotation angle between reference and
observed images. Different sizes of image patches are used to create different angle
histograms. The scaling value between the two images can be determined by the
angle histogram which has the highest maximum peak. In the following subsections,
the approach is described as follows.
3.1.1 The Orientation of an Image Patch
For a given patch p(i, j)(i = 1, 2,...,m), the covariance matrix is defined as
COVp = E(Xmx)(Xmx)T
(3.1)
18
7/28/2019 EE491_Y2150
23/32
where X = ij is the position of the pixel; mx=
mximxj is the centroid of the
image patch p(i, j), the first order moment. The eigenvalues can be found by solving:
|COVp I| = 0 (3.2)
Equation 3.2 will give us two eigenvalues. Suppose 1 is the largest eigenvalue and
2 is the smallest eigenvalue. The normalized eigenvectorsV1 and V2 that correspond
to the eigenvalues 1 and 2 are of course orthogonal. The direction of eigenvector
V1 is defined as the orientation of image patch p(i, j). By applying this approach, we
can compute orientations for all image patches on reference and observed images.
3.1.2 Angle Histogram for Image Rotation and Scaling
For observed image, we can create a patch set Pt = {pjt , j = 1, 2,...,nt} and obtain an
orientation set t = {jt , j = 1, 2,...,nt}. Similarly, for reference image, we can create
a patch set Pf = {pif, i = 1, 2,...,nf} and an orientation set f = {if, i = 1, 2,...,nf}
. Suppose that the rotation angle between observed and reference images is and
both images cover same scene. For an image patch pjt on observed image, we compute
orientation differences with all patches Pf = {pif, i = 1, 2,...,nf} on reference image.
l = |jt if|, i = 1, 2,...,nf (3.3)
If we do the same computation for all patches pjt , j = 1, 2,...,nt on observed image,
we will obtain a set of orientation differences = {l, l = 1, 2,...,ntnf} and find
nt correspondence patches on reference image. For these nt pairs of correspondence
patches, the value of orientation differences will be the rotation angle.
l
= |
j
t
i
f|, i = 1, 2,...,nf (3.4)
If we create a histogram for the orientation differences, the counts which the value of
orientation difference between correspondence patches appears will be the highest.
For finding the scale, we can obtain the value of the scaling through a series of
voting processes. By changing the size of the image patches and computing the an-
gle histograms, we can obtain a serial of angle histograms. Choose the one which
has the highest maximum peak. Let At and Af denote the patch sizes on observed
and reference images corresponding to the histogram Hh which has the highest max-
imum peak. The value of the scaling between observed and reference images can be
19
7/28/2019 EE491_Y2150
24/32
(a)
Figure 3.1: Typical angle histogram. In the X-axis each column of image representsbin of angle difference and in the Y-axis occurrence of that bin is shown. The peakfor the angle is very dominant assuming same scale.
computed by
s =
At
Af(3.5)
In the mean time, the orientation difference corresponding to the histogram Hh
is the rotation angle between observed and reference images.
3.1.3 Finding Translation
Algorithm
1. Select one feature point in reference image: manually done
2. Then find similar points according to similarity measure defined in [28] thresh-
old = 0.85
3. Then take a window about the interest point in reference image and calculate
normalized cross correlation for each similar point and the particular similar
point in target image is the corresponding point/patch. Then finding translation
is trivial by using simple transformation.
20
7/28/2019 EE491_Y2150
25/32
(a) (b)
Figure 3.2: (a) a test image, (b) test image rotated by 110 and result shows 11.460
rotation.
3.1.4 Results of Registration
After extracting the features we use these features for finding rotation, scale and
translation parameters and register the images according to above described algo-
rithm. Table 3.1 shows the results obtained for harris features, Table 3.2 shows forKLT features and Table 3.3 shows the results obtained for salient features using Ittis
model.
For validation of algorithm we applied it to the a pair of test images such that
fig. 3.2(a) is the image without any rotation and fig. 3.2(b) is the 110 rotated anti-
clockwise. The results obtained shows 11.460 rotation and 1.0 scale which is a very
good result. We did similar experiments with other images also. For validation of final
correspondence after finding the translation we highlighted the found corresponding
feature in destination image with black blob as shown in fig. 3.3(a) and 3.3(b).
Fig. 3.4(a) and fig. 3.4(b) are showing the pair of images (source and destination
images for registration) for which above registration parameters are found. Fig. 1.1(c)
shows the mosaic by combining the source and destination images using the registra-
tion parameters.
21
7/28/2019 EE491_Y2150
26/32
(a) (b)
Figure 3.3: (a) source image with a feature selected manually, (b) destination imagewith estimated corresponding feature to the feature of source image
Table 3.1: Results - Using Harris Corners Features
Angle of Rotation 3.6492
o
Scale 0.973329Translation Tx 57.547512 pixelTranslation Ty 1.562878 pixel
Table 3.2: Results - Using KLT Corners Features
Angle of Rotation 5.1567o
Scale 1.000000
Translation Tx 48.427681 pixelTranslation Ty 8.734428 pixel
Table 3.3: Results - Using Visually Salient Points
Angle of Rotation 1.43312o
Scale 1.000000Translation Tx 50.427681 pixelTranslation T
y 1.734428 pixel
22
7/28/2019 EE491_Y2150
27/32
(a) (b)
Figure 3.4: (a) Image One, (b) Second Image.
(a)
Figure 3.5: Merged Image Using Corner Features
23
7/28/2019 EE491_Y2150
28/32
Chapter 4
Discussion
KLT features are more stable than harris features. Salient points are also found
to be stable.
A degree of blurring in the images affects the detected features because presence
of noise affects the texture (orientation). The blurring is clearly due to shaking
of the mount, which should be minimized.
Registration parameters are comparable using any of the features. The rotationis very small as expected because at the time of data collection motion was
mainly translational.
Mosaic generated by combining the pair of images does not have perfect overlap
of inlier region of images and is not robust enough to the noise in the registration
parameters.
24
7/28/2019 EE491_Y2150
29/32
Chapter 5
Conclusion and Further Study
Features are extracted successfully. As of now algorithm for finding translation after
finding the rotation and scale is manual. The algorithm for finding translation needs
to be statistical and automatic without any human intervention. There is need to
detect key-frames from any video mission data which can be done only when we
will have sufficient mission data. In future more flight data has to be collected.
The algorithm for mosaicing is not robust enough to the noise in the registration
parameters, this artifact has to be taken care of. Some preprocessing is also neededbecause the collected aerial data generally has noise due to weather conditions, motion
compensation needs to be there and camera parameters are to be estimated for pre-
warping the images; these factors affects the registration procedure and its accuracy.
In the future the mosaic needs to be geo-register with ortho-rectified satellite image
data (reference image data). We are planning to buy satellite image data of Kanpur
area for this purpose from NRSA (National Remote Sensing Agency). For funding
we have submitted detailed research proposal to ARDB (Aeronautics Research &
Development Board).
25
7/28/2019 EE491_Y2150
30/32
Bibliography
[1] Xiong, Y., and Quek, F., Automatic Aerial Image Registration Without Cor-
respondence, The 4th IEEE International Conference on Computer Vision Sys-tems (ICVS2006), January 5-7, 2006. St. Johns University, Manhattan, New
York City, New York, USA.
[2] Sheikh, Y., and Khan, S. and Shah, M. and Cannata, R.W., Geodetic Alignment
of Aerial Video Frames, VideoRegister03, 2003,Chapter 7
[3] Golden, J.P., Terrain Contour Matching (TERCOM): A cruise missile guidance
aid, Proc. Image Processing Missile Guidance, vol. 238, pp. 10-18, 1980.
[4] Rodriquez, J., and Aggarwal, J., Matching Aerial Images to 3D terrain maps,
IEEE PAMI, 12(12), pp. 1138-1149, 1990.
[5] Sim, D., and Park, R., Localization based on the gradient information for DEM
Matching, Proc. Transactions on Image Processing, 11(1), pp. 52-55, 2002.
[6] Zheng, Q., and Chellappa, R., A computational vision approach to image reg-
istration, IEEE Transactions on Image Processing, 2(3), pp. 311 -326, 1993.
[7] Bergen, J., Anandan, P., Hanna, K., and Hingorani, R., Hierarchical model-based motion estimation, Proc. European Conference on Computer Vision, pp.
237-252, 1992.
[8] Szeliski, R., Image mosaicing for tele-reality applications, IEEE Workshop on
Applications of Computer Vision, pp. 44-53, 1994.
[9] Cannata, R., Shah, M., Blask, S., and Workum, J. V., Autonomous Video Reg-
istration Using Sensor Model Parameter Adjustments, Applied Imagery Pattern
Recognition Workshop, 2000.
26
7/28/2019 EE491_Y2150
31/32
[10] Kumar, R., Sawhney, H., Asmuth, J., Pope, A., and Hsu, S., Registration of
video to georeferenced imagery, Fourteenth International Conference on PatternRecognition, vol. 2. pp.1393-1400, 1998.
[11] Wildes, R., Hirvonen, D., Hsu, S., Kumar, R., Lehman, W., Matei, B., and
Zhao, W., Video Registration: Algorithm and quantitative evaluation, Proc.
International Conference on Computer Vision, Vol. 2, pp. 343 -350, 2001.
[12] Horn, B., and Schunk, B., Determining Optical Flow, Artificial Intelligence,
vol. 17, pp. 185-203, 1981.
[13] Lucas, B., and Kanade, T., An Iterative image registration technique with an
application to stereo vision, Proceedings of the 7th International Joint Confer-
ence on Artificial Intelligence, pp. 674-679, 1981.
[14] Brown, L.G., A Survey of Image Registration Techniques, ACM Computing
Surveys, Vol. 24, No. 4, pp. 325-376, December 1992.
[15] Li, H., Manjunath, B.S., and Mitra, S.K., A contour based appraoch to mul-
tisensor image registration, IEEE Trans. Image Processing pp. 320334, March.
1995.
[16] Cideciyan, A. V., Registration of high resolution images of the retina, in Proc.
SPIE, Medical Imaging VI: Image Processing, Feb. 1992, vol. 1652, pp. 310322.
[17] Tainxi, W., A New Mosaicing Method for Landsat Remote Sensing Images,
Kexue Tongbao (Science Bulletin), 32(12): 854-859, 1987.
[18] Herbert, P., and Rouge, B., Digital Image Mosaics, Prentice Hall, Englewood
cliffs, New Jersey, 1979.
[19] Westerkamp, D., and Gahm, T., Non-Distorted Assemblage of the Digital Im-
ages of Adjacent Fields in Histological Sections, Universitat Hannover, Appel-
strasse 9A, 3000 Hannover 1, Germany March 1992.
[20] Canny, J., A computational approach to edge detection, IEEE PAMI 679 698
(1986).
[21] Harris, J. C., and Stephens, M., A combined corner and edge detector, In
Proc. 4th Alvey Vision Conf, pages 189 192,1988.
27
7/28/2019 EE491_Y2150
32/32
[22] Tomasi, C., and Kanade, T., Detection and Tracking of Point Features, CMU
Technical Report CMU-CS-91-132, April 1991.
[23] Itti, L., Models of Bottom-Up and Top-Down Visual Attention, PhD thesis,
Pasadena, California, 2000.
[24] Itti, L., and Koch, C., Nature Reviews Neuroscience, 2(3), 194-203, 2001
[25] Ouerhani, N., Visual Attention: Form Bio-Inspired Modelling to Real-Time
Implementation , PhD thesis, 2003
[26] Backer, G., and Mertsching, B., Two selection stages provide efficient ob ject-based attentional controlfor dynamic vision, in International Workshop on At-
tention and Performance in Computer Vision, 2004.
[27] Maji, S., and Mukerjee, A., Motion Conspicuity Detection: A Visual Atten-
tion model for Dynamic Scenes, Report on CS497, IIT Kanpur, avialable at
www.cse.iitk.ac.in/report-repository/ 2005/Y2383 497-report.pdf
[28] Kyung and Lacroix, S., A Robust Interest Point Matching Algorithm, IEEE,
2001
[29] Intel Open Source Computer Vision Libraray.
http://www.intel.com/technology/computing/opencv/index.htm