EE491_Y2150

7/28/2019 EE491_Y2150

1/32

Feature-Based Aerial Image Registration andMosaicing

A Report Submittedin Partial Fulfillment of the Requirements

for the Degree of

Bachelor of Technology

by

Gaurav GuptaY2150

to the

Department of Electrical Engineering

Indian Institute of Technology, Kanpur

April, 2006

7/28/2019 EE491_Y2150

2/32

Certificate

This is to certify that the work contained in the thesisentitled Feature-Based Aerial Image Registration andMosaicing, by Gaurav Gupta, has been carried out under my

supervision and that this work has not been submittedelsewhere for a degree.

April, 2006 ------------------------------------------(Dr. Sumana Gupta)Department of Electrical Engineering,Indian Institute of Technology,Kanpur.

-------------------------------------------(Dr. Amitabha Mukerjee)Department of Computer Scienceand Engineering,Indian Institute of Technology,Kanpur.

7/28/2019 EE491_Y2150

3/32

ACKNOWLEDGEMENT

I am extremely thankful to Dr. Amitabha Mukerjee and Dr. Sumana Gupta

who provided me with support and guidance which was inevitable for my work. Allmy doubts were welcome. I also would like to thank Dr. Jharna Majumdar un-

der whom guidance I gained knowledge and experience while my short stay at ADE,

Bangalore which helped me lot working on this project. I would like to thank Dr.

A. K. Ghosh for organizing flight for collecting data required for this project. Fur-

thermore, I would like to extend my sincere gratitude to Mr. Shobhit Niranjan,

M.Tech(Dual) student, Dept. of Electrical Engineering, IIT Kanpur, who sat with

me to sort out problems in my code and algorithms. I also would like to thank

Mr Subhranshu Maji, B.Tech student, Dept. of Computer Science and Engineer-

ing, who provided me help in coding some part. I also thank my teammates from

Aerospace Engineering and Computer Science and Engineering who are with me in

UAV Project Group at IIT Kanpur for motivation, support and company in critical

times.

iii

7/28/2019 EE491_Y2150

4/32

Contents

1 Introduction 1

1.1 Geo-Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Geometric Transformations . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Aerial Image Registration . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Image Mosaicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Feature Extraction 10

2.1 Corner Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Saliency Map and Salient Point based on Ittis Model . . . . . . . . . 13

3 Registration Technique 18

3.1 Registration without using Correspondence and Mosaicing . . . . . . 18

4 Discussion 24

5 Conclusion and Further Study 25

iv

7/28/2019 EE491_Y2150

5/32

Chapter 1

Introduction

The ability to locate scenes and objects visible in aerial video imagery with their

corresponding locations in a reference coordinate system is becoming increasingly im-

portant in visually-guided navigation, surveillance and monitoring systems [2] [11].

The availability of low-cost, lightweight video camera systems, high-bandwidth VHF

communications links and a growing inventory of Unmanned Aerial Vehicles (UAVs)

and Mini Aerial Vehicles (MAVs) has resulted in dramatic new opportunities for sur-

veillance and sensing applications of such algorithms. The typical mission of suchUAV/MAV will consist video registration of particular surveillance flight with refer-

ence image (alignment of video frames with pre-calibrated reference imagery-DEM

and satellite data). Frame-to-reference registration of video is complex due to lack of

stable features/saliant points because typical frame does not cover large enough area

and computationally expensive also. Possible solution is registration of mosiac with

reference image created from typical MAV mission flight data.

The objective of this work is to study feature based techniques to do aerial reg-

istration of typical MAV mission flight data in semi-urban environment and createmosaic using estimated geometric transformation parameters. The next task is to reg-

ister this generated mosaic with high resolution satellite reference image which would

allow us to locate any given landmark in a MAV mission video in world co-ordinate

system. Two aspects to this work are:

1. Feature extraction

2. Registration

In this semester till now we have studied various important features which can

be used for registration which are Harris corner detector, KLT corner detector and

1

7/28/2019 EE491_Y2150

6/32

Visually salient points using Ittis model. Features are extracted successfully. Rota-

tion and scale parameters are determined using recent registration technique devisedby Xiong et. al. [1]. The algorithm in [1] is different from conventional feature

based image registration algorithms, which does not need image matching or corre-

spondence.We compare different feature extraction algorithms and use those features

for registration and compare the results. We extend the approach in [1] for finding

translation also and use estimated parameters to generate mosaic. Code for mosaicing

is also written and some results are obtained although the algorithm for mosaicing

is not robust enough to the noise in the registration parameters. Implementation is

done in Windows platform on Visual C++ 6.0 with OpenCV [29] libraries.Next semester tasks will be to find translation parameters robustly and creating

mosaic from a MAV video data. In the future the mosaic will be geo-registered with

ortho-rectified satellite image data (reference image data). We are planning to buy

satellite image data of Kanpur area for this purpose from NRSA (National Remote

Sensing Agency). Some preprocessing is also needed because the collected aerial data

generally has noise due to weather conditions, motion compensation needs to be there

and camera parameters are to be estimated for pre-warping the images; these factors

affect the registration procedure and its accuracy.

1.1 Geo-Registration

Computer vision techniques can be used to successfully align any given video frame

with pre-calibrated reference imagery. This kind of registration is known as geo-

registration [2]. The Reference Imagery is a high-resolution orthographic image, usu-

ally with a Ground Sampling Distance of 1 (meaning a pixel corresponds to 1m2

on ground). This Reference Imagery is geodetically aligned, and has an associated

Digital Elevation Map (DEM), so that each pixel of the Reference Imagery has a pre-

cise longitude, latitude, and height associated with it. The Reference Imagery, which

covers a substantial area, can be cropped on the basis of the telemetry data (Teleme-

try is an automatic measurement of data that defines the position of the camera in

terms of nine parameters: vehicle latitude, vehicle longitude, vehicle height, vehicle

roll, vehicle pitch, vehicle heading, camera elevation, camera scan angle and camera

focal length.) to a smaller area corresponding to Ivideo(x) which denotes video frame

or mosaic created from aerial data. This cropped reference image can be referred as

Iref(x).

2

7/28/2019 EE491_Y2150

7/32

Two transformation functions exist between reference and aerial image. Ifx is a

image point (in pixels), I1(x) is the aerial image array and I2(x) is reference imagearray then,

1. freg(x) is the geometric transformation. Finding this mapping is the problem

of image registration,

I1(x) I2(freg(x)) (1.1)

2. fcolor is intensity /colour mapping between aerial and reference image,

I1(x) = fcolor(I2(freg(x))) (1.2)

Finding freg(x) is the larger challenge. Shah et. al. [2] identifies the following

difficulties:

1. The two imageries are in different projection views: Ivideo(x) is an image of

perspective projection, whereas Iref(x) is an image of orthographic projection.

While the telemetry information can be used with a sensor model to bring both

images into a single projection view,

2. Because of the large duration of time that elapses between the capturing of the

two images, data distortions like severe lighting and atmospheric variations and

object changes in the form of forest growths or new construction cause a high

number of disjoint features (features present in one image but not in the other).

3. Remotely sensed terrain imagery, in particular, has the property of being highly

self-correlated both as image data and elevation data. This includes first order

correlations (locally similar luminance or elevation values in buildings), second

order correlations (edge continuations in roads, forest edges, and ridges), as wellas higher order correlations (homogeneous textures in forests and homogenous

elevations in plateaus).

In the past, substantial research has been directed towards determining the geo-

location of objects from an aerial view. Several systems such as Terrain Contour

Matching (TERCOM) [3], SITAN, Inertial Navigation/Guidance Systems (INS/IGS),

Global Positioning Systems (GPS) and most recently Digital Scene-Matching and

Area Correlation (DSMAC) have already been deployed in applications requiring

geo-location. While each of these systems has had some degree of success, several

shortcomings and deficiencies have become increasingly apparent. By understanding

3

7/28/2019 EE491_Y2150

8/32

the limitations of these systems, we can acquire a better appreciation for the need of

effective image based systems.Two types of approaches can be distinguished for geo-registration problem: Elevation-

Based Correspondence and Image-Based Correspondence. Elevation based algorithms

attempt to achieve alignment by matching the DEM with an elevation map recov-

ered from video data. Aggarwal et. al. in [4] perform pixel-wise stereo analysis of

successive frames to yield a recovered elevation map or REM, as the initial Data Rec-

tification step. In [5], Sim and Park propose another geo-registration algorithm that

reconstructs a REM from stereo analysis of successive video frames. Normalized Cross

Correlation based point-matching is used to recover the elevation values. Elevation-Based approaches (based on DEMs) have the general drawback that they rely on

the accuracy of recovered elevation from two frames, a task found to be notoriously

difficult.

Intensity-based approaches to geo-registration use intensity properties of both im-

ageries to achieve alignment. Work has been done developing image based techniques

towards registration of two sets of reference imageries [6], as well as the registration

of two successive video images ( [7], [8]). In [9], Cannata et al use the telemetry

information to bring a video frame into an orthographic projection view, by associat-

ing each pixel with an elevation value from the DEM. By ortho-rectifying the aerial

video frame, the process of alignment is simplified to a strict 2D registration problem.

Correspondence is achieved by taking 32 32 pixel patches uniformly over the aerial

image and correlating them with a larger search patch in the Reference Image, using

Normalized Cross Correlation. Finally, the sensor parameters are updated using a

conjugate gradient method, or by a Kalman Filter to stress temporal continuity.

An alternate approach is presented by Kumar et al in [10] where instead of

ortho-rectifying the Aerial Video Frame, a perspective projection of the associated

area of the Reference Image is performed. In [10], two further data rectification

steps are performed. Video frame-to-frame alignment is used to create a mosaic

providing greater context for alignment than a single image. For data rectification,

a Laplacian filter at multiple scales is then applied to both the video mosaic and

reference image. To achieve correspondence, two stages of alignment are used: coarse

followed by fine alignment. For coarse alignment salient (feature) points are defined

as the locations where the response in both scale and space is maximum. Normalized

correlation is used as a match measure between salient points and the associated

reference patch. One feature point is picked as a reference, and the correlation surfacesfor each feature point are then translated to be centered at the reference feature point.

4

7/28/2019 EE491_Y2150

9/32

In the subsequent work, [11], the filter is modified to use the Laplacian of Gaussian

filter as well as Hilbert Transform, in four directions to yield four oriented energyimages for each aerial video frame, and for each perspectively projected reference

image. Instead of considering video mosaics for alignment, the authors use a mosaic

of 3 key-frames from the data stream, each with at least 50 percent overlap.

The major limitation of the intensity based approaches are the assumptions that

are made. The research literature of image-based correspondence is quite vast; [12] is

a general survey of some of these registration techniques. Alignment by maximization

of Mutual Information [13] is another frequently used registration approach, and while

it provides high levels of robustness it also allows many false positives when matchingover a search area of the nature encountered in Geo-Registration. In addition to

working with no GPS, it is also possible to consider situations where telemetry data

is not available or corrupted. This is also possible, but due to lack of initial point in

the visual search, it results in significant increase in computational time.

1.2 Geometric Transformations

A Geometric Transformation is a mapping that relocates image points. Transforma-tions can be global or local in nature. Global transformations are usually defined by a

single set of parameters, which is applied to the whole image. Some of the most com-

mon global transformations are affine, perspective, and polynomial transformations.

The affine transformations include translation, scaling and shear motion parameters.

Translation and rotation transforms are usually caused by the different orientation of

the sensor, while scaling transform is the effect of change in altitude of the sensor.

The sensor distortion or the viewing angle may cause stretching and shearing. Rigid

transformations account for object or sensor movement in which objects in the images

maintain their relative shape and size [14]. A rigid-body transformation is composed

of a combination of rotation , translation in x direction tx translation in y direction

ty, and scale s. It can be written as,x2y2

=

txty

+ s

cos sin sin cos

x1y1

(1.3)

where (x2, y2) is the new transformed coordinate of (x1, y1), tx and ty are x-axis

and y-axis translations, and s is a scale factor. The general 2D affine transformation

can be expressed as shown in the following equation:

5

7/28/2019 EE491_Y2150

10/32

(a)

(b) (c)

Figure 1.1: (a) Based on the telemetry data, that specifies the corresponding area

of the Reference Imagery the camera is capturing, the Reference Image is cropped.,(b) The aerial video frame before and (c)after geo-registration with the CroppedReference Image. It should be noted that the Reference Image is an OrthographicImage while the Aerial Video Frame is a Perspective Image. Images are taken from[2] for elaboration

6

7/28/2019 EE491_Y2150

11/32

x2y2

=t

xty

+a

11a12a21 a22 x

1y1

(1.4)

A =

a11 a12a21 a22

where (x2, y2) is the new transformed coordinate of (x1, y1). The matrix A can be

combination of rotation, scale, or shear. The rotation matrix is similar to 1.3. The

scale for both x and y axes can be expressed as:

Scale =Sx 0

0 Sy

(1.5)

However Local distortions may be present in the scenes due to a motion paral-

lax, movement of object, etc. The parameters of a local mapping a transformation

vary across the different regions of the image to handle local deformations. These

parameters can be determined by subdividing the image into small image parts.

1.3 Aerial Image Registration

Image registration is the process of determining the geometric transformation defined

above between a newly sensed image, called input image, and a reference image of

the same scene that could possibly be taken at different times, from different sensors,

or from different viewpoints. The current automated registration techniques can be

classified into two broad categories: area-based and feature-based techniques. In the

area-based algorithms, a small window of points in the sensed image is compared

statistically with windows of the same size in the reference image [15]. Window

correspondence is based on the similarity measure between two given windows. The

measure of similarity is usually the normalized cross correlation. Area-based tech-niques can be implemented by the Fourier transform using the fast Fourier transform

(FFT) [16]. A majority of the area-based methods have the limitation of register-

ing only images with small misalignment, and therefore, the images must be roughly

aligned with each other initially. The correlation measures become unreliable when

the images have multiple modalities and the gray-level characteristics vary (e.g., TM

and synthetic aperture radar (SAR) data).

In contrast, the feature-based methods are more robust and more suitable in

these cases. There are two critical procedures generally involved in the feature-basedtechniques: feature extraction and feature correspondence. The basic building block

7

7/28/2019 EE491_Y2150

12/32

of feature based image registration scheme involves matching feature points that are

extracted from a sensed image to their counter parts in a reference image. Featuresmay be control points, corners, junctions or interest points. These features are also

known as visually salient point. Feature matching overcomes the inabilities of are

based signal correlation by attempting matching only information rich points.

1.3.1 Image Integration

It deals with finding fcolor between two aerial images or two different imaging system.

Various techniques have been developed for modifying the image grey levels in the

vicinity of a boundary to obtain a smooth transition between images by removing

these seams and creating a blended image. These mainly consist in choosing a frontier,

which induces a minimum of discontinuity [18]. [17] proposed using a polynomial

curve. In each line of the common area, one point is retained and the curve is

defined by a minimum squared error procedure. These methods are appropriate if

the common area is quite identical (plane support). Heitz presented a simplification

using a parametric plane function s = ax + by + c determined by a mean square

error procedure. More important transformation must be applied when the common

regions includes large differences, Westerkamp et. al. [19] described a polynomial

function to assemble distorted microscopic images.

1.4 Image Mosaicing

An Image mosaic is a synthetic composition generated from a sequence of images

and it can be obtained by understanding geometric relationship between images.

The geometric relations are coordinate transformations that relate the different im-

age coordinate systems. By applying the appropriate transformations via a warping

operation and merging the overlapping regions of warped images, it is possible to con-

struct a single image indistinguishable from a single large image of the same object,

covering the entire visible area of the scene. This merged single image is the called

mosaic. The basic scheme for Mosaicing comprises of two main steps, which are

outlined below.

1. Image registration using Geometric Transformations derived from image data

and/or camera models.

2. Image integration or blending.

8

7/28/2019 EE491_Y2150

13/32

We have adopted Feature Based approach to solve registration problem. Different

from conventional feature based image registration algorithms, our approach is basedon the work done by Xiong et. al. [1] which does not need image matching or

correspondence. In [1] they only consider Harris corners as features. We compare

different feature extraction algorithms and use those features for correspondence and

compare the results. We extend the approach in [1] for finding translation also and

use estimated parameters to generate mosaic.

9

7/28/2019 EE491_Y2150

14/32

Chapter 2

Feature Extraction

A feature is the result of an interpretation of n pixels, usually in a compact support,

in a window ofp p. An important step in almost all machine as well as biological

vision systems is to process the input image(s) to extract features or primal sketches.

In general, the feature detection process involves computing the response R of one or

multiple detectors (filters/operators) to the input image(s), followed by the analysis

of R to isolate points (or regions) that satisfy certain constraints. In-fact, the best

definition of a feature is the operator itself. There are several kinds of feature usedfor matching. They may be divided into four grouped as follows:

Visual features (edges, textures junctions and corners)

Transform Coefficient Features: Fourier descriptors, Hadamard coefficients.

Algebraic Features (based on matrix decomposition of an image)

Statistical Features (moment invariants)

2.1 Corner Detection

Corners are defined as the junction point of two straight line edges. Most existing

edge detectors perform poorly at corners, because they assume an edge to be an

entity with infinite extent, an assumption, which is violated at the corners. Since,

most of the gray-level based corner detectors are based on existing edge detectors, the

performance of such corner detectors is not satisfactory. For example the Canny edge

detector [20] is found incapable of accurately locating edges near a corner due to the

well-known rounding effect. Harris corner detector [21] and KLT feature detector [22]

are the most widely used corner detectors so we compare them for our application.

10

7/28/2019 EE491_Y2150

15/32

2.1.1 Harris Corner Detector

Harris corner detector [21] algorithm computes a matrix, which is related to the auto-

correlation function of Image intensity. This matrix averages the first derivatives of

the signal on a window:

expx2 + y2

22

I2x IxIyIxIy I

2y

(2.1)

where Ix and Iy are the gradient (derivatives) in the x and y direction. The eigen

values of this matrix are the principal curvatures of the auto-correlation function. If

these two curvatures are high, an interest point is present.The Algorithm for Harris corner detection is as follows:

Algorithm for Harris Corner Detector

1. Compute Matrix C for each pixel of the input image.

2. The standard Harris Corner detection algorithm proposes two different criteri-

ons for corner point selection. The first is to compare the value of (det(C)

k trace(C)2) with a threshold and the second way is to compare the value of

R = det(C)trace(C)

with a threshold. Where C is the covariance matrix of gradient

computed above. We have used the second way while implementing the Harris

algorithm because first method highly depends on the chosen value of constant

k.

Feature Reduction

While selecting corner points using the Harris algorithm we have applied a two level

corner strength comparison. Suppose 1and2 are the two eigen values of the covari-

ance matrix C then criteria for feature reduction is as follows,

First we compare the value of norm =21 +

22 with a threshold and if it is

greater, it is a first level corner point.

Then we divide the image into 25 25 grids and in each grid we select at

maximum one corer point that has highest value of norm of eigen values defined

above.

Fig. 2.1(a) shows the corners detected using Harris corner detection algorithm [21]and fig. 2.1(b) shows the corners after applying feature reduction algorithm.

11

7/28/2019 EE491_Y2150

16/32

(a) (b)

Figure 2.1: (a) Corners obtained after using Harris Corner Detector algorithm, (b)Detected features after feature reduction algorithm.

2.1.2 KLT Corner Detector

KLT features [22] are geometrically stable under different transformations. Hence

features detected by KLT have high repeatability factor and have high information

content. It is also based on auto-correlation function of image intensity.

KLT Corner Detector Algorithm

1. Compute Matrix C for each pixel of the input image and let 1 and 2 denotes

its eigen values.

2. The KLT corner has first level corner detection based on the value of smaller

eigen value. It is computed in a window about the point under consideration,

and is compared with threshold: if it is greater than threshold it is a first levelcorner point. Then the array of all corner points is sorted in decreasing order

of minimum of eigen values of windows about points under considerations.

3. Moving from top to down we delete all the points lying below the point under

consideration in the array and satisfy 8-neighborhood criterion.

In fig. 2.2(a) corners are shown using KLT algorithm and fig. 2.2(b) shows the

reduced KLT corners.

12

7/28/2019 EE491_Y2150

17/32

(a) (b)

Figure 2.2: (a) Corners obtained after using KLT Corner Detector algorithm, (b)Detected features after feature reduction.

2.2 Saliency Map and Salient Point based on Ittis

Model

Visual attention is basically a biological mechanism used essentially by primates to

compensate for the inability of their brains to process the huge amount of visual in-

formation gathered by the two eyes. Early works on attention modeling were mostly

inspired by the biological model of the Brain. The Caltechs Hypothesis [23] elabo-

rated by Itti-Koch [24] represents one of the first concrete descriptions on how the

visual attention model works. According to the hypothesis the elementary features

are extracted in a unique map of attention, the saliency map, which resides either

in LGN (lateral geniculate nucleus) or in the V1 (Primary Visual Cortex). Finally,

the Winner Take All(WTA) network [27] which is responsible for detecting the most

salient scene location is located around the thalamic reticular nucleus.

One of the first and the most popular of the computation models of saliency and

is based on the Caltechs thesis. It is based on four main principles: Visual attention

is based on multi-featured inputs; saliency of a region is affected by the surrounding

context; the saliency of locations is represented by a saliency map, and the Winner

Take All and Inhibition of return are suitable mechanisms to allow attention shifts

13

7/28/2019 EE491_Y2150

18/32

Figure 2.3: Schematic model for Ittis model [23]

2.2.1 Feature Maps for Static Images

First, a number of features (1....j.....n) are extracted from the scene by computing the

so called feature maps Fj . Such a map represents the image of the scene, based on a

well-defined feature, which leads to a multi-featured representation of the scene. In

his implementation, Itti considered seven different features which are computed from

an RGB color image and which belong to three main cues, namely intensity, color,

and orientation.

Intensity Feature

F1 = I= 0.3 R+ 0.59 G+ 0.11 B (2.2)

Two chromatic features based on the two color opponency filters R+G and

B+Y where the yellow signal is defined as Y = R+G2

. Such chromatic oppo-

nency exists in human visual cortex.

F2 =RG

I(2.3)

F2 =B Y

I(2.4)

14

7/28/2019 EE491_Y2150

19/32

The normalization of the features with I decouples hue from intensity.

Four local orientation features F4...7 according to the angles{0;45;90;135}.

Gabor filters which represent a suitable mathematical model of the receptive

field impulse response of orientation-selective neurons in primary visual cortex

[25], are used to compute the orientation features. In this implementation of

the model, it is possible to use an arbitrary number of orientations. However, it

has been noticed that using more than four orientations does not improve the

performance of the model drastically.

2.2.2 Center-Surround Receptive Field Profiles

In a second step, each feature map is transformed in its conspicuity map which high-

lights the parts of the scene that strongly differ, according to a specific feature, from

their surroundings. In biologically plausible models, this is usually achieved by us-

ing a center-surround mechanism. Practically, this mechanism can be implemented

with a difference-of-Gaussians- filter, DoG, which can be applied on feature maps to

extract local activities for each feature type. A visual attention task has to detect

conspicuous regions, regardless of their sizes. Thus, a multiscale conspicuity operatoris required. Applying variable size center-surround- filters on fixed size images,has a

high computational cost. This method is based on a multiresolution representation

of images. For a feature j, a gaussian pyramid Ij is created by progressively lowpass

filtering and sub-sampling by factor 2 the feature map Fj , using a gaussian filter G:

Ij(0) = Fj (2.5)

Ij(i) = (Ij(i 1) G) (2.6)

where () refers to the spatial convolution operator and refers to the downsam-

pling operation. Center-Surround is then implemented as the difference between fine

(c for center) and coarse scales (s for surround). Indeed, for a feature j(1...j...n), a set

of intermediate multiscale conspicuity maps Mj,k(1...k.....K) are computed according

to the equation below, giving rise to (n*K) maps for n considered features.

Mj,k = |Ij(ck) Ij(sk)| (2.7)

where is a cross-scale difference operator that first interpolates the coarser scale

to the finer one and then carries out a point-by-point substraction. The absolute

15

7/28/2019 EE491_Y2150

20/32

value of the difference between the center and the surround allows the simultaneous

computing of both sensitivities, dark center on bright surround and bright center ondark surround (red/green and green/red or blue/yellow and yellow/blue for color).

2.2.3 Saliency Map

The purpose of the saliency map is to represent the conspicuity or saliency

at every location in the visual field by a scalar quantity, and to guide the selection

of attended locations, based on the spatial distribution of saliency. At each spatial

location, all the feature maps consequently needs to be combined into a unique scalar

measure of salience. In the implementation all the feature maps are normalized to

the same total dynamic range (e.g., between 0 to 255), and to sum all feature maps

into the saliency map. This operation is defined as N(.).

2.2.4 Selection of the point of Attention

Once the saliency Map has been computed the Winner Take All (WTA) and Inhibi-

tion of Return are Suitable Mechanisms to imitate the eye movements and the focus

of attention [27]. The WTA will select the point with maximum salience at each it-eration. However The movement of the attention point can be done by inhibiting the

saliency of the current object being attended [26] [27]. At each iteration the saliency

of the object being attended to is decayed, thus eventually the objects not being

attended to will increase in saliency and take the focus of attention. Other approach

could be to divide the saliency map into sufficient grids and take local maxima of the

intensity of saliency map image in the grid above certain threshold value.

Fig. 2.4(a) shows the saliency map generated by combining all the feature maps

based on the Ittis model for finding visually salient points. It clearly shows that the

salient locations have larger intensity in the image. Fig. 2.5(a) and fig. 2.5(b) shows

the visually salient points detected on a pair of images.

16

7/28/2019 EE491_Y2150

21/32

(a)

Figure 2.4: Saliency map obtained from normalized summation of all feature maps

(a) (b)

Figure 2.5: (a) Visually salient points in first image obtained, (b) salient points insecond image obtained.

17

7/28/2019 EE491_Y2150

22/32

Chapter 3

Registration Technique

3.1 Registration without using Correspondence and

Mosaicing

This approach is based on the Xiong et. al. [1] which studies the problem of aerial

image registration without any correspondence, by a novel algorithm. Features are

detected on pair of images (observed and reference) using either any corner detector orsaliency map approach based on Ittis model. Image patches are created using these

features as positions. Circle is used as the shape of image patches to deal with rotation

situation. By changing the size of image patches, we can handle scaling situation.

Orientations of image patches are computed with eigenvector approach. With the

orientation differences of patches between reference and observed images, an angle

histogram is created by a voting procedure. The orientation difference corresponding

to the maximum peak of the histogram is the rotation angle between reference and

observed images. Different sizes of image patches are used to create different angle

histograms. The scaling value between the two images can be determined by the

angle histogram which has the highest maximum peak. In the following subsections,

the approach is described as follows.

3.1.1 The Orientation of an Image Patch

For a given patch p(i, j)(i = 1, 2,...,m), the covariance matrix is defined as

COVp = E(Xmx)(Xmx)T

(3.1)

18

7/28/2019 EE491_Y2150

23/32

where X = ij is the position of the pixel; mx=

mximxj is the centroid of the

image patch p(i, j), the first order moment. The eigenvalues can be found by solving:

|COVp I| = 0 (3.2)

Equation 3.2 will give us two eigenvalues. Suppose 1 is the largest eigenvalue and

2 is the smallest eigenvalue. The normalized eigenvectorsV1 and V2 that correspond

to the eigenvalues 1 and 2 are of course orthogonal. The direction of eigenvector

V1 is defined as the orientation of image patch p(i, j). By applying this approach, we

can compute orientations for all image patches on reference and observed images.

3.1.2 Angle Histogram for Image Rotation and Scaling

For observed image, we can create a patch set Pt = {pjt , j = 1, 2,...,nt} and obtain an

orientation set t = {jt , j = 1, 2,...,nt}. Similarly, for reference image, we can create

a patch set Pf = {pif, i = 1, 2,...,nf} and an orientation set f = {if, i = 1, 2,...,nf}

. Suppose that the rotation angle between observed and reference images is and

both images cover same scene. For an image patch pjt on observed image, we compute

orientation differences with all patches Pf = {pif, i = 1, 2,...,nf} on reference image.

l = |jt if|, i = 1, 2,...,nf (3.3)

If we do the same computation for all patches pjt , j = 1, 2,...,nt on observed image,

we will obtain a set of orientation differences = {l, l = 1, 2,...,ntnf} and find

nt correspondence patches on reference image. For these nt pairs of correspondence

patches, the value of orientation differences will be the rotation angle.

l

= |

j

t

i

f|, i = 1, 2,...,nf (3.4)

If we create a histogram for the orientation differences, the counts which the value of

orientation difference between correspondence patches appears will be the highest.

For finding the scale, we can obtain the value of the scaling through a series of

voting processes. By changing the size of the image patches and computing the an-

gle histograms, we can obtain a serial of angle histograms. Choose the one which

has the highest maximum peak. Let At and Af denote the patch sizes on observed

and reference images corresponding to the histogram Hh which has the highest max-

imum peak. The value of the scaling between observed and reference images can be

19

7/28/2019 EE491_Y2150

24/32

(a)

Figure 3.1: Typical angle histogram. In the X-axis each column of image representsbin of angle difference and in the Y-axis occurrence of that bin is shown. The peakfor the angle is very dominant assuming same scale.

computed by

s =

At

Af(3.5)

In the mean time, the orientation difference corresponding to the histogram Hh

is the rotation angle between observed and reference images.

3.1.3 Finding Translation

Algorithm

1. Select one feature point in reference image: manually done

2. Then find similar points according to similarity measure defined in [28] thresh-

old = 0.85

3. Then take a window about the interest point in reference image and calculate

normalized cross correlation for each similar point and the particular similar

point in target image is the corresponding point/patch. Then finding translation

is trivial by using simple transformation.

20

7/28/2019 EE491_Y2150

25/32

(a) (b)

Figure 3.2: (a) a test image, (b) test image rotated by 110 and result shows 11.460

rotation.

3.1.4 Results of Registration

After extracting the features we use these features for finding rotation, scale and

translation parameters and register the images according to above described algo-

rithm. Table 3.1 shows the results obtained for harris features, Table 3.2 shows forKLT features and Table 3.3 shows the results obtained for salient features using Ittis

model.

For validation of algorithm we applied it to the a pair of test images such that

fig. 3.2(a) is the image without any rotation and fig. 3.2(b) is the 110 rotated anti-

clockwise. The results obtained shows 11.460 rotation and 1.0 scale which is a very

good result. We did similar experiments with other images also. For validation of final

correspondence after finding the translation we highlighted the found corresponding

feature in destination image with black blob as shown in fig. 3.3(a) and 3.3(b).

Fig. 3.4(a) and fig. 3.4(b) are showing the pair of images (source and destination

images for registration) for which above registration parameters are found. Fig. 1.1(c)

shows the mosaic by combining the source and destination images using the registra-

tion parameters.

21

7/28/2019 EE491_Y2150

26/32

(a) (b)

Figure 3.3: (a) source image with a feature selected manually, (b) destination imagewith estimated corresponding feature to the feature of source image

Table 3.1: Results - Using Harris Corners Features

Angle of Rotation 3.6492

o

Scale 0.973329Translation Tx 57.547512 pixelTranslation Ty 1.562878 pixel

Table 3.2: Results - Using KLT Corners Features

Angle of Rotation 5.1567o

Scale 1.000000

Translation Tx 48.427681 pixelTranslation Ty 8.734428 pixel

Table 3.3: Results - Using Visually Salient Points

Angle of Rotation 1.43312o

Scale 1.000000Translation Tx 50.427681 pixelTranslation T

y 1.734428 pixel

22

7/28/2019 EE491_Y2150

27/32

(a) (b)

Figure 3.4: (a) Image One, (b) Second Image.

(a)

Figure 3.5: Merged Image Using Corner Features

23

7/28/2019 EE491_Y2150

28/32

Chapter 4

Discussion

KLT features are more stable than harris features. Salient points are also found

to be stable.

A degree of blurring in the images affects the detected features because presence

of noise affects the texture (orientation). The blurring is clearly due to shaking

of the mount, which should be minimized.

Registration parameters are comparable using any of the features. The rotationis very small as expected because at the time of data collection motion was

mainly translational.

Mosaic generated by combining the pair of images does not have perfect overlap

of inlier region of images and is not robust enough to the noise in the registration

parameters.

24

7/28/2019 EE491_Y2150

29/32

Chapter 5

Conclusion and Further Study

Features are extracted successfully. As of now algorithm for finding translation after

finding the rotation and scale is manual. The algorithm for finding translation needs

to be statistical and automatic without any human intervention. There is need to

detect key-frames from any video mission data which can be done only when we

will have sufficient mission data. In future more flight data has to be collected.

The algorithm for mosaicing is not robust enough to the noise in the registration

parameters, this artifact has to be taken care of. Some preprocessing is also neededbecause the collected aerial data generally has noise due to weather conditions, motion

compensation needs to be there and camera parameters are to be estimated for pre-

warping the images; these factors affects the registration procedure and its accuracy.

In the future the mosaic needs to be geo-register with ortho-rectified satellite image

data (reference image data). We are planning to buy satellite image data of Kanpur

area for this purpose from NRSA (National Remote Sensing Agency). For funding

we have submitted detailed research proposal to ARDB (Aeronautics Research &

Development Board).

25

7/28/2019 EE491_Y2150

30/32

Bibliography

[1] Xiong, Y., and Quek, F., Automatic Aerial Image Registration Without Cor-

respondence, The 4th IEEE International Conference on Computer Vision Sys-tems (ICVS2006), January 5-7, 2006. St. Johns University, Manhattan, New

York City, New York, USA.

[2] Sheikh, Y., and Khan, S. and Shah, M. and Cannata, R.W., Geodetic Alignment

of Aerial Video Frames, VideoRegister03, 2003,Chapter 7

[3] Golden, J.P., Terrain Contour Matching (TERCOM): A cruise missile guidance

aid, Proc. Image Processing Missile Guidance, vol. 238, pp. 10-18, 1980.

[4] Rodriquez, J., and Aggarwal, J., Matching Aerial Images to 3D terrain maps,

IEEE PAMI, 12(12), pp. 1138-1149, 1990.

[5] Sim, D., and Park, R., Localization based on the gradient information for DEM

Matching, Proc. Transactions on Image Processing, 11(1), pp. 52-55, 2002.

[6] Zheng, Q., and Chellappa, R., A computational vision approach to image reg-

istration, IEEE Transactions on Image Processing, 2(3), pp. 311 -326, 1993.

[7] Bergen, J., Anandan, P., Hanna, K., and Hingorani, R., Hierarchical model-based motion estimation, Proc. European Conference on Computer Vision, pp.

237-252, 1992.

[8] Szeliski, R., Image mosaicing for tele-reality applications, IEEE Workshop on

Applications of Computer Vision, pp. 44-53, 1994.

[9] Cannata, R., Shah, M., Blask, S., and Workum, J. V., Autonomous Video Reg-

istration Using Sensor Model Parameter Adjustments, Applied Imagery Pattern

Recognition Workshop, 2000.

26

7/28/2019 EE491_Y2150

31/32

[10] Kumar, R., Sawhney, H., Asmuth, J., Pope, A., and Hsu, S., Registration of

video to georeferenced imagery, Fourteenth International Conference on PatternRecognition, vol. 2. pp.1393-1400, 1998.

[11] Wildes, R., Hirvonen, D., Hsu, S., Kumar, R., Lehman, W., Matei, B., and

Zhao, W., Video Registration: Algorithm and quantitative evaluation, Proc.

International Conference on Computer Vision, Vol. 2, pp. 343 -350, 2001.

[12] Horn, B., and Schunk, B., Determining Optical Flow, Artificial Intelligence,

vol. 17, pp. 185-203, 1981.

[13] Lucas, B., and Kanade, T., An Iterative image registration technique with an

application to stereo vision, Proceedings of the 7th International Joint Confer-

ence on Artificial Intelligence, pp. 674-679, 1981.

[14] Brown, L.G., A Survey of Image Registration Techniques, ACM Computing

Surveys, Vol. 24, No. 4, pp. 325-376, December 1992.

[15] Li, H., Manjunath, B.S., and Mitra, S.K., A contour based appraoch to mul-

tisensor image registration, IEEE Trans. Image Processing pp. 320334, March.

1995.

[16] Cideciyan, A. V., Registration of high resolution images of the retina, in Proc.

SPIE, Medical Imaging VI: Image Processing, Feb. 1992, vol. 1652, pp. 310322.

[17] Tainxi, W., A New Mosaicing Method for Landsat Remote Sensing Images,

Kexue Tongbao (Science Bulletin), 32(12): 854-859, 1987.

[18] Herbert, P., and Rouge, B., Digital Image Mosaics, Prentice Hall, Englewood

cliffs, New Jersey, 1979.

[19] Westerkamp, D., and Gahm, T., Non-Distorted Assemblage of the Digital Im-

ages of Adjacent Fields in Histological Sections, Universitat Hannover, Appel-

strasse 9A, 3000 Hannover 1, Germany March 1992.

[20] Canny, J., A computational approach to edge detection, IEEE PAMI 679 698

(1986).

[21] Harris, J. C., and Stephens, M., A combined corner and edge detector, In

Proc. 4th Alvey Vision Conf, pages 189 192,1988.

27

7/28/2019 EE491_Y2150

32/32

[22] Tomasi, C., and Kanade, T., Detection and Tracking of Point Features, CMU

Technical Report CMU-CS-91-132, April 1991.

[23] Itti, L., Models of Bottom-Up and Top-Down Visual Attention, PhD thesis,

Pasadena, California, 2000.

[24] Itti, L., and Koch, C., Nature Reviews Neuroscience, 2(3), 194-203, 2001

[25] Ouerhani, N., Visual Attention: Form Bio-Inspired Modelling to Real-Time

Implementation , PhD thesis, 2003

[26] Backer, G., and Mertsching, B., Two selection stages provide efficient ob ject-based attentional controlfor dynamic vision, in International Workshop on At-

tention and Performance in Computer Vision, 2004.

[27] Maji, S., and Mukerjee, A., Motion Conspicuity Detection: A Visual Atten-

tion model for Dynamic Scenes, Report on CS497, IIT Kanpur, avialable at

www.cse.iitk.ac.in/report-repository/ 2005/Y2383 497-report.pdf

[28] Kyung and Lacroix, S., A Robust Interest Point Matching Algorithm, IEEE,

2001

[29] Intel Open Source Computer Vision Libraray.

http://www.intel.com/technology/computing/opencv/index.htm

Documents

EE491_Y2150