EE491_Y2150

  • Upload
    mkd2000

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

  • 7/28/2019 EE491_Y2150

    1/32

    Feature-Based Aerial Image Registration andMosaicing

    A Report Submittedin Partial Fulfillment of the Requirements

    for the Degree of

    Bachelor of Technology

    by

    Gaurav GuptaY2150

    to the

    Department of Electrical Engineering

    Indian Institute of Technology, Kanpur

    April, 2006

  • 7/28/2019 EE491_Y2150

    2/32

    Certificate

    This is to certify that the work contained in the thesisentitled Feature-Based Aerial Image Registration andMosaicing, by Gaurav Gupta, has been carried out under my

    supervision and that this work has not been submittedelsewhere for a degree.

    April, 2006 ------------------------------------------(Dr. Sumana Gupta)Department of Electrical Engineering,Indian Institute of Technology,Kanpur.

    -------------------------------------------(Dr. Amitabha Mukerjee)Department of Computer Scienceand Engineering,Indian Institute of Technology,Kanpur.

  • 7/28/2019 EE491_Y2150

    3/32

    ACKNOWLEDGEMENT

    I am extremely thankful to Dr. Amitabha Mukerjee and Dr. Sumana Gupta

    who provided me with support and guidance which was inevitable for my work. Allmy doubts were welcome. I also would like to thank Dr. Jharna Majumdar un-

    der whom guidance I gained knowledge and experience while my short stay at ADE,

    Bangalore which helped me lot working on this project. I would like to thank Dr.

    A. K. Ghosh for organizing flight for collecting data required for this project. Fur-

    thermore, I would like to extend my sincere gratitude to Mr. Shobhit Niranjan,

    M.Tech(Dual) student, Dept. of Electrical Engineering, IIT Kanpur, who sat with

    me to sort out problems in my code and algorithms. I also would like to thank

    Mr Subhranshu Maji, B.Tech student, Dept. of Computer Science and Engineer-

    ing, who provided me help in coding some part. I also thank my teammates from

    Aerospace Engineering and Computer Science and Engineering who are with me in

    UAV Project Group at IIT Kanpur for motivation, support and company in critical

    times.

    iii

  • 7/28/2019 EE491_Y2150

    4/32

    Contents

    1 Introduction 1

    1.1 Geo-Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Geometric Transformations . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.3 Aerial Image Registration . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.4 Image Mosaicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 Feature Extraction 10

    2.1 Corner Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.2 Saliency Map and Salient Point based on Ittis Model . . . . . . . . . 13

    3 Registration Technique 18

    3.1 Registration without using Correspondence and Mosaicing . . . . . . 18

    4 Discussion 24

    5 Conclusion and Further Study 25

    iv

  • 7/28/2019 EE491_Y2150

    5/32

    Chapter 1

    Introduction

    The ability to locate scenes and objects visible in aerial video imagery with their

    corresponding locations in a reference coordinate system is becoming increasingly im-

    portant in visually-guided navigation, surveillance and monitoring systems [2] [11].

    The availability of low-cost, lightweight video camera systems, high-bandwidth VHF

    communications links and a growing inventory of Unmanned Aerial Vehicles (UAVs)

    and Mini Aerial Vehicles (MAVs) has resulted in dramatic new opportunities for sur-

    veillance and sensing applications of such algorithms. The typical mission of suchUAV/MAV will consist video registration of particular surveillance flight with refer-

    ence image (alignment of video frames with pre-calibrated reference imagery-DEM

    and satellite data). Frame-to-reference registration of video is complex due to lack of

    stable features/saliant points because typical frame does not cover large enough area

    and computationally expensive also. Possible solution is registration of mosiac with

    reference image created from typical MAV mission flight data.

    The objective of this work is to study feature based techniques to do aerial reg-

    istration of typical MAV mission flight data in semi-urban environment and createmosaic using estimated geometric transformation parameters. The next task is to reg-

    ister this generated mosaic with high resolution satellite reference image which would

    allow us to locate any given landmark in a MAV mission video in world co-ordinate

    system. Two aspects to this work are:

    1. Feature extraction

    2. Registration

    In this semester till now we have studied various important features which can

    be used for registration which are Harris corner detector, KLT corner detector and

    1

  • 7/28/2019 EE491_Y2150

    6/32

    Visually salient points using Ittis model. Features are extracted successfully. Rota-

    tion and scale parameters are determined using recent registration technique devisedby Xiong et. al. [1]. The algorithm in [1] is different from conventional feature

    based image registration algorithms, which does not need image matching or corre-

    spondence.We compare different feature extraction algorithms and use those features

    for registration and compare the results. We extend the approach in [1] for finding

    translation also and use estimated parameters to generate mosaic. Code for mosaicing

    is also written and some results are obtained although the algorithm for mosaicing

    is not robust enough to the noise in the registration parameters. Implementation is

    done in Windows platform on Visual C++ 6.0 with OpenCV [29] libraries.Next semester tasks will be to find translation parameters robustly and creating

    mosaic from a MAV video data. In the future the mosaic will be geo-registered with

    ortho-rectified satellite image data (reference image data). We are planning to buy

    satellite image data of Kanpur area for this purpose from NRSA (National Remote

    Sensing Agency). Some preprocessing is also needed because the collected aerial data

    generally has noise due to weather conditions, motion compensation needs to be there

    and camera parameters are to be estimated for pre-warping the images; these factors

    affect the registration procedure and its accuracy.

    1.1 Geo-Registration

    Computer vision techniques can be used to successfully align any given video frame

    with pre-calibrated reference imagery. This kind of registration is known as geo-

    registration [2]. The Reference Imagery is a high-resolution orthographic image, usu-

    ally with a Ground Sampling Distance of 1 (meaning a pixel corresponds to 1m2

    on ground). This Reference Imagery is geodetically aligned, and has an associated

    Digital Elevation Map (DEM), so that each pixel of the Reference Imagery has a pre-

    cise longitude, latitude, and height associated with it. The Reference Imagery, which

    covers a substantial area, can be cropped on the basis of the telemetry data (Teleme-

    try is an automatic measurement of data that defines the position of the camera in

    terms of nine parameters: vehicle latitude, vehicle longitude, vehicle height, vehicle

    roll, vehicle pitch, vehicle heading, camera elevation, camera scan angle and camera

    focal length.) to a smaller area corresponding to Ivideo(x) which denotes video frame

    or mosaic created from aerial data. This cropped reference image can be referred as

    Iref(x).

    2

  • 7/28/2019 EE491_Y2150

    7/32

    Two transformation functions exist between reference and aerial image. Ifx is a

    image point (in pixels), I1(x) is the aerial image array and I2(x) is reference imagearray then,

    1. freg(x) is the geometric transformation. Finding this mapping is the problem

    of image registration,

    I1(x) I2(freg(x)) (1.1)

    2. fcolor is intensity /colour mapping between aerial and reference image,

    I1(x) = fcolor(I2(freg(x))) (1.2)

    Finding freg(x) is the larger challenge. Shah et. al. [2] identifies the following

    difficulties:

    1. The two imageries are in different projection views: Ivideo(x) is an image of

    perspective projection, whereas Iref(x) is an image of orthographic projection.

    While the telemetry information can be used with a sensor model to bring both

    images into a single projection view,

    2. Because of the large duration of time that elapses between the capturing of the

    two images, data distortions like severe lighting and atmospheric variations and

    object changes in the form of forest growths or new construction cause a high

    number of disjoint features (features present in one image but not in the other).

    3. Remotely sensed terrain imagery, in particular, has the property of being highly

    self-correlated both as image data and elevation data. This includes first order

    correlations (locally similar luminance or elevation values in buildings), second

    order correlations (edge continuations in roads, forest edges, and ridges), as wellas higher order correlations (homogeneous textures in forests and homogenous

    elevations in plateaus).

    In the past, substantial research has been directed towards determining the geo-

    location of objects from an aerial view. Several systems such as Terrain Contour

    Matching (TERCOM) [3], SITAN, Inertial Navigation/Guidance Systems (INS/IGS),

    Global Positioning Systems (GPS) and most recently Digital Scene-Matching and

    Area Correlation (DSMAC) have already been deployed in applications requiring

    geo-location. While each of these systems has had some degree of success, several

    shortcomings and deficiencies have become increasingly apparent. By understanding

    3

  • 7/28/2019 EE491_Y2150

    8/32

    the limitations of these systems, we can acquire a better appreciation for the need of

    effective image based systems.Two types of approaches can be distinguished for geo-registration problem: Elevation-

    Based Correspondence and Image-Based Correspondence. Elevation based algorithms

    attempt to achieve alignment by matching the DEM with an elevation map recov-

    ered from video data. Aggarwal et. al. in [4] perform pixel-wise stereo analysis of

    successive frames to yield a recovered elevation map or REM, as the initial Data Rec-

    tification step. In [5], Sim and Park propose another geo-registration algorithm that

    reconstructs a REM from stereo analysis of successive video frames. Normalized Cross

    Correlation based point-matching is used to recover the elevation values. Elevation-Based approaches (based on DEMs) have the general drawback that they rely on

    the accuracy of recovered elevation from two frames, a task found to be notoriously

    difficult.

    Intensity-based approaches to geo-registration use intensity properties of both im-

    ageries to achieve alignment. Work has been done developing image based techniques

    towards registration of two sets of reference imageries [6], as well as the registration

    of two successive video images ( [7], [8]). In [9], Cannata et al use the telemetry

    information to bring a video frame into an orthographic projection view, by associat-

    ing each pixel with an elevation value from the DEM. By ortho-rectifying the aerial

    video frame, the process of alignment is simplified to a strict 2D registration problem.

    Correspondence is achieved by taking 32 32 pixel patches uniformly over the aerial

    image and correlating them with a larger search patch in the Reference Image, using

    Normalized Cross Correlation. Finally, the sensor parameters are updated using a

    conjugate gradient method, or by a Kalman Filter to stress temporal continuity.

    An alternate approach is presented by Kumar et al in [10] where instead of

    ortho-rectifying the Aerial Video Frame, a perspective projection of the associated

    area of the Reference Image is performed. In [10], two further data rectification

    steps are performed. Video frame-to-frame alignment is used to create a mosaic

    providing greater context for alignment than a single image. For data rectification,

    a Laplacian filter at multiple scales is then applied to both the video mosaic and

    reference image. To achieve correspondence, two stages of alignment are used: coarse

    followed by fine alignment. For coarse alignment salient (feature) points are defined

    as the locations where the response in both scale and space is maximum. Normalized

    correlation is used as a match measure between salient points and the associated

    reference patch. One feature point is picked as a reference, and the correlation surfacesfor each feature point are then translated to be centered at the reference feature point.

    4

  • 7/28/2019 EE491_Y2150

    9/32

    In the subsequent work, [11], the filter is modified to use the Laplacian of Gaussian

    filter as well as Hilbert Transform, in four directions to yield four oriented energyimages for each aerial video frame, and for each perspectively projected reference

    image. Instead of considering video mosaics for alignment, the authors use a mosaic

    of 3 key-frames from the data stream, each with at least 50 percent overlap.

    The major limitation of the intensity based approaches are the assumptions that

    are made. The research literature of image-based correspondence is quite vast; [12] is

    a general survey of some of these registration techniques. Alignment by maximization

    of Mutual Information [13] is another frequently used registration approach, and while

    it provides high levels of robustness it also allows many false positives when matchingover a search area of the nature encountered in Geo-Registration. In addition to

    working with no GPS, it is also possible to consider situations where telemetry data

    is not available or corrupted. This is also possible, but due to lack of initial point in

    the visual search, it results in significant increase in computational time.

    1.2 Geometric Transformations

    A Geometric Transformation is a mapping that relocates image points. Transforma-tions can be global or local in nature. Global transformations are usually defined by a

    single set of parameters, which is applied to the whole image. Some of the most com-

    mon global transformations are affine, perspective, and polynomial transformations.

    The affine transformations include translation, scaling and shear motion parameters.

    Translation and rotation transforms are usually caused by the different orientation of

    the sensor, while scaling transform is the effect of change in altitude of the sensor.

    The sensor distortion or the viewing angle may cause stretching and shearing. Rigid

    transformations account for object or sensor movement in which objects in the images

    maintain their relative shape and size [14]. A rigid-body transformation is composed

    of a combination of rotation , translation in x direction tx translation in y direction

    ty, and scale s. It can be written as,x2y2

    =

    txty

    + s

    cos sin sin cos

    x1y1

    (1.3)

    where (x2, y2) is the new transformed coordinate of (x1, y1), tx and ty are x-axis

    and y-axis translations, and s is a scale factor. The general 2D affine transformation

    can be expressed as shown in the following equation:

    5

  • 7/28/2019 EE491_Y2150

    10/32

    (a)

    (b) (c)

    Figure 1.1: (a) Based on the telemetry data, that specifies the corresponding area

    of the Reference Imagery the camera is capturing, the Reference Image is cropped.,(b) The aerial video frame before and (c)after geo-registration with the CroppedReference Image. It should be noted that the Reference Image is an OrthographicImage while the Aerial Video Frame is a Perspective Image. Images are taken from[2] for elaboration

    6

  • 7/28/2019 EE491_Y2150

    11/32

    x2y2

    =t

    xty

    +a

    11a12a21 a22 x

    1y1

    (1.4)

    A =

    a11 a12a21 a22

    where (x2, y2) is the new transformed coordinate of (x1, y1). The matrix A can be

    combination of rotation, scale, or shear. The rotation matrix is similar to 1.3. The

    scale for both x and y axes can be expressed as:

    Scale =Sx 0

    0 Sy

    (1.5)

    However Local distortions may be present in the scenes due to a motion paral-

    lax, movement of object, etc. The parameters of a local mapping a transformation

    vary across the different regions of the image to handle local deformations. These

    parameters can be determined by subdividing the image into small image parts.

    1.3 Aerial Image Registration

    Image registration is the process of determining the geometric transformation defined

    above between a newly sensed image, called input image, and a reference image of

    the same scene that could possibly be taken at different times, from different sensors,

    or from different viewpoints. The current automated registration techniques can be

    classified into two broad categories: area-based and feature-based techniques. In the

    area-based algorithms, a small window of points in the sensed image is compared

    statistically with windows of the same size in the reference image [15]. Window

    correspondence is based on the similarity measure between two given windows. The

    measure of similarity is usually the normalized cross correlation. Area-based tech-niques can be implemented by the Fourier transform using the fast Fourier transform

    (FFT) [16]. A majority of the area-based methods have the limitation of register-

    ing only images with small misalignment, and therefore, the images must be roughly

    aligned with each other initially. The correlation measures become unreliable when

    the images have multiple modalities and the gray-level characteristics vary (e.g., TM

    and synthetic aperture radar (SAR) data).

    In contrast, the feature-based methods are more robust and more suitable in

    these cases. There are two critical procedures generally involved in the feature-basedtechniques: feature extraction and feature correspondence. The basic building block

    7

  • 7/28/2019 EE491_Y2150

    12/32

    of feature based image registration scheme involves matching feature points that are

    extracted from a sensed image to their counter parts in a reference image. Featuresmay be control points, corners, junctions or interest points. These features are also

    known as visually salient point. Feature matching overcomes the inabilities of are

    based signal correlation by attempting matching only information rich points.

    1.3.1 Image Integration

    It deals with finding fcolor between two aerial images or two different imaging system.

    Various techniques have been developed for modifying the image grey levels in the

    vicinity of a boundary to obtain a smooth transition between images by removing

    these seams and creating a blended image. These mainly consist in choosing a frontier,

    which induces a minimum of discontinuity [18]. [17] proposed using a polynomial

    curve. In each line of the common area, one point is retained and the curve is

    defined by a minimum squared error procedure. These methods are appropriate if

    the common area is quite identical (plane support). Heitz presented a simplification

    using a parametric plane function s = ax + by + c determined by a mean square

    error procedure. More important transformation must be applied when the common

    regions includes large differences, Westerkamp et. al. [19] described a polynomial

    function to assemble distorted microscopic images.

    1.4 Image Mosaicing

    An Image mosaic is a synthetic composition generated from a sequence of images

    and it can be obtained by understanding geometric relationship between images.

    The geometric relations are coordinate transformations that relate the different im-

    age coordinate systems. By applying the appropriate transformations via a warping

    operation and merging the overlapping regions of warped images, it is possible to con-

    struct a single image indistinguishable from a single large image of the same object,

    covering the entire visible area of the scene. This merged single image is the called

    mosaic. The basic scheme for Mosaicing comprises of two main steps, which are

    outlined below.

    1. Image registration using Geometric Transformations derived from image data

    and/or camera models.

    2. Image integration or blending.

    8

  • 7/28/2019 EE491_Y2150

    13/32

    We have adopted Feature Based approach to solve registration problem. Different

    from conventional feature based image registration algorithms, our approach is basedon the work done by Xiong et. al. [1] which does not need image matching or

    correspondence. In [1] they only consider Harris corners as features. We compare

    different feature extraction algorithms and use those features for correspondence and

    compare the results. We extend the approach in [1] for finding translation also and

    use estimated parameters to generate mosaic.

    9

  • 7/28/2019 EE491_Y2150

    14/32

    Chapter 2

    Feature Extraction

    A feature is the result of an interpretation of n pixels, usually in a compact support,

    in a window ofp p. An important step in almost all machine as well as biological

    vision systems is to process the input image(s) to extract features or primal sketches.

    In general, the feature detection process involves computing the response R of one or

    multiple detectors (filters/operators) to the input image(s), followed by the analysis

    of R to isolate points (or regions) that satisfy certain constraints. In-fact, the best

    definition of a feature is the operator itself. There are several kinds of feature usedfor matching. They may be divided into four grouped as follows:

    Visual features (edges, textures junctions and corners)

    Transform Coefficient Features: Fourier descriptors, Hadamard coefficients.

    Algebraic Features (based on matrix decomposition of an image)

    Statistical Features (moment invariants)

    2.1 Corner Detection

    Corners are defined as the junction point of two straight line edges. Most existing

    edge detectors perform poorly at corners, because they assume an edge to be an

    entity with infinite extent, an assumption, which is violated at the corners. Since,

    most of the gray-level based corner detectors are based on existing edge detectors, the

    performance of such corner detectors is not satisfactory. For example the Canny edge

    detector [20] is found incapable of accurately locating edges near a corner due to the

    well-known rounding effect. Harris corner detector [21] and KLT feature detector [22]

    are the most widely used corner detectors so we compare them for our application.

    10

  • 7/28/2019 EE491_Y2150

    15/32

    2.1.1 Harris Corner Detector

    Harris corner detector [21] algorithm computes a matrix, which is related to the auto-

    correlation function of Image intensity. This matrix averages the first derivatives of

    the signal on a window:

    expx2 + y2

    22

    I2x IxIyIxIy I

    2y

    (2.1)

    where Ix and Iy are the gradient (derivatives) in the x and y direction. The eigen

    values of this matrix are the principal curvatures of the auto-correlation function. If

    these two curvatures are high, an interest point is present.The Algorithm for Harris corner detection is as follows:

    Algorithm for Harris Corner Detector

    1. Compute Matrix C for each pixel of the input image.

    2. The standard Harris Corner detection algorithm proposes two different criteri-

    ons for corner point selection. The first is to compare the value of (det(C)

    k trace(C)2) with a threshold and the second way is to compare the value of

    R = det(C)trace(C)

    with a threshold. Where C is the covariance matrix of gradient

    computed above. We have used the second way while implementing the Harris

    algorithm because first method highly depends on the chosen value of constant

    k.

    Feature Reduction

    While selecting corner points using the Harris algorithm we have applied a two level

    corner strength comparison. Suppose 1and2 are the two eigen values of the covari-

    ance matrix C then criteria for feature reduction is as follows,

    First we compare the value of norm =21 +

    22 with a threshold and if it is

    greater, it is a first level corner point.

    Then we divide the image into 25 25 grids and in each grid we select at

    maximum one corer point that has highest value of norm of eigen values defined

    above.

    Fig. 2.1(a) shows the corners detected using Harris corner detection algorithm [21]and fig. 2.1(b) shows the corners after applying feature reduction algorithm.

    11

  • 7/28/2019 EE491_Y2150

    16/32

    (a) (b)

    Figure 2.1: (a) Corners obtained after using Harris Corner Detector algorithm, (b)Detected features after feature reduction algorithm.

    2.1.2 KLT Corner Detector

    KLT features [22] are geometrically stable under different transformations. Hence

    features detected by KLT have high repeatability factor and have high information

    content. It is also based on auto-correlation function of image intensity.

    KLT Corner Detector Algorithm

    1. Compute Matrix C for each pixel of the input image and let 1 and 2 denotes

    its eigen values.

    2. The KLT corner has first level corner detection based on the value of smaller

    eigen value. It is computed in a window about the point under consideration,

    and is compared with threshold: if it is greater than threshold it is a first levelcorner point. Then the array of all corner points is sorted in decreasing order

    of minimum of eigen values of windows about points under considerations.

    3. Moving from top to down we delete all the points lying below the point under

    consideration in the array and satisfy 8-neighborhood criterion.

    In fig. 2.2(a) corners are shown using KLT algorithm and fig. 2.2(b) shows the

    reduced KLT corners.

    12

  • 7/28/2019 EE491_Y2150

    17/32

    (a) (b)

    Figure 2.2: (a) Corners obtained after using KLT Corner Detector algorithm, (b)Detected features after feature reduction.

    2.2 Saliency Map and Salient Point based on Ittis

    Model

    Visual attention is basically a biological mechanism used essentially by primates to

    compensate for the inability of their brains to process the huge amount of visual in-

    formation gathered by the two eyes. Early works on attention modeling were mostly

    inspired by the biological model of the Brain. The Caltechs Hypothesis [23] elabo-

    rated by Itti-Koch [24] represents one of the first concrete descriptions on how the

    visual attention model works. According to the hypothesis the elementary features

    are extracted in a unique map of attention, the saliency map, which resides either

    in LGN (lateral geniculate nucleus) or in the V1 (Primary Visual Cortex). Finally,

    the Winner Take All(WTA) network [27] which is responsible for detecting the most

    salient scene location is located around the thalamic reticular nucleus.

    One of the first and the most popular of the computation models of saliency and

    is based on the Caltechs thesis. It is based on four main principles: Visual attention

    is based on multi-featured inputs; saliency of a region is affected by the surrounding

    context; the saliency of locations is represented by a saliency map, and the Winner

    Take All and Inhibition of return are suitable mechanisms to allow attention shifts

    13

  • 7/28/2019 EE491_Y2150

    18/32

    Figure 2.3: Schematic model for Ittis model [23]

    2.2.1 Feature Maps for Static Images

    First, a number of features (1....j.....n) are extracted from the scene by computing the

    so called feature maps Fj . Such a map represents the image of the scene, based on a

    well-defined feature, which leads to a multi-featured representation of the scene. In

    his implementation, Itti considered seven different features which are computed from

    an RGB color image and which belong to three main cues, namely intensity, color,

    and orientation.

    Intensity Feature

    F1 = I= 0.3 R+ 0.59 G+ 0.11 B (2.2)

    Two chromatic features based on the two color opponency filters R+G and

    B+Y where the yellow signal is defined as Y = R+G2

    . Such chromatic oppo-

    nency exists in human visual cortex.

    F2 =RG

    I(2.3)

    F2 =B Y

    I(2.4)

    14

  • 7/28/2019 EE491_Y2150

    19/32

    The normalization of the features with I decouples hue from intensity.

    Four local orientation features F4...7 according to the angles{0;45;90;135}.

    Gabor filters which represent a suitable mathematical model of the receptive

    field impulse response of orientation-selective neurons in primary visual cortex

    [25], are used to compute the orientation features. In this implementation of

    the model, it is possible to use an arbitrary number of orientations. However, it

    has been noticed that using more than four orientations does not improve the

    performance of the model drastically.

    2.2.2 Center-Surround Receptive Field Profiles

    In a second step, each feature map is transformed in its conspicuity map which high-

    lights the parts of the scene that strongly differ, according to a specific feature, from

    their surroundings. In biologically plausible models, this is usually achieved by us-

    ing a center-surround mechanism. Practically, this mechanism can be implemented

    with a difference-of-Gaussians- filter, DoG, which can be applied on feature maps to

    extract local activities for each feature type. A visual attention task has to detect

    conspicuous regions, regardless of their sizes. Thus, a multiscale conspicuity operatoris required. Applying variable size center-surround- filters on fixed size images,has a

    high computational cost. This method is based on a multiresolution representation

    of images. For a feature j, a gaussian pyramid Ij is created by progressively lowpass

    filtering and sub-sampling by factor 2 the feature map Fj , using a gaussian filter G:

    Ij(0) = Fj (2.5)

    Ij(i) = (Ij(i 1) G) (2.6)

    where () refers to the spatial convolution operator and refers to the downsam-

    pling operation. Center-Surround is then implemented as the difference between fine

    (c for center) and coarse scales (s for surround). Indeed, for a feature j(1...j...n), a set

    of intermediate multiscale conspicuity maps Mj,k(1...k.....K) are computed according

    to the equation below, giving rise to (n*K) maps for n considered features.

    Mj,k = |Ij(ck) Ij(sk)| (2.7)

    where is a cross-scale difference operator that first interpolates the coarser scale

    to the finer one and then carries out a point-by-point substraction. The absolute

    15

  • 7/28/2019 EE491_Y2150

    20/32

    value of the difference between the center and the surround allows the simultaneous

    computing of both sensitivities, dark center on bright surround and bright center ondark surround (red/green and green/red or blue/yellow and yellow/blue for color).

    2.2.3 Saliency Map

    The purpose of the saliency map is to represent the conspicuity or saliency

    at every location in the visual field by a scalar quantity, and to guide the selection

    of attended locations, based on the spatial distribution of saliency. At each spatial

    location, all the feature maps consequently needs to be combined into a unique scalar

    measure of salience. In the implementation all the feature maps are normalized to

    the same total dynamic range (e.g., between 0 to 255), and to sum all feature maps

    into the saliency map. This operation is defined as N(.).

    2.2.4 Selection of the point of Attention

    Once the saliency Map has been computed the Winner Take All (WTA) and Inhibi-

    tion of Return are Suitable Mechanisms to imitate the eye movements and the focus

    of attention [27]. The WTA will select the point with maximum salience at each it-eration. However The movement of the attention point can be done by inhibiting the

    saliency of the current object being attended [26] [27]. At each iteration the saliency

    of the object being attended to is decayed, thus eventually the objects not being

    attended to will increase in saliency and take the focus of attention. Other approach

    could be to divide the saliency map into sufficient grids and take local maxima of the

    intensity of saliency map image in the grid above certain threshold value.

    Fig. 2.4(a) shows the saliency map generated by combining all the feature maps

    based on the Ittis model for finding visually salient points. It clearly shows that the

    salient locations have larger intensity in the image. Fig. 2.5(a) and fig. 2.5(b) shows

    the visually salient points detected on a pair of images.

    16

  • 7/28/2019 EE491_Y2150

    21/32

    (a)

    Figure 2.4: Saliency map obtained from normalized summation of all feature maps

    (a) (b)

    Figure 2.5: (a) Visually salient points in first image obtained, (b) salient points insecond image obtained.

    17

  • 7/28/2019 EE491_Y2150

    22/32

    Chapter 3

    Registration Technique

    3.1 Registration without using Correspondence and

    Mosaicing

    This approach is based on the Xiong et. al. [1] which studies the problem of aerial

    image registration without any correspondence, by a novel algorithm. Features are

    detected on pair of images (observed and reference) using either any corner detector orsaliency map approach based on Ittis model. Image patches are created using these

    features as positions. Circle is used as the shape of image patches to deal with rotation

    situation. By changing the size of image patches, we can handle scaling situation.

    Orientations of image patches are computed with eigenvector approach. With the

    orientation differences of patches between reference and observed images, an angle

    histogram is created by a voting procedure. The orientation difference corresponding

    to the maximum peak of the histogram is the rotation angle between reference and

    observed images. Different sizes of image patches are used to create different angle

    histograms. The scaling value between the two images can be determined by the

    angle histogram which has the highest maximum peak. In the following subsections,

    the approach is described as follows.

    3.1.1 The Orientation of an Image Patch

    For a given patch p(i, j)(i = 1, 2,...,m), the covariance matrix is defined as

    COVp = E(Xmx)(Xmx)T

    (3.1)

    18

  • 7/28/2019 EE491_Y2150

    23/32

    where X = ij is the position of the pixel; mx=

    mximxj is the centroid of the

    image patch p(i, j), the first order moment. The eigenvalues can be found by solving:

    |COVp I| = 0 (3.2)

    Equation 3.2 will give us two eigenvalues. Suppose 1 is the largest eigenvalue and

    2 is the smallest eigenvalue. The normalized eigenvectorsV1 and V2 that correspond

    to the eigenvalues 1 and 2 are of course orthogonal. The direction of eigenvector

    V1 is defined as the orientation of image patch p(i, j). By applying this approach, we

    can compute orientations for all image patches on reference and observed images.

    3.1.2 Angle Histogram for Image Rotation and Scaling

    For observed image, we can create a patch set Pt = {pjt , j = 1, 2,...,nt} and obtain an

    orientation set t = {jt , j = 1, 2,...,nt}. Similarly, for reference image, we can create

    a patch set Pf = {pif, i = 1, 2,...,nf} and an orientation set f = {if, i = 1, 2,...,nf}

    . Suppose that the rotation angle between observed and reference images is and

    both images cover same scene. For an image patch pjt on observed image, we compute

    orientation differences with all patches Pf = {pif, i = 1, 2,...,nf} on reference image.

    l = |jt if|, i = 1, 2,...,nf (3.3)

    If we do the same computation for all patches pjt , j = 1, 2,...,nt on observed image,

    we will obtain a set of orientation differences = {l, l = 1, 2,...,ntnf} and find

    nt correspondence patches on reference image. For these nt pairs of correspondence

    patches, the value of orientation differences will be the rotation angle.

    l

    = |

    j

    t

    i

    f|, i = 1, 2,...,nf (3.4)

    If we create a histogram for the orientation differences, the counts which the value of

    orientation difference between correspondence patches appears will be the highest.

    For finding the scale, we can obtain the value of the scaling through a series of

    voting processes. By changing the size of the image patches and computing the an-

    gle histograms, we can obtain a serial of angle histograms. Choose the one which

    has the highest maximum peak. Let At and Af denote the patch sizes on observed

    and reference images corresponding to the histogram Hh which has the highest max-

    imum peak. The value of the scaling between observed and reference images can be

    19

  • 7/28/2019 EE491_Y2150

    24/32

    (a)

    Figure 3.1: Typical angle histogram. In the X-axis each column of image representsbin of angle difference and in the Y-axis occurrence of that bin is shown. The peakfor the angle is very dominant assuming same scale.

    computed by

    s =

    At

    Af(3.5)

    In the mean time, the orientation difference corresponding to the histogram Hh

    is the rotation angle between observed and reference images.

    3.1.3 Finding Translation

    Algorithm

    1. Select one feature point in reference image: manually done

    2. Then find similar points according to similarity measure defined in [28] thresh-

    old = 0.85

    3. Then take a window about the interest point in reference image and calculate

    normalized cross correlation for each similar point and the particular similar

    point in target image is the corresponding point/patch. Then finding translation

    is trivial by using simple transformation.

    20

  • 7/28/2019 EE491_Y2150

    25/32

    (a) (b)

    Figure 3.2: (a) a test image, (b) test image rotated by 110 and result shows 11.460

    rotation.

    3.1.4 Results of Registration

    After extracting the features we use these features for finding rotation, scale and

    translation parameters and register the images according to above described algo-

    rithm. Table 3.1 shows the results obtained for harris features, Table 3.2 shows forKLT features and Table 3.3 shows the results obtained for salient features using Ittis

    model.

    For validation of algorithm we applied it to the a pair of test images such that

    fig. 3.2(a) is the image without any rotation and fig. 3.2(b) is the 110 rotated anti-

    clockwise. The results obtained shows 11.460 rotation and 1.0 scale which is a very

    good result. We did similar experiments with other images also. For validation of final

    correspondence after finding the translation we highlighted the found corresponding

    feature in destination image with black blob as shown in fig. 3.3(a) and 3.3(b).

    Fig. 3.4(a) and fig. 3.4(b) are showing the pair of images (source and destination

    images for registration) for which above registration parameters are found. Fig. 1.1(c)

    shows the mosaic by combining the source and destination images using the registra-

    tion parameters.

    21

  • 7/28/2019 EE491_Y2150

    26/32

    (a) (b)

    Figure 3.3: (a) source image with a feature selected manually, (b) destination imagewith estimated corresponding feature to the feature of source image

    Table 3.1: Results - Using Harris Corners Features

    Angle of Rotation 3.6492

    o

    Scale 0.973329Translation Tx 57.547512 pixelTranslation Ty 1.562878 pixel

    Table 3.2: Results - Using KLT Corners Features

    Angle of Rotation 5.1567o

    Scale 1.000000

    Translation Tx 48.427681 pixelTranslation Ty 8.734428 pixel

    Table 3.3: Results - Using Visually Salient Points

    Angle of Rotation 1.43312o

    Scale 1.000000Translation Tx 50.427681 pixelTranslation T

    y 1.734428 pixel

    22

  • 7/28/2019 EE491_Y2150

    27/32

    (a) (b)

    Figure 3.4: (a) Image One, (b) Second Image.

    (a)

    Figure 3.5: Merged Image Using Corner Features

    23

  • 7/28/2019 EE491_Y2150

    28/32

    Chapter 4

    Discussion

    KLT features are more stable than harris features. Salient points are also found

    to be stable.

    A degree of blurring in the images affects the detected features because presence

    of noise affects the texture (orientation). The blurring is clearly due to shaking

    of the mount, which should be minimized.

    Registration parameters are comparable using any of the features. The rotationis very small as expected because at the time of data collection motion was

    mainly translational.

    Mosaic generated by combining the pair of images does not have perfect overlap

    of inlier region of images and is not robust enough to the noise in the registration

    parameters.

    24

  • 7/28/2019 EE491_Y2150

    29/32

    Chapter 5

    Conclusion and Further Study

    Features are extracted successfully. As of now algorithm for finding translation after

    finding the rotation and scale is manual. The algorithm for finding translation needs

    to be statistical and automatic without any human intervention. There is need to

    detect key-frames from any video mission data which can be done only when we

    will have sufficient mission data. In future more flight data has to be collected.

    The algorithm for mosaicing is not robust enough to the noise in the registration

    parameters, this artifact has to be taken care of. Some preprocessing is also neededbecause the collected aerial data generally has noise due to weather conditions, motion

    compensation needs to be there and camera parameters are to be estimated for pre-

    warping the images; these factors affects the registration procedure and its accuracy.

    In the future the mosaic needs to be geo-register with ortho-rectified satellite image

    data (reference image data). We are planning to buy satellite image data of Kanpur

    area for this purpose from NRSA (National Remote Sensing Agency). For funding

    we have submitted detailed research proposal to ARDB (Aeronautics Research &

    Development Board).

    25

  • 7/28/2019 EE491_Y2150

    30/32

    Bibliography

    [1] Xiong, Y., and Quek, F., Automatic Aerial Image Registration Without Cor-

    respondence, The 4th IEEE International Conference on Computer Vision Sys-tems (ICVS2006), January 5-7, 2006. St. Johns University, Manhattan, New

    York City, New York, USA.

    [2] Sheikh, Y., and Khan, S. and Shah, M. and Cannata, R.W., Geodetic Alignment

    of Aerial Video Frames, VideoRegister03, 2003,Chapter 7

    [3] Golden, J.P., Terrain Contour Matching (TERCOM): A cruise missile guidance

    aid, Proc. Image Processing Missile Guidance, vol. 238, pp. 10-18, 1980.

    [4] Rodriquez, J., and Aggarwal, J., Matching Aerial Images to 3D terrain maps,

    IEEE PAMI, 12(12), pp. 1138-1149, 1990.

    [5] Sim, D., and Park, R., Localization based on the gradient information for DEM

    Matching, Proc. Transactions on Image Processing, 11(1), pp. 52-55, 2002.

    [6] Zheng, Q., and Chellappa, R., A computational vision approach to image reg-

    istration, IEEE Transactions on Image Processing, 2(3), pp. 311 -326, 1993.

    [7] Bergen, J., Anandan, P., Hanna, K., and Hingorani, R., Hierarchical model-based motion estimation, Proc. European Conference on Computer Vision, pp.

    237-252, 1992.

    [8] Szeliski, R., Image mosaicing for tele-reality applications, IEEE Workshop on

    Applications of Computer Vision, pp. 44-53, 1994.

    [9] Cannata, R., Shah, M., Blask, S., and Workum, J. V., Autonomous Video Reg-

    istration Using Sensor Model Parameter Adjustments, Applied Imagery Pattern

    Recognition Workshop, 2000.

    26

  • 7/28/2019 EE491_Y2150

    31/32

    [10] Kumar, R., Sawhney, H., Asmuth, J., Pope, A., and Hsu, S., Registration of

    video to georeferenced imagery, Fourteenth International Conference on PatternRecognition, vol. 2. pp.1393-1400, 1998.

    [11] Wildes, R., Hirvonen, D., Hsu, S., Kumar, R., Lehman, W., Matei, B., and

    Zhao, W., Video Registration: Algorithm and quantitative evaluation, Proc.

    International Conference on Computer Vision, Vol. 2, pp. 343 -350, 2001.

    [12] Horn, B., and Schunk, B., Determining Optical Flow, Artificial Intelligence,

    vol. 17, pp. 185-203, 1981.

    [13] Lucas, B., and Kanade, T., An Iterative image registration technique with an

    application to stereo vision, Proceedings of the 7th International Joint Confer-

    ence on Artificial Intelligence, pp. 674-679, 1981.

    [14] Brown, L.G., A Survey of Image Registration Techniques, ACM Computing

    Surveys, Vol. 24, No. 4, pp. 325-376, December 1992.

    [15] Li, H., Manjunath, B.S., and Mitra, S.K., A contour based appraoch to mul-

    tisensor image registration, IEEE Trans. Image Processing pp. 320334, March.

    1995.

    [16] Cideciyan, A. V., Registration of high resolution images of the retina, in Proc.

    SPIE, Medical Imaging VI: Image Processing, Feb. 1992, vol. 1652, pp. 310322.

    [17] Tainxi, W., A New Mosaicing Method for Landsat Remote Sensing Images,

    Kexue Tongbao (Science Bulletin), 32(12): 854-859, 1987.

    [18] Herbert, P., and Rouge, B., Digital Image Mosaics, Prentice Hall, Englewood

    cliffs, New Jersey, 1979.

    [19] Westerkamp, D., and Gahm, T., Non-Distorted Assemblage of the Digital Im-

    ages of Adjacent Fields in Histological Sections, Universitat Hannover, Appel-

    strasse 9A, 3000 Hannover 1, Germany March 1992.

    [20] Canny, J., A computational approach to edge detection, IEEE PAMI 679 698

    (1986).

    [21] Harris, J. C., and Stephens, M., A combined corner and edge detector, In

    Proc. 4th Alvey Vision Conf, pages 189 192,1988.

    27

  • 7/28/2019 EE491_Y2150

    32/32

    [22] Tomasi, C., and Kanade, T., Detection and Tracking of Point Features, CMU

    Technical Report CMU-CS-91-132, April 1991.

    [23] Itti, L., Models of Bottom-Up and Top-Down Visual Attention, PhD thesis,

    Pasadena, California, 2000.

    [24] Itti, L., and Koch, C., Nature Reviews Neuroscience, 2(3), 194-203, 2001

    [25] Ouerhani, N., Visual Attention: Form Bio-Inspired Modelling to Real-Time

    Implementation , PhD thesis, 2003

    [26] Backer, G., and Mertsching, B., Two selection stages provide efficient ob ject-based attentional controlfor dynamic vision, in International Workshop on At-

    tention and Performance in Computer Vision, 2004.

    [27] Maji, S., and Mukerjee, A., Motion Conspicuity Detection: A Visual Atten-

    tion model for Dynamic Scenes, Report on CS497, IIT Kanpur, avialable at

    www.cse.iitk.ac.in/report-repository/ 2005/Y2383 497-report.pdf

    [28] Kyung and Lacroix, S., A Robust Interest Point Matching Algorithm, IEEE,

    2001

    [29] Intel Open Source Computer Vision Libraray.

    http://www.intel.com/technology/computing/opencv/index.htm