01-icar-omni

Embed Size (px)

Citation preview

  • 8/12/2019 01-icar-omni

    1/6

    Experiments in Visual-based Navigation with an OmnidirectionalCamera

    Niall Winters 1 , 2 , Jose Gaspar 2 , Etienne Grossmann 2 and Jose Santos-Victor 2

    1 Computer Vision and Robotics Group, 2 Instituto de Sistemas e Rob otica,Department of Computer Science, Instituto Superior Tecnico,

    University of Dublin, Trinity College, Av. Rovisco Pais, 1,Dublin 2 - Ireland. 1049-001 Lisboa - Portugal.

    [email protected] { jasv,etienne,jag }@isr.ist.utl.pt

    Abstract

    This paper overviews experiments in autonomous visual-based navigation undertaken at the Instituto de Sistemas e Rob otica. By considering the precise na-ture of the robots task, we specify a navigation method which fullls the environmental and localization re-quirements best suited to achieving this task. On-going research into task specication using 3D models,along with improvements to our topological navigation method are presented. We show how to build 3D models from images obtained with an omnidirectional camera equipped with a spherical mirror, despite the fact that it does not have a single projection centre.

    1. IntroductionThe problem of navigation is central to the success-ful application of mobile robots. Autonomous visual-based navigation can be achieved using catadioptricvision sensors. Signicantly, such a sensor can gener-ate diverse views of the environment. We have utilizedomnidirectional, panoramic and birds-eye views (i.e.orthographic images of the ground plane) to achieve avariety of navigation tasks [8].

    The motivation for our research comes from studiesof animal navigation. These suggest that most speciesutilize a very parsimonious combination of perceptual,action and representational strategies [3] that lead tomuch more efficient solutions when compared to thoseof todays robots. For example, when entering oneshall door one needs to navigate with more precisionthan when walking along a city avenue. This obser-vation of a path distance/accuracy trade-off is a fun-damental aspect of our work [8, 17], with each moderequiring a specic environmental representation andlocalization accuracy.

    For traveling long distances within an indoor envi-ronment our robot does not rely upon knowledge of itsexact position [18]. Instead, fast qualitative position-ing is undertaken in a topological context by matchingthe current omnidirectional image to an a priori ac-quired set, i.e. an appearance based approach to theproblem using PCA [11]. For tasks requiring very pre-cise movements, e.g. docking, we developed a methodtermed Visual Path Following [7] for tracking visuallandmarks in birds-eye views of a scene. We can eas-ily integrate these approaches in a natural manner. Inone experiment the robot navigates through the Com-puter Vision Lab, traverses the door, travels down acorridor and then returns to its starting position.

    It is important to note that for successful comple-tion of such a task, user-specied information is re-quired. While our robot is autonomous it is not in-dependent . This constraint lead us to investigate amethod whereby a user could specify particular sub-tasks in as intuitive a manner as possible. This is thesubject of on-going research. We wished to solve thisproblem by designing an effective human-robot inter-face using only omnidirectional images. In this paper,we show how to construct this complimentary human-robot interface module and how it ts into our existingnavigation framework. Our solution is to reconstruct a3D model of the environment from a single omnidirec-tional image. An intermediate step in this constructionrequires the generation of perspective images. Thus,we show how our sensor can be modeled by the centralprojection model [9], despite the fact that it does nothave a single centre of projection [1].

    Additionally, this paper presents an improvementto our topological navigation module. In our previ-ous work, entire omnidirectional images where used forlocalization. Using a statistical method termed Infor-mation Sampling [19], we can now select the most dis-

  • 8/12/2019 01-icar-omni

    2/6

    Figure 1: Left: the omnidirectional camera. Right: thecamera mounted on the mobile robot.

    criminatory data from a set of images acquired a pri-ori. This not only gives us a speed increase but is alsobenecial when dealing with such problems as partialocclusion and illumination change.

    This paper is outlined as follows: Section 2 intro-duces the Information Sampling method and presentsassociated results. Section 3 details our human-robotinterface in addition to showing how the central pro- jection model can model our spherical sensor. Resultsare also given. In Section 4 we present our conclusionsand the future direction of our research.

    2. Information Sampling for Nav-igation

    Information Sampling is a statistical method to extractthe most discriminatory data from a set of omnidirec-tional images acquired a priori. We rank this data frommost to least discriminatory. For more details on rank-ing methods, see [19]. Then, using only the highestranking data we build a local appearance space whichthe robot utilizes for autonomous navigation. Theoret-ically, Information Sampling can be applied on a pixel-by-pixel basis to any type of image . For computationalreasons, we use windows instead of pixels, extractedfrom omnidirectional images. Our set up is shown inFigure 1.

    2.1. Related WorkOhba [12] presents a method to divide images into anumber of smaller windows. Eigenspace analysis wasthen applied to each window. Thus, even if a num-ber of windows become occluded, the remaining onescontain enough information to perform the given task.Unfortunately, this method requires storage of a verylarge number of image windows and the chances of onewindow, acquired at runtime being matched to a num-ber of images from the a priori set is high. Therefore,it is highly desirable that only the most discrimina-tory windows are selected from each acquired image.A number of solutions have been proposed including

    5 10 15 20 25 30

    5

    10

    15

    20

    25

    30

    5 10 15 20 25 30

    5

    10

    15

    20

    25

    30

    Figure 2: Left: A 32 32 omnidirectional image ac-quired at runtime and its most discriminatory 8 8information window. Right: Its reconstruction usingthis information window.

    the use of interest operators [14], pre-dened grids [4]or criteria such as detectability, uniqueness and relia-bility [12]. It should be noted that unlike the other so-lutions cited here, Information Sampling does not rely

    on eigenspace analysis or the use of interest operators.

    2.2. Information Sampling: Overview

    Essentially, Information Sampling can be described asa process for (i) reconstructing an entire image from theobservation of a few (noisy) pixels and (ii) determiningthe most relevant image pixels, in the sense that theyconvey the most information about the image set.

    Part (i) selects a number of pixels, d to test accord-ing to a selection matrix, S . The problem is to estimatean (entire) image based on these partial (noisy) obser-

    vations of the pixels, d. Once we have this information,we can calculate the maximum a posteriori estimate of an image, I MAP [13] as follows:

    I MAP = arg maxI

    p(I |d) = ( 1I + S

    T 1n S )( 1I I + S T

    1n d)

    (1)

    In order to determine the best selection of pixels in theimage, part (ii) involves computing the error associ-ated with the above reconstruction. This error is calcu-lated as the determinant of the error covariance matrix: error = Cov( I I MAP ). In essence, we wish to deter-mine the S that minimizes (in some sense) error . Thiswill be our information window. Notice that the infor-mation criterion is based on the entire set of imagesand not, as with other methods, on an image-by-imagebasis. More details on Information Sampling can befound in [19].

    Figure 2 (left) shows an omnidirectional image ac-quired at runtime and (right) its reconstruction us-ing only the most discriminatory information window.This illustrates the power of our method. Results aredetailed in the following section.

  • 8/12/2019 01-icar-omni

    3/6

    5 10 15 20 25 30

    5

    10

    15

    20

    25

    30

    5 10 15 20 25 30

    5

    10

    15

    20

    25

    30

    5 10 15 20 25 30

    5

    10

    15

    20

    25

    30

    20 40 60 80 100 120

    20

    40

    60

    80

    100

    120

    20 40 60 80 100 120

    20

    40

    60

    80

    100

    120

    Figure 3: Top: Close-up of the 32 32 informationwindows: unknown (left), closest (middle) and recon-structed (right). Bottom: The position of the unknownand closest images in their respective omnidirectional

    images.

    2.3. Position Estimation Results

    Once the best window was determined using Informa-tion Sampling , we built a local appearance space [19, 4]using only the data selected with this window from eacha priori image. This additionally compressed our datato only approximately one thousandth of the original128 128 image data. Successful position estimationhas been achieved using windows as small as 4 4 pixelsin size.

    We projected only the selected windows from eachimage into the eigenspace. This is an improvement onprevious approaches, where all windows rst had to beprojected. Thus, we were able to immediately reducethe ambiguity associated with projection. Figure 3 toprow, from left to right, shows the most relevant in-formation window from an unknown image, its closestmatch from the a priori set of omnidirectional imagesand its reconstruction. The bottom row shows the in-formation window in the unknown 128 128 image(left) and its closest match from the a priori set, ob-tained by projecting only the most relevant informationwindow (right). We note here that we could in princi-pal, given enough computing power, use equation (1)to reconstruct a 128 128 image using only the mostrelevant window .

    Figure 4 shows the distance between images ac-quired at run time and those acquired a priori. Theglobal minimum is the correct estimate of the robotscurrent topological position. Local minima correspondto images similar to the current one, some distanceaway from the robots position. Figure 4 (left) showsthe graph obtained using 16 16 images, while Fig-

    0 10

    20 30

    40 50

    60 70

    80 90

    0

    10

    20

    30

    40

    50

    0

    1

    2

    3

    4

    5

    6

    x 104

    Prior Images

    Matching using 16 x 16 Omnidirectional Images

    Run Time Images

    E

    Global Minimum

    0 1020 30

    40 5060 70

    80 90

    0

    10

    20

    30

    40

    500

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    x104

    PriorImages

    Matchingusing4x 4ImageWindows

    RunTimeImages

    E

    Global Minimum

    Figure 4: A 3D plot of images acquired at run timeversus those acquired a priori using (top) 16 16 images and (bottom) 4 4 information windows , respectively.

    ure 4 (right) that obtained using the most discriminat-ing 4 4 image window. While different local minimaare obtained by both methods, the global minimum ismaintained. Thus, we have shown that effective navi-

    gation can be undertaken using only the most informa-tive 16 pixels from each image.

    3. Human Robot InterfaceOnce we had developed an effective method for au-tonomous qualitative robot navigation along a topo-logical map, we turned our attention to developing anintuitive user interface from which to select subtasks.While nal experiments have yet to be undertaken, inthis section we show how to construct this interface.

    Our interface consists of a 3D interactive reconstruc-tion of a structured environment obtained from a single omnidirectional image. In order to build such a model,we rst have to generate perspective images from ourcatadioptric sensor, despite the fact that it does not have a single centre of projection, since it utilizes aspherical mirror. The benet of such an interface isthat even though it does not provide very ne detailsof the environment, it provides the user with a suffi-ciently rich description that may easily be transmittedover a low bandwidth link.

    3.1. Catadioptic Sensor ModelingIn [9] Geyer and Daniilidis present a central projectionmodel for all catadioptric systems with a single pro- jection centre . This model combines a mapping to asphere followed by a projection to a plane. The pro- jection of a point in space ( x,y,z ) to an image point(u, v) can be written as:

    uv =

    l + ml r z

    xy = P (x,y,z ; l, m ) (2)

    r = x2 + y2 + z2

  • 8/12/2019 01-icar-omni

    4/6

    l

    m

    u

    P = (x, y, z)

    image plane

    C = ( )0, 0, 0

    O

    r

    z

    SphericalMirror

    R

    Image Plane

    P (r,z)

    L

    p

    Projection Center

    P

    r

    z mm

    m

    r

    i

    Figure 5: Left: Central Projection Model. Right:Spherical Mirror.

    200 300 400 500 600 7000

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    e r r o r

    [ p

    i x ]

    depth [cm]

    Figure 6: Mean absolute error between the central pro- jection model and the projection with a spherical mir-ror. Vertical bars indicate the standard deviation.

    where l and m, represent the distance from the spherecentre to the projection centre, O, and the distancefrom the sphere centre to the plane, respectively. Thisis graphically represented by Figure 5 (left).

    On the other hand, catadioptric sensors which use

    a spherical mirror, shown in Figure 5 (right), are es-sentially modelled by the equation of reection at themirror surface, i = r [7].

    A camera with a spherical mirror cannot be exactlyrepresented by the central projection model. In orderto nd an approximate representation, we compare theimage projection error, instead of analyzing the projec-tion centre itself.

    Let P (x i , yi , z i ; ) denote the central projection de-ned in equation (2) and P c be the projection with aspherical mirror. Grouping the geometric and intrinsicparameters together - and c for the former and latterprojections - we want to minimize the cost functionalassociated with the image projection error:

    = arg mini

    P (x i , yi , zi ; ) P c (x i , yi , zi ; c )2

    The minimization of this cost functional gives the de-sired parameters, for the central projection model, P which approximates the spherical sensor characterizedby P c and c . Despite the fact that projection usingspherical mirrors cannot be represented by the centralprojection model, Figure 6 shows that for a certain

    operational range, this theory gives a very good ap-proximation. The error is less than 1 pixel.

    3.2. Forming Perspective ImagesThe acquisition of correct perspective images, indepen-

    dent of the scenario, requires that the vision sensor becharacterised by a single projection centre [1]. By de-nition the spherical projection model has this propertybut due to the intermediate mapping over the spherethe images obtained are, in general, not perspective.

    In order to obtain correct perspective images, thespherical projection must rst be reversed from the im-age plane to the sphere surface and then, re-projectedto the desired plane from the sphere centre. We termthis reverse projection back-projection after Sturm in[15, 16]. The back-projection of an image pixel ( u, v ),obtained through spherical projection, yields a 3D di-rection k

    (x,y,z ) given by the following equations, de-

    rived from equation (2):

    a = ( l + m), b = ( u2 + v2 )

    xy =

    la sign (a) (1 l2 )b + a2a 2 + b uv (3)z = 1 x2 y2

    where z is negative when |l + m| /l > u 2 + v2 , andpositive otherwise. It is assumed, without loss of gener-ality, that ( x,y,z ) lies on the surface of the unit sphere.

    Note that the set {(x,y,z )}, interpreted as pointsof the projective plane, already dene a perspectiveimage. However for displaying or obtaining specicviewing directions further development is required.

    Letting R denote the orientation of the desired (pin-hole) camera relative to the frame associated with theresults of back-projection, the new perspective image

    {(u,v, )} becomes:uv

    = K R 1

    xyz

    (4)

    where K contains the intrinsic parameters and is ascaling factor. This is the pin-hole camera projectionmodel [5], when the origin of the coordinates is thecamera centre.

    3.3. Interactive 3D ReconstructionWe now describe the method used to obtain 3D mod-els given the following information: a perspective im-age (obtained from an omnidirectional camera with aspherical mirror), a camera orientation obtained fromvanishing points [2] and some limited user input [6, 10].

  • 8/12/2019 01-icar-omni

    5/6

    Let p = [u v 1]T be the projection of a 3D point[C C C 1]T that we want to reconstruct. Then, if weconsider a normalized camera [5], we have the follow-ing:

    u

    v1 = [r1 r2 r3 0]

    C C C 1

    = (Cr 1 + C r 2 + C r 3 )

    (5)

    where r1 , r2 , r3 are vanishing points. As is usual, wechoose 0 as the origin of the coordinates for reconstruc-tion. Next, we dene lines towards vanishing points:

    li = r i [u v 1]T , i = 1 . . . 3. (6)Then, using the cross and internal products propertythat ( r p)T .p = 0, we obtain:

    lT i .r 1 C + lT i .r 2 C + lT i .r 3 C = 0 (7)

    which is a linear system in the coordinates of the 3Dpoint. This can be rewritten as:

    0 lT 1 .r 2 lT 1 .r 3lT 2 .r 1 0 lT 2 .r 3lT 3 .r 1 lT 3 .r 2 0

    C C C

    = 0 . (8)

    Generalising this system to N points we again obtaina linear system:

    A.C = 0 (9)where

    C contains the 3 N tridimensional coordinates

    that we wish to locate and A is block diagonal, whereeach block has the form shown in equation (8). Thus,A is of size 3N 3N . Since only two equations fromthe set dened by equation (8) are independent, theco-rank of A is equal to the number of points N . Asexpected, this shows that there is an unknown scalefactor for each point.

    Now, adding some limited user input, in the form of co-planarity or co-linearity point constraints, a numberof 3D coordinates become equal and thus the numberof columns of A may be reduced. As a simple example,if we have 5 points we have 5 3 free coordinates, i.ethe number of columns of A. Now, if we impose theconstraint that points P 1 , P 2 , P 3 are co-planar, withconstant z value and points P 4 , P 5 are co-linear, witha constant ( x, y ) value, then the total number of freecoordinates is reduced from the initial 15 to 12. Thus,the coordinates P 2 z , P 3 z , P 5 x , P 5 y are dropped fromthe linear system dened by equation (9).

    Given sufficient user input, the co-rank of A becomes1. In this case, the solution of the system will give areconstruction with no ambiguity other than that - wellknown - of scale [10].

    1 2

    3 4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    Figure 7: User dened planes orthogonal to the x axis(light gray), y axis (white) and z axis (dark gray).

    Axis Planes Lines

    x (1, 2, 9, 10, 11), (3 , 4, 14, 15, 16),

    (5, 7, 12, 13)y (5, 6, 10, 11, 12), (7, 8, 13, 14, 15),(1, 3, 9, 16) (1, 2)

    z (1, 2, 3, 4, 5, 6, 7, 8), (9, 12, 13, 16)

    Table 1: User-dened planes and lines. The numbersare the indexes of the image points shown in Figure 7.The rst column indicates the axis to which the planesare orthogonal and the lines are parallel.

    3.4. Developing the Human Robot In-terface: Results

    Figure 7 shows an omnidirectional image and super-posed user input. This input consists of the 16 pointsshown, knowledge that sets of points belong to con-stant x, y or z planes and that other sets belong tolines parallel to the x, y or z axes. Table 1 details allthe user-dened data. Planes orthogonal to the x andy axes are in light gray and white respectively, and onehorizontal plane is shown in dark gray (the topmosthorizontal plane is not shown as it would occlude theother planes). The coordinates in the original imagewere transformed to the equivalent pin-hole model co-ordinates and used for reconstruction. Figure 8 shows

    the resulting texture-mapped reconstructions. Theseresults are interesting given that they required only asingle image and limited user input to reconstruct thesurroundings of the sensor.

    4. Conclusions & Future WorkThis paper presented experiments in visual-based nav-igation using an omnidirectional camera. We showedimprovements to our topological module by using In-formation Sampling. We explained how our sensor,

  • 8/12/2019 01-icar-omni

    6/6

    50 100 150 200 250 300 350 400 450

    50

    100

    150

    200

    250

    100 200 300 400 500 600

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    550

    Figure 8: Views of the reconstructed 3D model.

    despite the fact that it does not have a single projec-tion centre, can be modeled by the central projectionmodel. Then, using single omnidirectional images andsome limited user input we showed how to build a 3Dmodel of the environment. This acted as a basis for anintuitive human-robot interface.

    In terms of our future work, we plan on extend-

    ing the implementation of Information Sampling. Ad-ditionally, creation of large scene models shall beachieved by fusing different models together. We planon carrying out extended closed-loop control exper-iments verifying the applicability of our navigationframework.

    Acknowledgements

    This work was partly funded by the European UnionRTD - Future and Emerging Technologies ProjectNumber: IST-1999-29017, Omniviews.

    References[1] S. Baker and S.K. Nayar. A theory of single-viewpoint

    catadioptric image formation. International Journal of Computer Vision , 35(2):175196, 1999.

    [2] B. Caprile and V. Torre. Using vanishing points forcamera calibration. International Journal of Computer Vision , 4:127140, 1990.

    [3] T. S. Collett, E. Dillmann, A. Giger, and R. Wehner.Visual landmarks and route following in the desert ant.J. Comp. Physiology A , 170:435442, 1992.

    [4] V. Colin de Verdiere and J. L. Crowley. Visual recogni-tion using local appearance. In 5th European Confer-ence on Computer Vision, (ECCV 1998) , pages 640654, Freiburg, Germany, June 1998.

    [5] O. Faugeras. Three Dimensional Computer Vision .MIT Press, 1993.

    [6] J. Gaspar, E. Grossmann, and J. Santos-Victor. Inter-active reconstruction from an omnidirectional image.In 9th International Symposium on Intelligent Robotic Systems (SIRS01) , Toulouse, France, July 2001.

    [7] J. Gaspar and J. Santos-Victor. Visual path follow-ing with a catadioptric panoramic camera. In 7th In-ternational Symposium on Intelligent Robotic Systems (SIRS99) , pages 139147, Coimbra, Portugal, July1999.

    [8] J. Gaspar, N. Winters, and J. Santos-Victor. Vision-

    based navigation and environmental representationswith an omni-directional camera. IEEE Transactions on Robotics and Automation , 16(6):890898, Decem-ber 2000.

    [9] C. Geyer and K. Daniilidis. A unifying theory for cen-tral panoramic systems and practical applications. InECCV 2000 , pages 445461, Dublin, Ireland, 2000.

    [10] E. Grossmann, D. Ortin, and J. Santos-Victor. Recon-struction of structured scenes from one or more views:Nature of solutions. Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence , 2001.

    [11] H. Murase and S. K. Nayar. Visual learning and

    recognition of 3d objects from appearance. Interna-tional Journal of Computer Vision , 14(1):524, Jan-uary 1995.

    [12] K. Ohba and K. Ikeuchi. Detectibility, uniqueness andreliability of eigen windows for stable verication of partially occluded objects. IEEE Transactions on Pat-tern Analysis and Machine Intelligence , 19(9):10431048, September 1997.

    [13] A. P. Sage and J. L. Melsa. Estimation Theory with Applications to Communications and Control .McGraw-Hill, 1971.

    [14] C. Schmid and R. Mohr. Local grayvalue invariants forimage retrieval. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence , 19(5):530535, May1997.

    [15] P. Sturm. A method for 3d reconstruction of piece-wise planar objects from single panoramic images. In1st International IEEE Workshop on Omnidirectional Vision at CVPR , pages 119126, 2000.

    [16] P. Sturm and S. Maybank. A method for interactive 3dreconstruction of piecewise planar objects from singleimages. In British Machine Vision Conference , pages265274, 1999.

    [17] N. Winters, J.Gaspar, G.Lacey, and J. Santos-Victor.Omni-directional vision for robot navigation. In 1st In-

    ternational IEEE Workshop on Omni-directional Vi-sion at CVPR , pages 2128, 2000.

    [18] N. Winters and J.Santos-Victor. Omni-directional vi-sual navigation. In 7th International Symposium on Intelligent Robotics Systems (SIRS99) , pages 109118, Coimbra, Portugal, July 1999.

    [19] N. Winters and J.Santos-Victor. Information samplingfor optimal image data selection. In 9th International Symposium on Intelligent Robotics Systems (SIRS01) ,Toulouse, France, July 2001.