Visual Tracking for Seamless 3D Interactions inAugmented Reality
Fraunhofer Institute for Applied Information Technology,Collaborative Virtual and Augmented Environments,
Schloss Birlinghoven, 53754 Sankt Augustin, Germanychunrong.firstname.lastname@example.org
Abstract. This paper presents a computer vision based approach for creating 3Dtangible interfaces, which can facilitate realtime and flexible interactions withthe augmented virtual world. This approach uses realworld objects and freehand gestures as interaction handles. The identity of these objects/gestures aswell as their 3D pose in the physical world can be tracked in realtime. Once theobjects and gestures are perceived and localized, the corresponding virtual objectscan be manipulated dynamically by human operators who are operating on thosereal objects. Since the tracking algorithm is robust against background clutterand adaptable to illumination changes, it performs well in realworld scenarios,where both objects and cameras move rapidly in unconstrained environments.
Augmented Reality (AR) deals mainly with the visual enhancement of the physicalworld. The interactive aspect of AR requires tangible interfaces  that can invoke dy-namic actions and changes in the augmented 3D space. On the one hand, the conceptof tangible interfaces makes it possible to develop interactive AR applications. On theother hand, reliable systems that can retrieve the identity and location of realworld ob-jects have to be developed. It is obvious that successful AR interactions depend amongother things largely on the robust processing and tracking of realworld objects. Ac-cording to , many AR systems will not be able to run without accurate registrationof the real world.
Various means can be employed for the tracking of realworld objects including me-chanical, electromagnetic, acoustic, inertial, optical and image based devices . Wefavor the image based tracking method because it is noninvasive and can be applied inboth static and dynamic situations. Unlike other approaches, image based visual track-ing is a closedloop approach that tackles simultaneously the registration and interac-tion problem. Images can provide a visual feedback on the registration performance sothat an AR user can know how closely the real and virtual objects match each other.With this visual feedback, interactions with the virtual world can take place more natu-rally and efficiently. The author thanks the whole CVAE group as well as colleagues outside Fraunhofer for their
kind support and discussions.
G. Bebis et al. (Eds.): ISVC 2005, LNCS 3804, pp. 321328, 2005.c Springer-Verlag Berlin Heidelberg 2005
322 C. Yuan
One popular approach to the visual tracking problem is using marker objects. In, 2D ARToolkit markers are used to render virtual objects onto them. A cube withdifferent colors on each side of its surface has been used in , where the cube islocalized in an image by the CSC color segmentation algorithm. In , the 3D poseof a dotted pattern is recovered using a pair of stereo cameras. Because these markerobjects are designed only for tracking, they are not suitable for interaction purposes.
A few other works suggest using hand gestures as tangible interfaces. In , a point-ing posture is detected based on human body segmentation by combining backgroundsubtraction method and region categorization. Another example is the augmented deskinterface . Here arms of a user are segmented from the infrared input image using asimple threshold operation. After that, fingertips are searched for within regions withfixed size using template matching algorithm. Gestures are then recognized based onmultiple fingertip trajectories.
In this paper, we present a new approach which is capable of realtime tracking of thephysical world as well as the creation of natural and easy to use interfaces. By relatingrealworld objects to their counterparts in the augmented virtual world one by one, aset of interaction units can be constructed so that the virtual world can be manipulatedseamlessly by AR users operating on those real objects .
The proposed tracking approach contributes to the stateoftheart in several as-pects. First, both realworld objects as well as freehand gestures are tracked simulta-neously to satisfy different interaction purposes. Unlike the markers used in the refer-ences, the objects we have designed are much smaller, which makes it much easier tograsp. Second, our tracking system can support multiple users who can interact with theAR world either individually or cooperatively. Last but not least, the tracking camerasin our system are allowed to move freely in unconstrained environments, while mosttracking systems can only handle static camera(s).
The remainder of this paper is organized as follows. Sect. 2. gives an overview of thetracking system. Sect. 3. presents the visual tracking algorithm. Interaction mechanismsbased on the results of visual tracking are shown in Sect. 4. System performance isevaluated and discussed in Sect. 5., followed by a summary in Sect. 6.
2 System Overview
The tracking system is designed to be used in a multiuser AR environment, whereseveral users need to interact collaboratively with the virtual world rendered on topof a round table (see Fig. 1(a)). For different purposes, different kinds of interactionmechanisms are needed. Hence we use various 2D/3D objects as well as hand gesturesas input devices. The scene captured by the tracking system is very dynamic, as bothforeground and background objects are changing constantly and unexpectedly.
The users can sit or stand, and can move around the table to examine the virtualworld from different viewpoints. In order that the system keeps tracking the hand ges-tures while the users are moving freely, cameras are mounted on the head mounteddisplays (HMD). As a result, both the objects and the cameras are moving all the time.To enable dynamic interactions with the target objects in the virtual world, 3D poseparameters of the objects and gestures should be estimated precisely and in real time.
Visual Tracking for Seamless 3D Interactions in Augmented Reality 323
Fig. 1. (a). Multiple AR users interact with the augmented virtual world. (b). Objects and gesturesused in the tracking system. (c). Offline color calibration. (d). Illustration of recognition andtracking results. (e). Manipulation of the virtual buildings. (f) Creation of new 3D models.
The central task of the vision based 3D interface is the identification and tracking ofmultiple colored objects appeared in the camera view. As shown in Fig. 1(b), the objectsare made of six 2D place holder objects (PHOs), two 3D pointers, and a set of gestures.PHOs are 2D colored objects with 3DOF (degree of freedom) pose. They are calledplace holders because they are used mainly to be related to their virtual counterparts.The pose of the pointers is 6DOF. They are pointing devices that can be used to pointat some virtual objects in 3D.
There are altogether six kinds of gestures used in the system, with the hand showingzero (a fist gesture) to five fingers. The gesture with one finger is a dynamic pointing
324 C. Yuan
gesture whose 6DOF pose can be tracked in the same way as that of the 3D pointers.The other five gestures are also tracked continuously. But unlike the pointing gestures,these gesture are tracked only in 3DOF, as they are generally used as visual commandto trigger certain operations in the virtual world. Some HCI applications dont requirethe pose of a gesture to be known . However, pose parameters of even a static gestureare indispensable for 3D interactions in location critical applications.
The tracking system uses a static camera (Elmo CC491 camera unit with lipsticksize microhead QP49H) hanging over the round table to recognize the PHOs. Each ARuser wears a pair of headmounted cameras (HMC), which is installed horizontally onthe left and right side of the HMD. Each HMC is made of a pair of stereo cameras (JAICVM 2250 microhead camera) for 3D pose estimation. Pointers can be tracked byall the users HMCs. Gestures made by an AR user are tracked only by the HMC onhis own head. To increase tracking speed, the right image of a stereo pair will only beprocessed if pointers or gestures have been recognized in the left image.
3 Visual Object TrackingVisual tracking for AR involves several steps such as object detection, object iden-tification and object pose estimation. In the whole system, tracking is done using colors.First colored regions are detected. Then the shapes of the colored regions are analyzedto identify the objects and gestures. After an object or a gesture is identified, its 2D/3Dpose will be estimated. Though we do use interframe information to guide tracking, itis not necessary to use a generalpurpose tracking algorithm such as the condensationor meanshift algorithm, as the scene is very dynamic (both cameras and objects moveirregularly).
3.1 Color SegmentationColor regions are segmented by identifying the different colors based on pixelwiseclassification of the input images. For each of the colors used in the tracking system,a Gaussian model is built to approximate its distribution in the normalized redgreencolor space (r = rr+g+b , g = gr+g+b ). Since color is very sensitive to the changeof lighting conditions, adaptable color models are built in an offline color calibrationprocess before the tracking system works online. The calibration is done interactivelyby putting objects in different locations. The adaptability of the color model can bevisualized after calibration. To test the calibration result, the user just click on a colorregion and see whether it can be segmented prope