[Lecture Notes in Computer Science] Advances in Visual Computing Volume 3804 || Visual Tracking for Seamless 3D Interactions in Augmented Reality

Visual Tracking for Seamless 3D Interactions inAugmented Reality

C. Yuan�

Fraunhofer Institute for Applied Information Technology,Collaborative Virtual and Augmented Environments,

Schloss Birlinghoven, 53754 Sankt Augustin, [email protected]

Abstract. This paper presents a computer vision based approach for creating 3Dtangible interfaces, which can facilitate real–time and flexible interactions withthe augmented virtual world. This approach uses real–world objects and free–hand gestures as interaction handles. The identity of these objects/gestures aswell as their 3D pose in the physical world can be tracked in real–time. Once theobjects and gestures are perceived and localized, the corresponding virtual objectscan be manipulated dynamically by human operators who are operating on thosereal objects. Since the tracking algorithm is robust against background clutterand adaptable to illumination changes, it performs well in real–world scenarios,where both objects and cameras move rapidly in unconstrained environments.

1 Introduction

Augmented Reality (AR) deals mainly with the visual enhancement of the physicalworld. The interactive aspect of AR requires tangible interfaces [1] that can invoke dy-namic actions and changes in the augmented 3D space. On the one hand, the conceptof tangible interfaces makes it possible to develop interactive AR applications. On theother hand, reliable systems that can retrieve the identity and location of real–world ob-jects have to be developed. It is obvious that successful AR interactions depend amongother things largely on the robust processing and tracking of real–world objects. Ac-cording to [2], many AR systems will not be able to run without accurate registrationof the real world.

Various means can be employed for the tracking of real–world objects including me-chanical, electromagnetic, acoustic, inertial, optical and image based devices [3]. Wefavor the image based tracking method because it is non–invasive and can be applied inboth static and dynamic situations. Unlike other approaches, image based visual track-ing is a closed–loop approach that tackles simultaneously the registration and interac-tion problem. Images can provide a visual feedback on the registration performance sothat an AR user can know how closely the real and virtual objects match each other.With this visual feedback, interactions with the virtual world can take place more natu-rally and efficiently.

� The author thanks the whole CVAE group as well as colleagues outside Fraunhofer for theirkind support and discussions.

G. Bebis et al. (Eds.): ISVC 2005, LNCS 3804, pp. 321–328, 2005.c© Springer-Verlag Berlin Heidelberg 2005

322 C. Yuan

One popular approach to the visual tracking problem is using marker objects. In[4], 2D ARToolkit markers are used to render virtual objects onto them. A cube withdifferent colors on each side of its surface has been used in [5], where the cube islocalized in an image by the CSC color segmentation algorithm. In [6], the 3D poseof a dotted pattern is recovered using a pair of stereo cameras. Because these markerobjects are designed only for tracking, they are not suitable for interaction purposes.

A few other works suggest using hand gestures as tangible interfaces. In [7], a point-ing posture is detected based on human body segmentation by combining backgroundsubtraction method and region categorization. Another example is the augmented deskinterface [8]. Here arms of a user are segmented from the infrared input image using asimple threshold operation. After that, fingertips are searched for within regions withfixed size using template matching algorithm. Gestures are then recognized based onmultiple fingertip trajectories.

In this paper, we present a new approach which is capable of real–time tracking of thephysical world as well as the creation of natural and easy to use interfaces. By relatingreal–world objects to their counterparts in the augmented virtual world one by one, aset of interaction units can be constructed so that the virtual world can be manipulatedseamlessly by AR users operating on those real objects [9].

The proposed tracking approach contributes to the state–of–the–art in several as-pects. First, both real–world objects as well as free–hand gestures are tracked simulta-neously to satisfy different interaction purposes. Unlike the markers used in the refer-ences, the objects we have designed are much smaller, which makes it much easier tograsp. Second, our tracking system can support multiple users who can interact with theAR world either individually or cooperatively. Last but not least, the tracking camerasin our system are allowed to move freely in unconstrained environments, while mosttracking systems can only handle static camera(s).

The remainder of this paper is organized as follows. Sect. 2. gives an overview of thetracking system. Sect. 3. presents the visual tracking algorithm. Interaction mechanismsbased on the results of visual tracking are shown in Sect. 4. System performance isevaluated and discussed in Sect. 5., followed by a summary in Sect. 6.

2 System Overview

The tracking system is designed to be used in a multi–user AR environment, whereseveral users need to interact collaboratively with the virtual world rendered on topof a round table (see Fig. 1(a)). For different purposes, different kinds of interactionmechanisms are needed. Hence we use various 2D/3D objects as well as hand gesturesas input devices. The scene captured by the tracking system is very dynamic, as bothforeground and background objects are changing constantly and unexpectedly.

The users can sit or stand, and can move around the table to examine the virtualworld from different viewpoints. In order that the system keeps tracking the hand ges-tures while the users are moving freely, cameras are mounted on the head mounteddisplays (HMD). As a result, both the objects and the cameras are moving all the time.To enable dynamic interactions with the target objects in the virtual world, 3D poseparameters of the objects and gestures should be estimated precisely and in real time.

Visual Tracking for Seamless 3D Interactions in Augmented Reality 323

(a) (b)

(c) (d)

(e) (f)

Fig. 1. (a). Multiple AR users interact with the augmented virtual world. (b). Objects and gesturesused in the tracking system. (c). Offline color calibration. (d). Illustration of recognition andtracking results. (e). Manipulation of the virtual buildings. (f) Creation of new 3D models.

The central task of the vision based 3D interface is the identification and tracking ofmultiple colored objects appeared in the camera view. As shown in Fig. 1(b), the objectsare made of six 2D place holder objects (PHOs), two 3D pointers, and a set of gestures.PHOs are 2D colored objects with 3DOF (degree of freedom) pose. They are calledplace holders because they are used mainly to be related to their virtual counterparts.The pose of the pointers is 6DOF. They are pointing devices that can be used to pointat some virtual objects in 3D.

There are altogether six kinds of gestures used in the system, with the hand showingzero (a fist gesture) to five fingers. The gesture with one finger is a dynamic pointing

324 C. Yuan

gesture whose 6DOF pose can be tracked in the same way as that of the 3D pointers.The other five gestures are also tracked continuously. But unlike the pointing gestures,these gesture are tracked only in 3DOF, as they are generally used as visual commandto trigger certain operations in the virtual world. Some HCI applications don’t requirethe pose of a gesture to be known [8]. However, pose parameters of even a static gestureare indispensable for 3D interactions in location critical applications.

The tracking system uses a static camera (Elmo CC–491 camera unit with lipstick–size microhead QP–49H) hanging over the round table to recognize the PHOs. Each ARuser wears a pair of head–mounted cameras (HMC), which is installed horizontally onthe left and right side of the HMD. Each HMC is made of a pair of stereo cameras (JAICV–M 2250 microhead camera) for 3D pose estimation. Pointers can be tracked byall the users’ HMCs. Gestures made by an AR user are tracked only by the HMC onhis own head. To increase tracking speed, the right image of a stereo pair will only beprocessed if pointers or gestures have been recognized in the left image.

3 Visual Object Tracking

Visual tracking for AR involves several steps such as object detection, object iden-tification and object pose estimation. In the whole system, tracking is done using colors.First colored regions are detected. Then the shapes of the colored regions are analyzedto identify the objects and gestures. After an object or a gesture is identified, its 2D/3Dpose will be estimated. Though we do use inter–frame information to guide tracking, itis not necessary to use a general–purpose tracking algorithm such as the condensationor mean–shift algorithm, as the scene is very dynamic (both cameras and objects moveirregularly).

3.1 Color Segmentation

Color regions are segmented by identifying the different colors based on pixel–wiseclassification of the input images. For each of the colors used in the tracking system,a Gaussian model is built to approximate its distribution in the normalized red–greencolor space (r′ = r

r+g+b , g′ = gr+g+b ). Since color is very sensitive to the change

of lighting conditions, adaptable color models are built in an offline color calibrationprocess before the tracking system works online. The calibration is done interactivelyby putting objects in different locations. The adaptability of the color model can bevisualized after calibration. To test the calibration result, the user just click on a colorregion and see whether it can be segmented properly, as is illustrated in Fig. 1(c), wherethe segmentation result of the right most circle on the top right PHO is shown.

After each of the used colors has been calibrated, the color models are completelybuilt and can be made available for use in online tracking. Once new images are grabbed,the pixels that have similar statistics as those in the models are identified. Regions ofdifferent colors can now be established.

3.2 Object Recognition and Tracking

Recognition of the PHOs is done as follows. All the PHOs have a same backgroundcolor. Once each region having this background color has been identified, the two col-


ored regions within each can be localized and the geometric center of each circle can becomputed . The lines connecting the two center points indicate the orientations of thePHOs (see Fig. 1(d)). Based on the identified colors, the identity of the PHOs can bedetermined. Using a calibrated camera, the 3DOF pose of the PHOs is easily calculated.

The recognition is quite robust. As can be seen from Fig. 1(d), all the six PHOs havebeen recognized despite occlusions. Neither does the existence of objects with similarcolors in the image have any effect on the recognition results.

Recognition of the 3D pointers applies the same principle, i.e. by trying to locatethe pointer’s two colored regions. Shown in Fig. 1(d) on the right is the recognizedpointer, whose 2D location and orientation have been marked with a line. The 3D poseestimation of pointers is not so straight forward as that of the PHOs, which will beexplained in Sect. 3.3.

Gestures are identified through the analysis of skin–colored regions. The center ofthe palm is located by fitting a circle with maximal radius within the boundary of thesegmented skin region. From here, fingers are sought after using circles with increasingradii. A region can be identified as a finger only if it can cross different circles with sub-stantial pixels. Based on the number of found fingers, gestures can be differentiated. Tosuppress false alarms due to unintended hand movement of the user, only a gesture thathas been recognized in three consecutive frames will be accepted as a gesture output.If the gesture is not a pointing gesture, then only the location of the hand (the center ofthe palm) will be computed. In case of a pointing gesture, calculation of its 3D pose issimilar to that of the 3D pointers, i.e. by using the approach to be shown in Sect. 3.3.

For illustration purpose, the recognition result of a pointing gesture is shown inFig. 1(d). The white point at the center of the hand shows the location of the palm.There are two big circles shown around the hand in Fig. 1(d). The interior one shows acircle which crosses the middle of the finger with an arc of maximal length. The exte-rior one crosses the finger with an arc whose distance to the fingertip is about one–sixthof the length the finger has. The center of this arc is regarded as the location of thefingertip. The line on the finger shows its orientation.

3.3 Multi–view 3D Pose Estimation

Since the HMC is not fixed in space, a multi–view based 3D pose estimation is ap-proached. The calculation of the 3D pose of pointers and gestures is based on the trian-gulation of the points observed by both the HMC and the static overhead camera.

In principle there is no need to differentiate between a pointer and a pointing gesture,as in both cases, their 6DOF pose has to be computed. Due to this reason we will outlinethe algorithm by taking the pointer as an example.

In an offline process, the intrinsic camera parameters of the HMC and the transformmatrix between the left and the right camera coordinate of the HMC is calibrated be-forehand. Due to the constant movement of the HMC, it is necessary to estimate theextrinsic parameters of both the left and right cameras, i.e., to know the rotation andtranslation parameters of the HMC relative to the world coordinate. The idea is to usethe PHOs to establish the transform matrices of the HMC.

Using the measurements of the static overhead camera, the 3D world coordinates ofthe colored circles on the PHOs can be computed. As long as two PHOs are in field–

326 C. Yuan

of–view of the overhead camera, we can obtain at least four points whose 3D positionsare known. If these PHOs are also in one of the HMC’s field of view (e.g. the left orthe right camera of the HMC), the 2D image coordinates of these four points in theHMC image are also computable. Let’s suppose the left HMC camera sees the PHOs.Now its extrinsic camera parameters can be solved using a least square optimizationalgorithm. Based on the known transform between the two cameras of the HMC, theextrinsic parameters of the right camera can also be determined.

Since we have six PHOs, we can always have more than four points with known 3D–2D correspondences. This can guarantee a robust calculation of the projection matricesof the HMC even if some of the PHOs are moving or occluded. Once the location andorientation of the camera pair are known, a stereo–based reconstruction algorithm isapplied to recover the 3D location of the pointer. If a pointer can be observed by morethan one HMC, the pose computed by the two HMCs can differ. Hence we need tochoose one with better quality. The criterion we use is to select the pose computed by aHMC that lies near to the pointer. By counting the total number of pixels the detectedpointer has in both cases, the one with a larger size is chosen.

4 Vision Based 3D Interactions

With a working tracking system, users’ intentions can be interpreted constantly anddynamically based on the captured status shift in the camera images. Using PHOs,pointers and gestures as tangible interface elements, AR users can interact with andmanipulate objects in the virtual augmented world intuitively.

Shown in Fig. 1(e) are two AR users who need to cooperate with each other toaccomplish the task of architectural design and city planing. In the users’ HMDs, avirtual 3D model of a cityscape is visualized. With the objects in the physical worldperceived and located, a number of dynamic interactions can be invoked seamlessly tosupport the design and construction process.

In the AR system, virtual objects can be selected and related to the PHOs by usingeither pointers or a pointing gesture. By pointing at a PHO first and a virtual buildingafterwards, a user can establish a one–to–one relationship between them. Correlation ofa PHO and a virtual object can also be done by moving the PHO to the place where thevirtual object is located.

An object is highlighted with a bounding box after selection so that users can knowwhether a virtual object is selected successfully (see Fig. 1(f)). For the same reason,the pointing directions can be visualized with highlighted virtual rays. A virtual objectcan be selected directly, when it is “touched” by the pointing gesture, as is shown inFig. 1(a). Or the virtual object can be selected indirectly, so long the extension of thepointing ray intersects the virtual object, as is shown in Fig. 1(f).

If an AR user translates and rotates a PHO, the corresponding virtual building can bemoved and rotated accordingly. Scaling of the buildings can be done if the user pointsat a corner of the highlighted bounding box and enlarges or shrinks it . Another wayof scaling is to use gesture commands. Besides achieving scaling, gestures can alsobe used to activate such commands as “copy”, “paste” or “delete”. For example, aftermaking a “copy” gesture command at the location where a virtual object resides, the


user moves his hand to another place and makes a “paste” gesture command. A virtualobject selected by the first gesture command will be duplicated. The duplicated buildingwill now appear at the location where the second gesture has been made.

Similar to gestures, 3D virtual menus controlled by PHOs provide further means forthe selection and creation of 3D buildings or 3D geometric parts. As shown in Fig. 1(e),the user on the left is using his pointer to select from the menu. With a menu item“translate”, pointers or the pointing gesture can be used to move the selected virtual ob-ject to another place. By selecting and putting geometric parts together, new geometricmodels can be built on the fly. If one is not satisfied with the model just created, he candelete it using a simple “delete” gesture.

Take for example the top left object in Fig. 1(f). This virtual object is made of thethree model objects shown on the bottom right. It is built by using the pointer to selectand “glue” the three parts at the desired location. After the object is created, the useradds colors to its different parts by pointing at the part and utters a speech input at thesame time. Although speech recognition is not the focus of this paper, we mention itjust to show that tangible interfaces can be combined with other input methods such as3D virtual menus or speech commands to assist interactions.

5 Performance Evaluation

System performance is evaluated based on experimental study and user tests. For therecognition of the objects and gestures, the approach of combining color and shape in-formation leads to satisfactory recognition results: 99.2% for PHOs and pointers, 95.6%for gestures. Since PHOs and pointers are rigid objects, higher recognition rate has beenachieved. The performance of gesture recognition is both encouraging and acceptable,since they are non–rigid and there exists a large variance of skin colors between differ-ent users. Furthermore, a user can make the same gesture in a number of different ways.For example, the gesture with three fingers shown can be made theoretically in ten dif-ferent configurations by using any three fingers of a hand. Due to the user–friendlyconsideration, users are allowed to make the gesture in his own way.

Quantitative evaluation of the precision of the tracking is carried out by comparingthe difference between the estimated pose and real pose. In average, the deviation iswithin 2 cm for translation parameters and 3o for rotation parameters. Based on the factthat different interaction tasks have been accomplished successfully and accurately, thequality of tracking and the effectiveness of the proposed interaction mechanisms havebeen positively evaluated by numerous AR users in different application scenarios.

Together with the other parts of our AR system, the tracking system runs in a dis-tributed manner on several PCs. We can have one static camera tracking and three HMCtracking systems running simultaneously on three PCs, so that as much as three userswearing HMC can interact with the AR world. In average, the tracking speed rangesfrom 20 to 25 frames per second. The latency between tracking and visualization isnegligible within the distributed AR environment. Extensive user tests show that thethe tracking system achieves real–time performance for seamless 3D interactions inmulti–user AR applications.

328 C. Yuan

6 Conclusions

Interactions in AR require accurate, reliable and fast position tracking of the physicalworld. This paper presents a model–based visual tracking approach and its applica-tion in AR. The tracking system works in real time, is capable of robust tracking inmulti–user AR environment. With a set of interaction units combining the real and thevirtual, seamless interactions can be accomplished in 3D space intuitively. Using PHOs,pointers and free–hand gestures, users can interact with the AR system interchangeably.Having the visual object tracking system, the AR system always knows the user’s inten-tions and can hence respond to the user’s requests intelligently. By real–time renderingand visualization of the virtual objects together with added highlights, the AR users cansee the interaction results immediately and this visual feedback can be used to guidetheir actions. In the future, we are aiming at developing more compact tracking systemto enable pervasive interactions in mobile AR environments. We are planing to applycomputer vision based tracking techniques in a number of real–world applications in-cluding AR enabled learning, pervasive gaming and human–robot interaction.

References

1. Patten, J., Ishii, H., Hines, J., Pangaro, G.: Sensetable: A wireless object tracing platform fortangible user interfaces. In: Conference on Human Factors in Computing Systems (CHI 2001),Seattle, USA (2001) 253–260

2. Azuma, R.: A survey of augmented reality. Teleoperators and Virtual Environments 6 (1997)355–385

3. Hightower, J., Borriello, G.: Location systems for ubiquitous computing. IEEE Computer 34(2001) 57–66

4. Poupyrev, I., Tan, D., Billinghurst, M., Kato, H., Regenbrecht, H., Tetsutani, N.: Developinga generic augmented-reality interface. IEEE Computer 35 (2002) 44–50

5. Schmidt, J., Scholz, I., Niemann, H.: Placing arbitrary objects in a real scene using a color cubefor pose estimation. In: Pattern Recognition, 23rd DAGM Symposium, Munich, Germany(2001) 421–428

6. van Liere, R., Mulder, J.: Optical tracking using projective invariant marker pattern properties.In: IEEE Virtual Reality Conference 2003. (2003) 191–198

7. Kolesnik, M., Kuleßa, T.: Detecting, tracking and interpretation of a pointing gesture by anoverhead view camera. In: Pattern Recognition, 23rd DAGM Symposium, Munich, Germany(2001) 429–436

8. Oka, K., Sato, S., Koike, H.: Real–time tracking of multiple fingertips and gesture recognitionfor augmented desk interface systems. IEEE Computer Graphics and Applications 22 (2002)64–71

9. Yuan, C.: Simultaneous tracking of multiple objects for augmented reality applications. In:The Seventh Eurographics Symposium on Multimedia (EGMM 2004), Nanjing, P.R. China,Eurographics Association (2004) 41–47

Documents

[Lecture Notes in Computer Science] Advances in Visual Computing Volume 3804 || Visual Tracking for Seamless 3D Interactions in Augmented Reality