Download pdf - Object Trackinng PHd Thesis

Master Erasmus Mundus in Color in Informatics and Media Technology (CIMET)

Object Tracking: State of The Art and CAMSHIFT Improvement Using Multi‐dominant Colors Tracking

Master Thesis Report

Presented by Priyanto Hidayatullah

and defended at the

University of Jean Monnet Saint‐Etienne, France

22nd June 2010 Jury Committee: Supervisor: Prof. Alain Tremeau Hubert Konik, Ph.D Prof. Jon Yngve Hardeberg Faouzi Alaya Cheikh, Ph.D Javier Hernández‐Andrés, Ph.D Damien Muselet, Ph.D Eric Dinet, Ph.D

Object Tracking: State of The Art and CAMSHIFT Improvement Using Multi-dominant Color Tracking

i

Abstract

Object tracking is a wide area in which a lot of methods available and wide variety of

applications. One of the applications would be tracking an object in a clickable hypervideo

to enrich the interactivity of video application. In this thesis, some state of the art of object

tracking methods are reviewed and closely observed. We then select one of object tracking

state of the art methods to improve. Our selection goes to CAMSHIFT which has been very

well accepted as one of the most prominent methods in object tracking which has real time

speed performance and more suitable for clickable hypervideo. CAMSHIFT is very good

for single hue object tracking and in the condition where object’s color is different with

background’s colors.

In this thesis, we try to improve the robustness of CAMSHIFT for multihued object

tracking and the situation where object’s colors are similar with background’s colors. To

improve robustness on the condition where object’s colors are similar to background’s

colors, we use object localization by selecting each dominant color object part using

combination of Mean-Shift segmentation and region growing. Hue-distance, saturation

and value color histogram are used to describe the object. We also track the dominant

color object parts separately and combine them together to improve robustness of the

tracking on multihued object. Our experiments showed that those methods improved

CAMSHIFT significantly. This improvement hopefully will be useful for object tracking in

clickable hypervideo.

Keywords: Object tracking, CAMSHIFT, Segmentation, Mean-Shift, Hypervideo.


ii

Table of Contents

Abstract ...................................................................................................................... i

Table of Contents ....................................................................................................... ii

Table of Figures ......................................................................................................... iv

1 Introduction ...................................................................................................... 1

1.1 The General Aim of The Master Thesis ....................................................... 1

2 Previous Work ................................................................................................... 2

2.1 Test Videos ................................................................................................ 2

2.2 Object Tracking Categorization .................................................................. 4

2.3 Corner Detector Combined with Optical Flow ............................................ 4 2.3.1 Corner detection .................................................................................... 5 2.3.2 Optical flow ........................................................................................... 5

2.4 Speeded Up Robust Features (SURF) ......................................................... 7

2.5 Mean Shift Tracking .................................................................................. 9

2.6 CAMSHIFT Tracking ................................................................................ 11 2.6.1 Color probability distribution and histogram back projection ............... 12 2.6.2 Mass center calculation ....................................................................... 13 2.6.3 CAMSHIFT advantages and disadvantages .......................................... 14

2.7 Local Binary Pattern ................................................................................ 15

2.8 Beyond Semi-Supervised Online Boosting Tracking ................................. 17

2.9 Method that We Choose ........................................................................... 19

2.10 CAMSHIFT/Mean-Shift Improvement in Literatures ...............................20 2.10.1 Mean-Shift tracking combined with texture histogram .....................20 2.10.2 CAMSHIFT and Mean-Shift combined with interest points .............. 21 2.10.3 CAMSHIFT improvement using new HSV model ............................. 24 2.10.4 CAMSHIFT improvement using hue-distance and saturation features25 2.10.5 CAMSHIFT with improvement of object localization........................ 27 2.10.6 CAMSHIFT improvement using adaptive background (ABCShift) .... 28 2.10.7 CAMSHIFT improvement by background subtraction ...................... 31 2.10.8 The CAMSHIFT improvement method that we choose ..................... 32 2.10.9 The more specific aim of the master thesis ....................................... 32

3 Proposed Method ............................................................................................. 34

3.1 Object Localization .................................................................................. 34 3.1.1 Preprocessing ...................................................................................... 34 3.1.2 Image color transformation ................................................................. 36 3.1.3 Object Selection ................................................................................... 36 3.1.4 Minimum and maximum values storing ............................................... 36

3.2 Object Modeling ...................................................................................... 37

3.3 Making Color Mask.................................................................................. 38


iii

3.4 Segmentation .......................................................................................... 38

3.5 Histogram Back Projection ...................................................................... 39

3.6 Tracking .................................................................................................. 39

4 Implementations and Experiments .................................................................. 42

4.1 Implementations ..................................................................................... 42

4.2 Experiments Setting ................................................................................ 43

5 Results and Discussions ................................................................................... 44

5.1 Results .................................................................................................... 44 5.1.1 First Experiment Results ..................................................................... 44 5.1.2 Second Experiment Results ................................................................. 46 5.1.3 Third Experiment Results.................................................................... 49 5.1.4 Forth Experiment Results.................................................................... 49

5.2 Discussion ............................................................................................... 52 5.2.1 Some Advantages ................................................................................ 52 5.2.2 Some Limitations ................................................................................ 52

6 Conclusions and Future Works ........................................................................ 54

6.1 Conclusions ............................................................................................. 54

6.2 Future Works .......................................................................................... 54

7 Bibliography .................................................................................................... 55


iv

Table of Figures

Figure 2.1 Test Videos. ............................................................................................... 2 Figure 2.2 Illustration of optical flow[28]. .................................................................. 6 Figure 2.3 Shi Tomasi corner detector also detect the background corners inside object’s rectangle .................................................................................................................... 7 Figure 2.4 SURF Tracker Result. ................................................................................ 9 Figure 2.5 Intuitive description of Mean-Shift.[14] ................................................... 10 Figure 2.6 Summary of CAMSHIFT algorithm. ........................................................ 13 Figure 2.7 LBP and CS-LBP features for a neighborhood of 8 pixels [16]................... 15 Figure 2.8 Example of LBP calculation[16]............................................................... 16 Figure 2.9 LBP Tracker Result in second video.......................................................... 16 Figure 2.10 The core classifier system: detector, recognizer and tracker.[20] ............ 18 Figure 2.11 Comparing LBP Image and its back projection image. ............................ 21 Figure 2.12 SURF and CAMSHIFT 1......................................................................... 23 Figure 2.13 SURF and CAMSHIFT 2. ....................................................................... 23 Figure 2.14 CAMSHIFT with new HSV model. ......................................................... 26 Figure 2.15 CAMSHIFT improvement with hue-distance saturation features. ........... 29 Figure 2.16 Foreground extraction. .......................................................................... 29 Figure 2.17 Sample of elongated object.....................................................................30 Figure 2.18 Background subtraction in static background. ....................................... 31 Figure 2.19 Background subtraction in dynamic background. ................................... 33 Figure 3.1 A sample of complex shape object ............................................................ 34 Figure 3.2 Object Localization using only region growing ......................................... 35 Figure 3.3 More precise object localization with only a single click. .......................... 35 Figure 3.4 Text file configuration to tune the parameters ......................................... 36 Figure 3.5 Color mask illustration ............................................................................ 38 Figure 3.6 Segmentation for smoothing and noise removal of third test video........... 38 Figure 3.7 Histogram Back Projection of first test video ........................................... 39 Figure 3.8 Maximum rectangle illustration. .............................................................40 Figure 3.9 The proposed method’s schema .............................................................. 41 Figure 4.1 Hue histogram of air plane body. ............................................................. 43 Figure 5.1 First video result with the proposed method. ........................................... 44 Figure 5.2 First video result with classic CAMSHFT at frame 33. .............................. 45 Figure 5.3 Object localization comparison ................................................................ 45 Figure 5.4 Second video result with our proposed method. ....................................... 47 Figure 5.5 Second video result with classic CAMSHIT. ............................................. 47 Figure 5.6 Third video result with the proposed method. .......................................... 48 Figure 5.7 Third video best result with classic CAMSHIFT at frame 300. .................. 50 Figure 5.8 Object (marked with red rectangle) tracked by the proposed method. ...... 50 Figure 5.9 Forth video best result with classic CAMSHIFT at frame 57. .................... 50 Figure 5.10 Drifting tracker...................................................................................... 51 Figure 5.11 Multiple object tracking using our proposed method .............................. 51


1

1 Introduction

Object tracking has been one of the most emerging areas in computer vision. There are

a lot of applications of object tracking. One of which would be tracking an object in a

clickable hypervideo. Hypervideo is a displayed video stream that contains embedded

user-clickable anchors[19]. In this application, user can interact with the video like

interaction between user with a website. This enriches the interactivity of a video.

There will be a lot of advantages with this capability. For example, user can monetize

their videos by putting company’s links inside the video and, in reverse way,

companies now able to promote their product in videos.

Another capability that would be interesting is object tracking in hypervideo. This

means user can select any object in a video and track along the video sequence. For

example, user has favorite football player in football match and want to track his

movement along the match, then it would be possible with this capability. This also

true if a user want to track his favorite racer in F1 videos, track his favorite movie stars

in a movie cinema, etc.

In this thesis, we try to improve an object tracking method that can be used in

hypervideo. Some state of the art of object tracking methods are reviewed,

experimented and closely observed. We then select one of object tracking state of the

art methods and improve it.

1.1 The General Aim of The Master Thesis The general objective of the master thesis can be summarized into these points:

1) Study some state of the art object tracking methods

2) Choose one to improve based on some criteria

3) Improve the chosen method with some constraint if needed


2

2 Previous Work

Object tracking is a very wide area in computer vision. There are many kinds of

method which are sometimes suitable only for specific conditions. This part will

describe the review of some state of the art object tracking methods available now.

2.1 Test Videos Before we go deeper into the state of the art methods, in this section we present some

test videos which we used to examine the state of the art methods and help us to

choose one of them. Secondly, these test videos will be used to test our own proposed

method compare to the chosen method without our improvement.

The first one is a yellow trunk (Figure 2.1(a)). This is the simplest case where the

object is single hue with scaling, rotation and little deformation in front of dynamic

background which color is quite different with the object. The object is yellow while

the background is mostly blue. In the middle of the video, partial dynamic occlusion

occurs. The dimension of the video is 1280 x 720 pixels in 24 bit. The purpose of this

(a) (b)

(c) (d)

Figure 2.1 Test Videos. (a) First video: yellow trunk (b) Second video: air plane (c) Third video: small scaling toy

(d) Forth video: Football match


3

video is to test the robustness of the state of the art object tracking methods as well as

our proposed method on partial occlusion, scaled, rotated and deformed object.

The second test video is an air plane flying above sea with some small islands below

and mild cloud distraction (Figure 2.1(b)). This is a multihued object which passing

through a dynamic background. There are some distractions from background which

has similar color to some object parts’ color. The dimension of the video is 1280 x 720

pixels in 24 bit. The purpose of using this video is to test the robustness of object

tracking methods on multihued object with some distractions.

The third video is a small toy contains several dominant colors which moves across a

complex background (Figure 2.1(c)) which is available in [33]. This is a multihued

object in front of complex background which has very similar color to the object. Some

more challenges of this video are scaling and skewing of the object. The object is

moving outward the camera until the size is very small and skews several times. The

object is also moving very fast so then it is harder to track.

The background is actually static. Usually, for this kind of video, background

subtraction is very powerful. But because of the object stayed for quite a long time in

the early frames, even background subtraction will have a problem. It needs a lot of

training data to have a very good background model. More over, in the middle of the

video, there is some movement of the background that can ruin the background model.

The last thing, not only the wanted object is moving, the hand and the paper below the

object are also moving which make some more challenges if we are using background

subtraction.

The dimension of the video is 640 x 480 pixels in 24 bit. The purpose of this video is to

test the robustness of object tracking methods in multihued object in front of similarly

color background. Some challenging scaling and skewing on the object is also

important to test the object tracking performance.

The forth video is a football match video (Figure 2.1(d)) which is available in [34]. In

this video there is almost full occlusion and there is distraction from similar color

moving object. The object is also very small which will be a great challenge for some

object tracking methods. The dimension of the video is 544 x 436 pixels in 24 bit. The

purpose of using this video is to test the robustness of object tracking methods on very

small object with almost full occlusion and distraction from other similar color objects.


4

2.2 Object Tracking Categorization In [17], Yilmaz et. al. wrote a result of object tracking methods survey. He proposed

object tracking categorization with methods that represent each category. The

categories themselves are divided into object detection and object tracking categories.

The categories are presented on 2006. We update the categorization examples with

some recent methods in each category so that it will be more relevant to our master

thesis. The categorization can be summarized in Table 2.1 and Table 2.2.

In the next section, we then choose some representative methods to study based on

these criteria:

1) Acceptability. The methods that are widely used by researchers are more

preferable.

2) Recentness. We prefer to choose more recent methods than the old ones.

2.3 Corner Detector Combined with Optical Flow In [5] Bradski stated that one of the basic method to do object tracking is selecting

representative point features and track that features using optical flow. This method is

one of the most intuitive methods in object tracking. The KLT tracker as the

representative of this method is a well known object tracking method. That is why we

choose this method to review. This method represents the point detectors and kernel

tracking according to Yilmaz et. al. categorization[17].

Categories Representative Work Point detectors Harris detector [Harris and Stephens 1988],

KLT detector [Shi and Tomasi 1994], Scale Invariant Feature Transform [Lowe 2004], Speeded Up Robust Features [Bay 2006]

Segmentation Active contours [Caselles et al. 1995]. Mean-shift [Comaniciu and Meer 1999],

Texture Descriptor Gray concurrence matrices [C. C. Gotlieb et. al., 1990] Gabor filtering [G. Wouwer et. al., 1999] Local Binary Pattern [Ojala, Pietikainen, 2001]

Background Modeling Mixture of Gaussians[Stauffer and Grimson 2000], Eigenbackground[Oliver et al. 2000], Dynamic texture background [Monnet et al. 2003].

Supervised Classifiers Support Vector Machines [Papageorgiou et al. 1998], Neural Networks [Rowley et al. 1998], Adaptive Boosting [Viola et al. 2003]. Beyond Semi Supervised Online Boosting [Stalder et al 2009]

Table 2.1 Object Detection Categories[17]


5

Categories Representative Work Point Tracking • Deterministic methods MGE tracker [Salari and Sethi 1990], GOA tracker [Veenman et al. 2001]. • Statistical methods Kalman filter [Broida and Chellappa 1986],

JPDAF [Bar-Shalom and Foreman 1988], PMHT [Streit and Luginbuhl 1994].

Kernel Tracking • Template and density based appearance models

KLT [Shi and Tomasi 1994], CAMSHIFT [Bradski, 1998], Layering [Tao et al. 2002],

• Multi-view appearance models

Eigentracking [Black and Jepson 1998], SVM tracker [Avidan 2001].

Silhouette Tracking

• Contour evolution State space models [Isard and Blake 1998], Variational methods [Bertalmio et al. 2000], Heuristic methods [Ronfard 1994].

• Matching shapes Hausdorff [Huttenlocher et al. 1993], Hough transform [Sato and Aggarwal 2004].

Table 2.2 Tracking Categories [17]

2.3.1 Corner detection Representative features naturally are the features that most probably have some

significant change in the next frame. We hopefully can select unique (or almost

unique) points so that it can be tracked more easily. One can take the points that have

strong derivative. Those points may be the points along the edge. But if we take two

derivatives in orthogonal directions, then we can hope that the points are unique.

Those points called corners.

To detect corner, one method that can be used is KLT Shi Tomasi corner detector [26].

The implementation is available in OpenCV 2.0[24] with function name called

cvGoodFeaturesToTrack(). This function computes the second derivatives (using Sobel

operators) that are needed and from those computes the needed Eigen values. It then

returns a list of points that meet the requirements of good features to track.

2.3.2 Optical flow Another approach to track a region defined by a primitive shape is to compute its

translation by use of an optical flow method. Optical flow methods are used for

generating dense flow fields by computing the flow vector of each pixel[17]. One of the

famous optical flow algorithm is the Lucas Kanade algorithm. The most basic equation

of it is stated in [27] which is


6

(1)

The goal of feature tracking: for a given point u in image I, find its corresponding

location v = u + d in next image J such as I(u) and J(v) are “similar". Displacement

vector d is the image velocity at x which also known as optical flow at x [27]. The

similarity function is measured on an image neighborhood of size (2ωx + 1) x (2ωy +1).

This neighborhood will be also called integration window. Let ωx and ωy are two

integers which has typical values 2, 3, 4, 5, 6, 7 pixels.

The basic idea of Lucas-Kanade algorithm based on three assumptions[5]:

1) Brightness constancy. A pixel of an object in an image does not change in

appearance as it (possibly) moves from frame to frame. For grayscale image,

this means we assume that the brightness of a pixel does not change as is

tracked from frame to frame.

2) Small movements. The image motion of an object changes slowly in time.

3) Spatial coherence. Neighboring points in a scene belong to same surface have

similar motion.

Figure 2.2 Illustration of optical flow[28].

The disadvantage of using small local window in Lucas-Kanade is the large motions

can move points outside of local window and makes it impossible to track[5]. This led


7

to the development of pyramidal LK algorithm which start tracking from highest level

of an image pyramid (lowest detail) and working down to lower levels (finer detail).

Tracking using image pyramids makes it possible to track a large motions in local

windows. In 1994, Shi and Tomasi proposed the KLT tracker which iteratively

computes the translation (du, dv) of a region (e.g., 25 × 25 patch) centered on an

interest point[17].

Figure 2.3 Shi Tomasi corner detector also detect the background corners inside object’s

rectangle

We have tried both methods combined together by using implementations in OpenCV

2.0. We make a rectangle bounding the object and detect the corners using Shi Tomasi

method and detect the movement of that corners using pyramidal Lucas-Kanade

method. To update the position of the object’s rectangle, we do averaging the

movement of all the corners movement based on the assumption of spatial coherence.

Nevertheless, the result is not satisfying. Because, the Shi Tomasi corner detection give

us, not only the object’s corners, but also background corners inside the object’s

rectangle (Figure 2.3). For example, if the object moves upward then the background

moves downward. When we do movement averaging to all of the corners inside the

object’s rectangle (which has object’s and background’s corners) for movement

calculation of object’s rectangle, the result is not satisfying. It is difficult to tune the Shi

Tomasi corner detection parameters so that it gives only object’s corners inside

object’s rectangle.

2.4 Speeded Up Robust Features (SURF) SURF is an image detector and descriptor using interest points of the image. This

method is very well accepted by researchers as one of the most prominent image

interest point detector and descriptor. That is why we choose this method to review.


8

This method represents the point detectors according to Yilmaz et. al.

categorization[17].

In [15] Bay et. al. describe that the points detection used a very basic Hessian-matrix

approximation. This lends itself to the use of integral images which reduces the

computation time drastically. Interest points need to be found at different scales, not

least because the search of correspondences often requires their comparison in images

where they are seen at different scales. Scale spaces are usually implemented as an

image pyramid. The images are repeatedly smoothed with a Gaussian and then sub-

sampled in order to achieve a higher level of the pyramid. In order to localize interest

points in the image and over scales, a non-maximum suppression in a 3 x 3 x 3

neighborhood is applied.

For interest point description and matching, they build on the distribution of first

order Haar wavelet responses in x and y direction rather than the gradient, exploit

integral images for speed, and use only 64 dimensions. This reduces the time for

feature computation and matching, and has proven to simultaneously increase the

robustness. Furthermore, they introduced new indexing step based on the sign of the

Laplacian which increases the robustness of the descriptor and the matching speed.

The sign of the Laplacian distinguishes bright blobs on dark backgrounds from the

reverse situation. This feature is available at no extra computational cost as it was

already computed during the detection phase.

In conclusion, they claimed that SURF proves to be work great for classification tasks,

performing better than the previous methods (SIFT, GLOH), while still being faster to

compute. They stated that SURF should be very well suited for tasks in object

detection, object recognition or image retrieval.

This method has been widely used nowadays and drives us to try this method in object

tracking. We try the code provided by the authors in [25]. We do a simple tracking by

making a bounding rectangle around the object and find the interest points. We store

the interest points as object model. We evaluate the next frame and find the matched

points. We calculate the displacement of those matched points. We move the object

rectangle based on the displacement of those matched points.

With the steps above, we test SURF using our test videos (Figure 2.4). In the first video

(yellow trunk), SURF failed to detect the interest points. In the second video (air

plane), SURF can detect the object and move the rectangle quite nicely. For third


9

video, SURF can detect the object in several first frames. But when the object is too far

from the camera and become too small, SURF fails. More over, sometimes the SURF

implementation gives some wrongly matched interest points.

(a) (b)

(c) (d)

(e) (f)

Figure 2.4 SURF Tracker Result. (a) SURF Tracker in first video at frame 1 (b) SURF Tracker in first video at frame 65

(c) SURF Tracker in second video at frame 1 (d) SURF Tracker in first video at frame 95 (e) SURF Tracker in third video at frame 1 (f) SURF Tracker in third video at frame 148.

Object rectangle in the left corner means SURF fails to detect object’s interest points

2.5 Mean Shift Tracking Mean-Shift is a robust method on finding mode in a density distribution of data set[5].

This method has multi-functionality since the density is not only for color distribution,

but also texture, motion, etc [2,5]. This is an easy process for continues distributions

which merely just hill climbing applied to a density histogram of the data[5]. This


10

method is efficient compare to standard template matching since it eliminates brute

force search[17]. Those characteristics made Mean-Shift very well accepted by

researchers. That is why we choose this method to review. This method represents the

segmentation object detection according to Yilmaz et. al. categorization[17].

The method can be summarized intuitively as follows [14]:

1) Process is started by taking arbitrary position and size of a window (region of

interest).

2) Find the mean-shift vector.

3) Move the window according the vector so then the center of the window now is

the end point of the vector (the mean).

4) Recalculate the vector inside the current window position.

5) Return to step 3) until the convergence. Convergence here means the window

movement is below threshold or the mean-shift procedure has been carried

out for a particular number of iterations.

(a) (b)

Figure 2.5 Intuitive description of Mean-Shift.[14]

Mean-shift is not meant to be tracking algorithm at the first time [1]. In [2] Comaniciu

et. al. applied mean-shift for discontinuity preserving filtering and image

segmentation. But then, Comaniciu introduce mean-shift to track non-rigid

objects[29].

Bradski in [5] stated that mean-shift calculation can be simplified by considering a

rectangular kernel. A rectangular kernel is a kernel with no falloff with distance from

the center, until a single sharp transition to zero value. This is in contrast to the

exponential falloff of a Gaussian kernel and the falloff with the square of distance from


11

the center in the commonly used Epanechnikov kernel. The simplification reduces the

mean-shift vector equation to calculating the center of mass of the image pixel

distribution using image moment

1) Zeroth moment calculation

(2)

2) First moment calculation

(3)

3) Mean search window calculation

(4)

So then practically, the mean-shift tracking algorithm runs as follows[5]:

1) Choose a search window with its characteristics

i. Initial location

ii. Type (uniform, polynomial, exponential, or Gaussian)

iii. Shape (symmetric, rounded, rectangular)

iv. Size

2) Compute the window center of mass using moment

3) Center the window at the center of mass

4) Return to step b until convergence.

This method is good for a single hue object on background which has different color

with the object. The disadvantage is mean-shift only gives the mean position. It does

not give the object’s orientation. In [5], Bradski implicitly implied that Mean-Shift

does not give object size. In [17], Yilmaz denotes that Mean-Shift is not rotation

invariants.

2.6 CAMSHIFT Tracking This is the improvement of famous Mean-Shift method by making the distribution of

the color adaptive to the changing in each frame. The heart of CAMSHIFT is Mean-

Shift. Mean-Shift will give the center position of the rectangle. CAMSHIFT gives not

only the position of the object, but also the size of the object and its orientation[1,5].

The ability of CAMSHIFT to improve Mean-Shit by giving size and orientation of the

object is very important in our case. That is why we choose this method to review. This

method represents the template and density based appearance models according to

Yilmaz et. al. categorization[17].


12

The intention of CAMSHIFT actually to develop a real time perceptual user interface

which in this case, the application is tracking human faces [1]. This method is based on

mean-shift method. The mean-shift method is modified so it can be adaptive to

dynamically changing color probability distributions from frame sequences in a

video[1]. This is due to color distribution from frame sequence is changing over time.

CAMSHIFT now is used for computer interface for controlling computer games.

For the need of computer interface, they develop CAMSHIFT so it fulfills some

characteristics:

1) Real time

2) Can be run on inexpensive consumer cameras without lenses calibration

For these purposes, they decide to focus on color based tracking. To track the color

object, they use color histogram.

The CAMSHIFT algorithm can be summarize with these steps [1]:

1) Chose the initial region of interest which contain the object we want to track

2) Make a color histogram of that region as the object model

3) Make a probability distribution of frame using the color histogram. As a

remark, in the implementation, they use histogram back projection method.

4) Based on the probability distribution image, find the center mass of the search

window using mean-shift method.

5) Center the search window to the point taken from step 4 and iterate step 4

until convergence.

6) Process the next frame with the search window position from the step 5.

2.6.1 Color probability distribution and histogram back projection

In order CAMSHIFT can track colored object, it needs a probability distribution image.

They use HSV color system and using only hue component to make the object’s color

1D histogram. This histogram is stored to convert next frames into corresponding

probability of the object. The probability distribution image itself is made by back

projecting the 1D hue histogram to the hue image of the frames. The result called back

projection image. CAMSHIFT then used to track the object based on this back

projection image

Regarding histogram back projection, it is a technique to find probability of a

histogram in an image. It means each pixel of the image is evaluated on how much

probability it has to the histogram.


13

2.6.2 Mass center calculation The mean location of the probability image inside search window is computed using

image moments. Given that I(x,y) is the intensity of the discrete probability image at

(x,y) within the search window.

1) Zeroth moment calculation using formula (2)

2) First moment calculation using formula (3)

3) Mean search window calculation using formula (4)

From this phase, we can have the center position of the image in every frame. But, this

is not enough since the size of the object can be varied over time. For example, if the

object is moving towards and away from the camera, then the size is changing. This

information can be calculated using second moments which not only give the length

and width of the object, but also the orientation.

Figure 2.6 Summary of CAMSHIFT algorithm.

The gray box is the mean-shift algorithm [1]

Second moments are:

(5)


14

The orientation is:

(6)

Then length l and width w from the distribution centroid are

(7)

(8)

With a, b, c are

(9)

(10)

(11)

2.6.3 CAMSHIFT advantages and disadvantages Thus, with all those characteristics, classic CAMSHIFT has advantages along with

disadvantages

Some advantages:

1) Computationally efficient with the performance of real time tracking (30 fps).

2) Invariants to scaling and rotation

3) Ignores image distractors as long as they lie outside the search window.

4) Can deal with occlusion as long as the occlusion is not 100%

5) Insensitive to object deformation [4]

Some disadvantages:

1) Problems with multi hue object. If the object has more than one hue, the

tracker tend to track the most significant object part and leave the small part

untracked. The problem also occurs in the case of complex background.


15

2) Because it only takes hue component, problems may occur if the object past a

background that has similar colors to the object.

3) Sensitive to changing illuminant

4) In addition, when the object moves so fast that the target area in the two

neighboring frame will not overlap, tracking object often converges to a wrong

object [4]

2.7 Local Binary Pattern Local binary pattern (LBP) is a descriptor which describes each pixel in region by its

texture calculated by relative gray levels of its neighboring pixels. The LBP is a

powerful illumination invariant texture primitive[16]. The histogram of the binary

patterns computed over a region is used for texture description. Figure 2.7 show us

how to calculate LBP using classic LBP and center-symmetric local binary pattern (CS-

LBP).

LBP is a quite recent texture descriptor method which has been developed rapidly by a

lot of researchers and shows some encouraging results in texture descriptor. Beside

CS-LBP that has been illustrated above, there are also Rotation Invariant Volume LBP

(RIV-LBP) and LBP from Three Orthogonal Planes (LBP-TOP) [22]. That is why we

choose this method to review. This method represents the texture descriptor based on

the categorization in Table 2.1.

Based on the shared implementation LBP code in[23], we try to use it as object tracker

in such following simple way. First we make a rectangle that marks the object. We

calculate the LBP histogram using Rotation Invariant Volume LBP (RIV-LBP) and

stored the histogram as the object model. We then we build a search region which is

Figure 2.7 LBP and CS-LBP features for a neighborhood of 8 pixels [16].


16

Figure 2.8 Example of LBP calculation[16]

twice larger than the object’s rectangle on the next frame. Then we make a sliding

window with the same size with the object rectangle. We propagate the sliding window

inside the search region. Each time the sliding window propagates, we calculate the

RIV-LBP histogram. We then compare the similarity of the propagated RIV-LBP

histograms with the model RIV-LBP histogram. Histogram that gives best similarity

assumed to be the object. The last step is to move the object rectangle to the sliding

window that gives best histogram similarity. The histogram similarity that we used is

histogram intersection implemented in OpenCV 2.0[24].

(a) (b)

Figure 2.9 LBP Tracker Result in second video.

(a) Object rectangle at frame 1 (b) Object rectangle at frame 33 The result is not what we expected, even for a simple video which has homogeneous

plain background and a rigid fix sized object (Figure 2.9). The object rectangle often

goes to a position which, in visual perspective, is not the best position of the object in

the next frame. So then we do not continue to base our thesis on improving LBP.


17

2.8 Beyond Semi-Supervised Online Boosting Tracking This method is based on online boosting mechanism in giving weight to every feature.

With this approach, tracking problem is treated as classification problem. In [20]

Stalder et. al. presented a multiple classifier system which split the tasks of detection

(finding the object of interest), recognition (distinguishing similar objects in a scene),

and tracking (retrieving the object to be tracked) into separate classifiers. The purpose

of this splitting is to simplify each classification task.

This method represents the supervised classifiers according to Yilmaz et. al.

categorization[17]. It is one of the most recent methods in this object tracking category

which shows a lot of encouraging results. That is why we choose this method to study

from supervised classifiers category.

For feature selection, the goal of boosting is to minimize the error by selecting and

combining a set of N “weak” classification algorithms into a strong classifier. In [20]

they describe the on-line variant, where the main idea is to perform on-line boosting

on selectors rather than on the weak classifiers directly. A selector holds a set of M

weak classifiers and selects the one with the lowest estimated error. For tracking,

online boosting is used by building initial classifier by taking positive samples from

object and negative samples from background. Then, the classifier is evaluated

exhaustively on the image at time t+1. The resulting confidence distribution is

analyzed and in the simplest case the local maximum is considered to be the new

object position. In order to adapt to appearance changes of the object (e.g. different

illumination) or changed background, the classifier gets updated and the loop repeats.

In supervised online boosting tracker used self-labeled data for updates, but in semi-

supervised online boosting tracking, the updates is using unlabeled data.

In beyond semi-supervised tracking, they proposed to use multi classifier system

(Figure 2.10). For detector, they used offline classifier which purpose is to reliably find

the object of interest. The detector classifier is not updated during tracking. Any kind

of object detector can be integrated in the system as long as it is generic and can be

applied on any kind of scene. For recognizer, they used supervised online classifier.

Updates are performed. The positive training set consists of tracked samples which are

validated by the detector. The negative training set is composed of hard examples

collected in the background image at the time of detection. This allows to distinguish

similar objects in a scene. For tracker, they used semi-supervised on-line classifier.


18

The confidence map is analyzed via semi-supervised updates to retrieve a stable

maximum.

Figure 2.10 The core classifier system: detector, recognizer and tracker.[20]

They used Haar-like features, histograms of oriented gradients (HOG), and color

histograms. They managed to get 10 fps in common 3.0 GHz PC Dual Core with 2 GB

RAM[20].

Some of the advantages of this method are:

1) Currently one of the state of the arts in supervised/semi-supervised object

tracking method

2) It can distinguish between very similar object. They have given an example to

track a coca cola bottle near another similar coca cola bottle. This can

distinguish which one is the tracked object and which one is not

3) It can track partially or fully occluded object, either by static or dynamic

occlusion.

4) It can track an object which has similar texture with its background.

5) It can do multiple object tracking.

6) Face tracking is also possible with a good result

7) Able to re-track the object.

The weakness is this method works only if the size of the object is unchange in each

frame. If the size change (e.g. move out from the viewer), then it will not detect the


19

object. This is very critical lack in our case. The speed is also not high enough for a real

time application.

This method is very promising actually. The authors also share the code [21].

Unfortunately, we have some difficulties in some ways. Firstly is to understand the

code. The implementation of multi classifiers system, which contains a lot of pattern

recognition algorithms, is not so easy to understand. Even though this code contains

only Haar-like feature, which means it is the simple version, this is still hard to

understand. Secondly Pattern recognition is not our main field. So we decided to skip

this. We do not continue to further study nor improve the method.

2.9 Method that We Choose After studying, experimenting and observing some state of the art methods in object

tracking, we decide to choose CAMSHIFT. This decision comes up with the following

reasons:

1) CAMSHIFT has a real time performance while some others not. Our

experiment shows that it reaches 30 fps. This capability is very important for

our case (clickable hypervideo). Hypervideo application needs the tracker to

have real time capability. Semi-Supervised Online Boosting Tracking fails to

achieve real time performance. In our experiments, LBP, KLT and SURF also

fail.

2) Invariants to scaling and rotation while some others not. Scaling and rotation

is not avoidable in a real video application such as hypervideo. This capability

is the next important method characteristic that we need. CAMSHIFT is very

robust for scaling and rotated object. SURF is able to detect rotating object.

Unfortunately it fails to detect highly scaled object. In our case, if the object

goes too small, SURF fails. Semi-Supervised Online Boosting Tracking is

certainly fails for scaled object. Mean-Shift is using fix kernel size and moving

the kernel towards the mean position. The size is not changing over time while

CAMSHIFT is adaptive to changing color distribution over time which makes

it scaling invariants[5]. Mean-Shift is suitable for translational and scaling

motion but not suitable for rotational motion[17]. KLT is only for affine

motion but not suitable for translational, rotational nor scaling motion[17].

3) From all of our test videos, CAMSHIFT is one of methods that can track the

object with reasonable result. It can track the object but starts to fails after

some frames. For the first video, the tracker drifts at frame 145. It drifts at

frame 573 and 61 for third and forth video consecutively. It succeed to track

the object in the second video but only some object parts (propellers). With


20

the procedure provide in 2.3 and 2.7, KLT and LBP fail for all of the videos.

SURF succeed to track the object in second video. Nevertheless it directly fails

to detect object interest points in the first video and forth video. It succeed to

track the object in third video for some frames, but fails when the object is

getting too small. We managed to get Beyond Semi-supervised Online

Boosting tracking simple source code and try it to our test videos but not the

full version one. This simple source code contains only Haarlike features

without color histogram and Histogram Oriented Gradients (HOG) features.

With this simple, it can track the object in the second video successfully. It

certainly fails for third test video as stated by the author in [33].

4) Insensitive to object deformation [4]. Changing object shape is not a problem

for CAMSHIFT.

5) We believe that we can improve the some critical disadvantages of CAMSHIFT

within the master thesis duration. We eliminate Semi-Supervised Online

Boosting Tracking since mostly it deals with Pattern Recognition while our

main field is image analysis and processing with specialization in color

features (Color In Informatics in Media Technology). We believe that we can

improve this later method to meet our requirements but the time to spend for

improving this method is not feasible within the master thesis duration.

6) Actually CAMSHIFT is closed to Mean-Shift. But since CAMSHIFT is an

improved version of Mean-Shift with the capability of adaptive to the color

distribution changes during time, we believe that CAMSHIFT is a better

starting point than Mean-Shift.

2.10 CAMSHIFT/Mean-Shift Improvement in Literatures Before we start to improve CAMSHIFT, we study some literatures to know what the

researchers have done to improve CAMSHIFT/Mean-Shift. After that, we will precise

what CAMSHIFT’s improvement will be done in this thesis.

2.10.1 Mean-Shift tracking combined with texture histogram Ning et. al. in [8] proposed a joint color-texture histogram to represent an object and

then applying it to the mean-shift framework. The purpose is to improve tracking

accuracy and efficiency by adding conventional color histogram features with texture

features, which is in this case, the local binary pattern (LBP). The idea is to improve

the object model by modeling every pixel in the object by, not only color information,

but also the texture value. Ning[8] combine mean-shift tracking with LBP because of

LBP’s fast computation and rotation invariants. They claimed the proposed method


21

performs much better than the original color based method with fewer iteration

numbers, especially in tracking objects that have similar color appearance to the

background.

We have described LBP in 2.7. We try once again the LBP with an implementation

code in [30] in the following ways:

1) We use the code to have LBP image as the result using our test video (Figure

2.11).

2) Soon after that, we calculate LBP texture histogram using RIV-LBP[23].

3) We do back projection to get probability image based on texture.

Unfortunately, the result is beyond our expectation. Back projection should give high

intensity (probability) to the pixel that has similar characteristic regarding the model

histogram, but the result seems against that (Figure 2.11). Then we decide not to use

texture to improve CAMSHIFT.

(a) (b)

Figure 2.11 Comparing LBP Image and its back projection image. (a) LBP Image (b) The Back projection image. The small toy intensity is low.

2.10.2 CAMSHIFT and Mean-Shift combined with interest points Another way to improve CAMSHIFT is to combine with interest point feature. Interest

point feature is well known with its invariant to illuminant, rotation and scaling. This

advantage is very useful to fill the disadvantage of color histogram in CAMSHIFT

which is sensitive to illuminant change.

One of the implementation regarding this method is done by Ganoun et. al. in [10].

The aim of their research is to widen the field of CAMSHIFT method so that it can be


22

applied to gray image sequences. It tried to do so by improving the object model by

adding feature points information to the color histogram.

In principle, they measure the displacement of search window by calculating the

displacement of matched interest points. Then they use the CAMSHIFT to determine

the final object position.

The method can be summarized by these following steps[10]:

1) Calculation of the object model using color histogram and feature points

2) Calculation of the temporary object displacement by the feature points

matching between image It of the sequence at instant t and image It+1 at

instant t+1.

3) Determination of a reduced search window positioned on the centre Ctemp

calculated at step 2.

4) Calculation of the probability image in the search window.

5) Application of the Mean-Shift algorithm to determine the new object centre.

6) Actualization of the object model.

One other implementation of this method is by Qiu Xuena et. al. in [9] which is using

SIFT as the interest point descriptor combined with spatial features to create

probability distribution of the tracked object. They use the spatial features in order to

increase robustness when the tracker deals with occluded situation. They claimed their

method can handle object scale, orientation, view, and illumination changes. It could

also deal with the camera movement mode.

The SIFT was added with the purpose of increasing the robustness of CAMSHIFT

when dealing with the condition of occlusion and object has similar color with the

background. Meanwhile color feature can help SIFT segment the target. In addition,

the spatial feature can handle the object occluded situation. The probability

distribution of the tracked object is represented by linear weighted combination of the

kernel function of the above three features.

The entire algorithm can be summarized as follows[9]:

1) Define a rectangle on the region of interest in the first frame;

2) Compute the color histogram of this region, at the same time extracting SIFT

features within this region;

3) In the second frame, let the previous location be the center of the interested

region and the size of this interested region is one quarter the size of the


23

frame. In this interested region, also let each pixel be the center of the sub-

window and the size of the sub-window is the same as the target region;

4) For every sub-window calculate probability density;

5) Determine the final tracking window.

(a) (b)

Figure 2.12 SURF and CAMSHIFT 1. (a) Yellow trunk is tracked by CAMSHIFT in frame 33

(b) Air plane is successfully tracked by SURF in frame 300

(a) (b) (c)

Figure 2.13 SURF and CAMSHIFT 2. (a) Small scaling toy is successfully tracked by SURF at frame 2 (b) SURF sometimes match

wrong points (c) SURF fails completely when the object skews at frame 360

As we know that there is publicly available interest point detector and descriptor which

is widely used and perform very well called SURF. So then we try SURF once more

time with purpose to improve CAMSHIFT. We use SURF implementation in OpenCV

2.0 [24] and compare with the previous experiment in 2.4. We do it with the following

ways:

1) Make a rectangle that covers the object in the first frame

2) Calculate the conservative CAMSHIFT color histogram of the object inside the

rectangle. Store it as object’s color model


24

3) Use SURF to inside the rectangle to detect interest points of the object. Store it

as object’s interest points model

4) In each frame we use both models to track the object.

The result is more or less the same to what we have done in 2.4, good for some cases,

but not good for some other cases. For the first test video (yellow trunk), SURF simply

fails to track the interest points while CAMSHIFT can track the object quite perfectly.

In our second test video (Figure 2.12), SURF helps very much with giving quite precise

object’s rectangle while CAMSHIFT fails to cover the whole object. CAMSHIFT only

gives ellipse that covers both air plane’s propellers. So for these videos, both methods

complement each other which is good.

For the third video (Figure 2.13), CAMSHIFT succeed for 275 frames but then drift and

track the background which has similar color with the object. SURF can manage to

track the object also for 105 frames. But when the object’s size goes to small or skews,

SURF fails. SURF detector can not recognize the object anymore even if we update the

interest points object model by taking the last successfully tracked object.

Based on these result, we decide not to choose interest points as the improvement of

CAMSHIFT.

2.10.3 CAMSHIFT improvement using new HSV model

In [4], G. Tian et. al. propose an improved H, S, V combined one-dimensional color

histogram model for CAMSHIFT object tracking. The purpose is to improve

CAMSHIFT tracking accuracy in the condition where color distribution of the object

and background is similar or even in complex background. This method is based on

Munsell 3D color coordinate system which has been confirmed suitable for human

visual system[4].

Based on optical theory that each color has corresponding wavelength, they quantify

the H, S, V into different ranges. In summary, the processes are:

1) Divided color scope: Base on the human visual ability to distinguish color, they

divide H color space into 8 part, S color space into 3 parts, V color space into 3

parts.

2) Quantify the value of H, S, V with different intervals: based on the human

subjective color perception on different scopes of the colors.


25

3) Building combined one-dimensional feature vector from 3 color component:

G = HQs Qv + SQv + V

They choose Qs = 3, Qv = 3, where Qs and Qv are S component and V

component’s weight.

Therefore: G = 9H + 3S + V

They showed some result which is quite encouraging. We have not tried this method

but we put it on our algorithms list to improve CAMSHIFT.

2.10.4 CAMSHIFT improvement using hue-distance and saturation features

J. A. Corrales et. al. in [3] use different approach. Their purpose is to improve

CAMSHIFT in tracking objects in dynamic backgrounds with similar hue values by

using hue and saturation color component but modifying the hue component. Instead

of using hue, they use hue-distance.

The hue-distance is a function which represents each hue value H as a distance from a

reference hue value Href. The following distance function is used instead of the hue

component:

(12)

The hue reference Href is the hue value which has the highest frequency in the

histogram h(x) obtained from the histogram calculation in the first step of CAMSHIFT

algorithm:

(13)


26

(a) (b)

(c) (d)

Figure 2.14 CAMSHIFT with new HSV model. (a) The result image using CAMSHIFT (b) The back projection image using CAMSHIFT (c) The

result image using CAMSHIFT with new HSV model (d) The back projection image using CAMSHIFT with new HSV model[4]

Firstly, a histogram of the hue component of the standard HSV model is calculated in

order to obtain the hue reference Href using (13). Afterwards, the histogram is re-

computed using the hue distance in (12) and it is used as target distribution. All the

following images are transformed to the HSV model but using the hue distance instead

of the hue component.

The use of the hue component to obtain the probability distribution image using

classic CAMSHIFT is not sufficient when there are elements in the background which

have similar hue values to the target object. In this case, the CAMSHIFT algorithm

may include wrongly in the search window elements from the background.

Two histograms are calculated: one for the hue distance hd(H,Href)(x) and another

one for the saturation component hS(x). In the step 3 of the algorithm, the histogram

hd(H,Href)(x) is used to obtain the back-projection Bd(H,Href)(x, y) of the hue

distance channel, and the histogram hS(x) is used to obtain the back-projection BS(x,

y) of the saturation channel. These two back-projections are combined according to


27

the following equation in order to create a final probability distribution image B(x, y)

which is used by the Mean Shift algorithm to find the center of mass:

(14)

(15)

Equation (15) removes from the hue-distance back projection those pixels whose

saturation channel does not match the saturation values of the tracked object. Most of

the background pixels, whose hue values are similar to the tracked object, can be

removed because their saturation values are different. Therefore, only the pixels with

similar hue and saturation to the object are considered.

We have implemented this method according to the procedure written above but with

modification that is V color component included. We believe that value is also

important to take into account in the model. The result is encouraging (Figure 2.15).

This has similar purpose with the previous new HSV model method.

2.10.5 CAMSHIFT with improvement of object localization These methods try to improve CAMSHIFT object’s model by improving object

localization method.

Foreground extraction • This method tries to increase the robustness of tracking by giving more weight to

the center of the rectangle by putting very high positive weight and giving low

negative weight to the parts beyond the range[12]. The range is given as circular.

Illustration is given in Figure 2.16. The formula is

(16)

Where x,y is pixel coordinate, hi is a any value desired by user to filter out

background color clusters.

• This still gives problem because, in real applications, many objects are not in

circular shape. For example, if the object is an elongated object (Figure 2.17), there

will be some background information taken into the object model or some object

parts are not taken into account to the object model.


28

Weighted and Ratio Histogram This method has, in some way, similar with foreground extraction. For pixels inside

the object search window, it gives higher weight to the pixels near the center and gives

0 to the pixels far from the center in the model histogram calculation process [7].

They used ratio histogram which gives lower weights to the pixels outside the object

search window.

We decide not to choose these methods since it can not effectively localize the object.

We propose another method which will be describe in 3.1.

2.10.6 CAMSHIFT improvement using adaptive background (ABCShift)

The aim of this method is to track robustly in two situations where CAMSHIFT fails;

firstly with scenery change due to camera motion and secondly when the tracked

object moves across regions of background with which it shares significant colors.[11]

It tries to improve the tracker by modeling the background based on Bayesian

probability model.

In summary, the algorithm is[11]:

1) Identify an object region in the first image and train the object model.

2) Center the search window on the estimated object centroid and resize it to

have an area r times greater than the estimated object size.

The centroid position can be calculated using:

(17)

where i is index of all pixels in the search window and ci is the color of pixel i.

Then the position of the centroid is

(18)

where (xi, yi) is the position of pixel i in the search window. At the end of the

iteration the center of the search window is shifted to the new position (xc, yc)

and the procedure is repeated until two consecutive center positions are

within ε of each other.

3) Learn the color distribution, P(C), by building a histogram of the colors of all

pixels within the search window.


29

Figure 2.15 CAMSHIFT improvement with hue-distance saturation features.

(a) Tracked image with selected object in rectangle (b) hue distance back projection (c) Saturation back projection (d) Hue distance – Saturation combination back projection[]

Figure 2.16 Foreground extraction.

FEM applies high positive value to the pixels near the center and applies negative values to the pixels toward the edges of the object region[12].


30

Figure 2.17 Sample of elongated object.

Elongated object is not suitable using foreground extraction method.

4) Use Bayes’ law, to assign object probabilities, P(O|C), to every pixel in the

search window, creating a 2D distribution of object location.

(19)

where P(O|C) denotes the probability that the pixel represents the tracked

object given its color, P(C|O) is the color model learned for the tracked object

and P(O) and P(C) are the prior probabilities that the pixel represents object

and has the color C respectively.

5) Estimate the new object position as the centroid of this distribution and

estimate the new object size (in pixels) as the sum of all pixel probabilities

within the search window.

6) Repeat steps 2-6 until the object position estimate converges.

7) Return to step 2 for the next image frame.

They shows some videos that confirm their claim that ABCShift gives good result

though the object is passing through a background which has similar color with it.

They implement this method in robotics. The authors show some encouraging results

but not multiple object tracking.


31

2.10.7 CAMSHIFT improvement by background subtraction This method tries to improve CAMSHIFT tracking robustness in complex background

by modeling the background and subtracting it from every frame sequence.

Background subtraction has been widely used in object tracking in the case of static

background. The basic principle is to model the background and subtract the frame

sequences by that background. The subtraction result more or less is the moving object

inside the frame. This method works in the assumption that the object is moving.

Otherwise the object will be identified as background.

Some methods of background subtraction are using average, median, code book and

Gaussian mixture model. Those methods implementation are also available publicly.

Average background subtraction code is available in [31], code book is available in

OpenCV 1.0 [24] or higher, Gaussian mixture model code is available in [32]. For

median background subtraction, we develop it by modifying the average background

subtraction code.

We have tried three of the methods combined with CAMSHIFT and here are the

results in frame 30 in scaling small toy test video (Figure 2.18).

Figure 2.18 Background subtraction in static background.

First row is the result image, the second row is the foreground image. First column is using average. Second column is median . Third column is using Gaussian mixture model

For the conclusion, background subtraction helps CAMSHIFT very much in static

background videos. More over, with the help of background subtraction, we can

extract object contour. Unfortunately, in dynamic background videos, background

subtraction is not helpful. This is because the movement in the background is detected


32

as foreground. The result of using background subtraction in airplane test video frame

30 can be seen in Figure 2.19

Actually, we tried this method because we the previous improvement had difficulties in

solving the challenging tracking problem in third video (Figure 2.19). We realized that

this video has static background. This information is very useful because usually

background subtraction method works very well in this kind of condition. Then we

make sure by trying background subtraction method which in fact helps CAMSHIFT to

track the object in third video. After that we think that we can use this method if we

have a technique to detect whether a background is static or dynamic. When we detect

the background is static, then we use background subtraction method otherwise, the

background subtraction is not used. But this will be another research which will take

quite amount of time. We do not continue to use this method in improving CAMSHIFT

because hypervideo can not be limited to only video with static background and

background subtraction’s limitation in dynamic background videos.

2.10.8 The CAMSHIFT improvement method that we choose After studying, experimenting, and observing we found that there are still some

problems with the previous improvement. For improving CAMSHIFT using LBP, we

have tried with our simple procedure as stated in 2.10.1. Unfortunately, the result is

beyond our expectation so we decide not to use texture to improve CAMSHIFT.

Combining interest points with CAMSHIFT also gives unsatisfying result for our case.

We do not choose CAMSHIFT improvement using new HSV model because the bins

ranges rigidity. CAMSHIFT object model improvement with foreground extraction and

weighted histogram fails to exclude background information into the model or fails to

include all object information into the model. ABCShift is actually improve

CAMSHIFT significantly especially in the condition where object’s colors are similar

with background’s colors. Nevertheless this method is closer to Pattern Recognition

method which is not our main interest field. We want to try another approach. We do

not use background subtraction as its effectiveness only occurs for video with static

background. The only method that we adopt is the use of hue-distance histogram

which is explained in 2.10.4. For this method also we slightly modify the method by,

not only using hue-distance and saturation histogram, but also value histogram.

2.10.9 The more specific aim of the master thesis Based on discussion above, we decide to improve some critical disadvantages of

CAMSHFIT that are not fully solved by the researchers or not meet our aim for this

thesis. From this conclusion, we define our specific aim of the master thesis as:


33

1) Improve the robustness of classic CAMSHIFT for multihued object tracking.

2) Improve the robustness of CAMSHIFT for the condition where object’s colors

are similar with background’s colors.

3) Improve CAMSHIFT capability to do multiple object tracking. We did not find

any literature stating improvement of CAMSHIFT so that it can do multiple

object tracking.

4) Speed and illuminant change are not our main concern. In [], Colantoni et. al.

stated that using Graphical Processing Unit (GPU), the speed performance of

an image processing task (e.g. object tracking) can be increased until 10 times

faster. As our main field is image analysis and processing, we focus our work

in improving the robustness while speed will be improved in our future works

or given to the person who is expert in this GPU area.

To achieve our aim, we adopt hue distance histogram idea as one of ways to improve

CAMSHIFT for the condition where object’s colors are similar with background’s

colors. For object localization, we propose another method which will be described in

3.1. For increasing the robustness of tracking, we also propose another method which

will be explained in 3.2 and 3.6. Multiple object tracking will be easy to do if we can

solve the first two problems.

Figure 2.19 Background subtraction in dynamic background.

First row is the result image, the second row is the foreground image. First column is using average. Second column is median . Third column is using Gaussian mixture model


34

3 Proposed Method

In this thesis, we propose several ways to improve CAMSHIFT. The use of multi-

dominant color object localization and track the dominant color object parts separately

are the key methods to improve CAMSHIFT. This section will describe the proposed

method. Details of implementation such as parameters values will be describe in

Implementation section.

3.1 Object Localization Object rectangle is the most common method to do object localization. User makes a

rectangle to the object. This is simple and easy to use, nevertheless problem may occur

because most of the time, the object is not exactly in rectangle shape. This makes some

background’s information is included in object model. If this happens, drifting often

occurs. The tracker is not robust in tracking the object.

Another method is using points as object boundary. The selection continue with

making line between those points consecutively. This method can give exact region of

the object. This method is suitable for simple object shape. The problem with this

method is that it is not practical for object with complex or irregular shape (Figure

3.1). Even for object with simple circular shape, this method is not so practical.

Figure 3.1 A sample of complex shape object

Some recent methods [7][12] gives not precise localization of the object. They failed to

give the exact information of the object. This drives us to create a more sophisticated

method in object localization.

3.1.1 Preprocessing We propose object localization by combining mean-shift segmentation and region

growing. Mean-shift is a preprocessing before the object parts selection. It applied to

segment each part of the object and make them homogeneous enough to be chosen

easily. Mean-shift segmentation smoothen the image while preserving

discontinuity[2]. This is actually happens until some level. Because if the object is in

front of very similar color background at the localization phase, then the object will be


35

merge with the background. We need this preprocessing step because using region

growing itself is not enough even though we increase the color tolerance. Figure 3.2

gives simple illustration for the case. As remark, this preprocessing is needed only for

the frame in object localization step.

(a) (b)

Figure 3.2 Object Localization using only region growing (a) before selection (b) after selection

(a) (b)

Figure 3.3 More precise object localization with only a single click. (a) before selection (b) after selection

Actually for preprocessing step, one other alternative is using K-Means segmentation.

One can segment the object and background by selecting some colors as means and all

pixels will be classified into those means. But this method will be not practical because

we have to choose, not only the object colors, but also the background colors. If there

are a lot of colors in the background, which is very common in every day life video as

well as in hypervideo, practicality will be an issue. For examples are the third and forth

test video.

To be more adaptive to user’s need, we designed the segmentation is tunable using text

file. User can change the segmentation parameters value that makes all object appears


36

to be selected. Figure 3.4 shows that we can tune the preprocessing segmentation

parameters as well as color tolerance for region growing, number of bins, etc.

Figure 3.4 Text file configuration to tune the parameters

3.1.2 Image color transformation The next step is we transform the image into HSV color space. We choose HSV because

this model is based on the human perception of eyes, which use the Munsell three-

dimensional color coordinate system to present. Munsell color space has been

confirmed suitable for human visual comprehending by human eyes.[4].

3.1.3 Object Selection Next step is the object selection. After mean-shift segmentation, user can choose the

object by clicking each object parts. The clicks’ positions become set of “seed” points

that have specific properties. From these seed points, the region grows by appending

their neighbors which have similar properties to the seeds [18]. We also give some

tolerance values so that the seeds can grow further more until reaching the edges of

object parts. Those selected object parts are considered as containing the dominant

colors of the object which will be tracked in the tracking phase.

3.1.4 Minimum and maximum values storing Each time an object part is selected, we store the minimum and maximum value of the

object parts color component. This means we store the hmin, hmax, smin, smax, vmin,

vmax. These values are needed to make color mask which will be describe in section

3.3.


37

One advantage with the proposed method is no surrounding background information

added to the model. This makes the tracker more robust and avoids drifting problem.

Another advantage is for some less object part case, this method is more practical. For

example, for the yellow trunk which has only one hue, the selection is just clicking any

part of it, and it will be chosen entirely (Figure 3.3). Even though the shape of the

object is complex and irregular, there will be no problem as long as it has

homogeneous color.

3.2 Object Modeling The next step is object modeling. We use color histogram to model the object. We try

to improve CAMSHIFT using only color information of the object. No background

information used in the process, only the object’s information included.

We use HSV color space. In the original implementation of CAMSHIFT, Bradski was

using only hue component. This leads to a problem in a condition where the object

passes through a background which has similar hue to the object. Some other methods

are using only hue-saturation components. This might be sufficient for some cases, but

for some other cases, hue-saturation component are not enough to distinguish the

object from its background. Value color component often gives good discrimination

between object with its background. That is why we use all three components and

make the histogram of them.

The use of those three components apparently is not enough to distinguish the object

from its background. Tian et. al. in [4] proposed a new HSV color model to describe

the object. They claim to improve the CAMSHIFT tracking. Corrales et. al. in [3]

proposed to use hue-distance histogram instead of hue histogram which we have

described in 2.8.4. They combine it with saturation component.

In this thesis, we combine those ideas so then we use hue-distance, saturation and

value histogram to model the object. This gives a better discrimination.

We do quantization to each color histogram by using less number of bins. The number

of bins for each component is the next important factor. Based on our experiments, the

use of 30 bins for hue-distance component, 9 bins for saturation and 6 bins for value

gives good result.


38

3.3 Making Color Mask Color mask is made for each object part. It is made based on the minimum and

maximum values taken from the object localization step (3.1.4). Each pixel in every

frame will be evaluated according to those minimum and maximum values. If a pixel is

inside the minimum and maximum values range, then it will be given a value 1.

Otherwise, it will be given value 0.

(a) (b) (c)

Figure 3.5 Color mask illustration

(a) Original frame (b) Selected object (airplane body) (c) Color mask

3.4 Segmentation For next frames, mean-shift segmentation is carried out. This will ease the

differentiation of the object from the background. With small spatial and color ranges,

mean-shift will smoothen the image and removes noise while preserving the

discontinuity. Mean-shift will merge some close-color background areas into one

region which is hopefully has color information beyond the object histogram. Figure

3.6 shows some yellow color noises in the background which is close to object color.

Those noises are merged with neighboring pixels and assigned by color information

which is different with object’s color information.

(a) (b)

Figure 3.6 Segmentation for smoothing and noise removal of third test video (a) Original frame (b) Segmented frame


39

3.5 Histogram Back Projection Histogram back projection means evaluating each pixel in the frame sequence based

on the histogram model we have made in the object modeling phase. Before we do

back projection, we put the color mask to the frame to pass only pixels that satisfy the

object color ranges. We then do histogram back projection to hue-distance histogram,

saturation histogram and value histogram. Each histogram back projection will give a

back projection image as the result which contains the probability of each pixel in the

frame according to the histograms. We then combine all back projection image into a

single back projection image using AND operator. This single back projection image is

the last input for the tracking (Figure 3.7).

Figure 3.7 Histogram Back Projection of first test video

3.6 Tracking Good localization is a first step towards good tracking. But that does not mean we will

have 100% accurate tracking. If we choose all the object parts as one part and take

HSV color histogram on it, it will be difficult to track the object if it passes through a

background which has color in the range of the object color range.

One good way to improve the robustness of the tracking is doing “divide and conquer”

which means we split the problem of the tracking itself so then it will be easier to solve.

To split the problem, we propose to track the object parts separately. Object parts

represent the dominant color parts of the whole object. Each object part is modeled

using hue-distance, saturation and value histogram and then track it. The whole object

rectangle is the maximum rectangle of each object rectangle (Figure 3.8). The whole

object center position is the center of the maximum rectangle. Maximum rectangle is

defined as the smallest possible rectangle that covers all rectangles inside it[5].

We also propose a mechanism to detect if the tracker lost the object and how it deals

with the whole object rectangle. If the next tracking rectangle area of an object part is


40

equal to 0, that particular object tracker is stated as lost. If an object tracker is lost,

then the rectangle will not be taken into account in the whole object rectangle. The

whole object rectangle will only consider the “surviving” tracking rectangle.

Figure 3.8 Maximum rectangle illustration.

Maximum rectangle (thick red) of body, trouser and shoes’ rectangle (thin blue).

The proposed method can be summarized in the following steps (Figure 3.9) :

1) Do mean-shift segmentation to the first frame so the object parts will be easier

to choose.

2) Transform the frame into HSV space

3) Choose each object by clicking it and do region growing starts from the click

position.

4) For each object part (In red zone, Figure 3.9):

a. Take minimum and maximum value of hue, saturation and value for

each object part.

b. Calculate hue-difference, saturation and value histogram

c. Make a color mask by evaluating each pixel in the frame based on the

minimum and maximum values taken from step 5

d. Do mean-shift segmentation to the image for smoothing the image

and reducing noise.

e. Do histogram back projection from the histogram in step 6 and

combine all back projection images.

f. Track the object based on the combined projection image and stored

the new tracking window information.

g. If the tracker lost, leave it. Otherwise go to step 5

5) Find maximum rectangle of each object parts rectangle.

6) Return to step 4c using new tracking window from step 4f.


41

Figure 3.9 The proposed method’s schema


42

4 Implementations and Experiments

In this section, the implementations and experiments will be explained thoroughly.

The implementation environment, library used, and some parameters used will be

described.

4.1 Implementations In the implementation phase, OpenCV 2.0 library is used. We use Microsoft Visual

Studio 2005 as development tools.

For mean-shift segmentation, we use mean-shift segmentation implementation in

OpenCV library. As proposed by Bradski et. al. in [5], we use hs = 20 and hr = 40 and

maximum level = 2 which is good for an image with dimension 640 x 480. In the case

of image with dimension 1280 x 720 (e.g. first test video), these parameters also give

good segmentation result.

For region growing, we use Flood Fill method which is also available in OpenCV 2.0

library [24]. Flood fill will append each neighboring pixels based on the color

characteristics. If a neighboring pixel has color characteristics that are close to the seed

(i.e. within the tolerance ranges), the pixel will be appended. We use 20 as tolerance

for hue and value color component and 40 for saturation.

When seed is growing, it marks the area with perfect white (H=255, S=255, V=255).

This perfect white area will become a mask to find the extreme value of the area and

calculate the object part histogram. The extreme values are hmin, hmax, smin, smax,

vmin, vmax. These values correspond to hue minimum, hue maximum, saturation

minimum, saturation maximum, value minimum, and value maximum. These extreme

will be used for making color mask of each object part.

In histogram calculation, we use 30 bins for hue and hue-distance component, 9 bins

for saturation and 6 bins for value. These parameters come up from experiments and

observations which give good result. We keep the hue histogram. After histogram

calculation which takes hue-distance, saturation and value color component, we make

a threshold to hue histogram. This threshold is important to retain close hue pixels

and remove unwanted far hue pixels in the back projection. First we make a threshold

of 255 for the histogram. Any bin that has value above the threshold will be stated as


43

peak. Number of close hue is 70% of number of peaks. We round it after because we

need an integer number. For example, if we have 10 peaks, then the number of close

hue will be 70% x 10 = 7. This means, only hue that has distance below 7 to the hue

reference that will be taken into the back projection. All other hues will be discarded

(Figure 4.1). If no histogram bin exceed 255, then the hue-distance threshold will be

set into 1.

Figure 4.1 Hue histogram of air plane body.

Number of peaks = 10, Hue-distance threshold= 7

We smoothen the frames by mean-shift segmentation starting from second frame. We

use spatial range hs=5, color range hr = 20, and maximum pyramid level = 2. With

this parameters, noise is reduced significantly while the discontinuity (e.g. edge) is still

preserved.

Implementation of creating color mask is simply evaluates each pixel in the current

frame based on extreme (minimum and maximum) values that we have taken in the

previous step. While histogram back projecting, CAMSHIFT tracking, and maximum

rectangle calculation, we use the available functions in OpenCV 2.0 library [24].

4.2 Experiments Setting All experiments are carried out in a laptop with specification of AMD Turion X2 (dual

core) 1.6 GHz with 2 GB RAM.


44

5 Results and Discussions

This section will provide the description of experiment results continued with

Discussions about them.

5.1 Results

5.1.1 First Experiment Results In the first video, our proposed object localization method shows its powerfulness.

With a single click, the object is exactly selected without any surrounding background

included. Surrounding background means background that is spatially closed to the

object. This will help to make a robust histogram and robust tracking. The object can

be tracked perfectly though there is object shape change and orientation change. There

is partial occlusion occurs in the middle of video but the tracking can deal with it

without any problem. The average frame rate is 2.4 frame per second (fps).

This shows us that for a single hue object in front of a very distinct color background,

our object localization method works very well.

Figure 5.1 First video result with the proposed method.

First column is the frame, the second column is the back projection image, the third column is the hue histogram with 30 bins at frame 33.

If we use the classic CAMSHIFT, we have to configure the parameters for hue,

saturation and value manually. These parameters are used to make a color mask.

Every pixel that fits these parameters will be assigned 1, otherwise it will be assigned 0.

In the implementation example of CAMSHIFT in [24], the hue parameters is set to

hmin=0 and hmax=180 which mean takes all possible color types. For saturation, it is

set to smin=30 and smax=256. For value, it is set to vmin=10 and vmax=256 (Table

5.1).

We made some tuning to these parameters to get the best result using classic

CAMSHIFT. The tuning is carried out on saturation and value parameters. In the first

video, we set the smin=50. In the first experiment for this video, we set vmin= 150.


45

Extreme Value

Minimum hue (hmin) 0

Maximum hue (hmax) 180 (maximum hue in OpenCV implementation)

Minimum saturation (smin) 30

Maximum saturation (smax) 256

Minimum value (vmin) 10

Maximum value (vmax) 256

Table 5.1 Default parameters of classic CAMSHIFT in [24]

(a) (b) (c)

Figure 5.2 First video result with classic CAMSHFT at frame 33. (a) The frame (b) back projection image (c) the hue histogram with 16 bins.

(a) (b)

(c) (d)

Figure 5.3 Object localization comparison (a) Object localization using proposed method (b) Object localization using classic CAMSHIFT

(c) Hue histogram using proposed method (d) Hue histogram using classic CAMSHIFT

The result shows the yellow trunk can be tracked successfully. Unfortunately, it starts

drifting from frame 145. The tracker is starting to track the background from that

frame. This is due to the problem of object localization. In classic CAMSHIFT, we use

rectangle to localize the object. With rectangle, there is big possibility that some


46

background information will be taken into the object model. As we can see from Figure

5.3(b), there is background inside rectangle which will be included into object model

(hue histogram). That is why the blue bin is filled (Figure 5.3(d)).

5.1.2 Second Experiment Results For the second video (Figure 5.4), the localization is done by 3 clicks. First is the air

plane body and wing, the second and third are for the propellers. Each of the objects

will be tracked separately. The result will be the whole object rectangle (search

window) which is the maximum rectangle of each object parts’ rectangle. The

experiment shows that the object parts and the whole object can be tracked

successfully. The whole object rectangle is very stable in bounding the air plane. Even

when there is cloud distraction, the tracker can still track the object (Figure 5.4 (a)).

The average frame rate is 1.15 fps.

Mean while, if we use classic CAMSHIFT, problem occurs. In CAMSHIFT, it takes

whole range of hue which is in the OpenCV implementation from 0 to 180 and

maximum saturation (smax) = 256. After selecting the whole air plane, the tracking

ellipse goes very large when we set the value maximum (vmax) into 150 (Figure 5.5 1st

row). While if we use vmax=100, the tracking ellipse goes to the island in the

background (Figure 5.5 2nd row). Finally, if we use vmax = 70, the tracking ellipse

covers only the propellers (Figure 5.5 3rd row). This can be explained using hue

histogram and back projection image.

Back projection image is the projection of the hue histogram to the hue image of the

frame sequences. In the case of vmax=150, we have hue histogram which is close to the

background color characteristic. That is why the background gives very high intensity

in the back projection image. More over, the object gives very less intensity in the back

projection image. That is why the tracking ellipse covers almost whole background. In

the case of vmax=100, some parts in the background gives more intensity than the

object. That is why, when the object is passing through that background, the tracking

ellipse jumps into that background parts. In the case of using vmax=70, the tracking

ellipse is robustly track the object. But unfortunately, it tracks only the propellers. It

does not track the whole air plane. We also try to use different minimum and

maximum saturation values. But this does not help to improve the tracking.


47

(a) (b)

(c) (d)

Figure 5.4 Second video result with our proposed method. (a) Tracked image passing through cloud at frame 290 (b) Air plane body back projection image

at frame 10 (c) Left propeller back projection image at frame 10 (d) Right propeller back projection image at frame 10

Figure 5.5 Second video result with classic CAMSHIT.

First column is the frame, the second column is the back projection image, the third column is the hue histogram with 16 bins. First row using vmax=150 at frame 95, Second row is using

vmax=100 at frame 325, Third row is using vmax=70 at frame 29


48

Extreme Values






Maximum value (vmax) 150 / 70

Table 5.2 Tuned CAMSHIFT parameters for test video 1 and test video 2

Extreme Values







Table 5.3 Tuned classic CAMSHIFT parameters for test video 3

(a) (b)

(c) (d)

Figure 5.6 Third video result with the proposed method. (a) Tracked image when the object skewed at frame 639 (b) Yellow body part back projection

image at frame 10. (c) Blue trouser back projection image at frame 10 (d) Orange shoes back projection image at frame 10


49

This is one of the main problems of classic CAMSHIFT method. When it deals with

multihued image, it often drifts or it tracks only some object parts. We also have to

specify the parameters manually for different kind of videos. This makes practically

uncomfortable.

5.1.3 Third Experiment Results In the third video, there is a problem occurs which is the head part is merge with the

background by the mean-shift segmentation method so then it can not be selected.

This is the problem if in the localization phase, the object is in front of a background

which has similar color to the object. We select the body, trouser and the shoes. Apart

from localization lack, our proposed method still can track the object nicely (Figure

5.6). The shoes can be tracked until frame 855. The tracker lost because the shoes is

getting to small to detect. The trouser can be tracked until frame 105. When it goes far

away from the camera, the color is getting darker which is hard to detect the hue. The

body can be tracked successfully until the end of the video. The average frame rate is

1.96 fps.

If we use classic CAMSHIFT method, we found some problems. As we have mentioned

in the second video, the problem with multihued image happens again here. We do the

same experiments configuration which bring us to this result. The best result is shown

in Figure 5.7. We do not change the saturation minimum value as we have tried to vary

it, it does not give much influence to the result. We use the default value smin=30. The

result shows us that as soon as the object passing through the background which has

similar color, the tracker drifts and start tracking the background. (Figure 5.7).

5.1.4 Forth Experiment Results For the forth video (Figure 5.8), we select one of the football player. Object can be

selected with one click. The result shows the object can be tracked successfully by the

proposed method. In the middle of sequence, the object almost fully occluded by

opponent player. Nevertheless, the tracker is still able to track the remaining

unoccluded part of the object. There is also distraction from a team mate that run

towards the object, but the tracker can still track the object. The average frame rate is

11.77 fps.

When we test using CAMSHIFT, we tune the parameter to get the best result (Table

5,4). We set minimum value vmin = 10, minimum saturation smin = 10, maximum

saturation = 256, and take all the hue range. We vary the maximum value with 150,


50

100, and 70. When we use vmax = 70 gives better tracking result. Nevertheless, when

there is occlusion, the tracker drifts and track the occluder (Figure 5.10)

(a) (b) (c)

Figure 5.7 Third video best result with classic CAMSHIFT at frame 300. (a) the frame (b) back projection image (c) hue histogram with 16 bins.

Extreme Values







Table 5.4 Tuned classic CAMSHIFT parameters for test video 4

(a) (b)

Figure 5.8 Object (marked with red rectangle) tracked by the proposed method. (a) The object is almost fully occluded at frame 57 (b) The corresponding back projection image

(a) (b) (c)

Figure 5.9 Forth video best result with classic CAMSHIFT at frame 57. (a) the frame (b) back projection image (c) hue histogram with 16 bins.


51

(a) (b)

(c) (d)

Figure 5.10 Drifting tracker. (a) The object is tracked at frame 54 (b) The object almost fully occluded at frame 57 (c) The ellipse covers the object and occluder at frame 61 (d) Tracker drifts: it tracks the occluder at

frame 62

(a) (b)

Figure 5.11 Multiple object tracking using our proposed method (a) Frame 2 (b) Frame 45

Classic CAMSHIFT can not do multiple object tracking. With our proposed method,

CAMSHIFT has the capability to do that (Figure 5.11). The average frame rate is 4.18

fps.


52

5.2 Discussion

5.2.1 Some Advantages Our experiments results show that the proposed method improve CAMSHIFT

significantly. This happens because of some of these improvements.

First, the object localization is more precise. It avoids the object model from taking its

surrounding background information. With this, tracker drifts less. While classic

CAMSHIFT uses rectangle which takes surrounding background information into the

object model for a lot of cases. Some other improvements [7,12] also fail to model the

object precisely. We use default preprocessing mean-shift segmentation parameters

proposed by Bradski [5] which works very well in every test video and reduce the need

to tune the parameters manually. But, we also give the possibility to tune the

preprocessing mean-shift segmentation parameters so that it is adaptive to the user’s

need. The proposed method can detect the extreme values of the object automatically

while in classic CAMSHIFT, we have to tune the parameters manually.

Second, the use of hue-distance histogram with threshold increase the robustness of

CAMSHIFT in the situation object passing through background which has similar

color to the object. The automatic threshold limits only very similar hue pixels will be

taken into hue-distance back projection image. In classic CAMSHIFT, the use of only

hue histogram make it difficult to track the object in that situation.

Third, splitting the problem of tracking into smaller problem by tracking the object

parts separately, increase the robustness of tracking multihued object. Classic

CAMSHIFT and current CAMSHIFT improvement methods track the object as a

whole which make them often drift. This is one of the main advantages of our method

in term of robustness.

Forth, our proposed method has a capability to track multiple object. We have tried

tracking 6 object simultaneously with very good result. All objects can be tracked

successfully. We did not find any CAMSHIFT improvement methods that support

multiple object tracking.

5.2.2 Some Limitations Our proposed method has increased the robustness of CAMSHIFT tracking.

Nevertheless, there are some limitations in using our method.


53

First, in the case of object has a lot of hue, the object localization may be not so

practical. In addition to that, the performance may be slower due to more tracker to

compute. In the case of textured image, such as a running cheetah, the object

localization also suffer from its object localization method.

Second, if the object passing through a background with exactly the same color (hue,

saturation, value) to the object, then the tracker will most likely fail. The reason is

because we use only color information. So if the background has exactly the same color

as the object, it will be considered as the object as well.

Third, the tracker can not re-track the lost object parts. This is because we only use

color information. If we re-track using color only, there is big possibility the re-tracker

result will be wrong. To re-track, we need some other features to model the object.

Forth, the speed. Our method does not achieve real time performance due to the

separate tracking and some additional tasks to increase robustness. Actually, we have

defined in the first time that this issue is not our main concern.


54

6 Conclusions and Future Works

6.1 Conclusions We have developed different ways to improve CAMSHIFT robustness. The proposed

object localization method improves the robustness of object model. With it, we can

significantly avoid surrounding background information to be taken into the object

model. The use of hue-distance histogram, tracking dominant-color object parts

separately and the use of maximum rectangle that combine each object part rectangle

also help CAMSHIFT so it can track multihued object in similar color background.

With all the experiments result, we have shown that the proposed method is able to

significantly improve CAMSHIFT robustness in challenging videos.

6.2 Future Works Our future works will be improving the methods using graphical processing unit

(GPU) or parallel programming in multi-core processor to increase the speed so it can

achieve real time speed.

Beside that, we propose to improve the proposed method so it can re-track the lost

object parts, improve the ability to track textured object and object with has a lot of

hue. These are very important things to do because there are a lot of real world

applications need these capabilities.

One remaining work is to improve the tracker in condition the object has exactly the

same color with the background and apply this tracker into clickable hypervideo.


55

7 Bibliography

[1] Bradski, G. R. 1998. “Computer Vision Face Tracking for Use in a Perceptual

User Interface”. Intel Technology Journal, 2(2), 13-27.

[2] Comaniciu, D. and P. Meer. 2002. “Mean Shift: A Robust Approach Toward

Feature Space Analysis”. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 24(5), 603-619.

[3] J. A. Corrales, P. Gil, F. A. Candelas, F. Torres. 2009. “Tracking based on Hue-

Saturation Features with a Miniaturized Active Vision System”. In Proceedings

Book of 40th International Symposium on Robotics, Asociación Española de

Robótica y Automatización Tecnologías de la Producción – AER-ATP,

Barcelona, Spain. pp.107

[4] Tian, G., Hu, R., Wang, Z., and Fu, Y. 2009. “Improved Object Tracking

Algorithm Based on New HSV Color Probability Model”. In Proceedings of the

6th international Symposium on Neural Networks: Advances in Neural

Networks - Part II, Wuhan, China.

[5] Bradski, G., and Kaehler, A. 2008. Learning OpenCV: Computer Vision with

the OpenCV Library. O'Reilly Media, Inc.

[6] Intel Corporation. 2001. Open Source Computer Vision Library Reference

Manual, 123456-001

[7] J. G. Allen, R. Y. D. Xu, and J. S. Jin. 2004. “Object tracking using camshift

algorithm and multiple quantized feature spaces”, in Proceedings of the Pan-

Sydney area workshop on Visual information processing, ser. ACM

International Conference Proceeding Series, vol. 100. Darlinghurst, Australia:

Australian Computer Society, Inc., pp. 3–7.

[8] J. Ning, L. Zhang, David Zhang and C. Wu. 2009, “Robust Object Tracking

using Joint Color-Texture Histogram”. International Journal of Pattern

Recognition and Artificial Intelligence, vol. 23, No. 7 (2009), World Scientific

Publishing Company 1245–1263

[9] Qiu, X., Liu, S., Liu, F. 2009. Kernel-based Target Tracking with Multiple

Features Fusion. Joint 48th IEEE Conference on Decision and Control and

28th Chinese Control Conference, Shanghai, P.R. China.

[10] Ganoun, A., Ould-Dris, N., and Canals, R. 2006, “Tracking System Using

CAMSHIFT and Feature Points”. 14th European Signal Processing Conference

(EUSIPCO 2006), Florence, Italy.


56

[11] Stolkin, R., I. Florescu, M. Baron, C. Harrier and B. Kocherov. 2008. Efficient

Visual Servoing with the ABCshift Tracking Algorithm. In: IEEE International

Conference on Robotics and Automation, pp. 3219-3224, Pasadena, California,

USA.

[12] Xu, R Y D; Allen, J & Jin, J S .2003. Robust real-time tracking of non-rigid

objects, Conferences in Research and Practice in Information Technology,

VIP'03, Sydney, Australia.

[13] K. Fukunaga and L.D. Hostetler .1975. ”The estimation of the gradient of a

density function, with applications in pattern recognition”, IEEE Trans.

Inf0rmation Theory, vol. 21, pp. 32-40.

[14] Collins, R. 2007. “Lecture 29: Video Tracking: Mean-Shift” CSE/EE486

Computer Vision I, CSE Department, Penn State University

http://www.cse.psu.edu/~rcollins/CSE486/lecture29.pdf (visited June 2010)

[15] H. Bay, T. Tuytelaars, and L. Van Gool. 2006. SURF: Speeded Up Robust

Features. In ECCV (1), pages 404–417.

[16] M. Heikkila, M. Pietikainen, and C. Schmid. 2009. Description of interest

regions with local binary patterns. Pattern Recognition 42(3):425–436.

[17] A. Yilmaz, O. Javed, M. Shah. 2006). “Object tracking: a survey”, ACM

Computing surveys, vol. 38, no. 4, pp.1-45.

[18] R. C. Gonzalez, R.E. Woods, and S. L. Eddins. 2004. Digital Image Processing

Using MATLAB 1st Edition, Dorsing Kindersley, USA.

[19] J. McC. Smith and D. Stotts. 2002. An Extensible Object Tracking Architecture

for Hyperlinking in Real-time and Stored Video Streams, Technical Report

TR02-017, Department of Computer Science Univ of North Carolina at Chapel

Hill, USA.

[20] S. Stalder, H. Grabner, and L. Van Gool. 2009. Beyond Semi-Supervised

Tracking: Tracking Should Be as Simple as Detection, but not Simpler than

Recognition. In Proceedings ICCV’09 WS on On-line Learning for Computer

Vision, 2009

[21] S. Stalder, H. Grabner, and L. Van Gool. Beyond Semi-Supervised Tracking

Code.

http://www.vision.ee.ethz.ch/boostingTrackers/download.htm (visited

February 2010)

[22] M. Pietikainen and G. Zhao. 2009. Local Texture Descriptors in Computer

Vision. Tutorial in: IEEE International Conference on Computer Vision ICCV.


57

[23] G. Zhao & M. Pietikainen. C++ implementation of spatio-temporal LBP.

http://www.ee.oulu.fi/research/imag/texture/download/STLBP_VC.zip

(visited March 2010)

[24] 2009. OpenCV 2.0 library. Code from web.

http://sourceforge.net/projects/opencvlibrary/ (visited February 2010)

[25] H. Bay, T. Tuytelaars, and L. Van Gool. 2006. Code from web.

http://www.vision.ee.ethz.ch/~surf/download.html (visited February 2010)

[26] J. Shi and C. Tomasi. 1994. Good features to track, Proc. IEEE Comput. Soc.

Conf. Comput. Vision and Pattern Recogn., pages 593-600.

[27] Jean-Yves Bouguet. Pyramidal Implementation of the Lucas Kanade Feature

Tracker, Description of the algorithm. Intel Corporation Microprocessor

Research Labs

[28] D. Stavens. 2007. The OpenCV Library: Computing Optical Flow. Stanford

Artificial Intelligence Lab, USA

[29] D. Comaniciu, V. Ramesh and P. Meer. 2000. Real-time Tracking of Non-Rigid

Objects Using Mean Shift. CVPR.

[30]

M. Heikkilä and T. Ahonen. 2009. Code from web.

http://www.ee.oulu.fi/mvg/page/lbp_matlab (visited May 2010)

[31] Code from web. http://opencv.jp/sample/accumulation_of_background.html

(visited May 2010)

[32] Z. Zirkovic. 2004. Improved adaptive Gausian mixture model for background

subtraction. Code from web.

http://staff.science.uva.nl/~zivkovic/Publications/CvBSLibGMM.zip (visited

May 2010)

[33] S. Stalder, H. Grabner, and L. Van Gool. 2009. Video from web.

http://www.vision.ee.ethz.ch/boostingTrackers/contactBoosting.html (visited

February 2010)

[34] R Valenti, F Hageloh. Video from web.

http://student.science.uva.nl/~rvalenti/uva/MIR/movies/soccer.avi

(visited April 2010)

[35] P. Colantoni, N. Boukala, J. Da Rugna. 2003. Fast and Accurate Color Image

Processing Using 3D Graphics Cards, VMV 2003. Munich, Germany