Scene Duplicate Detection Based on the Pattern of ...people.cs.clemson.edu/~jzwang/ustc11/mm2008/p51-wu.pdfH.3.1 [Content Analysis and Indexing]: Indexing meth-ods; I.4 [Image Processing

Scene Duplicate Detection Based on the Pattern ofDiscontinuities in Feature Point Trajectories

Xiaomeng WuNational Institute of

Informatics2-1-2 Hitotsubashi,

Chiyoda-kuTokyo 101-8430,

[email protected]

Masao TakimotoThe University of

Tokyo7-3-1 Hongo,Bunkyo-ku

Tokyo 113-8654,Japan

Shin’ichi SatohNational Institute of



[email protected]

Jun AdachiNational Institute of



[email protected]

ABSTRACT

The paper is aiming to detect and retrieve videos of thesame scene (scene duplicates) from broadcast video archives.Scene duplicate is composed of different pieces of footage ofthe same scene, the same event, at the same time, but fromthe different viewpoints. Scene duplicate detection wouldbe particularly useful to identify the same event reportedin different programs from different broadcast stations. Theapproach should be invariant to viewpoint changes. We fo-cused on object motion in videos and devised a video match-ing approach based on the temporal pattern of discontinu-ities obtained from feature point trajectories. We developedan acceleration method based on the discontinuity pattern,which is more robust to variations in camerawork and edit-ing than conventional features, to dramatically reduce thecomputation burden. We compared our approach with anexisting video matching method based on the local feature ofkeyframe. The spatial registration strategy of this methodwas also used with the proposed approach to cope with vi-sually different unrelated video pairs. The performance andeffectiveness of our approach was demonstrated on actualbroadcasted videos.

Categories and Subject Descriptors

H.3.1 [Content Analysis and Indexing]: Indexing meth-ods; I.4 [Image Processing and Computer Vision]: Ap-plications

General Terms

Algorithms, Experimentation

Keywords

Video Matching, Scene Duplicate Detection, Time SeriesAnalysis, Feature Point Tracking

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.MM’08, October 26–31, 2008, Vancouver, British Columbia, Canada.Copyright 2008 ACM 978-1-60558-303-7/08/10 ...$5.00.

1. INTRODUCTIONRecent advances in broadband networks, storage devices,

and digital video broadcasting has created a demand forlarge-scale video databases and intelligent access. To achievethis, video semantic analysis is indispensable. Despite the ef-forts of many researchers, including the experimental projectson high-level feature extraction conducted by TRECVID,video semantic analysis still has insufficient performance.

Recently, near-duplicate shot detection has attracted theattention of researchers. This form of detection does notrequire semantic analysis of videos (e.g., it does not need toextract information such as what the shown object is, whatthis scenery is, etc.), yet it may enable semantic relationsbetween shots to be evaluated without analyzing semanticcontent of each shot. There are several promising applica-tions of near-duplicate detection to broadcast video streams.

1) Commercial film detection and identification [3]. Thisapplication is of particular interest to sponsoring companiesof commercial films.

2) News topic tracking and threading [5, 6, 20]. Near-duplicate shots shared by several news topics imply latentsemantic relations between these topics. This informationcan be used to define the similarity between news topics;thus, it is useful for news topic tracking and threading. Forinstance, by detecting common topics among news videos,near duplicates can be used to provide supplementary infor-mation to text similarity evaluation.

3) News topic ranking [23]. Near-duplicate detection canbe used to identify the same topic among news programsbroadcasted by the different channels. A topic mentioned bymany broadcast stations can be regarded as a highly rankedtopic [23].

4) Novelty detection [19]. The earliest broadcasted topicmay be a novel occurrence among the same topics detectedin news video archives for a certain period from several chan-nels. Near-duplicate detection can be used and combinedwith text similarity based on language models for this pur-pose [19].

5) Filler shot detection [14]. Within a certain period ofnews video archives, repeatedly used shots can be regardedas redundant shots (or filler shots), e.g. opening CG shots,anchor person shots, weather charts, etc., and they are es-pecially useful for news video parsing. Near-duplicate de-tection can be used to detect repeatedly used shots [14].

51

1.1 Classification of Near DuplicatesNear duplicates can be classified as follows:a) Strict near duplicate. The same video material

(footage) is used several times in different programs. Inthis case, various techniques, e.g. editing, cropping, videocaptioning, etc. may have been applied to the footage butno viewpoint change occurs. Fig. 1 shows examples of strictnear duplicates. Typical cases include commercials and filefootage used in news.

Figure 1: Example of strict near duplicates.File footage of the same video material.

b) The same objects (Object duplicate). This typeof near duplicate is composed of footage involving the sameobject or the same background, but taken at different times(and/or at different places). Fig. 2 shows an example ofobject duplicates. This example shows two pieces of footageof a government spokesman speaking about two differenttopics on different days.

Figure 2: Example of object duplicates.File footage of the same government spokesman

mentioning a different topic on different days.

c) The same scene (Scene duplicate). This type ofnear duplicate is composed of different pieces of footage ofthe same scene, the same event, at the same time, but fromthe different viewpoints, e.g., taken by different cameras andpossibly with temporal offsets. Fig. 3 shows examples ofscene duplicates. This is a special case of b).

1.2 Related WorkThere are a number of near duplicate detection approaches.

Type a) can be detected by almost all approaches. Someapproaches focus on detecting a) and pay more attention tocomputational efficiency [4, 7, 21, 22]. Interest point ap-proaches are typically used to detect types b) and c) [9,10, 11, 13]. These approaches first extract interest pointsand compute features, such as SIFT [12], which are invari-ant to shifts, changes in scale, rotations, etc. If two keyframes share a certain fraction of interest points having al-most the same feature values, the shots corresponding tothe key frames are regarded as near duplicates. These ap-proaches are invariant to viewpoint changes, camera work,etc., to some extent, and therefore they can detect types b)and c) as well as type a).

Computer vision researchers have devised a method tomatch two videos of the same scene taken from different

Figure 3: Examples of scene duplicates.File footage of the same scene at the same time but

from the different viewpoints.

viewpoints [1]. However, this method estimates the tempo-ral offset and relative viewpoint difference (such as homog-raphy), and is computationally extremely expensive. Themethod described in reference [2] matches videos using spatio-temporal segmentation and motion-based signatures, butcannot handle scene duplicate detection properly. The au-thors of [8] mentioned that detection of type c) is challeng-ing, but it did not present any methods to distinguish c)from b). Reference [17] matches the temporal patterns ofthe flashes of cameras to detect scene duplicates, but it can-not detect scenes without flash lights. Currently, no atten-tion has been paid to distinguishing types b) and c), i.e. todistinguishing scene duplicate from object duplicate.

1.3 Research Purpose and IssueIf we could detect scene duplicates without object dupli-

cates, they would be particularly useful to identify the sameevent reported in different programs from different broad-cast stations. Since type b) may include shots of the sameanchorperson reporting on different topics or shots of thesame object or the same person in a totally different con-text, type b) may have to be eliminated to identify shotscorresponding to the same event. Fig. 2 and 3 clearly showthe differences: an object duplicate in Fig. 2 does not cor-respond to the same event, while scene duplicates in Fig. 3obviously relate to the same event.

Figure 4: Motion of object in scene duplicates.Scene duplicate shots have the same object with the

same motion, and thus feature points on the object and

their trajectories have the same motion pattern.

52

In this paper, we propose an approach to detect sceneduplicates while eliminating object duplicates. Note thatthis work is different from traditional video retrievalstudies, which do not take into account whether thevideo similarity correspond to the ”same” scene. Incontrast, the purpose of this paper is to detect ”identical”video duplicates that contain the ”same” scene, the ”same”event, and possibly the ”same”news topic. Also, this workis different from copy detection studies that aim atdetecting strict near duplicates with no cameraworkvariation (e.g. the content-based copy detection taskinvolved in TRECVID). On the other hand, our work istargeted at detection of not only strict near duplicates, butalso scene duplicates taken by different cameras.

The approach should be invariant to viewpoint changes.This issue can be treated by using invariant features suchas SIFT; however, it is difficult to distinguish b) from c) byusing interest points and invariant features. Instead, we em-ploy the temporal pattern of discontinuities obtained fromtrajectories of feature points (Fig. 4). If two shots belong tothe same scene having different viewpoints, a certain frac-tion of the temporal patterns of trajectories between theseshots will match with the common temporal offset.

The approach presented in this paper extends [15], wherewe used inconsistency [16] to detect discontinuities and tomatch trajectories. Since application of the method [15]to large-scale video archives is computationally expensive,we use a filtering approach using temporal discontinuitypatterns to accelerate the matching. Based on this idea,we developed a new near-duplicate detection method whichcan detect scene duplicates but not object duplicates. Wedemonstrated its performance on actual broadcast videosfrom five TV channels, and compared it with a state-of-the-art reference technique [13] under the same conditions. Thisreference technique is keyframe-based and uses local spatialfeatures to detect near-duplicate keyframes in a large dataset. We furthermore exploit this keyframe-based techniquein our approach to cope with visually different unrelatedvideo pairs, and thereby improve the speed and accuracy ofscene duplicate detection.

2. SCENE DUPLICATE DETECTION

2.1 Overview

Figure 5: Framework overview.

Our approach works on a video archive composed of videoprograms, and it outputs scene duplicates within the archive.The approach is composed of an offline process and an online

process (Fig. 5). The offline process first decomposes givenvideos into shots with an arbitrary shot boundary detectionmethod and applies interest point detection and tracking toobtain trajectories from each shot. We use KLT tracker [18]for interest point detection and tracking, but basically anyinterest point detector and tracker can be used. From eachtrajectory, the approach extracts inconsistency sequence anddiscontinuity sequence (these terms will be explained later).They are all stored for the online process to match shots.Note that a shot may correspond to several trajectories (inour case up to 200 trajectories), and each trajectory is thenassociated with an inconsistency sequence and a discontinu-ity sequence.

The online process then matches all pairs of shots in thearchive to obtain scene duplicates. Given a pair of shots,the approach evaluates the similarity between shots assum-ing all possible temporal shifts (frame by frame). The simi-larity between a pair of shots is then evaluated by matchingall possible combinations of pairs of trajectories between thetwo shots. If a certain fraction of the trajectory pairs aresimilar to some extent, the similarity between the shots willbecome close to one (match). The inconsistency sequencesand discontinuity sequences generated in the offline processare referred to for evaluating the similarity between trajec-tories. In addition, discontinuity sequences are used to ac-celerate the similarity evaluation.

2.2 Feature CalculationBefore we explain how the similarity between shots S1 and

S2 given a temporal offset τ is evaluated, we explain the fea-tures used to evaluate similarity during the offline process.Let’s assume that the shot Si is composed of ni = |Si| tra-jectories, T j

i , j = 1 · · ·ni. The issue is how to evaluate the

similarity between two trajectories T j1 and T k

2 .To detect motion discontinuities, we use inconsistency [16],

which is known to work well on motion estimation from videosequences of complex dynamic scenes. An inconsistency isdefined for small spatio-temporal patch (ST-patch) in thespatio-temporal intensity space, as shown in Fig. 6. If a ST-patch is small enough (e.g. 7[pixels]×7[pixels]×3[frames]used in our work), then all pixels within it can be assumedto move with a single uniform motion unless the ST-patchis located at motion discontinuities or contain an abrupttemporal change in the motion direction or velocity. Ourpurpose is to detect these motion discontinuities or tempo-ral changes as the inconsistency.

To do so, the spatio-temporal gradients ∇Pi = (Pxi, Pyi, Pti)of the intensity at each pixel i within the ST-patch are firstexamined with i = 1 · · ·n. n is the number of pixels (e.g.if the ST-patch is 7 × 7 × 3, then n = 147). The followingmatrix is then computed within the patch:

M =

∑

P 2x

∑

PxPy

∑

PxPt∑

PyPx

∑

P 2y

∑

PyPt∑

PtPx

∑

PtPy

∑

P 2t

(1)

where Px, Py and Pt are the spatio-temporal gradients ofthe intensities at all pixels within the ST-patch. Therefore,for all small ST-patches containing a single uniform mo-tion, the matrix M is rank-deficient matrix: rank(M) ≤ 2.rank(M) = 3 happens when and only when the ST-patch islocated at spatio-temporal motion discontinuities or changesin the motion direction or velocity. Such patches are alsoknown as ”space-time corners” or patches of ”no coherent

53

motion”. To detect these motion discontinuities, the follow-ing matrix M♦, which only captures the spatial properties ofthe ST-patch, is defined by removing the temporal propertyfrom M :

M♦ =

(∑

P 2x

∑

PxPy∑

PyPx

∑

P 2y

)

(2)

It is obvious that rank(M♦) ≤ 2. Also, for a ST-patchwith a single uniform motion, the following rank conditionholds: rank(M) = rank(M♦). Namely, when there is a sin-gle uniform motion within the ST-patch, the added temporalcomponent, which is captured by the 3rd row and 3rd columnof M , does not introduce any increase in rank. This, how-ever, does not hold when the ST-patch is located at motiondiscontinuities or temporal changes, e.g. when the motion isnot along a single straight line. In such cases, the added tem-poral component introduces an increase in the rank, namely:rank(M) = rank(M♦) + 1.

Let λ1 ≥ λ2 ≥ λ3 be the eigenvalues of M , and letλ♦

1 ≥ λ♦2 be the eigenvalues of M♦. By the interlacing prop-

erty of eigenvalues in symmetric matrices, it follows that:λ1 ≥ λ♦

1 ≥ λ2 ≥ λ♦2 ≥ λ3. This leads to the following two

observations:

λ1 ≥ λ1 · λ2 · λ3

λ♦1 · λ♦

2

=det(M)

det(M♦)≥ λ3 (3)

1 ≥ λ2 · λ3

λ♦1 · λ♦

2

≥ λ3

λ1≥ 0 (4)

The continuous rank-increase measure ∆r, i.e. the incon-sistency, is defined to be:

∆r =λ2 · λ3

λ♦1 · λ♦

2

(5)

with 0 ≤ ∆r ≤ 1. The case of ∆r = 0 is an ideal case ofno rank increase, and when ∆r = 1 there is a clear rankincrease.

Inconsistency was first used to compare small (in termsof both image size and duration) template videos and tar-get videos to see if the target contained a similar objectwith similar motion to that of the template. To do so, thestudy [16] decomposes template video as well as target videointo patches and then evaluates the correlation between allcombinations of pairs of patches in the videos. This requiresa computation of inconsistencies for all patches, as well asa comparison of all combinations of patch pairs, which iscomputationally very demanding.

Figure 6: Calculation of features for trajectories.

We borrow the idea of inconsistency, but use it in a differ-ent way. We first apply the KLT tracker to a shot, limitingthe maximum number of trajectories to be 200. Then foreach trajectory, we extract inconsistency values of patcheson the trajectory (Fig. 6), resulting in a sequence of val-ues for each trajectory. We call such a sequence an incon-

Figure 7: Inconsistency of feature points.The graph shows two inconsistency sequences

corresponding to the feature points of two scene

duplicate shots. Since the feature points are in

correspondence, the two inconsistency sequences have

almost identical temporal patterns.

sistency sequence, c(t; T ji ) of trajectory T j

i , expressing theinconsistency value at time t. Fig. 7 shows an example ofinconsistency sequences. The computation of inconsistencyvalues is limited to trajectories only, and as explained later,the comparison of inconsistency values is also limited to onlypairs of trajectories. After smoothing an inconsistency se-quence with a Gaussian, we detect the local maxima andregard them as discontinuities. We generate a discontinu-ity sequence d(t; T j

i ) of T ji as a binary sequence where 1

corresponds to a maximum (discontinuity) and 0 otherwise.

2.3 Shot SimilarityBy using inconsistency sequences and discontinuity se-

quences, we can then define the similarity between trajecto-ries. The basic idea is to check whether the motion discon-tinuities of each trajectory occur at almost the same tim-ing. To evaluate the similarity of trajectories T1 and T2,we compute the local normalized cross correlation centeredat all discontinuities within a certain window width w, andthe average of these values is used as the similarity of tra-jectories. The similarity of trajectories T1 and T2 given atemporal offset τ is then defined as follows:

Sim(T1||T2;τ)

=∑

t(d(t;T1)+d(t−τ;T2))NCC(c(t;T1),c(t−τ;T2);t−w,t+w)∑

t(d(t;T1)+d(t−τ;T2))(6)

NCC(c(t;T1),c(t;T2);t1,t2)

=

∑t2t=t1

(c(t;T1)−c(T1))(c(t;T2)−c(T2))√∑

(c(t;T1)−c(T1))2∑

(c(t;T2)−c(T2))2(7)

where NCC(c1, c2; t1, t2) is the normalized cross correlation

between inconsistency sequences for t1 ≤ t ≤ t2, c(Ti) isthe average of c(t; Ti) for t1 ≤ t ≤ t2. NCC is computedat frames where d(t; T1) = 1, d(t − τ ;T2) = 1 or both, andaveraged.

On the basis of the similarity between trajectories, wethen define the similarity between a trajectory and a shot,i.e., a set of trajectories. From here onwards, the temporaloffset τ is omitted for the sake of readability. The similaritybetween the jth trajectory in S1 (T j

1 ) and another shot (S2)

is defined as the similarity between the trajectory T j1 and

the most similar trajectory among T k2 :

Sim(T j1 ||S2) = max

kSim(T j

1 ||T k2 ) (8)

54

The similarity between shot S1 and S2 is defined as follows:

Sim(S1||S2) = avgtop ρ%

Sim(T j1 ||S2) (9)

Sim(S1, S2) =1

2(Sim(S1||S2) + Sim(S2||S1)). (10)

Figure 8: Example of scene duplicate including fea-ture points which are occluded in one shot.The two shots are scene duplicates. Feature points on

faces match each other; i.e., they are inliers. However,

feature points on the occluded background as well as

feature points around the tie which are occluded by the

video caption on the other shot are outlier trajectories.

Among j, the top ρ% of Sim(T j1 ||S2) will be used for

the average. Between shots of scene duplicates, some ofthe trajectories in one shot will match those of the othershot, but other trajectories will not match. (See Fig. 8 forexamples of outlier trajectories.) The parameter ρ should besmall enough to eliminate trajectories which cannot match,yet at the same time, ρ should be large enough to ensurethat a sufficient fraction of trajectories match between thedetected scene duplicate shots.

2.4 Filtering using Discontinuity Sequences

Processings Time (h:mm:ss)

Similarity calculation (Trajectory) 3:59:43Similarity calculation (Shot) 0:02:20

Result file output 0:09:08Others 0:28:38

Total 4:39:49

Number of shots: 50

Number of shot pairs: 1,225

Number of trajectory pairs: 8,883,931

Number of trajectory similarity calculation: 2,998,411,843

Table 1: Processing time.

The defined shot similarity can be applied to all combina-tions of shot pairs in video archives, and pairs having largersimilarities can then be detected as scene duplicates. How-ever, an evaluation of shot similarity requires a computation-ally costly normalized cross correlation, even though onlya piecewise evaluation around discontinuities (d(t;T1) = 1,d(t − τ ;T2) = 1 or both) is performed. We tested our ap-proach by using 50 video shots captured from TV news pro-grams as the dataset for the experiment. The total length ofthese shots is around 8 minutes, and Table 1 shows the pro-cessing time of this experiment. The system took more than4.5 hours to detect all scene duplicates from the 8-minutearchive, which is far from practical. The trajectory-levelsimilarity calculation took the most time (around 4 hours).

This is mainly because we estimate the temporal offset τ bysliding two trajectories and calculating the NCC-based sim-ilarity of them with all possible offsets. Moreover, this evalu-ation is made for all possible combinations of trajectory andshot pairs within the video archive (totally 2,998,411,843possible offsets tested in our experiment). In other words,the required computational amount is the square of the sizeof video archive and the computational burden of this pro-cess is excessive. Further acceleration is thus required.

We propose an acceleration method using discontinuity se-quences. Since discontinuity sequences are binary sequences,their comparison is computationally lighter than using incon-sistency-based trajectory similarity. In this study, we maketwo assumptions:

Assumption 1: if two trajectories match in terms of trajec-

tory similarity, they have to share at least a certain number

of matching discontinuities, namely θd.

Assumption 2: if two trajectories match in terms of tra-

jectory similarity, they have to share at least a certain ratio

of matching discontinuities, namely θRoD.

With these assumptions, we can discard all trajectoriesthat do not satisfy the following conditions.

Condition 1: N ≥ θd with N = N1∩N2, where N1 and N2

are the numbers of the discontinuities within the overlapping

section between two trajectories T1 and T2, and N is the

number of matching discontinuities between N1 and N2.

Condition 2:

Strict mode: NN1

≥ θRoD and NN2

≥ θRoD

Normal mode: 2NN1+N2

≥ θRoD

Loose mode: NN1

≥ θRoD or NN2

≥ θRoD

Figure 9: Detection of matching discontinuities.

For Condition 1, we can discard all trajectories havingless than θd discontinuities. This will reduce the number ofcandidate trajectories, and it might reduce the number ofcandidate shots, which would be very helpful for accelera-tion. After that, we can easily filter out a pair of trajec-tories, either one of which has less than θd discontinuitieswithin the overlapping section of the two trajectories. Thischeck can be achieved quite efficiently by using a cumula-tive discontinuity sequence: each value of the sequence isthe total number of discontinuities from the start of the se-quence to the location of the value. Next, we count thenumber of matching discontinuities between two trajecto-ries and filter out pairs having less than θd discontinuities.With trajectory T1 fixed, trajectory T2 is gradually slid bychanging the temporal offset τ (Fig. 9). Here, we allow aone-frame difference to compensate for the variance of the

55

location of discontinuities. This process is also very efficient.For Condition 2, we propose three modes corresponding todifferent extents of the condition restriction. Tightening therestriction reduces the number of noisy trajectory pairs butalso eliminates many helpful ones. Relaxing the restrictiontends to ensure more complete detection but decreases theefficiency of noise filtering. We tested all the three modes inour experiments to evaluate the performance of our filteringmethod (Sec. 3.3).

3. EXPERIMENTSWe tested our approach with actual broadcasted videos.

50 video shots were captured from TV news programs, andthey composed the dataset for the experiment. The totallength of these shots was around 8 minutes. The datasetgenerated 1,225 shot pairs, and it included 20 scene dupli-cate pairs (29 shots), which were selected manually. Othershots included 49 object duplicate pairs. These videos werestored in MPEG-1 format at 30 frames per second and hadan image size of 352 × 240 pixels. All experiments wereperformed on a PC (Dell Precision T3400, 2.99 GHz, 3.25GB RAM, Windows XP). All code was written in C++ andcompiled by GCC.

The experiment required certain parameters to be set first.The KLT tracker needs the number of feature points to betracked. This value was set to 200, as mentioned before.The approach required two shots to overlap by more thana certain number of frames θo [frames]. If the overlap isless than θo = 120, the similarity between the pair of shotsis defined to be zero (no match). A shot was consideredto be lacking enough motion information and unsuitable forour trajectory-based approach if it contained θt = 20 trajec-tories or less for which the discontinuities were more thanθDoT = 5. The window size to calculate NCC was set tow = 14 (totally 2w + 1 = 29 frames). The window size tocalculate the local maxima of inconsistency sequences to ob-tain the discontinuity sequences was set to w = 5 (totally2w +1 = 11 frames). ρ which was used to calculate the sim-ilarity between shots was set to ρ% = 50%. The minimumnumber of the agreed discontinuity θd was set to 3. Theminimum matching discontinuity ratio θRoD was set from 0to 1. All parameters including the ones mentioned abovewere determined empirically.

The feature extraction phase rejected 7 shots because theyhad θt trajectories or less, the discontinuities of which weremore than θDoT . One of the 29 shots that made up the20 scene duplicates was rejected in this phase so that onetrue scene duplicate was rejected as false. Therefore, among

43C2 = 946 pairs of shots, 19 pairs were scene duplicatesand remained to be detected by our approach.

3.1 Reference TechniqueAs mentioned in Sec. 1.2, less attention has been paid

to distinguish object duplicates and scene duplicates. Inthis paper, as a reference to compare with the proposedapproach, we used an interest-point-based technique with alocal description. The algorithm, named LIP-IS+OOS, wasproposed by Ngo et al. [13], and it was applied to the same50-video dataset described above.

LIP-IS+OOS was originally used to detect general nearduplicates including both object duplicates and scene dupli-cates. Videos are segmented into keyframes, and a DoG de-tector is used to detect interest points from these keyframes.

A one-to-one symmetric (OOS) interest point matching strat-egy is used to match interest points across frames. Thematching patterns of keyframes were captured with two his-tograms of matching orientations of interest points. His-tograms are constructed by aligning two frames horizontallyand vertically. Depending on the alignment, a histogram iscomposed of the quantized angles formed by the matchinglines of interest points and the horizontal or vertical axis.The homogeneity of histogram patterns is measured as thekeyframe similarity by evaluating the mutual informationbetween the two histograms. Entropy is used to reveal themutual information. The similarity between a pair of shotsis then evaluated based on keyframe extracted from them.

In [13], LIP-IS+OOS was tested by using a keyframe databaseinstead of a video archive. By observing our experimen-tal results, we found that the method to select keyframeshas a significant impact on the accuracy of near-duplicateA֒gshotA֒h detection. Therefore, we propose to perform theexperiment by changing the density of keyframe selection.We extracted multiple keyframes from each video shot inthe 50-video dataset. For changing the density, the shotlength was equally divided, and the frames at the points ofdivision were selected as the keyframe. In equation terms,

given the shot length L, the i·LN+1

thframes are extracted as

the keyframe, with i = 1 · · ·N and N = 1 · · · 5. The sim-ilarity of each shot pair was evaluated by calculating themaximum similarity among all possible keyframe pairs ofthe two shots:

Sim(S1, S2) = maxi,j

Sim(Ki1, K

j2) (11)

3.2 Accuracy Analysis

Figure 10: Precision-recall without acceleration.

Methods AveP (%)

LIP-IS+OOS (N=1) 31.30LIP-IS+OOS (N=2) 22.91LIP-IS+OOS (N=3) 32.84LIP-IS+OOS (N=4) 36.10LIP-IS+OOS (N=5) 29.09PROPOSED METHOD (SDD) 59.58

Table 2: Average precision comparison.

Fig. 10 shows the performance of our scene duplicate de-tection as well as that of LIP-IS+OOS. The comparison

56

shown in the figure does not involve filtering by disconti-nuity sequence. An evaluation using average precision (AP)was also performed (with Equ. 12), and Table 2 illustratesits results.

AveP =

∑N

r=1(P (r) · rel(r))number of relevant documents

(12)

where r is the rank, N the number retrieved, rel() a bi-nary function on the relevance of a given rank, and P () theprecision at a given cut-off rank.

From Fig. 10, we can see that the proposed approachachieves a high precision rate (78.57%) with the recall be-ing fixed to 55%. From Fig. 10 and Table 2, it can besaid that the proposed approach enables more accurate andmore complete scene duplicate detection than LIP-IS+OOS.Even without the filtering method, our approach outper-forms LIP-IS+OOS for almost all fixed recalls. One thingthat we should note here is that LIP-IS+OOS was origi-nally proposed for detecting general near duplicates, notonly the scene duplicates of the same event, which is themain purpose of our study. For this reason, it stands toreason that LIP-IS+OOS underperforms the proposed ap-proach in terms of scene duplicate detection. This is alsoconsidered to be the reason why the precision-recall curve ofLIP-IS+OOS shows such a random shape in Fig. 10 insteadof a standard trade-off relationship.

Figure 11: Examples of detected scene duplicates.

Examples of detected scene duplicates are shown in Fig. 11.This shows our approach is sufficiently invariant to camera-work. The variation of camerawork always leads to changesof viewpoint, picture composition, camera motion, color, res-olution, and image distortion. Because the proposed videomatching criterion is only based on motions in feature point

Figure 12: Examples of false negatives.

trajectories, the impact of color and resolution variation canbe avoided. The changes of camera motion, e.g. zoom orpan, as well as the image distortion due to editing mighthave impact on the inconsistency similarity measure of sceneduplicates. But this problem was not reflected because weused TV news programs as the experimental data and therewas no scene duplicate that contains large camera motion ordistortion variation in this dataset. As for view point andpicture composition, as shown in Fig. 11, even if the vari-ation of camera direction or zoom have led to a relativelylarge difference of picture composition (the 3rd pair) or ob-ject scale (the 4th pair), our approach can still successfullydetect these scene duplicates based on object motion.

In this paper, we used KLT tracker for interest pointdetection and tracking. The KLT tracking points have atendency to drift due to abrupt illumination change, occlu-sion, and aperture problem. Drift might become the great-est source of tracking error in case of object tracking stud-ies, which need very accurate spatial positioning of trackingpoints. In our work, if too many points drift away fromtheir correct position, they will absolutely have an impacton the motion consistency. But if the number of driftingpoints is small enough, the temporal position of motion dis-continuities will not nearly change, and our approach canstill successfully detect scene duplicates based on the mo-tion discontinuity pattern.

We have to note that our approach cannot handle all pos-sible changes of viewpoint. For instance, in the upper exam-ple shown in Fig. 12, because the viewing angle of the objectis too different, the discontinuity sequences of most of thedetected trajectories do not match. It is not quantitativelystudied yet on to what extent our approach can avoid theimpact of viewpoint change. On the other hand, in the lowerexample in Fig. 12, although the most informative motionis in the head part, most of the feature points are detectedin the body part, background region, and video captions.Therefore, the similarity between these two shots becomeslow. This problem should be solved by applying a furthertreatment to the trajectory detection phase. They are ourfuture works.

3.3 Acceleration EvaluationWe evaluated the effects of our accelerations. The results

are shown in Fig. 13 and Fig. 14. The former illustratesthe relationship between processing time and average preci-sion of three filtering modes and is generated by shifting the

57

Figure 13: Comparison of three filtering modes.The graph is generated by shifting the threshold θRoD and

evaluating the average precision and processing time of scene

detection. The closer the curve is from the top-left corner, the

better performance (less processing time and higher average

precision) it indicates.

Figure 14: Accuracies with and without filtering.θRoD is set to 0.5 in all cases.

threshold θRoD. The latter illustrates the precision-recallcurves of scene duplicate detection with and without filter-ing, with θRoD being set to 0.5. SMF, NMF, and LMFrespectively indicate the strict, normal and loose mode fil-tering.

From Fig. 13, it is obvious that the filtering method ofall three modes can dramatically accelerate scene duplicatedetection without decreasing detection accuracy. Regard-ing loose mode filtering using discontinuity sequences, whenθRoD is set to 0.5, the approach achieves a 10.58 timesspeedup. In this case, the processing time is reduced frommore than 4.5 hours (Table 1) to around 26 minutes. Thenumber of possible offsets, which is evaluated in the trajec-tory similarity calculation phase, drops from 2,998,411,843to 164,531,932 (5.49%).

Also, in Fig. 13 and Fig. 14, loose mode (SDD+LMF)showed the best accuracy among the three filtering modes.In Fig. 13, given a certain fixed processing time (e.g. 1hour), LMF gives higher average precision than SMF andNMF. On the other hand, LMF takes the shortest time toobtain a certain fixed average precision (e.g. 60%). The rea-son that LMF outperforms the other two modes is that the

loose mode is more invariant to changes in scale and motionintensity. As defined in Sec. 1.1, scene duplicates are videostaken by different cameramen. This leads to possibly dif-ferent camerawork, e.g. close-up, mid-range and far range,which causes variations in the scales of moving objects inthe video. The upper shot pair in Fig. 3 and the 4th pairin Fig. 11 serve as a prime example. Since the motion ofvideo objects in the real world is uniform, a close-up viewtends to show larger or faster motion than mid-range andfar-range views. By observing the experimental results, wefound that inconsistency and discontinuity features are sen-sitive to large motions such that more noise discontinuitieswill be generated from a trajectory with larger motion. Thedifference between the discontinuity numbers N1 and N2 inCondition 2 (Sec. 2.4) makes the true scene duplicate morelikely to be rejected by the strict mode and normal mode fil-terings. Therefore, we chose to use the loose mode in furtherexperiments.

Moreover, the acceleration treatments improved the de-tection accuracy. In the case of loose mode filtering, whenθRoD is set to 0.5, the approach achieves a better averageprecision (61.87%) than without filtering (59.58%). The bestaverage precision (62.73%) is obtained when θRoD = 0.35.

4. FUSION WITH LOCAL FEATURE REG-

ISTRATIONFrom the experiments described in Sec. 3, we found that

LIP-IS+OOS cannot distinguish object duplicates (Fig. 2)from scene duplicates. On the other hand, the proposedtrajectory-based matching approach detects duplicates with-out depending on visual or spatial similarity. It can rejectobject duplicates of different events even if the two shots arevery similar in terms of image appearance.

Figure 15: Object duplicates with similar motion.

However, there are still limitations on the approach thatonly depends on trajectory motion. For instance, as illus-trated in Fig. 15, the two videos are completely unrelatedbut are falsely recognized as scene duplicates only becausethe persons’ faces have similar motions. Conversely, LIP-IS+OOS can easily avoid such false positives because it de-pends on the local-feature-based visual similarity, not mo-tion information. In particular, since it uses spatial coher-ence as the video matching criteria, LIP-IS+OOS is muchmore sensitive to the difference between completely unre-lated videos. In other words, although LIP-IS+OOS can-

58

not distinguish object duplicates from scene duplicates, thelocal-feature-based registration strategy enables it to effec-tively distinguish general near duplicates from noisy videopairs that are entirely visually different.

Figure 16: Fusion of local-feature-based registrationwith the proposed approach.

θRoD is set to 0.5 in all cases.

Methods Time (h:mm:ss) AveP (%)

LIP-IS+OOS (N=1) 0:01:14 31.30LIP-IS+OOS (N=2) 0:04:58 22.91LIP-IS+OOS (N=3) 0:11:11 32.84LIP-IS+OOS (N=4) 0:19:30 36.10LIP-IS+OOS (N=5) 0:30:50 29.09PROPOSED METHOD (SDD) 4:39:49 59.58

LIP-IS+OOS+SDD+LMF (N=1) 0:02:01 53.63LIP-IS+OOS+SDD+LMF (N=2) 0:05:58 57.91LIP-IS+OOS+SDD+LMF (N=3) 0:12:23 71.63LIP-IS+OOS+SDD+LMF (N=4) 0:20:45 70.42LIP-IS+OOS+SDD+LMF (N=5) 0:32:22 74.34SDD+LMF 0:26:27 61.88

Table 3: Time: fusion of local-feature-based regis-tration with the proposed approach.

θRoD is set to 0.5 in all cases.

We tried to use the local-feature-based method for remov-ing noise and improve the precision of scene duplicate de-tection by filtering out entirely visually different video pairs.LIP-IS+OOS was used as preprocessing, and the same ex-periments as described in Sec. 3 were performed. To ensurecompleteness of detection, we loosened the threshold settingof LIP-IS+OOS so that it could generate as many near-duplicate candidates as possible. The proposed approachwas used with only these candidates as the target, and theperformance of scene duplicate detection was then evalu-ated. The results are illustrated in Fig. 16 and Table 3.

From Fig. 16, we can see the performance improvement ofthe proposed approach with LIP-IS+OOS as preprocessing.Object duplicates whose inconsistency patterns are very sim-ilar (Fig. 15) are successfully filtered out. When extractingthree keyframes per shot (N = 3) and setting θRoD to 0.5,we can obtain very accurate scene duplicate detection with100% precision and 60% recall. Moreover, as illustrated inTable 3, the detection speed is also increased by removing

noise and thereby reducing the number of candidate shotpairs. The minimum processing time is for N = 1 keyframeper shot, but the accuracy is lower. In the case of N = 3,the processing time falls from around 26 minutes to less than13 minutes, which is more than a two-fold reduction. More-over, the average precision goes up from 61.88% to 71.63%.Therefore, we recommend N = 3 for the LIP-IS+OOS pre-processing.

5. APPLICATION TO BROADCAST VIDEO

Stations NHK NTV TBS FTV TVADuration 3 hoursAired time 06:00–08:00 06:00–09:00

19:00–20:00#Shots 80 83 75 89 97

Table 4: Dataset obtained from video archive.

We then applied the proposed scene duplicate detectionto videos obtained from actual broadcast video archive. Ta-ble 4 describes the videos used in this experiment. We em-ployed news programs from five channels, three hours each,in total 15 hours. They were aired on the same day, thussome of news topics were expected to be shared among dif-ferent channels. After shot decomposition, we manually ex-cluded anchor person shots to reduce the number of candi-date shots for computation saving. Although we excludedthese shots manually, since anchor person shot detection aswell as camera motion analysis was studied and many goodalgorithms were already proposed, this process can be auto-mated if needed. Finally the number of shots extracted andused is shown in Table 4.

Processings Time (hh:mm:ss)

LIP-IS+OOS (N=3) 19:24:34Matching discontinuity detection 00:02:15Similarity calculation (Trajectory) 00:15:36

Similarity calculation (Shot) 00:05:10Result file output 00:16:59

Others 00:02:28

Total 20:07:02

Table 5: Processing time of scene duplicate detec-tion to 15-hour broadcast video archive.LIP-IS+OOS+SDD+LMF (N=3): θRoD is set to 0.5. Note that

the estimate processing time of the original system (SDD) to the

same 15-hour archive will be more than 341 hours (more than 14

days).

We then applied scene duplicate detection to the videos.We chose to use the loose mode in this experiment (θRoD =0.5) and N = 3 for the LIP-IS+OOS preprocessing. Theprocessing time is illustrated in Table 5. LIP-IS+OOS tookthe most time (more than 19 hours). After that, scene du-plicate detection took about 42 minutes. The estimate pro-cessing time of the original system (SDD) to the same 15-hour archive will be more than 341 hours (more than 14days). Based on the filtering using discontinuity sequencesand the local-feature-based registration, the processing timedecreased from more than 14 days to around 20 hours, whichwas around a 17 times speedup.

59

6. CONCLUSIONSWe proposed an approach to detect scene duplicates as

a variant of near duplicates. The approach is invariant tocamera angle differences, background complexity, and videocaptions. Experiments show that our approach can success-fully detect scene duplicates and exclude object duplicates.

The current trajectory-based matching technique can han-dle frequent discontinuous motion, and so far it works wellmostly for videos showing faces. Adaptation to other typesof video or larger-scale data, e.g. TRECVID copy detection,is a future issue. Another issue is the application to regularbroadcast streams. When we assume 5 channels of broad-cast video archive containing three-hour programs per day,the total amount will be 15 hours for one day (if we focuson only news programs that are more possible to containtopics shared among different channels than other types ofprograms). It is also reasonable to assume that the news pro-grams one week before the current day are “out-of-date”anddo not share scene duplicates with today’s 15-hour videos.Then, the issue is to detect scene duplicates, firstly fromtoday’s videos (1 day), and secondly from between today’svideos and the videos of last week (7 days). The total pro-cessing time is equal to 1+7 = 8 times of the time to performdetection from 15-hour videos, which is around 8×20 = 160hours (N = 3 and θRoD = 0.5). Since the shot similar-ity evaluations of different shot pairs are independent cal-culations, scene duplicate detection may be performed inparallel. By using 7 CPUs, the current system will take160/7 < 23 hours to detect all duplicates based on the oneday archive, which is faster than the real time.

7. ACKNOWLEDGMENTSOur thanks go out to Professor Chong-Wah Ngo at City

University of Hong Kong for providing us with the binary ofthe LIP-IS+OOS method.

8. REFERENCES[1] Y. Caspi and M. Irani. Spatio-temporal alignment of

sequences. IEEE Trans. Pattern Anal. Mach. Intell.,24(11):1409–1424, 2002.

[2] D. DeMenthon and D. S. Doermann. Video retrievalusing spatio-temporal descriptors. In ACM

Multimedia, pages 508–517, 2003.

[3] L.-Y. Duan, J. Wang, Y. Zheng, J. S. Jin, H. Lu, andC. Xu. Segmentation, categorization, andidentification of commercial clips from tv streamsusing multimodal analysis. In ACM Multimedia, pages201–210, 2006.

[4] A. Hampapur and R. M. Bolle. Comparison of distancemeasures for video copy detection. In ICME, 2001.

[5] W. H. Hsu and S.-F. Chang. Topic tracking acrossbroadcast news videos with visual duplicates andsemantic concepts. In ICIP, pages 141–144, 2006.

[6] I. Ide, H. Mo, N. Katayama, and S. Satoh. Topicthreading for structuring a large-scale news videoarchive. In CIVR, pages 123–131, 2004.

[7] K. Iwamoto, E. Kasutani, and A. Yamada. Imagesignature robust to caption superimposition for videosequence identification. In ICIP, pages 3185–3188,2006.

[8] A. Jaimes, S.-F. Chang, and A. C. Loui. Duplicatedetection in consumer photography and news video. InACM Multimedia, pages 423–424, 2002.

[9] A. Joly, O. Buisson, and C. Frelicot. Content-basedcopy retrieval using distortion-based probabilisticsimilarity search. IEEE Transactions on Multimedia,9(2):293–306, February 2007.

[10] J. Law-To, O. Buisson, V. Gouet-Brunet, andN. Boujemaa. Robust voting algorithm based on labelsof behavior for video copy detection. In ACM

Multimedia, pages 835–844, 2006.

[11] J. Lejsek, F. H. Asmundsson, B. T. Jonsson, andL. Amsaleg. Scalability of local image descriptors: acomparative study. In ACM Multimedia, pages589–598, 2006.

[12] D. G. Lowe. Distinctive image features fromscale-invariant keypoints. International Journal of

Computer Vision, 60(2):91–110, 2004.

[13] C.-W. Ngo, W. Zhao, and Y.-G. Jiang. Fast trackingof near-duplicate keyframes in broadcast domain withtransitivity propagation. In ACM Multimedia, pages845–854, 2006.

[14] S. Satoh. News video analysis based on identical shotdetection. In IEEE International Conference on

Multimedia and Expo, pages 69–72, 2002.

[15] S. Satoh, M. Takimoto, and J. Adachi. Sceneduplicate detection from videos based on trajectoriesof feature points. In Multimedia Information Retrieval,pages 237–244, 2007.

[16] E. Shechtman and M. Irani. Space-time behaviorbased correlation. In CVPR, pages 405–412, 2005.

[17] M. Takimoto, S. Satoh, and M. Sakauchi.Identification and detection of the same scene basedon flash light patterns. In ICME, pages 9–12, 2006.

[18] C. Tomasi and T. Kanade. Detection and tracking ofpoint features. Carnegie Mellon University Technical

Report CMU-CS-91-132, April 1991.

[19] X. Wu, A. G. Hauptmann, and C.-W. Ngo. Noveltyand redundancy detection with multimodalities incross-lingual broadcast domain. Computer Vision and

Image Understanding, to appear.

[20] X. Wu, C.-W. Ngo, and Q. Li. Threading andautodocumenting news videos: a promising solution torapidly browse news topics. IEEE Signal Processing

Magazin, 23(2):59–68, March 2006.

[21] F. Yamagishi, S. Satoh, and M. Sakauchi. A newsvideo browser using identical video segment detection.In PCM, pages 205–212, 2004.

[22] J. Yuan, L.-Y. Duan, Q. Tian, and C. Xu. Fast androbust short video clip search using an indexstructure. In Multimedia Information Retrieval, pages61–68, 2004.

[23] Y. Zhai and M. Shah. Tracking news stories acrossdifferent sources. In ACM Multimedia, pages 2–10,2005.

60

Documents

Scene Duplicate Detection Based on the Pattern of ...people.cs.clemson.edu/~jzwang/ustc11/mm2008/p51-wu.pdfH.3.1 [Content Analysis and Indexing]: Indexing meth-ods; I.4 [Image Processing