Scalable Content-Based Video Copy Detection for Stream

3.5 Difficulties in adapting a stream monitoring solution to video mining . ... 3.8 Sentence-based indexing . ...... medium to high dimension (from 10 to 1000) computed on a keyframe, ..... retrieval based on a pdf that models signature distortions. ...... If η = 1 then no correction is performed on search, if η = 0.5 then the initial.
4MB taille 2 téléchargements 301 vues
Institut National de l’Audiovisuel Conservatoire National des Arts et M´etiers

PhD Thesis

Scalable Content-Based Video Copy Detection for Stream Monitoring and Video Mining

S´ebastien POULLOT Examination committee Dr.

Shin’ichi

SATOH

NII, Tokyo, JAPAN

Dr.

Patrick

GROS

INRIA, Rennes, FRANCE

Pr.

Alberto

DEL BIMBO

Universit`a degli Studi di Firenze, ITALY

Pr.

Fr´ed´eric

JURIE

Universite de Caen, FRANCE

Pr.

Michel

CRUCIANU

CNAM, Paris, FRANCE

Dr.

Olivier

BUISSON

INA, Paris, FRANCE

Invited member

Denis

MARRAUD

EADS Innovation Works, FRANCE

This thesis was prepared in the TTAV team of the INA and in the Vertigo team of CEDRIC [email protected] http://poullot.sebastien.free.fr

pour Tinou qui n’a jamais rien compris a` l’informatique 0:-)

Acknowledgments

So many people to thanks. First, Shin’ichi Satoh and Patrick Gros for accepting to be examinators of my works, and the members of the jury, for taking the time to review them For the more personnal thanks i prefer to use my french words so as to not loose the spirit :) ˜ ira, pour ma part oui je crois, Mes parents qui m’ont conseill´es de faire des e´ tudes, j’esp`ere que A§a mais je peux encore prolonger. Mes douces et rassurantes soeurettes, Juliette et Elsa. Lorraine qui est m’a accompagn´ee tout le long de cette aventure et qui est plus que patiente quand le travail me rend irascible. Les sarthois en g´en´eral, et les pochtrons plus pr´ecis´ement, toujours agit´es, mˆeme s’ils sont pass´es de la rugueuse t´equila aux douceurs vinicoles nationales. Axl, Alexandra, Julien, Judith, Guillaume, Julie, Alex, Anne-Laure, Etienne, Pal, Manu et Cyprien, en particulier. Au fait Manu t’as gagn´e t’es prem’s mais t’es a` Winipegg :p, pas pour longtemps t’inqui`ete :). Les trop gentils et trop mignons fans du GF38, Damien, Claire, Bob et Marceau, qui finiront par supporter le MUC72. Le maganime Toto, sa bonne cuisine et ses ap´eros du mardi, cr´eateurs de rencontres et g´en´erateurs de curiosit´e, et tous ceux qu’on y croise, plus praticuli`erement le Gonz’, Delphine, Nico, Allie et Marion. Mes compagnons de l’INA, ceux qui ont partag´es des id´ees et des conf´erences, Julien et Alexis, ceux qui ont partag´es les locaux, la super cantine (que je remercie aussi!) et le caf´e, Jˆerome, Thomas, J´er´emy, Herv´e, Quentin, F´elicien, Damien, S´ebastien, Jean-Et’ et Mc-Luce. Jˆerome encore une fois pour tout le travail qu’il a fourni sur les interfaces graphiques. Mes compagnons du CNAM, Nouha, Imen, Timo, Dan et Nico notamment. Michel Sholl un directeur de recherche toujours insoumis et souriant. Michel pour son pr´ecieux travail de r´edaction et sa magistrale direction de th`ese, et Olivier pour m’avoir tout donn´e avant que je puisse continuer par moi mˆeme. Merci a` tous pour votre soutien et votre attention, pour les partages de gal`eres mais surtout de joie. 3 ans c’est long, et beaucoup de choses se sont pass´ees, mais je n’ai rien a` regretter ni a` oublier. Hou que non. Continuons plus loin, plus haut, plus fort. Et je vous promet des rillettes encore meilleures.

Contents

1 Introduction

1

2 Scalable Content Based Copy Detection for Video Stream Monitoring

5

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Video copy detection and stream monitoring . . . . . . . . . . . . . . . . . . . . . . . .

8

2.2.1

Requirements for a stream monitoring system . . . . . . . . . . . . . . . . . . .

8

2.2.2

Possible solutions for video copy detection . . . . . . . . . . . . . . . . . . . .

12

2.2.3

Generic workflow of video stream monitoring . . . . . . . . . . . . . . . . . . .

12

Video description and content-based copy detection . . . . . . . . . . . . . . . . . . . .

14

2.3.1

State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.3.2

Adopted video description method . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.3.3

Decision process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Similarity-based retrieval and indexing . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.4.1

Retrieval by similarity for copy detection . . . . . . . . . . . . . . . . . . . . .

21

2.4.2

Existing indexing proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.4.3

Overview of our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

Z-grid indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.5.1

Definition and construction of the Z-grid index . . . . . . . . . . . . . . . . . .

30

2.5.2

Probabilistic retrieval with the Z-grid . . . . . . . . . . . . . . . . . . . . . . .

32

2.3

2.4

2.5

viii

CONTENTS

2.6

2.7

2.8

2.9

Optimizations exploiting the global data distribution . . . . . . . . . . . . . . . . . . . .

35

2.6.1

Component sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.6.2

Boundary adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.6.3

Evaluation of the selectivity improvements . . . . . . . . . . . . . . . . . . . .

37

Optimizations based on local models . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

2.7.1

Precise modeling of the distortions . . . . . . . . . . . . . . . . . . . . . . . . .

41

2.7.2

Modeling the local density . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

Experimental evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

2.8.1

Base and index construction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

2.8.2

Detection quality on ground truth databases . . . . . . . . . . . . . . . . . . . .

55

2.8.3

Scalability to very large databases . . . . . . . . . . . . . . . . . . . . . . . . .

61

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

3 Video Mining Using Content Based Copy Detection

67

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

3.2

Video mining state of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.2.1

Approaches relying on content-based copy detection . . . . . . . . . . . . . . .

69

3.2.2

Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

Potential applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

3.3.1

INA context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

3.3.2

Web2.0 context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

3.4

General goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

3.5

Difficulties in adapting a stream monitoring solution to video mining . . . . . . . . . . .

75

3.5.1

General considerations on performance . . . . . . . . . . . . . . . . . . . . . .

75

3.5.2

Focus on the bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

A new video mining framework by CBCD . . . . . . . . . . . . . . . . . . . . . . . . .

77

3.6.1

Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

3.6.2

Proposed framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

The Glocal descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

3.7.1

80

3.3

3.6

3.7

The Glocal descriptor scheme . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS

ix

3.7.2

Compactness of Glocal signatures . . . . . . . . . . . . . . . . . . . . . . . . .

82

3.7.3

Similarity between Glocal signatures . . . . . . . . . . . . . . . . . . . . . . .

82

3.7.4

Effects of the copy processes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

Sentence-based indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

3.8.1

Principle of the indexing scheme . . . . . . . . . . . . . . . . . . . . . . . . . .

86

3.8.2

Speed-up estimation of the similarity self-join . . . . . . . . . . . . . . . . . . .

88

3.8.3

Sentences selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

3.8.4

Collision analysis for rule-based bucket selection . . . . . . . . . . . . . . . . .

90

3.8.5

Collision analysis for random bucket selection . . . . . . . . . . . . . . . . . .

92

3.8.6

Implementation issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

3.8.7

Finding the links between keyframes

. . . . . . . . . . . . . . . . . . . . . . .

96

Reconstruction of the matching video sequences . . . . . . . . . . . . . . . . . . . . . .

97

3.10 Parallelization of the mining process . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

3.8

3.9

3.11 Graphs and post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.12 Experimental evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.12.1 Calibration and mining quality evaluation . . . . . . . . . . . . . . . . . . . . . 104 3.12.2 Evaluation of the off-line version . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.12.3 Evaluation of the online version . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.13 Graphs and illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4 Conclusion

123

Bibliography

126

x

CONTENTS

CHAPTER

1

Introduction

The production and consumption of video contents are subject to deep changes driven by quickly evolving related technologies and by globalization. Demand is continuously encouraged by pervasive, ever faster and cheaper access to distribution channels. Technological convergence gives different types of terminals access to several such channels. “Video anywhere, anytime” also has an impact on the characteristics of the sought-after content, typically consisting of short video sequences played over and over again and possibly merged into “best of” compilations (e.g. music videos, soccer goals, funny shots). Cultural globalization further implies that about the same content is popular in many different areas. Given the globalization of the audiovisual industry, the cultural globalization and the existence of large video sharing Web sites, content produced anywhere is typically distributed everywhere. Cheap video cameras, often integrated into widespread devices like mobile phones, together with free userfriendly video-editing software, allow a large number of consumers to become content creators. Video sharing Web sites and peer-to-peer networks provide user-generated content (and, in some cases, professional content) with free and reliable platforms for large scale distribution For higher impact and lower production costs, new content offered by professionals or amateurs is often obtained by recycling existing popular content, taken either from professional producers or from video sharing Web sites. It is worth noting that 95% of the videos stored on Web2.0 sites are less than 4 minutes long [CDL07], [GALM07]. The fact that new video programs are often produced by recycling existing content shows how widespread is the use of copies, both among the professionals and among the amateurs. The term “copy” is employed here for any video that is directly or indirectly derived from original content. While initially motivated by copyright enforcement, the detection of these copies in video streams or databases can have many other applications of significant interest for content owners, providers and consumers. Apart from

2

Introduction

copyright enforcement, copy detection in video streams can be directly employed for automatic billing, scheduling enforcement and content filtering. The identification, in a large video database, of all the video sequences that occur more than once (with various modifications) can make explicit an important part of the internal structure of the database, thus supporting content management, retrieval and preservation for both large institutional archives and video sharing Web sites. As specific examples, we can mention content segmentation, the extension of textual annotations from one video to another, the removal of lower-quality copies or advanced visual navigation. Among all the copies found in TV broadcasts or on Web2.0 sites, full copies or exact copies are quite infrequent. A full copy is a reproduction of the entire original program, with possible changes of the visual aspect. An exact copy may exploit only an excerpt of the original program, but that excerpt is left unmodified. Since most of the time the visual aspect or the timeline of the original content were subject to significant modifications for creating copies, the detection of copies is a difficult task. The representation of the videos and the decision procedure should be as robust as possible to the changes resulting from a copy-creation process in order to have a good detection rate. At the same time, they should remain very sensitive to all the other differences between videos in order to have as few false alarms as possible. Also, given the broad distribution of video content, a large number of channels should be monitored. By further considering the huge volume of original documents to protect, we can see that video copy detection faces a significant scalability challenge. Watermarking was the first solution to the video copy detection problem. Since many watermarking schemes have a relatively low computation cost, they are more scalable. But watermarks are not so robust to many of the strong transformations that are employed for creating copies. Moreover, the multiplicity of existing watermarking solutions and the diversity of their usage policies makes the implementation of a general copy detection system highly challenging. Last but not least, copy detection based on watermarks cannot be employed if the original content was disseminated before the application of any mark, which is the case for a large part of the existing video content to protect. The alternative to watermarking is to perform Content-Based Copy Detection (CBCD) by matching signatures describing the content itself. Indeed, to have some interest, a copy should preserve at least the main information conveyed by the original content. If, with adequate content descriptions and associated similarity measure, the candidate video (potential copy) is found to be similar enough to the original video, then the candidate is considered to be a copy of the original. There was significant interest, during the last few years, in image and video CBCD, both for finding description schemes that provide reliable detection results and for obtaining cost-effective CBCD solutions. Note that nothing prevents CBCD from being employed together with watermarking in a broader copy detection approach. In this thesis we address the scalability challenge for content-based video copy detection, both for monitoring video streams and for finding content links in a large video database. While both applications rely on CBCD, their specific requirements lead us to adopt different solutions, which explains why the thesis is divided in two parts. In the first part, our focus is on video stream monitoring. We aim to take the scalability to a new level

3

with respect to previous proposals, by achieving a similar detection quality when a stream is compared to original content databases that are at least ten times larger than before. To perform content-based copy detection, we represent a video as a sequence of keyframes, each keyframe being described by a set of local features. This representation is computationally more expensive than the use of video descriptions that are less local in time and space. But it was shown to bring in more robustness to the changes resulting from a copy-creation process and allows to detect copies of variable-length video excerpts. The local features of every keyframe from the stream are used as similarity-based queries in the database holding the features of the original videos. The features returned are employed by the decision procedure to provide detection results. We first put forward a more efficient index for processing probabilistic similarity-based queries. Since the stream monitoring application we aim at does not require immediate alarms when copies arrive on the stream, a batch architecture is used in order to minimize the importance of mass storage latency. We then develop refined yet inexpensive local models of feature distortions, which makes the probabilistic queries more selective and can also improve the detection quality. By testing the system with very large databases of up to 280,000 hours of video, we find that content redundancy can have a significant negative impact on retrieval cost for such large amounts of data. Accordingly, we optimize retrieval by introducing local models for the density of local features in their description space. Separate evaluations are performed for the quality of copy detection and for the scalability of the method. Precision and recall are measured on the two available ground truth databases, then some detections are illustrated on the Trecvid 2008 Copy Detection data1 . Retrieval cost is then evaluated on several larger databases (up to 280,000 hours), together with the impact of scale on recall. It is shown that the complete system can handle realistic situations with limited resources: one computer is sufficient for monitoring in deferred real time a video stream against a database of 280,000 hours of video. In the second part of this thesis we focus on finding all the video sequences that occur more than once (with various modifications) in a video database. This use of content-based copy detection for making explicit content links in a large video database can be seen as a specific video mining problem. With appropriate content descriptions and associated similarity measure, two video sequences are considered to be connected by an extended “copy” relation if their descriptions are similar enough. The video mining process can then be expressed as a similarity self-join on the video database. If exhaustive search is performed, the complexity of the join operation is quadratic in the size of the database. It is natural to attempt to employ the indexing method developed for stream monitoring in order to make the similarity self-join more tractable. The database must be serialized and the resulting stream monitored against the database itself. However, a closer analysis highlights many problems, mainly due to the use of sets of local features and of a complex decision procedure, that strongly limit the performance expected from of such a solution and preclude efficient parallel implementations. Based on these findings, we put forward a compact embedding description of video keyframes that also simplifies the similarity evaluations, while preserving enough information to obtain reliable detec1

The ground truth for the Trecvid 2008 Copy Detection evaluation campaign was not publicly available when this manuscript was completed.

4

Introduction

tion results. Taking advantage of this compact description, we then propose an indexing method based on the redundant segmentation of the database, which significantly speeds up the similarity self-join operation and allows for an efficient parallel implementation. The quality of detection is evaluated by quantifying precision and recall obtained on two ground truth databases initially developed for CBCD (after including the copy detection queries into the original database). We further measure the time required for mining databases of various sizes with a single computer. An online version of the system is employed for small databases of less than 100 hours (typical upper bound for the size of the results returned by a video sharing Web site to a textual query), while larger databases of up to 10,000 hours of video are processed by an off-line version of the system. We show that mining can be interactively performed for the smaller databases and remains affordable for the larger ones.

CHAPTER

2

Scalable Content Based Copy Detection for Video Stream Monitoring

2.1 Introduction

The continuous multiplication of content diffusion channels, together with the fast growth in the production of both professional and personal audiovisual content, significantly increases the (re)use of video contents. Then copy detection is a key issue in protecting content owners against intentional or accidental unauthorized (re)use of their content. It consists in finding whether a candidate document, streamed on a TV channel for example, is issued from an original document stored in a reference video database. Given the large number of content providers whose streams must be monitored and the huge volume of original documents to protect, automatic and semi-automatic tasks must be performed in order to help human operators. Consequently, copy detection faces a significant scalability challenge. Some other potential applications of video copy detection, such as video mining for broadcast programming analysis or media impact evaluation, as we will see in the second part of this thesis, also stress the importance of scalability. The term copy is employed here for a document obtained from an original by the application of one or several transformations such as filtering, cropping, scaling, insertion of logos or frames, addition of noise, etc. Figure 2.1 illustrates two examples that combine such modifications. The nature and amplitude of the transformations are wide and largely depend on whether the copy is the result of professional postproduction operations, made for a TV show, or by a private user and posted on a video web site. These transformations must nevertheless preserve the main information conveyed by the original content in order to obtain a copy with some interest. Section 2.3.1 gives more details on the definition of the video copy, together with a broad description of the typical transformations. Currently there are two different existing solutions to identify video copies: watermarking and

6

Scalable Content Based Copy Detection for Video Stream Monitoring

Figure 2.1: A copy obtained by post-production operations (top) and one from a Web2.0 site (bottom). The copies are on the left side, the original frames on the right side.

content-based video copy detection (video CBCD). Section 2.2.2 shows that watermarking is not so well adapted for our purposes and that there is a scalability challenge for video CBCD. We subsequently present several contributions in order to deal with this scalability issue. The first part of this thesis addresses the scalability problem for a monitoring system using video CBCD. The aim is to compare a video stream to a reference video database. The challenge comes from the size of the database, with an target of 500,000 hours of video in the short-term, and from the streaming constraint, since the stream must be continuously followed. Such systems exist, e.g. [DLA+ 07], but their efficiency was not demonstrated for such volumes of data. Our proposal of a monitoring system relies on the use of local visual features (signatures) of the video. Classically they are computed on keyframes that correspond to strong visual changes. Using only the keyframes reduces drastically the size of the video description, the average being 3000 keyframes for an hour of video. This reduction is convenient for the scalability issue. The first step is to compute the signatures of all the reference videos and to store them into a reference signature database. Then, the signatures computed on a stream are compared to the ones of the reference signature database in order to find a potential original version. Finally, a decision process using all the candidate signatures returned allows to tell whether the current broadcast is a copy or not. The local signatures, sometimes called fingerprints, have shown good robustness to many visual distortions (see figure 2.1), but also to more complex cases including such operations as insertion, occlusion, cropping, scaling or shift [JBF07]. Indeed, with an appropriate decision process, if

2.1 Introduction

7

some signatures disappear from a keyframe, the remaining ones may still indicate that we have a copy case. Unfortunately, the use of local signatures also has a drawback: the video description is large. Some design choices can nevertheless bound this size: at most N signatures are used for a keyframe, and a signature belongs to a d-dimensional space (with relatively low d). But, for the volume of video databases we aim to protect (target of 500,000 hours), we still have about 30 × 109 signatures of dimension d. Such an ambitious goal was not previously addressed. The scalability challenge mainly concerns the retrieval of similar signatures from the reference signature database. To reduce the computation cost of this step, one has to use an indexing method, see e.g. [BAG03], [GIM99], [JBF07], in order to avoid the sequential scan of the database. Such methods scan only a relevant part of the database to find potential similar signatures. However, the effectiveness of the indexing methods significantly depends on the data distribution in the description space. To address the scalability challenge for video stream monitoring, we put forward in the following: • an index structure based on a Z-grid and an associated dimension-wise probabilistic search method that is adapted to the distribution of image signatures in the database, • a general approach for obtaining refined models of the distortions undergone by the signatures during the copy creation process, and • an effective method for speeding up retrieval in high-density regions of the description space. In the next section, after introducing the requirements—in terms of content volume, granularity and transformations)—of video stream monitoring, we briefly justify our preference for content-based copy detection rather than watermarking and show a generic workflow for CBCD. Section 2.3 includes a state of the art regarding video CBCD, with a focus on the video description methods employed and their ability to support the requirements defined in section 2.2.1. We then present the video description method we adopted and the associated decision procedure for copy detection. Our focus in the first part of this thesis is on the scalability of video stream monitoring and, given our target for the size of the original content databases, the main challenge concerns the similarity-based retrieval. Consequently, in section 2.4 we study the issue of retrieval by similarity and review existing indexing methods that were proposed in the literature. Our contributions are developed in the next sections. The Z-grid index is presented in section 2.5. In section 2.6 the associated dimension-wise probabilistic retrieval is described, together with a solution to the problems raised by long-range variations in the data distributions. After a brief analysis of the impact the local properties of signature databases can have on query selectivity, section 2.7 describes the refined local modeling of signature distortions and of signature density, which provides an efficient way to improve selectivity. An extensive evaluation of the entire system and of the effect of each contribution is performed in section 2.8. Comparative results regarding detection quality are provided on the CIVR 2007 benchmark and on a INA ground truth, then scalability is measured on larger databases of up to 280,000 hours of video.

8

Scalable Content Based Copy Detection for Video Stream Monitoring

2.2 Video copy detection and stream monitoring Before studying possible solutions for video copy detection in stream monitoring applications, we have to define the specific requirements of such applications, regarding not only content volume and granularity but also the video transformations that can be expected to occur during copy creation. These requirements will drive all our later choices.

2.2.1 Requirements for a stream monitoring system The context of use has a strong impact on the scalability requirements for video CBCD methods. The important variables are the size of the original video database to protect, the minimal length of the video sequences that should be detected, the amplitudes of the video transformations and the number of streams that should be monitored simultaneously. While the market for video CBCD services is in its infancy, a first indication can be obtained from national video archives. As an example, the Institut National de l’Audiovisuel has a database of more than 280, 000 hours of digitized video to protect (expected to grow to 500, 000 hours in the short term) and the detection rate should be high even for brief copy sequences (of a few seconds). Other applications, such as video mining, may benefit from the detection of even shorter sequences. The number of broadcasts, cable and satellite TV streams, that must be monitored e.g. in France is currently of about 250 and expected to increase. Regarding the videos on the Internet, Chen and al. [CDL07] evaluates that a famous UGC web site contains 42 × 106 millions of videos whose average length is 3 minutes. The estimated volume is then 2.1 × 106 hours. It is mentioned in [SZH+ 07] that on average 65 × 103 clips are uploaded every day, i.e. about 3, 250 hours. A video copy is obtained from an original video by the application of one or several transformations. During the creation of a video copy, an individual image I in an original video sequence is transformed into a copy image I′ : I′ = T (I, Θ)

(2.1)

where T is the transformation and Θ are the parameters of the transformation. The term “copy” is employed if T belongs to a set of tolerated transformations that preserve at least the main information conveyed by the original content. They usually combine intensity and geometric transformations, T = Ti ◦ Tg ; both Ti and Tg can be primitive or composed. Since some transformations modify the timeline of a set of frames, at the level of a video we must also consider temporal transformations. Another family includes more sophisticated, video editing transformations. The nature of the most frequent primitive transformations is wide. For intensity transformations we can mention: • changes of contrast, gamma or hue, • switch from color to B&W and, conversely, colorization, • saturation of the different colors,

2.2 Video copy detection and stream monitoring

9

• use of color filters, • blur, • addition of noise, • strong compression producing compression artifacts. Among the geometric transformations we can mention: • cropping, • scaling, • shifting, • change of aspect ratio, • insertion of black stripes, • flip (invert right and left), • camcording (shooting a screen with a video camera). For the temporal transformations we can mention: • change of frame rate, • insertion and deletion of a frame, • slow motion and, conversely, speed-up, • editing of the shot boundaries (the boundaries of a shot are defined by a break of the continuity: switch of the shooting video camera) , • reversing the video. Eventually, among the video editing transformations we can mention: • insertion of logos, • insertion of subtitles, • in-frame combination of frames from different videos. • insertion and drop of shots, • timeline editing.

10

Scalable Content Based Copy Detection for Video Stream Monitoring

This list is comprehensive but by no means exhaustive. Most of the time, a copy is the result of a composition of several these transformations. All these potential transformations may have various amplitudes. However, if a copy (transformed version of a video) has lost the major part of the information content in the original video, then it is of poor interest. Furthermore, the very difficult cases typically have a much higher cost for the automatic detection. So, for video copy detection, acceptable ranges must be defined for the amplitudes of the transformations. Detection time can be traded against detection quality. Figure 2.2 illustrates a good detection. A good detection is called true positive, a false detection is a false positive, a missed detection is a false negative.

Figure 2.2: An example of a true positive. The two frames are from different video clips obtained from the same original content. The right frame is a “zoom in” of the left one, with a further slight crop and a large text overlay (Madonna, “Sorry”).

The experience of the Institut National de l’Audiovisuel (whose main mission is to collect and store French radio and television broadcasts) with reused video content shows that the most frequent transformations concern gamma and contrast changes, scaling, cropping, compression artifacts and insertion of logos or frames. The amplitude of gamma or contrast changes and of scaling is relatively low, in order to preserve the perceived quality. Cropping and shifting are also limited in order to preserve the interest of the video. Insertions can be quite complex and have a stronger impact on the video frames. Finally, editing can affect the timeline of the original video. The insertion or deletion of frames also occurs, together with the in-frame combination of videos from different sources in order to make a new video. The study of a large and diverse set of copies allowed to draw the conclusion that the probability of a transformation decreases when the amplitude of the transformation increases [JFB05]. The copies of entire programs are rare, so we must be able to find excerpts of a video, transformed or not. The near-duplicates (or near-exact) copies [WHN07b], [SZH + 07] are a special case where transformations are rather light, and where copy and original generally have the same length and the same timeline. Near-exact or exact copies can be considered easy cases. They usually don’t require a precise description of the content, a global coarse description of the video can be sufficient. Difficult cases for video CBCD are the news shows, talk shows, weather forecasts, etc. in which the background does not change from one day to another (but, at best, the suit or tie of the anchorman, or the guests). Two such videos can be very similar but are not copies. Here, copy detection can produce

2.2 Video copy detection and stream monitoring

11

false positives, i.e. a video sequence is detected as a copy but it is not a copy. Figure 2.3 illustrates such a case.

Figure 2.3: An example of a false positive. The two frames are from different magazines, one shot in the morning and the other shot in the evening (“Vamos a bailar”).

An important issue is the characterization of the videos of the same scene, but taken with different video cameras from different points of view. Figure 2.4 illustrates such a case. For the purpose of copy detection, the resulting videos are not considered as copies. Content-based copy detection methods would have difficulties in finding a correspondence here, given the significant differences between the two views, so there is little risk of false positives. Some work did focus on finding such correspondences, using specific solutions for specific cases, such as TV news shows with temporal patterns of flashes in [Sat02] and close-ups of one person in [STA07]. Note that camcording generally leads to a different problem, since the different views are of a same flat screen and not of a 3D scene; also, the angular variations are quite limited for camcording. The broad dissemination of personal video cameras increases both the number of movie camcordings and the number of different footages of popular events (concerts, soccer games, etc.).

Figure 2.4: An example of videos of the same scene taken with different cameras, from different angles. The illustrations are from [STA07])

12

Scalable Content Based Copy Detection for Video Stream Monitoring

2.2.2 Possible solutions for video copy detection Existing solutions for copy detection rely either on the use of watermarks (see [LELD05] for a review) or on matching signatures describing the content itself. Each of these alternatives has specific advantages and drawbacks. Watermarks that include transaction-related meta-data can support services that go beyond copy detection and allow to trace the path followed by each individual document. Also, many watermarking schemes can keep the computational costs of copy detection relatively low and are thus more scalable. But given the multiplicity of existing watermarking solutions and the diversity of their usage policies, making available a general copy detection system appears to be highly challenging. A more fundamental problem is that copy detection based on watermarks cannot be employed if copies of the original content have been disseminated before the application of any mark, which is unfortunately the case for a large part of the existing content needing protection. Since a copy is expected to preserve at least the main information conveyed by the original content, copy detection can also rely on similarity evaluations. If, with adequate descriptions of the content, the candidate document (potential copy) is found to be similar enough to the original (source), then the candidate is considered to be a copy of the original. This is the underlying principle of contentbased copy detection (CBCD) methods. Recent proposals regarding image and video CBCD, such as [SZ02], [BAG03], [KSH04], [JFB05], [LTBGBB06], [CPIZ07], [FZST07] or [Jol07], attempt to find description schemes that provide more reliable detection results and to make copy detection feasible for large databases of original documents. The next section explains the general approach to content-based copy detection for video stream monitoring.

2.2.3 Generic workflow of video stream monitoring Figure 2.5 shows the workflow of a monitoring system that employs CBCD. It is quite generic with respect to the video description method and indexing scheme. The workflow is composed of two distinct parts: • The indexing component. This off-line process is performed in order to compute the signatures (working descriptions) for all the content in the reference content database containing the original videos and build a reference signature database. Section 2.3.1 includes a state of the art regarding the video description methods employed for CBCD, then section 2.3.2 presents the description method we adopted. This database is usually rather static and an index can be built for it in order to speed up the copy detection process. In section 2.4 we review existing indexing methods that were proposed in the literature. • The detection component. This online process checks whether a candidate video (from a stream or a serialized database) is a copy of an original video stored in the reference content database. Signatures are extracted from the candidate video (as in section 2.3.2) and used as a queries: the database is scanned in order to retrieve the reference signatures that are similar. This similarity-

2.2 Video copy detection and stream monitoring

13

Reference content database

Video feature extraction

Reference signature database

Index

INDEXING

DETECTION Candidate video

Video feature extraction

Query by similarity and decision making ID of original video

Figure 2.5: Generic workflow of a stream monitoring system using video CBCD based retrieval operation is critical for the scalability of the copy detection method. For very large databases, good efficiency can only be obtained if an index was built during the off-line process (to avoid sequential scan, see section 2.4). Our contributions concern this issue and their presentation starts with section 2.5. Finally, a decision process labels the candidate video as “copy” if the signatures of this video are similar enough to the signatures of one of the original sequences retrieved. The decision process we employ is described in section 2.3.3. Most of the existing proposals for video (or image) monitoring using CBCD follow this generic workflow but employ different content description schemes, different solutions for similarity-based retrieval and decision. In order to have a system faster that real time, the time for processing the queries (the signatures of new incoming documents) has to be lower or equal to the duration of the videos described by these queries. This is the main obstacle in monitoring a stream against a very large reference signature database. The computation cost depends on the indexing scheme and on the retrieval process. The reference signature databases we employ are generally too large to be stored in main memory, so this requires access to mass storage. We aim to speed up the similarity-based retrieval without degrading the quality of the results.

14

Scalable Content Based Copy Detection for Video Stream Monitoring

2.3 Video description and content-based copy detection Content-based video copy detection is a difficult task and the video description is critical for achieving good detection quality. After a state of the art regarding video CBCD, with a focus on the video description methods employed and their ability to support the requirements defined in section 2.2.1, we will present the video description method we adopted and the associated decision procedure for copy detection.

2.3.1 State of the art For video CBCD, both the original videos and the candidate document (potential copy) are represented by specific signatures computed from their content, then the similarity between these signatures is measured and a decision is made. Image transformations lead to distortions between the signatures of the original documents and the signatures of the corresponding copies. Effective copy detection requires a content description scheme and an associated metric that are rather insensitive to the tolerated transformations, while being highly sensitive to the other differences between documents. A brief overview of the existing video descriptors is necessary in order to specify which are adapted to video CBCD. Their robustness (invariance to distortions) and computational costs are of primary importance. Before being applied to videos, content-based copy detection was first suggested for still images. The early proposal in [CWLW98] addresses the detection of replicated images on the web, using a global image description based on wavelet coefficients. A database was built together with an index based on hashing. The solution was designed to detect duplicates (i.e. images having the same content) and was only robust to resizing and compression artifacts. The following review of video CBCD methods is organized according to the granularity of the description employed: for an entire video, for a set of frames, for a frame or for a part of a frame. Some proposals describe an entire video with a single signature. In [cSCZ03] a video is represented by using some of its frames, selected because they were similar to some randomly seed images. The similarity between two videos thus relies on the intersection of these frames. A rather similar method is suggested in [SOZ05], where the frames of a video are clustered by visual similarity. The similarity between two videos relies on the number of shared and unshared frame clusters. In [HWL03] the videos are described at two levels of granularity, a coarse and a fine one, in order to deal with the scalability issue. The first level quickly decides if two videos could be similar, then the second one performs a more precise verification if is necessary. An approach based on the difference between the successive frames is put forward in [HZ03]. A motion approximation and a color histogram is employed for each frame. The signature of a video depends on its length. Alignment operations are performed in order to compare the signature of a query video to those of the reference signature database. In [HHB02] a frame is partitioned into regular image blocks and a signal feature is extracted for each block (discrete cosine transform coefficients, average gray level, etc.). An ordinal measure is used for each frame: the blocks are sorted according to their

2.3 Video description and content-based copy detection

15

signal. This order is the coarse signature of a frame. The signature of a video is the concatenation of these frame signatures. Alignment operations are performed to find similar videos. The methods in [OKH02] and [KV05] are very similar and add a temporal signature. In [MAW06] the authors also use an alignment to match similar clips. Each clip is described by the concatenation of the global signatures of the frames. These signatures are obtained from three Mpeg-7 descriptors: color layout, scalable color and edge histogram. The previous methods use visual features of the videos and so are sensitive to the intensity transformations, but also to many geometric transformations such as cropping, shifting, rescaling, insertions, etc. Some other methods further rely on temporal features of the shots. They usually use a temporal signatures based on the cuts [Shi99]. Alignment operations must then be performed. Some very recent work [CLW + 06], [CCC08] show good results for handling some temporal transformations (notably slow motion and fast motion). The temporal signatures are very robust to many geometric and intensity transformations, but lack effectiveness if the timeline is modified. When monitoring a stream, the boundaries of the programs are not easy to find and the methods just mentioned could be inefficient for this task. Another important drawback of such methods is that they are not adapted to the detection of excerpts. Our goal is to find short copy sequences that may be part of a longer video program or stream. A typical example are the “best of” TV shows re-exploiting a large number of excerpts from older shows. If the excerpts are too short, their temporal signature might not be long enough to support reliable detection. The granularity of the description must be at a finer level than that of entire videos. In order to be able to find small excerpts, a video program cannot be considered as the basic element but a finer granularity must be employed. Some methods like [HZ03], [LTBGBB06], [CPIZ07], [STA07] or [YOZ08] directly work at the frame level of description: they employ an individual signature for each frame. Such a refined description can support a better precision (less false alarms) than a coarse description. But it can have a negative impact on recall (more false negatives) and necessarily makes the solution more difficult to scale. Indeed, all these solutions were evaluated with rather small volumes, usually less than 100 hours. The proposal in [YOZ08] uses a coarse visual signature for a frame based on the discrete cosine transform of image-blocks and an ordinal signature. Only one single coefficient is kept after a coarse approximation of the position of the visual signature in its description space. A fixed number of frames are used to build a sketch. The signature of a sketch is the concatenation of the coefficients of its frames. This method is designed especially to deal with streams. But it is still sensitive to some common transformations because of the design of the ordinal signature. In [CPIZ07] a local visual vocabulary is employed for describing the information conveyed by frames extracted from the stream at a fixed rate (1 out of 15). The vocabulary is built using the Bag of Features -BoF) approach: (i) extraction of points of interest (PoI), (ii) computation of local features (SIFT) at these locations, (iii) k-means clustering of the local features, each cluster is a visual word. The frames containing a high number of similar visual words are considered as near-duplicates. The method is robust to a large set of distortions thanks to the use of local features having good invariance properties and to the approximation resulting from the cluster-based matching. However the pair of frames returned can

16

Scalable Content Based Copy Detection for Video Stream Monitoring

be similar but not copies, so a decision step (not described) must be added in order to filter out these cases. The method was evaluated on small volumes of video. In the image search domain, in [JT05] and [MTJ07] the authors shows that using random patches and new methods for clustering can produce a visual vocabulary faster and make it more discriminant. A video is a sequence of frames (images), SECAM or PAL frame frequency is of 25 frames per second (25fps). Considering all the frames is not necessary: some are visually very close, and sometimes there is no visual activity. Using only the keyframes ([CCM+ 97], [HHB02], [YOZ08]) minimizes the redundancy and maximizes the discrimination ability. Moreover, the scalability challenge would be much harder if every frame was employed. The keyframes of a video, seen as an accurate summary of the video, are generally considered satisfactory. The keyframes are automatically detected and correspond to strong global visual changes. On average,

1 30

of the frames are detected as keyframes, which leads to

about 3000 keyframes per hour of video. To significantly improve robustness for the detection of individual frames (or keyframes), recent proposals describe images by sets of local features and evaluate the similarity between two images as a score of the best match between the sets of local features [SZ02], [JFB03], [BAG03], [KSH04], [JFB05], [LTBGBB06], [CPIZ07], [FZST07], [LTCJ+ 07]. The use of sets of local features for copy detection typically relies on comparing individual images via a two-stage process: first, the individual features of the candidate image are used as queries for retrieving similar local features from the reference signature database; then, a matching process is performed and the decision is taken. Part of the robustness comes from the fact that the local features employed are invariant to some of the transformations (e.g. most of the intensity ones and some geometric ones, rescaling and change of aspect ratio). Also, matching process involves some form of voting, thus allowing for partial matches that provide robustness to other transformations (cropping, shifting, occlusion, insertion of logos, etc.), where global descriptions usually fail. Various types of local features and matching procedures are employed. While this is computationally expensive, several proposals along this line (e.g. [BAG03], [JFB03], [DLA+ 07]) have already dealt with large volumes of video and shown that scalability was possible. To describe an image with local features, two operations are needed: the detection of points of interest and the extraction of the local features around each point. First, a set of locations (generically called interest points in the following) are detected. The detectors are rather stable to various image transformations. Among the robust detectors, we can mention the improved Harris detector [SM97], the Harris-Laplace detector [MS01] and Difference of Gaussian (DoG) detector [BA83] used by Scale Invariant Feature Transform (SIFT) [Low04]. Then, the neighborhood of each such interest point has to be described according to the intended goal. Most of the recent description methods, such as SIFT [Low04], PCA-SIFT [KS04], Gradient Location and Orientation Histogram (GLOH [MS04]) or Speeded-Up Robust Features (SURF [BTG06]), aim at providing strong robustness to general affine transforms and were mainly designed for applications such as object recognition and stereo matching. Some of these solutions were successfully employed for CBCD [KSH04], [CPIZ07], [FZST07], [DLA+ 07]. But such high-dimensional descriptions pose se-

2.3 Video description and content-based copy detection

17

vere problems to multidimensional index structures and the cost of extraction [BTG06] and of similarity computations is high. Moreover, strong robustness to general affine transforms makes features less discriminant and leads to false detections (in the copy detection context) for images representing the same scene with viewpoint changes [Foo07].

2.3.2 Adopted video description method In order to support the detection of excerpts, to increase robustness to temporal transformations and to make (by minimizing redundancy), we represent a video as a sequence of keyframes. Then, we describe every keyframe by a set of local features, which increases the robustness to some geometric and video editing transformations. Furthermore, by employing a specific type of features, we can reach good robustness to a range of intensity and geometric transformations, while remaining sensitive to stronger affine transforms. The specific robustness and scalability requirements led us to use here the video content description scheme of [JFB05] described below. The algorithm employed to find the keyframes in a stream is quite simple and relies on finding maxima of the global intensity of motion [EM99]. The pixels that change between two successive frames are counted and if their number is above a fixed threshold then the last frame is considered to be a keyframe. Such a keyframe extraction method is rather robust to many temporal transformations like timeline editing, change of frame rate, or insertion and drop of frames. In this way, about 3, 000 keyframes are detected per hour of video, which drastically reduces the initial volume of information (to about

1 30 ).

Interest points are detected in each keyframe with the improved Harris detector [SM97]. The neighborhood of every interest point is then described by the normalized 5-dimensional vector of first and second-order partial derivatives (∂x, ∂y, ∂xy, ∂x2 , ∂y 2 ) of the gray-level brightness: vx,y,t =



∂I ∂I ∂ 2 I ∂ 2 I ∂ 2 I , , , , ∂x ∂y ∂x∂y ∂x2 ∂y 2



(2.2) x,y,t

For each interest point in the keyframe at time t, the same type of description is computed for three other neighboring points in frames t + δ, t − δ and t − 2δ respectively, with a small spatial offset ∆, as shown in figure 2.6 (where the spatial offset was exaggerated). Each description is individually normalized. This provides a d-dimensional spatio-temporal signature (with d = 20 here) for every interest point detected in every keyframe. The final description is: sx,y,t =



vx−∆,y−∆,t+δ vx+∆,y−∆,t−δ vx+∆,y+∆,t−2δ vx,y,t , , , kvx,y,t k kvx−∆,y−∆,t+δ k kvx+∆,y−∆,t−δ k kvx+∆,y+∆,t−2δ k



(2.3)

Each component i, i ∈ [0, d], has a value ci ∈ [0, 1]. To keep the signatures compact, we represent each value on one byte, so ci ∈ [0, 255]. The resulting description space is [0, 255]d . The improved Harris detector and the differential description employed provide invariance to most intensity changes (change of contrast, change of gamma, B&W, colorization), as well as partial robustness to others (noise, blur, saturation, change of hue, compression artifacts) and to some geometric transfor-

18

Scalable Content Based Copy Detection for Video Stream Monitoring

mations (limited amplitude scaling, i.e. ×0.8 to ×1.2, change of aspect ratio, camcording). Robustness to other transformations (mainly geometric and temporal) depend on the decision process. Conversely, the video description we employ is not robust to rotations or high amplitude scaling, but these distortions are very infrequently used for generating copies. The description lacks robustness to left-right flips or time-reversal of the video. A solution for detecting these types of copies could be to index a flipped version and a time-reversed version of the video together with the original one.

Figure 2.6: Spatio-temporal signature of one interest point

A local signature needs 20 bytes for storage. Some metadata are also stored for each signature: a unique identifier associated to the video it was extracted from, a time code T c corresponding to the number of the frame it belongs to (1 if in the first frame), the x and y positions of the interest point in the frame. These metadata are used in the decision process. Finally, we have a signature of 32 bytes. The extraction of keyframes and of their local features is performed once, off-line, for all the videos in the reference content database; all these features constitute the reference signature database. A similar extraction is also performed online for each monitored video stream. The entire extraction process (including keyframe extraction, improved Harris corner detection and local feature computation) takes about

1 15

of real-time for 352 × 288 MPEG1 videos.

To perform the copy detection task, the different local signatures of the monitored video stream must be compared to the ones in the signature database. A copy is detected when two sets of signatures match: those in the current keyframe and some issued from a frame in the signature database. Two steps are necessary for this: first perform retrieval by similarity in the reference signature database, then proceed to a decision step to find the possible matches. In order to limit the increase of the size of the signature database and the number of similarity-based retrieval operations to be performed for a stream, we extract a maximum of Ns = 20 signatures from a keyframe. For a 300,000 hours video database, this means a set of 20 × 3, 000 × 300, 000 = 18 × 109 signatures. Although we attempt to limit the database size, this one is quite large and monitoring a video stream against it remains a challenge.

2.3 Video description and content-based copy detection

19

If the signature s from an original keyframe has a corresponding signature s′ in the transformed keyframe belonging to the copy, then: s′ = Ti (s, Θi ) ◦ Tg (x, Θg )

(2.4)

where Θi are the parameters of Ti , Θg are the parameters of Tg and x is the position of the point of interest. If the signatures were invariant to all the transformations applied, we would have s′ = s. But, even for the most robust local descriptions, this is never true in practice; the signatures are at best invariant to a subset of the transformations. Every signature is affected by a distortion δs = s − s′

(2.5)

Figure 2.7 shows seven interest points detected in the original keyframe I and a copy I′ where only four of these points remain. The bottom of figure 2.7 depicts the distortions of the signatures in a simplified 2D description space. The local signatures of the original keyframe are in the reference signature database (off-line process), while the local signatures of the copy are the queries (on line process).

Figure 2.7: Some interest points detected in the original (top left) and in a copy (top right), and schematic representation of the distortions of their signatures (bottom) in a 2D description space Evaluating the amplitude of the local feature distortions is very important for controlling the selectivity of the similarity-based retrieval from the reference signature database, with a strong impact on both the efficiency (mean cost per query) and the effectiveness (quality in terms of precision and recall) of copy detection. In section 2.7.1 we develop refined yet inexpensive local models of feature distortions, which makes the probabilistic queries more selective and also improve detection quality.

20

Scalable Content Based Copy Detection for Video Stream Monitoring

2.3.3 Decision process To decide whether a candidate video is a copy of an original video, their similarity is computed at the keyframe level and then at the video sequence level. For copy detection, the similarity between keyframes described by local features can be simply measured by the number of local matches [SM97], [BAG03]. Since this method does not take into account the relative positions of the interest points, it is robust to strong geometric transformations but can also result in many false positive detections. Another solution consists in taking into account geometrical consistency with a global transformation [Low99], [JCL03], [KSH04], [MTS+ 05], [RLSP06]. The parameters of the transformation can be estimated using a random sample consensus (RANSAC [FB81]) algorithm. For each candidate local feature, only the best match in the video (according to the L2 distance between the position of the signatures) is kept for the estimation. The similarity is obtained by counting the number of interest points that match (up to a spatio-temporal approximation) when the transformation is applied. If this similarity between the candidate video and an original video is higher than a threshold, the candidate is marked as a copy of this original. The vote-based decision scheme leads to significant robustness to occlusion and video inlay and reduces the rate of false positives. We employ in the following this solution based on registration and vote. More precisely, a sliding window of 9 keyframes of the stream is used. For every interest point from these 9 keyframes, the similar signatures found in the reference database are retrieved. The identifiers of the original videos from which the returned signatures were issued are collected and the number of occurrences of each identifier is determined. Only the identifiers whose number of occurrences is above a fixed threshold are retained. For each such identifier, a 3-dimensional matching (using the x, y position in the keyframe and the time position T c) of the signatures is attempted with the 9 keyframes from the stream. An optimization is used to find the best match. If the score of the best match is above a threshold, then the stream sequence is considered to be a copy of this original. As we shall see in section 2.4, similarity-based retrieval returns a maximum of k signatures for a query. The complexity of the decision algorithm employed is quadratic in k. The decision is performed 1 in 30 of real-time for k = 100; increasing k can have a strong negative impact on scalability. Recently, Gengembre and al. [GB08] have proposed a probabilistic matching process that has the potential to significantly reduce the number of false detections and also improve computation time.

The next section focuses on similarity-based retrieval and indexing. A state of the art is provided before we introduce our contributions that are developed and evaluated in the later sections.

2.4 Similarity-based retrieval and indexing

21

2.4 Similarity-based retrieval and indexing To decide whether a candidate video is a copy of a video from the reference content database, the signatures extracted from the candidate video are used to query by similarity the reference signature database. The signatures retrieved, together with the signatures of the candidate video, are the input for the decision process described above. Given the size requirements mentioned in section 2.2.1, this retrieval by similarity is a critical stage for the scalability of content-based copy detection applied to video stream monitoring. When the volume of videos to protect is small or when few compact features are sufficient for reliable decisions (e.g. for detecting near-exact copies), the reference signature database is small and exhaustive search can answer similarity queries fast enough. But for very large databases, like the ones we focus on here, an indexing solution is needed in order to reduce retrieval time and to make it increase sub-linearly with the size of the database. The principle of indexing is to avoid exhaustive search by restricting the comparisons to be performed to the relevant parts of the database (i.e. where signatures that are similar to the query can be found). Before describing some potentially relevant existing proposals in section 2.4.2 and our contributions in the later sections, we highlight two specific issues that can have significant implications on the type of retrieval by similarity to be performed.

2.4.1 Retrieval by similarity for copy detection The first issue concerns the requirements of the application of copy detection to video stream monitoring. While some methods attempt to raise an immediate alarm when a copy arrives on the stream [YOZ08], [DLA+ 07], we consider that raising the alarms in deferred real time is sufficient in most application contexts. Our goal is to minimize the computation time required for processing a query rather than the total time interval needed to return the answers. By relaxing the constraint of requiring immediate answers, we can employ a batch processing scheme in order to minimize the importance of mass storage latency. Figure 2.8 shows these successive processing tasks along the timeline of the stream. The extraction of the visual descriptors, performed in parallel, requires about

1 15

of real time. Queries are accumulated

in main memory. At constant intervals, when the buffer of query signatures is full (indicated at the top of the figure, from brown to light orange), the search process is launched (including the loading of the database). The answers to an interval of the video stream are provided with a delay (marked by an arrow). The second issue concerns the fact that the signatures of a copy are distorted versions of the signatures of the original video (see section 2.3.2). The distribution of the amplitude of these distortions is not uniform (small amplitudes being much more frequent than large amplitudes) and can depend on the region in description space where the original signature is. This has an impact on the type of retrieval that should be performed. For data in a metric space, the two most frequently employed types of retrieval by similarity correspond to ǫ-range queries and kNN queries. An ǫ-range retrieval operation should return all the elements

22

Scalable Content Based Copy Detection for Video Stream Monitoring

Figure 2.8: Stream monitoring process providing answers in deferred real time

within a sphere of fixed radius ǫ centered on the query element. A kNN retrieval operation must return the k nearest neighbors of the query element. The local density has a strong impact on both types of retrieval, showing their respective drawbacks in opposite cases. In high density areas, an ǫ-range query returns a very large number elements, so it is not very selective. In low density areas, a kNN query returns elements that are very far from the query element and thus potentially irrelevant. Methods like hashing (e.g. LSH [IM98], see the next section) typically do not compute any distance at query time. Hashing partitions the elements of the database in small sets (buckets) composed of the elements that collide. The hashing function is locality sensitive if the probability for two elements to collide is proportional to their similarity. For retrieval by similarity, a set of the buckets is visited for each query and all the elements they contain are equally considered as candidates. Such methods save memory space since the vector of the signature does not need to be stored, so they are relevant for large signatures. They also save computation time during similarity-based retrieval, since no distance is computed. The drawback is a limited selectivity, some elements in a same bucket can be quite different from each other. To deal with this selectivity problem, in [Low04] a criterion is proposed in order to see whether the nearest neighbor is meaningful or not. If n1 is the nearest neighbor and n2 is the second nearest neighbor of a query q, n1 is considered meaningful only if d(n2 , q) > 1.8 d(n1 , q) In [LAJA05] this criterion is extended to 100-NN. The neighbor ni is considered meaningful if d(n100 , q) > c2 d(ni , q)

(2.6)

with c = 1.8 as default value. The signatures are required for computing the criterion. The study of a large and diverse set of copies leads to the conclusion that when the probability of distortions decreases when their amplitude increases [JFB05]. Indeed, high amplitudes are unlikely

2.4 Similarity-based retrieval and indexing

23

because they imply that more of the information conveyed by the content is lost. Furthermore, to handle very large databases with limited resources, a trade-off must be found between computation cost and the scope of copy detection; high amplitude transformations should be the first to be neglected. A simple solution is to set a hard bound, ǫ, on the amplitude of the tolerable distortions and, for every signature of a potential copy, retrieve all the items within a range of ǫ around it. Such ǫ-range queries can be used with many different index structures, but they give equal importance to near and farther neighbors within the range, with a negative impact on retrieval speed [JBF07]. Instead of using hard bounds, it was shown that it is better to consider probabilistic retrieval that privileges lower amplitude signature distortions [JFB05]. A probabilistic query is defined by a probability density function (pdf) over the description space; probabilistic retrieval consists in returning a set of segments from a multidimensional index so that their cumulative probability (computed using the query pdf) is above a target value Pα . Many multidimensional index structures were put forward to perform similarity-based retrieval in general, see [Sam06] for a comprehensive presentation. In the following we review a few methods that were proved to be able to handle very high volumes of data.

2.4.2 Existing indexing proposals To reduce the cost of similarity-based search, the most efficient methods perform approximate search. The result returned is not guaranteed to be identical to what exhaustive search would return, but is usually close to it. Some methods can provide an estimate for the quality of this approximation. Access methods have to face the curse of dimensionality: when the dimension of the vectors increases, the performances of the multidimensional access methods decreases. There are several families of indexing methods. For each family presented here a few representative methods are described, considered to handle well the curse of dimensionality. They can be employed with medium or high-dimensional signatures, what corresponds to ours. However, in many cases the dimension of a signature can be reduced. A subset of convenient dimensions (or combinations of dimensions) is found with methods like PCA [KS04], [DA08], and so on [MDPI04],[WMS00], in order to produce smaller descriptors and thus diminish the impact of the curse of dimensionality effect. For a better comprehension, the context is narrowed in the following. We consider that we are working on keyframes taken from videos. Each keyframe has a unique identifier (ID) obtained from both the identifier of the video it belongs to and its time code. Each signature, classically a vector of medium to high dimension (from 10 to 1000) computed on a keyframe, is associated to this ID. Some methods keep this vector so as to compute distances during similarity-based search, in order to refine the results. Some others only keep the ID in order to save space in main memory and computation time; no distance is computed any more, but selectivity can be limited. Consequently, the precision of these approaches can be difficult to control. Random projection methods. These methods rely on the use of functions that project data into subspaces. The functions employed intent to preserve the information regarding locality in order to keep

24

Scalable Content Based Copy Detection for Video Stream Monitoring

the neighboring elements close together in each sub-space. They usually show good results when the distribution of the data is close to uniform. Locality Sensitive Hashing (LSH) [IM98], [GIM99] is a powerful method often used for indexing high dimensional data and for similarity-based retrieval of visual signatures. The principle is to define a set of hashing functions that preserve the information of proximity in the description space. A set of n functions are defined to project the visual signatures onto n indexes. Each visual signatures is inserted in 1 bucket for each index, so in a total of n buckets. For similarity-based search, the n buckets where the query is projected are visited. Usually, only the ID is kept, so all the elements contained by the buckets associated to the query are considered as candidates. Then, they take part to a final decision, where the most frequently seen IDs (frequency above a threshold) are considered as copies. A weakness of this method is that the indexes are very large. For n indexes, an element is inserted in n buckets and so the global size is of n times the original size of the database. LSH has mainly been used for main memory searches. Moreover, the precision is hard to tune and this has to be done prior to the construction of the indexes. Also, the selectivity can be quite low. A recent improvement, multi-probe LSH [LJW+ 07], uses fewer indexes. On search, several buckets are visited on each index. They are selected according to their distance to the query. It still works in main memory but, since the number of indexes is smaller, larger volumes of data can be processed. In [JB08] it is shown that selectivity is still poor, so the signatures are employed for a refinement step after bucket selection. In order to compare LSH with the NV-tree, in [LAJA05] a mass storage version of LSH is employed, also keeping the signatures. In [JASG08] it is suggested to use lattices rather than projections for building an LSH index. The positional information is better maintained in this way and the buckets contain more relevant elements. Consequently, selectivity is highly improved but the computational cost can also be significantly higher. To compensate for the low selectivity of LSH, Jegou and al. [JDS08] proposes to store for every element a coarse approximation of its position with respect to the centroid of the bucket it belongs to. On search, the elements of a bucket are further filtered using these approximations. The NV-tree [LAJA05], [DLA+ 07], based on OMEDRANK [FKS03], relies on the iterative projections of the vectorial signatures on random lines (orthogonality being encouraged). For every line, the domain covered by the projections is split in I intervals. The signatures projected on a line are divided in I sets, one set for each interval. The line are successively used for projections, dividing at each projection the previous sets in I smaller sets. The main idea is quite close to LSH. But, in order to deal with the boundaries between intervals, some overlapping intervals are added between two consecutive intervals. An element that should have been in only one bucket is finally present in several ones. For search, only one bucket is visited. The query is successively projected too, but only one interval is kept on each line, the one having its center closest to the projection of the query. The selectivity of this method is not high enough, so the authors propose to construct several (three in [DLA+ 07]) different indexes using several sets of random lines. Only the signatures contained in at least two of these three sets are kept for subsequent refining step. The remaining signatures almost fulfill the criterion of equation 2.4.1. These merged searches significantly improve the selectivity. This method was shown to be efficient for 200 × 106 signatures on mass storage. But it shares a weakness with LSH: the creation of very large indexes. Here each NV-tree requires 50Gb.

2.4 Similarity-based retrieval and indexing

25

Space quantization methods. These methods intent to make approximations of the vectors contained by the database. They assume that the curse of dimensionality cannot be avoided. These methods shows good performance when the distribution of the data is close to uniform. The VA-file [BW97] propose to divide the space in regular non-overlapping cells. Each cell, according to the number of divisions, is coded by a set of bits. This code is a coarse approximation of the vectors of the cell. Only the cells containing signatures are kept. This vector approximation (VA) index must be kept in main memory, while the contents of the cells are held on mass storage. Both ǫ-range queries and kNN queries can be processed with this index. In a first stage, the index is fully scanned. If a cell is of interest, i.e. the minimum distance from the query to the cell is above a search distance, it is visited during the second stage. In this stage, when a cell is visited, the signatures it contains are retrieved from mass storage and distance computations allow to refine the selection. In order to deal with less uniformly distributed data, the VA+ -file [FTAA00] first proceeds to a PCA of the data and uses more bits to approximate the most discriminating dimensions. When dealing with very large datasets these two methods can be inefficient because either the number of cells must be increased (so the index becomes too large for main memory) or the number of signatures per cell must be increased (so many distance computations have to be performed during the second stage). Sampling metric methods. These methods depend on the use of random pivot elements to organize the data. They show better performance with clustered data. The Spatial Approximation Sample Hierarchy (SASH) [HS05] is a very different approach. The architecture is based on a graph whose nodes are the elements of the database. The signatures must be stored. The graph is built as follows: at level l, p × Nl−1 signatures from the database are randomly chosen, where Nl is the number of nodes at level l. A link between two nodes of two consecutive levels, l − 1 and l is established if the distance between the two nodes is below a fixed threshold. At most, a node of l − 1 can have p sons, and a node of l can have c fathers, still the closest ones. If a node has no father, then the nearest node of the previous level, having less than p sons, is chosen. The root of the tree is also a randomly selected signature. For similarity-based search, the graph is scanned, following the links that point at elements close to the query. With the SASH, kNN retrieval operations can be directly performed. The IDs of the returned elements take part to the decision step. The time needed for construction a SASH is quite long. The structure—an index where elements lay at different levels—has to be stored in main memory to be efficient. Data partitioning methods. These methods rely on the partitioning of the data. They aim to deal with data that is not uniformly distributed and only covers specific areas of the description space. Often but not always they rely on the use of a tree structure. The areas defined by the partitioning can overlap or not. They show good performance when the distribution of data is very non-uniform. Berrani et al. [BAG03] suggest to first cluster the data with an algorithm inspired from BIRCH [ZRL96]. Clusters are hyper-spheres. Once they are defined, they are approximated by smaller hyperspheres. The idea is to avoid the overlap between clusters, whose volume could be large if the dimension is high (curse of dimensionality). The selection depends on the minimal and maximal distances from the query to the approximate cluster. This method was shown to perform better for similarity-based search

26

Scalable Content Based Copy Detection for Video Stream Monitoring

than an SS-tree [WJ96] or an SR-Tree [KS97]. A final step scans the elements of the selected hypersphere (cluster) and computes the distances to the query in order to refine the selection. The precision of retrieval is easy to tune. The limitations concern the use of clustering that can be very long and a selectivity that can be low for high-dimensional data. Space partitioning methods. These methods partition the description space into non-overlapping areas. A signature lies in only one area. The space is recursively divided into areas until one or several criteria are satisfied. The result of these methods is generally a tree structure. The entire space is associated to the root node and each internal node represents only a part of the space. While some of these methods rely on a regular partitioning and are best adapted to uniformly distributed data, many of them take the data distribution into account and can perform well on non-uniformly distributed data. The k-d-B-tree [Rob81] is a data-adaptive space partitioning method. Recursively, according to a dimension, space is partitioned in two in order to have the same number of signatures on each branch (balanced partitioning). The order of partitioning can be defined according several criteria. Each branch of the tree is independent and has its own partitioning. The signatures are stored at the level of the leaves. A leaf contains at most Ne signatures, where Ne is usually defined by main memory constraint. Both ǫ-range queries and kNN queries can be processed. During search, branches of the tree are removed using the minimum distance from the query to the cell representing that branch. Leaves can be stored on mass storage and usually the signatures are employed for refining the results. But the tree has to be stored in main memory to be efficient, otherwise mass storage accesses are needed for finding the cells to visit. The LSD-tree [HSW89] (for Local Decision Tree) is a binary tree. An LSD-Tree is built dynamically by successive insertions. Each signature is inserted into the area (node) it belongs to. When a node is full, it is split into two children nodes; the choice of the split position and dimension is done locally, according to the distribution of data inside the node to be split. Each internal node of the LSD-tree is represented by a dimension d, which is the dimension used to separate the space, and a separation point s. The internal nodes of the tree can be stored partially or totally in main memory. The remaining nodes and the leaves that include data are saved on mass storage devices. The LSDh -tree [Hen98] is a version of the LSD-tree in which the separation point s is replaced by a hyper-rectangle. It is coded by a binary vector of small size. This hyper-rectangle includes all the data of the node it is associated with. However, it is not the minimal bounding region of the data. This coding is needed in order to save space for storing the tree structure in main memory. The construction of an LSD-tree or LSD-h tree is very long for large datasets. To be efficient, LSD-trees need to be stored in main memory. Because of the amount of data we want to deal with, the tree would be very large if we want it to be selective, i.e. have small leaves. The leaves usually contain the signatures in order to refine the selection by computing the distance to the query. The GC-tree (Grid Cell Tree) [CL07] is based on a space partitioning strategy using the data density to decide if splitting should continue or not. It uses a recursive hierarchical decomposition of the multidimensional space into cells, so as to reach leaves. The leaves are stored on mass storage and contain the signatures. First, each dimension i is divided into p intervals, which generates pd partitions (d being

2.4 Similarity-based retrieval and indexing

27

the number of dimensions). If the density of a cell, defined as the number of points in the cell (the cells have the same volume), is lower than a fixed threshold τ , then all the points it includes are considered as outliers; otherwise the cell is a cluster. If a cluster overflows the page capacity, fixed by the user, it is partitioned too. The result of this recursive partitioning is represented by a tree structure (called directory structure) and a set of nodes (called data nodes) that include the signatures. The directory tree represents the approximations of data, where the internal nodes represent approximations of the bounding hyper rectangles of the different partitions and the leaves describe signatures by their Local Polar Coordinate (LPC) representation. In the directory tree all the outliers are gathered in the same leaf node and described by their LPC representation too. Owing to the adaptive partitioning, this method shows good results with real data. In [JFB05] and [JBF07] the description space is hierarchically partitioned into rectangular cells. At each level, every cell of the previous level is divided into two equal parts. Since this partitioning is regular, the tree does not need any storage. Each cell has an address given by a Hilbert space-filling curve. To find the address of the cells that must be selected for a query, the Butz algorithm must be used. Both ǫ-range queries and kNN queries can be processed, but the authors suggest to perform probabilistic retrieval based on a pdf that models signature distortions. An isotropic (i.e. having the same variance in all directions) multidimensional normal pdf was considered, with a uniform variance over the entire description space. Figure 2.9 shows a simple partitioning of a 2D square into cells and a query defined by an isotropic normal pdf.

Figure 2.9: A probabilistic query following a 2D normal law pdf on a 2D grid. Darkness is proportional to the cumulative probability over the cell

For a query q and a 1D standard deviation σ for the isotropic normal pdf, the probability of a cell (a, b) = [a1 , b1 ] × · · · × [ad , bd ] is P (a, b) =

d Z Y i=1

bi

N (qi , σ)(x)dx ai

(2.7)

28

Scalable Content Based Copy Detection for Video Stream Monitoring

For a description space partitioned at depth h and a candidate signature s, a probabilistic query inequality must be satisfied by finding a set Bα of h-level cells bi such that: Card(Bα ) Z

X i=1

p (x − s) dx ≥ Pα

(2.8)

bi ∈Bα

where Card(Bα ) ≤ 2h is the number of blocks in Bα and p is pdf of the distortion. Probabilistic retrieval returns a minimum number of cells so that their cumulative probability is above a fixed threshold Pα considered acceptable for reliable copy detection. It was shown [JB08] that such a probabilistic search is very selective (3 to 4 times better that multi-probe LSH [LJW+ 07]). In the selected cells, all the signatures are returned and the distance from the query to each signature is computed. If the distance is lower that a hard bound (as in an ǫ-range query), the signature is kept. Then, in order to limit the number of results and thus the complexity of the decision step, only the k nearest signatures are selected and are take part to the final decision.

2.4.3 Overview of our contributions The space partitioning method proposed in [JFB05], [JBF07] has an important advantage for dealing with very large databases: the internal nodes of the tree do not need to be stored. Also, probabilistic retrieval with such a space partitioning appears to be quite selective. However, the Hilbert space-filling curve appears to be sub-optimal. Indeed, for a query, the computation of its cell address on the Hilbert curve is a difficult task. Indeed, the Butz algorithm requires several different successive binary operations (shift, XOR, AND, etc.). Moreover, the regular space partitioning is optimal for uniform data, but the distribution of real data is usually not uniform. Data-dependent space partitioning methods like [Rob81], [Hen98], [CL07] can not be employed since for very large databases the resulting tree would either be too large for main memory or not selective enough. Using the generic content-based video copy detection workflow shown in figure 2.5, we address the scalability challenge raised by the query search on a very large database of signatures. Our contributions concern the indexing solution and the similarity-based search using this index. Based on the study of signature distribution in the description space and of the distortions undergone by these signatures during the copy-creation process, we put forward in the following: • an index structure based on a Z-grid with a simple balancing mechanism, • a general approach for obtaining refined yet inexpensive local models of signature distortions, • a general and effective method for speeding up retrieval in high-density areas of the description space. To index the reference signature database, we propose an indexing scheme called Z-grid, a space partitioning method that considers each dimension independently of the others and where the address of a cell is very easy to compute. We further improve this indexing scheme, first of all by analyzing the

2.4 Similarity-based retrieval and indexing

29

distribution of the signatures in the database along each component in order to balance the partitioning. As we shall see, this operation is important for reducing the average computation time per query. Inspired from [JFB05], we employ probabilistic search with this index. The results are refined with a combination of ǫ-range retrieval and kNN retrieval: only the k nearest elements, under a defined range, are kept as candidates for the final decision procedure. Considering each component independently also allows to make inexpensive local models at every possible query position. Indeed, the model at a precise position in the description space will be expressed as a combination of models along each component. The Z-grid indexing scheme makes simple the use of models based on projections on the individual components. Thus, we also propose two local models used on similarity search process. First, a refined model of the distortions undergone by the signatures during the copy creation process allows to search in a more appropriately defined area of the description space, increasing query selectivity and also improving detection quality. Second, local density estimates are used during the retrieval process in order to avoid producing, in areas where the local density of signatures is very high, an overwhelming amount of candidates that would be of little interest for detecting copies. It is shown that, by the joint use of these proposals, it is possible to handle very large databases with limited resources: for example, one standard computer is sufficient for monitoring in deferred real time a video stream against a database of 280,000 hours of video, with good detection quality.

30

Scalable Content Based Copy Detection for Video Stream Monitoring

2.5 Z-grid indexing The Z-grid indexing relies on a spatial splitting of the description space. The local visual signatures extracted from the keyframes belongs to this space. It is iteratively divided in two parts, along each dimension in turn. Each leave (terminal cell) is coded by a unique binary vector of h bits representing its address (key). The splitting method allows to compute the address of a cell very quickly. In such an index a signature lies at a single position. Some other positions that may contain similar elements are visited during similarity-based search.

2.5.1 Definition and construction of the Z-grid index The Z-grid is obtained by hierarchically partitioning the description space ([0, 255]d in our case, see section 2.3.2) into hyper-rectangular cells, as shown in the left side of figure 2.10 (where a 2D description space is represented, not the frame space). The partitioning uses each dimension in turn. Considering a d-dimensional description space (d = 20 in our case), then at the first d levels of the hierarchy, every unidimensional interval ([0, 255] here) defining the description space is partitioned into two (first-order partitioning); the parts can be equal ([0, 127] and [128, 255] here), but a better alternative is suggested in subsection 2.6. At the next levels, the resulting spatial partitions are further segmented by dividing each interval into two (second and higher-order partitioning). Thus, at depth h in the hierarchy there are 2h cells. The resulting cells are ordered according to a Z-curve in the index that is built and in the associated signature database.

1

2

5

6

3

4

7

8

3

1

8

2

7 4

5

6

Figure 2.10: Two-dimensional space partitioning at depth 3 following the Z-curve (left) or the Hilbert curve (right) The proposal in [JFB05], [JBF07] is based on a Hilbert space-filling curve to control the partitioning of the description space and to sort the resulting cells in the index. However, for high-dimensional spaces and higher-order partitioning, finding the key (the address) in the index from the position in description space (Butz algorithm [But71]) becomes difficult. Further problems arise from the fact that, as seen in the right side of figure 2.10, when the partitioning depth exceeds the number of dimensions, not all the cells are partitioned along the same dimensions (see right side of figure 2.10). The Hilbert spacefilling curve can be chosen in order to use the Butz algorithm to compute the neighboring cells in the description space [LLL01], [LK01], [CC05], where potentially similar signatures could be found. But

2.5 Z-grid indexing

31

if a probabilistic search is performed, the only requirement is to be able to compute the probability of a cell. It only depends on the position and size of the cell and of the position of the query. With a Z-grid, the key of the cell containing a given signature is easily found by following the partitioning hierarchy from top to bottom. It provides at every level the appropriate value for a new bit in the key, going from the Most Significant Bit (MSB) to the Least Significant Bit (LSB). Then, for a partitioning depth, the key of a cell is the binary string of the cell position on the Z-curve. Extension to higher-order partitioning is straightforward, as can be seen in figure 2.11 for a depth of 3 in 2D. We call the resulting index Z-grid index. In contrast with the use of a Hilbert curve, in a Z-grid all the cells are partitioned along the same dimensions, whatever the depth. The direct association between individual dimensions and levels in the hierarchy also allows us to put forward in subsection 2.6 optimizations based on unidimensional analysis, allowing the index to adapt to the distribution of the data.

Figure 2.11: Z-grid: computation of the keys at depth 3

The partitioning depth providing the fastest retrieval depends on the relative costs of the different stages of the retrieval algorithm. For indexes based on the Hilbert curve, it was empirically found in [JBF07] that the optimal depth was h∗ ≃ log2 Ns , Ns being the number of signatures in the reference database. With the Z-grid we found instead h∗ ≃ 1.5 × log2 Ns + 2. The difference depends on the algorithm but also on the processor speed. There is a balance between the time taken by the computation of the keys and the time taken by the computation of the distances: the longer the keys (i.e. the deeper the partitioning), the shorter the computation of the distances (and the larger is the index); indeed, less signatures are returned for longer keys, but if the keys are too long their computation becomes too expensive. Using this value h∗ as an initial guess, several indexes are produced at different depths (usually 3, at h∗ and h∗ − 1, h∗ + 1) and, after some retrieval tests, the one providing the fastest retrieval is retained. The differences in performance remain however limited: between the optimal index and the second one, the difference in computation time is lower that 10%. To obtain the indexed database file, all the signatures of the reference database are sorted in ascending order of their key. Then, to build the index file, the signature database is scanned and the positions where the keys are incremented are stored. The algorithm 1 provides the details of these two steps. The complexity of the construction of a Z-grid-based index is O(d Ns + Ns log2 Ns ).

32

Scalable Content Based Copy Detection for Video Stream Monitoring

Algorithm 1 Z-grid indexing Require: List of local signatures with associated metadata (S, M ) issued from the reference videos Require: Depth of indexing h Ensure: Sorted database of local signatures with associated metadata (S, M ) and index I 1: for all Si in S do 2: Compute Z-grid key Zi from Si 3: end for 4: Sort (S, M ) in ascending order along Z 5: i = 0 6: Add i to I 7: for k = 0 to 2h do 8: while k is equal to Zi do 9: i=i+1 10: end while 11: Add i to I 12: end for

2.5.2 Probabilistic retrieval with the Z-grid The retrieval of the signatures that are similar to a query is performed in two stages. During the first stage, hierarchical probabilistic retrieval returns a set of cells such as their cumulative probability (following the density centered on the query) is above a threshold Pα considered acceptable for reliable copy detection. This Pα is set by the user according to the recall level he requires. During the second stage, all the signatures from these cells are compared to the query and those that are too far from it are filtered out. We employ a combination of ǫ-range retrieval and kNN retrieval: only the k nearest neighbors, under a defined range, are kept as candidates. The resulting signatures identify the original videos to which the candidate video are compared by the matching algorithm in order to decide whether it is a copy or not. First stage. The hierarchical probabilistic retrieval is performed as follows. The description space is split recursively using the Z-grid scheme. At each level, the probability of the cells to contain a signature that is close to the query is computed. This probability is obtained by integrating over the cell the density corresponding to the distortion model. If this probability is below a cell probability threshold Tα , retrieval along this branch (within this cell) is stopped. This is performed starting from the top level (half of the description space) and down to the leaves (terminal cells), at depth h. All the cells returned are defined at the splitting depth h. The cell probability threshold Tα has to be set; it depends on Pα , the cumulative probability threshold. A calibration is performed using 10, 000 queries and 10, 000 signatures. The signatures are randomly selected in the signature database. From a given Tα , a dichotomy is performed in order to obtain a Pα that is as close as possible to the one desired by the user: if the hierarchical probabilistic retrieval results in a Pα′ > Pα , Tα is lowered, otherwise it is increased. The distortion model employed can be an isotropic normal law, as in [JFB03, JFB05], so the probability of a cell is given by (2.7). A refined modeling solution is proposed in subsection 2.7.1. Since at

2.5 Z-grid indexing

33

each level one dimension is divided, only the corresponding factor is updated in (2.7), page 27. The first stage is detailed in algorithm 2. Algorithm 2 Z-grid cell search Require: Query signature with associated metadata (s, m) issued from the video stream Require: Reference signature database S and its index I Require: Depth of indexing h Require: Model of signature distortion d Require: Cell probability threshold Tα Ensure: List of cells K to visit 1: K = entire description space 2: K ′ = ∅ 3: for j=0 to h do 4: for all Ki ∈ K do 5: Compute probability Pd (Ki ) to contain candidates 6: if Pd (Ki ) > Tα then 7: Add Ki to K ′ 8: end if 9: end for 10: K=∅ 11: for all Ki′ ∈ K ′ do ′ and K ′ keys issued from K ′ at depth j + 1 12: Compute Ki1 i2 i ′ and K ′ to K 13: Add Ki1 i2 14: end for 15: end for Second stage. The cells returned by the first stage may contain many signatures that are too far from the query to be potential originals for this query. The second stage filters out those signatures whose distance to the query is above a reference range rr that depends on the signature distortion model. In some cases, the number of signatures within the reference range can still be too high, which would have a strong negative impact on the subsequent vote-based decision step. In these cases we return only the k (usually 100) nearest neighbors of the query signature. The impact of this solution is further discussed in subsection 2.7.2, where local models for the density of signatures are introduced. The time complexity of this stage is linear according to the number of signatures of the selected cells; for this cost to be systematically low, both the mean value and the variance of the number of signatures per cell should be small. The lower bound of the first stage complexity is logarithmic in Ns (the size of the reference database) and, since the selectivity of the index is relatively high, we consider that with an appropriate choice of the depth h, complexity remains sub-linear in Ns . This is confirmed by the investigation in section 2.8. The simple Z-grid scheme provides a fast retrieval process. The evaluation of the probabilistic retrieval with the Z-grid shows that it outperforms the use of a Hilbert scheme. Using a regular splitting, it is well adapted to uniform data. But, even if the signatures cover well the entire description space, their density is not uniform. In order to optimize the similarity-based retrieval using the Z-grid, we estimate the data distribution. The aim is to balance the population of the cells in order to increase the selectivity. The components along which the data distributions are the most uniform are partitioned first. These are

34

Scalable Content Based Copy Detection for Video Stream Monitoring

also the components that are partitioned several times. We also adapt the positions of the partitioning boundaries to the data distribution.

2.6 Optimizations exploiting the global data distribution

35

2.6 Optimizations exploiting the global data distribution If the regular splitting of the Z-grid is adapted to uniform data, it certainly does not fit well to the distribution of real data. Sorting the components and using adapted boundaries for the Z-Grid indexing scheme help balancing both the database and the queries distribution, resulting in a better selectivity of the probabilistic similarity-based search. The idea is to get closer to data-dependent space-partitioning methods like k-d-B-tree [Rob81], GC-tree [CL07] or LSDh-tree [Hen98], but without having to store a large tree in main memory. For large databases (more than 10,000 hours) containing diverse videos, the signatures employed here cover well the description space. Due to the high redundancy within some types of videos (e.g. weather forecasts, news, shows, ads), but also to characteristics of the local descriptors employed (that were nevertheless designed to make the distribution more uniform, see 2.3.2), different cells at a same partitioning level can contain very different numbers of signatures as shown in figure 2.12.

Figure 2.12: Distribution of cell populations at an indexing depth of 20 (220 cells) for a 10,000 hours database The presence of highly “populated” cells has a negative impact on the second stage of retrieval. Indeed, if one of these cells is returned, all of its elements must be scanned and their distances to the query computed. Extreme values have a strong impact on the average. So a few very expensive queries significantly increase the average search time for a query. To mitigate the effects of a non-uniform data distribution, we define a specific order between components for the hierarchical partitioning and then fit the partitioning to the distribution of data. This is inspired by data-dependent space-partitioning methods, but we aim to introduce as little additional complexity as possible. Indeed, the complexity of the construction of a Z-grid-based index is O(Ns log2 Ns ), while for the LSDh-tree [Hen98] it is O(d Ns log2 Ns ) (d being the dimension of the description space). Furthermore, a k-d-B-tree or LSDh-tree (both based on the k-d-tree) needs to store at every level, for each partition, the position where it is divided; for a medium size database of 60,000 hours of video, h∗ ≃ 32, it would require at least 40 Gb (with 10 bytes per node, including the component number, boundary position and addresses of the sons) and could hardly hold in main memory.

36

Scalable Content Based Copy Detection for Video Stream Monitoring

Figure 2.13: Distributions of data projections on dimensions 1 and 3, and the corresponding adapted partitioning

When building the index, the description space is partitioned along every dimension once or several times. As shown in figure 2.13, the distribution of the signatures projections can be rather uniform for some dimensions and quite non-uniform for others. The non-uniform ones all present a peak close to the center. If we consider the distribution of the queries cost and relate it to the position of the queries we obtain a useful information: queries that lie near the center of the space are expensive. Indeed, the border between cells is close to the peak for the very non-uniform distributions, then queries are more likely to select cells on both sides of this border, thus increasing computation cost. This should be avoided as much as possible. For large volumes of data, the queries (signatures of the monitored streams) and the signatures in the reference database can be considered to have the same distribution. We thus have a double benefit from partitioning first the dimensions along which the distributions are the the most uniform. Also, by dividing as much as possible these dimensions, the populations of the cells obtained at the optimal depth h∗ should be as balanced as possible.

2.6.1 Component sorting The components are sorted by increasing order of their uniformity and partitioning follows this order. Non-uniformity is defined as: Vj =

255 X

|dij − dj |

(2.9)

i=0

where dij is the number of signatures whose orthogonal projections on dimension j fall in [i, i + 1) and dj is the mean value of dij along dimension j. It corresponds to the L1 distance between the actual histogram along dimension j and a uniform histogram. This simple criterion is adequate because the distributions of all the components are continuous and follow a bell shape, more or less flat. Other criteria should be chosen otherwise. The indexing scheme corresponding to this configuration is called Z with sorting in the remaining part of this section.

2.6 Optimizations exploiting the global data distribution

37

2.6.2 Boundary adaptation To further balance the populations of the cells with as little additional complexity as possible, the partitioning along each dimension is fitted to the distribution of the signatures while keeping it independent of the other dimensions. As shown in figure 2.13, partitioning at first order follows the median and at second order the remaining quartiles. The indexing scheme employing both the sorting of components and the adaptation of the boundaries is called ZN in the following. To evaluate how uniform the distribution is and to find the quartiles, the distributions of the projections on the individual dimensions are estimated on 1,000 hours of randomly selected videos, but they are found to be already stable for a random sample of 100 hours. Note that an indexing scheme where components are used independently from each other is necessary. An index scheme based on a Hilbert curve cannot be employed: if not all the cells are split along the same component at some level, the resulting cells are not oriented in the same way (see figure 2.10), so a single dimension cannot be used in priority.

2.6.3 Evaluation of the selectivity improvements We have directly evaluated the gains resulting from the two improvements, the sorting of the components and the adaptation of cell boundaries according to the distribution along the individual components, on a 10,000 hours database. Figure 2.14 shows the distribution of the number of signatures returned with the three different indexing schemes described above: the Z-Grid scheme (Z), the Z with sorting one and the ZN one. Ten thousand queries were built, they are issued from signatures of the database, artificially distorted with a Gaussian noise in the description space. The plot in figure 2.14 only shows the distributions of the queries that return ≤ 15, 000 signatures (for improved readability). Table 2.1 shows the number of the queries that return very many signatures (more than 20,000 and 200,000) and do not appear on the plot. In figure 2.14, the Z scheme has a distribution with a peak for 700 signatures returned by a query and a long tail. For the Z with sorting and the ZN schemes the peak is for fewer signatures (around 500) and is much higher. Many more queries return less signatures than with the simple Z scheme. The plot also suggests that the tail is shorter. The table 2.1 corroborates this. The number of queries that return more than 20,000 signatures is significantly lower for Z with sorting and ZN. The number of extreme values, 200,000 signatures, also shows a strong decrease. The ZN scheme provides the best results and so is kept for the subsequent evaluations. The optimal indexing depth indicated with the Z-grid in 2.5.1 is slightly lowered by the ZN scheme and is now h∗ ≃ 1.5 × log2 Ns . This is a nice benefit, since a smaller index is faster to load. Figure 2.15 shows the average query time with the two indexing schemes for a 10,000 hours video database. The optimal depth is 31 for ZN and 32 for Z. The results given here for illustration employ the refined model of distortions presented in section 2.7.1; the profiles of the curves are the same, but average retrieval time is lower.

38

Scalable Content Based Copy Detection for Video Stream Monitoring

600

Z Z_with_sorting ZN

number of queries

500 400 300 200 100 0 0

2000 4000 6000 8000 100001200014000 number of distance computations per query

Figure 2.14: Distributions of the number of distances computed for 10,000 queries with a Z-Grid (Z), a Z-grid including the sorting of the components (Z with sorting), a Z-Grid including the sorting of the components and adapted boundaries (ZN) Table 2.1: Number of expensive queries (out of 10,000) for three indexing schemes Indexing scheme

Z

Z with sorting

ZN

Average number of signatures returned

78,376

43,557 (-45%)

36,430 (-54%)

Number of queries returning more than 20,000 signatures

4,150

3,124 (-24%)

2,759 (-34%)

Number of queries returning 200,000 signatures

1,085

569 (-48%)

545 (-50%)

Table 2.2 illustrates, for the ZN scheme and different volumes of video, the size of the resulting database, the associated optimal indexing depth and the resulting index size. Using the Z-grid with vectors of medium dimension (d = 20) is effective. An evaluation of both speed and detection quality for stream monitoring is performed in section 2.8. The optimal indexing depth depends on the amount of data, here a component is split once or twice. The approach is generic, so the indexing scheme can be used with any kind of description. It can deal with descriptors of smaller or larger dimensions. The components are sorted according to their uniformity and boundaries are adapted (ZN scheme). For low-dimensional descriptors, each component is cut several times, with a lower bound depending on the range of signatures distortions. For high-dimensional descriptors, only the most uniform components would be used. We believe that this scheme can also be effective with such description schemes as SIFT [Low04], SURF [BTG06] or GLOH [MS04]. The use of the unidimensional distributions in order to sort the components and to adapt the partitioning of the description space is the first stage for optimizing the probabilistic similarity-based retrieval

2.6 Optimizations exploiting the global data distribution

39

average cost per query (ms)

3.2

Z ZN

3 2.8 2.6 2.4 2.2 2 1.8 27

28

29

30

31

32

33

depth of indexing

Figure 2.15: Average cost per query with the Z and ZN schemes at various indexing depths on a 10,000 hours database Table 2.2: Some databases, with corresponding optimal partitioning depths and index sizes Volume of video Nb. of descriptors Estimated optimal indexing depth (h∗ ) Experimentally optimal indexing depth (he ) Approximate number of cells for he Index size

10,000h 0.6 × 109 30 31 2 × 109 16 Gb

60,000h 3.6 × 109 33 32 8.5 × 109 32 Gb

120,000h 7.2 × 109 34 34 17 × 109 128 Gb

280,000h 16.8 × 109 35 35 34 × 109 256 Gb

with a Z-grid. The second stage consists in building local models for signature distortions and for the local density in description space.

40

Scalable Content Based Copy Detection for Video Stream Monitoring

2.7 Optimizations based on local models The first local models we are interested in concern signature distortions. The models used in [JFB03] or [BAG03] are isotropic (same variance in all directions) and uniform over the entire description space. We performed a statistical analysis on a set of videos and automatically generated copies, which shows that the isotropic uniform model is very restrictive. Indeed, the distortions have significant variations depending on the position of the signatures and on the direction considered. Based on this analysis we build better (yet inexpensive) local models of signature distortions (equation 2.4). We aim to improve the selectivity of a query but also the relevance of the covered area. To relate this to probabilistic search, the set of cells returned by a query can be different when compared to an isotropic uniform model: there are new selected cells and, conversely, other cells are removed; overall, fewer cells are returned. We already exploited the global distribution of the signatures in order to enhance the selectivity, by improving the balance of the population of the cells. But locally some areas are still very dense, and they are still increasing the average time of queries. The idea is then to correct the probabilistic search according to a local estimation of the signature density in description space. The number of cells brought back in these areas is lowered, in order to keep only the ones having a higher probability to contain signatures close to the query. In order to have a more generic approach, we propose two separate models than can be employed together. We retained three criteria we consider important for the selection of a modeling method: 1. The volume of data required for building the model. A model is obtained by statistical means. To be reliable, the model associated to a part of the description space should rely on a sufficient amount of data. 2. How representative the model is. A model for a part of the description space should have a low variability over that part of the space, otherwise it is of little use. 3. The potential contribution of the model to the performance of the system. A model can be expected to have a valuable contribution if it significantly improves the selectivity of the similarity-based queries at constant detection quality, and if it can be exploited by the retrieval process with little additional computation cost and without requiring a high amount of storage. The granularity of the model has an impact on the three criteria. A model is defined for the whole description space but relies on different parameters. Using local models induces the setting of different parameters according to the position in the description space. The position is defined according to the granularity used. The first possibility is to build a model at the level of a leave (terminal cell). A coarser level of precision can be used, at a lower partitioning depth, as well as a finer level. The limit is to have a specific model for each position in the description space (among the 255d existing positions in our case). If a model is obtained at a coarse cell level, then it might not be representative enough. Indeed, the cells are too large and high variations (of the local density or of the consequences of the distortions) between

2.7 Optimizations based on local models

41

opposite parts of a cell could be found; the second criterion would not be respected. If a model is obtained at a fine cell level, then it might not be reliable enough, since there is little chance to have enough data for building it. Furthermore, specific model-related data would to be stored for the model of each cell, which is prohibitive given the number of terminal cells for a large database. The first and the third criteria would not be satisfied. Another possibility is to cluster the data and use a model for each cluster. By setting a large lower bound for cluster size, the first criterion can be fulfilled. But if some clusters are too large, the second criterion is not respected. Conversely, if the number of clusters is too high, the first and third criteria cannot be respected. The clustering of the signatures in the database is hard to tune and a good compromise might not even exist. Moreover, the clustering supporting a model of density might not be the same as the clustering supporting a model of distortions. Also, it is difficult to choose the model for a query that lies at the intersection (or in-between) several clusters. To build appropriate models we rely again on specific independence assumptions for the components of the description space. A model can then be defined at any position of the description space as the combination of unidimensional independent models along each component. This choice can fulfill the three criteria: 1. Considering one component coded on one byte, 256 values are possible so

1 256

of the available

data is involved in the construction of each local model. 2. The granularity used is the finest possible, each position of the description space has its own local model. 3. Only 256 values for each component are needed for defining each local model. Note that having local models at the same level of granularity simplifies their use during the probabilistic search. The final evaluation results will allow us to see whether all the assumptions we made hold sufficiently well.

2.7.1 Precise modeling of the distortions A study of the distortions found in a small ground truth database shows that the model in (2.7) is simplistic and has as a main consequence a sub-optimal selectivity of the queries in the reference signature database. A refined model has the potential to further reduce retrieval time by improving the selectivity, while maintaining (or even improving) detection performance. The size of a ground truth cannot be expected to be large enough for the construction of refined models. While it is possible to automatically generate a “ground truth” for estimating refined models, it should be stressed that such a ground truth is nevertheless obtained from a set of original videos that is well known (all the redundancies are identified) and large enough to provide a good coverage of the description space. Beyond a certain size, such a set of original videos is still very difficult to obtain.

42

Scalable Content Based Copy Detection for Video Stream Monitoring

Moreover, the cost of automatically generating this “ground truth” and of the subsequent construction of the model can become prohibitive. A model provides an approximation of the real phenomena and we need data in order to build it. In our case, we have to use the original signatures and the corresponding distorted signatures. The more precise the model is, the more it can be expected to improve the selectivity of the queries and thus to speed up retrieval. But, as seen previously, the more precise we want a model to be, the more data we need to build it. Also, the more precise the model, the more complex is the computation involved and the larger is the additional storage it requires, with a negative impact on the retrieval speed. Finding a compromise is difficult.

Obtaining the database A very important difficulty in obtaining a model of distortions is the collection of the data to build the model. Ideally, we should have a very large ground truth database where: • all the (transformed) copies of existing content are properly identified (without false positives or false negatives) and • a link is established between each point of interest in a copy and the original point of interest it corresponds to. The points of interest signatures in this ground truth database should also provide a sufficient coverage of all the regions of the description space. Clearly, this ideal situation cannot be encountered in practice: first, past copies are not necessarily representative of the future copies (even when neither the nature, nor the parameters of the transformations are expected to change); secondly, the creation of a reliable ground truth would require a prohibitive amount of human intervention. It is also important to note here that the more complex the target model is, the larger should be the amount of data required to estimate this model reliably. We had access to two ground truth databases. One consists of 30 hours of original videos from INA. The other is the video copy detection benchmark of CIVR071 , consisting of 80 hours of original videos and about 2 hours of copies. These ground truth databases are too small for modeling the amplitude of the image transformations we can expect. Also, if a ground truth was used for building the model then it could no longer be employed for evaluating it. Consequently, we do not employ these ground truth databases for building the distortion models, but only for evaluating the complete system in section 2.8). Since the construction of a model cannot be performed by using a ground truth database, we need to automatically generate a large set of copies (transformed videos) and then model the distortions of the signatures using this automatically generated data. This synthetic copy database should be as representative as possible of the copies we can expect: it should cover at least all the types of transformations we have already encountered, with transforma1

http://www-rocq.inria.fr/imedia/civr-bench/

2.7 Optimizations based on local models

43

Figure 2.16: An illustration of the bounds considered for gamma changes, scaling and contrast changes

tion parameters that remain within reasonable bounds with regard to visual perception. The coverage of the description space should also be appropriate but, as we have seen in section 2.7.1, the precise requirements directly depend on the nature of the model. The set of transformations we considered comprises gamma and contrast changes (intensity transformations), scaling (geometric transformation) and addition of Gaussian noise. Both in official postproduction and in the user generated copies, these are the most frequently used transformations that preserve the points of interest. Other transformations including occlusion, crop and insertions either remove or introduce new points of interest, so they should not be taken into account for building a model of signature distortions (that only concerns the original signatures that are distorted but not removed). The introduction or removal of points of interest during the generation of a copy is implicitly handled by the vote-based decision step (see 2.3.3). Figure 2.16 illustrates the bounds we have taken for the amplitude of the different transformations. We consider that, with the transformations listed above, the interest point detector (Harris for the results presented here) preserves the points of interest. So, all the points of interest found on the original videos are also found on the corresponding copies, using a geometric correction when needed (i.e. for scaling). Then, this process for generating copies provides not only the correspondence between a copy and the original, but also the link between every point of interest in a copy and the original point of interest it corresponds to.

44

Scalable Content Based Copy Detection for Video Stream Monitoring

Within the ranges we have taken for the parameters, we select a set of samples {Θi }k (where i is the parameter number, i ∈ [1, NΘ ], and k is the index of the sample). By applying to every image in a set {Ij } (j being the image number, j ∈ [1, NI ]) one transformation, we obtain the data needed for building the signature distortion model. More specifically, the data generation process follows the undermentioned stages: 1. On every image, compute the points of interest signatures detected in order to get a set of original signatures {sl } (l being the signature number, l ∈ [1, Ns ]). 2. On every image Ij , apply one transformation (2.4) with the parameters {Θi } in order to obtain a transformed image (i.e. copy) I′j . 3. Compute the local signatures on every transformed image to obtain the transformed signatures {s′l }. 4. As we perfectly know the image transformations, we can easily associate an original signature sl with its distorted version s′l . 5. Compute the distortion for each signature δsl using (2.5) to obtain the set {δsl }. For the cases we consider, every general transformation T in (2.1) is composed of two types of transformations, T = Ti ◦ Tg . Ti is an intensity transformation of the same pixel of coordinates x, I′ (x) = Ti (I(x), Θi )

(2.10)

Θi being the parameters of Ti (with I(x), I′ (x) ∈ [0, 1]). Tg is the geometrical transformation x′ = Tg (x, Θg ), Θg being the parameters of Tg . Ti can be further expressed as I′ (x) = a[I(x)]γ + b + ν

(2.11)

where a, γ and b are the parameters of the changes in image intensity and ν is the noise. To produce the synthetic copies we start from a set of 300 hours of miscellaneous videos, containing 16,580,000 local signatures. Following the procedure above we generate 100 hours of transformed videos (copies) using a resizing factor in [0.8, 1.2]\1.0 and the following intervals for the values of the parameters in equation (2.11): a ∈ [0.6, 1.4]\1.0, γ ∈ [0.6, 1.4]\1.0 and σ(ν) ∈]0.0, 7.0]. This produces a set {hsl , s′l i} of 5.5 × 106 pairs of original and associated transformed signatures. The parameters are randomly chosen, each copy results in a combination of the four possible transformations. This aims to model cases of strong distortion.

Modeling alternatives The isotropic model (2.7) consists in the use of a single value for the variance along all components, for the entire description space. All the alternatives we consider here for modeling signature distortions

2.7 Optimizations based on local models

45

associate specific covariance matrices (in general having different values on the diagonal) and vectorial offsets to specific regions of the description space. Our models differ by the definition of these regions. Since the hypothesis of independence of the components was found to hold in general [JFB03], we only consider variance matrices, which also limit the amount of data needed for estimating the model. After the isotropic model (2.7), the next simplest model is the global non-isotropic model. It consists in a variance matrix that has different values on the diagonal and is employed for the entire description space. Stated otherwise, for every component we have a specific value for the variance of the distortions but this value is considered true over the entire space. However, we noticed that the behavior of distortions has important variations between different regions of the space. These variations are averaged when computing the empirical variance matrix, producing a model that is close to the isotropic one and does not bring any significant improvement. A more refined alternative is to associate specific variance matrices to different cells (space divisions), either at the same partitioning level or at different levels. Following the procedure in section 2.7.1, we could obtain about 5, 5 × 106 data couples {original signature, distorted signature}. For a small database of 1,000 hours of video we have about 250 × 106 cells at the bottom level. For a larger database of 100,000 hours we have about 17 × 109 cells. The amount of data is thus clearly insufficient to estimate a diagonal variance matrix (or even the variance for an isotropic normal distribution) for every cell at the bottom level. We could then attempt to develop a coarser model, by estimating variance matrices for higher-level cells. Unfortunately, to reliably associate a variance matrix to a cell, we should have a low variability of the variances over the cell. Given the fact that the cells are large even at the bottom level (where they generally span at least a quarter of each component), this condition can hardly be satisfied. Partitioning the data rather than the space and associating specific variance matrices to different data partitions has the potential of reducing the number of different models if there is more consistency within data partitions that within existing space partitions. But this raises the problem of the empty (or low density) regions, where no model is estimated but where we can have queries. Also, in this case we need both a clustering method to obtain the data partitions and a complementary search procedure to find which model should be applied for a query, resulting in significantly heavier off-line and on line computation. For both alternatives mentioned above, which associate specific models either to different space partitions or to different data partitions, the location-dependency of the models can be improved if the variance matrix employed for a query close to a frontier is obtained as a weighted combination of the matrices associated to the neighboring cells. Unfortunately, this would considerably increase the complexity of the computations involved. We can also refine the global non-isotropic model in an alternative way: by associating to every (coordinate) value of each component a different variance and offset (see figure 2.17). Since every component only has 256 different values, we have a parsimonious model of only 2 × d × 256 parameters to estimate and store. Also, we have on average 5, 5 × 106 /256 ≈ 21, 000 data couples {original signature, distorted signature} to estimate every parameter. To estimate the variance along for a value of a component, we average over all the original data points within a “slice” (hyperplane) that is orthogonal

46

Scalable Content Based Copy Detection for Video Stream Monitoring

(σˆ , r ) i j (σˆ , r ) i k •

• • • • •



• •



• • j

k





dimension i Figure 2.17: Proposed signature distortion model: at every value of each component we can have a potentially different value of the variance along this component.

B C + A Figure 2.18: The global isotropic model (fixed-size circles) and the refined model (ellipses) in a 2D representation

to that component. While there is significant variation in the behavior of distortions within such a slice, this variation appears to be low enough in the direction of the component considered (orthogonal to the slice) for the resulting model to be reliable. This quality is also indirectly confirmed by the comparative experiments performed in section 2.8. Given the advantages of this last model, we selected it as refined model for evaluation. We now consider two versions of this model, one ignoring the offsets (“precise model”) and one employing the offset (“precise model with offset”). Figure 2.18 shows the differences between the previous, global isotropic model (fixed-size circles) and the refined model (ellipses of various shapes, dimensions and orientations) in a 2D representation of the description space.

2.7 Optimizations based on local models

47

Figure 2.19: On the left, distribution of the deviations of copy signatures with respect to original signatures for components 0 and 7. On the right, distribution of the absolute values of the deviations of copy signatures with respect to original signatures for components 0 and 7 Construction of the refined model For a signature s, the distortion δs produced by the copy creation process is given by equations (2.4) and (2.5). However, when we have to process a query q we do not know a priori whether it is actually a distorted version, s′ , of an original signature from the database, nor what values the parameters of the transform (2.4) may have. We can only rely on a statistical model of distortions and attempt to find original signatures s that may produce q after distortion. For the precise model put forward in the previous section we consider that the statistical distribution of the distortions δs depends on the position of the signature s in the description space. We write: δs ∼ M (s, Ψ)

(2.12)

where Ψ are the parameters of the statistical model M . To build this model we rely on the synthetically generated sample {(sk , s′k )} (see section 2.7.1). We compute the deviations between the signatures issued from the copies and the corresponding original signatures. Left part of figure 2.19 shows these deviations for components 0 and 7, while right part shows the cumulative distribution of absolute values of these deviations, together with the threshold of 90%. The distributions are quite different among the components; we represented here the distributions having the highest and respectively the lowest peak, the distributions of the other components being between these two. To obtain the precise model we need to estimate, for each of the d components, the variance along that component at every coordinate j (in 0, 1, . . . , 255) of the component. For this, we average over all the original data points within the corresponding “slice” that is orthogonal to the component. For component i, the standard deviation σ ˆij along that component at coordinate j is: σ ˆij =

q

Es∈Sliceij [(δs)i ]2 − [Es∈Sliceij (δs)i ]2

(2.13)

48

Scalable Content Based Copy Detection for Video Stream Monitoring

Figure 2.20: On the left, standard deviations computed for every abscissa along components 2, 4, 5, 7 and 10 respectively. On the right, mean distortion computed for every abscissa along components 2 and 7 where Sliceij is the slice that is orthogonal to component i and intersects it at coordinate j, and (δs)i is the projection of δs on component i. For the precise model with offset we also need, for each of the d components, the offset along that component at every coordinate of the component. For component i, the offset rˆij along that component at coordinate j is: rˆij = Es∈Sliceij (δs)i

(2.14)

Left part of figure 2.20 shows the estimated standard deviations σ ˆij at various coordinates along components 2, 4, 5, 7 and 10. Right part of figure 2.20 shows the estimated offsets rˆij for components 2 and 7 only (for clarity). To estimate the 2 × d × 256 parameters, all the 5, 5 × 106 data couples {original signature, distorted signature} were employed.

Use of the refined model During search, when we have to process a query q, we employ the statistical model of distortions (2.12) estimated for the same position in the description space in order to find original signatures s that may produce q after distortion. More specifically, we consider that: δs ∼ M (s′ , Ψ)

(2.15)

We can do so because the model is relatively smooth at the scale of the estimated standard deviations σ ˆij . With the precise model, only the values of σ ˆij are employed. For a query q (with components qi ), the computation of the probability of a cell is no longer performed according to (2.7), but follows instead: P (a, b) =

d Z Y i=1

bi

ai

N (qi , σ ˆij )(x)dx

(2.16)

2.7 Optimizations based on local models

49

If the precise model with offset is employed, we first correct the query q using the corresponding estimated offset: qi′ = qi + rˆij

(2.17)

for j = qi . Then, the probabilities of the cells are computed according to (2.15), using the corrected q′ instead of q.

Selectivity improvements We can see, from the right side of figure 2.20, that the values of the offset are not negligible with respect to the values of the standard deviations. But, after analysis, it appeared that the offset is only a consequence of a saturation phenomenon: the hyper-ellipsoid that describes the area where a distorted version of a signature may be found is truncated by the border of the space of description. A direct consequence is the offset of the average distortion. Therefore rˆ is not considered relevant and is not used during the evaluations. Table 2.3 reminds the selectivity of the probabilistic search based on the ZN scheme and using a global isotropic model (the left column is the same as the ZN column in table 2.1). It is compared to the selectivity of the probabilistic search based on a ZN scheme but using the precise model of distortion (right column). Table 2.3: Number of expensive queries (out of 10,000) with two distortion models Distortion model

global isotropic

precise

Average number of signatures brought back

36,430

18,719

Number of queries bringing back more than 20,000 signatures

2,759

1,910

545

167

Number of queries bringing back 200,000 signatures

The joint use of the ZN indexing scheme and of the precise model of signature distortions is denoted by ZNA in the following. Figure 2.21 presents the distribution of the number of signatures returned for the two models of distortion, as in 2.6.3. The next local models allow to introduce a local correction of the probabilistic search according to the local density in the description space.

2.7.2 Modeling the local density Motivation The first stage of probabilistic retrieval (see section 2.5.2) is driven by the model of signature distortions. The cells returned cover a region larger than the range of distortions in the description space. In the second stage the signatures whose distance to the query is above this range are filtered out. Since in the dense areas of the space the number of remaining signatures can be too high and make the subsequent

50

Scalable Content Based Copy Detection for Video Stream Monitoring

1400

ZNA ZN

number of queries

1200 1000 800 600 400 200 0 0

2000 4000 6000 8000 100001200014000 number of distance computations per query

Figure 2.21: Distributions of the number of distances computed for 10,000 queries with ZN scheme using a global isotropic model of distortion (ZN) or a precise model (ZNA). vote-based decision step very expensive, only the k nearest neighbors of the query are retained. While this kNN selection may leave out many signatures that could be potential originals for the query, we found that a very good recall can nevertheless be achieved (see section 2.8). The likely explanation is that the signatures in the dense areas of the description space have a low contribution to the final decision. A query in a dense area of the description space has a large amount of very close neighbors. It can then be very time consuming to filter out, among the signatures from the cells returned by probabilistic retrieval, those that are beyond the kNN of the query. Experiments with a 10,000 hours database show that such queries can be 100 times more expensive than the others and significantly slow down retrieval. Moreover, the ratio of dense areas naturally increases with the size of the database. A considerable speedup could then be achieved by doing the filtering as early as possible during the retrieval process. The solution we suggest is to estimate the local density in the neighborhood of a query and, if this density is high, to raise the cell rejection threshold Tα used for the hierarchical probabilistic retrieval (section 2.5.2) so as to return less cells.

Relevance of queries in high density areas First, we want to know if the result of these queries had an informative content, i.e. took part to the final vote-based decision. Thus we traced the queries, from retrieval to the final vote process, and noted the participation in a true positive detection linked with the number of signatures contained by the returned cells. We found that in the high density areas, few queries were taking part to the final vote. We can easily explain this, the kNN retrieval is saturated as the radius between a copy and its original remains the same. The space is being filled and the good signatures that should have taken part to the vote process are relegated beyond the k-NN radius limit. The kNN radius (the distance to the k-th NN) becomes

2.7 Optimizations based on local models

51

smaller as the local density increases. Figure 2.22 shows an example of such a case. The light blue disk represents the ǫ-range, while the dark blue disk represents the k-NN range. In a low density area, the signature (green) that is a distorted version of the original (brown) is in the kNN on the left side. In a high density area this signature is not in the kNN range anymore, but many more signatures are scanned in order to find the k nearest. These computations are useless.

Figure 2.22: kNN in an area of low density (left) and in an area of high density (right), for k = 4. If the good signatures are too far, relatively to the kNN radius, we should not spend time on these expensive queries. In order to apply this idea, we need to limit the number of cells returned during search, before any kNN filtering of the results. An evaluation of the local density could anticipate this problem and adapt the search process. Note that to increase the retrieval quality we could increase k, the number of neighbors returned. During search the computation time would not be significantly much increased. Indeed, all the signatures in the remaining cells are scanned whatever the value of k. But it would have a strong impact on the computation time for the vote-based decision, which is quadratic in k. Also, we consider detection quality to be good enough. However, if the quality would decrease as a consequence of other features ot the complete system, this possibility of increasing k is worth mentioning. This is not in contradiction with the use of a local model of density, which aims to increase the selectivity of the probabilistic search before the kNN filtering is performed. By increasing Tα , the cells of lower probability are pruned. We could simply have chosen to neglect the queries located in these areas, but we found this solution too radical. For very similar copies, originals and copies are very close and so the kNN retrieval returns good results. Moreover, the larger the database is, the larger such areas are, so the system must be able to deal with them. The first step is to estimate the local density in the neighborhood of a query in order to apply the correction.

Estimation of the local density In principle, the local density could be estimated by counting the number of signatures found within a small range around the query signature s. But this is too expensive to be performed during retrieval. Local densities should rather be estimated prior to any retrieval operation, for every point in the description

52

Scalable Content Based Copy Detection for Video Stream Monitoring

space, and stored with the Z-grid. However, the time needed to perform this estimation and the space required to store the resulting local density values are prohibitive. Using again an assumption of independence between components, a less expensive model of the density is devised. It consists in (i) finding, prior to any query, the unidimensional densities ρi (i ∈ {1, . . . , d}) of the projections of all the signatures in the database on each of the axes and (ii) estimating, when processing a query s, the local density in s as: ρˆ(s) =

d Y

ρi (s)

(2.18)

i=1

The time needed to obtain the unidimensional densities is proportional to the size of the signature database and only 256 × d floats are required to store these densities. The result of (2.18) was found to be a good estimate of the local density ρ(s) for points s where the unidimensional densities are comparable, but underestimates ρ(s) where some unidimensional densities are much lower than the others.

Definition of dense areas and modification of the cell selection threshold The cost of the queries falling in dense areas is reduced by raising the cell rejection threshold Tα so that hierarchical probabilistic retrieval returns fewer cells. More precisely, the modification of Tα should guarantee that the ratio of queries for which the selected cells contain more than k signatures is lower than a bound η. This condition is employed to define density thresholds that tell whether a query is in a dense area or not. Since there are significant differences between the unidimensional densities, see e.g. figure 2.13, a specific threshold ρ0i is employed for every dimension i. Cost considerations govern the choice of η. The conservative approach taken here returns at least one cell for every query (the cell to which the query belongs), so very low values for η may be unreachable. The ρ0i thresholds are obtained by an iterative procedure using a random sample of 10,000 queries. Consider a query s that falls in a dense area, i.e. ∃i ∈ {1, . . . , d}, ρi (s) > ρ0i , and let I(s) = {i ∈ {1, . . . , d} | ρi (s) > ρ0i }. For probabilistic retrieval, the cell rejection threshold Tα is then multiplied by: v u Y ρi (s) u c(s) = t ρ0i

(2.19)

i ∈ I(s)

This form for c(s) corrects the inaccuracy noticed for the estimate in (2.18) for those points s where some unidimensional densities are much lower than the others. Moreover, it makes the iterative process converge. Note that for convergence, we had to perform the similarity-based retrieval with k = 300. The iterative procedure to find the density threshold starts with ρ0i = 1.3 × maxs {ρi (s)} for every i. It evaluates the ratio of the 10,000 queries for which the selected cells (with Tα modified according to (2.19)) contain more than k signatures. If this ratio is above η all ρ0i are divided by c = 1.02. The procedure is run again until a ratio equal to η is reached. At the end of the process, we obtain a reference value on each dimension, defining the limit between high density and normal density. Initial value of ρ0i

2.7 Optimizations based on local models

53

aims to penalize the components having a peaked distribution. Indeed, the iterative process is a geometric suite, the decrease is faster for components having a high maximal value. The value of c is set so as to have a smooth decrease. Calibration must converge, if the decrease goes too far the process is reversed with a lower c. With c = 1.02 this does not occur often. This calibration procedure can be repeated for several values of η, producing several sets of thresholds {ρ0i }η that require little storage; the desired value of η can be selected at query time and the corresponding precomputed set of thresholds {ρ0i }η directly employed. The value of η is an indication of the correction strength. If η = 1 then no correction is performed on search, if η = 0.5 then the initial threshold Tα is raised in order to have an average of k2 -NN where kNN were available previously for the same radius. A calibration is specific to a database, it depends on the size and on the local densities, but also on the depth of indexing. It may take quite long for very large databases, where the search process must be performed many times to set up the threshold on each component, so the database must be loaded several times.

Using the local density estimates during search Summing up, for each query signature s the following operations are performed: 1. if I(s) is empty then leave the cell selection threshold Tα unchanged, else multiply Tα by the correction factor c(s) from (2.19), and 2. with the resulting Tα , carry out hierarchical probabilistic retrieval (section 2.5.2). Of course, the model of density is totally independent from the model of distortion, so any combination of parameters can be used for each model.

The last section of the first part of this thesis is devoted to the evaluation of our contributions to the scalability of the reliable monitoring of video streams on very large databases: 1. a Z-grid indexing scheme with a mechanism for balancing cell populations, 2. a local model of the distortions undergone by the signatures during the copy-creation processes, 3. a local model of the density of signatures in the description space.

54

Scalable Content Based Copy Detection for Video Stream Monitoring

2.8 Experimental evaluations A comprehensive evaluation must concern both the scalability and the detection performance of the proposed system. We can define several evaluation steps. For both the online and the off-line component (see figure 2.5), we have to measure the time for signature extraction from MPEG files: an existing software is used, it computes the local descriptors for 360 hours of videos in one day, which means that it processes in

1 15

of real-time. A maximum of N = 20

points of interest are detected using the improved Harris detector, then local signatures are extracted around this points. For the off-line process we must first measure the construction time of the database and of the index. It is not critical but shows that the system can be set up and start fast, and thus that periodical updates can be performed at reasonable cost. Then, for the online process, we have to measure the mean time required for similarity-based retrieval and for decision-making. The quality of the results has to be evaluated to prove the effectiveness of the system. Precision and recall are critical for this task, they are defined as follows: P =

Ngd Nd

(2.20)

R=

Ngd Ntd

(2.21)

where Ngd is the number of positive detections among the detections, Nd is the total number of detections, Ntd is the total number of possible positive detections. The precision indicates the part of “noise” in the detections. When there is no false detection in the top Nd results, Ngd = Nd and P = 1. The recall indicates the part of what is retrieved in what is possible to retrieve. If all positive detections are found, Ngd = Ntd and R = 1. Measuring the recall is not trivial in very large databases, since the contents of such databases can hardly be completely known. All the experiments were performed on a single PC having a 3 GHz Xeon64 CPU with 2 Gb of RAM and running Linux. The mass storage device is 1.2 To, SCSI, RAID0, composed of twelve disks at 7200 rpm.

2.8.1 Base and index construction To build the reference signature database from signature files, the time needed increases linearly with the volume. For a 10,000 hours database, 100 min are needed. A 280,000 hours database requires 2,800 min (46 h 40 min). Concerning the indexing, our experimental results appear in table 2.4. It mainly consists in sorting the signatures according to their key, so the complexity is O(n log n) where n is the number of signature in the database.

2.8 Experimental evaluations

55

Table 2.4: Indexing time for several databases Volume 10,000 hours 60,000 hours 120,000 hours 280,000 hours

Time 1 h 40min 7 h 20min 14 h 50min 30 h 00min

Indexing depth 31 33 34 35

Index Size 16 Gb 64 Gb 128 Gb 256 Gb

Database Size 17 Gb* 84 Gb* 85 Gb* 425 Gb*

The sizes of the databases, marked (*) can not be directly compared because the kind of signature metadata employed was not always the same (identical for 10,000, 60,000 and 280,000 hours, but different for 120,000 hours).

2.8.2 Detection quality on ground truth databases To evaluate the quality of detection, precision-recall graphs are obtained on two different ground truths. Since these databases are relatively small, there is no dense area in the description space, so the densitybased modification of probabilistic retrieval (section 2.7.2) is not used here. For both databases the partitioning depth for the Z-grid is h = 25 and k = 100 nearest neighbors are retained for every query. The performance of the system is first compared to other existing systems using the public video copy detection benchmark2 of CIVR 2007. This benchmark provides a database of 80 hours and two sets of queries: ST1 that are integral copies of video programs in the database, and ST2 queries corresponding to short excerpts. The results of the best performing system of the CIVR 2007 competition (“Best CIVR”) and of the ZNA version of our system, regarding both detection quality and retrieval speed, are provided in Table 2.5. ZNA obtains better scores; speed cannot be compared since different computers were used. Figure 2.23 shows the precision-recall graphs obtained on ST1 and ST2 with the ZN and ZNA versions of our system. Table 2.5: Comparison on the CIVR runs on the CIVR 2007 data Task System Score Search time (min)

ST1 Best CIVR ZNA 0.86 1.00 44 19

ST2 Best CIVR ZNA 0.86 0.98 35 5

All the copies are found for task ST1, recall and precision reach 1 together. For task ST2, the recall of 1 is not reached, because one good detection is missing. This missing detection corresponds to a flip transformation; we only stored the signatures of the original videos in the reference database, we did not add the signatures of the videos after flipping. The descriptor employed was not designed to be robust to this transformation. However, extracting the signatures from a flipped version of the query (or using a flipped version of the original videos) allows to detect this copy. Figure 2.25 shows this missing detection at the bottom. In ST1, another flip was present in the detections, but it was found. Fortunately, a part of 2

http://www-rocq.inria.fr/imedia/civr-bench/

Scalable Content Based Copy Detection for Video Stream Monitoring

precision

56

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

ZNA 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall

Figure 2.23: Precision versus recall for ST1 queries (left) and ST2 queries (right) of the CIVR 2007 ground truth the video had a vertical symmetry and so was found to match. Figure 2.24 shows examples of detections found for the ST1 task while figure 2.25 shows some found for the ST2 task.

Figure 2.24: The first two pairs of keyframes illustrate copies detected for ST2 tasks (full copies). The first copy is 6 minutes long, the second copy is 11 minutes 30 seconds long. A second evaluation is performed on a ground truth database set up at INA, containing 30 hours of original broadcasts issued from the INA archive. The queries are all the keyframes issued from 1 hour of copies (short excerpts) obtained from a Web2.0 site for tests. Several high amplitude distortions can be found in the copies, essentially intensity changes, blur and compression artifacts. The ground truth consists in the links between the video copies and the corresponding sequences from the 30 hours of original broadcasts. These links were initially identified with an earlier version of this CBCD system and were then verified by visual inspection. Figure 2.26 shows a slight improvement in precision with ZNA on this ground truth. All the 5.2 × 104 signatures from the 1 hour of video copies are used to query our system. To reliably evaluate the speed of retrieval, we need realistic, relatively large ground truth databases.

2.8 Experimental evaluations

57

Figure 2.25: The first two pairs of keyframes illustrate copies detected for ST2 tasks (copies of excerpts). The first copy is 24 seconds long, the second copy is 29 seconds long). The third pair is the only copy not detected (57 seconds long). Since we did not find any such benchmark, we built our own, of 1,000 and respectively 10,000 hours, based on 10,000 hours of miscellaneous videos from the INA archive. The 1,000 hours database contains the 30 hours of original video broadcasts from which the copies are produced and 970 hours from the 10,000 hours (6 × 107 signatures). The 10,000 hours database contains the 1,000 hours and other 9,000 hours from the 10,000 hours (for a total of 5.6 × 108 signatures). The results are presented for the ZNA version that employs the standard deviation given by equation 2.13 but not the offsets in equation 2.14. Taking the offset into account does not improve precision or recall on this ground truth but slightly increases retrieval time: the offset pushes the query toward the center of the description space and thus increases the number of cells returned. The Mahalanobis distance was also tried, instead of the L2 distance, in order to link the distance for filtering out signatures to the model of distortion. We believed it could be better adapted to the volume described by our refined distortion model. It leads to a marginally better precision, but the computation overhead is high; these results are not reported here. For very large databases, the use of the optimization in high-density areas (subsection 2.7.2) might adversely affect the quality of copy detection. Since we could not obtain very large ground truth

Scalable Content Based Copy Detection for Video Stream Monitoring

precision

58

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65

ZN ZNA 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 recall

1

Figure 2.26: Precision versus recall on the INA ground truth database databases, this impact was evaluated by using ZNAD on a 280,000 hours database consisting of a large and diverse set of TV broadcasts stored at INA, to which the CIVR 2007 ground truth database was added. Precision and recall are measured for the ST1 and ST2 queries of this ground truth. Given the source of the CIVR 2007 videos, it is unlikely (but not guaranteed) that a copy of any of them is present in the large INA database. The 30 hours INA ground truth mentioned above would have been less appropriate here because it is very likely that the larger INA database contains copies of videos from this

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.25 0.2

ZNA

precision

precision

ground truth.

ZNAD

0.15 0.1 0.05

ZNA ZNAD

0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall

Figure 2.27: Precision versus recall for ST1 queries (left) and ST2 queries (right) on the CIVR 2007 ground truth included in the 280,000 hour database Since the x, y positions of the interest points in the keyframes were not available for the 280,000 hours database, geometrical consistency could not be checked during the matching-based decision for copy detection, which has a strong negative impact on precision. However, the main goal of this experiment is to evaluate the potentially negative impact that the optimization in high-density areas has on the quality of detection, i.e. to measure the difference between ZNA and ZNAD. As shown in figure 2.27, the precisions of ZNA and ZNAD are close for most recall levels, including the highest recall attained by ZNA. ZNAD was employed with k = 100 and η = 0.4. Finally, we used our system on the TRECVID 2008 copy detection task. Two hundred hours of video composed the database and 2,000 video queries of various length were provided, we employed the ZNA version for the tests. Given the industrial implications of the technology, we were not allowed to post

2.8 Experimental evaluations

59

the results obtained, so we can not provide scores. However, we can show some shots of interesting detections in the following figures (from 2.28 to 2.34).

Figure 2.28: Copy with strong compression artifacts.

Figure 2.29: Copy with strong noise.

Figure 2.30: Copy with a large occlusion (50% of the frame). If we expected to detect the occlusions, crops and shifts because of the use of local signatures, the two first cases (figure 2.28 and 2.29) are more surprising because the improved Harris corner detector is sensitive to strong noise and blur. Indeed, many of the original points are lost in these cases, but the vote-based decision allows to detect these copies. However, it is not certain that for larger volumes these detections would be found (the video reference database contains only 200 hours). Additional tests should be performed for investigating this.

60

Scalable Content Based Copy Detection for Video Stream Monitoring

Figure 2.31: Copy with a large border inserted, leading to a large crop.

Figure 2.32: A picture-in-picture copy case with a slight shift and crop.

Figure 2.33: Various transformations combined, leading to a difficult case.

Figure 2.34: Another difficult case combining several different transformations.

2.8 Experimental evaluations

61

2.8.3 Scalability to very large databases To evaluate the contribution of our proposals to the speed-up and also see how the mean retrieval time increases with the size of the database, four larger databases are employed. The largest one is the 280,000 hours database introduced in the previous subsection. This database is redundant, i.e. it contains several modified excerpts of many broadcasts. The other three databases were obtained by a random selection of programs from the largest database and contain 10,000, 60,000 and respectively 120,000 hours of video. To avoid the bias that could be introduced by using as queries the interest points from the copies belonging to the two specific ground truths databases, a different procedure was designed to produce the queries. First, a uniform random selection of 10,000 interest points describing original videos is performed. Then, we built two types of queries for testing the scalability: original signatures from the database, and signatures from the database distorted with a Gaussian noise. The goal is to test the difference between the mean retrieval time required for existing signatures and the time required for non existing signatures that follow the data distribution. We obtained the same result (the difference is not significant) in both cases, so just one is reported here. We did not perform any test with random uniformly distributed queries. First because we designed the system to fits to the distribution of the data in order to improve the performances, second because the input data from the video streams is not uniformly distributed. Table 2.6: Comparative results of the average cost time for a query (ms) for several database sizes Method 10,000 hours 60,000 hours 120,000 hours 280,000 hours

HC 17.10 − − −

Z 4.52 19.44 37.10 72.45

ZN 2.69 11.60 18.40 38.60

ZNA 1.81 7.81 11.84 18.80

ZNAD 0.32 1.30 2.35 4.10

The comparison with the method in [JBF07], labeled HC (from Hilbert curve) could only be performed on the 10,000 hours database. The HC, Z and ZN systems employed the same parameters σ = 20 and rr = 6.7σ. HC was used with a partitioning depth of 29, considered optimal for this database size in [JBF07], while Z, ZN, ZNA and ZNAD were employed with a partitioning depth of 31, optimal for them. Table 2.6 shows the results of the comparison of average cost time of a query on several databases. ZNAD is more than 50 times faster than HC on the 10,000 hours database. On the larger databases, the modification of probabilistic retrieval in the very dense areas provides a very significant speed-up. ZNAD is 17 times faster on the 280,000 hours database than the Z one. Figure 2.36 shows that the mean retrieval time increases sub-linearly with the size of the database. Consequently, this is also the case for the computational resources required. The mean search time obtained with ZNAD allows to monitor in deferred real-time one TV channel against the 280,000 hours database with a single PC like the one used in these experiments. Finally, we investigated the impact of k on the decision process. We performed an evaluation on the CIVR07 benchmark with k = 200, 300 and 500. The variations of precision and recall are not significant. The precision-recall curves for k = 100 and k = 500 are shown in figure 2.37. This supports

Scalable Content Based Copy Detection for Video Stream Monitoring

retrieval time (ms)

62

80 70 60 50 40 30 20 10 0

Z ZN ZNA ZNAD

0

2

4

6

8

10

12

14

16

18

9

number of signatures (x10 )

Figure 2.35: Impact of each of our contributions to the resulting speed-up our use of the filtering step with a low k (100 for the CIVR07 database) and retrieval optimization by modeling the local densities (section 2.7.2).

80 70 60 50 40 30 20 10 0

63

retrieval time (ms)

retrieval time (ms)

2.8 Experimental evaluations

Z

0 2 4 6 8 10 12 14 16 18

40 35 30 25 20 15 10 5 0

ZN

0 2 4 6 8 10 12 14 16 18

9

number of signatures (x109)

20 18 16 14 12 10 8 6 4 2 0

retrieval time (ms)

retrieval time (ms)

number of signatures (x10 )

ZNA

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

0 2 4 6 8 10 12 14 16 18

ZNAD

0 2 4 6 8 10 12 14 16 18

9

number of signatures (x109)

number of signatures (x10 )

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.25

K=100 K=500

K=100 K=500

0.2 precision

precision

Figure 2.36: Dependence of retrieval time on the size of the database: with ZNA and ZNAD, this dependence also becomes sub-linear for the largest databases

0.15 0.1 0.05 0

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 recall

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

recall

Figure 2.37: Precision versus recall for ST1 queries (left) and ST2 queries (right) of the CIVR 2007 ground truth included in the 280,000 hour database for k = 100 and k = 500

64

Scalable Content Based Copy Detection for Video Stream Monitoring

2.9 Conclusion

While content-based copy detection methods for still images and video show significant robustness to image transformations and do not depend on the presence of watermarks in the original content to be protected, their computational complexity also poses a scalability challenge. To be able to monitor video streams against very large databases of original video content, we introduced here several innovations for the similarity-based probabilistic retrieval process, directly addressing the scalability challenge. First, we put forward an index structure relying on a Z-grid, together with an adapted probabilistic search method. This spatial indexing scheme proposes to balance the population of the cells in order to lower the average time of a query. The selectivity of the probabilistic search is consequently highly enhanced. While this method is only applied here to specific, rather low-dimensional signatures, we conjecture that comparable results could be obtained on higher-dimensional descriptions. The index should be built only on the dimensions along which the signatures have a relatively uniform distribution and a distortion range that is significantly smaller than the size of the description space spanned. Then, we proposed a general approach for producing refined models of the distortions undergone by the signatures during the copy creation process. Besides increasing query selectivity, with a positive impact on retrieval speed, such refined models can also improve the quality of copy detection. We show that reliable detection of real video copies can be achieved with models obtained from automatically generated copies and does not require high volumes of manually verified data. The models can thus closely follow the requirements of the application, e.g. by taking into account specific ranges for different types of distortions. We eventually suggested an effective method to speed up retrieval in high-density regions of the description space. Given the high cost and relative irrelevance of the queries falling in these areas, the probabilistic search in these areas is considerably narrowed. This allows for a significant reduction in retrieval time and helps keeping it sub-linear in the size of the database, even for very large databases. An evaluation of the system on two ground truth video databases shows that it provides reliable detection results. Very efficient retrieval is then demonstrated on much larger databases of up to 280,000 hours, pointing out that it is possible to handle realistic situations with limited resources. Finally, a simulation shows that the monitoring of 1 stream against a database of 280,000 hours can be performed with one computer having 2 Gb of RAM and a 7200 rpm disk. Our investigation of content-based video copy detection for stream monitoring and the results we obtained serve as a basis for the second problem we addressed, which is video mining by copy detection. This second problem has a higher complexity and will require more fundamental changes that go beyond an adaptation of the stream monitoring solution to video mining. The requirements of stream monitoring applications are evolving with the development of Web2.0 video sharing sites. Indeed, a user uploading a video in not supposed to wait for too long before knowing whether the video is accepted by the web site, so a fast answer can be required. Also, in many cases, users upload copies of videos coming from e.g. digital broadcasts a few minutes after the broadcast,

2.9 Conclusion

65

so dynamic updates of the original signature database and associated index may be needed. Of course, an adaptation of the system we proposed here can provide solutions to these problems, nevertheless the issues of interactive answers and of dynamically updating the index should be further explored.

66

Scalable Content Based Copy Detection for Video Stream Monitoring

CHAPTER

3

Video Mining Using Content Based Copy Detection

3.1 Introduction

The aim of the second part of this thesis is to obtain a scalable method for Video Mining by contentbased Copy Detection (VMCD). As seen in the general introduction, the number of existing copies on the different channels (television and Internet) is now tremendous. The detection of all these multiple occurrences can support many applications. For this purpose, a video database must be queried with each of its elements. If the system is based on visual descriptions of the content, as in the first part of this thesis, all the signatures are used as queries by similarity in the reference database. The problem is now quadratic in which queries are candidates as well. This corresponds to a similarity self-join operation whose complexity, if no index is employed, is quadratic in the size of the database. Copy detection applications and solutions have been recently extensively investigated. In the first part of this thesis we put forward a solution for video stream monitoring and cited other recent work. Video mining has also known some important recent developments. We review a some of the existing proposals in section 3.2. Some of them do not rely on copy detection, however they still aim at finding videos that are similar in some respect. Most of them are very specific to a type of video or can only deal with rather small volumes of data. No robust, general and scalable solution based on copy detection has been proposed yet. Such a solution should be found before addressing more complex tasks like video mining based on similar parts of frames. We focus here on the scalability problem for VMCD, considering generic video contents and a relatively large family of transformations of high amplitude, as in figure 3.1. Our goal is to identify all the occurrences of short (a few seconds) video excerpts from any video document in a large database. Figure 3.2 shows the links that should be found between four videos, containing two redundant excerpts.

68

Video Mining Using Content Based Copy Detection

The mining process produces a graph having as nodes all the video sequences that occur more than once, with various transformations, in the database, while the edges are the links found between such occurrences. By further processing this graph, different types of structural components can be identified in the video database, which supports several applications.

Figure 3.1: Example of copies found on a video Web2.0 site The identification, in a large video database, of all the video sequences that occur more than once (with various modifications) can make explicit an important part of the internal structure of the database, thus supporting content management, retrieval and preservation for both large institutional archives and video sharing Web sites. As specific examples, we can mention content segmentation, the extension of textual annotations from one video to another, the removal of lower-quality copies or advanced visual navigation in a database. Further details are provided regarding a broad range of potential applications in section 3.3.

Figure 3.2: Four videos containing 2 redundant excerpts. The yellow one occurs in three videos, in video 3 it is truncated. The orange one occurs in all the videos, twice in video 4. The textual annotations that can be found on a video sharing site or the annotations provided by INA archivists are usually not sufficient for detecting copies, but can allow to select a range of videos from a database in which it is relevant to search for copies. For visual browsing applications this is essential in order to obtain smaller databases that fit into main memory, allowing fast video mining that supports dynamic browsing. The second part of this thesis is organized as follows. We first review in section 3.2 some recent approaches for video mining. We then describe a range of potential applications of VMCD that we

3.2 Video mining state of the art

69

consider relevant and realistic (section 3.3), together with the corresponding requirements (section 3.4). In section 3.5 an adaptation of the monitoring solution seen in the first part of the thesis is considered and shown to be inefficient for mining. Consequently, a new framework is proposed in section 3.6 for addressing the scalability issue for VMCD. This new framework includes two important contributions: • A new keyframe descriptor called “Glocal” that is a compact representation for a set of local features (section 3.7). Its goals are to reduce the computations during the mining process, to avoid the vote-based decision step and to reduce the size of the database to be mined, while maintaining good robustness to a wide range of video transformations. VMCD then starts by a similarity selfjoin of the Glocal signatures from a video database (section 3.8.7). • A new indexing scheme adapted to the Glocal descriptor and to the similarity self-join operation. It employs a redundant scheme based on sets of bit positions in the Glocal signatures (section 3.8). It aims to speed up the similarity self-join operation with as little loss of detection quality as possible. The similarity self-join performed on the set of keyframes extracted from the videos in the database returns a set of pairs of similar keyframes that are further connected, using their metadata, in order to find all the “copy” links between video sequences (section 3.9). Finally, a graph whose nodes are video sequences and the links correspond to “copy” relations is returned to the user (section 3.11). The proposed solution allows to mine a database of 10,000 hours of video in a realistic time (80 hours) with a single standard PC. An efficient parallel scheme can be devised, so significantly larger volumes of video can be easily processed. An online implementation allows to mine small databases (100 hours or less) in tens of seconds in order to interactively return results supporting dynamic browsing. For these applications, the range of supported video transformation is quite large. In the next section, we review recent existing proposals for video mining.

3.2 Video mining state of the art 3.2.1 Approaches relying on content-based copy detection As seen in the first part of this thesis (section 2.2.2) various methods have recently been used for the copy detection task: some employ a general description of a video [cSCZ03], [SOZ05], some are based on the use of information conveyed by the individual frames, using global or local features [JFB03], [BAG03], [LTBGBB06], [DLA+ 07], [YOZ08], others rely on temporal features [CLW + 06], [CCC08] and some combine different approaches [HWL03], [WHN07b]. Various indexing solutions are used to reduce the cost of similarity-based operations, depending on the description scheme employed. Nevertheless, if the copy detection task was heavily investigated, video mining was less explored. The early approaches to video mining in [Sat02] and [YSS04] focus on news shows and develop CBCD solutions. While the studied databases are not very large (less than 100 hours) and the descriptors

70

Video Mining Using Content Based Copy Detection

employed are rather compact (limiting the range of transformations that are compatible with the detection of copies), they demonstrate the relevance of CBCD for video mining. It is proposed in [WHN07b] to employ video mining in order to eliminate the near-duplicates from the results returned by a Web search engine. Inexpensive descriptors are used to separate the least similar videos, then local descriptors allow to refine near-duplicate detection. This cascade scheme aims to reduce the cost of mining, but even with this improvement the removal of duplicates remains computationally intensive. This approach attempts to match each video with every other one, so the complexity remains quadratic. Moreover, there are strong restrictions in defining correspondences between videos, e.g. the shots must occur in the same order on the timeline. Two solutions for video mining are proposed in [CPIZ07], one using the histogram of the keyframes with LSH indexing, one using a visual vocabulary based on SIFT with minHash indexing. The first solution supports fast retrieval of the near-duplicate keyframes. Unfortunately, the selectivity of the similarity-based search with LSH is very low. Using it to find the near-duplicate shots in the TRECVID06 video database (165 hours) requires 90 minutes. Moreover, this solution is not robust to some colorimetric changes, crops and occlusions. The second solution is derived from [SZ03]; it is very robust to these transformations but it is also invariant to viewpoint changes, so it is finally used to find similar scenes rather than copies. Processing times are not provided, but the method is employed on only one video clip and one movie. Scalability remains a problem.

3.2.2 Other approaches Several proposals for mining databases of news shows put forward methods that do not actually aim at the detection of transformed videos. In some cases the occurrences of the same scenes (shot by different cameras, possibly from very different viewpoints) are detected, either by using the discontinuities in the trajectories of interest points [STA07] or by employing flash patterns [TSS06]. The video sequence descriptors are very compact, so relatively large video databases can be processed, but the application of these solutions is limited to rather specific video content. Others focus on tracking news subjects by using both video keyframe similarity and automatic audio transcriptions [ZS05], [WHN07a], [WZN07]. The keyframe descriptions employed allow to some extent to find views of a same scene from different angles, but rather small databases are processed. In [NG08] the authors address the issue of finding duplicates in TV video streams. A shot segmentation is first performed, then a DCT-based description is computed for each frame. Hash tables are used for storing these descriptions and associated shot indentifiers. Given the compact description, 24 hours of video can be compared to a database of 600 hours in seconds, with very good recall and precision. However, this method is designed for very near duplicate detection, specific for advertisements, credits and jingles. Little variation is accepted in the content and the temporal definition of the shots must also be preserved. The video mining stage in [SZ03] identifies in principle content links at a sub-frame level, e.g. same or similar objects in different scenes, with visual words based on SIFT descriptions. However, the size and number of signatures per frame, the cost of matching and tracking, together with the very large number of links this approach can generate, significantly reinforce the scalability challenge.

3.3 Potential applications

71

In [QFG06] the authors use several types of features to mine a video clip. Their goal is to find the similar objects along a video clip where scenes are repeated several times. They first use a shot detector and extract 4 frames from each shot. Then they build a visual vocabulary of the video clip, as in [SZ03]. Relations that includes absolute spatial positions in the frame are defined for the words that are spatially close in the space of the frame. A second version also includes temporal relations between successive frames. Then, they attempt to match pairs of frames using these relations. The results show how effective the approach is. But the construction of the relation requires about 0.4 seconds per frame (10 minutes for a clip of 4 min 30 sec) and the time needed for the matching process is not provided. The structure of the relation is quite complex so no indexing scheme is employed. In the next section we describe a range of potential applications of VMCD that we consider relevant and realistic. Then we define the precise goals of our work.

3.3 Potential applications We investigate two contexts in which video mining by copy detection can have many applications of significant impact: a large institutional archive like INA and a video Web2.0 site.

3.3.1 INA context At the INA, the video streams are recorded quite directly: the segmentation of the streams in shows is not precise. Advertisements appear at the beginning and at the end of a show, sometimes shows are divided in two parts by advertisements, etc. For large archives like the one of INA, the identification of content links between video sequences would allow to segment them better and to extend textual metadata from one video sequence to others, while avoiding to separately annotate several versions of the same content. It further supports such services as visual navigation in the database, broadcast programming analysis or media impact evaluation that often help deciding which videos should be annotated. For example, very short sequences that are near-identical copies usually correspond to opening or closing credits, broadcast design sequences or jingles; they allow to identify the channel, segment the broadcast and identify individual shows, thus providing annotations and supporting data management and retrieval. Longer frequent sequences that are also highly similar typically represent TV commercials or, with a different temporal pattern, brief flashes from news agencies; they can provide information for media impact evaluation. Longer, infrequent and more strongly transformed sequences correspond to reused excerpts from shows or movies; they provide information about the type of program, help the transfer of annotations from one program to another and also support broadcast programming analysis. The volume to mine at INA is very large, as seen in the first part of the thesis it will soon be of 500,000 hours. The mining of such a volume on video cannot be performed on a single computer. The mining process must be parallelized. For example, by using 50 computers, each one would have to mine an equivalent of 10,000 hours (though, as will be seen later, this does not correspond to partitioning the

72

Video Mining Using Content Based Copy Detection

complete database in segments of 10,000 hours). However, this volume is still large and the descriptions and index used for mining would not hold in main memory, so a solution using mass storage is necessary.

3.3.2 Web2.0 context A very large share of the videos currently hosted by Web2.0 sites is issued from professional original video content that is modified or repurposed and eventually uploaded by many different users. In many cases, the number of different transformed versions of a same content that are stored on a site is high and, given the wide availability of easy to use video editing software, can be expected to keep growing. The frequent transformations include resizing, compression, colorimetric changes, insertion of logos or various artifacts, modification of the time line and mixing. The accumulation of very many transformed versions of the same content has a strong negative impact on the quality of the results presented to the users and on the management of the available content. Typically, several consecutive results to a query are versions of a same original video, forcing the user to skim through several pages of results in order to find the next truly different video. Also, a very large share of the videos that a Web2.0 site stores is actually useless (and even detrimental) because it consists of such duplicates. Figure 3.3 illustrates the videos returned to the query “Madonna” on a video Web site. Even when the illustrating thumbnail is different, by watching the videos one can notice that several of them are copies, even in the first 16 results. These examples bring out the need for identifying the content links between the stored videos, and for using them to unclutter and structure the content of the video databases, with the ultimate aim to improve user experience with Web2.0 sites. The textual annotations currently available for the videos can very seldom bring to light the links between the many versions of a same original video. The tool of choice to find such links is content-based video copy detection (CBCD), where the relation of “copy” holds between two videos if they are transformed variants of the same original video. We consider that CBCD can be used for this purpose in two complementary ways. First, an off-line process can be periodically launched on large but specific parts of the database in order to identify near-duplicates and videos that share a common sub-sequence. A good reference is the volume of videos returned by a broad query; as an example, on a popular Web2.0 site the query “Madonna” returns about 116,000 answers (corresponding to a little less than 10,000 hours of video). The results of this process can be employed to remove the lower quality versions of the videos, thus reducing storage requirements and improving the diversity of the answers to future user queries. They also allow to establish navigation links based on common sub-sequences between videos that do not share keywords, or to improve the quality of the keywords associated to a video by exploiting the keywords of its different versions. Noting the volume of videos returned by a broad query, the solution previously considered for mining of the INA archives is in agreement with this task. Second, when the system receives a keyword-based query, it can employ a query-dependent online process to structure the preliminary results before returning them to the user. This process can remove recently uploaded duplicates and group the results in a query-specific way or modify the default ranking by exploiting the content links. Such an online process should be able to answer almost immediately

3.3 Potential applications

73

Figure 3.3: The first 16 videos returned by the query “Madonna” on a video Web site. Five videos are linked by seven relations of copy.

when given about 1,000 videos (Web sites usually return the top 1,000 answers even if more relevant videos are available) of a few minutes each (most videos are less than 5 minutes long), i.e. about 80 hours of video (less than 100 hours). Here scalability is not a priority, we first need fast mining in order to provide an interactive answer to a query (user acceptable time, i.e. close to ten seconds). Figure 3.4 shows a possible system using both the off-line (left part) and the online (right part) processes applied on a video Web site. Note that it could also be used in this way at INA. Several problems have to be solved in order to achieve such fast online mining and scalable off-line processing. The diversity and the amplitude of the transformations that can be encountered (regarding both the individual frames and the temporal arrangement of video sub-sequences) rule out simple descriptions of the videos. But robust descriptions typically require expensive matching operations that can make CBCD-based mining hopelessly slow.

74

Video Mining Using Content Based Copy Detection

Figure 3.4: Proposed workflow for VMCD in a video Web2.0 site context In next section we present our general goals regarding video mining by CBCD, that essentially concern the scalability issue.

3.4 General goals The goal concerning the scalability is double. First, we want to be able to deal with large video databases of at least 10,000 hours with a single PC and make the mining of very large sets (100,000 hours and more) possible. Of course, this can only be performed off-line. Second, we want to be able to mine very fast smaller volumes (about 100 hours) for online applications. Some extreme copy cases, as we have seen in the first part of this thesis at the end of section 2.8.2 (figures 2.28 to 2.34) are not very realistic and rarely seen. We do not focus on dealing with such difficult cases. The above-mentioned applications of VMCD have different requirements in terms of length of the detected sequences, of nature and amplitude of the transformations applied and of the particular categories of video programs that should be analyzed. Rather than developing entirely specific mining frameworks for specific applications and/or type of content (which is not always easy to define), it can be useful to have a single mining framework that is able to deal with a very large volume of general video content and find links between substantially transformed sequences. The possibilities of detection of the mining process should ideally be the same as those of our stream monitoring system exposed in the first part of this thesis: find all occurrences of short excerpts that may have been transformed by a combination of the basic visual (intensity and geometric) and temporal changes (see section 2.2.1). Clearly, the scalability challenge is stronger when considering a large volume of general (non specific) video content and a wide range of transformations with potentially high amplitude (see figure 3.1). Supporting the volume of 280,000 hours is not realistic with a single PC solution, therefore a full parallel

3.5 Difficulties in adapting a stream monitoring solution to video mining

75

solution is necessary: the video database must be divided in P parts, and each part must be independently handled by one PC. A single PC should be able to deal with a 10,000 hours video database. It must also be able to mine a less than 100 hours database in a human acceptable time, that is tens seconds or so. Finally, a plus is a dynamic structure, the insertion of a video or its deletion should not be heavy overload and should not ask for the reconstruction of the system (database and index). Indeed, the number of video streams is growing very fast, and thinking of a dynamic solution rather than a static solution is clearly a step forward. The result of mining is a large graph having as nodes all the video sequences that occur more than once (with various transformations) in the database, while the edges are the links found between such occurrences. False detections can significantly overload the graph and connect many disconnected components; good precision (low rate of false detections) is thus even more important for mining than for monitoring. Some information required by specific applications can be obtained by processing the graph, indeed, the edges and nodes carry information about the videos and about the correspondences between them (see 3.11).

3.5 Difficulties in adapting a stream monitoring solution to video mining 3.5.1 General considerations on performance Obviously, a sequential search on a video database using as query each video cannot be considered. A natural solution would be to consider a general and scalable method for content-based copy detection and apply it to the mining problem. CBCD methods usually have an off-line indexing component and an online detection component as we have seen in the first part of this thesis. The indexing component computes the signatures for all the keyframes in the database of original video content, then stores these signatures in a reference database and creates an index supporting fast similarity-based retrieval from this reference database. The detection component extracts keyframes from a video stream and computes their signatures, retrieves similar signatures from the reference database and decides whether the similarity between sequences of keyframes is sufficient for the input sequence to be considered a copy of an original sequence. To directly apply a CBCD method to mining, the reference database and the corresponding index must be created first using the signatures of every keyframe. Then all the signatures of the reference database are used to query the database in order to identify the copies. The method we described in the first part of this thesis can monitor in deferred real time, with one PC, one TV channel (video stream) against a database of 280,000 hours of video (860 × 106 keyframes, 16.4 × 109 local signatures). It copes with a large set of transformations and detects the transformed sequences of a few seconds with good recall and precision. However, the direct application of this method to mining would be unacceptably slow since we can estimate that it would take more than 3 years to mine the 280,000 hours database with 1 PC (800 days for the similarity-based-search and 390 for the decision process). Parallelism might be a solution.

76

Video Mining Using Content Based Copy Detection

Considering that any signature may be compared to any other from the database, every one must be used to query the indexed database. The first solution is then to copy the database of 280,000 hours on 28 computers, in order to mine 10,000 hours of queries on each computer. This is the best solution because the similarity-based search has a cost that is sub-linear in the size of the database and the decision process has a cost that is linear in the number of queries (time length). The other choice is to create sub-databases of 10,000 hours, then to query the 280,000 hours of video on each one. The good solution would ask for 44 days of computation for each PC (30 days for the similarity-based search using ZNAD and 14 days for the decision process with 100-NN, see first part of the thesis). We can say this is acceptable but still too much. The bad solution would ask for 462 days for each PC (72 days for the similarity-based search and 390 days for the decision process). Second, the deferred time is acceptable for the off-line process, but not for the online process. However, we could make a main memory adaptation for databases of 100 hours and less. But an estimation of the time required to mine a database with the stream monitoring method shows that it is not appropriate. Let us consider the entire process, from the construction of a database of 100 hours and up to the final result (we suppose however that the keyframes of a video and their signatures were already extracted and stored on mass storage). We need about 20 seconds to make the database and the index. Then, 0.05 ms are required in order to process a query; since the database contains 20 × 3, 000 × 100 = 6, 000, 000 signatures, the mining would take 300 seconds. The global processing time is already of 320 seconds, more than 5 minutes, too long a waiting time for any user. The final decision process requires

1 30

of real

time, so we would need 3 hours. Clearly, this solution cannot be further considered. We see here that the vote process becomes a heavy overload and a real issue. If it is fast enough to monitor a video stream, but too expensive for a video mining process. The next subsection points out precisely the features of the stream monitoring scheme that are obstacles to an adaptation to video mining.

3.5.2 Focus on the bottlenecks A major obstacle for mining comes from the use of local features. Our stream monitoring system is using M = 20 local features, which is not so much, but mining is already too slow. First because the similarity self-join is processed too slowly, then because an expensive decision process is necessary. Some VMCD proposals [SZ03], [QFG06], [CPIZ07] use more than 100 local features (even 1000) on each frame (see section 3.2), which has a significant negative impact on scalability. But the local features are needed for the robustness to some frequent distortions (cropping, shifting, scaling, inlays, occlusions, etc.). Then, we should reduce the number of features by gathering the local features of a frame in a single feature that offers a higher level of description, combining the robustness of the local features and the compactness of a global description. The distance computations have the highest cost during the similarity-based search, therefore we have to use a compact structure for the description of a keyframe and a simple distance between the signatures of two keyframes. The description scheme should not be a simple aggregation of the local

3.6 A new video mining framework by CBCD

77

visual descriptors, but rather a novel structure embedding them. A new description scheme must be designed, as well as an associated indexing solution that is adapted to the mining task. The decision process using local features is slow because it needs many steps: (i) it gathers among each kNN of each query signature the ones having the same video identifier, (ii) it assembles those with the same time code, (iii) it checks some geometric consistency constraints between the remaining candidates and (iv) it checks some temporal constraints with the detections found on the previous keyframes. The complexity of each of these steps is quadratic in the number of local features, so using a single feature for a keyframe drastically reduces the computation time. Furthermore, if we use a global description for a keyframe we avoid step (ii) and may also step (iii). These two steps, together with step (i), also have a quadratic complexity in k and therefore are very expensive. It is also important to note that kNN retrieval for the local features may be inappropriate. In some cases (and especially in a Web2.0 context), the system has to process databases that may contain very many copies of some videos, so the value of k would become quickly too small to find all these occurrences. If higher values were used for k, then the decision process (already too slow) would become unusable. So kNN retrieval for the local features is not appropriate for a mining task. This encourages the use of global features instead of local ones. We have outlined the goals and requirements of the applications we are aiming at. An adaptation of the stream monitoring proposal is not satisfying in term of performance (3.5) and above we have pointed out why. We have already mentioned existing solutions for video mining (see 3.2) that cannot fulfil our requirements either. We now present a new video mining framework including two major contributions concerning a compact keyframe description method and an indexing scheme for mining with such descriptions.

3.6 A new video mining framework by CBCD 3.6.1 Our contributions To meet the scalability requirements for video mining, we • define and employ compact frame-level signatures (called Glocal) instead of sets of interest point signatures, and • build a redundant index to support the similarity self-join. The Glocal signatures allow to directly compute similarity at the level of frames and avoid the large volume of intermediate results (between similarity self-join and the vote-based decision) often associated to the direct use of interest point signatures. They provide a better compromise between detection time and quality and, being compact, make a redundant index affordable. The frame-level signatures and the redundant index allow to keep the similarity computations local to small sub-databases, minimizing exchanges between main memory and mass storage.

78

Video Mining Using Content Based Copy Detection

As in the first part of this thesis, an indexing solution is needed to make the solution scalable. It is defined in relation with the structure of the keyframe descriptor. Furthermore, this indexing solution must be easy to parallelize in order to deal with very large databases. So a redundant indexing scheme is required. In a classic CBCD process, the cells (or buckets) where a query may have neighbors are visited. For mining, the query is inserted into these cells. In this way, each cell is an independent sub-database Di and the similarity self-join is independently performed in each Di . Of course, the size of the entire database increases in comparison with a non redundant one, but it carries a part of the similarity self-join in its structure and makes an efficient parallel implementation possible. To balance this size problem, the use of a compact frame descriptor is important.

3.6.2 Proposed framework The problem we address can be described as a self-join operation on a database of video sequences, returning all the pairs of sequences that satisfy the selection criteria (“matching” sequences). The primary criterion is similarity, but other criteria can be added, depending on the data or application (e.g. exclude multiple occurrences that are close together in a short time interval, or only consider one type of broadcasts). When the additional criteria are more selective than similarity, they should be employed first to reduce the complexity of the problem. Rather than directly searching for similar sequences, the mining process suggested here first retrieves the pairs of similar keyframes and then employs them to find similar sequences. This solution brings about more flexibility and can be readily used to mine databases that contain both images and videos. To remain as general as possible, we only use in the following one additional criterion, that is not critical to the approach: avoid temporal proximity of the keyframes. So, if we have to mine a stream, it is cut in segments (their duration will depend on the application) and each segment receives an unique identifier. If we have to mine a set of videos, each video receives an unique identifier. Then, in order to satisfy the criterion, during the similarity self-join, the correspondence between keyframes having the same identifier is simply forbidden (see section 3.9). Figure 3.5 illustrates the proposed framework. The boxes with plain outlines represent data, the boxes with dotted outlines correspond to processes. A short summary of the necessary steps clarifies the process: 1. The compact Glocal signatures are computed from the local feature of the keyframes, stored in a database and indexed. The Glocal description scheme is described in the next section (3.7). The redundant indexing scheme is presented in section 3.8. 2. The similarity self-join on the set of Glocal signatures (keyframes) is then processed (section 3.8.7), the result is a set of keyframe pairs. The similarity between the Glocal signatures of the keyframes in a pair is above a fixed threshold. This threshold is explained after the introduction of the Glocal signature, in section 3.7.

3.6 A new video mining framework by CBCD

79

Figure 3.5: Proposed workflow for VMCD 3. The pairs of keyframes are then sorted according to the identifiers of the videos they belong to. Each video has a unique identifier, as seen previously. If we have Nv videos in the database, at most

(Nv ×(Nv −1)) 2

pairs of identifiers exist.

4. For every pair of videos containing a pair of keyframes previously connected, the reconstruction of corresponding videos sequences is attempted, using time consistency conditions (section 3.9). At the end of the process, all the “copy” links between video sequences are returned. 5. A graph is built, whose nodes are video sequences and the links correspond to “copy” relations. A simpler graph is also returned, where every node is an entire video and every edge represents all the links between two videos (section 3.11). 6. Further post-processing can be performed on these graphs, based on the structure of the graph as well as on the attributes of the nodes and of the edges (section 3.11). The different stages of the mining process are now presented, following this order.

80

Video Mining Using Content Based Copy Detection

3.7 The Glocal descriptor In the stream monitoring process described in the first part of this thesis, a good robustness to the typical transformations between original videos and copies is obtained by the use of local descriptors (see 2.2.1). Each keyframe is described by the local spatio-temporal signatures of at most M interest points (20 in our case and sometimes slightly less) found by the improved Harris detector. The use of multiple local signatures for every keyframe and of a specific matching provides robustness to cropping, shifting, light scaling, occlusions or insertions. On the other hand, it is expensive in terms of storage and, above all, the decision process using kNN of local signatures is computationally too expensive. Consequently, it is important to find a frame-level description scheme that keeps as much relevant information as possible and allows to include part of the vote-based decision in a simple computation of the similarity between two keyframes, while significantly reducing total computation and storage cost.

3.7.1 The Glocal descriptor scheme The spatio-temporal signature of an interest point in keyframe t is composed of the normalized 5dimensional vector of first and second-order partial derivatives of the grey-level brightness around this point and for 3 other neighboring points in the frames t + δ, t − δ and t − 2δ. Such a signature belongs to the 20-dimensional description space [0, 255]20 . See the first part of this thesis (section 2.3.2) for more details. The distribution of the signatures covers well this space, so indexing is based on hierarchical partitioning of the description space (not of the image plane) into hyper-rectangular cells. Every local signature is defined by a precise position and falls within one such cell. Only the coarse information given by the identifiers of the cells to which the local signatures belong is employed to define a frame-level signature. The 20-dimensional description space is partitioned at a limited depth h (at every level, a new interval is partitioned in two), which produces 2h cells. To every cell a position is assigned in a binary vector of fixed dimension 2h . The position follows a numbering scheme, here we use the Z-grid order (section 2.5.1), because known and convenient, but any scheme could be used. The Glocal signature of a keyframe is such a binary vector, where the i-th bit is set to 1 if the local signature of at least one interest point from that keyframe falls within the cell i, or else left to 0. Therefore, a Glocal signature is a fixed-size embedding of a set of local signatures. A simplified example (with interest point signatures in a 2-dimensional description space) is given in figure 3.6, for a keyframe containing 6 interest points. The squares represent the partitioning of the description space (not of the image plane) at depth 4. The identifiers of the cells are shown in the corners. The local signatures are the + marks within the squares. The resulting Glocal signature of 2h = 16 bits is 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1. Because 2 local signatures are in the same cell, only 5 bits are set to 1, the ones at position 3, 8, 12, 13 and 16. On figure 3.6 the partitioning grid is regular. Actually the one described in the first part of the thesis in section 2.6 is used. The components are sorted according to the distance of their distribution to the

3.7 The Glocal descriptor

81

+ 1

+

9

3

11

+ 2

4

10

12

5

7

+ 13

15

+

+ 6

8

14

16

Figure 3.6: Construction of a Glocal signature in a 2-dimensional description space uniform distribution and the boundaries are moved in order to balance the local signatures between the cells. This way, the distribution of the bits set to 1 in the Glocal signature is also more uniform. The information provided by a Glocal signature consists in the positions of the bits set to 1 (in the followings a “position of the bits set to 1” is B1 ) and is a coarse approximation of what a set of local signatures would bring. This description scheme would be inadequate if the local signatures of many interest points belonging to a same keyframe fell in a same cell (on average). A very significant amount of information would then be lost. By studying how many bits are set to 1 in a Glocal signature, we obtained an average of 17.4 for a partitioning depth h = 8, close to the average number of local signatures in a keyframe (which is 19.4), when the number of local signatures is limited to M = 20. This shows that the distribution is convenient, so the loss of information is limited, and that the partitioning of the space at this depth is already appropriate. Partitioning at a higher depth cannot diminish this average. Figure 3.7 shows the ratio between the number of B1 in the Glocal signature and the number of local

Nb. local features in frame -----------------Nb. bits set to 1 in signature

signatures at different depths (it is still limited to M = 20 per keyframe). 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 4

5

6

7

8

9

10

11

12

Indexing depth

Figure 3.7: Ratio between the number of B1 in a Glocal signature and the number of local signatures in a keyframe (at most M = 20 per keyframe) The metadata associated to a keyframe consists of the unique identifier of the video it belongs to, Idx , 0 ≤ x ≤ Nv ] where Nv is the number of videos in the database, and the time code of the keyframe, T cxy , 0 ≤ y ≤ tx , where tx is the number of keyframes of the video x. The keyframe is the basic element for the similarity self-join, so the time code checking to assert if local descriptors are from a same keyframe is no longer required. Also, the positions of visual signatures in the keyframe are no longer employed.

82

Video Mining Using Content Based Copy Detection

The Glocal signature is a list of the positions of the local descriptors of a keyframe, in a coarse approximation of the description space. A Glocal signature has a fixed size, so it is meaningful to compare the bits found at the same position in two Glocal signatures. If a bit position (cell address) is seen as a word, for a partitioning depth of h there are 2h words. Glocal signatures should be compared to the bag of features approach (BoF, [SZ03], [NJT06]). In the BoF approach, a set of visual words must be built. First, the size of the vocabulary (number of words) must be selected (usually arbitrary from 1,000 to 10,000 words, depending on the volume and content). Then, a clustering of the local signatures of the database is performed, for instance using k-means with k equal to the chosen number of words. For Glocal signatures, the number of words depends on the partitioning depth h and no clustering has to be performed, one cell corresponds to one word, given by its position on the Z-grid. The Glocal signatures are binary, very compact and sparse, while in [SZ03] or [NJT06] one thousand visual words are selected to describe a frame. The similarity between Glocal signatures can be computed very fast.

3.7.2 Compactness of Glocal signatures Considering a partitioning depth of 8, every keyframe is described by 28 bits (or 32 bytes) with a Glocal signature, while the description in the stream monitoring system requires M local signatures of 20 bytes each, for a total of 20 × M bytes. Moreover, a local signature is associated to 4 other data: the positions x and y in the keyframe, the time code of the keyframe and the identifier of the video, for 12 more bytes. A Glocal signature only needs to be associated to a time code and to an identifier, i.e. 8 bytes. Finally, in our case a keyframe requires (20 + 12) × M = 32 × M bytes with local signatures, and only 32 + 8 = 40 bytes with a Glocal signature. The more local signatures are used to build a Glocal signature, the higher the size ratio of the database: R =

40 32×M .

For M = 20 the ratio is R =

40 640

=

1 16 .

This gain allows the

use of a redundant indexing scheme (see section 3.8) that supports faster mining, without increasing too much the total storage requirements, depending on the redundancy of the index.

3.7.3 Similarity between Glocal signatures The Glocal signature is a sparse binary vector. At depth 8, a maximum of 20 bits are set to 1 among the 256 available. Then, a natural choice to measure the similarity of Glocal signatures is the Dice coefficient (SDice ): SDice (g1 , g2 ) =

2 |G1 ∩ G2 | |G1 | + |G2 |

(3.1)

where Gi is the set of B1 in the signature gi and | · | denotes set cardinality. Similarity is high between vectors having a match of their B1 , i.e. the embedded local features correspond. The Dice coefficient is directly related to the Jaccard coefficient SJaccard (g1 , g2 ) =

|G1 ∩ G2 | |G1 | + |G2 | − |G1 ∩ G2 |

3.7 The Glocal descriptor

83

by SJaccard =

SDice 2 − SDice

Also, when the number of B1 is approximately the same for every signature, |Gi | ≈ L, the Dice coefficient is almost identical to the overlap coefficient, Soverlap (g1 , g2 ) =

|G1 ∩ G2 | ≈ SDice (g1 , g2 ) min{|G1 |, |G2 |}

and can also be related to the Hamming distance by dHamming ≈ 2L(1 − SDice ) Our implementation of the Dice coefficient has a computation time of 10−7 seconds for an indexing depth of 8. Time is divided by 2 if the depth is decreased by 1 and multiplied by 2 if it is increased by 1, since it depends on the length of the binary vector. An implementation based on the comparison of lists could be considered for larger values of the partitioning depth. A database of 10,000 hours of video contains about Nk = 3 × 107 keyframes. Mining using an exhaustive approach needs

Nk ×(Nk −1) 2

≃ 4.5 × 1014 similarity computations, 4.5 × 107 seconds or

12,500 hours (520 days), which is not realistic. Indexing is necessary in order to make mining scalable. Thus, in order to perform the similarity self-join in a reasonable time, a database of Glocal signatures is built together with an index. Indexing must be performed according to the structure of the Glocal descriptor and to the distortions undergone during the copy processes.

3.7.4 Effects of the copy processes When a copy of a video is made, the local signatures may be distorted, according to the intensity, geometric and temporal transformations. A local signature may be moved away from its original position in the description space. It may also disappear from the set, because the Harris corner detector no longer finds the same points of interest in the copy. In the same way, a new local signature may appear in the set. The resulting effect on the Glocal signature is the shift of B1 , or the appearance or disappearance of B1 . The Dice coefficient measures these changes.

84

Video Mining Using Content Based Copy Detection

Table 3.1 shows captions from two clips, issued from the same video, with the corresponding Dice coefficient between their Glocal signatures. The differences between the two versions are the consequences of a crop of the top and bottom of the picture in the first one, along with the insertion of a text at bottom right and a stronger compression in the second one. Table 3.1: Different captions of two video clips of Madonna from a video Web site and their SDice , for M = 20 and h = 8

SDice = 0.33

SDice = 0.46

SDice = 0.53

SDice = 0.67

SDice = 0.81 As seen is the previous section, indexing is necessary. The index must be built by using the B1 since they define the visual vocabulary of a keyframe. The keyframes sharing a visual vocabulary should then be grouped together. Only part of the words associated to an original keyframe are also associated to its copy, the index must take these changes into account.

3.8 Sentence-based indexing

85

3.8 Sentence-based indexing

The first stage of the video mining process, which consists in identifying the links between individual keyframes, is the most expensive. If D is the database of Glocal signatures and θ is the similarity threshold above which one keyframe is considered a transformed version of another, then the following set should be found: Kθ = {(gi , gj ) | gi , gj ∈ D, SDice (gi , gj ) > θ}

(3.2)

This corresponds to a similarity self-join on the database of Glocal keyframe signatures and its time complexity would be O(N 2 ) if the similarity was computed for every pair of signatures (N is the size of the database). The efficient computation of similarity joins is addressed in the information retrieval and in the data management literature, e.g. [SK04], [AGK06], [BMS07]. Most of the existing methods are based on inverted lists and are used for text vectors. Our signatures contain a closed visual vocabulary (2h words), so we should carefully consider these approaches in order to define the indexing scheme. To find an appropriate solution to speed up the first stage of the mining process, our prerequisites and the main characteristics of the data should be made explicit. The main requirement concerns the level of scalability: we intend to mine databases of more than 10,000 hours of video (≥ 30 × 106 Glocal signatures) much faster than by the direct application of existing CBCD methods. To further reduce the time required for larger databases, the method should fully support a parallel implementation by requiring a limited volume of data exchanges. As seen in subsection 3.7, the Glocal signatures are compact (2h bits per frame for a partitioning depth of h) and sparse. The method proposed in [BMS07] relies on inverted lists and an active similarity self-join: it uses the currently found results to speed up the subsequent operations. Text signatures and social network signatures typically employed in information retrieval have millions of dimensions, the signature of a document is represented has tens or hundreds of non zero components. Glocal signatures are not as large nor as sparse as these signatures. Furthermore, while for text signatures the number of non-zero components is usually highly variable, for Glocal signatures it is about the same. Also, the Glocal signatures cover well the description space, even if the probability of being set is not the same for all the bits. In [BMS07], as in [AGK06], the authors propose an exact method to perform the similarity self-join. The computation requirements are very low compared to an exhaustive method. However, we think that an exact similarity self-join would be an overload, so we consider an approximate similarity self-join. If a pair of similar keyframes is missed, a previous or following matching pair between two sequences can make the detection still possible. Of course, we trade time against quality, as in the first part of this thesis, but we believe that it is essential. The characteristics of the Glocal signatures and of their distribution, together with the requirements mentioned above, do not allow us to expect a significant speed-up from the direct application of the proposals in [BMS07] or [AGK06].

86

Video Mining Using Content Based Copy Detection

In [SK04], an exact method is suggested that also employs inverted lists. Even if it is not the core of the paper, an interesting idea of n-grams is introduced to build some of the lists: textual words are associated for the indexing. This idea is used in many domains such as genomics [KWLL05]. The proposals mentioned above rely on a final voting process depending on the number of inverted lists where a pair of signatures is found. For that, only an identifier of a signature is stored in the inverted list, as in LSH. A vote must be done for the returned signatures in order to tell which are relevant. We suggest to keep the signatures in the index and to compute the similarity directly. Some similarity computations must be performed but a vote is avoided. Moreover, the selectivity is higher, which saves a large part of the cost of the next stage (time consistency) that also controls precision. In a classic similarity-based search using an indexing scheme, the processing of a query involves the exploration of several buckets in order to find the similar elements. The elements have first to be inserted in a database, then an index of the database is built. The proposed redundant indexing reverses the problem by placing a signature into the buckets that would have been visited on the similarity-based search. Using a redundant indexing relies on the idea of bringing together the elements that have a similarity above a threshold, thus making local similarity self-joins possible.

3.8.1 Principle of the indexing scheme The indexing scheme put forward here is inspired by both inverted lists [WBM94], [MZ96] and hashing [IM98]. In a typical indexing scheme, the Glocal signatures should be inserted in an index. Then, each Glocal signature would be used as a query and thus a set of buckets should be visited to find potential similar signatures. If the database can be stored in main memory this solution is affordable. Otherwise the process must be batched as in the first part of the thesis and the database loaded several times, consequently the similarity self-join can become an overload. Moreover, the process cannot be easily parallelized, each computer should deal with the entire database and part of the queries, or deal with a part of the database and all the queries. Here we focus on being able to deal with very large databases that can not be stored in main memory and on parallelizing the process in an optimal way. The redundant indexing proposed requires more mass storage space but makes parallelism easy, as will be seen in section 3.10. We aim to accelerate the discovery of Kθ defined by (3.2) by avoiding to compute the similarity for every pair of signatures, while keeping storage requirements rather low. To make this stage affordable, we require that:

1. the full similarity between two signatures is only computed if it is above a similarity threshold, and 2. for every maximal group of signatures that satisfy to this constraint, this computation is performed without intermediate disk accesses.

3.8 Sentence-based indexing

87

Item (1) says that it should be possible to divide the database into segments (that may overlap) such that, in each segment, the similarity between any two signatures is above a threshold. So the similarity self-join can be independently performed on the segments of the database. Item (2) further requires that every such part be small enough to hold in main memory and be stored in a compact way on disk. Given the sizes of the target databases, an approximate solution is considered acceptable if it significantly accelerates the mining process. Such a solution is also allowed because the transformations undergone by the video sequences during the creation of copies can only follow probabilistic models, so that exact retrieval is not a requirement. A segment is defined by a specific set of B1 (actually n-grams called sentences below) in the representation of Glocal signatures. The segment (“bucket” in the following) consists of all the Glocal signatures in the database that contain this sentence; it can be stored as an inverted list. In a standard inverted list (in text indexing for example), one word defines a bucket and the texts containing this word at least once are inserted in that segment. In the same way, in the bag of features approach visual words are independently used for indexing and retrieval. Here, we prefer to associate several words of our visual vocabulary to build the index as suggested in [SK04]. The association offers a larger diversity which implies a smaller and more discriminative index. In [SK04] the problem is different, since the universe of the possible words of the database is not closed. With Glocal signatures the set of possible visual words is closed and can be small, defined by the h. Consequently, the probabilities of collision are stronger. To sum up how indexing is performed: • the index relies on buckets, a bucket is defined by a set of B1 , • for each Glocal signature, a set of buckets is selected among the possible ones, • each Glocal signature is then inserted in all these buckets. It follows that a Glocal signature is inserted several times in the index. The index supports a part of the similarity self-join in its structure. The Glocal signatures that should be compared are grouped in the buckets. The similarity self-join is then performed in each bucket independently from the other ones. The results of the local similarity self-joins are merged after the whole index has been processed. The index is loaded into memory only once. To complete the description of the indexing scheme we must specify: • at what depth should the description space be partitioned in order to define the Glocal signatures, • how many bits the sentences should have, • how many different sentences should be used to index a Glocal signature, • how should these sentences be selected. In next section, we investigate the different parameters on which the expected speed-up depends.

88

Video Mining Using Content Based Copy Detection

3.8.2 Speed-up estimation of the similarity self-join Let N be the total number of Glocal signatures in the database. If the similarity is evaluated for every pair of signatures, the total number of similarity computations is

N (N −1) 2



N2 2 .

A comparison with the

index-based solution shows the impact of the partitioning depth h and of the length l of the sentences. We consider the index as an inverted list. The number of B1 is about the same for every Glocal signature, in our case ≥ 17 for h ≥ 8, and is upper bounded by the maximum number of interest points per keyframe (M = 20 here), it does not increase very much with h for h ≥ 8 and can then be considered as static; it is denoted by L. It follows h that the length of every signature is 2h , the total number of different buckets (possible sentences) is 2l  and every signature is present in Ll buckets. Table 3.2 gives an idea of the number of existing buckets in function of h and l. Table 3.3 shows the

number of buckets where a Glocal signature may lie depending on the number of local features used in the keyframe. For both tables, if the resulting numbers are too high then they are not shown (. . .). Table 3.2: Approximate number of buckets (×103 ) in function of the indexing depth and of the length of the sentences Depth of indexing (h) Sentence length (l) 2 3 4 5 6

6

7

8

9

10

2 41 625 7, 624 74, 974

8 341 10, 668 264, 566 ...

32 2, 763 174, 792 ... ...

130 22, 238 2, 829, 877 ... ...

523 178, 433 ... ... ...

Table 3.3: Number of possible sentences for a Glocal signature in function of the number of local features and of the length of the sentences Number of local features (L) Sentence length (l) 2 3 4 5 6

10

20

30

40

50

100

45 120 210 252 210

190 1, 140 4, 845 15, 504 38, 760

435 4, 060 27, 405 142, 506 593, 775

780 9, 880 291, 390 658, 008 3, 838, 380

1, 225 19, 600 230, 300 2, 118, 760 ...

4, 950 161, 700 3, 921, 225 ... ...

If all the bits are set to 1 with the same frequency for the signatures in the database, then all the  h −1 . Consequently, the number of similarity computations buckets have the same size, equal to N Ll 2l 2 h −2 N 2 L2 2h −1  h 2 = 2 l . The estimated speedup performed with the index is approximately 2l N2 Ll 2l l a obtained by using the index is then:

a=

 h  −2 L 2 l l

(3.3)

3.8 Sentence-based indexing

89

25

l=6 l=5 l=4 l=3 l=2 l=1

log10(a)

20 15 10 5 0 -5 8

9

10

11

12

13

14

15

16

17

18

19

20

h

Figure 3.8: Impact of h and l on speed-up If sparse representations are employed for the Glocal signatures, the space required to store a signature and the time required to compute the similarity between two signatures can be considered fixed and independent of h. As shown in figure 3.8 for l ∈ {1, . . . , 6} and h ∈ {8, . . . , 20} (with L = 20), the  speed-up increases with both l and h. But the storage requirements depend on N Ll , so they augment

when l grows from 1 to 10 and then decrease. Taking for l a value between 15 and 20 would make the similarity for two signatures in a same bucket higher or equal to

2×15 20+20

= 0.75, according to equation 3.1,

which would severely restrict recall. Moreover, when h increases (so the partitioning of the description space is stronger) the similarity between a keyframe and a transformed version of this keyframe diminishes. The similarity threshold θ used in equation 3.2 to establish a link between two keyframes could be reduced accordingly, but this would limit the possibility to distinguish true positives from true negatives. For these reasons, in the system developed here both the sentence length l and the partitioning depth h are close to their lower bounds shown in figure 3.8. To further save computation time and storage, it is possible to reduce both the number of buckets and  their size by assigning each signature to a relatively small share of the Ll buckets it can belong to. This

can be performed by defining rules to select sentences (sets of B1 ) in a signature; then, all the signatures containing a given sentence are assigned to the corresponding bucket.

3.8.3 Sentences selection The aim is to define a set of sentences that covers the information contained in the Glocal signature and is robust to the insertions, deletions or displacement of B1 . These operations are the consequences of the different distortions undergone by the videos. Some sentences found in a signature may no longer exist in the signature of a copy. Then we must find rules that select sentences using a wide range of B1 and different combinations for each B1 . The rules considered here to select sentences follow these directions. We propose to consider all the sentences of B1 that can be made with: • the neighboring bits: they are not separated by any other B1 in the Glocal bit list, • the 1-out-of-2 bits: they are separated by exactly one B1 in the Glocal bit list,

Video Mining Using Content Based Copy Detection

1

1

0.9

0.9

0.8

0.8

0.8

0.7 0.6 0.5 0.4 0.3

depth_7_0 depth_7_0_1 depth_7_0_1_2 depth_7_0_1_2_3

0.2 0.1 0 10

20

30

40

50

0.7 0.6 0.5 0.4 0.3

depth_8_0 depth_8_0_1 depth_8_0_1_2 depth_8_0_1_2_3

0.2 0.1 0

60

probability of collision

1 0.9 probability of collision

probability of collision

90

10

number of sentences

20

30

40

50

0.7 0.6 0.5 0.4 0.3

depth_9_0 depth_9_0_1 depth_9_0_1_2 depth_9_0_1_2_3

0.2 0.1 0

60

10

number of sentences

20 30 40 50 number of sentences

60

Figure 3.9: Estimated probability of collision between the Glocal signature of an original keyframe and that of an automatically generated copy, obtained at depth 7 (left), 8 (middle) and 9 (right), respectively. • the 1-out-of-3 bits: they are separated by exactly two B1 in the Glocal bit list, • the 1-out-of-4 bits: they are separated by exactly three B1 in the Glocal bit list. As an example, for the Glocal signature 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1, the B1 are 3, 8, 12, 13 and 16, so the sentences of length 2 of neighboring bits are 3-8, 8-12, 12-13 and 13-16, those of 1-out-of-2-bits are 3-12, 8-13 and 12-16, those of 1-out-of-3-bits are 3-13 and 8-16, while the only one sentence of 1-out-of-4-bits is 3-16. The selection mode guarantees that all the sentences found are different. The reduction of the number of buckets to which a signature is assigned is significant: e.g. with L = 20 and l = 3 every signature is only assigned to a maximum 19 + 18 + 17 + 16 = 70 of the 20 3 = 1140 buckets in which it would otherwise be present. Ns , the average number of buckets for a signature, could even be lower. But if this reduction is too strong, some pairs of signatures corresponding

to a keyframe and to its transformed version may no longer be placed together in any remaining bucket, so only part of Kθ would be found. To evaluate these rules and find appropriate values for h and l, we explore their impact on an automatically generated set of transformed videos (copies). Since the available ground truths are small, they are only employed for the final evaluation in section 3.12. Instead of using a ground truth, we employed a large set of automatically generated copies (already used in the first part of this thesis to generate a distortion model, see section 2.7.1). We computed the Glocal signatures of the original keyframes and of their transformed versions, then measured the impact of different parameters on the quality of copy detection. To show the effects of the selection rules, we compare with an approach in which the sentences would be selected randomly for each Glocal signature.

3.8.4 Collision analysis for rule-based bucket selection Using this large dataset, we estimate the probability for the Glocal signature of an original keyframe to “collide” (i.e. be assigned to at least one bucket together) with the corresponding signature of an automatically generated copy; the results are shown in figure 3.9 for h ∈ {7, 8, 9}. Figure 3.10 displays the

3.8 Sentence-based indexing

0.7

0.9

depth_7_0 depth_7_0_1 depth_7_0_1_2 depth_7_0_1_2_3

0.8

0.6 0.5 0.4 0.3 0.2

0.7 0.5 0.4 0.3 0.2 0.1

0

0 20

30

40

50

60

number of sentences

0.8

0.6

0.1 10

0.9

depth_8_0 depth_8_0_1 depth_8_0_1_2 depth_8_0_1_2_3

probability of collision

probability of collision

0.8

probability of collision

0.9

91

0.7

depth_9_0 depth_9_0_1 depth_9_0_1_2 depth_9_0_1_2_3

0.6 0.5 0.4 0.3 0.2 0.1 0

10

20

30

40

50

60

number of sentences

10

20 30 40 50 number of sentences

60

Figure 3.10: Estimated probability of collision between the Glocal signatures of two random keyframes, obtained at depth 7 (left), 8 (middle) and 9 (right), respectively. results obtained when the estimation is performed for randomly selected pairs of signatures (moreover, two such keyframes are always taken from different broadcasts). Each curve on the graphs corresponds to the addition of one more selection rule: “0” stands for neighboring bits only, “0 1” for neighboring bits and 1-out-of-2 bits, “0 1 2” adds the selection of 1-out-of-3 bits and “0 1 2 3” adds the selection of 1-out-of-4 bits. Each point on a curve corresponds to a value for the length of the sentences: 2 for the point at the top, 3 for the point below, and so on up to 8 for the point at the bottom. The abscissa represents the total number of different sentences obtained for a signature. The similarity between two signatures is computed and compared to the threshold θ only if the two signatures collide. When keyframes are compared to their copies (figure 3.9), the estimated probability of collision pc should be as high as possible; a value of 1 would guarantee that all of Kθ is found. When random keyframes are compared, the estimated probability of collision pr should be as close to 0 as possible in order to save similarity computations; a positive value does not necessarily decrease the precision of mining, since the similarity between signatures that collide is always computed and compared to θ. Figure 3.9 and figure 3.10 show that sentences of length l = 2 offer a very high pc (≥ 95%), for all the indexing depths and with all the sets of rules we considered, but unfortunately also a relatively high pr (≥ 4%), which suggests that many similarity computations are useless. Sentences of length l = 4 produce a pc that is not high enough (pc ≤ 70%) with the set of selection rules we use; with more rules pc could be increased but the redundancy would be too high and make the index inefficient. The sentences of length l = 3 appears to be appropriate for a partitioning at depth h = 8 or h = 9 and with the set of rules “0 1 2”: • a relatively high pc = 88% (almost all of the corresponding Kθ is found) for h = 8 and pc = 82% for h = 9; • a low pr ≤ 1% for h = 8 and pr ≤ 0.1% for h = 9; • a very strong reduction in the number of buckets used for each Glocal signature, with a maximum of 19 + 18 + 17 = 54 and an average of Ns ≃ 42 for h = 8 or Ns ≃ 44 for h = 9, instead of  20 3 = 1140;

92

Video Mining Using Content Based Copy Detection

• an estimated speedup of a =

256×255×254 3×2×422

≃ 1, 566 for h = 8 and a =

512×511×510 3×2×442

≃ 11, 486 for

h = 9. The values h = 8 and l = 3 are used as reference for the evaluations in section 3.12. However, the larger the video database, the deeper the partitioning should be, so we employed h = 9 for the largest databases. The speedup given by eq. (3.3) is an optimistic estimate and in practice the speedup would not be so high since the buckets are likely to have different sizes (they are not balanced). In fact, the more copies the video database contains, the less balanced the buckets can be expected to be. Since the complexity is quadratic in the size of the buckets, the overall computational gain essentially depends on the largest buckets. The indexing scheme is generic, but several different implementations can be produced depending on the mining context. An off-line and an online version are described in section 3.8.6. Before we present these versions, we investigate two alternative solutions to highlight the benefit of both redundancy and rule selection for mining. Next section analyzes the collisions if sentences were randomly chosen in order to prove the benefit of the proposed rules of selection.

3.8.5 Collision analysis for random bucket selection We performed another collision analysis by randomly selecting a fixed number of sentences for each Glocal signature (rather than selecting them for all the signatures according to the same rules, as above). We used h = 8 and considered sentences of length l = 2, l = 3 and l = 4. The graph on the right side of figure 3.11 shows the probability of collision between an original signature and the corresponding signature of a copy (pc ); the same number of sentences is employed for both signatures. The graph on

1

0.45

0.9

0.4

0.8

0.35

Probability of collision

Probability of collision

the left side shows the probability of collision between two random signatures (pr ).

0.7 0.6 0.5 0.4 0.3 0.2

l=2 l=3 l=4

0.1 0 0

20

40

60

80 100 120 140 160 180 200

Number of sentences

l=2 l=3 l=4

0.3 0.25 0.2 0.15 0.1 0.05 0 0

20

40

60

80 100 120 140 160 180 200

Number of sentences

Figure 3.11: Estimated probability of collision with randomly selected sentences. The left-side graph shows the probability of collision between the Glocal signature of an original keyframe and that of a copy. The right-side graph shows the probability of collision between the Glocal signatures of two random keyframes. For l = 2 we notice that a high pc = 90% can only be obtained for Ns ≥ 40 sentences, but this also produces a rather high pr ≥ 9%. For l = 2 there is no convenient compromise. For l = 4 there

3.8 Sentence-based indexing

93

is no convenient compromise either, pc remains rather low even for Ns ≃ 200 sentences. Only l = 3 can provide a good compromise. To obtain pc = 88%, as with h = 8 and l = 3 employing the rules “0 1 2”, Ns ≃ 130 sentences must be selected compared to Ns ≃ 42 previously; the value of pr is the same in both cases. Consequently, the speedup is higher when the sentences are selected according to the same rules for all the Glocal signatures than when the selection is random for each signature. Indeed, at constant quality, by employing the same rules the number of similarity computations is reduced by a 2

factor of ( 120 42 ) ≃ 8.1 in comparison with the case where the selection is random for for each signature. These results show that with rule-based selection a high probability of collision between an original signature and its copy can be reached, with limited redundancy. So less computation is required to process the similarity self-join. The strategy of using a wide range of sentences is efficient, it balances the effects of the copy-creation process.

3.8.6 Implementation issues Off-line version This version addresses the scalability issue, with the aim of mining a 10,000 hours video database with a single PC. It does not have to provide immediate answers and can employ mass storage. But it also has to satisfy to some constraints: be dynamic and be easy to parallelize in order to support scalability to very large video databases (100,000 hours and more). There are many ways to build the database and the index. In order to be as close as possible to an industrial process, we want to have a dynamic system in which the insertion or deletion of a video is fast and easy. Indeed, it can take quite some time to accumulate 10,000 hours of video (some details are given in the following); for a weekly update, an incremental process is recommended. For that, a bucket is stored as a single file, whose name is the identifier of the bucket. This solution has several advantages: • no specific index file is required, the structure is represented by the bucket files, • access to a bucket only requires to open a file if it exists, • only the modified buckets is opened for an update, • to use C computers in parallel, if B buckets exist, each computer has to process

B C

bucket files.

But it also has some drawbacks: • many files are created, increasing the size of the database, • many system calls (opening and closing of the files) are necessary to perform the mining, which slows down the process. The dynamic operations are easy to perform. For the insertion of a video, the Glocal signatures are computed for all its keyframes, and for each one the buckets it should lie in are identified, then the

94

Video Mining Using Content Based Copy Detection

corresponding files are opened and the signatures are added. The deletion of a video also begins with the computation of the Glocal signatures and of their respective buckets, then the corresponding files are opened and scanned in order to remove the Glocal signatures with identifier of the video to delete. This solution can be implemented in two ways: the buckets store (1) the Glocal signatures or (2) their addresses in a list. For the first alternative (off-line solution 1) nothing more needs to be done, while for the second alternative (off-line solution 2) a list of all the Glocal signatures must be built and stored. For a good efficiency, this list must be stored in main memory in order to avoid access to mass storage. The advantage is that it saves disk space for the storage of buckets, indeed a pointer (4 bytes for 23 2 items) replaces a Glocal signature and its metadata (40 bytes if h = 8) in the buckets. A 10, 000 hours video database contains about 30, 000, 000 keyframes, which means that the storage of the Glocal signature list needs 1.2 Gb, suitable for main memory storage. Tables 3.4 and 3.5 show the time required for the construction of the databases for both off-line versions and for different values for the indexing depth h. The amount of memory required for the construction is 1.2 Gb. The larger the main memory available, the faster the index is built; the main cost is due to the access to mass storage.

Table 3.4: Time required to build the database and volume of the database for various volumes of video with the first off-line version Indexing depth (h) Volume of video (hours, keyframes)

7

8

9

1, 000 3.1 × 106

23 min - 8.3 Gb 117 × 106

36 min - 31 Gb 127 × 106

2 h 50 min - 153Gb 133 × 106

2, 000 6.1 × 106

44 min - 13 Gb 230 × 106

53 min - 38 Gb 242 × 106

2 h 52 min - 180Gb 255 × 106

Table 3.5: Time required to build the database and volume of the database for various volumes of video with the second off-line version Indexing depth (h) Volume of video (hours) 1, 000 2, 000

7

8

9

9 min - 4.7 Gb 24 min - 5.4 Gb

26 min - 28 Gb 41 min - 31 Gb

2 h 02 min - 150 Gb 2 h 12 min - 177 Gb

We performed mining operations with the two off-line versions on the different databases. Version 2 requires about 10% more time than version 1 for the similarity self-join. Conversely, as we can see in tables 3.4 and 3.5, the time required to build the database is significantly shorter with version 2. Also, some disk space is saved with version 2. The version 2 of the off-line solution is retained here and further evaluated in section 3.12.

3.8 Sentence-based indexing

95

Online version

This version must be fast enough to be employed online on Web 2.0 video Web sites, for comparatively small amounts of video data. It must quickly output a graph showing the links between video sequences for a video database of less than 100 hours, which is about the maximum cumulative volume of the 1, 000 top answers to a textual query on a video Web site (1000 × 4 minutes ≈ 67 hours). The aim is to provide improved results to the user and also a better way of browse these results. It groups the videos having the same content, then offers a broader and non redundant set of results. It also provides links between the videos having common sequences in order to highlight the structure of the result set. It can identify the videos having the largest number of copies in order to point out popular videos (typically uploaded, under various versions, by many users). As we shall see in section 3.13, many videos on a Web2.0 site are compilations (“bests of”). The graph of links between video sequences gives access to a new level of granularity in browsing, since the different similar segments are grouped in connected sub-graphs. The online version is not intended to be dynamic, the entire mining process is performed online: indexing (section 3.8), similarity self-join on the set of keyframes (section 3.8.7), establishing links between video sequences (section 3.9) and construction of a view of the graph. For a textual query, the Glocal signatures corresponding to the keyframes of the returned videos (the top 1000 answers to the query) are processed. A new textual query returns other videos, so the mining process is performed again from scratch. For the most popular queries, the graphs of content links may of course be precomputed off-line and stored. This version must work with all the data structures in main memory, so the size of the available memory can be a limitation. Using a list of the Glocal signatures, as for the version 2 of the off-line solution in section 3.8.6, is essential here. The volume of this list is small (for 100 hours we have about 300, 000 Glocal signatures, so 12 Mb) but the index (the set of buckets) is significantly larger, so it may be necessary to bound the size of the buckets. If the bound is too low, many buckets are truncated, thus reducing the probability of collision, with a negative impact on recall. Using 2 Gb of the main memory for the index, at depth h = 7 a bucket can hold about 1,500 Glocal signatures, but we consider the computational gain insufficient. At depth h = 8, about 180 Glocal signatures can be stored in a bucket, which is convenient. For h = 9 a bucket can only hold about 22 Glocal signatures, which is by far insufficient in order to obtain a good level of recall. An indexing depth of h = 8 is the best compromise. Table 3.6 presents the computation time and the size of the index for various volumes of videos, when the main memory is limited to 3 Gb. At this point, the Glocal signatures are computed for every keyframe of the video database that has to be mines. These signatures are indexed using the redundant scheme explained in the previous sections. Two implementations are available, one to perform the similarity self-join off-line for large database, and one to perform it online in main memory (with the corresponding size limitations). Figure 3.12 presents a general scheme that summarizes the process from the extraction of the local signatures (upon which the Glocal description is based) to the construction of the redundant index of the Glocal signatures.

96

Video Mining Using Content Based Copy Detection

Table 3.6: Time required (seconds to build the database and volume of the database for various volumes of video with the online solution Indexing depth (h) Volume of video (hours, keyframes)

7

8

9

10 32,849

3.8 s

5.4 s

19 s

20 61,968

4.5 s

6.2 s

20.1 s

50 145,860

6.8 s

8.5 s

23.5 s

100 276,245

10.2 s

13.4 s

29.4 s

In next section, details are provided for the similarity self-join on the set of keyframes based on the redundant indexing.

3.8.7 Finding the links between keyframes The role of the buckets is to avoid comparing each signature to every other signature in the database; the full similarity between two signatures is only computed if it is above a threshold. With L = 20 and l = 3, equation 3.1 implies that the similarity between two signatures in a same bucket is at least

2×3 20+20

= 0.15;

while the mining process is significantly accelerated, this value is too low to remove false detections. A higher value could be obtained by increasing the length of the sentences that define the buckets, but this has a negative impact on recall, as seen in the previous subsection. Consequently, within each bucket, the similarity is evaluated for every pair of Glocal signatures and compared to a threshold θ (defining Kθ ) whose value is significantly higher than 0.15 (see section 3.12). A link is established between two keyframes if the similarity between their signatures is above θ. As mentioned in section 3.6, a preliminary analysis has shown that a strong temporal redundancy could be found in many broadcasts (such as talk shows or weather forecasts). If mining is directly applied, many of the links will be established between keyframes that are temporally close and belong to a same broadcast; these uninteresting links would then overload the subsequent filtering stages (such as the temporal consistency check). This problem can be solved in many different ways, depending on the representation of the data. Here, the videos stored in the archives we employed are already segmented into relatively short broadcasts (between a few minutes and three hours), each having its own identifier. In the same way the videos from a video Web site already have an identifier. On the other hand, for a video stream, chunks have to be made at regular intervals then an identifier is given to each one. Another solution could be to check the temporal proximity of the keyframes, but it would make our framework too specific. To every Glocal signature in the database is associated the ID of the video broadcast and the time code T c of the keyframe it is issued from; time codes are relative to the beginning of the broadcast.

3.9 Reconstruction of the matching video sequences

97

Figure 3.12: Obtaining the Glocal signature database

The link between keyframe n from broadcast x and keyframe m from broadcast y is specified by the corresponding pairs of IDs (Idx ; Idy ) and time codes (Tcx,n ; Tcy,m ). To avoid the generation of links between keyframes that belong to a same broadcast, the signatures that are in a same bucket and have the same Idx are never compared to each other. For this, the pairs of keyframes having a similarity above θ have to respect the constraint Idx < Idy , otherwise Idx and Idy are inverted. The pairs are sorted by (Idx ; Idy ) (first on Idx , then on Idy if they have the same Idx identifiers. There are

Nv ×(Nv −1) 2

possible

pairs, where Nv is the number of videos in the database. Algorithm 3 summarizes the indexing and similarity self-join operations performed on the database of Glocal signatures. In order to reconstruct the potentially matching video segments between two broadcasts, a merging process relying on the time codes of the keyframes is performed for each pair (Idx ; Idy ). Next section describes this process.

3.9 Reconstruction of the matching video sequences The connections identified between individual keyframes are used to delimit and link together video sequences that are transformed versions of a same content. First, the pairs of keyframes found during the previous stage (section 3.8.7) are sorted by increasing time codes according to their time codes (Tcn ; Tcm ) (first on Tcn , then on Tcm if they have the same Tcn ). The timeline of the first video is then respected.

98

Video Mining Using Content Based Copy Detection

Algorithm 3 Indexing and similarity self-join of database of the Glocal signatures Require: Glocal signatures DG issued from the video set to mine Require: Metadata (ID,TC) LP 1: // Redundant indexing 2: for all DGi ∈ DG do 3: Compute sentences P H // sentences for signature DGi 4: for all P Hj ∈ P H do 5: Compute address SBk from P Hj 6: if SBk does not exist then 7: Create SBk // create new bucket 8: end if 9: Insert DGi in SBk 10: end for 11: end for 12: // Similarity self-join 13: for all SBi ∈ SB // for each bucket do 14: for all DGij ∈ DGi do 15: for all DGik ∈ DGi , k > j do 16: Compute similarity Sijk between DGij and DGik 17: if Sijk > Θ then 18: Add (DGij .id, DGij .tc) (DGik .id, KFik .tc) in LP 19: end if 20: end for 21: end for 22: end for 23: // Sorting 24: Sort LP ((DG1i .id, DG1i .tc)(DG2i .id, DG2i .tc)) by DG1i .id, DG2i .id, DG1i .tc, DG2i .tc in ascending order

Starting from the first two connected keyframes, two joined video sequences are built by the stepwise addition of other connected keyframes (with increasing time codes) that verify temporal consistency conditions. These conditions make the detection more robust to the absence of a few connected keyframes and to the presence of some false positive detections. The first requirement is that the temporal gap between the last keyframe in a sequence (with time code Tcx,l ) and a candidate keyframe to be added to the same sequence (with time code Tcx,c ) should be lower than a threshold τg : Tcx,c − Tcx,l < τg

(3.4)

Gaps are due to the absence of a few connected keyframes, as a result of post-processing operations (addition or removal of several frames), of instabilities of the keyframe detector or of false negatives in the detection of connected keyframes. We narrow the gap in order to avoid the matching of distant pairs of keyframes that generally produces false positives. The second requirement bounds the variation of temporal offset (jitter) between the connected keyframes of two different sequences of Idx and Idy . Jitter is caused by post-processing operations or by instabilities of the keyframe detector. If Tcx,l is the time code of the last keyframe in Idx , Tcy,l the time code

3.10 Parallelization of the mining process

99

Figure 3.13: Reconstruction of the matching video sequences from pairs of linked keyframes of the last keyframe in Idy and Tcx,c , Tcy,c are the time codes of the candidate keyframes, then the condition for upper bounding the jitter by τj is: |(Tcx,c − Tcx,l ) − (Tcy,c − Tcy,l )| < τj

(3.5)

The candidate keyframes are added at the end of the current sequences only if both the gap and the jitter conditions are satisfied. The third condition is that the resulting video sequences should be longer than a minimal value τl to be considered valid; this removes very short detections, typically false positives. Figure 3.13 summarizes the three temporal constraints. From the indexed database of Glocal signatures representing the keyframes extracted from a collection of videos, the mining process brings out pairs of matching sequences (transformed versions of a same content). The entire process is summarized in figure 3.14. The results can be represented as a graph where every node is a video sequence and an edge is a match between two video sequences. A video can contain several sequences that are linked to other sequences in several other videos. A graph having as nodes entire videos can also be produced, where every link is the summary of all the links between all the matching video sequences of two videos. This graph is consequently smaller than the previous one. More details are given in section 3.11 and examples of specific applications of the two types of graphs are provided in section 3.13.

3.10 Parallelization of the mining process Before presenting the evaluation of our method for mining video databases by copy detection, we must show how easy it is to efficiently parallelize the mining process. Figure 3.15 presents a possible parallel scheme for the entire off-line process (version 1 or 2). The first level (V1 to VN V ) represents the initial video files. Each of the C computers processes a subset of these files (the load is balanced according

100

Video Mining Using Content Based Copy Detection

Figure 3.14: Finding links between video sequences to the length of the files) in order to compute the Glocal signature for every keyframe (GS1 to GSN V ). These Glocal signatures are inserted into a local redundant index (not shown here) during the Glocal signature duplication step. Each computer fills in a local index, independently from the others. When this step is finished, the C indexes are merged (the complexity of this operation is linear in the number of Glocal signatures). Then, each computer receives a set of buckets from the global index (B1 to BN B ). The similarity self-join is independently performed in each bucket and the resulting pairs of keyframe identifiers are stored locally (SSS1 to SSS1B ) in files whose name is obtained from the pair of IDs (i j, 1 ≤ i, j ≤ Nv , i < j, Nv being the number of available video files). They are then collected in common files (SRV1 to SRVN V , complexity is linear in the number of initial files). Each computer now receives a set of these files (level noticed from T CV1 to T CVN V ) in order to reconstruct, in each file, the matching video sequences (RV1 to RVN V ). The final graph is obtained under a distributed form (sub-graphs) on the different computers, nodes and edges now have to be merged in order to obtained the global graph. At every stage (computation of Glocal signatures, similarity self-joins, sorting and time consistency checks), processing is performed in parallel on the available computers. The intermediate merging operations have to be handled with care, depending on the communication and storage architecture employed, in order to be efficient. However, these operations have linear complexity and only concern I/O accesses and so are not critical.

3.11 Graphs and post-processing Various types of post-processing operations can be performed on the graph resulting from the primary mining procedure, using the attributes of the node and of the edges, as well as the morphology of the

3.11 Graphs and post-processing

101

Figure 3.15: A parallel implementation scheme for the off-line process

graph. We explored some of these post-processing operations that have limited complexity but valuable applications. The results of the matching video sequences for a video database are represented as two graphs, at two different levels of granularity: the graph of the entire videos (where a node corresponds to a single ID) and the graph of video segments (where a node is an excerpt from a video, identified by a video ID and a time frame). In our current representation of the graph of video segments, two segments of a same video that overlap are represented as a single node. Every video of the database is a node in the graph of entire videos. If a video does not have any link to another video then it is an isolated node. If a video has one or more links with another video, an edge is set between the two nodes. Only the segments that have at least one match with another segment appear as nodes in the graph of video segments. If different segments of a video have a match with other segments in other videos, the node representing the video in the graph of videos is split into several nodes (one for each segment) in the graph of segments and each match is represented by an edge in this graph. Figure 3.16 illustrates the two types of graphs obtained by mining a database of four videos. Video 0 contains three matching segments, video 1 contains two such segments, videos 2 and 3 only one. The graphs store several attributes. In the graph of video segments, a node carries the name of the video it corresponds to and a number corresponding to the order of appearance of the segment in the entire video. The length of the segment and the date of broadcasting date are also stored, together with various types of associated metadata (e.g., for a video Web site, the keywords of videos). The edges also carry information. In the graph of video sequences, an edge between two segments stores the length of the match (duration and number of keyframes), the ratio between the number of

102

Video Mining Using Content Based Copy Detection

Figure 3.16: The two graphs obtained by mining a database of four videos: the graph of segments (left) and the graph of entire videos (right) individually matching keyframes and the total length of the sequence, the mean, minimum and maximum similarities between matching keyframes. All these attributes can be used to perform various filtering operations: finding long or short matching sequences, finding “exact” matches (high similarity between keyframes, high ratio of matching keyframes), etc. In the graph of entire videos, an edge stores averages of these values over all the edges between corresponding segments of the two videos. Some operations can also be performed using the morphology of the graph. For example, a transitive closure on the graph of video sequences can add some links that are not detected and complete the graph. But some other interesting patterns can appear and have a specific meaning. We present here one post-processing method that is very useful for the INA video database. The largest share of the sequences occurring more than once are broadcast design sequences and especially jingles, that are very brief and follow the pattern shown in figure 3.17. The TV shows (magazines, reports, news, weather forecasts, etc.) have opening credits and ending credits (usually different), of ten seconds or slightly longer. These credits usually match for all the episodes of a same show. During the show, almost identical (and typically short) jingles that are specific to the show occur several times. The credits and jingles are useful for the identification and the segmentation of a show, so identifying them is useful. The pattern in figure 3.17 suggests to simply consider the redundant detection of similar segments between two shows as jingles. The graph of video segments can then be divided into a graph containing the copies considered as jingles and a graph containing all the other copies. The graph of jingles is useful for the identification and segmentation of shows, while the second graph can be employed to browse (an example is shown in section 3.13). Each graph is useful, in its own way, for the textual annotation of the videos. Note the detection of jingles is of little interest for the videos from a Web2.0 site, that typically does not contain jingles. The graphs obtained for videos taken from a Web2.0 site also show interesting patterns. Some textual queries, typically queries on popular people (actor, singer, sportsman, etc.) generally return videos that are summaries of the career of that person, his or her best clips or concerts or goals or smashes, and so on. In the same way, a query on a popular TV show returns compilations with the bests moments of

3.12 Experimental evaluations

103

Figure 3.17: Typical pattern of broadcast design sequences the show. Such compilations are very useful in order to organize the other answers, since the different segments that compose them are linked to clusters containing several versions of a videos for each period (of a life, of a show). These compilations are easy to identify: a node representing a compilation has many links in the graph of videos (star pattern) and the corresponding video is divided in many nodes in the graph of video sequences. Depending on how the compilation is built, its timeline can propose a browsing based on popularity (most popular events first) or on chronology. Some examples are shown in section 3.13.

3.12 Experimental evaluations The experimental evaluations presented here concern the quality, the speed and the scalability of video mining. To evaluate the quality of video mining based on copy detection, it is in principle possible to employ a ground truth for content-based video copy detection, by including the queries in the reference database and attempting to find all the links between the two by mining. However, things are not so simple because both the reference database of the ground truth and the set of queries are typically redundant, so mining should also find the links between different queries and those between different videos in the reference database. To make possible a direct comparison of the precision-recall results obtained by mining to those obtained by stream monitoring for the same ground truth, we decided to consider only the links between the queries and the reference database. The experiments are performed with sentences of l = 3 words. The default number of point of interest in a keyframe is N = 20, but for some comparisons we employ higher values for N . Sentence (bucket) selection is performed using the three first rules seen in section 3.8.3 (neighboring bits, 1-outof-2 bits, 1-out-of-3 bits). The default parameters for the reconstruction of the matching video sequences (section 3.9) are τl = 3, τg = 3 and τj = 10. All the experiments were performed on a single PC having a 3 GHz Xeon64 CPU with 4 Gb of RAM and running Linux. The disk system is 1,2 To RAID0 composed of four 7200 rpm disks.

104

Video Mining Using Content Based Copy Detection

3.12.1 Calibration and mining quality evaluation To set the value of the similarity threshold θ and to check the optimal value for the partitioning depth h employed in computing the Glocal signatures, a calibration is performed on one ground truth, then the parameters are employed on another ground truth for evaluation. Precision and recall (equations 2.8 and 2.8) are measured using only the links established between the queries and the reference database. The links between different queries or between different reference videos are not considered. The first ground truth database employed consists of 30 hours of original broadcasts, composed of 30 files, issued from the INA archive, together with 20 minutes of copies, composed of 20 short excerpts (from 20 seconds to 4 minutes), obtained from a UGC Web site for tests. At least one sequence from every broadcast is present, transformed, in the set of copies. Since the amount of data is relatively limited, all the links between original and transformed video sequences were manually checked. Table 3.7: Best recall and associated precision of the mining process on the INA ground truth Indexing depth Similarity threshold

6

7

8

0.40

0.83 0.73

0.86 0.79

0.80 0.39

0.45

0.83 0.97

0.86 0.97

0.77 0.96

0.50

0.71 0.96

0.66 0.96

0.57 0.95

0.55

0.49 0.94

0.49 0.94

0.49 0.94

0.60

0.31 1.00

0.31 1.00

0.31 1.00

Several experiments were performed to find an appropriate value for the similarity threshold θ used to establish links between individual keyframes. Our previous experience with a CBCD system (see the first part of this thesis) has shown that a keyframe could be safely considered a copy of another keyframe if half of the local signatures are very similar. This suggested that a good initial guess could be θ = 0.5. Table 3.7 shows the precision and recall on the ground truth using different indexing depths and different similarity thresholds. The best compromise for every depth is obtained at θ = 0.45. The depth of 7 offers the best scores, followed by depth 6 then 8. In section 3.12.3 we see that the time required to build the 3 databases online is short but that the time required to make the similarity self-join is much higher for depths 6 and 7. This is because in these last cases some buckets contain too many signatures. Consequently, the reference value selected is h = 8. This value for the threshold had to be evaluated on another, independent ground truth. We employed the public video copy detection benchmark1 of CIVR 2007 that provides a database of 80 hours and 1

http://www-rocq.inria.fr/imedia/civr-bench/

3.12 Experimental evaluations

105

two sets of queries: ST1 are copies of entire videos from the database, while ST2 consists of copies of excerpts inserted into longer videos that are external to the database. Queries ST1 and ST2 were inserted into the database and the mining operations performed. The value of θ previously found, θ = 0.45, is employed. Recall is 0.8 for ST1 and 0.81 for ST2 with the same precision of 1.0. It is important to note that, with the video description and mining solution put forward here, the recall values given above are stable whatever the volume of data used. If the ground truth is inserted in a larger database, the recall measured on the ground truth remains the same. Indeed, the pairs of Glocal signatures still lie in the same buckets and the matches are found. On the other hand, the precision measured on the ground truth alone can be lower because of new true positives that may appear in the larger database but are not counted or because of new false positives. The database of Glocal signatures, the buckets, the links between keyframes and between sequences easily hold in main memory for such small databases, so mining using the indexing scheme only takes 40 seconds. As a comparison, if the indexing scheme is not employed, recall and precision remain the same for θ = 0.45, but mining now requires 23 minutes. To measure the effectiveness of the Glocal descriptor in detecting copies, Table 3.8 compares our results to those of the CIVR 2007 video copy detection competition, using the same performance measures. Table 3.8: Comparison of scores on the tasks of the CIVR 2007 benchmark Method Best CIVR 2007 Ours with θ = 0.55 Ours with θ = 0.45

ST1 score 0.86 0.73 0.80

ST2 segment score 0.86 0.71 0.81

In the following the CIVR 2007 copy detection ground truth is employed to build a database for comparative evaluations. It is called CIVR07VM database and includes reference videos together with the ST1 and ST2 sets of queries. Figure 3.18 and 3.19 shows some copies from TRECVID 2008 that are found with the ZNA method for videos stream monitoring (section 2.8.2) but not found by the mining method. These lost detection are the consequence of the disappearance of too many points of interest, so the similarity between the associated Glocal signatures is below the θ threshold.

Figure 3.18: A difficult case resulting from the combination of various transformations.

106

Video Mining Using Content Based Copy Detection

Figure 3.19: Another difficult case resulting from the combination of various transformations.

Now, different evaluations must be performed specifically for the two versions available: a scalability evaluation for the off-line disk version and a speed evaluation for the online, main memory version. Quality of the mining is evaluated through recall and precision as well. The precision measure is slightly different from the ones used in equation 2.8. For the precision we employ two measures, Pe for the edges of the graph and Pn for the nodes: Pe =

Nge Ne

(3.6)

Pn =

Ngn Nn

(3.7)

where Nge is the number of correct edges among the Ne edges found, while Ngn is the number of correct nodes among the Nn nodes found. The good nodes are not involved in a false edge. Pe and Pn are slightly different as certain nodes are attracting wrong detections, so many edges are connecting them. To measure Pe and Pn , a manual browsing of the graph is necessary. Therefore, it is only performed for small databases. In the same way, recall could be measured by: Re =

Nge Nte

(3.8)

where Nge is the number of correct edges found and Nte the total number of correct edges. Unfortunately, Nte cannot be properly evaluated. Indeed, we know the links between queries and reference database for our ground truths but not the ones within the reference database and within the queries. Therefore, we don’t use this value. We use the recall given by equation 2.8, limited to the links between queries and the reference videos, in order to compare mining results for different values of the parameters. The recall on a ground truth can be easily found even when it is inserted in large databases.

3.12.2 Evaluation of the off-line version We first evaluate the scalability of the off-line version of our method. Table 3.9 gives the mining time for large databases of 500, 1,000 and 2,000 hours at the indexing depth h = 8, with θ = 0.55, L = 20, l = 3. Separate values are provided for every stage. The similarity self-join takes 85% of the mining time for 2,000 hours, but only 50% for 500 hours. The buckets are getting overpopulated when the database size increase. To balance this, the indexing depth must be increased.

3.12 Experimental evaluations

107

Table 3.9: Time required to mine 3 databases at depth 8 Database size Nb. of keyframes Nb. of signatures Base construction Linking keyframes Linking sequences Total

500 h 1.6 × 106 65 × 106 0 h 33 min 1 h 00 min 0 h 28 min 2 h 01 min

1,000 h 2.9 × 106 117 × 106 0 h 35 min 2 h 12 min 0 h 34 min 3 h 11 min

2,000 h 5.8 × 106 231 × 106 0 h 41 min 7 h 30 min 0 h 45 min 9 h 16 min

Table 3.10 gives the mining time for larger databases, 2,000, 5,000 and 10,000 hours at the indexing depth h = 9, the other parameters remaining the same. For the construction of the database, most of the time is taken by accesses to mass storage (80%). The 10,000 hours database requires twice the time needed for the construction of the 5,000 hours database because it results of the insertion of a new volume of 5,000 hours in the previous 5,000 hours database. For 2,000 hours, the construction time increases for h = 9 compared to h = 8, but the similarity self-join time decreases andthe global mining time is slightly shorter for h = 9. With the increase of the volume to mine, the last stage (linking sequences) takes a more important part of the total computation time. The complexity of this step is quadratic in the number of copies. However, mining a 10,000 hours database requires about 82 hours, which can be considered reasonable. Table 3.10: Time required to mine 3 databases at depth 9 Database size Nb. of keyframes Nb. of signatures Base construction Linking keyframes Linking sequences Total

2,000 h 5.8 × 106 231 × 106 2 h 12 min 5 h 40 min 1 h 15 min 8 h 07 min

5,000 h 14.5 × 106 615 × 106 3 h 38 min 14 h 59 min 7 h 15 min 25 h 52 min

10,000 h 28.7 × 106 1, 217 × 106 7h 55 h 20 h 35 min 82 h 35 min

For comparison, if the similarity was computed for every pair of Glocal signatures, mining the 10,000 hours database would require about one year. If the monitoring solution using ZNAD was directly employed (see section 2.8.3), mining 10,000 hours would require about 22 days (528 hours). Influence of the partitioning depth on time and quality. An increase of the indexing depth h will not only have an impact on speed, but may change the values of recall and precision. Indeed, the cells of the description space are halved, so the local signatures are further separated. We built two databases of 1,000 hours, one at depth h = 8 and the other at depth h = 9, including the CIVR07VM database and various other videos from INA. The base construction time is indicated in table 3.5. The global mining time is 3 h 21 min at depth h = 8 (table 3.9) and 5 h at depth h = 9. The database is not large enough to use a partitioning depth of 9. However, we can compare the mining results. The similarity is still computed at depth h = 8. Using a threshold θ = 0.45 at depth 9 we found the same recall for ST1 and ST2: 0.8 and 0.81. Pe and Pn were not measured because of the very large number of nodes and edges.

108

Video Mining Using Content Based Copy Detection

Influence of the number of corners in the image. We also tried to change the number of local features per keyframe used to obtain a Glocal signature. We employed h = 8, l = 3 and θ = 0.45. We used the default value of L = 20 local signatures per keyframe, then L = 30 L = 40, L = 50 and L = 10. We built the corresponding Glocal reference signature databases using only the benchmark videos for the experiments, in order to measure the precision and recall variations along with the computation time. The first point is that increasing the number of local signatures per keyframe allows to obtain a higher recall. It reached 0.9 for both ST1 ans ST2 when L = 40. But unfortunately the precision is lowered to 0.8 for ST1. For L = 50 recall is stable but precision for ST1 decreases to 0.6. Two missed detections are flip cases (inversion of left and right of the image), the Glocal signature is not designed to handle these transformations. The third missed detection is a query where there is no activity and, consequently, only 1 keyframe is extracted. For the merging process (section 3.9) the default value for τl (minimal number of keyframes) is 3, so the detection is not possible. We noted that the detections between the trials using L = 10 and L = 20 are not overlapping. If the detections are merged we then obtain a recall 0.9 for ST1 and 0.86 for ST2 with the values 0.96 and respectively 0.91 for Pe . Table 3.11: Comparison of time, Pe and recall on the CIVR07VM database using various numbers of local signatures per keyframe Maximum nb. of points Base construction time Total Mining Nb. signatures Pe (ST1, ST2) Recall (ST1, ST2)

10 9 min 33 s 18 min 17 s 2,376,513 (1.0, 1.0) (0.8, 0.71)

20 9 min 40 s 19 min 33 s 6,090,847 (0.96, 0.91) (0.8, 0.81)

30 9 min 31 s 20 min 04 s 9,483,460 (0.96, 0.91) (0.8, 0.85)

40 9 min 41 s 21 min 20 s 12,596,407 (0.81, 1.0) (0.9, 0.9)

50 9 min 10 s 22 min 34 s 15,228,910 (0.62, 1.0) (0.9, 0.9)

One can note that processing time for both base construction and mining do not change as one could have expected. Database construction time is stable because it mainly corresponds to the creation of files. Mining with L = 40 points per keyframe does not require twice the time needed for L = 20 points. Indeed, the rules of bucket selection double the number of buckets where the Glocal signature of a keyframe is inserted (fourth line of table 3.11). But the number of buckets significantly increases, so the sizes of the buckets decrease. Also, the distribution of bucket sizes becomes more uniform when the number of local features increases.

3.12.3 Evaluation of the online version Table 3.12 gives general information regarding three video databases obtained from a Web2.0 site for tests. The first database is built with the top answers to the query “Zidane”, the second one with the top answers to the query “Funny cats”, the third one with the top answers to the query “Madonna”. The number of videos in each database changes because ffmpeg was not able to re-encode some of the original files to the mpeg1 format used by our software. We notice that the databases are of various sizes. The contents of each database are also highly variable. The database “Zidane” contains essentially compilations of goals and technical moves. The database “Funny cats” contains compilations of homemade

3.12 Experimental evaluations

109

movies. The database “Madonna” contains video clips, concert excerpts and interviews. Consequently, the average length and keyframe rate change from one database to another. Table 3.12: Databases issued from three textual queries submitted to a Web2.0 site Keyword-based query Nb. of videos Database size (hours) Average video length (seconds) Nb. of keyframes Keyframe rate (/hour) Average nb. keyframes / video

“Zidane” 739 35 170 92,452 2,641 125

“Funny cats” 1303 55 151 77,027 1,400 59

“Madonna” 920 69 270 182,613 2,646 198

Table 3.13 shows the mining time required for each database with the online implementation with the reference parameters L = 20, h = 8, θ = 0.45. A new parameter, R, appears for the online version: the RAM used for processing the databases, here R = 3, 000 Mb. The extraction of the local signatures and the computation of the Glocal signatures are not included. It is reasonable to think that for an online process this step has been performed off-line once (e.g. when the videos are uploaded) and that the Glocal signatures are stored. Note that cost of obtaining the Glocal signature can be neglected in comparison to cost of local signatures extraction. Table 3.13: Mining time (seconds) for the three Web2.0 databases using the online implementation (L = 20, l = 3, h = 8, θ = 0.45, R = 3, 000 Mb) Database Base construction Linking keyframes Linking sequences Total

“Zidane” 5s 8s 1s 15 s

“Funny cats” 4.5 s 7.5 s 1.5 s 14 s

“Madonna” 7s 23 s 2s 33 s

For the “Madonna” database, with the set of parameters L = 20, l = 3 and h = 8, the Glocal signature list is a 7 Mb file. Consequently, two possibilities can be considered for an online application. The processing could be performed on a server but the list could also be sent to the client and the process run locally with a plug-in (our mining software takes 211 Kb). For the three test databases, table 3.14 shows the content of the two types of graphs: for video sequences (segments) and respectively entire videos (programs). Almost all the edges found are true positives. The very few false positives are scrolling credits. In the “Madonna” database some false positives are also due to clips consisting in fixed flashing pictures with background music. Influence of the partitioning depth. Table 3.15 presents the mining time for the “Madonna” database at depths 6, 7 and 8. We notice that at depth 6 the similarity self-join step (linking keyframes) is extremely long and quite slow at depth 7. The indexes at these depths do not separate enough the Glocal signatures among the buckets. Consequently, the computational gain is not high enough. Moreover, the selectivity is also low and the last stage (linking sequences) takes quite a while. At depth 8 the processing is rather fast, about 32 seconds, and could be performed online. The depth of 9 is not employed because the buckets are too small with the available amount of RAM.

110

Video Mining Using Content Based Copy Detection

Table 3.14: Number of nodes and edges of the entire videos (programs) and video sequences (segments) graphs of the 3 databases L = 20, l = 3, h = 8, θ = 0.45, R = 3000M o Programs Segments

Database Nodes Edges Nodes Edges

“Zidane” 302 1,050 654 1,309

“Funny cats” 351 2,472 605 3,161

“Madonna” 285 2,476 570 2,661

Table 3.15: Mining time (seconds) for the “Madonna” database at various partitioning depths. L = 20, l = 3 θ = 0.45, R = 3000M o Partioning depth h Base and index construction Linking keyframes Linking sequences

6 1,5 s 475 s 45 s

7 4,5 s 120 s 25 s

8 7s 23 s 2s

Influence of the bound on the size of the buckets. To obtain the figure 3.20 we used different sizes of RAM to mine the “Madonna” database. Because of this, the size of the buckets can be bounded (so the buckets are truncated). This only modifies the construction of the buckets, the next stage remains the same. Here h = 8, which defines the number of possible buckets and thus the bucket size. The value of the threshold is θ = 0.55 and is not a critical parameter because it is used after the construction of the index. We employed several RAM sizes, between 500 Mb and 3,500 Mb, corresponding to Nbd = 25 and respectively Nbd = 293 signatures per bucket with h = 8. We measured the proportion of nodes and edges found in comparison with the disk version (that does not limit the bucket size). We note that with Nbd = 69 a very large part of the nodes and edges are already found (respectively 99.2% and 95.7%),

100

32 total processing time (seconds)

retrieval pourcent compared to disk version

with a processing time of 20.4 seconds.

98 96 94 92 90 88

nodes edges

86 0

50

100

150 200 bucket size

30 28 26 24 22 20 18 16 14 12

250

300

0

50

100

150 200 bucket size

250

300

Figure 3.20: On the left graph, proportion of nodes and edges found by the bounded memory online implementation, as a function of the bucket size bound. On the right graph, the total processing time as a function of the bucket size bound (“Madonna” database, θ = 0.55, h = 8, L = 20, l = 3) For figure 3.21 we used different values for R to process the mining of the “Zidane” database. The values are between R = 500 Mb and R = 3, 000 Mb, corresponding to 25 and respectively 248 signa-

3.12 Experimental evaluations

111

tures per bucket. The database is smaller than the “Madonna” database, 98% of the nodes are found with

100

15

99

total processing time (seconds)

retrieval pourcent compared to disk version

Nbd = 69, in only 11 seconds of computation.

98 97 96 95 94 93 92 91

nodes

90 0

50

100

150

200

14 13 12 11 10 9

250

0

50

100 150 bucket size

bucket size

200

250

Figure 3.21: On the left graph, proportion of nodes and edges found by the bounded memory online implementation, as a function of the bucket size bound. On the right graph, the total processing time as a function of the bucket size bound (“Zidane database”, θ = 0.55, h = 8, L = 20, l = 3)

The conclusion of these tests is that using large buckets for an online application is not necessary, most of the links and edges are found with truncated buckets, despite the high redundancy of the databases. This saves computation time and space in main memory. Influence of the similarity threshold. Table 3.16 presents the time required for the various mining operations on the CIVR07VM database. Time values are shown along with Pn and Pe . The θ threshold has a strong influence. For θ = 0.3 and θ = 0.35 computation time is prohibitive for an online application. Moreover, the resulting graphs have many false detections (edges). These edges overwhelm the true positives, making the graph unusable. For θ = 0.4 the graph becomes cleaner and it is really clean for θ = 0.45. For higher thresholds some true positives are lost. Recall is not shown, but note that for θ ≤ 0.40 the flip case is found for ST1, in a 10 seconds sequence in which the image is almost symmetric.

Table 3.16: Online mining of the CIVR07VM database using various values for the similarity threshold θ. h = 8, L = 20, l = 3, R = 3000M o θ Total mining time Linking keyframes Linking sequences Nb of nodes Pn Nb of edges Pe

0.3 89 s 42 s 41 s 476 ... 1,320 ...

0.35 50 s 29 s 18 s 382 ... 619 ...

0.4 31 s 19.5 s 11 s 236 0.66 365 0.79

0.45 21 s 15.5 s 4s 172 0.89 303 0.92

0.5 17 s 14 s 3s 147 0.93 266 0.96

0.55 16 s 13 s 2.5 s 122 0.97 234 0.99

0.6 16 s 13 s 1.5 s 103 1.0 222 1.0

112

Video Mining Using Content Based Copy Detection

Figure 3.22: Graphs of the segments built from the 1000 hours database including the CIVR07VM database. For the graph on the right side, the jingles were filtered out by post-processing (section 3.11.

3.13 Graphs and illustrations In the following we present several graphs showing the results of mining several databases. To display these graphs we use a software tool (developed at INA) that clusters as much as possible the subgraphs in a two-dimensional space. The layout is controlled by different forces between nodes and clusters but does not depend here on the attributes of individual nodes (other than connectivity) or edges. The tool is not able to handle graphs from databases larger than 5,000 hours. This type of display is only provided for illustration purposes, but the resulting layout should not be used for the browsing applications considered in section 3.3. All the figures in this section are issued from mining using the parameters L = 20, l = 3 θ = 0.45, h = 8. Figure 3.22 shows two graphs built for the 1,000 hours database including the CIVR07VM database. For the graph on the right side, the jingles were filtered out by the process described in section 3.11. Both graphs have many nodes and edges, but the one on the right side is less dense. For both graphs, in the center there is a large connected subgraph (cluster) composed of sub-clusters. This cluster contains many false positive detections that connect different subgraphs together. Many single detections are located around. The dense clusters are usually composed of opening or ending credits and jingles and so are very useful for identifying programs and thus support a better segmentation of the videos. One large cluster is composed of many false detections but they all concern weather forecasts, so it is informative. Other clusters correspond to specific advertisements and allow to analyze their broadcast. The following figures (“Madonna”, “Zidane” and “Funny cats” databases) are issued from the online process with R = 3, 000 Mb. Figures 3.23 to 3.28 are close-ups (magnifications) of parts of the graphs built on the “Madonna” database. Figures 3.29 to 3.33 are examples of difficult cases that were nevertheless detected and connected. The values of θ that are different from the default are indicated.

3.13 Graphs and illustrations

113

Figure 3.23: Graph of the segments built from the “Madonna” database. Central clusters correspond to the last Madonna clip, composed of many brief shots and for which very many customized versions are present in the database. The smaller clusters mainly correspond to older clips and excerpts of concerts or interviews.

Figure 3.24: Close-up from the graph of the segments built from the “Madonna” database. The cluster on the left side is another clip, while the one on the right side is composed of false detections. These false detections all belongs to instances of a show where the background is the same.

114

Video Mining Using Content Based Copy Detection

Figure 3.25: Close-up from the graph of the segments built from the “Madonna” database, showing several elementary subgraphs (one link between two segments).

Figure 3.26: Graph of programs built from the “Madonna” database. Central cluster presents different subclusters

3.13 Graphs and illustrations

115

Figure 3.27: Close-up of the graph of programs built from the “Madonna” database. The top left cluster contains versions of the last clip of Madonna rescaled at 50% in comparison with the versions of the dense central cluster. A link between the two clusters is nevertheless found during the mining process.

Figure 3.28: Close-up of the graph of the programs built from the “Madonna” database. Some nodes have been moved on this close-up in comparison with the previous one. Two programs are surrounded by a rectangle. The left one is a compilation of clips and concert excerpts, it is linked to many other programs, so the links have a “star” configuration. The other program is a compilation of the three last clips of Madonna, so is linked to the three corresponding clusters, encircled on the figure (the central large one and two smaller ones).

116

Video Mining Using Content Based Copy Detection

Figure 3.29: A strong quality degradation case. Aspect ratio and colors are also changed. Detection found at θ = 0.35.

Figure 3.30: Insertion of logo, quality degradation and inlays. Detection found at θ = 0.45.

Figure 3.31: Quality degradation, change of gamma and contrast, and insertion of text. Detection found at θ = 0.4.

Figure 3.32: Strong rescaling (half of the original size). Detection found at θ = 0.45.

Figure 3.33: Picture in picture case. Detection found at θ = 0.35.

3.13 Graphs and illustrations

117

Figures 3.34 and 3.35 are shots from the graph of segments built on the “Zidane” database. There are many compilations are repurposed compilations with new logos or different aspect ratio.

Figure 3.34: Graph of the segments built from the “Zidane” database. The videos are essentially compilations of goals or dribbles. Most of the clusters correspond to a scene, a famous goal or dribble, sometimes to several successive ones that were not separated by the users who reused the contents.)

Figure 3.35: Close-up from the graph of segments built from the “Zidane” database.

118

Video Mining Using Content Based Copy Detection

Figures 3.36 and 3.39 are shots from the graphs built on the “funny cats” database. The videos are compilations of funny excerpts from homemade videos of playing cats.

Figure 3.36: Graph of the segments built from the “funny cats” database. As for the “Zidane” database, most clusters represent a shot or a sequence of shots that were uploaded several times. The large blue cluster at the bottom right is composed of false detections. They are all homemade opening and ending credits.

Figure 3.37: Close-up from the graph of segments built from the “funny cats” database, showing three clusters containing three different popular videos.

3.13 Graphs and illustrations

119

Figure 3.38: Close-up from the graph of segments built from the “funny cats” database, showing three others clusters containing three others different popular videos.

Figure 3.39: Close-up from the graph of programs built from the the “funny cats” database. Single videos with no links are also shown. Most of the programs containing copies are linked together but some sub-clusters can be easily identified.

120

Video Mining Using Content Based Copy Detection

3.14 Conclusion In the second part of this thesis we focused on mining video databases. Television channels and the Internet now propose a large number of transformed versions of a same video content. Reuse of content, customized in various ways, is very popular among users, and professionals exploit older or user-generated content to produce popular shows at a lower production cost. The many versions of a same original video can be used in many applications such as content segmentation, the extension of textual annotations from one video to another, the removal of lower-quality copies or advanced visual navigation, as mentioned in section 3.3. For this, large databases must be mined using content-based copy detection. We put forward here a novel method for mining a video database by CBCD. This method contains two major contributions that address the scalability challenge. The first is a new image descriptor, called Glocal, that embeds in a fixed-size binary pattern a set of local features extracted from the image. The Glocal description is based on a simple quantification of the description space that defines a visual vocabulary where each partition corresponds to a word. In spite of its compactness, we have shown that the Glocal description is discriminant; it also has a good robustness to various transformations of a video. The mining process we developed relies on a similarity self-join performed on the set of Glocal signatures describing the keyframes of a video database. Our second contribution is a new redundant indexing scheme using “sentences” of visual words. This indexing solution significantly reduces the computational cost of the similarity self-join operation. After the similarity self-join, the resulting pairs of keyframes serve to reconstruct the video sequences that are connected by a “copy” relation. Two versions of the method have been implemented. An off-line version, using mass storage, is scalable and we have shown that is can mine a databases of 10,000 hours of video in affordable time (80 hours) with a single standard computer. This off-line version also makes simple the dynamic update of the video database. Moreover, this version can be efficiently parallelized, allowing to mine much larger databases with a set of computers. An online version relies on full memory processing and can successfully mine small databases (100 hours or less) in a very short time (less than 30 seconds). The final result of the mining process is a graph that can be displayed at two different levels, the level of entire videos and the level of video sequences. Each level offers specific advantages for the applications.

Future work. We have shown that the method we put forward has good effectiveness and efficiency for video mining. However, we did not explore the entire space of possible choices for the variables that define the embedding and the indexing scheme. Some of these variables are quite strongly connected, both with respect to the detection quality and with respect to the cost of mining (e.g., the value of the similarity threshold θ, the number of local features employed for building a Glocal signature and the depth of partitioning). Regarding the similarity threshold, a lower value should result in a higher level of recall. To avoid a

3.14 Conclusion

121

negative impact on precision when lowering the θ threshold, the rules of bucket selection may have to be revised in order to support a better selectivity. To obtain a higher recall, we can also further investigate the number of local features employed. The evaluations presented here show that by using more local features per keyframes can increase recall without having a negative impact on precision. Also, a low cost solution for exploiting rough configurations of the local features could also improve selectivity and should be explored; this may involve an evolution of both the embedding description and of the indexing scheme. The embedding scheme we proposed requires very simple computation (for obtaining the embedded signatures, for computing their similarity) and provides a good discrimination despite the compactness of these signatures. It should be evaluated with other local features, having broader invariance properties (such as SIFT), in order to support other applications involving more general similarity between images (keyframes) than the one corresponding to copy detection. Generalizations of this embedding scheme should also be explored. Concerning the implementation of the indexing solution, the standard file system is not well adapted to the operations performed by the off-line version. If we want to keep the dynamic updating feature, then other file systems such as reiserfs or xfs should be investigated, since they should be better adapted to these operations. If the dynamic updating is not considered important, then a mass storage implementation of the online could be used. We evaluate the quality of our video mining method we mainly employed the same performance metrics and ground truth databases as for the evaluation of content-based copy detection. This allows direct comparisons with methods developed for stream monitoring, but do not reflect the specific characteristics of video mining. A more comprehensive evaluation requires specific ground truth databases, supporting specific recall and precision metrics, but also more application-dependent metrics.

Further work is also needed in order to develop specific tools based on the results of mining. We saw that for large databases the graphs can be very large and quite cluttered. We have to find out how to deal with such graphs and how to display them in order to support advanced navigation in the database. This requires an adequate spatial layout of the graphs, together with methods for filtering the graph based on their connectivity and on the attributes of nodes and edges. The exploitation of the results of mining for the automatic (or assisted) transfer of textual annotations from one video to others or for “cleaning up” these annotations also requires specific tools.

122

Video Mining Using Content Based Copy Detection

CHAPTER

4

Conclusion

Significant changes are experienced in the production and consumption of video content. The multiplication of available devices and the convergence of technologies make access to content possible “anywhere, anytime”. Given the low cost and user-friendliness of image and video editing software, the recycling and customization of popular video content is becoming widespread. Professionals also extensively reuse older content and user-generated content to lower production costs and increase the appeal of their programs. A direct consequence of these evolutions is the existence of many copies of the same video content, on TV channels as well as on the Internet. These copies are often short and include a wide range of intensity, geometric, temporal and editing transformations. Monitoring video streams in order to identify the copies of (excerpts from) original programs is necessary for copyright enforcement, but can have many other applications related to automatic billing, scheduling enforcement and content filtering. Finding multiple transformed occurrences of some video sequences in large video databases can make explicit an important part of the internal structure of the database, thus supporting content management, retrieval and preservation for both large institutional archives and video sharing Web sites. As specific application examples, we can mention content segmentation, the extension of textual annotations from one video to another, the removal of lower-quality copies or advanced visual browsing. But given the very large volume of the content databases and the high number of sources or channels, copy detection is a challenge. Watermarking methods can allow to detect copies, support other applications and can scale well. However, the content must be watermarked before dissemination, the methods are not robust to some strong content transformations and the multiplicity of watermarking solutions creates significant practical problems. The alternative is to employ a content-based copy detection method.

124

Conclusion

To be robust to strong transformations of the content, such methods require rich and invariant content descriptions, but this also makes scalability a significant challenge.

In this thesis we addressed the scalability of content-based copy detection for video stream monitoring and for video mining. In the first part, our focus was on video stream monitoring. We put forward a more efficient index for processing probabilistic similarity-based queries, exploiting the distribution of signatures in the description space. To make the probabilistic queries more selective and improve the detection quality, we developed refined yet inexpensive local models of the distortions undergone by the local features when a copy is created. We eventually optimized retrieval on very large databases by introducing local models for the density of local features in their description space. The system we developed can monitor a video stream against a database of 280,000 hours of video, corresponding to almost 17 billion local signatures, in deferred real time with a single standard PC. The system was shown to present both high recall and high precision, for a wide range of visual transformations, on two ground truth databases. This shows that approximate models of the distribution and distortion of signatures could be obtained and their use has a significant impact on scalability even for very large databases. Indeed, these models allow to lower the cost of detection by more than an order of magnitude without loss of recall. In the second part of this thesis we focused on finding all the video sequences that occur more than once (with various modifications) in a video database. This can be seen as a specific video mining problem. After studying the potential applications and their requirements, we have shown why an attempt to adapt the method put forward for stream monitoring could not lead to an efficient solution for the mining problem. Based on this analysis, we put forward a compact embedding description of video keyframes that simplifies the similarity evaluations, while preserving enough information to obtain reliable detection results. We then proposed an indexing method based on the redundant segmentation of the database, taking advantage of the compact keyframe description. This method significantly speeds up the similarity self-join operation and allows for an efficient parallel implementation. The quality of detection is evaluated by quantifying precision and recall obtained on two ground truth databases initially developed for CBCD (after including the copy detection queries into the original database). We further measure the time required for mining databases of various sizes with a single computer. An online version of the system is employed for small databases of less than 100 hours (typical upper bound for the size of the results returned by a video sharing Web site to a textual query), while larger databases of up to 10,000 hours of video are processed by an off-line version of the system. We show that mining can be interactively performed for the smaller databases and remains affordable for the larger ones. The resulting system can mine a 10,000 hours video database in about 80 hours (including mass storage accesses) with a single standard PC. An main memory version can mine a database of 100 hours or less in about 30 seconds. The system is also shown to provide reliable detections on two

125

ground truth databases. This shows that by embedding a set of local signatures into a compact keyframe signature (Glocal) can provide an answer to the strong scalability challenge, while maintaining good detection quality. Some recent work such as [GD07], [DWCL08] have also shown the relevance of similar approaches, which encourages us to continue to investigate in this direction.

A first and natural extension of the work presented here would consist in adopting Glocal signatures in a video stream monitoring system. This would require an adaptation of the indexing solution but could result in a significant reduction of the cost of monitoring and, especially, of the latency of the monitoring system. A version working in main memory, for databases of medium size, could perform copy detection in real time rather than deferred real time. The analysis of the distribution of data has shown its importance in optimizing the indexing and retrieval solution put forward for stream monitoring. The global distribution of the data was also employed to improve the space partitioning required for building Glocal signatures. The way we made use of the local models in the stream monitoring method can be seen as an adaptation to the data of the similarity measure used for retrieval. Specific statistical studies should be conducted to optimize the mining application and especially to help defining data-adapted rules for the selection of the buckets. Content-based video copy detection in general can provide a new way of using video content. The organization of video contents for browsing could rely on the off-line or online detection of copies. A detected sequence may be part of several programs and have different continuations. Then watching could be combined with browsing: the user would select one or more of the available continuations of the current sequence, proposed by the system. Our focus here was on content-based video copy detection. However, since the solutions put forward here all rely on measures of similarity, they can be extended to new problems beyond copy detection. Starting from appropriate initial features, the embedding scheme and the indexing solution could be used to address the scalability of mining based on more general or just different similarity criteria. We can mention, for example, the detection of scenes with the same background [LTBGBB06], the detection of a same scene shot from different points of view [STA07] or simply the detection of videos that are similar but not copies. This would allow to address an even wider range of applications.

126

Conclusion

Bibliography

[AGK06] Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on Very Large Data Bases (VLDB’06), pages 918–929, Seoul, Korea, 2006. VLDB Endowment. [BA83] Peter J. Burt and Edward H. Adelson. The laplacian pyramid as a compact image code. IEEE Transactions on Communications, COM-31,4:532–540, 1983. [BAG03] Sid-Ahmed Berrani, Laurent Amsaleg, and Patrick Gros. Robust content-based image searches for copyright protection. In Proc. 1st ACM intl. workshop on Multimedia Databases (MMDB’03), pages 70–77, New Orleans, USA, 2003. ACM Press. [BMS07] Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web (WWW’07), pages 131–140, New York, NY, USA, 2007. ACM. [BTG06] Herbert Bay, Tinne Tuytelaars, and Luc J. Van Gool. SURF: Speeded up robust features. In Ales Leonardis, Horst Bischof, and Axel Pinz, editors, Proceedings of the European Conference on Computer Vision (ECCV’06), volume 3951 of Lecture Notes in Computer Science, pages 404–417. Springer, 2006. [But71] A.R. Butz. Alternative algorithm for hilbert’s space-filling curve. IEEE Transactions on Computers, C-20(4):424–426, April 1971. [BW97] Stephen Blott and Roger Weber. A simple vector-approximation file for similarity search in highdimensional vector spaces. Technical report, ESPRIT project HERMES (no. 9141), 1997. [CC05] Hue-Ling Chen and Ye-In Chang. Neighbor-finding based on space-filling curves. Inf. Syst., 30(3):205–226, 2005.

128

BIBLIOGRAPHY

[CCC08] C.Y. Chiu, C.S. Chen, and L.F. Chien. A framework for handling spatiotemporal variations in video copy detection. CirSysVideo, 18(3):412–417, March 2008. [CCM+ 97] Shih-Fu Chang, William Chen, Horace J. Meng, Hari Sundaram, and Di Zhong. Videoq: an automated content based video search system using visual cues. In MULTIMEDIA ’97: Proceedings of the fifth ACM international conference on Multimedia, pages 313– 324, New York, NY, USA, 1997. ACM. [CDL07] Xu Cheng, Cameron Dale, and Jiangchuan Liu. Understanding the characteristics of internet short video sharing: Youtube as a case study, Jul 2007. [CL07] Junbo Chen and Shanping Li. Gc-tree: A fast online algorithm for mining frequent closed itemsets. In PAKDD Workshops, pages 457–468, 2007. [CLW+ 06] Chih-Yi Chiu, Cheng-Hung Li, Hsiang-An Wang, Chu-Song Chen, and Lee-Feng Chien. A time warping based approach for video copy detection. In ICPR ’06: Proceedings of the 18th International Conference on Pattern Recognition, pages 228–231, Washington, DC, USA, 2006. IEEE Computer Society. [CPIZ07] Ondˇrej Chum, James Philbin, Michael Isard, and Andrew Zisserman. Scalable near identical image and shot detection. In Proc. 6th ACM intl. conf. on Image and video retrieval (CIVR’07), pages 549–556, Amsterdam, The Netherlands, 2007. ACM Press. [cSCZ03] Sen ching S. Cheung and Avideh Zakhor. Efficient video similarity measurement with video signature. IEEE Transactions on Circuits and Systems for Video Technology, 13:59– 74, 2003. [CWLW98] Edward Chang, James Wang, C. Li, and G. Wilderhold. Rime - a replicated image detector for the world-wide web. In Proceedings of SPIE Symposium of Voice, Video and Data Communications, pages 58–67, November 1998. [DA08] A.B. Dahl and H. Aanaes. Effective image database search via dimensionality reduction. In InterNet08, pages 1–6, 2008. ´ [DLA+ 07] Kristleifur Dadason, Herwig Lejsek, Fridrik Asmundsson, Bj¨orn J´onsson, and Laurent Amsaleg. Videntifier: identifying pirated videos in real-time. In Proceedings of the 15th international conference on Multimedia (MULTIMEDIA’07), pages 471–472, New York, NY, USA, 2007. ACM. [DWCL08] Wei Dong, Zhe Wang, Moses Charikar, and Kai Li. Efficiently matching sets of features with random histograms. In MM ’08: Proceeding of the 16th ACM international conference on Multimedia, pages 179–188, New York, NY, USA, 2008. ACM. [EM99] S. Eickeler and S. Muller. Content-based video indexing of tv broadcast news using hidden markov models. In ICASSP ’99: Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference, pages 2997–3000, Washington, DC, USA, 1999. IEEE Computer Society.

BIBLIOGRAPHY

129

[FB81] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. [FKS03] Ronald Fagin, Ravi Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (SIGMOD’03), pages 301–312, New York, NY, USA, 2003. ACM. [Foo07] Jun Jie Foo. Detection of Near-duplicates in Large Image Collections. Ph.D. dissertation, School of Computer Science and Information Technology, Royal Melbourne Institute of Technology, Melbourne, Victoria, Australia, 2007. [FTAA00] Hakan Ferhatosmanoglu, Ertem Tuncel, Divyakant Agrawal, and Amr El Abbadi. Vector approximation based indexing for non-uniform high dimensional data sets. In CIKM ’00: Proceedings of the ninth international conference on Information and knowledge management, pages 202–209, New York, NY, USA, 2000. ACM. [FZST07] Jun Jie Foo, Justin Zobel, Ranjan Sinha, and S. M. M. Tahaghoghi. Detection of nearduplicate images for web search. In Proceedings of the 6th ACM international conference on Image and video retrieval (CIVR’07), pages 557–564, New York, NY, USA, 2007. ACM. [GALM07] Phillipa Gill, Martin Arlitt, Zongpeng Li, and Anirban Mahanti. Youtube traffic characterization: a view from the edge. In IMC ’07: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 15–28, New York, NY, USA, 2007. ACM. [GB08] Nicolas Gengembre and Sid-Ahmed Berrani. A probabilistic framework for fusing framebased searches within a video copy detection system. In CIVR ’08: Proceedings of the 2008 international conference on Content-based image and video retrieval, pages 211– 220, New York, NY, USA, 2008. ACM. [GD07] Kristen Grauman and Trevor Darrell. The pyramid match kernel: Efficient learning with sets of features. J. Mach. Learn. Res., 8:725–760, 2007. [GIM99] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB’99), pages 518–529, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [Hen98] Andreas Henrich. The LSDh-tree: An access structure for feature vectors. In Proceedings of the 14th International Conference on Data Engineering (ICDE’98), pages 362–369, Washington, DC, USA, 1998. IEEE Computer Society. [HHB02] Arun Hampapur, Kiho Hyun, and Ruud M. Bolle. Comparison of sequence matching techniques for video copy detection. In Minerva M. Yeung, Chung-Sheng Li, and

130

BIBLIOGRAPHY

Rainer W. Lienhart, editors, Proc. Conf. on Storage and Retrieval for Media Databases, pages 194–201, December 2002. [HS05] Michael E. Houle and Jun Sakuma. Fast approximate similarity search in extremely highdimensional data sets. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), pages 619–630, Washington, DC, USA, 2005. IEEE Computer Society. [HSW89] Andreas Henrich, Hans-Werner Six, and Peter Widmayer. The LSD tree: spatial access to multidimensional and non-point objects. In VLDB’89: Proceedings of the 15th international conference on Very large data bases, pages 45–53, San Francisco, CA, USA, 1989. Morgan Kaufmann Publishers Inc. [HWL03] Chu-Hong Hoi, Wei Wang, and Michael R. Lyu. A novel scheme for video similarity detection. In CIVR, pages 373–382, 2003. [HZ03] Timothy C. Hoad and Justin Zobel. Fast video matching with signature alignment. In MIR ’03: Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval, pages 262–269, New York, NY, USA, 2003. ACM. [IM98] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC ’98: Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, New York, NY, USA, 1998. ACM. [JASG08] Herve Jegou, Laurent Amsaleg, Cordelia Schmid, and Patrick Gros. Query-adaptative locality sensitive hashing. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, apr 2008. [JB08] Alexis Joly and Olivier Buisson. A posteriori multi-probe locality sensitive hashing. In ACM ’08: Proceedings of ACM Multimedia conference, Vancouver, Canada, 2008. [JBF07] Alexis Joly, Olivier Buisson, and Carl Fr´elicot. Content-based copy detection using distortion-based probabilistic similarity search. IEEE Transactions on Multimedia, 9(2):293–306, 2007. [JCL03] A. Jaimes, Shih-Fu Chang, and A. C. Loui. Detection of non-identical duplicate consumer photographs. In 4th Pacific Rim Conference on Multimedia, volume 1, pages 16–20, 2003. [JDS08] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Hamming embedding and weak geometric consistency for large scale image search. In Andrew Zisserman David Forsyth, Philip Torr, editor, European Conference on Computer Vision, volume I of LNCS, pages 304–317. Springer, oct 2008. [JFB03] Alexis Joly, Carl Fr´elicot, and Olivier Buisson. Robust content-based video copy identification in a large reference database. In Intl. Conf. on Image and Video Retrieval (CIVR’03), pages 414–424, Urbana-Champaign, IL, USA, 2003.

BIBLIOGRAPHY

131

[JFB05] Alexis Joly, Carl Fr´elicot, and Olivier Buisson. Discriminant local features selection using efficient density estimation in a large database. In Proc. 7th ACM SIGMM intl. workshop on Multimedia Information Retrieval (MIR’05), pages 201–208, New York, NY, USA, 2005. ACM Press. [Jol07] Alexis Joly. New local descriptors based on dissociated dipoles. In CIVR ’07: Proceedings of the 6th ACM international conference on Image and video retrieval, pages 573–580, New York, NY, USA, 2007. ACM. [JT05] Frederic Jurie and Bill Triggs. Creating efficient codebooks for visual recognition. In ICCV ’05: Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, pages 604–610, Washington, DC, USA, 2005. IEEE Computer Society. [KS97] Norio Katayama and Shin’ichi Satoh. The sr-tree: An index structure for highdimensional nearest neighbor queries. In Joan Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 369–380. ACM Press, 1997. [KS04] Yan Ke and Rahul Sukthankar. PCA-SIFT: A more distinctive representation for local image descriptors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’04), volume 02, pages 506–513, Los Alamitos, CA, USA, 2004. IEEE Computer Society. [KSH04] Yan Ke, Rahul Sukthankar, and Larry Huston. An efficient parts-based near-duplicate and sub-image retrieval system. In MULTIMEDIA ’04: Proceedings of the 12th annual ACM international conference on Multimedia, pages 869–876, New York, NY, USA, 2004. ACM. [KV05] C. Kim and B. Vasudev. Spatiotemporal sequence matching for efficient video copy detection. CirSysVideo, 15(1):127–132, January 2005. [KWLL05] Min-Soo Kim, Kyu-Young Whang, Jae-Gil Lee, and Min-Jae Lee. n-gram/2l: a space and time efficient two-level n-gram inverted index structure. In VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 325–336. VLDB Endowment, 2005. ´ [LAJA05] Herwig Lejsek, Fridrik Heidar Asmundsson, Bj¨orn Thor J´onsson, and Laurent Amsaleg. Efficient and effective image copyright enforcement. In V´eronique Benzaken, editor, BDA, 2005. [LELD05] Eugene Lin, Ahmet Eskicioglu, Reginald Lagendijk, and Edward Delp. Advances in digital video content protection. Proceedings of the IEEE, 93(1):171–183, January 2005. [LJW+ 07] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe lsh: efficient indexing for high-dimensional similarity search. In VLDB ’07: Proceedings

132

BIBLIOGRAPHY

of the 33rd international conference on Very large data bases, pages 950–961. VLDB Endowment, 2007. [LK01] J. K. Lawder and P. J. H. King. Querying multi-dimensional data indexed using the hilbert space-filling curve. SIGMOD Rec., 30(1):19–24, 2001. [LLL01] Swanwa Liao, Mario A. Lopez, and Scott T. Leutenegger. High dimensional similarity search with space filling curves. International Conference on Data Engineering, 0:0615, 2001. [Low99] David G. Lowe. Object recognition from local scale-invariant features. In Proc. Intl. Conference on Computer Vision (ICCV’99), Volume 2), pages 1150–1157, Washington, DC, USA, 1999. IEEE Computer Society. [Low04] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [LTBGBB06] Julien Law-To, Olivier Buisson, Valerie Gouet-Brunet, and Nozha Boujemaa. Robust voting algorithm based on labels of behavior for video copy detection. In Proceedings of the 14th annual ACM international conference on Multimedia, pages 835–844, New York, NY, USA, 2006. ACM Press. [LTCJ+ 07] Julien Law-To, Li Chen, Alexis Joly, Ivan Laptev, Olivier Buisson, Valerie Gouet-Brunet, Nozha Boujemaa, and Fred Stentiford. Video copy detection: a comparative study. In Proceedings of the 6th ACM international conference on Image and video retrieval (CIVR’07), pages 371–378, New York, NY, USA, 2007. ACM. [MAW06] Bertini Marco, Del Bimbo Alberto, and Nunziati Walter. Video clip matching using mpeg7 descriptors and edit distance. In Proc. of ACM International Conference on Image and Video Retrieval (CIVR), LNCS, pages 133–142, Tempe, AZ, USA, July 2006. Springer. [MDPI04] E. Mriwka, A. Dorado, W. Pedrycz, and E. Izquierdo. Dimensionality reduction for content-based image classification. International Conference on Information Visualisation, 0:435–438, 2004. [MS01] Krystian Mikolajczyk and Cordelia Schmid. Indexing based on scale invariant interest points. In International Conference on Computer Vision, volume 1, pages 525–531, 2001. [MS04] Krystian Mikolajczyk and Cordelia Schmid. Scale & affine invariant interest point detectors. International Journal of Computer Vision, 60(1):63–86, 2004. [MTJ07] Frank Moosmann, Bill Triggs, and Frederic Jurie. Fast discriminative visual codebooks using randomized clustering forests. In In NIPS, 2007. [MTS+ 05] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. Int. J. Computer Vision, 65(1-2):43–72, 2005.

BIBLIOGRAPHY

133

[MZ96] Alistair Moffat and Justin Zobel. Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst., 14(4):349–379, 1996. [NG08] Xavier Naturel and Patrick Gros. Detecting repeats for video structuring. Multimedia Tools Appl., 38(2):233–252, 2008. [NJT06] Eric Nowak, Frederic Jurie, and Bill Triggs. Sampling strategies for bag-of-features image classification. In European Conference on Computer Vision. Springer, 2006. [OKH02] Job Oostveen, Ton Kalker, and Jaap Haitsma. Feature extraction and a database strategy for video fingerprinting. In Proc. 5th Intl. Conf. on Recent Advances in Visual Information Systems (VISUAL’02), pages 117–128, London, UK, 2002. Springer-Verlag. [QFG06] Till Quack, Vittorio Ferrari, and Luc J. Van Gool. Video mining with frequent itemset configurations. In Hari Sundaram, Milind R. Naphade, John R. Smith, and Yong Rui, editors, CIVR, volume 4071 of Lecture Notes in Computer Science, pages 360–369. Springer, 2006. [RLSP06] F. Rothganger, Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Object modeling and recognition using local affine-invariant image descriptors and multi-view spatial contraints. International Journal of Computer Vision, 66(3), 2006. [Rob81] John T. Robinson. The k-d-b-tree: a search structure for large multidimensional dynamic indexes. In SIGMOD ’81: Proceedings of the 1981 ACM SIGMOD international conference on Management of data, pages 10–18, New York, NY, USA, 1981. ACM. [Sam06] Hanan Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006. [Sat02] Shin’ichi Satoh. News video analysis based on identical shot detection. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’02), pages 69–72, 2002. [Shi99] Narayanan Shivakumar. Detecting digital copyright violations on the internet. PhD thesis, Stanford University, Stanford, CA, USA, 1999. Adviser-Hector Garcia-Molina. [SK04] Sunita Sarawagi and Alok Kirpal. Efficient set joins on similarity predicates. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data (SIGMOD’04), pages 743–754, New York, NY, USA, 2004. ACM. [SM97] Cordelia Schmid and Roger Mohr. Local grayvalue invariants for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 19(5):530–535, 1997. [SOZ05] Heng Tao Shen, Beng Chin Ooi, and Xiaofang Zhou. Towards effective indexing for very large video sequence database. In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 730–741, New York, NY, USA, 2005. ACM.

134

BIBLIOGRAPHY

[STA07] Shin’ichi Satoh, Masao Takimoto, and Jun Adachi. Scene duplicate detection from videos based on trajectories of feature points. In Proceedings of the international workshop on Multimedia Information Retrieval (MIR’07), pages 237–244, New York, NY, USA, 2007. ACM. [SZ02] Frederik Schaffalitzky and Andrew Zisserman. Multi-view matching for unordered image sets, or ”how do I organize my holiday snaps?”. In Proceedings of the 7th European Conference on Computer Vision (ECCV’02), pages 414–431, London, UK, 2002. Springer-Verlag. [SZ03] Josef Sivic and Andrew Zisserman. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV’03), pages 1470–1477, Washington, DC, USA, 2003. IEEE Computer Society. [SZH+ 07] Heng Tao Shen, Xiaofang Zhou, Zi Huang, Jie Shao, and Xiangmin Zhou. Uqlips: a realtime near-duplicate video clip detection system. In VLDB ’07: Proceedings of the 33rd international conference on Very large data bases, pages 1374–1377. VLDB Endowment, 2007. [TSS06] Masao Takimoto, Shin’ichi Satoh, and Masao Sakauchi. Identification and detection of the same scene based on flash light patterns. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’06), pages 9–12, Los Alamitos, CA, USA, 2006. IEEE Computer Society. [WBM94] Ian H. Witten, Timothy C. Bell, and Alistair Moffat. Managing Gigabytes: Compressing and Indexing Documents and Images. John Wiley & Sons, Inc., New York, NY, USA, 1994. [WHN07a] Xiao Wu, Alexander G. Hauptmann, and Chong-Wah Ngo. Novelty detection for crosslingual news stories with visual duplicates and speech transcripts. In Proceedings of the 15th international conference on Multimedia, pages 168–177, New York, NY, USA, 2007. ACM. [WHN07b] Xiao Wu, Alexander G. Hauptmann, and Chong-Wah Ngo. Practical elimination of nearduplicates from web video search. In Proceedings of the 15th international conference on Multimedia, pages 218–227, New York, NY, USA, 2007. ACM. [WJ96] David A. White and Ramesh Jain. Similarity indexing with the ss-tree. In ICDE ’96: Proceedings of the Twelfth International Conference on Data Engineering, pages 516– 523, Washington, DC, USA, 1996. IEEE Computer Society. [WMS00] P. Wu, B.S. Manjunath, and H.D. Shin. Dimensionality reduction for image retrieval. In ICIP00, pages Vol III: 726–729, 2000.

BIBLIOGRAPHY

135

[WZN07] Xiao Wu, Wan-Lei Zhao, and Chong-Wah Ngo. Near-duplicate keyframe retrieval with visual keywords and semantic context. In Proceedings of the 6th ACM international Conference on Image and Video Retrieval (CIVR’07), pages 162–169, New York, NY, USA, 2007. ACM. [YOZ08] Ying Yan, Beng Chin Ooi, and Aoying Zhou. Continuous content-based copy detection over streaming videos. In ICDE, pages 853–862, 2008. [YSS04] Fuminori Yamagishi, Shin’ichi Satoh, and Masao Sakauchi. A news video browser using identical video segment detection. In Kiyoharu Aizawa, Yuichi Nakamura, and Shin’ichi Satoh, editors, PCM (2), volume 3332 of Lecture Notes in Computer Science, pages 205– 212. Springer, 2004. [ZRL96] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. In H. V. Jagadish and Inderpal Singh Mumick, editors, Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 4-6, 1996, pages 103–114. ACM Press, 1996. [ZS05] Yun Zhai and Mubarak Shah. Tracking news stories across different sources. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 2–10, New York, NY, USA, 2005. ACM.