efficient visual search of videos cast as text retrieval

12
Efficient Visual Search of Videos Cast as Text Retrieval Josef Sivic and Andrew Zisserman PAMI 2009 Presented by: John Paisley, Duke University

Upload: trudy

Post on 22-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Efficient Visual Search of Videos Cast as Text Retrieval. Josef Sivic and Andrew Zisserman PAMI 2009 Presented by: John Paisley, Duke University. Outline. Introduction Text retrieval review Object retrieval in video Experiments Conclusion. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient Visual Search of Videos Cast as Text Retrieval

Efficient Visual Search of Videos Cast as Text Retrieval

Josef Sivic and Andrew ZissermanPAMI 2009

Presented by: John Paisley, Duke University

Page 2: Efficient Visual Search of Videos Cast as Text Retrieval

Outline• Introduction

• Text retrieval review

• Object retrieval in video

• Experiments

• Conclusion

Page 3: Efficient Visual Search of Videos Cast as Text Retrieval

Introduction• Goal: Retrieve objects in a video database similar to a

queried object.

• This work aims to cast this problem as a text retrieval problem.

– In text retrieval, each document is an object and each word is given an index. Each document then is represented by a vector of the counts of each word.

– Can we treat video the same way? Each frame is treated as a document. Multiple feature vectors are extracted from a single frame. These are quantized, with the quantized values then being treated as a word.

– Text retrieval algorithms can the be used.

Page 4: Efficient Visual Search of Videos Cast as Text Retrieval

Text Retrieval• As mentioned, each document is represented by a vector. The

standard way of obtaining this vector is via “term frequency-inverse document frequency” (tf-idf).

• Document retrieval then proceeds as follows, where documents are sorted in descending order.

• If these vectors are normalized, the Euclidean distance can be used.

Page 5: Efficient Visual Search of Videos Cast as Text Retrieval

Object Retrieval in Video: Viewpoint Invariant Description• Goal: Extract description of an object that is unaffected by

viewpoint, scale and illumination, etc.

• To do this, for each frame, use segmentation algorithms to define regions of interest (two are used here). Roughly 1,200 regions are computed for each frame. Each region is represented as a 128 dimensional vector using the SIFT descriptor method.

• To get rid of bogus regions, they are tracked over a few frames to make sure that the regions are stable, and therefore potentially interesting. This reduces the number of feature vectors to about 600 per frame.

Page 6: Efficient Visual Search of Videos Cast as Text Retrieval

Object Retrieval in Video: Building a Visual Vocabulary• Now represent each frame by roughly a 128 x 600 matrix.

• To go from images to words, build a global dictionary using VQ (e.g., K-means) and quantize feature vectors. In this paper, K-means is used using the Mahalanobis distance.

• These clusters are found separately for each segmantation algorithm. In all, the authors use 16,000 clusters (or words).

• Each frame is now represented as a 16,000 vector of counts of the number of observations in each cluster. Words that arise freqently in documents are thrown out as stop words.

Page 7: Efficient Visual Search of Videos Cast as Text Retrieval

Object Retrieval in Video: Spatial Consistency

• Given a queried object, there’s information in the spatial relationships of the region of interest that can help the ranking.

• This is done by first returning results using text retrieval algorithm discussed before and then re-ranking by looking at how similar the K-nearest neighbors are

Page 8: Efficient Visual Search of Videos Cast as Text Retrieval

Object Retrieval Process

• Feature length film usually has 100K-150K frames. Use one frame per second to reduce to 4K-6K frames.

• Features are extracted and quantized as discussed.

• The user selects a query region. “Words” are extracted as well as spatial relationships.

• A desired number of frames are returned using the text retrieval algorithm and re-ranked using the spatial consistency method.

Page 9: Efficient Visual Search of Videos Cast as Text Retrieval

Experiments• Results using the movies

“Groundhog Day,” “Run Lola Run” and “Casablanca”

• Six objects of interest were selected and searched for.

• An additional benefit of the proposed method is speed.

Page 10: Efficient Visual Search of Videos Cast as Text Retrieval

Experiments• Fig. 16 shows the effect of vocabulary size.

• Table 2 shows the effect of building the dictionary using the right, wrong and two movies

• Table 3 shows the combination of the two

Page 11: Efficient Visual Search of Videos Cast as Text Retrieval

Experiments: Different Distance Measures

Page 12: Efficient Visual Search of Videos Cast as Text Retrieval

Conclusion

• Vector quantization does not seem to degrade performance, while the speed is significantly faster.

• Using spatial information via spatial consistency reranking was shown to significantly improve results.

• This can be extended to temporal information as well.