efficient visual search of videos cast as text retrieval

Efficient Visual Search of Videos Cast as Text Retrieval

Josef Sivic and Andrew ZissermanPAMI 2009

Presented by: John Paisley, Duke University

Outline• Introduction

• Text retrieval review

• Object retrieval in video

• Experiments

• Conclusion

Introduction• Goal: Retrieve objects in a video database similar to a

queried object.

• This work aims to cast this problem as a text retrieval problem.

– In text retrieval, each document is an object and each word is given an index. Each document then is represented by a vector of the counts of each word.

– Can we treat video the same way? Each frame is treated as a document. Multiple feature vectors are extracted from a single frame. These are quantized, with the quantized values then being treated as a word.

– Text retrieval algorithms can the be used.

Text Retrieval• As mentioned, each document is represented by a vector. The

standard way of obtaining this vector is via “term frequency-inverse document frequency” (tf-idf).

• Document retrieval then proceeds as follows, where documents are sorted in descending order.

• If these vectors are normalized, the Euclidean distance can be used.

Object Retrieval in Video: Viewpoint Invariant Description• Goal: Extract description of an object that is unaffected by

viewpoint, scale and illumination, etc.

• To do this, for each frame, use segmentation algorithms to define regions of interest (two are used here). Roughly 1,200 regions are computed for each frame. Each region is represented as a 128 dimensional vector using the SIFT descriptor method.

• To get rid of bogus regions, they are tracked over a few frames to make sure that the regions are stable, and therefore potentially interesting. This reduces the number of feature vectors to about 600 per frame.

Object Retrieval in Video: Building a Visual Vocabulary• Now represent each frame by roughly a 128 x 600 matrix.

• To go from images to words, build a global dictionary using VQ (e.g., K-means) and quantize feature vectors. In this paper, K-means is used using the Mahalanobis distance.

• These clusters are found separately for each segmantation algorithm. In all, the authors use 16,000 clusters (or words).

• Each frame is now represented as a 16,000 vector of counts of the number of observations in each cluster. Words that arise freqently in documents are thrown out as stop words.

Object Retrieval in Video: Spatial Consistency

• Given a queried object, there’s information in the spatial relationships of the region of interest that can help the ranking.

• This is done by first returning results using text retrieval algorithm discussed before and then re-ranking by looking at how similar the K-nearest neighbors are

Object Retrieval Process

• Feature length film usually has 100K-150K frames. Use one frame per second to reduce to 4K-6K frames.

• Features are extracted and quantized as discussed.

• The user selects a query region. “Words” are extracted as well as spatial relationships.

• A desired number of frames are returned using the text retrieval algorithm and re-ranked using the spatial consistency method.

Experiments• Results using the movies

“Groundhog Day,” “Run Lola Run” and “Casablanca”

• Six objects of interest were selected and searched for.

• An additional benefit of the proposed method is speed.

Experiments• Fig. 16 shows the effect of vocabulary size.

• Table 2 shows the effect of building the dictionary using the right, wrong and two movies

• Table 3 shows the combination of the two

Experiments: Different Distance Measures

Conclusion

• Vector quantization does not seem to degrade performance, while the speed is significantly faster.

• Using spatial information via spatial consistency reranking was shown to significantly improve results.

• This can be extended to temporal information as well.

efficient visual search of videos cast as text retrieval

Documents

document retrieval

text retrieval problem

text retrieval algorithms

queried object

single frame

text retrievalas

vector of counts

number of feature vectors