efficient visual search of videos cast as text retrieval
DESCRIPTION
Efficient Visual Search of Videos Cast as Text Retrieval. Josef Sivic and Andrew Zisserman PAMI 2009 Presented by: John Paisley, Duke University. Outline. Introduction Text retrieval review Object retrieval in video Experiments Conclusion. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
Efficient Visual Search of Videos Cast as Text Retrieval
Josef Sivic and Andrew ZissermanPAMI 2009
Presented by: John Paisley, Duke University
Outline• Introduction
• Text retrieval review
• Object retrieval in video
• Experiments
• Conclusion
Introduction• Goal: Retrieve objects in a video database similar to a
queried object.
• This work aims to cast this problem as a text retrieval problem.
– In text retrieval, each document is an object and each word is given an index. Each document then is represented by a vector of the counts of each word.
– Can we treat video the same way? Each frame is treated as a document. Multiple feature vectors are extracted from a single frame. These are quantized, with the quantized values then being treated as a word.
– Text retrieval algorithms can the be used.
Text Retrieval• As mentioned, each document is represented by a vector. The
standard way of obtaining this vector is via “term frequency-inverse document frequency” (tf-idf).
• Document retrieval then proceeds as follows, where documents are sorted in descending order.
• If these vectors are normalized, the Euclidean distance can be used.
Object Retrieval in Video: Viewpoint Invariant Description• Goal: Extract description of an object that is unaffected by
viewpoint, scale and illumination, etc.
• To do this, for each frame, use segmentation algorithms to define regions of interest (two are used here). Roughly 1,200 regions are computed for each frame. Each region is represented as a 128 dimensional vector using the SIFT descriptor method.
• To get rid of bogus regions, they are tracked over a few frames to make sure that the regions are stable, and therefore potentially interesting. This reduces the number of feature vectors to about 600 per frame.
Object Retrieval in Video: Building a Visual Vocabulary• Now represent each frame by roughly a 128 x 600 matrix.
• To go from images to words, build a global dictionary using VQ (e.g., K-means) and quantize feature vectors. In this paper, K-means is used using the Mahalanobis distance.
• These clusters are found separately for each segmantation algorithm. In all, the authors use 16,000 clusters (or words).
• Each frame is now represented as a 16,000 vector of counts of the number of observations in each cluster. Words that arise freqently in documents are thrown out as stop words.
Object Retrieval in Video: Spatial Consistency
• Given a queried object, there’s information in the spatial relationships of the region of interest that can help the ranking.
• This is done by first returning results using text retrieval algorithm discussed before and then re-ranking by looking at how similar the K-nearest neighbors are
Object Retrieval Process
• Feature length film usually has 100K-150K frames. Use one frame per second to reduce to 4K-6K frames.
• Features are extracted and quantized as discussed.
• The user selects a query region. “Words” are extracted as well as spatial relationships.
• A desired number of frames are returned using the text retrieval algorithm and re-ranked using the spatial consistency method.
Experiments• Results using the movies
“Groundhog Day,” “Run Lola Run” and “Casablanca”
• Six objects of interest were selected and searched for.
• An additional benefit of the proposed method is speed.
Experiments• Fig. 16 shows the effect of vocabulary size.
• Table 2 shows the effect of building the dictionary using the right, wrong and two movies
• Table 3 shows the combination of the two
Experiments: Different Distance Measures
Conclusion
• Vector quantization does not seem to degrade performance, while the speed is significantly faster.
• Using spatial information via spatial consistency reranking was shown to significantly improve results.
• This can be extended to temporal information as well.