near-duplicate video retrieval by aggregating intermediate cnn layers
TRANSCRIPT
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersGiorgos Kordopatis-Zilos1,2, Symeon Papadopoulos1, Ioannis Patras2 and Yiannis Kompatsiaris1
1Information Technologies Institute, CERTH, Thessaloniki, Greece2Queen Mary University of London, Mile end Campus, UK, E14NS
23rd International Conference on MultiMedia Modeling Reykjavík, Iceland, 4-6 January 2017
Problem & Motivation• Near-Duplicate Video Retrieval (NDVR)
• Given a query video, search a video dataset to retrieve (visually) highly similar videos
• Rank the candidate videos based on their similarity to the query
• Various applications• content verification• video retrieval, management and recommendation• copyright protection
• Crucial importance of NDVR, due to the exponential growth of video content
Near-Duplicate Videos: Definition• Variety of definitions and understandings regarding the
near-duplicate videos• Adopt definition by Wu et al. (2007)
• photometric variations: gamma, contrast, brightness, etc.• editing operations: resize, shift, crop, flip• insertion of patterns: caption, logo, subtitles, sliding captions, etc.• re-encoding: video format, compression• video modifications: frame rate, frame insertion, deletion, swap
X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In Proceedings of the
15th ACM international conference on Multimedia, pp. 218-227, 2007
Related Work• Variety of approaches (Liu et al., 2013)
• Video-level matching: comparison of global signatures• Global feature vectors• Fingerprints• Hash codes
• Frame-level matching: frames or sequences• Local descriptors• Spatiotemporal features
• Hybrid-level matching• Filter-and-refine methods
• TRECVID content-based copy detection (Kraaij & Awad, 2011)• duplicates artificially generated by standard transformations
W. Kraaij, and G. Awad. TRECVID 2011 content-based copy detection: Task overview. Proc. TRECVid 2010, 2011
J. Liu, Z. Huang, H. Cai, H. T. Shen, C. W. Ngo, and W. Wang. Near-duplicate video retrieval: Current research and future trends. ACM Computing Surveys, vol.45, no. 4, 44, 2013
Feature Extraction (1/2)
• Employ a pre-trained CNN with convolutional layers• Apply max pooling on every channel of the feature map of
each layer (Zheng et al., 2016)
• -dimensional vectors generated
L. Zheng, Y. Zhao, S. Wang, J. Wang, and Q. Tian. Good Practice in CNN Feature Transfer. arXiv:1604.00133, 2016
Feature Extraction (2/2)• Pre-trained CNN networks from Caffe (Jia et al., 2014):
a) AlexNet, b) VGGNet, c) GoogLeNet• Feature extraction uses the convolution layers of the
architectures
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM int. conference on Multimedia, pp. 675-678, 2014
AlexNet VGGNet GoogLeNet
Vector Aggregation
Vector Aggregation
Vector Aggregation
Vector Aggregation
Layer Aggregation
Video Indexing and Querying• tf-idf weighting of visual words
• Inverted file indexing structure for fast search
• Retrieve candidates with at least one common visual word
• Rank candidates based on cosine similarity of their tf-idf representations
Evaluation: Dataset• Dataset: CC_WEB_VIDEO
• Videos: 13,139 videos• Keyframes: 397,965 images
CC_WEB_VIDEO: http://vireo.cs.cityu.edu.hk/webvideo/
Dataset Annotation
• Evaluation metrics• precision-recall (PR)• mean Average Precision (mAP)
Query video Near-duplicate Videos
Dataset Examples
Results I Impact of CNN architecture and vocabulary size
Results IIPerformance using individual layers
AlexNet VGGNet GoogLeNet
Results III• Performance per query• Best runs
• CNN-V: Vector-based aggregation GoogLeNet• CNN-L: Layer-based aggregation VGGNet
Lower precision in hard queries• query 18 (Bus uncle)• query 22 (Numa Gary)
Evaluation: Comparison to SoA• Color Histograms (CH) (Wu et al., 2007) - Video-level matching, color histograms• Auto Color Correlograms (ACC) (Cai et al., 2011) - Frame-level matching, auto-
color correlograms, BoW, tf-idf weighted cosine similarity• Local Structure (LS) (Wu et al., 2007) - Hybrid-level matching, Color Histograms,
keyframes similarity of PCA-SIFT descriptors• Multiple Feature Hashing (MFH) (Song et al., 2013) - Video-level matching, hash
multiple features into Hamming space, combination of the keyframe hash code to a global video representation
• Pattern-based approach (PPT) (Chou et al., 2015) - Hybrid-level matching, pattern-based indexing tree (PI-tree), m-pattern-based dynamic programming (mPDP), time-shift m-pattern similarity (TPS)
X. Wu, A. G. Hauptmann, and C. W. Ngo. Practical elimination of near-duplicates from web video search. In Proceedings of the
15th ACM international conference on Multimedia, pp. 218-227, 2007
Y. Cai, L. Yang, W. Ping, F. Wang, T. Mei, X. S. Hua, and S. Li. Million-scale near-duplicate video retrieval system. In Proceedings of
the 19th ACM international conference on Multimedia, pp. 837-838, 2011
J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo Effective multiple feature hashing for large-scale near-duplicate video retrieval. In
IEEE Transactions on Multimedia, vol. 15, no. 8, pp. 1997-2008, 2013
C. L. Chou, H. T. Chen, and S. Y. Lee. Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale Videos. IEEE
Transactions on Multimedia, vol. 17, no. 3, pp. 382-395, 2015
Results IV Comparison against existing NDVR approaches
Future Work• Exploit the C3D features (Tran et al., 2015)
• Conduct more comprehensive evaluations • More challenging datasets: larger scale, more similar but non-
relevant videos (distractors)
• Partial Duplicate Video Retrieval (PDVR)• Assess the applicability of the approach on the PDVR problem
D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497, 2015
Thank you!
Get in touch:• George Kordopatis-Zilos: [email protected] • Symeon Papadopoulos: [email protected] / @sympap
With the support of: