super: towards real-time event recognition in internet videos
DESCRIPTION
S peeded Up E vent R ecognition. SUPER: Towards Real-time Event Recognition in Internet Videos. Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China [email protected]. ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012. - PowerPoint PPT PresentationTRANSCRIPT
SUPER: Towards Real-time Event Recognition in Internet Videos
Yu-Gang JiangSchool of Computer Science
Fudan UniversityShanghai, China
ACM ICMR 2012, Hong Kong, June 2012
Speeded Up Event Recognition
ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012.
2
The Problem• Recognize high-level events in videos
We’re particularly interested in Internet Consumer videos
• Applications Video Search Personal Video Collection Management Smart Advertising Intelligence Analysis …
…
3
Our Objective
Improve Efficiency
Maintain Accuracy
The Baseline Recognition Framework
4
Feature extraction
SIFT
Spatial-temporal
interest points
MFCC audio feature
Late Averag
e Fusion
χ2 kernel SVM
Classifier
Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010.
Best Performing approach in TRECVID-2010 Multimedia event detection (MED) task
Three Audio-Visual Features…
5
• SIFT (visual) – D. Lowe, IJCV ‘04
• STIP (visual)– I. Laptev, IJCV ‘05
• MFCC (audio) … 16ms 16ms
Bag-of-words Representation• SIFT / STIP / MFCC words• Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007)
Keypoint extraction
Vocabulary 1
SIF
T fe
atur
e sp
ace
......... .........
Vocabulary 2
DoG Hessian Affine
BoW histograms Using Soft-Weighting
.........
Vocabulary Generation BoW Representation
Bag-of-SIFT
6Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010
Baseline Speed…
7
Feature extraction
SIFT
Spatial-temporal
interest points
MFCC audio feature
Late Averag
e Fusion
χ2 kernel SVM
Classifier
• 4 Factors on speed: Feature, Classifier, Fusion, Frame Sampling
82.0
916.8
2.36~2.0
0<<1
Feature efficiency is measured in seconds needed for processing an 80-second video sequence (for SIFT: 0.5fps).
Classification time is measured by classifying a video using classifiers of all the 20 categories
Total: 1003 seconds per video !
Basketball
Baseball
Soccer
Ice Skating
Skiing
Swimming
Biking
Cat
Dog
Bird
Graduation
Birthday Celebration
Wedding Reception
Wedding Ceremony
Wedding Dance
Music Performance
Non-music Performance
Parade
Beach
Playground
8
Dataset: Columbia Consumer Videos (CCV)
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance, in ACM ICMR 2011.
9
Uijlings, Smeulders, Scha, Real-time bag of words, approximately, in ACM CIVR 2009.
Feature Options• (Sparse) SIFT• STIP• MFCC• Dense SIFT (DIFT)• Dense SURF (DURF)• Self-Similarities (SSIM)• Color Moments (CM)• GIST• LBP• TINY
Suggested feature combinations:
10
Classifier Kernels• Chi Square Kernel• Histogram Intersection
Kernel (HI)• Fast HI Kernel (fastHI)
Maji, Berg, Malik, Classification Using Intersection Kernel Support Vector Machines is Efficient, in CVPR 2008.
Multi-modality Fusion• Early Fusion
Feature concatenation
• Kernel FusionKf=K1+K2+…
• Late Fusionfusion of classificationscore
MFCC, DURF, SSIM, CM, GIST, LBP
MFCC, DURF
12
Frame Sampling
• DURF Uniformly sampling 16 frames per video seems sufficient.
K. Schindler and L. van Gool, Action snippets: How many frames does human action recognition require?, in CVPR 2008.
13
Frame Sampling
• MFCC Sampling audio frames is always harmful.
14
Summary• Feature: Dense SURF (DURF), MFCC, plus some
global features• Classifier: Fast HI kernel SVM• Fusion: Early• Frame Selection: Audio - No; Visual - Yes
220-fold speed-up!
15
Demo…