super: towards real-time event recognition in internet videos

SUPER: Towards Real-time Event Recognition in Internet Videos

Yu-Gang JiangSchool of Computer Science

Fudan UniversityShanghai, China

[email protected]

ACM ICMR 2012, Hong Kong, June 2012

Speeded Up Event Recognition

ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012.

2

The Problem• Recognize high-level events in videos

We’re particularly interested in Internet Consumer videos

• Applications Video Search Personal Video Collection Management Smart Advertising Intelligence Analysis …

…

3

Our Objective

Improve Efficiency

Maintain Accuracy

The Baseline Recognition Framework

4

Feature extraction

SIFT

Spatial-temporal

interest points

MFCC audio feature

Late Averag

e Fusion

χ2 kernel SVM

Classifier

Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010.

Best Performing approach in TRECVID-2010 Multimedia event detection (MED) task

Three Audio-Visual Features…

5

• SIFT (visual) – D. Lowe, IJCV ‘04

• STIP (visual)– I. Laptev, IJCV ‘05

• MFCC (audio) … 16ms 16ms

Bag-of-words Representation• SIFT / STIP / MFCC words• Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007)

Keypoint extraction

Vocabulary 1

SIF

T fe

atur

e sp

ace

......... .........

Vocabulary 2

DoG Hessian Affine

BoW histograms Using Soft-Weighting

.........

Vocabulary Generation BoW Representation

Bag-of-SIFT

6Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010

Baseline Speed…

7

Feature extraction

SIFT

Spatial-temporal

interest points

MFCC audio feature

Late Averag

e Fusion

χ2 kernel SVM

Classifier

• 4 Factors on speed: Feature, Classifier, Fusion, Frame Sampling

82.0

916.8

2.36~2.0

0<<1

Feature efficiency is measured in seconds needed for processing an 80-second video sequence (for SIFT: 0.5fps).

Classification time is measured by classifying a video using classifiers of all the 20 categories

Total: 1003 seconds per video !

Basketball

Baseball

Soccer

Ice Skating

Skiing

Swimming

Biking

Cat

Dog

Bird

Graduation

Birthday Celebration

Wedding Reception

Wedding Ceremony

Wedding Dance

Music Performance

Non-music Performance

Parade

Beach

Playground

8

Dataset: Columbia Consumer Videos (CCV)

Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance, in ACM ICMR 2011.

9

Uijlings, Smeulders, Scha, Real-time bag of words, approximately, in ACM CIVR 2009.

Feature Options• (Sparse) SIFT• STIP• MFCC• Dense SIFT (DIFT)• Dense SURF (DURF)• Self-Similarities (SSIM)• Color Moments (CM)• GIST• LBP• TINY

Suggested feature combinations:

10

Classifier Kernels• Chi Square Kernel• Histogram Intersection

Kernel (HI)• Fast HI Kernel (fastHI)

Maji, Berg, Malik, Classification Using Intersection Kernel Support Vector Machines is Efficient, in CVPR 2008.

Multi-modality Fusion• Early Fusion

Feature concatenation

• Kernel FusionKf=K1+K2+…

• Late Fusionfusion of classificationscore

MFCC, DURF, SSIM, CM, GIST, LBP

MFCC, DURF

12

Frame Sampling

• DURF Uniformly sampling 16 frames per video seems sufficient.

K. Schindler and L. van Gool, Action snippets: How many frames does human action recognition require?, in CVPR 2008.

13

Frame Sampling

• MFCC Sampling audio frames is always harmful.

14

Summary• Feature: Dense SURF (DURF), MFCC, plus some

global features• Classifier: Fast HI kernel SVM• Fusion: Early• Frame Selection: Audio - No; Visual - Yes

220-fold speed-up!

15

Demo…

email: [email protected]

THANK YOU!

16

super: towards real-time event recognition in internet videos

Documents

realtime event recognition

concept detection

multimedia retrieval

temporal matching

hong kong

dan ellis

acm civr

stip visuali