super: towards real-time event recognition in internet videos

16
SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China [email protected] ACM ICMR 2012, Hong Kong, June 2012 Speeded Up Event Recognition ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012.

Upload: mackenzie-gibbs

Post on 01-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

S peeded Up E vent R ecognition. SUPER: Towards Real-time Event Recognition in Internet Videos. Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China [email protected]. ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SUPER: Towards Real-time Event Recognition in Internet Videos

SUPER: Towards Real-time Event Recognition in Internet Videos

Yu-Gang JiangSchool of Computer Science

Fudan UniversityShanghai, China

[email protected]

ACM ICMR 2012, Hong Kong, June 2012

Speeded Up Event Recognition

ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012.

Page 2: SUPER: Towards Real-time Event Recognition in Internet Videos

2

The Problem• Recognize high-level events in videos

We’re particularly interested in Internet Consumer videos

• Applications Video Search Personal Video Collection Management Smart Advertising Intelligence Analysis …

Page 3: SUPER: Towards Real-time Event Recognition in Internet Videos

3

Our Objective

Improve Efficiency

Maintain Accuracy

Page 4: SUPER: Towards Real-time Event Recognition in Internet Videos

The Baseline Recognition Framework

4

Feature extraction

SIFT

Spatial-temporal

interest points

MFCC audio feature

Late Averag

e Fusion

χ2 kernel SVM

Classifier

Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010.

Best Performing approach in TRECVID-2010 Multimedia event detection (MED) task

Page 5: SUPER: Towards Real-time Event Recognition in Internet Videos

Three Audio-Visual Features…

5

• SIFT (visual) – D. Lowe, IJCV ‘04

• STIP (visual)– I. Laptev, IJCV ‘05

• MFCC (audio) … 16ms 16ms

Page 6: SUPER: Towards Real-time Event Recognition in Internet Videos

Bag-of-words Representation• SIFT / STIP / MFCC words• Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007)

Keypoint extraction

Vocabulary 1

SIF

T fe

atur

e sp

ace

......... .........

Vocabulary 2

DoG Hessian Affine

BoW histograms Using Soft-Weighting

.........

Vocabulary Generation BoW Representation

Bag-of-SIFT

6Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010

Page 7: SUPER: Towards Real-time Event Recognition in Internet Videos

Baseline Speed…

7

Feature extraction

SIFT

Spatial-temporal

interest points

MFCC audio feature

Late Averag

e Fusion

χ2 kernel SVM

Classifier

• 4 Factors on speed: Feature, Classifier, Fusion, Frame Sampling

82.0

916.8

2.36~2.0

0<<1

Feature efficiency is measured in seconds needed for processing an 80-second video sequence (for SIFT: 0.5fps).

Classification time is measured by classifying a video using classifiers of all the 20 categories

Total: 1003 seconds per video !

Page 8: SUPER: Towards Real-time Event Recognition in Internet Videos

Basketball

Baseball

Soccer

Ice Skating

Skiing

Swimming

Biking

Cat

Dog

Bird

Graduation

Birthday Celebration

Wedding Reception

Wedding Ceremony

Wedding Dance

Music Performance

Non-music Performance

Parade

Beach

Playground

8

Dataset: Columbia Consumer Videos (CCV)

Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance, in ACM ICMR 2011.

Page 9: SUPER: Towards Real-time Event Recognition in Internet Videos

9

Uijlings, Smeulders, Scha, Real-time bag of words, approximately, in ACM CIVR 2009.

Feature Options• (Sparse) SIFT• STIP• MFCC• Dense SIFT (DIFT)• Dense SURF (DURF)• Self-Similarities (SSIM)• Color Moments (CM)• GIST• LBP• TINY

Suggested feature combinations:

Page 10: SUPER: Towards Real-time Event Recognition in Internet Videos

10

Classifier Kernels• Chi Square Kernel• Histogram Intersection

Kernel (HI)• Fast HI Kernel (fastHI)

Maji, Berg, Malik, Classification Using Intersection Kernel Support Vector Machines is Efficient, in CVPR 2008.

Page 11: SUPER: Towards Real-time Event Recognition in Internet Videos

Multi-modality Fusion• Early Fusion

Feature concatenation

• Kernel FusionKf=K1+K2+…

• Late Fusionfusion of classificationscore

MFCC, DURF, SSIM, CM, GIST, LBP

MFCC, DURF

Page 12: SUPER: Towards Real-time Event Recognition in Internet Videos

12

Frame Sampling

• DURF Uniformly sampling 16 frames per video seems sufficient.

K. Schindler and L. van Gool, Action snippets: How many frames does human action recognition require?, in CVPR 2008.

Page 13: SUPER: Towards Real-time Event Recognition in Internet Videos

13

Frame Sampling

• MFCC Sampling audio frames is always harmful.

Page 14: SUPER: Towards Real-time Event Recognition in Internet Videos

14

Summary• Feature: Dense SURF (DURF), MFCC, plus some

global features• Classifier: Fast HI kernel SVM• Fusion: Early• Frame Selection: Audio - No; Visual - Yes

220-fold speed-up!

Page 15: SUPER: Towards Real-time Event Recognition in Internet Videos

15

Demo…

Page 16: SUPER: Towards Real-time Event Recognition in Internet Videos

email: [email protected]

THANK YOU!

16