arf @ mediaeval 2012: multimodal video classification

17
~ Multimodal Video Classification ~ University POLITEHNICA of Bucharest *this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557. Austrian Researc Institute for Artificial Intelligence Bogdan IONESCU *1,3 [email protected] Ionuț MIRONICĂ 1 [email protected] Klaus SEYERLEHNER 2 [email protected] Peter KNEES 2 [email protected] Jan SCHLÜTER 4 [email protected] Markus SCHEDL 2 [email protected] Horia CUCU 1 [email protected] Andi BUZO 1 [email protected] Patrick LAMBERT 3 [email protected] ARF (Austria-Romania-France) team 1 2 3 4

Upload: mediaeval2012

Post on 19-Dec-2014

447 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: ARF @ MediaEval 2012: Multimodal Video Classification

~ Multimodal Video Classification ~

University POLITEHNICA of Bucharest

*this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557.

Austrian Research Institute for Artificial Intelligence

Bogdan IONESCU*1,3

[email protected]

Ionuț MIRONICĂ1

[email protected]

Klaus SEYERLEHNER2

[email protected]

Peter KNEES2

[email protected]

Jan SCHLÜTER4

[email protected]

Markus SCHEDL2

[email protected]

Horia CUCU1

[email protected]

Andi BUZO1

[email protected]

Patrick LAMBERT3

[email protected]

ARF (Austria-Romania-France) team

1 2 3 4

Page 2: ARF @ MediaEval 2012: Multimodal Video Classification

2

Presentation outline

MediaEval - Pisa, Italy, 4-5 October 2012 1/16

• The approach

• Video content description

• Experimental results

• Conclusions and future work

Page 3: ARF @ MediaEval 2012: Multimodal Video Classification

3

The approach

MediaEval - Pisa, Italy, 4-5 October 2012 2/16

video database

> challenge: find a way to assign (genre) tags to unknown videos;

> approach: machine learning paradigm;

train

classifier

unlabeled data

web food autos

…label data

labeled data

tagged video database

Page 4: ARF @ MediaEval 2012: Multimodal Video Classification

4

The approach: classification

MediaEval - Pisa, Italy, 4-5 October 2012 3/16

> the entire process relies on the concept of “similarity” computed between content annotations (numeric features),

objective 1: go multimodal (truly)

visual audio text

objective 2: test a broad range of classifiers and descriptor combinations;

> this year focus is on:

Page 5: ARF @ MediaEval 2012: Multimodal Video Classification

5

Video content description - audio

MediaEval - Pisa, Italy, 4-5 October 2012 4/16

[Klaus Seyerlehner et al., MIREX’11, USA]

average

median

variance

...

e.g. 50% overlapping

block-level audio features (capture also local temporal information)

• Spectral Pattern, ~ soundtrack’s timbre;

• delta Spectral Pattern, ~ strength of onsets;

• variance delta Spectral Pattern, ~ variation of the onset strength;

• Logarithmic Fluctuation Pattern, ~ rhythmic aspects;

• Spectral Contrast Pattern, ~ ”toneness”;

• Correlation Pattern, ~ loudness changes;

• Local Single Gaussian model,~ timbral;

• George Tzanetakis model,~ timbral;

Page 6: ARF @ MediaEval 2012: Multimodal Video Classification

6

Video content description - audio

MediaEval - Pisa, Italy, 4-5 October 2012 5/16

[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]

• Linear Predictive Coefficients,

• Line Spectral Pairs,

• Mel-Frequency Cepstral Coefficients,

• Zero-Crossing Rate,

+ variance of each feature over a certain window.

• spectral centroid, flux, rolloff, and kurtosis,

standard audio features (audio frame-based)

f1 fn…f2

globalfeature

= mean & variance

time

+var{f2} var{fn}

Page 7: ARF @ MediaEval 2012: Multimodal Video Classification

7

Video content description - visual

MediaEval - Pisa, Italy, 4-5 October 2012 6/16

[OpenCV toolbox, http://opencv.willowgarage.com]

MPEG-7 & color/texture descriptors(visual frame-based)

• Local Binary Pattern,

• Autocorrelogram,

• Color Coherence Vector,

• Color Layout Pattern,

• Edge Histogram,

• Scalable Color Descriptor,

• Classic color histogram,

• Color moments.

time

f1 fn…

globalfeature

=mean &

dispersion & skewness & kurtosis & median &

root mean square

f2

Page 8: ARF @ MediaEval 2012: Multimodal Video Classification

8

Video content description - visual

MediaEval - Pisa, Italy, 4-5 October 2012 7/16

[OpenCV toolbox, http://opencv.willowgarage.com]

feature descriptors(visual frame-based)

• Histogram of oriented Gradients (HoG)~ counts occurrences of gradient orientation in localized portions of an image (20º per bin)

• Harris corner detector

• Speeded Up Robust Feature (SURF)

feature points (e.g. Harris)

image source http://www.ifp.illinois.edu/~yuhuang

Page 9: ARF @ MediaEval 2012: Multimodal Video Classification

9

Video content description - text

MediaEval - Pisa, Italy, 4-5 October 2012 8/16

TF-IDF descriptors(Term Frequency-Inverse Document Frequency)

> text sources: ASR and metadata,

1. remove XML markups,

2. remove terms <5%-percentile of the frequency distribution,

3. select term corpus: retaining for each genre class m terms (e.g. m = 150 for ASR and 20 for metadata) with the highest χ2 values that occur more frequently than in complement classes,

4. for each document we represent the TF-IDF values.

Page 10: ARF @ MediaEval 2012: Multimodal Video Classification

10

avg. Fscore (over all genres)

Experimental results: devset (5,127 seq.)

MediaEval - Pisa, Italy, 4-5 October 2012 9/16

> classifiers from Weka (Bayes, lazy, functional, trees, etc.),

[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]

> cross-validation (train 50% – test 50%),

- visual descriptors capabilities 30%±10%,

- best LBP+CCV+histogram (Fscore=41.2%).

- using more visual is not more accurate than using few,

Page 11: ARF @ MediaEval 2012: Multimodal Video Classification

11

Experimental results: devset (5,127 seq.)

MediaEval - Pisa, Italy, 4-5 October 2012 10/16

[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]

avg. Fscore (over all genres)

> cross-validation (train 50% – test 50%),

- proposed block-based better than standard (by ~10%),

- audio still better than visual (improvement ~6%),

Page 12: ARF @ MediaEval 2012: Multimodal Video Classification

12

Experimental results: devset (5,127 seq.)

MediaEval - Pisa, Italy, 4-5 October 2012 11/16

[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]

avg. Fscore (over all genres)

- best performance ASR LIMSI + metadata (Fscore=68%).

> cross-validation (train 50% – test 50%),

- ASR from LIMSI more representative than LIUM (~3%),

Page 13: ARF @ MediaEval 2012: Multimodal Video Classification

13

Experimental results: devset (5,127 seq.)

MediaEval - Pisa, Italy, 4-5 October 2012 12/16

[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]

avg. Fscore (over all genres)

- increasing the number of modalities increases the performance.

- audio-visual close to text (ASR) for the automatic descriptors,

> cross-validation (train 50% – test 50%),

Page 14: ARF @ MediaEval 2012: Multimodal Video Classification

14

Experimental results: official runs (9,550 seq.)

MediaEval - Pisa, Italy, 4-5 October 2012 13/16

> train on devset, test on testset (SVM linear),

Run1 LBP+CCV+

hist + audio block-based

Run2

TF-IDF on ASR LIMSI

Run3

audio block-based + LBP + CCV + hist +

TF-IDF on ASR LIMSI

Run4

audio

block-based

Run5

TF-IDF on metadata +

ASR LIMSI

MediaEval

2011

MAP 10.3%

MediaEval

2011

MAP 12%

metadata

Page 15: ARF @ MediaEval 2012: Multimodal Video Classification

15

Experimental results: official runs (9,550 seq.)

MediaEval - Pisa, Italy, 4-5 October 2012 14/16

> genre MAP for Run 5: TF-IDF on ASR + metadata,

Run 1: visual + audio religion

71%gaming

71%autos52%

environment50%

Page 16: ARF @ MediaEval 2012: Multimodal Video Classification

16

Conclusions and future work

MediaEval - Pisa, Italy, 4-5 October 2012 15/16

> classification adapts to the corpus – changing the corpus will change the performance;

> future work:

more elaborated late-fusion ?

pursue tests on the entire data set;

perhaps more elaborated Bag-of-Visual-Words.

Acknowledgement: we would like to thank Prof. Fausto Giunchiglia and Prof. Nicu Sebe from University of Trento for their support.

> how far can we go with ad-hoc classification without human intervention?

> audio-visual descriptors are inherently limited;

Page 17: ARF @ MediaEval 2012: Multimodal Video Classification

17

thank you !

MediaEval - Pisa, Italy, 4-5 October 2012 16/16

any questions ?