arf @ mediaeval 2012: multimodal video classification

~ Multimodal Video Classification ~

University POLITEHNICA of Bucharest

*this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557.

Austrian Research Institute for Artificial Intelligence

Bogdan IONESCU*1,3

[email protected]

Ionuț MIRONICĂ1

[email protected]

Klaus SEYERLEHNER2

[email protected]

Peter KNEES2

[email protected]

Jan SCHLÜTER4

[email protected]

Markus SCHEDL2

[email protected]

Horia CUCU1

[email protected]

Andi BUZO1

[email protected]

Patrick LAMBERT3

[email protected]

ARF (Austria-Romania-France) team

1 2 3 4

2

Presentation outline

MediaEval - Pisa, Italy, 4-5 October 2012 1/16

• The approach

• Video content description

• Experimental results

• Conclusions and future work

3

The approach


video database

> challenge: find a way to assign (genre) tags to unknown videos;

> approach: machine learning paradigm;

train

classifier

unlabeled data

web food autos

…label data

labeled data

tagged video database

4

The approach: classification


> the entire process relies on the concept of “similarity” computed between content annotations (numeric features),

objective 1: go multimodal (truly)

visual audio text

objective 2: test a broad range of classifiers and descriptor combinations;

> this year focus is on:

5

Video content description - audio


[Klaus Seyerlehner et al., MIREX’11, USA]

average

median

variance

...

e.g. 50% overlapping

block-level audio features (capture also local temporal information)

• Spectral Pattern, ~ soundtrack’s timbre;

• delta Spectral Pattern, ~ strength of onsets;

• variance delta Spectral Pattern, ~ variation of the onset strength;

• Logarithmic Fluctuation Pattern, ~ rhythmic aspects;

• Spectral Contrast Pattern, ~ ”toneness”;

• Correlation Pattern, ~ loudness changes;

• Local Single Gaussian model,~ timbral;

• George Tzanetakis model,~ timbral;

6

Video content description - audio


[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]

• Linear Predictive Coefficients,

• Line Spectral Pairs,

• Mel-Frequency Cepstral Coefficients,

• Zero-Crossing Rate,

+ variance of each feature over a certain window.

• spectral centroid, flux, rolloff, and kurtosis,

standard audio features (audio frame-based)

f1 fn…f2

globalfeature

= mean & variance

time

+var{f2} var{fn}

7

Video content description - visual


[OpenCV toolbox, http://opencv.willowgarage.com]

MPEG-7 & color/texture descriptors(visual frame-based)

• Local Binary Pattern,

• Autocorrelogram,

• Color Coherence Vector,

• Color Layout Pattern,

• Edge Histogram,

• Scalable Color Descriptor,

• Classic color histogram,

• Color moments.

time

f1 fn…

globalfeature

=mean &

dispersion & skewness & kurtosis & median &

root mean square

f2

8

Video content description - visual


[OpenCV toolbox, http://opencv.willowgarage.com]

feature descriptors(visual frame-based)

• Histogram of oriented Gradients (HoG)~ counts occurrences of gradient orientation in localized portions of an image (20º per bin)

• Harris corner detector

• Speeded Up Robust Feature (SURF)

feature points (e.g. Harris)

image source http://www.ifp.illinois.edu/~yuhuang

9

Video content description - text


TF-IDF descriptors(Term Frequency-Inverse Document Frequency)

> text sources: ASR and metadata,

1. remove XML markups,

2. remove terms <5%-percentile of the frequency distribution,

3. select term corpus: retaining for each genre class m terms (e.g. m = 150 for ASR and 20 for metadata) with the highest χ2 values that occur more frequently than in complement classes,

4. for each document we represent the TF-IDF values.

10

avg. Fscore (over all genres)

Experimental results: devset (5,127 seq.)


> classifiers from Weka (Bayes, lazy, functional, trees, etc.),

[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]

> cross-validation (train 50% – test 50%),

- visual descriptors capabilities 30%±10%,

- best LBP+CCV+histogram (Fscore=41.2%).

- using more visual is not more accurate than using few,

11






- proposed block-based better than standard (by ~10%),

- audio still better than visual (improvement ~6%),

12





- best performance ASR LIMSI + metadata (Fscore=68%).


- ASR from LIMSI more representative than LIUM (~3%),

13





- increasing the number of modalities increases the performance.

- audio-visual close to text (ASR) for the automatic descriptors,


14

Experimental results: official runs (9,550 seq.)


> train on devset, test on testset (SVM linear),

Run1 LBP+CCV+

hist + audio block-based

Run2

TF-IDF on ASR LIMSI

Run3

audio block-based + LBP + CCV + hist +

TF-IDF on ASR LIMSI

Run4

audio

block-based

Run5

TF-IDF on metadata +

ASR LIMSI

MediaEval

2011

MAP 10.3%

MediaEval

2011

MAP 12%

metadata

15

Experimental results: official runs (9,550 seq.)


> genre MAP for Run 5: TF-IDF on ASR + metadata,

Run 1: visual + audio religion

71%gaming

71%autos52%

environment50%

16

Conclusions and future work


> classification adapts to the corpus – changing the corpus will change the performance;

> future work:

more elaborated late-fusion ?

pursue tests on the entire data set;

perhaps more elaborated Bag-of-Visual-Words.

Acknowledgement: we would like to thank Prof. Fausto Giunchiglia and Prof. Nicu Sebe from University of Trento for their support.

> how far can we go with ad-hoc classification without human intervention?

> audio-visual descriptors are inherently limited;

17

thank you !


any questions ?

arf @ mediaeval 2012: multimodal video classification

Technology

commediaeval pisa

timbralmediaeval pisa

netherlandsmediaeval

text asr

genres audio

genres asr

weka toolbox

audio asr limsilbp