a multimodal approach for video geocoding

A Multimodal Approach for Video Geocoding

(UNICAMP at Placing Task MediaEval 2012)

Lin Tzy Li. Jurandy Almeida. Daniel Carlos Guimaraes Pedronette. Otavio A. B. Penatti. and Ricardo da S. Torres

Institute of Computing - University of Campinas (UNICAMP) Brazil

Multimodal geocoding proposal

Textual features

• Similarity functions: Okapi & Dice

• Video metadata (run 1)

– Title + description + keywords (Okapi_all)

– Only description: Okapi_desc & Dice_desc

– Combined result in run 1:

• Okapi_all + Okapi_desc + Dice_desc

• Photo tags (run 5)

– Okapi function

keywords (test video) X tags (3,185,258 Flickr photos)

Video similarity

computation

Geocoded Video

Video feature

extraction

Video with lat/long. Tags. title. description. etc… Candidates lat/long

+ match score

The location (lat/long) of the most similar video used as candidate for the test video.

Geocoding Visual Content

Development Set (15,563 videos)

V1

Vk

Test Video

Rankedlist of similar videos

Bag of Scenes (photos)

Histograms of Motion Patterns

Visual Features (HMP): Extracting

• Histograms of Motion Patterns

• Keyframes: Not used

• Applying an algorithm to compare video sequence (1) partial decoding;

(2) feature extraction;

(3) signature generation.

“Comparison of video sequences with histograms of motion patterns”, J. Almeida et al. ICIP, 2011.

Visual Features (HMP): overview

[Almeida et al., Comparison of video sequences with histograms of motion patterns. ICIP 2011]

HMP: Comparing Video

• Comparison of histograms can be performed by any vectorial distance function

– like Manhattan (L1) or Euclidean (L2)

• Video sequences compared with

– Histogram intersection defined as:

i

i

v

i

i

v

i

v

H

HH

vvd

1

1

21

),min(),(

2

ivH Histogram extracted from video Vi Output: range [0,1]

0 = not similar histogram 1= identical histogram

Dictionary of

local descriptions

Feature

vector

...

Dictionary of

scenes

...

[Penatti et al.. A Visual Approach for Video Geocoding using Bag-of-Scenes. ACM ICMR 2012]

Visual Features: Bag-of-Scenes (BoS)

Dictionary of

scenes

Scenes Feature vectors Visual words

selection

…

Feature vectors Frames Video

Assignment Pooling

Video feature vector

(bag-of-scenes)

Cre

ating t

he d

ictionary

U

sin

g t

he d

ictionary

…

…

…

Data Fusion – Rank Aggregation

• In multimedia applications, an approach for information fusion (considering different modalities) is essential for obtaining high effectiveness performance

• Rank Aggregation:

– Unsupervised approach for Data Fusion

– Combination of different features using:

Multiplication approach inspired by the Naive Bayes classifiers (assuming conditional independence among features)

10


Combined ranked list

Visual Feature

Visual Feature

Textual

Feature

11


12

: query video

: similarity score between videos

Set of simiarlity functions defined by different

features:

New aggregated score computed by:

: dataset video

Runs Summary

Description Descriptor used

Run 1 Combine 3 Textual Okapi_all + Okapi_desc + Dice_desc

Run 2 Combine 2 textual &

2 visual Okapi_all + Okapi_desc + HMP +

BoS_CEDD5000

Run 3 Single visual: HMP HMP (last year visual)

Run 4 Combine 3 visual HMP + BoS5000 + BoS500

Run 5 Textual: Flickr photos

tags as geo-profile Okapi on keywords

Results for Test Set

Radius Run 1 Run 2 Run 3 Run 4 Run 5

1 21.40% 22.29% 15.81% 15.93% 9.28%

10 30.68% 31.25% 16.07% 16.09% 19.44%

100 35.39% 36.42% 16.62% 17.07% 24.13%

200 37.37% 38.40% 17.58% 17.86% 25.85%

500 41.77% 43.35% 19.68% 19.97% 29.29%

1,000 45.38% 47.68% 24.77% 25.47% 33.91%

2,000 53.32% 56.03% 33.48% 33.31% 46.05%

5,000 62.29% 66.91% 45.34% 45.34% 65.73%

10,000 85.27% 87.95% 81.95% 81.73% 87.69%

15,000 95.89% 96.80% 95.79% 95.70% 96.17%

Result of classical text vector space combined (run 1)

is twice as good as

the use of visual cues alone (run 4).



1 21.40% 22.29% 15.81% 15.93% 9.28%

10 30.68% 31.25% 16.07% 16.09% 19.44%

100 35.39% 36.42% 16.62% 17.07% 24.13%

200 37.37% 38.40% 17.58% 17.86% 25.85%

500 41.77% 43.35% 19.68% 19.97% 29.29%

1,000 45.38% 47.68% 24.77% 25.47% 33.91%

2,000 53.32% 56.03% 33.48% 33.31% 46.05%

5,000 62.29% 66.91% 45.34% 45.34% 65.73%

10,000 85.27% 87.95% 81.95% 81.73% 87.69%

15,000 95.89% 96.80% 95.79% 95.70% 96.17%

Run 5 results (photos metadata functioned as geo-profile)

worse than Run 4 (visual information only) at 1 km precision.

However, for other radii, run 5 is better than runs 3 and 4.

Results for Test Set Radius Run 1 Run 2 Run 3 Run 4 Run 5

1 21.40% 22.29% 15.81% 15.93% 9.28%

10 30.68% 31.25% 16.07% 16.09% 19.44%

100 35.39% 36.42% 16.62% 17.07% 24.13%

200 37.37% 38.40% 17.58% 17.86% 25.85%

500 41.77% 43.35% 19.68% 19.97% 29.29%

1,000 45.38% 47.68% 24.77% 25.47% 33.91%

2,000 53.32% 56.03% 33.48% 33.31% 46.05%

5,000 62.29% 66.91% 45.34% 45.34% 65.73%

10,000 85.27% 87.95% 81.95% 81.73% 87.69%

15,000 95.89% 96.80% 95.79% 95.70% 96.17%

Combination of different textual and visual descriptors (run 2)

leads to statistically significant improvements (confidence >= 0.99)

over results of using only on textual clues (run1)

3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500

Dis

tan

ce (

km)

Confidence Interval (99%) run2

run 1



1 21.40% 22.29% 15.81% 15.93% 9.28%

10 30.68% 31.25% 16.07% 16.09% 19.44%

100 35.39% 36.42% 16.62% 17.07% 24.13%

200 37.37% 38.40% 17.58% 17.86% 25.85%

500 41.77% 43.35% 19.68% 19.97% 29.29%

1,000 45.38% 47.68% 24.77% 25.47% 33.91%

2,000 53.32% 56.03% 33.48% 33.31% 46.05%

5,000 62.29% 66.91% 45.34% 45.34% 65.73%

10,000 85.27% 87.95% 81.95% 81.73% 87.69%

2011’s result for HMP

0.21%

1.12%

2.71%

3.33%

6.08%

12.16%

22.11%

37.78%

79.45%

Run 3 using only HMP (our last year approach) -- performs much

better with this year's data set.

Why? Larger development set (+ 5,000 videos in 2012) = richer geo-profile ?

Conclusion

• Textual features: Okapi & Dice • Visual features: HMP & BoS

– HMP results: better in 2012 than 2011 – Is it due to bigger development set?

• Combined textual information (video) and visual features • Ranked lists • Promising results: combining yields better results than

single modality

• Future improvement by – other strategies for combining different modalities – other information sources filter out noisy data from ranked

lists (e.g., Geonames and Wikipedia)

Acknowledgements & contacts

• RECOD Lab @ Institute of Computing, UNICAMP

(University of Campinas)

• VoD Lab @ UFMG

(Universidade Federal de Minas Gerais)

• Organizers of Placing Task and MediaEval 2012

• Brazilian funding agencies

CAPES, FAPESP, CNPq

Contact email: {lintzyli, jurandy.almeida, dcarlos, penatti, rtorres}@ic.unicamp.br

a multimodal approach for video geocoding

Documents