a multimodal approach for video geocoding

19
A Multimodal Approach for Video Geocoding (UNICAMP at Placing Task MediaEval 2012) Lin Tzy Li. Jurandy Almeida. Daniel Carlos Guimares Pedronette. Otvio A. B. Penatti. and Ricardo da S. Torres Institute of Computing - University of Campinas (UNICAMP) Brazil

Upload: mediaeval2012

Post on 18-Dec-2014

585 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: A Multimodal Approach for   Video Geocoding

A Multimodal Approach for Video Geocoding

(UNICAMP at Placing Task MediaEval 2012)

Lin Tzy Li. Jurandy Almeida. Daniel Carlos Guimaraes Pedronette. Otavio A. B. Penatti. and Ricardo da S. Torres

Institute of Computing - University of Campinas (UNICAMP) Brazil

Page 2: A Multimodal Approach for   Video Geocoding

Multimodal geocoding proposal

Page 3: A Multimodal Approach for   Video Geocoding

Textual features

• Similarity functions: Okapi & Dice

• Video metadata (run 1)

– Title + description + keywords (Okapi_all)

– Only description: Okapi_desc & Dice_desc

– Combined result in run 1:

• Okapi_all + Okapi_desc + Dice_desc

• Photo tags (run 5)

– Okapi function

keywords (test video) X tags (3,185,258 Flickr photos)

Page 4: A Multimodal Approach for   Video Geocoding

Video similarity

computation

Geocoded Video

Video feature

extraction

Video with lat/long. Tags. title. description. etc… Candidates lat/long

+ match score

The location (lat/long) of the most similar video used as candidate for the test video.

Geocoding Visual Content

Development Set (15,563 videos)

V1

Vk

Test Video

Rankedlist of similar videos

Bag of Scenes (photos)

Histograms of Motion Patterns

Page 5: A Multimodal Approach for   Video Geocoding

Visual Features (HMP): Extracting

• Histograms of Motion Patterns

• Keyframes: Not used

• Applying an algorithm to compare video sequence (1) partial decoding;

(2) feature extraction;

(3) signature generation.

“Comparison of video sequences with histograms of motion patterns”, J. Almeida et al. ICIP, 2011.

Page 6: A Multimodal Approach for   Video Geocoding

Visual Features (HMP): overview

[Almeida et al., Comparison of video sequences with histograms of motion patterns. ICIP 2011]

Page 7: A Multimodal Approach for   Video Geocoding

HMP: Comparing Video

• Comparison of histograms can be performed by any vectorial distance function

– like Manhattan (L1) or Euclidean (L2)

• Video sequences compared with

– Histogram intersection defined as:

i

i

v

i

i

v

i

v

H

HH

vvd

1

1

21

),min(),(

2

ivH Histogram extracted from video Vi Output: range [0,1]

0 = not similar histogram 1= identical histogram

Page 8: A Multimodal Approach for   Video Geocoding

Dictionary of

local descriptions

Feature

vector

...

Dictionary of

scenes

...

[Penatti et al.. A Visual Approach for Video Geocoding using Bag-of-Scenes. ACM ICMR 2012]

Visual Features: Bag-of-Scenes (BoS)

Page 9: A Multimodal Approach for   Video Geocoding

Dictionary of

scenes

Scenes Feature vectors Visual words

selection

Feature vectors Frames Video

Assignment Pooling

Video feature vector

(bag-of-scenes)

Cre

ating t

he d

ictionary

U

sin

g t

he d

ictionary

Page 10: A Multimodal Approach for   Video Geocoding

Data Fusion – Rank Aggregation

• In multimedia applications, an approach for information fusion (considering different modalities) is essential for obtaining high effectiveness performance

• Rank Aggregation:

– Unsupervised approach for Data Fusion

– Combination of different features using:

Multiplication approach inspired by the Naive Bayes classifiers (assuming conditional independence among features)

10

Page 11: A Multimodal Approach for   Video Geocoding

Data Fusion – Rank Aggregation

Combined ranked list

Visual Feature

Visual Feature

Textual

Feature

11

Page 12: A Multimodal Approach for   Video Geocoding

Data Fusion – Rank Aggregation

12

: query video

: similarity score between videos

Set of simiarlity functions defined by different

features:

New aggregated score computed by:

: dataset video

Page 13: A Multimodal Approach for   Video Geocoding

Runs Summary

Description Descriptor used

Run 1 Combine 3 Textual Okapi_all + Okapi_desc + Dice_desc

Run 2 Combine 2 textual &

2 visual Okapi_all + Okapi_desc + HMP +

BoS_CEDD5000

Run 3 Single visual: HMP HMP (last year visual)

Run 4 Combine 3 visual HMP + BoS5000 + BoS500

Run 5 Textual: Flickr photos

tags as geo-profile Okapi on keywords

Page 14: A Multimodal Approach for   Video Geocoding

Results for Test Set

Radius Run 1 Run 2 Run 3 Run 4 Run 5

1 21.40% 22.29% 15.81% 15.93% 9.28%

10 30.68% 31.25% 16.07% 16.09% 19.44%

100 35.39% 36.42% 16.62% 17.07% 24.13%

200 37.37% 38.40% 17.58% 17.86% 25.85%

500 41.77% 43.35% 19.68% 19.97% 29.29%

1,000 45.38% 47.68% 24.77% 25.47% 33.91%

2,000 53.32% 56.03% 33.48% 33.31% 46.05%

5,000 62.29% 66.91% 45.34% 45.34% 65.73%

10,000 85.27% 87.95% 81.95% 81.73% 87.69%

15,000 95.89% 96.80% 95.79% 95.70% 96.17%

Result of classical text vector space combined (run 1)

is twice as good as

the use of visual cues alone (run 4).

Page 15: A Multimodal Approach for   Video Geocoding

Results for Test Set

Radius Run 1 Run 2 Run 3 Run 4 Run 5

1 21.40% 22.29% 15.81% 15.93% 9.28%

10 30.68% 31.25% 16.07% 16.09% 19.44%

100 35.39% 36.42% 16.62% 17.07% 24.13%

200 37.37% 38.40% 17.58% 17.86% 25.85%

500 41.77% 43.35% 19.68% 19.97% 29.29%

1,000 45.38% 47.68% 24.77% 25.47% 33.91%

2,000 53.32% 56.03% 33.48% 33.31% 46.05%

5,000 62.29% 66.91% 45.34% 45.34% 65.73%

10,000 85.27% 87.95% 81.95% 81.73% 87.69%

15,000 95.89% 96.80% 95.79% 95.70% 96.17%

Run 5 results (photos metadata functioned as geo-profile)

worse than Run 4 (visual information only) at 1 km precision.

However, for other radii, run 5 is better than runs 3 and 4.

Page 16: A Multimodal Approach for   Video Geocoding

Results for Test Set Radius Run 1 Run 2 Run 3 Run 4 Run 5

1 21.40% 22.29% 15.81% 15.93% 9.28%

10 30.68% 31.25% 16.07% 16.09% 19.44%

100 35.39% 36.42% 16.62% 17.07% 24.13%

200 37.37% 38.40% 17.58% 17.86% 25.85%

500 41.77% 43.35% 19.68% 19.97% 29.29%

1,000 45.38% 47.68% 24.77% 25.47% 33.91%

2,000 53.32% 56.03% 33.48% 33.31% 46.05%

5,000 62.29% 66.91% 45.34% 45.34% 65.73%

10,000 85.27% 87.95% 81.95% 81.73% 87.69%

15,000 95.89% 96.80% 95.79% 95.70% 96.17%

Combination of different textual and visual descriptors (run 2)

leads to statistically significant improvements (confidence >= 0.99)

over results of using only on textual clues (run1)

3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500

Dis

tan

ce (

km)

Confidence Interval (99%) run2

run 1

Page 17: A Multimodal Approach for   Video Geocoding

Results for Test Set

Radius Run 1 Run 2 Run 3 Run 4 Run 5

1 21.40% 22.29% 15.81% 15.93% 9.28%

10 30.68% 31.25% 16.07% 16.09% 19.44%

100 35.39% 36.42% 16.62% 17.07% 24.13%

200 37.37% 38.40% 17.58% 17.86% 25.85%

500 41.77% 43.35% 19.68% 19.97% 29.29%

1,000 45.38% 47.68% 24.77% 25.47% 33.91%

2,000 53.32% 56.03% 33.48% 33.31% 46.05%

5,000 62.29% 66.91% 45.34% 45.34% 65.73%

10,000 85.27% 87.95% 81.95% 81.73% 87.69%

2011’s result for HMP

0.21%

1.12%

2.71%

3.33%

6.08%

12.16%

22.11%

37.78%

79.45%

Run 3 using only HMP (our last year approach) -- performs much

better with this year's data set.

Why? Larger development set (+ 5,000 videos in 2012) = richer geo-profile ?

Page 18: A Multimodal Approach for   Video Geocoding

Conclusion

• Textual features: Okapi & Dice • Visual features: HMP & BoS

– HMP results: better in 2012 than 2011 – Is it due to bigger development set?

• Combined textual information (video) and visual features • Ranked lists • Promising results: combining yields better results than

single modality

• Future improvement by – other strategies for combining different modalities – other information sources filter out noisy data from ranked

lists (e.g., Geonames and Wikipedia)

Page 19: A Multimodal Approach for   Video Geocoding

Acknowledgements & contacts

• RECOD Lab @ Institute of Computing, UNICAMP

(University of Campinas)

• VoD Lab @ UFMG

(Universidade Federal de Minas Gerais)

• Organizers of Placing Task and MediaEval 2012

• Brazilian funding agencies

CAPES, FAPESP, CNPq

Contact email: {lintzyli, jurandy.almeida, dcarlos, penatti, rtorres}@ic.unicamp.br