fisher kernel based relevance feedback for multimodal video retrieval

31
University Politehnica of Bucharest & University of Trento ICMR 2013 Sunday, November 2, 2014 1 Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval Ionu MIRONIC ț Ă 1 Bogdan IONESCU 1,2 Jasper Uijlings 2 Nicu Sebe 2 1 LAPI – University Politehnica of Bucharest, Romania 2 MHUG DISI – University of Trento, Italy ACM International Conference on Multimedia Retrieval, ICMR 2013, Dallas, Texas, USA, April 16 - 19, 2013

Upload: ionut-mironica

Post on 18-Jul-2015

133 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 1

Fisher Kernel based Relevance Feedbackfor Multimodal Video Retrieval

Ionu MIRONICț Ă1 Bogdan IONESCU1,2 Jasper Uijlings2 Nicu Sebe2

1LAPI – University Politehnica of Bucharest, Romania2MHUG DISI – University of Trento, Italy

ACM International Conference on Multimedia Retrieval, ICMR 2013, Dallas, Texas, USA, April 16 - 19, 2013

Page 2: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 2

Agenda

• Introduction

• Relevance Feedback Algorithms

• Fisher Kernel Theory

• Implementation

• Experimental results

• Conclusion

Page 3: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 3

Problem StatementConcepts• Content Based Video Retrieval• Genre Retrieval• Query by Example

Query Results

Movie Sample

Query Database

Page 4: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 4

CBVR System

VideoDescriptors (vectors with

features)Movie Database

(Off-line)Descriptor compute

(On-line)Descriptor Compute

Comparison

Image Selection

(browsing)

Page 5: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 5

Semantic Gap and Relevance Feedback• Relevance feedback uses positive and negative examples provided by the user to improve the system’s performance.

UserFeedback

Display

Estimation &Display selection

Feedbackto system

Display

UserFeedback

Page 6: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 6

State-of-the-Art algorithms:

• Rocchio Algorithm

• Feature Relevance Estimation (FRE)

• SVM – Relevance Feedback

• This paper: Fisher Kernel representation + SVM

Relevance Feedback

Page 7: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 7

∑∑∈∈

−+=NN

iRR

i

ii

NN

RR

QQγβα'

• Uses a set R of relevant documents and a set N of non-relevant documents, selected in the user relevance feedback phase, and updates the query feature.

Rocchio Algorithm

Initial query

Relevant documents Non-Relevant documentsNew Query-Point

Page 8: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 8

∑ ∑= =

−=d

i

d

iiiii WYXWYXDist

1 1

2)(),(

• some specific features may be more important than other features• every feature will have an importance weight that will be computed as Wi = 1/σ, where σ denotes the variance of video features.

Feature Relevance Estimation (FRE)

Initial queryRelevant documents

Non-Relevant documents

Page 9: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 9

SVM – Relevance Feedback

• Train SVM with two classes: - positive samples - negative samples

• Classify next samples as positive or negative using SVM network

• Use the confidence values of the classifiers to rerank the documents

Page 10: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 10

Fisher Kernel (FK) Theory- Introduced by [Jaakkola, T., Haussler, D.: Exploiting generative models in

discriminative classifiers. NIPS’99] for protein detection

- Introduced in Computer Vision by [Perronnin, F., Dance C. "Fisher kernels on visual vocabularies for image categorization." CVPR’07]

- combines the benefits of generative and discriminative approaches

- represents a signal as the gradient of the probability density function that is a learned generative model of that signal

Page 11: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 11

Fisher Kernel Framework

Compute Fisher Vectors

Training & Classification

Step

Feature extraction Dictionar extraction

X = {x1 ... xm}

Altering Feature Space Training & Reranking Step

Page 12: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 12

Fisher Kernel based RF

- Use a video document as a „query by example”

- The system returns top k documents (k=20)

- Use these document to train an Gaussian Mixture Model (GMM)

- Compute the Fisher Kernels for all the features (reshape the features space)

- Apply Fisher Kernel Normalization

Page 13: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 13

- The user selects the relevant documents

- The system is trained with these samples (SVM classifiers)- The documents are reranked using the classifier confidence level

- If the results are not satisfactory the use can continue with more Relevance Feedback iterations)

Fisher Kernel based RF

Page 14: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 14

Fisher Kernel based RF

Page 15: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 15

Experimental Setup MediaEval 2012 Dataset•14,838 episodes from 2,249 shows ~ 3,260 hours of data•split into Development and Test sets

5,288 for development / 9,550 for test

Page 16: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 16

Experimental Setup MediaEval 2012 Dataset

• 26 Genre labels26 Genre labels1000 art 1001 autos_and_vehicles 1002 business 1003 citizen_journalism 1004 comedy 1005 conferences_and_other_events 1006 default_category 1007 documentary 1008 educational 1009 food_and_drink 1010 gaming 1011 health 1012 literature 1013 movies_and_television 1014 music_and_entertainment 1015 personal_or_auto-biographical 1016 politics 1017 religion 1018 school_and_education 1019 sports 1020 technology 1021 the_environment 1022 the_mainstream_media 1023 travel 1024 videoblogging 1025 web_development_and_sites

Page 17: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014

17

Feedback Approaches for MPEG-7 Content-based Biomedical Image

Experimental Setup

Precision = Number of returned relevant documents

Total number of returned documents

Recall = Number of returned relevant documents

Total number of relevant documents

• Precision – Recall Chart

• Mean Average Precision - summarize rankings from multiple queries by averaging average precision

Page 18: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 18

Visual Descriptors• Global MPEG 7 related descriptors: Local Binary Pattern (LBP), Autocorrelogram, Color Coherence Vector (CCV), Edge Histogram (EHD) Color Layout Pattern(CLD), Scalable Color Descriptor classic color histogram (hist), color moments

• Global HoG

• Global Structural Descriptors describe contour properties and appearance parameters

Global Descriptors = mean and variance over all frame descriptors

[IJCV, C. Rasche’10]

edginess symmetry

Page 19: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 19

Audio Descriptors

[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]

• Linear Predictive Coefficients,

• Line Spectral Pairs,

• Mel-Frequency Cepstral Coefficients,

• Zero-Crossing Rate,

+ variance of each feature over a certain window.

• spectral centroid, flux, rolloff, and kurtosis,

f1 fn…f2

globalfeature = mean & variance

time

+var{f2} var{fn}

[Klaus Seyerlehner et al., MIREX’11, USA]

Standard Audio Features

Block-based Audio Features• Spectral Pattern, delta Spectral Pattern, • variance delta Spectral Pattern, • Logarithmic Fluctuation Pattern , • Correlation Pattern, Local Single Gaussian model• Spectral Contrast Pattern, • Mel-Frequency Cepstral Coefficients

[B. Mathieu et al., Yaafe toolbox, ISMIR’10]

Page 20: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 20

Text Descriptors

TF-IDF descriptors (Term Frequency-Inverse Document Frequency)

1. extract root words

2. remove terms <5%-percentile of the frequency distribution

3. select term corpus: retaining for each genre class m terms (e.g. m = 150 for ASR) with the highest χ2 values that occur more frequently than in complement classes

4. for each document the TF-IDF values are computed.

Text source: ASR [Lamel, et al., ICNL’08]

Page 21: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 21

Evaluation

(1) Evaluating Features Metrics

Manhattan Euclidian Mahalanobis Cosinus Bray-Curtis

Chi-Square

Canberra

Hog 17.02% 17.18% 17.07% 17.00% 17.10% 17.07% 16.67%

Structural 10.87% 10.55 11.14% 2.18% 10.92% 11.58% 14.82%

MPEG 7 12.37% 10.85 21.14% 8.69% 13.34% 13.34% 25.97%

Standard Audio

7.76% 7.78 29.26% 15.28% 7.78% 8.04% 1.58%

Block-Based Audio

19.33% 19.58 20.21% 21.23% 19.71% 19.99% 20.37%

Text 8.32% 7.15 5.39% 16.64% 20.40% 9.83% 9.68%

(MAP values)

Page 22: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 22

Evaluation

(2) Optimizing Fisher Kernel

- number of GMM Centroids

Page 23: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 23

Implementation

Visual Audio Text

Without Normalization 37.25% 38.68% 31.13%

L1 – norm 36.82% 37.97% 29.83%

L2 – norm 39.22% 41.94% 30.51%

Log – norm 38.61% 42.01% 35.07%

PN 38.51% 41.37% 34.93%

PN + L1 – norm 39.20% 42.98% 30.12%

PN + L2 – norm 39.46% 43.23% 31.71%

(2) Optimizing Fisher Kernel

- Normalization strategy

(MAP values)

PN = Power NormalizationLog – norm = Logarithmic Normalization

Page 24: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 24

Implementation

Feature FK- for all data FK RF – RBF

Visual features 34.02% 38.23%

Standard audio features 38.25% 46.34%

Text features 32.37% 35.14%

(3) Fisher Kernel representation on all data

(MAP values)

Page 25: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 25

Evaluation

(4) Comparison to State-of-the-Art algorithms

- Rocchio- Nearest Neighbor RF - NB- Boost RF- SVM RF - Random Forest RF - (RF)- Relevance Feature Estimation - (RFE)

Page 26: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 26

Evaluation (4) Comparison to State-of-the-Art algorithms

• Rocchio• Nearest Neighbor RF - NB• Boost RF• SVM RF • Random Forest RF - (RF)• Relevance Feature Estimation - (RFE)

Page 27: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 27

Evaluation

Feature Without RF

Rocchio NB Boost SVM RF RFE FK Linear

FK RBF

HoG 17.18 25.57 24.18 26.72 26.49 26.89 27.50 29.46 29.59

Structural 14.82 21.96 23.73 23.63 24.62 24.69 23.91 26.28 23.96

MPEG 7 25.97 30.88 34.09 32.55 32.90 36.85 31.93 40.50 40.80

All Visual 26.18 32.98 34.25 35.99 36.08 42.28 32.43 41.33 42.23

Standard Audio

29.26 32.71 34.88 32.88 38.58 40.46 44.32 44.80 46.34

Block Based Audio

21.23 35.39 35.22 39.87 31.46 33.41 31.93 43.96 43.69

Text 20.40 32.55 26.91 26.93 34.70 34.70 25.82 34.84 35.14

All features 30.29 37.91 39.88 38.88 40.93 45.31 44.93 46.43 46.80

(4) Comparison to State-of-the-Art algorithms

(MAP values)

Page 28: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 28

Evaluation

(5) Frame Agreggation with Fisher KernelNumber of GMM Centroids

Page 29: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 29

Evaluation

(5) Frame Agreggation with Fisher Kernel

Feature FKRF Linear (T=1)

FKRF RBF (T=1)

Frame Aggregation FK-RF Linear

Frame Aggregation FK-RF RBF

HoG 29.46% 29.59% 32.12% 32.87%

MPEG-7 40.50% 40.80% 44.69% 45.43%

(MAP values)

almost 5% improvement for MPEG-7

Page 30: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 30

• Relevance feedback is a powerful tool for improving content based video retrieval systems• The Fisher Kernel RF Approach outperforms classical RF algorithms (such as Rocchio or RFE and SVM) in terms of accuracy.• Modelling the temporal variation in video using the Fisher Kernel yields 5% improvement for MPEG-7 features

Future Work•Test algorithm on bigger datasets•Extend to the other features (motion, audio and text)•Test modelling temporal variation on general retrieval tasks

Conclusions

Page 31: Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

University Politehnica of Bucharest & University of Trento

ICMR 2013Sunday, November 2, 2014 31

Thank you!

Questions?