icassp, may 21 2004 arjen p. de vries thijs westerveld tzvetanka i. ianeva combining multiple...

26
ICASSP, May 21 2004 Arjen P. de Vries Thijs Westerveld Tzvetanka I. Ianeva Combining Multiple Representations on the TRECVID Search Task

Post on 19-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

ICASSP, May 21 2004

Arjen P. de Vries

Thijs Westerveld

Tzvetanka I. Ianeva

Combining Multiple Representations on the TRECVID Search Task

ICASSP, May 21 2004

Introduction• Video Retrieval should take advantage

of information from all available sources and modalities– …but so far ASR best for almost any query

• LL11@TRECVID2003: Combining information sources– Different models/modalities– Multiple example images

ICASSP, May 21 2004

‘Language Modelling’ approach to IRDocs Models

ICASSP, May 21 2004

Calculate conditional

probabilities of observing query samples given each model in the collection

RetrievalModels

P(Q|M1)

P(Q|M4)

P(Q|M3)

P(Q|M2)

Query

ICASSP, May 21 2004

Static Model• Indexing

–Estimate a Gaussian Mixture Model from each keyframe (using EM)

–Fixed number of components (C=8)

–Feature vectors contain colour, texture, and position information from pixel blocks: <x,y,DCT>

ICASSP, May 21 2004

Dynamic Model

• Indexing:•GMM of

multiple frames (N=29) around keyframe

•Feature vectors extended with time-stamp in [0,1]: <x,y,t,DCT>

0

.5

1

ICASSP, May 21 2004

Dynamic Model

ICASSP, May 21 2004

Dynamic Model Advantages

• More training data for models

• Reduced dependency upon selecting appropriate keyframe

• Some spatio-temporal aspects of shot are captured– (Dis-)appearance of objects

ICASSP, May 21 2004

Experimental Set-up

• Build models for each shot– Static, Dynamic, Language

• Build Queries from topics– Construct simple keyword text query– Select visual example– Rescale and compress example images to

match video size and quality

ICASSP, May 21 2004

Combining Modalities• Independence assumption textual/visual

– P(Qt,Qv|Shot) = P(Qt|LM) * P(Qv|GMM)

• Combination works if both runs useful [CWI:TREC:2002]

• Dynamic run moreuseful than static run

Run MAP

ASR only .130

Static only .022

Static+ASR .105

Dynamic only .022

Dynamic+ASR .132

ICASSP, May 21 2004

Combining Modalities

Dynamic: Higher Initial Precision

ICASSP, May 21 2004

Dow Jones Topic (120)

ICASSP, May 21 2004

Dow Jones Topic (120)• “Dow Jones Industrial Average

rise day points”

+

=

ICASSP, May 21 2004

Dow Jones Topic (120)

ICASSP, May 21 2004

Arafat topic (103)

ICASSP, May 21 2004

Arafat Topic (103)

ICASSP, May 21 2004

Basketball topic (101)Baseball topic (102)

ICASSP, May 21 2004

Basketball Topic

ICASSP, May 21 2004

Merging Run Results

ICASSP, May 21 2004

Merging Run Results

• Combining (conflicting) examples difficult [CWI:TREC:2002]

• Single example Miss relevant shots

• Round-Robin Merging

123456789

10

123456789

10

Combined

11223344..

ICASSP, May 21 2004

Merging Run Results

• Combining (conflicting) examples difficult [CWI:TREC:2002]

• Single example Miss relevant shots

• Round-Robin Merging

Combined

11223344..

123456789

10

123456789

10

+ASR

Single .022 .132

All .031 .149

Selected .039 .151

Best .050 .155

ICASSP, May 21 2004

Flames (112)

ICASSP, May 21 2004

Flames Topic (112)

ICASSP, May 21 2004

Conclusions

• For most topics, neither the static nor the dynamic visual model captures the user information need sufficiently…

• …averaged over 25 topics however, it is better to use both modalities than ASR only

Working hypothesis: Matching against

both modalities gives robustness

ICASSP, May 21 2004

Conclusions

• Dynamic captures visual similarity better– Thanks to spatio-temporal aspects?

• Experiments with full covariance matrix for <x,y,t>-dims

• Static model of KF is too fragile – Dependency on single KF?

• To be tested by ranking max(all I-frames in shot)

– Not enough training data?

ICASSP, May 21 2004

Conclusions

• Visual aspects of an information need are best captured by using multiple examples

• Combining results for multiple (good) examples in round-robin fashion, each ranked on both modalities, gives near-best performance for almost all topics