when textual and visual information join forces for multimedia retrieval

When Textual and Visual InformationJoin Forces

for MultiMedia Retrieval

Bahjat Safadi, Mathilde Sahuguet, Benoit HuetEURECOM, Multimedia Department

Sophia Antipolis, France

Introduction

� EU alone hosts 500+ online video platforms

� 42.7m hrs of footage in online archives of broadcast ers and producers (61% of archive footage is online)

� UGC on the advance: � YouTube receives 60 hrs of video/minute� Vine and Instagram video

� Internet video is now 40 percent of consumer Intern et traffic, and will reach 62 percent by the end of 2015, 75% i n 2017(source: CISCO)

� How to make the content accessible?� Browsing, Searching, Hyperlinking

B Huet - Eurecom - BAMMF - p 220/06/2014

Objectives and Contributions

� We propose and evaluate a video search framework us ing visual information to enrich the classic text-based search for video retrieval operating at the fragment level.

� We investigate the following two questions: � To which extent can visual concepts contribute information when retrieving

videos? � How can we cope with the confidence in visual concept detection?

� The framework extends conventional text-based searc h by fusing together textual and visual scores.

� We address both the semantic and intention gaps� By automatically mapping the query text to semantic concepts.� With the addition of “visual cues”

20/06/2014 B Huet - Eurecom - BAMMF - p 3

MediaEval Search & Hyperlinking

� Information seeking in a video dataset: retrieving media fragments/anchors


The Video Archive

2323 BBC videos of different genres (440 programs)� ~1697h of video + audio� Subtitles (manual)� Two ASR transcripts (LIMSI,LIUM)� Metadata (Title, Cast, Description,..)� Shot boundaries and key-frames� Search: 50 queries from 29 users

– Textual query + visual cues� Face detection� Concept detection


The Video Archive

2323 BBC videos of different genres (440 programs)� ~1697h of video + audio� Subtitles (manual)� Two ASR transcripts (LIMSI,LIUM)� Metadata (Title, Cast, Description,..)� Shot boundaries and key-frames� Search: 50 queries from 29 users

– Textual query + visual cues� Face detection� Concept detection


Text query : Medieval history of why castles were first builtVisual cues : Castle

Text query : Best players of all time; Embarrassing England performances; Wake up call for English football; Wembley massacre;

Visual cues : Poor camera quality; heavy looking football; unusual goal celebrations; unusual crowd reactions; dark; grey; overcast; black and white;

The proposed Framework


Videos, scenes and subtitles

Collection

Scenes

Conceptsindexing scores

Visualsemantic concepts

Content-based indexing

Off-line

On-line

Textual/visual

Query:Textual query

Scenes + subtitles

Text-based scores

Lucene indexing

User querying

Visual-based scores? Selected

concepts

Visualcues

Ranking

Ranked list

Fusion



Scenes

Conceptsindexing scores

Videos, scenes and subtitles

Collection

Visualsemantic concepts

Content-based indexing

No training data for visual concepts

Use 151 visual concept detectors trained on TrecVid 2012 data

Unknown performance

Visual concept detector confidence (w)

� 100 top images for the concept “Animal”

� 58 out of 100 are manually evaluated as valid




Textual/visual

Query:

User querying

<queryText>Children out on poetry trip Exploration of poetry by school children Poem writing</queryText> <visualCues>House memories Farm exploration A poem on animal and shells </visualCues>

Users are not aware of visual concepts

Mapping visual cues to visual concepts

� <queryText>Children out on poetry trip Exploration of poetry by school children Poem writing</queryText> <visualCues>House memories Farm exploration A poem on animal and shells </visualCues>

Farm

Shells

Exploration

Poem

Animal

House

Memories

AnimalBirdsInsect

Cattle

DogsBuilding

SchoolChurch

Flags

Mountain

WordNet Mapping

keyw

ords

visual concepts


Mapping visual cues to visual concepts

� Concepts mapped to the visual query "Castle”

� Semantic similarity computed using the “Lin” distance


Concept Windows Plant Court Church Building

β 0.4533 0.4582 0.5115 0.6123 0.701



Text-based scores

Lucene indexing

Visual-based scores

WordNetsimilarity

Selected concepts

RankingFusion

One score for each scene (t)

f i = t iα + v i

1−α

One score for each scene (v):

Computed from the scores of the selected concepts for each scene

v iq = w c × vs i

c

c∈C 'q

∑

Evaluation

� To which extent can visual concepts contribute info rmation when retrieving videos?

� How can we cope with the confidence in visual conce pt detection?

� BBC Archive subset provided by the MediaEval 2013 Se arch and Hyperlinking task.

� Evaluation Measures:� Mean Reciprocal Rank (MRR): assesses the rank of the relevant segment� Mean Generalized Average Precision (mGAP) : takes into account starting

time of the segment� Mean Average Segment Precision (MASP) : measures both ranking and

segmentation of relevant segments


Retrieval Performance (50 queries)

� Low impact of visual concept detector confidence ( w)

� Significant improvement can be achieved by combinin g only mapped concepts with θ ≥ 0.3.

� Best performance is obtained when θ ≥ 0.8 (gain ≈ 11-12%).


w=1.0 w=confidence(c)

Visual concepts and Query association

� The number of concepts associated to queries with different threshold θ.


θ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Min 5 5 5 2 0 0 0 0 0 0

Max 45 45 41 37 25 19 19 12 6 2

Mean 20 19 18 15 11 7 5 3 1 1

#Q(#c’q>0) 50 50 50 50 49 49 48 44 29 21

Retrieval on queries with visual concepts (21)

� Concept mapping improves significantly the performance of the text-based search task on these queries.

� The best performance was achieved with θ ≥ 0.7 (gain ≈ 32-33%).


w=1.0 w=confidence(c)

Conclusion

� A novel video search framework using visual informa tion to enrich a text-based search for video retrieval has been presented.

� We conducted our evaluations on the MediaEval 2013 w here we achieved the 2sd best on Search and 1 st on Hyperlinking

� Experimental results show that mapping text-based q ueries to visual concepts improves significantly the searc h system.

� When appropriately selecting the relevant visual co ncepts, a very significant improvement is achieved (gain ≈ 33%).


Related Publications

� B. Safadi, M. Sahuguet and B. Huet, When textual and visual information join forces for multimedia retrieval, ICMR 2014, ACM International Conference on Multimedia Retrieval, April 1-4, 2014, Glasgow, Scotland

� M. Sahuguet and B. Huet. Mining the Web for Multimedia-based Enriching . Multimedia Modeling MMM 2014, 20th International Conference on MultiMedia Modeling, 8-10th January 2014, Dublin, Ireland

� M. Sahuguet, B. Huet, B. Cervenkova, E. Apostolidis, V. Mezaris, D. Stein, S. Eickeler, J-L. Redondo Garcia, R. Troncy, L. Pikora. LinkedTV at MediaEval 2013 search and hyperlinking ta sk, MEDIAEVAL 2013, Multimedia Benchmark Workshop, October 18-19, 2013, Barcelona, Spain

� Stein, D.; Öktem, A.; Apostolidis, E.; Mezaris, V.; Redondo García, J. L.; Troncy, R.; Sahuguet, M. & Huet, B., From raw data to semantically enriched hyperlinking : Recent advances in the LinkedTV analysis workflow, NEM Summit 2013, Networked & Electronic Media, 28-30 October 2013, Nantes, France

� V. Mezaris and B. Huet, “Video Hyperlinking ”, Tutorial Accepted at ICIP 2014 (Oct) Paris

� B. Safadi, M. Sahuguet and B. Huet, “Linking text and visual concepts semantically for c ross modal multimedia search ”, ICIP 2014, Paris 2014.


Questions?

http://www.slideshare.net/huetbenoit/

� Thank you.

When Textual and Visual InformationJoin Forces

for MultiMedia RetrievalBenoit Huet


when textual and visual information join forces for multimedia retrieval

Presentations & Public Speaking