when textual and visual information join forces for multimedia retrieval
DESCRIPTION
Currently, popular search engines retrieve documents on the basis of text information. However, integrating the visual information with the text-based search for video and image retrieval is still a hot research topic. In this paper, we propose and evaluate a video search framework based on using visual information to enrich the classic text-based search for video retrieval. The framework extends conventional text-based search by fusing together text and visual scores, obtained from video subtitles (or automatic speech recognition) and visual concept detectors respectively. We attempt to overcome the so called problem of semantic gap by automatically mapping query text to semantic concepts. With the proposed framework, we endeavor to show experimentally, on a set of real world scenarios, that visual cues can effectively contribute to the quality improvement of video retrieval. Experimental results show that mapping text-based queries to visual concepts improves the performance of the search system. Moreover, when appropriately selecting the relevant visual concepts for a query, a very significant improvement of the system's performance is achieved.TRANSCRIPT
![Page 1: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/1.jpg)
When Textual and Visual InformationJoin Forces
for MultiMedia Retrieval
Bahjat Safadi, Mathilde Sahuguet, Benoit HuetEURECOM, Multimedia Department
Sophia Antipolis, France
![Page 2: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/2.jpg)
Introduction
� EU alone hosts 500+ online video platforms
� 42.7m hrs of footage in online archives of broadcast ers and producers (61% of archive footage is online)
� UGC on the advance: � YouTube receives 60 hrs of video/minute� Vine and Instagram video
� Internet video is now 40 percent of consumer Intern et traffic, and will reach 62 percent by the end of 2015, 75% i n 2017(source: CISCO)
� How to make the content accessible?� Browsing, Searching, Hyperlinking
B Huet - Eurecom - BAMMF - p 220/06/2014
![Page 3: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/3.jpg)
Objectives and Contributions
� We propose and evaluate a video search framework us ing visual information to enrich the classic text-based search for video retrieval operating at the fragment level.
� We investigate the following two questions: � To which extent can visual concepts contribute information when retrieving
videos? � How can we cope with the confidence in visual concept detection?
� The framework extends conventional text-based searc h by fusing together textual and visual scores.
� We address both the semantic and intention gaps� By automatically mapping the query text to semantic concepts.� With the addition of “visual cues”
20/06/2014 B Huet - Eurecom - BAMMF - p 3
![Page 4: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/4.jpg)
MediaEval Search & Hyperlinking
� Information seeking in a video dataset: retrieving media fragments/anchors
B Huet - Eurecom - BAMMF - p 420/06/2014
![Page 5: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/5.jpg)
The Video Archive
2323 BBC videos of different genres (440 programs)� ~1697h of video + audio� Subtitles (manual)� Two ASR transcripts (LIMSI,LIUM)� Metadata (Title, Cast, Description,..)� Shot boundaries and key-frames� Search: 50 queries from 29 users
– Textual query + visual cues� Face detection� Concept detection
B Huet - Eurecom - BAMMF - p 520/06/2014
![Page 6: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/6.jpg)
The Video Archive
2323 BBC videos of different genres (440 programs)� ~1697h of video + audio� Subtitles (manual)� Two ASR transcripts (LIMSI,LIUM)� Metadata (Title, Cast, Description,..)� Shot boundaries and key-frames� Search: 50 queries from 29 users
– Textual query + visual cues� Face detection� Concept detection
B Huet - Eurecom - BAMMF - p 620/06/2014
Text query : Medieval history of why castles were first builtVisual cues : Castle
Text query : Best players of all time; Embarrassing England performances; Wake up call for English football; Wembley massacre;
Visual cues : Poor camera quality; heavy looking football; unusual goal celebrations; unusual crowd reactions; dark; grey; overcast; black and white;
![Page 7: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/7.jpg)
The proposed Framework
B Huet - Eurecom - BAMMF - p 720/06/2014
Videos, scenes and subtitles
Collection
Scenes
Conceptsindexing scores
Visualsemantic concepts
Content-based indexing
Off-line
On-line
Textual/visual
Query:Textual query
Scenes + subtitles
Text-based scores
Lucene indexing
User querying
Visual-based scores? Selected
concepts
Visualcues
Ranking
Ranked list
Fusion
![Page 8: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/8.jpg)
The proposed Framework
B Huet - Eurecom - BAMMF - p 820/06/2014
Scenes
Conceptsindexing scores
Videos, scenes and subtitles
Collection
Visualsemantic concepts
Content-based indexing
No training data for visual concepts
Use 151 visual concept detectors trained on TrecVid 2012 data
Unknown performance
![Page 9: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/9.jpg)
Visual concept detector confidence (w)
� 100 top images for the concept “Animal”
� 58 out of 100 are manually evaluated as valid
B Huet - Eurecom - BAMMF - p 920/06/2014
![Page 10: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/10.jpg)
The proposed Framework
B Huet - Eurecom - BAMMF - p 1020/06/2014
Textual/visual
Query:
User querying
<queryText>Children out on poetry trip Exploration of poetry by school children Poem writing</queryText> <visualCues>House memories Farm exploration A poem on animal and shells </visualCues>
Users are not aware of visual concepts
![Page 11: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/11.jpg)
Mapping visual cues to visual concepts
� <queryText>Children out on poetry trip Exploration of poetry by school children Poem writing</queryText> <visualCues>House memories Farm exploration A poem on animal and shells </visualCues>
Farm
Shells
Exploration
Poem
Animal
House
Memories
AnimalBirdsInsect
Cattle
DogsBuilding
SchoolChurch
Flags
Mountain
WordNet Mapping
keyw
ords
visual concepts
B Huet - Eurecom - BAMMF - p 1120/06/2014
![Page 12: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/12.jpg)
Mapping visual cues to visual concepts
� Concepts mapped to the visual query "Castle”
� Semantic similarity computed using the “Lin” distance
20/06/2014 B Huet - Eurecom - BAMMF - p 12
Concept Windows Plant Court Church Building
β 0.4533 0.4582 0.5115 0.6123 0.701
![Page 13: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/13.jpg)
The proposed Framework
B Huet - Eurecom - BAMMF - p 1320/06/2014
Text-based scores
Lucene indexing
Visual-based scores
WordNetsimilarity
Selected concepts
RankingFusion
One score for each scene (t)
f i = t iα + v i
1−α
One score for each scene (v):
Computed from the scores of the selected concepts for each scene
v iq = w c × vs i
c
c∈C 'q
∑
![Page 14: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/14.jpg)
Evaluation
� To which extent can visual concepts contribute info rmation when retrieving videos?
� How can we cope with the confidence in visual conce pt detection?
� BBC Archive subset provided by the MediaEval 2013 Se arch and Hyperlinking task.
� Evaluation Measures:� Mean Reciprocal Rank (MRR): assesses the rank of the relevant segment� Mean Generalized Average Precision (mGAP) : takes into account starting
time of the segment� Mean Average Segment Precision (MASP) : measures both ranking and
segmentation of relevant segments
20/06/2014 B Huet - Eurecom - BAMMF - p 14
![Page 15: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/15.jpg)
Retrieval Performance (50 queries)
� Low impact of visual concept detector confidence ( w)
� Significant improvement can be achieved by combinin g only mapped concepts with θ ≥ 0.3.
� Best performance is obtained when θ ≥ 0.8 (gain ≈ 11-12%).
20/06/2014 B Huet - Eurecom - BAMMF - p 15
w=1.0 w=confidence(c)
![Page 16: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/16.jpg)
Visual concepts and Query association
� The number of concepts associated to queries with different threshold θ.
20/06/2014 B Huet - Eurecom - BAMMF - p 16
θ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Min 5 5 5 2 0 0 0 0 0 0
Max 45 45 41 37 25 19 19 12 6 2
Mean 20 19 18 15 11 7 5 3 1 1
#Q(#c’q>0) 50 50 50 50 49 49 48 44 29 21
![Page 17: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/17.jpg)
Retrieval on queries with visual concepts (21)
� Concept mapping improves significantly the performance of the text-based search task on these queries.
� The best performance was achieved with θ ≥ 0.7 (gain ≈ 32-33%).
20/06/2014 B Huet - Eurecom - BAMMF - p 17
w=1.0 w=confidence(c)
![Page 18: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/18.jpg)
Conclusion
� A novel video search framework using visual informa tion to enrich a text-based search for video retrieval has been presented.
� We conducted our evaluations on the MediaEval 2013 w here we achieved the 2sd best on Search and 1 st on Hyperlinking
� Experimental results show that mapping text-based q ueries to visual concepts improves significantly the searc h system.
� When appropriately selecting the relevant visual co ncepts, a very significant improvement is achieved (gain ≈ 33%).
20/06/2014 B Huet - Eurecom - BAMMF - p 18
![Page 19: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/19.jpg)
Related Publications
� B. Safadi, M. Sahuguet and B. Huet, When textual and visual information join forces for multimedia retrieval, ICMR 2014, ACM International Conference on Multimedia Retrieval, April 1-4, 2014, Glasgow, Scotland
� M. Sahuguet and B. Huet. Mining the Web for Multimedia-based Enriching . Multimedia Modeling MMM 2014, 20th International Conference on MultiMedia Modeling, 8-10th January 2014, Dublin, Ireland
� M. Sahuguet, B. Huet, B. Cervenkova, E. Apostolidis, V. Mezaris, D. Stein, S. Eickeler, J-L. Redondo Garcia, R. Troncy, L. Pikora. LinkedTV at MediaEval 2013 search and hyperlinking ta sk, MEDIAEVAL 2013, Multimedia Benchmark Workshop, October 18-19, 2013, Barcelona, Spain
� Stein, D.; Öktem, A.; Apostolidis, E.; Mezaris, V.; Redondo García, J. L.; Troncy, R.; Sahuguet, M. & Huet, B., From raw data to semantically enriched hyperlinking : Recent advances in the LinkedTV analysis workflow, NEM Summit 2013, Networked & Electronic Media, 28-30 October 2013, Nantes, France
� V. Mezaris and B. Huet, “Video Hyperlinking ”, Tutorial Accepted at ICIP 2014 (Oct) Paris
� B. Safadi, M. Sahuguet and B. Huet, “Linking text and visual concepts semantically for c ross modal multimedia search ”, ICIP 2014, Paris 2014.
B Huet - Eurecom - BAMMF - p 1920/06/2014
![Page 20: When textual and visual information join forces for multimedia retrieval](https://reader033.vdocument.in/reader033/viewer/2022051816/546cd722af79596c298b5145/html5/thumbnails/20.jpg)
Questions?
http://www.slideshare.net/huetbenoit/
� Thank you.
When Textual and Visual InformationJoin Forces
for MultiMedia RetrievalBenoit Huet
B Huet - Eurecom - BAMMF - p 2020/06/2014