the trec interactive video track and content-based retrieval from digital video rong yan...

66
The TREC Interactive Video The TREC Interactive Video Track and Content-based Track and Content-based Retrieval from Digital Video Retrieval from Digital Video Rong Yan http://www.cs.cmu.edu/~christel/MM2002/ syllabus.htm

Upload: piers-french

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

The TREC Interactive Video Track The TREC Interactive Video Track and Content-based Retrieval from and Content-based Retrieval from

Digital VideoDigital Video

Rong Yan

http://www.cs.cmu.edu/~christel/MM2002/syllabus.htm

Page 2: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 2 Carnegie Mellon

State-of-the-art Multimedia Search EnginesState-of-the-art Multimedia Search Engines

• Recalling the Homework 1 and Homework 2Recalling the Homework 1 and Homework 2• Better for simple concepts, Better for simple concepts,

e.g. Two people kissing, A picture of a giraffe e.g. Two people kissing, A picture of a giraffe • Don’t work for complex queriesDon’t work for complex queries

e.g. A picture of a brick home with black shutters and e.g. A picture of a brick home with black shutters and white pillars, with a pickup truck in front of it (image) white pillars, with a pickup truck in front of it (image)

Page 3: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 3 Carnegie Mellon

ExamplesExamples

• Find the pictures of giraffe Find the pictures of giraffe • Keyword: giraffe Keyword: giraffe • http://images.google.com/images?hl=en&lr=lang_zh-CN

%7Clang_en&ie=UTF-8&oe=UTF-8&q=giraffe+

• A picture of a brick home with black shutters and white A picture of a brick home with black shutters and white pillars, with a pickup truck in front of it (image) pillars, with a pickup truck in front of it (image) • brick home shutters brick home shutters • http://images.google.com/images?hl=en&lr=lang_zh-CN

%7Clang_en&ie=UTF-8&oe=UTF-8&q=brick+home+shutters+

Page 4: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 4 Carnegie Mellon

Why this happens?Why this happens?

• Most of these search engines are keyword based Most of these search engines are keyword based • ““False” multi-media search engineFalse” multi-media search engine• Have to represent your idea in keywordsHave to represent your idea in keywords• These keywords are expected to appear in the filename, These keywords are expected to appear in the filename,

or corresponding webpageor corresponding webpage

• Therefore……Therefore……• Unable to handle semantic meaning of imagesUnable to handle semantic meaning of images• Unable to handle visual positionUnable to handle visual position• Unable to handle time informationUnable to handle time information• Unable to use images as query Unable to use images as query • ………………..

Page 5: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 5 Carnegie Mellon

Solution Solution

• Excerpted from your homeworkExcerpted from your homework

• …………I found that the Google Image Search was not I found that the Google Image Search was not as good as expected. Altavista was the more as good as expected. Altavista was the more useful multimedia search engine. useful multimedia search engine. However, most However, most of them just did a search based on the of them just did a search based on the filename or the matching keywordsfilename or the matching keywords within within the site it was located.the site it was located. I think it would be great I think it would be great to have multimedia search engine intelligent to have multimedia search engine intelligent enough to enough to associate its own keywords based on associate its own keywords based on what's in the imagewhat's in the image..  

Page 6: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 6 Carnegie Mellon

Solution Solution

• Excerpted from your homeworkExcerpted from your homework

• …………I found that the Google Image Search was not I found that the Google Image Search was not as good as expected. Altavista was the more as good as expected. Altavista was the more useful multimedia search engine. useful multimedia search engine. However, most However, most of them just did a search based on the of them just did a search based on the filename or the matching keywordsfilename or the matching keywords within within the site it was located.the site it was located. I think it would be great I think it would be great to have multimedia search engine intelligent to have multimedia search engine intelligent enough to enough to associate its own keywords based on associate its own keywords based on what's in the imagewhat's in the image..  

  

Our Solution: Content-based Information

Retrieval(CBIR)

Page 7: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 7 Carnegie Mellon

Solution Solution

• Excerpted from your homeworkExcerpted from your homework

• …………I found that the Google Image Search was not I found that the Google Image Search was not as good as expected. Altavista was the more as good as expected. Altavista was the more useful multimedia search engine. useful multimedia search engine. However, most However, most of them just did a search based on the of them just did a search based on the filename or the matching keywordsfilename or the matching keywords within within the site it was located.the site it was located. I think it would be great I think it would be great to have multimedia search engine intelligent to have multimedia search engine intelligent enough to enough to associate its own keywords based on associate its own keywords based on what's in the imagewhat's in the image. .   

Our Solution: Content-based Information

Retrieval(CBIR)

Mainly Video Retrieval in this

lecture

Page 8: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 8 Carnegie Mellon

Content-based Video RetrievalContent-based Video Retrieval

• ApplicationApplication

• ImplementationImplementation

• Effort on TREC02 video trackEffort on TREC02 video track• Feature Extraction Task (High-level Semantics Feature)Feature Extraction Task (High-level Semantics Feature)• Manual Retrieval Task (One-run Retrieval)Manual Retrieval Task (One-run Retrieval)• Interactive Retrieval Task (Multiple-run with Feedback)Interactive Retrieval Task (Multiple-run with Feedback)• Results & DemoResults & Demo

• ConclusionConclusion

Page 9: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 9 Carnegie Mellon

Application Application

• Increasing demand for visual information retrievalIncreasing demand for visual information retrieval• Retrieve useful information from databasesRetrieve useful information from databases• Sharing and distributing video data through computer Sharing and distributing video data through computer

networksnetworks

• Example: BBCExample: BBC• BBC archive has +500k queries plus 1M new items … BBC archive has +500k queries plus 1M new items …

per year;per year;• From the BBC …From the BBC …

• Police car with blue light flashing Police car with blue light flashing

• Government plan to improve reading standards Government plan to improve reading standards

• Two shot of Kenneth Clarke and William HagueTwo shot of Kenneth Clarke and William Hague

Page 10: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 10 Carnegie Mellon

Application ( Cont. ) Application ( Cont. )

• Video Surveillance Video Surveillance • Find where else the person appearsFind where else the person appears

• Experience On-DemandExperience On-Demand• Help to remember previous eventsHelp to remember previous events

• Provide useful information on travelingProvide useful information on traveling• Equipment on cars to retrieve useful multimedia Equipment on cars to retrieve useful multimedia

information according to your location/preferenceinformation according to your location/preference

• ………………

• Video content is plentiful … its now available digitally … Video content is plentiful … its now available digitally … we can work on it directly … so it followswe can work on it directly … so it follows

Page 11: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 11 Carnegie Mellon

Typical Retrieval FrameworkTypical Retrieval Framework

• User : provide query information that User : provide query information that represents his information needs represents his information needs

• Database: store a large collection of Database: store a large collection of video datavideo data

• Goal: Find the most relevant shots Goal: Find the most relevant shots from the databasefrom the database• Shots: “paragraph” in video, typically Shots: “paragraph” in video, typically

20 – 40 seconds, which is the basic 20 – 40 seconds, which is the basic unit of video retrieval unit of video retrieval

Page 12: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 12 Carnegie Mellon

Sample QuerySample Query

• Text : Text : Find pictures of George WashingtonFind pictures of George Washington

• Image:Image: • Video:Video:

Page 13: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 13 Carnegie Mellon

Bridging the Gap Bridging the Gap

Video DatabaseUser

Result

Page 14: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 14 Carnegie Mellon

Automatically Structure Video DataAutomatically Structure Video Data

• The first step for video retrieval: Video “programmes” The first step for video retrieval: Video “programmes” are structured into logical scenes, and physical shots are structured into logical scenes, and physical shots

• If dealing with text, then the structure is obvious: If dealing with text, then the structure is obvious: • paragraph, section, topic, page, etc.paragraph, section, topic, page, etc.

• All text-based indexing, retrieval, linking, etc. builds All text-based indexing, retrieval, linking, etc. builds upon this structure;upon this structure;

• Automatic shot boundary detection and selection of Automatic shot boundary detection and selection of representative keyframes is usually the first step;representative keyframes is usually the first step;

Page 15: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 15 Carnegie Mellon

Typical automatic structuring of videoTypical automatic structuring of video

A set of shots

a video document

Keyframe browser combined with transcript or object-based search

Page 16: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 16 Carnegie Mellon

Bridging the Gap Bridging the Gap

Video DatabaseUser

Video Structure Information Need

Result

Page 17: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 17 Carnegie Mellon

Ideal solution Ideal solution

Video DatabaseUser

Video Structure Information NeedUnderstanding the semantic meaning and retrieve

Result

Page 18: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 18 Carnegie Mellon

Ideal solution Ideal solution

Video DatabaseUser

Video Structure Information NeedUnderstanding the semantic meaning and retrieve

Result

However, 1. Hard to represent query in

natural language and for computer to understand

2. Computers have no experience3. Other representation

restriction like position, time

Page 19: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 19 Carnegie Mellon

Alternative SolutionAlternative Solution

Video DatabaseUser

Video Structure Information Need

Match and combine

Result

Provide evidence of relevant information ( text, image, audio)

Page 20: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 20 Carnegie Mellon

Evidence-based Retrieval SystemEvidence-based Retrieval System

• General framework for current video retrieval systemGeneral framework for current video retrieval system

• Video retrieval based on the evidence from both users Video retrieval based on the evidence from both users and database, includingand database, including• Text information Text information • Image informationImage information• Motion informationMotion information• Audio informaitonAudio informaiton

• Return a relevant score for each evidenceReturn a relevant score for each evidence

• Combination of the scoresCombination of the scores

Page 21: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 21 Carnegie Mellon

Keyword-based SystemKeyword-based System

Video DatabaseUser

Video Structure Information Need

Automatic Annotation

Keyword

Including filename, video title, caption,

related web page

Page 22: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 22 Carnegie Mellon

Keyword-based SystemKeyword-based System

Video DatabaseUser

Video Structure Information Need

Automatic Annotation

Keyword

Manual Annotation

Page 23: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 23 Carnegie Mellon

Manual AnnotationManual Annotation

• Manually creating annotation/keywords for image / Manually creating annotation/keywords for image / video datavideo data

• Examples: Examples: Gettyimage.com (image retrieval) (image retrieval)

• Pros: Pros: • Represent the semantic meaning of videoRepresent the semantic meaning of video

• ConsCons• Time-consuming, labor-intensive Time-consuming, labor-intensive • Keyword is not enough to represent information needKeyword is not enough to represent information need

Page 24: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 24 Carnegie Mellon

Speech and OCR transcriptionSpeech and OCR transcription

Video DatabaseUser

Video Structure Information Need

Annotation

Keyword

Speech Transcription

OCR Transcription

Page 25: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 25 Carnegie Mellon

Query using speech/OCR informationQuery using speech/OCR information

Query: Query: Find pictures of Harry Hertz, Find pictures of Harry Hertz,

Director of the National Quality Director of the National Quality Program, NISTProgram, NIST

Speech: We’re looking for people that have a broad range of expertise that have business knowledge that have knowledge on quality management on quality improvement and in particular …

OCR:H,arry Hertz a Director aro 7 wa-,i,,ty Program,Harry Hertz a Director

Page 26: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 26 Carnegie Mellon

What we lack?What we lack?

Video DatabaseUser

Video Structure Information Need

Annotation

Keyword

Speech Transcription

OCR Transcription

Image Information

Page 27: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 27 Carnegie Mellon

Image-based RetrievalImage-based Retrieval

Video DatabaseUser

Video Structure Information Need

Text Information

Keyword

Image Feature

Query Images

Page 28: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 28 Carnegie Mellon

Global Low-level Image FeatureGlobal Low-level Image Feature

• Color-based FeatureColor-based Feature• Color HistogramColor Histogram• Color PecentageColor Pecentage• Color CorrelogramColor Correlogram• Color MomentsColor Moments

• Texture-based FeatureTexture-based Feature• Gabor FilterGabor Filter• Wavelet Wavelet

• Shape/Structure FeatureShape/Structure Feature

Page 29: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 29 Carnegie Mellon

Regional Low-level Image FeatureRegional Low-level Image Feature

• Segmentation into objectsSegmentation into objects

• Extract low-level features from each regionsExtract low-level features from each regions

Page 30: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 30 Carnegie Mellon

Image SearchImage Search

• Feature RepresentationFeature Representation• Image: represented as a series of real number, or a Image: represented as a series of real number, or a

vector of features, vector of features, (f1, …., fn)(f1, …., fn)• Distance Function: The distance between two vectors, Distance Function: The distance between two vectors,

typically Euclidean Distancetypically Euclidean Distance• We believe “Nearest is relevant”We believe “Nearest is relevant”

• The nearest images in the database is relevant to the query The nearest images in the database is relevant to the query images. images.

Page 31: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 31 Carnegie Mellon

Finding Similar ImagesFinding Similar Images

Page 32: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 32 Carnegie Mellon

But…..But…..• Low-level feature doesn’t work in all the cases

Page 33: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 33 Carnegie Mellon

High-level Image FeatureHigh-level Image Feature

• Objects: Persons, Roads, Cars, Skies…Objects: Persons, Roads, Cars, Skies…

• Scenes: Indoors, Outdoors, Cityscape, Landscape, Scenes: Indoors, Outdoors, Cityscape, Landscape, Water, Office, Factory…Water, Office, Factory…

• Event: Parade, Explosion, Picnic, Playing Soccer…Event: Parade, Explosion, Picnic, Playing Soccer…

• Generated from low-level featuresGenerated from low-level features

Page 34: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 34 Carnegie Mellon

Image Feature

Image-based RetrievalImage-based Retrieval

Video DatabaseUser

Video Structure Information Need

Text Information

Keyword

Query Images

Low-level Feature

High-level Feature

Page 35: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 35 Carnegie Mellon

More Evidence in Video RetrievalMore Evidence in Video Retrieval

Video DatabaseUser

Video Structure Information Need

Text Information

Keyword

Image Information

Query Images

MotionInformation

Audio Information

Motion

Audio

Page 36: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 36 Carnegie Mellon

Combination of multi-modal resultsCombination of multi-modal results

• Difference characteristics between multi-modal Difference characteristics between multi-modal informationinformation• Text-based InformationText-based Information: better for middle and high level : better for middle and high level

queriesqueries• e.g. Find the video clip of dancing women wearing dresses e.g. Find the video clip of dancing women wearing dresses

• Image-based InformationImage-based Information: better for low and middle level : better for low and middle level queriesqueries

• e.g. Find the video clip of green treese.g. Find the video clip of green trees

• Combination of multi-modal informationCombination of multi-modal information

Page 37: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 37 Carnegie Mellon

Other Useful TechniqueOther Useful Technique

• Query ExpansionQuery Expansion

• Cross-Modal RelationCross-Modal Relation

• Relevance FeedbackRelevance Feedback

Page 38: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 38 Carnegie Mellon

RecapRecap

• Video Retrieval is to bridge the gap between user Video Retrieval is to bridge the gap between user information need and video databaseinformation need and video database

• Multi-modal evidenceMulti-modal evidence• Text-based (most popular) Text-based (most popular) • Image-basedImage-based• Motion-basedMotion-based• Audio-basedAudio-based

• Combination of the evidenceCombination of the evidence

Page 39: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 39 Carnegie Mellon

Content-based Video RetrievalContent-based Video Retrieval

• ApplicationApplication

• ImplementationImplementation

• TREC video trackTREC video track• Feature Extraction Task (High-level Semantics Feature)Feature Extraction Task (High-level Semantics Feature)• Manual Retrieval Task (One-run Retrieval)Manual Retrieval Task (One-run Retrieval)• Interactive Retrieval Task (Multiple-run with Feedback)Interactive Retrieval Task (Multiple-run with Feedback)• Results & DemoResults & Demo

• ConclusionConclusion

Page 40: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 40 Carnegie Mellon

Introduction to TREC Video Retrieval TrackIntroduction to TREC Video Retrieval Track

• Full Name: Text REtrieval Conference Full Name: Text REtrieval Conference

• TREC Video Track web site: TREC Video Track web site: http://www-nlpir.nist.gov/projects/trecvid/

• TREC series sponsored by the National Institute of TREC series sponsored by the National Institute of Standards and Technology (NIST) with additional support Standards and Technology (NIST) with additional support from other U.S. government agenciesfrom other U.S. government agencies• Goal is to encourage research in information retrievalGoal is to encourage research in information retrieval

Page 41: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 41 Carnegie Mellon

Introduction to TREC Video Retrieval TrackIntroduction to TREC Video Retrieval Track

• Video Retrieval Track started in 2001Video Retrieval Track started in 2001• Goal is investigation of content-based retrieval from digital Goal is investigation of content-based retrieval from digital

videovideo• Focus on the Focus on the shotshot as the unit of information retrieval rather as the unit of information retrieval rather

than the scene or story/segment/clipthan the scene or story/segment/clip

• Current state-of-the-art Video Retrieval CompetitionCurrent state-of-the-art Video Retrieval Competition• 17 active participants, including groups from CMU, IBM 17 active participants, including groups from CMU, IBM

Research, Microsoft Research Asia, MediaMill, LIMSI, Dublin Research, Microsoft Research Asia, MediaMill, LIMSI, Dublin City University.City University.

Page 42: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 42 Carnegie Mellon

Main tasks in TRECMain tasks in TREC

• Shot boundary detectionShot boundary detection

• Semantic Feature Extraction Task Semantic Feature Extraction Task

• Video Retrieval TaskVideo Retrieval Task• Manual Retrieval: Manual Retrieval: Human formulate a query and then Human formulate a query and then

automatically retrieve from collectionautomatically retrieve from collection• Interactive Retrieval: Interactive Retrieval: Full human access and feedbackFull human access and feedback

Page 43: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 43 Carnegie Mellon

Image Feature

Where are they?Where are they?

Video DatabaseUser

Video Structure Information Need

Text Information

Keyword

Query Images

Low-level Feature

High-level Feature

Shot Boundary Detection Feature

Extraction

Retrieval Task

Page 44: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 44 Carnegie Mellon

Video DataVideo Data

• Difficult to get video data for use in TREC because ©Difficult to get video data for use in TREC because ©

• Used mainly Used mainly Internet Archive• advertising, educational, industrial, amateur films 1930-1970 advertising, educational, industrial, amateur films 1930-1970 • produced by corporations, non-profit organisations, trade produced by corporations, non-profit organisations, trade

groups, etc.groups, etc.• Noisy, strange color, but real archive dataNoisy, strange color, but real archive data• 73.3 hours partitioned as follows:73.3 hours partitioned as follows:

4.85

5.07

23.26

40.12Search test

Feature development(training and validation)

Feature test

Shot boundary test

Page 45: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 45 Carnegie Mellon

Shot Boundary DetectionShot Boundary Detection

• Fundamental primitive of most/all work in content-based Fundamental primitive of most/all work in content-based video retrievalvideo retrieval

A set of video shots

a video document

Page 46: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 46 Carnegie Mellon

Feature ExtractionFeature Extraction

• Extracted high-level semantic feature from videoExtracted high-level semantic feature from video

• Assign a video clip to one or more of several categories Assign a video clip to one or more of several categories of videoof video

High-level features:Cityscape, Lake,

Trees, Water, Sky

Page 47: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 47 Carnegie Mellon

Feature ExtractionFeature Extraction

• Interesting itself but when it serves to help video navigation and search then its importance increases

• Benefits:

• Retrieval - Find video from a particular class

• Filtering - Remove irrelevant and distracting information categories from summaries and visualizations

Page 48: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 48 Carnegie Mellon

The FeaturesThe Features

FaceFace

Clip contains at least one human face with the nose, Clip contains at least one human face with the nose, mouth, and both eyes visible. Pictures of a face meeting mouth, and both eyes visible. Pictures of a face meeting the above conditions countthe above conditions count

PeoplePeople

Clip contains a group of two more humans, each of which Clip contains a group of two more humans, each of which is at least partially visible and is recognizable as a humanis at least partially visible and is recognizable as a human

On-screen TextOn-screen Text

Clip contains superimposed text large enough to be readClip contains superimposed text large enough to be read

Page 49: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 49 Carnegie Mellon

The FeaturesThe Features

IndoorIndoor

Clip contains a recognizably indoor location, i.e., inside a building Clip contains a recognizably indoor location, i.e., inside a building

OutdoorOutdoorClip contains a recognizably outdoor location, i.e., one outside of Clip contains a recognizably outdoor location, i.e., one outside of

buildingsbuildings

CityscapeCityscapeClip contains a recognizably city/urban/suburban settingClip contains a recognizably city/urban/suburban setting

LandscapeLandscapeClip contains a predominantly natural inland setting, i.e., one with Clip contains a predominantly natural inland setting, i.e., one with

little or no evidence of development by humans. Scenes with little or no evidence of development by humans. Scenes with bodies of water that are clearly inland may be includedbodies of water that are clearly inland may be included

Page 50: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 50 Carnegie Mellon

Non-Video (Audio) FeaturesNon-Video (Audio) Features

SpeechSpeechA human voice uttering words is recognizable as such in this A human voice uttering words is recognizable as such in this

segment segment

Instrumental SoundInstrumental SoundSound produced by one or more musical instruments is Sound produced by one or more musical instruments is

recognizable as such in this segmentrecognizable as such in this segment

MonologuesMonologuesSegment contains an event in which a single person is at least Segment contains an event in which a single person is at least

partially visible and speaks for a long time without partially visible and speaks for a long time without interruption by another speaker. Pauses are ok if shortinterruption by another speaker. Pauses are ok if short

Page 51: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 51 Carnegie Mellon

TREC02 Results TREC02 Results

0

0.2

0.4

0.6

0.8

1 2 3 4 5 6 7 8 9 10

Feature

Avera

ge p

recis

ion

CMU_r1

A_CMU_r2

CLIPS-LIT_GEOD

CLIPS-LIT-LIMSU

DCUFE2002

Eurecom1

Fudan_FE_Sys1

Fudan_FE_Sys2

IBM-1

IBM-2

MediaMill1

MediaMill2

MSRA

UnivO_MT1

UnivO_MT2

Avg Prec Cap

Outd

oors

Indo

ors

Face

Peop

le

City

scap

e Land

sca

pe Text

ov

erla

ySp

eec

h

Inst

rum

enta

l

so

und

Mon

olo

g

Random baseline

Page 52: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 52 Carnegie Mellon

Video Search TaskVideo Search Task

• The most important task and final goalThe most important task and final goal

• Manual & Interactive Search Task Manual & Interactive Search Task

Page 53: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 53 Carnegie Mellon

Queries for 2002 TREC Video TrackQueries for 2002 TREC Video Track

• Specific item or personSpecific item or person• Eddie Rickenbacker, James Chandler, George Washington, Golden Eddie Rickenbacker, James Chandler, George Washington, Golden

Gate Bridge, Price Tower in Bartlesville, OK Gate Bridge, Price Tower in Bartlesville, OK

• Specific factSpecific fact• Arch in Washington Square Park in NYC, map of continental USArch in Washington Square Park in NYC, map of continental US

• Instances of a categoryInstances of a category• football players, overhead views of cities, one or more women football players, overhead views of cities, one or more women

standing in long dressesstanding in long dresses

• Instances of events/activitiesInstances of events/activities• people spending leisure time at the beach, one or more musicians people spending leisure time at the beach, one or more musicians

with audible music, crowd walking in an urban environment, with audible music, crowd walking in an urban environment, locomotive approaching the viewerlocomotive approaching the viewer

Page 54: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 54 Carnegie Mellon

Sample QuerySample Query

• XML RepresentationXML Representation

<!DOCTYPE videoTopic SYSTEM "videoTopics.dtd"><videoTopic num="077">

<textDescription text="Find pictures of George Washington" />

<imageExample src="http://www.cia.gov/csi/monograph/firstln/955pres2.gif" desc="face" />

<videoExample src="01681.mpg" start="09m25.938s" stop="09m29.308s" desc="face" />

</videoTopic>

Page 55: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 55 Carnegie Mellon

Sample QuerySample Query

• Text : Text : Find pictures of George WashingtonFind pictures of George Washington

• Image:Image: • Video:Video:

Page 56: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 56 Carnegie Mellon

Evaluation MetricEvaluation Metric

• Goal: Maximize the Mean Average PrecisionGoal: Maximize the Mean Average Precision• Result set limited to 100 shotsResult set limited to 100 shots• Precision = (# relevant shots retrieved)/(total # shots retrieved)Precision = (# relevant shots retrieved)/(total # shots retrieved)• Average precision: compute precision after each retrieved Average precision: compute precision after each retrieved

relevant shot and then average these precisions over the total relevant shot and then average these precisions over the total number of retrieved relevant shots in the collection for that number of retrieved relevant shots in the collection for that topictopic

• Submitting the maximum number of shots per result set can Submitting the maximum number of shots per result set can never lower the average precision for that submissionnever lower the average precision for that submission

• Mean Average Precision = average of the average precision Mean Average Precision = average of the average precision measures for each topicmeasures for each topic

Page 57: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 57 Carnegie Mellon

CMU Manual Retrieval System CMU Manual Retrieval System

Query

Text Image

Final Score

Text Score

ImageScore

RetrievalAgents

MovieInfo

PRFScore

Page 58: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 58 Carnegie Mellon

Snapshot of the systemSnapshot of the system

Page 59: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 59 Carnegie Mellon

Manual Search ResultManual Search Result

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Recall

Pre

cis

ion

Prous Science

IBM-2

CMU_MANUAL1

IBM-3

LL10_T

CLIPS+ASR

Fudan_Search_Sys4

CLIPS+ASR+X

ICMKM-2

UMDMqtrec

Page 60: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 60 Carnegie Mellon

CMU Interactive Search SystemCMU Interactive Search System

• New Interface based on Informedia systemNew Interface based on Informedia system

• Multiple document storyboardsMultiple document storyboards

• Query context plays a key role in filtering image sets to Query context plays a key role in filtering image sets to manageable sizesmanageable sizes

• TREC 2002 image feature set offers additional filtering TREC 2002 image feature set offers additional filtering capabilities for indoor, outdoor, faces, people, etc.capabilities for indoor, outdoor, faces, people, etc.

• Displaying filter count and distribution guides their use Displaying filter count and distribution guides their use in manipulating the storyboard viewsin manipulating the storyboard views

Page 61: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 61 Carnegie Mellon

Snapshot of the systemSnapshot of the system

Page 62: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 62 Carnegie Mellon

Filter Interface for using Image FeaturesFilter Interface for using Image Features

Page 63: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 63 Carnegie Mellon

Interactive runs top 10 (of 13)Interactive runs top 10 (of 13)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Recall

Pre

cis

ion

CMUInfInt1

DCUTrec11B.1

DCUTrec11C.2

CMU_INTERACTIVE_2

UnivO_MT5

IBM-4

DCUTrec11B.3

DCUTrec11C.4

UMDIqtrec

MSRA.Q-Video.2a

Prous Science

IBM-2

CMU_MANUAL1

Page 64: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 64 Carnegie Mellon

Mean AvgP vs mean elapsed timeMean AvgP vs mean elapsed time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

5

10

15

20

25

30

IuVf1CMUInfIn

t1CMU_IN

TERACTIVE

DCUTrec11

B.

1DCUTrec

11B.

3DCUTrec

11C.

2DCUTrec

11C.4

IBM-4

ICMKM-1

MSRA.Q-

Video.1

MSRA.Q-V

ideo.2.

A

UMDIqtrec

UNivO_M

T5

Mean average precision

Mean elapsed time (mins.)

Wide variation in elapsed time.Not the dominant factor in effectiveness

Page 65: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 65 Carnegie Mellon

DemoDemo

• CMU Interactive Search SystemCMU Interactive Search System

• IBM Video Retrieval SystemIBM Video Retrieval Systemhttp://mp7.i2.ibm.com/

Page 66: The TREC Interactive Video Track and Content-based Retrieval from Digital Video Rong Yan christel/MM2002/syllabus.htm

© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 66 Carnegie Mellon

ConclusionConclusion

• The goal of content-based video retrieval is to build The goal of content-based video retrieval is to build more intelligent video retrieval engine via semantic more intelligent video retrieval engine via semantic meaningmeaning

• Many applications in daily lifeMany applications in daily life

• Combine evidence from different aspectsCombine evidence from different aspects

• Hot research topic, few business systemHot research topic, few business system

• State-of-the-art performance is still unacceptable for State-of-the-art performance is still unacceptable for normal users, space to improvenormal users, space to improve