the trec interactive video track and content-based retrieval from digital video rong yan...
TRANSCRIPT
The TREC Interactive Video Track The TREC Interactive Video Track and Content-based Retrieval from and Content-based Retrieval from
Digital VideoDigital Video
Rong Yan
http://www.cs.cmu.edu/~christel/MM2002/syllabus.htm
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 2 Carnegie Mellon
State-of-the-art Multimedia Search EnginesState-of-the-art Multimedia Search Engines
• Recalling the Homework 1 and Homework 2Recalling the Homework 1 and Homework 2• Better for simple concepts, Better for simple concepts,
e.g. Two people kissing, A picture of a giraffe e.g. Two people kissing, A picture of a giraffe • Don’t work for complex queriesDon’t work for complex queries
e.g. A picture of a brick home with black shutters and e.g. A picture of a brick home with black shutters and white pillars, with a pickup truck in front of it (image) white pillars, with a pickup truck in front of it (image)
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 3 Carnegie Mellon
ExamplesExamples
• Find the pictures of giraffe Find the pictures of giraffe • Keyword: giraffe Keyword: giraffe • http://images.google.com/images?hl=en&lr=lang_zh-CN
%7Clang_en&ie=UTF-8&oe=UTF-8&q=giraffe+
• A picture of a brick home with black shutters and white A picture of a brick home with black shutters and white pillars, with a pickup truck in front of it (image) pillars, with a pickup truck in front of it (image) • brick home shutters brick home shutters • http://images.google.com/images?hl=en&lr=lang_zh-CN
%7Clang_en&ie=UTF-8&oe=UTF-8&q=brick+home+shutters+
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 4 Carnegie Mellon
Why this happens?Why this happens?
• Most of these search engines are keyword based Most of these search engines are keyword based • ““False” multi-media search engineFalse” multi-media search engine• Have to represent your idea in keywordsHave to represent your idea in keywords• These keywords are expected to appear in the filename, These keywords are expected to appear in the filename,
or corresponding webpageor corresponding webpage
• Therefore……Therefore……• Unable to handle semantic meaning of imagesUnable to handle semantic meaning of images• Unable to handle visual positionUnable to handle visual position• Unable to handle time informationUnable to handle time information• Unable to use images as query Unable to use images as query • ………………..
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 5 Carnegie Mellon
Solution Solution
• Excerpted from your homeworkExcerpted from your homework
• …………I found that the Google Image Search was not I found that the Google Image Search was not as good as expected. Altavista was the more as good as expected. Altavista was the more useful multimedia search engine. useful multimedia search engine. However, most However, most of them just did a search based on the of them just did a search based on the filename or the matching keywordsfilename or the matching keywords within within the site it was located.the site it was located. I think it would be great I think it would be great to have multimedia search engine intelligent to have multimedia search engine intelligent enough to enough to associate its own keywords based on associate its own keywords based on what's in the imagewhat's in the image..
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 6 Carnegie Mellon
Solution Solution
• Excerpted from your homeworkExcerpted from your homework
• …………I found that the Google Image Search was not I found that the Google Image Search was not as good as expected. Altavista was the more as good as expected. Altavista was the more useful multimedia search engine. useful multimedia search engine. However, most However, most of them just did a search based on the of them just did a search based on the filename or the matching keywordsfilename or the matching keywords within within the site it was located.the site it was located. I think it would be great I think it would be great to have multimedia search engine intelligent to have multimedia search engine intelligent enough to enough to associate its own keywords based on associate its own keywords based on what's in the imagewhat's in the image..
Our Solution: Content-based Information
Retrieval(CBIR)
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 7 Carnegie Mellon
Solution Solution
• Excerpted from your homeworkExcerpted from your homework
• …………I found that the Google Image Search was not I found that the Google Image Search was not as good as expected. Altavista was the more as good as expected. Altavista was the more useful multimedia search engine. useful multimedia search engine. However, most However, most of them just did a search based on the of them just did a search based on the filename or the matching keywordsfilename or the matching keywords within within the site it was located.the site it was located. I think it would be great I think it would be great to have multimedia search engine intelligent to have multimedia search engine intelligent enough to enough to associate its own keywords based on associate its own keywords based on what's in the imagewhat's in the image. .
Our Solution: Content-based Information
Retrieval(CBIR)
Mainly Video Retrieval in this
lecture
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 8 Carnegie Mellon
Content-based Video RetrievalContent-based Video Retrieval
• ApplicationApplication
• ImplementationImplementation
• Effort on TREC02 video trackEffort on TREC02 video track• Feature Extraction Task (High-level Semantics Feature)Feature Extraction Task (High-level Semantics Feature)• Manual Retrieval Task (One-run Retrieval)Manual Retrieval Task (One-run Retrieval)• Interactive Retrieval Task (Multiple-run with Feedback)Interactive Retrieval Task (Multiple-run with Feedback)• Results & DemoResults & Demo
• ConclusionConclusion
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 9 Carnegie Mellon
Application Application
• Increasing demand for visual information retrievalIncreasing demand for visual information retrieval• Retrieve useful information from databasesRetrieve useful information from databases• Sharing and distributing video data through computer Sharing and distributing video data through computer
networksnetworks
• Example: BBCExample: BBC• BBC archive has +500k queries plus 1M new items … BBC archive has +500k queries plus 1M new items …
per year;per year;• From the BBC …From the BBC …
• Police car with blue light flashing Police car with blue light flashing
• Government plan to improve reading standards Government plan to improve reading standards
• Two shot of Kenneth Clarke and William HagueTwo shot of Kenneth Clarke and William Hague
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 10 Carnegie Mellon
Application ( Cont. ) Application ( Cont. )
• Video Surveillance Video Surveillance • Find where else the person appearsFind where else the person appears
• Experience On-DemandExperience On-Demand• Help to remember previous eventsHelp to remember previous events
• Provide useful information on travelingProvide useful information on traveling• Equipment on cars to retrieve useful multimedia Equipment on cars to retrieve useful multimedia
information according to your location/preferenceinformation according to your location/preference
• ………………
• Video content is plentiful … its now available digitally … Video content is plentiful … its now available digitally … we can work on it directly … so it followswe can work on it directly … so it follows
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 11 Carnegie Mellon
Typical Retrieval FrameworkTypical Retrieval Framework
• User : provide query information that User : provide query information that represents his information needs represents his information needs
• Database: store a large collection of Database: store a large collection of video datavideo data
• Goal: Find the most relevant shots Goal: Find the most relevant shots from the databasefrom the database• Shots: “paragraph” in video, typically Shots: “paragraph” in video, typically
20 – 40 seconds, which is the basic 20 – 40 seconds, which is the basic unit of video retrieval unit of video retrieval
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 12 Carnegie Mellon
Sample QuerySample Query
• Text : Text : Find pictures of George WashingtonFind pictures of George Washington
• Image:Image: • Video:Video:
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 13 Carnegie Mellon
Bridging the Gap Bridging the Gap
Video DatabaseUser
Result
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 14 Carnegie Mellon
Automatically Structure Video DataAutomatically Structure Video Data
• The first step for video retrieval: Video “programmes” The first step for video retrieval: Video “programmes” are structured into logical scenes, and physical shots are structured into logical scenes, and physical shots
• If dealing with text, then the structure is obvious: If dealing with text, then the structure is obvious: • paragraph, section, topic, page, etc.paragraph, section, topic, page, etc.
• All text-based indexing, retrieval, linking, etc. builds All text-based indexing, retrieval, linking, etc. builds upon this structure;upon this structure;
• Automatic shot boundary detection and selection of Automatic shot boundary detection and selection of representative keyframes is usually the first step;representative keyframes is usually the first step;
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 15 Carnegie Mellon
Typical automatic structuring of videoTypical automatic structuring of video
A set of shots
a video document
Keyframe browser combined with transcript or object-based search
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 16 Carnegie Mellon
Bridging the Gap Bridging the Gap
Video DatabaseUser
Video Structure Information Need
Result
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 17 Carnegie Mellon
Ideal solution Ideal solution
Video DatabaseUser
Video Structure Information NeedUnderstanding the semantic meaning and retrieve
Result
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 18 Carnegie Mellon
Ideal solution Ideal solution
Video DatabaseUser
Video Structure Information NeedUnderstanding the semantic meaning and retrieve
Result
However, 1. Hard to represent query in
natural language and for computer to understand
2. Computers have no experience3. Other representation
restriction like position, time
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 19 Carnegie Mellon
Alternative SolutionAlternative Solution
Video DatabaseUser
Video Structure Information Need
Match and combine
Result
Provide evidence of relevant information ( text, image, audio)
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 20 Carnegie Mellon
Evidence-based Retrieval SystemEvidence-based Retrieval System
• General framework for current video retrieval systemGeneral framework for current video retrieval system
• Video retrieval based on the evidence from both users Video retrieval based on the evidence from both users and database, includingand database, including• Text information Text information • Image informationImage information• Motion informationMotion information• Audio informaitonAudio informaiton
• Return a relevant score for each evidenceReturn a relevant score for each evidence
• Combination of the scoresCombination of the scores
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 21 Carnegie Mellon
Keyword-based SystemKeyword-based System
Video DatabaseUser
Video Structure Information Need
Automatic Annotation
Keyword
Including filename, video title, caption,
related web page
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 22 Carnegie Mellon
Keyword-based SystemKeyword-based System
Video DatabaseUser
Video Structure Information Need
Automatic Annotation
Keyword
Manual Annotation
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 23 Carnegie Mellon
Manual AnnotationManual Annotation
• Manually creating annotation/keywords for image / Manually creating annotation/keywords for image / video datavideo data
• Examples: Examples: Gettyimage.com (image retrieval) (image retrieval)
• Pros: Pros: • Represent the semantic meaning of videoRepresent the semantic meaning of video
• ConsCons• Time-consuming, labor-intensive Time-consuming, labor-intensive • Keyword is not enough to represent information needKeyword is not enough to represent information need
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 24 Carnegie Mellon
Speech and OCR transcriptionSpeech and OCR transcription
Video DatabaseUser
Video Structure Information Need
Annotation
Keyword
Speech Transcription
OCR Transcription
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 25 Carnegie Mellon
Query using speech/OCR informationQuery using speech/OCR information
Query: Query: Find pictures of Harry Hertz, Find pictures of Harry Hertz,
Director of the National Quality Director of the National Quality Program, NISTProgram, NIST
Speech: We’re looking for people that have a broad range of expertise that have business knowledge that have knowledge on quality management on quality improvement and in particular …
OCR:H,arry Hertz a Director aro 7 wa-,i,,ty Program,Harry Hertz a Director
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 26 Carnegie Mellon
What we lack?What we lack?
Video DatabaseUser
Video Structure Information Need
Annotation
Keyword
Speech Transcription
OCR Transcription
Image Information
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 27 Carnegie Mellon
Image-based RetrievalImage-based Retrieval
Video DatabaseUser
Video Structure Information Need
Text Information
Keyword
Image Feature
Query Images
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 28 Carnegie Mellon
Global Low-level Image FeatureGlobal Low-level Image Feature
• Color-based FeatureColor-based Feature• Color HistogramColor Histogram• Color PecentageColor Pecentage• Color CorrelogramColor Correlogram• Color MomentsColor Moments
• Texture-based FeatureTexture-based Feature• Gabor FilterGabor Filter• Wavelet Wavelet
• Shape/Structure FeatureShape/Structure Feature
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 29 Carnegie Mellon
Regional Low-level Image FeatureRegional Low-level Image Feature
• Segmentation into objectsSegmentation into objects
• Extract low-level features from each regionsExtract low-level features from each regions
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 30 Carnegie Mellon
Image SearchImage Search
• Feature RepresentationFeature Representation• Image: represented as a series of real number, or a Image: represented as a series of real number, or a
vector of features, vector of features, (f1, …., fn)(f1, …., fn)• Distance Function: The distance between two vectors, Distance Function: The distance between two vectors,
typically Euclidean Distancetypically Euclidean Distance• We believe “Nearest is relevant”We believe “Nearest is relevant”
• The nearest images in the database is relevant to the query The nearest images in the database is relevant to the query images. images.
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 31 Carnegie Mellon
Finding Similar ImagesFinding Similar Images
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 32 Carnegie Mellon
But…..But…..• Low-level feature doesn’t work in all the cases
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 33 Carnegie Mellon
High-level Image FeatureHigh-level Image Feature
• Objects: Persons, Roads, Cars, Skies…Objects: Persons, Roads, Cars, Skies…
• Scenes: Indoors, Outdoors, Cityscape, Landscape, Scenes: Indoors, Outdoors, Cityscape, Landscape, Water, Office, Factory…Water, Office, Factory…
• Event: Parade, Explosion, Picnic, Playing Soccer…Event: Parade, Explosion, Picnic, Playing Soccer…
• Generated from low-level featuresGenerated from low-level features
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 34 Carnegie Mellon
Image Feature
Image-based RetrievalImage-based Retrieval
Video DatabaseUser
Video Structure Information Need
Text Information
Keyword
Query Images
Low-level Feature
High-level Feature
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 35 Carnegie Mellon
More Evidence in Video RetrievalMore Evidence in Video Retrieval
Video DatabaseUser
Video Structure Information Need
Text Information
Keyword
Image Information
Query Images
MotionInformation
Audio Information
Motion
Audio
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 36 Carnegie Mellon
Combination of multi-modal resultsCombination of multi-modal results
• Difference characteristics between multi-modal Difference characteristics between multi-modal informationinformation• Text-based InformationText-based Information: better for middle and high level : better for middle and high level
queriesqueries• e.g. Find the video clip of dancing women wearing dresses e.g. Find the video clip of dancing women wearing dresses
• Image-based InformationImage-based Information: better for low and middle level : better for low and middle level queriesqueries
• e.g. Find the video clip of green treese.g. Find the video clip of green trees
• Combination of multi-modal informationCombination of multi-modal information
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 37 Carnegie Mellon
Other Useful TechniqueOther Useful Technique
• Query ExpansionQuery Expansion
• Cross-Modal RelationCross-Modal Relation
• Relevance FeedbackRelevance Feedback
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 38 Carnegie Mellon
RecapRecap
• Video Retrieval is to bridge the gap between user Video Retrieval is to bridge the gap between user information need and video databaseinformation need and video database
• Multi-modal evidenceMulti-modal evidence• Text-based (most popular) Text-based (most popular) • Image-basedImage-based• Motion-basedMotion-based• Audio-basedAudio-based
• Combination of the evidenceCombination of the evidence
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 39 Carnegie Mellon
Content-based Video RetrievalContent-based Video Retrieval
• ApplicationApplication
• ImplementationImplementation
• TREC video trackTREC video track• Feature Extraction Task (High-level Semantics Feature)Feature Extraction Task (High-level Semantics Feature)• Manual Retrieval Task (One-run Retrieval)Manual Retrieval Task (One-run Retrieval)• Interactive Retrieval Task (Multiple-run with Feedback)Interactive Retrieval Task (Multiple-run with Feedback)• Results & DemoResults & Demo
• ConclusionConclusion
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 40 Carnegie Mellon
Introduction to TREC Video Retrieval TrackIntroduction to TREC Video Retrieval Track
• Full Name: Text REtrieval Conference Full Name: Text REtrieval Conference
• TREC Video Track web site: TREC Video Track web site: http://www-nlpir.nist.gov/projects/trecvid/
• TREC series sponsored by the National Institute of TREC series sponsored by the National Institute of Standards and Technology (NIST) with additional support Standards and Technology (NIST) with additional support from other U.S. government agenciesfrom other U.S. government agencies• Goal is to encourage research in information retrievalGoal is to encourage research in information retrieval
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 41 Carnegie Mellon
Introduction to TREC Video Retrieval TrackIntroduction to TREC Video Retrieval Track
• Video Retrieval Track started in 2001Video Retrieval Track started in 2001• Goal is investigation of content-based retrieval from digital Goal is investigation of content-based retrieval from digital
videovideo• Focus on the Focus on the shotshot as the unit of information retrieval rather as the unit of information retrieval rather
than the scene or story/segment/clipthan the scene or story/segment/clip
• Current state-of-the-art Video Retrieval CompetitionCurrent state-of-the-art Video Retrieval Competition• 17 active participants, including groups from CMU, IBM 17 active participants, including groups from CMU, IBM
Research, Microsoft Research Asia, MediaMill, LIMSI, Dublin Research, Microsoft Research Asia, MediaMill, LIMSI, Dublin City University.City University.
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 42 Carnegie Mellon
Main tasks in TRECMain tasks in TREC
• Shot boundary detectionShot boundary detection
• Semantic Feature Extraction Task Semantic Feature Extraction Task
• Video Retrieval TaskVideo Retrieval Task• Manual Retrieval: Manual Retrieval: Human formulate a query and then Human formulate a query and then
automatically retrieve from collectionautomatically retrieve from collection• Interactive Retrieval: Interactive Retrieval: Full human access and feedbackFull human access and feedback
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 43 Carnegie Mellon
Image Feature
Where are they?Where are they?
Video DatabaseUser
Video Structure Information Need
Text Information
Keyword
Query Images
Low-level Feature
High-level Feature
Shot Boundary Detection Feature
Extraction
Retrieval Task
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 44 Carnegie Mellon
Video DataVideo Data
• Difficult to get video data for use in TREC because ©Difficult to get video data for use in TREC because ©
• Used mainly Used mainly Internet Archive• advertising, educational, industrial, amateur films 1930-1970 advertising, educational, industrial, amateur films 1930-1970 • produced by corporations, non-profit organisations, trade produced by corporations, non-profit organisations, trade
groups, etc.groups, etc.• Noisy, strange color, but real archive dataNoisy, strange color, but real archive data• 73.3 hours partitioned as follows:73.3 hours partitioned as follows:
4.85
5.07
23.26
40.12Search test
Feature development(training and validation)
Feature test
Shot boundary test
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 45 Carnegie Mellon
Shot Boundary DetectionShot Boundary Detection
• Fundamental primitive of most/all work in content-based Fundamental primitive of most/all work in content-based video retrievalvideo retrieval
A set of video shots
a video document
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 46 Carnegie Mellon
Feature ExtractionFeature Extraction
• Extracted high-level semantic feature from videoExtracted high-level semantic feature from video
• Assign a video clip to one or more of several categories Assign a video clip to one or more of several categories of videoof video
High-level features:Cityscape, Lake,
Trees, Water, Sky
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 47 Carnegie Mellon
Feature ExtractionFeature Extraction
• Interesting itself but when it serves to help video navigation and search then its importance increases
• Benefits:
• Retrieval - Find video from a particular class
• Filtering - Remove irrelevant and distracting information categories from summaries and visualizations
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 48 Carnegie Mellon
The FeaturesThe Features
FaceFace
Clip contains at least one human face with the nose, Clip contains at least one human face with the nose, mouth, and both eyes visible. Pictures of a face meeting mouth, and both eyes visible. Pictures of a face meeting the above conditions countthe above conditions count
PeoplePeople
Clip contains a group of two more humans, each of which Clip contains a group of two more humans, each of which is at least partially visible and is recognizable as a humanis at least partially visible and is recognizable as a human
On-screen TextOn-screen Text
Clip contains superimposed text large enough to be readClip contains superimposed text large enough to be read
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 49 Carnegie Mellon
The FeaturesThe Features
IndoorIndoor
Clip contains a recognizably indoor location, i.e., inside a building Clip contains a recognizably indoor location, i.e., inside a building
OutdoorOutdoorClip contains a recognizably outdoor location, i.e., one outside of Clip contains a recognizably outdoor location, i.e., one outside of
buildingsbuildings
CityscapeCityscapeClip contains a recognizably city/urban/suburban settingClip contains a recognizably city/urban/suburban setting
LandscapeLandscapeClip contains a predominantly natural inland setting, i.e., one with Clip contains a predominantly natural inland setting, i.e., one with
little or no evidence of development by humans. Scenes with little or no evidence of development by humans. Scenes with bodies of water that are clearly inland may be includedbodies of water that are clearly inland may be included
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 50 Carnegie Mellon
Non-Video (Audio) FeaturesNon-Video (Audio) Features
SpeechSpeechA human voice uttering words is recognizable as such in this A human voice uttering words is recognizable as such in this
segment segment
Instrumental SoundInstrumental SoundSound produced by one or more musical instruments is Sound produced by one or more musical instruments is
recognizable as such in this segmentrecognizable as such in this segment
MonologuesMonologuesSegment contains an event in which a single person is at least Segment contains an event in which a single person is at least
partially visible and speaks for a long time without partially visible and speaks for a long time without interruption by another speaker. Pauses are ok if shortinterruption by another speaker. Pauses are ok if short
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 51 Carnegie Mellon
TREC02 Results TREC02 Results
0
0.2
0.4
0.6
0.8
1 2 3 4 5 6 7 8 9 10
Feature
Avera
ge p
recis
ion
CMU_r1
A_CMU_r2
CLIPS-LIT_GEOD
CLIPS-LIT-LIMSU
DCUFE2002
Eurecom1
Fudan_FE_Sys1
Fudan_FE_Sys2
IBM-1
IBM-2
MediaMill1
MediaMill2
MSRA
UnivO_MT1
UnivO_MT2
Avg Prec Cap
Outd
oors
Indo
ors
Face
Peop
le
City
scap
e Land
sca
pe Text
ov
erla
ySp
eec
h
Inst
rum
enta
l
so
und
Mon
olo
g
Random baseline
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 52 Carnegie Mellon
Video Search TaskVideo Search Task
• The most important task and final goalThe most important task and final goal
• Manual & Interactive Search Task Manual & Interactive Search Task
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 53 Carnegie Mellon
Queries for 2002 TREC Video TrackQueries for 2002 TREC Video Track
• Specific item or personSpecific item or person• Eddie Rickenbacker, James Chandler, George Washington, Golden Eddie Rickenbacker, James Chandler, George Washington, Golden
Gate Bridge, Price Tower in Bartlesville, OK Gate Bridge, Price Tower in Bartlesville, OK
• Specific factSpecific fact• Arch in Washington Square Park in NYC, map of continental USArch in Washington Square Park in NYC, map of continental US
• Instances of a categoryInstances of a category• football players, overhead views of cities, one or more women football players, overhead views of cities, one or more women
standing in long dressesstanding in long dresses
• Instances of events/activitiesInstances of events/activities• people spending leisure time at the beach, one or more musicians people spending leisure time at the beach, one or more musicians
with audible music, crowd walking in an urban environment, with audible music, crowd walking in an urban environment, locomotive approaching the viewerlocomotive approaching the viewer
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 54 Carnegie Mellon
Sample QuerySample Query
• XML RepresentationXML Representation
<!DOCTYPE videoTopic SYSTEM "videoTopics.dtd"><videoTopic num="077">
<textDescription text="Find pictures of George Washington" />
<imageExample src="http://www.cia.gov/csi/monograph/firstln/955pres2.gif" desc="face" />
<videoExample src="01681.mpg" start="09m25.938s" stop="09m29.308s" desc="face" />
</videoTopic>
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 55 Carnegie Mellon
Sample QuerySample Query
• Text : Text : Find pictures of George WashingtonFind pictures of George Washington
• Image:Image: • Video:Video:
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 56 Carnegie Mellon
Evaluation MetricEvaluation Metric
• Goal: Maximize the Mean Average PrecisionGoal: Maximize the Mean Average Precision• Result set limited to 100 shotsResult set limited to 100 shots• Precision = (# relevant shots retrieved)/(total # shots retrieved)Precision = (# relevant shots retrieved)/(total # shots retrieved)• Average precision: compute precision after each retrieved Average precision: compute precision after each retrieved
relevant shot and then average these precisions over the total relevant shot and then average these precisions over the total number of retrieved relevant shots in the collection for that number of retrieved relevant shots in the collection for that topictopic
• Submitting the maximum number of shots per result set can Submitting the maximum number of shots per result set can never lower the average precision for that submissionnever lower the average precision for that submission
• Mean Average Precision = average of the average precision Mean Average Precision = average of the average precision measures for each topicmeasures for each topic
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 57 Carnegie Mellon
CMU Manual Retrieval System CMU Manual Retrieval System
Query
Text Image
Final Score
Text Score
ImageScore
RetrievalAgents
MovieInfo
PRFScore
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 58 Carnegie Mellon
Snapshot of the systemSnapshot of the system
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 59 Carnegie Mellon
Manual Search ResultManual Search Result
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Recall
Pre
cis
ion
Prous Science
IBM-2
CMU_MANUAL1
IBM-3
LL10_T
CLIPS+ASR
Fudan_Search_Sys4
CLIPS+ASR+X
ICMKM-2
UMDMqtrec
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 60 Carnegie Mellon
CMU Interactive Search SystemCMU Interactive Search System
• New Interface based on Informedia systemNew Interface based on Informedia system
• Multiple document storyboardsMultiple document storyboards
• Query context plays a key role in filtering image sets to Query context plays a key role in filtering image sets to manageable sizesmanageable sizes
• TREC 2002 image feature set offers additional filtering TREC 2002 image feature set offers additional filtering capabilities for indoor, outdoor, faces, people, etc.capabilities for indoor, outdoor, faces, people, etc.
• Displaying filter count and distribution guides their use Displaying filter count and distribution guides their use in manipulating the storyboard viewsin manipulating the storyboard views
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 61 Carnegie Mellon
Snapshot of the systemSnapshot of the system
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 62 Carnegie Mellon
Filter Interface for using Image FeaturesFilter Interface for using Image Features
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 63 Carnegie Mellon
Interactive runs top 10 (of 13)Interactive runs top 10 (of 13)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Recall
Pre
cis
ion
CMUInfInt1
DCUTrec11B.1
DCUTrec11C.2
CMU_INTERACTIVE_2
UnivO_MT5
IBM-4
DCUTrec11B.3
DCUTrec11C.4
UMDIqtrec
MSRA.Q-Video.2a
Prous Science
IBM-2
CMU_MANUAL1
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 64 Carnegie Mellon
Mean AvgP vs mean elapsed timeMean AvgP vs mean elapsed time
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
5
10
15
20
25
30
IuVf1CMUInfIn
t1CMU_IN
TERACTIVE
DCUTrec11
B.
1DCUTrec
11B.
3DCUTrec
11C.
2DCUTrec
11C.4
IBM-4
ICMKM-1
MSRA.Q-
Video.1
MSRA.Q-V
ideo.2.
A
UMDIqtrec
UNivO_M
T5
Mean average precision
Mean elapsed time (mins.)
Wide variation in elapsed time.Not the dominant factor in effectiveness
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 65 Carnegie Mellon
DemoDemo
• CMU Interactive Search SystemCMU Interactive Search System
• IBM Video Retrieval SystemIBM Video Retrieval Systemhttp://mp7.i2.ibm.com/
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 66 Carnegie Mellon
ConclusionConclusion
• The goal of content-based video retrieval is to build The goal of content-based video retrieval is to build more intelligent video retrieval engine via semantic more intelligent video retrieval engine via semantic meaningmeaning
• Many applications in daily lifeMany applications in daily life
• Combine evidence from different aspectsCombine evidence from different aspects
• Hot research topic, few business systemHot research topic, few business system
• State-of-the-art performance is still unacceptable for State-of-the-art performance is still unacceptable for normal users, space to improvenormal users, space to improve