ecs 289h: visual recognitionyjlee/teaching/ecs289h... · ground truth segmentation were known....
TRANSCRIPT
ECS 289H: Visual RecognitionFall 2014
Yong Jae Lee
Department of Computer Science
Plan for today
• Questions?
• Research overview
Standard supervised visual learning
Annotators
Category models
Novel images
building
tree
• Number of training images required can be costly• Assumes closed-world setting where all categories
are known
4
Unsupervised visual discovery
5
Visual world
Discovered categories
Unsupervised visual discovery
6
Visual world
Object segmentations
in images and video
Unsupervised visual discovery
• No human to explicitly guide visual recognition process
7
Visual world
Storyboard visual summary
1:00 pm 2:00 pm 3:00 pm 4:00 pm
Why visual discovery?
Exploring new environments
8
Summarization
MSR Sensecam9
Why visual discovery?
6 billion images 70 billion images 1 billion images served daily
10 billion images
100 hours uploaded per minute
Almost 90% of web traffic is visual!
:From
10
Why visual discovery?
Most of it is unlabeled!!
Personal photo albums
Surveillance and security
Movies, news, sports
Medical and scientific images
Inputs today
Svetlana Lazebnik
Understand and organize and index all this data!!
Let’s first explore…
what we can do with big data!
13
Everyday use of big data:Predictive text
14
Predictive drawing?
Research goal: Visual discovery
16
Visual world
Discovered categories
Key challenges
• Simultaneously estimate segmentation and groups
• Unknown variability in appearance
• What is the proper distance metric?
17
18
- = Grayvalue distance of 50 values
- = Euclidian distance of 5 unitsx
y
x
y
- = ?
= hamming distance of 1 letterCLIME - CRIME
How similar are two pictures?
Alyosha Efros
?=
19
How similar are two pictures?
Problem
Clusters formed from full image matches
Mutual Relationship between Foreground Features and Clusters
• If we have only foreground features, we can form good clusters…
Clusters formed from full image matches
Clusters formed from foreground matches
Mutual Relationship between Foreground Features and Clusters
• If we have good clusters, we can detect the foreground…
Our Approach
• Unsupervised task that iteratively seeks the mutual support between discovered objects and their defining features
Update cluster based on weighted
feature matches
Refine feature weights given
current clusters
Feature
index
Feature
weights
[Lee & Grauman, Foreground Focus, IJCV 2009]
Cluster and Feature Weight Refinement:Iteration 1
Feature
index
Feature
weightsImages as Local Feature Sets
Pair-wise Matching
Normalized Cuts Clustering
Initial Set of Clusters
Cluster and Feature Weight Refinement:Iteration 1
Feature
index
Feature
weights
Compute Feature Weights
New Feature Weights
Cluster and Feature Weight Refinement:Iteration 2
Feature
index
Feature
weightsNew Set of Clusters
Compute Feature Weights
New Feature Weights
Cluster and Feature Weight Refinement:Iteration 3
Feature
index
Feature
weightsPair-wise
Matching + Normalized Cuts
Final Set of Clusters
New Feature Weights
Quality of Clusters Formed
• Black dotted lines indicate the best possible quality that could be obtained if the ground truth segmentation were known
Quality of Foreground Detection
10-classes subset - highly weighted features
Shape
• Invariant to lighting conditions • Relatively stable compared to intra-category appearance (texture, color) variations
Can we discover common object shapes within unlabeled multi-category collections of images?
Anchoring Edge Fragments to Local Patches
Even with accurate patch matches, there’s a limit to how much shape information can be captured.
By anchoring edge fragments to patch features, we can produce more reliable matches and describe the object’s shape.
Foreground Shape Discovery: Prototypical Shape
Examples of discovered object contours Our shapes
[Lee & Grauman, Shape Discovery, CVPR 2009]
• Works well for object-centric images
• Complex images with multiple objects remains challenging…
Existing approachesPrevious work treats unsupervised visual discovery as an appearance-grouping problem.
50
1
3 4
2
How can seeing previously learned objects in novel images help to discover new categories?
1
3 4
2
Our idea
51
Our idea: Discover visual categories within unlabeled images by modeling interactions between the unfamiliar regions and familiar objects.
Our idea
1
3 4
2
52[Lee & Grauman, Object-graphs, CVPR 2010]
drive-way
sky
house
? grass
Context-aware visual discovery
grass
sky
truckhouse
? drive-way
grass
sky
housedrive-way
fence
?
? ? ?
53[Lee & Grauman, Object-graphs, CVPR 2010]
Learn “known” categories
• Offline: Train region-based classifiers for N “known” categories using labeled training data.
sky road
buildingtree
Detect Unknowns
Object-level Context
DiscoveryLearn
Models
54
Identifying unknown regions
Input: unlabeled pool of novel images
Compute multiple segmentations for each unlabeled image
Detect Unknowns
Object-level Context
DiscoveryLearn
Models
55
P(c
las
s | r
eg
ion
)P
(cla
ss
| r
eg
ion
)
P(c
las
s | r
eg
ion
)P
(cla
ss
| r
eg
ion
)
Prediction:known
Prediction:known
Prediction:known
Highentropy →Prediction:unknown
• Deem each segment as “known” or “unknown” based on
Identifying unknown regions
Detect Unknowns
Object-level Context
DiscoveryLearn
Models
56
resulting entropy:
Model the topology of category predictions relative to the unknown (unfamiliar) region.
An unknown region within an image
Object-graphs
Detect Unknowns
Object-level Context
DiscoveryLearn
Models
57
An unknown region within an image
0
Closest nodes in
its object-graph
2a
2b1b
1a
3a
3b
S
b t s r
1aabove
1bbelow
H1(s)
b t s rb t s r
0
self
g(s) = [ , , , ]
HR(s)
b t s r b t s r
Raabove
Rbbelow
1st nearest region out to Rth nearest
b t s r
0
self
H0(s)
Object-graphs
Detect Unknowns
Object-level Context
DiscoveryLearn
Models
58
Consider spatially near regions aboveand below, record distributions for each known class.
Example object-graphs
building sky roadunknown
• Colors indicate the predicted known category (max posterior)59
Unknown
Regions
Clusters from region-region affinities
Detect Unknowns
Object-level Context
DiscoveryLearn
Models
60
Object-level context provides more robust affinities
MSRC-v2
PASCAL 2008
Corel
MSRC-v0
Results: object discovery accuracy
61
Example discoveries
62
63
Context-aware face discovery
• System can suggest novel people to name based on their appearance and co-occurrence with familiar people.
DavidDavid
David
Kate
KateKate
Kate
Katename?
[Lee & Grauman, Face discovery, BMVC 2011]
64
Co-occurring faces
2 2 2 2 2
3 3333
DiscoveredFace
7 7 7 7 7
12 12 12 1212
• Dataset: Gallagher, Friends, Buffy• 12,542 images, 8,452 faces and 23
unique people• Two splits: 8 unknowns, and 15
unknowns
Results: Context-aware face discovery
[Lee & Grauman, Face discovery, BMVC 2011]
Self-paced discovery
Traditional Batch k-way
• Previous work treats unsupervised visual discovery as a one-pass “batch” procedure.
66
Self-paced discovery
67
• Focus on the “easier” instances first, and gradually discover new models of increasing complexity.
Single Easiest (Ours)
[Lee & Grauman, Self-paced discovery, CVPR 2011]
Identify Easy Objects
68
Detect Easy Instances
Discover New Category
Expand Knowledge
Initialize Stuff
Objectness (Obj) Easiness (ES)
Familiarity Map (F)
Context-Awareness (CA)
+
• Obj: how well a window contains any generic object.
• CA: how well surrounding regions resemble familiar categories.
69
Identify Easy Objects
Detect Easy Instances
Discover New Category
Expand Knowledge
Initialize Stuff
70
Object Discovery Accuracy3 9
22 29
12 13
14 20
Unsupervised visual discovery
71
Visual world
Object segmentations
in images and video
Collect-Cut
72
Best Bottom-up (with multi-segs)
Collect-Cut(ours)
Discovered Ensemble from Unlabeled Multi-Object Images
Unlabeled Images
Unsupervised Segmentation Examples
[Lee & Grauman, Collect-Cut, CVPR 2010]
Problem: Video object segmentation
Input: Unannotated video Desired output: Segmentation of high-ranking foreground object
• Existing methods group pixels using low-level features, which can result in an “over-segmentation.” [Brendel & Todorovic 2009,
Vazquez-Reina et al. 2010, Grundmann et al. 2010, Brox & Malik 2010]
How to segment the foreground objects in video when• background is moving and changing• categories of foreground objects are unknown in advance
73
Key-segment discovery
• Discover a set of “object-like” key-segments for category independent video object segmentation
– Resist over-segmentation by detecting regions with “object-like” appearance and motion
[Lee, Kim, Grauman, Key-segments, ICCV 2011] 74
1) Find object-like regions using appearance and motion cues
2) Group regions across video to discover key-segment hypotheses
3) Rank hypotheses and build segmentation models for each hypothesis
4) For a given hypothesis, segment the corresponding foreground object using the models
Key-segment discovery
Output segmentation
Shape modelColor model
[Lee, Kim, Grauman, Key-segments, ICCV 2011] 75
Results: Key-segment video segmentation
76
• Detect and segment people and discovered important objects without category-specific models
• Success in spite of moving camera, bg changes, low resolution
77
Results: Key-segment video segmentation
Grundmann et al. 2010
Grundmann et al. 2010
Ours Ours
• Resists over-segmentation by detecting regions with “object-like” appearance and motion
Results: Key-segment video segmentation
[29]: Tsai et al. BMVC 2010, [7]: Chockalingam et al. ICCV 2009
• Background subtraction falls apart• Ours produces state-of-the-art results even when compared
to supervised methods
78
Segmentation error rate
Unsupervised visual discovery
79
Visual world
Storyboard visual summary
1:00 pm 2:00 pm 3:00 pm 4:00 pm
GoPro Google Glass Looxcie
PivotheadTobii SMI
Mining first-person camera data
80
Steve Mann life logger81
90’s
Mining first-person camera data
Problem: Summarizing egocentric videos
Output: Storyboard summary of discovered important people and objects
9:00 am 10:00 am 11:00 am 12:00 pm 1:00 pm 2:00 pm
Wearable camera
82
Input: Egocentric video of the camera wearer’s day
[Lee, Ghosh, Grauman, Egocentric video summarization, CVPR 2012]
Important person/object discovery
• Discover important people and objects for egocentric video summarization
– Important: things with which the camera wearer has significant interaction
83[Lee, Ghosh, Grauman, Egocentric video summarization, CVPR 2012]
Data collection
• 15 fps, 320 x 480 resolution• 10 videos, 3-5 hrs in length; total of 37 hrs• Four subjects: one undergraduate, two grad students, and
one office worker
84
Segment video into
events
Discover important
regions
Storyboard summary
Learn Importance
Collect training data
distance to hand frequencydistance to frame center
Egocentric features:
Learning region importance
85
Segment video into
events
Discover important
regions
Storyboard summary
Learn Importance
Collect training data
distance to hand distance to frame center frequency
Egocentric features:
Learning region importance
86Region features: size, width, height, centroid
Object features:
surrounding area’s appearance, motion
[ ]
candidate region’s appearance, motion
[ ]
Object-like appearance, motion overlap w/ face detection
Segment video into
events
Discover important
regions
Storyboard summary
Learn Importance
Collect training data
Learning region importance
• Regressor to learn and predict a region’s degree of importance• Expect significant interactions between the features; e.g., a region
near the hand is important only if it is object-like in appearance
• For training:
• For testing: predict I(r) given xi(r)’s
learned parameters i’th feature valueimportance
87
Segment video into
events
Discover important
regions
Storyboard summary
Learn Importance
Collect training data
88
OursObject-like
[Carreira, 2010]Object-like
[Endres, 2010]Saliency
[Walther, 2006]
Results: Important region prediction
Good predictions
89
Results: Important region prediction
Ours
Failure cases
Object-like [Carreira, 2010]
Object-like [Endres, 2010]
Saliency [Walther, 2006]
Generating a storyboard summary
90
Event 1 Event 2 Event 3
Event 4Event 3
• Display event boundaries and frames of the selected important people and objects
Segment video into
events
Discover important
regions
Storyboard summary
Learn Importance
Collect training data
Our summary (12 frames)Original video (3 hours)
91
Results: Egocentric video summarization
92
Results: Egocentric video summarization
Fine-grained recognition
94
Coming up
• Sign-up for papers
• Next class– Object Recognition from Local Scale-Invariant Features. D. Lowe. ICCV 1999.
– Video Google: A Text Retrieval Approach to Object Matching in Videos. J. Sivic and A. Zisserman. ICCV 2003.
• Read both papers
• Write a review for one of them