ecs 289h: visual recognitionyjlee/teaching/ecs289h... · ground truth segmentation were known....

ECS 289H: Visual RecognitionFall 2014

Yong Jae Lee

Department of Computer Science

Plan for today

• Questions?

• Research overview

Standard supervised visual learning

Annotators

Category models

Novel images

building

tree

• Number of training images required can be costly• Assumes closed-world setting where all categories

are known

4

Unsupervised visual discovery

5

Visual world

Discovered categories


6

Visual world

Object segmentations

in images and video


• No human to explicitly guide visual recognition process

7

Visual world

Storyboard visual summary

1:00 pm 2:00 pm 3:00 pm 4:00 pm

Why visual discovery?

Exploring new environments

8

Summarization

MSR Sensecam9


6 billion images 70 billion images 1 billion images served daily

10 billion images

100 hours uploaded per minute

Almost 90% of web traffic is visual!

:From

10


Most of it is unlabeled!!

Personal photo albums

Surveillance and security

Movies, news, sports

Medical and scientific images

Inputs today

Svetlana Lazebnik

Understand and organize and index all this data!!

http://www.picsearch.com/

http://www.picsearch.com/

Let’s first explore…

what we can do with big data!

13

Everyday use of big data:Predictive text

14

Predictive drawing?

ShadowDraw

• video

http://www.youtube.com/watch?v=zh_-HUdQwow

Research goal: Visual discovery

16

Visual world

Discovered categories

Key challenges

• Simultaneously estimate segmentation and groups

• Unknown variability in appearance

• What is the proper distance metric?

17

18

- = Grayvalue distance of 50 values

- = Euclidian distance of 5 unitsx

y

x

y

- = ?

= hamming distance of 1 letterCLIME - CRIME

How similar are two pictures?

Alyosha Efros

?=

19

How similar are two pictures?

Problem

Clusters formed from full image matches

Mutual Relationship between Foreground Features and Clusters

• If we have only foreground features, we can form good clusters…

Clusters formed from full image matches

Clusters formed from foreground matches

Mutual Relationship between Foreground Features and Clusters

• If we have good clusters, we can detect the foreground…

Our Approach

• Unsupervised task that iteratively seeks the mutual support between discovered objects and their defining features

Update cluster based on weighted

feature matches

Refine feature weights given

current clusters

Feature

index

Feature

weights

[Lee & Grauman, Foreground Focus, IJCV 2009]

Cluster and Feature Weight Refinement:Iteration 1

Feature

index

Feature

weightsImages as Local Feature Sets

Pair-wise Matching

Normalized Cuts Clustering

Initial Set of Clusters


Feature

index

Feature

weights

Compute Feature Weights

New Feature Weights


Feature

index

Feature

weightsNew Set of Clusters

Compute Feature Weights

New Feature Weights


Feature

index

Feature

weightsPair-wise

Matching + Normalized Cuts

Final Set of Clusters

New Feature Weights

Quality of Clusters Formed

• Black dotted lines indicate the best possible quality that could be obtained if the ground truth segmentation were known

Quality of Foreground Detection

10-classes subset - highly weighted features

Shape

• Invariant to lighting conditions • Relatively stable compared to intra-category appearance (texture, color) variations

Can we discover common object shapes within unlabeled multi-category collections of images?

Anchoring Edge Fragments to Local Patches

Even with accurate patch matches, there’s a limit to how much shape information can be captured.

By anchoring edge fragments to patch features, we can produce more reliable matches and describe the object’s shape.

Foreground Shape Discovery: Prototypical Shape

Examples of discovered object contours Our shapes

[Lee & Grauman, Shape Discovery, CVPR 2009]

• Works well for object-centric images

• Complex images with multiple objects remains challenging…

Existing approachesPrevious work treats unsupervised visual discovery as an appearance-grouping problem.

50

1

3 4

2

How can seeing previously learned objects in novel images help to discover new categories?

1

3 4

2

Our idea

51

Our idea: Discover visual categories within unlabeled images by modeling interactions between the unfamiliar regions and familiar objects.

Our idea

1

3 4

2

52[Lee & Grauman, Object-graphs, CVPR 2010]

drive-way

sky

house

? grass

Context-aware visual discovery

grass

sky

truckhouse

? drive-way

grass

sky

housedrive-way

fence

?

? ? ?

53[Lee & Grauman, Object-graphs, CVPR 2010]

Learn “known” categories

• Offline: Train region-based classifiers for N “known” categories using labeled training data.

sky road

buildingtree

Detect Unknowns

Object-level Context

DiscoveryLearn

Models

54

Identifying unknown regions

Input: unlabeled pool of novel images

Compute multiple segmentations for each unlabeled image

Detect Unknowns


DiscoveryLearn

Models

55

P(c

las

s | r

eg

ion

)P

(cla

ss

| r

eg

ion

)

P(c

las

s | r

eg

ion

)P

(cla

ss

| r

eg

ion

)

Prediction:known

Prediction:known

Prediction:known

Highentropy →Prediction:unknown

• Deem each segment as “known” or “unknown” based on

Identifying unknown regions

Detect Unknowns


DiscoveryLearn

Models

56

resulting entropy:

Model the topology of category predictions relative to the unknown (unfamiliar) region.

An unknown region within an image

Object-graphs

Detect Unknowns


DiscoveryLearn

Models

57

An unknown region within an image

0

Closest nodes in

its object-graph

2a

2b1b

1a

3a

3b

S

b t s r

1aabove

1bbelow

H1(s)

b t s rb t s r

0

self

g(s) = [ , , , ]

HR(s)

b t s r b t s r

Raabove

Rbbelow

1st nearest region out to Rth nearest

b t s r

0

self

H0(s)

Object-graphs

Detect Unknowns


DiscoveryLearn

Models

58

Consider spatially near regions aboveand below, record distributions for each known class.

Example object-graphs

building sky roadunknown

• Colors indicate the predicted known category (max posterior)59

Unknown

Regions

Clusters from region-region affinities

Detect Unknowns


DiscoveryLearn

Models

60

Object-level context provides more robust affinities

MSRC-v2

PASCAL 2008

Corel

MSRC-v0

Results: object discovery accuracy

61

Example discoveries

62

63

Context-aware face discovery

• System can suggest novel people to name based on their appearance and co-occurrence with familiar people.

DavidDavid

David

Kate

KateKate

Kate

Katename?

[Lee & Grauman, Face discovery, BMVC 2011]

64

Co-occurring faces

2 2 2 2 2

3 3333

DiscoveredFace

7 7 7 7 7

12 12 12 1212

• Dataset: Gallagher, Friends, Buffy• 12,542 images, 8,452 faces and 23

unique people• Two splits: 8 unknowns, and 15

unknowns

Results: Context-aware face discovery

[Lee & Grauman, Face discovery, BMVC 2011]

Self-paced discovery

Traditional Batch k-way

• Previous work treats unsupervised visual discovery as a one-pass “batch” procedure.

66

Self-paced discovery

67

• Focus on the “easier” instances first, and gradually discover new models of increasing complexity.

Single Easiest (Ours)

[Lee & Grauman, Self-paced discovery, CVPR 2011]

Identify Easy Objects

68

Detect Easy Instances

Discover New Category

Expand Knowledge

Initialize Stuff

Objectness (Obj) Easiness (ES)

Familiarity Map (F)

Context-Awareness (CA)

+

• Obj: how well a window contains any generic object.

• CA: how well surrounding regions resemble familiar categories.

69

Identify Easy Objects

Detect Easy Instances

Discover New Category

Expand Knowledge

Initialize Stuff

70

Object Discovery Accuracy3 9

22 29

12 13

14 20


71

Visual world

Object segmentations

in images and video

Collect-Cut

72

Best Bottom-up (with multi-segs)

Collect-Cut(ours)

Discovered Ensemble from Unlabeled Multi-Object Images

Unlabeled Images

Unsupervised Segmentation Examples

[Lee & Grauman, Collect-Cut, CVPR 2010]

Problem: Video object segmentation

Input: Unannotated video Desired output: Segmentation of high-ranking foreground object

• Existing methods group pixels using low-level features, which can result in an “over-segmentation.” [Brendel & Todorovic 2009,

Vazquez-Reina et al. 2010, Grundmann et al. 2010, Brox & Malik 2010]

How to segment the foreground objects in video when• background is moving and changing• categories of foreground objects are unknown in advance

73

Key-segment discovery

• Discover a set of “object-like” key-segments for category independent video object segmentation

– Resist over-segmentation by detecting regions with “object-like” appearance and motion

[Lee, Kim, Grauman, Key-segments, ICCV 2011] 74

1) Find object-like regions using appearance and motion cues

2) Group regions across video to discover key-segment hypotheses

3) Rank hypotheses and build segmentation models for each hypothesis

4) For a given hypothesis, segment the corresponding foreground object using the models

Key-segment discovery

Output segmentation

Shape modelColor model

[Lee, Kim, Grauman, Key-segments, ICCV 2011] 75

Results: Key-segment video segmentation

76

• Detect and segment people and discovered important objects without category-specific models

• Success in spite of moving camera, bg changes, low resolution

77


Grundmann et al. 2010

Grundmann et al. 2010

Ours Ours

• Resists over-segmentation by detecting regions with “object-like” appearance and motion


[29]: Tsai et al. BMVC 2010, [7]: Chockalingam et al. ICCV 2009

• Background subtraction falls apart• Ours produces state-of-the-art results even when compared

to supervised methods

78

Segmentation error rate


79

Visual world

Storyboard visual summary

1:00 pm 2:00 pm 3:00 pm 4:00 pm

GoPro Google Glass Looxcie

PivotheadTobii SMI

Mining first-person camera data

80

Steve Mann life logger81

90’s

Mining first-person camera data

Problem: Summarizing egocentric videos

Output: Storyboard summary of discovered important people and objects

9:00 am 10:00 am 11:00 am 12:00 pm 1:00 pm 2:00 pm

Wearable camera

82

Input: Egocentric video of the camera wearer’s day

[Lee, Ghosh, Grauman, Egocentric video summarization, CVPR 2012]

Important person/object discovery

• Discover important people and objects for egocentric video summarization

– Important: things with which the camera wearer has significant interaction

83[Lee, Ghosh, Grauman, Egocentric video summarization, CVPR 2012]

Data collection

• 15 fps, 320 x 480 resolution• 10 videos, 3-5 hrs in length; total of 37 hrs• Four subjects: one undergraduate, two grad students, and

one office worker

84

Segment video into

events

Discover important

regions

Storyboard summary

Learn Importance

Collect training data

distance to hand frequencydistance to frame center

Egocentric features:

Learning region importance

85

Segment video into

events

Discover important

regions

Storyboard summary

Learn Importance


distance to hand distance to frame center frequency

Egocentric features:


86Region features: size, width, height, centroid

Object features:

surrounding area’s appearance, motion

[ ]

candidate region’s appearance, motion

[ ]

Object-like appearance, motion overlap w/ face detection

Segment video into

events

Discover important

regions

Storyboard summary

Learn Importance



• Regressor to learn and predict a region’s degree of importance• Expect significant interactions between the features; e.g., a region

near the hand is important only if it is object-like in appearance

• For training:

• For testing: predict I(r) given xi(r)’s

learned parameters i’th feature valueimportance

87

Segment video into

events

Discover important

regions

Storyboard summary

Learn Importance


88

OursObject-like

[Carreira, 2010]Object-like

[Endres, 2010]Saliency

[Walther, 2006]

Results: Important region prediction

Good predictions

89

Results: Important region prediction

Ours

Failure cases

Object-like [Carreira, 2010]

Object-like [Endres, 2010]

Saliency [Walther, 2006]

Generating a storyboard summary

90

Event 1 Event 2 Event 3

Event 4Event 3

• Display event boundaries and frames of the selected important people and objects

Segment video into

events

Discover important

regions

Storyboard summary

Learn Importance


Our summary (12 frames)Original video (3 hours)

91

Results: Egocentric video summarization

92

Results: Egocentric video summarization

Fine-grained recognition

94

AverageExplorer

• video

http://youtu.be/1QgL_aPPCpM

Coming up

• Sign-up for papers

• Next class– Object Recognition from Local Scale-Invariant Features. D. Lowe. ICCV 1999.

– Video Google: A Text Retrieval Approach to Object Matching in Videos. J. Sivic and A. Zisserman. ICCV 2003.

• Read both papers

• Write a review for one of them

http://people.cs.ubc.ca/~lowe/papers/iccv99.pdf

http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf

ecs 289h: visual recognitionyjlee/teaching/ecs289h... · ground truth segmentation were known....

Documents