visual element discovery as discriminative mode seeking

Visual Element Discovery as Discriminative Mode Seeking

Carl Doersch, Abhinav Gupta, Alexei A. EfrosCMU CMU UCB

The need for mid-level representations

6 billion images 70 billion images 1 billion images served daily

10 billion images

60 hours uploaded per minute

Almost 90% of web traffic is visual!

:From

Discriminative patches

• Visual words are too simple

• Objects are too difficult

• Something in the middle?(Felzenswalb et al. 2008)

(Singh et al. 2012)

Mid-level “Visual Elements”

• Simple enough to be detected easily• Complex enough to be meaningful– “Meaningful” as measured by weak labels

(Doersch et al. 2012)

(Singh et al. 2012)

Mid-level “Visual Elements”


(Singh et al. 2012)

• Doersch et al. 2012• Singh et al. 2012• Jain et al. 2013• Endres et al. 2013• Juneja et al. 2013

• Li et al. 2013• Sun et al. 2013• Wang et al. 2013• Fouhey et al. 2013• Lee et al. 2013

Our goal

• Provide a mathematical optimization for visual elements

• Improve performance of mid-level representations.

Elements as Patch Classifiers

What if the labels are weak?

• E.g. image has horse/no-horse• (Or even weaker, like Paris/not-Paris)

• Idea: Label these all as “horse”

• Problem: 10,000 patches per image, most of which are unclassifiable.

The weaker the label, the bigger the problem.

Task: Learn to classify Paris from Not-Paris

Paris Also Paris

Other approaches

• Latent SVM:– Assumes we have one instance per positive image

• Multiple instance learning– Not clear how to define the bags

What if the labels are weak?

• Negatives are negatives, positives might not be positive

• Most of our data can be ignored• First: how to cluster without clustering everything


(Singh et al. 2012)

Mean shift

Patch distances

Min distance: 2.59e-4

Max distance: 1.22e-4

Input Nearest neighbor

Mean shift

Negative Set Not ParisParis

Density Ratios Not ParisParis

Adaptive Bandwidth NegativePositive

Bandwidth

Discriminative Mode Seeking

• Find local optima of an estimate of the density ratio

• Allow an adaptive bandwidth• Be extremely fast– Minimize the number of passes through the data


• Mean shift: maximize (w.r.t. w)

Centroid

Patch FeatureBandwidth

Distance

w

b


B(w) is the value of b satisfying:


s.t.

optimize

• Distance metric: Normalized Correlation

s.t.

optimize

NegativePositive

w


Optimization

• Initialization is straightforward• For each element, just keep around ~500

patches where wTx - b > 0• Trivially parallelizable in MapReduce.• Optimization is piecewise quadratic

s.t.

Evaluation via Purity-Coverage Plot

• Analogous to Precision-Recall Plot

Low Purity

Element 1

Element 2

Element 3

Element 4

Element 5

High purity, Low Coverage

Element 1

Element 2

Element 3

Element 4

Element 5

0 2 4 6 8 100

0.10.20.30.40.50.60.70.80.9

1

Purity-Coverage Curve

ParisNot Paris

Purity

Coverage x1e4 pixels

Purity


ParisNot Paris Coverage

0 2 4 6 8 100

0.10.20.30.40.50.60.70.80.9

1

x1e4 pixels


• Coverage for multiple elements is simply the union.

Purity-Coverage

0 0.1 0.2 0.3 0.4 0.50.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 0.2 0.4 0.6 0.8

Purit

y

Coverage (fraction of positive dataset) Coverage (fraction of positive dataset)

Top 25 Elements Top 200 Elements

This workThis work, no inter-elementSVM Retrained 5x (Doersch et al. 2012)LDA Retrained 5xLDA RetrainedExemplar LDA (Hariharan et al. 2012)

Results on Indoor 67 Scenes

Kitchen Grocery Bowling

Elevator Bakery Bathroom

Results on Indoor 67 Scenes

Method Accuracy Method Accuracy

ROI+Gist (Quattoni et al.) 26.05 miSVM (Li et al.) 46.40

MM-Scene (Zhu et al.) 28.00 D. Patches (full) (Singh et al.) 49.40

Scene-DPM (Pandley et al.) 30.40 MMDL (Wang et al.) 50.15

CENTRIST (Wu et al.) 36.90 Discr. Parts (Sun et al.) 51.40

Object Bank (Li et al.) 37.60 IFV (Juneja et al.) 60.77

RBoW (Parizi et al.) 37.93 Bag of Parts+IFV (Juneja et al.) 63.10

Discr. Patches (Singh et al.) 38.10 Ours (no inter-element) 63.36

Latent Pyramid. (Sadeghi et al.) 44.84 Ours 64.03

Bag of Parts (Juneja et al.) 46.10 Ours+IFV 66.87

Qualitative Indoor67 Results

Indoor67: Error Analysis

Ground Truth (GT): deli GT: corridorGuess: grocery store Guess: staircase

GT: laundromat Guess: closetGT: museum Guess: garage

Ground Truth (GT): deli GT: corridorGuess: grocery store Guess: staircase

GT: laundromat Guess: closetGT: museum Guess: garage

Thank you!

More results athttp://graphics.cs.cmu.edu/projects/discriminativeModeSeeking/

Paris Elements • Indoor 67 ElementsIndoor 67 Heatmaps • Source code (soon)

Some New Paris Elements

visual element discovery as discriminative mode seeking

Documents