beyond nouns eccv_2008

Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual

Classifiers

Abhinav Gupta and Larry S. DavisUniversity of Maryland, College Park

Proceedings of ECCV 2008

Presented by: Debaleena Chattopadhyay

Presentation Outline

- The Problem Definition

- The Novelty

- The Problem Solution

- The Results

The Problem Definition

To learn visual classifiers for object recognition from weakly labeled data

Labels: city, mountain, sky, sun

Input:

Expected Output:

citymountai

n

sky sun

Novelty

To learn visual classifiers for object recognition from weakly labeled data utilizing additional language

constructs

Labels: (Nouns) city, mountain, sky, sun(Relations) below(mountain, sky), below(mountain, sun) above(sky, city), above(sun, city)

brighter(sun, mountain), brighter(sun, city) behind(mountain, city), convex(sun, city)

in(sun, sky), smaller(sun, sky)

Input:

Expected Output:

citymountai

n

sky sun

Related WorkSome Previous Works:• Learn classifiers for visual attributes from a training dataset of

+ve and –ve images using a generative model [Ferrari et. al]

• Learn adjectives and nouns in 2 steps (adjectives in the 1st step, nouns in the 2nd) using a latent model

[Bernard et. al]

Some After Works:• Mining Discriminative Adjectives and Prepositions for

Natural Scene Recognition [Fei-Fei Li et. al, CVPR 09]

• Joint Learning of visual attributes, object classes and visual saliency

[ Forsyth et. al, ICCV 2009]

Overview

Relationships: in, above, below

SEA

SUN

SKY

(SEA, SUN)

(SEA, SKY)

(SKY, SEA)

(SKY, SUN)

(SUN, SKY)

(SUN, SEA)

Pairs of Nouns:

Nouns:

Proposed Algorithm• Dataset: Training set annotated with nouns and binary

relationships (prepositions and comparative adjectives)• Algorithm:

o Each image represented into a set of image regions.o Each image region is represented by a set of features o Classifiers for nouns are based on these features (CA)o Classifiers for relationships are based on differential features

extracted from pairs of regions (CR)o EM-approach is used to learn noun and relationship models

simultaneously E-step: Update assignments of nouns to image regions,

given CA and CR

M-step: Update model parameters,(CA and CR ) given updated assignments

The Generative Model

Ij Ik

ns np

r

Ijk

CA

CR

Graphical Model for Image Annotation

Learning the Model

EM-approach: Simultaneously solve for the correspondence problem and learn the parameters of classifiers (noun and relationship)

E-step: Compute the noun assignment using parameters from the previous iteration. P( noun i assigned to region j) =

Where,

'

'

'

'

( | , , )| , ,

( | , , )

lij

lik

l l old

A Al l l oldi l l old

k A A

P A IP A j I

P A I

' ' '( | , , ) ( | , , ) ( | , )l l old l l old l oldP A I P A I P A I

Learning the Model

Learning the Model

EM-approach: Simultaneously solve for the correspondence problem and learn the parameters of classifiers (noun and relationship)

M-step: Update the model parameters depending on the updated assignments in the E-step. The Maximum Likelihood parameters depends upon the classifier used.

To utilize contextual information for labeling test-images, priors on relationship ,P(r|ns,np), are also learnt from a co-occurrence table after the relationship annotations are generated.

Inference- Labeling• Test images are divided into regions. Region j is associated with some features Ij and noun nj.

• We know Ij and we have to estimate nj.

• The labeling problem is constrained by priors on relationships between pairs of nouns.• Bayesian Network is used to represent the labeling problem and belief propagation for inference.

Experimental ResultsDataset: • Subset of Corel5k training and test dataset• For training, 850 images with nouns and hand-labelled relationships between subset of pairs of nouns.• Nearest neighbor and Gaussian Classifier based likelihood model for nouns is used.• Decision stump based likelihood model for relationships is used.• 173 nouns • 19 relationships: above, behind, below, beside, more textured, brighter, in, greener, larger, left, near, far from, ontopof, more blue, right, similar, smaller, taller, shorter• Image Features used (30): area, x, y, boundary/area, convexity, moment-

of-inertia, RGB (3), RGB stdev (3), L*a*b (3), L*a*b stdev (3), mean oriented energy, 30 degree

increments (12)

Experimental Results

Resolution of Correspondence Ambiguities• On randomly sampled 150 images from the training dataset• Compared with human labeling• Performance measures:

Range of semantics identified- Both algorithm give similar performance (L)

Frequency Correct- Later algorithm performs better in number of times a noun is identified (R)

Nouns only

Nouns & Relationships

(learned)

Nouns & Relationships

(Human)

Proposed EM algorithm

bootstrapped by IBM Model 1

Proposed EM algorithm bootstrapped by Duygulu et. al

Experimental ResultsReducing Correspondence Ambiguity

Duygulu et. al Beyond Nouns

Experimental ResultsLabeling New Images:• Dataset: Subset of 500 images provided in Corel5k dataset. (Images were selected randomly from those images which had been annotated with words present in the learned vocabulary)• Performance Measure:

Missed Labels (L): Compute St/Sg where St= set of annotations provided by Corel dataset, Sg = set of annotations generated by the algorithmUsing proposed Bayesian model, missed labels decreases by 24% (IBM Model 1) and 17% (Duygulu et. al)

False Labels (R): Compared with human observers.

Experimental ResultsImage Labeling : Constrained Bayesian Model

Duygulu et. al Beyond Nouns

Experimental Results

Precision-Recall:

Precision Ratio- The ratio of number of images that have been correctly annotated with that word to the number of images which were annotated with the word by the algorithm. (Respect to Human Observers)

Recall Ratio: The ratio of the number of images correctly annotated with that word using the algorithm to the number of images that should have been annotated with that word. (Respect to Corel Annotations)

Conclusion

• Most approaches to learn visual classifiers from weakly labeled data use a “bag” of nouns model and try to find correspondence using co-occurrence of image features and the nouns. However, correspondence ambiguity remains.

• This algorithm proposes an EM based method to simultaneously learn visual classifiers for nouns, prepositions and comparative adjectives.

• Experimental results show that using relationship words helps in reduction of correspondence ambiguity and using a constrained model leads to a better labeling performance.

Thank you

beyond nouns eccv_2008

Career

nouns city

overviewpairs of nouns

albeyond nouns

update assignments of

likelihood model

parameters of classifiers

update model parameters

ibm model