unsupervised discovery of visual object class hierarchies josef sivic (inria / ens), bryan russell...

32
Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU) and Bill Freeman (MIT)

Post on 21-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Unsupervised discovery of visual object class hierarchies

Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU) and

Bill Freeman (MIT)

Page 2: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Levels of supervision for training object category models

• None? Images only

[Agarwal & Roth, Leibe & Schiele, Torralba et al., Shotton et al.]

[Barnard et al.][Csurka et al, Dorko & Schmid, Fergus et al., Opelt et al]

• Object label +

segmentation

• Object label only

[Viola & Jones]

Can we learn about objects just by looking at images?

Page 3: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Goal: Given a collection of unlabelled images, discover a hierarchy of visual object categories

Which images contain the same object(s)?

Where is the object in the image?

Organize objects into a visual hierarchy (tree).

Page 4: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

I. Represent an image as a bag-of-visual-words

II. Apply topic discovery methods to find objects in the corpus of images

Review: Object discovery in the visual domain

Decompose image collection into objects common to all images and mixture coefficients specific to each image

Hofmann: Probabilistic latent semantic analysisBlei et al.: Latent Dirichlet Allocation

[Sivic, Russell, Efros, Freeman, Zisserman, ICCV’05]

Page 5: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Topic discovery models

‘Flat’ topic structure – all topics are ‘available’ to all documents

d … documents (images)

w … visual words

z … topics (‘objects’)

Probabilistic Latent Semantic Analysis (pLSA) [Hofmann’99]

M documents

N words per document

Page 6: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Hierarchical topic models

• Topics organized in a tree

• Document is a superposition of topics along a single path

• Topics at internal nodes are shared by two or more paths

• The hope is that more specialized topics emerge as we descend the tree

c … paths

z … levels

[Hofmann’99, Blei et al. ’2004, Barnard et al.’01]

Page 7: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Hierarchical topic models

• Topics organized in a tree

• Document is a superposition of topics along a single path

• Topics at internal nodes are shared by two or more paths

• The hope is that more specialized topics emerge as we descend the tree

c … paths

z … levels

[Hofmann’99, Blei et al. ’2004, Barnard et al.’01]

Page 8: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Hierarchical topic models

• Topics organized in a tree

• Document is a superposition of topics along a single path

• Topics at internal nodes are shared by two or more paths

• The hope is that more specialized topics emerge as we descend the tree

c … paths

z … levels

[Hofmann’99, Blei et al. ’2004, Barnard et al.’01]

Page 9: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Hierarchical topic models

• Topics organized in a tree

• Document is a superposition of topics along a single path

• Topics at internal nodes are shared by two or more paths

• The hope is that more specialized topics emerge as we descend the tree

c … paths

z … levels

[Hofmann’99, Blei et al. ’2004, Barnard et al.’01]

Page 10: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Hierarchical topic models

d … documents (images)

w … words

z … levels of the tree

c … paths in the tree

For each document:

Introduce a hidden variable c indicating the path in the tree

c … paths

z … levels

[Hofmann’99, Blei et al. ’2004, Barnard et al.’01]

Page 11: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Hierarchical Latent Dirichlet Allocation (hLDA)

d … documents (images)

w … words

z … levels of the tree

c … paths in the tree

Treat P(z|d) and P(w|z,c) as random variables sampled from Dirichlet prior:

c … paths

z … levels

[Blei et al. ’2004]

Page 12: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Hierarchical Latent Dirichlet Allocation (hLDA)

d … documents (images)

w … words

z … levels of the tree

c … paths in the tree

c … paths

z … levels

[Blei et al. ’2004]

Tree structure is not fixed:

assignments of documents to paths, cj, are sampled from

the nested Chinese restaurant process prior (nCRP)

Page 13: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

CRP: customers sit in a restaurant with unlimited number of tables

Nested Chinese restaurant process (nCRP)

1,2,3,4

1,2,3

1,2 3

4

4

[Blei et al.’04]

Nested CRP: extension of CRP to tree structures

• Prior on assignments of documents to paths in the tree (of fixed depth L)

• Each internal node corresponds to a CRP, each table points to a child node

Example:

Example: Tree of depth 3 with 4 documents

Sample path for the 5-th document

5th customer arriving

A

CB

D E F

Page 14: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

hLDA model fitting

Use Gibbs sampler to generate samples from P(z,c,T|w)

c … paths

z … levels

For a given document j:

• sample zj while keeping cj fixed (LDA along one path)

• sample cj while keeping zj fixed (can delete/create branches)

Page 15: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Image representation – ‘dense’ visual words

Represent each region by a SIFT descriptor

Extract circular regions on a regular grid, at multiple scales

Cf. [Agarwal and Triggs’05, Bosch and Zisserman’06]

Page 16: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Build a visual vocabulary

Quantize descriptors using k-means

K = 10 + 1 K = 100 + 1

Visualization by ‘average’ words from the training set (single scale)

Page 17: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Vocabulary with varying degree of spatial and appearance granularity

K1 = 11

K2 = 101

K3 = 101

K4 = 101

Granularity

Appearance Spatial

Bag of words

Bag of words

3x3 grid

5x5 grid

Combined vocabulary:

K = 11+101+909+2,525

= 3,546 visual words

V1:

V2:

V3:

V4: Cf. Fergus et al.’ 05 Lazebnik et al.’06

Page 18: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Example I. – cropped LabelMe images

• 125 images, 5 object classes:

cars side, cars rear, switches, traffic lights, computer screens

• Images cropped to contain mostly the object, and normalized for scale

Page 19: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Example I. – cropped LabelMe images

• Learn 4-level tree hierarchy

• Initialization:

• c with a random tree (125 documents) sampled from nCRP (=1)

• z based on vocabulary granularity

c … paths

z … levelsK1 = 11

K2 = 101

K3 = 101

K4 = 101

Bag of words

Bag of words

3x3 grid

5x5 grid

V1:

V2:

V3:

V4:

Page 20: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Example I. – cropped LabelMe images

Learnt object hierarchy

Nodes visualized by average images

Example images assigned to different paths

Page 21: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Quality of the tree?

Intersection Union

ground truth images in class i

For each node t and class i measure the classification score:

Images assigned to a path passing through t

Good score:

- All images of class i assigned to node t (high recall)

- No images of other classes assigned to t (high precision)

Score for class i:

Page 22: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Quality of the tree?

Intersection Union

ground truth images in class i

For each node t and class i measure the classification score:

Images assigned to a path passing through t

Score for class i:

Example: traff. lights, node 2

Page 23: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Quality of the tree?

Intersection Union

ground truth images in class i

For each node t and class i measure the classification score:

Images assigned to a path passing through t

Score for class i:

Example: switches, node 9

Page 24: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Quality of the tree?

Intersection Union

ground truth images in class i

For each node t and class i measure the classification score:

Images assigned to a path passing through t

Overall score:

Page 25: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Example II. – MSRC b1 dataset

240 images, 9 object classes, pixel-wise labelled

Cars

Airplanes

Cows

Buildings

Faces

Grass

Trees

Bicycles

Sky

Page 26: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Example II. – MSRC b1 dataset

Experiment 1: Known object mask (manual), unknown class labels

Experiment 2: Both segmentation and class labels unknown (just images)

- More objects and images (than Ex. I)

- Measure classification performance

- Compare with the standard `flat’ LDA

- ‘Unsupervised discovery’ scenario

- Employ the ‘multiple segmentations’ framework of [Russell et al.,’06]

- Measure segmentation accuracy

Page 27: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

MSRC b1 dataset – known object mask

Learnt tree visualized by average images, nodes size indicates # of images

Some nodes visualized by top 3 images (sorted by KL divergence)

Page 28: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

MSRC b1 dataset – known object mask

Classification performance: comparison with ‘flat’ LDA

Flat LDA:

Estimate mixing weights for each topic i

Assign each image to a single topic:

Page 29: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

MSRC b1 dataset – unknown object mask and image labels

Page 30: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Multiple segmentation approach [Russell et al.’06]

1) Produce multiple segmentations of each image

2) Discover clusters of similar segments

3) Score segments by how well they fit object cluster

Images Multiple segmentations Cars Buildings

(review)

(here use hLDA)

Page 31: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Road/asphalt

Page 32: Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Conclusions

• Investigated learning visual object hierarchies using hLDA

• The number of topics/objects and the structure of the tree is estimated automatically from the data

• Topic/object hierarchy may improve classification performance