visual object recognitiongrauman/slides/aaai2008/aaai2008...perceptual and sensory augmented...

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visual Object Recognition

Bastian Leibe &Computer Vision LaboratoryETH Zurich

Chicago, 14.07.2008

Kristen GraumanDepartment of Computer SciencesUniversity of Texas in Austin

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

2K. Grauman, B. Leibe

Outline

1. Detection with Global Appearance & Sliding Windows

2. Local Invariant Features: Detection & Description

3. Specific Object Recognition with Local Features

― Coffee Break ―

4. Visual Words: Indexing, Bags of Words Categorization

5. Matching Local Feature Sets

6. Part-Based Models for Categorization

7. Current Challenges and Research Directions

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Global representations: limitations

• Success may rely on alignment -> sensitive to viewpoint

• All parts of the image or window impact the description -> sensitive to occlusion, clutter


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Local representations• Describe component regions or patches separately.• Many options for detection & description…


Superpixels[Ren et al.]

Shape context [Belongie et al.]

Maximally Stable Extremal Regions [Matas

et al.]

Geometric Blur [Berg et al.]

SIFT [Lowe]

Salient regions [Kadir et al.]

Harris-Affine [Schmid et al.]

Spin images [Johnson and Hebert]

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Recall: Invariant local features

Subset of local feature types designed to be invariant to

ScaleTranslationRotationAffine transformationsIllumination

1) Detect interest points 2) Extract descriptors

x1 x2…xd

y1 y2…yd

[Mikolajczyk & Schmid, Matas et al., Tuytelaars & Van Gool, Lowe, Kadiret al.,… ]

K. Grauman, B. Leibe

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Recognition with local feature sets

• Previously, we saw how to use local invariant features + a global spatial model to recognize specific objects, using a planar object assumption.

• Now, we’ll use local features for Indexing-based recognition

Bags of words representations

Correspondence / matching kernels


Aachen Cathedral

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Basic flow


Detect or sample features

Describe features

…List of positions,

scales, orientations

Associated list of d-dimensional

descriptors

…Index each one into pool of descriptors from previously seen images

…

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Indexing local features

• Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT)


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• When we see close points in feature space, we have similar descriptors, which indicates similar local content.

Figure credit: A. Zisserman K. Grauman, B. Leibe

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• We saw in the previous section how to use voting and pose clustering to identify objects using local features


Figure credit: David Lowe

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image?

Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search

High-dimensional descriptors: approximate nearest neighbor search methods more practical

Inverted file indexing schemes


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Indexing local features: approximate nearest neighbor search


Best-Bin First (BBF), a variant of k-dtrees that uses priority queue to examine most promising branches first [Beis & Lowe, CVPR 1997]

Locality-Sensitive Hashing (LSH), a randomized hashing technique using hash functions that map similar points to the same bin, with high probability [Indyk & Motwani, 1998]

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

• For text documents, an efficient way to find all pages on which a word occurs is to use an index…

• We want to find all images in which a feature occurs.

• To use this idea, we’ll need to map our features to “visual words”.


Indexing local features: inverted file index

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visual words: main idea

• Extract some local features from a number of images …


e.g., SIFT descriptor space: each point is 128-dimensional

Slide credit: D. Nister

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


15K. Grauman, B. LeibeSlide credit: D. Nister

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial



Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Map high-dimensional descriptors to tokens/words by quantizing the feature space


• Quantize via clustering, let cluster centers be the prototype “words”

Descriptor space

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Map high-dimensional descriptors to tokens/words by quantizing the feature space


• Determine which word to assign to each new image region by finding the closest cluster center.

Descriptor space

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visual words

• Example: each group of patches belongs to the same visual word


Figure from Sivic & Zisserman, ICCV 2003

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visual words

• First explored for texture and material representations

• Texton = cluster center of filter responses over collection of images

• Describe textures and materials based on distribution of prototypical texture elements.

Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003;

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visual words


• More recently used for describing scenes and objects for the sake of indexing or classification.

Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others.

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Inverted file index for images comprised of visual words

Image credit: A. Zisserman K. Grauman, B. Leibe

Word number

List of image numbers

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Bags of visual words

• Summarize entire image based on its distribution (histogram) of word occurrences.

• Analogous to bag of words representation commonly used for documents.

26K. Grauman, B. LeibeImage credit: Fei-Fei Li

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Video Google System

1. Collect all words within query region

2. Inverted file index to find relevant frames

3. Compare word counts4. Spatial verification

Sivic & Zisserman, ICCV 2003

• Demo online at : http://www.robots.ox.ac.uk/~vgg/research/vgoogle/index.html


Query region

Retrieved frames

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Basic flow



Describe features




descriptors


…

Quantize to form bag of words vector for the image

or

…

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visual vocabulary formation

Issues:• Sampling strategy• Clustering / quantization algorithm• Unsupervised vs. supervised• What corpus provides features (universal vocabulary?)• Vocabulary size, number of words


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Sampling strategies

30K. Grauman, B. LeibeImage credits: F-F. Li, E. Nowak, J. Sivic

Dense, uniformly Sparse, at interest points

Randomly

Multiple interest operators

• To find specific, textured objects, sparse sampling from interest points often more reliable.

• Multiple complementary interest operators offer more image coverage.

• For object categorization, dense sampling offers better coverage.

[See Nowak, Jurie & Triggs, ECCV 2006]

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Clustering / quantization methods

• k-means (typical choice), agglomerative clustering, mean-shift,…

• Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies

Vocabulary tree [Nister & Stewenius, CVPR 2006]


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Example: Recognition with Vocabulary Tree

• Tree construction:

Slide credit: David Nister

[Nister & Stewenius, CVPR’06]

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Vocabulary Tree

• Training: Filling the tree



Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Vocabulary Tree




Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Vocabulary Tree

• Recognition



RANSACverification

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Vocabulary Tree: Performance

• Evaluated on large databasesIndexing with up to 1M images

• Online recognition for databaseof 50,000 CD covers

Retrieval in ~1s

• Find experimentally that large vocabularies can be beneficial for recognition


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Vocabulary formation

• Ensembles of trees provide additional robustness

Figure credit: F. Jurie K. Grauman, B. Leibe

Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007; Bosch, Zisserman, & Munoz 2007; …

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Supervised vocabulary formation

• Recent work considers how to leverage labeled images when constructing the vocabulary


Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for Generic Visual Categorization, ECCV 2006.

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• Merge words that don’t aid in discriminability

Winn, Criminisi, & Minka, Object Categorization by Learned Universal Visual Dictionary, ICCV 2005

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• Consider vocabulary and classifier construction jointly.


Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR 2008.

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Learning and recognition with bag of words histograms• Bag of words representation makes it possible to

describe the unordered point set with a single vector (of fixed dimension across image examples)

• Provides easy way to use distribution of feature types with various learning algorithms requiring vector input.


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

• …including unsupervised topic models designed for documents.

• Hierarchical Bayesian text models (pLSA and LDA)– Hoffman 2001, Blei, Ng & Jordan, 2004– For object and scene categorization: Sivic et al. 2005,

Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al. 2005


Learning and recognition with bag of words histograms

Figure credit: Fei-Fei Li

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

• …including unsupervised topic models designed for documents.


Learning and recognition with bag of words histograms

Probabilistic Latent Semantic Analysis (pLSA)w

Nd z

D

“face”

Sivic et al. ICCV 2005[pLSA code available at: http://www.robots.ox.ac.uk/~vgg/software/]

Figure credit: Fei-Fei Li

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Bags of words: pros and cons

+ flexible to geometry / deformations / viewpoint+ compact summary of image content+ provides vector representation for sets+ has yielded good recognition results in practice

- basic model ignores geometry – must verify afterwards, or encode via features

- background and foreground mixed when bag covers whole image

- interest points or sampling: no guarantee to capture object-level parts

- optimal vocabulary formation remains unclear47


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Outline

1. Detection with Global Appearance & Sliding Windows

2. Local Invariant Features: Detection & Description

3. Specific Object Recognition with Local Features

― Coffee Break ―

4. Visual Words: Indexing, Bags of Words Categorization

5. Matching Local Feature Sets

6. Part-Based Models for Categorization

7. Current Challenges and Research Directions

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Basic flow



Describe features




descriptors


…

Compute match with another image

orQuantize to form bag of words vector for the image

or

…

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Local feature correspondences

• The matching between sets of local features helps to establish overall similarity between objects or shapes.

• Assigned matches also useful for localization


Shape context [Belongie & Malik 2001]

Low-distortion matching [Berg & Malik 2005] Match kernel [Wallraven, Caputo & Graf 2003]

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Local feature correspondences

• Least cost match: minimize total cost between matched points

• Least cost partial match: match all of smaller set to some portion of larger set.

∑∈

→−

Xxii

YXi

xx )(min:

ππ

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Pyramid match kernel (PMK)

• Optimal matching expensive relative to number of features per image (m).

• PMK is approximate partial match for efficient discriminative learning from sets of local features.


Optimal match: O(m3)Greedy match: O(m2 log m)Pyramid match: O(m)

[Grauman & Darrell, ICCV 2005]

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Pyramid match kernel: pyramid extraction

53K. Grauman

,

Histogram pyramid: level i has bins of size

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Pyramid match kernel: counting matches

54K. Grauman

Histogram intersection

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Pyramid match kernel: counting new matches

55K. Grauman

Difference in histogram intersections across levels counts number of new pairs matched

matches at this level matches at previous level

Histogram intersection

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Pyramid match kernel

56K. Grauman

• For similarity, weights inversely proportional to bin size (or may be learned discriminatively)

• Normalize kernel values to avoid favoring large sets

measure of difficulty of a match at level i

histogram pyramids

number of newly matched pairs at level i

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Example pyramid match

57K. Grauman

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


58K. Grauman

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


59K. Grauman

pyramid match

optimal match


K. Grauman

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods

• Bounded error relative to optimal partial match• Linear time -> efficient learning with large feature sets


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial



• Bounded error relative to optimal partial match• Linear time -> efficient learning with large feature sets

Acc

urac

y

Mean number of featuresTi

me

(s)

Mean number of features

ETH

-80 data set

Pyramid match

Match [Wallraven et al.] O(m2)

O(m)

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial



• Bounded error relative to optimal partial match• Linear time -> efficient learning with large feature sets• Use data-dependent pyramid partitions for high-d

feature spaces

Uniform pyramid bins Vocabulary-guided pyramid bins

Code for PMK: http://people.csail.mit.edu/jjl/libpmk/

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Matching smoothness & local geometry

• Solving for linear assignment means (non-overlapping) features can be matched independently, ignoring relative geometry.

• One alternative: simply expand feature vectors to include spatial information before matching.


[ f1,…,f128, ]

xa

ya

xa, ya

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Spatial pyramid match kernel

• First quantize descriptors into words, then do one pyramid match per word in image coordinate space.

Lazebnik, Schmid & Ponce, CVPR 2006


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• Use correspondence to estimate parameterized transformation, regularize to enforce smoothness

Shape context matching [Belongie, Malik, & Puzicha 2001]


Code: http://www.eecs.berkeley.edu/Research/Projects/CS/vision/shape/sc_digits.html

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• Let matching cost include term to penalize distortion between pairs of matched features.

Approximate for efficient solutions: Berg & Malik, CVPR 2005; Leordeanu & Hebert, ICCV 2005

i i 'i i'

j j'QueryTemplate

RijSi'j'

Figure credit: Alex Berg K. Grauman, B. Leibe

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Matching smoothness & local geometry• Compare “semi-local” features: consider configurations

or neighborhoods and co-occurrence relationships


Hyperfeatures: Agarwal & Triggs, ECCV 2006]

Correlograms of visual words [Savarese, Winn, & Criminisi, CVPR 2006]

Proximity distribution kernel [Ling & Soatto, ICCV 2007]

Feature neighborhoods [Sivic& Zisserman, CVPR 2004]

Tiled neighborhood [Quack, Ferrari, Leibe, van Gool ICCV 2007]

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• Learn or provide explicit object-specific shape model [Next in the tutorial : part-based models]

x1

x3

x4

x6

x5

x2

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Summary

• Local features are a useful, flexible representationInvariance properties - typically built into the descriptorDistinctive, especially helpful for identifying specific textured objectsBreaking image into regions/parts gives tolerance to occlusions and clutterMapping to visual words forms discrete tokens from image regions

• Efficient methods available forIndexing patches or regionsComparing distributions of visual wordsMatching features


visual object recognitiongrauman/slides/aaai2008/aaai2008...perceptual and sensory augmented...

Documents