inference network approach to image retrieval don metzler r. manmatha center for intelligent...

Inference Network Approach to Image Retrieval

Don Metzler

R. ManmathaCenter for Intelligent Information Retrieval

University of Massachusetts, Amherst

Motivation Most image retrieval systems assume:

Implicit “AND” between query terms Equal weight to all query terms Query made up of single representation (keywords or image)

“tiger grass” => “find images of tigers AND grass where each is equally important”

How can we search with queries made up of both keywords and images?

How do we perform the following queries? “swimmers OR jets” “tiger AND grass, with more emphasis on tigers than grass” “find me images of birds that are similar to this image”

Related Work

Inference networks Semantic image retrieval Kernel methods

Inference Networks

Inference Network Framework [Turtle and Croft ‘89] Formal information retrieval framework INQUERY search engine Allows structured queries

phrases, term weighting, synonyms, etc… #wsum( 2.0 #phrase ( image retrieval ) 1.0 model )

Handles multiple document representations (full text, abstracts, etc…)

MIRROR [deVries ‘98] General multimedia retrieval framework based on inference

network framework Probabilities based on clustering of metadata + feature vectors

Image Retrieval / Annotation

Co-occurrence model [Mori, et al] Translation model [Duygulu, et al] Correspondence LDA [Blei and Jordan] Relevance model-based approaches

Cross-Media Relevance Models (CMRM) [Jeon, et al] Continuous Relevance Models (CRM) [Lavrenko, et al]

Goals Input

Set of annotated training images User’s information need

Terms Images “Soft” Boolean operators (AND, OR, NOT) Weights

Set of test images with no annotations Output

Ranked list of test images relevant to user’s information need

Data

Corel data set†

4500 training images (annotated) 500 test images 374 word vocabulary

Each image automatically segmented using normalized cuts Each image represented as set of representation

vectors 36 geometric, color, and texture features Same features used in similar past work

† Available at: http://vision.cs.arizona.edu/kobus/research/data/eccv_2002/

Features

Geometric (6) area position (2) boundary/area convexity moment of inertia

Color (18) avg. RGB x 2 (6) std. dev. of RGB (3) avg. L*a*b x 2 (6) std. dev. of L*a*b (3)

Texture (12) mean oriented energy, 30 deg. increments (12)

Image representation

][ , ][

][ , ][

||11

||1111

Vndn

Vd

wwrr

wwrr

cat, grass, tiger, water

annotation vector(binary, same for each segment)

representation vector(real, 1 per image segment)

Image Inference Network J – representation

vectors for image, (continuous, observed)

qw – word w appears in annotation, (binary, hidden)

qr – representation vector r describes image, (binary, hidden)

qop – query operator satisfied (binary, hidden)

I – user’s information need is satisfied, (binary, hidden)

I

J

qr1 qrk…

qop1 qop2

qw1 qwk…

“Image Network”

“Query Network”

fixed(based on image)

dynamic(based on query)

Example Instantiation

#or

#and

tiger grass

What needs to be estimated?

P(qw | J)

P(qr | J)

P(qop | J) P(I | J)

I

J

qr1 qrk…

qop1 qop2

qw1 qwk…

P(qw | J) [ P( tiger | ) ]

Probability term w appears in annotation given image J Apply Bayes’ Rule and use non-parametric density

estimation Assumes representation vectors are conditionally

independent given term w annotates the image

Jrwiw

www

i

qrPqP

JPqJPqPJqP

)|()(

)(/)|()()(

tot

qw n

nqP w)(

???

How can we compute P(ri | qw)?

training setrepresentation vectors

representation vectors associated with image annotated

by w

area of high likelihood

area of low likelihood

P(qw | J) [final form]

Jrtf

TI Igkir

qtot

Jrwiw

www

iIw

k

i

w

i

grNnn

qrPqP

JPqJPqPJqP

0

1||

,

),;(1

)|()(

)(/)|()()(

)()(

2

1exp

||)2(

1),;( 1

xxxN T

d

Σ assumed to be diagonal, estimated from training data

Regularized estimates…

P(qw | J) are good, but not comparable across images

term P(qw | J)

cat 0.45

grass 0.35

tiger 0.15

water 0.05

term P(qw | J)

cat 0.90

grass 0.05

tiger 0.01

water 0.03 Is the 2nd image really 2x more “cat-like”? Probabilities are relative per image

Regularized estimates… Impact Transformations

Used in information retrieval “Rank is more important than value” [Anh and Moffat]

Idea: rank each term according to P(qw | J) give higher probabilities to higher ranked terms P(qw | J) ≈ 1/rankqw

Zipfian assumption on relevant words a few words are very relevant a medium number of words are somewhat relevant many words are not relevant

Regularized estimates…

wqw rankJqP

1)|(ˆ

term P(qw | J) 1/rank

cat 0.45 0.48

grass 0.35 0.24

tiger 0.15 0.16

water 0.05 0.12

term P(qw | J) 1/rank

cat 0.90 0.48

grass 0.05 0.24

tiger 0.01 0.12

water 0.03 0.16


P(qw | J)

P(qr | J)

P(qop | J) P(I | J)

I

J

qr1 qrk…

qop1 qop2

qw1 qwk…

P(qr | J) [ P( | ) ]

Probability representation vector observed given J Use non-parametric density estimation again Impose density over J’s representation vectors just as

we did in the previous case Estimates may be poor

Based on small sample (~ 10 representation vectors) Naïve and simple, yet somewhat effective

Jr

irJ

r

i

rqNr

JqP ),;(||

1)|(


P(qw | J)

P(qr | J)

P(qop | J) P(I | J)

I

J

qr1 qrk…

qop1 qop2

qw1 qwk…

Query Operators

“Soft” Boolean operators #and / #wand (weighted and) #or #not

One node added to query network for each operator present in query

Many others possible #max, #sum, #wsum #syn, #odn, #uwn, #phrase, etc…

#or( #and ( tiger grass ) )

#or

#and

tiger grass

Operator Nodes

Combine probabilities from term and image nodes

Closed forms derived from corresponding link matrices

Allows efficient inference within network

1#

)(#

)(

/#

)(#

1)|(

))|(1(1)|(

)|()|(

)|()|(

pJqP

JpPJqP

JpPJqP

JpPJqP

not

qParpior

qParp

Wwiwand

qParpiand

i

i

i

i

Par(q) = Set of q’s parent nodes

Results - Annotation

Model Translation CMRM CRM InfNet

# words with recall >= 0 49 66 107 117

Results on full vocabulary

Mean per-word recall 0.04 0.09 0.19 0.24

Mean per-word precision 0.06 0.10 0.16 0.17

F-measure 0.05 0.09 0.17 0.20

foals (0.46)mare (0.33)horses (0.20)field (1.9E-5)grass (4.9E-6)

railroad (0.67)train (0.27)smoke (0.04)locomotive (0.01)ruins (1.7E-5)

sphinx (0.99)polar (5.0E-3)stone (1.0E-3)bear (9.7E-4)sculpture (6.0E-4)

Results - RetrievalPrecision @ 5 retrieved images

1 word 2 word 3 word

CMRM 0.1989 0.1306 0.1494

CRM 0.2480 0.1902 0.1888

InfNet 0.2525 0.1672 0.1727

InfNet-reg 0.2547 0.1964 0.2170

Mean Average Precision

1 word 2 word 3 word

CMRM 0.1697 0.1642 0.2030

CRM 0.2353 0.2534 0.3152

InfNet 0.2484 0.2155 0.2478

InfNet-reg 0.2633 0.2649 0.3238

Future Work

Use rectangular segmentation and improved features

Different probability estimates Better methods for estimating P(qr | J)

Use CRM to estimate P(qw | J)

Apply to documents with both text and images Develop a method/testbed for evaluating for

more “interesting” queries

Conclusions

General, robust model based on inference network framework

Departure from implied “AND” between query terms Unique non-parametric method for estimating network

probabilities Pros

Retrieval (inference) is fast Makes no assumptions about distribution of data

Cons Estimation of term probabilities is slow Requires sufficient data to get a good estimate

inference network approach to image retrieval don metzler r. manmatha center for intelligent...

Documents

image slide

image j

image segment slide

query slide

image dynamic

image retrieval systems

phrase image retrieval

image tiger grass