inference network approach to image retrieval don metzler r. manmatha center for intelligent...
TRANSCRIPT
Inference Network Approach to Image Retrieval
Don Metzler
R. ManmathaCenter for Intelligent Information Retrieval
University of Massachusetts, Amherst
Motivation Most image retrieval systems assume:
Implicit “AND” between query terms Equal weight to all query terms Query made up of single representation (keywords or image)
“tiger grass” => “find images of tigers AND grass where each is equally important”
How can we search with queries made up of both keywords and images?
How do we perform the following queries? “swimmers OR jets” “tiger AND grass, with more emphasis on tigers than grass” “find me images of birds that are similar to this image”
Inference Networks
Inference Network Framework [Turtle and Croft ‘89] Formal information retrieval framework INQUERY search engine Allows structured queries
phrases, term weighting, synonyms, etc… #wsum( 2.0 #phrase ( image retrieval ) 1.0 model )
Handles multiple document representations (full text, abstracts, etc…)
MIRROR [deVries ‘98] General multimedia retrieval framework based on inference
network framework Probabilities based on clustering of metadata + feature vectors
Image Retrieval / Annotation
Co-occurrence model [Mori, et al] Translation model [Duygulu, et al] Correspondence LDA [Blei and Jordan] Relevance model-based approaches
Cross-Media Relevance Models (CMRM) [Jeon, et al] Continuous Relevance Models (CRM) [Lavrenko, et al]
Goals Input
Set of annotated training images User’s information need
Terms Images “Soft” Boolean operators (AND, OR, NOT) Weights
Set of test images with no annotations Output
Ranked list of test images relevant to user’s information need
Data
Corel data set†
4500 training images (annotated) 500 test images 374 word vocabulary
Each image automatically segmented using normalized cuts Each image represented as set of representation
vectors 36 geometric, color, and texture features Same features used in similar past work
† Available at: http://vision.cs.arizona.edu/kobus/research/data/eccv_2002/
Features
Geometric (6) area position (2) boundary/area convexity moment of inertia
Color (18) avg. RGB x 2 (6) std. dev. of RGB (3) avg. L*a*b x 2 (6) std. dev. of L*a*b (3)
Texture (12) mean oriented energy, 30 deg. increments (12)
Image representation
][ , ][
][ , ][
||11
||1111
Vndn
Vd
wwrr
wwrr
cat, grass, tiger, water
annotation vector(binary, same for each segment)
representation vector(real, 1 per image segment)
Image Inference Network J – representation
vectors for image, (continuous, observed)
qw – word w appears in annotation, (binary, hidden)
qr – representation vector r describes image, (binary, hidden)
qop – query operator satisfied (binary, hidden)
I – user’s information need is satisfied, (binary, hidden)
I
J
qr1 qrk…
qop1 qop2
qw1 qwk…
“Image Network”
“Query Network”
fixed(based on image)
dynamic(based on query)
P(qw | J) [ P( tiger | ) ]
Probability term w appears in annotation given image J Apply Bayes’ Rule and use non-parametric density
estimation Assumes representation vectors are conditionally
independent given term w annotates the image
Jrwiw
www
i
qrPqP
JPqJPqPJqP
)|()(
)(/)|()()(
tot
qw n
nqP w)(
???
How can we compute P(ri | qw)?
training setrepresentation vectors
representation vectors associated with image annotated
by w
area of high likelihood
area of low likelihood
P(qw | J) [final form]
Jrtf
TI Igkir
qtot
Jrwiw
www
iIw
k
i
w
i
grNnn
qrPqP
JPqJPqPJqP
0
1||
,
),;(1
)|()(
)(/)|()()(
)()(
2
1exp
||)2(
1),;( 1
xxxN T
d
Σ assumed to be diagonal, estimated from training data
Regularized estimates…
P(qw | J) are good, but not comparable across images
term P(qw | J)
cat 0.45
grass 0.35
tiger 0.15
water 0.05
term P(qw | J)
cat 0.90
grass 0.05
tiger 0.01
water 0.03 Is the 2nd image really 2x more “cat-like”? Probabilities are relative per image
Regularized estimates… Impact Transformations
Used in information retrieval “Rank is more important than value” [Anh and Moffat]
Idea: rank each term according to P(qw | J) give higher probabilities to higher ranked terms P(qw | J) ≈ 1/rankqw
Zipfian assumption on relevant words a few words are very relevant a medium number of words are somewhat relevant many words are not relevant
Regularized estimates…
wqw rankJqP
1)|(ˆ
term P(qw | J) 1/rank
cat 0.45 0.48
grass 0.35 0.24
tiger 0.15 0.16
water 0.05 0.12
term P(qw | J) 1/rank
cat 0.90 0.48
grass 0.05 0.24
tiger 0.01 0.12
water 0.03 0.16
P(qr | J) [ P( | ) ]
Probability representation vector observed given J Use non-parametric density estimation again Impose density over J’s representation vectors just as
we did in the previous case Estimates may be poor
Based on small sample (~ 10 representation vectors) Naïve and simple, yet somewhat effective
Jr
irJ
r
i
rqNr
JqP ),;(||
1)|(
Query Operators
“Soft” Boolean operators #and / #wand (weighted and) #or #not
One node added to query network for each operator present in query
Many others possible #max, #sum, #wsum #syn, #odn, #uwn, #phrase, etc…
Operator Nodes
Combine probabilities from term and image nodes
Closed forms derived from corresponding link matrices
Allows efficient inference within network
1#
)(#
)(
/#
)(#
1)|(
))|(1(1)|(
)|()|(
)|()|(
pJqP
JpPJqP
JpPJqP
JpPJqP
not
qParpior
qParp
Wwiwand
qParpiand
i
i
i
i
Par(q) = Set of q’s parent nodes
Results - Annotation
Model Translation CMRM CRM InfNet
# words with recall >= 0 49 66 107 117
Results on full vocabulary
Mean per-word recall 0.04 0.09 0.19 0.24
Mean per-word precision 0.06 0.10 0.16 0.17
F-measure 0.05 0.09 0.17 0.20
foals (0.46)mare (0.33)horses (0.20)field (1.9E-5)grass (4.9E-6)
railroad (0.67)train (0.27)smoke (0.04)locomotive (0.01)ruins (1.7E-5)
sphinx (0.99)polar (5.0E-3)stone (1.0E-3)bear (9.7E-4)sculpture (6.0E-4)
Results - RetrievalPrecision @ 5 retrieved images
1 word 2 word 3 word
CMRM 0.1989 0.1306 0.1494
CRM 0.2480 0.1902 0.1888
InfNet 0.2525 0.1672 0.1727
InfNet-reg 0.2547 0.1964 0.2170
Mean Average Precision
1 word 2 word 3 word
CMRM 0.1697 0.1642 0.2030
CRM 0.2353 0.2534 0.3152
InfNet 0.2484 0.2155 0.2478
InfNet-reg 0.2633 0.2649 0.3238
Future Work
Use rectangular segmentation and improved features
Different probability estimates Better methods for estimating P(qr | J)
Use CRM to estimate P(qw | J)
Apply to documents with both text and images Develop a method/testbed for evaluating for
more “interesting” queries
Conclusions
General, robust model based on inference network framework
Departure from implied “AND” between query terms Unique non-parametric method for estimating network
probabilities Pros
Retrieval (inference) is fast Makes no assumptions about distribution of data
Cons Estimation of term probabilities is slow Requires sufficient data to get a good estimate