building text features for object image classification

BUILDING TEXT FEATURES FOR OBJECT IMAGE CLASSIFICATIONGang Wang Derek Hoeim David Forsyth

MAIN IDEA

Text based image features built using auxiliary dataset of images(internet) annotated with tags.

Visual classifier with an object viewed under novel circumstances.

So, basically,Text classifier Image

ClassifierUnified

WHAT ARE THEY TRYING TO DO?

CHALLENGES

Determine which objects are present in an image based on the text that surrounds similar images drawn from large collections.

Sounds easy but: Object appearance Pose Illumination

LOW LEVEL FEATURES CAN RESCUE BUT…..

Color Texture SIFT features Can help if we had millions of training

samples but this is unrealistic.

So what can help?????Millions of images on the internet, not tagged but the text associated with them helps classification.

EUREKA!!!!!!

Easier to determine image content using surrounding text than with currently available image features.

Given a large enough dataset, we are bound to find very similar images to an input image. So they infer likely text for an input image based on similar images

THE COMMON APPROACH

Approach Improve annotation quality or filter spurious

search results that can be used for training. The Problem

Noise or ambiguity in annotations can easily nullify any benefit

Proposal Learn a distance metric that causes images

with similar surrounding text to be similar in visual feature space.

THEIR APPROACH

Build text features for object image classification as they are expected to capture direct semantic meaning of an image.

APPROACH EXPLAINED

Dataset = Training + Test images Auxiliary Dataset= Internet images(Flickr),

have associated text.

For each training image Extract visual features. Find K nearest neighbor images from internet

dataset. Use text associated with these internet images

to build text feature. Train!!

Repeat for visual features and combine both.

VISUAL FEATURES

SIFT :

Used for image matching and object recognition. They use to detect and describe local patches. Extract 1000 local patches from each image. Quantized to 1000 clusters and each patch

denoted to a cluster index. Finally each image represented as a normalized

histogram of cluster indices.

GIST: Powerful in scene categorization and retreiving. They represent each image as a 960 dimension

GIST descriptor.

Color: Quantize each channel to 8 bins. Each pixel value is represented as integer

between 1 to 512. 512 dimensional histogram for each image.

Gradient Can be considered as global and coarse SIFT

feature. Divide image into 4*4 cells At each cell quantize the gradient into 16 bins. Whole image represented as 256 dimensional

vector.

Unified Concatenation of the 4 previously described

features. Let the above features be f1, f2, f3, f4 . Resultant features [w1f1, w2f2 ,w3f3,w4f4]

HOW TO FIND WEIGHTS:

Learn weights from training images. Aim to force the images from the same

category to be close and vice versa. Randomly select N pairs of images from the

training set. For ith pair, Si=1 if two images share atleast

one same object class, otherwise Si=0. Calculate chi square distance fj for the ith pair

as Learn weights:

Can solve directly using “fmincon” in Matlab.

CHI SQUARE???

Chi square distance(http://

www.stat.lsu.edu/faculty/moser/exst7037/geometry.pdf):

Denominator is the normalization component for each point in X.

So for n dimensions:

http://www.stat.lsu.edu/faculty/moser/exst7037/geometry.pdf

http://www.stat.lsu.edu/faculty/moser/exst7037/geometry.pdf

FMINCON?????

Finds minimum of constrained nonlinear multivariable function.

x = fmincon(fun,x0,A,b)x = fmincon(fun,x0,A,b,Aeq,beq)x = fmincon(fun,x0,A,b,Aeq,beq,lb,ub)…..

http://www.mathworks.com/help/toolbox/optim/ug/fmincon.html

http://www.mathworks.com/help/toolbox/optim/ug/fmincon.html

AUXILIARY DATASET

Collected from Flickr. Total 1 million images Out of which 700,000 images collected for 58

object categories whose names come from PASCAL and CALTECH 256 datasets.

Rest collected from a group called “10 million photos ”. Random images.

TEXT FEATURES

For each training/test image Find K nearest neighbor images from the

auxiliary dataset. Extract text with these associated images Build text features.

“Dogs! Dogs! Dogs!” treated as a single item.

Use only frequent tags and group names(6000) in the auxiliary dataset.

Text feature is a normalized histogram of tag and group name counts.

CLASSIFIER

SVM classifier with a chi-squared kernel for text features.

Same used for visual features as well.

FUSION

Build visual classifier Build text classifier Third classifier trained to combine the

confidence values of above two to give final prediction.

Final classifier logistic regression and is trained on a validation test.

RESULTS

PASCAL VOC 2006-10 object categories PASCAL VOC 2007-20 object categories

Performance quantitatively measured using AUC(Area under the ROC curve) for 2006 dataset and by AP(Average Precision) for 2007 dataset.

Use 150 nearest neighbor images in all experiments.

PERFORMANCE METRICS

Performance of text features built with different visual features.

Effects of combining text and visual classifiers.

Effects of varying number of training images Performance of the text features built with

varying number of internet images Effects of category names

For 2006 Dataset: Text classifier outperforms GIST KNN for each feature. Unified is best amongst all. Combination(V) etc. are obtained by training a logistic regression classifier on the validation dataset usingthe confidence values returned by the individual classifiers.

VARYING NUMBER OF AUXILIARY IMAGES

EXCLUDING CATEGORY NAMES

QUESTIONS???

building text features for object image classification

Documents

available image features

input image

divide image

image matching

image content

training images

approachbuild text features

similar images