on sparse representations for scalability in image pattern matching€¦ · nov. 8, 2008 oct 9,...

On Sparse Representations for

Scalability in Image Pattern Matching

1

SIAM, PPSC-2012

Parallel Processing and Scientific Computing

Karl Ni, [email protected] Lincoln Laboratory

22 September 2011

This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.

• Motivation– Image pattern recognition– Image data collection capabilities

• Problems and Challenges

• Training an Image Database

• Results

• Conclusions

Outline

2

SIAM, PPSC-2012

• What can a computer understand?

Applying Semantic Understanding of Images

• Who?

• What?

• When?

• Where?

3

SIAM, PPSC-2012

Classifier Decision!

Feature Extraction

Feature Extraction

Matching &AssociationMatching &Association

Training DataQuery by example

Statistical modeled

Query by sketch

Computer vision algorithms

• Image retrieval

• Robotic navigation

• Semantic labeling

• Image sketch

• Structure from Motion

• Requires: Some prior knowledge

• Open source and image collection capabilities

Where can we get training sets?

“Kullen.Net”

UW vs Cornell’s “PhotoCity” competitionThe Growth of Flickr

4

SIAM, PPSC-2012

• FlickR “trounced by” Facebook– 15 billion photos– In Nov 2008, “2 billion photos each month.”

Nov. 8, 2008 Oct 9, 2009

Nov 13, 2007

“Still, it’s a staggering

number of photos for

a site that launched in

2004,” -- Tech Crunch

2Billionth Image

3Billionth Image4Billionth Image

• Motivation

• Problems and Challenges– Current Techniques– Computational Issues


• Results

• Conclusions

Outline

5

SIAM, PPSC-2012

• Face detection and recognition: mostly done:

• Generic object detector: not so much:

Specialized Content Detectors

6

SIAM, PPSC-2012

• Computation for algorithms (e.g., deformable parts)– Rely on multiple instance learnings (can be considerable # of instances)– Parsing the entire image for relevant features– Serial in computation– Rely on false alarm rejections to reduce computations– The feature space is exceedingly complex

Detectors for Every Object?

1 46

7

25

7

SIAM, PPSC-2012

• Parallelizable, but still poor algorithmic performance

• Let’s say you have 10 very good detectors (~%5 FA rate)– Still have a large image to classify at different scales/orientations

and 10 x 0.05 FA rate for ~40% FA rate!– These classifiers don’t know anything about their surroundings!

People can’t be flying or walking on billboards!1. Chair, 2. Table, 3. Road, 4. Road, 5. Table, 6. Car, 7. Keyboard

We use context in order inference about an image

3

46

• Motivation


• Training an Image Database– What are the “best” features?– How to automatically train for the best features– Automated choice in hierarchical GMMs– A better option: optimizing for sparsity– Parallel processing in sparse feature finding

• Results

Outline

8

SIAM, PPSC-2012

• Results

• Conclusions

• Problems in image pattern matching

Finding the Features of Image

• Each image = 10 million pixels!

• Most dimensions are irrelevant

• Multiple concepts inside the image

Feature Extraction

Feature Extraction

Training / ClassifierTraining / Classifier

9

SIAM, PPSC-2012

• Features are a quantitative way for machines to understand of an image

Image Property Feature Technique– Local Color (Luma + Chroma Components)– Object texture (Fourier domain/Wavelet)– Shape (Curvelets, Shapelets)– Lower level gradients (Wavelets: Haar, Daubechies)– Higher level descriptors (SIFT/SURF/etc)– Overall image descriptors (GIST)

Numerous features: subset is relevant

Feature

Extraction

Feature

ExtractionMatching &

Association

Matching &

Association

10

SIAM, PPSC-2012

• FEATURES ARE:

• Red bricks on multiple buildings

• Small hedges, etc

• Windows of a certain type

• Types of buildings are there

• FEATURES ARE:

• Arches and white buildings

• Domes and ancient architecture

• Older/speckled materials (higher frequency image content)

• FEATURES ARE:

• More suburb-like

• Larger roads

• Drier vegetation

• Shorter houses

Choice of features requires looking at multiple semantic concepts defined by entities and attributes inside of images

Choice of features requires looking at multiple semantic concepts defined by entities and attributes inside of images

Environment is relevant

Feature

Extraction

Feature

ExtractionClassification

/ Training

Classification

/ Training

• Feature invariance in images is necessary for most concepts– to transformations (e.g., 3D rotation, translation, scale)– to dynamic content (e.g., deformable parts)– to various contexts (e.g., illumination at different times of day)– …to different instances

11

SIAM, PPSC-2012

• Some features (e.g. SIFT) acquire some of these attributes but only to a certain extent

– 30 Degrees (SIFT)– Many times don’t match– Illumination invariance

• A collective group of features is necessary (boosting/blending)– A large set of features make training/classification more complex– Training is very difficult (feature extraction & training/classification)

• Tools to hand labelconcepts

• 2006-2011– Google Image Labeler– Kobus’s Corel Dataset– MIT LabelMe

Getting the Right Features

Feature

Extraction

Feature

Extraction

12

SIAM, PPSC-2012

– MIT LabelMe– Yahoo! Games

• Problems– Tedious– Time consuming– Incorrect– Very low throughput

• Face detection, time to collect all the data

• Feature selection: currently an active area of research

• Computational complexity is high– Feature extraction is difficult problem– Classifiers are difficult problems– Traditionally passed between the two problems

More knowledge of the domain/model, rely on better feature extractorsLess knowledge, rely (unfortunately) on complexity in discriminant methods

• Would like to feed the entire image into– Won’t need to manually segment images– Feeding in noise will learn the context (info about surroundings)– Learn multiple instances of a concept (build invariance through example)

Automatically Learn the Best Features

Feature Extraction

Feature Extraction

Training / ClassifierTraining / Classifier

13

SIAM, PPSC-2012

– Learn multiple instances of a concept (build invariance through example)– Massively parallel per image per class

• Take several features, and subselect the “best” ones

Image Class 1 Image Class 2 Image Class N

Distribution 1

Entire image

Distribution 2 Distribution N

Entire image Entire image

Training imagesTraining images

Automatic feature subselection has been submitted to SSP 2012

• Lots of work in the 1990s– Conditional probabilities through large training data sets– Vasconcelos et al’s work on semantic image retrieval– Primarily based on multiple instance learning and noisy density

estimation

• Learning multiple instances of an object (no noise case)

How do we do it?

14

SIAM, PPSC-2012

• Robustness to noise through law of large numbers– Hope to integrate it out

– Although the area of red boxes per instance is small, their aggregate over all instances is dominant

Noise, if uncorrelated, will

become more and more sparse

• Statistical Distributions:– Generative methods: represent millions of points by a few

parameters

• Mixture hierarchies can be incrementally trained

Parallel Calculations through Mixture Hierarchies

Top Level GMM

15

SIAM, PPSC-2012

• Problems with HGMMs– Extensive computational process to bring hierarchies together– Difficult to train as each level requires initialization point– Specify number of classes at each level for initialization

Lower Level GMMs

Can be done in parallel

image 1 image 2 image 3

• Gaussian mixtures as a density estimate– Non-convex / sensitive to initialization– Iterative and very slow– Small sample-bias is large

• Think discriminantly:– Instead of: Generating centroids that represent images– Think: Prune features to eliminate redundancy

Finding a sparse basis set

16

SIAM, PPSC-2012

• Sparsity optimization– Solving directly for the features that we want to use– Induces less complexity and as will see is a LP problem– Reduction of redundancy is intuitive and not generative

• Under normalization, GMM’s classifier can be implemented with matched filter instead

normalize

><∈

iCi

yx,maxarg},...,1{

2

2||||maxarg iyx −β

• Gaussian Mixture Models

• Many optimization problems (compressed sensing) induce sparsity:

Finding sparsity with linear programming

Group Lasso

GMM, solved via EM

(non-convex optimization problem)

Exponential according to N

Each iteration O(MNd2)

[ ])(||)(||maxarg 2

2 ββλββ

TXX +−

}1,0{∈ijβ 11 =Tβsuch that and

∑ ∑= =

∑∑

∑−

N

j

M

m

mmjmm xpMM 1 1

......)|(logmin

11

µπµµ

17

SIAM, PPSC-2012

• Matched filter constraint:

• Relaxation of constraints

Max-Constraint Optimization

LP Optimization Problem:

Faster than EM

Faster than G-Lasso

Independemt of dimensionality!

Convex (unlike MF opt & GMM, EM)

On average, according to N2

+− ∑

i

i

TtXXtr λβ

β

)(minarg

10 ≤≤≤ iij tβ 11 =Tβsuch that and

[ ]2

2||1||)(maxarg βλββ

+XXtrT

}1,0{∈ijβ 11 =Tβsuch that and

• Relies on covariance matrix concept

Intuition

< t1

+−= ∑

i

i

TtXXtr λββ

β

)(minarg*

10 ≤≤≤ iij tβ 11=Tβs.t. and

1.0098.1

18

SIAM, PPSC-2012

• Actual implementation does not include covariance matrix, but rather keeps track of beta indices

β =

< t1

< t2

< t3

< t4

=

195.1.01.0

95.12.00

1.02.0198.

1.0098.1

XXT

• Motivation



• Results

• Conclusions

Outline

19

SIAM, PPSC-2012

LP Feature Learning versus G-Lasso

20

SIAM, PPSC-2012

• More intuitive grouping– Threshold learning is unnecessary– Post-processing is unnecessary

• 5.452% more accurate in +1/-1 learning classes

• 80.054% faster than GMM’s

Classifying Texture

20 40 60 80 100 120 140 160 180 200 220

20

40

60

80

100

120

140

160

180

200

Decision Confidence

50 100 150 200

20

40

60

80

100

120

140

160

180

200 4

6

8

10

12

14

DecisionsOriginal Image

21

SIAM, PPSC-2012

Decision ConfidenceDecisionsOriginal Image

Complexity and Confusion Matrix

• 1400 images per dataset

• Filter reduction to 356 filters per class

• Less than a minute classification time

• Coverage of cities: entire cities (Vienna, Dubrovnik, Lubbock), portion of Cambridge (MIT-Kendall)

22

SIAM, PPSC-2012

Training

Datasets MIT-Kendall

Vienna Dubrovnik Lubbock

Testing MIT-Kendall 0.975 0.056 0.024 0.102

Vienna 0.050 0.896 0.035 0.060

Dubrovnik 0.015 0.024 0.905 0.057

Lubbock 0.097 0.002 0.053 0.901

Computational Results & Accuracy

0

2

4

Co

mp

uta

tio

n t

ime

lo

g1

0(m

in)

200

400

600

MS

E

GMMs

Beta Opt

23

SIAM, PPSC-2012

• Fixed iteration and k GMM’s

• Best initialization via k-means (not included in optimization)

2 2.5 3 3.5 4 4.5 5 5.5 6-2

log10

#DCT features

Co

mp

uta

tio

n t

ime

lo

g

2 2.5 3 3.5 4 4.5 5 5.5 6

0

Interesting automatic semantic learning result

24

SIAM, PPSC-2012

• Training in computer vision is troublesome– Big data– Feature extraction– Non-automated processes

• Statistical characterization reduces complexity

• Redundancy arbitration achieves savings

Conclusions

25

SIAM, PPSC-2012

• Redundancy arbitration achieves savings

• Feature selection through LP programming produces gains in computation time

• MIT Lincoln Laboratory– Karl Ni– Nicholas Armstrong-Crews– Scott Sawyer– Nadya Bliss

• MIT– Katherine L. Bouman

Contributors and Acknowledgements

26

SIAM, PPSC-2012

• Boston University– Zachary Sun

• Northeastern University– Alexandru Vasile

• Cornell University– Noah Snavely

Questions?

27

SIAM, PPSC-2012

Between Class Training

28

SIAM, PPSC-2012

• There’s sky in both of these

• It’s a feature that is descriptive of most situations where you would find a car or a buffalo

• Simply get rid of the features that are common

• Find the most discriminative features

Car Class Buffalo Class

• Training an image class is most efficiently done with a small number of descriptive yet discriminative features. Features are often manually handpicked from subsets of imagery or machine generated feature extractors. It is beneficial to automatically discard irrelevant features and retain the most representative ones. Determining the best features to use is inherently a difficult and computationally tasking process. Such a methodology would allow training large scale datasets quickly, in

Abstract

29

SIAM, PPSC-2012

would allow training large scale datasets quickly, in parallel, and without human aid. We overview an automated technique in image pattern matching that uses sparse optimization constraints to select the best subset of large amounts of feature data.

Can you tell what is in this picture?

30

SIAM, PPSC-2012

Courtesy A. Torralba

Context in processing is important

31

SIAM, PPSC-2012

Courtesy A. Torralba

on sparse representations for scalability in image pattern matching€¦ · nov. 8, 2008 oct 9,...

Documents