from image analysis to content extraction: are we there...

From Image Analysis to Content Extraction:

Are We There Yet?

From Image Analysis to Content Extraction:

Are We There Yet?

Tsuhan ChenCarnegie Mellon University

Pittsburgh, [email protected]

Tsuhan Chen

A Journey of 10+ Years A Journey of 10+ Years

• Multimedia Signal Processing (MMSP) Technical Committee

– Founding Chair 1996~1999

• MMSP Workshops

– Princeton 1997, Los Angeles 1998, Copenhagen 1999, Cannes 2001, St. Thomas 2002, Siena 2004, Shanghai 2005, Victoria 2006…

• IEEE Transactions on Multimedia

– Editor-in-Chief: 2002~2004

• International Conference on Multimedia and Expo (ICME)

– New York 2000, Tokyo 2001, Lausanne 2002, Baltimore 2003, Taipei2004, Amsterdam 2005, Toronto 2006, Beijing 2007…

• IEEE Fellow, 2007~, “…multimedia signal processing”

• IEEE Distinguished Lecturer, 2007~2008

Signal vs. ContentSignal vs. Content

Tsuhan Chen

[Baker and Kanade]

What is “content”?What is “content”?

population worldhistory human36524606030 ××××××>>

Number of all possible 16×12 images 812162 ××=

“Content” is based on signals, i.e., prior, statistics, data-driven…

Tsuhan Chen

ThoughtsThoughts

• “The most compelling shapes are those near to our hearts: people’s faces, a gracefully moving body, a natural scene with rustling leaves and flowing water. Evolution has tuned us to these sights…”

[Lengyel, 1998]

• How do we see such “objects of interest”?

• Content extraction is more than processing bits…it’s signal processing + statistical learning

[Chen, 2007]

Tsuhan Chen

Sample Projects in Content Retrieval Sample Projects in Content Retrieval

Beyond digital images/videos…

Hand-Drawn Query

Retrieved Trademarks

[Leung&Chen ICME’02]Trademark RetrievalTrademark Retrieval

Tsuhan Chen

Sketch RetrievalSketch RetrievalUser sketches a query…

QuerySketch

SimilarSketch

Page Stored in Database

[Leung&Chen ICME’03]

Tsuhan Chen

3D Object Retrieval3D Object Retrieval[Zhang&Chen ACM MM’01]

Tsuhan Chen

3D Protein Retrieval3D Protein Retrieval[Chen&Chen ICIP’02]

Tsuhan Chen

Object DiscoveryObject Discovery

Object Discovery ≠ Object Detection

Tsuhan Chen

Object DetectionObject Detection

Training Data (Labeled)

Test Data

[BioID Face Database]

Tsuhan Chen


[Caltech Face+Background Dataset]

Discover = Categorize + Localize

How did we do that?

Tsuhan Chen


[UIUC Car Dataset]

Discover = Categorize + Localize

How did we do that?

Tsuhan Chen

Discovering Objects in VideoDiscovering Objects in Video

Discover = Categorize + Localize[YouTube/Google Video]

How did we do that?

Tsuhan Chen

The ApproachThe Approach

Feature Extraction

Statistical Learning

Tsuhan Chen

Feature ExtractionFeature Extraction

Maximally Stable Extremal Regions (MSER)[Matas et al., 02]

“Patch”

Tsuhan Chen

Scale Invariant Feature Transform (SIFT)Scale Invariant Feature Transform (SIFT)[Lowe, 04]

• Robust to viewpoint, illumination, blurring, rotation, and scale changes

Tsuhan Chen

Quantization into Visual WordsQuantization into Visual Words

Visual Words

Discrete symbols128-dim SIFT features

[Leung and Malik, 01]

K-means

Every images becomes a bag of words…

Tsuhan Chen

Statistical LearningStatistical Learning

FeatureExtraction

StatisticalLearning

Single Image

Collectionof Images

Video

Tsuhan Chen

GoalGoal

• Label each patch as background or object of interest

r = (200;200)

z = object of interest

z = background

r = (300;100)

w = w2

w = w3

“Location”

“Appearance”

“Location”

“Appearance”

Tsuhan Chen

Probabilistic ModelProbabilistic Model

0.7

0.3z1

z2

Image Characteristic

Gaussian

uniform

Location Semantics

p(rjz2)p(rjz1)p(z)

= p(z)p(rjz)p(wjz)

0.40.1

0.40.0

0.20.9

z1 z2

w1

w2

Topic Appearance

p(wjz)

w3

p(z; r; w) = p(z)p(r; wjz) r Locationw Appearancez Obj/Bg

Tsuhan Chen

Posterior ProbabilityPosterior Probability

r = (300;100)

r = (200;200)

w = w3

w = w2

p(zjr; w) = p(z; r; w)Xzp(z; r; w)

=p(z)p(rjz)p(wjz)Xzp(z)p(rjz)p(wjz)

z = argmaxz

p(zjr; w)

z = argmaxz

p(zjr; w)

Posterior Probabilities ~ (Soft) Labels

Tsuhan Chen

Only half of the story…Only half of the story…

p(wjz)p(z)

p(rjz)p(zjr; w)

r Locationw Appearancez Obj/Bg

Tsuhan Chen

p(z = z1) =1

4+3

4=2 = 1=2

p(z = z1) = 1=2

How to estimate :• If label is known

• If is known

Estimate Image CharacteristicEstimate Image Characteristic

p(z)

z = z1

z = z1

p(zjw; r)


Tsuhan Chen

p(w = w1jz = z1) =1

2=1

2= 1

p(w = w1jz = z1) =

0@34 + 0

2

1A= 1

2=3

4

Estimate Topic AppearanceEstimate Topic Appearance

How to estimate :• If label is known

• If is known

p(wjz)

w1

w1

w2

w2

z = z1

z = z1

p(zjw; r)


Tsuhan Chen

How to estimate mean and var of :• If label is known

• If is known

Estimate Location SemanticsEstimate Location Semantics

p(rjz = z1)z = z1

p(zjw; r)

Tsuhan Chen

An Iterative AlgorithmAn Iterative Algorithm

p(wjz)p(z)

LocationEstimation p(rjz)

p(zjr; w)


Can start anywhere, can seed anyhow…

Tsuhan Chen

Collection of ImagesCollection of Images

0.4

0.6

0.8

0.2

d1 d2

z1

z2

p(rjz1; d1)

p(rjz1; d2)

p(zjd)

p(z; r; wjd) = p(zjd)p(rjz; d)p(wjz; d)

p(z; r; w) = p(z)p(rjz)p(wjz)

= p(zjd)p(rjz; d)p(wjz)

p(wjz)

0.40.1

0.40.0

0.20.9

z1 z2

w1

w2

w3

r Locationw Appearancez Obj/Bgd Image

Tsuhan Chen


p(wjz)

LocationEstimation

p(zjd)

p(rjz; d)p(zjr; w; d)


Same as before, but location/characteristics are image-dependent

Tsuhan Chen

An ExampleAn Example

[Caltech Face+Background Dataset]

Tsuhan Chen

Location Semantics Topic AppearancePosteriorp(rjz = z1; d)

p(wjz = z1)

p(wjz = z2)p(zjr; w; d)


Tsuhan Chen

Video ≠ Collection of ImagesVideo ≠ Collection of Images

Time

Smooth trajectory expected

Tsuhan Chen

Tsuhan Chen

Motion Information

Tsuhan Chen

( )iν

Tsuhan Chen

),0|( )()(

)()(

SN ii

ii

i

νβ

νβν

∝

≡∑

ν

[Bar-Shalom, 80]

Tsuhan Chen

),,|(),0|(

),0|()()(

1)()(

)()(

)()(

drwzzpSN

SNiiii

ii

ii

i

=∝

∝

≡∑

νβ

νβ

νβν

[Bar-Shalom, 80]

ν

Tsuhan Chen

ν

νWss += −+ ˆˆ

+s−s

[ ]( )2tmeasuremensystem )ˆ(,, +−ΣΣ= ssEfW

Tsuhan Chen

ν

+s−s

νWss += −+ ˆˆ[ ]( )2

tmeasuremensystem )ˆ(,, +−ΣΣ= ssEfW

Tsuhan Chen


p(wjz)

LocationEstimation

p(zjd)p(zjw; r; d)

p(rjz; d)Motion

Modeling

Tsuhan Chen


p(wjz)

MotionModeling

p(zjd)p(zjw; r; d)

p(rjz; d)

• Knowledge of appearance improves location estimate

Tsuhan Chen


p(wjz)

MotionModeling

p(zjd)p(zjw; r; d)

p(rjz; d)

• Knowledge of location improves appearance estimate

Tsuhan Chen

ApplicationsApplications

• Object localization

• Categorization– Video skimming

• Keyframe extraction– Video summarization

Tsuhan Chen

Input VideoInput Video

CMU dataset

Tsuhan Chen

ComparisonComparisonAPP+LOC+MOTION

APP+LOCAPP

p(wjz)

MotionModel

p(zjd)p(zjw; r; d)

p(rjz; d)

p(wjz)

LocationEstim.

p(zjd)p(zjw; r; d)

p(rjz; d)

p(wjz)p(zjd)

p(zjw; d)

[Sivic et al. 05]

Tsuhan Chen

LocalizationLocalization

[CMU Dataset]

APP+LOC+MOTION

APP+LOCAPP

Tsuhan Chen

CategorizationCategorization

• Top 40 frames out of 181, according to p(z = z1jd)

[YouTube/Google Video]

Tsuhan Chen

CategorizationCategorization


• Top 40 frames out of 711, according to p(z = z1jd)

Tsuhan Chen

Keyframe Extraction on YouTubeKeyframe Extraction on YouTube


Tsuhan Chen

Keyframe Extraction – Our ResultKeyframe Extraction – Our Result

5 keyframes from top 40 frames, according to

181 frames. 2 frame/sec.

p(z = z1jd)


Tsuhan Chen

Keyframe Extraction on YouTubeKeyframe Extraction on YouTube


Tsuhan Chen

Keyframe Extraction – Our ResultKeyframe Extraction – Our Result

711 frames. 2 frame/sec.

5 keyframes from top 40 frames, according to p(z = z1jd)


Tsuhan Chen

ExtensionsExtensions

• Geometric Consistency

• Semi-supervised

• Multiple classes and instances

• Hierarchical semantics of objects

Tsuhan Chen

Geometric ConsistencyGeometric Consistency

– Background random, object consistent– Matched patches more likely from object of interest

[Caltech-4 data set]

Tsuhan Chen

Geometric ConsistencyGeometric Consistency

Correspondence Info

0.010.2011 ~ 15

0.970.360 ~ 5

0.000.07> 16

0.020.376 ~ 10

# matches z1 z2

p(mjz)

p(z; w; r;mjd) = p(zjd)p(wjz)p(rjz; d)p(mjz)

Tsuhan Chen

Semi-SupervisedSemi-Supervised

• User provides limited information– e.g., Label one frame

p(wjz)

LocationEstimation

p(zjd)

p(rjz; d)

pL(zjw; r; d)pU(zjw; r; d)

Tsuhan Chen

Multiple Classes and InstancesMultiple Classes and Instances

• Multiple classes

• Multiple instances of the same object class

– Parametric methods

– Nonparametric methods

Model selection with BIC [Schwartz 78]Variational Bayes [Attias 99]

Mean-shift [Comaniciu & Meer 01]

Tsuhan Chen

CHAIR

OFFICE

PHONE

MONITORKEYBOARD

computer

desk-area

Collection of images Corresponding hSO

Hierarchical Semantics of ObjectsHierarchical Semantics of Objects[Parikh&Chen CVPR’07]

Tsuhan Chen

SummarySummary

• Probabilistic framework for object discovery– Incorporate information from

appearance / location / motion / geometry– Multiple classes and multiple instances possible– Unsupervised and semi-supervised possible– Discovery of hierarchical semantics of objects

Tsuhan Chen

Finally…Finally…

Tsuhan Chen

Some Related WorkSome Related Work

Tsuhan Chen

Camera ArrayCamera Array

Tsuhan Chen

What can be done…What can be done…

[EyeVision]

[CMU 3D Dome]

[CMU CamArray]

Tsuhan Chen

Beyond Camera Array: “Active Sensing”Beyond Camera Array: “Active Sensing”

Tsuhan Chen

Tsuhan Chen

Advanced Multimedia Processing LabAdvanced Multimedia Processing Lab

Please visit us at:http://amp.ece.cmu.edu

from image analysis to content extraction: are we there...

Documents