visuelle perzeption für mensch- maschine schnittstellen€¦ · comparison for each shape location...

Interactive Systems Laboratories, Universität Karlsr uhe (TH)Edgar Seemann, 04.12.09 1

Visuelle Perzeption für Mensch-Maschine Schnittstellen

Vorlesung, WS 2009

Prof. Dr. Rainer StiefelhagenDr. Edgar Seemann

Institut für AnthropomatikUniversität Karlsruhe (TH)

http://[email protected]

[email protected]


Computer Vision:

People Detection II

WS 2009/10

Dr. Edgar Seemann

[email protected]


Termine (1)

TBA21.12.2009

Stereo and Optical Flow18.12.2009

Termine Thema19.10.2009 Introduction, Applications

23.10.2009 Basics: Cameras, Transformations, Color

26.10.2009 Basics: Image Processing

30.10.2009 Basics: Pattern recognition

02.11.2009 Computer Vision: Tasks, Challenges, Learning, Performance measures

06.11.2009 Face Detection I: Color, Edges (Birchfield)

09.11.2009 Project 1: Intro + Programming tips13.11.2009 Face Detection II: ANNs, SVM, Viola & Jones

16.11.2009 Project 1: Questions

20.11.2009 Face Recognition I: Traditional Approaches, Eigenfaces, Fisherfaces, EBGM

23.12.2009 Face Recognition II

27.11.2009 Head Pose Estimation: Model-based, NN, Texture Mapping, Focus of Attention

30.11.2009 People Detection I

04.12.2009 People Detection II

07.12.2009 Project 1: Student Presentations, Project 2: Intro11.12.2009 People Detection III (Part-Based Models)

14.12.2009 Scene Context and Geometry

Edgar Seemann, 04.12.09 4

Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciToday

� Last time: Global approaches� Compute a singe feature vector / representation

for the complete body

� Today:� Silhouette matching (global approach)

� Part-based approaches


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ci

Silhouette Matching


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciChamfer Matching [Gavrila & Philomin ICCV’99]

Goal

• Align known object shapes with image

Object shapesReal-world image of object

Requirements for an alignment algorithm

• High detection rate

• Few false positives

• Robustness

• Computationally inexpensive


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ci

� Used to compare/align two (typically binary) shapes

1. Compute for each pixel the distance to the next edge pixel� Here the eculidean distances are

approximated by the 2-3 distance

Distance Transform

Shape 1 Shape 2

Distance = ?

Distance transform

20

3

36

0

85222025

2024

630

2

520002

42222

4333

555

668

522024

30024

2024

235244


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciDistance Transform

2. Overlay second shape over distance transform

3. Accumulate distances along shape 24. Find best matching position by an exhaustive search

� Distance is not symmetric� Distance has to be normalized w.r.t. the length of the

shapes

Distance transform

20

3

36

0

85222025

2024

630

2

520002

42222

4333

555

668

522024

30024

2024

235244

Distance = 14


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciChamfer Matching

Distance measureDistance transform of a real-world image

Distance transform

20

3

36

0

85222025

2024

630

2

520002

42222

4333

555

668

522024

30024

2024

235244

Binary image

• Compute distance transform (DT)

• For each possible object location

• Position known object shape over DT

• Accumulate distances along the

contour

Chamfer Matching


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciEfficient implementation

� The distance transform can be efficiently computed by two scans over the complete image

� Forward-Scan� Starts in the upper-left corner and moves from left to right, top to

bottom

� Uses the following mask

� Backward-Scan� Starts in the lower-right corner and moves from right to left,

bottom to top

� Uses the following mask

3 2 3

2 0

3 2 3

20


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ci

3

2

? ? ?

?

?

Forward scan

� We can choose different values for the filter mask

� The local distances, d, s and c, in the mask are added to the pixel values of the distance map and the new value of the zero pixel is the minimum of the five sums

� Example:

d s d

s c

2+2 2+3 3+5

2+0 2

3 2 3

2 0 ?

? ? ?

5

?

?


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciAdvantages and Disadvantages

� Fast� Distance transform has to be computed only once

� Comparison for each shape location is cheap

� Good performance on uncluttered images (with few background structures)

� Bad performance for cluttered images

� Needs a huge number of people silhouettes� But computation effort increases with the number of

silhouettes


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciTemplate Hierachy

� To reduce the number of silhouettes to consider, silhouettes can be organized in a template hierarchy

� For this, the shapes are clustered by similarity


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciSearch in the hierarchy

� Matching the shapes, then corresponds to a traversal of the template hierarchy

� How can we prune search branches to speed up matching?� Thresholds depend on:

� Edge detector (likelihood of gaps)

� Silhouette sizes

� Hierarchy level

� Allowed shape variation

� Thresholds are set statisticallyduring training


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciExample Detections


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciVideo


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciCoarse-To-Fine Search

� Goal: Reduce search effort by discarding unlikely regions with minimal computation

� Idea:� Subsample image and search

first at a coarse scale

� Only consider regions with alow distance when searchingfor a match on finer scales

� Again, we have to findreasonable thresholds

Level 1

Level 2

Level 3


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciProtector System (Daimler)


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciAdding edge orientation

� So far edge orientation has been completely ignored

� Idea: Consider edge orientation for each pixel

Distance = small


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciEdge orientation - The math

� Given two shapes S, C, we can express the chamfer distance in the following manner

� The orientation correspondence between two points is then measured by

� The combined distance measure:


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciStatistical Relevance

� Adding statistical relevance of silhouette regions further improves the results[Dimitrijevic06]


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciSpatio-Temporal templates

� Use multiple successive frames to build a spatio-temporal template (T={T1,…,TN})

� Allow spatial variations of dx, dy (due to motion or camera movements)


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciExample: single-frame vs. 3 frames


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciQuantitative Results

� Red: spatio-temporal templates + statistical relevance


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciVideo

� Restrict detection to a single articulation (when legs are in a v-shaped position)

� Spatio-Temporal templates:� Allows more reliable detection of motion direction

� Avoids confusions and some false positive detections


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciAlternatives: Earth Mover’s Distance

� Originally developed to compare histograms

� Idea: Find the minimal ‘flow’ to transform one histogram to another

� Example:


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciEarth mover’s distance (EMD)

Example detection

• Detect edges in image

• For each possible object location

• Optimize correspondences between

known shape and edge image

EMD MatchingChamfer:

EMD:

Distance measure basis


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciEMD – The math

� Variant of the transportation problem (possible solutions: Stepping Stone Algorithm, Transportation-simplex method)

� Constraints

� EMD-Distance


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciAdvantages and Disadvantages

� Optimizes matching between silhouette and edge structure in image

� Enforces one-to-one matchings (unlike chamfer)

� Allows partial matches

� Can deal with arbitrary features

� High computational complexity� Approximation is possible [Graumann, Darrel CVPR’94]


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ci

Part-Based Models


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciPart-Based Models

� Fischler & Elschlager1973

� Model has two components� parts

(2D image fragments)

� structure (configuration of parts)


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciConfiguration of parts

� Fixed Spatial Layout� Local parts have a mostly fixed position and orientation

with respect to the center object center or detection window

� Flexible Spatial Layout� Local parts are allowed to shift in location and scale� Able to compensate for deformations or articulation

changes� Well suited for non-rigid objects � Spatial relations often modeled probabilistically


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ci

K. Grauman, B. Leibe

Different Connectivity Structures

Fergus et al. ’03Fei-Fei et al. ‘03

Leibe et al. ’04, ‘08Crandall et al. ‘05Fergus et al. ’05

Crandall et al. ‘05 Felzenszwalb & Huttenlocher ‘05

Bouchard & Triggs ‘05 Carneiro & Lowe ‘06Csurka ’04Vasconcelos ‘00

from [Carneiro & Lowe, ECCV’06]

O(N6) O(N2) O(N3) O(N2)


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ci

Fixed Spatial Layout


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciMotivation

� Breakdown overall variability into more manageable pieces

� Pieces can be classified by a less complex classifier

� Apply prior information by (manually) grouping the global feature vector into meaningful parts


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciExample: Sashua et al. IVS’04

� Divide person into 9 overlapping parts

� 4 pair combinations


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciFeature Representation

� Edge orientation histograms (similar to HOG)

� Each part is divided into 2x2 regions� i.e. 4 histograms per part

� 8 gradient orientations

� Features are insensitive to small local shifts


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciPart Classifier and Combination

� Parts are classified by a linear regression� i.e. distance to a separating hyperplane

� One score with respect to each of the 9 parts� Separate head vs. leg feature, head vs. upper body etc.

� The values of the individual part detectors are combined by AdaBoost


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciAttention mechanism

� Attention mechanism� Considers scene geometry, camera perspective

� Filters out regions with lack of texture properties

� Only 75 image windows are considered for pedestrian classification per frame


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciResults

� Lower curve:Global SVM

� Middle curve:Mohan et al.

� Upper curve:Sashua et al.


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciResults


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciExample: Mohan et al.

� Body divided into 4 parts� Face/Shoulder

� Legs

� Right/Left arm

� Detection� Sliding window

� 64x128 pixels


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciGeometric constraints

� Body parts are not always at the exact same position

� Allow local shifts� Position

� Scale

� Best location has to befound for each detectionwindow


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciTraining Data

� MIT pedestrian database


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciFeatures

� Based on wavelets� Multi-Resolution representation

� Haar wavelets

� Quadrupledensity


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciDiscrete Wavelet Transform (DWT)

Motivation

� Representation with basis functions

� Given: Signal [9 7 3 5]

� Disadvantages:� Coefficients provide only

local information

� No information aboutglobal signal change

http://nt.eit.uni-kl.de/wavelet/transformation.html


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciHaar Wavelet Basis

� Scaling function provides information about the mean value or signal level

� Representation inwavelet basis:� [12 4 sqrt(2) -sqrt(2)]


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ci1D Wavelet Transformation

� Step wise transformation with orthogonal

spaces

� Basis function for Vj� Scaling function

� Basis function for Wj

� Wavelet function


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciOur Example

� Project signal onto subspaces

V1 W1

V2 W2


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ci2D Haar-Wavelets

� Same princple as in 1D

� Quadruple densityi.e. overcompleterepresentation


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciFeature Discussion

� Haar wavelets represent pixel/region differences

� Strong responses of wavelet coefficients on image boundaries, i.e. coefficients encode shape

� Similar to gradients at different resolutions

� Overcomplete basis allows good spatial resolution

� Fine scales are discarded as they represent noise

� Coarse scales are discarded as they represent global intensity levels and not the shape


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciPerformance: Part Detectors


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciClassifier Combination

� Voting� E.g. majority of part detectors, classify detection

window as person

� SVM combination� Feed SVM scores of part detectors in a second stage

SVM

� Referred to as Adaptive Classifier Combination


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciCombination Performance


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciResults


Com

pute

r V

isio

n fo

r H

uman

-Com

pute

r In

tera

ctio

n

Res

earc

h G

roup

, Uni

vers

itätK

arls

ruhe

(T

H)

cv:h

ciOcclusion

visuelle perzeption für mensch- maschine schnittstellen€¦ · comparison for each shape location...

Documents