machine learning and imprecise probabilities for computer vision fabio cuzzolin idiap, martigny,...

Machine learning and imprecise probabilities for

computer vision

Fabio Cuzzolin

IDIAP, Martigny, 19/4/2006

Myself

Master’s thesis on gesturegesture recognitionrecognition at the University of Padova Visiting student, ESSRL, Washington

University in St. Louis Ph.D. thesis on the theory of evidencetheory of evidence Young researcher in Milan with the Image

and Sound Processing group Post-doc at UCLA in the Vision Lab

My research

research

Discrete mathematics

linear independence on lattices

Belief functions and imprecise probabilities

geometric approach

algebraic analysis

combinatorial analysis

Computer vision object and body tracking

data association

gesture and action recognition

Computer Vision HMMs and size functions for gesture recognition Compositional behavior of hidden Markov

modelsVolumetric action recognitionData association with shape informationEvidential models for pose estimation Bilinear models for view-invariant gaitIDRiemannian metrics for motion classification

Imprecise probabilities Geometric approachAlgebraic analysis

Approach Problem: recognizing an example of a

known category of gestures from image sequences

Combination of HMMsHMMs (for dynamics) and size functionssize functions (for pose representation)

Continuous hidden Markov models

EM algorithm for parameter learning (Moore)

Example

transition matrix A -> gesture dynamics

state-output matrix C -> collection of hand

poses

The gesture is represented as a sequence of transitions between a small set of canonical poses

Size functions Hand poses are represented through their contours

real image measuring function family of lines

size function table

Gesture classification

…

HMM 1

HMM 2

HMM n

EM algorithmEM algorithm is used to learn HMM parameters from an input feature sequence

the new sequence is fed to the learnt gesture models

they produce a likelihood the most likely model is chosen (if above a

threshold)

Composition of HMMs Compositional behavior of HMMS: the model of the

action of interest is embedded in the overall model

ClusteringClustering: states of the original model are grouped in clusters, and the transition matrix recomputed accordingly:

kj Ceitjtitkt eXeXPeXCXP )|()|( 11

State clustering Effect of clustering on HMM topology

“Cluttered” model for the two overlapping motions

Reduced model for the “fly” gesture extracted through clustering

Kullback-Leibler comparison

We used the K-L distance to measure the similarity between models extracted from clutter and in absence of it

KL distances between “fly” (solid) and “fly from clutter” (dash)

KL distances between “fly” and “cycle”

Volumetric action recognition

problem: recognizing the action performed by a person viewed by a number of cameras

2D approaches: features are extracted from single views -> viewpoint dependence

volumetric approachvolumetric approach: features are extracted from a volumetric reconstruction of the moving body

Locally linear embeddingLocally linear embedding to find topological representation of the moving body

3D feature extraction

Linear discriminant analysis (LDA) to estimate the direction of motion

as the direction of maximal separation between the legs

k-means clusteringk-means clustering to separate bodyparts

A number of formalisms have been proposed to extend or replace classical probability:

e.g. possibilities, fuzzy sets, random sets, monotone capacities, gambles, upper and lower previsions

Uncertainty descriptions

theory of evidencetheory of evidence (A. Dempster, G. Shafer) Probabilities are replaced by belief functions Bayes’ rule is replaced by Dempster’s rule families of domains for multiple representation

of evidence

1)( B

Bm

A

B2

B1

AB

BmAs )( ..where m is a mass function on 2Θ s.t.

Belief functions are not additive

belief function s: 2Θ ->[0,1]

Probability on a finite set: function p: 2Θ -> [0,1] with p(A)=x m(x), where m: Θ -> [0,1] is a mass function which meets the normalization constraint

Probabilities are additive: if AB= then p(AB)=p(A)+p(B)

Belief functions

Dempster’s rule

in the theory of evidence, new information encoded as a belief function is combined with old beliefs in a revision process

belief functions are combined through Dempster’s rule'', ssss

ABBmABel)()(

Ai

Bj

AiBj=A

intersection of focal elements

ABA

ji

ji

BmAmAm )()()( 21

Example of combination

s1:

m({a1})=0.7, m({a1 ,a2})=0.3

a1

a2

a3

a4

s2:

m()=0.1, m({a2 ,a3 ,a4})=0.9

s1 s2 :

m({a1}) = 0.7*0.1/0.37 = 0.19

m({a2}) = 0.3*0.9/0.37 = 0.73

m({a1 ,a2}) = 0.3*0.1/0.37 = 0.08

JPDA with shape info

YX

Z

XY

Z

robustness: clutter does not meet shape constraints occlusions: occluded targets can be estimated

JPDA model: independent targets

shape model: rigid links

Dempster’s fusion

Body tracking

Application: tracking of feature pointstracking of feature points on a moving human body

Pose estimation estimating the “posepose” (internal configuration)

of a moving body from the available images

salient image measurements: featuresfeatures

Qtq k ˆt=0

t=T

Model-based estimation if you have an a-priori modela-priori model of the object .. .. you can exploit it to help (or drive) the

estimation

example: kinematic model

Model-free estimation

if you do not have any information about the body..

the only way to do inference is to learn a maplearn a map between features and

poses directly from the data

this can be done in a training stagetraining stage

Collecting training data motion capture system

3D locations of markers = pose

Training data when the object performs some “significant”

movements in front of the camera … … a finite collection of configuration values

are provided by the motion capture system

… while a sequence of features is computed from the image(s)

q q

y y

Q~

1

1

T

T

Learning feature-pose maps

Hidden Markov modelsHidden Markov models provide a way to build feature-pose maps from the training data

a Gaussian density for each state is set up on the feature space -> approximate feature spaceapproximate feature space

mapmap between each region and the set of training poses qk with feature value yk inside it

Evidential model

approximate feature spaces ..

.. and approximate parameter space ..

.. form a family of compatible family of compatible frames: the evidential modelframes: the evidential model

Human body tracking

two experiments, two views

four markers on the right arm

six markers on both legs

Feature extraction

three steps: original image, color segmentation, bounding box

185

94

161

38

185

94

161

38

Performances comparison of three models: left view

only, right view only, both views

pose estimation yielded by the overall model

estimate associated with the “right” model

“left” model

ground truth

GaitID The problem: recognizing the identity of

humans from their gait Typical approaches: PCA on image features,

HMMs People typically use silhouette data Issue: view-invarianceview-invariance Can be addressed via 3D representations 3D tracking: difficult and sensitive

Bilinear models From view-invariance to “style” invariance“style” invariance In a dataset of sequences, each motion possess several

labels: action, identity, viewpoint, emotional state, etc. Bilinear modelsBilinear models (Tenenbaum) can be used to separate

the influence of two of those factors, called “style” and “content” (the label to classify)

ySC is a training set of k-dimensional observations with labels S and C

bC is a parameter vector representing content, while AS is a style-specific linear map mapping the content space onto the observation space

CSSC bAy

Content classification of unknown style

Consider a training set in which persons (content=ID) are seen walking from different viewpoints (style=viewpoint)

an asymmetric bilinear model can learned from it through the SVD of a stacked observation matrix

when new motions are acquired in which a known person is being seen walking from a different viewpoint (unknown style)…

… an iterative EM procedure can be set up to classify the content (identity)

E step -> estimation of p(c|s), the prob. of the content given the current estimate s of the style

M step -> estimation of the linear map for the unknown style s

Three layer modelThree layer model

each sequence is encoded as a Markov model, its C matrix is stacked in an observation vector, and a bilinear model is trained over those vectors

Three-layer model

Feature representation:projection of the contour of the silhouette on a sheaf of lines passing through the center

MOBO database Mobo database: 25 people performing 4 different

walking actions, from 6 cameras6 cameras Each sequence has three labels: action, id, viewaction, id, view We set up four experimentsfour experiments in which one label was

chosen as content, another one as style, and the remaining is considered as a nuisance factor

Content = id, style = view -> view-invariant gaitID Content = id, style = action -> action-invariant gaitID Content = action, style = view -> view-invariant action

recogntion Content = action, style = id -> style-invariant action recognition

Results Compared performances with “baseline”

algorithm and straight k-NN on sequence HMMs

Distances between dynamical models

Problem: motion classification Approach: representing each movement as a

linear dynamical modellinear dynamical model for instance, each image sequence can be

mapped to an ARMA, or AR linear model Classification is then reduced to find a suitable

distance function in the space of dynamical distance function in the space of dynamical modelsmodels

We can use this distance in any of the popular classification schemes: k-NN, SVM, etc.

Riemannian metrics Some distances have been proposed: Martin’s

distance, subspace angles, gap-metric, Fisher metric

However, it makes no sense to choose a single distance for all possible classification problems

When some a-priori info is available (training set)..

.. we can learn in a supervised fashion the “best” .. we can learn in a supervised fashion the “best” metric for the classification problem!metric for the classification problem!

Feasible approach: volume minimization of volume minimization of

pullback metricspullback metrics

Learning pullback metrics many unsupervised algorithms take in input

dataset and map it to an embedded space they fail to learn a full metric Consider than a family of diffeomorphisms F

between the original space M and a metric space N

The diffeomorphism F induces on M a pullback metric

M

ND

F

Space of AR(2) models Given an input sequence, we can identify the

parameters of the linear model which better describes it

We chose the class of autoregressive models of order 2 AR(2)

Fisher metric on AR(2)

Compute the geodesics of the pullback metric on M

21

12

2212121 1

1

)1)(1)(1(

1),(

aa

aa

aaaaaaag

Results scalar feature, AR(2) and ARMA models NN algorithm to classify new sequences Identity recognition Action recognition

Results -2 Recognition performance of the second-best

distance and the optimal pull-back metric The whole dataset is considered, regardless

the view

View 1 View 5

it has the shape of a simplex

),( APClS A

each subset A A-th coordinate s(A)

Geometric approach to the ToE

Belief functions can be seen as points of an Euclidean space of dimension 2n-2

Belief spaceBelief space: the space of all the belief functions on a given frame

Geometry of Dempster’s rule

Dempster’s rule can be studied in the geometric setup too Geometric operator mapping pairs of points onto another point of

the belief space

conditional subspaces

Px

Py

P

F (s)x

F (s)y

y

x

y

x

compositional criterioncompositional criterion the approximation behaves like s when combined through

Dempster’s rule

tpPp

P dttptss minarg

Problem: given a belief function s, finding the “best” probabilistic approximation of s this can be solved in the geometric setup

comparative study of all the proposed probabilstic approximations

Probabilistic approximation

Lattice structure

minimal refinement

1F

maximal coarsening

F is a locally Birkhoff (semimodular with finite length) lattice bounded below

order relation: existence of a refining

families of frames have the

algebraic structure of a lattice

a-priori constraint

conditional constraint

generalization of the total probability theorem

Total belief theorem

whole graph of candidate solutions, connections with combinatorics and linear systems

machine learning and imprecise probabilities for computer vision fabio cuzzolin idiap, martigny,...

Documents

action recognition slide

clustering slide

moore slide

cycle slide

threshold slide

learnt gesture models

vision lab slide

gesture classification