visual object detection, recognition & tracking (without deep learning)

Yu Huang

Sunnyvale, California

[email protected]

Visual Object Detection, Recognition &

Tracking (without Deep Learning)

mailto:[email protected]

Outline Object Detection and Classification

State of Art Object Detection and Classification

Global/local features (Harris, FAST, SIFT, SURF, HOG,

LBP, BRIEF, BRISK, FREAK, ORB)

Kd-tree, LSH, min-hash, inverted file;

part-based (constellation model, pictorial structure,

implicit shape model, deformable model)

Pose estimation

bag of words (codebook, pyramid match kernel, spatial

pyramid match, vocabulary tree, pLSA, LDA)

VLAD, Fisher kernel, Hamming embedding, product

quant.

Machine learning: generative/discriminative model

Efficiency in Detection/Classification

Divide-and-conquer, branch-and-bound, coarse-to-fine,

DP, selective search by segmentation.

Open set problem

Data unbalancing problem

Face detection/recognition

Text detection/recognition

Scene parsing/semantic segmentation

Data set and evaluation metric

Object Tracking

State-of-art methods of Object Tracking

Representation Scheme in Tracking

Search Mechanism in Tracking

Model Update in Tracking

Context in Tracker, Fusion of Trackers

Multiple object tracking

3-D Model-based tracking

SLAM (feature/pixel tracking)

Appendix: Action/Event Detection/Classification

Object Detection Given an image or a frame in the video, the goal of object detection is to determine

whether there are any defined objects in the image and return their locations and extents

(from a long time and whole viewed observer).

The object detection should know how to differentiate the specific object from everything

else in the view.

Object detection usually is a binary classification problem; however, additional context

information from background helps building a strong detector, such as co-occurrence of

objects and geometric location priors.

Multi views or 3-d information (depth) if available.

Object Classification

Given an image or a frame in the video, the goal of

object classification is to identify specific objects

within a certain object set.

The object classification should tell what the

difference is between object A and object B in the

predefined object set.

Contextual info. between objects is modeled (by

learning) too;

Object detection can be solved as object parts

classification;

Segmentation co-trained with classification.

Objects’ geometric context are useful too;

Multi views or 3-d information (depth) if available.

Object classification is complicated and the difficulty

raises as the number of objects in the set increases.

State-of-Art Methods of Object

Detection/Classification 1. Global/Local representation;

Template matching (multi-scale) and eigen-object (subspace or manifold);

GIST, MSER, SIFT, SURF, Haar-like, Histogram of Oriented Gradient (HOG), Pyramid HOG,

GLOH (Gradient Location and Orientation Histogram), Local Binary Pattern (LBP), Shape-Context etc;

Local feature correspondence

– Indexing local features for Approximate Nearest Neighbor (ANN) search: KD-tree, min-Hashing, spectral hashing,

inverted index;

– Fast matching for large dataset: product quantization, Hamming embedding, Fisher kernel, sparse coding,

manifold learning;

– Global spatial model: homography constraints.

Visual Features

Harris Corner Detection • Shifting the window in any direction should yield a

large change in appearance;

• Harris corner detector gives a mathematical

approach for determining which case holds: flat,

edge or corner;

• Treat gradient vectors as a set of (dx,dy) points

with a center of mass defined as being at (0,0);

• Fit an ellipse to that set of points via scatter matrix.

FAST • Features from accelerated segment test

(FAST) is corner detection method;

• FAST corner detector uses a circle of 16 pixels

(radius 3) to classify if a candidate is a corner.

• Classified as corner: intensity Ip, threshold t,

condition either 1 or 2 as below

• 1: N contiguous pixels which intensities > Ip + t;

• 2: N contiguous pixels which intensities < Ip - t.

• Tradeoff of choosing N (usually as 12), t (20%).

• The high-speed test for rejecting non-corner

points is operated by examining 4 example

pixels, namely pixel 1, 9, 5 and 13.

• Non maximal suppression.

FAST • Machine learning: corner detection processed on a set of training images;

• For every pixel p, store the 16 pixels surrounding it, as a vector P;

• Each value in the vector, take three states: darker, brighter than or similar to p;

• Depending on the states, the vector will be subdivided into three subsets: Pd, Ps, Pb;

• Define variable Kp: true if p is an interest point and false otherwise;

• Decision tree classifier queries each subset using Kp, on principle of entropy

minimization: the true class is found with min. number of queries for three subsets;

• Terminate when entropy of a subset is zero;

• This learned order of querying is used then.

LBP: Local Binary Pattern LBP transforms an image into an array or image of integer labels describing small-scale

appearance of the image.

Assume texture has locally two complementary aspects, a pattern and its strength.

Divide the examined window into cells (e.g. 16x16 pixels for each cell).

For each pixel in a cell, compare the pixel to each of its 8 neighbors. Follow the pixels along a

circle, i.e. clockwise or counter-clockwise.

Where the center pixel's value is greater than the neighbor's value, write "1". Otherwise, write "0".

This gives an 8-digit binary number (which is usually converted to decimal for convenience).

Compute the histogram, over the cell, of the frequency of each "number" occurring (i.e., each

combination of which pixels are smaller and which are greater).

Optionally normalize the histogram.

Concatenate (normalized) histograms of all cells. This gives the feature vector.

http://upload.wikimedia.org/wikipedia/commons/c/c2/Lbp_neighbors.svg

HOG: Histograms of Oriented Gradients

Introduce invariance

Bias / gain / nonlinear transformations

bias: gradients / gain: local

normalization

nonlinearity: clamping magnitude,

orientations

Small deformations

spatial subsampling

local “bag” models

At each pixel

Gradient magnitude:

m = || (Ix, Iy) ||

Gradient orientation:

o = tan-1(Iy / Ix)

Quantize orientation: vote into bin

(weighted)

SIFT: Scale Invariant Feature Transform

Scale-space extrema detection:

Find the points, whose surrounding patches (with some scale) are distinctive

An approximation to the scale-normalized Laplacian of Gaussian

Keypoint localization: eliminating edge points

Orientation assignment:

Assign an orientation to each keypoint, which descriptor represented relative to this orientation

and therefore achieve invariance to image rotation;

Magnitude & orientation on the Gaussian smoothed images

A histogram is formed by quantizing the orientations into 36 bins;

Peaks in the histogram correspond to the orientations of the patch;

For the same scale & location, there could be multiple keypoints with different orientations (if another

peak is bigger than 80% of the maximal peak);

Keypoint descriptor: 128-d

16x16 patch -> 4x4 subregions ->8 bins for each subregion

SIFT: Scale Invariant Feature Transform

SURF: Speeded Up Robust Features

The feature vector of SURF is almost identical to that of SIFT. It

creates a grid around the keypoint and divides each grid cell into

sub-grids.

At each sub-grid cell, the gradient is calculated and is binned by

angle into a histogram whose counts are increased by the

magnitude of the gradient, all weighted by a Gaussian.

These grid histograms of gradients are concatenated into a 64-d

vector.

SURF can also use 36-vector of principle components of the 64

vector (PCA is performed on a large set of training images) for a

speedup.

SURF also improves on SIFT by using a box filter approximation

to the convolution kernel of the Gaussian derivative operator. This

convolution is sped up further using integral images.

BRIEF

BRIEF: Binary Robust Independent Elementary Features;

Binary test

BRIEF descriptor

For each S*S patch

Smooth it, then pick pixels using pre-defined binary tests

Pros:

Compact, easy-computed, highly discriminative

Fast matching using Hamming distance

Good recognition performance

Cons:

More sensitive to image distortions and transformations, in particular to in-plane rotation and scale

change

Binary Robust Invariant Scalable Keypoints

BRISK: Combination of SIFT-like scale-

space keypoint detection and BRIEF-

like descriptor

Scale and rotation invariant

BRISK is a 512 bit binary descriptor

that computes the weighted Gaussian

average over a select pattern of points

near the keypoint;

It compares the values of specific pairs

of Gaussian windows, leading to either

a 1 or a 0, depending on which window

in the pair was greater.

The pairs to use are preselected in

BRISK. This creates binary descriptors

that work with hamming distance

instead of Euclidean.

FREAK: Fast Retina Keypoint

FREAK is a cascade of binary strings is computed by efficiently

comparing image intensities over a retinal sampling pattern;

FREAK improves upon the sampling pattern and method of pair

selection that BRISK uses.

FREAK evaluates 43 weighted Gaussians at locations around the

keypoint, but the pattern formed by these Gaussians is biologically

inspired by the retinal pattern in the eye.

The pixels being averaged overlap, and are much more

concentrated near the keypoint. This leads to a more accurate

description of the keypoint.

The actual FREAK algorithm also uses a cascade for comparing

these pairs, and puts the 64 most important bits in front to speed

up the matching process.

ORB

• Oriented FAST and Rotated BRIEF:

• Fast and accurate orientation compensation to FAST;

• Efficient computation of oriented BRIEF;

• Learn to de-correlate BRIEF in sampling pairs under rotational invariance.

• Intensity centroid for corner orientation:

• Properties of the sampling pairs:

• Uncorrelation – each new pair will bring new information to the descriptor;

• High variance – more discriminative, since it responds differently to inputs.

• Learning the sampling: a training set of about 300,000 keypoints drawn;

• Greedy method: obtain a set of 256 uncorrelated binary tests with high variance.

http://gilscvblog.files.wordpress.com/2013/10/figure3.jpg

http://gilscvblog.files.wordpress.com/2013/10/figure4.jpg

Kaze/A-Kaze • Kaze: Multi-scale 2D feature detection and description algorithm in nonlinear scale

spaces by means of nonlinear diffusion filtering;

• The discretization of a function by means of the forward Euler scheme: high computation cost;

• Efficient Additive Operator Splitting (AOS) techniques and variable conductance diffusion, solving a

tri-diagonal system of linear equations, which can be efficiently done by Thomas algorithm;

• The set of first and second order derivatives are approximated by means of 3x3 Scharr filters of

different derivative step sizes (better than Sobel-like operators);

• Find dominant orientation and adapt M-SURF to build descriptor;

• A-Kaze: accelerated Kaze with fast explicit diffusion (FED) in nonlinear scale spaces;

• FED combines the advantages of explicit and semi-implicit schemes while avoiding shortcomings;

• Idea: M cycles of n explicit diffusion steps with varying step sizes from factorization of the box filter;

• Embed the FED scheme into a fine to coarse pyramidal framework;

• Use a Modified-Local Difference Binary (M-LDB) that exploits gradient and intensity information;

• LDB is similar to BRIEF but using binary tests between the average of areas (not pixels).

KD-tree The kd-tree data structure is based on a recursive subdivision of space into disjoint

hyper-rectangular regions called cells; Each node of the tree is associated with a region, called a box, and is associated with a set of data

points that lie within this box;

The root node of the tree is associated with a bounding box that contains all the data points;

Consider an arbitrary node in the tree: As long as the number of data points associated with this node is greater than a small quantity, called the bucket size, the box is split into two boxes by an axis-orthogonal hyper-plane that intersects this box.

There are a number of different splitting rules, which determine how this hyper-plane is selected (mean or median usually);

When the number of points that are associated with the current box falls below the bucket size, then the resulting node is declared a leaf node, and these points are stored with the node;

Limit the number of neighboring k-d tree bins to explore ANN;

Reduce the boundary effects by randomization.

KD-tree Randomized kd-tree forest: a fast ANN search;

Split by picking the dimension with the highest variance first;

Multiple randomized trees increase the chances of finding nearby points;

Best-bin first search heuristic: priority queue;

A branch-and-bound technique for an estimate of the smallest distance from the query point to any

of the data points down all of the open paths;

Priority search: visits cells in increasing order of distance from the query, and converge rapidly on

the true NN (max/min heap data structure).

Locality Sensitive Hashing (LSH) LSH is a randomized hashing technique using hash

functions that map similar points to the same bin, with high probability;

Choose a random projection;

Project points;

Points close in the original space remain close under the projection Use multiple quantized projections defining a high-

dimensional “grid”;

Cell contents can be efficiently indexed using a hash table;

Repeat to avoid quantization errors near the cell boundaries;

Point that shares at least one cell = potential candidate;

Compute distance to all candidates;

Min-Hash The Min-Hash seen as an instance of locality sensitive hashing (LSH);

The more similar two items are, the higher the chance that they share the same min-hash; Similarity is measured by the Jaccard distance : the number of elements two sets have in common

divided by the total number elements in both;

It is a weaker representation than a BoW since word frequency information is reduced into a binary information (present or absent);

To estimate the word overlap of two items, multiple independent min-Hash functions fi are used;

To efficiently retrieve items with high similarity, the values of min-Hash functions fi are grouped into s-tuples, called sketches;

The recall is increased by repeating the random selection of sketches k times; A pair of items is a potential match when at least one sketch collision is encountered;

The probability of a pair of items having at least one sketch out of k in common is a function of the word overlap.

Inverted File An inverted file index is just like an index in a book, where the keywords are

mapped to the numbers of pages using them;

In the visual word case, a table that points from the word number to the indices of

the database images with the word, is built too;

Retrieval via the inverted file is faster than searching every image, assuming that

not all images contain every word (sparse).

Index compression: Huffman compression

Note: inverted list contains both the vector identifier and the encoded reisdual.


Detection/Classification 2. Bags of words model (derived from natural language processing);

Key point Localization (MSER, SIFT, SURF, Shape Context…);

Codebook generation: clustering or quantizing the feature space (k-means);

Sparse coding for efficient quantization.

Learning with histogram of code words and its extension;

– Pyramid match kernel: map to multi-dimensional multi-resolution histograms;

– Spatial pyramid match: partition the image into increasingly fine sub-regions.

– Prob. Latent Semantic Analysis (pLSA): mixture decomposition from a latent model;

– Latent Dirichlet Allocation (LDA): add the Dirichlet prior for the topic distribution.

Bag of Visual Words

feature detection

& representation

codewords dictionary

image representation

Representation

1.

2.

3.

category models

(and/or) classifiers

category

decision

Pyramid Match Kernel

Fast approximation of Earth Mover’s Distance;

Weighted sum of histogram intersections at multiple resolutions (linear in the

number of features instead of cubic);

Spatial Pyramid Matching Based on pyramid matching kernel;

Descriptor layer: detect, locate features, extract correspond. descriptors;

Code layer: code the descriptor by VQ, soft-VQ or even sparse coding;

SPM layer: pool codes across subregions and normalize into a histogram;

Classifiers with these features by nonlinear kernels.

Vocabulary Trees Vocabulary Tree defined using an offline unsupervised (k-

means) training stage.

Hierarchical scoring based on term freq. inverse document

freq. (TF-IDF). Number of the descriptor vectors of each image with a path along

the node i (ni query, mi database)

Number of images in the database with at least one descriptor

vector path through the node i (Ni )

Defining the relevance score

Implementation of Scoring Every node is associated with an inverted file

Decrease the fraction of images in the database that have to be

explicitely considered for a query

Hierarchical k-means

K-means tree of height h (levels)

Determining the path of a descriptor means performing kh

dot products.

Nister & Stewenius, 2006

Vocabulary Trees

Index Query

Locally Aggregated Descriptors

VLAD: vector of locally aggregated descriptors;

Learning: a vector quantifier (k-means)

output: k centroids (visual words): c1,…,ci,…ck

centroid ci has dimension d

For a given image

assign each descriptor to closest center ci

accumulate (sum) descriptors per cell

vi := vi + (x - ci)

VLAD (dimension D = k x d): run PCA for reduction

The vector is L2-normalized;

VLAD better than BoF for a given descriptor size

comparable to Fisher descriptors for these operating points

Choose a small D if output dimension D’ is small.

Fisher Kernel Given a likelihood function uλ with parameters λ, the score function of a given sample X is given by:

fixed-length vector whose dimensionality depends only on # parameters.

Intuition: direction in which the parameters λ of the model should be modified to better fit the data;

Fisher information matrix (FIM) or negative Hessian:

Measure similarity between using the Fisher Kernel (FK):

FK can be rewritten as a dot product between Fisher Vectors (FV):

A Gaussian Mixture Model (GMM) trained on a large set of features X={xt,

t=1,...T}, to get a probabilistic visual vocabulary (soft BoV);

Fisher kernel transforms an variable size set of independent samples into a fixed size vector

representation.

average pooling

Hamming Embedding Representation of a descriptor x with binary signatures

Vector-quantized to q(x) as in standard BoF

Short binary vector b(x) for an additional localization in the Voronoi cell

Define HE matching: two descriptors x and y match iif q(x)=q(y) and h(b(x), b(y)) <=ht

where h(a, b) is the Hamming distance.

Nearest neighbors for Hamming distance ≈ the ones for Euclidean distance

Efficiency:

Hamming distance = very few operations

Fewer random memory accesses: faster that BOF with same dictionary size!

Off-line (given a quantizer)

Draw an orthogonal projection matrix P of size db× d (random matrix generation)

this defines db random projection directions (projection and assignment for each learning data point)

for each Voronoi cell and projection direction, compute the median value from a learning set;

On-line: Compute the binary signature b(x) of a given descriptor

project x onto the projection directions as z(x) = (z1,…zdb)

Signature: bi(x) = 1 if zi(x) is above the learned median value, otherwise 0

Product Quantization

Main idea: compressed representation of the database vectors;

Vector split into m sub-vectors: y --> [y1|…|ym];

Sub-vectors are quantized separately by different quantizers

q(y)=[q(y1)|…|q(y2)]; where each qi is learned by k-means with a limited number of centroids;

The key: estimate the distances in the compressed domain, such that Quantization is fast enough;

Quantization is precise, i.e., many different possible indexes (ex: 2^64).

Note: Regular k-means is not appropriate; not for k=2^64 centroids;

Product quantization-based approach offers Competitive search accuracy

Compact footprint: few bytes per indexed vector

MPEG CDVS Image Feature Extraction Pipeline

• Keypoint detection: ALP;

• SIFT descriptor;

• Feature selection;

• Local descriptor compression;

• Coordinate coding;

• Global descriptor aggregation;

MPEG Standard CDVS-based Pairwise Matching

and Indexing/Retrieval Pipeline

• Global descriptor: top matches;

• Local descriptor: decoding->matching->geometric verification->localization;

• Location coding and descriptor coding.

MPEG Standard CDVS Key Proposals

• CE1: global descriptor: Residual Enhanced Visual Vector (REVV);

• SCFV (Scalable Compressed Fisher Vector);

• Robust Visual Descriptor (RVD);

• CE2: local descriptor compression: CHoG;

• Transform + Scalar Quantizer;

• Multi-stage Vector Quantizer;

• CE3: Location coding: context based coordinate coding;

• CE4: key-point detection: ALP;

• Block-based Frequency domain LoG;

• CE5: Local Descriptors: SIFT;

• CE6: retrieval pipeline: MBIT;

• CE7: feature selection: Naïve Bayesian learning based feature selection;

• CE8: pairwise matching pipeline: DISTRAT (local matching with geometric verification)

• Weighted Hamming distance (global matching).

ALP • SIFT Patent: DoG filtering to construct scale space!

• ALP (low polynomial degree): by Telecom Italia;

• Idea: scale space response modeled by a polynomial function; then

estimation of coefficients by LoG filtering at different scales:

• 1. Scale space: approximated by a polynomial, get local extrema in the

polynomials, eliminates those exceeded at the boundaries, output a list of

candidates (x, y, σ);

• 2. Clean candidates: either at edges with bigger ratio of curvatures, or with lower

absolute values or curvatures;

• 3. Refinement of coordinates: approximate scale space by a polynomial;

• 4. Eliminates duplicates at octave boundaries;

• 5. Find the remaining candidates.

D. G. Lowe: ”Distinctive image features from scale-invariant keypoints”, IJCV 2004, Patent No US 6,711,293.

BoVW, VT, VLAD, REVV, Fisher Vector, CFV, SCFV, RVD

• Bag of Visual Words: zero order moments;

• Vocabulary Tree: hierarchical clustering;

• Vector of Locally Aggregated Descriptors (VLAD): 1st order moments;

• Residual Enhanced Visual Vector: similar to VLAD, as residual + LDA;

• Hamming Embedding: BoW plus a binary vector (Hamming distance);

• Fisher Vector: 2nd order moments;

• Compressed Fisher Vector: Fisher vector + Product Quantization;

• PQ: decomposed into Cartesian product of subspace, then quantized separately;

• Spectral Hashing: partition by spectral method (eigenvectors of graph Laplacian).

• Scalable Compressed Fisher Vector: sparsity of FVs;

• Robust Visual Descriptor: similar to SCFV, robust cluster + bit selection.


Detection/Classification 3. Part-based method (structure);

Constellation Model: find mean of the appearance density, mean location & uncertainty in part location,

output the prob. of that part being present;

Pictorial Structures: modeled with unary template and pair-wise springs, then joint estimation of part

locations;

Implicit Shape Model: Consistent configurations of the observed parts (visual words) with spatial

distribution and final probabilistic voting for object segmentation;

Deformable model: learn the latent SVM model structure, filters, deformation costs.

Constellation Model Originally for unsupervised learning of object categories;

It represents objects by estimating a joint appearance and shape distribution of their parts;

The model is very flexible and can even be applied to objects that are only characterized by

their texture;

Representation

Joint model of part locations

Ability to deal with background clutter and occlusions

Multiple mixture components for different viewpoints

Learning

Manual construction of part detectors

Estimate parameters of shape density

Now semi-unsupervised

Automatic construction and selection of part detectors

Estimation of parameters using EM

•Recognition Run part detectors over image

Try combinations of features in model

Use efficient search techniques to make fast

Pictorial Structure

Objects are modeled by a collection of parts in a deformable configuration;

Statistical framework

Prior distribution is defined as a tree-structured Markov random field where no preference is given to

the absolute location of each part;

Model parts based on the response of Gaussian derivative filters of different orders, orientations and

scales;

Connections modeled by springs between parts: Gaussian Distribution.

Best match is found by minimizing function that measures both individual match costs and connection

costs;

How each part matches at its location which agree with the deformable model;

Matching a pictorial structure does not involve making any decisions about location of individual

parts;

It is solved independently and implies that any kind of part model can be used as long as maximum

likelihood can be computed for an individual part;

Implicit Shape Model Representation: object shape is only defined implicitly;

Not model semantically meaningful parts, instead as a collection of a large number of prototypical

features for a dense cover of the object area;

Each feature has a defined appearance and a spatial probability distribution for the locations relative

to the object center;

Learning:

Build a visual vocabulary from local features overlapped with training objects;

Learns a spatial occurrence distribution for each visual word (a list of all positions and scales relative

to the object center, and a reference figure-ground mask for each occurrence entry, used for inferring a

top-down segmentation);

Recognition: by the Generalized Hough Transform

In the test image, only a small fraction of the learned features will typically occur, which consistent

configuration still provides strong evidence for an object’s presence;

Each activated visual word then casts votes for possible positions of the object center according to its

learned spatial distribution;

Consistent hypotheses are searched as local maxima in the voting space.

Implicit Shape Model (Training + Recognition)

Deformable Model Latent SVM Model training: discriminative model with latent variable

The learned positions of object-parts and the position of the whole object are the Latent Variables;

Training data consists of images with labeled bounding boxes;

Need to learn the model structure, filters and deformation costs;

Detection: 8x8 blocks, HOG feature at different resolution;

Root filter: rectangular templates defining weights for features

Learn root filter by standard SVM;

Part filter: Multi-scale model captures features;

Deformation model: matching with pictorial structures.

Articulated Human Detection with Flexible Mixtures-of-Parts

• Human detection and human pose estimation in static images based on deformable part models;

• A general, flexible mixture model that jointly captures spatial relations between part locations and co-

occurrence relations between part mixtures, augmenting standard pictorial structure models that encode

just spatial relations;

• Learn all parameters, including local appearances, spatial relations, and co-occurrence relations (which

encode local rigidity) with a structured SVM solver.

Flexible mixture-of-parts model (middle), not classic

approaches (left) by warping template to different

orientation and foreshortening states (top right),

approximate small warps by translating patches

connected with a spring (bottom right).

A visualization of our model for K = 14 parts and T

= 4 local mixtures, trained on the Parse dataset.

Appearance and Expressive Spatial Models for Human Pose Estimation

Extend the basic Pictorial Structure model: (a) to more flexible structure with stronger local

appearance representations including single component part detectors (b) and mixtures of

part detectors (c). Combine local appearance model with mid-level representation based

on semi-global poselets which capture configurations of multiple parts (d).

MODEC: Multimodal Decomposed Models for Posture

• Most pictorial structures researchers have put effort into better and larger feature

spaces, in which they fit one linear model; • Feature computation is expensive, still fails at capturing many appearance modes in real data.

• Not find an increasing number of features for high-dimensional linear separability, model

non-linearities in simpler, lower dimensional feature spaces, using locally linear models.

Armlet: a Varied Poselet for Upper Body Posture

• Predict the pose of both arms for each person;

• Train highly discriminative classifiers to estimate the arm configurations (armlets);

• Extend HOG, integrate strong contours, skin color and contextual cues.

Far left: HOG with local gradient contours for each cell.

Left: each cell is HOG with gPb. Right: average value of

the skin classifier at each cell. Far right: context feature; at

each cell, a poselet activation feature vector; for each

poselet type, maximum of all poselet activations whose

center falls in the cell, or zero if no activations are present.

Pose Recognition from A Depth Image

From a depth image of single frame, the pipeline includes background removal, initial pose estimation,

and pose correction. The skeleton joints marked by color dots in (c) are the ones with high confidence in

estimation whereas the ones without color dots are with the low confidence.

Each pixel (black square) casts a 3D vote (orange line) for each joint. Mean shift is used to aggregate

these votes and produce a final set of hypotheses for each joint. The highest confidence hypothesis for

each joint is shown. NB ‘left’ refers to the user’s left as if looking in a mirror.

1. regression directly from the raw depth image, no use of an arbitrary intermediate representation;

2. applicability to general motions (not constrained to particular activities);

3. the ability to localize occluded as well as visible body joints.

Learning Pose from Depth Images


Detection/Classification 4. Generative models vs. Discriminative models;

– Naïve Bayes classifier, MRF, PCA, mixture Gaussian,, pLSA, LDA, …;

– SVM, Adaboost, random forest, CRF, Logistic regression, KNN,…;

– Deep learning: convolutional NN, DBN, DBM, Auto-encoder, ....

Neural Network (MLP) A multilayer perceptron (MLP) is a feedforward artificial NN model that maps sets of input data onto a

set of appropriate outputs;

A MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the

next one; Except for the input nodes, each node is a neuron (or processing element) with a

nonlinear activation function;

MLP utilizes a supervised learning technique called back propagation for training the network;

MLP is a modification of the standard linear perceptron and can distinguish data that are not linearly

separable.

Decision Trees

Classification trees and regression trees predict responses to data.

To predict a response, follow the decisions in the tree from the root node down to a leaf

node. The leaf node contains the response.

Classification trees give responses that are nominal, as 'true‘ or 'false'.

Regression trees give numeric responses.

A Decision Tree consists of three

types of nodes:

1. Decision nodes;

2. Chance nodes;

3. End nodes .

https://sites.google.com/site/yorkyuhuang/home/research/machine-learning-information-retrieval/x-ray-based-detection/random_forest.png?attredirects=0

Naïve Bayes Classifier

The Naive Bayes classifier is designed when features are independent of one

another within each class, but it appears to work well even when that

independence assumption is not valid. It classifies data in two steps: Training step: Using the training samples, the method estimates the parameters of a probability

distribution, assuming features are conditionally independent given the class.

Prediction step: For any unseen test sample, the method computes the posterior probability of that

sample belonging to each class. The method then classifies the test sample according the largest

posterior probability.

The class-conditional independence assumption greatly simplifies the training

step since you can estimate the one-dimensional class-conditional density for

each feature individually; While the class-conditional independence between features is not true in general, research shows

that this optimistic assumption works well in practice;

This assumption of class independence allows the Naive Bayes classifier to better estimate the

parameters for accurate classification while using less training data than many other classifiers;

This makes it particularly effective for datasets containing many predictors or features.

Support Vector Machines (SVM)

Separable Data An SVM classifies data by finding the best hyperplane that separates all data points of one class

from those of the other class.

“Margin” means the maximal width of the slab parallel to the hyperplane that has no interior data

points.

The support vectors are the data points that are closest to the separating hyperplane.

Non-separable Data Your data might not allow for a separating hyperplane. In that case, SVM can use a soft margin,

meaning a hyperplane that separates many, but not all data points.

Kernel trick: Polynomials, Radial basis or Sigmoid function

Generative Model: MRF

Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi

takes value fi in a label set L.

Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it

satisfies Markov property.

Generative model for joint probability p(x)

allows no direct probabilistic interpretation

define potential functions Ψ on maximal cliques A

map joint assignment to non-negative real number

requires normalization

MRF is undirected graphical models

Hidden Markov Model

A hidden Markov model (HMM) is a statistical Markov model: the modeled system is a Markov process with unobserved (hidden) states;

In HMM, state is not visible, but output, dependent on state, is visible.

Each state has a probability distribution over the possible output tokens;

Sequence of tokens generated by an HMM gives some information about the sequence of states.

Note: the adjective 'hidden' refers to the state sequence through which the model passes, not to the parameters of the model;

A HMM can be considered a generalization of a mixture model where the hidden variables are related through a Markov process;

Inference: prob. of an observed sequence by Forward-Backward Algorithm and the most likely state trajectory by Viterbi algorithm (DP);

Learning: optimize state transition and output probabilities by Baum-Welch algorithm (special case of EM).

http://en.wikipedia.org/wiki/File:Hmm_temporal_bayesian_net.svg

Discriminative Model: CRF

Conditional , not joint, probabilistic sequential models p(y|x)

Allow arbitrary, non-independent features on the observation seq X

Specify the probability of possible label seq given an observation seq

Prob. of a transition between labels depend on past/future observ.

Relax strong independence assumptions, no p(x) required

CRF is MRF plus “external” variables, where “internal” variables Y of MRF are un-observables and “external” variables X

are observables

Linear chain CRF: transition score depends on current observation

Inference by DP like HMM, learning by forward-backward as HMM

Optimization for learning CRF: discriminative model

Conjugate gradient, stochastic gradient,…

AdaBoost Boosting: at each step, training data are re-weighted that incorrectly classified objects get

larger weights in a new, modified training set, thus actually maximizes the margins

between objects;

Classifiers are constructed on weighted versions of the training set, which are

independent on previous classification results;

Boosting learning originated from the Probably Approximately Correct (PAC) learning

theory;

AdaBoost is the first algorithm that could adapt to the weak learners;

Variant of Adaboost (Adaptive boosting): originally DigitalBoost

LogitBoost:

GentleBoost: Update is fm(x) = P(y=1|x) – P(y=0|x) instead of RealBoost’s


Detection/Classification 5. Efficiency in Detection/Classification:

Efficient features: Vector quantization;

Integral/Aggregate channel features;

Divide-and-conquer: Cascaded Classifiers.

Coarse-to-fine: Feature scaling, feature pyramid;

Branch-and-bound; Efficient subwindow search.

Generic object proposal:

Selective search;

BING for objectness;

Dynamic programming; Avoid repeating compute.

Parallelize (hardware): GPU;

Multi-core;

Cloud?

Integral Channel Features (ICF) Multiple registered image channels are computed using image linear/non-linear transformations, called

integral channel features;

Features such as local sums, histograms, Haar features and their various generalizations are computed

using integral images;

ICF naturally integrate heterogeneous sources of information, have few parameters, and result in fast,

accurate detectors.

Integral Channel Features (ICF) Very efficient for human/pedestrian detection;

6 quantized orientations, 1 gradient magnitude, 3 LUV color channels.

Multi-scale without image scaling.

Boosted classifiers: soft cascades

Two level decision trees.

Aggregate Channel Features (ACF)

• Compute image’s several channels and sum every block of pixels, smooth the resulting LR channels;

• Features are single pixel lookups in the aggregated channels;

• Boosting is used to learn decision trees over these features (pixels) to distinguish object from

background;

• A multiscale sliding window is applied;

• With the appropriate choice of channels to design, ACF achieves SoA performance in pedestrian

detection;

• Normalized gradient magnitude, histogram of oriented gradients and LUV color;

• Compute feature pyramid at octave-spaced scale intervals;

• Adaboost for training, combining 2048 depth-two trees over 5120 candidate features in each search 128x64 window.

Segmentation as Selective Search Challenges

Objects extremely diverse

Various shapes, sizes and appearances;

Within object variation

Multiple materials and textures with strong

interior boundaries;

Many objects in an image

Selective search by hierarchical grouping:

encouraging diversity.

BING: Binarized Normed Gradients What is the object?

Standalone, unique with different appearance from

surroundings, well closed boundary;

Objectness metric is for how likely a window covers an

object of any category;

Reduce the search space;

Allow strong classifiers.

Normed gradients (NG) + Linear SVMs

BING feature: illustration

Use a single atomic variable (INT64 & BYTE) to represents a

BING feature and its last row.

CVPR 2014

Branch-and-Bound for Sub-window Search

It is a general algorithm for optimal solution, especially for discrete and

combinatorial optimization;

A branch-and-bound algorithm consists of a systematic enumeration of all

candidate solutions, where large subsets of fruitless candidates are

discarded en masse, by using upper and lower estimated bounds of the

quantity being optimized;

Splitting: given a set of candidates, S1 S2 …=S, “bounding” a search

tree whose nodes are S1, S2, …

Bounding: upper/lower bounds for the objective function within Si;

The idea: prune by maintaining a global variable recording the min upper

bound, so any node whose lower bound greater than it can be discarded;

Used for efficient sliding sub-window search.


Detection/Classification 6. Open Questions:

Viewpoint variations;

Scales and appearance variation (illumination and shape distortion);

Cluttered background, partial occlusion, lack of contextual inform.;

Multiple poses (out of the plane or on the plane) or articulation;

Variations of the intra class or category.


Detection/Classification 7. “Open Set” problem: how to handle unknown or unfamiliar classes;

Label as one of known classes or as unknown;

Zero shot learning/unseen class detection;

Novelty detection with null space methods; One class SVM;

Multiple classes: Artificial super class from all given classes;

Combine several one class classifiers learned separately;

K-nearest neighbors;


Detection/Classification 8. “Data unbalancing” problem:

Resampling methods for balancing the data set. Over-sampling, under-sampling, importance sampling;

Modification of existing learning algorithms. Cost-sensitive learning;

One class classification;

Classifier ensemble (bagging, boosting, random forest…)

Measuring the classifier performance in imbalanced domains. ROC, F-measure,…

Relationship between class imbalance and other data complexity characteristics.


Detection/Classification 9. Face detection/recognition is a special area: large variations.

Face detection: Viola-Jones’s cascaded Adaboost for cascaded simple feature extractions;

Face verification/authentication: validate a claimed identity based on the image, and either accept

or reject the identity claim (one-to-one matching);

Face identification: identify a person based on the image, compared with all the registered persons

(one-to-many matching);

Face clustering: find the common people among these faces.

Face Detection and Pose Alignment

• Paul Viola’s method: Integral image + Cascaded Adaboost;

• Boosting learning:

• RealBoost;

• FloatBoost;

• GentleBoost;

• HoG + SVM + Image Pyramid (dlib C++ lib);

• Generic linear features: Anistropic Gaussian filters;

• Variation of LBP and HoG for face detection: • Local Gradient Pattern and Binary Histogram of Gradients;

• Integral/Aggregate Channel Features: HOG + LUV;

• Shape features: edgelet, shapelet;

• DPM (Deformable Part Model) or Pictorial model plus SVM;

• Strongly or weakly supervised.

• Antiface: multi-template matching in cascade.

Face Detection and Pose Alignment View/Pose alignment + view/pose specific detectors;

Facial feature (2d-landmark) detection for alignment: holistic or local

ASM, AAM: generative model;

Elastic graph matching;

Constrained local model (CLM): global shape constraints;

Explicit Shape Regression;

Robust cascaded pose regression (RCPR);

Conditional regression forests;

Tree Structured Part Model (TSPM): [Zhu’12];

Ensemble of regression trees (dlib C++ lib);

Supervised Descent Method;

Parts-based deformable shape model;

Congealing or funneling: reduce the entropy by transform.

Face Detection, Landmark Localization and Pose Alignment • Ensemble of regression trees used to estimate the face’s

landmark positions directly from a sparse subset of pixel

intensities;

• Based on gradient boosting, learn by optimizing the sum of

square error loss and naturally handles missing or partially

labelled data;

• Appropriate (exponential) priors exploiting the structure of

image data helps with efficient feature selection;

• Different regularization strategies to combat overfitting;

• Can detect 194 landmarks on face from a single image in a

millisecond for face alignment;

• Open sources: http://dlib.net/.

Landmark estimates at different levels of the cascade initialized with the mean shape centered

at the output of a Viola face detector. T is number of strong regressors in the tree.

http://dlib.net/

Joint Cascaded Face Detection and Alignment

• Define the Post classifier: • use the Viola-Jones detector in OpenCV with a low threshold to ensure a high recall;

• split all the images into two parts, then use the positive and negative output windows in the first part to

train a linear SVM classifier, and test all the output windows in the second part;

• Feature extractors: the 3rd one is the best for classifier performance; • 1. divide the window into 6*6 non-overlapping cells and extract a SIFT in each cell;

• 2. use a mean face shape with 27 facial points and extract a SIFT centered on each point;

• 3. align the 27 facial points and extract a SIFT centered on each point;

• Local learning of the tree structure;

• Global learning of the tree output;

• A Unified Framework for Cascade Face Detection and Alignment: • Cascade Detection;

• Cascade Alignment;

• A Unified Framework;

• Joint Learning of Detection and Alignment: • S-strategy: in the split test of each internal node randomly choose to either minimize the binary

entropy for classification or the variance of the facial point increments for regression;

• use RealBoost for the cascade classification learning with multi-scale shape indexed pixel difference

features.

Joint Cascaded Face Detection and Alignment

Face Recognition: Verification or Identification

• How to represent the face?

• Features are hand-crafted or learned automatically;

• Global feature-based: eigen face, fisher face;

• Local feature-based: Gabor feature, Haar, HOG, LBP, SIFT feature.

• Hierarchically? (manual first, then learn from it…)

• Dimensionality reduction? (high dimensional is good)

• Subspace methods: PCA, LDA

• Manifold methods.

• Bag of words (BoW): encoding/quantization

• VLAD, Fisher vector;

• Spatial information.

• Matching metric learning: weight optimization, joined Bayesian method

• Siamese network (two identical convolutional network sharing weights)

High Dim LBP Feature for Face Verification • Making a high-dimensional (e.g., 100K-dim)

face feature is critical to high performance;

• Local Binary Pattern (LBP) descriptor;

• First extract multi-scale patches centered at

dense facial landmarks;

• Then divide each patch into a grid of cells and

code each cell by a certain descriptor;

• Finally concatenate all descriptors to form our

high-dimensional feature.

• Learn a sparse linear projection with a much

lower computational/storage cost.

• Adopt PCA first;

• Then supervised subspace learning methods such as

LDA and Joined Bayesian are applied to extract

discriminative information for face recognition;

• Learn a sparse linear projection (L1-based

regression) directly mapping high-d feature to low-d

feature.

In the training phase, low-dimensional features 𝑌 are first

obtained by PCA and supervised subspace learning. Then

learn the sparse projection matrix 𝐵 which maps 𝑋 to 𝑌 by

the rotated sparse regression. In the testing phase,

compute the low-dimensional feature by directly projecting

high-dimensional feature using sparse matrix 𝐵.

A Metric Learning: Joined Bayesian Method

• Model the appearance difference of two

faces jointly with an appropriate prior on

the face representation in verification;

• Each face is the summation of two

independent Gaussian latent variables, i.e.

intrinsic for identity, and intra-personal for

within-person variation.

• EM like algorithm for learning and closed form

solution for testing.

• Derived new similarity metric preserves

the separability and leads to better

performance;

• Can be viewed as a reference model as

well with parametric form.

3-D Face Analysis and Recognition Preprocessing: surface smoothing, noise removal and hole filling;

3-d facial landmark detection and face registration;

Curvature-based (spin image is too costly): landmark extraction;

3D statistical facial feature model (SFAM): both global and local;

Procrustes Analysis and Iterated Closest Point (ICP) for registration;

Depth image: RGB-D;

Deformable face model.

Feature extraction/learning;

Curvature-based;

Part-based feature;

Learning again?

Feature matching.

Region-based ICP;

Shape and texture.

Average Face Model Average Regional Model

3-D Model based Face Alignment • Model types:

• 3-D morphable model;

• 3-D mesh model;

• Parts-based model.

• Recover the frontal pose by 3d geometrical transformations;

• Align a 2D face image to a 3D face image and then rotate it to render the frontal

view;

• A 3d-to-2d camera is fitted for this rotation (face frontalization);

• Robust to illumination and viewpoint variation;

(a) Query photo; (b) facial feature detections; (c) and (d) the same on a reference face and model; (e) estimate a projection matrix

used to back-project to the reference coord. system; (f) estimated visibility due to non-frontal poses; (g) final frontalization.

3-D Model based Face Alignment

Improved single-3D Single-3D DeepFaces

Infrared Face Recognition

• face challenges in the presence of illumination, pose and expression

changes, as well as facial disguises;

• Appearance-based;

• Registration;

• Feature-based;

• Infrared LBP;

• Wavelet/Curvelet Transform;

• Multi-spectral/Hyper-spectral;

• Multi-modal methods.

Video-based Face Recognition

• Merits: a set of observations, temporal dynamics and 3D information;

• Challenges: pose, expression, illumination, scale, motion blur and occlusion;

• Set-to-Set method:

• Frames of a video showing the same face are often represented as sets of vectors, one

vector per frame; then recognition becomes a problem of determining similarity between

vector sets in different spaces and metrics;

• Algebraic methods that compare sets regard each video as a linear subspace, spanned

by the vectors encoding the frames in the video;

• Pyramid Match Kernel (PMK) is non-algebraic kernel for encoding similarities between

sets of vectors in a hierarchical structure;

• A set-to-set similarity measure: the Matched Background Similarity (MBGS).

• Sequence method:

• Split tracking and recognition to use the temporal dynamics, such as HMM, DTW.

Video-based Face Recognition

Facial Expression/Emotion Recognition

• Questions:

• Features are holistic or analytic;

• Temporal information used or not;

• View or volume-based.

• Measuring Facial Actions:

• Facial Action Coding (FACS): muscle‐based describing visually distinguishable facial movements;

• Facial Expression Extraction: both facial deformation and motion parts.

Facial Expression/Emotion Recognition

• Features: Haar, LBP, HOG, Optic flow;

• Recognition: frame-based and sequence-based

• Frame: PCA, ICA, LDA, kNN, GentleBoost, SVM, NN;

• Sequence: HMMs, DTW, motion energy, RNN .

• Deep Learning-based: CNN, RNN, 3D-CNN

• Deformation extraction:

• Image-based: Gabor filter, LDA/PCA, Gabor WT;

• Model-based: AAM, PDM, Facial Animation;

• Motion extraction:

• Frame-based: snake, region-based, model units;

• Sequence-based: Candid node, feature tracking


Detection/Classification 10. Text detection and recognition: text is object too.

Text detection: connected component-based or texture(template, i.e. sliding window)-based;

Maximally Stable Extremal Regions (MSERs) first;

Feature HOG works;

Text recognition: OCR (optical character recogn.), license plate recogn.

Pictorial structure model;

Lexicon helps.

“video text detection and recognition: dataset and benchmark”, by P Nguyen K Wang, S Belongie, ICCV13.


Detection/Classification 11. Scene parsing or semantic image segmentation.

Many methods rely on MRFs, CRFs, or other types of graphical models to ensure the consistency of the labeling and

to account for context;

Most methods rely on a pre-segmentation into superpixels or other segment candidates, and extract features and

categories from individual segments and from various combinations of neighboring segments;

Deep learning-based method: hierarchical feature learning.

RGB-D data: depth information helps inferring 3D relationship in the scene.

SuperParsing: Scalable Nonparametric Image Parsing with Superpixels

• First perform global scene-level matching against the training set, followed by superpixel-level

matching and MRF optimization for incorporating neighborhood context.

• Compute a simultaneous labeling of image regions into semantic classes (e.g., tree, building, car)

and geometric classes (sky, vertical, ground).

Nonparametric Scene Parsing: Label Transfer via

Dense Scene Alignment

• Retrieve its nearest neighbors

from a large database

containing fully annotated

images;

• Then, establish dense

correspondences between the

input image and each of the

nearest neighbors using the

dense SIFT flow algorithm;

• Finally, warp the existing

annotations and integrates

multiple cues in a MRF

framework to segment and

recognize the query image.

Indoor Segmentation and Support Inference from RGBD Images

• Interpret the major surfaces, objects, and support relations of an indoor scene;

• Parse indoor scenes into floor, walls, supporting surfaces, and object regions, and to

recover support relationships, 3D cues can best inform a structured 3D interpretation.

Dataset for Object Classification

Caltech 101

and 256;

LabelMe;

ImageNet;

VOC: Pascal;

MS COCO;

Dataset for Object Classification Face Recognition:

NIST FERET;

Yale Face;

CMU PIE Face;

Labeled Faces in the Wild

Database of Scene Parsing • Stanford background: 715 images with 8 classes of

outdoor scenes;

• SIFT Flow: 2688 images with 33 scenes;

• CamVid: Cambridge driving Labeled Video Database;

• Kitti Vision Dataset: Honda Research Inst. Euro GmbH;

• Barcelona database: 14871+279 images of 170 classes;

• MSRC: 240 small images with 9 classes;

• LabelMe: 2700 city scenes with 5-20 common classes;

• PASCAL VOC challenge: semantic segmentation with

20 FG classes and 1 BG class;

• RGB-D NYU data: 407024 RGB-D pairs with 894

categories;

• Microsoft COCO: image recognition, segmentation, and

captioning dataset.

Evaluation Metric Detection/Classification rate

true positives, false positives, true negatives and false negatives (reflected in confusion or

contigency matrix);

accuracy, precision, recall;

ROC, i.e. receiver operating characteristic;

plotting the fraction of true positives (TPR = true positive rate) vs. the fraction of false positives

(FPR = false positive rate), at various threshold settings;

• Precision = true positive / (true positive + false positive);

• Recall (sensitivity) = true positive / (true positive + false negative);

• Precision-Recall curve;

Complexity/Time;

Real-time?

Object Tracking

Definition: object tracking is generally posed as an recursive estimation problem, i.e.,

estimate the current location (size) given the estimate at the previous time instant as

well as the current observation (or measurement).

Object tracking is to find an object based on a short-time (applying Markov chain in

modeling) and narrow-viewed observer (a dynamic model for searching);

However, object tracking can collaborate with different types of object

detectors/classifiers to enhance the performance.

Object tracking can be performed on different feature levels: points, regions, contours

or blobs;

Tracked objects can be varied types: articulated, deformable, fluid, multiple objects

and so on.

State-of-Art Methods of Object Tracking

Y Wu, J Lim, M-H Yang, “Online Object Tracking: A Benchmark,” CVPR 2013.

model update

L: local, H: holistic, T: template,

IH: intensity histogram, BP: binary pattern,

SR: sparse representation,

DM: discriminative model,

GM: generative model.

PF: particle filter,

MCMC: Markov Chain Monte Carlo,

LOS: local optimum search,

DS: dense sampling search.

Representation Scheme in Tracking

Holistic templates;

Subspace-based;

Sparse representation;

Feature-based:

Color histograms, HOG, covariance region descriptor, Haar-like features, LBP etc.;

Discriminative model with a binary classifier:

SVM, structured output SVM, ranking SVM, boosting, semi-boosting and multi-instance boosting;

Parts-based;

Bags of features (patch-based);

3-d information (depth) or multiple cameras-based?

PCA, AP & Spectral Clustering Principal Component Analysis (PCA) uses orthogonal transformation to convert a set of observations

of possibly correlated variables into a set of linearly uncorrelated variables called principal

components.

This transformation is defined in such a way that the first principal component has the largest

possible variance and each succeeding component in turn has the highest variance possible under the

constraint that it be orthogonal to the preceding components.

PCA is sensitive to the relative scaling of the original variables.

Also called as Karhunen–Loève transform (KLT), Hotelling transform, singular value

decomposition (SVD) , factor analysis, eigenvalue decomposition (EVD), spectral decomposition etc.;

Affinity Propagation (AP) is a clustering algorithm based on the concept of "message passing"

between data points.[Unlike clustering algorithms such as k-means or k-medoids, AP does not require

the number of clusters to be determined or estimated before running the algorithm;

Spectral Clustering makes use of the spectrum (eigenvalues) of the data similarity matrix to

perform dimensionality reduction before clustering in fewer dimensions.

The similarity matrix consists of a quantitative assessment of the relative similarity of each pair of

points in the dataset.

Blind Source Separation & ICA Independent component analysis (ICA) is for separating a multivariate signal into

additive subcomponents by assuming that the subcomponents are non-Gaussian signals

and all statistically independent from each other. ICA is a special case of blind source separation.

Assumptions: the source signals are independent of each other; distribution of the values in

each source signals are non-Gaussian.

Three effects of mixing signals as below Independence: the signal mixtures may not;

Normality: closer to Gaussian than any of original variables;

Complexity: Greater than that of its constituent source signal.

Preprocessing: centering, whitening and dimension reduction;

ICA finds the independent components (latent variables) by maximizing the statistical

independence of the estimated components;

Definitions of independence for ICA: Minimization of mutual information (KL divergence or entropy);

Maximization of non-Gaussianity (kurtosis and negative entropy).

NMF & pLSA Non-negative matrix factorization (NMF): a matrix V is factorized into (usually) two matrices W and H,

that all three matrices have no negative elements.

The different types arise from using different cost functions for measuring the divergence

between V and W*H and possibly by regularization of the W and/or H matrices;

squared error, Kullback-Leibler divergence or total variation (TV);

NMF is an instance of a more general probabilistic model called "multinomial PCA“, as pLSA

(probabilistic latent semantic analysis);

pLSA is a statistical technique for two-mode (extended naturally to higher modes) analysis, modeling the

probability of each co-occurrence as a mixture of conditionally independent multinomial distributions;

Their parameters are learned using EM algorithm;

pLSA is based on a mixture decomposition derived from a latent class model, not as downsizing the

occurrence tables by SVD in LSA.

Note: an extended model, LDA (Latent Dirichlet allocation) , adds a Dirichlet prior on the per-document

topic distribution.

ISOMAP General idea:

Approximate the geodesic distances by shortest graph distance.

MDS (multi-dimensional scaling) using geodic distances

Algorithm:

Construct a neighborhood graph

Construct a distance matrix

Find the shortest path between every i and j (e.g. using Floyd-Marshall) and construct a new distance matrix such that Dij is the

length of the shortest path between i and j.

Apply MDS to matrix to find coordinates

LLE (Locally Linear Embedding) General idea: represent each point on the local linear subspace of the manifold as a linear combination

of its neighbors to characterize the local neighborhood relations; then use the same linear coefficient for

embedding to preserve the neighborhood relations in the low dimensional space;

Compute the coefficient w for each data by solving a constraint LS problem;

Algorithm: 1. Find weight matrix W of linear coefficients

2. Find low dimensional embedding Y that minimizes the reconstruction error

3. Solution: Eigen-decomposition of M=(I-W)’(I-W)

i j

jiji YWYY

2

)(

Local Tangent Space Alignment (LTSA) Every smooth manifold can be constructed locally by its tangent plane;

Stages: 1) A local parameterization is established for each data point; 2) then a global

alignment is computed.

Taylor series expansion of the embedding function f(•) in the local neighborhood;

We are given samples from the embedded manifold with noise therefore, for an

arbitrary point xi and its local neighbor and in the absence of the noise (εi = 0), we can

write:

Solve the problem:

where si is the i-th membership vector.

The optimal alignment (using LS): Substituting Li into the objective:

where S=[s1,…,sn], W=diag(W1,…,Wn), and

Solve using an EVD.

i

jx 1

Local Tangent Space Alignment (LTSA)

Laplacian Eigenmaps General idea: minimize the norm of Laplace-Beltrami operator on the manifold

measures how far apart maps nearby points.

Avoid the trivial solution of f = const.

The Laplacian-Beltrami operator can be approximated by Laplacian of the neighborhood graph with appropriate weights.

Construct the Laplacian matrix L=D-W.

can be approximated by its discrete equivalent

Algorithm:

Construct a neighborhood graph (e.g., epsilonball, k-nearest neighbors).

Construct an adjacency matrix with the following weights

Minimize

The generalized eigen-decomposition of the graph Laplacian is

Spectral embedding of the Laplacian manifold:

• The first eigenvector is trivial (the all one vector).

Search Mechanism in Tracking Tracking is posed within an optimization framework, gradient descent methods used to locate the target;

Iterative Image Registration;

Mean shift;

Mean field;

Distribution Fields.

Dense sampling methods at the expense of high computational load;

Online boosting;

Online Multiple Instance Learning;

Struck: Structured Output Tracking with Kernels (SVM).

Stochastic search algorithms insensitive to local minima and computationally efficient :

Particle filters;

Incremental learning within Particle filter;

Sparse coding within particle filter;

Mean shift within particle filter.

Particle Filter Monte Carlo characterization of pdf:

Represent posterior density by a set of random i.i.d. samples (particles) from the pdf p(x0:t|z1:t)

For larger number N of particles equivalent to functional description of pdf

For N approaches optimal Bayesian estimate

Regions of high density

Many particles

Large weight of particles

Uneven partitioning

Discrete approximation for continuous pdf

Draw N samples x0:t(i) from importance sampling distribution (x0:t|z1:t)

Importance weight and its update

N

i

i

tt

i

tttN xxwzxP1

:0:0:1:0 )(δ)|(

)|(π

)|()(

:1:0

:1:0:0

tt

ttt

zx

zxpxw

),|(π

)|()|()(

1

)(

)(

1

)()()(

1

)(

t

i

t

i

t

i

t

i

t

i

tti

t

i

tzxx

xxpxzpww

Mean Shift A tool for finding modes in a set of data samples, manifesting an underlying probability

density function (PDF) in RN;

Non-parametric density estimation (Parsen window)

MS is for kernel density gradient estimation

Translate the kernel window by the mean shift vector: m(x)

Go to the maxima of density.

Used for visual tracking.

1

1 ( ) ( )

n

i

i

P Kn

x x - x

1

1 1

1

( )

n

i in ni

i i ni i

i

i

gc c

P k gn n

g

x

x x

2

( ) ii

K ckh

x - xx - x

2

1

2

1

( )

ni

i

i

ni

i

gh

gh

x - xx

m x xx - x

Model Update in Tracking

Template update for the KLT algorithm;

Online mixture model;

Incremental subspace update;

Online boosting;

More robust online-trained classifier:

Semi-supervised;

Multiple instance learning;

Co-tracking with different features (co-training);

Multiple classifier boosting (object part-based or orientation-based);

Background separation for its multimodal distribution as well;

Object Structural Constraints by Positive-Negative learning (self learning);

Struck: Structured Output Tracking with Kernels (SVM).

Online Multiple Negative Modality

Learning Boosting for Tracking

Online Boosting Online learning: a learning algorithm is presented with one example at a time;

Since we don’t know a priori how the difficult/good a sample is, the online boosting

algorithm turns to a new strategy to compute the weight distribution;

Oza proposed an on-line boosting framework in which the importance of a sample can be

estimated by propagating it through the set of weak classifiers.

However, Oza’s algorithm has no way of choosing the most discriminative feature

because the entire training set is not available at one time.

Grabner and Bischof proposed a modified method which performs feature selection

(defined as selectors) by maintaining a pool of M > N candidate weak classifiers:

The number of weak classifiers N is fixed at the beginning;

One sample is used to update all weak classifiers and the corresponding voting weights;

For each classifier, the most discriminative feature for the entire training set is selected from a given feature pool;

It is suggested the worst feature is replaced by a new one randomly from the feature pool.

Stochastic gradient descent is more suited for online learning, more specifically for

online boosting.

Multiple Instance Learning (MIL) The basic idea is that during training, examples are presented in sets or bags and labels

are provided for the bags rather than individual instances;

If the bag is labeled positive, it is assumed to contain at least one positive instance,

otherwise the bag is negative;

The ambiguity is passed on to the learning algorithm, which now has to figure out which

instance in each positive bag is the most “correct”.

The MIL problem can be solved by a gradient boosting framework (proposed by

Friedman) to maximize the log likelihood of bags;

A Noisy-OR (NOR) model is used to define the bag probability;

The instance-is-positive probability is modeled as the logistic function;

The weight on each sample is given as the derivative of the loss function with respect to a change

in the score of the sample.

People have been adapting classical classification methods, such as SVMs or boosting,

to work within the context of multiple-instance learning.

Online Incremental SVM

Learning for Tracking

Incremental/Decremental SVM Incremental learning on-line to construct the solution recursively one point at a time;

Retain the Kuhn-Tucker (KT) conditions on all previously seen data, while “adiabatically” adding a new data point to

the solution;

Support vectors (SV) in soft margin classification: margin SV, error SV and ignored SV;

Leave-one-out (LOO) for predicting the generalization power of a trained classifier, implemented by

decremental unlearning, adiabatic reversal of incremental learning, on each of the training data from

the full trained solution.

ignored vector margin

Context in Tracker, Fusion of Trackers Context information is also important;

Mining auxiliary objects or local visual inform.

surrounding the target to assist tracking;

The context information helpful when the target is

fully occluded or leaves the image; “Learning where the object might be”;

“Exploring Supporters and Distracters in

Unconstrained Environments”.

Combines static, moderately adaptive and highly

adaptive trackers to account for appearance

changes;

Multiple trackers are maintained in MAP;

Multiple feature sets selected in a Bayesian way.

Multiple Object Tracking The MOT tracker not only associates the objects with the observation data correctly (with

in-between interaction), but also discriminates them with similar appearance;

Persistence in motion;

Mutual exclusion;

Challenging situations:

Joint state space: variable number of objects?

Interaction (split/merge): occlusion;

Birth/death (enter/leave): appear/disappear.

Similar cases:

Articulated object tracking: body parts, hand fingers etc.

Deformable object tracking: facial expression with multiple action units.

http://michal.is/projects/people-tracking-identification/

http://artsandsciences.sc.edu/technology/node/495

Multiple Object Tracking

Typical methods: Multiple Independent object tracker (MioT): no interaction, so apply separate trackers

meanwhile.

Multiple hypothesis tracker (MHT): k-best hypothesis, gating/pruning.

Joint probabilistic data association filter (JPDAF): expectation over all.

Probability hypothesis density: propagate 1st moment of posterior.

Particle filter Apply the probabilistic exclusion principle;

Bayesian Multiple Blob tracker: background subtraction;

Subordination links between particles;

Mixture model: particle clustering.

MCMC: Markov chain in sampling.

Markov random field-based: energy minimization of interaction and data terms;

Belief propagation: inference approximation.

Mean Field: variational approximation.

3-D Model-based Tracking

3-D model: CAD geometric model, planar parts or rough ellipsoid; Pose (camera or object) estimation can considerably simplify the task;

Camera calibration or not?

Non-rigid object tracking with 3-d model: 3D Morphable Models;

Articulated object tracking with 3-d model: 3D kinematic chain structure;

Factorization-based: separation of motion from structure.

http://cvrlcode.ics.forth.gr/handtracking/

3-D Model-based Tracking How to eliminate the drifting error?

Key frames for registration (pose refinement);

Bundle adjustment (used in the point-based method). Edge-based method

Look for strong gradients: fast and general;

Extract image contours and then fit them to the model outlines; Pixel-based method

Optic flow: add a feedback loop to suppress drift effect;

Template matching: Lucas-Kanade;

Interest point-based: Lucas-Kanade-Tomasi;

Note: tracking without 3-d models -> visual SLAM v.s. SfM (structure from motion).

SLAM (simultaneous localization and mapping): to provide in real-time an estimation of

the relative motion of a camera and a 3D map of the surrounding environment;

Extensible tracking: attempts to add previously unknown scene elements to its initial

map, and these then provide registration even when the original map is out of sensing

range;

Deformable Object Tracking

2-D methods; Active contour (snake);

Level set;

Exemplar-based shape prior: ASM, AAM as a

generative model; Project-out inverse compositional algorithm for model

fitting;

Constrained local model: Use point distribution model (PDM), but only model the

appearance of local patches around landmarks;

3-D methods: model the variability of a certain class

of objects; Deformable model:

Shape, represented as curve or surface, is deformed in

order to match a specific example;

Model texture variations & imaging factors i.e.

perspective projection and illumination effects;

3-D pose estimation from 2-D view; Rigid pose.

Articulated Object Tracking

2-D methods; Model-free methods;

Exemplar-based;

2-D model-based: cardboard-like;

3-D methods; Without estimating parts’ joint angles;

With fixed basis;

Use top part’s tip locations;

3-D model-based: Kinematic chain structure;

Quantized feature space;

Model refinement: factorization, key-frame-based;

Motion filter: kalman filter, particle filter or multiple hypothesis (MT);

Data driven dimensionality reduction by learning the configuration space;

3-D pose estimation from 2-D view; Classification-based or direct mapping-based.

PTAM, DTAM and Semi-Dense TAM PTAM: tracking and mapping separated, mapping based on key frames BA (not update

frame by frame), while tracking based on camera pose estimation with patch-based

search;

DTAM: still key frames-based, a pixel-based rather than feature-based method for tracking

and mapping with a dense 3D surface textured model;

Dense inverse depth map estimate by regularization (TV): primal dual;

Occlusion/blur handling;

Tracking by image registration with synthesized realistic novel view:

Nonlinear LS: like iterative LK style (rotation first, then full pose).

Semi-Dense TAM: semi-dense depth of pixels with apparent image gradients;

Probabilistic depth map + uncertainty from both geometric and photometric disparity errors;

Reduction of image regions for estimating semi-dense inverse depth map for the current frame;

Tracking is still dense in image alignment.

Initial depth map is from stereo vision by five-point-pose algorithm (Multiple view stereo).

Feature Tracking The popular KLT (Kanade-Lukas-Tomasi) method;

Good features for tracking;

Eigen values (Shi);

Affine alignment;

FAST, SIFT, SURF, HOG, …;

Gradient, orientation, histogram;

Random forest, FERNS: tracking-by-detection

The classifiers are applied for keypoint matching;

Multiclassifier based on randomized trees;

Random selection of features;

Random selection of patches.

For classification.

Note: Optic flow is dense tracking.

Extract features independently and then match by comparing descriptors

x x x

32 Possible Outputs

ik cCFP |Posterior Distributions (Look-up Tables)

0 7

Feature Tracking

After FERNS are trained;

Do patch classification.

M

ikflabelclass cCFPExample1

_ |maxarg

2 6 1

Posterior Distributions (Look-up Tables)

Fern 1 Fern 2 Fern 3

Depth Sensor-based Tracking Features:

RGBD HOG feature;

Point cloud feature;

3D iterative closest point (ICP);

Occlusion detection and recovery from RGBD;

“Tracking revisited using RGBD camera: unified benchmark and Baselines”, by S Song, J Xiao, proc. of ICCV13.

3-D Model-based Body Tracking with Kinect RGB-D Data

• Predict 3D position of each body joints from a single depth image

• No temporal information

• Uses an object recognition approach

• Single per pixel classification

• Large and highly varied training

• The classifications can produce hypotheses of 3D body joint positions by skeletal tracking;

• Random forest-based classifiers with depth invariant features.

• The method has been designed to be robust, in two ways:

• The system is trained with a vast, highly varied training set of synthetic images ensuring for all:

• Ages, body shapes and sizes, clothing, hairstyles;

• The recognition does not rely on any temporal information:

• The system can initialize from arbitrary poses;

• Prevents catastrophic loss of track.

3-D Model-based Face Tracking with Kinect RGB-D Data

Use a 3-D deformable face model;

Handle noisy depth data by maximum likelihood;

L1 regularization in ICP-based tracking framework;

Feature tracking in RGB data helps.

Appendix: Action/Event Detection & Classification

Action/Event Detection and Classification

• Action classification: assigning an action label to a video clip;

• Action localization: search locations of an action in a video;

• Query for videos in professional Archives and YouTube;

• Query and organize your personal videos collections;

• Car safety (self-driving) and video surveillance;

• Detection of humans (pedestrians) and their motion;

• Detection of unusual behavior.

Challenging Issues

• Large variation in appearance

• Viewpoint changes

• Intra-class variation

• Camera motion

• Manual collection of training data is difficult

• Many action classes, rare occurrence

• Pose and object annotation often a plus

• Action vocabulary is not well defined

• What is the action granularity?

• How to represent composite actions?

Spatial-Temporal Features

• LTP (Local Trinary Pattern);

• MOSIFT;

• Histogram of optical flow (HOF);

• HoG3D;

• Motion boundary histograms (MBH);

• Internal Motion Histogram (IMH);

• Space-time interest points (STIP);

• Trajectory: dense or sparse;

• Trajectons: sparse;

• (Improved) Dense trajectory: optic flow.

Local Trinary Pattern

• Captures the effect of motion on the local structure of self similarities;

• The frame divided into a grid of m × n cells, histograms of 8-digit trinary strings measured;

• To examine similarity btw a patch centered at (x, y) at time t and the patch around (x − ∆x,

y) at time t + ∆t as the background statistic;

• Encode with one trit, whether one of the two similarities is significantly higher than the

other or whether the two similarities are approximately the same.

Patches at eight shifted locations at times

t−∆t and t+∆t are compared to a central

patch at time t to produce 16 similarities. 8

trits are used to represent each pixel.

SSD1 and SSD2 are computed patch

distances at one of the eight locations. ∆t is

set to 3 frames. Patches spread 4 pixels

around the center patch.

MOSIFT

• A pair of frames as input;

• Local extremes of DoG

and optical flow

determine the MoSIFT

points for which features

are described;

• MoSIFT concatenates

aggregated grids for both

appearance and motion

for a 256-d descriptors

vector.

Histogram of Optical Flow

• Detect interest points using a space-time extension of the Harris operator;

• Normalized histogram descriptors of space-time volumes in the neighborhood

of detected points.

HOG3D

• HoG like 3D descriptor, similar to SIFT.

Motion Boundary Histograms

• MBH

(a,b) Reference images at time t and t+1. (c,d) Computed optical flow, and flow

magnitude showing motion boundaries. (e,f) Gradient magnitude of flow field for

image pair (a,b). (g,h) Average MBH descriptor over all training images for flow field.

Internal Motion Histograms

• IMHdiff

• IMHcd

• IMHmd

• IMHwd

(a) One block of IMHcd coding scheme. The arrows emerging from the central cell

show the central pixel used to compute differences for the corresponding pixel in the

neighbouring cell. Similar differences are computed for each of the 8 neighbouring

cells. Values +1 and −1 represent the difference weights. (b) The wavelet operators

used in the IMHwd motion coding scheme.

Space-time Interest Points

• Space-time corner detector;

STIP Feature for Action Classification

• Bag of space-time features + SVM classifier;

• Group similar STIP descriptors together with k-means;

clustering

Trajecton Features

Simple KLT feature

tracking is used to track

as many features as

possible within a video.

Each tracked point

produces a fixed length

trajectory snippet every

frame consisting of the

last L (usually 10)

positions in its trajectory.

These snippets are

quantized to a library of

trajectons.

Dense Trajectory (DT) Features

• Dense sampling at several scales

• Feature tracking based on optical flow for several scales

• Length 15 frames, to avoid drift

• Histogram of gradients (HOG: 2x2x3x8)

• Histogram of optical flow (HOF: 2x2x3x9)

• Motion-boundary histogram (MBHx + MBHy: 2x2x3x8)

Improved Dense Trajectory (IDT) Features

Improve dense trajectories by explicit camera motion estimation

Detect humans to remove outlier matches for homography estimation

Stabilize optical flow to eliminate camera motion

IDT for Action Recognition

Motion stabilized trajectories and features (HOG, HOF, MBH);

Normalization for each descriptor, then PCA to reduce its dimension by 2;

Use Fisher vector to encode each descriptor separately, set # Gaussians K=256;

Use Power+L2 normalization for FV, SVM for multi-class classification;

HOF improves significantly and MBH somewhat almost no impact on HOG;

HOF/MBH are complementary, represent 0 and 1st order motion information;

IDT significantly improve over DT;

Human detection always helps;

Comb. with static CNN features.

Note: CNN motion descriptors do

not improve the performance.

Modeling Temporal Structure of Decomposable

Motion Segments for Activity Classification

Latent SVM model with

temporal action parts.

Enables temporal

localization of action parts.

Learning Latent Temporal Structure for

Complex Event Detection

Modeling of longer events such

as Grooming an animal;

Discriminatively trained Markov

model;

Aims to infer and learn latent

temporal structure of actions.

Interaction Models

Rule-based system

Detect/Track moving objects, manually identify key regions in scene (road,

checkpoint)

Scenarios describe relative arrangements of objects in scene

e.g. proximity of car to checkpoint, notions of scene context

Local feature relationship

UT Interaction dataset

Focus on interactions between people

Local feature approach

Define spatio-temporal relationships between points • Novel kernel for comparing sets

Key poses

Identify key points in an interaction, pose of people, and relative positions.

Latent SVM formulation.

Group Models

Stochastic grammars

Probabilistic grammar for describing domain

Person context

Activities are context-dependent

Adaptive structures

Temporal extent

Chains model

Link tracklet

Data association

Storyline model

Build AND-OR graph representation of activities

Event Models

Holistic flow

Global model of crowd

No tracking of individuals

Compute flow on dense grid

Interaction forces btw particles estimated in local regions

Quantized into bag of words representation over the video

Group flow

Model group motion using Markov chain parameters

probabilistic spatial transition model (from tracklets of individuals)

Define intra-group and inter-group measurements

Collectiveness: fit of individual members to group parameters

Stability: maintenance of nearest neighbours within a group, …

Event Structure Learning

ST weakly supervised learning

How to identify important/salient

segments to same events?

Frame clusters

Can be done implicitly or explicitly?

What is the granularity in time and

feature space which will work?

Learning Structure Implicitly using

Topic-based Pooling

Recognit. by Composition: Latent

Temporal Part-based Learning

A Videography Analysis Framework for

Video Retrieval and Summarization

A set of camera motion + related features = A “videography style descriptor”;

Capture some semantically meaningful things about how the video was taken.

visual object detection, recognition & tracking (without deep learning)

Engineering