visual object detection, recognition & tracking (without deep learning)
TRANSCRIPT
Yu Huang
Sunnyvale, California
Visual Object Detection, Recognition &
Tracking (without Deep Learning)
Outline Object Detection and Classification
State of Art Object Detection and Classification
Global/local features (Harris, FAST, SIFT, SURF, HOG,
LBP, BRIEF, BRISK, FREAK, ORB)
Kd-tree, LSH, min-hash, inverted file;
part-based (constellation model, pictorial structure,
implicit shape model, deformable model)
Pose estimation
bag of words (codebook, pyramid match kernel, spatial
pyramid match, vocabulary tree, pLSA, LDA)
VLAD, Fisher kernel, Hamming embedding, product
quant.
Machine learning: generative/discriminative model
Efficiency in Detection/Classification
Divide-and-conquer, branch-and-bound, coarse-to-fine,
DP, selective search by segmentation.
Open set problem
Data unbalancing problem
Face detection/recognition
Text detection/recognition
Scene parsing/semantic segmentation
Data set and evaluation metric
Object Tracking
State-of-art methods of Object Tracking
Representation Scheme in Tracking
Search Mechanism in Tracking
Model Update in Tracking
Context in Tracker, Fusion of Trackers
Multiple object tracking
3-D Model-based tracking
SLAM (feature/pixel tracking)
Appendix: Action/Event Detection/Classification
Object Detection Given an image or a frame in the video, the goal of object detection is to determine
whether there are any defined objects in the image and return their locations and extents
(from a long time and whole viewed observer).
The object detection should know how to differentiate the specific object from everything
else in the view.
Object detection usually is a binary classification problem; however, additional context
information from background helps building a strong detector, such as co-occurrence of
objects and geometric location priors.
Multi views or 3-d information (depth) if available.
Object Classification
Given an image or a frame in the video, the goal of
object classification is to identify specific objects
within a certain object set.
The object classification should tell what the
difference is between object A and object B in the
predefined object set.
Contextual info. between objects is modeled (by
learning) too;
Object detection can be solved as object parts
classification;
Segmentation co-trained with classification.
Objects’ geometric context are useful too;
Multi views or 3-d information (depth) if available.
Object classification is complicated and the difficulty
raises as the number of objects in the set increases.
State-of-Art Methods of Object
Detection/Classification 1. Global/Local representation;
Template matching (multi-scale) and eigen-object (subspace or manifold);
GIST, MSER, SIFT, SURF, Haar-like, Histogram of Oriented Gradient (HOG), Pyramid HOG,
GLOH (Gradient Location and Orientation Histogram), Local Binary Pattern (LBP), Shape-Context etc;
Local feature correspondence
– Indexing local features for Approximate Nearest Neighbor (ANN) search: KD-tree, min-Hashing, spectral hashing,
inverted index;
– Fast matching for large dataset: product quantization, Hamming embedding, Fisher kernel, sparse coding,
manifold learning;
– Global spatial model: homography constraints.
Visual Features
Harris Corner Detection • Shifting the window in any direction should yield a
large change in appearance;
• Harris corner detector gives a mathematical
approach for determining which case holds: flat,
edge or corner;
• Treat gradient vectors as a set of (dx,dy) points
with a center of mass defined as being at (0,0);
• Fit an ellipse to that set of points via scatter matrix.
FAST • Features from accelerated segment test
(FAST) is corner detection method;
• FAST corner detector uses a circle of 16 pixels
(radius 3) to classify if a candidate is a corner.
• Classified as corner: intensity Ip, threshold t,
condition either 1 or 2 as below
• 1: N contiguous pixels which intensities > Ip + t;
• 2: N contiguous pixels which intensities < Ip - t.
• Tradeoff of choosing N (usually as 12), t (20%).
• The high-speed test for rejecting non-corner
points is operated by examining 4 example
pixels, namely pixel 1, 9, 5 and 13.
• Non maximal suppression.
FAST • Machine learning: corner detection processed on a set of training images;
• For every pixel p, store the 16 pixels surrounding it, as a vector P;
• Each value in the vector, take three states: darker, brighter than or similar to p;
• Depending on the states, the vector will be subdivided into three subsets: Pd, Ps, Pb;
• Define variable Kp: true if p is an interest point and false otherwise;
• Decision tree classifier queries each subset using Kp, on principle of entropy
minimization: the true class is found with min. number of queries for three subsets;
• Terminate when entropy of a subset is zero;
• This learned order of querying is used then.
LBP: Local Binary Pattern LBP transforms an image into an array or image of integer labels describing small-scale
appearance of the image.
Assume texture has locally two complementary aspects, a pattern and its strength.
Divide the examined window into cells (e.g. 16x16 pixels for each cell).
For each pixel in a cell, compare the pixel to each of its 8 neighbors. Follow the pixels along a
circle, i.e. clockwise or counter-clockwise.
Where the center pixel's value is greater than the neighbor's value, write "1". Otherwise, write "0".
This gives an 8-digit binary number (which is usually converted to decimal for convenience).
Compute the histogram, over the cell, of the frequency of each "number" occurring (i.e., each
combination of which pixels are smaller and which are greater).
Optionally normalize the histogram.
Concatenate (normalized) histograms of all cells. This gives the feature vector.
HOG: Histograms of Oriented Gradients
Introduce invariance
Bias / gain / nonlinear transformations
bias: gradients / gain: local
normalization
nonlinearity: clamping magnitude,
orientations
Small deformations
spatial subsampling
local “bag” models
At each pixel
Gradient magnitude:
m = || (Ix, Iy) ||
Gradient orientation:
o = tan-1(Iy / Ix)
Quantize orientation: vote into bin
(weighted)
SIFT: Scale Invariant Feature Transform
Scale-space extrema detection:
Find the points, whose surrounding patches (with some scale) are distinctive
An approximation to the scale-normalized Laplacian of Gaussian
Keypoint localization: eliminating edge points
Orientation assignment:
Assign an orientation to each keypoint, which descriptor represented relative to this orientation
and therefore achieve invariance to image rotation;
Magnitude & orientation on the Gaussian smoothed images
A histogram is formed by quantizing the orientations into 36 bins;
Peaks in the histogram correspond to the orientations of the patch;
For the same scale & location, there could be multiple keypoints with different orientations (if another
peak is bigger than 80% of the maximal peak);
Keypoint descriptor: 128-d
16x16 patch -> 4x4 subregions ->8 bins for each subregion
SIFT: Scale Invariant Feature Transform
SURF: Speeded Up Robust Features
The feature vector of SURF is almost identical to that of SIFT. It
creates a grid around the keypoint and divides each grid cell into
sub-grids.
At each sub-grid cell, the gradient is calculated and is binned by
angle into a histogram whose counts are increased by the
magnitude of the gradient, all weighted by a Gaussian.
These grid histograms of gradients are concatenated into a 64-d
vector.
SURF can also use 36-vector of principle components of the 64
vector (PCA is performed on a large set of training images) for a
speedup.
SURF also improves on SIFT by using a box filter approximation
to the convolution kernel of the Gaussian derivative operator. This
convolution is sped up further using integral images.
BRIEF
BRIEF: Binary Robust Independent Elementary Features;
Binary test
BRIEF descriptor
For each S*S patch
Smooth it, then pick pixels using pre-defined binary tests
Pros:
Compact, easy-computed, highly discriminative
Fast matching using Hamming distance
Good recognition performance
Cons:
More sensitive to image distortions and transformations, in particular to in-plane rotation and scale
change
Binary Robust Invariant Scalable Keypoints
BRISK: Combination of SIFT-like scale-
space keypoint detection and BRIEF-
like descriptor
Scale and rotation invariant
BRISK is a 512 bit binary descriptor
that computes the weighted Gaussian
average over a select pattern of points
near the keypoint;
It compares the values of specific pairs
of Gaussian windows, leading to either
a 1 or a 0, depending on which window
in the pair was greater.
The pairs to use are preselected in
BRISK. This creates binary descriptors
that work with hamming distance
instead of Euclidean.
FREAK: Fast Retina Keypoint
FREAK is a cascade of binary strings is computed by efficiently
comparing image intensities over a retinal sampling pattern;
FREAK improves upon the sampling pattern and method of pair
selection that BRISK uses.
FREAK evaluates 43 weighted Gaussians at locations around the
keypoint, but the pattern formed by these Gaussians is biologically
inspired by the retinal pattern in the eye.
The pixels being averaged overlap, and are much more
concentrated near the keypoint. This leads to a more accurate
description of the keypoint.
The actual FREAK algorithm also uses a cascade for comparing
these pairs, and puts the 64 most important bits in front to speed
up the matching process.
ORB
• Oriented FAST and Rotated BRIEF:
• Fast and accurate orientation compensation to FAST;
• Efficient computation of oriented BRIEF;
• Learn to de-correlate BRIEF in sampling pairs under rotational invariance.
• Intensity centroid for corner orientation:
• Properties of the sampling pairs:
• Uncorrelation – each new pair will bring new information to the descriptor;
• High variance – more discriminative, since it responds differently to inputs.
• Learning the sampling: a training set of about 300,000 keypoints drawn;
• Greedy method: obtain a set of 256 uncorrelated binary tests with high variance.
Kaze/A-Kaze • Kaze: Multi-scale 2D feature detection and description algorithm in nonlinear scale
spaces by means of nonlinear diffusion filtering;
• The discretization of a function by means of the forward Euler scheme: high computation cost;
• Efficient Additive Operator Splitting (AOS) techniques and variable conductance diffusion, solving a
tri-diagonal system of linear equations, which can be efficiently done by Thomas algorithm;
• The set of first and second order derivatives are approximated by means of 3x3 Scharr filters of
different derivative step sizes (better than Sobel-like operators);
• Find dominant orientation and adapt M-SURF to build descriptor;
• A-Kaze: accelerated Kaze with fast explicit diffusion (FED) in nonlinear scale spaces;
• FED combines the advantages of explicit and semi-implicit schemes while avoiding shortcomings;
• Idea: M cycles of n explicit diffusion steps with varying step sizes from factorization of the box filter;
• Embed the FED scheme into a fine to coarse pyramidal framework;
• Use a Modified-Local Difference Binary (M-LDB) that exploits gradient and intensity information;
• LDB is similar to BRIEF but using binary tests between the average of areas (not pixels).
KD-tree The kd-tree data structure is based on a recursive subdivision of space into disjoint
hyper-rectangular regions called cells; Each node of the tree is associated with a region, called a box, and is associated with a set of data
points that lie within this box;
The root node of the tree is associated with a bounding box that contains all the data points;
Consider an arbitrary node in the tree: As long as the number of data points associated with this node is greater than a small quantity, called the bucket size, the box is split into two boxes by an axis-orthogonal hyper-plane that intersects this box.
There are a number of different splitting rules, which determine how this hyper-plane is selected (mean or median usually);
When the number of points that are associated with the current box falls below the bucket size, then the resulting node is declared a leaf node, and these points are stored with the node;
Limit the number of neighboring k-d tree bins to explore ANN;
Reduce the boundary effects by randomization.
KD-tree Randomized kd-tree forest: a fast ANN search;
Split by picking the dimension with the highest variance first;
Multiple randomized trees increase the chances of finding nearby points;
Best-bin first search heuristic: priority queue;
A branch-and-bound technique for an estimate of the smallest distance from the query point to any
of the data points down all of the open paths;
Priority search: visits cells in increasing order of distance from the query, and converge rapidly on
the true NN (max/min heap data structure).
Locality Sensitive Hashing (LSH) LSH is a randomized hashing technique using hash
functions that map similar points to the same bin, with high probability;
Choose a random projection;
Project points;
Points close in the original space remain close under the projection Use multiple quantized projections defining a high-
dimensional “grid”;
Cell contents can be efficiently indexed using a hash table;
Repeat to avoid quantization errors near the cell boundaries;
Point that shares at least one cell = potential candidate;
Compute distance to all candidates;
Min-Hash The Min-Hash seen as an instance of locality sensitive hashing (LSH);
The more similar two items are, the higher the chance that they share the same min-hash; Similarity is measured by the Jaccard distance : the number of elements two sets have in common
divided by the total number elements in both;
It is a weaker representation than a BoW since word frequency information is reduced into a binary information (present or absent);
To estimate the word overlap of two items, multiple independent min-Hash functions fi are used;
To efficiently retrieve items with high similarity, the values of min-Hash functions fi are grouped into s-tuples, called sketches;
The recall is increased by repeating the random selection of sketches k times; A pair of items is a potential match when at least one sketch collision is encountered;
The probability of a pair of items having at least one sketch out of k in common is a function of the word overlap.
Inverted File An inverted file index is just like an index in a book, where the keywords are
mapped to the numbers of pages using them;
In the visual word case, a table that points from the word number to the indices of
the database images with the word, is built too;
Retrieval via the inverted file is faster than searching every image, assuming that
not all images contain every word (sparse).
Index compression: Huffman compression
Note: inverted list contains both the vector identifier and the encoded reisdual.
State-of-Art Methods of Object
Detection/Classification 2. Bags of words model (derived from natural language processing);
Key point Localization (MSER, SIFT, SURF, Shape Context…);
Codebook generation: clustering or quantizing the feature space (k-means);
Sparse coding for efficient quantization.
Learning with histogram of code words and its extension;
– Pyramid match kernel: map to multi-dimensional multi-resolution histograms;
– Spatial pyramid match: partition the image into increasingly fine sub-regions.
– Prob. Latent Semantic Analysis (pLSA): mixture decomposition from a latent model;
– Latent Dirichlet Allocation (LDA): add the Dirichlet prior for the topic distribution.
Bag of Visual Words
feature detection
& representation
codewords dictionary
image representation
Representation
1.
2.
3.
category models
(and/or) classifiers
category
decision
Pyramid Match Kernel
Fast approximation of Earth Mover’s Distance;
Weighted sum of histogram intersections at multiple resolutions (linear in the
number of features instead of cubic);
Spatial Pyramid Matching Based on pyramid matching kernel;
Descriptor layer: detect, locate features, extract correspond. descriptors;
Code layer: code the descriptor by VQ, soft-VQ or even sparse coding;
SPM layer: pool codes across subregions and normalize into a histogram;
Classifiers with these features by nonlinear kernels.
Vocabulary Trees Vocabulary Tree defined using an offline unsupervised (k-
means) training stage.
Hierarchical scoring based on term freq. inverse document
freq. (TF-IDF). Number of the descriptor vectors of each image with a path along
the node i (ni query, mi database)
Number of images in the database with at least one descriptor
vector path through the node i (Ni )
Defining the relevance score
Implementation of Scoring Every node is associated with an inverted file
Decrease the fraction of images in the database that have to be
explicitely considered for a query
Hierarchical k-means
K-means tree of height h (levels)
Determining the path of a descriptor means performing kh
dot products.
Nister & Stewenius, 2006
Vocabulary Trees
Index Query
Locally Aggregated Descriptors
VLAD: vector of locally aggregated descriptors;
Learning: a vector quantifier (k-means)
output: k centroids (visual words): c1,…,ci,…ck
centroid ci has dimension d
For a given image
assign each descriptor to closest center ci
accumulate (sum) descriptors per cell
vi := vi + (x - ci)
VLAD (dimension D = k x d): run PCA for reduction
The vector is L2-normalized;
VLAD better than BoF for a given descriptor size
comparable to Fisher descriptors for these operating points
Choose a small D if output dimension D’ is small.
Fisher Kernel Given a likelihood function uλ with parameters λ, the score function of a given sample X is given by:
fixed-length vector whose dimensionality depends only on # parameters.
Intuition: direction in which the parameters λ of the model should be modified to better fit the data;
Fisher information matrix (FIM) or negative Hessian:
Measure similarity between using the Fisher Kernel (FK):
FK can be rewritten as a dot product between Fisher Vectors (FV):
A Gaussian Mixture Model (GMM) trained on a large set of features X={xt,
t=1,...T}, to get a probabilistic visual vocabulary (soft BoV);
Fisher kernel transforms an variable size set of independent samples into a fixed size vector
representation.
average pooling
Hamming Embedding Representation of a descriptor x with binary signatures
Vector-quantized to q(x) as in standard BoF
Short binary vector b(x) for an additional localization in the Voronoi cell
Define HE matching: two descriptors x and y match iif q(x)=q(y) and h(b(x), b(y)) <=ht
where h(a, b) is the Hamming distance.
Nearest neighbors for Hamming distance ≈ the ones for Euclidean distance
Efficiency:
Hamming distance = very few operations
Fewer random memory accesses: faster that BOF with same dictionary size!
Off-line (given a quantizer)
Draw an orthogonal projection matrix P of size db× d (random matrix generation)
this defines db random projection directions (projection and assignment for each learning data point)
for each Voronoi cell and projection direction, compute the median value from a learning set;
On-line: Compute the binary signature b(x) of a given descriptor
project x onto the projection directions as z(x) = (z1,…zdb)
Signature: bi(x) = 1 if zi(x) is above the learned median value, otherwise 0
Product Quantization
Main idea: compressed representation of the database vectors;
Vector split into m sub-vectors: y --> [y1|…|ym];
Sub-vectors are quantized separately by different quantizers
q(y)=[q(y1)|…|q(y2)]; where each qi is learned by k-means with a limited number of centroids;
The key: estimate the distances in the compressed domain, such that Quantization is fast enough;
Quantization is precise, i.e., many different possible indexes (ex: 2^64).
Note: Regular k-means is not appropriate; not for k=2^64 centroids;
Product quantization-based approach offers Competitive search accuracy
Compact footprint: few bytes per indexed vector
MPEG CDVS Image Feature Extraction Pipeline
• Keypoint detection: ALP;
• SIFT descriptor;
• Feature selection;
• Local descriptor compression;
• Coordinate coding;
• Global descriptor aggregation;
MPEG Standard CDVS-based Pairwise Matching
and Indexing/Retrieval Pipeline
• Global descriptor: top matches;
• Local descriptor: decoding->matching->geometric verification->localization;
• Location coding and descriptor coding.
MPEG Standard CDVS Key Proposals
• CE1: global descriptor: Residual Enhanced Visual Vector (REVV);
• SCFV (Scalable Compressed Fisher Vector);
• Robust Visual Descriptor (RVD);
• CE2: local descriptor compression: CHoG;
• Transform + Scalar Quantizer;
• Multi-stage Vector Quantizer;
• CE3: Location coding: context based coordinate coding;
• CE4: key-point detection: ALP;
• Block-based Frequency domain LoG;
• CE5: Local Descriptors: SIFT;
• CE6: retrieval pipeline: MBIT;
• CE7: feature selection: Naïve Bayesian learning based feature selection;
• CE8: pairwise matching pipeline: DISTRAT (local matching with geometric verification)
• Weighted Hamming distance (global matching).
ALP • SIFT Patent: DoG filtering to construct scale space!
• ALP (low polynomial degree): by Telecom Italia;
• Idea: scale space response modeled by a polynomial function; then
estimation of coefficients by LoG filtering at different scales:
• 1. Scale space: approximated by a polynomial, get local extrema in the
polynomials, eliminates those exceeded at the boundaries, output a list of
candidates (x, y, σ);
• 2. Clean candidates: either at edges with bigger ratio of curvatures, or with lower
absolute values or curvatures;
• 3. Refinement of coordinates: approximate scale space by a polynomial;
• 4. Eliminates duplicates at octave boundaries;
• 5. Find the remaining candidates.
D. G. Lowe: ”Distinctive image features from scale-invariant keypoints”, IJCV 2004, Patent No US 6,711,293.
BoVW, VT, VLAD, REVV, Fisher Vector, CFV, SCFV, RVD
• Bag of Visual Words: zero order moments;
• Vocabulary Tree: hierarchical clustering;
• Vector of Locally Aggregated Descriptors (VLAD): 1st order moments;
• Residual Enhanced Visual Vector: similar to VLAD, as residual + LDA;
• Hamming Embedding: BoW plus a binary vector (Hamming distance);
• Fisher Vector: 2nd order moments;
• Compressed Fisher Vector: Fisher vector + Product Quantization;
• PQ: decomposed into Cartesian product of subspace, then quantized separately;
• Spectral Hashing: partition by spectral method (eigenvectors of graph Laplacian).
• Scalable Compressed Fisher Vector: sparsity of FVs;
• Robust Visual Descriptor: similar to SCFV, robust cluster + bit selection.
State-of-Art Methods of Object
Detection/Classification 3. Part-based method (structure);
Constellation Model: find mean of the appearance density, mean location & uncertainty in part location,
output the prob. of that part being present;
Pictorial Structures: modeled with unary template and pair-wise springs, then joint estimation of part
locations;
Implicit Shape Model: Consistent configurations of the observed parts (visual words) with spatial
distribution and final probabilistic voting for object segmentation;
Deformable model: learn the latent SVM model structure, filters, deformation costs.
Constellation Model Originally for unsupervised learning of object categories;
It represents objects by estimating a joint appearance and shape distribution of their parts;
The model is very flexible and can even be applied to objects that are only characterized by
their texture;
Representation
Joint model of part locations
Ability to deal with background clutter and occlusions
Multiple mixture components for different viewpoints
Learning
Manual construction of part detectors
Estimate parameters of shape density
Now semi-unsupervised
Automatic construction and selection of part detectors
Estimation of parameters using EM
•Recognition Run part detectors over image
Try combinations of features in model
Use efficient search techniques to make fast
Pictorial Structure
Objects are modeled by a collection of parts in a deformable configuration;
Statistical framework
Prior distribution is defined as a tree-structured Markov random field where no preference is given to
the absolute location of each part;
Model parts based on the response of Gaussian derivative filters of different orders, orientations and
scales;
Connections modeled by springs between parts: Gaussian Distribution.
Best match is found by minimizing function that measures both individual match costs and connection
costs;
How each part matches at its location which agree with the deformable model;
Matching a pictorial structure does not involve making any decisions about location of individual
parts;
It is solved independently and implies that any kind of part model can be used as long as maximum
likelihood can be computed for an individual part;
Implicit Shape Model Representation: object shape is only defined implicitly;
Not model semantically meaningful parts, instead as a collection of a large number of prototypical
features for a dense cover of the object area;
Each feature has a defined appearance and a spatial probability distribution for the locations relative
to the object center;
Learning:
Build a visual vocabulary from local features overlapped with training objects;
Learns a spatial occurrence distribution for each visual word (a list of all positions and scales relative
to the object center, and a reference figure-ground mask for each occurrence entry, used for inferring a
top-down segmentation);
Recognition: by the Generalized Hough Transform
In the test image, only a small fraction of the learned features will typically occur, which consistent
configuration still provides strong evidence for an object’s presence;
Each activated visual word then casts votes for possible positions of the object center according to its
learned spatial distribution;
Consistent hypotheses are searched as local maxima in the voting space.
Implicit Shape Model (Training + Recognition)
Deformable Model Latent SVM Model training: discriminative model with latent variable
The learned positions of object-parts and the position of the whole object are the Latent Variables;
Training data consists of images with labeled bounding boxes;
Need to learn the model structure, filters and deformation costs;
Detection: 8x8 blocks, HOG feature at different resolution;
Root filter: rectangular templates defining weights for features
Learn root filter by standard SVM;
Part filter: Multi-scale model captures features;
Deformation model: matching with pictorial structures.
Articulated Human Detection with Flexible Mixtures-of-Parts
• Human detection and human pose estimation in static images based on deformable part models;
• A general, flexible mixture model that jointly captures spatial relations between part locations and co-
occurrence relations between part mixtures, augmenting standard pictorial structure models that encode
just spatial relations;
• Learn all parameters, including local appearances, spatial relations, and co-occurrence relations (which
encode local rigidity) with a structured SVM solver.
Flexible mixture-of-parts model (middle), not classic
approaches (left) by warping template to different
orientation and foreshortening states (top right),
approximate small warps by translating patches
connected with a spring (bottom right).
A visualization of our model for K = 14 parts and T
= 4 local mixtures, trained on the Parse dataset.
Appearance and Expressive Spatial Models for Human Pose Estimation
Extend the basic Pictorial Structure model: (a) to more flexible structure with stronger local
appearance representations including single component part detectors (b) and mixtures of
part detectors (c). Combine local appearance model with mid-level representation based
on semi-global poselets which capture configurations of multiple parts (d).
MODEC: Multimodal Decomposed Models for Posture
• Most pictorial structures researchers have put effort into better and larger feature
spaces, in which they fit one linear model; • Feature computation is expensive, still fails at capturing many appearance modes in real data.
• Not find an increasing number of features for high-dimensional linear separability, model
non-linearities in simpler, lower dimensional feature spaces, using locally linear models.
Armlet: a Varied Poselet for Upper Body Posture
• Predict the pose of both arms for each person;
• Train highly discriminative classifiers to estimate the arm configurations (armlets);
• Extend HOG, integrate strong contours, skin color and contextual cues.
Far left: HOG with local gradient contours for each cell.
Left: each cell is HOG with gPb. Right: average value of
the skin classifier at each cell. Far right: context feature; at
each cell, a poselet activation feature vector; for each
poselet type, maximum of all poselet activations whose
center falls in the cell, or zero if no activations are present.
Pose Recognition from A Depth Image
From a depth image of single frame, the pipeline includes background removal, initial pose estimation,
and pose correction. The skeleton joints marked by color dots in (c) are the ones with high confidence in
estimation whereas the ones without color dots are with the low confidence.
Each pixel (black square) casts a 3D vote (orange line) for each joint. Mean shift is used to aggregate
these votes and produce a final set of hypotheses for each joint. The highest confidence hypothesis for
each joint is shown. NB ‘left’ refers to the user’s left as if looking in a mirror.
1. regression directly from the raw depth image, no use of an arbitrary intermediate representation;
2. applicability to general motions (not constrained to particular activities);
3. the ability to localize occluded as well as visible body joints.
Learning Pose from Depth Images
State-of-Art Methods of Object
Detection/Classification 4. Generative models vs. Discriminative models;
– Naïve Bayes classifier, MRF, PCA, mixture Gaussian,, pLSA, LDA, …;
– SVM, Adaboost, random forest, CRF, Logistic regression, KNN,…;
– Deep learning: convolutional NN, DBN, DBM, Auto-encoder, ....
Page 52
Neural Network (MLP) A multilayer perceptron (MLP) is a feedforward artificial NN model that maps sets of input data onto a
set of appropriate outputs;
A MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the
next one; Except for the input nodes, each node is a neuron (or processing element) with a
nonlinear activation function;
MLP utilizes a supervised learning technique called back propagation for training the network;
MLP is a modification of the standard linear perceptron and can distinguish data that are not linearly
separable.
Decision Trees
Classification trees and regression trees predict responses to data.
To predict a response, follow the decisions in the tree from the root node down to a leaf
node. The leaf node contains the response.
Classification trees give responses that are nominal, as 'true‘ or 'false'.
Regression trees give numeric responses.
A Decision Tree consists of three
types of nodes:
1. Decision nodes;
2. Chance nodes;
3. End nodes .
Naïve Bayes Classifier
The Naive Bayes classifier is designed when features are independent of one
another within each class, but it appears to work well even when that
independence assumption is not valid. It classifies data in two steps: Training step: Using the training samples, the method estimates the parameters of a probability
distribution, assuming features are conditionally independent given the class.
Prediction step: For any unseen test sample, the method computes the posterior probability of that
sample belonging to each class. The method then classifies the test sample according the largest
posterior probability.
The class-conditional independence assumption greatly simplifies the training
step since you can estimate the one-dimensional class-conditional density for
each feature individually; While the class-conditional independence between features is not true in general, research shows
that this optimistic assumption works well in practice;
This assumption of class independence allows the Naive Bayes classifier to better estimate the
parameters for accurate classification while using less training data than many other classifiers;
This makes it particularly effective for datasets containing many predictors or features.
Support Vector Machines (SVM)
Separable Data An SVM classifies data by finding the best hyperplane that separates all data points of one class
from those of the other class.
“Margin” means the maximal width of the slab parallel to the hyperplane that has no interior data
points.
The support vectors are the data points that are closest to the separating hyperplane.
Non-separable Data Your data might not allow for a separating hyperplane. In that case, SVM can use a soft margin,
meaning a hyperplane that separates many, but not all data points.
Kernel trick: Polynomials, Radial basis or Sigmoid function
Generative Model: MRF
Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi
takes value fi in a label set L.
Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it
satisfies Markov property.
Generative model for joint probability p(x)
allows no direct probabilistic interpretation
define potential functions Ψ on maximal cliques A
map joint assignment to non-negative real number
requires normalization
MRF is undirected graphical models
Hidden Markov Model
A hidden Markov model (HMM) is a statistical Markov model: the modeled system is a Markov process with unobserved (hidden) states;
In HMM, state is not visible, but output, dependent on state, is visible.
Each state has a probability distribution over the possible output tokens;
Sequence of tokens generated by an HMM gives some information about the sequence of states.
Note: the adjective 'hidden' refers to the state sequence through which the model passes, not to the parameters of the model;
A HMM can be considered a generalization of a mixture model where the hidden variables are related through a Markov process;
Inference: prob. of an observed sequence by Forward-Backward Algorithm and the most likely state trajectory by Viterbi algorithm (DP);
Learning: optimize state transition and output probabilities by Baum-Welch algorithm (special case of EM).
Discriminative Model: CRF
Conditional , not joint, probabilistic sequential models p(y|x)
Allow arbitrary, non-independent features on the observation seq X
Specify the probability of possible label seq given an observation seq
Prob. of a transition between labels depend on past/future observ.
Relax strong independence assumptions, no p(x) required
CRF is MRF plus “external” variables, where “internal” variables Y of MRF are un-observables and “external” variables X
are observables
Linear chain CRF: transition score depends on current observation
Inference by DP like HMM, learning by forward-backward as HMM
Optimization for learning CRF: discriminative model
Conjugate gradient, stochastic gradient,…
AdaBoost Boosting: at each step, training data are re-weighted that incorrectly classified objects get
larger weights in a new, modified training set, thus actually maximizes the margins
between objects;
Classifiers are constructed on weighted versions of the training set, which are
independent on previous classification results;
Boosting learning originated from the Probably Approximately Correct (PAC) learning
theory;
AdaBoost is the first algorithm that could adapt to the weak learners;
Variant of Adaboost (Adaptive boosting): originally DigitalBoost
LogitBoost:
GentleBoost: Update is fm(x) = P(y=1|x) – P(y=0|x) instead of RealBoost’s
State-of-Art Methods of Object
Detection/Classification 5. Efficiency in Detection/Classification:
Efficient features: Vector quantization;
Integral/Aggregate channel features;
Divide-and-conquer: Cascaded Classifiers.
Coarse-to-fine: Feature scaling, feature pyramid;
Branch-and-bound; Efficient subwindow search.
Generic object proposal:
Selective search;
BING for objectness;
Dynamic programming; Avoid repeating compute.
Parallelize (hardware): GPU;
Multi-core;
Cloud?
Integral Channel Features (ICF) Multiple registered image channels are computed using image linear/non-linear transformations, called
integral channel features;
Features such as local sums, histograms, Haar features and their various generalizations are computed
using integral images;
ICF naturally integrate heterogeneous sources of information, have few parameters, and result in fast,
accurate detectors.
Integral Channel Features (ICF) Very efficient for human/pedestrian detection;
6 quantized orientations, 1 gradient magnitude, 3 LUV color channels.
Multi-scale without image scaling.
Boosted classifiers: soft cascades
Two level decision trees.
Aggregate Channel Features (ACF)
• Compute image’s several channels and sum every block of pixels, smooth the resulting LR channels;
• Features are single pixel lookups in the aggregated channels;
• Boosting is used to learn decision trees over these features (pixels) to distinguish object from
background;
• A multiscale sliding window is applied;
• With the appropriate choice of channels to design, ACF achieves SoA performance in pedestrian
detection;
• Normalized gradient magnitude, histogram of oriented gradients and LUV color;
• Compute feature pyramid at octave-spaced scale intervals;
• Adaboost for training, combining 2048 depth-two trees over 5120 candidate features in each search 128x64 window.
Segmentation as Selective Search Challenges
Objects extremely diverse
Various shapes, sizes and appearances;
Within object variation
Multiple materials and textures with strong
interior boundaries;
Many objects in an image
Selective search by hierarchical grouping:
encouraging diversity.
BING: Binarized Normed Gradients What is the object?
Standalone, unique with different appearance from
surroundings, well closed boundary;
Objectness metric is for how likely a window covers an
object of any category;
Reduce the search space;
Allow strong classifiers.
Normed gradients (NG) + Linear SVMs
BING feature: illustration
Use a single atomic variable (INT64 & BYTE) to represents a
BING feature and its last row.
CVPR 2014
Branch-and-Bound for Sub-window Search
It is a general algorithm for optimal solution, especially for discrete and
combinatorial optimization;
A branch-and-bound algorithm consists of a systematic enumeration of all
candidate solutions, where large subsets of fruitless candidates are
discarded en masse, by using upper and lower estimated bounds of the
quantity being optimized;
Splitting: given a set of candidates, S1 S2 …=S, “bounding” a search
tree whose nodes are S1, S2, …
Bounding: upper/lower bounds for the objective function within Si;
The idea: prune by maintaining a global variable recording the min upper
bound, so any node whose lower bound greater than it can be discarded;
Used for efficient sliding sub-window search.
State-of-Art Methods of Object
Detection/Classification 6. Open Questions:
Viewpoint variations;
Scales and appearance variation (illumination and shape distortion);
Cluttered background, partial occlusion, lack of contextual inform.;
Multiple poses (out of the plane or on the plane) or articulation;
Variations of the intra class or category.
State-of-Art Methods of Object
Detection/Classification 7. “Open Set” problem: how to handle unknown or unfamiliar classes;
Label as one of known classes or as unknown;
Zero shot learning/unseen class detection;
Novelty detection with null space methods; One class SVM;
Multiple classes: Artificial super class from all given classes;
Combine several one class classifiers learned separately;
K-nearest neighbors;
State-of-Art Methods of Object
Detection/Classification 8. “Data unbalancing” problem:
Resampling methods for balancing the data set. Over-sampling, under-sampling, importance sampling;
Modification of existing learning algorithms. Cost-sensitive learning;
One class classification;
Classifier ensemble (bagging, boosting, random forest…)
Measuring the classifier performance in imbalanced domains. ROC, F-measure,…
Relationship between class imbalance and other data complexity characteristics.
State-of-Art Methods of Object
Detection/Classification 9. Face detection/recognition is a special area: large variations.
Face detection: Viola-Jones’s cascaded Adaboost for cascaded simple feature extractions;
Face verification/authentication: validate a claimed identity based on the image, and either accept
or reject the identity claim (one-to-one matching);
Face identification: identify a person based on the image, compared with all the registered persons
(one-to-many matching);
Face clustering: find the common people among these faces.
Face Detection and Pose Alignment
• Paul Viola’s method: Integral image + Cascaded Adaboost;
• Boosting learning:
• RealBoost;
• FloatBoost;
• GentleBoost;
• HoG + SVM + Image Pyramid (dlib C++ lib);
• Generic linear features: Anistropic Gaussian filters;
• Variation of LBP and HoG for face detection: • Local Gradient Pattern and Binary Histogram of Gradients;
• Integral/Aggregate Channel Features: HOG + LUV;
• Shape features: edgelet, shapelet;
• DPM (Deformable Part Model) or Pictorial model plus SVM;
• Strongly or weakly supervised.
• Antiface: multi-template matching in cascade.
Face Detection and Pose Alignment View/Pose alignment + view/pose specific detectors;
Facial feature (2d-landmark) detection for alignment: holistic or local
ASM, AAM: generative model;
Elastic graph matching;
Constrained local model (CLM): global shape constraints;
Explicit Shape Regression;
Robust cascaded pose regression (RCPR);
Conditional regression forests;
Tree Structured Part Model (TSPM): [Zhu’12];
Ensemble of regression trees (dlib C++ lib);
Supervised Descent Method;
Parts-based deformable shape model;
Congealing or funneling: reduce the entropy by transform.
Face Detection, Landmark Localization and Pose Alignment • Ensemble of regression trees used to estimate the face’s
landmark positions directly from a sparse subset of pixel
intensities;
• Based on gradient boosting, learn by optimizing the sum of
square error loss and naturally handles missing or partially
labelled data;
• Appropriate (exponential) priors exploiting the structure of
image data helps with efficient feature selection;
• Different regularization strategies to combat overfitting;
• Can detect 194 landmarks on face from a single image in a
millisecond for face alignment;
• Open sources: http://dlib.net/.
Landmark estimates at different levels of the cascade initialized with the mean shape centered
at the output of a Viola face detector. T is number of strong regressors in the tree.
Joint Cascaded Face Detection and Alignment
• Define the Post classifier: • use the Viola-Jones detector in OpenCV with a low threshold to ensure a high recall;
• split all the images into two parts, then use the positive and negative output windows in the first part to
train a linear SVM classifier, and test all the output windows in the second part;
• Feature extractors: the 3rd one is the best for classifier performance; • 1. divide the window into 6*6 non-overlapping cells and extract a SIFT in each cell;
• 2. use a mean face shape with 27 facial points and extract a SIFT centered on each point;
• 3. align the 27 facial points and extract a SIFT centered on each point;
• Local learning of the tree structure;
• Global learning of the tree output;
• A Unified Framework for Cascade Face Detection and Alignment: • Cascade Detection;
• Cascade Alignment;
• A Unified Framework;
• Joint Learning of Detection and Alignment: • S-strategy: in the split test of each internal node randomly choose to either minimize the binary
entropy for classification or the variance of the facial point increments for regression;
• use RealBoost for the cascade classification learning with multi-scale shape indexed pixel difference
features.
Joint Cascaded Face Detection and Alignment
Face Recognition: Verification or Identification
• How to represent the face?
• Features are hand-crafted or learned automatically;
• Global feature-based: eigen face, fisher face;
• Local feature-based: Gabor feature, Haar, HOG, LBP, SIFT feature.
• Hierarchically? (manual first, then learn from it…)
• Dimensionality reduction? (high dimensional is good)
• Subspace methods: PCA, LDA
• Manifold methods.
• Bag of words (BoW): encoding/quantization
• VLAD, Fisher vector;
• Spatial information.
• Matching metric learning: weight optimization, joined Bayesian method
• Siamese network (two identical convolutional network sharing weights)
High Dim LBP Feature for Face Verification • Making a high-dimensional (e.g., 100K-dim)
face feature is critical to high performance;
• Local Binary Pattern (LBP) descriptor;
• First extract multi-scale patches centered at
dense facial landmarks;
• Then divide each patch into a grid of cells and
code each cell by a certain descriptor;
• Finally concatenate all descriptors to form our
high-dimensional feature.
• Learn a sparse linear projection with a much
lower computational/storage cost.
• Adopt PCA first;
• Then supervised subspace learning methods such as
LDA and Joined Bayesian are applied to extract
discriminative information for face recognition;
• Learn a sparse linear projection (L1-based
regression) directly mapping high-d feature to low-d
feature.
In the training phase, low-dimensional features 𝑌 are first
obtained by PCA and supervised subspace learning. Then
learn the sparse projection matrix 𝐵 which maps 𝑋 to 𝑌 by
the rotated sparse regression. In the testing phase,
compute the low-dimensional feature by directly projecting
high-dimensional feature using sparse matrix 𝐵.
A Metric Learning: Joined Bayesian Method
• Model the appearance difference of two
faces jointly with an appropriate prior on
the face representation in verification;
• Each face is the summation of two
independent Gaussian latent variables, i.e.
intrinsic for identity, and intra-personal for
within-person variation.
• EM like algorithm for learning and closed form
solution for testing.
• Derived new similarity metric preserves
the separability and leads to better
performance;
• Can be viewed as a reference model as
well with parametric form.
3-D Face Analysis and Recognition Preprocessing: surface smoothing, noise removal and hole filling;
3-d facial landmark detection and face registration;
Curvature-based (spin image is too costly): landmark extraction;
3D statistical facial feature model (SFAM): both global and local;
Procrustes Analysis and Iterated Closest Point (ICP) for registration;
Depth image: RGB-D;
Deformable face model.
Feature extraction/learning;
Curvature-based;
Part-based feature;
Learning again?
Feature matching.
Region-based ICP;
Shape and texture. Page 80
Average Face Model Average Regional Model
3-D Model based Face Alignment • Model types:
• 3-D morphable model;
• 3-D mesh model;
• Parts-based model.
• Recover the frontal pose by 3d geometrical transformations;
• Align a 2D face image to a 3D face image and then rotate it to render the frontal
view;
• A 3d-to-2d camera is fitted for this rotation (face frontalization);
• Robust to illumination and viewpoint variation;
(a) Query photo; (b) facial feature detections; (c) and (d) the same on a reference face and model; (e) estimate a projection matrix
used to back-project to the reference coord. system; (f) estimated visibility due to non-frontal poses; (g) final frontalization.
3-D Model based Face Alignment
Improved single-3D Single-3D DeepFaces
Infrared Face Recognition
• face challenges in the presence of illumination, pose and expression
changes, as well as facial disguises;
• Appearance-based;
• Registration;
• Feature-based;
• Infrared LBP;
• Wavelet/Curvelet Transform;
• Multi-spectral/Hyper-spectral;
• Multi-modal methods.
Video-based Face Recognition
• Merits: a set of observations, temporal dynamics and 3D information;
• Challenges: pose, expression, illumination, scale, motion blur and occlusion;
• Set-to-Set method:
• Frames of a video showing the same face are often represented as sets of vectors, one
vector per frame; then recognition becomes a problem of determining similarity between
vector sets in different spaces and metrics;
• Algebraic methods that compare sets regard each video as a linear subspace, spanned
by the vectors encoding the frames in the video;
• Pyramid Match Kernel (PMK) is non-algebraic kernel for encoding similarities between
sets of vectors in a hierarchical structure;
• A set-to-set similarity measure: the Matched Background Similarity (MBGS).
• Sequence method:
• Split tracking and recognition to use the temporal dynamics, such as HMM, DTW.
Video-based Face Recognition
Facial Expression/Emotion Recognition
• Questions:
• Features are holistic or analytic;
• Temporal information used or not;
• View or volume-based.
• Measuring Facial Actions:
• Facial Action Coding (FACS): muscle‐based describing visually distinguishable facial movements;
• Facial Expression Extraction: both facial deformation and motion parts.
Facial Expression/Emotion Recognition
• Features: Haar, LBP, HOG, Optic flow;
• Recognition: frame-based and sequence-based
• Frame: PCA, ICA, LDA, kNN, GentleBoost, SVM, NN;
• Sequence: HMMs, DTW, motion energy, RNN .
• Deep Learning-based: CNN, RNN, 3D-CNN
• Deformation extraction:
• Image-based: Gabor filter, LDA/PCA, Gabor WT;
• Model-based: AAM, PDM, Facial Animation;
• Motion extraction:
• Frame-based: snake, region-based, model units;
• Sequence-based: Candid node, feature tracking
State-of-Art Methods of Object
Detection/Classification 10. Text detection and recognition: text is object too.
Text detection: connected component-based or texture(template, i.e. sliding window)-based;
Maximally Stable Extremal Regions (MSERs) first;
Feature HOG works;
Text recognition: OCR (optical character recogn.), license plate recogn.
Pictorial structure model;
Lexicon helps.
“video text detection and recognition: dataset and benchmark”, by P Nguyen K Wang, S Belongie, ICCV13.
State-of-Art Methods of Object
Detection/Classification 11. Scene parsing or semantic image segmentation.
Many methods rely on MRFs, CRFs, or other types of graphical models to ensure the consistency of the labeling and
to account for context;
Most methods rely on a pre-segmentation into superpixels or other segment candidates, and extract features and
categories from individual segments and from various combinations of neighboring segments;
Deep learning-based method: hierarchical feature learning.
RGB-D data: depth information helps inferring 3D relationship in the scene.
SuperParsing: Scalable Nonparametric Image Parsing with Superpixels
• First perform global scene-level matching against the training set, followed by superpixel-level
matching and MRF optimization for incorporating neighborhood context.
• Compute a simultaneous labeling of image regions into semantic classes (e.g., tree, building, car)
and geometric classes (sky, vertical, ground).
Nonparametric Scene Parsing: Label Transfer via
Dense Scene Alignment
• Retrieve its nearest neighbors
from a large database
containing fully annotated
images;
• Then, establish dense
correspondences between the
input image and each of the
nearest neighbors using the
dense SIFT flow algorithm;
• Finally, warp the existing
annotations and integrates
multiple cues in a MRF
framework to segment and
recognize the query image.
Indoor Segmentation and Support Inference from RGBD Images
• Interpret the major surfaces, objects, and support relations of an indoor scene;
• Parse indoor scenes into floor, walls, supporting surfaces, and object regions, and to
recover support relationships, 3D cues can best inform a structured 3D interpretation.
Dataset for Object Classification
Caltech 101
and 256;
LabelMe;
ImageNet;
VOC: Pascal;
MS COCO;
Dataset for Object Classification Face Recognition:
NIST FERET;
Yale Face;
CMU PIE Face;
Labeled Faces in the Wild
Database of Scene Parsing • Stanford background: 715 images with 8 classes of
outdoor scenes;
• SIFT Flow: 2688 images with 33 scenes;
• CamVid: Cambridge driving Labeled Video Database;
• Kitti Vision Dataset: Honda Research Inst. Euro GmbH;
• Barcelona database: 14871+279 images of 170 classes;
• MSRC: 240 small images with 9 classes;
• LabelMe: 2700 city scenes with 5-20 common classes;
• PASCAL VOC challenge: semantic segmentation with
20 FG classes and 1 BG class;
• RGB-D NYU data: 407024 RGB-D pairs with 894
categories;
• Microsoft COCO: image recognition, segmentation, and
captioning dataset.
Evaluation Metric Detection/Classification rate
true positives, false positives, true negatives and false negatives (reflected in confusion or
contigency matrix);
accuracy, precision, recall;
ROC, i.e. receiver operating characteristic;
plotting the fraction of true positives (TPR = true positive rate) vs. the fraction of false positives
(FPR = false positive rate), at various threshold settings;
• Precision = true positive / (true positive + false positive);
• Recall (sensitivity) = true positive / (true positive + false negative);
• Precision-Recall curve;
Complexity/Time;
Real-time?
Object Tracking
Definition: object tracking is generally posed as an recursive estimation problem, i.e.,
estimate the current location (size) given the estimate at the previous time instant as
well as the current observation (or measurement).
Object tracking is to find an object based on a short-time (applying Markov chain in
modeling) and narrow-viewed observer (a dynamic model for searching);
However, object tracking can collaborate with different types of object
detectors/classifiers to enhance the performance.
Object tracking can be performed on different feature levels: points, regions, contours
or blobs;
Tracked objects can be varied types: articulated, deformable, fluid, multiple objects
and so on.
State-of-Art Methods of Object Tracking
Y Wu, J Lim, M-H Yang, “Online Object Tracking: A Benchmark,” CVPR 2013.
model update
L: local, H: holistic, T: template,
IH: intensity histogram, BP: binary pattern,
SR: sparse representation,
DM: discriminative model,
GM: generative model.
PF: particle filter,
MCMC: Markov Chain Monte Carlo,
LOS: local optimum search,
DS: dense sampling search.
Representation Scheme in Tracking
Holistic templates;
Subspace-based;
Sparse representation;
Feature-based:
Color histograms, HOG, covariance region descriptor, Haar-like features, LBP etc.;
Discriminative model with a binary classifier:
SVM, structured output SVM, ranking SVM, boosting, semi-boosting and multi-instance boosting;
Parts-based;
Bags of features (patch-based);
3-d information (depth) or multiple cameras-based?
PCA, AP & Spectral Clustering Principal Component Analysis (PCA) uses orthogonal transformation to convert a set of observations
of possibly correlated variables into a set of linearly uncorrelated variables called principal
components.
This transformation is defined in such a way that the first principal component has the largest
possible variance and each succeeding component in turn has the highest variance possible under the
constraint that it be orthogonal to the preceding components.
PCA is sensitive to the relative scaling of the original variables.
Also called as Karhunen–Loève transform (KLT), Hotelling transform, singular value
decomposition (SVD) , factor analysis, eigenvalue decomposition (EVD), spectral decomposition etc.;
Affinity Propagation (AP) is a clustering algorithm based on the concept of "message passing"
between data points.[Unlike clustering algorithms such as k-means or k-medoids, AP does not require
the number of clusters to be determined or estimated before running the algorithm;
Spectral Clustering makes use of the spectrum (eigenvalues) of the data similarity matrix to
perform dimensionality reduction before clustering in fewer dimensions.
The similarity matrix consists of a quantitative assessment of the relative similarity of each pair of
points in the dataset.
Blind Source Separation & ICA Independent component analysis (ICA) is for separating a multivariate signal into
additive subcomponents by assuming that the subcomponents are non-Gaussian signals
and all statistically independent from each other. ICA is a special case of blind source separation.
Assumptions: the source signals are independent of each other; distribution of the values in
each source signals are non-Gaussian.
Three effects of mixing signals as below Independence: the signal mixtures may not;
Normality: closer to Gaussian than any of original variables;
Complexity: Greater than that of its constituent source signal.
Preprocessing: centering, whitening and dimension reduction;
ICA finds the independent components (latent variables) by maximizing the statistical
independence of the estimated components;
Definitions of independence for ICA: Minimization of mutual information (KL divergence or entropy);
Maximization of non-Gaussianity (kurtosis and negative entropy).
NMF & pLSA Non-negative matrix factorization (NMF): a matrix V is factorized into (usually) two matrices W and H,
that all three matrices have no negative elements.
The different types arise from using different cost functions for measuring the divergence
between V and W*H and possibly by regularization of the W and/or H matrices;
squared error, Kullback-Leibler divergence or total variation (TV);
NMF is an instance of a more general probabilistic model called "multinomial PCA“, as pLSA
(probabilistic latent semantic analysis);
pLSA is a statistical technique for two-mode (extended naturally to higher modes) analysis, modeling the
probability of each co-occurrence as a mixture of conditionally independent multinomial distributions;
Their parameters are learned using EM algorithm;
pLSA is based on a mixture decomposition derived from a latent class model, not as downsizing the
occurrence tables by SVD in LSA.
Note: an extended model, LDA (Latent Dirichlet allocation) , adds a Dirichlet prior on the per-document
topic distribution.
ISOMAP General idea:
Approximate the geodesic distances by shortest graph distance.
MDS (multi-dimensional scaling) using geodic distances
Algorithm:
Construct a neighborhood graph
Construct a distance matrix
Find the shortest path between every i and j (e.g. using Floyd-Marshall) and construct a new distance matrix such that Dij is the
length of the shortest path between i and j.
Apply MDS to matrix to find coordinates
LLE (Locally Linear Embedding) General idea: represent each point on the local linear subspace of the manifold as a linear combination
of its neighbors to characterize the local neighborhood relations; then use the same linear coefficient for
embedding to preserve the neighborhood relations in the low dimensional space;
Compute the coefficient w for each data by solving a constraint LS problem;
Algorithm: 1. Find weight matrix W of linear coefficients
2. Find low dimensional embedding Y that minimizes the reconstruction error
3. Solution: Eigen-decomposition of M=(I-W)’(I-W)
i j
jiji YWYY
2
)(
Local Tangent Space Alignment (LTSA) Every smooth manifold can be constructed locally by its tangent plane;
Stages: 1) A local parameterization is established for each data point; 2) then a global
alignment is computed.
Taylor series expansion of the embedding function f(•) in the local neighborhood;
We are given samples from the embedded manifold with noise therefore, for an
arbitrary point xi and its local neighbor and in the absence of the noise (εi = 0), we can
write:
Solve the problem:
where si is the i-th membership vector.
The optimal alignment (using LS): Substituting Li into the objective:
where S=[s1,…,sn], W=diag(W1,…,Wn), and
Solve using an EVD.
i
jx 1
Local Tangent Space Alignment (LTSA)
Laplacian Eigenmaps General idea: minimize the norm of Laplace-Beltrami operator on the manifold
measures how far apart maps nearby points.
Avoid the trivial solution of f = const.
The Laplacian-Beltrami operator can be approximated by Laplacian of the neighborhood graph with appropriate weights.
Construct the Laplacian matrix L=D-W.
can be approximated by its discrete equivalent
Algorithm:
Construct a neighborhood graph (e.g., epsilonball, k-nearest neighbors).
Construct an adjacency matrix with the following weights
Minimize
The generalized eigen-decomposition of the graph Laplacian is
Spectral embedding of the Laplacian manifold:
• The first eigenvector is trivial (the all one vector).
Search Mechanism in Tracking Tracking is posed within an optimization framework, gradient descent methods used to locate the target;
Iterative Image Registration;
Mean shift;
Mean field;
Distribution Fields.
Dense sampling methods at the expense of high computational load;
Online boosting;
Online Multiple Instance Learning;
Struck: Structured Output Tracking with Kernels (SVM).
Stochastic search algorithms insensitive to local minima and computationally efficient :
Particle filters;
Incremental learning within Particle filter;
Sparse coding within particle filter;
Mean shift within particle filter.
Particle Filter Monte Carlo characterization of pdf:
Represent posterior density by a set of random i.i.d. samples (particles) from the pdf p(x0:t|z1:t)
For larger number N of particles equivalent to functional description of pdf
For N approaches optimal Bayesian estimate
Regions of high density
Many particles
Large weight of particles
Uneven partitioning
Discrete approximation for continuous pdf
Draw N samples x0:t(i) from importance sampling distribution (x0:t|z1:t)
Importance weight and its update
N
i
i
tt
i
tttN xxwzxP1
:0:0:1:0 )(δ)|(
)|(π
)|()(
:1:0
:1:0:0
tt
ttt
zx
zxpxw
),|(π
)|()|()(
1
)(
)(
1
)()()(
1
)(
t
i
t
i
t
i
t
i
t
i
tti
t
i
tzxx
xxpxzpww
Mean Shift A tool for finding modes in a set of data samples, manifesting an underlying probability
density function (PDF) in RN;
Non-parametric density estimation (Parsen window)
MS is for kernel density gradient estimation
Translate the kernel window by the mean shift vector: m(x)
Go to the maxima of density.
Used for visual tracking.
1
1 ( ) ( )
n
i
i
P Kn
x x - x
1
1 1
1
( )
n
i in ni
i i ni i
i
i
gc c
P k gn n
g
x
x x
2
( ) ii
K ckh
x - xx - x
2
1
2
1
( )
ni
i
i
ni
i
gh
gh
x - xx
m x xx - x
Model Update in Tracking
Template update for the KLT algorithm;
Online mixture model;
Incremental subspace update;
Online boosting;
More robust online-trained classifier:
Semi-supervised;
Multiple instance learning;
Co-tracking with different features (co-training);
Multiple classifier boosting (object part-based or orientation-based);
Background separation for its multimodal distribution as well;
Object Structural Constraints by Positive-Negative learning (self learning);
Struck: Structured Output Tracking with Kernels (SVM).
Online Multiple Negative Modality
Learning Boosting for Tracking
Online Boosting Online learning: a learning algorithm is presented with one example at a time;
Since we don’t know a priori how the difficult/good a sample is, the online boosting
algorithm turns to a new strategy to compute the weight distribution;
Oza proposed an on-line boosting framework in which the importance of a sample can be
estimated by propagating it through the set of weak classifiers.
However, Oza’s algorithm has no way of choosing the most discriminative feature
because the entire training set is not available at one time.
Grabner and Bischof proposed a modified method which performs feature selection
(defined as selectors) by maintaining a pool of M > N candidate weak classifiers:
The number of weak classifiers N is fixed at the beginning;
One sample is used to update all weak classifiers and the corresponding voting weights;
For each classifier, the most discriminative feature for the entire training set is selected from a given feature pool;
It is suggested the worst feature is replaced by a new one randomly from the feature pool.
Stochastic gradient descent is more suited for online learning, more specifically for
online boosting.
Multiple Instance Learning (MIL) The basic idea is that during training, examples are presented in sets or bags and labels
are provided for the bags rather than individual instances;
If the bag is labeled positive, it is assumed to contain at least one positive instance,
otherwise the bag is negative;
The ambiguity is passed on to the learning algorithm, which now has to figure out which
instance in each positive bag is the most “correct”.
The MIL problem can be solved by a gradient boosting framework (proposed by
Friedman) to maximize the log likelihood of bags;
A Noisy-OR (NOR) model is used to define the bag probability;
The instance-is-positive probability is modeled as the logistic function;
The weight on each sample is given as the derivative of the loss function with respect to a change
in the score of the sample.
People have been adapting classical classification methods, such as SVMs or boosting,
to work within the context of multiple-instance learning.
Online Incremental SVM
Learning for Tracking
Incremental/Decremental SVM Incremental learning on-line to construct the solution recursively one point at a time;
Retain the Kuhn-Tucker (KT) conditions on all previously seen data, while “adiabatically” adding a new data point to
the solution;
Support vectors (SV) in soft margin classification: margin SV, error SV and ignored SV;
Leave-one-out (LOO) for predicting the generalization power of a trained classifier, implemented by
decremental unlearning, adiabatic reversal of incremental learning, on each of the training data from
the full trained solution.
ignored vector margin
Context in Tracker, Fusion of Trackers Context information is also important;
Mining auxiliary objects or local visual inform.
surrounding the target to assist tracking;
The context information helpful when the target is
fully occluded or leaves the image; “Learning where the object might be”;
“Exploring Supporters and Distracters in
Unconstrained Environments”.
Combines static, moderately adaptive and highly
adaptive trackers to account for appearance
changes;
Multiple trackers are maintained in MAP;
Multiple feature sets selected in a Bayesian way.
Multiple Object Tracking The MOT tracker not only associates the objects with the observation data correctly (with
in-between interaction), but also discriminates them with similar appearance;
Persistence in motion;
Mutual exclusion;
Challenging situations:
Joint state space: variable number of objects?
Interaction (split/merge): occlusion;
Birth/death (enter/leave): appear/disappear.
Similar cases:
Articulated object tracking: body parts, hand fingers etc.
Deformable object tracking: facial expression with multiple action units.
Page 119
Multiple Object Tracking
Typical methods: Multiple Independent object tracker (MioT): no interaction, so apply separate trackers
meanwhile.
Multiple hypothesis tracker (MHT): k-best hypothesis, gating/pruning.
Joint probabilistic data association filter (JPDAF): expectation over all.
Probability hypothesis density: propagate 1st moment of posterior.
Particle filter Apply the probabilistic exclusion principle;
Bayesian Multiple Blob tracker: background subtraction;
Subordination links between particles;
Mixture model: particle clustering.
MCMC: Markov chain in sampling.
Markov random field-based: energy minimization of interaction and data terms;
Belief propagation: inference approximation.
Mean Field: variational approximation.
3-D Model-based Tracking
3-D model: CAD geometric model, planar parts or rough ellipsoid; Pose (camera or object) estimation can considerably simplify the task;
Camera calibration or not?
Non-rigid object tracking with 3-d model: 3D Morphable Models;
Articulated object tracking with 3-d model: 3D kinematic chain structure;
Factorization-based: separation of motion from structure.
3-D Model-based Tracking How to eliminate the drifting error?
Key frames for registration (pose refinement);
Bundle adjustment (used in the point-based method). Edge-based method
Look for strong gradients: fast and general;
Extract image contours and then fit them to the model outlines; Pixel-based method
Optic flow: add a feedback loop to suppress drift effect;
Template matching: Lucas-Kanade;
Interest point-based: Lucas-Kanade-Tomasi;
Note: tracking without 3-d models -> visual SLAM v.s. SfM (structure from motion).
SLAM (simultaneous localization and mapping): to provide in real-time an estimation of
the relative motion of a camera and a 3D map of the surrounding environment;
Extensible tracking: attempts to add previously unknown scene elements to its initial
map, and these then provide registration even when the original map is out of sensing
range;
Deformable Object Tracking
2-D methods; Active contour (snake);
Level set;
Exemplar-based shape prior: ASM, AAM as a
generative model; Project-out inverse compositional algorithm for model
fitting;
Constrained local model: Use point distribution model (PDM), but only model the
appearance of local patches around landmarks;
3-D methods: model the variability of a certain class
of objects; Deformable model:
Shape, represented as curve or surface, is deformed in
order to match a specific example;
Model texture variations & imaging factors i.e.
perspective projection and illumination effects;
3-D pose estimation from 2-D view; Rigid pose.
Articulated Object Tracking
2-D methods; Model-free methods;
Exemplar-based;
2-D model-based: cardboard-like;
3-D methods; Without estimating parts’ joint angles;
With fixed basis;
Use top part’s tip locations;
3-D model-based: Kinematic chain structure;
Quantized feature space;
Model refinement: factorization, key-frame-based;
Motion filter: kalman filter, particle filter or multiple hypothesis (MT);
Data driven dimensionality reduction by learning the configuration space;
3-D pose estimation from 2-D view; Classification-based or direct mapping-based.
PTAM, DTAM and Semi-Dense TAM PTAM: tracking and mapping separated, mapping based on key frames BA (not update
frame by frame), while tracking based on camera pose estimation with patch-based
search;
DTAM: still key frames-based, a pixel-based rather than feature-based method for tracking
and mapping with a dense 3D surface textured model;
Dense inverse depth map estimate by regularization (TV): primal dual;
Occlusion/blur handling;
Tracking by image registration with synthesized realistic novel view:
Nonlinear LS: like iterative LK style (rotation first, then full pose).
Semi-Dense TAM: semi-dense depth of pixels with apparent image gradients;
Probabilistic depth map + uncertainty from both geometric and photometric disparity errors;
Reduction of image regions for estimating semi-dense inverse depth map for the current frame;
Tracking is still dense in image alignment.
Initial depth map is from stereo vision by five-point-pose algorithm (Multiple view stereo).
Feature Tracking The popular KLT (Kanade-Lukas-Tomasi) method;
Good features for tracking;
Eigen values (Shi);
Affine alignment;
FAST, SIFT, SURF, HOG, …;
Gradient, orientation, histogram;
Random forest, FERNS: tracking-by-detection
The classifiers are applied for keypoint matching;
Multiclassifier based on randomized trees;
Random selection of features;
Random selection of patches.
For classification.
Note: Optic flow is dense tracking.
Extract features independently and then match by comparing descriptors
x x x
32 Possible Outputs
ik cCFP |Posterior Distributions (Look-up Tables)
0 7
Feature Tracking
After FERNS are trained;
Do patch classification.
M
ikflabelclass cCFPExample1
_ |maxarg
2 6 1
Posterior Distributions (Look-up Tables)
Fern 1 Fern 2 Fern 3
Depth Sensor-based Tracking Features:
RGBD HOG feature;
Point cloud feature;
3D iterative closest point (ICP);
Occlusion detection and recovery from RGBD;
“Tracking revisited using RGBD camera: unified benchmark and Baselines”, by S Song, J Xiao, proc. of ICCV13.
3-D Model-based Body Tracking with Kinect RGB-D Data
• Predict 3D position of each body joints from a single depth image
• No temporal information
• Uses an object recognition approach
• Single per pixel classification
• Large and highly varied training
• The classifications can produce hypotheses of 3D body joint positions by skeletal tracking;
• Random forest-based classifiers with depth invariant features.
• The method has been designed to be robust, in two ways:
• The system is trained with a vast, highly varied training set of synthetic images ensuring for all:
• Ages, body shapes and sizes, clothing, hairstyles;
• The recognition does not rely on any temporal information:
• The system can initialize from arbitrary poses;
• Prevents catastrophic loss of track.
3-D Model-based Face Tracking with Kinect RGB-D Data
Use a 3-D deformable face model;
Handle noisy depth data by maximum likelihood;
L1 regularization in ICP-based tracking framework;
Feature tracking in RGB data helps.
Appendix: Action/Event Detection & Classification
Action/Event Detection and Classification
• Action classification: assigning an action label to a video clip;
• Action localization: search locations of an action in a video;
• Query for videos in professional Archives and YouTube;
• Query and organize your personal videos collections;
• Car safety (self-driving) and video surveillance;
• Detection of humans (pedestrians) and their motion;
• Detection of unusual behavior.
Challenging Issues
• Large variation in appearance
• Viewpoint changes
• Intra-class variation
• Camera motion
• Manual collection of training data is difficult
• Many action classes, rare occurrence
• Pose and object annotation often a plus
• Action vocabulary is not well defined
• What is the action granularity?
• How to represent composite actions?
Spatial-Temporal Features
• LTP (Local Trinary Pattern);
• MOSIFT;
• Histogram of optical flow (HOF);
• HoG3D;
• Motion boundary histograms (MBH);
• Internal Motion Histogram (IMH);
• Space-time interest points (STIP);
• Trajectory: dense or sparse;
• Trajectons: sparse;
• (Improved) Dense trajectory: optic flow.
Local Trinary Pattern
• Captures the effect of motion on the local structure of self similarities;
• The frame divided into a grid of m × n cells, histograms of 8-digit trinary strings measured;
• To examine similarity btw a patch centered at (x, y) at time t and the patch around (x − ∆x,
y) at time t + ∆t as the background statistic;
• Encode with one trit, whether one of the two similarities is significantly higher than the
other or whether the two similarities are approximately the same.
Patches at eight shifted locations at times
t−∆t and t+∆t are compared to a central
patch at time t to produce 16 similarities. 8
trits are used to represent each pixel.
SSD1 and SSD2 are computed patch
distances at one of the eight locations. ∆t is
set to 3 frames. Patches spread 4 pixels
around the center patch.
MOSIFT
• A pair of frames as input;
• Local extremes of DoG
and optical flow
determine the MoSIFT
points for which features
are described;
• MoSIFT concatenates
aggregated grids for both
appearance and motion
for a 256-d descriptors
vector.
Histogram of Optical Flow
• Detect interest points using a space-time extension of the Harris operator;
• Normalized histogram descriptors of space-time volumes in the neighborhood
of detected points.
HOG3D
• HoG like 3D descriptor, similar to SIFT.
Motion Boundary Histograms
• MBH
(a,b) Reference images at time t and t+1. (c,d) Computed optical flow, and flow
magnitude showing motion boundaries. (e,f) Gradient magnitude of flow field for
image pair (a,b). (g,h) Average MBH descriptor over all training images for flow field.
Internal Motion Histograms
• IMHdiff
• IMHcd
• IMHmd
• IMHwd
(a) One block of IMHcd coding scheme. The arrows emerging from the central cell
show the central pixel used to compute differences for the corresponding pixel in the
neighbouring cell. Similar differences are computed for each of the 8 neighbouring
cells. Values +1 and −1 represent the difference weights. (b) The wavelet operators
used in the IMHwd motion coding scheme.
Space-time Interest Points
• Space-time corner detector;
STIP Feature for Action Classification
• Bag of space-time features + SVM classifier;
• Group similar STIP descriptors together with k-means;
clustering
Trajecton Features
Simple KLT feature
tracking is used to track
as many features as
possible within a video.
Each tracked point
produces a fixed length
trajectory snippet every
frame consisting of the
last L (usually 10)
positions in its trajectory.
These snippets are
quantized to a library of
trajectons.
Dense Trajectory (DT) Features
• Dense sampling at several scales
• Feature tracking based on optical flow for several scales
• Length 15 frames, to avoid drift
• Histogram of gradients (HOG: 2x2x3x8)
• Histogram of optical flow (HOF: 2x2x3x9)
• Motion-boundary histogram (MBHx + MBHy: 2x2x3x8)
Improved Dense Trajectory (IDT) Features
Improve dense trajectories by explicit camera motion estimation
Detect humans to remove outlier matches for homography estimation
Stabilize optical flow to eliminate camera motion
IDT for Action Recognition
Motion stabilized trajectories and features (HOG, HOF, MBH);
Normalization for each descriptor, then PCA to reduce its dimension by 2;
Use Fisher vector to encode each descriptor separately, set # Gaussians K=256;
Use Power+L2 normalization for FV, SVM for multi-class classification;
HOF improves significantly and MBH somewhat almost no impact on HOG;
HOF/MBH are complementary, represent 0 and 1st order motion information;
IDT significantly improve over DT;
Human detection always helps;
Comb. with static CNN features.
Note: CNN motion descriptors do
not improve the performance.
Modeling Temporal Structure of Decomposable
Motion Segments for Activity Classification
Latent SVM model with
temporal action parts.
Enables temporal
localization of action parts.
Learning Latent Temporal Structure for
Complex Event Detection
Modeling of longer events such
as Grooming an animal;
Discriminatively trained Markov
model;
Aims to infer and learn latent
temporal structure of actions.
Interaction Models
Rule-based system
Detect/Track moving objects, manually identify key regions in scene (road,
checkpoint)
Scenarios describe relative arrangements of objects in scene
e.g. proximity of car to checkpoint, notions of scene context
Local feature relationship
UT Interaction dataset
Focus on interactions between people
Local feature approach
Define spatio-temporal relationships between points • Novel kernel for comparing sets
Key poses
Identify key points in an interaction, pose of people, and relative positions.
Latent SVM formulation.
Group Models
Stochastic grammars
Probabilistic grammar for describing domain
Person context
Activities are context-dependent
Adaptive structures
Temporal extent
Chains model
Link tracklet
Data association
Storyline model
Build AND-OR graph representation of activities
Event Models
Holistic flow
Global model of crowd
No tracking of individuals
Compute flow on dense grid
Interaction forces btw particles estimated in local regions
Quantized into bag of words representation over the video
Group flow
Model group motion using Markov chain parameters
probabilistic spatial transition model (from tracklets of individuals)
Define intra-group and inter-group measurements
Collectiveness: fit of individual members to group parameters
Stability: maintenance of nearest neighbours within a group, …
Event Structure Learning
ST weakly supervised learning
How to identify important/salient
segments to same events?
Frame clusters
Can be done implicitly or explicitly?
What is the granularity in time and
feature space which will work?
Learning Structure Implicitly using
Topic-based Pooling
Recognit. by Composition: Latent
Temporal Part-based Learning
A Videography Analysis Framework for
Video Retrieval and Summarization
A set of camera motion + related features = A “videography style descriptor”;
Capture some semantically meaningful things about how the video was taken.