face detection - university of...

Face detectionFace detection

Slides from: Svetlana Lazebnik, Silvio Savarese, Fei-Fei Li, Derek Hoiemalso from Mor Yakobovits and Roni Karlikar and also David Cohen

Face detection and recognition

Detection Recognition “Sally”

Applications of Face Detection

• Auto-focus in cameras

• Security systems (recognize faces of certain people)

• Human-computer interface

• Marketing systems

• Much more..

Humans have also tendency to see face patterns even if they don’t really exist.

Faces everywhere

5http://www.marcofolio.net/imagedump/faces_everywhere_15_images_8_illusions.html

Difficulties of Face Detection

Building a model for faces is not a simple task,

faces can be complex and vary from each other.

Faces in images are also affected from the environment.

Face shapes, facial features, skin Tone Variations….

Face appearance under variation in lighting can change drastically

Scaling and angles

Obstruction of facial features

Facial expressions

Funny Nikon ads"The Nikon S60 detects up to 12 faces."

"The Nikon S60 detects up to 12 faces."

Funny Nikon ads

Consumer application: Apple iPhoto

•Things iPhoto thinks are faces

Approaches to face Detection

Skin Detection - approaches

Template MatchingStart

Segmentation

Cross Correlation

between Image Blocks and

all scaled average faces

Find Face Location

Face Loc = Pos(Max corr)

Face Size = Size(avg face)

Blank out the Face

Using Face Loc &

Size just found

Max(xcorr) >

Threshold

Next Block

All Blocks

Yes Yes

Candidates

Average Faces

Template Matching

Template Based approaches – Deformable Templates

Deformable templates: Yuille, Cohen, Hallinan (1989)

Eigenfaces

M.A. Turk and A.P. Pentland:

Eigenfaces for Recognition. Journal of Cognitive

Neuroscience, 3 (1):71--86, 1991.

The Viola/Jones Face Detector

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

P. Viola and M. Jones. Robust real-time face detection.IJCV 57(2), 2004.

Basic idea:1. Treat pixels as a vector

2. Recognize face by nearest neighbor

nyy ...1

xy −=T

k argmin

Eigenfaces

The space of all face images

• Face images as vectors are extremely high-dimensional

– 100x100 image = 10,000 dimensions

– Slow and lots of storage

• But very few 10,000-dimensional vectors are valid face images

• We want to effectively model the subspace of face images

An image is a point in a high dimensional space• An N x M image is a point in RNM

• We can define vector for every image in this space

Face Space

• Eigenface idea: construct a low-dimensional linear

subspace that best explains the variation in the set

of face images

Face Space

• Detect new face by measuring distance to Face-Space

Principal Component Analysis (PCA)

Given data – find best linear representation in lower dim.

Find a set of k components (vectors)

Σ [�� - Σ ��ej]2

e1 e2 e3….

such that a linear combination gives a good approximation for every data point:

�� ≅��

Find ej and wj that minimize:

Data Set pi

e0 = mean(pi)

Minimizes Σ(pi – e0)2

Pre: k= 0

Σ [�� - e0]2

Find e0 that minimizes:

Data Set pi~

pi - Mean zero data set~

Σ [�� - Σ ��ej]2

Minimizes Σ(pi – e0 – wi1e1)2

pi = e0 + wi1e1~

wi1, = <pi, e1>

Data Set pi~

Σ [�� - Σ ��ej]2

Minimizes Σ(pi – e0 – wi1e1 – wi2e2)2

pi = e0 + wi1e1 + wi2e2~

Find a set of n directions (vectors)

e1 e2 e3….

with maximum variance of the points.

= Σ ||(�� − ��)�e||2�= Σ e�(�� − ��) (�� − ��)� e�

where A = [�� … ��]C = A AT is the covariance matrix

= e�� e

��(e) = Σ ||��e||2�

Find a set of n directions (vectors)

e1 e2 e3….

with maximum variance of the points.

�� ! e�"ee

||e||2 = 1

Solve e�"e = λ

Choose e eigenvector of largest eigenvalue λ

Create covariance matrix

Diagonalize using Singular Value Decomposition (SVD) :

C = UDVt

Where D is a diagonal matrix of eigenvalues and U,V are matrices of eigenVectors.

U = V = [ e1 e2 e3 ….]

Given pi i =1..n,find a set of k directions (vectors)

e1 e2 e3….

(in decreasing order of eigenvalues)

C = Σ (�� − ��) (�� − ��)��

Choose first k eigenvectors.

Training

images

Eigenfaces

Face images must be

aligned (centered and

of the same size)

Eigenfaces

Treat pixels as a vector

= ��

e1 e2 e3….

Eigenfaces

e1 e2 e3….

�� # �$ … = %

C = PPT~~

C = UDVTSVD

247 249

249 249

= ��

Eigenfaces exampleChoose top eigenvectors: e1,…ek

Mean: µ

Eigenfaces

Eigenfaces Projection

Top 3 eigenvectors: e1, e2, e3Mean: µ

e0 e1��= + *

pi = e0 + wi1e1 + wi2e2~

e2��+ * e3�&+ *

wij = <pi ej> = piT ej

= + 0.195 * + 0.046 * + 0.032 *

k = # of eigenfaces

ilarity

origin

How many eigenfaces ? (n)

Choosing the Dimension K- Example

Choosing the Dimension K

K NMi =

eigenvalues

• How many eigenfaces to use?

• Look at the decay of the eigenvalues

– the eigenvalue tells you the amount of variance “in the direction” of that eigenface

– ignore eigenfaces with low variance

Can perform projection on new image.

Projectedinto

face-space

If new ≈ projected

new = face

Face-space

�� ≅��

��dΣ [�� - Σ ��ej]

d < Thresh = face��= face��d > Thresh

http://demonstrations.wolfram.com/FaceRecognitionUsingTheEigenfaceAlgorithm/

Eigenfaces Results

Reconstruction using the eigenfaces

Eigenfaces Issues - 1

Possible solutions:

• Evaluate distance to mean face-space,• Evaluate distance to nearest face in space.• Evaluate size of weights Σ��

Problem:Projection of new face is near face-space but NOT near faces.

Problem:The dimension of is

Where is the number of pixels in each image.This matrix is often too large - not practical

2 2N N×

MM M×

Eigenfaces Issues -2

C = AAT

ATAATA

Typically M << N2

Solution: consider the matrix The dimension of is ,where is the number of images in the training set

• If are the eigenvector of

( are the eigenvalues)

• Multiply by �

• The Eigenfaces of C are then: (need to normalize)

vi ATAL =

ATA vi = λivi λi

A AATAvi = Aλivi (AAT) Avi = λi Avi

C ei ei= λi

Eigenfaces Issues

ei = Avi

• PCA assumes that the data follows a Gaussian

distribution (mean µ, covariance matrix Σ)

The shape of this dataset is not well described by its principal components

Eigenfaces Limitations - I

Eigenfaces Limitations - II

Global appearance method: not robust to

misalignment, background variation

Eigenfaces Limitations - III• Performance decreases quickly with changes to face size

− Multi-scale eigenspaces.

− Scale input image to multiple sizes.

• Performance decreases with changes to face orientation (but not as fast as with scale changes)

− Plane rotations are easier to handle.

− Out-of-plane rotations are more difficult to handle.

Eigenfaces Limitations - IV

The direction of maximum variance is not

always good for classification

non-faces

Fisherfaces (FLD)

A more discriminative subspace: FLD

• Fisher Linear Discriminants � “Fisher Faces”

• PCA preserves maximum variance

• FLD preserves discrimination

– Find projection that maximizes scatter between classes

and minimizes scatter within classes

Reference: Eigenfaces vs. Fisherfaces, Belheumer et al., PAMI 1997

P. Viola and M. Jones. Robust real-time face detection.IJCV 57(2), 2004.

Challenges of face detection

• Sliding window detector must evaluate tens of thousands of location/scale combinations

• Faces are rare: 0–10 per image

• For computational efficiency, we should try to spend as little time as possible on the non-face windows

• A megapixel image has ~106 pixels and a comparable number of candidate face locations

• To avoid having a false positive in every image image, our false positive rate has to be less than 10-6

• First real-time face detector

• Training is slow, but detection is very fast

• Key ideas

• Integral images for fast feature evaluation

• Boosting for feature selection

• Attentional cascade for fast rejection of non-face windows

P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.

Image Features

• All faces share some common features:

• The eyes region is darker than the upper-cheeks.

• The nose bridge region is brighterthan the eyes.

• Features must be simple (a value) and efficient to compute

• How many features are needed to indicate the existence of a face?

Image Features

“Rectangle filters”

(Haar-like features)

Value =

∑ (pixels in white area) – ∑ (pixels in black area)

= correlation with a mask having 1 in pixels of white areas,

and -1 in pixels of black areas

Example

Value = 0.001 Value = 10

Rectangle Features (Haar Features)

• Some features correspond to common facial features. Examples:

Basic Features Vary Scale and Orientation

For 24x24 detection region, there are 162,336 possible features (in all sizes), all based on the base 5 features.

Rectangle Features (Haar Features)

Challenges

1) Feature Computation – as fast as possible

2) Feature Selection – too many features, need to select the most informative ones

3) Real-timeliness – focus mainly on potentially positive image areas (potentially faces)

Fast !!!

• The integral image computes a value at each pixel (x,y) that is the sum of the pixel values above and to the left of (x,y), inclusive

• This can quickly be computed in one pass through the image

Integral Image

ii(x, y-1)

s(x-1, y)

i(x, y)

s(x,y) = sum of pixels in row x,columns 1…y

i(x,y) is the imageii(x,y) is its integral image

Computing the Integral Image

�� !, ( = � � !*, (′,-.,,/*./

0 !, ( = 0 ! − 1, ( + �(!, ()�� !, ( = �� !, ( − 1 + 0(!, ()

Recursive definition:

Formal definition:

Computing sum within a rectangle

Compute sum in rectangle D.

A,B,C,D are the values of the integral image at the corners a,b,c,d of D.

The sum of original image values within D can be computed:

sum(D) = ii(d) – ii(b) – ii(c) + ii(a) ii(a) = A

ii(b) = A+B

ii(c) = A+C

ii(d) = A+B+C+D

D = ii(d)+ii(a)-ii(b)-ii(c)

Computing sum within a rectangle

Compute sum in rectangle D.

A,B,C,D are the values of the integral image at the corners a,b,c,d of D.

The sum of original image values within D can be computed:

sum(D) = ii(d) – ii(b) – ii(c) + ii(a)ii(d) + ii(a)– ii(c)– ii(b)

Only 3 additions are required for any size of rectangle!

ii(a) = A

ii(b) = A+B

ii(c) = A+C

ii(d) = A+B+C+D

D = ii(d)+ii(a)-ii(b)-ii(c)

Original image:

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

Contructing integral image:

0 0,0 = 0 −1,04

+ � 0,0�45

�� 0,0 = �� 0, −14

+ 0 0,0�45

Integral image - example

Slides from David Cohen

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

0 1,0 = 0 0,0�45

+ � 1,0�44

�� 1,0 = �� 1, −14

+ 0 1,0&45

205 305

Original image:

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

0 2,0 = 0 1,0&45

+ � 2,074

�� 2,0 = �� 2, −14

+ 0 2,0&75

205 305 395

Original image:

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

0 3,0 = 0 2,0&75

+ � 3,04

�� 3,0 = �� 3, −14

+ 0 3,0&75

205 305 395 395

Original image:

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

0 0,1 = 0 −1, 14

+ � 0,1�44

�� 0,1 = �� 0,0�45

+ 0 0,1�44

205 305 395 395

Original image:

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

0 1,1 = 0 0,1�44

+ � 1,1&4

�� 1,1 = �� 1,0&45

+ 0 1,1�&4

205 305 395 395

405 535

Original image:

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

0 2,1 = 0 1,1�&4

+ � 2,1�45

�� 2,1 = �� 2,0&75

+ 0 2,1&&5

205 305 395 395

405 535 730

Original image:

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

0 3,1 = 0 2,1&&5

+ � 3,194

�� 3,1 = �� 3,0&75

+ 0 3,1:�5

205 305 395 395

405 535 730 810

Original image:

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

Integral Image:

205 305 395 395

405 535 730 810

610 840 1125 1205

810 1070 1460 1620

Original image:

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

205 305 395 395

405 535 730 810

610 840 1125 1205

810 1070 1460 1620

ii(x,y) in the integral image is the sum of all the pixels above and to the left in the original image

Original image: Integral Image:

Assume we want to calculate the pixels in the red area.

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

205 305 395 395

405 535 730 810

610 840 1125 1205

810 1070 1460 1620

Integral Image:Original image:

Original image: Integral Image:

205 100 90 0

200 30 105 80

205 100 90 0

200 30 105 80

205 305 395 395

405 535 730 810

610 840 1125 1205

810 1070 1460 1620

Assume we want to calculate the pixels in the red area.

=1620-810-1070+535=275-B-C+AD

Feature Evaluation Using Integral Image

Black square = D – B – C + AWhite square = F – D – E + C

White - Black = -A+B+2C-2D-E+F

∑ (pixels in white area) -

∑ (pixels in black area)

Result: Rapid feature evaluation!

Two-, three- and four-rectangular features can be computed with 6, 8 and 9 array accesses respectively.

Challenges

2) Feature Selection – too many features, need

to select the most informative ones

3) Real-timeliness – focus mainly on potentially

positive image areas (potentially faces)

Feature Selection

• The problem: too many features

– In a 24x24 sub-window there are ~160,000 possible features

– Impractical to evaluate all of the features in every candidate sub-window

• The solution: select the most informative features

AdaBoost Algorithm

• Introduced by Yoav Freund & Robert E. Schapire in 1995

• It is a machine-learning algorithm

• Stands for Adaptive Boost

• AdaBoost is an algorithm for constructing a “strong” classifier as linear combination of “simple” “weak” classifier

• Week classifier – performs slightly better than random guess.

Strong classifier

Weak classifier

Weight

C(x) = αt ht(x) +αt ht(x) + αt ht(x) + …

=otherwise 0

)( if 1)(

pxfpxh

window

value of rectangle feature

parity threshold

The weak classifiers

A weak classifier hj(x) consists of a feature

fj, a threshold θj, and a parity pj indicating

the direction of inequality sign:

x is a 24-by-24 sub-window of an image

Ensemble classification function = linear combination of week classifiers:

= ∑∑==

otherwise 02

1)( if 1

tt xhxC

αα learned weights

The Strong classifier

Adaboost procedure

• Given training set !�, (� , … , !<, (<• (� ∈ −1,+1 correct/incorrect label of each !� ∈ >• All examples initialized to have the same weight �� = �

<• For t = 1,…,T

• Construct all weak classifiers: ℎ@: > → −1,+1• Choose weak classifier with minimum error C@ :

C@ =�� ℎ@ !� − (�<

��

• Update weights – increase weight of points that were classified incorrectly by current weak classifier.

Adaboost procedure

• Compute final classifier as linear combination of all weak

learners (weight of each learner is directly proportional to

its accuracy).

= ∑∑==

otherwise 02

1)( if 1

tt xhxC

αi are a function of the point weights wi

αi are proportional to how “reliable” a week classifier is.

Boosting Example

First classifier

First 2 classifiers

First 3 classifiers

Final Classifier learned by Boosting

Boosting for face detection

First two features selected by boosting:

This feature combination can yield 100% detection rate and 50% false positive rate

Boosting vs. SVM

• Advantages of boosting• Integrates classifier training with feature selection

• Complexity of training is linear instead of quadratic in the number of training examples

• Flexibility in the choice of weak learners, boosting scheme

• Testing is fast

• Easy to implement

• Disadvantages• Needs many training examples

• Training is slow

• Often doesn’t work as well as SVM (especially for many-class problems)

• A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084

Not good enough!

Receiver operating characteristic (ROC) curve

Boosting for face detection

Challenges

2) Feature Selection – too many features, need to select the most informative ones

3) Real-timeliness – focus mainly on potentially positive image areas (potentially faces)

Real-timeliness

• On average only 0.01% of all sub-windows in a image are positives (faces)

• We spend time equally on negative & positive windows

Attentional cascade

• Start with simple classifiers which reject many of the negative sub-windows but detect (almost) all positive sub-windows

• Positive response from the first classifier triggers a second (more complex) classifier, and so on

• A negative outcome at any point leads to the immediate rejection of the sub-window

FACEIMAGE

SUB-WINDOWClassifier 1

Classifier 3

NON-FACE

Classifier 2

NON-FACE

Cascade classifiers with gradually increased complexity

FACEIMAGE

Classifier 3

NON-FACE

Classifier 2

NON-FACE

• Chain classifiers that are progressively more complex and have lower false positive rates:

• Each layer will be a “strong” classifier obtained using AdaBoost

vs false neg determined by

% False Pos

Receiver operating characteristic

Attentional cascade

• The detection rate and the false positive rate of the cascade are found by multiplying the respective rates of the individual stages

• A detection rate of 0.9 and a false positive rate on the order of 10-6 can be achieved by a 10-stage cascade if each stage has a detection rate of 0.99 (0.9910 ≈ 0.9) and a false positive rate of about 0.30 (0.310 ≈ 6×10-6)

Attentional cascade

FACEIMAGE

Classifier 3

NON-FACE

Classifier 2

NON-FACE

Cascade - Comparison

Training

Set(sub-

windows)

Integral

Representation

Feature

computation

AdaBoost

Feature Selection

Cascade trainer

Training phase

Strong Classifier 1

(cascade stage 1)

Strong Classifier N

(cascade stage N)

Classifier cascade

framework

Strong Classifier 2

(cascade stage 2)

FACE IDENTIFIED

Viola & Jones Algorithm - Visualization

Testing phase

Viola & Jones Algorithm - Visualization

Strong Classifier 1

(cascade stage 1)

Strong Classifier N

(cascade stage N)

Classifier cascade

framework

Strong Classifier 2

(cascade stage 2)

Testing phase

NOT A FACE !!!

• Finding the optimum is not practical.

• Viola & Jones goal: 95% TP rate, 10-6 FP rate

They suggested an algorithm that:

• does not guarantee optimality, but

• able to generate a cascade that meets their goal

Training the cascade

– How many layers? (strong classifiers)

– How many features in each layer?

– Threshold of each strong classifier?

• Set target detection and false positive rates for each stage

• Keep adding features to the current stage until its target rates have been met

• Need to lower AdaBoost threshold to maximize detection (as opposed to minimizing total classification error)

• Test on a validation set

• If the overall false positive rate is not low enough, then add another stage

• Use false positives from current stage as the negative training examples for the next stage

Training the cascade

Viola & Jones System

• Tested on the MIT+MCU test set

• Training time: “weeks” on 466 MHz Sun workstation

• 38 layers, total of 6061 features

• 1st classifier- layer, 2-features 50% FP rate, 99.9% TP rate

• 2nd classifier- layer, 10-features 20% FP rate, 99.9% TP rate

• next 2 layers 25-features each, next 3 layers 50-features each

• Average of 10 features evaluated per window on test set

• A 384x288 image on an PC (dated 2001) took about 0.067 seconds

• 15 times faster than previous detector (Rowley et al., 1998)

Output of Face Detector on Test Images

Profile Detection

Profile Features

Summary: Viola/Jones detector

• Rectangle features

• Integral images for fast computation

• Boosting for feature selection

• Attentional cascade for fast rejection of negative windows

face detection - university of...

Documents

image pyramids - university of...

color management - university of...

introduction to wavelets - part 2 by barak hurwitz wavelets...

mining the biomedical literature - seoul national...

the cost of privatization hagit attiya eshcar hillel...

when parallel met distributed hagit attiya cs, technion

wavelets seminar with dr ’ hagit hal-or. outline :...

edge detection edge detection - university of...

face detection lecturers: mor yakobovits roni karlikar...

the human visual system - university of...

lecture 1 the human visual system - university of...

inherent limitations facilitate design and verification of...

hagit: the future is in your hands designed and presented...

edge detection - university of...

hagit borer usc - fosssil

mining the biomedical literature pengyu hong state of the...

binary images - university of...

hagit shoffel-havakuk, md cv updated july 2018 · hagit...

information-flow models for shared memory allon adir hagit...

presented by: hagit cohen april 2006