a robust real time face detection. outline adaboost – learning algorithm face detection in real...

A Robust Real Time Face A Robust Real Time Face DetectionDetection

OutlineOutline

AdaBoost – Learning AlgorithmAdaBoost – Learning Algorithm Face Detection in real lifeFace Detection in real life Using AdaBoost for Face DetectionUsing AdaBoost for Face Detection ImprovementsImprovements DemonstrationDemonstration

AdaBoostAdaBoost

A short Introduction to Boosting (Freund & Schapire, 1999)A short Introduction to Boosting (Freund & Schapire, 1999)Logistic Regression, AdaBoost and Bregman Distances Logistic Regression, AdaBoost and Bregman Distances

(Collins, Schapire, Singer, 2002)(Collins, Schapire, Singer, 2002)

BoostingBoosting

The Horse-Racing The Horse-Racing Gambler ProblemGambler Problem– Rules of thumbRules of thumb for a set of races for a set of races– How should we choose the set of races in order How should we choose the set of races in order

to get to get the best rules of thumbthe best rules of thumb??– How should the rules be combinedHow should the rules be combined into a single into a single

highly accurate prediction rule?highly accurate prediction rule?

Boosting !Boosting !

BoostingBoosting

AdaBoost - the ideaAdaBoost - the idea

Initialize Initialize sample weightssample weights For each cycle:For each cycle:

– Find a Find a classifier that performs classifier that performs wellwell on the weighted sample on the weighted sample

– Increase weightsIncrease weights of misclassified of misclassified examplesexamples

Return a Return a weighted list of weighted list of classifiersclassifiers

AdaBoost agglomerates AdaBoost agglomerates many weak classifiersmany weak classifiers into one strong classifier.into one strong classifier.

Shoe sizeShoe size

IQ

AdaBoost - algorithmAdaBoost - algorithm

T

ttt

t

yxhe

yxhet

tt

t

tt

iitDitt

t

iimm

xhsignxH

ZZ

iDiD

yxhXh

D

Ty

miD

YyXxyxyx

iiit

iit

t

1

)( if

)( if

1

~

1

11

))(()(

:hypothesis final Output the

factorion normalizat a is where)(

)( e Updat

)1

ln(2

1 Choose

])([Prerror with }1,1{: hypothesisGet weak

on distributi using classifierbest weak Select the

..1For

/1)( Initialize

}1,1{, where),(),..,,(Given

example decision

distribution

step

AdaBoost – AdaBoost – training errortraining error

Freund and Schapire (1997) proved that:Freund and Schapire (1997) proved that:

AdaBoost AdaBoost adapts adapts to the to the error rateserror rates of the of the individual weak hypotheses.individual weak hypotheses.– Therefore it is called Therefore it is called ADAADABoost.Boost.

ttγeHerr

T

tt

2

1 where,)(' 1

2

AdaBoost – AdaBoost – generalization errorgeneralization error Freund and Schapire (1997) showed that:Freund and Schapire (1997) showed that:

sizeset training-

rounds ofnumber -

dimension VC -

sample trainingon they probabilit empirical the-Pr

:where

,)(])([Pr')(

m

T

d

y] '[H(x)

m

TdOyxHHerr

AdaBoost – generalization errorAdaBoost – generalization error

The analysis implies that boosting will The analysis implies that boosting will overfitoverfit if the algorithm is run if the algorithm is run for too many roundsfor too many rounds

However, it was observed However, it was observed empiricallyempirically that that AdaBoost does not overfitAdaBoost does not overfit, , – even when run even when run thousands of roundsthousands of rounds..

Moreover, it was observed that the Moreover, it was observed that the generalization errorgeneralization error continues to drive downcontinues to drive down long after long after training errortraining error reached zero reached zero

AdaBoost – generalization errorAdaBoost – generalization error An alternative analysis was presented by An alternative analysis was presented by

Schapire et al. (1998), that Schapire et al. (1998), that suits the empirical suits the empirical findingsfindings

xhy

yxt

tt

m

dOyxHerr

)(),(margin

:where

2 -1] )(]),(margin[Pr')(Pr[

AdaBoost – different point of viewAdaBoost – different point of view

We try to solve the problem of We try to solve the problem of approximating the approximating the yy’s’s using a using a linear combinationlinear combination of weak hypotheses of weak hypotheses

In other words, we are interested in the problem of In other words, we are interested in the problem of finding a finding a vector of parameters vector of parameters α α such thatsuch that

is a ‘is a ‘good approximationgood approximation’ of ’ of yyii

For classification problems we try to For classification problems we try to match the match the signsign of of f(xf(xii) to y) to yii

n

jijji xhxf

1

)()(

AdaBoost – different point of viewAdaBoost – different point of view

Sometimes it is advantageous to Sometimes it is advantageous to minimize some minimize some other (non-negative) loss functionother (non-negative) loss function instead of the instead of the number of classification errorsnumber of classification errors

For AdaBoost the loss function isFor AdaBoost the loss function is

This point of view was used by This point of view was used by Collins, Schapire Collins, Schapire and Singer (2002) to and Singer (2002) to demonstrate that AdaBoost demonstrate that AdaBoost converges to optimalityconverges to optimality

n

iii xfy

1

))(exp(

Face Detection Face Detection (not face recognition)(not face recognition)

Face Detection in Face Detection in MonkeysMonkeys

There are cells that There are cells that ‘detect faces’‘detect faces’

Face Detection in HumanFace Detection in HumanThere are ‘There are ‘processes of face detectionprocesses of face detection’’

Faces Are SpecialFaces Are SpecialWe humans analyze faces in a ‘different way’We humans analyze faces in a ‘different way’

Faces Are SpecialFaces Are Special

We analyze faces in a We analyze faces in a ‘different way’‘different way’

Face Recognition in HumanFace Recognition in Human

We analyze faces We analyze faces ‘in a specific location’‘in a specific location’

RobustRobust Real-TimeReal-Time Face Face DetectionDetection

Viola and Jones, 2003Viola and Jones, 2003

FeaturesFeatures

Picture analysis, Integral ImagePicture analysis, Integral Image

FeaturesFeatures The system classifies images based on the The system classifies images based on the value value

of simple featuresof simple features

Two-rectangle

Three-rectangle

Four-rectangle

Value =

∑ (pixels in white area) - ∑ (pixels in black area)

Contrast FeaturesContrast Features

Source

Result

Notice that each feature is related to a Notice that each feature is related to a special locationspecial location in the sub-window in the sub-window

FeaturesFeatures

Notice that each feature is related to a Notice that each feature is related to a special locationspecial location in the sub-window in the sub-window

Why features and not pixels?Why features and not pixels?– EncodeEncode domaindomain knowledge knowledge– Feature based system operates Feature based system operates fasterfaster– Inspiration Inspiration from humanfrom human vision vision

FeaturesFeatures

Later we will see that there are Later we will see that there are other other featuresfeatures that can be used to implement an that can be used to implement an efficient face detectorefficient face detector

The original system of The original system of Viola and JonesViola and Jones used used only rectangle featuresonly rectangle features

ComputingComputing Features Features

Given a Given a detection resolutiondetection resolution of of 24x2424x24, and , and size of ~200x200, the set of rectangle size of ~200x200, the set of rectangle features is features is ~160,000~160,000 ! !

We need to find a way to We need to find a way to rapidly compute rapidly compute the featuresthe features

Integral ImageIntegral Image Intermediate Intermediate

representationrepresentation of the of the imageimage

Computed in one pass Computed in one pass over the original imageover the original image

yyxx

yxiyxii','

)','(),(

0),1(

0)1,(

),(),1(),(

),()1,(),(

yii

xs

yxsyxiiyxii

yxiyxsyxs

Integral ImageIntegral Image

Using the integral image representation one can compute the value of any rectangular sum in constant time.

For example the integral sum inside rectangle D we can compute as:

ii(4) + ii(1) – ii(2) – ii(3)

(x,y)

s(x,y) = s(x,y-1) + i(x,y)

ii(x,y) = ii(x-1,y) + s(x,y)

(0,0)

x

y

Integral ImageIntegral Image

-1 +1+2-1

-2+1

Integral Image

(x,y)

(x,y)

Building a DetectorBuilding a Detector

Cascading, training a cascadeCascading, training a cascade

Main IdeasMain Ideas

The The FeaturesFeatures will be will be used as weak used as weak classifiersclassifiers

We will We will concatenate several detectorsconcatenate several detectors seriallyserially into a cascade into a cascade

We will We will boostboost (using a version of AdaBoost) (using a version of AdaBoost) a number of features a number of features to get ‘good enough’ to get ‘good enough’ detectorsdetectors

Weak ClassifiersWeak Classifiers

Weak ClassifierWeak Classifier : A feature which : A feature which best separatesbest separates the examples the examples

Given a Given a sub-window (sub-window (xx),), a feature ( a feature (ff), ), a threshold a threshold ((ΘΘ),), and a and a polarity (polarity (pp)) indicating the direction of indicating the direction of the inequality:the inequality:

pxpfpfxh )(1),,,(

Probability for this threshold

Weak ClassifiersWeak Classifiers

A A weak classifierweak classifier is a combination of a is a combination of a feature feature and a and a thresholdthreshold

We have We have K K featuresfeatures We have We have NN thresholds where thresholds where NN is the is the

number of examplesnumber of examples Thus there are Thus there are KNKN weak classifiers weak classifiers

Weak Classifier SelectionWeak Classifier Selection

For For each featureeach feature sort the examplessort the examples based on based on feature valuefeature value

For each element For each element evaluate the total sumevaluate the total sum of of positive/negative example weights (T+/T-) and positive/negative example weights (T+/T-) and the sum of positive/negative weightsthe sum of positive/negative weights below the below the current example (S+/S-)current example (S+/S-)

The The error for a thresholderror for a threshold which splits the range which splits the range between the current and previous example in the between the current and previous example in the sorted list is :sorted list is :

))(),(min( STSSTSe

An exampleAn example

xxyyffWWT+T+T-T-S+S+S-S-AABBee

X1X1-1-1221/51/53/53/52/52/500002/52/53/53/52/52/5

X2X2-1-1331/51/53/53/52/52/5001/51/51/51/54/54/51/51/5

X3X311551/51/53/53/52/52/5002/52/5005/55/500

X4X411771/51/53/53/52/52/51/51/52/52/51/51/54/54/51/51/5

X5X511881/51/53/53/52/52/52/52/52/52/52/52/53/53/52/52/5

examples

weight

Error = min(A,B)

positive/negative example weightspositive/negative example weights

the sum of positive/negative the sum of positive/negative weightsweights below the current below the current exampleexample

decision

For e calculationFeature

value

Main Ideas: Main Ideas: CascadingCascading The Features will be used as weak classifiersThe Features will be used as weak classifiers We will concatenate several detectors serially We will concatenate several detectors serially

into a cascadeinto a cascade We will boost (using a version of AdaBoost) a We will boost (using a version of AdaBoost) a

number of features to get ‘good enough’ number of features to get ‘good enough’ detectorsdetectors

CascadingCascading We We start with simple classifiersstart with simple classifiers which which reject reject

many of the negative sub-windowsmany of the negative sub-windows while while detecting almost all positive sub-windowsdetecting almost all positive sub-windows

Positive results from the first classifier Positive results from the first classifier triggers triggers the evaluation of a secondthe evaluation of a second (more complex) (more complex) classifier, and so onclassifier, and so on

A A negative outcomenegative outcome at any point leads to the at any point leads to the immediate rejectionimmediate rejection of the sub-window of the sub-window

CascadingCascading

Main Ideas: Main Ideas: BoostingBoosting

The Features will be used as weak The Features will be used as weak classifiersclassifiers

We will concatenate several detectors We will concatenate several detectors serially into a cascadeserially into a cascade

We will boost (using a version of AdaBoost) We will boost (using a version of AdaBoost) a number of features to get ‘good enough’ a number of features to get ‘good enough’ detectorsdetectors

TrainingTraining a cascade a cascade User selects values for: User selects values for:

– Maximum acceptableMaximum acceptable false positive ratefalse positive rate per per layerlayer

– Minimum acceptableMinimum acceptable detection ratedetection rate per layer per layer– Target Target overall overall false positivefalse positive rate rate

User givesUser gives a set of positive and negative a set of positive and negative examplesexamples

Training a cascade (cont.)Training a cascade (cont.) While the While the overall false positive rateoverall false positive rate is not metis not met::

– While the While the false positive rate of current layerfalse positive rate of current layer is less is less than the maximum per layer:than the maximum per layer: Train Train a classifier with a classifier with nn features features using AdaBoostusing AdaBoost on a set on a set

of of positive and negativepositive and negative examples examples Decrease thresholdDecrease threshold when the current classifier detection when the current classifier detection

rate of the layer is more than the minimumrate of the layer is more than the minimum Evaluate Evaluate current cascade classifier current cascade classifier on validation seton validation set

– EvaluateEvaluate current cascade detector on a set of current cascade detector on a set of non non faces imagesfaces images and and put any false detectionsput any false detections into the into the negative training setnegative training set

ResultsResults

Training Data SetTraining Data Set

4916 4916 hand labeledhand labeled faces faces Aligned to base resolution Aligned to base resolution

(24x24)(24x24) Non facesNon faces for first layer for first layer

were collected from were collected from 9500 9500 non facesnon faces images images

Non faces for subsequent Non faces for subsequent layers were obtained by layers were obtained by scanning the partial scanning the partial cascade across non facescascade across non faces and and collecting false collecting false positivespositives (max 6000 for (max 6000 for each layer)each layer)

Structure of the DetectorStructure of the Detector

38 38 layer cascadelayer cascade 60606060 features features

Layer number 1 2 3 to 4 5 to 38Number of feautures 2 10 50 -Detection rate 100% 100% - -Rejection rate 50% 80% - -

Speed of final DetectorSpeed of final Detector

On a 700Mhz Pentium III processor, the On a 700Mhz Pentium III processor, the face detectorface detector can process a can process a 384 by 288 384 by 288 pixel imagepixel image in about .067 seconds in about .067 seconds

ImprovementsImprovements

Learning Object Detection from a Learning Object Detection from a Small Number of Examples: the Small Number of Examples: the Importance of Good Features Importance of Good Features

(Levy & Weiss, 2004)(Levy & Weiss, 2004)

ImprovementsImprovements

Performance depends crucially on the Performance depends crucially on the featuresfeatures that are used to represent the that are used to represent the objects (Levy & Weiss, 2004)objects (Levy & Weiss, 2004)

Good FeaturesGood Features imply: imply:– Good results from Good results from small training databasessmall training databases– BetterBetter generalizationgeneralization abilities abilities– Shorter (Shorter (fasterfaster) classifiers) classifiers

Edge Orientation HistogramEdge Orientation Histogram

InvariantInvariant to to global illuminationglobal illumination changes changes Captures Captures geometric propertiesgeometric properties of faces of faces Domain knowledgeDomain knowledge represented: represented:

– Inner part of the face includes Inner part of the face includes more horizontalmore horizontal edges then edges then verticalvertical

– The The ration between vertical and horizontal edgesration between vertical and horizontal edges is bounded is bounded– The area of theThe area of the eyes eyes includes includes mainly horizontal edgesmainly horizontal edges– The The chin chin has more or less the same number of has more or less the same number of oblique oblique

edgesedges on both sides on both sides

Edge Orientation Edge Orientation HistogramHistogram

Called EOHCalled EOH The EOH can be calculated using some kind of The EOH can be calculated using some kind of

Integral ImageIntegral Image::– We find the We find the gradientsgradients at the point (x,y) using at the point (x,y) using Sobel masksSobel masks– We calculate the We calculate the orientation orientation of the edge (x,y)of the edge (x,y)– We We divide the edgesdivide the edges into K bins into K bins– The result is The result is stored in K matricesstored in K matrices– We use the same We use the same idea of Integral Imageidea of Integral Image for the matricesfor the matrices

EOH FeaturesEOH Features The ratio between two The ratio between two

orientationsorientations The dominance of a given The dominance of a given

orientationorientation Symmetry FeaturesSymmetry Features

ResultsResults

Already with Already with only 250 positive examplesonly 250 positive examples we can we can see see above 90% detection rateabove 90% detection rate

Faster classifierFaster classifier Better performance in profile facesBetter performance in profile faces

DemoDemoImplementing Viola & Jones systemImplementing Viola & Jones system

Frank Fritze, 2004Frank Fritze, 2004

a robust real time face detection. outline adaboost – learning algorithm face detection in real...

Documents

faces slide

recognition slide

optimality slide

different way slide

black area slide

specific location slide

integral image slide

adaboost training error