synergistic face detection and pose...

Synergistic Face Detection Synergistic Face Detection

and Pose Estimationand Pose Estimation

M. Osadchy M.Miller Y. LeCunM. Osadchy M.Miller Y. LeCun

Technion NEC Labs NYUTechnion NEC Labs NYU

Our SystemOur System

Detects faces independently of their poses. Detects faces independently of their poses.

Estimates head poses.Estimates head poses.

No

tracking!

Our SystemOur System

Robust to: yaw (from left to right profile), Robust to: yaw (from left to right profile), roll (-45, 45), and pitch (-60, 60).roll (-45, 45), and pitch (-60, 60).

Single Detector is applied to all poses.Single Detector is applied to all poses.

Pose estimation: Within 15° error about 90% of Pose estimation: Within 15° error about 90% of poses are estimated correctly.poses are estimated correctly.

Near real-time: 5 frames per second on standard Near real-time: 5 frames per second on standard hardware.hardware.

SynergySynergy

Multi-View

Face

Detection

Pose

estimation

closely related Inner class variation (skin Inner class variation (skin

color, hair style, etc.)color, hair style, etc.)

Lighting VariationsLighting Variations

Scale VariationsScale Variations

Facial ExpressionsFacial Expressions

……

Common Problems

Train together Better generalization

Integrating Face Detection and Pose Integrating Face Detection and Pose

Estimation: Previous MethodsEstimation: Previous Methods

image

Rough pose

estimation

Pose specific face

detector

Unmanageable in real problems

Integrating Face Detection and Pose Integrating Face Detection and Pose

Estimation: Our ApproachEstimation: Our ApproachLow dimensional space

Mapping: G

)(ZF

Face Manifold

parameterized by

pose

Apply

Image X

)(XG)()( ZFXG −

Parameterization of the Face Parameterization of the Face

Manifold – Single ParameterManifold – Single Parameter

]2 ,2[ pipiZ −==θYaw:

03pi− 3pi pose2pi− 2pi2pi−

( ) )cos( iiF αθθ −= ]3 ,0 ,3[ pipi−=α3,2,1=i

Face manifold

1F

2F3F

∑

∑

=

==3

1

3

1

sin

cosarctan

iii

iii

G

G

α

αθθ

Parameterization of the Face Parameterization of the Face

Manifold – Two ParametersManifold – Two Parameters

]2 ,2[ pipi−=θ

);cos()cos(),( jiijF βϕαθϕθ −−= ]3 ,0 ,3[, pipi−=βα3,2,1, =ji

a portion of the surface of a sphere

Yaw and roll :

] 4 ,4[ pipi−=ϕ ≅

( ) )cos()cos( jiij

ij XGcc βα∑=

( ) )cos()cos( jiij

ij XGsc βα∑=

( ) )sin()cos( jiij

ij XGcs βα∑=

( ) )sin()sin( jiij

ij XGss βα∑=

( ) ( )sscccsscssccsccs +−+−+= ,atan2,atan2(5.0θ

( ) ( )sscccsscssccsccs +−−−+= ,atan2,atan2(5.0ϕ

where

),( ϕθ=Z

Minimum Energy Machine Minimum Energy Machine

( )XZ,Y,EWEnergy function:

]45,45[]90,90[}{ −×−=Z

imagepose label

parameters

=0

1Y face

non face

measures compatibility between X,Z,Y. ( )XZ,Y,EW

If X is a face with pose Z then we want: ( ) ( ) ;'Z ,X,Z'0,EXZ,1,E WW ∀< <

( ) ( ) ZZ ≠∀< <' ,X,Z'1,EXZ,1,E WW

Operating the MachineOperating the Machine

Clamp X to the observed value (the image)

Find Z and Y such that:

Complete energy:

( ){ }

( )XZ,Y,EminargZ,Y WZZ{Y},Y ∈∈

=

( ) ( ) TY1)()(YXZ,Y,EW ⋅−+−⋅= ZFXGW

Y=1X is a face

Operating the MachineOperating the Machine

Clamp X to the observed value (the image)

Find Z and Y such that:

Complete energy:

( ){ }

( )XZ,Y,EminargZ,Y WZZ{Y},Y ∈∈

=

( ) ( ) TY1)()(YXZ,Y,EW ⋅−+−⋅= ZFXGW

Y=0X is not a face

ArchitectureArchitecture

X (image)

analytical

mapping onto

face manifold

convolutional

network

Z (pose)

)()(GW ZFX −

switch

( energy)

Y (label)

W

(param)

)(GW X )(ZF

T

),,(EW XZY

Operating the machine:

F(Z)(X)GZ W{Z}Z

−=∈minarg

=0

1Y

TF(Z)(X)GW <−otherwise

Convolutional NetworkConvolutional Network

““end-to-end” trainable systems from low-level features to end-to-end” trainable systems from low-level features to

high-level representations.high-level representations.

Easily learn the type of shift-invariant features, relevant Easily learn the type of shift-invariant features, relevant

to object recognition.to object recognition.

Can be replicated over large images much more Can be replicated over large images much more

efficiently than traditional classifiers. efficiently than traditional classifiers.

Considerable advantage for

real-time systems!

Convolutions Subsampling

ConvolutionsSubsampling

Convolutions

Full

connection

C1: feature maps

8@28x28

S1: f. maps

8@14x14

C3: f. maps

20@10x10S4: f. maps

20@5x5 C5: 120Output: 9

Input

32x32

Similar to LeNet5, with more maps:

Training with Discriminative Loss Training with Discriminative Loss

FunctionFunction

( ) ( )∑∑∈∈

+=01

,1

,,1

)( 0

0

1

1 Si

i

Si

ii XWLS

XZWLS

WL

training faces training non-faces

loss for face sample

with known pose loss for non-face

sample

We showed that this loss function causes the machine to exhibit proper behavior:

( ) ( ) margin,...YE,...YE undesireddesired +<

Minimize:

21 ),,1(),,1,( XZEXZWL W= [ ]),,1(exp),0,(0 XZEKXWL −=

(X)GWF(Z) (X)GW

)ZF(

Running the MachineRunning the Machine

Works on grey-level images.Works on grey-level images.

Applied at range of scales stepping by a factor Applied at range of scales stepping by a factor

of .of .

The network is replicated over the image at each The network is replicated over the image at each

scale, stepping by 4 pixels in x and y.scale, stepping by 4 pixels in x and y.

Overlapping detections are replaced by the Overlapping detections are replaced by the

strongest.strongest.

2

ResultsResults

Our system is robust to yaw , in-plane rotation , Our system is robust to yaw , in-plane rotation ,

and pitchand pitch

90±54± 60±

TrainingTraining

52,850, 32x32 grey-level images of faces (NEC Labs hand 52,850, 32x32 grey-level images of faces (NEC Labs hand annotated set) with uniform distribution of poses.annotated set) with uniform distribution of poses.

Initial negative set: 52,850 random non-face natural images.Initial negative set: 52,850 random non-face natural images. Second phase: half of the initial negative set was replaced by Second phase: half of the initial negative set was replaced by

false positives of the initial version of the detector.false positives of the initial version of the detector. Each training image was used 5 times with random variation Each training image was used 5 times with random variation

in scale, in-plane rotation, brightness and contrast.in scale, in-plane rotation, brightness and contrast. 9 passes on the data: 26 hours on 2Ghz Pentium 4.9 passes on the data: 26 hours on 2Ghz Pentium 4. The system converged to an EER of 5% on training set and The system converged to an EER of 5% on training set and

6% on test set of 90,000 images.6% on test set of 90,000 images.

Test on Standard Data SetsTest on Standard Data Sets No standard set tests all poses, that our system is No standard set tests all poses, that our system is

designed to detect.designed to detect. 3 standard sets focusing on particular pose variation: 3 standard sets focusing on particular pose variation:

tilted, profile, and frontal.tilted, profile, and frontal.

x93%86%Schneiderman & Kanade

x96%89%Rowley et al

x83%70%xJones & Viola (profile)

xx95%90%Jones & Viola (tilted)

88%83%83%67%97%90%Our Detector

1.280.53.360.4726.94.42

MIT+CMUPROFILETILTEDData Set->False positives per image->

Rea

l tim

e

DetectionPose Estimation of

the detected faces

Note: typical pose estimation systems input centered faces; when we hand

localize this faces we get: 89% of yaw and 100% of in-plane rotations

within 15 degrees.

Standard Sets

Synergy TestSynergy Test

Detection Pose Estimation

synergistic face detection and pose...

Documents