Transcript
Page 1: Deep Face Recognition

Deep  Face  Recogni-on  Omkar  M.  Parkhi          Andrea  Vedaldi      Andrew  Zisserman  

Visual  Geometry  Group,  Department  of  Engineering  Science,  University  of  Oxford  

0.  

Objec)ve  

Results  

(ii)  Correspondences  

•  Build  a  large  scale  face  dataset  with  minimal  human  interven-on  

 •  Train  a  convolu-onal  neural  network  for  face  recogni-on  which  can  compete  with  that  of  the  Internet  giants  like  Google  and  Facebook  

 Results  

Incorpora)ng  A=ributes  

Contribu)ons  

Representa)on  Learning   Comparison  with  the  state  of  the  art  

•  New  face  dataset  of  2.6  Million  Faces  with  minimal        manual  labeling    •  State-­‐of-­‐the-­‐art  results  on  YouTube  faces  in  the  wild  dataset  (EER  3%)    [BeVer  than  Google’s  FaceNet  and  Facebook’s  DeepFace]  

•  Comparable  to  the  state-­‐of-­‐the-­‐art  results  on  the  Labeled  faces  in  the  wild  dataset  (EER  0.87%)  

   [Comparable  to  Google’s  FaceNet  and  beVer  than        Facebook’s  DeepFace]  

Dataset  construc)on    1.   Candidate  list  genera)on:    5000  iden--es    •  Tap  the  knowledge  on  the  web  

•  Collect  representa-ve  images  for  each  celebrity                -­‐  200  images  per  iden-ty    

2.   Candidate  list  filtering  •  Remove  people  with  low  representa-on  on  the  Google  image  search.  

•  Remove  overlap  with  public  benchmarks  

•  2622  celebri-es  in  the  final  dataset      

3.  Rank  image  sets  •  Download  2000  images  per  iden-ty  

•  Learn  classifier  using  data  obtained  in  step  1  

•  Rank  2000  images  and  select  the  top  1000  

4.   Near  duplicate  removal  •  VLAD  descriptor  based  near  duplicate  removal  

5.  Manual  filtering  •  Cura-ng  the  dataset  further  using  manual  checks  

       

   

       

No.   Aim   Mode   #  Persons   #  images  /person  

Total  #  images  

Anno.  effort  

1   Candidate  list  genera-on   Auto   5000   200   1,000,000   -­‐  

2   Candidate  list  filtering   Manual   2622   -­‐   -­‐   4  days  

3   Rank  image  sets   Auto   2622   1000   2,622,000   -­‐  

4   Near  duplicate  removal   Auto   2622   623   1,635,159   -­‐  

5   Manual  filtering   Manual   2622   375   982,803   10  days  

No.   Aim   #  Persons   Total  #  images  

1   Labeled  Faces  In  the  Wild   5,749   13,233  

2   WDRef   2995   99,773  

3   Celeb  Faces   10177   202,599  

4   Ours   2622   2.6M  

5   Facebook   4030   4.4M  

6   Google   8M   200M  

Dataset  Comparison  

Example  images  from  our  dataset  

Dataset  Sta)s)cs  

image  

Conv-­‐64  

maxpool  

fc-­‐4096  

fc-­‐4096  

SoWmax  

Conv-­‐64  

Conv-­‐128  

maxpool  

Conv-­‐128  

Conv-­‐256  

maxpool  

Conv-­‐256  

Conv-­‐512  

maxpool  

Conv-­‐512  

Conv-­‐512  

Conv-­‐512  

maxpool  

Conv-­‐512  

Conv-­‐512  

fc-­‐2622  

Objec)ve   100%-­‐EER  

Soemax  classifica-on   97.30  

Embedding  Learning   99.10  

No.   Method   #  Training  Images   #  Networks   Accuracy  

1   Fisher  Vector  Faces   -­‐   -­‐   93.10  

2   DeepFace  (Facebook)   4  M   3   97.35  

3   DeepFace  Fusion  (Facebook)   500  M   5   98.37  

4   DeepID-­‐2,3   Full   200   99.47  

5   FaceNet  (Google)   200  M   1   98.87  

6   FaceNet+  Alignment  (Google)   200  M   1   99.63  

7   Ours  (VGG  Face)   2.6  M   1   98.78  

No.   Method   #  Training  Images   #  Networks   100%-­‐

EER   Accuracy  

1   Video  Fisher  Vector  Faces   -­‐   -­‐   87.7   83.8  

2   DeepFace  (Facebook)   4  M   1   91.4   91.4  

4   DeepID-­‐2,2+,3   200   -­‐   93.2  

5   FaceNet    (Google)   200  M   1   -­‐   95.1  

7   Ours  (VGG  Face)   2.6  M   1   97.60   97.4  

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

True

Pos

itive

Rat

e

ROC Curve LFW Dataset

Our MethodDeepID3DeepFaceFisher Vector Faces

Dataset  

YouTube  Faces  In  the  Wild:    Unrestricted  Protocol  

Labeled  Faces  In  the  Wild:  Unrestricted  Protocol  

•  “Very  Deep”  Architecture  •  Different  from  previous  architectures              proposed  for  face  recogni-on:  •  locally  connected  layers  (Facebook)  •  incep-on  (Google)  

•  Learning  a  mul)-­‐way  classifier  •  Soe  Max  Objec-ve  •  2622  way  classifica-on  •  4096d  descriptor    

 •  Learning  Task  Specific  Embedding  •  The  embedding  is  learnt  by  minimizing            the  triplet  loss  

•  Learning  a  projec-on  layer  from            4096  to  1024  dimensions  •  Online  triplet  forma-on  at  the  beginning            of  each  itera-on  •  Fine  tuned  on  target  datasets  

   •  Experiments                        Labeled  Faces  in  the  Wild  –  Unrestricted  Protocol  

 

6 PARKHI et al.: DEEP FACE RECOGNITION

layer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18type input conv relu conv relu mpool conv relu conv relu mpool conv relu conv relu conv relu mpool convname – conv1_1 relu1_1 conv1_2 relu1_2 pool1 conv2_1 relu2_1 conv2_2 relu2_2 pool2 conv3_1 relu3_1 conv3_2 relu3_2 conv3_3 relu3_3 pool3 conv4_1

support – 3 1 3 1 2 3 1 3 1 2 3 1 3 1 3 1 2 3filt dim – 3 – 64 – – 64 – 128 – – 128 – 256 – 256 – – 256

num filts – 64 – 64 – – 128 – 128 – – 256 – 256 – 256 – – 512stride – 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1pad – 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0 1

layer 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37type relu conv relu conv relu mpool conv relu conv relu conv relu mpool conv relu conv relu conv softmxname relu4_1 conv4_2 relu4_2 conv4_3 relu4_3 pool4 conv5_1 relu5_1 conv5_2 relu5_2 conv5_3 relu5_3 pool5 fc6 relu6 fc7 relu7 fc8 prob

support 1 3 1 3 1 2 3 1 3 1 3 1 2 7 1 1 1 1 1filt dim – 512 – 512 – – 512 – 512 – 512 – – 512 – 4096 – 4096 –

num filts – 512 – 512 – – 512 – 512 – 512 – – 4096 – 4096 – 2622 –stride 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1pad 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0

Table 3: Network configuration. Details of the face CNN configuration A. The FC layersare listed as “convolution” as they are a special case of convolution (see Section 4.3). Foreach convolution layer, the filter size, number of filters, stride and padding are indicated.

using a “triplet loss” training scheme, illustrated in the next section. While the latter isessential to obtain a good overall performance, bootstrapping the network as a classifier, asexplained in this section, was found to make training significantly easier and faster.

4.2 Learning a face embedding using a triplet loss

Triplet-loss training aims at learning score vectors that perform well in the final application,i.e. identity verification by comparing face descriptors in Euclidean space. This is similar inspirit to “metric learning”, and, like many metric learning approaches, is used to learn a pro-jection that is at the same time distinctive and compact, achieving dimensionality reductionat the same time.

Our triplet-loss training scheme is similar in spirit to that of [17]. The output f(`t

) 2RD

of the CNN, pre-trained as explained in Section 4.1, is l

2-normalised and projected to a L ⌧D dimensional space using an affine projection x

t

=W

0f(`t

)/kf(`t

)k2, W

0 2 RL⇥D. Whilethis formula is similar to the linear predictor learned above, there are two key differences.The first one is that L 6=D is not equal to the number of class identities, but it is the (arbitrary)size of the descriptor embedding (we set L = 1,024). The second one is that the projectionW

0 is trained to minimise the empirical triplet loss

E(W 0) = Â(a,p,n)2T

max{0,a �kx

a

�x

n

k22 +kx

a

�x

p

k22}, x

i

=W

0 f(`i

)

kf(`i

)k2. (1)

Note that, differently from the previous section, there is no bias being learned here as thedifferences in (1) would cancel it. Here a � 0 is a fixed scalar representing a learning

margin and T is a collection of training triplets. A triplet (a, p,n) contains an anchor faceimage a as well as a positive p 6= a and negative n examples of the anchor’s identity. Theprojection W

0 is learned on target datasets such as LFW and YTF honouring their guidelines.The construction of the triplet training T set is discussed in Section 4.4.

4.3 Architecture

We consider three architectures based on the A, B, and D architectures of [19]. The CNNarchitecture A is given in full detail in Table 3. It comprises 11 blocks, each containing alinear operator followed by one or more non-linearities such as ReLU and max pooling. Thefirst eight such blocks are said to be convolutional as the linear operator is a bank of linearfilters (linear convolution). The last three blocks are instead called Fully Connected (FC);

Download  

Effect  of  dataset  cura)on  

Dataset   #images   100%-­‐EER  

Curated   982,803   92.83  

Full   2,622,000   95.80  

Effect  of  embedding  learning  

More  experiments  in  the  paper  Example  correct  classifica)on:  posi)ve  pairs  

Example  correct  classifica)on:  nega)ve  pairs  

Example  incorrect  classifica)on  

Acknowledgements  

False  posi)ve  

CNN  model  and  face  detector  available  online!!      Visit:  h=p://www.robots.ox.ac.uk/~vgg/soWware/vgg_face/  

This  research  is  based  upon  work  supported  by  the  Office  of  the  Director  of  Na-onal  Intelligence  (ODNI),  Intelligence  Advanced  Research  Projects  Ac-vity  (IARPA)  

False  nega)ve  

Top Related