multimodal facial feature extraction for automatic 3d face

17
Multimodal Facial Feature Extraction for Automatic 3D Face Recognition Xiaoguang Lu and Anil K. Jain Dept. of Computer Science & Engineering Michigan State University East Lansing, MI 48824 {Lvxiaogu, jain}@cse.msu.edu Abstract. Facial feature extraction is important in many face-related ap- plications, such as face alignment for recognition. We propose a multimodal scheme to integrate 3D (range) and 2D (intensity) information provided from a facial scan to extract the feature points. Given a face scan, the foreground is segmented from the background using the range map and the face area is detected using a real-time intensity-based algorithm. A robust nose tip lo- cator is presented. A statistical 3D feature location model is applied after aligning the model with the nose tip. The shape index response derived from the range map and the cornerness response from the intensity map are com- bined to determine the positions of the corners of the eyes and the mouth. Real-world data is subject to sensor noise, resulting in spurious feature points. We introduce a local quality metric to automatically reject the scan whose sensor noise is above a certain threshold. As a result, a fully automatic multi- modal face recognition system is developed. Both qualitative and quantitative evaluations are conducted for the proposed feature extraction algorithm on a publicly available database, containing 946 facial scans of 267 subjects. This automatic feature extraction algorithm has been integrated in an automatic face recognition system. The identification performance on a database of 198 probe scans and 200 gallery subjects is close to that with manually labeled landmarks. 1 Introduction Automatic human face recognition has received substantial attention from researchers in biometrics, pattern recognition, and computer vision communities over the past decade [1, 2]. Current 2D face recognition systems still encounter difficulties in han- dling facial variations due to head poses, lighting conditions and facial expressions [3], which introduce large amount of intra-class variations. Range images captured by a 3D sensor explicitly contain facial surface shape infor- mation, which can complement information contained in a 2D image. The 3D shape information does not change much due to pose and lighting variations, which can change the corresponding intensity image significantly. Range image based 3D face recognition has been demonstrated to be effective in enhancing the face recognition accuracy [4–6]. As the 3D imaging technology is progressing quickly [7], non-intrusive 3D data capture along with texture information will become readily available. Cur- rent 3D cameras can provide two registered modalities, range and intensity. Figure 1 gives an example of a facial scan.

Upload: phamkhuong

Post on 04-Jan-2017

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Multimodal Facial Feature Extraction for Automatic 3D Face

Multimodal Facial Feature Extraction forAutomatic 3D Face Recognition

Xiaoguang Lu and Anil K. JainDept. of Computer Science & Engineering

Michigan State UniversityEast Lansing, MI 48824

{Lvxiaogu, jain}@cse.msu.edu

Abstract. Facial feature extraction is important in many face-related ap-plications, such as face alignment for recognition. We propose a multimodalscheme to integrate 3D (range) and 2D (intensity) information provided froma facial scan to extract the feature points. Given a face scan, the foregroundis segmented from the background using the range map and the face area isdetected using a real-time intensity-based algorithm. A robust nose tip lo-cator is presented. A statistical 3D feature location model is applied afteraligning the model with the nose tip. The shape index response derived fromthe range map and the cornerness response from the intensity map are com-bined to determine the positions of the corners of the eyes and the mouth.Real-world data is subject to sensor noise, resulting in spurious feature points.We introduce a local quality metric to automatically reject the scan whosesensor noise is above a certain threshold. As a result, a fully automatic multi-modal face recognition system is developed. Both qualitative and quantitativeevaluations are conducted for the proposed feature extraction algorithm on apublicly available database, containing 946 facial scans of 267 subjects. Thisautomatic feature extraction algorithm has been integrated in an automaticface recognition system. The identification performance on a database of 198probe scans and 200 gallery subjects is close to that with manually labeledlandmarks.

1 Introduction

Automatic human face recognition has received substantial attention from researchersin biometrics, pattern recognition, and computer vision communities over the pastdecade [1, 2]. Current 2D face recognition systems still encounter difficulties in han-dling facial variations due to head poses, lighting conditions and facial expressions [3],which introduce large amount of intra-class variations.

Range images captured by a 3D sensor explicitly contain facial surface shape infor-mation, which can complement information contained in a 2D image. The 3D shapeinformation does not change much due to pose and lighting variations, which canchange the corresponding intensity image significantly. Range image based 3D facerecognition has been demonstrated to be effective in enhancing the face recognitionaccuracy [4–6]. As the 3D imaging technology is progressing quickly [7], non-intrusive3D data capture along with texture information will become readily available. Cur-rent 3D cameras can provide two registered modalities, range and intensity. Figure 1gives an example of a facial scan.

Page 2: Multimodal Facial Feature Extraction for Automatic 3D Face

In both 2D and 3D face recognition systems, alignment (registration) between thequery and the template is necessary in order to make the probe and template in thegallery comparable [8, 3]. In general, face recognition systems include face detection,alignment, and recognition. Registration based on feature point correspondence isone of the most popular methods [2] for alignment. To make the face recognitionsystem fully automatic, facial feature extraction is one of the crucial steps.

Facial features can be of different types: region [9, 10], key point (landmark) [11,12], and contour [13, 14]. Generally, key point features provide more accurate andconsistent representation for alignment purposes than region-based features, withlower complexity and computational burden than contour feature extraction. There-fore, we have focused on the key point feature extraction. We select a subset of thefacial landmarks (or the fiducial points), as defined in anthropometry [15], as ourfeature points. These feature points include nose tip, two inner eye corners, two out-side eye corners, and two mouth corners, shown in Fig. 1(d). The selected featurepoints define a basic facial configuration. In addition to face alignment, they can beused for tracking, screening (face retrieval), animation, etc. These feature points canalso be used to initialize the active appearance models [13, 14] for higher-level featureextraction, such as extracting the contours of the eyes.

Registration in 3D space achieves better alignment results to handle head posechanges than in 2D space. In 2D face recognition systems, the two eye centers arecommonly used for alignment [1]. However, the eye center regions, especially withbrown and black eyes, cannot be reliably captured by the 3D laser-based scannerdue to the low reflectivity in the dark region [16], see Fig. 1 for an example. Weextract more reliable feature points, such as eye corners to achieve the alignment inthree-dimensional space.

Fig. 1. Facial scan and feature points. (a) Intensity image. (b) Range image with the colormap indicating the corresponding depth (z value). (c) Mask image provided by the sensor,indicating valid points (white). Notice the holes in the eye centers due to dark regions. (d)Feature points.

Page 3: Multimodal Facial Feature Extraction for Automatic 3D Face

Intensity images captured by 2D cameras are closer to the input of the humanvisual system for interpreting facial images. But robust facial feature extraction fromintensity images is still a challenging problem. Properties derived from the intensity,such as edge and corner responses, are not robust to lighting and pose changes. Therange modality is relatively insensitive to lighting and pose changes, but is subjectto sensor noise. Due to the large intra-class variability, a single modality may notprovide consistent feature point across a large population. Accumulating evidencederived from different modalities has the potential to make the feature extractionsystem more robust.

While a number of approaches have been proposed for 2D intensity-based facialfeature extraction, only a few address feature extraction utilizing both 3D range and2D intensity in frontal facial scans [17, 16]. Wang et al. [17] classified feature pointsinto two types: 3D and 2D. The point signature [18] and the stacked Gabor filterresponses [11] were used as the 3D and 2D features for each point in the image,respectively. Each extracted feature point was associated with a feature vector con-taining values of the 3D and 2D features, which was used for matching. Boehnen andRuss [16] explored the 2D color information to extract skin tone region and identifythe eyes and the mouth. The 3D information contained in the range image was uti-lized to compute the distance constraint between the eyes and also locate the nosetip. The eye centers, the nose tip, and the center of the upper lip were extracted.

We have focused on automatically extracting feature points (the nose tip and thecorners of eyes and mouth) from frontal facial scans, which are used for face alignmentin three-dimensional space. A robust approach is proposed to identify the nose tipfrom the range map. We propose an integration scheme to utilize the registeredrange and intensity maps to extract fiducial points. For each candidate point, theshape index response from the range map and the cornerness response from theintensity map are calculated. The normalized responses from different modalities arecombined to generate a unified response for each candidate point. The point with thestrongest response is identified as the targeted feature point. In addition, we introducea reject option to handle the large sensor noise. From the system’s point of view,the reject option makes the system robust rather than providing unexpected results.Utilizing the automatic feature extraction module, a fully automatic multimodal facerecognition system, which integrates both range and intensity modalities, is developedand its performance is evaluated.

2 Feature Extraction

The overall feature extraction process is shown in Fig. 2.

2.1 Face Segmentation

The first step in the entire face recognition system is to detect a face and locate theface area (if any) from a given facial scan. A well-known real-time boosted cascadeclassifier 1 [19, 20] is applied on the intensity map for face detection. The intensity-based algorithm may generate false alarms in a cluttered background. We utilizethe range map to segment the foreground from the background using the depthinformation, which constrains the face detection search area, reducing the false alarm

1 The source package is downloaded from http://sourceforge.net/projects/opencvlibrary/.

Page 4: Multimodal Facial Feature Extraction for Automatic 3D Face

Fig. 2. Multimodal feature extraction diagram.

rate. The face segmentation result in our system from the facial scan in Fig. 1 isprovided in Fig. 2. The subsequent feature point extraction is conducted within thesegmented face area.

2.2 Robust Nose Tip Extraction

The nose tip is a distinctive point of the human face, especially in the range map. Itis also insensitive to the facial expression changes. The nose tip in frontal scans hasthe local maximum value in the z direction (the z direction points out of the imageplane), but it is not necessarily the global maximum due to various factors, such asbeard, hair, other objects in the field of view, sensor noise, and so on. Figure 3 givesa number of examples.

Fig. 3. Nose tip mis-location results from using the heuristic that the nose tip is the closestpoint to the sensor. The errors can result from beard (a), hair (b), other objects in the fieldof view (elbow in (c)), and sensor noise (d). (e) is the range image of (d) shown from theside viewpoint in 3D space. The spike noise appears in the eye area.

We have developed a robust nose tip extraction scheme. The nose tip is locatedthrough cross profile analysis on the range map based on the shape of the nose. Theregion around the nose tip lacks texture information (intensity characteristics). Therange image is represented as z(r, c), where z is the depth value, and r and c are the

Page 5: Multimodal Facial Feature Extraction for Automatic 3D Face

row (horizontal) and column (vertical) indices, respectively. In the following, we usethe facial scan shown in Fig. 1 to demonstrate our feature extraction scheme.

In the frontal pose, the nose tip is close to the mid-line (the cross section betweenthe facial surface and the bilateral symmetry plane). The facial surface points alongthe mid-line. Therefore, points on the nose ridge are very likely to provide extreme zvalues along the horizontal cross sections in the segmented face area. For each row,we find the position with the maximum z value, shown in Fig. 4(a). Then for eachcolumn, the number of these positions is counted, resulting in a histogram shownin Fig. 4(b). We pick the column, at which the histogram reaches the peak, as themid-line given in Fig. 4(c).

Fig. 4. Finding face mid-line. (a) The yellow marks represent the positions where the zvalue reaches the extremum along each row. (b) Total number of extreme z values (yellowpoints) in each column. (c) The mid-line (in blue) is located by choosing the column withthe maximum peak in (b).

The search region for the nose tip is reduced to the points along the mid-line.The vertical z profile analysis along the mid-line is conducted. Figure 5(a) presentsthe profile (z value) along the mid-line.

Fig. 5. The depth (z) profile along the mid-line after being processed with a mean filteralong the vertical axis. The horizontal axis represents the row index of the range image.

In the mid-line profile, the nose bridge presents a strong consecutive increase inz values. The gradients can be calculated by g(r) = z(r + 1) − z(r), where r is therow index. Then the run-length of the consecutive positive signs of the gradients iscomputed for each column as follows:

Page 6: Multimodal Facial Feature Extraction for Automatic 3D Face

RL(r) =

{RL(r − 1) + 1 if g(r) > 00 otherwise.

Figure 6 shows the RL values computed from the mid-line profile.

Fig. 6. Run-lengths along the mid-line. (a) Run-length of the consecutive positive signs ofthe depth gradients. The arrows points out the three highest peaks, with the correspondingpoints labeled in (b).

The nose tip is located at one of the peaks, where the run-length reaches a localmaximum before going to zero. To verify the peak corresponding to the nose tip, weconduct the profile analysis at each peak position along each row, i.e., the horizontalz profile analysis.

(a) (b)

Fig. 7. Row profiles for local peaks along the mid-line. (a) The horizontal depth (z) profile(along the row direction) at three peaks in Fig. 6. (b) Extracted nose tip.

In our experiments, the three highest peaks along the mid-line are selected. Thecorresponding row profiles are provided in Fig. 7(a). Let cp denote the column indexof the peaks. For each row profile, the metric sg is computed by

sg =∑

c∈R

|z(cp)− z(c)|,

where R is the column neighborhood around cp. The metric sg measures the vari-ations of the row profile w.r.t. each peak. In our experiments, the value of R is

Page 7: Multimodal Facial Feature Extraction for Automatic 3D Face

empirically set to be within ±20mm. The peak with the largest sg value is selectedas the nose tip candidate, around which the point with the maximum z value isidentified as the nose tip. The final nose tip extraction result is shown in Fig. 7(b).

2.3 Feature Location Model

Facial feature points have similar shape configuration across the population. A sta-tistical model of the facial features is used as a prior constraint to reduce the searcharea for the feature points. Effectively reducing the search region not only enhancesthe accuracy of the extraction results, but also improves the computational efficiency.Based on an independently collected set of frontal facial scans with manually labeledfeature points, the statistical model is constructed as the average position of eachfeature point associated with a 3D ellipsoid; the length of the ellipsoid axes is spannedby the 1.5 times the standard deviations along the respective (x, y, and z) direction.The frontal scans are aligned by setting the nose tip as the origin.

The scans provided by the 3D sensor contain (x,y,z) coordinates in the worldcoordinate system in units of mm. The statistical feature location model is built inthe physical world coordinate system, so that the scale factor induced by the world-to-image (pixel) mapping is removed from the model. In our experiments, 145 frontalfacial scans are used to construct the model, which is demonstrated in Fig. 8.

Fig. 8. Feature location model (left) overlaid on a 3D face image with nose tip aligned(right). The red star denotes the average position and the purple ellipsoid spans (x,y,z)directions. Since the nose tip is used to align all the scans, there is no variation at the nosetip.

2.4 Extracting Corners of the Eyes and the Mouth

Given the extracted nose tip position, the statistical model is aligned using the nosetip by translation (for frontal case). The search region for each feature point is con-strained in the ellipsoid shown in Fig. 8.

Page 8: Multimodal Facial Feature Extraction for Automatic 3D Face

Shape Index (range) We derived the local shape index [21] at each point based onthe range map. The shape index S(p) at point p is defined using the maximum (k1)and minimum (k2) local curvature values (see Eq. (1)). The shape index takes a valuein the interval [0, 1]. The corners of the eyes and the mouth are in a cup-like shapewith low shape index values. Figure 9 provides nine shapes with the correspondingshape index values.

S(p) =12− 1

πarctan

k1(p) + k2(p)k1(p)− k2(p)

(1)

Fig. 9. Nine representative shapes on the shape index scale [21].

Cornerness (intensity) In the intensity map, the corners of the eyes and the mouthshow a strong corner-like pattern. We applied the Harris corner detector [22], basedon the fact that intensity changes in a local neighborhood of a corner point alongall the directions should be large. The Harris corner detector was demonstrated tohave good repeatability on images taken under varying conditions [23]. Consider theHessian matrix H of the image intensity function I in a local neighborhood of pointp(x, y). If the two eigenvalues of H are large, then a small motion in any direction willcause a significant change of gray level. This indicates that the point p is a corner.A better variant of the corner response function is given in [24]:

C(p) =∂2I∂x2

∂2I∂y2 −

(∂2I

∂x∂y

)2

∂2I∂x2 + ∂2I

∂y2

The stronger the corner response C(p), the more likely the point p is a corner.

Fusion The responses obtained from range and intensity maps are integrated. Inorder to apply the fusion rules, both S(p) and C(p) are normalized using min-max

Page 9: Multimodal Facial Feature Extraction for Automatic 3D Face

rule in the search region for each feature point. The normalized shape index responseS′(p) at point p is computed as

S′(p) =S(p)−min{Si}

max{Si} −min{Si} ,

where {Si} is the set of shape index values for each feature point in the search region.The same normalization scheme applies to cornerness response C.

The final score F (p) is computed by integrating scores from the two modalitiesusing the sum rule [25]

F (p) = (1− S′(p)) + C ′(p).

The point with the highest F (p) in each search region is identified as the correspond-ing feature point. Figure 10 shows an example of the extracted corners.

Fig. 10. Feature extraction using fusion. (a) Shape index response in the detected face area.The dark regions represent low shape index values and the bright regions represent highvalues. The background is painted as black. (b) Corner response in the detected face area.(c) The feature points (yellow dots) are extracted using the range (shape index) map onlyfor the smallest shape index values. (d) The feature points (yellow dots) are extracted usingthe intensity (cornerness) map only for the highest corner response. (e) The feature points(cyan plus) are extracted by the fusion of the range and the intensity for the highest finalscore F .

Reject Option: Local Quality Analysis Range images captured by the 3D sen-sor are subject to sensor noise due to dark regions, reflectance properties, or lighting

Page 10: Multimodal Facial Feature Extraction for Automatic 3D Face

effects [16]. It is difficult to model all these sensor errors. Most of existing featureextraction systems suffer from bad data and provide invalid feature points. We in-troduce a reject option in the system to ensure the quality of the extracted featurepoint and reject scans with poor quality. As a result, the system generates fewer(unexpected) invalid matching results. For the missing data, the 3D sensor providesa mask map (see Fig. 1(c) for an example) indicating which sample point has notbeen captured correctly. A local quality metric that computes the depth (z) valuedistribution in a local region is calculated as:

LQ(p0) = −max{[maxp∈R(zp)−medianp∈R(zp)], [medianp∈R(zp)−minp∈R(zp)]},

where R is the local neighborhood centered at p0, and zp denotes the z coordinateof the point p. This metric LQ is aimed to detect the spikes in both positive andnegative z directions. The LQ is always non-positive so that the higher the LQ, thebetter the quality.

After a nose tip is extracted, the local quality metric LQ is evaluated at thecandidate feature points. If LQ is below a pre-defined threshold, the extracted featurepoint will be rejected.

3 Automatic Multimodal Face Recognition

We have developed a multimodal face recognition system [26], which matches mul-tiview 2.5D face scans to 3D face models. For each subject, a 3D face model isconstructed by integrating several 2.5D face scans which are captured from differ-ent views. The recognition engine consists of two components, surface matching andappearance-based matching. The range map is used for surface matching based ona modified iterative closest point (ICP) scheme. The intensity map was used forappearance-based matching using discriminant subspace analysis. The weighted sumrule is applied to combine the scores given by the two matching components.

The feature points are used for both alignment in three-dimensional space forsurface matching and for facial area cropping for the appearance-based matching.With feature points automatically extracted, the entire matching system is fullyautomatic.

4 Experiments and Discussion

Two facial scan databases are used. One is the publicly available database fromthe University of Notre Dame 2 (UND) [4, 27], containing 946 facial scans from 267subjects. To evaluate the entire face recognition system [28] with automatic featureextraction integrated, two 3D (full view) face models are combined to construct thegallery database3. One was collected at Michigan State University (MSU), and theother was provided by the University of South Florida (USF) [29]. In total, there are200 subjects (3D models) in the gallery database. The test set contains 98 frontalscans with neutral expression and 98 frontal scans with smiling expression from 98subjects in the MSU database. The range images (downsampled to 320× 240 with a

2 The database can be accessed at http://www.nd.edu/ cvrl/UNDBiometricsDatabase.html.3 There is no 3D face model available in the UND database.

Page 11: Multimodal Facial Feature Extraction for Automatic 3D Face

depth resolution of ∼0.1mm) were collected using a Minolta Vivid 910 scanner [30]at Michigan State University. This scanner uses structured laser light to constructthe face image in less than one second. Each point in a scan has a color (r, g, b)as well as a location in 3D space (x, y, z). Only gray scale intensities are used inour experiments instead of color information. An independent dataset is collectedfor building the statistical feature location model. The USF 3D model database iscaptured by the Cyberware scanner 4.

4.1 Results of Feature Extraction

Figure 11 provides examples of the feature extraction results. The rejection threshold

Fig. 11. Examples of feature extraction results.

for the local quality measurement LQ is empirically set to −10mm. In total, 10 scansfrom the UND database are rejected due to large sensor noise at the nose tip. Rejectedimages are given in Fig. 12. Figure 13 presents examples of the extracted feature point

Fig. 12. Rejected images based on local quality analysis. The surface (range map) is visu-alized in a wire-frame mesh.

4 http://www.cyberware.com/

Page 12: Multimodal Facial Feature Extraction for Automatic 3D Face

with large displacement to the manually labeled ground truth position.

Fig. 13. Examples of extracted feature points with large displacement in localization.

Based on the intensity map of the scan, the feature points of each face scan aremanually labeled. Since the range map and the intensity map are registered (by thesensor), the corresponding 3D coordinates of feature points are obtained from therange map. Using the manually labeled position as the ground truth, the localizationdisplacement is computed as the Euclidean distance between the position of the au-tomatically extracted feature point and the ground truth position. For easy notation,we introduce the following terms. NT: nose tip; LE: inner left eye corner; RE: innerright eye corner; ORE: outside right eye corner; OLE: outside left eye corner; RM:right mouth corner; LM: left mouth corner. Tables 1 provides the statistics of thelocalization displacement on the UND database. The corresponding 3D visualizationis presented in Fig. 14.

Table 1. Statistics of the Euclidean distance (in 3D) between the automatically extractedfeature points and the manually labeled feature points. (For the range image used in theexperiments, the distance between two pixels in x and y directions is s1mm.)

NT LE RE ORE OLE RM LM

Mean (mm) 5.0 5.7 6.0 7.1 7.9 3.6 3.6

Std (mm) 2.4 3.0 3.3 5.9 5.1 3.3 2.9

Max (mm) 13.9 24.4 23.1 38.2 37.6 26.4 25.9

Min (mm) 0 0 0 0 0 0 0

For each individual feature point, we compute the localization displacement his-togram (LDH) across the entire dataset. The localization accuracy curve (LAC) iscalculated by counting the number (or the percentage w.r.t. the total number ofscans) of extracted feature points whose localization displacement is below a thresh-old. Figure 15 shows the LDH and LAC for the nose tip. Fig. 16 and Fig. 17 providethe LDH and LAC for the corners of the eyes and the mouth, respectively.

In practice, the feature points will be used as a set, rather than individuals, forregistration purpose. The localization accuracies of individual feature points fromthe same scan are not independent. Therefore, the correlated localization displace-ment evaluation is conducted. For each scan, the average localization displacement(Euclidean distance) among the seven extracted feature points is computed as the lo-

Page 13: Multimodal Facial Feature Extraction for Automatic 3D Face

Fig. 14. 3D visualization of the mean and standard deviation of localization displacementfor each feature point extracted by the fusion scheme (left), overlaid on a 3D face model(right). The red star is the ground truth position. The blue ellipsoid represents the averagealong x, y, and z directions, respectively. The purple ellipsoid is determined by the averageplus one standard deviation in the corresponding direction.

(a) (b)

Fig. 15. The localization displacement histogram (a) and localization accuracy curve (b)for the nose tip extracted from the range map.

calization displacement for this scan. The corresponding localization accuracy curveis provided in Fig. 18.

In general, the fusion scheme of integrating range and intensity modalities achievesbetter performance than each individual modality. Total computation time for fea-ture extraction, including face segmentation and localizing the 7 feature points, isapproximately 2 seconds on a Pentium 4 2.8GHz CPU.

Page 14: Multimodal Facial Feature Extraction for Automatic 3D Face

Fig. 16. The localization displacement histogram for the corners of the eyes and the mouthextracted by three different schemes, range only, intensity only, and the fusion of both.

4.2 Results of Automatic Face Recognition

The face recognition system automatically matches the 196 frontal test scans tothe 200 3D face models in the identification mode. The identification results aregiven in Table 2. The identification results using manually labeled feature points arelisted for comparison. It shows that the fully automatic system provides competitiveidentification accuracy as the system using manually labeled feature points.

Table 2. Identification accuracy comparison based on automatically and manually extractedfeature points. The rank-one matching error numbers along with up to rank-five matchingerror numbers (in the parenthesis) are provided. There are 196 frontal test scans and 2003D face models from different subjects in the gallery.

Dataset Neutral (No. of Scans = 98) Smiling (No. of Scans = 98)

Manual (ICP only) 2 (1, 0, 0, 0) 31 (28, 27, 23, 21)

Automatic (ICP only) 1 (1, 0, 0, 0) 37 (29, 27, 25, 24)

Manual (ICP + LDA) 1 (1, 1, 0, 0) 24 (19, 15, 15, 14)

Automatic (ICP + LDA) 0 (0, 0, 0, 0) 27 (23, 21, 18, 16)

Page 15: Multimodal Facial Feature Extraction for Automatic 3D Face

Fig. 17. The localization accuracy curve for the corners of the eyes and the mouth extractedby three different schemes, range only, intensity only, and the fusion of both.

Fig. 18. The localization accuracy curve for correlated analysis.

5 Conclusions and Future Work

We propose a multimodal scheme to combine both 3D (range) and 2D (intensity)information to extract the facial feature points, leading to a fully automatic facerecognition system. The multimodal scheme is developed in each key step of the

Page 16: Multimodal Facial Feature Extraction for Automatic 3D Face

proposed face recognition system, including face segmentation, feature extraction,and recognition. The proposed feature extraction scheme can handle multiple facesin the same scan. A robust nose tip locator is presented by the cross profile analysisbased on the nose shape. A statistical 3D location model of feature points is appliedto constrain the search regions of feature points. The corners of the eyes and themouth are extracted using the shape index from the range map and the cornernessresponse from the intensity map. Local quality analysis is developed to reject the scanwith poor quality. The developed automatic face recognition system achieves similaridentification accuracy compared to the system with manually labeled landmarks.

We are extending to feature extraction from facial scans with large pose variations.We are exploring to utilize the matching score as a confident measurement to robustlyselect the most reliable points for registration.

References

1. W. Zhao, R. Chellappa, A. Rosenfeld, and P.J. Phillips, “Face recognition:A literature survey,” CVL Technical Report, University of Maryland, 2000,<ftp://ftp.cfar.umd.edu/TRs/CVL-Reports-2000/TR4167-zhao.ps.gz>.

2. S.Z. Li and A.K. Jain (Eds.), Handbook of Face Recognition, Springer, 2005.3. Face Recognition Vendor Test (FRVT), <http://www.frvt.org/>.4. K. I. Chang, K. W. Bowyer, and P. J. Flynn, “Multi-modal 2D and 3D biometrics

for face recognition,” in Proc. IEEE Workshop on Analysis and Modeling of Faces andGestures, France, Oct. 2003.

5. F. Tsalakanidou, S. Malassiotis, and M. Strintzis, “Face localization and authenticationusing color and depth images,” IEEE Transactions on Image Processing, vol. 14, no. 2,pp. 152–168, 2005.

6. C. BenAbdelkader and P. Griffin, “Comparing and combining depth and texture cuesfor face recognition,” Image and Vision Computing, vol. 23, pp. 339–352, 2005.

7. 5th International Conference on 3-D Digital Imaging and Modeling (3DIM),<http://www.3dimconference.org/>, 2005.

8. S. Shan, Y. Chang, W. Gao, and B. Cao, “Curse of mis-alignment in face recognition:Problem and a novel mis-alignment learning solution,” in Proc. IEEE InternationalConference on Automatic Face and Gesture Recognition, Korea, 2004, pp. 314–320.

9. Y. Ryu and S. Oh, “Automatic extraction of eye and mouth fields from a face imageusing eigenfeatures and multiplayer perceptrons,” Pattern Recognition, vol. 34, no. 12,pp. 2459–2466, 2001.

10. D. Cristinacce and T. Cootes, “Facial feature detection using adaboost with shapeconstraints,” in Proc. 14th British Machine Vision Conference, Norwich, UK, Sep.2003, pp. 231–240.

11. L. Wiskott, J.M. Fellous, N. Kruger, and C. von der Malsburg, “Face recognition byelastic bunch graph matching,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 19, no. 7, pp. 775–779, 1997.

12. K. Toyama, R. Feris, J. Gemmell, and V. Kruger, “Hierarchical wavelet networks forfacial feature localization,” in Proc. IEEE International Conference on Automatic Faceand Gesture Recognition, Washington D.C., 2002, pp. 118–123.

13. T.F. Cootes, G.J. Edwards, and C.J. Taylor, “Active appearance models,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 681–685, Jun. 2001.

14. J. Xiao, S. Baker, I. Matthews, and T. Kanade, “Real-time combined 2D+3D activeappearance models,” in Proc. IEEE Computer Society Conference on Computer Visionand Pattern Recognition, 2004, pp. 535–542.

15. L.G. Farkas, Anthropometry of the Head and Face, Raven Press, 2nd edition, 1994.

Page 17: Multimodal Facial Feature Extraction for Automatic 3D Face

16. C. Boehnen and T. Russ, “A fast multi-modal approach to facial feature detection,”in Proc. 7th IEEE Workshop on Applications of Computer Vision, Breckenridge, CO,Jan. 2005, pp. 135–142.

17. Y. Wang, C. Chua, and Y. Ho, “Facial feature detection and face recognition from 2Dand 3D images,” Pattern Recognition Letters, vol. 23, pp. 1191–1202, 2002.

18. C.S. Chua and R. Jarvis, “Point signature: A new representation for 3D object recog-nition,” International Journal on Computer Vision, vol. 25, no. 1, pp. 6385, 1997.

19. Paul Viola and Michael J. Jones, “Rapid object detection using a boosted cascade ofsimple features,” in Proc. IEEE Computer Society Conference on Computer Vision andPattern Recognition, 2001, pp. 511–518.

20. Rainer Lienhart and Jochen Maydt, “An extended set of haar-like features for rapidobject detection,” in Proc. IEEE International Conference on Image Processing, Sep.2002, pp. 900–903.

21. C. Dorai and A. K. Jain, “Cosmos - a representation scheme for 3D free-form objects,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 10, pp. 1115–1130,1997.

22. C.G Harris and M. Stephens, “A combined comer and edge detector,” in Proc. 4thAlvey Vision Conference, 1988, pp. 147–151.

23. C. Schmid, R. Mohr, and C. Bauckhage, “Evaluation of interest point detectors,”International Journal on Computer Vision, vol. 37, no. 2, pp. 151–172, 2000.

24. Alison Noble, Descriptions of Image Surfaces, PhD thesis, Department of EngineeringScience, Oxford University, 1989.

25. J. Kittler, M. Hatef, R. Duin, and J. Matas, “On combining classifiers,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226–239, 1998.

26. Xiaoguang Lu and Anil Jain, “Integrating range and texture information for 3D facerecognition,” in Proc. 7th IEEE Workshop on Applications of Computer Vision, Breck-enridge, CO, 2005, pp. 156–163.

27. P.J. Phillips, “Face recognition grand challenge,” Biometric Consortium Conference,2004.

28. Xiaoguang Lu, Anil K. Jain, and Dirk Colbry, “Matching 2.5D face scans to 3D models,”IEEE Trans. Pattern Analysis and Machine Intelligence, 2005. To appear.

29. USF HumanID 3D Face Dataset.30. Minolta Vivid 910 non-contact 3D laser scanner,

<http://www.minoltausa.com/vivid/>.