arxiv:1610.08119v1 [cs.cv] 25 oct 2016 · 2020. 1. 27. · model objective facets of facial...

9
Predicting First Impressions with Deep Learning Mel McCurrie 1 , Fernando Beletti 1 , Lucas Parzianello 1 , Allen Westendorp 1 , Samuel Anthony 2,3 , and Walter J. Scheirer 1 1 University of Notre Dame 2 Harvard University 3 Perceptive Automata, Inc. Abstract— Describable visual facial attributes are now com- monplace in human biometrics and affective computing, with existing algorithms even reaching a sufficient point of maturity for placement into commercial products. These algorithms model objective facets of facial appearance, such as hair and eye color, expression, and aspects of the geometry of the face. A natural extension, which has not been studied to any great extent thus far, is the ability to model subjective attributes that are assigned to a face based purely on visual judgements. For instance, with just a glance, our first impression of a face may lead us to believe that a person is smart, worthy of our trust, and perhaps even our admiration — regardless of the underlying truth behind such attributes. Psychologists believe that these judgements are based on a variety of factors such as emotional states, personality traits, and other physiognomic cues. But work in this direction leads to an interesting question: how do we create models for problems where there is no ground truth, only measurable behavior? In this paper, we introduce a new convolutional neural network-based regression framework that allows us to train predictive models of crowd behavior for social attribute assignment. Over images from the AFLW face database, these models demonstrate strong correlations with human crowd ratings. I. I NTRODUCTION In human attribute modeling, there exists a disparity be- tween the way humans describe humans and the way current computational models describe humans. A large amount of describable attribute research in computer vision concen- trates on objective traits. Work using the CelebA dataset [22], [29], [42], [40] has applied different methods to model traits such as “Male”, “Wearing a Hat”, and “Bearded”, using the dataset’s binary “yes” or “no” annotations. Going beyond objective attributes, it is possible to model more subjective traits such as expression [12], [7], attractiveness [17], and humorousness [21], but the current state-of-the-art overlooks the important interrelation between human attribute modeling and social psychology. Both enabling computers to make accurate predictions about objective content and enabling computers to make human-like judgements about subjective content are necessary steps in the development of machine intelligence. In this work we concentrate on descriptions of the face, as an abundance of social psychology research demonstrates a human tendency to make judgements in social interactions based on the faces of fellow humans [31], [38], [1]. Popular human characteristics of academic interest closely related to these social interactions include emotion [24], attractive- ness [1], trustworthiness [35], [38], [28], [8], dominance [31], Fig. 1: Computational modeling of social attributes allows us to predict what the crowd might say about a face image. In this image we graphically compare the attribute predictions for Julian Assange and Benedict Cumberbatch, who plays Assange in the movie The Fifth Estate, as well as the predictions for Edward Snowden and Joseph Gordon-Levitt, who plays Snowden in the movie Snowden. Specifically look- ing at these images, our models output remarkably similar predictions between the subjects and their actors, attesting to the accuracy of the portrayals in the films. The radar plots above reflect the output of a face processing pipeline, where faces are detected, aligned, and then processed through a deep convolutional neural network regressor that models a particular social attribute. This regression framework is the main contribution of our work. [24], sociability, intelligence, and morality [1]. Psychologists often specifically concentrate on trustworthiness, dominance, and intelligence because they represent comprehensive ab- stract qualities that humans regard in each other. Alexander Todorov, one of the foremost psychologists studying these social judgements uses dominance and trustworthiness as the basis of many in-depth studies of human judgement [35], [36], [34]. Ultimately he finds that most other recogniz- able subjective traits in humans can be represented as an arXiv:1610.08119v1 [cs.CV] 25 Oct 2016

Upload: others

Post on 19-Nov-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1610.08119v1 [cs.CV] 25 Oct 2016 · 2020. 1. 27. · model objective facets of facial appearance, such as hair and eye color, expression, and aspects of the geometry of the

Predicting First Impressions with Deep Learning

Mel McCurrie1, Fernando Beletti1, Lucas Parzianello1, Allen Westendorp1, SamuelAnthony2,3, and Walter J. Scheirer1

1 University of Notre Dame2 Harvard University

3 Perceptive Automata, Inc.

Abstract— Describable visual facial attributes are now com-monplace in human biometrics and affective computing, withexisting algorithms even reaching a sufficient point of maturityfor placement into commercial products. These algorithmsmodel objective facets of facial appearance, such as hair andeye color, expression, and aspects of the geometry of the face.A natural extension, which has not been studied to any greatextent thus far, is the ability to model subjective attributesthat are assigned to a face based purely on visual judgements.For instance, with just a glance, our first impression of a facemay lead us to believe that a person is smart, worthy of ourtrust, and perhaps even our admiration — regardless of theunderlying truth behind such attributes. Psychologists believethat these judgements are based on a variety of factors suchas emotional states, personality traits, and other physiognomiccues. But work in this direction leads to an interesting question:how do we create models for problems where there is no groundtruth, only measurable behavior? In this paper, we introduce anew convolutional neural network-based regression frameworkthat allows us to train predictive models of crowd behavior forsocial attribute assignment. Over images from the AFLW facedatabase, these models demonstrate strong correlations withhuman crowd ratings.

I. INTRODUCTION

In human attribute modeling, there exists a disparity be-tween the way humans describe humans and the way currentcomputational models describe humans. A large amount ofdescribable attribute research in computer vision concen-trates on objective traits. Work using the CelebA dataset[22], [29], [42], [40] has applied different methods to modeltraits such as “Male”, “Wearing a Hat”, and “Bearded”, usingthe dataset’s binary “yes” or “no” annotations. Going beyondobjective attributes, it is possible to model more subjectivetraits such as expression [12], [7], attractiveness [17], andhumorousness [21], but the current state-of-the-art overlooksthe important interrelation between human attribute modelingand social psychology. Both enabling computers to makeaccurate predictions about objective content and enablingcomputers to make human-like judgements about subjectivecontent are necessary steps in the development of machineintelligence.

In this work we concentrate on descriptions of the face,as an abundance of social psychology research demonstratesa human tendency to make judgements in social interactionsbased on the faces of fellow humans [31], [38], [1]. Popularhuman characteristics of academic interest closely relatedto these social interactions include emotion [24], attractive-ness [1], trustworthiness [35], [38], [28], [8], dominance [31],

Fig. 1: Computational modeling of social attributes allows usto predict what the crowd might say about a face image. Inthis image we graphically compare the attribute predictionsfor Julian Assange and Benedict Cumberbatch, who playsAssange in the movie The Fifth Estate, as well as thepredictions for Edward Snowden and Joseph Gordon-Levitt,who plays Snowden in the movie Snowden. Specifically look-ing at these images, our models output remarkably similarpredictions between the subjects and their actors, attesting tothe accuracy of the portrayals in the films. The radar plotsabove reflect the output of a face processing pipeline, wherefaces are detected, aligned, and then processed through adeep convolutional neural network regressor that models aparticular social attribute. This regression framework is themain contribution of our work.

[24], sociability, intelligence, and morality [1]. Psychologistsoften specifically concentrate on trustworthiness, dominance,and intelligence because they represent comprehensive ab-stract qualities that humans regard in each other. AlexanderTodorov, one of the foremost psychologists studying thesesocial judgements uses dominance and trustworthiness as thebasis of many in-depth studies of human judgement [35],[36], [34]. Ultimately he finds that most other recogniz-able subjective traits in humans can be represented as an

arX

iv:1

610.

0811

9v1

[cs

.CV

] 2

5 O

ct 2

016

Page 2: arXiv:1610.08119v1 [cs.CV] 25 Oct 2016 · 2020. 1. 27. · model objective facets of facial appearance, such as hair and eye color, expression, and aspects of the geometry of the

orthogonal function of dominance and trustworthiness [27],which suggests these two conceptual traits are ideal for ourcomputational modeling.

Closely related to our results is the intriguing branch ofresearch concentrated on the assessment of abstract traitsin human faces based on the effect of facial contortionsand positions. Inspired by animals’ displays of dominanceand submissiveness in respective head raises and bows,Mignault et al. specifically analyzed the effects of head tilton the change in perceived dominance and emotion [24].Not only does the study confirm the hypothesized disparityin perceived traits based on head tilt, but it also finds genderhas a noteworthy influence on subjects’ perceptions. Keatinget al. assessed the effect of eyebrow and mouth gestureson perceived dominance and happiness in a cultural con-text [14]. The study found smiling to be a universal indicatorof happiness and showed weak associations between notsmiling and dominance. It also determined the effect ofa lowered-brow on perceived dominance to be generallyrestricted to Western subjects.

In this paper we begin to bridge the gap between tra-ditional machine learning and social psychology. We workspecifically with traits that do not have a ground truth and canbe considered abstract representations of high-level humanattributes. Additionally, we introduce a new convolutionalneural network-based regression framework that allows us totrain predictive models of crowd behavior for social attributeassignment. Very different from prior work, we make useof a unique visual psychophysics crowdsourcing platformcalled TestMyBrain.org to gather the annotations necessaryfor training. As a case study, we examine three purely (whenanalyzed in a visual context) social attributes: dominance,trustworthiness, and IQ. We also look at the more familiarobjective attribute of age, but purely in the context of crowdjudgements. Our models demonstrate strong correlations withcrowd ratings, largely driven by low level image queues asopposed to high-level facial structure.

In short, our contributions in this paper are: noitemsep

• A novel ground-truth-free dataset of over 6,000 imagesannotated for all four traits of interest (dominance, trust-worthiness, IQ and age) on continuous distributions.

• The deployment of a crowd-sourced data collectionregime, which collects large amounts of data on high-level social attributes from the popular psychophysicstesting platform TestMyBrain.org.

• The comparison of different deep learning architecturesfor high-level social attribute modeling.

• A set of highly effective automatic predictors of socialattributes that have not been modeled before in com-puter vision.

II. RELATED WORK

The related work in computer vision falls into two cat-egories: general facial attributes, and specific convolutionalneural network-based approaches. We review both in thissection.

A. Attributes in Computer Vision

Due to the proliferation of low-cost high performancecomputing resources (e.g., GPUs) and web-scale image data,large-scale image classification and labeling is now com-monplace in computer vision. With respect to face imagesfrom the web, Labeled Faces in the Wild [13], YouTubeFaces [39], MegaFace [26], Janus Benchmark A [15], andCelebA [22] are all popular choices for a variety of fa-cial modeling tasks beyond conventional face recognition.Attribute prediction, where the objective is to assign se-mantically meaningful labels to faces in order to build ahuman interpretable description of facial appearance, is theparticular task we concentrate on in this paper.

Both Farhadi et al. [9] and Lampert et al. [19] originallyconceived of visual attributes as a development supportingobject recognition, rather than a primary goal in and of itself.Faces, however, are a special case where standalone analysissupports applications in biometrics and affective computing.Kumar et al. used facial attributes for face verification andimage search [17]. Scheirer et al. applied the statisticalExtreme Value Theory to facial attribute search spaces tocreate accurate multi-dimensional representations of attributesearches [30]. Siddiquie et al. modeled the relationshipsbetween different attributes to create more accurate multi-attribute searches [32]. Luo et al. captured the interde-pendencies of local face regions to increase classificationaccuracy [23].

Certain traits such as Age [25], [18], [20] and gender [21],[20] have enjoyed disproportionate attention, but researchersalso model numerous other facial attributes. The release ofthe large CelebA dataset [22] also prompted several novelstudies of facial attributes on all 40 traits in the dataset [29],[42], [40]. For a comprehensive review of facial attributework in practical biometric systems, see the view authoredby Dantcheva et al. [6].

B. Convolutional Neural Networks for Attributes

Current state-of-the-art facial attribute modeling relies onCNNs. Pioneering work in the field, Golomb et al. trained aCNN with an 8.1% error rate on gender prediction [11]. Morerecently, Zhang et al. used CNNs alongside conventionalpart-based models to predict attributes such as clothing style,gender, action, and hair style from images [41]. Wang etal. applied CNNs to an automatically generated egocentricdataset annotated for contextual information such as weatherand location [37]. Levi et al. used a CNN for age and genderclassification from faces [20]. Liu et al. used two cascadedCNNs and trained Support Vector Machines to separate theprocesses of face localization and attribute prediction [22].And Zhong et al. extended the work of Liu et al. usingoff-the-shelf CNNs to build facial descriptors in a differentapproach to attribute prediction [42].

Most similar to our research is the recent work of Lewen-berg et al. [21]. They use a CNN to predict objective traitsincluding gender, ethnicity, age, make-up, and hair color, andsubjective traits including emotional state, attractiveness, andhumorousness. That research introduced a new face attributes

Page 3: arXiv:1610.08119v1 [cs.CV] 25 Oct 2016 · 2020. 1. 27. · model objective facets of facial appearance, such as hair and eye color, expression, and aspects of the geometry of the

Fig. 2: We assert that to most accurately model humans’ psychological judgements, each of these traits should be modeledon a continuous distribution. For this reason we employed the Likert Scale in our data collection. This image shows faces ateach quartile from the training dataset (left) as well as the training data distributions (right), all of which seem to be closeto normal.

dataset of 10, 000 images annotated for these traits. Togenerate this dataset, Lewenberg et al. employed Amazon’sMechanical Turk raters from the US and Canada to rate asubset of the PubFig dataset, aggregating labels from threeseparate individuals for each image. Notably, the work onlyanalyzes the traits on a discrete distribution, labeling andpredicting each image as a binary “yes” or “no” for eachtrait. Our most immediate improvement on this work isin the way in which we collect data. We use an onlinepsychophysics testing platform, aggregating data from alarger number of raters from an arguably more reliable andgeographically variable source. In addition, we model moreabstract, representational traits on continuous distributions.

Also parallel to our work, and the current state-of-the-artattribute prediction, is the work of Rudd et al. [29]. Ruddet al. employ a single custom Mixed Objective OptimizationNetwork (MOON) to multi-task facial attribute recognition,minimizing the error of their networks over all forty traits ofthe CelebA dataset [22]. We use our own implementation ofthe MOON architecture as a basis for each separate trait inour modeling.

III. CROWD-SOURCED DATA COLLECTION

In this paper we introduce a new dataset for social attributemodeling. The dataset consists of 6, 300 greyscale images offaces sampled from the AFLW dataset [16] and annotatedfor the four traits we study. Representative samples of thedataset for each trait can be seen in Figure 2. This datasetis novel in that there is no ground truth. For traits suchas age and IQ, which are easy to record and describedon well-known scales, it is of course possible to producea dataset with verifiable ground truth annotations — butthis is not our object. Rather than analyze and model actualtrustworthiness, dominance, IQ, and age, we choose to study

people’s described perceptions of the aforementioned traits.For example, our dataset does not include actual ages, insteadthe images are annotated by a consensus score — aggregatestatistics of what many people said about the ages of thesubjects in the images. Note that the complete dataset withannotations will be released following publication.

A. TestMyBrain.org

For this high-level, ground-truth-free annotation, we useTestMyBrain.org [10], a crowd-sourced psychophysics test-ing website where users go to test and compare their mentalabilities and preferences. It is one of the most popular “braintesting” sites on the web, with over 1.6 million participantssince 2008. But what specific advantages does TestMy-Brain.org have over Amazon’s Mechanical Turk service?

TestMyBrain.org is a citizen science effort that facilitatespsychological experiments and provides personalized feed-back for the user, mutually benefiting both researchers andthose curious about their own mind. The subject pool isgeographically diverse and provides an arguably superiorpsychometric testing group compared to smaller more ho-mogeneous subject pools such as that of Mechanical Turk.In addition to being an ideal setting for aggregate, cross-cultural psychometric experiments for researchers, TestMy-Brain.org provides the non-monetary incentive of detailed,personalized results for subjects. Subjects visiting the siteare motivated by a desire to learn about themselves and havelittle incentive to respond to experiments quickly or poorly.Based on these factors, we determined that the subject poolof TestMyBrain.org is ideal for the delicate task of honestlyappraising abstract, ground truth-free attributes in faces.

Using TestMyBrain.org, we asked participants to judgefaces for a select trait on a Likert Scale, a psychometricbipolar scaling method shown in Fig. 3. As can be seen in

Page 4: arXiv:1610.08119v1 [cs.CV] 25 Oct 2016 · 2020. 1. 27. · model objective facets of facial appearance, such as hair and eye color, expression, and aspects of the geometry of the

Fig. 3: A sample behavioral task that a subject might see onTestMyBrain.org. All ratings collected for this work wereon a Likert scale between 1-7, where 1 indicates the leastamount of attribute presence, and the 7 indicates the mostamount.

Table I, each face has an average of about 32 judgements forTrustworthiness and Dominance and 15 for Age and IQ. Werecorded the average to use as the consensus score for thatimage and normalized the Trustworthiness and Dominancescores.

IV. CNN REGRESSION FOR SOCIAL ATTRIBUTES

Our algorithm is a regression model that outputs a singlescore from an input image. A regression, rather than abinary classification, is a more realistic representation ofthe initial judgements humans make. For example, from ourfour modeled traits, both Age and IQ are already known tobe described by continuous distributions and are thereforelikely judged on continuous distributions. We assert theother two modeled traits, Trustworthiness and Dominance,are similarly best described by continuous distributions. Forwhat is discussed below with respect to architectures, assumethe output is always a single floating point number fromthe fully-connected layers of a CNN after the convolutionallayers’ feature extraction.

A. Comparing Architectures: What Works Best for SocialAttribute Modeling?

We initially compared four separate deep architectureswith conceptually similar structures but different depths, useof max pooling and dropout, and implementation parameters.To test very deep architectures we used the Oxford VisualGeometry Group’s VGG networks [33]. We reproducedVGGNet19’s convolutional architcture, modifying the shapeof the input and output matrices for our smaller grayscaleimages and single floating point regression output. To com-pare results from another deep, yet slightly more shallowarchitecture, we also modified and used VGGNet16 in thesame manner.

TABLE I: Statistics on the 6, 000 images used for trainingfor all four social attribute classes.

Trust. Dom. Age IQMean Rating 0.434 0.449 3.507 3.986

Mean Std.Deviation 0.254 0.257 0.783 1.176

Mean Num.of Ratings 32.318 32.304 15.851 15.819

Min. 0.028 0.048 1.000 1.727Max. 0.826 0.875 6.957 6.200

The newest architecture we analyzed is our implementa-tion of the MOON architecture [29], which is more shallowthan both of the VGGNet implementations. The convolu-tional feature extracting portion of the architecture consistsof several segments, where each segment has multiple convo-lutional layers followed by a max pooling layer. We modifythe architecture for our smaller grayscale images and connectthe convolutional layers to fully-connected layers that outputa single score. With respect to our reimplementation of theMOON architecture, we made use of the Keras [5] andTheano [2] deep learning frameworks.

For comparison we also added a much more shallow cus-tom architecture with fewer convolutional layers and morefrequent dropout. This shallow convolutional architectureconsists of three segments of convolutional layers followedby max pooling. We connect this feature extracting portion tothree fully-connected layers each followed by 50% dropout,and a fully-connected layer that outputs a single score. Allactivations in the shallow network are parametric ReLU.

As will be discussed below in Sec. V, the differences inmodel performances on the validation sets during trainingare not very large, suggesting the architecture choice may notmake a significant difference. The newer MOON architectureperforms slightly better on most of the traits, however, so wechose to use it as a basis for our final optimized models. Thecode and models will be released following publication.

B. Hyperparameter Optimization

Although Rudd et al. train their MOON models on RGBimages that are larger than our grayscale images, they alsomodel hypothetically lower level objective attributes from theCelebA dataset, which is annotated for binary classification.This suggests that our very different dataset and featurescould benefit from some deviations in parameter choices.

To determine the best network size and deviation in pa-rameters from the original MOON architecture, we optimizethe network for each trait using hyperopt [3], a python libraryfor hyperparameter optimization. Our search space includeslearning rate, dropout, batch size, the number of filters ineach layer, and the number of layers. Employing hyperoptwith the Tree of Parzen Estimators (TPE) algorithm allowedus to test a multitude of different parameter and architecturecombinations over forty iterations.

In Table II we label each segment of the convolutionalportion of the architecture as Convolution N , where N isthe in-order number of a segment. The optimal number of

Page 5: arXiv:1610.08119v1 [cs.CV] 25 Oct 2016 · 2020. 1. 27. · model objective facets of facial appearance, such as hair and eye color, expression, and aspects of the geometry of the

Fig. 4: In this image we compare the ability of each architecture to learn the dataset and generalize to new unseen validationdata. We include all four traits, training and validation scores, and all four original architectures (best viewed in color) plusour final architecture based on the hyperparameter optimization results. Of the four original architectures the MOON modelsgenerally perform better, but our optimized models consistently perform the best.

filters for all the convolutional layers in each segment arespecified in the table. The absence of a number in a rowsignifies that the model performance was best without theentire segment.

We maximize the model’s performance according tomin(1 − R2) where R2 is the coefficient of determinationfrom the regression of y, the model’s predicted scores, ony, the original human annotations. We use the coefficientof determination as the measure of performance because itrepresents the percentage of prediction variation explained bythe regression model. As explained previously, our measureof performance cannot be described as “accuracy”, as thereis no ground truth.

As seen in Figure 4, each trait trains very differently withthe original MOON architecture. Following this trend, eachtrait’s coefficient of determination is optimized by slightlydifferent hyperparameters and deviations from the Moon

TABLE II: Hyperparameter optimization results per trait.

Trust. Dom. Age IQLearning Rate 1E−5 1E−4 1E−4 1E−4

Dropout 50% 50% 60% 45%Batch Size 1 16 1 32

Convolution 1 32 64 32 64Convolution 2 64 64 128 64Convolution 3 256 256 256 256Convolution 4 256 512 256 512Convolution 5 - 512 256 512

Outputs 1 4096 4096 4096 4096Outputs 2 - 4096 - -

architecture as seen in Table II. However the differences fromthe original architecture are only modest.

V. EXPERIMENTAL EVALUATION

There are two important facets of evaluation with respectto our social attribute models: (1) model correlation withhuman crowd ratings of images, and (2) feature importancefor social attribute models. Each of these facets is exploredin this section. After data collection, our dataset consistedof 6, 300 grayscale images of faces, aligned to correct forin-plane rotation using the CSU Toolkit [4] and annotatedfor Trustworthiness, Dominance, Age, and IQ. We randomlyseparated the original dataset into a training set of 6, 000images, a validation set of 200 images and a test set of 100images. The test set is held out during training, while thevalidation set is used to tune the hyperparameters.

A. Correlations with Crowd Judgements

We employ the R2 value from a regression of y, themodel’s predicted scores, on y, the original human annota-tions, as a measure of our model’s performance. To properlycompare architectures and assess the training performance ofour optimized model, we record the R2 at each epoch andgraph them in Figure 4.

Looking at the graphs, the largest differences in the effectof architecture choice are in the speed of training and thevariability of the R2 scores between epochs. The shallowcustom architecture with high dropout tends to learn theslowest, yet does not have very much variability in scoresbetween epochs, whereas the validation scores of the faster

Page 6: arXiv:1610.08119v1 [cs.CV] 25 Oct 2016 · 2020. 1. 27. · model objective facets of facial appearance, such as hair and eye color, expression, and aspects of the geometry of the

Fig. 5: We can visualize regions of the face that are mostimportant to the trained models by systematically coveringparts of the face and recording the absolute differences. Herewe display the average visualization of the differences fromthe validation set on top of the average face of the validationset.

learning, deeper models vary greatly between epochs. It isvery important to note how little the difference is in final R2

values for each of the original four architectures. At a lowernumber of epochs, the correlation of model predictions withface annotations varies greatly as the architectures train atdifferent speeds. But after 120 epochs, the scores are verysimilar — this is most prominently illustrated in Dominancetraining.

As expected, our optimized architectures outperform theother four architectures. We display our final results from theoptimized networks in Table III, which shows R2 values fromregressions of our model’s predicted values on human anno-tated consensus scores for both the validation and testing sets.Each trait has a slightly different coefficient of determination,however all scores are very strong for a psychology-orientedexperiment incorporating noisy human measurements. Ourmodels for Age are the strongest, IQ are the weakest, andTrustworthiness and Dominance perform similarly to eachother.

TABLE III: R2 values of Validation and Testing Results fromour optimized MOON architectures for each trait.

Trust. Dom. Age IQValidation 0.5157 0.4620 0.6803 0.3200

Test 0.5687 0.4601 0.6453 0.3548

Fig. 6: Visualizing a sample of the filters from the lastconvolutional layer of each optimized model, we can see theresemblance of the output to a low-level feature extractor.

B. Visualizations of Feature Importance

Visualizations of the hyperparamter optimized CNN mod-els show localized areas of importance on the face for eachtrait. As an example, we display the average heatmaps onthe average faces of the entire validation subset in Figure 5.To produce these graphics we systematically moved a graybox over an image, iteratively scaling the box down aftereach pass. We then recorded the absolute difference in totalscore at each point. This visualization is intriguing becauseit allows us to view, in a certain image, or over an averageof images, what areas of the face have the most or leastsignificant effect on the final prediction.

Again, it is difficult to assess the validity of our models,as accuracy cannot be calculated because there is no knownground-truth. Referring back to previous social psychologyresearch [24], [14], however, both Trustworthiness andDominance are expected to rely on the mouth. Our modelsindicate a heavy reliance on areas near the mouth and chin.Similarly, Keating et al. [14] determined that a lowered browshould affect the (mostly Western) perception of Dominance.Both our Dominance and Trustworthiness models approx-imately locate the brow mid-sections. These observationsindicate that our models have learned to look in the sameplaces that humans do, replicating the way we judge high-level attributes in each other.

Another method of assessing our models is a visualizationof the filters. Our visualizations of the filters from the finalconvolutional layer of each network in Figure 6 are intriguingbecause they resemble the output of a low-level featureextractor. This indicates that despite the high-level abstractquality of these traits, low-level features might be enoughfor humans to make their immediate judgements.

Page 7: arXiv:1610.08119v1 [cs.CV] 25 Oct 2016 · 2020. 1. 27. · model objective facets of facial appearance, such as hair and eye color, expression, and aspects of the geometry of the

Fig. 7: Frames taken from real time video processing examples. The scores are normalized and then shown over time on aline plot and a histogram. Similar to Figure 1, we attempted to gauge the trustworthiness and dominance of actors BenedictCumberbatch and Joseph Gordon-Levitt, who respectively appear in the movies The Fifth Estate and Snowden. Accordingto the predictions from the models the actors are rather low on the trustworthiness scale, mirroring the visual consistencyobserved in Figure 1.

C. Processing Faces in Video

A very good litmus test for our models is video processing.For each frame from a video, we can apply face detection andface alignment, and then use our optimized models to predictthe score of each trait. With the models loaded into memorywe can even do this in real time, allowing the subjectsin the videos to move and change position, simultaneouslydetermining the change in other people’s perceptions. Figure7 shows several frames from a couple of example videosbeing processed. In Figure 7, all scores are mapped to astandard normal distribution and shown over time on both aline plot and a histogram. A selection of processed videosare provided as supplemental material.

VI. DISCUSSION

Current state-of-the-art visual recognition algorithms incomputer vision, and more specifically algorithms for facialattribute prediction [21], [29], show accuracy that promisesnew applications in the near future. It is in the best interestof both researchers and those applying these models invarious industries to promote research that focuses on theinterrelation of machine learning, computer vision, and socialpsychology.

A model is only as good as its data. The dataset andits annotations will ultimately have the most significanteffect on the psychological validity and usefulness of themodels. When annotating a dataset for subjective traits, smalldifferences such as the number of annotators, the number of

Page 8: arXiv:1610.08119v1 [cs.CV] 25 Oct 2016 · 2020. 1. 27. · model objective facets of facial appearance, such as hair and eye color, expression, and aspects of the geometry of the

annotations, and the geographic and cultural differences ofthe annotators must be taken into consideration. Differentcultures and languages affect the way people interpret traits,or the description of traits. Just as intriguing as the gen-eralizations about people that we made in our work is thestudy of different cultures and focus groups. Models trainedonly on the annotations of a focus group could generalizeto new data, enabling cross-culture comparisons — useful inresearch, marketing, political campaigning and more.

In systematically analyzing human judgements, it is alsoimportant to choose traits that best fulfill a purpose. In ourcase, Trustworthiness and Dominance are the best represen-tations of the abstract judgements humans make about eachother. IQ and Age, while not as fundamental in a psycho-logical sense, still have conceivable applications, includingthe assessment of preconceived notions of intelligence andseniority — subtle social cues we often take for granted.

ACKNOWLEDGEMENTS

Mel McCurrie was supported by Boeing, Fernando Belettiand Lucas Parzianello were supported by the Brazil ScientificMobility Program, Allen Westendorp was supported by theNational Science Foundation, and Samuel Anthony wassupported in part by NSF SBIR #IIP-1621689.

REFERENCES

[1] M. D. Alicke, R. H. Smith, and M. L. Klotz. Judgments of physicalattractiveness: The role of faces and bodies. Personality and SocialPsychology Bulletin, 12(4):381–389, 1986.

[2] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des-jardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: A CPUand GPU math compiler in python. In Proc. 9th Python in ScienceConf., pages 1–7, 2010.

[3] J. Bergstra, D. Yamins, and D. D. Cox. Hyperopt: A python libraryfor optimizing the hyperparameters of machine learning algorithms.In Proc. of the 12th Python in Science Conf., pages 13–20. Citeseer,2013.

[4] D. S. Bolme, J. R. Beveridge, M. Teixeira, and B. A. Draper. Thecsu face identification evaluation system: its purpose, features, andstructure. In International Conference on Computer Vision Systems,pages 304–313. Springer, 2003.

[5] F. Chollet. Keras. https://github.com/fchollet/keras,2015.

[6] A. Dantcheva, P. Elia, and A. Ross. What else does your biometric datareveal? A survey on soft biometrics. IEEE Transactions on InformationForensics and Security, 11(3):441–467, 2016.

[7] M. Dumas. Emotional expression recognition using support vectormachines. In Proc. of the International Conference on MultimodalInterfaces, 2001.

[8] V. Falvello, M. Vinson, C. Ferrari, and A. Todorov. The robustness oflearning about the trustworthiness of other people. Social Cognition,33(5):368, 2015.

[9] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects bytheir attributes. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2009.

[10] L. Germine, K. Nakayama, B. C. Duchaine, C. F. Chabris, G. Chat-terjee, and J. B. Wilmer. Is the web as good as the lab? Comparableperformance from web and lab in cognitive/perceptual experiments.Psychonomic Bulletin & Review, 19(5):847–857, 2012.

[11] B. A. Golomb, D. T. Lawrence, and T. J. Sejnowski. Sexnet: A neuralnetwork identifies sex from human faces. In NIPS, volume 1, page 2,1990.

[12] A. Graves, C. Mayer, M. Wimmer, J. Schmidhuber, and B. Radig.Facial expression recognition with recurrent neural networks. In Proc.of the International Workshop on Cognition for Technical Systems,2008.

[13] G. B. Huang, M. Mattar, H. Lee, and E. Learned-Miller. Learning toalign from scratch. In Neural Information Processing Systems (NIPS),2012.

[14] C. F. Keating, A. Mazur, M. H. Segall, P. G. Cysneiros, J. E. Kilbride,P. Leahy, W. T. Divale, S. Komin, B. Thurman, and R. Wirsing. Cultureand the perception of social dominance from facial expression. Journalof Personality and Social Psychology, 40(4):615, 1981.

[15] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen,P. Grother, A. Mah, and A. K. Jain. Pushing the frontiers of uncon-strained face detection and recognition: IARPA Janus Benchmark A.In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2015.

[16] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof. Annotatedfacial landmarks in the wild: A large-scale, real-world database forfacial landmark localization. In First IEEE International Workshopon Benchmarking Facial Image Analysis Technologies, 2011.

[17] N. Kumar, A. Berg, P. N. Belhumeur, and S. Nayar. Describable visualattributes for face verification and image search. IEEE Transactions onPattern Analysis and Machine Intelligence, 33(10):1962–1977, 2011.

[18] Y. H. Kwon and N. da Vitoria Lobo. Age classification from facialimages. Computer Vision and Image Understanding, 74(1):1–21, 1999.

[19] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detectunseen object classes by between-class attribute transfer. In IEEEConference on Computer Vision and Pattern Recognition (CVPR),2009.

[20] G. Levi and T. Hassner. Age and gender classification using convo-lutional neural networks. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops, pages 34–42,2015.

[21] Y. Lewenberg, Y. Bachrach, S. Shankar, and A. Criminisi. Predictingpersonal traits from facial images using convolutional neural net-works augmented with facial landmark information. arXiv preprintarXiv:1605.09062, 2016.

[22] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributesin the wild. In IEEE International Conference on Computer Vision(ICCV), 2015.

[23] P. Luo, X. Wang, and X. Tang. A deep sum-product architecture forrobust facial attributes analysis. In IEEE International Conference onComputer Vision (ICCV), 2013.

[24] A. Mignault and A. Chaudhuri. The many faces of a neutral face: Headtilt and perception of dominance and emotion. Journal of NonverbalBehavior, 27(2):111–132, 2003.

[25] A. Montillo and H. Ling. Age regression from faces using randomforests. In IEEE International Conference on Image Processing (ICIP),2009.

[26] A. Nech and I. Kemelmacher-Shlizerman. Megaface 2: 672,057identities for face recognition. 2016.

[27] N. N. Oosterhof and A. Todorov. The functional basis of faceevaluation. Proceedings of the National Academy of Sciences,105(32):11087–11092, 2008.

[28] A. E. Pinkham, J. B. Hopfinger, K. Ruparel, and D. L. Penn. Aninvestigation of the relationship between activation of a social cog-nitive neural network and social functioning. Schizophrenia bulletin,34(4):688–697, 2008.

[29] E. Rudd, M. Gunther, and T. Boult. Moon: A mixed objectiveoptimization network for the recognition of facial attributes. InEuropean Conference on Computer Vision (ECCV), 2016.

[30] W. J. Scheirer, N. Kumar, P. N. Belhumeur, and T. E. Boult. Multi-attribute spaces: Calibration for attribute fusion and similarity search.In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2012.

[31] C. Senior, M. Phillips, J. Barnes, and A. David. An investigation intothe perception of dominance from schematic faces: A study usingthe world-wide web. Behavior Research Methods, Instruments, &Computers, 31(2):341–346, 1999.

[32] B. Siddiquie, R. S. Feris, and L. S. Davis. Image ranking and retrievalbased on multi-attribute queries. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2011.

[33] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[34] A. Todorov, S. G. Baron, and N. N. Oosterhof. Evaluating facetrustworthiness: a model based approach. Social cognitive and affectiveneuroscience, 3(2):119–127, 2008.

[35] A. Todorov and B. Duchaine. Reading trustworthiness in faces withoutrecognizing faces. Cognitive Neuropsychology, 25(3):395–410, 2008.

[36] A. Todorov, M. Pakrashi, and N. N. Oosterhof. Evaluating faceson trustworthiness after minimal time exposure. Social Cognition,27(6):813–833, 2009.

[37] J. Wang, Y. Cheng, and R. S. Feris. Walk and learn: Facial attributerepresentation learning from egocentric video and contextual data.arXiv preprint arXiv:1604.06433, 2016.

Page 9: arXiv:1610.08119v1 [cs.CV] 25 Oct 2016 · 2020. 1. 27. · model objective facets of facial appearance, such as hair and eye color, expression, and aspects of the geometry of the

[38] J. S. Winston, B. A. Strange, J. O’Doherty, and R. J. Dolan. Automaticand intentional brain responses during evaluation of trustworthiness offaces. Nature Neuroscience, 5(3):277–283, 2002.

[39] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrainedvideos with matched background similarity. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2011.

[40] K. Zhang, L. Tan, Z. Li, and Y. Qiao. Gender and smile classificationusing deep convolutional neural networks. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops,pages 34–38, 2016.

[41] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda:Pose aligned networks for deep attribute modeling. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition,pages 1637–1644, 2014.

[42] Y. Zhong, J. Sullivan, and H. Li. Face attribute prediction using off-the-shelf CNN features. In IAPR Int. Conf. on Biometrics, 2016.