1 first impressions: a survey on vision-based …1 first impressions: a survey on vision-based...

20
1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Ya˘ gmur G ¨ uc ¸l¨ ut¨ urk, Marc P´ erez, Umut G¨ uc ¸l¨ u, Carlos Andujar, Xavier Bar´ o, Hugo Jair Escalante, Isabelle Guyon, Marcel A. J. van Gerven, Rob van Lier and Sergio Escalera Abstract—Personality analysis has been widely studied in psychology, neuropsychology, and signal processing fields, among others. From the past few years, it also became an attractive research area in visual computing. From the computational point of view, by far speech and text have been the most considered cues of information for analyzing personality. However, recently there has been an increasing interest from the computer vision community in analyzing personality from visual data. Recent computer vision approaches are able to accurately analyze human faces, body postures and behaviors, and use these information to infer apparent personality traits. Because of the overwhelming research interest in this topic, and of the potential impact that this sort of methods could have in society, we present in this paper an up-to-date review of existing vision-based approaches for apparent personality trait recognition. We describe seminal and cutting edge works on the subject, discussing and comparing their distinctive features and limitations. Future venues of research in the field are identified and discussed. Furthermore, aspects on the subjectivity in data labeling/evaluation, as well as current datasets and challenges organized to push the research on the field are reviewed. Index Terms—Personality computing, Firs impressions, Person perception, Big-Five, Subjective bias, Computer vision, Machine learning, Nonverbal signals, Facial expression, Gesture, Speech analysis, Multi-modal recognition. Julio C. S. Jacques Junior is a postdoctoral researcher at Universitat Oberta de Catalunya and research collaborator at Computer Vision Center, Spain E-mail: [email protected] Ya ˘ gmur G¨ cl¨ ut¨ urk is an assistant professor at Radboud University, Donders Institute for Brain, Cognition and Behaviour, the Netherlands E-mail: [email protected] Marc P´ erez is a bachelor’s student at University of Barcelona, Spain E-mail: [email protected] Umut G¨ cl¨ u is an assistant professor at Radboud University, Donders Institute for Brain, Cognition and Behaviour, the Netherlands E-mail: [email protected] Carlos Andujar is an associate professor at Universitat Polit` ecnica de Catalunya, Spain E-mail: [email protected] Xavier Bar´ o is an associate professor at Universitat Oberta de Catalunya and Computer Vision Center, Spain E-mail: [email protected] Hugo Jair Escalante is a Research scientist at Instituto Nacional de Astrof´ ısica, ´ Optica y Eectr´ onica, Mexico E-mail: [email protected] Isabelle Guyon is a professor and researcher at UPSud/INRIA Universit´ e Paris-Saclay, France, and Pr´ esident of ChaLearn, Berkeley, California E-mail: [email protected] Marcel A. J. van Gerven is professor and PI at Radboud University, Donders Institute for Brain, Cognition and Behaviour, the Netherlands E-mail: [email protected] Rob van Lier is an associate professor at Radboud University, Donders Institute for Brain, Cognition and Behaviour, the Netherlands E-mail: [email protected] Sergio Escalera is an associate professor at University of Barcelona and Computer Vision Center, Spain E-mail: [email protected] 1 I NTRODUCTION Psychologists have long studied human personality, and throughout the years different theories have been proposed to categorize, explain and understand it. According to Vin- ciarelli and Mohammadi [1], the models that most effec- tively predict measurable aspects in the life of people are those based on traits. Trait theory [2] is an approach based on the definition and measurement of traits, i.e., habitual patterns of behaviors, thoughts and emotions relatively sta- ble over time. Trait models are built upon human judgments about semantic similarity and relationships between adjec- tives that people use to describe themselves and the others. For instance, consider most of people know the meaning of nervous, enthusiastic, and open-minded. Trait psychologists build on these familiar notions, giving precise definitions, devising quantitative measures, and documenting the im- pact of traits on people’s lives [2]. Psychology studies, among other aspects, behaviour. Behaviour (B) is a function of the person (P ) and the situation (S), i.e., [B = f (P,S)]. From a psychological point of view, most research has been conducted on the personal side of the equation (P ), especially on individual differences in personality and cognitive traits. From a computational perspective, recent studies also started to pay attention on the situational part of the equation (S), with a particular interest on personality perception. Apparent personality (A), however, is conditioned to the observer (O), and could be defined as a function of (P, S, O), i.e., [A = f (P, S, O)]. While a vast amount of psychological research on the study of cognitive processes in individuality judgment from the observer’s point of view can be found in the literature [3], the research from a computational point of view is just at its arXiv:1804.08046v3 [cs.CV] 17 Jul 2019

Upload: others

Post on 28-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

1

First Impressions: A Survey on Vision-basedApparent Personality Trait Analysis

Julio C. S. Jacques Junior, Yagmur Gucluturk, Marc Perez, Umut Guclu, Carlos Andujar, Xavier Baro,Hugo Jair Escalante, Isabelle Guyon, Marcel A. J. van Gerven, Rob van Lier and Sergio Escalera

Abstract—Personality analysis has been widely studied in psychology, neuropsychology, and signal processing fields, among others.From the past few years, it also became an attractive research area in visual computing. From the computational point of view, by farspeech and text have been the most considered cues of information for analyzing personality. However, recently there has been anincreasing interest from the computer vision community in analyzing personality from visual data. Recent computer vision approachesare able to accurately analyze human faces, body postures and behaviors, and use these information to infer apparent personalitytraits. Because of the overwhelming research interest in this topic, and of the potential impact that this sort of methods could have insociety, we present in this paper an up-to-date review of existing vision-based approaches for apparent personality trait recognition. Wedescribe seminal and cutting edge works on the subject, discussing and comparing their distinctive features and limitations. Futurevenues of research in the field are identified and discussed. Furthermore, aspects on the subjectivity in data labeling/evaluation, aswell as current datasets and challenges organized to push the research on the field are reviewed.

Index Terms—Personality computing, Firs impressions, Person perception, Big-Five, Subjective bias, Computer vision, Machinelearning, Nonverbal signals, Facial expression, Gesture, Speech analysis, Multi-modal recognition.

F

• Julio C. S. Jacques Junior is a postdoctoral researcher at Universitat Obertade Catalunya and research collaborator at Computer Vision Center, SpainE-mail: [email protected]

• Yagmur Gucluturk is an assistant professor at Radboud University,Donders Institute for Brain, Cognition and Behaviour, the NetherlandsE-mail: [email protected]

• Marc Perez is a bachelor’s student at University of Barcelona, SpainE-mail: [email protected]

• Umut Guclu is an assistant professor at Radboud University, DondersInstitute for Brain, Cognition and Behaviour, the NetherlandsE-mail: [email protected]

• Carlos Andujar is an associate professor at Universitat Politecnica deCatalunya, SpainE-mail: [email protected]

• Xavier Baro is an associate professor at Universitat Oberta de Catalunyaand Computer Vision Center, SpainE-mail: [email protected]

• Hugo Jair Escalante is a Research scientist at Instituto Nacional deAstrofısica, Optica y Eectronica, MexicoE-mail: [email protected]

• Isabelle Guyon is a professor and researcher at UPSud/INRIA UniversiteParis-Saclay, France, and President of ChaLearn, Berkeley, CaliforniaE-mail: [email protected]

• Marcel A. J. van Gerven is professor and PI at Radboud University,Donders Institute for Brain, Cognition and Behaviour, the NetherlandsE-mail: [email protected]

• Rob van Lier is an associate professor at Radboud University, DondersInstitute for Brain, Cognition and Behaviour, the NetherlandsE-mail: [email protected]

• Sergio Escalera is an associate professor at University of Barcelona andComputer Vision Center, SpainE-mail: [email protected]

1 INTRODUCTION

Psychologists have long studied human personality, andthroughout the years different theories have been proposedto categorize, explain and understand it. According to Vin-ciarelli and Mohammadi [1], the models that most effec-tively predict measurable aspects in the life of people arethose based on traits. Trait theory [2] is an approach basedon the definition and measurement of traits, i.e., habitualpatterns of behaviors, thoughts and emotions relatively sta-ble over time. Trait models are built upon human judgmentsabout semantic similarity and relationships between adjec-tives that people use to describe themselves and the others.For instance, consider most of people know the meaningof nervous, enthusiastic, and open-minded. Trait psychologistsbuild on these familiar notions, giving precise definitions,devising quantitative measures, and documenting the im-pact of traits on people’s lives [2].

Psychology studies, among other aspects, behaviour.Behaviour (B) is a function of the person (P ) and thesituation (S), i.e., [B = f(P, S)]. From a psychological pointof view, most research has been conducted on the personalside of the equation (P ), especially on individual differencesin personality and cognitive traits. From a computationalperspective, recent studies also started to pay attention onthe situational part of the equation (S), with a particularinterest on personality perception. Apparent personality(A), however, is conditioned to the observer (O), and couldbe defined as a function of (P, S,O), i.e., [A = f(P, S,O)].While a vast amount of psychological research on the studyof cognitive processes in individuality judgment from theobserver’s point of view can be found in the literature [3],the research from a computational point of view is just at its

arX

iv:1

804.

0804

6v3

[cs

.CV

] 1

7 Ju

l 201

9

Page 2: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

2

early stages. This study revealed, among other things, thatin addition to the measurements of agreement with respectto personality perception, almost no attention was given tothe part of the equation associated to the observer (O) whenapparent personality trait recognition is considered.

From the perspective of automatic personality comput-ing, the relationship between stimuli (everything observablepeople do) and the outcomes of the social perception pro-cesses (how we form impressions about others) is likely tobe stable enough to be modeled in statistical terms [4]. Thisis a major advantage for the analysis of social perceptionprocesses because computing science, in particular machinelearning, provides a wide spectrum of methods aimed atmodeling statistical relationships like those observed insocial perception. However, the main criticism against theuse of trait models is that they are purely descriptive anddo not correspond to actual characteristics of individuals,even though several decades of research have shown thatthe same traits appear with surprisingly regularity across awide range of situations and cultures, suggesting that theyactually correspond to psychological salient phenomena [1].

During the past decades, different trait models have beenproposed and broadly studied: the Big-Five [5], Big-Two [6],16PF [7], among others. The model known as Big-Fiveor Five-Factor Model, often represented by the acronymsOCEAN, is one of the most adopted and influential modelsin psychology and personality computing. It is a hierarchicalorganization of personality traits in terms of five basicdimensions: Openness to Experience (contrasts traits such asimagination, curiosity and creativity with shallowness andimperceptiveness), Conscientiousness (contrasting organiza-tion, thoroughness and reliability with traits as careless-ness, negligence and unreliability), Extraversion (contraststalkativeness, assertiveness and activity level with silence,passivity and reserve), Agreeableness (contrasts kindness,trust and warmth with traits such as hostility, selfishnessand distrusts), and Neuroticism (contrasts emotional insta-bility, anxiety and moodiness with emotional stability).

For the sake of illustration, Fig. 1 shows representativeface images for the highest and lowest levels of each Big-Five trait, obtained from [8]. Such images reflect a kindof relationship between observers perception and observedpeople on the ChaLearn First Impression Database [9]. Asit can be seen, these representative samples are influencedby, among other things, facial expression (e.g., a high scoreon Extraversion trait and an associated smiling expression)and subjective bias of annotators with respect to gender(i.e., some traits scored high, such as Opennes to Experienceand Extraversion, show a female looking face whereas thesame traits scored low seems to show a male looking face).These are just few examples to show how the characteristicsof observers and observed people affect first impressions ofpersonality and how complex and subjective it can be.

Assessing the personality of an individual means tomeasure how well the adjectives mentioned above describehim/her. Psychologists have developed reliable and usefulmethodologies for assessing personality traits. Despite itsknown limitations, the self-report questionnaires have be-come the dominant method for assessing personality [10].However, personality assessment is not limited to psychol-ogists: everybody, everyday, makes judgments about our

Fig. 1. Representative face images for the highest and lowest levels ofeach Big-Five trait, obtained from [8]. Images were created by aligningand averaging the faces of 100 unique individuals that had the highestand lowest evaluations for each trait on the ChaLearn First Impressiondatabase [9] (we inverted images for Neuroticism trait, i.e., low→ high,as they were drawn in [8] from the perspective of Emotion Stability).

personalities as well as of others. In every-day intuition,the personality of a person is assessed along several dimen-sions. We are used to talk about an individual as being toomuch/little focused on herself, (dis-)organized, (non-)open-minded, etc [11]. Nonetheless, support for the validity ofthese first impressions is inconclusive, raising the questionof why do we form them so readily? According to Willis andTodorov [12], people make first impressions about others,either from their faces [13] or in general, from a glimpse asbrief as 100ms, and these snap judgments predict all kindsof important decisions (discussed in Sec. 2.1).

While being diverse in terms of data, technologies andmethodologies, all domains concerned with personalitycomputing consider the same three main problems [1],namely automatic personality recognition (i.e., the real per-sonality of an individual), automatic personality perception(the prediction of the personality others attribute to a givenperson), and automatic personality synthesis (i.e., the gen-eration of artificial personalities through embodies agents).Recently, the machine learning research community adoptedthe terms of apparent personality, personality impressions, orsimply first impressions [9], [14], [15] to refer to personalityperception (which are used interchangeably along the text).Although, first impressions are not restricted to personality.In general, automatic personality perception is composed ofthree main steps: data annotation/ground-truth generation,feature extraction and classification/regression, which willbe incrementally discussed along the text.

Vinciarelli and Mohammadi [1] presented the first sur-vey on personality computing. However, they focused onautomatic personality recognition, perception and synthesisfrom a more general point of view rather than from avision-based perspective. In this work, we contribute to theresearch area in the following directions:

• We present an up-to-date literature review on ap-parent personality trait analysis from a vision-basedperspective, i.e., centered on the visual analysis ofhumans. Reviewed works include some kind ofimage-based analysis at some stage of their pipelines.Hence, this study can be considered the first compre-hensive review covering this particular research area.

• We discuss the subjectivity in data labeling (andevaluation protocols) from first impressions, whichis a relatively new and emerging research topic.

Page 3: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

3

• We propose a taxonomy to group works according tothe type of data they use: still images, image sequences,audiovisual or multimodal. We claim the type of dataand application are strongly correlated, and that theproposed taxonomy can help future researchers inidentifying: 1) common features, databases and pro-tocols employed in different categories, 2) pros andcons of “similar” approaches, and 3) what methodsthey should compare with (or get inspired by).

• We present and discuss relevant works developed forreal personality trait analysis, as well as those corre-lating both real and apparent personalities, which isan almost unexplored area in visual computing.

• We present a set of mid-level cues, collected from thereviewed works, highly correlated to each personal-ity trait of the Big-Five model. We analyse reportedresults and reveal commonly best recognized traits,as well as the more challenging ones.

• We discuss current datasets and competitions orga-nized to push the research in the field, main limita-tions and future research prospects.

• We identify recent trend methodologies, open chal-lenges and research opportunities in the field.

The remainder of this paper is organized as follows.Sec. 2.1 motivates the research on the topic. Subjectivity isdiscussed in Sec. 2.2. State-of-the-art, according to the pro-posed taxonomy, is presented and discussed from Sec. 2.3to Sec. 2.6. Then, a joint discussion about real and apparentpersonality is presented in Sec. 2.7. Later, we briefly discusshigh-level features and their correlated traits in Sec. 2.8,and overall accuracy obtained for different Big-Five traitsin Sec. 2.9. In Sec. 3 we discuss past and current challengesorganized to push the research area. Finally, final remarksand conclusions are drawn in Sec. 4.

2 RELATED WORK

This section presents a comprehensive review on vision-based methods for apparent personality trait analysis.

2.1 The importance of first impressions in our lives

A computer program capable of predicting in mere secondsthe psychological profile of someone could have wide ap-plication for companies as well as for individuals aroundthe globe. Just to mention a few, they could be applied inhealth (e.g. personalized psychological therapies), robotics(e.g., humanized and social robots), learning (e.g. automatictutoring systems), leisure and business (e.g. personalizedrecommendation systems). For instance, recent studies showthat therapeutic robots can be helpful in stimulating socialskills in children with autism, which would be impossiblewithout a robot provided with some advanced capabilities.Other studies indicate video interviews, through nonverbalvisual human behavior analysis, are starting to modify theway in which applicants get hired [16]. Nevertheless, thesekind of applications will only be truly accepted and trustedif explainability and transparency can be guaranteed [17].Moreover, to become inclusive and benefit everyone, suchsystems need to be able to generalize to different contexts.

The development and evaluation of automatic methodsfor personality perception is a very delicate topic, making usto think over the still open question “what should be the limitof such technology?”. The accuracy performance of apparentpersonality recognition models is generally measured interms of how close the outcomes of the approach to thejudgments made by external observers (i.e., annotators) are.The main assumption behind such evaluation is that socialperception technologies are not expected to predict theactual state of the target, but the state observers attributed toit, i.e., their impressions. Thus, making automatic apparentpersonality trait analysis a very complex and subjective task.

According to the literature, faces are rich source of cuesfor apparent personality attribution [18], [19]. However,Todorov and Porter [20] showed that first impressions basedon facial analysis can vary with different photos (i.e., theratings vary w.r.t. context). Whether or not trait inferencesfrom faces are accurate, they also affect important socialoutcomes. For example, attractive people have better matingsuccess and jobs prospects than their less fortunate peers.The effects of appearance on social outcomes may be partlyattributed to the halo effect [21], which is the tendency to useglobal evaluations to make judgments about specific traits(e.g., attractiveness correlates with perceptions of intelligence),having a strong influence on how people build their firstimpressions about others. Rapid judgments of competencebased solely on the facial appearance of candidates wasenough to allow participants from an experiment [22] topredict the outcomes of Senate elections in the United Statesin 72.4% of the cases. It seems politicians who simply lookmore competent are more likely to win elections [23], [24].

First impressions also influence legal decision-making [25]. Nevertheless, at the same time that differentresearch communities (e.g., machine learning, computervision and psychology) are advancing the state-of-the-art inthe field in different directions, it was recently observed1

that some Artificial Intelligence based models are exhibitingracial and gender biases, which are considered extremelycomplex and emerging issues. One possible explanationfor these problems (i.e., in the case of first impressions)is that the ground truth annotations, used to train AIbased models, are given by individuals and may reflecttheir bias/preconception towards the person in the imagesor videos, even though it may be unintentional andsubconscious [26]. Hence, trained classifiers can inherentlycontain a subjective bias. This phenomena was alsoobserved and analyzed in natural language processing [27],and should receive special attention from all research areas.

2.2 How challenging and subjective can be apparentpersonality trait labeling/evaluation?

The outcomes of machine learning based models will be insome way a reflect of the learning data they use, i.e., in thecase of first impression analysis, the labels provided by theannotators. The validity of such data can be very subjectivedue to several factors, such as cultural [28], [29], social [21],[30], contextual [20], gender [31], [32], appearance [33], etc.,

1. https://www.theguardian.com/technology/2017/apr/13/ai-programs-exhibit-racist-and-sexist-biases-research-reveals

Page 4: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

4

which makes the research and development on personalityperception a very challenging task.

The subjectivity of impressions raises further questionsabout how many annotators should be involved in anexperiment and how much they should agree with oneanother. When it comes to the Big-Five model, the literaturesuggests that the agreement should be measured in terms ofamount of variance shared by the observers. In general, thelow agreement should not be considered the result of lowquality judgments or data, but an effect of the inherent am-biguity of the problem [4]. However, existing works analyseapparent personality from a universal perception, that is,the impressions given by different observers concerning aparticular individual are averaged. Although this became astandard procedure, we argue it does not accurately reflecthow first impressions work in real life. Person perceptionis conditioned to the observer. Thus, particularities of thedifferent populations of observers, such as cultural differ-ences [28], should also be taken into account. The aim of thissection, rather than trying to answer the above questions, isto introduce a brief review and discussion on the topic.

Jenkins et al. [32] analyzed the variability in photos ofthe same faces. According to their study, within-person vari-ability exceeded between-person variability in attractiveness,suggesting that the consequences of within-person variabil-ity are not confined to judgments of identity. It was alsoobserved that female annotators tended to be rather harshon the male faces (i.e., gender issues). Although attractive-ness is not explicitly related to personality, results indicatehow complex is the task of social judgment. The studypresented in [30] revealed that the most important sourceof within-person variability, related to social impressionsof key traits of trustworthiness, dominance and attractiveness,which index the main dimensions in theoretical models offacial impressions, is the emotional expression of the face,but the viewpoint of the photograph also affects impressionsand modulates the effects of expression.

Within-person variance in behavior is likely to be a re-sponse to variability in relevant situational cues (e.g., peopleare more extraverted in large groups than in small groups- even though some individuals may not increase or mayeven decrease their level of extraversion with the size ofthe group) [34]. Because situational cues vary in everydaybehavior, behavior varies as well. Abele and Wojciszke [6]analyzed the Big-Two model (i.e., Agency, also called compe-tence, power, and Communion, also called warmth, morality,expressiveness) from the perspective of self versus others.According to their work, agency is more desirable andimportant in the self-perspective, and communion is moredesirable and important in the other-perspective, i.e., “peo-ple perceive and evaluate themselves and others in a way thatmaximizes their own interests and current goals”.

Cross cultural consensus and differences were foundin [28] when addressing the problem of universal and cul-tural differences in forming personality trait judgments fromfaces. Using a face model, they were able to formalize thestatic facial information that is used to make certain person-ality trait judgments such as aggressiveness, extroversion andlikeability. According to their study, Asian and Western par-ticipants were able to identify the enhanced salience of alldifferent traits in the faces, suggesting that the associations

between mere static facial information and certain traits arehighly shared among participants from different culturalbackgrounds. Nevertheless, faces with enhanced salience ofaggressiveness, extroversion, and trustworthiness were betteridentified by Western than by Asian participants.

Even though the problem of stereotyping is minimizedin [28] (which could bias the analysis) through the usage ofmanipulated faces (synthetic faces are used in [23], [25] forthe same purpose), it should be noted that social judgmentsin real situations are formed from different sources, suchas pose, gaze, facial expression or styling. For example,hairstyle, which is extrafacial information, can be intention-ally chosen by target persons to shape others’ impressions ofthem. According to Vetter and Walker [35], in order to sys-tematically investigate how faces are perceived, categorizedor recognized, we need to control over the stimuli we use inours experiments. The problem is not only to get face imagestaken under comparable lighting conditions, distance fromthe camera, pose or facial expression, but to get face stimuliwith clearly defined similarities and differences.

Barratt et al. [36] revisited the classical problem of theKuleshov effect. According to film mythology, the filmmakerLev Kuleshov conducted an experiment (in the early 1920s)in which he combined a close-up of an actor’s neutral facewith three different emotional contexts: happiness, sadnessand hunger. The viewers reportedly perceived the actor’sface as expressing an emotion congruent with the givencontext. Recent attempts at replicating the experiment haveproduced either conflicting or unreliable results. However,it was observed that some sort of Kuleshov effect does infact exist. Olivola and Todorov [37] evaluated the abilityof human judges to infer the characteristics of others fromtheir appearances. They found that judges are generally lessaccurate at predicting characteristics than they would be ifappearance cues are ignored, suggesting that appearance isoverweighed and can have detrimental effects on accuracy.

More recently, Escalante et al. [26] analyzed the FirstImpression dataset [9] and top winner approaches used inthe ChaLearn LAP Job Candidate Screening Challenge [38].Part of the study focused on the existence of latent biastowards gender and ethnicity. When correlating these vari-ables with apparent personality annotations (Big-Five),they first observed that there was an overall positive at-titude/preconception towards females in both personalitytraits (except for agreeableness) and job interview invitationvariable. Moreover, gender bias was observed to be strongercompared to ethnicity bias. Concerning ethnicity, resultsindicated an overall positive bias towards Caucasians, anda negative bias towards African-Americans. No discerniblebias towards Asians in either way was observed.

2.2.1 DiscussionThis section discusses Sec. 2.2, covering the main challengesrelated to data labeling, different solutions employed toaddress worker bias, and suggestions for future research.

Data labels. Annotation protocols for personality per-ception require special attention. The challenge relies indefining a particular score to a certain trait, either from acontinuous domain or within a specific range (e.g., usingLikert Scale [39]), which by default is extremely subjective,time-consuming and influenced by worker bias. This study

Page 5: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

5

revealed that there is no standard protocol to annotatedata for personality perception, as well as there is no goldstandard for apparent personality. According to [40], everyindividual may perceive others in a different way, whichposes a great obstacle for creating a model addressing theissues of annotators subjectivity and bias. For the sake ofillustration of how challenging data labeling in this area canbe, Nguyen and Gatica-Perez [16] asked Amazon Mechan-ical Turk (AMT) workers to label 939 videos with respectto the Big-Five model using a five-point Likert Scale and astandard personality inventory questionnaire. In their work,only extraversion trait was observed to be consistently rated.

Reducing bias. Worker bias is particularly difficult toevaluate and correct when many workers contribute just afew labels, which is typical when labeling is crowd-sourced.Bremner et al. [41] proposed to identify annotators whoassigned labels without looking at the content by removingthose who incorrectly answered a test question. Pairwisecomparisons [9], [24], [42] demonstrated in recent studiesto be a very effective way to address worker bias. Joo etal. [24] asked AMT workers to compare a pair of images ingiven dimensions rather than evaluating each image indi-vidually. A similar strategy is applied in [9] for video files,which include an algorithm to estimate continuous scores(i.e., ground truth labels) from pairwise annotations [42].Comparison schemes have three main advantages: 1) theannotators do not need to establish the absolute baselineor scales in these social dimensions, which would be in-consistent, i.e., “what does a score of 0.3 mean?”; 2) theynaturally identify the strength of each sample in the contextof relational distance from other examples, generating amore reliable ranking of subtle signal differences [24]; and3) they avoid previously annotated samples from biasingfuture scores (i.e., scoring someone very low on a certaintrait because of an unconscious comparison with previousvideos/images where the score was high).

Future directions. The area of personality computingwould benefit from the design, definition and release ofnew, large and public databases, as well as with the designof standard protocols for data collection, labeling and eval-uation. According to our study, ~40% of reviewed worksdeveloped for personality perception are evaluated on pub-lic databases without any type of customization (in most ofthe cases, small in size and/or composed of small numberof participants); around 25% of works are evaluated onprivate databases, and the remaining ones on customizedfrom public databases. The use of private or customizeddatabases makes the comparison among works a big chal-lenge, creating a barrier to advance the state-of-the-art onthe field. Disregarding the different applications (e.g., face-to-face interviews, HRI, or conversational videos), whichcan influence the design of new databases, we argue thatfuture developed databases should be composed at least ofa large number of samples from a heterogeneous population(with respect to observed people and annotators), so thatcurrent and/or future works can generalize to differentcultures easily. We envisage two main scenarios can makea big difference in future researches on the topic, as follows.

Joint analysis of real and apparent personalities: it willrequire the design of new, large and public databases con-taining labels for both real and apparent personalities. Up

to now, the correlation analysis of both personality types,from a computer vision/machine learning point of view,have not been fully exploited. Despite the great difficultyof accomplish such task, as it requires data collection (self-reports and annotations) from a large population, it couldbenefit different research lines on the field (discussed inSec. 2.7). For instance, when addressing real and apparentage estimation, Jacques et al. [43] show that the subjectivebias contained in the perception of age can be used to betterapproximate the real target. As far as we know, a similaridea has never been exploited in the context of personalitytrait analysis. Then, questions such as “what is the effect ofthe real personality of someone on his/her perceived personality?”,or “is it possible to accurately regress the real personality fromperception mechanisms?”, could be studied.

Correlate observer vs. observed: “what is the observerlooking at?” or “what characteristics does the observer have?”,or even better, a combination of both questions. Existingworks are not considering any information about the ob-servers (apart from their impressions) to perform automaticpersonality perception, such as cultural similarities or dif-ferences [28] with respect to different target populations.Future researches could take into account, for example, thegender, age, ethnicity, or even the real personality (making alink to the previous scenario) from both observers and peo-ple being observed. Taking the observers characteristics intoaccount would move the research on this area to anotherlevel. Nevertheless, the above questions may pose a greatchallenge to privacy and ethic issues. A preliminary analysison this topic was addressed in [26]. Authors analyzed thecorrelation among gender and ethnicity (from the peoplebeing observed) vs. the interview variable contained in theirdatabase (provided by the observers). However, any datafrom the annotators were collected.

Some extra questions still remain open and could bea subject for future researches, such as whether universaland cultural differences studied in [28] can be generalizedto faces from other cultural backgrounds, or whether thepersonality impression about one trait can influence the im-pressions about other traits [44], or the relationship amongnonverbal content and personality variables/scores [45].

2.3 Still Images

This section reviews the very few works developed for au-tomatic personality perception from still images (neither au-dio nor temporal information are used). This class of worksusually focus on facial information to drive their models,generally combining features at different levels and theirrelationships. Note that some works developed for othercategories (e.g., image sequences or audio-visual) could beapplied (or easily adapted) to still images, as they performa frame-by-frame prediction before a final aggregation orfusion (responsible to deal with the temporal information).

In the work of Guntuku et al. [46], low-level features areemployed to detect mid-level cues (gender, age, presenceof image editing, etc), used to predict real and apparentpersonality traits (Big-Five) of users in self-portrait images(selfies). Even though a small dataset is used (i.e., composedof 123 images from different users), authors presented someinsights on which mid-level cues contribute to personality

Page 6: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

6

recognition and perception (see Table 2). Yan et al. [47]studied the relationship between facial appearance andpersonality impressions in the form of trustworthy. In theirwork, different low-level features are extracted from differ-ent face regions, as well as relationships between regions.For instance, HoG is used to describe eyebrow shape, andEuclidean distance to describe eyes width. To alleviate thesemantic gap between low-level and high-level features,mid-level cues are built through clustering. Then, a Sup-port Vector Machine (SVM) is used to find the relationshipbetween face features and personality impression.

Dhall and Hoey [48] exploited a multivariate regressionapproach to infer personality impressions of users fromTwitter profile images. Hand-crafted and deep learningbased features are computed from the face region. Back-ground information is considered, which may affect person-ality perception, and a high correlation between opennessand scene descriptors was observed, suggesting that thecontext where pictures are taken loosely relate to a person’sability to explore new places. In [49], authors combinedeigenfaces features with SVM to investigate whether peopleperceive an individual portrayed in a picture to be above orbelow the median with respect to each Big-Five trait.

2.3.1 Discussion

This section discusses the importance of semantic attributes,limitations and future research directions related to Sec. 2.3.

Mid-level features. Facial landmarks seem to be thestart point of different feature extraction methods [48], es-pecially those which exploit mid-level features or semanticattributes [46], [47]. Mid-level features or semantic attributesusually carry meaningful information, which can be usedto complement other low-level features and then improveaccuracy performance. They also enable more interpretableanalysis on the results. For instance, when studying socialimpressions, Joo et al. [24] analyzed the correlation betweena set of mid-level attributes and social dimensions (e.g.,attractiveness, intelligence and dominance), and investigatedwhich face regions contribute more to each trait dimen-sion. Such analysis could be considered a step forward inthe direction of feature selection for automatic personalityperception. Moreover, mid-level cue detectors outperformedmany low-level features analysed in [46], for almost all traitdimensions of the Big-Five model, reinforcing their benefit.

Current limitations. Most works presented in this sec-tion either built their own datasets [46] (e.g., collectingdata from the Internet) or adapted to their needs [47], [49]datasets developed for other purposes (e.g., face recogni-tion). The common point of these works is that images needto be labelled, and labels usually do not become public. Thisway, reproducing their results can be a big challenge. Twitterprofile images collected in [50] are used in [48]. However,baseline labels (Big-Five) are created trough the analysis ofusers tweets. These points reinforce the fact that new andlarge public datasets, with associated standard protocols, arefundamental to advance the state-of-the-art on the field.

Future directions. The analysis of image content outsidethe face region, as performed in [48], is a topic whichdeserves further attention (and not specifically related tostill images, as it could be applied to other categories). As

emphasized in [51], when addressing the perception of emo-tions from images, the context have an important role, whichis completely aligned with personality perception studies.Although some works proposed to ignore the backgroundinformation, hairstyle or clothes [23], [25], [28], people canintentionally combine such information to shape others’impressions of them. Moreover, the literature shows thatcontext also influence annotators during data labeling. Inaddition to the context, high-level features extracted frombody pose, gestures or facial expression, which have alreadybeen exploited by some works in other categories, have notbeen fully exploited when still images are considered. Bodylanguage analysis [52], which is an emerging research topicin computer vision, could benefit apparent personality traitanalysis in many ways. It includes different kinds of nonver-bal indicators, such as gaze direction, position of hands, thestyle of smiling, among others, which are important markersof the emotional and cognitive inner state of a person.

2.4 Image Sequences

Works exploiting visual cues of image sequences are pre-sented next. They benefit from temporal information andscene dynamics (without acoustic information), which bringuseful and complementary information to the problem.

Biel et al. [18] studied personality impressions in con-versational videos (vlogs) using a subset of the Youtube vlogdataset [53] and focusing on facial expression analysis. Au-tomatic personality perception is addressed using SupportVector Regression (SVR) combined with statistics of facialactivity based on frame-by-frame estimates. Results showthat extraversion is the trait showing the largest activitycue utilization (reinforced in Table 2), which is related toevidences found in the literature that extraversion is typicallyeasier to judge [54], [55]. Later, Aran and Gatica-Perez [56]investigated the use of social media content as a domain fortransfer learning from conversational videos to small groupsettings. They considered the particular trait of extraversion,and addressed the problem combining Ridge Regressionand SVM classifiers with statistics extracted from weightedMotion Energy Images. In [44], the connections between fa-cial emotion expressions and personality impressions areanalysed as an extension of [18]. Four sets of behavioralcues (and fusion strategies) that characterize face statisticsand dynamics over brief observation windows are proposedto represent facial patterns. Co-occurrence analysis (e.g.,smiling with surprise) is also exploited. Finally, the inferencetask is addressed using SVR. Their study show that whilemultiple facial expression cues have significant correlationwith several of the Big-Five traits, they were only able tosignificantly predict extraversion impressions.

Taking Human-Computer Interaction (HCI) into ac-count, Celiktutan and Gunes [57] addressed the challengingtask of continuous prediction of perceived traits in spaceand time. According to the authors, continuous predictionsof first impressions have not been explored before. In theirwork, external observers were asked to continuously pro-vide ratings along multiple dimensions (Big-Five) in orderto generate continuous annotations of video sequences. Theinference problem is then addressed using low-level visualfeatures (e.g., HoG/HoF) combined with a linear regression

Page 7: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

7

method. The work was extended in [58], [59] with real-timecapabilities, audio-only and audio-visual data analysis.

Considering the great advances in the field of deeplearning, Gurpinar and collaborators [19] employed a pre-trained Convolutional Neural Network (CNN) to extractfacial expressions as well as ambient information (whichis ignored by most competitive works on the topic) onthe ChaLearn First Impression [9] dataset. Visual featuresthat represent facial expressions and scene are combinedand fed to a Kernel Extreme Learning Machine (ELM)regressor. Ventura et al. [60] studied why CNN models areperforming surprisingly well in automatically inferring firstimpressions. Results show that the face provides most ofthe discriminative information for personality impressionsinference, and the internal CNN representations mainlyanalyze key face regions such as the eyes, nose, and mouth.

Unlike the later works, Bekhouche et al. [61] combinedtexture features, extracted from the face region, with fiveSVRs to estimate apparent personality traits (Big-Five). Asreported by the authors, although deep learning-based ap-proaches can achieve better results, temporal face texture-based approaches are still very effective.

2.4.1 DiscussionNext, Sec. 2.4 is discussed. Topics like interaction types,spatio-temporal information modeling, slice length/locationand prospects for future research directions are covered.

Type of interaction. Two works presented in this sec-tion are related to humans interacting either with a virtualagent [57] or with small groups of people [56]. Hypothet-ically, one could consider some kind of interaction existswhen people talk to a camera, as in [18], [19], [56], [60],[61]. According to [56], people talk to the camera in vlogsas if they were talking to other people. When some kindof interaction is considered, classes of features that encodespecific aspects of social interactions can be exploited, suchas visual activity, facial expressions or body/head motion.

Spatio-temporal information. A preliminary analysissuggests that the inclusion of spatio-temporal informationplaced first impression analysis on a new level (comparedto still images), with a wider range of applications. Con-tinuous predictions demonstrated to be a research line stilllittle explored. This may be due to the challenging andcomplex task of generating accurate labels over time fora huge amount of data, in particular when deep learningbased methods are considered, which from our knowledge,have not been employed in this context yet. Celiktutan andGunes [57] pioneered continuous predictions in first impres-sions. However, they used a small dataset composed of 30video recordings captured from 10 subjects. Moreover, theircontinuous prediction can be interpreted as a frame-basedregressor where each frame is treated independently. Thus,the dynamics of the scene are not completely explored.In [19], [61], short video clips are globally represented withstatistics computed for the sequence of frames. Even thoughsuch approach does not treat the frames independently, itstill does not consider the temporal evolution of the data.In [44], statistics of facial expression outputs are character-ized as dynamic signals over brief observation windows,which can be considered a step forward to dynamicallyanalyze first impression on the temporal dimension.

Temporal information has been also exploited throughrelatively simple motion pattern representations [56], [61],which can have a positive impact on the outcomes. How-ever, considered motion-based approaches are still quitelimited. Very few works (presented in Sec. 2.5) proposed tomodel spatio-temporal dynamics using more powerful andadvanced techniques such as 3D-CNN [62] (to model localtemporal patterns) or Long Short-Term Memory (LSTM)[62], [63]. Therefore, the influence (either in prediction ordata labeling), benefits, as well as appropriate ways tomodel temporal information, are still open questions on thetopic, as briefly discussed next (and later in Sec. 2.5.3, w.r.t.slice length/location) and in Sec. 2.6.1 (future directions).

Slice length/location. The predictive power of facial ex-pressions, depending on the duration and relative positionof specific vlog segments, is analyzed in [44]. Results suggestthat viewers’ impressions are better predicted by featurescomputed at the beginning of each video, corroboratingwith the idea that first impressions are built from shortinteractions [13]. Nevertheless, authors reported that furtherresearch is needed to confirm their hypothesis, as well as toverify if the same effect is observed for different nonverbalsources, and whether the optimal duration and position ofthe slices are the same for each data type.

Future directions. This study revealed that the numberof methods developed for image sequences is significantlysmaller if compared to those developed for audiovisualcategory (Sec. 2.5), which may be related to improvementsobtained by the inclusion of complementary information orto the different applications being considered. Few worksfrom the ones described in this section (e.g., [19], [57]) havebeen extended to consider audiovisual information [63],[64], [65], emphasizing the benefits of including acousticfeatures to the pipeline. Nevertheless, in general, works arenot exploiting the full benefits of temporal information. Afeature representation that can keep the temporal evolutionof the data, such as Dynamic Image Networks [66] used inaction recognition, should be considered in future research.

Although great advancements have been reported bydeep learning based approaches [19], [60], they are oftenperceived as black-box techniques, i.e., they are able to effec-tively model very complex problems, but cannot be easilyinterpreted nor their predictions can be explained. Becauseof this, explainability and interpretability should deservespecial attention. In fact, the interest on these topics areevidenced by the organization of dedicated events, suchas thematic workshops [67], [68] and challenges [26], [38].However, this kind of research is just on its infancy.

2.5 Audiovisual trait prediction

In this section, we review works using both acoustic (non-verbal) and visual features to perform automatic personalityperception. Works are further classified based on the type ofinteraction (with/without), as they may use datasets, fea-tures and methodologies developed for different purposes.Interactive approaches, for instance, can exploit particu-lar behaviours usually found in interactive scenarios (e.g.,attention received while speaking), while non-interactivemethods are generally focused on the individual.

Page 8: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

8

2.5.1 Interactive approachesAran and Gatica-Perez [69] addressed personality percep-tion during small group interactions using a subset of theELEA corpus [70]. Thus, personality impressions needed tobe collected, as the dataset did not provide them. The infer-ence task is addressed using Ridge Regression combinedwith statistics computed from the given video segments(e.g., average speaking turn, prosodic features and visualactivity). For a comprehensive review of nonverbal cuesapplied to small group settings we refer the reader to [71].

Focusing on feature representation for personality andsocial impressions, Okada et al. [72] proposed a co-occurrence event mining framework for multiparty and mul-timodal interactions. According to the authors, the use ofco-occurrence patterns between modalities yields two mainadvantages: (i) it can improve the inference accuracy of thetrait value based on richer feature set and (ii) discover keycontext patterns linking personality traits. In their work,speech utterances, body motion and gaze are representedas time-series binary data, and co-occurrence patterns aredefined as multimodal events overlapped in time. Then, co-occurrence events are detected through clustering before afinal inference using Ridge Regression and linear SVM.

Staiano et al. [73] focused on feature selection to modelthe dynamics of personality states in a meeting scenario. Per-sonality state refer to a specific behavioral episode whereina person behaves as more or less introvert/extrovert, neu-rotic or open to experience, etc. It is also referred as sit-uational cues [34]. According to the authors, the problemwith “traditional approaches” is that they assume a directand stable relationship between, e.g., being extravert andacting extravertedly (speaking loudly, being talkative, etc).However, “on the contrary, extraverts can sometimes be silentand reflexive, while introverts can at times exhibit extravertedbehaviors. Similarly, people prone to neuroticism do not alwaysexhibit anxious behavior, while agreeable people can sometimes beaggressive” [73]. In their work, several low-level (acoustic)and high-level features (attention given/received in theform of head pose, gaze and voice activity) are combinedwith different machine learning approaches.

More recently, Celiktutan and Gunes [63] addressedhow personality impressions fluctuate with time and sit-uational contexts. First, audio-visual features are extracted(e.g., face/head/body movements, geometric and hybridfeatures). Then, a Bidirectional-LSTM Network is employedto model the temporal relationships between the continu-ously generated annotations and extracted features. Finally,a decision-level fusion is performed to combine the outputsof the audio and the visual regression models. Nevertheless,their study required a database (a subset of the SEMAINEcorpus [74]) annotated in a time-continuous manner.

Interested on the differences in situational context affect-ing trait perceptions and labelling, Joshi et al. [40] analyzedthin slices (14 sec long) of behavioral data during HCIsettings. Authors analyzed (i) the differences between theperceived traits (Big-Five and social impressions) duringaudio-visual and visual-only observations; (ii) the deviationin the perception when there is a change of situationalcontext; (iii) and the change in the perception marked by anexternal observer when the same individual interacts withdifferent virtual characters (exhibiting specific emotional

and social attributes). To take into account errors inducedby subjective biases, they proposed a framework whichencapsulates a weighted model based on linear SVR, andlow-level visual features computed over the face region.

Bremner et al. [41] investigated how robot mediationaffects the way the Big-Five personality traits of the operatorare perceived. Results showed that (i) observers utilize robotappearance cues along with operator vocal cues to maketheir judgments; (ii) operators’ gestures reproduced on therobot aid personality judgments, and (iii) that personalityperception through robot mediation is highly operator-dependent. Extending [41], Celiktutan et al. [75] showedthat apparent personality classification from nonverbal cuesworks better than from audio-only (except for agreeableness),and that facial activity and head pose together with audioand arm gestures play an important role in conveyingspecific personality traits in a telepresence context.

2.5.2 Non-interactive settingsIn general, works falling in this category are those exploitingconversational videos, self-presentations or video resumes.

Biel and Gatica-Perez [54], [55] pioneered personalityimpressions in vlogs under the perspective of audiovisualbehavioral analysis. In [54], they studied the use of non-verbal cues as descriptors of vloggers’ behavior. Later [55],they addressed the problem as a regression task, where fea-tures extracted from audio (speaking activity and prosodic),video (looking activity, pose and overall motion) and co-occurrence events (e.g., looking-while-speaking) were com-bined with SVR to infer apparent personality. In both works,the analyses are performed on thin vlog slices (1-minute).An extensive review discussing both verbal and nonverbalaspects of vlogger behaviors is presented in [76].

Nguyen and Gatica-Perez [16] analyzed the formationof job-related first impressions in conversational video re-sumes. In fact, job recommendation systems based on thevisual analysis of nonverbal human behavior started to re-ceive a lot of attention from the past few years [77], [78], [79],being social impressions the focus of most works. Accordingto [16], online video resumes represent an opportunity tostudy the formation of first impressions in an employmentcontext at a scale never attempted before. In their work,the linear relationships between nonverbal behavior and theorganizational constructs of hirability and apparent person-ality (Big-Five) are examined. Different regression methodsare analyzed for the prediction of personality and hirabilityimpressions from audio (speaking activity and prosody)and visual cues (proximity, face events and visual motion).Results suggest that combining feature groups stronglyimproves accuracy performance, reinforcing the benefits ofincluding complementary information to the pipeline. Morerecently, Gatica-Perez et al. [80] addressed the recognition ofpersonal state and trait impressions in a longitudinal studyusing behavioral data of vloggers who posted on YouTubefor a period between three and six years. The dataset is com-posed of a small number of participants, and results do notshow any significant temporal trend related to personality.

Following the recent advancements of CNNs, Gucluturket al. [15] presented an audiovisual Deep Residual Network(trained end-to-end) for apparent personality trait recogni-tion. The network does not require any feature engineering

Page 9: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

9

or visual analysis such as face/landmark detection andalignment or facial expression recognition. Auditory andvisual streams are merged in an audiovisual stream, whichcomprises a fully-connected layer. At the training/test stage,the fully-connected layer outputs five continuous predictionvalues corresponding to each trait for the given input videoclip. Their work won the third place in the ChaLearnFirst Impressions Challenge [9] (1st round), whereas [62]and [14] achieved the second and first place, respectively.The work [15] was extended in [8] to consider verbal con-tent, and to predict an “invitation to job interview” variable.In [62], two end-to-end trainable deep learning architecturesare proposed to recognize personality impressions. The net-works have two branches, one for encoding audio and theother for visual features. The first model is formulated as aVolumetric (3D) CNN, while the second one is formulate asan LSTM based network, to learn temporal patterns in theaudio-visual channels. Both models concatenate statistics ofcertain acoustic properties (obtained from non-overlappingpartitions) and visual data (from segmented faces, afterlandmark detection) in a later stage.

In order to capture rich information from both visualand audio modality, Zhang et al. [14], [81] proposed a DeepBimodal Regression framework. They modified the tradi-tional CNN to exploit important visual cues (introducingwhat they called Descriptor Aggregation Networks), andbuilt a linear regressor for the audio modality. To combinecomplementary information from the two modalities, theyensembled predicted regression scores by both early andlate fusion. Gurpinar et al. [65] extended their previouswork [19] (briefly described in Sec. 2.4) with the inclusionof other visual descriptors, acoustic features and weightedscore level fusion strategy, and ranked in the first place ofthe ChaLearn First Impressions Challenge [82] (2nd round).

2.5.3 DiscussionA general discussion about Sec. 2.5 is presented next, cover-ing current limitations, slice length/location, co-occurrenceevent mining, personality states, job recommendation sys-tems and prospect for future directions of research.

Current limitations. The ELEA [70] dataset, employedin [69], was captured in a very specific and controlledenvironment, and developed for analyzing emergent lead-erships. From the 40 recorded meetings, only 27 have bothaudio and video (for more details, see Table 4). From thepoint of view of deep learning based approaches, whichare dominating different lines of research in social/affectivecomputing, small datasets have limited application. TheMission Survival Task corpus [83], employed in [73], accord-ing to the authors, is not currently available. The databasesused in [45], [84], because of the privacy-sensitive content ofthe interviews or data protection laws [16], are not publiclyavailable. As recently reported in [63], most of the resultsfound in the literature are not directly comparable to eachother, as different evaluation protocols are employed.

Slice length/location. According to [69], external ob-servers usually make their impressions based on thin slicesselected from video samples. Thus, the decision of whichpart of the video (from the whole sequence) will be analysedis a common requirement to be addressed, and in most of thecases, it is done empirically. Slice length/location analysis

is also studied in the context of social impressions fromnonverbal visual analysis. Although having a different goal,it has strong overlap with personality perception studies asboth areas are centered on the analysis of human behavior.For instance, Lepri et al. [85], [86] observed that classificationaccuracy (of extraversion trait) is affected by the size ofthe slice when addressing real personality recognition (i.e.,classification performance increases with slice length up toa certain amount of time, and then it slightly decreases).In [78], manual “scene/slice segmentation” was performedto analyze different segments of a job interview in thecontext of social impressions. Results show that any sliceclearly stood out in terms of predictive validity, i.e., allslices yielded comparable results. As stated in [78], whichis shared by personality impressions studies, one of thechallenges in thin slice research is the amount of temporalsupport necessary for each behavioral feature to be predic-tive of the outcome of the full interaction. In other words,some cues require to be aggregated over a longer periodthan others. According to their study, and to our review, nometric assessing the necessary amount of temporal supportfor a given feature exists, neither in social impressions nor inpersonality perception. Thus, whether thin slice impressionswould generalize to the whole video for the prediction taskis still an open question. Question such as “is that possibleto automatically select the best slice of the whole clip that bestdescribes first impressions? Will it generalize to the whole video?”could be a subject for future research in the field.

Co-occurrence event mining. It demonstrated to be veryeffective [16], [55], [72] as an alternative to exploit comple-mentary information associated to body language [52], suchas looking while speaking or attention received while silent. Ac-cording to [87], the computer vision community has reacheda point when it can start considering high-level reasoningtasks such as the “communicative intents” of images.

Personality states or situational context. Have strongimpact in personality perception, either when interactionsare considered or not [34], [40], [63], [73], [80]. Althoughfirst impressions have been studied from different perspec-tives in the past few years, situational cues did not receiveenough attention from the computer vision community upto now. The great complexity and subjectivity related to thistopic, along with the existing dataset limitations, could be apossible explanation for that. One can imagine, for instance,how challenge can be the task of defining a new dataset onthis topic, either from different contexts (e.g., at work, in aparty, during a job interview, etc) or even during the samecontext but at different time intervals [63]. Although it canbe considered a challenging task, future researches on thistopic would have strong impact on the whole research area.

Job recommendation. The question of why a particularindividual receives a positive (or negative) evaluation basedon first impressions analysis deserves special attention fromthe research community, either if personality or hirability(social) impressions are considered. Note that a close linkbetween the constructs of personality and hirability ex-ist [88]. Automatic job recommendation systems can be verysubjective, and might have strong influence in our livesonce they become common. Recent studies [26], includingthose submitted to a workshop organized by the ChaLearngroup [38], sought to address such question.

Page 10: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

10

Future directions. Despite its limitations, verbal con-tent [76], combined with nonverbal visual data, is a poten-tial direction to be taken to advance the state-of-the-art inautomatic personality perception (discussed in Sec. 2.6.1).

Taking into account the particular case of job recom-mendation systems, the differences across job types havenot been investigated in computing [16], even though havealready been addressed in psychological studies. For in-stance, the expected behavior for a person applying for asales position may differ from someone looking for an en-gineering position. Combining differences across job typeswith personality trait analysis could be used to make jobrecommendation systems more transparent and inclusive.

Regarding the recently proposed CNN based models forautomatic personality perception [14], [15], [60], [62], weobserved that there is still a long venue to be explored.The top three winner methods [14], [15], [62] submitted tothe ChaLearn First Impression Challenge [9] obtained verysimilar overall performances (i.e., 0.913, 0.912 and 0.911,respectively) even though presenting different solutions,suggesting that proposed architectures may be exploitingcomplementary features [26], which could be combinedsomehow to improve overall accuracy. Moreover, deep neu-ral networks are currently one of the most promising candi-dates to tackle the challenges of multimodal data fusion [14],[62], [65], [81] and multi-task solutions in first impressions.

2.6 Multimodal trait predictionThis section reviews works using multimodal data for auto-matic personality perception, i.e., in addition to the audio-visual cues, they may exploit verbal content, depth informa-tion or use data acquired by more specialized devices.

Biel et al. [39] addressed personality impressions ofvloggers using Linguistic Inquiry and Word Count (LIWC)and N-grams analysis. While the focus of their work ison what vloggers say, few experiments fusing verbal andnonverbal content have been performed. Verbal content isalso exploited in [76]. In this case, the work focuses on non-verbal content analysis. Both works [39], [76] use manualtranscripts of vlogs to verify (in an error-free setting) theability of verbal content for the prediction of personalityimpressions (Big-Five). The feasibility of building a fullyautomatic framework were investigated using AutomaticSpeech Recognition (ASR). However, errors caused by theASR system significantly decreased the performances.

Chavez-Martınez et al. [89] considered the inference ofmood and personality impressions (Big-Five) of vloggers,from verbal content (i.e., categorizing word counts intolinguistic categories, obtained from manual transcriptions)and nonverbal audio-visual cues (e.g., pitch, speaking rate,body motion and facial expression). High-level facial fea-tures are considered through the concept of compound facialexpressions. The inference task is then addressed using amulti-label classifier. Results suggest that the combinationof mood and trait labels improved overall performance incomparison with the mood-only and trait-only experiments.

Using a logistic regression model, Sarkar et al. [90]combined audiovisual (pitch, speech and movement anal-ysis), verbal content (unigram bag of words and statisticsfrom the transcriptions), demographic (gender) and senti-ment features (e.g., positive/negative sentiment scores of

the verbal content) for apparent personality trait (Big-Five)classification. Results show that different personality traitsare better predicted using different combinations of features.Alam and Riccardi [91] reached a similar conclusion whenaddressing personality impression using the same dataset(Youtube vlog [55]), i.e., the performance of each trait variesfor different feature sets. In their work, linguistic, psycholin-guistic and emotional features extracted from the transcriptsare analyzed, in addition to the audio-visual features pro-vided with the dataset. Similarly as in [90], [91], Farnadiet al. [92] combined audiovisual and several text-based fea-tures with different multivariate regression techniques. Themain differences among the above solutions [90], [91], [92]are in relation to verbal features and the way the problemwas modeled, as audio-visual features were provided withthe dataset (released in the WCPR2014 Challenge [93]).

Srivastava et al. [94] exploited audio, visual and lexicalfeatures to predict the Big-Five Inventory-10 answers, fromwhich personality trait scores can be obtained. Audiovisualand verbal cues are combined for recognizing emotions,and used to learn a linear regression model based on theproposed Sparse and Low-rank Transformation (SLoT). Adataset composed of short movie clips (4-7 sec long each),manually labeled with BFI-10 answers and personality im-pressions is used. Several high level tasks are performed(e.g., face/emotion expression recognition, tracking, scenechange detection) as the dataset can show multiple peopleand cuts, which makes the study even harder. Moreover, asdialogs are extracted from the movie’s subtitles, audiovisualinformation might not be well synchronized with the text.

More recently, Gucluturk et al. [8] extended their pre-vious work [15] to consider verbal content (extracted fromaudio transcripts provided with the data) as well as to inferhirability scores [26], [38]. Different modalities are analysed,including audio-only, visual-only, language-only, audiovi-sual, and a combination of audiovisual and language in alate fusion strategy. Results show that the best performanceis obtained though the fusion of all data modalities.

Exploiting the concept of group engagement, Salam etal. [95] investigated how personality impressions of par-ticipants can be used together with robot’s personality topredict the engagement state of each participant in a triadicHuman-Human-Robot Interaction setting. Nonverbal visualcues of individuals (e.g., body activity, appearance andvisual focus of attention) and interpersonal features (e.g,relative distances, attention given/received) captured fromRGB-D data are employed. However, several high leveltasks have to be addressed for feature extraction, such asROI/group detection, body/head/face detection, skeletonjoints, as well as robot detection. Authors also propose touse extroverted/introverted robots in the experiments tovary the context of the interactions.

Finnerty and collaborators [84] studied whether firstimpressions of stress are equivalent to physiological mea-surements of electrodermal activity (EDA) in a context ofjob interviews. The outcomes of job interviews are thenanalyzed based on features extracted from multiple datamodalities (EDA, audio and visual). Even though focusingon stress impressions, they presented preliminary analysison the relationship among real and apparent personality, andstress impressions, briefly discussed in Sec. 2.7.

Page 11: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

11

2.6.1 DiscussionIn this section, we discuss different aspects of complemen-tary information, recurrent problems on the field, as well asprospects for future research directions related to Sec. 2.6.Then, we close the section with a summary of recent trendmethodologies with respect to all data modalities.

Complementary information. Whether focusing on ver-bal [39] or nonverbal content analysis [76], this study re-vealed that overall improvements of each personality traitare obtained when different cues are employed [39], [90],[91], as well as that they can be further improved whendifferent features are combined/fused [8]. For instance, im-provements were obtained in [89] by combining mood andtrait labels. In [39], extraversion was better predicted usingnonverbal content, whereas agreeableness, neuroticism andconscientiousness when verbal cues were used. Nevertheless,according to the revised literature, verbal content analysishas some limitations: 1) most existing works exploitingverbal content are based on manual transcriptions of theaudio channel (which imposes a great barrier to applicabil-ity); 2) automatic speech recognition methods are still notso accurate to capture verbal content without introducingnoise/errors to the pipeline; and 3), it is language depen-dent, i.e., verbal content analysis from people of differentspoken languages might require different treatments. Thus,according to our study, there is not a single set of featuresthat maximizes accuracy performance for all personalitytraits. Verbal content, voice, facial expressions, gestures andposes (i.e., body language), among many other features,are potential sources to code/decode personality, and cancomplement each other in different ways.

Recurrent problems. The use of controlled environ-ments [95] or specialized sensors [84] imposes a limitationto applicability, which can be even more limited if the studyis based on private [94], [95] or customized datasets [84],[89] composed of small number of participants [95].

Future directions. The two-stage approach presented in[94] is motivated by the fact that the relationship betweenfeatures and personality traits is generally difficult to de-scribe through a simple linear model. Authors attempted tolearn a model for predicting answers to BFI-10 questionnairefrom features, which is like a mid-level step in understandingthe semantic hierarchy from features to personality traits.Results indicate that features to answers followed by answersto personality scores can achieve superior performance thanfeatures to personality scores, even though in a preliminarystudy. As far as we know, no other work on the field triedto address the problem using such two-stage methodology,which could receive special attention in future researches.

It is well known that deep learning is making a revo-lution with respect to almost all research domains relatedto visual computing (i.e., compared to traditional machinelearning and standard computer vision approaches based onhand-crafted features). The architecture presented in [8], [15]was the only approach, among the top ranking approachesin the ChaLearn First Impression Challenge [9], which didnot rely neither on pretrained models nor on feature engi-neering, which makes it particularly appealing since it doesnot require making any assumptions regarding the impor-tant features for the task at hand. Authors also evaluatedthe changes in performance for the audio and visual models

as a function of exposure time (i.e., slice length), which still isan open question on the field. Results suggest that there isenough information about personality in a single frame [8],as evidenced in [13] when studying the bases of personalityjudgments (“first impressions are built from a glimpse as briefas 100ms”). Nevertheless, the same reasoning does not holdfor the auditory modality, especially for very short auditoryclips. Note that, frame selection (i.e., which frame from thewhole sequence will be analyzed), not addressed in [8],have not been properly addressed yet in spatio-temporalbased methods for automatic personality perception (i.e.,the standard approach is to analyse uniformly distributedsamples over the set of frames).

Although single images can carry meaningful informa-tion about the personality of an individual, we envisagefuture research directions in multimodal approaches (whichhold for almost all other categories of the proposed tax-onomy) should contemplate end-to-end trainable models,multi-task scenarios (different tasks are jointly trained),taking benefit of the evolution of the data on the spatio-temporal domain, possibly benefiting from transfer learning(e.g., cross-domain) and semi-supervised approaches (usingpartially annotated data from different datasets).

Finally, our study also revealed recent trend methodolo-gies, summarized in Table 1, where deep architectures pre-dominate. Nevertheless, there is sill not a standard way torepresent multimodal information, i.e., hand-crafted, deepfeatures and raw data are combined in different ways.Therefore, some models are able to obtain high accuracyperformance without any engineered feature (at least withrespect to the visual channel), which can be considered abreak trough in automatic personality perception. It canbe also observed that most of these trend methodologiesare benefiting from transfer learning through the use ofpretrained models, which increases complexity and numberof parameters, but can be considered a step forward toenhance generalization to different contexts and scenarios.The design of standard evaluation protocols, databases andchallenges (discussed in Sec. 3), facilitates the comparisonof different approaches. For instance, if we consider thedatabases [9] and [38] (shown in Table 1), they are composedby the same audio-visual data (see Table 4 for more details)and we can directly compare works evaluated on them.However, there is still a large venue to be explored, as somekey-points affecting personality perception can be consideredat their early stages of research (e.g., the influence of faceattributes and context, spatio-temporal modeling, explain-ability, etc).

2.7 Real and apparent personality trait analysis

In this section, we present relevant works developed forautomatic personality recognition, due to their relevanceand similarity to the topic of this survey, and briefly dis-cuss the very few existing works analysing both real andapparent personality traits. Note that, if we ignore the waylabels (used to train machine learning algorithms for theclassification/regression task) are obtained (i.e., from exter-nal observers, in the case of personality perception, or fromself-report questionnaires, in real personality trait analysis),the task of inferring personality from visual, audio-visual,

Page 12: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

12

TABLE 1Recent trend methodologies in apparent personality trait analysis. Cat.: still image (SI), image sequence (IS), audio-visual (AV) and

multimodal (M). Pretrained models employed for feature extraction or pre-processing (audio end video resampling are not considered) arereported. Average Score (and metric) reported for a particular Database is presented. Legend: LFAV: late fusion of acoustic and visual cues;

FLDA: face/landmark detection and alignment; HC: hand-crafted; DF: deep feature; RD: raw-data; acc: accuracy.

Ref Cat. Model Pretrained Pre-processing Features Database Score Key-points∗

[48] SI KPLS multivariate regression VGG16 FLDA HD and DF customfrom [50]

0.37(RMSE) scene descriptor

[19] IS KELM regressor (late fusion ofscene and emotion features)

VGG19;VGG-Face FLDA HD and DF [9] 0.9094

(acc)scene descriptor; facial

expression

[60] IS Deep Bimodal Regression VGG-Face face detection HD and DF [38] 0.912(acc)

Action Unit analysis;explainability

[63] AV BLSTM with a decision level fusionof audio and visual cues - FLDA; upper

body detection HD customfrom [74]

0.51(MSE)

continuous domain;varying context

[62] AV 3D-CNN and LSTM bi-modal deepnetworks (LFAV) - FLDA RD (video); HC

(audio) [38] 0.9121(acc)

end-to-end;spatio-temporal modeling

[81] AV Deep Bimodal Regression (early &LFAV); linear regression (audio)

VGG-Face;ResNet - RD (video); HC

(audio) [9] 0.9130(acc)

ensemble of multiplemodels; explainability

[8] M RNN (LFAV); Ridge regression(LFAV and verbal content) - - RD (video, audio);

HC (verbal) [38] 0.9118(acc)

end-to-end; explainability;verbal content

∗ Key-points are particular aspects we consider relevant to be further exploited.

or multimodal data could be considered the same, eitherif real or apparent personality is considered. However, it isimportant to emphasise that “apparent personality recogni-tion” is not a replacement for “real personality recognition”,as personality perception does not necessarily relate to theactual trait(s) of the person.

2.7.1 Automatic Personality RecognitionSimilarly as the case of personality perception, personalityrecognition have been addressed in the literature usingdifferent data modalities, i.e., still images [50], [96], [97], imagesequences [98], [99], audiovisual (with [11], [59], [79], [85], [86],[100], [101], [102], [103] or without [88], [104] interactions)and multimodal [105], [106], [107], [108].

Taking the popularity of social networks into account,Ferwerda et al. [96] proposed to infer real personality fromthe way users manipulate their pictures on Instagram.However, the work is basically limited to color informationanalysis. Liu et al. [50] addressed how Twitter profile imagesvary with the personality of users. Although profile imagesfrom over 66,000 users are used, personality traits wereestimated based on their tweets. In [97], they analysed theBig-Five traits and interaction styles from Facebook profileimages. However, without explicitly analysing human faces,as some images even contain one single person.

Subramanian et al. [99] show that social attention pat-terns computed from target’s position and head pose duringsocial interactions are excellent predictors of extraversionand neuroticism. In [98], the impact of personality dur-ing Human-Robot Interactions (HRI) is analysed based onnonverbal cues extracted from a first-person perspective,as well as from their relationships with participants’ self-reported personalities and interaction experience. LinearSVR is employed to predict personality traits from gazedirection, attention and head movements while interactingwith either an “extroverted” or “introverted” robot.

Fang et al. [100] combined Ridge Regression with audio-visual features to address Big-Five personality recognitionand social impressions during small group interactions. Ina similar context, speaking time and visual attention isexploited in [85], [86] to predict extraversion. In [85], they

consider the attention an individual receive/gives from/tothe group members. In [86], authors differentiate the amountof attention given/received while the person is speaking.Both works also address the impact of slice length on theclassification. In a meeting scenario, Pianesi et al. [11] ad-dressed extraversion and Locus of Control classification usingSVM and audio-visual features (e.g. audio signal statisticsand Motion History Images) extracted from 1-min videos.Later [101], the problem was addressed as a regressiontask. Inference of Extraversion and Locus of Control are alsoproposed in [102] using a Bayesian Networks that explicitlyincorporate hypotheses about the relationships among per-sonality, actual behavior of the target and situational aspects.

Batrinca et al. [103] employed 2-5 min videos to recog-nize personality traits during HCI, combining audio-visualcues and feature selection with SVM. In their work, thecomputer interacts with individuals using different levelsof collaborations, to elicit the manifestation of differentpersonality traits. The work was extended in [109] to con-sider Human-Human Interactions (HHI). In [88], authorscombined nonverbal visual features with Naive Bayes andSVM to predict Big-Five traits in a similar monologue settingthan vlogging [54]. The study [88] was extended in [104] toautomatically extract few additional visual features. In bothworks, the highest accuracy were obtained when classifyingconscientiousness. As related by the authors, the request ofintroducing themselves in front of a camera, apparentlyactivated the subjects’ conscientiousness dispositions.

Nguyen et al. [79] extend [77] to predict the Big-Fivetraits in addition to hirability impressions, focusing onpostures and gestures (extracted from a mixture of man-ual annotations and automated methods) as well as co-occurrence events. Rahbar et al. [105] addressed extraversionrecognition during HRI, taking into account the first thinslices of the interaction. Multimodal features extracted fromdepth images (e.g., motion and human-robot distance) areused to train a Logistic Regression Classifier. The work wasextended in [106] to consider new features and in depthanalysis. Fernadi et al. [107] compared different person-ality recognition methods and investigated the possibilityof cross-learning from different social media, i.e. Facebook,

Page 13: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

13

Twitter and YouTube. However, disregarding the analysisperformed on the YouTube vlog [55] dataset, any visual-based analysis was performed in relation to the othersources. In [108], they studied the relation between Big-Fivetraits and implicit responses of people to affective content(i.e., emotional videos), combining features obtained fromelectroencephalogram, peripheral physiological signals andfacial landmark trajectories with a linear regression model.

In summary, the very few existing works developed forautomatic personality recognition are: 1) mainly based onhand-crafted features, classic machine learning approachesand single-task scenarios, without modelling multiple vi-sual human cues for an accurate representation neitherexploiting temporal dynamics of the data; 2) analysed onsmall sized datasets without standard evaluation protocols.In consequence, there is no generalization guarantee todifferent target populations/scenarios; and last but not least3) none of the existing works regressing personality traitsfrom audio-visual data analyse sources of bias to furthercorrelate real and apparent personality.

2.7.2 Joint analysis of real and apparent personalityPreliminary results on the relationship between real andapparent personality traits have been reported in the liter-ature [41], [84], [110], [111], [112] with limited outcomes.In the work of Wolffhechel et al. [110], any connectionwas found between participants self-reported personalitytraits and the scores they gave to others. According totheir study, a single facial picture may lack information forevaluating diverse traits, i.e., “a viewer will miss additionalcues for gathering a more complete first impression and willtherefore focus overly on facial expressions instead”. In [111],no visual information is used (just audio and transcripts).Furthermore, a small dataset was used and results show lowgeneralization capability of the proposed model. Bremner etal. [41] analysed audio-visual and audio-only informationon a small dataset of 20 participants (and 5 observers) cap-tured from a controlled environment, with no generalizationguarantee to different target populations. They observedthat judge’s ratings bear a significant relation to the target’sself-ratings only for the extraversion trait.

Although restricted to job interviews, preliminary resultson the relationship among impressions of stress and the Big-Five traits (real and apparent) are reported in [84]. Withrespect to real personality, they observed that Openness toexperience and Conscientiousness traits were negatively corre-lated with stress impressions. For traits as perceived by oth-ers, Conscientiousness was negatively correlated with stressimpressions while Neuroticism was positively correlated.

More recently, Celiktutan et al. [112] presented a mul-timodal database to study personality simultaneously inHHI and HRI scenarios, and its relationship with engage-ment. Note that, differently from existing approaches inpersonality computing, personality impressions were pro-vided by acquaintance people, which may explain whybaseline results show that trends in personality classificationperformance remained the same with respect to the selfand acquaintance labels. Moreover, results are limited to ananalysis conducted with a small number of participants.

In [46], authors addressed real and apparent personal-ity traits recognition. However, in addition to presenting

some insights on which cues contribute to each personalitytype, any joint analysis was performed. Fang et al. [100]addressed the recognition of real personality traits and socialimpressions without making an in depth correlation anal-ysis about both domains. In [69], personality impressionswere collected for the same dataset used in [100]. However,still without correlating both personality types.

This study revealed that state-of-the-art in visual-basedpersonality computing are neither focusing on understand-ing human biases that influence personality perception nortrying to automatically (and accurately) regress the realpersonality from perception mechanisms. This is mainlybecause of the high complexity on accurately modellinghumans in visual data, as well as the high subjectivity ofthe topic (involving a large set of possible sources of bias),remaining a largely unexplored area.

2.8 What features give better results?

As discussed in previous sections, there is not a standardset or modality of features that works better for any typeof data, database or personality trait. Different solutionshave been proposed over the past years based on distinctevaluation protocols, which prevent the above question tobe properly addressed. However, we present in Table 2a list of mid-level features and semantic attributes highlycorrelated with Big-Five traits, reported by state-of-the-artworks on personality computing. We expect the informa-tion summarized in Table 2 can be used to inspire futureresearches to advance the state-of-the-art on the field, inparticular when studying new strategies to improve therecognition performance of traits that are currently difficultto be recognized (as discussed in Sec. 2.9).

While some agreement can be observed in Table 2 (inmost of the cases) in relation to particular sets of attributes,traits and personality type, few minor inconsistencies canalso be noted, reinforcing the difficulty of addressing theabove question. This is the case of “positive emotions andsmiling expressions” with respect to openess and real per-sonality (reported to have, in different studies, negative andpositive correlation). It must be emphasized that the studiesperformed by the different works presented in Table 2 maybe limited to analysis performed on small datasets, com-posed of small number of participants and without standardevaluation protocols, which do not guarantee generalizationto different scenarios and contexts. In the case of appar-ent personality, they are also influenced by subjective biasfrom people who labelled the data, number of annotators,among other variables. Nevertheless, disregarding theselimitations, it is possible to observe some agreement withreported literature in psychology, such as that extrovertedpeople usually show smiling expressions and speak louder,as well as that more conscientious people tend to preservetheir privacy (e.g., by not sharing images from privatelocations on social media), for example.

Evidences found in the literature show that extraversionis the trait showing the largest activity cue utilization [18](reflected in Table 2), as well as it is the trait typically easierto judge [54], [55]. Our study also revealed that, for thecase of personality perception, extraversion is also the traitrecognized with higher accuracy (shown in Fig. 2).

Page 14: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

14

TABLE 2Reported mid-level features and semantic attributes, highly correlated with Big-Five traits, reported by state-of-the-art in personality computing.

Negative correlation Trait Positive correlationReal Apparent Real Apparent

Positive emotions andsmiling expressions

[50]; stressimpression [84]

Negative emotions (anger, disgust)[18], [44], [76]; verbal content

associated to negative emotions[39]; long eye contact [55] andfrontal face event duration [16]

O

Positive emotions and smilingexpressions [45]; body motionand speaking activity during

collaborative task [109]

Smiling expressions, joy [44]; speaking time[54]; body motion [16], [55], [76]; hirability

impressions and eye contact [16]; verbalcontent associated to leisure activities [39]

Private location [46];negative mood [50];speech conflict withothers [100]; stress

impression [84]

Negative emotions/valence (sad,anger) [39], [44]; body activity [54];

long eye contact [55], [76]; stressimpression [84]

C

Smiling expressions [88],positive mood and valence (joy)[50]; eye contact [97]; speaking

activity and body motionduring collaborative task [109]

Smiling expressions, joy and contempt [44],frontal pose [48]; speaking time [54]; eye

contact, low body motion activity,looking-while-speaking [55], [76]; verbal

content associated to occupation andachievements [39]

Low deviation inreceived attention

during interaction [99];visual attention given

to the rest of thegroup [102]

Neutral or negativeemotions/valence (anger, disgust,contempt) [11], [44], [76], [80], [89];

pressed lips [46]; low speechactivity [80] during group

interaction [72]; give/receiveattention while silent [86]; speech

turns [54], [55], [76]; long eyecontact [16], [55]

E

Positive emotions [50]; smilingexpressions [97]; body motion[103], [109]; attention received

and speaking time [102];engagement during

interaction [95]

Positive emotions/valence (joy) [18], [46],[76], [89] and smiling expressions [18], [44];

funny [80]; body motion [16], [54], [55],[72], [76]; give/receive attention while

speaking [55], [69], [72], [86]; speaking time[54], [55], [76] and louder [55]; eye contact

[16]; verbal content related to interpersonalinteractions and sexuality [39]; hirability

impression [16], [79]

Negative emotionexpressions [50];

talking turns in groupinteractions [100];

Negative emotions/valence (angerand disgust) [18], [44], [76], [89];

verbal content associated tonegative emotions, sexuality and

religion [39]

A

Positive emotions, smilingexpressions and joy [50]; bodymotion during collaborativetask [109], and long speaking

duration [88]

Positive emotions/valence (joy) andsmiling expression [18], [44], [76], [80]; eye

contact [46]; facial attractiveness [40]; verbalcontent associated to positive emotions [39]

Positive emotions [50];smiling expressions[97]; long speaking

duration duringcollaborative task [109]

Smiling expressions, joy [44]; facevisibility [46]; facial attractiveness[40]; looking-while-speaking [55],

[76]; hirability impression [16];aggreableness [80]

N

Negative emotions or lack ofemotion expression [50]; social

profile images without faces[50], [97]; low body motion

activity during interactions [99]

Negative emotions/valence (anger) [80],[89]; verbal content associated to negativewords and negative emotional words [39];

duckface [46]; stress impression [84]

2.9 Which traits are easily recognized?To address the above question, we analysed the resultsreported by reviewed works with respect to the Big-Fivemodel, i.e., {O, C, E, A, N}. In total, 11 and 33 worksaddressing real and apparent personality recognition, re-spectively, have been analysed. Works which did not reportresults for all Big-Five traits have not been considered. Foreach work, we retrieved the two traits recognized withhighest accuracy. Fig. 2 shows the obtained distribution.Different observations can be taken form Fig. 2: 1) the“ranking” of traits on the two distributions are different,i.e., 2) regarding real trait estimation, “C” and “O” are thetraits recognized with highest accuracy, whereas 3) “E” and“C” are the best recognized traits in personality perception;4) clearly, “N” is challenging for both types of work; 5) sur-prisingly, “A”, which is usually recognized with satisfactoryaccuracy in personality perception, is the most difficult traitto be recognized when real personality is considered.

Fig. 2. Distribution of Big-Five personality traits “easily” recognized.

It must be emphasized that Fig. 2 was generated basedon few works evaluated on different datasets and protocols.Thus, the analysis presented in this section can only provide

a more general view around the above question, and furtherstudies are needed to confirm our observations.

3 TRAIT RECOGNITION CHALLENGES

Academic competitions/challenges are an excellent way toquickly advance the state-of-the-art in a particular field.Organizers of challenges formulate a problem, provide data,evaluation metrics, evaluation platform and forums forthe dissemination of results. Within the computer vision,pattern recognition and multimedia information processingcommunities2 several challenges related to personality anal-ysis have been proposed. These are summarized in Table 3.

The first challenge on apparent personality analysis wasthat organized at Interspeech in 2012 [117]. However, thechallenge focused only in the audio modality. Interestingly,organizers of such competition found that participants haddifficulty at improving the baseline, and solutions lackedcreativity, hence motivating further research in terms of thefeature extraction and modeling .

The MAPTRAITS [58] and WCPR [93] challenges or-ganized in 2014 were the first ones involving both videoand audio modalities, focusing on personality perception.Two tracks were launched in the MAPTRAITS challenge: acontinuous estimation of personality traits through time andthe recognition of traits from entire clips of video (discrete).Organizers found that participants barely obtained compa-rable performance as the baselines. Where the best resultswere obtained with audio-visual and visual only baselines,

2. Please, note that inferring real personality traits from text is a topicthat has been studied intensively by the NLP community (e.g., [113]).

Page 15: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

15

TABLE 3Challenges on apparent personality analysis.

Challenge Year Dataset Task Samples Event Winner Overview

ChaLearn LAP2017 First impressions v2 Invite to interview 10,000 CVPR/IJCNN [61], [64] [38]2016 First impressions P. traits (Big-5) 10,000 ECCV [14] [9]2016 First impressions P. traits (Big-5) 10,000 ICPR [65] [82]

MAPTRAITS 2014 Semaine P. traits (Big-5) 44 ICMI [114], [115] [58]WCPR 2014 YouTube vlog P. traits (Big-5) 404 MM [91] [93]

Speaker Trait 2012 I-ST P. traits (Big-5) 640 Interspeech [116] [117]

for the continuous and discrete tracks, respectively. Hencehighlighting the importance of the visual modality. TheWCPR challenge also comprised two tracks: one focusingon the use multimodal information and the other using onlytext information. The conclusions from this competitionwere that the problem was too hard, but better results wereobtained with multimodal information than when only textwas used. In both challenges, the number of available sam-ples were quite small (only 44 samples were used in [58] and404 in [93]), which may be the cause for the low recognitionperformance. Still, these were the first efforts on the usageof multimodal information for personality analysis.

More recently, ChaLearn3 organized two rounds ofa challenge on personality perception from audiovisualdata [9], [82]. A new dataset comprising realistic videosannotated by AMT workers was used. To the best of ourknowledge, this is the largest dataset available so far for ap-parent personality analysis (10K samples). The challenge fo-cused on trait recognition in short clips taken from YouTube.The winning methods were based on deep learning [14],[65]. In fact, most participants of the contest adopted deeplearning methods (e.g., [15], [62]). The best performancewas achieved by solutions that incorporated both audio andvisual cues. For these competitions, participants succeededat improving the performance of the baseline, achievingrecognition performance above 90% of average accuracy.The main conclusion from these competitions was thataccurately recognizing personality impressions from shortvideos is feasible, motivating further research in this topic.

Results from the latter challenge motivated a newcompetition in a closely related topic involving personal-ity traits, the so called Job Candidate Screening Coope-tition [38]. In this challenge, an extended version of theChaLearn First Impression dataset [9] was considered,where the extension consisted of the new variable to bepredicted and manual transcriptions of audio in the videos.Participants had to predict a variable indicating whether theperson in a video would be invited or not to a job inter-view (round 1). In addition, participants had to develop aninterface explaining their recommendations (round 2). Forthe latter task, participants relied on apparent personalitytrait predictions. Organizers concluded that the “invite-for-interview” variable could be predicted with high accuracy,and that developing explainable models is a critical aspect,yet there is still a large room for improvement in this aspect.

From these challenges, several lessons can be learned.First, a major difficulty for organizers of the first compe-titions was the scarcity of data. From that competitions,participants could barely improve the performance of the

3. http://chalearnlap.cvc.uab.es

baseline and there was not too much diversity in terms ofthe type of solution. Larger datasets, as those provided inthe more recent challenges, together with more powerfulmodeling techniques that can leverage such amounts of data(e.g., deep learning methods), could be the key componentfor the participants to succeed at improving baselines andobtaining outstanding performance. Table 4 shows avail-able datasets on the topic, from a vision-based perspective,as well as additional details about labels, modality, etc.Secondly, the inclusion of multimodal information, e.g., asopposed to using audio or text only, increased the range ofpossible methodologies and information that could be usedto predict personality traits. Finally, the initial competitionswere not organized for consecutive years (except thoseorganized by ChaLearn). We think continuity is a key aspectfor the success of any academic challenge.

It is important to emphasize that despite there are not toomany challenges on personality analysis, the progress thatprevious competitions have caused is remarkable. In addi-tion, there are several related challenges that deserve to bementioned4. For instance, the EmotiW [119] and Avec [120]challenge series, that have been run for several years andwhose focus is on emotion recognition, depression detec-tion, mood classification, and multimodal data processing.Clearly, progress in these related fields is having an impacton new challenges targeting personality traits exclusively.

4 DISCUSSION

In this section, we provide a final discussion about theresearch topic. First, we comment few and relevant observedaspects, and lessons learned, at different modalities of theproposed taxonomy. Then, we summarize and discuss thechanges in terms of applications, features and limitations,when temporal information started being used. Next, wediscuss accuracy performance and its relation to subjectivity,as well as the importance of dataset definition as richresources to advance the research. Finally, we comment cur-rent deep learning technologies applied on first impressionanalysis, expected outcomes and applications.

Particularities of the different modalities. From thisstudy, several lessons can be learned. It revealed thatstill images based approaches mostly focus on geometricand/or appearance facial features, using low-level or mid-level/semantic attributes to drive the recognition of per-sonality traits. Most works within this category use ad-hoc datasets, making the comparison with competitive ap-proaches a big challenge. Techniques developed for image

4. We consider these challenges are related because both are associ-ated to the social signal processing field.

Page 16: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

16

TABLE 4Available datasets used in personality computing, centered on the visual analysis of humans, from a vision-based perspective.

Dataset Year Short description Focus Labels Used in

MHHRI [112],Multimodal 2017

12 interaction sessions (~4h) captured withegocentric cameras, depth and bio-sensors, 18

participants, controlled environment

Personality and engagementduring HHI and HCI

Self/acquaintance-assessed Big-Five, and

engagement[95], [98]

ChaLearn FirstImpression v2

[38], Multimodal2017 Extended version of [9], with the inclusion of

hirability impressions and audio transcriptsApparent personality trait and

hirability impressions

Big-Five impressions,job interview variable

and transcripts

[8], [26], [60],[61], [64]

ChaLearn FirstImpression [9],

Audiovisual2016

10K short videos: ~15sec each, collected from2762 YouTube users, 1280x720 of size, RGB,

30fps, uncontrolled environment

Apparent personality traitanalysis (no interaction - single

person talking to a camera)Big-Five impressions

[14], [15],[19], [62], [65],

[81], [82]

SEMAINE [74],Multimodal 2012

959 conversations: ~5min each, 150participants, 780x580 of size, 49.979fps, RGBand gray, frontal and profile view, controlled

environment

Face-to-face (interactive)conversations with sensitive

artificial listener agents

Metadata∗, transcripts,5 affective dimensions

and 27 associatedcategories†

[40], [57],[58], [59], [63],

[114], [115]

EmergentLEAder

(ELEA) [70],Audiovisual

2012

40 meetings: ~15min each, 27 having bothaudio and video, composed of 3 or 4 members,148 participants; 6 static (25fps) and 2 portable

(30fps) cameras, controlled environment

Small group interactions andemergent leadership (winter

survival task)

Metadata∗, Big-Five(self-report)† and social

impressions

[69], [72],[100]

YouTube vlog[53], [55], [118],

Audiovisual2011 442 vlogs: ~1min each, 1 video per participant,

uncontrolled environment

Conversational vlogs [53], [118]and apparent personality trait

analysis [55]

Metadata∗ andBig-Five impressions

[18], [39],[44], [54], [55],[56], [76], [90],

[91], [92]∗ Metadata can be gender, age, number of “likes” on social media, presence of laughs, Facial Action Units, etc, and vary for each dataset.† Labels for personality perception are annotated by each independent work, as they are not provided with the database.

sequences usually include higher level features and analy-sis, such as facial emotion expressions, co-occurrence eventmining, head/body motion, in addition to the ones usedfor still images. When temporal information is available, thegreat majority of works tend to compute functional statisticsover time or treat each frame independently, omitting largespatio-temporal interactions. We envision future studies ex-ploiting the spatio-temporal dependencies in (audio)visual(or multimodal) data to be an essential line of research.Possible studies in this direction may focus on new tem-poral deep learning models, such as using 3D convolutions(to consider local motion patterns), or based on temporalmodels such as Recurrent Neural Networks (RNN)-LSTM,which is able to model large spatio-temporal interactions.

Beyond still images. The use of temporal informationintroduced the problem of defining the slice duration andlocation. Even though addressed in some works, thesequestions remain open in all related modalities. Some otherissues appeared when audiovisual approaches got in focus,such as situational contexts or personality states, which areextremely important points, as they contribute to increasethe complexity and subjectivity in first impression studies.

In addition to visual information, methods for the anal-ysis of personality in videos have relied on other modalitiesof data. For instance, most of the works reviewed in thissurvey fall within the audiovisual category, including low-level acoustic features (pitch, intensity, frequencies) as wellas descriptors of speaking activity, turns, pauses, looking-while-speaking (co-occurrence event), etc. This is becausenonverbal audio modality has proven to carry informa-tion that can be highly correlated with personality. In thesame line, lexical analysis from the audio transcriptions hasproven to be very useful, this is not surprising as automaticdetection of personality from text has been widely studied.

Performance vs subjectivity. A considerable number ofworks show that the performance of each trait varies fromdifferent feature sets. This study also revealed: 1) the rank-

ing of best recognized traits varies with respect to reviewedworks, although some tendencies have been observed; 2)the best feature set/modality, or ranking of best recognizedtraits, might also change for the same work from one datasetto another due to subjectivity and complexity of the task.

Public datasets as valuable resources. A major problemin personality computing in the past has been the lack ofunified public datasets for allowing the accurate evaluationof methodologies for personality recognition and percep-tion. Our review revealed that the construction of resourcescan be a valuable contribution. In fact, nowadays there area few datasets available already, some of them generated inthe context of academic challenges. Nevertheless, the designof new datasets and challenges will speed up the progressin the field at fast rates. We envision new, large and pub-lic datasets, considering a large amount of heterogeneouspopulation, and exploiting the following topics could definenew research directions in the next few years: 1) situationalcontexts or personality states in more realistic scenarios; 2)continuous predictions; 3) joint analysis of real and apparentpersonality; 4) observer vs. observed analysis, in the contextof first impression data labeling and subjective bias analysis.Note that these topics are highly correlated, and could besomehow tackled together, being a richer source for researchon the field. They could also potentially benefit from 5) anincreased comprehensive personality profile (which couldalso consider physical and mental health [121], cognitiveabilities [122], implicit bias [123], among other attributes).

The revolution of deep learning. This study revealedthat most works developed for automatic personality traitanalysis (either if real or apparent) are mainly basedon hand-crafted features, standard machine learning ap-proaches and single-task scenarios. Although few recentworks are trained end-to-end, they do not integrate a com-prehensive set of human visual cues, neither address humanbias nor correlate real and apparent personality. Neverthe-less, CNNs are starting to be used in first impressions with

Page 17: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

17

very promising results, allowing the model to analyze notonly a limited set of predefined features but the whole scenewith contextual information, as well as facilitating advancedspatio-temporal modeling through the use of, e.g., RNN or3D-CNN. Furthermore, CNNs can be considered nowadayson of the most promising candidates to meet the challengesof multimodal data fusion by virtue of its high capability inextracting powerful high-level features.

Outcomes. Recent and promising results on personalitycomputing may encourage psychologists to get interestedin machine learning approaches and to contribute to thefield. Along these lines, a very promising venue for researchhas to do with the incorporation of prior knowledge in per-sonality analysis models. This represents a new challengefor both psychology and machine learning communities.Likewise, we believe that users of personality recognitionmethods would benefit from information on the decisionsor recommendations made by an automatic system. Thus,a quite promising venue for research is explainability andinterpretability of personality recognition methods.

Applications. This study revealed that automatic per-sonality trait analysis is applied in a vast number of sce-narios. Reviewed works are applied in social media, smallgroups, face-to-face or interface based interviews, HRI, HCI,vlogs, video resumes, among others contexts. From a practi-cal point of view, the wide range of potential applications re-lated to automatic personality perception, whose limits havenot been defined yet, can benefit health, affective interfaces,learning, business and leisure, among others. However, inorder to be applied only for good causes, the new generationif intelligent systems provided with personality perceptioncapabilities will need to be more effective, explainable, andinclusive, being able to ethically generalize to different cul-tural and social contexts, benefiting everyone, everywhere.We anticipate personality computing will become a hot topicin the next few years with a high impact in a wide numberof applications and scenarios.

ACKNOWLEDGMENTS

This project has been partially supported by granted Span-ish Ministry projects TIN2016-74946-P, TIN2015-66951-C2-2-R and TIN2017-88515-C2-1-R. This work is partially sup-ported by ICREA under the ICREA Academia programme.We thank ChaLearn Looking at People sponsors for theirsupport, including Microsoft Research, Google, NVIDIACorporation, Amazon, Facebook and Disney Research.

REFERENCES

[1] A. Vinciarelli and G. Mohammadi, “A survey of personalitycomputing,” TAC, vol. 5, no. 3, pp. 273–291, 2014.

[2] P. T. Costa and R. R. Mccrae, Trait Theories of Personality. Boston,MA: Springer US, 1998, pp. 103–121.

[3] D. C. Funder, “Accurate personality judgment,” Current Direc-tions in Psychological Science, vol. 21, no. 3, pp. 177–182, 2012.

[4] A. Vinciarelli, Social Perception in Machines: The Case of Personalityand the Big-Five Traits. Springer, 2016, pp. 151–164.

[5] R. R. McCrae and O. P. John, “An introduction to the five-factormodel and its applications,” Journal of Personality, vol. 60, no. 2,pp. 175–215, 1992.

[6] B. Abele, Andrea E.; Wojciszke, “Agency and communion fromthe perspective of self versus others,” Journal of Personality andSocial Psychology, vol. 95, no. 5, pp. 751–763, 2007.

[7] R. B. Cattell and S. E. Krug, “The number of factors in the 16pf: Areview of the evidence with special emphasis on methodologicalproblems,” Educational and Psychological Measurement, vol. 46,no. 3, pp. 509–522, 1986.

[8] Y. Gucluturk, U. Guclu, X. Baro, H. J. Escalante, I. Guyon, S. Es-calera, M. A. J. van Gerven, and R. van Lier, “Multimodal firstimpression analysis with deep residual networks,” TAC, vol. PP,no. 99, pp. 1–1, 2017.

[9] V. Ponce-Lopez, B. Chen, A. Places, M. Oliu, C. Corneanu,X. Baro, H. Escalante, I. Guyon, and S. Escalera, “ChaLearn LAP2016: First round challenge on first impressions - dataset andresults,” in ECCVW, 2016, pp. 400–418.

[10] G. J. Boyle and E. Helmes, The Cambridge handbook of personalitypsychology, Chapter: Methods of Personality Assessment. CambridgeUniversity Press, Eds: P. J. Corr & G. Matthews, pp.110-126, 2009.

[11] F. Pianesi, N. Mana, A. Cappelletti, B. Lepri, and M. Zancanaro,“Multimodal recognition of personality traits in social interac-tions,” in ICMI, 2008, pp. 53–60.

[12] J. Willis and A. Todorov, “First impressions: Making up yourmind after a 100-ms exposure to a face,” Psychological Science,vol. 17, no. 7, pp. 592–598, 2006.

[13] A. Todorov, Face Value: The Irresistible Influence of First Impressions.Princeton and Oxford: Princeton University Press, 2017.

[14] C.-L. Zhang, H. Zhang, X.-S. Wei, and J. Wu, “Deep bimodalregression for apparent personality analysis,” in ECCVW, 2016.

[15] Y. Gucluturk, U. Guclu, M. A. van Gerven, and R. van Lier, “Deepimpression: Audiovisual deep residual networks for multimodalapparent personality trait recognition,” ECCVW, 2016.

[16] L. S. Nguyen and D. Gatica-Perez, “Hirability in the wild: Analy-sis of online conversational video resumes,” IEEE Transactions onMultimedia, vol. 18, no. 7, pp. 1422–1437, 2016.

[17] C. Liem, M. Langer, A. Demetriou, A. Hiemstra, A. Sukma Wicak-sana, M. Born, and C. Konig, Psychology Meets Machine Learning:Interdisciplinary Perspectives on Algorithmic Job Candidate Screening.Springer International Publishing, 2018, pp. 197–253.

[18] J.-I. Biel, L. Teijeiro-Mosquera, and D. Gatica-Perez, “Facetube:predicting personality from facial expressions of emotion inonline conversational video,” in ICMI, 2012, pp. 53–56.

[19] F. Gurpınar, H. Kaya, and A. A. Salah, “Combining deep facialand ambient features for first impression estimation,” in ECCVW,G. Hua and H. Jegou, Eds., 2016, pp. 372–385.

[20] A. Todorov and J. Porter, “Misleading first impressions differentfor different facial images of the same person,” PsychologicalScience, vol. 25, no. 7, pp. 1404–17, 2014.

[21] A. Todorov, C. Said, and S. Verosky, Personality Impressions fromFacial Appearance. Oxford Handbook of Face Perception, 2012.

[22] C. C. Ballew and A. Todorov, “Predicting political elections fromrapid and unreflective face judgments,” in National Academy ofSciences of the USA, 104(46), 2007, pp. 17 948–17 953.

[23] C. Y. Olivola and A. Todorov, “Elected in 100 milliseconds:Appearance-based trait inferences and voting,” Journal of Non-verbal Behavior, vol. 34, no. 2, pp. 83–110, 2010.

[24] J. Joo, F. F. Steen, and S. C. Zhu, “Automated facial trait judgmentand election outcome prediction: Social dimensions of face,” inICCV, 2015, pp. 3712–3720.

[25] F. Funk, M. Walker, and A. Todorov, “Modelling perceptions ofcriminality and remorse from faces using a data-driven computa-tional approach,” Cognition and Emotion, vol. 31, no. 7, pp. 1431–1443, 2017.

[26] H. Escalante, H. Kaya, A. Salah, S. Escalera, Y. Gucluturk, U. Gu-clu, X. Baro, I. Guyon, J. Jacques Jr., M. Madadi, S. Ayache,E. Viegas, F. Gurpinar, A. Wicaksana, C. Liem, M. van Gerven,and R. van Lier, “Explaining First Impressions: Modeling, Rec-ognizing, and Explaining Apparent Personality from Videos,”ArXiv e-prints, Feb. 2018.

[27] A. C. Islam, J. J. Bryson, and A. Narayanan, “Semantics de-rived automatically from language corpora contain human-likebiases,” Science, vol. 356, no. 6334, pp. 183–186, 2017.

[28] M. Walker, F. Jiang, T. Vetter, and S. Sczesny, “Universals andcultural differences in forming personality trait judgments fromfaces,” Social Psychological and Personality Science, vol. 2, no. 6, pp.609–617, 2011.

[29] C. Sofer, R. Dotsch, M. Oikawa, H. Oikawa, D. H. J. Wigboldus,and A. Todorov, “For your local eyes only: Culture-specific facetypicality influences perceptions of trustworthiness,” Perception,vol. 46, no. 8, pp. 914–928, 2017.

Page 18: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

18

[30] C. A. M. Sutherland, A. W. Young, and G. Rhodes, “Facial firstimpressions from another angle: How social judgements areinfluenced by changeable and invariant facial properties,” BritishJournal of Psychology, vol. 108, no. 2, pp. 397–415, 2017.

[31] K. Mattarozzi, A. Todorov, M. Marzocchi, A. Vicari, and P. M.Russo, “Effects of gender and personality on first impression,”PLOS ONE, vol. 10, no. 9, pp. 1–13, 09 2015.

[32] R. Jenkins, D. White, X. V. Montfort, and A. M. Burton, “Variabil-ity in photos of the same face,” Cognition, vol. 121, no. 3, pp. 313– 323, 2011.

[33] S. C. Rudert, L. Reutner, R. Greifeneder, and M. Walker, “Facedwith exclusion: Perceived facial warmth and competence influ-ence moral judgments of social exclusion,” Journal of ExperimentalSocial Psychology, vol. 68, pp. 101 – 112, 2017.

[34] W. Fleeson, “Toward a structure- and process-integrated view ofpersonality: Traits as density distributions of states,” Personalityand Social Psychology, vol. 80, no. 6, pp. 1011–1027, 2001.

[35] T. Vetter and M. Walker, Computer-Generated Images in Face Percep-tion. Andrew J. Calder, Gillian Rhodes, Mark H. Johnson, JamesV. Haxby (Editors). The Oxford Handbook of Face Perception:Oxford University Press, pp. 387-399, 2011.

[36] D. Barratt, A. C. Redei, A. Innes-Ker, and J. van de Weijer, “Doesthe kuleshov effect really exist? revisiting a classic film experi-ment on facial expressions and emotional contexts,” Perception,vol. 45, no. 8, pp. 847–874, 2016.

[37] C. Y. Olivola and A. Todorov, “Fooled by first impressions? re-examining the diagnostic value of appearance-based inferences,”Journal of Experimental Social Psychology, vol. 46, no. 2, pp. 315 –324, 2010.

[38] H. Escalante, I. Guyon, S. Escalera, J. Jacques, M. Madadi, X. Baro,S. Ayache, E. Viegas, Y. Gucluturk, U. Guclu, M. A. J. van Gerven,and R. van Lier, “Design of an explainable machine learningchallenge for video interviews,” in IJCNN, 2017, pp. 3688–3695.

[39] J.-I. Biel, V. Tsiminaki, J. Dines, and D. Gatica-Perez, “Hi youtube!:Personality impressions and verbal content in social video,” inICMI, 2013, pp. 119–126.

[40] J. Joshi, H. Gunes, and R. Goecke, “Automatic prediction of per-ceived traits using visual cues under varied situational context,”in ICPR, 2014, pp. 2855–2860.

[41] P. Bremner, O. Celiktutan, and H. Gunes, “Personality perceptionof robot avatar tele-operators,” in HRI, 2016, pp. 141–148.

[42] B. Chen, S. Escalera, I. Guyon, V. Ponce-Lopez, N. Shah, andM. Oliu Simon, Overcoming Calibration Problems in Pattern Labelingwith Pairwise Ratings: Application to Personality Traits. SpringerInternational Publishing, 2016, pp. 419–432.

[43] J. Jacques Junior, C. Ozcinar, M. Jakovljevic, X. Baro, G. Anbarja-fari, and S. Escalera, “On the effect of age perception biases forreal age regression (in press),” in FG, 2019, pp. 1–8.

[44] L. Teijeiro-Mosquera, J.-I. Biel, J. L. Alba-Castro, and D. Gatica-Perez, “What your face vlogs about: expressions of emotion andbig-five traits impressions in youtube,” TAC, vol. 6, no. 2, pp.193–205, 2015.

[45] I. Naim, M. I. Tanveer, D. Gildea, and M. E. Hoque, “Automatedprediction and analysis of job interview performance: The role ofwhat you say and how you say it,” in FG, 2015, pp. 1–6.

[46] S. C. Guntuku, L. Qiu, S. Roy, W. Lin, and V. Jakhetiya, “Doothers perceive you as you want them to?: Modeling personalitybased on selfies,” in International Workshop on Affect & Sentimentin Multimedia, 2015, pp. 21–26.

[47] Y. Yan, J. Nie, L. Huang, Z. Li, Q. Cao, and Z. Wei, “Exploringrelationship between face and trustworthy impression using mid-level facial features,” in ICMM, 2016, pp. 540–549.

[48] A. Dhall and J. Hoey, “First impressions - predicting user person-ality from twitter profile images,” in Human Behavior Understand-ing, M. Chetouani, J. Cohn, and A. Salah, Eds., 2016, pp. 148–158.

[49] N. Al Moubayed, Y. Vazquez-Alvarez, A. McKay, and A. Vincia-relli, “Face-based automatic personality perception,” in Interna-tional Conference on Multimedia, 2014, pp. 1153–1156.

[50] L. Liu, D. Preotiuc-Pietro, Z. Samani, M. Moghaddam, and L. Un-gar, “Analyzing personality through social media profile picturechoice,” in ICWSM, 2016, pp. 211–220.

[51] R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Emotionrecognition in context,” in CVPR, 2017, pp. 1960–1968.

[52] F. Noroozi, C. A. Corneanu, D. Kaminska, T. Sapinski, S. Escalera,and G. Anbarjafari, “Survey on emotional body gesture recogni-tion,” CoRR, vol. abs/1801.07481, 2018.

[53] J.-I. Biel and D. Gatica-Perez, “Voices of vlogging,” in ICWSM,2010, pp. 211–214.

[54] J.-I. Biel, O. Aran, and D. Gatica-Perez, “You are known by howyou vlog: Personality impressions and nonverbal behavior inyoutube,” in ICWSM, 2011, pp. 446–449.

[55] J.-I. Biel and D. Gatica-Perez, “The youtube lens: Crowdsourcedpersonality impressions and audiovisual analysis of vlogs,” IEEETransactions on Multimedia, vol. 15, no. 1, pp. 41–55, 2013.

[56] O. Aran and D. Gatica-Perez, “Cross-domain personality predic-tion: from video blogs to small group meetings,” in InternationalConference on Multimodal Interaction, 2013, pp. 127–130.

[57] O. Celiktutan and H. Gunes, “Continuous prediction of perceivedtraits and social dimensions in space and time,” in ICIP, 2014, pp.4196–4200.

[58] O. Celiktutan, F. Eyben, E. Sariyanidi, H. Gunes, and B. Schuller,“Maptraits 2014: The first audio/visual mapping personalitytraits challenge,” in Mapping Personality Traits Challenge and Work-shop, 2014, pp. 3–9.

[59] O. Celiktutan, E. Sariyanidi, and H. Gunes, “Let me tell youabout your personality!: Real-time personality prediction fromnonverbal behavioural cues,” in FG, vol. 1, 2015, pp. 1–1.

[60] C. Ventura, D. Masip, and A. Lapedriza, “Interpreting CNNmodels for apparent personality trait regression,” in CVPRW,2017, pp. 1705–1713.

[61] S. E. Bekhouche, F. Dornaika, A. Ouafi, and T. A. Abdemalik,“Personality traits and job candidate screening via analyzingfacial videos,” in CVPRW, 2017.

[62] A. Subramaniam, V. Patel, A. Mishra, P. Balasubramanian, andA. Mittal, “Bi-modal first impressions recognition using tem-porally ordered deep audio and stochastic visual features,” inECCVW, 2016.

[63] O. Celiktutan and H. Gunes, “Automatic prediction of impres-sions in time and across varying context: Personality, attractive-ness and likeability,” TAC, vol. 8, no. 1, pp. 29–42, 2017.

[64] H. Kaya, F. Gurpinar, and A. A. Salah, “Multi-modal scorefusion and decision trees for explainable automatic job candidatescreening from video cvs,” in CVPRW, 2017, pp. 1651–1659.

[65] F. Gurpinar, H. Kaya, and A. A. Salah, “Multimodal fusion ofaudio, scene, and face features for first impression estimation,”in ICPR, 2016, pp. 43–48.

[66] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi, “Action recog-nition with dynamic image networks,” TPAMI, vol. PP, no. 99,pp. 1–14, 2017.

[67] B. Kim, D. M. Malioutov, K. R. Varshney, and A. Weller, “Pro-ceedings of the 2017 ICML Workshop on Human Interpretabilityin Machine Learning (WHI 2017),” ArXiv e-prints, 2017.

[68] K.-R. Muller, A. Vedaldi, L. K. Hansen, W. Samek, and G. Mon-tavon, “Interpreting, explaining and visualizing deep learningworkshop,” NIPS, vol. Forthcoming, 2017.

[69] O. Aran and D. Gatica-Perez, “One of a kind: inferring personal-ity impressions in meetings,” in ICMI, 2013, pp. 11–18.

[70] D. Sanchez-Cortes, O. Aran, M. S. Mast, and D. Gatica-Perez,“A nonverbal behavior approach to identify emergent leaders insmall groups,” IEEE Transactions on Multimedia, vol. 14, no. 3, pp.816–832, 2012.

[71] D. Gatica-Perez, “Automatic nonverbal analysis of social inter-action in small groups: A review,” Image and Vision Computing,vol. 27, no. 12, pp. 1775–1787, 2009.

[72] S. Okada, O. Aran, and D. Gatica-Perez, “Personality trait classifi-cation via co-occurrent multiparty multimodal event discovery,”in ICMI, 2015, pp. 15–22.

[73] J. Staiano, B. Lepri, R. Subramanian, N. Sebe, and F. Pianesi,“Automatic modeling of personality states in small group inter-actions,” in Int. Conference on Multimedia, 2011, pp. 989–992.

[74] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder,“The semaine database: Annotated multimodal records of emo-tionally colored conversations between a person and a limitedagent,” TAC, vol. 3, no. 1, pp. 5–17, 2012.

[75] O. Celiktutan, P. Bremner, and H. Gunes, “Personality classifi-cation from robot-mediated communication cues,” in RO-MAN,2016, pp. 1–2.

[76] J.-I. Biel, “Mining conversational social video,” Ph.D. disserta-tion, Ecole Polytechnique Federale de Lausanne, 2013.

[77] L. S. Nguyen, D. Frauendorfer, M. S. Mast, and D. Gatica-Perez,“Hire me: Computational inference of hirability in employmentinterviews based on nonverbal behavior,” IEEE Transactions onMultimedia, vol. 16, no. 4, pp. 1018–1031, 2014.

Page 19: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

19

[78] L. S. Nguyen, “Computational analysis of behavior in employ-ment interviews and video resumes,” Ph.D. dissertation, EcolePolytechnique Federale de Lausanne, 2015.

[79] L. S. Nguyen, A. Marcos-Ramiro, M. Marron Romera, andD. Gatica-Perez, “Multimodal analysis of body communicationcues in employment interviews,” in ICMI, 2013, pp. 437–444.

[80] D. Gatica-Perez, D. Sanchez-Cortes, T. M. T. Do, D. B. Jayagopi,and K. Otsuka, “Vlogging over time: Longitudinal impressionsand behavior in youtube,” in MUM, 2018, pp. 37–46.

[81] X. S. Wei, C. L. Zhang, H. Zhang, and J. Wu, “Deep bimodalregression of apparent personality traits from short video se-quences,” TAC, vol. PP, no. 99, pp. 1–14, 2017.

[82] H. J. Escalante, V. Ponce, J. Wan., M. Riegler, C. B., A. Clapes,S. Escalera, I. Guyon, X. Baro, P. Halvorsen, H. Muller, andM. Larson, “Chalearn joint contest on multimedia challengesbeyond visual analysis: An overview,” in ICPRW, 2016.

[83] N. Mana, B. Lepri, P. Chippendale, A. Cappelletti, F. Pianesi,P. Svaizer, and M. Zancanaro, “Multimodal corpus of multi-partymeetings for automatic social behavior analysis and personalitytraits detection,” in Workshop on Tagging, mining and retrieval ofhuman related activity information, 2007, pp. 9–14.

[84] A. N. Finnerty, S. Muralidhar, L. S. Nguyen, F. Pianesi, andD. Gatica-Perez, “Stressful first impressions in job interviews,”in ICMI, 2016, pp. 325–332.

[85] B. Lepri, R. Subramanian, K. Kalimeri, J. Staiano, F. Pianesi,and N. Sebe, “Employing social gaze and speaking activity forautomatic determination of the extraversion trait,” in ICMI, 2010.

[86] ——, “Connecting meeting behavior with extraversion – a sys-tematic study,” TAC, vol. 3, no. 4, pp. 443–455, 2012.

[87] X. Huang and A. Kovashka, “Inferring visual persuasion viabody language, setting, and deep features,” in CVPRW, 2016, pp.778–784.

[88] L. Batrinca, N. Mana, B. Lepri, F. Pianesi, and N. Sebe, “Please,tell me about yourself: automatic personality assessment usingshort self-presentations,” in ICMI, 2011, pp. 255–262.

[89] G. Chavez-Martınez, S. Ruiz-Correa, and D. Gatica-Perez,“Happy and agreeable?: Multi-label classification of impressionsin social video,” in MUM, 2015, pp. 109–120.

[90] C. Sarkar, S. Bhatia, A. Agarwal, and J. Li, “Feature analysis forcomputational personality recognition using youtube personalitydata set,” in WCPR, 2014, pp. 11–14.

[91] F. Alam and G. Riccardi, “Predicting personality traits usingmultimodal information,” in WCPR, 2014, pp. 15–18.

[92] G. Farnadi, S. Sushmita, G. Sitaraman, N. Ton, M. De Cock, andS. Davalos, “A multivariate regression approach to personalityimpression recognition of vloggers,” in WCPR, 2014, pp. 1–6.

[93] F. Celli, B. Lepri, J.-I. Biel, D. Gatica-Perez, G. Riccardi, and F. Pi-anesi, “The workshop on computational personality recognition2014,” in Int. Conference on Multimedia, 2014, pp. 1245–1246.

[94] R. Srivastava, J. Feng, S. Roy, S. Yan, and T. Sim, “Don’t ask mewhat i’m like, just watch and listen,” in International Conferenceon Multimedia, 2012, pp. 329–338.

[95] H. Salam, O. Celiktutan, I. Hupont, H. Gunes, and M. Chetouani,“Fully automatic analysis of engagement and its relationship topersonality in human-robot interactions,” IEEE Access, vol. 5, pp.705–721, 2017.

[96] B. Ferwerda, M. Schedl, and M. Tkalcic, “Using instagram picturefeatures to predict users’ personality,” in MMM, 2016.

[97] F. Celli, E. Bruni, and B. Lepri, “Automatic personality andinteraction style recognition from facebook profile pictures,” inInternational Conference on Multimedia, 2014, pp. 1101–1104.

[98] O. Celiktutan and H. Gunes, “Computational analysis of human-robot interactions through first-person vision: Personality andinteraction experience,” in RO-MAN, 2015, pp. 815–820.

[99] R. Subramanian, Y. Yan, J. Staiano, O. Lanz, and N. Sebe, “On therelationship between head pose, social attention and personalityprediction for unstructured and dynamic group interactions,” inICMI, 2013, pp. 3–10.

[100] S. Fang, C. Achard, and S. Dubuisson, “Personality classificationand behaviour interpretation: An approach based on featurecategories,” in ICMI, 2016, pp. 225–232.

[101] B. Lepri, N. Mana, A. Cappelletti, F. Pianesi, and M. Zancanaro,“Modeling the personality of participants during group interac-tions,” in ACM UMAP, 2009, pp. 114–125.

[102] K. Kalimeri, B. Lepri, and F. Pianesi, “Causal-modelling of per-sonality traits: extraversion and locus of control,” in Workshop onSocial Signal Processing, 2010, pp. 41–46.

[103] L. Batrinca, B. Lepri, N. Mana, and F. Pianesi, “Multimodalrecognition of personality traits in human-computer collabora-tive tasks,” in ICMI, 2012, pp. 39–46.

[104] L. Batrinca, B. Lepri, and F. Pianesi, “Multimodal recognition ofpersonality during short self-presentations,” in ACM Workshop onHuman Gesture and Behavior Understanding, 2011, pp. 27–28.

[105] F. Rahbar, S. M. Anzalone, G. Varni, E. Zibetti, S. Ivaldi, andM. Chetouani, “Predicting extraversion from non-verbal featuresduring a face-to-face human-robot interaction,” in InternationalConference on Social Robotics, 2015, pp. 543–553.

[106] S. Anzalone, G. Varni, S. Ivaldi, and M. Chetouani, “Automatedprediction of extraversion during human-humanoid interaction,”Int. Journal of Social Robotics, vol. 9, no. 3, pp. 385–399, 2017.

[107] G. Farnadi, G. Sitaraman, S. Sushmita, F. Celli, M. Kosinski,D. Stillwell, S. Davalos, M.-F. Moens, and M. D. Cock, “Compu-tational personality recognition in social media,” User Modelingand User-Adapted Interaction, vol. 22, no. 2, pp. 109–142, 2016.

[108] M. K. Abadi, J. A. M. Correa, J. Wache, H. Yang, I. Patras, andN. Sebe, “Inference of personality traits and affect schedule byanalysis of spontaneous reactions to affective videos,” in FG,vol. 1, 2015, pp. 1–8.

[109] L. Batrinca, N. Mana, B. Lepri, N. Sebe, and F. Pianesi, “Mul-timodal personality recognition in collaborative goal-orientedtasks,” IEEE Trans. on Multimedia, vol. 18, no. 4, pp. 659–673, 2016.

[110] K. Wolffhechel, J. Fagertun, U. Jacobsen, W. Majewski, A. Hem-mingsen, C. Larsen, S. Lorentzen, and H. Jarmer, “Interpretationof appearance: The effect of facial features on first impressionsand personality,” PLOS ONE, vol. 9, no. 9, pp. 1–8, 09 2014.

[111] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K. Moore, “Usinglinguistic cues for the automatic recognition of personality inconversation and text,” Journal of Artificial Intelligence Research,vol. 30, no. 1, pp. 457–500, Nov. 2007.

[112] O. Celiktutan, E. Skordos, and H. Gunes, “Multimodal human-human-robot interactions (MHHRI) dataset for studying person-ality and engagement,” TAC, 2018.

[113] F. Celli, F. Pianesi, D. Stillwell, and M. Kosinski, “Workshop oncomputational personality recognition: Shared task,” in AAAITechnical Report WS-13-01 Computational Personality Recognition(Shared Task), 2013, pp. 2–5.

[114] H. Kaya and A. A. Salah, “Continuous mapping of personalitytraits: A novel challenge and failure conditions,” in MappingPersonality Traits Challenge and Workshop, 2014, pp. 17–24.

[115] M. Sidorov, S. Ultes, and A. Schmitt, “Automatic recognition ofpersonality traits: A multimodal approach,” in Mapping Personal-ity Traits Challenge and Workshop, 2014, pp. 11–15.

[116] A. V. Ivanov and X. Chen, “Modulation spectrum analysis forspeaker personality trait recognition,” in Interspeech, 2012.

[117] B. Schuller, S. Steidl, A. Batliner, E. Noth, A. Vinciarelli,F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet,G. Mohammadi, and B. Weiss, “A survey on perceived speakertraits: Personality, likability, pathology, and the first challenge,”Computer Speech and Language, vol. 29, no. 1, pp. 100–131, 2015.

[118] J.-I. Biel and D. Gatica-Perez, “Vlogcast yourself: Nonverbalbehavior and attention in social media,” in ICMI-MLMI, 2010,pp. 50:1–50:4.

[119] A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon, “Emotiw2016: Video and group-level emotion recognition challenges,” inICMI, 2016, pp. 427–432.

[120] M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. T.Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “Avec2016 - depression, mood, and emotion recognition workshop andchallenge,” in ArXiv 1605.01600, 2016.

[121] J. E. J. Ware and C. D. Sherbourne, “The mos 36-item short-form health survey (sf-36). I. conceptual framework and itemselection,” Medical Care, vol. 30, no. 6, 1992.

[122] J. J. Raven, “Raven progressive matrices,” in Handbook of Nonver-bal Assessment. Springer US, 2003, pp. 223–237.

[123] A. Baron and M. Banaji, “Implicit association test (iat),” inEncyclopedia of group processes & intergroup relations. SAGEPublications, Inc., Thousand Oaks, CA, 1992, pp. 433–435.

Page 20: 1 First Impressions: A Survey on Vision-based …1 First Impressions: A Survey on Vision-based Apparent Personality Trait Analysis Julio C. S. Jacques Junior, Yagmur G˘ uc¸l¨ ut¨

20

Julio C. S. Jacques Junior is a postdoctoralresearcher at the Computer Science, Multime-dia and Telecommunications department at Uni-versitat Oberta de Catalunya (UOC), within theScene Understanding and Artificial Intelligence(SUNAI) group. He also collaborates within theComputer Vision Center (CVC) and HumanPose Recovery and Behavior Analysis (HUPBA)group at Universitat Autonoma de Barcelona(UAB) and University of Barcelona (UB), as wellas within ChaLearn Looking at People.

Yagmur Gucluturk is an assistant professorat the Radboud University, Donders Institutefor Brain, Cognition and Behaviour, Nijmegen,Netherlands, where she is working as a memberof the Artificial Cognitive Systems Lab and theNeuronal Stimulation for Recovery of FunctionConsortium. Previously, she did her Ph.D. andM.Sc in Cognitive Neuroscience at the RadboudUniversity, and B.IT in Artificial Intelligence at theMultimedia University.

Marc Perez obtained the B.S. in ComputerScience and Mathematics at University ofBarcelona. His research interests include com-puter vision and machine learning, with specialinterest in affective computing and self-drivingcars.

Umut Guclu is an assistant professor at Rad-boud University, Donders Institute for Brain, Cog-nition and Behaviour, Nijmegen, Netherlands,and a member of the Artificial Cognitive Sys-tems lab, where he works on combining deeplearning and neural coding to systematically in-vestigate the cognitive algorithms in vivo withneuroimaging.He did his postdoctoral, doctoraland master’s studies in cognitive neuroscienceand bachelor’s studies in artificial intelligence.

Carlos Andujar is associate professor at theComputer Science Department of the UniversitatPolitecnica de Catalunya - BarcelonaTech, andsenior researcher at the Research Center forVisualization, Virtual Reality and Graphics Inter-action, ViRVIG. He received his PhD in SoftwareEngineering in 1999 from UPC. His research in-terests include 3D modeling, computer graphics,and virtual reality. He is currently the vicedirectorof the CS department at UPC.

Xavier Baro is associate professor and re-searcher at the Computer Science, Multimediaand Telecommunications department at Univer-sitat Oberta de Catalunya (UOC). From 2015,he is the director of the Computer Vision Mas-ter degree at UOC. He is co-founder of theSUNAI group, and collaborates within CVC atUAB as member of HUPBA group. He is inter-ested on machine learning, evolutionary com-putation, and pattern recognition, specially theirapplications to the field of Looking at People.

Hugo Jair Escalante is researcher scientistat Instituto Nacional de Astrofisica, Optica yElectronica, INAOE, Mexico, and a director ofChaLearn, a nonprofit dedicated to organizingchallenges, since 2011. He has been involvedin the organization of several challenges in com-puter vision and automatic machine learning. Hehas served as coeditor of special issues in IJCV,PAMI, and TAC, and as as competition/area chairof venues like NeurIPS, PAKDD and IJCNN.

Isabelle Guyon is chaired professor in “big data”at the Universite Paris-Saclay, specialized in sta-tistical data analysis, pattern recognition andmachine learning (ML). Prior to joining Paris-Saclay she worked as an independent consul-tant and was a researcher at AT&T Bell Lab-oratories, where she pioneered applications ofneural networks to pen computer interfaces (withcollaborators including Yann LeCun and YoshuaBengio) and co-invented with Bernhard Boserand Vladimir Vapnik Support Vector Machines.

She organizes challenges in ML since 2003 supported by the EU net-work Pascal2, NSF, and DARPA, with prizes sponsored by Microsoft,Google, Facebook, Amazon, Disney Research, and Texas Instrument.She is action editor of the Journal of ML Research.

Marcel A. J. van Gerven is professor of ArtificialCognitive Systems at Radboud University andprincipal investigator in the Donders Institute forBrain, Cognition and Behaviour. His research fo-cuses on brain-inspired computing, understand-ing the neural mechanisms underlying naturalintelligence and the development of new waysto improve the interaction between humans andmachines. He is head of the Artificial Intelligencedepartment at Radboud University.

Rob van Lier studied Experimental Psychology,and after that he took his PhD on the topic ofhuman visual perception at Radboud University(Nijmegen, The Netherlands). Next, he spent afew postdoc years at the University of Leuven(Belgium) and returned to Nijmegen, based on atwo-year grant from the Dutch National ScienceFoundation (NWO) and a 5-year grant from theRoyal Netherlands Academy of Arts and Sci-ences (KNAW). Currently, he is an associate pro-fessor at the Radboud University and a principal

investigator at the Donders Institute for Brain, Cognition and Behaviour,heading the “Perception and Awareness” research group.

Sergio Escalera obtained the best P.h.D. awardin 2018 at Computer Vision Center, UAB. Heleads the Human Pose Recovery and BehaviorAnalysis Group (HUPBA). He is an associateprofessor at the Department of Mathematics andInformatics, Universitat de Barcelona (UB). He isalso a member of the Computer Vision Center(CVC) at UAB. He is vice-president of ChaLearnChallenges in Machine Learning and chair ofIAPR TC-12: Multimedia and visual informationsystems. He has been awarded with ICREA

Academia. His research interests include the visual analysis of humans,with special interest in affective and personality computing.