ieee transactions on affective computing 1 the pictures …vincia/papers/submissionv1.pdf ·...

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 1

The Pictures we Like are our Image: MultipleInstance Regression for Mapping Aesthetic

Preferences into Personality TraitsCrisitina Segalin1, Student Member, IEEE, Alessandro Perina2, Marco Cristani1, Member, IEEE,

and Alessandro Vinciarelli3, Member, IEEE

Abstract—Flickr allows its users to tag the pictures they likeas “favorite”. As a result, many users of the popular photo-sharing platform produce galleries of favorite pictures thataccount, at least in principle, for their aesthetic preferences.This article proposes new approaches, based on ComputationalAesthetics and Multiple Instance Regression, capable of inferringthe personality traits of Flickr users from the galleries above. Inparticular, the approaches map low-level features extracted fromthe pictures into numerical scores corresponding to the Big-FiveTraits, both self-assessed and attributed. The experiments wereperformed over 60,000 pictures tagged as favorite by 300 users(the PsychoFlickr Corpus). The results show that it is possibleto predict beyond chance both self-assessed and attributed traits.In line with the state-of-the-art of personality computing, theselatter are predicted with higher effectiveness (correlation up to0.68 between actual and predicted traits).

Index Terms—Computational Aesthetics; Personality Comput-ing; Multiple Instance Regression.

I. INTRODUCTION

IS a picture worth a thousand words? It seems to be sowhen it comes to mobile technologies and social network-

ing platforms: taking pictures is the action most commonlyperformed with mobile phones (82% of the American users),followed by exchanging text messages (80% of the users) andaccessing the Internet (56% of the users) [1]. Furthermore,56% of the American Internet users either post online originalpictures and videos (46% of the total Internet users) or shareand redistribute similar material posted by others (41% of thetotal Internet users). In other words, “photos and videos havebecome key social currencies online” [2].

Pictures streams are “often seen as a substitute for moredirect forms of interaction like email” [3] and interacting withconnected individuals appears to be one of the main motiva-tions behind the use of online photo-sharing platforms [4]. Thepervasive use of liking mechanisms, online actions allowingusers to publicly express appreciation for a given online item,further confirms the adoption of pictures as a social glue [5]:on Flickr, likes between connected users are roughly 105 timesmore frequent than those between non-connected ones [6].

Besides helping to maintain connections and express affil-iation, liking mechanisms are a powerful means of possiblyinvoluntary self-disclosure: statistical approaches can inferpersonality traits and hidden, privacy sensitive information

1University of Verona (Italy), 2Italian Institute of Technology (Italy),3University of Glasgow (UK).

(e.g., political views, sexual orientation, alcohol consumptionhabits, etc.) from likes and Facebook profiles [7]. Furthermore,online expressions of aesthetic preferences convey an impres-sion in terms of characteristics like prestige, differentiationor authenticity [8]. For this reason, this article proposes newapproaches, based on Computational Aesthetics and MultipleInstance Regression [9], [10], capable of inferring the person-ality traits of Flickr users, both self-assessed and attributed byothers, from the pictures they tag as favorite. In other words,this article proposes an approach aimed at mapping aestheticpreferences (the pictures people like) into personality traits.

Multiple Instance Regression (MIR) [9], [10] appears tobe the most suitable computational framework because Flickrusers typically tag as favorite several pictures. Therefore, thereare multiple instances (the favorite pictures) in a bag (theuser that tags the pictures as favorite) associated with a singlevalue (a personality trait of the user). In particular, this articleproposes a set of novel methods that build an intermediaterepresentation of the pictures - using topic models - and thenperform regression in the resulting space, thus improving theperformance of standard MIR approaches operating on the rawfeatures extracted from the pictures.

The experiments have been performed over PsychoFlickr,a corpus of 60,000 pictures tagged as favorite by 300 ProFlickr users1 (200 randomly selected favorite pictures peruser). For each user, the corpus includes two personalityassessments. The first has been obtained by asking the usersto self-assess their own Big-Five Traits, the second has beenobtained by asking 12 independent assessors to rate the Big-Five Traits of the users (see Section III for details). In thisway, according to the terminology introduced in [11], it ispossible to perform both Automatic Personality Recognition(APR) and Automatic Personality Perception (APP), i.e. theprediction of self-assessed and attributed traits, respectively.

The reason for addressing both APR and APP is thatself-assessed and attributed traits tend to relate differently todifferent aspects of an individual [12] and, therefore, both needto be investigated. In the case of this work, the results suggestthat aesthetic preferences account only to a limited extent forself-assessed traits (the best APR result is a correlation 0.26between actual and predicted traits) while they have a majorimpact on attributed ones (the best APP result is a correlation

1At the moment of the data collection, Pro users were individuals payinga yearly fee in order to access privileged Flickr functionalities.


0.68 between actual and predicted traits). To the best of ourknowledge, this is one of the few works where APP andAPR have been compared over the same data (see [11] foran extensive survey). This is an important advantage becauseit allows one to assess the effectiveness of a given type ofbehavioural evidence (the favorite pictures in this case) inconveying information about personality.

The rest of this paper is organised as follows: Section IIsurveys previous work, Section III presents the PsychoFlickrCorpus, Section IV introduces the low-level features extractedfrom the pictures, Section V describes the new MIL regressionapproaches developed for this work, Section VI reports onexperiments and results and the final Section VII draws someconclusions.

II. PREVIOUS WORK

The experiments of this work lie at the crossroad be-tween Computational Aesthetics and Personality Computing(see [11] and [13] for an extensive surveys). To the best ofour knowledge, no other works have addressed the problemof mapping aesthetic preferences into personality traits. How-ever, several works have addressed separately the inferenceof aesthetic preferences from pictures and the inference ofpersonality traits (both assessed and self-assessed) from socialmedia material. The rest of this section proposes a survey ofthe main works presented in both areas.

A. Computational Aesthetics

Computational Aesthetics target “[...] computational meth-ods that can make applicable aesthetic decisions in a similarfashion as humans can” [14]. In the particular case of pictures,the goal of CA is typically to predict automatically whetherhuman observers like a given picture or not. In most cases,the task corresponds to a binary classification, i.e. to predictautomatically whether a picture has been rated high or low interms of visual pleasantness [15], [16], [17], [18], [19].

The experiments of [15] use a feature set similar to theone adopted in this work (see Section IV) and are performedover a set of 1,664 images downloaded from the web. Theimages are split into two classes corresponding to the top andbottom ratings assigned by human observers. The classificationaccuracy achieved with Support Vector Machines is higherthan 70%. Similar experiments are proposed in [16] over 6,000images (3,000 per class). The features account for composi-tion, lighting, focus controlling and color. The main differencewith respect to other approaches is that the processing focuseson the subject region and on its difference with respect to thebackground. The classification accuracy is higher than 90%.

In the case of [17], the experiments are performed overdigital images of paintings and the task is the discriminationbetween high and low quality paintings. The features includecolor distribution, brightness properties (accounting for theuse of light), use of blurring, edge distribution, shape of pic-ture segments, color properties of segments, contrast betweensegments, and focus region. In this case as well, the taskis a binary classification and the experiments are performedover 100 images. The best error rate is around 35%. In a

similar vein, the approach proposed in [18] detects the subjectof a picture first and then it extracts features that accountfor the difference between foreground and background. Thefeatures account for sharpness, contrast and exposure and theexperiments are performed over a subset of the pictures usedin [15]. Like in the other works presented so far, the task is abinary classification and the accuracy is 78.5%. The approachproposed in [19] adopts an alternative approach for whatconcerns the features. Rather than using features inspired bygood practices in photography, like the other works presentedso far, it uses features like SIFT and descriptors like the Bagof Words or the Fisher Vector. While being general purpose,these are expected to encode the properties that distinguishbetween pleasant and non-pleasant images. The experimentsare performed over 12,000 pictures and the accuracy in abinary classification task (6,000 pictures per class) is closeto 90%.

B. Personality and Social Media

The literature proposes several works investigating the in-terplay between the traces that people leave on social media(posts, pictures, profiles, likes, etc.) and personality traits,both self-assessed [20], [21], [22], [23], [24], [25] and at-tributed [25], [26], [27].

The approach proposed in [20] infers the self-assessedpersonality traits of 167 Facebook users from the absenceor presence of certain items (e.g., political orientation, reli-gion,etc.) in the profile. The results, obtained with regressionapproaches based on Gaussian Processes and M5 algorithm,correspond to a mean absolute error lower than 0.15. In asimilar way, the experiments presented in [21] predict whether209 users of Ren Ren (a popular Chinese social netowrkingplatform) are in the lowest, middle or highest third of theobserved personality scores. The features adopted in such awork include usage measures such as the post frequency, thenumber of uploads, etc. The results show an F -measure up to72% depending on the trait. APR on Facebook profiles wasthe subject of an international benchmarking campaign 2 theresults of which appear in [22]. The main indication of thisinitiative is that selection techniques applied to large sets ofinitial features lead to the highest performances.

Given the difficulty in collecting large amounts of self-assessments, two approaches propose to use measures likethe number of connections or the lexical choices in postsas a criterion to assign personality scores to social mediausers [23], [24]. In the case of [23], the proposed methodologymeasures first whether features like the use of punctuation(e.g., exclamation marks) or emoticons is stable for a givenuser, then it uses the most stable features to assign personalitytraits. The results show an accuracy of 63% in predictingthe actual self-assessed traits of 156 users of FriendFeed,an Italian social network. In the case of [24], the authorssimply label the users as extravert or introvert dependingon how many connections they have. The resulting labelsare then predicted automatically using lexical choices, i.e.

2http://mypersonality.org/wiki/doku.php?id=wcpr13


0

80

160

Openness

Traits Distribution

0

80

160

Conscientiousness

0

80

160

No.

of

Subje

cts

Extraversion

0

80

160

Agreeableness

-4 -3 -2 -1 0 1 2 3 4BFI-10 Score

0

80

160

Neuroticism attributed

self-assessed

Fig. 1: The figure shows the distribution of self-assessed andattributed traits.

calculating how frequently people use words falling in thedifferent categories of the Linguistic Inquiry Word Count.

While the works presented so far address the APR prob-lem, the experiments presented in [26], [27] target APP. Inparticular, both works investigate the agreement between self-assessed traits of social media users and traits that these are at-tributed by observers after posting material online. In the caseof [26], the focus is on profile pictures and the experimentsshow that the agreement is higher when the picture subjectssmile and do not wear hats. The other work [27] performsa similar analysis over profiles showing personal informationand the results show that the agreement improves when peoplepost information about their spiritual and religious beliefs,their most important sources of satisfaction, and material theyconsider to be funny.

To the best of our knowledge, the work in [25] is the onlyone that addresses both APR and APP using the same data.The experiments are performed over the PsychoFlickr Corpus,the same dataset used in the experiments of this article (seeSection III). The results show that the agreement between self-assessed and attributed traits range between 0.32 and 0.55depending on the trait. Furthermore, experiments based onCounting Grid models and Lasso regression approaches leadto a correlation up to 0.62 for the attributed traits and up to0.22 for self-assessed ones.

III. PSYCHOFLICKR: PICTURES AND PERSONALITY

The experiments of this work are performed over Psy-choFlickr, a corpus designed to investigate the interplay be-tween aesthetic preferences and personality traits. The corpusincludes pictures that 300 Flickr users have tagged as favorite(200 pictures per user for a total of 60,000 samples). Further-more, for each user, the corpus includes both self-assessed andattributed traits (see Section III-B). Therefore, PsychoFlickrallows one to perform both APP and APR experiments.

A. The Subjects

The subjects included in the corpus were recruited througha word-of-mouth process. A few Flickr Pro users were con-tacted personally and asked to involve other Pro users in theexperiment (typically through the social networking facilitiesavailable on Flickr). The process was stopped once the first300 individuals answered positively. The resulting pool ofusers includes 214 men (71.3% of the total) and 86 women(28.7% of the total). The age at the moment of the datacollection is available only for 44 subjects (14.7% of thetotal)3. These participants are between 20 and 62 years oldand the average age is 39. However, it is not possible to knowwhether this is representative of the entire pool. The nationalityis available for 288 users (96.0% of the total) that come from37 different countries. The most represented ones are Italy(153 subjects, 51% of the total), United Kingdom (31 subjects,10.3% of the total), United States (28 subjects, 9.3% of thetotal), and France (13 participants, 4.3% of the total).

B. Personality and its Measurement

Personality is the latent construct that accounts for “in-dividuals’ characteristic patterns of thought, emotion, andbehavior together with the psychological mechanisms - hiddenor not - behind those patterns” [28]. The literature proposes alarge number of personality models and PsychoFlickr adoptsthe Big-Five (BF) Traits, five broad dimensions that havebeen shown to capture most individual differences [29]. Thereason behind this choice is twofold: On the one hand theBF is the model most commonly applied in both personalitycomputing [11] and personality science [12]. On the otherhand, the BF model represents personality in terms of fivenumerical scores, a form particularly suitable for computerprocessing.

The scores above account for how well the behavior of anindividual fits the tendencies associated to the BF Traits, i.e.Openness (tendency to be intellectually open, curious and havewide interests), Conscientiousness (tendency to be responsible,reliable and trustworthy), Extraversion (tendency to interactand spend time with others), Agreeableness (tendency to bekind, generous, etc.) and Neuroticism (tendency to experiencethe negative aspects of life, to be anxious, sensitive, etc.).

In the BF framework, assessing the personality of an in-dividual means to calculate the five scores corresponding tothe traits above. The literature proposes several questionnairesdesigned for such a task (see [11] for a list of the mostimportant ones). The personality assessments of PsychoFlickrhave been obtained with the BFI-10 [30], a list of ten itemsassociated to 5-points Likert scales ranging from “StronglyDisagree” to “Strongly Agree”. The main advantage of theBFI-10 is that it can be filled in less than one minute while stillproviding reliable results. Table I shows the BFI-10 for bothself-assessment and attribution of personality traits. In the self-assessment case, people fill the questionnaire about themselvesand the result is a personality self-assessment (necessary toperform APR experiments). In the attribution case, people fill

3Personal information is extracted from the Flickr profiles where the usersare allowed to hide the details they prefer to keep private.


Self-Assessment Attribution TraitI am reserved The user is reserved Ext

I am generally trusting The user is generallytrusting Agr

I tend to be lazy The user tends to be lazy ConI am relaxed, handlestress well

The user is relaxed, han-dles stress well Neu

I have few artistic inter-ests

The user has few artisticinterests Ope

I am outgoing, sociable The user is outgoing, so-ciable Ext

I tend to find fault withothers

The user tends to findfault with others Agr

I do a thorough job The user does a thoroughjob Con

I get nervous easily The user gets nervouseasily Neu

I have an active imagina-tion

The user has an activeimagination Ope

TABLE I: The BFI-10 [30] is the short version of the Big-FiveInventory. Each Item is associated to a Likert scale, from -2(“Strongly disagree”) to 2 (“Strongly agree”) and contributesto the integer score (in the interval [−4, 4]) of a particular trait,see the third column. The answers are mapped into numbers(e.g., from -2 to 2). The table shows the questionnaire in bothself-assessment and attribution version.

the questionnaire about others and the result is a personalityattribution (necessary to perform APP experiments).

The 300 Flickr users included in the corpus were askedto fill the self-assessment version of the BFI-10 and wereoffered a short analysis of the outcome as a reward for theparticipation. The left chart of Figure 1 shows the distributionof the scores for each trait. In line with the observations ofthe literature, the self-assessments tend to be biased towardssocially desirable characteristics (e.g., high Conscientiousnessand low Neuroticism) [31]. In the case of PsychoFlickr, thisapplies in particular Openness, the trait of intellectual curiosityand artistic inclinations. A possible explanation is that thepool of subjects includes individuals that tend to considerphotography as a form of artistic expression. However, noinformation is available about this aspect of the users.

In parallel, 12 independent judges were hired to attributepersonality traits to the 300 subjects of the corpus. The judgesare fully unacquainted with the users and they are all fromthe same country (Italy) to ensure cultural homogeneity. Theywere asked to watch the 200 pictures tagged as favorite byeach user and, immediately after, to fill the attribution versionof the BFI-10 (“Attribution” column of Table I). Each judgehas assessed all 300 users and the 12 assessments availablefor each user were averaged to obtain the attributed traits. Thejudges were paid 95 Euros for their work. The right chartof Figure 1 shows the resulting distribution of scores. Sincethe judges are fully unacquainted with the users and all theyknow about the people they assess are the favorite pictures,the ratings tend to peak around 0, the score associated to theexpression “Neither agree nor disagree”.

The agreement among the assessors was measured with the

Krippendorff’s α [32], a reliability coefficient suitable for awide variety of assessments (binary, nominal, ordinal, intervaletc.), and robust to small sample sizes. The value of α is0.06 for Openness, 0.12 for Conscientiousness, 0.26 for Ex-traversion, 0.17 for Agreeableness and 0.22 for Neuroticism.The values are statistically significant and comparable to thoseobserved in the literature for zero acquaintance scenarios [33],i.e. situations where assessors and subjects being rated do nothave any personal contact (like in the PsychoFlickr corpus).Furthermore, the results of the APP experiments (see Sec-tion VI) confirm that the judgments are sufficiently reliable toallow statistically significant performances.

IV. FEATURE EXTRACTION

The goal of this work is to map the aesthetic preferencesof Flickr users into personality traits. Therefore, the featuresadopted in this work focus on the contributions of Com-putational Aesthetics (see Section II) and do not take intoaccount feature extraction techniques like SIFT or HOG that,while being widely applied in the computer vision community,do not account for aesthetic preferences. A synopsis of thefeatures adopted in this work is available in Table II. Thefeatures cover a wide, though not exhaustive, spectrum ofvisual characteristics and are grouped into three main cate-gories: color, composition and textural properties. This followsthe taxonomy proposed in [34], but it excludes the contentcategory to make the process more robust with respect to thewide semantic variability of Flickr images. The only exceptionis a feature that counts the number of faces because these areubiquitous in the images and, furthermore, the human brain istuned to their detection in the environment [43].

A. ColorThe feature extraction process represents colors with the

HSV model, from the initials of Hue, Saturation and Value(this latter is often referred to as Brightness). This sectiondescribes features related to colors and their use.

HSV statistics: These features account for the use ofcolors and are based on statistics collected over H,S and Vpixel values observed in a picture (see Figure 2). The Hchannel provides information about color diversity through itscircular variance R [35]:

A =

K∑k=1

L∑l=1

cosHkl, B =

K∑k=1

L∑l=1

sinHkl

R = 1− 1

KL

√A2 +B2

where Hkl is the Hue of pixel (k, l), K is the image height andL is the image width. On the S and V channels we compute theaverage and standard deviation; in particular, the average Satu-ration indicates chromatic purity, while the average over the Vchannel is called use of light and corresponds to a fundamentalobservation of image aesthetics, i.e. that underexposed oroverexposed pictures are usually not considered aestheticallyappealing [36]. The average of the Hue was not calculatedbecause it cannot be associated to an intensity attribute (low,high), being an angular measure. Figure 2 provides examplesof how the pictures change according to HSV statistics.


Category Name d Short Description

Color

HSV statistics 5Average of S channel and standard deviation of S, V channels [34];circular variance in HSV color space [35]; use of light as the averagepixel intensity of V channel [36]

Emotion-based 3 Measurement of valence, arousal, dominance [34], [37]

Color diversity 1 Distance w.r.t a uniform color histogram, by Earth Mover’s Distance(EMD) [36], [34]

Color name 11 Amount of black, blue, brown, green, gray, orange, pink, purple, red,white, yellow [34]

Composition

Edge pixels 1 Total number of edge points, extracted with Canny detector [38]Level of detail 1 Number of regions (after mean shift segmentation) [39], [40]Average region size 1 Average size of the regions (after mean shift segmentation) [39]

Low depth of field (DOF) 3 Amount of focus sharpness in the inner part of the image w.r.t. theoverall focus [36], [34]

Rule of thirds 2 Average of S,V channels over inner rectangle [36], [34]Image size 1 Size of the image [36], [38]

Textural Properties

Gray distribution entropy 1 Image entropy [38]

Wavelet based textures 12 Level of spatial graininess measured with a three-level (L1,L2,L3)Daubechies wavelet transform on HSV channels [36]

Tamura 3 Amount of coarseness, contrast, directionality [41]

GLCM - features 12 Amount of contrast, correlation, energy, homogeneousness for eachHSV channel [34]

GIST descriptors 24 Output of GIST filters for scene recognition [42].Faces Faces 1 Number of faces (extracted manually)

TABLE II: Synopsis of the features. Every image is represented with 82 features split in four major categories: Color,Composition, Textural Properties, and Faces. This latter is the only feature that takes into account the pictures content.

Emotion-based: Saturation and Brightness can elicitemotions according to the following equations resulting frompsychological studies (Valence, Arousal and Dominance aredimensions commonly adopted to represent emotions) [37]:

Valence = 0.69 · V + 0.22 · S (1)Arousal = −0.31 · V + 0.60 · S (2)

Dominance = −0.76 · V + 0.32 · S (3)

where V and S are the averages of V and S over an image,respectively. See Figure 2 for examples of pictures withdifferent levels of Valence, Arousal and Dominance.

Color diversity (Colorfulness): The feature distin-guishes multi-colored images from monochromatic, sepia orlow-contrast pictures. Following the approach in [36], theimage under analysis is converted in the CIELUV color space,and its color histogram is computed; this representation iscompared (in terms of Earth Mover’s Distance) with thehistogram of an ideal image where the distribution over thecolors is uniform, that is, an histogram where all the binshave the same value (Figure 2 shows examples of pictureswith different colorfulness).

Color name: Every pixel of an image can be assignedto one of the following classes identified in [44]: black, blue,brown, grey, green, orange, pink, purple, red, white and yellow.The sum of pixels across the classes above accounts not onlyfor how frequently the colors appear in an image, but also forthe style of a photographer. The classification of the pixels isperformed using the algorithm proposed in [45] and it mimicsthe way humans label chromatic information in an image. Thefractions of pixels belonging to each of the classes above areused as features.

Use of light

high (= 0.79) low (= 0.14)

Average saturation

high (= 0.89) low (= 0.17)

Valence

high (= 0.72) low (= 0.18)

Dominance

high (= -0.03) low (= -0.50)

Arousal

high (= 0.36) low (= -0.22)

Hue circular variance

low (= 0.04)high (= 0.84)

Color diversity

high (= 1/8.16) low (= 1/16.7)

Fig. 2: The figure shows examples of how the visual propertiesof a picture change according to several color-related features.

B. Composition

The organization of visual elements across an image asdistinct from the subject is referred to as composition. Thefeatures described in this section aim at capturing such anaspect of the pictures.

Edge pixels: The structure of an image depends, to asignificant extent, on the edges, i.e. on those points where theimage brightness shows discontinuities. Therefore, the featureextraction process adopts the Canny detector [38] to identifythe edges and calculate the fraction of pixels in an image thatlie on an edge (see Figure 3 for an example of edge extraction


in an image).Level of detail: images can be partitioned or segmented

into multiple regions, i.e. sets of pixels that share commonvisual characteristics. The feature extraction process segmentsthe images using the EDISON implementation [39] of themean shift algorithm [40], and provides two features: i) thenumber of segments, accounting for the fragmentation of theimage, and ii) the normalized average extension of the regions,that is, the mean area of the regions divided by the area of thewhole image. On average, the more the details, the more thesegments (see Figure 3).

Low depth of field (DOF) indicator: An image withlow depth of field corresponds to a shot where the object ofinterest is sharper than the background, drawing the attentionof the observer [34], [36] (see Figure 3). To detect low DOF,it is assumed that the object of interest is central; the imageis thus decomposed into wavelet coefficients (see the nextsection), which measure the frequency content of a picture:in particular, high frequency coefficients (formally, level 3 asused in the notation of Eq. (6)) encode fine visual details. Thelow DOF indicator calculates the ratio of the high frequencywavelet coefficients of the inner part of the image against thewhole image is calculated. In specific, the image is dividedinto 16 equal rectangular blocks M1, . . .M16, numbered inrow-major order. Let w3 = wHL=v3 , wLH=h

3 , wHH=d3 denote

the set of wavelet coefficients in the high frequency of the hueimage IH . The low DOF indicator DOFH for hue is computedas follows,

DOFH =

∑(k,l)∈M6∪M7∪M10∪M11

w3(k, l)∑16i=1

∑(k,l)∈Mi

w3(k, l)(4)

High DOFH indicates an apparent low depth of field in theimage. The low depth of field indicator for the Saturationand the Brightness channels is computed similarly on thecorrespondent image channels. Figure 3 shows the differencebetween images with different Depth of Field.

The rule of thirds: Any image can be ideally dividedinto nine blocks - arranged in a 3× 3 grid - by two equally-spaced horizontal lines and two equally-spaced vertical lines.The rule of thirds is a photography composition guidelinethat suggests to position the important visual elements of apicture along such lines or at their intersections. In otherwords, it suggests where the most salient objects should liein the image. The rule of thirds feature in image aestheticssimplifies the photographic technique, analyzing the centralblock of the image by keeping the average values of Saturationand Brightness [34], [36]:

fS =9

KL

2K/3∑k=K/3

2L/3∑l=L/3

Skl (5)

where K is the image height, L is the image width and Sklis the Saturation at pixel (k, l). A similar feature fV can becalculated for the Brightness.

Image size: the size of the image is calculated as thetotal number of pixels and used as feature.

Canny

processedoriginal

Level of detail

high (number of segments = 528norm. average extension = 0.002)

low (number of segments = 2norm. average extension = 0.5)

Low depth of �eld indicator

strong (= 2,1.3, 2) weak (= 1.1, 0.9, 0.9)

Fig. 3: The figure shows the effect of the Canny algorithmand an example of the visual properties associated to Level ofDetail and Low Depth of Field.

C. Textural Properties

A texture is the spatial arrangement of intensity and colorsin an image or in an image region. Textures capture perceptualaspects (e.g., they are more evident in sharp images than inblurred ones) and provide information about the subject ofan image (e.g., textures tend to be more regular in picturesof artificial objects than in those of natural landscapes). Thefeatures described in this section aim at capturing texturalproperties.

Entropy: The entropy serves as a feature to measure thehomogeneousness of an image. The image is first convertedinto gray levels; then, for each pixel, the distribution of thegray values in a neighborhood of 9 × 9 pixels is calculated(that is, the gray level histogram of the patch) and the entropyof the distribution is computed. Finally, all the entropy valuesare summed, and divided by the size of the image. The morethe intensity tends to be uniform across the image, the lowerwill be the entropy (see Figure 4 for the impact of Entropyon visual characteristics).

Wavelet textures: Daubechies wavelet transform canmeasure the spatial smoothness/graininess in images [36],[34]. The 2D Discrete Wavelet Transform (2D-DWT) of animage aims at analyzing its frequency content, where highfrequency can be associated intuitively to high edge density.The output of a 2D-DWT can be visualized as a multilevelorganization of square patches (see Figure 5). Each levelcorresponds to a given frequency analysis of the originalimage. In the first level of decomposition, the image isseparated into four parts. Each of them has a quarter sizeof the original image, and a label. The upper left part islabeled LL (LowLow) and is a low-pass version of the originalimage. The vertical LH (LowHigh), horizontal HL (HighLow)and diagonal HH (HighHigh) parts can be assumed as imageswhere vertical, horizontal and diagonal edges at the finest scaleare highlighted. We can call them edge images at level 1. Thesubdivision can be further applied to find coarser edges, asthe figure shows, performing again the wavelet transform tothe coarser (LL coefficients) version at half the resolution,recursively, in order to further decorrelate neighboring pixelsof the input image. The Daubechies wavelet transform isa particular kind of wavelet transform, explicitly suited for


Entropy

high (= 4.66) low (= 0.31)

Tamura directionality

high (= 0.5) low (= 0.25)

Tamura coarseness

high (= 4.06) low (= 2.93)

Tamura contrast

high (= 0.0598) low (= 0.0027)

GLCM contrast (on the V channel)

high (= 1) low (= 0.75)

GLCM correlation (on the H channel)

high (= 0.9646) low (= 0.7646)

GLCM energy (on the S channel)

high (= 0.99) low (= 0.58)

GLCM homogeneity (on the H channel)

high (= 0.95) low (= 0.47)

Fig. 4: The figure shows examples of pictures where the valueof the textural features is high and low.

compression and denoising of images.The feature extraction process computes a three-level

wavelet transform on H, S and V channels separately. Ateach level, we have three parts which represent the edgeimages, called whi , wvi and wdi , where i ∈ 1, 2, 3, d = HH ,h = HL and v = LH , to resemble the kind of edges that arehighlighted (diagonal, horizontal, vertical, respectively). Thewavelet features are defined as follows:

wfi =

∑k,l w

hi (k, l) +

∑k,l w

vi (k, l) +

∑k,l w

di (k, l)

(|whi |+ |wvi |+ |wdi |, (6)

for a total of 9 features (three levels for each of the threechannels). The values k, l span over the spatial domain of thesingle w taken into account, and the operator | · | accounts forthe spatial area of the single w. The corresponding waveletfeatures of saturation and brightness images have been com-puted similarly. In other words, for each color space channeland wavelet transform level, we average the values of the highfrequency coefficients. We extracted three more features bycomputing the sum of the average wavelet coefficients overall three frequency level for each HSV channel (see Figure 5).

Tamura: In [41], six texture features corresponding tohuman visual perception have been proposed: coarseness,contrast, directionality, line-likeness, regularity and roughness.The first three have been found particularly important, sincethey are tightly correlated with human perception, and havebeen considered in this work. They are extracted from graylevel images.

Coarseness: The feature gives information about the size oftexture elements. A coarse texture contains a small numberof large texels, while a fine texture contains a large numberof small texels. (see Figure 4). The coarseness measure iscomputed as follows. Let X an N0 × N1 matrix of valuesX(n0, n1) that can e.g. be interpreted as gray values:

1) For every point (n0, n1) calculate the average overneighbourhoods. The size of the neighbourhoods arepowers of two, e.g.: 1× 1, 2× 2, 4× 4, . . . , 32× 32:

Ak(n0, n1) =1

22k

22k∑i=1

22k∑j=1

X(n0−2k−1+i, n1−2k−1+j)

(7)2) For every point (n0, n1) calculate the difference between

the not overlapping neighbourhoods on opposite sides ofthe point in horizontal and vertical direction:

Ehk (n0, n1) = |Ak(n0 +2k−1, n1)−Ak(n0−2k−1, n1)|(8)

and

Evk(n0, n1) = |Ak(n0, n1 +2k+1)−Ak(n0, n1−2k−1)|(9)

3) At each point (n0, n1) select the size leading to thehighest difference value:

S(n0, n1) = arg maxk=1...5

maxd=h,v

Edk(n0, n1) (10)

4) Finally take the average over 2S as a coarseness measurefor the image:

Fcrs =1

N0N1

N0∑n0=1

N1∑n1=1

2S(n0,n1) (11)

Contrast: It stands, in rough words, for texture quality. It iscalculated by

Fcon =σ

αz4with α4 =

µ4

σ4(12)

where µ4 = 1KL

∑Kk=1

∑Ll=1(X(k, l) − µ4) is the fourth

moment about the mean µ, σ2 is the variance of the grayvalues of the image, and z has experimentally been determinedto be 1

4 . In practice, contrast is influenced by the followingtwo factors: range of gray-levels (large for high contrast),polarization of the distribution of black and white on the gray-level histogram (polarized histogram for high contrast). For anexample, see Figure 5.

Directionality: It models how polarized is the distributionof edge orientations. High directionality indicates a texturewhere the edges are homogeneously oriented, and viceversa.Given the directions of all the edge pixels, the entropy Eof their distribution is calculated; the directionality becomesthen 1/(E + 1). Textures with edges oriented along a singledirection will be distributed as a single peak, thus E = 0, andmaximal directionality (=1). Viceversa, pictures with edgeswhose orientation is distributed in a uniform manner will havelow directionality (∼ 0). For an example, see Figure 4.

Gray-Level Co-occurrence Matrix (GLCM) features:The GLCM is a matrix where the element (i, j) is theprobability p(i, j) of observing values i and j for a givenchannel (H, S or V) in the pixels of the same region W . Inthe feature extraction process, W includes a pixel and its rightneighbor and, therefore, the GLCM includes the probabilitiesof observing one pixel where the value j is at the right ofa pixel where the value is i. The GLCM serves as a basisfor calculating several features, each obtained separately over


Original image

LH1

HL1 HH1

LH1

HL1 HH1

LL1

LL2

HL2

LH2

HH2

b)a)

Fig. 5: The figure shows how the wavelet decompositionworks.

the H, S and V channels [46]: Contrast: It is the averagevalue of (i − j)2, the square difference of values observedin neighboring pixels: C =

∑L−1i,j=0 (i− j)2p(i, j), where L

is the number of possible values in a pixel. The value of Cranges between 0 (uniform image) and (L−1)2 (see Figure 4for examples of pictures with high and low contrast).

Correlation: It is the coefficient that measures the covaria-tion between neighboring pixels:

L−1∑i,j=0

(i− µ)(j − µ)p(i, j)

σ2, (13)

where µ =∑L−1i,j=0 ip(i, j), and σ2 =

∑L−1i,j=0 p(i, j)(i−µ)2 +∑L−1

i,j=0 p(i, j)(j − µ)2. The correlation ranges in the interval[−1, 1] (see lower part of Figure 5).

Energy is the sum of the square values of the GLCMelements:

∑L−1i,j=0 p(i, j)

2. If an image is uniform, the energyis 1 (see Figure 5).

Homogeneity is a measure of how frequently neighboringpixels have the same value (see Figure 5):

H =

L−1∑i,j=0

p(i, j)

1 + |i− j|(14)

The feature tends to be higher when the elements on thediagonal of the GLCM are larger (see Figure 4).

Spatial Envelope (GIST): it is a low dimensional rep-resentation of a scene that relies on Gabor Filters to capturea set of perceptual dimensions, namely naturalness, openness,roughness, expansion, ruggedness [42]. The outputs of theGIST filters are used as features.

D. Number of Faces

All features presented so far are content independent, i.e. donot take into account what the images show. This feature is the

only exception to such an approach because human faces arefrequently portrayed in Flickr pictures and, furthermore, thereare neural pathways that make the human brain particularlysensitive to faces [43]. In this work, the number of faces iscalculated manually for each of the 60,000 pictures of thedataset.

V. INFERENCE OF PERSONALITY TRAITS

This section presents the regression approaches adopted tomap the features described in Section IV into personality traits.The regression is performed separately for the Big-Five traitsbecause these result from the application of Factor Analysis tobehavioural data [47] and, therefore, they are indepedent. Thegoal of the experiments is to infer the personality traits - bothself-assessed and attributed - from the multiple pictures thata user tags as favorite. Hence, Multiple Instance Regression(MIR) [9], [10] appears to be the most suitable computationalframework because it addresses problems where there aremultiple instances (the favorite pictures of a Flickr user) fora bag (the Flickr user) associated with one value (the scoreof the Flickr user for a particular trait). Furthermore, MIRapproaches can deal with cases where only a subset of thebag instances actually account for the value to be predicted.This can be useful in case only part of the pictures tagged asfavorite are actually predictive of the traits of an individual.

Before applying the MIR approaches developed for thiswork, the features are discretized using Q = 6 quantizationlevels (Q values between 3 and 9 were tested, but no signif-icant performance differences were observed). The intervalscorresponding to the levels are obtained by splitting the rangeof each feature in the training set into Q uniform, non-overlapping intervals. In this way, the 82 features describingeach picture (see Section IV) can be interepreted as counts.

In the following, each Flickr user u corresponds to a bag offavorite pictures Bu (u = 1, . . . , 300) and five traits yup (p =O,C,E,A,N). The trait values predicted by the MIR approachesare denoted with yup . The notation does not distinguish betweenself-assessed and attributed traits because the two cases aretreated independently of each other. The element Ctz of thefeature matrix C is the value of feature z (z = 1, . . . , 82) forpicture t (t = 1, . . . , 6× 104). Finally, υ(t) is a function thattakes as input a picture index t and returns the correspondinguser-index u.

A. Baseline ApproachesThe general MIR formulation is NP-hard [9] and this

requires the adoption of simplifying assumptions. The mostcommon one is to consider that each bag includes a primaryinstance that is sufficient to predict correctly the bag label. Inthe experiments of this work, this means that each bag Bu

includes only one picture t - with υ(t) = u - that should befed to the regressor to obtain as output the trait score yup (fora given p). However, the primary instance cannot be known a-priori for a test bag. Furthermore, the bags of this work include200 pictures and, therefore, using only one of them means toneglect a large amount of information. For this reason, thiswork adopts different baseline MIR approaches, more suitablefor the PsychoFlickr data.


Naive-MIR [10]: The simplest baseline consists in giv-ing each picture of a bag Bu as input to the regressor. Asa result, there is a predicted score yup (t) for each picture tsuch that υ(t) = u. The final trait score prediction yup isthe average of the yup (t) values. The main assumption behindthe Naive-MIR is that all the pictures of a bag carry task-relevant information and, therefore, they must all influencethe predicted score yup .

cit-kNN [48]: Given a test bag Bu, this methodologyadopts the minimal Hausdorff distance [49] to identify, amongthe training bags, both its R nearest neighbors and its C-nearest citers (the training bags that have Bu among their Cnearest neighbors). The predicted score yup is then the averageof the scores of both R nearest training bags and C nearestciters training bags. The approach does not include an actualregression step, but still maps a test bag Bu into a continuouspredicted score yup .

Clust-Reg [50]: The Clust-Reg MIR includes three mainsteps. The first consists in clustering all the pictures of thetraining bags using a kmeans, thus obtaining C centroids cj inthe feature space (j = 1, . . . , C). The second step considers allthe images of a bag Bu that belong to a cluster ck and averagesthem to obtain a prototype k. The same task is performedfor all training bags Bu. In this way each training bag isrepresented by C prototypes at most. The third step trainsC regressors ri (i = 1, . . . , C) - each obtained by training themodel in Section V-C over all prototypes corresponding to oneof the C clusters - and identifies rk, the one that performs beston a validation set. At the moment of the test, rk is applied toprototype k of a test bag Bu to obtain the predicted score yup .The regressor operates on a prototype k that represents onlythe test-bag pictures surrounding the centroid ck. Therefore,the Clust-Reg implements the assumption that only a fractionof the pictures carry task-relevant information.

B. Latent representation-based methods

The baseline approaches presented in Section V-A operatein the feature space where the pictures are represented. Suchan approach is not suitable when the number of instances perbag is large - like in the experiments of this work - because itis not possible to know a-priori what are the samples that carryinformation relevant to the task. For this reason, this sectionproposes new MIR approaches that map the pictures of thetraining bags onto an intermediate latent space Z expectedto capture most of the information necessary to perform theprediction of the trait scores. While being proposed for theexperiments of this work, the new MIR approaches can beapplied to any problem where the number of instances perbag is large.

Topic-Sum: The main assumption of this approach isthat the pictures of a test bag Bu distribute over topics,i.e. over frequent associations of features that can be learntfrom the images of the training bags. Such an approach ispossible because the features extracted from the pictures havebeen quantized and the feature vectors can then be consideredas vectors of counts or Bags of Features (see beginning ofSection V). The topic model adopted in this work is the

Latent Dirichlet Annotation (LDA) [51]. The LDA expressesthe topics as probability distributions of features p(Cuz |k), withk = 1, . . . ,K and K << D (D is the dimension of the featurevectors). Once the topics are learnt from the training bags, atest bag Bu can be expressed as a mixture of topics:

p(Bu) =

K∑k=1

p(Cuz |k)p(k|u), (15)

where Cuz is the feature matrix of the images belonging to Bu,and the coefficients p(k|u), called topic proportions, measurehow frequently the topics appear in test bag Bu. The regressorof Section V-C is trained over vectors where the componentscorrespond to the topic proportions of the training bags. Atthe moment of the test, the topic proportions of a test bag Bu

are fed to the resulting regressor to obtain yup .Gen-LDA: This approach learns a LDA model [51] from

the pictures of each training bag and then it fits a Dirichletdistribution p(·;αu) on the resulting topic proportions (seedescription of Topic-Sum above). The parameter vectors αu

are then used to train the regressor of Section V-C. At thetest stage, the αu parameters of the Dirichlet distributioncorresponding to the topic proportions in test bag Bu are thengiven as input to the regressor to predict the trait scores.

Gen-MoG: This approach learns a Mixture of Gaussianswith C components from the pictures of the training bags, thenit considers all the images of a test bag Bu to estimate thefollowing for c = 1, . . . , C:

Zu(c) =∑

t:υ(t)=u

p(c|t) (16)

where p(c|t) is the a-posteriori probability of component c inthe mixture when the picture is t. The values Zu(c) are givenas input to the regressor to predict the trait scores. Comparedto the Multiple Instance Cluster Regression [50], a similarmethodology, the main difference of the Gen-MoG is thatan instance is softly attributed to all the components of theMixture of Gaussians through the probabilities p(c|t).

CG: This approach is based on the Counting Grid(CG) [52], a recent generative model which embeds BoFrepresentations like those used in this work in d-dimensionalmanifolds. The CG allows one to map each picture t of thetraining bags onto a 2-dimensional grid lying on a smoothmanifold, i.e. a manifold where close positions correspondto close images in the original feature space. The grid hasE1 rows and E2 columns and, typically, E1 × E2 << N ,where N is the number of pictures in the bag. After such atraining step, every test bag Bu is projected onto the samemanifold and becomes a set of locations Lu = {`t} on the2-dimensional grid, i.e. a distribution of the test bag picturesover the grid. The distribution - an E1×E2 dimensional vector- is first smoothed by averaging over a 5×5 window and thengiven as input to the regressor of Section V-C to predict thetrait scores.

C. Regression

All the methods above, except cit-kNN, require a regressorto predict the trait scores. The one adopted in the experiments


Neu (A)

Agr (A)

Ext (A)

Con (A)

Ope (A)

Neu (S)

Agr (S)

Ext (S)

Con (S)

Ope (S)

Color Composition Textural Properties Faces

use

of

light

avg S

std S

std V

vale

nce

dom

inance

aro

usa

lH

cir

cula

r vari

ance

colo

r div

ers

ity

bla

ckblu

ebro

wn

gra

ygre

en

ora

nge

pin

kpurp

lere

dw

hit

eyello

wedge p

ixels

level of

deta

ilavg r

egio

n s

ize

low

DO

F -

Hlo

w D

OF

- S

low

DO

F -

Vru

le o

f th

irds

- S

rule

of

thir

ds

- V

image s

ize

gra

y d

istr

ibuti

on e

ntr

opy

H w

avele

t -

lev 1

H w

avele

t -

lev 2

H w

avele

t -

lev 3

S w

avele

t -

lev 1

S w

avele

t -

lev 2

S w

avele

t -

lev 3

V w

avele

t -

lev 1

V w

avele

t -

lev 2

V w

avele

t -

lev 3

H w

avele

t su

mS w

avele

t su

mV

wavele

t su

mTam

ura

coars

eness

Tam

ura

contr

ast

Tam

ura

dir

ect

ionalit

yG

LCM

contr

ast

- H

GLC

M c

orr

ela

tion -

HG

LCM

energ

y -

HG

LCM

hom

ogeneit

y -

HG

LCM

contr

ast

- S

GLC

M c

orr

ela

tion -

SG

LCM

energ

y -

SG

LCM

hom

ogeneit

y -

SG

LCM

contr

ast

- V

GLC

M c

orr

ela

tion -

V

GLC

M e

nerg

y -

VG

LCM

hom

ogeneit

y -

VG

IST -

channel 1

GIS

T -

channel 2

GIS

T -

channel 3

GIS

T -

channel 4

GIS

T -

channel 5

GIS

T -

channel 6

GIS

T -

channel 7

GIS

T -

channel 8

GIS

T -

channel 9

GIS

T -

channel 1

0G

IST -

channel 1

1G

IST -

channel 1

2G

IST -

channel 1

3G

IST -

channel 1

4G

IST -

channel 1

5G

IST -

channel 1

6G

IST -

channel 1

7G

IST -

channel 1

8G

IST -

channel 1

9G

IST -

channel 2

0G

IST -

channel 2

1G

IST -

channel 2

2G

IST -

channel 2

3G

IST -

channel 2

4num

ber

of

face

s

0.00

0.06

0.12

0.18

0.24

0.30

0.36

0.42

0.48

0.54

Fig. 6: The plot reports the absolute value of the Spearman Correlation Coefficient ρ between features and traits, both selfassessed (upper part) and attributed (lower part). Color and size of the bubbles account for the absolute value of ρ, all valuesabove 0.12 are significant at 0.05 level.

of this work has the following form:

yup =

K∑k=1

βkxuk , (17)

where the β = (β1, . . . , βK) are the regressor parameters andthe values xuk are the parameters that, according to the differentmethods presented above, represent a test bag Bu (e.g., theDirichlet distribution parameters in Gen-LDA). The βk areestimated by minimising the Mean Square Error E(β):

E(β) =

U∑u=1

(yup −

K∑k=1

βkxuk

)2, (18)

where U is the number of training bags. The problem wasregularized using LASSO [53], an approach which constrainsthe L1 norm of the least squares solution, thus acting as modelselection method, by enforcing the sparsity on coefficientsβ. This is particularly relevant to the problem of this work,because it allows one to model the fact that not all the featuresare correlated with a given trait.

VI. EXPERIMENTS AND RESULTS

This section presents first a correlational analysis aimed atshowing the relationship between features and traits and thenthe regression experiments performed in this work.

A. Correlational Analysis

Figure 6 shows the absolute covariation, measured withthe Spearman Coefficient ρ, of features and traits, both self-assessed and attributed (size and color of the bubbles account

for the absolute value of ρ). The covariation with the featuresis high for the attributed traits, but limited for the self-assessments. In particular, ρ is statistically significant (at level0.05) for 48.5% of the features in the case of the attributedtraits and only for 8.3% of the features in the case of self-assessments. This suggests that the visual properties of theimages influence the impression that the judges develop aboutthe Flickr users, but do not account for the self-assessmentsthat the users provide. For this reason, the rest of this sectionfocuses on the attributed traits.

Color properties (see Section IV-A) covariate to a significantextent with all traits and, in particular with Agreeablenessand Neuroticism. However, the properties that are positivelycorrelated with one trait tend to be correlated negatively withthe other and viceversa. In other words, Agreeableness andNeuroticism seem to be perceived as complementary withrespect to color characteristics. This applies, e.g., to averagesaturation (ρ = 0.40 for Agreeableness and ρ = −0.55 forNeuroticism), percentage of orange (ρ = 0.45 and ρ = −0.56),blue (ρ = 0.36 and ρ = −0.52) and red (ρ = 0.30 andρ = −0.40) pixels, arousal (ρ = 0.38 and ρ = −0.52)and valence (ρ = 0.27 and ρ = −0.40). Overall, the judgesappear to assess as high in Agreeableness users that likeimages eliciting pleasant emotions and showing pure colors.Viceversa, the judges consider high in Neuroticism peoplethat like images stimulating intense, unpleasant emotions andcontain colors with low saturation.

Complementary assessments can be observed for Opennessand Conscientiousness as well when it comes to the relation-ship with compositional properties (see Section IV-B). Thefeatures of this category that covariate most with the two traits


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7C

orr

ela

tion A

ctual-

Pre

dic

ted S

core

s

Naive MIR cit-kNN Clust-Reg Topic-Sum Gen-MoG Gen-LDA CG

Attributed Traits

Ope.

Con.

Ext.

Agr.

Neu.

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Corr

ela

tion A

ctual-

Pre

dic

ted S

core

s

Naive MIR cit-kNN Clust-Reg Topic-Sum Gen-MoG Gen-LDA CG

Self-Assessed Traits

Fig. 7: The figure shows APP (upper chart) and APR (lower chart) performance. The missing bars correspond to correlationsthat are not significant at 0.05 level.

are rule of thirds (ρ = −0.21 for Openness and ρ = 0.22for Conscientiousness) and level of detail (ρ = −0.30 andρ = 0.19). Therefore, unconventional compositions displayinga few details tend to be associated with high Openness (thetrait of creativity and artistic inclinations) while conventionalcompositions with many details tend to be associated with highConscientiousness (the trait of reliability and thoroughness).

Textural features (see Section IV-C) appear to covariate withthe perception of most traits, especially when it comes tothe properties of the Gray-Level Co-occurrence Matrix (seeSection IV-C). In the case of Openness, the highest correlationsare observed for exposure (measured in terms of brightnessenergy) and image homogeneousness (measured in terms ofgray distribution entropy). The covariation is positive for theformer (ρ = 0.35) and negative for the latter (ρ = −0.27).Therefore, people that like pictures with homogeneous illu-mination and uniform textural properties tend to be perceivedas higher in Openness. For Conscientiousness, the covariation(ρ = 0.23) is significant only for the Tamura directionality.Hence, there seems to be no relationship between the traitand textural properties. In contrast, several textural propertiescovariate with the attribution of Extraversion. High contrast inhue, meaning large color differences in neighboring pixels, andsaturation, meaning chromatic purity, are associated with highExtraversion scores (ρ = 0.26 and ρ = 0.21, respectively). Thesame applies to Tamura contrast (ρ = 0.25) and directionality(ρ = −0.33) of the images.

The value of ρ for the number of faces, the only contentrelated feature considered in this work, is statistically signif-icant at 0.01 confidence level for all traits except Openness.The ρ value is negative for Conscientiousness (ρ = −0.2)and Agreeableness (ρ = −0.17) and positive for the othertraits. Not surprisingly, the absolute covariation is particularly

high (ρ = 0.53) for Extraversion, the trait of sociability andinterest for others, and Neuroticism (ρ = −0.28), the trait ofthe difficulties in dealing with social interactions.

B. Regression Analysis: Experimental Setup

All the experiments of this work have been performedusing a Leave-One-User-Out approach: the models are trainedover all the pictures of PsychoFlickr except those tagged asfavorite by one of the Flickr user included in the corpus(see Section III). The traits of these latter are then predictedusing the excluded pictures as test set. The process is theniterated and, at each iteration, a different user is left out. Thehyper-parameters of the methods introduced in Section V-Aand Section V-B have been set through cross-validation: allparameter values in a search range were tested over a subsetof the training set and the configurations leading to the highestperformance were retained for the test. The main advantageof the setup above is that it allows the use of the entire corpusto measure the performance of the inference approaches whilestill preserving a rigorous separation between training and testset.

Hyper-parameters and search ranges for the methods de-scibed above are as follows: for the cit-kNN, number ofnearest citers C and number of nearest neighbors R weresearched in the ranges [4, 10] and [2, 8], respectively; forClust-Reg and Gen-MoG, the number C of clusters wassearched in the set {5, 10, 20, . . . , 100}; for Topic-Sum andGen-LDA, the number of topics K was searched in theset {50, 70, 90, 110, 130, 150}; for CG, the grid sizes weresearched in the set {20×20, 25×25, . . . , 65×65}. Wheneverthere was no risk of confusion, the same symbol has been usedfor different hyper-parameters in different models.


50 70 90 110 130 150No. of Topics

0.2

0.3

0.4

0.5

0.6

0.7

Corr

ela

tion A

ctual-

Pre

dic

ted S

core

sPerformance vs No. of Topics

0 50 100 150 200Test Bags Size

0.2

0.3

0.4

0.5

0.6

0.7

Performance vs Test Bags Size

0 50 100 150 200Training Bags Size

0.2

0.3

0.4

0.5

0.6

0.7

Performance vs Training Bags Size

Openness

Conscientiousness

Extraversion

Agreeableness

Neuroticism

Fig. 8: From left to right, the first plot shows the performance as a function of the number of topics, the second and the thirdreport the performance as a function of test and training bags size, respectively.

C. Prediction Results

Figure 7 reports the results obtained with the regressionmethods described in Section V for both self-assessed andattributed traits. The charts report the Spearman correlationcoefficients between scores predicted automatically and scoresresulting from the BFI-10 questionnaire (see Section III). Themissing bars correspond to correlations that are not statisticallysignificant at level 0.05.

In line with the state-of-the-art of personality comput-ing [11], APP results tend to be more satisfactory than APRones. In the case of this work, the reason is that the judges areunacquainted with the users. Therefore, the pictures dominatethe personality impressions that the judges develop and, as aresult, the correlation between visual features and trait scoresis higher (see Figure 6). Furthermore, the consensus across thejudges is statistically significant (see Section III-B). These twoconditions help the regression approaches to achieve higherperformances. When the users self-assess their personality,they take into account information that is not available in thefavorite pictures like, e.g., personal history, inner state, edu-cation, etc. Therefore, the correlation between visual featuresand trait scores is low. This does not allow the regressionapproaches to achieve high performances.

When it comes to attributed traits, the weakest approaches(cit-kNN and Clust-Reg) are those that make hard decisions toexclude part of the pictures in a test bag. This seems to suggestthat all pictures carry task-relevant information and the mosteffective approach is to make soft decisions by combiningcomplex generative models (e.g., LDA and CG) and sparsitycontrol regressors (see Section V-C). This is the case of thebest performing methods, namely Topic-Sum, Gen-MoG, CGand Gen-LDA (this latter has the best overall performance).Furthermore, the good performance of Naive confirms that allimages in a test bag contribute to influence the attributed traitsand, hence, must be used for the regression. There can betwo possible explanations behind this observation. The firstis that all pictures influence the impression that each judgedevelops about the Flickr users. The second is that each judgeis influenced by a different subset of pictures and the attributedtraits - the average over the traits attributed individually byeach judge - are therefore influenced by all pictures in a testbag.

Different types of data let different traits to emerge withmore or less evidence [12]. This is the reason why not allthe traits are predicted with the same effectiveness. In thecase of the attributed traits, the Krippendorff’s α values (seeSection III-B) provide a first indication of this phenomenon:There is higher agreement for traits that emerge more clearlyor, at least, are perceived to do so by the judges. As aresult, the performance tends to be better for traits where α ishigher. Extraversion is the best predicted dimension for bothattributed and self-assessed traits, in line with the results ofboth personality computing [11] and personality science [54].The reason is that this trait is the most socially oriented and,therefore, it leaves more traces in observable behaviour [12]. Inthe case of attributed traits, the performance tends to be higherthan average on Neuroticism. To the best of our knowledge,the literature does not provide indications about, but it seemsto be the effect of the high correlation of the trait with the useof certain colors (orange, blue and red) and chromatic purity,as well as with the emotions elicited by the images. Theseeffects are among the strongest observed in the PsychoFlickrcorpus (see Figure 6). The lowest performance corresponds toOpenness. The main reason is probably that the judges seem tomanifest high uncertainty in assessing the trait. This is evidentin Figure 1, where the distribution for attributed Opennessshows the highest peak in correspondence of the bin centeredaround zero. Similarly, Openness is the trait that correspondsto the lowest Krippendorff’s α (see Section III-B).

D. Number of Topics, Bag Size and Performance

This section analyses in more detail the application of Gen-LDA to the prediction of attributed traits, the case for whichthe experiments above show the best overall performance. Theleftmost plot of Figure 8 shows the performance as a functionof the number of LDA topics. The range is [50, 150] becauseoutside this interval the performance falls rapidly. No value ofthe number of topics appears to be optimal for all traits. Forthe best predicted traits (Extraversion and Neuroticism), theperformance remains roughly constant or even grows with thenumber of topics. For the other traits, the performance reachesits maximum in correspondence of different numbers of topicsand then it falls before reaching the 150 limit. A possibleexplanation is that there is no advantage in increasing the


number of topics when the covariation between features andtraits is lower (Figure 6 shows that Openness, Agreeablenessand Conscientiousness are more weakly correlated with thefeatures than the other two traits).

The central and rightmost plots of Figure 8 show therelationship between performance and size of test and trainingset bags, respectively. In both cases, the performance growswith the number of pictures, but statistically significant per-formances can still be achieved with small bag sizes, i.e. 5 inthe case of the training set and 1 in the case of the test set.This is particularly important in view of applications dealingwith users that tag only a few images as favorite.

VII. CONCLUSIONS

This work proposes experiments aimed at mapping aes-thetic preferences into personality traits. In particular, theexperiments have shown that low-level features, automaticallyextracted from images tagged as favorite, allow one to inferthe personality traits of Flickr users. The experiments haverequired the development of new regression approaches basedon Multiple Instance Regression. The adoption of topic modelsto represent the instances of a bag has been the main ad-vancement with respect to the state-of-the-art. One of the newproposed approaches, the Multiple Instance Generative LDA,has achieved the highest overall performance.

From an application point of view, the main interest of thiswork is twofold. On the one hand - whether it is an opportunity(“Personal data is the new oil of the Internet and the newcurrency of the digital world” [55]) or a risk of intrusion inprivate life [7] - the experiments show that aesthetic prefer-ences allow the inference of data that people do not necessarilyintend to share (the self-assessed traits in this case). On theother hand, the experiments show that it is possible to predictthe personality impression that people convey to unacquaintedothers through aesthetic preferences expressed online. Thisproblem is going to become increasingly more important ifit is true that online “many meet their social, emotional andeconomic needs” [56]. A typical example is when potentialemployers google candidates before an interview, a widespreadpractice where impressions based on online traces make thedifference between getting a job or not [57].

Overall, 67% of the US Internet users access at least onesocial networking platform [58]. Therefore, managing ourdigital impression seems to emerge as a new social skill,as important as taking care of appearance in everyday life.Understanding the interplay between online activities andpersonality attribution is a step in such a direction. Futurework can involve the improvement of technical aspects (e.g.,feature extraction and inference approaches) and the extensionof the methodologies to other online activities. Furthermore,the results of this work open the possibility of investigatingwhether digital appearance can work as a social signal, i.e. anact capable of engaging and involving others in an interaction,possibly leading to the same effects we observe in face-to-face social exchanges like, e.g., the halo effect (the tendencyto attribute socially desirable characteristics to attractive peo-ple) [59] or the similarity attraction effect (the tendency to

like people similar to us) [60]. Understanding the phenomenaabove might lead, e.g., to better social media strategies, theproduction of viral content, or the development of betterinterfaces.

REFERENCES

[1] M. Duggan and L. Rainie, “Cell phone activities 2012,” Pew ResearchCenter, Tech. Rep., 2012.

[2] L. Rainie, J. Brenner, and K. Purcell, “Photos and videos as socialcurrency online,” Pew Research Center, Tech. Rep., 2012.

[3] N. Van House, “Flickr and public image-sharing: distant closeness andphoto exhibition,” in CHI 2007 Extended Abstracts on Human Factorsin Computing Systems, 2007, pp. 2717–2722.

[4] ——, “Personal photography, digital technologies and the uses of thevisual,” Visual Studies, vol. 26, no. 2, pp. 125–134, 2011.

[5] J. Suler, “Image, word, action: Interpersonal dynamics in a photo-sharingcommunity,” CyberPsychology & Behavior, vol. 11, no. 5, pp. 555–560,2008.

[6] M. Lipczak, M. Trevisiol, and A. Jaimes, “Analyzing favorite behavior inFlickr,” in Proceedings of the International Conference on MultimediaModeling, S. Li, A. El Saddik, M. Wang, T. Mei, N. Sebe, S. Yan,R. Hong, and C. Gurrin, Eds., vol. LNCS 7732. Springer Verlag,2013, pp. 535–545.

[7] M. Kosinski, D. Stillwell, and T. Graepel, “Private traits and attributesare predictable from digital records of human behavior,” Proceeedingsof the National Academy of Sciences, vol. 110, no. 15, pp. 5802–5805,2013.

[8] H. Liu, “Social network profiles as taste performances,” Journal ofComputer-Mediated Communication, vol. 13, no. 1, pp. 252–275, 2007.

[9] S. Ray and D. Page, “Multiple instance regression,” in Proceedings ofthe International Conference on Machine Learning, 2001, pp. 425–432.

[10] B. Babenko, “Multiple instance learning: Algorithms and applications.”[11] A. Vinciarelli and G. Mohammadi, “A survey of personality computing,”

IEEE Transactions on Affective Computing, vol. 5, no. 3, pp. 273–291,2014.

[12] A. Wright, “Current directions in personality science and the potentialfor advances through computing,” IEEE Transactions on Affective Com-puting, vol. 5, no. 3, pp. 292–296, 2014.

[13] P. Galanter, “Computational aesthetic evaluation: Past and future,” inComputers and Creativity, J. McCormack and M. d’Inverno, Eds.Springer, 2012, pp. 255–293.

[14] F. Hoenig, “Defining computational aesthetics,” in Proceedings of theEurographics Conference on Computational Aesthetics in Graphics,Visualization and Imaging, 2005, pp. 13–18.

[15] R. Datta, D. Joshi, J. Li, and J. Wang, “Studying aesthetics in photo-graphic images using a computational approach,” in Proceedings of theEuropean Conference on Computer Vision, 2006, pp. 288–301.

[16] Y. Luo and X. Tang, “Photo and video quality evaluation: Focusing onthe subject,” in Proceedings of the European Conference on ComputerVision, 2008, pp. 386–399.

[17] C. Li and T. Chen, “Aesthetic visual quality assessment of paintings,”IEEE Journal of Selected Topics in Signal Processing, vol. 3, no. 2, pp.236–252, 2009.

[18] L. Wong and K. Low, “Saliency-enhanced image aesthetics class pre-diction,” in IEEE International Conference on Image Processing, 2009,pp. 997–1000.

[19] L. Marchesotti, F. Perronnin, D. Larlus, and G. Csurka, “Assessing theaesthetic quality of photographs using generic image descriptors,” inProceedings of the International Conference on Computer Vision, 2011,pp. 1784–1791.

[20] J. Golbeck, C. Robles, and K. Turner, “Predicting Personality with SocialMedia,” in Proceedings of the Extended Abstracts on Human Factors inComputing Systems, 2011, pp. 253–262.

[21] S. Bai, T. Zhu, and L. Cheng, “Big-five personality prediction based onuser behaviors at social network sites,” Cornell University, Tech. Rep.,2012.

[22] F. Celli, F. Pianesi, D. Stillwell, and M. Kosinski, “Workshop onComputational Personality Recognition (shared task),” in Proceedingsof the Workshop on Computational Personality Recognition, 2013.

[23] F. Celli, “Unsupervised personality recognition for social network sites,”in Proceeedings of the International Conference on Digital Society,2012, pp. 59–62.


[24] T. Nguyen, D. Phung, B. Adams, and S. Venkatesh, “Towards discoveryof influence and personality traits through social link prediction,” inProceedings of the International AAAI Conference on Weblogs andSocial Media, 2011, pp. 566–569.

[25] M. Cristani, A. Vinciarelli, C. Segalin, and A. Perina, “Unveiling themultimedia unconscious: Implicit cognitive processes and multimediacontent analysis,” in Proceedings of the ACM International Conferenceon Multimedia, 2013, pp. 213–222.

[26] S. Fitzgerald, D. Evans, and R. Green, “Is your profile picture worth1000 words? photo characteristics associated with personality impres-sion agreement,” in Proceedings of AAAI International Conference onWeblogs and Social Media, 2009.

[27] D. Evans, S. D. Gosling, and A. Carroll, “What elements of an onlinesocial networking profile predict target-rater agreement in personalityimpressions,” in Proceedings of the International Conference on Weblogsand Social Media, 2008, pp. 45–50.

[28] D. Funder, “Personality,” Annual Reviews of Psychology, vol. 52, pp.197–221, 2001.

[29] G. Saucier and L. Goldberg, “The language of personality: Lexicalperspectives on the five-factor model,” in The Five-Factor Model ofPersonality, J. Wiggins, Ed., 1996.

[30] B. Rammstedt and O. John, “Measuring personality in one minute orless: A 10-item short version of the Big Five Inventory in English andGerman,” Journal of Research in Personality, vol. 41, no. 1, pp. 203–212, 2007.

[31] G. Boyle and E. Helmes, “Methods of personality assessment,” inThe Cambridge handbook of personality psychology, P. Corr andG. Matthews, Eds. Cambridge University Press, 2009, pp. 110–126.

[32] K. Krippendorff, “Reliability in content analysis,” Human Communica-tion Research, vol. 30, no. 3, pp. 411–433, 2004.

[33] J. Biesanz and S. West, “Personality coherence: Moderating self–otherprofile agreement and profile consensus.” Journal of Personality andSocial Psychology, vol. 79, no. 3, pp. 425–437, 2000.

[34] J. Machajdik and A. Hanbury, “Affective image classification usingfeatures inspired by psychology and art theory,” in Proceedings of theACM International Conference on Multimedia, 2010, pp. 83–92.

[35] K. Mardia and P. Jupp, Directional Statistics. Wiley, 2009.[36] R. Datta, D. Joshi, J. Li, and J. Wang, “Studying aesthetics in photo-

graphic images using a computational approach,” in Proceedings of theEuropean Conference on Computer Vision. Springer Verlag, 2006, vol.3953, pp. 288–301.

[37] P. Valdez and A. Mehrabian, “Effects of color on emotions,” Journal ofExperimental Psychology: General, vol. 123, no. 4, p. 394, 1994.

[38] P. Lovato, A. Perina, N. Sebe, O. Zandona, A. Montagnini, M. Bicego,and M. Cristani, “Tell me what you like and I’ll tell you what you are:discriminating visual preferences on Flickr data,” in Proceedings of theAsian Conference on Computer Vision, 2012.

[39] C. Georgescu, “Synergism in low level vision,” in Proceedings of theInternational Conference on Pattern Recognition, 2002, pp. 150–155.

[40] D. Comaniciu and P. Meer, “Mean shift: a robust approach towardfeature space analysis,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 24, no. 5, pp. 603 – 619, 2002.

[41] H. Tamura, S. Mori, and T. Yamawaki, “Texture features correspondingto visual perception,” IEEE Transactions on Systems, Man and Cyber-netics, vol. 8, no. 6, pp. 460–473, 1978.

[42] A. Oliva and A. Torralba, “Modeling the shape of the scene: Aholistic representation of the spatial envelope,” International Journalof Computer Vision, vol. 42, no. 3, pp. 145–175, 2001.

[43] M. Johnson, “Subcortical face processing,” Nature Reviews Neuro-science, vol. 6, no. 10, pp. 766–774, 2005.

[44] B. Berlin, Basic color terms: Their universality and evolution. Univer-sity of California Press, 1991.

[45] J. van de Weijer, C. Schmid, and J. Verbeek, “Learning color namesfrom real-world images,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2007, pp. 1–8.

[46] R. Haralick and L. Shapiro, Computer and Robot Vision. Addison-Wesley, 1992.

[47] P. Costa, R. MacCrae, and I. Psychological Assessment Resources,Revised NEO Personality Inventory (NEO PI-R) and NEO Five-FactorInventory (NEO FFI): Professional Manual. PAR, 1992.

[48] D. R. Dooly, Q. Zhang, S. A. Goldman, and R. A. Amar, “Multipleinstance learning of real valued data,” JMLR, vol. 3, pp. 651–678, 2003.

[49] J. Wang, et Jean-Daniel Zucker, and J. daniel Zucker, “Solving themultiple-instance problem: A lazy learning approach,” in ICML. Mor-gan Kaufmann, 2000, pp. 1119–1125.

[50] K. L. Wagstaff, T. Lane, and A. Roper, “Multiple-instance regressionwith structured data.” in Workshops of International Conference on DataMining, 2008, pp. 291–300.

[51] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,”Jornal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.

[52] A. Perina and N. Jojic, “Image analysis by counting on a grid,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2011, pp. 1985–1992.

[53] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Jour-nal of the Royal Statistical Society, vol. 58, pp. 267–288, 1994.

[54] C. Judd, L. James-Hawkins, V. Yzerbyt, and Y. Kashima, “Fundamentaldimensions of social judgment: Unrdestanding the relations bteweenjudgments of competence and warmth,” Journal of Personality andSocial Psychology, vol. 89, no. 6, pp. 899–913, 2005.

[55] W. E. Forum, “Personal data: the emergence of a new asset class,” WorldEconomic Forum, Tech. Rep., 2011.

[56] L. Rainie and B. Wellman, Networked. MIT Press, 2012.[57] D. Coutu, “We googled you,” Harvard Business Review, vol. 85, no. 6,

pp. 1–8, 2007.[58] M. Duggan and J. Brenner, “The demographics of social media users,”

Pew Research Center, Tech. Rep., 2012.[59] R. Nisbett and T. Wilson, “The halo effect: Evidence for unconscious

alteration of judgments,” Journal of Personality and Social Psychology,vol. 35, no. 4, p. 250, 1977.

[60] D. Byrne, W. Griffitt, and D. Stefaniak, “Attraction and similarity of per-sonality characteristics,” Journal of Personality and Social Psychology,vol. 5, no. 1, p. 82, 1967.

ieee transactions on affective computing 1 the pictures …vincia/papers/submissionv1.pdf ·...

Documents