computational appraisal of gender representativeness in

14
HAL Id: hal-02944000 https://hal.archives-ouvertes.fr/hal-02944000v2 Submitted on 31 Dec 2021 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Computational appraisal of gender representativeness in popular movies Antoine Mazières, Telmo Menezes, Camille Roth To cite this version: Antoine Mazières, Telmo Menezes, Camille Roth. Computational appraisal of gender representative- ness in popular movies. Humanities and Social Sciences Communications, Nature, 2021, 8, pp.137. 10.1057/s41599-021-00815-9. hal-02944000v2

Upload: others

Post on 27-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

HAL Id: hal-02944000https://hal.archives-ouvertes.fr/hal-02944000v2

Submitted on 31 Dec 2021

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Computational appraisal of gender representativeness inpopular movies

Antoine Mazières, Telmo Menezes, Camille Roth

To cite this version:Antoine Mazières, Telmo Menezes, Camille Roth. Computational appraisal of gender representative-ness in popular movies. Humanities and Social Sciences Communications, Nature, 2021, 8, pp.137.�10.1057/s41599-021-00815-9�. �hal-02944000v2�

Computational appraisal of gender representativenessin popular movies

Antoine Mazières*1, Telmo Menezes1, and Camille Roth1,2

1CNRS, Centre Marc Bloch, Computational Social Science team, Berlin, Germany2CAMS, Centre d’Analyse et de Mathématique Sociales, CNRS/EHESS, Paris, France

Abstract

Gender representation in mass media has long beenmainly studied by qualitatively analyzing content.This article illustrates how automated computa-tional methods may be used in this context to scaleup such empirical observations and increase theirresolution and significance. We specifically applya face and gender detection algorithm on a broadset of popular movies spanning more than threedecades to carry out a large-scale appraisal of theon-screen presence of women and men. Beyondthe confirmation of a strong under-representationof women, we exhibit a clear temporal trend to-wards a fairer representativeness. We further con-trast our findings with respect to movie genre, bud-get, and various audience-related features such asmovie gross and user ratings. We lastly propose a finedescription of significant asymmetries in the mise-en-scène and mise-en-cadre of characters in relationto their gender and the spatial composition of a givenframe.

Keywords: gender studies; image analysis; filmtheory; content analysis; face recognition

Introduction

There is assuredly a long tradition of scholarship inthe description of sex roles on mass media of var-ious types: already in her seminal review, LindaBusby (1975) described how instructional material,TV, films, advertising, newspaper, cartoons and lit-erature have been used since the late 1950s tostudy gender-related representations such as sex-ual stereotypes, biases in occupational roles, bodystaging, marriage and rape. Back then, she furtherconcluded that “media sex-role studies that have

*Corresponding author: [email protected]

been completed in the 1960s and early 1970s canbe used as historical documents to measure futuresocial changes”, emphasizing the need of replicat-ing these analyses at several points in time to cap-ture underlying mutations and trends. As empiri-cal material, such sources provide the opportunityto grasp a certain state of affairs regarding genderrepresentations, together with the intents and con-flicts of interest at play in shaping them. Recentreviews of this research (Rudy et al., 2010; Collins,2011) highlight the ubiquity of gender patterns, mostnotably the under-representation and sexualizationof women, across multiple media and content types,even though some negative results may occasion-ally be found as well (Kian et al., 2009). Almosta half century after Busby’s review, the roles of fe-males and males in media and fiction have beena prominent domain of inquiry in content analysisand have been subjected to many analyses based ona sometimes substantial quantity of cultural artifacts(Neuendorf, 2017), including for instance broadcastnetwork programs (Lauzen, 2018), popular movies(Lauzen, 2019; Smith et al., 2019) and recurring TVshow characters (Townsend et al., 2019).

Methodologically, this strand of media gender re-search principally relies upon manual assessmentsof text, images and scripts, which occasionally fea-ture complex semantic concepts and possibly sub-jective interpretations. As a result, these approachesare difficult to scale to a large number of obser-vations: a lot of human coders are required toperform statistical and especially temporal analy-ses. Some studies do rely on large-scale and au-tomatically collated datasets, for instance throughcollaborative platforms such as IMDb, the InternetMovie Database, but they are by definition limited toalready-available metadata, such as film cast, crew,or budget (Lindner et al., 2015; Yang et al., 2020). Thesystematic construction and extraction of variablesadequate for a given study and a given research ques-

1

tion remains a challenge.

Recent advances in artificial intelligence and datascience may significantly help in this regard, es-pecially in terms of automated processing of text,image and video, where current technologies aresometimes capable of competing with humans ina wide array of specialized tasks, including auto-matic text summarization (Mani, 2001), topic detec-tion (Chaney and Blei, 2012), or translation (Hassanet al., 2018); face recognition (Dhomne et al., 2018;Guo and Zhang, 2019), scene intensity estimation(Kataria and Kumar, 2016), narrative element extrac-tion (Guha et al., 2015b; Bost et al., 2016); or evenat the interface of both, text description generationfrom images (Xu et al., 2015). At the moment, how-ever, these methods have generally been applied onissues that remain quite close to the scientific fieldswhich they originate from: they focus rather on tech-nological than social science applications.

Our contribution explores the possibility of usingsuch advances to the construction of datasets rele-vant to sex role research. Firstly, we outline a field ofinquiry by focusing on cinema, for which we identifya relevant subset of more that 3500 popular moviesspanning over 3 decades. We extract a representa-tive set of frames from this dataset and applied ma-chine learning models to detect human faces andinfer their gender. We take the extra precaution ofevaluating the performance and fairness of these in-ferences regarding the target categories (female andmale), for these models are typically evaluated in allgenerality and their potential biases may vary withrespect to data corpora. Secondly, we devise a metricto appraise women’s presence in movies, the femaleface ratio (FFR). We compare it with another well-established measure, the Bechdel test. In aggregate,FFR markedly increases over time, to the point of ap-proaching female-male parity. Also, there are sig-nificant differences in how its values are distributedfor successive temporal periods. This indicates a no-ticeable mutation in the popular movie-making cul-ture regarding women’s representation. Thirdly, weexplore several more sophisticated and experimen-tal capabilities of automatic face detection to ana-lyze how characters of distinct genders are framedon-screen. Interestingly, this yields mostly negativeresults in the sense that we observe very little varia-tions. We nevertheless exhibit a few significant pat-terns related to gender-mixed environments.

A few recent academic endeavors have started ex-ploring methodologies of automated visual contentanalysis in a social science framework. These works

have been denoted with a variety of labels. In thecontext of digital humanities, for instance, the no-tion of “distant viewing” (Arnold and Tilton, 2019)has been coined by analogy with the famous con-cept of “distant reading” (Moretti, 2000). The emerg-ing field of so-called “computational media intelli-gence” (Somandepalli et al., 2021) covers a variety ofinitiatives with a more technical focus (Guha et al.,2015b; Kataria and Kumar, 2016). In this area, a casestudy aimed at tracking female participation in the100 top-grossing Hollywood films over 6 years is no-tably relevant here (Guha et al., 2015a; Somande-palli et al., 2021), as it introduced algorithms specif-ically designed to measure on-screen presence andgender-specific speaking time. In a similar vein, Janget al. (2019) applied an object detection system on900 movies to characterize which items were presentin association with a face of a given gender, and howoften.

Our research belongs to this strand. On the onehand, we rely on a relatively simple and mainstreamalgorithmic apparatus enabling face detection andgender inference from still frames. In this regard,our contribution is more methodological than tech-nical: we focus particularly on the construction of asound protocol that pays special attention to a formof criticism prevalent in social sciences regarding thepotential biases induced by the use of automatedlabeling methods, especially when stemming frommachine learning approaches (Buolamwini and Ge-bru, 2018; Crawford and Paglen, 2019). On the otherhand, we apply our method on a much larger datasetthan has been done so far, and on a much wider pe-riod of time. This enables us to originally analyze thetemporal evolution of gender representativeness infilms over decades.

More broadly, we contend that the systematic ap-plication of such techniques could contribute tothe formulation of ambitious research questionsthat would be hardly tractable with only a humanworkforce. This could furthermore enable the cre-ation of well-documented datasets featuring meta-data adapted to sex role research for the commu-nity, in order to thoroughly and conveniently repro-duce and improve experiments. Tackling this chal-lenge could indeed trigger new fields of interest forboth qualitative and quantitative approaches. Forinstance, this could help formulating a theoreticalunderstanding of the distribution of representationsover the whole spectrum of a specific medium, or fo-cusing on potential outliers in order to unveil theirpossible contribution to future evolutions.

2

Dataset and data processing

Corpus scope

Movie studies typically define the corpus scope byrelying on box office data as a proxy for movie popu-larity (e.g., Follows, 2014; Lauzen, 2019; Smith et al.,2019). They essentially outline a selection based onthe yearly top grossing movies over a period of timei.e., short-term commercial success in movie the-aters, which is admittedly related to popularity. Yetpopularity relies on complex behaviors: it relates asmuch to the value given by an individual to the con-tent, as to the value an individual perceives, or an-ticipates, others will give. Intricate interactions ofsupport, rejection, controversy, advocacy and imita-tion come into play to establish a cultural object’s in-fluence (Cillessen and Marks, 2011). Put shortly, at-tendance alone may not help fully capture moviesthat are both characteristic of cultural representa-tions and influential in shaping them. In particu-lar, it may discard some content that may qualify as“mainstream” yet did not attain significant box officesuccess.

We thus devised a different approach based onopen collaborative platforms such as peer-to-peerfile sharing networks (Vassileva, 2002; Cohen, 2003)or wiki-based knowledge sharing systems (Rafaeliand Ariel, 2008; Yang and Lai, 2010). These online en-vironments are fueled by interactions between a di-verse and critical mass of users. Contributors are in-centivized by the effort of others to increase the sys-tem usefulness by creating and maintaining fashion-able resources: they act from a variety of motives, in-cluding both the perceived value of the content theyprovide and the peer recognition that it entails. Weargue that the intensity of such collaborative activ-ity defines a broader proxy of content mainstream-ness than attendance. However, we also acknowl-edge that it may be biased toward the notoriouslyyounger population of such online communities andtheir tastes.

Based on this, we focus on films for which datais available on two significantly distinct types ofonline platforms: (1) a peer-to-peer file sharingnetwork, which is one of the major Torrent com-munities, YIFY (yts.mx); and (2) a movie-relatedknowledge-sharing platform, the above-mentionedInternet Movie Database (IMDb, imdb.com), whichcomprises records on about 500k movies, mostlystemming from user contributions. We first listedall 13,662 movies made available on YIFY, requir-ing that at least 3 people share them (seeders) as of

December 2019. We then linked them to their re-spective record on IMDb, excluding documentariesand animation movies while requiring that key meta-data be available: year of release, genres, users rat-ing, parental rating, runtime, budget and world widegross. We find that there are very few movies per yearbefore 1985 (10 on average, no more than 48 for agiven year): for the purpose of the temporal analysis,we decide to further focus on the period 1985-2019,wherefrom the yearly number of movies per year isalways above 100. This yields a dataset of 3,776movies. The average runtime is 109 minutes with astandard deviation of 18 minutes, indicating that weessentially gathered feature films. The budget distri-bution is broad, with a median of $23m while the firstand third quartiles are at $10m and $45m, indicat-ing that we focus on a quite diverse array of moviebudgets. The same applies to world wide gross fig-ures: median $43m, first quartile $11m, third quar-tile $122m. This further substantiates our approachfor constructing a filter that is broader than when fo-cusing on top audience figures only.

Face recognition and gender estimation

The computational extraction of artistic or semanticcharacteristics of a movie traditionally relies on theextraction of a number of significant images (Guhaet al., 2015b; Ko et al., 2019). This is commonly basedon keyframes i.e., frames of a movie’s timeline wherenew shots commence. This method results in betterquality images, since keyframes are used as markersfor video compression. Also, it likely captures nar-rative highlights, since a keyframe captures the firststate of scenery —arguably an important one— fromwhich the shot unfolds. For one, the previously citedwork of Guha et al. (2015a) relied on this approachto downsample movies frames. However, the dura-tion and pace vary very significantly from a shot tothe other, and are also strongly influenced by shottype, movie genre and year of production (Cuttingand Candan, 2015). Therefore to ensure the rep-resentativeness of our sample with respect to whatspectators are shown — even more so for the tempo-ral analysis we aim at — we simply extracted frameson a time frequency basis, similarly to what has beendone in Jang et al. (2019). Selecting one image every2 seconds yielded a collection of more than 12.4 mil-lion images.

We processed each of these images with the helpof face detection and gender estimation algorithmsprovided by a common scientific computing soft-ware, Wolfram (2020) Mathematica Engine 12.

3

We eventually detect close to 10 millions faces overmore than 6.6 million images, with an average of2596 (σ = 1090) faces per movie. For every face,the algorithm provides the coordinates of a bound-ing box, enabling us to take into account both theposition and the size of the surface occupied by theface with respect to the frame dimensions. It alsoprovides an estimation of the likely binary gender ofeach face (male or female).

Both algorithms are built using conventional ma-chine learning methods. Many questions have beenraised over the recent years regarding the accuracyand potential bias of predictions based on thesetechniques, and our approach is no exception. Pre-vious social scientific-oriented research specificallyhighlighted the issues associated with the construc-tion of the datasets that are used to train machinelearning algorithms (Crawford and Paglen, 2019).Put shortly, a dataset of human-labeled picturesis first gathered, such as ImageNet (Deng et al.,2009). Labels correspond to categories of interestthat should be learned from this dataset, in orderto predict them on any unknown dataset. In ourcase, these labels include the visible faces (pres-ence and position) and their gender (male or fe-male). Part of this human-labeled dataset is fed toa learning algorithm —such as a neural network—that will initially improvise predictions and then, it-eratively, learn from its mistakes, readjusting andultimately converging towards better guesses. Thelearned model is then tested on another part of thedataset to assess if the algorithm managed to gener-alize well — thereby measuring its performance.

Across the state of the art, both types of algorithmsgenerally reach accuracies well above 90% (Guo andZhang, 2019; Dhomne et al., 2018). Yet, they alsodisplay a strong degree of performance variation de-pending on the type of dataset at hand and, plau-sibly, the context and type of images, for instancein medical imagery (Zech et al., 2018; McBee et al.,2018). Movie frames are likely a specific type of data.The work of Buolamwini and Gebru (2018) on de-signing intersectional benchmarks is also particularlyrelevant here, in that it highlights how face detectionalgorithms perform unevenly when tested on facesof specific genders or skin tones. In any event, wethus need to make sure that the algorithms performsufficiently well with our dataset for our purposes.

To this end, we set up a simple experimental pro-tocol: we randomly select 1000 frames each ex-tracted from a distinct movie and on which the al-gorithm detected only one face, half of which fe-

male, the other male (so, 500 frames for each gen-der). We built the web interface shown in Fig. 1 dis-playing one random frame at a time with a boundingbox around the detected face, followed by two ques-tions. The first question aimed at checking whetherthe face detected in the bounding box and its gen-der are correct. The second question aimed to checkwhether the frame contains faces outside the bound-ing box which would therefore be undetected, sinceonly one face was detected on each image. We sentthe link to this website on our research center’s inter-nal mailing-list. Participants were invited to reviewas many frames as they could. Overall, 4,938 reviewswere submitted with an average of 4.94 (σ = 2.29)reviews per frame. For every frame, we consideredthe most frequent answer. (Narrowing the evalua-tion only to pictures with identical answers over allreviews actually yielded very similar results). Raw re-sults are gathered on Table 1. For each image, Ta-ble 1a gathers two observations, one for inside thebounding box (true and false positives) and one forthe rest of the frame (true and false negatives), thustotaling 2000 observations from 1000 images.

For face detection, there are 977+863=1840 correctinferences (true positives and true negatives) and23+137=160 incorrect inferences (false positives andfalse negatives), thus a high accuracy of 92%, con-sistent with the literature. Note that there are muchmore false negatives than false positives i.e., the al-gorithm, when wrong, tends to rather fail to iden-tify a face than erroneously detect one. Accuracy forgender inference is weaker, with 304+410=714 cor-rect inferences and 162+75+7+8=252 incorrect ones(discarding the negligible “doubt” category which in-dicates that human participants were unable to beconclusive) i.e., a lower yet pretty high 73.9% ac-curacy. However, we also notice that gender infer-ence performs quite differently between males andfemales. When it infers a female face, the face is ac-tually of a women only 65% of the time, while of aman 35% of the time. Male faces are accurately iden-tified 84.5% of the time, and are actually of a femalefor only 15.5% of the cases.

Therefore, the model shows in aggregate a ten-dency to wrongly categorize faces as female more of-ten than for male faces. It generally informs us thatthe raw inferences of woman faces and thus womanpresence are overestimated by the machine learningalgorithm that we used. While it is clear that a 65%accuracy in general would be problematic, we luck-ily deal here with a dichotomized variable: either fe-male or male. Since the accuracy on male faces is ac-

4

Figure 1: Interface of the human evaluation experi-ment

tually very high, it serves as an anchor upon which tobuild (1) the good accuracy of faces detected as male,by construction, and thus (2) the good accuracy ofthe correction on what is not detected as male. Inthis sense, the good accuracy on faces detection asmale ensures that a correction based on manual val-idation on faces detected as female would accuratelyredress estimations for both genders.

Thanks to this contextual validation step, we cannow correct inference results appropriately. Know-ing the shape and magnitude of model error makesit indeed easy to adjust face counts: for instance, ifthe algorithm detects a female face, we count .65 fe-male faces and .35 male faces, using the confusionmatrix of Table 1. The same applies for male faces.In a nutshell, we adjust the raw FFR using the follow-ing formula:

FFRcorrected = (1−λ)+ (λ+λ′−1)FFR

where λ and λ′ are the proportions of true positivesfor male and female faces, respectively. Further-more, we observe that algorithm error is not con-stant across time: female faces are over-estimatedsignificantly more for the earlier than for the lateryears. In practice, we thus use time-dependent cor-rection factors λ and λ′ (based on time periods de-fined below for the temporal analysis).

Women’s presence and its evolution

The content analysis literature has relied on diversefeatures to assess gender representation in media.

���� ���� ���� ��������

��

��%

20

���� ���� ���� ��������

��

��%

% of female speaking characters in popular movies

% of female speaking characterson broadcast network programs

% of female among seriesregular characters

(Smith et. al., 2019)

(Lauzen, 2018)

(Townsend et. al., 2019)

Figure 2: Several metrics used in the litera-ture, based on Smith et al. (2019); Lauzen (2018);Townsend et al. (2019)

It variously mixed field expertise, subjective per-ceptions and quantifiable variables. These endeav-ors often led to semantic characterizations such aswomen appearing “as dependent on men”, “unintel-ligent”, “less competitive”, “more sexualized” (Busby,1975), which are identified, annotated and countedthroughout the media for further comment. Themore formal the feature, the easier it is to scale theanalysis to more observations, either by increasingthe number of observers or automating the process.

More recently, various academic and activistprojects have undertaken large scale analysis of vi-sual entertainment media. They often lessened thesemantic complexity of the variables they rely on andmainly focused on presence ratios, while being ableto increase sample sizes to a point that made tem-poral analysis possible. Figure 2 gathers some re-sults from three of these projects (Townsend et al.,2019; Smith et al., 2019; Lauzen, 2018). They notonly confirm the under-representation of women al-ready widely observed across the literature (Busby,1975; Collins, 2011), but they also invite the conclu-sion that this situation has not evolved markedly inany direction during the considered periods.

Female face ratio (FFR)

The face and gender detection algorithms we useprovide us, for each movie frame, with three types ofinformation of increasing complexity: number, gen-der and position of faces. In turn, we derive threetypes of variables. The first one is the most mini-malist: the percentage of faces classified as femaleamong all the detected faces on all frames of a givenmovie, or female face ratio (FFR).

The average FFR over all movies is 34.52% (σ =9.19). This ratio is comparable to what is found

5

Table 1: Evaluation of the detection models.

(a) Face detection

HumansPositive Negative

ModelPositive 977 23Negative 137 863

(b) Gender inference

HumansFemale Male Doubt No face

ModelFemale 304 162 18 16

Male 75 410 8 7

Horror Romance Comedy Mystery Drama Sci-Fi Adventure Thriller Action Crime0.

0.2

0.4

0.6

0.8

1.

that the movie has at least two named women in it …who talk to each other …about something besides a man.

Probability that a movie passes the Bechdel test, i.e.

Female face ratio (FFR)Variation relative to the mean (%)

Horror Romance Comedy Mystery Drama Sci-Fi Adventure Thriller Action Crime

µ=34.5%

+5%

+10%

+15%

-5%

-10%

-15%

+ ++ + +

+ + ++ +

+

Figure 3: Bechdel test and female face ratio (FFR) across a selection of popular movie genres.

in the literature, such as the ratio of female amongcharacters in primetime television programming(39.6%) (Sink and Mastro, 2017) or among speakingcharacters in broadcast network programs and pop-ular movies (see Fig. 2) (Smith et al., 2019; Lauzen,2018). However, the FFR markedly differs from onegenre to another: we find for example an average FFR

of 31.3% for Crime movies while it reaches 37.1% forRomance movies.

To illustrate informally what the FFR means inpractice, we provide a few examples of top gross-ing movies for some domains of this metric. First,among movies with a high percentage of male faces(i.e. FFR < 25%) we find movies such as Pirates of theCaribbean (2007), Star Wars (2005), Matrix (2003),Independence Day (1996) or Forest Gump (1994), allwith a FFR of around 23%. Movies such as The HungerGames (2014) and Jurassic World (2015), Rogue One(2016) and Gravity (2013) lie around a female-maleparity, with a FFR of between 45% and 55%. Lastly,the movie with the highest FFR (68%) is Bad Moms(2016), closely followed by movies such as Sisters(2015), Life of the Party (2018) and Cake (2014).

Beyond these few examples, we further check howthe FFR is correlated with narrative features by com-paring it with the Bechdel (1983) test. This test isreferenced and used in numerous studies (Selisker,2015; Yang et al., 2020; Lindner et al., 2015) and

renowned for discarding around half of all reviewedmovies with the simple criteria that two namedwomen be present, speak to each other, about some-thing besides a man. We rely on data produced byvolunteers who manually evaluate if a movie passesor not the above cited conditions. This data is avail-able at bechdeltest.com and only covers a subset ofour dataset (n=2,454). As the FFR varies along moviegenres, so does the test: we compared both metricsacross the 10 most frequent movie genres, as shownon Figure 3. We find that they are ordered in al-most the same manner (Spearman score> 0.93) eventhough the FFR varies somewhat less across genres inabsolute values.

Temporal analysis. Our aggregate findingson the FFR since 1985 confirm women under-representation in terms of on-screen presence.Yet, they also show a significant trend toward lessinequality. Our computational approach enablesus to go into more detail by providing a relativelyhigh resolution on the FFR distribution across theobservation period which, in turns, reveals severalfeatures.

We temporally divided our dataset into quartiles,i.e. four consecutive periods featuring the samenumber of films. As shown in Figure 4a, the FFR

markedly increases across time from an average 27%

6

averages

1985-19981999-2007

2008-2013

2014-2019

(µ=27% ; σ=5.1)

(µ=30.8% ; σ=6.4)

(µ=35% ; σ=5.9)

(µ=44.9% ; σ=7.6)

10 20 30 40 50 60 70 90 10080FFR

%

0

10

20

30

(a) Distributions of FFR for each period.

0

5

10

15

20

25

2014-2019colored by budget

0

5

10

15

20

25

2014-2019colored by wwgross

0

5

10

15

20

25

2014-2019colored by rating_value

0

5

10

15

20

25

2014-2019colored by female_rating_count_ratio

0

5

10

15

20

25

0

5

10

15

20

25

0

5

10

15

20

25

0

5

10

15

20

25

%% % %

FFR FFR FFR FFR

Colored by Budget(2014-2019)

Colored by Gross(2014-2019)

Colored by Rating value(2014-2019)

Colored by Female rating count ratio

(2014-2019)

5030 70 5030 70 5030 70 5030 70

(b) Distributions of several features over the distribution of FFR.

Figure 4: Distributions of female face ratio (FFR): (a) Percentage of movies with a given FFR, one data pointevery 5%; (b) Percentage of movies with a given FFR, colored by the given variable mean within the bin, thelighter the higher.

between 1985 and 1998 to a mean FFR of 44.9% forthe last period (2014-2019), close to a female-malebalance. The evolution of FFR ranges is equally sig-nificant: most movies shot over 1985-1998 exhibit anFFR of 20-45%, while movies of the most recent pe-riod 2014-19 generally cover the 35-65% range. Be-sides, the standard deviations of the underlying dis-tributions increase overall (from 5.1 to 7.6). Thisprobably denotes a higher diversity of situations withregard to on-screen gender presence. On the whole,it seems to be slowly evolving in favor of female rep-resentation as distributions appear to be increas-ingly right-skewed, i.e. towards a higher FFR. Fur-thermore, considering data from bechdeltest.comrestrained to the films of our datasets, over the sameperiods, we also observe an increase in the percent-age of movies passing the test: 51% between 1985and 1998 up to 60% for the last period (2014-2019).This evolution is comparable to the increase of theFFR, albeit of a somewhat smaller magnitude – +9%vs. +18%.

As previously mentioned, while the literaturewidely acknowledges that women are under-represented in movies and, more broadly, in visualentertainment media, it usually states that thissituation does not exhibit any significant evolution(see Fig 2). As it stands, we observe on our dataset apositive evolution over time of two distinct features,

the FFR and the Bechdel test success probability, inapparent contradiction with the hitherto observedstable representation of women. Note however thatwe exhibit a correlation between the FFR and theBechdel test, indicating that the FFR nonethelesscaptures at least in part some semantic featuresbeyond the plain proportion of female faces.

We can think of two phenomena to explain thediscrepancy between our study and the previousones. The first one relates to the way we select con-tent, whereby we focus on a selection of films thatmay be distinct from what is immediately availableon prime-time TV and on-demand streaming plat-forms. In other words, both ours and the Bechdel testdata are based on information contributed by users(on such and such website, relating to the interestof users for such and such content), while the tra-ditional data is based on top-grossing films and/orprograms (indicating what is offered to, or most suc-cessful for a given audience). The second one may belinked to the potential difference between on-screenpresence (that we measure here) and more sophis-ticated features, such as effective speaking time orregularity of appearance (that is typically measuredin the literature).

In essence, the discrepancy may demonstrate thatthere has been a significant evolution towards moreon-screen female presence close to reaching female-

7

male parity, but that this trend is only moderatelyrelated to the actual importance or influence ofwomen in popular movies and their scenarios. Inother words, put in perspective with the literature,the evolution that we uncover here may not be of suf-ficient influence on gender representation in popu-lar movies. Figuring out to what degree the increaseof female on-screen presence is potentially prelu-dial to an upcoming fairer gender representation, ora subtle expression of “purplewashing”, would re-quire a deeper qualitative analysis which is beyondthe scope of our study.

Relation between FFR and audience. We could seethat distinct genres correspond to differing FFR val-ues. Budget and audience-related metadata en-able us to characterize more finely the type of filmsthat correspond to certain areas of the FFR distri-bution. In Figure 4b we focus on FFR histogramfor the most recent period (2014-19). On this his-togram, we project the average rank of movie FFR

with respect to budget, gross, rating given by users(rating value) or number of people having rated amovie (rating count). Note that we chose to colorhistograms from white to black using rankings ratherthan absolute values, for there are wide variationsin the orders of magnitude of the underlying aver-age values (for instance, budget spans several or-ders of magnitude – if a certain range of absolutevalues corresponded to a certain tone, we would al-most have had either only white or only black bars,losing a significant resolution and missing the ac-tual ordering and hierarchy between high-budgetand low-budget movies). Lighter tones correspondto higher ranks: for instance, the white bar for the“budget” coloration (left-most histogram) denotesthe highest movie budgets. It coincides with themain mode and specifically with the bar of the his-togram featuring the highest proportion of movies,with an FFR of 35%. The darkest tones, on the otherhand, are found for the most extreme values of FFR

(very small or very high). Some exceptions are no-table: there is a slightly less dark tone for FFR val-ues around 70%, indicating the existence of relativelyhigher budget movies on that side as well. On thewhole, the same phenomena are visible for worldwide gross and rating. This suggests that the au-dience and their opinion resonate best with moviesclose to the main FFR mode, which corresponds tothe average FFR under-representation of women. In-terestingly, the higher FFR values that emerged overthe recent years (around 60%) also correspond to rel-

atively well-funded and successful movies. The last(right-most) histogram focuses on one of the bestordered tone scales (i.e., gray levels and FFR valuesare ordered similarly), with respect to the propor-tion of user ratings given on IMDb by females. Inother words, it reveals a virtually perfect agreementbetween movies featuring a high FFR and the engage-ment of women in rating these movies (regardless ofthe polarity of these ratings, positive or negative).

The framing of gender

Face-ism

From an experimental psychology perspective, littleis known about the effect upon observers of visualcomposition and element framing in a picture (Sam-martino and Palmer, 2012). A movie shot composi-tion allegedly helps convey emotional attachment ofviewers to characters and narrative elements, drivingthem through the plot. These elements have beenwidely discussed and commented since the early re-search on modern aesthetics, including film theo-ries (Eisenstein, 1949), and taken as basis for a moresocio-political critique of public displays of informa-tion such as gender (Goffman, 1979). While the fea-tures extracted in the present study are insufficientfor recovering the highly qualitative nature of sucheditorial choices, they still enable us to discuss char-acter framing, of interest in film theory and its his-tory (Cutting, 2015). In particular, by focusing onsimple elements such as face position and surface,we first explore the hypothesis of face-ism made byseveral gender studies. We further propose a moregeneral appraisal of on-screen gender presence. Thisanalysis is more sophisticated than the computa-tion of FFR: in particular, propagating the above-mentioned inference correction of the algorithm tocomplex on-screen face positions (bounding box ar-eas) and compositions (one or several faces) wouldprove to be quite arduous. For this reason and thesake of simplicity, we now restrict our analysis to thelatest period of our dataset (2014-19), since modelerror was lowest and least serious. First, the accuracyof gender detection lies around 78% and, more im-portantly, it is symmetric across genders: male facesdetected as female are in the same proportion as fe-male faces detected as male.

Face-ism is the tendency of an image to revealmore of the subject’s face or head than body. Ithas been commonly associated with dominanceand positive affect in audience perceptions (Archer

8

et al., 1983). Both in mass media and social net-works (Smith and Cooley, 2012), research tends toobserve that higher face-ism is granted to males overfemales.

In our dataset, the area of the face occupied on-screen can be assessed by the area of the face’sbounding box. Compared with the size of the framefor a given movie, it yields the percentage of theframe occupied by a detected face, which can thenbe compared between movies with different aspectratio or resolution. The values of face areas acrossall our dataset follows a heterogeneous distribution(technically a power law: many are small, few arelarge) with 80% faces occupying more that 1.36%of the frame. The median face area is 3.8% of theframe and is almost identical for male and femalefaces. More precisely, the differences are statisticallysignificant according to the non-parametric Mann-Whitney U test, yet extremely small: male generalface area median is 0.03% above female. Further-more, by genre these small differences appear some-times in one direction, sometimes in another – onthe whole a typical signature of an effect that ratherfluctuates around zero with some certainty. Thistends to not confirm the presence of gender biasesin the way face-ism is granted to a character. Notehowever that our metric does not perfectly reflectpotential face-ism, for it lacks the ability to comparethe area of the face with that of the body – cautionmust hence be applied before drawing from this re-sult a refutation of the hypothesis of gender bias inface-ism. Further automated inquiry on the mat-ter should therefore make use of an additional al-gorithm able to detect and measure the presence ofbodies in the picture.

Gender’s mise-en-scene and mise-en-cadre

Choosing how many characters appear in a givenframe is an influential element of the craft of stag-ing, or mise-en-scene. It may direct the viewer’s at-tention to one face or divide it among several, sig-nificantly modifying the perception of actors’ perfor-mance and their surroundings. Thus, we analyzedthe combinations of character genders appearing ina same frame. As shown in figure 5, the distribu-tion of the most observed combinations reveals that9 cases account for more that 95% of all frames withfaces and that the one-male-only configuration rep-resents almost half of them.

Let us first focus on frames with only one face,which is the most common case. The distribution ofthe gender of that face exhibits a more marked bias

in favor of male faces than the FFR: 40% of one-faceframes feature a female, vs. 60% for males (44.2% outof 29.5+44.2%), while the average global FFR for thelast period is 44.9%. In other words, there seems tobe a stronger bias favoring male presence in situa-tions featuring a single face.

Furthermore, following the ranking of figure 5 indecreasing order exhibits a perfect symmetry of gen-der combinations (0 female/1 male, 1/0, 0/2, 2/0,etc), with equivalent configurations appearing first(i.e. 0 female/1 male before 1 female/0 male), in linewith the underlying general bias in favor of male facepresence. This hint at the idea that there is no signif-icant additional gender bias in the character compo-sition of a frame beyond the general previously ob-served 45-55 woman-man representation unbalancefor that period.

We used these combinations to see if gender hasan impact on the screen location of faces or, in otherwords, to observe if there is a gender-specific mise-en-cadre depending on these configurations. Fig-ure 6 displays small matrices representing the screenon which a movie would be displayed, split accord-ing to the common rule-of-thirds. Each zone is an-notated with the percentage of women or men ap-pearing in it, in the context of the character gendercombination mentioned above it.

We used chi-square to test the hypothesis of inde-pendence of the frequency distributions found in thevarious matrices. We considered the categorical vari-able mise-en-cadre, with 9 possible values (one foreach position in the 3x3 grid). We generated a con-tingency table for each pair of face configurations.We also checked for aggregated horizontal and ver-tical positions, in such cases the mise-en-cadre onlyhaving 3 possible categories (in the horizontal case:

��� ��� ��� ��� ���100 20 30 40%

44.2%

29.5%

8.4%

6.3%

3.6%

1.7%

1.3%

1.1%

0.5%

3.5%

0 female

1 female

1 female

0 female

2 females

1 female

2 females

0 female

3 females

other

&

&

&

&

&

&

&

&

& 1 male

0 male

1 male

2 males

0 male

2 males

1 male

3 males

0 male

Figure 5: Combinations of character gender (2014-2019).

9

Women when aloneWomen when alone

with one manWomen when two and

without menWomen when alone

with two men

Men when aloneMen when alonewith one woman

Men when two andwithout women

Men when alonewith two women

0

10

20

30

50

40

%

5.6 25.4 6

7.3 45 9.1

0.3 0.8 0.4

5.6 25.6 5.6

7.6 45.3 8.4

0.4 0.9 0.5

9.3 19 9.5

14.5 27.1 15.3

1.4 2.2 1.7

12.3 26.8 12.1

11.2 22.3 11.4

1.1 1.6 1.3

10.9 22.9 11

12.6 24.6 13

1.4 2.1 1.6

11.2 22.6 11.4

13.1 23.7 13.9

1.3 1.6 1.4

10.9 18.5 11

14.4 23.4 15

2 2.5 2.3

15 24.5 13.5

11.8 18.2 11.3

1.6 2.3 1.7

Figure 6: Distribution of faces position on-screen (2014-2019).

left, center, right, in the vertical case: top, middle,bottom). For all these cases and all pairwise com-binations we found strong support for dependence,with all p-values < 0.005. This leads us to concludethat even differences of small magnitude are statisti-cally significant.

When in a gender-mixed configuration, womenare more present in the middle third of the screenwhile men seem to appear more frequently in the up-per third of the screen. A similar phenomenon canbe observed when women and men are alone or ina non-mixed character gender combination, but inthese cases, while the observation is still statisticallysignificant, the magnitude of the effect is very small.

We randomly selected hundreds of pictures ex-hibiting this significant pattern: the woman’s facepresent in the middle third of the screen while theman’s is located in the upper third. A manual evalu-ation of this selection revealed that this bias is partlydue the height gap between actors and actresses, asillustrated by Figure 7. As stated earlier, the mise-en-cadre of characters goes beyond face size and po-sition. A more fine-grained analysis would requirethe ability to assess subtle biases in depth and per-spective of characters placement together explana-tory and evaluation protocols with movies experts.We leave this to further research.

Concluding remarks

In practice, our contribution principally exhibitsseveral gender representation discrepancies in on-screen presence in a large set of movies spanning awide period of time. More broadly, this article also

Figure 7: Example of gender placement on-screen.

aims at demonstrating the usefulness and feasibilityof automated computational methods for the studyof gender representativeness in mass media. We suc-cessfully uncovered clear historical trends thanks tothe possibility of handily producing empirical ob-servations at a scale that would have been bothexpensive and difficult for a qualitative endeavor.Nonetheless, our essentially quantitative approachdid not prevent us to appraise more sophisticatedfeatures and to correlate our findings with a varietyof meta-data. As such, our approach could be eas-ily replicated on other corpuses within the visual en-tertainment industry, such as advertisement and TVshows.

Meanwhile, our study also outlined several chal-lenges for computational methods to efficientlytackle issues related to gender representation in me-dia. Firstly, even though we used face and genderdetection algorithms with solid track records froman engineering perspective, we had to realize andacknowledge that the underlying machine learningmodels still suffer from important and significant bi-ases, especially with respect to the empirical con-text of movie content over several decades. Trusting

10

the output of these algorithms at face value wouldhave led to significant errors. The development ofa protocol to assess their bias on a case-by-case ba-sis proved to be key: further studies should impera-tively estimate the performance of such tools, be itin the framework of gender studies or more broadlyin the prospect of carrying out the “distant viewing”of media material. Secondly, our results have shownclear trends towards more representativeness of on-screen woman presence in popular movies, whereasparts of the state of the art rather tend to reporta rather stable (under-)representation. This opensup interesting venues for further qual-quant analy-ses: for instance, by focusing on movies quantita-tively featuring a gender ratio close to parity and de-scribing qualitatively how women are actually repre-sented with respect to men. On the whole, we hopeto have shown that there is a promising potentialin the fine qualitative analysis of media material se-lected on the basis of a large-scale scanning of siz-able media datasets.

Data availability

The datasets generated during and/or analysed duringthe current study are available in the Nakala repository,https://doi.org/10.34847/nkl.543czc59.

Acknowledgments

The authors are grateful to Élise Marsicano, Lilas Duver-nois, Cécile Dumas, Jean-Christophe Ribot and AngelaCrone for their help and advices in conducting this re-search.

Competing interests

The authors have declared that no competing interests ex-ist.

References

Archer, D., Iritani, B., Kimes, D. D., and Barrios, M.(1983). Face-ism: Five studies of sex differences in fa-cial prominence. Journal of Personality and social Psy-chology, 45(4):725–735.

Arnold, T. and Tilton, L. (2019). Distant viewing: analyzinglarge visual corpora. Digital Scholarship in the Human-ities, 34(S1):i3–i16.

Bechdel, A. (1983). Dykes to Watch Out For. https://dykestowatchoutfor.com/.

Bost, X., Labatut, V., Gueye, S., and Linarès, G. (2016).Narrative smoothing: dynamic conversational networkfor the analysis of tv series plots. In 2016 IEEE/ACM

International Conference on Advances in Social Net-works Analysis and Mining (ASONAM), pages 1111–1118. IEEE.

Buolamwini, J. and Gebru, T. (2018). Gender shades: In-tersectional accuracy disparities in commercial genderclassification. In Conference on fairness, accountabilityand transparency, pages 77–91.

Busby, L. J. (1975). Sex-role research on the mass media.Journal of Communication, 25(4):107–131.

Chaney, A. J.-B. and Blei, D. M. (2012). Visualizing topicmodels. In Proc. 6th ICWSM AAAI Conf on weblogs andsocial media, pages 419–422.

Cillessen, A. H. and Marks, P. E. (2011). Conceptualizingand measuring popularity. Popularity in the peer sys-tem, pages 25–56.

Cohen, B. (2003). Incentives build robustness in bittor-rent. In Workshop on Economics of Peer-to-Peer systems,volume 6, pages 68–72.

Collins, R. L. (2011). Content analysis of gender roles inmedia. Sex roles, 64:290–298.

Crawford, K. and Paglen, T. (2019). Excavating ai: Thepolitics of images in machine learning training sets.https://www.excavating.ai/.

Cutting, J. E. (2015). The framing of characters in popularmovies. Art & Perception, 3(2):191–212.

Cutting, J. E. and Candan, A. (2015). Shot durations, shotclasses, and the increased pace of popular movies. Pro-jections, 9(2):40–62.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,L. (2009). Imagenet: A large-scale hierarchical imagedatabase. In Proc. 2009 IEEE conference on computervision and pattern recognition, pages 248–255.

Dhomne, A., Kumar, R., and Bhan, V. (2018). Genderrecognition through face using deep learning. ProcediaComputer Science, 132:2–10.

Eisenstein, S. (1949). Film form: Essays in film theory. Har-court, Inc.

Follows, S. (2014). Gender within film crews. Stephen Fol-lows Film Data and Education, 22.

Goffman, E. (1979). Gender advertisements. MacmillanInternational Higher Education.

Guha, T., Huang, C.-W., Kumar, N., Zhu, Y., andNarayanan, S. S. (2015a). Gender representation in cin-ematic content: A multimodal approach. In Proceed-ings of the 2015 ACM on International Conference onMultimodal Interaction, pages 31–34.

11

Guha, T., Kumar, N., Narayanan, S. S., and Smith, S. L.(2015b). Computationally deconstructing movie nar-ratives: an informatics approach. In 2015 IEEE interna-tional conference on acoustics, speech and signal pro-cessing (ICASSP), pages 2264–2268. IEEE.

Guo, G. and Zhang, N. (2019). A survey on deep learn-ing based face recognition. Computer Vision and ImageUnderstanding, 189:102805.

Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Fe-dermann, C., Huang, X., Junczys-Dowmunt, M., Lewis,W., Li, M., Liu, S., Liu, T.-Y., Luo, R., Menezes, A., Qin, T.,Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y., Zhang,D., Zhang, Z., and Zhou, M. (2018). Achieving humanparity on automatic Chinese to English news transla-tion. arXiv, 1803.05567.

Jang, J. Y., Lee, S., and Lee, B. (2019). Quantification ofgender representation bias in commercial films basedon image analysis. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):1–29.

Kataria, S. and Kumar, A. (2016). Scene intensity estima-tion and ranking for movie scenes through direct con-tent analysis. Project report, IIT Kanpur.

Kian, E. T. M., Mondello, M., and Vincent, J. (2009).Espn—the women’s sports network? a content analy-sis of internet coverage of march madness. Journal ofBroadcasting & Electronic Media, 53(3):477–495.

Ko, M.-Y., Li, J.-L., and Lee, C.-C. (2019). Learning minimalintra-genre multimodal embedding from trailer con-tent and reactor expressions for box office prediction.In 2019 IEEE International Conference on Multimediaand Expo (ICME), pages 1804–1809. IEEE.

Lauzen, M. M. (2018). Boxed in 2017-18: Women onscreen and behind the scenes in television. Center forthe Study of Women in Television and Film, San DiegoState University.

Lauzen, M. M. (2019). It’s a man’s (celluloid) world: Por-trayals of female characters in the top grossing films of2018. Center for the Study of Women in Television andFilm, San Diego State University.

Lindner, A. M., Lindquist, M., and Arnold, J. (2015). Mil-lion dollar maybe? the effect of female presence inmovies on box office returns. Sociological Inquiry,85(3):407–428.

Mani, I. (2001). Automatic Summarization. John Ben-jamins Publishing Company.

McBee, M. P., Awan, O. A., Colucci, A. T., Ghobadi, C. W.,Kadom, N., Kansagra, A. P., Tridandapani, S., and Auf-fermann, W. F. (2018). Deep learning in radiology. Aca-demic radiology, 25(11):1472–1480.

Moretti, F. (2000). Conjectures on world literature. NewLeft Review, 1:54–68.

Neuendorf, K. A. (2017). The content analysis guidebook.Sage.

Rafaeli, S. and Ariel, Y. (2008). Online motivational fac-tors: Incentives for participation and contribution inwikipedia. Psychological aspects of cyberspace: Theory,research, applications, 2(08):243–267.

Rudy, R. M., Popova, L., and Linz, D. G. (2010). The contextof current content analysis of gender roles. Sex roles,62:705–720.

Sammartino, J. and Palmer, S. E. (2012). Aesthetic issuesin spatial composition: Effects of vertical position andperspective on framing single objects. Journal of Ex-perimental Psychology: Human Perception and Perfor-mance, 38(4):865.

Selisker, S. (2015). The bechdel test and the social formof character networks. New Literary History, 46(3):505–523.

Sink, A. and Mastro, D. (2017). Depictions of gender onprimetime television: A quantitative content analysis.Mass Communication and Society, 20(1):3–22.

Smith, L. R. and Cooley, S. C. (2012). International faces:An analysis of self-inflicted face-ism in online profilepictures. Journal of Intercultural Communication Re-search, 41(3):279–296.

Smith, S. L., Choueiti, M., Pieper, K., Yao, K., Case,A., and Choi, A. (2019). Inequality in 1,200 PopularFilms. http://assets.uscannenberg.org/docs/aii-inequality-report-2019-09-03.pdf.

Somandepalli, K., Guha, T., Martinez, V. R., Kumar, N.,Adam, H., and Narayanan, S. (2021). Computationalmedia intelligence: Human-centered machine analysisof media. Proceedings of the IEEE.

Townsend, M., Deerwater, R., Adams, N., Trasandes, M.,and Hood, D. (2019). Where we are on TV. GLAAD.

Vassileva, J. (2002). Motivating participation in peer topeer communities. In International Workshop on En-gineering Societies in the Agents World, pages 141–155.Springer.

Wolfram (2020). FacialFeatures. https://reference.wolfram.com/language/ref/FacialFeatures.html. Accessed: March 5, 2021.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov,R., Zemel, R., and Bengio, Y. (2015). Show, attend andtell: Neural image caption generation with visual atten-tion. In International conference on machine learning,pages 2048–2057.

Yang, H.-L. and Lai, C.-Y. (2010). Motivations of wikipediacontent contributors. Computers in human behavior,26(6):1377–1383.

12

Yang, L., Xu, Z., and Luo, J. (2020). Measuring womenrepresentation and impact in films over time. arXivpreprint arXiv:2001.03513.

Zech, J. R., Badgeley, M. A., Liu, M., Costa, A. B., Ti-tano, J. J., and Oermann, E. K. (2018). Confound-ing variables can degrade generalization performanceof radiological deep learning models. arXiv preprintarXiv:1807.00431.

13