measuring gender inequalities of german professions on wikipedia

34
SLA 2014/15 Zagovora Olga 1 Institute for Web Science and Technologies · University of Koblenz-Landau, Germany Measuring gender inequalities of German professions on Wikipedia Olga Zagovora Supervisors: Prof. Dr. Claudia Wagner Dr. Fabian Flöck

Upload: olga-zagovora

Post on 20-Mar-2017

65 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Folie 1

Measuring gender inequalities of German professions on Wikipedia

Olga Zagovora

Supervisors: Prof. Dr. Claudia WagnerDr. Fabian Flck

SLA 2014/15Zagovora Olga#Institute for Web Science and Technologies University of Koblenz-Landau, Germany

1

Gender stereotypes #RedrawTheBalance www.inspiringthefuture.org

Watch video from: https://www.youtube.com/watch?v=kJP1zPOfq_0

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#Particular professions are implicitly associated with genders. For example, elementary school teachers are implicitly stereotyped to be female and engineers are stereotyped to be male.

2

Profession article. Example: Images: http:// de.wikipedia.org

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#Imagine one of these girls want to know about profession Journalist. Could we identify and measure gender biases related to professions in the results of collaborative community work?3

Profession article. Example: Images: http:// de.wikipedia.org

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#Imagine one of these girls want to know about profession Journalist. Could we identify and measure gender biases related to professions in the results of collaborative community work?4

Redirection analysisClassification of professions according to existing Wiki articles [based on gender of profession titleExample: Hebamme or Entbindungspfleger Kaufmann, Kauffrau, or Kaufleute]Images analysis -> People on imagesIdentification of people gender on imagesDistribution comparison of image categories [based on persons gender] Textual analysis -> Mentioned people in the textMining of persons names from articlesDistribution comparison of persons gender

Method. Main dimensions

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#In order to do so, first, we are going to match professions to Wiki articles. We want to check which articles exist on Wiki. In other words, are there articles with male name of profession (example would be Entbindungspfleger) or with female name (e.g. Hebamme)? or a neutral name of profession (e.g. Kaufleute)?(Then we will make comparison to offline data (German labor market statistics))Second, we are going to retrieve all images from articles about professions. Then we will identify gender of people on images and we want to compare distributions of genders in profession articles. (As well we will compare results to labor market statistics)Third, we are going to mine all people names from the text of articles. Then we will identify gender of people by their first name. And we want to know which gender is more popular. (Then we will compare results to labor market statistics )

5

List of professions [based on profession list from Bundesagentur fr Arbeit] n=4457:"Lehrer": "Lehrerin","Krankenpfleger": "Krankenschwester,"Entbindungspfleger": "Hebamme","PR-Fachkraft", "Fotomodell", "Aufsichtsperson"WikipediaArticles about professionsImages from profession articlesMentioned people in profession articles

Datasets

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

6

Pages exist

1. Redirection analysis. Terminology

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

7

Pages exist

1. Redirection analysis. Terminology

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

8

Pages exist

1. Redirection analysis. Terminology

Neutral case (no bias)

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

9

Pages exist

No page

1. Redirection analysis. Terminology

Neutral case (no bias)

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

10

Pages exist

No page

1. Redirection analysis. Terminology

Neutral case (no bias)

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

11

Pages exist

No page

1. Redirection analysis. Terminology

Neutral case (no bias)

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

12

Pages exist

No page

1. Redirection analysis. Terminology

Neutral case (no bias)Male bias

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

13

Pages exist

No page

1. Redirection analysis. Terminology

Neutral case (no bias)Male biasFemale bias

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

14

Pages exist

No page

Redirects

1. Redirection analysis. Terminology

Neutral case (no bias)Male biasFemale biasMale biasFemale bias

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

15

most articles have male titlemost redirects are from female to male title885 articles Redirection analysis. Results

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

16

most articles have male titlemost redirects are from female to male title885 articles Redirection analysis. Results

Redirection bias groups:Male: 812 professionsFemale: 6 professionsNeutral: 55 professions

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

17

Data:Google hits for profession names

Is it only Wikipedia specific phenomena?

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#Our findings reveal that the German speaking web is male biased, meaning that for most of professions one can find much more sources for male profession names than for female profession. names.18

Data:Google hits for profession names

German speaking web is a male biased -> more sources for male than female profession names

Is it only Wikipedia specific phenomena?

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#Our findings reveal that the German speaking web is male biased, meaning that for most of professions one can find much more sources for male profession names than for female profession. names.19

Does Wikipedia reflects the general bias on the Web?

coefModel1Normalized Google difference2.44***(intercept)2.41***Model2Normalized Google difference-5.93**(intercept)-5.55**

Logistic regression models:

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#Does Wikipedia reflects the general bias on the Web?Google hits for male profession names are smaller than the respective Google hits for female profession names over professions that have only article with title of female profession nameLogistic regression model results reveal that with increase in normalized google difference, probability of profession being in male bias increases, whereas, probability being in female bias decrease. 20

Could we explain Wikipedia phenomena with labor market statistics?

Bias groupszmale & neutral0.83male & female-3.32**neutral & female-3.35**

RankSum tests1Data:German labor market statistics

with two stage p-value correction of Benjamini-HochbergLogistic regression model:Dependent variable:binary state of having female biascoefpercentage of women0.36**(intercept)-35.5**

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#The second hypothesis is that professions with more women on labor market have only female article title on Wikipedia, professions with more men have male article title, professions with equal gender representation have both articles. I would like to draw your attention to the fact that in male bias group % of women vary between 0 to 100. Mann-Whitney-Wilcoxon runksum tests(Is there relation between redirection bias group and ratio of women in profession?)21

Data:Images from profession articlesCrowdFlower task

2. Images analysis

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#The second part of our research is analysis of images. 22

Images analysis. Results

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

23

Do Wikipedia images reflect labor market statistics?

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#Do professions with female majority have same distribution of images as professions with male majority?

The chi2 method produces the exact value of an approximate p-value, whereas the MC method produces an approximation to the exact p-value! In the contingency table case simulation is done by random sampling from the set of all contingency tables with given marginals (given number of columns and rows). The p-value produced by chi2 is based on the fact that the chi-squared statistic is approximately distributed like a true chi-squared distribution (on 3 degrees of freedom, in this case) if the null hypothesis is true. However, it is possible to obtain exact p-values, if one wishes to calculate the chi-squared statistic for all possible tables of counts with the same row and column sums as the given table.24

Relation to labor market statistics

German labor marketstatistics

Image resultsFeature 1Feature 2CorrelationThe more images depicting / The higher the percentage of number of images depicted womennumber of women in the labor market women are in the article, the more women are working in the professionnumber of images depicted men number of men in the labor market 0.088- percentage of images depicted menpercentage of men in the labor market;

percentage of women in the labor marketimages depicting men is in the article, the higher the percentage of men is in the labor market;images depicting men is in the article, the lower the percentage of women is in the labor marketpercentage of images depicted womenpercentage of women in the labor market images depicting women is in the article, the higher the percentage of women is in the labor market

0.34

***-0.3*** 0.15*

0.3***

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

25

Data:Mentioned people from profession articlesgender identification according to the first name (accuracy=0.97) 5085 (4272 men and 813 women) persons from 885 articles411 articles with at least one personDistribution of ratios of male names in an article

3. Textual analysis

mean0.83median0.9825%0.875%1.0avg.number of persons per article10.4 m1.9 f

Male biasFemale bias

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#violin plot features a kernel density estimation of the underlying distribution. It is a combination of boxplot and probability density function. 26

Do articles with male title have higher ratio of mentioned men than articles with neutral title? female title?Do articles with neutral title have higher ratio of mentioned men than articles with female title?

Is there an effect of gender of article title on ratio of mentioned men?

male & femalez=2.46*medianmale title1.0medianfemale title0.65

Rank sum tests1H0: Two sets of ratios of mentioned men are drawn from the same distribution Halt: Values in one set are more likely to be larger than the values in the other sample

with two stage p-value correction of Benjamini-Hochberg

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#We check dependency between ratio of mentioned men and gender of the article title. I would like to draw your attention to the fact thatWe also looked into professions with male majority and professions with female majority. And there is the same trend

Mann-Whitney-Wilcoxon runksum testsBonferroni correction -> conservative, reduces statistical power. BH controlls falls discovery rate, Hochberg's step-up procedure (1988) is performed using the following steps:Start by ordering the p-values (from lowest to highest) P ( 1 ) P ( m ) and let the associated hypotheses be H ( 1 ) H ( m )For a given alpha , let R be the largest k such that P(k) /( m + 1 k) Reject the null hypotheses H (1) H (R)

This method relies on the same assumption as the Benjamini and Hochberg method, but it is a more clever method. It first examines the distribution of P values to estimate the fraction of the null hypotheses that are actually true. It then uses this information to get more power when deciding when a P value is low enough to be called a discovery. 27

Relation to labor market statistics

German labor marketstatisticsFeature 1Feature 2CorrelationThe higher the percentage of / The more percentage of mentioned menpercentage of women in the labor market mentioned men is in the article, the lower the percentage of women is in the professionnumber of mentioned mennumber of people in the labor market men are mentioned in the article, the fewer people are employed in the professionnumber of mentioned mennumber of men in the labor marketmen are mentioned in the article, the fewer men are employed in the professionnumber of mentioned mennumber of women in the labor marketmen are mentioned in the article, the fewer women are employed in the profession

-0.27-0.2-0.15-0.23

Mentioned people in an article***********No correlation between:number of mentioned women & number of women in labor marketnumber of mentioned women & number of men in labor market

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

28

dbpedia -> birthDatedivide people on those which were born before&after 1960 before 1960after 1960

Negative correlation remains between ratio of mentioned men & percentage of women in labor market

Is there an effect of history on number of mentioned men

cor# mentioned menamount of people in labor market # mentioned menamount of men in labor market# mentioned menamount of women in labor market

- 0.19**-0.15*- 0.20**cor# mentioned menamount of people in labor market -0.12*# mentioned menamount of men in labor market -0.12# mentioned menamount of women in labor market -0.11

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

29

Male bias over all dimensions:redirectionsimagesmentioned people

High female bias for some professionsExamples: Model(mentioned people), Hebamme(images)

Summary

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#Most professions have only article with male title of professionThere are 4 times more images with men than women98 % men and 2% women

Male bias is higher than expected based on labor market statisticsExamples: Apotheker, LehrerEvidence of male bias is higher than we expect according to labor market statistics2 times more images .. for professions with >50% women

Apotheker 29 1 97% 83% women in l.b. And 2 images with menLehrer 21 3 87% 70% women in l.b. And 4 images with men!30

Why does the male bias exist on Wikipedia?male editorsimplicit stereotypes of each individualmale bias over other media (including Search engines aka Google)

What can be done to reduce it?attraction of more female editorsdevelopment of Wikipedia equality ruleswarning editors before acceptance of revisionprofession equality lessons for kids

Future directions:cross-language analysis of gender inequalities for different Wiki editionstimestamp analysis of revisions software tool for Wikipedia editors

Discussion & Outlook

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#models for detecting biased language in multilingual corpuses

31

Questions?

[email protected]

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

32

Heinrich Heine Fotografen beim Fuball Ralf Roletschek /CC-BY-SA-3.0/ http://creativecommons.org/licenses/by-sa/3.0Fotojournalisten bei der Fuball-Europameisterschaft 2008 Arne Mseler / arne-mueseler.de / CC-BY-SA-3.0 / https://creativecommons.org/licenses/by-sa/3.0/de/deed.deLothar Loewe bei einem Vortrag im Juli 2009 Bcherhexe /CC-BY-SA-3.0 / http://creativecommons.org/licenses/by-sa/3.01. Jugendolympiade 2012 Innsbruck Ralf Roletschek / CC-BY-SA-3.0 at / http://creativecommons.org/licenses/by-sa/3.0/at/deed.enReporter Heinz abel (PHOENIX) im Gesprch mit Peter Fahrenholz Andr Zahn/ CC-BY-SA-2.0 de / http://creativecommons.org/licenses/by-sa/2.0/de/deed.enBob Woodward, assistant managing editor Jim Wallace (Smithsonian Institution) / CC-BY-2.0 / http://creativecommons.org/licenses/by/2.0Journalisten bei der Fuball-Europameisterschaft 2008 Arne Mseler / arne-mueseler.de / CC-BY-SA-3.0 / https://creativecommons.org/licenses/by-sa/3.0/de/deed.deOriana Fallaci in Tehran 1979Oprah Winfrey at the Hotel Bel Air in Los Angeles Alan Light / CC-BY-2.0 / http://creativecommons.org/licenses/by/2.0

Images

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#models for detecting biased language in multilingual corpuses

33

License

Measuring Gender Inequalities of German Professions on Wikipedia by Olga Zagovora is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.Based on a work at https://arxiv.org/abs/1702.00829.

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#models for detecting biased language in multilingual corpuses

34

Profession article(s) with male title only exists: male bias;with female title only exists: female bias;with both titles exist: neutral (no bias);redirects to other gender: bias for redirection source (i.e. female or male bias);with neutral title exists: neutral (no bias).

Redirection analysis. Classification rules:

Redirection bias groups:Male: 812 professionsFemale: 6 professionsNeutral: 55 professions

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

35

Images analysis. Reliability of agreement

degree of agreement that is attainable above chancethe degree of agreement actually achieved above chance

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#the reliability of agreement between a fixed number of raters -> how consistent the responses are

36

Do Wiki images reflect labor market statistics?

tests1,2

sign. diff.male majority & female majority*** image categories which show sign.diff:women in image***men in image**mixed, equal amount of men and women**no person*

Based on Monte Carlo simulations (1e+5 ) with two stage p-value correction of Benjamini-Hochberg

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#Do professions with female majority have same distribution of images as professions with male majority? The chi2 method produces the exact value of an approximate p-value, whereas the MC method produces an approximation to the exact p-value! In the contingency table case simulation is done by random sampling from the set of all contingency tables with given marginals (given number of columns and rows). The p-value produced by chi2 is based on the fact that the chi-squared statistic is approximately distributed like a true chi-squared distribution (on 3 degrees of freedom, in this case) if the null hypothesis is true. However, it is possible to obtain exact p-values, if one wishes to calculate the chi-squared statistic for all possible tables of counts with the same row and column sums as the given table.37

Group articles according to:

Images analysis. Results 2

pairs/image cat.female & male & neutr.***Article groupsfemale & male ***female & neutral *** image categoriesfemale ***male *

pairs/image cat.female& male & neutral*Article groupsfemale & male* female & neutral* image categoriesfemale *

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

38

Group articles according to:

Images analysis. Results 2

pairs/image cat.male & female majority *** image categoriesfemale***male**mixed, equal amount of male and female**no person*

pairs/image cat.male & female dom.prof. *** image categoriesfemale***male**mixed, equal amount of male and female**no person**

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

39

Images analysis

German labor marketstatistics

Image results

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

40

Method 1: outlinks which are in Wiki Category Women or MenMethod 2: Step 1. From text: NLP Named Entity recognition (Al-Rfou et al)-> Person [based on Wikipedia link structure & Freebase attr.] Step 2. Gender identification according to first name:vocabulary of the program gender by Jrg Michael (40.000)Database of first names Genderizer (35.000)

Accuracy of gender identification: 97.23% (for dashed set of names)

Textual analysis. Mining of mentioned people

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

41

Gender bias of profession according to redirection analysis [male bias, neutral, female bias] There is no significant differences (p>0.05). All groups have ratio of mentioned men above 0.5.

Do ratios of men in article reflect labor market statistics? Textual analysis. Results 2

male & femalez = 4.3***median male1.0median female0.85

Man-Whitney-Wilcoxon Rank sum tests1 with two stage p-value correction of Benjamini-Hochberg

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

42

Distribution of ratios of mentioned men:right plot - ratios of mentioned men over selected people whoo were born before 1960,left plot - after 1960Textual analysis. Results 2

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

43

Textual analysis

German labor marketstatistics

Mentioned people in an article

Olga ZagovoraMeasuring gender inequalities of German professions on Wikipedia#

44