towards debiasing sentence representationspliang/papers/acl2020_debiasing.pdfcial biases and...

Towards Debiasing Sentence Representations

Paul Pu Liang, Irene Li, Emily Zheng, Yao Chong Lim,Ruslan Salakhutdinov, Louis-Philippe Morency

Machine Learning Department and Language Technologies InstituteCarnegie Mellon [email protected]

Abstract

As machine learning methods that learn fromtext data are increasingly deployed in real-world scenarios such as healthcare, legal sys-tems, and social science, it becomes necessaryto recognize the role they play in shaping so-cial biases and stereotypes. Although thesemodels are able to accurately model text us-ing neural representations of words and sen-tences, previous work has revealed the pres-ence of social biases in widely used wordembeddings involving gender, race, religion,and other social constructs. While existingmethods for debiasing word embeddings haveproven largely successful, there has been a re-cent shift from word embeddings to contex-tual sentence representations such as ELMoand BERT which have achieved better perfor-mance on multiple tasks in NLP. Accordingly,we investigate the presence of social biases inthese sentence-level representations and pro-pose a new method, SENT-DEBIAS, to removethese biases. We show that SENT-DEBIAS iseffective in removing biases, and at the sametime, preserves performance on sentence-leveldownstream tasks such as sentiment analysis,linguistic acceptability, and natural languageunderstanding. We hope that our work will in-spire future research on characterizing and re-moving social biases from widely adopted sen-tence representations for fairer NLP.

1 IntroductionMachine learning tools for learning from languageare increasingly deployed in real-world scenariossuch as healthcare (Velupillai et al., 2018), legalsystems (Dale, 2019), and computational socialscience (Bamman et al., 2016). Key to the suc-cess of these models are powerful embedding lay-ers which learn continuous representations of dis-crete tokens such as words, sentences, and docu-ments from large amounts of data (Devlin et al.,

2019; Mikolov et al., 2013). Although word em-beddings (Pennington et al., 2014; Mikolov et al.,2013) are highly informative features useful for avariety of tasks in Natural Language Processing(NLP), recent work has shown that word embed-dings reflect and propagate social biases presentin training corpora (Lauscher and Glavas, 2019;Caliskan et al., 2017; Swinger et al., 2019; Boluk-basi et al., 2016). Machine learning systems thatincorporate these word embeddings can further am-plify biases (Sun et al., 2019b; Zhao et al., 2017;Barocas and Selbst, 2016) and unfairly discrimi-nate against users, particularly those from disad-vantaged social groups. Fortunately, researcherson fairness and ethics in NLP have devised meth-ods towards debiasing these word representationsfor both binary (Bolukbasi et al., 2016) and multi-class (Manzini et al., 2019) bias attributes such asgender, race, and religion.

More recently, contextualized sentence repre-sentations such as ELMo (Peters et al., 2018),BERT (Devlin et al., 2019), and GPT (Radfordet al., 2019) have become the preferred choicefor text sequence encoding. These models haveachieved better performance on multiple tasks inNLP (Wu and Dredze, 2019), multimodal learn-ing (Zellers et al., 2019; Sun et al., 2019a), andgrounded language learning (Urbanek et al., 2019).As their usage proliferates across various real-world applications (Huang et al., 2019; Alsentzeret al., 2019), it becomes necessary to recognize therole they play in shaping social biases and stereo-types. Although there has been some recent workin measuring the presence of bias in sentence rep-resentations (May et al., 2019; Basta et al., 2019),none of them have been able to successfully re-move bias from pretrained sentence encoders. Inparticular, Zhao et al. (2019), Park et al. (2018),and Garg et al. (2019) are not able to perform post-hoc debiasing and require changing the data or

underlying word embeddings and retraining whichis costly. Bordia and Bowman (2019) only studyword-level language models and also requires re-training. Finally, Kurita et al. (2019) only measurebias on BERT by extending the word-level WEATmetric in a manner similar to May et al. (2019).

Debiasing sentence representations is difficultfor two reasons. Firstly, it is usually unfeasibleto fully retrain many of these state-of-the-art sen-tence encoder models. In contrast with conven-tional word embeddings such as GloVe (Penning-ton et al., 2014) and word2vec (Mikolov et al.,2013) which can be retrained on a single machinewithin a few hours, the best sentence encoders suchas BERT (Devlin et al., 2019), and GPT (Radfordet al., 2019) are trained on massive amounts oftext data over hundreds of machines for severalweeks. As a result, it is difficult to retrain a newsentence encoder whenever a new source of bias isuncovered from data. We therefore focus on post-hoc debiasing techniques which add a post-trainingdebiasing step to these sentence representations be-fore they are used in downstream tasks (Bolukbasiet al., 2016; Manzini et al., 2019). Secondly, sen-tences display variety in how they are composedof individual words since sentence structure differsbetween individuals, settings, and between spo-ken and written text. As a result, it is difficult toscale traditional word-level debiasing approaches(which involve bias-attribute words such as man,woman) (Bolukbasi et al., 2016) to sentences. Asa compelling step towards generalizing debiasingmethods to sentence representations, we capturethe various ways in which bias-attribute words canbe used in natural sentences from text corpora.This is performed by contextualizing bias-attributewords using a diverse set of sentence templatesfrom various sources into bias-attribute sentences.

With these insights, we extend the HARD-DEBIAS method (Bolukbasi et al., 2016) to debiassentences for both binary and multiclass bias at-tributes spanning gender and religion. Key to ourapproach is the contextualization step in which bias-attribute words are converted into bias-attribute sen-tences by using a diverse set of sentence templatesfrom text corpora. This is backed up by our experi-mental results which demonstrate the importance ofusing a large number of diverse sentence templateswhen estimating bias subspaces of sentence repre-sentations. Our experiments on two state-of-the-artsentence encoders BERT (Devlin et al., 2019) and

Binary Genderman, woman

he, shefather, motherson, daughter

Multiclass Religionjewish, christian, muslim

torah, bible, quransynagogue, church, mosque

rabbi, priest, imam

Table 1: Word pairs to estimate the binary gender biassubspace and the 3-class religion bias subspace.

ELMo (Peters et al., 2018) show that our approachreduces the bias while preserving performance ondownstream sequence tasks. We end with a discus-sion on possible shortcomings and some directionsfor future work towards accurately characterizingand removing social biases from sentence represen-tations for fairer NLP.

2 Debiasing Sentence Representations

Our method for debiasing sentence representations,SENT-DEBIAS, consists of four steps: 1) definingthe words which exhibit bias attributes, 2) contextu-alizing these words into bias attribute sentences andsubsequently their sentence representations, 3) esti-mating the sentence representation bias subspaceusing these bias attribute sentence representations,and finally 4) debiasing general sentences by re-moving the projection onto this bias subspace. Wesummarize these steps in Algorithm 1 and describethe algorithmic details in the following subsections.

(I) Defining Bias Attributes: The first step in-volves identifying the bias attributes and defininga set of bias attribute words that are indicative ofthese attributes. For example, when characteriz-ing bias across the male and female genders, weuse the word pairs (man, woman), (boy, girl) thatare indicative of gender. When estimating the 3-class religion subspace across the Jewish, Christian,and Muslim religions, we use the tuples (Judaism,Christianity, Islam), (Synagogue, Church, Mosque).Each tuple should consist of words that have theexact same meaning except for the bias attribute. Ingeneral, for d-class bias attributes, the set of wordsforms a dataset D = {(w

(i)1 , ...,w

(i)d )}mi=1 of m en-

tries where each entry (w1, ...,wd) is a d-tuple ofwords that are each representative of a particularbias attribute (we drop the superscript (i) when itis clear from the context). Table 1 shows some biasattribute words that we use to estimate the bias sub-spaces for binary gender and multiclass religiousattributes (full pairs and triplets in appendix).

Existing methods that investigate biases tendto operate on words which form a set of tokensbounded by the vocabulary size (Bolukbasi et al.,

Algorithm 1 SENT-DEBIAS: a debiasing algorithm for sentence representations.

SENT-DEBIAS:1: Initialize (usually pretrained) sentence encoder Mθ.2: Define bias attributes (e.g. binary gender gm and gf ).3: Obtain words D = {(w

(i)1 , ...,w

(i)d )}mi=1 indicative of bias attributes (e.g. Table 1).

4: S = ⋃mi=1 CONTEXTUALIZE(w(i)1 , ...,w

(i)d ) = {(s

(i)1 , ..., s

(i)d )}ni=1 // words into sentences

5: for j ∈ [d] do6: Rj = {Mθ(s

(i)j )}ni=1 // obtain sentence representations

7: end for8: V = PCAk (⋃

dj=1⋃w∈Rj

(w −µi)) // compute bias subspace9: for a new sentence h do

10: hV = ∑kj=1⟨h,vj⟩vj // project onto bias subspace

11: h = h − hV // subtract projection12: end for

2016). This makes it easy to identify the presenceof biases using predefined sets of word associations,and estimate the bias subspace using the predefinedbias word pairs. On the other hand, the potentialnumber of sentences are unbounded which makesit harder to precisely characterize the sentences inwhich bias is present or absent. Therefore, it is nottrivial to directly convert these words to sentencesto obtain a representation from pre-trained sentenceencoders. In the subsection below, we describe oursolution to this problem.

(II) Contextualizing Words into Sentences: Acore step in debiasing sentences involves defining amethod to contextualize the predefined sets of biasattribute words to sentences so that the target sen-tence encoders can be used to obtain sentence repre-sentations. One option is to use a simple template-based design to simplify the contextual associationsa sentence encoder makes with a given term, simi-lar to how May et al. (2019) proposed to measure(but not remove) bias in sentence representations.For example, each word can be slotted into tem-plates such as “This is <word>.”, “I am a <word>.”.We take an alternative perspective and hypothesizethat for a given bias attribute (e.g. gender), a singlebias subspace exists across all possible sentencerepresentations. For example, the bias subspaceshould be the same in the sentences “The boy iscoding.”, “The girl is coding.”, “The boys at theplayground.”, and “The girls at the playground.”.In order to estimate this bias subspace accurately, itbecomes important to use sentence templates thatare as diverse as possible to account for all occur-rences of that word in surrounding contexts. In our

experiments, we empirically demonstrate that esti-mating the bias subspace using diverse templatesfrom text corpora leads to improved bias reductionas compared to using simple templates.

To capture the variety in syntax across sentences,we use large text corpora to find naturally occur-ring sentences. These naturally occurring sentencestherefore become our sentence “templates”. Touse these templates to generate new sentences, wereplace words representing a single class with an-other. For example, a sentence containing a maleterm “he” is used to generate a new sentence butreplacing it with the corresponding female term“she”. This contextualization process is repeatedfor all word tuples in the bias attribute word datasetD, eventually contextualizing the given set of biasattribute words into bias attribute sentences. Sincethere are multiple templates which a bias attributeword can map to, the contextualization process re-sults in a bias attribute sentence dataset S whichis substantially larger in size:

S =m

⋃i=1

CONTEXTUALIZE(w(i)1 , ...,w

(i)d ) (1)

= {(s(i)1 , ..., s

(i)d )}

ni=1, ∣S ∣ > ∣D∣ (2)

where CONTEXTUALIZE(w1, ...,wd) is a functionwhich returns a set of sentences obtained by match-ing words with naturally-occurring sentence tem-plates from text corpora.

Our text corpora originate from the followingfive sources: 1) WikiText-2 (Merity et al., 2017a),a dataset of formally written Wikipedia articles (weonly use the first 10% of WikiText-2 which wefound to be sufficient to capture formally written

Dataset Type Topics Formality Length Samples

WikiText-2 written everything formal 24.0“the mailing contained information about their history

and advised people to read several books,which primarily focused on {jewish/christian/muslim} history”

SST written movie reviews informal 19.2“{his/her} fans walked out muttering words like horrible and terrible,

but had so much fun dissing the film that they didn’t mind the ticket cost.”

Hacker News writtentechnology,

entrepreneurshipformal 3.4 “i’m not a {man/woman} in tech”

Reddit writtenpolitics,

electronics,relationships

informal 13.6“roommate cut my hair without my consent,

ended up cutting {himself /herself} and is threatening tocall the police on me”

POM spoken opinion videos informal 16.0 “and {his/her} family is, like, incredibly confused”

Table 2: Comparison of the various datasets used to find natural sentence templates. Length represents the averagelength measured by the number of words in a sentence. Words in italics indicate the words used to estimating thebinary gender or multiclass religion subspaces, e.g. (man, woman), (jewish, christian, muslim). This demonstratesthe variety in our naturally occurring sentence templates in terms of topics, formality, and spoken/written text.

text), 2) Stanford Sentiment Treebank (Socheret al., 2013), a collection of 10000 polarized writ-ten movie reviews, 3) Hacker News, a collectionof formal news titles focusing on technology andentrepreneurship, 4) Reddit data collected fromdiscussion forums related to politics, electronics,and relationships, and 5) POM (Park et al., 2014),a dataset of spoken review videos collected across1,000 individuals spanning multiple topics. Thesedatasets have been the subject of recent researchin language understanding (Merity et al., 2017b;Liu et al., 2019; Wang et al., 2019) and multimodalhuman language (Liang et al., 2018, 2019). Table 2summarizes these datasets. We also give some ex-amples of the diverse templates that occur naturallyacross various individuals, settings, and in bothwritten and spoken text.

(III) Estimating the Bias Subspace: Now thatwe have contextualized all m word d-tuples inD into n sentence d-tuples S, we pass the latterthrough a pre-trained sentence encoder (e.g. BERT,ELMo) to obtain sentence representations. Sup-pose we have a pre-trained encoder Mθ with pa-rameters θ. Define Rj , j ∈ [d] as sets that collectall sentence representations of the j-th entry in thed-tuple, Rj = {Mθ(s

(i)j )}ni=1. Each of these sets

Rj defines a vector space in which a specific biasattribute is present across its contexts. For example,when dealing with binary gender bias,R1 (likewiseR2) defines the space of sentence representationswith a male (likewise female) context. The onlydifference between the representations in R1 ver-susR2 should be the specific bias attribute present.Define the mean of set j as µj =

1∣Rj ∣

∑w∈Rjw.

The bias subspace V = {v1, ...,vk} is given by thefirst k components of principal component analysis

(PCA) (Abdi and Williams, 2010):

V = PCAk

⎛

⎝

d

⋃j=1

⋃w∈Rj

(w −µj)⎞

⎠. (3)

k is a hyperparameter in our experiments whichdetermines the dimension of the bias subspace. In-tuitively, V represents the top-k orthogonal direc-tions which most represent the bias subspace.

(IV) Debiasing: Given the estimated bias sub-space V, we apply a partial version of the HARD-DEBIAS algorithm (Bolukbasi et al., 2016) to re-move bias from new sentence representations. Tak-ing the example of binary gender bias, the HARD-DEBIAS algorithm consists of two steps:

(1) Neutralize: Bias components are removedfrom sentences that are not gendered and shouldnot contain gender bias (e.g., I am a doctor., Thatnurse is taking care of the patient.) by removing theprojection onto the bias subspace. More formally,given a representation h of a sentence and the previ-ously estimated gender subspace V = {v1, ...,vk},the debiased representation h is given by first ob-taining hV, the projection of h onto the bias sub-space V before subtracting hV from h. This resultsin a vector that is orthogonal to the bias subspaceV and therefore contains no bias:

hV =k

∑j=1

⟨h,vj⟩vj , (4)

h = h − hV. (5)

(2) Equalize: Gendered representations are cen-tered and their bias components are equalized (e.g.man and woman should have bias components inopposite directions, but of the same magnitude).This ensures that any neutral words are equidistant

Test BERT BERT post SST-2 BERT post CoLA BERT post QNLI ELMoC6: M/F Names, Career/Family

C6b: M/F Terms, Career/Family

C7: M/F Terms, Math/Arts

C7b: M/F Names, Math/Arts

C8: M/F Terms, Science/Arts

C8b: M/F Names, Science/Arts

Multiclass Caliskan

+0.477→ +0.397

+0.108→ +0.057

+0.254→ +0.223

+0.253→ +0.210

+0.636→ +0.353

+0.399→ +0.606

+0.035→ +0.379

+0.002→ −0.091

+0.114→ +0.080

+0.394→ −0.213

+0.799→ +0.856

−0.292→ +0.281

+0.172→ +0.042

+1.200→ +1.000

−0.105→ −0.025

−0.031→ −0.096

−0.071→ −0.272

+0.105→ −0.197

−0.419→ −0.137

−0.115→ +0.253

+0.243→ +0.757

−0.345→ −0.128

−0.230→ −0.310

−0.262→ −0.040

−0.369→ −0.256

−0.607→ −0.013

−0.562→ −0.356

−

−0.380→ −0.298

−0.345→ −0.327

−0.479→ −0.487

+0.016→ −0.013

−0.296→ −0.327

+0.554→ +0.548

−

Table 3: Debiasing results on BERT and ELMo sentence representations. First six rows measure binary SEATeffect sizes for sentence-level tests, adapted from Caliskan tests. SEAT scores closer to 0 represent lower bias.CN : test from Caliskan et al. (2017) row N . The last row measures bias in a multiclass religion setting usingMAC (Manzini et al., 2019) before and after debiasing. MAC score ranges from 0 to 2 and closer to 1 representslower bias. Results are reported as x1 → x2 where x1 represents score before debiasing and x2 after, with lowerbias score in bold. Our method reduces bias of BERT and ELMo for the majority of binary and multiclass tests.

to biased words with respect to the bias subspace.In our implementation, we skip this Equalize stepbecause it is hard to identify all or even the majorityof sentence pairs to be equalized due to the com-plexity of natural sentences. For example, we cannever find all the sentences that man and womanappear in to equalize them appropriately. Note thateven if the magnitudes of sentence representationsare not normalized, the debiased representationsare still pointing in directions orthogonal to thebias subspace. Therefore, skipping the equalizestep still results in debiased sentence representa-tions as measured by our definition of bias.

3 ExperimentsWe test the effectiveness of SENT-DEBIAS at re-moving biases and retaining performance on down-stream tasks. Experimental details are in the ap-pendix and code are released at https://github.com/pliang279/sent_debias.

3.1 Evaluating BiasesBiases are traditionally measured using the WordEmbedding Association Test (WEAT) (Caliskanet al., 2017). WEAT measures bias in word embed-dings by comparing two sets of target words to twosets of attribute words. For example, to measuresocial bias surrounding genders with respect to ca-reers, one could use the target words programmer,engineer, scientist, and nurse, teacher, librarian,and the attribute words man, male, and woman,female. Unbiased word representations should dis-play no difference between the two target wordsin terms of their relative similarity to the two setsof attribute words. The relative similarity as mea-sured by WEAT is commonly known as the effectsize. An effect size with absolute value closer to 0

represents lower bias.To measure the bias present in sentence repre-

sentations, we use the method as described in Mayet al. (2019) which extended WEAT to the SentenceEncoder Association Test (SEAT). For a given setof words for a particular test, words are convertedinto sentences using a template-based method. TheWEAT metric can then be applied for fixed-length,pre-trained sentence representations. To measurebias over multiple classes, we use the Mean Aver-age Cosine similarity (MAC) metric which extendsSEAT to a multiclass setting (Manzini et al., 2019).For the binary gender setting, we use words fromthe Caliskan Tests (Caliskan et al., 2017) whichmeasure biases in common stereotypes surround-ing gendered names with respect to careers, math,and science (Greenwald et al., 2009). To evaluatebiases in the multiclass religion setting, we modifythe Caliskan Tests used in May et al. (2019) withlexicons used by Manzini et al. (2019).

3.2 Debiasing SetupWe first describe the details of applying SENT-DEBIAS on sentence encoders like BERT1 (Devlinet al., 2019) and ELMo (Peters et al., 2018). Notethat the pre-trained BERT encoder must be fine-tuned on task-specific data. This implies that thefinal BERT encoder used during debiasing changesfrom task to task. To account for these differ-ences, we report two sets of metrics: 1) BERT:simply debiasing the pre-trained BERT encoder,and 2) BERT post task: first fine-tuning BERTand post-processing (i.e. normalization) on a spe-cific task before the final BERT representations aredebiased. We apply SENT-DEBIAS on BERT fine-

1We used uncased BERT-Base throughout all experiments.

https://github.com/pliang279/sent_debias

https://github.com/pliang279/sent_debias

tuned on two single sentence datasets, StanfordSentiment Treebank (SST-2) sentiment classifica-tion (Socher et al., 2013) and Corpus of LinguisticAcceptability (CoLA) grammatical acceptabilityjudgment (Warstadt et al., 2018). It is also possi-ble to apply BERT (Devlin et al., 2019) on down-stream tasks that involve two sentences. The outputsentence pair representation can also be debiased(after fine-tuning and normalization). We test theeffect of SENT-DEBIAS on Question Natural Lan-guage Inference (QNLI) (Wang et al., 2018) whichconverts the Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016) into a binary clas-sification task. For ELMo, the encoder stays thesame for downstream tasks (no fine-tuning) so wejust debias the ELMo encoder. We report this resultas ELMo.

3.3 Debiasing ResultsWe present these debiasing results in Table 3, andsee that for both binary gender bias and multiclassreligion bias, our proposed method reduces theamount of bias as measured by the given tests andmetrics. The reduction in bias is most pronouncedwhen debiasing the pre-trained BERT encoder. Wealso observe that simply fine-tuning the BERT en-coder for specific tasks also reduces the biasespresent as measured by the Caliskan tests, to someextent. However, fine-tuning does not lead to con-sistent decreases in bias and cannot be used as astandalone debiasing method. Furthermore, fine-tuning does not give us control over which type ofbias to control for and may even amplify bias if thetask data is skewed towards particular biases. Forexample, while the bias effect size as measured byCaliskan test C6 decreases from +0.477 to +0.002and −0.105 after fine-tuning on SST-2 and CoLArespectively, the effect size as measured by themulticlass Caliskan test increases from +0.035 to+1.200 and +0.243 after fine-tuning on SST-2 andCoLA respectively.

3.4 Comparison with BaselinesWe compare to three additional debiasing meth-ods: 1) FastText derives debiased sentence embed-dings using an average of debiased FastText wordembeddings (Bojanowski et al., 2016) using word-level debiasing methods (Bolukbasi et al., 2016),2) BERT word obtains a debiased sentence repre-sentation from average debiased BERT word repre-sentations, again using word-level debiasing meth-ods (Bolukbasi et al., 2016), and 3) BERT simple

Debiasing Method Ave. Abs. Effect SizeFastText (Bojanowski et al., 2016) +0.565

BERT word (Bolukbasi et al., 2016) +0.861

BERT simple (May et al., 2019) +0.298

SENT-DEBIAS BERT (ours) +0.258

Table 4: Comparison of various debiasing methodson sentence embeddings. FastText (Bojanowski et al.,2016) (and BERT word) derives debiased sentenceembeddings with an average of debiased FastText(and BERT) word embeddings using word-level debi-asing methods (Bolukbasi et al., 2016). BERT sim-ple adapts May et al. (2019) by using simple templatesto debias BERT representations. SENT-DEBIAS BERTrepresents our method using diverse templates. We re-port the average absolute effect size across all Caliskantests. Average scores closer to 0 represent lower bias.

Test k SENT-DEBIAS BERT

C7

0 +0.254

1 +0.223

2 +0.184

3 +0.193

Test k SENT-DEBIAS BERT

C7b

0 +0.253

1 +0.210

2 +0.218

3 +0.189

Table 5: SEAT effect sizes for select sentence-leveltests, varying the dimensionality k of the bias subspace(k = 0 represents no debiasing at all). SEAT scorescloser to 0 represent lower bias. Lower bias scores arein bold. A higher dimensional bias subspace (k = 2/3)works better for debiasing sentence representations.

adapts May et al. (2019) by using simple templatesto debias BERT sentence representations. FromTable 4, SENT-DEBIAS achieves a lower averageabsolute effect size and outperforms the baselinesbased on debiasing at the word-level and averag-ing across all words. This indicates that it is notsufficient to debias words only and that biases ina sentence could arise from their debiased wordconstituents. In comparison with BERT simple, weobserve that using diverse sentence templates ob-tained from naturally occurring written and spokentext makes a significant difference on how well wecan remove biases from sentence representations.This supports our hypothesis that using increas-ingly diverse templates estimates a bias subspacethat generalizes to different words in context.

3.5 Dimension of Bias SubspaceWe hypothesize that the dimension of the bias sub-space S for sentences should be larger than that forwords due to the variety in sentence structure. Forexample, the semantic meaning changes when theword ‘girl’ is used as the subject or the object ina sentence. A larger dimensional subspace shouldhelp to capture the full bias subspace. This is vali-

Figure 1: Influence of the number of templates on the effectiveness of bias removal on BERT fine-tuned on SST-2(left) and BERT fine-tuned on QNLI (right). All templates are from WikiText-2. The solid line represents themean over different combinations of domains and the shaded area represents the standard deviation. As increasingsubsets of data are used, we observe a decreasing trend and lower variance in average absolute effect size.

dated by our empirical results in Table 5, where wevary the dimension k of the bias subspace from 0(no debiasing) to 3. We find that 2/3-dimensionswork well for estimating the bias subspace, as com-pared to the 1-dimensional subspace (vector) usu-ally used to debias word representations.

3.6 Effect of TemplatesWe further test the importance of sentence tem-plates through two experiments.

1. Same Domain, More Quantity: Firstly, weask: how does the number of sentence templatesimpact debiasing performance? To answer this, webegin with the largest domain WikiText-2 (13750templates) and divide it into 5 partitions each ofsize 2750. We collect sentence templates using allpossible combinations of the 5 partitions and applythese sentence templates in the contextualizationstep of SENT-DEBIAS. We then estimate the cor-responding bias subspace, debias, and measure theaverage absolute values of all 6 SEAT effect sizes.Since different combinations of the 5 partitions re-sult in a set of sentence templates of different sizes(20%, 40%, 60%, 80%, 100%), this allows us tosee the relationship between size and debiasing per-formance. Combinations with the same percentageof data are grouped together and for each groupwe compute the mean and standard deviation ofthe average absolute effect sizes. We perform theabove steps to debias BERT fine-tuned on SST-2and QNLI and plot these results in Figure 1. Pleaserefer to the appendix for experiments with BERTfine-tuned on CoLA, which show similar results.

For BERT fine-tuned on SST-2, we observe adecreasing trend in the effect size as increasingsubsets of the data is used. For BERT fine-tunedon QNLI, there is a decreasing trend that quickly

tapers off. However, using a larger number of tem-plates reduces the variance in average absoluteeffect size and improves the stability of the SENT-DEBIAS algorithm. These observations allow us toconclude the importance of using a large numberof templates from naturally occurring text corpora.

2. Same Quantity, More Domains: How doesthe number of domains that sentence templates areextracted from impact debiasing performance? Wefix the total number of sentence templates to be1080 and vary the number of domains these tem-plates are drawn from. Given a target number k, wefirst choose k domains from our Reddit, SST, POM,WikiText-2 datasets and randomly sample 1080/ktemplates from each of the k selected domains. Weconstruct 1080 templates using all possible subsetsof k domains and apply them in the contextual-ization step of SENT-DEBIAS. We estimate thecorresponding bias subspace, debias and measurethe average absolute SEAT effect sizes. To seethe relationship between the number of domainsk and debiasing performance, we group combina-tions with the same number of domains (k) andfor each group compute the mean and standard de-viation of the average absolute effect sizes. Thisexperiment is also performed for BERT fine-tunedon SST-2 and QNLI datasets. Results are plottedin Figure 2.

We draw similar observations: there is a decreas-ing trend in effect size as templates are drawn frommore domains. For BERT fine-tuned on QNLI,using a larger number of domains reduces the vari-ance in effect size and improves stability of thealgorithm. Therefore, it is important to use a largevariety of templates across different domains.

Figure 2: Influence of the number of template domains on the effectiveness of bias removal on BERT fine-tuned onSST-2 (left) and BERT fine-tuned on QNLI (right). The domains span the Reddit, SST, POM, WikiText-2 datasets.The solid line is the mean over different combinations of domains and the shaded area is the standard deviation.As more domains are used, we observe a decreasing trend and lower variance in average absolute effect size.

Pretrained BERT embeddings Debiased BERT embeddings

Figure 3: t-SNE plots of average sentence representations of a word across its sentence templates before (left) andafter (right) debiasing. After debiasing, non gender-specific concepts (in black) are more equidistant to genders.

3.7 VisualizationAs a qualitative analysis of the debiasing process,we visualize how the distances between sentencerepresentations shift after the debiasing processis performed. We average the sentence represen-tations of a concept (e.g. man, woman, science,art) across its contexts (sentence templates) andplot the t-SNE (van der Maaten and Hinton, 2008)embeddings of these points in 2D space. FromFigure 3, we observe that BERT average represen-tations of science and technology start off closer toman while literature and art are closer to woman.After debiasing, non gender-specific concepts (e.gscience, art) become more equidistant to both manand woman average concepts.

3.8 Performance on Downstream TasksTo ensure that debiasing does not hurt the perfor-mance on downstream tasks, we report the perfor-mance of our debiased BERT and ELMo on SST-2and CoLA by training a linear classifier on top ofdebiased BERT sentence representations. FromTable 6, we observe that downstream task perfor-mance show a small decrease ranging from 1 − 3%after the debiasing process. However, the perfor-

mance of ELMo on SST-2 increases slightly from89.6 to 90.0. We hypothesize that these differencesin performance are due to the fact that CoLA testsfor linguistic acceptability so it is more concernedwith low-level syntactic structure such as verb us-age, grammar, and tenses. As a result, changesin sentence representations across bias directionsmay impact its performance more. For example,sentence representations after the gender debiasingsteps may display a mismatch between genderedpronouns and the sentence context. For SST, it hasbeen shown that sentiment analysis datasets havelabels that correlate with gender information andtherefore contain gender bias (Kiritchenko and Mo-hammad, 2018). As a result, we do expect possibledecreases in accuracy after debiasing. Finally, wetest the effect of SENT-DEBIAS on QNLI by train-ing a classifier on top of debiased BERT sentencepair representations. We observe little impact ontask performance: our debiased BERT fine-tunedon QNLI is able to achieve 90.3% performance ascompared to the 90.5% originally reported.

Test BERT debiased BERT ELMo debiased ELMoSST-2 93.5 90.1 89.6 90.0

CoLA 57.4 56.5 39.1 37.1

QNLI 90.5 90.3 - -

Table 6: We test the effect of SENT-DEBIAS on bothsingle sentence (BERT and ELMo on SST-2, CoLA)and paired sentence (BERT on QNLI) downstreamtasks. The performance (higher is better) of debiasedBERT and ELMo sentence representations on down-stream tasks is not hurt by the debiasing step.

4 Discussion and Future Work

Firstly, we would like to emphasize that both theWEAT, SEAT, and MAC metrics are not perfectsince they only have positive predictive ability:they can be used to detect the presence of biasesbut not their absence (Gonen and Goldberg, 2019).This calls for new metrics that evaluate biases andcan scale to the various types of sentences appear-ing across different individuals, topics, and in bothspoken and written text. We believe that our pos-itive results regarding contextualizing words intosentences implies that future work can build on ouralgorithms and tailor them for new metrics.

Secondly, a particular bias should only be re-moved from words and sentences that are neutral tothat attribute. For example, gender bias should notbe removed from the word “grandmother” or thesentence “she gave birth to me”. Previous work ondebiasing word representations tackled this issue bylisting all attribute specific words based on dictio-nary definitions and only debiasing the remainingwords. However, given the complexity of naturalsentences, it is extremely hard to identify the setof neutral sentences and its complement. Thus, indownstream tasks, we removed bias from all sen-tences which could possibly harm downstream taskperformance if the dataset contains a significantnumber of non-neutral sentences.

Finally, a fundamental challenge lies in the factthat these representations are trained without ex-plicit bias control mechanisms on large amounts ofnaturally occurring text. Given that it becomes in-feasible (in standard settings) to completely retrainthese large sentence encoders for debiasing (Zhaoet al., 2018; Zhang et al., 2018), future work shouldfocus on developing better post-hoc debiasing tech-niques. In our experiments, we need to re-estimatethe bias subspace and perform debiasing wheneverthe BERT encoder was fine-tuned. It remains to beseen whether there are debiasing methods whichare invariant to fine-tuning, or can be efficiently

re-estimated as the encoders are fine-tuned.

5 ConclusionThis paper investigated the post-hoc removal ofsocial biases from pretrained sentence representa-tions. We proposed the SENT-DEBIAS method thataccurately captures the bias subspace of sentencerepresentations by using a diverse set of templatesfrom naturally occurring text corpora. Our experi-ments show that we can remove biases that occur inBERT and ELMo while preserving performance ondownstream tasks. We also demonstrate the impor-tance of using a large number of diverse sentencetemplates when estimating bias subspaces. Lever-aging these developments will allow researchers tofurther characterize and remove social biases fromsentence representations for fairer NLP.

6 AcknowledgementsPPL and LM were supported in part by the Na-tional Science Foundation (Awards #1750439,#1722822) and National Institutes of Health.RS was supported in part by DARPA SAG-AMORE HR00111990016, AFRL CogDeCONFA875018C0014, and NSF IIS1763562. Any opin-ions, findings, and conclusions or recommenda-tions expressed in this material are those of theauthor(s) and do not necessarily reflect the viewsof the National Science Foundation, National Insti-tutes of Health, DARPA, and AFRL, and no officialendorsement should be inferred. We also acknowl-edge NVIDIA’s GPU support and the anonymousreviewers for their constructive comments.

ReferencesHerve Abdi and Lynne J. Williams. 2010. Princi-

pal component analysis. WIREs Comput. Stat.,2(4):433–459.

Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, andMatthew McDermott. 2019. Publicly available clini-cal BERT embeddings. In Proceedings of the 2ndClinical Natural Language Processing Workshop,pages 72–78, Minneapolis, Minnesota, USA. Asso-ciation for Computational Linguistics.

David Bamman, A. Seza Dogruoz, Jacob Eisenstein,Dirk Hovy, David Jurgens, Brendan O’Connor, Al-ice Oh, Oren Tsur, and Svitlana Volkova. 2016. Pro-ceedings of the first workshop on NLP and computa-tional social science.

Solon Barocas and Andrew D Selbst. 2016. Big data’sdisparate impact. Calif. L. Rev., 104:671.

https://doi.org/10.1002/wics.101

https://doi.org/10.1002/wics.101

https://doi.org/10.18653/v1/W19-1909

https://doi.org/10.18653/v1/W19-1909

Christine Basta, Marta R. Costa-jussa, and Noe Casas.2019. Evaluating the underlying gender bias in con-textualized word embeddings. In Proceedings of theFirst Workshop on Gender Bias in Natural LanguageProcessing, pages 33–39, Florence, Italy. Associa-tion for Computational Linguistics.

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606.

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,Venkatesh Saligrama, and Adam T Kalai. 2016.Man is to computer programmer as woman is tohomemaker? Debiasing word embeddings. In Proc.of NIPS, pages 4349–4357.

Shikha Bordia and Samuel R. Bowman. 2019. Identify-ing and reducing gender bias in word-level languagemodels. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Student Research Work-shop, pages 7–15, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Aylin Caliskan, Joanna J Bryson, and ArvindNarayanan. 2017. Semantics derived automaticallyfrom language corpora contain human-like biases.Science.

Robert Dale. 2019. Law and word order: Nlp in legaltech. Natural Language Engineering, 25(1):211–217.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Sahaj Garg, Vincent Perot, Nicole Limtiaco, AnkurTaly, Ed H. Chi, and Alex Beutel. 2019. Counterfac-tual fairness in text classification through robustness.In Proceedings of the 2019 AAAI/ACM Conferenceon AI, Ethics, and Society, AIES ’19, page 219–226,New York, NY, USA. Association for ComputingMachinery.

Hila Gonen and Yoav Goldberg. 2019. Lipstick on apig: Debiasing methods cover up systematic genderbiases in word embeddings but do not remove them.In Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), pages 609–614,Minneapolis, Minnesota. Association for Computa-tional Linguistics.

Anthony Greenwald, T Poehlman, Eric Luis Uhlmann,and Mahzarin Banaji. 2009. Understanding and us-ing the implicit association test: Iii. meta-analysis ofpredictive validity.

Kexin Huang, Jaan Altosaar, and Rajesh Ran-ganath. 2019. Clinicalbert: Modeling clinicalnotes and predicting hospital readmission. CoRR,abs/1904.05342.

Svetlana Kiritchenko and Saif Mohammad. 2018. Ex-amining gender and race bias in two hundred sen-timent analysis systems. In Proceedings of theSeventh Joint Conference on Lexical and Compu-tational Semantics, pages 43–53, New Orleans,Louisiana. Association for Computational Linguis-tics.

Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black,and Yulia Tsvetkov. 2019. Measuring bias in contex-tualized word representations. In Proceedings of theFirst Workshop on Gender Bias in Natural LanguageProcessing, pages 166–172, Florence, Italy. Associ-ation for Computational Linguistics.

Anne Lauscher and Goran Glavas. 2019. Are we con-sistently biased? multidimensional analysis of bi-ases in distributional word vectors. In *SEM 2019.

Paul Pu Liang, Yao Chong Lim, Yao-Hung Hubert Tsai,Ruslan Salakhutdinov, and Louis-Philippe Morency.2019. Strong and simple baselines for multimodalutterance embeddings. In Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers), pages 2599–2609, Minneapolis, Minnesota.Association for Computational Linguistics.

Paul Pu Liang, Ziyin Liu, AmirAli Bagher Zadeh, andLouis-Philippe Morency. 2018. Multimodal lan-guage analysis with recurrent multistage fusion. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized BERT pretraining ap-proach. CoRR, abs/1907.11692.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of MachineLearning Research, 9:2579–2605.

Thomas Manzini, Lim Yao Chong, Alan W Black,and Yulia Tsvetkov. 2019. Black is to criminalas caucasian is to police: Detecting and removingmulticlass bias in word embeddings. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 615–621, Minneapo-lis, Minnesota. Association for Computational Lin-guistics.

Chandler May, Alex Wang, Shikha Bordia, Samuel R.Bowman, and Rachel Rudinger. 2019. On measur-ing social biases in sentence encoders. In Proceed-ings of the 2019 Conference of the North American

https://doi.org/10.18653/v1/W19-3805

https://doi.org/10.18653/v1/W19-3805

https://doi.org/10.18653/v1/N19-3002

https://doi.org/10.18653/v1/N19-3002

https://doi.org/10.18653/v1/N19-3002

http://dblp.uni-trier.de/db/journals/nle/nle25.html#Dale19

http://dblp.uni-trier.de/db/journals/nle/nle25.html#Dale19

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.1145/3306618.3317950

https://doi.org/10.1145/3306618.3317950

https://doi.org/10.18653/v1/N19-1061

https://doi.org/10.18653/v1/N19-1061

https://doi.org/10.18653/v1/N19-1061

https://doi.org/10.18653/v1/S18-2005

https://doi.org/10.18653/v1/S18-2005

https://doi.org/10.18653/v1/S18-2005

https://doi.org/10.18653/v1/W19-3823

https://doi.org/10.18653/v1/W19-3823

https://doi.org/10.18653/v1/N19-1267

https://doi.org/10.18653/v1/N19-1267

http://arxiv.org/abs/1907.11692


http://www.jmlr.org/papers/v9/vandermaaten08a.html

https://doi.org/10.18653/v1/N19-1062

https://doi.org/10.18653/v1/N19-1062

https://doi.org/10.18653/v1/N19-1062

Chapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers). Association for Computa-tional Linguistics.

Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2017a. Pointer sentinel mixturemodels. In 5th International Conference on Learn-ing Representations, ICLR 2017, Toulon, France,April 24-26, 2017, Conference Track Proceedings.OpenReview.net.

Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2017b. Pointer sentinel mixturemodels. In 5th International Conference on Learn-ing Representations, ICLR 2017, Toulon, France,April 24-26, 2017, Conference Track Proceedings.OpenReview.net.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. In 1st International Con-ference on Learning Representations, ICLR 2013,Scottsdale, Arizona, USA, May 2-4, 2013, WorkshopTrack Proceedings.

Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re-ducing gender bias in abusive language detection.In Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing,pages 2799–2804, Brussels, Belgium. Associationfor Computational Linguistics.

Sunghyun Park, Han Suk Shim, Moitreya Chatterjee,Kenji Sagae, and Louis-Philippe Morency. 2014.Computational analysis of persuasiveness in socialmultimedia: A novel dataset and multimodal predic-tion approach. In ICMI.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In EMNLP.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.

Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In EMNLP.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In EMNLP.

Chen Sun, Austin Oliver Myers, Carl Vondrick, KevinMurphy, and Cordelia Schmid. 2019a. Videobert:A joint model for video and language representationlearning. CoRR, abs/1904.01766.

Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang,Mai ElSherief, Jieyu Zhao, Diba Mirza, ElizabethBelding, Kai-Wei Chang, and William Yang Wang.2019b. Mitigating gender bias in natural languageprocessing: Literature review. In ACL.

Nathaniel Swinger, Maria De-Arteaga, Neil ThomasHeffernan IV, Mark DM Leiserson, and Adam Tau-man Kalai. 2019. What are the biases in my wordembedding? In Proceedings of the 2019 AAAI/ACMConference on AI, Ethics, and Society, AIES ’19,page 305–311, New York, NY, USA. Association forComputing Machinery.

Jack Urbanek, Angela Fan, Siddharth Karamcheti,Saachi Jain, Samuel Humeau, Emily Dinan, TimRocktaschel, Douwe Kiela, Arthur Szlam, and JasonWeston. 2019. Learning to speak and act in a fantasytext adventure game. CoRR, abs/1903.03094.

Sumithra Velupillai, Hanna Suominen, Maria Liakata,Angus Roberts, Anoop D. Shah, Katherine Mor-ley, David Osborn, Joseph Hayes, Robert Stewart,Johnny Downs, Wendy Chapman, and Rina Dutta.2018. Using clinical natural language processing forhealth outcomes research: Overview and actionablesuggestions for future advances. Journal of Biomed-ical Informatics, 88:11 – 19.

Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Black-boxNLP.

Chenguang Wang, Mu Li, and Alexander J. Smola.2019. Language models with transformers. CoRR,abs/1904.09408.

Alex Warstadt, Amanpreet Singh, and Samuel R Bow-man. 2018. Neural network acceptability judgments.arXiv preprint arXiv:1805.12471.

Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-cas: The surprising cross-lingual effectiveness ofBERT. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages833–844, Hong Kong, China. Association for Com-putational Linguistics.

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and YejinChoi. 2019. From recognition to cognition: Vi-sual commonsense reasoning. In The IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR).

Brian Hu Zhang, Blake Lemoine, and MargaretMitchell. 2018. Mitigating unwanted biases with ad-versarial learning. In AIES.

https://openreview.net/forum?id=Byj72udxe




https://doi.org/10.18653/v1/D18-1302

https://doi.org/10.18653/v1/D18-1302

https://doi.org/10.18653/v1/N18-1202

https://doi.org/10.18653/v1/N18-1202

https://doi.org/10.1145/3306618.3314270

https://doi.org/10.1145/3306618.3314270

https://doi.org/https://doi.org/10.1016/j.jbi.2018.10.005




https://doi.org/10.18653/v1/D19-1077

https://doi.org/10.18653/v1/D19-1077

https://doi.org/10.18653/v1/D19-1077

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cot-terell, Vicente Ordonez, and Kai-Wei Chang. 2019.Gender bias in contextualized word embeddings. InProceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), pages 629–634,Minneapolis, Minnesota. Association for Computa-tional Linguistics.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-donez, and Kai-Wei Chang. 2017. Men also likeshopping: Reducing gender bias amplification usingcorpus-level constraints. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 2979–2989, Copenhagen,Denmark. Association for Computational Linguis-tics.

Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning gender-neutral wordembeddings. In Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Process-ing, pages 4847–4853, Brussels, Belgium. Associa-tion for Computational Linguistics.

https://doi.org/10.18653/v1/N19-1064

https://doi.org/10.18653/v1/D17-1323

https://doi.org/10.18653/v1/D17-1323

https://doi.org/10.18653/v1/D17-1323

https://doi.org/10.18653/v1/D18-1521

https://doi.org/10.18653/v1/D18-1521

towards debiasing sentence representationspliang/papers/acl2020_debiasing.pdfcial biases and...

Documents