machine learning & fairness...» works with pre-trained word embeddings » harder to integrate...

Machine Learning& Fairness

Jenn Wortman Vaughan & Hanna WallachMicrosoft Research New York City

Who?

http://www.microsoft.com/en-us/research/group/fate/

Aether Committee

Bias & FairnessWorking Group

IntelligibilityWorking Group

AI & Machine Learning

NeurIPS Registrations

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

20

10

20

11

20

12

20

13

20

14

20

15

20

16

20

17

20

18

Nu

mb

er

of

Reg

istr

atio

ns

THE AGE OF AI

OPPORTUNITIES

Microsoft

CHALLENGES

The Media…

Employment

Criminal Justice

Advertising

[Sweeney, 2013]

FAIRNESS

LEARN FROMSECURITY & PRIVACY

SOME HISTORY…

GROWTH MINDSET

This Talk

» What are (some of) the different types of harm?

» Which subpopulations are likely to be affected?

» Where do these harms come from and what are some effective strategies to help mitigate them?

» Which software tools can help mitigate them?

TYPES OF HARM

[Shapiro et al., 2017]

Allocation

Quality of Service

[Buolamwini & Gebru, 2018]

Quality of Service

Stereotyping

Stereotyping

[Caliksan et al., 2017]

Stereotyping

Denigration

Over- and Under-Representation

[Kay et al., 2015]

Types of Harm

Allo

cati

on

Qu

alit

y o

f Se

rvic

e

Ste

reo

typ

ing

De

nig

rati

on

Ove

r-o

r U

nd

er-

Re

pr.

Hiring system does not rank women as highly as men for technical jobs

x x x

Gender classification software misclassifies darker-skin women

x

Machine translation system exhibits male/female gender stereotypes

x x

Photo management program labels image of black people as “gorillas”

x x

Image searches for “CEO” yield only photos of white men on first page

x x

This Talk





Subpopulations

» Protected subpopulations, e.g., race, gender, age

» Historically marginalized subpopulations

» Not always easy to identify subpopulations

» 62% of industry practitioners reported it would be very or extremely useful to have support in this area

» Subpopulations may be application-specific

[Holstein et al., 2019]

Subpopulations

“ [P]eople start thinking about sensitive attributes like your ethnicity, your religion, your sexuality, your gender. But the biggest problem I found is that these cohorts should be defined based on the domain and problem. For example, for [automated writing evaluation] maybe it should be defined based on [... whether the writer is] a native speaker.

Intersectionality

Access to Attributes

» Many teams have no access to relevant attributes

» Makes it hard to audit systems for biases

» One option is to collect attributes purely for auditing

» Raises privacy concerns, users may object

» Another option is to use ML to infer relevant attributes

» Shifts the problem, can introduce new biases

Social Constructs

[Buolamwini & Gebru, 2018]

Individual Fairness

Counterfactual Fairness

This Talk





ML Pipeline

task definition

dataset construction

model definition

training process

testing process

deployment process

feedback loop

Task Definition

task definition


model definition

training process

testing process

deployment process

feedback loop

Task Definition

[Wu & Zhang, 2016]

Task Definition

Task Definition

» Clearly define the task & model’s intended effects

» Try to identify any unintended effects & biases

» Involve diverse stakeholders & multiple perspectives

» Try to refine task definition & be willing to abort

» Document any unintended effects & biases

Dataset Construction

task definition


model definition

training process

testing process

deployment process

feedback loop

Data: Societal Bias

Data: Skewed Sample

Data: Skewed Sample

“ It sounds easy to just say like, “Oh, just add some more images in there,” but [...] there's no person on the team that actually knows what all of [these celebrities] look like [...] If I noticed that there's some celebrity from Taiwan that doesn't have enough images in there, I actually don't know what they look like to go and fix that. It's a non-trivial problem [...] But, Beyoncé, I know what she looks like.

Data: Source

» Think critically before collecting any data

» Check for biases in data source selection process

» Try to identify societal biases present in data source

» Check for biases in cultural context of data source

» Check that data source matches deployment context

Data: Collection Process

» Check for biases in technology used to collect data

» Check for biases in humans involved in collecting data

» Check for biases in strategy used for sampling

» Ensure sufficient representation of subpopulations

» Check that collection process itself is fair & ethical

Data: Labeler Bias

Data: Labeling & Preprocessing

» Check whether discarding data introduces biases

» Check whether bucketing introduces biases

» Check preprocessing software for biases

» Check labeling/annotation software for biases

» Check that human labelers do not introduce biases

Data: Documentation

DATASHEETS

[Gebru et al., 2018]

Datasheets for Datasets

Motivation

Composition

Collection Process

Preprocessing

Distribution

Maintenance

Legal & Ethical

Questions

Composition

Collection Process

Points to Consider

» What is the right set of questions?

» How best to handle continually evolving datastreams?

» Are there legal or PR risks to creating datasheets?

» What is the right process for making a datasheet?

» How best to incentivize developers & PMs?

» How much (if anything) should be automated?

Model Definition

task definition


model definition

training process

testing process

deployment process

feedback loop

What is a Model?

price of house = w1 * number of bedrooms +

w2 * number of bathrooms +

w3 * square feet +

a little bit of noise

Model: Assumptions

Model: Structure

[image from Moritz Hardt]

majority minority population

Model: Objective Function

Model Definition

» Clearly define all assumptions about model

» Try to identify biases present in assumptions

» Check whether model structure introduces biases

» Check objective function for unintended effects

» Consider including “fairness” in objective function

Training Process

task definition


model definition

training process

testing process

deployment process

feedback loop

What is Training?

price of house = w1 * number of bedrooms +

w2 * number of bathrooms +

w3 * square feet +

a little bit of noise

Training Process

Testing Process

task definition


model definition

training process

testing process

deployment process

feedback loop

Testing: Data

Testing: Metrics

Testing: Metrics

Un

qu

alif

ied

Qu

alif

ied

Reject TN FN

Hire FP TP}confusion matrix

Testing: Metrics

Un

qu

alif

ied

Qu

alif

ied

Reject 15 5

Hire 20 60

Men

Un

qu

alif

ied

Qu

alif

ied

Reject 60 20

Hire 5 15

Women

}confusion matrices

Demographic Parity

Un

qu

alif

ied

Qu

alif

ied

Reject 60 20

Hire 5 15

Un

qu

alif

ied

Qu

alif

ied

Reject 15 5

Hire 20 60

Men Women

Testing: Metrics

Predictive Parity

Un

qu

alif

ied

Qu

alif

ied

Reject 60 20

Hire 5 15

Un

qu

alif

ied

Qu

alif

ied

Reject 15 5

Hire 20 60

Men Women

False Positive Rate Balance

Un

qu

alif

ied

Qu

alif

ied

Reject 60 20

Hire 5 15

Un

qu

alif

ied

Qu

alif

ied

Reject 15 5

Hire 20 60

Men Women

False Negative Rate Balance

Un

qu

alif

ied

Qu

alif

ied

Reject 60 20

Hire 5 15

Un

qu

alif

ied

Qu

alif

ied

Reject 15 5

Hire 20 60

Men Women

Testing: Metrics

Impossibility Theorem

Testing: Metrics

Testing Process

» Check that test data matches deployment context

» Ensure test data has sufficient representation


» Clearly state all fairness requirements for model

» Use metrics to check that requirements are met

Deployment Process

task definition


model definition

training process

testing process

deployment process

feedback loop

Deployment: Context

[Phillips et al., 2011]

Deployment Process

» Check that data source matches deployment context

» Monitor match between training data & deployment data

» Monitor fairness metrics for unexpected changes

» Invite diverse stakeholders to audit system for biases

» Monitor user reports & user complaints

Feedback Loop

task definition


model definition

training process

testing process

deployment process

feedback loop

Feedback: Non-Adversarial

Feedback: Adversarial

Feedback Loop

» Monitor match between training & deployment data

» Monitor fairness metrics for unexpected changes

» Monitor user reports & user complaints

» Monitor users’ interactions with system

» Consider prohibiting some types of interactions

This Talk





SOFTWARE TOOLS

Academic Response

[image from Moritz Hardt]

AUDITING

Aequitas

IBM Fairness 360

Points to Consider

» Fairness is a non-trivial sociotechnical challenge

» Many types of harm relate to a broader cultural context than a single decision-making system

» Many aspects of fairness not captured by metrics

» No free lunch! Can’t simultaneously satisfy all metrics

» Need to make different tradeoffs in different contexts

CLASSIFICATION

[Agarwal et al., 2018]

“Fair” Classification

» Choose fairness metric w/r/t relevant attributes

» ML goal becomes maximizing classifier accuracy while minimizing unfairness according to the metric

» Two technical challenges:

» Choose an appropriate fairness metric

» Learning an accurate model subject to the metric

fairness-constrained classification

cost-sensitive classification

➧

A Reductions Approach

Many Benefits

» Works with many different fairness metrics

» Agnostic to form of classifier & training algorithm

» Doesn’t need deployment access to relevant attributes

» Important for teams that have no such access

» Important for “disparate treatment” concerns

» Guaranteed to find most accurate classifier

In Practice…

Python Library

IBM Fairness 360

Points to Consider

» Need to choose an appropriate fairness metric

» Still need to assess other fairness metrics

» Accuracy–fairness tradeoff may be illusory

» Test data may not match deployment context


» Many aspects of fairness not captured by metrics

WORD EMBEDDINGS

Data: Societal Bias

[Caliksan et al., 2017]

Bias in Word Embeddings

[Bolukbasi et al., 2016]

Python Library

Points to Consider

» Works with pre-trained word embeddings

» Harder to integrate into systems that learn embeddings

» Not all subpopulations have a definitional “direction”

» Can’t guarantee that you have eliminated biases

» Need to assess downstream effects on performance

This Class





OPEN QUESTIONS

Semi-Structured Interviews

[Holstein et al., 2019]

Anonymous Survey

0 10 20 30 40 50 60

Natural Language…

Predictive Analytics

Computer Vision

Decision Support

Search / Info. Retrieval

Recommender Systems

Chatbots / Conversational…

Speech and Voice

User Modeling / Adaptive…

Robotics / Cyberphysical…

0 10 20 30 40 50 60

Data Scientist

Researcher

Software Engineer

Technical Lead / Manager

Project / Program Manager

Domain / Content Expert

Executive / General…

Data Labeler

Social Scientist

Product Manager

High-Level Themes

» Needs for support in auditing systems for biases in a diverse range of applications beyond allocation

» Needs for support in creating “fairer” datasets

» Needs for support in identifying subpopulations

» Needs for support in detecting biases with access only to coarse-grained, partial, or indirect information

TAKEAWAYS

3 Calls to Action

» Prioritize fairness at every stage of ML pipeline

» Fairness should be a first-order priority



» Adopt a growth mindset & learn from failures

» Can’t solve fairness, can’t debias, can’t neutralize

http://www.microsoft.com/en-us/research/group/fate/

THANKS

machine learning & fairness...» works with pre-trained word embeddings » harder to integrate...

Documents