machine learning & fairness...» works with pre-trained word embeddings » harder to integrate...
TRANSCRIPT
Machine Learning& Fairness
Jenn Wortman Vaughan & Hanna WallachMicrosoft Research New York City
Who?
http://www.microsoft.com/en-us/research/group/fate/
Aether Committee
Bias & FairnessWorking Group
IntelligibilityWorking Group
AI & Machine Learning
NeurIPS Registrations
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
20
18
Nu
mb
er
of
Reg
istr
atio
ns
THE AGE OF AI
OPPORTUNITIES
Microsoft
CHALLENGES
The Media…
Employment
Criminal Justice
Advertising
[Sweeney, 2013]
FAIRNESS
LEARN FROMSECURITY & PRIVACY
SOME HISTORY…
GROWTH MINDSET
This Talk
» What are (some of) the different types of harm?
» Which subpopulations are likely to be affected?
» Where do these harms come from and what are some effective strategies to help mitigate them?
» Which software tools can help mitigate them?
TYPES OF HARM
[Shapiro et al., 2017]
Allocation
Allocation
Quality of Service
[Buolamwini & Gebru, 2018]
Quality of Service
Stereotyping
Stereotyping
[Caliksan et al., 2017]
Stereotyping
Denigration
Denigration
Over- and Under-Representation
[Kay et al., 2015]
Types of Harm
Allo
cati
on
Qu
alit
y o
f Se
rvic
e
Ste
reo
typ
ing
De
nig
rati
on
Ove
r-o
r U
nd
er-
Re
pr.
Hiring system does not rank women as highly as men for technical jobs
x x x
Gender classification software misclassifies darker-skin women
x
Machine translation system exhibits male/female gender stereotypes
x x
Photo management program labels image of black people as “gorillas”
x x
Image searches for “CEO” yield only photos of white men on first page
x x
This Talk
» What are (some of) the different types of harm?
» Which subpopulations are likely to be affected?
» Where do these harms come from and what are some effective strategies to help mitigate them?
» Which software tools can help mitigate them?
WHO?
Subpopulations
» Protected subpopulations, e.g., race, gender, age
» Historically marginalized subpopulations
» Not always easy to identify subpopulations
» 62% of industry practitioners reported it would be very or extremely useful to have support in this area
» Subpopulations may be application-specific
[Holstein et al., 2019]
Subpopulations
“ [P]eople start thinking about sensitive attributes like your ethnicity, your religion, your sexuality, your gender. But the biggest problem I found is that these cohorts should be defined based on the domain and problem. For example, for [automated writing evaluation] maybe it should be defined based on [... whether the writer is] a native speaker.
Intersectionality
Access to Attributes
» Many teams have no access to relevant attributes
» Makes it hard to audit systems for biases
» One option is to collect attributes purely for auditing
» Raises privacy concerns, users may object
» Another option is to use ML to infer relevant attributes
» Shifts the problem, can introduce new biases
Social Constructs
[Buolamwini & Gebru, 2018]
Individual Fairness
Counterfactual Fairness
This Talk
» What are (some of) the different types of harm?
» Which subpopulations are likely to be affected?
» Where do these harms come from and what are some effective strategies to help mitigate them?
» Which software tools can help mitigate them?
ML Pipeline
task definition
dataset construction
model definition
training process
testing process
deployment process
feedback loop
Task Definition
task definition
dataset construction
model definition
training process
testing process
deployment process
feedback loop
Task Definition
[Wu & Zhang, 2016]
Task Definition
Task Definition
» Clearly define the task & model’s intended effects
» Try to identify any unintended effects & biases
» Involve diverse stakeholders & multiple perspectives
» Try to refine task definition & be willing to abort
» Document any unintended effects & biases
Dataset Construction
task definition
dataset construction
model definition
training process
testing process
deployment process
feedback loop
Data: Societal Bias
Data: Societal Bias
Data: Skewed Sample
Data: Skewed Sample
Data: Skewed Sample
“ It sounds easy to just say like, “Oh, just add some more images in there,” but [...] there's no person on the team that actually knows what all of [these celebrities] look like [...] If I noticed that there's some celebrity from Taiwan that doesn't have enough images in there, I actually don't know what they look like to go and fix that. It's a non-trivial problem [...] But, Beyoncé, I know what she looks like.
Data: Source
» Think critically before collecting any data
» Check for biases in data source selection process
» Try to identify societal biases present in data source
» Check for biases in cultural context of data source
» Check that data source matches deployment context
Data: Collection Process
» Check for biases in technology used to collect data
» Check for biases in humans involved in collecting data
» Check for biases in strategy used for sampling
» Ensure sufficient representation of subpopulations
» Check that collection process itself is fair & ethical
Data: Labeler Bias
Data: Labeling & Preprocessing
» Check whether discarding data introduces biases
» Check whether bucketing introduces biases
» Check preprocessing software for biases
» Check labeling/annotation software for biases
» Check that human labelers do not introduce biases
Data: Documentation
DATASHEETS
[Gebru et al., 2018]
Datasheets for Datasets
Motivation
Composition
Collection Process
Preprocessing
Distribution
Maintenance
Legal & Ethical
Questions
Composition
Collection Process
Points to Consider
» What is the right set of questions?
» How best to handle continually evolving datastreams?
» Are there legal or PR risks to creating datasheets?
» What is the right process for making a datasheet?
» How best to incentivize developers & PMs?
» How much (if anything) should be automated?
Model Definition
task definition
dataset construction
model definition
training process
testing process
deployment process
feedback loop
What is a Model?
price of house = w1 * number of bedrooms +
w2 * number of bathrooms +
w3 * square feet +
a little bit of noise
Model: Assumptions
Model: Assumptions
Model: Structure
[image from Moritz Hardt]
majority minority population
Model: Objective Function
Model Definition
» Clearly define all assumptions about model
» Try to identify biases present in assumptions
» Check whether model structure introduces biases
» Check objective function for unintended effects
» Consider including “fairness” in objective function
Training Process
task definition
dataset construction
model definition
training process
testing process
deployment process
feedback loop
What is Training?
price of house = w1 * number of bedrooms +
w2 * number of bathrooms +
w3 * square feet +
a little bit of noise
Training Process
Testing Process
task definition
dataset construction
model definition
training process
testing process
deployment process
feedback loop
Testing: Data
Testing: Metrics
Testing: Metrics
Testing: Metrics
Un
qu
alif
ied
Qu
alif
ied
Reject TN FN
Hire FP TP}confusion matrix
Testing: Metrics
Un
qu
alif
ied
Qu
alif
ied
Reject 15 5
Hire 20 60
Men
Un
qu
alif
ied
Qu
alif
ied
Reject 60 20
Hire 5 15
Women
}confusion matrices
Demographic Parity
Un
qu
alif
ied
Qu
alif
ied
Reject 60 20
Hire 5 15
Un
qu
alif
ied
Qu
alif
ied
Reject 15 5
Hire 20 60
Men Women
Testing: Metrics
Predictive Parity
Un
qu
alif
ied
Qu
alif
ied
Reject 60 20
Hire 5 15
Un
qu
alif
ied
Qu
alif
ied
Reject 15 5
Hire 20 60
Men Women
False Positive Rate Balance
Un
qu
alif
ied
Qu
alif
ied
Reject 60 20
Hire 5 15
Un
qu
alif
ied
Qu
alif
ied
Reject 15 5
Hire 20 60
Men Women
False Negative Rate Balance
Un
qu
alif
ied
Qu
alif
ied
Reject 60 20
Hire 5 15
Un
qu
alif
ied
Qu
alif
ied
Reject 15 5
Hire 20 60
Men Women
Testing: Metrics
Testing: Metrics
Impossibility Theorem
Testing: Metrics
Testing Process
» Check that test data matches deployment context
» Ensure test data has sufficient representation
» Involve diverse stakeholders & multiple perspectives
» Clearly state all fairness requirements for model
» Use metrics to check that requirements are met
Deployment Process
task definition
dataset construction
model definition
training process
testing process
deployment process
feedback loop
Deployment: Context
[Phillips et al., 2011]
Deployment Process
» Check that data source matches deployment context
» Monitor match between training data & deployment data
» Monitor fairness metrics for unexpected changes
» Invite diverse stakeholders to audit system for biases
» Monitor user reports & user complaints
Feedback Loop
task definition
dataset construction
model definition
training process
testing process
deployment process
feedback loop
Feedback: Non-Adversarial
Feedback: Adversarial
Feedback Loop
» Monitor match between training & deployment data
» Monitor fairness metrics for unexpected changes
» Monitor user reports & user complaints
» Monitor users’ interactions with system
» Consider prohibiting some types of interactions
This Talk
» What are (some of) the different types of harm?
» Which subpopulations are likely to be affected?
» Where do these harms come from and what are some effective strategies to help mitigate them?
» Which software tools can help mitigate them?
SOFTWARE TOOLS
Academic Response
[image from Moritz Hardt]
AUDITING
Aequitas
IBM Fairness 360
Points to Consider
» Fairness is a non-trivial sociotechnical challenge
» Many types of harm relate to a broader cultural context than a single decision-making system
» Many aspects of fairness not captured by metrics
» No free lunch! Can’t simultaneously satisfy all metrics
» Need to make different tradeoffs in different contexts
CLASSIFICATION
[Agarwal et al., 2018]
“Fair” Classification
» Choose fairness metric w/r/t relevant attributes
» ML goal becomes maximizing classifier accuracy while minimizing unfairness according to the metric
» Two technical challenges:
» Choose an appropriate fairness metric
» Learning an accurate model subject to the metric
fairness-constrained classification
cost-sensitive classification
➧
A Reductions Approach
Many Benefits
» Works with many different fairness metrics
» Agnostic to form of classifier & training algorithm
» Doesn’t need deployment access to relevant attributes
» Important for teams that have no such access
» Important for “disparate treatment” concerns
» Guaranteed to find most accurate classifier
In Practice…
Python Library
IBM Fairness 360
Points to Consider
» Need to choose an appropriate fairness metric
» Still need to assess other fairness metrics
» Accuracy–fairness tradeoff may be illusory
» Test data may not match deployment context
» Fairness is a non-trivial sociotechnical challenge
» Many aspects of fairness not captured by metrics
WORD EMBEDDINGS
Data: Societal Bias
[Caliksan et al., 2017]
Bias in Word Embeddings
[Bolukbasi et al., 2016]
Python Library
Points to Consider
» Works with pre-trained word embeddings
» Harder to integrate into systems that learn embeddings
» Not all subpopulations have a definitional “direction”
» Can’t guarantee that you have eliminated biases
» Need to assess downstream effects on performance
This Class
» What are (some of) the different types of harm?
» Which subpopulations are likely to be affected?
» Where do these harms come from and what are some effective strategies to help mitigate them?
» Which software tools can help mitigate them?
OPEN QUESTIONS
Semi-Structured Interviews
[Holstein et al., 2019]
Anonymous Survey
0 10 20 30 40 50 60
Natural Language…
Predictive Analytics
Computer Vision
Decision Support
Search / Info. Retrieval
Recommender Systems
Chatbots / Conversational…
Speech and Voice
User Modeling / Adaptive…
Robotics / Cyberphysical…
0 10 20 30 40 50 60
Data Scientist
Researcher
Software Engineer
Technical Lead / Manager
Project / Program Manager
Domain / Content Expert
Executive / General…
Data Labeler
Social Scientist
Product Manager
High-Level Themes
» Needs for support in auditing systems for biases in a diverse range of applications beyond allocation
» Needs for support in creating “fairer” datasets
» Needs for support in identifying subpopulations
» Needs for support in detecting biases with access only to coarse-grained, partial, or indirect information
TAKEAWAYS
3 Calls to Action
» Prioritize fairness at every stage of ML pipeline
» Fairness should be a first-order priority
» Involve diverse stakeholders & multiple perspectives
» Fairness is a non-trivial sociotechnical challenge
» Adopt a growth mindset & learn from failures
» Can’t solve fairness, can’t debias, can’t neutralize
http://www.microsoft.com/en-us/research/group/fate/
THANKS