data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. dino...
TRANSCRIPT
Data ethics and machine learningDiscrimination, algorithmic bias, and how to discover them.DINO PEDRESCHIKDDLAB, DIPARTIMENTO DI INFORMATICA, UNIVERSITÀ DI PISA
Opportunities of big data
4
5
Spot business trendsPrevent diseasesFight crime
Improve transportationPersonalised servicesImprove wellbeing
Event Detection
Detecting events in a geographic area classifying the different kinds of users.
City of Rome
Metropolitan area
Covered geographical region: city of RomeDataset size per snapshot: ≈ 1.2 GBytes per dayNumber of records: ≈ 5.6 million lines per day
8 months between 2015 and 2016
San Pietro
San Giovanni
Circo Massimo
Stadio Olimpico
End users
Traveler
Mobility Manager
Data
Trip
advises
Mon
itor
Rul
es &
In
cent
ives
City
Personal mobility assistant
12
Carpooling Network
Estimating wellbeing with mobility data
AI and Big Data 13A
B
C
HW
Predicting GDP with Retail Market data
14
generic utility function(rationality)
personal utility function(diversity)
Product
Price
Quantity Needed
Sophistication
R2 = 17.25% R2 = 32.38%
R2 = 85.72%
Risks of big data
15
Big Data, Big Risks Big data is algorithmic, therefore it cannot be biased! And yet…
• All traditional evils of social discrimination, and many new ones, exhibit themselves in the big data ecosystem
• Because of its tremendous power, massive data analysis must be used responsibly
• Technology alone won’t do: also need policy, user involvement and education efforts
16
By 2018, 50% of business ethics violations will occur through improper use of big data analytics
[source: Gartner, 2016]
AI and Big Data 17
AI and Big Data 18
The danger of black boxes - 1
The COMPAS score (Correctional Offender Management Profiling for Alternative Sanctions)
A 137-questions questionnaire and a predictive model for “risk of crime recidivism.” The model is a proprietary secret of Northpointe, Inc.
The data journalists at propublica.org have shown that
• the prediction accuracy of recidivism is rather low (around 60%)
• the model has a strong ethnic bias◦ blacks who did not reoffend are classified as high risk twice as much as
whites who did not reoffend◦ whites who did reoffend were classified as low risk twice as much as blacks
who did reoffend.
AI and Big Data 20
The danger of black boxes -2
The three major US credit bureaus, Experian, TransUnion, and Equifax, providing credit scoring for millions of individuals, are often discordant.
In a study of 500,000 records, 29% of consumers received credit scores that differ by at least fifty points between credit bureaus, a difference that may mean tens of thousands dollars over the life of a mortgage [CRS+16].
AI and Big Data 21
The danger of black boxes - 3
In 2010, some homeowners with a regular payment history of their mortgage reported a sudden drop of forty points in their credit score, soon after their own enquiry.
AI and Big Data 22
The danger of black boxes - 4
During the 1970s and 1980s, St. George’s Hospital Medical School in London used a computer program for initial screening of job applicants.
The program used information from applicants’ forms, which contained no reference to ethnicity.
The program was found to unfairly discriminate against female applicants and ethnic minorities (inferred from surnames and place of birth), less likely to be selected for interview [LM88].
AI and Big Data 23
The danger of black boxes - 5
In a recent paper at SIGKDD 2016 [RSG16] the authors show how an accurate but untrustworthy classifier may result from an accidental bias in the training data.
In a task of discriminating wolves from huskies in a dataset of images, the resulting deep learning model is shown to classify a wolf in a picture based solely on …
AI and Big Data 24
The danger of black boxes - 5
In a recent paper at SIGKDD 2016 [RSG16] the authors show how an accurate but untrustworthy classifier may result from an accidental bias in the training data.
In a task of discriminating wolves from huskies in a dataset of images, the resulting deep learning model is shown to classify a wolf in a picture based solely on … the presence of snow in the background!
[RSG16] “Why Should I Trust You?” Explaining the Predictions of Any Classifier
SIGKDD 2016 Conference Paper
AI and Big Data 25
Deep learning is creating computer systems we don't fully
understand www.theverge.com/2016/7/12/12158238/first-click-deep-learning-algorithmic-black-boxes
AI and Big Data 26
Is AI Permanently Inscrutable?
nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable
27
The danger of black boxes - 6
In a recent study at Princeton Univ, the authors show how the semantics derived automatically from large text/web corpora contains human biases ◦ E.g., names associated with whites were found to be
significantly easier to associate with pleasant than unpleasant terms, compared to names associated with black people.
Therefore, any machine learning model trained on text data for, e.g., sentiment or opinion mining has a strong chance of inheriting the prejudices reflected in the human-produced training data.
AI and Big Data 28
Human Bias
AI and Big Data 29
Human Bias can be Learned - 7
AI and Big Data 30
As we stated in our 2008 SIGKDD paper that started the field of discrimination-aware data mining [PRT08]:
“learning from historical data recording human decision making may mean to discover traditional prejudices that are endemic in reality, and to assign to such practices the status of general rules, maybe unconsciously, as these rules can be deeply hidden within the learned classifier.”
AI and Big Data 31
PoliciesBIG DATA ETHICS
Satya Nadella's rules for AI
www.theverge.com/2016/6/29/12057516/satya-nadella-ai-robot-laws
AI and Big Data 33
U.S. – F.T.C.
Salvatore Ruggieri 34
www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-exclusion-understanding-issues/160106big-data-rpt.pdf (Sept. 2014)
U.S. – White House
Salvatore Ruggieri 35
www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf (May 2014)
U.S. – White House
Salvatore Ruggieri
36
www.whitehouse.gov/sites/default/files/microsites/ostp/2016_0504_data_discrimination.pdf (May 2016)
U.S. – White House www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NSTC/preparing_for_the_future_of_ai.pdf (October 2016)
AI and Big Data 37
E.U. - EDPS
Salvatore Ruggieri 38
secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Consultation/Opinions/2015/15-11-19_Big_Data_EN.pdf
E.U. - EDPS
Salvatore Ruggieri 39
secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Consultation/Opinions/2015/15-09-11_Data_Ethics_EN.pdf
Netherlands www.knaw.nl/en/news/publications/ethical-and-legal-aspects-of-informatics-research (September 2016)
AI and Big Data 40
Big Data Ethics informationaccountability.org/big-data-ethics-initiative/
AI and Big Data 41
Value-Sensitive Design Design for privacy Design for security Design for inclusion Design for sustainability Design for democracy Design for safety Design for transparency Design for accountability Design for human capabilities
AI and Big Data 42
EU Projects: SoBigData.eu Social Mining & Big Data Ecosystem project (SoBigData, H2020-INFRAIA-2014-2015, duration: 2015-2019, www.sobigdata.eu
AI and Big Data 43
Data Ethics Literacy Rapporto MIUR su Big Data, 28 Luglio 2016
◦ www.istruzione.it/allegati/2016/bigdata.pdf
Master UNIPI in Big Data Analytics & Social Mining◦ masterbigdata.it
AI and Big Data 44
Data ethics technologiesDISCRIMINATION DISCOVERY FROM DATA
AI and Big Data 46
Discrimination discovery Given:
◦ an historical database of decision records, each describing features of an applicant to a benefit ◦ e.g., a credit request to a bank and the corresponding on credit approval/denial
◦ some designated categories of applicants, such as groups protected by anti-discrimination laws,
find whether, and in which circumstances, there are evidences of discrimination of the designated categories that emerge from the data.
DCUBE: Discrimination Discovery in Databases 47
German Credit dataset
DCUBE: Discrimination Discovery in Databases 48
How? Fight with the same weapons
Idea: use data mining to discover discrimination
◦ the decision policies hidden in a database can be represented by decision rules and discovered by frequent pattern mining
◦ Once found all such decision rules, highlight all potential niches of discrimination by filtering the rules using a measure that quantifies the discrimination risk.
DCUBE: Discrimination Discovery in Databases 49
Discrimination discovery from data
FOREIGN_WORKER=yes & PURPOSE=new_car & HOUSING=own CREDIT=bad◦ elift = 5,19 supp = 56 conf = 0,37
elift = 5,19 means that foreign workers have more than 5 times more probability of being refused credit than the average population (even if they own their house).
50
Outcome: Funded Not funded Conditionally funded
Case Study: grant evaluation
51
Dataset attributes
52
Features of the PI
Project costs
Research Area
Project Evaluation
A potentially discriminatory rule
Antecedent◦ Project proposals in “Physical and Analytical
Chemical Sciences”◦ Young females◦ Total cost of 1,358,000 Euros or above
Possible interpretation◦ “Peer-reviewers of panel PE4 trusted young females
requiring high budgets less than males leading similar projects”
53
Case study: US Harmonized Tariff System
US Harmonized Tariff System (HTS)
https://hts.usitc.gov/
Detailed tariff classification system for merchandise imported to US
Chapter 61, 62, 64, 65: apparels◦ Different taxes for same garments
separately produced for male and female◦ Description is at semi-structured form
64.4¢/kg + 18.8%96¢/doz + 1.4% 8.5%Women and girls
38.6¢/kg + 10%08.9%Men and boys
Coats Fur felt hatsCotton pajamas
Different taxes for same apparels for men and women
64.4¢/kg + 18.8%96¢/doz + 1.4% 8.5%Women and girls
38.6¢/kg + 10%08.9%Men and boys
Coats Fur felt hatsCotton pajamas
Different taxes for same apparels for men and women
54
Women: 14%Men: 9% 1.3 billions USD!!!
AI and Big Data 55
Totes-Isotoner Corp. v. U.S.
Rack Room Shoes Inc. andForever 21 Inc. vs U.S.
Court of International Trade
U.S. Court of Appeals for the Federal Circuit (2014)
“[…] the courts may have concluded that Congress had no discriminatory intent when ruling the HTS, but there is littledoubt that gender-based tariffs have discriminatory impact”
Sample rule from the HTS dataset
AI and Big Data 56
Soccer Player Ratings
Soccer Player Ratings
How humans evaluate sports performance?
Human evaluation line
Tech
nica
lfe
atur
es
Mac
hine
per
form
ance
Human evaluation line
Tech
nica
lfe
atur
es
Tech
nica
l + C
onte
xtua
lfe
atur
es
Mac
hine
per
form
ance
Wrapping up
AI AND BIG DATA 62
Right of explanation • Applying AI within many domains requires
transparency and responsibility: • health care• finance• surveillance• autonomous vehicles• Government
• EU General Data Protection Regulation (April 2016) establishes (?) a right of explanation for all individuals to obtain “meaningful explanations of the logic involved” when automated (algorithmic) individual decision-making, including profiling, takes place.
• In sharp contrast, (big) data-driven AI/ML models are often black boxes.
AI and Big Data 63
Accountability “Why exactly was my loan application rejected?” “What could I have done differently so that my application would not have been rejected?”
AI and Big Data 64
Social Mining & Big Data Ecosystem
www.sobigdata.eu
66
Knowledge Discovery & Data Mining Labhttp://kdd.isti.cnr.it
Special thanks
• Salvatore Ruggieri• Franco Turini• Fosca Giannotti• Anna Monreale• Luca Pappalardo
SMARTCATs