innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · innovating...

25
Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD Assistant Professor School of Information Sciences/ The iSchool at Illinois University of Illinois at Urbana‐Champaign

Upload: others

Post on 17-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Innovating compliantly and transparently ‐

road blocks, myths and solutions

Jana Diesner, PhDAssistant Professor

School of Information Sciences/ The iSchool at IllinoisUniversity of Illinois at Urbana‐Champaign

Page 2: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Enablers And Benefits of Openness

Regulations and Norms

TransparencyOpenness

Reproducibility

Trust 

Value Added

Incentive mechanisms, 

business models

Infra‐structures

Page 3: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Open what? It’s complicated

• Privacy statements– Users pay little attention, hard to understand (McDonald & 

Cranor 2008, Acquisti & Grossklags 2005)• Regulations for human‐centered and online data

– Researchers pay little attention, hard to understand– IRB (1979): “To protect the rights and welfare of humans 

participating as subjects in the research" • Respect for people, beneficence, minimize risk• For intervention or interaction with living individuals and/or identifiable private information

– Golden times? Listen, don’t ask (passive measurement, Zevenbergen et al. 2015) and measure/ don’t estimate 

Diesner J (2015) Small Decisions with Big Impact on Data Analytics. Big Data & Society, special issue: Assumptions of Sociality. 

Page 4: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Working with Human‐Centered and Online Data: Some Practical Questions 

• Awareness: – If an IRB does not apply to our project, is there an ethics or privacy review board, protocol or process? 

– What governs data use in commercial settings?• Knowledge: 

– What's the relationship between copyright, terms of service and privacy? What trumps what?

– Does “personal use” include “research use”? – We got different answers from the IRB, legal, and the library. How to make a decision? (Digital literacy)

• Skills: – How do we practically implement terms of use?– How can we anonymize social network data?How can be guarantee non‐consumptive use?

• So what makes this all complicated?

Page 5: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

What Types of Regulations are out there? 

1. Institutional and organizational norms and regulations– Health Insurance Portability and Accountability Act (HIPAA), Fair 

Information Practice Principles (FIPPs), Menlo Report (Ethical Principles Guiding Information and Communication Technology Research) (2012)

2. Privacy regulations and law3. Security regulations and law4. Intellectual property law, copyright

– Snippets of (appropriated) content5. Terms of use/ service6. Technical constraints (robots.txt, APIs)7. Personal values

– People apply them consciously or unconsciously– Depend on gender (Gilligan 1987), culture (Graham at el. 2011)– 16+: Conventional morality (comply with (group) norms) versus 

10‐15% post‐conv. morality (own principles) (Kohlberg 1984)

Page 6: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Different people/ fields driven by different practical/ real‐world approaches

Driven by pragmatics

• Utilitarian ethics • Technical feasibility • E.g., some of Web Science

Driven by rule compliance 

• Vs. learning from examples and common practice• 73% (N = 263 from academia, industry, gov) permissible to “scrape data from online forums”, 21% with neutral opinion = 94% (Vitak, Shilton & Ashktorab 2016)

• Set quasi standards • Lack of standards 

Driven by ethics/ personal values

• Shweder 1997: • Autonomy (protect individual rights and justice)

• Communityoriented (preserve institutions and social order, effect: sense of duty, respect, loyalty) 

• Divinity (protect people from degradation, e.g. due to selfishness) 

Page 7: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Open,Free!?

• Open Science, Open Intelligence, Open you name it…– Gratis versus libre (Floss, Stallman, GNU)– User‐generated data from 3rd party platforms often “free to 

see” (alternat. copyright models, Lessig, Creative Commons)• Browsewrap agreements not enforceable: 

– “Terms of Use” hyperlinks “not sufficiently conspicuous” (obvious) for “reasonably prudent internet consumer” (plaintiff did not manifest unambiguous assent to be bound by Terms of Use“) (Long v. Provide Commerce, Inc., 2016 WL 1056555, Cal Ct. App., 03/17/2016)

• Working with online data is kind of like archival research (Kosinski et al. 2015)– No consent needed if 1) users consciously made their data 

public, 2) collected data anonymized, 3) researchers do not interact with participants, 4) no identifiable user information  published

Page 8: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Solutions

1. Education

2. Compliance

3. Technology (room for improvement) 

4. Policy, Legislation 

5. For pay models, subscriptions

Page 9: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Accuracy and Transparency at Scale

“In viel weiterem Umfange, als man sich klar zu machen pflegt, ruhtunsre moderne Existenz von der Wirtschaft, die immer mehrKreditwirtschaft wird, bis zum Wissenschaftsbetrieb, in dem dieMehrheit der Forscher unzählige, ihnen gar nicht nachprüfbareResultate anderer verwenden muß, auf dem Glauben an dieEhrlichkeit des andern.““To a much wider degree than we often think, our modernexistence from business and trade, which is turning more and moreinto a credit economy, to the pursuit of science, where the majorityof researchers has to work with results that were produced byothers and that cannot be verified by the researcher, relies uponthe believe in the honesty of other people.“

Simmel, G. (1908). Das Geheimnis und die geheime Gesellschaft Soziologie. Untersuchungen über die Formen der Vergesellschaftung (pp. 256‐304). Berlin Duncker & Humblot.

Page 10: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Entity Resolution in Graphs

10Mark Newman, UMich Mark Newman, UMich

• Splitting– Same surface form, different social entities

– 46,157 John Smith,562 Mark Newman 

• Merging/ consolidation– Collect all references to same unique entity

– Aka co‐reference resolution, record linkage

– Craig Evans, Craig S. Evans, C.S. Evans, …

Diesner J, Evans C, Kim J (2015) Impact of entity disambiguation errors on social network properties. International AAAI Conference on Web and Social Media (ICWSM), Oxford, UK

Page 11: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Why Bother?

• Impact and propagation of errors (magnitude, upper and lower bound) on (robustness of) (network) data, properties, findings, conclusions largely unknown 

• Worth the efforts and costs?• Highly accurate algorithmic solutions exist (90ies % range)

• Payoff from incremental improvements?

11

Page 12: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

• Big deal in bibliometrics: heuristics, rules– First initial based disambiguation: M. Newman = M. Newman

– All initial based disambiguation: M. E. Newman != M. W. Newman 

– Justification: upper and lower bound of true number of nodes (Newman, 2001): Is that true?

Disambiguation:What do we know already?

12

Page 13: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

DataEnron MEDLINE

Time Range 10/1999-07/2002 01/2005-12/2009Number of documents 520,458 101,162

Domain Email Co-publishing

Context Corporate, internal communication

Scientific, external/ public communication

Mainly subject to Merging Splitting

13

Page 14: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Data Preparation: Enron: Consolidation

• Semi‐automated and manually vetted mapping of email addresses to people, incl. full names, job histories, locations (Diesner et al. 2005)

• “Service learning assignment” in graduate courses 14

# email addresses/

person

# people with that #

of addresses

Person (* indicted)

26 1 Kenneth Lay, Chairman*

11 3Jeffrey Skilling, CEO*David Delainey, Energy Trader*Vince Kaminski, MD Research

10 3Susan ScottSteven Kean, EVP, Chief of StaffMark Haedicke, General Counsel

9 4Mark Taylor, Asst Gen CounselGrant Masson, VP ResearchPatrice MimsJeff Dasovich, Exec - Gov Affairs

8 5 1,523 > 1 email address,average 2.4, median 2

7 136 175 364 633 1602 1,2181 21,753

Page 15: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Data Preparation: Enron: Networks

• Raw (worst):– Simple directed graph– Baseline for no effort

• Disambiguated (better):– Actual social entities – Only @enron.com 

email addresses• Scrubbed (best for 

now): – More consolidation 

and verification – No mailing lists 

15

Number of Raw Disambig.(Diff to Raw)

Scrubbed(Diff to Raw)

(Diff to Disambig.)

Senders19,466 6,205

(-68%)5,441(-72%)(-12%)

Receivers72,713 19,700

(-73%)15,297(-79%)(-22%)

Addresses81,811 20,332

(-75%)15,526(-81%)(-24%)

Edges332,683 212,768

(-36%)188,045(-43%)(-12%)

Page 16: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Data Preparation: MEDLINE: Disambiguation

• From National Library of Medicine (1950 onwards)

• 2012: 20 mio publications• Medical subject heading 

(MeSH): brain, 2005‐2009, ~110k articles from 3,700 journals

• Disambiguation: Authority database (Torvik & Smalheiser 2009, 98‐99% accurate), 101K pubs.

• 3 networks– Algorithmic (best)– All initial (worse)– First initial (worst)

16

Algo-rithmic

All-initials(Diff to alg.)

First-initials(Diff to alg)(Diff to all-

initials)Name

Instances

557,662 557,662 557,662

Unique Entities 258,971 207,256

(-20.0%)

182,421(-29.6%)(-11.9%)

Edges 1,335,366 1,317,894(-1.3%)

1,303,957(-1.6%)(-1.1%)

Page 17: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Email Networks Co-publishing networksNetwork

Properties Raw(worst)

Manual Disamb. (better)

Scrubbed (best)

Algorithmic(best)

All-initials(worse)

First-initial(worst)

Consolidation of nodesElimination of errors

Splitting up of nodes Introduction of errors

No. of Vertices 81,811 20,332(-75.15%)

15,526(-81.02%) 258,971 207,256

(-19.97%)182,421

(-29.56%)

No. of Edges 332,683 212,768(-36.04%)

188,045(-43.48%) 1,335,366 1,317,894

(-1.31%)1,303,957(-2.35%)

Density 4.97E-05 5.14E-04(+9.34%)

7.80E-04(+14.69%) 3.98E-05 6.14E-05

(+54.27%)7.84E-05

(+96.98%)Clustering Coefficient 0.07637 0.09421

(+18.94%)0.10698

(+28.61%) 0.39 0.20(-48.72%)

0.19(-51.28%)

Diameter 18 (Directed)15 (Undirected)

10 (Directed)10 (Undirected)

10 (Directed)7 (Undirected) 22 19

(-13.64%)18

(-18.18%)Avg. Shortest Path Length 4.33 3.56

(-17.78%)3.56

(-17.78%) 6.70 5.21(-22.24%)

4.78(-28.66%)

No. of Components 978 10

(-98.98%)5

(-99.49%) 10,182 5,028(-50.62%)

3,100(-69.55%)

Ratio of Largest Component 96.82% 99.91%

(+3.09%p)99.95%

(+3.13%p) 80.91% 90.47%(+9.56%p)

93.63%(+12.72%p)

Degree Centralization N/A N/A N/A 1.83E-03 6.98E-03

(+281.42%)8.40E-03

(+359.02%)In Degree

Centralization 0.01635 0.03052(+86.67%)

0.03561(+117.80%) N/A N/A N/A

Out Degree Centralization 0.01909 0.07858

(+311.63%)0.07858

(+311.63%) N/A N/A N/A

Eigenvector Centralization 0.99588 0.98552

(-1.04%)0.98213(-1.38%) 0.212 0.195

(-8.02%)0.187

(-11.79%)Betweenness

Centralization 0.01041 0.02014(+93.47%)

0.02728(+164.65%) 9.85E-03 2.26E-02

(+129.44%)2.09E-02

(+112.18%)Cl 0 228 0 238

17

Page 18: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Results: Most powerful/ influential individuals

18

Degree Centrality Rank Enron MEDLINE

Raw Disambiguated Scrubbed Algorithmic All Initials First Initial 1 [email protected] Beck, Sally Beck, Sally Krause, W Wang, Y Wang, J 2 [email protected] OUTLOOK TEAM Lay, Kenneth Fulop, L Wang, J Wang, Y 3 [email protected] Forster, David Forster, David Nawa, H Wang, X Lee, J 4 [email protected] Lay, Kenneth Jones, Tana Su, Y Chen, Y Kim, J 5 [email protected] TECHNOLOGY Kaminski, Vince Medarova, Z Li, X Wang, X

Closeness Centrality Rank Enron MEDLINE

Raw Disambiguated Scrubbed Algorithmic All Initials First Initial 1 [email protected] Lay, Kenneth Beck, Sally Trojanowski, JQ Wang, J Wang, J 2 [email protected] Beck, Sally Lay, Kenneth Kretzschmar, HA Wang, Y Wang, Y 3 [email protected] OUTLOOK TEAM Kitchen, Louise Toga, AW Wang, X Wang, X 4 [email protected] Kitchen, Louise Kean, Steven Thompson, PM Li, X Lee, J 5 [email protected] Lavorato, John Lavorato, John Barkhof, F Zhang, J Zhang, J

Betweenness Centrality Rank Enron MEDLINE

Raw Disambiguated Scrubbed Algorithmic All Initials First Initial 1 [email protected] Beck, Sally Beck, Sally Toga, AW Wang, J Wang, J 2 [email protected] Kaminski, Vince Lay, Kenneth Kretzschmar, HA Wang, Y Lee, J 3 [email protected] Lay, Kenneth Kaminski, Vince Thompson, PM Wang, X Wang, Y 4 [email protected] Skilling, Jeffrey Jones, Tana Trojanowski, JQ Li, J Wang, X 5 [email protected] OUTLOOK TEAM Hayslett, Rod Barkhof, F Lee, J Zhang, J

Eigenvector Centrality Rank Enron MEDLINE

Raw Disambiguated Scrubbed Algorithmic All Initials First Initial 1 [email protected] Kitchen, Louise Kitchen, Louise Futreal, PA Wang, Y Wang, Y 2 [email protected] Beck, Sally Beck, Sally Stratton, MR Liu, Y Wang, J 3 [email protected] Haedicke, Mark Haedicke, Mark Edkins, S Wang, J Liu, Y 4 [email protected] Lavorato, John Lavorato, John Omeara, S Wang, X Wang, X 5 [email protected] Forster, David Forster, David Stevens, C Li, X Zhang, J

Page 19: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Results: Differences in Topologies

19Enron (left): log‐log plot of node degree (in, out)MEDLINE (right): log‐log plot of node degree

Email networks: • Duplicates ‐> network seems bigger, less 

coherent, less integrated • Overestimates need for interaction

Co‐publishing networks:• Missing to split nodes ‐> scientific sector 

seems more dense, integrated, cohesive, and authors more productive, collaborative, diverse 

• Underestimates need for (inter‐disciplinary) collaboration and support

Page 20: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Conclusions

20

• Majority of metrics heavily biased, topologies misidentified, key players more robust

• Big Data does not fix this issue• Data preparation and analysis loaded with decisions

– Inherent in data collection, tools, algorithms, …– Decisions sometimes not considered or not made explicit– Poor awareness for and understanding of their impact

• Data quality key ingredient for reliable results • Silver lining/ possible positive side: Closely interacting 

with data and forcing ourselves to understand them can help us to move from being able to precisely model and formally describe effects in society to also understand and explain them.

Page 21: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Acknowledgement

• Regulations: Supported the Ford Foundation and the National Center for Supercomputing Applications (NCSA).

• Disambiguation: Supported by KISTI (Korea Institute of Science and Technology Information). The disambiguated MEDLINE dataset was provided by Vetle Torvik and Brent Fegley from iSchool/ UIUC. 

• Chie‐Li (Julian) Chin and Jinseok Kim from my lab 

Page 22: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

References Citations• Acquisti, A., & Grossklags, J. (2005). Privacy and rationality in individual decision making. IEEE Security & 

Privacy, 3(1), 26‐33.• Dittrich, D. and Kenneally, E. (2012). The Menlo Report: Ethical Principles Guiding Information and 

Communication Technology Research, Tech. rep., U.S. Department of Homeland Security.• Gilligan, C. (1987). Moral orientation and moral development.• Graham, J., Nosek, B. A., Haidt, J., Iyer, R., Koleva, S., & Ditto, P. H. (2011). Mapping the moral domain. 

Journal of personality and social psychology, 101(2), 366. • Kohlberg, L. (1984). The psychology of moral development: The nature and validity of moral stages (Vol. 2): 

Harpercollins College Div. • Kosinski, M., Matz, S. C., Gosling, S. D., Popov, V., & Stillwell, D. (2015). Facebook as a research tool for the 

social sciences: Opportunities, challenges, ethical considerations, and practical guidelines. American Psychologist, 70(6), 543‐556.

• McDonald, A. M., & Cranor, L. F. (2008). The cost of reading privacy policies. ISJLP, 4, 543.• Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National 

Academy of Sciences of the United States of America, 98(2), 404‐409.• Shweder, R. A., Much, N. C., Mahapatra, M., & Park, L. (1997). The" Big Three" of Morality (Autonomy, 

Community, Divinity) and the" Big Three" Explanations of Suffering. In A. M. Brandt & P. Rozin (Eds.), Morality and Health, 119‐172.

• Torvik, V. I., & Smalheiser, N. R. (2009). Author Name Disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1‐29.

• Vitak, J., Shilton, K., & Ashktorab, Z. (2016). Beyond the Belmont Principles: Ethical Challenges, Practices, and Beliefs in the Online Data Research Community. Paper presented at the 9th ACM Conference on Computer‐Supported Cooperative Work and Social Computing (CSCW 2016) San Francisco, CA.

• Zevenbergen, B., Mittelstadt, B., Véliz, C., Detweiler, C., Cath, C., Savulescu, J., & Whittaker, M. (2015). Philosophy meets Internet Engineering: Ethics in Networked Systems Research. GTC workshop outcomes paper: Oxford Internet Institute, University of Oxford. 

Page 23: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

References Images

• World clock:  https://en.wikipedia.org/wiki/File:Globe‐with‐clock.svg• Free speech: http://www.gbcnv.edu/rights_

responsibilities/free_speech.html• Free beer: https://openclipart.org/detail/73603/beer• Flowers: http://publicdomainpictures.net/view‐

image.php?image=119670&picture=&jazyk=pt

Page 24: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Publications on Regulatory Issues and Impact of Pre‐Processing on Network Analysis

• Diesner J, Chin C (2016) Seeing the forest for the trees: considering applicable types of regulation for the responsible collection and analysis of human centered data. Human‐Centered Data Science (HCDS) Workshop at 19th ACM Conference on Computer‐Supported Cooperative Work and Social Computing (CSCW 2016), San Francisco, CA.

• Diesner J, Chin C (2016) Gratis, Libre, or Something Else? Regulations and Misassumptions Related to Working with Publicly Available Text Data, ETHI‐CA² Workshop (ETHics In Corpus Collection, Annotation & Application), 10th Language Resources and Evaluation Conference (LREC), Portoroz, Slovenia. 

• Diesner J, Chin C (2015) Usable Ethics: Practical Considerations for Responsibly Conducting Research with Social Trace Data. Workshop: Beyond IRBs: Ethical Review Processes for Big Data Research, Future of Privacy Forum, Washington DC.

• Diesner J, Evans C, Kim J (2015) Impact of entity disambiguation errors on social network properties. International AAAI Conference on Web and Social Media (ICWSM), Oxford, UK

• Kim J, Diesner J (accepted) Less than expected: Over‐time measurement of triadic closure in scientific collaboration networks. Journal for Social Network Analysis and Mining (SNAM).

• Kim J, Diesner J (2015) The Effects of Data Pre‐Processing on Understanding the Evolution of Collaboration Networks. Journal of Informetrics, 9(1), 226‐236.

Page 25: Innovating compliantly and transparently ‐ road blocks, myths … · 2017-03-29 · Innovating compliantly and transparently ‐ road blocks, myths and solutions Jana Diesner, PhD

Thank you!• Questions, comments, feedback, follow‐up: Jana DiesnerEmail: [email protected]: http://jdiesnerlab.ischool.illinois.edu/