hearts and minds of data sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific...

44
The Hearts and Minds of Data Science Cecilia R. Aragon Dept. of Human Centered Design & Engineering eScience Institute University of Washington [email protected] Image by Bill Howe

Upload: others

Post on 19-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

The Hearts and Minds of Data Science

Cecilia R. Aragon Dept. of Human Centered Design & Engineering

eScience Institute University of Washington

[email protected]

Image by Bill Howe

Page 2: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

What is data science?

2

Impact

Page 3: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

From data to knowledge to action

• The ability to extract knowledge from large, heterogeneous, noisy datasets – to move “from data to knowledge to action” – lies at the heart of 21st century discovery

• To remain at the forefront, researchers in all fields will need access to state-of-the-art data science methodologies and tools

• Data science is driven more by intellectual infrastructure (human skills) and software infrastructure (shared tools and services – digital capital) than by hardware

• These methodologies and tools will need to be designed to enable efficient discovery by the scientists who use them

3

Impact

Adapted from: Ed Lazowska et al., eScience Institute, UW

Page 4: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

4

Moore/Sloan Data Science Environment: $37.8M Initiative at UCB, NYU, UW

Impact

Graphic by Ray Hong and eScience Institute, UW

Page 5: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

DSE Goal

• Change the culture of universities to create a data science culture

• “we're a startup in the Eric Ries sense, a ‘human institution designed to deliver a new product or service under conditions of extreme uncertainty.’ Our outcomes are new forms of doing science in academic settings, and we're going against the grain of large institutional priors.” – Fernando Pérez

Fernando Pérez http://blog.fperez.org/2013/11/an-ambitious-experiment-in-data-science.html

Page 6: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Six major thrusts

• Education and training • Software tools and environments • Career paths and alternative metrics • Reproducibility and open science • Working spaces and culture • Ethnography and evaluation

Page 7: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

What is ethnography and why use it?

• Qualitative field-based method of inquiry • Can focus on the dynamics and uses of data

and computation in context • Ecological validity • The determinants of culture are often invisible

or unspoken • Understanding why and not just what

Page 8: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Human centered data science

Page 9: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Two human issues

1. How do we build tools that facilitate data-driven science today?

2. How do we ensure that we have enough people with data science skills working in academia to produce the next generation of scientific discoveries?

Page 10: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Two human issues

1. How do we build tools that facilitate data-driven science today?

2. How do we ensure that we have enough people with data science skills working in academia to produce the next generation of scientific discoveries?

Page 11: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Why is this a human issue?

Page 12: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

What is the rate-limiting step in data understanding?

Time

Am

ount

of d

ata

in th

e w

orld

Amount of data in the world

(1 EB = 1,000,000,000,000,000,000 B = 1018 bytes = one quintillion bytes = 100,000 Libraries of Congress)

800 EB

2009 1986

2.6 EB

The World’s Technological Capacity to Store, Communicate, and Compute Information, M. Hilbert and P. López (2011), Science, 332(6025), 60-65.

Page 13: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Time

Amou

nt o

f dat

a in

the

wor

ld

Time

Proc

essi

ng p

ower

What is the rate-limiting step in data understanding?

Processing power: Moore’s Law

Amount of data in the world

Page 14: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Time

Amou

nt o

f dat

a in

the

wor

ld

Time

Proc

essi

ng p

ower

What is the rate-limiting step in data understanding?

Processing power: Moore’s Law

Effective Processing Power: Amdahl’s Law

Amount of data in the world

Page 15: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Proc

essin

g pow

er

Time

What is the rate-limiting step in data understanding?

Processing power: Moore’s Law

Human cognitive capacity

Idea adapted from “Less is More” by Bill Buxton (2001)

Amount of data in the world

Effective Processing Power: Amdahl’s Law

Page 16: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Proc

essin

g pow

er

Time

Processing power: Moore’s Law

Human cognitive capacity

Amount of data in the world

Bridging the Data Science Gaps

• Human centered data science – collaborative software, effective user interfaces – visualization – sociology, psychology,

study of socio-technical systems

• Hardware • Software • Parallel

computing • Machine

learning

Page 17: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

http://sciencewatch.com/articles/multiauthor-papers-onward-and-upward (C. King, Jul 2012)

Collaboration Growth: “Hyperauthorship”

Physics (3,221)

Medicine (2,459)

Page 18: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Number of multi-author papers, 1998-2011

http://sciencewatch.com/articles/multiauthor-papers-onward-and-upward (C. King, Jul 2012)

Page 19: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Collaboration is essential

• To understand vast and complex data sets • To combine different skills • To build interdisciplinary teams to solve the

most challenging scientific, social, political, and human problems

Page 20: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

But it’s not so easy to build collaboration

• Many barriers are social, not just technical – Cummings and Kiesler study (2007) of 491 scientific collaborations,

“Coordination costs and project outcomes in multi-university collaborations.” Research Policy, 36(10), 138-152.

– Charlotte Lee, “Barriers to Adoption of Collaboration Technologies,” presented at CHI 09 workshop

• too little is known about dynamics of complex work teams • collaboration across disciplines is difficult (different languages, methods) • distributed work is difficult (different organizational structures and processes) • need to study how to foster productive collaborations • the human infrastructure of cyberinfrastructure

Page 21: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Data science involves solving both technical and social issues

Page 22: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Astrophysics example

1. New machine learning approach to transient search

2. Image processing for supernova detection 3. Collaboration building 4. Cross-cultural issues

Page 23: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Impact on Supernova Search

Page 24: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Imbalanced Data & Class Uncertainty • Ratio of positives to negatives less than 1/10,000 • Many negatives in region of overlap • Potentially mislabeled examples

Page 25: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Fast Algorithm

• Find good supernova candidates – R1 measures ||F-1|| vs. ||F1|| (elliptical vs. circular) – R2 measures power in higher terms vs. ||F1|| and ||F-1||

• Advantages of these R1 and R2: – Scale invariant (due to normalization by F1) – Rotation invariant (only magnitudes used, no phase) – Fast! Only two transform terms, and they are complex conjugates (F1, F-1), so

intermediate results are shared

• Performance depends on how fast we can obtain (for each point zi = xi +iyi on the contour) the primitives used to form F1 and F-1:

• In the end:

– 41% reduction in false positives, no measurable increase in running time

)cos(),sin(

),sin(),cos(22

22

Ni

iNi

i

Ni

iNi

i

yx

yxππ

ππ

"A Fast Contour Descriptor Algorithm for Supernova Image Classification," C. Aragon and D. Aragon, SPIE Symposium on Electronic Imaging: Real-Time Image Processing, San Jose, CA (2007).

Page 26: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Lightweight, context-linked tools for collaborative work

• “Context-Linked Virtual Assistants in Distributed Teams: An Astrophysics Case Study," S. Poon, R. Thomas, C. Aragon, B. Lee, CSCW 2008, San Diego, CA (2008). Best Paper Award Nominee.

Page 27: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Measurable improvements from human centered approach

• Measurable improvements to transient search and data analysis in many areas. – Improved image processing algorithms:

• Fourier contour analysis (41% reduction in false positives)

– New machine learning algorithms for astronomical transient search:

• support vector machines • boosted decision trees (90% reduction in false positives)

– Improved control interfaces for scanning and vetting • approx. 70% speedup per image

• Search + Scanning: approx. 90% labor savings • from 6-8 people working ~4 hrs/day to 1 person

working 1 hr/day

Page 28: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

The sociotechnical ecosystem of science

Page 29: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Tools that fit within the sociotechnical ecosystem of science

• IPython - Fernando Pérez – base for interactive scientific environment – enables

exploratory computing – browser-based notebooks (rich text, code, plots)

Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: http://ipython.org

Page 30: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Tools that fit within the sociotechnical ecosystem of science

• mpld3, a toolkit for visualizing matplotlib graphics in-browser via d3 – Jake Vanderplas

http://jakevdp.github.io/blog/2014/01/10/d3-plugins-truly-interactive/

Page 31: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Research with Social Media Text Data • Ranges along quantitative – qualitative spectrum • Quantitative

– Good: Large quantities of data, efficient – Bad: Relatively shallow, superficial

• Qualitative – Good: deep, explanatory conclusions – Bad: Small, focused samples, inefficient

• How to get best of both worlds? – Visual analytics, combining machine learning with

visualization and qualitative research

Page 32: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Integrate machine learning and interactive visualization into a qualitative research workflow

• Conclusions based on large data sets

• Maintain context, nuance, depth

• Select tools based on transparency and understandability, not just efficiency

Page 33: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Human issues in data science

• Ethics: Lilly Irani and Six Silberman, UCSD, Turkopticon and crowdsourcing Irani, Lilly C., and M. Silberman. Turkopticon: Interrupting worker invisibility in Amazon Mechanical Turk. Proc.of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2013, 611–620.

• Design: What happens when you crowd-source design? Daniela Rosner, UW (systems that guarantee banal results?)

Rosner, Daniela, and Jonathan Bean. "Big data, diminished design." interactions 21, no. 3. In press.

• Data expectations: What does data really mean across different fields? Gina Neff, B. Fiore-Silfvast

Brittany Fiore-Silfvast and Gina Neff, UW Dept of Communication, “Communication, Mediation, and the Expectations of Data: Data Valences across Health and Wellness Communities.” 2014. In press.

Page 34: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Human centered data science

Page 35: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Two human issues

1. How do we build tools that facilitate data-driven science today?

2. How do we ensure that we have enough people with data science skills working in academia to produce the next generation of scientific discoveries?

Page 36: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Interdisciplinary Data Science Skills

• Programming Foundations and Data Abstractions • Software Engineering and Systems • Databases and Data Management • User-Centered Software Design; Design Foundations for

Interactive Systems • High-Performance Scientific Computing • Scalable and Data-Intensive Computing in the Cloud • Data Mining / Machine Learning / Statistics • Data and Information Visualization

eScience curriculum, http://escience.washington.edu/ education/proposed-escience-curriculum

Page 37: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Human issues of data science in academia – Fernando Perez

• Academia favors individualism, hyper-specialization and novelty

• Collaborations and interdisciplinary work are less rewarded • Writing software to enable new science is not rewarded

(it’s time not spent writing papers) • Scientists have been trained to think of computation as not

“real science” • Methodologists (e.g. computer scientists, statisticians) are

not rewarded for creating applications • The skills of data science are in tremendous demand in

industry today = “big data brain drain” (see Jake Vanderplas blog)

http://blog.fperez.org/2013/11/an-ambitious-experiment-in-data-science.html

Page 38: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

The Big Data Brain Drain

• “the skills required to be a successful scientific researcher are increasingly indistinguishable from the skills required to be successful in industry.”

• “open, well-documented, and well-tested scientific code is essential not only to reproducibility in modern scientific research, but to the very progression of research itself.”

–Jake Vanderplas

http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/

Page 39: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Human issues of data science in academia

• So what are we going to do about this? • We know that data science skills are necessary

for scientific discovery • Who is going to teach these skills? • Academic scientific disciplines have full

schedules, tend not to teach computation and data skills or only via a course or two.

• And when people acquire data science skills, there is no job path in academia

Page 40: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Data science as an academic department

• Analogy: how the discipline of computer science was created

• No undergraduate CS degree at Caltech in the 1980s

Page 41: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

The emerging discipline of computer science

• “Only failed mathematicians go into computer science.” – Unnamed mathematician, Caltech, 1980s

• “I wrote a program once. It wasn’t that hard.” – Unnamed physicist, Berkeley, 1990s

• “But… this programming… what if it’s all just a fad?” – Unnamed physics professor, U Penn, 2000s

Page 42: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Interdisciplinary departments

• Another exemplar: my own interdisciplinary department – Human Centered Design and Engineering – Tenure case does not involve only publishing in

the “right” journals – Part of the case is that you present evidence of

your impact (conferences, journals, altmetrics) – New department = unranked – But: fastest-growing department in the College of

Engineering at UW

Page 43: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

The hearts and minds of data science

Page 44: Hearts and Minds of Data Sciencemmds-data.org/presentations/2014/aragon_mmds14.pdf · scientific discoveries? Two human issues . 1. How do we build tools that facilitate data-driven

Questions?

Cecilia R. Aragon Human Centered Design & Engineering

University of Washington [email protected]