hearts and minds of data science - human-centered design … · how do we build tools that...

56
The Hearts and Minds of Data Science Cecilia R. Aragon Dept. of Human Centered Design & Engineering eScience Institute University of Washington [email protected] Image by Bill Howe

Upload: others

Post on 19-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

The Hearts and Minds of Data Science

Cecilia R. Aragon Dept. of Human Centered Design & Engineering

eScience Institute University of Washington

[email protected]

Image by Bill Howe

Page 2: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

What is data science?

2

Impact

Page 3: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Exponential improvements in technology and algorithms are enabling a revolution in discovery

• A proliferation of sensors • Ever more powerful models producing data that must be analyzed • The creation of almost all information in digital form • Dramatic cost reductions in storage • Dramatic increases in network bandwidth • Dramatic cost reductions and scalability improvements in

computation • Dramatic algorithmic breakthroughs in areas such as machine

learning

3

Impact

Adapted from: Ed Lazowska et al., eScience Institute, UW

Page 4: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Nearly every field of discovery is transitioning from “data poor” to “data rich”

Astronomy: LSST Physics: LHC

Oceanography: OOI

Sociology: The Web

Biology: Sequencing Economics: POS

terminals

Neuroscience: EEG, fMRI

4

Impact

Ed Lazowska et al., eScience Institute, UW

Page 5: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

The Fourth Paradigm

5

1. Empirical + experimental 2. Theoretical 3. Computational 4. Data-Intensive

Impact

Adapted from: Ed Lazowska et al, eScience Institute, UW

Page 6: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

From data to knowledge to action

• The ability to extract knowledge from large, heterogeneous, noisy datasets – to move “from data to knowledge to action” – lies at the heart of 21st century discovery

• To remain at the forefront, researchers in all fields will need access to state-of-the-art data science methodologies and tools

• Data science is driven more by intellectual infrastructure (human skills) and software infrastructure (shared tools and services – digital capital) than by hardware

• These methodologies and tools will need to be designed to enable efficient discovery by the scientists who use them

6

Impact

Adapted from: Ed Lazowska et al., eScience Institute, UW

Page 7: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Interdisciplinary Data Science Skills

• Programming Foundations and Data Abstractions • Software Engineering and Systems • Databases and Data Management • User-Centered Software Design; Design Foundations for

Interactive Systems • High-Performance Scientific Computing • Scalable and Data-Intensive Computing in the Cloud • Data Mining / Machine Learning / Statistics • Data and Information Visualization

UW eScience curriculum, http://escience.washington.edu/ education/proposed-escience-curriculum

Page 8: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

8

Moore/Sloan Data Science Environment: $37.8M Initiative at UCB, NYU, UW

Impact

Graphic by Ray Hong and eScience Institute, UW

Page 9: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

DSE Goal

• Change the culture of universities to create a data science culture

• “we're a startup in the Eric Ries sense, a ‘human institution designed to deliver a new product or service under conditions of extreme uncertainty.’ Our outcomes are new forms of doing science in academic settings, and we're going against the grain of large institutional priors.” – Fernando Pérez

Fernando Pérez http://blog.fperez.org/2013/11/an-ambitious-experiment-in-data-science.html

Page 10: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Six major thrusts

• Education and training • Software tools and environments • Career paths and alternative metrics • Reproducibility and open science • Working spaces and culture • Ethnography and evaluation

Page 11: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

What is ethnography and why use it?

• Qualitative field-based method of inquiry • Can focus on the dynamics and uses of data

and computation in context • Ecological validity • The determinants of culture are often invisible

or unspoken • Understanding why and not just what

Page 12: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Human centered data science

Page 13: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Two human issues

1. How do we build tools that facilitate data-driven science today?

2. How do we ensure that we have enough people with data science skills working in academia to produce the next generation of scientific discoveries?

Page 14: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Two human issues

1. How do we build tools that facilitate data-driven science today?

2. How do we ensure that we have enough people with data science skills working in academia to produce the next generation of scientific discoveries?

Page 15: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Why is this a human issue?

Page 16: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

What is the rate-limiting step in data understanding?

Time

Am

ount

of d

ata

in th

e w

orld

Amount of data in the world

800 EB

2009 1986

2.6 EB

The World’s Technological Capacity to Store, Communicate, and Compute Information, M. Hilbert and P. López (2011), Science, 332(6025), 60-65.

(1 EB = 1,000,000,000,000,000,000 B = 1018 bytes = one quintillion bytes = 100,000 Libraries of Congress)

Page 17: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Time

Amou

nt o

f dat

a in

the

wor

ld

Time

Proc

essi

ng p

ower

What is the rate-limiting step in data understanding?

Processing power: Moore’s Law

Amount of data in the world

Page 18: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Time

Amou

nt o

f dat

a in

the

wor

ld

Time

Proc

essi

ng p

ower

What is the rate-limiting step in data understanding?

Processing power: Moore’s Law

Effective Processing Power: Amdahl’s Law

Amount of data in the world

Page 19: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Proc

essin

g pow

er

Time

What is the rate-limiting step in data understanding?

Processing power: Moore’s Law

Human cognitive capacity

Idea adapted from “Less is More” by Bill Buxton (2001)

Amount of data in the world

Effective Processing Power: Amdahl’s Law

Page 20: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Proc

essin

g pow

er

Time

Processing power: Moore’s Law

Human cognitive capacity

Amount of data in the world

Bridging the Data Science Gaps

• Human centered data science – collaborative software, effective user interfaces – visualization – sociology, psychology,

study of socio-technical systems

• Hardware • Software • Parallel

computing • Machine

learning

Page 21: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

http://sciencewatch.com/articles/multiauthor-papers-onward-and-upward (C. King, Jul 2012)

Collaboration Growth: “Hyperauthorship”

Physics (3,221)

Medicine (2,459)

Page 22: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Number of multi-author papers, 1998-2011

http://sciencewatch.com/articles/multiauthor-papers-onward-and-upward (C. King, Jul 2012)

Page 23: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Collaboration is essential

• To understand vast and complex data sets • To combine different skills • To build interdisciplinary teams to solve the

most challenging scientific, social, political, and human problems

Page 24: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

But it’s not so easy to build collaboration

• Many barriers are social, not just technical – Cummings and Kiesler study (2007) of 491 scientific collaborations,

“Coordination costs and project outcomes in multi-university collaborations.” Research Policy, 36(10), 138-152.

– Charlotte Lee, “Barriers to Adoption of Collaboration Technologies,” presented at CHI 09 workshop

• too little is known about dynamics of complex work teams • collaboration across disciplines is difficult (different languages, methods) • distributed work is difficult (different organizational structures and processes) • need to study how to foster productive collaborations • the human infrastructure of cyberinfrastructure

Page 25: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Data science involves solving both technical and social issues

Page 26: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Astrophysics example

1. New machine learning approach to transient search

2. Image processing for supernova detection 3. Collaboration building 4. Cross-cultural issues

Page 27: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Machine learning

"Supernova Recognition using Support Vector Machines," R. Romano, C. Aragon, and C. Ding, Proceedings of the 5th International Conference of Machine Learning Applications, Orlando, FL (2006). Winner of Best Application Paper Award.

Page 28: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Impact on Supernova Search

Page 29: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Imbalanced Data & Class Uncertainty • Ratio of positives to negatives less than 1/10,000 • Many negatives in region of overlap • Potentially mislabeled examples

Page 30: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Image processing

"A Fast Contour Descriptor Algorithm for Supernova Image Classification," C. Aragon and D. Aragon, SPIE Symposium on Electronic Imaging: Real-Time Image Processing, San Jose, CA (2007).

Page 31: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Shape Detection Scores • Many “bad subtractions” have shapes which distinguish them

from good supernova candidates. • Add “roundness” scores using Fourier contour descriptor

algorithm

Page 32: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Two New Features Added • R1 – object contour eccentricity; ratio of powers in lowest order

negative and positive Fourier contour descriptors

• R2 – object contour irregularity; power in higher order Fourier contour descriptors divided by total power

>=

0

21

2

2R

kk

kk

F

F

21

211R

FF−=

"A Fast Contour Descriptor Algorithm for Supernova Image Classification," C. Aragon and D. Aragon, SPIE Symposium on Electronic Imaging: Real-Time Image Processing, San Jose, CA (2007).

Page 33: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Fast Algorithm

• Find good supernova candidates – R1 measures ||F-1|| vs. ||F1|| (elliptical vs. circular) – R2 measures power in higher terms vs. ||F1|| and ||F-1||

• Advantages of these R1 and R2: – Scale invariant (due to normalization by F1) – Rotation invariant (only magnitudes used, no phase) – Fast! Only two transform terms, and they are complex conjugates (F1, F-1), so

intermediate results are shared

• Performance depends on how fast we can obtain (for each point zi = xi +iyi on the contour) the primitives used to form F1 and F-1:

• In the end:

– 41% reduction in false positives, no measurable increase in running time

)cos(),sin(

),sin(),cos(22

22

Ni

iNi

i

Ni

iNi

i

yx

yxππ

ππ

"A Fast Contour Descriptor Algorithm for Supernova Image Classification," C. Aragon and D. Aragon, SPIE Symposium on Electronic Imaging: Real-Time Image Processing, San Jose, CA (2007).

Page 34: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Collaboration building

“Context-Linked Virtual Assistants in Distributed Teams: An Astrophysics Case Study," S. Poon, R. Thomas, C. Aragon, B. Lee, CSCW 2008, San Diego, CA (2008). Best Paper Award Nominee.

Page 35: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Lightweight, context-linked tools for collaborative work

• “Context-Linked Virtual Assistants in Distributed Teams: An Astrophysics Case Study," S. Poon, R. Thomas, C. Aragon, B. Lee, CSCW 2008, San Diego, CA (2008). Best Paper Award Nominee.

Page 36: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

36

Discussion and Benefits

• Context-linked information aids in communication and collaborative work – Provides continuous environmental data – Informs awareness of time, schedule – Crucial in coordinating time-critical tasks – Reduction of coordination efforts to establish

common ground (less time spent figuring out state of tasks, beneficial when working under time pressure)

• Lightweight, easy-to-use – Did not require elaborate setup

“Context-Linked Virtual Assistants in Distributed Teams: An Astrophysics Case Study," S. Poon, R. Thomas, C. Aragon, B. Lee, CSCW 2008, San Diego, CA (2008). Best Paper Award Nominee.

Page 37: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Cross-cultural issues

Page 38: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Improved cross-cultural communication via context-linked software tools

"No Sense of Distance: Improving Cross-Cultural Communication with Context-Linked Software Tools," Cecilia R. Aragon and Sarah S. Poon, iConference (2011).

Page 39: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Measurable improvements from human centered approach

• Measurable improvements to transient search and data analysis in many areas. – Improved image processing algorithms:

• Fourier contour analysis (41% reduction in false positives)

– New machine learning algorithms for astronomical transient search:

• support vector machines • boosted decision trees (90% reduction in false positives)

– Improved control interfaces for scanning and vetting • approx. 70% speedup per image

• Search + Scanning: approx. 90% labor savings • from 6-8 people working ~4 hrs/day to 1 person

working 1 hr/day

Page 40: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

The sociotechnical ecosystem of science

Page 41: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Tools that fit within the sociotechnical ecosystem of science

• IPython - Fernando Pérez – base for interactive scientific environment – enables

exploratory computing – browser-based notebooks (rich text, code, plots)

Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: http://ipython.org

Page 42: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Tools that fit within the sociotechnical ecosystem of science

• mpld3, a toolkit for visualizing matplotlib graphics in-browser via d3 – Jake Vanderplas

http://jakevdp.github.io/blog/2014/01/10/d3-plugins-truly-interactive/

Page 43: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Research with Social Media Text Data • Ranges along quantitative – qualitative spectrum • Quantitative

– Good: Large quantities of data, efficient – Bad: Relatively shallow, superficial

• Qualitative – Good: deep, explanatory conclusions – Bad: Small, focused samples, inefficient

• How to get best of both worlds? – Visual analytics, combining machine learning with

visualization and qualitative research

"Statistical Affect Detection in Collaborative Chat," Michael Brooks, Katie Kuksenok, Megan Torkildson, Daniel Perry, John Robinson, Paul Harris, Ona Anicello, Taylor Scott, Ariana Zukowski, Cecilia Aragon. CSCW '13 (2013)

Page 44: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Integrate machine learning and interactive visualization into a qualitative research workflow

• Conclusions based on large data sets

• Maintain context, nuance, depth

• Select tools based on transparency and understandability, not just efficiency

"Statistical Affect Detection in Collaborative Chat," Michael Brooks, Katie Kuksenok, Megan Torkildson, Daniel Perry, John Robinson, Paul Harris, Ona Anicello, Taylor Scott, Ariana Zukowski, Cecilia Aragon. CSCW '13 (2013)

Page 45: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Human issues in data science

• Ethics: Lilly Irani and Six Silberman, UCSD, Turkopticon and crowdsourcing Irani, Lilly C., and M. Silberman. Turkopticon: Interrupting worker invisibility in Amazon Mechanical Turk. Proc.of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2013, 611–620.

• Design: What happens when you crowd-source design? Daniela Rosner, UW (systems that guarantee banal results?)

Rosner, Daniela, and Jonathan Bean. "Big data, diminished design." interactions 21, no. 3. In press.

• Data expectations: What does data really mean across different fields? Gina Neff, B. Fiore-Silfvast

Brittany Fiore-Silfvast and Gina Neff, UW Dept of Communication, “Communication, Mediation, and the Expectations of Data: Data Valences across Health and Wellness Communities.” 2014. In press.

Page 46: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Human centered data science

Page 47: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Two human issues

1. How do we build tools that facilitate data-driven science today?

2. How do we ensure that we have enough people with data science skills working in academia to produce the next generation of scientific discoveries?

Page 48: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Interdisciplinary Data Science Skills

• Programming Foundations and Data Abstractions • Software Engineering and Systems • Databases and Data Management • User-Centered Software Design; Design Foundations for

Interactive Systems • High-Performance Scientific Computing • Scalable and Data-Intensive Computing in the Cloud • Data Mining / Machine Learning / Statistics • Data and Information Visualization

eScience curriculum, http://escience.washington.edu/ education/proposed-escience-curriculum

Page 49: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Human issues of data science in academia – Fernando Perez

• Academia favors individualism, hyper-specialization and novelty

• Collaborations and interdisciplinary work are less rewarded • Writing software to enable new science is not rewarded

(it’s time not spent writing papers) • Scientists have been trained to think of computation as not

“real science” • Methodologists (e.g. computer scientists, statisticians) are

not rewarded for creating applications • The skills of data science are in tremendous demand in

industry today = “big data brain drain” (see Jake Vanderplas blog)

http://blog.fperez.org/2013/11/an-ambitious-experiment-in-data-science.html

Page 50: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

The Big Data Brain Drain

• “the skills required to be a successful scientific researcher are increasingly indistinguishable from the skills required to be successful in industry.”

• “open, well-documented, and well-tested scientific code is essential not only to reproducibility in modern scientific research, but to the very progression of research itself.”

–Jake Vanderplas

http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/

Page 51: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Human issues of data science in academia

• So what are we going to do about this? • We know that data science skills are necessary

for scientific discovery • Who is going to teach these skills? • Academic scientific disciplines have full

schedules, tend not to teach computation and data skills or only via a course or two.

• And when people acquire data science skills, there is no job path in academia

Page 52: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Data science as an academic department

• Analogy: how the discipline of computer science was created

• No undergraduate CS degree at Caltech in the 1980s

Page 53: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

The emerging discipline of computer science

• “Only failed mathematicians go into computer science.” – Unnamed mathematician, Caltech, 1980s

• “I wrote a program once. It wasn’t that hard.” – Unnamed physicist, Berkeley, 1990s

• “But… this programming… what if it’s all just a fad?” – Unnamed physics professor, U Penn, 2000s

Page 54: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Interdisciplinary departments

• Another exemplar: our own interdisciplinary department – Human Centered Design and Engineering – Tenure case does not involve only publishing in

the “right” journals – Part of the case is that you present evidence of

your impact (conferences, journals, altmetrics) – New department – But: fastest-growing department in the College of

Engineering at UW

Page 55: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

The hearts and minds of data science

Page 56: Hearts and Minds of Data Science - Human-centered design … · How do we build tools that facilitate data-driven science today? 2. ... – base for interactive scientific environment

Questions?

Cecilia R. Aragon Human Centered Design & Engineering

University of Washington [email protected]