eresearch new zealand keynote

Post on 10-May-2015

342 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

Bill Howe gave a keynote at eResearch New Zealand in early July 2013

TRANSCRIPT

04/11/2023 1

Bill Howe, PhDDirector of Research,

Scalable Data AnalyticsUniversity of Washington

eScience Institute

eScience and Data Science at the UW eScience Institute

Bill Howe, UW

2

“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying to figure out how to make people click on ads”

-- Jeff Hammerbacher, co-founder, Cloudera

Jim Gray

04/11/2023 4

The University of Washington eScience Institute

• Rationale– The exponential increase in physical and virtual sensing tech is

transitioning all fields of science and engineering from data-poor to data-rich

– Techniques and technologies include• Sensors and sensor networks, data management, data mining,

machine learning, visualization, cluster/cloud computing • Mission

– Help position the University of Washington and partners at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon them.

• Strategy– Bootstrap a cadre of Research Scientists– Add faculty in key fields– Build out a “consultancy” of students and non-research staff

Bill Howe, UW

04/11/2023 5Bill Howe, UW

# o

f b

yte

s

# of data sources

telescopes

spectra

LSST (~100PB; images, spectra)

PanSTARRS (~40PB; images, trajectories)

OOI (~50TB/year; sims, RSN)IOOS (~50TB/year; sims, satellite, gliders,

AUVs, vessels, more)CMOP (~10TB/year; sims, stations, gliders,

AUVs, vessels, more)

SDSS (~400TB; images, spectra, catalogs)

n-body sims

models

AUVs

stations

cruises, CTDsflow cytometry

gliders

ADCPsatellites

Astronomy

Ocean Sciences

3 V’s of Big Data

Volume

Variety

Velocity

PDB

GenBank

UniProt

Pfam

Spreadsheets, NotebooksLocal, Lost

High throughput experimental methodsIndustrial scaleCommons based productionPublicly data setsCherry picked resultsPreserved

CATH, SCOP(Protein Structure Classification)

ChemSpider

Long Tail of Research Data

[src: Carol Goble]

Types of Data Stored

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

5.20%

11.70%

15.20%

20.50%

22.80%

48.00%

61.20%

71.60%Quantitative/tabular/statistical

Text (literature, transcriptions, field notes)

Images (photographs, maps)

Video recordings

Audio recordings

Multimedia digital objects

Geo-tagged objects/ spatial data

Other

Lewis et al 2008

src: Conversations with Research Leaders (2008)

Where do you store your data?

src: Conversations with Research Leaders (2008)

src: Faculty Technology Survey (2011)

Other

External (non-UW) data center

Department-managed server

My computer

0% 20% 40% 60% 80% 100%

5%6%

12%27%

41%66%

87%

Lewis et al 2011

How much data do you work with?

Wright 2013

The Long Tail is worthy of investment

Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6)

log(NSERC grants) ($) log(NSERC + CIHR rants) ($)

The Long Tail is worthy of investment

Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6)

“In sum, greater productivity is not strongly related to greater funding.”

“…the best article of one rich researcher received, on average, 14% fewer citations than the best article from any random pair of researchers, each of whom received only half as much funding!”

“if maximizing the total impact of the entire pool of grantees is the goal, then the “few big” strategy would be less effective than the “many small” strategy.”

04/11/2023 12Bill Howe, UW

“Ensuring a successful future for the biological sciences will require restraint in the growth of large centers and -omics-like projects, so as to provide more financial support for the critical work of innovative small laboratories striving to understand the wonderful complexity of living systems.”

Alberts B (2012) The End of “Small Science”? Science 337: 1583

The long tail is worthy of investment

04/11/2023 13

04/11/2023 14Bill Howe, UW

100x more impact

How much more productive is a great programmer than a mediocre programmer?

04/11/2023 16

Problem

How much time do you spend “handling data” as opposed to “doing science”?

Mode answer: “90%”

Bill Howe, UW

1704/11/2023 Bill Howe, UW

Simple Example

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome

COGAnnotation_coastal_sample.txt

SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit

04/11/2023 18Bill Howe, UW

“The future is already here; it’s just not very evenly distributed.”

-- William Gibson

04/11/2023 19

Three Avenues of Attack

• Technological• Educational• Organizational

Bill Howe, UW

SQLSHARE: QUERY-AS-A-SERVICETechnology for the Long Tail

Alex Szalay Jim Gray

How can we deliver 1000 little SDSSs?

1) Upload data “as is”Cloud-hosted; no need to install or design a database; no pre-defined schema

2) Analyze data with SQLRight in your browser, writing queries on top of queries on top of queries ...

SELECT hit, COUNT(*)

FROM tigrfam_surface

GROUP BY hit

ORDER BY cnt DESC

3) Share the results Queries on queries on queries…

04/11/2023 23

Browse English descriptions

04/11/2023 24

Save the results, share them with others

Edit a Query

VizDeck

Python Client

“Flagship” SQLShare App (Python) on EC2

SQLShare REST API

Excel Addin

R Client

Spreadsheet Crawler

ASP.NET

OAuth2

WCF

04/11/2023 26

METADATAAside

Bill Howe, UW

04/11/2023 27

What about metadata?

• Claim: Comprehensive metadata standards* represent a shared consensus about the world

• At the frontier of research, this shared consensus does not exist, by definition

• Any consensus that does emerge will change frequently, by definition

• Data found “in the wild” will typically not conform to any standard, by definition

So are we stuffed?

* ontology / schema / controlled vocabulary / etc.

Bill Howe, UW

04/11/2023 28Bill Howe, UW

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

Maslow’s Needs Hierarchy

04/11/2023 29

A “Needs Hierarchy” of Science Data Management

storage

sharing

Bill Howe, UW

query

curation

analytics

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

04/11/2023 30

A “Needs Hierarchy” of Science Data Management

storage

sharing

Bill Howe, UW

curation

query

analytics

“As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.”

-- Maslow 43

04/11/2023 31

Views

• A view is just a query with a name• We can use the view just like a real

table

Bill Howe, UW

Why can we do this?

Because we know that every query returns a relation:

We say that the language is “algebraically closed”

04/11/2023 32

Scientific data management reduces to sharing views• Integrate data from multiple sources?

– joins and unions with views

• Standardize on units, apply naming conventions?– rename columns, apply functions with views

• Attach metadata?– add new tables with descriptive names, add new columns

with views

• Data cleaning, quality control?– hide bad values with views

• Maintain provenance?– inspect view dependencies

• Propagate updates? – view maintenance

• Protect sensitive data?– expose subsets with views (assuming views carry

permissions) Bill Howe, UW

Pay-as-you-go metadata

Deeply nested hierarchies of viewsProvenance

Controlled SharingImplicit re-execution

Bring the computation to the data

• Rich query services, not data cemetaries

• Avoid “Transloading:” Pointless data movement from one environment to another– Compute Vis– Server Client– Cloud Cloud

• “Share the soup” and curate incrementally as a side effect of usage

• Pay-as-you-go metadata

04/11/2023 35

USE CASES

Bill Howe, UW

Laser

Microscope Objective

Pine Hole Lens

Nozzle d1

d2

FSC (Forward scatter)

Orange fluo

Red fluo

SeaFlowFrancois Ribalet

Jarred Swalwell

Ginger Armbrust

SeaFlowd1

/ F

SC

d2 / FSC

RE

D f

luor

esce

nce

FSC

Picoplankton

Nanoplankton

IS

Ultraplankton

Prochlorococcus

Continuous observations of various phytoplankton groups from 1-20 mm in size

Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton Based on ORANGE fluo: Synechococcus, Cryptophytes Based on FSC: Coccolithophores

Francois Ribalet

Jarred Swalwell

Ginger Armbrust

SeaFlowFrancois Ribalet

Jarred Swalwell

Ginger Armbrust

04/11/2023 39

Script-oriented computing was killing them• Scripts (typically in R) must be pre-shared with all

collaborators• When the data changes, everybody has to re-run all the

scripts• When the scripts change, everybody has to re-run all the

scripts. • Implicit assumption that all data fits in main memory• Terrible provenance, terrible reproducibility • Pipeline of scripts dependent on intricate file formats and

file naming schemes

Bill Howe, UW

Howe, et al., CISE 2012

04/11/2023 41

Workflow Systems

• Scripts (typically in R) must be pre-shared with all collaborators

• When the data changes, everybody has to re-run all the scripts

• When the scripts change, everybody has to re-run all the scripts.

• Implicit assumption that data fits in main memory

• No provenance• Pipeline of scripts dependent on intricate file

formats and/or file naming schemes

Bill Howe, UW

Steven Roberts

SQL as a lab notebook:http://bit.ly/16Xj2JP

Popular service for Bioinformatics Workflows

Halperin, Howe, et al. SSDBM 2013

04/11/2023 44

Robin Kodner

Bill Howe, UW

“I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.”

Data management and statistics for biologists

04/11/2023 45Bill Howe, UW

“An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces.

Previously, we were using huge directory trees and plain text files.

Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”-- Andrew D White

Andrew White, UW Chemistry

Decoding nonspecific interactions from nature. A. White, A. Nowinski, W. Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted

SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap

FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC

Non-programmers can write very complex queries (rather than relying on staff programmers)

Example: Computing the overlaps of two sets of blast results

We see thousands of queries written by non-programmers

04/11/2023 47

DATA SCIENCE

Bill Howe, UW

Education

04/11/2023 48Bill Howe, UW

04/11/2023 49

Drew Conway’s Data Science Venn Diagram

Bill Howe, UW

04/11/2023 50Bill Howe, UW

“I worry that the Data Scientist role is like the mythical “webmaster” of the 90s: master of all trades.”

-- Aaron Kimball, CTO Wibidata

04/11/2023 51Bill Howe, UW

But what are the abstractions of data science?

tools abstr.

“Data Jujitsu”“Data Wrangling”“Data Munging”

Translation: “We have no idea what this is all about”

Claim: Relational Algebra is the universal formalism for data modeling and manipulation, and every data scientist should know it

04/11/2023 52

UW Data Science Education Efforts

Bill Howe, UW

Students Non-StudentsCS/Informatics Non-Major professionals researchersundergrads grads undergrads grads

UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) Coursera Course: Intro to Data Science

Previous courses:Scientific Data Management, Graduate CS, Summer 2006, Portland State UniversityScientific Data Management, Graduate CS, Spring 2010, University of Washington

04/11/2023 53Bill Howe, UW

Bill Howe

“Very very interesting tht there is high correlation in the way the election results were being announced and the way the graph is shaped.”

“I was quite amazined that I was able to obtain this analysis with just 2 days of lecturing and practice!”

“Inspired by all my new learning, I thought about doing a little sentiment analysis myself for our national elections!”

“Darth Grader”

… I was frankly amazed that you were so fast in responding to my query, in spite of the class being so huge. I’d like to thank you so much for teaching this course. This has been one of the most useful courses I’ve ever taken and you’re an awesome instructor. With such great quality education freely available, I wonder why I joined a Master’s program haha. Thanks again and take care. I’ll try my best to finish another competition.

“With such great quality education freely available, I wonder why I

joined a Master’s program haha.”

04/11/2023 57

INTELLECTUAL INFRASTRUCTURE

Institution

Bill Howe, UW

Seek work in “Pasteur’s Quadrant”

Considerations of use

Que

st fo

r fun

dam

enta

l und

erst

andi

ng

Pasteur

Edison

Bohr

04/11/2023 59Bill Howe, UW

Multiple modes of interaction, multiple time scales

1-2 years and up1-2 weeks and down 1-2 quarters

Incubation(projects)

Communication(events)

Collaboration(partnerships)

2018

2008

Some local observations:• Big data work exposes common ground• Every job is becoming “data scientist”• More π-shaped people!• Democratization to the long tail is key• Industry and research aren’t too different

Incubator• Seed grants to students and postdocs• Rotating staff from science and industry• An evolving portfolio of reusable tools• Produce digital capital and human capital

Data Science Incubator

2013

04/11/2023 61Bill Howe, UW

On NoSQL

04/11/2023 62Bill Howe, UW

http://sqlshare.escience.washington.edu

billhowe@cs.washington.edu

http://escience.washington.edu

04/11/2023 63Bill Howe, UW

04/11/2023 64

WHY SQL?

Bill Howe, UW

5/18/10 Garret Cole, eScience Institute

What’s the point?• Conventional wisdom says “Science data isn’t relational”

– This is nonsense

• Conventional wisdom says “Scientists won’t write SQL”– This is nonsense

• So why aren’t databases being used more often?– They’re a PITA

• We implicate difficulty in– installation, configuration– schema design, data loading– performance tuning– app-building (NoGUI?)

We ask instead, “What kind of platform can support ad hoc scientific Q&A with SQL?”

04/11/2023 66Bill Howe, UW

God made the integers; all else is the work of man.

(Leopold Kronecker, 19th Century Mathematician)

slide src: Mike Franklin

Codd made relations;all else is the work of man.

(Raghu Ramakrishnan, DB text book author)

04/11/2023 Bill Howe, UW 67

Key Idea: Algebraic Optimization

N = ((z*2)+((z*3)+0))/1

Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x

Apply rules 1, 3, 4, 2:N = (2+3)*z

two operations instead of five, no division operator

Same idea works with the Relational Algebra!

Data Curation

Data Management

Data Analytics

Cyberinfrastructure

Big

Dat

a

Open D

ata

Database & Systems

Researchers

Stats, ML, and Viz

Library Science

Researchers

DataONE

Hadoop

GraphLab

Vertica

Greenplum

Oracle/MS/IBM

Dataverse

R/SPSS/MATLAB/Stata

Dryad

ICPSR

Geodata.gov

Tableau

Weka

GenBank

Intellectual Infrastructure

Spark

Pig

HIVE

Shark

Dremel

04/11/2023 69

SQLShare as a CS Research Platform• Automatic “Starter” Queries

– (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle)

• VizDeck: Automatic Mashups and Visualization – (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon)

• Info Extraction from Spreadsheets– (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani)

• Scalable Analytics-as-a-Service– (Dan Suciu, Magda Balazinska, Bill Howe)

• Optimizing Iterative Queries for Machine Learning– (Dan Suciu, Magda Balazinska, Bill Howe)

• Case Studies in Metagenomics, Chemistry, more

SSDBM 2011SIGMOD 2011 (demo)

SSDBM 2011CHI 2012SIGMOD 2012 (demo)

Bill Howe, UW

VLDB 2010Datalog2.0 2012CIDR 2013

Data engineering 2012CiSE 2012

Two Failure Modes Serving the Long Tail

over-abstraction

“uber-system”

“neither fish nor fowl”

tries to address so many requirements, it addresses none

too reactive, ad hoc, one-off

Addresses exactly 1 application

no leverage; doesn’t scale

A stripped-down version of Jim Gray’s “20 questions” methodology

Experimental Engagement Algorithm for the Long Tail

1. Get the data2. Load the data “as is” – no schema design3. Get ~20 questions (in English)4. Translate the questions into SQL (when possible)5. Provide these “starter queries” to the researchers

Q: Can researchers questions be expressed in SQL?Q: Are a few examples sufficient for novices to self-train with SQL?Q: Can we scale this process up?Q: If so, will the use of SQL reduce their data handling overhead?

top related