big data + big sim: query processing over unstructured cfd models

66
Bill Howe Information School Computer Science & Engineering University of Washington Big Data + Big Sim: Query Processing over Unstructured CFD Models 8/7/2017 Bill Howe, UW 1 Scott Moe Applied Math University of Washington

Upload: university-of-washington

Post on 23-Jan-2018

99 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Bill Howe Information School

Computer Science & Engineering

University of Washington

Big Data + Big Sim:

Query Processing over

Unstructured CFD Models

8/7/2017 Bill Howe, UW 1

Scott Moe

Applied Math

University of Washington

Page 2: Big Data + Big Sim: Query Processing over Unstructured CFD Models

This morning…

• Data-intensive science in oceanography

• Background on databases and query

algebras

• Regridding: Integrating ocean models using

a database-style algebra

• If time: Responsible data science

8/7/2017 Bill Howe, UW 2

Motivation Algebraic Optimization Regridding End

Page 3: Big Data + Big Sim: Query Processing over Unstructured CFD Models

My position for this talk…

• Simulations are sources of data

• Analysis requires querying across

heterogeneous data sources, including

simulations

• The CS database community has the

right set of concepts and approaches

…but ultimately we’re just plumbers

8/7/2017 Bill Howe, UW 3

Motivation Algebraic Optimization Regridding End

Page 4: Big Data + Big Sim: Query Processing over Unstructured CFD Models

The Fourth Paradigm

1. Empirical + experimental

2. Theoretical

3. Computational

4. Data-Intensive

Jim Gray

8/7/2017 Bill Howe, UW 4

Motivation Algebraic Optimization Regridding End

Page 5: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Nearly every field of discovery is transitioning

from “data poor” to “data rich”

Astronomy: LSSTPhysics: LHC

Oceanography: OOI

Social Sciences

Biology: Sequencing

Economics

Neuroscience: EEG, fMRI

Motivation Algebraic Optimization Regridding End

Page 6: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 6

Complex System

“Little linear windows”

Academic research

Practitioners

One view of “data science” is the streamline the discovery, interpretation,

and operationalization of semi-robust local patterns that have predictive

power for some task.1

In general, these don’t exist. But in specific situations, they do.

Page 7: Big Data + Big Sim: Query Processing over Unstructured CFD Models

slide: John Delaney, UW

Motivation Algebraic Optimization Regridding End

Page 8: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Regional Scale Nodes

8/7/2017 Bill Howe, UW 8

John

Delaney

10s of Gigabits/second from the ocean floor

Motivation Algebraic Optimization Regridding End

Page 9: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 9

17 federal organizations named as partners

11 Regional Associations

“a strategy for incorporating observation systems from …

near shore waters as part of … a network of observatories.”

Motivation Algebraic Optimization Regridding End

Page 10: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Center for Coastal Margin

Observation and Prediction (CMOP)

8/7/2017 Bill Howe, UW 10

Antonio

Baptista

Motivation Algebraic Optimization Regridding End

Page 11: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Virtual Mekong Basin

8/7/2017 Bill Howe, UW 11

img src: Mark Stoermer, UW Center for Environmental Visualization

Jeff

RicheyMotivation Algebraic Optimization Regridding End

Page 12: Big Data + Big Sim: Query Processing over Unstructured CFD Models

So what?

• Geosciences are transitioning from

expedition-based to observatory-based

science

• Enormous investments in integrating

sensors and models

• The big problem: ad hoc queries over

large, heterogeneous, distributed datasets

and models

8/7/2017 Bill Howe, UW 12

Motivation Algebraic Optimization Regridding End

Page 13: Big Data + Big Sim: Query Processing over Unstructured CFD Models

So what do we do about querying across

heterogeneous sources?

Raise the level of abstraction and let the

system handle the details

8/7/2017 Bill Howe, UW 13

Motivation Algebraic Optimization Regridding End

Page 14: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Pre-Relational: if your data changed, your application broke.

Early RDBMS were buggy and slow (and often reviled), but

required only 5% of the application code.

“Activities of users at terminals and most application programs should

remain unaffected when the internal representation of data is changed and

even when some aspects of the external representation are changed.”

Key Idea: Programs that manipulate tabular data exhibit an algebraic

structure allowing reasoning and manipulation independently of physical

data representation

Digression: Relational Database History

-- Codd 1979

Motivation Algebraic Optimization Regridding End

Page 15: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Key Idea: An Algebra of Tables

select

project

join join

Other operators: aggregate, union, difference, cross product

Motivation Algebraic Optimization Regridding End

Page 16: Big Data + Big Sim: Query Processing over Unstructured CFD Models

16

Review: Algebraic OptimizationN = ((4*2)+((4*3)+0))/1

Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x

Apply rules 1, 3, 4, 2: N = (2+3)*4

two operations instead of five, no division operator

Same idea works with very large tables, but the payoff is much higher

Motivation Algebraic Optimization Regridding End

Page 17: Big Data + Big Sim: Query Processing over Unstructured CFD Models

17

Algebraic Optimization:

Find a better logical plan

Product Purchase

pid=pid

price>100 and city=‘Seattle’

x.name,z.name

δ

cid=cid

Customer

Π

σ

Product(pid, name, price)

Purchase(pid, cid, store)

Customer(cid, name, city)

SELECT DISTINCT x.name, z.name

FROM Product x, Purchase y, Customer z

WHERE x.pid = y.pid and y.cid = z.cid and

x.price > 100 and z.city = ‘Seattle’

Motivation Algebraic Optimization Regridding End

Page 18: Big Data + Big Sim: Query Processing over Unstructured CFD Models

18

Algebraic Optimization:

Find a better logical plan

Product Purchase

pid=pid

city=‘Seattle’

x.name,z.name

δ

cid=cid

Customer

Π

σprice>100

σ

Query optimization =

finding cheaper,

equivalent expressions

SELECT DISTINCT x.name, z.name

FROM Product x, Purchase y, Customer z

WHERE x.pid = y.pid and y.cid = z.cid and

x.price > 100 and z.city = ‘Seattle’

Motivation Algebraic Optimization Regridding End

Page 19: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Same logical expression, different physical

algorithms

Which is faster?

SELECT *

FROM Order o, Item i

WHERE o.order = i.order

join

scan scan

o.order = i.order

Order oItem i

for each record i in Item:

for each record o in Order:

if o.order = i.order:

return (r,s)

Option 1

for each record i in Item:

insert into hashtable

for each record o in Order:

lookup corresponding records in hashtable

return matching pairs

Option 2

O(N)

O(1)

O(M)

O(1)

O(N)

O(1)

O(~1)

O(M)overall:

O(N*M)

overall:

O(N+M)

Motivation Algebraic Optimization Regridding End

Page 20: Big Data + Big Sim: Query Processing over Unstructured CFD Models

3/12/09 Bill Howe, eScience Institute 20

H0 : (x,y,b) V0 : (z)

A

restrict(0, z >b)

B

color is depth

Algebraic Manipulation of Scientific Datasets,

B. Howe, D. Maier, VLDBJ 2005

H0 : (x,y,b) V0 : ( )

apply(0, z=(surf b) * )

bind(0, surf)

C

color is salinity

GridFields: An Algebra of MeshesMotivation Algebraic Optimization Regridding End

Page 21: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Example (1)

H = Scan(context, "H")

rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)

H = rH =

dimensionpredicate

color: bathymetry

Motivation Algebraic Optimization Regridding End

Page 22: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 [email protected]

Example: Transect

P

Motivation Algebraic Optimization Regridding End

Page 23: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 [email protected]

Transect: Bad Query Plan

H(x,y,b)

V(z)

r(z>b) b(s) regrid

PP V

1) Construct full-size 3D grid

2) Construct 2D transect grid

3) Interpolate 1) onto 2)

Motivation Algebraic Optimization Regridding End

Page 24: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 [email protected]

Transect: Optimized Plan

P V

V(z)P

H(x,y,b)regrid b(s) regrid

1) Find 2D cells containing points

2) Create “stacks” of 2D cells carrying data

3) Create 2D transect grid

4) Interpolate 2) onto 3)

Motivation Algebraic Optimization Regridding End

Page 25: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 [email protected]

1) Find cells containing points in PMotivation Algebraic Optimization Regridding End

Page 26: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 [email protected]

1)

4)

2)

1) Find cells containing points in P

2) Construct “stacks” of cells

4) Interpolate

Motivation Algebraic Optimization Regridding End

Page 27: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Transect: Results

8/7/2017 [email protected]

0

5

10

15

20

25

30

35

40

45

vtk(3D) interpolate simple interp_o simple_o

secs

800 MB

(1 timestep)

Motivation Algebraic Optimization Regridding End

Page 28: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Back to integrating models:

What is the right abstraction?

• Claim: Everything reduces to regridding

• Model-data comparisons skill assessment?

Regrid observations onto model mesh

• Model-model comparison?

Regrid one model’s mesh onto the other’s

• Model coupling?

Regrid a meso-scale atmospheric model onto your regional ocean model

• Visualization?

Regrid onto a 3D mesh, or regrid onto a 2D array of pixels

8/7/2017 Bill Howe, UW 28

Motivation Algebraic Optimization Regridding End

Page 29: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Status Quo

• “FTP + MATLAB”

• “Nascent Databases”

– File-based, format-specific API

– UniData’s NetCDF, HDF5

– Some IO optimization, some indexing

• “Data Servers”

– Same as file-based systems,

– but supports RPC

8/7/2017 Bill Howe, UW 29

HyraxNone of this scales

- up with data volumes

- up with number of sources

- down with developer expertise

Motivation Algebraic Optimization Regridding End

Page 30: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Summary so far

• “Integration” means “regridding”

– mesh to pixels, mesh to mesh, trajectory to mesh

– satellites to models, models to models, observations to models

• Regridding is hard

– Must be easy, tolerant of unusual grids, numerically conservative, efficient

Our goal

• Define a “universal regridding” operator with nice algebraic

properties

• Use it to implement efficient distributed data sharing applications,

parallel algorithms, and more

8/7/2017 Bill Howe, UW 30

Motivation Algebraic Optimization Regridding End

Page 31: Big Data + Big Sim: Query Processing over Unstructured CFD Models

What are some complexities we want to

hide?

• Unstructured Grids

• Numerical Conservation

• Choice of Algorithms

8/7/2017 Bill Howe, UW 31

Motivation Algebraic Optimization Regridding End

Page 32: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 32

Motivation Algebraic Optimization Regridding End

Page 33: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 33

Washington

Oregon

Columbia River Estuary

Motivation Algebraic Optimization Regridding End

Page 34: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Washington

Oregon

Columbia River Estuary

Motivation Algebraic Optimization Regridding End

Page 35: Big Data + Big Sim: Query Processing over Unstructured CFD Models

SciDBHyrax

GridFields

ESMF

VTK/Paraview

easy; good support hard; poor support

Motivation Algebraic Optimization Regridding End

Page 36: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Structured grids are easy

8/7/2017 Bill Howe, eScience Institute 36

The data model…

(Cartesian products of coordinate variables)

…immediately implies a representation,

(multidimensional arrays)

…an API,

(reading and writing subslabs)

…and an efficient implementation

(address calculation using array “shape”)

Motivation Algebraic Optimization Regridding End

Page 37: Big Data + Big Sim: Query Processing over Unstructured CFD Models

What are some complexities we want to

hide?

• Unstructured Grids

• Numerical Conservation

• Choice of Algorithms

8/7/2017 Bill Howe, UW 37

Motivation Algebraic Optimization Regridding End

Page 38: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Naïve Method: Interpolation (Spatial Join)

8/7/2017 Bill Howe, UW 38

For each vertex in the target grid,

Find containing cell in the source grid,

Evaluate the basis functions to interpolate

Motivation Algebraic Optimization Regridding End

Page 39: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 39

Motivation Algebraic Optimization Regridding End

Page 40: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Supermeshing [Farrell 10]

8/7/2017 Bill Howe, UW 40

For each cell in the target grid,

Find overlapping cells in the source grid,

Compute their intersections

Derive new coefficients to minimize L2 norm

* Guaranteeed Conservative

* Minimizes Error

But:

Domains must match exactly

Motivation Algebraic Optimization Regridding End

Page 41: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 41

Motivation Algebraic Optimization Regridding End

Page 42: Big Data + Big Sim: Query Processing over Unstructured CFD Models

What are some complexities we want to

hide?

• Unstructured Grids

• Numerical Conservation

• Choice of algorithms

8/7/2017 Bill Howe, UW 42

Motivation Algebraic Optimization Regridding End

Page 43: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 43

Motivation Algebraic Optimization Regridding End

Page 44: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Finding mesh intersections

8/7/2017 Bill Howe, UW 44

Motivation Algebraic Optimization Regridding End

Page 45: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 45

Motivation Algebraic Optimization Regridding End

Page 46: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 46

Motivation Algebraic Optimization Regridding End

Page 47: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 47

Restrict(Regrid(X,Y)) = Regrid(Restrict(X), Restrict(Y))

Commutativity of Regrid and Restrict:

G0 = Regrid(Restrict0(X), Restrict0(Y)))

G1 = Regrid(Restrict1(X), Restrict1(Y)))

:

GN = Regrid(Restrict2(X), Restrict2(Y)))

R = Stitch(G0, G1, G2)

Motivation Algebraic Optimization Regridding End

Page 48: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 48

Motivation Algebraic Optimization Regridding End

Page 49: Big Data + Big Sim: Query Processing over Unstructured CFD Models

“Lumping”

8/7/2017 Bill Howe, UW 49

Motivation Algebraic Optimization Regridding End

Page 50: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 50

Motivation Algebraic Optimization Regridding End

Page 51: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 51

Motivation Algebraic Optimization Regridding End

Page 52: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 52

Globally conservative

Parallelizable

Commutes with user-

selected restrictions

masking to handle

mismatched domains

Todos:

• Characterize the error relative to plain supermeshing

• Universal Regridding-as-a-Service

Motivation Algebraic Optimization Regridding End

Page 53: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Outreach and Usage

• Code is available, but in transition to github

– Search “gridfields” on google code

– http://code.google.com/p/gridfields/

– C++ with Python bindings

• Integrated into the Hyrax Data Server

– OPULS project funded by NOAA

– Server-side processing of unstructured grids

• Other users

– US Geological Survey

– NOAA 8/7/2017 Bill Howe, UW 538/7/2017 Bill Howe, UW 53

Motivation Algebraic Optimization Regridding End

Page 54: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 54

• Screenshot of OPeNDAP demo

http://ec2-174-129-186-110.compute-1.amazonaws.com:8088/nc/test4.nc.nc?

ugrid_restrict(0,"Y>41.5&Y<42.75&X>-68.0&X<-66.0")

Motivation Algebraic Optimization Regridding End

Page 55: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Wrap up

• Integration of big data and big models is the game

• Database-style systems are about hiding complexity

and raising the level of abstraction

• A database-style query algebra for FEMs emphasizing

interpolation and regridding across data and models

made sense to us

• But more broadly: a richer infrastructure for comparing

and sharing model results and data

• One idea: “Virtual datasets” where the model is

executed in response to queries, perhaps with simpler

grids and relaxed assumptions

8/7/2017 Bill Howe, UW 55

Motivation Algebraic Optimization Regridding End

Page 56: Big Data + Big Sim: Query Processing over Unstructured CFD Models

56

Propublica, May 2016

Motivation Regridding Supermeshing

Database Algebras Evaluation

Numerical conservation

Responsible Data Science

Page 57: Big Data + Big Sim: Query Processing over Unstructured CFD Models

57

The Special Committee on Criminal Justice Reform's

hearing of reducing the pre-trial jail population.

Technical.ly, September 2016

Philadelphia is grappling with the prospect of a racist computer algorithm

Any background signal in the

data of institutional racism is

amplified by the algorithm

operationalized by the algorithm

legitimized by the algorithm

“Should I be afraid of risk assessment tools?”

“No, you gotta tell me a lot more about yourself.

At what age were you first arrested?

What is the date of your most recent crime?”

“And what’s the culture of policing in the

neighborhood in which I grew up in?”

Motivation Regridding Supermeshing

Database Algebras Evaluation

Numerical conservation

Responsible Data Science

Page 58: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 58

Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016Motivation Regridding Supermeshi

ngDatabase Algebras Evaluat

ionNumerical conservation

Responsible Data Science

Page 59: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 59

Amazon Prime Now Delivery Area: Boston Bloomberg, 2016Motivation Regridding Supermeshi

ngDatabase Algebras Evaluat

ionNumerical conservation

Responsible Data Science

Page 60: Big Data + Big Sim: Query Processing over Unstructured CFD Models

8/7/2017 Bill Howe, UW 60

Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016Motivation Regridding Supermeshi

ngDatabase Algebras Evaluat

ionNumerical conservation

Responsible Data Science

Page 61: Big Data + Big Sim: Query Processing over Unstructured CFD Models

First decade of Data Science research and practice:

What can we do with massive, noisy, heterogeneous datasets?

Next decade of Data Science research and practice:

What should we do with massive, noisy, heterogeneous datasets?

The way I think about this…..(1)

Motivation Regridding Supermeshing

Database Algebras Evaluation

Numerical conservation

Responsible Data Science

Page 62: Big Data + Big Sim: Query Processing over Unstructured CFD Models

The way I think about this…. (2)

Decisions are based on two sources of information:

1. Past examplese.g., “prior arrests tend to increase likelihood of future arrests”

2. Societal constraintse.g., “we must avoid racial discrimination”

8/7/2017 Data, Responsibly / SciTech NW 62

We’ve become very good at automating the use of past examples

We’ve only just started to think about incorporating societal constraints

Motivation Regridding Supermeshing

Database Algebras Evaluation

Numerical conservation

Responsible Data Science

Page 63: Big Data + Big Sim: Query Processing over Unstructured CFD Models

The way I think about this… (3)

How do we apply societal constraints to algorithmic

decision-making?

Option 1: Rely on human oversight

Ex: EU General Data Protection Regulation requires that a

human be involved in legally binding algorithmic decision-making

Ex: Wisconsin Supreme Court says a human must review

algorithmic decisions made by recidivism models

Issues with scalability, prejudice

Option 2: Build systems to help enforce these constraints

This is the approach we are exploring

8/7/2017 Data, Responsibly / SciTech NW 63

Motivation Regridding Supermeshing

Database Algebras Evaluation

Numerical conservation

Responsible Data Science

Page 64: Big Data + Big Sim: Query Processing over Unstructured CFD Models

The way I think about this…(4)

On transparency vs. accountability:

• For human decision-making, sometimes explanations are

required, improving transparency

– Supreme court decisions

– Employee reprimands/termination

• But when transparency is difficult, accountability takes over

– medical emergencies, business decisions

• As we shift decisions to algorithms, we lose both

transparency AND accountability

• “The buck stops where?”

8/7/2017 Data, Responsibly / SciTech NW 64

Motivation Regridding Supermeshing

Database Algebras Evaluation

Numerical conservation

Responsible Data Science

Page 65: Big Data + Big Sim: Query Processing over Unstructured CFD Models

FairnessAccountability TransparencyPrivacyReproducibility

Fides: A platform for responsible data science

joint with Stoyanovich [US], Abiteboul [FR], Miklau [US], Sahuguet [US], Weikum [DE]

Data Curation

novel features to support:

So what do we do about it?Motivation Regridding Supermeshi

ngDatabase Algebras Evaluat

ionNumerical conservation

Responsible Data Science

Page 66: Big Data + Big Sim: Query Processing over Unstructured CFD Models

Motivation Regridding Supermeshing

Database Algebras Evaluation

Numerical conservation

Responsible Data Science