big data + big sim: query processing over unstructured cfd models
TRANSCRIPT
Bill Howe Information School
Computer Science & Engineering
University of Washington
Big Data + Big Sim:
Query Processing over
Unstructured CFD Models
8/7/2017 Bill Howe, UW 1
Scott Moe
Applied Math
University of Washington
This morning…
• Data-intensive science in oceanography
• Background on databases and query
algebras
• Regridding: Integrating ocean models using
a database-style algebra
• If time: Responsible data science
8/7/2017 Bill Howe, UW 2
Motivation Algebraic Optimization Regridding End
My position for this talk…
• Simulations are sources of data
• Analysis requires querying across
heterogeneous data sources, including
simulations
• The CS database community has the
right set of concepts and approaches
…but ultimately we’re just plumbers
8/7/2017 Bill Howe, UW 3
Motivation Algebraic Optimization Regridding End
The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
8/7/2017 Bill Howe, UW 4
Motivation Algebraic Optimization Regridding End
Nearly every field of discovery is transitioning
from “data poor” to “data rich”
Astronomy: LSSTPhysics: LHC
Oceanography: OOI
Social Sciences
Biology: Sequencing
Economics
Neuroscience: EEG, fMRI
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 6
Complex System
“Little linear windows”
Academic research
Practitioners
One view of “data science” is the streamline the discovery, interpretation,
and operationalization of semi-robust local patterns that have predictive
power for some task.1
In general, these don’t exist. But in specific situations, they do.
slide: John Delaney, UW
Motivation Algebraic Optimization Regridding End
Regional Scale Nodes
8/7/2017 Bill Howe, UW 8
John
Delaney
10s of Gigabits/second from the ocean floor
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 9
17 federal organizations named as partners
11 Regional Associations
“a strategy for incorporating observation systems from …
near shore waters as part of … a network of observatories.”
Motivation Algebraic Optimization Regridding End
Center for Coastal Margin
Observation and Prediction (CMOP)
8/7/2017 Bill Howe, UW 10
Antonio
Baptista
Motivation Algebraic Optimization Regridding End
Virtual Mekong Basin
8/7/2017 Bill Howe, UW 11
img src: Mark Stoermer, UW Center for Environmental Visualization
Jeff
RicheyMotivation Algebraic Optimization Regridding End
So what?
• Geosciences are transitioning from
expedition-based to observatory-based
science
• Enormous investments in integrating
sensors and models
• The big problem: ad hoc queries over
large, heterogeneous, distributed datasets
and models
8/7/2017 Bill Howe, UW 12
Motivation Algebraic Optimization Regridding End
So what do we do about querying across
heterogeneous sources?
Raise the level of abstraction and let the
system handle the details
8/7/2017 Bill Howe, UW 13
Motivation Algebraic Optimization Regridding End
Pre-Relational: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but
required only 5% of the application code.
“Activities of users at terminals and most application programs should
remain unaffected when the internal representation of data is changed and
even when some aspects of the external representation are changed.”
Key Idea: Programs that manipulate tabular data exhibit an algebraic
structure allowing reasoning and manipulation independently of physical
data representation
Digression: Relational Database History
-- Codd 1979
Motivation Algebraic Optimization Regridding End
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
Motivation Algebraic Optimization Regridding End
16
Review: Algebraic OptimizationN = ((4*2)+((4*3)+0))/1
Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2: N = (2+3)*4
two operations instead of five, no division operator
Same idea works with very large tables, but the payoff is much higher
Motivation Algebraic Optimization Regridding End
17
Algebraic Optimization:
Find a better logical plan
Product Purchase
pid=pid
price>100 and city=‘Seattle’
x.name,z.name
δ
cid=cid
Customer
Π
σ
Product(pid, name, price)
Purchase(pid, cid, store)
Customer(cid, name, city)
SELECT DISTINCT x.name, z.name
FROM Product x, Purchase y, Customer z
WHERE x.pid = y.pid and y.cid = z.cid and
x.price > 100 and z.city = ‘Seattle’
Motivation Algebraic Optimization Regridding End
18
Algebraic Optimization:
Find a better logical plan
Product Purchase
pid=pid
city=‘Seattle’
x.name,z.name
δ
cid=cid
Customer
Π
σprice>100
σ
Query optimization =
finding cheaper,
equivalent expressions
SELECT DISTINCT x.name, z.name
FROM Product x, Purchase y, Customer z
WHERE x.pid = y.pid and y.cid = z.cid and
x.price > 100 and z.city = ‘Seattle’
Motivation Algebraic Optimization Regridding End
Same logical expression, different physical
algorithms
Which is faster?
SELECT *
FROM Order o, Item i
WHERE o.order = i.order
join
scan scan
o.order = i.order
Order oItem i
for each record i in Item:
for each record o in Order:
if o.order = i.order:
return (r,s)
Option 1
for each record i in Item:
insert into hashtable
for each record o in Order:
lookup corresponding records in hashtable
return matching pairs
Option 2
O(N)
O(1)
O(M)
O(1)
O(N)
O(1)
O(~1)
O(M)overall:
O(N*M)
overall:
O(N+M)
Motivation Algebraic Optimization Regridding End
3/12/09 Bill Howe, eScience Institute 20
H0 : (x,y,b) V0 : (z)
A
restrict(0, z >b)
B
color is depth
Algebraic Manipulation of Scientific Datasets,
B. Howe, D. Maier, VLDBJ 2005
H0 : (x,y,b) V0 : ( )
apply(0, z=(surf b) * )
bind(0, surf)
C
color is salinity
GridFields: An Algebra of MeshesMotivation Algebraic Optimization Regridding End
Example (1)
H = Scan(context, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
H = rH =
dimensionpredicate
color: bathymetry
Motivation Algebraic Optimization Regridding End
8/7/2017 [email protected]
Transect: Bad Query Plan
H(x,y,b)
V(z)
r(z>b) b(s) regrid
PP V
1) Construct full-size 3D grid
2) Construct 2D transect grid
3) Interpolate 1) onto 2)
Motivation Algebraic Optimization Regridding End
8/7/2017 [email protected]
Transect: Optimized Plan
P V
V(z)P
H(x,y,b)regrid b(s) regrid
1) Find 2D cells containing points
2) Create “stacks” of 2D cells carrying data
3) Create 2D transect grid
4) Interpolate 2) onto 3)
Motivation Algebraic Optimization Regridding End
8/7/2017 [email protected]
1) Find cells containing points in PMotivation Algebraic Optimization Regridding End
8/7/2017 [email protected]
1)
4)
2)
1) Find cells containing points in P
2) Construct “stacks” of cells
4) Interpolate
Motivation Algebraic Optimization Regridding End
Transect: Results
8/7/2017 [email protected]
0
5
10
15
20
25
30
35
40
45
vtk(3D) interpolate simple interp_o simple_o
secs
800 MB
(1 timestep)
Motivation Algebraic Optimization Regridding End
Back to integrating models:
What is the right abstraction?
• Claim: Everything reduces to regridding
• Model-data comparisons skill assessment?
Regrid observations onto model mesh
• Model-model comparison?
Regrid one model’s mesh onto the other’s
• Model coupling?
Regrid a meso-scale atmospheric model onto your regional ocean model
• Visualization?
Regrid onto a 3D mesh, or regrid onto a 2D array of pixels
8/7/2017 Bill Howe, UW 28
Motivation Algebraic Optimization Regridding End
Status Quo
• “FTP + MATLAB”
• “Nascent Databases”
– File-based, format-specific API
– UniData’s NetCDF, HDF5
– Some IO optimization, some indexing
• “Data Servers”
– Same as file-based systems,
– but supports RPC
8/7/2017 Bill Howe, UW 29
HyraxNone of this scales
- up with data volumes
- up with number of sources
- down with developer expertise
Motivation Algebraic Optimization Regridding End
Summary so far
• “Integration” means “regridding”
– mesh to pixels, mesh to mesh, trajectory to mesh
– satellites to models, models to models, observations to models
• Regridding is hard
– Must be easy, tolerant of unusual grids, numerically conservative, efficient
Our goal
• Define a “universal regridding” operator with nice algebraic
properties
• Use it to implement efficient distributed data sharing applications,
parallel algorithms, and more
8/7/2017 Bill Howe, UW 30
Motivation Algebraic Optimization Regridding End
What are some complexities we want to
hide?
• Unstructured Grids
• Numerical Conservation
• Choice of Algorithms
8/7/2017 Bill Howe, UW 31
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 32
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 33
Washington
Oregon
Columbia River Estuary
Motivation Algebraic Optimization Regridding End
Washington
Oregon
Columbia River Estuary
Motivation Algebraic Optimization Regridding End
SciDBHyrax
GridFields
ESMF
VTK/Paraview
easy; good support hard; poor support
Motivation Algebraic Optimization Regridding End
Structured grids are easy
8/7/2017 Bill Howe, eScience Institute 36
The data model…
(Cartesian products of coordinate variables)
…immediately implies a representation,
(multidimensional arrays)
…an API,
(reading and writing subslabs)
…and an efficient implementation
(address calculation using array “shape”)
Motivation Algebraic Optimization Regridding End
What are some complexities we want to
hide?
• Unstructured Grids
• Numerical Conservation
• Choice of Algorithms
8/7/2017 Bill Howe, UW 37
Motivation Algebraic Optimization Regridding End
Naïve Method: Interpolation (Spatial Join)
8/7/2017 Bill Howe, UW 38
For each vertex in the target grid,
Find containing cell in the source grid,
Evaluate the basis functions to interpolate
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 39
Motivation Algebraic Optimization Regridding End
Supermeshing [Farrell 10]
8/7/2017 Bill Howe, UW 40
For each cell in the target grid,
Find overlapping cells in the source grid,
Compute their intersections
Derive new coefficients to minimize L2 norm
* Guaranteeed Conservative
* Minimizes Error
But:
Domains must match exactly
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 41
Motivation Algebraic Optimization Regridding End
What are some complexities we want to
hide?
• Unstructured Grids
• Numerical Conservation
• Choice of algorithms
8/7/2017 Bill Howe, UW 42
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 43
Motivation Algebraic Optimization Regridding End
Finding mesh intersections
8/7/2017 Bill Howe, UW 44
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 45
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 46
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 47
Restrict(Regrid(X,Y)) = Regrid(Restrict(X), Restrict(Y))
Commutativity of Regrid and Restrict:
G0 = Regrid(Restrict0(X), Restrict0(Y)))
G1 = Regrid(Restrict1(X), Restrict1(Y)))
:
GN = Regrid(Restrict2(X), Restrict2(Y)))
R = Stitch(G0, G1, G2)
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 48
Motivation Algebraic Optimization Regridding End
“Lumping”
8/7/2017 Bill Howe, UW 49
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 50
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 51
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 52
Globally conservative
Parallelizable
Commutes with user-
selected restrictions
masking to handle
mismatched domains
Todos:
• Characterize the error relative to plain supermeshing
• Universal Regridding-as-a-Service
Motivation Algebraic Optimization Regridding End
Outreach and Usage
• Code is available, but in transition to github
– Search “gridfields” on google code
– http://code.google.com/p/gridfields/
– C++ with Python bindings
• Integrated into the Hyrax Data Server
– OPULS project funded by NOAA
– Server-side processing of unstructured grids
• Other users
– US Geological Survey
– NOAA 8/7/2017 Bill Howe, UW 538/7/2017 Bill Howe, UW 53
Motivation Algebraic Optimization Regridding End
8/7/2017 Bill Howe, UW 54
• Screenshot of OPeNDAP demo
http://ec2-174-129-186-110.compute-1.amazonaws.com:8088/nc/test4.nc.nc?
ugrid_restrict(0,"Y>41.5&Y<42.75&X>-68.0&X<-66.0")
Motivation Algebraic Optimization Regridding End
Wrap up
• Integration of big data and big models is the game
• Database-style systems are about hiding complexity
and raising the level of abstraction
• A database-style query algebra for FEMs emphasizing
interpolation and regridding across data and models
made sense to us
• But more broadly: a richer infrastructure for comparing
and sharing model results and data
• One idea: “Virtual datasets” where the model is
executed in response to queries, perhaps with simpler
grids and relaxed assumptions
8/7/2017 Bill Howe, UW 55
Motivation Algebraic Optimization Regridding End
56
Propublica, May 2016
Motivation Regridding Supermeshing
Database Algebras Evaluation
Numerical conservation
Responsible Data Science
57
The Special Committee on Criminal Justice Reform's
hearing of reducing the pre-trial jail population.
Technical.ly, September 2016
Philadelphia is grappling with the prospect of a racist computer algorithm
Any background signal in the
data of institutional racism is
amplified by the algorithm
operationalized by the algorithm
legitimized by the algorithm
“Should I be afraid of risk assessment tools?”
“No, you gotta tell me a lot more about yourself.
At what age were you first arrested?
What is the date of your most recent crime?”
“And what’s the culture of policing in the
neighborhood in which I grew up in?”
Motivation Regridding Supermeshing
Database Algebras Evaluation
Numerical conservation
Responsible Data Science
8/7/2017 Bill Howe, UW 58
Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016Motivation Regridding Supermeshi
ngDatabase Algebras Evaluat
ionNumerical conservation
Responsible Data Science
8/7/2017 Bill Howe, UW 59
Amazon Prime Now Delivery Area: Boston Bloomberg, 2016Motivation Regridding Supermeshi
ngDatabase Algebras Evaluat
ionNumerical conservation
Responsible Data Science
8/7/2017 Bill Howe, UW 60
Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016Motivation Regridding Supermeshi
ngDatabase Algebras Evaluat
ionNumerical conservation
Responsible Data Science
First decade of Data Science research and practice:
What can we do with massive, noisy, heterogeneous datasets?
Next decade of Data Science research and practice:
What should we do with massive, noisy, heterogeneous datasets?
The way I think about this…..(1)
Motivation Regridding Supermeshing
Database Algebras Evaluation
Numerical conservation
Responsible Data Science
The way I think about this…. (2)
Decisions are based on two sources of information:
1. Past examplese.g., “prior arrests tend to increase likelihood of future arrests”
2. Societal constraintse.g., “we must avoid racial discrimination”
8/7/2017 Data, Responsibly / SciTech NW 62
We’ve become very good at automating the use of past examples
We’ve only just started to think about incorporating societal constraints
Motivation Regridding Supermeshing
Database Algebras Evaluation
Numerical conservation
Responsible Data Science
The way I think about this… (3)
How do we apply societal constraints to algorithmic
decision-making?
Option 1: Rely on human oversight
Ex: EU General Data Protection Regulation requires that a
human be involved in legally binding algorithmic decision-making
Ex: Wisconsin Supreme Court says a human must review
algorithmic decisions made by recidivism models
Issues with scalability, prejudice
Option 2: Build systems to help enforce these constraints
This is the approach we are exploring
8/7/2017 Data, Responsibly / SciTech NW 63
Motivation Regridding Supermeshing
Database Algebras Evaluation
Numerical conservation
Responsible Data Science
The way I think about this…(4)
On transparency vs. accountability:
• For human decision-making, sometimes explanations are
required, improving transparency
– Supreme court decisions
– Employee reprimands/termination
• But when transparency is difficult, accountability takes over
– medical emergencies, business decisions
• As we shift decisions to algorithms, we lose both
transparency AND accountability
• “The buck stops where?”
8/7/2017 Data, Responsibly / SciTech NW 64
Motivation Regridding Supermeshing
Database Algebras Evaluation
Numerical conservation
Responsible Data Science
FairnessAccountability TransparencyPrivacyReproducibility
Fides: A platform for responsible data science
joint with Stoyanovich [US], Abiteboul [FR], Miklau [US], Sahuguet [US], Weikum [DE]
Data Curation
novel features to support:
So what do we do about it?Motivation Regridding Supermeshi
ngDatabase Algebras Evaluat
ionNumerical conservation
Responsible Data Science
Motivation Regridding Supermeshing
Database Algebras Evaluation
Numerical conservation
Responsible Data Science