big data for big river science: data intensive tools, …...big data for big river science: data...

38
Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research Center Ed Bulliner U.S. Geological Survey, Columbia Environmental Research Center

Upload: others

Post on 20-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research Center

Ed BullinerU.S. Geological Survey, Columbia Environmental Research Center

Page 2: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Goals of Presentation

• How are the data available to us different than the past?

• What different approaches are needed to analyze these data?

• What questions are we asking and answering that we could not before?

• ‘Big river science’ – four examples• How does this relate to NRDAR/ecological

restoration?

Page 3: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

“Big Data”

• What is “big data”?• Emerging field• Several definitions – volume, variety,

variability• Do we work with ‘big data’ or ‘lots of

data’• Is that distinction important?• Regardless of semantics, increasing scale and

complexity of problems and necessary data• What do increasing amounts of data mean

for science and scientists?• How do we get the most value from the data

available to us?• Why is this important?

Page 4: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Data Intensive Science

• Paradigm shift in how we do science

• Can ask (and answer) new kinds of questions

• New tools and techniques

Page 5: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Traditional versus Data-Intensive Analyses

• Where do we see ‘data-intensive’ science?

• Within river science?• Within USGS/government?

• Why now? (what’s different?)• Data availability• Data resolution• Computational power

• What are the different tools and approaches currently used?

Page 6: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Tools for Data-Intensive Analyses

• Data storage• Increased hard drive space• Databases

• Data manipulation• Scripting languages• Web scraping/data

‘munging’/ data mining• Modeling

• Scripting languages• Modeling packages• Data visualization

Page 7: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Python

OS Operations

Web Queries

Database Integration

IDL

ArcGIS & ArcPY

Data Visualization

Statistics

• General purpose scripting language

• Lots of modules• Free*

• Tools for:• Data management• Data

filtering/cleaning• Scientific

computing• Geospatial analyses• Plotting• Collaborating

Page 8: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Pretty cool, but what can we use it for?

Page 9: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Question: Where do riverine sandbars exist and how do

they change over time?

Page 10: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

• Create database of rivers and flows• Mask active channel within overlap

of rivers and landsat images• Integrate Landast metadata with

corresponding discharge data through relational database

• Query imagery by discharge/date• Automated download and analysis

of imagery – timeseries of sandbars

Page 11: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

• Identified areas of persistent sand

• Investigated flows where sand was exposed

• Examined spatial variation• Used metrics of exposure

to help model success of Least Tern nests

Page 12: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Main Points

• Scripts and databases allow for automated downloading and linking of multiple data types

• Too much data for manual analysis

• Python can be used to batch-process images across programs without manual intervention

• Scripted tools can be used to directly query, plot, and perform statistics on image data

Page 13: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Question: What information can we synthesize from a 400+ day archive of field

measurements?

Page 14: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

2.5

0

EXPLANATIONVelocity, in cubic meters per second

Velocity ensemble

Velocity bin

River bottom

Water column

fast slow4-beamdepths

• Velocities and depths measured along regular transects

• Lateral, longitudinal, and vertical variability

Page 15: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

ADCP and single-beam survey dates, locations and

discharges 2000-2015

EXPLANATION

Flow percentileLow <25%Medium 25-75%High >75%

Page 16: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

• Compiled over 32,000 individual cross-sections from 2000-2015

• Joined dataset to river mile and gage to allow discharge-specific queries

• Can group data by location along river and varying discharge levels to compare

Page 17: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

• Ongoing restoration question: how does habitat (velocity) compare in river chutes versus main channel

• Chutes = restoration• 37 field days where

measurements in chutes were taken incidentally or deliberately

• Can use geospatial tools and scripts to come up with relevant comparisons

Page 18: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Measurement archive in lieu of hydrodynamic model –sturgeon spawning locations?

Page 19: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Main Points

• Scripts and databases allow for efficient querying and cleaning of archived datasets

• Python can be used to quickly and interactively summarize datasets by specific groupings

• Existing data can be repurposed and integrated with new data for value-added analyses using scripting

Page 20: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Question: How can we better visualize field

measurements of channel velocity and bathymetry?

Page 21: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

• Measurements of velocity collected along ‘regular’ transects

• Python used to interpolate data into structured grid (3d matrix)

Page 22: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Paraview

Page 23: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

• Can visualize flowlines around structures (biology)

• Identified bias in field measurements?

Page 24: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

• Noticed systematic bias

• Collaborating with ILWSC

33 million+ data points!

Page 25: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Main Points

• Python scripts allow for interpolation and visualization of field data

• Using open-source (free) tools along with Python allows for replication of abilities from more expensive software

• New insights can be gained from visualizing data in different ways

Page 26: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Question: How can we better

characterize inundation patterns along the Missouri

River?

Page 27: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

• Hydrodynamic (HEC-RAS) model provided by USACE describing water surface elevations at cross sections over time

• Used scripting to extend cross sections across floodplain for Missouri River

Page 28: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

• Merged LIDAR and channel data provides high-resolution characterization of floodplain elevation

• Spatial interpolations of water elevation

• Calculations of inundation depths

Page 29: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Inundation return interval statistics

Page 30: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Base unit for calculations: 1 date, water depth raster grid (30m) for 1 area

Time series of rasters, 1 per day for 29,892 modeled days

…n dates…

…n dates…

Page 31: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Stack over time

xy

z

Structured 3-dimensional matrix of data

x and y are geospatial coordinates (raster dims)z is time coordinate (29,892 days)

Water depth for each x,y,z

Page 32: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Tim

e

Data structured as hierarchical data format (hdf) on disk to allow computationally efficient slicing in time domain

Setting inundation threshold allows for identification of inundated periods per pixel

Page 33: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Can aggregate data by year

Evaluate inundation status by criteria (such as longest consecutive inundated period during growing season)

Summarize metrics across all modeled years

…nyears……n

years…

Page 34: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Main Points

• Python scripts allow for dealing with data too big for one computer

• Processing across virtual machines• Processing large files

• Time-series analyses on large datasets are useful for answering management questions

• Computational models are a useful supplement to field data

Page 35: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Data Intensive Restoration?

• There have been many attempts at ecological restoration

• Meta-analysis of restoration success is nothing new

• What data are available to us in USGS/DOI that might lend itself to these approaches?

• What data are needed by people implementing NRDAR restoration?

• How can NRDAR projects contribute useful information?

Page 36: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

NRDAR Case Map and Document Library

Page 37: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Conclusions

• As scientists, we work in an expanding world of ‘big data’

• We can’t analyze data by ourselves – need tools• Sharing data is important• Ongoing projects are just beginning to utilize

the scope of available datasets and capabilities of tools like Python

• What existing data is not fully utilized?• Think big• Add value

Page 38: Big data for big river science: data intensive tools, …...Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research

Questions?