big data for big river science: data intensive tools, ... big data for big river science: data...

Click here to load reader

Post on 20-May-2020

9 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research Center

    Ed Bulliner U.S. Geological Survey, Columbia Environmental Research Center

  • Goals of Presentation

    • How are the data available to us different than the past?

    • What different approaches are needed to analyze these data?

    • What questions are we asking and answering that we could not before?

    • ‘Big river science’ – four examples • How does this relate to NRDAR/ecological

    restoration?

  • “Big Data”

    • What is “big data”? • Emerging field • Several definitions – volume, variety,

    variability • Do we work with ‘big data’ or ‘lots of

    data’ • Is that distinction important? • Regardless of semantics, increasing scale and

    complexity of problems and necessary data • What do increasing amounts of data mean

    for science and scientists? • How do we get the most value from the data

    available to us? • Why is this important?

  • Data Intensive Science

    • Paradigm shift in how we do science

    • Can ask (and answer) new kinds of questions

    • New tools and techniques

  • Traditional versus Data- Intensive Analyses

    • Where do we see ‘data- intensive’ science?

    • Within river science? • Within USGS/government?

    • Why now? (what’s different?) • Data availability • Data resolution • Computational power

    • What are the different tools and approaches currently used?

  • Tools for Data-Intensive Analyses

    • Data storage • Increased hard drive space • Databases

    • Data manipulation • Scripting languages • Web scraping/data

    ‘munging’/ data mining • Modeling

    • Scripting languages • Modeling packages • Data visualization

  • Python

    OS Operations

    Web Queries

    Database Integration

    IDL

    ArcGIS & ArcPY

    Data Visualization

    Statistics

    • General purpose scripting language

    • Lots of modules • Free* • Tools for:

    • Data management • Data

    filtering/cleaning • Scientific

    computing • Geospatial analyses • Plotting • Collaborating

  • Pretty cool, but what can we use it for?

  • Question: Where do riverine sandbars exist and how do

    they change over time?

  • • Create database of rivers and flows • Mask active channel within overlap

    of rivers and landsat images • Integrate Landast metadata with

    corresponding discharge data through relational database

    • Query imagery by discharge/date • Automated download and analysis

    of imagery – timeseries of sandbars

  • • Identified areas of persistent sand

    • Investigated flows where sand was exposed

    • Examined spatial variation • Used metrics of exposure

    to help model success of Least Tern nests

  • Main Points

    • Scripts and databases allow for automated downloading and linking of multiple data types

    • Too much data for manual analysis • Python can be used to batch-process images

    across programs without manual intervention • Scripted tools can be used to directly query,

    plot, and perform statistics on image data

  • Question: What information can we synthesize from a 400+ day archive of field

    measurements?

  • 2.5

    0

    EXPLANATION Velocity, in cubic meters per second

    Velocity ensemble

    Velocity bin

    River bottom

    Water column

    fast slow 4-beam depths

    • Velocities and depths measured along regular transects

    • Lateral, longitudinal, and vertical variability

  • ADCP and single-beam survey dates, locations and

    discharges 2000-2015

    EXPLANATION

    Flow percentile Low 75%

  • • Compiled over 32,000 individual cross-sections from 2000-2015

    • Joined dataset to river mile and gage to allow discharge- specific queries

    • Can group data by location along river and varying discharge levels to compare

  • • Ongoing restoration question: how does habitat (velocity) compare in river chutes versus main channel

    • Chutes = restoration • 37 field days where

    measurements in chutes were taken incidentally or deliberately

    • Can use geospatial tools and scripts to come up with relevant comparisons

  • Measurement archive in lieu of hydrodynamic model – sturgeon spawning locations?

  • Main Points

    • Scripts and databases allow for efficient querying and cleaning of archived datasets

    • Python can be used to quickly and interactively summarize datasets by specific groupings

    • Existing data can be repurposed and integrated with new data for value-added analyses using scripting

  • Question: How can we better visualize field

    measurements of channel velocity and bathymetry?

  • • Measurements of velocity collected along ‘regular’ transects

    • Python used to interpolate data into structured grid (3d matrix)

  • Paraview

  • • Can visualize flowlines around structures (biology)

    • Identified bias in field measurements?

  • • Noticed systematic bias

    • Collaborating with ILWSC

    33 million+ data points!

  • Main Points

    • Python scripts allow for interpolation and visualization of field data

    • Using open-source (free) tools along with Python allows for replication of abilities from more expensive software

    • New insights can be gained from visualizing data in different ways

  • Question: How can we better

    characterize inundation patterns along the Missouri

    River?

  • • Hydrodynamic (HEC- RAS) model provided by USACE describing water surface elevations at cross sections over time

    • Used scripting to extend cross sections across floodplain for Missouri River

  • • Merged LIDAR and channel data provides high-resolution characterization of floodplain elevation

    • Spatial interpolations of water elevation

    • Calculations of inundation depths

  • Inundation return interval statistics

  • Base unit for calculations: 1 date, water depth raster grid (30m) for 1 area

    Time series of rasters, 1 per day for 29,892 modeled days

    …n dates…

    …n dates…

  • Stack over time

    x y

    z

    Structured 3-dimensional matrix of data

    x and y are geospatial coordinates (raster dims) z is time coordinate (29,892 days)

    Water depth for each x,y,z

  • Ti m

    e

    Data structured as hierarchical data format (hdf) on disk to allow computationally efficient slicing in time domain

    Setting inundation threshold allows for identification of inundated periods per pixel

  • Can aggregate data by year

    Evaluate inundation status by criteria (such as longest consecutive inundated period during growing season)

    Summarize metrics across all modeled years

    …n years……nyears…

  • Main Points

    • Python scripts allow for dealing with data too big for one computer

    • Processing across virtual machines • Processing large files

    • Time-series analyses on large datasets are useful for answering management questions

    • Computational models are a useful supplement to field data

  • Data Intensive Restoration?

    • There have been many attempts at ecological restoration

    • Meta-analysis of restoration success is nothing new

    • What data are available to us in USGS/DOI that might lend itself to these approaches?

    • What data are needed by people implementing NRDAR restoration?

    • How can NRDAR projects contribute useful information?

  • NRDAR Case Map and Document Library

  • Conclusions

    • As scientists, we work in an expanding world of ‘big data’

    • We can’t analyze data by ourselves – need tools • Sharing data is important • Ongoing projects are just beginning to utilize

    the scope of available datasets and capabilities of tools like Python

    • What existing data is not fully utilized? • Think big • Add value

  • Questions?

    Big data for big river science: data intensive tools, techniques, and projects at the USGS/Columbia Environmental Research Center Goals of Presentation “Big Data” Data Intensive Science Traditional versus Data-Intensive Analyses Tools for Data-Intensive Analyses Python Pretty cool, but what can we use it for? Question: Where do riverine sandbars exist and how do they change over time? Slide Number 10 Slide Number 11 Main Points Question: What information can we synthesize from a 400+ day archive of field measurements? Slide Number 14 Slide Number 15 Slide Number 16 Slide Nu

View more