starting “small” to go big: building a living database · big: building a . living database....
TRANSCRIPT
Solutions for Today | Options for Tomorrow
Jennifer Bauer, Jenny DiGiulio, Devin Justman,Lucy Romeo, Kelly Rose, Patrick Wingo
Starting “small” to go Big:Building a Living DatabaseMichael Sabbatino1,2, Baker, D.V. “Vic” 3,4, Rose, K. 1, Romeo, L.1,2, Bauer, J.1, and Barkhurst, A.3,4
POC: [email protected]
1US Department of Energy, National Energy Technology Laboratory, Albany, OR;2AECOM, Albany, OR;3Mid-Atlantic Technology Research & Innovation Center (MATRIC), Morgantown WV;4US Department of Energy, National Energy Technology Laboratory, Morgantown, WV
2
Challenges & Needs of Scientific Data Data Access
• ~80% loss of published data after 20 years
Data Discovery• 20% public data versus 80%
privateDate Interoperability
• Variety of data makes it difficult to create, exchange, & use data across different applications and systems
Date Analytics & Visualization• Requires advanced
computational capabilities, algorithms, & large data stores to analyze these data
80% Dark Data
Instrumentation, logging, sensors, external data, user generated content
Reliable data flow, infrastructure, pipelines,structures and unstructured data storage
Cleaning, anomaly detection, prep
Analytic, metrics, segments, aggregates, features, training data
A/B testing,experimentation,
simple ML algorithms
AI& Deep
Learning
Learn &Optimize
Aggregate& Label
Explore &Transform
Collect
Move &Store
working up the Data Science
Hierarchyof Needs
3
Discovered & integrated open data sources of
information related to oil & gas infrastructure
across the globe
Collect
https://edx.netl.doe.gov/dataset/global-oil-gas-features-database
“small” Beginnings: Developing a Global Oil & Gas Database
Machine Learning Automated Approach- A tool that scans “seed” resources and identifies relevant keywords, then crawls the web and parses the data for integration
>700 datasets>4 million features
4
EDX - A Virtual Library & Laboratoryfor Energy Science
• Virtualizing team analytics• Continuing innovations to connect
researchers to online Earth-Energy system resources
• Increasing number of tools & apps for use in team workspaces
Move & Store
https://edx.netl.doe.gov
Data Workflows& Structure
• Custom “smart search” tool in development
• Digital spatial team “notebook”
• Auto-indexing algorithm, provides analysis of your search and helps recommend other items
EDX vs Dark Data
80% Dark Data
EDX Smart Search - A machine learning, big data tool for rapid, online, .Zip, & FTP spatial & non-spatial data
mining with Hadoop + Bing + ESRI
5
Explore & Transform
https://edx.netl.doe.gov
The Living Database• Store & Share Data in a Structured
Secure Database Environment• Reduce Redundant Acquisition• Direct Data Access (not file based storage)• Consistent Data with Staff Turnover• Enhance Collaboration
• Curation of data and knowledge• Allows Direct Analysis from Database
Storing Databases with different data types, formats, & resolutions
Includes Data workflow, infrastructure, pipelines, structured & unstructured data
People
DataLifecycleApps
ResearchExternalApps
6https://edx.netl.doe.gov
• Developing tools & approaches to manage multiple heterogeneous datasets
• Develop a probabilistic approach to assess scientific data using big data analyses
• Develop stochastic approaches to reduce uncertainty
Improve joint analysis of multiple datasetsfocus on advancing “Big Data” mining, machine
learning, and advanced geoprocessing computing
Aggregate & Label
Tools, Analytics, & Metrics
Select relevant datasetsCombine data and tools to…
Highlight resultant data and analysis and reuse for
in further research
Evaluatecorrelations and spatio-temporal
trends
7
EDX continues to evolve in response to the needs of its users and NETL’s knowledge
transfer goals
https://edx.netl.doe.gov
Future Big Data Development &
Analysis
Learn & Optimize
• EDX Cloud Services• Living Database• Common Operating
Platform for Data Analytics
• Geocube Spatial Data Viewer
• Fuzzy Logic Analytics (SIMPA)
• AWS Development• Integration with
decades of DOE R&D• Federating Open Source
data• EDX & GeoCube (search
and location)• ID Data gaps in
subsurface puzzle
GOGI
Oil & Gas, Geothermal Data
& Resources
Carbon Storage Data & Resources
Millions of Records
Millions of Records
Millions of Records
Employing “smart” search tools to include
open resources
Billions of Records
*These attributes data sources are evolving quickly with implementation of new tools and engagement of key stakeholders
8
Advanced computer science& research
https://edx.netl.doe.gov
Developing Schema Matching AI
• Variety of data sources with diverse data schemas
• Manual schema matching is time consuming & inefficient
• Plan to develop and use existing machine leaning algorithms to match disparate data schemas:
• Schema level• Element Level• Structure Level
• Linguistic Matching
• Syntactic Techniques
~ Thank you! ~
Michael [email protected]
9
Questions?
Come check out this awesome poster!