building a data discovery network for sustainability science
DESCRIPTION
This is the slidedeck for my SEAD presentation at the 3rd International VIVO Conference held on August 24, 2012 at Miami, FL.TRANSCRIPT
© Trustees of Indiana UniversityReleased under Creative Commons 3.0 unported license; license terms on last slide.
Building a Data Discovery Network for Sustainability
ScienceRobert H. McDonald
Deputy Director Data to Insight (D2I) CenterAssociate Dean – IU Libraries
Indiana [email protected] | @mcdonald @SEADdatanet
Presented at the VIVO 2012 Conference
Miami, FL– August 24, 2012Available from: http://slidesha.re/Q9q8VWhttp://slidesha.re/Q9q8VW
NSF DataNet Program
Motivation: “… one of the major challenges of this scientific generation: how to develop the new methods, management structures and technologies to manage the diversity, size, and complexity of current and future data sets and data streams.”
Response: DataNet creates “a set of exemplar national and global data research infrastructure organizations” to address this challenge.
Current NSF DataNet Projects
• SEAD– http://sead-data.net
• DataOne– http://www.dataone.org
• DataNet Federation Consortium– http://datafed.org
• Terra Populous– https://www.pop.umn.edu/terra_pop
Sustainable Environment Actionable Data (SEAD) - DataNet
• SEAD Strategy― Serve scientists and
researchers in the “long tail” of science
― Leverage social media for discovery of data, interest, and expertise
― Move data curation upstream in the data life cycle of science
― Take advantage of existing domain and institutional infrastructures (Institutional Repositories, ICPSR) for long-term preservation
SEAD Partners - http://sead-data.net
SEAD TEAMS
Margaret Hedstrom-PI, Ann Zimmerman-Co-PI, Karen Woollams, George Alter (ICPSR), Bryan Beecher (ICPSR), Jude Yew
Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Light, Kavitha Chandrasekar, Stacy Kowalczyk, Robert Ping
James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd
Praveen Kumar-Co-PI, Md Aktaruzzaman, Terry McLaren (NCSA), Rob Kooper (NCSA), Luigi Marini (NCSA)
Michigan
Indiana
Rensselaear
Illinois
SEAD 18 month Pilot Phase• Domain Engagement:
– National Center for Earth Systems Dynamics (NCESD), Illinois River Basin Observatory
– Requirements, Use Cases, Prioritization of Data Types and Services• Active and Social Curation
– Pilot Active Content Repository, VIVO deployments– Exemplar services for Data Ingest, Discovery, Re-use, Curation
(Tupelo/Medici)• CI for Long-term Access (Virtual Archive)
– Data model, protocol design/development– Pilot Federated Repository infrastructure
• Education, Outreach, and Training– Post-doc mentoring– Web site, training materials, meetings, workshops, …
• Project Oversight– Management, reporting, committees– Business model development
Sustainability Science
7
Science
Technology
Economics
Poverty & Justice
Policy
Cooperation
Data challenges• Heterogeneity
of all kinds• Multiple scales• Multidisciplinar
y• Many small
datasets
The long tail of scientific research
• Small and derived data sets• Heterogeneous data• Multiple sources of data• Short-lived data with long-term
value• Value of data grows when
combined & integrated
SEAD notions of defined Data Phases
• Phases of data lifecycle acknowledge and accommodate the difference between public data and data still in work by a researcher.
• Research Data Phase: data set is research data collection, owned by individual and under their control. – Data need not be licensed at this time because it is not
ready for broader release – Data need not have permanent IDs because still work in
progress – Corresponds to first existence in Active Curation Repository
• Published Phase: Owner of research data collection determines that dataset is ready for publication– License terms set– Persistent ID – Made available as part of public profile in VIVO– Activated by user-controlled publish event
CI Technical Approach
SEAD CI Technical Approach
Appraisal and
SelectionDigital Repository Federation (OAIS compliant)
Scholarly Communication
Preservation Actions
Compound Objects - OAI-ORE
Dissemination Packages
Ingest, AIPs
Data Acquisition, Analysis and Simulation
Search, Browse,
Annotation, Visualization
Tools
Metadata Management
DDI3. METS, PREMIS, MODS, DC, SensorML,
OGC, …
Automated Curation Workflow/Rule
Engine
Operates on Metadata, Content Objects and Trigger
Events
Access Mechanisms and E-Scholarship Services
Migration and
Emulation Tools
Use, Reuse, Repurposing
Tools
Wide-Area File System
Ingest scripts: fixity, integrity, authentication, transformation
Active and Social Curation OAIS Repository FederationCuration Boundary
UserContributor
Active Content
Repository
VIVO/Linked Data
CI Technical Approach
SEAD CI Technical Approach
Appraisal and
SelectionDigital Repository Federation (OAIS compliant)
Scholarly Communication
Preservation Actions
Compound Objects - OAI-ORE
Dissemination Packages
Ingest, AIPs
SEAD Active Data Systems
Data Acquisition, Analysis and Simulation
Search, Browse,
Annotation, Visualization
Tools
Metadata Management
DDI3. METS, PREMIS, MODS, DC, SensorML,
OGC, …
Automated Curation Workflow/Rule
Engine
Operates on Metadata, Content Objects and Trigger
Events
Access Mechanisms and E-Scholarship Services
Migration and
Emulation Tools
Use, Reuse, Repurposing
Tools
Wide-Area File System
Ingest scripts: fixity, integrity, authentication, transformation
OAIS Repository FederationCuration Boundary
UserContributor
A standardized data model and federation capability over OAIS-Standard Institutional Repositories
Active and Social Curation
CI Technical Approach
SEAD CI Technical Approach
Appraisal and
Selection SEAD Trusted Digital Repository Federation (OAIS compliant)
Scholarly Communication
Preservation Actions
Compound Objects - OAI-ORE
Dissemination Packages
Ingest, AIPs
Data Acquisition, Analysis and Simulation
Search, Browse,
Annotation, Visualization
Tools
Metadata Management
DDI3. METS, PREMIS, MODS, DC, SensorML,
OGC, …
Automated Curation Workflow/Rule
Engine
Operates on Metadata, Content Objects and Trigger
Events
Access Mechanisms and E-Scholarship Services
Migration and
Emulation Tools
Use, Reuse, Repurposing
Tools
Wide-Area File System
Ingest scripts: fixity, integrity, authentication, transformation
OAIS Repository FederationCuration Boundary
UserContributor
Active and Social Curation
Active Content
Repository
VIVO/Linked Data
A robust, replicated distributed file system used as a large-scale backing store
CI Technical Approach
SEAD CI Technical Approach
Scholarly Communication
Data Acquisition, Analysis and Simulation
Search, Browse,
Annotation, Visualization
Tools
Metadata Management
DDI3. METS, PREMIS, MODS, DC, SensorML,
OGC, …
Automated Curation Workflow/Rule
Engine
Operates on Metadata, Content Objects and Trigger
Events
Access Mechanisms and E-Scholarship Services
Migration and
Emulation Tools
Use, Reuse, Repurposing
Tools
OAIS Repository FederationCuration Boundary
UserContributor
Active and Social Curation
Active Content
Repository
VIVO/Linked Data
Appraisal and
Selection SEAD Trusted Digital Repository Federation (OAIS compliant) Preservation
Actions
Compound Objects - OAI-ORE
Dissemination Packages
Ingest, AIPs
Wide-Area File System
Ingest scripts: fixity, integrity, authentication, transformation
An Active Content Repository based on standard global IDs and semantic web technologies - to collect and integrate data, metadata, and provenance information from multiple sources.
ContentContentContentContent
Lustre File System
DC:CreatorOPM:wasDerivedFromSWAN:isEvidenceFor…
CI Technical Approach
SEAD CI Technical Approach
Appraisal and
SelectionDigital Repository Federation (OAIS compliant)
Scholarly Communication
Preservation Actions
Compound Objects - OAI-ORE
Dissemination Packages
Ingest, AIPs
Data Acquisition, Analysis and Simulation
Search, Browse,
Annotation, Visualization
Tools
Metadata Management
DDI3. METS, PREMIS, MODS, DC, SensorML,
OGC, …
Automated Curation Workflow/Rule
Engine
Operates on Metadata, Content Objects and Trigger
Events
Access Mechanisms and E-Scholarship Services
Migration and
Emulation Tools
Use, Reuse, Repurposing
Tools
Wide-Area File System
Ingest scripts: fixity, integrity, authentication, transformation
Active and Social Curation OAIS Repository FederationCuration Boundary
UserContributor
Active Content
Repository
VIVO/Linked Data
SEAD will run a VIVO instance and may harvest Linked Data from other sources
VIVO Application: Open Source federatable Researcher Information – people, papers, projects, centers, fields, etc.
CI Technical Approach
SEAD CI Technical Approach
Appraisal and
Selection SEAD Trusted Digital Repository Federation (OAIS compliant)
Scholarly Communication
Preservation Actions
Compound Objects - OAI-OREIngest, AIPs
Data Acquisition, Analysis and Simulation
Search, Browse,
Annotation, Visualization
Tools
Metadata Management
DDI3. METS, PREMIS, MODS, DC, SensorML,
OGC, …
Automated Curation Workflow/Rule
Engine
Operates on Metadata, Content Objects and Trigger
Events
Access Mechanisms and E-Scholarship Services
Migration and
Emulation Tools
Use, Reuse, Repurposing
Tools
Ingest scripts: fixity, integrity, authentication, transformation
OAIS Repository FederationCuration Boundary
Active and Social Curation Services supporting automated and interactive use of SEAD- leveraging standard web application/web service toolkits and virtual machine infrastructure
Active and Social Curation
Active Content
Repository
VIVO/Linked Data
Dissemination Packages
Wide-Area File System
UserContributor
Active and Social Curation Services supporting automated and interactive use of SEAD- leveraging standard web application/web service toolkits and virtual machine infrastructure
CI Technical Approach
SEAD CI Technical Approach
Appraisal and
Selection SEAD Trusted Digital Repository Federation (OAIS compliant)
Scholarly Communication
Preservation Actions
Compound Objects - OAI-ORE
Dissemination Packages
Ingest, AIPs
Active Content Repository
Data Acquisition, Analysis and Simulation
Search, Browse,
Annotation, Visualization
Tools
Metadata Management
DDI3. METS, PREMIS, MODS, DC, SensorML,
OGC, …
Automated Curation Workflow/Rule
Engine
Operates on Metadata, Content Objects and Trigger
Events
Access Mechanisms and E-Scholarship Services
Migration and
Emulation Tools
Use, Reuse, Repurposing
Tools
Wide-Area File System
Ingest scripts: fixity, integrity, authentication, transformation
OAIS Repository FederationCuration Boundary
UserContributor
Curation and Preservation Services also leveraging standard web application/web service toolkits and virtual machine infrastructure
Active and Social Curation
Appraisal and
SelectionDigital Repository Federation (OAIS compliant)
Scholarly Communication
Preservation Actions
SEAD Data Curation Lifecycle Elements
Compound Objects - OAI-ORE
Dissemination Packages
Ingest, AIPs
Data Acquisition, Analysis and Simulation
Search, Browse,
Annotation, Visualization
Tools
Metadata Management
DDI3. METS, PREMIS, MODS, DC, SensorML,
OGC, …
Automated Curation Workflow/Rule
Engine
Operates on Metadata, Content Objects and Trigger
Events
Access Mechanisms and E-Scholarship Services
Migration and
Emulation Tools
Use, Reuse, Repurposing
Tools
Wide-Area File System
Ingest scripts: fixity, integrity, authentication, transformation
OAIS Repository FederationCuration Boundary
UserContributor
Active and Social Curation
Active Content
Repository
VIVO/Linked Data
SEAD Active/Social Curation Repository
SEAD VIVO: RIS2N3
SEAD Virtual Archive
Faceted search(Solr-based)
Facets
Search Result
A dataset or file looks like this
Geospatial search(from Postgres index)
Geospatial search results
Login for data upload
Upload file
Files from Medici can also be added
Create collection (can have multiple files)
Upload complete
Data ingested to DSpace (Mississippi example)
SEAD Virtual Archive Architecture
SEAD Ingest
Client /UI
IRDSpace
SIP(Data+
Metadata)
Ack
Data Validation
(Fixity, Virus)
Preservation Metadata Generation (Events)
Feature Extraction from Data
Solr Index
PostgreSQL Index
SIP breakdown
AIP
+Data
Geospatial +Temporal Metadata
Core Property+ Domain
Metadata
Obtain DOI from DataCite
IU DataCite
ID Server
Store data object, its metadata object, and its relationship record (latter as RDF) in IR as a collection
Register DOI with
VIVO
VIVO server
register metadata to SOLR and PostgreS for rapid retrieval of metadata
Key Questions for SEAD Prototype
• What could SEAD capture when?• How can SEAD provide direct
value to data producers, users, and curators?
• How can web 2.0/3.0 and social computing lower barriers and reduce/realign costs?
Towards A Shared Data Future
Source: EU HLEG Report on Data Deluge: Riding the Wave, pg 31, 2010
Trus
t
Dat
a Cu
ratio
n
Data Generators Users
Community Support Services
Common Data Services
User functionalities, data capture & transfer, virtual research environments
Data discovery & navigation, workflow generation, annotation, interpretability
Persistent storage, identification, authenticity (provenance), workflow execution, data mining
Data Interoperability
• NSF OCI: DataNet and INTEROP now DIBBs
• EUDAT• Data Web Forum• IETF Research Data Identifier BOF• Upcoming Oct. US Meeting of
DataNet, INTEROP, Data Web Forum
AcknowledgementsSEAD is funded by the National Science Foundation under cooperative agreement #OCI0940824
http://sead-data.net
• For more on SEAD go to:• http://sead-data.net
• Follow us on Twitter @SEADdatanet
License terms• Please cite as: McDonald, R.H. et. al. Building a Data Discovery Network
for Sustainability Science. 3rd International VIVO Conference, Miami, FL, 24 August 2012. Available from: [http://slidesha.re/Q9q8VW]
• Thanks to Margaret Hedstrom, who’s guided the team through the (really) lengthy review process and to Jim Myers, Beth Plale, Praveen Kumar, Terry McLaren, Luigi Marini, Kavitha Chandrasekar and others who provided content for this presentation.
• The concepts and software being leveraged in SEAD represent the work of a broad range of people over multiple years – their contributions have been critical to launching SEAD.
• Items indicated with a © are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse.
• This document is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.