building a data discovery network for sustainability science

Post on 17-Nov-2014

698 Views

Category:

Education

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is the slidedeck for my SEAD presentation at the 3rd International VIVO Conference held on August 24, 2012 at Miami, FL.

TRANSCRIPT

© Trustees of Indiana UniversityReleased under Creative Commons 3.0 unported license; license terms on last slide.

Building a Data Discovery Network for Sustainability

ScienceRobert H. McDonald

Deputy Director Data to Insight (D2I) CenterAssociate Dean – IU Libraries

Indiana Universityrhmcdona@indiana.edu | @mcdonald @SEADdatanet

Presented at the VIVO 2012 Conference

Miami, FL– August 24, 2012Available from: http://slidesha.re/Q9q8VWhttp://slidesha.re/Q9q8VW

NSF DataNet Program

Motivation: “… one of the major challenges of this scientific generation: how to develop the new methods, management structures and technologies to manage the diversity, size, and complexity of current and future data sets and data streams.”

Response: DataNet creates “a set of exemplar national and global data research infrastructure organizations” to address this challenge.

Current NSF DataNet Projects

• SEAD– http://sead-data.net

• DataOne– http://www.dataone.org

• DataNet Federation Consortium– http://datafed.org

• Terra Populous– https://www.pop.umn.edu/terra_pop

Sustainable Environment Actionable Data (SEAD) - DataNet

• SEAD Strategy― Serve scientists and

researchers in the “long tail” of science

― Leverage social media for discovery of data, interest, and expertise

― Move data curation upstream in the data life cycle of science

― Take advantage of existing domain and institutional infrastructures (Institutional Repositories, ICPSR) for long-term preservation

SEAD Partners - http://sead-data.net

SEAD TEAMS

Margaret Hedstrom-PI, Ann Zimmerman-Co-PI, Karen Woollams, George Alter (ICPSR), Bryan Beecher (ICPSR), Jude Yew

Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Light, Kavitha Chandrasekar, Stacy Kowalczyk, Robert Ping

James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd

Praveen Kumar-Co-PI, Md Aktaruzzaman, Terry McLaren (NCSA), Rob Kooper (NCSA), Luigi Marini (NCSA)

Michigan

Indiana

Rensselaear

Illinois

SEAD 18 month Pilot Phase• Domain Engagement:

– National Center for Earth Systems Dynamics (NCESD), Illinois River Basin Observatory

– Requirements, Use Cases, Prioritization of Data Types and Services• Active and Social Curation

– Pilot Active Content Repository, VIVO deployments– Exemplar services for Data Ingest, Discovery, Re-use, Curation

(Tupelo/Medici)• CI for Long-term Access (Virtual Archive)

– Data model, protocol design/development– Pilot Federated Repository infrastructure

• Education, Outreach, and Training– Post-doc mentoring– Web site, training materials, meetings, workshops, …

• Project Oversight– Management, reporting, committees– Business model development

Sustainability Science

7

Science

Technology

Economics

Poverty & Justice

Policy

Cooperation

Data challenges• Heterogeneity

of all kinds• Multiple scales• Multidisciplinar

y• Many small

datasets

The long tail of scientific research

• Small and derived data sets• Heterogeneous data• Multiple sources of data• Short-lived data with long-term

value• Value of data grows when

combined & integrated

SEAD notions of defined Data Phases

• Phases of data lifecycle acknowledge and accommodate the difference between public data and data still in work by a researcher.

• Research Data Phase: data set is research data collection, owned by individual and under their control. – Data need not be licensed at this time because it is not

ready for broader release – Data need not have permanent IDs because still work in

progress – Corresponds to first existence in Active Curation Repository

• Published Phase: Owner of research data collection determines that dataset is ready for publication– License terms set– Persistent ID – Made available as part of public profile in VIVO– Activated by user-controlled publish event

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

SelectionDigital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

Active and Social Curation OAIS Repository FederationCuration Boundary

UserContributor

Active Content

Repository

VIVO/Linked Data

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

SelectionDigital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

SEAD Active Data Systems

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

OAIS Repository FederationCuration Boundary

UserContributor

A standardized data model and federation capability over OAIS-Standard Institutional Repositories

Active and Social Curation

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

Selection SEAD Trusted Digital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

OAIS Repository FederationCuration Boundary

UserContributor

Active and Social Curation

Active Content

Repository

VIVO/Linked Data

A robust, replicated distributed file system used as a large-scale backing store

CI Technical Approach

SEAD CI Technical Approach

Scholarly Communication

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

OAIS Repository FederationCuration Boundary

UserContributor

Active and Social Curation

Active Content

Repository

VIVO/Linked Data

Appraisal and

Selection SEAD Trusted Digital Repository Federation (OAIS compliant) Preservation

Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

An Active Content Repository based on standard global IDs and semantic web technologies - to collect and integrate data, metadata, and provenance information from multiple sources.

ContentContentContentContent

Lustre File System

DC:CreatorOPM:wasDerivedFromSWAN:isEvidenceFor…

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

SelectionDigital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

Active and Social Curation OAIS Repository FederationCuration Boundary

UserContributor

Active Content

Repository

VIVO/Linked Data

SEAD will run a VIVO instance and may harvest Linked Data from other sources

VIVO Application: Open Source federatable Researcher Information – people, papers, projects, centers, fields, etc.

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

Selection SEAD Trusted Digital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-OREIngest, AIPs

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Ingest scripts: fixity, integrity, authentication, transformation

OAIS Repository FederationCuration Boundary

Active and Social Curation Services supporting automated and interactive use of SEAD- leveraging standard web application/web service toolkits and virtual machine infrastructure

Active and Social Curation

Active Content

Repository

VIVO/Linked Data

Dissemination Packages

Wide-Area File System

UserContributor

Active and Social Curation Services supporting automated and interactive use of SEAD- leveraging standard web application/web service toolkits and virtual machine infrastructure

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

Selection SEAD Trusted Digital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Active Content Repository

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

OAIS Repository FederationCuration Boundary

UserContributor

Curation and Preservation Services also leveraging standard web application/web service toolkits and virtual machine infrastructure

Active and Social Curation

Appraisal and

SelectionDigital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

SEAD Data Curation Lifecycle Elements

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

OAIS Repository FederationCuration Boundary

UserContributor

Active and Social Curation

Active Content

Repository

VIVO/Linked Data

SEAD Active/Social Curation Repository

SEAD VIVO: RIS2N3

SEAD Virtual Archive

Faceted search(Solr-based)

Facets

Search Result

A dataset or file looks like this

Geospatial search(from Postgres index)

Geospatial search results

Login for data upload

Upload file

Files from Medici can also be added

Create collection (can have multiple files)

Upload complete

Data ingested to DSpace (Mississippi example)

SEAD Virtual Archive Architecture

SEAD Ingest

Client /UI

IRDSpace

SIP(Data+

Metadata)

Ack

Data Validation

(Fixity, Virus)

Preservation Metadata Generation (Events)

Feature Extraction from Data

Solr Index

PostgreSQL Index

SIP breakdown

AIP

+Data

Geospatial +Temporal Metadata

Core Property+ Domain

Metadata

Obtain DOI from DataCite

IU DataCite

ID Server

Store data object, its metadata object, and its relationship record (latter as RDF) in IR as a collection

Register DOI with

VIVO

VIVO server

register metadata to SOLR and PostgreS for rapid retrieval of metadata

Key Questions for SEAD Prototype

• What could SEAD capture when?• How can SEAD provide direct

value to data producers, users, and curators?

• How can web 2.0/3.0 and social computing lower barriers and reduce/realign costs?

Towards A Shared Data Future

Source: EU HLEG Report on Data Deluge: Riding the Wave, pg 31, 2010

Trus

t

Dat

a Cu

ratio

n

Data Generators Users

Community Support Services

Common Data Services

User functionalities, data capture & transfer, virtual research environments

Data discovery & navigation, workflow generation, annotation, interpretability

Persistent storage, identification, authenticity (provenance), workflow execution, data mining

Data Interoperability

• NSF OCI: DataNet and INTEROP now DIBBs

• EUDAT• Data Web Forum• IETF Research Data Identifier BOF• Upcoming Oct. US Meeting of

DataNet, INTEROP, Data Web Forum

AcknowledgementsSEAD is funded by the National Science Foundation under cooperative agreement #OCI0940824

http://sead-data.net

• For more on SEAD go to:• http://sead-data.net

• Follow us on Twitter @SEADdatanet

License terms• Please cite as: McDonald, R.H. et. al. Building a Data Discovery Network

for Sustainability Science. 3rd International VIVO Conference, Miami, FL, 24 August 2012. Available from: [http://slidesha.re/Q9q8VW]

• Thanks to Margaret Hedstrom, who’s guided the team through the (really) lengthy review process and to Jim Myers, Beth Plale, Praveen Kumar, Terry McLaren, Luigi Marini, Kavitha Chandrasekar and others who provided content for this presentation.

• The concepts and software being leveraged in SEAD represent the work of a broad range of people over multiple years – their contributions have been critical to launching SEAD.

• Items indicated with a © are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse.

• This document is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

top related