esi supplemental webinar 2 - dataone presentation slides
DESCRIPTION
Presented by William Michener on 11-15-2012TRANSCRIPT
DuraSpace/ARL/DLFE-Science Institute
DataONE: Tools and Approaches for Supporting the Data Life Cycle
Supplemental WebinarThursday, November 15, 2012
1:00-2:30 pm EDT
11
DataONE: Tools and Approaches for Supporting the Data Life Cycle
Presented by William Michener,
University of New Mexico
Professor and Director of e‐Science Initiatives for University Libraries
DuraSpace/ARL/DLF E‐Science Institute2
3
Three Key Challenges
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
AnalyzeInno
vati
on
4
1. Data Preservation and Planning
✔ ?5
DuraSpace/ARL/DLF E‐Science Institute
6
The Long Tail of Orphan DataVo
lum
e
Rank frequency of datatype
Specialized repositories(e.g. GenBank, PDB)
Orphan data
(B. Heidorn)
“Most of the bytes are at the high end, but most of the datasets are at the low end” – Jim Gray
6DuraSpace/ARL/DLF E‐Science Institute
Planning ?
Metadata standard?Data repository?
7DuraSpace/ARL/DLF E‐Science Institute
Three major components for a flexible, scalable, sustainable network
Member Nodes• diverse institutions• serve local community• provide resources for managing their data
• retain copies of data
DataONE and the DMPToolSupport Data Preservation
8
Three major components for a flexible, scalable, sustainable network
Member Nodes• diverse institutions• serve local community• provide resources for managing their data
• retain copies of data
Coordinating Nodes• retain complete metadata catalog
• indexing for search• network‐wide services• ensure content availability (preservation)
• replication services
DataONE and the DMPToolSupport Data Preservation
9
Three major components for a flexible, scalable, sustainable network
Member Nodes• diverse institutions• serve local community• provide resources for managing their data
• retain copies of data
Coordinating Nodes• retain complete metadata catalog
• indexing for search• network‐wide services• ensure content availability (preservation)
• replication services
Investigator Toolkit
DataONE and the DMPToolSupport Data Preservation
10
Dryad (>3,000 data products)
Coordinated submission of articles and underlying data
Handshaking with specialized repositories
Promotion of reuse and incentives for deposit
11DuraSpace/ARL/DLF E‐Science Institute
Contributors• Individual investigators• Field stations and networks• Government agencies• Non‐profit partnerships• Synthesis centers
Data Types• Ecological• Environmental• Demographic• Social/Legal/Economic
< 1
1‐10
10‐200
>200
0
15
3045
60DataSizes
%
12MB
Knowledge Network for Biocomplexity (20,000+ data packages)
13
✔Check for best practices✔Create metadata✔Connect to ONEShare
Data & Metadata (EML)
14
15
16DuraSpace/ARL/DLF E‐Science Institute
17DuraSpace/ARL/DLF E‐Science Institute
18
19DuraSpace/ARL/DLF E‐Science Institute
20
21
22
23
24DuraSpace/ARL/DLF E‐Science Institute
25DuraSpace/ARL/DLF E‐Science Institute
2. Data Discovery
26
Data Silos
27
The DataONE Federation
28
• Tier 1: Read only, public contentping(), getLogRecords(), getCapabilities(),get(), getSystemMetadata(), getChecksum(),listObjects(), synchronizationFailed()
• Tier 2: Read only, with access controlisAuthorized(), setAccessPolicy()
• Tier 3: Read/Write using client toolscreate(), update(), delete()
• Tier 4: Able to operate as a replication target–replicate(),getReplica()
• http://mule1.dataone.org/ArchitectureDocs‐current/apis/MN_APIs.html
Member Node Functional Tiers
29DuraSpace/ARL/DLF E‐Science Institute
30
NASA collectors DAAC Users (UWG)
DataONE Users
ORNL DAAC as a DataONE Member Node
Investigator Toolkit
30
31DuraSpace/ARL/DLF E‐Science Institute
32
33DuraSpace/ARL/DLF E‐Science Institute
34
35DuraSpace/ARL/DLF E‐Science Institute
36
3. Innovation
36
The Fourth Paradigm:1. Observational and
experimental 2. Theoretical research 3. Computer simulations of
natural phenomena4. Data‐intensive research
• new tools, techniques, and ways of working
37
Decreasin
g Spatial Coverage
Increasin
g Process K
nowledge
Adapted from CENR‐OSTP
Remotesensing
Intensive science sitesand experiments
Extensive science sites
Volunteer & education networks
“Data Intensive Science” and the “80:20 Rule”
37
Kepler
DMP-Tool
Investigator Toolkit Support
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
38
Spatio‐Temporal Exploratory Model identifies factors affecting patterns of migration
Diverse bird observations and environmental data from 300,00 locations in the US integrated and analyzed using High Performance Computing Resources
Land Cover
Meteorology
MODIS –Remote sensing data
• Examine patterns of migration
• Infer how climate change may affect bird migration
Model results
Occurrence of Indigo Bunting (2008)
Jan Sep DecJunApr
Exploration, Visualization, and Analysis
39
Scientific workflows
40DuraSpace/ARL/DLF E‐Science Institute
41
Workflows Evolution with VisTrails
DuraSpace/ARL/DLF E‐Science Institute
Collaboration environments
42
43
Taverna, MyExperiment
DuraSpace/ARL/DLF E‐Science Institute
Community Engagement
44
Year 1 Year 2 Year 3 Year 4 Year 5
Scientists: BLScientists: BL
User Assessments
Scientists: FUScientists: FU
Librarians: BLLibrarians: BL Librarians: FULibrarians: FU
Policy Makers: BLPolicy Makers: BL Policy Makers: FUPolicy Makers: FU
Educators: BLEducators: BL Educators: FUEducators: FU
Library Policies: BLLibrary Policies: BL Library Policies: FULibrary Policies: FU
45DuraSpace/ARL/DLF E‐Science Institute
• “More than half of the respondents (56%) reported that they did not use any metadata standard and about 22% of respondents indicated they used their own lab metadata standard.”
• Less than 6% of scientists are making “All” of their data available via some mechanism.
Results
46DuraSpace/ARL/DLF E‐Science Institute
Community Engagement
47DuraSpace/ARL/DLF E‐Science Institute
Best Practices and Software Tools
48
Best Practices and Software Tools
49
June 3-21, 2013University of New Mexico
50
DataONE: Supporting Scientific Data Preservation, Discovery, and Innovation
51
• 9 areas where you can help researchers
Recommendations
52DuraSpace/ARL/DLF E‐Science Institute
1. Plan ‐ https://dmp.cdlib.org
53
2. Collect and assure the data http://www.dataone.org/best‐practices
54
3. Describe and document the data
http://metavist2.codeplex.com/
http://knb.ecoinformatics.org/morphoportal.jsp
55
4. Select a repository for the datahttp://databib.org/http://www.dataone.org/best-practiceshttp://www.opendoar.org/
56
5. Preserve the datahttp://daac.ornl.gov/PI/BestPractices-2010.pdf
57
6. Use the data http://www.nutnet.umn.edu/
58
7. Budget for it – 10‐>25% of total budget
59
8. Communicate (early and often)Meetings, web portals, newsletters, phone and video conferences
60
9. Train (in‐person and/or virtually)
61
DataONE.org
62DuraSpace/ARL/DLF E‐Science Institute
DataONE Team and Sponsors
•Bertram Ludaescher
•Deborah McGuinness
• Jeff Horsburgh
•Robert Sandusky
• Peter Honeyman
• Carole Goble
• Cliff Duke
•Donald Hobern
• Ewa Deelman•Amber Budden, Roger Dahl, Rebecca Koskela, Bill Michener, Robert Nahf, Skye Roseboom, Mark Servilla
• Patricia Cruse, John Kunze
• Dave Vieglais
• Paul Allen, Rick Bonney, Steve Kelling
• Stephanie Hampton, Chris Jones, Matt Jones, Ben Leinfelder, Andrew Pippin
• Suzie Allard, Nick Dexter, Kimberly Douglass, Carol Tenopir, Robert Waltz, Bruce Wilson
• John Cobb, Bob Cook, Ranjeet Devarakonda, Giri Palanismy, Line Pouchard
• Sky Bristol, Mike Frame, Richard Huffine, VivHutchison, Jeff Morisette, Jake Weltzin, Lisa Zolly
•David DeRoure
•Ryan Scherle, Todd Vision
LEON LEVY FOUNDATION
•Randy Butler
63
DuraSpace/ARL/DLF E‐Science Institute 64
Questions?