data sharing*, archiving, and discovery: tips and tools
TRANSCRIPT
Data Sharing*, Archiving, and
Discovery: Tips and Tools
William MichenerCollege of University Libraries & Learning Sciences
DataONE
University of New Mexico
*Making data available for others to use
2
The Data Deluge
3
Co
nte
nt
Time
Time of publication
Specific details
General details
Accident
Retirement or
career change
Death
(Michener et al. 1997)
Vin
es,
T.
H.
et
al. C
urr
. B
iol. h
ttp://d
x.d
oi.o
rg/1
0.1
016
/j.c
ub.2
013
.11.0
14
(2013).
Data Entropy
4
Dark data in
the long tail
Specific Data are Hard to Find …
The Rest are Inaccessible
PB Heidorn (2008) Library Trends 57 (2), 280-299
� “the merging of ideas, approaches and
technologies from widely diverse fields of
knowledge to stimulate innovation and
discovery”
5
Convergent Science
Data Sharing
6
� The International Biological Program (IBP):
1964-1974� “… data policies and protocols were never
elaborated nor even agreed to in principle.” (Porter
& Callahan 1994)
7
A brief history of ecological data
sharing
Michener (2015) Ecological Informatics 29:33-44
A brief history of ecological data
sharing
Long Term Ecological
Research Network
(LTER): 1980-present
• LTER Guidelines for Site Data Management Policies issued in 1990 (Porter & Callahan 1994)
• LTER Network Data Access Policy, Data Access Requirements, and General Data Use Agreement (approved by the LTER Coordinating Committee April 6, 2005)
8 Michener (2015) Ecological Informatics 29:33-44
Approx. 20,000 data packages available
9
� NSF Policy from Grant General Conditions
(April 1, 2001)� “NSF … expects investigators to share with other
researchers, at no more than incremental cost and
within a reasonable time, the data, samples,
physical collections and other supporting materials
created or gathered in the course of the work.”
� America Competes Act (August 9, 2007)� requires civilian federal agencies to provide
guidelines, policy and procedures, to facilitate and
optimize the open exchange of data and research
between agencies, the public and policymakers.
10
A brief history of ecological data
sharing
1 Michener (2015) Ecological Informatics 29:33-44
11
A brief history of ecological data
sharing
1 Michener (2015) Ecological Informatics 29:33-44
� [Journal] requires, as a condition for publication, that
data supporting the results in the paper should be
archived in an appropriate public archive, such as [list
of approved archives here]. Data are important
products of the scientific enterprise, and they should
be preserved and usable for decades in the future.
Authors may elect to have the data publicly available at
time of publication, or, if the technology of the archive
allows, may opt to embargo access to the data for a
period up to a year after publication. Exceptions may be
granted at the discretion of the editor, especially for
sensitive information such as human subject data or the
location of endangered species.
12
The 2011 Joint Data Archiving Policy
(JDAP; see datadryad.org)
Michener (2015) Ecological Informatics 29:33-44
� “PLOS journals require authors to make all data
underlying the findings described in their
manuscript fully available without restriction,
with rare exception1.”� Nature, Science, Ecological Monographs, …
13
A brief history of ecological data
sharing
1 Michener (2015) Ecological Informatics 29:33-44
14
15
0 20 40 60 80 100
Use others' datasets if their data were
easily accesible
Willing to share data across a broad group
Use others’ datasets if their data were easily accessible
Process for searching
Perception
Satisfaction
Baselin
e (
2010)
Follo
w-u
p (
2014)
Views: 35,693; Citations: 188
(published Jun 2011)
Views: 8,342; Citations: 8
(published Aug 2015)
Community Practices
and Perceptions
16
20102014
17
18
19
Benefits of Data Sharing1
� “data sharing accelerates the pace of science
by enabling researchers to discover and re-use
relevant data, combine data from multiple
sources, and ask new questions”
� “public trust increases as science is made more
transparent and findings can be reproduced and
verified”
� Researchers “benefit from the credit attributed
to them when their archived data are cited and
used by others” and “citation rates of
publication increase when the research data
are shared” 20
Benefits of Data Sharing
Michener (2015) Ecological Informatics 29:33-44
Best Practices
21
Best Practices for Sharing Data:1. Create and Follow a Data Management Plan
22
Michener WK (2015) Ten Simple Rules for Creating a Good Data Management Plan. PLoS Comput Biol 11(10): e1004525. doi:10.1371/journal.pcbi.1004525
Best Practices for Sharing Data:2. Adopt/follow Data Sharing & Attribution Policies
23
Joint Data Archiving Policy: [Journal] requires, as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive, such as [list of approved archives here]. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species.
http://datadryad.org/pages/jdap
Whitlock, M. C., M. A. McPeek, M. D. Rausher, L.
Rieseberg, and A. J. Moore. 2010. Data Archiving.
American Naturalist. 175(2):145-146,
http://dx.doi.org/10.1086/650340
Creative Commons Licenses
(https://creativecommons.org)
Best Practices for Sharing Data:3. Fully Document the Data
� Darwin Core – species and biodiversity
collections
� EML – Ecological Metadata Language
� ISO 19115 – for wide variety of geospatial data
24
https://knb.ecoinformatics.org/#tools/morpho
http://rs.tdwg.org/dwc/
Best Practices for Sharing Data:4. Preserve the Data, Software and Workflows
25
http://specifyx.specifysoftware.org
Catalog of 1,500+ Data Repositories
Best Practices for Sharing Data:5. “Publish” and Disseminate the Data Products
26
http://www.gbif.org
http://www.vertnet.org
http://www.nature.com/sdata/
Archiving
27
Role of the Data Archive
28
Cook et al. (In press) Preserve: Protecting Data for Long-Term Use. In: Recknagel F, Michener WK(eds) Ecological Informatics, 4th edn. Springer.
Bad Practices for Preserving Data
29 Example from Lesson 4 in DataONE education modules (see DataONE.org)
Best Practices for Preserving DataCook et al. (In press) Preserve: Protecting Data for Long-Term Use. In: Recknagel F, Michener WK (eds) Ecological
Informatics, 4th edn. Springer.
1. “Keep similar measurements together in one
data set”
2. Follow standard approaches (e.g.
International System) when defining names,
units & formats (e.g., yyyy-mm-dd or
yyyymmdd for date, 20161220)
3. Use consistent data organization
30
Best Practices for Preserving DataCook et al. (In press) Preserve: Protecting Data for Long-Term Use. In: Recknagel F, Michener WK (eds) Ecological
Informatics, 4th edn. Springer.
4. Use stable file format� Text/CSV, shapefile, GeoTIFF, HDF, netCDF
5. Specify spatial & temporal coordinates
6. Assign descriptive file names� “Soil carbon and nitrogen concentrations in Barrow….”
7. Save raw data in read-only format and save
processing scripts (R, MATLAB, SAS)
31
Best Practices for Preserving DataMichener (In press) Quality assurance and quality control. In: Recknagel F, Michener WK (eds) Ecological Informatics,
4th edn. Springer.
8. Assure data quality
9. Provide complete documentation
10. Protect data (1 original, 1 copy onsite, 1 off-site)
32
The Data Repository Will Ensure:Cook et al. (In press) Preserve: Protecting Data for Long-Term Use. In: Recknagel F, Michener WK (eds) Ecological
Informatics, 4th edn. Springer.
1. Files are received as sent
2. Documentation describes files
3. Parameters and units are defined
4. File content is consistent
5. Parameter values are reasonable
6. Files are reformatted and
reorganized if necessary
33
34
35
Discovery
36
Data Repositories
37
Best Practices for Data Discovery:1. Search a Domain Portal or Aggregator
38
Data Federations (DataONE, GBIF)
carbon cycling plant biomass
ocean nitrogen avian distribution
39
Best Practices for Data Discovery:2. Refine the Search, Using Relevant Facets
40
41
Best Practices for Data Discovery:3. Give Back – ie Cite the Data Appropriately
42
Dryad links to journals
Provides citation instructions
dataone.org