niso/dcmi webinar: metadata for managing scientific research data
DESCRIPTION
TRANSCRIPT
Metadata for Managing Scientific Research Data
NISO/DCMI Webinar: August 22, 2012
Jane Greenberg, Professor and Director of the SILS Metadata Research [email protected]
Overview▪ Why should we care?▪ What is data?▪ What is metadata’s role w.r.t data?▪ Selected metadata standards▪ Challenges, opportunities, and jumping in▪ Concluding comments▪ Q&A
BIG stuff▪ Digital data deluge (Hey & Trefethen, 2003)
▪ Big data (New York Times)
▪ The fourth paradigm (Jim Gray, 2007)
Just as important▪ The long tail (Heidorn, 2008)
▪ CODATA/Data-at-Risk Task Group▪ Scholarly communications, data citation
Technological affordances for improving and advancing science
Why should we care?
2008
Cultural shift toward data sharing
▪ National and international policies – US NSF and NIH [1, 2]– OECD (Organisation for Economic Co-operation and
Development) [3]– INSPIRE Infrastructure for Spatial Information in the European
Community EU Commission [4]– UK Medical Research Council [5]
Dryad “enables scientists to validate published findings, explore new analysis methodologies, repurpose data for research questions unanticipated by the original authors, and perform synthetic studies.” (http://datadryad.org/)
Overview▪ Why should we care?
▪ What is data?▪ What is metadata’s role w.r.t data?▪ Selected metadata standards▪ Challenges, opportunities, and jumping in▪ Concluding comments▪ Q&A
Data▪ No single agreed upon definition▪ One person’s data is another person’s
information ▪ Data often implies the “raw” stuff lacking
context– Scholarly context, written assessment
▪ “Essence of science” (Greenberg, et al, 2009)
▪ What is science?– The Archaeology Data Service (ADS)
archaeologydataservice.ac.uk
DataI know it when I see it
By example: Traditional observations, numbers, and measures stored in spreadsheets and databases, fossils, phylogenetic trees, and herbarium samples (White, 2008)
Other disciplines▪ Bioinformatics: Gene
expressions, DNA transcription to RNA translation
▪ Geology, agriculture, surveillance, and historical manuscript research: Hyperspectral remote sensing
quantity type
3162 Plain Text
476 Microsoft Excel
308 Adobe Portable Document Format
302 Comma-separated values
252 Nexus
153 Microsoft Excel OpenXML
108 Microsoft Word
80 Zip file
62 JPEG image
45 Microsoft Word OpenXML
40 Extensible Markup Language
35 Hypertext Markup Language
21 Rich Text Format
16 FASTA sequence file
15 Tag Image File Format
14 Postscript Files
2 Video Quicktime
2 Mathematica Notebook
1 Microsoft Powerpoint
(email w/R. Scherle, July 2012)
The Dryad Repository
Overview▪ Why should we care?▪ What is data?
▪ What is metadata’s role w.r.t data?▪ Selected metadata standards▪ Challenges, opportunities, and jumping in▪ Concluding comments▪ Q&A
Metadata defined……data about data
…….information about data
▪“Metadata or ‘data about data’ describes the content, quality, condition, and other characteristics of data.” (FGDC Metadata WG, 1998)
▪Structured information about an object (data) that facilitates functions associated with the object. (Greenberg, 2002, 2003, 2009)
Discover ManageControl rights
Identify versions
Certify authenticity
Indicate status
Mark conent strucure
Situate geospatially
Describe processes
Typical functions
Overview▪ Why should we care?▪ What is data?▪ What is metadata’s role w.r.t data?
▪ Selected metadata standards▪ Challenges, opportunities, and jumping in▪ Concluding comments▪ Q&A
Metadata for Scientific Research Data
It g
ets
mes
sy r
eally
qu
ickl
y
Metadata for Scientific Research Data
Descriptive– General to granular
▪Value (addressing a topic, “aboutness”)– Topical (ontologies, subject heading lists/thesauri,
taxonomies)
▪Named entities– Name authority files (people, organizations,
geographical jurisdictions, structures, and events)
▪Geo-spatial (coordinates)
▪Temporal data (ISO 8601/ W3CDTF, or …)
Given the messiness…
“I cannot tell you exactly what metadata standards, vocabularies, etc. to use…”
Examining metadata schemes
Objectives and principles
Domains Architectural layout
• Objectives
• Principles
• Discipline
• Genre
• Format
• Structural design
• Extent
• Granularity
Metadata Objectives and principles, Domain, and Architectural Layout (MODAL) framework
(Greenberg, 2005; Willis, et al, JASIST 2012)
Simple schemes[6]
Objectives and principles
Domains Architectural layout
• Interoperability• Easy to
generate, lower barrier to produce
• Multi-disciplinary
• Any genre or format
• Primarily flat• Minimal with
means to extend
• General (not granular)
Dublin Core Metadata Element Set (DCMES) ver.1.1
US MARC bibliographic format
• Need training • Primarily flat• Extensible
DataCite • Primarily flat
Dublin Core Application Profile-Dryad [7]
DataCite example, ver.2.2 [8] National Institute for Environmental Studies and Center for Climate System Research Japan
US MARC bibliographic format: World Ocean Circulation Experiment global data (Moss Landing Marine Labs and the Monterey Bay Aquarium Research Institute Library) [9]
Simple/moderate schemes
Objectives and principles
Domains Architectural layout
Interoperability balanced w/specific needs
Generation requires more expertise
Greater domain focus
Genera diversity within a domain
Primarily flat Extensibility—
via connecting Slightly more
granular
Darwin Core
Access to Biological Collections Data (ABCD)
• Not as flat
Ecological Metadata Language
DCMI Terms • Graph approach
Wieczorek, et al. (2012). Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLoS One. 2012; 7(1): e29715: doi: 10.1371/journal.pone.0029715.
<?xml version='1.0' encoding='UTF-8'?> <DataSets xmlns='http://www.tdwg.org/schemas/abcd/2.06'> <DataSet>
<TechnicalContacts> <TechnicalContact> <Name>Gerd MÃŒller</Name> <Email>[email protected]</Email> </TechnicalContact> </TechnicalContacts> <ContentContacts> <ContentContact> <Name>A Another</Name> <Email>[email protected]</Email> </ContentContact> </ContentContacts> <Metadata> <Description> <Representation language='en'> <Title>PonTaurus collection</Title> </Representation> </Description> <RevisionData> <DateModified>2001-03-01T00:00:00</DateModified> </RevisionData> </Metadata> <Units> <Unit> <SourceInstitutionID>BGBM</SourceInstitutionID> <SourceID>PonTaurus</SourceID> <UnitID>1136</UnitID> </Unit> </Units> </DataSet> </DataSets>
Access to Biological Collections Data (ABCD) (A minimum record)
Properties in the /terms/ namespace
abstractaccessRightsaccrualMethodaccrualPeriodicityaccrualPolicyalternativeaudienceavailablebibliographicCitationconformsTocontributorcoveragecreatedcreatordatedateAccepteddateCopyrighteddateSubmitteddescription
educationLevelextentformathasFormathasParthasVersionidentifierinstructionalMethodisFormatOfisPartOfisReferencedByisReplacedByisRequiredByissuedisVersionOflanguagelicensemediatormedium
modifiedprovenancepublisherreferencesrelationreplacesrequiresrightsrightsHoldersourcespatialsubjecttableOfContentstemporaltitletypevalid
Complex schemes
Objectives and principles
Domains Architectural layout
Interoperability level
Generation requires greater expertise
• Genre focus• Format
variation
Hierarchical Extensive Granular
FGDC
DDI
Content Standard for Digital Geospatial Metadata (CSDGM)/FGDC
Data Document Initiative (DDI)
1. Identification Information (M)2. Data Quality Information 3. Spatial Data Organization Information4. Spatial Reference Information5. Entity and Attribute Information6. Distribution Information7. Metadata Reference Information (M)
1. Concept2. Collecting3. Processing Archiving4. Distribution Archiving5. Discovery6. Analysis7. Repurposing
Summary for descriptive schemes
▪ Simple: Interoperable, Easy to generate/low barrier, generally multidisciplinary, genera/format agnostics, primarily flat, general (not granular), 15-25 properties
▪ Simple/moderate: Interoperability balanced w/specific needs, generation requires more expertise, greater domain focus, extensible--via connecting to other schemes, more granular, more properties
▪ Complex: Interoperable level, generation requires expertise, genera focus/format variation, hierarchical, granular, and extensive (100+ properties)
Value schemes
(addressing a topic,
“aboutness”)
Topical (ontologies,
subject heading
lists/thesauri,
taxonomies)
EXAMPLE
DDI Vocabularies
•Analysis Unit
•Character Set
•Commonality Type Coded
•Lifecycle Event Type
•Response Unit
•Software Package
•Summary Statistic Type
•Time Method
Named entities (people, organizations, geographical jurisdictions, structures, and events)» LC Authorities» Virtual International Authority File (VIAF)» Open Researcher and Contributor ID (ORCID)
» Gazetteers» Getty Thesaurus of Geographical Names
Geo-spatial coordinatesISO 19111
Temporal data
- Dates ISO 8601/
W3CDTF
- Periods
CODE lists- Mime type- Language- Geo.- Etc.
Overview▪ Why should we care?▪ What is data?▪ What is metadata’s role w.r.t data?▪ Selected metadata standards
▪ Challenges, opportunities, and jumping in▪ Concluding comments▪ Q&A
Challenges and opportunities
▪ Stop here
Challenges Opportunities
Workflow/When to generate the metadata?
Educate scientists early (Qin, 2009)Integrate into social setting w/Center for Embedded Networked Sensing(CENS) (Borgman, Mayernik, etc., 2009-current; Mayernik’s dissertation, 2011)
Methods for generating metadata (labor intensive)
Use automatic techniques as much as possible, leverage human expertise (Dryad, DataOne Excel project)
Too many standardsWhich one do I use?
Don’t panic, join communities, look for examples. (If you can’t find them?)
Do I need to implement my metadata as linked data.
No. Explore and develop a best practice. Pursue a 2 pronged approach (Greenberg, et al, 2009)
Jumping in…
1. DCMI/NISO Seminars !!
2. DCMI Science and Metadata Community (http://wiki.dublincore.org/index.php/DCMI_Science_And_Metadata)
3. Digital Curation Center (DCC) (http://www.dcc.ac.uk/)
4. The Research Data Management Training, or MANTRA project (http://datalib.edina.ac.uk/mantra/)
5. DataONE workshops and tutorials (www.dataone.org/)
Overview▪ Why should we care?▪ What is data?▪ What is metadata’s role w.r.t data?▪ Selected metadata standards▪ Challenges, opportunities, and jumping in
▪ Concluding comments▪ Q&A
Concluding comments▪ Standards are guidelines; no police
– Aim for reasonable quality
▪ KISS: Keep it simple stupid– What’s vital; what will aid reuse?
▪ Help to move the practice forward– Share what you learn
▪ Nothing new/it’s all new– Data documentation since ancient times – SILOS; let’s break them down (Willis, et al, 2012)– Greater connectivity than ever– Cross-disciplinary approaches for problem solving
Overview▪ Why should we care?▪ What is data?▪ What is metadata’s role w.r.t data?▪ Selected metadata standards▪ Challenges, opportunities, and jumping in▪ Concluding comments
▪ Q&A
Footnotes[1] NSF Data Sharing Policy: http://www.nsf.gov/bfa/dias/policy/dmp.jsp.
[2] NIH Data Sharing Policy: http://grants.nih.gov/grants/policy/data_sharing/.
[3] ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT/Data and Metadata Reporting and Presentation Handbook: http://www.oecd.org/std/37671574.pdf.
[4] The INSPIRE Infrastructure for Spatial Information in the European Community): http://inspire.ec.europa.eu/index.cfm/pageid/48. directive released 15 May 2007 and will be implemented in various stages, with full implementation required by 2019, and aims to create a European Union (EU) spatial data infrastructure.
[5] UK medical research council: http://www.mrc.ac.uk/Ourresearch/Ethicsresearchguidance/datasharing/index.html.
[6] The DCMI Glossary (scroll down for “schema” entry): http://dublincore.org/documents/usageguide/glossary.shtml#schema.
[7] Dublin Core Example: Data from: Divergence time estimation using fossils as terminal taxa and the origins of Lissamphibia (Dryad repository): http://datadryad.org/resource/doi:10.5061/dryad.8120?show=full.
[8] National Institute for Environmental Studies and Center for Climate System Research Japan—animation data (DataCite): http://schema.datacite.org/meta/kernel-2.2/example/datacite-metadata-sample-v2.2.xml.
[9] US MARC bibliographic format: World Ocean Circulation Experiment global data (Moss Landing Marine Labs and the Monterey Bay Aquarium Research Institute Library): http://mlml.kohalibrary.com/cgi-bin/koha/opac-detail.pl?biblionumber=9282.