metadata lecture(9 17-14)
DESCRIPTION
lecture slides about metadata for a data analytics courseTRANSCRIPT
MatthewBrush
Ontology Developme
ntGroup
OHSU Library,DMICE
METADATAPERSPECTIVES
FROM THE WEB AND DATABASES
SYSTEMS
Sept 17, 2014 [email protected]
“Data about Data”
“Data” broadly covers any information resource digital or physical narrative, multimedia, structured raw data, processed data, aggregates of datasets, or discrete elements within data sets
More formally, “Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource”
METADATA
(N ISO (2004) Unders tand ing Metada ta . Be thesda , N ISO Press )
Descriptive metadata : supports discovery and identifi cation e.g. title, author, identifiers, subjects, keywords
Structural metadata: describes how the components of a resource are organized e.g. table of contents for a book, schema of database tables,
manifest of fi les in an aggregate ‘research object’
Administrative metadata: helps manage the resource Technical - describes technical aspects of a resource
e.g. fi le type, version information, how/when created Rights management - explains intellectual property rights
e.g. licensing, use restrictions, privacy concerns Preservation - supports maintenance and archiving of a
resource e.g. provenance/ownership, history of use, authenticity
METADATA SERVES MANY PURPOSES . . .
http://www.niso.org/publications/press/UnderstandingMetadata.pdf
Metadata comes in many forms, serves many needs, and operates in very diverse settings
I. Resource metadata (on the web) Target: information resources as a whole 1o Goals: resource discovery and use Form: structured, separate records Users: everyone Standards: many metadata frameworks/vocabularies
II. Metadata in database systems Target: structured data and data elements 1o Goals: data consistency, aggregation, analysis Form: ER diagrams, summary tables, data dictionaries Users: professional data administrators and scientists Standards: metadata and CDE registries
. . . AND OPERATES IN MANY CONTEXTS
I. Resource Metadata (on the web)A. OverviewB. ExamplesC. Metadata Frameworks
i. Schemaii. Vocabulariesiii. Conceptual Modelsiv. Practical Specificationsv. Encoding Specifications
D. Metadata Storage and Retrieval
II. Metadata in Databases SystemsA. OverviewB. Data ElementsC. Data DictionariesD. Common Data Elements (CDEs)E. CDE Registries
OUTLINE
Metadata in the world that all of us have used and created in work and life
Attached to information resources we find on the web books, videos, images, websites, datasets, . . .
Helps us to find a resource and understand what it is and how to use it
I. RESOURCE METADATA (ON THE WEB)
Descriptive
Structural
Administrative
Book Catalog Record
http://ohsulibrary.worldcat.org/title/metadata/oclc/225088362
Descriptive
Structural
Administrative
Digital Photograph Library
http://crdl.usg.edu/cgi/crdl?query=id:highlander_highlanderphotos_p2-wi3-3
Data Set Description
http://datadryad.org/resource/doi:10.5061/dryad.4ms68
Research Data Sets and Files (datadryad.org)
Data File Description
Resource metadata is increasingly structured according to established schemas and standards
Many standards exist that vary in their: complexity (schemas, specifications, conceptual
models) targets (music, video, images, books, art, datasets) goals (descriptive, administrative and preservation) communities served (libraries, museums, research)
Benefits: leverage existing resources vetted by community interoperability and integration
STANDARDS ARE KEY
Normative standards for metadata are captured in metadata frameworks.
There are five possible components of a metadata framework:
A. Schema
B. Vocabularies
C. Conceptual Model
D. Practical Specifications
E. Encoding Specifications
METADATA FRAMEWORKS
Core of any framework – specifies the categories of information recorded
Comprised of a set of data elements along with descriptions of their attributes and rules for use attributes described should minimally include an
identifier and/or name and a definition of each element
Can also specify data types and ‘value domains’ that describe allowable values for a given element e.g. term lists, CVs, ontologies
Example schema: Dublin Core, LOM, HCLS Dataset Std.
A. METADATA SCHEMA
First effort at standardizing metadata to improve resource discovery on the web
Very simple core schema consisting of 15 general data elements representing properties of a information resource, with no value restrictions.
Data Elements: title, identifier, type, description, creator, contributor, date, subject, format, language, source, publisher, relation, coverage, rights
Element Attributes: URI, label, definition, domain, range, version, comment
EXAMPLE 1: DUBLIN CORE METADATA INITIATIVE (DCMI)
http://dublincore.org/documents/dcmi-terms/
Extensive set of metadata elements describing ‘learning objects’ “Any digital or non-digital entity that may be used for
learning, education, or training"
Based loosely on DCMI schema, but: >50 new elements to describe educational attributes of
learning objects organizes elements into a hierarchical structure provides detailed specifications for allowable values supports ‘application profiles’ that extend model for
specific domains
EXAMPLE 2: LEARNING OBJECT METADATA (IEEE-LOM)
http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html
LOM SCHEMA ELEMENTS AND ATTRIBUTES
The LOM base schema defi nes 9 categories of metadata elements
Hierarchical structure supports user understanding, metadata organization and aggregation for analysis
LOM ELEMENT HIERARCHY
A unified schema that provides all key metadata fields needed to comprehensively describe research datasets what they are, how they are produced, where they are found meets pressing need in current research climate to support
sharing, discovery, and re-use of public datasets in a standardized way
Metadata elements describe general features, identifiers, provenance and change, availability and distribution, and dataset statistics
Comprised entirely of elements (properties) from existing community vocabularies, e.g. DCMI, DCAT, PROV, VOID, FOAF attributes and rules for element use defined in source
schema
EXAMPLE 3: W3C HCLS DATASET DESCRIPTION STANDARD
http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/
B. VOCABULARIES
Set of terms (often structured) that is used to constrain entry of metadata values
Vocabularies represent general concepts Word or code lists Hierarchical classifications
taxonomies, thesauri, ontologies e.g. ICD9, SNOMed, MeSH, NCIthesaurus
Authority lists provide controlled names for proper nouns FundRef (organizations) Global Gazeteer (places) ORCID (people)
Open Researcher and Contributor IDentifi er (ORCID) a nonproprietary alphanumeric code that uniquely identifies
scientific and other academic authors (a persistent “author DOI” for researchers)
The ORCID identifi er set is coming to serve as a de facto authority list to record persons contributing to scholarly research products
ORCIDs facilitates eff orts to track productivity, impact, and attribution based on all scholarly outputs (publications, grants, datasets, protocols, presentations, abstracts, code, blogs, etc)
Services can aggregate scholarly outputs for a given researcher resolves to a “CV” listing all scholarly contributions linked across
various venues (e.g. Pubmed, Scopus, Slideshare, Figshare, Github, Dryad, . . .)
ORCID AS AN AUTHORITY LIST
An underlying model that describes how all the information and concepts inherent in a resource are related to one another
Metadata Models conceptualize the metadata schema itself
(hierarchical relationships or other mappings between elements )
Domain Models conceptualize domain in which the metadata
schema operates (classes of things that are annotated and the relationships between them)
C. CONCEPTUAL MODELS
EXAMPLE METADATA MODEL:LOM ELEMENT HIERARCHY
The structure of the LOM is an example of a simple conceptual metadata model, which organizes elements into disjoint hierarchies
The summary level describes the dataset in general The version level describes a specifi c version The distribution level describes a representation of a
version
EXAMPLE DOMAIN MODEL: HCLS DATASET ‘LEVELS’
Supports recommendations for how each should be described using the standard
D. Practical specifications for use provide guidance for how to apply metadata under a given
schema e.g. HCLS model provides recommendations when and
how to apply certain elements to types of targets in the domain
E. Encoding specifications for presentation & exchange rules for binding metadata to syntactic formats such as
XML or RDF e.g. LOM has precise specification for binding to XML or
RDF
D/E. SPECIFICATIONS
STORING AND ACCESSING RESOURCE METADATA
Typically lives separately from annotated resources, in databases and/or XML fi les
Can also be stored within a resource (e.g. photo metadata embedded in image fi le itself)
Increasing number of resource catalogs and repositories on the web provide access to metadata and often the resource itself will have seen examples for books, images, and
datasets
These repositories are indexed by search tools and provide programmatic interfaces to allow for resource discovery and re-use
Serves same basic needs, but diff erent scale and target of annotation, user base, and primary use cases
II. METADATA IN DATABASE SYSTEMS
Two main categories:
1. Structural metadata
describes the structure of database objects and the relationships between them
commonly encoded externally as ER-diagrams, or internally as summary tables
http://www.visn20.med.va.gov/VISN20/V20/DataWarehouse/Images/LabAutopsy.jpg
Example ER diagram for VA autopsy data
Serves same basic needs, but diff erent scale and target of annotation, user base, and primary use cases
II. METADATA IN DATABASE SYSTEMS
Two main categories:
2. Content metadata
describes meaning of data at a very fine granularity
specifies attributes of data elements , and rules for recoding their values
encoded internally or externally as ‘data dictionaries’
Example of a data set that needs a dictionary to interpret
The notion of a ‘data element’ obtains a more precise meaning and specifi cation in the context of a database. elements can be specified at finer granularity in a databases
holding structured data in a controlled operational system
Conceptually, a data element is comprised of a concept and a value domain concept = the subject of the data recorded for a given element value domain = the defined value set for how that data is
recorded
Example: PT_ETHNIC concept = patient ethnicity value domain = [ E1 (caucasian), E2 (hispanic/latino), E3
(african), E4 (asian), E5 (mixed) ]
DATA ELEMENTS
Provide detailed metadata about data elements element identifiers and name(s) definitions and descriptions value constraints
data type default value length allowable values
value frequency (mandatory or not) provenance and tracking
version number, entry and termination dates indicate source table(s) mappings to elements in other schema dictionaries
DATA DICTIONARIES
DATA DICTIONARIES
http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf
Simple example of a data dictionary
Key Functions unambiguous and shared understanding of the data by
all users (administrators, analysts, and clients) consistent data representation and manipulation (addition, extraction, aggregation, and transformation) maintenance of the data model data integration, exchange, and re-use
Encoding as an external document and/or represented as a table
in the database itself
DATA DICTIONARIES
1. Clear and thorough element definitions and value set explanations are key
2. Give persistent identifiers to data elements
3. Map data elements to community standards where possible common data elements (CDEs)
4. Specify value sets in terms of open controlled vocabularies CVs where possible
5. Provide notes and guidelines for context of use
6. Make dictionary easily accessible to all users
DATA DICTIONARY BEST PRACTICES
As research moves toward 'big data‘, information from diverse sources is being shared and aggregated for analysis.
A major challenge for managing this data is the diversity of ways that a given idea can be described in data elements Sex/gender definitions can be based on genetics,
phenotype, or self-identification. Values can be recorded as local codes, abbreviations, full labels, or community vocabularies.
DATABASE METADATA INTEROPERABILITY
The Common Data Element (CDE) movement aims to address this problem by providing standardized data elements that can be re-used across medical datasets
CDEs are owned, managed, & curated by single authority
(NINDS, NCI) stored and managed in large repositories called CDE
registries available for diverse areas of clinical practice and
research, and at very fine granularity larger repositories hold up to 50,000 elements available
CDEs serving as a foundation for interoperability across data systems
COMMON DATA ELEMENTS (CDE ) s
Metadata registries that collect common data elements for a defi ned domain
Resemble large scale data dictionaries, but with key diff erences: Exposed in searchable public repositories with
additional services to promote extraction and re-use Coverage is wider as they are used across different
domains and systems Metadata element descriptions are far richer to
support discovery, provenance, versioning, mappings, meta-modeling
The NIH maintains a portal to information about existing CDE initiatives, registries, and tools (
http://www.nlm.nih.gov/cde/)
CDE REGISTRIES
Houses >20,000 CDEs “Core” element set covers general concepts in medical
domain patient demographics, medical history, assessment &
examinations, treatments & interventions, outcomes, and study protocol
“Supplementary” sets covering specific diseases/research areas spinal injury, brain injury, epilepsy and stroke, Parkinson’s
disease, ALS
Metadata schema captures 30 element attributes this expanded set of attributes supports use cases of enabling
discovery and community re-use across different implementations
Portal has search functionality and support for generating clinical forms (CRFs) with CDE mappings embedded in collected data
NINDS CDE REGISTRY
http://www.commondataelements.ninds.nih.gov/
The National Cancer Institute cancer Data Standards Registry (caDSR) is the largest and most widely used CDE registry >50,000 total elements
Integrates CDEs from several initiatives under a unified model and technical infrastructure
Broad and deep coverage to fine granularity (as with NINDS)
Metadata model is VERY complex captures >100 distinct attributes describing each data
element in the registry (vs 30 for NINDS) implements a complex conceptual model based on the ISO/IEC 11179 metadata registry standard decomposes data elements into component parts that are mapped to NCI thesaurus terms (formal encoding of semantics)
NCI DSR CDE REGISTRYc a
https://cdebrowser.nci.nih.gov/CDEBrowser/
DSR CONCEPTUAL MODEL
1. To understand the data element table and explain why it is so expansive
2. Follows a standard for database metadata registries called ISO11179 commonly implemented in other efforts you may
encounter e.g. the Clinical Data Interchange Standards
Consortium (CDISC), which has similar goals as the caDSR but across a broader domain
3. Is the basis for semantic mappings to ontologies such as the NCI thesaurus which are an important feature of the model
c a
Data Element
Concept Value Domain
ValueRepresen-
tation
Valid Values
Class Property
DSR CONCEPTUAL MODELc a
Data ElementPT_GENDER_CODE
Concept‘patient gender’
Class‘person’
Property‘gender’
CONCEPT ELEMENT MAPPING
Concept = idea represented by the data element, described independently of a particular representation
Class = a set of real world objects with shared characteristics
Property = a characteristic common to all members of an class
Data ElementPT_GENDER_CODE
Concept‘patient gender’
Class‘person’C25190
Property‘gender’C17357
Class and property concepts are mapped to NCI taxonomy terms to formally encode their semantics
Class Mapping• person = C25190
Property Mapping• gender = C17357
CONCEPT ELEMENT MAPPING
Data ElementPT_GENDER_CODE
Value Domain
VALUE DOMAIN MAPPING
Value domain = a set of attributes describing representational characteristics of instance data
Value Representation = type of value the data represents (along different dimensions)
Valid Values = the actual allowed values for a given value domain
Value Rep.
‘person’, ‘gender’,
‘code’
Valid Values
‘0’,’1’,’2’,’9’
Data ElementPT_GENDER_CODE
Value Domain
Value Rep.
‘person’, ‘gender’,
‘code’
Value Representation Mappings ‘person’ = C25190‘gender’ = C17357‘code’ = C25162
Valid Value Mappings0 = unknown C179981 = female gender C461102 = male gender C461099 = unspecified n/a
VALUE DOMAIN MAPPING
Valid Values
‘0’,’1’,’2’,’9’
Concept Value Domain
ValueRepresen-
tation
Valid Values
Class Property
"SEMANTICALLY UNAMBIGUOUS
INTEROPERABILITY"
Semantic mappings of these four elements can support more sophisticated search and analysis
Computational tools can leverage logic in the NCI hierarchy for query expansion and data aggregation
The structure of the NCI taxonomy supports synonym and hierarchical query expansion
LEVERAGING SEMANTICS
User searches ‘cancer biology’ to view all CDEs related to this concept.
The query is expanded (1) to include any children of this term in the taxonomy, and (2) to include elements with text matching any synonym of cancer in the taxonomy
NCI Thesaurus ‘Cancer
Biology’ branch
StrategiesMap elements in local data dictionaries to CDEs
Parkinson’s Disease Biomarkers Program (PDBP) data dictionary
NINDS registry form builder
Build libraries of re-usable, pre-fabricated forms with embedded CDE metadata NINDS Case Report Form (CRF) library medical-data-models.org forms
Initialize software with CDEs so that electronic forms automatically carry mappings when they are generated caDSR registry and CDISC tools
CDE IN PRACTICEs
CDEs standardize data elements for use across multiple systems
Available in registries that vary in size and complexity some resemble simple data dictionaries with expanded attributes
to support discovery and provenance (NINDS) some are implemented with complex conceptual models and
semantic mappings (caDSR)
Tools and standards supporting practical application exist but are not yet state of the art
Worlds collide: the intersection of metadata for web resources and database systems CDEs represent discoverable web resources, that are used in the
context of data collection and description in database systems Each registry defines a metadata framework/schema for a given
domain
CDE SUMMARY
Promote standardized and systematic data collection
Improve data quality and consistency
Facilitate data sharing and integration
Reduce the cost and time needed to develop data collection tools
Improve opportunities for meta-analysis comparing results across studies
Increase the availability of data for the planning and design of new trials
BENEFITS OF CDEs
Data elements across efforts are not well aligned
Tooling support for discovery & application immature
Limited use of community taxonomy and ontology mappings
Navigating complexity and redundancy . . . of medical data itself
many ways to calculate and represent simple and complex measures such as tumor burden or medical prognosis
. . . of metadata elements/schemas thousands of elements with very nuanced meaning and
use redundant representation poses challenges for data
collection, aggregation, and integrated analysis (even for simple measures)
CHALLENGES FOR DATA INTEGRATION AND ANALYSIS
LINKS Schema Examples:DCMI: http://dublincore.org/documents/dcmi-terms/IEEE-LOM: http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html HCLS Dataset description standard: http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ Data Dictionary Example:http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf CDE Sites:NIH CDE Portal: http://www.nlm.nih.gov/cde/ NINDS CE Registry: http://www.commondataelements.ninds.nih.gov/ caDSR browser: https://cdebrowser.nci.nih.gov/CDEBrowser/ caDSR tools: http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and-semantics/metadata-and-models CDEs in Practice:PDBP gender data dictionary entry: https://dictionary.pdbp.ninds.nih.gov/portal/publicData/dataElementAction!view.action?dataElementId=5585NINDS form builder http://www.commondataelements.ninds.nih.gov/CRF.aspx?source=formBuilderDownloadable forms (CRFs) from NINDs with embedded CDE links: http://www.commondataelements.ninds.nih.gov/CRF.aspxmedical-data-models.org forms https://medical-data-models.org/forms/1049 Suite of tools on caDSR site http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and-semantics/metadata-and-models