metadata lecture(9 17-14)

49
Matthew Brush Ontology Developmen t Group OHSU Library, DMICE METADATA PERSPECTIVES FROM THE WEB AND DATABASES SYSTEMS Sept 17, 2014 [email protected]

Upload: mhb120

Post on 22-Apr-2015

175 views

Category:

Data & Analytics


0 download

DESCRIPTION

lecture slides about metadata for a data analytics course

TRANSCRIPT

Page 1: Metadata lecture(9 17-14)

MatthewBrush

Ontology Developme

ntGroup

OHSU Library,DMICE

METADATAPERSPECTIVES

FROM THE WEB AND DATABASES

SYSTEMS

Sept 17, 2014 [email protected]

Page 2: Metadata lecture(9 17-14)

“Data about Data”

“Data” broadly covers any information resource digital or physical narrative, multimedia, structured raw data, processed data, aggregates of datasets, or discrete elements within data sets

More formally, “Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource”

METADATA

(N ISO (2004) Unders tand ing Metada ta . Be thesda , N ISO Press )

Page 3: Metadata lecture(9 17-14)

Descriptive metadata : supports discovery and identifi cation e.g. title, author, identifiers, subjects, keywords 

Structural metadata: describes how the components of a resource are organized e.g. table of contents for a book, schema of database tables,

manifest of fi les in an aggregate ‘research object’

Administrative metadata: helps manage the resource Technical - describes technical aspects of a resource

e.g. fi le type, version information, how/when created Rights management - explains intellectual property rights

e.g. licensing, use restrictions, privacy concerns Preservation - supports maintenance and archiving of a

resource e.g. provenance/ownership, history of use, authenticity

METADATA SERVES MANY PURPOSES . . .

http://www.niso.org/publications/press/UnderstandingMetadata.pdf

Page 4: Metadata lecture(9 17-14)

Metadata comes in many forms, serves many needs, and operates in very diverse settings

I. Resource metadata (on the web) Target: information resources as a whole 1o Goals: resource discovery and use Form: structured, separate records Users: everyone Standards: many metadata frameworks/vocabularies

II. Metadata in database systems Target: structured data and data elements 1o Goals: data consistency, aggregation, analysis Form: ER diagrams, summary tables, data dictionaries Users: professional data administrators and scientists Standards: metadata and CDE registries

. . . AND OPERATES IN MANY CONTEXTS

Page 5: Metadata lecture(9 17-14)

I. Resource Metadata (on the web)A. OverviewB. ExamplesC. Metadata Frameworks

i. Schemaii. Vocabulariesiii. Conceptual Modelsiv. Practical Specificationsv. Encoding Specifications

D. Metadata Storage and Retrieval

II. Metadata in Databases SystemsA. OverviewB. Data ElementsC. Data DictionariesD. Common Data Elements (CDEs)E. CDE Registries

OUTLINE

Page 6: Metadata lecture(9 17-14)

Metadata in the world that all of us have used and created in work and life

Attached to information resources we find on the web books, videos, images, websites, datasets, . . .

Helps us to find a resource and understand what it is and how to use it

I. RESOURCE METADATA (ON THE WEB)

Page 7: Metadata lecture(9 17-14)

Descriptive

Structural

Administrative

Book Catalog Record

http://ohsulibrary.worldcat.org/title/metadata/oclc/225088362

Page 8: Metadata lecture(9 17-14)

Descriptive

Structural

Administrative

Digital Photograph Library

http://crdl.usg.edu/cgi/crdl?query=id:highlander_highlanderphotos_p2-wi3-3

Page 9: Metadata lecture(9 17-14)

Data Set Description

http://datadryad.org/resource/doi:10.5061/dryad.4ms68

Research Data Sets and Files (datadryad.org)

Data File Description

Page 10: Metadata lecture(9 17-14)

Resource metadata is increasingly structured according to established schemas and standards

Many standards exist that vary in their: complexity (schemas, specifications, conceptual

models) targets (music, video, images, books, art, datasets) goals (descriptive, administrative and preservation) communities served (libraries, museums, research)

Benefits: leverage existing resources vetted by community interoperability and integration

STANDARDS ARE KEY

Page 11: Metadata lecture(9 17-14)

Normative standards for metadata are captured in metadata frameworks.

There are five possible components of a metadata framework:

A. Schema

B. Vocabularies

C. Conceptual Model

D. Practical Specifications

E. Encoding Specifications

METADATA FRAMEWORKS

Page 12: Metadata lecture(9 17-14)

Core of any framework – specifies the categories of information recorded

Comprised of a set of data elements along with descriptions of their attributes and rules for use attributes described should minimally include an

identifier and/or name and a definition of each element

Can also specify data types and ‘value domains’ that describe allowable values for a given element e.g. term lists, CVs, ontologies

Example schema: Dublin Core, LOM, HCLS Dataset Std.

A. METADATA SCHEMA

Page 13: Metadata lecture(9 17-14)

First effort at standardizing metadata to improve resource discovery on the web

Very simple core schema consisting of 15 general data elements representing properties of a information resource, with no value restrictions.

Data Elements: title, identifier, type, description, creator, contributor, date, subject, format, language, source, publisher, relation, coverage, rights

Element Attributes: URI, label, definition, domain, range, version, comment

EXAMPLE 1: DUBLIN CORE METADATA INITIATIVE (DCMI)

http://dublincore.org/documents/dcmi-terms/

Page 14: Metadata lecture(9 17-14)

Extensive set of metadata elements describing ‘learning objects’ “Any digital or non-digital entity that may be used for

learning, education, or training"

Based loosely on DCMI schema, but: >50 new elements to describe educational attributes of

learning objects organizes elements into a hierarchical structure provides detailed specifications for allowable values supports ‘application profiles’ that extend model for

specific domains

EXAMPLE 2: LEARNING OBJECT METADATA (IEEE-LOM)

http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html

Page 15: Metadata lecture(9 17-14)

LOM SCHEMA ELEMENTS AND ATTRIBUTES

Page 16: Metadata lecture(9 17-14)

The LOM base schema defi nes 9 categories of metadata elements

Hierarchical structure supports user understanding, metadata organization and aggregation for analysis

LOM ELEMENT HIERARCHY

Page 17: Metadata lecture(9 17-14)

A unified schema that provides all key metadata fields needed to comprehensively describe research datasets what they are, how they are produced, where they are found meets pressing need in current research climate to support

sharing, discovery, and re-use of public datasets in a standardized way

Metadata elements describe general features, identifiers, provenance and change, availability and distribution, and dataset statistics

Comprised entirely of elements (properties) from existing community vocabularies, e.g. DCMI, DCAT, PROV, VOID, FOAF attributes and rules for element use defined in source

schema

EXAMPLE 3: W3C HCLS DATASET DESCRIPTION STANDARD

http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/

Page 18: Metadata lecture(9 17-14)

B. VOCABULARIES

Set of terms (often structured) that is used to constrain entry of metadata values

Vocabularies represent general concepts Word or code lists Hierarchical classifications

taxonomies, thesauri, ontologies e.g. ICD9, SNOMed, MeSH, NCIthesaurus

Authority lists provide controlled names for proper nouns FundRef (organizations) Global Gazeteer (places) ORCID (people)

Page 19: Metadata lecture(9 17-14)

Open Researcher and Contributor IDentifi er (ORCID) a nonproprietary alphanumeric code that uniquely identifies

scientific and other academic authors (a persistent “author DOI” for researchers)

The ORCID identifi er set is coming to serve as a de facto authority list to record persons contributing to scholarly research products

ORCIDs facilitates eff orts to track productivity, impact, and attribution based on all scholarly outputs (publications, grants, datasets, protocols, presentations, abstracts, code, blogs, etc)

Services can aggregate scholarly outputs for a given researcher resolves to a “CV” listing all scholarly contributions linked across

various venues (e.g. Pubmed, Scopus, Slideshare, Figshare, Github, Dryad, . . .)

ORCID AS AN AUTHORITY LIST

Page 20: Metadata lecture(9 17-14)

An underlying model that describes how all the information and concepts inherent in a resource are related to one another

Metadata Models conceptualize the metadata schema itself

(hierarchical relationships or other mappings between elements )

Domain Models conceptualize domain in which the metadata

schema operates (classes of things that are annotated and the relationships between them)

C. CONCEPTUAL MODELS

Page 21: Metadata lecture(9 17-14)

EXAMPLE METADATA MODEL:LOM ELEMENT HIERARCHY

The structure of the LOM is an example of a simple conceptual metadata model, which organizes elements into disjoint hierarchies

Page 22: Metadata lecture(9 17-14)

The summary level describes the dataset in general The version level describes a specifi c version The distribution level describes a representation of a

version

EXAMPLE DOMAIN MODEL: HCLS DATASET ‘LEVELS’

Supports recommendations for how each should be described using the standard

Page 23: Metadata lecture(9 17-14)

D. Practical specifications for use provide guidance for how to apply metadata under a given

schema e.g. HCLS model provides recommendations when and

how to apply certain elements to types of targets in the domain

E. Encoding specifications for presentation & exchange rules for binding metadata to syntactic formats such as

XML or RDF e.g. LOM has precise specification for binding to XML or

RDF

D/E. SPECIFICATIONS

Page 24: Metadata lecture(9 17-14)

STORING AND ACCESSING RESOURCE METADATA

Typically lives separately from annotated resources, in databases and/or XML fi les

Can also be stored within a resource (e.g. photo metadata embedded in image fi le itself)

Increasing number of resource catalogs and repositories on the web provide access to metadata and often the resource itself will have seen examples for books, images, and

datasets

These repositories are indexed by search tools and provide programmatic interfaces to allow for resource discovery and re-use

Page 25: Metadata lecture(9 17-14)

Serves same basic needs, but diff erent scale and target of annotation, user base, and primary use cases

II. METADATA IN DATABASE SYSTEMS

Two main categories:

1. Structural metadata

describes the structure of database objects and the relationships between them

commonly encoded externally as ER-diagrams, or internally as summary tables

http://www.visn20.med.va.gov/VISN20/V20/DataWarehouse/Images/LabAutopsy.jpg

Example ER diagram for VA autopsy data

Page 26: Metadata lecture(9 17-14)

Serves same basic needs, but diff erent scale and target of annotation, user base, and primary use cases

II. METADATA IN DATABASE SYSTEMS

Two main categories:

2. Content metadata

describes meaning of data at a very fine granularity

specifies attributes of data elements , and rules for recoding their values

encoded internally or externally as ‘data dictionaries’

Example of a data set that needs a dictionary to interpret

Page 27: Metadata lecture(9 17-14)

The notion of a ‘data element’ obtains a more precise meaning and specifi cation in the context of a database. elements can be specified at finer granularity in a databases

holding structured data in a controlled operational system

Conceptually, a data element is comprised of a concept and a value domain concept = the subject of the data recorded for a given element value domain = the defined value set for how that data is

recorded

Example: PT_ETHNIC concept = patient ethnicity value domain = [ E1 (caucasian), E2 (hispanic/latino), E3

(african), E4 (asian), E5 (mixed) ]

DATA ELEMENTS

Page 28: Metadata lecture(9 17-14)

Provide detailed metadata about data elements element identifiers and name(s) definitions and descriptions value constraints

data type default value length allowable values

value frequency (mandatory or not) provenance and tracking

version number, entry and termination dates indicate source table(s) mappings to elements in other schema dictionaries

DATA DICTIONARIES

Page 29: Metadata lecture(9 17-14)

DATA DICTIONARIES

http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf

Simple example of a data dictionary

Page 30: Metadata lecture(9 17-14)

Key Functions unambiguous and shared understanding of the data by

all users (administrators, analysts, and clients) consistent data representation and manipulation (addition, extraction, aggregation, and transformation) maintenance of the data model data integration, exchange, and re-use

Encoding as an external document and/or represented as a table

in the database itself

DATA DICTIONARIES

Page 31: Metadata lecture(9 17-14)

1. Clear and thorough element definitions and value set explanations are key

2. Give persistent identifiers to data elements

3. Map data elements to community standards where possible common data elements (CDEs)

4. Specify value sets in terms of open controlled vocabularies CVs where possible

5. Provide notes and guidelines for context of use

6. Make dictionary easily accessible to all users

DATA DICTIONARY BEST PRACTICES

Page 32: Metadata lecture(9 17-14)

As research moves toward 'big data‘, information from diverse sources is being shared and aggregated for analysis.

A major challenge for managing this data is the diversity of ways that a given idea can be described in data elements Sex/gender definitions can be based on genetics,

phenotype, or self-identification. Values can be recorded as local codes, abbreviations, full labels, or community vocabularies.

DATABASE METADATA INTEROPERABILITY

Page 33: Metadata lecture(9 17-14)

The Common Data Element (CDE) movement aims to address this problem by providing standardized data elements that can be re-used across medical datasets

CDEs are owned, managed, & curated by single authority

(NINDS, NCI) stored and managed in large repositories called CDE

registries available for diverse areas of clinical practice and

research, and at very fine granularity larger repositories hold up to 50,000 elements available

CDEs serving as a foundation for interoperability across data systems

COMMON DATA ELEMENTS (CDE ) s

Page 34: Metadata lecture(9 17-14)

Metadata registries that collect common data elements for a defi ned domain

Resemble large scale data dictionaries, but with key diff erences: Exposed in searchable public repositories with

additional services to promote extraction and re-use Coverage is wider as they are used across different

domains and systems Metadata element descriptions are far richer to

support discovery, provenance, versioning, mappings, meta-modeling

The NIH maintains a portal to information about existing CDE initiatives, registries, and tools (

http://www.nlm.nih.gov/cde/)

CDE REGISTRIES

Page 35: Metadata lecture(9 17-14)

Houses >20,000 CDEs “Core” element set covers general concepts in medical

domain patient demographics, medical history, assessment &

examinations, treatments & interventions, outcomes, and study protocol

“Supplementary” sets covering specific diseases/research areas spinal injury, brain injury, epilepsy and stroke, Parkinson’s

disease, ALS

Metadata schema captures 30 element attributes this expanded set of attributes supports use cases of enabling

discovery and community re-use across different implementations

Portal has search functionality and support for generating clinical forms (CRFs) with CDE mappings embedded in collected data

NINDS CDE REGISTRY

http://www.commondataelements.ninds.nih.gov/

Page 36: Metadata lecture(9 17-14)

The National Cancer Institute cancer Data Standards Registry (caDSR) is the largest and most widely used CDE registry >50,000 total elements

Integrates CDEs from several initiatives under a unified model and technical infrastructure

Broad and deep coverage to fine granularity (as with NINDS)

Metadata model is VERY complex captures >100 distinct attributes describing each data

element in the registry (vs 30 for NINDS) implements a complex conceptual model based on the ISO/IEC 11179 metadata registry standard decomposes data elements into component parts that are mapped to NCI thesaurus terms (formal encoding of semantics)

NCI DSR CDE REGISTRYc a

https://cdebrowser.nci.nih.gov/CDEBrowser/

Page 37: Metadata lecture(9 17-14)

DSR CONCEPTUAL MODEL

1. To understand the data element table and explain why it is so expansive

2. Follows a standard for database metadata registries called ISO11179 commonly implemented in other efforts you may

encounter e.g. the Clinical Data Interchange Standards

Consortium (CDISC), which has similar goals as the caDSR but across a broader domain

3. Is the basis for semantic mappings to ontologies such as the NCI thesaurus which are an important feature of the model

c a

Page 38: Metadata lecture(9 17-14)

Data Element

Concept Value Domain

ValueRepresen-

tation

Valid Values

Class Property

DSR CONCEPTUAL MODELc a

Page 39: Metadata lecture(9 17-14)

Data ElementPT_GENDER_CODE

Concept‘patient gender’

Class‘person’

Property‘gender’

CONCEPT ELEMENT MAPPING

Concept = idea represented by the data element, described independently of a particular representation

Class = a set of real world objects with shared characteristics

Property = a characteristic common to all members of an class

Page 40: Metadata lecture(9 17-14)

Data ElementPT_GENDER_CODE

Concept‘patient gender’

Class‘person’C25190

Property‘gender’C17357

Class and property concepts are mapped to NCI taxonomy terms to formally encode their semantics

Class Mapping• person = C25190

Property Mapping• gender = C17357

CONCEPT ELEMENT MAPPING

Page 41: Metadata lecture(9 17-14)

Data ElementPT_GENDER_CODE

Value Domain

VALUE DOMAIN MAPPING

Value domain = a set of attributes describing representational characteristics of instance data

Value Representation = type of value the data represents (along different dimensions)

Valid Values = the actual allowed values for a given value domain

Value Rep.

‘person’, ‘gender’,

‘code’

Valid Values

‘0’,’1’,’2’,’9’

Page 42: Metadata lecture(9 17-14)

Data ElementPT_GENDER_CODE

Value Domain

Value Rep.

‘person’, ‘gender’,

‘code’

Value Representation Mappings ‘person’ = C25190‘gender’ = C17357‘code’ = C25162

Valid Value Mappings0 = unknown C179981 = female gender C461102 = male gender C461099 = unspecified n/a

VALUE DOMAIN MAPPING

Valid Values

‘0’,’1’,’2’,’9’

Page 43: Metadata lecture(9 17-14)

Concept Value Domain

ValueRepresen-

tation

Valid Values

Class Property

"SEMANTICALLY UNAMBIGUOUS

INTEROPERABILITY"

Semantic mappings of these four elements can support more sophisticated search and analysis

Computational tools can leverage logic in the NCI hierarchy for query expansion and data aggregation

Page 44: Metadata lecture(9 17-14)

The structure of the NCI taxonomy supports synonym and hierarchical query expansion

LEVERAGING SEMANTICS

User searches ‘cancer biology’ to view all CDEs related to this concept.

The query is expanded (1) to include any children of this term in the taxonomy, and (2) to include elements with text matching any synonym of cancer in the taxonomy

NCI Thesaurus ‘Cancer

Biology’ branch

Page 45: Metadata lecture(9 17-14)

StrategiesMap elements in local data dictionaries to CDEs

Parkinson’s Disease Biomarkers Program (PDBP) data dictionary

NINDS registry form builder

Build libraries of re-usable, pre-fabricated forms with embedded CDE metadata NINDS Case Report Form (CRF) library medical-data-models.org forms

Initialize software with CDEs so that electronic forms automatically carry mappings when they are generated caDSR registry and CDISC tools

CDE IN PRACTICEs

Page 46: Metadata lecture(9 17-14)

CDEs standardize data elements for use across multiple systems

Available in registries that vary in size and complexity some resemble simple data dictionaries with expanded attributes

to support discovery and provenance (NINDS) some are implemented with complex conceptual models and

semantic mappings (caDSR)

Tools and standards supporting practical application exist but are not yet state of the art

Worlds collide: the intersection of metadata for web resources and database systems CDEs represent discoverable web resources, that are used in the

context of data collection and description in database systems Each registry defines a metadata framework/schema for a given

domain

CDE SUMMARY

Page 47: Metadata lecture(9 17-14)

Promote standardized and systematic data collection

Improve data quality and consistency

Facilitate data sharing and integration

Reduce the cost and time needed to develop data collection tools

Improve opportunities for meta-analysis comparing results across studies

Increase the availability of data for the planning and design of new trials

BENEFITS OF CDEs

Page 48: Metadata lecture(9 17-14)

Data elements across efforts are not well aligned

Tooling support for discovery & application immature

Limited use of community taxonomy and ontology mappings

Navigating complexity and redundancy . . . of medical data itself

many ways to calculate and represent simple and complex measures such as tumor burden or medical prognosis

. . . of metadata elements/schemas thousands of elements with very nuanced meaning and

use redundant representation poses challenges for data

collection, aggregation, and integrated analysis (even for simple measures)

CHALLENGES FOR DATA INTEGRATION AND ANALYSIS

Page 49: Metadata lecture(9 17-14)

LINKS Schema Examples:DCMI: http://dublincore.org/documents/dcmi-terms/IEEE-LOM: http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html HCLS Dataset description standard: http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/  Data Dictionary Example:http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf CDE Sites:NIH CDE Portal: http://www.nlm.nih.gov/cde/ NINDS CE Registry: http://www.commondataelements.ninds.nih.gov/ caDSR browser: https://cdebrowser.nci.nih.gov/CDEBrowser/ caDSR tools: http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and-semantics/metadata-and-models CDEs in Practice:PDBP gender data dictionary entry: https://dictionary.pdbp.ninds.nih.gov/portal/publicData/dataElementAction!view.action?dataElementId=5585NINDS form builder http://www.commondataelements.ninds.nih.gov/CRF.aspx?source=formBuilderDownloadable forms (CRFs) from NINDs with embedded CDE links: http://www.commondataelements.ninds.nih.gov/CRF.aspxmedical-data-models.org forms https://medical-data-models.org/forms/1049 Suite of tools on caDSR site http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and-semantics/metadata-and-models