equivalence is in the (id) of the beholder

38
Equivalence is in the (ID) of the beholder Prefix Commons Biocontext versus one-“ID” Willie Biomedical Data Translator NIH Data Commons Melissa Haendel @ontowonka

Upload: mhaendel

Post on 29-Jan-2018

223 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Equivalence is in the (ID) of the beholder

Equivalence is in the (ID) of the beholder

PrefixCommonsBiocontext

versus one-“ID” Willie

Biomedical Data Translator

NIH Data Commons

Melissa Haendel@ontowonka

Page 2: Equivalence is in the (ID) of the beholder

Finding treasure requires aligning different knowledge modalities and perspectives

Real-worldconcepts

Annotations

MappingsStandards and team science

Page 3: Equivalence is in the (ID) of the beholder

Data science is a young field; the data is crufty, but it is valuable

Page 4: Equivalence is in the (ID) of the beholder

What integrators are aiming to do is non-trivial

But… biological things can have multiple IDs, and each ID can be written in

multiple ways. Some persist, some don’t.

Some have machine readable data, some don’t. This makes integration hard.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4981258/figure/fig1/

Page 5: Equivalence is in the (ID) of the beholder

Identifiers are the invisible bedrock of all scientific inquiry; the more complex the question, the greater the reliance on ID hygiene

What? Why?How?

Identifiers

Identifiers &Metadata

Identifiers & MetaData &Relationships

Re

qu

ire

d h

arm

on

izat

ion

Question complexity

FAIR

FAIR

How many?

FAIR

Page 6: Equivalence is in the (ID) of the beholder

Identifier Reality: Not all IDs are created equal Not all communities are equally poised to use themWe need systems that accommodate the heterogeneity

TraditionalLiterature

Non-Traditional

Persistent

Ephemeral

Non-existent

Ide

nti

fie

r M

atu

rity

Scholarly Output Maturity

Genomic resources

Wild west of identifier tumbleweed

Connected

Page 7: Equivalence is in the (ID) of the beholder

Goonies may not need IDs, but agents do

Page 8: Equivalence is in the (ID) of the beholder
Page 9: Equivalence is in the (ID) of the beholder

Pharos

DrugBank

PubMed

KEGG

WikiPathways

HGNC

HGNC

NCBI

ClinVar

OMIM

Page 10: Equivalence is in the (ID) of the beholder

https://github.com/biolink/biolink-model/blob/master/biolink-model.yaml

http://github.com/ncats-tangerine/beacon-aggregator

Genes Diseases Chemicals Proteins

NCBI GeneHGNCEnsemblOMIMOrganism-DBsPantherMygene.infoMyvariant.infoClinvar

MONDOOMIMOrphanetICD-9 / ICD-10 / ICD-11SNOMED-CT

ChebiDrugbankPharos

UniProt/SwissProtWikiPathwaysReactome

The NCATS Data Translator Knowledge Map: Will define a menu of types, IDs, and paths available to reasoners

Page 11: Equivalence is in the (ID) of the beholder

Genes Environment Phenotypes+ =

Biology is not just identifying the bits, but also the

relationships between them.

G-P or D (disease)• causes• contributes to• is risk factor for• protects against• correlates with• is marker for• modulates• involved in• increases susceptibility to

G-G (kind of)• regulates• negatively regulates (inhibits)• positively regulates (activates)• directly regulates• interacts with• co-localizes with• co-expressed with

P/D - P/D• part of• results in• co-occurs with• correlates with• hallmark of (P->D)

E-P• contributes to (E->P)• influences (E->P)• exacerbates (E->P)• manifest in (P->E)

G-E (kind of)• expressed in• expressed during• contains• inactivated by

Page 12: Equivalence is in the (ID) of the beholder
Page 13: Equivalence is in the (ID) of the beholder

Identifier cleaning is a shared pain point (and a huge time sink),

but no ways to share the outcomes

Let’s discuss!!

Page 14: Equivalence is in the (ID) of the beholder

CHEMBL1431

Pharos:O15244 (one of 355)

IDG:D908

reactome:R-HSA-374914

The NCATS BioMedical Translator Knowledge Map is a set of

actions the agent can perform:3rd party provider value: API’s and connectivity

reactome:HSA-1430728

clinvar.variant:443497

medgen:87607 Short stature

medgen:892473 Abnormality of cardiovascular

system morphology

medgen:427827 Abnormality of the ear

medgen:776570 Polydactyly

Variant

Phenotypes

MonarchPhenotypeToDisease Phenotypes

Disease

HP:0004322 Short stature

HP:0030680 Abnormality of cardiovascular system

morphology

HP:0000598 Abnormality of the ear

HP:0010442 Polydactyly

MONDO:0019391 Fanconi Anemia Disease

Phenotype

Page 15: Equivalence is in the (ID) of the beholder

downloadeverything

crawl &index

high-levelabstraction

trial &error

more details fewer details

Burden on integrator(less coordination)

Burden on provider(more coordination)

Data integration is a socio-technical problem

Page 16: Equivalence is in the (ID) of the beholder

(Way) beyond linkrot: types of equivalence pain

Bit.ly/evidence-of-identifier-pain

Page 17: Equivalence is in the (ID) of the beholder

Challenge 1: ID Syntax polymorphismThere is no ID that is immune to getting mangled

A fine strategy

for 2 variations

but …

Page 18: Equivalence is in the (ID) of the beholder

Challenge 1: ID Syntax polymorphismThere is no ID that is immune to getting mangled

SNCA (synuclein alpha) Implicated in Parkinson Disease

bit.ly/ncbi-identifier-permutations

http://view.ncbi.nlm.nih.gov/gene/6622http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=6622http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=6622http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Gene&term=6622http://www.ncbi.nlm.nih.gov/gene?term=6622http://www.ncbi.nlm.nih.gov/gene/?term=6622http://www.ncbi.nlm.nih.gov/gene/6622http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db=gene&list_uids=40966&dopt=full_reporthttp://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=gene&list_uids=6622http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=6622http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=6469https://www.ncbi.nlm.nih.gov/gene/6622

https://orcid.org/0000-0002-1825-0097http://orcid.org/0000-0002-1825-0097orcid.org/0000-0002-1825-00970000-0002-1825-0097ORCiD:0000-0002-1825-0097orcid:0000-0002-1825-0097ORCiD: 0000-0002-1825-0097orcid: 0000-0002-1825-0097

DOIsorcids

532 possible

combinations of short form and http identifiers for the same gene in the same DB!

Page 19: Equivalence is in the (ID) of the beholder

Mitigation 1. syntax polymorphism: minimize it where you can,

document it where you can’tRepos:• Document how you want others to

reference your IDs• Implement QC to check the IDs of records

you xref• Implement Signposting (Herbert van de

Sompel)

Technical Needs:• Generic identifier QC services

(identifiers.org?)• Tooling to diagnose and roll up equivalents

(underway)• Monarch Initiative / NIH Data Commons

/ NIH Data Translator

Page 20: Equivalence is in the (ID) of the beholder

PDBsum: http://www.ebi.ac.uk/pdbsum/2gc4

Proteopedia: http://proteopedia.org/wiki/index.php/2gc4

PDB Europe: http://www.pdbe.org/2gc4

RCSB PDB: http://www.rcsb.org/pdb/explore/explore.do?structureId=2gc4

PDBj: http://pdbj.org/mine/summary/2gc4

Challenge 2: Coordinated mirroring of exact record copies

within consortium members

2gc4 is a 16 chain structure with sequence from Paracoccus denitrificans

Page 21: Equivalence is in the (ID) of the beholder

Mitigation 2: Use machine-readabledocumentation about distributions

Find sources (eg. Identifiers.org) of machine-readable documentation about all possible endpoints

Note this doesn’t work with ad-hoc deposition of a single record in multiple places (eg. same preprint servers, institutional repositories, etc)

https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000020

Page 22: Equivalence is in the (ID) of the beholder

Challenge 3: Knowledge evolution and implications for over-eager collapse

http://www.autism-society.org/dsm-iv-diagnostic-classifications/#autismThis evolution makes it hard to compare diagnoses made at different times.

DSM change

Analagous examples:Organizational membership evolution and partonomy

Page 23: Equivalence is in the (ID) of the beholder

Mitigation 3: Capture provenance, versioning, and context

Make use of it in analysis

1, no changes in reporting practices and no calendar effect (baseline prevalence); 2, only a calendar effect; 3, a calendar effect and a diagnostic change; 4, a calendar effect and the inclusion of outpatients;5, a calendar effect, a diagnostic change, and the inclusion of outpatients (total observed prevalence).

1

2

3

4

5

doi:10.1001/jamapediatrics.2014.1893

Page 24: Equivalence is in the (ID) of the beholder

Challenge 4: Closely-related real-world entities

whose distinction matters only to some

Ala

D-AlanineL-Alanine

CHEBI:16977 CHEBI:16449CHEBI:15570

Page 25: Equivalence is in the (ID) of the beholder

Mitigation 4: create systems that let users choose “lenses” for fuzzy or exact matching

ExactFuzzy

Example implementations: OpenPhacts (drugs), Monarch Initiative (phenotypes)

Page 26: Equivalence is in the (ID) of the beholder

Challenge 5: Partially recapitulated records that reference the same ID

EHLERS-DANLOS SYNDROME, CLASSIC TYPE, 1; EDSCL1

Ehlers-Danlos syndrome, classic type

https://omim.org/entry/130000 https://monarchinitiative.org/disease/MONDO:0007522

https://www.malacards.org/card/ehlers_danlos_syndrome_classic_type

OMIM:130000

Page 27: Equivalence is in the (ID) of the beholder

Mitigation 5:Use authoritative IDs as redirection/query entry points into related data

Use JSON-LD context files to document in/out paths

http://purl.obolibrary.org/obo/OMIM_130000

Incoming

https://monarchinitiative.org/OMIM:130000

Outgoing

OMIM

OMIM http://purl.obolibrary.org/obo/OMIM_130000

Outgoing

OMIM

Incoming

https://bio2rdf.org/omim:130000OMIM

https://github.com/prefixcommons/biocontext/blob/master/registry/monarch_context.jsonld

PrefixCommonshttps://github.com/prefixcommons/biocontext

Page 28: Equivalence is in the (ID) of the beholder

Challenge 6: Records that [seem to] be about the same real world

concept (or maybe not-so real)

Too little equivalence:Missing connections

Too much equivalence:False positives

Page 29: Equivalence is in the (ID) of the beholder

Challenge 6 cont’d: Post-hoc harmonization

Page 30: Equivalence is in the (ID) of the beholder

Challenge 6 cont’d: Fuzzy Match on xrefs/content

How are these 11 records for “Ehlers Danlos

Syndrome” related to each

other?

Narrow synonym? Broad? Exact? Child? Parent?Bayesian models like k-BOOM can help Mungalldoi:10.1101/048843

bit.ly/xref-wildwest

Page 31: Equivalence is in the (ID) of the beholder

OMIM(brown)

MESH(grey)

ORDO/Orphanet(yellow)

SubClassOf(solid line)

Xref(dashed grey line)

Hemolytic anemia mappings across resources

Each vocabulary is different, they inconsistently map to each other, leading to poor interoperability and computability

Page 32: Equivalence is in the (ID) of the beholder
Page 33: Equivalence is in the (ID) of the beholder

Mitigation 6: Use algorithms to determine probability of equivalence

Bayesian OWL Ontology Merging (kBOOM) Mungall et al. http://bit.ly/k-BOOM

Determine probability of equivalence based upon:

1. Synonyms

2. XREFs (ideally semantically typed)

3. Graph structure

4. (Partial) string matching of labels

5. Prior weighting (e.g. if you know a source has specific

or poor xref’ing curation strategies)

Example applied to diseases:http://obofoundry.org/ontology/mondo.html

Page 34: Equivalence is in the (ID) of the beholder

ID21C: Tangible, actionable community best practice

on identifiers for data integration

doi:10.1371/journal.pbio.2001414 (bit.ly/id21c-plosbio)

Please comment!

Page 35: Equivalence is in the (ID) of the beholder

Goldilocks approach to (ID) standards

NON-adoption to nano adoptionOnly the information useful for action

Page 36: Equivalence is in the (ID) of the beholder

UDP:2542

Impaired platelet aggregation(HP:0003540)

Thromocytopenia(HP:0001873)

Abnormal platelet activation(MP:0006298)

Thrombocytopenia(MP:0003179)

Genetics in Medicine 18, 608–617 (2016)doi:10.1038/gim.2015.137

MGI:3764834

If we get equivalence right, what does that make possible?

Page 37: Equivalence is in the (ID) of the beholder

• Traceable• Attributable• Versioned• High Level typing of records for desired audience (eg. is record for a gene? A disease?

An instrumentation readout?)

• Licensed• Pick a standard license; encode in file header • (using a URL like https://creativecommons.org/publicdomain/zero/1.0 if possible)

• PIDs for license types?

• Connected:• Syntactic and semantic harmonization• Use standardized metadata, vocabs, ontologies• Document your identifier scheme and follow the ID21C guidelines or other best

practices to avoid syntactic variation• Capture cross references but ensure that you semantically qualify them• Tell others how to link to you• APIs please; data dumps are not enough, but if you do nothing else:

• Tell us what types of ids to expect and how they are related to one another• Support bulk identifier operations in APIs, especially for xref’d IDs

FAIRTLC

Traceable

Licensed

Connected

bit.ly/fair-tlc (https://doi.org/10.5281/zenodo.203295) reusabledata.org

Page 38: Equivalence is in the (ID) of the beholder

Don’t hoard your data like One-“ID” Willie; Help it travel well via better equivalency support.

With thanks to: Julie McMurry, Chris Mungall, Mathias Wawer and whole teamhttps://monarchinitiative.org/page/team

https://github.com/orgs/NCATS-Tangerine/peopleJohn Kunze, Greg Janee, Sarala Wimalaratne, Nick Juty, others

5 R24 OD011883OT3 TR002019-01S21 OT OD025464-011 OT3 HL142479-01

U24TR002306