equivalence is in the (id) of the beholder

Equivalence is in the (ID) of the beholder

PrefixCommonsBiocontext

versus one-“ID” Willie

Biomedical Data Translator

NIH Data Commons

Melissa Haendel@ontowonka

Finding treasure requires aligning different knowledge modalities and perspectives

Real-worldconcepts

Annotations

MappingsStandards and team science

Data science is a young field; the data is crufty, but it is valuable

What integrators are aiming to do is non-trivial

But… biological things can have multiple IDs, and each ID can be written in

multiple ways. Some persist, some don’t.

Some have machine readable data, some don’t. This makes integration hard.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4981258/figure/fig1/

Identifiers are the invisible bedrock of all scientific inquiry; the more complex the question, the greater the reliance on ID hygiene

What? Why?How?

Identifiers

Identifiers &Metadata

Identifiers & MetaData &Relationships

Re

qu

ire

d h

arm

on

izat

ion

Question complexity

FAIR

FAIR

How many?

FAIR

Identifier Reality: Not all IDs are created equal Not all communities are equally poised to use themWe need systems that accommodate the heterogeneity

TraditionalLiterature

Non-Traditional

Persistent

Ephemeral

Non-existent

Ide

nti

fie

r M

atu

rity

Scholarly Output Maturity

Genomic resources

Wild west of identifier tumbleweed

Connected

Goonies may not need IDs, but agents do

Pharos

DrugBank

PubMed

KEGG

WikiPathways

HGNC

HGNC

NCBI

ClinVar

OMIM

https://github.com/biolink/biolink-model/blob/master/biolink-model.yaml

http://github.com/ncats-tangerine/beacon-aggregator

Genes Diseases Chemicals Proteins

NCBI GeneHGNCEnsemblOMIMOrganism-DBsPantherMygene.infoMyvariant.infoClinvar

MONDOOMIMOrphanetICD-9 / ICD-10 / ICD-11SNOMED-CT

ChebiDrugbankPharos

UniProt/SwissProtWikiPathwaysReactome

The NCATS Data Translator Knowledge Map: Will define a menu of types, IDs, and paths available to reasoners

Genes Environment Phenotypes+ =

Biology is not just identifying the bits, but also the

relationships between them.

G-P or D (disease)• causes• contributes to• is risk factor for• protects against• correlates with• is marker for• modulates• involved in• increases susceptibility to

G-G (kind of)• regulates• negatively regulates (inhibits)• positively regulates (activates)• directly regulates• interacts with• co-localizes with• co-expressed with

P/D - P/D• part of• results in• co-occurs with• correlates with• hallmark of (P->D)

E-P• contributes to (E->P)• influences (E->P)• exacerbates (E->P)• manifest in (P->E)

G-E (kind of)• expressed in• expressed during• contains• inactivated by

Identifier cleaning is a shared pain point (and a huge time sink),

but no ways to share the outcomes

Let’s discuss!!

CHEMBL1431

Pharos:O15244 (one of 355)

IDG:D908

reactome:R-HSA-374914

The NCATS BioMedical Translator Knowledge Map is a set of

actions the agent can perform:3rd party provider value: API’s and connectivity

reactome:HSA-1430728

clinvar.variant:443497

medgen:87607 Short stature

medgen:892473 Abnormality of cardiovascular

system morphology

medgen:427827 Abnormality of the ear

medgen:776570 Polydactyly

Variant

Phenotypes

MonarchPhenotypeToDisease Phenotypes

Disease

HP:0004322 Short stature

HP:0030680 Abnormality of cardiovascular system

morphology

HP:0000598 Abnormality of the ear

HP:0010442 Polydactyly

MONDO:0019391 Fanconi Anemia Disease

Phenotype

downloadeverything

crawl &index

high-levelabstraction

trial &error

more details fewer details

Burden on integrator(less coordination)

Burden on provider(more coordination)

Data integration is a socio-technical problem

(Way) beyond linkrot: types of equivalence pain

Bit.ly/evidence-of-identifier-pain

Challenge 1: ID Syntax polymorphismThere is no ID that is immune to getting mangled

A fine strategy

for 2 variations

but …

Challenge 1: ID Syntax polymorphismThere is no ID that is immune to getting mangled

SNCA (synuclein alpha) Implicated in Parkinson Disease

bit.ly/ncbi-identifier-permutations

http://view.ncbi.nlm.nih.gov/gene/6622http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=6622http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=6622http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Gene&term=6622http://www.ncbi.nlm.nih.gov/gene?term=6622http://www.ncbi.nlm.nih.gov/gene/?term=6622http://www.ncbi.nlm.nih.gov/gene/6622http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db=gene&list_uids=40966&dopt=full_reporthttp://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=gene&list_uids=6622http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=6622http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=6469https://www.ncbi.nlm.nih.gov/gene/6622

https://orcid.org/0000-0002-1825-0097http://orcid.org/0000-0002-1825-0097orcid.org/0000-0002-1825-00970000-0002-1825-0097ORCiD:0000-0002-1825-0097orcid:0000-0002-1825-0097ORCiD: 0000-0002-1825-0097orcid: 0000-0002-1825-0097

DOIsorcids

532 possible

combinations of short form and http identifiers for the same gene in the same DB!

https://orcid.org/0000-0002-1825-0097

http://orcid.org/0000-0002-1825-0097

Mitigation 1. syntax polymorphism: minimize it where you can,

document it where you can’tRepos:• Document how you want others to

reference your IDs• Implement QC to check the IDs of records

you xref• Implement Signposting (Herbert van de

Sompel)

Technical Needs:• Generic identifier QC services

(identifiers.org?)• Tooling to diagnose and roll up equivalents

(underway)• Monarch Initiative / NIH Data Commons

/ NIH Data Translator

PDBsum: http://www.ebi.ac.uk/pdbsum/2gc4

Proteopedia: http://proteopedia.org/wiki/index.php/2gc4

PDB Europe: http://www.pdbe.org/2gc4

RCSB PDB: http://www.rcsb.org/pdb/explore/explore.do?structureId=2gc4

PDBj: http://pdbj.org/mine/summary/2gc4

Challenge 2: Coordinated mirroring of exact record copies

within consortium members

2gc4 is a 16 chain structure with sequence from Paracoccus denitrificans

http://www.ebi.ac.uk/pdbsum/2gc4

http://proteopedia.org/wiki/index.php/2gc4

http://www.pdbe.org/2gc4

http://www.rcsb.org/pdb/explore/explore.do?structureId=2gc4

http://pdbj.org/mine/summary/2gc4

http://en.wikipedia.org/wiki/Paracoccus_denitrificans

Mitigation 2: Use machine-readabledocumentation about distributions

Find sources (eg. Identifiers.org) of machine-readable documentation about all possible endpoints

Note this doesn’t work with ad-hoc deposition of a single record in multiple places (eg. same preprint servers, institutional repositories, etc)

https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000020

Challenge 3: Knowledge evolution and implications for over-eager collapse

http://www.autism-society.org/dsm-iv-diagnostic-classifications/#autismThis evolution makes it hard to compare diagnoses made at different times.

DSM change

Analagous examples:Organizational membership evolution and partonomy

Mitigation 3: Capture provenance, versioning, and context

Make use of it in analysis

1, no changes in reporting practices and no calendar effect (baseline prevalence); 2, only a calendar effect; 3, a calendar effect and a diagnostic change; 4, a calendar effect and the inclusion of outpatients;5, a calendar effect, a diagnostic change, and the inclusion of outpatients (total observed prevalence).

1

2

3

4

5

doi:10.1001/jamapediatrics.2014.1893

Challenge 4: Closely-related real-world entities

whose distinction matters only to some

Ala

D-AlanineL-Alanine

CHEBI:16977 CHEBI:16449CHEBI:15570

Mitigation 4: create systems that let users choose “lenses” for fuzzy or exact matching

ExactFuzzy

Example implementations: OpenPhacts (drugs), Monarch Initiative (phenotypes)

Challenge 5: Partially recapitulated records that reference the same ID

EHLERS-DANLOS SYNDROME, CLASSIC TYPE, 1; EDSCL1

Ehlers-Danlos syndrome, classic type

https://omim.org/entry/130000 https://monarchinitiative.org/disease/MONDO:0007522

https://www.malacards.org/card/ehlers_danlos_syndrome_classic_type

OMIM:130000

Mitigation 5:Use authoritative IDs as redirection/query entry points into related data

Use JSON-LD context files to document in/out paths

http://purl.obolibrary.org/obo/OMIM_130000

Incoming

https://monarchinitiative.org/OMIM:130000

Outgoing

OMIM

OMIM http://purl.obolibrary.org/obo/OMIM_130000

Outgoing

OMIM

Incoming

https://bio2rdf.org/omim:130000OMIM

https://github.com/prefixcommons/biocontext/blob/master/registry/monarch_context.jsonld

PrefixCommonshttps://github.com/prefixcommons/biocontext

Challenge 6: Records that [seem to] be about the same real world

concept (or maybe not-so real)

Too little equivalence:Missing connections

Too much equivalence:False positives

Challenge 6 cont’d: Post-hoc harmonization

Challenge 6 cont’d: Fuzzy Match on xrefs/content

How are these 11 records for “Ehlers Danlos

Syndrome” related to each

other?

Narrow synonym? Broad? Exact? Child? Parent?Bayesian models like k-BOOM can help Mungalldoi:10.1101/048843

bit.ly/xref-wildwest

OMIM(brown)

MESH(grey)

ORDO/Orphanet(yellow)

SubClassOf(solid line)

Xref(dashed grey line)

Hemolytic anemia mappings across resources

Each vocabulary is different, they inconsistently map to each other, leading to poor interoperability and computability

Mitigation 6: Use algorithms to determine probability of equivalence

Bayesian OWL Ontology Merging (kBOOM) Mungall et al. http://bit.ly/k-BOOM

Determine probability of equivalence based upon:

1. Synonyms

2. XREFs (ideally semantically typed)

3. Graph structure

4. (Partial) string matching of labels

5. Prior weighting (e.g. if you know a source has specific

or poor xref’ing curation strategies)

Example applied to diseases:http://obofoundry.org/ontology/mondo.html

ID21C: Tangible, actionable community best practice

on identifiers for data integration

doi:10.1371/journal.pbio.2001414 (bit.ly/id21c-plosbio)

Please comment!

Goldilocks approach to (ID) standards

NON-adoption to nano adoptionOnly the information useful for action

UDP:2542

Impaired platelet aggregation(HP:0003540)

Thromocytopenia(HP:0001873)

Abnormal platelet activation(MP:0006298)

Thrombocytopenia(MP:0003179)

Genetics in Medicine 18, 608–617 (2016)doi:10.1038/gim.2015.137

MGI:3764834

If we get equivalence right, what does that make possible?

• Traceable• Attributable• Versioned• High Level typing of records for desired audience (eg. is record for a gene? A disease?

An instrumentation readout?)

• Licensed• Pick a standard license; encode in file header • (using a URL like https://creativecommons.org/publicdomain/zero/1.0 if possible)

• PIDs for license types?

• Connected:• Syntactic and semantic harmonization• Use standardized metadata, vocabs, ontologies• Document your identifier scheme and follow the ID21C guidelines or other best

practices to avoid syntactic variation• Capture cross references but ensure that you semantically qualify them• Tell others how to link to you• APIs please; data dumps are not enough, but if you do nothing else:

• Tell us what types of ids to expect and how they are related to one another• Support bulk identifier operations in APIs, especially for xref’d IDs

FAIRTLC

Traceable

Licensed

Connected

bit.ly/fair-tlc (https://doi.org/10.5281/zenodo.203295) reusabledata.org

Don’t hoard your data like One-“ID” Willie; Help it travel well via better equivalency support.

With thanks to: Julie McMurry, Chris Mungall, Mathias Wawer and whole teamhttps://monarchinitiative.org/page/team

https://github.com/orgs/NCATS-Tangerine/peopleJohn Kunze, Greg Janee, Sarala Wimalaratne, Nick Juty, others

5 R24 OD011883OT3 TR002019-01S21 OT OD025464-011 OT3 HL142479-01

U24TR002306

equivalence is in the (id) of the beholder

Science