equivalence is in the (id) of the beholder
TRANSCRIPT
Equivalence is in the (ID) of the beholder
PrefixCommonsBiocontext
versus one-“ID” Willie
Biomedical Data Translator
NIH Data Commons
Melissa Haendel@ontowonka
Finding treasure requires aligning different knowledge modalities and perspectives
Real-worldconcepts
Annotations
MappingsStandards and team science
Data science is a young field; the data is crufty, but it is valuable
What integrators are aiming to do is non-trivial
But… biological things can have multiple IDs, and each ID can be written in
multiple ways. Some persist, some don’t.
Some have machine readable data, some don’t. This makes integration hard.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4981258/figure/fig1/
Identifiers are the invisible bedrock of all scientific inquiry; the more complex the question, the greater the reliance on ID hygiene
What? Why?How?
Identifiers
Identifiers &Metadata
Identifiers & MetaData &Relationships
Re
qu
ire
d h
arm
on
izat
ion
Question complexity
FAIR
FAIR
How many?
FAIR
Identifier Reality: Not all IDs are created equal Not all communities are equally poised to use themWe need systems that accommodate the heterogeneity
TraditionalLiterature
Non-Traditional
Persistent
Ephemeral
Non-existent
Ide
nti
fie
r M
atu
rity
Scholarly Output Maturity
Genomic resources
Wild west of identifier tumbleweed
Connected
Goonies may not need IDs, but agents do
Pharos
DrugBank
PubMed
KEGG
WikiPathways
HGNC
HGNC
NCBI
ClinVar
OMIM
https://github.com/biolink/biolink-model/blob/master/biolink-model.yaml
http://github.com/ncats-tangerine/beacon-aggregator
Genes Diseases Chemicals Proteins
NCBI GeneHGNCEnsemblOMIMOrganism-DBsPantherMygene.infoMyvariant.infoClinvar
MONDOOMIMOrphanetICD-9 / ICD-10 / ICD-11SNOMED-CT
ChebiDrugbankPharos
UniProt/SwissProtWikiPathwaysReactome
The NCATS Data Translator Knowledge Map: Will define a menu of types, IDs, and paths available to reasoners
Genes Environment Phenotypes+ =
Biology is not just identifying the bits, but also the
relationships between them.
G-P or D (disease)• causes• contributes to• is risk factor for• protects against• correlates with• is marker for• modulates• involved in• increases susceptibility to
G-G (kind of)• regulates• negatively regulates (inhibits)• positively regulates (activates)• directly regulates• interacts with• co-localizes with• co-expressed with
P/D - P/D• part of• results in• co-occurs with• correlates with• hallmark of (P->D)
E-P• contributes to (E->P)• influences (E->P)• exacerbates (E->P)• manifest in (P->E)
G-E (kind of)• expressed in• expressed during• contains• inactivated by
Identifier cleaning is a shared pain point (and a huge time sink),
but no ways to share the outcomes
Let’s discuss!!
CHEMBL1431
Pharos:O15244 (one of 355)
IDG:D908
reactome:R-HSA-374914
The NCATS BioMedical Translator Knowledge Map is a set of
actions the agent can perform:3rd party provider value: API’s and connectivity
reactome:HSA-1430728
clinvar.variant:443497
medgen:87607 Short stature
medgen:892473 Abnormality of cardiovascular
system morphology
medgen:427827 Abnormality of the ear
medgen:776570 Polydactyly
Variant
Phenotypes
MonarchPhenotypeToDisease Phenotypes
Disease
HP:0004322 Short stature
HP:0030680 Abnormality of cardiovascular system
morphology
HP:0000598 Abnormality of the ear
HP:0010442 Polydactyly
MONDO:0019391 Fanconi Anemia Disease
Phenotype
downloadeverything
crawl &index
high-levelabstraction
trial &error
more details fewer details
Burden on integrator(less coordination)
Burden on provider(more coordination)
Data integration is a socio-technical problem
(Way) beyond linkrot: types of equivalence pain
Bit.ly/evidence-of-identifier-pain
Challenge 1: ID Syntax polymorphismThere is no ID that is immune to getting mangled
A fine strategy
for 2 variations
but …
Challenge 1: ID Syntax polymorphismThere is no ID that is immune to getting mangled
SNCA (synuclein alpha) Implicated in Parkinson Disease
bit.ly/ncbi-identifier-permutations
http://view.ncbi.nlm.nih.gov/gene/6622http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=6622http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=6622http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Gene&term=6622http://www.ncbi.nlm.nih.gov/gene?term=6622http://www.ncbi.nlm.nih.gov/gene/?term=6622http://www.ncbi.nlm.nih.gov/gene/6622http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db=gene&list_uids=40966&dopt=full_reporthttp://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=gene&list_uids=6622http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=6622http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=6469https://www.ncbi.nlm.nih.gov/gene/6622
https://orcid.org/0000-0002-1825-0097http://orcid.org/0000-0002-1825-0097orcid.org/0000-0002-1825-00970000-0002-1825-0097ORCiD:0000-0002-1825-0097orcid:0000-0002-1825-0097ORCiD: 0000-0002-1825-0097orcid: 0000-0002-1825-0097
DOIsorcids
532 possible
combinations of short form and http identifiers for the same gene in the same DB!
Mitigation 1. syntax polymorphism: minimize it where you can,
document it where you can’tRepos:• Document how you want others to
reference your IDs• Implement QC to check the IDs of records
you xref• Implement Signposting (Herbert van de
Sompel)
Technical Needs:• Generic identifier QC services
(identifiers.org?)• Tooling to diagnose and roll up equivalents
(underway)• Monarch Initiative / NIH Data Commons
/ NIH Data Translator
PDBsum: http://www.ebi.ac.uk/pdbsum/2gc4
Proteopedia: http://proteopedia.org/wiki/index.php/2gc4
PDB Europe: http://www.pdbe.org/2gc4
RCSB PDB: http://www.rcsb.org/pdb/explore/explore.do?structureId=2gc4
PDBj: http://pdbj.org/mine/summary/2gc4
Challenge 2: Coordinated mirroring of exact record copies
within consortium members
2gc4 is a 16 chain structure with sequence from Paracoccus denitrificans
Mitigation 2: Use machine-readabledocumentation about distributions
Find sources (eg. Identifiers.org) of machine-readable documentation about all possible endpoints
Note this doesn’t work with ad-hoc deposition of a single record in multiple places (eg. same preprint servers, institutional repositories, etc)
https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000020
Challenge 3: Knowledge evolution and implications for over-eager collapse
http://www.autism-society.org/dsm-iv-diagnostic-classifications/#autismThis evolution makes it hard to compare diagnoses made at different times.
DSM change
Analagous examples:Organizational membership evolution and partonomy
Mitigation 3: Capture provenance, versioning, and context
Make use of it in analysis
1, no changes in reporting practices and no calendar effect (baseline prevalence); 2, only a calendar effect; 3, a calendar effect and a diagnostic change; 4, a calendar effect and the inclusion of outpatients;5, a calendar effect, a diagnostic change, and the inclusion of outpatients (total observed prevalence).
1
2
3
4
5
doi:10.1001/jamapediatrics.2014.1893
Challenge 4: Closely-related real-world entities
whose distinction matters only to some
Ala
D-AlanineL-Alanine
CHEBI:16977 CHEBI:16449CHEBI:15570
Mitigation 4: create systems that let users choose “lenses” for fuzzy or exact matching
ExactFuzzy
Example implementations: OpenPhacts (drugs), Monarch Initiative (phenotypes)
Challenge 5: Partially recapitulated records that reference the same ID
EHLERS-DANLOS SYNDROME, CLASSIC TYPE, 1; EDSCL1
Ehlers-Danlos syndrome, classic type
https://omim.org/entry/130000 https://monarchinitiative.org/disease/MONDO:0007522
https://www.malacards.org/card/ehlers_danlos_syndrome_classic_type
OMIM:130000
Mitigation 5:Use authoritative IDs as redirection/query entry points into related data
Use JSON-LD context files to document in/out paths
http://purl.obolibrary.org/obo/OMIM_130000
Incoming
https://monarchinitiative.org/OMIM:130000
Outgoing
OMIM
OMIM http://purl.obolibrary.org/obo/OMIM_130000
Outgoing
OMIM
Incoming
https://bio2rdf.org/omim:130000OMIM
https://github.com/prefixcommons/biocontext/blob/master/registry/monarch_context.jsonld
PrefixCommonshttps://github.com/prefixcommons/biocontext
Challenge 6: Records that [seem to] be about the same real world
concept (or maybe not-so real)
Too little equivalence:Missing connections
Too much equivalence:False positives
Challenge 6 cont’d: Post-hoc harmonization
Challenge 6 cont’d: Fuzzy Match on xrefs/content
How are these 11 records for “Ehlers Danlos
Syndrome” related to each
other?
Narrow synonym? Broad? Exact? Child? Parent?Bayesian models like k-BOOM can help Mungalldoi:10.1101/048843
bit.ly/xref-wildwest
OMIM(brown)
MESH(grey)
ORDO/Orphanet(yellow)
SubClassOf(solid line)
Xref(dashed grey line)
Hemolytic anemia mappings across resources
Each vocabulary is different, they inconsistently map to each other, leading to poor interoperability and computability
Mitigation 6: Use algorithms to determine probability of equivalence
Bayesian OWL Ontology Merging (kBOOM) Mungall et al. http://bit.ly/k-BOOM
Determine probability of equivalence based upon:
1. Synonyms
2. XREFs (ideally semantically typed)
3. Graph structure
4. (Partial) string matching of labels
5. Prior weighting (e.g. if you know a source has specific
or poor xref’ing curation strategies)
Example applied to diseases:http://obofoundry.org/ontology/mondo.html
ID21C: Tangible, actionable community best practice
on identifiers for data integration
doi:10.1371/journal.pbio.2001414 (bit.ly/id21c-plosbio)
Please comment!
Goldilocks approach to (ID) standards
NON-adoption to nano adoptionOnly the information useful for action
UDP:2542
Impaired platelet aggregation(HP:0003540)
Thromocytopenia(HP:0001873)
Abnormal platelet activation(MP:0006298)
Thrombocytopenia(MP:0003179)
Genetics in Medicine 18, 608–617 (2016)doi:10.1038/gim.2015.137
MGI:3764834
If we get equivalence right, what does that make possible?
• Traceable• Attributable• Versioned• High Level typing of records for desired audience (eg. is record for a gene? A disease?
An instrumentation readout?)
• Licensed• Pick a standard license; encode in file header • (using a URL like https://creativecommons.org/publicdomain/zero/1.0 if possible)
• PIDs for license types?
• Connected:• Syntactic and semantic harmonization• Use standardized metadata, vocabs, ontologies• Document your identifier scheme and follow the ID21C guidelines or other best
practices to avoid syntactic variation• Capture cross references but ensure that you semantically qualify them• Tell others how to link to you• APIs please; data dumps are not enough, but if you do nothing else:
• Tell us what types of ids to expect and how they are related to one another• Support bulk identifier operations in APIs, especially for xref’d IDs
FAIRTLC
Traceable
Licensed
Connected
bit.ly/fair-tlc (https://doi.org/10.5281/zenodo.203295) reusabledata.org
Don’t hoard your data like One-“ID” Willie; Help it travel well via better equivalency support.
With thanks to: Julie McMurry, Chris Mungall, Mathias Wawer and whole teamhttps://monarchinitiative.org/page/team
https://github.com/orgs/NCATS-Tangerine/peopleJohn Kunze, Greg Janee, Sarala Wimalaratne, Nick Juty, others
5 R24 OD011883OT3 TR002019-01S21 OT OD025464-011 OT3 HL142479-01
U24TR002306