20090511 manchester biochemistry

42
Increasingly Accurate Representation of Biochemistry Michel Dumontier, Ph.D. Assistant Professor of Bioinformatics Department of Biology, School of Computer Science Institute of Biochemistry, Ottawa Institute of Systems Biology Carleton University 1 IMG Seminar:Manchester:Michel Dumontier 11/05/2009

Upload: michel-dumontier

Post on 13-Dec-2014

1.817 views

Category:

Technology


1 download

DESCRIPTION

Biochemical ontologies aim to capture and represent biochemical entities and the relations that exist between them in an accurate manner. A fundamental starting point is biochemical identity, but our current approach for generating identifiers is haphazard and consequently integrating data is error-prone. I will discuss plausible structure-based strategies for biochemical identity whether it be at molecular level or some part thereof (e.g. residues, collection of residues, atoms, collection of atoms, functional groups) such that identifiers may be generated in an automatic and curator/database independent manner. With structure-based identifiers in hand, we will be in a position to more accurately capture context-specific biochemical knowledge, such as how a set of residues in a binding site are involved in a chemical reaction including the fact that a key nitrogen atom must first be de-protonated. Thus, our current representation of biochemical knowledge may improve such that manual and automatic methods of bio-curation are substantially more accurate.

TRANSCRIPT

Page 1: 20090511 Manchester Biochemistry

Increasingly Accurate Representation of Biochemistry

Michel Dumontier, Ph.D.Assistant Professor of Bioinformatics

Department of Biology, School of Computer ScienceInstitute of Biochemistry, Ottawa Institute of Systems Biology

Carleton University1 IMG Seminar:Manchester:Michel Dumontier 11/05/2009

Page 2: 20090511 Manchester Biochemistry

Biochemistry• Biochemistry aims to understand the structure

and function of all living things at the molecular level

http://multimedia.mcb.harvard.edu/media.html

Page 3: 20090511 Manchester Biochemistry

Representational Issues

1. Identity

2. Descriptions

3. Situations

Page 4: 20090511 Manchester Biochemistry

Case Study: HIF1αHypoxia-Inducible Factor 1, alpha chain (uniprot:Q16665)Master transcriptional regulator of the adaptive response to hypoxia

• Under normoxic conditions, HIF1α is hydroxylated on Pro-402 and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. EGLN3/PHD3 has also been shown to hydroxylate Pro-564. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation.

Situationa) Normoxicb) Hypoxicc) Other/Unspecified

Multiple structural forms

Part, named/ unnamed regions

The part is the agent in the process

Selective interaction with parts

Page 5: 20090511 Manchester Biochemistry

Structure-based biochemical identity:Differences between apples and oranges

• HIF1α – au naturel• HIF1α

– hydroxylated @P402• HIF1α

– hydroxylated @P564• HIF1α

– hydroxylated @P402 & @P564• HIF1α

– hydroxylated @P402 & (@P564)– ubiquitinated @K532

• HIF1α– L400A & L397A

Page 6: 20090511 Manchester Biochemistry

Current approach to biochemical identity is erroneous, misleading or underspecified

• Information gathered from multiple structural variants are attributed to the unmodified form.

Uniprot/Genbank

• This conflates functionality arising from similar, but different structural forms

Inaccurate specification of knowledge

• Incomplete descriptions are just as bad– Reactome has an internal

identifier for referring to different forms, but links to Uniprot entries

– Obfuscates identity between databases

Page 7: 20090511 Manchester Biochemistry

11/05/2009IMG Seminar::Michel Dumontier7

Page 8: 20090511 Manchester Biochemistry

Bio2RDF: 2.3B triples of SPARQL-accessible linked biological data!

Chemical Parts!

Page 9: 20090511 Manchester Biochemistry

1. Precise Biochemical Identifiers• Identifiers and their exact descriptions are

required for these kinds of entities:– atom : atomic interactions, catalytic mechanism– collection of atoms : binding/catalytic site, interaction– residue : post translational modification– collection of residues : motif/domain/interaction site– molecule : metabolism, signalling – complex : metabolism , signalling, scaffolds, containers

• We need a reproducible methodology

Page 10: 20090511 Manchester Biochemistry

Different molecules must have different identifiers

• IUPAC International Chemical Identifier (InChI)• A data string that provides

1. the structure of a chemical compound 2. the convention for drawing the structure

• It can be made by anyone, anywhere at any time – a deterministic algorithm ensures that is always written in the same way (syntactic identity), and fully specifies the molecular description (semantic identity).

– It is a data identifier

Page 11: 20090511 Manchester Biochemistry

(S)-Glutamic AcidInChI={version}1/{formula}C5H9NO4/c{connections}6-3(5(9)10)1-2-4(7)8/h{H_atoms}3H,1-2,6H2,(H,7,8)(H,9,10)/p{protons}+1/t{stereo:sp3}3-/m{stereo:sp3:inverted}0/s{stereo:type (1=abs, 2=rel, 3=rac)}1/i{isotopic:atoms}4+1

Page 12: 20090511 Manchester Biochemistry

CMLSDF

O1[C@@H]([C@@H](O)([C@H](O)([C@@H](O)([C@@H]1(O)))))(CO) 79025

IUPACInChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1/s1InCHI

α-D-Glucose

6-(hydroxymethyl)oxane-2,3,4,5-tetrol OR (2R,3R,4S,5R,6R)-6 -(hydroxymethyl)tetrahydro -2H-pyran-2,3,4,5-tetraol

SMILES

2. Structure Accurate and Extensible Descriptions Required

Page 13: 20090511 Manchester Biochemistry

OWL Has Explicit Semantics

Can therefore be used to capture knowledge in a machine understandable way

Page 14: 20090511 Manchester Biochemistry

http://code.google.com/p/semanticwebopenbabel/

Page 15: 20090511 Manchester Biochemistry

Chemical Ontology

Chemical Knowledge for the Semantic Web.Mykola Konyk, Alexander De Leon, and Michel Dumontier. LNBI. 2008. 5109:169-176. Data Integration in the Life Sciences (DILS2008). Evry. France. 

Page 16: 20090511 Manchester Biochemistry

hydroxyl groupmethyl group

Knowledge of functional groups is important in chemical synthesis, pharmaceutical design and lead optimization.

Functional groups describe chemical reactivity in terms of atoms and their connectivity, and exhibits characteristic chemical behavior when present in a compound.

Describing chemical functional groups in OWL-DL for the classification of chemical compounds

N Villanueva-Rosales, MDumontier. 2007. OWLED, Innsbruck, Austria.

Ethanol

Page 17: 20090511 Manchester Biochemistry

Describing Functional Groups in DL

HydroxylGroup: CarbonGroup that (hasSingleBondWith some (OxygenAtom that hasSingleBondWith some HydrogenAtom)

OHR

R group

Page 18: 20090511 Manchester Biochemistry

Fully Classified Ontology

35 FG

Page 19: 20090511 Manchester Biochemistry

And, we define certain compounds

Alcohol: OrganicCompound that (hasPart some HydroxylGroup)

Page 20: 20090511 Manchester Biochemistry

Organic Compound Ontology

28 OC

Page 21: 20090511 Manchester Biochemistry

Question Answering

• Query all attributes

• Query PubChem, DrugBank and dbPedia*

* Requires import of relevant URIs

Page 22: 20090511 Manchester Biochemistry

But...• Molecules represented as individuals because

OWL-DL only allows tree-like class expressions– No variable binding (e.g. ?x) ... no cyclic

molecule/functional group descriptions at the class level

• Boris Motik et al proposed Description Graphs – Robert Stevens, Duncan Hull, Uli Sattler (and I)

exploring their use for chemical representation and sub-structure reasoning....

Page 23: 20090511 Manchester Biochemistry

turns out that…• Using InChI’s precise numbering system, we can

specify molecular graphs at the class level• Simple 3-carbon ring system

CarbonAtom that hasPosition value 1 and hasSingleBondTo exactly 1 (CarbonAtom that hasPosition value 2 and hasSingleBondTo exactly 1(CarbonAtom that hasPosition value 3 and hasSingleBondTo exactly 1 (CarbonAtom that hasPosition value 1)))

(ignoring hydrogens)

InChI=1/C3H6/c1-2-3-1/h1-3H2

Page 24: 20090511 Manchester Biochemistry

• Possible... but a 1000 residue protein would contain ~15,000 atoms on average.... – Size of the string will be enormous

• We can use InChiKeys (SHA1 hash), but then we need to provide a you-submit-InChI, we-store-both and they-look-it-up service.

– OpenBabel seemed to struggle with anything over 100 residues

• Needs some performance tweaking / commercial solutions

– Modularize InChI construction for (linear) polymers?• Make InChi strings for each residue, and concatenate – rename the

atoms according to the residue position

InCHI for Proteins???

Page 25: 20090511 Manchester Biochemistry

Identifiers for Atoms• Atom identifiers can be consistently retrieved

from the InChI model.– Canonical numbering means we can reliably refer to a

specific region rather than a (possibly degenerate) sub-graph match.

– In our plugin, component naming was based on the assigned molecule identifier

e.g. pubchemid#aN, where a is the “atom” label and N is the position

– Use hash of InChI as base?e.g. id#aN

Page 26: 20090511 Manchester Biochemistry

What about identifiers for collection of atoms?

• Potentially useful in describing residues, PTMs, binding sites, etc. – Is the lack of connectivity sufficient?

• Contiguous: – ranges (id#aN-aN)– enumerations (id#aN,aN,aN)

• Non-contiguous:– Combination of ranges, enumerations?

Page 27: 20090511 Manchester Biochemistry

Can we reuse our positional nomenclature for residues?

• Residues are generally referred to by their absolute position in the biopolymer sequence.

e.g. Pro @ X on Protein Yid#a50-a65 owl:sameAs id#r5id#r5_a1-r5_a15 owl:sameAs id#r5

• Collection of residues might follow the same rules as a collection of atoms.– Useful for defining domains, motifs, etc

Page 28: 20090511 Manchester Biochemistry

• We already have a simplified representation for biopolymers... – Canonical sequence is represented by a string of

single letter characters• DNA: ACGT• RNA: ACGU• Proteins: 20 amino acids (not B,J,O,U,X,Z)

– Modifications can be referred to with ChEBI/PSI-MOD ontology (e.g. Prolyl hydroxylated residue @ 402)

• Each (modified) residue must have its InChi description so as to capture explicit structural deviations (de-protonation, etc)

An Alternative Scheme

Page 29: 20090511 Manchester Biochemistry

PSI-MOD contains modified residues with links to structural descriptions

Page 30: 20090511 Manchester Biochemistry

But what if we have a modification that isn’t contained in the ontology!

• No problem... define your own term, with the corresponding structural description (InChI, SMILES), and add to an ontology document...– If you’re using OWL, you can add the import

statement and publish it.• And, of course, you should submit it to the

appropriate ontology development teams. (and later make it equivalent to)

Page 31: 20090511 Manchester Biochemistry

While we’re at it, we could extend our expressive capability to create broader

descriptions:• Specification

– Exactly mod1@pos X– Only mod1@posX

• Minimum : – At least mod1@posX

• Combination:– mod1@posX AND mod2@posY, X != Y

• Possibilities/Uncertainty: – (mod1 OR mod2) @posX

• Exclusion:– not mod1 @ posX

Page 32: 20090511 Manchester Biochemistry

So what if...we describe the structural features of the molecule with OWL (sequence + PTMs), and generate an identifier from one of its serializations (RDF/XML?)

that way we get a unique identifier with a description that is extensible and compatible with the semantic web.

Page 33: 20090511 Manchester Biochemistry

Biological Identifier Service

Page 34: 20090511 Manchester Biochemistry
Page 35: 20090511 Manchester Biochemistry

Extensible to create other class descriptions

• Chemical– Conformation (e.g. Open vs closed form)

• Biological– Species– mRNA/Gene from which it was transcribed/encoded

Page 36: 20090511 Manchester Biochemistry

What does this mean to providers and consumers?

• Automatic identifier and description generation • Data providers can get the identifier that exactly

matches their entity.• Consumers can get the exact description of a

reported identifier.

• Registry can keep track of provider to entity– Discover where additional information can be found

Page 37: 20090511 Manchester Biochemistry

Semantic Science will create a Bio2RDF endpoint to link semantically equivalent biochemical identifiers

Page 38: 20090511 Manchester Biochemistry

Situational Modeling

Page 39: 20090511 Manchester Biochemistry

Uniprot example revisitedUnder normoxic conditions, HIF1α is hydroxylated on Pro-402 and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation

.

:A rdfs:subClassOf :Hydroxylation:A hasParticipant (:0#r402 and :Substrate):A hasParticipant (:1#r402 and :Product):A hasParticipant (:5 and :Enzyme)

:B rdfs:subClassOf :Interaction:B :hasParticipant (:2#r402 or :3#r564 or :4#r402,r564):B :hasParticipant (:6)

:1 (HIF1α):2 (HIF1α + P402hyd):3 (HIF1α + P564hyd):4 (HIF1α + P402hyd + P564hyd):5 (EGLN1):6 (VHL)

Please ignore the made up short-hand syntax!

Page 40: 20090511 Manchester Biochemistry

Infering Protein Participation • OWL Role Chain

hasParticipant o isPartOf -> hasParticipantif process has the part as a participant, then the whole is also a participant

:0#r402 :isPartOf :0:1#r402 :isPartOf :1

:A rdfs:subClassOf :Hydroxylation:A hasParticipant (:0#r402 and :Substrate):A hasParticipant (:1#r402 and :Product)

:A hasParticipant :0:A hasParticipant :1

Page 41: 20090511 Manchester Biochemistry

Summary• Biochemical identity is tightly linked to accurate

descriptions.

• Automatic and consistent identifier generation will allow anybody to specify findings according to the biopolymers for which it was observed– No curation required!!!!– Will be discovered automatically – link biochemical knowledge at various levels of granularity

• Situational modeling enables the careful separation of what is known under a particular circumstance.

Page 42: 20090511 Manchester Biochemistry

dumontierlab.com

[email protected]

Special thanks to PhD Student Leonid Chepelev for insightful discussions

semanticscience.org