inchi keys as standard global identifiers in chemistry web...
Post on 27-Feb-2019
220 Views
Preview:
TRANSCRIPT
InChI keys as standard global identifiers
in chemistry web services
Russ Hillard ACS, Salt Lake City
March 2009
Context of this talk
• We have created a web service
• That aggregates sources built independently - Dozens individual databases - Containing Molecules and reactions - Created using non-standardized business rules (wrt chemical representation)
• Covers large record sets - 30+ million unique molecules from combined sources - 5+ million unique reactions from combined sources
• Requires integration across all sources - Based on shared chemical entities - Where “entity” means chemical compound(s) - And “chemical compound” has a unique identifiers
- Chemical structure elucidated by scientists - Systematic chemical name derived from structure - Graphic representation of structure assigned at registration - Trivial chemical name assigned to structure - Registry number assigned to structure - Key or string computed from structure
The basic problem . . .
ChemInform (FIZ Chemie)
Beilstein (Elsevier) BRN3936786
Curr. Chem Reactions (Thomson)
BRN3936786
5693-99-2 stereochem unspecified 71403-94-6 relative stereochem 121651-02-3 absolute stereochem (2R,3S) 126720-47-6 absolute stereochem (2S,3R)
trans-3-phenyloxirane-carboxaldehyde (2R*,3R*)-2,3-epoxycinnamaldehyde trans-cinnamaldehyde epoxide Epoxyzimtaldehyd
(2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom)
• Don’t always have or know the BRN, CASRN, ChemSpiderID, MFCD# . . . • Relationship of Structure:RegNumbers if often 1:many
One solution
• Define our own set of registration rules
• Register all structures to one big database - Normalize structures according to our rules
• Assign a unique record identifier (URI) to the normalized structures
• Correlate our URIs to the native sources
• Use our URIs to correlate records across different databases
• We have done this but have not exposed the URIs - Even with modern computers this is resource intensive - Problem is compounded when data is from different providers - Does the world really need another “Global Registry Number”?
As currently implemented this gives:
ChemInform (FIZ Chemie)
Great for internal correlations: Reactions Commercial Availability
Toxicity Bioactivity . . . etc
Molecules Synthetic preparations of Organic reactions of Toxicity . . . Etc
But what about external correlations? Anything we don’t/can’t index Commercial data Proprietary data
Will focus on these two options
• Assume structures as registered are correct - Accept that we cannot always normalize according to our rules
• Use a derived (calculated) compound identifier
• Is this possible? - IUPAC Name - Wiswesser Line Notation (WLN) - Molfile and its derivatives - SEMA Key - MDL Line Notation - SMILES - Chemical Markup Language (CML) - InChI Name - InChI Key - NEMA key
Alternative solution
IUPAC - International Chemical Identifier
The objective of the IUPAC Chemical Identifier Project is to establish a unique label, the IUPAC Chemical Identifier, which would be a non-proprietary identifier for chemical substances that could be used in printed and electronic data sources thus enabling easier linking of diverse data compilations.
The initial work focused on the development of algorithms for converting an input organic chemical structure to a unique (canonical) form. This, in effect, involves the unique numbering of each atom, with equivalent atoms being assigned identical numbers. "Serializing" the result to create a string is the final, straightforward, step in creating an identifier. From: http://www.iupac.org/web/ins/2000-025-1-800
For this presentation all InchI Keys are generated using:
final standard InChI/InChIKey v. 1.02 so7ware
The Morgan Algorithm
Invented by H. L. Morgan, J. Chem. Doc., 5, 107 (1965) - Underpins many of the systems in use today - The basis of CAS Online
Identifies atoms based on an extended connectivity value and the atom with the highest value becomes the first atom in the name, and its neighbors are then listed in descending order – ties are resolved based on additional parameters, for example bond order, and atomic number
Does not handle stereochemistry
SEMA developed to handle stereoisomers - W. T. Wipke and T. M. Dyott, J. Amer. Chem. Soc., 96,
4825, (1974).
NEMA
NEMA produces a unique name and key for a wider range of structures than SEMA. It extends perception to non-tetrahedral stereogenic centers, it supports both 2D and 3D stereochemistry perception, and it does not have an atom limit. It is a proprietary to Symyx, but it is exposed in our products, for example Symyx Draw and Symyx Direct generate NEMA keys.
The work of Wipke et al identified the value of a constitutional key and a stereo key. This approach has been incorporated into NEMA.
W. T. Wipke, S. Krishnan, and G. I. Ouchi, J. Chem. Inf Comput. Sci., 18, 32, 1978
Tautomers (mobile H atoms)
Different structures
Different systematic names
Presumably exist in equilibrium
InchI Keys are identical
NEMA Keys are different
Both structures are registered to our collection
57531-38-1 assigned to both structures
4(5)-chloro-5(4)-nitroimidazole 5(4)-chloro-4(5)-nitroimidazole 4-chloro-5-nitroimidazole 5-chloro-4-nitroimidazole 4-chloro-5-nitro-1(3)H-imidazole
Stereoisomers
Pure enantiomer
Enantiomeric pair
No stereo
InchI does not distinguish pure enantiomer from raceme
Concern with stereochem goes back to…..
ChemInform (FIZ Chemie)
Beilstein (Elsevier) BRN3936786
Curr. Chem Reactions (Thomson)
BRN3936786
5693-99-2 stereochem unspecified 71403-94-6 relative stereochem 121651-02-3 absolute stereochem (2R,3S) 126720-47-6 absolute stereochem (2S,3R)
trans-3-phenyloxirane-carboxaldehyde (2R*,3R*)-2,3-epoxycinnamaldehyde trans-cinnamaldehyde epoxide Epoxyzimtaldehyd
(2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom)
• Don’t always have or know the BRN, CASRN, ChemSpiderID, MFCD# . . . • Relationship of Structure:RegNumbers if often 1:many
Layered structure of InchI Keys
AAAAAAAAAAAAAA-BBBBBBBBCD
AAAAAAAAAAAAAA = skeleton
BBBBBBBB = structural features mobile hydrogens, isotopes, metal bonds ...
C = flag, InchI version . . .
D = check character
Ability to reconstruct InChi Keys into classes of related structures sets them apart
There is still plenty to do……
Biologics Average pipeline contains 22% biologics Some companies are near 50% Peptides & modified peptides Nucleic acid sequences
Generics Markush structures
Polymers Repeating monomers Block copolymers Cross-linked polymers
So what should go into our web service?
• Unique chemical structures registered to Compound Index
• Unique reaction structures registered to Reaction Index
• Assigned global identifiers as available - Registry numbers (BRN, CASRN, MFCD#s, PubChemIDs. . .)
• Computed global identifiers for all compounds - InChI strings - InChI Keys - NEMA Keys
• Register InChi Keys to ACD and other Symyx databases
• Let the consumer decide which to use
top related