iccs9 2011 talk

57
NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9 Markus Sitzmann 1 , Wolf-Dietrich Ihlenfeldt 2 , and Marc C. Nicklaus 1 [1] Computer-Aided Drug Design Group, Chemical Biology Laboratory, NCI-Frederick, NIH, DHHS [2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany ADD Chemical Identifier Resolver: ing and Analysis of Available Chemistry S

Upload: markus-sitzmann

Post on 26-Jun-2015

4.154 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt2, andMarc C. Nicklaus1

[1] Computer-Aided Drug Design Group, Chemical Biology Laboratory,NCI-Frederick, NIH, DHHS[2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany

NCI/CADD Chemical Identifier Resolver:Indexing and Analysis of Available Chemistry Space

Page 2: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Chemistry Space Analysis

• how many small-molecules are there currently?• since the early 2000s: enormous increase of the number of

databases containing small molecules, e.g. PubChem, ChemSpider, ChEMBL, DrugBank – what is the overlap?

• many ambiguities in the representation of small molecules (e.g. tautomerism, salts, ionic resonance forms)

• growing number of chemical structure identifiers (InChI/InChIKey, PubChem SID/CID, ChemSpider ID, ChEBI ID, …)

Page 3: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

chemical structure

Chemical Identifier Resolver

NCI/CADD Identifiers

InChI/InChIKey

ChemSpider ID

PubChem SID/CID

chemical names

CAS Registry Number

NSC number

FDA UNII

ChemNavigator SID

SMILES

SD File

Chemical FormulaChEBI ID

PDB Ligand ID

MRV

CML

SYBYL Line Notation

GIF image

Page 4: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

http://cactus.nci.nih.gov/chemical/structure

Works as a resolver for different chemical structure identifiers. Allows one to convert a givenstructure identifier into anotherrepresentation or structureidentifier.

Chemical Identifier ResolverNCI/CADD Web Resources

first beta release: July 2009current release (beta 4): April 2011

Page 5: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

• it is usable by a simple URL API:

example: http://cactus.nci.nih.gov/chemical/structure/Tamiflu/cas

204255-11-8

http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”

MIME type: text/plain

Chemical Identifier ResolverNCI/CADD Web Resources

XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”/xml

• if a request is not resolvable: HTTP404 status message

Page 6: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

resolver

chemical namesIUPAC names (by OPSIN)

CAS numbersSMILES strings

IUPAC InChI/InChIKeysNCI/CADD Identifiers

CACTVS HASHISYNSC number

PubChem SIDChemSpider ID

ChemNavigator SIDFDA UNII

/smiles/names, /iupac_name/cas/inchi, /stdinchi/inchikey, /stdinchikey/ficts, /ficus, /uuuuu /image/file, /sdf/mw, /monoisotopic_mass /formula/twirl, /3d/urls/chemspider_id/pubchem_sid/chemnavigator_sid

“identifier” “representation”

http://cactus.nci.nih.gov/chemcial/structure

Chemical Identifier ResolverNCI/CADD Public Web Resources

Page 7: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

identifier representation

http request

http response

detection ofthe identifier

type

identifier is afull structure

representation(e.g. SMILES, InChI)

calculation of therequested structure

representation

identifier is ahashed structure

representation(e.g. InChIKey),

trivial nameetc.

database lookup

MIME type

Chemical Identifier ResolverNCI/CADD Web Resources

structure

e.g. InChI, GIF image

e.g. CAS number,chemical nameCACTVS

NCI/CADD Chemical Structure Database (CSDB)

Page 8: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

• ChemNavigator iResearch Librarycompilation of commercially availablescreening compounds from ~300 inter-national chemistry suppliers

• PubChem databaseincluding Open NCI database, EPA DSSTox databases, NIAID HIVdatabases, NIST Webbook, NLM ChemIDplus, ChemSpider …

• Commercial Sources / othersAsinex, Comgenex, eMolecules,ChEMBL, …

currently:~150 chemical structure databases

~120 million structure records ~81.6 million unique structures by

NCI/CADD FICuS Identifier~84 million unique structures by Std. InChIKey

ChemNav.iResearch Lib.~56%

PubChem~38%

others

~6%

Chemical Structure Database (CSDB)Chemical Identifier Resolver

Page 9: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

NCI/CADD Structure Identifiers

FICTS, FICuS, uuuuu

Page 10: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

• based on hashcodes calculated by the chemoinformatics toolkit CACTVS

• CACTVS hashcodes: represent a chemical structure uniquely as

16-digit hexadecimal number (64-bit unsigned) high sensitivity to structural features of a compound change if connectivity changes

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

HNN NH2

OH

O

9850FD9F9E2B4E25

Page 11: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

structurenormalization

parentstructure

NCI/CADDIdentifier

hashcodecalculation

E_HASHISY

• calculation of a set of parent structures with differentsensitivity to chemical features

• representation of chemical structures on different levels

FICTS

original structure

record

MolfileSDFSMILESChemDraw cdxPDB

FICuS

uuuuu

SDFSMILESdatabase

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

Page 12: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

• adjustable levels of sensitivity:

Fragments

sensitive

keep only largestorganic fragment

Isotopes

ignoreisotope labels

sensitive

D

D

D

D

D

D

Charges

uncharge

sensitive

find canonicaltautomer

O O

Stereochemistry

sensitive

COOH

NH2

discard stereoinformation

O-

O

NH3+

OH

O

NH2

un-sensitive un-sensitive un-sensitive un-sensitive

sensitive

O OH

O OH

Tautomers

COOH

HNH2

COOH

NH2

HNa+

O

O-

O

OH

un-sensitive

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

Page 13: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Fragments Isotopes Charges

sensitive sensitive sensitive

D

D

D

D

D

D

O OCOOH

NH2

un-sensitive un-sensitive un-sensitive un-sensitive

O-

O

NH3+

OH

O

NH2

Tautomers Stereochemistry

sensitive sensitive

O OH

O OH

COOH

HNH2

COOH

NH2

HNa+

O

O-

O

OH

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

Page 14: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Fragments Isotopes Charges

sensitive sensitive sensitive

D

D

D

D

D

D

O OCOOH

NH2

F I C

representation of the exact drawing

un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive

T

O-

O

NH3+

OH

O

NH2

≠ ≠ ≠

Tautomers Stereochemistry

sensitive sensitive

O OH

O OH

COOH

HNH2

COOH

NH2

H

S

Na+

O

O-

O

OH

FICTS

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

Page 15: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Fragments Isotopes Charges

sensitive sensitive sensitive

D

D

D

D

D

D

O OCOOH

NH2

F I C

comes closest to how a chemist perceives a compound

un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive

u

O-

O

NH3+

OH

O

NH2

Tautomers Stereochemistry

sensitive sensitive

O OH

O OH

COOH

HNH2

COOH

NH2

H= ≠

S

Na+

O

O-

O

OH

FICuS

≠ ≠ ≠ ≠=

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

Page 16: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Fragments Isotopes Charges Tautomers Stereochemistry

Na+

sensitive sensitive sensitive sensitive sensitive

O

O-

D

D

D

D

D

D

O-

O

NH3+

O OH

O OH

COOH

HNH2

COOH

NH2

H

O

OH

O OCOOH

NH2OH

O

NH 2

=

=== = = =

=

closely related forms of the same compound

u uuuu

un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive

uuuuu

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

Page 17: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Fragments Isotopes Charges StereoTautomers

FICTS

FICuS

uuuuu

sensitive / not sensitive

<CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum>

HNN NH2

O-

ONa+ 4A122D094098B50D-FICTS-01-1D

0E26B623DF7FAD30-FICuS-01-709850FD9F9E2B4E25-uuuuu-01-27

NCI/CADD Structure IdentifiersUnique Representation of Chemical Structures

Page 18: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

HNN NH2

OH

O

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+O-

O

O

HNN NH2

ONa

HNN NH

OH

ONH

N 15NH2

OH

O

charged form

tautomer

isotope

salt

stereoisomers

“errors”

histidine

Page 19: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

A3DAE0788050DDE4-FICTS E5F83F10C5DB080A-FICTS

B2FDA68AEDA06DB9-FICTS

9850FD9F9E2B4E25-FICTS

E5F83F10C5DB080A-FICTS

E92E4BA2869F3611-FICTS8A7AD1EB498CC76A-FICTS6C16DE2351F9FF50-FICTS

HNN NH2

OH

O

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+O-

O

O

HNN NH2

ONa

HNN NH

OH

ONH

N 15NH2

OH

O

9850FD9F9E2B4E25-FICTS

charged form

tautomer

isotope

salt

stereoisomers

FICTS

“errors”

Page 20: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

A3DAE0788050DDE4-FICuS E5F83F10C5DB080A-FICuS

B2FDA68AEDA06DB9-FICuS

9850FD9F9E2B4E25-FICuS

E5F83F10C5DB080A-FICuS

E92E4BA2869F3611-FICuS8A7AD1EB498CC76A-FICuS9850FD9F9E2B4E25-FICuS

HNN NH2

OH

O

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+O-

O

O

HNN NH2

ONa

HNN NH

OH

ONH

N 15NH2

OH

O

9850FD9F9E2B4E25-FICuS

charged form

tautomer

isotope

salt

stereoisomers

FICuS

“errors”

Page 21: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu

9850FD9F9E2B4E25-uuuuu

9850FD9F9E2B4E25-FICuS

9850FD9F9E2B4E25-uuuuu

9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu9850FD9F9E2B4E25-uuuuu

HNN NH2

OH

O

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+O-

O

O

HNN NH2

ONa

HNN NH

OH

ONH

N 15NH2

OH

O

9850FD9F9E2B4E25-uuuuu

charged form

tautomer

isotope

stereoisomers

salt

uuuuu

“errors”

Page 22: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

HNDVDQJCIGZPNO-UHFFFAOYSA-N

HNDVDQJCIGZPNO-CDYZYAPPSA-N

HNDVDQJCIGZPNO-RXMQYKEDSA-N HNDVDQJCIGZPNO-YFKPBYRVSA-NHNDVDQJCIGZPNO-UHFFFAOYSA-N

HNN NH2

OH

O

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+O-

O

O

HNN NH2

ONa

HNN NH

OH

ONH

N 15NH2

OH

O

HNDVDQJCIGZPNO-UHFFFAOYSA-N

charged form

tautomer

isotope

stereoisomers

salt

Std. InChIKey

“errors”

HNDVDQJCIGZPNO-UHFFFAOYSA-N

UHPNKBYGGMJTIM-UHFFFAOYSA-M

UHPNKBYGGMJTIM-UHFFFAOYSA-M

Page 23: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

FICTS

original record

original record

original record

original record

FICTS

original record

original record

original record

original record

original record

original record

original record

FICTS

FICTS

FICTS

FICTS

FICTS

FICTS

FICuS

FICuS

FICuS

FICuS

FICuS

FICuS

uuuuu

uuuuu

uuuuu

uuuuu

83.1 millionFICTS

parent structures

81.6 millionFICuS

parent structures

76.2 millionuuuuu

parent structures

119.8 million originalstructure records

in CSDB

NCI/CADD Chemical Structure Database

Structure Normalization

Page 24: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

FICTS

original record

original record

original record

original record

FICTS

original record

original record

original record

original record

original record

original record

original record

FICTS

FICTS

FICTS

FICTS

FICTS

FICTS

FICuS

FICuS

FICuS

FICuS

FICuS

FICuS

uuuuu

uuuuu

uuuuu

uuuuu

tautomer-invariant

83.1 millionFICTS

parent structures

81.6 millionFICuS

parent structures

76.2 millionuuuuu

parent structures

119.8 million originalstructure records

in CSDB

NCI/CADD Chemical Structure Database

Structure Normalization

Page 25: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Tautomer Analysis

How much “chemical space” is “just generated” by drawing tautomers?

Page 26: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

• CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism)

• rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS• rule set is systematically applied to the original structure

(and all tautomers that have been generated in previous steps)• tautomer generation is limited to 1000 SMIRKS transform

operations/structure• all tautomers are ranked by a scoring function• the highest ranked tautomer is defined as the

canonical tautomer

NCI/CADD Chemical Structure Database

Tautomer Analysis

Page 27: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

rule 12: furanones

rule 11: 1.11 (aromatic) heteroatom H shiftrule 10: 1.9 (aromatic) heteroatom H shiftrule 9: 1.7 (aromatic) heteroatom H shiftrule 8: 1.5 aromatic heteroatom H shift (2)rule 7: 1.5 (aromatic) heteroatom H shift (1)rule 6: 1.3 heteroatom H shiftrule 5: 1.3 aromatic heteroatom H shiftrule 4: special iminerule 3: simple (aliphatic) iminerule 2: 1.5 (thio)keto/(thio)enolrule 1: 1.3 (thio)keto/(thio)enol

• 21 SMIRKS transform rules:

rule 21: phosphonic acidsrule 20: isocyanidesrule 19: formamidinesulfinic acidsrule 18: cyanic/iso-cyanic acidsrule 17: oxim/nitroso via phenolrule 16: oxim/nitrosorule 15: pentavalent nitro/aci-nitrorule 14: ionic nitro/aci-nitro

rule 13: keten/ynol exchange

NCI/CADD Chemical Structure Database

Tautomer Analysis

Page 28: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

[O,S,Se,Te;X1:1]=[C;z{1-2}:2][CX4R{0-2}:3][#1:4]>>[#1:4][O,S,Se,Te;X2:1][#6;z{1-2}:2]=[C,cz{0-1}R{0-1}:3]

[N,n,S,s,O,o,Se,Te:1]=[NX2,nX2,C,c,P,p:2][N,n,S,O,Se,Te:3][#1:4]>>[#1:4][N,n,S,O,Se,Te:1][NX2,nX2,C,c,P,p:2]=[N,n,S,s,O,o,Se,Te:3]

32

O1

H 43

2O1H 4

N2

S1 N 3

H

H4

HN2

S1 N3

H

H4

H

1.3 keto/enol

1.3 heteroatom H shift

rule 1: 1.3 (thio)keto/(thio)enol

rule 6: 1.3 heteroatom H shift

NCI/CADD Chemical Structure Database

Tautomer Analysis

Page 29: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

FICTS

FICTS

FICTS

FICTS

FICTS

FICTS

FICTS

FICTS

72.0 millionFICTS

parent structures

NCI/CADD Chemical Structure Database

Tautomer Analysis

FICuS

FICuS

FICuS

FICuS

FICuS

FICuS

8.6% change tautomericform during FICuSnormalization

FICTS parent structures

70.6 millionFICuS

parent structures

structure counts are on basis of the 2009 version of CSDB(103.9 million structure records)

FICuS parent structures

1.5% have an one-to-manyrelationship to severalFICTS parent structures(“conflict”)

98.5% have an one-to-onerelationship to a singleFICTS parent structure

Page 30: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

NCI/CADD Chemical Structure Database

Tautomer Analysis

numberdatabasereleases

0

10

20

30

40

50

60

70

80

90

0.0 0.5 1.0 1.5 2.0

frequency

tautomeric overlap within each individual database release (%)

average: ~0.3% of original structure records

Page 31: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

NCI/CADD Chemical Structure Database

Tautomer Analysis

numberdatabasereleases

0

10

20

30

40

50

60

70

80

90

0.0 0.5 1.0 1.5 2.0

frequency

tautomeric overlap within each individual database release (%)

average: ~0.3% of original structure records

AsinexChemBridgeComGenexChemNavigatorColumbia University Molecular Screening CenterEPA DSSToxSpecs

AmbinterBINDBindingDBChemNavigatorKEGGNCI Open DatabaseNIST WebBookNLM ChemIDplusNMRShiftDBThomson PharmaWombat

NCI/DTPPASS Training SetSGC-Ox

ChemDBZINC

ChEBIChemSpider

Page 32: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

NCI/CADD Chemical Structure Database

Tautomer Analysis

0

5

10

15

20

25

30

0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.5 18.5 20.5 22.5 24.5

frequencynumber

databasereleases

percentage of FICuS parent structure in each database releaseoccurring somewhere in CSDB with a conflict

occurrence of “tautomerism-critical” molecules within each individual database release (%)

average: ~9.5% of FICuS parent structures

Page 33: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

HNN O

O

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)

• HPMBP is used in liquid membranes(selective removal of metal ions)

• selectivity and efficiency depends on the tautomeric form of HPMBP

• the tautomeric form depends on solvent and concentration of HPMBP

He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP.1. Tautomer and extraction characteristics. J. Chem. Eng. Data 2009, 54(10), 2944-2947

Example for a Tautomer “Conflict”

Page 34: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

NN OH

O

HNN O

O

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)

CACTVS generates 7 tautomers

Example for a Tautomer “Conflict”

canonical tautomer

by CACTVS 5 have potential stereo center on atoms or bonds

HNN O

OR/S

HNN OH

OHR/S

HNN O

OHE/Z

NN O

OHE/Z

NN O

OR/S

Page 35: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

HNN O

O

HNN O

O

H

4551-69-133064-14-1

127117-31-1

859 references49 references

3 references

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)

3 have CAS Registry Numbers assigned

Example for a Tautomer “Conflict”

(no stereo)

(Z)

HNN O

OR/S

HNN OH

OHR/S

NN O

OHE/Z

NN O

OHE/Z

NN O

OR/S

Page 36: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

NN OH

O

NN O

O

HNN O

O

NN O

OH

HNN O

OH

HNN OH

OH

HNN O

O

6 databases16 databases (no stereo)3 databases (R)2 databases (S)

12 databases

1 database(no stereo)

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)

Example for a Tautomer “Conflict”

occurrences in databasesindexed in CSDB

R/S

R/S

E/ZE/Z

R/S

Page 37: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

6 databases16 databases (no stereo)3 databases (R)2 databases (S)

12 databases

occurrences in databasesN

N OH

O

NN O

OR/S

HNN O

O

NN O

OHE/Z

HNN O

OHE/Z

HNN OH

OHR/S

HNN O

OR/S

1 database(no stereo)

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)

Example for a Tautomer “Conflict”

ACD 3DAmbinterBindingDBChemBankChemDBChemSpiderChemNavigatorMLSMRNIAID Scripps Screening CenterThomson PharmaZINC

ChemDB

ACD 3DACXAmbinterBioByte QSARChemBankChemBridgeChemDBChemSpiderDiscoveryGateEPA GCESMLSMRNCI Open DatabaseNIST MS-LibNLM ChemIDplusSigma-AldrichThomson Pharma

AmbinterChemDBChemSpiderDiscoveryGateChemNavigatorThomson Pharma

ChemSpiderZINC

ChemSpiderECOTOXZINC

Page 38: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

FICuS

FICuS

FICuS

FICuS

FICuS

FICuS

70.6 millionFICuS

parent structures

NCI/CADD Chemical Structure Database

Tautomer Analysis

• how many tautomers are generated?

• how often is each rule applied(type of tautomerism)?

• how many tautomers perstructure?

starting from the set of FICuS parent structures we systematically generated all tautomers based on the 21 SMIRKS rule set available in CACTVS

generated680 million tautomers

for 1.7% of the FICuS parent structuresthe enumeration was not exhaustive

Page 39: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

2.617,860,604rule 12: furanones

0.21,374,235rule 11: 1.11 (aromatic) heteroatom H shift

0.75,061,731rule 10: 1.9 (aromatic) heteroatom H shift

8.457,242,472rule 9: 1.7 (aromatic) heteroatom H shift

<0.126,819rule 8: 1.5 aromatic heteroatom H shift (2)

4.027,542,770rule 7: 1.5 (aromatic) heteroatom H shift (1)

36.8250,453,882rule 6: 1.3 heteroatom H shift

3.825,678,446rule 5: 1.3 aromatic heteroatom H shift

0.64,306,155rule 4: special imine

5.335,917,415rule 3: simple (aliphatic) imine

1.711,541,452rule 2: 1.5 (thio)keto/(thio)enol

25.4173,002,712rule 1: 1.3 (thio)keto/(thio)enol

%count

generated tautomerstautomer rule

Tautomer AnalysisNCI/CADD Chemical Structure Database

• usage of SMIRKS rules (1/2):

Page 40: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

<0.154,926rule 21: phosphonic acids

<0.1229rule 20: isocyanides

<0.11392rule 19: formamidinesulfinic acids

<0.1181rule 18: cyanic/iso-cyanic acids

<0.1131,502rule 17: oxim/nitroso via phenol

<0.1505,695rule 16: oxim/nitroso

<0.1129rule 15: pentavalent nitro/aci-nitro

<0.1428,266rule 14: ionic nitro/aci-nitro

<0.157,989rule 13: keten/ynol exchange

%count

generated tautomerstautomer rule

Tautomer AnalysisNCI/CADD Chemical Structure Database

• usage of SMIRKS rules (2/2):

Page 41: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

<0.13801–832 tautomers

<0.1362701-800 tautomers

<0.11,400601-700 tautomers

<0.14,323501-600 tautomers

<0.117,241401-500 tautomers

<0.135,144301-400 tautomers

<0.1104,875201-300 tautomers

0.8565,199101-200 tautomers

1.61,136,06651-100 tautomers

3.72,622,58725-50 tautomers

15.410,870,31211-25 tautomers

47.533,532,2842-10 tautomers

15.210,721,845one tautomer

13.89,756,186no tautomers

%countFICuS structures with

NCI/CADD Chemical Structure Database

Tautomer Analysis• number of

tautomers perstructure:

Page 42: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

<0.13801–832 tautomers

<0.1362701-800 tautomers

<0.11,400601-700 tautomers

<0.14,323501-600 tautomers

<0.117,241401-500 tautomers

<0.135,144301-400 tautomers

0.1104,875201-300 tautomers

0.8565,199101-200 tautomers

1.61,136,06651-100 tautomers

3.72,622,58725-50 tautomers

15.410,870,31211-25 tautomers

47.533,532,2842-10 tautomers

15.210,721,845one tautomer

13,89,756,186no tautomers

%countFICuS structures with

NCI/CADD Chemical Structure Database

Tautomer Analysis• number of

tautomers perstructure:

NH

O

N

OH

many minor tautomeric forms(but you find them in databases)

Page 43: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

45.6310,725,465>0.9-1.0

31.5214,747,976>0.8-0.9

16.4111,954,384>0.7-0.8

5.336,448,651>0.6-0.7

0.96,304,436>0.5-0.6

<0.1369,331>0.4-0.5

<0.1 6,580>0.3-0.4

<0.16>0.2-0.3

0.00>0.0-0.2

%CountTanimoto index range

Tautomer Analysis

Tanimoto Similarities of Tautomers• canonical tautomer vs. generated tautomers (680 million tautomer set)

PubChem/CACTVS E_SCREEN bitvector (881 bits)

~ 23% below 0.8 Tanimotosimilarity (although thesame molecule)

Page 44: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Scaffold Analysis

Page 45: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Scaffold AnalysisNCI/CADD Chemical Structure Database

molecular scaffold tree

archetype scaffold

simple scaffold

Schuffenhauer et al.J. Chem. Inf. Model. 2007, 47, 47-58

Bemis et al.J. Med. Chem. 1996, 39, 2887-2893

Bemis et al.J. Med. Chem. 1996, 39, 2887-2893

SO O

NNO

NNHO

NNH

O NNH

level 2 level 1

example

Page 46: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

NCI/CADD Chemical Structure Database

molecular scaffold tree

archetype scaffold

simple scaffold

76.2 million

8.1 million scaffolds

6.8 million scaffolds

0.8 million scaffolds

CSDB

Scaffold Analysis

uuuuu compound

set

NNHO

O NNH

NNH

level 2level 1

Page 47: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

NCI/CADD Chemical Structure Database

76.2 million

number of unique scaffolds per hierarchy level

CSDB

Scaffold Analysis

uuuuu compound

set

NNHO

O NNH

8.1 million scaffolds

0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 3 4 5 6 7 8 9 10

Hierarchy Level

Nu

mb

er

of

Un

iqu

e S

caf

fold

s (

in m

illi

on

s)

0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

Nu

mb

er o

f un

iqu

e s

truc

ture

s (in

millio

n)

level 2level 1

molecular scaffold tree

Page 48: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

NCI/CADD Chemical Structure Database

1667 58

51

2

33

11

2NNO

R2R1

R9

R8

R7R6

R5R4

NNR10

R2R1

R9

R8

R7R6

R5R4

R3 21

R3

96

53

4

25

1693

16

7

73

44

2,281 uuuuu parent structures

2,726 uuuuuparent structures

744,469 uuuuuparent structures

5334 structure recordsin 64 databases

6007 structure recordsin 66 databases

1,069,046 structure recordsin 66 databases

Scaffold Analysis

SO O

NNO

NNHO

NNH

Page 49: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Atom Neighborhoods

Page 50: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Multilevel Neighborhoods of Atoms (MNA)

HC C(C(CC-H)C(CC-C)-H(C))HO C(C(CC-H)C(CN-H)-H(C))CHCC C(C(CC-H)C(CN-H)-C(C-O-O))CHCN C(C(CC-H)N(CC)-H(C))CCCC C(C(CC-C)N(CC)-H(C))CCOO N(C(CN-H)C(CN-H))NCC -H(C(CC-H))OHC -H(C(CN-H))OC -H(-O(-H-C))

-C(C(CC-C)-O(-H-C)-O(-C))-O(-H(-O)-C(C-O-O))-O(-C(C-O-O))

NCI/CADD Chemical Structure Database

Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J.Chem. Inf. Comput. Sci., 1999, 39 (4), 666-670.

N

OH

O

HH

MNA level 1 MNA level 2

Page 51: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database

76.2 million

CSDB

uuuuu compound

set

Unique MNAs

level 1

level 2

13,426

918,5162.3 billion relationships

1.3 billion relationships~ 17 per uuuuu parent structure

~ 30 per uuuuu parent structure

Page 52: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Multilevel Neighborhoods of Atoms (MNA)NCI/CADD Chemical Structure Database

424,784 MNAs (level 2) are exclusive to a set of 1,3 million structures in ChemSpider

76.2 million

CSDB

uuuuu compound

set

Unique MNAs

level 1

level 2

13,426

918,5162.3 billion relationships

1.3 billion relationships~ 17 per uuuuu parent structure

~ 30 per uuuuu parent structure

Page 53: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Chemical Structure Web Services

NCI/CADDweb service

NCI/CADDweb service

NCI/CADD Chemical StructureDatabase (CSDB)

CACTVS

externalweb services

http

ChemicalIdentifierResolver

othersoftwarepackages

e.g. OPSIN

Chemical Structure Web ServicesIndexing Chemical Space

Page 54: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

http://cactus.nci.nih.gov/chemical/structure

Chemical Identifier ResolverNCI/CADD Web Resources

http://cactus.nci.nih.gov/blog

Page 55: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Acknowledgments

ChemNavigatorScott HuttonTad Hurst

CADD Group, CBL, NCIIgor Filippov

Thanks to all database providers!

http://cactus.nci.nih.gov

Our web site:

University of CambridgeDaniel LowePeter Murray-Rust

Noel’ O Boyle (University College Cork, Ireland) Richard Apodaca (Metamolecular)Hans-Juergen Himmler

Page 56: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9

Acknowledgments - Software

CACTVS

Python Web FrameworkChemWriter

Python SQL Library

Javascript library

Peter Ertl (Novartis)

Page 57: ICCS9 2011 Talk

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space ICCS9