algorithms for automatic tautomer generation and their applications

1
References [1] Kochev, N. T., Paskaleva, V. H. and Jeliazkova, N., Ambit-Tautomer: An Open Source Tool for Tautomer Generation. Mol. Inf., 32: 481–504, 2013 [2] AMBIT project, http://ambit.sourceforge.net [3] Steinbeck C., Hoppe C., Kuhn S., Guha R., Willighagen E.L., “Recent Developments of the Chemistry Development Kit (CDK) – An Open-Source Java Library for Chemo- and Bioinformatics”. Curr. Pharm. Des. 2006; 12(17):2111-2120 (DOI: 10.2174/138161206777585274) [4] Jeliazkova N., Jeliazkov V., AMBIT RESTful web services: an implementation of the Open Tox application programming interface, Journal of Chemoinformatics 2011, 3:18, doi: 10.1186/1758- 2946-3-18.; Ambit-Tautomer Basic Features Tautomer generation algorithms Pure combinatorial algorithm Incremental approach (based on depth first search algorithm) for rule combination with local rule corrections and refinement on the way Customizable set of rules Basic set of 1-3 and 1-5 proton shift rules Additional rules: 1-7 proton shifts, chlorine atom shifts Rule description based on SMARTS Ambit-Tautomer [1] is part of the Ambit2 software package [2], distributed under LGPL license and using the Chemistry Development Kit (CDK) library [3] for basic chemoinformatics functionality. Ambit-Tautomer utilizes a depth-first search algorithm, combined with a set of rules for tautomeric transformations.The Ambit implementation of OpenTox Web [4] services for predictive toxicology, are being extended to include the tautomer generation algorithm. A web page, providing online tautomer generation by several different algorithms, including Ambit-Tautomer, is available at: http://apps.ideaconsult.net:8080/ambit2/depict/tautomer. ALGORITHMS FOR AUTOMATIC TAUTOMER GENERATION AND THEIR APPLICATIONS Nikolay T. Kochev 1 , Vesselina H. Paskaleva 1 , Nina Jeliazkova 2 1 University of Plovdiv, Department of Analytical Chemistry and Computer Chemistry; 2 Ideaconsult Ltd, 4 A. Kanchev str., Sofia 1000, Bulgaria (methimazole) Similarity Similarity Similarity 1. 0.62 0.71 0.47 2. 0.6 0.71 0.45 3. 0.59 0.64 0.44 4. 0.58 0.57 0.44 5. 0.54 0.57 0.43 N NH S H 3 C N N SH H 3 C N N S H 3 C NH 2 N H H 3 C S NH 2 N H H 3 C CH 3 S H N N H 3 C CH 3 S CH 3 N N S H 3 C H N H N CH 3 S N H 3 C CH 3 N H 3 C CH 3 N CH 3 CH 3 N CH 3 CH 3 CH 2 N N + H 3 C I N C - HN Ag + CH 2 N N + H 3 C Cl N HN CH 3 N N CH 3 N N CH 3 SH NH N N N CH 3 H 3 C N N H 3 C Software characteristics CDK.sf.net based structure representation, input, output and info processing Supports standard chemical formats: SMILES, InChI, MOL/SDF file, CML Exhaustive tautomer generation Customizable set of rules and post- generation filters Set of predefined rules Tautomer ranking based on simple empirical rules The structural information was processed according to the presented flow chart. We studied the influence of tautomers information on various processing stages: descriptor calculation (table 3), similarity searching (see table 1) and QSAR/QSPR modeling of Ames-Mutagenicity and LogP (see fig.2 and table 2). Table 1. The similarity search results for the three tautomers of methimazole. Each column contains the five most similar structures to the tautomer. Similarity search is performed in a data base with 553477 compounds (subset of PubChem data base). Violuric acid tautomers /SMILES notations/ Ames Mutagenicity (model) XLogP O=C1NC(=O)C(=NO)C(=O)N1 O=C1N=C(O)N=C(O)C1(=NO) O=C1N=C(O)C(=NO)C(O)=N1 O=C1N=C(O)C(=NO)C(=O)N1 O=C1N=C(O)NC(=O)C1(=NO) O=NC1=C(O)N=C(O)N=C1(O) O=NC=1C(=O)NC(O)=NC=1(O) O=NC=1C(=O)N=C(O)NC=1(O) O=NC=1C(O)=NC(=O)NC=1(O) O=NC=1C(=O)NC(=O)NC=1(O) O=NC1C(O)=NC(=O)N=C1(O) O=NC1C(=O)N=C(O)N=C1(O) O=NC1C(=O)NC(=O)N=C1(O) O=NC1C(=O)N=C(O)NC1(=O) O=NC1C(=O)NC(=O)NC1(=O) 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0.135 -0.086 0.267 0.041 0.361 -0.102 -0.084 1.230 0.698 0.363 -0.277 -1.056 -0.932 -1.038 -1.267 Table 2. The values of Ames- mutagenecity model and XLogP model for all tautomers of viuoluric acid. 0.70 0.90 1.10 1.30 1.50 1.70 1.90 2.10 2 ÷ 10 11 ÷ 30 31 ÷ 50 52 ÷ 100 102 ÷ 192 204 ÷ 292 302 ÷ 1318 mean error Number of tautomers per structure XLogP (no tautomers) XLogP (all tautomers) Structure RSD threshold Number of PaDEL descriptors that have RSD > RSD threshold methimazole 0.1 0.3 0.5 1.0 180 124 99 71 violuric acid 0.1 0.3 0.5 1.0 217 151 108 80 pemoline 0.1 0.3 0.5 1.0 239 168 138 113 Table 3. The number of descriptors (out of total 863) which exhibit relative standard deviation (RSD due to the tautomerism) larger than particular thresholds: 0.1, 0.3, 0.5, 1.0 Figure 2. The mean absolute errors for XLogP model compared with the errors obtained from the averaged model values calculated for all tautomers for each testing structure. The statistics is calculated for 8327 test structures. Figure 1. AMBIT2 Tautomer generation test page Structure input: C1=CN(C(N1)=S)C /SMILES, InChI, *.mol, CML/ N NH S H 3 C QSAR/QSPR Cheminfo Processing Flow Chart CDK representation methimazole Connection Table (CDK container) generate 2D generate 3D generate tautomers N NH S H 3 C N N SH H 3 C N N S H 3 C Calculate 1D, 2D, 3D molecular descriptors NA = 13 Z=32 NH = 6 W=40 MW = 114.03 ATSc1 = 0.14 1 0 0 0 1 . . . 1 1 1 0 1 1 hashed fingerprint 0 0 1 0 1 . . . 0 0 1 0 1 0 key-based fingerprint Calculate fingerprints (bit-vectors) Group counts, additive schemes tautomer 3D models S N N S N N S N N QSPR QSAR Models of physicochemical properties: LogP, BP, MP, MR,… Models of biological activities: ADME Toxicity, Mutagenicity, Biodegradation, … Similarity search Chemical Data base N N CH 3 List of most similar structures N N CH 3 SH NH N N N CH 3 H 3 C QSAR QSPR Overlapping rules HO HO CH 3 NH 2 HO HO CH 2 NH 2 O HO CH 3 NH 2 HO HO CH 3 NH - simple combinations do not work - rule conflicts are possible - some tautomers might be omitted - more sophisticated approach is needed Tautomer Generation Flow Chart HO HO CH 3 NH 2 Substructure search Initial rule list Generation of all possible combinations of the rule states based on Depth- first search with refinement of the rule list at each step. Post-generation filtering duplicates, topological equivalency, allene atoms, incorrect structures, … Ranking Result output HO HO CH 3 NH HO HO CH 2 NH 2 O HO CH 3 NH 2 HO HO CH 3 NH 2 unused rules OC=C at 0 1 3 OC=C at 2 1 3 NC=C at 4 3 1 NH CH 3 HO HO 4 3 1 0 2 5 used rules N=CC at 4 3 1 unused rules N=CC at 4 3 5 used rules NC=C at 4 3 1 unused rules OC=C at 0 1 3 OC=C at 2 1 3 NH CH 3 HO HO 4 3 1 0 2 5 used rules N=CC at 4 3 1 N=CC at 4 3 5 used rules N=CC at 4 3 1 NC=C at 4 3 5 NH 2 CH 2 HO HO 4 3 1 0 2 5 NH 2 CH 3 O HO 4 3 1 0 2 5 NH 2 CH 3 HO HO 4 3 1 0 2 5 used rules NC=C at 4 3 1 OC=C at 2 1 3 unused rules OC=C at 0 1 3 used rules NC=C at 4 3 1 O=CC at 2 1 3 NH 2 CH 3 HO O 4 3 1 0 2 5 NH 2 CH 3 HO HO 4 3 1 0 2 5 used rules NC=C at 4 3 1 OC=C at 2 1 3 O=CC at 0 1 3 used rules NC=C at 4 3 1 OC=C at 2 1 3 OC=C at 0 1 3 Structure input OC(O)=C(N)C HO HO CH 3 NH 2 (CDK representation) NH 2 CH 3 HO HO 4 3 1 0 2 5 NH 2 CH 3 HO HO 4 3 1 0 2 5 Combinations of non-overlapping rules 0 1 HN OH HN O H 2 N O H 2 N OH each tautomer is described as a binary combination 1 0 1 1 0 0 1 0 0 1 marks the current rule used to generate two possible states

Upload: nina-jeliazkova

Post on 26-Jan-2015

107 views

Category:

Education


0 download

DESCRIPTION

OpenTox Euro 2013 poster http://www.opentox.org/meet/opentoxeu2013/opentoxeu2013posters/ 6. Ambit-TAUTOMER – a Software Tool for Automatic Tautomer Generation, Nina Jeliazkova (Ideaconsult Ltd) Ambit-Tautomer [1] is an open source Java library for automatic generation of all tautomers of a given chemical compound. It is implemented on top of the Chemistry Development Kit (CDK) [2]. The system includes three main algorithms: pure combinatorial method, improved combinatorial method and incremental algorithm. The tautomer generator uses a set of predefined, but customizable rules. The rules are defined by Daylight SMILES/SMARTS line notations and support the basic types of tautomerism (1-3, 1-5 and 1-7 proton tautomer shifts). The pure combinatorial method generates all tautomeric forms considering all possible combinations of the matched rule states. The improved combinatorial method uses sub-combinations based on rules clustering. The incremental algorithm applies depth-first search to handle sophisticated cases of overlapping rules. Additionally, rule pre-filtering and tautomer post-filtering are applied for fine tuning of the generation process. The tautomer generator implements tautomer ranking based on empirical rules defined in terms of relative energy difference. Ambit-Tautomer library is applied to improve the Ambit database storage of chemical structures and accordingly to implement search procedures which take into account the tautomerism information. Also the tautomer sets are used to calculate modified values of the original molecular descriptors in order to improve existing QSAR/QSPR models. Ambit-Tautomer module is implemented as open source Java package as part of the Ambit open source software for chemoinformatics and data management [3,4] and is available as a Java library, command line application [5] and OpenTox Algorithm API compatible Web service [6]. Ambit package is available as online web services and as a downloadable application. A web page providing online tautomer generation by Ambit-Tautomer and several different software packages is available on http://apps.ideaconsult.net:8080/ambit2/depict/tautomer. References [1] Kochev, N. T., Paskaleva, V. H. and Jeliazkova, N., Ambit-Tautomer: An Open Source Tool for Tautomer Generation. Mol. Inf., 32: 481–504, 2013 [2] C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann, E. Willighagen, The Chemistry Development Kit (CDK):  An Open-Source Java Library for Chemo- and Bioinformatics, J. Chem. Inf. Comput. Sci., 43: 493–500, 2003 [3] Jeliazkova N., Jeliazkov V. AMBIT RESTful web services: an implementation of the OpenTox application programming interface, Journal of Cheminformatics 2011, 3:18, doi:10.1186/1758-2946-3-18. [4] http://ambit.sourceforge.net [5] https://github.com/ideaconsult/examples-ambit/tree/master/tautomers-example [6] http://apps.ideaconsult.net:8080/ambit2/algorithm/tautomers

TRANSCRIPT

Page 1: ALGORITHMS FOR AUTOMATIC TAUTOMER GENERATION AND THEIR  APPLICATIONS

References [1] Kochev, N. T., Paskaleva, V. H. and Jeliazkova, N., Ambit-Tautomer: An Open Source Tool for Tautomer Generation. Mol. Inf., 32: 481–504, 2013 [2] AMBIT project, http://ambit.sourceforge.net [3] Steinbeck C., Hoppe C., Kuhn S., Guha R., Willighagen E.L., “Recent Developments of the Chemistry Development Kit (CDK) – An Open-Source Java Library for Chemo- and Bioinformatics”. Curr. Pharm. Des. 2006; 12(17):2111-2120 (DOI: 10.2174/138161206777585274) [4] Jeliazkova N., Jeliazkov V., AMBIT RESTful web services: an implementation of the Open Tox application programming interface, Journal of Chemoinformatics 2011, 3:18, doi: 10.1186/1758-2946-3-18.;

Ambit-Tautomer Basic Features

Tautomer generation algorithms • Pure combinatorial algorithm • Incremental approach (based on depth first search algorithm) for rule combination with local rule corrections and refinement on the way

Customizable set of rules • Basic set of 1-3 and 1-5 proton shift rules • Additional rules: 1-7 proton shifts, chlorine atom shifts • Rule description based on SMARTS

Ambit-Tautomer [1] is part of the Ambit2 software package [2], distributed under LGPL license and using the Chemistry Development Kit (CDK) library [3] for basic chemoinformatics functionality. Ambit-Tautomer utilizes a depth-first search algorithm, combined with a set of rules for tautomeric transformations.The Ambit implementation of OpenTox Web [4] services for predictive toxicology, are being extended to include the tautomer generation algorithm. A web page, providing online tautomer generation by several different algorithms, including Ambit-Tautomer, is available at: http://apps.ideaconsult.net:8080/ambit2/depict/tautomer.

ALGORITHMS FOR AUTOMATIC TAUTOMER GENERATION AND THEIR APPLICATIONS Nikolay T. Kochev1, Vesselina H. Paskaleva1, Nina Jeliazkova2

1University of Plovdiv, Department of Analytical Chemistry and Computer Chemistry; 2Ideaconsult Ltd, 4 A. Kanchev str., Sofia 1000, Bulgaria

(methimazole)

Sim

ilarit

y

Sim

ilarit

y

Sim

ilarit

y

1.

0.62 0.71 0.47

2.

0.6

0.71 0.45

3.

0.59

0.64

0.44

4.

0.58

0.57 0.44

5. 0.54

0.57

0.43

N NH

S

H3CNN

SH

H3CNN

S

H3C

NH2NH

H3C

S

NH2NH

H3C

CH3 S

HNN

H3C

CH3

S

CH3

NN

S

H3CHN

HN CH3

S

NH3C

CH3

NH3C

CH3

N

CH3

CH3

N CH3

CH3

CH2

N

N+

H3C

I–

N

C -HNAg+

CH2

N

N+

H3C

Cl–

N–HN

CH3

N

N

CH3

N

N

CH3

SH

NHN

NN

CH3

H3C

N

N

H3C

Software characteristics •CDK.sf.net based structure representation, input, output and info processing •Supports standard chemical formats: SMILES, InChI, MOL/SDF file, CML • Exhaustive tautomer generation • Customizable set of rules and post- generation filters • Set of predefined rules • Tautomer ranking based on simple empirical rules

The structural information was processed according to the presented flow chart. We studied the influence of tautomers information on various processing stages: descriptor calculation (table 3), similarity searching (see table 1) and QSAR/QSPR modeling of Ames-Mutagenicity and LogP (see fig.2 and table 2).

Table 1. The similarity search results for the three tautomers of methimazole. Each column contains the five most similar structures to the tautomer. Similarity search is performed in a data base with 553477 compounds (subset of PubChem data base).

Violuric acid tautomers /SMILES notations/

Ames Mutagenicity (model) XLogP

O=C1NC(=O)C(=NO)C(=O)N1 O=C1N=C(O)N=C(O)C1(=NO) O=C1N=C(O)C(=NO)C(O)=N1 O=C1N=C(O)C(=NO)C(=O)N1 O=C1N=C(O)NC(=O)C1(=NO) O=NC1=C(O)N=C(O)N=C1(O) O=NC=1C(=O)NC(O)=NC=1(O) O=NC=1C(=O)N=C(O)NC=1(O) O=NC=1C(O)=NC(=O)NC=1(O) O=NC=1C(=O)NC(=O)NC=1(O) O=NC1C(O)=NC(=O)N=C1(O) O=NC1C(=O)N=C(O)N=C1(O) O=NC1C(=O)NC(=O)N=C1(O) O=NC1C(=O)N=C(O)NC1(=O) O=NC1C(=O)NC(=O)NC1(=O)

1 0 0 1 1 1 1 1 1 1 1 1 1 1 1

0.135 -0.086 0.267 0.041 0.361 -0.102 -0.084 1.230 0.698 0.363 -0.277 -1.056 -0.932 -1.038 -1.267

Table 2. The values of Ames-mutagenecity model and XLogP model for all tautomers of viuoluric acid.

0.70

0.90

1.10

1.30

1.50

1.70

1.90

2.10

2 ÷ 10 11 ÷ 30 31 ÷ 50 52 ÷ 100 102 ÷ 192 204 ÷ 292 302 ÷ 1318

mea

n er

ror

Number of tautomers per structure

XLogP(no tautomers)

XLogP (all tautomers)

Structure RSD threshold

Number of PaDEL descriptors that have RSD > RSDthreshold

methimazole

0.1

0.3

0.5

1.0

180

124

99

71

violuric acid

0.1

0.3

0.5

1.0

217

151

108

80

pemoline

0.1

0.3

0.5

1.0

239

168

138

113

Table 3. The number of descriptors (out of total 863) which exhibit relative standard deviation (RSD due to the tautomerism) larger than particular thresholds: 0.1, 0.3, 0.5, 1.0

Figure 2. The mean absolute errors for XLogP model compared with the errors obtained from the averaged model values calculated for all tautomers for each testing structure. The statistics is calculated for 8327 test structures.

Figure 1. AMBIT2 Tautomer generation test page

Structure input: C1=CN(C(N1)=S)C

/SMILES, InChI, *.mol, CML/ N NH

S

H3C

QSAR/QSPR Cheminfo Processing Flow Chart CDK

representation

methimazole Connection

Table (CDK container)

generate 2D

generate 3D

generate tautomers

N NH

S

H3C

NN

SH

H3C NN

S

H3C

Calculate 1D, 2D, 3D molecular descriptors

NA = 13 Z=32 NH = 6 W=40 MW = 114.03 ATSc1 = 0.14 … …

1 0 0 0 1 . . . 1 1 1 0 1 1 hashed fingerprint 0 0 1 0 1 . . . 0 0 1 0 1 0 key-based fingerprint

Calculate fingerprints (bit-vectors)

Group counts, additive schemes

tautomer 3D models S

NN

S

NN

S

NN

QSPR QSAR

Models of physicochemical properties: LogP, BP, MP, MR,…

Models of biological activities: ADME Toxicity, Mutagenicity, Biodegradation, …

Similarity search

Chemical Data base

N

N

CH3

List of most similar structures

N

N

CH3 SH

NHN NN

CH3

H3CQSAR

QSPR

Overlapping rules

HO

HO CH3

NH2

HO

HO CH2

NH2O

HO CH3

NH2 HO

HO CH3

NH

- simple combinations do not work - rule conflicts are possible - some tautomers might be omitted - more sophisticated approach is needed

Tautomer Generation Flow Chart

HO

HO CH3

NH2

Substructure search

Initial rule list

Generation of all possible combinations of the rule states based on Depth- first search with refinement of the rule list at each step.

Post-generation filtering duplicates, topological equivalency, allene atoms, incorrect structures, …

Ranking

Result output

HO

HO CH3

NH

HO

HO CH2

NH2O

HO CH3

NH2

HO

HO CH3

NH2

unused rules

OC=C at 0 1 3

OC=C at 2 1 3

NC=C at 4 3 1

NH

CH3HO

HO4

31

0

2 5

used rules

N=CC at 4 3 1

unused rules

N=CC at 4 3 5

used rules

NC=C at 4 3 1

unused rules

OC=C at 0 1 3

OC=C at 2 1 3

NH

CH3HO

HO4

31

0

2 5

used rules

N=CC at 4 3 1

N=CC at 4 3 5

used rules

N=CC at 4 3 1

NC=C at 4 3 5

NH2

CH2HO

HO4

31

0

2 5

NH2

CH3O

HO4

31

0

2 5

NH2

CH3HO

HO4

31

0

2 5

used rules

NC=C at 4 3 1

OC=C at 2 1 3

unused rules

OC=C at 0 1 3 used rules

NC=C at 4 3 1

O=CC at 2 1 3

NH2

CH3HO

O4

31

0

2 5

NH2

CH3HO

HO4

31

0

2 5

used rules

NC=C at 4 3 1

OC=C at 2 1 3

O=CC at 0 1 3

used rules

NC=C at 4 3 1

OC=C at 2 1 3

OC=C at 0 1 3

Structure input OC(O)=C(N)C HO

HO CH3

NH2

(CDK representation)

NH2

CH3HO

HO4

31

0

2 5

NH2

CH3HO

HO4

31

0

2 5

Combinations of non-overlapping rules

0 1

HN OH

HN O

H2N O

H2N OH

each tautomer is described as a binary combination

1 0

1 1

0 0 1 ↔ 0 0 ↔ 1

marks the current rule used to generate two possible

states