algorithms for automatic tautomer generation and their applications
DESCRIPTION
OpenTox Euro 2013 poster http://www.opentox.org/meet/opentoxeu2013/opentoxeu2013posters/ 6. Ambit-TAUTOMER – a Software Tool for Automatic Tautomer Generation, Nina Jeliazkova (Ideaconsult Ltd) Ambit-Tautomer [1] is an open source Java library for automatic generation of all tautomers of a given chemical compound. It is implemented on top of the Chemistry Development Kit (CDK) [2]. The system includes three main algorithms: pure combinatorial method, improved combinatorial method and incremental algorithm. The tautomer generator uses a set of predefined, but customizable rules. The rules are defined by Daylight SMILES/SMARTS line notations and support the basic types of tautomerism (1-3, 1-5 and 1-7 proton tautomer shifts). The pure combinatorial method generates all tautomeric forms considering all possible combinations of the matched rule states. The improved combinatorial method uses sub-combinations based on rules clustering. The incremental algorithm applies depth-first search to handle sophisticated cases of overlapping rules. Additionally, rule pre-filtering and tautomer post-filtering are applied for fine tuning of the generation process. The tautomer generator implements tautomer ranking based on empirical rules defined in terms of relative energy difference. Ambit-Tautomer library is applied to improve the Ambit database storage of chemical structures and accordingly to implement search procedures which take into account the tautomerism information. Also the tautomer sets are used to calculate modified values of the original molecular descriptors in order to improve existing QSAR/QSPR models. Ambit-Tautomer module is implemented as open source Java package as part of the Ambit open source software for chemoinformatics and data management [3,4] and is available as a Java library, command line application [5] and OpenTox Algorithm API compatible Web service [6]. Ambit package is available as online web services and as a downloadable application. A web page providing online tautomer generation by Ambit-Tautomer and several different software packages is available on http://apps.ideaconsult.net:8080/ambit2/depict/tautomer. References [1] Kochev, N. T., Paskaleva, V. H. and Jeliazkova, N., Ambit-Tautomer: An Open Source Tool for Tautomer Generation. Mol. Inf., 32: 481–504, 2013 [2] C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann, E. Willighagen, The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics, J. Chem. Inf. Comput. Sci., 43: 493–500, 2003 [3] Jeliazkova N., Jeliazkov V. AMBIT RESTful web services: an implementation of the OpenTox application programming interface, Journal of Cheminformatics 2011, 3:18, doi:10.1186/1758-2946-3-18. [4] http://ambit.sourceforge.net [5] https://github.com/ideaconsult/examples-ambit/tree/master/tautomers-example [6] http://apps.ideaconsult.net:8080/ambit2/algorithm/tautomersTRANSCRIPT
References [1] Kochev, N. T., Paskaleva, V. H. and Jeliazkova, N., Ambit-Tautomer: An Open Source Tool for Tautomer Generation. Mol. Inf., 32: 481–504, 2013 [2] AMBIT project, http://ambit.sourceforge.net [3] Steinbeck C., Hoppe C., Kuhn S., Guha R., Willighagen E.L., “Recent Developments of the Chemistry Development Kit (CDK) – An Open-Source Java Library for Chemo- and Bioinformatics”. Curr. Pharm. Des. 2006; 12(17):2111-2120 (DOI: 10.2174/138161206777585274) [4] Jeliazkova N., Jeliazkov V., AMBIT RESTful web services: an implementation of the Open Tox application programming interface, Journal of Chemoinformatics 2011, 3:18, doi: 10.1186/1758-2946-3-18.;
Ambit-Tautomer Basic Features
Tautomer generation algorithms • Pure combinatorial algorithm • Incremental approach (based on depth first search algorithm) for rule combination with local rule corrections and refinement on the way
Customizable set of rules • Basic set of 1-3 and 1-5 proton shift rules • Additional rules: 1-7 proton shifts, chlorine atom shifts • Rule description based on SMARTS
Ambit-Tautomer [1] is part of the Ambit2 software package [2], distributed under LGPL license and using the Chemistry Development Kit (CDK) library [3] for basic chemoinformatics functionality. Ambit-Tautomer utilizes a depth-first search algorithm, combined with a set of rules for tautomeric transformations.The Ambit implementation of OpenTox Web [4] services for predictive toxicology, are being extended to include the tautomer generation algorithm. A web page, providing online tautomer generation by several different algorithms, including Ambit-Tautomer, is available at: http://apps.ideaconsult.net:8080/ambit2/depict/tautomer.
ALGORITHMS FOR AUTOMATIC TAUTOMER GENERATION AND THEIR APPLICATIONS Nikolay T. Kochev1, Vesselina H. Paskaleva1, Nina Jeliazkova2
1University of Plovdiv, Department of Analytical Chemistry and Computer Chemistry; 2Ideaconsult Ltd, 4 A. Kanchev str., Sofia 1000, Bulgaria
(methimazole)
Sim
ilarit
y
Sim
ilarit
y
Sim
ilarit
y
1.
0.62 0.71 0.47
2.
0.6
0.71 0.45
3.
0.59
0.64
0.44
4.
0.58
0.57 0.44
5. 0.54
0.57
0.43
N NH
S
H3CNN
SH
H3CNN
S
H3C
NH2NH
H3C
S
NH2NH
H3C
CH3 S
HNN
H3C
CH3
S
CH3
NN
S
H3CHN
HN CH3
S
NH3C
CH3
NH3C
CH3
N
CH3
CH3
N CH3
CH3
CH2
N
N+
H3C
I–
N
C -HNAg+
CH2
N
N+
H3C
Cl–
N–HN
CH3
N
N
CH3
N
N
CH3
SH
NHN
NN
CH3
H3C
N
N
H3C
Software characteristics •CDK.sf.net based structure representation, input, output and info processing •Supports standard chemical formats: SMILES, InChI, MOL/SDF file, CML • Exhaustive tautomer generation • Customizable set of rules and post- generation filters • Set of predefined rules • Tautomer ranking based on simple empirical rules
The structural information was processed according to the presented flow chart. We studied the influence of tautomers information on various processing stages: descriptor calculation (table 3), similarity searching (see table 1) and QSAR/QSPR modeling of Ames-Mutagenicity and LogP (see fig.2 and table 2).
Table 1. The similarity search results for the three tautomers of methimazole. Each column contains the five most similar structures to the tautomer. Similarity search is performed in a data base with 553477 compounds (subset of PubChem data base).
Violuric acid tautomers /SMILES notations/
Ames Mutagenicity (model) XLogP
O=C1NC(=O)C(=NO)C(=O)N1 O=C1N=C(O)N=C(O)C1(=NO) O=C1N=C(O)C(=NO)C(O)=N1 O=C1N=C(O)C(=NO)C(=O)N1 O=C1N=C(O)NC(=O)C1(=NO) O=NC1=C(O)N=C(O)N=C1(O) O=NC=1C(=O)NC(O)=NC=1(O) O=NC=1C(=O)N=C(O)NC=1(O) O=NC=1C(O)=NC(=O)NC=1(O) O=NC=1C(=O)NC(=O)NC=1(O) O=NC1C(O)=NC(=O)N=C1(O) O=NC1C(=O)N=C(O)N=C1(O) O=NC1C(=O)NC(=O)N=C1(O) O=NC1C(=O)N=C(O)NC1(=O) O=NC1C(=O)NC(=O)NC1(=O)
1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
0.135 -0.086 0.267 0.041 0.361 -0.102 -0.084 1.230 0.698 0.363 -0.277 -1.056 -0.932 -1.038 -1.267
Table 2. The values of Ames-mutagenecity model and XLogP model for all tautomers of viuoluric acid.
0.70
0.90
1.10
1.30
1.50
1.70
1.90
2.10
2 ÷ 10 11 ÷ 30 31 ÷ 50 52 ÷ 100 102 ÷ 192 204 ÷ 292 302 ÷ 1318
mea
n er
ror
Number of tautomers per structure
XLogP(no tautomers)
XLogP (all tautomers)
Structure RSD threshold
Number of PaDEL descriptors that have RSD > RSDthreshold
methimazole
0.1
0.3
0.5
1.0
180
124
99
71
violuric acid
0.1
0.3
0.5
1.0
217
151
108
80
pemoline
0.1
0.3
0.5
1.0
239
168
138
113
Table 3. The number of descriptors (out of total 863) which exhibit relative standard deviation (RSD due to the tautomerism) larger than particular thresholds: 0.1, 0.3, 0.5, 1.0
Figure 2. The mean absolute errors for XLogP model compared with the errors obtained from the averaged model values calculated for all tautomers for each testing structure. The statistics is calculated for 8327 test structures.
Figure 1. AMBIT2 Tautomer generation test page
Structure input: C1=CN(C(N1)=S)C
/SMILES, InChI, *.mol, CML/ N NH
S
H3C
QSAR/QSPR Cheminfo Processing Flow Chart CDK
representation
methimazole Connection
Table (CDK container)
generate 2D
generate 3D
generate tautomers
N NH
S
H3C
NN
SH
H3C NN
S
H3C
Calculate 1D, 2D, 3D molecular descriptors
NA = 13 Z=32 NH = 6 W=40 MW = 114.03 ATSc1 = 0.14 … …
1 0 0 0 1 . . . 1 1 1 0 1 1 hashed fingerprint 0 0 1 0 1 . . . 0 0 1 0 1 0 key-based fingerprint
Calculate fingerprints (bit-vectors)
Group counts, additive schemes
tautomer 3D models S
NN
S
NN
S
NN
QSPR QSAR
Models of physicochemical properties: LogP, BP, MP, MR,…
Models of biological activities: ADME Toxicity, Mutagenicity, Biodegradation, …
Similarity search
Chemical Data base
N
N
CH3
List of most similar structures
N
N
CH3 SH
NHN NN
CH3
H3CQSAR
QSPR
Overlapping rules
HO
HO CH3
NH2
HO
HO CH2
NH2O
HO CH3
NH2 HO
HO CH3
NH
- simple combinations do not work - rule conflicts are possible - some tautomers might be omitted - more sophisticated approach is needed
Tautomer Generation Flow Chart
HO
HO CH3
NH2
Substructure search
Initial rule list
Generation of all possible combinations of the rule states based on Depth- first search with refinement of the rule list at each step.
Post-generation filtering duplicates, topological equivalency, allene atoms, incorrect structures, …
Ranking
Result output
HO
HO CH3
NH
HO
HO CH2
NH2O
HO CH3
NH2
HO
HO CH3
NH2
unused rules
OC=C at 0 1 3
OC=C at 2 1 3
NC=C at 4 3 1
NH
CH3HO
HO4
31
0
2 5
used rules
N=CC at 4 3 1
unused rules
N=CC at 4 3 5
used rules
NC=C at 4 3 1
unused rules
OC=C at 0 1 3
OC=C at 2 1 3
NH
CH3HO
HO4
31
0
2 5
used rules
N=CC at 4 3 1
N=CC at 4 3 5
used rules
N=CC at 4 3 1
NC=C at 4 3 5
NH2
CH2HO
HO4
31
0
2 5
NH2
CH3O
HO4
31
0
2 5
NH2
CH3HO
HO4
31
0
2 5
used rules
NC=C at 4 3 1
OC=C at 2 1 3
unused rules
OC=C at 0 1 3 used rules
NC=C at 4 3 1
O=CC at 2 1 3
NH2
CH3HO
O4
31
0
2 5
NH2
CH3HO
HO4
31
0
2 5
used rules
NC=C at 4 3 1
OC=C at 2 1 3
O=CC at 0 1 3
used rules
NC=C at 4 3 1
OC=C at 2 1 3
OC=C at 0 1 3
Structure input OC(O)=C(N)C HO
HO CH3
NH2
(CDK representation)
NH2
CH3HO
HO4
31
0
2 5
NH2
CH3HO
HO4
31
0
2 5
Combinations of non-overlapping rules
0 1
HN OH
HN O
H2N O
H2N OH
each tautomer is described as a binary combination
1 0
1 1
0 0 1 ↔ 0 0 ↔ 1
marks the current rule used to generate two possible
states