analysis of the effects of related fingerprints on
TRANSCRIPT
Kuwahara and Gao J Cheminform (2021) 13:27 https://doi.org/10.1186/s13321-021-00506-2
RESEARCH ARTICLE
Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approachHiroyuki Kuwahara and Xin Gao*
Abstract
Two-dimensional (2D) chemical fingerprints are widely used as binary features for the quantification of structural similarity of chemical compounds, which is an important step in similarity-based virtual screening (VS). Here, using an eigenvalue-based entropy approach, we identified 2D fingerprints with little to no contribution to shaping the eigenvalue distribution of the feature matrix as related ones and examined the degree to which these related 2D fin-gerprints influenced molecular similarity scores calculated with the Tanimoto coefficient. Our analysis identified many related fingerprints in publicly available fingerprint schemes and showed that their presence in the feature set could have substantial effects on the similarity scores and bias the outcome of molecular similarity analysis. Our results have implication in the optimal selection of 2D fingerprints for compound similarity analysis and the identification of potential hits for compounds with target biological activity in VS.
Keywords: Structure-activity relationship, Similarity-based virtual screening, 2D fingerprint, Unsupervised feature selection, Chemoinformatics
© The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
IntroductionVirtual screening (VS) is a computational approach that is widely used as a cost-effective alternative to the tra-ditional high-throughput screening for the selection of initial hits in a search for drugs with a given biological activity [1, 2]. The foundation of similarity-based VS is structure-activity relationship (SAR), a concept in which molecules with similar structures are destined to have similar biological activities. In such VS applications, thus, the quantification of structural similarity of molecules is a crucial step. To quantify the structural similarity of a pair of molecules, the Tanimoto similarity measure is commonly applied to fingerprint features based on their two-dimensional (2D) structures. These 2D fingerprints
represent each molecule as a binary (0 or 1) vector characterizing the absence or the presence of specific properties of its 2D structure. Although this feature rep-resentation is simple, it has been reported to be more effective than those using more complex features such as 3D structural patterns [3, 4].
There are libraries of predefined 2D chemical finger-print dictionaries available to represent molecules as binary vectors [5]. Among the most commonly used fin-gerprint schemes for similarity quantification is molecu-lar access system (MACCS) [6], which was reported to cover many useful 2D features for virtual screening [7]. While these predefined fingerprint dictionaries are easy to use, previous studies demonstrated that the selec-tion of relevant 2D fingerprints from the original set resulted in better performance [8–10]. These feature selection methods typically focus on supervised machine learning settings in which to select a subset of relevant 2D fingerprints that intend to enhance the generality to
Open Access
Journal of Cheminformatics
*Correspondence: [email protected] Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
Page 2 of 12Kuwahara and Gao J Cheminform (2021) 13:27
discriminate chemical compounds with a given biological activity against those without. For example, Nisius, et al. ranked 2D fingerprints by applying the Kullback-Leibler divergence to each fingerprint to quantify its asymmet-ric usage between the active compound class and the inactive one [11]. Given the nature of drug discovery, however, these supervised feature selection approaches inevitably face a challenging class imbalance problem in practice as available compounds with the target bio-logical activity is most likely very scarce. That is, had the number of target bioactive compounds been large enough to begin with, a pipeline to discover more of the same would not have probably warranted a large cost of investment.
Here, we focus on a different issue in the combination of 2D fingerprints and analyze the effects of related fin-gerprints on the quantification of molecular similarity using eigenvalue-based entropy. The eigenvalue-based entropy was introduced by Alter et al. [12] to indicate the weight distribution of gene expression eigenvectors for analysis of temporal gene expression patterns. Var-shavsky et al. [13] developed an unsupervised feature selection method that ranks each feature by measuring its contribution to the eigenvalue-based entropy. We defined the relatedness of each 2D fingerprint based on the degree to which the shape of the eigenvalue distribu-tion of the feature matrix is changed. And, by using the eigenvalue-based entropy as the scaler value to indicate the distribution of eigenvalues, we determined related 2D fingerprints. Thus, we defined a related 2D fingerprint as a feature that has a (quasi) linear relationship with some other fingerprints in the feature set regardless of its rel-evance and importance for the discriminability. As illus-trated in Fig. 1, the presence of such related fingerprints can inflate or deflate similarity scores, potentially chang-ing the outcome of molecular similarity analysis and VS.
In this paper, we applied MACCS and Pubchem finger-print schemes to human metabolite and drug compound datasets and identified many fingerprints as related ones. While the effects of related fingerprints depended on various factors such as query compounds, the general trend of the effects of the presence of related 2D finger-prints was found to mildly lower overall similarity scores and some of these negative effects were found to be sub-stantial. Our analysis demonstrated that these effects can pose challenges in ranking similar compounds and quali-tatively change the outcome of VS.
MethodsDatasetsFrom Human Metabolome Database (HMDB) [14], we retrieved 2D structure data for 25,376 metabo-lites on September 5, 2019. With a filtering for the
metabolites found in blood with the metabolite status being “Detected and Quantified,” we further obtained the information about 3,202 metabolites for the blood specimen.
From DrugBank (version 5.1.7) [15], we downloaded the dataset for aproved drugs with 2,636 entries. After preprocessing for unique and valid SMILES data, we obtained 2,466 drug compounds for the DrugBank dataset.
Molecular similarity measureWe used the implementation of CDK (version 2.3) [16] to compute 166-bit MACCS and 881-bit Pubchem fin-gerprint vectors. To measure the similarity of a pair of compounds, a and b, we computed the Tanimoto coef-ficient of their l-bit fingerprint vectors, va and vb as follows:
where v(i) represents the i-th element of vector v.
(1)sim(a, b) =
∑li=1 va(i)vb(i)
∑li=1 va(i)+ vb(i)− va(i)vb(i)
,
1 0 1 1 1 0 1 1 0F1 F2 F3 F4 F5 F6 F7 F8 F9
query
1 0 1 1 0 0 0 0 1compound 11 1 1 0 0 0 1 0 0compound 20 0 0 0 1 1 1 1 0compound 3
Collinearity: 2F1 = F2 + F3 + F4
With collinearity
Without collinearity
compound 1
compound 2
compound 3
373737
Tc041334
Tc
Fig. 1 An illustrative example for the effects of related fingerprints on similarity measures. A hypothetical fingerprint scheme with nine bit keys ( F1 to F9 ) is used to represent small molecules in a hypothetical compound dataset. The fingerprint matrix of this dataset is found to have a perfect multicollinearity in the first four features with 2 F1 = F2 + F3 + F4 . The similarity of a query compound against three compounds is computed using Tanomoto coefficient (Tc) with and without this collinearity. For the results without the collinearity, the Tanimoto coefficient without the first four features ( F1 to F4 ) is shown
Page 3 of 12Kuwahara and Gao J Cheminform (2021) 13:27
Eigenvalue‑based entropyLet A be an m by n matrix. Then, an n by n symmetric matrix ATA is positive semidefinite and has real eigen-values �1 ≥ �2 ≥ · · · ≥ �n ≥ 0 . By defining qj to be the j-th normalized eigenvalue qj = �j/
∑nk=1 �k , we com-
puted a single value that indicates the complexity of the distribution of eigenvalues with the normalized entropy of eigenvalues [12] as follows:
This entropy ranges from 0 to 1, with 0 indicating that the dataset can be constructed based on a single eigenvec-tor and 1 indicating that each eigenvector has an equal contribution to the dataset. Eigenvalues were computed using the svd function in R.
Eigenvalue‑based fingerprint contribution measureSuppose we have m compounds, each of which is expressed with n-bit 2D fingerprints. That is, we have an m by n matrix A whose element ai,j represents the value of the j-th fingerprint for the i-th compound. Let A[−i] be an m by n matrix that has all but the i-th column of A, with the i-th column replaced by a zero column. We computed the contribution of the i-th ( 1 ≤ i ≤ n ) fingerprint, hi as hi = H(A[−i]) where H(M) is the eigenvalue-based entropy of matrix M given by Equ. 2. Note that, because we can first compute n-by-n matrix from ATA , the computation of each fingerprint entropy depends on the number of the fingerprints and not on the number of compounds, which is presumed to be very large.
Contribution of related fingerprints to the similarity scoreSuppose we have two l-bit fingerprint vectors va and vb . Further suppose that two k-bit fingerprint vectors ua and ub are subvectors of va and vb , respectively, that represent their related fingerprints. To measure the contribution of the related fingerprints to the Tanimoto coefficient, we compute two scores: the contribution to the union set and the contribution to the intersect-ing set. The contribution to the union set of va and vb is defined to be the ratio of the union set of the related fingerprint vectors to the union set of the original fin-gerprint vectors as follows:
while the contribution to the intersecting set of va and vb is defined to be the ratio of the intersecting set of the
(2)H = −1
log (n)
n∑
j=1
qj log (qj).
(3)∑k
i=1 ua(i)+ ub(i)− ua(i)ub(i)
ǫ +∑l
i=1 va(i)+ vb(i)− va(i)vb(i),
related fingerprint vectors to the intersecting set of the original fingerprint vectors as follows:
where ǫ is a small constant (e.g., 10−10 ) to avoid the divi-sion by zero.
ResultsPresence of highly correlated fingerprintsMACCS keys are 166-bit 2D structure fingerprints that are commonly used for the measure of molecular simi-larity. Because each bit is either on (i.e., 1) or off (i.e., 0), MACCS 166 keys can represent more than 9.3× 1049 distinct fingerprint vectors. After removing 454 entries with duplicate canonical SMILES strings, we generated 24,922 MACCS fingerprint vectors using the metabo-lite data we obtained from HMDB [14] (see "Methods"). After filtering out duplicates, we ended up with 3,125 unique fingerprint vectors. On average, thus, 8 metabo-lites were represented by the same MACCS fingerprint vector, indicating a high degree of collisions. The high level of collided metabolites suggests the possibility that many MACCS keys describe related 2D substructure characteristics.
To analyze the use of each MACCS key, we first counted the occurrence of on bit for each key in the 3,125 unique fingerprint vectors (Fig. 2a). We found that 39% of the 166 MACCS keys are on (i.e., 1) for fewer than 10% of the fingerprint vectors, while only 1 key is on for more than 90% of the vectors. This skewed use of molecular fingerprints ( γ1 = 0.777 ) indicates that many fingerprint bits are set to be off (i.e., 0) in most of the vec-tors, resulting in highly similar usage patterns. However, since Tanimoto coefficient, the most commonly used 2D fingerprint-based similarity measure, does not consider the off bits (see "Methods"), its similarity analysis of these HMDB metabolites may not be influenced by the finger-prints with many off bits.
We next analyzed the association of MACCS keys whose on-bit counts are more moderate and whose effects on similarity measure are assumed to be more profound. To this end, we focused on a subset of the MACCS keys whose on-bit counts are in a range between 25% and 75% of the total number of the unique vectors. We obtained 68 MACCS keys that satisfied this con-straint and computed their pairwise correlation coeffi-cient values (Fig. 2b). We found that a large fraction of the pairs (73%) had positive correlation ( r ≥ 0 ). Out of 2,278 fingerprint pairs, while none had strong negative correlation ( r ≤ −0.5 ), 105 had strong positive correla-tion ( r ≥ 0.5 ). Among these positively correlated pairs,
(4)∑k
i=1 ua(i)ub(i)
ǫ +∑l
i=1 va(i)vb(i),
Page 4 of 12Kuwahara and Gao J Cheminform (2021) 13:27
0.0003200.002900.000320.00130.000320.0240.00320.000960.0260.00130.0120.00420.000960.00960.0110.0016
0.0680.000320.00320.0250.0240.0360.0490.055
0.00580.016
0.0520.035
0.00160.040.044
0.0240.00064
0.0740.0740.12
0.0270.028
0.0120.0670.11
0.0120.0280.00640.0490.0750.069
0.130.0580.054
0.270.31
0.0740.014F56
F55F54F53F52F51F50F49F48F47F46F45F44F43F42F41F40F39F38F37F36F35F34F33F32F31F30F29F28F27F26F25F24F23F22F21F20F19F18F17F16F15F14F13F12F11F10
F9F8F7F6F5F4F3F2F1
0.00 0.25 0.50 0.75 1.00on fraction
MAC
CS
finge
rprin
t
0.250.0750.0690.0810.083
0.180.023
0.0610.26
0.230.091
0.00580.13
0.0320.051
0.340.087
0.180.26
0.150.24
0.0880.240.24
0.130.27
0.310.21
0.340.25
0.130.17
0.40.480.47
0.290.28
0.140.390.39
0.340.41
0.220.4
0.0490.18
0.10.49
0.430.28
0.160.33
0.280.33
0.45F111F110F109F108F107F106F105F104F103F102F101F100
F99F98F97F96F95F94F93F92F91F90F89F88F87F86F85F84F83F82F81F80F79F78F77F76F75F74F73F72F71F70F69F68F67F66F65F64F63F62F61F60F59F58F57
0.00 0.25 0.50 0.75 1.00on fraction
0.420.35
0.180.350.350.360.35
0.110.39
0.450.39
0.460.23
0.270.36
0.60.36
0.420.17
0.640.54
0.260.17
0.220.43
0.580.31
0.690.46
0.210.46
0.60.4
0.510.64
0.530.47
0.40.58
0.490.63
0.580.69
0.650.6
0.790.65
0.80.670.7
0.670.78
0.830F166
F165F164F163F162F161F160F159F158F157F156F155F154F153F152F151F150F149F148F147F146F145F144F143F142F141F140F139F138F137F136F135F134F133F132F131F130F129F128F127F126F125F124F123F122F121F120F119F118F117F116F115F114F113F112
0.00 0.25 0.50 0.75 1.00on fraction
0.92
a
Fig. 2 Fingerprint usage patterns of MACCS 166 keys on HMDB metabolite dataset. a The on-bit count of each key. b The pairwise Pearson’s correlation coefficient value for each pair of 68 MACCS keys with moderate on-bit counts
Page 5 of 12Kuwahara and Gao J Cheminform (2021) 13:27
the 127th and the 143rd fingerprints, both of which had 1,875 on-bit counts, had the perfect positive correlation, suggesting that 2D structures characterized by these two fingerprints are highly related. Although the correlation coefficient can capture only a limited type of related fin-gerprints, these results suggest the prevalence of related fingerprints in the predefined 2D fingerprint dictionaries.
Characterization of related fingerprintsWe next sought to analyze the extent to which more general types of related fingerprints were present in the MACCS and Pubchem fingerprint dictionaries. To this
end, we gathered 3202 metabolites found in the blood specimen from the HMDB metabolite dataset and fil-tered out compounds with duplicate 2D structures, duplicate fingerprint vectors, and all-zero MACCS fin-gerprint vectors. In addition, we removed each com-pound whose MACCS fingerprint vector has off bits for more than 90% of the fingerprints.
With this data preprocessing, we selected 1023 metab-olites that have unique fingerprint vectors. The 1023 by 166 matrix formed with the MACCS fingerprints had the rank of 144, where the column represents the fingerprints and the row represents the metabolite (see "Methods"),
F53F54F65F72F75F82F83F85F86F89F90F91F92F93F95F96F97F98
F100F104F105F106F108F109F110F111F112F113F115F116F117F118F120F121F122F123F125F126F127F128F129F131F132F133F136F137F138F139F140F142F143F144F145F146F147F148F149F150F151F152F153F154F155F156F158F160F161F162
F53
F54
F65
F72
F75
F82
F83
F85
F86
F89
F90
F91
F92
F93
F95
F96
F97
F98
F100
F104
F105
F106
F108
F109
F110
F111
F112
F113
F115
F116
F117
F118
F120
F121
F122
F123
F125
F126
F127
F128
F129
F131
F132
F133
F136
F137
F138
F139
F140
F142
F143
F144
F145
F146
F147
F148
F149
F150
F151
F152
F153
F154
F155
F156
F158
F160
F161
F162
-1.0
-0.5
0.0
0.5
1.0
PearsonCorrelation
b
Fig. 2 continued
Page 6 of 12Kuwahara and Gao J Cheminform (2021) 13:27
indicating that the pattern of ∼ 15% of MACCS fin-gerprints can be completely captured by the rest. The Pubchem fingerprints resulted in a 1023 by 881 finger-print matrix that had the rank of 377, indicating even more pronounced effects of rank deficiency with more than half of the fingerprints completely characterized by linear combinations of 377 fingerprints.
To assess the degree of related fingerprints, we defined the relatedness using the eigenvalue-based entropy (see "Methods"). This eigenvalue-based entropy measure indicates the shape of the eigenvalue distri-bution [12], with its value ranging from 0 to 1 where a lower value indicates that the matrix can be recon-structed with a linear combination of a smaller num-ber of eigenvectors. The distribution of the normalized eigenvalues for the MACCS and Pubchem fingerprint matrices shows that the first component has > 6 times higher weight than the second one in both (Fig. 3a), indicating that their entropy values must be lower. Indeed, the entropy values of the original MACCS and Pubchem matrices were 0.474 and 0.355, respectively. To measure the relatedness of the i-th fingerprint with the other fingerprints, we computed the change in the entropy between the original fingerprint-feature matrix and the feature matrix without the i-th fingerprint (see "Methods"). This can indicate the contribution of the i-th fingerprint to shaping the eigenvalue distribu-tion, which, in turn, allows us to evaluate the degree to which the i-th fingerprint is linearly related to some other fingerprints. Figure 3b shows the distribution of the fingerprint entropy values for both the MACCS
and Pubchem schemes. We found a high peak at the entropy value of the original fingerprint-feature matrix, with many fingerprints having their entropies in near the original one, indicating that these fingerprints do not contribute much to the eigenvalue distribution and are highly related to some other fingerprints.
By measuring the relatedness based on the distance between the original entropy and the entropy for each fingerprint, we selected related fingerprints from the original MACCS and Pubchem fingerprint dictionar-ies. Let h0 and hi (1 ≤ i ≤ n) be the eigenvalue-based entropy of the original feature matrix and for the i-th fingerprint, respectively. Then, we selected the i-th fin-gerprint as a related feature if hi satisfies the following condition:
where z is a reduced-level threshold parameter, which was set to 0.1, 0.2, and 0.3 in this study. Based on this approach, we found that many fingerprints in the MACCS and Pubchem dictionaries are related (Fig. 4a). With the reduced-level threshold being 0.1, 0.2, and 0.3, we identified 28, 48, 62 related fingerprints in MACCS and 454, 525, and 555 in Pubchem, respectively. A larger fraction of the related fingerprints identified in the Pubchem scheme with the low threshold value was expected given that the distribution of its fingerprint entropies had a higher density of the fingerprints near the original entropy.
(5)|hi − h0| < z
√
∑nj=1
(
hj − h0)2
n,
dens
ity
norm
aliz
ed e
igen
valu
e
entropy0.355 0.356 0.357
original entropy
Pubchem
0.4725 0.4750 0.4775 0.4800
original entropy
MACCS
0.0
0.2
0.4
0.6
1 2 3 4 5 6 7 8 9 10
0.0
0.2
0.4
0.6
1 2 3 4 5 6 7 8 9 10component
Pubchem
MACCSa b
Fig. 3 Eigenvalue-based analysis of MACCS and Pubchem fingerprint matrices. a Normalized eigenvalues of the first 10 components for MACCS and Pubchem fingerprint matrices. b The distribution of the eigenvalue-based entropy for MACCS and Pubchem fingerprints
Page 7 of 12Kuwahara and Gao J Cheminform (2021) 13:27
Effects of related fingerprints on molecular similarity scoresUsing the related fingerprints identified with the eigen-value-based entropy approach, we set out to examine their effects on the similarity score. To this end, we first constructed 1023 by 1023 similarity matrix by comput-ing the Tanimoto coefficient for each metabolite pair and generated the normalized eigenvalues of similarity matrices (Fig. 4b). The comparison of the first six com-ponents suggests that the similarity matrices computed from fingerprint sets with various reduction levels in both MACCS and Pubchem schemes are similar. Next, we measured the absolute difference of 522,753 distinct metabolite pairs between the original fingerprint set and reduced fingerprint sets. We found that the differ-ence increased as the reduced level threshold increased in both MACCS and Pubchem fingerprint dictionaries (Fig. 4c). While both fingerprint schemes had quanti-tatively similar levels of absolute differences with the reduced level at 0.01, the difference became wider as the threshold increases particularly in MACCS, sug-gesting the effects of removing related fingerprints were greater in the MACCS scheme even though a higher fraction of fingerprints were removed in the Pubchem scheme.
To further analyze the effects of related fingerprints on the Tanimoto similarity scores, we grouped the metab-olites in the blood specimen into four classes: drug, microbial, plant, and endogenous using the metabolite annotation retrieved from HMDB. In each of these four categories, we computed the average of the pairwise Tan-imoto similarity scores. The results show that the aver-age similarity scores from the reduced fingerprint sets are quantitatively close to those from the original fingerprint sets (Table 1). Similarity scores from the reduced finger-print sets were found to be marginally higher than those from the original fingerprint sets. In other words, the inclusion of related fingerprints had negative effects and slightly decreased the Tanimoto similarity score.
To characterize the significance of the negative effects of the related fingerprints on individual compound pairs, we identified the pairs with the 30 largest absolute differ-ences in the similarity score computed with two reduced-level thresholds 0 and 0.3 for each of the MACCS and Pubchem fingerprint schemes. From the comparison of the similarity scores for different threshold levels in these 60 compound pairs, the negative effects of the related fingerprints on the similarity score were found to be prevalent (Fig. 5). Indeed, the similarity scores using the reduced fingerprints with the threshold 0.3, for example, demonstrated strong association between the related fin-gerprints and the negative effects on the similarity scores ( p < 10−22 with two-sided exact binomial test).
0.0
0.2
0.4
0.6
0.8
1 2 3 4 5 6component
norm
aliz
ed e
igen
valu
e
MACCS
Pubchem
0.00
0.25
0.50
0.75
1 2 3 4 5 6
reduced level00.10.20.3
0.0
0.2
0.4
0.6
0.8
0.01 0.02 0.03 0.04 0.05
0.0
0.2
0.4
0.6
0.01 0.02 0.03 0.04 0.05threshold
frac
of p
airs
with
abs
olut
e di
ffere
nce
> th
resh
old
MACCS
Pubchem
reduced level0.10.20.3
100.0
83.1
71.1
62.7
100.0
48.5
40.4
37.0
0.00 0.25 0.50 0.75 1.00
0.3
0.2
0.1
0
0.3
0.2
0.1
0
fingt
erpr
int s
chem
e
scheme MACCS Pubchem
fraction of used fingerprints
a
b
c
Fig. 4 Comparison of the Tanimoto similarity score based on different reduction levels. a The fraction of used fingerprints with respect to given reduced levels for MACCS and Pubchem fingerprints. The reduced level 0 indicates the fraction for the original fingerprints. b The comparison of the first 6 principal components with respect to different reduction levels. c The comparison of the fraction of metabolite pairs with the absolute similarity score difference between the original set of fingerprints and a given reduced set of fingerprints exceeding specified threshold values
Page 8 of 12Kuwahara and Gao J Cheminform (2021) 13:27
Effects of related fingerprints for different query compoundsNext, we set out to examine the extent to which related fingerprints can impact the analysis of similar com-pounds for different query compounds. To this end, we used the drug compound data from DrugBank 5.1.7 (see "Methods") and generated the fingerprint vectors of the 2,466 drugs using MACCS and Pubchem fingerprint schemes. We performed the eigenvalue-based entropy approach on the dataset with 0.3 as the reduced-level threshold, identifying 34 and 470 fingerprint features as related features for the MACCS and Pubchem schemes, respectively. We then randomly selected 10 drugs as the query compounds and measured their structural similarity against the other 2,456 compounds using the Tanimoto coefficient. To analyze the contribution of the related fingerprint features for each query compound, we used the compounds with the 50 highest similar-ity scores. Because the Tanimoto coefficient of a pair of fingerprint vectors is the ratio of the size of their inter-secting set to that of their union set, we measured the contribution to the intersecting set and the union set, separately (see "Methods"). When the contribution of the related fingerprints to the intersecting set is higher than the union set, they have positive effects on the Tanimoto coefficient. By contrast, when their contribution to the intersecting set is lower than the union set, they have negative effects on the Tanimoto coefficient.
The contribution of related fingerprints to the Tani-moto coefficient was found to vary among different com-pound pairs as well as between MACCS and Pubchem (Fig. 6). Although, in both schemes, the related finger-prints had positive and negative effects on similarity scores of drug pairs, their effects to the union set were more significant and prevalent in the MACCS fingerprint scheme. In addition, consistent with the analysis on the HMDB dataset, stronger effects of the related finger-prints were found in those pairs with higher contribution
to the union set for both MACCS and Pubchem schemes (i.e., those with negative effects).
Interestingly, the contribution of the related finger-prints displayed clustering patterns based on the query compounds. With Sevoflurane (DB01236) as the query drug, for example, the related fingerprints contributed minimal on the intersecting set compared with the union set for both MACCS and Pubchem, indicating that the related fingerprints had stronger negative effects on the similarity scores for this drug. For Ibrutinib (DB09053), while they contributed to both the intersecting and the union sets more than 10%, the related fingerprints influ-enced the intersecting set more, indicating that they had positive effects on the similarity score. These results indi-cates that the impact of related fingerprints can depend strongly on query compounds, suggesting that the pres-ence of related fingerprints can give rise to bias in simi-larity scores, potentially making fair analysis of similar compounds for various query compounds challenging.
Effects of related fingerprints on drug similarity accuracyAlthough our results showed the effects of related finger-prints on molecular similarity analysis based on the Tani-moto coefficient, it is not clear if those effects can lead to qualitatively significant changes in structural similarity-based molecule screening. To analyze potential effects of related fingerprints in such SAR-based analysis, we used a dataset consisting of 100 drug compound pairs from DrugBank 3.0 [17] that 143 experts analyzed to provide their yes or no binary decisions about the structural simi-larity [18]. To use this dataset as the correct reference for similar compounds, we selected a subset of the 100 pairs whose similarity was supported by at least 80% of the experts, resulting in 33 pairs of similar compounds. The Tanimoto similarity scores of these compound pairs were computed using the original fingerprint set as well as the reduced one from the DrugBank 5.1.7 dataset, with the reduced-level threshold set to 0.3.
Table 1 The average Tanimoto similarity score for five classes of metabolites in the blood specimen for the MACCS and Pubchem fingerprint schemes with different reduction levels
Scheme Level Drug Microbial Plant Endogenous All
MACCS 0 0.3008 0.3531 0.3662 0.3211 0.3142
MACCS 0.1 0.2990 0.3489 0.3696 0.3190 0.3122
MACCS 0.2 0.2944 0.3536 0.3723 0.3188 0.3118
MACCS 0.3 0.3013 0.3578 0.3785 0.3211 0.3149
Pubchem 0 0.3048 0.3323 0.3873 0.2967 0.2968
Pubchem 0.1 0.3104 0.3390 0.3922 0.3012 0.3016
Pubchem 0.2 0.3167 0.3384 0.3974 0.3043 0.3053
Pubchem 0.3 0.3217 0.3395 0.4010 0.3071 0.3085
Page 9 of 12Kuwahara and Gao J Cheminform (2021) 13:27
We first analyzed the performance of each fingerprint set by computing five measures: the mean of the similar-ity scores, the standard deviation of the scores, the num-ber of pairs whose similarity scores are ≥ 0.8 , the number of pairs whose similarity scores are ≥ 0.7 , and the mini-mum similarity score (Table 2). We decided to focus on the number of positives because in virtual screening molecular similarity is used to filter out incompatible
compounds and to generate initial hits with potentially similar bioactive properties in order to capture them in follow-up screenings [1]. That is, in the filtering for the initial hits, as long as true positives are included, the number of false positives is not as important.
The reduced fingerprint set resulted in an increase in the average similarity scores and a decrease in the vari-ance of the similarity scores mildly yet consistently in both MACCS and Pubchem schemes. An increase in the overall similarity score is consistent with the results from the HMDB dataset, while a decrease in the variance can be explained by the enhanced generalization achieved with the removal of related fingerprints. The number of pairs with high similarity sores also increased by remov-ing related fingerprints; in the MACCS scheme, the original fingerprint set and the reduced fingerprint set with the threshold 0.03 correctly predicted 73% and 85% of the reference pairs with the positive-calling thresh-old of 0.8, while in the Pubchem scheme, they correctly predicted 97% and 100% of the pairs with the positive-calling threshold of 0.7. Furthermore, in both fingerprint schemes, the reduced fingerprint set resulted in > 6% increase in the minimum similarity score among the 33 reference pairs, indicating that the filtering of related fingerprints was able to improve the similar compound search more inclusively.
To better understand the significance of the results from the reduced fingerprint set, we generated 10,000 sets of reduced fingerprints by randomly pruning the original fingerprint set to have the same size as the one with the reduced-level threshold 0.03. The comparison revealed that the filtering of related fingerprints had a tendency to substantially increase the similarity scores of a number of pairs, while the random pruning did not show such directionality bias (Fig. 7). To further analyze significant changes from the filtering of related finger-prints, we computed the p-value with the null hypoth-esis that the filtering of related fingerprints is the same as the random pruning. We obtained two and four pairs with significant changes ( p < 0.05 ) from the MACCS and Pubchem schemes, respectively, and all of these pairs resulted in an increase in their similarity scores with the removal of related fingerprints. These results indicate that related fingerprints are likely to contribute to the union set more significantly and that the filtering of such fingerprint features can increase the Tanimoto coefficient of structurally similar compound pairs.
ConclusionsIn this study, we defined related fingerprints to be those that do not contribute to the shape of the eigen-value distribution of the original fingerprint feature matrix and thus are thought to possess a high degree of
HMDB0000996 HMDB0001049HMDB0000996 HMDB0001067HMDB0000996 HMDB0029159HMDB0000641 HMDB0000996HMDB0000641 HMDB0000731HMDB0000731 HMDB0001049HMDB0000168 HMDB0000996HMDB0000812 HMDB0000996HMDB0001232 HMDB0001859HMDB0000731 HMDB0004122HMDB0000996 HMDB0011737HMDB0000996 HMDB0011170HMDB0000125 HMDB0000996HMDB0000731 HMDB0011737HMDB0000731 HMDB0001370HMDB0000731 HMDB0003357HMDB0000731 HMDB0011170HMDB0000731 HMDB0029159HMDB0000925 HMDB0004983HMDB0000099 HMDB0000731HMDB0000996 HMDB0029147HMDB0002878 HMDB0059586HMDB0000731 HMDB0015673HMDB0000996 HMDB0011171HMDB0011753 HMDB0061742HMDB0000148 HMDB0000996HMDB0000731 HMDB0029147HMDB0000148 HMDB0000731HMDB0011741 HMDB0155722HMDB0000996 HMDB0001890HMDB0037547 HMDB0061387HMDB0037547 HMDB0061385HMDB0001139 HMDB0059571HMDB0034902 HMDB0059571HMDB0001885 HMDB0003474HMDB0001220 HMDB0059571HMDB0002664 HMDB0059571HMDB0001232 HMDB0001928HMDB0000900 HMDB0059571HMDB0001398 HMDB0061385HMDB0003634 HMDB0059571HMDB0000876 HMDB0059571HMDB0005176 HMDB0059571HMDB0001438 HMDB0059571HMDB0002277 HMDB0059571HMDB0000159 HMDB0001885HMDB0001403 HMDB0059571HMDB0037547 HMDB0060472HMDB0000021 HMDB0001885HMDB0001885 HMDB0002210HMDB0005010 HMDB0037547HMDB0059571 HMDB0060054HMDB0001885 HMDB0015166HMDB0005028 HMDB0037547HMDB0004472 HMDB0059571HMDB0000610 HMDB0059571HMDB0001928 HMDB0037547HMDB0005010 HMDB0060472HMDB0001885 HMDB0001904HMDB0037547 HMDB0061386
00.
10.
20.
3 00.
10.
20.
3
0.00
0.25
0.50
0.75
1.00
Similarityvalue
MACCS Pubchem
Fig. 5 Illustration of 60 metabolite pairs with high levels of changes in Tanimoto similarity measures. Heatmap showing the similarity scores of 60 metabolite pairs based (y-axis) on given levels of reduced fingerprint sets (x-axis). From the MACCS and Pubchem fingerprint dictionaries, 30 pairs are selected from each based on the difference between the original set of fingerprints and a reduced set of fingerprints with reduced level 0.3
Page 10 of 12Kuwahara and Gao J Cheminform (2021) 13:27
multicollinearity with other features. By developing a method to identify such related fingerprint features, we studied their effects on analysis of compound similarity. Analyzing fingerprint feature matrices in the datasets of human metabolites and drug compounds, we found that commonly used predfined 2D structure fingerprint schemes had many related fingerprints and these finger-prints affected the scoring of structural similarity differ-ently depending on compound pairs, which could bias the outcome of similar compound rankings and qualita-tively change the list of potential hits. Interestingly, our analysis showed that the presence of related fingerprints had a general trend to mildly yet consistently lower the Tanimoto coefficient and these negative effects were seen to be substantial for a subset of compound pairs.
Previously, we developed a method to predict ther-modynamic parameters of biochemical reactions by using predefined 2D chemical fingerprints and chemi-cal descriptors as features [19]. In this regression prob-lem, we dealt with high degrees of multicollinearity in the fingerprint-based features and sought to enhance the generalization ability using a LASSO-based feature selection and a regularized linear regression model. Here, to analyze the effects of features with a high degree of multicollinearity on search for similar com-pounds, we considered a different approach that com-putes eigenvalue-based entropy to identify related 2D fingerprint features regardless of their ability to clas-sify similar compounds. Because this eigenvalue-based entropy approach is an unsupervised method, it is
0.0
0.1
0.2
0.0 0.1 0.2contribution to the union
cont
ribut
ion
to th
e in
ters
ectio
n
0.0
0.1
0.2
0.0 0.1 0.2contribution to the union
query drugDB00655DB01236DB00138DB09071DB00190DB09053DB08799DB12523DB00392DB00204
a b
Fig. 6 Contribution of related fingerprints to the similarity score for 10 randomly selected query compounds in DrugBank. The scatterplot shows two contribution measures from the Tanimoto coefficient (the ratio of the intersecting set to the union) of the drug compounds with the 50 highest similarity scores for each query compound. The x-axis shows the contribution of the related fingerprints to the union set, while the y-axis shows the contribution to the intersecting set. The related fingerprints are defined to be the removed ones based on the reduced level 0.3. a MACCS scheme. b Pubchem scheme
Table 2 Summary of the results from 33 pairs of similar compounds with high consensus from 143 experts using the original fingerprint set and the reduced fingerprint set
aThe number of instances in which the similarity score is greater than or equal to 0.8.bThe number of instances in which the similarity score is greater than or equal to 0.7.cThe minimum similarity score among the 33 pairs.dThe reduced fingerprint set using 0.3 as the threshold
Scheme Fingerprints Mean Std. dev. ≥ 0.8a
≥ 0.7b min simc
MACCS Original 0.8724 0.1308 24 31 0.4686
MACCS reducedd 0.8794 0.1229 28 31 0.5000
Pubchem Original 0.9163 0.0809 29 32 0.6970
Pubchem Reduced 0.9187 0.0750 29 33 0.7402
Page 11 of 12Kuwahara and Gao J Cheminform (2021) 13:27
expected to be integrated seamlessly to exiting similar-ity-based VS pipelines that use 2D fingerprints as fea-ture vectors.
Our results indicate that the presence of related fin-gerprints in predefined fingerprint dictionaries can pose challenges in objectively ranking the degree of similarity among compounds from different datasets and for different query compounds, making the evalu-ation of structural similarity for not only multiple-mol-ecule queries but also single-molecule queries difficult. As each compound dataset can have different sets of related fingerprints, our results demonstrate the impor-tance of knowing which fingerprints have high degrees of relatedness and how such related features affect the Tanimoto similarity scores. This study also emphasizes
that an increase in the number of structural finger-prints may not always enhance the search performance for similar compounds and that feature selection is a valuable preprocessing step for the task of SAR analysis.
Authors’ contributionsHK designed the study, developed tools, performed analysis, and wrote the paper. XG oversaw the project and wrote the paper. All authors read and approved the final manuscript.
FundingThis work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Awards No. BAS/1/1624-01, URF/1/3412-01, URF/1/3450-01, FCC/1/1976-18, FCC/1/1976-23, FCC/1/1976-25, FCC/1/1976-26, and FCS/1/4102-02.
Availability of data and materialsThe code and data used in this study have been uploaded to GitHub and are available at https:// github. com/ hkuwa hara/ chem- fp- lambda- entro py.
Declarations
Competing interestsWe declare that we have no competing interests.
Received: 8 May 2020 Accepted: 13 March 2021
References 1. Smith A (2002) Screening for drug discovery: the leading question.
Nature 418:453–459 2. Lyne PD (2002) Structure-based virtual screening: an overview. Drug
Discovery Today 7:1047–1055 3. Willett P (2006) Similarity-based virtual screening using 2D fingerprints.
Drug Discovery Today 11:1046–1053 4. Scior T, Bender A, Tresadern G, Medina-Franco JL, Martínez-Mayorga K
et al (2012) Recognizing pitfalls in virtual screening: a critical review. J Chemical Information Modeling 52:867–881
5. Cereto-Massagué A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallvé S et al (2015) Molecular fingerprint similarity search in virtual screening. Meth-ods 71:58–63
6. Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chemical Information Computer Sci 42:1273–1280
7. Mellor CL, Marchese Robinson RL, Benigni R, Ebbrell D, Enoch SJ et al (2019) Molecular fingerprint-derived similarity measures for toxicologi-cal read-across: Recommendations for optimal use. Regulatory Toxicol Pharmacol 101:121–134
8. Bender A, Mussa HY, Glen RC, Reiling S (2004) Molecular similarity search-ing using atom environments, information-based feature selection, and a naïve bayesian classifier. J Chemical Information Computer Sci 44:170–178
9. Geppert H, Vogt M, Bajorath J (2010) Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. J Chemical Information Modeling 50:205–216
10. Heikamp K, Bajorath J (2011) How do 2D fingerprints detect structur-ally diverse active compounds? Revealing compound subset-specific fingerprint features through systematic selection. J Chemical Information Modeling 51:2254–2265
11. Nisius B, Vogt M, Bajorath J (2009) Development of a fingerprint reduc-tion approach for Bayesian similarity searching based on Kullback-Leibler divergence analysis. J Chemical Information Modeling 49:1347–1358
12. Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proceedings
92a91a89a87a85a84a82a80a77a71a67a66a65a60a58a57a56a51a48a44a40a36a33a31a29a22a21a15a11a10a
7a5a2a
0.0 0.1 0.2 0.3relative change
mol
ecul
e pa
ir ID
92a91a89a87a85a84a82a80a77a71a67a66a65a60a58a57a56a51a48a44a40a36a33a31a29a22a21a15a11a10a
7a5a2a
0.00 0.05 0.10relative change
a b
Fig. 7 Relative changes of the Tanimoto similarity measure with the fingerprints in the reduced level 0.3 with respect to the one with the original fingerprints. The relative similarity changes are shown for the 33 similar-compound pairs with a high consensus by the 143 experts ( ≥ 80% ). The error bars indicate the 1st and the 3rd quartiles of 10,000 Tanimoto coefficients computed with random pruning of the original fingerprints. These randomly selected fingerprint vectors have the same length as the one for the the reduced level 0.3. a MACCS scheme. b Pubchem scheme
Page 12 of 12Kuwahara and Gao J Cheminform (2021) 13:27
• fast, convenient online submission
•
thorough peer review by experienced researchers in your field
• rapid publication on acceptance
• support for research data, including large and complex data types
•
gold Open Access which fosters wider collaboration and increased citations
maximum visibility for your research: over 100M website views per year •
At BMC, research is always in progress.
Learn more biomedcentral.com/submissions
Ready to submit your researchReady to submit your research ? Choose BMC and benefit from: ? Choose BMC and benefit from:
of the National Academy of Sciences of the United States of America 97:10101–10106
13. Varshavsky R, Gottlieb A, Linial M, Horn D (2006) Novel unsupervised feature filtering of biological data. Bioinformatics (Oxford, England) 22:e507–e513
14. Wishart DS, Feunang YD, Marcu A, Guo AC, Liang K et al (2018) HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res 46:D608–D617
15. Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A et al (2018) Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Res 46:D1074–D1082
16. Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L et al (2017) The chemistry development kit (cdk) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J Cheminformatics 9:33
17. Knox C, Law V, Jewison T, Liu P, Ly S et al (2011) DrugBank 3.0: a com-prehensive resource for ‘omics’ research on drugs. Nucleic Acids Res 39:D1035–D1041
18. Franco P, Porta N, Holliday JD, Willett P (2014) The use of 2d fingerprint methods to support the assessment of structural similarity in orphan drug legislation. J Cheminformatics 6:5
19. Alazmi M, Kuwahara H, Soufan O, Ding L, Gao X (2019) Systematic selec-tion of chemical fingerprint features improves the Gibbs energy predic-tion of biochemical reactions. Bioinformatics 35:2634–2643
Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims in pub-lished maps and institutional affiliations.