discrete pattern recognition by fitting onto a continuous function

18
< < Discrete Pattern Recognition by Fitting onto a Continuous Function ´ U ALIETTE COSSE-BARBI, MOURAD RAJI ( ) Institut de Topologie et de Dynamique des Systemes ITODYS , CNRS URA-34, Universite ` ´ Paris 7-Denis Diderot, 1 rue Guy de la Brosse, 75005 Paris, France Received 16 June 1996; accepted 17 June 1997 ABSTRACT: This article outlines an original method for matching discrete structures when atom correspondences are unknown. This method avoids the Ž . current atom-by-atom treatment and its inherent combinatorial problems and considers the structures to be compared in their totality. The basic idea is to first obtain the atom correspondences by fitting one of the two discrete structures onto a spline approximation of the other, rather than optimizing in discrete space, and, second, to overlap the two discrete structures on the basis of the proposed assignment. As starting data, the method requires only the Cartesian coordinates of the two structures. No connectivity information, neither atom labeling nor matching tolerance is required. This method can readily handle matches of molecules with a few hundred atoms. It is able to search for a given 3D pattern as well as for a pattern common to two structures. Q 1997 John Wiley & Sons, Inc. J Comput Chem 18: 1875 ] 1892, 1997 Introduction he recognition and the analysis of three- T Ž . dimensional 3D similarities between molecules is of fundamental importance for the interpretation and prediction of their physical, chemical, or biological properties; that is, molecules with similar shape features are expected to interact Ž in a similar way with radiation circular dichro- . Ž . ism , reagents chemical reactivity , or biological Ž . receptors bioreactivity . U Presented as M. Raji’s thesis at Universite Paris 7-Denis ´ Diderot Correspondence to: A. Cosse-Barbi; e-mail: cosse@Paris7. ´ jussieu.fr For the understanding of molecular properties, both overall shape features and local shape fea- tures may be relevant. Indeed, the shapes to be compared can be overall ground states, delocal- ized three-dimensional arrangements of not neces- Ž sarily connected functional groups pharma- 1 . cophores , or local arrangements around a Ž . functional group chromophores, reactive sites . The molecular shape can be a discrete nuclear Ž arrangement or a 3D molecular body electron distribution or, at a simpler level, van der Waals . volume . Well-documented 3D data bases, experimen- Ž 2, 3 . tally x-ray or neutron diffraction or computa- tionally generated, 4 have provided a completely new perspective. Overcoming the viewpoint of ( ) Journal of Computational Chemistry, Vol. 18, No. 15, 1875 ]1892 1997 Q 1997 John Wiley & Sons, Inc. CCC 0192-8651 / 97 / 151875-18

Upload: mourad

Post on 06-Jun-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Discrete pattern recognition by fitting onto a continuous function

— —< <

Discrete Pattern Recognition by Fittingonto a Continuous Function

´ UALIETTE COSSE-BARBI, MOURAD RAJI( )Institut de Topologie et de Dynamique des Systemes ITODYS , CNRS URA-34, Universite` ´

Paris 7-Denis Diderot, 1 rue Guy de la Brosse, 75005 Paris, France

Received 16 June 1996; accepted 17 June 1997

ABSTRACT: This article outlines an original method for matching discretestructures when atom correspondences are unknown. This method avoids the

Ž .current atom-by-atom treatment and its inherent combinatorial problems andconsiders the structures to be compared in their totality. The basic idea is to firstobtain the atom correspondences by fitting one of the two discrete structuresonto a spline approximation of the other, rather than optimizing in discretespace, and, second, to overlap the two discrete structures on the basis of theproposed assignment. As starting data, the method requires only the Cartesiancoordinates of the two structures. No connectivity information, neither atomlabeling nor matching tolerance is required. This method can readily handlematches of molecules with a few hundred atoms. It is able to search for a given3D pattern as well as for a pattern common to two structures. Q 1997 JohnWiley & Sons, Inc. J Comput Chem 18: 1875]1892, 1997

Introduction

he recognition and the analysis of three-T Ž .dimensional 3D similarities betweenmolecules is of fundamental importance for theinterpretation and prediction of their physical,chemical, or biological properties; that is, moleculeswith similar shape features are expected to interact

Žin a similar way with radiation circular dichro-. Ž .ism , reagents chemical reactivity , or biological

Ž .receptors bioreactivity .U Presented as M. Raji’s thesis at Universite Paris 7-Denis´

DiderotCorrespondence to: A. Cosse-Barbi; e-mail: cosse@Paris7.´

jussieu.fr

For the understanding of molecular properties,both overall shape features and local shape fea-tures may be relevant. Indeed, the shapes to becompared can be overall ground states, delocal-ized three-dimensional arrangements of not neces-

Žsarily connected functional groups pharma-1.cophores , or local arrangements around a

Ž .functional group chromophores, reactive sites .The molecular shape can be a discrete nuclear

Žarrangement or a 3D molecular body electrondistribution or, at a simpler level, van der Waals

.volume .Well-documented 3D data bases, experimen-

Ž 2, 3.tally x-ray or neutron diffraction or computa-tionally generated,4 have provided a completelynew perspective. Overcoming the viewpoint of

( )Journal of Computational Chemistry, Vol. 18, No. 15, 1875]1892 1997Q 1997 John Wiley & Sons, Inc. CCC 0192-8651 / 97 / 151875-18

Page 2: Discrete pattern recognition by fitting onto a continuous function

´COSSE-BARBI AND RAJI

quantitative structure activity 5 or rs 6 relation-ships on particular series of compounds, they makeit possible to compare many dissimilar moleculeson purely 3D criteria, in order to design com-pounds belonging to new series and having inter-esting shape features.

Since 1973, following Gund’s proposals,7 com-putational programs well adapted to 3D data baseshave been developed. Current methods for exam-ining 3D similarities fall into two categories:

Ž1. Comparison of areas or volumes van derWaals volumes,8 ] 10 electron densities,11, 12

.etc. requiring a continuous optimizationprocedure.

2. Comparison of atom positions or of inter-atomic distances requiring an optimization indiscrete space.13 ] 18

Ž . Ž .Whatever the comparison, type 1 or 2 , pro-Žgrams are able to overlap only small structures a

.few tens of atoms . It is easy to match discretestructures if the atom correspondences are known.In this case, we have only to determine a transla-tion and rotation step for one molecule to overlapit with the other.

In fact, the atom correspondences are rarelyknown. The challenge is precisely to find them.Some programs search for a predetermined 3D

Ž .pattern SubStructure search in a structure. Otherssearch for a three-dimensional pattern common to

Žtwo or more structures Common SubStructure.search . Any method for handling the latter prob-

lem must be able to solve the former, but theŽ .reverse is not true. Whatever the aim SS or CSS ,

it is necessary to find the atom correspondencesbefore overlapping the structures.

This atom correspondence search leads to acombinatorial problem. The possible atomic corre-spondences to be screened increase dramaticallywith molecular size. Techniques such as exhaus-tive tree searches with branch-and-boundpruning,19 the use of neural networks,20 or simu-lated annealing21 are designed to reduce the com-binatorial problem.

Clearly, the difficulty is inherent in the discretenature of the entities to be matched. We propose toovercome it by going from discrete to continuousspace. The basic idea is as follows:

B First, we do not try to match the two discreteentities but simply to move one of the enti-

Ž .ties the smaller on a continuous representa-

Ž .tion of the other the greater . This fittingleads to an atom assignment.

B Second, we control this assignment and ad-just the atom positions of the two discreteentities more closely.

To prevent possible misunderstanding about themethod it is useful to make some points immedi-ately. As starting data, the method requires onlyCartesian coordinates of the two entities to be

Žcompared. Atom labeling although this could be.introduced in further improvements , connectivi-

ties, and any kind of prescreening are unnecessary.Moreover, the atom assignments are sought byminimizing a function. Consequently, as opposedto current methods based on atom-by-atom treat-ment, the match is not constrained by any match-ing tolerance.

Method for Circumventing theCombinatorial Problem

Current methods for 3D discrete pattern recog-nition proceed in the following order:

B They establish that the discrete substructureis contained in the discrete structure.

B If this is the case and, if necessary, theycompute the substructure translation androtation step T required to overlap the SSon the structure.

We propose to proceed in the reverse order. Wesearch for the rotation and translation step TX

whose existence allows us to assume the inclusionof the substructure in the structure. Consequently,the 3D subgraph search problem becomes a step-search analytical problem.

PRINCIPLE OF THE METHOD

Figure 1 summarizes the principle of the methodŽ .in two-dimensional 2D space. Let us suppose

that the problem to solve is to find a discreteŽ .four-atom pattern points ' included somewhere

Ž .in a seven-atom structure points l . Currentmethods have to screen 840 possible correspon-dences. We avoid this painstaking screening in thefollowing way. The seven-atom structure is firstinterpolated by a continuous function called RRŽ .Fig. 1b and the structure atoms are momentarily

Ž . Xput aside Fig. 1c . Let us call T the translation

VOL. 18, NO. 151876

Page 3: Discrete pattern recognition by fitting onto a continuous function

DISCRETE PATTERN RECOGNITION

( ) ( ) ( )FIGURE 1. Principle of method. a Problem to solve. b Introducing continuous interpolation, RR, of the structure. c( ) ( ) ( )Setting aside structure atoms. d Fitting substructure pattern onto RR. e Deducing atom assignments. f Return to

( )initial situation, assignments being known. g Fitting two discrete patterns.

and rotation step necessary to fit the four-atompattern on RR.

Then, the four-atom pattern is fitted on RR andX Ž .T is computed Fig. 1d . At this stage, continuous

interpolation is abandoned. We turn our attentionback to the seven atoms of the structure. Eachsubstructure atom is assigned to the nearest struc-

Ž .ture atom Fig. 1e . With the atom assignmentbeing known, we return to the initial situationŽ .Fig. 1f . Let us call T the rotation and translationstep necessary to fit the substructure pattern on

the structure one. The substructure pattern is fit-ted on the structure pattern and T is computedŽ .Fig. 1g .

GENERALIZATION IN 3D SPACE: DETAILEDDESCRIPTION

Going from Discrete to Continuous Space

The continuous representation of the structureis obtained by projecting its Ns atoms onto two

JOURNAL OF COMPUTATIONAL CHEMISTRY 1877

Page 4: Discrete pattern recognition by fitting onto a continuous function

´COSSE-BARBI AND RAJI

( )FIGURE 1. Continued

planes of a Cartesian coordinate system and inter-polating through the projections.

Let us denote, by P and P , the interpolationx y x zfunctions in planes O and O , respectively. Forx y x zeach atom At of the structure with atom coordi-knates x , y , and z , we have:k k k

Ž .P x s yx y k k Ž .1Ž .P x s zx z k k

This continuous representation, RR, is not uniqueŽand contains an extra item of information all the

points in between the real projection of the discrete.structure , but it presents the only relevant prop-

erty; that is, it contains all the geometrical infor-mation. This RR representation is associated withthe structure. It can be stored to constitute a struc-tural data base, and used when necessary for par-ticular 3D pattern searches.

The interpolation function may be any functionprovided that it is derivable and continuous ateach projection point. A polynomial representationwas attractive because polynomial parameters canbe stored easily. We have discarded Legendrepolynomials whose major drawback is that thedegree of the polynomial increases with the struc-ture size and have chosen a cubic spline interpola-tion.22 Here, whatever the structure size, the poly-

Ž .nomials have the same degree three and it is onlythe number of parameters that increases with size.

However, some caution is necessary to ensure con-tinuity and derivability at each point and we have

Ž .to examine more closely: i the ends of the RRŽ .representation; and ii situations in which two

atoms have the same projection.

( )i The polynomial representation starts ‘‘be-Ž . Žfore’’ x - x min and finishes ‘‘after’’ xk

.) x max the set of projection points. Thiskis obtained by adding two virtual points tothe real projections in each plane xy and xz.These virtual points are chosen very far

˚Ž .away 1000-A variation in each coordinateŽ .from the ‘‘first’’ x min and the ‘‘last’’k

Ž .x max real projection points to assist thekŽ X .convergence process T search .

( )ii Two atoms might be projected in the samepoint. In other words, we could have two y

Ž .or z values or more for the same x. In thiscase, a polynomial representation is never-theless created by increasing one of the two

y5 ˚Ž .y or z values by a small amount, 10 Afor instance. However, it is important tonotice that the structure coordinates are not

y5 ˚modified, the 10 A increment being intro-duced only to make it possible to constructthe continuous representation.

TX Step-Search

TX moves the projection of the substructureatoms, At , whose coordinates are x , y , and zi i i ionto the RR representation. TX depends on six pa-rameters, the three rotation angles, u , u , and u ,x y zaround the three axes, Ox, Oy, and Oz, and thethree translations, tr , tr , and tr .x y z

Depending on the order chosen for the individ-ual translations and rotations, there are many stepsto adjust the 3D discrete pattern on the RR repre-sentation. Our purpose being not to find an opti-mal translation and rotation step, we adopt thefollowing arbitrary order:

X Ž .T s Ttr (Ttr (Ttr (Tu (Tu (Tu 2z y x z y x

X Ž .T is written as eq. 3 :

Ž . Ž . Ž . Ž . Ž . Ž . Ž . Ž .cos u = cos u sin u = sin u = cos u cos u = sin u = cos u try z x y z x y z x

Ž . Ž . Ž . Ž .ycos u = sin u qsin u = sin ux z x z

Ž . Ž . Ž . Ž . Ž . Ž . Ž . Ž .cos u = sin u sin u = sin u = sin u cos u = sin u = cos u trX y z x y z x y z y Ž .T s 3Ž . Ž . Ž . Ž .qcos u = cos u ysin u = cos ux z x z

Ž . Ž . Ž . Ž . Ž .ysin u sin u = cos u cos u = cos u trz x y x y z

0 0 0 1

VOL. 18, NO. 151878

Page 5: Discrete pattern recognition by fitting onto a continuous function

DISCRETE PATTERN RECOGNITION

and denoting by xX , yX and zX the At coordinatesi i i iafter TX transformation, we have:

X Ž . Ž .x s cos u = cos u = xi y z i

Ž . Ž . Ž .q sin u = sin u = cos uŽ x y z

Ž . Ž .ycos u = sin u = y.x z i

Ž . Ž . Ž .q cos u = sin u = cos uŽ x y z

Ž . Ž .qsin u = sin u = z q tr.x z i x

X Ž . Ž .y s cos u = sin u = xi y z i

Ž . Ž . Ž .q sin u = sin u = sin uŽ x y z

Ž .4

Ž . Ž .qcos u = cos u = y.x z i

Ž . Ž . Ž .q cos u = sin u = sin uŽ x y z

Ž . Ž .ysin u = cos u = z q tr.x z i y

X Ž . Ž . Ž .z s ysin u = x q sin u = cos u = yi y i x y i

Ž . Ž .q cos u = cos u = z q trx y i z

If the 3D pattern is included in the structurewith exactly the same spatial disposition of atoms,its projection after the translation and rotation stepTX must be included exactly in the continuous RRrepresentation. We must have:

Ž X . XP x s yx y i i Ž .5Ž X . XP x s zx z i i

In a realistic case, a 3D pattern is never found ina molecule with exactly the same spatial disposi-tion of atoms. Therefore, the most we can do is tofind the best fit of the substructure projections onRR. To achieve this, an obvious way is to minimize

Ž .a quantity or quantities related to local distancesŽ .parallel to the y and or z axis between the

substructure atom projections and the polynomialŽ .representation Fig. 2 . One can use two different

optimizations in the two projections planes, fol-

( )FIGURE 2. Local distances vertical segments involvedin QT.

lowed by the necessary interrelation between them,or minimize a quantity extended over all substruc-ture atom projections, whatever the projectionplane xy or xz. We have found this second way tobe the more convenient. Therefore, we propose tosearch for the six parameters that minimize thefollowing overall quantity, QT :

Nss2 2X X X XŽ . Ž Ž . . Ž .QT s P x y y q P x y z 6Ž .Ý x y i i x z i i

is1

with Nss being the number of SS atoms.At this stage, it is clear that computing QT

requires only the substructure atom Cartesian co-ordinates and the parameters of the polynomialrepresentation of the structure.

Ž .Solving eq. 6 requires only numerical meth-ods. Many tools are possible. Here we use theBFGS algorithm,23 because it converges, even incases where we start far from the solution.

Atom Assignment

TX moves the substructure atom projections ontothe RR representation, on or in between the realatom projections of the structure. The proximity ofthe substructure and structure atom projections isthe basis for atom assignment, each SS atom beingassigned to the nearest structure atom by screen-ing a distance matrix. This procedure could be acombinatorial problem in itself. To circumvent thisdrawback, we proceed in the following way. Afirst screening deals with all atoms pairs At , Ati k

5 5 Žwith local distances At At less than or equali k˚.to 0.05 A. Two atoms, At and At , are paired ifi k

the local distance between them does not exceed˚0.05 A. These two atoms are then set aside. After

5 5this first screening, all local distances At At toi k˚be considered exceed 0.05 A. We then look for the

minimum local distance, and the two atoms Atiand At corresponding to this lowest local dis-ktance are paired and set aside. This second screen-ing is repeated until each substructure atom ismatched with a structure atom.

This atom assignment step provides a uniquecorrespondence for strong similarities between thetwo patterns as well as for strong dissimilarities.There are two reasons for this:

B First, for chemical patterns, it is impossibleto find two substructure atoms At locatedi

˚ Ž .at 0.05 A or less from the same structureatom At or the same substructure atomk

˚ Ž .located at 0.05 A or less from two structureatoms.

JOURNAL OF COMPUTATIONAL CHEMISTRY 1879

Page 6: Discrete pattern recognition by fitting onto a continuous function

´COSSE-BARBI AND RAJI

B Second, it is very improbable to find twoŽ .substructure atoms or structure atoms at

the same distance from the same structureŽ .substructure atom. If this were the case, wewould abandon the resulting combinatorialproblem, because, by varying the SS initial

Žlocation see ‘‘Repeating the Entire Proce-.dure’’ Section , we could recover the lost

isomorphism if this latter were of interest.

T Step Determination: Inclusion Accuracy

The accuracy of the assignment must be con-trolled. For this purpose, we compute the rotationand translation step T by minimizing the rootmean square distance between the correspondingatoms of the two discrete entities.

Let us call xY, yY, and zY the Cartesian coordi-i i inates of the SS atoms after the T transformation;the RMS is written:

Nss2 2 2Y Y YŽ . Ž . Ž .x yx q y yy q z yzÝ i k i k i k

1)RMSsNss

Ž .7

This RMS measures the accuracy of the assign-ment of the predetermined 3D pattern of atoms tothe structure atoms. The isomorphism beingknown, the range of possible algorithms to deter-mine T is larger than for TX. Nevertheless, as for TX

computation, we use the BFGS algorithm here also.To summarize, one can see that the entire proce-

dure just described does not require any local oroverall threshold. For this reason, it always pro-vides an atom assignment and the correspondingbest match, whatever the similarity or the dissimi-larity of the two discrete patterns to be compared.Consequently, in contrast to other methods basedon narrow local adjustments and comparisons withlocal thresholds, our method does not establish theabsence of a match, but only gives its accuracy. Theuser is left free to accept or to reject the match. Hisdiagnostic is helped both by the overall criterion,the RMS, and by an ordered table of local dis-

5 5tances At At .i k

REPEATING THE ENTIRE PROCEDURE

Why a Scan? QT Multiple Minima

The function, QT , always has several minimaand the convergence process may lead to a local

Ž .one while the absolute one is sought i . In someŽapplications, several minima are sought poly-

Ž .meric structures ii and not necessarily the bestŽ .ones common pattern recognition, iii . Whatever

the aim, it is necessary to repeat the entire proce-dure by modifying the initial location of the sub-structure with respect to the RR representation.

( )i Seeking the best overlap. If the two patternsŽ .are identical or almost identical , there is aŽunique way or very few ways, perhaps

.two or three to adjust the substructure onRR. Conversely, if the dissimilarity is strongand the adjustment poor, there are manyways of adjusting SS on RR. The first onefound may lead to a nonoptimal assign-ment. In this case, better solutions must besought.

( )ii Patterns included many times in a structure.Some organic and bioorganic materials arepolymers reproducing the same patternmany times with small variations. Local 3Ddissimilarities are of interest for detectionin these polymeric structures because theyare involved in the reactivity of such sys-tems. For example, it is well known thatthe reactivity of ARN structures is relatedto minute variations in the 3D sugar pat-tern; the opening of DNA base pairs con-comitant with the approach of a drug in-volves the kinking of the double strandand the interaction of a protein with areceptor involves folding. To recognizemany similar patterns, a scan is necessary.

( )iii Common pattern recognition. Later we willextend the algorithm to the recognition ofpatterns common to two structures. In sucha search for pharmacophoric patterns, thebest assignment may not be relevant ofparticular biological or biochemical appli-cation and other assignments correspond-ing to poorer overlaps and local QT min-ima may be of greater interest. Moreover,the overall similarity criterion, the RMS,allows us to compare only common pat-terns of similar size. For two common pat-terns of different sizes, there is no means ofdetermining which is the best. To recog-nize common patterns, our algorithm takesadvantage of the multiple QT minima.

VOL. 18, NO. 151880

Page 7: Discrete pattern recognition by fitting onto a continuous function

DISCRETE PATTERN RECOGNITION

How to Scan

The most convenient way to seek other isomor-phisms is to scan along the x axis. In this way, wetake advantage of the particular form chosen forthe RR representation with the same variable, x ,kfor the two spline interpolations, P and P . Tox y x z

Ž .make the scan efficient, we must: i prepare theŽ .structure; ii carefully define the alignment space;

Ž .and iii choose a step for varying the initial loca-tion of the substructure pattern:

( )i Before the alignment procedure, the struc-ture is rotated so that its largest extensionis along the x axis.

( )ii To begin the scan, the substructure is trans-lated along the x axis to make its greatestx coordinate equal to the smallest x coor-i kdinate of the structure. The scan ends whenthe smallest substructure x coordinate be-icomes greater than the greatest x coordi-knate of the structure.

( )iii The step along the x axis separating two SSinitial locations must be chosen carefully.We have tested this point for structures inwhich a given pattern is repeated manytimes with some geometrical variation. Ifthe step is too large, some occurrences willbe missing and, if it is too small, the samesolution will be obtained many times. In

˚our applications, we have found 1 A to bea reasonable value, but other applicationscould require different values.

Assessment of AlgorithmPerformances

The algorithm, including the numerical tools,the BFGS algorithm, and the spline approximation,is written in C language and run on an AlphaServer 2100. The aim of this section is to test itsperformances by comparing it with other methods.

HOW TO COMPARE OUR RESULTSWITH OTHERS

The literature provides timings for several algo-rithms. Nevertheless, it is impossible to comparethem, because the range of hardware is too largeand there is no simple way of putting CPU timeson the same scale.

Moreover, some methods, such as simulatedannealing, neural networks, and our method, min-

imize a function sometimes called an objectivefunction. In contrast with atom-by-atom treat-ments, they do not use any matching tolerance. Inthese methods, the conditions for stopping the

Žconvergence process iteration number, the small-est allowed difference between two successive val-

.ues of the function to minimize, etc. may dependŽon the tools chosen to ensure convergence BFGS

.algorithm, etc. . For these reasons, one cannot re-produce a study similar to that of Brint and Willettcomparing four methods by running them onthe same computer with the same distance toler-ance.17 Nevertheless, we can try to answer severalquestions:

1. For substructures exactly included in a struc-ture with the same disposition of atoms, orfor substructures very slightly perturbedfrom the exact one, the exact assignment isknown. Is the method able to produce theexact assignment?

2. What is its behavior when the dissimilaritybetween the two patterns is increased?

3. How does it behave when the sizes of thepatterns to be matched increase?

It is not easy to isolate the role of structure sizefrom that of substructure size. One can keep thelatter constant and vary the former. In this case, ofthe total numbers of atoms, the fraction to be

Ž .recognized NssrNs varies and this parametermay seriously affect the fit of the discrete patternonto the continuous representation. One can also

Ž .work with a fixed NssrNs ratio 1, 0.5, . . . andchange the structure size. To address these issues,we performed three studies.

First, we tested the ability of the algorithm toproduce an exact match by using identical coordi-nates for the two discrete entities but a differentnumbering for the atoms and, of course, a differentlocation in 3D space, with the substructure cover-

Ž .ing half the structure Ns s 2 Nss or the entireŽ .structure Ns s Nss . The molecular sizes were 20,

Ž70, 134, and 316 atoms. For the first three Ns s 20,. 2470, and 134 , the Cartesian coordinates were ex-

tracted from the Cambridge CrystallographicDatabase.2 The last structure, with 316 atoms, wasan oligonucleotide whose coordinates were ob-

25 Žtained by an empirical calculation JUMNA pro-26 .gram with NMR constraints. For the eight cases

in Table I, the number of possible atom correspon-11 Ž .dences varies from 4.7 ? 10 Ns s 2 Nss s 20 to

652 Ž .4.6 ? 10 Ns s Nss s 316 .

JOURNAL OF COMPUTATIONAL CHEMISTRY 1881

Page 8: Discrete pattern recognition by fitting onto a continuous function

´COSSE-BARBI AND RAJI

TAB

LEI.

Sea

rch

for

aG

iven

Pat

tern

Exa

ctly

or

Inex

actly

Incl

uded

ina

Str

uctu

re.C

om

par

iso

nw

ithTw

oM

eth

od

sb

ased

on

Min

imiz

atio

no

fan

aO

bje

ctiv

eFu

nctio

n.

Cor

rect

SS

pert

urba

tion

assi

gnm

ent

Fina

lmea

n ˚ ()

()

Met

hod

Com

pute

rS

izes

perc

enta

geif

avai

labl

edi

stan

ces

AC

PU

times

seco

nds

20

Neu

raln

etw

orks

DS

5000

/200

DE

CN

s=

26N

ss=

50

["3%

]20

]82

%1.

215

]55

%1.

8

Sim

ulat

edIB

M30

84Q

Ns

=N

ss=

200

86]

100%

0]

0.09

50.

18]

1.68

21

()

anne

alin

gN

s=

Nss

=70

=7.

50

73]

92%

0.09

]0.

028

10]

51=

)43

0N

s=

Nss

=15

00

45]

82%

0.01

6]

0.08

213

2]

1490

Ns

=N

ss=

20"

16%

}}

0.53

]4.

88(

)N

s=

Nss

=15

0=

7.5

"16

%}

}15

7]

=)

140

)10

80

Dis

cret

eve

rsus

AS

2100

Ns

=N

ss=

200

100%

00.

65co

ntin

uous

Ns

=N

ss=

70=

15.8

010

0%0

0.94

=5.

8(t

his

wor

k)N

s=

Nss

=13

40

100%

02.

31N

s=

Nss

=31

60

100%

03.

78

Ns

=20

Nss

=10

010

0%0

0.55

Ns

=70

Nss

=35

=15

.80

100%

00.

79=

5N

s=

134

Nss

=67

010

0%0

1.28

Ns

=31

6N

ss=

158

010

0%0

2.76

Ns

=N

ss=

20"

8%}

0.43

1.16

70=

6.7

"8%

}0.

531.

69=

3.9

134

"8%

}0.

444.

56

Ns

=N

ss=

20"

20%

}0.

941.

1870

=6.

7"

20%

}1.

013.

90=

5.6

134

"20

%}

0.99

6.63

aP

atte

rns

exac

tlyin

clud

ed:

mea

ntim

esfo

rte

ndi

ffere

ntnu

mbe

rings

and

loca

tions

insp

ace.

Pat

tern

sin

exac

tlyin

clud

ed:

mea

ntim

esfo

rte

nsu

bstr

uctu

res

with

diffe

rent

num

berin

gsan

dlo

catio

nsin

spac

ean

ddi

ffere

ntly

pert

urbe

dfro

mth

eex

acto

ne.

VOL. 18, NO. 151882

Page 9: Discrete pattern recognition by fitting onto a continuous function

DISCRETE PATTERN RECOGNITION

In a second test, in addition to the numberingand overall location variations, we introduced

Ž .some random variation "8%, " 20% in the posi-tions of the pattern points.

In a third test, the NssrNs fraction to be recog-nized was systematically changed.

Table I compares our results with those of twomethods based on the minimization of an objectivefunction, the neural network method20 and simu-lated annealing.21 Table II compares our resultswith atom-by-atom treatments,13 ] 18 devoted espe-cially to recognizing small substructure patterns.

ISOMORPHISM

In the case of an exact inclusion, and whateverŽ .the atom number Table I , our method always

gives the correct assignment with a zero RMS.This is not the case for neural networks or simu-lated annealing. For the former, the best solution is

obtained for 20% to 82% of the runs and, for thelatter, the number of points correctly assigned isbetter but the similarity criterion deduced from

Ž .the difference distance matrix DDM is rarely zeroby the end of the procedure.

The aim of our method is not to find all possibleassignments. On the contrary, we wish to avoidmost of them and, of course, the worse ones, andto detect the best ones as quickly as possible.However, in a study presented elsewhere on geo-metric chirality scales, we compared our methodwith a method proposed by Rassat,27 based on theHausdorff distance.28 In the framework of thiscomparison, it was of interest to determine if allcorrespondences could be obtained. For the over-lap of a dissymmetrical triangle with its ‘‘enanti-omer’’ in 2D space, our method detects the threerelevant isomorphisms of the six possible. Theother three permute two SS atoms so that theseatoms are not paired with the nearest structure

TABLE II.Search for a Given Pattern Exactly Included in a Structure. Comparison with Atom-by-Atom Treatments.a

Matching Correct CPU Times( )Method Computer Sizes Tolerance Assignment seconds

17 ˚Lesk Prime 9950 Ns = 60 Nss = 5 0.25 A 100% 1.36˚Ns = 60 Nss = 9 0.25 A 100% 1.31˚Ns = 60 Nss = 15 0.25 A 100% 4.54

17 ˚Set reduction Prime 9950 Ns = 60 Nss = 5 0.25 A 100% 0.80˚Ns = 60 Nss = 9 0.25 A 100% 2.62˚Ns = 60 Nss = 15 0.25 A 100% 8.96

˚Clique Prime 9950 Ns = 60 Nss = 5 0.25 A 100% 1.2117 ˚detection Ns = 60 Nss = 9 0.25 A 100% 2.94

˚Ns = 60 Nss = 15 0.25 A 100% 9.0317 ˚Ullman Prime 9950 Ns = 60 Nss = 5 0.25 A 100% 0.27

˚Ns = 60 Nss = 9 0.25 A 100% 0.43˚Ns = 60 Nss = 15 0.25 A 100% 0.92

Discrete versus AS 2100 Ns = 70 Nss = 6 none 100% 0.82( )continuous this work Ns = 70 Nss = 9 none 100% 0.78

Ns = 70 Nss = 15 none 100% 0.68Ns = 70 Nss = 35 none 100% 0.79Ns = 70 Nss = 52 none 100% 0.99Ns = 70 Nss = 70 none 100% 0.94Ns = 316 Nss = 14 none 100% 2.94Ns = 316 Nss = 40 none 100% 2.88Ns = 316 Nss = 79 none 100% 2.03Ns = 316 Nss = 158 none 100% 2.76Ns = 316 Nss = 237 none 100% 3.89Ns = 316 Nss = 316 none 100% 3.78

aMean times for ten different numberings and locations in space.

JOURNAL OF COMPUTATIONAL CHEMISTRY 1883

Page 10: Discrete pattern recognition by fitting onto a continuous function

´COSSE-BARBI AND RAJI

atoms. The same experiment on the overlap of atetrahedron and its enantiomer leads to the 12relevant isomorphisms of the 24 possible ones.

INCREASING STRUCTURE SIZES

Even if we cannot compare the absolute CPUtimes required by different methods, we can com-pare the time variation when the structure size isincreased. For substructures exactly included in a

Ž .structure Table I , multiplying the structure sizeby about 16 multiplies the CPU time by five- or

Ž .sixfold depending on NssrNs 0.5 or 1 . This be-havior must be appreciated by comparing it withthe simulated annealing method in which the CPUtime is multiplied by more than 430 when thestructure atom number is multiplied by 7.5.

For substructures inexactly included in a struc-ture, the times increase slightly, but the observa-tion is similar. Multiplying the structure size byabout seven multiplies the CPU time by aboutfour- or sixfold, depending on the amount of sub-structure perturbation with respect to the exactly

Ž .included one "8% or "20% . With the simulatedannealing method the CPU time is multiplied bymore than 140 when the structure size is increasedby a factor of 7.5.

INCREASING SUBSTRUCTURE SIZES

Atom-by-atom treatments were applied to rec-ognize small SS patterns up to a quarter of thestructure. The time then increases monotonicallywith the substructure size. Our method behavesdifferently. The performances depend very slightlyon NssrNs. Shallow minima are obtained forsubstructures covering about a quarter of thestructure.

CPU TIME CONSIDERATIONS

The time variations for increasing pattern sizesare summarized in Figures 3 and 4. The timedependence is roughly logarithmic with respect to

Ž .structure size Fig. 3 whether the substructurecovers half or all of the structure. It follows thatour algorithm might be particularly attractive forstructures with a few hundred atoms. With respect

Ž .to SS size Fig. 4 , the time seems to depend on atŽ .least two conflicting factors: i Substructure cover-

Ž .ing most of the structure NssrNs s 1 have veryfew ways to adjust on RR while SS, correspondingto small NssrNs values, have many possible loca-

Ž .tions on RR. ii To assign atoms, we screen a

Ž .distance matrix, whose size Ns = Nss increaseswith the two pattern sizes.

Finally, increasing the SS size facilitates the fiton RR, but slows down the atom assignment step,while the structure size affects mainly the timerequired to assign the atoms. The dependence ofthe CPU time on structure size could be a conse-quence of the particular technical solutions chosento simplify the assignment step.

We conclude that going from discrete to contin-uous space eliminates combinational problems andprovides a fast and reliable method for determin-ing pattern atom correspondences.

Further Adaptations and Applications

TAKING CERTAIN PROPERTIESINTO ACCOUNT

In the previous tests we fitted the discrete pat-tern onto the continuous RR representation by min-

w Ž .ximizing a quantity QT eq. 6 roughly related tothe ‘‘distance’’ between the discrete pattern andthe RR representation. Only the geometrical param-eters were taken into account.

However, in some applications, the nature ofatoms andror their ability to bind with a receptorsite are relevant properties for a 3D structure com-parison. The quantity, QT , to be minimized can bemodified to take into account this type of problem.

Let t be a numerical substructure atom prop-ierty. QT could be, for example:

Nss2 2X X X XŽ . Ž Ž . .QT s P x y y q P x y zŽ .Ý x y i i x z i i

is1

2X XŽ Ž . . Ž .q P x y t 8x t i i

Thus, t is the At atomic number if the useri iwishes to match only similar atoms. The propertyt may be defined differently if we wish to matchiall the atoms with lone pairs, whatever their na-ture. In this case, a zero t value is assigned toiatoms with lone pairs and a value of t s 1 to theiothers.

In addition, the property t may be weighted byiŽ .p in eq. 9 :i

Nss2 2X X X XŽ . Ž Ž . .QT s P x y y q P x y zŽ .Ý x y i i x z i i

is1

2X XŽ Ž . . Ž .q p P x y t 9i x t i i

VOL. 18, NO. 151884

Page 11: Discrete pattern recognition by fitting onto a continuous function

DISCRETE PATTERN RECOGNITION

( ) ( ) ( )FIGURE 3. CPU seconds dependence on structure size number of atoms . a Substructure exactly included and( ) ( ) ( ) ( )covering half the structure Ns = 2Nss . b Substructure exactly included and covering all the structure Ns = Nss . c

( ) ( )Substructure randomly perturbed by "8% from the exact one Ns = Nss . d Substructure randomly perturbed by( )"20% from the exact one Ns = Nss .

( ) ( ) ( )FIGURE 4. CPU seconds dependence on Nss / Ns. a Ns = 70. b Ns = 316.

JOURNAL OF COMPUTATIONAL CHEMISTRY 1885

Page 12: Discrete pattern recognition by fitting onto a continuous function

´COSSE-BARBI AND RAJI

SCHEME 1. P:Phosphodiester; S:Sugar; A:Adenine; T:Thymine; G:Guanine; C:Cytosine.

In more complex applications, some hydrogenbonding heteroatoms can sometimes act as accep-tors and sometimes as donors. Consequently, wehave to search not only for the matching of accep-tors with acceptors and donors with donors butalso for the matching of atoms bearing the twoproperties with donors or with acceptors. To takeinto account this possibility, we assign two numer-ical properties, t and t , to one atom:i1 i2

t s t s 1 donori1 i2

t s t s 2 acceptori1 i2

t s 1, t s 2 donor and acceptori1 i2

t s t s 4 neither of the propertiesi1 i2

QT to be minimized is modified as follows:

Nss2 2X X X XŽ . Ž Ž . .QT s P x y y q P x y zŽ .Ý x y i i x z i i

is1

2 2X X X XŽ Ž . . Ž Ž . .q p P x y t = P x y ti x t1 i i1 x t 2 i i2

Ž .10

RECOGNITION OF A 3D PATTERNCONTAINED MANY TIMES WITH SOMESTRUCTURAL VARIATION

ŽA group from our laboratory Dodin and.Cordier is presently working on the interaction of

drugs with small oligonucleotides.29 The bindingsite may be modified by some local irregularity.

Ž .The structure studied here scheme 1 is a singlestrand with sequence ATGACGTCAT. It contains

Ž .ten deoxyriboses S s Sugar , nine phosphodi-Ž . Žesters P , and ten bases three adenines, two gua-

.nines, two cytosines, and three thymines .The first nucleotide unit has one sugar H atom

more than the others and, for the last, the phos-Ž .phate group is missing Scheme 1 . Borderline ef-

fects are expected for these two units.

( )Search for Third Nucleotide Unit GSP

Our aim is to recognize the third nucleotide, aŽ .33-atom pattern including a guanidine base G , a

sugar, and a phosphodiester. Of the ten structurenucleotides, two of them bear a guanine base, thethird and the sixth units, and the other eight differfrom the third in the nature of the base. To fit thediscrete pattern on the spline approximation of thestrand, we take into account the atom nature by

Ž .means of relationship shown in eq. 8 .Table III gives the best matches ordered accord-

ing to their decreasing similarity with the thirdunit. As expected, units bearing a purine base,guanidine or adenine, are more similar to the third

Žunit than those bearing a pyrimidine base thy-.mine, cytosine .

TABLE III.Recognition of a 33-Atom Three-Dimensional Pattern, the Third GSP Nucleotide Unit, in the 316-Atom Structure,and Overlap With the Other Nucleotide Units.

Unit to be Number of Atoms˚( )Recognized Match with the: Assigned in the Unit RMS A

( )Third GSP Third GSP All 0.00( )Third GSP Sixth GSP All 0.56( )Third GSP Ninth ASP S,P atoms, +11 base 1.05

atoms( )Third GSP Fourth ASP S,P atoms, +12 base 1.11

atoms( )Third GSP First ASP S,P atoms, +9 base atoms 1.29

( )Third GSP Second TSP S,P atoms, +7 base atoms 1.31etc.

VOL. 18, NO. 151886

Page 13: Discrete pattern recognition by fitting onto a continuous function

DISCRETE PATTERN RECOGNITION

ŽOur technique identifies the two third and.sixth units bearing a guanine base, whereupon

each atom of the substructure corresponds to anŽ .atom in the nth third or sixth unit.

Our technique also matches the GSP substruc-ture with units differing from the third in thenature of the base. The accuracy then decreasesand the match is only partial, with some atoms ofthe GSP substructure corresponding to atoms inthe nth unit and others to atoms not in the nthunit. The RMS value and the number of atomcorrespondences in the nth unit allow us to appre-ciate the similarity between the third and the nthunit. The search could be continued by the deter-mination of a maximal pattern common to the twounits.

Search for the Third Deoxyribose

Our technique finds the third sugar and theŽnine others differing slightly from the third Table

.IV and each atom of the substructure correspondsto an atom in the nth sugar. If we exclude the firstand last units, we have to note that odd unitsŽ .fifth, seventh, ninth are more similar to the third

Ž .unit than even units second, fourth, sixth, eighth .An a posteriori examination of the sugar geome-

tries shows that these odd and even sugars belongto different classes30 : S with high phases and lowamplitudes for the former, X with low phases andhigh amplitude for the latter.

SEARCH FOR COMMON SUBSTRUCTURES

Up to now, we have put off the introduction ofany matching tolerance. However, if common pat-

TABLE IV.Recognition of a 13-Atom Three-Dimensional Pattern,the Third Sugar, Contained 10 Times With SomeVariation in the 316-Atom Structure.

Unit to Match Number of Atoms˚( )Recognize with the: Assigned in the Unit RMS A

Third S First S All 0.15Third S Second S All 0.15Third S Third S All 0.00Third S Fourth S All 0.15Third S Fifth S All 0.11Third S Sixth S All 0.15Third S Seventh S All 0.10Third S Eighth S All 0.15Third S Ninth S All 0.01Third S Tenth S All 0.22

terns are to be retrieved, this can no longer beavoided.

Technical Adjustments

Let us now recall our simplified example inŽ2D space ‘‘Principle of the Method’’ Section and

.Fig. 1 and suppose that at the end of the entireprocedure, the situation is as follows. The four-atom pattern is fitted on the seven-atom pattern.Both the RMS distance and the local distances arecomputed.

Ž .Let us now suppose Fig. 5 that the user con-siders that the distance between atom 4 of thesubstructure and its corresponding atom in thestructure is too great. Atom 4 is put aside and theremaining three-atom pattern is fitted on theseven-atom one. The local requirement beingachieved, we can conclude that we have found acommon three-atom 3D pattern.

FIGURE 5. Common 3D patterns: some necessary(technical adjustments. The smaller pattern four atoms:

) ( )' is sought in the bigger seven atoms: l and after( )the full procedure. a the preliminary result is inspected:

an SS atom, At is considered to be too far away from its4( )corresponding atom in the structure. b At is set aside4

and the remaining three-atom pattern is fitted on theseven-atom one. The two patterns have this three-atompattern in common.

JOURNAL OF COMPUTATIONAL CHEMISTRY 1887

Page 14: Discrete pattern recognition by fitting onto a continuous function

´COSSE-BARBI AND RAJI

More precisely, to search for common patternsin two structures, we proceed in the followingway. The smaller structure plays the role of thesubstructure and we search for its inclusion in thelarger structure. Here, the user has to make adecision: to define a local threshold, j . The localdistances between the corresponding atoms arecompared with j for each pair of atoms. If thelocal distances are smaller than j for each pair, thetwo structures are considered to have the smallerstructure pattern in common. If the local distancebetween two or more corresponding atoms ex-ceeds the local threshold, the substructure atomsconcerned are deleted and we fit the remainingsubstructure 3D pattern on the structure. The localrequirement is achieved. Thus, we have found acommon 3D pattern, the size of which is given bythe number of corresponding atom pairs, Npairs.

The RMS, limited to the corresponding atompairs, measures the accuracy of the common sub-structure determination:

RMS

Npairs2 2 2Y Y YŽ . Ž . Ž .x y x q y y y q z y zÝ i k i k i k

i)sNpairs

Ž .11

It can easily be seen that screening all substruc-ture atoms whose local distances from the corre-sponding structure atom exceed j provides aunique solution, whatever the order of atom dele-tion. This solution depends only on the initially

Žproposed isomorphism see ‘‘Atom Assignment’’.section and on the chosen local threshold.

Application to Substructure Common to TwoMarine Neurotoxins

Ž . Ž .Saxitoxin STX and tetrodotoxin TTX are twomarine neurotoxins selected by Danziger andDean19 and by Feuilleaubois et al.20 to test theirmatching techniques. These toxins act on nerveendings by binding the sodium-channel macro-molecule.

Their 3D structures are not obviously similar.For toxins, they are exceptionally small and their

Ž .crystallographic structures, limited to the 22 TTXŽ .and 21 STX heavy atoms, are known.

Danziger and Dean searched for correspon-dences only between heteroatoms with one prop-erty, donor or acceptor, in common. They used atree search technique and pruned the tree to limitthe search. The objective function to be minimizedis the RMS value of the distance difference matrixŽ .Dd . Table V shows the proposed matches, withthe atom numbering being the same as in the

31 Ž .article by Dean and Chau Fig. 6 .Feuilleaubois et al. looked for a nine-atom 3D

pattern of STX in TTX, taking into account the

( ) ( )FIGURE 6. Saxitoxin left and tetrodotoxin right .Numbering from Dean and Chau.31

TABLE V.( ) ( ) ( ) 19Tetrodotoxin TTX and Saxitoxin STX Common Substructures CSS Proposed by Danziger and Dean.

<D D D A A A + D D D A + D A + D Dd RMS˚( )STX CSS Size N7 N11 N9 O17 O20 N19 N1 N10 O21 O15 A

TTX 8 N1 N15 N3 O20 O17 O18 O21 O22 0.80 1.25D D D A + D A + D A + D A + D A + D

6 N15 O22 O11 O21 N1 O17 0.40 0.50D A + D A A + D D A + D

4 N1 N15 O21 O22 0.10 0.10D D A + D A + D

Matches of acceptor and / or donor heteroatoms. A: Acceptor; D: donor; A + D: acceptor and donor.

VOL. 18, NO. 151888

Page 15: Discrete pattern recognition by fitting onto a continuous function

DISCRETE PATTERN RECOGNITION

ability of the atoms to give or to receive a hydro-gen bond. They used Hopfield-like neural net-works.32 They minimized the sum F of the squares

Žof the distance difference matrix elements Table.VI .We fitted the discrete STX pattern onto the TTX

spline approximation, taking account of the abilityof the atoms to give andror to receive hydrogen

Ž . 10bonds, according to eq. 10 . First, only donorand acceptor heteroatoms were considered; sec-ond, all atoms were considered.

Matching acceptor andror donor atoms only. InSTX, only 10 heteroatoms are able to receive orgive hydrogen bonds. In TTX, there are 11 suchatoms. Thus, the maximal common substructurecontains no more than 10 atoms.

TABLE VI.( ) ( ) ( ) 20Tetrodotoxin TTX and Saxitoxin STX Common Substructure CSS Tested by Feuilleaubois et al.

D a a D a D D a A F RMS˚ ˚( ) ( )STX CSS Size N1 C5 C6 N7 C8 N9 N11 C16 O17 A A

TTX 9 O22 C6 C13 N1 C2 N3 N15 C12 O21 3.03 1.07A + D a a D a D D a A + D

A: acceptor; D: donor; A + D: acceptor and donor; a: neither donor nor acceptor.

TABLE VII.( ) ( ) ( )Tetrodotoxin TTX and Saxitoxin STX Common Substructures CSS Obtained by Fitting the Discrete

Structure of STX Onto the Spline Approximation of TTX.

CSS RMS˚( )Size A

STX 4 N1 N7 N11 O17D D D A 0.10

TTX O22 N1 N15 O21D + A D D D + A D & D

STX 5 N7 N9 N11 O20 O21D D D A D + A 0.36

TTX N1 N3 N15 O20 O11D D D D + A A

STX 6 N1 N7 N10 O17 N19 O20D D D A D + A A 0.50

TTX N1 N15 O17 O22 O21 O11D D D + A D + A D + A A D & D

STX 7 N1 N7 N10 O17 N19 O20 O21D D D A D + A A D + A 0.67

TTX N1 N15 O17 O22 O21 O11 O16D D D + A D + A D + A A D + A

STX 8 N1 N7 N9 N10 N11 O15 O17 O21D D D D D D + A A D + A 1.14

TTX O17 N1 N3 O18 N15 O16 O14 O11D + A D D D + A D D + A A A

STX 8 N1 N7 N10 N11 O17 N19 O20 O21D D D D A D + A A D + A 1.14

TTX N1 N3 O17 N15 O11 O21 O14 O18D D D + A D A D + A A D + A

STX 8 N7 N9 N10 N11 O15 O17 O20 O21D D D D D + A A A D + A 0.96

TTX N1 N3 O16 N15 O22 O18 O20 O21D D D + A D D + A D + A D + A D + A

˚Matches limited to heteroatoms able to give and / or receive hydrogen bonds; j F 1.5 A. A: acceptor; D: donor; A + D: acceptorand donor.

JOURNAL OF COMPUTATIONAL CHEMISTRY 1889

Page 16: Discrete pattern recognition by fitting onto a continuous function

´COSSE-BARBI AND RAJI

The maximal common substructure size de-˚pends on the imposed threshold, j . With j s 2 A,

it is possible to obtain 9 or 10 common atoms. If˚the threshold is lowered to 1.5 A, no more than 8

Ž .atoms in common are found Table VII . By com-parison of Tables III and VII, it can be seen thatour technique provides Danziger and Dean’s four-atom and six-atom common substructures. Our

Žbest eight-atom common substructure RMS s˚.0.96 A is very similar to the common substructure

of the same size found by Danziger and Dean˚Ž . ŽRMS s 1.25 A and the fit is better see Fig. 7, the

.stereoview of the final superposition . In particu-lar, the nitrogen atoms of the STX guanidinium

Ž .group N , N , and N are matched with those of7 9 11Ž .the TTX guanidinium group N , N , and N and1 3 15Ž .the gem dihydroxyl oxygens O and O of STX15 21

are paired with two cage hydroxyl oxygens of TTXŽ .O and O . These nitrogens and hydroxyl oxy-22 21

FIGURE 7. Matching eight acceptor and / or donoratoms. Stereoview of the best superposition. Gray line:saxitoxin; Black line: tetrodotoxin.

gens appear to be involved in the binding of thesodium-channel macromolecule.33, 34

Matching all atoms. In each structure, 11 atomsare not able to give andror receive hydrogenbonds. These are the 11 carbon atoms of TTX andthe 10 carbon atoms and the N nitrogen of STX.3Table VIII shows our best matches. The best 10-atom common substructure is similar to the 8-atomcommon substructure of Danziger and Dean’sstudy—with there being five identical atoms, thethree guanidinium N atoms and two hydroxyloxygen atoms. It is also similar to the 9-atomcommon substructure sought by Feuilleaubois etal., in that there are three identical heteroatoms,the three guanidinium nitrogens and two carbonatoms.

In conclusion, our common substructure searchcompares well with the others. However, we mustpoint out a difficulty inherent in any commonsubstructure search: there are many other possiblecorrespondences of interest that are numerically as

Žgood as those discussed or judged see, e.g., the.fifth and sixth matches in Table VIII . Neither of

the two criteria, the RMS or the local threshold j ,is sufficient to screen them. Only expert knowl-edge of the related bioreactivity problem makes itpossible to choose.

Conclusion

This article provides a new method for discreteshape similarity recognition. Instead of optimizingin discrete space, our method first optimizes thelocation of one of the two discrete shapes on acontinuous approximation of the other. On thebasis of the atom correspondences obtained, thediscrete shapes are then overlapped.

Our method requires only few computer re-sources. It needs, as starting data, the atom Carte-sian coordinates and the parameters for the splineapproximation of the structures. The storage ofthis information occupies very little memory space.The central processing time increases roughly log-arithmically with the size of the structures to bematched. Consequently, we can match structureswith a few hundred atoms.

Current methods for recognizing a given 3Dpattern are not able to retrieve patterns common totwo structures. We have here applications in theretrieval of a pattern, a nucleotide unit and asugar, contained many times with some variationin a 316-atom oligonucleotide and in the determi-

VOL. 18, NO. 151890

Page 17: Discrete pattern recognition by fitting onto a continuous function

DISCRETE PATTERN RECOGNITION

TABLE VIII.( ) ( )Tetrodotoxin TTX and Saxitoxin STX Common Substructures Obtained by fitting the Discrete Structure of

STX onto the Spline Approximation of TTX.

CSS RMS˚( )Size A

STX 4 C2 C4 C6 C16a a a a 0.16

TTX C7 C5 C13 C12a a a a

STX 5 C2 N3 C4 C6 N10a a a a D 0.11

TTX C12 C13 C6 C8 O21a a a a D + A

STX 6 N3 C4 C5 C6 N10 C14a a a a D a 0.30

TTX C5 C10 C12 C13 N3 C9a a a a D a

STX 7 C2 N3 C4 C5 C6 N10 C16a a a a a D a 0.29

TTX C12 C13 C6 C5 C10 O21 C9a a a a a D + A a

STX 8 C2 N3 C4 C5 C6 N10 O15 C16a a a a a D D + A a 0.46

TTX C12 C13 C6 C5 C10 O21 N1 C9a a a a a D + A D a

STX 9 C2 C4 C5 C6 N7 C13 C16 O20 O21a a a a D a a A D + A 0.36

TTX C12 C8 C7 C6 O17 C19 C5 O16 O18a a a a D + A a a D + A D + A

STX 10 C4 C5 C6 N7 C8 N9 N11 C16 O20 O21a a a D a D D a A D + A 0.39

TTX C5 C6 C7 N1 C2 N3 N15 C8 O20 O11a a a D a D D a D + A A

˚Matches of all atoms; j F 1 A. A: acceptor; D: donor; A + D: acceptor and donor; a: neither donor nor acceptor.

nation of common patterns for two marine neuro-toxins. Our results compare well with those ofother investigators.

Most of the current methods for shape similar-ity analysis minimize a distance difference matrixand achieve a good local match at the expense ofan overall fit. For these two reasons, they are notable to measure the dissimilarity of two enan-tiomers of a chiral body. Our method, which workson Cartesian coordinates and searches for an over-all match, might be able to achieve this. We shallexamine this point in a subsequent study.

References

1. P. Ehrlich and J. Morgenroth, On Haemolysis: Third Commu-nication. The Collected Papers of Paul Ehrlich, Vol. 1, F. Him-melweit, Ed., Pergamon Press, London, 1956, p. 205; L. B.

Kier, Molecular Orbital Theory in Drug Research, AcademicPress, New York, 1971.

2. F. H. Allen, O. Kennard, and R. Taylor, Acc. Chem. Res., 16,Ž .146 1983 .

3. E. E. Abola, F. C. Bernstein, and T. F. Koetzle, The ProteinData Bank in the Role of Data in Scientific Progress, P. S.Glaeser, Ed., Elsevier, New York, 1985.

4. R. S. Pearlman, Concord User’s Manual, Tripos Associates,St. Louis, MO, 1987; R. S. Pearlman, Chem. Des. Automat.

Ž .News, 2, 1 1987 .Ž .5. C. Hansch and T. Fujita, J. Am. Chem. Soc., 86, 1616 1964 .

6. L. Hammett, Physical Organic Chemistry, McGraw-Hill, NewYork, 1970.

7. P. Gund, W. T. Wipke, and R. Langridge, Proc. Int. Conf.Comput. Chem. Res. Edu., Ljubljana, 1973, p. 33; P. Gund,W. T. Wipke, and R. Langridge, Comput. Chem. Res. Edu.

Ž .Technol., 3, 5 1974 .

8. N. L. Allinger, Pharmacology and the Future of Man, Vol. 5,Proceedings of the Fifth International Congress, R. A. Maxwell,Ed., Karger, Basel, 1972, p. 57.

JOURNAL OF COMPUTATIONAL CHEMISTRY 1891

Page 18: Discrete pattern recognition by fitting onto a continuous function

´COSSE-BARBI AND RAJI

9. A. Y. Meyer and W. G. Richards, J. Comput.-Aid. Mol. Des.,Ž .5, 427 1991 .

Ž .10. M. Petitjean, J. Comput. Chem., 16, 80 1995 .11. R. Carbo, L. Leyda, and M. Arnau, Int. J. Quant. Chem., 17,

Ž .1185 1980 ; R. Carbo and L. Domingo, Int. J. Quant. Chem.,Ž .32, 517 1987 .

12. E. E. Hodgkin and W. G. Richards, Int. J. Quant. Chem.Ž .Quant. Biol. Symp., 14, 105 1987 ; C. Burt, W. G. Richards,

Ž .and P. Huxley, J. Comput. Chem., 11, 1139 1990 ; A. C.Good, E. E. Hodgkin, and W. G. Richards, J. Chem. Inf.

Ž .Comput. Sci., 32, 188 1992 .Ž .13. A. M. Lesk, Commun. ACM, 22, 219 1974 .

14. S. E. Jakes, N. Watts, P. Willett, D. Bawden, and J. D.Ž .Fischer, J. Mol. Graph., 5, 41 1987 .

15. V. Golender and A. Rozenblit, Logical and CombinatorialAlgorithms for Drug Design, Research Studies Press, Letch-worth, UK, 1983.

16. F. S. Kuhl, G. M. Crippen, and D. K. Friesen, J. Comput.Ž .Chem., 5, 24 1984 .

Ž .17. A. T. Brint and P. Willett, J. Mol. Graph., 5, 49 1987 .Ž .18. J. R. Ullman, J. ACM, 16, 31 1976 .

19. D. J. Danziger and P. M. Dean, J. Theor. Biol., 116, 215Ž .1985 .

20. E. Feuilleaubois, V. Fabart, and J. P. Doucet, SAR and QSARŽ .in Environ. Res., 1, 97 1993 .

21. M. T. Barakat and P. M. Dean, J. Comput.-Aided Mol. Des. 4,Ž .295 1990 .

22. P. G. Ciarlet, Introduction a l’analyse numerique matricielle et a` ´ `l’optimisation, Masson, Paris, 1994.

Ž .23. J. Broyden, J. Inst. Math. Appl., 6, 222 1970 ; R. Fletcher,Ž .Comput. J., 13, 317 1970 ; B. Goldfarb, Math. Comp., 24, 23

Ž . Ž .1970 ; D. F. Shanno, Math. Comp., 24, 647 1970 .24. Y. Lin, M. Risk, S. M. Ray, D. Van Engen, J. Clardy, J. Golik,

J. C. James, and K. Nakanishi, J. Am. Chem. Soc., 103, 6773Ž .1981 .

25. G. Dodin and C. Cordier, personal communication.26. R. Lavery, H. Sklenar, K. Zakrzewska, and B. Pullman,

Ž .J. Biomol. J. Struct. Dynam., 3, 989 1986 ; R. Lavery, Struc-ture and Expression, Vol. 3, W. K. Olson, R. H. Sarma, M. H.Sarma, and M. S. Sundaraligam, Eds., Adenine Press, NewYork, 1988, p. 191.

Ž . Ž .27. A. Rassat, C. R. Acad. Sci., Paris, 299 Serie II , 53 1984 .´28. F. Hausdorff, Set Theory, 2nd Ed., Chelsea Publishing, New

York, 1962, p. 166.29. G. Dodin, J. M. Kuhnel, P. Demerseman, and J. Kotzyba,¨

Ž .Anti-Cancer Drug Des., 8, 416 1993 ; G. Dodin, B. Bourli-ataud, C. Cordier, and J. P. Blais, J. Org. Chem., 61, 2561Ž .1996 .

30. M. Poncin, B. Hartmann, and R. Lavery, J. Mol. Biol., 226,Ž .775 1992 .

Ž .31. P. M. Dean and P. L. Chau, J. Mol. Graph., 5, 152 1987 .Ž .32. J. J. Hopfield, Biol. Cybernet. 52, 141 1985 ; J. J. Hopfield,

Ž .Proc. Natl. Acad. Sci. USA, 81, 3088 1984 .Ž .33. B. Hille, J. Biophys. 15, 615 1975 .

Ž . Ž .34. C. Kao and S. E. Walker, J. Physiol. Lond , 323, 619 1982 .

VOL. 18, NO. 151892