persistent object-oriented hyper-graph model for maximal common substructure (mcs) search milorad...

Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search

Milorad Tosic, Ph.D.Rutgers, The State University of New Jersey

Department of Chemistry

Size of the database

Nature of structure’s data Search type Type of similarity

Databases of Chemical Structures: Similarity Searching Features

Couple of hundreds of thousands of structures

Purified, consistent data

Row, inconsistent data

Structure search

Substructure search [DOW96], [BAR93]

Substructure similarity search[HAG92], [GWW98], [ART92]

Supstructure search (structures contained in target structure)

Less general

More general

Graph isomorphism

Subgraph isomorphism

Maximal common subgraph

Substructure similarity search

• screening search– based on substructural features that are typically small, fragment

substructures

– many thousands of structures per second

– precedes detailed and time-consuming atom-by-atom search

• atom-by-atom search (MCS)(Maximal Common Substructure search)– The MCS of a pair of structures is the largest substructure that is present

in both structures.

– The MCS is interpreted as similarity measure between two structures that corresponds favorably to an “intuitive” notion of chemical similarity

– The MCS is of our primary concern because of it’s importance for the search quality and it’s exponential computational complexity.

[DOW96], [BAR93], [HAG92], [GWW98], [ART92]

MCS - Maximal Common Substructure search

• NP-complete problem– Subgraph isomorphism is proven to be NP-complete problem which

implies that the MCS is also NP-complete

– (at least) Exponential computational complexity

• Average run-time can be reduced by:– Use faster computer

– Use various heuristics

– Carry out some computation in pre-processing phase

[XUJ96]

[BAR93]

[BAR93]

Our strategy for MCS search

• Back-tracking– The back-tracking is used as an common background algorithm for

problems with exponential complexity

• Distributed objects– Distributed computing is explored for increasing processing speed

– Persistent objects are essential for robustness of the searching engine

• Topology-based comparison criteria– Topology-based features of chemical structures are found attractive for

structure efficient description

– Topological queries and indexing in collection of distributed objects are considered as promising approach in similar applications

– Our heuristics for reducing average searching time and postponing computational explosion to the structures of the size as big as possible are based on substructure-by-substructure instead of atom-by-atom search

[XUJ96], [EST98], [WAN98]

[PSV99]

Experimental results - question

• Compare searching time with and without topology-based criteria, for the same set of target structures and the same set of database structures.

• The topology criterion based on loop number is used:

An atom X matches atom Y iff they have the same atom types and number of loops that X belongs to is not greater than that Y belongs to.

• In order to examine how atom types influence searching process, the same set of target structures is applied including as well as excluding hydrogens..

Is there any searching speed-up due to introduction of topology-based comparison criteria ?

Search with Hydrogens excluded

Search with Hydrogens included

Experimental results - answer

Is there any searching speed-up due to introduction of topology-based comparison criteria ? - YES• Searching speed-up is evident if topology-based criteria are applied.

• Oscillations in searching time indicate further potential for improving speed.

• Exponential complexity remains (both curves have the same growing tendency), but by introducing topology-based criteria point of the run-time explosion is translated into the area of much more complex structures.

• Relative improvement is higher for the case where structures without hydrogens are considered. If such a conclusion can be made for specific atom types, then much better results can be expected for the case of specific substructure type.

Experimental results - question

• Does topology-based comparison criteria improve substructure

similarity measure?

• Compare structures from the sets of resulting structures obtained by

searching with and without topology-based criteria, for the same set of

target structures and the same set of database structures.

Is there any improvement in quality of the searching results due to introduction of topology-based comparison criteria ?

Target structure

Two of resulting structures

The structure is eliminated

Experimental results - answer

Is there any improvement in quality of the searching results due to introduction of topology-based comparison criteria ? - YES

• Decreasing number of resulting structures.

• Increased probability for expected structures to be found in the set of

resulting structures.

Serializable hyper-graph

• Different characteristic substructures are represented on an uniform way

• Efficient implementation of topology-based comparison criteria

• Pointer-based data structure with no extra delay due to serialization

• Persistent storage of such objects is straightforward

• Easy to adopt to any distributed objects technology

Hyper-graph: definitions

Definition: A hyper-graph HG is an ordered two-tuple

HG = (C,E) ,

where C is set of hyper-graphs that are containers of HG, and E is a set of hyper-graphs that are elements of HG:

C = { c | c > HG }, E = { e | e < HG }

Definition: An undirected hyper-graph HG is an ordered two-tuple

HG = ((C, E), I) ,

where (C,E) is hyper-graph, and I is set of undirected hyper-graphs that are neighbors of the HG. We say that HG is in undirected connection relation with its neighbors.

Definition: The undirected connection relation is an equivalence relation.

Hyper-graph: definitions (con’t)

Definition: An directed hyper-graph HG is an ordered three-tuple

HG = ((C, E), I, O) ,

where (C,E) is hyper-graph, I is set of directed hyper-graphs that are input neighbors of the HG, and O is set of directed hyper-graphs that are output neighbors of the HG. We say that HG is in directed connection relation with its neighbors.

Definition: The directed connection relation is an order relation.

Note: We use the undirected hyper-graph in MCS.

Hyper-graph: example

v1

v5

v7

v8

v6

v4

v2

v3

e23e12

e45e24

e35

e57

e46 e67

e68

v1:id = v1;type = VERTEX;Container = {G1};Elements = {};InElements = {e12};

v2:id = v2;type = VERTEX;Container = {G1};Elements = {};InElements = {e12, e23, e24};

G1:id = G1;type = GRAPH;Container = {};Elements = {v1, … , v8, e12, e23, … ,e68};InElements = {};

. . .

e12:id = e12;type = EDGE;Container = {G1};Elements = {};InElements = {v1,v2};

e23:id = e23;type = EDGE;Container = {G1};Elements = {};InElements = {v2, v3};

. . .

Hyper-graph: example (con’t)After simple-loop reduction

v5

v7

v6

v4e45 e57

e46 e67

G2:id = G2;type = GRAPH;Container = {};Elements = {g1,g2,g3,g4, e1,e2,e3,e4};InElements = {};

v1

v2

e12 v5

v4

v2

v3

e23

e45e24

e35

v8

v6e68

g1 g2 g3 g4e1 e2 e3

g1:id = g1;type = GRAPH;Container = {G2};Elements = {v1,v2,e12};InElements = {e1};

g2:id = g2;type = LOOP;Container = {G2};Elements = {v2,v3,v4,v5,e23,e24,e35,e45};InElements = {e1, e2};

e1:id = e1;type = EDGE;Container = {G2};Elements = {v2};InElements = {g1,g2};

e2:id = e2;type = EDGE;Container = {G2};Elements = {v4,v5,e45};InElements = {g2, g3};

Hyper-graph: class hierarchy

Conclusions

• Experimental analysis proved again the fact pointed out in a literature that topological information about chemical structure (information about loops in the experiments) can improve substructure similarity searching.

• Because the MCS is NP-complete problem, efficiency of the applied computing model is very important. Distributed objects is currently the most promising computational approach. Hence, it should be applied to substructure similarity search in chemical structure databases.

• The proposed hyper-graph model is able to efficiently represent both topology and behavioral characteristics of a chemical structure, in a hierarchical way.

• Due to efficient serialization method, the object representation of the hyper-graph can be incorporated at any distributed technology (i.g. CORBA) without decreasing execution efficiency.

References

[DOW96] Downs, G.M., and Willett, P. (1995), Similarity searching in databases of chemical structures., Rev. Comput. Chem., 7, 1-66.

[GWW96] Gillet, V.J., Wild, D.J., Willet, P., and Bradshaw, J. (1998), Similarity and dissimilarity methods for processing chemical structure databases., The Computer Journal, 41, No. 8, 547-558.

[HAG92] Hagadone, T.R., (1992), Molecule substructure similarity searching: Efficient retrival in two-dimensional structure databases., J. Chem. Inf. Comput. Sci., 32, 515-521.

[WAN98] Wang, T., and Zhou, J., (1998), 3DFS: A new 3D flexible searching system for use in drug design., J. Chem. Inf. Comput. Sci., 38, 71-77.

[XUJ96] Xu, J., (1996), GMA: A generic match algorithm for structural homomorphism, isomorphism, and maximal common substructure match and its applications., J. Chem. Inf. Comput. Sci., 36, 25-34.

[PSV99] Papadimitriou, C.H., Suciu, D., and Vianu, V., (1999), Topological queries in spatial databases., Journal of Comput. and Sys. Sci., 58, 29-53.

[ART92] Artymiuk, J., et. all., (1992), Similarity searching of three-dimensional molecules and macromolecules., J. Chem. Inf. Comput. Sci., 32, 617-630.

[BAR93] Barnard, J.M., (1993), Substructure searching methods: Old and New., J. Chem. Inf. Comput. Sci., 33, 532-538.

[EST98] Estrada, E., (1998), Spectral moments of the edge adjacency matrix in molecular graphs., J. Chem. Inf. Comput. Sci., 38, 23-27.

persistent object-oriented hyper-graph model for maximal common substructure (mcs) search milorad...

Documents

atom search atom

atom search xuj96

search quality

thousands of structures

pair of structures

set of target structures

atom x

set of database structures