persistent object-oriented hyper-graph model for maximal common substructure (mcs) search milorad...
TRANSCRIPT
Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search
Milorad Tosic, Ph.D.Rutgers, The State University of New Jersey
Department of Chemistry
Size of the database
Nature of structure’s data Search type Type of similarity
Databases of Chemical Structures: Similarity Searching Features
Couple of hundreds of thousands of structures
Purified, consistent data
Row, inconsistent data
Structure search
Substructure search [DOW96], [BAR93]
Substructure similarity search[HAG92], [GWW98], [ART92]
Supstructure search (structures contained in target structure)
Less general
More general
Graph isomorphism
Subgraph isomorphism
Maximal common subgraph
Substructure similarity search
• screening search– based on substructural features that are typically small, fragment
substructures
– many thousands of structures per second
– precedes detailed and time-consuming atom-by-atom search
• atom-by-atom search (MCS)(Maximal Common Substructure search)– The MCS of a pair of structures is the largest substructure that is present
in both structures.
– The MCS is interpreted as similarity measure between two structures that corresponds favorably to an “intuitive” notion of chemical similarity
– The MCS is of our primary concern because of it’s importance for the search quality and it’s exponential computational complexity.
[DOW96], [BAR93], [HAG92], [GWW98], [ART92]
MCS - Maximal Common Substructure search
• NP-complete problem– Subgraph isomorphism is proven to be NP-complete problem which
implies that the MCS is also NP-complete
– (at least) Exponential computational complexity
• Average run-time can be reduced by:– Use faster computer
– Use various heuristics
– Carry out some computation in pre-processing phase
[XUJ96]
[BAR93]
[BAR93]
Our strategy for MCS search
• Back-tracking– The back-tracking is used as an common background algorithm for
problems with exponential complexity
• Distributed objects– Distributed computing is explored for increasing processing speed
– Persistent objects are essential for robustness of the searching engine
• Topology-based comparison criteria– Topology-based features of chemical structures are found attractive for
structure efficient description
– Topological queries and indexing in collection of distributed objects are considered as promising approach in similar applications
– Our heuristics for reducing average searching time and postponing computational explosion to the structures of the size as big as possible are based on substructure-by-substructure instead of atom-by-atom search
[XUJ96], [EST98], [WAN98]
[PSV99]
Experimental results - question
• Compare searching time with and without topology-based criteria, for the same set of target structures and the same set of database structures.
• The topology criterion based on loop number is used:
An atom X matches atom Y iff they have the same atom types and number of loops that X belongs to is not greater than that Y belongs to.
• In order to examine how atom types influence searching process, the same set of target structures is applied including as well as excluding hydrogens..
Is there any searching speed-up due to introduction of topology-based comparison criteria ?
Experimental results - answer
Is there any searching speed-up due to introduction of topology-based comparison criteria ? - YES• Searching speed-up is evident if topology-based criteria are applied.
• Oscillations in searching time indicate further potential for improving speed.
• Exponential complexity remains (both curves have the same growing tendency), but by introducing topology-based criteria point of the run-time explosion is translated into the area of much more complex structures.
• Relative improvement is higher for the case where structures without hydrogens are considered. If such a conclusion can be made for specific atom types, then much better results can be expected for the case of specific substructure type.
Experimental results - question
• Does topology-based comparison criteria improve substructure
similarity measure?
• Compare structures from the sets of resulting structures obtained by
searching with and without topology-based criteria, for the same set of
target structures and the same set of database structures.
Is there any improvement in quality of the searching results due to introduction of topology-based comparison criteria ?
Experimental results - answer
Is there any improvement in quality of the searching results due to introduction of topology-based comparison criteria ? - YES
• Decreasing number of resulting structures.
• Increased probability for expected structures to be found in the set of
resulting structures.
Serializable hyper-graph
• Different characteristic substructures are represented on an uniform way
• Efficient implementation of topology-based comparison criteria
• Pointer-based data structure with no extra delay due to serialization
• Persistent storage of such objects is straightforward
• Easy to adopt to any distributed objects technology
Hyper-graph: definitions
Definition: A hyper-graph HG is an ordered two-tuple
HG = (C,E) ,
where C is set of hyper-graphs that are containers of HG, and E is a set of hyper-graphs that are elements of HG:
C = { c | c > HG }, E = { e | e < HG }
Definition: An undirected hyper-graph HG is an ordered two-tuple
HG = ((C, E), I) ,
where (C,E) is hyper-graph, and I is set of undirected hyper-graphs that are neighbors of the HG. We say that HG is in undirected connection relation with its neighbors.
Definition: The undirected connection relation is an equivalence relation.
Hyper-graph: definitions (con’t)
Definition: An directed hyper-graph HG is an ordered three-tuple
HG = ((C, E), I, O) ,
where (C,E) is hyper-graph, I is set of directed hyper-graphs that are input neighbors of the HG, and O is set of directed hyper-graphs that are output neighbors of the HG. We say that HG is in directed connection relation with its neighbors.
Definition: The directed connection relation is an order relation.
Note: We use the undirected hyper-graph in MCS.
Hyper-graph: example
v1
v5
v7
v8
v6
v4
v2
v3
e23e12
e45e24
e35
e57
e46 e67
e68
v1:id = v1;type = VERTEX;Container = {G1};Elements = {};InElements = {e12};
v2:id = v2;type = VERTEX;Container = {G1};Elements = {};InElements = {e12, e23, e24};
G1:id = G1;type = GRAPH;Container = {};Elements = {v1, … , v8, e12, e23, … ,e68};InElements = {};
. . .
e12:id = e12;type = EDGE;Container = {G1};Elements = {};InElements = {v1,v2};
e23:id = e23;type = EDGE;Container = {G1};Elements = {};InElements = {v2, v3};
. . .
Hyper-graph: example (con’t)After simple-loop reduction
v5
v7
v6
v4e45 e57
e46 e67
G2:id = G2;type = GRAPH;Container = {};Elements = {g1,g2,g3,g4, e1,e2,e3,e4};InElements = {};
v1
v2
e12 v5
v4
v2
v3
e23
e45e24
e35
v8
v6e68
g1 g2 g3 g4e1 e2 e3
g1:id = g1;type = GRAPH;Container = {G2};Elements = {v1,v2,e12};InElements = {e1};
g2:id = g2;type = LOOP;Container = {G2};Elements = {v2,v3,v4,v5,e23,e24,e35,e45};InElements = {e1, e2};
e1:id = e1;type = EDGE;Container = {G2};Elements = {v2};InElements = {g1,g2};
e2:id = e2;type = EDGE;Container = {G2};Elements = {v4,v5,e45};InElements = {g2, g3};
Conclusions
• Experimental analysis proved again the fact pointed out in a literature that topological information about chemical structure (information about loops in the experiments) can improve substructure similarity searching.
• Because the MCS is NP-complete problem, efficiency of the applied computing model is very important. Distributed objects is currently the most promising computational approach. Hence, it should be applied to substructure similarity search in chemical structure databases.
• The proposed hyper-graph model is able to efficiently represent both topology and behavioral characteristics of a chemical structure, in a hierarchical way.
• Due to efficient serialization method, the object representation of the hyper-graph can be incorporated at any distributed technology (i.g. CORBA) without decreasing execution efficiency.
References
[DOW96] Downs, G.M., and Willett, P. (1995), Similarity searching in databases of chemical structures., Rev. Comput. Chem., 7, 1-66.
[GWW96] Gillet, V.J., Wild, D.J., Willet, P., and Bradshaw, J. (1998), Similarity and dissimilarity methods for processing chemical structure databases., The Computer Journal, 41, No. 8, 547-558.
[HAG92] Hagadone, T.R., (1992), Molecule substructure similarity searching: Efficient retrival in two-dimensional structure databases., J. Chem. Inf. Comput. Sci., 32, 515-521.
[WAN98] Wang, T., and Zhou, J., (1998), 3DFS: A new 3D flexible searching system for use in drug design., J. Chem. Inf. Comput. Sci., 38, 71-77.
[XUJ96] Xu, J., (1996), GMA: A generic match algorithm for structural homomorphism, isomorphism, and maximal common substructure match and its applications., J. Chem. Inf. Comput. Sci., 36, 25-34.
[PSV99] Papadimitriou, C.H., Suciu, D., and Vianu, V., (1999), Topological queries in spatial databases., Journal of Comput. and Sys. Sci., 58, 29-53.
[ART92] Artymiuk, J., et. all., (1992), Similarity searching of three-dimensional molecules and macromolecules., J. Chem. Inf. Comput. Sci., 32, 617-630.
[BAR93] Barnard, J.M., (1993), Substructure searching methods: Old and New., J. Chem. Inf. Comput. Sci., 33, 532-538.
[EST98] Estrada, E., (1998), Spectral moments of the edge adjacency matrix in molecular graphs., J. Chem. Inf. Comput. Sci., 38, 23-27.