ugm 2006 miklós vargyas scientific workshop maximum common substructure
TRANSCRIPT
UGM 2006UGM 2006
Miklós Vargyas
Scientific Workshop
Maximum Common Substructure
UGM 2006
Slide 2
Workshop overview
• Introduction, concepts, theory
• Clustering, the role of MCS
• Applications
• Future plans
UGM 2006
Slide 3
Motivations
• Automated reaction mapping
UGM 2006
Slide 4
Mapping chemical reactions
F
F
O
HN O
HN
O O
CH3
OC
NO
OH
3CF
F
O
NH2
+
UGM 2006
Slide 5
ChemAxon’s automapper
F
F
O
HN O
HN
O O
CH3
OC
NO
OH
3CF
F
O
NH2
+
• Find parts common to both sides
• Map common parts
F11
F10
9
O8
7
6
5
4
3
2
HN1
O
HN19
18 17
O16
15 O 14
13CH
3 12
OC
N
19
18
17
O16
15O14
13
H3C
12
F11
F10
9
O8
7
6
5
4
3
2
NH2
1
+
UGM 2006
Slide 6
ChemAxon’s automapper
• Map the rest
– Score possible mappings– Find the one that scores the highest
F11
F10
9
O8
7
6
5
4
3
2
HN1
O
HN19
18 17
O16
15 O 14
13CH
3 12
OC
N
19
18
17
O16
15O14
13
H3C
12
F11
F10
9
O8
7
6
5
4
3
2
NH2
1
+
UGM 2006
Slide 7
Concepts and theory
• MCS/MCES/MOS
• MCS complexity O(nm)
UGM 2006
Slide 8
MCS search methods / Clique
• Barrow and Burstall, 1976
• Raymond and Willett, RASCAL, 2002
• Details in brief– Construct the product graph of G1 and G2
• Node count: |V1 | ∙|V2 |– Find clique, it corresponds to largest matching
• Why is it good– Very elegant, pure graph theory– MCES can also be found– Disconnected MCS/MCES can be found– Node and edge coloring fits easily
• What are the drawbacks– Product graph is large and dense
• Recent advances in clique detection
UGM 2006
Slide 9
MCS search methods / Backtrack
• Crandell-Smith, 1983
• Advantages– Flexible, easy to add constraints, incorporate
chemical knowledge, heuristics– Dynamic programming– Various search strategies
• Recent algorithms– Jun Xu, GMA, 1995
UGM 2006
Slide 10
Comparison of methods
• Brint and Willett, 1986: Clique based substantially faster
• Recent publication, 2006: backtracking is superior
• We tested both approaches– Backtracking: 1.2 s (exhaustive search)– Clique based was stopped after 2 hours!!!
UGM 2006
Slide 11
ChemAxon MCS search approach
• Based on Wang and Zhou, EMCSS, 1996
• Backtracking– Divide and conquer strategy– Create all spanning trees of the query graph
O
ON 1
2
7
34
56
8
10
9
1112
13
15
14
UGM 2006
Slide 12
ChemAxon MCS search approach
– Use this as a route plan to traverse the target graph
O
O
N+
1
2
3 4
5
12
7
34
56
8
10
9
1112
13
15
14
O
O
N+ 1 2 3 4 5
6
7 8
9
10 11
12
1314
O
O
N+
UGM 2006
Slide 13
An application of MCS
• Reaction automapping (live demonstration)
• Average mapping time: 320ms
• Complex structures cannot be mapped efficiently
CH3O
O
O
N
H3C
O
S
NH2
HN
S
O
NH
O
H2N
HN
O
HN
CH3
H3C
O
NH
CH3H3C
O
NH
OHN
O
NH
O
N
H3C
O
OHCH3
O
O
O
N
O
O
H3C
O
O
O
S
NH
HN
S
O
NH
O
HN
HN
O
HN
CH3
H3C
O
NH
CH3H3C
O
NH
OHN
O
NH
O
N SR
S
S
S
S
R
R
RS
2
SR
S
S
S
S
R
R
RS
UGM 2006
Slide 14
Product development philosophy
Sophisticated technology
High performance (speed, accuracy, features)
Rounded, industry relevant functionality
Customizable
Extendable
Long term relevance
>300 active clients
Client driven development
Fast and reliable support
Comprehensive API
Platform independence (Java)
UGM 2006
Slide 15
LibMCS motivations
“However, finding MCS from a pair of molecules has limited usage for our study. When we get hits from HTS, we cluster them into groups and the chemists will eye browse each group to find the scaffolds that are potentially good templates for later expansion. One main use of MCS will be to process multiple compounds of similar structures and automate what chemists have been doing by eyes now.”
“We expect to use MCS tools for two cases: 1) use to analyze hits from HTS screens. 2) use it as a sorting tool for data retrieval, i.e., whenever people export data from our database (compounds across assays), we run MCS so that structurally similar compounds are grouped together. Chemists like this very much (we currently do this by clustering based on overall Tanimoto similarity).”
“The typical hits from screens range from 2000-10000 (in few cases). In lead optimization phase, the compound list is around 3000-5000 in a typical project. So if MCS tools can process 5000 compound under 5 seconds, it can be integrated with online web tools. Otherwise, if it takes several minutes, it will be only used to analyze hits off-line based on user requests. If it takes more than an hour, its usage will be very limited.”
UGM 2006
Slide 16
• Exact solution– Requires the pair-wise comparison of each
structure• n ∙ (n - 1) / 2 MCS computations• Next problem is larger!!
– All CS (above a given size) have to be found• n ∙ (n - 1) / 2 CS computations• Partitioning O(n3) CS
LibMCS is a hard problem to solve
UGM 2006
Slide 17
Pair-wise MCS table
UGM 2006
Slide 18
Pair-wise MCS computation
• Average MCS computation: 100ms
• First step: n ∙ (n - 1) / 2 MCS computations– 100 structures: 50 ∙ 99 ∙ 100ms = 8 min– 1000 structures: 14 hours
• Second step: larger problem has to be solved
• Practically not feasible approach
UGM 2006
Slide 19
Known approaches / Products
• Stahl and Mauser, 2004, 2005– Cluster first (ES)– Find an MCS for each cluster
• Wilkens, Janes and Su, 2004
• BioReason ClassPharmer
• ChemTK
• LeadScope
• Tripos ?
• Daylight ?
UGM 2006
Slide 20
ChemAxon’s approach
• Goal– Reduce the number of MCS pair computations
• Idea: guess which two structures give significant MCS– Similar compounds are likely to share large MCS– Similarity guided pair-wise MCS
• Not clustering by similarity and determine the MCS for the cluster
• Which molecular descriptor gives best correlation– ChemAxon fingerprint– BCUT (Burden matrix)
• Consequence– Approximate solution
UGM 2006
Slide 21
LibMCS algorithm
Read input structures
Generate fingerprint
Calculate similarity matrix
Make singletons
Compute MCS
MCS large
Create new clusterSimilarity
above threshold
Get two most similar
More structures
SSS
Found
Add to cluster
n
n
n
n
y
y
y
y
UGM 2006
Slide 22
Applications
• Screen analysis
• Data visualization and profiling
• Combinatorial library partitioning
• Buying new compounds
• ?
• Suggest more!!!!
UGM 2006
Slide 23
Application 1 / Screen analysis
1
10
100
1000
10000
0 5 10 15 20 25 30 35
Spikes retrieved
Str
uct
ure
s re
trie
ved
Euclidean
Optimized Euclidean
Ideal
UGM 2006
Slide 24
Activity filtering
UGM 2006
Slide 25
Live demonstration
• Partitioning mixed combinatorial library– Affect of parameters– Affect of modes– Benchmarks– Quality of clusters
UGM 2006
Slide 26
Combichem library scaffolds
UGM 2006
Slide 27
Combichem library scaffolds
• Turbo mode distorts clusters
UGM 2006
Slide 28
Combichem benchmark
• Influence of normal/fast/turbo mode
• Worth, distortion is not significant
0
500
1000
1500
2000
2500
3000
3500
4000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (s
ec)
Normal
Fast
Turbo
UGM 2006
Slide 29
Development roadmap
• Soon– R-Group decomposition– Stereo care MCS– Preserving rings– Lower bound pre-filtering– Disconnected MCS– Multi cluster members
• Mid term– Integrate Ward/Jarvis-Patrick in the new GUI
• Long term– Integrate molecular descriptors, metrics– Integrate virtual screening
UGM 2006
Slide 30
Coming soon – R-Group decomposition
UGM 2006
Slide 31
Coming soon – R-Group decomposition
UGM 2006
Slide 32
Coming soon – Multi cluster
UGM 2006
Slide 33
Summary
• MCS developed for automatic reaction mapping
• MCS based hierarchical clustering
• Fast method
• Chemical adequacy must be improved
• Various uses, currently focusing on combinatorial library partitioning
UGM 2006
Slide 34
Acknowledgements
• Developers– Péter Vadász– Nóra Máté
• Ideas– Szabolcs Csepregi, Ferenc Csizmadia
• Special thanks to