ugm 2006 miklós vargyas scientific workshop maximum common substructure

UGM 2006UGM 2006

Miklós Vargyas

Scientific Workshop

Maximum Common Substructure

UGM 2006

Slide 2

Workshop overview

• Introduction, concepts, theory

• Clustering, the role of MCS

• Applications

• Future plans

UGM 2006

Slide 3

Motivations

• Automated reaction mapping

UGM 2006

Slide 4

Mapping chemical reactions

F

F

O

HN O

HN

O O

CH3

OC

NO

OH

3CF

F

O

NH2

+

UGM 2006

Slide 5

ChemAxon’s automapper

F

F

O

HN O

HN

O O

CH3

OC

NO

OH

3CF

F

O

NH2

+

• Find parts common to both sides

• Map common parts

F11

F10

9

O8

7

6

5

4

3

2

HN1

O

HN19

18 17

O16

15 O 14

13CH

3 12

OC

N

19

18

17

O16

15O14

13

H3C

12

F11

F10

9

O8

7

6

5

4

3

2

NH2

1

+

UGM 2006

Slide 6

ChemAxon’s automapper

• Map the rest

– Score possible mappings– Find the one that scores the highest

F11

F10

9

O8

7

6

5

4

3

2

HN1

O

HN19

18 17

O16

15 O 14

13CH

3 12

OC

N

19

18

17

O16

15O14

13

H3C

12

F11

F10

9

O8

7

6

5

4

3

2

NH2

1

+

UGM 2006

Slide 7

Concepts and theory

• MCS/MCES/MOS

• MCS complexity O(nm)

UGM 2006

Slide 8

MCS search methods / Clique

• Barrow and Burstall, 1976

• Raymond and Willett, RASCAL, 2002

• Details in brief– Construct the product graph of G1 and G2

• Node count: |V1 | ∙|V2 |– Find clique, it corresponds to largest matching

• Why is it good– Very elegant, pure graph theory– MCES can also be found– Disconnected MCS/MCES can be found– Node and edge coloring fits easily

• What are the drawbacks– Product graph is large and dense

• Recent advances in clique detection

UGM 2006

Slide 9

MCS search methods / Backtrack

• Crandell-Smith, 1983

• Advantages– Flexible, easy to add constraints, incorporate

chemical knowledge, heuristics– Dynamic programming– Various search strategies

• Recent algorithms– Jun Xu, GMA, 1995

UGM 2006

Slide 10

Comparison of methods

• Brint and Willett, 1986: Clique based substantially faster

• Recent publication, 2006: backtracking is superior

• We tested both approaches– Backtracking: 1.2 s (exhaustive search)– Clique based was stopped after 2 hours!!!

UGM 2006

Slide 11

ChemAxon MCS search approach

• Based on Wang and Zhou, EMCSS, 1996

• Backtracking– Divide and conquer strategy– Create all spanning trees of the query graph

O

ON 1

2

7

34

56

8

10

9

1112

13

15

14

UGM 2006

Slide 12

ChemAxon MCS search approach

– Use this as a route plan to traverse the target graph

O

O

N+

1

2

3 4

5

12

7

34

56

8

10

9

1112

13

15

14

O

O

N+ 1 2 3 4 5

6

7 8

9

10 11

12

1314

O

O

N+

UGM 2006

Slide 13

An application of MCS

• Reaction automapping (live demonstration)

• Average mapping time: 320ms

• Complex structures cannot be mapped efficiently

CH3O

O

O

N

H3C

O

S

NH2

HN

S

O

NH

O

H2N

HN

O

HN

CH3

H3C

O

NH

CH3H3C

O

NH

OHN

O

NH

O

N

H3C

O

OHCH3

O

O

O

N

O

O

H3C

O

O

O

S

NH

HN

S

O

NH

O

HN

HN

O

HN

CH3

H3C

O

NH

CH3H3C

O

NH

OHN

O

NH

O

N SR

S

S

S

S

R

R

RS

2

SR

S

S

S

S

R

R

RS

UGM 2006

Slide 14

Product development philosophy

Sophisticated technology

High performance (speed, accuracy, features)

Rounded, industry relevant functionality

Customizable

Extendable

Long term relevance

>300 active clients

Client driven development

Fast and reliable support

Comprehensive API

Platform independence (Java)

UGM 2006

Slide 15

LibMCS motivations

“However, finding MCS from a pair of molecules has limited usage for our study. When we get hits from HTS, we cluster them into groups and the chemists will eye browse each group to find the scaffolds that are potentially good templates for later expansion. One main use of MCS will be to process multiple compounds of similar structures and automate what chemists have been doing by eyes now.”

“We expect to use MCS tools for two cases: 1) use to analyze hits from HTS screens. 2) use it as a sorting tool for data retrieval, i.e., whenever people export data from our database (compounds across assays), we run MCS so that structurally similar compounds are grouped together. Chemists like this very much (we currently do this by clustering based on overall Tanimoto similarity).”

“The typical hits from screens range from 2000-10000 (in few cases). In lead optimization phase, the compound list is around 3000-5000 in a typical project. So if MCS tools can process 5000 compound under 5 seconds, it can be integrated with online web tools. Otherwise, if it takes several minutes, it will be only used to analyze hits off-line based on user requests. If it takes more than an hour, its usage will be very limited.”

UGM 2006

Slide 16

• Exact solution– Requires the pair-wise comparison of each

structure• n ∙ (n - 1) / 2 MCS computations• Next problem is larger!!

– All CS (above a given size) have to be found• n ∙ (n - 1) / 2 CS computations• Partitioning O(n3) CS

LibMCS is a hard problem to solve

UGM 2006

Slide 17

Pair-wise MCS table

UGM 2006

Slide 18

Pair-wise MCS computation

• Average MCS computation: 100ms

• First step: n ∙ (n - 1) / 2 MCS computations– 100 structures: 50 ∙ 99 ∙ 100ms = 8 min– 1000 structures: 14 hours

• Second step: larger problem has to be solved

• Practically not feasible approach

UGM 2006

Slide 19

Known approaches / Products

• Stahl and Mauser, 2004, 2005– Cluster first (ES)– Find an MCS for each cluster

• Wilkens, Janes and Su, 2004

• BioReason ClassPharmer

• ChemTK

• LeadScope

• Tripos ?

• Daylight ?

UGM 2006

Slide 20

ChemAxon’s approach

• Goal– Reduce the number of MCS pair computations

• Idea: guess which two structures give significant MCS– Similar compounds are likely to share large MCS– Similarity guided pair-wise MCS

• Not clustering by similarity and determine the MCS for the cluster

• Which molecular descriptor gives best correlation– ChemAxon fingerprint– BCUT (Burden matrix)

• Consequence– Approximate solution

UGM 2006

Slide 21

LibMCS algorithm

Read input structures

Generate fingerprint

Calculate similarity matrix

Make singletons

Compute MCS

MCS large

Create new clusterSimilarity

above threshold

Get two most similar

More structures

SSS

Found

Add to cluster

n

n

n

n

y

y

y

y

UGM 2006

Slide 22

Applications

• Screen analysis

• Data visualization and profiling

• Combinatorial library partitioning

• Buying new compounds

• ?

• Suggest more!!!!

UGM 2006

Slide 23

Application 1 / Screen analysis

1

10

100

1000

10000

0 5 10 15 20 25 30 35

Spikes retrieved

Str

uct

ure

s re

trie

ved

Euclidean

Optimized Euclidean

Ideal

UGM 2006

Slide 24

Activity filtering

UGM 2006

Slide 25

Live demonstration

• Partitioning mixed combinatorial library– Affect of parameters– Affect of modes– Benchmarks– Quality of clusters

UGM 2006

Slide 26

Combichem library scaffolds

UGM 2006

Slide 27

Combichem library scaffolds

• Turbo mode distorts clusters

UGM 2006

Slide 28

Combichem benchmark

• Influence of normal/fast/turbo mode

• Worth, distortion is not significant

0

500

1000

1500

2000

2500

3000

3500

4000

0 5000 10000 15000 20000 25000 30000 35000

Structure count

Ru

nn

ing

tim

e (s

ec)

Normal

Fast

Turbo

UGM 2006

Slide 29

Development roadmap

• Soon– R-Group decomposition– Stereo care MCS– Preserving rings– Lower bound pre-filtering– Disconnected MCS– Multi cluster members

• Mid term– Integrate Ward/Jarvis-Patrick in the new GUI

• Long term– Integrate molecular descriptors, metrics– Integrate virtual screening

UGM 2006

Slide 30

Coming soon – R-Group decomposition

UGM 2006

Slide 31

Coming soon – R-Group decomposition

UGM 2006

Slide 32

Coming soon – Multi cluster

UGM 2006

Slide 33

Summary

• MCS developed for automatic reaction mapping

• MCS based hierarchical clustering

• Fast method

• Chemical adequacy must be improved

• Various uses, currently focusing on combinatorial library partitioning

UGM 2006

Slide 34

Acknowledgements

• Developers– Péter Vadász– Nóra Máté

• Ideas– Szabolcs Csepregi, Ferenc Csizmadia

• Special thanks to

ugm 2006 miklós vargyas scientific workshop maximum common substructure

Documents

highest slide

clique detection slide

common parts slide

feasible approach slide

mcs computations

mcs tools

main use of mcs

pairwise mcs table