contemporary qsar classifiers...

21
Contemporary QSAR Classifiers Compared Craig Bruce School of Chemistry

Upload: others

Post on 28-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Contemporary QSAR Classifiers Compared

Craig BruceSchool of Chemistry

Page 2: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

1

Introduction

QSARSimilar Property PrincipleSimilar structure » similar properties

QuantitativeStructure-ActivityRelationship

Page 3: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

2

Methods

Support Vector Machine

Page 4: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

3

Methods

Support Vector Machine Decision Tree

Page 5: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

4

Methods

Support Vector Machine Decision Tree Random Forest Ensemble

Bagging Boosting

Parameter Tuning

Page 6: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

5

DatasetsDataset Compound type No.

Compounds No. Descriptor s

2.5D Fragments

A C E Angiotensin converting enzyme 114 5 6 1024

AchE Acetyl-cholinesterase inhibito rs 111 6 3 774

B Z R Benzodiazepine recepto r 163 7 5 832

COX2 Cyclooxygenase-2 inhibitor s 322 7 4 660

DHFR Dihydrofolate reductase inhibitors

397 7 0 952

G P B Glycogen phosphorylase b 6 6 7 0 692

THER Therolysin inhibitors 7 6 6 4 575

T H R Thrombin inhibito rs 8 8 6 6 527

Sutherland, J. J.; O'Brien, L. A.; Weaver, D. F. J. Med. Chem. 2004, 47(22), 5541-5554.

Page 7: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

6

Cross-Validation

Trained on full datasetCV to measure classifier

Dataset

Page 8: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

7

Need for HPC

8 datasets2 descriptor sets7 classifiers10 repeats of CV1120 models to generate

Page 9: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

8

Results - 2.5DDataset Tree Bagged

Tree

Boosted

Tree

Random

Forest

SVM Tuned

Foresta

Tuned

SVMb

A C E 86.9 86.5 86.6 85.4 90.3 89.3 89.9

AchE 70.6 71.6 72.7 72.6 72.0 79.5 74.3

B Z R 71.7 75.5 75.4 74.0 77.4 79.5 81.6

COX2 75.6 75.7 76.1 73.4 75.4 75.7 75.2

DHFR 78.8 83.2 83.4 83.1 79.6 84.9 82.2

G P B 70.6 74.5 76.2 74.1 73.9 76.7 75.3

THER 67.2 69.2 67.8 69.7 69.5 74.6 74.6

T H R 66.5 69.1 68.0 69.1 67.2 72.5 69.0

a 100 Treesb Polynomial kernel; exponent = 2; complexity constants = 0.05

Page 10: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

9

Results - Fragments

a 100 Treesb RBF kernel; width = 0.1; complexity constants = 1

Dataset Tree Bagged

Tree

Boosted

Tree

Random

Forest

SVM Tuned

Foresta

Tuned

SVMb

A C E 80.4 82.0 81.0 80.5 78.9 80.0 82.2

AchE 64.1 68.0 68.8 70.5 69.4 70.5 77.1

B Z R 74.0 75.0 69.8 67.3 74.0 68.7 75.8

COX2 71.1 71.5 71.0 68.1 72.6 68.7 71.1

DHFR 84.4 85.4 83.1 84.9 83.5 85.5 86.5

G P B 73.8 75.6 76.2 74.5 77.4 75.2 76.7

THER 72.2 75.8 75.5 75.4 75.3 76.7 73.4

T H R 71.5 69.2 68.8 66.7 71.1 68.4 69.8

Page 11: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

10

Statistics

Paired t-testMultiple Comparison Tests

Nonparametric Friedman test (corrected Iman & Davenport) Post-hoc Nemenyi test

Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets.J. Mach. Learn. Res. 2006, 7, 1-30

Page 12: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

11

Statistical Results

10 vs 100 trees in random forest tuning in 2.5D

Across classifiers statistical difference detected Tuned SVM & RF better than decision tree Other differences not significant

Page 13: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

12

Problems

Datasets are large 2GB RAM quickly used (unfairly) Although larger amounts of RAM can be

supported it is very expensive

Problem for larger datasets and runningensemble classifiers

Page 14: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

13

HPC solutions

Split task over many nodesParallelRandom ForestBagging

Page 15: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

14

Tree computation

FinalClassification

Page 16: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

15

Tree computation

FinalClassification

Page 17: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

16

Interpretation

QSAR need good accuracy and Interpretability

SVM transform the dataDecision trees produce instant

classification rules

Page 18: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

17

Trees

Page 19: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

18

Conclusions

SVM excellent classifier Ensemble of trees very competitive Universal parameters for random forest; SVM

more dataset specific Trees have interpretability advantage Future work

Extraction of information from ensemblesBruce, C. L.; Melville, J. L.; Pickett, S. D.; Hirst, J. D.

Contemporary QSAR Classifiers Compared.J. Chem. Inf. Mod. 47, 219–227 (2007).

Page 20: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

19

Acknowledgements

Jonathan HirstJames Melville

Stephen PickettChris LuscombeGavin Harper

Page 21: Contemporary QSAR Classifiers Comparedcomp.chem.nottingham.ac.uk/members/bruce-files/cbruce...2007/01/17  · Craig Bruce HPC User Meeting 17th January 2007 5 Datasets Dataset Compound

Craig Bruce HPC User Meeting17th January 2007

20

Any Questions?