competitive data science: a tale of two web services
DESCRIPTION
Initial results from the Boehringer Ingelheim Pharmacueticals, Inc. 'Predicting a biological response' Kaggle competition. Presented at the Fall ACS 2012 #CINF session "When Chemists and Computers Collide: Putting Cheminformatics in the Hands of Medicinal Chemists"TRANSCRIPT
![Page 1: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/1.jpg)
![Page 2: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/2.jpg)
Competitive data science: A tale of two web services
David C. Thompson Jörg Bentzien Ingo Mügge Ben Hamner
![Page 3: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/3.jpg)
What is about to happen
• about.me
• The Kaggle process
• The data set
• How the competition went
• The models and implementation
• What we have learnt
![Page 4: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/4.jpg)
about.me/dcthompson
My favourite papers from each period: [1] J. Chem. Phys. 122, 124107 (2005) [2] J. Chem. Phys. 128, 224103 (2008) [3] J. Chem. Inf. Model. 49, 1889 (2009) [4] J. Chem. Inf. Model. 51, 93 (2011)
![Page 5: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/5.jpg)
![Page 6: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/6.jpg)
• We wanted to investigate the utility of the process
• We wanted to move with speed
• We wanted to use a data set the scientific community had previously seen
• We wanted to be inclusive – no domain expertise needed
What you should know about this exercise
![Page 7: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/7.jpg)
The data set
• Version 2 of the Hansen AMES mutagenicity data was used
• The following protocol was observed:
http://doc.ml.tu-berlin.de/toxbenchmark/ J. Chem. Inf. Model. 49, 2077 (2009) * D, B, Al, P, Ga, Si, Ge, Sn, As, Sb, Se, Te, At, He, Ne, Ar, Kr, Xe, Rn
What happened # of molecules (removed)
Download smiles 6512
Conversion with Corina 6503 (9)
Remove non-zero formal charge
6419 (84)
Remove if more than 99 atoms
6414 (5)
Remove if contains undesirable atoms*
6252 (162)
![Page 8: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/8.jpg)
Descriptor calculation SD file, descriptor calculation – 6252 x 5030
– Filter for low variance (≤ 0.01); removed 2537
– Remove for high correlation (> 0.90); removed 716
– Descriptor normalization resulted in 6252 x 1777 .csv file
Descriptor Engine # of descriptors
MOE 2D 76 (186)
Atom Pair 696 (1920)
MolConn-Z 174 (745)
Pipeline Pilot Property Counts
5 (130)
Daylight fingerprints
825 (2048)
clogP 0 (1) 0
200
400
600
800
1000
1200
1400
50
10
0
15
0
20
0
25
0
30
0
35
0
40
0
45
0
50
0
55
0
60
0
65
0
70
0
75
0
80
0
85
0
90
0
95
0
1000
1050
1100
1150
1200
J. Chem. Inf. Model. 49, 2077 (2009)
![Page 9: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/9.jpg)
Testing Framework
“Predictive Modeling from a Kaggler’s Perspective” Jeremy Achin, Sergey Yergenson, Tom Degodoy
• Public Leaderboard: The split of the test set that competition participants see real-time feedback on over the course of the competition.
• Private Leaderboard: The split of the test set that is used to determine the competition winners and estimate the generalization error. Participants do not see feedback on this during the competition.
![Page 10: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/10.jpg)
Expectations
“Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set”
• 20 models generated with different algorithms and descriptors
• Models have overall accuracies between 0.75 and 0.83 for the training set and 0.76 and 0.82 for the test set
• Inter-laboratory accuracy for Ames test reported at 85%
Expectation: Models should have similar accuracy to literature
Goal: Models should be balanced; sensitivity and specificity should be high
J. Chem. Inf. Model. 50, 2094 (2010)
![Page 12: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/12.jpg)
![Page 13: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/13.jpg)
log loss= N
i
iiii yyyyN 1
)ˆ1log()1()ˆlog(1
Performance as a function of time
796 players
703 teams
8841 entries
55 forum topics, 409 posts
![Page 14: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/14.jpg)
Final Ranking
Team Name Public
Ranking Δ (log loss)
1 Winter is Coming & Sergey 11 0
2 seelary 26 7E-05
3 bluehat 1 0.00051
4 jazz 15 0.0014
5 Wayne Zhang & Gxav & woshialex 19 0.00146
6 Indy Actuaries 38 0.00184
7 bluemaster & imran 7 0.00231
8 Efiimov & Bers & Cragin & vsu 4 0.00241
9 y_tag 18 0.0026
10 Killian O’Connor 44 0.00285
11 PlanetThanet & SirGuessalot 40 0.00298
12 AussieTim 48 0.00335
13 Jason Farmer 31 0.00347
14 GreenPeace 16 0.00356
15 mars 32 0.00388
16 Fuzzify 60 0.00392
17 Emanuele 63 0.00395
18 HappyHour 10 0.00431
19 Baltic 30 0.00465
20 dejavu 20 0.00482
352 Random Forest Benchmark 373 0.04184
541 Support Vector Machine Benchmark 522 0.12147
639 Optimized Constant Value Benchmark 647 0.31414
642 Uniform Benchmark 650 0.31959
https://github.com/emanuele/kaggle_pbr
https://github.com/benhamner/BioResponse
![Page 15: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/15.jpg)
#FTW Strategies
• Feature selection
• RF + complementary approaches
• Blending
All three winning teams identified D27 as important. What is it? Organon toxicophore*
* J. Med. Chem. 49, 312 (2005)
“Predictive Modeling from a Kaggler’s Perspective” Jeremy Achin, Sergey Yergenson, Tom Degodoy
![Page 16: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/16.jpg)
Winning Teams
Team 1 Team 2 Team 3
873 888 893
165 150 145
Team 1 Team 2 Team 3
151 165 162
687 673 676
TP FN
FP TN
Benchmarks
RF SVM
888 822
150 216
RF SVM
166 215
672 673
Other
Team 17 D27
896 781
142 257
Team 17 D27
169 215
669 623
Se Sp CCR
RF 0.86 0.80 0.83
SVM 0.79 0.74 0.77
Se Sp CCR
Team 1 0.84 0.82 0.83
Team 2 0.86 0.80 0.83
Team 3 0.86 0.80 0.83
Se Sp CCR
Team 17 0.86 0.80 0.83
D27 0.75 0.74 0.75
Se: TP/(TP+FN) Sp: TN/(FP+TN)
CCR: (Se + Sp)/2
Private Set Performance
![Page 17: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/17.jpg)
Okay, where’s this ‘second’ web service?
17
BIpredict Physicochemical properties are updated as molecule is built Atomistic descriptor values are appended directly to the molecule
* D. C. Thompson Chemical Computing Group, User Group Meeting, Montreal, 2011
![Page 18: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/18.jpg)
So, what did we learn?
• Was this useful? – Yes
• Participation was high, contributors and contributions were diverse*
• A large number of models were of a high quality – Differences in top models in log loss metric are small
– Different statistical measures lead to different rankings
– RandomForest benchmark has high correct classification rate (CCR)
* Sort of
![Page 19: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/19.jpg)
‘Machine learning that matters’
Kiri L. Wagstaff. Machine Learning that Matters. Proceedings of the Twenty-Ninth International Conference on Machine Learning (ICML), June 2012. Download PDF (CL #12-2026)
Domain expertise Machine learning
skill
![Page 20: Competitive data science: A tale of two web services](https://reader034.vdocument.in/reader034/viewer/2022051513/547b2e45b4795959098b4d15/html5/thumbnails/20.jpg)
Thanks to: Lilly Ackley Amy Kunkel Mehul Patel Alex Renner, PhD All Kaggle participants – esp. Winter is Coming & Sergey