competitive data science: a tale of two web services

Competitive data science: A tale of two web services

David C. Thompson Jörg Bentzien Ingo Mügge Ben Hamner

What is about to happen

• about.me

• The Kaggle process

• The data set

• How the competition went

• The models and implementation

• What we have learnt

about.me/dcthompson

My favourite papers from each period: [1] J. Chem. Phys. 122, 124107 (2005) [2] J. Chem. Phys. 128, 224103 (2008) [3] J. Chem. Inf. Model. 49, 1889 (2009) [4] J. Chem. Inf. Model. 51, 93 (2011)

about.me/dcthompson

about.me/dcthompson

about.me/dcthompson

• We wanted to investigate the utility of the process

• We wanted to move with speed

• We wanted to use a data set the scientific community had previously seen

• We wanted to be inclusive – no domain expertise needed

What you should know about this exercise

The data set

• Version 2 of the Hansen AMES mutagenicity data was used

• The following protocol was observed:

http://doc.ml.tu-berlin.de/toxbenchmark/ J. Chem. Inf. Model. 49, 2077 (2009) * D, B, Al, P, Ga, Si, Ge, Sn, As, Sb, Se, Te, At, He, Ne, Ar, Kr, Xe, Rn

What happened # of molecules (removed)

Download smiles 6512

Conversion with Corina 6503 (9)

Remove non-zero formal charge

6419 (84)

Remove if more than 99 atoms

6414 (5)

Remove if contains undesirable atoms*

6252 (162)

http://doc.ml.tu-berlin.de/toxbenchmark/




Descriptor calculation SD file, descriptor calculation – 6252 x 5030

– Filter for low variance (≤ 0.01); removed 2537

– Remove for high correlation (> 0.90); removed 716

– Descriptor normalization resulted in 6252 x 1777 .csv file

Descriptor Engine # of descriptors

MOE 2D 76 (186)

Atom Pair 696 (1920)

MolConn-Z 174 (745)

Pipeline Pilot Property Counts

5 (130)

Daylight fingerprints

825 (2048)

clogP 0 (1) 0

200

400

600

800

1000

1200

1400

50

10

0

15

0

20

0

25

0

30

0

35

0

40

0

45

0

50

0

55

0

60

0

65

0

70

0

75

0

80

0

85

0

90

0

95

0

1000

1050

1100

1150

1200

J. Chem. Inf. Model. 49, 2077 (2009)

Testing Framework

“Predictive Modeling from a Kaggler’s Perspective” Jeremy Achin, Sergey Yergenson, Tom Degodoy

• Public Leaderboard: The split of the test set that competition participants see real-time feedback on over the course of the competition.

• Private Leaderboard: The split of the test set that is used to determine the competition winners and estimate the generalization error. Participants do not see feedback on this during the competition.

Expectations

“Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set”

• 20 models generated with different algorithms and descriptors

• Models have overall accuracies between 0.75 and 0.83 for the training set and 0.76 and 0.82 for the test set

• Inter-laboratory accuracy for Ames test reported at 85%

Expectation: Models should have similar accuracy to literature

Goal: Models should be balanced; sensitivity and specificity should be high

J. Chem. Inf. Model. 50, 2094 (2010)

http://www.kaggle.com/c/bioresponse

http://www.kaggle.com/c/bioresponse

log loss= N

i

iiii yyyyN 1

)ˆ1log()1()ˆlog(1

Performance as a function of time

796 players

703 teams

8841 entries

55 forum topics, 409 posts

Final Ranking

Team Name Public

Ranking Δ (log loss)

1 Winter is Coming & Sergey 11 0

2 seelary 26 7E-05

3 bluehat 1 0.00051

4 jazz 15 0.0014

5 Wayne Zhang & Gxav & woshialex 19 0.00146

6 Indy Actuaries 38 0.00184

7 bluemaster & imran 7 0.00231

8 Efiimov & Bers & Cragin & vsu 4 0.00241

9 y_tag 18 0.0026

10 Killian O’Connor 44 0.00285

11 PlanetThanet & SirGuessalot 40 0.00298

12 AussieTim 48 0.00335

13 Jason Farmer 31 0.00347

14 GreenPeace 16 0.00356

15 mars 32 0.00388

16 Fuzzify 60 0.00392

17 Emanuele 63 0.00395

18 HappyHour 10 0.00431

19 Baltic 30 0.00465

20 dejavu 20 0.00482

352 Random Forest Benchmark 373 0.04184

541 Support Vector Machine Benchmark 522 0.12147

639 Optimized Constant Value Benchmark 647 0.31414

642 Uniform Benchmark 650 0.31959

https://github.com/emanuele/kaggle_pbr

https://github.com/benhamner/BioResponse

https://github.com/emanuele/kaggle_pbr

https://github.com/benhamner/BioResponse

#FTW Strategies

• Feature selection

• RF + complementary approaches

• Blending

All three winning teams identified D27 as important. What is it? Organon toxicophore*

* J. Med. Chem. 49, 312 (2005)

“Predictive Modeling from a Kaggler’s Perspective” Jeremy Achin, Sergey Yergenson, Tom Degodoy

Winning Teams

Team 1 Team 2 Team 3

873 888 893

165 150 145

Team 1 Team 2 Team 3

151 165 162

687 673 676

TP FN

FP TN

Benchmarks

RF SVM

888 822

150 216

RF SVM

166 215

672 673

Other

Team 17 D27

896 781

142 257

Team 17 D27

169 215

669 623

Se Sp CCR

RF 0.86 0.80 0.83

SVM 0.79 0.74 0.77

Se Sp CCR

Team 1 0.84 0.82 0.83

Team 2 0.86 0.80 0.83

Team 3 0.86 0.80 0.83

Se Sp CCR

Team 17 0.86 0.80 0.83

D27 0.75 0.74 0.75

Se: TP/(TP+FN) Sp: TN/(FP+TN)

CCR: (Se + Sp)/2

Private Set Performance

Okay, where’s this ‘second’ web service?

17

BIpredict Physicochemical properties are updated as molecule is built Atomistic descriptor values are appended directly to the molecule

* D. C. Thompson Chemical Computing Group, User Group Meeting, Montreal, 2011

http://www.slideshare.net/dcthompson/thompson-dc-ccgugm2011

So, what did we learn?

• Was this useful? – Yes

• Participation was high, contributors and contributions were diverse*

• A large number of models were of a high quality – Differences in top models in log loss metric are small

– Different statistical measures lead to different rankings

– RandomForest benchmark has high correct classification rate (CCR)

* Sort of

‘Machine learning that matters’

Kiri L. Wagstaff. Machine Learning that Matters. Proceedings of the Twenty-Ninth International Conference on Machine Learning (ICML), June 2012. Download PDF (CL #12-2026)

Domain expertise Machine learning

skill

http://ml.jpl.nasa.gov/papers/wagstaff/wagstaff-MLmatters-icml12.pdf

Thanks to: Lilly Ackley Amy Kunkel Mehul Patel Alex Renner, PhD All Kaggle participants – esp. Winter is Coming & Sergey

competitive data science: a tale of two web services

Technology