Download - ChEMBL UGM May 2011
![Page 2: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/2.jpg)
Those who cannot remember the
past are condemned to repeat it
Contents
![Page 3: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/3.jpg)
• Things you would not like to see in your hits
• Specifically: reactive/labile chemical groups
– Is the compound still on the plate?
– Activity due to (selective) non-covalent binding?
– Some overlap with frequent hitters/aggregators
– Peroxides, aldehydes, etc
• Not ‘structural alerts’
– Off-target toxicity
– Toxic compounds after metabolic activation
– hERG binders, anilines, etc
‘Nasties’
![Page 4: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/4.jpg)
• If you are a chemist you know many of these
• If you have been working in pharma you know more of these
• Pharma companies probably all have their in-house list of ‘forbidden/risky/ugly’ structures
• Some publications but no definitive public list
• Thus reinvention of the wheel, wasted effort
This is not a new concept
![Page 5: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/5.jpg)
ChEMBL:
• “the most comprehensive ever seen in a public database.’” (wikipedia)
• “…cover a significant fraction of the SAR and discovery of modern drugs” (ChEMBL website)
• This must be a good source to learn what goes
– Experienced scientists who cared enough about compounds to measure the activity and submit the results to peer-reviewed journals
ChEMBL as a teacher
![Page 6: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/6.jpg)
• To learn we also need to know what not to do:
• Compound vendor catalogues
– Fewer constraints on reactivity / stability
– Drive for diversity
– More customers than just pharma:
– Should be enriched in nasties compared to ChEMBL
ChEMBL as a teacher
![Page 7: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/7.jpg)
• ChEMBL– Release 7
– Dump all compounds, keep largest fragment
– Unique canonical smiles: 597,255
• Vendor reagents– Pipeline Pilot examples: Maybridge + Asinex
– 186,967 unique compounds
• Build Bayesian model ‘reagentlike’– Vendor “good” v. ChEMBL “baseline”
– What do reagents have in common that ChEMBLcompounds don’t?
Lesson 1
![Page 8: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/8.jpg)
• Training/Test: Random 80% / 20%
• Excellent separation ChEMBL / Reagent
Reagentlike model
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0% 20% 40% 60% 80% 100%
% A
ctiv
es
Cap
ture
d
% of Samples
Perfect Model
Random Model
Reagentlike#xv Model
Leave-one-out enrichment
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0% 20% 40% 60% 80% 100%
% A
ctiv
es
Cap
ture
d
% of Samples
Perfect Model
Random Model
Reagentlike Model
Test set enrichment
![Page 9: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/9.jpg)
Done?
![Page 10: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/10.jpg)
A look at high and low scoring compounds
• Colour atoms by contribution to Bayesian score
– Red: high contribution: reagent-like
– Blue: low contribution: not reagent-like
– Color gradient over set of molecules
![Page 11: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/11.jpg)
High scoring molecules
![Page 12: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/12.jpg)
More high scoring molecules
They do contain ‘nasty’ groups…
But they don’t stand out against rest of the molecules (all red).
![Page 13: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/13.jpg)
Low scoring molecules
Etc
![Page 14: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/14.jpg)
High scoring features Low scoring features
High and low scoring reagent features
Many variations of peptide bond and other polypeptide features:
Seen 1029 times, of which in reagent set 1024 times
635 out of 639 in reagent set
![Page 15: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/15.jpg)
• Learning the difference was too easy
• Small organic vs large polypeptide
• Both sets contain many series, model learns common core instead of (nasty) decorations
– Metric: compounds / Murcko frames
– ChEMBL: ~6.7, reagent: ~9.0
• Number of frames / in common: ~81k / ~6k
• I need to resit this class
Conclusions from lesson 1
![Page 16: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/16.jpg)
• Restrict to organic small molecules
– AlogP < 6, Mw < 600, organic compound filter
• Bayesian Model
– ECFP_2 (smaller features compared to ECFP_6)
– Less likely to capture whole core
Lesson 2: Rebalancing the training set
![Page 17: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/17.jpg)
Still a predictive model
![Page 18: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/18.jpg)
A typical high scoring compound
~neutral score for parts presumed common to both sets like phenyl
~positive score for nasty parts
![Page 19: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/19.jpg)
Low scoring example
Many sugars, phosphates, steroids, etc
![Page 20: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/20.jpg)
High scoring features Low scoring features
Some ECFP_2 features
![Page 21: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/21.jpg)
• Less learning of “series by template”
• But it still happens, don’t need to capture whole ring to capture sugar, steroid, etc
• Some of expected nasty features found
• But many are not
• Better training set needed
– Series: similar in both clean/nasty training set, so that difference is not the template
– Many ChEMBL compounds are odd
• I have still not learned the lesson
Conclusions from lesson 2
![Page 22: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/22.jpg)
• ChEMBL:
– What I should have started with:
• All compounds with IC50 or Ki expressed in nM,
• Against human target,
• Include reference: journal, volume, year, page
• 569,569 activities
• 223,896 compounds
• 14,383 references
Lesson 3: Learning from (big) pharma
![Page 23: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/23.jpg)
Looking up author affiliation in PubMed
This takes ~4 hours in a weekend (PubMed usage restriction)
NCBI Entrez Utilities Web Service (Text Analytics component collection)
• 13,410 references
• 564,422 activities
• 214,747 compounds
![Page 24: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/24.jpg)
Something wrong with some BMCL refs?
Journal Frequency
Bioorg. Med. Chem. Lett. 935
J. Nat. Prod. 13
J. Med. Chem. 3
![Page 25: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/25.jpg)
Top 10 affiliations
DocAuthorsAffiliation Number of published activities in ChEMBL
Department of Chemistry, University of Wisconsin-Milwaukee, P.O. Box 413, Milwaukee, Wisconsin 53201, USA.
2775
Laboratorio di Chimica Inorganica e Bioinorganica, Universitadegli Studi, Via Gino Capponi 7, I-50121 Florence, Italy.
1326
Aton Pharma, Inc, 777 Old Sawmill River Road, Tarrytown, New York 10591, USA.
1264
AstraZeneca, Centre de Recherches, Z.I. La Pompelle, B.P. 1050, Chemin de Vrilly, 51689 Reims, Cedex 2, France.
1261
Merck Sharp & Dohme Research Laboratories, Neuroscience Research Centre, Terlings Park, Eastwick Road, Harlow, Essex CM20 2QR, U.K.
1160
Universita degli Studi di Firenze, Laboratorio di Chimica Bioinorganica, Rm. 188, Via della Lastruccia 3, I-50019 Sesto Fiorentino (Firenze), Italy.
1158
Merck Sharp & Dohme Research Laboratories, West Point, Pennsylvania 19486.
916
Department of Organic Pharmaceutical Chemistry, Uppsala Biomedical Center, Uppsala University, Sweden.
895
Lilly Research Laboratories, Eli Lilly and Company, 46285, Indianapolis, IN, USA.
884
Institute of Pharmacy, Department of Pharmaceutical and Medicinal Chemistry, Eberhard-Karls-University Tubingen, Auf der Morgenstelle 8, 72076 Tubingen,
877
![Page 26: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/26.jpg)
Where is Pfizer?
DocAuthorsAffiliationNumber of published activities in ChEMBL
Central Research Division, Pfizer Inc., Groton, Connecticut 06340. 708
Department of Medicinal Chemistry, Pfizer Global Research & Development, 4901 Searle Parkway, Skokie, IL 60077, USA.
531
Department of Medicinal Chemistry, and Cancer Pharmacology, Pfizer Global Research and Development, Michigan Laboratories, 2800 Plymouth Road, Ann Arbor, Michigan 48105, USA.
484
Department of Medicinal and Combinatorial Chemistry, Pharmacia Corporation, 800 North Lindbergh Boulevard, St. Louis, Missouri 63167, USA.
463
Pfizer Inc, Central Research Division, Groton, CT 06340, USA. 452
Pfizer Global Research and Development, Groton Laboratories, CT 06340, USA. 445
Medicinal Chemistry, Cancer Pharmacology, and Pharmacokinetics, Dynamics and Metabolism, Pfizer Global Research and Development, Michigan Laboratories, 2800 Plymouth Road, Ann Arbor, Michigan 48105, USA.
383
Pfizer Global Research & Development, Fresnes Laboratories, 3 a 9 rue de la Loge, 94265 Fresnes, France.
358
Pfizer Global Research and Development, 3-9 Rue de la loge 94265 Fresnes, France. 327
Discovery Chemistry, Pfizer Global Research and Development, SandwichLaboratories, Sandwich, Kent CT13 9NJ, UK.
305
And 318 more… Similar for other contributors
![Page 27: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/27.jpg)
if DocAuthorsAffiliation rlike
'univers|Faculty|hospital|National.*Institute.*Health|Polytechnic'
then
Published_by := 'Academic';
elsif DocAuthorsAffiliation rlike 'Pfizer' then
Published_by := 'Pfizer';
elsif DocAuthorsAffiliation rlike 'warner.*lambert|parke.*davis' then
Published_by := 'Warner-Lambert';
elsif DocAuthorsAffiliation rlike 'Pharmacia|Upjohn' then
Published_by := 'Pharmacia';
elsif DocAuthorsAffiliation rlike 'Wyeth' then
Published_by := 'Wyeth';
elsif DocAuthorsAffiliation rlike 'Merck' then
Published_by := 'Merck';
…
else
Published_by := 'Other';
end if;
Merging affiliations
![Page 28: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/28.jpg)
Ranked contributors to ChEMBL
Affiliation Activities in ChEMBL Papers Avg activities / paper
Academic 170,230 4216 40.4 ± 82.9
Other 169,882 3424 49.6 ± 70.0
Merck 50,504 685 73.7 ± 102.4
Pfizer 22,224 326 68.2 ± 86.9
BMS 19,888 306 65.0 ± 71.3
Abbott 19,526 272 71.8 ± 74.4
Wyeth 16,095 266 60.5 ± 61.4
GSK 15,666 301 52.0 ± 70.3
J&J 12,218 182 67.1 ± 73.1
Novartis 11,310 197 57.4 ± 73.4
Lilly 10,094 131 77.1 ± 114.2
AZ 8,249 115 71.7 ± 126.8
SP 8,181 130 62.9 ± 71.4
Roche 7,108 127 56.0 ± 58.3
Sanofi 6,236 81 77.0 ± 108.6
Warner-Lambert 5,035 64 78.7 ± 98.2
Pharmacia 2,353 39 60.3 ± 57.1
Organon 1,305 25 52.2 ± 67.0
![Page 29: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/29.jpg)
Creating balanced training/test sets
• Affiliation: Pharma, Other, Academic
• Keep 602 targets for which measured activities are available for all 3 affiliations
• Same target, same pharmacophore, some me-too work: less series learning
![Page 30: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/30.jpg)
• Bayesian model based on <= 2005 data
• Descriptors: ECFP_6 + Ro5 physical properties
Categorical model: Pharma/Academic/Other
![Page 31: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/31.jpg)
Predicting affiliation post 2005
Academic
Other
Pharma
• Other not different• Academic/Pharma distinct
![Page 32: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/32.jpg)
What makes a compound ‘Pharma’
Number of times feature observed / how many times in academic / pharma
Aromatic rings, aromatic rings, aromatic rings. IP? Absence of decorations means these are not distinctive.
![Page 33: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/33.jpg)
What makes a compound ‘Academic’
Number of times feature observed / how many times in academic / pharma
Aliphatic, single rings, bold usage of F and other decorations, etc. Maybe not nasty but not very druglike.
![Page 34: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/34.jpg)
Most Pharma-like compounds
For each target, compound with highest ‘Pharma’ score and true origin
![Page 35: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/35.jpg)
Most Academic-like compounds
For each target, compound with highest ‘Academic’ score and true origin
![Page 36: ChEMBL UGM May 2011](https://reader034.vdocument.in/reader034/viewer/2022042715/559b6a171a28ab38188b45f4/html5/thumbnails/36.jpg)
• Set out to learn nasty model, ended up with a (non)drug-like model
• Pharma is ‘a bit’ underrepresented
– 10% of MDDR is in ChEMBL (Dave Rogers)
– ChEMBL c/should include patent literature
• Over the years (big) pharma has delivered the goods and learned what does (not) work in a structure. Some of this knowledge can be extracted from ChEMBL.
• Ignore this at your peril
Conclusions