exploring chemical space with computers—challenges and opportunities
DESCRIPTION
Exploring Chemical Space with Computers—Challenges and Opportunities. Pierre Baldi UCI. Chemical Informatics. Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/1.jpg)
Exploring Chemical Space with Computers—Challenges and Opportunities
Pierre BaldiUCI
![Page 2: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/2.jpg)
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space Small molecules (systems biology,
chemical synthesis, drug design, nanotechnology)
![Page 3: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/3.jpg)
Chemical Space
Stars Small Mol.
Existing
1022 107
Virtual 0 1060 (?)
Access Difficult “Easy”
Mode Individual Combinatorial
![Page 4: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/4.jpg)
Chemical Space
![Page 5: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/5.jpg)
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology) Predict physical, chemical, biological
properties (classification/regression) Build filters/tools to efficiently navigate
chemical space to discover new drugs, new galaxies, etc.
![Page 6: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/6.jpg)
Methods
Spetrum: Schrodinger Equation Molecular Dynamics Machine Learning (e.g. SS prediction)
![Page 7: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/7.jpg)
Chemical Informatics
Informatics must be able to deal with variable-size structured data Graphical Models (Recursive) Neural Networks ILP GA SGs Kernels
![Page 8: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/8.jpg)
Two Essential Ingredients
1. Data2. Similarity Measures
Bioinformatics analogy and differences:
Data (GenBank, Swissprot, PDB) Similarity (BLAST)
![Page 9: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/9.jpg)
Data
Mutag (Mutagenicity) 200 compounds (125/63), mutagenicity in Salmonella
PTC (Predictive Toxicity Challenge) A few hundred compounds, carcinogenicity (FM,MM,FR,MR)
NCI (Anti-cancer activity) 70,000 compounds screened for ability to inhibit growth in 60
human tumor cell lines Alkanes (Boiling points)
All 150 non-cyclic alkanes (CnH2n+2) with n<11 and their boiling points ([-164,174])
Benzodiazepines (QSAR) 79 1,4-benzodiazepines-2-one, affinity towards GABAA
ChemDB 7M compounds
![Page 10: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/10.jpg)
Similarity
Rapid Searches of Large Databases
Predictive Methods (Kernel Methods)
Why it is not hopeless?
![Page 11: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/11.jpg)
Similarity
Rapid Search of Large Databases Protein Receptor (Docking) Small Molecule/Ligand Small Molecule/Ligand (Similarity)(Similarity)
Predictive Methods (Kernel Methods) Why it is not hopeless
OrganicOrganicChemicalsChemicals
![Page 12: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/12.jpg)
Linear Classifiers
![Page 13: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/13.jpg)
Classification
Learning to Classify Limited number of training
examples (molecules, patients, sequences, etc.)
Learning algorithm (how to build the classifier?)
Generalization: should correctly classify test data.
Formalization X is the input space Y (e.g. toxic/non toxic, or
{1,-1}) is the target class f: X→Y is the classifier.
![Page 14: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/14.jpg)
Classification
Fundamental Point: f is entirely determined by the dot products xi,xjmeasuring the similarity
between pairs of data points
![Page 15: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/15.jpg)
Non Linear Classification(Kernel Methods)
We can transform a nonlinear problem into a linear one using a kernel.
![Page 16: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/16.jpg)
Non Linear Classification(Kernel Methods)
We can transform a nonlinear problem into a linear one using a kernel K.
Fundamental property: the linear decision surface depends on
K(xi ,xj)=(xi ) , (xj). All we need is the Gram similarity
matrix K. K defines the local metric of the embedding space.
![Page 17: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/17.jpg)
Similarity: Data Representations
NC(O)C(=O)O
O
OH
NH2
OH
![Page 18: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/18.jpg)
Molecular Representations
1D: SMILES strings 2D: Graph of bonds 2.5D: Surfaces 3D: Atomic coordinates 4D: Temporal evolution
![Page 19: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/19.jpg)
15Total:
1D SMILES Kernel
CCCCCCc1ccc(cc1O)O
CCCCCc1ccc(cc1)CO
C H3
OHCH3
OH O H
Kmer CountCCCC 2CCCc 1CCc1 1Cc1c 1c1cc 11ccc 1ccc( 1cc(c 1c(cc 1(cc1 1cc1) 1c1)C 11)CO 1
Kmer CountCCCC 3CCCc 1CCc1 1Cc1c 1c1cc 11ccc 1ccc( 1cc(c 1c(cc 1(cc1 1cc1O 1c1O) 11O)O 1
Kmer Count1 Count2 Product(cc1 1 1 11)CO 0 1 01O)O 1 0 01ccc 1 1 1CCCC 3 2 6CCCc 1 1 1CCc1 1 1 1Cc1c 1 1 1c(cc 1 1 1c1)C 0 1 0c1O) 1 0 0c1cc 1 1 1cc(c 1 1 1cc1) 0 1 0cc1O 1 0 0ccc( 1 1 1
![Page 20: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/20.jpg)
2D Molecule Graph Kernel
For chemical compounds atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … }
Count labeled paths Fingerprints
(CsNsCdO)
![Page 21: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/21.jpg)
Similarity Measures
![Page 22: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/22.jpg)
3D Coordinate Kernel
1.4 A
2.0 A
2.8 A
3.4 A
4.2 A
Atom Distance Histogram
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5Distance (Angstroms)
Co
un
t
Distance Count0 01 52 73 34 15 0
![Page 23: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/23.jpg)
Example of Results
![Page 24: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/24.jpg)
Results
![Page 25: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/25.jpg)
Results
![Page 26: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/26.jpg)
Results
0.6500
0.6600
0.6700
0.6800
0.6900
0.7000
0.7100
0.7200
0.7300
0.7400
0.7500
Cell Line
Pre
dic
tio
n A
ccu
racy
1D SMILES(71.7% avg, 1.17% stdev)
2D Molecule Graph(72.3% avg, 0.99% stdev)
3D Coordinates(69.8% avg, 1.27% stdev)
![Page 27: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/27.jpg)
Example of Results
![Page 28: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/28.jpg)
Summary
Derived a variety of kernels for small molecules State-of-the-art performance on several benchmark
datasets 2D kernels slightly better than 1D and 3D kernels Many possible extensions: 2.5D kernels, isomers, etc… Need for larger data sets and new models of
cooperation in the chemistry community Many open (ML) questions (e.g. clustering and
visualizing 107 compounds, intelligent recognition of useful molecules, information retrieval from literature, docking, prediction of reaction rates, matching table of all proteins against all known compounds, origin of life)
Chemistry version of the Turing test
![Page 29: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/29.jpg)
ChemDB
7M compounds (3.5M unique) Commercially available PostgreSQL/Oracle Annotation (Experimental,
Computational) Searchable Web interface Similarity, in silico reactions
![Page 30: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/30.jpg)
Acknowledgements Informatics
Liva Ralaivola J. Chen S. J. Swamidass Yimeng Dou Peter Phung Jocelyne Bruand
Funding NIH NSF IGB
Pharmacology Daniele Piomelli
Chemistry G. Weiss J. S. Nowick R. Chamberlin
![Page 31: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/31.jpg)
New Questions
Predict drug-like molecules? toxicity? New Strategies
How can we search efficiently? Intelligently? New data structures and algorithms Optimizing old structures
How can we understand this much data? Cluster and visualize millions of data points Define commercially accessible space.
Are there other useful things we can do with this?
Discover new polymers, etc. Wonder about the origin of life. Combinatorially combine all known chemicals.
![Page 32: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/32.jpg)
Acknowledgements
Jocelyne Bruand Peter Phung Liva Ralaivola S. Joshua Swamidass Yimeng Dou NIH/NSF/IGB
Questions
![Page 33: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/33.jpg)
DockingD
ata b
ase
of p
o ten
tial
dru
gs
6 m
illi
on s
mal
l mol
e cul
e s
…
Query:Binding Site of Protein
Scoring Function
& Efficient Minimizer
![Page 34: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/34.jpg)
Some Targets
P53 (Luecke) ACCD5 (Tsai) IMPDH, PPAR, etc.
(Luecke) HIV Integrase
(Robinson)
![Page 35: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/35.jpg)
P53
![Page 36: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/36.jpg)
Drug Rescue of P53 Mutants
![Page 37: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/37.jpg)
Docking → ChemDB
~6 million commercially available compounds
Searchable, annotated, downloadable.
Other Databases: Cambridge Structural Database ChemBank PubChem
![Page 38: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/38.jpg)
Chemical Toxicity Prediction
By Kernel Methods
Jonathan ChenS Joshua Swamidass
The Baldi Lab
![Page 39: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/39.jpg)
Data Flow
Toxicity State List
Predictions
Gram MatrixID 1 2 3 4 …1 21 4 5 10 …2 4 14 5 3 …3 5 5 15 6 …4 10 3 6 23 …… … … … … …
4 Yes
O
S
P
S
O
C H3
O
C H3
NH
C H3
2 No
Cl
Cl
Cl
3 Yes
O O
1 No
NH
N
CH 3CH3
O
O
OH
ID Toxic?
Kernel
Linear Classifier
![Page 40: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/40.jpg)
Results
0.5000
0.5500
0.6000
0.6500
0.7000
0.7500
0.8000
0.8500
0.9000
0.9500
1.0000
Cell Line
Pre
dic
tio
n A
ccu
racy
1D SMILES(71.7% avg, 1.17% stdev)
2D Molecule Graph(72.3% avg, 0.99% stdev)
3D Coordinates(69.8% avg, 1.27% stdev)
Default(54.2% avg, 3.49% stdev)
![Page 41: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/41.jpg)
Example of Results
Kernel/Method Mutag MM FM MR FRKashima (2003) 89.1 61.0 61.0 62.8 66.7 Kashima (2003) 85.1 64.3 63.4 58.4 66.11D SMILES spec. 84.0 66.1 61.3 57.3 66.11D SMILES spec+ 85.6 66.4 63.0 57.6 67.02D Tanimoto 87.8 66.4 64.2 63.7 66.72D MinMax 86.2 64.0 64.5 64.5 66.42D Tanimoto, l = 1024, b = 1 87.2 66.1 62.4 65.7 66.92D Hybrid l = 1024, b = 1 87.2 65.2 61.9 64.2 65.82D Tanimoto, l = 512, b = 1 84.6 66.4 59.9 59.9 66.12D Hybrid l = 512, b = 1 86.7 65.2 61.0 60.7 64.72D Tanimoto, l = 1024 + MI 84.6 63.1 63.0 61.9 66.72D Hybrid l = 1024 + MI 84.6 62.8 63.7 61.9 65.52D Tanimoto, l = 512 + MI 85.6 60.1 61.0 61.3 62.42D Hybrid l = 512 + MI 86.2 63.7 62.7 62.2 64.43D Histogram 81.9 59.8 61.0 60.8 64.4
![Page 42: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/42.jpg)
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology) Catalog Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
![Page 43: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/43.jpg)
Datasets
![Page 44: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/44.jpg)
Small Molecules as Undirected Labeled Graphs of Bonds
atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … }
![Page 45: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/45.jpg)
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology) Bioinformatics analogy:
Catalog (GenBank) Search (BLAST)
Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
![Page 46: Exploring Chemical Space with Computers—Challenges and Opportunities](https://reader035.vdocument.in/reader035/viewer/2022062314/56812f54550346895d94e578/html5/thumbnails/46.jpg)
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology) Bioinformatics analogy:
Catalog (GenBank) Search (BLAST)
Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.