from chemoinformatics to repositioningsemmelweis.hu › orgchem › files › 2018 › 03 ›...
TRANSCRIPT
1
From chemoinformatics to repositioning
Péter [email protected]
Computational Biomedicine (Combine) workgroupDepartment of Measurement and Information Systems,
Budapest University of Technology and Economics
Overview• Chemoinformatics• The „big data”/omic era of chemo- and bioinformatics• Data and knowledge fusion in biomedicine• The semantic unification of pharmacological spaces• Multi-aspect virtual screening• Drug repositioning
2
Chemoinformatics• Gasteiger, Johann, and Thomas Engel, eds.
Chemoinformatics: a textbook. John Wiley & Sons, 2006.• Bajorath, Jürgen. Chemoinformatics for Drug Discovery.
John Wiley & Sons, 2013.• Karthikeyan, Muthukumarasamy, and Renu Vyas.
Practical Chemoinformatics. Springer, 2014.• Brown, Nathan. In Silico Medicinal Chemistry:
Computational Methods to Support Drug Design. No. 8. Royal Society of Chemistry, 2015.
3
Practical chemoinformatics
4
1. Open-Source Tools, Techniques, and Data in Chemoinformatics
2. Chemoinformatics Approach for the Design and Screening of Focused Virtual Libraries
3. Machine Learning Methods in Chemoinformatics for Drug Discovery
4. Docking and Pharmacophore Modelling for Virtual Screening
5. Active Site-Directed Pose Prediction Programs for Efficient Filtering of Molecules
6. Representation, Fingerprinting, and Modelling of Chemical Reactions
7. Predictive Methods for Organic Spectral Data Simulation
8. Chemical Text Mining for Lead Discovery 9. Integration of Automated Workflow in
Chemoinformatics for Drug Discovery10. Cloud Computing Infrastructure
Development for Chemoinformatics
In Silico Medicinal Chemistry: Computational Methods to Support Drug Design
5
In Silico Medicinal Chemistry
6
In Silico Medicinal Chemistry
7
E D. Green et al. Nature 470, 204-213 (2011) doi:10.1038/nature09764
Accomplishments of genomics research
Big „omic” data sets in biomed.
9
Multiple levels in biomedicine
Genome(s)
Phenome (disease, side effect)
Transcriptome
Proteome
Metabolome
Environment&life style
Drugs
Moore’s Law for Data Explosion (Carlson’s law)
Sequencing costs per mill.
base
Publicly available
genetic data
NATURE, Vol 464, April 2010
• x10 every 2-3 years
• Data volumes and complexity that IT has never faced before…
Large-scale cohorts in UK
12
UK Biobank:• 1million< adults• aged 40-69,• 2006-2036<• genes x lifestyle x environment diseases
• open 2012-
Number of genome-wide association studiesTo
tal N
umbe
rof P
ublic
atio
ns
Calendar Quarter
0
200
400
600
800
1000
1200
1400
2005 2006 2007 2008 2009 2010 2011 2012
1350
NHGRI GWA Catalogwww.genome.gov/GWAStudies
Published Genome-Wide Associations through 12/2012Published GWA at p≤5X10-8 for 17 trait categories
Disease network
L.A.Barabási:PNAS, 2007, The human disease network
Repositories for gene expression• Gene Expression Omnibus (NCBI)• http://www.ncbi.nlm.nih.gov/geo/
Gene expression profiles
• Justin Lamb: The Connectivity Map: a new tool for biomedical research, Nature, 7,pp 54-60, 2007
Compounds Cell lines
Each cell is transcriptional
proifle
STRING - Protein-Protein Interactions
• http://string-db.org/
Unification of biology: Gene Ontology
• Ontologies:– Gene Ontology (GO): http://www.geneontology.org/– Enzyme Classification (EC)– Unified Medical Language Systems (UMLS)– OBO
The Human Phenotype Ontology
http://human-phenotype-ontology.github.io/
Number of biomedical publications
21Little Science, Big Science, by Derek J. de Solla Price, 1963
0
200000
400000
600000
800000
1000000
1200000
1950 1960 1970 1980 1990 2000 2010
Number of annual papers
The fusion bottleneck(~limits of personal cognition)
The
phar
ma
gap
Semantic publishing:papers vs DBs/KBs
M. Gerstein, "E-publishing on the Web: Promises, pitfalls, and payoffs for bioinformatics," Bioinformatics, 1999M. Gerstein: Blurring the boundaries between scientific 'papers' and biological databases, Nature, 2001P. Bourne, "Will a biological database be different from a biological journal?," Plos Computational Biology, 2005M. Gerstein et al: "Structured digital abstract makes text mining easy," Nature, 2007.M. Seringhaus et al: "Publishing perishing? Towards tomorrow's information architecture," Bmc Bioinformatics,
2007.M. Seringhaus: "Manually structured digital abstracts: A scaffold for automatic text mining," Febs Letters, 2008.D. Shotton: "Semantic publishing: the coming revolution in scientific journal publishing," Learned Publishing, 2009
24
Network of databases and knowledge bases in biomedicine
25
• 10k< relevant biological databases and knowledge-bases• Petabytes of sequence and high-throughput gene/protein data• ~10.000.000 concepts and relations explicitly in knowledge bases
Combination of elements
genege
ne
target
com
poun
d
gene
disease
binding site
com
poun
d
target protein
bind
ing
site
product
gene
gene
TFBS
pathway
gene
disease
path
way
transcription factor binding site
prod
uct
ATC
GO
EC
HPO
“Compound” Google?
27
“The Science Behind an Answer”
artificial intelligence, natural language processing..(???)
abacavir didanosine lamivudine
Why Can’t My Computer Understand Me?http://www.newyorker.com/tech/elements/why-cant-my-computer-understand-me
28
E-science, data-intensive science, the fourth paradigm
Approaches to fusion• Encyclopedists:
– Wikipedia, Wikidata,– Linked Open Data (LOD),– Semantic publishing
• Automated cross-domain querying– Forms– Workflow systems– Natural language understanding, Machine reading
• Automated reasoning– Watson
• Automated discovery systems („Automation of science”)– Adam, Eve
• Large-scale similarity-based fusion (applied in repositioning) 29
OPS: scientific pharma questions
30
A problem with public data: parallel works on cleaning...integration
31
A Resource Description Framework (RDF) háttér
• The data model of the Semantic Web• RDF statement
– subject: resource identified by an IRI– predicate (property): resource identified by an
IRI– object: resource or literal (constant value)
• Graph databases of RDF triples
32
Relational databases vs. Triplestores (graph databases)Relational databases• Relations are separated from data (cases)• Tables&keys define the formal model (syntax)
for the data (cases)• Model-based (~predefined)• Meaning (semantics) is informal (out of scope
of the DB)• Singular databases (~they are separated)
Triplestores• Unified representation of relations and data• Triples („graph database”) stores the dynamic
model for the data, together with the factual data
• Model-free (~relations as data)• Meaning is defined by the (explicit) relations
(~ontology)• Linked open data space (using universal
identifiers & ontologies)
33
Cf. Neumann’s principle: instructions is data
SPARQL
• a query language specification for querying over RDF triples
34
/Linked data
35
• Bio2RDF• ~11 billion triples• 35 datasets:
clinicaltrials, dbSNP, DrugBank, KEGG, PIR, GOA, OrphaNet, PubMed, SIDER..)
• local: chembl, pathwaycommons, reactome, wikipathways
• http://download.bio2rdf.org/release/3/release.html
Chem2Bio2RDF I.
36
Chem2Bio2RDF II.
37
• Discovery Platform for cross-domain fusion. • Public, curated, linked data.
– The data sources you already use, integrated and linked together: compounds, targets, pathways, diseases and tissues.
• Everything in triples: Subject-predicate-object
38
Open Pharmacological Space
Precursor: Gene Ontology: tool for the unification of biology, Nature, 2000
@gray_alasdair Big Data Integration 39
• Discovery Platform to cross barriers. • The data sources you already use, integrated
and linked together: compounds, targets, pathways, diseases and tissues.
• ChEBI, ChEMBL, ChemSpider, ConceptWiki, DisGeNET, DrugBank, Gene Ontology, neXtProt, UniProt and WikiPathways.
• For questions in drug discovery, answers from publications in peer reviewed scientific journals.
40
Top questions in the pharma industry I. (Open PHACTS)
41
Top questions II.
42
Open PHACTS: databases
43
Dataset Downloaded Version Licence TriplesBio Assay Ontology CC-By 10,360CALOHA 8 Apr 2015 2014-01-22 CC-By-ND 14,552ChEBI 4 Mar 2015 125 CC-By-SA 1,012,056ChEMBL 18 Feb 2015 20.0 CC-By-SA 445,732,880ConceptWiki 12 Dec 2013 CC-By-SA 4,331,760DisGeNET 31 Mar 2015 2.1.0 ODbL 15,011,136Disease Ontology 2015-05-21 CC-By 188,062DrugBank 19 Feb 2015 4.1 Non-commercial 4,028,767ENZYME 2015_11 CC-By-ND 61,467FDA Adverse Events 9 Jul 2012 CC0 13,557,070
Total: ~3 Billion triples
Dataset Downloaded Version Licence TriplesGene Ontology 4 Mar 2015 CC-By 1,366,494Gene Ontology Annotations 17 Feb 2015 CC-By 879,448,347NCATS OPDDR Nov 2015 Oct 2015 2,643neXTProt (NP) 1 Feb 2014 1.0 CC-By-ND 215,006,108OPS Chemical Registry 4 Nov 2014 CC-By-SA 241,986,722
HMDB 3.6 HMDB
MeSH 2015 MeSH
PDB Ligands 2 PDB
OPS Metadata CC-By-SA 2,053UniProt 2015_11 CC-By-ND 1,131,186,434WikiPathways 20151118 CC-By 11,781,627
Total: ~3 Billion triples
OPS: open tools for free academic use
46
Open PHACTS with non-shared, private data for commercial users
47
Open PHACTS: advantages I.
48
Open PHACTS: advantages II.
49
Attrition in drug discovery
De novo drug discovery and development10-17 years process and around 1B USD~10% probability of success from Phase 1 to Market
Drug repositioning3-12 years process and up to 80% cost reductionSignificantly higher probability of success from Phase 1 to Market due to reduced safety and pharmacokinetic uncertainty
De novo discovery vs. repositioning
Scientific motivations for repositioning/rescue
L.A.Barabási:PNAS, 2007,
M.Campillos:Science, 2008Ingenuity Pathway Analysis
A disease-disease similarity network
A drug-drug networkA gene regulatory network
1, Multiple targets
2, Multifactorial diseases
4, Complex pathways (accumulating knowledge)
3, Personalized aspects: 3a, pharmaceutical/phenotypic: efficacy, side effects 3b, genetic/epigenetic
5, New measurements (accumulating omic data)
ENCODE: tissue specific regulation
6, Drugome (2000-7000, 1941) + failed drugs (~2000, +100 new yearly)
Scientific motivations for repositioning II.
• Magic bullet vs. Promiscuous/dirty drugs• Monogenic vs multifactorial disease• Selective optimisation of side activities (SOSA)• Network pharmacy• Personalized („precision”) drugs (for sub-
populations)– Special external applicability conditions– „Pathway” drugs
53
Repositioning publications
54
Ashburn TT, Thor KB: Drug repositioning: identifying and developing new uses for existing drugs. Nat Rev Drug Discov
2004, 3(8):673-683.
Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P: Drug target identification using side-effect similarity. Science 2008, 321(5886):263-266.
Joachim von Eichborn Manuela S. Murgueitio, Mathias Dunkel, Soeren Koerner,Philip E. Bourne, Robert Preissner: PROMISCUOUS: a database for network-based drug-repositioning, Nucleic Acids Research, 2010, 1–7
Michael Kuhn, Monica Campillos, Ivica Letunic, Lars Juhl Jensen, Peer Bork,*SIDER: A side effect resource to capture
phenotypic effects of drugs, Molecular Systems Biology 6:343, 2010……..
0
20
40
60
80
2004200520062007200820092010201120122013
Repositioning: examples
55Li and Jones: Drug repositioning for personalized medicine, Genome Medicine 2012, 4:27
Information sources in repositioning and lead discovery.
Profile Repositioning HTS-based Dimension
Chemical X X 100-10000
Target protein X X n x 10000
Taxonomy X 3 (depth)
Side effect X 10000
Literature X 100000
Gene Expression X X k x 1000
Off-label use X 10000
Chemical fingerprints• MACCS 2D, Molcon-Z, Dragon, 3D,..
• Schrödinger Canvas using Tanimoto distance
•Structurefingerprint
810 drugs
011001001011010101...
001010000001110100...
Target profiles I.
58
•Targets: 10,774•Compound records: 1,715,667•Distinct compounds: 1,463,270•Activities: 13,520,737•Publications: 59,610
ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data).
https://www.ebi.ac.uk/chembl
Target profiles II.
Compounds: 68,280,589Tested Compounds: 2,081,593Substances: 196,539,272Tested Substances: 3,121,708BioAssays: 1,112,184RNAi BioAssays: 62BioActivities: 228,469,266Protein Targets: 9,847Gene Targets: 41,361
59
SdPcsubstance contains more than 180 million records. Pccompound contains more than 63 million unique structures. PCBioAssay contains more than 1 million BioAssays. Each BioAssay contains a various number of data points.
Side-effect profiles• DailyMed textmining
– qualitative:SIDER adatbázis (http://sideeffects.embl.de)– quantitative: exact prevalences
• E.g. Olanzapine
514 drugs
Taxonomies
• Anatomical Therapeutic Chemical Classification System (ATC)– 5 levels:
• Main anatomic, • Main therapeutic • therapeutic/pharmacological subgroup • chemical/therapeutic/pharmacological subgroup
• Drugs.com– http://www.drugs.com/
• RxNorm, Aetionomy 61
Chemical
Target
Pathway
“Disease”
Side effect
drugi
drugj
Combination of chemical and side effect information for better target prediction
M.Campillos: Drug target identification using side-effect similarity, Science, 2008
Potential avenues of drug repositioning
63Li and Jones: Drug repositioning for personalized medicine, Genome Medicine 2012, 4:27
In silico/virtual screening using LOD
Chemical Side-effects Target prot. MMoA Pathways
Tanimoto
Linked Open Data (LOD), e.g. Open PHACTS
Representation
Surrogate
Compound representations
Compound-compound similaritiesDavis,Shrobe,Szolovits, 1993
Similarity-based virtual screening1, The “One-One-One” phaseHenrickson J, Johnson M, Maggiori G: Concepts and applications of molecular similarity. 1991, New York: John
Willey & Sons.Willett P, Barnard J, Downs G: Chemical similarity searching. Journal of Chemical Information and Computer
Sciences 1998, 38(6):983-996.
2, The „data fusion” phase “One-Many-One”Ginn C, Willett P, Bradshaw J: Combination of molecular similarity measures using data fusion. Perspectives in
Drug Discovery and Design 2000, 20(1):1-16.
3, The „group fusion” phase “Many-Many-One”Whittle M, Gillet V, Willett P, Loesel J: Analysis of data fusion methods in virtual screening: Similarity and group
fusion. Journal of Chemical Information and Modeling 2006, 46(6):2206-2219.Keiser M, Roth B, Armbruster B, Ernsberger P, Irwin J, Shoichet B: Relating protein pharmacology by ligand
chemistry. Nature Biotechnology 2007, 25(2):197-206.Chen B, Mueller C, Willett P: Combination rules for group fusion in similarity-based virtual screening. Molecular
Informatics 2010, 29(6-7):533-541.Gardiner E, Holliday J, O'Dowd C, Willett P: Effectiveness of 2D fingerprints for scaffold hopping. Future Medicinal
Chemistry 2011, 3(4):405-414.Svensson F, Karlén A, Sköld C: Virtual screening data fusion using both structure- and ligand-based methods.
Journal of Chemical Information and Modeling 2011, 52(1):225-232.
B
A
S1S2S3S4 S5
B
A
S*
B
Q2
S S S
B
S*
Q3
Q1
Q2
Q3
Q1
B
S*
Q2
Q3
Q1
B
Q2
S1 S2S3S4 S5
Q1
S2S3S4 S5
Q3
S2S3S4 S5
Q2
Q1
Q3
S*
B
A
S
1, Similarity-basedapproach
2, Data fusion
3, Group fusion
4, Query Driven Fusion Framework
Similarity-based fusion in drug repositioning
Chemical Side-effects Target prot. MMoA Pathways
Query-based optimal fusion
Tanimoto
Query: set of corresponding drugs
QDF2
On the use of query analysis• The information content of
– the query,– the information resources,– and the unknown observations(!)
• allow a one-class analysis of the query(data description)
• and this induction is used in prioritization.JOINTLY OPTIMIZED:1. weighting the members in the query (e.g. detection of outliers in the question),
GETTING THE RIGHT/IMPROVED QUESTION
2. weighting the similarity measures (e.g.information resources),GETTING THE SCORING (SIMILARITY) FOR THE RIGHT/IMPROVED QUESTION
3. scoring/ranking the aggregate similarity of the unknown data points to the.
QDF2
The repositome
The „repositome” of FDA approved drugs (row) for the ATC level 4 classes (columns).
Thank you for your attention!