day 4: knime practical - the european bioinformatics ... can you do with knime? 6 •data...
TRANSCRIPT
Day 4: KNIME Practical George Papadatos, ChEMBL group, EMBL-EBI
Francis Atkinson, ChEMBL group, EMBL-EBI
Outline
2
• Introduction to KNIME
• Basic components • Desktop, nodes, dialogs, workflows
• Demo • Compound selection for focused screening
• Read chemical data
• Calculate properties
• Apply drug- and lead- likeness filters
• Remove “nasty” compounds
• Pick diverse molecules
• Visualize results and plot properties
• Exercises 1 & 2 (hands-on)
12/12/2013 Resources for Computational Drug Discovery
What is KNIME?
• KNIME = Konstanz Information Miner
• Developed at University of Konstanz in Germany
• Desktop version available free of charge (Open Source)
• Modular platform for building and executing workflows using predefined components, called nodes
• Core functionality available for tasks such as standard data mining, analysis and manipulation
• Extra features and functionality available in KNIME through extensions from various groups and vendors
• Written in Java based on the Eclipse SDK platform
4 12/12/2013 Resources for Computational Drug Discovery
KNIME resources
• Web pages (documentation) • www.knime.org | tech.knime.org | tech.knime.org/installation-0
• Downloads • knime.org/download-desktop
• Community forum • tech.knime.org/forum
• Books and white papers • knime.org/node/33079
• Myself
5 12/12/2013 Resources for Computational Drug Discovery
What can you do with KNIME?
6
• Data manipulation and analysis • File & database I/O, sorting, filtering, grouping, joining, pivoting
• Data mining / machine learning • R, WEKA, KNIME, interactive plotting
• Chemoinformatics • Conversions, similarity, clustering, (Q)SAR analysis, MMPs, reaction
enumeration
• Scripting integration • R, Perl, Python, Matlab, Octave, Groovy
• Reporting
• So much more • Bioinformatics, HTS & image analysis, network & text mining
• Marketing, bid data and business analytics
12/12/2013 Resources for Computational Drug Discovery
Community contribution nodes
• http://tech.knime.org/community
• Chemoinformatics • ChEMBL and ChEBI (EBI) – SureChEMBL nodes coming soon!
• CDK (EBI), RDKit (Novartis), Indigo (GGA), ErlWood (Eli Lilly), Enalos (NovaMechanics)
• Bioinformatics • HCS (MPI), NGS (Konstanz), Image analysis
• Text mining • Palladian
• Integration • Python, Perl, R, Groovy, Matlab (MPI), PDB web services client (Vernalis)
Resources for Computational Drug Discovery 12/12/2013 7
Installation & updates
8
• Download and unzip KNIME • No further setup required
• Additional nodes after first launch
• knime.ini contains arguments & parameters for launch
• New software (nodes) from update sites • http://tech.knime.org/update/community-contributions/release
• Workflows and data are stored in a workspace • /Users/georgep/knime/workspace_mac_new
• C:\knime_2.8.2\workspace
• Customization in: FilePreferencesKNIME
12/12/2013 Resources for Computational Drug Discovery
KNIME Workbench
9
workflow editor
console outline
tabs
Node description
node repository
workflow projects
favorite nodes
public server
12/12/2013 Resources for Computational Drug Discovery
Auto-layout Execute Execute all nodes
Node = basic processing unit of KNIME workflow which performs a particular task
Title
Icon
Input port(s) – on the left of icon
Output port(s) – on the right of icon
Status display (‘traffic lights’) • Red (not ready) • Amber (ready) • Green (executed)
• Blue bar during execution
(with percentage or flashing)
Sequence number Right-click menu To configure and execute the node, display the output views, edit the node, and display data for the ports
KNIME nodes: Overview
10 12/12/2013 Resources for Computational Drug Discovery
Configuration menus for selected nodes
KNIME nodes: Dialogs
11
Explicit column type
12/12/2013 Resources for Computational Drug Discovery
Double click to configure…
An example completed workflow
12
• Workflows can be imported and exported as .zip files
• With or without the underlying data
• File Import KNIME workflow…
• File Export KNIME workflow…
12/12/2013 Resources for Computational Drug Discovery
Compound selection for focused screening
1. Read chemical data
2. Calculate phys/chem properties
3. Apply drug- and lead-likeness filters
4. Apply more filters (e.g. remove solubility liabilities)
5. Apply substructural filters (PAINS subset)
6. Pick diverse molecules
Resources for Computational Drug Discovery 12/12/2013 14
First steps - I
• Locate the directory with today’s material
• Copy and paste it to your desktop
• You can take it with you too
• Open the presentation file
• Import the FocusedScreeningSelection.zip to KNIME
• Menu File Import workflow to KNIME
Resources for Computational Drug Discovery 12/12/2013 16
1
2
3
First steps - II
• Open a new workflow
• Right click on the workflow projects area
Resources for Computational Drug Discovery 12/12/2013 17
1
2
3
SDF Reader
Resources for Computational Drug Discovery 12/12/2013 19
1
2
3
5
4
.\data\SMDC_cleaned_nodups.sdf
Inspect the structures…
Resources for Computational Drug Discovery 12/12/2013 20
Right click on the node
Inspect the Lipinski fails…
Resources for Computational Drug Discovery 12/12/2013 27
Right click on the node
Inspect the Oprea fails…
Resources for Computational Drug Discovery 12/12/2013 30
Right click on the node
Resources for Computational Drug Discovery 12/12/2013 32
Inspect the Solubility fails…
Right click on the node
Resources for Computational Drug Discovery 12/12/2013 42
Inspect matched structures…
Right click on the node
Inspect the fingerprints…
Resources for Computational Drug Discovery 12/12/2013 47
Right click on the node
Exercise 1
• Read an sd file with drug information from ChEMBL
• Inspect the structures and their properties
• Select only drugs that were released after 1990 (First Approval)
• Select only drugs that target human (Homo sapiens)
• How many drugs remain now?
• Save the workflow
• Tips • Open a new workflow
• Use the SDF Reader node
• Use the Numeric Row Splitter node to filter on First Approval >= 1990
• Use the Nominal Value Row filter node to filter on Organism = Homo sapiens
Resources for Computational Drug Discovery 12/12/2013 52
Exercise 2
• Continue from your previous workflow
• Calculate MW and logP of the drug compounds
• Generate a scatter plot of MW and logP
• Can you see any compounds with high MW and logP?
• Tips • Use the Molecule to RDKit node
• Use the RDKit Descriptor Calculator node
• Include the SlogP and ExactMW descriptors
• Use the 2D/3D Scatterplot node
Resources for Computational Drug Discovery 12/12/2013 53
Conclusions
• Compound selection for focused screening
• Typical scenario
• KNIME
• Open and free
• Data analysis
• Chemoinformatics toolkits • Erl Wood, RDKit, Indigo, CDK, etc.
• Lots of other functionality
• More advanced KNIME on Friday around lunch time
Resources for Computational Drug Discovery 12/12/2013 55
Further reading
• Open data and tools
Resources for Computational Drug Discovery 12/12/2013 56
1. Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G., ZINC:
A free tool to discover chemistry for biology. Journal of Chemical Information
and Modeling 2012 ASAP.
2. Saubern, S.; Guha, R.; Baell, J. B., KNIME workflow to assess PAINS filters in
SMARTS format. Comparison of RDKit and Indigo cheminformatics libraries.
Molecular Informatics 2011, 30, (10), 847-850.
3. Barnes, M. R.; Harland, L.; Foord, S. M.; Hall, M. D.; Dix, I.; Thomas, S.;
Williams-Jones, B. I.; Brouwer, C. R., Lowering industry firewalls: pre-
competitive informatics initiatives in drug discovery. Nature Reviews Drug
Discovery 2009, 8, (9), 701-708.
4. Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kötter, T.; Meinl, T.; Ohl, P.;
Sieb, C.; Thiel, K.; Wiswedel, B., KNIME: The Konstanz Information Miner. In
Data Analysis, Machine Learning and Applications, Preisach, C.; Burkhardt, H.;
Schmidt-Thieme, L.; Decker, R., Eds. Springer: Berlin, 2008; pp 319-326.
5. Tiwari, A.; Sekhar, A. K. T., Workflow based framework for life science
informatics. Computational Biology and Chemistry 2007, 31, (5-6), 305-319.
Further reading
• High throughput screening
• Lead- and drug-likeness
Resources for Computational Drug Discovery 12/12/2013 57
1. Bajorath, J., Integration of virtual and high-throughput screening. Nature
Reviews Drug Discovery 2002, 1, (11), 882-894.
2. Harper, G.; Pickett, S. D.; Green, D. V. S., Design of a compound
screening collection for use in High Throughput Screening. Combinatorial
Chemistry & High Throughput Screening 2004, 7, (1), 63-70.
1. Chuprina, A.; Lukin, O.; Demoiseaux, R.; Buzko, A.; Shivanyuk, A., Drug- and
lead-likeness, target class, and molecular diversity analysis of 7.9 million
commercially available organic compounds provided by 29 suppliers. Journal of
Chemical Information and Modeling 2010, 50, (4), 470-479.
2. Lipinski, C. A., Lead- and drug-like compounds: the rule-of-five revolution. Drug
Discovery Today: Technologies 2004, 1, (4), 337-341.
3. Oprea, T. I.; Davis, A. M.; Teague, S. J.; Leeson, P. D., Is there a difference
between leads and drugs? A historical perspective. Journal of Chemical
Information and Computer Sciences 2001, 41, (5), 1308-1315.
Further reading
• Physicochemical properties and drug discovery
• Structural alerts in HTS
Resources for Computational Drug Discovery 12/12/2013 58
1. Brüstle, M.; Beck, B.; Schindler, T.; King, W.; Mitchell, T.; Clark, T., Descriptors,
physical properties, and drug-likeness. Journal of Medicinal Chemistry 2002, 45,
(16), 3345-3355.
2. Hill, A. P.; Young, R. J., Getting physical in drug discovery: A contemporary
perspective on solubility and hydrophobicity. Drug Discovery Today 2010, 15,
(15/16), 648-655.
3. Leeson, P. D.; Springthorpe, B., The influence of drug-like concepts on decision-
making in medicinal chemistry. Nature Reviews Drug Discovery 2007, 6, (11), 881-
890.
1. Baell, J. B.; Holloway, G. A., New substructure filters for removal of Pan Assay
Interference Compounds (PAINS) from screening libraries and for their exclusion in
bioassays. Journal of Medicinal Chemistry 2010, 53, (7), 2719-2740.
2. Rishton, G. M., Reactive compounds and in vitro false positives in HTS. Drug
Discovery Today 1997, 2, (9), 382-384.
Further reading
• Similarity and diversity
Resources for Computational Drug Discovery 12/12/2013 59
1. Ashton, M.; Barnard, J.; Casset, F.; Charlton, M.; Downs, G.; Gorse, D.; Holliday,
J.; Lahana, R.; Willett, P., Identification of diverse database subsets using
property-based and fragment-based molecular descriptions. Quantitative
Structure-Activity Relationships 2002, 21, (6), 598-604.
2. Bender, A.; Glen, R. C., Molecular similarity: a key technique in molecular
informatics. Organic and Biomolecular Chemistry 2004, 2, 3204-3218.
3. Gorse, A.-D., Diversity in medicinal chemistry space. Current Topics in Medicinal
Chemistry 2006, 6, (1), 3-18.
4. Maldonado, A.; Doucet, J.; Petitjean, M.; Fan, B.-T., Molecular similarity and
diversity in chemoinformatics: From theory to applications. Molecular Diversity
2006, 10, (1), 39-79.
5. Rogers, D.; Hahn, M., Extended-connectivity fingerprints. Journal of Chemical
Information and Modeling 2010, 50, (5), 742-754.
6. Schuffenhauer, A.; Brown, N., Chemical diversity and biological activity. Drug
Discovery Today: Technologies 2006, 3, (4), 387-395.
7. Willett, P.; Barnard, J. M.; Downs, G. M., Chemical similarity searching. Journal
of Chemical Information and Computer Sciences 1998, 38, (6), 983-996.