day 4: knime practical - the european bioinformatics ... can you do with knime? 6 •data...

Day 4: KNIME Practical George Papadatos, ChEMBL group, EMBL-EBI

Francis Atkinson, ChEMBL group, EMBL-EBI

Outline

2

• Introduction to KNIME

• Basic components • Desktop, nodes, dialogs, workflows

• Demo • Compound selection for focused screening

• Read chemical data

• Calculate properties

• Apply drug- and lead- likeness filters

• Remove “nasty” compounds

• Pick diverse molecules

• Visualize results and plot properties

• Exercises 1 & 2 (hands-on)

12/12/2013 Resources for Computational Drug Discovery

Are there KNIME users among us?

Resources for Computational Drug Discovery 12/12/2013 3

What is KNIME?

• KNIME = Konstanz Information Miner

• Developed at University of Konstanz in Germany

• Desktop version available free of charge (Open Source)

• Modular platform for building and executing workflows using predefined components, called nodes

• Core functionality available for tasks such as standard data mining, analysis and manipulation

• Extra features and functionality available in KNIME through extensions from various groups and vendors

• Written in Java based on the Eclipse SDK platform

4 12/12/2013 Resources for Computational Drug Discovery

KNIME resources

• Web pages (documentation) • www.knime.org | tech.knime.org | tech.knime.org/installation-0

• Downloads • knime.org/download-desktop

• Community forum • tech.knime.org/forum

• Books and white papers • knime.org/node/33079

• Myself

• [email protected]


What can you do with KNIME?

6

• Data manipulation and analysis • File & database I/O, sorting, filtering, grouping, joining, pivoting

• Data mining / machine learning • R, WEKA, KNIME, interactive plotting

• Chemoinformatics • Conversions, similarity, clustering, (Q)SAR analysis, MMPs, reaction

enumeration

• Scripting integration • R, Perl, Python, Matlab, Octave, Groovy

• Reporting

• So much more • Bioinformatics, HTS & image analysis, network & text mining

• Marketing, bid data and business analytics


Community contribution nodes

• http://tech.knime.org/community

• Chemoinformatics • ChEMBL and ChEBI (EBI) – SureChEMBL nodes coming soon!

• CDK (EBI), RDKit (Novartis), Indigo (GGA), ErlWood (Eli Lilly), Enalos (NovaMechanics)

• Bioinformatics • HCS (MPI), NGS (Konstanz), Image analysis

• Text mining • Palladian

• Integration • Python, Perl, R, Groovy, Matlab (MPI), PDB web services client (Vernalis)


http://tech.knime.org/community

http://tech.knime.org/community

Installation & updates

8

• Download and unzip KNIME • No further setup required

• Additional nodes after first launch

• knime.ini contains arguments & parameters for launch

• New software (nodes) from update sites • http://tech.knime.org/update/community-contributions/release

• Workflows and data are stored in a workspace • /Users/georgep/knime/workspace_mac_new

• C:\knime_2.8.2\workspace

• Customization in: FilePreferencesKNIME


KNIME Workbench

9

workflow editor

console outline

tabs

Node description

node repository

workflow projects

favorite nodes

public server


Auto-layout Execute Execute all nodes

Node = basic processing unit of KNIME workflow which performs a particular task

Title

Icon

Input port(s) – on the left of icon

Output port(s) – on the right of icon

Status display (‘traffic lights’) • Red (not ready) • Amber (ready) • Green (executed)

• Blue bar during execution

(with percentage or flashing)

Sequence number Right-click menu To configure and execute the node, display the output views, edit the node, and display data for the ports

KNIME nodes: Overview


Configuration menus for selected nodes

KNIME nodes: Dialogs

11

Explicit column type


Double click to configure…

An example completed workflow

12

• Workflows can be imported and exported as .zip files

• With or without the underlying data

• File Import KNIME workflow…

• File Export KNIME workflow…


Any questions so far?


Compound selection for focused screening

1. Read chemical data

2. Calculate phys/chem properties

3. Apply drug- and lead-likeness filters

4. Apply more filters (e.g. remove solubility liabilities)

5. Apply substructural filters (PAINS subset)

6. Pick diverse molecules


The objective


First steps - I

• Locate the directory with today’s material

• Copy and paste it to your desktop

• You can take it with you too

• Open the presentation file

• Import the FocusedScreeningSelection.zip to KNIME

• Menu File Import workflow to KNIME


1

2

3

First steps - II

• Open a new workflow

• Right click on the workflow projects area


1

2

3

Part 1: Reading chemical data


SDF Reader


1

2

3

5

4

.\data\SMDC_cleaned_nodups.sdf

Inspect the structures…


Right click on the node

Molecule to RDKit


Part 2: Property-based filtering


Descriptor Calculation


1 2

3

Java Snippet


1

2

3

.\code\Lipinski.txt

Numeric Row Splitter


Inspect the Lipinski fails…



Java Snippet


.\code\Oprea.txt 1

2

3

Inspect the Oprea fails…




Inspect the Solubility fails…


Part 3: Substructure-based filtering


Molecule to Indigo



File reader .\data\PAINS_clean_half.sdf


Query Molecule to Indigo


Inspect the SMARTS rules


Chunk Loop Start


Substructure Matcher


Loop End


Inspect matched structures…



Reference Row Filter

Part 4: Diversity picking and plotting


RDKit Fingerprint


Inspect the fingerprints…



RDKit Diversity Picker


2D/3D Scatterplot


Inspect the plot…



Exercise 1

• Read an sd file with drug information from ChEMBL

• Inspect the structures and their properties

• Select only drugs that were released after 1990 (First Approval)

• Select only drugs that target human (Homo sapiens)

• How many drugs remain now?

• Save the workflow

• Tips • Open a new workflow

• Use the SDF Reader node

• Use the Numeric Row Splitter node to filter on First Approval >= 1990

• Use the Nominal Value Row filter node to filter on Organism = Homo sapiens


Exercise 2

• Continue from your previous workflow

• Calculate MW and logP of the drug compounds

• Generate a scatter plot of MW and logP

• Can you see any compounds with high MW and logP?

• Tips • Use the Molecule to RDKit node

• Use the RDKit Descriptor Calculator node

• Include the SlogP and ExactMW descriptors

• Use the 2D/3D Scatterplot node


Any questions? Last chance!


Conclusions

• Compound selection for focused screening

• Typical scenario

• KNIME

• Open and free

• Data analysis

• Chemoinformatics toolkits • Erl Wood, RDKit, Indigo, CDK, etc.

• Lots of other functionality

• More advanced KNIME on Friday around lunch time


Further reading

• Open data and tools


1. Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G., ZINC:

A free tool to discover chemistry for biology. Journal of Chemical Information

and Modeling 2012 ASAP.

2. Saubern, S.; Guha, R.; Baell, J. B., KNIME workflow to assess PAINS filters in

SMARTS format. Comparison of RDKit and Indigo cheminformatics libraries.

Molecular Informatics 2011, 30, (10), 847-850.

3. Barnes, M. R.; Harland, L.; Foord, S. M.; Hall, M. D.; Dix, I.; Thomas, S.;

Williams-Jones, B. I.; Brouwer, C. R., Lowering industry firewalls: pre-

competitive informatics initiatives in drug discovery. Nature Reviews Drug

Discovery 2009, 8, (9), 701-708.

4. Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kötter, T.; Meinl, T.; Ohl, P.;

Sieb, C.; Thiel, K.; Wiswedel, B., KNIME: The Konstanz Information Miner. In

Data Analysis, Machine Learning and Applications, Preisach, C.; Burkhardt, H.;

Schmidt-Thieme, L.; Decker, R., Eds. Springer: Berlin, 2008; pp 319-326.

5. Tiwari, A.; Sekhar, A. K. T., Workflow based framework for life science

informatics. Computational Biology and Chemistry 2007, 31, (5-6), 305-319.

Further reading

• High throughput screening

• Lead- and drug-likeness


1. Bajorath, J., Integration of virtual and high-throughput screening. Nature

Reviews Drug Discovery 2002, 1, (11), 882-894.

2. Harper, G.; Pickett, S. D.; Green, D. V. S., Design of a compound

screening collection for use in High Throughput Screening. Combinatorial

Chemistry & High Throughput Screening 2004, 7, (1), 63-70.

1. Chuprina, A.; Lukin, O.; Demoiseaux, R.; Buzko, A.; Shivanyuk, A., Drug- and

lead-likeness, target class, and molecular diversity analysis of 7.9 million

commercially available organic compounds provided by 29 suppliers. Journal of

Chemical Information and Modeling 2010, 50, (4), 470-479.

2. Lipinski, C. A., Lead- and drug-like compounds: the rule-of-five revolution. Drug

Discovery Today: Technologies 2004, 1, (4), 337-341.

3. Oprea, T. I.; Davis, A. M.; Teague, S. J.; Leeson, P. D., Is there a difference

between leads and drugs? A historical perspective. Journal of Chemical

Information and Computer Sciences 2001, 41, (5), 1308-1315.

Further reading

• Physicochemical properties and drug discovery

• Structural alerts in HTS


1. Brüstle, M.; Beck, B.; Schindler, T.; King, W.; Mitchell, T.; Clark, T., Descriptors,

physical properties, and drug-likeness. Journal of Medicinal Chemistry 2002, 45,

(16), 3345-3355.

2. Hill, A. P.; Young, R. J., Getting physical in drug discovery: A contemporary

perspective on solubility and hydrophobicity. Drug Discovery Today 2010, 15,

(15/16), 648-655.

3. Leeson, P. D.; Springthorpe, B., The influence of drug-like concepts on decision-

making in medicinal chemistry. Nature Reviews Drug Discovery 2007, 6, (11), 881-

890.

1. Baell, J. B.; Holloway, G. A., New substructure filters for removal of Pan Assay

Interference Compounds (PAINS) from screening libraries and for their exclusion in

bioassays. Journal of Medicinal Chemistry 2010, 53, (7), 2719-2740.

2. Rishton, G. M., Reactive compounds and in vitro false positives in HTS. Drug

Discovery Today 1997, 2, (9), 382-384.

Further reading

• Similarity and diversity


1. Ashton, M.; Barnard, J.; Casset, F.; Charlton, M.; Downs, G.; Gorse, D.; Holliday,

J.; Lahana, R.; Willett, P., Identification of diverse database subsets using

property-based and fragment-based molecular descriptions. Quantitative

Structure-Activity Relationships 2002, 21, (6), 598-604.

2. Bender, A.; Glen, R. C., Molecular similarity: a key technique in molecular

informatics. Organic and Biomolecular Chemistry 2004, 2, 3204-3218.

3. Gorse, A.-D., Diversity in medicinal chemistry space. Current Topics in Medicinal

Chemistry 2006, 6, (1), 3-18.

4. Maldonado, A.; Doucet, J.; Petitjean, M.; Fan, B.-T., Molecular similarity and

diversity in chemoinformatics: From theory to applications. Molecular Diversity

2006, 10, (1), 39-79.

5. Rogers, D.; Hahn, M., Extended-connectivity fingerprints. Journal of Chemical

Information and Modeling 2010, 50, (5), 742-754.

6. Schuffenhauer, A.; Brown, N., Chemical diversity and biological activity. Drug

Discovery Today: Technologies 2006, 3, (4), 387-395.

7. Willett, P.; Barnard, J. M.; Downs, G. M., Chemical similarity searching. Journal

of Chemical Information and Computer Sciences 1998, 38, (6), 983-996.

Day 4: KNIME Practical George Papadatos, ChEMBL group, EMBL-EBI

Francis Atkinson, ChEMBL group, EMBL-EBI

day 4: knime practical - the european bioinformatics ... can you do with knime? 6 •data...

Documents