indiana university school of david wild – joint iu, michigan, lilly meeting, october 2006. page 1...

David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Indiana University School of

Smart Mining Interfaces, Workflows, and Data Mining the NCI DTP Dataset

David Wild

Joint IU / Michigan / Lilly MeetingIndianpolis, August 2006


Acknowledgements

• Xiao Dong - HTCL database & mining, web services, workflows, smart mining interfaces

• Rajarshi Guha - Smart Mining Interfaces, Workflows, Web Services

• Geoffrey/Marlon’s lab: Smitha Ajay, Sima Patel, Jake Kim - web services, workflows, portlet interfaces

• Others: Junguk Hur, Chris Mueller, Huijun Wang …

• Funding from Microsoft eScience and CICC


Outline

• “Smart mining of drug discovery information” - our approach to connecting scientists with the information they need

• Application of smart mining to post-HTS chemistry analysis

• Current interface-level projects• Examining the DTP HTCL dataset as a standard for multi-screen data mining and as a surrogate for HTS


Classic approach to designing tools for chemists• Select the computational chemistry methods that seem to be the most useful for drug discovery (or whatever), such as docking and similarity searching

• Have computational chemists / modelers figure out how to “dumb down” the methods for ordinary chemists

• Wrap a pretty web interface around the command line tools that run on a Unix server

• Tell the chemists to use the tools using their browsers

Result…• A few chemists use the tools a few times• Clever tools don’t necessarily directly meet needs, and simple needs can be complex to answer


A better idea…

• Use interviews and follow up techniques (e.g. Contextual Design and Interaction Design) to understand the workflows of chemists

• Design tools using paper prototyping involving personas (or actual chemists)

• Develop Tools• Go through several iterations of usability testing with real scientists and real-life problems


Contextual Inquiry

• Session in which an observer watches scientists do their work in their natural environment

• Observer can ask questions, clarify understanding, etc.

• Helps to record session on a tape recorder• Want to see them do “real” work, but helps if it is related to the software.

• From tape, build sequence, flow, artifact, culture and physical models

• Helps in understanding the scientists’ work, and in building personas and identifying breakdowns in processes


Example sequence model

Intent: Try to improve the activity of the current XXX-1 Kinase structureIntent: See what other similar structures are in ISIS

Log onto machine and go into ISIS Unsure which database to chooseChooses Master1 databaseDraws in structure Unable to specify aromaticity correctlyDoes similarity searchFinds 8 compounds which look similar

Intent: Try docking these molecules using web-based docking programOpens web browserGoes to docking program by choosing bookmark Browser says “page cannot be found”


Interaction Design (Cooper)

• Personas stereotype the people who will be using the software

• Select primary personas, and create a customized interface for each one

• Define the goals of the primary personas, then deveop scenarios that reveal the way these goals are reached

Wallace is an engineer. He is aged 45, and lives by himself with his dog Grommit. He enjoys using new gadgets, and likes using his inventiveness to make new things, with differing degrees of success. He is confident using Microsoft Word and Excel, although he sometimes becomes frustrated with these packages. He wants the computer to help him design a rocket to fly him and Grommit to the moon, as he likes cheese and believes the moon to be made of very fine cheese.


Usability testing

• Monitor real people using the software you have written

• Find breakdowns, recurring problems• Measure a usability score for the software• Make changes and see if the score improves


Usability of 2D structure drawing tools

• Key difference between “sequential” and “random” drawers

• Huge difference in intuitiveness• Key factor how badly you can mess things up• Marvin Sketch ≈ JME > ChemDraw >> ISIS Draw


This approach is better but…

• Still centered on the “tool” instead of the information

• Doesn’t solve problem of tools only solving one problem

• Not flexible enough for an environment where scientists have constantly changing needs for information which is complex to retrive (or compute):– Do we have this compound in-house? Where can I get it?– I want to know if anyone else does something with structures like these

– I want to improve the activity of these compounds, what directions should I take?

– I wonder if there any protein targets this compound might bind to other than the project I’m working on?

– I’m worried there might be a degradation problem with these compounds


Simple questions can be complex to answer…

Oracle Database (HTS)

Compounds were tested against related assays and showed activity, including

selectivity within target families

Oracle Database (Genomics)

? None of these compounds have been tested in a microarray

assay

Computation

The information in the structures and known activity

data is good enough to create a QSAR model with a confidence

of 75%

External Database (Patent)

Some structures with a similarity > 0.75 to these appear

to be covered by a patent held by a competitor

Computation

All the compounds pass the Lipinksi Rule of Five and toxicity

filters

Excel Spreadsheet (Toxicity)

One of the compounds was previously tested for toxicology and was found to have no liver

toxicity

Word Document (Chemistry)

Several of the compounds had been followed up in a previous project, and solubility problems prevented further development

Journal Article

A recent journal article reported the effectiveness of some compounds in a related series against a target in the same family

Word Document (Marketing)

A report by a team in Marketing casts doubt on whether the market for this target is big

enough to make development cost-effective

SCIENTIST

“These compounds look promising from their HTS results. Should I commit some chemistry

resources to following them up?”

?


An even better idea?

• Develop web services around as much chemoinformatics computation and data sources as possible

• Develop web service workflows for as many real-life workflowsas we can, based on contextual design interviews (and other sources)

• Develop generic smart interface components which are able to match what people express they want to do with what workflows and services are available, and even create workflows to meet needs

This is a kind of scatter gun approach, with the onus on the workflows and interface to make it useful

Related to the “lab of the future”- based on information, not tools

Very closely linked to LEAD


3-layer model

Purpose Technologies

Interaction Layer Interactive software for creative access and exploitation of information by humans

Microsoft Smart Clients, portlets, Java applets, email and browser clients, visualization technologies

Aggregation Layer Workflows and data schemas customized for particular domains, applications and users

BPEL, Taverna and other workflow modeling tools, aggregate web services

Web service layer Comprehensive data and computation provision including storage, calculation, semantics and meta-data exposed as web services

Apache web services, SOAP wrappers, WSDL, UDDI, XML, Microsoft .NET


Kinds of interface-level interactions

• Passive user / active computation– Mainly for information and computation request, and

single-stream or summary results– Natural Language interface through email– RSS, web and .NET tools– Graphical workflow generation– May lead to active user interaction as described below

• Active user / passive computation or retrieval– Facilitates direct interaction with workflows, services

and information– Permits analysis and interpretation of multi-stream,

interactive results– Multi-stream portlets– Custom multi stream desktop tools (including .NET)– Visual SAR– Flagging and annotation

• Active user / active computation– A “conversation” between the scientist and the computer


Example email natural language interface


Email response after triggering events occur


Natural Language Interface Stage 1:Matching requests to existing workflows• All existing workflows are given descriptions which reflect the kinds of words people would likely use if requesting them

• When a request is made, workflows are ranked by text similarity between the request and the descriptions

• If similarity is less than a cutoff, it is determined that no match can be found (and thus another workflow should be written!)

• Requests may be parsed by a standard syntactic analyzer


Natural Language Interface Stage 2:On-the-fly workflow creation• Requests can be formatted in a “do this THEN do this THEN do this” fashion

• These requests are parsed, and used to attempt to create a workflow on the fly

• Possible existing parsing software includes Python NLP modules and Infocom Z-code (or similar)…


Desktop tool for multi-stream analysis


Multi-stream portlet interface


PubChemSR .NET desktop search tool (Junguk Hur)

http://darwin.informatics.indiana.edu/juhur/Tools/PubChemSR


Structural differences related to activity are projected onto actual 2D structures.See http://www.daylight.com/meetings/mug99/Wild/Mug99.html

VisualiSAR - modal fingerprints


QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.Original (curated) Breadth-first Search

Degree Sloan’s Algorithm

Data: NCI Compound Database - Compounds with positive AIDS screens

Visual Similarity Matrices display large, graph-based data sets in a compact form. The axes are labeled with the data

items (vertices) and a dot indicates a relation (edge) between two data items. Different vertex orderings can reveal

information about the data.

Additional details are displayed as property

plots. Here, the different computed properties are displayed along with the

main matrix.

Student: Christopher Mueller

In order to generate similarity matrices and orderings in a reasonable time (minutes instead of days), we are developing parallel and high-performance libraries that take advantage of modern processor and system architectures. These include

optimized SIMD for Alitvec (PowerPC) and SSE (Intel) and parallel algorithms for multiprocessor environments.

Visual Similarity Matrices


VoPlot


Chemistry Decision Support after PubChem data submission

MLSCN submits HTS data to Pubchem

Data is stored in Pubchem

PubChem interfaces to workflows via SOAP

Workflows perform different kinds of analysis on the MLSCN data, including SAR, clustering, literature searching, protein searching, toxicity testing, etc…

End-user applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis


Simple HTS follow-up workflow

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Presented at 222nd ACS, Chicago, 2001See http://www.lib.uchicago.edu/cinf/222nm/presentations/222nm050.pdf


Example HTS workflow: organization & flagging

A biological screen is selected. The activity results for all the compounds is extracted from the database (currently using DTP Tumor Cell Line database)

The compounds are clustered on

chemical structure similarity, to group similar compounds

together The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT

OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs


Example of workflow output - LogP vs GI50

Plotting XLogP against GI50 can help identify highly active compounds with good logP profiles (1 - 4 range)


Example of workflow output - Cluster # vs GI50

Plotting Cluster against GI50 can help identify groups of highly active, structurally similar compounds, and also

clusters which might yield good QSAR information


Example HTS workflow: finding cell-protein relationships

A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex)

SImilar structures to the ligand can be

browsed using client portlets.

Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.

Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.

The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand.

Docking results and activity patterns fed into R services for building of activity models and correlations

LeastSquaresRegression

RandomForests

NeuralNets


Example workflow output - docked complexes

NSC_ID 685478Docking score -29.74

NSC_ID 685477 Docking score -35.51



Example output of most similar compounds to PDB 1Y4 complex ligands docked into the target protein using OpenEye FRED


DTP Human Tumor Cell Line Data Mining• Collaboration with Melanie Wu at the School of Informatics• 257,547 compounds, 44,653 with 60-cell line screening data

(GI50)• Local PostgreSQL database with gNova CHORD cartridge for

substructure and similarity searching, exposed as a web service• Aim to build on existing published data mining research on this

dataset, doing forms of data mining that are made easier by using web services

• Learned so far– Most previous research has used small compound subsets (~4000

compounds), and generally fall into organization (SOM, clustering, etc) or correlation of structure, activity and/or expression

– There is little that has approached the dataset as a whole (as it is in 2006)

– Correlations of structure, activity and expression are limited in scope (cf. e.g. association rule mining)

– The 44,653 compounds with screening data are extremely drug-like• Evaluating set as a standard to use as a surrogate for multi-

screen HTS(at least secondary screening data)

• Aim to apply latest Data Mining methods to whole set• First step in an Active User / Active Computation Oncology

Information Portal?


Sample property profiles (hydrogen bond acceptors)

0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0

0

5 0

1 0 0

1 5 0

2 0 0

2 5 0

m r t d c o m p o u n d

0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0

0

1 0 0 0

2 0 0 0

3 0 0 0

4 0 0 0

5 0 0 0

6 0 0 0

7 0 0 0

8 0 0 0

9 0 0 0

n c i c o m p o u n d

0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0

0

2 0 0 0

4 0 0 0

6 0 0 0

8 0 0 0

1 0 0 0 0

p u b c h e m o n e p r e c e n t s e t # 1

0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0

0

2 0 0 0

4 0 0 0

6 0 0 0

8 0 0 0

1 0 0 0 0

p u b c h e m o n e p r e c e n t s e t # 2


Mean inter-molecular similarity

Mean Similarity

TCL 0.3047MRTD 0.3199Pubchem Subsets 0.3605

Most-similar HTCL compounds to MRTD348/1220 > 0.8


Current activities

• More workflows, more services• Identification of key “customers”, for contextual

inquiry sessions• Advancement of portlet interfaces• Development of first natural language email interfaces• Contextualizing information (predictions, SAR, flags,

annotations, text mining results, etc) for inclusion in these interfaces

• Further characterization of DTP HTCL dataset, particularly how similar the screens are to HTS screens

• Other things:– Lab of the future– Distributed Drug Discovery Database– OSCAR-3 derivatives– Clustering of PubChem


250

300

350

400

450

500

550

600

650

700

0 10 20 30 40 50 60 70 80 90

Number of processors

Runtime (seconds)

Minsize 1 Minsize 100 Minsize 1000

MPI Parallel Divkmeans clustering of PubChemAVIDD Linux cluster, 5,273,852 structures (Pubchem compound, Nov 2005)

min_size ncpus wall_mins walltime1 20 676 11:16:061 40 444 7:24:241 60 379 6:18:411 80 353 5:53:00

100 20 462 7:41:58100 40 356 5:56:01100 40 356 5:55:47100 60 339 5:38:44100 80 337 5:36:53

1000 20 513 8:32:391000 40 376 6:16:251000 60 346 5:46:221000 80 346 5:45:40

indiana university school of david wild – joint iu, michigan, lilly meeting, october 2006. page 1...

Documents

hts slide

cicc slide

problem of tools

application of smart

web browser

ordinary chemists

actual chemists

command line tools