indiana university school of david wild – joint iu, michigan, lilly meeting, october 2006. page 1...
TRANSCRIPT
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 1 Indiana University School of
Smart Mining Interfaces, Workflows, and Data Mining the NCI DTP Dataset
David Wild
Joint IU / Michigan / Lilly MeetingIndianpolis, August 2006
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 2 Indiana University School of
Acknowledgements
• Xiao Dong - HTCL database & mining, web services, workflows, smart mining interfaces
• Rajarshi Guha - Smart Mining Interfaces, Workflows, Web Services
• Geoffrey/Marlon’s lab: Smitha Ajay, Sima Patel, Jake Kim - web services, workflows, portlet interfaces
• Others: Junguk Hur, Chris Mueller, Huijun Wang …
• Funding from Microsoft eScience and CICC
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 3 Indiana University School of
Outline
• “Smart mining of drug discovery information” - our approach to connecting scientists with the information they need
• Application of smart mining to post-HTS chemistry analysis
• Current interface-level projects• Examining the DTP HTCL dataset as a standard for multi-screen data mining and as a surrogate for HTS
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 4 Indiana University School of
Classic approach to designing tools for chemists• Select the computational chemistry methods that seem to be the most useful for drug discovery (or whatever), such as docking and similarity searching
• Have computational chemists / modelers figure out how to “dumb down” the methods for ordinary chemists
• Wrap a pretty web interface around the command line tools that run on a Unix server
• Tell the chemists to use the tools using their browsers
Result…• A few chemists use the tools a few times• Clever tools don’t necessarily directly meet needs, and simple needs can be complex to answer
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 5 Indiana University School of
A better idea…
• Use interviews and follow up techniques (e.g. Contextual Design and Interaction Design) to understand the workflows of chemists
• Design tools using paper prototyping involving personas (or actual chemists)
• Develop Tools• Go through several iterations of usability testing with real scientists and real-life problems
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 6 Indiana University School of
Contextual Inquiry
• Session in which an observer watches scientists do their work in their natural environment
• Observer can ask questions, clarify understanding, etc.
• Helps to record session on a tape recorder• Want to see them do “real” work, but helps if it is related to the software.
• From tape, build sequence, flow, artifact, culture and physical models
• Helps in understanding the scientists’ work, and in building personas and identifying breakdowns in processes
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 7 Indiana University School of
Example sequence model
Intent: Try to improve the activity of the current XXX-1 Kinase structureIntent: See what other similar structures are in ISIS
Log onto machine and go into ISIS Unsure which database to chooseChooses Master1 databaseDraws in structure Unable to specify aromaticity correctlyDoes similarity searchFinds 8 compounds which look similar
Intent: Try docking these molecules using web-based docking programOpens web browserGoes to docking program by choosing bookmark Browser says “page cannot be found”
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 8 Indiana University School of
Interaction Design (Cooper)
• Personas stereotype the people who will be using the software
• Select primary personas, and create a customized interface for each one
• Define the goals of the primary personas, then deveop scenarios that reveal the way these goals are reached
Wallace is an engineer. He is aged 45, and lives by himself with his dog Grommit. He enjoys using new gadgets, and likes using his inventiveness to make new things, with differing degrees of success. He is confident using Microsoft Word and Excel, although he sometimes becomes frustrated with these packages. He wants the computer to help him design a rocket to fly him and Grommit to the moon, as he likes cheese and believes the moon to be made of very fine cheese.
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 9 Indiana University School of
Usability testing
• Monitor real people using the software you have written
• Find breakdowns, recurring problems• Measure a usability score for the software• Make changes and see if the score improves
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 10 Indiana University School of
Usability of 2D structure drawing tools
• Key difference between “sequential” and “random” drawers
• Huge difference in intuitiveness• Key factor how badly you can mess things up• Marvin Sketch ≈ JME > ChemDraw >> ISIS Draw
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 11 Indiana University School of
This approach is better but…
• Still centered on the “tool” instead of the information
• Doesn’t solve problem of tools only solving one problem
• Not flexible enough for an environment where scientists have constantly changing needs for information which is complex to retrive (or compute):– Do we have this compound in-house? Where can I get it?– I want to know if anyone else does something with structures like these
– I want to improve the activity of these compounds, what directions should I take?
– I wonder if there any protein targets this compound might bind to other than the project I’m working on?
– I’m worried there might be a degradation problem with these compounds
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 12 Indiana University School of
Simple questions can be complex to answer…
Oracle Database (HTS)
Compounds were tested against related assays and showed activity, including
selectivity within target families
Oracle Database (Genomics)
? None of these compounds have been tested in a microarray
assay
Computation
The information in the structures and known activity
data is good enough to create a QSAR model with a confidence
of 75%
External Database (Patent)
Some structures with a similarity > 0.75 to these appear
to be covered by a patent held by a competitor
Computation
All the compounds pass the Lipinksi Rule of Five and toxicity
filters
Excel Spreadsheet (Toxicity)
One of the compounds was previously tested for toxicology and was found to have no liver
toxicity
Word Document (Chemistry)
Several of the compounds had been followed up in a previous project, and solubility problems prevented further development
Journal Article
A recent journal article reported the effectiveness of some compounds in a related series against a target in the same family
Word Document (Marketing)
A report by a team in Marketing casts doubt on whether the market for this target is big
enough to make development cost-effective
SCIENTIST
“These compounds look promising from their HTS results. Should I commit some chemistry
resources to following them up?”
?
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 13 Indiana University School of
An even better idea?
• Develop web services around as much chemoinformatics computation and data sources as possible
• Develop web service workflows for as many real-life workflowsas we can, based on contextual design interviews (and other sources)
• Develop generic smart interface components which are able to match what people express they want to do with what workflows and services are available, and even create workflows to meet needs
This is a kind of scatter gun approach, with the onus on the workflows and interface to make it useful
Related to the “lab of the future”- based on information, not tools
Very closely linked to LEAD
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 14 Indiana University School of
3-layer model
Purpose Technologies
Interaction Layer Interactive software for creative access and exploitation of information by humans
Microsoft Smart Clients, portlets, Java applets, email and browser clients, visualization technologies
Aggregation Layer Workflows and data schemas customized for particular domains, applications and users
BPEL, Taverna and other workflow modeling tools, aggregate web services
Web service layer Comprehensive data and computation provision including storage, calculation, semantics and meta-data exposed as web services
Apache web services, SOAP wrappers, WSDL, UDDI, XML, Microsoft .NET
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 15 Indiana University School of
Kinds of interface-level interactions
• Passive user / active computation– Mainly for information and computation request, and
single-stream or summary results– Natural Language interface through email– RSS, web and .NET tools– Graphical workflow generation– May lead to active user interaction as described below
• Active user / passive computation or retrieval– Facilitates direct interaction with workflows, services
and information– Permits analysis and interpretation of multi-stream,
interactive results– Multi-stream portlets– Custom multi stream desktop tools (including .NET)– Visual SAR– Flagging and annotation
• Active user / active computation– A “conversation” between the scientist and the computer
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 16 Indiana University School of
Example email natural language interface
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 17 Indiana University School of
Email response after triggering events occur
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 18 Indiana University School of
Natural Language Interface Stage 1:Matching requests to existing workflows• All existing workflows are given descriptions which reflect the kinds of words people would likely use if requesting them
• When a request is made, workflows are ranked by text similarity between the request and the descriptions
• If similarity is less than a cutoff, it is determined that no match can be found (and thus another workflow should be written!)
• Requests may be parsed by a standard syntactic analyzer
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 19 Indiana University School of
Natural Language Interface Stage 2:On-the-fly workflow creation• Requests can be formatted in a “do this THEN do this THEN do this” fashion
• These requests are parsed, and used to attempt to create a workflow on the fly
• Possible existing parsing software includes Python NLP modules and Infocom Z-code (or similar)…
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 20 Indiana University School of
Desktop tool for multi-stream analysis
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 21 Indiana University School of
Multi-stream portlet interface
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 22 Indiana University School of
PubChemSR .NET desktop search tool (Junguk Hur)
http://darwin.informatics.indiana.edu/juhur/Tools/PubChemSR
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 23 Indiana University School of
Structural differences related to activity are projected onto actual 2D structures.See http://www.daylight.com/meetings/mug99/Wild/Mug99.html
VisualiSAR - modal fingerprints
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 24 Indiana University School of
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.Original (curated) Breadth-first Search
Degree Sloan’s Algorithm
Data: NCI Compound Database - Compounds with positive AIDS screens
Visual Similarity Matrices display large, graph-based data sets in a compact form. The axes are labeled with the data
items (vertices) and a dot indicates a relation (edge) between two data items. Different vertex orderings can reveal
information about the data.
Additional details are displayed as property
plots. Here, the different computed properties are displayed along with the
main matrix.
Student: Christopher Mueller
In order to generate similarity matrices and orderings in a reasonable time (minutes instead of days), we are developing parallel and high-performance libraries that take advantage of modern processor and system architectures. These include
optimized SIMD for Alitvec (PowerPC) and SSE (Intel) and parallel algorithms for multiprocessor environments.
Visual Similarity Matrices
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 25 Indiana University School of
VoPlot
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 26 Indiana University School of
Chemistry Decision Support after PubChem data submission
MLSCN submits HTS data to Pubchem
Data is stored in Pubchem
PubChem interfaces to workflows via SOAP
Workflows perform different kinds of analysis on the MLSCN data, including SAR, clustering, literature searching, protein searching, toxicity testing, etc…
End-user applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 27 Indiana University School of
Simple HTS follow-up workflow
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Presented at 222nd ACS, Chicago, 2001See http://www.lib.uchicago.edu/cinf/222nm/presentations/222nm050.pdf
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 28 Indiana University School of
Example HTS workflow: organization & flagging
A biological screen is selected. The activity results for all the compounds is extracted from the database (currently using DTP Tumor Cell Line database)
The compounds are clustered on
chemical structure similarity, to group similar compounds
together The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT
OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 29 Indiana University School of
Example of workflow output - LogP vs GI50
Plotting XLogP against GI50 can help identify highly active compounds with good logP profiles (1 - 4 range)
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 30 Indiana University School of
Example of workflow output - Cluster # vs GI50
Plotting Cluster against GI50 can help identify groups of highly active, structurally similar compounds, and also
clusters which might yield good QSAR information
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 31 Indiana University School of
Example HTS workflow: finding cell-protein relationships
A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex)
SImilar structures to the ligand can be
browsed using client portlets.
Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.
Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.
The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand.
Docking results and activity patterns fed into R services for building of activity models and correlations
LeastSquaresRegression
RandomForests
NeuralNets
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 32 Indiana University School of
Example workflow output - docked complexes
NSC_ID 685478Docking score -29.74
NSC_ID 685477 Docking score -35.51
NSC_ID 719175Docking score -30.78
NSC_ID 725806Docking score -32.15
Example output of most similar compounds to PDB 1Y4 complex ligands docked into the target protein using OpenEye FRED
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 33 Indiana University School of
DTP Human Tumor Cell Line Data Mining• Collaboration with Melanie Wu at the School of Informatics• 257,547 compounds, 44,653 with 60-cell line screening data
(GI50)• Local PostgreSQL database with gNova CHORD cartridge for
substructure and similarity searching, exposed as a web service• Aim to build on existing published data mining research on this
dataset, doing forms of data mining that are made easier by using web services
• Learned so far– Most previous research has used small compound subsets (~4000
compounds), and generally fall into organization (SOM, clustering, etc) or correlation of structure, activity and/or expression
– There is little that has approached the dataset as a whole (as it is in 2006)
– Correlations of structure, activity and expression are limited in scope (cf. e.g. association rule mining)
– The 44,653 compounds with screening data are extremely drug-like• Evaluating set as a standard to use as a surrogate for multi-
screen HTS(at least secondary screening data)
• Aim to apply latest Data Mining methods to whole set• First step in an Active User / Active Computation Oncology
Information Portal?
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 34 Indiana University School of
Sample property profiles (hydrogen bond acceptors)
0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
m r t d c o m p o u n d
0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0
0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
n c i c o m p o u n d
0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0
0
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
p u b c h e m o n e p r e c e n t s e t # 1
0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0
0
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
p u b c h e m o n e p r e c e n t s e t # 2
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 35 Indiana University School of
Mean inter-molecular similarity
Mean Similarity
TCL 0.3047MRTD 0.3199Pubchem Subsets 0.3605
Most-similar HTCL compounds to MRTD348/1220 > 0.8
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 36 Indiana University School of
Current activities
• More workflows, more services• Identification of key “customers”, for contextual
inquiry sessions• Advancement of portlet interfaces• Development of first natural language email interfaces• Contextualizing information (predictions, SAR, flags,
annotations, text mining results, etc) for inclusion in these interfaces
• Further characterization of DTP HTCL dataset, particularly how similar the screens are to HTS screens
• Other things:– Lab of the future– Distributed Drug Discovery Database– OSCAR-3 derivatives– Clustering of PubChem
David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 37 Indiana University School of
250
300
350
400
450
500
550
600
650
700
0 10 20 30 40 50 60 70 80 90
Number of processors
Runtime (seconds)
Minsize 1 Minsize 100 Minsize 1000
MPI Parallel Divkmeans clustering of PubChemAVIDD Linux cluster, 5,273,852 structures (Pubchem compound, Nov 2005)
min_size ncpus wall_mins walltime1 20 676 11:16:061 40 444 7:24:241 60 379 6:18:411 80 353 5:53:00
100 20 462 7:41:58100 40 356 5:56:01100 40 356 5:55:47100 60 339 5:38:44100 80 337 5:36:53
1000 20 513 8:32:391000 40 376 6:16:251000 60 346 5:46:221000 80 346 5:45:40