enabling discoveries at high throughput - small molecule and rnai hts at the nctt
TRANSCRIPT
Enabling Discoveries at High Throughput Small molecule and RNAi HTS at the NCTT
Rajarshi Guha NIH Center for Transla6on Therapeu6cs
May 3, 2011
Outline
• Informa6cs for small molecule & RNAi screening • HCA & automated decision making
– Pre7y pictures can lead to more efficient screens
• Large scale cheminforma6cs – We can do it, but do we need to?
• Founded 2004 as part of NIH Roadmap Molecular Libraries Ini6a6ve – NCGC staffed with 90+ scien6sts – biologists, chemists, informa6cians, engineers
– Post-‐doc program
• Mission – MLPCN (screening & chemical synthesis; compound repository; PubChem database;
funding for assay, library and technology development ) – Develop new chemical probes for basic research and leads for therapeu6c development,
par6cularly for rare/neglected diseases – New paradigms & applica6ons of HTS for chemical biology / chemical genomics
• All NCGC projects are collabora6ons with a target or disease expert; currently >200 collabora6ons with inves6gators worldwide
NIH Chemical Genomics Center
(C) Detection methods
(B) Target types (A) Disease areas
Project Diversity Project Diversity
Assay formats & detec?on methods in HTS
• ligand binding – compe66on binding
• enzyma6c ac6vity – biochemical – cellular
• ion or ligand transport – Ion-‐sensi6ve dyes – membrane poten6al dyes
• protein-‐protein interac6ons – biochemical – cellular
• luminescence – chemiluminescence – bioluminescence – BRET – ALPHA
• fluorescence – FI – FRET – TRF – TR-‐FRET – FP – FCS – FLT
• cellular signal transduction – reporter gene – second messenger
• phenotypic – protein redistribution – cell viability – etc.
• absorbance • radioactivity
– SPA
Assay formats
Detection modes
Detector Systems: “Reading the assay”
• ViewLux – Mul6modal CCD-‐based imager
• Abs., Luminescence, Fluorescence
• Envision – PMT-‐based reader
• ALPHA
• Acumen Explorer – Laser Scanning Imager
• “sta6c” cell cytometry
• Hamamatsu FDS 7000 Series – rapid kine6cs
• INCell1000 – Subcellular imaging
1536-well plates, inter-plate dilution series Assay volumes 2 – 5 μL
Assay concentration ranges over 4 logs (high:~ 100 μM)
Informatics pipeline. Automated curve fitting and classification. 300K samples
Automated concentration-response data collection ~1 CRC/sec
A
B
C
qHTS: High Throughput Dose Response
Informa?cs Ac?vi?es
• High throughput curve fieng • Data integra6on, automated cherry picking • SAR algorithms
– QSAR modeling – Fragment based analysis – Ac6vity cliffs
• Tools – standardizer, tautomers, fragment acDvity browser, kinome browser and more
• RNAi hit selec6on, OTE analysis • High content analysis
Kinome Navigator
• Browse kinase panel data
• Currently focused on the Abbot dataset
• View • Fragments
• Target pairs • Kinome overlay
hip://tripod.nih.gov
Fragment Browser
• View ac6vi6es on a fragment wise basis • Compare ac6vity distribu6ons by fragment • Currently based around ChEMBL assays but users can browse their own compounds & ac6vi6es
hip://tripod.nih.gov
Structure Ac?vity Landscapes
• Rugged gorges or rolling hills? – Small structural changes associated with large ac6vity changes represent steep slopes in the landscape
– But tradi6onally, QSAR assumes gentle slopes – We can characterize the landscape using SALI
Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535
What Can We Do With SALI’s?
• SALI characterizes cliffs & non-‐cliffs • For a given molecular representa6on, SALI’s gives us an idea of the smoothness of the SAR landscape
• Models try and encode this landscape
• Use the landscape to guide descriptor or model selec6on
Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658
Predic?ng the Landscape
• Rather than predic6ng ac6vity directly, we can try to predict the SAR landscape
• Implies that we aiempt to directly predict cliffs – Observa6ons are now pairs of molecules
Scheiber et al, StaDsDcal Analysis and Data Mining, 2009, 2, 115-‐122
Original pIC50 RMSE = 0.97
SALI, AbsDiff RMSE = 1.10
SALI, GeoMean RMSE = 1.04
Data Integra?on
• It’s nice to simplify data, but we can s6ll be faced with a mul6tude of data types
• We want to explore these data in a linked fashion
• How we explore and what we explore is generally influenced by the task at hand
• At one point, make inferences over all the data
Data Integra?on
User’s Network
Network of Public Data
Content: -‐ Drugs -‐ Compounds -‐ Scaffolds -‐ Assays -‐ Genes -‐ Targets -‐ Pathways -‐ Diseases -‐ Clinical Trials -‐ Documents
Links: -‐Manually curated -‐Derived from algorithms
Record View of an Assay
Access Disease Hierarchy & Network
Ar?cles, Patents, Drug Labels, …
NPC Browser
hip://tripod.nih.gov/npc/
Going Beyond Explora?on?
• Simply being able to explore data in an integrated manner is useful as an idea generator
• Can we integrate heterogenous data types & sources to get a systems level view? – Current research problem in genomics and systems biology
– Some aiempts have been made to merge chemical data with other data types
Young, D.W. et al, Nat. Chem. Biol., 2008, 4, 59-‐68
• Perform collabora6ve genome-‐wide RNAi screening-‐based projects with intramural inves6gators
• Advance the science of RNAi and miRNA screening and informa6cs via technology development to improve efficiency, reliability, and costs.
RNAi Facility Mission
Range of Assays!
Pathway (Reporter assays, e.g. luciferase,
β-lactamase)!
Complex Phenotypes (High-content imaging, cell
cycle, translocation, etc)!
Simple Phenotypes (Viability, cytotoxicity, oxidative stress, etc)!
RNAi Effectors
RNAi effectors provide an excellent way to conduct gene-specific loss of function studies."
• RNAi effectors give a knockdown not a knockout (70% - 80% is considered good). Therefore, they may not silence enough to give a phenotype even if the target is involved in what you are assaying for."
• RNAi effectors induce off-target effects!!!!! "
Issues Using RNAi Effectors
• Protein Quality Control
• DNA Re-‐replica6on
• Base Excision Repair
• DNA Damage – ELG1 stabiliza6on
• An6oxidant Response
• Hypoxia
• TNFa Response
• Interferon Response
• iPS to RPE
• Poxvirus
• Respiratory Viruses
• Lysosomal Storage Disorders
• Parkinsons – Mitochondrial Quality Control
• Ewings Sarcoma
• Drug Modifiers, Pancrea6c Cancer
• Drug Modifiers, TOP1 Clinical Agents
• Immunotoxin-‐Mediated Cell Death
Examples of Current Projects Examples of Current Projects
User Accessible Tools
RNAi Libraries
Qiagen Human Druggable Genome Library, > 7,000
genes, 4 unique siRNAs per gene."
Kinome Libraries"Purchased from a number of
vendors."
• Smaller libraries (e.g. kinome and miRNA mimics) will enable high-impact screens in systems less amenable to high throughput applications."
• Considerations are being made for additional species and shRNA resources."
Human and Mouse miRNA Mimic Libraries &
Human miRNA Inhibitor Library"
Ambion Human Genome-Wide Library, 21,585 genes, 3
unique siRNAs per gene. "
Dharmacon Human Duet Genome-Wide siRNA
Libraries, 18,236 genes, siRNA pools."
Ambion Mouse Genome-Wide Library, 17,582 genes, 3 unique siRNAs per gene."
Druggable Genome Screening Campaign
• Over 7,000 genes, 4 unique siRNAs per gene (≈36,000 wells).
Pseudo-colored Blue/Green Ratio (Normalized to plate Median)
• 85 genes were selected for follow-up through a variety of threshold-based selection schemes.
• 27 genes were validated as confident hits using siRNAs from multiple vendors.
0
20
40
60
80
100
TNFα Receptor IKKα RELA NEMO
Percent Reduction in NF-kB Signal Av
erag
e In
hibi
tion
(%)
Qiagen siRNAs Ambion siRNAs
Significant enrichment for core NF-kB components
Qiagen Ambion
Murata et al Nature Reviews Mol. Cell Biol.
ß1-7 α1-7
α1-7 20S Proteasome
RPT
RPT
RPN
RPN 19S Regulator particle
19S Regulator particle 0
20
40
60
80
100
A1
A2
A3 A4
A5
A6
A7 B2
B3
B4
C4
C5
D2
D7
D14
Percent Reduction in NF-kB Signal
Aver
age
Inhi
bitio
n (%
)
α core 20S β core 20S RPT 19S RPN 19S
PSM Gene
PSM Protein
Significant enrichment for proteins that form the 28S proteasome
An additional 34 genes remain inconclusive, but noteworthy hits that require further study. Some of these tie into the core NF-kB pathway
Druggable Genome Screening Campaign
Other instances of the seeds incorporated within siRNAs targeting PSMA3 do not exhibit significant activity, adding to the likelihood of this being an on-target effect."
Seed Sequence Analysis
Other instances of the seeds within the active siRNAs targeting SLC24A1 tend to downregulate NF-kB reporter, adding to the likelihood of this being an off-target effect."
Seed Sequence Analysis
RNAi & Small Molecule Screens
Goal: Develop systems level view of small molecule acUvity
• Reuse pre-‐exis6ng MLI data • Develop new annotated libraries
TACGGGAACTACCATAATTTA
CAGCATGAGTACTACAGGCCA
• Run parallel RNAi screen
What targets mediate ac6vity of siRNA and compound
Pathway elucida6on, iden6fica6on of interac6ons
Target ID and valida6on
Link RNAi generated pathway peturba6ons to small molecule ac6vi6es. Could provide insight into polypharmacology
Matching Phenotypes RNAI
Small Molecule
Merging Screening Technologies
• Lead iden6fica6on • Single (few) read outs • High-‐throughput • Moderate data volumes
• Phenotypic profiling • Mul6ple parameters • Moderate throughput • Very large data volumes
High throughput screening High content screening
• We’d like to combine the technologies, to obtain rich high-‐resolu6on data at high speed
• Is this feasible? What are the trade-‐offs?
Merging Screening Technologies
• A simple solu6on is to run a HTS & HCS as separate, primary & secondary screens
• Alterna6vely – Wells to Cells – Integrate HTS & HCS in a single screen using a combined plavorm for robo6cs & real 6me automated HTS analy6cs
– Selec6ve imaging of interes6ng wells
Wells to Cells Workflow
• Sequen6al qHTS using laser scanning cytometry followed by high-‐res microscopy
• Unit of work is a plate series • The same aliquot is analyzed by both techniques
• A message based system
• The key is deciding which wells go through the workflow
Well to Cells Assays
• Cell cycle, cell transloca6on, DNA repreplica6on • All assays run against LOPAC1280 • Consistency between cytometry & microscopy is measured by the R2 between log AC50’s – Cell cycle, 0.94 – 0.96 – Cell transloca6on, 0.66 – 0.94 – DNA rereplica6on, s6ll in progress
Cell Transloca?on Example Hits
Informa?cs Pla[orm
• Advanced correc6on and normaliza6on methods
• Sophis6cated curve fieng algorithm
• Good performance, allows paralleliza6on of the en6re workflow
InCell Layout File
Why Messaging?
• A messaging architecture allows for significant flexibility – Persistent, can be kept for process tracking, repor6ng
– Asynchronous, allows individual components of the workflow to proceed at their own pace
– Modular, new components can be introduced at any 6me without redesigning the whole workflow
• We employ Oracle AQ, but any message queue can be employed
Handling Mul?ple Pla[orms
• Current examples employ InCell hardware • We also use Molecular Devices hardware
• As a result we have two orthogonal image stores / databases
• Need to integrate them – Support seamless data browsing across mul6ple screens irrespec6ve of imaging plavorm used
– Support analy6cs external to vendor code
A Unified Interface
• A client sees a single, simple interface to screening image data
• Transparently extract image data via the MetaXpress database or via custom code
• Currently the interface address image serving
• Unified metadata interface in the works
hXp://host/rest/protocol/plate/well/image
Trade-‐offs & Opportuni?es
• Automa6on reduces the ability to handle unforeseen errors – Dispense errors and other plate problems – Well selec6on based on curve classes may need to be modified on the fly
• Well selec6on does not consider SAR – Wells are selected independently of each other – If we could model SAR on the fly (or from valida6on screens), we’d select mul6ple wells, to obtain posi6ve and nega?ve results
Cloud Compu?ng & Cheminforma?cs
• Cloud compu6ng is a hot topic • A number of examples of computa6onal chemistry / cheminforma6cs on the cloud – MolPlex, hBar, Numerate, Wingu, Sciligence, Pfizer
• Many examples use the cloud for remote storage remote (hosted) computa6ons
• But providers such as Amazon allow us to run distributed compuDng applica6ons on the cloud
Map/Reduce
• Map/Reduce is a programming model for efficient distributed compu6ng
• M/R made “famous” by Google, but the idea has been around for a long 6me
• It works like a Unix pipeline: – cat input | grep | sort | uniq -c | cat > output – Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
– Streaming through data, reducing seeks
– Pipelining Owen O’Malley, hip://bit.ly/ecHPvB
Map/Reduce
Owen O’Malley, hip://bit.ly/ecHPvB
Hadoop & Cheminforma?cs
• Hadoop is an Open Source implementa6on of the map/reduce paradigm
• Hadoop is a framework for scalable, distributed compu6ng – Hadoop, HDFS, Hive, PIG
• Importantly, you can play with all this on your laptop and just copy files to the big cluster when you’re ready for produc6on
Why Hadoop?
• Simple way to make use of large clusters without MPI etc
• AWS supports Hadoop, so easy to scale up to 100’s or 1000’s of cores
• Great for Java code, but non-‐Java code can also make use of Hadoop
• M/R can be applied to a lot of problems, but one of the simplest is to use it as a “chunker”
Cheminforma?cs in Parallel
• Many cheminforma6cs problems are data parallel – Chunk the data and apply the same technique over each chunk
• This makes many problems amenable for M/R – Substructure / pharmacophore search
– Descriptor calcula6ons, virtual screening – Model development (?)
• In general, each chunk is processed on a dis6nct node – so code itself can be non-‐parallel
Cheminforma?cs in Parallel
See h_p://blog.rguha.net/?tag=hadoop for examples & code
Substructure Searching
• Substructure searching is a trivial extension of atom coun6ng
• If a structure matches, emit (name,1)!
• Otherwise (name,0)
• Reducer simply outputs tuples of the form (name,1)
public class SubSearch {!
…! public static class MoleculeMapper extends ! Mapper<Object, Text, Text, IntWritable> {!
private Text matches = new Text();! private String pattern;!
public void setup(Context context) {! pattern = context.getConfiguration().get("net.rguha.dc.data.pattern");! }!
public void map(Object key, Text value, Context context) throws! IOException, InterruptedException {! try {! IAtomContainer molecule = sp.parseSmiles(value.toString()); !
sqt.setSmarts(pattern);! boolean matched = sqt.matches(molecule);! matches.set((String) molecule.getProperty(CDKConstants.TITLE));! if (matched) context.write(matches, one);! else context.write(matches, zero);! } catch (CDKException e) {! e.printStackTrace();! }! }! }!
public static class SMARTSMatchReducer extends ! Reducer<Text, IntWritable, Text, IntWritable> {! private IntWritable result = new IntWritable();!
public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! for (IntWritable val : values) {! if (val.compareTo(one) == 0) {! result.set(1);! context.write(key, result);! }! }! }!
Running on AWS
• All the code was debugged on my laptop with rela6vely small files
• To test the scalability, I shi{ed everything to AWS – Pharmacophore search – 136K structures, single conformer, 560MB
– Created a single JAR file with CDK & applica6on code
– Uploaded data files to S3 • Total cost of experiments was ~ $10
But I Don’t Want to Write Programs
• All these examples require us to write full fledged Java classes
• An easier way to use Pig & Pig La6n – a plavorm and query language built on top of Hadoop
• Lets us write SQL-‐like queries that make use of Hadoop underneath
• Flexible due to user defined func6ons (UDF’s) – UDF’s encapsulate the cheminforma6cs
Cheminforma?cs & Pig
• Iden6fy molecules in medium.smi that match the SMARTS paiern and dump to output.txt
• The complexity is now hidden in the UDF
• Many toolkit func6ons could be wrapped as UDF’s, allowing flexible queries with much simpler code
• See hip://blog.rguha.net/?p=748 for the code
A = load 'medium.smi' as (smiles:chararray);!B = filter A by net.rguha.dc.pig.SMATCH(smiles, 'NC(=O)C(=O)N');!store B into 'output.txt';!
Latency
• Hadoop is suited for batch processing • Significant network I/O involved in distribu6ng data to compute nodes
• Not good for – Random ad hoc processing of small subsets – Small volume data
– Real 6me (low latency) work
• But latency issues can be addressed somewhat by Hbase, Hive and other technologies
More than Chunking?
• But all the examples so far could have been done via PBS/Condor or any other job scheduler – (With Hadoop we don’t have to worry about explicit chunking of the input data)
• But are there cheminforma6cs algorithms that can be reworked in to the M/R paradigm? – Predic6ve modeling?
– Graph algorithms?
More than Chunking?
• Both predic6ve & graph algorithms are increasingly supported in Hadoop – Mahout for M/L algorithms on massive datasets – Cloud9 for graph algorithms
• A number of bioinforma6cs applica6ons make use of M/R at the algorithmic level
• They are all big applica6ons – Crossbow aligns 3 billion paired/unpaired reads
• Cheminforma?cs datasets are not very big
Summary
• HTS data is an ample playground for interes6ng analy6cs, mul6ple data types makes it more fun
• A major challenge in our informa6cs infrastructure is dealing with proprietary vendor interfaces
• Hadoop and M/R provide great opportuni6es for handling large data in a flexible manner
• But can cheminforma6cs really make use of it?
InformaUcs
• Ajit Jadhav • Trung Nguyen • Noel Southall • Ruili Huang • Min Shen
• Hongmao Sun
• Xin Hu • Tongan Zhao
RNAi & Small Molecule
• Scoi Mar6n
• Pinar Tuzmen • Yu-‐Chi Chen • Carleen Klump • Craig Thomas
• Jim Inglese
• Ron Johnson • Sam Michael
• Jennifer Wichterman
Acknowledgments
Coun?ng Atoms
• The canonical Hadoop program is to count the frequency of words in a text file – Mapper reads a line, outputs a tuple – (word, 1) – Reducer will receive tuples, keyed on word!
• Summing up the 1’s gives us the frequency of word
• By default, Hadoop works on a line-‐by-‐line basis • For cheminforma6cs problems, SMILES files sa6sfy this requirement – one line, one molecule
Coun?ng Atoms public class HeavyAtomCount {! static SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());!
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> ! {!
private final static IntWritable one = new IntWritable(1);! private Text word = new Text();!
public void map(Object key, Text value, Context context) throws ! IOException, InterruptedException {! try {! IAtomContainer molecule = sp.parseSmiles(value.toString());! for (IAtom atom : molecule.atoms()) {! word.set(atom.getSymbol());! context.write(word, one);! }! } catch (InvalidSmilesException e) {! // do nothing for now! }! }! }!
public static class IntSumReducer extends Reducer<Text, IntWritable, ! Text, IntWritable> {! private IntWritable result = new IntWritable();!
public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! int sum = 0;! for (IntWritable val : values) {! sum += val.get();! }! result.set(sum);! context.write(key, result);! }! }!….!}!
• Uses the CDK to parse SMILES
• For each molecule loop over atoms – Emit (symbol,1)!
• Reducer simply sums the 1’s for each symbol
Mul?line Records
• Lots of cheminforma6cs applica6ons require 3D – SMILES won’t do. Need to support SDF
• We implement a custom RecordReader to process SD files!
• We’re now ready to tackle preiy much most cheminforma6cs tasks
Why Hadoop?
• Java and C++ APIs – In Java use Objects, while in C++ bytes
• Each task can process data sets larger than RAM
• Automa6c re-‐execu6on on failure – In a large cluster, some nodes are always slow or flaky – Framework re-‐executes failed tasks
• Locality op6miza6ons – M/R queries HDFS for loca6ons of input data – Map tasks are scheduled close to the inputs when possible
Owen O’Malley, hip://bit.ly/ecHPvB