enabling discoveries at high throughput - small molecule and rnai hts at the nctt

Enabling Discoveries at High Throughput Small molecule and RNAi HTS at the NCTT

Rajarshi Guha NIH Center for Transla6on Therapeu6cs

May 3, 2011

Outline

•  Informa6cs for small molecule & RNAi screening •  HCA & automated decision making

– Pre7y pictures can lead to more efficient screens

•  Large scale cheminforma6cs – We can do it, but do we need to?

•  Founded 2004 as part of NIH Roadmap Molecular Libraries Ini6a6ve –  NCGC staffed with 90+ scien6sts – biologists, chemists, informa6cians, engineers

–  Post-‐doc program

•  Mission –  MLPCN (screening & chemical synthesis; compound repository; PubChem database;

funding for assay, library and technology development ) –  Develop new chemical probes for basic research and leads for therapeu6c development,

par6cularly for rare/neglected diseases –  New paradigms & applica6ons of HTS for chemical biology / chemical genomics

•  All NCGC projects are collabora6ons with a target or disease expert; currently >200 collabora6ons with inves6gators worldwide

NIH Chemical Genomics Center

(C) Detection methods

(B) Target types (A) Disease areas

Project Diversity Project Diversity

Assay formats & detec?on methods in HTS

•  ligand binding –  compe66on binding

•  enzyma6c ac6vity –  biochemical –  cellular

•  ion or ligand transport –  Ion-‐sensi6ve dyes –  membrane poten6al dyes

•  protein-‐protein interac6ons –  biochemical –  cellular

•  luminescence –  chemiluminescence –  bioluminescence –  BRET –  ALPHA

•  fluorescence –  FI –  FRET –  TRF –  TR-‐FRET –  FP –  FCS –  FLT

•  cellular signal transduction –  reporter gene –  second messenger

•  phenotypic –  protein redistribution –  cell viability –  etc.

•  absorbance •  radioactivity

–  SPA

Assay formats

Detection modes

Detector Systems: “Reading the assay”

•  ViewLux –  Mul6modal CCD-‐based imager

•  Abs., Luminescence, Fluorescence

•  Envision –  PMT-‐based reader

•  ALPHA

•  Acumen Explorer –  Laser Scanning Imager

•  “sta6c” cell cytometry

•  Hamamatsu FDS 7000 Series –  rapid kine6cs

•  INCell1000 –  Subcellular imaging

1536-well plates, inter-plate dilution series Assay volumes 2 – 5 μL

Assay concentration ranges over 4 logs (high:~ 100 μM)

Informatics pipeline. Automated curve fitting and classification. 300K samples

Automated concentration-response data collection ~1 CRC/sec

A

B

C

qHTS: High Throughput Dose Response

Informa?cs Ac?vi?es

•  High throughput curve fieng •  Data integra6on, automated cherry picking •  SAR algorithms

– QSAR modeling – Fragment based analysis – Ac6vity cliffs

•  Tools – standardizer, tautomers, fragment acDvity browser, kinome browser and more

•  RNAi hit selec6on, OTE analysis •  High content analysis

Kinome Navigator

•  Browse kinase panel data

•  Currently focused on the Abbot dataset

•  View •  Fragments

•  Target pairs •  Kinome overlay

hip://tripod.nih.gov

Fragment Browser

•  View ac6vi6es on a fragment wise basis •  Compare ac6vity distribu6ons by fragment •  Currently based around ChEMBL assays but users can browse their own compounds & ac6vi6es

hip://tripod.nih.gov

Structure Ac?vity Landscapes

•  Rugged gorges or rolling hills? – Small structural changes associated with large ac6vity changes represent steep slopes in the landscape

– But tradi6onally, QSAR assumes gentle slopes – We can characterize the landscape using SALI

Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535

What Can We Do With SALI’s?

•  SALI characterizes cliffs & non-‐cliffs •  For a given molecular representa6on, SALI’s gives us an idea of the smoothness of the SAR landscape

•  Models try and encode this landscape

•  Use the landscape to guide descriptor or model selec6on

Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658

Predic?ng the Landscape

•  Rather than predic6ng ac6vity directly, we can try to predict the SAR landscape

•  Implies that we aiempt to directly predict cliffs – Observa6ons are now pairs of molecules

Scheiber et al, StaDsDcal Analysis and Data Mining, 2009, 2, 115-‐122

Original pIC50 RMSE = 0.97

SALI, AbsDiff RMSE = 1.10

SALI, GeoMean RMSE = 1.04

Data Integra?on

•  It’s nice to simplify data, but we can s6ll be faced with a mul6tude of data types

•  We want to explore these data in a linked fashion

•  How we explore and what we explore is generally influenced by the task at hand

•  At one point, make inferences over all the data

Data Integra?on

User’s Network

Network of Public Data

Content: -‐ Drugs -‐ Compounds -‐ Scaffolds -‐ Assays -‐ Genes -‐ Targets -‐ Pathways -‐ Diseases -‐ Clinical Trials -‐ Documents

Links: -‐Manually curated -‐Derived from algorithms

Record View of an Assay

Access Disease Hierarchy & Network

Ar?cles, Patents, Drug Labels, …

NPC Browser

hip://tripod.nih.gov/npc/

Going Beyond Explora?on?

•  Simply being able to explore data in an integrated manner is useful as an idea generator

•  Can we integrate heterogenous data types & sources to get a systems level view? – Current research problem in genomics and systems biology

– Some aiempts have been made to merge chemical data with other data types

Young, D.W. et al, Nat. Chem. Biol., 2008, 4, 59-‐68

•  Perform collabora6ve genome-‐wide RNAi screening-‐based projects with intramural inves6gators

•  Advance the science of RNAi and miRNA screening and informa6cs via technology development to improve efficiency, reliability, and costs.

RNAi Facility Mission

Range of Assays!

Pathway (Reporter assays, e.g. luciferase,

β-lactamase)!

Complex Phenotypes (High-content imaging, cell

cycle, translocation, etc)!

Simple Phenotypes (Viability, cytotoxicity, oxidative stress, etc)!

RNAi Effectors

RNAi effectors provide an excellent way to conduct gene-specific loss of function studies."

•  RNAi effectors give a knockdown not a knockout (70% - 80% is considered good). Therefore, they may not silence enough to give a phenotype even if the target is involved in what you are assaying for."

•  RNAi effectors induce off-target effects!!!!! "

Issues Using RNAi Effectors

•  Protein Quality Control

•  DNA Re-‐replica6on

•  Base Excision Repair

•  DNA Damage – ELG1 stabiliza6on

•  An6oxidant Response

•  Hypoxia

•  TNFa Response

•  Interferon Response

•  iPS to RPE

•  Poxvirus

•  Respiratory Viruses

•  Lysosomal Storage Disorders

•  Parkinsons – Mitochondrial Quality Control

•  Ewings Sarcoma

•  Drug Modifiers, Pancrea6c Cancer

•  Drug Modifiers, TOP1 Clinical Agents

•  Immunotoxin-‐Mediated Cell Death

Examples of Current Projects Examples of Current Projects

User Accessible Tools

RNAi Libraries

Qiagen Human Druggable Genome Library, > 7,000

genes, 4 unique siRNAs per gene."

Kinome Libraries"Purchased from a number of

vendors."

• Smaller libraries (e.g. kinome and miRNA mimics) will enable high-impact screens in systems less amenable to high throughput applications."

• Considerations are being made for additional species and shRNA resources."

Human and Mouse miRNA Mimic Libraries &

Human miRNA Inhibitor Library"

Ambion Human Genome-Wide Library, 21,585 genes, 3

unique siRNAs per gene. "

Dharmacon Human Duet Genome-Wide siRNA

Libraries, 18,236 genes, siRNA pools."

Ambion Mouse Genome-Wide Library, 17,582 genes, 3 unique siRNAs per gene."

Druggable Genome Screening Campaign

•  Over 7,000 genes, 4 unique siRNAs per gene (≈36,000 wells).

Pseudo-colored Blue/Green Ratio (Normalized to plate Median)

•  85 genes were selected for follow-up through a variety of threshold-based selection schemes.

•  27 genes were validated as confident hits using siRNAs from multiple vendors.

0

20

40

60

80

100

TNFα Receptor IKKα RELA NEMO

Percent Reduction in NF-kB Signal Av

erag

e In

hibi

tion

(%)

Qiagen siRNAs Ambion siRNAs

Significant enrichment for core NF-kB components

Qiagen Ambion

Murata et al Nature Reviews Mol. Cell Biol.

ß1-7 α1-7

α1-7 20S Proteasome

RPT

RPT

RPN

RPN 19S Regulator particle

19S Regulator particle 0

20

40

60

80

100

A1

A2

A3 A4

A5

A6

A7 B2

B3

B4

C4

C5

D2

D7

D14

Percent Reduction in NF-kB Signal

Aver

age

Inhi

bitio

n (%

)

α core 20S β core 20S RPT 19S RPN 19S

PSM Gene

PSM Protein

Significant enrichment for proteins that form the 28S proteasome

An additional 34 genes remain inconclusive, but noteworthy hits that require further study. Some of these tie into the core NF-kB pathway

Druggable Genome Screening Campaign

Other instances of the seeds incorporated within siRNAs targeting PSMA3 do not exhibit significant activity, adding to the likelihood of this being an on-target effect."

Seed Sequence Analysis

Other instances of the seeds within the active siRNAs targeting SLC24A1 tend to downregulate NF-kB reporter, adding to the likelihood of this being an off-target effect."

Seed Sequence Analysis

RNAi & Small Molecule Screens

Goal: Develop systems level view of small molecule acUvity

•  Reuse pre-‐exis6ng MLI data •  Develop new annotated libraries

TACGGGAACTACCATAATTTA

CAGCATGAGTACTACAGGCCA

•  Run parallel RNAi screen

What targets mediate ac6vity of siRNA and compound

Pathway elucida6on, iden6fica6on of interac6ons

Target ID and valida6on

Link RNAi generated pathway peturba6ons to small molecule ac6vi6es. Could provide insight into polypharmacology

Matching Phenotypes RNAI

Small Molecule

Merging Screening Technologies

•  Lead iden6fica6on •  Single (few) read outs •  High-‐throughput •  Moderate data volumes

•  Phenotypic profiling •  Mul6ple parameters •  Moderate throughput •  Very large data volumes

High throughput screening High content screening

•  We’d like to combine the technologies, to obtain rich high-‐resolu6on data at high speed

•  Is this feasible? What are the trade-‐offs?

Merging Screening Technologies

•  A simple solu6on is to run a HTS & HCS as separate, primary & secondary screens

•  Alterna6vely – Wells to Cells –  Integrate HTS & HCS in a single screen using a combined plavorm for robo6cs & real 6me automated HTS analy6cs

– Selec6ve imaging of interes6ng wells

Wells to Cells Workflow

•  Sequen6al qHTS using laser scanning cytometry followed by high-‐res microscopy

•  Unit of work is a plate series •  The same aliquot is analyzed by both techniques

•  A message based system

•  The key is deciding which wells go through the workflow

Well to Cells Assays

•  Cell cycle, cell transloca6on, DNA repreplica6on •  All assays run against LOPAC1280 •  Consistency between cytometry & microscopy is measured by the R2 between log AC50’s – Cell cycle, 0.94 – 0.96 – Cell transloca6on, 0.66 – 0.94 – DNA rereplica6on, s6ll in progress

Cell Transloca?on Example Hits

Informa?cs Pla[orm

•  Advanced correc6on and normaliza6on methods

•  Sophis6cated curve fieng algorithm

•  Good performance, allows paralleliza6on of the en6re workflow

InCell Layout File

Why Messaging?

•  A messaging architecture allows for significant flexibility – Persistent, can be kept for process tracking, repor6ng

– Asynchronous, allows individual components of the workflow to proceed at their own pace

– Modular, new components can be introduced at any 6me without redesigning the whole workflow

•  We employ Oracle AQ, but any message queue can be employed

Handling Mul?ple Pla[orms

•  Current examples employ InCell hardware •  We also use Molecular Devices hardware

•  As a result we have two orthogonal image stores / databases

•  Need to integrate them – Support seamless data browsing across mul6ple screens irrespec6ve of imaging plavorm used

– Support analy6cs external to vendor code

A Unified Interface

•  A client sees a single, simple interface to screening image data

•  Transparently extract image data via the MetaXpress database or via custom code

•  Currently the interface address image serving

•  Unified metadata interface in the works

hXp://host/rest/protocol/plate/well/image

Trade-‐offs & Opportuni?es

•  Automa6on reduces the ability to handle unforeseen errors – Dispense errors and other plate problems – Well selec6on based on curve classes may need to be modified on the fly

•  Well selec6on does not consider SAR – Wells are selected independently of each other –  If we could model SAR on the fly (or from valida6on screens), we’d select mul6ple wells, to obtain posi6ve and nega?ve results

Cloud Compu?ng & Cheminforma?cs

•  Cloud compu6ng is a hot topic •  A number of examples of computa6onal chemistry / cheminforma6cs on the cloud – MolPlex, hBar, Numerate, Wingu, Sciligence, Pfizer

•  Many examples use the cloud for remote storage remote (hosted) computa6ons

•  But providers such as Amazon allow us to run distributed compuDng applica6ons on the cloud

Map/Reduce

• Map/Reduce is a programming model for efficient distributed compu6ng

• M/R made “famous” by Google, but the idea has been around for a long 6me

•  It works like a Unix pipeline: –  cat input | grep | sort | uniq -c | cat > output –  Input | Map | Shuffle & Sort | Reduce | Output

•  Efficiency from

–  Streaming through data, reducing seeks

–  Pipelining Owen O’Malley, hip://bit.ly/ecHPvB

Map/Reduce

Owen O’Malley, hip://bit.ly/ecHPvB

Hadoop & Cheminforma?cs

•  Hadoop is an Open Source implementa6on of the map/reduce paradigm

•  Hadoop is a framework for scalable, distributed compu6ng – Hadoop, HDFS, Hive, PIG

•  Importantly, you can play with all this on your laptop and just copy files to the big cluster when you’re ready for produc6on

Why Hadoop?

•  Simple way to make use of large clusters without MPI etc

•  AWS supports Hadoop, so easy to scale up to 100’s or 1000’s of cores

•  Great for Java code, but non-‐Java code can also make use of Hadoop

•  M/R can be applied to a lot of problems, but one of the simplest is to use it as a “chunker”

Cheminforma?cs in Parallel

•  Many cheminforma6cs problems are data parallel – Chunk the data and apply the same technique over each chunk

•  This makes many problems amenable for M/R – Substructure / pharmacophore search

– Descriptor calcula6ons, virtual screening – Model development (?)

•  In general, each chunk is processed on a dis6nct node – so code itself can be non-‐parallel

Cheminforma?cs in Parallel

See h_p://blog.rguha.net/?tag=hadoop for examples & code

Substructure Searching

•  Substructure searching is a trivial extension of atom coun6ng

•  If a structure matches, emit (name,1)!

•  Otherwise (name,0)

•  Reducer simply outputs tuples of the form (name,1)

public class SubSearch {!

…! public static class MoleculeMapper extends ! Mapper<Object, Text, Text, IntWritable> {!

private Text matches = new Text();! private String pattern;!

public void setup(Context context) {! pattern = context.getConfiguration().get("net.rguha.dc.data.pattern");! }!

public void map(Object key, Text value, Context context) throws! IOException, InterruptedException {! try {! IAtomContainer molecule = sp.parseSmiles(value.toString()); !

sqt.setSmarts(pattern);! boolean matched = sqt.matches(molecule);! matches.set((String) molecule.getProperty(CDKConstants.TITLE));! if (matched) context.write(matches, one);! else context.write(matches, zero);! } catch (CDKException e) {! e.printStackTrace();! }! }! }!

public static class SMARTSMatchReducer extends ! Reducer<Text, IntWritable, Text, IntWritable> {! private IntWritable result = new IntWritable();!

public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! for (IntWritable val : values) {! if (val.compareTo(one) == 0) {! result.set(1);! context.write(key, result);! }! }! }!

Running on AWS

•  All the code was debugged on my laptop with rela6vely small files

•  To test the scalability, I shi{ed everything to AWS – Pharmacophore search – 136K structures, single conformer, 560MB

– Created a single JAR file with CDK & applica6on code

– Uploaded data files to S3 •  Total cost of experiments was ~ $10

But I Don’t Want to Write Programs

•  All these examples require us to write full fledged Java classes

•  An easier way to use Pig & Pig La6n – a plavorm and query language built on top of Hadoop

•  Lets us write SQL-‐like queries that make use of Hadoop underneath

•  Flexible due to user defined func6ons (UDF’s) – UDF’s encapsulate the cheminforma6cs

Cheminforma?cs & Pig

•  Iden6fy molecules in medium.smi that match the SMARTS paiern and dump to output.txt

•  The complexity is now hidden in the UDF

•  Many toolkit func6ons could be wrapped as UDF’s, allowing flexible queries with much simpler code

•  See hip://blog.rguha.net/?p=748 for the code

A = load 'medium.smi' as (smiles:chararray);!B = filter A by net.rguha.dc.pig.SMATCH(smiles, 'NC(=O)C(=O)N');!store B into 'output.txt';!

Latency

•  Hadoop is suited for batch processing •  Significant network I/O involved in distribu6ng data to compute nodes

•  Not good for – Random ad hoc processing of small subsets – Small volume data

– Real 6me (low latency) work

•  But latency issues can be addressed somewhat by Hbase, Hive and other technologies

More than Chunking?

•  But all the examples so far could have been done via PBS/Condor or any other job scheduler –  (With Hadoop we don’t have to worry about explicit chunking of the input data)

•  But are there cheminforma6cs algorithms that can be reworked in to the M/R paradigm? – Predic6ve modeling?

– Graph algorithms?

More than Chunking?

•  Both predic6ve & graph algorithms are increasingly supported in Hadoop – Mahout for M/L algorithms on massive datasets – Cloud9 for graph algorithms

•  A number of bioinforma6cs applica6ons make use of M/R at the algorithmic level

•  They are all big applica6ons – Crossbow aligns 3 billion paired/unpaired reads

•  Cheminforma?cs datasets are not very big

Summary

•  HTS data is an ample playground for interes6ng analy6cs, mul6ple data types makes it more fun

•  A major challenge in our informa6cs infrastructure is dealing with proprietary vendor interfaces

•  Hadoop and M/R provide great opportuni6es for handling large data in a flexible manner

•  But can cheminforma6cs really make use of it?

InformaUcs

•  Ajit Jadhav •  Trung Nguyen •  Noel Southall •  Ruili Huang •  Min Shen

•  Hongmao Sun

•  Xin Hu •  Tongan Zhao

RNAi & Small Molecule

•  Scoi Mar6n

•  Pinar Tuzmen •  Yu-‐Chi Chen •  Carleen Klump •  Craig Thomas

•  Jim Inglese

•  Ron Johnson •  Sam Michael

•  Jennifer Wichterman

Acknowledgments

Coun?ng Atoms

•  The canonical Hadoop program is to count the frequency of words in a text file – Mapper reads a line, outputs a tuple – (word, 1) – Reducer will receive tuples, keyed on word!

•  Summing up the 1’s gives us the frequency of word

•  By default, Hadoop works on a line-‐by-‐line basis •  For cheminforma6cs problems, SMILES files sa6sfy this requirement – one line, one molecule

Coun?ng Atoms public class HeavyAtomCount {! static SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());!

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> ! {!

private final static IntWritable one = new IntWritable(1);! private Text word = new Text();!

public void map(Object key, Text value, Context context) throws ! IOException, InterruptedException {! try {! IAtomContainer molecule = sp.parseSmiles(value.toString());! for (IAtom atom : molecule.atoms()) {! word.set(atom.getSymbol());! context.write(word, one);! }! } catch (InvalidSmilesException e) {! // do nothing for now! }! }! }!

public static class IntSumReducer extends Reducer<Text, IntWritable, ! Text, IntWritable> {! private IntWritable result = new IntWritable();!

public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! int sum = 0;! for (IntWritable val : values) {! sum += val.get();! }! result.set(sum);! context.write(key, result);! }! }!….!}!

•  Uses the CDK to parse SMILES

•  For each molecule loop over atoms – Emit (symbol,1)!

•  Reducer simply sums the 1’s for each symbol

Mul?line Records

•  Lots of cheminforma6cs applica6ons require 3D – SMILES won’t do. Need to support SDF

•  We implement a custom RecordReader to process SD files!

•  We’re now ready to tackle preiy much most cheminforma6cs tasks

Why Hadoop?

•  Java and C++ APIs –  In Java use Objects, while in C++ bytes

•  Each task can process data sets larger than RAM

•  Automa6c re-‐execu6on on failure –  In a large cluster, some nodes are always slow or flaky – Framework re-‐executes failed tasks

•  Locality op6miza6ons – M/R queries HDFS for loca6ons of input data – Map tasks are scheduled close to the inputs when possible

Owen O’Malley, hip://bit.ly/ecHPvB

enabling discoveries at high throughput - small molecule and rnai hts at the nctt

Technology

chemical data

data mining

data types young

mul6tude of data types

network of public data

kinase panel data

rnai hts

fragment browser view