classification and clustering for hit identification in high content rnai screens

31
Classifica(on and Clustering for Hit Iden(fica(on in High Content RNAi Screens Rajarshi Guha, Ph.D. NIH Center for Transla:onal Therapeu:cs January 11, 2012

Upload: rguha

Post on 10-May-2015

936 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification and Clustering for Hit Identification in High Content RNAi Screens

Classifica(on  and  Clustering  for    Hit  Iden(fica(on  in  High    

Content  RNAi  Screens  

Rajarshi  Guha,  Ph.D.  NIH  Center  for  Transla:onal  Therapeu:cs  

 January  11,  2012  

Page 2: Classification and Clustering for Hit Identification in High Content RNAi Screens

DNA Re-replication

Sivaprasad et al Cell Division

DNA replication is a tightly controlled and well-studied process. Proteins including geminin, cyclin A, and Emi1 can help prevent DNA re-replication.!

Levels of geminin increase as cells enter S phase, which help to prevent a second round of DNA replication.!

After mitosis, levels of geminin and cyclins decrease through ubiqutin mediated degradation.!

Collaborator:!Mel Depamphilis, NICHD!Wenge Zhu, Georgetown U!

Page 3: Classification and Clustering for Hit Identification in High Content RNAi Screens

DNA Re-replication

Certain cancer cells may have less safeguards against DNA re-replication than normal cells (i.e. Achilles heel). Induction of re-replication results in apoptosis.!

Zhu et al, Cancer Res, 2009

Page 4: Classification and Clustering for Hit Identification in High Content RNAi Screens

Screening  Protocol  

•  HCT-116 colon cancer cells are fixed and stained (Hoechst)!

•  Image at 4X on ImageXpress!

•  MetaXpress used to perform cell cycle analysis to quantify cells with >4N DNA content !

•  Screens were run with singles and pools  

Page 5: Classification and Clustering for Hit Identification in High Content RNAi Screens

Screen  Summary  

•  Qiagen  druggable  genome  library  (6,866  genes)  •  94  plates,  36K  wells    including  controls  

•  Good  screen    performance,    some  poorer    plates  were    redone  

 

Plate Index

Statistic

0.5

0.6

0.7

0.8

0 20 40 60 80 100

Trimmed Z'

46

810

12140 20 40 60 80 100

SSMD

Page 6: Classification and Clustering for Hit Identification in High Content RNAi Screens

Goals  

•  Can  we  iden:fy  genes  with  GMNN-­‐like  phenotypes  – We  already  iden:fied  a  set  of  genes  via  thresholding  the  %G2  parameter  

– We’d  like  to  see  what  we  get  when  we  use  a  mul:-­‐dimensional  representa:on  

•  Employ  predic:ve  modeling  to  “learn”  the  phenotype  

•  Apply  clustering  and  iden:fy  biologically  relevant  clusters  

Page 7: Classification and Clustering for Hit Identification in High Content RNAi Screens

What  Do  GMNN  Wells  Look  Like?  

Page 8: Classification and Clustering for Hit Identification in High Content RNAi Screens

Cell-­‐Level  Modeling  

•  A  first  approach  was  to  match  distribu:ons  of  individual  wells  with  the  overall  distribu:on  from  the  posi:ve  control  wells  – Expected  that  distribu:on  for  GMNN  wells  should  match  the  posi:ve  control  

– Use  KS  test  to  iden:fy  wells  with  similar  distribu:ons  – Doesn’t  work  too  well,  even  for  GMNN  itself  – Considers  1  parameter  at  a  :me  (though  a  2D  KS  test  is  possible)  

Page 9: Classification and Clustering for Hit Identification in High Content RNAi Screens

Random  Forest  Model  

•  Ensemble  of  decision  trees  (Breiman  1984)  •  Not  always  the  most    accurate,  but  great  for    exploratory  modeling  –  Implicit  feature  selec:on  – Proven  to  not  overfit  – Provides  a  measure  of  feature  importance  

•  Employ  the  randomForest  package  from  R  

h`p://proteomics.bioengr.uic.edu/malibu/docs/meta_classifiers.html  

Page 10: Classification and Clustering for Hit Identification in High Content RNAi Screens

Cell-­‐Level  Modeling  

•  Removed  cells  with  “incomplete”  parameters  •  S:ll  leaves  291K  posi:ve  cases  and  3M  nega:ve  cases  

•  Developed  a  random  forest  model,  sampling  from  nega:ves  to  maintain  balanced  classes  – Predict  whether  a  cell  is  GMNN-­‐like  – Models  from  mul:ple  samples    of  the  nega:ve  control    exhibited  similar  performance  

Posi-ve   Nega-ve  

Posi-ve   220,636   72,498  

Nega-ve   35,614   257,520  

Overall  18%  error,  25%  error  on  posi3ve    class  and  12%  error  on  nega3ve  class  

Page 11: Classification and Clustering for Hit Identification in High Content RNAi Screens

Cell-­‐Level  Modeling  

•  Significant  overlap  between  distribu:ons  for  the  nega:ve  and  posi:ve  controls  

Page 12: Classification and Clustering for Hit Identification in High Content RNAi Screens

Cell-­‐Level  Predic(ons  

•  Aggregate  predic:ons  for  all  cells  in  a  well  to  label  a  well  as  GMNN-­‐like  

•  Iden:fy  genes  with  >=  2  siRNA’s  (ie  wells)  labeled  as  GMNN-­‐like  – 31  genes  iden:fied  (GMNN,  KIF11,  ESPL1,  …)  

•  Iden:fied  expected  genes  and  most  of  the  set  were  func:onally  relevant  – Also  iden:fied  a  few  interes:ng,  novel  genes  

•  Reconfirma:on  based  on  Ambion  sequences  was  rela:vely  low  (9/31)  

Page 13: Classification and Clustering for Hit Identification in High Content RNAi Screens

Well-­‐Level  Modeling  

•  Started  with  27  parameters  from  MetaXpress  •  Performed  automated  feature  selec:on  – Remove  undefined,  constant  features  – Manually  removed  a  few  highly  correlated  features  

•  Work  with  12    parameters  

•  Convert  to  Z-­‐scores  •  Posi:ve  &  nega:ve  controls  are  nicely  separated  

All  Wells   Controls  Wells  

Page 14: Classification and Clustering for Hit Identification in High Content RNAi Screens

Parameter  Distribu(ons  

Page 15: Classification and Clustering for Hit Identification in High Content RNAi Screens

Model  Performance  

•  Classifica:on  model  trained  using  the  posi:ve  (GMNN-­‐like)  and  nega:ve  (not  GMNN-­‐like)  controls  

•  Perfect  classifica:on!        – Worrying  –  overfiqng?  – Nearly,  99%  of  the  control  wells  were  confidently  classified  as  a  posi:ve  or  nega:ve    

Posi-ve   Nega-ve  

Posi-ve   1504   0  

Nega-ve   0   1504  

Page 16: Classification and Clustering for Hit Identification in High Content RNAi Screens

Descriptor  Importance  

•  What  does  the  model  iden:fy  as  the  most  relevant  descriptors?  

•  Some  parameters  are  moderately  correlated    

Cell.MitoticAverageIntensity

Cell.DNAAverageIntensity

X.SPhase

G2Cells

DNABackgroundValue

Cell.DNAArea

X.G0.G1

Cell.DNAIntegratedIntensity

Cell.MitoticIntegratedIntensity

X.G2

SPhaseCells

G0.G1Cells

0 100 200 300

MeanDecreaseGini

Page 17: Classification and Clustering for Hit Identification in High Content RNAi Screens

Random  Forest  Predic(ons  

•  We  use  the  model  to  predict  the  class  for  all  the  remaining  wells  

•  All  four  siRNA’s  targe:ngGMNN  are  classified  as  Geminin-­‐like  with  high  confidence  

Probability of being Geminin-like

Per

cent

of T

otal

0

2

4

6

8

10

0.0 0.2 0.4 0.6 0.8 1.0

Page 18: Classification and Clustering for Hit Identification in High Content RNAi Screens

Random  Forest  Predic(ons  

•  Select  genes  for  which  >  75%  of  its  siRNA’s  are  predicted  to  be  Geminin-­‐like  with  probability  >  0.8  

•  Good  overlap  with  cell-­‐level  model  

Pro

babi

lity

of b

eing

Gem

inin

-like

0.0

0.2

0.4

0.6

0.8

1.0

AURKA

AURKBBRD8

C8orf79

CDCA5

CDCA8CRAT

ESPL1F12

FBXO5

GMNNGUSB

INCENPITPKA JU

N

KCNH6KIF11MLL4

OR10A2PLK1

PSMA1

PSMB4

ROBO2

RPLP2SNRK

TOP2A

TRIM64 TT

KUBCWRN

Page 19: Classification and Clustering for Hit Identification in High Content RNAi Screens

GO  Enrichment  

•  GO  Biological  Processes  enriched  by  this  set  of  selected  genes,  are  relevant  to  the  biology  

•  Similarly  with  pathways  (from  GeneGo)  

Page 20: Classification and Clustering for Hit Identification in High Content RNAi Screens

Clustering  

•  RF  classifica:on  is  useful,  but  doesn’t  directly  tell  us  much  about  finer  groups  of  genes  that  might    be  phenotypically  related  

•  So  we  apply  unsupervised  clustering  (PAM)  – Explore  different  numbers  of  clusters  – Evaluate  sta:s:cal  cluster  quality  metrics  – Evaluate  biologically  mo:vated  quality  metrics  

•  We  considered  both  plate-­‐wise  and  experiment-­‐wise  clustering  protocols  

Page 21: Classification and Clustering for Hit Identification in High Content RNAi Screens

Platewise  Clustering  (k=4)  

•  Cluster  assignments  can’t  be  directly  compared  across  plates  

•  Good  to  see  that    control  columns  are  dis:nctly    clustered  

•  Certain  plates  show  no    membership  to  the  ‘GMNN  cluster’  

Page 22: Classification and Clustering for Hit Identification in High Content RNAi Screens

Experimentwise  Clustering  (k=2)  

•  Encouraging  to  see  clean  separa:on  between  control  columns  

•  Bulk  of  wells  are  iden:fied  as  inac:ve  •  We  can  compare  results  from  this  clustering  to    RF  classifica:on  – 6  genes  iden:fied,  with  mul:ple  siRNA’s    clustered  with  nega:ve  control  

Page 23: Classification and Clustering for Hit Identification in High Content RNAi Screens

Experimentwise  Clustering  (k=2)  

•  6  genes  iden:fied  with  mul:ple  siRNA’s  clustered  with  the  nega:ve  control  

•  These  were  confidently  iden:fied  by  the  RF  model  

Pro

babi

lity

of b

eing

Gem

inin

-like

0.0

0.2

0.4

0.6

0.8

1.0

AURKA

AURKBBRD8

C8orf79

CDCA5

CDCA8CRAT

ESPL1F12

FBXO5

GMNNGUSB

INCENPITPKA JU

N

KCNH6KIF11MLL4

OR10A2PLK1

PSMA1

PSMB4

ROBO2

RPLP2SNRK

TOP2A

TRIM64 TT

KUBCWRN

Page 24: Classification and Clustering for Hit Identification in High Content RNAi Screens

How  Many  Clusters?  

•  A  priori,  difficult  to  decide  how  many  clusters  there  should  be  – Manual  spot  checks  did  not  iden:fy  dis:nctly    different  morphologies,  counts  

•  Evaluate  clusters  with  varying  k  and  calculate  average  silhoue`e  width  

•  Clustering  based  on  the    Euclidean  metric  doesn’t    do  a  good  job  

Number of Clusters

Ave

rage

Silh

ouet

te W

idth

0.2

0.3

0.4

0.5

0.6

0.7

2 5 8 11 14 17 20

Page 25: Classification and Clustering for Hit Identification in High Content RNAi Screens

How  Many  Clusters?  

•  One  approach  is  to  ignore  clusterings  that  have  spread  all  GMNN  siRNAs  across  mul:ple  clusters  

•  The  current  data  suggests  that  we  s:ck  to  k  =  5  

Page 26: Classification and Clustering for Hit Identification in High Content RNAi Screens

Biological  Enrichment  in  Clusters  

•  Considering  5  clusters  •  Some  clusters  are  annotated  with  more  relevant  terms    

Cluster  containing  ¾  GMNN  siRNAs  

Page 27: Classification and Clustering for Hit Identification in High Content RNAi Screens

Signal  Enhancement  in  Clusters  

•  Signal  is  significantly  enhanced  in  some  clusters  versus  others  

•  Clusters  1,  2  and  4  did  not  contain  any  siRNA’s  above  Z  =  3  

Page 28: Classification and Clustering for Hit Identification in High Content RNAi Screens

Making  a  Final  Hitlist  

•  Off  targets  effects  are  a  major  confounding  factor  

•  We  are  able  to  assess  OTE  on  a  gene  by  gene  basis  using  Common  Seed  Analysis  

•  Select  genes  from  individual  clusters,  using  %  G2  and  number  of  siRNA’s  as  secondary  filters  

•  Combine  with  hits  from  random  forest  model  

Marine,  S.  et  al,  J.  Biomol.  Screen.,  2011,  ASAP  

Page 29: Classification and Clustering for Hit Identification in High Content RNAi Screens

Reconfirma(on  

•  18/211  genes  selected  based  on  thresholding  from  the  primary  reconfirmed  using  Ambion  sequences  

•  Considering  just  the  genes  selected  by  the  random  forest  and/or  clustering  methods  –  11/30  genes  selected  by  RF  reconfirmed  using  Ambion  libraries  

–  5/6  Genes  iden:fied  by  RF  &  clustering  reconfirmed  using  mul:ple  libraries  •  ESPL1,  FBXO5,  INCENP,  KIF11  reconfirmed  very  strongly  

•  Based  on  k  =  5  clustering,    –  23/181  genes  from  cluster  3  reconfirmed  –  5/5  genes  from  cluster  5  reconfirmed    

Page 30: Classification and Clustering for Hit Identification in High Content RNAi Screens

Outlook  

•  Complements  tradi:onal  threshold  based  selec:on  methods  

•  The  random  forest  approach  is  sufficiently  accurate  and  lets  us  avoid  explicitly  selec:ng  features  up  front  

•  Combined  with  clustering  lets  us  zoom  into  biological  relevant  clusters  of  genes  

Page 31: Classification and Clustering for Hit Identification in High Content RNAi Screens

Acknowledgements  

•  Sco`  Mar:n  •  Pinar  Tuzmen  •  Carleen  Klump  •  Eugen  Buehler