csama%2013:%computaonal% · pdf file• grammar%of%graphics%concepts,%deployment...

47
CSAMA 2013: Computa1onal visualiza1on in genomic data analysis Vince Carey PhD Harvard Medical School

Upload: dangcong

Post on 18-Mar-2018

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

CSAMA  2013:  Computa1onal  visualiza1on  in  genomic  data  

analysis  Vince  Carey  PhD  

Harvard  Medical  School  

Page 2: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 3: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 4: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Road  map  of  the  talk  

•  Mo1va1ons  for  computa1onal  visualiza1on  –  Concrete  accompaniment  to  more  abstract  sta1s1cal  tabula1ons/inferences  

–  Exploring  an  observa1onal  fron1er  •  PloJng  and  basic  sta1s1cs  (design,  univariate,  mul1variate,  mul1ple  comparisons)  

•  Grammar  of  graphics  concepts,  deployment  against  Yeast  Cell  Cycle  expression  archive  

•  ggbio  applica1ons  to  GWAS  •  Object  designs  for  CCLE  and  a  cancer  regulatory  network  

Page 5: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Three  principles  

•  Visualiza1ons  should  be  obtained  using  a  program:  no  photoshop  –  Steps  by  which  snapshots  of  interac1ve  visualiza1ons  are  obtained  should  have  a  textual  representa1on  

•  The  process  by  which  a  given  visualiza1on  was  selected  (from  among  relevant  varia1ons)  should  be  disclosed  

•  Uncertainty  and  variability  can  be  hard  to  depict  but  aWempts  should  be  made  to  indicate:  –  Roles  of  modeling  assump1ons  –  Scope  and  sources  of  measurement  varia1on    

Page 6: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Report  to  an  academy  

Page 7: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

“looks  like  there  is  something  there”  

Page 8: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Honest  pairwise  comparisons  

Page 9: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

A  basic  dilemma  

•  Use  visualiza1on  to  discover/communicate  rela1onships  in  data  

•  Avoid  choosing  visualiza1ons  that  overstate  the  strength  of  rela1onship  – Are  there  trustworthy  prac1ces?  – Can  we  guide  the  viewer  to  results  of  a  principled  sta1s1cal  analysis?    Rela1onships  that  are  highly  likely  to  “independently  replicate”?  

Page 10: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 11: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Communicate  about  the  experimental  design  

•  Objec1ves  •  Demonstra1on  of  validity  and  sa1sfactory  power  of  proposed  test  procedure  

•  Illustra1on  of  development  of  the  dataset  from  screening  to  analysis  

Page 12: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 13: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

R/Bioconductor  and  scien1fic  visualiza1on  

•  Integra1ve,  self-­‐describing  containers  – ExpressionSet, SummarizedExperiment, GRanges

•  Packages  that  define  workflow  components  – affy,  oligo,  limma,  ShortRead,  DESeq,  GSEAlm  –  these  oaen  include  their  own  visualiza1on  funcs.  

•  Packages  that  specialize  in  visualiza1on  – geneplo9er,  ggbio  (focused  ggplot2),  Gviz,  Rgraphviz,  HilbertVis  

Page 14: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

By  the  end  of  the  course  

•  Understand  the  ra1onale  for  the  container  designs,  and  the  ones  you  will  actually  use  

•  Understand  why  func1ons  have  been  packaged  up  as  they  have  been  

•  Get  a  sense  of  the  discipline  required  to  use  the  containers  and  packages  vs.  files  and  scripts  

•  Sharpen  your  sta1s1cal  acumen  …  recognize  the  good  arguments,  improve  the  flawed  ones  

Page 15: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

A  reproducible  experiment?    Four  labs  assay  a  sample  of  blinded  origin  on  two  quan11es  –  as  regulator,  do  you  declare  

them  consistent?  

Page 16: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

When  we  look  at  the  data….  

Page 17: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Moral  of  the  Anscombe  data?  

•  “High-­‐quality”  summary  sta1s1cs  (e.g.,  es1mators  that  are  unbiased,  efficient  under  mild  regularity  condi1ons  …)  can  hide  a  lot  of  scien1fically  relevant  informa1on  – Some1mes  a  good  visualiza1on  will  reveal  important  structures  

•  Always  create  an  opportunity  to  look    •  Fold  visualiza1ons  at  various  scales  into  your  reports  

Page 18: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Displaying  150x4  iris  measurements  with  pairs()

Page 19: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

A  unified  view  with  ggplot2  

Page 20: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Marginal  distribu1ons  for  species  discrimina1on  

Page 21: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

A  detour:  large-­‐scale  use  of  boxplots  

•  The  ReporAngTools  package  is  an  important  new  contribu1on  that  allows  developers  to  synthesize  approaches  to  analysis  and  visualiza1on  for  high-­‐level  communica1ons  

•  Target  medium  is  a  modern  browser  with  significant  client-­‐side  capabili1es  permiJng  search,  sor1ng,  etc.  

Page 22: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 23: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Summary  thus  far  

•  Visuals  are  to  help  with  interpreta1on  –  Interpreta1on  is  difficult  without  clear  sense  of  objec1ves  of  underlying  experiment  

–  Experiments  are  oaen  messy  …  data  filtering  diagram  like  CONSORT  should  be  standard  prac1ce  

•  Marginal  and  pairwise  joint  distribu1ons  are  easy  to  visualize  – Marginal  and  pairwise  sta1s1cs/tests  can  obscure  important  paWerns  

–  Efficient  enhancements  to  tables  (Repor1ngTools)  

Page 24: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Principal  components  reexpression  of  a  mul1variate  dataset  

PC1  is  the  linear  combina1on  of  the  original  variables  that  has  maximum  variance  among  all  linear  combina1ons  of  the  original  variables;  PC2  is  MVLC  among  all  LC  uncorrelated  with  PC1,  …  

Page 25: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Interpre1ng  the  “loading”  on  PCs  

Page 26: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Heatmaps:        distances        clustering  criteria        annota1on  

Page 27: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Under  the  hood  with  dendrograms:  c1  =  hclust(dist(iris[,1:4]))  

Page 28: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Summary  

•  PCA  is  widely  used  for  “dimension  reduc1on”,  “eigengene”  reexpression,  QA,  removal  of  extraneous  varia1on  

•  Cluster  analyses  are  commonly  used  and  underlie  many  prominent  displays/analyses  – Highly  tunable  – R  and  other  tools  hide  complexity:  “don’t  believe  in  magic”,  know  how  to  open  the  box  

Page 29: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

“Grammar  of  graphics”  

Page 30: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Deploying  grammar  of  graphics  directly  on  an  expression  archive  

Page 31: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 32: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

In  the  brixvis2013  lab  (op1onal)  

•  Func1on  clquad() reexpresses  a  piece  of  a  cluster  analysis  of  cell-­‐cycle  expression  trajectories  

•  Func1on  clpts() breaks  up  a  cluster  into  its  cons1tuent  trajectories  

•  Neither  is  wriWen  for  general  use,  but  help  to  illustrate  exposure  of  variability  at  different  scales  for  a  “holis1c”  workflow  step  

Page 33: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 34: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 35: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Some  examples  with  ggbio  package:  autoplot  method  sensi1ve  to  input  class  

Page 36: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 37: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 38: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

ggbio  comments  

•  Containers  for  genomic  annota1on  and  experimental  results  have  autoplot methods  

•  ggplot2  factoriza1on  of  visualiza1on  tasks  allows  programma1cally  efficient  embellishments  

•  You  can  go  your  own  way  

Page 39: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 40: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Lightweight  containers  for  substan1al  experimental  data  

Page 41: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Simple  syntax  for  CCLE  surveys  

Page 42: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%
Page 43: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Gene1cs  of  topotecan  sensi1vity  

Page 44: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Summary  of  CCLE  visualiza1on  support  

•  Regulari1es  in  data  structure  across  cell  lines  iden1fied  for  reten1on  in  hierarchical  object  structure    

•  plot() methods  defined  for  inputs  at  different  levels  of  the  hierarchy  

•  Use  of  ggplot2  infrastructure  permits  immediate  use  of  tunable  sta1s1cal  visualiza1on  paWerns,  factoriza1on  of  embellishments  

Page 45: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Two  levels  of  encyclopedia  structure  

Page 46: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Envoi:  Variant-­‐TF-­‐Phenotype  network  structures  

Page 47: CSAMA%2013:%Computaonal%  · PDF file• Grammar%of%graphics%concepts,%deployment againstYeastCell%Cycle%expression%archive% • ggbio%applicaons%to%GWAS%

Conclusions  

•  R/bioconductor  provide  substan1al  infrastructure  for  flexible  approaches  to  visualiza1on  

•  Emphasis:  sta1s1cal  integrity,  reproducibility,  acknowledgment  of  variability,  uncertainty  

•  Produc1ve  approach:  separately  conceptualize  the  underlying  data  structure,  visualiza1on  objec1ves,  and  code  to  render  the  informaAon  in  a  tunable,  extensible  way