comparative genomics and visualisation - part 1

82
Compara’ve Genomics and Visualisa’on – Part 1 Leighton Pritchard

Upload: leighton-pritchard

Post on 26-Jan-2015

116 views

Category:

Science


6 download

DESCRIPTION

Slides from a Comparative Genomics and Visualisation course (part 1) presented at the University of Dundee, 7th March 2014. Other materials are available at GitHub (https://github.com/widdowquinn/Teaching)

TRANSCRIPT

Page 1: Comparative Genomics and Visualisation - Part 1

Compara've  Genomics  and  Visualisa'on  –  Part  1  

Leighton  Pritchard  

Page 2: Comparative Genomics and Visualisation - Part 1

Part  1  l What  is  compara've  genomics?  

l Levels  of  genome  comparison  

l  bulk,  whole  sequence,  features  

l A  Brief  History  of  Compara've  Genomics  

l  experimental  compara;ve  genomics  

l Computa'onal  Compara've  Genomics  

l  Bulk  proper;es  

l  Whole  genome  comparisons  

l Part  2  l  Genome  feature  comparisons  

Page 3: Comparative Genomics and Visualisation - Part 1

What  is  Compara've  Genomics?  

The  combina'on  of  genomic  data  and  compara've  and  evolu'onary  biology  to  address  ques'ons  of  genome  structure,  evolu'on  and  func'on.  

Page 4: Comparative Genomics and Visualisation - Part 1

What  is  Compara've  Genomics?    

“Nothing  in  biology  makes  sense,  except  in  the  light  of  evolu9on”  

Theodosius  Dobzhansky  

Page 5: Comparative Genomics and Visualisation - Part 1

Why  Compara've  Genomics?  l Genomes  describe  heritable  characteris;cs  

l Related  organisms  share  ancestral  genomes  

l Func;onal  elements  encoded  in  genomes  are  common  to  related  organisms  

l Func;onal  understanding  of  model  systems  (E.  coli,  A.  thaliana,  D.  melanogaster)  can  be  transferred  to  non-­‐model  systems  on  the  basis  of  genome  comparisons  

l Genome  comparisons  can  be  informa;ve,  even  for  distantly-­‐related  organisms  

Page 6: Comparative Genomics and Visualisation - Part 1

Why  Compara've  Genomics?  

l BUT:  l  Context:  epigene;cs,  ;ssue  

differen;a;on,  mesoscale  systems,  etc.  

l  Phenotypic  plas'city:  responses  to  temperature,  stress,  environment,  etc.  

Page 7: Comparative Genomics and Visualisation - Part 1

Why  Compara've  Genomics?  l Genomic  differences  can  underpin  phenotypic  (morphological  or  physiological)  differences.  

l Where  phenotypes  or  other  organism-­‐level  proper;es  are  known,  comparison  of  genomes  may  give  mechanis;c  or  func;onal  insight  into  differences  (e.g.  GWAS).  

l Genome  comparisons  aid  iden;fica;on  of  func;onal  elements  on  the  genome.  

l Studying  genomic  changes  reveals  evolu;onary  processes  and  constraints.  

 

Page 8: Comparative Genomics and Visualisation - Part 1

Why  Compara've  Genomics?  

Adapted  from  Hardison  (2003)  PLoS  Biol.  doi:10.1371/journal.pbio.0000058  

species  

'me  

contemporary  organisms  

l  Comparison  within  species  (e.g.  isolate-­‐level  –  or  even  within  individuals):  which  genome  features  may  account  for  unique  characteris;cs  of  organisms/tumours?  Epigene;cs  in  an  individual.  

Page 9: Comparative Genomics and Visualisation - Part 1

Why  Compara've  Genomics?  

genus  

'me  

contemporary  organisms  

l Comparison  within  genus  (e.g.  species-­‐level):  what  genome  features  show  evidence  of  selec;ve  pressure,  and  in  which  species?  

Page 10: Comparative Genomics and Visualisation - Part 1

Why  Compara've  Genomics?  

subgroup  'me  

contemporary  organisms  

l Comparison  within  subgroup  (e.g.  genus-­‐level):  what  are  the  core  set  of  genome  features  that  define  a  subgroup  or  genus?  

Page 11: Comparative Genomics and Visualisation - Part 1

The  E.coli  long-­‐term  evolu'on  experiment  

l Run  by  the  Lenski  lab,  Michigan  State  University  since  1988  

l  hVp://myxo.css.msu.edu/ecoli/  

l 12  flasks,  citrate  usage  selec;on  

l 50,000  genera;ons  of  Escherichia  coli!  l  Cultures  propagated  every  day  

l  Every  500  genera;ons  (75  days),    mixed-­‐popula;on  samples  stored  

l  Mean  fitness  es;mated  at  500    genera;on  intervals  

Jeong  et  al.  (2009)  J.  Mol.  Biol.  doi:10.1016/j.jmb.2009.09.052  Barrick  et  al.  (2009)  Nature  doi:10.1038/nature08480  Wiser  et  al.  (2013)  Science.  doi:10.1126/science.1243357  

Page 12: Comparative Genomics and Visualisation - Part 1

Compara've  Genomics  in  the  News  

Sankaraman  et  al.  (2014)  Nature.  doi:10.1038/nature12961  

l Neanderthal  alleles:  l  Aid  adapta;on  outwith  Africa  

l  Associated  with  disease  risk  

l  Reduce  male  fer;lity  

Page 13: Comparative Genomics and Visualisation - Part 1

Levels  of  Genome  Comparison  

Genomes  are  complex,  and  can  be  compared  on  a  range  of  conceptual  levels  -­‐  both  prac'cally  and  in  silico.  

Page 14: Comparative Genomics and Visualisation - Part 1

Three  broad  levels  of  comparison  l Bulk  Proper;es  

l  chromosome/plasmid  counts  and  sizes,    

l  nucleo;de  content,  etc.  

l Whole  Genome  Sequence  

l  sequence  similarity  

l  organisa;on  of  genomic  regions  (synteny),  etc.  

l Genome  Features/Func;onal  Components  

l  numbers  and  types  of  features  (genes,  ncRNA,  regulatory  elements,  etc.)  

l  organisa;on  of  features  (synteny,  operons,  regulons,  etc.)  

l  complements  of  features  

l  selec;on  pressure,  etc.  

Page 15: Comparative Genomics and Visualisation - Part 1

A  Brief  History  of  Experimental  Compara've  Genomics  You  don’t  have  to  sequence  genomes  to  compare  them  (but  it  helps).  

Page 16: Comparative Genomics and Visualisation - Part 1

Genome  Comparisons  Predate  NGS  l Sequence  data  was  not  always  cheap  and  abundant  

l Prac;cal,  experimental  genome  comparisons  were  needed  

Page 17: Comparative Genomics and Visualisation - Part 1

Bulk  Genome  Property  Comparisons  

Values  calculated  for  individual  genomes,  and  subsequently  compared.  

Page 18: Comparative Genomics and Visualisation - Part 1

Bulk  Genome  Proper'es  l  Large-­‐scale  summary  measurements  

l Measure  genomes  independently  –  compare  values  later  

l  Number  of  chromosomes  

l  Ploidy  

l  Chromosome  size  

l  Nucleo;de  (A,  C,  G,  T)  frequency/percentage  

Page 19: Comparative Genomics and Visualisation - Part 1

Chromosome  Counts/Size  l  The  chromosome  counts/ploidy  of  organisms  can  vary  widely  

l  Escherichia  coli:  1  (but  plasmids…)  l  Rice  (Oryza  sa6va):  24  (but  mitochondria,  plas;ds  etc…)  l  Human  (Homo  sapiens):  46,  diploid  l  Adders-­‐tongue  (Ophioglossum  re6culatum):  up  to  1260  

l  Domes;c  (but  not  wild)  wheat  soma;c  cells  hexaploid,  gametes  haploid  

l  Physical  genome  size  (related  to  sequence  length)    can  also  vary  greatly  

l  Genome  size  and  chromosome  count    do  not  indicate  organism  ‘complexity’  

l  S;ll  surprises  to  be  found  in  physical  study  of  chromosomes!  (e.g.  Hi-­‐C)  

Kamisugi  et  al.  (1993)  Chromosome  Res.  1(3):  189-­‐96  Wang  et  al.  (2013)  Nature  Rev  Genet.  doi:10.1038/nrg3375  

Page 20: Comparative Genomics and Visualisation - Part 1

Nucleo'de  Content  l Experimental  approaches  for  accurate  measurement  

l  e.g.  use  radiolabelled  monophosphates,  calculate  propor;ons  using  chromatography  

Karl  (1980)  Microbiol.  Rev.  44(4)  739-­‐796  Krane  et  al.  (1991)  Nucl.  Acids  Res.  doi:10.1093/nar/19.19.5181  

Page 21: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Comparisons  

Comparisons  of  one  whole  or  drac  genome  with  another  (or  many  others)  

Page 22: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Comparisons  l  Requires  two  genomes:  “reference”  and  “comparator”  

l  Experiment  produces  a  compara;ve  result,  dependent  on  the  choice  of  genomes  

l Methods  mostly  based  around  direct  or  indirect  DNA  hybridisa;on  

l  DNA-­‐DNA  hybridisa;on  

l  Compara;ve  Genomic  Hybridisa;on  (CGH)  

l  Array  Compara;ve  Genomic  Hybridisa;on  (aCGH)  

Page 23: Comparative Genomics and Visualisation - Part 1

DNA-­‐DNA  Hybridisa'on  (DDH)  l Several  methods  based  around  the  same  principle  

1.  Denature  organism  A,  B    genomic  DNA  mixture  

2.  Allow  to  anneal  –  hybrids  result    (reassocia;on  ≈  similarity)  

Morelló-­‐Mora  &  Amann  (2001)  FEMS  Microbiol.  Rev.  doi:10.1016/S0168-­‐6445(00)00040-­‐1  

Page 24: Comparative Genomics and Visualisation - Part 1

DNA-­‐DNA  Hybridisa'on  (DDH)  l  Several  methods  -­‐  same  principle  

1.  Find  homoduplex  Tm1  

2.  Denature  reference,  comparator  gDNA  +  mix  

3.  Allow  to  anneal  –  hybrids  result    (reassocia;on  ≈  similarity),  find    heteroduplex  Tm2  

4.  ∆Tm  =  Tm1  –  Tm2  

5.  High  ∆T  implies  greater  genomic  difference  (fewer  H-­‐bonds)  

l  Proxy  for  sequence  similarity  

Morelló-­‐Mora  &  Amann  (2001)  FEMS  Microbiol.  Rev.  doi:10.1016/S0168-­‐6445(00)00040-­‐1  

Page 25: Comparative Genomics and Visualisation - Part 1

DNA-­‐DNA  Hybridisa'on  (DDH)  l Used  for  taxonomic  classifica;on  in  prokaryotes  from  1960s  

l Sibley  &  Ahlquist  redefined  bird  and  primate  phylogeny  with    DDH  in  1980s:  Homo  shares  more  recent  common  ancestor  with  Pan  than  with  Gorilla  (this  was  previously  in  dispute)  

Sibley  &  Ahlquist  (1984)  J.  Mol.  Evol.  doi:10.1007/BF02101980  

Page 26: Comparative Genomics and Visualisation - Part 1

Compara've  Genomic  Hybridisa'on  l  Two  genomes:  “reference”  and  “test”  are  labelled  (red  and  green  –    

a  bad  conven6on  to  choose,  for  visualisa6on),  then  hybridised  against  a  third  “normal”  genome  

l  Differences  in  red/green  intensity  mapped  by  microscopy  correspond  to  rela;ve  rela;onship  of  reference  and  test  to  “normal”  genome  

l  Comparisons  within  species  (or  individual,  for  tumours);  copy  number  varia'ons  (CNV)  

l  Labour-­‐intensive,  low-­‐resolu;on  

Page 27: Comparative Genomics and Visualisation - Part 1

Compara've  Genomic  Hybridisa'on  l  Image  analysis  required  –  intensity  along  medial  axis.  

Kallioniemi  et  al.  (1992)  Science  doi:10.1126/science.1359641  Fraga  et  al.  (2005)  Proc.  Natl.  Acad.  Sci.  USA  doi:10.1073/pnas.0500398102  

Epigene'cs:  hybridising    methylated  DNA  

Page 28: Comparative Genomics and Visualisation - Part 1

Array  Compara've  Genomic  Hybridisa'on  l  Uses  DNA  microarrays:  thousands  of  short  DNA  probes  (genome    

fragments)  immobilised  on  a  surface  

l  gDNA,  cDNA,  etc.  fluorescently-­‐labelled  and  hybridised  to  the  array  

l  Smaller  sample  sizes  cf.  CGH,    automatable,  high-­‐throughput,  high-­‐res  

l  Iden'fies  copy  number  varia'on  (CNV)  and  segmental  duplica'on  

Pollack  et  al.  (1999)  Nat.  Genet.  doi:10.1038/12640  

Page 29: Comparative Genomics and Visualisation - Part 1

Genome  Feature  Comparisons  

Comparisons  on  the  basis  of  a  restricted  set  of  genome  features  

Page 30: Comparative Genomics and Visualisation - Part 1

Chromosomal  Rearrangements  l  Genomes  are  dynamic,  and  undergo  large-­‐scale  changes  

l  Hybridisa;on  used  to  map  genome  rearrangement/duplica;on  

l  Separate  chromosomes  electrophore;cally  l  Apply  single  gene  hybridising  probes  l  Reciprocal  hybridisa;ons  indicate  transloca;ons  

Fischer  et  al.  (2000)  Nature.  doi:10.1038/35013058  

Page 31: Comparative Genomics and Visualisation - Part 1

Diagnos'c  PCR/MLST  l  Define  a  set  of  regions  (usually  genes):  

l  conserved  enough  that  PCR  primers  can  be  designed  to  amplify  the  same  region  in  mul;ple  organisms  

l  and:  l  divergent  enough  that  hybridising  

probes  can  dis;nguish  between  groups  

l  or:  l  sequence  the  amplifica;on  products  

l  Sequence  variants  given  numbers  

l  Number  profiles  define  groups  

l  Track  evolu;on  by  minimum  spanning  trees  (MST)  

l  hVp://pubmlst.org/  Maiden  et  al.  (2006)  Ann.  Rev.  Microbiol.  doi:10.1146/annurev.micro.59.030804.121325  

Page 32: Comparative Genomics and Visualisation - Part 1

l  aCGH  can  also  be  applied  across  species  for  classifica'on/diagnos'cs:  l  Microarray  probes  represent  genes    

from  one  or  more  organisms  

l  “Off-­‐species”  gDNA  fragmented,    labelled,  and  hybridised  

l  Hybridisa;on  ≈  sequence    similarity  ≈  gene  presence  

l  Heatmap  of  217  Staphylococcus  aureus  isolates  on  7-­‐strain  array.  

l  columns=isolates  

l  yellow/red=gene  present  

l  blue/white/grey=gene  absent  

l  Lower  bars  coloured  by  lineage  and  host    (green=caVle,  blue=horse,  purple=human)  

Array  Compara've  Genomic  Hybridisa'on  

Sung  et  al.  (2008)  Microbiol.  doi:10.1099/mic.0.2007/015289-­‐0  

Page 33: Comparative Genomics and Visualisation - Part 1

But  This  Happened…  l High-­‐throughput  sequencing  

Page 34: Comparative Genomics and Visualisation - Part 1

…And  Then  It  Rained  Sequence  Data  l Modern  high-­‐throughput  sequencing  (454,  Illumina)  completely  

changed  the  landscape.  

l  Complete,  (mainly)  accurate  sequence    data  much  cheaper,  enabling:  

l  more  precise  sequence  comparison  

l  novel  analyses,  insights  and    visualisa;ons  

l  Genomic  &  exomic  comparisons  

l  19/2/2014  at  GOLD:  l  3,011  “finished”  genomes  

l  9,891  “permanent  drar”  genomes  

l  19/2/2014  at  NCBI  WGS:  

l  17,023  whole  genome  projects  

Page 35: Comparative Genomics and Visualisation - Part 1

…And  Then  It  Rained  Sequence  Data  l  In  2012,  GOLD  added  3736  genomes,  NCBI  added  4585  

l Mostly  prokaryotes  (archaea  and  bacteria)  

l We’re  a  liVle  ahead  of  Su’s  (Scripps,  La  Jolla)  projec;ons  

Figures  and  code  from:  hlp://sulab.org/2013/06/sequenced-­‐genomes-­‐per-­‐year/    

Page 36: Comparative Genomics and Visualisation - Part 1

Computa'onal  Compara've  Genomics  

Massively  enabled  by  high-­‐throughput  sequencing,  much  more  powerful  and  precise.  

Page 37: Comparative Genomics and Visualisation - Part 1

Three  broad  levels  of  comparison  l Bulk  Proper;es  

l  chromosome/plasmid  counts  and  sizes,    

l  nucleo;de  content,  etc.  

l Whole  Genome  Sequence  

l  sequence  similarity  

l  organisa;on  of  genomic  regions  (rearrangements),  etc.  

l Genome  Features/Func;onal  Components  

l  numbers  and  types  of  features  (genes,  ncRNA,  regulatory  elements,  etc.)  

l  organisa;on  of  features  (synteny,  operons,  regulons,  etc.)  

l  complements  of  features  

l  selec;on  pressure,  etc.  

Page 38: Comparative Genomics and Visualisation - Part 1

Bulk  Genome  Property  Comparisons  

Values  calculated  for  individual  genomes,  and  subsequently  compared.  

Page 39: Comparative Genomics and Visualisation - Part 1

Nucleo'de  Frequencies/Genome  Size  

l Very  easy  to  calculate  from  complete  or  drar  genome  sequence  

l  (or  in  a  region  of  genome  sequence)  

l GC  content/chromosome  size  can  be  characteris;c  of  an  organism  

l  [ACTIVITY]  l  bacteria_size_gc  iPython  notebook  

l  ipython notebook –-pylab inline  in  bacteria_size  directory  

Page 40: Comparative Genomics and Visualisation - Part 1

Blobology  l Metazoan  sequence  data  can  be  contaminated  by  microbial  symbionts.  

l  Host  and  symbiont  DNA  have  different  %GC  (and  are  present  in  different  amounts/coverage)  

l  Preliminary  genome  assembly,  followed  by  read  mapping  

l  Plot  con;g  coverage  against    %GC  =  Blobology  

l  hVp://nematodes.org/bioinforma;cs/blobology/  

Kumar  &  Blaxter  (2011)  Symbiosis  doi:10.1007/s13199-­‐012-­‐0154-­‐6  

Page 41: Comparative Genomics and Visualisation - Part 1

Nucleo'de  k-­‐mers  l  Sequence  data  is  required  to  determine  k-­‐mers  

l  Nucleo;de  frequencies:    l  A,  C,  G,  T  

l  Dinucleo;de  frequencies:    l  AA,  AC,  AG,  AT,  CA,  CC,  CG,  CT,  GA,  GC,  GG,  GT,  TA,  TC,  TG,  TT  

l  Trinucleo;de  frequencies:  l  64  trinucleo;des  

l  k-­‐nucleo;de  frequencies:  l  4k  k-­‐mers  

l  [ACTIVITY]  l  runApp(“shiny/nucleotide_frequencies”)in  RStudio  

Page 42: Comparative Genomics and Visualisation - Part 1

k-­‐mer  Spectra  l k-­‐mer  spectrum:  

l  Frequency  distribu;on  of  observed  k-­‐mer  counts  

l  Most  species  have  a  unimodal  k-­‐mer  spectrum  

Chor  et  al.  (2009)  Genome  Biol.  doi:10.1186/gb-­‐2009-­‐10-­‐10-­‐r108  

Page 43: Comparative Genomics and Visualisation - Part 1

k-­‐mer  Spectra  l  k-­‐mer  spectrum:  

l  All  mammals  tested  (and  some  other)  species  have  a  mul;modal  k-­‐mer  spectrum  

l  Genomic  regions  differ  in  this  property  

Chor  et  al.  (2009)  Genome  Biol.  doi:10.1186/gb-­‐2009-­‐10-­‐10-­‐r108  

Page 44: Comparative Genomics and Visualisation - Part 1

Average  Nucleo'de  Iden'ty  (ANI)  l ANI  introduced  as  a  subs;tute  for  DDH  in  2007:  

l  70%  iden;ty  (DDH)  =  “gold  standard”    prokaryo;c  species  boundary  

l  70%  iden;ty  (DDH)  ≈  95%  iden;ty  (ANI)  

Goris  et  al.  (2007)  Int.  J.  System.  Evol.  Biol.  doi:10.1099/ijs.0.64483-­‐0  

Page 45: Comparative Genomics and Visualisation - Part 1

Average  Nucleo'de  Iden'ty  (ANI)  l ANI  introduced  as  a  subs;tute  for  DDH  in  2007:  

l  70%  iden;ty  (DDH)  =  “gold  standard”    prokaryo;c  species  boundary  

l  70%  iden;ty  (DDH)  ≈  95%  iden;ty  (ANI)  

l Original  method  emulates  physical  experiment:  

1.  break  genome  into  1020nt  fragments  

2.  align  fragments  using  BLASTN  

3.  ANI  =  mean  iden;ty  of  all  BLASTN    matches  with  >30%  iden;ty  over  70%  alignable  length  

Goris  et  al.  (2007)  Int.  J.  System.  Evol.  Biol.  doi:10.1099/ijs.0.64483-­‐0  

Page 46: Comparative Genomics and Visualisation - Part 1

Average  Nucleo'de  Iden'ty  (ANI)  l ANI  introduced  as  a  subs;tute  for  DDH  in  2007:  

l  70%  iden;ty  (DDH)  =  “gold  standard”  prokaryo;c  species  boundary  

l  70%  iden;ty  (DDH)  ≈  95%  iden;ty  (ANI)  

l ANIm  and  TETRA  introduced  (2009)  

1.  Align  sequences  using  NUCmer  

2.  ANI  =  mean  %iden;ty  of  matches  

l TETRA:  1.  Calculate  tetranucleo;de  frequencies  

2.  Determine  each  tetramer  devia;on  from  expecta;on  (Z-­‐score)  

3.  TETRA  =  Pearson  correla;on  coefficient  of  tetramer  Z-­‐scores  

Richter  &  Rosselló-­‐Móra  (2009)  Proc.  Natl.  Acad.  Sci.  USA  doi:10.1073/pnas.0906412106  

Page 47: Comparative Genomics and Visualisation - Part 1

Average  Nucleo'de  Iden'ty  (ANI)  l ANIb  discards  useful  informa;on  that  ANIm  retains  

l TETRA  reflects  bulk  genome  proper;es  rather  than  selec;on  on  sequence  

l  Data  for  Anaplasma  marginale  (3),  A.phagocytophilum  (4),  A.centrale  (1)  

l TETRA  scores  are  prone  to  false  posi;ves;  ANIb  scores  are  prone  to  false  nega;ves  

Page 48: Comparative Genomics and Visualisation - Part 1

Average  Nucleo'de  Iden'ty  (ANI)  l  Jspecies  (hVp://www.imedea.uib.es/jspecies/)    

l  WebStart  

l  java -jar -Xms1024m -Xmx1024m jspecies1.2.1.jar

l Python  script  l  scripts/calculate_ani.py

l  [ACTIVITY]  l  average_nucleotide_identity/README.md  Markdown  

Richter  &  Rosselló-­‐Móra  (2009)  Proc.  Natl.  Acad.  Sci.  USA  doi:10.1073/pnas.0906412106  

Page 49: Comparative Genomics and Visualisation - Part 1

Diagnos'c  PCR/MLST  l PCR/MLST  s;ll  cheap  

l  (but  for  how  much  longer?)  

l Use  whole  genomes  to  iden;fy  unique/diagnos;c  regions  for  PCR/MLST  

Slezak  et  al.  (2003)  Brief.  Bioinf.  doi:10.1093/bib/4.2.133  Pritchard  et  al.  (2012)  PLoS  One  doi:10.1371/journal.pone.0034498  

Page 50: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Sequence  Comparisons  

Comparisons  of  one  whole  or  drac  genome  sequence  with  another  (or  many  others)  

Page 51: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Alignment  

Page 52: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Alignment  l Which  genomes  should  you  align?  (or  not  bother  aligning)  

l For  reasonable  analysis,  genomes  should:  

l  derive  from  a  sufficiently  recent  common  ancestor:  so  that  homologous  regions  can  be  iden;fied.  

l  derive  from  a  sufficiently  distant  common  ancestor:  so  that  sufficiently  “interes;ng”  changes  are  likely  to  have  occurred  

l  help  answer  your  biological  ques;on:  

�  is  your  ques;on  organism  or  phenotype  specific?  

� are  you  inves;ga;ng  a  process?  

l This  may  be  more  involved  for  metazoans  (vertebrates,  arthropods,  nematodes,  etc.)  than  prokaryotes…  

Page 53: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Alignment  l Naïve  alignment  algorithms  (e.g.  Needleman-­‐Wunsch/Smith-­‐Waterman)  are  not  appropriate:  

l  Do  not  handle  rearrangements  

l  Computa;onally  expensive  on  large  sequences  

l Many  whole-­‐genome  alignment  algorithms  proposed,  including:  

l  LASTZ  (hVp://www.bx.psu.edu/~rsharris/lastz/)  

l  BLAT  (hVp://genome.ucsc.edu/goldenPath/help/blatSpec.html)  

l  Mugsy  (hVp://mugsy.sourceforge.net/)  

l  megaBLAST  (hVp://www.ncbi.nlm.nih.gov/blast/html/megablast.html)  

l  MUMmer  (hVp://mummer.sourceforge.net/)  

l  LAGAN  (hVp://lagan.stanford.edu/lagan_web/index.shtml)  

l  WABA,  etc…  

Page 54: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Alignment  l BLAT  

l  BLAT  is  broadly  similar  to  BLAST  

l  Main  differences:  

� op;mised  to  find  only  exact  or  near-­‐exact  matches,  for  speed  

�  indexes  the  subject  genome,  retains  the  index  and  scans  the  query    

� connects  homologous  match  regions  into  a  single  alignment  (BLAST  reports  them  separately)  

�  reports  mRNA  match  intron-­‐exon  boundaries  exactly  (BLAST  tends  to  extend)  

l  Advantages:  fast;  exact  exon  boundaries;  UCSC  integra;on  

l  Disadvantages:  does  not  find  more  remote/very  divergent  matches  

Kent  (2002)  Genome  Res.  doi:10.1101/gr.229202  

Page 55: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Alignment  l megaBLAST  

l  Op;mised  for  speed  over  BLASTN    (see  hVp://www.ncbi.nlm.nih.gov/blast/Why.shtml):  

� genome-­‐level  searches    

� queries  on  large  sequence  sets  

�  long  alignments  of  very  similar  sequence  (sequencing  errors/SNPs)  

l  Uses  Zhang  et  al.  (2000)  greedy  algorithm  

l  Concatenates  queries  to  improve  performance  (“query  packing”)  

� NOTE:  this  is  good  prac'ce  for  large  query  sets!  

l  Two  modes:  megaBLAST,  and  discon;nuous  megaBLAST  (dc-­‐megablast)  

� dc-­‐megablast  intended  for  more  divergent  sequences  

Zhang  et  al.  (2000)  J.  Comp.  Biol.  7(1-­‐2)  203-­‐14  Korf  et  al.  (2003)  “BLAST”,  O’Reilly  &  Associates,  Sebastopol,  CA  

Page 56: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Alignment  l MUMmer  

l  Uses  suffix  trees  for  paVern  matching:  very  fast  even  for  large  sequences  

� Finds  maximal  exact  matches  

� Memory  use  depends  only  on  reference  sequence  size  

Kurtz  et  al.  (2004)  Genome  Biol.  doi:10.1186/gb-­‐2004-­‐5-­‐2-­‐r12  

Page 57: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Alignment  l MUMmer  

l  Uses  suffix  trees  for  paVern  matching:  very  fast  even  for  large  sequences  

� Finds  maximal  exact  matches  

� Memory  use  depends  only  on  reference  sequence  size  

l  Suffix  Tree:  

l  Can  be  constructed  and  searched  in  O(n)  ;me  

l  Useful  algorithms  are  nontrivial  

l  BANANA$  

�  B  followed  by  ANANA$  only  

�  A  followed  by  $,  NA$,  NANA$  

�  N  followed  by  A$,  ANA$  

Kurtz  et  al.  (2004)  Genome  Biol.  doi:10.1186/gb-­‐2004-­‐5-­‐2-­‐r12  

Page 58: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Alignment  l MUMmer  

l  Process:  

� 1)  Iden;fy  a  non-­‐overlapping  subset  of  maximal  exact  matches:  oren  Maximum  Unique  Matches  (MUMs  -­‐  though  not  always  unique)  

� 2)  Cluster  into  alignment  anchors  

� 3)  Extend  between  anchors  to  produce  a  final  gapped  alignment  

l  Very  flexible  approach:  a  suite  of  programs  (mummer, nucmer, promer,  …)  

�  nucleo;de  and  “conceptual  protein”  (more  sensi;ve)  alignments  

�  used  for  genome  comparisons,  assembly  scaffolding,  repeat  detec;on,  etc.  

�  forms  the  basis  for  other  aligners/assemblers,  e.g.  Mugsy,  AMOS  

Kurtz  et  al.  (2004)  Genome  Biol.  doi:10.1186/gb-­‐2004-­‐5-­‐2-­‐r12  

Page 59: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Alignment  l  [ACTIVITY]  

l  whole_genome_alignments_A.md Markdown  

l  hVps://github.com/widdowquinn/Teaching/blob/master/Compara;ve_Genomics_and_Visualisa;on/Part_1/whole_genome_alignment/whole_genome_alignments_A.md  

Page 60: Comparative Genomics and Visualisation - Part 1

Mul'ple  Genome  Alignment  l Several  tools:  

l  Mugsy  (hVp://mugsy.sourceforge.net/)  

l  MLAGAN  (hVp://lagan.stanford.edu/lagan_web/index.shtml)  

l  TBA/Mul'Z  (hVp://www.bx.psu.edu/miller_lab/)  

l  Mauve  (hVp://gel.ahabs.wisc.edu/mauve/)  

l Posi;onal  homology  vs.  glocal  

Page 61: Comparative Genomics and Visualisation - Part 1

Mul'ple  Genome  Alignment  l LAGAN:  rapid  alignment  of  two  homologous  genome  sequences  

l  Generate  local  alignments  (anchors,  B)  

l  Construct  rough  global  map    (maximal-­‐scoring  ordered  subset,  C)  

�  Join  anchors  that  lie  within  a    threshold  distance,  the  same  way  

l  Compute  global  alignment  by    dynamic  programming  (D)  

Brudno  et  al.  (2003)  Genome  Res.  doi:10.1101/gr.926603  

Page 62: Comparative Genomics and Visualisation - Part 1

Mul'ple  Genome  Alignment  l MLAGAN:  mul;ple  genome  alignment  of  k  genomes  in  k-­‐1  alignment  steps,  using  a  phylogene;c  tree  (CLUSTAL-­‐like):  

l  Make  rough  global  maps  between  each    pair  of  sequences  (step  C  in  LAGAN)  

l  Progressive  mul;ple  alignment  with    anchors  (iterated)  

1.  Perform  global  alignment  between    closest  pair  of  sequences  with    LAGAN:  alignments  are    “mul6-­‐sequences”  

2.  Find  rough  global  maps  of  this  mul6-­‐sequence  to  all  other  mul6-­‐sequences.  

Brudno  et  al.  (2003)  Genome  Res.  doi:10.1101/gr.926603  

Page 63: Comparative Genomics and Visualisation - Part 1

Human-­‐Mouse-­‐Rat  Alignment  l Three-­‐way  progressive  alignment,  iden;fying:  

l  Homologous  (H/M/R),  rodent-­‐only  (M/R)  and  human-­‐mouse  or  human-­‐rat  (H/M,  H/R)  homologous  regions  

l Three-­‐way  synteny  

synteny  mapped  to  rat  genome  

Brudno  et  al.  (2004)  Genome  Res.  doi:10.1101/gr.2067704  

Ini'al  alignments  by  BLAT  Syntenous  regions  aligned  with  LAGAN  

Page 64: Comparative Genomics and Visualisation - Part 1

Drac  Genome  Alignment  

Page 65: Comparative Genomics and Visualisation - Part 1

Drac  Genome  Alignment  l Whole  genome  alignments  useful  for  scaffolding  assemblies  

l  High-­‐throughput  sequence  assemblies  come  in  fragments  (con;gs)  

l  Con;gs  can  some;mes  be  ordered  if  paired  reads  or  long  read  technologies  are  used  

l  Can  also  align  to  a  known  reference  genome  

l MUMmer  

l  Can  use  NUCmer  or,  for  more  distant  rela;ons,  PROmer  

l Mauve/Progressive  Mauve  

l  hVp://gel.ahabs.wisc.edu/mauve/  

Darling  et  al.  (2003)  Genome  Res.  doi:10.1101/gr.2289704  

Page 66: Comparative Genomics and Visualisation - Part 1

Mauve  l Mauve’s  alignment  algorithm  

1.  Find  local  alignments  (mul;-­‐MUMs  –  seed  &  extend)  

2.  Construct  phylogene;c  guide  tree  from  mul;-­‐MUMs  

3.  Select  subset  of  mul;-­‐MUMs  as  anchors.  

�  Par;;on  anchors  into  Local  Collinear  Blocks  (LCBs)  –  consistently-­‐ordered  subsets  

4.  Perform  recursive  anchoring  to  iden;fy  further  anchors  

5.  Perform  progressive  alignment  (similar  to  CLUSTAL),  against  guide  tree  

l Mauve  Con;g  Mover  (MCM)  for  ordering  con;gs  

Darling  et  al.  (2003)  Genome  Res.  doi:10.1101/gr.2289704  

Page 67: Comparative Genomics and Visualisation - Part 1

Mauve  l Mauve  alignment  of  LCBs  in  nine  enterobacterial  genomes  

l  Rearrangement  of  homologous  backbone  sequence  

Darling  et  al.  (2003)  Genome  Res.  doi:10.1101/gr.2289704  

Page 68: Comparative Genomics and Visualisation - Part 1

Drac  Genome  Alignment  l  [OPTIONAL  ACTIVITY]  (useful  for  exercise)  

l  Alignment  and  reordering  of  drar  genome  con;gs  

l  whole_genome_alignments_B.md  Markdown  

l  hVps://github.com/widdowquinn/Teaching/blob/master/Compara;ve_Genomics_and_Visualisa;on/Part_1/whole_genome_alignment/whole_genome_alignments_B.md  

l  [ACTIVITY]  l  Visualisa;on  of  whole  genome  alignment  with  Biopython  

l  biopython_visualisation  iPython  notebook  

Page 69: Comparative Genomics and Visualisation - Part 1

Collinearity  and  Synteny  l Rearrangements  may  occur  post-­‐specia;on  

l Different  species  s;ll  exhibit  conserva;on  of  sequence  similarity  and  order  

l  Two  elements  are  collinear  if  they  lie  in  the  same  linear  sequence  

l  Two  elements  are  syntenous  (syntenic)  if:  

�  (orig.)  they  lie  on  the  same  chromosome  

�  (mod.)  conserva;on  of  blocks  of  order  within  the  same  chromosome  

l Signs  of  evolu;onary  constraints,  including  synteny,  may  indicate  func;onal  genome  regions  

l More  about  this  in  Part  2,  related  to  genome  features  

Page 70: Comparative Genomics and Visualisation - Part 1

Syntenous  l example1.png  from  biopython_visualisation  ac;vity  

Page 71: Comparative Genomics and Visualisation - Part 1

Nonsyntenous  l example2.png  from  biopython_visualisation  ac;vity  

Page 72: Comparative Genomics and Visualisation - Part 1

Whole  Genome  Duplica'on  l Puffer  fish  Tetraodon  nigroviridis  (smallest  known  vertebrate  genome)  

l  Whole-­‐genome  duplica;on,  subsequent  to  divergence  from  mammals.  

l  Ancestral  vertebrate  genome  inferred  to  have  12  chromosomes.  

Duplicated  genes  (ExoFish)  on  21  chromosomes  

Jaillon  et  al.  (2004)  Nature  doi:10.1038/nature03025  

Page 73: Comparative Genomics and Visualisation - Part 1

VISTA,  mVISTA,  VISTA-­‐Point  l Alignment/visualisa;on  tools:    

l  hVp://genome.lbl.gov/vista/index.shtml  

l mVISTA:  align  and  compare  submiVed  sequences  (up  to  2Mbp)  

l VISTA-­‐Point:  visualise  precomputed  alignments  

Frazer  et  al.  (2004)  Nucl.  Acids  Res.  doi:10.1093/nar/gkh458  

Page 74: Comparative Genomics and Visualisation - Part 1

UCSC  l hVp://genome.ucsc.edu/  

l Many  vertebrate/invertebrate  model  genomes  

Kent  et  al.  (2002)  Genome  Res.  doi:10.1101/gr.229102  

Page 75: Comparative Genomics and Visualisation - Part 1

Conclusion  l Physical  and  computa;onal  genome  comparisons:  

l  Similar  biological  ques;ons  -­‐>  similar  concepts  

l Lots  of  sequence  data  in  modern  biology  

l Conserva;on  ≈  evolu;onary  constraint  

l Many  choices  of  algorithms/analysis  sorware  

l Many  choices  of  visualisa;on  sorware/tools  

l Coming  in  Part  2:  genomic  func;onal  elements  

Page 76: Comparative Genomics and Visualisation - Part 1

Credits  l This  slideshow  is  shared  under  a  Crea;ve  Commons  AVribu;on  4.0  License  hVp://crea;vecommons.org/licenses/by/4.0/)  

l Copyright  is  held  by  The  James  HuVon  Ins;tute  hVp://www.huVon.ac.uk  

l You  may  freely  use  this  material  in  research,  papers,  and  talks  so  long  as  acknowledgement  is  made.    

Page 77: Comparative Genomics and Visualisation - Part 1

Nucleo'de  Content  l A,  C,  G,  T  composi;on  

l  Varies  between,  and  within  genomes  

l  staining  varies  across  genomes,  due  to    varia;on  in  GC  content  

l “isochores”:  regions  with  liVle  internal  GC  varia;on  (homogeneous)  

�   long  a  point  of  discussion    –  difficult  to  define  

l  In  humans:  

l  L1,  L2  isochores:  low  GC  (≲41%)  

l  H1,  H2,  H3  isochores:  high  GC  (≳41%)  

l  Imprecise  bulk  measurement  

Sadoni  et  al.  (1999)  J.  Cell  Biol.  doi:10.1083/jcb.146.6.1211  

hybridisa;on  of  H3  isochore  to  human  genome  

Page 78: Comparative Genomics and Visualisation - Part 1

DNA-­‐DNA  Hybridisa'on  (DDH)  l Used  for  taxonomic  classifica;on  in  prokaryotes  from  1960s  

l Sibley  &  Ahlquist  redefined  bird  and  primate  phylogeny  with    DDH  in  1980s:    

l Not  without  controversy:  � Sugges;ons  of  data  manipula;on    

(see  here)  

� Close  evolu;onary  rela;onships    difficult  to  resolve  due  to  paralogy    (more  on  paralogy  later…)  

l S;ll  hanging  on  as  a  de  facto  “gold    standard”  in  microbiological  taxonomic    classifica;on.  

Sibley  &  Ahlquist  (1987)  J.  Mol.  Evol.  doi:10.1007/BF02111285  

Page 79: Comparative Genomics and Visualisation - Part 1

Finding  isochores  l  Isochores:  homogeneous  regions  of  %GC  content  

l  Easy  to  find  with  windowed  (100kbp)  %GC  calcula;on,  from  sequenced    genomes.  

l  3200  isochores  characterised  in  the    human  genome,  consistent  with  5    levels  (L1,  L2,  H1,  H2,  H3)  found    by  staining/hybridisa;on.  

 

Costan'ni  et  al.  (2006)  Genome  Res.  doi:10.1101/gr.4910606  

Page 80: Comparative Genomics and Visualisation - Part 1

Compara've  Genomic  Hybridisa'on  l  Two  genomes:  “reference”  and  “test”  labelled  (red  and  green),  

then  hybridised  against  a  “normal”  genome  

l  semiquan'ta've:  

l  Red:  loss  (<2  copies)  in  tumour  

l  Green:  gain  (3-­‐4  copies)  in  tumour  

l  Amplifica;ons  (>4  copies)  in  BOLD  

l  Cases  with  the  same  Copy  Number    Aberra;on  (CNA)  are  numbered  

De  Bortoli  et  al.  (2006)  BMC  Cancer  doi:10.1186/1471-­‐2407-­‐6-­‐223  

Page 81: Comparative Genomics and Visualisation - Part 1

l Early  approaches  took  a  threshold  score  (present/absent)  

l Later  approaches  used  known  reference  genome  sequence    context  (HMMs,  synteny)  to  improve  presence/absence  calls  

l No  hybridisa;on  =  “absent”  or“divergent”?  

l Not  nearly  as  good  as  sequencing  directly!  

Array  Compara've  Genomic  Hybridisa'on  

Pritchard  et  al.  (2009)  PLoS  Comp.  Biol.  doi:10.1371/journal.pcbi.1000473  

Page 82: Comparative Genomics and Visualisation - Part 1

k-­‐mer  Spectra  l k-­‐mer  spectrum:  

l  CpG  suppression  (CGs  are  uncommon  in  vertebrate  genomes),  but  (by  simula;on)  only  when  in  combina;on  with  a  par;cular  %GC,  explains  mul;modality  

Chor  et  al.  (2009)  Genome  Biol.  doi:10.1186/gb-­‐2009-­‐10-­‐10-­‐r108