comparative genomics and visualisation - part 1

Compara've Genomics and Visualisa'on – Part 1

Leighton Pritchard

Part 1 l What is compara've genomics?

l Levels of genome comparison

l  bulk, whole sequence, features

l A Brief History of Compara've Genomics

l  experimental compara;ve genomics

l Computa'onal Compara've Genomics

l  Bulk proper;es

l  Whole genome comparisons

l Part 2 l  Genome feature comparisons

What is Compara've Genomics?

The combina'on of genomic data and compara've and evolu'onary biology to address ques'ons of genome structure, evolu'on and func'on.

What is Compara've Genomics?

“Nothing in biology makes sense, except in the light of evolu9on”

Theodosius Dobzhansky

Why Compara've Genomics? l Genomes describe heritable characteris;cs

l Related organisms share ancestral genomes

l Func;onal elements encoded in genomes are common to related organisms

l Func;onal understanding of model systems (E. coli, A. thaliana, D. melanogaster) can be transferred to non-‐model systems on the basis of genome comparisons

l Genome comparisons can be informa;ve, even for distantly-‐related organisms

Why Compara've Genomics?

l BUT: l  Context: epigene;cs, ;ssue

differen;a;on, mesoscale systems, etc.

l  Phenotypic plas'city: responses to temperature, stress, environment, etc.

Why Compara've Genomics? l Genomic differences can underpin phenotypic (morphological or physiological) differences.

l Where phenotypes or other organism-‐level proper;es are known, comparison of genomes may give mechanis;c or func;onal insight into differences (e.g. GWAS).

l Genome comparisons aid iden;fica;on of func;onal elements on the genome.

l Studying genomic changes reveals evolu;onary processes and constraints.


Adapted from Hardison (2003) PLoS Biol. doi:10.1371/journal.pbio.0000058

species

'me

contemporary organisms

l  Comparison within species (e.g. isolate-‐level – or even within individuals): which genome features may account for unique characteris;cs of organisms/tumours? Epigene;cs in an individual.


genus

'me


l Comparison within genus (e.g. species-‐level): what genome features show evidence of selec;ve pressure, and in which species?


subgroup 'me


l Comparison within subgroup (e.g. genus-‐level): what are the core set of genome features that define a subgroup or genus?

The E.coli long-‐term evolu'on experiment

l Run by the Lenski lab, Michigan State University since 1988

l  hVp://myxo.css.msu.edu/ecoli/

l 12 flasks, citrate usage selec;on

l 50,000 genera;ons of Escherichia coli! l  Cultures propagated every day

l  Every 500 genera;ons (75 days), mixed-‐popula;on samples stored

l  Mean fitness es;mated at 500 genera;on intervals

Jeong et al. (2009) J. Mol. Biol. doi:10.1016/j.jmb.2009.09.052 Barrick et al. (2009) Nature doi:10.1038/nature08480 Wiser et al. (2013) Science. doi:10.1126/science.1243357

Compara've Genomics in the News

Sankaraman et al. (2014) Nature. doi:10.1038/nature12961

l Neanderthal alleles: l  Aid adapta;on outwith Africa

l  Associated with disease risk

l  Reduce male fer;lity

Levels of Genome Comparison

Genomes are complex, and can be compared on a range of conceptual levels -‐ both prac'cally and in silico.

Three broad levels of comparison l Bulk Proper;es

l  chromosome/plasmid counts and sizes,

l  nucleo;de content, etc.

l Whole Genome Sequence

l  sequence similarity

l  organisa;on of genomic regions (synteny), etc.

l Genome Features/Func;onal Components

l  numbers and types of features (genes, ncRNA, regulatory elements, etc.)

l  organisa;on of features (synteny, operons, regulons, etc.)

l  complements of features

l  selec;on pressure, etc.

A Brief History of Experimental Compara've Genomics You don’t have to sequence genomes to compare them (but it helps).

Genome Comparisons Predate NGS l Sequence data was not always cheap and abundant

l Prac;cal, experimental genome comparisons were needed

Bulk Genome Property Comparisons

Values calculated for individual genomes, and subsequently compared.

Bulk Genome Proper'es l  Large-‐scale summary measurements

l Measure genomes independently – compare values later

l  Number of chromosomes

l  Ploidy

l  Chromosome size

l  Nucleo;de (A, C, G, T) frequency/percentage

Chromosome Counts/Size l  The chromosome counts/ploidy of organisms can vary widely

l  Escherichia coli: 1 (but plasmids…) l  Rice (Oryza sa6va): 24 (but mitochondria, plas;ds etc…) l  Human (Homo sapiens): 46, diploid l  Adders-‐tongue (Ophioglossum re6culatum): up to 1260

l  Domes;c (but not wild) wheat soma;c cells hexaploid, gametes haploid

l  Physical genome size (related to sequence length) can also vary greatly

l  Genome size and chromosome count do not indicate organism ‘complexity’

l  S;ll surprises to be found in physical study of chromosomes! (e.g. Hi-‐C)

Kamisugi et al. (1993) Chromosome Res. 1(3): 189-‐96 Wang et al. (2013) Nature Rev Genet. doi:10.1038/nrg3375

Nucleo'de Content l Experimental approaches for accurate measurement

l  e.g. use radiolabelled monophosphates, calculate propor;ons using chromatography

Karl (1980) Microbiol. Rev. 44(4) 739-‐796 Krane et al. (1991) Nucl. Acids Res. doi:10.1093/nar/19.19.5181

Whole Genome Comparisons

Comparisons of one whole or drac genome with another (or many others)

Whole Genome Comparisons l  Requires two genomes: “reference” and “comparator”

l  Experiment produces a compara;ve result, dependent on the choice of genomes

l Methods mostly based around direct or indirect DNA hybridisa;on

l  DNA-‐DNA hybridisa;on

l  Compara;ve Genomic Hybridisa;on (CGH)

l  Array Compara;ve Genomic Hybridisa;on (aCGH)

DNA-‐DNA Hybridisa'on (DDH) l Several methods based around the same principle

1.  Denature organism A, B genomic DNA mixture

2.  Allow to anneal – hybrids result (reassocia;on ≈ similarity)

Morelló-‐Mora & Amann (2001) FEMS Microbiol. Rev. doi:10.1016/S0168-‐6445(00)00040-‐1

DNA-‐DNA Hybridisa'on (DDH) l  Several methods -‐ same principle

1.  Find homoduplex Tm1

2.  Denature reference, comparator gDNA + mix

3.  Allow to anneal – hybrids result (reassocia;on ≈ similarity), find heteroduplex Tm2

4.  ∆Tm = Tm1 – Tm2

5.  High ∆T implies greater genomic difference (fewer H-‐bonds)

l  Proxy for sequence similarity

Morelló-‐Mora & Amann (2001) FEMS Microbiol. Rev. doi:10.1016/S0168-‐6445(00)00040-‐1

DNA-‐DNA Hybridisa'on (DDH) l Used for taxonomic classifica;on in prokaryotes from 1960s

l Sibley & Ahlquist redefined bird and primate phylogeny with DDH in 1980s: Homo shares more recent common ancestor with Pan than with Gorilla (this was previously in dispute)

Sibley & Ahlquist (1984) J. Mol. Evol. doi:10.1007/BF02101980

Compara've Genomic Hybridisa'on l  Two genomes: “reference” and “test” are labelled (red and green –

a bad conven6on to choose, for visualisa6on), then hybridised against a third “normal” genome

l  Differences in red/green intensity mapped by microscopy correspond to rela;ve rela;onship of reference and test to “normal” genome

l  Comparisons within species (or individual, for tumours); copy number varia'ons (CNV)

l  Labour-‐intensive, low-‐resolu;on

Compara've Genomic Hybridisa'on l  Image analysis required – intensity along medial axis.

Kallioniemi et al. (1992) Science doi:10.1126/science.1359641 Fraga et al. (2005) Proc. Natl. Acad. Sci. USA doi:10.1073/pnas.0500398102

Epigene'cs: hybridising methylated DNA

Array Compara've Genomic Hybridisa'on l  Uses DNA microarrays: thousands of short DNA probes (genome

fragments) immobilised on a surface

l  gDNA, cDNA, etc. fluorescently-‐labelled and hybridised to the array

l  Smaller sample sizes cf. CGH, automatable, high-‐throughput, high-‐res

l  Iden'fies copy number varia'on (CNV) and segmental duplica'on

Pollack et al. (1999) Nat. Genet. doi:10.1038/12640

Genome Feature Comparisons

Comparisons on the basis of a restricted set of genome features

Chromosomal Rearrangements l  Genomes are dynamic, and undergo large-‐scale changes

l  Hybridisa;on used to map genome rearrangement/duplica;on

l  Separate chromosomes electrophore;cally l  Apply single gene hybridising probes l  Reciprocal hybridisa;ons indicate transloca;ons

Fischer et al. (2000) Nature. doi:10.1038/35013058

Diagnos'c PCR/MLST l  Define a set of regions (usually genes):

l  conserved enough that PCR primers can be designed to amplify the same region in mul;ple organisms

l  and: l  divergent enough that hybridising

probes can dis;nguish between groups

l  or: l  sequence the amplifica;on products

l  Sequence variants given numbers

l  Number profiles define groups

l  Track evolu;on by minimum spanning trees (MST)

l  hVp://pubmlst.org/ Maiden et al. (2006) Ann. Rev. Microbiol. doi:10.1146/annurev.micro.59.030804.121325

l  aCGH can also be applied across species for classifica'on/diagnos'cs: l  Microarray probes represent genes

from one or more organisms

l  “Off-‐species” gDNA fragmented, labelled, and hybridised

l  Hybridisa;on ≈ sequence similarity ≈ gene presence

l  Heatmap of 217 Staphylococcus aureus isolates on 7-‐strain array.

l  columns=isolates

l  yellow/red=gene present

l  blue/white/grey=gene absent

l  Lower bars coloured by lineage and host (green=caVle, blue=horse, purple=human)

Array Compara've Genomic Hybridisa'on

Sung et al. (2008) Microbiol. doi:10.1099/mic.0.2007/015289-‐0

But This Happened… l High-‐throughput sequencing

…And Then It Rained Sequence Data l Modern high-‐throughput sequencing (454, Illumina) completely

changed the landscape.

l  Complete, (mainly) accurate sequence data much cheaper, enabling:

l  more precise sequence comparison

l  novel analyses, insights and visualisa;ons

l  Genomic & exomic comparisons

l  19/2/2014 at GOLD: l  3,011 “finished” genomes

l  9,891 “permanent drar” genomes

l  19/2/2014 at NCBI WGS:

l  17,023 whole genome projects

…And Then It Rained Sequence Data l  In 2012, GOLD added 3736 genomes, NCBI added 4585

l Mostly prokaryotes (archaea and bacteria)

l We’re a liVle ahead of Su’s (Scripps, La Jolla) projec;ons

Figures and code from: hlp://sulab.org/2013/06/sequenced-‐genomes-‐per-‐year/

Computa'onal Compara've Genomics

Massively enabled by high-‐throughput sequencing, much more powerful and precise.

Three broad levels of comparison l Bulk Proper;es

l  chromosome/plasmid counts and sizes,

l  nucleo;de content, etc.

l Whole Genome Sequence

l  sequence similarity

l  organisa;on of genomic regions (rearrangements), etc.

l Genome Features/Func;onal Components

l  numbers and types of features (genes, ncRNA, regulatory elements, etc.)

l  organisa;on of features (synteny, operons, regulons, etc.)

l  complements of features

l  selec;on pressure, etc.

Bulk Genome Property Comparisons

Values calculated for individual genomes, and subsequently compared.

Nucleo'de Frequencies/Genome Size

l Very easy to calculate from complete or drar genome sequence

l  (or in a region of genome sequence)

l GC content/chromosome size can be characteris;c of an organism

l  [ACTIVITY] l  bacteria_size_gc iPython notebook

l  ipython notebook –-pylab inline in bacteria_size directory

Blobology l Metazoan sequence data can be contaminated by microbial symbionts.

l  Host and symbiont DNA have different %GC (and are present in different amounts/coverage)

l  Preliminary genome assembly, followed by read mapping

l  Plot con;g coverage against %GC = Blobology

l  hVp://nematodes.org/bioinforma;cs/blobology/

Kumar & Blaxter (2011) Symbiosis doi:10.1007/s13199-‐012-‐0154-‐6

Nucleo'de k-‐mers l  Sequence data is required to determine k-‐mers

l  Nucleo;de frequencies: l  A, C, G, T

l  Dinucleo;de frequencies: l  AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT

l  Trinucleo;de frequencies: l  64 trinucleo;des

l  k-‐nucleo;de frequencies: l  4k k-‐mers

l  [ACTIVITY] l  runApp(“shiny/nucleotide_frequencies”)in RStudio

k-‐mer Spectra l k-‐mer spectrum:

l  Frequency distribu;on of observed k-‐mer counts

l  Most species have a unimodal k-‐mer spectrum

Chor et al. (2009) Genome Biol. doi:10.1186/gb-‐2009-‐10-‐10-‐r108

k-‐mer Spectra l  k-‐mer spectrum:

l  All mammals tested (and some other) species have a mul;modal k-‐mer spectrum

l  Genomic regions differ in this property


Average Nucleo'de Iden'ty (ANI) l ANI introduced as a subs;tute for DDH in 2007:

l  70% iden;ty (DDH) = “gold standard” prokaryo;c species boundary

l  70% iden;ty (DDH) ≈ 95% iden;ty (ANI)

Goris et al. (2007) Int. J. System. Evol. Biol. doi:10.1099/ijs.0.64483-‐0




l Original method emulates physical experiment:

1.  break genome into 1020nt fragments

2.  align fragments using BLASTN

3.  ANI = mean iden;ty of all BLASTN matches with >30% iden;ty over 70% alignable length

Goris et al. (2007) Int. J. System. Evol. Biol. doi:10.1099/ijs.0.64483-‐0




l ANIm and TETRA introduced (2009)

1.  Align sequences using NUCmer

2.  ANI = mean %iden;ty of matches

l TETRA: 1.  Calculate tetranucleo;de frequencies

2.  Determine each tetramer devia;on from expecta;on (Z-‐score)

3.  TETRA = Pearson correla;on coefficient of tetramer Z-‐scores

Richter & Rosselló-‐Móra (2009) Proc. Natl. Acad. Sci. USA doi:10.1073/pnas.0906412106

Average Nucleo'de Iden'ty (ANI) l ANIb discards useful informa;on that ANIm retains

l TETRA reflects bulk genome proper;es rather than selec;on on sequence

l  Data for Anaplasma marginale (3), A.phagocytophilum (4), A.centrale (1)

l TETRA scores are prone to false posi;ves; ANIb scores are prone to false nega;ves

Average Nucleo'de Iden'ty (ANI) l  Jspecies (hVp://www.imedea.uib.es/jspecies/)

l  WebStart

l  java -jar -Xms1024m -Xmx1024m jspecies1.2.1.jar

l Python script l  scripts/calculate_ani.py

l  [ACTIVITY] l  average_nucleotide_identity/README.md Markdown

Richter & Rosselló-‐Móra (2009) Proc. Natl. Acad. Sci. USA doi:10.1073/pnas.0906412106

Diagnos'c PCR/MLST l PCR/MLST s;ll cheap

l  (but for how much longer?)

l Use whole genomes to iden;fy unique/diagnos;c regions for PCR/MLST

Slezak et al. (2003) Brief. Bioinf. doi:10.1093/bib/4.2.133 Pritchard et al. (2012) PLoS One doi:10.1371/journal.pone.0034498

Whole Genome Sequence Comparisons

Comparisons of one whole or drac genome sequence with another (or many others)

Whole Genome Alignment

Whole Genome Alignment l Which genomes should you align? (or not bother aligning)

l For reasonable analysis, genomes should:

l  derive from a sufficiently recent common ancestor: so that homologous regions can be iden;fied.

l  derive from a sufficiently distant common ancestor: so that sufficiently “interes;ng” changes are likely to have occurred

l  help answer your biological ques;on:

�  is your ques;on organism or phenotype specific?

� are you inves;ga;ng a process?

l This may be more involved for metazoans (vertebrates, arthropods, nematodes, etc.) than prokaryotes…

Whole Genome Alignment l Naïve alignment algorithms (e.g. Needleman-‐Wunsch/Smith-‐Waterman) are not appropriate:

l  Do not handle rearrangements

l  Computa;onally expensive on large sequences

l Many whole-‐genome alignment algorithms proposed, including:

l  LASTZ (hVp://www.bx.psu.edu/~rsharris/lastz/)

l  BLAT (hVp://genome.ucsc.edu/goldenPath/help/blatSpec.html)

l  Mugsy (hVp://mugsy.sourceforge.net/)

l  megaBLAST (hVp://www.ncbi.nlm.nih.gov/blast/html/megablast.html)

l  MUMmer (hVp://mummer.sourceforge.net/)

l  LAGAN (hVp://lagan.stanford.edu/lagan_web/index.shtml)

l  WABA, etc…

Whole Genome Alignment l BLAT

l  BLAT is broadly similar to BLAST

l  Main differences:

� op;mised to find only exact or near-‐exact matches, for speed

�  indexes the subject genome, retains the index and scans the query

� connects homologous match regions into a single alignment (BLAST reports them separately)

�  reports mRNA match intron-‐exon boundaries exactly (BLAST tends to extend)

l  Advantages: fast; exact exon boundaries; UCSC integra;on

l  Disadvantages: does not find more remote/very divergent matches

Kent (2002) Genome Res. doi:10.1101/gr.229202

Whole Genome Alignment l megaBLAST

l  Op;mised for speed over BLASTN (see hVp://www.ncbi.nlm.nih.gov/blast/Why.shtml):

� genome-‐level searches

� queries on large sequence sets

�  long alignments of very similar sequence (sequencing errors/SNPs)

l  Uses Zhang et al. (2000) greedy algorithm

l  Concatenates queries to improve performance (“query packing”)

� NOTE: this is good prac'ce for large query sets!

l  Two modes: megaBLAST, and discon;nuous megaBLAST (dc-‐megablast)

� dc-‐megablast intended for more divergent sequences

Zhang et al. (2000) J. Comp. Biol. 7(1-‐2) 203-‐14 Korf et al. (2003) “BLAST”, O’Reilly & Associates, Sebastopol, CA

Whole Genome Alignment l MUMmer

l  Uses suffix trees for paVern matching: very fast even for large sequences

� Finds maximal exact matches

� Memory use depends only on reference sequence size

Kurtz et al. (2004) Genome Biol. doi:10.1186/gb-‐2004-‐5-‐2-‐r12


l  Uses suffix trees for paVern matching: very fast even for large sequences

� Finds maximal exact matches

� Memory use depends only on reference sequence size

l  Suffix Tree:

l  Can be constructed and searched in O(n) ;me

l  Useful algorithms are nontrivial

l  BANANA$

�  B followed by ANANA$ only

�  A followed by $, NA$, NANA$

�  N followed by A$, ANA$



l  Process:

� 1) Iden;fy a non-‐overlapping subset of maximal exact matches: oren Maximum Unique Matches (MUMs -‐ though not always unique)

� 2) Cluster into alignment anchors

� 3) Extend between anchors to produce a final gapped alignment

l  Very flexible approach: a suite of programs (mummer, nucmer, promer, …)

�  nucleo;de and “conceptual protein” (more sensi;ve) alignments

�  used for genome comparisons, assembly scaffolding, repeat detec;on, etc.

�  forms the basis for other aligners/assemblers, e.g. Mugsy, AMOS


Whole Genome Alignment l  [ACTIVITY]

l  whole_genome_alignments_A.md Markdown

l  hVps://github.com/widdowquinn/Teaching/blob/master/Compara;ve_Genomics_and_Visualisa;on/Part_1/whole_genome_alignment/whole_genome_alignments_A.md

Mul'ple Genome Alignment l Several tools:

l  Mugsy (hVp://mugsy.sourceforge.net/)

l  MLAGAN (hVp://lagan.stanford.edu/lagan_web/index.shtml)

l  TBA/Mul'Z (hVp://www.bx.psu.edu/miller_lab/)

l  Mauve (hVp://gel.ahabs.wisc.edu/mauve/)

l Posi;onal homology vs. glocal

Mul'ple Genome Alignment l LAGAN: rapid alignment of two homologous genome sequences

l  Generate local alignments (anchors, B)

l  Construct rough global map (maximal-‐scoring ordered subset, C)

�  Join anchors that lie within a threshold distance, the same way

l  Compute global alignment by dynamic programming (D)

Brudno et al. (2003) Genome Res. doi:10.1101/gr.926603

Mul'ple Genome Alignment l MLAGAN: mul;ple genome alignment of k genomes in k-‐1 alignment steps, using a phylogene;c tree (CLUSTAL-‐like):

l  Make rough global maps between each pair of sequences (step C in LAGAN)

l  Progressive mul;ple alignment with anchors (iterated)

1.  Perform global alignment between closest pair of sequences with LAGAN: alignments are “mul6-‐sequences”

2.  Find rough global maps of this mul6-‐sequence to all other mul6-‐sequences.


Human-‐Mouse-‐Rat Alignment l Three-‐way progressive alignment, iden;fying:

l  Homologous (H/M/R), rodent-‐only (M/R) and human-‐mouse or human-‐rat (H/M, H/R) homologous regions

l Three-‐way synteny

synteny mapped to rat genome


Ini'al alignments by BLAT Syntenous regions aligned with LAGAN

Drac Genome Alignment

Drac Genome Alignment l Whole genome alignments useful for scaffolding assemblies

l  High-‐throughput sequence assemblies come in fragments (con;gs)

l  Con;gs can some;mes be ordered if paired reads or long read technologies are used

l  Can also align to a known reference genome

l MUMmer

l  Can use NUCmer or, for more distant rela;ons, PROmer

l Mauve/Progressive Mauve

l  hVp://gel.ahabs.wisc.edu/mauve/

Darling et al. (2003) Genome Res. doi:10.1101/gr.2289704

Mauve l Mauve’s alignment algorithm

1.  Find local alignments (mul;-‐MUMs – seed & extend)

2.  Construct phylogene;c guide tree from mul;-‐MUMs

3.  Select subset of mul;-‐MUMs as anchors.

�  Par;;on anchors into Local Collinear Blocks (LCBs) – consistently-‐ordered subsets

4.  Perform recursive anchoring to iden;fy further anchors

5.  Perform progressive alignment (similar to CLUSTAL), against guide tree

l Mauve Con;g Mover (MCM) for ordering con;gs


Mauve l Mauve alignment of LCBs in nine enterobacterial genomes

l  Rearrangement of homologous backbone sequence


Drac Genome Alignment l  [OPTIONAL ACTIVITY] (useful for exercise)

l  Alignment and reordering of drar genome con;gs

l  whole_genome_alignments_B.md Markdown

l  hVps://github.com/widdowquinn/Teaching/blob/master/Compara;ve_Genomics_and_Visualisa;on/Part_1/whole_genome_alignment/whole_genome_alignments_B.md

l  [ACTIVITY] l  Visualisa;on of whole genome alignment with Biopython

l  biopython_visualisation iPython notebook

Collinearity and Synteny l Rearrangements may occur post-‐specia;on

l Different species s;ll exhibit conserva;on of sequence similarity and order

l  Two elements are collinear if they lie in the same linear sequence

l  Two elements are syntenous (syntenic) if:

�  (orig.) they lie on the same chromosome

�  (mod.) conserva;on of blocks of order within the same chromosome

l Signs of evolu;onary constraints, including synteny, may indicate func;onal genome regions

l More about this in Part 2, related to genome features

Syntenous l example1.png from biopython_visualisation ac;vity

Nonsyntenous l example2.png from biopython_visualisation ac;vity

Whole Genome Duplica'on l Puffer fish Tetraodon nigroviridis (smallest known vertebrate genome)

l  Whole-‐genome duplica;on, subsequent to divergence from mammals.

l  Ancestral vertebrate genome inferred to have 12 chromosomes.

Duplicated genes (ExoFish) on 21 chromosomes

Jaillon et al. (2004) Nature doi:10.1038/nature03025

VISTA, mVISTA, VISTA-‐Point l Alignment/visualisa;on tools:

l  hVp://genome.lbl.gov/vista/index.shtml

l mVISTA: align and compare submiVed sequences (up to 2Mbp)

l VISTA-‐Point: visualise precomputed alignments

Frazer et al. (2004) Nucl. Acids Res. doi:10.1093/nar/gkh458

UCSC l hVp://genome.ucsc.edu/

l Many vertebrate/invertebrate model genomes

Kent et al. (2002) Genome Res. doi:10.1101/gr.229102

Conclusion l Physical and computa;onal genome comparisons:

l  Similar biological ques;ons -‐> similar concepts

l Lots of sequence data in modern biology

l Conserva;on ≈ evolu;onary constraint

l Many choices of algorithms/analysis sorware

l Many choices of visualisa;on sorware/tools

l Coming in Part 2: genomic func;onal elements

Credits l This slideshow is shared under a Crea;ve Commons AVribu;on 4.0 License hVp://crea;vecommons.org/licenses/by/4.0/)

l Copyright is held by The James HuVon Ins;tute hVp://www.huVon.ac.uk

l You may freely use this material in research, papers, and talks so long as acknowledgement is made.

Nucleo'de Content l A, C, G, T composi;on

l  Varies between, and within genomes

l  staining varies across genomes, due to varia;on in GC content

l “isochores”: regions with liVle internal GC varia;on (homogeneous)

�  long a point of discussion – difficult to define

l  In humans:

l  L1, L2 isochores: low GC (≲41%)

l  H1, H2, H3 isochores: high GC (≳41%)

l  Imprecise bulk measurement

Sadoni et al. (1999) J. Cell Biol. doi:10.1083/jcb.146.6.1211

hybridisa;on of H3 isochore to human genome

DNA-‐DNA Hybridisa'on (DDH) l Used for taxonomic classifica;on in prokaryotes from 1960s

l Sibley & Ahlquist redefined bird and primate phylogeny with DDH in 1980s:

l Not without controversy: � Sugges;ons of data manipula;on

(see here)

� Close evolu;onary rela;onships difficult to resolve due to paralogy (more on paralogy later…)

l S;ll hanging on as a de facto “gold standard” in microbiological taxonomic classifica;on.

Sibley & Ahlquist (1987) J. Mol. Evol. doi:10.1007/BF02111285

Finding isochores l  Isochores: homogeneous regions of %GC content

l  Easy to find with windowed (100kbp) %GC calcula;on, from sequenced genomes.

l  3200 isochores characterised in the human genome, consistent with 5 levels (L1, L2, H1, H2, H3) found by staining/hybridisa;on.

Costan'ni et al. (2006) Genome Res. doi:10.1101/gr.4910606

Compara've Genomic Hybridisa'on l  Two genomes: “reference” and “test” labelled (red and green),

then hybridised against a “normal” genome

l  semiquan'ta've:

l  Red: loss (<2 copies) in tumour

l  Green: gain (3-‐4 copies) in tumour

l  Amplifica;ons (>4 copies) in BOLD

l  Cases with the same Copy Number Aberra;on (CNA) are numbered

De Bortoli et al. (2006) BMC Cancer doi:10.1186/1471-‐2407-‐6-‐223

l Early approaches took a threshold score (present/absent)

l Later approaches used known reference genome sequence context (HMMs, synteny) to improve presence/absence calls

l No hybridisa;on = “absent” or“divergent”?

l Not nearly as good as sequencing directly!

Array Compara've Genomic Hybridisa'on

Pritchard et al. (2009) PLoS Comp. Biol. doi:10.1371/journal.pcbi.1000473

k-‐mer Spectra l k-‐mer spectrum:

l  CpG suppression (CGs are uncommon in vertebrate genomes), but (by simula;on) only when in combina;on with a par;cular %GC, explains mul;modality