comparative genomics and visualisation - part 1
DESCRIPTION
Slides from a Comparative Genomics and Visualisation course (part 1) presented at the University of Dundee, 7th March 2014. Other materials are available at GitHub (https://github.com/widdowquinn/Teaching)TRANSCRIPT
![Page 1: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/1.jpg)
Compara've Genomics and Visualisa'on – Part 1
Leighton Pritchard
![Page 2: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/2.jpg)
Part 1 l What is compara've genomics?
l Levels of genome comparison
l bulk, whole sequence, features
l A Brief History of Compara've Genomics
l experimental compara;ve genomics
l Computa'onal Compara've Genomics
l Bulk proper;es
l Whole genome comparisons
l Part 2 l Genome feature comparisons
![Page 3: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/3.jpg)
What is Compara've Genomics?
The combina'on of genomic data and compara've and evolu'onary biology to address ques'ons of genome structure, evolu'on and func'on.
![Page 4: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/4.jpg)
What is Compara've Genomics?
“Nothing in biology makes sense, except in the light of evolu9on”
Theodosius Dobzhansky
![Page 5: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/5.jpg)
Why Compara've Genomics? l Genomes describe heritable characteris;cs
l Related organisms share ancestral genomes
l Func;onal elements encoded in genomes are common to related organisms
l Func;onal understanding of model systems (E. coli, A. thaliana, D. melanogaster) can be transferred to non-‐model systems on the basis of genome comparisons
l Genome comparisons can be informa;ve, even for distantly-‐related organisms
![Page 6: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/6.jpg)
Why Compara've Genomics?
l BUT: l Context: epigene;cs, ;ssue
differen;a;on, mesoscale systems, etc.
l Phenotypic plas'city: responses to temperature, stress, environment, etc.
![Page 7: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/7.jpg)
Why Compara've Genomics? l Genomic differences can underpin phenotypic (morphological or physiological) differences.
l Where phenotypes or other organism-‐level proper;es are known, comparison of genomes may give mechanis;c or func;onal insight into differences (e.g. GWAS).
l Genome comparisons aid iden;fica;on of func;onal elements on the genome.
l Studying genomic changes reveals evolu;onary processes and constraints.
![Page 8: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/8.jpg)
Why Compara've Genomics?
Adapted from Hardison (2003) PLoS Biol. doi:10.1371/journal.pbio.0000058
species
'me
contemporary organisms
l Comparison within species (e.g. isolate-‐level – or even within individuals): which genome features may account for unique characteris;cs of organisms/tumours? Epigene;cs in an individual.
![Page 9: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/9.jpg)
Why Compara've Genomics?
genus
'me
contemporary organisms
l Comparison within genus (e.g. species-‐level): what genome features show evidence of selec;ve pressure, and in which species?
![Page 10: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/10.jpg)
Why Compara've Genomics?
subgroup 'me
contemporary organisms
l Comparison within subgroup (e.g. genus-‐level): what are the core set of genome features that define a subgroup or genus?
![Page 11: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/11.jpg)
The E.coli long-‐term evolu'on experiment
l Run by the Lenski lab, Michigan State University since 1988
l hVp://myxo.css.msu.edu/ecoli/
l 12 flasks, citrate usage selec;on
l 50,000 genera;ons of Escherichia coli! l Cultures propagated every day
l Every 500 genera;ons (75 days), mixed-‐popula;on samples stored
l Mean fitness es;mated at 500 genera;on intervals
Jeong et al. (2009) J. Mol. Biol. doi:10.1016/j.jmb.2009.09.052 Barrick et al. (2009) Nature doi:10.1038/nature08480 Wiser et al. (2013) Science. doi:10.1126/science.1243357
![Page 12: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/12.jpg)
Compara've Genomics in the News
Sankaraman et al. (2014) Nature. doi:10.1038/nature12961
l Neanderthal alleles: l Aid adapta;on outwith Africa
l Associated with disease risk
l Reduce male fer;lity
![Page 13: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/13.jpg)
Levels of Genome Comparison
Genomes are complex, and can be compared on a range of conceptual levels -‐ both prac'cally and in silico.
![Page 14: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/14.jpg)
Three broad levels of comparison l Bulk Proper;es
l chromosome/plasmid counts and sizes,
l nucleo;de content, etc.
l Whole Genome Sequence
l sequence similarity
l organisa;on of genomic regions (synteny), etc.
l Genome Features/Func;onal Components
l numbers and types of features (genes, ncRNA, regulatory elements, etc.)
l organisa;on of features (synteny, operons, regulons, etc.)
l complements of features
l selec;on pressure, etc.
![Page 15: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/15.jpg)
A Brief History of Experimental Compara've Genomics You don’t have to sequence genomes to compare them (but it helps).
![Page 16: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/16.jpg)
Genome Comparisons Predate NGS l Sequence data was not always cheap and abundant
l Prac;cal, experimental genome comparisons were needed
![Page 17: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/17.jpg)
Bulk Genome Property Comparisons
Values calculated for individual genomes, and subsequently compared.
![Page 18: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/18.jpg)
Bulk Genome Proper'es l Large-‐scale summary measurements
l Measure genomes independently – compare values later
l Number of chromosomes
l Ploidy
l Chromosome size
l Nucleo;de (A, C, G, T) frequency/percentage
![Page 19: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/19.jpg)
Chromosome Counts/Size l The chromosome counts/ploidy of organisms can vary widely
l Escherichia coli: 1 (but plasmids…) l Rice (Oryza sa6va): 24 (but mitochondria, plas;ds etc…) l Human (Homo sapiens): 46, diploid l Adders-‐tongue (Ophioglossum re6culatum): up to 1260
l Domes;c (but not wild) wheat soma;c cells hexaploid, gametes haploid
l Physical genome size (related to sequence length) can also vary greatly
l Genome size and chromosome count do not indicate organism ‘complexity’
l S;ll surprises to be found in physical study of chromosomes! (e.g. Hi-‐C)
Kamisugi et al. (1993) Chromosome Res. 1(3): 189-‐96 Wang et al. (2013) Nature Rev Genet. doi:10.1038/nrg3375
![Page 20: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/20.jpg)
Nucleo'de Content l Experimental approaches for accurate measurement
l e.g. use radiolabelled monophosphates, calculate propor;ons using chromatography
Karl (1980) Microbiol. Rev. 44(4) 739-‐796 Krane et al. (1991) Nucl. Acids Res. doi:10.1093/nar/19.19.5181
![Page 21: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/21.jpg)
Whole Genome Comparisons
Comparisons of one whole or drac genome with another (or many others)
![Page 22: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/22.jpg)
Whole Genome Comparisons l Requires two genomes: “reference” and “comparator”
l Experiment produces a compara;ve result, dependent on the choice of genomes
l Methods mostly based around direct or indirect DNA hybridisa;on
l DNA-‐DNA hybridisa;on
l Compara;ve Genomic Hybridisa;on (CGH)
l Array Compara;ve Genomic Hybridisa;on (aCGH)
![Page 23: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/23.jpg)
DNA-‐DNA Hybridisa'on (DDH) l Several methods based around the same principle
1. Denature organism A, B genomic DNA mixture
2. Allow to anneal – hybrids result (reassocia;on ≈ similarity)
Morelló-‐Mora & Amann (2001) FEMS Microbiol. Rev. doi:10.1016/S0168-‐6445(00)00040-‐1
![Page 24: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/24.jpg)
DNA-‐DNA Hybridisa'on (DDH) l Several methods -‐ same principle
1. Find homoduplex Tm1
2. Denature reference, comparator gDNA + mix
3. Allow to anneal – hybrids result (reassocia;on ≈ similarity), find heteroduplex Tm2
4. ∆Tm = Tm1 – Tm2
5. High ∆T implies greater genomic difference (fewer H-‐bonds)
l Proxy for sequence similarity
Morelló-‐Mora & Amann (2001) FEMS Microbiol. Rev. doi:10.1016/S0168-‐6445(00)00040-‐1
![Page 25: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/25.jpg)
DNA-‐DNA Hybridisa'on (DDH) l Used for taxonomic classifica;on in prokaryotes from 1960s
l Sibley & Ahlquist redefined bird and primate phylogeny with DDH in 1980s: Homo shares more recent common ancestor with Pan than with Gorilla (this was previously in dispute)
Sibley & Ahlquist (1984) J. Mol. Evol. doi:10.1007/BF02101980
![Page 26: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/26.jpg)
Compara've Genomic Hybridisa'on l Two genomes: “reference” and “test” are labelled (red and green –
a bad conven6on to choose, for visualisa6on), then hybridised against a third “normal” genome
l Differences in red/green intensity mapped by microscopy correspond to rela;ve rela;onship of reference and test to “normal” genome
l Comparisons within species (or individual, for tumours); copy number varia'ons (CNV)
l Labour-‐intensive, low-‐resolu;on
![Page 27: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/27.jpg)
Compara've Genomic Hybridisa'on l Image analysis required – intensity along medial axis.
Kallioniemi et al. (1992) Science doi:10.1126/science.1359641 Fraga et al. (2005) Proc. Natl. Acad. Sci. USA doi:10.1073/pnas.0500398102
Epigene'cs: hybridising methylated DNA
![Page 28: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/28.jpg)
Array Compara've Genomic Hybridisa'on l Uses DNA microarrays: thousands of short DNA probes (genome
fragments) immobilised on a surface
l gDNA, cDNA, etc. fluorescently-‐labelled and hybridised to the array
l Smaller sample sizes cf. CGH, automatable, high-‐throughput, high-‐res
l Iden'fies copy number varia'on (CNV) and segmental duplica'on
Pollack et al. (1999) Nat. Genet. doi:10.1038/12640
![Page 29: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/29.jpg)
Genome Feature Comparisons
Comparisons on the basis of a restricted set of genome features
![Page 30: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/30.jpg)
Chromosomal Rearrangements l Genomes are dynamic, and undergo large-‐scale changes
l Hybridisa;on used to map genome rearrangement/duplica;on
l Separate chromosomes electrophore;cally l Apply single gene hybridising probes l Reciprocal hybridisa;ons indicate transloca;ons
Fischer et al. (2000) Nature. doi:10.1038/35013058
![Page 31: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/31.jpg)
Diagnos'c PCR/MLST l Define a set of regions (usually genes):
l conserved enough that PCR primers can be designed to amplify the same region in mul;ple organisms
l and: l divergent enough that hybridising
probes can dis;nguish between groups
l or: l sequence the amplifica;on products
l Sequence variants given numbers
l Number profiles define groups
l Track evolu;on by minimum spanning trees (MST)
l hVp://pubmlst.org/ Maiden et al. (2006) Ann. Rev. Microbiol. doi:10.1146/annurev.micro.59.030804.121325
![Page 32: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/32.jpg)
l aCGH can also be applied across species for classifica'on/diagnos'cs: l Microarray probes represent genes
from one or more organisms
l “Off-‐species” gDNA fragmented, labelled, and hybridised
l Hybridisa;on ≈ sequence similarity ≈ gene presence
l Heatmap of 217 Staphylococcus aureus isolates on 7-‐strain array.
l columns=isolates
l yellow/red=gene present
l blue/white/grey=gene absent
l Lower bars coloured by lineage and host (green=caVle, blue=horse, purple=human)
Array Compara've Genomic Hybridisa'on
Sung et al. (2008) Microbiol. doi:10.1099/mic.0.2007/015289-‐0
![Page 33: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/33.jpg)
But This Happened… l High-‐throughput sequencing
![Page 34: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/34.jpg)
…And Then It Rained Sequence Data l Modern high-‐throughput sequencing (454, Illumina) completely
changed the landscape.
l Complete, (mainly) accurate sequence data much cheaper, enabling:
l more precise sequence comparison
l novel analyses, insights and visualisa;ons
l Genomic & exomic comparisons
l 19/2/2014 at GOLD: l 3,011 “finished” genomes
l 9,891 “permanent drar” genomes
l 19/2/2014 at NCBI WGS:
l 17,023 whole genome projects
![Page 35: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/35.jpg)
…And Then It Rained Sequence Data l In 2012, GOLD added 3736 genomes, NCBI added 4585
l Mostly prokaryotes (archaea and bacteria)
l We’re a liVle ahead of Su’s (Scripps, La Jolla) projec;ons
Figures and code from: hlp://sulab.org/2013/06/sequenced-‐genomes-‐per-‐year/
![Page 36: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/36.jpg)
Computa'onal Compara've Genomics
Massively enabled by high-‐throughput sequencing, much more powerful and precise.
![Page 37: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/37.jpg)
Three broad levels of comparison l Bulk Proper;es
l chromosome/plasmid counts and sizes,
l nucleo;de content, etc.
l Whole Genome Sequence
l sequence similarity
l organisa;on of genomic regions (rearrangements), etc.
l Genome Features/Func;onal Components
l numbers and types of features (genes, ncRNA, regulatory elements, etc.)
l organisa;on of features (synteny, operons, regulons, etc.)
l complements of features
l selec;on pressure, etc.
![Page 38: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/38.jpg)
Bulk Genome Property Comparisons
Values calculated for individual genomes, and subsequently compared.
![Page 39: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/39.jpg)
Nucleo'de Frequencies/Genome Size
l Very easy to calculate from complete or drar genome sequence
l (or in a region of genome sequence)
l GC content/chromosome size can be characteris;c of an organism
l [ACTIVITY] l bacteria_size_gc iPython notebook
l ipython notebook –-pylab inline in bacteria_size directory
![Page 40: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/40.jpg)
Blobology l Metazoan sequence data can be contaminated by microbial symbionts.
l Host and symbiont DNA have different %GC (and are present in different amounts/coverage)
l Preliminary genome assembly, followed by read mapping
l Plot con;g coverage against %GC = Blobology
l hVp://nematodes.org/bioinforma;cs/blobology/
Kumar & Blaxter (2011) Symbiosis doi:10.1007/s13199-‐012-‐0154-‐6
![Page 41: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/41.jpg)
Nucleo'de k-‐mers l Sequence data is required to determine k-‐mers
l Nucleo;de frequencies: l A, C, G, T
l Dinucleo;de frequencies: l AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT
l Trinucleo;de frequencies: l 64 trinucleo;des
l k-‐nucleo;de frequencies: l 4k k-‐mers
l [ACTIVITY] l runApp(“shiny/nucleotide_frequencies”)in RStudio
![Page 42: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/42.jpg)
k-‐mer Spectra l k-‐mer spectrum:
l Frequency distribu;on of observed k-‐mer counts
l Most species have a unimodal k-‐mer spectrum
Chor et al. (2009) Genome Biol. doi:10.1186/gb-‐2009-‐10-‐10-‐r108
![Page 43: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/43.jpg)
k-‐mer Spectra l k-‐mer spectrum:
l All mammals tested (and some other) species have a mul;modal k-‐mer spectrum
l Genomic regions differ in this property
Chor et al. (2009) Genome Biol. doi:10.1186/gb-‐2009-‐10-‐10-‐r108
![Page 44: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/44.jpg)
Average Nucleo'de Iden'ty (ANI) l ANI introduced as a subs;tute for DDH in 2007:
l 70% iden;ty (DDH) = “gold standard” prokaryo;c species boundary
l 70% iden;ty (DDH) ≈ 95% iden;ty (ANI)
Goris et al. (2007) Int. J. System. Evol. Biol. doi:10.1099/ijs.0.64483-‐0
![Page 45: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/45.jpg)
Average Nucleo'de Iden'ty (ANI) l ANI introduced as a subs;tute for DDH in 2007:
l 70% iden;ty (DDH) = “gold standard” prokaryo;c species boundary
l 70% iden;ty (DDH) ≈ 95% iden;ty (ANI)
l Original method emulates physical experiment:
1. break genome into 1020nt fragments
2. align fragments using BLASTN
3. ANI = mean iden;ty of all BLASTN matches with >30% iden;ty over 70% alignable length
Goris et al. (2007) Int. J. System. Evol. Biol. doi:10.1099/ijs.0.64483-‐0
![Page 46: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/46.jpg)
Average Nucleo'de Iden'ty (ANI) l ANI introduced as a subs;tute for DDH in 2007:
l 70% iden;ty (DDH) = “gold standard” prokaryo;c species boundary
l 70% iden;ty (DDH) ≈ 95% iden;ty (ANI)
l ANIm and TETRA introduced (2009)
1. Align sequences using NUCmer
2. ANI = mean %iden;ty of matches
l TETRA: 1. Calculate tetranucleo;de frequencies
2. Determine each tetramer devia;on from expecta;on (Z-‐score)
3. TETRA = Pearson correla;on coefficient of tetramer Z-‐scores
Richter & Rosselló-‐Móra (2009) Proc. Natl. Acad. Sci. USA doi:10.1073/pnas.0906412106
![Page 47: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/47.jpg)
Average Nucleo'de Iden'ty (ANI) l ANIb discards useful informa;on that ANIm retains
l TETRA reflects bulk genome proper;es rather than selec;on on sequence
l Data for Anaplasma marginale (3), A.phagocytophilum (4), A.centrale (1)
l TETRA scores are prone to false posi;ves; ANIb scores are prone to false nega;ves
![Page 48: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/48.jpg)
Average Nucleo'de Iden'ty (ANI) l Jspecies (hVp://www.imedea.uib.es/jspecies/)
l WebStart
l java -jar -Xms1024m -Xmx1024m jspecies1.2.1.jar
l Python script l scripts/calculate_ani.py
l [ACTIVITY] l average_nucleotide_identity/README.md Markdown
Richter & Rosselló-‐Móra (2009) Proc. Natl. Acad. Sci. USA doi:10.1073/pnas.0906412106
![Page 49: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/49.jpg)
Diagnos'c PCR/MLST l PCR/MLST s;ll cheap
l (but for how much longer?)
l Use whole genomes to iden;fy unique/diagnos;c regions for PCR/MLST
Slezak et al. (2003) Brief. Bioinf. doi:10.1093/bib/4.2.133 Pritchard et al. (2012) PLoS One doi:10.1371/journal.pone.0034498
![Page 50: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/50.jpg)
Whole Genome Sequence Comparisons
Comparisons of one whole or drac genome sequence with another (or many others)
![Page 51: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/51.jpg)
Whole Genome Alignment
![Page 52: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/52.jpg)
Whole Genome Alignment l Which genomes should you align? (or not bother aligning)
l For reasonable analysis, genomes should:
l derive from a sufficiently recent common ancestor: so that homologous regions can be iden;fied.
l derive from a sufficiently distant common ancestor: so that sufficiently “interes;ng” changes are likely to have occurred
l help answer your biological ques;on:
� is your ques;on organism or phenotype specific?
� are you inves;ga;ng a process?
l This may be more involved for metazoans (vertebrates, arthropods, nematodes, etc.) than prokaryotes…
![Page 53: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/53.jpg)
Whole Genome Alignment l Naïve alignment algorithms (e.g. Needleman-‐Wunsch/Smith-‐Waterman) are not appropriate:
l Do not handle rearrangements
l Computa;onally expensive on large sequences
l Many whole-‐genome alignment algorithms proposed, including:
l LASTZ (hVp://www.bx.psu.edu/~rsharris/lastz/)
l BLAT (hVp://genome.ucsc.edu/goldenPath/help/blatSpec.html)
l Mugsy (hVp://mugsy.sourceforge.net/)
l megaBLAST (hVp://www.ncbi.nlm.nih.gov/blast/html/megablast.html)
l MUMmer (hVp://mummer.sourceforge.net/)
l LAGAN (hVp://lagan.stanford.edu/lagan_web/index.shtml)
l WABA, etc…
![Page 54: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/54.jpg)
Whole Genome Alignment l BLAT
l BLAT is broadly similar to BLAST
l Main differences:
� op;mised to find only exact or near-‐exact matches, for speed
� indexes the subject genome, retains the index and scans the query
� connects homologous match regions into a single alignment (BLAST reports them separately)
� reports mRNA match intron-‐exon boundaries exactly (BLAST tends to extend)
l Advantages: fast; exact exon boundaries; UCSC integra;on
l Disadvantages: does not find more remote/very divergent matches
Kent (2002) Genome Res. doi:10.1101/gr.229202
![Page 55: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/55.jpg)
Whole Genome Alignment l megaBLAST
l Op;mised for speed over BLASTN (see hVp://www.ncbi.nlm.nih.gov/blast/Why.shtml):
� genome-‐level searches
� queries on large sequence sets
� long alignments of very similar sequence (sequencing errors/SNPs)
l Uses Zhang et al. (2000) greedy algorithm
l Concatenates queries to improve performance (“query packing”)
� NOTE: this is good prac'ce for large query sets!
l Two modes: megaBLAST, and discon;nuous megaBLAST (dc-‐megablast)
� dc-‐megablast intended for more divergent sequences
Zhang et al. (2000) J. Comp. Biol. 7(1-‐2) 203-‐14 Korf et al. (2003) “BLAST”, O’Reilly & Associates, Sebastopol, CA
![Page 56: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/56.jpg)
Whole Genome Alignment l MUMmer
l Uses suffix trees for paVern matching: very fast even for large sequences
� Finds maximal exact matches
� Memory use depends only on reference sequence size
Kurtz et al. (2004) Genome Biol. doi:10.1186/gb-‐2004-‐5-‐2-‐r12
![Page 57: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/57.jpg)
Whole Genome Alignment l MUMmer
l Uses suffix trees for paVern matching: very fast even for large sequences
� Finds maximal exact matches
� Memory use depends only on reference sequence size
l Suffix Tree:
l Can be constructed and searched in O(n) ;me
l Useful algorithms are nontrivial
l BANANA$
� B followed by ANANA$ only
� A followed by $, NA$, NANA$
� N followed by A$, ANA$
Kurtz et al. (2004) Genome Biol. doi:10.1186/gb-‐2004-‐5-‐2-‐r12
![Page 58: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/58.jpg)
Whole Genome Alignment l MUMmer
l Process:
� 1) Iden;fy a non-‐overlapping subset of maximal exact matches: oren Maximum Unique Matches (MUMs -‐ though not always unique)
� 2) Cluster into alignment anchors
� 3) Extend between anchors to produce a final gapped alignment
l Very flexible approach: a suite of programs (mummer, nucmer, promer, …)
� nucleo;de and “conceptual protein” (more sensi;ve) alignments
� used for genome comparisons, assembly scaffolding, repeat detec;on, etc.
� forms the basis for other aligners/assemblers, e.g. Mugsy, AMOS
Kurtz et al. (2004) Genome Biol. doi:10.1186/gb-‐2004-‐5-‐2-‐r12
![Page 59: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/59.jpg)
Whole Genome Alignment l [ACTIVITY]
l whole_genome_alignments_A.md Markdown
l hVps://github.com/widdowquinn/Teaching/blob/master/Compara;ve_Genomics_and_Visualisa;on/Part_1/whole_genome_alignment/whole_genome_alignments_A.md
![Page 60: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/60.jpg)
Mul'ple Genome Alignment l Several tools:
l Mugsy (hVp://mugsy.sourceforge.net/)
l MLAGAN (hVp://lagan.stanford.edu/lagan_web/index.shtml)
l TBA/Mul'Z (hVp://www.bx.psu.edu/miller_lab/)
l Mauve (hVp://gel.ahabs.wisc.edu/mauve/)
l Posi;onal homology vs. glocal
![Page 61: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/61.jpg)
Mul'ple Genome Alignment l LAGAN: rapid alignment of two homologous genome sequences
l Generate local alignments (anchors, B)
l Construct rough global map (maximal-‐scoring ordered subset, C)
� Join anchors that lie within a threshold distance, the same way
l Compute global alignment by dynamic programming (D)
Brudno et al. (2003) Genome Res. doi:10.1101/gr.926603
![Page 62: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/62.jpg)
Mul'ple Genome Alignment l MLAGAN: mul;ple genome alignment of k genomes in k-‐1 alignment steps, using a phylogene;c tree (CLUSTAL-‐like):
l Make rough global maps between each pair of sequences (step C in LAGAN)
l Progressive mul;ple alignment with anchors (iterated)
1. Perform global alignment between closest pair of sequences with LAGAN: alignments are “mul6-‐sequences”
2. Find rough global maps of this mul6-‐sequence to all other mul6-‐sequences.
Brudno et al. (2003) Genome Res. doi:10.1101/gr.926603
![Page 63: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/63.jpg)
Human-‐Mouse-‐Rat Alignment l Three-‐way progressive alignment, iden;fying:
l Homologous (H/M/R), rodent-‐only (M/R) and human-‐mouse or human-‐rat (H/M, H/R) homologous regions
l Three-‐way synteny
synteny mapped to rat genome
Brudno et al. (2004) Genome Res. doi:10.1101/gr.2067704
Ini'al alignments by BLAT Syntenous regions aligned with LAGAN
![Page 64: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/64.jpg)
Drac Genome Alignment
![Page 65: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/65.jpg)
Drac Genome Alignment l Whole genome alignments useful for scaffolding assemblies
l High-‐throughput sequence assemblies come in fragments (con;gs)
l Con;gs can some;mes be ordered if paired reads or long read technologies are used
l Can also align to a known reference genome
l MUMmer
l Can use NUCmer or, for more distant rela;ons, PROmer
l Mauve/Progressive Mauve
l hVp://gel.ahabs.wisc.edu/mauve/
Darling et al. (2003) Genome Res. doi:10.1101/gr.2289704
![Page 66: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/66.jpg)
Mauve l Mauve’s alignment algorithm
1. Find local alignments (mul;-‐MUMs – seed & extend)
2. Construct phylogene;c guide tree from mul;-‐MUMs
3. Select subset of mul;-‐MUMs as anchors.
� Par;;on anchors into Local Collinear Blocks (LCBs) – consistently-‐ordered subsets
4. Perform recursive anchoring to iden;fy further anchors
5. Perform progressive alignment (similar to CLUSTAL), against guide tree
l Mauve Con;g Mover (MCM) for ordering con;gs
Darling et al. (2003) Genome Res. doi:10.1101/gr.2289704
![Page 67: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/67.jpg)
Mauve l Mauve alignment of LCBs in nine enterobacterial genomes
l Rearrangement of homologous backbone sequence
Darling et al. (2003) Genome Res. doi:10.1101/gr.2289704
![Page 68: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/68.jpg)
Drac Genome Alignment l [OPTIONAL ACTIVITY] (useful for exercise)
l Alignment and reordering of drar genome con;gs
l whole_genome_alignments_B.md Markdown
l hVps://github.com/widdowquinn/Teaching/blob/master/Compara;ve_Genomics_and_Visualisa;on/Part_1/whole_genome_alignment/whole_genome_alignments_B.md
l [ACTIVITY] l Visualisa;on of whole genome alignment with Biopython
l biopython_visualisation iPython notebook
![Page 69: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/69.jpg)
Collinearity and Synteny l Rearrangements may occur post-‐specia;on
l Different species s;ll exhibit conserva;on of sequence similarity and order
l Two elements are collinear if they lie in the same linear sequence
l Two elements are syntenous (syntenic) if:
� (orig.) they lie on the same chromosome
� (mod.) conserva;on of blocks of order within the same chromosome
l Signs of evolu;onary constraints, including synteny, may indicate func;onal genome regions
l More about this in Part 2, related to genome features
![Page 70: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/70.jpg)
Syntenous l example1.png from biopython_visualisation ac;vity
![Page 71: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/71.jpg)
Nonsyntenous l example2.png from biopython_visualisation ac;vity
![Page 72: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/72.jpg)
Whole Genome Duplica'on l Puffer fish Tetraodon nigroviridis (smallest known vertebrate genome)
l Whole-‐genome duplica;on, subsequent to divergence from mammals.
l Ancestral vertebrate genome inferred to have 12 chromosomes.
Duplicated genes (ExoFish) on 21 chromosomes
Jaillon et al. (2004) Nature doi:10.1038/nature03025
![Page 73: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/73.jpg)
VISTA, mVISTA, VISTA-‐Point l Alignment/visualisa;on tools:
l hVp://genome.lbl.gov/vista/index.shtml
l mVISTA: align and compare submiVed sequences (up to 2Mbp)
l VISTA-‐Point: visualise precomputed alignments
Frazer et al. (2004) Nucl. Acids Res. doi:10.1093/nar/gkh458
![Page 74: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/74.jpg)
UCSC l hVp://genome.ucsc.edu/
l Many vertebrate/invertebrate model genomes
Kent et al. (2002) Genome Res. doi:10.1101/gr.229102
![Page 75: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/75.jpg)
Conclusion l Physical and computa;onal genome comparisons:
l Similar biological ques;ons -‐> similar concepts
l Lots of sequence data in modern biology
l Conserva;on ≈ evolu;onary constraint
l Many choices of algorithms/analysis sorware
l Many choices of visualisa;on sorware/tools
l Coming in Part 2: genomic func;onal elements
![Page 76: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/76.jpg)
Credits l This slideshow is shared under a Crea;ve Commons AVribu;on 4.0 License hVp://crea;vecommons.org/licenses/by/4.0/)
l Copyright is held by The James HuVon Ins;tute hVp://www.huVon.ac.uk
l You may freely use this material in research, papers, and talks so long as acknowledgement is made.
![Page 77: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/77.jpg)
Nucleo'de Content l A, C, G, T composi;on
l Varies between, and within genomes
l staining varies across genomes, due to varia;on in GC content
l “isochores”: regions with liVle internal GC varia;on (homogeneous)
� long a point of discussion – difficult to define
l In humans:
l L1, L2 isochores: low GC (≲41%)
l H1, H2, H3 isochores: high GC (≳41%)
l Imprecise bulk measurement
Sadoni et al. (1999) J. Cell Biol. doi:10.1083/jcb.146.6.1211
hybridisa;on of H3 isochore to human genome
![Page 78: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/78.jpg)
DNA-‐DNA Hybridisa'on (DDH) l Used for taxonomic classifica;on in prokaryotes from 1960s
l Sibley & Ahlquist redefined bird and primate phylogeny with DDH in 1980s:
l Not without controversy: � Sugges;ons of data manipula;on
(see here)
� Close evolu;onary rela;onships difficult to resolve due to paralogy (more on paralogy later…)
l S;ll hanging on as a de facto “gold standard” in microbiological taxonomic classifica;on.
Sibley & Ahlquist (1987) J. Mol. Evol. doi:10.1007/BF02111285
![Page 79: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/79.jpg)
Finding isochores l Isochores: homogeneous regions of %GC content
l Easy to find with windowed (100kbp) %GC calcula;on, from sequenced genomes.
l 3200 isochores characterised in the human genome, consistent with 5 levels (L1, L2, H1, H2, H3) found by staining/hybridisa;on.
Costan'ni et al. (2006) Genome Res. doi:10.1101/gr.4910606
![Page 80: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/80.jpg)
Compara've Genomic Hybridisa'on l Two genomes: “reference” and “test” labelled (red and green),
then hybridised against a “normal” genome
l semiquan'ta've:
l Red: loss (<2 copies) in tumour
l Green: gain (3-‐4 copies) in tumour
l Amplifica;ons (>4 copies) in BOLD
l Cases with the same Copy Number Aberra;on (CNA) are numbered
De Bortoli et al. (2006) BMC Cancer doi:10.1186/1471-‐2407-‐6-‐223
![Page 81: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/81.jpg)
l Early approaches took a threshold score (present/absent)
l Later approaches used known reference genome sequence context (HMMs, synteny) to improve presence/absence calls
l No hybridisa;on = “absent” or“divergent”?
l Not nearly as good as sequencing directly!
Array Compara've Genomic Hybridisa'on
Pritchard et al. (2009) PLoS Comp. Biol. doi:10.1371/journal.pcbi.1000473
![Page 82: Comparative Genomics and Visualisation - Part 1](https://reader033.vdocument.in/reader033/viewer/2022052522/54c63d024a7959c9388b4726/html5/thumbnails/82.jpg)
k-‐mer Spectra l k-‐mer spectrum:
l CpG suppression (CGs are uncommon in vertebrate genomes), but (by simula;on) only when in combina;on with a par;cular %GC, explains mul;modality
Chor et al. (2009) Genome Biol. doi:10.1186/gb-‐2009-‐10-‐10-‐r108