2015 12-09 nmdd
TRANSCRIPT
WGS data for bacterial typing
Karin Lagesen
@karinlag
NMDD presentation
2015-12-09
Bacterial genomes
Four letters: A, C, T, G
Two strands complementary:
A : T, C : G
Genes: DNA that encode for proteins
Often regarded as the “functional”
regions of the genome
Bacteria: genes approx 90% of the genome
ATCCGGAG GAGGACGG
Mutations: single letter
character changes
TGAGGGACCAAACCGAT
TGAGGGACGAAACCGAT
Bacterial
genomes are
most often
circular
Campylobacter
jejuni genome:
1.68 million
basepairs
Bacterial typing
Typing: identifying a bacterial isolate at the strain
level
Goal: discriminate between different bacterial
isolates
● Effectively: a distance measure is often sought
Traditionally done via distinguishing based on
phenotypic characteristics
Molecular strain typing has taken over
Goal: figure out how different sequences are
Advances in bacterial genomics
Phyla Number
genomes
% of total
Actinobacteria 4059 13
Bacteroidetes/
Chlorobi group
932 3
Cyanobacteria 340 1
Firmicutes 9628 31
Proteobacteria 14,268 46
Spirochaetes 525 2
Other
1500 5
Number of sequenced genomes for 6 selected phyla and the percent of all genomes found
in the phyla
Source: GenBank prokaryotes.txt file downloaded 4 February 2015
Land et. al., Functional & Integrative Genomics, 2015
2002
Development of sequencing technologies
Genome assembly
http://knowgenetics.org/whole-genome-sequencing/
Sequencing
machine
Reads
Molecular bacterial typing
How
dif
fere
nces
are
counte
d
Amount of sequence used
Single
gene
Categorical
Ordinal
Continuous
MLST,
MLVA
MLSA
One region Some regions Many regions All
MLVA – Multi-locus VNTR analysis
Find loci with known
repeats
Discover copy number
of repeat – becomes
identifier for loci
Strain identified by
copy numbers for
defined set of loci
Similarity is # of
idential loci numbers
http://www.applied-maths.com/applications/mlva
Multi Locus Sequence Typing
Set of genes
Each variant is assigned a categorical number
Cluster types on # shared variants
Numbers becomes Sequence type (ST)
Similarity is # of idential loci numbers
MLST: 7 genes
rMLST: ribosomal genes
http://www.applied-maths.com/applications/mlst
Clustering categorical data
Feil, Nature Rev. Microbiol. 2004
Phylogeny – tracing ancestry
Many algorithms
● Distance matrix methods (sequence similarity)
● Maximum parsimony methods
● Maximum likelyhood methods
Based on similarity between sequences
Can become very computationally intensive, especially for longer sequences (e.g. WGS)
Examples:
● 16S rRNA phylogenetic trees
● Multi Locus Sequence Analyses – phylogenies of concatenated MLST genes
Campylobacter 16S tree
Friis et. al. PLOS One 2013
Molecular bacterial typing
How
dif
fere
nces
are
counte
d
Amount of sequence used
Single
gene
Categorical
Ordinal
Continuous
Pairwise
SNPs
Core
genome
MLST,
MLVA
MLSA
One region Some regions Many regions All
wgMLST
Core
SNPs
Ideal whole genome comparisons
Bacterial species definition:
● 70% of genome should be able to anneal to each other – i.e. «match»
Converted to whole genome sequences:
● Based on % identity between conserved regions
● Average Nucleotide Identity~95 %
All-against-all sequence alignment is required
● Time complexity: O(n2)
● Not feasible in most cases
Alternatives:
● Focus on core regions of the genome (core genes)
● Find just the variations (SNPs), make trees from those
Core genome – # ”shared genes”
Sequences q and s have matching region
Regarded as ”shared” iff k and n are large
enough
Similarity = # ”shared” genes
s
q length of match (n)
% of matching characters
in matching region (k)
Core genome tree, Campylobacter
Friis et. al. PLOS One 2013
Core SNP trees
Approach A: External core gene set
● Map each genome’s reads to genes
● Examine reads mapping to the same gene to
find sequence variations (variant calling)
● Create genome/SNP matrix
Approach B: Intrinsic core set
● Use suffix graphs to get Maximal Unique Matches
● Extend alignments from MUMs to get shared
core set
● Find variants in alignments
● Create genome/SNP matrix
Similarity: genomes that share the same SNP
Snippy
snpTree
Parsnp
Campylobacter jejuni, core SNP tree
Maximum likelihood phylogeny derived from the core-genome alignment of 131 C. jejuni
isolates. Isolates with a known hyper-invasive phenotype have their taxa identifier names
highlighted in red. The three clades identified as containing hyper-invasive strains have
branches indicated in red
Baig et al. BMC Genomics 2015 16:852 doi:10.1186/s12864-015-2087-y
k-mer based SNP trees
k-mer: piece of sequence, k nucleotides long
Split genomes/reads into k-mers
Find k-mers in different genomes that vary in their middle character
Create genome/SNP matrix
● Note: this is not core, but pairwise all-against-all
Create trees
Similarity is # shared SNPs
Genome A: TGAGGGACCAAACCGAT
Genome B: TGAGGGACGAAACCGAT
kSNP
Acenitobacter whole genome SNP tree
Sahl et. al., PLOS One, 2013
Classification of distance measures
Categorical
● Loci defined as either equal/different
● Similarity calculated as # shared loci
Ordinal
● Regions defined as “shared” based on sequence
similarity levels
● Similarity calculated as # shared sequences
Continous
● Find all sequence differences (SNPs)
● Similarity calculated as # shared SNPs
(Some) sources of variation
Small changes
● Nucleotide substitution
● Insertions and deletions
Recombination
● Shuffling regions of the genome
“Jumping genes”: insertion sequences and transposons
● Small sequences that jump
● Can move other sequences with them
Horizontal gene transfer.
Gene tree != genome tree
Rose et. Al., Biology direct 2007
So… what do we do?
No real answers (yet)
Could sequence the lot, but is expensive
However: gain so much more with sequencing
● Very high discriminatory power (resolution)
● Access to virulence genes, ++
Be aware of possible fragility in MLST data
● One mutation = changed ST
● Should probably double check STs with MLSA
Compare MLSTs with WGS data, see how stable the
MLSTs are to the whole genome
Questions? and Thankyou!