2015 12-09 nmdd

WGS data for bacterial typing

Karin Lagesen

@karinlag

NMDD presentation

2015-12-09

Bacterial genomes

Four letters: A, C, T, G

Two strands complementary:

A : T, C : G

Genes: DNA that encode for proteins

Often regarded as the “functional”

regions of the genome

Bacteria: genes approx 90% of the genome

ATCCGGAG GAGGACGG

Mutations: single letter

character changes

TGAGGGACCAAACCGAT

TGAGGGACGAAACCGAT

Bacterial

genomes are

most often

circular

Campylobacter

jejuni genome:

1.68 million

basepairs

Bacterial typing

Typing: identifying a bacterial isolate at the strain

level

Goal: discriminate between different bacterial

isolates

● Effectively: a distance measure is often sought

Traditionally done via distinguishing based on

phenotypic characteristics

Molecular strain typing has taken over

Goal: figure out how different sequences are

Advances in bacterial genomics

Phyla Number

genomes

% of total

Actinobacteria 4059 13

Bacteroidetes/

Chlorobi group

932 3

Cyanobacteria 340 1

Firmicutes 9628 31

Proteobacteria 14,268 46

Spirochaetes 525 2

Other

1500 5

Number of sequenced genomes for 6 selected phyla and the percent of all genomes found

in the phyla

Source: GenBank prokaryotes.txt file downloaded 4 February 2015

Land et. al., Functional & Integrative Genomics, 2015

2002

Development of sequencing technologies

Genome assembly

http://knowgenetics.org/whole-genome-sequencing/

Sequencing

machine

Reads

Molecular bacterial typing

How

dif

fere

nces

are

counte

d

Amount of sequence used

Single

gene

Categorical

Ordinal

Continuous

MLST,

MLVA

MLSA

One region Some regions Many regions All

MLVA – Multi-locus VNTR analysis

Find loci with known

repeats

Discover copy number

of repeat – becomes

identifier for loci

Strain identified by

copy numbers for

defined set of loci

Similarity is # of

idential loci numbers

http://www.applied-maths.com/applications/mlva

Multi Locus Sequence Typing

Set of genes

Each variant is assigned a categorical number

Cluster types on # shared variants

Numbers becomes Sequence type (ST)

Similarity is # of idential loci numbers

MLST: 7 genes

rMLST: ribosomal genes

http://www.applied-maths.com/applications/mlst

Clustering categorical data

Feil, Nature Rev. Microbiol. 2004

Phylogeny – tracing ancestry

Many algorithms

● Distance matrix methods (sequence similarity)

● Maximum parsimony methods

● Maximum likelyhood methods

Based on similarity between sequences

Can become very computationally intensive, especially for longer sequences (e.g. WGS)

Examples:

● 16S rRNA phylogenetic trees

● Multi Locus Sequence Analyses – phylogenies of concatenated MLST genes

Campylobacter 16S tree

Friis et. al. PLOS One 2013

Molecular bacterial typing

How

dif

fere

nces

are

counte

d

Amount of sequence used

Single

gene

Categorical

Ordinal

Continuous

Pairwise

SNPs

Core

genome

MLST,

MLVA

MLSA

One region Some regions Many regions All

wgMLST

Core

SNPs

Ideal whole genome comparisons

Bacterial species definition:

● 70% of genome should be able to anneal to each other – i.e. «match»

Converted to whole genome sequences:

● Based on % identity between conserved regions

● Average Nucleotide Identity~95 %

All-against-all sequence alignment is required

● Time complexity: O(n2)

● Not feasible in most cases

Alternatives:

● Focus on core regions of the genome (core genes)

● Find just the variations (SNPs), make trees from those

Core genome – # ”shared genes”

Sequences q and s have matching region

Regarded as ”shared” iff k and n are large

enough

Similarity = # ”shared” genes

s

q length of match (n)

% of matching characters

in matching region (k)

Core genome tree, Campylobacter

Friis et. al. PLOS One 2013

Core SNP trees

Approach A: External core gene set

● Map each genome’s reads to genes

● Examine reads mapping to the same gene to

find sequence variations (variant calling)

● Create genome/SNP matrix

Approach B: Intrinsic core set

● Use suffix graphs to get Maximal Unique Matches

● Extend alignments from MUMs to get shared

core set

● Find variants in alignments

● Create genome/SNP matrix

Similarity: genomes that share the same SNP

Snippy

snpTree

Parsnp

Campylobacter jejuni, core SNP tree

Maximum likelihood phylogeny derived from the core-genome alignment of 131 C. jejuni

isolates. Isolates with a known hyper-invasive phenotype have their taxa identifier names

highlighted in red. The three clades identified as containing hyper-invasive strains have

branches indicated in red

Baig et al. BMC Genomics 2015 16:852 doi:10.1186/s12864-015-2087-y

k-mer based SNP trees

k-mer: piece of sequence, k nucleotides long

Split genomes/reads into k-mers

Find k-mers in different genomes that vary in their middle character

Create genome/SNP matrix

● Note: this is not core, but pairwise all-against-all

Create trees

Similarity is # shared SNPs

Genome A: TGAGGGACCAAACCGAT

Genome B: TGAGGGACGAAACCGAT

kSNP

Acenitobacter whole genome SNP tree

Sahl et. al., PLOS One, 2013

Classification of distance measures

Categorical

● Loci defined as either equal/different

● Similarity calculated as # shared loci

Ordinal

● Regions defined as “shared” based on sequence

similarity levels

● Similarity calculated as # shared sequences

Continous

● Find all sequence differences (SNPs)

● Similarity calculated as # shared SNPs

(Some) sources of variation

Small changes

● Nucleotide substitution

● Insertions and deletions

Recombination

● Shuffling regions of the genome

“Jumping genes”: insertion sequences and transposons

● Small sequences that jump

● Can move other sequences with them

Horizontal gene transfer.

Gene tree != genome tree

Rose et. Al., Biology direct 2007

So… what do we do?

No real answers (yet)

Could sequence the lot, but is expensive

However: gain so much more with sequencing

● Very high discriminatory power (resolution)

● Access to virulence genes, ++

Be aware of possible fragility in MLST data

● One mutation = changed ST

● Should probably double check STs with MLSA

Compare MLSTs with WGS data, see how stable the

MLSTs are to the whole genome

Questions? and Thankyou!

2015 12-09 nmdd

Education