2015 12-09 nmdd

1

Upload: karinlag

Post on 14-Jan-2017

243 views

Category:

Education


1 download

TRANSCRIPT

Page 1: 2015 12-09 nmdd

WGS data for bacterial typing

Karin Lagesen

@karinlag

NMDD presentation

2015-12-09

Page 2: 2015 12-09 nmdd

Bacterial genomes

Four letters: A, C, T, G

Two strands complementary:

A : T, C : G

Genes: DNA that encode for proteins

Often regarded as the “functional”

regions of the genome

Bacteria: genes approx 90% of the genome

ATCCGGAG GAGGACGG

Mutations: single letter

character changes

TGAGGGACCAAACCGAT

TGAGGGACGAAACCGAT

Bacterial

genomes are

most often

circular

Campylobacter

jejuni genome:

1.68 million

basepairs

Page 3: 2015 12-09 nmdd

Bacterial typing

Typing: identifying a bacterial isolate at the strain

level

Goal: discriminate between different bacterial

isolates

● Effectively: a distance measure is often sought

Traditionally done via distinguishing based on

phenotypic characteristics

Molecular strain typing has taken over

Goal: figure out how different sequences are

Page 4: 2015 12-09 nmdd

Advances in bacterial genomics

Phyla Number

genomes

% of total

Actinobacteria 4059 13

Bacteroidetes/

Chlorobi group

932 3

Cyanobacteria 340 1

Firmicutes 9628 31

Proteobacteria 14,268 46

Spirochaetes 525 2

Other

1500 5

Number of sequenced genomes for 6 selected phyla and the percent of all genomes found

in the phyla

Source: GenBank prokaryotes.txt file downloaded 4 February 2015

Land et. al., Functional & Integrative Genomics, 2015

Page 5: 2015 12-09 nmdd

2002

Development of sequencing technologies

Page 6: 2015 12-09 nmdd

Genome assembly

http://knowgenetics.org/whole-genome-sequencing/

Sequencing

machine

Reads

Page 7: 2015 12-09 nmdd

Molecular bacterial typing

How

dif

fere

nces

are

counte

d

Amount of sequence used

Single

gene

Categorical

Ordinal

Continuous

MLST,

MLVA

MLSA

One region Some regions Many regions All

Page 8: 2015 12-09 nmdd

MLVA – Multi-locus VNTR analysis

Find loci with known

repeats

Discover copy number

of repeat – becomes

identifier for loci

Strain identified by

copy numbers for

defined set of loci

Similarity is # of

idential loci numbers

http://www.applied-maths.com/applications/mlva

Page 9: 2015 12-09 nmdd

Multi Locus Sequence Typing

Set of genes

Each variant is assigned a categorical number

Cluster types on # shared variants

Numbers becomes Sequence type (ST)

Similarity is # of idential loci numbers

MLST: 7 genes

rMLST: ribosomal genes

http://www.applied-maths.com/applications/mlst

Page 10: 2015 12-09 nmdd

Clustering categorical data

Feil, Nature Rev. Microbiol. 2004

Page 11: 2015 12-09 nmdd

Phylogeny – tracing ancestry

Many algorithms

● Distance matrix methods (sequence similarity)

● Maximum parsimony methods

● Maximum likelyhood methods

Based on similarity between sequences

Can become very computationally intensive, especially for longer sequences (e.g. WGS)

Examples:

● 16S rRNA phylogenetic trees

● Multi Locus Sequence Analyses – phylogenies of concatenated MLST genes

Page 12: 2015 12-09 nmdd

Campylobacter 16S tree

Friis et. al. PLOS One 2013

Page 13: 2015 12-09 nmdd

Molecular bacterial typing

How

dif

fere

nces

are

counte

d

Amount of sequence used

Single

gene

Categorical

Ordinal

Continuous

Pairwise

SNPs

Core

genome

MLST,

MLVA

MLSA

One region Some regions Many regions All

wgMLST

Core

SNPs

Page 14: 2015 12-09 nmdd

Ideal whole genome comparisons

Bacterial species definition:

● 70% of genome should be able to anneal to each other – i.e. «match»

Converted to whole genome sequences:

● Based on % identity between conserved regions

● Average Nucleotide Identity~95 %

All-against-all sequence alignment is required

● Time complexity: O(n2)

● Not feasible in most cases

Alternatives:

● Focus on core regions of the genome (core genes)

● Find just the variations (SNPs), make trees from those

Page 15: 2015 12-09 nmdd

Core genome – # ”shared genes”

Sequences q and s have matching region

Regarded as ”shared” iff k and n are large

enough

Similarity = # ”shared” genes

s

q length of match (n)

% of matching characters

in matching region (k)

Page 16: 2015 12-09 nmdd

Core genome tree, Campylobacter

Friis et. al. PLOS One 2013

Page 17: 2015 12-09 nmdd

Core SNP trees

Approach A: External core gene set

● Map each genome’s reads to genes

● Examine reads mapping to the same gene to

find sequence variations (variant calling)

● Create genome/SNP matrix

Approach B: Intrinsic core set

● Use suffix graphs to get Maximal Unique Matches

● Extend alignments from MUMs to get shared

core set

● Find variants in alignments

● Create genome/SNP matrix

Similarity: genomes that share the same SNP

Snippy

snpTree

Parsnp

Page 18: 2015 12-09 nmdd

Campylobacter jejuni, core SNP tree

Maximum likelihood phylogeny derived from the core-genome alignment of 131 C. jejuni

isolates. Isolates with a known hyper-invasive phenotype have their taxa identifier names

highlighted in red. The three clades identified as containing hyper-invasive strains have

branches indicated in red

Baig et al. BMC Genomics 2015 16:852 doi:10.1186/s12864-015-2087-y

Page 19: 2015 12-09 nmdd

k-mer based SNP trees

k-mer: piece of sequence, k nucleotides long

Split genomes/reads into k-mers

Find k-mers in different genomes that vary in their middle character

Create genome/SNP matrix

● Note: this is not core, but pairwise all-against-all

Create trees

Similarity is # shared SNPs

Genome A: TGAGGGACCAAACCGAT

Genome B: TGAGGGACGAAACCGAT

kSNP

Page 20: 2015 12-09 nmdd

Acenitobacter whole genome SNP tree

Sahl et. al., PLOS One, 2013

Page 21: 2015 12-09 nmdd

Classification of distance measures

Categorical

● Loci defined as either equal/different

● Similarity calculated as # shared loci

Ordinal

● Regions defined as “shared” based on sequence

similarity levels

● Similarity calculated as # shared sequences

Continous

● Find all sequence differences (SNPs)

● Similarity calculated as # shared SNPs

Page 22: 2015 12-09 nmdd

(Some) sources of variation

Small changes

● Nucleotide substitution

● Insertions and deletions

Recombination

● Shuffling regions of the genome

“Jumping genes”: insertion sequences and transposons

● Small sequences that jump

● Can move other sequences with them

Page 23: 2015 12-09 nmdd

Horizontal gene transfer.

Page 24: 2015 12-09 nmdd

Gene tree != genome tree

Rose et. Al., Biology direct 2007

Page 25: 2015 12-09 nmdd

So… what do we do?

No real answers (yet)

Could sequence the lot, but is expensive

However: gain so much more with sequencing

● Very high discriminatory power (resolution)

● Access to virulence genes, ++

Be aware of possible fragility in MLST data

● One mutation = changed ST

● Should probably double check STs with MLSA

Compare MLSTs with WGS data, see how stable the

MLSTs are to the whole genome

Page 26: 2015 12-09 nmdd

Questions? and Thankyou!