whole genome based phylogeny - goseqitbetween root and tip is the same along each of the lineages...

51
Whole Genome based Phylogeny Johanne Ahrenfeldt PhD student DTU Bioinformatics

Upload: others

Post on 08-Aug-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Whole Genome based Phylogeny

Johanne Ahrenfeldt PhD student

DTU Bioinformatics

Page 2: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Short about me

Johanne [email protected]

•  PhD student at DTU Bioinformatics –  Whole Genome based Phylogeny

•  Graduate Engineer in Systems Biology and Bioinformatics from Technical University of Denmark

•  Working in the CGE project since 2012 – started as a student helper

Page 3: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Overview

•  What is Phylogeny •  SNP methods

–  CSI Phylogeny•  Nucleotide Differences

–  NDtree•  Controlled Evolution study•  Good advice

Page 4: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

What is phylogeny?

•  Early phylogeny–  Classification–  Based on phenotypes

•  Current phylogeny–  Based on genotypes–  DNA mutations as basis for evolution

Page 5: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Classification

Carl Linnaeus 1707-1778

Hierarchical system KingdomPhylumClassOrderFamilyGenusSpecies

Page 6: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Classification depicted as a tree

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Classification depicted as a tree

Page 7: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Classification depicted as a tree

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Classification depicted as a tree

Species Genus Family Order Class

Page 8: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

CEN

TER

FO

R B

IOLO

GIC

AL S

EQU

ENC

E AN

ALYS

IS

Molecular Basis for Variation: DNA Mutation

DNA mutations as basis for evolution

Page 9: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

What are phylogenetic trees

•  Phylogenetic trees are a visual representation of the genetic relationship between species

•  Think of them as family trees •  Phylogeny can also be represented by distance

matrices

Page 10: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

What are phylogenetic trees

•  Trees were traditionally made using aligned sequences of single genes or proteins

•  Whole genome data can be used to create trees based on –  SNP calling–  K-mer overlap–  Alignment of genomes

Page 11: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

What is a SNP

•  A Single Nucleotide Polymorphism (SNP) is a DNA sequence variation occurring commonly* within a population (e.g. 1%) in which a Single Nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes.

Page 12: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

How does it work

Strain A ATTCAGTAGT Strain B ATGCAGTTGA Strain C ATGCAATTGT Strain D ATCCATTAGC

Page 13: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Construct distance matrix

Strain A ATTCAGTAGT Strain B ATGCAGTTGA Strain C ATGCAATTGT Strain D ATCCATTAGC

A   B   C   D  

AA  

B  

C  

D  

Page 14: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Make Tree

Strain A ATTCAGTAGT Strain B ATGCAGTTGA Strain C ATGCAATTGT Strain D ATCCATTAGC

A B C D A 0 3 3 3 B 3 0 2 4 C 3 2 0 4 D 3 4 4 0  

B

D

C

A

1

1

1

1

1

1

Page 15: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

How to read phylogenetic trees

15  

that would have been classified as Homo sapiens wereborn. However, rainbow trout and humans are contempo-rary species, meaning that the lineages of which they arecurrently terminal nodes have been evolving for exactly thesame amount of time since their divergence from a distantcommon ancestor. As a result, any notion that one of theselineages is “more evolved” or that one has had more time toaccumulate differences is flawed.

Misconception #8: Backwards Time Axes

Among the common misconceptions identified by Meiret al. (2007) was the tendency for many students to misreadthe time axis on evolutionary trees. Many studentsinterpreted the location of the terminal nodes as indicatingtime, for example by reading from left to right or from theleftmost tip to the root. In Fig. 17a, for example, manystudents read time as proceeding from birds (oldest) to theroot W (youngest) or from birds (oldest) to kangaroos(youngest). Neither is correct, as time extends from the rootto the terminal nodes, all of which are contemporary. Thismisinterpretation may have been exacerbated by the factthat the tree used in the quiz placed mammals—whichmany students assume to be the most “advanced” and hencemost recent group—alone on the less diverse branch at thefar right of an unbalanced, ladderized tree (unfortunately, a

tendency to place humans or some other preferred taxon atthe top or right of every tree appears to be an unshakablehabit among many phylogeneticists, although there is noobjective reason for doing so). As indicated in Fig. 5, evenon cladograms, in which the lengths of the branches are not

Fig. 16 The lineages leading to contemporary species have all beenevolving for exactly the same amount of time. Rates of morphologicalchange may vary among lineages, but the amount of time thatseparates two living lineages from their common ancestor does not.This figure shows the relationships among a sample of vertebratelineages, all of which have been evolving for exactly the same amountof time, even if some lineages have undergone more change or morebranching than others or if some taxonomically identifiable subsets ofthose lineage (e.g., teleost fishes) arose earlier than others (e.g.,mammals). It is therefore a fallacy to describe one modern species as“more evolved” than another. Note, however, that this is a cladogramrather than an ultrametric tree, such that one cannot assume that any orall of G, H, E, F, C, and B are equal, only that the total amount of timebetween root and tip is the same along each of the lineages

Fig. 17 The number of intervening nodes does not indicate overallrelatedness between lineages. The tree in a is the same in topology asthe one used in the study of Meir et al. (2007), which showed thatmany readers have a tendency misread the directionality of time onphylogenies and to count nodes when asked to determine evolutionaryrelatedness among species. Confusion may arise in this particular casebecause many people maintain the erroneous assumption thatmammals are the most “advanced” and therefore must be the youngestgroup. More generally, because the tree is unbalanced, students maytend to consider birds and mammals (separated by four internal nodeson this tree, Z, Y, X, and W) as more distantly related than turtles andmammals (separated by two internal nodes, X and W). However, this issimply an artifact of the species chosen for inclusion on the tree. Allspecies descended from ancestor X are equally related to kangaroos,with which they all share the same last common ancestor, W. Todemonstrate this, b illustrates the same tree with different patterns foreach branch, which are then spliced together in c to reveal theidentical total distance from the common ancestor W to all of theterminal nodes

134 Evo Edu Outreach (2008) 1:121–137

T.  Ryan  Gregory.  Understanding  Evolu<onary  Trees.  Evo  Edu  Outreach  (2008)  1:121–137  DOI  10.1007/s12052-­‐008-­‐0035-­‐x    

 

Page 16: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

How to read phylogenetic trees

16  

T.  Ryan  Gregory.  Understanding  Evolu<onary  Trees.    Evo  Edu  Outreach  (2008)  1:121–137  DOI  10.1007/s12052-­‐008-­‐0035-­‐x    

 

with their shared ancestor represent a clade (amniotes) inwhich the first two clades are nested. Adding frogs and theancestor linking them to the aforementioned species createsa yet larger clade (tetrapods). Adding fishes and thecommon ancestor of all species on this tree creates thefinal and largest clade (vertebrates). Because frogs can be

included in a clade with humans before fishes can—in otherwords, because frogs and humans share a common ancestorthat is not shared with fishes—frogs are more closelyrelated to humans than to fishes. Indeed, frogs and humansare exactly equally related to fishes through this commonancestor (recall that two cousins are equally related to athird, more distant relative).

A more rapid approach is to mentally rotate a fewinternal nodes with no effect on the topology of the tree, asshown in Fig. 11b. In this modified tree, humans are stillsister to cats and birds are sister to lizards, frogs are thensister to amniotes, and fishes are the outgroup to thetetrapods. This second tree is identical in topology and istherefore equally accurate as the first tree. However, it

Fig. 11 The order of terminal nodes is meaningless. One of the mostcommon misconceptions about evolutionary trees is that the order ofthe terminal nodes provides information about their relatedness. Onlybranching order (i.e., the sequence of internal nodes) provides thisinformation; because all internal nodes can be rotated withoutaffecting the topology (Fig. 6), the order of the tips is meaningless.Nevertheless, there is a strong tendency for readers to take the tree in aas indicating that frogs are more closely related to fishes than humansare. They are not: both frogs and humans (and birds and lizards andcats) are equally closely related to fishes because as tetrapods theyshare a common ancestor to the exclusion of bony fishes. On the otherhand, humans and cats are more closely related to each other thaneither is to any of the other species depicted because they share arecent common ancestor to the exclusion of the other species. The treein b exhibits an identical topology to the one in a and is thereforeequally valid. In this case, the same misinterpretation of “readingacross the tips” would lead to the erroneous conclusion that birds aremore closely related to fishes than cats are or that humans are moreclosely related to frogs than to lizards and birds. Because they share acommon ancestor as amniotes, birds, cats, lizards, and humans are allequally related to frogs. It is good practice to rotate a few internalnodes mentally when first examining a tree to dispel misinter-pretations based on reading the order of tips

Fig. 12 Evolutionary trends cannot be identified by reading across thetips. In addition to resulting in incorrect interpretations of relatedness(Fig. 11), reading across the tips can engender a false impression ofevolutionary trends. For example, many readers confronted with thetree in a might be tempted to infer an evolutionary trend towardincreased body size in snail species over time (or, in Fig. 11a, anincrease in complexity or intelligence over time). Unfortunately,misinterpretations such as this can be found even in the primaryscientific literature. Once again, this can be corrected simply byrotating a few internal nodes, as has been done in b, in which thetopology is the same but where the supposed trend is no longerapparent. c shows evidence of a real evolutionary trend towardincreased body size. The important consideration is internal branch-ing: In this case, there is information about ancestral states (e.g., fromfossils), and it is evident that in every branching event, the twodescendant species have been larger than their shared ancestor.Despite this being a clear evolutionary trend, there is no patternevident across the terminal nodes. Thus, reading across the tips cancreate apparent trends where there are none and can mask real trendsthat are strongly supported by historical information

130 Evo Edu Outreach (2008) 1:121–137

with their shared ancestor represent a clade (amniotes) inwhich the first two clades are nested. Adding frogs and theancestor linking them to the aforementioned species createsa yet larger clade (tetrapods). Adding fishes and thecommon ancestor of all species on this tree creates thefinal and largest clade (vertebrates). Because frogs can be

included in a clade with humans before fishes can—in otherwords, because frogs and humans share a common ancestorthat is not shared with fishes—frogs are more closelyrelated to humans than to fishes. Indeed, frogs and humansare exactly equally related to fishes through this commonancestor (recall that two cousins are equally related to athird, more distant relative).

A more rapid approach is to mentally rotate a fewinternal nodes with no effect on the topology of the tree, asshown in Fig. 11b. In this modified tree, humans are stillsister to cats and birds are sister to lizards, frogs are thensister to amniotes, and fishes are the outgroup to thetetrapods. This second tree is identical in topology and istherefore equally accurate as the first tree. However, it

Fig. 11 The order of terminal nodes is meaningless. One of the mostcommon misconceptions about evolutionary trees is that the order ofthe terminal nodes provides information about their relatedness. Onlybranching order (i.e., the sequence of internal nodes) provides thisinformation; because all internal nodes can be rotated withoutaffecting the topology (Fig. 6), the order of the tips is meaningless.Nevertheless, there is a strong tendency for readers to take the tree in aas indicating that frogs are more closely related to fishes than humansare. They are not: both frogs and humans (and birds and lizards andcats) are equally closely related to fishes because as tetrapods theyshare a common ancestor to the exclusion of bony fishes. On the otherhand, humans and cats are more closely related to each other thaneither is to any of the other species depicted because they share arecent common ancestor to the exclusion of the other species. The treein b exhibits an identical topology to the one in a and is thereforeequally valid. In this case, the same misinterpretation of “readingacross the tips” would lead to the erroneous conclusion that birds aremore closely related to fishes than cats are or that humans are moreclosely related to frogs than to lizards and birds. Because they share acommon ancestor as amniotes, birds, cats, lizards, and humans are allequally related to frogs. It is good practice to rotate a few internalnodes mentally when first examining a tree to dispel misinter-pretations based on reading the order of tips

Fig. 12 Evolutionary trends cannot be identified by reading across thetips. In addition to resulting in incorrect interpretations of relatedness(Fig. 11), reading across the tips can engender a false impression ofevolutionary trends. For example, many readers confronted with thetree in a might be tempted to infer an evolutionary trend towardincreased body size in snail species over time (or, in Fig. 11a, anincrease in complexity or intelligence over time). Unfortunately,misinterpretations such as this can be found even in the primaryscientific literature. Once again, this can be corrected simply byrotating a few internal nodes, as has been done in b, in which thetopology is the same but where the supposed trend is no longerapparent. c shows evidence of a real evolutionary trend towardincreased body size. The important consideration is internal branch-ing: In this case, there is information about ancestral states (e.g., fromfossils), and it is evident that in every branching event, the twodescendant species have been larger than their shared ancestor.Despite this being a clear evolutionary trend, there is no patternevident across the terminal nodes. Thus, reading across the tips cancreate apparent trends where there are none and can mask real trendsthat are strongly supported by historical information

130 Evo Edu Outreach (2008) 1:121–137

Page 17: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

What is phylogeny used for

•  Classify taxonomy – The classic use

•  Outbreak detection – Increasing with WGS data

Page 18: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

What is phylogeny used for

•  Cholera outbreak in Haiti 2010•  Listeria outbreak 2014

Whole-genome Sequencing Used to Investigate a Nationwide Outbreak of Listeriosis Caused by Ready-to-eat Delicatessen Meat, Denmark, 2014.Kvistholm Jensen et al. Clin Infect Dis. (2016) 63 (1): 64-70. doi: 10.1093/cid/ciw192

Page 19: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Case story

•  Vibrio Cholerae outbreak in Haiti followed the 2010 earthquake

•  Rumors said that the outbreak may have come from Nepal, travelling along with UN soldiers from Nepal

•  No proof had been given of this until the Hendriksen et al. paper in 2011

Popula<on  Gene<cs  of  Vibrio  cholerae  from  Nepal  in  2010:  Evidence  on  the  Origin  of  the  Hai<an  Outbreak.  Hendriksen  et  al.  23  August  2011  mBio  vol.  2  no.  4  e00157-­‐11.  doi:  10.1128/mBio.00157-­‐11  

Page 20: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Case story

•  Data–  24 recent V. cholerae strains from Nepal–  10 previously sequenced V. cholerae isolates,

including 3 from the Haitian outbreak•  Analysis

–  Antimicrobial susceptibility testing –  PFGE (pulsed-field gel electrophoresis) to analyze for

genetic relatedness–  Whole genome sequencing, SNP identification and

phylogenetic analysis

Page 21: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Case story - Results

Resistance  profile   Suscep5ble   Decreased  suscep5bility    

Resistant  

Nepalese  strains  Hendriksen  et  al.  2011  

Tetracycline   Ciprofloxacin   Trimethoprim,  Sulfamethoxazole  Nalidixic    

Hai<an  outbreak  strains  Centers  for  Disease  Control  and  Preven<on,  2010    

Tetracycline   Ciprofloxacin   Trimethoprim,  Sulfamethoxazole  Nalidixic  

Page 22: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Case story - Results

•  Pulsed-field gel electrophoresis (PFG)E–  Nepalese isolates divided in 4 groups–  Most common Haitian type in same group as four

Nepalese strains

Page 23: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Case story - Results

FIG 1 Genetic relationships among V. cholerae isolates from Nepal and Haiti. A single maximum parsimony tree was reconstructed using 752 SNPs from 34whole-genome sequences. There were 184 parsimony-informative SNPs, of which 6 were homoplastic, resulting in a CI of 0.97 (excluding uninformativecharacters). The branch lengths are labeled in red, and for branches affected by homoplasy, minimum and maximum branch lengths are designated. Membersof SNP genotypic group V (16) are indicated. SNP differences among the three most closely related Nepali groups and the Haitian group are shown andcharacterized in Table S1 in the supplemental material.

TABLE 1 Different point mutations observed among the three sequenced isolates from the Haiti outbreak and the three most closely related isolatesfrom Nepala

Chromosome Position

Nucleotide or amino acid in:

Reference strain

Haitian isolate Nepalese isolate

1786 1792 1798 14 25 26

I 2787016 C C C C T T TGly Gly Gly Gly Arg Arg Arg

I 1090536 T T T T T T GIle Ile Ile Ile Ile Ile Ser

II 962762 C C C C T C CAla Ala Ala Ala Ala Ala Ala

a The reference strain is Vibrio cholerae O1 biovar El Tor strain N16961 (Bangladesh 1971). The NCBI reference sequences or accession numbers are NC_002505 for chromosome Iand NC_002506 for chromosome II.

Population Genetics of Vibrio cholerae

July/August 2011 Volume 2 Issue 4 e00157-11 ® mbio.asm.org 3

m

bio.asm.org

on March 3, 2015 - Published by

mbio.asm

.orgD

ownloaded from

Page 24: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

10 minutes break!

Page 25: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

snpTree

•  First online webserver for constructing phylogenetic trees based on whole genome sequencing

snpTree-­‐-­‐a  web-­‐server  to  iden<fy  and  construct  SNP  trees  from  whole  genome  sequence  data.  Leekitcharoenphon  P,  Kaas  RS,  Thomsen  MC,  Friis  C,  Rasmussen  S,  Aarestrup  FM.  BMC  Genomics.  2012;13  Suppl  7:S6.  

Page 26: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

snpTree flow

Page 27: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

CSI Phylogeny

https://cge.cbs.dtu.dk/services/CSIPhylogeny/

•  SNP identification same as snpTree•  Strict sorting of SNPs

–  Depth–  Relative depth–  Distance between SNPs–  SNP quality–  Read mapping quality

Rolf  S.  Kaas,  Pimlapas  Leekitcharoenphon,  Frank  M.  Aarestrup,  Ole  Lund.  Solving  the  Problem  of  Comparing  Whole  Bacterial  Genomes  across  Different  Sequencing  Plaeorms.  PLoS  ONE  2014;  9(8):  e104984.  

Page 28: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

CSI Phylogeny

•  Requires all SNPs to be significant–  Z-score higher than 1.96 for all SNPs

•  X is the number of reads, with the most common nucleotide at that position, and Y the number of reads with any other nucleotide.

Z = X −YX+Y

Page 29: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

CSI Phylogeny

OutputTree build by FastTree algorithm, in Newick format

•  Branch lengths is substitutions per site at the variable sites

Matrix of SNP pair counts in text (.txt) format •  Diagonal SNP matrix

Page 30: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

CSI Phylogeny

Page 31: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

NDtree

https://cge.cbs.dtu.dk/services/NDtree/

Nucleotide calling•  A different approach where the main distinction is not

between if a SNP should be called or not, but between whether or not there is solid evidence for the nucleotide at the given position.

Real-­‐Time  Whole-­‐Genome  Sequencing  for  Rou<ne  Typing,  Surveillance,  and  Outbreak  Detec<on  of  Verotoxigenic  Escherichia  coli.  Joensen  KG,  Scheutz  F,  Lund  O,  Hasman  H,  Kaas  RS,  Nielsen  EM,  Aarestrup  FM.  J  Clin  Microbiol.  2014  May;52(5):1501-­‐10.  

Page 32: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

NDtree

Simple mapping approach•  Cuts all reads into K-mers•  Maps all K-mers to reference genome•  Makes an ungapped consensus sequences of equal

lengths

Page 33: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Mapping

33  

K-­‐mers  

Reference  genome  

Consensus  sequence  

Reference  genome  Genome  1  Genome  2  Genome  3  Genome  4  Genome  5  Genome  6  

Page 34: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

NDtree Nucleotide calling

–  When all reads have been mapped the significance of the base call at each position was evaluated by calculating the number of reads X having the most common nucleotide at that position, and the number of reads Y supporting other nucleotides.

A Z-score threshold is calculated

> 1.96 (or 3.29)

>90% of reads supporting the same base

Z = X −YX+Y

Page 35: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

NDtree

Count nucleotide differences–  Method 1: Each pair of sequences was compared

and the number of nucleotide differences in positions called in all sequences was counted.

•  More accurate (Z=1.96 is used as threshold)–  Method 2: Each pair of sequences was compared

and the number of nucleotide differences in positions called in both sequences was counted.

•  More robust (Z=3.29 is used as threshold)

Page 36: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Method 1 – all calledSignificant  posi<ons  in  Genome  1    Significant  posi<ons  in  Genome  2    Significant  posi<ons  in  Genome  3    Posi<ons  used  for  phylogeny    

Method 2 – pairwise significanceSignificant  posi<ons  in  Genome  1    Significant  posi<ons  in  Genome  2    Significant  posi<ons  in  Genome  3    Posi<ons  used  between  1  and  2    Posi<ons  used  between  1  and  3    Posi<ons  used  between  2  and  3        

Page 37: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

NDtree

Uses two different algorithms to make two different trees•  UPGMA•  Neighbor Joining

Both algorithms are part of the PHYLIP Neighbor program package and make trees from distance matrices

Page 38: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

UPGMA vs. Neighbor Joining

•  UPGMA works when samples have been taken the same time

•  Neighbor Joining is better when samples have been taken at different times

Page 39: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

NDtree

Output•  distance.txt: Distance matrix - tab separated•  dist.mat: Distance matrix - PHYLIP format•  tree.nj.newick: Neighbor Joining tree - Newick format

–  Branch lengths is number of Nucleotide Differences•  tree.upgma.newick: UPGMA tree – Newick format

–  Branch lengths is number of Nucleotide Differences

Page 40: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Day$1$

For$each$8$hour$culture$a$sample$was$saved$for$DNA$sequencing$

Single$colonies$of$CSH114$

Choose$colony$Grow$for$8$h$$

Plate$out$Grow$for$16$h$

Plate$out$Grow$for$16$h$

Choose$colonies$Grow$for$8$h$

Choose$colonies$Grow$for$8$h$

Day$2$ Day$3$ $$$$$…$$$$$$$$$$$$$$Day$8$$

128x$

Controlled Evolution study

J.  Ahrenfeldt,  C.  Skaarup,  H.  Hasman,  A.  G.  Pedersen,  F.  M.  Aarestrup  and  O.  Lund.  Bacterial  whole  genome-­‐based  phylogeny:  construc<on  of  a  new  benchmarking  dataset  and  assessment  of  some  exis<ng  methods.  BMC  Genomics  (2017)  18:19      

Page 41: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Naming the descendants Day$1 Day$2 Day$3 Day$4 Day$5

S2211

S2212

S222S2221

S2222

S2

S21

S211S2111

S2112

S212S2121

S2122

S22

S221

S121S1211

S1212

S122S1221

S1222S

S1

S11

S111S1111

S1112

S112S1121

S1122

S12

Page 42: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Mutations

Page 43: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Phylogenetic tree using NDtree (UPGMA)

2.0

S1122

S1111

S2

S221

S222

S

S2121

S1112

S1221

S2122

S21

S1121

S2211

S1211

S112

S2221

S122

S12

S11

S22

S111

S1212

S2212

S2111

S121

S211S2112

S1222

S212

S1

Page 44: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Phylogenetic tree using NDtree (Neighbor Joining)

S2212

S121

S222

S1112S112

S1212

S2211

S

S1222

S212

S1211

S11

S2112

S1S12

S1111

S1221

S1121

S2122

S2

S2121

S1122

S21

S2221

S211

S221

S111

S2111

S122

S22

Page 45: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

UPGMA vs. Neighbor Joining

•  UPGMA works when samples have been taken the same time

•  Neighbor Joining is better when samples have been taken at different times

Page 46: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

CSI Phylogeny – Default settings

0.2

1_1

1_2_1_2

1_1_2_1

f_1_1_2_1_2

k_1_2_1_2_1

b_1_2_2

g_1_1_2_2_1

1_1_2_2

1_2

1_1_2

m_1_2_2_1_1

1_2_2

o_1_2_2_2_1n_1_2_2_1_2

1_2_1_1

1_2_2_2

d_1_1_1_2_2

j_1_2_1_1_2

1_2_2_1

1_1_1

1_1_1_1

a_1_1_1_1_1l_1_2_1_2_2

c_1_1_1_2_1

1_2_1

e_1_1_2_1_1

h_1_1_2_2_2

1_1_1_2

b_1_1_1_1_2

1

f_2_1_1

i_1_2_1_1_1

Page 47: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

0.2

S1111

S2112

S112

S2212

S11

S2122

S111

S212

S122S12

S222S22

S2111

S1112

S1122

S1212

S2121

S21

S

S121

S1221

S2221

S221

S1222

S1211

S2

S1121

S2211

S1

S211

CSI Phylogeny – Pruning disabled

Page 48: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

So… What should I use when?

CSI Phylogeny•  Has very good statistics and a good graphical overview.•  Advantageous to use when you expect the differences

between the isolates to be larger than 5-10 mutations. •  Is fasterNDtree•  Is able to find very small differences. •  Does not take recombination into consideration. •  Works best on raw reads. If given assembled genomes,

it simulates reads.

Page 49: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Choosing a reference genome

For comparison of very closely related isolates, a better level of detail is given by using a closely related reference genome.

Page 50: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

What defines an outbreak

•  We can’t tell for certain•  It depends on the species •  But a rule of thump is:

–  Within 10 SNPs it is definitely an outbreak–  Within 30 SNPs it might be an outbreak–  Above 60 SNPs it is most likely not an outbreak

Page 51: Whole Genome based Phylogeny - GoSeqItbetween root and tip is the same along each of the lineages Fig. 17 The number of intervening nodes does not indicate overall relatedness between

Thank you for listening

•  Questions?