![Page 1: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/1.jpg)
A Probabilistic Approach to Whole Genome based Phylogeny
Johanne AhrenfeldtPhD student - Genomic Epidemiology
![Page 2: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/2.jpg)
DTU Bioinformatics, Technical University of Denmark
Overview• Whole Genome Based Phylogeny
• A data set with known phylogeny
• Base calling revisited
• A probabilistic approach to distance calculation
2
![Page 3: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/3.jpg)
DTU Bioinformatics, Technical University of Denmark
Whole genome based phylogeny• WGS sequencing is increasing as the price is (used to be) falling
• A very useful tool for outbreak analysis of infectious diseases
• Used in the Haiti cholera outbreak to find the source of the infection
3
![Page 4: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/4.jpg)
DTU Bioinformatics, Technical University of Denmark
Mapping
4
Reads
Reference genome
Consensus sequence
Reference genomeGenome 1
Genome 2
Genome 3
Genome 4
Genome 5
Genome 6
Base calling
![Page 5: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/5.jpg)
DTU Bioinformatics, Technical University of Denmark
Base callingPosition A T G C1 4
06 4 1
02 5
010
4 6
3 5 50
0 5
4 0 0 60
0
5 5 0 2 53
6 3 5 50
2
7 0 1 0 59
8 55
0 5 05
Each position is evaluated by calculating a Z-score
X is the number of reads having the most common nucleotide at that position, and Y is the number of reads supporting other nucleotides.
To trust a base call we require Z > 1.96>90% of reads supporting the same base
![Page 6: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/6.jpg)
DTU Bioinformatics, Technical University of Denmark
A data set with known phylogenyTo test various methods it would be useful to have a dataset with known phylogeny, then the tree structure is known, and the methods ability to infer the correct phylogeny can be evaluated.
This was made by In vitro evolution of E. coli
J. Ahrenfeldt, C. Skaarup, H. Hasman, A. G. Pedersen, F. M. Aarestrup and O. Lund. Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods. BMC Genomics (2017) 18:19
6
![Page 7: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/7.jpg)
DTU Bioinformatics, Technical University of Denmark
In vitro evolution of E. coli
J. Ahrenfeldt, C. Skaarup, H. Hasman, A. G. Pedersen, F. M. Aarestrup and O. Lund. Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods. BMC Genomics (2017) 18:19
![Page 8: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/8.jpg)
DTU Bioinformatics, Technical University of Denmark
Data set with known phylogeny
8
![Page 9: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/9.jpg)
DTU Bioinformatics, Technical University of Denmark
Data set with known phylogeny
9
![Page 10: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/10.jpg)
DTU Bioinformatics, Technical University of Denmark
Tree by NDtree - regular base calling
10
![Page 11: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/11.jpg)
DTU Bioinformatics, Technical University of Denmark
Base calling revisited• Do we get all the information from simple base calling.
• How about vectorizing the counts and comparing these for each position to get the difference. If the scalarproduct of two vectors is 1, the angle is 0 degrees, as cos(1) = 0, telling us that the vectors are identical.
11
![Page 12: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/12.jpg)
DTU Bioinformatics, Technical University of Denmark
Base callingPosition A T G C1 4
06 4 1
02 5
010
4 6
3 5 50
0 5
4 0 0 60
0
5 5 0 2 53
6 3 5 50
2
7 0 1 0 59
8 55
0 5 012
Position A T G C1 3
06 4 1
02 4
010
4 6
3 5 40
0 5
4 0 0 50
0
5 5 0 2 43
6 3 5 40
2
7 0 1 0 49
8 4 0 5 0
![Page 13: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/13.jpg)
DTU Bioinformatics, Technical University of Denmark
Vectorized counts
13
![Page 14: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/14.jpg)
DTU Bioinformatics, Technical University of Denmark
Base calling revisited• Do we get all the information from simple base calling.
• How about vectorizing the counts and comparing these for each position to get the difference. If the scalarproduct of two vectors is 1, the angle is 0 degrees as cos(1) = 0, telling us that the vectors are identical.
(Been there done that, didn’t really work)
• Bayesian probabilistic phylogeny
14
![Page 15: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/15.jpg)
DTU Bioinformatics, Technical University of Denmark
Bayesian probabilistic phylogeny• Assume two consensus sequences x and y
• At each position i we have counts Xi(a) for base a in genome x and similarly Yi(a) forthe other genome. The total count at i is called Xi and Yi
• Distance could be the expected number of positions where the two genomes differ
• The correct base at position i is called xi / yi.
• Calculate the probability for each position
15
![Page 16: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/16.jpg)
DTU Bioinformatics, Technical University of Denmark
Frequentist interpretation of Bayes’stheorem
16
In the frequentist interpretation, probability measures a “proportion of outcomes.” For example, suppose an experiment is performed many times. P(A) is the proportion of outcomes with property A, and P(B) that with property B. P(B | A ) is the proportion of outcomes with property B out of outcomes with property A, and P(A | B ) the proportion of those with A out of those with B.
![Page 17: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/17.jpg)
DTU Bioinformatics, Technical University of Denmark
Bayesian probabilistic phylogeny
17
The probability of observing the count given that x and y are different
The probability of x and y being different. The prior. Tested at 0.01
The probability of observing the count, given that x and y are the same.
![Page 18: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/18.jpg)
DTU Bioinformatics, Technical University of Denmark
Bayesian probabilistic phylogeny
18
![Page 19: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/19.jpg)
DTU Bioinformatics, Technical University of Denmark
Single position calculationExample 1
19
Prior P(X=Y) 0,99Sets the prior belief that two positions are identical (=fraction of identical)
Genome x
A C G T10 2 0 0 12
Genome y A 100 6,4*10^-21 6,9*10^-37 1,1*10^-42 1,1*10^-42
C 20 1,4*10^-180 1,5*10^-196 2,4*10^-202 2,4*10^-202
G 0 5,9*10^-243 6,4*10^-259 1*10^-264 1*10^-264
T 0 5,9*10^-243 6,4*10^-259 1*10^-264 1*10^-264
120
Diagonal 6,4*10^-21
Total 6,4*10^-21
P(x!=y) 1,2*10^-18 5,9*10^-12
![Page 20: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/20.jpg)
DTU Bioinformatics, Technical University of Denmark
Single position calculationExample 2
20
Prior P(X=Y) 0,99Sets the prior belief that two positions are identical (=fraction of identical)
Genome x
A C G T10 2 0 0 12
Genome y A 10 4,3*10^-11 4,6*10^-27 7,2*10^-33 7,2*10^-33
C 1 9,4*10^-32 1,0*10^-47 1,5*10^-53 1,5*10^-53
G 0 5,9*10^-35 6,4*10^-51 1*10^-56 1*10^-56
T 5 2,4*10^-21 2,6*10^-37 4,1*10^-43 4,1*10^-43
16
Diagonal 4,3*10^-11
Total 4,3*10^-11
P(x!=y) 5,8*10^-13 2,9*10^-06
![Page 21: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/21.jpg)
DTU Bioinformatics, Technical University of Denmark
Single position calculationExample 2
21
Prior P(X=Y) 0,99Sets the prior belief that two positions are identical (=fraction of identical)
Genome x
A C G T10 0 5 0 15
Genome y A 2 3,6*10^-35 1,3*10^-58 3,8*10^-45 1,3*10^-58
C 10 4,7*10^-17 1,7*10^-40 5,0*10^-27 1,7*10^-40
G 0 2,7*10^-41 1*10^-64 2,8*10^-51 1*10^-64
T 5 1,5*10^-27 5,8*10^-51 1,6*10^-37 5,8*10^-51
17
Diagonal 3,6*10^-35
Total 4,7*10^-17
P(x!=y) 1 5000000
![Page 22: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/22.jpg)
DTU Bioinformatics, Technical University of Denmark
Tree
22
![Page 23: A Probabilistic Approach to Whole Genome based Phylogenyelixir-node.cbs.dtu.dk/wp-content/uploads/2017/03/2017-Johanne...DTU Bioinformatics, Technical University of Denmark Whole genome](https://reader033.vdocument.in/reader033/viewer/2022041419/5e1d99e74bc34f2f2b7f9098/html5/thumbnails/23.jpg)
Acknowledgements
Anders Krogh
Anders Gorm Pedersen
Ole Lund