phylogenies from large samples of bacterial genomes

17
Phylogenies from Large Samples of Bacterial Genomes Bernhard Haubold MPI for Evolutionary Biology, Plön June 10, 2016 Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 1 / 17

Upload: duongtuyen

Post on 05-Jan-2017

219 views

Category:

Documents


0 download

TRANSCRIPT

Phylogenies from Large Samples of Bacterial

Genomes

Bernhard HauboldMPI for Evolutionary Biology, Plön

June 10, 2016

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 1 / 17

Overview

From genomes to phylogenies

Approximate alignments

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 2 / 17

From Genomes to Phylogenies—1

Genomes Alignment Distance Matrix Tree

S4

S3

S2

S1

S4

S3

S2

S1 S1 S2 S3 S4

S1 0

S2 d2,1 0

S3 d3,1 d3,2 0

S4 d4,1 d4,2 d4,3 0

S3

S4

S1

S2

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 3 / 17

From Genomes to Phylogenies—2

Genomes Alignment Distance Matrix Tree

S4

S3

S2

S1

S4

S3

S2

S1 S1 S2 S3 S4

S1 0

S2 d2,1 0

S3 d3,1 d3,2 0

S4 d4,1 d4,2 d4,3 0

S3

S4

S1

S2

slow fast fast

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 4 / 17

From Genomes to Phylogenies—3

Genomes Alignment Distance Matrix Tree

S4

S3

S2

S1

S4

S3

S2

S1 S1 S2 S3 S4

S1 0

S2 d2,1 0

S3 d3,1 d3,2 0

S4 d4,1 d4,2 d4,3 0

S3

S4

S1

S2

slow fast fast

andi

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 5 / 17

Approximate Alignment

Only consider pairs of sequences.

Q S

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 6 / 17

Anchors

Q S

Q S

Anchors:

Unique

Cannot be extended (maximal)

Longer than random match

Equidistant

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 7 / 17

Anchor Distance

g1 AATGCCACCGGGTGATGATAGCCTCGATAGGCCGCAGGTCTCGCGGGGAAATC

g2 GCGAGAGCGCACCAGCGGGTGATGATAGCCTGGATAGGCCGCAGGACGGT

da =1

20 + 13= 0.03

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 8 / 17

Searching

Q S

Compute index of S:◮ Time- & memory-intensive step◮ Parallelize

Search index of S with Q: Quick

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 9 / 17

Implementation

Program: andi (ANchor DIstances)

Code: www.github.com/evolbioinf/andi

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 10 / 17

Accuracy

10−5

10−4

10−3

10−2

10−1

100

10−4 10−3 10−2 10−1

da

Substitutions per Site (K )

da

ideal

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 11 / 17

Problems at high Substitution Rates

0.1

0.2

0.3

0.4

0.5

0.60.70.80.91.0

0.4 0.5 0.6 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1d

a

Faile

dd

aE

stim

ation

Substitutions per Site (K )

da

idealfailed da

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 12 / 17

29 Escherichia coli Genomes

mugsy: 2 h 29 min andi: 16.7s0.002

E. coli IAI1E. coli SE11

E. coli E24377A

S. sonnei Ss046

S. boydii Sb227S. boydii CDC 3083-94

S. flexneri 5 str. 8401S. flexneri 2a str. 2457TS. flexneri 2a str. 301

E. coli ATCC 8739E. coli HS

E. coli str. K-12 substr. MG1655E. coli str. K12 substr. W3110

E. coli str. K12 substr. DH10BE. coli BW2952

S. dysenteriae Sd197E. coli O55:H7 str. CB9615

E. coli O157:H7 EDL933E. coli O157:H7 str. Sakai

E. coli UMN026E. coli IAI39E. coli SMS-3-5

E. coli 0127:H6 E2348/69E. coli 536

E. coli ED1aE. coli CFT073

E. coli S88

E. coli UTI89E. coli APEC O1

0.002

E. coli IAI1E. coli SE11

E. coli E24377A

S. sonnei Ss046

S. boydii Sb227S. boydii CDC 3083-94

S. flexneri 5 str. 8401S. flexneri 2a str. 2457TS. flexneri 2a str. 301

E. coli ATCC 8739E. coli HS

E. coli str. K-12 substr. MG1655E. coli str. K12 substr. W3110

E. coli str. K12 substr. DH10BE. coli BW2952

S. dysenteriae Sd197E. coli O55:H7 str. CB9615

E. coli O157:H7 EDL933E. coli O157:H7 str. Sakai

E. coli UMN026E. coli IAI39E. coli SMS-3-5

E. coli 0127:H6 E2348/69

E. coli 536E. coli ED1a

E. coli CFT073

E. coli S88

E. coli UTI89E. coli APEC O1

500-fold speedup

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 13 / 17

Time & Memory

0

50

100

150

200

250

0 5 10 15 20 25 300

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Tim

e(s

)

Me

mo

ry(G

b)

Processors

LaptopZone

TimeMemory

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 14 / 17

3085 Streptococcus pneumoniae Genomes (2.2 Mb)

4 h 37 min on 24-core computer; 9.2 GB RAM

Cheewapreecha et al. (2014). Nature Genetics, 46:305–309.

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 15 / 17

Summary

From genomes to phylogenies

genome alignment distance matrix tree

Approximate alignments

genome alignment distance matrix tree

ANchor DIstances: andi◮ accurate & scaleable to thousands of genomes◮ www.github.com/evolbioinf/andi◮ Ubuntu 16.04 (Xenial Xerus)

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 16 / 17

Acknowledgments

Fabian Klötzl, Plön

Peter Pfaffelhuber, Freiburg

Bernhard Haubold (MPI Plön) Whole Genome Phylogenies RaMi-NGS 2016 17 / 17