eve 161: microbial phylogenomics class #10: genome sequencing · microbial phylogenomics class #10:...

46
EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant: Cassie Ettinger 1

Upload: others

Post on 27-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

EVE 161:Microbial Phylogenomics

Class #10: Genome Sequencing

UC Davis, Winter 2018 Instructor: Jonathan Eisen

Teaching Assistant: Cassie Ettinger

!1

Page 2: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

RESEARCH ARTICLE

Whole-Genome RandomSequencing and Assembly ofHaemophilus influenzae Rd

Robert D. Fleischmann, Mark D. Adams, Owen White, Rebecca A. Clayton,Ewen F. Kirkness, Anthony R. Kerlavage, Carol J. Bult, Jean-Francois Tomb,Brian A. Dougherty, Joseph M. Merrick, Keith McKenney, Granger Sutton,

Will FitzHugh, Chris Fields,* Jeannine D. Gocayne, John Scott, Robert Shirley,Li-lng Liu, Anna Glodek, Jenny M. Kelley, Janice F. Weidman, Cheryl A. Phillips,

Tracy Spriggs, Eva Hedblom, Matthew D. Cotton, Teresa R. Utterback,Michael C. Hanna, David T. Nguyen, Deborah M. Saudek, Rhonda C. Brandon,Leah D. Fine, Janice L. Fritchman, Joyce L. Fuhrmann, N. S. M. Geoghagen,

Cheryl L. Gnehm, Lisa A. McDonald, Keith V. Small, Claire M. Fraser,Hamilton O. Smith, J. Craig Ventert

An approach for genome analysis based on sequencing and assembly of unselectedpieces of DNA from the whole chromosome has been applied to obtain the completenucleotide sequence (1,830,137 base pairs) of the genome from the bacterium Hae-mophilus influenzae Rd. This approach eliminates the need for initial mapping efforts andis therefore applicable to the vast array of microbial species for which genome maps areunavailable. The H. influenzae Rd genome sequence (Genome Sequence DataBase ac-cession number L42023) represents the only complete genome sequence from a free-living organism.

A prerequisite to understanding the com-plete biology of an organism is the deter-mination of its entire genome sequence.Several viral and organellar genomes havebeen completely sequenced. Bacterio-phage 0X174 [5386 base pairs (bp)] wasthe first to be sequenced, by Fred Sangerand colleagues in 1977 (1). Sanger et al.were also the first to use strategy based onrandom (unselected) pieces of DNA, com-pleting the genome sequence of bacterio-phage X (48,502 bp) with cloned restric-tion enzyme fragments (1). Subsequently,the 229-kb genome of cytomegalovirus(CMV) (2), the 192-kb genome of vaccin-ia (3), and the 187-kb mitochondrial and121-kb chloroplast genomes of Marchantiapolymorpha (4) have been sequenced. The186-kb genome of variola (smallpox) wasthe first to be completely sequenced withautomated technology (5).

At the present time, there are active ge-nome projects for many organisms, includingDrosophila melanogaster (6), Escherichia coli(7), Saccharomyces cerevisiae (8), Bcillus sub-tilis (9), Caenorhabditis elegans (10), andJ.-F. Tomb, B. A. Dougherty, and H. O. Smith are with theJohns Hopkins University School of Medicine, Baltimore,MD 21205, USA. J. M. Merrick is with the State Universityof New York, Department of Microbiology, Buffalo, NY,14214, USA. K. McKenney is with the National Institutefor Standards and Technology, Gaithersburg, MD 20878,USA. All other authors are with The Institute for GenomicResearch (TIGR), Gaithersburg, MD, 20878, USA. Theaddress for TIGR as of 9 September 1995 is 9712 Med-ical Center Drive, Rockville, MD 20850, USA.*Present address: The National Center for Genome Re-sources, Santa Fe, NM, 87505, USA.tTo whom correspondence should be addressed.

496

Homo sapiens (11). These projects, as well as

viral genome sequencing, have been basedprimarily on the sequencing of clones usuallyderived from extensively mapped restrictionfragments, or or cosmid clones. Despiteadvances in DNA sequencing technology(12) the sequencing of genomes has not

progressed beyond clones on the order of thesize of (-40 kb). This has been primarilybecause of the lack of sufficient computa-tional approaches that would enable the ef-ficient assembly of a large number (tens ofthousands) of independent, random se-

quences into a single assembly.The computational methods developed

to create assemblies from hundreds of thou-sands of 300- to 500-bp complementaryDNA (cDNA) sequences (13) led us to testthe hypothesis that segments of DNA sev-

eral megabases in size, including entire mi-crobial chromosomes, could be sequencedrapidly, accurately, and cost-effectively byapplying a shotgun sequencing strategy towhole genomes. With this strategy, a singlerandom DNA fragment library may be pre-pared, and the ends of a sufficient numberof randomly selected fragments may be se-

quenced and assembled to produce thecomplete genome. We chose the free-livingorganism Haemophilus influenzae Rd as a

pilot project because its genome size (1.8Mb) is typical among bacteria, its G+Cbase composition (38 percent) is close tothat of human, and a physical clone mapdid not exist.

Haemophilus influenzae is a small, nonmo-tile, Gram-negative bacterium whose onlySCIENCE * VOL. 269 * 28 JULY 1995

natural host is human. Six H. influenzaeserotype strains (a through f) have beenidentified on the basis of immunologicallydistinct capsular polysaccharide antigens.Non-typeable strains also exist and are dis-tinguished by their lack of detectable capsu-lar polysaccharide. They are commensal res-idents of the upper respiratory mucosa ofchildren and adults and cause otitis mediaand respiratory tract infections, mostly inchildren. More serious invasive infection iscaused almost exclusively by type b strains,with meningitis producing neurological se-quelae in up to 50 percent of affected chil-dren. A vaccine based on the type b capsularantigen is now available and has dramatical-ly reduced the incidence of the disease inEurope and North America.

Genome sequencing. The strategy for ashotgun approach to whole genome se-quencing is outlined in Table 1. The theoryfollows from the Lander and Waterman(14) application of the equation for thePoisson distribution. The probability that abase is not sequenced is P0 = e-", where mis the sequence coverage. Thus after 1.83Mb of sequence has been randomly gener-ated for the H. influenzae genome (m = 1, 1X coverage), P0 = e-1 = 0.37 and approx-imately 37 percent of the genome is unse-quenced. Fivefold coverage (approximately9500 clones sequenced from both insertends and an average sequence read lengthof 460 bp) yields P, = e-5 = 0.0067, or 0.67percent unsequenced. If L is genome lengthand n is the number of random sequencesegments done, the total gap length is Le-",and the average gap size is L/n. Fivefoldcoverage would leave about 128 gaps aver-aging about 100 bp in size.

To approximate the random model dur-ing actual sequencing, procedures for libraryconstruction (15) and cloning (16) weredeveloped. Genomic DNA from H. influen-zae Rd strain KW20 (17) was mechanicallysheared, digested with BAL 31 nuclease toproduce blunt ends, and size-fractionated byagarose gel electrophoresis. Mechanicalshearing maximizes the randomness of theDNA fragments. Fragments between 1.6and 2.0 kb in size were excised and recov-ered. This narrow range was chosen to min-imize variation in growth of clones. In ad-dition, we chose this maximum size to min-imize the number of complete genes thatmight be present in a single fragment, andthus might be lost as a result of expressionof deleterious gene products. These frag-ments were ligated to Sma I-cut, phos-phatase-treated pUC18 vector, and the li-gated products were fractionated on an aga-rose gel. The linear vector plus insert bandwas excised and recovered. The ends of thelinear recombinant molecules were repairedwith T4 polymerase, and the moleculeswere then ligated into circles. This two-

Lla(.%·BlllaB.LPr.lterrilr.l.

on February 8, 2018

http://science.sciencemag.org/

Dow

nloaded from

Page 3: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

• Presenters?

Page 4: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

• Questions / To Cover?

Page 5: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

RESEARCH ARTICLE

Whole-Genome RandomSequencing and Assembly ofHaemophilus influenzae Rd

Robert D. Fleischmann, Mark D. Adams, Owen White, Rebecca A. Clayton,Ewen F. Kirkness, Anthony R. Kerlavage, Carol J. Bult, Jean-Francois Tomb,Brian A. Dougherty, Joseph M. Merrick, Keith McKenney, Granger Sutton,

Will FitzHugh, Chris Fields,* Jeannine D. Gocayne, John Scott, Robert Shirley,Li-lng Liu, Anna Glodek, Jenny M. Kelley, Janice F. Weidman, Cheryl A. Phillips,

Tracy Spriggs, Eva Hedblom, Matthew D. Cotton, Teresa R. Utterback,Michael C. Hanna, David T. Nguyen, Deborah M. Saudek, Rhonda C. Brandon,Leah D. Fine, Janice L. Fritchman, Joyce L. Fuhrmann, N. S. M. Geoghagen,

Cheryl L. Gnehm, Lisa A. McDonald, Keith V. Small, Claire M. Fraser,Hamilton O. Smith, J. Craig Ventert

An approach for genome analysis based on sequencing and assembly of unselectedpieces of DNA from the whole chromosome has been applied to obtain the completenucleotide sequence (1,830,137 base pairs) of the genome from the bacterium Hae-mophilus influenzae Rd. This approach eliminates the need for initial mapping efforts andis therefore applicable to the vast array of microbial species for which genome maps areunavailable. The H. influenzae Rd genome sequence (Genome Sequence DataBase ac-cession number L42023) represents the only complete genome sequence from a free-living organism.

A prerequisite to understanding the com-plete biology of an organism is the deter-mination of its entire genome sequence.Several viral and organellar genomes havebeen completely sequenced. Bacterio-phage 0X174 [5386 base pairs (bp)] wasthe first to be sequenced, by Fred Sangerand colleagues in 1977 (1). Sanger et al.were also the first to use strategy based onrandom (unselected) pieces of DNA, com-pleting the genome sequence of bacterio-phage X (48,502 bp) with cloned restric-tion enzyme fragments (1). Subsequently,the 229-kb genome of cytomegalovirus(CMV) (2), the 192-kb genome of vaccin-ia (3), and the 187-kb mitochondrial and121-kb chloroplast genomes of Marchantiapolymorpha (4) have been sequenced. The186-kb genome of variola (smallpox) wasthe first to be completely sequenced withautomated technology (5).

At the present time, there are active ge-nome projects for many organisms, includingDrosophila melanogaster (6), Escherichia coli(7), Saccharomyces cerevisiae (8), Bcillus sub-tilis (9), Caenorhabditis elegans (10), andJ.-F. Tomb, B. A. Dougherty, and H. O. Smith are with theJohns Hopkins University School of Medicine, Baltimore,MD 21205, USA. J. M. Merrick is with the State Universityof New York, Department of Microbiology, Buffalo, NY,14214, USA. K. McKenney is with the National Institutefor Standards and Technology, Gaithersburg, MD 20878,USA. All other authors are with The Institute for GenomicResearch (TIGR), Gaithersburg, MD, 20878, USA. Theaddress for TIGR as of 9 September 1995 is 9712 Med-ical Center Drive, Rockville, MD 20850, USA.*Present address: The National Center for Genome Re-sources, Santa Fe, NM, 87505, USA.tTo whom correspondence should be addressed.

496

Homo sapiens (11). These projects, as well as

viral genome sequencing, have been basedprimarily on the sequencing of clones usuallyderived from extensively mapped restrictionfragments, or or cosmid clones. Despiteadvances in DNA sequencing technology(12) the sequencing of genomes has not

progressed beyond clones on the order of thesize of (-40 kb). This has been primarilybecause of the lack of sufficient computa-tional approaches that would enable the ef-ficient assembly of a large number (tens ofthousands) of independent, random se-

quences into a single assembly.The computational methods developed

to create assemblies from hundreds of thou-sands of 300- to 500-bp complementaryDNA (cDNA) sequences (13) led us to testthe hypothesis that segments of DNA sev-

eral megabases in size, including entire mi-crobial chromosomes, could be sequencedrapidly, accurately, and cost-effectively byapplying a shotgun sequencing strategy towhole genomes. With this strategy, a singlerandom DNA fragment library may be pre-pared, and the ends of a sufficient numberof randomly selected fragments may be se-

quenced and assembled to produce thecomplete genome. We chose the free-livingorganism Haemophilus influenzae Rd as a

pilot project because its genome size (1.8Mb) is typical among bacteria, its G+Cbase composition (38 percent) is close tothat of human, and a physical clone mapdid not exist.

Haemophilus influenzae is a small, nonmo-tile, Gram-negative bacterium whose onlySCIENCE * VOL. 269 * 28 JULY 1995

natural host is human. Six H. influenzaeserotype strains (a through f) have beenidentified on the basis of immunologicallydistinct capsular polysaccharide antigens.Non-typeable strains also exist and are dis-tinguished by their lack of detectable capsu-lar polysaccharide. They are commensal res-idents of the upper respiratory mucosa ofchildren and adults and cause otitis mediaand respiratory tract infections, mostly inchildren. More serious invasive infection iscaused almost exclusively by type b strains,with meningitis producing neurological se-quelae in up to 50 percent of affected chil-dren. A vaccine based on the type b capsularantigen is now available and has dramatical-ly reduced the incidence of the disease inEurope and North America.

Genome sequencing. The strategy for ashotgun approach to whole genome se-quencing is outlined in Table 1. The theoryfollows from the Lander and Waterman(14) application of the equation for thePoisson distribution. The probability that abase is not sequenced is P0 = e-", where mis the sequence coverage. Thus after 1.83Mb of sequence has been randomly gener-ated for the H. influenzae genome (m = 1, 1X coverage), P0 = e-1 = 0.37 and approx-imately 37 percent of the genome is unse-quenced. Fivefold coverage (approximately9500 clones sequenced from both insertends and an average sequence read lengthof 460 bp) yields P, = e-5 = 0.0067, or 0.67percent unsequenced. If L is genome lengthand n is the number of random sequencesegments done, the total gap length is Le-",and the average gap size is L/n. Fivefoldcoverage would leave about 128 gaps aver-aging about 100 bp in size.

To approximate the random model dur-ing actual sequencing, procedures for libraryconstruction (15) and cloning (16) weredeveloped. Genomic DNA from H. influen-zae Rd strain KW20 (17) was mechanicallysheared, digested with BAL 31 nuclease toproduce blunt ends, and size-fractionated byagarose gel electrophoresis. Mechanicalshearing maximizes the randomness of theDNA fragments. Fragments between 1.6and 2.0 kb in size were excised and recov-ered. This narrow range was chosen to min-imize variation in growth of clones. In ad-dition, we chose this maximum size to min-imize the number of complete genes thatmight be present in a single fragment, andthus might be lost as a result of expressionof deleterious gene products. These frag-ments were ligated to Sma I-cut, phos-phatase-treated pUC18 vector, and the li-gated products were fractionated on an aga-rose gel. The linear vector plus insert bandwas excised and recovered. The ends of thelinear recombinant molecules were repairedwith T4 polymerase, and the moleculeswere then ligated into circles. This two-

Lla(.%·BlllaB.LPr.lterrilr.l.

on February 8, 2018

http://science.sciencemag.org/

Dow

nloaded from

Page 6: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 7: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 8: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

What Genomes Done Before?

• What Genomes Had Been Sequenced?

Page 9: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 10: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

How Were Genomes Being Sequenced?

• How Were Genomes Being Sequenced?

Page 11: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 12: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

What is new here?

• What did they do that was new in terms of sequencing?

Page 13: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 14: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 15: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Organisms Chosen

• What is H. Influenzae?

• Why did they choose it?

Page 16: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 17: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Sequencing Details?

• Outline of sequencing approach?

Page 18: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Shotgun Sequencing

• What is shotgun sequencing?

• How does it work?

Page 19: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Shotgun Sequencing

• What is shotgun sequencing?

• How does it work?

• How can you complete a genome this way?

Page 20: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 21: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Lander Waterman

P0=e-m

P0 = probability a base is

not sequenced

m = coverage

Page 22: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

?

P0=e-m

P0 = probability a base is

not sequenced

WHY DOES RANDOMNESS

MATTER?

Page 23: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Show Curve

00.10.20.30.40.50.60.70.80.9

1

0 2000 4000 6000 8000 10000 12000Perc

ent U

nseq

uenc

ed

Number of Reads

Lander Waterman Curve

100 Bp 500 bp 1000 bp

Page 24: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Other issues

• Why do they talk about ESTS?

• Why do they need a database?

Page 25: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Assembly

Page 26: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Assembly

Page 27: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Assembly Method

• TIGR Assembler

• Smith Waterman

• Paired Ends

Page 28: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Assembly Results

• Contigs

• Sequencing Gaps

• Physical Gaps

Page 29: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Gap Filling

Page 30: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Gap Filling

• DNA Hybridization

• Peptide Links

• Phage lambda libraries

• PCR

Page 31: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Accuracy Checks?

Page 32: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Accuracy Checks?

• Frameshift in protein sequences vs. Homologs

• Coverage > 1x

• Few ambiguities

• Comparison to H. Influenza sequence in DBs

Page 33: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Cost

Page 34: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Cost

• 0.48$ / finished bp

Page 35: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Annotation

• Structural

• Functional

Page 36: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

Annotation

• Structural

• Functional

Page 37: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 38: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

n0002 P0007 fdo« *ooot dh· HIOOHI *·C

10001O gap~ 10(taO 10001 fdol IHOui 10010 -- 100K^^KII001_«r«B ·003* *r*D

K00031 nIOoM~hD n0010 rLl ot r lOOlwB01l«» B0017 KIOOM nPOMO i"003 B0 oier 10024oItO 10024 lipfc 1002 K002daoA) MMOUT11010033 I1003510040 n0043 10(

K000410011 bOlDPO1001 noO~100« ·» n0033 oil 10027 lip» 10030 rlpK 10033pbp2 n0034 n0041 xthk n0043~~~~~~~ad r I~~~~~~~~~~~~~~~~~~~~~~~~~~~~02»«

BOM* o^»2 ........'gax47 ioi«4 B~~~~~~~~~~i((_aa ioi««Y

·I1JlO ag___________________ 1 P_____voI_________________________"0-? "« nol* r"gl" laIglg lrlS1^ ^ O? 1!^ 1

^____________1013 duQnmmm r04 D-f 104 uo«'1n0404 0e £C 01 0* ef11S a 05 ~ 0S 111gr113115 07 0*

1

0Or~~~~~~~~~04 nolr*n P0144 glk1~ B01S1 hfUr Bpldoc OolS~fjM 001S* x»U21P II n017«10l7»actr

n0(*artr rmnxn 10« up O00 ra I00 rrm 10310nlrlll~ nmm~ iT----t vaiq a--sa ----P01=) i030 [=) IISn~OIL ollPrrOnudolLborOI~ore~r·01 ~Q10~

102* nazQr 102M zpo l VU I07 102*0D I4 101 0* r123 125zo130- 031ty 00 00 00 rC 139x

«e20 027 ytS 1024 1«n027«O~1027* P02«l 1<>2»2 102*3 PoP C *2*nuX102*0oop*t ·102*2 *u 0* o0«2« i»P2»pl 00*131133*10271 10273xyh 1027K027*102t~tC 102*» *toC 102*1 102*4 autS B~f~pUC 10312 rwf 10314 r

1042 d~l> 0444top _g«p~f l~»P 0471hi» 1073n02tO 04 cX 04 0« 04 an 145 0 B0(wa 0* la 143 - ^iet rn04«« hipan0470 hi»Cn0472 hi- K

n02»- 03 143e 03 0- 143 P104t»X14*t« 04*141 ...'*".no lel MOb onpr*B** t 04 0«

no0430u P10432o t034- 10434 Ulflo103 cir>r 1*447 IwtS 10450 norrOb n0454 1045(lrr n~~oor nrpocoo P04»1043 oc-a r ~ urrrrrno~t ·r011 P P~nglUno~r PO1o ·01~1nolo PO~1 n~or l PQIO oe Por0 r· P1Ir opD ol~roiu oo,,piune~e ^, 10«0toy«rll

Origin ^no-^i5* o0* r»»7nnl~ KOO tx I O.r-*-« " 00 y* n«5n« 00 zs9 S

a ~ ~ 01 --- EI0720hl B72 II1 I III II 077 af 1000-l171 n 10713_B^I071 o~lpl B071norWn 101»^^ 072 JlKI~u Iwrlo ,..,-~r K,.,,,.,,l Be724 B0725 B07210rioQ 107291 pro* ·03177

BOTUt~pi1071 10721072 na»Tra 102 yt,...lg,73 .',71I7MIC74y* 03

Po1Ir our noI11 PO111 PO ·101~ oa· ~O1~1 oor IDIIII trru POI) norrr ·I~I~ P~lll~oU n~lrt pC PorI) PO PO~r1 uO norrr10T32

Origin~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~01 rpL21·r

·IQIII~ ~ ~ 1»> roAo11001 ·1004 11007 111 to 01^ 107l» 12

n133orrB oni3 rTBnls tU Blgt-X KIoltf BUpft 111451 ph^ ~ 11152jPt"11* nUtalt* Bll(4^k IllCSnI~ il OnrllIOO Cl 11171OtrPr nor1173 ncO ~ ~~ ~ ~ ~ ~ ~ ~ ~~~14 1114 B15 115 IlOyC Bltr 110y« 1( 1( i 1« 17 r

norrr~~~~~~~~~iiiii numua SSSH VEQ~r ^*^^^* iir~r ---norI moQdidO B^i ~ --(1*

11420r12 1143 tzp B14>h11451)·- B14C11 1 aodT

110 14* 122123 **112 tr 112 140 13 trpK 11434 a1437 1143* B1444 -tr B141o 10145I)·0'3 B1450 114( 12

B1401~__10_mlB40m41111 11 nlr 12 B44 o141pm143- 145 13B40***142rB* 14 tO 114 o143 iB 15 pa 14*10

B1501I~l~ol 11V 15KB15 B15»«pei 115*51 57 -- 115» 11oo1(11(1 tyrt 11(12''' 11(14I~lld~g p--11(~obiOl(2 1(4 1(5 12 dadfc111(3jg55Il) B15> mn15*1n11I Iz 11(00 110 KltA1(* K2 Hla

^v...O^..*-- C===3 C=> ---- « C=3 C=)~~~~~~~~~~~~~~~~~~~~~~nooo pl

1I0055 uoan n00(3 poaB 1l00(71 itL 1100(* gin!

-r ·I004( ~ ~ OI105_uXr 100(2C dluA 1~00(4 folK 100O((-iB 100(0r txpX

144 0047 ·da 1I004* MB*( 110051 n10053 B005( 1~00571 UXC 110059 l00(0 -hA

n0045 t0040 hhar ·10050 110052 u0050 M- u00(1 r~o2

110111 10194 wmS u0197 -pX 102000 **ID, ·1020 n~h ·I0200 wro

·II lgl<, 101«( 1II01171 11010* gIdhX 101*2e **«X ·10195 101»( aroC 101»»r «01»p ubn·1020 10207ua *roK1020*d-

110104 10105 101*0 fur 10193 *coC u1001 rpLI* ·I0203

·I01*1 LId* 00~O~tCTD 10204 rp~l«

10321 TagCu0319 10324nmt 110331 oapB 10335l dffkA u042 110344 1034(

31510314c n0317 upfl u0320 n0323 lI0328 *fp ·I0332 ruc0 n0337 g~LnB K~033 pri*733 VnnS'~sxztm'M~rC

i~k P0475 hl«B 104090 u1097 h»10 10502 rb«X 10504 rtb· ·BIOS~rbttl

O474 hiT Il047( n0400 Z0104901 10492 049(0l ll0499·6*ldB 1000 lt0501 rb·D *10503 rbC ·I0505 fb-( l0507

~10470 ·tpC ·10410 *tp0 u1040 *tpn I10416 gidB 10490 10494109n0490~IoocpotD 1050101047*tp 10411 *tk 140 td 0nIOIIS911 I~llltpl IO110415 atpB

l

1011al« 0(1 t 0(0 po 1(3 *cB ^

K0l aA 01 010gp l 02 ~t 1(4K(2tt 027 1(**l* T> 032tf O3

icl 10(15 tucX 10(20 hip* *~~~10(21 l·h I(2 031oa 1(3

1074dcuf 10750 dapf 1 107(0lrp·~IO~O

107* d- 074 BI74 plB 175 taO 175 1070xp3

iol ·xoclr ~~~ol ·I1019- or» nL0C199fl 1090 nI0907 It0915 ooir> M91

·f0'19 _n BIO»__m>dO·1111

11034 Iotr1103( ~11041 M·I0 o · 1) zl 11050l-urP 'l 1 I'1 u I'1 t o~

JOi 1132(m r

n~~t39 dn~c·no~1132 11329 1 BX9133 ·I1340 1134 u07 134 po 1134rpoc Zlll ^ " K130 EaPI -^ j4

^ T o5 T _4 pt

110093~~~~~~~~~~~117·rB* 11412J 11414 114K0 114*1 rdg114* 1147 101500 o~

·114(3 u1144fl 141 81 41 tu»0 114731cedP 11475 nitCn p ·11410 1140wi1140 14*

H4(5tt- 11 41 470 t lpC 0147 11474_t<»C K147(UC IfO' ~ 10111 f00 1101 n11 rb 113

_I(3p^ n_3Ip KO* 9 ni(5(

14 df>D«I1(37 ~011(4 hi«T nl(4( cdcH49dl 11(1 *KM 11(5

B|Energy metabolismHiatty acid/Phospholipid metabolismHPurines, pyrimidines, nucleosides and nucleotides^BRegulatory functions

Replication

HI Transport/binding proteins^g~Translation (.2.,~ ,,

1 Transcription ^-NIB Other categories____~ ~~ ~ ~ ~ ~~~t--i__ k!bKHia Hypothetical||Ulnknown

1I172 bC» 1~1730 172 _____rn73^ l3ltl13 11 ri1tta 5«-23»-14» PI170 r*00n1742xpoS112» - I11 I11 I13-« 1l4«pT I14 -

B|Amino acid biosynthesisBBBiosynthesis of cofactors, prosthetic groups, carriersBBCell envelope^BCellular processes

Central intermediary metabolism

on February 8, 2018

http://science.sciencemag.org/

Dow

nloaded from

Page 39: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:

II """""""".GI.ULICC.UII..I.CC.CCIII.u,,.., rrr·rr - -_ --_ --_ .r··r rrr··r

*zoo0

10210 HI10212ribX10214 prIC 10215868h esdM 1026 h 1(0218 prrO Qlu *10321 go- *Q10232gaX 10224Irp 1022« brnQ ne0229pSWssin02»0 ~ r10231 d-0 10 123noa17

10211 pap» 80213 oppX 102119 10220 arc rlIut 5»-23»-16« 10223 rur010225s ahX n10227102110234 102Ks wC 10219-O 10 241

nesss ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~~~~~~~~01 «xaa_<n<s16e

n0358 foAD n0369 hi«8I03»«

Issnesss~~~~~105 IUJD nei" 103«s lys C== n03»< «XO»»»ssas~a110349mn pmfc 035 10352orrm 10359 n038 "03( tin*. 103( 10372 tdx111037 107 07 031pl 133_o*135CoQP3lISM IO4pf

1es 8173031al8101(1 f«C«s n0371 103731Lhec66 n0375 neifS n03*0 toUspr 1 103t 101*Iessp

c~~~mr~~~mn Pmrmd~~~~~~~nmmr ~~~~~mzma ron~~~~~~~sna I I I I~~~1052 a*01041 or10510 10522 10527 fdxl 10511 zp«21h0533 rpoDs 10542L~ neepB n0549 togfcs rl nss

10U hk 013hncI154 pC155 pB 158 pl 018<oo 059 050053102 fa 056102 tr 150 c103 M* 103 a-0570« 059WC154 f 054Kp»104 «- «

10509~~~~~~~nes1051BlaoII101 pl 02 02 gc 102 d 01 c»15*u« 150*- 155xUB51i

nesse nesssnesss FQ nesss rplss nest, rpoD nesss rop·nesss lPI~~~~~~~~~~~~~~~~~~~~~~~~054 -i

~~~rmm, ~~ ~ ~ ~ 104 10C4 gla gI 0« yC ZC9*rr~~llit~ 1061purB~hoclnesI --- -- AZ« pp nssrplIl^---IO1102 es nsser nssnss y·nssanss g est^rlt a ss rr es sPss p es p I

K~~~tl«~Z~K04ZP1Xr 105 1(3oyDB1C BC7 OT1- 0«1t

1063tb92 10C3« 10C43tbi»Cs10644 toxCsBO4 -n04el6 05 062MA1(4tsg BOC5 __1<51 (5 KOC Upftsa0« 10«<5 asT~p BOC71 B0<7BOC7C a C B0<7BOC37 ~~ ~ ~ ~ ~ ~ AetrarC5pd 04 169rp 105 d 1C5uo 1(7tp 1COBCC 07 OC4BT

B0771 xpl>4 10712 rpL22 1015 rpL29 10790 rpL5 10793 pL< 10796 xpL30 10(01 zpS

107«507C9 ffbl B0777 zpl. 10710x^ ^rp83 rp817 rp2 vlt z8 OT -T1(0 p 1(3Am ~ ~ ~ ~es1076rib lOTftB sOrfZB75 07 plO 171z89 177 171 p1 09 pr5179 p1 00xo 00

1076210766107 «X 07 n0774 ctfA 077 rpL3 17( zpI E7«rp1 179 t004 00 n0«0» frr 1M1 nelscysigalP0*1 no«&10(1 l essKO10763IInadXa "{fife L077 m 100 ~»n»9ph 01 ~fc1(4*« 01 «» 1(9-

narll~~~~~102 hoi*03 o~b~ePIIoCIorurrcls nronro a ouro o~10939 n0941 PO09145 htrt r 10(54dae'llp B095« 7) POI~nesss trpl ness psd· nesss nessrop nessshdU nsss as nesss to~ nesICDs0943 ----- -- D

nesss nesss ir~~~~r nesss rp~~~s l[T01~~0 rp~~s rpsf ass as nesss s1090 1092 1194 ni09 110991103cy

KI1061__lpxXb 110<4 llCstD HH06( nrt» 81710723 n1074 H075cyd 1107«^p llptaln ni0«2 K10«4 1100« 110»srodfcn1100 11«snse 11101 XtUetu1106 IpxBm116 fabZ1108 nrt 16 nrf 1107 117 1107 eyd 117 gla 11» 103 115107*lP1«

n1210 nedh*I1213 xprXss nss1125121 nes1319 3nesode essr~s es 11221 hijDs 11224 pyrTs 1123 n1317 dpaK s 1121 dnessJ r IgiaipP111245Kr10 *uA 121 y 1 a2pr m117bgm n21 c» 22 Ia 12 up 121 pt 122 q 123 n* 135129pg 14 ni42c 114

11209~ ar"~ *XOK mUl~ufX rr1229 4uX P1231 Ipdk 11236W 0134~1 PI

-11501~~srrs I--I1Il ~r8799or 10 150*11507 11510 11512 ssgyQ 101~90neso 11514 ni5Ksn K1520 1152 n1521 parn1515 11517 lie*s r19LioC

110 151 B 155 *I59 11 1151 111 vmmmm157 519 mm1521 Tm 112112 112 ihp 112p~rrC 111 l«153 Z 11 156 151li»K50lo 14

HI1SO<1152 »BdB153 ri^Cs11831 tnb»154 1^ 154 lg«4 11547rroa 01549^O~~~~~~~~~~~~~~~~~~~12 an 11544r 11541mnmHiss rb

l^^---l ~ ~ nes nses 'ss 1ss sssnss

11(59 ~ ~ ~ ~~~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~~~~~~~~oBmd 11(72mn 11(|4mm 111 1911(60 nrd» n1(71 11(77 "xl<7> nspsney n1(79n1(12 nsohi 1161 111 ol1(1 1111(9 1(4B

HH66^uc H((3r 11665 1166 1166 pr 116 17 *& 17 *a 11 1X<X1(1 O 1(1 15 1(7 19 10

P1OH662__»C^^_ nrCH((4 0116661 P1(01tiaOlSd 11(74 po~l~~sn P11oa0~nO~1011692tLd KUMQ 101--(~

300,000mnBMW ~ ~~ ~ ~ ~~~~~~~~~~·~~102( h

' ' ^_^^^^^^ ~~~~~~~~~~~~~~~~0~10( raxQ

~~ ~B»245_--f igallr

Prrr1 at iw«(6" c^ 450,000 nt^ 140^ l n4» «m ________________040 P«^ r SS Pli N* *SSSamm m MU9wt"** "S " "y 0j

0400r·- MUBria B0414) 1042(_«dklBO- B057(mc, _o rzm

r I ~ ~ ~ ~ ~ ~~~~~~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ZD 60~~~~~~~~~~~~~~~~~~70,000 nt

R~~~ ~ ~ rr Bllll tx« B0111· 101 Cl

15,20.00 nt

mmmm~~~~~~~~~~~~~~~UBT 1.35.0

00 ,00nBU51T rnrBUM BUMr BU-U PrM BU1 pI 12( 37

I124(~ ~ Pr-~BUMmmm BUI BUM~rBU U5U *- BU M112 B1272 fmBrrU47rTZ» ~rrrBU41r BUqftd BrU O BUMl PrmcA BU B 7 to

l74

US I^C1I 1 gm "

'B117(IB1177 ·k- BUMrrne BUM ~BUM r«- U W Br- 1,050,000ntBU7»y~ ~BlfUfClt U BMhtB~ U141 10 10

BUM 11*74-^. ^ u,<^ BUT~~~~~~~~~~~~~t.~ 1650.0,00 ntP ~~ ~ ~~ BM ~lBM o k UT yf 17 i B1577 11513)«rylt·

150bloO~~~~~~~~~~~~~~lM2~~~ ~ BUmmrBMmmUUM BMBM BM U1111 11

mu~ur1,500,00ntB12 o- 127*g

2??ur"i^JS BiDly Buo^ga Bua^ira lrC 1u1171 I^BU BI1ft B10 ku BU 11 TB71»- 11 B7 11 1U*- B72

^BUM B17M^ B171 Buj^ B1711 urwOPr r u~riic

B011J ta

B0107r n0101 10110 -aorr101ku

li"aa·s~

B~ ~ ~ Bill*lr

urrrr I·.1I·.·U .rru ur··rO amknmmmml IBII IU2 -«C BCIIU - BIU BIU7~ BUM

I

BoI_«od0 150.000mn112«dk BOU4~110115

10077 10010*SSBS& 1=3l 1~0013 CI·tY·

1000~lQ071 grpK7 nrdPl107 B 07 pB1010012 Olyll0wb 0ltCaOI1I00741Il0015 <Mh 10088 thrB

010091 *ftr

010091 P00931·0094005 TC 01001 00102 du«B 04htpO n lQf

B10092 B0101 B10103

on February 8, 2018

http://science.sciencemag.org/

Dow

nloaded from

Page 40: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 41: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 42: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 43: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 44: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 45: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant:
Page 46: EVE 161: Microbial Phylogenomics Class #10: Genome Sequencing · Microbial Phylogenomics Class #10: Genome Sequencing UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant: