articles sequenceandanalysisofchromosome2 of the plant ... · 2/20/2012 · articles...

9
© 1999 Macmillan Magazines Ltd NATURE | VOL 402 | 16 DECEMBER 1999 | www.nature.com 761 articles Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana Xiaoying Lin*, Samir Kaul*, Steve Rounsley², Terrance P. Shea², Maria-Ines Benito, Christopher D. Town, Claire Y. Fujii, Tanya Mason, Cheryl L. Bowman, Mary Barnstead, Tamara V. Feldblyum, C. Robin Buell, Karen A. Ketchum, John Lee, Catherine M. Ronning, Hean L. Koo, Kelly S. Moffat, Lisa A. Cronin, Mian Shen, Grace Pai, Susan Van Aken, Lowell Umayam, Luke J. Tallon, John E. Gill, Mark D. Adams², Ana J. Carrera, Todd H. Creasy, Howard M. Goodman², Chris R. Somerville², Greg P. Copenhaver², Daphne Preuss², William C. Nierman, Owen White, Jonathan A. Eisen, Steven L. Salzberg, Claire M. Fraser & J. Craig Venter² The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA * These authors contributed equally to this work ............................................................................................................................................................................................................................................................................ Arabidopsis thaliana (Arabidopsis) is unique among plant model organisms in having a small genome (130–140 Mb), excellent physical and genetic maps, and little repetitive DNA. Here we report the sequence of chromosome 2 from the Columbia ecotype in two gap-free assemblies (contigs) of 3.6 and 16 megabases (Mb). The latter represents the longest published stretch of uninterrupted DNA sequence assembled from any organism to date. Chromosome 2 represents 15% of the genome and encodes 4,037 genes, 49% of which have no predicted function. Roughly 250 tandem gene duplications were found in addition to large-scale duplications of about 0.5 and 4.5 Mb between chromosomes 2 and 1 and between chromosomes 2 and 4, respectively. Sequencing of nearly 2 Mb within the genetically defined centromere revealed a low density of recognizable genes, and a high density and diverse range of vestigial and presumably inactive mobile elements. More unexpected is what appears to be a recent insertion of a continuous stretch of 75% of the mitochondrial genome into chromosome 2. Arabidopsis thaliana (thale cress), a member of the mustard family, is a dicotyledenous flowering plant with a diploid number of 10 that has become a widely used model for the study of plant biology because of its small size, short generation time, facile genetics and ease of transformation. Whole-genome analysis began ten years ago with the production of cosmid and yeast artificial chromosome (YAC)-based physical maps 1 , while early large-scale sequencing efforts focused on ordered cosmids 2 . On August 20–21 1996, representatives of six research groups from Japan, Europe and the USA met in Washington DC to discuss strategies for facilitating international cooperation in completing the sequencing of the Arabidopsis genome (the Arabidopsis Genome Initiative (AGI)) by the year 2004. The group agreed on a workplan to ensure that the sequencing of all five chromosomes was completed except for the difficult to sequence repetitive regions: the nucleolar organizer regions (NORs) and centromeres (http://genome-www.stanford. edu/Arabidopsis/AGI/AGI_memo.html) 3 . The main molecular and cytological features of the A. thaliana chromosomes had been defined by the time high-throughput sequencing began (Fig. 1). The NORs in A. thaliana are found on the upper ends of chromosomes 2 and 4. Each region, 3.5–4 Mb in length, consists of an array of tandemly repeated ribosomal RNA genes 4 . Using a combination of restriction analysis, polymerase chain reaction (PCR) and sequencing, it has been shown that the rDNA of the NOR on chromosome 4 immediately abuts the telomeric repeats with less than 500 base pairs (bp) of intervening sequence 5 . Restriction analysis is consistent with a similar arrange- ment on chromosome 2, with the rDNA terminating at a different point within the repeat unit. By contrast, the genomic organization of A. thaliana centromeres is not well characterized. These regions are heterochromatic 6 and contain large tandem arrays of a family of 180-bp elements 7–9 that are reminiscent of the 170-bp alphoid repeats found in primate centromeres 10 . Large blocks of DNA consisting primarily of 180-bp repeats have been characterized by 2D gel electrophoresis and assigned to specific centromeres on the basis of restriction-fragment-length polymorphisms 11 . The 180-bp repeat block on chromosome 2 is 820–830 kb in size 11 , which should be considered as a minimum estimate for the size of the unsequenced centromeric region on this chromosome. Features of chromosome 2 Chromosome 2 is acrocentric and was originally estimated to be 13.5Mb in length from the YAC-based physical map 12 . Sequencing was initiated using bacterial artificial chromosomes (BACs) that had been placed onto the physical map of chromosome 2 by hybridization to YACs and genetic markers. Subsequently, BAC end sequences and BAC fingerprint data 13 allowed extension from these initial seed points and completion of the entire chromosome. A total of 257 BAC and P1 clones (including 5 BACs completed by other groups) were sequenced to produce over 24 Mb of finished sequence, which has been assembled as two contigs terminating in blocks of 180-bp repeats. These repeats represent the inner bound- aries of our finished sequence. The upper (short) arm, measured from the lower end of the NOR to the centromeric 180 bp repeats, is 3.6 Mb. The northern-most BAC (F23H14) in this contig contains a single rDNA repeat unit that is oriented in the same direction as the telomere-proximal repeat, suggesting that all the rDNA units are arranged in a head-to- tail fashion running 59 to 39 from the telomere towards the centromere 14 . Immediately adjoining the rDNA is about 60 kb of highly repetitive sequence containing many transposons. The lower (long) arm from centromere to telomere is 16 Mb. The sequence of the southern-most BAC in this contig (F11L15) is joined by a small PCR fragment to the A. thaliana sequence in pAtT51, a telomere- containing clone previously mapped to TEL2S 15 (http://genome- www.stanford.edu/Arabidopsis/ww/Nov98RImaps/index.html). Within this region, a putative transcriptional co-activator gene was identified. Because pAtT51 was derived from the Landsberg eco- type, the last 2.5 kb of this contig may not be identical to the sequence in the Columbia ecotype. ² Present addresses: Cereon Genomics, 270 Albany Street, Cambridge, Massachusetts 02139, USA (S.R., T.P.S.); Celera Genomics Corporation, 45 West Gude Drive,Rockville, Maryland 20850, USA (M.D.A., J.C.V.); Department of Molecular Biology, MassachusettsGeneral Hospital, Boston, Massachussets 02114, USA (H.M.G.); Department of Plant Biology, Carnegie Institution of Washington, 260 Panama St., Stanford, California 94305, USA (C.R.S.); Department of Molecular Genetics and Cell Biology, University of Chicago, Chicago, Illinois 60637,USA (G.P.C., D.P.).

Upload: hahanh

Post on 30-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

© 1999 Macmillan Magazines LtdNATURE | VOL 402 | 16 DECEMBER 1999 | www.nature.com 761

articles

Sequence and analysis of chromosome 2of the plant Arabidopsis thalianaXiaoying Lin*, Samir Kaul*, Steve Rounsley², Terrance P. Shea², Maria-Ines Benito, Christopher D. Town, Claire Y. Fujii, Tanya Mason,Cheryl L. Bowman, Mary Barnstead, Tamara V. Feldblyum, C. Robin Buell, Karen A. Ketchum, John Lee, Catherine M. Ronning,Hean L. Koo, Kelly S. Moffat, Lisa A. Cronin, Mian Shen, Grace Pai, Susan Van Aken, Lowell Umayam, Luke J. Tallon, John E. Gill,Mark D. Adams², Ana J. Carrera, Todd H. Creasy, Howard M. Goodman², Chris R. Somerville², Greg P. Copenhaver², Daphne Preuss²,William C. Nierman, Owen White, Jonathan A. Eisen, Steven L. Salzberg, Claire M. Fraser & J. Craig Venter²

The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA

* These authors contributed equally to this work............................................................................................................................................................................................................................................................................

Arabidopsis thaliana (Arabidopsis) is unique among plant model organisms in having a small genome (130±140 Mb), excellentphysical and genetic maps, and little repetitive DNA. Here we report the sequence of chromosome 2 from the Columbia ecotype intwo gap-free assemblies (contigs) of 3.6 and 16 megabases (Mb). The latter represents the longest published stretch ofuninterrupted DNA sequence assembled from any organism to date. Chromosome 2 represents 15% of the genome and encodes4,037 genes, 49% of which have no predicted function. Roughly 250 tandem gene duplications were found in addition to large-scaleduplications of about 0.5 and 4.5 Mb between chromosomes 2 and 1 and between chromosomes 2 and 4, respectively. Sequencingof nearly 2 Mb within the genetically de®ned centromere revealed a low density of recognizable genes, and a high density anddiverse range of vestigial and presumably inactive mobile elements. More unexpected is what appears to be a recent insertion of acontinuous stretch of 75% of the mitochondrial genome into chromosome 2.

Arabidopsis thaliana (thale cress), a member of the mustard family,is a dicotyledenous ¯owering plant with a diploid number of 10 thathas become a widely used model for the study of plant biologybecause of its small size, short generation time, facile genetics andease of transformation. Whole-genome analysis began ten years agowith the production of cosmid and yeast arti®cial chromosome(YAC)-based physical maps1, while early large-scale sequencingefforts focused on ordered cosmids2. On August 20±21 1996,representatives of six research groups from Japan, Europe and theUSA met in Washington DC to discuss strategies for facilitatinginternational cooperation in completing the sequencing of theArabidopsis genome (the Arabidopsis Genome Initiative (AGI)) bythe year 2004. The group agreed on a workplan to ensure that thesequencing of all ®ve chromosomes was completed except for thedif®cult to sequence repetitive regions: the nucleolar organizerregions (NORs) and centromeres (http://genome-www.stanford.edu/Arabidopsis/AGI/AGI_memo.html)3.

The main molecular and cytological features of the A. thalianachromosomes had been de®ned by the time high-throughputsequencing began (Fig. 1). The NORs in A. thaliana are found onthe upper ends of chromosomes 2 and 4. Each region, 3.5±4 Mb inlength, consists of an array of tandemly repeated ribosomal RNAgenes4. Using a combination of restriction analysis, polymerasechain reaction (PCR) and sequencing, it has been shown that therDNA of the NOR on chromosome 4 immediately abuts thetelomeric repeats with less than 500 base pairs (bp) of interveningsequence5. Restriction analysis is consistent with a similar arrange-ment on chromosome 2, with the rDNA terminating at a differentpoint within the repeat unit. By contrast, the genomic organizationof A. thaliana centromeres is not well characterized. These regionsare heterochromatic6 and contain large tandem arrays of a family of180-bp elements7±9 that are reminiscent of the 170-bp alphoid

repeats found in primate centromeres10. Large blocks of DNAconsisting primarily of 180-bp repeats have been characterized by2D gel electrophoresis and assigned to speci®c centromeres on thebasis of restriction-fragment-length polymorphisms11. The 180-bprepeat block on chromosome 2 is 820±830 kb in size11, which shouldbe considered as a minimum estimate for the size of theunsequenced centromeric region on this chromosome.

Features of chromosome 2Chromosome 2 is acrocentric and was originally estimated to be13.5 Mb in length from the YAC-based physical map12. Sequencingwas initiated using bacterial arti®cial chromosomes (BACs) thathad been placed onto the physical map of chromosome 2 byhybridization to YACs and genetic markers. Subsequently, BACend sequences and BAC ®ngerprint data13 allowed extension fromthese initial seed points and completion of the entire chromosome.A total of 257 BAC and P1 clones (including 5 BACs completed byother groups) were sequenced to produce over 24 Mb of ®nishedsequence, which has been assembled as two contigs terminating inblocks of 180-bp repeats. These repeats represent the inner bound-aries of our ®nished sequence.

The upper (short) arm, measured from the lower end of the NORto the centromeric 180 bp repeats, is 3.6 Mb. The northern-mostBAC (F23H14) in this contig contains a single rDNA repeat unitthat is oriented in the same direction as the telomere-proximalrepeat, suggesting that all the rDNA units are arranged in a head-to-tail fashion running 59 to 39 from the telomere towards thecentromere14. Immediately adjoining the rDNA is about 60 kb ofhighly repetitive sequence containing many transposons. The lower(long) arm from centromere to telomere is 16 Mb. The sequence ofthe southern-most BAC in this contig (F11L15) is joined by a smallPCR fragment to the A. thaliana sequence in pAtT51, a telomere-containing clone previously mapped to TEL2S15 (http://genome-www.stanford.edu/Arabidopsis/ww/Nov98RImaps/index.html).Within this region, a putative transcriptional co-activator gene wasidenti®ed. Because pAtT51 was derived from the Landsberg eco-type, the last 2.5 kb of this contig may not be identical to thesequence in the Columbia ecotype.

² Present addresses: Cereon Genomics, 270 Albany Street, Cambridge, Massachusetts 02139, USA (S.R.,T.P.S.); Celera Genomics Corporation, 45 West Gude Drive, Rockville, Maryland 20850, USA (M.D.A.,

J.C.V.); Department of Molecular Biology, Massachusetts General Hospital, Boston, Massachussets 02114,

USA (H.M.G.); Department of Plant Biology, Carnegie Institution of Washington, 260 Panama St.,

Stanford, California 94305, USA (C.R.S.); Department of Molecular Genetics and Cell Biology, University

of Chicago, Chicago, Illinois 60637, USA (G.P.C., D.P.).

© 1999 Macmillan Magazines Ltd

At 19.6 Mb, the overall length of chromosome 2 (excluding theNOR and centromere) is 45% larger than the original estimate. Thisdifference is due in part to gaps in the YAC-based physical map andto a deletion in one of the YACs (see Fig. 1). Similar increases in theactual physical lengths of the other A. thaliana chromosomesindicate that the genome size is actually 130±140 Mb, rather thanthe earlier estimates of 70±100 Mb.

Using sequence information for 37 markers from the ArabidopsisRecombinant Inbred genetic map (http://genome-www.stanford.edu/Arabidopsis/ww/Nov98RImaps/index.html), the relationshipbetween physical and genetic distances along the chromosomewas examined. As the centromere represents a discontinuity,regression analysis for the two arms was performed separately.The ratios of physical to genetic distance on the short arm andlong arm were not signi®cantly different, with values of 244 kb cM-1

and 223 kb cM-1, respectively (Fig. 2a). Although these two valuesindicate a similar overall rate of recombination along the two armsof the chromosome, local distortions are evident which may re¯ectchromosomal rearrangements between the two ecotypes used toconstruct the mapping population (Fig. 2b).

Gene contentUsing a combination of gene prediction programs and databasesearches, the chromosome was annotated (Fig. 3 (facing page 768);and Table 1). Statistics of the chromosomal composition are listedin Table 2, along with corresponding data from other eukaryoticgenome projects. A graphical distribution of various features isshown in Fig. 4. Of the 4,037 genes identi®ed, 2,078 (51.5%) wereassigned either a de®nite or putative function, 865 (21.4%) werelabelled as unknown genes and 1,094 genes (27.1%) were designatedas hypothetical. In addition, 400 pseudogenes, most of which arerelated to proteins found in retrotransposons and are located nearthe centromere, and 79 genes encoding structural RNAs (73 tRNA, 4tRNA and 2 small nucleolar (snRNA)) were found.

On average, a gene occurs every 4.4 kb, is about 2 kb in length(from start to stop codon) and contains 4.6 exons. More than 50%of the chromosome is therefore intergenic sequence. Twenty-threeper cent of the genes consist of a single exon, whereas the largest

gene (T20F21.17) contains 52 predicted exons and encodes aprotein that is 50% identical to the human ch-TOG protein16. Atleast 1,352 (33.5%) genes are expressed, as re¯ected by the presenceof a matching complementary DNA or expressed sequence tag(EST) entry in GenBank. Genes that are most highly representedby ESTs include a glycine-rich RNA-binding protein, aquaporins,ribulose bisphosphate carboxylase activase, a chlorophyll a/bbinding protein and lipid transfer proteins. The distribution ofgenes and ESTs along the chromosome is shown in Fig. 4b, c.

Classi®cation by biological role or biochemical function of theknown and putative genes shows that most major cellular processesappear to be represented, with genes involved in regulatory func-tions and signal transduction (DNA-binding proteins/transcriptionfactors and protein kinases) comprising the largest functionalgroups (Table 1). These homology-based assignments are consistentwith results from searching the hidden Markov models (HMMs) ofprotein domains in the pFam database17. Of the 4,037 genespredicted to encode a protein, 1,436 match at least one of the 434distinct pFam HMMs (22% of the total HMMs). The mostfrequently seen matches are to the LRR (leucine-rich repeat, 239matches), the pkinase (eukaryotic protein kinase, 161 matches) andthe zf-C3HC4 (zinc ®nger, C3HC4 type, 46 matches) HMMs. Thecellular destinations of the proteins were also predicted. Over 17%(698) contained signal peptides, whereas over 10% (416) werepredicted to be organelle-targeted (135 for the chloroplast and281 for the mitochondrion). In addition, more than 16% of theproteins are probably located in cellular membranes, with 665sequences predicted to have 4 or more transmembrane segments(TMSs). Furthermore, about 25% of the proteins with 1, 2 or 3predicted TMSs (1,438) are predicted to be localized to themembrane.

Gene and chromosomal duplicationsChromosome 2 contains many duplicated genes. More than 60% of

articles

762 NATURE | VOL 402 | 16 DECEMBER 1999 | www.nature.com

cMMarker

49< er

19< mi421

69< ve018

87< mi79a

3.5< kk1

Telomere

Centromere

Telomere

NOR

{

Figure 1 Features of Arabidopsis thaliana chromosome 2. The chromosome is roughly

23 Mb long (including the NOR), spans a genetic map distance of 90 cM and contains

15% of the unique sequence of the A. thaliana genome. The locations of a number of

genetic markers are shown. The original YAC contigs are represented by the orange line,

and the location of the deletion noted in the text is shown (D).

2,500

500

400

300

200100

2,0001,5001,000

20,000

16,000

12,000

8,000

4,000

b

0 40 60 80 10020

0 40 60 80 10020

Genetic map distance (cM)Phy

sica

l/gen

etic

dis

tanc

e (k

b c

M–1

)P

hysi

cal d

ista

nce

(kb

) a

Figure 2 Representation of physical/genetic distance along chromosome 2. a, Physical

distance (in kb) is plotted against position of each marker on the Recombinant Inbred (RI)

map. Data above and below the centromere are treated as separate sets for regression

analysis. Note that genetic distance is measured from the telomere (TEL2N), whereas

physical distance is measured from the bottom of the NOR. b, The ratio of physical to

genetic distance between successive pairs of markers is plotted against map position

for each marker pair on the RI map.

© 1999 Macmillan Magazines Ltd

the predicted proteins (2,542 out of 4,037) have a signi®cant match(FASTA, P , 10 2 10) to another A. thaliana protein, with 2,138matching at least one other protein encoded on chromosome 2. Ofthese, 593 are found in 239 tandem duplications that range in sizefrom 2 to 9 genes (average 2.5). A particularly striking example isfound in BAC F16P2, which contains tandem repeats of three genefamilies (Fig. 5). Of the 12 intact copies of a putative short-chaindehydrogenase/reductase gene in this region, 9 are found in tandem.This gene family is most similar to tropinone reductase, a branchpoint in the biosynthesis of tropane alkaloids including suchmedicinally important compounds as atropine, scopolamine andcocaine18,19. Although these alkaloids are not known to occur in theBrassicaceae, at least one of these genes is expressed, suggesting arole in secondary metabolism which has yet to be elucidated. Thissame BAC also contains a tandem array of seven genes encodingglutathione S-transferase (GST), which has been implicated innumerous plant processes including stress and pathogen responses,xenobiotic tolerance and vacuolar sequestration20. In total, 13 GSTgenes have been found on chromosome 2, with 4 of the other genesoccurring as tandem pairs. The GSTs within the long tandem repeatare more closely related to one another than to any other of the 25GSTs known in Arabidopsis, indicating that this expansion is ofrelatively recent origin. There is also a smaller duplication consisting

of three pumilio-like genes, two of which are in tandem. Theproximity of these gene duplications suggests either that this is arepeat-prone region of the chromosome or that they arose as part ofthe same expansion event.

Some of the duplicated genes are found within large chromoso-mal duplications (Fig. 6). A segment of the bottom arm ofchromosome 2 (13.4±14.1 Mb, containing 170 genes) from cloneF20M17 to F4P9 was found to be duplicated on chromosome 1(from clone F19P19 to F3F20) with an inversion in the middle(13.8±13.9 Mb on chromosome 2). Within this region, 57 gene pairs(33%) are duplicated between the two chromosomes. Severalinstances of tandem duplications on just one of the two chromo-somes were also observed. For example, F4P9.8 encodes an auxin-inducible protein that matched two adjacent copies of a similarprotein-coding sequence (F19P19.31 and F19P19.32) on chromo-some 1; and a single copy of a gene encoding an ion-channel proteinon chromosome 1 (YUP8H12.9) matched two adjacent copies onchromosome 2 (T32F6.9 and T32F6.8). Sequence comparison of theproteins in these asymmetric duplications suggests that the tandemduplication events occurred after the large-scale duplication. Anadditional smaller section near the top of chromosome 2 (200±500 kb) shares 16 genes with a region on chromosome 1 (T10B6).An even larger duplication (4.6 Mb) between chromosome 2 (6.7±11.3 Mb, from clone F9O13 to F18A8) and chromosome 4 (fromclone F20O9 to the bottom end of the chromosome) was found. Inthis region, 39% of the genes (430 out of 1,100) are duplicatedbetween the two chromosomes, including some singleton-tandemduplication pairs, as described above. In addition, although thisregion contains megabase-scale rearrangements (translocations orinversions), within each rearranged segment gene order is stillpreserved between the two chromosomes. A small part of thisduplication (45 kb) has been described recently21.

Comparative and evolutionary genomicsComparison of the predicted proteins on chromosome 2 to otherA. thaliana proteins, to proteins from other plants, and to allavailable complete genome sequences provides insight into theevolution of plant genomes. More speci®cally, the timing of geneduplications can be better understood. For example, as most of the2,542 chromosome-2 proteins (83%) that have paralogues withinthe A. thaliana genome are more similar to their paralogue than toany protein in the available completed genomes, it is likely that thesegene duplications occurred after the separation of plants from theanimal and fungal lineages. This analysis also suggests that all plantshave a common set of genes for many functions. Even thoughsequencing in other plants has focused primarily on particular typesof genes, more than 40% (1,656) of the proteins encoded onchromosome 2 have a signi®cant match to a protein previously

articles

NATURE | VOL 402 | 16 DECEMBER 1999 | www.nature.com 763

Table 1 Classi®cation of chromosome-2 proteins according to role orfunction

Percentage ofchromosome-2

proteins.............................................................................................................................................................................

Regulatory functions 7.8Signal transduction 4.0Cellular structure, organization and biogenesis 3.6Transport and binding proteins 3.6Protein fate 3.0Energy metabolism 2.7Growth and development 2.6Secondary metabolism 2.5Cellular processes 2.4Pathogen responses 2.0Protein synthesis 1.9Fatty acid and phospholipid metabolism 1.5General transcription 1.4Environmental response 1.0DNA metabolism 0.7Central intermediary metabolism 0.7Amino-acid biosynthesis 0.6Purine and pyrimidine base, nucleoside and nucleotide metabolism 0.6Biosynthesis of cofactors, prosthetic groups and carriers 0.4Other categories 6.9Unclassi®ed 1.4Unknown protein 21.4Hypothetical protein 27.1.............................................................................................................................................................................

4,037 genes are included in this analysis.

Table 2 Compositional analysis of Arabidopsis chromosome 2 and comparison with other eukaryotic chromosomes and genomes

Feature A. thaliana* P. falciparum* S. cerevisiae² C. elegans²...................................................................................................................................................................................................................................................................................................................................................................

Overall length Total: 19.6 MbShort arm: 3.6 MbLong arm: 16.0 Mb

0.95 Mb 13.4 Mb 97 Mb

Base composition (%GC) Overall: 35.8%Coding: 43.6%

Non-coding: 32.1%

19.7%24.3%13.3%

38%40%35%

36%§§

Number of genes 4,037 209 6,339 19,099Gene density 4.4 kb per gene 4.5 kb per gene 2.1 kb per gene 5.0 kb per gene

Exon statistics³ 20,352 exons4.6 exons per gene

295 bp per exon

355 exons1.7 exons per gene

728 bp per exon

6,561 exons1.04 exons per gene

§

§5 exons per gene

§

Intron statistics 15,926 introns3.6 introns per gene

178 bp per intron

144 introns1.6 introns per gene

207 bp per intron

§§§

§§§

...................................................................................................................................................................................................................................................................................................................................................................

* Chromosome 2 only.² Complete genome.³ Includes intronless genes.§ Not published.

© 1999 Macmillan Magazines Ltd

identi®ed in other plant species. Of these, 48% (803) do not have amatch in any complete genome sequence. In addition, comparisonwith other species indicates that many of the genes on chromosome2 probably originated within the plant lineage. More than 65%(2,745) of the predicted proteins do not show signi®cant similarityto any other completed genome, suggesting that they are unique toplants. Of the remaining proteins (1,293), the majority are mostsimilar to proteins from Saccharomyces cerevisiae (368 or 28%) orCaenorhabditis elegans (633 or 49%). The number of proteins whosebest match is to one of these two proteomes is proportional to thetotal number of proteins encoded in each genome, suggesting that A.thaliana is roughly equally evolutionarily distant from animals andfungi. The remaining 292 proteins (23%) are most similar to bacterialand archaeal proteins, which may be due in part to transfers of genesfrom organellar genomes to the nucleus (see below).

Lateral transfers of genes from organelles to the nucleus have beenwell established22. In most cases, such genes were identi®ed becausetheir products were known to function in the organelle. The

sequence of chromosome 2 allows a reverse type of analysis,identi®cation of nuclear genes that originated in the chloroplastby their evolutionary ancestry. Genes most similar to cyanobacterialgenes relative to genes from other complete genomes were con-sidered to have been acquired from the chloroplast genome. Usingthis methodology, 135 putative chloroplast-derived genes wereidenti®ed. Of these, 71 (52.5%) are predicted to have chloroplast-targeting peptides. The putative functions of these genes includeDNA repair, protein synthesis, regulatory functions, cell division,photosynthesis, transport and oxidation±reduction. As these genesare not found in any of the completed plant chloroplast genomes,most of these gene transfers probably occurred soon after thesymbiosis with the chloroplast evolved. Many other genes are ofunknown function and will be worth pursuing as potential novelchloroplast-targeted functions.

A mitochondrial genome in the nucleusCompletion of the chromosome has revealed a large and

articles

764 NATURE | VOL 402 | 16 DECEMBER 1999 | www.nature.com

Trinucleotide compositionvariation

GC content (%)

Gene density

EST density

Tandem gene duplications

tRNA gene density

Transposon density

Intermediate repeats

20304050607080

020406080

02468

0

2

4

6

0

5

10

15

Chromosome position (Mb)

0 5 10 15 20

c

a

b

g

f

e

d

0

10

20

30

Figure 4 Distribution of various features along chromosome 2. The two arms of the

chromosome are represented as a continuous sequence of DNA, beginning immediately

after the NOR (position 1) and continuing to the centromere (position 3,606,929), where it

is joined to the long arm (which terminates with the most telomeric base (position

19,647,005)) by a 120-bp spacer. The various features are plotted relative to the

nucleotide position on the chromosome. a, Chi-squared statistics of trinucleotide

frequency50 and GC composition in a 10-kb sliding window. b, Gene density per 100 kb;

c, EST density per 100 kb. d, Tandem repeats, showing repeat number as well as

location. e, tRNAs per 100 kb. f, Density of transposon-related sequences per 100 kb.

g, Position of pericentromeric repeat classes along chromosome 2. From top to bottom

they are 180 bp, Athila, 106B, 11B7RE, 163A, 164A, 278A, mi167, pAtT27, new repeats

1±4. The GC peak (45%) and the gaps in gene and transposon density from 3.2 to 3.5 Mb

reveal the site of the mitochondrial DNA insertion.

© 1999 Macmillan Magazines Ltd

unexpected organellar-to-nuclear gene-transfer event. Within thegenetically de®ned centromere, there is a stretch of 270 kb ofsequence that is nearly identical to that of the Arabidopsis mito-chondrial genome. The authenticity of this insertion in the Colum-bia ecotype was con®rmed by PCR ampli®cation across thejunctions of mitochondrial and unique nuclear DNA, followed bythe sequencing of the corresponding fragments. This insertion is

much larger than any of the previously reported organellar-nucleartransfers23, and is 99% identical to the mitochondrial genome,suggesting that the transfer event was very recent. As the publishedsequence is derived from the C24 ecotype24, it is not possibledetermine whether the observed sequence polymorphisms are dueto divergence of the insertion from the mitochondrial genome or todifferences among ecotypes. The organization of the mitochondrial

articles

NATURE | VOL 402 | 16 DECEMBER 1999 | www.nature.com 765

Chromosome 2 coordinate (Mb)

Chr

omos

ome

1 re

lativ

e p

ositi

on(M

b)

13.4 13.6 13.8 14.0

1.0

1.4

1.2

0.8

0.6

a

b

Chromosome 2 coordinate (Mb)

Chr

omos

ome

4 re

lativ

e p

ositi

on(M

b)

7.0 8.0 9.0 10.0 11.0

16.0

15.0

14.0

13.0

12.0

17.0

Figure 6 Large chromosomal duplications. Regions of similarity were initially detected

using the MUMmer program after which sequences were aligned using the dds program

with criteria of at least 75% identity over 100 bp. Points in the dot plot represent the

coordinates of each such match. Values show positions (in bp) within each contig

assembled from all available sequence. a, Duplication between chromosomes 1 and 2.

b, Duplication between chromosomes 4 and 2. Arrows indicate the small, previously

described duplication21.

G GG G G G G

T T T T T T T TT T

PP T T

P

26349 30113 33877 37641 41405 45169 48933

101829

75281715176775363989602255646152697

978659410190337865738280979045

127977124213120449116685112921109157105393

Figure 5 Organization of genes on BAC F16P2 showing the three tandem gene

duplications. The display from TIGR Annotator shows the exon±intron structure of the

annotated genes. The glutathione S-transferase and tropinone reductase genes are

labelled G and T, respectively. A smaller duplication of pumilio-like protein (P) is also

present.

© 1999 Macmillan Magazines Ltd

DNA in the nucleus differs from that of the published mitochon-drial genome and its predicted alternate forms25; it corresponds toanother possible isoform with an internal deletion (Fig. 7). Thisdeletion may have occurred during or after transfer or mayrepresent an alternate form of the Columbia mitochondrialgenome.

The pericentromeric regionUntil the completion of chromosome 2, it was unclear what featureswould be discovered in the regions around the centromere (thepericentromeric region). In addition to the detection of the mito-chondrial genome insertion, an analysis of this region has revealedthe presence, distribution and diversity of speci®c repetitive ele-ments on a single chromosome. Transposons were identi®ed bysearching for predicted amino-acid sequences that are similar tocharacterized transposases, reverse-transcriptases, polyproteins orother transposon-related coding sequences. Of the 563 transposonproteins annotated, 50% are pseudogenes found either as fragmentsof previously intact elements or as genes interrupted by stop codons,frameshifts or other mutations. A breakdown of the classes oftransposons identi®ed is shown in Table 3. Representative membersof most of the large transposon families seen in other plants(Mutator, En-Spm and Ac-Ds (haT/mariner)) are present. Retro-elements including members of the LINE-like Ta11 elements, long

terminal repeat (LTR) elements in both the Ty3-gypsy and the Ty1-copia family, and the Arabidopsis Athila family represent the largestclass of transposons. This accumulation of retroelements and othertransposons may be due to either preferential insertion into theregions ¯anking the centromere or their elimination from the rest ofthe genome at a higher rate26. In addition, representatives ofpreviously described intermediate repeats as well as some newrepeats were identi®ed (Fig. 4f), a distribution consistent withprevious hybridization data27±30.

As shown in Fig. 4, the number of recognizable genes decreasesclose to the centromere, whereas the number of transposableelements and various classes of intermediate repeats increases.Nevertheless, within this region there are some identi®able genes:a protein kinase (T25N22.14); a plasma membrane proton ATPase(F9A16.7); an ABC transporter (T5E7.1); a DNA-replication licen-sing factor (T12J2.1), an RNA helicase (T12J2.7) and a 40Sribosomal protein S16 (F7B19.13). The RNA helicase, proteinkinase and 40S ribosomal protein appear to be actively expressedgenes, as each is represented in the EST collection.

ConclusionsArabidopsis has been considered to be not only a tractable species forplant biology studies, but also a suitable model from which afundamental understanding of basic plant processes could beapplied to agriculturally relevant species. The analysis of thecomplete sequence of chromosome 2 validates this concept becausemany genes that have previously been characterized in other plantshave been found. As observed with other eukaryotic genomeprojects, however, roughly half of the predicted genes have noknown function. This represents a large, untapped resource ofinformation that will require a combination of knockoutapproaches, antisense technologies and global gene-expressionanalyses to understand their functions.

Both gene duplication and large-scale chromosomal duplicationsare observed, events that apparently occurred after the divergence ofplants from other lineages. These duplications are larger than anyreported in the C. elegans genome31, but less extensive thanthe complete genome duplications proposed for maize32 andS. cerevisiae33. As gene duplication is often accompanied by func-tional divergence34, the identi®cation of such duplications in theplant lineage may help identify new plant functions. Many of theduplications occurred in tandem, generating closely linked genefamilies which may provide some of the allelic diversity thatbreeders and agricultural researchers are seeking. Some newfamilies, such as the family of genes that may encode one step inalkaloid biosynthesis, could provide a reservoir of molecular tools tomodify plant metabolism and productivity in a bene®cial manner.

articles

766 NATURE | VOL 402 | 16 DECEMBER 1999 | www.nature.com

D' 1 B 3 C 1 A' 3c

d

100 200 300 Mito (kb)

3430

3230

3330

Chr

2 (k

b)

A

1

B3

C

1

D

3'

D'

A'

1

1

B

C

Possibleinsertionpoint

3

3

ba

A. thalianamitochondrial

genome366,923 bp

Figure 7 Proposed model for insertion of mitochondrial DNA into chromosome 2.

a, Diagram of the major form of the published A. thaliana C24 mitochondrial genome25.

The different colours correspond to the major segments of the genome: A (nucleotide

297,579±44,698); B (48,895±112,146); C (118,737±178,862); D (183,060±

290,989); and two major repeats, 1 (44,698±48,894, 178,863±183,059) and 3

(112,147±118,736, 297,579±290,990). A small repeat (repeat number 2) is not shown.

The orientation of one of the three repeats is inverted relative to the other. The arrows are

used to indicate orientation for comparison with the alternative form. b, Hypothetical

alternative form, generated by in silico rearrangements around some of the repeats.

Arrows and primes indicate inversions. c, Linearized form of genome shown in b if

inserted at the point between repeat 3 and segment D9. d, Dot plot of the linearized

alternate form against the region of chromosome 2 containing the mitochondrial DNA. The

gap in the diagonal indicates that there was a deletion of a large segment (see text).

Table 3 Representaton of transposable elements on chromosome 2

Transposon class Subclass Family Number of openreading frames

.............................................................................................................................................................................

DNA elementsMutator 73

En-Spm/Tam1/Psl 71Ac-Ds 12mariner 2

hAT 1

RetroelementsNon-LTR LINE-likeretrotransposons

Ta11-like 152TSCL 1

LTR retrotransposonsAthila 75Other 141

Ty1-copia-like (Ta1)Ty3-gypsy-like (Tat1)

Retroviral-like Replication protein 28Helicase 7

.............................................................................................................................................................................

Total 563.............................................................................................................................................................................

© 1999 Macmillan Magazines Ltd

The completion of chromosomes 2 and 4 (ref. 35), whichaccounts for 30% of the entire Arabidopsis genome, represents alandmark achievement in eukaryotic genomics. The 16-Mb contigon the lower arm of chromosome 2 is the largest sequence assemblypublished to date. Sequencing within the genetically de®ned cen-tromere has shown not only an abundance of degenerate transpos-able elements, but also a number of recognizable intact genes, someof which appear to be expressed. As researchers move on tosequence larger and more complex genomes, it remains to be seenwhether the level of genome closure achieved in this model plantcan be duplicated.Note added in proof : Since this paper was accepted for publication, a23Mb contig from human chromosome 22 has been published(See Nature 402, 489±495 (1999). M

MethodsSequencing

Three libraries made from the Columbia ecotype of A. thaliana were used in sequencingchromosome 2: the TAMU BAC library (clones pre®xed with T36); the IGF BAC library(clones pre®xed with F37); and the Mitsui P1 library (clones pre®xed with M38). ShearedBAC DNA (1.6±2 kb) was ligated to a modi®ed pUC19 vector and transformed intoEscherichia coli. Sequencing reactions were performed using either BigDye primers,BigDye Terminators or Dichlorohodamine Terminators (P.E. Biosystems), and were runon ABI 377 and ABI 3700 sequencers (P.E. Biosystems). Shotgun clones were sequenced togenerate 7±8-fold coverage of each BAC. In total, 590,165 reactions were carried out togenerate the sequence of the chromosome. In collaboration with Celera Genomics, 99,072reactions were run on ABI 3700 sequencers, which accounted for nearly 20% of the overallsequence for the chromosome. The discrepancy rate on these machines, as determined bycomparison of overlapping BACs, was comparable to that of ABI 377 machines (1unresolved discrepancy per 32,000 bases). A base difference is possibly due to residualheterozygosity or to different DNA sources used to make the various libraries. IndividualBACs were assembled from the shotgun sequences using the TIGR Assembler program39.Following assembly into contigs, the Grouper program (TIGR, unpublished software) wasused to link individual assemblies within the BACs. The BACs were then closed using acombination of BAC walking, directed PCR and resequencing of individual shotgunclones. To con®rm the assembly of the entire BAC sequence, predicted and actualrestriction digest patterns were compared. The collinearity of the BACs with theArabidopsis chromosome 2 sequence was con®rmed by alignment of the sequence ofchromosome 2 with markers obtained from the Arabidopsis Recombinant Inbred map(http://genome-www.stanford.edu/Arabidopsis/ ww/Nov98RImaps/index.html) andwith YAC end sequences. Several markers that clearly mapped to single genetic loci wereexcluded from this analysis because they contained repetitive sequence and could not beassigned a unique location on the sequence map.

Two BACs (T5M2 and T5E7) within the genetically de®ned centromere26 were found tocontain long stretches of sequence with very high similarity to the Arabidopsis C24mitochondrial genome (69 kb on T5M2, and 74 kb on T5E7) joined directly to nuclearsequence. Two overlapping BACs provided an additional 122 kb of purely mitochondrialsequence to close the gap, suggesting that the total size of the organellar DNA insertion isabout 270 kb. Veri®cation of the mitochondrial insertion involved PCR ampli®cation ofthe two ¯anking junction sequences from Arabidopsis Columbia genomic DNA. PCRproducts with the expected size were ampli®ed that had sequence identical to the junctionBACs. Similar sized junction PCR products were obtained using additional BACs fromboth the TAMU and IGF libraries. Determination of the exact size and sequence of theinsertion will be facilitated once the sequence of the Columbia mitochondrial genome isknown, and it becomes possible to distinguish between nuclear- and organellar-derivedmitochondrial sequences.

The sequence of chromosome 2 identi®ed by the BAC clone F11L15 did not contain thetelomeric repeat consensus sequence 59-TTTAGGG-39 (ref. 15). However, the sequence ofthe lower telomere of chromosome 2 had been previously identi®ed in the clone pAtT51(ref. 15) (http://genome-www.stanford.edu/Arabidopsis/ww/Nov98RImaps/index.html).pAtT51 was sequenced to identify primer sites that could be used to walk inwards from thetelomere. Southern hybridization using probes derived from F11L15 and pAtT51suggested that the gap between the BAC contig and pAtT51 was less than 1 kb. A 702-bpPCR product that spanned this physical gap was ampli®ed from Arabidopsis Columbiagenomic DNA and sequenced.

Analysis

Annotation of chromosome 2 involved both DNA and protein database searches and geneprediction programs. Gene predictions were made by Genscan40, Gene®nder (P. Green andL. Hillier, unpublished data) and GRAIL41. Splice sites were identi®ed by these programs,by NetPlantGene42 and by alignment with EST and protein sequences using the dds anddps programs43. Predicted protein sequences were searched against a non-redundantamino-acid database and against HMMs of the protein domains from pFAM3 (1,407) andTIGR built HMMs (502) with the HMMer2 program17. tRNAs were identi®ed usingtRNAscan-SE44. Repeats were identi®ed using RepeatMasker (A. F. A. Smit and P. Green,http://ftp.genome.washington.edu/RM/RepeatMasker.html) and a modi®ed version ofthe suf®x tree algorithm used in the MUMmer system45. Output from the gene ®nding andsignal detection programs was displayed using the Annotator genome viewer (L. Zhou,

unpublished data) and manually edited. The accuracy of computational gene ®nders forArabidopsis is quite variable and all gene predictions that are not supported by indepen-dent evidence, such as protein or EST homology, should be regarded as tentative, pendingfurther research.

Putative membrane-spanning proteins were identi®ed using TopPred46. SignalP47 wasused to detect signal peptide sequences, while Predotar (N. Peeters et al., unpublished)searched for organellar targets. In addition, ChloroP48 and Mitoprot49 were used to ®ndproteins targeted for the chloroplast and mitochondria, respectively. The results presentedare the concordance of the predictions.

We analysed the proteome of chromosome 2 by comparing each predicted protein witha database of all proteins from chromosome 2, all available completed genomes and allproteins from A. thaliana and C. elegans using described methods50. To identify genes onother chromosomes that were related to chromosome-2 genes, nucleotide searches werealso used as much of the nucleotide sequence in GenBank is unannotated. Each predictedtranscript was searched against all available genomic sequence using a cut-off of 75%identity over 50% of the transcript length in order to take intron sequences into account.

To search for large-scale duplications within the genome, we assembled pseudomole-cules representing all available sequence of chromosomes 1 and 4. The sequence ofchromosome 2 was then compared with chromosomes 1 and 4 using the MUMmerprogram45 with a word size of 20. Long-range duplications were found by plotting theresults, and a local alignment program dds43 was used to identify matches between theexact repeats within the duplicated regions (stringency .75%). Because of insuf®cientsequence or overlap information, it is dif®cult at the present time to constructchromosome-size pseudomolecules of chromosomes 3 and 5 and assess the possibility ofother large duplications.

Received 5 October; accepted 29 October 1999.

1. Schmidt, R. & Dean, C. Towards construction of an overlapping YAC library of the Arabidopsis

thaliana genome. Bioessays 15, 63±69 (1993).

2. Bevan, M. et al. Analysis of 1.9 Mb of contiguous sequence from chromosome 4 of Arabidopsis

thaliana. Nature 391, 485±488 (1998).

3. Bevan, M. et al. Objective: the complete sequence of a plant genome. Plant Cell 9; 476±478 (1997).

4. Copenhaver, G. P. & Pikaard, C. S. Two-dimensional RFLP analyses reveal megabase-sized clusters of

rRNA gene variants in Arabidopsis thaliana, suggesting local spreading of variants as the mode for gene

homogenization during concerted evolution. Plant J. 9, 273±282 (1996).

5. Copenhaver, G. P. & Picaard, C. S. RFLP and physical mapping with an rDNA-speci®c endonuclease

reveals that nucleolus organizer regions of Arabidopsis thaliana adjoin the telomeres on chromosomes

2 and 4. Plant J. 9, 259±272 (1996).

6. Schweizer, D., Loidl, J. & Hamilton, B. Heterochromatin and the phenomenon of chromosome

banding. Results Probl. Cell Differ. 14, 235±254 (1987).

7. Martinez-Zapater, J. M., Estelle, M. A. & Somerville, C. R. A highly repeated DNA sequence in

Arabidopsis thaliana. Mol. Gen. Genet. 204, 417±423 (1986).

8. Maluszynska, J. & Heslop-Harrison, J. S. Localization of tandemly repeated DNA sequences in

Arabidopsis thaliana. Plant J. 1, 159±166 (1991).

9. Simoens, C. R., Gielen, J., Van Montagu, M. & Inze, D. Characterization of highly repetitive sequences

of Arabidopsis thaliana. Nucleic Acids Res. 16, 6753±6766 (1988).

10. Pluta, A. F., Mackay, A. M., Ainsztein, A. M., Goldberg, I. G. & Earnshaw, W. C. The centromere: hub

of chromosomal activities. Science 270, 1591±1594 (1995).

11. Round, E. K., Flowers, S. K. & Richards, E. J. Arabidopsis thaliana centromere regions: genetic map

positions and repetitive DNA structure. Genome Res. 7, 1045±1053 (1997).

12. Zachgo, E. A. et al. A physical map of chromosome 2 of Arabidopsis thaliana. Genome Res. 6, 19±25

(1996).

13. Marra, M. A. et al. High throughput ®ngerprint analysis of large-insert clones. Genome Res. 7, 1072±

1084 (1997).

14. Copenhaver, G. P., Doelling, J. H., Gens, J. S. & Pikaard, C. S. Use of RFLPs larger than 100 kbp to map

the position and internal organization of the nucleolus organizer region on chromosome 2 in

Arabidopsis thaliana. Plant J. 7, 273±286 (1995).

15. Richards, E. J., Chao, S., Vongs, A. & Yang, J. Characterization of Arabidopsis thaliana telomeres

isolated in yeast. Nucleic Acids Res. 20, 4039±4046 (1992).

16. Charrasse, S. et al. Characterization of the cDNA and pattern of expression of a new gene over-

expressed in human hepatomas and colonic tumors. Eur. J. Biochem. 234, 406±413 (1995).

17. Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. & Durbin, R. Pfam: multiple sequence

alignments and HMM-pro®les of protein domains. Nucleic Acids Res. 26, 320±322 (1998).

18. Leete, E. Recent developments in the biosynthesis of the tropane alkaloids. Planta Med. 56, 339±352

(1990).

19. Yamad, Y. et al. in Secondary Products from Plant Tissue Culture (eds Charlwood, B. V. & Rhodes, M. J.

C.) 227±242 (Clarendon, Oxford, 1990).

20. Marrs, K. A. The functions and regulation of glutathione S-transferases in plants. Annu. Rev. Plant

Physiol. 47, 127±158 (1996).

21. Terryn, N. et al. Evidence for an ancient chromosomal duplication in Arabidopsis thaliana by

sequencing and analyzing a 400-kb contig at the APETALA2 locus on chromosome 4. FEBS Lett. 445,

237±245 (1999).

22. Martin, W. & Herrmann, R. G. Gene transfer from organelles to the nucleus: how much, what

happens, and why? Plant Physiol. 118, 9±17 (1998).

23. Blanchard, J. L. & Schmidt, G. W. Pervasive migration of organellar DNA to the nucleus in plants. J.

Mol. Evol. 41, 397±406 (1995).

24. Unseld, M., Marienfeld, J. R., Brandt, P. & Brennicke, A. The mitochondrial genome of Arabidopsis

thaliana contains 57 genes in 366,924 nucleotides. Nature Genet. 15, 57±61 (1997).

25. Klein, M. et al. Physical mapping of the mitochondrial genome of Arabidopsis thaliana by cosmid and

YAC clones. Plant J. 6, 447±455 (1994).

26. Copenhaver, G. P. et al. Genetic de®nition and sequence analysis of Arabidopsis centromeres. Science

(in the press).

27. Thompson, H. L., Schmidt, R. & Dean, C. Identi®cation and distribution of seven classes of middle-

repetitive DNA in the Arabidopsis thaliana genome. Nucleic Acids Res. 24, 3017±3022 (1996).

articles

NATURE | VOL 402 | 16 DECEMBER 1999 | www.nature.com 767

© 1999 Macmillan Magazines Ltd

28. Thompson, H., Schmidt, R., Brandes, A., Heslop-Harrison, J. S. & Dean, C. A novel repetitive

sequence associated with the centromeric regions of Arabidopsis thaliana chromosomes. Mol. Gen.

Genet. 253, 247±252 (1996).

29. Thompson, H. L., Schmidt, R. & Dean, C. Analysis of the occurrence and nature of repeated DNA in

an 850 kb region of Arabidopsis thaliana chromosome 4. Plant Mol. Biol. 32, 553±557 (1996).

30. Brandes, A., Thompson, H., Dean, C. & Heslop-Harrison, J. S. Multiple repetitive DNA sequences in

the paracentromeric regions of Arabidopsis thaliana L. Chromosome Res. 5, 238±246 (1997).

31. The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for

investigating biology. Science 282, 2012±2018 (1998).

32. Gale, M. D. & Devos, K. M. Comparative genetics in the grasses. Proc. Natl Acad. Sci. USA 95, 1971±

1974 (1997).

33. Wolfe, K. H. & Shields, D. C. Molecular evidence for an ancient duplication of the entire yeast genome.

Nature 387, 708±713 (1997).

34. Hughes, A. L. The evolution of functionally novel proteins after gene duplication. Proc. R. Soc. Lond. B

Biol. Sci. 256, 119±124 (1994).

35. The European Union Arabidopsis Genome Sequencing Consortium & The Cold Spring Harbor,

Washington University in St Louis and PE Biosystems Arabidopsis Sequencing Consortium. Sequence

and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 402, 769±777 (1999).

36. Choi, S., Creelman, R. A. Mullet, J. E. & Wing, R. A. Construction and characterization of bacterial

arti®cial chromosome library of Arabidopsis thaliana. Plant Mol. Biol. Rep. 13, 124±128 (1995).

37. Mozo, T., Fischer, S., Meier-Ewert, S., Lehrach, H. & Altmann, T. Use of the IGF BAC library for

physical mapping of the Arabidopsis thaliana genome. Plant J. 16, 377±384 (1998).

38. Liu, Y.-G., Mitsukawa, N., Vasquez-Tell, A. & Whittier, R. F. Generation of a high-quality P1 library of

Arabidopsis suitable for chromosome walking. Plant J. 7, 351±358 (1995).

39. Sutton, G. G., White, O., Adams, M. D. & Kerlavage, A. R. TIGR Assembler: a new tool for assembling

large shotgun sequencing projects. Genome 1, 9±19 (1995).

40. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol.

268, 78±94 (1997).

41. Uberbacher, E. C. & Mural, R. J. Locating protein-coding regions in human DNA sequences by a

multiple sensor-neural network approach. Proc. Natl Acad. Sci. USA 88, 11261±11265 (1991).

42. Hebsgaard, S. M. et al. Splice site prediction in Arabidopsis thaliana DNA by combining local and

global sequence information. Nucleic Acids Res. 24, 3439±3452 (1996).

43. Huang, X., Adams, M. D., Zhou, H. & Kerlavage, A. R. A tool for analyzing and annotating genomic

sequences. Genomics 46, 37±45 (1997).

44. Lowe, T. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in

genomic sequence. Nucleic Acids Res. 25, 955±964 (1997).

45. Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27; 2369±2376 (1999).

46. Claros, M. G. & von Heijne, G. TopPred II: an improved software for membrane protein structure

predictions. Comput. Appl. Biosci. 10, 685±686 (1994).

47. Nielsen, H., Brunak, S. & von Heijne, G. Machine learning approaches for the prediction of signal

peptides and other protein sorting signals. Protein Eng. 12, 3±9 (1999).

48. Emanuelsson, O., Nielsen, H. & von Heijne, G. ChloroP, a neural network-based method for

predicting chloroplast transit peptides and their cleavage sites. Protein. Sci. 8, 978±984 (1999).

49. Claros, M. G. & Vincens, P. Computational method to predict mitochondrially imported proteins and

their targeting sequences. Eur. J. Biochem. 241, 779±786 (1996).

50. Nelson, K. E. et al. Evidence for lateral gene transfer between Archaea and bacteria from genome

sequence of Thermotoga maritima. Nature 399, 323±329 (1999).

Acknowledgements

We thank S. Jackson, J. Jiang, K. Meyer, E. Richards and S. Tabata for helpful contributionsin completing the chromosome. In addition, we would like to thank the TIGR faculty,system administrators, sequencing facility, bioinformatics department and administrativestaff. This work was supported by the National Science Foundation, the Department ofEnergy and the US Department of Agriculture.

Correspondence and requests for materials should be addressed to J. C. Venter(e-mail: [email protected]).

articles

768 NATURE | VOL 402 | 16 DECEMBER 1999 | www.nature.com

Figure 3 Gene map of Arabidopsis thaliana chromosome 2. Predicted gene models are

shown with arrowheads indicating the direction of transcription. Genes are colour coded

according to broad role categories as shown in the key. The region highlighted in pink

indicates the genetically de®ned centromere within which the cross-hatched gene-free

region represents the mitochondrial insertion. The gap indicates the highly repetitive

centromeric region. Note that the NOR is not drawn to scale. Gene identi®cation numbers

follow the recently adopted nomenclature (At2gNNNNN), where At is Arabidopsis thaliana,

2 is chromosome number, g indicates that the feature is a gene and numbering is from top

to bottom of each chromosome in increments of 10.

Q

© 1999 Macmillan Magazines Ltd

Puj

a

Dan

iela

Nuc

leol

ar o

rgan

izer

reg

ion

Tel

omer

e

tRN

A

18S

RN

AS

eque

nce

gap

Mito

chon

dria

l reg

ion

Gen

etic

cen

trom

ere

5.8S

RN

A

26S

RN

A

snR

NA

1010

1020

1030

2940

29

20

2000

019

990

1998

019

970

1996

0

1039

0

2381

0

1038

0

2380

0

1037

0

2379

0

1036

0

2378

0

1035

0

2377

0

1034

0

2376

0

1033

0

2375

0

1032

0

2374

023

730

2742

027

430

2744

027

450

2746

027

470

2748

027

490

2750

027

510

6830

6840

6850

6860

6870

6880

6890

6310

6320

6330

6340

6350

6360

6370

6380

6390

6400

6410

1987

019

860

1985

019

840

1983

019

820

3487

0

1981

0

3488

0

1980

0

3489

0

1979

0

3490

0

1978

0

3491

034

920

3493

034

940

3496

0

4387

043

860

4385

043

840

4383

043

820

4381

043

800

1977

019

760

1975

019

740

1973

019

720

3497

0

1971

019

700

3498

034

990

1969

0

3500

0

2163

0

1968

0

3501

0

2162

021

610

3502

035

030

2160

0

3504

0

2159

0

3505

0

2158

0

3506

0

2156

021

550

2154

0

4210

042

090

4208

042

070

4206

042

050

4204

042

030

1967

0

3246

032

470

3248

032

490

3250

032

510

3252

032

530

4113

041

120

4111

041

100

4109

041

080

4107

041

060

3215

0

4105

0

3214

032

130

4104

0

3212

032

100

3209

032

080

3207

032

060

1699

017

040

4103

041

020

4101

041

000

4099

040

980

3205

032

040

3203

032

020

3201

032

000

3199

031

980

3197

031

960

4090

040

910

4092

040

930

4094

040

950

4096

040

970

1212

012

130

1214

012

150 25

570

1216

012

170

2558

0

1218

0

2559

0

1219

0

2560

025

610

1220

0

2562

0

1221

0

2563

025

640

1166

011

650

1164

011

630

1162

011

610

1160

011

590

1222

0

1158

0

1223

012

240

1225

012

260

1227

012

280

1229

012

300

1231

012

320

5350

2874

028

750

2876

028

780

2879

028

800

4012

040

130

4014

040

150

4019

040

200

4021

0

2134

021

350

2137

021

380

2139

021

400

2141

021

420

2143

0

1185

011

860

1187

011

880

1189

011

900

1191

011

920

1193

0

2144

021

450

2146

021

470

2148

0

2372

0

2149

021

500

2371

023

700

2151

0

2369

0

2152

0

4771

0

2367

0

2153

0

2366

0

4772

047

730

2364

0

4775

0

2363

0

4776

047

770

4778

047

790

1734

017

350

4780

0

1736

017

290

1738

017

390

1740

017

410

1742

017

430

9790

9800

9810

9820

9830

9840

9850

9860

6740

6730

6720

2368

0

6710

6700

6690

4781

0

6680

4782

0

6670

4783

0

6660

4784

0

6650

4785

047

860

4787

047

880

4789

0

1744

0

4790

0

1737

0

1737

0

6640

6630

6620

4791

047

9204

7930

4794

047

950

4796

047

970

4798

047

990

2335

023

340

2333

023

320

2331

023

300

2329

023

280

4549

045

500

4551

045

520

4553

045

540

4555

045

560

1524

015

230

1522

015

210

1520

015

190

1518

015

170

2830

028

290

2828

028

270

2826

0

2826

0

2825

028

240

7240

7250

7260

7270

7280

7290

7300

7310

7320

2633

026

320

2631

026

300

2629

026

280

2627

0

6990

7000

4429

0

7010

4430

0

7020

4431

0

7030

7040

4433

0

7050

4434

044

350

4436

044

370

4438

044

390

4440

044

410

4442

044

430

4444

044

450

4446

0

4447

044

480

1303

013

040

1305

0

1710

0

1306

013

070

1713

0

1308

0

1715

0

4449

0

1309

0

1700

0

4450

0

1310

0

1701

0

4451

0

1702

017

030

3282

032

810

3280

032

790

3278

032

770

1705

017

060

1707

017

090

1711

017

120

1714

0

3109

0

3108

031

070

3106

031

050

3104

0

1485

014

860

1487

014

880

1489

014

900

1491

014

920

1793

0

1493

0

1792

017

910

1790

017

890

4141

041

400

4139

041

380

4137

041

360

2130

2120

2110

2100

2090

2080

2070

3893

038

940

3895

038

960

3897

038

980

4707

047

060

3899

0

4705

0

3900

0

4704

0

3901

0

5900

4703

0

5910

3902

0

4702

0

5920

4701

0

1887

0

5930

4700

0

5940

1888

0

4699

0

1889

0

5950

6220

4698

0

6230

5960

6240

5970

5980

6250

5990

6260

6270

6280

6290

6300

3903

039

040

3905

039

060

3907

0

4697

0

3908

0

4696

0

3909

0

2220

4695

0

2210

3910

0

4694

0

6000

2200

3911

0

4693

0

3912

0

2190

4692

0

2180

4691

0

2170

2160

4690

046

890

2150

4688

0

2140

3878

038

770

3876

038

750

3874

038

730

3872

038

710

3913

0

4687

046

860

1476

014

750

1474

014

730

1472

014

710

2460

024

590

2458

0

3629

0

2457

024

560

3627

0

2455

0

3626

0

2454

0

3625

0

2453

0

3624

0

2452

0

3623

036

220

3621

036

200

2544

025

450

2546

025

470

2548

025

490

2550

025

510

2552

025

530

2382

023

830

2384

023

8502

3860

2387

023

880

2389

0

3619

036

180

2554

025

550

2556

0

1608

016

090

1610

01611

016

120

1613

016

140

1615

016

1601

6170

4870

4860

1506

0

4850

1507

015

080

1509

015

100

1511

015

120

1618

016

190

1620

016

210

1622

0

2304

023

050

2306

023

070

2308

023

090

2310

023

110

2312

0

7530

7520

7510

7500

7490

7480

7630

1333

013

320

1331

0

1094

0

1330

0

1095

010

960

1329

013

280

1097

010

980

1327

0

1099

011

000

3661

0

1101

0

3662

0

3662

0

1102

0

3663

036

640

3665

036

660

3667

036

680

3669

0

5800

5810

3810

3820

3830

3840

3850

3860

3870

3880

3890

1264

0

4517

045

160

4515

045

140

4513

045

120

4511

045

100

2043

020

440

2830

2820

2045

020

460

2810

2047

0

2800

2048

0

2790

2780

2049

020

500

2770

1422

014

210

1420

0

3936

039

370

3938

039

390

3940

039

410

3942

039

430

1390

1400

1410

1420

1430

1440

1450

1460

1470

3418

034

190

3420

034

210

3422

0

3422

034

230

3424

034

250

4300

4280

4070

4060

4050

4040

1390

013

890

1388

013

870

1386

0

1810

018

090

1808

018

070

1806

018

050

1804

0

2429

024

300

2431

024

320

2433

024

340

2435

024

360

2902

029

040

2905

029

060

2907

029

080

2951

029

500

2909

029

490

2910

029

480

2911

029

470

2946

029

450

2944

029

430

5820

5830

5840

5850

2536

0

5860

2537

0

5870

2538

025

390

5880

5890

2541

025

420

2543

0

2912

029

130

2942

029

410

2940

029

390

2938

029

370

2936

029

350

2934

0

2780

0

2933

0

2779

027

780

2798

027

770

2797

027

760

2796

027

750

2795

027

740

2773

027

940

2793

027

720

2792

0

1764

0

2771

0

1770

0

2791

0

1775

0

2790

0

1778

0

2789

0

1779

017

760

1773

0

2680

2670

2660

2650

3870

0

2640

3869

0

2932

0

2630

3868

0

2931

0

2620

3867

0

2930

0

2610

3866

0

2929

0

3865

0

2928

0

3864

0

2927

0

3863

0

2926

029

250

1342

0

3862

038

610

2924

0

1343

0

2770

0

1344

0

2923

0

1345

0

2769

027

680

1346

0

2767

027

880

2787

027

660

2765

027

640

2763

027

620

2761

0

3860

0

2922

0

3859

038

580

2921

0

3857

0

2920

029

190

3856

038

550

2918

029

170

3854

0

2916

0

3853

0

2915

0

3852

038

510

2914

0

2760

027

590

2758

027

570

2755

027

540

2753

027

520

3850

038

490

3848

0

1356

013

570

1358

013

590

1360

013

610

1362

013

630

1364

013

650

3300

032

990

3298

032

970

4211

042

120

4213

042

140

4215

042

160

4217

042

180

3465

034

640

3463

034

620

3461

034

600

3459

034

580

3457

0

3380

033

810

3382

033

830

3384

033

850

3387

0

3885

038

860

3887

038

880

3889

038

900

6140

3891

0

6130

3892

0

6120

6110

6100

6090

6080

6070

6060

6050

3369

033

780

3372

0

4022

0

3004

030

030

3002

030

010

3000

0

1872

0

2999

0

1873

018

740

1875

018

760

6040

1877

018

780

6030

6020

1879

0

3386

0

6010

1880

0

3388

0

1881

0

4388

043

890

4390

043

910

4392

043

930

1882

0

1882

0

4394

0

1883

0

4395

0

2250

0

1884

0

4396

0

2249

0

1730

1885

0

1720

1886

0

2248

022

470

1710

2246

0

1700

1690

2245

0

1680

2244

022

430

1670

1660

2242

0

1650

2241

0

1640

4538

045

370

4536

045

350

4534

045

330

4532

045

310

4530

0

2240

0

1630

2239

0

1620

2237

0

1610

2236

0

2998

0

2235

022

340

2997

029

960

2233

0

2995

0

1669

0

2994

029

930

1670

0

2992

0

1671

0

1672

016

730

1674

0

4529

0

1675

0

4528

0

1676

0

4527

0

1677

0

4526

0

1678

0

4525

045

240

4523

0

4522

045

210

4520

0

1286

012

850

1284

012

830

1679

0

4490

4480

1680

016

810

4470

1682

0

4460

4450

1683

0

4440

4519

0

4430

4518

0

4420

4410

4400

4390

4380

4370

4360

4350

4340

4330

4320

4310

3607

036

060

3604

036

030

3602

036

010

3600

0

4304

0

1995

0

4303

0

1994

019

930

4302

0

1992

0

4301

0

1991

019

900

1989

019

880

2960

029

590

2958

0

2957

0

2957

0

2956

029

550

2954

029

530

4170

4180

4190

2741

0

4200

2740

0

4210

2739

027

380

2737

027

360

2735

027

340

2733

0

3799

037

980

3797

037

960

3795

037

940

3793

037

920

3791

037

900

1772

017

670

1765

017

800

1777

017

740

1771

017

690

1768

0

3789

037

880

3787

037

860

3785

037

840

3783

037

820

3781

0

3980

0

3780

0

3981

039

820

3983

039

840

3985

0

1766

0

3638

036

390

3640

036

410

3642

036

430

3644

036

450

3646

0

3779

037

780

3714

037

130

3712

037

110

3710

037

090

3398

0

3708

0

3399

0

3560

3400

0

3707

0

3570

3401

0

3706

0

3402

0

3580

3705

0

3403

0

3590

3404

0

3600

3405

0

3610

3620

3406

034

070

3630

3640

3408

034

090

3410

0

2732

0

3411

0

2731

0

3412

0

2730

0

3413

034

140

2729

0

3415

0

2728

027

270

3416

0

2726

0

3417

0

2725

027

240

2723

0

2838

028

370

2836

028

350

2834

028

330

2832

028

310

3879

038

800

3881

038

820

3883

038

840

3721

037

200

3719

037

180

3717

037

160

3715

0

3808

038

090

3810

0

3811

038

120

3813

038

140

3815

038

160

3029

030

300

3031

030

320

3033

030

340

3035

030

360

3037

030

380

4632

046

310

4630

046

290

4628

046

270

4626

0

4762

047

610

2590

2580

2570

2560

2550

2540

2530

2520

2510

2696

0

3055

0

4307

0

3054

0

2697

026

980

3053

0

4308

0

3052

0

2699

0

4309

0

3051

0

2700

0

4310

0

3050

0

2701

0

4311

0

3049

0

2702

0

4312

043

130

3047

0

2703

0

4314

0

3046

0

2704

0

4315

0

2500

2490

2480

4812

048

130

4814

048

150

4816

0

3044

030

420

3041

030

400

3039

0

3039

030

480

3043

030

710

3072

030

730

3074

030

750

3076

030

770

3078

030

790

1920

1930

1111

011

120

1113

011

140

1115

011

160

1117

011

180

4800

0

1119

0

4801

048

020

4803

048

040

1409

014

100

1411

014

120

1413

014

140

1415

014

160

4805

048

060

4807

048

080

4809

048

100

4811

0

4290

042

890

4288

042

870

4286

042

850

4284

042

830

4282

042

810

4280

042

790

4278

042

770

4276

042

750

4274

042

730

4272

042

710

2230

2240

2232

022

310

2250

2260

2230

0

2270

2229

0

2280

2280

2290

2227

0

2300

2226

022

250

2565

025

660

2567

025

680

2569

0

4270

042

690

2570

025

710

4268

042

670

2572

0

4266

0

2573

0

3577

035

760

3575

035

740

3573

035

720

3020

030

170

3224

032

230

2168

0

3222

032

210

2166

0

3220

0

1909

0

2165

0

3219

032

180

1910

0

2164

0

1911

0

3217

032

160

1912

019

130

1914

0

2686

0

1915

0

2685

0

1916

0

2684

0

1917

0

2683

026

820

2681

026

800

2679

0

3692

036

930

1918

0

3694

0

1919

0

3695

036

960

2786

0

3697

0

1920

0

3698

036

990

2784

027

830

3368

0

3700

0

2782

0

3370

0

3701

0

4033

0

3371

0

2781

0

4034

0

3373

0

4035

0

4035

0

3374

0

4036

0

3375

0

4037

0

3376

0

4038

0

3377

033

790

4039

040

400

4041

040

420

3063

0

3702

0

3064

0

3703

037

040

3065

030

660

3067

030

680

3069

030

700

1692

016

930

1694

016

950

1696

016

970

1698

0

1084

010

830

1082

010

810

2587

0

1080

0

2586

025

850

1079

010

780

2584

025

830

2582

025

810

2580

025

790

2578

0

1921

019

220

1923

019

240

1925

019

260

1927

019

280

1929

0

2577

025

760

2575

0

1311

013

120

2574

0

1313

013

140

1315

013

160

1016

010

170

1018

0

1120

0

1019

0

1121

0

1020

0

1122

0

1021

0

1123

011

240

1022

0

3158

0

1125

0

1023

0

3159

0

1126

0

1024

0

3160

031

610

3162

031

630

3164

031

650

3166

031

670

4149

041

500

4151

041

520

4153

041

540

4155

041

560

4157

0

3168

031

690

3170

031

710

3172

0

2930

2950

2960

2970

2980

2990

3000

3010

3020

3990

4000

4010

4020

4030

3030

3040

3050

3060

3070

3080

3090

3964

039

650

3966

039

670

3968

039

690

3970

0

7620

7610

7600

7590

7580

7570

7560

7550

7540

4397

043

980

4399

044

000

1863

0

4401

044

020

1864

018

650

4403

044

040

1866

018

670

4405

0

1868

0

4406

0

1869

018

700

1871

0

3021

030

220

3023

030

240

3025

030

260

3027

030

280

4407

0

1077

0

4408

044

090

1076

010

750

4410

044

110

1074

0

4412

0

1073

010

720

4413

044

140

1071

010

700

4415

044

160

4417

044

180

4419

0

4726

047

270

4728

047

290

4730

047

310

4732

047

330

4734

047

350

4489

044

880

4487

044

860

4485

044

840

4483

044

820

4481

0

3324

033

230

3322

0

3935

0

3321

033

200

3934

039

330

3319

0

3932

0

3318

0

3931

0

3317

0

3930

0

1819

0

3929

039

280

1818

0

4736

047

370

1817

0

3927

0

1816

0

4738

047

390

1815

018

140

4740

0

1813

0

4509

045

080

4507

045

060

4505

045

040

4503

045

010

3117

031

180

3119

031

200

3121

031

220

3123

031

240

3125

0

2125

021

260

4500

0

2127

021

280

4499

044

980

2129

0

4497

0

2130

0

4496

0

2131

0

4495

0

2132

021

330

4494

0

6900

6910

6920

6930

6940

6950

6960

6970

6980

1256

012

570

1258

012

590

1260

012

610

1262

012

630

1220

1230

1240

1250

1260

1270

1280

1290

1300

1310

4306

043

050

2694

026

930

1320

1684

0

2692

0

1330

1685

0

2691

026

900

1686

0

1340

2689

0

1687

0

1350

2688

0

1688

0

1360

2687

0

1370

1689

016

900

1380

1691

0

4717

047

180

4719

047

200

4721

047

220

4723

047

240

4725

0

3817

038

180

3819

038

200

3821

038

220

3823

038

240

3825

038

260

1058

0

3325

033

260

1057

010

560

3327

0

1055

0

3328

0

1054

0

3329

0

1053

0

3330

033

310

1052

0

3332

0

1051

010

500

4597

0

3827

038

280

4596

045

950

3829

0

4594

045

930

3276

032

750

3274

032

730

3272

032

710

3270

032

690

3268

032

670

4180

041

790

3266

0

4178

041

770

3265

0

3264

0

4176

041

750

4174

041

730

4172

041

710

1157

011

560

1155

011

540

1153

011

520

1151

0

3148

0

1150

0

3147

0

1149

011

480

3146

031

450

3144

031

430

3142

0

4170

0

1273

012

720

1271

012

700

1269

012

680

1267

0

1147

0

2390

023

910

2392

023

930

2394

023

950

1062

010

610

2396

0

1060

0

2397

023

980

1059

0

2399

024

000

2401

024

020

2403

024

040

2405

024

060

2407

024

080

2409

0

7680

4548

045

470

4546

045

450

4544

045

430

4542

045

410

4540

0

4461

044

620

4463

044

640

4465

0

3683

0

4466

0

3684

036

850

4467

044

680

3687

0

4469

0

3688

0

4470

0

3689

036

900

1417

014

180

3691

0

1419

0

4471

044

720

4473

044

740

4475

044

760

4477

044

780

4672

0

4480

044

790

4010

040

110

7690

7700

7710

7720

7730

7740

7750

7760

7770

3470

3460

3450

3440

3430

3420

3410

3400

3390

3380

3370

3360

3350

3340

3330

3320

2811

0

3310

3300

2810

028

090

3290

2808

0

3280

2807

028

060

2805

0

1930

0

2804

0

1931

0

2803

0

1932

019

330

1934

019

350

1936

019

370

1938

019

390

3841

038

400

3839

038

380

3837

038

360

3835

038

340

3833

038

320

3270

1532

015

310

1530

0

1530

0

1529

015

280

1527

015

260

1940

019

410

2327

023

260

2325

023

240

2323

0

3831

038

300

2322

023

210

2320

023

190

5430

2318

0

5420

5410

1726

0

5400

1727

0

5390

1728

0

5380

5370

1730

0

5360

1731

017

320

1733

0

2317

023

160

2315

023

140

2313

0

4620

4610

4600

4590

4580

2058

020

570

2988

029

890

2990

029

910

2289

022

900

2291

022

920

2293

022

940

2295

0

2030

2535

025

340

2040

2533

0

2050

2532

0

2060

1590

0

4121

0

1589

0

4081

0

4120

041

190

4082

0

1588

015

870

4083

0

4118

041

170

1586

0

4084

0

4116

0

1585

0

4085

040

860

1584

0

4115

0

4087

0

1583

0

4114

0

4114

0

1582

0

4763

0

4088

040

890

4764

0

1581

0

4765

047

660

4767

047

680

4769

0

4769

047

700

2051

0

2760

2750

2052

020

530

2740

2730

2054

0

2498

024

970

2055

0

2720

2056

0

2710

2496

0

2700

2495

0

2690

2494

024

930

2492

024

910

2490

0

1580

015

790

2489

0

1578

015

770

1576

0

2488

024

870

2486

024

850

2484

024

830

2482

024

810

2480

024

790

3062

030

610

3060

030

590

3058

030

570

3056

0

4300

0

2478

0

2280

022

810

2282

022

830

1598

0

1598

0

2284

0

1597

0

2285

0

1596

0

2286

0

1594

0

2287

0

6550

1593

0

2288

0

6560

1592

0

6570

1591

0

6580

6590

6600

6610

1538

015

370

1536

015

350

1534

015

330

1470

014

690

1468

014

670

1439

014

380

1437

014

360

1435

014

340

1433

014

320

3560

0

4716

0

3559

035

580

3557

035

560

3555

035

540

3553

0

4682

046

830

4684

046

850

4685

0

2296

022

970

2298

022

990

2300

023

010

2303

0

1640

016

390

1638

016

370

1636

016

350

1634

016

330

1632

016

310

1630

016

290

1628

0

4299

042

980

1627

016

260

4296

0

1625

0

4295

042

940

1623

0

4293

042

920

4291

0

2507

025

060

2505

025

040

2503

025

020

2501

025

000

2499

0

2846

028

470

2848

028

490

2850

028

510

4880

2852

028

530

4890

4900

4910

4920

4930

4940

4950

4960

1724

017

250

2854

028

550

2856

028

570

1040

1050

1060

1070

1080

1090

1100

1110

1120

3333

033

340

3335

0

1599

0

3336

0

1600

0

3337

0

1601

0

3338

0

1602

0

3339

0

1603

0

3340

0

1604

0

3341

0

1605

0

3342

0

1606

016

070

2114

021

130

2112

0

2111

021

100

2109

021

080

2107

021

060

2105

0

1049

0

3343

0

1048

0

3344

0

1047

0

3345

033

460

3347

033

480

3349

033

500

3351

033

520

1900

019

010

1902

0

2104

0

1903

0

2103

0

1904

0

2102

0

1905

0

2101

0

2840

1906

0

2100

020

990

2850

1907

019

080

2860

2098

0

2870

2097

0

2880

2096

020

950

2890

2910

3353

033

540

3355

0

2673

0

3356

033

570

2674

026

750

3358

0

2676

0

3359

033

600

2677

026

780

3361

033

620

3847

038

460

3845

038

440

2094

0

3843

038

420

3363

033

640

3367

0

4229

042

300

4231

042

320

4233

042

340

4235

042

360

4237

0

1659

016

580

1657

016

560

1655

016

540

1653

0

1452

014

510

1450

014

490

1448

014

470

1446

014

450

1444

014

430

1025

010

260

1027

0

3260

3250

3240

4238

042

390

4240

042

410

4243

042

440

4245

042

460

4247

0

1442

014

410

1440

0

1942

019

430

1944

019

450

1946

019

470

4248

0

1948

0

4249

0

1949

0

4250

0

4970

4980

4990

5000

5010

5020

5030

5040

5050

5060

1203

0

1203

012

040

1205

012

060

1207

0

5070

1208

0

5080

1209

012

100

5090

1211

0

4840

4830

4820

4810

4800

4790

4780

4770

4592

045

910

4590

045

890

4588

045

870

4586

0

4760

4585

0

4750

4740

4584

045

830

4730

4720

4710

4700

4690

4680

4670

3944

039

450

4582

0

3946

039

470

4581

045

800

3948

039

490

4579

045

780

3950

039

510

4577

045

760

4660

3952

039

530

4575

0

4650

4640

4574

045

730

4630

3954

039

550

3956

0

4572

0

3957

0

4571

045

700

3958

039

590

3962

039

630

1341

013

400

1339

013

380

1337

013

360

1335

013

340

1745

017

460

3752

037

530

1747

017

480

3754

0

1749

0

3755

0

1750

0

3756

0

1751

0

3757

0

1752

0

3758

0

1753

0

3759

0

3089

0

5200

3090

0

5210

3091

0

5220

3092

030

930

5230

3094

0

5240

3095

0

5250

5260

3096

030

970

5270

5280

3098

0

5290

1293

012

920

1291

012

900

1289

012

880

1287

0

2469

0

3099

0

5300

3100

0

2468

024

670

3101

0

5310

5320

3102

0

2466

0

3103

0

5330

2465

0

5340

2464

024

630

2462

024

610

2256

022

550

2254

022

530

2252

022

510

1652

0

3900

1651

0

3910

1650

0

3920

1649

0

3930

3940

3950

3960

3970

3980

4493

044

920

4491

044

900

2819

028

180

2817

028

160

2815

028

140

2813

028

120

4557

045

580

4559

045

600

4561

045

620

4563

045

640

4565

045

660

4625

046

240

4623

046

220

4621

046

200

4619

046

180

4617

046

160

4567

045

680

4569

0

4615

046

140

4613

046

110

4610

0

4609

0

4609

0

4608

046

070

4606

0

3316

033

150

3313

033

120

3311

033

100

3722

0

3309

0

3723

0

3308

0

3724

037

250

3307

0

3726

0

4327

0

3727

037

280

3728

0

4328

0432

90

3729

037

300

4331

043

320

4333

043

340

4335

042

650

4264

0

2336

0

2336

0

4605

0

2337

0

4604

0

2338

0

4603

0

2339

0

3477

0

4602

0

2340

0

4601

0

2341

0

3476

0

4600

0

2342

0

3475

034

740

4599

0

2343

0

4598

0

2344

0

3473

034

720

3306

0

3471

0

3305

0

3470

0

3304

033

030

3302

033

010

1399

014

000

1401

014

020

1403

014

040

1405

014

060

1407

014

080

2722

027

210

2720

027

190

2718

0

4420

044

210

4422

0

1494

0

4423

0

1495

0

4424

044

250

1496

0

4426

044

270

4428

0

4160

4150

4140

2893

0

4130

2894

0

4120

2895

0

4110

2896

028

970

4100

4090

2898

028

990

4080

2900

029

010

7780

4024

040

250

4026

0

3926

0

4027

0

3925

0

4028

0

3924

0

4029

0

1175

0

3923

0

4030

0

1174

0

3922

0

1173

0

4031

0

3921

0

1170

0

4032

0

3920

0

1282

0

3919

0

1169

0

1281

0

1168

0

3918

0

1280

0

1167

0

3917

0

1279

0

1172

0

1278

0

1171

0

1277

0

1130

1276

0

1140

1275

0

1150

1274

0

1170

1180

1190

1200

1210

1480

1490

1500

1510

1520

1530

1540

1550

1560

1570

3916

039

150

3914

0

1516

015

150

1514

015

130

1580

1590

1600

4055

040

540

1788

0

4053

0

1787

0

4052

0

1786

0

4051

0

1785

0

4050

0

1784

017

830

4049

040

480

1782

0

4047

0

2514

0

4046

0

2266

022

670

2268

022

690

2270

022

710

2272

022

730

2274

0

4045

040

440

4043

0

2515

0

1846

018

450

1844

018

420

1812

018

410

1811

018

400

2516

0

2869

028

680

2987

0

2867

028

660

2986

029

850

2865

028

640

2984

0

2863

0

2983

0

2862

0

2982

029

810

2093

020

920

2091

020

900

2089

020

880

2087

020

860

2085

020

840

2517

0

2124

021

230

2122

0

2083

020

820

2081

0

4641

0

2080

0

4642

0

2079

0

4643

0

2078

020

770

4644

046

450

2076

020

750

4646

0

2074

0

4647

0

2518

0

2073

020

720

2071

0

2845

028

440

2070

020

690

2843

0

2068

0

2842

028

410

2840

028

390

4379

043

780

4377

043

760

4375

043

740

4373

043

720

4371

043

700

4072

040

710

4070

0

2509

0

4069

040

680

4067

040

660

4065

040

640

3426

034

270

3428

034

290

3430

034

310

3432

034

330

4369

043

680

3434

034

350

4367

043

660

4365

043

640

4363

043

620

4361

043

600

1302

013

010

1300

012

990

1298

012

970

1296

0

3436

0

1295

012

940

3437

0

4359

043

580

4357

0

2275

022

760

2277

022

780

2279

0

1547

015

480

1549

015

500

1551

015

520

1553

015

540

4158

041

590

4160

041

610

4162

041

630

4164

041

650

4166

041

670

2634

026

350

2636

026

370

2638

0

1466

0

2639

0

1465

0

2640

0

1464

0

2641

0

1463

0

2642

0

1462

014

610

1460

0

4168

0

1459

0

4169

0

1458

0

2802

028

010

2800

027

990

1660

016

610

4570

1662

016

630

4560

4550

1664

0

4540

1665

0

1754

0

3760

0

4530

1666

0

1755

0

3761

037

620

4520

1667

0

1756

017

570

4510

1668

0

3763

0

1758

0

4500

3764

037

650

1759

017

600

3766

037

670

3768

037

690

3770

037

710

3772

0

5550

3773

0

5540

3774

0

5530

3775

0

5520

3776

0

5510

3777

0

5500

5490

5480

5470

5460

5450

5440

5790

5780

5770

3230

5760

5750

3220

3210

3200

3190

3180

3170

3160

3150

3140

3130

3120

3110

3100

4270

4260

4250

4240

3116

0

4290

3115

0

4230

4220

3114

031

130

3112

031

110

3110

0

2470

024

710

2472

024

730

2474

024

750

2476

0

2477

0

2310

2320

2330

2340

2350

2360

1810

2370

2380

1800

2390

1790

1780

2400

1770

1760

1750

1740

2035

020

340

2033

020

320

2031

020

300

2029

0

2029

0

2028

020

270

2026

0

3296

0

2410

3295

0

2224

022

230

2420

3294

0

2430

3293

0

2221

0

2440

3292

0

2450

3291

0

2220

0

2460

3290

0

2219

022

180

2470

3289

0

2217

0

3288

0

2216

0

3287

0

2215

0

2025

020

240

2023

020

220

1069

0

2021

0

1068

0

2020

0

1067

010

660

1065

010

640

1063

0

3286

0

2214

0

3285

0

2067

0

3284

0

2066

0

3283

0

2065

020

640

2063

020

620

2061

020

600

2059

0

1375

013

760

1377

013

780

1379

0

1380

0

1184

011

830

1182

011

810

1180

011

790

1178

0

4633

0

1177

0

4634

0

1176

0

4635

046

360

4637

046

380

4639

046

400

2042

020

410

2040

020

390

2038

020

370

2036

0

1244

012

450

1246

012

470

1248

012

490

1250

012

510

1252

0

2881

028

820

6420

2883

028

840

6430

3608

036

090

6440

2885

0

6450

2886

0

3610

036

110

2887

0

3612

0

2888

0

3613

0

2889

0

4648

0

5740

2890

0

4649

0

5730

4650

0

3616

0

5720

4651

0

5710

5700

4652

046

530

5690

4654

0

5680

4655

0

5670

4656

0

4017

0

1046

010

450

1044

010

430

1042

0

2891

028

920

1041

010

400

6820

6810

6800

6790

6780

6770

6760

6750

1555

015

560

1557

015

580

1559

015

600

1561

015

620

1563

015

640

1347

013

480

1349

013

500

1351

013

520

1353

013

540

1355

0

2823

028

220

2821

028

200

1565

015

660

1567

015

680

1569

015

700

1571

015

720

1573

015

740

1575

0

1940

1950

1960

1970

1980

1990

2000

2010

2020

1031

010

300

1029

0

1028

0

1028

0

2664

026

650

2666

026

670

2668

026

690

2670

026

710

2672

0

3080

030

810

3082

030

830

3084

030

860

3087

030

880

1366

013

670

1368

0

9960

1369

0

9970

1370

0

9980

1371

013

720

9990

1000

0

1373

013

740

1001

010

020

1003

010

040

1005

0

2178

021

790

2180

021

810

2182

021

830

2184

021

850

2186

021

870

1006

010

070

7470

1008

0

7460

1009

010

100

1011

010

120

1013

0

1242

0

1014

0

1243

0

1015

0

2188

021

890

2190

0

4760

0

2191

0

4759

0

2192

0

4758

0

2193

0

4757

0

2194

0

2194

0

4756

0

2195

021

960

4755

047

540

2197

0

4753

047

520

2019

0

4751

0

2018

020

170

2016

020

150

2014

020

130

2012

020

110

2198

0219

9022

000

4750

0

2201

0

4749

0

2202

0

4748

047

470

4746

047

450

4744

047

430

4742

047

410

1708

0

3647

036

480

3649

036

500

3653

036

540

3655

036

560

1385

013

840

6460

1383

0

6470

1382

013

810

6480

6490

4673

0

6500

4674

0

6510

4675

0

6520

4676

0

3657

0

6530

4677

0

3658

0

6540

4678

0

3659

0

4679

046

800

4681

0

1093

010

920

1091

010

900

1089

010

880

1087

010

860

1085

0

1862

018

610

1860

018

590

1858

018

570

1856

0

4135

041

340

4133

041

320

4131

041

300

4018

0

4129

041

280

4127

041

260

4125

0

2608

0

4124

0

2607

0

4123

0

2606

0

4122

0

2605

026

040

2603

026

020

2601

026

000

4715

047

140

4713

047

120

4711

047

090

4708

0

2599

025

980

2597

025

960

2595

025

940

2593

0

3571

035

700

3569

035

680

3567

035

660

3565

035

630

3562

035

610

3186

031

850

3184

031

830

3182

0

3564

0

3181

031

800

3179

031

780

3177

031

760

3175

0

7330

7340

3174

0

7350

3173

0

7360

7370

7380

7390

7400

7410

7420

1763

017

620

1761

0

7430

7440

7450

4080

040

790

4078

040

770

4076

040

750

4074

040

730

4201

042

000

4199

041

980

4197

041

960

3731

0

4194

0

3732

0

4193

0

3733

0

4192

0

3734

0

4191

0

3735

037

360

3737

037

380

3739

037

400

1233

012

340

1235

012

360

1237

012

380

1239

012

400

1241

0

4190

041

890

4188

041

870

4186

0

3741

0

4195

0

3742

0

1253

0

3743

0

1254

0

3744

0

1255

0

3745

037

460

3747

037

4803

7490

3750

0

3987

039

880

3989

039

900

3991

039

920

3993

0

3751

0

5100

1803

018

020

1801

018

000

1799

017

980

1797

017

960

1795

017

940

1497

014

980

1499

015

000

1501

015

020

1503

015

040

1505

0

3016

030

150

3014

030

130

3012

030

110

3010

030

090

3008

030

070

1839

018

380

1837

018

360

1835

018

340

1833

018

320

1830

0

2531

025

300

2529

025

280

2527

025

260

2525

0

3006

0

2524

0

3005

0

2523

025

220

2592

025

910

2590

0

1829

0

2589

0

1828

018

270

2588

0

1826

018

250

1824

0

7670

1823

018

220

7660

4336

0

7650

1821

0

4337

0

7640

1820

0

4338

043

390

4340

043

410

4342

043

430

2521

0

4185

0

4344

0

2520

0

4345

0

4184

0

2519

0

4183

0

4183

0

4182

041

810

7140

7150

7160

7170

7180

7190

7200

7210

7220

7220

7230

4346

043

470

1716

0

4348

0

1717

0

4349

0

1718

0

4350

0

1719

0

4351

0

1720

0

4352

0

1721

0

4353

0

1722

0

4354

0

1723

0

4355

0

1950

0

1951

019

520

1953

019

540

4023

0

1955

019

560

1957

019

580

1959

0

1431

0

4356

0

1430

014

290

1428

014

270

1426

014

250

1424

014

230

1960

019

610

1962

019

630

1964

019

650

1966

0

4016

0

4263

042

620

4261

042

600

4259

042

580

4257

042

560

4255

042

540

1146

011

450

1144

011

430

1142

0

1141

0

4253

0

1140

0

4252

0

1139

0

4251

0

3141

031

400

3139

031

380

3137

031

360

3135

031

340

3133

0

2205

022

040

2203

0

3807

038

060

3805

038

040

3802

038

010

3800

0

4142

041

430

4144

041

450

1326

0

4146

0

1325

0

4147

0

1324

0

4148

0

1323

013

220

1321

013

200

1319

013

180

1317

0

3550

3540

3530

3520

3510

3500

3480

1457

014

560

1455

0

3552

0

1454

014

530

3551

035

500

5110

3549

035

480

5120

5130

3547

035

460

5140

5150

3545

0

5160

3544

0

5170

3543

0

5180

5190

4219

042

200

4221

042

220

4223

042

240

4225

042

260

4227

042

280

3542

035

410

3540

035

390

3538

035

370

3536

035

350

3534

035

330

3994

039

950

3996

0

2626

0

3997

0

2625

0

3998

0

2624

0

3999

0

2623

0

4000

0

2622

0

4001

0

2621

0

4002

0

2620

0

4003

0

2619

026

180

2617

0

1103

011

040

1105

011

060

1107

0

3617

0

1108

011

380

1109

011

100

1137

011

360

1135

011

340

1133

011

320

1131

0

4004

0

1130

0

4005

0

1129

0

4006

040

070

4008

040

090

2410

024

110

2412

0

2412

024

130

2414

024

150

2416

024

170

2418

0

2345

023

460

1128

0

2347

023

480

1127

0

2349

023

500

2351

023

520

2353

023

540

2355

0

1484

0

2356

0

1483

0

2357

0

1482

0

2358

0

1481

0

2359

0

1480

0

2360

0

1479

0

2361

0

1478

0

2362

0

1477

0

1855

018

540

1853

018

520

1851

018

500

1849

018

480

1847

0

3149

0

2705

0

3150

0

2706

027

070

3151

0

2708

0

3152

031

530

2709

027

100

3154

0

3254

0

2711

0

3155

0

3255

0

3156

0

2712

0

3256

0

2713

0

3157

0

3257

032

580

3469

0

3259

032

600

3468

0

3261

0

3467

034

660

3262

032

630

2714

027

150

2716

027

170

4316

0

2695

0

4317

043

180

4319

043

200

4321

043

220

4323

043

240

4325

0

1648

016

470

5660

1646

0

5650

1645

0

5640

1644

016

430

1642

016

410

3238

032

370

3236

032

350

3234

032

330

3232

0

2643

0

3231

0

2644

0

3230

0

2645

026

460

3229

0

2647

0

4326

0

2648

026

490

2650

026

510

2652

0

2257

022

590

4657

046

580

2260

022

610

4659

046

600

2262

0

4661

0

2263

0

4662

0

2264

022

650

3228

032

270

3226

032

250

2653

026

540

2655

026

560

2657

026

580

1266

0

2659

026

600

1265

0

2861

0

2661

0

2860

0

2662

0

2859

028

580

3670

0

2169

0

3671

036

720

2171

0

3673

0

2172

0

3674

0

2173

0

3675

0

2174

0

3676

0

2175

0

3677

036

780

2177

0

3679

0

2663

0

7060

7070

7080

7090

7100

7110

7120

7130

2980

029

790

2419

0

2978

0

2420

0

2977

029

760

2421

0

3680

0

2975

0

2422

0

3681

0

2974

0

2423

0

3682

0

2973

0

2424

024

250

2972

029

710

2426

024

270

2428

0

2970

029

690

2968

029

670

2966

029

650

2964

029

630

2962

029

610

3523

035

210

3520

035

190

3518

035

170

3516

035

150

3514

035

130

3512

035

110

3510

035

090

3508

035

070

1525

0

4663

046

640

4665

046

660

4667

046

680

4669

046

700

4671

0

1546

0

3456

0

1545

0

3455

0

1544

0

3454

0

1543

015

420

3453

0

1541

0

3452

0

1540

0

3451

0

1539

0

3450

034

490

3448

034

470

2121

021

200

2119

021

180

2117

021

160

2115

0

3446

034

450

3444

034

430

3442

034

410

3440

034

390

3438

0

5630

5620

5610

5600

5590

5580

5570

5560

3599

035

980

3597

035

960

3595

035

940

3593

035

920

3591

035

900

3195

031

940

3193

031

920

3191

031

900

3189

031

880

3187

0

3187

0

1910

3589

035

880

3587

035

860

3585

035

840

3583

0

3582

0

3582

0

3581

035

800

3245

032

440

3243

032

420

3241

032

400

3239

0

2870

028

710

3579

0

2872

0

3578

0

3478

0

3605

0

3479

034

800

3481

034

820

3483

034

840

3485

034

860

2508

0

3637

036

360

3635

036

340

3633

036

320

3631

036

300

1194

011

950

1196

011

970

1198

011

990

6210

1200

0

6200

1201

0

6190

1202

0

6180

6170

3650

6160

6150

3670

3680

3690

3971

039

720

3710

3973

0

3720

3974

0

3730

3975

0

3740

3976

039

770

3978

039

790

2510

0

3750

3760

3126

031

270

3770

3128

0

3780

3129

0

3800

3130

031

310

1890

0

3132

0

1891

018

9201

8930

1894

018

950

1896

018

970

4452

044

530

1898

0

4454

0

1899

0

4455

044

560

4457

0

2511

0

4458

044

590

4460

0

2437

024

390

2440

024

410

2442

024

430

2444

024

450

2446

0

2512

0

2447

024

480

2449

024

500

2451

0

1391

013

920

1393

013

940

1395

0

4063

0

1396

0

4062

0

1397

0

4061

0

1398

0

4060

040

590

4058

040

570

4056

0

2513

0

9870

9880

9890

9900

9910

9920

9930

9940

9950

2616

026

150

2614

026

130

2612

026

110

2610

0

2213

022

120

2010

020

090

2210

0

2008

0

2209

0

2007

0

2208

0

2006

0

2207

022

060

2005

020

040

2003

020

020

2001

0

3532

035

310

3530

035

290

3528

035

270

3526

0

1820

1830

3525

0

1840

3524

0

1850

18601

870

1880

1890

1900

Unk

now

nU

ncla

ssifi

edO

ther

cat

egor

ies

Con

serv

ed h

ypot

hetic

alT

rans

port

and

bin

ding

pro

tein

sP

rote

in fa

te/p

rote

in s

ynth

esis

Sec

onda

ry m

etab

olis

mR

egul

ator

y fu

nctio

ns/s

igna

l tra

nsdu

ctio

nN

ucle

osid

es, a

nd n

ucle

otid

es

Tra

nscr

iptio

n

Env

ironm

enta

l res

pons

e/pa

thog

en r

espo

nse

Fat

ty a

cid

and

phos

phol

ipid

met

abol

ism

Ene

rgy

met

abol

ism

DN

A m

etab

olis

m

Gro

wth

and

dev

elop

men

tC

ellu

lar

stru

ctur

e, o

rgan

izat

ion

and

biog

enes

isC

ellu

lar

proc

esse

sC

ofac

tors

, pro

sthe

tic g

roup

s, a

nd c

arrie

rsA

min

o ac

id b

iosy

nthe

sis

Cen

tral

inte

rmed

iary

met

abol

ism