bmc genomics biomed central · 2017. 8. 28. · biomed central page 1 of 31 (page number not for...

31
BioMed Central Page 1 of 31 (page number not for citation purposes) BMC Genomics Open Access Research article Cross genome comparisons of serine proteases in Arabidopsis and rice Lokesh P Tripathi and R Sowdhamini* Address: National Centre for Biological Sciences, Tata Institute of Fundamental Research, GKVK Campus, Bellary Road, Bangalore 560 065, India Email: Lokesh P Tripathi - [email protected]; R Sowdhamini* - [email protected] * Corresponding author Abstract Background: Serine proteases are one of the largest groups of proteolytic enzymes found across all kingdoms of life and are associated with several essential physiological pathways. The availability of Arabidopsis thaliana and rice (Oryza sativa) genome sequences has permitted the identification and comparison of the repertoire of serine protease-like proteins in the two plant species. Results: Despite the differences in genome sizes between Arabidopsis and rice, we identified a very similar number of serine protease-like proteins in the two plant species (206 and 222, respectively). Nearly 40% of the above sequences were identified as potential orthologues. Atypical members could be identified in the plant genomes for Deg, Clp, Lon, rhomboid proteases and species-specific members were observed for the highly populated subtilisin and serine carboxypeptidase families suggesting multiple lateral gene transfers. DegP proteases, prolyl oligopeptidases, Clp proteases and rhomboids share a significantly higher percentage orthology between the two genomes indicating substantial evolutionary divergence was set prior to speciation. Single domain architectures and paralogues for several putative subtilisins, serine carboxypeptidases and rhomboids suggest they may have been recruited for additional roles in secondary metabolism with spatial and temporal regulation. The analysis reveals some domain architectures unique to either or both of the plant species and some inactive proteases, like in rhomboids and Clp proteases, which could be involved in chaperone function. Conclusion: The systematic analysis of the serine protease-like proteins in the two plant species has provided some insight into the possible functional associations of previously uncharacterised serine protease-like proteins. Further investigation of these aspects may prove beneficial in our understanding of similar processes in commercially significant crop plant species. Background The proper functioning of a cell is ensured by the precise regulation of protein levels that in turn are regulated by a balance between the rates of protein synthesis and degra- dation. Protein degradation is mediated by proteolysis, which paves the way for recycling of amino acids into the cellular pool. In addition to degradation, regulated prote- olysis plays a major role in diverse cellular processes by regulating biochemical activities of proteins in concert with other cellular mechanisms such as post-translational modification[1,2]. Serine proteases are one of the largest groups of proteolytic enzymes involved in numerous reg- ulatory processes. They catalyse the hydrolysis of specific peptide bonds in their substrates and this activity depends Published: 09 August 2006 BMC Genomics 2006, 7:200 doi:10.1186/1471-2164-7-200 Received: 11 April 2006 Accepted: 09 August 2006 This article is available from: http://www.biomedcentral.com/1471-2164/7/200 © 2006 Tripathi and Sowdhamini; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Upload: others

Post on 30-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • BioMed CentralBMC Genomics

    ss

    Open AcceResearch articleCross genome comparisons of serine proteases in Arabidopsis and riceLokesh P Tripathi and R Sowdhamini*

    Address: National Centre for Biological Sciences, Tata Institute of Fundamental Research, GKVK Campus, Bellary Road, Bangalore 560 065, India

    Email: Lokesh P Tripathi - [email protected]; R Sowdhamini* - [email protected]

    * Corresponding author

    AbstractBackground: Serine proteases are one of the largest groups of proteolytic enzymes found acrossall kingdoms of life and are associated with several essential physiological pathways. The availabilityof Arabidopsis thaliana and rice (Oryza sativa) genome sequences has permitted the identification andcomparison of the repertoire of serine protease-like proteins in the two plant species.

    Results: Despite the differences in genome sizes between Arabidopsis and rice, we identified a verysimilar number of serine protease-like proteins in the two plant species (206 and 222, respectively).Nearly 40% of the above sequences were identified as potential orthologues. Atypical memberscould be identified in the plant genomes for Deg, Clp, Lon, rhomboid proteases and species-specificmembers were observed for the highly populated subtilisin and serine carboxypeptidase familiessuggesting multiple lateral gene transfers. DegP proteases, prolyl oligopeptidases, Clp proteasesand rhomboids share a significantly higher percentage orthology between the two genomesindicating substantial evolutionary divergence was set prior to speciation. Single domainarchitectures and paralogues for several putative subtilisins, serine carboxypeptidases andrhomboids suggest they may have been recruited for additional roles in secondary metabolism withspatial and temporal regulation. The analysis reveals some domain architectures unique to eitheror both of the plant species and some inactive proteases, like in rhomboids and Clp proteases,which could be involved in chaperone function.

    Conclusion: The systematic analysis of the serine protease-like proteins in the two plant specieshas provided some insight into the possible functional associations of previously uncharacterisedserine protease-like proteins. Further investigation of these aspects may prove beneficial in ourunderstanding of similar processes in commercially significant crop plant species.

    BackgroundThe proper functioning of a cell is ensured by the preciseregulation of protein levels that in turn are regulated by abalance between the rates of protein synthesis and degra-dation. Protein degradation is mediated by proteolysis,which paves the way for recycling of amino acids into thecellular pool. In addition to degradation, regulated prote-

    olysis plays a major role in diverse cellular processes byregulating biochemical activities of proteins in concertwith other cellular mechanisms such as post-translationalmodification[1,2]. Serine proteases are one of the largestgroups of proteolytic enzymes involved in numerous reg-ulatory processes. They catalyse the hydrolysis of specificpeptide bonds in their substrates and this activity depends

    Published: 09 August 2006

    BMC Genomics 2006, 7:200 doi:10.1186/1471-2164-7-200

    Received: 11 April 2006Accepted: 09 August 2006

    This article is available from: http://www.biomedcentral.com/1471-2164/7/200

    © 2006 Tripathi and Sowdhamini; licensee BioMed Central Ltd.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    Page 1 of 31(page number not for citation purposes)

    http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16895613http://www.biomedcentral.com/1471-2164/7/200http://creativecommons.org/licenses/by/2.0http://www.biomedcentral.com/http://www.biomedcentral.com/info/about/charter/

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    on a set of amino acids in the active site of the enzyme,one of which is always a serine. They include bothexopeptidases that act on the termini of polypeptidechains and endopeptidases that act in the interior ofpolypeptide chains and belong to many different proteinfamilies that are grouped into clans[3,4]. A clan is definedas a group of families, the members of which are likely tohave a common ancestor. Currently there are over 50 ser-ine-protease families known as classified by MEROPSdatabase[5], an information resource for peptidases andtheir inhibitors.

    Serine proteases appear to be the largest class of proteasesin plants[2]. Consistent with their regulatory role, plantserine proteases have been shown to be involved indiverse processes regulating plant development anddefense responses[2,6-8]. In addition, some of plant ser-ine carboxypeptidases (Family S10) function as acyltrans-ferases rather than hydrolases, suggesting diversificationof function to regulate secondary metabolism in plants [9-11]. However, the function and regulation of plant pro-teases is poorly understood primarily due to lack of iden-tification of their physiological substrates. While there isextensive literature available on plant serine proteases,particularly in Arabidopsis thaliana, most of these studiesare either restricted to specific families [11-13] or specificphysiological aspects (please see the references above).Arabidopsis thaliana and Oryza sativa constitute model sys-tems for Dicotyledons and Monocotyledons respectively,the major evolutionary lineages in angiosperms[14,15].The availability of the complete genomic sequence ofthese two species renders it possible to carry out a compre-hensive analysis of serine-protease families to gaininsights into their function and to ascertain their evolu-tionary relationships.

    The current analysis aims at the identification and analysisof serine protease families encoded in the genomes of thetwo plant species. The domain organization of the puta-tive serine proteases in the two plant genomes have beenanalysed employing several sensitive sequence searchmethods [16-18] that have provided insights into differ-ent domain architectures. To understand the evolutionaryrelationships among the Arabidopsis and rice serine pro-tease-like proteins, phylogenetic analysis was performedbased at their protease domains. Cross genome compari-son of the serine proteases across the two genomes has ledto the identification of members likely to be conservedacross the two species. Predictions of the sub-cellularlocalizations of serine protease-like proteins to obtain fur-ther insight into their possible functional associationswere carried out using TargetP[19]. Comparison of plantserine proteases with other major eukaryotic lineagesreveals selective expansion of serine protease families inplants and identification of plant specific serine proteases.

    This paper reports the first bioinformatics genome-widesurvey of plant serine proteases and cross-comparisons ofseveral serine protease families across monocots anddicots. Species-specific proteases and existence of evolu-tionary changes like lateral gene transfer and gene shuf-fling were observed at the catalytic domain. Furtherexperimental characterization of Arabidopsis and rice ser-ine proteases that are likely to be involved in basic func-tions associated with various physiological processes inplants would be useful in expanding the current under-standing of these processes in regulating plant growth anddevelopment. This would prove beneficial in understand-ing protease processes in plants and application of thisknowledge to important crop species.

    Results and discussionDespite the differences in genome sizes between Arabidop-sis and rice (125 Mb and 389 Mb respectively) andencoded number of genes (25,498 and 37,544 non-trans-posable-element-related-protein-coding[14,15] and55,890 in total as in TIGR[20] rice database), the twoplant species appear to encode a very similar number ofserine protease-like proteins (206 and 222 putative num-bers, respectively; Table 1; see additional files 1.2: TableS1.pdf, Table S2.pdf). These include several putative ser-ine protease homologues that are apparently catalyticallyinactive, since they either lack the amino acid residuesthat are believed to be essential for catalysis or carryamino acid substitutions at positions corresponding tocatalytically active sites. Such inactive-enzyme homo-logues are not restricted to serine proteases, but have beenidentified in diverse enzyme families, chiefly among pro-teins involved in signaling pathways and proteins that areapparently extracellular in function and are conservedacross metazoan species. They are believed to haveacquired newer and more distinct functions, in the courseof evolution, thereby adding to the complexity of variousregulatory networks in the cell[21]. We discuss below theoccurrence, phylogenetic patterns of serine proteasedomains in the plant genomes by grouping according toMEROPS classification[5]. Only serine protease familieswith non-zero occurrence of putative members in the twogenomes are discussed.

    DegP proteases (Family S1)DegP proteases are a group of ATP-independent serineproteases that belong to MEROPS[5] family S1 (trypsinfamily). E.Coli DegP is a peripheral membrane proteinlocated in the periplasmic side of the plasma membrane.It is a heat-shock protein that combines both chaperoneand proteolytic activities that switch in a temperature-dependent manner. The chaperone activity is predomi-nant at low temperatures, while the proteolytic activitytakes over at higher temperatures[22]. Crystal structure ofE.coli DegP reveals that the proteolytic site is inaccessible

    Page 2 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    at low temperatures and thus, enzyme exists in chaperoneconformation (See additional file 3: Table S3.pdf)[22].PDZ domains are known to modulate the function and/orlocalization of their associated proteins and it has beenshown that PDZ domains are involved in substrate recog-nition and binding in certain proteases [23-26].

    DegP-like proteins have been identified in bacteria andeukaryotes but appear to be absent in archaea, and arebelieved to have spread to eukaryotic lineages via hori-zontal gene transfer events [27-29]. The exact functions ofDegP-like proteins in plants remain unknown. A plantDegP homolog, DegP1, was first identified in Arabidopsisand was found to be strongly associated with the lumenalside of the thylakoid membrane. It was found to be rap-idly upregulated in response to elevated temperatures andDegP homologue associated membrane fraction wasfound to have serine-type proteolytic activity, suggesting apossible role in chloroplast heat response[30,31]. Morerecently, a chloroplast DegP was shown to function in theinitial cleavage of D1 protein of PSII (Photosystem II)subsequent to photoinhibition[32].

    16 DegP-like proteins could be identified in Arabidopsisproteome, higher than an earlier estimate of 14[6]. Like-wise, 15 DegP-like proteins were identified in rice. Amajority of these are predicted to localise to either chloro-plast or mitochondria (See additional file 2: TableS2.pdf). Of these, four gene products (At3g03380,At3g27925, At5g27660, At5g39830) from Arabidopsis andfour (LOC_Os02g48180, LOC_Os04g38640,LOC_Os05g49380, LOC_0s11g14170) from rice are pre-dicted to contain PDZ-like domains C-terminal to theprotease domain, while the rest appear to be a single pro-tease domain containing proteins. The absence of PDZ-like domains in majority of plant DegP-like proteins hasbeen suggested to have as of yet unclear functional impli-cations for these proteins[6]. We identified seven ortholo-

    gous pairs of DegP-like proteins across the two genomes(See additional file 4: Table S4.pdf). Arabidopsis DegP1(At3g27925) was found to be chloroplast localised andwas rapidly upregulated in response to elevated tempera-tures[30]. We identified an At3g27925 orthologue in therice proteome (LOC_Os05g49380) nearly identical in sizethat shares a high sequence similarity with At3g27925and may function in a similar manner. The significantnumber of predicted DegP-like proteins in the two plantspecies suggests the presence of a conserved repertoire ofproteins that are likely to be upregulated in response toelevated temperatures in thylakoid lumen and mitochon-drial matrix in plants. Some putative gene products inboth the genomes are likely to be proteolytically inactivedue to either mutations or deletions in the active site resi-dues (See additional file 5: Figure SF1.pdf). It remainsunclear whether these proteolytically inactive membersretain their chaperone activity at low temperatures, theymay modulate the function of active gene products by reg-ulating substrate association and complex formation.

    Phylogenetic analysis of the protease domains of Arabi-dopsis and rice DegP-like proteins showed that a majorityof sequences cluster into three major clades, supported byhigh bootstrap values, with additional members lessrelated to the three clusters (See additional file 5: FigureSF1.pdf). Clade I, the largest of three clusters consists of14 DegP-like proteins that are more varied in sequenceand share between 35 and 95% pairwise amino acidsequence identity, Clade II consists of six closely relatedsequences that share between 50 and 93% pairwisesequence identity with each other. Clade II includes twosequence pairs At3g27925:LOC_Os05g49380 andAt5g39830:LOC_Os04g38640 that carry a PDZ domaineach C-terminal to the protease domain and share 93%and 87% pairwise sequence identity respectively. CladeIII, the smallest of three clades, consists of five sequencesthat share between 51 and 94% pairwise sequence iden-

    Table 1: Serine proteases identified in the two plant genomes, by sensitive sequence search methods

    CLAN FAMILY Pfam[37] Accession Arabidopsis Rice

    PA(S) S1 (Chymotrypsin) PF00089 16 15SB S8 (Subtilisin) PF00082 56 63SC S9 (Prolyl oligopeptidase) PF00326 23 23

    S10 (Carboxypeptidase) PF00450 54 66S28 (Lys. Pro-x Carboxypeptidase) PF05577 7 5

    SE S12 (D-Ala-D-Ala carboxypeptidase B) PF00144 1 1SF S16 (Lon protease) PF05362 4 4

    S26 (Signal peptidase I) PF00717 9 7SK S14 (Clp Endopeptidase) PF00574 9 13SM S41 (C-terminal processing peptidase) PF03572 3 3

    S49 (Protease IV) PF01343 1 1S54 (Rhomboid) PF01694 20 18

    SP S59 (Nucleoporin autopeptidase) PF04096 3 3

    Page 3 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    tity with each other. In addition, there are two pairs ofPDZ domain containing proteins,At5g27660:LOC_Os11g14170 andAt3g03380:LOC_Os02g48180, that share 58 and 50%pairwise sequence identity respectively.

    Subtilisins (Family S8)The subtilisin family is the second largest family of serineproteases identified till date with over 200 known mem-bers across lineages in eubacteria, archaebacteria, eukary-otes and viruses. Crystal structures reveal that subtilisinsutilize a highly conserved catalytic triad similar to themembers of chymotrypsin and carboxypeptidase clansbut have a different order of Asp, His and Ser residues inthe sequence (D137, H168, S325) with no other struc-tural similarity[33] posing a case of convergent evolution.MEROPS database[5] subdivides the subtilisin family intotwo subfamilies: S8A (subtilisin subfamily) consisting oftrue subtilisins and S8B (kexin subfamily) consisting ofproprotein-processing enzymes. An interesting feature ofthe subtilisin family is that some members appear to bemosaic with little or no sequence similarity to any otherknown proteins[33] and with large N- and C-terminalextensions.

    A large number of subtilisin-like serine proteases havebeen identified in various plant species, where they havebeen implicated in diverse processes (See additional file 3:Table S3.pdf) and all appear to correspond to S8A sub-family, while kexin-type proteins appear to be absentfrom plants[2,12,34,35]. It has been reported that bothArabidopsis and rice are characterised by significantlylarger number of subtilases in comparison to other organ-isms such as human, Caenorahabditis and Drosophila, sug-gesting that expansion of the subtilase gene family in thetwo plant species is due to multiple duplicationevents[35].

    We have identified 56 subtilisin-like proteins in Arabidop-sis proteome, similar to the figures reported else-where[12,35] and 63 in rice. A majority of these geneproducts, as expected, are predicted to be secretory path-way proteins, though a few are predicted to localise tochloroplasts and mitochondria (See additional files 1, 2:Table S1.pdf, Table S2.pdf). The analysis of the domainarchitectures of Arabidopsis and rice subtilisin-like proteinsreveals that most of these gene products consist of threedomains: an N-terminal Subtilisin_N domain that isremoved prior to activation of the enzyme, Peptidase_S8domain and the protease associated (PA) domain, thatoccurs as a C-terminal insertion in the Peptidase_S8domain of a majority of subtilisin-like proteins identifiedin these two plant species (See additional files 1, 2: TableS1.pdf, Table S2.pdf). The PA domain is believed to beinvolved in protein-protein interactions or mediate sub-

    strate recognition by peptidases and is believed to havebeen associated with ancestral subtilases[35,36]. In addi-tion, most subtilisin-like proteins identified in the twoplant species, carry short insertions ranging from 15–45amino acid residues in length in the N-terminal region oftheir predicted protease domains. These insertions do notshare any obvious sequence similarities with any otherknown module and are likely to be unique in their associ-ation with Arabidopsis and rice subtilisin-like proteins(data not shown). However, the presence of gene productsin both the proteomes with only the protease domain sug-gests that this domain alone is sufficient for the activity ofthese enzymes (See additional files 1, 2: Table S1.pdf,Table S2.pdf). The Arabidopsis proteome contains twogene products At2g19170 and At4g30020 that appear topossess an additional module, classified as DUF1034(Pfam[37] accession: PF06280), C-terminal to the PAdomain, while a single similar gene productLOC_Os06g48650 was identified in rice proteome, sug-gesting a duplication of this gene in Arabidopsis subse-quent to dicot-monocot divergence (Figure 2; seeadditional files 1, 2: Table S1.pdf, Table S2.pdf). The lackof diverse domain architecture in plant subtilisin-like pro-teins, suggests that specific expansion of subtilisin genefamily in the two plant species has been facilitated byextensive gene duplication. This is further supported bythe observation that only 16 orthologous pairs were iden-tified across the two species, suggesting that the majorityof subtilisin-like proteins in the two species arose afterdicot-monocot divergence, possible due to gene duplica-tion (See additional file 4: Table S4.pdf). It is likely thatspatial and temporal regulation of gene expression pat-terns may play a major role in regulating the activity ofsubtilisin-like proteins in plants. The presence of a largenumber of subtilisin-like proteins in the two plant speciessuggests possible functional redundancy, while on theother hand it may be indicative of functional diversifica-tion. A significant number of plant subtilisins may havebeen recruited for functions in plant secondary metabo-lism, in a manner similar to some serine carboxypepti-dase-like proteins. The subtilisin-like proteins identifiedhere in Arabidopsis and rice as well as those reported fromother plant species show maximum similarity toMEROPS[5] S8A subfamily of subtilisins. A few geneproducts are likely to be proteolytically inactive due tomutations in the catalytic residues and are most likely tobe evolutionary intermediates (See additional file 6: Fig-ure SF2.pdf).

    Phylogenetic analysis of Arabidopsis and rice subtilisin-like proteins was carried out with the S8 protease domainminus the insertions. The analysis reveals the presence offive major gene clusters of which four consist of subtilisin-like proteins from both Arabidopsis and rice includingorthologous pairs (Figure 1). Clade I consists of 13

    Page 4 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    Page 5 of 31(page number not for citation purposes)

    Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) subtilisin domainsFigure 1Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) subtilisin domains. Subtili-sin-like protease domains were aligned using ClustalW [95] program and the alignments were exported to Phylip package [96] for representing the Neighbor-Joining tree (see methods). The colors and circles represent different evolutionary clades iden-tified in the analysis (see text for details). Clade I is represented in purple, Clade II is shaded orange, Clade III in green, Clade IV in brown and Clade V in yellow. For clarity, bootstrap values were replaced with symbols representing bootstrap percentages >50%. Bootstrap values between 50–60% are represented by an asterix, circles represent bootstrap values from 60%–80% while bootstrap values >80% are represented by rectangles. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene. A few species specific gene clusters were also identified in the analysis (see text for details).

    At5g11940 At1g32970At4g10530

    At4g10520

    At1g32940

    At1g32960

    At1g32950

    At4g10550

    At4g10540

    At4g10510

    At1g66220

    At1g66210

    At4g21650

    At4g21630

    At4g21640

    At4g21323 At4g21326

    Os11g15520

    Os04g02960

    Os04g03100

    Os04g03060

    Os04g03710

    Os04g02980

    Os02g17080

    Os02g17060

    Os02g17000

    Os02g16940

    Os02g17150

    Os02g17090

    Os04g03800

    Os04g03850Os04g03810

    Os01g58270

    Os01g58240

    Os01g58260

    Os01g58290

    Os01g58280

    Os09g36110

    At4g26330Os03g06290

    Os06g41880

    At2g39850

    At5g59130

    At5g59100

    At3g46840

    At3g46850

    At5g58830

    At5g58820

    At5g58840

    At5g59120

    At5g59090

    At4g15040

    At5g59190

    At4g00230

    At5g03620Os01g17160

    At1g20150At1g20160

    Os08g23740

    Os01g50680

    Os06g40700

    Os02g10520

    At2g04160

    At5g59810

    Os09g30250

    Os01g52750

    At5g45640

    At5g45650

    At1g62340

    Os04g45960

    Os06g48650

    At4g30020

    At2g19170

    Os01g56320

    At1g30600

    At4g20430

    At5g44530

    Os04g47160

    Os12g23980

    Os10g38080Os03g02750

    Os04g10360

    Os04g47150 Os02g44590

    Os05g30580

    Os07g48650Os03g31630

    Os09g26920

    Os10g25450

    At2g05920

    Os03g55350

    Os03g40830

    At5g67360

    Os04g48420

    At5g51750 Os08g35090

    At3g14240

    At4g34980

    Os03g13930

    Os02g53850

    Os02g53970

    Os02g53910At3g14067

    Os02g53860

    At1g01900

    Os07g39020At1g04110

    Os03g04950

    At5g67090Os04g35140

    Os01g64850

    Os05g36010

    Os01g64860

    Os06g06800At5g19660

    At4g20850

    Os02g44520

    *

    ***

    *

    *

    *

    *

    *

    Clade II

    Clade III

    Clade IV

    Clade V

    Clade I

    IVC

    IVB IVA

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    Page 6 of 31(page number not for citation purposes)

    Domain Architectures identified in Arabidopsis and rice serine Protease-like proteinsFigure 2Domain Architectures identified in Arabidopsis and rice serine Protease-like proteins. At -Arabidopsis thaliana; Os – Rice (Oryza sativa); B- Bacteria; A- Archaea; E- Eukaryota; Sxx- Serine protease family Sxx domain, where Sxx refers to the serine protease family as per MEROPS [5] classification (see text for details). PDZ- PDZ domain (Pfam [37] accession: PF00595); PA- Protease associated domain (Pfam [37] accession: PF02225); SUB N- Subtilisin N-terminal region (Pfam [37] accession: PF005922); DUF1034- Domain of unknown function (Pfam [37] accession: PF06280); Arf- ADP-ribosylation factor family (Pfam [37] acces-sion: PF00025); C2- C2 domain (Pfam [37] accession: PF00168); zf-CCHC- Zinc knuckle (Pfam [37] accession: PF00098); rve- Integrase core domain (Pfam accession:PF00665); Extensin 2- Extensin-like region (Pfam [37] accession: PF04554); S9 N- Prolyl oligopeptidase, N-terminal beta-propeller domain (Pfam [37] accession: PF02897); PD40- WD40-like beta propeller repeat (Pfam [37] accession: PF07676); DPPIV N- Dipeptidyl peptidase (DPP IV) N-terminal region (Pfam [37] accession: PF00930); Transposase 21- Transposase family tnp2 (Pfam [37] accession: PF02992); Retrotrans gag- Retrotransposon gag protein (Pfam [37] accession: PF03732); ABC1- ABC1 family (Pfam [37] accession: PF03109); LON- ATP-dependent protease La (LON) domain (Pfam [37] accession: PF02190); AAA- ATPase family associated with various cellular activities (Pfam [37] accession: PF00004); UBA- UBA/TN-S (ubiquitin associated) domain (Pfam [37] accession: PF000627); zf-RanBP- Zinc finger in Ran bind-ing protein and others (Pfam [37] accession: PF00641).

    Serine Proteasefamily

    Domain Architecture Abundance inArabidopsisandRice

    Occurrence inother lineages

    S1 (Trypsinfamily)

    12/16 ; 11/15

    4/16 ; 4/15

    B , A , E

    B , E

    S8 (Subtilisinfamily)

    3/56 ; 3/63

    1/56 ; 7/63

    3/56 ; -

    46/56 ; 48/63

    2/56 ; 1/63

    - ; 1/63

    - ; 1/63

    - ; 1/63

    - ; 1/63

    B , A , E

    B , E

    B , E

    B , E

    B , E

    E (Os)

    E (Os)

    E (Os)

    E (Os)

    S9 (Prolyloligopeptidasefamily)

    18/23 ; 18/23

    4/23 ; 3/23

    - ; 1/23

    1/23 ; 1/23

    B , A , E

    B , E

    B , A , E (Os)

    B , E

    S10 (SerineCarboxypeptidase) 54/54 ; 64/66

    - ; 1/66

    - ; 1/66

    B , A , E

    E (Os)

    E (Os)

    S12 (Serine BetaLactamase)

    1/1 ; 1/1 E (At & Os)

    S1

    PDZS1

    S8

    S8 PA

    SUB N S8

    SUB N S8 PA

    SUB N S8 PA DUF1034

    SUB N S8 PA Arf C2

    S8SUB N PA zf-CCHC

    S8 zf-CCHC rve

    S8 zf-CCHC rveExtensin 2

    S9

    S9S9 N

    S9PD40

    DPPIV N S9

    S10

    S10Transposase 21

    S10 Retrotrans gag

    S12ABC1

    S14 (Clp protease) 9/9 ; 13/13 B , A , E

    S16 (LonProtease)

    4/4 ; 3/4

    - ; 1/4

    B , A , E

    B , A , E

    S26 (SignalPeptidase)

    9/9 ; 7/7 B , A , E

    S28 (LysosomalPro-XCarboxypeptidase)

    7/7 ; 5/5 B , E

    S41 (C-terminalprocessingpeptidase)

    3/3 ; 3/3 B , A , E

    S49 (Protease IV) 1/1 ; 1/1 B , E

    S54 (Rhomboid)17/21 ; 16/18

    3/21 ; 2/18

    1/21 ; -

    B , A , E

    E (At & Os)

    E (At)

    S59 (NuceloporinAutopeptidase) 3/3 ; 3/3 E

    S16LON AAA

    AAA S16

    S14

    S26

    S28

    S41PDZ

    S49 S49

    S54

    S54 UBA

    S54 zf-RanBP

    S59

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    sequences that share between 44 and 76% pairwisesequence identity with each other. Clade II, the smallest offive clusters consists of six sequences from Arabidopsisand three from rice that share between 48 and 84%sequence identity with each other. Clade III consists of 31sequences from both the plant species that share between40 and 80% sequence identity with each other. The mem-bers of Clade III may be grouped into two sub-clusters.Clade IV appears to be the largest and most diverse clusterof Arabidopsis and rice subtilisin-like proteins. It includes40 sequences that share between 35 and 92% pairwisepercentage identity with each other. On the basis of pair-wise percentage identities between the constituentsequences, Clade IV may be further subdivided into threesubclusters: Clade IVA, IVB and IVC. Clade IVA consistsexclusively of Arabidopsis sequences that share between47 and 91% sequence identity with each other andbetween 35 and 60% sequence identity with the rest of theClade IV sequences. In contrast, Clade IVB and Clade IVC,consist exclusively of rice subtilisin-like proteins. CladeIVB the smallest of three subclusters, consists of sequencesthat share between 45 and 81% sequence identity witheach other, while sharing between 35 and 60% sequenceidentity with the remaining members of Clade IV. CladeIVC sequences share between 54 and 92% sequence iden-tity with each other and between 44 and 61% sequenceidentity with the rest of the constituents of Clade IV. CladeV is an exclusive cluster of Arabidopsis subtilisin-like pro-teins that are between 41 and 85% identical with eachother and most likely represent an Arabidopsis specificsubfamily of subtilisin-like proteins. Our observationsappear consistent with a recent report that identifies thepresence of four major clusters of orthologous groups anda fifth cluster of Arabidopsis sequences within the phylo-genetic tree constructed with Arabidopsis and rice sutbili-sin-like proteins[35]. We identified a few gene sub-clusters that consist exclusively of Arabidopsis or rice sub-tilisin-like gene products indicating the existence of Arabi-dopsis or rice subtilisin-like genes and by extensionputative dicot- and monocot-specific subtilisin-like genes(Figure 1).

    Prolyl oligopeptidases (Family S9)The members of Prolyl oligopeptidases (POPs) family ofserine peptidases belong to the α/β hydrolase foldgrouped into SC clan of serine peptidases. The familyincludes members of different types and with distinct spe-cificities. For example, prolyl oligopeptidase is anendopeptidase localised to the cytosol, while dipeptidylpeptidase IV (DPPIV) is a membrane bound exopepti-dase. According to MEROPS database[5], POPs may beclassified into four subfamilies with prolyl oligopeptidase(S9A), dipeptidyl-peptidase IV (S9B), aminoacyl-pepti-dase (S9C) and glutamyl endopeptidase (S9D) being thetypical examples of their respective subfamilies.

    POPs have been implicated in degradation of biologicallyimportant peptides such as peptide hormones and neu-ropeptides associated with learning and memory andtherefore have become significant targets for drugdesign[38]. They have been identified in organisms fromall forms of life, though members of this family displaydifferences in mutation rates and appear to have under-gone several changes in their localization over the courseof evolution[39,40].

    We identified 23 genes encoding prolyl oligopeptidase-like proteins in Arabidopsis and 23 in rice. A majority ofthese either lack a predicted signal sequence or are pre-dicted to localise to chloroplast and mitochondria. Onlyfew gene products identified here are predicted to carry aβ-propeller N-terminal to the protease domain (See addi-tional files 1, 2: Table S1.pdf, Table S2.pdf). The absenceof the N-terminal domain in a majority of plant POP-likeproteins is intriguing since it is likely to have a functionalimplication for these proteins. Following a careful inspec-tion of the amino acid residues in the vicinity of the activesite serine residue, we were able to unambiguously assigna few POP-like proteins in the two plant species to differ-ent subfamilies of POP-like proteins as classified byMEROPS[5]. These include At1g76140,LOC_Os01g01830 (GGSNGGLL; S9A); At5g24260(GWSYGGY; S9B); At2g47390, LOC_Os03g19410,LOC_Os07g48970 (GGHSYGAFMT; S9D) (See addi-tional file 7: Figure SF3.pdf). We identified 14 ortholo-gous pairs of POP-like proteins across the two plantspecies, suggesting high degree of conservation in thefunction of POP-like proteins across the two species (Seeadditional file 4: Table S4.pdf). It also suggests that major-ity of these enzymes were present prior to dicot-monocotdivergence, consistent with the ancient evolutionary ori-gin of these proteins. Phylogenetic analysis, however,reveals presence of several gene clusters of Arabidopsis andrice POP-like proteins that share low pairwise sequenceidentity with each other and significant bootstrap values(Figure 3).The largest of these clusters Clade I consists of12 POP-like gene products from the two plant species thatare closely related and the pairwise percent sequence iden-tity between any two sequences within the cluster rangesfrom 60 to 96%. The second major cluster Clade II is morevaried in composition and consists of 10 sequences thatmay further be grouped into two subclusters Clade IIAand Clade IIB based or pairwise percent sequence identitybetween any two sequences falling into the cluster. CladeIIA includes four sequences (At1g20380, At1g76140,LOC_Os01g01830 and LOC_Os04g47360) that arebetween 72 and 89% identical with each other. Clade IIBincludes six sequences that share between 32 and 85%pairwise sequence identity with each other. The membersof two subclusters share between 22 and 32% pairwisesequence identity with each other. Clade IIA includes two

    Page 7 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    Page 8 of 31(page number not for citation purposes)

    Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) prolyl oligopeptidase domainsFigure 3Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) prolyl oligopeptidase domains. Prolyl oligopeptidase-like domains were aligned using ClustalW [95] program and the alignments were exported to Phylip package [96] for representing the Neighbor-Joining tree (see methods). The colors and circles represent the two evolu-tionary clades identified in the analysis (see text for details). Clade I is represented in Orange, Clade II is shaded green. For clarity, bootstrap values were replaced with symbols representing bootstrap percentages >50%. Bootstrap values between 50–60% are represented by an asterix, circles represent bootstrap values from 60%–80% while bootstrap values >80% are repre-sented by rectangles. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene. Subfamily assignments where possible are indicated in parentheses below the gene name (see Figure SF3 and text for details).

    At1g50380

    Os06g51410

    Os09g28040

    At5g66960

    At1g69020Os09g29950

    Os04g47360

    At1g76140

    At1g20380

    Os01g01830

    At4g14570

    Os10g28020

    Os10g28030

    Os07g48970Os03g19410

    At2g47390

    Os06g11180

    Os06g11190

    At5g36210

    At5g24260

    Os02g18850

    At5g20520

    Os07g41730

    Os03g24450

    At4g14290Os10g04620

    At1g26120

    At5g15860

    At5g19630

    Os06g06770 At5g25770

    Os01g42690

    Os01g57770

    Os05g46210

    At3g47560

    Os01g39800

    At1g13610At3g30380

    Os02g09770

    Os06g42730

    At3g01690At4g24760

    Os12g18860

    At1g32190

    At2g24320

    Os02g55330At1g66900

    Os01g49510

    *

    *

    * Clade I

    Clade II

    IIA

    IIB

    (S9A)(S9A)

    (S9D)

    (S9D)(S9D)

    (S9B)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    Table 2: A list of most recent serine protease gene duplications in Arabidopsis thaliana genome. The most recent gene duplicates identified in the segmentally duplicated regions of the Arabidopsis genome are suffixed with (S). See text for details

    S. No. Arabidopsis serine protease-like protein

    Biological function if known

    Most recent duplicate

    Family S1 (Deg protease family)1. At1g51150 At5g547452. At1g65640 (S) At5g36950 (S)3. At2g47940 At5g402004. At3g16540 At3g165505. At3g16550 At3g165406. At3g27925 At5g39830

    Family S8 (Subtilisin family)1. At1g20150 At1g201602. At1g32940 At1g329603. At1g66210 At1g662204. At2g04160 At5g598105. At2g19170 (S) At4g30020 (S)6. At3g14240 At4g349807. At3g46840 At3g468508. At4g00230 At5g036209. At4g10520 At4g1053010. At4g10540 At4g1055011. At4g20430 (S) At5g44530 (S)12. At4g21630 At4g2164013. At5g45640 At5g4565014. At5g51750 At5g6736015. At5g58820 At5g5883016. At5g59090 At5g59120

    Family S9 (Prolyl oligopeptidase family)1. At1g20380 (S) At1g76140 (S)2. At1g52700 At3g156503. At1g69020 At5g669604. At3g02410 (S) At5g15860 (S)5. At3g23540 (S) At4g14290 (S)6. At4g14570 At5g36210

    Family S10 (Serine carboxypeptidase family)1. At1g11080 (S) At1g61130 (S)2. At1g28110 At2g335303. At1g61130 At1g110804. At1g73300 At5g361805. At2g22920 At2g229706. At2g24000 (S) At4g30610 (S)7. At2g35780 At3g079908. At3g12230 At3g122409. At3g25420 (S) At4g12910 (S)10. At3g45010 At5g2298011. At3g52000 At3g5201012. At3g52020 At3g6347013. At5g42230 At5g42240

    Family S14 (Clp protease family)1. At1g09130 At1g499702. At1g66670 At5g45390

    Family S16 (Lon protease family)1. At3g05790 At5g26860

    Family S26 (Signal Peptidase family)1. At1g06870 At2g30440

    Page 9 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    sequences (At1g76140, LOC_Os01g01830) identified asmembers of the POP (S9A) subfamily of Prolyl oli-gopeptidase-like proteins (see above) suggesting that theother members of this subcluster may possibly belong tothe S9A subfamily (Figure 3). No species-specific clusterswere identified in this family.

    Serine carboxypeptidases (Family S10)Serine carboxypeptidases catalyze the hydrolysis of the C-terminal bond in proteins and peptides. Crystal structuresshow that serine carboxypeptidases belong to the α/βhydrolase fold and possess a catalytic triad similar tomembers of chymotrypsin and subtilisin families in theorder Ser, Asp and His (S257, D449, H508)[5].

    Serine carboxypeptidase-like proteins (SCPLs) have beenidentified in several plant species, where they form largeand diverse gene families in contrast with prokaryotes andother eukaryotes. Plant SCPLs have been implicated inseveral biochemical processes including secondary metab-olism (See additional file 3: Table S3.pdf) and arebelieved to be essential to plant growth and develop-ment[2,9-11], [41-43].

    We identified 54 genes coding for serine-carboxypepti-dase (SCP)-like proteins in Arabidopsis and 66 in rice. Amajority of these gene products are predicted to be secre-tory pathway proteins, though a number of rice SCP-likeproteins are either predicted to localise to mitochondriaand chloroplasts or lack a predictable signal sequence (Seeadditional files 1, 2: Table S1.pdf, Table S2.pdf). The anal-ysis of domain architectures of Arabidopsis and rice SCP-

    like proteins reveals that nearly all gene products identi-fied here are single domain proteins containing a singleprotease domain, with the exception ofLOC_Os01g22970 that carries a single zinc binding mod-ule (zf-CCHC; Pfam[37] accession: PF00098) N-terminalto the protease domain and LOC_Os06g32740 that car-ries a conserved transposable element module(Transposase_21; Pfam[37] accession: PF01359) C-termi-nal to the protease domain (Figure 2). This clearly indi-cates that the predicted serine carboxypeptidase domain isusually sufficient for biological activity of the majority ofgene products identified in the two plant species. The lackof diverse domain architecture suggests that spatial andtemporal regulation of gene expression patterns may playa major role in regulating the activity of SCP-like proteinsin Arabidopsis and rice. We identified 17 orthologous pairsacross the two species suggesting, as in the case of subtili-sin-like proteins, a majority of SCP-like proteins in Arabi-dopsis and rice probably arose due to extensive geneduplication subsequent to dicot-monocot divergence andthat has contributed to their forming a large gene familyin the two species (Tables 2, 3; see additionals files 4, 8:Table S4.pdf, Figure SF4.pdf). The presence of a largenumber of SCP-like proteins in the two plant species sug-gests a possible functional redundancy, while on the otherhand it may indicate of functional diversification, where asignificant number of these proteins may have beenrecruited for functions in plant secondary metabolism tocatalyze novel reactions (See additional file 3: TableS3.pdf). SCP-like enzymes have been shown to functionas acyltransferases in several plant species including Arabi-dopsis. Arabidopsis SCP-like proteins SCT/SNG2

    2. At1g23465 At1g299603. At1g52600 (S) At3g15710 (S)

    Family S28 (Lysosomal Pro-X Carboxypeptidase family)1. At2g24280 At5g657602. At4g36190 At4g36195

    Family S41 (C-terminal processing peptidase family)1. At3g57680 At4g17740

    Family S54 (Rhomboid family)1. At1g12750 (S) At1g63120 (S)2. At1g74130 At1g741403. At2g41160 (S) At3g56740 (S)

    Family S59 (Nucleoporin autopeptidase family)1. At1g10390 At1g59660

    Table 2: A list of most recent serine protease gene duplications in Arabidopsis thaliana genome. The most recent gene duplicates identified in the segmentally duplicated regions of the Arabidopsis genome are suffixed with (S). See text for details (Continued)

    Page 10 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    Table 3: A list of most recent serine protease gene duplications in rice genome. The most recent gene duplicates identified in the segmentally duplicated regions of the rice genome are suffixed with (S). See text for details

    S. No. Rice serine protease-like protein Gene duplicate

    Family S1 (DegP protease family)1. LOC_Os02g50880 (S) LOC_Os06g12780 (S)2. LOC_Os04g38640 LOC_Os05g493803. LOC_Os12g04740 LOC_Os12g04740

    Family S8 (Subtilisin family)1. LOC_Os01g56320 LOC_Os06g486502. LOC_Os01g58240 LOC_Os01g582603. LOC_Os01g58280 LOC_Os01g582904. LOC_Os01g64860 (S) LOC_Os05g36010 (S)5. LOC_Os02g10520 (S) LOC_Os06g40700 (S)6. LOC_Os02g16940 LOC_Os02g170907. LOC_Os02g17000 LOC_Os02g170608. LOC_Os02g17090 LOC_Os02g169409. LOC_Os02g44590 (S) LOC_Os04g47150 (S)10. LOC_Os02g53910 LOC_Os02g5397011. LOC_Os03g02750 (S) LOC_Os10g38080 (S)12. LOC_Os03g31630 LOC_Os07g4865013. LOC_Os03g40830 LOC_Os03g5535014. LOC_Os04g02980 LOC_Os04g0306015. LOC_Os04g03800 LOC_Os04g0381016. LOC_Os09g26920 LOC_Os10g25450

    Family S9 (Prolyl oligopeptidase family)1. LOC_Os01g01830 LOC_Os04g473602. LOC_Os04g47360 LOC_Os01g018303. LOC_Os06g11180 LOC_Os06g111904. LOC_Os07g48970 LOC_Os07g489705. LOC_Os09g28040 LOC_Os09g299506. LOC_Os10g28020 LOC_Os10g28030

    Family S10 (Serine carboxypeptidase family)1. LOC_Os01g06490 LOC_Os01g616902. LOC_Os01g11670 LOC_Os05g506003. LOC_Os01g22980 LOC_Os05g066604. LOC_Os02g02320 LOC_Os07g296205. LOC_Os02g42310 (S) LOC_Os04g44410 (S)6. LOC_Os03g26920 LOC_Os03g269307. LOC_Os03g52040 LOC_Os03g520808. LOC_Os04g25560 LOC_Os12g154709. LOC_Os04g32540 LOC_Os11g3198010. LOC_Os05g50570 LOC_Os05g5058011. LOC_Os08g44640 LOC_Os08g4464012. LOC_Os10g01110 LOC_Os11g4239013. LOC_Os11g24290 LOC_Os11g2441014. LOC_Os11g24510 LOC_Os12g3917015. LOC_Os11g27270 LOC_Os11g27350

    Family S14 (Clp protease family)1. LOC_Os01g32350 LOC_Os03g195102. LOC_Os02g42290 (S) LOC_Os04g44400 (S)3. LOC_Os03g22430 LOC_Os05g514504. LOC_Os08g15270 LOC_Os12g10590

    Family S16 (Lon protease family)1. LOC_Os01g54040 (S) LOC_Os05g44590 (S)2. LOC_Os03g19350 LOC_Os07g489603. LOC_Os03g29540 LOC_Os07g32560

    Page 11 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    (At2g22920) and SMT/SNG1 (At5g09640) have beenshown to function as acyltransferases in phenylpropanoidmetabolism[41,43]. Studies have shown that ArabidopsisSMT is a heavily glycosylated protein that localises to cen-tral vacoules of the leaf tissue, suggesting that some SCP-like proteins identified in the two plant species may alsobe vacoule localised[44].

    Phylogenetic analysis indicates that the majority of Arabi-dopsis and rice SCPL proteins cluster into two majorclades, but there are some additional SCPL proteins thatare less closely related to members of the clades (See addi-tional file 8: Figure SF4.pdf). Clade I may further be sub-divided into three subgroups, Clade IA (includes onlyArabidopsis SCPL proteins), Clade IB (includes only riceSCPL proteins) and Clade IC (includes a pair of Arabidop-sis and rice SCPL proteins). Clade IA includes knownSCPL acyltransferases such as SMT (At2g22990) and SCT(At5g09640). Clade IA and IB consist exclusively of Arabi-dopsis and rice sequences respectively that have no iden-tifiable orthologues in either subcluster. Therefore thesetwo subclades may represent early diversification of cer-tain SCPL proteins followed by their lineage specificdelineation and expansion subsequent to speciation (Seeadditional file 8: Figure SF4.pdf).

    Clade II, the larger of two major clades, consists ofsequences that share between 33 and 82% pairwisesequence identity with each other. Earlier observationshave suggested that Clade II consists of more variedsequences in terms of pairwise sequence identity, conser-vation of key residues and exon/intron splice posi-tions[45]. In contrast to Clade I, Arabidopsis and ricesequences falling into Clade II, are interspersed with each

    other with the exception of a small number of ricesequences that form an exclusive subcluster. The membersof the two clades mostly share between 20 and 30% pair-wise sequence identity. However, of the three subclustersidentified among Clade I proteins, the members of CladeIC appear to be closer to Clade II sequences than membersof Clade IA and Clade IB and share between 25 and 35%pairwise sequence identity with the members of Clade II.

    Serine beta-lactamases (Family S12)Serine beta-lactamases are a group of hydrolases that cat-alyze the hydrolysis of the beta-lactam ring of β-lactamantibiotics such as penicillins. They were first identified inpenicillin-resistant bacterial isolates and have been foundto be widely distributed in eubacterial genomes. On thebasis of sequence similarity, they have been classified intothree classes A, C and D, where class B consists of metallo-beta-lactamases. Class A (penicillinase type) proteins werethe first to be identified and are the most common beta-lactamases. All these proteins share sequence similaritywithin the class but not with members of other classes.However, proteins of different classes share significantstructural similarity and are believed to be homologuesthat have undergone divergent evolution. Structure-basedphylogeny studies suggest that beta-lactamases are ancientenzymes that probably evolved as means of protectionagainst beta-lactam antibiotics[46]. Site-directed muta-genesis and structural analyses have helped to unravel thecatalytic mechanism of β-lactamases and led to the iden-tification of their active site residues (S93, K96,Y190)[47,48].

    Serine beta-lactamases are widespread in bacterialgenomes, and their corresponding genes may occur on

    Family S24 (Signal peptidase family)1. LOC_Os03g55640 LOC_Os09g280002. LOC_Os05g23260 LOC_Os06g16260

    Family S28 (Lysosomal Pro-X Carboxypeptidase family)1. LOC_Os01g56150 LOC_Os06g439302. LOC_Os10g36780 LOC_Os10g36780

    Family S41 (C-terminal processing peptidase family)1. LOC_Os01g47450 LOC_Os06g21380

    Family S54 (Rhomboid family)1. LOC_Os01g18100 LOC_Os03g448302. LOC_Os03g24390 LOC_Os07g46170 (S)3. LOC_Os08g43320 (S) LOC_Os09g35730 (S)

    Family S59 (Nucleoporin autopeptidase family)1. LOC_Os12g06870 LOC_Os12g06890

    Table 3: A list of most recent serine protease gene duplications in rice genome. The most recent gene duplicates identified in the segmentally duplicated regions of the rice genome are suffixed with (S). See text for details (Continued)

    Page 12 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    bacterial chromosomes or on plasmids. This allows fortheir transfer to distant species and may account for theirdistribution and diversity[48]. Serine beta-lactamasehomologues have been identified in eukaryotes, mainly asa consequence of ongoing genome sequencing programs.However, little function information is available aboutthese proteins in eukaryotes.

    We have identified a single serine beta-lactamase like geneproduct each in Arabidopsis and rice proteomes, both ofwhich are predicted to localise to mitochondria. An inter-esting feature is the presence of a single ABC1 domain(Pfam[37] accession: PF03109) N-terminal to the serineβ-lactamase domain (Figure 2; see additional files 1, 2:Table S1.pdf, Table S2.pdf). ABC1 domains are associatedwith a family of proteins that include members from yeastand bacteria that display a nuclear or mitochondrial sub-cellular location in eukaryotes. Some members of thefamily are believed to regulate mRNA translation and areessential for electron transfer in the bc1 complex[37] (Seeadditional file 9: Figure SF5.pdf).

    Clp ATP-dependent proteases (Family S14)Clp proteases are a group of ATP-dependent serineendopeptidases that belong to MEROPS[5] family S14.E.coli ClpP is an ATP-dependent serine protease that con-sists of a smaller protease subunit ClpP, and a larger chap-erone regulatory ATPase subunit (either ClpA or ClpX).Though the protease domain is capable of proteolysis onits own, the ATPase subunits are essential for effective lev-els of proteolysis. Structural analysis (See additional file 3:Table S3.pdf) reveals that the catalytic triad residues Ser-His-Asp (S111, H136, D185) are enclosed in a single cav-ity that allows for degradation of small peptides but pre-cludes the entry of large folded polypeptides (Seeadditional file 3: Table S3.pdf)[6,7,49].

    Clp proteases have been identified in bacteria and eukary-otes but appear to be absent from archaea. They have beenimplicated in protein quality control and degradation ofmisfolded nascent peptides[50]. Plant Clp proteases werefirst identified with the discovery of the plastid-encodedgene clpP1, orthologous to E.coli clpP and in most higherplants it is transcribed by genes rps12 and rps120[6,7].Though the exact function for Clp proteases remains to beworked out, they appear to be essential for chloroplastfunction and viability. Clp proteases are likely to play arole in protein degradation in stroma and thylakoid mem-brane under particular stress conditions thereby ensuringremoval of aberrant polypeptides and recycling of thenutrient pool[7,13]. They have also been implicated inshoot development[51] suggesting a broad regulatory rolefor Clp proteases in plants.

    We have identified nine different ClpP genes encoding theproteolytic subunit of the Clp protease in the Arabidopsisnuclear proteome and 13 in rice nuclear proteome. Eightof the nine Arabidopsis ClpP-like proteins are predicted tolocalise to chloroplast, while At5g23140 gene product ispredicted to be mitochondria localised, though experi-mental evidence suggests that At5g23140 gene productappears to localise to both chloroplasts and mitochon-dria[52]. Likewise, in rice, 11 ClpP-like proteins are pre-dicted to be chloroplast localised, whileLOC_Os08g15270 gene product is predicted to localise tomitochondria, while subcellular localization forLOC_Os12g1059 could not be determined (See addi-tional files 1, 2: Table S1.pdf, Table S2.pdf). We observedthat some ClpP gene products identified in the two specieshave lost the catalytic protease site and therefore do notappear to take part in proteolysis. However, they arebelieved to play a significant role in proper assembly offunctional ClpP protein complex and potentially influ-ence chaperone interaction (Figure 4). Interestingly, weidentified orthologues for all nine Arabidopsis ClpP geneproducts in rice, suggesting high degree of evolutionaryconservation of Clp function in the two species (See addi-tional file 4: Table S4.pdf). Recently, a 325-kDa complexconsisting of 10 Clp isoforms including the chloroplastencoded pClpP and nuclear encoded Clp proteins wasidentified in Arabidopsis chloroplasts and a 320-kDahomotetradecameric complex consisting of only ClpP2gene products was identified in mitochondria. These pro-tein complexes are believed to be involved in degradationof misfolded or unassembled peptides in an ATP-depend-ent manner[52]. High evolutionary conservation of ClpP-like proteins across the two species suggests that some riceClpP homologues may participate in the assembly of asimilar regulatory complex in rice chloroplasts and mito-chondria.

    Phylogenetic analysis of Arabidopsis and rice Clp-like pro-teins reveals the presence of several gene clusters (CladesI-VIII) that share comparatively low pairwise sequenceidentity with each other, as opposed to the memberswithin a cluster that share significant sequence similaritywith each other (Figure 5). We identified eight Clp-likeproteins from both Arabidopsis and rice that are likely tohave lost their catalytic protease activity due to mutationsof key residues in protease active site (Figure 4). Six ofthese gene products (At1g09130, At1g49970, At4g17040from Arabidopsis and LOC_Os01g16530,LOC_Os03g22430, LOC_Os05g51450 from rice) fall intoa single gene cluster (Clade I) and share between 36 and72% pairwise sequence identity with each other (Figure5). The above sequences have a 9–10 amino acid insert(L1 inserts) that are believed to affect the presentation ofthe substrate to the neighboring catalytic triads in ClpPproteins and also known to contribute residues to puta-

    Page 13 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    Page 14 of 31(page number not for citation purposes)

    Multiple sequence alignment of the Clp protease domain region of the annotated Arabidopsis and rice Clp protease-like pro-teinsFigure 4Multiple sequence alignment of the Clp protease domain region of the annotated Arabidopsis and rice Clp protease-like pro-teins. The catalytic triad residues are indicated. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene. Several gene products that display mutation in one or more catalytic triad residues can be visualised here (see text for details).

    At1g09130 ----------PRTPPPDLPSMLLDGRIVYIGMPLVPAVTELVVAELMYLQWLDPKEPIYIYINSTGTTRDDGETVGMESEOs03g22430 -----------RRPPPDLPSLLLHGRIVYIGMPLVPAVTELVVAQLMYLEWMNSKEPVYIYINSTGTARDDGEPVGMESEAt1g49970 -----GGSERPRTAPPDLPSLLLDARICYLGMPIVPAVTELLVAQFMWLDYDNPTKPIYLYINSPGTQNEKMETVGSETE

    Os05g51450 ------------SSAPDLPSLLLDSRIIFLGMPIVPAVTELIAAQFLWLDYDDRTKPIYLYINSTGTMDENNELVASETDAt4g17040 -----SKGSAHEQPPPDLASYLFKNRIVYLGMSLVPSVTELILAEFLYLQYEDEEKPIYLYINSTGTTK-NGEKLGYDTE

    Os01g16530 -----LRGTAWQQPPPDLASFLYKNRIVYLGMCLVPSVTELMLAEFLYLQYDDAEKPIYLYINSTGTTK-NGEKLGYETEAt1g02560 ---------MVQERFQSIISQLFQYRIIRCGGAVDDDMANIIVAQLLYLDAVDPTKDIVMYVNSP---------GGSVTA

    Os03g19510 ---------MVMERFQSVVSQLFQHRIIRCGGPVEDDMANIIVAQLLYLDAIDPNKDIIMYVNSP---------GGSVTAAt1g66670 -----------SFEELDTTNMLLRQRIVFLGSQVDDMTADLVISQLLLLDAEDSERDITLFINSP---------GGSITA

    Os01g32350 -----------RLEELDTTNMLLRQRIVFLGSPVDDMSADLIISQLLLLDAEDKTKDIKLFINSP---------GGSITAAt5g45390 -----------RGAESDVMGLLLRERIVFLGSSIDDFVADAIMSQLLLLDAKDPKKDIKLFINSP---------GGSLSA

    Os10g43050 ---------PGAGAGGDVMGLLLRERIVFLGNEIEDFLADAVVSQLLLLDAVDPNSDIRLFVNSP---------GGSLSAOs02g42290 ---------SRGERAYDIFSRLLKERIVLIHGPIADETASLVVAQLLFLESENPLKPVHLYINSP---------GGVVTAOs04g44400 ---------SRGERAYDIFSRLLKERIVCIHGPITDDTASLVVAQLLFLESENPAKPVHLYINSP---------GGVVTAAt5g23140 ---------SRGERAYDIFSRLLKERIICINGPINDDTSHVVVAQLLYLESENPSKPIHMYLNSP---------GGHVTAAt1g11750 -----------PGGPLDLSSVLFRNRIIFIGQPINAQVAQRVISQLVTLASIDDKSDILMYLNCP---------GGSTYS

    Os03g29810 -----------PAGALDLATVLLGNRIIFIGQYINSQVAQRVISQLVTLAAVDEEADILIYLNCP---------GGSLYSAt1g12410 ------NREEGTWQWVDIWNALYRERVIFIGQNIDEEFSNQILATMLYLDTLDDSRRIYMYLNGP---------GGDLTP

    Os06g04530 -------PGEGTWQWLDIWNALYRERIIFIGDSIDEEFSNQVLASMLYLDSVDNTKKILLYINGP---------GGDLTPOs08g15270 --------------------------------------------------------------------------------Os12g10590 -------PGDEEVTWVDLYNVMYRERTLFLGQEIRCEVTNHITGLMVYLSIEDGISDIFLFINSP---------GGWLISOs11g11210 DLPLLPFQPAEVLIPSECKTLHLYEARYLALLEEALYRTNNSFVHLVLDPVVSGSPKASFAVERLDIGALVSIRGVCRVN

    At1g09130 GFAIYDSLMQLKNEVHTVCVGAAIGQACLLLSAGTKGKRFMMPHAK------AMIQQPRVPSSGLMPASDVLIRAKEVITOs03g22430 GFAIYDAMMRMKTEIHTLCIGAAAGHACLVLAAGKKGKRYMFPHAK------AMIQQPRIPSYGMMQASDVVIRAKEVVHAt1g49970 AYAIADTISYCKSDVYTINCGMAFGQAAMLLSLGKKGYRAVQPHSS------TKLYLPKVNRS-SGAAIDMWIKAKELDA

    Os05g51450 AFAIADFINRSKSKVYTINLSMAYGQAAMLLSLGVKGKRGVLPNSI------TKLYLPKVHKS-GGAAIDMWIKAKELDTAt4g17040 AFAIYDVMGYVKPPIFTLCVGNAWGEAALLLTAGAKGNRSALPSST--------IMIKQPIARFQGQATDVEIARKEIKH

    Os01g16530 AFAIYDAMRYVKVPIFTLCVGNAWGEAALLLAAGAKVSVVFASKKYEDNDLILLTFNVQPIGRFQGQATDVDIARKEIRNAt1g02560 GMAIFDTMRHIRPDVSTVCVGLAASMGAFLLSAGTKGKRYSLPNSR-------IMIHQPLGGAQ-GGQTDIDIQANEMLH

    Os03g19510 GMAIFDTMKHIRPDVSTVCIGLAASMGAFLLSAGTKGKRYSLPNSR-------IMIHQPLGGAQ-GQETDLEIQANEMLHAt1g66670 GMGIYDAMKQCKADVSTVCLGLAASMGAFLLASGSKGKRYCMPNSK-------VMIHQPLGTAG-GKATEMSIRIREMMY

    Os01g32350 GMGVYDAMKFCKADISTVCFGLAASMGAFLLAAGTKGKRFCMPNAR-------IMIHQPSGGAG-GKVTEMGLQIREMMYAt5g45390 TMAIYDVVQLVRADVSTIALGIAASTASIILGAGTKGKRFAMPNTR-------IMIHQPLGGAS-GQAIDVEIQAKEVMH

    Os10g43050 TMAIYDVMQLVRADVSTIGLGIAGSTASIILGGGTKGKRFAMPNTR-------IMIHQPVGGAS-GQALDVEVQAKEILTOs02g42290 GLAIYDTMQYIRCPVTTLCIGQAASMGSLLLAAGARGERRALPNAR-------VMIHQPSGGAQ-GQATDIAIQAKEILKOs04g44400 GLAIYDTMQYIRSPVTTLCIGQAASMASLLLAAGARGERRALPNAR-------VMIHQPSGGAS-GQASDIAIHAKEILKAt5g23140 GLAIYDTMQYIRSPISTICLGQAASMASLLLAAGAKGQRRSLPNAT-------VMIHQPSGGYS-GQAKDITIHTKQIVRAt1g11750 VLAIYDCMSWIKPKVGTVAFGVAASQGALLLAGGEKGMRYAMPNTR-------VMIHQPQTGCG-GHVEDVRRQVNEAIE

    Os03g29810 ILAIYDCMSWIKPKVGTVCFGVVASQAAIILAGGEKGMRYAMPNAR-------VMIHQPQGVSE-GNVEEVRRQVGETIYAt1g12410 SLAIYDTMKSLKSPVGTHCVGLAYNLAGFLLAAGEKGHRFAMPLSR-------IALQSPAGAAR-GQADDIQNEAKELSR

    Os06g04530 CMALYDTMLSLKSPIGTHCLGFAFNLAGFILAAGEKGSRTGMPLCR-------ISLQSPAGAAR-GQADDIENEANELIROs08g15270 ----------------------MAS---FILLGGEPTKRIAFPHAR-------IMLHQPASAYYRARTPEFLLEVEELHKOs12g10590 GMAIFDTMQTVTPDIYTICLGIAASMVSFILLGGEPTKRIAFPHAR-------IMLHQPASAYYRARTPEFLLEVEELHKOs11g11210 IINLLQMEPYLRGDVSPIMDISSESIELGLRISKLRESMCNLHSLQMK-----LKVPEDEPLQTNIKASLLWSEKEIFEE

    . : .: .At1g09130 NRDILVELLSKHTGNSVETVANVMRRPYYMDAPKAKEFGVIDRILWRG----------

    Os03g22430 NRNTLVRLLARHTGNPPEKIDKVMRGPFYMDSLKAKEFGVIDKILWRG----------At1g49970 NTEYYIELLAKGTGKSKEQINEDIKRPKYLQAQAAIDYGIADKIADSQ----------

    Os05g51450 NTDYYLELLSKGVGKPKEELAEFLKGPRYFRAQEAIDYGLADTILHSL----------At4g17040 IKTEMVKLYSKHIGKSPEQIEADMKRPKYFSPTEAVEYGIIDKVVYNE----------

    Os01g16530 VKIEMIKLLSRHIGKSVEEIAQDIKRPKYFSPSEAVDYGIIDKVLYNE----------At1g02560 HKANLNGYLAYHTGQSLEKINQDTDRDFFMSAKEAKEYGLIDGVIMNP----------

    Os03g19510 HKANLNGYLAYHTGQPLDKINVDTDRDYFMSAKEAKEYGLIDGVIMNP----------At1g66670 HKIKLNKIFSRITGKPESEIESDTDRDNFLNPWEAKEYGLIDAVIDDG----------

    Os01g32350 EKIKINKILSRITGKPEEQIDEDTKFDYFMSPWEAKDYGIVDSVIDEG----------At5g45390 NKNNVTSIIAGCTSRSFEQVLKDIDRDRYMSPIEAVEYGLIDGVIDGD----------

    Os10g43050 NKRNVIRLISGFTGRTPEQVEKDIDRDRYMGPLEAVDYGLIDSVIDGD----------Os02g42290 LRDRLNKIYQKHTGQEIDKIEQCMERDLFMDPEEARDWGLIDEVIENR----------Os04g44400 VRDRLNKIYAKHTSQAIDRIEQCMERDMFMDPEEAHDWGLIDEVIEHR----------At5g23140 VWDALNELYVKHTGQPLDVVANNMDRDHFMTPEEAKAFGIIDEVIDERS---------At1g11750 ARQKIDRMYAAFTGQPLEKVQQYTERDRFLSASEALEFGLIDGLLETE----------

    Os03g29810 ARDKVDKMFAAFTGQTLDMVQQWTERDRFMSSSEAMDFGLVDALLETR----------At1g12410 IRDYLFNELAKNTGQPAERVFKDLSRVKRFNAEEAIEYGLIDKIVRPP----------

    Os06g04530 IKNYLYSKLSEHTGHPVDKIHEDLSRVKRFDAEGALEYGIIDRIIRPS----------Os08g15270 VREMITRVYALRTGKPFWVVSEDMERDVFMSADEAKAYGLVDIVGDEM----------Os12g10590 VHEMITRVYALRTGKPFWVVSEDMERDVFMSADKAKAYGLVDIVGDEM----------Os11g11210 YNEGFIPALPERLSFAAYQTVSGMSEAELLSLQKYKIQAMDSTNTLERLNSGIEYVEH

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    tive lateral openings likely to stimulate or inhibit exit ofpeptide fragments from the core[52] (Figure 4). The otherClpP-like gene products identified in the two speciesmostly cluster into independent gene clusters, Clades II-VIII (Figure 5). The sequences constituting the gene clus-ters share high pairwise sequence identity with each other,but are between 25 and 45% identical to the rest of ClpP-like proteases identified across the two species.

    Lon proteases (Family S16)Lon proteases are also a group of ATP-dependent serineproteases. Unlike the Clp proteases, the catalytic andATPase domains of the Lon proteases reside in the samepolypeptide. They are conserved in prokaryotes andeukaryotic organelles such as chloroplast and mitochon-dria. E.coli Lon protease was the first ATP-dependent pro-tease to be described and consists of three functional

    Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) Clp protease domainsFigure 5Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) Clp protease domains. Clp protease-like domains were aligned using ClustalW [95] program and the alignments were exported to Phylip package [96] for representing the Neighbor-Joining tree (see methods). The colors and circles represent different evolutionary clades identified in the analysis (see text for details). Clade I is represented in orange, while clades II-VIII are shaded in black. For clarity, boot-strap values were replaced with symbols representing bootstrap percentages >50%. Bootstrap values between 50–60% are represented by an asterix, circles represent bootstrap values from 60%–80% while bootstrap values >80% are represented by rectangles. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene.

    Os11g11210

    Os12g10590

    Os08g15270Os10g43050At5g45390Os01g32350

    At1g66670

    At1g11750

    Os03g29810

    Os03g19510

    At1g02560

    At5g23140

    Os04g44400

    Os02g42290 At1g12410Os06g04530

    Os01g16530At4g17040

    At1g49970

    Os05g51450

    At1g09130

    Os03g22430

    *

    *

    *

    Clade I

    Clade II

    Clade III

    Clade IV

    Clade V

    Clade VIClade VII

    Clade VIII

    Page 15 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    domains: the N-terminal domain (LON), a central ATPasedomain (AAA+ module) and a C-terminal proteolyticdomain (Lon_C). The N-terminal LON domain alongwith the AAA module is believed to impart substrate spe-cificity to Lon proteases[6,53]. Structure and sequenceanalysis along with site-directed mutagenesis indicate thatLon proteases employ a Ser-Lys (S679, K/R722) catalyticdyad instead of a canonical serine protease catalytictriad[53]. Based on the conservation of residues aroundthe catalytic residues, the Lon proteases have recentlybeen proposed to be divided into two subfamilies: LonAand LonB[53].

    Unlike Clp proteases, Lon proteases appear to be presentin organisms from all forms of life. They are believed toplay a significant role in removal of aberrant proteins andthereby maintain cellular stability. Interestingly, anarchaeal membrane bound Lon protease was found tohave both ATP-dependent and ATP-independent proteo-lytic activities towards folded and unfolded proteinsrespectively[54]. Little is known of their function inplants, but Lon1 from Arabidopsis has been implicated inthe control of cytoplasmic male sterility[55].

    We have identified four protein sequences containing theLon proteolytic domain in Arabidopsis and an equalnumber in rice. In addition, we also identified five proteinsequences containing only the Lon N-terminal domain inArabidopsis and four in rice (data not shown). Of the Ara-bidopsis Lon protease-like proteins, At3g05790 andAt5g26860 gene products are predicted to localise to chlo-roplast, while At3g05780 and At5g47040 gene productsare predicted to be mitochondria localised. All four Lonprotease-like proteins identified in rice are predicted tolocalise to chloroplasts. The Lon-protease like proteinsidentified in the two plant species share higher similaritieswith the LonA subfamily (See additional file 10: FigureSF6.pdf). The presence of LON-only sequences has beenknown in several prokaryotic species and its presence inArabidopsis and rice proteomes suggests that LON domainmay have been recruited as a general module for regulat-ing protein-protein interactions, particularly for targetedprotein degradation[56].

    Sequence analysis reveals that Arabidopsis and rice Lonprotease like sequences fall into two clusters on the basisof pairwise sequence identity (data not shown). The larg-est of these clusters consists of five sequences (At3g05790,At5g26860, At3g05780, LOC_Os03g19350,LOC_Os07g48960), all of which, except At3g05780 arepredicted to localise to chloroplasts and share between 76and 91% pairwise sequence identity with each other. Thesecond cluster consists of two sequences (At5g47040 andLOC_Os09g36300) that are 82% identical to each otherbut predicted to have different subcellular localizations.

    At5g47040 is predicted to localise to mitochondria andLOC_Os09g36300 is predicted to localise to chloroplasts.The members of the two clusters display between 40 and45% pairwise sequence identity with each other. A Lonprotease like gene product from rice, LOC_Os06g05820,is less closely related and shares between 19 and 25% pair-wise sequence identity with the rest of the Lon proteaselike gene products from the two species.

    Signal peptidases (Family S26)Signal peptidases (SPases or leader peptidases) are adiverse group of serine endopeptidases, that belong toMEROPS[5] family S26 and are responsible for theremoval of signal peptides from preproteins within thecell[57]. Signal peptides comprise the N-terminal part ofthe polypeptide chain. They are essential for targeting var-ious proteins to their correct destination and are cleavedoff subsequent to the transfer of the protein across themembrane. The failure to remove signal peptides oftenleads to protein inactivation and/or mislocalisation[57].

    The Type I SPases are membrane-bound serine endopepti-dases that were first identified in bacteria (E.coli lepB) andhomologues have subsequently been identified inarchaea, mitochondrial inner membrane, the thylakoidmembrane of chloroplasts and the endoplasmic reticu-lum membranes of yeast and higher eukaryotes[57,58].The Type I SPase family is further divided into two sub-families: members of S26A subfamily employ a serine-lysine dyad for catalysis (S91, K146) and include prokary-otic, chloroplast and mitochondrial peptidases, while themembers of S26B subfamily that include ER SPC andarchaeal peptidases, employ a serine-histidine dyad (S44,H83) based on the mechanism for catalysis[57,58].

    In bacteria, Type I SPases appear to be essential for cell via-bility and their inactivation invariably leads to cell death.Therefore they have been attractive targets for antibacte-rial drugs[57,58]. In higher plants, homologues of Type ISPases, known as thylakoidal processing peptidases havebeen identified and are involved in processing of proteinpresequences required for integration into or transloca-tion across thylakoid membranes. TPP, an Arabidopsishomologue of E.coli LepB is a thylakoid membrane pro-tein with its active site situated at the lumenal side[59,60].An Arabidopsis plastidic SPase I (Plsp1, At24590) has beenfound to be required for complete maturation of the plas-tid protein translocation channel Toc75. It is proposedthat Plsp1 may have multiple functions and is believed tobe essential for biogenesis of plastid internal mem-branes[61].

    We have identified nine genes coding for signal peptidase-like proteins in Arabidopsis and seven in rice. A majority ofthese gene products are predicted to localise to mitochon-

    Page 16 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    dria, though some are predicted to be chloroplast local-ised, membrane bound and even secreted (See additionalfiles 1, 2: Table S1.pdf, Table S2.pdf). Arabidopsis TPP(At2g30440), though predicted by TargetP[19] to bemitochondria localised, has been experimentally deter-mined to be associated with chloroplast fraction. Theenzyme when overexpressed in vitro, was found to pos-sess catalytic activity against the lumenal targeting prese-quence of the 23-kDa component of photosystem II[59],suggesting a possible role for TPP and similar proteins inprocessing of polypeptides during their transport throughcellular membranes. Most of the gene products analysedhere possess Ser and Lys as active site residues and arelikely to employ a serine-lysine catalytic dyad for proteol-ysis. Some gene products (At1g52600, At3g15710,LOC_Os05g23260, LOC_Os06g16260,LOC_Os02g58140) appear to possess Ser and His asactive site residues and consequently may employ a ser-ine-histidine catalytic dyad for proteolysis (Figure 6). Weidentified five orthologous pairs of signal peptidase-likeproteins across the two species, suggesting conservation of

    signal-peptidase function in the two species (See addi-tional file 4: Table S4.pdf).

    Phylogenetic analysis indicates that Arabidopsis and riceType I SPase-like proteins cluster into two major cladessupported by high bootstrap values and high pairwisesequence identity (See additional file 11: Figure SF7.pdf).Clade I consists of 5 sequences (At1g06870, At2g30440,At3g24590, LOC_Os09g28000, LOC_Os03g55640) thatare between 67 and 86% identical to each other. Clade IIconsists of 5 sequences (At1g52600, At3g15710,LOC_Os05g23260, LOC_Os06g16260,LOC_Os02g58140), that share between 67 and 96% pair-wise sequence identity with each other. The members ofeither cluster share low sequence identity with the rest ofthe gene products. Interestingly, all the members of cladeII carry histidine as an active site residue instead of Lysineand consequently are believed to employ a Ser-His dyadfor proteolysis and possibly correspond to S26B sub-family of Type I Spases (Figure 6; see above; see additionalfile 11: Figure SF7.pdf). The remaining sequences carry

    Multiple sequence alignment of the Type I Spase domain region of the annotated Arabidopsis and rice Type I Spase-like proteinsFigure 6Multiple sequence alignment of the Type I Spase domain region of the annotated Arabidopsis and rice Type I Spase-like proteins. The catalytic dyad residues are indicated. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene. The variations in the second residue (K/H) of catalytic dyad can be identified here (see text for details).

    . :. : :. : :At1g06870 SIPSTSMLPTLDVGDR-------VIAEKVSYFFRKPEVSDIVIFKAPPILVEH---GYSCADVFIKRIVASEGDWVEVCDAt2g30440 SIPSTSMYPTLDKGDR-------VMAEKVSYFFRKPEVSDIVIFKAPPILLEYPEYGYSSNDVFIKRIVASEGDWVEVRDAt3g24590 YIPSLSMYPTFDVGDR-------LVAEKVSYYFRKPCANDIVIFKSPPVLQEVG---YTDADVFIKRIVAKEGDLVEVHN

    Os03g55640 SIPSKSMYPTFDVGDR-------ILADKVSYVFREPNILDIVIFRAPPVLQALG---CSSGDVFIKRIVAKGGDTVEVRDOs09g28000 SIPSKSMYPTFDVGDR-------ILAEKVSYIFREPEILDIVIFRAPPALQDWG---YSSGDVFIKRVVAKAGDYVEVRDAt3g08980 PVRGDSMSPTFNPQRNSYLDDYVLVDKFCLKD-YKFARGDVVVFSSP----------THFGDRYIKRIVGMPGEWISS--

    Os04g08340 TVRGTSMNPTLESQQG----DRALVSRLCLDARYGLSRGDVVVFRSP----------TEHRSLLVKRLIALPGDWIQVPAAt1g23465 YAYGPSMIPTLHPSGN------MLLAERISKRYQKPSRGDIVVIRSP----------ENPNKTPIKRVVGVEGDCISFVIAt1g29960 YAYGPSMTPTLHPSGN------VLLAERISKRYQKPSRGDIVVIRSP----------ENPNKTPIKRVIGIEGDCISFVIAt1g53530 HVHGPSMLPTLNLTGD------VILAEHLSHRFGKIGLGDVVLVRSP----------RDPKRMVTKRILGLEGDRLTFSA

    Os11g40500 LVMGPSMLPAMNLAGD------VVAVDLVSARLGRVASGDAVLLVSP----------ENPRKAVVKRVVGMEGDAVTFLVOs05g23260 VVLSGSMEPGFKRGDI----------LFLHMSKDPIRTGEIVVFNID-----------GREIPIVHRVIKVHEREESAEVOs06g16260 VVLSGSMEPGFKRGDI----------LFLHMSKDPIRTGEIVVFNVD-----------GREIPIVHRVIKVHERQESAEVAt1g52600 VVLSGSMEPGFKRGDI----------LFLHMSKDPIRAGEIVVFNVD-----------GRDIPIVHRVIKVHERENTGEV

    Os02g58140 VVLSESMELGFERGDI----------LFLQMSKHPIRTGDIVVFND------------GREIPIVHRVIEVHERRDNAQVAt3g15710 VVLSESMEPGFQRGDI----------LFLRMTDEPIRAGEIVVFSVD-----------GREIPIVHRAIKVHERGDTKAV

    At1g06870 GKLLVNDT-------------At2g30440 GKLFVNDI-------------At3g24590 GKLMVNGV-------------

    Os03g55640 GKLLVNGV-------------Os09g28000 GKLIVNGV-------------At3g08980 ------SRDVIRVPEG-----

    Os04g08340 ------AQEIRQIPVG-----At1g23465 DPVKSDESQTIVVPKG-----At1g29960 DSRKSDESQTIVVPKG-----At1g53530 DPLVGDAS-------------

    Os11g40500 DPGNSDASKTVVVPKG-----Os05g23260 DILTK----------------Os06g16260 DILTK----------------At1g52600 DVLTKGDNNYGDDRLLYAEG-

    Os02g58140 DFLTKGDNNPMDDRILYTHGQAt3g15710 DVLTKGDNNDIDDI-------

    Page 17 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    lysine as an active site residue and are consequentlybelieved to employ a Ser-Lys dyad for proteolysis likely tocorrespond to S26A subfamily of Type I Spases (Figure 6;see above; see additional file 11: Figure SF7.pdf). Clade Isequences share between 8 and 10% pairwise sequenceidentity with members of Clade II but share between 21and 43% pairwise sequence identity with the remainingType I SPase-like gene products. Clade II members appearto be less closely related and share between 8 and 15%pairwise sequence identity with the rest of Type I SPase-like gene products. Taken together these observations sug-gest that functional clusters of Type I SPases employingslightly varied catalytic mechanism were established earlyduring evolution and were subsequently retained, possi-bly for adaptation to different physiological processes.

    Lysosomal Pro-X carboxypeptidase family (Family S28)This family includes several eukaryotic enzymes such aslysosomal Pro-X carboxypeptidase (PCP), dipeptidyl-peptidase II and thymus specific serine peptidase and isgrouped in MEROPS[5] clan SC along with MEROPS[5]peptidase families such as S9, S10, S15 and S33. The pre-dicted active site residues for the members of this family(S179, D430, H455) occur in the same order in thesequence as that of family S10.

    Members of this family have been identified exclusively ineukaryotes and have been implicated in several processes(See additional file 3: Table S3.pdf)[4,62,63]. No func-tional information is available regarding their role inplants.

    Analysis of the Arabidopsis proteome revealed the presenceof seven genes encoding for the members of S28 family,while five genes were identified in rice. Of these, a major-ity are predicted to be the members of the secretory path-way, while two rice gene products LOC_Os01g56150 andLOC_Os10g36780 are predicted to be mitochondrialocalised (See additional files 1, 2, 12: Table S1.pdf, TableS2.pdf, Figure SF8.pdf). We further identified three orthol-ogous pairs across the two species. S28 protease-like pro-teins identified here, like their eukaryotic counterparts arebelieved to be involved in processing of penultimate Pro-line containing polypeptides in the two plant species (Seeadditional file 4: Table S4.pdf).

    Phyolgenetic analysis reveals that Arabidopsis and rice pro-teins belonging to S28 family cluster into two clades (Fig-ure 7). Clade I appears to be more diverse of the twoclades and consists of six sequences that share between 40and 69% pairwise sequence identities among themselves.Clade II consists of five sequences that are more closelyrelated to each other than Clade I sequences and between61 and 92% identical. The members of the two clustersshare between 19 and 26% pairwise sequence identity

    with each other. These observations suggest the presenceof two evolutionary clades among Arabidopsis and riceS28-like proteins prior to speciation.

    C-terminal processing peptidases (Family S41)C-terminal processing peptidases (Ctps) belong toMEROPS[5] family S41. E.coli Tsp, a periplasmic endopro-tease, was one of the first members of the family to beidentified (See additional file 3: Table S3.pdf). It was sub-sequently determined that Most family members are asso-ciated with PDZ domains that are protein-proteininteraction modules and are important for substrate rec-ognition[24]. The C-terminal processing peptidase familyconsists of two subfamilies, which differ greatly in activity,sensitivity to inhibitors and molecular structure[5]. Thetwo subfamilies employ different mechanisms for cataly-sis. While, the C-terminal processing peptidase subfamily(S41A) is found chiefly in bacteria and some eukaryotesemploying a Ser-Lys catalytic dyad (S452, K477), the tri-con core protease subfamily (S41B) is largely confined toarchaea with few bacterial representatives employing acatalytic tetrad consisting of Ser, His, Ser and Glu residues(S745, H746, S965, E1023).

    In bacteria, CPPs (such as E. coli Tsp) are believed to playan important role in degradation of improperly synthe-sised proteins. For instance, proteins translated from dam-aged mRNA are modified by addition of a C-terminal tag,which enables their recognition and degradation byCPPs[64]. In plants, Ctps have been implicated in process-ing D1 protein of photosystem II, which is essential forcorrect assembly and functioning of the plant photosyn-thetic apparatus[60,65,66].

    We identified three genes each coding for Ctp-like pro-teins in Arabidopsis and rice proteomes. While the Arabi-dopsis proteins are all predicted to be soluble and havebeen shown to be translocated into thylakoid lumen, twoof the three rice gene products (LOC_Os01g47450 andLOC_Os02g57060) are predicted to be mitochondrialocalised, while a single gene product LOC_Os06g21380is predicted to be translocated to chloroplasts (See addi-tional files 1, 2: Table S1.pdf, Table S2.pdf). All Ctp-likeproteins identified in the two plant species carry a PDZdomain N-terminal to the protease domain (Figure 2; seeadditional files 1, 2: Table S1.pdf, Table S2.pdf). PDZdomains are protein-protein interaction modules that arelikely to mediate substrate recognition for plant Ctp-likeproteins[24]. Sequence analysis suggests that the Ctp-likeproteins identified here may employ a serine-lysine cata-lytic dyad for proteolysis and possibly belong to S41Asubfamily of C-terminal processing peptidases (See addi-tional file 13: Figure SF9.pdf). We identified three orthol-ogous pairs (At3g57680:LOC_Os06g21380,At4g17740:LOC_Os02g57060,

    Page 18 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    Page 19 of 31(page number not for citation purposes)

    Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) family S28 protease domainsFigure 7Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) family S28 protease domains. S28-like protease domains were aligned using ClustalW [95] program and the alignments were exported to Phylip package [96] for representing the Neighbor-Joining tree (see methods). The colors represent the two evolutionary clades identified in the analysis (see text for details). Clade I is represented in Orange, Clade II is shaded green. For clarity, bootstrap values were replaced with symbols representing bootstrap percentages >50%. Bootstrap values between 50–60% are repre-sented by an asterix, circles represent bootstrap values from 60%–80% while bootstrap values >80% are represented by rec-tangles. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene.

    At2g18080

    At4g36195

    At4g36190

    Os10g36780

    Os10g36760

    At5g22860

    Os11g05760

    At5g65760

    Os01g56150

    At2g24280

    Os06g43930

    *

    Clade II

    Clade I

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    At5g46390:LOC_Os01g47450) of six Ctp-like proteinsacross the two plant species (See additional file 4: TableS4.pdf). Each gene product displays a higher sequenceidentity with its orthologue in the corresponding genomethan similar gene products from the same species. Thissuggests that these groups were most likely establishedearly in evolution and disseminated in the two plant spe-cies through speciation. Like prokaryotic Ctps, plant Ctp-like proteins are believed to play a role in degradation ofincorrectly synthesised proteins in chloroplasts and mito-chondria.

    SppA/protease IV family (Family S49)These groups of serine peptidases belong to MEROPS[5]family S49 and were first identified on the basis of theirability to degrade signal peptides that accumulate in thecytosol subsequent to their removal from precursorpolypeptides by signal peptidases. Signal peptides areresponsible for targeting various proteins to their correctsubcellular localizations and are subsequently removedfrom the preproteins. The signal peptides generated fol-lowing this cleavage need to be rapidly degraded sincethey may be harmful to cells by interfering with proteintranslocation or may accumulate in the membrane lead-ing to cell lysis. E.coli SppA (protease IV), an inner mem-brane protein was the first protein identified with signalpeptide peptidase activity (See additional file 3: TableS3.pdf)[57,67]. The predicted active site serine residue formembers of Protease IV family and Clp protease familyoccur in the same order in the sequence: S, R/H, D andforms the basis for their assignment to SK clan[5].

    Members of the protease IV family have been identified inviruses, archaea and bacteria and among the highereukaryotes; they have been identified only in plants. Twoproteins, SppA1 and SppA2 have been identified in Syne-chosystis while a single homologue SppA has been identi-fied in Arabidopsis that lacks the putative transmembranespanning segments predicted from the E.colisequence[68].

    Protease IV family appears to be represented by singlemembers in Arabidopsis and rice proteomes and are notrepresented in any other eukaryotes. Though predicted byTargetP[19] to be mitochondria localised, experimentalevidence indicates that Arabidopsis SppA (At1g73390) is alight-inducible membrane-bound protein with an unu-sual monotropic arrangement, associated with thylakoidmembranes and has been proposed to be involved in lightdependent degradation of photosystem II and LHCIIantenna proteins[68]. A single SppA homologue in rice(LOC_Os2g49570) is also predicted to be mitochondrialocalised and in absence of experimental evidence, itwould be tempting to speculate that the rice protein mayalso localise to chloroplasts. At1g73390 and

    LOC_Os02g49570 each carry two protease IV domains(Figure 2; see additional files 1, 2, 14: Table S1.pdf, TableS2.pdf, Figure SF10.pdf). Interestingly the N-terminal pro-tease IV domain of At1g73390 is 63% identical to the C-terminal protease IV domain of the rice counterpartLOC_Os02g49570, while sharing only 17% identity withAt1g73390 C-terminal protease IV domain and 14% iden-tity with LOC_Os02g49570 N-terminal protease IVdomain. Likewise LOC_Os02g49570 N-terminal proteaseIV domain is 75% identical to At1g73390 C-terminal pro-tease IV domain and only 18% identical toLOC_Os02g49570 C-terminal protease IV domain. Thisclearly suggests that independent gene shuffling eventssubsequent to acquisition of SppA homologues in the twoplant species may be responsible for the current domainarchitecture. Similar to their prokaryotic counterparts,they may function in association with signal-peptidases todegrade leader peptides removed from precursor proteinssimilar to their prokaryotic counterparts.

    Rhomboids (Family S54)Rhomboid proteins are a family of serine proteases thatcleave substrates within transmembrane domains[69,70].Rhomboids are one of the most widespread membraneprotein families and are found in archaea, bacteria andeukaryotes including plants[70,71] suggesting that whiletheir signaling mechanism particularly protease activitymay be conserved across lineages, they may have addi-tional roles (See additional file 3: Table S3.pdf). A pro-posed catalytic triad comprising Asn169, Ser217 andHis281 has been identified using site directed mutagene-sis and each of proposed active site residues is locatedwithin a transmembrane domain[72]. However, a recentreport suggests that while Ser217 and His281 along witha glycine residue (two residues N-terminal to Ser217) areessential for catalysis, Asn169 is not required for catalyticactivity and that rhomboids likely function as endopepti-dases with a serine-histidine dyad[73].

    Rhomboid like proteins have been identified in plants,however, there was no experimental information availa-ble about them till recently. A rhomboid homologue fromArabidopsis AtRBL2 was recently isolated and shown tohave proteolytic activity and substrate specificity[74].AtRBL2 was shown to cleave Drosophila ligands Spitz andKeren but not similar proteins like TGFα, when expressedin mammalian cells, thereby releasing soluble ligandsfrom the cell into the media. AtRBL2 along with AtRBL1,another Arabidopsis rhomboid homologue was found tolocalise to Golgi apparatus in plant cells consistent withprevious observations where several eukaryotic rhomboidproteins have been predicted to localise within Golgiapparatus[72,74]. Interestingly, plants appear to lack anycomponent of the EGF signaling pathway other thanrhomboid and it has been suggested that rhomboid-like

    Page 20 of 31(page number not for citation purposes)

  • BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

    proteins in plants may be involved in processing as yetuncharacterised substrates that may be conserved in otherorganisms as well[74].

    We identified 20 genes coding for rhomboid-like proteinsin Arabidopsis and 18 in rice. Subsequent to multiplesequence alignment, it became evident that several pro-teins appeared to be catalytically inactive, owing to loss ormutation(s) in residues believed to be essential for cataly-sis. While the role of these catalytically compromised pro-teins remains unclear, it is believed that they may functionas regulators of active rhomboid proteases[71] (See addi-tional files 1, 2, 15: Table S1.pdf, Table S2.pdf, FigureSF11.pdf). Of all the rhomboid-like proteins identified inthe two plant species, only a few are likely to carry addi-tional domains. Three Arabidopsis gene products(At2g41160, At3g56740, At3g58460) and two gene prod-ucts from rice (LOC_Os01g16330, LOC_Os03g44830)are predicted to possess a ubiquitin associated (UBA)domain, C-terminal to the rhomboid protease domain.An Arabidopsis gene product At3g17611 is predicted tohave a zinc finger-like module C-terminal to the predictedrhomboid protease domain (Figure 2; see additional files1, 2: Table S1.pdf, Table S2.pdf). We identified 10 orthol-ogous pairs across Arabidopsis and rice, suggesting a highdegree of conservation of rhomboid mechanisms acrossthe two plant species (See additional file 4: Table S4.pdf).TargetP[19] predictions suggest that a few rhomboid-likeproteins from both the plant species may localise to mito-chondria or chloroplasts (See additional files 1, 2: TableS1.pdf, Table S2.pdf). Recently, a rhomboid protease wasidentified in yeast mitochondria, where it is involved inprocessing of substrates localised to the mitochondrialintermembrane space[75]. Although rhomboid-like pro-teins are yet to be identified in chloroplasts, identificationof a yeast mitochondrial rhomboid protease coupled withthe possibility that a few rhomboid-like proteins in thetwo plant species may localise to mitochondria and/orchloroplasts suggests that rhomboid-like proteins may beinvolved in regulatory processes in mitochondria andchloroplasts of higher plants.

    Phylogenetic analysis of Arabidopsis and rice rhomboid-like proteins reveals presence of several gene clusters thatshare low pairwise sequence identity with each ot