structural and functional analysis of hypothetical and ... · journal of infection and public...

12
Journal of Infection and Public Health (2014) 7, 296—307 Structural and functional analysis of hypothetical and conserved proteins of Clostridium tetani Shymaa Enany Department of Microbiology and Immunology, Faculty of Pharmacy, Suez Canal University, Ismailia, Egypt Received 4 December 2013 ; received in revised form 1 February 2014; accepted 14 February 2014 KEYWORDS Clostridium tetani; Hypothetical proteins; Conserved proteins; Structure; Function; Bioinformatics Summary The progress in biological technologies has led to rapid accumulation of microbial genomic sequences with a vast number of uncharacterized genes. Pro- teins encoded by these genes are usually uncharacterized, hypothetical, and/or conserved. In Clostridium tetani (C. tetani), these proteins constitute up to 50% of the expressed proteins. In this regard, understanding the functions and the struc- tures of these proteins is crucially important, particularly in C. tetani, which is a medically important pathogen. Here, we used a variety of bioinformatics tools and databases to analyze 10 hypothetical and conserved proteins in C. tetani. We were able to provide a detailed overview of the functional contributions of some of these proteins in several cellular functions, including (1) evolving antibiotic resistance, (2) interaction with enzymes pathways, and (3) involvement in drug transportation. Among these candidates, we postulated the involvement of one of these hypothet- ical proteins in the pathogenic activity of tetanus. The structural and functional prediction of these proteins should serve in uncovering and better understanding the function of C. tetani cells to ultimately discover new possible drug targets. © 2014 King Saud Bin Abdulaziz University for Health Sciences. Published by Elsevier Ltd. All rights reserved. Introduction Clostridium tetani (C. tetani) is an obligate anaerobic bacillus. Its spores survive in soil and Correspondence to: Department of Microbiology and Immu- nology, Faculty of Pharmacy, Suez Canal University, Ismailia 41522, Egypt. Tel.: +1 619 335 7176. E-mail address: [email protected] cause infection by contaminating wounds [1]. C. tetani can be found globally, but it is more common in third-world countries [2] and usually recovered from soil, feces, and house dust. These rod-shaped bacilli produce a potent neurotoxin that causes tetanus wound infection [3]. Tetanus is a neuromuscular disease in which C. tetani exotoxin (tetanospasmin) produces muscle spasms, incapacitating its host [2]. The global incidence http://dx.doi.org/10.1016/j.jiph.2014.02.002 1876-0341/© 2014 King Saud Bin Abdulaziz University for Health Sciences. Published by Elsevier Ltd. All rights reserved.

Upload: dothu

Post on 12-Aug-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Journal of Infection and Public Health (2014) 7, 296—307

Structural and functional analysis of hypotheticaland conserved proteins of Clostridium tetani

Shymaa Enany ∗

Department of Microbiology and Immunology, Faculty of Pharmacy, Suez Canal University, Ismailia, Egypt

Received 4 December 2013; received in revised form 1 February 2014; accepted 14 February 2014

KEYWORDSClostridium tetani;Hypothetical proteins;Conserved proteins;Structure;Function;Bioinformatics

Summary The progress in biological technologies has led to rapid accumulationof microbial genomic sequences with a vast number of uncharacterized genes. Pro-teins encoded by these genes are usually uncharacterized, hypothetical, and/orconserved. In Clostridium tetani (C. tetani), these proteins constitute up to 50% ofthe expressed proteins. In this regard, understanding the functions and the struc-tures of these proteins is crucially important, particularly in C. tetani, which is amedically important pathogen. Here, we used a variety of bioinformatics tools anddatabases to analyze 10 hypothetical and conserved proteins in C. tetani. We wereable to provide a detailed overview of the functional contributions of some of theseproteins in several cellular functions, including (1) evolving antibiotic resistance,(2) interaction with enzymes pathways, and (3) involvement in drug transportation.Among these candidates, we postulated the involvement of one of these hypothet-ical proteins in the pathogenic activity of tetanus. The structural and functional

prediction of these proteins should serve in uncovering and better understandingthe function of C. tetani cells to ultimately discover new possible drug targets.© 2014 King Saud Bin Abdulaziz University for Health Sciences. Published by ElsevierLtd. All rights reserved.

c

Introduction

Clostridium tetani (C. tetani) is an obligateanaerobic bacillus. Its spores survive in soil and

∗ Correspondence to: Department of Microbiology and Immu-nology, Faculty of Pharmacy, Suez Canal University, Ismailia41522, Egypt. Tel.: +1 619 335 7176.

E-mail address: [email protected]

Ccrrtiei

http://dx.doi.org/10.1016/j.jiph.2014.02.0021876-0341/© 2014 King Saud Bin Abdulaziz University for Health Scie

ause infection by contaminating wounds [1].. tetani can be found globally, but it is moreommon in third-world countries [2] and usuallyecovered from soil, feces, and house dust. Theseod-shaped bacilli produce a potent neurotoxin

hat causes tetanus wound infection [3]. Tetanuss a neuromuscular disease in which C. tetanixotoxin (tetanospasmin) produces muscle spasms,ncapacitating its host [2]. The global incidence

nces. Published by Elsevier Ltd. All rights reserved.

S ni

ooi01o[ta[chew

mpcitphaBtwat(s3rptgutpempaspgccuootsptac

cttcsrudTvpi

M

B

Tscrct(wQ(

Pc

EopTo[gnp

Ic

FcdPr

tructural—functional analysis of proteins in C. teta

f tetanus has been estimated at approximatelyne million cases annually [4], and the incidencen the USA has been reported to be approximately.10 individuals per 1 million persons, with a3.2% case-fatality rate among cases with knownutcomes and 31.3% among persons aged ≥65 years5]. Mortality rates from tetanus vary greatly acrosshe world, depending on access to healthcare, andpproach 100% in the absence of medical treatment6]. Tetanus is a highly lethal zoonotic disease, andases usually occur sporadically. Outbreaks amongumans are often associated with catastrophicvents such as earthquakes, traumas, tsunami, andar wounds [7].The C. tetani genome is composed of a chro-

osome that contains 2,799,250 bp and a plasmid,E88 that contains 74,082 bp [8]. The C. tetanihromosome harbors some virulence genes, includ-ng open reading frames (ORFs) encoding foretanolysin O, hemolysin, and fibronectin-bindingroteins [9,10]. The 74-kb C. tetani plasmid pE88arbors the genes for the tetanus toxin (tetX)nd its direct transcriptional regulator TetR [3,11].rüggemann et al. have identified 2372 ORFs inhe genome and 61 ORFs in the plasmid, 28 ofhich display similarities to known proteins withssigned functions [8]. These proteins include 9ransport proteins similar to ATP-binding cassetteABC) transporters, 5 transposition proteins withimilarity to protein peptide-transporting systems,

peptidases, 7 regulatory proteins, 3 putativeeplication proteins, and rRNA methyltransferaserotein. More than 31 proteins of unknown func-ion have been identified from the C. tetanienome and they have been assigned as unknown,ncharacterized, hypothetical, or conserved pro-eins [8]. A hypothetical protein is a predictedrotein expressed from ORFs for which there is noxperimental evidence of translation [12,13]. Theajority of hypothetical proteins are the product ofseudo-genes, but some hypothetical proteins have

high probability of being expressed [14]. Con-erved proteins are defined as a fraction of generoducts found in organisms from several phylo-enetic lineages that have not been functionallyharacterized or described at the protein chemi-al level [15,16]. These structures may representp to half of the potential protein coding regionsf a genome. Bioinformatics tools, including freenline servers and databases, are applicable toolshat have recently become available for use in thetudy of genomics and proteomics. These tools are

redictive and form a core for genomics and pro-eomics analyses not only in microorganisms butlso in humans, animals, and plants. Thus, theyan provide plentiful data about hypothetical and

hcPc

297

onserved proteins. Taking into consideration thathese types of proteins constitute a major part of C.etani genome products, we aimed to structurallyharacterize these unknown or hypothetically con-erved proteins to elucidate the structure—functionelationship of the uncharacterized gene prod-cts, to predict protein—protein interaction, and toetermine the protein physiochemical properties.he structural comparison with known proteins pro-ides a valuable piece of information that helps inredicting the cellular functions of these hypothet-cal proteins.

ethods

acterial strain used and sequence retrieval

oxigenic C. tetani E88, variant strain Mas-achusetts, was used in the current study. Theomplete genome sequence of E88 was firsteported in 2003 [8]. Ten of its hypothetical andonserved protein sequences were retrieved fromhe Universal Protein Resource (UniProt) databasehttp://www.uniprot.org/). Their sequence IDsere Q892L8, Q898Y4, Q899E1, Q899X6, Q891V3,899W4, Q898T0, Q899Y4, Q899A6, and Q897V7

Table 1).

hysicochemical properties and functionalharacterization

xpasy’s Protparam server (http://web.expasy.rg/protparam/) was used for detection of thehysicochemical properties of these proteins [17].he number of amino acids, molecular weight, the-retical isoelectric point (pI), extinction coefficient18], instability index [19], aliphatic index [20],rand average hydropathy (GRAVY) [21], and totalumber of positive and negative residues were com-uted for each protein.

dentification of protein family, domain, andlan

or identifying protein families, domains, andlans, the Pfam 27.0 (March 2013, 14,831 families)atabase (http://pfam.sanger.ac.uk/) was used.fam is a large collection of protein families, eachepresented by multiple sequence alignments and

idden Markov models (HMMs). There are twoomponents to the Pfam database: Pfam-A andfam-B. Pfam-A entries are high-quality, manuallyurated families. Pfam-B is useful for identifying

298 S. Enany

Table 1 The ten hypothetical and conserved proteins used in this study.

UniProt ID Protein names Gene names Organism Length

Q892L8 Hypothetical lipoprotein CTC 02078 Clostridium tetani (strainMassachusetts/E88)

202

Q898Y4 Hypothetical membraneprotein

CTC 00303 Clostridium tetani (strainMassachusetts/E88)

200

Q899E1 Hypothetical membraneprotein

CTC 00241 Clostridium tetani (strainMassachusetts/E88)

548

Q899X6 Hypotheticalmembrane-spanningprotein

CTC p38 Clostridium tetani (strainMassachusetts/E88)

169

Q891V3 Hypotheticalsugar-binding protein

CTC 02260 Clostridium tetani (strainMassachusetts/E88)

448

Q899W4 Hypotheticalmembrane-spanningprotein

CTC p51 Clostridium tetani (strainMassachusetts/E88)

158

Q898T0 Hypotheticalmembrane-associatedprotein

CTC 00362 Clostridium tetani (strainMassachusetts/E88)

386

Q899Y4 Hypotheticalmembrane-spanningprotein

CTC p30 Clostridium tetani (strainMassachusetts/E88)

646

Q899A6 Conserved protein CTC 00279 Clostridium tetani (strainMassachusetts/E88)

300

Q897V7 Conserved protein CTC 00617 Clostridium tetani (strain 256

iatbma

Dt

Tos(aoaa

Pl

Sc

functionally conserved regions when no Pfam-Aentries are found [22].

Conserved domain detection

For searching the Conserved Domain with proteinquery sequences, a CD search of NCBI’s inter-face was used (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). CDD is a protein anno-tation resource that consists of a collection ofwell-annotated multiple sequence alignment mod-els for ancient domains and full-length proteins.It uses RPS-BLAST, a variant of PSI-BLAST, toquickly scan a set of pre-calculated position-specific score matrices (PSSMs) with a protein query[23].

Protein three-dimensional (3D) structureconstructions

The Template-based Protein Structure PredictionOnline Server (PS)2-v2 (http://ps2v2.life.nctu.edu.tw/) was used to construct the hypotheticaland conserved proteins’ 3D structures. (PS)2-v2

is an automated homology modeling server thataccepts protein query sequences in FASTA format.The method uses an effective consensus strategyby combining PSI-BLAST, IMPALA, and T-Coffee

3tpt

Massachusetts/E88)

n both template selection and target-templatelignment. The final 3D structure is built usinghe modeling package MODELER depending on theest scored orthologous template. (PS)2-v2 adds aultiple model strategy and utilizes ProQ to assess

nd select the final model [24,25].

etermination of soluble orrans-membrane proteins

o determine whether the protein is a solubler a trans-membrane located protein, the SOSUIerver (SOSUI engine version 1.10) was usedhttp://bp.nuap.nagoya-u.ac.jp/sosui/). SOSUI isn online tool that predicts a part of the sec-ndary structure of proteins from a given aminocid sequence to determine whether these proteinsre soluble or trans-membrane proteins [26].

rediction of protein sub-cellularocalization

ub-cellular localization of each hypothetical andonserved protein was predicted by PSORTb version

.0.2 (http://www.psort.org/psortb/). PSORTb ishe most precise web-based tool used for therediction of sub-cellular localization in bac-erial protein sequences by submitting one or

S ni 299

ms[

Cc

Ftt(pstaiRp

P

TSG(k5ichsiam[i

R

TtctCtttpottspo

The

phys

icoc

hem

ical

prop

erti

es

of

the

invo

lved

hypo

thet

ical

and

cons

erve

d

prot

eins

.

D

Mol

ecul

arw

eigh

tTh

eore

tica

lpI

Exti

ncti

onco

effic

ient

sIn

stab

ility

inde

xAl

ipha

tic

inde

xG

rand

aver

age

ofhy

drop

athi

city

(GRA

VY)

Tota

l num

ber

ofne

gati

vely

char

ged

resi

dues

(Asp

+

Glu

)

Tota

l num

ber

ofpo

siti

vely

char

ged

resi

dues

(Arg

+

Lys)

21,8

95

9.62

22,0

15

27.8

8

(sta

ble)

76.1

4

−0.1

57

12

2423

,873

.7

9.77

18,0

05

37.2

0

(sta

ble)

121.

85

0.29

7

17

3261

,752

.6

9.28

45,3

95

33.1

9

(sta

ble)

143.

54

0.85

5

34

4720

,194

.3

9.61

30,0

35

24.2

2

(sta

ble)

140.

77

0.68

3

9

1852

,969

.6

9.10

71,1

70

39.5

0

(sta

ble)

79.1

7

−0.6

60

59

7119

,194

.1

9.35

33,0

15

30.6

9 (s

tabl

e)

132.

03

0.63

9

10

1745

,834

.6

8.95

50,8

95

45.3

1 (u

nsta

ble)

107.

80

−0.0

95

47

5574

,255

.2

9.08

69,4

00

37.2

7

(sta

ble)

121.

75

0.18

1

48

6633

,719

.7

8.83

25,5

80

25.4

7

(sta

ble)

94.2

3

−0.3

46

38

4428

,981

.9

8.67

30,4

95

18.4

3

(sta

ble)

71.1

3

−0.5

90

30

34

tructural—functional analysis of proteins in C. teta

ore Gram-positive or Gram-negative bacterialequences or archaeal sequences in FASTA format27].

ysteine disulfide bonding state andonnectivity prediction

or predicting the disulfide bonding state of cys-eines and their disulfide connectivity starting fromhe FASTA protein sequence, the DISULFIND serverhttp://disulfind.dsi.unifi.it/) was used. The serverredicts disulfide patterns in two computationaltages: (1) the disulfide bonding state of each cys-eine is predicted by a BRNN-SVM binary classifier;nd (2) cysteines that are known to participaten the formation of bridges are paired using aecursive Neural Network to obtain a connectivityattern [28].

rotein—protein interaction prediction

o predict the protein interactions, theearch Tool for the Retrieval of Interactingenes/Proteins (STRING version 9.1) was used

http://string-db.org/). STRING is a database ofnown and predicted protein interactions covering,214,234 proteins from 1133 organisms. Thenteractions include physical and functional asso-iations and are derived from genomic context,igh-throughput experiments, conserved expres-ion and previous knowledge. It quantitativelyntegrates interaction data from these sources for

large number of organisms and transfers infor-ation between these organisms where applicable

29,30]. Protein scores greater than 0.444 werencluded in the results.

esults and discussion

he major contribution of C. tetani to the high mor-ality rate worldwide and particularly in developingountries has motivated many researchers to studyhis pathogenic organism [2]. Protein products of. tetani are considered one of the virulence fac-ors involved in its pathogenesis and thus contributeo diseases [1]. Hypothetical and conserved pro-eins constitute up to 50% of C. tetani-producedroteins [8]. It was reported in the last updatef C. tetani annotation in the UniProt databasehat it has nearly 538 hypothetical proteins, and of

hose, 452 are listed as ‘‘no product’’ [31]. Analy-is of these hypothetical and conserved proteins canrovide a lot of information about the pathogenesisf this important pathogen. Although many studies Ta

ble

2

Uni

Prot

I

Q89

2L8

Q89

8Y4

Q89

9E1

Q89

9X6

Q89

1V3

Q89

9W4

Q89

8T0

Q89

9Y4

Q89

9A6

Q89

7V7

tpTp

mmdttliaipcNDha2siocsMfipWteAotTaColatsrtwuttdcB

300

have analyzed hypothetical proteins in differentmicroorganisms such as Staphylococcus aureus [13],Helicobacter pylori [12], Mycobacterium tubercu-losis [32], and Haemophilus influenza [33], there isno similar study on C. tetani. To our knowledge, thisstudy provides the first structural and functionalanalysis of C. tetani hypothetical and conservedproteins.

The physicochemical properties of each hypo-thetical and conserved protein computed byProtParam, including the amino acid composition,molecular weight, theoretical pI, extinction coeffi-cient, instability index, aliphatic index, grand aver-age of hydropathicity (GRAVY), and total numberof negatively and positively charged residues areshown in Table 2. Theoretical isoelectric points (pI)for these proteins range from 9.77 to 8.67. The the-oretical pI is the pH at which a particular moleculeor surface carries no net electrical charge, andit is useful for understanding the protein chargestability [17]. The extinction coefficient of thehypothetical and conserved proteins at 280 nmranges from 18,005 to 71,170 with respect to theconcentration of cysteine (Cys), tryptophan (Trp),and tyrosine (Tyr). It is defined as a measurementof how strongly a protein absorbs light at a givenwavelength. The high extinction coefficient indi-cates the presence of a high concentration of Cys,Trp, and Tyr in the hypothetical and/or conservedproteins [17]. The computed extinction coefficientsaid in the quantitative study of protein—proteinand protein—ligand interactions in solution [13].The instability index is an estimation of the sta-bility of a protein experimentally. A protein whoseinstability index is smaller than 40 is predicted asstable [19]. The instability index values for ourhypothetical and conserved proteins range from18.43 to 45.31. All proteins were stable exceptfor the hypothetical membrane-associated protein(Q898T0), which was unstable. The aliphatic indexof a protein is defined as the relative volume occu-pied by aliphatic side chains [alanine (Ala), valine(Val), isoleucine (Ile), and leucine (Leu)]. It maybe regarded as a positive factor for the increasein thermo-stability of globular proteins [17,20].The aliphatic index for the hypothetical and con-served protein sequences involved in this studyranged from 71.13 to 143.54. A high aliphatic indexindicates that a protein is thermo-stable over awide temperature range [20]. The GRAVY valuefor a peptide or protein is calculated as the sumof hydropathy values of all of the amino acids

divided by the number of residues in the sequence[17,21]. The hypothetical and conserved proteinsin this study had GRAVY indexes ranging from−0.66 to 0.855. This low GRAVY range indicates

cpTa

S. Enany

he possibility of being a globular (hydrophilic)rotein rather than membranous (hydrophobic).his information might be useful for localizing theseroteins.

Generally, proteins are composed of one orore functional and/or structural regions, com-only termed domains. Different combinations ofomains give rise to the diverse range of pro-ein families found in nature [22]. In our analysis,he proteins were classified into particular fami-ies based on the presence of a specific domainn their sequence using a Pfam database searchnd NCBI/CDD-BLAST. The Pfam domains and fam-lies found in the 10 hypothetical and conservedroteins are summarized in Table 3. The spe-ific domains identified in the Pfam search includelpC/P60, SBP bac 1, DUF58, Peptidase U57, andUF4367. Six proteins out of the 10 used proteinsave a Pfam A family, and one of them also has

Pfam B family. These families include the ABC- family, YabG peptidase U57 family, extracellularolute-binding proteins, and unknown function fam-lies. Pfam also generates higher-level groupingsf related families, known as clans. A clan is aollection of Pfam-A entries that are related byimilarity of sequence, structure, or profile hiddenarkov models (HMM) [22]. Four clans were identi-ed in our analysis: the peptidase clan (CL0125),eri-plasmic binding protein clan (CL0177), Vonillebrand factor type A (CL0128), and ABC-2-

ransporter-like clan (CL0181). The description ofach Pfam family and clan is provided in Table 3.nother set of protein classification results, basedn the presence of a conserved domain in the pro-ein sequence using NCBI/CDD-BLAST, is listed inable 4. The proteins have conserved domains suchs NlpC/P60, Spr, SBP bac 1, SBP bac 8, DUF58,OG1721, Acyl transf 3, and peptidase U57. Somef these domains are involved in important bio-ogical operations such as enzymatic processes,ntibiotic resistance, and bacterial transport sys-ems. Five proteins have no domains in theirequence according to the CDD search. The sameesults for these proteins were obtained fromhe Pfam analysis except for the protein Q897V7,hich contains a Pfam A domain (DUF4367) withnknown function. The identification of domains inhe hypothetical and conserved proteins increaseshe possibility of protein involvement in the sameomain function [13]. The 5 proteins containingonserved domains were assigned by the NCBILAST to super-families. A super-family is a set of

onserved domain models that generate overlap-ing annotation on the same protein sequences.hese models are assumed to represent evolution-rily related domains and may be redundant with

Structural—functional

analysis of

proteins in

C. tetani

301

Table 3 The Pfam families, domains, and clans of the hypothetical and conserved proteins.

UniProt ID Pfam-A Pfam-B Pfam family description Domain Clan Clan description

Q892L8 NlpC/P60 — The function of this domain isunknown. It is found in severallipoproteins

NlpC/P60 CL0125 Peptidaseclan CA

This clan includes peptidases with the papain-likefold.This clan contains 60 families, and the totalnumber of domains in the clan is 76,571.The clan was built by A Bateman.

Q898Y4 — —Q899E1 — —Q899X6 — —Q891V3 SBP bac 1 — Bacterial extracellular

solute-binding proteinSBP bac 1 CL0177

Periplasmicbinding proteinclan

Periplasmic binding proteins (PBPs) have fusedwith genes for integral membrane proteins in theprocess of evolution and genes encoding.Mammalian receptors contain extracellular ligandbinding domains that are homologous to the PBPsas glutamate/glycine-gated ion channels.This clan contains 23 families, and the totalnumber of domains in the clan is 242,652.The clan was built by A Bateman.

Q899W4 — —Q898T0 DUF58 Pfam-

B 68,838This family of prokaryoticproteins have unknown function.

DUF58 CL0128von Willebrandfactor type A

This clan contains 16 families, and the totalnumber of domains in the clan is 35,661.The clan was built by RD Finn.

Q899Y4 ABC2 memb 4 — ABC-2 family transporter protein — CL0181ABC-2-transporter-likeclan

These families are similar to the ABC-2transporter subfamily. Members of this family areinvolved in drug transport and resistance.This clan contains 10 families, and the totalnumber of domains in the clan is 49,701.The clan was built by M Fenech.

Q899A6 Peptidase U57 — YabG peptidase U57; is aprotease involved in theproteolysis and maturation ofSpoIVA and YrbA proteins,conserved with the cortex,and/or coat assembly byBacillus subtilis

PeptidaseU57

n/a

Q897V7 DUF4367 — This family of proteins is foundin bacteria, archaea andeukaryotes. Proteins in thisfamily are typically between 229and 435 amino acids in length

DUF4367 n/a

302 S. Enany

Table 4 The identified domains and their family description for the ten hypothetical and conserved proteins usingNCBI/CDD-BLAST.

UniProt ID Domains Super-family description

Q892L8 NlpC/P60 The function of this domain is unknown. It is found in severallipoproteins.

Spr Cell wall-associated hydrolases (invasion-associated proteins).Q898Y4 — —Q899E1 — —Q899X6 — —Q891V3 SBP bac 1 This family includes the bacterial extracellular solute-binding protein

POTD/POTF.SBP bac 8 This family includes bacterial extracellular solute-binding proteins.

Q899W4 — —Q898T0 DUF58 This family of prokaryotic proteins have unknown function.

COG1721 Uncharacterized conserved protein.Q899Y4 Acyl transf 3 This family includes a range of acyltransferase enzymes.Q899A6 Peptidase U57 YabG Peptidase U57 is a protease involved in the proteolysis and

maturation of SpoIVA and YrbA proteins.

3Qaitft

Q897V7 — —

each other [23]. The curating super-families herewere relatively close to the identified Pfam families(Table 4).

A list of C. tetani E88 variant strain Mas-sachusetts protein structures was deposited inthe RCSB Protein Data Bank (PDB). Hypotheti-cal and conserved protein 3D-structures in this

2

study were modeled by the (PS) -v2 server asshown in Figure 1. The templates used by the(PS)2-v2 server were 2k1gA for Q892L8, 1f9zA forQ898Y4, 1xmeA for Q899E1, 2vt4B for Q899X6,

tsl[

Figure 1 Predicted 3D-structure models of the hypothetical aand templates of these proteins are (1) Q892L8 and 2k1gA, (2and 2vt4B, (5) Q891V3 and 3c4mA, (6) Q899W4 and 1f9zA, (7and 1tmyA, and (10) Q897V7 and 2p3yA, respectively.

c4mA for Q891V3, 1f9zA for Q899W4, 1yvrA for898T0, 2cfqA for Q899Y4, 1tmyA for Q899A6,nd 2p3yA for Q897V7. The modeling templatesn this analysis were automatically selected byhe server and revised manually. MODELER is usedor homology or comparative modeling of pro-ein 3D structures. An alignment of a sequence

o be modeled was provided with known relatedtructures, and MODELER automatically calcu-ated a model containing all non-hydrogen atoms34].

nd conserved proteins used in this study. The UniProt IDs) Q898Y4 and 1f9zA, (3) Q899E1 and 1xmeA, (4) Q899X6) Q898T0 and 1yvrA, (8) Q899Y4 and 2cfqA, (9) Q899A6

Structural—functional

analysis of

proteins in

C. tetani

303

Table 5 SOSUI identification of the soluble/trans-membrane hypothetical and conserved proteins.

UniProt ID N-terminal Trans-membrane region C-terminal Type Length CommentsQ892L8 10 KNLLISMVMFCAILFGLNKVNIV 32 Primary 23 This amino acid sequence is of a membrane protein that has 1

trans-membrane helix.Q898Y4 19 FMLIMGFIFFILPFILLFFKILD 41 Primary 23 This amino acid sequence is of a membrane protein that has 2

trans-membrane helices.46 TYLAAIELLIVIAIIAKINMER 67 Primary 22

Q899E1 23 AITLVILLAIGFAPLLKSFAMAT 45 Primary 23 This amino acid sequence is of a membrane protein that has 12trans-membrane helices.

61 LLYTGLSATSLAIFIFAIFHVMN 83 Primary 23107 SKLILIVIYEYVTEIIFLLPILI 129 Primary 23146 AIIFLVNPIVPTILCTILIMIIM 168 Primary 23181 KMISGVIALGIGIGINLLIQGTF 203 Secondary 23249 GFINICLFLIITLLFLVIFIYIA 271 Primary 23315 IKVLLRTPIYFINCILMNFLWPI 337 Primary 23366 MKTVVIIAFSVILFITATNGIAA 388 Primary 23413 ISKILTAVIIEVVSTFLILLLVN 435 Primary 23445 AMSIVIIAIPAMFFSSLMGLIID 467 Primary 23485 NLNIVGNMVLNIIIAVITAITFI 507 Primary 23514 TITVLLIMAIYTLIDIFMYNFLK 536 Primary 23

Q899X6 101 FSINKLIFTIIIFTIISITTMQN 123 Primary 23 This amino acid sequence is of a membrane protein that has 3trans-membrane helices.

143 LGPQTLNFKIIEVLKWIIPHIFL 165 Secondary 23200 ISLLIITFFYFFIGFIIVLALAM 222 Primary 23

Q891V3 88 RTIMYCTLIIILLFASGIYR 107 Primary 20 This amino acid sequence is of a membrane protein that has 1trans-membrane helix.

Q899W4 102 TSIIIFNCFFMTILFYYNNIIII 124 Primary 23 This amino acid sequence is of a membrane protein that has 4trans-membrane helices.

153 LMELTYIIKIVFLLIFLLEFFYL 175 Primary 23189 NIIIYLAIGFGIYCVSFLFIK 209 Primary 21216 RLFMTLISTEIFSLILLKLVLK 237 Primary 22

Q898T0 95 FLLIFLGAFLYAFFEGGYFANAI 117 Secondary 23 This amino acid sequence is of a membrane protein that has 2trans-membrane helices.

119 ICLVIVALFSFMYILITAKLMKV 141 Primary 23Q899Y4 95 FILAFLVYLLPSIIFYTALCLFL 117 Primary 23 This amino acid sequence is of a membrane protein that has 8

trans-membrane helices.123 HSIISVVLCIFLFIFSDTMPEK 144 Primary 22172 IVNRLLYILLAIILTTNTPTYKL 194 Primary 23199 YLISFLLGQVILILLFLLAWLKD 221 Primary 23226 ILEKYSIVLLYSLFLSLLGLLIS 248 Primary 23255 VIGYIVPSIYCLSQIIIGISYLE 277 Primary 23297 MGNIYFTIFSVIILLIINFFYLS 319 Primary 23327 NLIFSTIILIICCCTTLGIYSFI 349 Primary 23

Q899A6 — — — — — This amino acid sequence is of a soluble protein.Q897V7 120 IVAALVAVMVVGTTVFAAGKIFS 142 Primary 23 This amino acid sequence is of a membrane protein that has 1

trans-membrane helix.

S. Enany

Table 6 Sub-cellular localization of the hypotheticaland conserved proteins using PSORTb.

UniProt ID Subcellular localization PSORTbscore

Q892L8 Extracellular 9.72Q898Y4 Cytoplasmic 7.50Q899E1 Cytoplasmic membrane 10Q899X6 Cytoplasmic membrane 9.55Q891V3 Unknown —Q899W4 Cytoplasmic membrane 10Q898T0 Unknown —Q899Y4 Cytoplasmic membrane 10

ius

tbttbittomrmoilawtnitdt[

Sbwpiot

304

To distinguish whether the hypothetical and con-served proteins used in this study are soluble ortrans-membrane proteins, the SOSUI server wasused. This online tool looks for �-helices, takinginto account the known helical potentials of thegiven amino acid sequence, and then differenti-ates between the �-helices in soluble proteins andthose in trans-membrane proteins. Four character-istics of the amino acid sequence are applied by theSOSUI predictions: hydropathy index, amphiphilic-ity index, amino acid sequence charge, and aminoacid sequence length [26]. The results indicatedthat all applied proteins were membrane proteinsexcept for the protein with ID Q899A6, which wassoluble. Membrane proteins have a hydropathy pro-file and at least one trans-membrane helix. Themaximum amount of trans-membrane helices foundwas 12 in the protein Q899E1. The number of trans-membrane domains for each protein and their N-and C-terminals are shown in Table 5. We shouldmention that membrane proteins that do not havetrans-membrane �-helices, such as porins, or thatare fixed with a covalent bond cannot be found bySOSUI [35].

Psortb2 software was used to predict the sub-cellular localization of the proteins (Table 6). Outof the ten proteins used, one protein is extracel-lular, two are cytoplasmic, four are cytoplasmicmembrane, and three have unknown locations. Thehighest Psortb scores are found in the cytoplasmicmembrane proteins Q899E1, Q899W4, and Q899Y4,which have a Psortb score of 10, and in the proteinQ899X6, which has a Psortb score of 9.55 (Table 6).Detection of the sub-cellular localization for thehypothetical and conserved proteins is essentialfor interpreting their functions in bacterial cellularprocesses and thus improving target identificationin the drug discovery process [13].

Disulfide bridges play a major role in the stabi-lization of the folding process for several proteins.Prediction of disulfide bridges from hypotheticaland conserved proteins sequences is valuable forthe study of structural and functional propertiesof these proteins. In addition, knowledge aboutthe disulfide bonding state of cysteines may aidthe experimental structure determination processand may be useful in other genomic annotationtasks [28]. An analysis of the ten proteins revealedtwo proteins, Q899X6 and Q899W4, with disulfidebonds in their annotated sequences. The disulfidebonding state with its confidence, position in thesequence of a pair of cysteines predicted to be form

a disulfide bridge, and confidence of connectivityof the individual bonding states are illustrated inTable 7. The presence of disulfide bonds in proteinsis an important post-translational modification that

vtps

Q899A6 Cytoplasmic 7.50Q897V7 Unknown —

s important for stabilizing proteins structures andnderstanding the structural—functional relation-hip [36].

Although the functional analysis features ofhe hypothetical and conserved proteins haveeen primarily determined by domain detec-ion, family prediction, sub-cellular localization,rans-membrane helix identification, and disulfideridge locating, the prediction of protein—proteinnteractions provides essential information abouthe biological and cellular functions. Typically,wo or more proteins bind together to carryut their biological function. Most importantolecular processes in the cell, such as DNA

eplication, are performed by large molecularachines that are built from a large number

f proteins organized by their protein—proteinnteractions [37]. It has been previously estab-ished that the function and activity of a proteinre often modulated by other proteins withhich it interacts [37]. Experimental elucida-

ion and computational analysis of the complexetworks formed by individual protein—proteinnteraction are some of the major challenges inhe post-genomic era. Protein—protein interactionatabases have become a major resource for inves-igating biological networks and pathways in cells13].

For functional protein association networks,TRING was used for the prediction of interactionetween our hypothetical and conserved proteinsith other partners (Figure 2). Only the partnerroteins with score more than 0.5 were includedn our results, except for Q899W4, which had onlyne partner with a 0.444 score. Though many pro-eins perform their functions independently, the

ast majority of proteins interact with other pro-eins for proper biological activity. Characterizingrotein—protein interactions is critical to under-tanding protein function and the biology of the

Structural—functional analysis of proteins in C. tetani 305

Table 7 Cysteine disulfide bonds and their confidence in the hypothetical and conserved proteins.

UniProt ID DB statea DB confb DB bondc Conn confd

Q892L8 0 6—7 — —Q898Y4 0 6—7 — —Q899E1 0 7 — —Q899X6 1 2, 3 Bond (5,53) 1Q891V3 ND ND ND NDQ899W4 1 2, 3 Bond (22,115) 1Q898T0 0 7—8 — —Q899Y4 0 2—7 — —Q899A6 0 7—8 — —Q897V7 0 1—2 — —

a DB state: (1 = disulfide bonded, 0 = not disulfide bonded).b DB conf: (confidence of disulfide bond; 0 = low to 9 = high).c DB bond: cysteines forming the disulfide bridge positions in the sequence.d Conn conf: confidence of connectivity assignment given the predicted disulfide bonding state.

Figure 2 Protein—protein interaction network established by STRING (Version 9.1) for hypothetical conserved proteinsand their partners.

306 S. Enany

Table 8 Hypothetical and conserved proteins with their functionally important partner proteins.

UniProt ID Predicted functional partner proteins Score

Q892L8 Glutamate racemase 0.644Seryl-tRNA synthetase 0.635Methionine aminopeptidase 0.600

Q898Y4 Peptide chain release factor 1 0.800Zinc uptake transporter 0.632Methyltransferase 0.603

Q899E1 ABC transporter ATP-binding protein 0.968DNA polymerase 0.721

Q899X6 Putative membrane-spanning permease 0.938Q891V3 Virulence factor MviN 0.804

Succinoglycan biosynthesis protein exoA 0.771Glycosyl transferase 0.740

Q899W4 Integrase/recombinase 0.444Q898T0 Methanol dehydrogenase regulatory protein (MoxR) 0.941Q899Y4 Tetanus toxin TetX 0.735Q899A6 Stage III sporulation protein SpoAB 0.741

Dimethyladenosine transferase 0.704Stage V sporulation protein T 0.692

actor

amasiafioobohtTwcaietpt

C

G

Q897V7 RNA polymerase sigma f

cell. Protein Q892L8 interacts with three proteins:glutamate racemase, seryl-tRNA synthetase, andmethionine aminopeptidase. Seryl-tRNA synthetasecatalyzes the attachment of serine to tRNA, and itis able to aminoacylate tRNA with serine to formthe misacylated tRNA L-seryl-tRNA (Sec), whereasmethionine aminopeptidase removes the amino-terminal methionine from nascent proteins. ProteinQ898Y4 has peptide chain release factor 1 as apartner, which directs the termination of transla-tion in response to the peptide chain terminationcodons UAG and UAA. One of the used hypothet-ical membrane proteins (Q899E1) interacts withABC transporter ATP-binding protein, and anotherhypothetical membrane protein (Q899X6) inter-acts with a putative membrane-spanning permease.Protein Q891V3 is found to interact with threeproteins: virulence factor MviN, succinoglycanbiosynthesis protein exoA, and glycosyl trans-ferase. The protein—protein interaction outputfor Q898T0 involves a methanol dehydrogenase reg-ulatory protein. The output for the Q899A6 includesa stage III sporulation protein (SpoAB), dimethy-ladenosine transferase, and stage V sporulationprotein T. The output for Q897V7 is RNA polymerasesigma factor. Q899Y4 interacts with an importanttetanus toxin that acts by inhibiting neurotrans-mitter release. It binds to peripheral neuronalsynapses, is internalized, and moves by retrograde

transport up the axon into the spinal cord, whereit can move between postsynaptic and presynap-tic neurons. It inhibits neurotransmitter release byacting as a zinc endopeptidase. Tetanus protein

ptft

0.865

nd its partners, including Q899Y4, play a hall-ark in tetanus pathogenesis. Protein interactions

re fundamentally characterized as stable or tran-ient. Tetanus protein and Q899Y4 have a stablenteraction, as they are proteins that are purifieds multi-subunit complexes [38]. A common sur-ace domain that facilitates stable protein—proteinnteractions is the leucine zipper, which consistsf �-helices on each protein that bind to eachther in a parallel fashion through the hydropho-ic bonding of regularly spaced leucine residuesn each �-helix that project between the adjacentelix peptide chains [38]. Interestingly, we provedhat Q899Y4 has 8 �-helices, as shown in Table 5.he predicted functional partner proteins, togetherith their confidence scores for each hypotheti-al and conserved protein involved in this study,re summarized in Table 8. The protein—proteinnteractions are of central importance for almostvery process in a living cell. Information abouthese interactions improves our understanding ofathogenesis and can provide the basis for newherapeutic approaches.

onclusion

enomic sequencing and structural genomics

rojects have introduced many protein sequenceshat cannot be annotated and have unknownunctions, designated hypothetical/conserved pro-eins. These proteins constitute up to half of the

S ni

ptfuithafchsiicfaptwnabbt

C

F

R

[

[

[[[

[[[

[[

[[[

[

[

[

[

[

[

[

[

[

[[

[[

[

tructural—functional analysis of proteins in C. teta

roteins in microorganism genomes. Particularly inhe pathogenic bacteria, they hamper the searchor new and effective drug targets and weaken ournderstanding of their virulence and pathogenic-ty. In this study, we used various bioinformaticsools and databases for elucidating C. tetaniypothetical and conserved protein structuresnd functions and for identifying their domains,amilies, trans-membrane helices, and disulfideonnectivity. We found that some of these proteinsave functionally significant domains and familiesuch as being implicated in enzymatic processes,n drug transport and resistance mechanisms andn cell transporter systems. One of these proteinsontributes as a partner for an important virulenceactor, tetanus toxin, of C. tetani by a stable inter-ction through their �-helices. Therefore, theseroteins could most likely be considered crucial forhe organism lifestyle and host invasion coincidentith growing at an escalating pace, increasing theeed for additional studies that combine proteinnd gene expression analyses, computationaliology, and comparative genomics to provide aetter understanding of the C. tetani biology andhe identification of new potential drug targets.

onflict of interest

unding: No funding sources.Competing interests: None declared.Ethical approval: Not required.

eferences

[1] Hassel B. Toxins 2013;5:73—83.[2] Hallit RR, Afridi M, Sison R, Salem E, Boghossian J, Slim J.

J Med Microbiol 2013;62:155—6.[3] Eisel U, Jarausch W, Goretzki K, Henschen A, Engels J,

Weller U, et al. EMBO J 1986;5:2495—502.[4] Thwaites CL, Farrar JJ. BMJ 2003;326:117—8.[5] Centers for Disease Control Prevention (CDC). Tetanus

surveillance—–United States, 2001—2008 MMWR 2011;60(12):365—9.

[6] Blencowe H, Lawn J, Vandelaer J, Roper M, Cousens S. Int

J Epidemiol 2010;39(Suppl. 1):i102—9.

[7] W. H. Organization, http://whqlibdoc.who.int/hq/2010/WHO HSE GAR DCE 2010.2 eng.pdf; 2010. WHO/HSE/GAR/DCE/2010. 2.

[[

Available online at www

ScienceD

307

[8] Bruggemann H, Baumer S, Fricke WF, Wiezer A, LiesegangH, Decker I, et al. Proc Natl Acad Sci U S A 2003;100:1316—21.

[9] Rood JI. Annu Rev Microbiol 1998;52:333—60.10] Shimizu T, Ohtani K, Hirakawa H, Ohshima K, Yamashita

A, Shiba T, et al. Proc Natl Acad Sci U S A 2002;99:996—1001.

11] Finn Jr CW, Silver RP, Habig WH, Hardegree MC, Zon G,Garon CF. Science 1984;224:881—4.

12] Park SJ, Son WS, Lee BJ. Int J Mol Sci 2012;13:7109—37.13] Mohan R, Venugopal S. Bioinformation 2012;8:722—8.14] Desler C, Suravajhala P, Sanderhoff M, Rasmussen M, Ras-

mussen LJ. BMC Bioinform 2009;10:289.15] Galperin MY. Comp Funct Genom 2001;2:14—8.16] Galperin MY, Koonin EV. Nucleic Acids Res 2004;32:5452—63.17] Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR,

Appel RD, Bairoch A. In: Walker JM, editor. The ProteomicsProtocols Handbook. Humana Press; 2005. p. 571—607.

18] Gill SC, von Hippel PH. Anal Biochem 1989;182:319—26.19] Guruprasad K, Reddy BV, Pandit MW. Protein Eng

1990;4:155—61.20] Ikai A. J Biochem 1980;88:1895—8.21] Kyte J, Doolittle RF. J Mol Biol 1982;157:105—32.22] Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell

C, et al. Nucleic Acids Res 2012;40:D290—301.23] Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Der-

byshire MK, DeWeese-Scott C, et al. Nucleic Acids Res2011;39:D225—9.

24] Chen CC, Hwang JK, Yang JM. Nucleic Acids Res2006;34:W152—7.

25] Chen CC, Hwang JK, Yang JM. BMC Bioinform 2009;10:366.

26] Hirokawa T, Boon-Chieng S, Mitaku S. Bioinformatics1998;14:378—9.

27] Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, Ester M, et al.Bioinformatics 2005;21:617—23.

28] Ceroni A, Passerini A, Vullo A, Frasconi P. Nucleic Acids Res2006;34:W177—81.

29] Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A,Minguez P, et al. Nucleic Acids Res 2011;39:D561—8.

30] Franceschini A, Szklarczyk D, Frankild S, Kuhn M, SimonovicM, Roth A, et al. Nucleic Acids Res 2013;41:D808—15.

31] Universal Protein Resource (UniProt) database release 2014,Clostridium tetani (strain Massachusetts/E88) [212717](www.uniprot.org), 2014.

32] Mazandu GK, Mulder NJ. Int J Mol Sci 2012;13:7283—302.33] Kolker E, Makarova KS, Shabalina S, Picone AF, Purvine

S, Holzman T, et al. Nucleic Acids Res 2004;32:2353—61.

34] Sali A, Blundell TL. J Mol Biol 1993;234:779—815.35] Ikeda M, Arai M, Okuno T, Shimizu T. Nucleic Acids Res

2003;31:406—9.36] Tang HY, Speicher DW. Curr Protoc Protein Sci 2004 [chapter

11; unit 11.11].

37] Phizicky EM, Fields S. Microbiol Rev 1995;59:94—123.38] Humeau Y, Doussau F, Grant NJ, Poulain B. Biochimie

2000;82:427—46.

.sciencedirect.com

irect