identifying bacterial virulent proteins by fusing a set of classifiers based on variants of...

Identifying Bacterial Virulent Proteins by Fusinga Set of Classifiers Based on Variants of Chou’s

Pseudo Amino Acid Composition and onEvolutionary Information

Loris Nanni, Alessandra Lumini, Dinesh Gupta, and Aarti Garg

Abstract—The availability of a reliable prediction method for prediction of bacterial virulent proteins has several important applications

in research efforts targeted aimed at finding novel drug targets, vaccine candidates, and understanding virulence mechanisms in

pathogens. In this work, we have studied several feature extraction approaches for representing proteins and propose a novel bacterial

virulent protein prediction method, based on an ensemble of classifiers where the features are extracted directly from the amino acid

sequence and from the evolutionary information of a given protein. We have evaluated and compared several ensembles obtained by

combining six feature extraction methods and several classification approaches based on two general purpose classifiers (i.e., Support

Vector Machine and a variant of input decimated ensemble) and their random subspace version. An extensive evaluation was

performed according to a blind testing protocol, where the parameters of the system are optimized using the training set and the

system is validated in three different independent data sets, allowing selection of the most performing system and demonstrating the

validity of the proposed method. Based on the results obtained using the blind test protocol, it is interesting to note that even if in each

independent data set the most performing stand-alone method is not always the same, the fusion of different methods enhances

prediction efficiency in all the tested independent data sets.

Index Terms—Virulent proteins, machine learning, ensemble of classifiers, support vector machines.

Ç

1 INTRODUCTION

BACTERIAL virulent proteins are that enhance the relativeability of bacteria to cause a disease. Bacterial Virulent

proteins have been classified on the basis of mechanisms ofvirulence. For example, bacterial adhesins play an impor-tant role in the process of adherence of bacteria to the hostcells. This class of proteins includes fimbria and pili inEscherichia coli, Vibrio cholerae, Pseudomonas aeruginosa,and Neisseria species. Surface exposed adhesins makeimportant vaccine candidates. Colonization factors areanother class of virulent proteins, which enables certainbacteria to colonize within the host cells, for example,Helicobacter pylori survives in the acidic milieu of thehuman stomach by producing urease enzyme, whichcatalyzes the formation of carbon dioxide and ammoniathat can neutralize the acidic pH. The virulence of differentstrains of Helicobacter pylori correlates with the level ofproduction of urease. Invasion factors, another class of

virulent proteins in certain bacteria are proteins whichdisrupt the host cell membranes, stimulating endocytosis,and hence facilitating the entry of bacteria into the hostbody across protective epithelial tissue layers. Anothercommonly known virulence factors are the bacterial toxinsthat poison the host cells and cause tissue damage [1]. Thevirulent proteins are important for host invasion andpathogenesis and consist of diverse set of proteins. Thoughevolutionarily related, several of the virulent proteins havelimited sequence similarities and few or no conservedmotifs-making computational prediction of bacterial viru-lent proteins a difficult task [2], [3]. Moreover, due toemergence of novel drug resistant varieties of variousbacterial pathogen, there is an urgent need to identify novelvirulent proteins which may have potential to be a drugtarget or vaccine candidate [4]. The first bacterial genomesequenced was Haemophilus influenza in 1995 [5]. Currently,there are more than 1,000 completely sequenced bacterialgenomes [6] and around 6,000 bacterial genomes are beingsequences due to availability of next-generation sequencingtechnologies. However, a large number of virulent proteinsare yet to be discovered in these genomes. The aim of thiswork is to develop a novel ensemble-based efficient and fastclassifier for prediction of bacterial virulent proteins [2], [3].

Several methods for predicting virulent proteins havebeen proposed, based on different strategies, for example,the first developed methods were based on similarity searchmethods like BLAST [7] and PSI-BLAST [8]. More recently,machine learning algorithms for predicting virulent pro-teins are reported: for instance, in [9], the authors proposeda neural network-based prediction of virulence factors;

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. 2, MARCH/APRIL 2012 467

. L. Nanni is with the Department of Information Engineering, University ofPadua, Via Gradenigo, 6, Padova 35131, Italy.E-mail: [email protected].

. A. Lumini is with the DEIS, University of Bologna, via Venezia 52, Cesena47023, Italy. E-mail: [email protected].

. D. Gupta and A. Garg are with the Structural and Computational BiologyGroup, International Centre for Genetic Engineering and Biotechnology(ICGEB), Aruna Asaf Ali Marg, New Delhi 110067, India.E-mail: {dinesh, aarti}@icgeb.res.in.

Manuscript received 13 Feb. 2011; revised 14 June 2011; accepted 26 June2011; published online 18 Aug. 2011.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2011-02-0035.Digital Object Identifier no. 10.1109/TCBB.2011.117.

1545-5963/12/$31.00 � 2012 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

furthermore, an ensemble of Support Vector Machine(SVM) where the different SVMs classifiers were trainedwith sequence features of bacterial virulent proteins such asamino acid (AA) compositions, 2-gram (2G) compositions,higher order dipeptide composition, and evolutionaryinformation has also been reported.

Retrospectively, several methods are suggested based onextraction of feature vectors from the primary sequence of aprotein, as in several methods recommended for thesubcellular locations prediction [14]. Additionally, in [12],it was also shown that the prediction of virulent proteinsbased on the features extracted directly from the amino acidsequence do not perform well as compared to the featureextraction methods based on the evolutionary information.Recently, an ensemble of SVM based on different physico-chemical properties is reported [31], which outperformedthe modules based on 2-gram compositions and the higherorder dipeptide composition.

Moreover, SVM-based machine learning algorithm hasbeen used for development of methods for prediction ofmembrane protein types [42], [43], protein subcellularlocation [44], protein structural class [45], [46], specificityof GalNAc-transferase [47], HIV protease cleavage sites inprotein [48], protein signal sequences and their cleavagesites [49], alpha-turn types [50] and catalytic triads of serinehydrolases [51], and several other classification problems incomputational biology involving complex data sets.

According a recent comprehensive review [41], thefollowing points are vital in development of efficientpredictor for protein systems:

1. benchmark data set construction or selection,2. protein sample formulation or feature extraction,3. operating algorithm (or engine),4. anticipated accuracy, and5. webserver establishment. Keeping in mind these

points, the following paragraphs elaborate ourmethods.

2 FEATURE EXTRACTION

Extracting features from proteins is a deeply studied issuein Bioinformatics [14], since several problems need to use acompact representation of protein sequences for classifica-tion problems (e.g., subcellular localization, protein-proteininteractions). In most cases, a fixed length encoding is usedin order to be coupled to a general-purpose classifier.

In the following sections, the most known encodingmethods for proteins are briefly detailed. For a moredetailed review about feature extraction from proteins,see [14].

2.1 2-Grams

The 2-Gram representation [14] counts the number ofoccurrences of a couple of amino acids in a protein. Hence,the descriptor is a vector of 202 values ci, each counting thenumber of occurrences of a given couple of amino acids viin a protein sequence. The descriptor is finally scaledaccording to the length of the sequence. With respect to thestandard orthonormal representation, this encoding techni-que does not consider the sequence order.

2.2 Quasi-Residue Couple (RC)

The Quasi-Residue Couple model is inspired by Chou’squasi-sequence-order model and Yuan’s Markov chainmodel [36]. This encodings considers only indirectly thesequence order effect and it also takes into account a fixedphysicochemical property of the protein. In this work, weuse the Residue couple model with order m � 3, which for aphysicochemical property d is given by

P1mmii;j ¼ ð1=ðL�mÞÞ

�X

n¼1:L�mH1i;jðn; nþm; dÞ

" #i; j 2 f1; 2; . . . ; 20g

P2mmii;j ¼ ð1=ðL�mÞÞ

�X

n¼1:L�mH2i;jðn; nþm; dÞ

" #i; j 2 f1; 2; . . . ; 20g;

where i and j denote the 20 different amino acids;H1i;jðn; nþm; dÞ ¼ indexði; dÞ if the amino acid in locationn is i and the one in location nþm is j otherwiseH1i;jðn; nþm; dÞ ¼ 0; H2i;jðn; nþm; dÞ ¼ indexðj; dÞ if theamino acid in location n is i and the one in location nþm isj otherwise H2i;jðn; nþm; dÞ ¼ 0; L is the length of theprotein sequence; indexðp; dÞ is the function returning thevalue of the physicochemical property p of the amino acid d;the parameter m is called the order of the residue couplemodel. The vector that describes a given protein is obtainedby the sum of P1 and P2 for each couple of i; j.

In this paper, we extract the features using the firstthree ranks (i.e., m range from 1 to 3), and then weconcatenate the resulting descriptors so that the finalvector is 1,200 dimensional. This encoding techniqueconsiders the sequence order effect indirectly.

2.3 Pseudo Chou’s Amino Acid (PA)

The Chou’s pseudo amino acid (PseAA) composition [15],[17] is one of the most used methods for feature extraction,which is based on the selection of a fixed physicochemicalproperty of the protein. This technique represents a proteinwith (20þ �) features (� is a parameter denoting themaximum distance between two considered amino acids),the first 20 features are the amino acid composition for agiven property and the features from 20þ 1 to 20þ � reflectthe effect of the sequence order. The PA features areextracted using the webserver available at http://chou.med.harvard.edu/bioinf/PseAA/1,2 that provides thescales of Eisenberg (“Hydrophobicity property”) and ofHopp-Woods (“Hydrophilicity property”) for the codifica-tion of residue hydropathic character, among the manyscales that have been proposed in the literature throughoutthe last three decades.

2.4 Artificial Chou’s Pseudo Amino Acid (GP)

Since the pseudo Chou’s amino acid features can be extractedfor each physicochemical property, a feature selectionmethod can be used to extract from the several thousands

468 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. 2, MARCH/APRIL 2012

1. The used parameters were: PseAA mode Type 2; Amino acid characterHydrophobicity Hydrophilicity; Weight factor ¼ 0:05;� ¼ 30.

2. A Matlab code PseAA-based feature is available: http://bias.csr.unibo.it/nanni/toolVR.rar.

of features, a subset of useful features. In [38], a small set of“artificial” features is created by genetic programming,starting from the pseudo Chou’s amino acid features andcombining one or more “original” features by means of somemathematical operators. The “original” feature set iscomposed by the Chou’s PseAA composition featurescalculated on the set of the physicochemical propertiesincluded in the Amino Acid index database [25] (available athttp://www.genome.jp/dbget/aaindex.html).

2.5 Position Specific Scoring Matrix (PSSM)3

Additionally, information of multiple sequence alignmentin the form of PSI-BLAST generated Position SpecificScoring Matrix profile was used as feature vector for thetraining of SVM model. Herein, three iterative searches witha cutoff E-value of 0.001 were carried out against thenonredundant NCBI database. PSI-BLAST generates PSSMfrom a multiple alignment of the high-scoring hits bycalculating position specific scores for each position inalignments and the PSSM generated in each step is used toperform subsequent iterative searches, thereby increasingthe sensitivity of the search in each step. After threeiterations, PSSM with highest score is formed whichcontains 20 times N elements, where N is the length ofthe target sequence, and each element represents thefrequency of occurrence of each of the 20 amino acids at aparticular position in the alignment.

Subsequently, the final PSSM was normalized using asigmoid function by which each matrix element was scaledto a range [0, 1] (see [12] for details). Finally, to generate aninput vector of fixed dimensions, summed up all the rowsin the PSSM corresponding to the same amino acid in thesequence, followed by division of each element by thelength of the sequence, which resulted an input vector of400 dimensions.

2.6 AAIndexLoc (AA)

The AAIndexLoc [13] describes a given protein P byAmino acid composition (20 features), it is fraction of

amino acid y in P,Weighted AA composition (20 features), it is defined, for

amino acid y, as (Amino acid composition of y) � (indexvalue a for the amino acid y);

Five-level grouping composition (25 features), theamino acids are classified by k-means clustering into fivegroups considering their amino acid index values, then thefive-level dipeptide composition is performed. The five-level dipeptide composition is defined as the compositionof the occurrence of two consecutive groups (see [13] formore details).

3 CLASSIFICATION SYSTEMS

For each feature extraction method, we have tested twodifferent classification systems:

- Support Vector Machine;4

- a variant of the Input Decimated Ensemble (IDE).

Moreover, for each classifier, we have tested its stand-alone version and the random subspace (RS) ensemblesversion.

The Random Subspace Method [24] modifies the trainingdata set generating KðK ¼ 50 in this paper) new trainingsets, builds classifiers on these modified training sets (eachnew training set contains only a random set 50 percent of allthe features), and then combines them by “sum rule.”

3.1 Support Vector Machines

The support vector machine is a technique for classificationfrom the field of statistical learning theory [23]. SVM is abinary-class prediction method trained to find the equationof a hyperplane that divides the training set leaving all thepoints of the same class on the same side while maximizingthe distance between the two classes and the hyperplane. Incases where a linear decision boundary does not exist, akernel function can be used: a kernel function allows toproject the data onto a higher dimensional feature space inwhich it may be separable by a hyperplane. Typical kernelsare polynomial kernels and radial basis function kernels.

Notice that, before the classification, all the features usedfor training SVM are linearly normalized to [0 1] consider-ing the training data.

3.2 Input Decimated Ensemble Based onNeighborhood Preserving Embedding (NPE)5

The standard Input Decimated Ensemble [32] trains adifferent Decision Tree Di using the data transformed bythe principal component analysis (PCA) transform obtainedusing the training patterns that belong to the class i. So, thenumber of classifiers that build the ensemble is bounded bythe number of classes.

In [34], [33], a different Decision Tree6 Di;j is trainedusing the data transformed by the PCA transform obtainedusing the training patterns that belong to the jth subset ofthe class i. The set of classifiers are combined by Sum rule.In [39] (used in this work), a variant based on theNeighborhood Preserving Embedding subspace projectionsis proposed.

In Fig. 1, the pseudocode for the variant of the InputDecimated Ensemble based on NPE is reported (it is basedon the pseudocode of [39]). The inputs are the training set,TR, the corresponding class labels yTR ðyTRðiÞ ¼ c in-dicates that the pattern xi belongs to the class c), the test set,TE, the corresponding class labels yTE, the number ofclasses Nclass and the value of the parameter ne (the numberof classifiers build for each different class). The procedureEXTRACTCLASS(TR,yTR,c) extracts from the training set asubset of patterns that belong to the class c. The procedureSUBSET(TS) extracts randomly a subset of patterns fromTS. The procedures PCA and NPE project the data onto aPCA space or a NPE space. The procedure DT trains adecision tree, and finally the scores are combined by sumrule in the procedure SUMRULE. The PCA projectionpreserves the 98 percent percentage of the variance. In thispaper, we show (see Section 3) that IDE can be coupled withRS obtaining a further performance improvement.

NANNI ET AL.: IDENTIFYING BACTERIAL VIRULENT PROTEINS BY FUSING A SET OF CLASSIFIERS BASED ON VARIANTS OF CHOU’S... 469

3. Webserver http://203.92.44.101/pssm/.4. SVM is implemented as in the OSU svm toolbox http://

sourceforge.net/projects/svm/.

5. The Matlab code is available: http://bias.csr.unibo.it/nanni/toolVR.rar.

6. The DT are implemented as in the PRTools 3.1.7 Matlab toolbox.

4 EXPERIMENTS

This section reports an experimental evaluation of theproposed ensemble performed on the most used data setsfor testing approaches of virulent protein classification. Thetraining and testing sets used in this work are described inSection 4.1, and then the performance of the differentapproaches combined in this work are evaluated andcompared with the performance of the ensemble proposedin Section 4.2.

4.1 Data Sets and Testing Protocol7

The proposed approach has been evaluated on the samedata sets used in [12] and [31], which are freely available fordownload at the VirulentPred web server site (http://bioinfo.icgeb.res.in/virulent).

Virulent data set (VIR). This data set contains bacterialvirulent protein sequences which were retrieved from theSWISS-PROT [10] and VFDB (an integrated and compre-hensive database of virulence factors of bacterial pathogens,[11]). It consists of 1,025 virulent and 1,030 nonvirulentbacterial sequences.

ADHESINS data set (ADH). The ADHESINS data set(which was used to validate the SPAAN approach [9])consists of 469 adhesins and 703 nonadhesins proteins(including several archaebacterial, viral, and yeast non-virulent proteins).

Independent data set 1 (IND1). This data set has beenobtained in the same way of the Virulent data set, butavoiding overlapping. It consists of 83 SWISS-PROT se-quences (40 virulent and 43 nonvirulent protein sequences).

Independent data set 2 (IND2). This data set consists of141 virulent and 143 nonvirulent sequences from bacterial

pathogens sequences of organisms which were not repre-sented in the VIR data set

- Campylobacter (39 virulent and 40 nonvirulentprotein sequences);

- Neisseria (25 virulent and 24 nonvirulent);- Bordetella (27 virulent and 27 nonvirulent sequences);- Haemophilus (35 virulent and 35 nonvirulent);- Listeria (15 virulent and 17 nonvirulent).

A summary of the characteristics of these data sets isreported in Table 1.

First, in order to perform the parameter optimization, a

10-fold cross-validation testing protocol has been adoptedon the VIR data set. Among the independent data set test,subsampling (e.g., five or 10-fold cross validation) test, and

jackknife test, which are often used for examining theaccuracy of a statistical prediction method [52], the jack-

knife test was deemed the least arbitrary that can alwaysyield a unique result for a given benchmark data set, aselucidated in [53], [54] and demonstrated by [41, (28-32)].

Therefore, the jackknife test has been increasingly andwidely adopted by investigators to test the power ofvarious prediction methods (see, e.g., [55], [56], [57], [58],


Fig. 1. Pseudocode of the proposed Input Decimated Ensemble based on NPE algorithm.

7. We have used exactly the data sets used in the literature. We havechecked the redundancy using Blastclust. Reducing the redundancy from40 to 25 percent doesn’t make significant changes in numbers (only 15 inpositive set and nine in negative set of the training set, i.e., the Virulent dataset, no reduction in Independent data sets).

TABLE 1Characteristics of the Data Sets Used

in the Experimentation

[59], [60], [61], [62], [63], [64], [65], [66], [67]). However, toreduce the computational time, we adopted the fivefoldcross validation in this study as done by many investiga-tors with SVM as the prediction engine. Then, thecomparison of the different methods is performed usingVIR as training set and the others (ADH, IND1, and IND2)as independent test sets.

As performance indicators, we use

1. the area under the ROC curve (AUC)8 [20], a scalarmeasure which can be interpreted as the prob-ability that the classifier will assign a lower score toa randomly picked positive pattern (a virulentprotein) than to a randomly picked negativepattern (a nonvirulent protein). An ROC curve isdefined by the true positive rate (TPR) and falsepositive rate (FPR).

2. DET curve [40], a two-dimensional measure ofclassification performance that plots the false posi-tive (a nonvirulent protein erroneously classified asvirulent) against the false negative (a virulentprotein erroneously classified as nonvirulent).

4.2 Experimental Results

The first experiment is aimed at comparing differentsolutions (some stand-alone classifiers and some ensem-bles) on a single classification problem, where the trainingand the test sets are related to similar patterns. Thisexperiment is useful for parameter tuning and for selectingthe best configurations to be fused. The results reported inTable 2 are obtained combining each feature extractiondescribed in Section 2 with a different classificationapproach and using a 10-fold cross-validation testingprotocol in the VIR data set. The classification approachestested in Table 2 are a stand-alone SVM (SVM), a InputDecimated Ensemble based on NPE (NPE), a randomsubspace ensemble of SVM (RS-SVM), a random subspaceensemble of NPE (RS-NPE), and two ensembles based onfeature perturbation where 10 different features are selectedby sequential forward floating selection9 [21] and used to

train a RS-SVM (E-RS-SVM) or a RS-NPE classifier (E-RS-NPE). The two last methods are not applicable for featuresets not related to different physicochemical properties (i.e.,2G, PA, and PSSM), otherwise the 10 best physicochemicalproperty10 are selected by sequential forward floatingselection in order to minimize the error rate in the trainingset (as in [22]).

Considering the results obtained in Table 2, it ispossible to design an ensemble by combining the bestconfigurations; in the following, we denote with FUSIONthe combination by sum rule among: 1) RS-NPE trainedwith PA, 2) E-RS-SVM11 trained with RC; 3) RS-NPEtrained with PSSM. The set of methods chosen forbuilding the ensemble is the combination, by sum rule,of approaches that obtains the higher AUC in the 10-foldcross validation in the training data. The performance ofFUSION in the VIR data set (0.899 AUC) is better than theresults obtained by the state-of-the-art approaches Viru-lentPred [12] (0.860 AUC) and ENS_AAIndexLoc [31](0.803 AUC).

VirulentPred [12] is a prediction method based on a two-layer cascade of Support Vector Machines. The first layerSVM classifiers are trained and optimized with differentindividual protein sequence features like amino acidcomposition, dipeptide composition, higher order dipep-tide composition, and Position Specific Iterated BLAST (PSI-BLAST) generated Position Specific Scoring Matrices(PSSM). In addition, a similarity-search based module hasalso been developed using a data set of virulent andnonvirulent proteins as BLAST database. The results fromthe first layer (SVM scores and PSI-BLAST result) are thencascaded to the second layer SVM classifier.

ENS_AAIndexLoc [31] is an ensemble of SVM classifiersbased on the features described in Section 2.6 which hasproven to outperform both 2-gram compositions and thehigher order dipeptide composition in this problem. Any-way, it is based only on amino acid sequence-basedensemble; it does not use the evolutionary information


TABLE 2AUC Obtained by Different Methods in the VIR Data Set

The acronym C.I. means “computationally unfeasible.”

8. Implemented as in DDtool 0.95 Matlab Toolbox http://homepage.tudelft.nl/n9d04/dd_tools.html.

9. Implemented as in the PRTools 3.1.7 Matlab Toolbox www.prtools.org/prtools.html.

10. Partition energy; Normalized relative frequency of double bend;Negative charge; Transfer free energy to surface; Solvation free energy;Hydrophobic parameter pi; Transfer free energy; Atom-based hydrophobicmoment; Net charge; Steric parameter.

11. We have used the Gaussian kernel with “Cost of the constrainviolation” ¼ 0:1 and Gamma ¼ 100.

features. The reported results show that the new approach

clearly outperforms [31].The second experiment is aimed at comparing the best

solutions ðfeaturesþ classifierorensemblesÞ found by the

previous experiment considering independent data sets.

From the results of Table 2, we select the three best feature

descriptors with their best classifier (which are the same

methods combined in FUSION). Moreover, in Table 3, the

combination at score level between FUSION and Viru-

lentPred [12] is reported, where the scores of the two

approaches are normalized to mean 0 and standard

deviation 1 before fusion (the normalization parameters

are calculated using only the training set).In Figs. 2, 3, and 4, we plot the DET-curves obtained by

FUSION, by VirulentPred, by FUSIONþVirulentPred.

The following conclusions can be drawn from the resultsreported in this section:

- Our experiments, executed training and testing theapproaches on different sets constructed usingproteins from different organism, show that thereis not a “best” stand-alone method that performsbetter than others in all the case studies; in each dataset, the best method is different;

- It is interesting to note that the PSSM features workvery well in the training data set and in the IND2data set, but they work poorly in the ADH data setand in the IND1 data set; in particular in ADH dataset, which is made of archaebacterial, viral, and yeastnonvirulent proteins, the performance of the PSSMfeature set is very poor if compared with other


TABLE 3AUC Obtained by Different Methods in the Independent Data Sets

Fig. 2. DET curve obtained by FUSION, VirulentPred, and theircombination in the ADH data set.

Fig. 3. DET curve obtained by FUSION, VirulentPred, and theircombination in the IND1 data set.

approaches; this is probably due to the fact that theVIR data set, used as training set, contains onlybacterial virulent protein sequences;

- better performance stability among different test setsis obtained by combining different methods, while theperformance of a single approach is influenced by theorigin of the proteins evaluated, i.e., it may degrade ifthe training and test sets are related to differentorganisms. Combined approaches seem to be morerobust against this problem, and in our experimentsthe ensemble named FUSIONþVirulentPred obtainsthe best performance in all the independent data sets.

A further test is performed varying the number ofapproaches that build the ensemble. In Table 4, we reportthe AUC obtained combining {3, 5, 7, 9} approaches, also inthis test the methods to be combined are selected using a 10-fold cross validation in the training set. It is clear that alsothe combination of several methods outperforms (averagein the different data sets) each stand-alone approach.Increasing the number of methods to be combined, theperformance slightly decreases (respect to when 3/5 meth-ods are combined) since also low-performance approachesare combined.

Due to the low performance of PSSM in the ADH data set(notice that ADH contains also nonbacterial sequences), we

have tested it using as training set the concatenationbetween VIR and ADH ðVIRþADHÞ. In Table 5, we reportthe performance obtained using a 10-fold cross validation inVIRþADH and in the IND1 and IND2 data sets. Thereported results show that now PSSM obtains better resultswhen also ADH proteins are available in the training set.Moreover, the larger training set permits also a slightimprovement (respect to the methods trained only withVIR) in IND1 and IND2.

5 CONCLUSIONS

In this paper, we have presented a method based onensemble of classifiers for virulent proteins predictionwhere the features are extracted directly from the aminoacid sequence and from the evolutionary information of agiven protein.

We obtain a number of statistically robust observationsregarding the robustness of the system. An extensiveevaluation on a large data set according to a blind testingprotocol has demonstrated that the proposed approachworks well in the independent data set also if it is trainedonly with bacterial virulent protein sequences and the testsets are related to different organisms.

To obtain this finding, we have studied six differentfeature extraction methods and several classification ap-proaches based on two general-purpose classifiers (i.e.,support vector machine and a variant of input decimatedensemble) and their random subspace version.

Since user-friendly and publicly accessible webserversrepresent the future direction for developing practicallymore useful models, simulated methods, or predictors [72],we shall make efforts in our future work to provide awebserver for the method presented in this paper.


Fig. 4. DET curve obtained by FUSION, VirulentPred, and theircombination in the IND2 data set.

TABLE 4AUC Obtained Varying the Number

of Combined Approaches

TABLE 5AUC Obtained Using the Concatenation of VIR and ADH as Training Set

REFERENCES

[1] K.A. Brogden, J.A. Roth, T.B. Stanton, C.A. Bolin, F.C. Minion, andM.J. Wannemuehler, Virulence Mechanisms of Bacterial Pathogens,third ed. ASM Press, 2000.

[2] R.A. Weiss, “Virulence and Pathogenesis,” Trends in Microbiology,vol. 10, pp. 314-317, 2002.

[3] I.M. Hastings, S. Paget-McNicol, and A. Saul, “Can Mutation andSelection Explain Virulence in Human P. Falciparum Infections?,”Malaria J., vol. 2, p. 3, 2004.

[4] D.M. Morens, G.K. Folkers, and A.S. Fauci, “The Challenge ofEmerging and Re-Emerging Infectious Diseases,” Nature, vol. 430,pp. 242-249, 2004.

[5] R.D. Fleischmann, M.D. Adams, O. White, R.A. Clayton, E.F.Kirkness, A.R. Kerlavage, C.J. Bult, J.F. Tomb, B.A. Dougherty,J.M. Merrick, K. McKenney, G.G. Sutton, W. FitzHugh, C.A.Fields, J.D. Gocayne, J.D. Scott, R. Shirley, L.I. Liu, A. Glodek, J.M.Kelley, J.F. Weidman, C.A. Phillips, T. Spriggs, E. Hedblom, M.D.Cotton, T.R. Utterback, M.C. Hanna, D.T. Nguyen, D.M. Saudek,R.C. Brandon, L.D. Fine, J.L. Fritchman, J.L. Fuhrmann, N.S.M.Geoghagen, C.L. Gnehm, L.A. McDonald, K.V. Small, C.M. Fraser,H.O. Smith, and J.C. Venter, “Whole-Genome Random Sequen-cing and Assembly of Haemophilus Influenzae Rd,” Science,vol. 269, pp. 496-512, 1995.

[6] K. Liolios, N. Tavernarakis, P. Hugenholtz, and N.C. Kyrpides,“The Genomes On Line Database (GOLD) v.2: A Monitor ofGenome Projects Worldwide,” Nucleic Acids Research, vol. 34,pp. D332-D334, 2006.

[7] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman,“Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215,pp. 403-410, 1990.

[8] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W.Miller, and D.J. Lipman, “Gapped BLAST and PSI-BLAST: A NewGeneration of Protein Database Search Programs,” Nucleic AcidsResearch, vol. 25, pp. 3389-3402, 1997.

[9] G. Sachdeva, K. Kumar, P. Jain, and S. Ramachandran, “SPAAN:A Software for Prediction of Adhesins and Adhesin-Like ProteinsUsing Neural Networks,” Bioinformatics, vol. 21, pp. 483-91, 2005.

[10] A. Bairoch and R. Apweiler, “The SWISS-PROT Protein SequenceDatabase and Its Supplement TrEMBL in 2000,” Nucleic AcidsResearch, vol. 28, pp. 45-48, 2000.

[11] L. Chen, J. Yang, J. Yu, Z. Yao, L. Sun, Y. Shen, and Q. Jin, “VFDB:A Reference Database for Bacterial Virulence Factors,” NucleicAcids Research, vol. 33, pp. D325-D328, 2005.

[12] A. Garg and D. Gupta, “VirulentPred: A SVM Based PredictionMethod for Virulent Proteins in Bacterial Pathogens,” BMCBioinformatics, vol. 9, article 62, 2008, doi:10.1186/1471-2105-9-62.

[13] E. Tantoso and K.-B. Li, “AAIndexLoc: Predicting SubcellularLocalization of Proteins Based on a New Representation ofSequences Using Amino Acid Indices,” Amino Acids, vol. 35,pp. 343-353, 2007.

[14] K.C. Chou and H.B. Shen, “Review: Recent Progresses in ProteinSubcellular Location Prediction,” Analytical Biochemistry, vol. 370,pp. 1-16, 2007.

[15] K.C. Chou and H.B. Shen, “MemType-2L: A Web Server forPredicting Membrane Proteins and Their Types by IncorporatingEvolution Information through Pse-PSSM,” Biochemical and Bio-physical Research Comm., vol. 360, pp. 339-345, 2007.

[16] K.C. Chou and H.B. Shen, “Signal-CF: A Subsite-Coupled andWindow-Fusing Approach for Predicting Signal Peptides,”Biochemical and Biophysical Research Comm., vol. 357, pp. 633-640,2007.

[17] K.C. Chou and H.B. Shen, “Euk-mPLoc: A Fusion Classifier forLarge-Scale Eukaryotic Protein Subcellular Location Prediction byIncorporating Multiple Sites,” J. Proteome Research, vol. 6, pp. 1728-1734, 2007.

[18] S.K. Riis and A. Krogh, “Improving Prediction of ProteinSecondary Structure Using Neural Networks and MultipleSequence Alignments,” J. Computational Biology, vol. 3, pp. 163-183, 1996.

[19] H.B. Shen and K.C. Chou, “Ensemble Classifier for Protein FoldPattern Recognition,” Bioinformatics, vol. 22, pp. 1717-1722, 2006.

[20] T. Fawcett, “ROC Graphs: Notes and Practical Considerations forResearchers,” technical report, Palo Alto, USA: HP Laboratories,2004.

[21] P. Pudil, J. Novovicova, and J. Kittler, “Flotating Search Methodsin Feature Selection,” Pattern Recognition Letters, vol. 15, pp. 1119-1125, 1994.

[22] L. Nanni and A. Lumini, “An Ensemble of K-Local Hyperplanefor Predicting Protein-Protein Interactions,” Bioinformatics, vol. 22,no. 10, pp. 1207-1210, 2006.

[23] N. Cristianini and J. Shawe-Taylor, An Introduction to SupportVector Machines and Other Kernel-Based Learning Methods.Cambridge Univ. Press, 2000.

[24] T.K. Ho, “The Random Subspace Method for ConstructingDecision Forests,” IEEE Trans. Pattern Analysis Machine Intelligence,vol. 20, no. 8, pp. 832-844, Aug. 1998.

[25] S. Kawashima and M. Kanehisa, “AAindex: Amino Acid IndexDatabase,” Nucleic Acids Research, vol. 28, p. 374, 2000.

[26] J. Kittler, “On Combining Classifiers,” IEEE Trans. Pattern AnalysisMachine Intelligence, vol. 20, no. 3, pp. 226-239, Mar. 1998.

[27] L. Nanni and A. Lumini, “A Genetic Approach for BuildingDifferent Alphabets for Peptide and Protein Classification,” BMCBioinformatics, vol. 9, p. 45, Jan. 2008.

[28] Goldberg and E. David, Genetic Algorithms in Search, Optimizationand Machine Learning. Kluwer Academic, 1989.

[29] Goldberg and E. David, The Design of Innovation: Lessons from andfor Competent Genetic Algorithms. Addison-Wesley, 2002.

[30] M. Lilic, M. Vujanac, and C.E. Stebbins, “A Common StructuralMotif in the Binding of Virulence Factors to Bacterial SecretionChaperones,” Molecular Cell, vol. 21, pp. 653-664, 2006.

[31] L. Nanni and A. Lumini, “An Ensemble of Support VectorMachines for Predicting Virulent Proteins,” Expert Systems withApplications, vol. 36, no. 4, pp. 7458-7462, May 2009.

[32] K. Tumer and N.C. Oza, “Input Decimated Ensembles,” PatternAnalysis Application, vol. 6, pp. 65-77, 2003.

[33] L. Nanni and A. Lumini, “Ensemble Generation and FeatureSelection for the Identification of Students with LearningDisabilities,” Expert System with Applications, vol. 36, pp. 3896-3900, 2009.

[34] L. Nanni and A. Lumini, “Using Ensemble of Classifiers inBioinformatics,” Machine Learning Research Progress, Novapublishers, 2008.

[35] X. He, D. Cai, S. Yan, and H.-J. Zhang, “NeighborhoodPreserving Embedding,” Proc. 10th IEEE Int’l Conf. ComputerVision (ICCV ’05), 2005.

[36] J. Guo, Y. Lin, and Z. Sun, “A Novel Method for ProteinSubcellular Localization: Combining Residue-Couple Model andSVM,” Proc. Third Asia-Pacific Bioinformatics Conf., pp. 117-129,2005.

[37] D. Sarda, G.H. Chua, K. Li, and A. Krishnan, “pSLIP: SVM BasedProtein Subcellular Localization Prediction Using Multiple Phy-sicochemical Properties,” BMC Bioinformatics, vol. 6, article 152,2005.

[38] L. Nanni and A. Lumini, “Genetic Programming for CreatingChou’s Pseudo Amino Acid Based Features for SubmitochondriaLocalization,” Amino Acids, vol. 34, no. 4, pp. 653-660, 2008.

[39] L. Nanni and A. Lumini, “Input Decimated Ensemble Based onNeighborhood Preserving Embedding for Spectrogram Classifica-tion,” Expert Systems with Applications, vol. 36, pp. 11257-11261,2009, doi:10.1016/j.eswa.2009.02.072.

[40] A. Martin et al., “The DET Curve in Assessment of Decision TaskPerformance,” Proc. EuroSpeech, pp. 1895-1898, 1997.

[41] K.C. Chou, “Some Remarks on Protein Attribute Prediction andPseudo Amino Acid Composition (50th Anniversary Year Re-view),” J. Theoretical Biology, vol. 273, pp. 236-247, 2011.

[42] Y.D. Cai, G.P. Zhou, and K.C. Chou, “Support Vector Machinesfor Predicting Membrane Protein Types by Using FunctionalDomain Composition,” Biophysical J., vol. 84, pp. 3257-3263, 2003.

[43] Y.D. Cai, R. Pong-Wong, K. Feng, J.C.H. Jen, and K.C. Chou,“Application of SVM to Predict Membrane Protein Types,”J. Theoretical Biology, vol. 226, pp. 373-376, 2004.

[44] K.C. Chou and Y.D. Cai, “Using Functional Domain Compositionand Support Vector Machines for Prediction of Protein SubcellularLocation,” J. Biological Chemistry, vol. 277, pp. 45765-45769, 2002.

[45] Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou, “Prediction of ProteinStructural Classes by Support Vector Machines,” Computers andChemistry, vol. 26, pp. 293-296, 2002.

[46] Y.S. Ding, T.L. Zhang, and K.C. Chou, “Prediction of ProteinStructure Classes with Pseudo Amino Acid Composition andFuzzy Support Vector Machine Network,” Protein and PeptideLetters, vol. 14, pp. 811-815, 2007.

[47] Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou, “Support VectorMachines for Predicting the Specificity of GalNAc-Transferase,”Peptides, vol. 23, pp. 205-208, 2002.


[48] Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou, “Support VectorMachines for Predicting HIV Protease Cleavage Sites in Protein,”J. Computational Chemistry, vol. 23, pp. 267-274, 2002.

[49] Y.D. Cai, S. Lin, and K.C. Chou, “Support Vector Machines forPrediction of Protein Signal Sequences and Their Cleavage Sites,”Peptides, vol. 24, pp. 159-161, 2003.

[50] Y.D. Cai, K.Y. Feng, Y.X. Li, and K.C. Chou, “Support VectorMachine for Predicting Alpha-Turn Types,” Peptides, vol. 24,pp. 629-630, 2003.

[51] Y.D. Cai, G.P. Zhou, C.H. Jen, S.L. Lin, and K.C. Chou, “IdentifyCatalytic Triads of Serine Hydrolases by Support VectorMachines,” J. Theoretical Biology, vol. 228, pp. 551-557, 2004.

[52] K.C. Chou and C.T. Zhang, “Review: Prediction of ProteinStructural Classes,” Critical Rev. Biochemistry and Molecular Biology,vol. 30, pp. 275-349, 1995.

[53] K.C. Chou and H.B. Shen, “Cell-PLoc: A Package of Web Serversfor Predicting Subcellular Localization of Proteins in VariousOrganisms,” Nature Protocols, vol. 3, pp. 153-162, 2008.

[54] K.C. Chou and H.B. Shen, “Cell-PLoc 2.0: An Improved Packageof Web-Servers for Predicting Subcellular Localization of Proteinsin Various Organisms,” Natural Science, vol. 2, pp. 1090-1103, 2010,http://www.scirp.org/journal/NS/.

[55] G. Ji, X. Wu, Y. Shen, J. Huang, and Q. Li, and Q., “AClassification-Based Prediction Model of Messenger RNA Poly-adenylation Sites,” J. Theoretical Biology, vol. 265, pp. 287-296, 2010.

[56] K.K. Kandaswamy, K.C. Chou, T. Martinetz, S. Moller, P.N.Suganthan, S. Sridharan, and G. Pugalenthi, “AFP-Pred: ARandom Forest Approach for Predicting Antifreeze Proteins fromSequence-Derived Properties,” J. Theoretical Biology, vol. 270,pp. 56-62, 2011.

[57] H. Lin and H. Ding, “Predicting Ion Channels and Their Types bythe Dipeptide Mode of Pseudo Amino Acid Composition,”J. Theoretical Biology, vol. 269, pp. 64-69, 2011.

[58] T. Liu and C. Jia, “A High-Accuracy Protein Structural ClassPrediction Algorithm Using Predicted Secondary StructuralInformation,” J. Theoretical Biology, vol. 267, pp. 272-275, 2010.

[59] M. Masso and I.I. Vaisman, “Knowledge-Based ComputationalMutagenesis for Predicting the Disease Potential of Human Non-Synonymous Single Nucleotide Polymorphisms,” J. TheoreticalBiology, vol. 266, pp. 560-568, 2010.

[60] C. Chen, L. Chen, X. Zou, and P. Cai, “Prediction of ProteinSecondary Structure Content by Using the Concept of Chou’sPseudo Amino Acid Composition and Support Vector Machine,”Protein and Peptide Letters, vol. 16, pp. 27-31, 2009.

[61] H. Ding, L. Luo, and H. Lin, “Prediction of Cell Wall LyticEnzymes Using Chou’s Amphiphilic Pseudo Amino Acid Com-position,” Protein and Peptide Letters, vol. 16, pp. 351-355, 2009.

[62] F.M. Li and Q.Z. Li, “Predicting Protein Subcellular Location UsingChou’s Pseudo Amino Acid Composition and Improved HybridApproach,” Protein and Peptide Letters, vol. 15, pp. 612-616, 2008.

[63] H. Lin, H. Ding, F.-B. Guo, A.Y. Zhang, and J. Huang, “PredictingSubcellular Localization of Mycobacterial Proteins by UsingChou’s Pseudo Amino Acid Composition,” Protein and PeptideLetters, vol. 15, pp. 739-744, 2008.

[64] H. Mohabatkar, “Prediction of Cyclin Proteins Using Chou’sPseudo Amino Acid Composition,” Protein and Peptide Letters,vol. 17, pp. 1207-1214, 2010.

[65] X. Xiao, P. Wang, and K.C. Chou, “GPCR-2L: Predicting GProtein-Coupled Receptors and Their Types by Hybridizing TwoDifferent Modes of Pseudo Amino Acid Compositions,” MolecularBiosystems, vol. 7, pp. 911-919, 2011.

[66] M. Esmaeili, H. Mohabatkar, and S. Mohsenzadeh, “Using theConcept of Chou’s Pseudo Amino Acid Composition for RiskType Prediction of Human Papillomaviruses,” J. TheoreticalBiology, vol. 263, pp. 203-209, 2010.

[67] Y.H. Zeng, Y.Z. Guo, R.Q. Xiao, L. Yang, L.Z. Yu, and M.L. Li,“Using the Augmented Chou’s Pseudo Amino Acid Compositionfor Predicting Protein Submitochondria Locations Based on AutoCovariance Approach,” J. Theoretical Biology, vol. 259, pp. 366-372,2009.

[68] K.C. Chou and H.B. Shen, “Review: Recent Progresses in ProteinSubcellular Location Prediction,” Analytical Biochemistry, vol. 370,pp. 1-16, 2007.

[69] K.C. Chou and H.B. Shen, “A New Method for Predicting theSubcellular Localization of Eukaryotic Proteins with Both Singleand Multiple Sites,” Euk-mPLoc 2.0 PLoS ONE, vol. 5, p. e9931,2010.

[70] K.C. Chou and H.B. Shen, “Plant-mPLoc: A Top-Down Strategy toAugment the Power for Predicting Plant Protein SubcellularLocalization,” PLoS ONE, vol. 5, p. e11335, 2010.

[71] K.C. Chou, “Pseudo Amino Acid Composition and Its Applica-tions in Bioinformatics, Proteomics and System Biology,” CurrentProteomics, vol. 6, pp. 262-274, 2009.

[72] K.C. Chou and H.B. Shen, “Review: Recent Advances inDeveloping Web-Servers for Predicting Protein Attributes,”Natural Science, vol. 2, pp. 63-92, 2009, http://www.scirp.org/journal/ NS/.

Loris Nanni received the master’s degreecum laude in 2002 from the University ofBologna, and the PhD degree in computerengineering at DEIS, University of Bologna, in2006. His research interests include patternrecognition, bioinformatics, and biometric sys-tems (fingerprint classification and recognition,signature verification, face recognition).

Alessandra Lumini received the master’sdegree from the University of Bologna, Italy, in1996. In 1998, she started the PhD studies atDEIS—University of Bologna and in 2001 shereceived the PhD degree in computer engineer-ing for her work on “image databases.” Now, sheis an associate researcher at the University ofBologna. Her research interests include patternrecognition, bioinformatics, biometric systems,multidimensional data structures, digital image

watermarking and image generation.

Dinesh Gupta received the PhD degree fromthe All India Institute of Medical Sciences, NewDelhi, India, 1998. He was a coordinator forWHO funded ICGEB Regional BioinformaticsTraining Centre for Tropical Disease Research-ers. The Centre has organized several Interna-tional Bioinformatics Workshops. He was afounder of Bioinformatics course for PhD stu-dents at ICGEB New Delhi. He is with theInternational Bioinformatics workshops including

those organized by WHO, NAS (USA), and regional universities. Hisscientific interests include the use of computational biology tools to solveresearch problems in the postgenomic era.

Aarti Garg is a senior research fellow (SRF) ofthe Bioinformatics center, Institute of MicrobialTechnology, Sector-39 A, Chandigarh, INDIA.Her scientific interests include the use ofcomputational biology tools to solve researchproblems in the postgenomic era.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


identifying bacterial virulent proteins by fusing a set of classifiers based on variants of...

Documents