research article application of hybrid functional groups...

12
Research Article Application of Hybrid Functional Groups to Predict ATP Binding Proteins Andreas N. Mbah Center for Bioinformatics & Computational Biology, Department of Biology, Jackson State University, Jackson, MS 39217, USA Correspondence should be addressed to Andreas N. Mbah; [email protected] Received 2 September 2013; Accepted 29 October 2013; Published 8 January 2014 Academic Editors: S.-A. Marashi and B. Oliva Copyright © 2014 Andreas N. Mbah. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e ATP binding proteins exist as a hybrid of proteins with Walker A motif and universal stress proteins (USPs) having an alternative motif for binding ATP. ere is an urgent need to find a reliable and comprehensive hybrid predictor for ATP binding proteins using whole sequence information. In this paper the open source LIBSVM toolbox was used to build a classifier at 10-fold cross- validation. e best hybrid model was the combination of amino acid and dipeptide composition with an accuracy of 84.57% and Mathews Correlation Coefficient (MCC) value of 0.693. is classifier proves to be better than many classical ATP binding protein predictors. e general trend observed is that combinations of descriptors performed better and improved the overall performances of individual descriptors, particularly when combined with amino acid composition. e work developed a comprehensive model for predicting ATP binding proteins irrespective of their functional motifs. is model provides a high probability of success for molecular biologists in predicting and selecting diverse groups of ATP binding proteins irrespective of their functional motifs. 1. Introduction Recent advances in the next generation sequencing and human genome projects have resulted in rapid increase of protein sequences, thus widening the protein sequence- structure gap [1, 2], leading to diverse protein functions from common family. Computation prediction tools for predicting protein structure and function are highly needed to narrow the widening gap [3]. e ATP binding proteins (ATP-BPs) are a diverse family of proteins in terms of amino acid sequences, function, and their three-dimensional structures. ese proteins hydrolyze ATP to provide the energy neces- sary to drive biochemical reactions in the cell [4]. ere are two distinct functional groups of ATP binding proteins. e first functional group has the Walker A motif [GXXXXGK (T/S) or G-4X-GK (T/S)] in their sequences for ATP binding [5]. Many members are transmembrane proteins and are responsible for transporting a wide variety of substrates across extra- and intracellular membranes [6]. e biochemical functions of ATP binding proteins are well exhibited within the ABC transporters group. In bacteria cell, ABC transporters pump substances such as sugars, vitamins, and metal ions into the cell, while in eukaryotes they trans- port molecules out of the cell [7]. ey are also known to transport lipids and play a protective role to the developing fetus against xenobiotics [7]. ABC transporters are crucial in the development of multidrug resistance, with the ATP binding sites exploitable as targets for chemotherapeutic agents [8]. e mechanism of action in multidrug trans- portation is unclear. However, one model called hydrophobic vacuum cleaner states that, in P-glycoprotein, the drugs are bound indiscriminately from the lipid phase based on their hydrophobicity [9]. e second evolutionary diverse functional class of ATP binding proteins is called universal stress proteins (USPs). e universal stress proteins (USPs) are found in diverse group of organisms like archaea, eubacteria, yeast, fungi, and plants; their expressions are triggered by variety of environmental stressors [10]. ese stressors might include but are not limited to starvation of nutrients such as carbon, nitrogen, phosphate, sulfate and the required amino acid and variety of toxicants and other agents such as heavy metals, oxidants, acids, heat shock, DNA damage, phosphate, uncouplers of the electron transport chain, and ethanol Hindawi Publishing Corporation ISRN Computational Biology Volume 2014, Article ID 581245, 11 pages http://dx.doi.org/10.1155/2014/581245

Upload: vodung

Post on 19-Aug-2019

230 views

Category:

Documents


0 download

TRANSCRIPT

Research ArticleApplication of Hybrid Functional Groups toPredict ATP Binding Proteins

Andreas N Mbah

Center for Bioinformatics amp Computational Biology Department of Biology Jackson State University Jackson MS 39217 USA

Correspondence should be addressed to Andreas N Mbah nji41yahoocom

Received 2 September 2013 Accepted 29 October 2013 Published 8 January 2014

Academic Editors S-A Marashi and B Oliva

Copyright copy 2014 Andreas N Mbah This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

TheATPbinding proteins exist as a hybrid of proteinswithWalkerAmotif and universal stress proteins (USPs) having an alternativemotif for binding ATP There is an urgent need to find a reliable and comprehensive hybrid predictor for ATP binding proteinsusing whole sequence information In this paper the open source LIBSVM toolbox was used to build a classifier at 10-fold cross-validation The best hybrid model was the combination of amino acid and dipeptide composition with an accuracy of 8457 andMathews Correlation Coefficient (MCC) value of 0693 This classifier proves to be better than many classical ATP binding proteinpredictorsThe general trend observed is that combinations of descriptors performed better and improved the overall performancesof individual descriptors particularly when combined with amino acid composition The work developed a comprehensive modelfor predicting ATP binding proteins irrespective of their functional motifs This model provides a high probability of success formolecular biologists in predicting and selecting diverse groups of ATP binding proteins irrespective of their functional motifs

1 Introduction

Recent advances in the next generation sequencing andhuman genome projects have resulted in rapid increaseof protein sequences thus widening the protein sequence-structure gap [1 2] leading to diverse protein functions fromcommon family Computation prediction tools for predictingprotein structure and function are highly needed to narrowthe widening gap [3] The ATP binding proteins (ATP-BPs)are a diverse family of proteins in terms of amino acidsequences function and their three-dimensional structuresThese proteins hydrolyze ATP to provide the energy neces-sary to drive biochemical reactions in the cell [4] There aretwo distinct functional groups of ATP binding proteins

The first functional group has the Walker A motif[GXXXXGK (TS) or G-4X-GK (TS)] in their sequencesfor ATP binding [5] Many members are transmembraneproteins and are responsible for transporting a wide varietyof substrates across extra- and intracellular membranes [6]The biochemical functions of ATP binding proteins are wellexhibited within the ABC transporters group In bacteria cellABC transporters pump substances such as sugars vitamins

and metal ions into the cell while in eukaryotes they trans-port molecules out of the cell [7] They are also known totransport lipids and play a protective role to the developingfetus against xenobiotics [7] ABC transporters are crucialin the development of multidrug resistance with the ATPbinding sites exploitable as targets for chemotherapeuticagents [8] The mechanism of action in multidrug trans-portation is unclear However one model called hydrophobicvacuum cleaner states that in P-glycoprotein the drugs arebound indiscriminately from the lipid phase based on theirhydrophobicity [9]

The second evolutionary diverse functional class of ATPbinding proteins is called universal stress proteins (USPs)The universal stress proteins (USPs) are found in diversegroup of organisms like archaea eubacteria yeast fungiand plants their expressions are triggered by variety ofenvironmental stressors [10] These stressors might includebut are not limited to starvation of nutrients such as carbonnitrogen phosphate sulfate and the required amino acidand variety of toxicants and other agents such as heavymetals oxidants acids heat shock DNA damage phosphateuncouplers of the electron transport chain and ethanol

Hindawi Publishing CorporationISRN Computational BiologyVolume 2014 Article ID 581245 11 pageshttpdxdoiorg1011552014581245

2 ISRN Computational Biology

[11 12] The USPs bind to ATP through the ATP bindingmotif [G-2X-G-9X-G(ST)] [13] Members of the USPs willsegregate into two groups based on whether or not they bindto ATP [13]

Experimental efforts are underway to determine the func-tion of newly discovered proteins [14] but these experimentalmethods are costly and time consuming and at times areunsuccessful due to the complexity involved in proteincrystallization process Several methods had been studiedbased on predicting ATP binding residues from their knownstructural features but with low accuracies [15 16] Somepredictors of ATP binding proteins have been developed withpromising results such as those in [17 18] including Greenet al [19] article on an effective method to recognize ATPbinding proteins by testing parallel cascade identificationand KNN Unfortunately these methods were adapted toATP binding proteins containing only the classical WalkerA motif [G-4X-GK (TS)] in their sequences The objectiveof this research reported here was to introduce a classifierbuilt from a pool of protein sequences containing both ATPbinding motifs of G-4X-GK (TS) and G-2X-G-9X-G(ST)To achieve the objective support vector machine (SVM)approach is proposed which predicts protein functions basedon the discriminative features that map protein sequences tobiological functions [20ndash23] using the sequence pool ATPhybrid motifs

There is aneed to develop an automated predictor for ATPbindingUSP encoded proteins to speed experimental designsand study how these proteins function under diverse envi-ronmental stressorsThis research has developed hybrid ATPbinding protein predictor using the open source LIBSVMtoolbox classificationThe best model was the combination ofamino acid and dipeptide composition of the sequences withan accuracy of 8457 and Mathews correlation coefficient(MCC) value of 0693 This model shows a striking overallperformance in sensitivity (8246) specificity (8700) andprecision (8785) with area under the ROC curve (AUC)value of 0849219The general trend shows that combinationsof descriptors perform better and improved the overallperformances of individual descriptors particularly whencombinedwith amino acid compositionThismodel providesa high probability of success for molecular biologists inpredicting and selecting diverse motif groups of ATP bindingproteins

2 Materials and Method

21 Datasets Balanced datasets of ATP and non-ATP bind-ing proteins were constructed from the UniProt proteindatabase (UniProt release 2011 11) (httpwwwuniprotorg)Protein Data Bank (httpwwwrcsborgpdbhomehomedo) IMGM database (httpimgjgidoegovcgi-binmmaincgi) and published literatures [24ndash26] which containdiverse universal stress proteins

211 Extraction of Walker A Motif Dataset A total of 2000protein sequences which belong to Walker A motif positivedataset were retrieved Redundancy due to homologous

sequences was removed using CD-HIT [27] and PISCES [28]servers at a threshold of 25 This threshold statisticallyretains adequate number of protein sequences for analysisas well as avoids bias that might result from high homologyDataset obtained was manually reviewed through literaturesearch and information from the protein data bank [2] toensure they represent ATP binding proteins A total of 100sequences were randomly selected from the original datasetand retained for training and testing to represent Walker Amotif positive (ATP binding) dataset The Walker A motifnegative dataset (non-ATP binding) was taken from Yu et al2006 [29] This was the ldquonegativerdquo dataset used for nucleicacid binding proteins This is because ATP binding proteinsare members of nucleotide binding protein family hencethe negative dataset used in [29] for predicting nucleotidebinding protein family was considered useful Redundancywas also maintained at 25 threshold and each protein wasverified to be non-ATP binding using both the literature andprotein data bank information A total of 100 sequences werealso randomly selected from [29] and retained for trainingand testing to represent Walker A motif negative (non-ATPbinding) dataset

212 Extraction of USP Protein Dataset The extracted USPsequences were tested for the presence or absence of the G-2X-G-9X-G(ST) motif in their sequences using the NCBIconserved domain search tool [30] The USP sequences weredivided into two groups based on the presence or absence ofATP bindingmotif [13]The redundancy was alsomaintainedat 25 threshold and 100 sequences were selected for eachclass of proteins (200 sequences in total)

The overall summary of the data prepared for analysiswas as follows (i) 100ATP binding proteins with Walker Amotif (ii) 100 without ATP binding proteins without WalkerA motif (iii) 100USP sequences with ATP binding motif[G-2X-G-9X-G(ST)] and (iv) 100USP sequences withoutATP bindingmotif [G-2X-G-9X-G(ST)]The 400 sequenceswere separated into two hybrid groups as follows 200ATPbinding sequences and 200 sequences without ATP bindingmotifs and were used to generate the feature vector Thefeature vector was generated from the entire sequences of theproteins (not only the ATP-binding domains) via PROFEATserver using 1497 descriptor set [31] Physicochemical andsequence attributes of biologically informative were priori-tized for investigation The attributes were incorporated intoLIBSVMclassifier to find the best hybridmodel for predictingATP binding proteins

22 LIBSVM Classifier Support vector machines (SVM)recognized objects to be classified as points in a high-dimensional space needing a hyperplane to separate them[32]The biologicalmolecules are representedwith descriptorset With a proper mapping furnished by a kernel functionSVM classifiers separate transformed data with a hyperplanein a high-dimensional space to predict the correct classifi-cation of protein functional classes SVMs have been widelyused in supervised classification problems in bioinformaticssuch as [33ndash36] The LIBSVM package which is freely

ISRN Computational Biology 3

downloadable at (httpwwwcsientuedutwsimcjlinlibsvm)was adopted and used to evaluate the attributes and build thefinal classifier using the radial basis function (RBF) as thekernel function [37ndash39]

A ldquogrid-searchrdquo was employed to select the proper valuesof the parameter of RBF and the penalty parameter (119862) ofthe soft margin SVM 119862 was set to 2minus5 2minus3 215 and 120574to 2minus15 2minus13 23 All the combinations of 119862 and 120574 weretested and the pair with the best cross-validation accuracy foreach feature set or combination of feature sets was selectedA smaller 120574 value makes the decision boundary smootherThe SVM training parameter 119862 is the regularization factorwhich controls the tradeoff between low training error andlarge margin [37 40] Throughout this work the parameter119862 was maintained at 119862 = 4 after trial and error assessmentas the best value The optimal value of 120574 was obtained foreach descriptor set for best resultsThe entire sets of attributeswere evaluated in terms of their associationwithATP bindingprotein and a final subset with good predictive power wasselected In this research a 10-fold cross validation (10CV)was implementedTheobjective of training is tomaximize theability of the SVM predictor to discriminate between classeswhile avoiding overfitting

23 Tenfold Cross-Validation Analysis The technique to eval-uate any newly developed method has become a majorchallenge to investigators The jack-knifing leave-one-outcross-validation (LOOCV) [41ndash43] is the popular techniquefor evaluating models During this procedure one sequenceis used for testing and the left over sequences are usedfor training This process is repeated many times and eachsequence is used once for testing Even though this method ispopular it is computer intensivewith considerable labor time

In this work 10-fold cross-validation was used to trainand test the dataset with sequences randomly partitionedinto ten sets This cross-validation ensures that the datasetwas split at the protein level in addition to the stratifiedpartition thus ensuring a more rigorous evaluation Duringthe procedure the positive and negative data samples aredistributed randomly into 10 sets or the so-called fold In eachof the 10 round steps 9 of the 10 sets are used to construct aclassifier (training) and then the classifier is evaluated usingthe remaining set (testing) This procedure was repeated tentimes in amannerwhere each set was used for testing [44 45]The overall performance was the average of the performancesof all the 10 sets

24 The LIBSVM Performance Evaluation The standardparameters used in evaluating the performance of the LIB-SVM are indicated below The overall accuracy (Acc) isthe intuitive measurement of the performance on a balancedatasetwhereasMatthewrsquos correlation coefficient (MCC) [46]is more realistic than Acc in measuring performance whenusing an unbalanced dataset [47 48] When both MCC andAcc values are high the overall performance of the predictedmodel is better In addition to Acc and MCC the followingparameters below were also calculated Sensitivity is the

percentage of correctly predicted binding proteins to the totalbinding proteins

True positive (TP)True negative (TN)False positive (FP) (false alarm)False negative (FN)False positive rate (FPR)Sensitivityrecall or True positive rate (TPR) TPR =TPP = TP(TP + FN)Precision = TP(TP + FP)Accuracy (Acc) = (TP + TN)(P + N) = (TP +TN)(TP + TN + FP + FN)Specificity (SPC) SPC = TNN = TN(FP + TN) = 1 ndashFPRMatthewrsquos correlation coefficient (MCC)((TP times TN) minus (FP times FN))[sqrt ((TN + FN) times (TN +FP) times (TP + FN) times (TP + FP))] OR

MCC = (TP lowast TN minus FP lowast FN)radicPNP1015840N1015840

(1)

Here TP is the number of true positives (ATP-BPs) TN is thenumber of true negatives (non ATP-BPs) FP is the numberof false positives and FN is the number of false negatives

25 Area under the ROC Curve (AUC) for LIBSVM It is aplot between true positive proportion (TPTP + FN) andfalse positive proportion (FPFP + TN) The StatsDirect wasused package to plot ROC and calculates the area under theROC curve directly by an extended trapezoidal rule [49]Theconfidence interval was constructed using DeLongrsquos varianceestimate [50] embedded in the statistic package

3 Results and Discussion

The ATP binding proteins are known to play key roles in thebiochemical functioning of the cell In signaling pathwaysATP molecules are substrates for protein kinase phospho-rylation It is difficult to identify ATP binding proteins dueto lack of experimentally determined protein structures [51ndash53] This is because the growth of protein sequences fromvarious genomic projects exceeds the capacity of experimen-tal techniques in determining protein structures and theirbinding reactions which are time consuming and at timesunsuccessful Therefore there is an urgent need to developautomated expert methods for determining the functionalclass of proteins such ATP binding proteins from theirprimary sequence information

The general assumption here is that every protein thatbinds to ATP molecule either USPs or those having WalkerA motif will have some common features embedded in theirsequences In both theUSP (G-2X-G-9X-G(ST)) andWalkerA (G-4X-GK (TS)) motifs the G K T and S denote glycinelysine threonine and serine respectively and X denotes any

4 ISRN Computational Biology

Feature Accsvm Accsvm

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

8364

799

8457

7943

8317

7336

799

8177

8224

5098

5747

7429

7616

5098

5747

7336

7429

7616

7943

799

8177

8224

8317

8364

8457

Figure 1 The performances of descriptors with LIBSVM in terms of accuracy The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of accuracy (Accsvm) In terms of accuracy the best descriptor was combination of aminoacid and dipeptide composition (8457) followed by amino acid composition (8364) dipeptide composition (8317) and Norm M-Bautocorrelation in that order The pseudo amino acids and Quasi sequence order descriptors perform poorly

amino acid residue The lysine (K) residue in the WalkerA motif is crucial for nucleotide binding [54] in this classof proteins It interacts with the phosphate groups of thenucleotide and with the magnesium ion which coordinatesthe 120573- and 120574-phosphates of the ATP molecule [55 56]

The universal stress proteins bind to ATP through theATP binding motif G-2X-G-9X-G(ST) with the -G(S)T asessential residues for ATP binding and phosphorylation [13]Therefore members of this class of proteins will segregateinto two groups based on whether or not they bind to ATP[13 57] Thus it is important to identify ATP binding USPsand other ATP binding proteins Several methods have beenstudied based on predicting ATP interacting residues if theprotein structures are known with some results showing very

low accuracies [15 16 58 59] This work has predicted ATPbinding proteins in general with high accuracy irrespectiveof their structural information using SVM classifier Thetraining and prediction statistics for each of the descriptorsets used were visualized and discussed below The visu-alizations were constructed using Tableau Public Software(httpwwwtableausoftwarecompublic)

The objective in this report was to find the best descriptorset which can be use to build a predictive model for a reliableand effective server for predicting ATP-BPs in generalirrespective of their subfunctional classes Throughout thiswork the parameter 119862 was maintained at 119862 = 4 whilethe optimal value of 120574 for each descriptor was obtained andused in evaluating their performances Their performances

ISRN Computational Biology 5

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

MCC MCC

06765

06041

06931

05897

06637

04767

05982

06355

06449

Null

02369

0486

05301

Null02360

04767

0486

05301

05897

05082

06041

06355

06449

06637

06765

06931

Figure 2 The performances of descriptors with LIBSVM in terms of Mathewrsquos correlation coefficient (MCC) The length of each colorcoded descriptor and the pyramidal view are a measure of their performances in terms of MCC The best performer was amino acid anddipeptide composition in combination (06931) followed by amino acid composition (06765) dipeptide composition (06637) and NormM-B autocorrelation (06449) in that order

were evaluated based on five computed parameters consistingof their accuracies sensitivities specificities precisions andMCC after a 10-fold cross validation (CV10)

The performance of pseudo amino acid compositionwas evaluated with only accuracy due to lack of suffi-cient sequence information The lengths of the color codeddescriptors were used as a measure of their performances In

terms of accuracy the best descriptor was the combination ofamino acid with dipeptide composition (8457) followedby amino acid composition alone (8364) dipeptide com-position (8317) and Norm M-B autocorrelation in thatorder (Figure 1)The pseudo amino acids andQuasi sequenceorder descriptors performed poorly compared to the otherdescriptors However the overall performances of the other

6 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

Sensitivity Sensitivity

0875

07623

08246

07788

08381

07429

08019

08148

08224

Null

Null

05421

07453

07258

05421

07258

07429

07453

07623

07788

08019

08148

08224

08248

08381

0875

Figure 3The performances of descriptors with LIBSVM in terms of sensitivityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of sensitivity The most sensitive descriptor was amino acid composition (0875) followedby dipeptide composition (08381) amino acid and dipeptide composition in combination (08246) and NormM-B autocorrelation (08224)in that order

descriptors were better as most of them registered accuracyvalues greater than 7000 These high performers mightbe due to the rigorous refinement of protein sequencesThus protein function classification with SVM classifierscan be improved drastically using rigorously refined proteinsequences

The individual performances of amino acid composi-tion (8364) and dipeptide composition (8317) wereincreased to 8457 when both descriptors were com-bined together This indicates that the combination of

descriptors can enhance the individual performance ofother descriptors particularly those combining with aminoacid composition This is a binary classification probleminvolving a balance dataset and accuracy (Acc) is the bestparameter for evaluating performance based on balancedataset whereas Matthewrsquos correlation coefficient (MCC)is more realistic than Acc when using an unbalanceddataset [47 48] But when both MCC and Acc values arehigh the overall performance of the predicted model isbetter

ISRN Computational Biology 7

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

087

08478

087

08119

08257

07339

07963

08208

08224

Null

08333

07407

08111

Null07339

07407

07963

08111

08119

08208

08224

08257

08333

08478

087

Specificity Specificity

Figure 4The performances of descriptors with LIBSVM in terms of SpecificityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of specificity The most specific descriptor was amino acid composition and aminoaciddipeptide composition (087) followed by all using all the feature set (08478) Quasi sequence order descriptors (08333) and dipeptidecomposition (08257) in that order

The performances of the models were evaluated based onMCC (Figure 2) The pyramidal view and the length of thecolor coded descriptors were used for performance visual-ization The best performer was amino acid and dipeptide

composition in combination (06931) followed by amino acidcomposition (06765) dipeptide composition (06637) andNormM-B autocorrelation (06449) in that order This orderis in line with their performances measured using accuracy

8 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

PrecisionPrecision

0785

08692

08785

08224

08224

0729

07944

08224

08224

Null

09626

07383

08411

Null0729

07383

0765

07944

08224

08411

08502

08785

09626

Figure 5 The performances of descriptors with LIBSVM in terms of Precision The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of precision The most precise descriptor was Quasi sequence order descriptors (09626)followed by amino acid and dipeptide composition in combination (08785) all feature set (08692) and Transition (08411) in that order

as the parameter This result justifies the performance ofthe overall model In general the combination of descriptorsets performs better than individual descriptors particularlywhen combined with amino acid composition

Therefore from the statistical point of view the use ofcombination sets particularly with amino acid compositiontend to give better prediction performance than individual-sets [53]The amino acid composition generally increases theoverall accuracies of other descriptors in combination Oneof the shortcoming of amino acid composition as a descriptor

is that the same amino acid composition may correspond todiverse sequences due to the loss of sequence order [28 60]This sequence order information can be partially coveredby combination with dipeptide composition but dipeptidecomposition itself lacks information on the fraction of theindividual residue in the sequence as such a combination setis expected to give a better prediction result [27 61] as shownabove due to masking effect

The models were further investigated based on theirsensitivity to predict ATP-BPs and the results displayed in

ISRN Computational Biology 9

100

075

050

025

000

100075050025000

Sens

itivi

ty

1 minus specificity

Figure 6 The ROC plot the plot shows the performance ofthe LIBSVM model generated with StatsDirect package using anextended trapezoidal rule and a nonparametric method analogousto the WilcoxonMann-Whitney test to calculate the area under theROC curve The calculated AUA was 0849219

pyramidal view (Figure 3) The most sensitive descriptorwas amino acid composition (0875) followed by dipep-tide composition (08381) amino aciddipeptide compositionin combination (08246) and Norm M-B autocorrelation(08224) in that order

These descriptors were among the best four performersin terms of Acc and MCC Evaluation based on specificityindicates that amino acid composition (087) was more spe-cific followed by using the entire feature set (08478) Quasisequence order descriptors (08333) and dipeptide compo-sition (08257) in that order (Figure 4) This informationhighlights the vital role played by amino acid compositionin protein function predictions in general Interestingly theQuasi sequence order descriptors (09626) had the highestprecision followed by amino acid and dipeptide compositionin combination (08785) entire feature set (08692) andTransition (08411) in that order (Figure 5)

The overall model evaluation shows that the amino acidsand dipeptide composition was the best model for predict-ing ATP-BPs from diverse functional classes using wholesequence information The use of ldquoall the descriptorrdquo set didnot generally result in a better model in classification Theldquoall featuresrdquo descriptor accuracy was 799 against 8457for amino acidsdipeptide in combination This finding isin accordance with [62 63] on their work on moleculardescriptors for predicting compounds of specific propertiesusing ldquoall featuresrdquo set The reduction in accuracy might bedue to noise generated by the use of many overlapping andredundant descriptors Hence the accuracy of the classifier

algorithms can be severely degraded by the presence of noisyor irrelevant features or if the feature scales are not consistentwith their importance in solving the classification problemin question The performance of the SVM model using ROCplot (Figure 6) has a value of AUCof 0849219This highlightsa better model based on whole sequence analysis

4 Conclusions

The prediction of ATP-binding proteins has been exploitedusing a battery of descriptor sets and a hybrid functionalgroup Also for the first time the prediction of ATP bindingin universal stress proteins had been investigated using thesupport vector machine The best hybrid model was thecombination of amino acid and dipeptide composition of thesequences with an accuracy of 8457 and Mathews corre-lation coefficient (MCC) value of 0693 The general trendis that combination of descriptors will perform better andimprove the overall performances of individual descriptorsparticularly when combined with amino acid compositionThis model provides a high probability of success for molec-ular biologists in predicting and selecting diverse groups ofATP binding proteins

Conflict of Interests

The author reports no conflict of interests in this workincluding the mentioned trademarks

Acknowledgments

The research reported was supported by the NationalInstitutes of Health (NIH-NIGMS-1T36GM095335) and theNational Science Foundation (EPS-0903787 EPS-1006883)The content is solely the responsibility of the author and doesnot necessarily represent the official views of the fundingagencies

References

[1] A Bairoch and R Apweiler ldquoThe SWISS-PROT protein se-quence database and its supplement TrEMBL in 2000rdquo NucleicAcids Research vol 28 no 1 pp 45ndash48 2000

[2] H M Berman J Westbrook Z Feng et al ldquoThe protein databankrdquo Nucleic Acids Research vol 28 no 1 pp 235ndash242 2000

[3] J Guo H Chen Z Sun and Y Lin ldquoA novel method forprotein secondary structure prediction using dual-layer SVMand profilesrdquo Proteins vol 54 no 4 pp 738ndash743 2004

[4] C Bustamante Y R Chemla N R Forde and D IzhakyldquoMechanical processes in biochemistryrdquo Annual Review ofBiochemistry vol 73 pp 705ndash748 2004

[5] J EWalkerM SarasteM J Runswick andN J Gay ldquoDistantlyrelated sequences in the alpha- and beta-subunits of ATP syn-thase myosin kinases and other ATP-requiring enzymes anda common nucleotide binding foldrdquo The EMBO Journal vol 1no 8 pp 945ndash951 1982

[6] N Hirokawa and R Takemura ldquoBiochemical and molecularcharacterization of diseases linked to motor proteinsrdquo Trends inBiochemical Sciences vol 28 no 10 pp 558ndash565 2003

10 ISRN Computational Biology

[7] C Gedeon J Behravan G Koren and M Piquette-MillerldquoTransport of glyburide by placental ABC transporters impli-cations in fetal drug exposurerdquo Placenta vol 27 no 11-12 pp1096ndash1102 2006

[8] A Maxwell and D M Lawson ldquoThe ATP-binding site of typeII topoisomerases as a target for antibacterial drugsrdquo CurrentTopics in Medicinal Chemistry vol 3 no 3 pp 283ndash303 2003

[9] H Ashida T Oonishi and N Uyesaka ldquoKinetic analysis of themechanism of action of the multidrug transporterrdquo Journal ofTheoretical Biology vol 195 no 2 pp 219ndash232 1998

[10] K Kvint L Nachin A Diez and T Nystrom ldquoThe bacterialuniversal stress protein function and regulationrdquo CurrentOpinion in Microbiology vol 6 no 2 pp 140ndash145 2003

[11] T Nystrom and F C Neidhardt ldquoCloning mapping andnucleotide sequencing of a gene encoding a universal stressprotein in Escherichia colirdquo Molecular Microbiology vol 6 no21 pp 3187ndash3198 1992

[12] A Diez N Gustavsson and T Nystrom ldquoThe universal stressprotein a of Escherichia coli is required for resistance to DNAdamaging agents and is regulated by a RecAFtsK-dependentregulatory pathwayrdquo Molecular Microbiology vol 36 no 6 pp1494ndash1503 2000

[13] M C Sousa and D B Mckay ldquoStructure of the universal stressprotein of Haemophilus influenzaerdquo Structure vol 9 no 12 pp1135ndash1141 2001

[14] V J Promponas C A Ouzounis and I Iliopoulos ldquoExper-imental evidence validating the computational inference offunctional associations from gene fusion events a criticalsurveyrdquo Briefings in Bioinformatics 2012

[15] J S Chauhan N K Mishra and G P Raghava ldquoIdentificationofATPbinding residues of a protein from its primary sequencerdquoBMC Bioinformatics vol 10 article 434 2009

[16] T Guo Y Shi and Z Sun ldquoA novel statistical ligand-bindingsite predictor application to ATP-binding sitesrdquo Protein Engi-neering Design and Selection vol 18 no 2 pp 65ndash70 2005

[17] K Chen M J Mizianty and L Kurgan ldquoATPsite sequence-based prediction of ATP-binding residuesrdquo Proteome Sciencevol 9 article S4 supplement 1 2011

[18] YN ZhangD J Yu S S Li Y X Fan YHuang andH B ShenldquoPredicting protein-ATP binding sites from primary sequencethrough fusing bi-profile sampling ofmulti-view featuresrdquoBMCBioinformatics vol 13 article 118 2012

[19] J R Green M J Korenberg R David and I W HunterldquoRecognition of adenosine triphosphate binding sites usingparallel cascade system identificationrdquo Annals of BiomedicalEngineering vol 31 no 4 pp 462ndash470 2003

[20] A Garg M Bhasin and G P S Raghava ldquoSupport vectormachine-based method for subcellular localization of humanproteins using amino acid compositions their order andsimilarity searchrdquoThe Journal of Biological Chemistry vol 280no 15 pp 14427ndash14432 2005

[21] S Ahmad M M Gromiha and A Sarai ldquoAnalysis and pre-diction of DNA-binding proteins and their binding residuesbased on composition sequence and structural informationrdquoBioinformatics vol 20 no 4 pp 477ndash486 2004

[22] X Xiao P Wang and K-C Chou ldquoGPCR-CA a cellularautomaton image approach for predicting G-protein-coupledreceptor functional classesrdquo Journal of Computational Chem-istry vol 30 no 9 pp 1414ndash1423 2009

[23] M Kumar MM Gromiha and G P S Raghava ldquoPrediction ofRNA binding sites in a protein using SVM and PSSM profilerdquoProteins vol 71 no 1 pp 189ndash194 2008

[24] B S Williams R D Isokpehi A N Mbah et al ldquoFunctionalannotation analytics of bacillus genomes reveals stress respon-sive acetate utilization and sulfate uptake in the biotechnologi-cally relevant bacillus megateriumrdquo Bioinformatics and BiologyInsights vol 6 pp 275ndash286 2012

[25] R D Isokpehi O Mahmud A N Mbah et al ldquoDevelopmentalregulation of genes encoding universal stress proteins in Schis-tosoma mansonirdquo Gene Regulation and Systems Biology vol 5pp 61ndash74 2011

[26] A N Mbah O Mahmud O R Awofolu and R D IsokpehildquoInferences on the biochemical and environmental regulationof universal stress proteins from Schistosomiasis parasitesrdquoAdvances and Applications in Bioinformatics and Chemistry vol6 pp 15ndash27 2013

[27] W Li L Jaroszewski and A Godzik ldquoClustering of highlyhomologous sequences to reduce the size of large proteindatabasesrdquo Bioinformatics vol 17 no 3 pp 282ndash283 2001

[28] G Wang and R L Dunbrack Jr ldquoPISCES a protein sequenceculling serverrdquo Bioinformatics vol 19 no 12 pp 1589ndash15912003

[29] X Yu J Cao Y Cai T Shi and Y Li ldquoPredicting rRNA-RNA- and DNA-binding proteins from primary structure withsupport vector machinesrdquo Journal of Theoretical Biology vol240 no 2 pp 175ndash184 2006

[30] AMarchler-Bauer C Zheng F Chitsaz et al ldquoCDD conserveddomains and protein three-dimensional structurerdquo NucleicAcids Research vol 41 pp D348ndashD352 2013

[31] Z R Li H H Lin L Y Han L Jiang X Chen and Y ZChen ldquoPROFEAT a web server for computing structural andphysicochemical features of proteins and peptides from aminoacid sequencerdquo Nucleic Acids Research vol 34 pp W32ndashW372006

[32] Z Bikadi I Hazai D Malik et al ldquoPredicting P-glycoprotein-mediated drug transport based on support vector machine andthree-dimensional crystal structure of P-glycoproteinrdquo PLoSONE vol 6 no 10 Article ID e25815 2011

[33] S L Lo C Z Cai Y Z Chen and M C M Chung ldquoEffectof training datasets on support vector machine prediction ofprotein-protein interactionsrdquo Proteomics vol 5 no 4 pp 876ndash884 2005

[34] M P Brown W N Grundy D Lin et al ldquoKnowledge-basedanalysis of microarray gene expression data by using supportvector machinesrdquo Proceedings of the National Academy ofSciences of the United States of America vol 97 no 1 pp 262ndash267 2000

[35] T S Furey N Cristianini N Duffy D W Bednarski MSchummer and D Haussler ldquoSupport vector machine classifi-cation and validation of cancer tissue samples using microarrayexpression datardquo Bioinformatics vol 16 no 10 pp 906ndash9142000

[36] K-C Chou and Y-D Cai ldquoPredicting protein-protein inter-actions from sequences in a hybridization spacerdquo Journal ofProteome Research vol 5 no 2 pp 316ndash322 2006

[37] M E Matheny F S Resnic N Arora and L Ohno-MachadoldquoEffects of SVM parameter optimization on discriminationand calibration for post-procedural PCI mortalityrdquo Journal ofBiomedical Informatics vol 40 no 6 pp 688ndash697 2007

[38] F Javed G S Chan A V Savkin et al ldquoRBF kernel basedsupport vector regression to estimate the blood volume andheart rate responses during hemodialysisrdquo in Proceedings ofthe Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBC rsquo09) pp 4352ndash4355 2009

ISRN Computational Biology 11

[39] C-C Chang and C-J Lin ldquoTraining nu-support vector classi-fiers theory and algorithmsrdquoNeural Computation vol 13 no 9pp 2119ndash2147 2001

[40] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[41] K C Chou and C T Zhang ldquoPrediction of protein structuralclassesrdquo Critical Reviews in Biochemistry andMolecular Biologyvol 30 pp 275ndash349 1995

[42] C Chen L Chen X Zou and P Cai ldquoPrediction of proteinsecondary structure content by using the concept of Choursquospseudo amino acid composition and support vector machinerdquoProtein and Peptide Letters vol 16 no 1 pp 27ndash31 2009

[43] H Ding L Luo and H Lin ldquoPrediction of cell wall lyticenzymes using choursquos amphiphilic pseudo amino acid compo-sitionrdquo Protein and Peptide Letters vol 16 no 4 pp 351ndash3552009

[44] J Bondia C Tarin W Garcia-Gabin et al ldquoUsing supportvector machines to detect therapeutically incorrect measure-ments by theMiniMed CGMSrdquo Journal of Diabetes Science andTechnology vol 2 pp 622ndash629 2008

[45] S Chen S Zhou F-F Yin L B Marks and S K DasldquoInvestigation of the support vector machine algorithm topredict lung radiation-induced pneumonitisrdquo Medical Physicsvol 34 no 10 pp 3808ndash3814 2007

[46] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta vol 405 no 2 pp 442ndash451 1975

[47] L Bao and Y Cui ldquoPrediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structuraland evolutionary informationrdquo Bioinformatics vol 21 no 10pp 2185ndash2190 2005

[48] R J Dobson P B Munroe M J Caulfield and M A S SaqildquoPredicting deleterious nsSNPs an analysis of sequence andstructural attributesrdquo BMC Bioinformatics vol 7 article 2172006

[49] J A Hanley and B J Mcneil ldquoThe meaning and use of thearea under a receiver operating characteristic (ROC) curverdquoRadiology vol 143 no 1 pp 29ndash36 1982

[50] E R Delong D M DeLong and D L Clarke-PearsonldquoComparing the areas under two or more correlated receiveroperating characteristic curves a nonparametric approachrdquoBiometrics vol 44 no 3 pp 837ndash845 1988

[51] C Chothia and A M Lesk ldquoThe relation between the diver-gence of sequence and structure in proteinsrdquo The EMBOJournal vol 5 no 4 pp 823ndash826 1986

[52] A M Lesk and C Chothia ldquoHow different amino acidsequences determine similar protein structures the structureand evolutionary dynamics of the globinsrdquo Journal of MolecularBiology vol 136 no 3 pp 225ndash270 1980

[53] M Hilbert G Bohm and R Jaenicke ldquoStructural relationshipsof homologous proteins as a fundamental principle in homol-ogy modelingrdquo Proteins vol 17 no 2 pp 138ndash151 1993

[54] P I Hanson and S W Whiteheart ldquoAAA+ proteins haveengine will workrdquo Nature Reviews Molecular Cell Biology vol6 no 7 pp 519ndash529 2005

[55] K M Ferguson T Higashijima M D Smigel and A GGilman ldquoThe influence of bound GDP on the kinetics of gua-nine nucleotide binding to G proteinsrdquoThe Journal of BiologicalChemistry vol 261 no 16 pp 7393ndash7399 1986

[56] F Jurnak A Mcpherson A H J Wang and A Rich ldquoBio-chemical and structural studies of the tetragonal crystallinemodification of the Escherichia coli elongation factor Turdquo TheJournal of Biological Chemistry vol 255 no 14 pp 6751ndash67571980

[57] T I Zarembinski L I-W Hung H-J Mueller-Dieckmann etal ldquoStructure-based assignment of the biochemical functionof a hypothetical protein a test case of structural genomicsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 26 pp 15189ndash15193 1998

[58] M Saito M Go and T Shirai ldquoAn empirical approach fordetecting nucleotide-binding sites on proteinsrdquo Protein Engi-neering Design and Selection vol 19 no 2 pp 67ndash75 2006

[59] V Sobolev A Sorokine J Prilusky E E Abola and M Edel-man ldquoAutomated analysis of interatomic contacts in proteinsrdquoBioinformatics vol 15 no 4 pp 327ndash332 1999

[60] R E Schapire and Y Singer ldquoBoostexter a boosting-basedsystem for text categorizationrdquo Machine Learning vol 39 no2-3 pp 135ndash168 2000

[61] S A Ong H H Lin Y Z Chen Z R Li and Z Cao ldquoEfficacyof different protein descriptors in predicting protein functionalfamiliesrdquo BMC Bioinformatics vol 8 article 300 2007

[62] L Xue and J Bajorath ldquoMolecular descriptors in chemoin-formatics computational combinatorial chemistry and virtualscreeningrdquo Combinatorial Chemistry and High ThroughputScreening vol 3 no 5 pp 363ndash372 2000

[63] L Xue JW Godden and J Bajorath ldquoEvaluation of descriptorsand mini-fingerprints for the identification of molecules withsimilar activityrdquo Journal of Chemical Information and ComputerSciences vol 40 no 5 pp 1227ndash1234 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

2 ISRN Computational Biology

[11 12] The USPs bind to ATP through the ATP bindingmotif [G-2X-G-9X-G(ST)] [13] Members of the USPs willsegregate into two groups based on whether or not they bindto ATP [13]

Experimental efforts are underway to determine the func-tion of newly discovered proteins [14] but these experimentalmethods are costly and time consuming and at times areunsuccessful due to the complexity involved in proteincrystallization process Several methods had been studiedbased on predicting ATP binding residues from their knownstructural features but with low accuracies [15 16] Somepredictors of ATP binding proteins have been developed withpromising results such as those in [17 18] including Greenet al [19] article on an effective method to recognize ATPbinding proteins by testing parallel cascade identificationand KNN Unfortunately these methods were adapted toATP binding proteins containing only the classical WalkerA motif [G-4X-GK (TS)] in their sequences The objectiveof this research reported here was to introduce a classifierbuilt from a pool of protein sequences containing both ATPbinding motifs of G-4X-GK (TS) and G-2X-G-9X-G(ST)To achieve the objective support vector machine (SVM)approach is proposed which predicts protein functions basedon the discriminative features that map protein sequences tobiological functions [20ndash23] using the sequence pool ATPhybrid motifs

There is aneed to develop an automated predictor for ATPbindingUSP encoded proteins to speed experimental designsand study how these proteins function under diverse envi-ronmental stressorsThis research has developed hybrid ATPbinding protein predictor using the open source LIBSVMtoolbox classificationThe best model was the combination ofamino acid and dipeptide composition of the sequences withan accuracy of 8457 and Mathews correlation coefficient(MCC) value of 0693 This model shows a striking overallperformance in sensitivity (8246) specificity (8700) andprecision (8785) with area under the ROC curve (AUC)value of 0849219The general trend shows that combinationsof descriptors perform better and improved the overallperformances of individual descriptors particularly whencombinedwith amino acid compositionThismodel providesa high probability of success for molecular biologists inpredicting and selecting diverse motif groups of ATP bindingproteins

2 Materials and Method

21 Datasets Balanced datasets of ATP and non-ATP bind-ing proteins were constructed from the UniProt proteindatabase (UniProt release 2011 11) (httpwwwuniprotorg)Protein Data Bank (httpwwwrcsborgpdbhomehomedo) IMGM database (httpimgjgidoegovcgi-binmmaincgi) and published literatures [24ndash26] which containdiverse universal stress proteins

211 Extraction of Walker A Motif Dataset A total of 2000protein sequences which belong to Walker A motif positivedataset were retrieved Redundancy due to homologous

sequences was removed using CD-HIT [27] and PISCES [28]servers at a threshold of 25 This threshold statisticallyretains adequate number of protein sequences for analysisas well as avoids bias that might result from high homologyDataset obtained was manually reviewed through literaturesearch and information from the protein data bank [2] toensure they represent ATP binding proteins A total of 100sequences were randomly selected from the original datasetand retained for training and testing to represent Walker Amotif positive (ATP binding) dataset The Walker A motifnegative dataset (non-ATP binding) was taken from Yu et al2006 [29] This was the ldquonegativerdquo dataset used for nucleicacid binding proteins This is because ATP binding proteinsare members of nucleotide binding protein family hencethe negative dataset used in [29] for predicting nucleotidebinding protein family was considered useful Redundancywas also maintained at 25 threshold and each protein wasverified to be non-ATP binding using both the literature andprotein data bank information A total of 100 sequences werealso randomly selected from [29] and retained for trainingand testing to represent Walker A motif negative (non-ATPbinding) dataset

212 Extraction of USP Protein Dataset The extracted USPsequences were tested for the presence or absence of the G-2X-G-9X-G(ST) motif in their sequences using the NCBIconserved domain search tool [30] The USP sequences weredivided into two groups based on the presence or absence ofATP bindingmotif [13]The redundancy was alsomaintainedat 25 threshold and 100 sequences were selected for eachclass of proteins (200 sequences in total)

The overall summary of the data prepared for analysiswas as follows (i) 100ATP binding proteins with Walker Amotif (ii) 100 without ATP binding proteins without WalkerA motif (iii) 100USP sequences with ATP binding motif[G-2X-G-9X-G(ST)] and (iv) 100USP sequences withoutATP bindingmotif [G-2X-G-9X-G(ST)]The 400 sequenceswere separated into two hybrid groups as follows 200ATPbinding sequences and 200 sequences without ATP bindingmotifs and were used to generate the feature vector Thefeature vector was generated from the entire sequences of theproteins (not only the ATP-binding domains) via PROFEATserver using 1497 descriptor set [31] Physicochemical andsequence attributes of biologically informative were priori-tized for investigation The attributes were incorporated intoLIBSVMclassifier to find the best hybridmodel for predictingATP binding proteins

22 LIBSVM Classifier Support vector machines (SVM)recognized objects to be classified as points in a high-dimensional space needing a hyperplane to separate them[32]The biologicalmolecules are representedwith descriptorset With a proper mapping furnished by a kernel functionSVM classifiers separate transformed data with a hyperplanein a high-dimensional space to predict the correct classifi-cation of protein functional classes SVMs have been widelyused in supervised classification problems in bioinformaticssuch as [33ndash36] The LIBSVM package which is freely

ISRN Computational Biology 3

downloadable at (httpwwwcsientuedutwsimcjlinlibsvm)was adopted and used to evaluate the attributes and build thefinal classifier using the radial basis function (RBF) as thekernel function [37ndash39]

A ldquogrid-searchrdquo was employed to select the proper valuesof the parameter of RBF and the penalty parameter (119862) ofthe soft margin SVM 119862 was set to 2minus5 2minus3 215 and 120574to 2minus15 2minus13 23 All the combinations of 119862 and 120574 weretested and the pair with the best cross-validation accuracy foreach feature set or combination of feature sets was selectedA smaller 120574 value makes the decision boundary smootherThe SVM training parameter 119862 is the regularization factorwhich controls the tradeoff between low training error andlarge margin [37 40] Throughout this work the parameter119862 was maintained at 119862 = 4 after trial and error assessmentas the best value The optimal value of 120574 was obtained foreach descriptor set for best resultsThe entire sets of attributeswere evaluated in terms of their associationwithATP bindingprotein and a final subset with good predictive power wasselected In this research a 10-fold cross validation (10CV)was implementedTheobjective of training is tomaximize theability of the SVM predictor to discriminate between classeswhile avoiding overfitting

23 Tenfold Cross-Validation Analysis The technique to eval-uate any newly developed method has become a majorchallenge to investigators The jack-knifing leave-one-outcross-validation (LOOCV) [41ndash43] is the popular techniquefor evaluating models During this procedure one sequenceis used for testing and the left over sequences are usedfor training This process is repeated many times and eachsequence is used once for testing Even though this method ispopular it is computer intensivewith considerable labor time

In this work 10-fold cross-validation was used to trainand test the dataset with sequences randomly partitionedinto ten sets This cross-validation ensures that the datasetwas split at the protein level in addition to the stratifiedpartition thus ensuring a more rigorous evaluation Duringthe procedure the positive and negative data samples aredistributed randomly into 10 sets or the so-called fold In eachof the 10 round steps 9 of the 10 sets are used to construct aclassifier (training) and then the classifier is evaluated usingthe remaining set (testing) This procedure was repeated tentimes in amannerwhere each set was used for testing [44 45]The overall performance was the average of the performancesof all the 10 sets

24 The LIBSVM Performance Evaluation The standardparameters used in evaluating the performance of the LIB-SVM are indicated below The overall accuracy (Acc) isthe intuitive measurement of the performance on a balancedatasetwhereasMatthewrsquos correlation coefficient (MCC) [46]is more realistic than Acc in measuring performance whenusing an unbalanced dataset [47 48] When both MCC andAcc values are high the overall performance of the predictedmodel is better In addition to Acc and MCC the followingparameters below were also calculated Sensitivity is the

percentage of correctly predicted binding proteins to the totalbinding proteins

True positive (TP)True negative (TN)False positive (FP) (false alarm)False negative (FN)False positive rate (FPR)Sensitivityrecall or True positive rate (TPR) TPR =TPP = TP(TP + FN)Precision = TP(TP + FP)Accuracy (Acc) = (TP + TN)(P + N) = (TP +TN)(TP + TN + FP + FN)Specificity (SPC) SPC = TNN = TN(FP + TN) = 1 ndashFPRMatthewrsquos correlation coefficient (MCC)((TP times TN) minus (FP times FN))[sqrt ((TN + FN) times (TN +FP) times (TP + FN) times (TP + FP))] OR

MCC = (TP lowast TN minus FP lowast FN)radicPNP1015840N1015840

(1)

Here TP is the number of true positives (ATP-BPs) TN is thenumber of true negatives (non ATP-BPs) FP is the numberof false positives and FN is the number of false negatives

25 Area under the ROC Curve (AUC) for LIBSVM It is aplot between true positive proportion (TPTP + FN) andfalse positive proportion (FPFP + TN) The StatsDirect wasused package to plot ROC and calculates the area under theROC curve directly by an extended trapezoidal rule [49]Theconfidence interval was constructed using DeLongrsquos varianceestimate [50] embedded in the statistic package

3 Results and Discussion

The ATP binding proteins are known to play key roles in thebiochemical functioning of the cell In signaling pathwaysATP molecules are substrates for protein kinase phospho-rylation It is difficult to identify ATP binding proteins dueto lack of experimentally determined protein structures [51ndash53] This is because the growth of protein sequences fromvarious genomic projects exceeds the capacity of experimen-tal techniques in determining protein structures and theirbinding reactions which are time consuming and at timesunsuccessful Therefore there is an urgent need to developautomated expert methods for determining the functionalclass of proteins such ATP binding proteins from theirprimary sequence information

The general assumption here is that every protein thatbinds to ATP molecule either USPs or those having WalkerA motif will have some common features embedded in theirsequences In both theUSP (G-2X-G-9X-G(ST)) andWalkerA (G-4X-GK (TS)) motifs the G K T and S denote glycinelysine threonine and serine respectively and X denotes any

4 ISRN Computational Biology

Feature Accsvm Accsvm

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

8364

799

8457

7943

8317

7336

799

8177

8224

5098

5747

7429

7616

5098

5747

7336

7429

7616

7943

799

8177

8224

8317

8364

8457

Figure 1 The performances of descriptors with LIBSVM in terms of accuracy The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of accuracy (Accsvm) In terms of accuracy the best descriptor was combination of aminoacid and dipeptide composition (8457) followed by amino acid composition (8364) dipeptide composition (8317) and Norm M-Bautocorrelation in that order The pseudo amino acids and Quasi sequence order descriptors perform poorly

amino acid residue The lysine (K) residue in the WalkerA motif is crucial for nucleotide binding [54] in this classof proteins It interacts with the phosphate groups of thenucleotide and with the magnesium ion which coordinatesthe 120573- and 120574-phosphates of the ATP molecule [55 56]

The universal stress proteins bind to ATP through theATP binding motif G-2X-G-9X-G(ST) with the -G(S)T asessential residues for ATP binding and phosphorylation [13]Therefore members of this class of proteins will segregateinto two groups based on whether or not they bind to ATP[13 57] Thus it is important to identify ATP binding USPsand other ATP binding proteins Several methods have beenstudied based on predicting ATP interacting residues if theprotein structures are known with some results showing very

low accuracies [15 16 58 59] This work has predicted ATPbinding proteins in general with high accuracy irrespectiveof their structural information using SVM classifier Thetraining and prediction statistics for each of the descriptorsets used were visualized and discussed below The visu-alizations were constructed using Tableau Public Software(httpwwwtableausoftwarecompublic)

The objective in this report was to find the best descriptorset which can be use to build a predictive model for a reliableand effective server for predicting ATP-BPs in generalirrespective of their subfunctional classes Throughout thiswork the parameter 119862 was maintained at 119862 = 4 whilethe optimal value of 120574 for each descriptor was obtained andused in evaluating their performances Their performances

ISRN Computational Biology 5

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

MCC MCC

06765

06041

06931

05897

06637

04767

05982

06355

06449

Null

02369

0486

05301

Null02360

04767

0486

05301

05897

05082

06041

06355

06449

06637

06765

06931

Figure 2 The performances of descriptors with LIBSVM in terms of Mathewrsquos correlation coefficient (MCC) The length of each colorcoded descriptor and the pyramidal view are a measure of their performances in terms of MCC The best performer was amino acid anddipeptide composition in combination (06931) followed by amino acid composition (06765) dipeptide composition (06637) and NormM-B autocorrelation (06449) in that order

were evaluated based on five computed parameters consistingof their accuracies sensitivities specificities precisions andMCC after a 10-fold cross validation (CV10)

The performance of pseudo amino acid compositionwas evaluated with only accuracy due to lack of suffi-cient sequence information The lengths of the color codeddescriptors were used as a measure of their performances In

terms of accuracy the best descriptor was the combination ofamino acid with dipeptide composition (8457) followedby amino acid composition alone (8364) dipeptide com-position (8317) and Norm M-B autocorrelation in thatorder (Figure 1)The pseudo amino acids andQuasi sequenceorder descriptors performed poorly compared to the otherdescriptors However the overall performances of the other

6 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

Sensitivity Sensitivity

0875

07623

08246

07788

08381

07429

08019

08148

08224

Null

Null

05421

07453

07258

05421

07258

07429

07453

07623

07788

08019

08148

08224

08248

08381

0875

Figure 3The performances of descriptors with LIBSVM in terms of sensitivityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of sensitivity The most sensitive descriptor was amino acid composition (0875) followedby dipeptide composition (08381) amino acid and dipeptide composition in combination (08246) and NormM-B autocorrelation (08224)in that order

descriptors were better as most of them registered accuracyvalues greater than 7000 These high performers mightbe due to the rigorous refinement of protein sequencesThus protein function classification with SVM classifierscan be improved drastically using rigorously refined proteinsequences

The individual performances of amino acid composi-tion (8364) and dipeptide composition (8317) wereincreased to 8457 when both descriptors were com-bined together This indicates that the combination of

descriptors can enhance the individual performance ofother descriptors particularly those combining with aminoacid composition This is a binary classification probleminvolving a balance dataset and accuracy (Acc) is the bestparameter for evaluating performance based on balancedataset whereas Matthewrsquos correlation coefficient (MCC)is more realistic than Acc when using an unbalanceddataset [47 48] But when both MCC and Acc values arehigh the overall performance of the predicted model isbetter

ISRN Computational Biology 7

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

087

08478

087

08119

08257

07339

07963

08208

08224

Null

08333

07407

08111

Null07339

07407

07963

08111

08119

08208

08224

08257

08333

08478

087

Specificity Specificity

Figure 4The performances of descriptors with LIBSVM in terms of SpecificityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of specificity The most specific descriptor was amino acid composition and aminoaciddipeptide composition (087) followed by all using all the feature set (08478) Quasi sequence order descriptors (08333) and dipeptidecomposition (08257) in that order

The performances of the models were evaluated based onMCC (Figure 2) The pyramidal view and the length of thecolor coded descriptors were used for performance visual-ization The best performer was amino acid and dipeptide

composition in combination (06931) followed by amino acidcomposition (06765) dipeptide composition (06637) andNormM-B autocorrelation (06449) in that order This orderis in line with their performances measured using accuracy

8 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

PrecisionPrecision

0785

08692

08785

08224

08224

0729

07944

08224

08224

Null

09626

07383

08411

Null0729

07383

0765

07944

08224

08411

08502

08785

09626

Figure 5 The performances of descriptors with LIBSVM in terms of Precision The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of precision The most precise descriptor was Quasi sequence order descriptors (09626)followed by amino acid and dipeptide composition in combination (08785) all feature set (08692) and Transition (08411) in that order

as the parameter This result justifies the performance ofthe overall model In general the combination of descriptorsets performs better than individual descriptors particularlywhen combined with amino acid composition

Therefore from the statistical point of view the use ofcombination sets particularly with amino acid compositiontend to give better prediction performance than individual-sets [53]The amino acid composition generally increases theoverall accuracies of other descriptors in combination Oneof the shortcoming of amino acid composition as a descriptor

is that the same amino acid composition may correspond todiverse sequences due to the loss of sequence order [28 60]This sequence order information can be partially coveredby combination with dipeptide composition but dipeptidecomposition itself lacks information on the fraction of theindividual residue in the sequence as such a combination setis expected to give a better prediction result [27 61] as shownabove due to masking effect

The models were further investigated based on theirsensitivity to predict ATP-BPs and the results displayed in

ISRN Computational Biology 9

100

075

050

025

000

100075050025000

Sens

itivi

ty

1 minus specificity

Figure 6 The ROC plot the plot shows the performance ofthe LIBSVM model generated with StatsDirect package using anextended trapezoidal rule and a nonparametric method analogousto the WilcoxonMann-Whitney test to calculate the area under theROC curve The calculated AUA was 0849219

pyramidal view (Figure 3) The most sensitive descriptorwas amino acid composition (0875) followed by dipep-tide composition (08381) amino aciddipeptide compositionin combination (08246) and Norm M-B autocorrelation(08224) in that order

These descriptors were among the best four performersin terms of Acc and MCC Evaluation based on specificityindicates that amino acid composition (087) was more spe-cific followed by using the entire feature set (08478) Quasisequence order descriptors (08333) and dipeptide compo-sition (08257) in that order (Figure 4) This informationhighlights the vital role played by amino acid compositionin protein function predictions in general Interestingly theQuasi sequence order descriptors (09626) had the highestprecision followed by amino acid and dipeptide compositionin combination (08785) entire feature set (08692) andTransition (08411) in that order (Figure 5)

The overall model evaluation shows that the amino acidsand dipeptide composition was the best model for predict-ing ATP-BPs from diverse functional classes using wholesequence information The use of ldquoall the descriptorrdquo set didnot generally result in a better model in classification Theldquoall featuresrdquo descriptor accuracy was 799 against 8457for amino acidsdipeptide in combination This finding isin accordance with [62 63] on their work on moleculardescriptors for predicting compounds of specific propertiesusing ldquoall featuresrdquo set The reduction in accuracy might bedue to noise generated by the use of many overlapping andredundant descriptors Hence the accuracy of the classifier

algorithms can be severely degraded by the presence of noisyor irrelevant features or if the feature scales are not consistentwith their importance in solving the classification problemin question The performance of the SVM model using ROCplot (Figure 6) has a value of AUCof 0849219This highlightsa better model based on whole sequence analysis

4 Conclusions

The prediction of ATP-binding proteins has been exploitedusing a battery of descriptor sets and a hybrid functionalgroup Also for the first time the prediction of ATP bindingin universal stress proteins had been investigated using thesupport vector machine The best hybrid model was thecombination of amino acid and dipeptide composition of thesequences with an accuracy of 8457 and Mathews corre-lation coefficient (MCC) value of 0693 The general trendis that combination of descriptors will perform better andimprove the overall performances of individual descriptorsparticularly when combined with amino acid compositionThis model provides a high probability of success for molec-ular biologists in predicting and selecting diverse groups ofATP binding proteins

Conflict of Interests

The author reports no conflict of interests in this workincluding the mentioned trademarks

Acknowledgments

The research reported was supported by the NationalInstitutes of Health (NIH-NIGMS-1T36GM095335) and theNational Science Foundation (EPS-0903787 EPS-1006883)The content is solely the responsibility of the author and doesnot necessarily represent the official views of the fundingagencies

References

[1] A Bairoch and R Apweiler ldquoThe SWISS-PROT protein se-quence database and its supplement TrEMBL in 2000rdquo NucleicAcids Research vol 28 no 1 pp 45ndash48 2000

[2] H M Berman J Westbrook Z Feng et al ldquoThe protein databankrdquo Nucleic Acids Research vol 28 no 1 pp 235ndash242 2000

[3] J Guo H Chen Z Sun and Y Lin ldquoA novel method forprotein secondary structure prediction using dual-layer SVMand profilesrdquo Proteins vol 54 no 4 pp 738ndash743 2004

[4] C Bustamante Y R Chemla N R Forde and D IzhakyldquoMechanical processes in biochemistryrdquo Annual Review ofBiochemistry vol 73 pp 705ndash748 2004

[5] J EWalkerM SarasteM J Runswick andN J Gay ldquoDistantlyrelated sequences in the alpha- and beta-subunits of ATP syn-thase myosin kinases and other ATP-requiring enzymes anda common nucleotide binding foldrdquo The EMBO Journal vol 1no 8 pp 945ndash951 1982

[6] N Hirokawa and R Takemura ldquoBiochemical and molecularcharacterization of diseases linked to motor proteinsrdquo Trends inBiochemical Sciences vol 28 no 10 pp 558ndash565 2003

10 ISRN Computational Biology

[7] C Gedeon J Behravan G Koren and M Piquette-MillerldquoTransport of glyburide by placental ABC transporters impli-cations in fetal drug exposurerdquo Placenta vol 27 no 11-12 pp1096ndash1102 2006

[8] A Maxwell and D M Lawson ldquoThe ATP-binding site of typeII topoisomerases as a target for antibacterial drugsrdquo CurrentTopics in Medicinal Chemistry vol 3 no 3 pp 283ndash303 2003

[9] H Ashida T Oonishi and N Uyesaka ldquoKinetic analysis of themechanism of action of the multidrug transporterrdquo Journal ofTheoretical Biology vol 195 no 2 pp 219ndash232 1998

[10] K Kvint L Nachin A Diez and T Nystrom ldquoThe bacterialuniversal stress protein function and regulationrdquo CurrentOpinion in Microbiology vol 6 no 2 pp 140ndash145 2003

[11] T Nystrom and F C Neidhardt ldquoCloning mapping andnucleotide sequencing of a gene encoding a universal stressprotein in Escherichia colirdquo Molecular Microbiology vol 6 no21 pp 3187ndash3198 1992

[12] A Diez N Gustavsson and T Nystrom ldquoThe universal stressprotein a of Escherichia coli is required for resistance to DNAdamaging agents and is regulated by a RecAFtsK-dependentregulatory pathwayrdquo Molecular Microbiology vol 36 no 6 pp1494ndash1503 2000

[13] M C Sousa and D B Mckay ldquoStructure of the universal stressprotein of Haemophilus influenzaerdquo Structure vol 9 no 12 pp1135ndash1141 2001

[14] V J Promponas C A Ouzounis and I Iliopoulos ldquoExper-imental evidence validating the computational inference offunctional associations from gene fusion events a criticalsurveyrdquo Briefings in Bioinformatics 2012

[15] J S Chauhan N K Mishra and G P Raghava ldquoIdentificationofATPbinding residues of a protein from its primary sequencerdquoBMC Bioinformatics vol 10 article 434 2009

[16] T Guo Y Shi and Z Sun ldquoA novel statistical ligand-bindingsite predictor application to ATP-binding sitesrdquo Protein Engi-neering Design and Selection vol 18 no 2 pp 65ndash70 2005

[17] K Chen M J Mizianty and L Kurgan ldquoATPsite sequence-based prediction of ATP-binding residuesrdquo Proteome Sciencevol 9 article S4 supplement 1 2011

[18] YN ZhangD J Yu S S Li Y X Fan YHuang andH B ShenldquoPredicting protein-ATP binding sites from primary sequencethrough fusing bi-profile sampling ofmulti-view featuresrdquoBMCBioinformatics vol 13 article 118 2012

[19] J R Green M J Korenberg R David and I W HunterldquoRecognition of adenosine triphosphate binding sites usingparallel cascade system identificationrdquo Annals of BiomedicalEngineering vol 31 no 4 pp 462ndash470 2003

[20] A Garg M Bhasin and G P S Raghava ldquoSupport vectormachine-based method for subcellular localization of humanproteins using amino acid compositions their order andsimilarity searchrdquoThe Journal of Biological Chemistry vol 280no 15 pp 14427ndash14432 2005

[21] S Ahmad M M Gromiha and A Sarai ldquoAnalysis and pre-diction of DNA-binding proteins and their binding residuesbased on composition sequence and structural informationrdquoBioinformatics vol 20 no 4 pp 477ndash486 2004

[22] X Xiao P Wang and K-C Chou ldquoGPCR-CA a cellularautomaton image approach for predicting G-protein-coupledreceptor functional classesrdquo Journal of Computational Chem-istry vol 30 no 9 pp 1414ndash1423 2009

[23] M Kumar MM Gromiha and G P S Raghava ldquoPrediction ofRNA binding sites in a protein using SVM and PSSM profilerdquoProteins vol 71 no 1 pp 189ndash194 2008

[24] B S Williams R D Isokpehi A N Mbah et al ldquoFunctionalannotation analytics of bacillus genomes reveals stress respon-sive acetate utilization and sulfate uptake in the biotechnologi-cally relevant bacillus megateriumrdquo Bioinformatics and BiologyInsights vol 6 pp 275ndash286 2012

[25] R D Isokpehi O Mahmud A N Mbah et al ldquoDevelopmentalregulation of genes encoding universal stress proteins in Schis-tosoma mansonirdquo Gene Regulation and Systems Biology vol 5pp 61ndash74 2011

[26] A N Mbah O Mahmud O R Awofolu and R D IsokpehildquoInferences on the biochemical and environmental regulationof universal stress proteins from Schistosomiasis parasitesrdquoAdvances and Applications in Bioinformatics and Chemistry vol6 pp 15ndash27 2013

[27] W Li L Jaroszewski and A Godzik ldquoClustering of highlyhomologous sequences to reduce the size of large proteindatabasesrdquo Bioinformatics vol 17 no 3 pp 282ndash283 2001

[28] G Wang and R L Dunbrack Jr ldquoPISCES a protein sequenceculling serverrdquo Bioinformatics vol 19 no 12 pp 1589ndash15912003

[29] X Yu J Cao Y Cai T Shi and Y Li ldquoPredicting rRNA-RNA- and DNA-binding proteins from primary structure withsupport vector machinesrdquo Journal of Theoretical Biology vol240 no 2 pp 175ndash184 2006

[30] AMarchler-Bauer C Zheng F Chitsaz et al ldquoCDD conserveddomains and protein three-dimensional structurerdquo NucleicAcids Research vol 41 pp D348ndashD352 2013

[31] Z R Li H H Lin L Y Han L Jiang X Chen and Y ZChen ldquoPROFEAT a web server for computing structural andphysicochemical features of proteins and peptides from aminoacid sequencerdquo Nucleic Acids Research vol 34 pp W32ndashW372006

[32] Z Bikadi I Hazai D Malik et al ldquoPredicting P-glycoprotein-mediated drug transport based on support vector machine andthree-dimensional crystal structure of P-glycoproteinrdquo PLoSONE vol 6 no 10 Article ID e25815 2011

[33] S L Lo C Z Cai Y Z Chen and M C M Chung ldquoEffectof training datasets on support vector machine prediction ofprotein-protein interactionsrdquo Proteomics vol 5 no 4 pp 876ndash884 2005

[34] M P Brown W N Grundy D Lin et al ldquoKnowledge-basedanalysis of microarray gene expression data by using supportvector machinesrdquo Proceedings of the National Academy ofSciences of the United States of America vol 97 no 1 pp 262ndash267 2000

[35] T S Furey N Cristianini N Duffy D W Bednarski MSchummer and D Haussler ldquoSupport vector machine classifi-cation and validation of cancer tissue samples using microarrayexpression datardquo Bioinformatics vol 16 no 10 pp 906ndash9142000

[36] K-C Chou and Y-D Cai ldquoPredicting protein-protein inter-actions from sequences in a hybridization spacerdquo Journal ofProteome Research vol 5 no 2 pp 316ndash322 2006

[37] M E Matheny F S Resnic N Arora and L Ohno-MachadoldquoEffects of SVM parameter optimization on discriminationand calibration for post-procedural PCI mortalityrdquo Journal ofBiomedical Informatics vol 40 no 6 pp 688ndash697 2007

[38] F Javed G S Chan A V Savkin et al ldquoRBF kernel basedsupport vector regression to estimate the blood volume andheart rate responses during hemodialysisrdquo in Proceedings ofthe Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBC rsquo09) pp 4352ndash4355 2009

ISRN Computational Biology 11

[39] C-C Chang and C-J Lin ldquoTraining nu-support vector classi-fiers theory and algorithmsrdquoNeural Computation vol 13 no 9pp 2119ndash2147 2001

[40] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[41] K C Chou and C T Zhang ldquoPrediction of protein structuralclassesrdquo Critical Reviews in Biochemistry andMolecular Biologyvol 30 pp 275ndash349 1995

[42] C Chen L Chen X Zou and P Cai ldquoPrediction of proteinsecondary structure content by using the concept of Choursquospseudo amino acid composition and support vector machinerdquoProtein and Peptide Letters vol 16 no 1 pp 27ndash31 2009

[43] H Ding L Luo and H Lin ldquoPrediction of cell wall lyticenzymes using choursquos amphiphilic pseudo amino acid compo-sitionrdquo Protein and Peptide Letters vol 16 no 4 pp 351ndash3552009

[44] J Bondia C Tarin W Garcia-Gabin et al ldquoUsing supportvector machines to detect therapeutically incorrect measure-ments by theMiniMed CGMSrdquo Journal of Diabetes Science andTechnology vol 2 pp 622ndash629 2008

[45] S Chen S Zhou F-F Yin L B Marks and S K DasldquoInvestigation of the support vector machine algorithm topredict lung radiation-induced pneumonitisrdquo Medical Physicsvol 34 no 10 pp 3808ndash3814 2007

[46] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta vol 405 no 2 pp 442ndash451 1975

[47] L Bao and Y Cui ldquoPrediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structuraland evolutionary informationrdquo Bioinformatics vol 21 no 10pp 2185ndash2190 2005

[48] R J Dobson P B Munroe M J Caulfield and M A S SaqildquoPredicting deleterious nsSNPs an analysis of sequence andstructural attributesrdquo BMC Bioinformatics vol 7 article 2172006

[49] J A Hanley and B J Mcneil ldquoThe meaning and use of thearea under a receiver operating characteristic (ROC) curverdquoRadiology vol 143 no 1 pp 29ndash36 1982

[50] E R Delong D M DeLong and D L Clarke-PearsonldquoComparing the areas under two or more correlated receiveroperating characteristic curves a nonparametric approachrdquoBiometrics vol 44 no 3 pp 837ndash845 1988

[51] C Chothia and A M Lesk ldquoThe relation between the diver-gence of sequence and structure in proteinsrdquo The EMBOJournal vol 5 no 4 pp 823ndash826 1986

[52] A M Lesk and C Chothia ldquoHow different amino acidsequences determine similar protein structures the structureand evolutionary dynamics of the globinsrdquo Journal of MolecularBiology vol 136 no 3 pp 225ndash270 1980

[53] M Hilbert G Bohm and R Jaenicke ldquoStructural relationshipsof homologous proteins as a fundamental principle in homol-ogy modelingrdquo Proteins vol 17 no 2 pp 138ndash151 1993

[54] P I Hanson and S W Whiteheart ldquoAAA+ proteins haveengine will workrdquo Nature Reviews Molecular Cell Biology vol6 no 7 pp 519ndash529 2005

[55] K M Ferguson T Higashijima M D Smigel and A GGilman ldquoThe influence of bound GDP on the kinetics of gua-nine nucleotide binding to G proteinsrdquoThe Journal of BiologicalChemistry vol 261 no 16 pp 7393ndash7399 1986

[56] F Jurnak A Mcpherson A H J Wang and A Rich ldquoBio-chemical and structural studies of the tetragonal crystallinemodification of the Escherichia coli elongation factor Turdquo TheJournal of Biological Chemistry vol 255 no 14 pp 6751ndash67571980

[57] T I Zarembinski L I-W Hung H-J Mueller-Dieckmann etal ldquoStructure-based assignment of the biochemical functionof a hypothetical protein a test case of structural genomicsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 26 pp 15189ndash15193 1998

[58] M Saito M Go and T Shirai ldquoAn empirical approach fordetecting nucleotide-binding sites on proteinsrdquo Protein Engi-neering Design and Selection vol 19 no 2 pp 67ndash75 2006

[59] V Sobolev A Sorokine J Prilusky E E Abola and M Edel-man ldquoAutomated analysis of interatomic contacts in proteinsrdquoBioinformatics vol 15 no 4 pp 327ndash332 1999

[60] R E Schapire and Y Singer ldquoBoostexter a boosting-basedsystem for text categorizationrdquo Machine Learning vol 39 no2-3 pp 135ndash168 2000

[61] S A Ong H H Lin Y Z Chen Z R Li and Z Cao ldquoEfficacyof different protein descriptors in predicting protein functionalfamiliesrdquo BMC Bioinformatics vol 8 article 300 2007

[62] L Xue and J Bajorath ldquoMolecular descriptors in chemoin-formatics computational combinatorial chemistry and virtualscreeningrdquo Combinatorial Chemistry and High ThroughputScreening vol 3 no 5 pp 363ndash372 2000

[63] L Xue JW Godden and J Bajorath ldquoEvaluation of descriptorsand mini-fingerprints for the identification of molecules withsimilar activityrdquo Journal of Chemical Information and ComputerSciences vol 40 no 5 pp 1227ndash1234 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

ISRN Computational Biology 3

downloadable at (httpwwwcsientuedutwsimcjlinlibsvm)was adopted and used to evaluate the attributes and build thefinal classifier using the radial basis function (RBF) as thekernel function [37ndash39]

A ldquogrid-searchrdquo was employed to select the proper valuesof the parameter of RBF and the penalty parameter (119862) ofthe soft margin SVM 119862 was set to 2minus5 2minus3 215 and 120574to 2minus15 2minus13 23 All the combinations of 119862 and 120574 weretested and the pair with the best cross-validation accuracy foreach feature set or combination of feature sets was selectedA smaller 120574 value makes the decision boundary smootherThe SVM training parameter 119862 is the regularization factorwhich controls the tradeoff between low training error andlarge margin [37 40] Throughout this work the parameter119862 was maintained at 119862 = 4 after trial and error assessmentas the best value The optimal value of 120574 was obtained foreach descriptor set for best resultsThe entire sets of attributeswere evaluated in terms of their associationwithATP bindingprotein and a final subset with good predictive power wasselected In this research a 10-fold cross validation (10CV)was implementedTheobjective of training is tomaximize theability of the SVM predictor to discriminate between classeswhile avoiding overfitting

23 Tenfold Cross-Validation Analysis The technique to eval-uate any newly developed method has become a majorchallenge to investigators The jack-knifing leave-one-outcross-validation (LOOCV) [41ndash43] is the popular techniquefor evaluating models During this procedure one sequenceis used for testing and the left over sequences are usedfor training This process is repeated many times and eachsequence is used once for testing Even though this method ispopular it is computer intensivewith considerable labor time

In this work 10-fold cross-validation was used to trainand test the dataset with sequences randomly partitionedinto ten sets This cross-validation ensures that the datasetwas split at the protein level in addition to the stratifiedpartition thus ensuring a more rigorous evaluation Duringthe procedure the positive and negative data samples aredistributed randomly into 10 sets or the so-called fold In eachof the 10 round steps 9 of the 10 sets are used to construct aclassifier (training) and then the classifier is evaluated usingthe remaining set (testing) This procedure was repeated tentimes in amannerwhere each set was used for testing [44 45]The overall performance was the average of the performancesof all the 10 sets

24 The LIBSVM Performance Evaluation The standardparameters used in evaluating the performance of the LIB-SVM are indicated below The overall accuracy (Acc) isthe intuitive measurement of the performance on a balancedatasetwhereasMatthewrsquos correlation coefficient (MCC) [46]is more realistic than Acc in measuring performance whenusing an unbalanced dataset [47 48] When both MCC andAcc values are high the overall performance of the predictedmodel is better In addition to Acc and MCC the followingparameters below were also calculated Sensitivity is the

percentage of correctly predicted binding proteins to the totalbinding proteins

True positive (TP)True negative (TN)False positive (FP) (false alarm)False negative (FN)False positive rate (FPR)Sensitivityrecall or True positive rate (TPR) TPR =TPP = TP(TP + FN)Precision = TP(TP + FP)Accuracy (Acc) = (TP + TN)(P + N) = (TP +TN)(TP + TN + FP + FN)Specificity (SPC) SPC = TNN = TN(FP + TN) = 1 ndashFPRMatthewrsquos correlation coefficient (MCC)((TP times TN) minus (FP times FN))[sqrt ((TN + FN) times (TN +FP) times (TP + FN) times (TP + FP))] OR

MCC = (TP lowast TN minus FP lowast FN)radicPNP1015840N1015840

(1)

Here TP is the number of true positives (ATP-BPs) TN is thenumber of true negatives (non ATP-BPs) FP is the numberof false positives and FN is the number of false negatives

25 Area under the ROC Curve (AUC) for LIBSVM It is aplot between true positive proportion (TPTP + FN) andfalse positive proportion (FPFP + TN) The StatsDirect wasused package to plot ROC and calculates the area under theROC curve directly by an extended trapezoidal rule [49]Theconfidence interval was constructed using DeLongrsquos varianceestimate [50] embedded in the statistic package

3 Results and Discussion

The ATP binding proteins are known to play key roles in thebiochemical functioning of the cell In signaling pathwaysATP molecules are substrates for protein kinase phospho-rylation It is difficult to identify ATP binding proteins dueto lack of experimentally determined protein structures [51ndash53] This is because the growth of protein sequences fromvarious genomic projects exceeds the capacity of experimen-tal techniques in determining protein structures and theirbinding reactions which are time consuming and at timesunsuccessful Therefore there is an urgent need to developautomated expert methods for determining the functionalclass of proteins such ATP binding proteins from theirprimary sequence information

The general assumption here is that every protein thatbinds to ATP molecule either USPs or those having WalkerA motif will have some common features embedded in theirsequences In both theUSP (G-2X-G-9X-G(ST)) andWalkerA (G-4X-GK (TS)) motifs the G K T and S denote glycinelysine threonine and serine respectively and X denotes any

4 ISRN Computational Biology

Feature Accsvm Accsvm

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

8364

799

8457

7943

8317

7336

799

8177

8224

5098

5747

7429

7616

5098

5747

7336

7429

7616

7943

799

8177

8224

8317

8364

8457

Figure 1 The performances of descriptors with LIBSVM in terms of accuracy The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of accuracy (Accsvm) In terms of accuracy the best descriptor was combination of aminoacid and dipeptide composition (8457) followed by amino acid composition (8364) dipeptide composition (8317) and Norm M-Bautocorrelation in that order The pseudo amino acids and Quasi sequence order descriptors perform poorly

amino acid residue The lysine (K) residue in the WalkerA motif is crucial for nucleotide binding [54] in this classof proteins It interacts with the phosphate groups of thenucleotide and with the magnesium ion which coordinatesthe 120573- and 120574-phosphates of the ATP molecule [55 56]

The universal stress proteins bind to ATP through theATP binding motif G-2X-G-9X-G(ST) with the -G(S)T asessential residues for ATP binding and phosphorylation [13]Therefore members of this class of proteins will segregateinto two groups based on whether or not they bind to ATP[13 57] Thus it is important to identify ATP binding USPsand other ATP binding proteins Several methods have beenstudied based on predicting ATP interacting residues if theprotein structures are known with some results showing very

low accuracies [15 16 58 59] This work has predicted ATPbinding proteins in general with high accuracy irrespectiveof their structural information using SVM classifier Thetraining and prediction statistics for each of the descriptorsets used were visualized and discussed below The visu-alizations were constructed using Tableau Public Software(httpwwwtableausoftwarecompublic)

The objective in this report was to find the best descriptorset which can be use to build a predictive model for a reliableand effective server for predicting ATP-BPs in generalirrespective of their subfunctional classes Throughout thiswork the parameter 119862 was maintained at 119862 = 4 whilethe optimal value of 120574 for each descriptor was obtained andused in evaluating their performances Their performances

ISRN Computational Biology 5

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

MCC MCC

06765

06041

06931

05897

06637

04767

05982

06355

06449

Null

02369

0486

05301

Null02360

04767

0486

05301

05897

05082

06041

06355

06449

06637

06765

06931

Figure 2 The performances of descriptors with LIBSVM in terms of Mathewrsquos correlation coefficient (MCC) The length of each colorcoded descriptor and the pyramidal view are a measure of their performances in terms of MCC The best performer was amino acid anddipeptide composition in combination (06931) followed by amino acid composition (06765) dipeptide composition (06637) and NormM-B autocorrelation (06449) in that order

were evaluated based on five computed parameters consistingof their accuracies sensitivities specificities precisions andMCC after a 10-fold cross validation (CV10)

The performance of pseudo amino acid compositionwas evaluated with only accuracy due to lack of suffi-cient sequence information The lengths of the color codeddescriptors were used as a measure of their performances In

terms of accuracy the best descriptor was the combination ofamino acid with dipeptide composition (8457) followedby amino acid composition alone (8364) dipeptide com-position (8317) and Norm M-B autocorrelation in thatorder (Figure 1)The pseudo amino acids andQuasi sequenceorder descriptors performed poorly compared to the otherdescriptors However the overall performances of the other

6 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

Sensitivity Sensitivity

0875

07623

08246

07788

08381

07429

08019

08148

08224

Null

Null

05421

07453

07258

05421

07258

07429

07453

07623

07788

08019

08148

08224

08248

08381

0875

Figure 3The performances of descriptors with LIBSVM in terms of sensitivityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of sensitivity The most sensitive descriptor was amino acid composition (0875) followedby dipeptide composition (08381) amino acid and dipeptide composition in combination (08246) and NormM-B autocorrelation (08224)in that order

descriptors were better as most of them registered accuracyvalues greater than 7000 These high performers mightbe due to the rigorous refinement of protein sequencesThus protein function classification with SVM classifierscan be improved drastically using rigorously refined proteinsequences

The individual performances of amino acid composi-tion (8364) and dipeptide composition (8317) wereincreased to 8457 when both descriptors were com-bined together This indicates that the combination of

descriptors can enhance the individual performance ofother descriptors particularly those combining with aminoacid composition This is a binary classification probleminvolving a balance dataset and accuracy (Acc) is the bestparameter for evaluating performance based on balancedataset whereas Matthewrsquos correlation coefficient (MCC)is more realistic than Acc when using an unbalanceddataset [47 48] But when both MCC and Acc values arehigh the overall performance of the predicted model isbetter

ISRN Computational Biology 7

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

087

08478

087

08119

08257

07339

07963

08208

08224

Null

08333

07407

08111

Null07339

07407

07963

08111

08119

08208

08224

08257

08333

08478

087

Specificity Specificity

Figure 4The performances of descriptors with LIBSVM in terms of SpecificityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of specificity The most specific descriptor was amino acid composition and aminoaciddipeptide composition (087) followed by all using all the feature set (08478) Quasi sequence order descriptors (08333) and dipeptidecomposition (08257) in that order

The performances of the models were evaluated based onMCC (Figure 2) The pyramidal view and the length of thecolor coded descriptors were used for performance visual-ization The best performer was amino acid and dipeptide

composition in combination (06931) followed by amino acidcomposition (06765) dipeptide composition (06637) andNormM-B autocorrelation (06449) in that order This orderis in line with their performances measured using accuracy

8 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

PrecisionPrecision

0785

08692

08785

08224

08224

0729

07944

08224

08224

Null

09626

07383

08411

Null0729

07383

0765

07944

08224

08411

08502

08785

09626

Figure 5 The performances of descriptors with LIBSVM in terms of Precision The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of precision The most precise descriptor was Quasi sequence order descriptors (09626)followed by amino acid and dipeptide composition in combination (08785) all feature set (08692) and Transition (08411) in that order

as the parameter This result justifies the performance ofthe overall model In general the combination of descriptorsets performs better than individual descriptors particularlywhen combined with amino acid composition

Therefore from the statistical point of view the use ofcombination sets particularly with amino acid compositiontend to give better prediction performance than individual-sets [53]The amino acid composition generally increases theoverall accuracies of other descriptors in combination Oneof the shortcoming of amino acid composition as a descriptor

is that the same amino acid composition may correspond todiverse sequences due to the loss of sequence order [28 60]This sequence order information can be partially coveredby combination with dipeptide composition but dipeptidecomposition itself lacks information on the fraction of theindividual residue in the sequence as such a combination setis expected to give a better prediction result [27 61] as shownabove due to masking effect

The models were further investigated based on theirsensitivity to predict ATP-BPs and the results displayed in

ISRN Computational Biology 9

100

075

050

025

000

100075050025000

Sens

itivi

ty

1 minus specificity

Figure 6 The ROC plot the plot shows the performance ofthe LIBSVM model generated with StatsDirect package using anextended trapezoidal rule and a nonparametric method analogousto the WilcoxonMann-Whitney test to calculate the area under theROC curve The calculated AUA was 0849219

pyramidal view (Figure 3) The most sensitive descriptorwas amino acid composition (0875) followed by dipep-tide composition (08381) amino aciddipeptide compositionin combination (08246) and Norm M-B autocorrelation(08224) in that order

These descriptors were among the best four performersin terms of Acc and MCC Evaluation based on specificityindicates that amino acid composition (087) was more spe-cific followed by using the entire feature set (08478) Quasisequence order descriptors (08333) and dipeptide compo-sition (08257) in that order (Figure 4) This informationhighlights the vital role played by amino acid compositionin protein function predictions in general Interestingly theQuasi sequence order descriptors (09626) had the highestprecision followed by amino acid and dipeptide compositionin combination (08785) entire feature set (08692) andTransition (08411) in that order (Figure 5)

The overall model evaluation shows that the amino acidsand dipeptide composition was the best model for predict-ing ATP-BPs from diverse functional classes using wholesequence information The use of ldquoall the descriptorrdquo set didnot generally result in a better model in classification Theldquoall featuresrdquo descriptor accuracy was 799 against 8457for amino acidsdipeptide in combination This finding isin accordance with [62 63] on their work on moleculardescriptors for predicting compounds of specific propertiesusing ldquoall featuresrdquo set The reduction in accuracy might bedue to noise generated by the use of many overlapping andredundant descriptors Hence the accuracy of the classifier

algorithms can be severely degraded by the presence of noisyor irrelevant features or if the feature scales are not consistentwith their importance in solving the classification problemin question The performance of the SVM model using ROCplot (Figure 6) has a value of AUCof 0849219This highlightsa better model based on whole sequence analysis

4 Conclusions

The prediction of ATP-binding proteins has been exploitedusing a battery of descriptor sets and a hybrid functionalgroup Also for the first time the prediction of ATP bindingin universal stress proteins had been investigated using thesupport vector machine The best hybrid model was thecombination of amino acid and dipeptide composition of thesequences with an accuracy of 8457 and Mathews corre-lation coefficient (MCC) value of 0693 The general trendis that combination of descriptors will perform better andimprove the overall performances of individual descriptorsparticularly when combined with amino acid compositionThis model provides a high probability of success for molec-ular biologists in predicting and selecting diverse groups ofATP binding proteins

Conflict of Interests

The author reports no conflict of interests in this workincluding the mentioned trademarks

Acknowledgments

The research reported was supported by the NationalInstitutes of Health (NIH-NIGMS-1T36GM095335) and theNational Science Foundation (EPS-0903787 EPS-1006883)The content is solely the responsibility of the author and doesnot necessarily represent the official views of the fundingagencies

References

[1] A Bairoch and R Apweiler ldquoThe SWISS-PROT protein se-quence database and its supplement TrEMBL in 2000rdquo NucleicAcids Research vol 28 no 1 pp 45ndash48 2000

[2] H M Berman J Westbrook Z Feng et al ldquoThe protein databankrdquo Nucleic Acids Research vol 28 no 1 pp 235ndash242 2000

[3] J Guo H Chen Z Sun and Y Lin ldquoA novel method forprotein secondary structure prediction using dual-layer SVMand profilesrdquo Proteins vol 54 no 4 pp 738ndash743 2004

[4] C Bustamante Y R Chemla N R Forde and D IzhakyldquoMechanical processes in biochemistryrdquo Annual Review ofBiochemistry vol 73 pp 705ndash748 2004

[5] J EWalkerM SarasteM J Runswick andN J Gay ldquoDistantlyrelated sequences in the alpha- and beta-subunits of ATP syn-thase myosin kinases and other ATP-requiring enzymes anda common nucleotide binding foldrdquo The EMBO Journal vol 1no 8 pp 945ndash951 1982

[6] N Hirokawa and R Takemura ldquoBiochemical and molecularcharacterization of diseases linked to motor proteinsrdquo Trends inBiochemical Sciences vol 28 no 10 pp 558ndash565 2003

10 ISRN Computational Biology

[7] C Gedeon J Behravan G Koren and M Piquette-MillerldquoTransport of glyburide by placental ABC transporters impli-cations in fetal drug exposurerdquo Placenta vol 27 no 11-12 pp1096ndash1102 2006

[8] A Maxwell and D M Lawson ldquoThe ATP-binding site of typeII topoisomerases as a target for antibacterial drugsrdquo CurrentTopics in Medicinal Chemistry vol 3 no 3 pp 283ndash303 2003

[9] H Ashida T Oonishi and N Uyesaka ldquoKinetic analysis of themechanism of action of the multidrug transporterrdquo Journal ofTheoretical Biology vol 195 no 2 pp 219ndash232 1998

[10] K Kvint L Nachin A Diez and T Nystrom ldquoThe bacterialuniversal stress protein function and regulationrdquo CurrentOpinion in Microbiology vol 6 no 2 pp 140ndash145 2003

[11] T Nystrom and F C Neidhardt ldquoCloning mapping andnucleotide sequencing of a gene encoding a universal stressprotein in Escherichia colirdquo Molecular Microbiology vol 6 no21 pp 3187ndash3198 1992

[12] A Diez N Gustavsson and T Nystrom ldquoThe universal stressprotein a of Escherichia coli is required for resistance to DNAdamaging agents and is regulated by a RecAFtsK-dependentregulatory pathwayrdquo Molecular Microbiology vol 36 no 6 pp1494ndash1503 2000

[13] M C Sousa and D B Mckay ldquoStructure of the universal stressprotein of Haemophilus influenzaerdquo Structure vol 9 no 12 pp1135ndash1141 2001

[14] V J Promponas C A Ouzounis and I Iliopoulos ldquoExper-imental evidence validating the computational inference offunctional associations from gene fusion events a criticalsurveyrdquo Briefings in Bioinformatics 2012

[15] J S Chauhan N K Mishra and G P Raghava ldquoIdentificationofATPbinding residues of a protein from its primary sequencerdquoBMC Bioinformatics vol 10 article 434 2009

[16] T Guo Y Shi and Z Sun ldquoA novel statistical ligand-bindingsite predictor application to ATP-binding sitesrdquo Protein Engi-neering Design and Selection vol 18 no 2 pp 65ndash70 2005

[17] K Chen M J Mizianty and L Kurgan ldquoATPsite sequence-based prediction of ATP-binding residuesrdquo Proteome Sciencevol 9 article S4 supplement 1 2011

[18] YN ZhangD J Yu S S Li Y X Fan YHuang andH B ShenldquoPredicting protein-ATP binding sites from primary sequencethrough fusing bi-profile sampling ofmulti-view featuresrdquoBMCBioinformatics vol 13 article 118 2012

[19] J R Green M J Korenberg R David and I W HunterldquoRecognition of adenosine triphosphate binding sites usingparallel cascade system identificationrdquo Annals of BiomedicalEngineering vol 31 no 4 pp 462ndash470 2003

[20] A Garg M Bhasin and G P S Raghava ldquoSupport vectormachine-based method for subcellular localization of humanproteins using amino acid compositions their order andsimilarity searchrdquoThe Journal of Biological Chemistry vol 280no 15 pp 14427ndash14432 2005

[21] S Ahmad M M Gromiha and A Sarai ldquoAnalysis and pre-diction of DNA-binding proteins and their binding residuesbased on composition sequence and structural informationrdquoBioinformatics vol 20 no 4 pp 477ndash486 2004

[22] X Xiao P Wang and K-C Chou ldquoGPCR-CA a cellularautomaton image approach for predicting G-protein-coupledreceptor functional classesrdquo Journal of Computational Chem-istry vol 30 no 9 pp 1414ndash1423 2009

[23] M Kumar MM Gromiha and G P S Raghava ldquoPrediction ofRNA binding sites in a protein using SVM and PSSM profilerdquoProteins vol 71 no 1 pp 189ndash194 2008

[24] B S Williams R D Isokpehi A N Mbah et al ldquoFunctionalannotation analytics of bacillus genomes reveals stress respon-sive acetate utilization and sulfate uptake in the biotechnologi-cally relevant bacillus megateriumrdquo Bioinformatics and BiologyInsights vol 6 pp 275ndash286 2012

[25] R D Isokpehi O Mahmud A N Mbah et al ldquoDevelopmentalregulation of genes encoding universal stress proteins in Schis-tosoma mansonirdquo Gene Regulation and Systems Biology vol 5pp 61ndash74 2011

[26] A N Mbah O Mahmud O R Awofolu and R D IsokpehildquoInferences on the biochemical and environmental regulationof universal stress proteins from Schistosomiasis parasitesrdquoAdvances and Applications in Bioinformatics and Chemistry vol6 pp 15ndash27 2013

[27] W Li L Jaroszewski and A Godzik ldquoClustering of highlyhomologous sequences to reduce the size of large proteindatabasesrdquo Bioinformatics vol 17 no 3 pp 282ndash283 2001

[28] G Wang and R L Dunbrack Jr ldquoPISCES a protein sequenceculling serverrdquo Bioinformatics vol 19 no 12 pp 1589ndash15912003

[29] X Yu J Cao Y Cai T Shi and Y Li ldquoPredicting rRNA-RNA- and DNA-binding proteins from primary structure withsupport vector machinesrdquo Journal of Theoretical Biology vol240 no 2 pp 175ndash184 2006

[30] AMarchler-Bauer C Zheng F Chitsaz et al ldquoCDD conserveddomains and protein three-dimensional structurerdquo NucleicAcids Research vol 41 pp D348ndashD352 2013

[31] Z R Li H H Lin L Y Han L Jiang X Chen and Y ZChen ldquoPROFEAT a web server for computing structural andphysicochemical features of proteins and peptides from aminoacid sequencerdquo Nucleic Acids Research vol 34 pp W32ndashW372006

[32] Z Bikadi I Hazai D Malik et al ldquoPredicting P-glycoprotein-mediated drug transport based on support vector machine andthree-dimensional crystal structure of P-glycoproteinrdquo PLoSONE vol 6 no 10 Article ID e25815 2011

[33] S L Lo C Z Cai Y Z Chen and M C M Chung ldquoEffectof training datasets on support vector machine prediction ofprotein-protein interactionsrdquo Proteomics vol 5 no 4 pp 876ndash884 2005

[34] M P Brown W N Grundy D Lin et al ldquoKnowledge-basedanalysis of microarray gene expression data by using supportvector machinesrdquo Proceedings of the National Academy ofSciences of the United States of America vol 97 no 1 pp 262ndash267 2000

[35] T S Furey N Cristianini N Duffy D W Bednarski MSchummer and D Haussler ldquoSupport vector machine classifi-cation and validation of cancer tissue samples using microarrayexpression datardquo Bioinformatics vol 16 no 10 pp 906ndash9142000

[36] K-C Chou and Y-D Cai ldquoPredicting protein-protein inter-actions from sequences in a hybridization spacerdquo Journal ofProteome Research vol 5 no 2 pp 316ndash322 2006

[37] M E Matheny F S Resnic N Arora and L Ohno-MachadoldquoEffects of SVM parameter optimization on discriminationand calibration for post-procedural PCI mortalityrdquo Journal ofBiomedical Informatics vol 40 no 6 pp 688ndash697 2007

[38] F Javed G S Chan A V Savkin et al ldquoRBF kernel basedsupport vector regression to estimate the blood volume andheart rate responses during hemodialysisrdquo in Proceedings ofthe Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBC rsquo09) pp 4352ndash4355 2009

ISRN Computational Biology 11

[39] C-C Chang and C-J Lin ldquoTraining nu-support vector classi-fiers theory and algorithmsrdquoNeural Computation vol 13 no 9pp 2119ndash2147 2001

[40] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[41] K C Chou and C T Zhang ldquoPrediction of protein structuralclassesrdquo Critical Reviews in Biochemistry andMolecular Biologyvol 30 pp 275ndash349 1995

[42] C Chen L Chen X Zou and P Cai ldquoPrediction of proteinsecondary structure content by using the concept of Choursquospseudo amino acid composition and support vector machinerdquoProtein and Peptide Letters vol 16 no 1 pp 27ndash31 2009

[43] H Ding L Luo and H Lin ldquoPrediction of cell wall lyticenzymes using choursquos amphiphilic pseudo amino acid compo-sitionrdquo Protein and Peptide Letters vol 16 no 4 pp 351ndash3552009

[44] J Bondia C Tarin W Garcia-Gabin et al ldquoUsing supportvector machines to detect therapeutically incorrect measure-ments by theMiniMed CGMSrdquo Journal of Diabetes Science andTechnology vol 2 pp 622ndash629 2008

[45] S Chen S Zhou F-F Yin L B Marks and S K DasldquoInvestigation of the support vector machine algorithm topredict lung radiation-induced pneumonitisrdquo Medical Physicsvol 34 no 10 pp 3808ndash3814 2007

[46] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta vol 405 no 2 pp 442ndash451 1975

[47] L Bao and Y Cui ldquoPrediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structuraland evolutionary informationrdquo Bioinformatics vol 21 no 10pp 2185ndash2190 2005

[48] R J Dobson P B Munroe M J Caulfield and M A S SaqildquoPredicting deleterious nsSNPs an analysis of sequence andstructural attributesrdquo BMC Bioinformatics vol 7 article 2172006

[49] J A Hanley and B J Mcneil ldquoThe meaning and use of thearea under a receiver operating characteristic (ROC) curverdquoRadiology vol 143 no 1 pp 29ndash36 1982

[50] E R Delong D M DeLong and D L Clarke-PearsonldquoComparing the areas under two or more correlated receiveroperating characteristic curves a nonparametric approachrdquoBiometrics vol 44 no 3 pp 837ndash845 1988

[51] C Chothia and A M Lesk ldquoThe relation between the diver-gence of sequence and structure in proteinsrdquo The EMBOJournal vol 5 no 4 pp 823ndash826 1986

[52] A M Lesk and C Chothia ldquoHow different amino acidsequences determine similar protein structures the structureand evolutionary dynamics of the globinsrdquo Journal of MolecularBiology vol 136 no 3 pp 225ndash270 1980

[53] M Hilbert G Bohm and R Jaenicke ldquoStructural relationshipsof homologous proteins as a fundamental principle in homol-ogy modelingrdquo Proteins vol 17 no 2 pp 138ndash151 1993

[54] P I Hanson and S W Whiteheart ldquoAAA+ proteins haveengine will workrdquo Nature Reviews Molecular Cell Biology vol6 no 7 pp 519ndash529 2005

[55] K M Ferguson T Higashijima M D Smigel and A GGilman ldquoThe influence of bound GDP on the kinetics of gua-nine nucleotide binding to G proteinsrdquoThe Journal of BiologicalChemistry vol 261 no 16 pp 7393ndash7399 1986

[56] F Jurnak A Mcpherson A H J Wang and A Rich ldquoBio-chemical and structural studies of the tetragonal crystallinemodification of the Escherichia coli elongation factor Turdquo TheJournal of Biological Chemistry vol 255 no 14 pp 6751ndash67571980

[57] T I Zarembinski L I-W Hung H-J Mueller-Dieckmann etal ldquoStructure-based assignment of the biochemical functionof a hypothetical protein a test case of structural genomicsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 26 pp 15189ndash15193 1998

[58] M Saito M Go and T Shirai ldquoAn empirical approach fordetecting nucleotide-binding sites on proteinsrdquo Protein Engi-neering Design and Selection vol 19 no 2 pp 67ndash75 2006

[59] V Sobolev A Sorokine J Prilusky E E Abola and M Edel-man ldquoAutomated analysis of interatomic contacts in proteinsrdquoBioinformatics vol 15 no 4 pp 327ndash332 1999

[60] R E Schapire and Y Singer ldquoBoostexter a boosting-basedsystem for text categorizationrdquo Machine Learning vol 39 no2-3 pp 135ndash168 2000

[61] S A Ong H H Lin Y Z Chen Z R Li and Z Cao ldquoEfficacyof different protein descriptors in predicting protein functionalfamiliesrdquo BMC Bioinformatics vol 8 article 300 2007

[62] L Xue and J Bajorath ldquoMolecular descriptors in chemoin-formatics computational combinatorial chemistry and virtualscreeningrdquo Combinatorial Chemistry and High ThroughputScreening vol 3 no 5 pp 363ndash372 2000

[63] L Xue JW Godden and J Bajorath ldquoEvaluation of descriptorsand mini-fingerprints for the identification of molecules withsimilar activityrdquo Journal of Chemical Information and ComputerSciences vol 40 no 5 pp 1227ndash1234 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

4 ISRN Computational Biology

Feature Accsvm Accsvm

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

8364

799

8457

7943

8317

7336

799

8177

8224

5098

5747

7429

7616

5098

5747

7336

7429

7616

7943

799

8177

8224

8317

8364

8457

Figure 1 The performances of descriptors with LIBSVM in terms of accuracy The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of accuracy (Accsvm) In terms of accuracy the best descriptor was combination of aminoacid and dipeptide composition (8457) followed by amino acid composition (8364) dipeptide composition (8317) and Norm M-Bautocorrelation in that order The pseudo amino acids and Quasi sequence order descriptors perform poorly

amino acid residue The lysine (K) residue in the WalkerA motif is crucial for nucleotide binding [54] in this classof proteins It interacts with the phosphate groups of thenucleotide and with the magnesium ion which coordinatesthe 120573- and 120574-phosphates of the ATP molecule [55 56]

The universal stress proteins bind to ATP through theATP binding motif G-2X-G-9X-G(ST) with the -G(S)T asessential residues for ATP binding and phosphorylation [13]Therefore members of this class of proteins will segregateinto two groups based on whether or not they bind to ATP[13 57] Thus it is important to identify ATP binding USPsand other ATP binding proteins Several methods have beenstudied based on predicting ATP interacting residues if theprotein structures are known with some results showing very

low accuracies [15 16 58 59] This work has predicted ATPbinding proteins in general with high accuracy irrespectiveof their structural information using SVM classifier Thetraining and prediction statistics for each of the descriptorsets used were visualized and discussed below The visu-alizations were constructed using Tableau Public Software(httpwwwtableausoftwarecompublic)

The objective in this report was to find the best descriptorset which can be use to build a predictive model for a reliableand effective server for predicting ATP-BPs in generalirrespective of their subfunctional classes Throughout thiswork the parameter 119862 was maintained at 119862 = 4 whilethe optimal value of 120574 for each descriptor was obtained andused in evaluating their performances Their performances

ISRN Computational Biology 5

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

MCC MCC

06765

06041

06931

05897

06637

04767

05982

06355

06449

Null

02369

0486

05301

Null02360

04767

0486

05301

05897

05082

06041

06355

06449

06637

06765

06931

Figure 2 The performances of descriptors with LIBSVM in terms of Mathewrsquos correlation coefficient (MCC) The length of each colorcoded descriptor and the pyramidal view are a measure of their performances in terms of MCC The best performer was amino acid anddipeptide composition in combination (06931) followed by amino acid composition (06765) dipeptide composition (06637) and NormM-B autocorrelation (06449) in that order

were evaluated based on five computed parameters consistingof their accuracies sensitivities specificities precisions andMCC after a 10-fold cross validation (CV10)

The performance of pseudo amino acid compositionwas evaluated with only accuracy due to lack of suffi-cient sequence information The lengths of the color codeddescriptors were used as a measure of their performances In

terms of accuracy the best descriptor was the combination ofamino acid with dipeptide composition (8457) followedby amino acid composition alone (8364) dipeptide com-position (8317) and Norm M-B autocorrelation in thatorder (Figure 1)The pseudo amino acids andQuasi sequenceorder descriptors performed poorly compared to the otherdescriptors However the overall performances of the other

6 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

Sensitivity Sensitivity

0875

07623

08246

07788

08381

07429

08019

08148

08224

Null

Null

05421

07453

07258

05421

07258

07429

07453

07623

07788

08019

08148

08224

08248

08381

0875

Figure 3The performances of descriptors with LIBSVM in terms of sensitivityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of sensitivity The most sensitive descriptor was amino acid composition (0875) followedby dipeptide composition (08381) amino acid and dipeptide composition in combination (08246) and NormM-B autocorrelation (08224)in that order

descriptors were better as most of them registered accuracyvalues greater than 7000 These high performers mightbe due to the rigorous refinement of protein sequencesThus protein function classification with SVM classifierscan be improved drastically using rigorously refined proteinsequences

The individual performances of amino acid composi-tion (8364) and dipeptide composition (8317) wereincreased to 8457 when both descriptors were com-bined together This indicates that the combination of

descriptors can enhance the individual performance ofother descriptors particularly those combining with aminoacid composition This is a binary classification probleminvolving a balance dataset and accuracy (Acc) is the bestparameter for evaluating performance based on balancedataset whereas Matthewrsquos correlation coefficient (MCC)is more realistic than Acc when using an unbalanceddataset [47 48] But when both MCC and Acc values arehigh the overall performance of the predicted model isbetter

ISRN Computational Biology 7

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

087

08478

087

08119

08257

07339

07963

08208

08224

Null

08333

07407

08111

Null07339

07407

07963

08111

08119

08208

08224

08257

08333

08478

087

Specificity Specificity

Figure 4The performances of descriptors with LIBSVM in terms of SpecificityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of specificity The most specific descriptor was amino acid composition and aminoaciddipeptide composition (087) followed by all using all the feature set (08478) Quasi sequence order descriptors (08333) and dipeptidecomposition (08257) in that order

The performances of the models were evaluated based onMCC (Figure 2) The pyramidal view and the length of thecolor coded descriptors were used for performance visual-ization The best performer was amino acid and dipeptide

composition in combination (06931) followed by amino acidcomposition (06765) dipeptide composition (06637) andNormM-B autocorrelation (06449) in that order This orderis in line with their performances measured using accuracy

8 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

PrecisionPrecision

0785

08692

08785

08224

08224

0729

07944

08224

08224

Null

09626

07383

08411

Null0729

07383

0765

07944

08224

08411

08502

08785

09626

Figure 5 The performances of descriptors with LIBSVM in terms of Precision The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of precision The most precise descriptor was Quasi sequence order descriptors (09626)followed by amino acid and dipeptide composition in combination (08785) all feature set (08692) and Transition (08411) in that order

as the parameter This result justifies the performance ofthe overall model In general the combination of descriptorsets performs better than individual descriptors particularlywhen combined with amino acid composition

Therefore from the statistical point of view the use ofcombination sets particularly with amino acid compositiontend to give better prediction performance than individual-sets [53]The amino acid composition generally increases theoverall accuracies of other descriptors in combination Oneof the shortcoming of amino acid composition as a descriptor

is that the same amino acid composition may correspond todiverse sequences due to the loss of sequence order [28 60]This sequence order information can be partially coveredby combination with dipeptide composition but dipeptidecomposition itself lacks information on the fraction of theindividual residue in the sequence as such a combination setis expected to give a better prediction result [27 61] as shownabove due to masking effect

The models were further investigated based on theirsensitivity to predict ATP-BPs and the results displayed in

ISRN Computational Biology 9

100

075

050

025

000

100075050025000

Sens

itivi

ty

1 minus specificity

Figure 6 The ROC plot the plot shows the performance ofthe LIBSVM model generated with StatsDirect package using anextended trapezoidal rule and a nonparametric method analogousto the WilcoxonMann-Whitney test to calculate the area under theROC curve The calculated AUA was 0849219

pyramidal view (Figure 3) The most sensitive descriptorwas amino acid composition (0875) followed by dipep-tide composition (08381) amino aciddipeptide compositionin combination (08246) and Norm M-B autocorrelation(08224) in that order

These descriptors were among the best four performersin terms of Acc and MCC Evaluation based on specificityindicates that amino acid composition (087) was more spe-cific followed by using the entire feature set (08478) Quasisequence order descriptors (08333) and dipeptide compo-sition (08257) in that order (Figure 4) This informationhighlights the vital role played by amino acid compositionin protein function predictions in general Interestingly theQuasi sequence order descriptors (09626) had the highestprecision followed by amino acid and dipeptide compositionin combination (08785) entire feature set (08692) andTransition (08411) in that order (Figure 5)

The overall model evaluation shows that the amino acidsand dipeptide composition was the best model for predict-ing ATP-BPs from diverse functional classes using wholesequence information The use of ldquoall the descriptorrdquo set didnot generally result in a better model in classification Theldquoall featuresrdquo descriptor accuracy was 799 against 8457for amino acidsdipeptide in combination This finding isin accordance with [62 63] on their work on moleculardescriptors for predicting compounds of specific propertiesusing ldquoall featuresrdquo set The reduction in accuracy might bedue to noise generated by the use of many overlapping andredundant descriptors Hence the accuracy of the classifier

algorithms can be severely degraded by the presence of noisyor irrelevant features or if the feature scales are not consistentwith their importance in solving the classification problemin question The performance of the SVM model using ROCplot (Figure 6) has a value of AUCof 0849219This highlightsa better model based on whole sequence analysis

4 Conclusions

The prediction of ATP-binding proteins has been exploitedusing a battery of descriptor sets and a hybrid functionalgroup Also for the first time the prediction of ATP bindingin universal stress proteins had been investigated using thesupport vector machine The best hybrid model was thecombination of amino acid and dipeptide composition of thesequences with an accuracy of 8457 and Mathews corre-lation coefficient (MCC) value of 0693 The general trendis that combination of descriptors will perform better andimprove the overall performances of individual descriptorsparticularly when combined with amino acid compositionThis model provides a high probability of success for molec-ular biologists in predicting and selecting diverse groups ofATP binding proteins

Conflict of Interests

The author reports no conflict of interests in this workincluding the mentioned trademarks

Acknowledgments

The research reported was supported by the NationalInstitutes of Health (NIH-NIGMS-1T36GM095335) and theNational Science Foundation (EPS-0903787 EPS-1006883)The content is solely the responsibility of the author and doesnot necessarily represent the official views of the fundingagencies

References

[1] A Bairoch and R Apweiler ldquoThe SWISS-PROT protein se-quence database and its supplement TrEMBL in 2000rdquo NucleicAcids Research vol 28 no 1 pp 45ndash48 2000

[2] H M Berman J Westbrook Z Feng et al ldquoThe protein databankrdquo Nucleic Acids Research vol 28 no 1 pp 235ndash242 2000

[3] J Guo H Chen Z Sun and Y Lin ldquoA novel method forprotein secondary structure prediction using dual-layer SVMand profilesrdquo Proteins vol 54 no 4 pp 738ndash743 2004

[4] C Bustamante Y R Chemla N R Forde and D IzhakyldquoMechanical processes in biochemistryrdquo Annual Review ofBiochemistry vol 73 pp 705ndash748 2004

[5] J EWalkerM SarasteM J Runswick andN J Gay ldquoDistantlyrelated sequences in the alpha- and beta-subunits of ATP syn-thase myosin kinases and other ATP-requiring enzymes anda common nucleotide binding foldrdquo The EMBO Journal vol 1no 8 pp 945ndash951 1982

[6] N Hirokawa and R Takemura ldquoBiochemical and molecularcharacterization of diseases linked to motor proteinsrdquo Trends inBiochemical Sciences vol 28 no 10 pp 558ndash565 2003

10 ISRN Computational Biology

[7] C Gedeon J Behravan G Koren and M Piquette-MillerldquoTransport of glyburide by placental ABC transporters impli-cations in fetal drug exposurerdquo Placenta vol 27 no 11-12 pp1096ndash1102 2006

[8] A Maxwell and D M Lawson ldquoThe ATP-binding site of typeII topoisomerases as a target for antibacterial drugsrdquo CurrentTopics in Medicinal Chemistry vol 3 no 3 pp 283ndash303 2003

[9] H Ashida T Oonishi and N Uyesaka ldquoKinetic analysis of themechanism of action of the multidrug transporterrdquo Journal ofTheoretical Biology vol 195 no 2 pp 219ndash232 1998

[10] K Kvint L Nachin A Diez and T Nystrom ldquoThe bacterialuniversal stress protein function and regulationrdquo CurrentOpinion in Microbiology vol 6 no 2 pp 140ndash145 2003

[11] T Nystrom and F C Neidhardt ldquoCloning mapping andnucleotide sequencing of a gene encoding a universal stressprotein in Escherichia colirdquo Molecular Microbiology vol 6 no21 pp 3187ndash3198 1992

[12] A Diez N Gustavsson and T Nystrom ldquoThe universal stressprotein a of Escherichia coli is required for resistance to DNAdamaging agents and is regulated by a RecAFtsK-dependentregulatory pathwayrdquo Molecular Microbiology vol 36 no 6 pp1494ndash1503 2000

[13] M C Sousa and D B Mckay ldquoStructure of the universal stressprotein of Haemophilus influenzaerdquo Structure vol 9 no 12 pp1135ndash1141 2001

[14] V J Promponas C A Ouzounis and I Iliopoulos ldquoExper-imental evidence validating the computational inference offunctional associations from gene fusion events a criticalsurveyrdquo Briefings in Bioinformatics 2012

[15] J S Chauhan N K Mishra and G P Raghava ldquoIdentificationofATPbinding residues of a protein from its primary sequencerdquoBMC Bioinformatics vol 10 article 434 2009

[16] T Guo Y Shi and Z Sun ldquoA novel statistical ligand-bindingsite predictor application to ATP-binding sitesrdquo Protein Engi-neering Design and Selection vol 18 no 2 pp 65ndash70 2005

[17] K Chen M J Mizianty and L Kurgan ldquoATPsite sequence-based prediction of ATP-binding residuesrdquo Proteome Sciencevol 9 article S4 supplement 1 2011

[18] YN ZhangD J Yu S S Li Y X Fan YHuang andH B ShenldquoPredicting protein-ATP binding sites from primary sequencethrough fusing bi-profile sampling ofmulti-view featuresrdquoBMCBioinformatics vol 13 article 118 2012

[19] J R Green M J Korenberg R David and I W HunterldquoRecognition of adenosine triphosphate binding sites usingparallel cascade system identificationrdquo Annals of BiomedicalEngineering vol 31 no 4 pp 462ndash470 2003

[20] A Garg M Bhasin and G P S Raghava ldquoSupport vectormachine-based method for subcellular localization of humanproteins using amino acid compositions their order andsimilarity searchrdquoThe Journal of Biological Chemistry vol 280no 15 pp 14427ndash14432 2005

[21] S Ahmad M M Gromiha and A Sarai ldquoAnalysis and pre-diction of DNA-binding proteins and their binding residuesbased on composition sequence and structural informationrdquoBioinformatics vol 20 no 4 pp 477ndash486 2004

[22] X Xiao P Wang and K-C Chou ldquoGPCR-CA a cellularautomaton image approach for predicting G-protein-coupledreceptor functional classesrdquo Journal of Computational Chem-istry vol 30 no 9 pp 1414ndash1423 2009

[23] M Kumar MM Gromiha and G P S Raghava ldquoPrediction ofRNA binding sites in a protein using SVM and PSSM profilerdquoProteins vol 71 no 1 pp 189ndash194 2008

[24] B S Williams R D Isokpehi A N Mbah et al ldquoFunctionalannotation analytics of bacillus genomes reveals stress respon-sive acetate utilization and sulfate uptake in the biotechnologi-cally relevant bacillus megateriumrdquo Bioinformatics and BiologyInsights vol 6 pp 275ndash286 2012

[25] R D Isokpehi O Mahmud A N Mbah et al ldquoDevelopmentalregulation of genes encoding universal stress proteins in Schis-tosoma mansonirdquo Gene Regulation and Systems Biology vol 5pp 61ndash74 2011

[26] A N Mbah O Mahmud O R Awofolu and R D IsokpehildquoInferences on the biochemical and environmental regulationof universal stress proteins from Schistosomiasis parasitesrdquoAdvances and Applications in Bioinformatics and Chemistry vol6 pp 15ndash27 2013

[27] W Li L Jaroszewski and A Godzik ldquoClustering of highlyhomologous sequences to reduce the size of large proteindatabasesrdquo Bioinformatics vol 17 no 3 pp 282ndash283 2001

[28] G Wang and R L Dunbrack Jr ldquoPISCES a protein sequenceculling serverrdquo Bioinformatics vol 19 no 12 pp 1589ndash15912003

[29] X Yu J Cao Y Cai T Shi and Y Li ldquoPredicting rRNA-RNA- and DNA-binding proteins from primary structure withsupport vector machinesrdquo Journal of Theoretical Biology vol240 no 2 pp 175ndash184 2006

[30] AMarchler-Bauer C Zheng F Chitsaz et al ldquoCDD conserveddomains and protein three-dimensional structurerdquo NucleicAcids Research vol 41 pp D348ndashD352 2013

[31] Z R Li H H Lin L Y Han L Jiang X Chen and Y ZChen ldquoPROFEAT a web server for computing structural andphysicochemical features of proteins and peptides from aminoacid sequencerdquo Nucleic Acids Research vol 34 pp W32ndashW372006

[32] Z Bikadi I Hazai D Malik et al ldquoPredicting P-glycoprotein-mediated drug transport based on support vector machine andthree-dimensional crystal structure of P-glycoproteinrdquo PLoSONE vol 6 no 10 Article ID e25815 2011

[33] S L Lo C Z Cai Y Z Chen and M C M Chung ldquoEffectof training datasets on support vector machine prediction ofprotein-protein interactionsrdquo Proteomics vol 5 no 4 pp 876ndash884 2005

[34] M P Brown W N Grundy D Lin et al ldquoKnowledge-basedanalysis of microarray gene expression data by using supportvector machinesrdquo Proceedings of the National Academy ofSciences of the United States of America vol 97 no 1 pp 262ndash267 2000

[35] T S Furey N Cristianini N Duffy D W Bednarski MSchummer and D Haussler ldquoSupport vector machine classifi-cation and validation of cancer tissue samples using microarrayexpression datardquo Bioinformatics vol 16 no 10 pp 906ndash9142000

[36] K-C Chou and Y-D Cai ldquoPredicting protein-protein inter-actions from sequences in a hybridization spacerdquo Journal ofProteome Research vol 5 no 2 pp 316ndash322 2006

[37] M E Matheny F S Resnic N Arora and L Ohno-MachadoldquoEffects of SVM parameter optimization on discriminationand calibration for post-procedural PCI mortalityrdquo Journal ofBiomedical Informatics vol 40 no 6 pp 688ndash697 2007

[38] F Javed G S Chan A V Savkin et al ldquoRBF kernel basedsupport vector regression to estimate the blood volume andheart rate responses during hemodialysisrdquo in Proceedings ofthe Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBC rsquo09) pp 4352ndash4355 2009

ISRN Computational Biology 11

[39] C-C Chang and C-J Lin ldquoTraining nu-support vector classi-fiers theory and algorithmsrdquoNeural Computation vol 13 no 9pp 2119ndash2147 2001

[40] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[41] K C Chou and C T Zhang ldquoPrediction of protein structuralclassesrdquo Critical Reviews in Biochemistry andMolecular Biologyvol 30 pp 275ndash349 1995

[42] C Chen L Chen X Zou and P Cai ldquoPrediction of proteinsecondary structure content by using the concept of Choursquospseudo amino acid composition and support vector machinerdquoProtein and Peptide Letters vol 16 no 1 pp 27ndash31 2009

[43] H Ding L Luo and H Lin ldquoPrediction of cell wall lyticenzymes using choursquos amphiphilic pseudo amino acid compo-sitionrdquo Protein and Peptide Letters vol 16 no 4 pp 351ndash3552009

[44] J Bondia C Tarin W Garcia-Gabin et al ldquoUsing supportvector machines to detect therapeutically incorrect measure-ments by theMiniMed CGMSrdquo Journal of Diabetes Science andTechnology vol 2 pp 622ndash629 2008

[45] S Chen S Zhou F-F Yin L B Marks and S K DasldquoInvestigation of the support vector machine algorithm topredict lung radiation-induced pneumonitisrdquo Medical Physicsvol 34 no 10 pp 3808ndash3814 2007

[46] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta vol 405 no 2 pp 442ndash451 1975

[47] L Bao and Y Cui ldquoPrediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structuraland evolutionary informationrdquo Bioinformatics vol 21 no 10pp 2185ndash2190 2005

[48] R J Dobson P B Munroe M J Caulfield and M A S SaqildquoPredicting deleterious nsSNPs an analysis of sequence andstructural attributesrdquo BMC Bioinformatics vol 7 article 2172006

[49] J A Hanley and B J Mcneil ldquoThe meaning and use of thearea under a receiver operating characteristic (ROC) curverdquoRadiology vol 143 no 1 pp 29ndash36 1982

[50] E R Delong D M DeLong and D L Clarke-PearsonldquoComparing the areas under two or more correlated receiveroperating characteristic curves a nonparametric approachrdquoBiometrics vol 44 no 3 pp 837ndash845 1988

[51] C Chothia and A M Lesk ldquoThe relation between the diver-gence of sequence and structure in proteinsrdquo The EMBOJournal vol 5 no 4 pp 823ndash826 1986

[52] A M Lesk and C Chothia ldquoHow different amino acidsequences determine similar protein structures the structureand evolutionary dynamics of the globinsrdquo Journal of MolecularBiology vol 136 no 3 pp 225ndash270 1980

[53] M Hilbert G Bohm and R Jaenicke ldquoStructural relationshipsof homologous proteins as a fundamental principle in homol-ogy modelingrdquo Proteins vol 17 no 2 pp 138ndash151 1993

[54] P I Hanson and S W Whiteheart ldquoAAA+ proteins haveengine will workrdquo Nature Reviews Molecular Cell Biology vol6 no 7 pp 519ndash529 2005

[55] K M Ferguson T Higashijima M D Smigel and A GGilman ldquoThe influence of bound GDP on the kinetics of gua-nine nucleotide binding to G proteinsrdquoThe Journal of BiologicalChemistry vol 261 no 16 pp 7393ndash7399 1986

[56] F Jurnak A Mcpherson A H J Wang and A Rich ldquoBio-chemical and structural studies of the tetragonal crystallinemodification of the Escherichia coli elongation factor Turdquo TheJournal of Biological Chemistry vol 255 no 14 pp 6751ndash67571980

[57] T I Zarembinski L I-W Hung H-J Mueller-Dieckmann etal ldquoStructure-based assignment of the biochemical functionof a hypothetical protein a test case of structural genomicsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 26 pp 15189ndash15193 1998

[58] M Saito M Go and T Shirai ldquoAn empirical approach fordetecting nucleotide-binding sites on proteinsrdquo Protein Engi-neering Design and Selection vol 19 no 2 pp 67ndash75 2006

[59] V Sobolev A Sorokine J Prilusky E E Abola and M Edel-man ldquoAutomated analysis of interatomic contacts in proteinsrdquoBioinformatics vol 15 no 4 pp 327ndash332 1999

[60] R E Schapire and Y Singer ldquoBoostexter a boosting-basedsystem for text categorizationrdquo Machine Learning vol 39 no2-3 pp 135ndash168 2000

[61] S A Ong H H Lin Y Z Chen Z R Li and Z Cao ldquoEfficacyof different protein descriptors in predicting protein functionalfamiliesrdquo BMC Bioinformatics vol 8 article 300 2007

[62] L Xue and J Bajorath ldquoMolecular descriptors in chemoin-formatics computational combinatorial chemistry and virtualscreeningrdquo Combinatorial Chemistry and High ThroughputScreening vol 3 no 5 pp 363ndash372 2000

[63] L Xue JW Godden and J Bajorath ldquoEvaluation of descriptorsand mini-fingerprints for the identification of molecules withsimilar activityrdquo Journal of Chemical Information and ComputerSciences vol 40 no 5 pp 1227ndash1234 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

ISRN Computational Biology 5

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

MCC MCC

06765

06041

06931

05897

06637

04767

05982

06355

06449

Null

02369

0486

05301

Null02360

04767

0486

05301

05897

05082

06041

06355

06449

06637

06765

06931

Figure 2 The performances of descriptors with LIBSVM in terms of Mathewrsquos correlation coefficient (MCC) The length of each colorcoded descriptor and the pyramidal view are a measure of their performances in terms of MCC The best performer was amino acid anddipeptide composition in combination (06931) followed by amino acid composition (06765) dipeptide composition (06637) and NormM-B autocorrelation (06449) in that order

were evaluated based on five computed parameters consistingof their accuracies sensitivities specificities precisions andMCC after a 10-fold cross validation (CV10)

The performance of pseudo amino acid compositionwas evaluated with only accuracy due to lack of suffi-cient sequence information The lengths of the color codeddescriptors were used as a measure of their performances In

terms of accuracy the best descriptor was the combination ofamino acid with dipeptide composition (8457) followedby amino acid composition alone (8364) dipeptide com-position (8317) and Norm M-B autocorrelation in thatorder (Figure 1)The pseudo amino acids andQuasi sequenceorder descriptors performed poorly compared to the otherdescriptors However the overall performances of the other

6 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

Sensitivity Sensitivity

0875

07623

08246

07788

08381

07429

08019

08148

08224

Null

Null

05421

07453

07258

05421

07258

07429

07453

07623

07788

08019

08148

08224

08248

08381

0875

Figure 3The performances of descriptors with LIBSVM in terms of sensitivityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of sensitivity The most sensitive descriptor was amino acid composition (0875) followedby dipeptide composition (08381) amino acid and dipeptide composition in combination (08246) and NormM-B autocorrelation (08224)in that order

descriptors were better as most of them registered accuracyvalues greater than 7000 These high performers mightbe due to the rigorous refinement of protein sequencesThus protein function classification with SVM classifierscan be improved drastically using rigorously refined proteinsequences

The individual performances of amino acid composi-tion (8364) and dipeptide composition (8317) wereincreased to 8457 when both descriptors were com-bined together This indicates that the combination of

descriptors can enhance the individual performance ofother descriptors particularly those combining with aminoacid composition This is a binary classification probleminvolving a balance dataset and accuracy (Acc) is the bestparameter for evaluating performance based on balancedataset whereas Matthewrsquos correlation coefficient (MCC)is more realistic than Acc when using an unbalanceddataset [47 48] But when both MCC and Acc values arehigh the overall performance of the predicted model isbetter

ISRN Computational Biology 7

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

087

08478

087

08119

08257

07339

07963

08208

08224

Null

08333

07407

08111

Null07339

07407

07963

08111

08119

08208

08224

08257

08333

08478

087

Specificity Specificity

Figure 4The performances of descriptors with LIBSVM in terms of SpecificityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of specificity The most specific descriptor was amino acid composition and aminoaciddipeptide composition (087) followed by all using all the feature set (08478) Quasi sequence order descriptors (08333) and dipeptidecomposition (08257) in that order

The performances of the models were evaluated based onMCC (Figure 2) The pyramidal view and the length of thecolor coded descriptors were used for performance visual-ization The best performer was amino acid and dipeptide

composition in combination (06931) followed by amino acidcomposition (06765) dipeptide composition (06637) andNormM-B autocorrelation (06449) in that order This orderis in line with their performances measured using accuracy

8 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

PrecisionPrecision

0785

08692

08785

08224

08224

0729

07944

08224

08224

Null

09626

07383

08411

Null0729

07383

0765

07944

08224

08411

08502

08785

09626

Figure 5 The performances of descriptors with LIBSVM in terms of Precision The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of precision The most precise descriptor was Quasi sequence order descriptors (09626)followed by amino acid and dipeptide composition in combination (08785) all feature set (08692) and Transition (08411) in that order

as the parameter This result justifies the performance ofthe overall model In general the combination of descriptorsets performs better than individual descriptors particularlywhen combined with amino acid composition

Therefore from the statistical point of view the use ofcombination sets particularly with amino acid compositiontend to give better prediction performance than individual-sets [53]The amino acid composition generally increases theoverall accuracies of other descriptors in combination Oneof the shortcoming of amino acid composition as a descriptor

is that the same amino acid composition may correspond todiverse sequences due to the loss of sequence order [28 60]This sequence order information can be partially coveredby combination with dipeptide composition but dipeptidecomposition itself lacks information on the fraction of theindividual residue in the sequence as such a combination setis expected to give a better prediction result [27 61] as shownabove due to masking effect

The models were further investigated based on theirsensitivity to predict ATP-BPs and the results displayed in

ISRN Computational Biology 9

100

075

050

025

000

100075050025000

Sens

itivi

ty

1 minus specificity

Figure 6 The ROC plot the plot shows the performance ofthe LIBSVM model generated with StatsDirect package using anextended trapezoidal rule and a nonparametric method analogousto the WilcoxonMann-Whitney test to calculate the area under theROC curve The calculated AUA was 0849219

pyramidal view (Figure 3) The most sensitive descriptorwas amino acid composition (0875) followed by dipep-tide composition (08381) amino aciddipeptide compositionin combination (08246) and Norm M-B autocorrelation(08224) in that order

These descriptors were among the best four performersin terms of Acc and MCC Evaluation based on specificityindicates that amino acid composition (087) was more spe-cific followed by using the entire feature set (08478) Quasisequence order descriptors (08333) and dipeptide compo-sition (08257) in that order (Figure 4) This informationhighlights the vital role played by amino acid compositionin protein function predictions in general Interestingly theQuasi sequence order descriptors (09626) had the highestprecision followed by amino acid and dipeptide compositionin combination (08785) entire feature set (08692) andTransition (08411) in that order (Figure 5)

The overall model evaluation shows that the amino acidsand dipeptide composition was the best model for predict-ing ATP-BPs from diverse functional classes using wholesequence information The use of ldquoall the descriptorrdquo set didnot generally result in a better model in classification Theldquoall featuresrdquo descriptor accuracy was 799 against 8457for amino acidsdipeptide in combination This finding isin accordance with [62 63] on their work on moleculardescriptors for predicting compounds of specific propertiesusing ldquoall featuresrdquo set The reduction in accuracy might bedue to noise generated by the use of many overlapping andredundant descriptors Hence the accuracy of the classifier

algorithms can be severely degraded by the presence of noisyor irrelevant features or if the feature scales are not consistentwith their importance in solving the classification problemin question The performance of the SVM model using ROCplot (Figure 6) has a value of AUCof 0849219This highlightsa better model based on whole sequence analysis

4 Conclusions

The prediction of ATP-binding proteins has been exploitedusing a battery of descriptor sets and a hybrid functionalgroup Also for the first time the prediction of ATP bindingin universal stress proteins had been investigated using thesupport vector machine The best hybrid model was thecombination of amino acid and dipeptide composition of thesequences with an accuracy of 8457 and Mathews corre-lation coefficient (MCC) value of 0693 The general trendis that combination of descriptors will perform better andimprove the overall performances of individual descriptorsparticularly when combined with amino acid compositionThis model provides a high probability of success for molec-ular biologists in predicting and selecting diverse groups ofATP binding proteins

Conflict of Interests

The author reports no conflict of interests in this workincluding the mentioned trademarks

Acknowledgments

The research reported was supported by the NationalInstitutes of Health (NIH-NIGMS-1T36GM095335) and theNational Science Foundation (EPS-0903787 EPS-1006883)The content is solely the responsibility of the author and doesnot necessarily represent the official views of the fundingagencies

References

[1] A Bairoch and R Apweiler ldquoThe SWISS-PROT protein se-quence database and its supplement TrEMBL in 2000rdquo NucleicAcids Research vol 28 no 1 pp 45ndash48 2000

[2] H M Berman J Westbrook Z Feng et al ldquoThe protein databankrdquo Nucleic Acids Research vol 28 no 1 pp 235ndash242 2000

[3] J Guo H Chen Z Sun and Y Lin ldquoA novel method forprotein secondary structure prediction using dual-layer SVMand profilesrdquo Proteins vol 54 no 4 pp 738ndash743 2004

[4] C Bustamante Y R Chemla N R Forde and D IzhakyldquoMechanical processes in biochemistryrdquo Annual Review ofBiochemistry vol 73 pp 705ndash748 2004

[5] J EWalkerM SarasteM J Runswick andN J Gay ldquoDistantlyrelated sequences in the alpha- and beta-subunits of ATP syn-thase myosin kinases and other ATP-requiring enzymes anda common nucleotide binding foldrdquo The EMBO Journal vol 1no 8 pp 945ndash951 1982

[6] N Hirokawa and R Takemura ldquoBiochemical and molecularcharacterization of diseases linked to motor proteinsrdquo Trends inBiochemical Sciences vol 28 no 10 pp 558ndash565 2003

10 ISRN Computational Biology

[7] C Gedeon J Behravan G Koren and M Piquette-MillerldquoTransport of glyburide by placental ABC transporters impli-cations in fetal drug exposurerdquo Placenta vol 27 no 11-12 pp1096ndash1102 2006

[8] A Maxwell and D M Lawson ldquoThe ATP-binding site of typeII topoisomerases as a target for antibacterial drugsrdquo CurrentTopics in Medicinal Chemistry vol 3 no 3 pp 283ndash303 2003

[9] H Ashida T Oonishi and N Uyesaka ldquoKinetic analysis of themechanism of action of the multidrug transporterrdquo Journal ofTheoretical Biology vol 195 no 2 pp 219ndash232 1998

[10] K Kvint L Nachin A Diez and T Nystrom ldquoThe bacterialuniversal stress protein function and regulationrdquo CurrentOpinion in Microbiology vol 6 no 2 pp 140ndash145 2003

[11] T Nystrom and F C Neidhardt ldquoCloning mapping andnucleotide sequencing of a gene encoding a universal stressprotein in Escherichia colirdquo Molecular Microbiology vol 6 no21 pp 3187ndash3198 1992

[12] A Diez N Gustavsson and T Nystrom ldquoThe universal stressprotein a of Escherichia coli is required for resistance to DNAdamaging agents and is regulated by a RecAFtsK-dependentregulatory pathwayrdquo Molecular Microbiology vol 36 no 6 pp1494ndash1503 2000

[13] M C Sousa and D B Mckay ldquoStructure of the universal stressprotein of Haemophilus influenzaerdquo Structure vol 9 no 12 pp1135ndash1141 2001

[14] V J Promponas C A Ouzounis and I Iliopoulos ldquoExper-imental evidence validating the computational inference offunctional associations from gene fusion events a criticalsurveyrdquo Briefings in Bioinformatics 2012

[15] J S Chauhan N K Mishra and G P Raghava ldquoIdentificationofATPbinding residues of a protein from its primary sequencerdquoBMC Bioinformatics vol 10 article 434 2009

[16] T Guo Y Shi and Z Sun ldquoA novel statistical ligand-bindingsite predictor application to ATP-binding sitesrdquo Protein Engi-neering Design and Selection vol 18 no 2 pp 65ndash70 2005

[17] K Chen M J Mizianty and L Kurgan ldquoATPsite sequence-based prediction of ATP-binding residuesrdquo Proteome Sciencevol 9 article S4 supplement 1 2011

[18] YN ZhangD J Yu S S Li Y X Fan YHuang andH B ShenldquoPredicting protein-ATP binding sites from primary sequencethrough fusing bi-profile sampling ofmulti-view featuresrdquoBMCBioinformatics vol 13 article 118 2012

[19] J R Green M J Korenberg R David and I W HunterldquoRecognition of adenosine triphosphate binding sites usingparallel cascade system identificationrdquo Annals of BiomedicalEngineering vol 31 no 4 pp 462ndash470 2003

[20] A Garg M Bhasin and G P S Raghava ldquoSupport vectormachine-based method for subcellular localization of humanproteins using amino acid compositions their order andsimilarity searchrdquoThe Journal of Biological Chemistry vol 280no 15 pp 14427ndash14432 2005

[21] S Ahmad M M Gromiha and A Sarai ldquoAnalysis and pre-diction of DNA-binding proteins and their binding residuesbased on composition sequence and structural informationrdquoBioinformatics vol 20 no 4 pp 477ndash486 2004

[22] X Xiao P Wang and K-C Chou ldquoGPCR-CA a cellularautomaton image approach for predicting G-protein-coupledreceptor functional classesrdquo Journal of Computational Chem-istry vol 30 no 9 pp 1414ndash1423 2009

[23] M Kumar MM Gromiha and G P S Raghava ldquoPrediction ofRNA binding sites in a protein using SVM and PSSM profilerdquoProteins vol 71 no 1 pp 189ndash194 2008

[24] B S Williams R D Isokpehi A N Mbah et al ldquoFunctionalannotation analytics of bacillus genomes reveals stress respon-sive acetate utilization and sulfate uptake in the biotechnologi-cally relevant bacillus megateriumrdquo Bioinformatics and BiologyInsights vol 6 pp 275ndash286 2012

[25] R D Isokpehi O Mahmud A N Mbah et al ldquoDevelopmentalregulation of genes encoding universal stress proteins in Schis-tosoma mansonirdquo Gene Regulation and Systems Biology vol 5pp 61ndash74 2011

[26] A N Mbah O Mahmud O R Awofolu and R D IsokpehildquoInferences on the biochemical and environmental regulationof universal stress proteins from Schistosomiasis parasitesrdquoAdvances and Applications in Bioinformatics and Chemistry vol6 pp 15ndash27 2013

[27] W Li L Jaroszewski and A Godzik ldquoClustering of highlyhomologous sequences to reduce the size of large proteindatabasesrdquo Bioinformatics vol 17 no 3 pp 282ndash283 2001

[28] G Wang and R L Dunbrack Jr ldquoPISCES a protein sequenceculling serverrdquo Bioinformatics vol 19 no 12 pp 1589ndash15912003

[29] X Yu J Cao Y Cai T Shi and Y Li ldquoPredicting rRNA-RNA- and DNA-binding proteins from primary structure withsupport vector machinesrdquo Journal of Theoretical Biology vol240 no 2 pp 175ndash184 2006

[30] AMarchler-Bauer C Zheng F Chitsaz et al ldquoCDD conserveddomains and protein three-dimensional structurerdquo NucleicAcids Research vol 41 pp D348ndashD352 2013

[31] Z R Li H H Lin L Y Han L Jiang X Chen and Y ZChen ldquoPROFEAT a web server for computing structural andphysicochemical features of proteins and peptides from aminoacid sequencerdquo Nucleic Acids Research vol 34 pp W32ndashW372006

[32] Z Bikadi I Hazai D Malik et al ldquoPredicting P-glycoprotein-mediated drug transport based on support vector machine andthree-dimensional crystal structure of P-glycoproteinrdquo PLoSONE vol 6 no 10 Article ID e25815 2011

[33] S L Lo C Z Cai Y Z Chen and M C M Chung ldquoEffectof training datasets on support vector machine prediction ofprotein-protein interactionsrdquo Proteomics vol 5 no 4 pp 876ndash884 2005

[34] M P Brown W N Grundy D Lin et al ldquoKnowledge-basedanalysis of microarray gene expression data by using supportvector machinesrdquo Proceedings of the National Academy ofSciences of the United States of America vol 97 no 1 pp 262ndash267 2000

[35] T S Furey N Cristianini N Duffy D W Bednarski MSchummer and D Haussler ldquoSupport vector machine classifi-cation and validation of cancer tissue samples using microarrayexpression datardquo Bioinformatics vol 16 no 10 pp 906ndash9142000

[36] K-C Chou and Y-D Cai ldquoPredicting protein-protein inter-actions from sequences in a hybridization spacerdquo Journal ofProteome Research vol 5 no 2 pp 316ndash322 2006

[37] M E Matheny F S Resnic N Arora and L Ohno-MachadoldquoEffects of SVM parameter optimization on discriminationand calibration for post-procedural PCI mortalityrdquo Journal ofBiomedical Informatics vol 40 no 6 pp 688ndash697 2007

[38] F Javed G S Chan A V Savkin et al ldquoRBF kernel basedsupport vector regression to estimate the blood volume andheart rate responses during hemodialysisrdquo in Proceedings ofthe Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBC rsquo09) pp 4352ndash4355 2009

ISRN Computational Biology 11

[39] C-C Chang and C-J Lin ldquoTraining nu-support vector classi-fiers theory and algorithmsrdquoNeural Computation vol 13 no 9pp 2119ndash2147 2001

[40] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[41] K C Chou and C T Zhang ldquoPrediction of protein structuralclassesrdquo Critical Reviews in Biochemistry andMolecular Biologyvol 30 pp 275ndash349 1995

[42] C Chen L Chen X Zou and P Cai ldquoPrediction of proteinsecondary structure content by using the concept of Choursquospseudo amino acid composition and support vector machinerdquoProtein and Peptide Letters vol 16 no 1 pp 27ndash31 2009

[43] H Ding L Luo and H Lin ldquoPrediction of cell wall lyticenzymes using choursquos amphiphilic pseudo amino acid compo-sitionrdquo Protein and Peptide Letters vol 16 no 4 pp 351ndash3552009

[44] J Bondia C Tarin W Garcia-Gabin et al ldquoUsing supportvector machines to detect therapeutically incorrect measure-ments by theMiniMed CGMSrdquo Journal of Diabetes Science andTechnology vol 2 pp 622ndash629 2008

[45] S Chen S Zhou F-F Yin L B Marks and S K DasldquoInvestigation of the support vector machine algorithm topredict lung radiation-induced pneumonitisrdquo Medical Physicsvol 34 no 10 pp 3808ndash3814 2007

[46] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta vol 405 no 2 pp 442ndash451 1975

[47] L Bao and Y Cui ldquoPrediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structuraland evolutionary informationrdquo Bioinformatics vol 21 no 10pp 2185ndash2190 2005

[48] R J Dobson P B Munroe M J Caulfield and M A S SaqildquoPredicting deleterious nsSNPs an analysis of sequence andstructural attributesrdquo BMC Bioinformatics vol 7 article 2172006

[49] J A Hanley and B J Mcneil ldquoThe meaning and use of thearea under a receiver operating characteristic (ROC) curverdquoRadiology vol 143 no 1 pp 29ndash36 1982

[50] E R Delong D M DeLong and D L Clarke-PearsonldquoComparing the areas under two or more correlated receiveroperating characteristic curves a nonparametric approachrdquoBiometrics vol 44 no 3 pp 837ndash845 1988

[51] C Chothia and A M Lesk ldquoThe relation between the diver-gence of sequence and structure in proteinsrdquo The EMBOJournal vol 5 no 4 pp 823ndash826 1986

[52] A M Lesk and C Chothia ldquoHow different amino acidsequences determine similar protein structures the structureand evolutionary dynamics of the globinsrdquo Journal of MolecularBiology vol 136 no 3 pp 225ndash270 1980

[53] M Hilbert G Bohm and R Jaenicke ldquoStructural relationshipsof homologous proteins as a fundamental principle in homol-ogy modelingrdquo Proteins vol 17 no 2 pp 138ndash151 1993

[54] P I Hanson and S W Whiteheart ldquoAAA+ proteins haveengine will workrdquo Nature Reviews Molecular Cell Biology vol6 no 7 pp 519ndash529 2005

[55] K M Ferguson T Higashijima M D Smigel and A GGilman ldquoThe influence of bound GDP on the kinetics of gua-nine nucleotide binding to G proteinsrdquoThe Journal of BiologicalChemistry vol 261 no 16 pp 7393ndash7399 1986

[56] F Jurnak A Mcpherson A H J Wang and A Rich ldquoBio-chemical and structural studies of the tetragonal crystallinemodification of the Escherichia coli elongation factor Turdquo TheJournal of Biological Chemistry vol 255 no 14 pp 6751ndash67571980

[57] T I Zarembinski L I-W Hung H-J Mueller-Dieckmann etal ldquoStructure-based assignment of the biochemical functionof a hypothetical protein a test case of structural genomicsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 26 pp 15189ndash15193 1998

[58] M Saito M Go and T Shirai ldquoAn empirical approach fordetecting nucleotide-binding sites on proteinsrdquo Protein Engi-neering Design and Selection vol 19 no 2 pp 67ndash75 2006

[59] V Sobolev A Sorokine J Prilusky E E Abola and M Edel-man ldquoAutomated analysis of interatomic contacts in proteinsrdquoBioinformatics vol 15 no 4 pp 327ndash332 1999

[60] R E Schapire and Y Singer ldquoBoostexter a boosting-basedsystem for text categorizationrdquo Machine Learning vol 39 no2-3 pp 135ndash168 2000

[61] S A Ong H H Lin Y Z Chen Z R Li and Z Cao ldquoEfficacyof different protein descriptors in predicting protein functionalfamiliesrdquo BMC Bioinformatics vol 8 article 300 2007

[62] L Xue and J Bajorath ldquoMolecular descriptors in chemoin-formatics computational combinatorial chemistry and virtualscreeningrdquo Combinatorial Chemistry and High ThroughputScreening vol 3 no 5 pp 363ndash372 2000

[63] L Xue JW Godden and J Bajorath ldquoEvaluation of descriptorsand mini-fingerprints for the identification of molecules withsimilar activityrdquo Journal of Chemical Information and ComputerSciences vol 40 no 5 pp 1227ndash1234 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

6 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

Sensitivity Sensitivity

0875

07623

08246

07788

08381

07429

08019

08148

08224

Null

Null

05421

07453

07258

05421

07258

07429

07453

07623

07788

08019

08148

08224

08248

08381

0875

Figure 3The performances of descriptors with LIBSVM in terms of sensitivityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of sensitivity The most sensitive descriptor was amino acid composition (0875) followedby dipeptide composition (08381) amino acid and dipeptide composition in combination (08246) and NormM-B autocorrelation (08224)in that order

descriptors were better as most of them registered accuracyvalues greater than 7000 These high performers mightbe due to the rigorous refinement of protein sequencesThus protein function classification with SVM classifierscan be improved drastically using rigorously refined proteinsequences

The individual performances of amino acid composi-tion (8364) and dipeptide composition (8317) wereincreased to 8457 when both descriptors were com-bined together This indicates that the combination of

descriptors can enhance the individual performance ofother descriptors particularly those combining with aminoacid composition This is a binary classification probleminvolving a balance dataset and accuracy (Acc) is the bestparameter for evaluating performance based on balancedataset whereas Matthewrsquos correlation coefficient (MCC)is more realistic than Acc when using an unbalanceddataset [47 48] But when both MCC and Acc values arehigh the overall performance of the predicted model isbetter

ISRN Computational Biology 7

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

087

08478

087

08119

08257

07339

07963

08208

08224

Null

08333

07407

08111

Null07339

07407

07963

08111

08119

08208

08224

08257

08333

08478

087

Specificity Specificity

Figure 4The performances of descriptors with LIBSVM in terms of SpecificityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of specificity The most specific descriptor was amino acid composition and aminoaciddipeptide composition (087) followed by all using all the feature set (08478) Quasi sequence order descriptors (08333) and dipeptidecomposition (08257) in that order

The performances of the models were evaluated based onMCC (Figure 2) The pyramidal view and the length of thecolor coded descriptors were used for performance visual-ization The best performer was amino acid and dipeptide

composition in combination (06931) followed by amino acidcomposition (06765) dipeptide composition (06637) andNormM-B autocorrelation (06449) in that order This orderis in line with their performances measured using accuracy

8 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

PrecisionPrecision

0785

08692

08785

08224

08224

0729

07944

08224

08224

Null

09626

07383

08411

Null0729

07383

0765

07944

08224

08411

08502

08785

09626

Figure 5 The performances of descriptors with LIBSVM in terms of Precision The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of precision The most precise descriptor was Quasi sequence order descriptors (09626)followed by amino acid and dipeptide composition in combination (08785) all feature set (08692) and Transition (08411) in that order

as the parameter This result justifies the performance ofthe overall model In general the combination of descriptorsets performs better than individual descriptors particularlywhen combined with amino acid composition

Therefore from the statistical point of view the use ofcombination sets particularly with amino acid compositiontend to give better prediction performance than individual-sets [53]The amino acid composition generally increases theoverall accuracies of other descriptors in combination Oneof the shortcoming of amino acid composition as a descriptor

is that the same amino acid composition may correspond todiverse sequences due to the loss of sequence order [28 60]This sequence order information can be partially coveredby combination with dipeptide composition but dipeptidecomposition itself lacks information on the fraction of theindividual residue in the sequence as such a combination setis expected to give a better prediction result [27 61] as shownabove due to masking effect

The models were further investigated based on theirsensitivity to predict ATP-BPs and the results displayed in

ISRN Computational Biology 9

100

075

050

025

000

100075050025000

Sens

itivi

ty

1 minus specificity

Figure 6 The ROC plot the plot shows the performance ofthe LIBSVM model generated with StatsDirect package using anextended trapezoidal rule and a nonparametric method analogousto the WilcoxonMann-Whitney test to calculate the area under theROC curve The calculated AUA was 0849219

pyramidal view (Figure 3) The most sensitive descriptorwas amino acid composition (0875) followed by dipep-tide composition (08381) amino aciddipeptide compositionin combination (08246) and Norm M-B autocorrelation(08224) in that order

These descriptors were among the best four performersin terms of Acc and MCC Evaluation based on specificityindicates that amino acid composition (087) was more spe-cific followed by using the entire feature set (08478) Quasisequence order descriptors (08333) and dipeptide compo-sition (08257) in that order (Figure 4) This informationhighlights the vital role played by amino acid compositionin protein function predictions in general Interestingly theQuasi sequence order descriptors (09626) had the highestprecision followed by amino acid and dipeptide compositionin combination (08785) entire feature set (08692) andTransition (08411) in that order (Figure 5)

The overall model evaluation shows that the amino acidsand dipeptide composition was the best model for predict-ing ATP-BPs from diverse functional classes using wholesequence information The use of ldquoall the descriptorrdquo set didnot generally result in a better model in classification Theldquoall featuresrdquo descriptor accuracy was 799 against 8457for amino acidsdipeptide in combination This finding isin accordance with [62 63] on their work on moleculardescriptors for predicting compounds of specific propertiesusing ldquoall featuresrdquo set The reduction in accuracy might bedue to noise generated by the use of many overlapping andredundant descriptors Hence the accuracy of the classifier

algorithms can be severely degraded by the presence of noisyor irrelevant features or if the feature scales are not consistentwith their importance in solving the classification problemin question The performance of the SVM model using ROCplot (Figure 6) has a value of AUCof 0849219This highlightsa better model based on whole sequence analysis

4 Conclusions

The prediction of ATP-binding proteins has been exploitedusing a battery of descriptor sets and a hybrid functionalgroup Also for the first time the prediction of ATP bindingin universal stress proteins had been investigated using thesupport vector machine The best hybrid model was thecombination of amino acid and dipeptide composition of thesequences with an accuracy of 8457 and Mathews corre-lation coefficient (MCC) value of 0693 The general trendis that combination of descriptors will perform better andimprove the overall performances of individual descriptorsparticularly when combined with amino acid compositionThis model provides a high probability of success for molec-ular biologists in predicting and selecting diverse groups ofATP binding proteins

Conflict of Interests

The author reports no conflict of interests in this workincluding the mentioned trademarks

Acknowledgments

The research reported was supported by the NationalInstitutes of Health (NIH-NIGMS-1T36GM095335) and theNational Science Foundation (EPS-0903787 EPS-1006883)The content is solely the responsibility of the author and doesnot necessarily represent the official views of the fundingagencies

References

[1] A Bairoch and R Apweiler ldquoThe SWISS-PROT protein se-quence database and its supplement TrEMBL in 2000rdquo NucleicAcids Research vol 28 no 1 pp 45ndash48 2000

[2] H M Berman J Westbrook Z Feng et al ldquoThe protein databankrdquo Nucleic Acids Research vol 28 no 1 pp 235ndash242 2000

[3] J Guo H Chen Z Sun and Y Lin ldquoA novel method forprotein secondary structure prediction using dual-layer SVMand profilesrdquo Proteins vol 54 no 4 pp 738ndash743 2004

[4] C Bustamante Y R Chemla N R Forde and D IzhakyldquoMechanical processes in biochemistryrdquo Annual Review ofBiochemistry vol 73 pp 705ndash748 2004

[5] J EWalkerM SarasteM J Runswick andN J Gay ldquoDistantlyrelated sequences in the alpha- and beta-subunits of ATP syn-thase myosin kinases and other ATP-requiring enzymes anda common nucleotide binding foldrdquo The EMBO Journal vol 1no 8 pp 945ndash951 1982

[6] N Hirokawa and R Takemura ldquoBiochemical and molecularcharacterization of diseases linked to motor proteinsrdquo Trends inBiochemical Sciences vol 28 no 10 pp 558ndash565 2003

10 ISRN Computational Biology

[7] C Gedeon J Behravan G Koren and M Piquette-MillerldquoTransport of glyburide by placental ABC transporters impli-cations in fetal drug exposurerdquo Placenta vol 27 no 11-12 pp1096ndash1102 2006

[8] A Maxwell and D M Lawson ldquoThe ATP-binding site of typeII topoisomerases as a target for antibacterial drugsrdquo CurrentTopics in Medicinal Chemistry vol 3 no 3 pp 283ndash303 2003

[9] H Ashida T Oonishi and N Uyesaka ldquoKinetic analysis of themechanism of action of the multidrug transporterrdquo Journal ofTheoretical Biology vol 195 no 2 pp 219ndash232 1998

[10] K Kvint L Nachin A Diez and T Nystrom ldquoThe bacterialuniversal stress protein function and regulationrdquo CurrentOpinion in Microbiology vol 6 no 2 pp 140ndash145 2003

[11] T Nystrom and F C Neidhardt ldquoCloning mapping andnucleotide sequencing of a gene encoding a universal stressprotein in Escherichia colirdquo Molecular Microbiology vol 6 no21 pp 3187ndash3198 1992

[12] A Diez N Gustavsson and T Nystrom ldquoThe universal stressprotein a of Escherichia coli is required for resistance to DNAdamaging agents and is regulated by a RecAFtsK-dependentregulatory pathwayrdquo Molecular Microbiology vol 36 no 6 pp1494ndash1503 2000

[13] M C Sousa and D B Mckay ldquoStructure of the universal stressprotein of Haemophilus influenzaerdquo Structure vol 9 no 12 pp1135ndash1141 2001

[14] V J Promponas C A Ouzounis and I Iliopoulos ldquoExper-imental evidence validating the computational inference offunctional associations from gene fusion events a criticalsurveyrdquo Briefings in Bioinformatics 2012

[15] J S Chauhan N K Mishra and G P Raghava ldquoIdentificationofATPbinding residues of a protein from its primary sequencerdquoBMC Bioinformatics vol 10 article 434 2009

[16] T Guo Y Shi and Z Sun ldquoA novel statistical ligand-bindingsite predictor application to ATP-binding sitesrdquo Protein Engi-neering Design and Selection vol 18 no 2 pp 65ndash70 2005

[17] K Chen M J Mizianty and L Kurgan ldquoATPsite sequence-based prediction of ATP-binding residuesrdquo Proteome Sciencevol 9 article S4 supplement 1 2011

[18] YN ZhangD J Yu S S Li Y X Fan YHuang andH B ShenldquoPredicting protein-ATP binding sites from primary sequencethrough fusing bi-profile sampling ofmulti-view featuresrdquoBMCBioinformatics vol 13 article 118 2012

[19] J R Green M J Korenberg R David and I W HunterldquoRecognition of adenosine triphosphate binding sites usingparallel cascade system identificationrdquo Annals of BiomedicalEngineering vol 31 no 4 pp 462ndash470 2003

[20] A Garg M Bhasin and G P S Raghava ldquoSupport vectormachine-based method for subcellular localization of humanproteins using amino acid compositions their order andsimilarity searchrdquoThe Journal of Biological Chemistry vol 280no 15 pp 14427ndash14432 2005

[21] S Ahmad M M Gromiha and A Sarai ldquoAnalysis and pre-diction of DNA-binding proteins and their binding residuesbased on composition sequence and structural informationrdquoBioinformatics vol 20 no 4 pp 477ndash486 2004

[22] X Xiao P Wang and K-C Chou ldquoGPCR-CA a cellularautomaton image approach for predicting G-protein-coupledreceptor functional classesrdquo Journal of Computational Chem-istry vol 30 no 9 pp 1414ndash1423 2009

[23] M Kumar MM Gromiha and G P S Raghava ldquoPrediction ofRNA binding sites in a protein using SVM and PSSM profilerdquoProteins vol 71 no 1 pp 189ndash194 2008

[24] B S Williams R D Isokpehi A N Mbah et al ldquoFunctionalannotation analytics of bacillus genomes reveals stress respon-sive acetate utilization and sulfate uptake in the biotechnologi-cally relevant bacillus megateriumrdquo Bioinformatics and BiologyInsights vol 6 pp 275ndash286 2012

[25] R D Isokpehi O Mahmud A N Mbah et al ldquoDevelopmentalregulation of genes encoding universal stress proteins in Schis-tosoma mansonirdquo Gene Regulation and Systems Biology vol 5pp 61ndash74 2011

[26] A N Mbah O Mahmud O R Awofolu and R D IsokpehildquoInferences on the biochemical and environmental regulationof universal stress proteins from Schistosomiasis parasitesrdquoAdvances and Applications in Bioinformatics and Chemistry vol6 pp 15ndash27 2013

[27] W Li L Jaroszewski and A Godzik ldquoClustering of highlyhomologous sequences to reduce the size of large proteindatabasesrdquo Bioinformatics vol 17 no 3 pp 282ndash283 2001

[28] G Wang and R L Dunbrack Jr ldquoPISCES a protein sequenceculling serverrdquo Bioinformatics vol 19 no 12 pp 1589ndash15912003

[29] X Yu J Cao Y Cai T Shi and Y Li ldquoPredicting rRNA-RNA- and DNA-binding proteins from primary structure withsupport vector machinesrdquo Journal of Theoretical Biology vol240 no 2 pp 175ndash184 2006

[30] AMarchler-Bauer C Zheng F Chitsaz et al ldquoCDD conserveddomains and protein three-dimensional structurerdquo NucleicAcids Research vol 41 pp D348ndashD352 2013

[31] Z R Li H H Lin L Y Han L Jiang X Chen and Y ZChen ldquoPROFEAT a web server for computing structural andphysicochemical features of proteins and peptides from aminoacid sequencerdquo Nucleic Acids Research vol 34 pp W32ndashW372006

[32] Z Bikadi I Hazai D Malik et al ldquoPredicting P-glycoprotein-mediated drug transport based on support vector machine andthree-dimensional crystal structure of P-glycoproteinrdquo PLoSONE vol 6 no 10 Article ID e25815 2011

[33] S L Lo C Z Cai Y Z Chen and M C M Chung ldquoEffectof training datasets on support vector machine prediction ofprotein-protein interactionsrdquo Proteomics vol 5 no 4 pp 876ndash884 2005

[34] M P Brown W N Grundy D Lin et al ldquoKnowledge-basedanalysis of microarray gene expression data by using supportvector machinesrdquo Proceedings of the National Academy ofSciences of the United States of America vol 97 no 1 pp 262ndash267 2000

[35] T S Furey N Cristianini N Duffy D W Bednarski MSchummer and D Haussler ldquoSupport vector machine classifi-cation and validation of cancer tissue samples using microarrayexpression datardquo Bioinformatics vol 16 no 10 pp 906ndash9142000

[36] K-C Chou and Y-D Cai ldquoPredicting protein-protein inter-actions from sequences in a hybridization spacerdquo Journal ofProteome Research vol 5 no 2 pp 316ndash322 2006

[37] M E Matheny F S Resnic N Arora and L Ohno-MachadoldquoEffects of SVM parameter optimization on discriminationand calibration for post-procedural PCI mortalityrdquo Journal ofBiomedical Informatics vol 40 no 6 pp 688ndash697 2007

[38] F Javed G S Chan A V Savkin et al ldquoRBF kernel basedsupport vector regression to estimate the blood volume andheart rate responses during hemodialysisrdquo in Proceedings ofthe Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBC rsquo09) pp 4352ndash4355 2009

ISRN Computational Biology 11

[39] C-C Chang and C-J Lin ldquoTraining nu-support vector classi-fiers theory and algorithmsrdquoNeural Computation vol 13 no 9pp 2119ndash2147 2001

[40] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[41] K C Chou and C T Zhang ldquoPrediction of protein structuralclassesrdquo Critical Reviews in Biochemistry andMolecular Biologyvol 30 pp 275ndash349 1995

[42] C Chen L Chen X Zou and P Cai ldquoPrediction of proteinsecondary structure content by using the concept of Choursquospseudo amino acid composition and support vector machinerdquoProtein and Peptide Letters vol 16 no 1 pp 27ndash31 2009

[43] H Ding L Luo and H Lin ldquoPrediction of cell wall lyticenzymes using choursquos amphiphilic pseudo amino acid compo-sitionrdquo Protein and Peptide Letters vol 16 no 4 pp 351ndash3552009

[44] J Bondia C Tarin W Garcia-Gabin et al ldquoUsing supportvector machines to detect therapeutically incorrect measure-ments by theMiniMed CGMSrdquo Journal of Diabetes Science andTechnology vol 2 pp 622ndash629 2008

[45] S Chen S Zhou F-F Yin L B Marks and S K DasldquoInvestigation of the support vector machine algorithm topredict lung radiation-induced pneumonitisrdquo Medical Physicsvol 34 no 10 pp 3808ndash3814 2007

[46] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta vol 405 no 2 pp 442ndash451 1975

[47] L Bao and Y Cui ldquoPrediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structuraland evolutionary informationrdquo Bioinformatics vol 21 no 10pp 2185ndash2190 2005

[48] R J Dobson P B Munroe M J Caulfield and M A S SaqildquoPredicting deleterious nsSNPs an analysis of sequence andstructural attributesrdquo BMC Bioinformatics vol 7 article 2172006

[49] J A Hanley and B J Mcneil ldquoThe meaning and use of thearea under a receiver operating characteristic (ROC) curverdquoRadiology vol 143 no 1 pp 29ndash36 1982

[50] E R Delong D M DeLong and D L Clarke-PearsonldquoComparing the areas under two or more correlated receiveroperating characteristic curves a nonparametric approachrdquoBiometrics vol 44 no 3 pp 837ndash845 1988

[51] C Chothia and A M Lesk ldquoThe relation between the diver-gence of sequence and structure in proteinsrdquo The EMBOJournal vol 5 no 4 pp 823ndash826 1986

[52] A M Lesk and C Chothia ldquoHow different amino acidsequences determine similar protein structures the structureand evolutionary dynamics of the globinsrdquo Journal of MolecularBiology vol 136 no 3 pp 225ndash270 1980

[53] M Hilbert G Bohm and R Jaenicke ldquoStructural relationshipsof homologous proteins as a fundamental principle in homol-ogy modelingrdquo Proteins vol 17 no 2 pp 138ndash151 1993

[54] P I Hanson and S W Whiteheart ldquoAAA+ proteins haveengine will workrdquo Nature Reviews Molecular Cell Biology vol6 no 7 pp 519ndash529 2005

[55] K M Ferguson T Higashijima M D Smigel and A GGilman ldquoThe influence of bound GDP on the kinetics of gua-nine nucleotide binding to G proteinsrdquoThe Journal of BiologicalChemistry vol 261 no 16 pp 7393ndash7399 1986

[56] F Jurnak A Mcpherson A H J Wang and A Rich ldquoBio-chemical and structural studies of the tetragonal crystallinemodification of the Escherichia coli elongation factor Turdquo TheJournal of Biological Chemistry vol 255 no 14 pp 6751ndash67571980

[57] T I Zarembinski L I-W Hung H-J Mueller-Dieckmann etal ldquoStructure-based assignment of the biochemical functionof a hypothetical protein a test case of structural genomicsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 26 pp 15189ndash15193 1998

[58] M Saito M Go and T Shirai ldquoAn empirical approach fordetecting nucleotide-binding sites on proteinsrdquo Protein Engi-neering Design and Selection vol 19 no 2 pp 67ndash75 2006

[59] V Sobolev A Sorokine J Prilusky E E Abola and M Edel-man ldquoAutomated analysis of interatomic contacts in proteinsrdquoBioinformatics vol 15 no 4 pp 327ndash332 1999

[60] R E Schapire and Y Singer ldquoBoostexter a boosting-basedsystem for text categorizationrdquo Machine Learning vol 39 no2-3 pp 135ndash168 2000

[61] S A Ong H H Lin Y Z Chen Z R Li and Z Cao ldquoEfficacyof different protein descriptors in predicting protein functionalfamiliesrdquo BMC Bioinformatics vol 8 article 300 2007

[62] L Xue and J Bajorath ldquoMolecular descriptors in chemoin-formatics computational combinatorial chemistry and virtualscreeningrdquo Combinatorial Chemistry and High ThroughputScreening vol 3 no 5 pp 363ndash372 2000

[63] L Xue JW Godden and J Bajorath ldquoEvaluation of descriptorsand mini-fingerprints for the identification of molecules withsimilar activityrdquo Journal of Chemical Information and ComputerSciences vol 40 no 5 pp 1227ndash1234 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

ISRN Computational Biology 7

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

087

08478

087

08119

08257

07339

07963

08208

08224

Null

08333

07407

08111

Null07339

07407

07963

08111

08119

08208

08224

08257

08333

08478

087

Specificity Specificity

Figure 4The performances of descriptors with LIBSVM in terms of SpecificityThe length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of specificity The most specific descriptor was amino acid composition and aminoaciddipeptide composition (087) followed by all using all the feature set (08478) Quasi sequence order descriptors (08333) and dipeptidecomposition (08257) in that order

The performances of the models were evaluated based onMCC (Figure 2) The pyramidal view and the length of thecolor coded descriptors were used for performance visual-ization The best performer was amino acid and dipeptide

composition in combination (06931) followed by amino acidcomposition (06765) dipeptide composition (06637) andNormM-B autocorrelation (06449) in that order This orderis in line with their performances measured using accuracy

8 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

PrecisionPrecision

0785

08692

08785

08224

08224

0729

07944

08224

08224

Null

09626

07383

08411

Null0729

07383

0765

07944

08224

08411

08502

08785

09626

Figure 5 The performances of descriptors with LIBSVM in terms of Precision The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of precision The most precise descriptor was Quasi sequence order descriptors (09626)followed by amino acid and dipeptide composition in combination (08785) all feature set (08692) and Transition (08411) in that order

as the parameter This result justifies the performance ofthe overall model In general the combination of descriptorsets performs better than individual descriptors particularlywhen combined with amino acid composition

Therefore from the statistical point of view the use ofcombination sets particularly with amino acid compositiontend to give better prediction performance than individual-sets [53]The amino acid composition generally increases theoverall accuracies of other descriptors in combination Oneof the shortcoming of amino acid composition as a descriptor

is that the same amino acid composition may correspond todiverse sequences due to the loss of sequence order [28 60]This sequence order information can be partially coveredby combination with dipeptide composition but dipeptidecomposition itself lacks information on the fraction of theindividual residue in the sequence as such a combination setis expected to give a better prediction result [27 61] as shownabove due to masking effect

The models were further investigated based on theirsensitivity to predict ATP-BPs and the results displayed in

ISRN Computational Biology 9

100

075

050

025

000

100075050025000

Sens

itivi

ty

1 minus specificity

Figure 6 The ROC plot the plot shows the performance ofthe LIBSVM model generated with StatsDirect package using anextended trapezoidal rule and a nonparametric method analogousto the WilcoxonMann-Whitney test to calculate the area under theROC curve The calculated AUA was 0849219

pyramidal view (Figure 3) The most sensitive descriptorwas amino acid composition (0875) followed by dipep-tide composition (08381) amino aciddipeptide compositionin combination (08246) and Norm M-B autocorrelation(08224) in that order

These descriptors were among the best four performersin terms of Acc and MCC Evaluation based on specificityindicates that amino acid composition (087) was more spe-cific followed by using the entire feature set (08478) Quasisequence order descriptors (08333) and dipeptide compo-sition (08257) in that order (Figure 4) This informationhighlights the vital role played by amino acid compositionin protein function predictions in general Interestingly theQuasi sequence order descriptors (09626) had the highestprecision followed by amino acid and dipeptide compositionin combination (08785) entire feature set (08692) andTransition (08411) in that order (Figure 5)

The overall model evaluation shows that the amino acidsand dipeptide composition was the best model for predict-ing ATP-BPs from diverse functional classes using wholesequence information The use of ldquoall the descriptorrdquo set didnot generally result in a better model in classification Theldquoall featuresrdquo descriptor accuracy was 799 against 8457for amino acidsdipeptide in combination This finding isin accordance with [62 63] on their work on moleculardescriptors for predicting compounds of specific propertiesusing ldquoall featuresrdquo set The reduction in accuracy might bedue to noise generated by the use of many overlapping andredundant descriptors Hence the accuracy of the classifier

algorithms can be severely degraded by the presence of noisyor irrelevant features or if the feature scales are not consistentwith their importance in solving the classification problemin question The performance of the SVM model using ROCplot (Figure 6) has a value of AUCof 0849219This highlightsa better model based on whole sequence analysis

4 Conclusions

The prediction of ATP-binding proteins has been exploitedusing a battery of descriptor sets and a hybrid functionalgroup Also for the first time the prediction of ATP bindingin universal stress proteins had been investigated using thesupport vector machine The best hybrid model was thecombination of amino acid and dipeptide composition of thesequences with an accuracy of 8457 and Mathews corre-lation coefficient (MCC) value of 0693 The general trendis that combination of descriptors will perform better andimprove the overall performances of individual descriptorsparticularly when combined with amino acid compositionThis model provides a high probability of success for molec-ular biologists in predicting and selecting diverse groups ofATP binding proteins

Conflict of Interests

The author reports no conflict of interests in this workincluding the mentioned trademarks

Acknowledgments

The research reported was supported by the NationalInstitutes of Health (NIH-NIGMS-1T36GM095335) and theNational Science Foundation (EPS-0903787 EPS-1006883)The content is solely the responsibility of the author and doesnot necessarily represent the official views of the fundingagencies

References

[1] A Bairoch and R Apweiler ldquoThe SWISS-PROT protein se-quence database and its supplement TrEMBL in 2000rdquo NucleicAcids Research vol 28 no 1 pp 45ndash48 2000

[2] H M Berman J Westbrook Z Feng et al ldquoThe protein databankrdquo Nucleic Acids Research vol 28 no 1 pp 235ndash242 2000

[3] J Guo H Chen Z Sun and Y Lin ldquoA novel method forprotein secondary structure prediction using dual-layer SVMand profilesrdquo Proteins vol 54 no 4 pp 738ndash743 2004

[4] C Bustamante Y R Chemla N R Forde and D IzhakyldquoMechanical processes in biochemistryrdquo Annual Review ofBiochemistry vol 73 pp 705ndash748 2004

[5] J EWalkerM SarasteM J Runswick andN J Gay ldquoDistantlyrelated sequences in the alpha- and beta-subunits of ATP syn-thase myosin kinases and other ATP-requiring enzymes anda common nucleotide binding foldrdquo The EMBO Journal vol 1no 8 pp 945ndash951 1982

[6] N Hirokawa and R Takemura ldquoBiochemical and molecularcharacterization of diseases linked to motor proteinsrdquo Trends inBiochemical Sciences vol 28 no 10 pp 558ndash565 2003

10 ISRN Computational Biology

[7] C Gedeon J Behravan G Koren and M Piquette-MillerldquoTransport of glyburide by placental ABC transporters impli-cations in fetal drug exposurerdquo Placenta vol 27 no 11-12 pp1096ndash1102 2006

[8] A Maxwell and D M Lawson ldquoThe ATP-binding site of typeII topoisomerases as a target for antibacterial drugsrdquo CurrentTopics in Medicinal Chemistry vol 3 no 3 pp 283ndash303 2003

[9] H Ashida T Oonishi and N Uyesaka ldquoKinetic analysis of themechanism of action of the multidrug transporterrdquo Journal ofTheoretical Biology vol 195 no 2 pp 219ndash232 1998

[10] K Kvint L Nachin A Diez and T Nystrom ldquoThe bacterialuniversal stress protein function and regulationrdquo CurrentOpinion in Microbiology vol 6 no 2 pp 140ndash145 2003

[11] T Nystrom and F C Neidhardt ldquoCloning mapping andnucleotide sequencing of a gene encoding a universal stressprotein in Escherichia colirdquo Molecular Microbiology vol 6 no21 pp 3187ndash3198 1992

[12] A Diez N Gustavsson and T Nystrom ldquoThe universal stressprotein a of Escherichia coli is required for resistance to DNAdamaging agents and is regulated by a RecAFtsK-dependentregulatory pathwayrdquo Molecular Microbiology vol 36 no 6 pp1494ndash1503 2000

[13] M C Sousa and D B Mckay ldquoStructure of the universal stressprotein of Haemophilus influenzaerdquo Structure vol 9 no 12 pp1135ndash1141 2001

[14] V J Promponas C A Ouzounis and I Iliopoulos ldquoExper-imental evidence validating the computational inference offunctional associations from gene fusion events a criticalsurveyrdquo Briefings in Bioinformatics 2012

[15] J S Chauhan N K Mishra and G P Raghava ldquoIdentificationofATPbinding residues of a protein from its primary sequencerdquoBMC Bioinformatics vol 10 article 434 2009

[16] T Guo Y Shi and Z Sun ldquoA novel statistical ligand-bindingsite predictor application to ATP-binding sitesrdquo Protein Engi-neering Design and Selection vol 18 no 2 pp 65ndash70 2005

[17] K Chen M J Mizianty and L Kurgan ldquoATPsite sequence-based prediction of ATP-binding residuesrdquo Proteome Sciencevol 9 article S4 supplement 1 2011

[18] YN ZhangD J Yu S S Li Y X Fan YHuang andH B ShenldquoPredicting protein-ATP binding sites from primary sequencethrough fusing bi-profile sampling ofmulti-view featuresrdquoBMCBioinformatics vol 13 article 118 2012

[19] J R Green M J Korenberg R David and I W HunterldquoRecognition of adenosine triphosphate binding sites usingparallel cascade system identificationrdquo Annals of BiomedicalEngineering vol 31 no 4 pp 462ndash470 2003

[20] A Garg M Bhasin and G P S Raghava ldquoSupport vectormachine-based method for subcellular localization of humanproteins using amino acid compositions their order andsimilarity searchrdquoThe Journal of Biological Chemistry vol 280no 15 pp 14427ndash14432 2005

[21] S Ahmad M M Gromiha and A Sarai ldquoAnalysis and pre-diction of DNA-binding proteins and their binding residuesbased on composition sequence and structural informationrdquoBioinformatics vol 20 no 4 pp 477ndash486 2004

[22] X Xiao P Wang and K-C Chou ldquoGPCR-CA a cellularautomaton image approach for predicting G-protein-coupledreceptor functional classesrdquo Journal of Computational Chem-istry vol 30 no 9 pp 1414ndash1423 2009

[23] M Kumar MM Gromiha and G P S Raghava ldquoPrediction ofRNA binding sites in a protein using SVM and PSSM profilerdquoProteins vol 71 no 1 pp 189ndash194 2008

[24] B S Williams R D Isokpehi A N Mbah et al ldquoFunctionalannotation analytics of bacillus genomes reveals stress respon-sive acetate utilization and sulfate uptake in the biotechnologi-cally relevant bacillus megateriumrdquo Bioinformatics and BiologyInsights vol 6 pp 275ndash286 2012

[25] R D Isokpehi O Mahmud A N Mbah et al ldquoDevelopmentalregulation of genes encoding universal stress proteins in Schis-tosoma mansonirdquo Gene Regulation and Systems Biology vol 5pp 61ndash74 2011

[26] A N Mbah O Mahmud O R Awofolu and R D IsokpehildquoInferences on the biochemical and environmental regulationof universal stress proteins from Schistosomiasis parasitesrdquoAdvances and Applications in Bioinformatics and Chemistry vol6 pp 15ndash27 2013

[27] W Li L Jaroszewski and A Godzik ldquoClustering of highlyhomologous sequences to reduce the size of large proteindatabasesrdquo Bioinformatics vol 17 no 3 pp 282ndash283 2001

[28] G Wang and R L Dunbrack Jr ldquoPISCES a protein sequenceculling serverrdquo Bioinformatics vol 19 no 12 pp 1589ndash15912003

[29] X Yu J Cao Y Cai T Shi and Y Li ldquoPredicting rRNA-RNA- and DNA-binding proteins from primary structure withsupport vector machinesrdquo Journal of Theoretical Biology vol240 no 2 pp 175ndash184 2006

[30] AMarchler-Bauer C Zheng F Chitsaz et al ldquoCDD conserveddomains and protein three-dimensional structurerdquo NucleicAcids Research vol 41 pp D348ndashD352 2013

[31] Z R Li H H Lin L Y Han L Jiang X Chen and Y ZChen ldquoPROFEAT a web server for computing structural andphysicochemical features of proteins and peptides from aminoacid sequencerdquo Nucleic Acids Research vol 34 pp W32ndashW372006

[32] Z Bikadi I Hazai D Malik et al ldquoPredicting P-glycoprotein-mediated drug transport based on support vector machine andthree-dimensional crystal structure of P-glycoproteinrdquo PLoSONE vol 6 no 10 Article ID e25815 2011

[33] S L Lo C Z Cai Y Z Chen and M C M Chung ldquoEffectof training datasets on support vector machine prediction ofprotein-protein interactionsrdquo Proteomics vol 5 no 4 pp 876ndash884 2005

[34] M P Brown W N Grundy D Lin et al ldquoKnowledge-basedanalysis of microarray gene expression data by using supportvector machinesrdquo Proceedings of the National Academy ofSciences of the United States of America vol 97 no 1 pp 262ndash267 2000

[35] T S Furey N Cristianini N Duffy D W Bednarski MSchummer and D Haussler ldquoSupport vector machine classifi-cation and validation of cancer tissue samples using microarrayexpression datardquo Bioinformatics vol 16 no 10 pp 906ndash9142000

[36] K-C Chou and Y-D Cai ldquoPredicting protein-protein inter-actions from sequences in a hybridization spacerdquo Journal ofProteome Research vol 5 no 2 pp 316ndash322 2006

[37] M E Matheny F S Resnic N Arora and L Ohno-MachadoldquoEffects of SVM parameter optimization on discriminationand calibration for post-procedural PCI mortalityrdquo Journal ofBiomedical Informatics vol 40 no 6 pp 688ndash697 2007

[38] F Javed G S Chan A V Savkin et al ldquoRBF kernel basedsupport vector regression to estimate the blood volume andheart rate responses during hemodialysisrdquo in Proceedings ofthe Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBC rsquo09) pp 4352ndash4355 2009

ISRN Computational Biology 11

[39] C-C Chang and C-J Lin ldquoTraining nu-support vector classi-fiers theory and algorithmsrdquoNeural Computation vol 13 no 9pp 2119ndash2147 2001

[40] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[41] K C Chou and C T Zhang ldquoPrediction of protein structuralclassesrdquo Critical Reviews in Biochemistry andMolecular Biologyvol 30 pp 275ndash349 1995

[42] C Chen L Chen X Zou and P Cai ldquoPrediction of proteinsecondary structure content by using the concept of Choursquospseudo amino acid composition and support vector machinerdquoProtein and Peptide Letters vol 16 no 1 pp 27ndash31 2009

[43] H Ding L Luo and H Lin ldquoPrediction of cell wall lyticenzymes using choursquos amphiphilic pseudo amino acid compo-sitionrdquo Protein and Peptide Letters vol 16 no 4 pp 351ndash3552009

[44] J Bondia C Tarin W Garcia-Gabin et al ldquoUsing supportvector machines to detect therapeutically incorrect measure-ments by theMiniMed CGMSrdquo Journal of Diabetes Science andTechnology vol 2 pp 622ndash629 2008

[45] S Chen S Zhou F-F Yin L B Marks and S K DasldquoInvestigation of the support vector machine algorithm topredict lung radiation-induced pneumonitisrdquo Medical Physicsvol 34 no 10 pp 3808ndash3814 2007

[46] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta vol 405 no 2 pp 442ndash451 1975

[47] L Bao and Y Cui ldquoPrediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structuraland evolutionary informationrdquo Bioinformatics vol 21 no 10pp 2185ndash2190 2005

[48] R J Dobson P B Munroe M J Caulfield and M A S SaqildquoPredicting deleterious nsSNPs an analysis of sequence andstructural attributesrdquo BMC Bioinformatics vol 7 article 2172006

[49] J A Hanley and B J Mcneil ldquoThe meaning and use of thearea under a receiver operating characteristic (ROC) curverdquoRadiology vol 143 no 1 pp 29ndash36 1982

[50] E R Delong D M DeLong and D L Clarke-PearsonldquoComparing the areas under two or more correlated receiveroperating characteristic curves a nonparametric approachrdquoBiometrics vol 44 no 3 pp 837ndash845 1988

[51] C Chothia and A M Lesk ldquoThe relation between the diver-gence of sequence and structure in proteinsrdquo The EMBOJournal vol 5 no 4 pp 823ndash826 1986

[52] A M Lesk and C Chothia ldquoHow different amino acidsequences determine similar protein structures the structureand evolutionary dynamics of the globinsrdquo Journal of MolecularBiology vol 136 no 3 pp 225ndash270 1980

[53] M Hilbert G Bohm and R Jaenicke ldquoStructural relationshipsof homologous proteins as a fundamental principle in homol-ogy modelingrdquo Proteins vol 17 no 2 pp 138ndash151 1993

[54] P I Hanson and S W Whiteheart ldquoAAA+ proteins haveengine will workrdquo Nature Reviews Molecular Cell Biology vol6 no 7 pp 519ndash529 2005

[55] K M Ferguson T Higashijima M D Smigel and A GGilman ldquoThe influence of bound GDP on the kinetics of gua-nine nucleotide binding to G proteinsrdquoThe Journal of BiologicalChemistry vol 261 no 16 pp 7393ndash7399 1986

[56] F Jurnak A Mcpherson A H J Wang and A Rich ldquoBio-chemical and structural studies of the tetragonal crystallinemodification of the Escherichia coli elongation factor Turdquo TheJournal of Biological Chemistry vol 255 no 14 pp 6751ndash67571980

[57] T I Zarembinski L I-W Hung H-J Mueller-Dieckmann etal ldquoStructure-based assignment of the biochemical functionof a hypothetical protein a test case of structural genomicsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 26 pp 15189ndash15193 1998

[58] M Saito M Go and T Shirai ldquoAn empirical approach fordetecting nucleotide-binding sites on proteinsrdquo Protein Engi-neering Design and Selection vol 19 no 2 pp 67ndash75 2006

[59] V Sobolev A Sorokine J Prilusky E E Abola and M Edel-man ldquoAutomated analysis of interatomic contacts in proteinsrdquoBioinformatics vol 15 no 4 pp 327ndash332 1999

[60] R E Schapire and Y Singer ldquoBoostexter a boosting-basedsystem for text categorizationrdquo Machine Learning vol 39 no2-3 pp 135ndash168 2000

[61] S A Ong H H Lin Y Z Chen Z R Li and Z Cao ldquoEfficacyof different protein descriptors in predicting protein functionalfamiliesrdquo BMC Bioinformatics vol 8 article 300 2007

[62] L Xue and J Bajorath ldquoMolecular descriptors in chemoin-formatics computational combinatorial chemistry and virtualscreeningrdquo Combinatorial Chemistry and High ThroughputScreening vol 3 no 5 pp 363ndash372 2000

[63] L Xue JW Godden and J Bajorath ldquoEvaluation of descriptorsand mini-fingerprints for the identification of molecules withsimilar activityrdquo Journal of Chemical Information and ComputerSciences vol 40 no 5 pp 1227ndash1234 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

8 ISRN Computational Biology

Feature

AA composition

All feature set

Amino acid anddipeptide composition

Composition

Dipeptide comp

Distribution

Geary autocorrelation

Moran autocorrelation

Norm M-Bautocorrelation

Pseudo amino acidcomposition

Quasi-sequenceorder descriptors

Sequence-order-coupling number

Transition

PrecisionPrecision

0785

08692

08785

08224

08224

0729

07944

08224

08224

Null

09626

07383

08411

Null0729

07383

0765

07944

08224

08411

08502

08785

09626

Figure 5 The performances of descriptors with LIBSVM in terms of Precision The length of each color coded descriptor and the pyramidalview are a measure of their performances in terms of precision The most precise descriptor was Quasi sequence order descriptors (09626)followed by amino acid and dipeptide composition in combination (08785) all feature set (08692) and Transition (08411) in that order

as the parameter This result justifies the performance ofthe overall model In general the combination of descriptorsets performs better than individual descriptors particularlywhen combined with amino acid composition

Therefore from the statistical point of view the use ofcombination sets particularly with amino acid compositiontend to give better prediction performance than individual-sets [53]The amino acid composition generally increases theoverall accuracies of other descriptors in combination Oneof the shortcoming of amino acid composition as a descriptor

is that the same amino acid composition may correspond todiverse sequences due to the loss of sequence order [28 60]This sequence order information can be partially coveredby combination with dipeptide composition but dipeptidecomposition itself lacks information on the fraction of theindividual residue in the sequence as such a combination setis expected to give a better prediction result [27 61] as shownabove due to masking effect

The models were further investigated based on theirsensitivity to predict ATP-BPs and the results displayed in

ISRN Computational Biology 9

100

075

050

025

000

100075050025000

Sens

itivi

ty

1 minus specificity

Figure 6 The ROC plot the plot shows the performance ofthe LIBSVM model generated with StatsDirect package using anextended trapezoidal rule and a nonparametric method analogousto the WilcoxonMann-Whitney test to calculate the area under theROC curve The calculated AUA was 0849219

pyramidal view (Figure 3) The most sensitive descriptorwas amino acid composition (0875) followed by dipep-tide composition (08381) amino aciddipeptide compositionin combination (08246) and Norm M-B autocorrelation(08224) in that order

These descriptors were among the best four performersin terms of Acc and MCC Evaluation based on specificityindicates that amino acid composition (087) was more spe-cific followed by using the entire feature set (08478) Quasisequence order descriptors (08333) and dipeptide compo-sition (08257) in that order (Figure 4) This informationhighlights the vital role played by amino acid compositionin protein function predictions in general Interestingly theQuasi sequence order descriptors (09626) had the highestprecision followed by amino acid and dipeptide compositionin combination (08785) entire feature set (08692) andTransition (08411) in that order (Figure 5)

The overall model evaluation shows that the amino acidsand dipeptide composition was the best model for predict-ing ATP-BPs from diverse functional classes using wholesequence information The use of ldquoall the descriptorrdquo set didnot generally result in a better model in classification Theldquoall featuresrdquo descriptor accuracy was 799 against 8457for amino acidsdipeptide in combination This finding isin accordance with [62 63] on their work on moleculardescriptors for predicting compounds of specific propertiesusing ldquoall featuresrdquo set The reduction in accuracy might bedue to noise generated by the use of many overlapping andredundant descriptors Hence the accuracy of the classifier

algorithms can be severely degraded by the presence of noisyor irrelevant features or if the feature scales are not consistentwith their importance in solving the classification problemin question The performance of the SVM model using ROCplot (Figure 6) has a value of AUCof 0849219This highlightsa better model based on whole sequence analysis

4 Conclusions

The prediction of ATP-binding proteins has been exploitedusing a battery of descriptor sets and a hybrid functionalgroup Also for the first time the prediction of ATP bindingin universal stress proteins had been investigated using thesupport vector machine The best hybrid model was thecombination of amino acid and dipeptide composition of thesequences with an accuracy of 8457 and Mathews corre-lation coefficient (MCC) value of 0693 The general trendis that combination of descriptors will perform better andimprove the overall performances of individual descriptorsparticularly when combined with amino acid compositionThis model provides a high probability of success for molec-ular biologists in predicting and selecting diverse groups ofATP binding proteins

Conflict of Interests

The author reports no conflict of interests in this workincluding the mentioned trademarks

Acknowledgments

The research reported was supported by the NationalInstitutes of Health (NIH-NIGMS-1T36GM095335) and theNational Science Foundation (EPS-0903787 EPS-1006883)The content is solely the responsibility of the author and doesnot necessarily represent the official views of the fundingagencies

References

[1] A Bairoch and R Apweiler ldquoThe SWISS-PROT protein se-quence database and its supplement TrEMBL in 2000rdquo NucleicAcids Research vol 28 no 1 pp 45ndash48 2000

[2] H M Berman J Westbrook Z Feng et al ldquoThe protein databankrdquo Nucleic Acids Research vol 28 no 1 pp 235ndash242 2000

[3] J Guo H Chen Z Sun and Y Lin ldquoA novel method forprotein secondary structure prediction using dual-layer SVMand profilesrdquo Proteins vol 54 no 4 pp 738ndash743 2004

[4] C Bustamante Y R Chemla N R Forde and D IzhakyldquoMechanical processes in biochemistryrdquo Annual Review ofBiochemistry vol 73 pp 705ndash748 2004

[5] J EWalkerM SarasteM J Runswick andN J Gay ldquoDistantlyrelated sequences in the alpha- and beta-subunits of ATP syn-thase myosin kinases and other ATP-requiring enzymes anda common nucleotide binding foldrdquo The EMBO Journal vol 1no 8 pp 945ndash951 1982

[6] N Hirokawa and R Takemura ldquoBiochemical and molecularcharacterization of diseases linked to motor proteinsrdquo Trends inBiochemical Sciences vol 28 no 10 pp 558ndash565 2003

10 ISRN Computational Biology

[7] C Gedeon J Behravan G Koren and M Piquette-MillerldquoTransport of glyburide by placental ABC transporters impli-cations in fetal drug exposurerdquo Placenta vol 27 no 11-12 pp1096ndash1102 2006

[8] A Maxwell and D M Lawson ldquoThe ATP-binding site of typeII topoisomerases as a target for antibacterial drugsrdquo CurrentTopics in Medicinal Chemistry vol 3 no 3 pp 283ndash303 2003

[9] H Ashida T Oonishi and N Uyesaka ldquoKinetic analysis of themechanism of action of the multidrug transporterrdquo Journal ofTheoretical Biology vol 195 no 2 pp 219ndash232 1998

[10] K Kvint L Nachin A Diez and T Nystrom ldquoThe bacterialuniversal stress protein function and regulationrdquo CurrentOpinion in Microbiology vol 6 no 2 pp 140ndash145 2003

[11] T Nystrom and F C Neidhardt ldquoCloning mapping andnucleotide sequencing of a gene encoding a universal stressprotein in Escherichia colirdquo Molecular Microbiology vol 6 no21 pp 3187ndash3198 1992

[12] A Diez N Gustavsson and T Nystrom ldquoThe universal stressprotein a of Escherichia coli is required for resistance to DNAdamaging agents and is regulated by a RecAFtsK-dependentregulatory pathwayrdquo Molecular Microbiology vol 36 no 6 pp1494ndash1503 2000

[13] M C Sousa and D B Mckay ldquoStructure of the universal stressprotein of Haemophilus influenzaerdquo Structure vol 9 no 12 pp1135ndash1141 2001

[14] V J Promponas C A Ouzounis and I Iliopoulos ldquoExper-imental evidence validating the computational inference offunctional associations from gene fusion events a criticalsurveyrdquo Briefings in Bioinformatics 2012

[15] J S Chauhan N K Mishra and G P Raghava ldquoIdentificationofATPbinding residues of a protein from its primary sequencerdquoBMC Bioinformatics vol 10 article 434 2009

[16] T Guo Y Shi and Z Sun ldquoA novel statistical ligand-bindingsite predictor application to ATP-binding sitesrdquo Protein Engi-neering Design and Selection vol 18 no 2 pp 65ndash70 2005

[17] K Chen M J Mizianty and L Kurgan ldquoATPsite sequence-based prediction of ATP-binding residuesrdquo Proteome Sciencevol 9 article S4 supplement 1 2011

[18] YN ZhangD J Yu S S Li Y X Fan YHuang andH B ShenldquoPredicting protein-ATP binding sites from primary sequencethrough fusing bi-profile sampling ofmulti-view featuresrdquoBMCBioinformatics vol 13 article 118 2012

[19] J R Green M J Korenberg R David and I W HunterldquoRecognition of adenosine triphosphate binding sites usingparallel cascade system identificationrdquo Annals of BiomedicalEngineering vol 31 no 4 pp 462ndash470 2003

[20] A Garg M Bhasin and G P S Raghava ldquoSupport vectormachine-based method for subcellular localization of humanproteins using amino acid compositions their order andsimilarity searchrdquoThe Journal of Biological Chemistry vol 280no 15 pp 14427ndash14432 2005

[21] S Ahmad M M Gromiha and A Sarai ldquoAnalysis and pre-diction of DNA-binding proteins and their binding residuesbased on composition sequence and structural informationrdquoBioinformatics vol 20 no 4 pp 477ndash486 2004

[22] X Xiao P Wang and K-C Chou ldquoGPCR-CA a cellularautomaton image approach for predicting G-protein-coupledreceptor functional classesrdquo Journal of Computational Chem-istry vol 30 no 9 pp 1414ndash1423 2009

[23] M Kumar MM Gromiha and G P S Raghava ldquoPrediction ofRNA binding sites in a protein using SVM and PSSM profilerdquoProteins vol 71 no 1 pp 189ndash194 2008

[24] B S Williams R D Isokpehi A N Mbah et al ldquoFunctionalannotation analytics of bacillus genomes reveals stress respon-sive acetate utilization and sulfate uptake in the biotechnologi-cally relevant bacillus megateriumrdquo Bioinformatics and BiologyInsights vol 6 pp 275ndash286 2012

[25] R D Isokpehi O Mahmud A N Mbah et al ldquoDevelopmentalregulation of genes encoding universal stress proteins in Schis-tosoma mansonirdquo Gene Regulation and Systems Biology vol 5pp 61ndash74 2011

[26] A N Mbah O Mahmud O R Awofolu and R D IsokpehildquoInferences on the biochemical and environmental regulationof universal stress proteins from Schistosomiasis parasitesrdquoAdvances and Applications in Bioinformatics and Chemistry vol6 pp 15ndash27 2013

[27] W Li L Jaroszewski and A Godzik ldquoClustering of highlyhomologous sequences to reduce the size of large proteindatabasesrdquo Bioinformatics vol 17 no 3 pp 282ndash283 2001

[28] G Wang and R L Dunbrack Jr ldquoPISCES a protein sequenceculling serverrdquo Bioinformatics vol 19 no 12 pp 1589ndash15912003

[29] X Yu J Cao Y Cai T Shi and Y Li ldquoPredicting rRNA-RNA- and DNA-binding proteins from primary structure withsupport vector machinesrdquo Journal of Theoretical Biology vol240 no 2 pp 175ndash184 2006

[30] AMarchler-Bauer C Zheng F Chitsaz et al ldquoCDD conserveddomains and protein three-dimensional structurerdquo NucleicAcids Research vol 41 pp D348ndashD352 2013

[31] Z R Li H H Lin L Y Han L Jiang X Chen and Y ZChen ldquoPROFEAT a web server for computing structural andphysicochemical features of proteins and peptides from aminoacid sequencerdquo Nucleic Acids Research vol 34 pp W32ndashW372006

[32] Z Bikadi I Hazai D Malik et al ldquoPredicting P-glycoprotein-mediated drug transport based on support vector machine andthree-dimensional crystal structure of P-glycoproteinrdquo PLoSONE vol 6 no 10 Article ID e25815 2011

[33] S L Lo C Z Cai Y Z Chen and M C M Chung ldquoEffectof training datasets on support vector machine prediction ofprotein-protein interactionsrdquo Proteomics vol 5 no 4 pp 876ndash884 2005

[34] M P Brown W N Grundy D Lin et al ldquoKnowledge-basedanalysis of microarray gene expression data by using supportvector machinesrdquo Proceedings of the National Academy ofSciences of the United States of America vol 97 no 1 pp 262ndash267 2000

[35] T S Furey N Cristianini N Duffy D W Bednarski MSchummer and D Haussler ldquoSupport vector machine classifi-cation and validation of cancer tissue samples using microarrayexpression datardquo Bioinformatics vol 16 no 10 pp 906ndash9142000

[36] K-C Chou and Y-D Cai ldquoPredicting protein-protein inter-actions from sequences in a hybridization spacerdquo Journal ofProteome Research vol 5 no 2 pp 316ndash322 2006

[37] M E Matheny F S Resnic N Arora and L Ohno-MachadoldquoEffects of SVM parameter optimization on discriminationand calibration for post-procedural PCI mortalityrdquo Journal ofBiomedical Informatics vol 40 no 6 pp 688ndash697 2007

[38] F Javed G S Chan A V Savkin et al ldquoRBF kernel basedsupport vector regression to estimate the blood volume andheart rate responses during hemodialysisrdquo in Proceedings ofthe Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBC rsquo09) pp 4352ndash4355 2009

ISRN Computational Biology 11

[39] C-C Chang and C-J Lin ldquoTraining nu-support vector classi-fiers theory and algorithmsrdquoNeural Computation vol 13 no 9pp 2119ndash2147 2001

[40] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[41] K C Chou and C T Zhang ldquoPrediction of protein structuralclassesrdquo Critical Reviews in Biochemistry andMolecular Biologyvol 30 pp 275ndash349 1995

[42] C Chen L Chen X Zou and P Cai ldquoPrediction of proteinsecondary structure content by using the concept of Choursquospseudo amino acid composition and support vector machinerdquoProtein and Peptide Letters vol 16 no 1 pp 27ndash31 2009

[43] H Ding L Luo and H Lin ldquoPrediction of cell wall lyticenzymes using choursquos amphiphilic pseudo amino acid compo-sitionrdquo Protein and Peptide Letters vol 16 no 4 pp 351ndash3552009

[44] J Bondia C Tarin W Garcia-Gabin et al ldquoUsing supportvector machines to detect therapeutically incorrect measure-ments by theMiniMed CGMSrdquo Journal of Diabetes Science andTechnology vol 2 pp 622ndash629 2008

[45] S Chen S Zhou F-F Yin L B Marks and S K DasldquoInvestigation of the support vector machine algorithm topredict lung radiation-induced pneumonitisrdquo Medical Physicsvol 34 no 10 pp 3808ndash3814 2007

[46] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta vol 405 no 2 pp 442ndash451 1975

[47] L Bao and Y Cui ldquoPrediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structuraland evolutionary informationrdquo Bioinformatics vol 21 no 10pp 2185ndash2190 2005

[48] R J Dobson P B Munroe M J Caulfield and M A S SaqildquoPredicting deleterious nsSNPs an analysis of sequence andstructural attributesrdquo BMC Bioinformatics vol 7 article 2172006

[49] J A Hanley and B J Mcneil ldquoThe meaning and use of thearea under a receiver operating characteristic (ROC) curverdquoRadiology vol 143 no 1 pp 29ndash36 1982

[50] E R Delong D M DeLong and D L Clarke-PearsonldquoComparing the areas under two or more correlated receiveroperating characteristic curves a nonparametric approachrdquoBiometrics vol 44 no 3 pp 837ndash845 1988

[51] C Chothia and A M Lesk ldquoThe relation between the diver-gence of sequence and structure in proteinsrdquo The EMBOJournal vol 5 no 4 pp 823ndash826 1986

[52] A M Lesk and C Chothia ldquoHow different amino acidsequences determine similar protein structures the structureand evolutionary dynamics of the globinsrdquo Journal of MolecularBiology vol 136 no 3 pp 225ndash270 1980

[53] M Hilbert G Bohm and R Jaenicke ldquoStructural relationshipsof homologous proteins as a fundamental principle in homol-ogy modelingrdquo Proteins vol 17 no 2 pp 138ndash151 1993

[54] P I Hanson and S W Whiteheart ldquoAAA+ proteins haveengine will workrdquo Nature Reviews Molecular Cell Biology vol6 no 7 pp 519ndash529 2005

[55] K M Ferguson T Higashijima M D Smigel and A GGilman ldquoThe influence of bound GDP on the kinetics of gua-nine nucleotide binding to G proteinsrdquoThe Journal of BiologicalChemistry vol 261 no 16 pp 7393ndash7399 1986

[56] F Jurnak A Mcpherson A H J Wang and A Rich ldquoBio-chemical and structural studies of the tetragonal crystallinemodification of the Escherichia coli elongation factor Turdquo TheJournal of Biological Chemistry vol 255 no 14 pp 6751ndash67571980

[57] T I Zarembinski L I-W Hung H-J Mueller-Dieckmann etal ldquoStructure-based assignment of the biochemical functionof a hypothetical protein a test case of structural genomicsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 26 pp 15189ndash15193 1998

[58] M Saito M Go and T Shirai ldquoAn empirical approach fordetecting nucleotide-binding sites on proteinsrdquo Protein Engi-neering Design and Selection vol 19 no 2 pp 67ndash75 2006

[59] V Sobolev A Sorokine J Prilusky E E Abola and M Edel-man ldquoAutomated analysis of interatomic contacts in proteinsrdquoBioinformatics vol 15 no 4 pp 327ndash332 1999

[60] R E Schapire and Y Singer ldquoBoostexter a boosting-basedsystem for text categorizationrdquo Machine Learning vol 39 no2-3 pp 135ndash168 2000

[61] S A Ong H H Lin Y Z Chen Z R Li and Z Cao ldquoEfficacyof different protein descriptors in predicting protein functionalfamiliesrdquo BMC Bioinformatics vol 8 article 300 2007

[62] L Xue and J Bajorath ldquoMolecular descriptors in chemoin-formatics computational combinatorial chemistry and virtualscreeningrdquo Combinatorial Chemistry and High ThroughputScreening vol 3 no 5 pp 363ndash372 2000

[63] L Xue JW Godden and J Bajorath ldquoEvaluation of descriptorsand mini-fingerprints for the identification of molecules withsimilar activityrdquo Journal of Chemical Information and ComputerSciences vol 40 no 5 pp 1227ndash1234 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

ISRN Computational Biology 9

100

075

050

025

000

100075050025000

Sens

itivi

ty

1 minus specificity

Figure 6 The ROC plot the plot shows the performance ofthe LIBSVM model generated with StatsDirect package using anextended trapezoidal rule and a nonparametric method analogousto the WilcoxonMann-Whitney test to calculate the area under theROC curve The calculated AUA was 0849219

pyramidal view (Figure 3) The most sensitive descriptorwas amino acid composition (0875) followed by dipep-tide composition (08381) amino aciddipeptide compositionin combination (08246) and Norm M-B autocorrelation(08224) in that order

These descriptors were among the best four performersin terms of Acc and MCC Evaluation based on specificityindicates that amino acid composition (087) was more spe-cific followed by using the entire feature set (08478) Quasisequence order descriptors (08333) and dipeptide compo-sition (08257) in that order (Figure 4) This informationhighlights the vital role played by amino acid compositionin protein function predictions in general Interestingly theQuasi sequence order descriptors (09626) had the highestprecision followed by amino acid and dipeptide compositionin combination (08785) entire feature set (08692) andTransition (08411) in that order (Figure 5)

The overall model evaluation shows that the amino acidsand dipeptide composition was the best model for predict-ing ATP-BPs from diverse functional classes using wholesequence information The use of ldquoall the descriptorrdquo set didnot generally result in a better model in classification Theldquoall featuresrdquo descriptor accuracy was 799 against 8457for amino acidsdipeptide in combination This finding isin accordance with [62 63] on their work on moleculardescriptors for predicting compounds of specific propertiesusing ldquoall featuresrdquo set The reduction in accuracy might bedue to noise generated by the use of many overlapping andredundant descriptors Hence the accuracy of the classifier

algorithms can be severely degraded by the presence of noisyor irrelevant features or if the feature scales are not consistentwith their importance in solving the classification problemin question The performance of the SVM model using ROCplot (Figure 6) has a value of AUCof 0849219This highlightsa better model based on whole sequence analysis

4 Conclusions

The prediction of ATP-binding proteins has been exploitedusing a battery of descriptor sets and a hybrid functionalgroup Also for the first time the prediction of ATP bindingin universal stress proteins had been investigated using thesupport vector machine The best hybrid model was thecombination of amino acid and dipeptide composition of thesequences with an accuracy of 8457 and Mathews corre-lation coefficient (MCC) value of 0693 The general trendis that combination of descriptors will perform better andimprove the overall performances of individual descriptorsparticularly when combined with amino acid compositionThis model provides a high probability of success for molec-ular biologists in predicting and selecting diverse groups ofATP binding proteins

Conflict of Interests

The author reports no conflict of interests in this workincluding the mentioned trademarks

Acknowledgments

The research reported was supported by the NationalInstitutes of Health (NIH-NIGMS-1T36GM095335) and theNational Science Foundation (EPS-0903787 EPS-1006883)The content is solely the responsibility of the author and doesnot necessarily represent the official views of the fundingagencies

References

[1] A Bairoch and R Apweiler ldquoThe SWISS-PROT protein se-quence database and its supplement TrEMBL in 2000rdquo NucleicAcids Research vol 28 no 1 pp 45ndash48 2000

[2] H M Berman J Westbrook Z Feng et al ldquoThe protein databankrdquo Nucleic Acids Research vol 28 no 1 pp 235ndash242 2000

[3] J Guo H Chen Z Sun and Y Lin ldquoA novel method forprotein secondary structure prediction using dual-layer SVMand profilesrdquo Proteins vol 54 no 4 pp 738ndash743 2004

[4] C Bustamante Y R Chemla N R Forde and D IzhakyldquoMechanical processes in biochemistryrdquo Annual Review ofBiochemistry vol 73 pp 705ndash748 2004

[5] J EWalkerM SarasteM J Runswick andN J Gay ldquoDistantlyrelated sequences in the alpha- and beta-subunits of ATP syn-thase myosin kinases and other ATP-requiring enzymes anda common nucleotide binding foldrdquo The EMBO Journal vol 1no 8 pp 945ndash951 1982

[6] N Hirokawa and R Takemura ldquoBiochemical and molecularcharacterization of diseases linked to motor proteinsrdquo Trends inBiochemical Sciences vol 28 no 10 pp 558ndash565 2003

10 ISRN Computational Biology

[7] C Gedeon J Behravan G Koren and M Piquette-MillerldquoTransport of glyburide by placental ABC transporters impli-cations in fetal drug exposurerdquo Placenta vol 27 no 11-12 pp1096ndash1102 2006

[8] A Maxwell and D M Lawson ldquoThe ATP-binding site of typeII topoisomerases as a target for antibacterial drugsrdquo CurrentTopics in Medicinal Chemistry vol 3 no 3 pp 283ndash303 2003

[9] H Ashida T Oonishi and N Uyesaka ldquoKinetic analysis of themechanism of action of the multidrug transporterrdquo Journal ofTheoretical Biology vol 195 no 2 pp 219ndash232 1998

[10] K Kvint L Nachin A Diez and T Nystrom ldquoThe bacterialuniversal stress protein function and regulationrdquo CurrentOpinion in Microbiology vol 6 no 2 pp 140ndash145 2003

[11] T Nystrom and F C Neidhardt ldquoCloning mapping andnucleotide sequencing of a gene encoding a universal stressprotein in Escherichia colirdquo Molecular Microbiology vol 6 no21 pp 3187ndash3198 1992

[12] A Diez N Gustavsson and T Nystrom ldquoThe universal stressprotein a of Escherichia coli is required for resistance to DNAdamaging agents and is regulated by a RecAFtsK-dependentregulatory pathwayrdquo Molecular Microbiology vol 36 no 6 pp1494ndash1503 2000

[13] M C Sousa and D B Mckay ldquoStructure of the universal stressprotein of Haemophilus influenzaerdquo Structure vol 9 no 12 pp1135ndash1141 2001

[14] V J Promponas C A Ouzounis and I Iliopoulos ldquoExper-imental evidence validating the computational inference offunctional associations from gene fusion events a criticalsurveyrdquo Briefings in Bioinformatics 2012

[15] J S Chauhan N K Mishra and G P Raghava ldquoIdentificationofATPbinding residues of a protein from its primary sequencerdquoBMC Bioinformatics vol 10 article 434 2009

[16] T Guo Y Shi and Z Sun ldquoA novel statistical ligand-bindingsite predictor application to ATP-binding sitesrdquo Protein Engi-neering Design and Selection vol 18 no 2 pp 65ndash70 2005

[17] K Chen M J Mizianty and L Kurgan ldquoATPsite sequence-based prediction of ATP-binding residuesrdquo Proteome Sciencevol 9 article S4 supplement 1 2011

[18] YN ZhangD J Yu S S Li Y X Fan YHuang andH B ShenldquoPredicting protein-ATP binding sites from primary sequencethrough fusing bi-profile sampling ofmulti-view featuresrdquoBMCBioinformatics vol 13 article 118 2012

[19] J R Green M J Korenberg R David and I W HunterldquoRecognition of adenosine triphosphate binding sites usingparallel cascade system identificationrdquo Annals of BiomedicalEngineering vol 31 no 4 pp 462ndash470 2003

[20] A Garg M Bhasin and G P S Raghava ldquoSupport vectormachine-based method for subcellular localization of humanproteins using amino acid compositions their order andsimilarity searchrdquoThe Journal of Biological Chemistry vol 280no 15 pp 14427ndash14432 2005

[21] S Ahmad M M Gromiha and A Sarai ldquoAnalysis and pre-diction of DNA-binding proteins and their binding residuesbased on composition sequence and structural informationrdquoBioinformatics vol 20 no 4 pp 477ndash486 2004

[22] X Xiao P Wang and K-C Chou ldquoGPCR-CA a cellularautomaton image approach for predicting G-protein-coupledreceptor functional classesrdquo Journal of Computational Chem-istry vol 30 no 9 pp 1414ndash1423 2009

[23] M Kumar MM Gromiha and G P S Raghava ldquoPrediction ofRNA binding sites in a protein using SVM and PSSM profilerdquoProteins vol 71 no 1 pp 189ndash194 2008

[24] B S Williams R D Isokpehi A N Mbah et al ldquoFunctionalannotation analytics of bacillus genomes reveals stress respon-sive acetate utilization and sulfate uptake in the biotechnologi-cally relevant bacillus megateriumrdquo Bioinformatics and BiologyInsights vol 6 pp 275ndash286 2012

[25] R D Isokpehi O Mahmud A N Mbah et al ldquoDevelopmentalregulation of genes encoding universal stress proteins in Schis-tosoma mansonirdquo Gene Regulation and Systems Biology vol 5pp 61ndash74 2011

[26] A N Mbah O Mahmud O R Awofolu and R D IsokpehildquoInferences on the biochemical and environmental regulationof universal stress proteins from Schistosomiasis parasitesrdquoAdvances and Applications in Bioinformatics and Chemistry vol6 pp 15ndash27 2013

[27] W Li L Jaroszewski and A Godzik ldquoClustering of highlyhomologous sequences to reduce the size of large proteindatabasesrdquo Bioinformatics vol 17 no 3 pp 282ndash283 2001

[28] G Wang and R L Dunbrack Jr ldquoPISCES a protein sequenceculling serverrdquo Bioinformatics vol 19 no 12 pp 1589ndash15912003

[29] X Yu J Cao Y Cai T Shi and Y Li ldquoPredicting rRNA-RNA- and DNA-binding proteins from primary structure withsupport vector machinesrdquo Journal of Theoretical Biology vol240 no 2 pp 175ndash184 2006

[30] AMarchler-Bauer C Zheng F Chitsaz et al ldquoCDD conserveddomains and protein three-dimensional structurerdquo NucleicAcids Research vol 41 pp D348ndashD352 2013

[31] Z R Li H H Lin L Y Han L Jiang X Chen and Y ZChen ldquoPROFEAT a web server for computing structural andphysicochemical features of proteins and peptides from aminoacid sequencerdquo Nucleic Acids Research vol 34 pp W32ndashW372006

[32] Z Bikadi I Hazai D Malik et al ldquoPredicting P-glycoprotein-mediated drug transport based on support vector machine andthree-dimensional crystal structure of P-glycoproteinrdquo PLoSONE vol 6 no 10 Article ID e25815 2011

[33] S L Lo C Z Cai Y Z Chen and M C M Chung ldquoEffectof training datasets on support vector machine prediction ofprotein-protein interactionsrdquo Proteomics vol 5 no 4 pp 876ndash884 2005

[34] M P Brown W N Grundy D Lin et al ldquoKnowledge-basedanalysis of microarray gene expression data by using supportvector machinesrdquo Proceedings of the National Academy ofSciences of the United States of America vol 97 no 1 pp 262ndash267 2000

[35] T S Furey N Cristianini N Duffy D W Bednarski MSchummer and D Haussler ldquoSupport vector machine classifi-cation and validation of cancer tissue samples using microarrayexpression datardquo Bioinformatics vol 16 no 10 pp 906ndash9142000

[36] K-C Chou and Y-D Cai ldquoPredicting protein-protein inter-actions from sequences in a hybridization spacerdquo Journal ofProteome Research vol 5 no 2 pp 316ndash322 2006

[37] M E Matheny F S Resnic N Arora and L Ohno-MachadoldquoEffects of SVM parameter optimization on discriminationand calibration for post-procedural PCI mortalityrdquo Journal ofBiomedical Informatics vol 40 no 6 pp 688ndash697 2007

[38] F Javed G S Chan A V Savkin et al ldquoRBF kernel basedsupport vector regression to estimate the blood volume andheart rate responses during hemodialysisrdquo in Proceedings ofthe Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBC rsquo09) pp 4352ndash4355 2009

ISRN Computational Biology 11

[39] C-C Chang and C-J Lin ldquoTraining nu-support vector classi-fiers theory and algorithmsrdquoNeural Computation vol 13 no 9pp 2119ndash2147 2001

[40] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[41] K C Chou and C T Zhang ldquoPrediction of protein structuralclassesrdquo Critical Reviews in Biochemistry andMolecular Biologyvol 30 pp 275ndash349 1995

[42] C Chen L Chen X Zou and P Cai ldquoPrediction of proteinsecondary structure content by using the concept of Choursquospseudo amino acid composition and support vector machinerdquoProtein and Peptide Letters vol 16 no 1 pp 27ndash31 2009

[43] H Ding L Luo and H Lin ldquoPrediction of cell wall lyticenzymes using choursquos amphiphilic pseudo amino acid compo-sitionrdquo Protein and Peptide Letters vol 16 no 4 pp 351ndash3552009

[44] J Bondia C Tarin W Garcia-Gabin et al ldquoUsing supportvector machines to detect therapeutically incorrect measure-ments by theMiniMed CGMSrdquo Journal of Diabetes Science andTechnology vol 2 pp 622ndash629 2008

[45] S Chen S Zhou F-F Yin L B Marks and S K DasldquoInvestigation of the support vector machine algorithm topredict lung radiation-induced pneumonitisrdquo Medical Physicsvol 34 no 10 pp 3808ndash3814 2007

[46] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta vol 405 no 2 pp 442ndash451 1975

[47] L Bao and Y Cui ldquoPrediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structuraland evolutionary informationrdquo Bioinformatics vol 21 no 10pp 2185ndash2190 2005

[48] R J Dobson P B Munroe M J Caulfield and M A S SaqildquoPredicting deleterious nsSNPs an analysis of sequence andstructural attributesrdquo BMC Bioinformatics vol 7 article 2172006

[49] J A Hanley and B J Mcneil ldquoThe meaning and use of thearea under a receiver operating characteristic (ROC) curverdquoRadiology vol 143 no 1 pp 29ndash36 1982

[50] E R Delong D M DeLong and D L Clarke-PearsonldquoComparing the areas under two or more correlated receiveroperating characteristic curves a nonparametric approachrdquoBiometrics vol 44 no 3 pp 837ndash845 1988

[51] C Chothia and A M Lesk ldquoThe relation between the diver-gence of sequence and structure in proteinsrdquo The EMBOJournal vol 5 no 4 pp 823ndash826 1986

[52] A M Lesk and C Chothia ldquoHow different amino acidsequences determine similar protein structures the structureand evolutionary dynamics of the globinsrdquo Journal of MolecularBiology vol 136 no 3 pp 225ndash270 1980

[53] M Hilbert G Bohm and R Jaenicke ldquoStructural relationshipsof homologous proteins as a fundamental principle in homol-ogy modelingrdquo Proteins vol 17 no 2 pp 138ndash151 1993

[54] P I Hanson and S W Whiteheart ldquoAAA+ proteins haveengine will workrdquo Nature Reviews Molecular Cell Biology vol6 no 7 pp 519ndash529 2005

[55] K M Ferguson T Higashijima M D Smigel and A GGilman ldquoThe influence of bound GDP on the kinetics of gua-nine nucleotide binding to G proteinsrdquoThe Journal of BiologicalChemistry vol 261 no 16 pp 7393ndash7399 1986

[56] F Jurnak A Mcpherson A H J Wang and A Rich ldquoBio-chemical and structural studies of the tetragonal crystallinemodification of the Escherichia coli elongation factor Turdquo TheJournal of Biological Chemistry vol 255 no 14 pp 6751ndash67571980

[57] T I Zarembinski L I-W Hung H-J Mueller-Dieckmann etal ldquoStructure-based assignment of the biochemical functionof a hypothetical protein a test case of structural genomicsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 26 pp 15189ndash15193 1998

[58] M Saito M Go and T Shirai ldquoAn empirical approach fordetecting nucleotide-binding sites on proteinsrdquo Protein Engi-neering Design and Selection vol 19 no 2 pp 67ndash75 2006

[59] V Sobolev A Sorokine J Prilusky E E Abola and M Edel-man ldquoAutomated analysis of interatomic contacts in proteinsrdquoBioinformatics vol 15 no 4 pp 327ndash332 1999

[60] R E Schapire and Y Singer ldquoBoostexter a boosting-basedsystem for text categorizationrdquo Machine Learning vol 39 no2-3 pp 135ndash168 2000

[61] S A Ong H H Lin Y Z Chen Z R Li and Z Cao ldquoEfficacyof different protein descriptors in predicting protein functionalfamiliesrdquo BMC Bioinformatics vol 8 article 300 2007

[62] L Xue and J Bajorath ldquoMolecular descriptors in chemoin-formatics computational combinatorial chemistry and virtualscreeningrdquo Combinatorial Chemistry and High ThroughputScreening vol 3 no 5 pp 363ndash372 2000

[63] L Xue JW Godden and J Bajorath ldquoEvaluation of descriptorsand mini-fingerprints for the identification of molecules withsimilar activityrdquo Journal of Chemical Information and ComputerSciences vol 40 no 5 pp 1227ndash1234 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

10 ISRN Computational Biology

[7] C Gedeon J Behravan G Koren and M Piquette-MillerldquoTransport of glyburide by placental ABC transporters impli-cations in fetal drug exposurerdquo Placenta vol 27 no 11-12 pp1096ndash1102 2006

[8] A Maxwell and D M Lawson ldquoThe ATP-binding site of typeII topoisomerases as a target for antibacterial drugsrdquo CurrentTopics in Medicinal Chemistry vol 3 no 3 pp 283ndash303 2003

[9] H Ashida T Oonishi and N Uyesaka ldquoKinetic analysis of themechanism of action of the multidrug transporterrdquo Journal ofTheoretical Biology vol 195 no 2 pp 219ndash232 1998

[10] K Kvint L Nachin A Diez and T Nystrom ldquoThe bacterialuniversal stress protein function and regulationrdquo CurrentOpinion in Microbiology vol 6 no 2 pp 140ndash145 2003

[11] T Nystrom and F C Neidhardt ldquoCloning mapping andnucleotide sequencing of a gene encoding a universal stressprotein in Escherichia colirdquo Molecular Microbiology vol 6 no21 pp 3187ndash3198 1992

[12] A Diez N Gustavsson and T Nystrom ldquoThe universal stressprotein a of Escherichia coli is required for resistance to DNAdamaging agents and is regulated by a RecAFtsK-dependentregulatory pathwayrdquo Molecular Microbiology vol 36 no 6 pp1494ndash1503 2000

[13] M C Sousa and D B Mckay ldquoStructure of the universal stressprotein of Haemophilus influenzaerdquo Structure vol 9 no 12 pp1135ndash1141 2001

[14] V J Promponas C A Ouzounis and I Iliopoulos ldquoExper-imental evidence validating the computational inference offunctional associations from gene fusion events a criticalsurveyrdquo Briefings in Bioinformatics 2012

[15] J S Chauhan N K Mishra and G P Raghava ldquoIdentificationofATPbinding residues of a protein from its primary sequencerdquoBMC Bioinformatics vol 10 article 434 2009

[16] T Guo Y Shi and Z Sun ldquoA novel statistical ligand-bindingsite predictor application to ATP-binding sitesrdquo Protein Engi-neering Design and Selection vol 18 no 2 pp 65ndash70 2005

[17] K Chen M J Mizianty and L Kurgan ldquoATPsite sequence-based prediction of ATP-binding residuesrdquo Proteome Sciencevol 9 article S4 supplement 1 2011

[18] YN ZhangD J Yu S S Li Y X Fan YHuang andH B ShenldquoPredicting protein-ATP binding sites from primary sequencethrough fusing bi-profile sampling ofmulti-view featuresrdquoBMCBioinformatics vol 13 article 118 2012

[19] J R Green M J Korenberg R David and I W HunterldquoRecognition of adenosine triphosphate binding sites usingparallel cascade system identificationrdquo Annals of BiomedicalEngineering vol 31 no 4 pp 462ndash470 2003

[20] A Garg M Bhasin and G P S Raghava ldquoSupport vectormachine-based method for subcellular localization of humanproteins using amino acid compositions their order andsimilarity searchrdquoThe Journal of Biological Chemistry vol 280no 15 pp 14427ndash14432 2005

[21] S Ahmad M M Gromiha and A Sarai ldquoAnalysis and pre-diction of DNA-binding proteins and their binding residuesbased on composition sequence and structural informationrdquoBioinformatics vol 20 no 4 pp 477ndash486 2004

[22] X Xiao P Wang and K-C Chou ldquoGPCR-CA a cellularautomaton image approach for predicting G-protein-coupledreceptor functional classesrdquo Journal of Computational Chem-istry vol 30 no 9 pp 1414ndash1423 2009

[23] M Kumar MM Gromiha and G P S Raghava ldquoPrediction ofRNA binding sites in a protein using SVM and PSSM profilerdquoProteins vol 71 no 1 pp 189ndash194 2008

[24] B S Williams R D Isokpehi A N Mbah et al ldquoFunctionalannotation analytics of bacillus genomes reveals stress respon-sive acetate utilization and sulfate uptake in the biotechnologi-cally relevant bacillus megateriumrdquo Bioinformatics and BiologyInsights vol 6 pp 275ndash286 2012

[25] R D Isokpehi O Mahmud A N Mbah et al ldquoDevelopmentalregulation of genes encoding universal stress proteins in Schis-tosoma mansonirdquo Gene Regulation and Systems Biology vol 5pp 61ndash74 2011

[26] A N Mbah O Mahmud O R Awofolu and R D IsokpehildquoInferences on the biochemical and environmental regulationof universal stress proteins from Schistosomiasis parasitesrdquoAdvances and Applications in Bioinformatics and Chemistry vol6 pp 15ndash27 2013

[27] W Li L Jaroszewski and A Godzik ldquoClustering of highlyhomologous sequences to reduce the size of large proteindatabasesrdquo Bioinformatics vol 17 no 3 pp 282ndash283 2001

[28] G Wang and R L Dunbrack Jr ldquoPISCES a protein sequenceculling serverrdquo Bioinformatics vol 19 no 12 pp 1589ndash15912003

[29] X Yu J Cao Y Cai T Shi and Y Li ldquoPredicting rRNA-RNA- and DNA-binding proteins from primary structure withsupport vector machinesrdquo Journal of Theoretical Biology vol240 no 2 pp 175ndash184 2006

[30] AMarchler-Bauer C Zheng F Chitsaz et al ldquoCDD conserveddomains and protein three-dimensional structurerdquo NucleicAcids Research vol 41 pp D348ndashD352 2013

[31] Z R Li H H Lin L Y Han L Jiang X Chen and Y ZChen ldquoPROFEAT a web server for computing structural andphysicochemical features of proteins and peptides from aminoacid sequencerdquo Nucleic Acids Research vol 34 pp W32ndashW372006

[32] Z Bikadi I Hazai D Malik et al ldquoPredicting P-glycoprotein-mediated drug transport based on support vector machine andthree-dimensional crystal structure of P-glycoproteinrdquo PLoSONE vol 6 no 10 Article ID e25815 2011

[33] S L Lo C Z Cai Y Z Chen and M C M Chung ldquoEffectof training datasets on support vector machine prediction ofprotein-protein interactionsrdquo Proteomics vol 5 no 4 pp 876ndash884 2005

[34] M P Brown W N Grundy D Lin et al ldquoKnowledge-basedanalysis of microarray gene expression data by using supportvector machinesrdquo Proceedings of the National Academy ofSciences of the United States of America vol 97 no 1 pp 262ndash267 2000

[35] T S Furey N Cristianini N Duffy D W Bednarski MSchummer and D Haussler ldquoSupport vector machine classifi-cation and validation of cancer tissue samples using microarrayexpression datardquo Bioinformatics vol 16 no 10 pp 906ndash9142000

[36] K-C Chou and Y-D Cai ldquoPredicting protein-protein inter-actions from sequences in a hybridization spacerdquo Journal ofProteome Research vol 5 no 2 pp 316ndash322 2006

[37] M E Matheny F S Resnic N Arora and L Ohno-MachadoldquoEffects of SVM parameter optimization on discriminationand calibration for post-procedural PCI mortalityrdquo Journal ofBiomedical Informatics vol 40 no 6 pp 688ndash697 2007

[38] F Javed G S Chan A V Savkin et al ldquoRBF kernel basedsupport vector regression to estimate the blood volume andheart rate responses during hemodialysisrdquo in Proceedings ofthe Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBC rsquo09) pp 4352ndash4355 2009

ISRN Computational Biology 11

[39] C-C Chang and C-J Lin ldquoTraining nu-support vector classi-fiers theory and algorithmsrdquoNeural Computation vol 13 no 9pp 2119ndash2147 2001

[40] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[41] K C Chou and C T Zhang ldquoPrediction of protein structuralclassesrdquo Critical Reviews in Biochemistry andMolecular Biologyvol 30 pp 275ndash349 1995

[42] C Chen L Chen X Zou and P Cai ldquoPrediction of proteinsecondary structure content by using the concept of Choursquospseudo amino acid composition and support vector machinerdquoProtein and Peptide Letters vol 16 no 1 pp 27ndash31 2009

[43] H Ding L Luo and H Lin ldquoPrediction of cell wall lyticenzymes using choursquos amphiphilic pseudo amino acid compo-sitionrdquo Protein and Peptide Letters vol 16 no 4 pp 351ndash3552009

[44] J Bondia C Tarin W Garcia-Gabin et al ldquoUsing supportvector machines to detect therapeutically incorrect measure-ments by theMiniMed CGMSrdquo Journal of Diabetes Science andTechnology vol 2 pp 622ndash629 2008

[45] S Chen S Zhou F-F Yin L B Marks and S K DasldquoInvestigation of the support vector machine algorithm topredict lung radiation-induced pneumonitisrdquo Medical Physicsvol 34 no 10 pp 3808ndash3814 2007

[46] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta vol 405 no 2 pp 442ndash451 1975

[47] L Bao and Y Cui ldquoPrediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structuraland evolutionary informationrdquo Bioinformatics vol 21 no 10pp 2185ndash2190 2005

[48] R J Dobson P B Munroe M J Caulfield and M A S SaqildquoPredicting deleterious nsSNPs an analysis of sequence andstructural attributesrdquo BMC Bioinformatics vol 7 article 2172006

[49] J A Hanley and B J Mcneil ldquoThe meaning and use of thearea under a receiver operating characteristic (ROC) curverdquoRadiology vol 143 no 1 pp 29ndash36 1982

[50] E R Delong D M DeLong and D L Clarke-PearsonldquoComparing the areas under two or more correlated receiveroperating characteristic curves a nonparametric approachrdquoBiometrics vol 44 no 3 pp 837ndash845 1988

[51] C Chothia and A M Lesk ldquoThe relation between the diver-gence of sequence and structure in proteinsrdquo The EMBOJournal vol 5 no 4 pp 823ndash826 1986

[52] A M Lesk and C Chothia ldquoHow different amino acidsequences determine similar protein structures the structureand evolutionary dynamics of the globinsrdquo Journal of MolecularBiology vol 136 no 3 pp 225ndash270 1980

[53] M Hilbert G Bohm and R Jaenicke ldquoStructural relationshipsof homologous proteins as a fundamental principle in homol-ogy modelingrdquo Proteins vol 17 no 2 pp 138ndash151 1993

[54] P I Hanson and S W Whiteheart ldquoAAA+ proteins haveengine will workrdquo Nature Reviews Molecular Cell Biology vol6 no 7 pp 519ndash529 2005

[55] K M Ferguson T Higashijima M D Smigel and A GGilman ldquoThe influence of bound GDP on the kinetics of gua-nine nucleotide binding to G proteinsrdquoThe Journal of BiologicalChemistry vol 261 no 16 pp 7393ndash7399 1986

[56] F Jurnak A Mcpherson A H J Wang and A Rich ldquoBio-chemical and structural studies of the tetragonal crystallinemodification of the Escherichia coli elongation factor Turdquo TheJournal of Biological Chemistry vol 255 no 14 pp 6751ndash67571980

[57] T I Zarembinski L I-W Hung H-J Mueller-Dieckmann etal ldquoStructure-based assignment of the biochemical functionof a hypothetical protein a test case of structural genomicsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 26 pp 15189ndash15193 1998

[58] M Saito M Go and T Shirai ldquoAn empirical approach fordetecting nucleotide-binding sites on proteinsrdquo Protein Engi-neering Design and Selection vol 19 no 2 pp 67ndash75 2006

[59] V Sobolev A Sorokine J Prilusky E E Abola and M Edel-man ldquoAutomated analysis of interatomic contacts in proteinsrdquoBioinformatics vol 15 no 4 pp 327ndash332 1999

[60] R E Schapire and Y Singer ldquoBoostexter a boosting-basedsystem for text categorizationrdquo Machine Learning vol 39 no2-3 pp 135ndash168 2000

[61] S A Ong H H Lin Y Z Chen Z R Li and Z Cao ldquoEfficacyof different protein descriptors in predicting protein functionalfamiliesrdquo BMC Bioinformatics vol 8 article 300 2007

[62] L Xue and J Bajorath ldquoMolecular descriptors in chemoin-formatics computational combinatorial chemistry and virtualscreeningrdquo Combinatorial Chemistry and High ThroughputScreening vol 3 no 5 pp 363ndash372 2000

[63] L Xue JW Godden and J Bajorath ldquoEvaluation of descriptorsand mini-fingerprints for the identification of molecules withsimilar activityrdquo Journal of Chemical Information and ComputerSciences vol 40 no 5 pp 1227ndash1234 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

ISRN Computational Biology 11

[39] C-C Chang and C-J Lin ldquoTraining nu-support vector classi-fiers theory and algorithmsrdquoNeural Computation vol 13 no 9pp 2119ndash2147 2001

[40] V Cherkassky and Y Ma ldquoPractical selection of SVM parame-ters and noise estimation for SVM regressionrdquoNeural Networksvol 17 no 1 pp 113ndash126 2004

[41] K C Chou and C T Zhang ldquoPrediction of protein structuralclassesrdquo Critical Reviews in Biochemistry andMolecular Biologyvol 30 pp 275ndash349 1995

[42] C Chen L Chen X Zou and P Cai ldquoPrediction of proteinsecondary structure content by using the concept of Choursquospseudo amino acid composition and support vector machinerdquoProtein and Peptide Letters vol 16 no 1 pp 27ndash31 2009

[43] H Ding L Luo and H Lin ldquoPrediction of cell wall lyticenzymes using choursquos amphiphilic pseudo amino acid compo-sitionrdquo Protein and Peptide Letters vol 16 no 4 pp 351ndash3552009

[44] J Bondia C Tarin W Garcia-Gabin et al ldquoUsing supportvector machines to detect therapeutically incorrect measure-ments by theMiniMed CGMSrdquo Journal of Diabetes Science andTechnology vol 2 pp 622ndash629 2008

[45] S Chen S Zhou F-F Yin L B Marks and S K DasldquoInvestigation of the support vector machine algorithm topredict lung radiation-induced pneumonitisrdquo Medical Physicsvol 34 no 10 pp 3808ndash3814 2007

[46] B W Matthews ldquoComparison of the predicted and observedsecondary structure of T4 phage lysozymerdquo Biochimica etBiophysica Acta vol 405 no 2 pp 442ndash451 1975

[47] L Bao and Y Cui ldquoPrediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structuraland evolutionary informationrdquo Bioinformatics vol 21 no 10pp 2185ndash2190 2005

[48] R J Dobson P B Munroe M J Caulfield and M A S SaqildquoPredicting deleterious nsSNPs an analysis of sequence andstructural attributesrdquo BMC Bioinformatics vol 7 article 2172006

[49] J A Hanley and B J Mcneil ldquoThe meaning and use of thearea under a receiver operating characteristic (ROC) curverdquoRadiology vol 143 no 1 pp 29ndash36 1982

[50] E R Delong D M DeLong and D L Clarke-PearsonldquoComparing the areas under two or more correlated receiveroperating characteristic curves a nonparametric approachrdquoBiometrics vol 44 no 3 pp 837ndash845 1988

[51] C Chothia and A M Lesk ldquoThe relation between the diver-gence of sequence and structure in proteinsrdquo The EMBOJournal vol 5 no 4 pp 823ndash826 1986

[52] A M Lesk and C Chothia ldquoHow different amino acidsequences determine similar protein structures the structureand evolutionary dynamics of the globinsrdquo Journal of MolecularBiology vol 136 no 3 pp 225ndash270 1980

[53] M Hilbert G Bohm and R Jaenicke ldquoStructural relationshipsof homologous proteins as a fundamental principle in homol-ogy modelingrdquo Proteins vol 17 no 2 pp 138ndash151 1993

[54] P I Hanson and S W Whiteheart ldquoAAA+ proteins haveengine will workrdquo Nature Reviews Molecular Cell Biology vol6 no 7 pp 519ndash529 2005

[55] K M Ferguson T Higashijima M D Smigel and A GGilman ldquoThe influence of bound GDP on the kinetics of gua-nine nucleotide binding to G proteinsrdquoThe Journal of BiologicalChemistry vol 261 no 16 pp 7393ndash7399 1986

[56] F Jurnak A Mcpherson A H J Wang and A Rich ldquoBio-chemical and structural studies of the tetragonal crystallinemodification of the Escherichia coli elongation factor Turdquo TheJournal of Biological Chemistry vol 255 no 14 pp 6751ndash67571980

[57] T I Zarembinski L I-W Hung H-J Mueller-Dieckmann etal ldquoStructure-based assignment of the biochemical functionof a hypothetical protein a test case of structural genomicsrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 95 no 26 pp 15189ndash15193 1998

[58] M Saito M Go and T Shirai ldquoAn empirical approach fordetecting nucleotide-binding sites on proteinsrdquo Protein Engi-neering Design and Selection vol 19 no 2 pp 67ndash75 2006

[59] V Sobolev A Sorokine J Prilusky E E Abola and M Edel-man ldquoAutomated analysis of interatomic contacts in proteinsrdquoBioinformatics vol 15 no 4 pp 327ndash332 1999

[60] R E Schapire and Y Singer ldquoBoostexter a boosting-basedsystem for text categorizationrdquo Machine Learning vol 39 no2-3 pp 135ndash168 2000

[61] S A Ong H H Lin Y Z Chen Z R Li and Z Cao ldquoEfficacyof different protein descriptors in predicting protein functionalfamiliesrdquo BMC Bioinformatics vol 8 article 300 2007

[62] L Xue and J Bajorath ldquoMolecular descriptors in chemoin-formatics computational combinatorial chemistry and virtualscreeningrdquo Combinatorial Chemistry and High ThroughputScreening vol 3 no 5 pp 363ndash372 2000

[63] L Xue JW Godden and J Bajorath ldquoEvaluation of descriptorsand mini-fingerprints for the identification of molecules withsimilar activityrdquo Journal of Chemical Information and ComputerSciences vol 40 no 5 pp 1227ndash1234 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology