a method of protein model classification and retrieval ......researcharticle a method of protein...

10
Research Article A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features Jinlin Ma, 1,2 Ziping Ma, 2 Baosheng Kang, 1 and Ke Lu 3 1 School of Information and Technology, Northwest University, Xi’an 710120, China 2 School of Mathematics and Information Science, North University of Nationalities, Yinchuan 750021, China 3 College of Computing & Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China Correspondence should be addressed to Jinlin Ma; [email protected] Received 20 May 2014; Accepted 30 July 2014; Published 1 September 2014 Academic Editor: Shengyong Chen Copyright © 2014 Jinlin Ma et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In this paper we propose a novel visual method for protein model classification and retrieval. Different from the conventional methods, the key idea of the proposed method is to extract image features of proteins and measure the visual similarity between proteins. Firstly, the multiview images are captured by vertices and planes of a given octahedron surrounding the protein. Secondly, the local features are extracted from each image of the different views by the SURF algorithm and are vector quantized into visual words using a visual codebook. Finally, KLD is employed to calculate the similarity distance between two feature vectors. Experimental results show that the proposed method has encouraging performances for protein retrieval and categorization as shown in the comparison with other methods. 1. Introduction e classification and retrieval of protein models is widely applicable in biomedical science. Biologists have a great demand for protein retrieval and classification tools to iden- tify the functions of unknown proteins and to discover new functions of known proteins. e most widely used protein structure classification systems are CATH [1] and SCOP [2], both of which are created by experts based on their experiences. With the rapid growth of the 3D protein structures, the artificial classifica- tion has been unable to meet the demand. It is desirable to do classification and retrieval in a more automated way. So, more and more researchers are dedicated to studying automatic classification methods which are based on the biological function of the protein molecules. e protein molecules are of some specific shape which can be described by their biological function, for example, the amino acid sequences and 3D structures. According to the different biological functions, there are three kinds of methods for protein retrieval and classification. ey are respectively based on molecular sequence, protein secondary structure (SS) elements, and 3D structural coordinates. e methods based on the molecular sequence aim to determine the amino acid sequences, since the amino acid sequences of proteins are easily understandable and simple to classify. e methods include FASTA [3], BLAST [4], PSI- BLAST [5], and Hidden Markov Models [6]. In most cases, the protein is represented by a set of SS elements. So many researchers are devoted to designing dif- ferent algorithms to represent vector features by SS elements or to obtain the similar distance between the SS elements. Milledge et al. [7] created a geometrical hashing using interatomic distance to identify the triples of atoms. Zotenko et al. [8] mapped the structure to a high-dimensional vector and utilized distance between the corresponding vectors to approximate the structural similarity. Feature vectors are extracted from contact regions of the secondary structure elements (SSEs) by Aung and Tan [9]. Camoglu et al. [10] used R-Tree in indexing the vector features which are represented by SS. Cantoni et al. [11] proposed a protein struc- tural motif retrieval approach based on Generalized Hough Transform, which evaluates the triplet of the Secondary Structure by midpoints distance. In literature [12], Mavridis et al. compared the performance of six algorithms including Contact Maps, 3DZernike, Group Integration, Genocrypt, Hindawi Publishing Corporation Computational and Mathematical Methods in Medicine Volume 2014, Article ID 269394, 10 pages http://dx.doi.org/10.1155/2014/269394

Upload: others

Post on 27-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Method of Protein Model Classification and Retrieval ......ResearchArticle A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features JinlinMa,1,2 ZipingMa,2

Research ArticleA Method of Protein Model Classification and RetrievalUsing Bag-of-Visual-Features

Jinlin Ma12 Ziping Ma2 Baosheng Kang1 and Ke Lu3

1 School of Information and Technology Northwest University Xirsquoan 710120 China2 School of Mathematics and Information Science North University of Nationalities Yinchuan 750021 China3 College of Computing amp Communication Engineering University of Chinese Academy of Sciences Beijing 100049 China

Correspondence should be addressed to Jinlin Ma 624160gmailcom

Received 20 May 2014 Accepted 30 July 2014 Published 1 September 2014

Academic Editor Shengyong Chen

Copyright copy 2014 Jinlin Ma et alThis is an open access article distributed under the Creative CommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In this paper we propose a novel visual method for protein model classification and retrieval Different from the conventionalmethods the key idea of the proposed method is to extract image features of proteins and measure the visual similarity betweenproteins Firstly the multiview images are captured by vertices and planes of a given octahedron surrounding the protein Secondlythe local features are extracted from each image of the different views by the SURF algorithm and are vector quantized intovisual words using a visual codebook Finally KLD is employed to calculate the similarity distance between two feature vectorsExperimental results show that the proposed method has encouraging performances for protein retrieval and categorization asshown in the comparison with other methods

1 Introduction

The classification and retrieval of protein models is widelyapplicable in biomedical science Biologists have a greatdemand for protein retrieval and classification tools to iden-tify the functions of unknown proteins and to discover newfunctions of known proteins

The most widely used protein structure classificationsystems are CATH [1] and SCOP [2] both of which arecreated by experts based on their experiences With the rapidgrowth of the 3D protein structures the artificial classifica-tion has been unable to meet the demand It is desirable to doclassification and retrieval in amore automatedway Somoreand more researchers are dedicated to studying automaticclassification methods which are based on the biologicalfunction of the protein molecules

The protein molecules are of some specific shape whichcan be described by their biological function for examplethe amino acid sequences and 3D structures According tothe different biological functions there are three kinds ofmethods for protein retrieval and classification They arerespectively based onmolecular sequence protein secondarystructure (SS) elements and 3D structural coordinates

The methods based on the molecular sequence aim todetermine the amino acid sequences since the amino acidsequences of proteins are easily understandable and simpleto classify The methods include FASTA [3] BLAST [4] PSI-BLAST [5] and Hidden Markov Models [6]

In most cases the protein is represented by a set of SSelements So many researchers are devoted to designing dif-ferent algorithms to represent vector features by SS elementsor to obtain the similar distance between the SS elementsMilledge et al [7] created a geometrical hashing usinginteratomic distance to identify the triples of atoms Zotenkoet al [8] mapped the structure to a high-dimensional vectorand utilized distance between the corresponding vectors toapproximate the structural similarity Feature vectors areextracted from contact regions of the secondary structureelements (SSEs) by Aung and Tan [9] Camoglu et al[10] used R-Tree in indexing the vector features which arerepresented by SS Cantoni et al [11] proposed a protein struc-tural motif retrieval approach based on Generalized HoughTransform which evaluates the triplet of the SecondaryStructure by midpoints distance In literature [12] Mavridiset al compared the performance of six algorithms includingContact Maps 3DZernike Group Integration Genocrypt

Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2014 Article ID 269394 10 pageshttpdxdoiorg1011552014269394

2 Computational and Mathematical Methods in Medicine

Spherical Trace Transform and 3DBlast by classifying proteinstructures according to the CATH superfamily classificationThe experimental results showed that contact maps and3DBlast are conceived specifically to compare the structuresof proteins

The methods based on the 3D structure coordinatestry to describe proteins shape by identifying or comparingstructural alignment MAMMOTH [13] modeled portion ofthe target structure and compared protein structure withan arbitrary low-resolution protein model TM-align [14]identified the best structural alignment by protein pairs andDynamic Programming FAST [15] compared the intramolec-ular residue-residue relationships of two structures by usinga directionality-based scoring scheme In order to reduce thecoordinate-independent space of protein structures Holmand Sander [16] proposed the optimal pairwise structuralalignment algorithm using Monte Carlo Shindyalov andBourne [17] studied heuristics combinatorial extension andsimilarity evaluation of structural alignment path algorithm

In recent years there appeared a method based onimage distance matrices Ankerst et al [18] introduced 3Dshape histograms algorithm to compare protein models ormolecules Chi et al [19] compared protein structures byusing signatures extracted from image-based distance matri-ces and multidimensional index Yeh et al [20] comparedthe protein models from multiple 3D projection views Theimage-based retrieval methods exhibited a higher degree ofprecision than the three kinds of traditional methods

In this paper we propose an image-based protein retrievaland classification method using SURF algorithm to extractfeatures and 119896-means to cluster the features thus generatinga codebookWe use histogramdetermined by BOVF (bag-of-visual-features) vectors to represent the characteristics of theidentified models

We construct an image-basedmethod to avoid exhaustivesearch for themolecular sequence structure coordinates andchain structure alignments Our major contribution is topropose an efficient protein models retrieval and classifica-tionmethod by using bag-of-visual-featureThe performanceis exciting Our experimental retrieval precision is 96 onaverage

This paper is organized as follows Section 2 discussesthe related algorithm for SURF and bag-of-feature andthen details the proposed method Experimental results arerepresented and analyzed in Section 3 In the final section weconclude this paper

2 Materials and Methods

21 Bag-of-Features Thebag-of-wordsmethodwas first usedin document retrieval and applied in 3D shape retrieval andcategorization due to its many advantages such as simplicityflexibility and efficiency The bag-of-features was first pro-posed by Liu et al [21] for both global comparison and partialmatching It relies on the extraction of spin image signatureswhich are later grouped in clusters Yu et al [22] built aneffective image retrieval system based on the bag-of-featuresmodel They respectively integrated the SIFT and LBP

descriptors and the HOG and LBP descriptors and proposedthe patch-based integration and image-based integrationmodels The experimental results showed that the image-based SIFT-LBP integration clustering by weighted 119896-meansalgorithm achieves the best performance A simple novel yetpowerful approachwas presented for background subtractionby bag-of-features [23] They supposed that encoding thelocal color and texture information can effectively attenuatethe texture variations in the background scenes and thendomain-range features were encoded in the soft-assignmentcoding procedure which is decided by the appropriate kernelvariances Nanni and Lumini [24] applied bag-of-featuresand heterogeneous set of texture descriptors for objectrecognition The proposed method is based on a simpleexhaustive extraction of subwindows and classification ofrandom subspace by support vector machine (SVM) andcan reduce dimensions by the principal component analysisMoreover Zhou et al [25] proposed a method for sceneclassification using a multiresolution bag-of-features modelThe bag-of-features approach can be also applied in musicclassification [26] distinction text between handwritten andmachine-printed [27] and noise filter [28] and so forth

The bag-of-features has also been applied to local feature3D shape retrieval and classification Ohbuchi et al [29]extracted local features from each range image of differentviews using the Scale Invariant Feature Transform (SIFT)algorithm [30] for retrieving rigid models and articulatedmodels In [31] a novel framework is employed to combinespectral clustering with region growing based on fast march-ing 3D object categorization

BOF was also used in analyzing medical images andcomputer-aided diagnosis (CAD) Shen et al [32] proposeda human epithelial type 2 (HEp-2) classification frameworkusing intensity order pooling based on gradient feature andbag-of-words The pooled gradient feature extracted by theintensity orders of local grid points is rotationally invariantwhich outperformed SIFT feature significantly Wang et al[33] investigated two issues of bag-of-feature strategy fortissue classification and developed a novel algorithm namedJoint-ViVo to learn the vocabulary and visual word weightsjointly The test results showed that the algorithm is betterthan the state-of-art methods on classifying breast tissuedensity in mammograms and lung tissue in high-resolutioncomputed tomography (HRCT) images and identifying braintissue type in magnetic resonance imaging (MRI)

22 Speed-Up Robust Features (SURF) SURF was proposedby Bay et al [34] It is based on sums of 2D Haar waveletresponses and Hessian matrix based interest pointrsquos detectorand it makes an efficient use of integral images which area robust local feature detector and descriptor that can beused in computer vision tasks like object recognition for 3Dreconstruction Though SURF is partly inspired by the SIFTdescriptor it is several times faster and more robust thanSIFT

Gui et al [35] proposed a novel point-pattern matchingmethod based on the SURF and Shape ContextThey appliedthe SURF bidirectional matching to match the feature points

Computational and Mathematical Methods in Medicine 3

1

2

3

14

1

2

3

14

Protein

F

A

E

C

D

B

Regular octahedron view Multiview shape Local feature extraction

Bag-of-featuresClusteringVisual words (codebook)Histogram

Figure 1 An illustration of our method (PDB code ldquo1hddrdquo)

in two images preliminarily and then calculated Shape Con-text descriptors of the feature points Experimental resultsshow that the method can eliminate the incorrect matchingpoint pairs and improve the accuracy of point-patternmatch-ing A fully affine invariant SURF algorithm was proposedby Pang et al [36] which has the affine invariant advantageof ASIFT and the efficient merit of SURF Alcantarilla etal [37] proposed a descriptor named Gauge-SURF whichis evaluated relative to the gradient direction at every pixelBecause of the use of integral images the descriptors are fastand robust

Recently SURF was used in iris retrieval and recognitionA hierarchical approach was proposed to retrieve an irisimage from a large iris database [38] The approach is acombination of both iris color and texture and the iris texturefeatures are obtained by SURF algorithm Mehrotra et al[39] proposed a robust segmentation and an adaptive SURFdescriptor for iris recognition In their method the adaptivestrip transformed from the annular region between the irisand pupil boundaries is enhanced using a gamma correctionapproach Then features are extracted from the adaptivestrip using SURF Feulner et al [40] presented a method forautomatically estimating the body region of a CT volumeimage The method is based on 1D registration of histogramsof visual words which serves as a description of a CT sliceThe SURF descriptor was extended to 119873 dimensions named119873-SURF Because of its simpler and efficient functioning

they used 2D upright SURF descriptors for estimating thebody region

23 The Proposed Method The key idea of our method isto extract the features of proteins and measure the visualsimilarity between proteins Our algorithm is implementedsubsequently in four steps as shown in Figure 1

(1) Multiview Rendering Render multiview images of theprotein from different perspectives The viewing angle isdetermined using the vertices and planes of a given octahe-dron structure surrounding the protein as shown in Figure 2

(2) Local Feature Extraction Extract the local visual featuresof the multiview images by using SURF algorithm Then foreach view we calculate the SURF descriptors

(3) Visual Words Generation and Word Histogram Construc-tion Generate visual words from feature vectors using avisual dictionary (ie the codebook) A visual dictionarycan be got by 119896-means clustering and so each local featureshall be represented as a discrete form The frequencies ofvisual words are counted and stored into a histogram whichbecomes the feature vector of the corresponding proteinmodel

4 Computational and Mathematical Methods in Medicine

O

Y

XZ

ba

cd

F

A

E

C

D

B

a998400

b998400c

998400

d998400

Figure 2 The multiviews are captured from the vertices and the planes on a given regular octahedron

(4) Distance Computation The dissimilarity among a pair offeature vectors (the histogram) is computed by the Kullback-Leibler divergence (KLD)

231 Multiview Rendering The multiview shapes are cap-tured from the six vertices and the eight planes on a givenregular octahedron Figure 2 shows the six vertices and theeight planes on the octahedron Along the 119909+ 119909minus 119910+ 119910minus119911+ and 119911minus axes we capture the proteinrsquos right view left viewtop view bottom view front view and rear view Along thenormal direction of each plane we get the proteinrsquos eightoblique views The size of the captured image is set as 100 times100 pixels

As we mentioned above 14 different views still existfor a protein Figure 3 illustrates all the 14 views of proteinstructure encoded as ldquo1hddrdquo with PDB code after adjustingthe viewing perspective

To reduce the interference in operating and standardizingthe rendering process we write a program for view capturingfrom CATH and SCOP in the PDB The program canautomatically load protein models rotate models captureimages and save images following a certain naming rule Ofcourse the protein models are also selected automatically bythe program in advance

232 Local Feature Extraction After the range images arerendered the SURF algorithm is applied to each of therange images to detect interest points and then to extractSURF descriptors as presented in [30] The SURF algorithmdetects interest points and then computes features at theseinterest points The SURF firstly finds positions of featuresthat are salientThe saliency detection is based on amultiscaleand multiorientation Fast-Hessian detector and distribution-based descriptor for gray-level change so that each SURF

feature can encode this information The SURF descriptor iscalculated using the OpenSURF C++ source code by Evans[41]

Figure 4 shows examples of an interest point generatedand its images that are rotated affined and scaled Thenumbers of interest points of SURF algorithm are 113 112102 and 93 as shown in the second row The interest pointsappear at similar locations in these four images in spiteof the geometrical transformations This robustness againstgeometric transformations contributes to the protein modelretrieval performance

Figure 5 shows the examples of SURF interest pointsmatch of images in Figure 4 The numbers of interest pointsmatching are 59 52 54 35 45 and 40 Every image issuccessfully matched to a certain feature point Because ofthe different interest feature points (in number size andposition) extracted by SURF the numbers of the featurepoints that match are different

233 Visual Words Generation and Word Histogram Con-struction It is time consuming to compare modelrsquos localSURF feature directly especially for the large number ofviewsTherefore it is necessary to quantize the SURF descrip-tors extracted from a multiview image into visual wordsFirstly a visual codebook is generated by using off-line 119896-means clustering of the features of every view Then thecodebook is searched linearly to find a visual word closestfor the feature As a result the feature vectors of visualwords are selected through the centers of the clusters (calledbarycenter) and the number of the clusters determines thecodebook size

After generating the codebook we should construct aword histogram over the codebook which is also an off-lineprocess The word histogram is constructed by counting the

Computational and Mathematical Methods in Medicine 5

(1) DO direction (2) CO direction (3) BO direction (4) AO direction (5) FO direction

(6) EO direction (7) Plane a (8) Plane b (9) Plane c (10) Plane d

(11) Plane a998400 (12) Plane b998400 (13) Plane c998400 (14) Plane d998400

Figure 3 Fourteen views of protein ldquo1hddrdquo (PDB code)

Original image (I1) Rotated image (I2) Affined image (I3) Scaled image (I4)

Interest points of I1 Interest points of I2 Interest points of I3 Interest points of I4

Figure 4 Interest points of the SURF algorithm are robust

frequencies of visual words Each view is represented as aword histogram which is the features extracted by bag-of-visual-features So the protein modelrsquos histogram is producedby combining every view imagersquos histogram of the proteinmodel

234 Distance Computation and Range Models MatchingThe last step of our method is the distance computation (alsocalled models matching) between two models A distance

among a pair of feature vectors (the histogram) is computedby using the Kullback-Leibler divergence (KLD) The KLD isnot a distance metric for it is not symmetric Consider

119863(x y) =119899

sum

119894=1

(119910119894minus 119909119894) ln119910119894

119909119894

(1)

where x = (119909119894) and y = (119910

119894) are the feature vectors and 119899 is

the dimension of the vectors

6 Computational and Mathematical Methods in Medicine

I1 and I2 I1 and I3 I1 and I4

I2 and I3 I2 and I4

I3 and I4

Figure 5 Interest points matching

3 Results and Discussion

In order to evaluate the efficacy and generalization capacityof the proposed method we tested it with several differentretrieval and categorization tasks The first experiment com-pares the performance of our method and the comparedmethod in terms of their capability for classification Thesecond experiment tests the impact of the numbers ofthe clusters on the methodsrsquo retrieval capability The lastexperiment tests the influence of the size of the training dataon the method

We implement the experiments in Matlab R2010a whilethe SURF and the 119896-means code is written in C++ Allalgorithms are run under windows 7 32 bit on a personalcomputer with a Core 2 Quad 266GHz CPU 300GB DDR2memory and a 512MB ATI Radeon HD4600 graphics card

To evaluate the methodrsquos efficiency we measured thefeature extracting time We experimented on the compu-tation time in a 2000 protein models database which hasimages of 28000 views The computation time for SURFfeature extracted algorithm is about 21948 seconds (an off-line process) For clustering the features by 119896-means andgenerating the histograms by bag-of-visual-features it takes4249 seconds and 066 seconds by an off-line process for thesame 2000 protein models database

31 Classification In the first experiment we evaluate theclassification performance of our approach by using proteinmodels database of SHREC 2010 [12] which includes 1000protein structures chosen from 100 CATH 33 superfamiliesIn the dataset each superfamily consists of at least 10 struc-tures and each structure contains at least 50 amino acids Weuse two different ways (nearest neighbor ROC plot) to testthe performance of our method by comparing with 3DBlast3DZernike GENOCRIPT ContactMaps Group Integrationand Spherical Trace Transform methods

Table 1 Nearest neighbour results

Method Correct predictions3DBlast 683DZernike 8Genocrypt 56Contact maps 80Group integration 52Spherical trace transform 0Proposed method 89

Nearest neighbor was counted as a correct predictionwhen the first protein of each ranked list was found to be amember of the same superfamily as the query

ROCplot (receiver operating characteristic) is a graphicalplot which illustrates the true positives rate against the falsepositives rate The area under the curve (AUC) is a singlenumerical performance measure of each ROC plot Theperfect value of AUC is 10 [12]

Table 1 summarizes the retrieval rate for all the methodsThe value of the comparison algorithm is from literature [12]According to Table 1 the correct prediction of the proposedmethod is better than the comparison algorithm

Figure 6 shows in a bar graph the query results witheach algorithm on 50 protein structures As Figure 6 showsthe proposed method can easily and successfully identifyeach query protein including those that the comparisonalgorithms failed to identify Compared with the compar-ison algorithms the proposed method may identify somesuperfamilies more easily It has a very encouraging andsatisfactory result that the comparison algorithms cannotreach It is worth noting that the classification correctness ofthe proposedmethod is almost the same as that that has beendone by human experts

Computational and Mathematical Methods in Medicine 7

00

04

08

00

04

08

3DBLAST2 6 10

2 6 10 2 6 10 2 6 10

2 6 10

2 6 10 2 6 1014 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

3DZernike Genocrypt

Contact maps Group integration Spherical trace transform

Proposed method

00

04

08

00

04

08

00

04

08

00

04

08

00

04

08

(a)

14 17 20 23 26 29 32 35 38 41 44 47 50

1 5 9 14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

000

002

004

006

008

010 3DBLAST

000

002

004

006

008

010 3DZernike

000

002

004

006

008

010 Genocrypt

000

002

004

006

008

010Contact maps

000

002

004

006

008

010 Group integration

000

002

004

006

008

010 Spherical trace transform

000

002

004

006

008

010Proposed method

1 5 9 1 5 9 1 5 9

1 5 91 5 9

1 5 9

(b)

Figure 6 Bar chart for each method The upper charts show the total AUC whereas the lower charts show the AUCs calculated for the top10 of the database The comparison algorithm is come from literature [12]

8 Computational and Mathematical Methods in Medicine

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

1K2K3K4K5K

6K7K8K9K11K

Figure 7 The precision-recall curves of influence of the codebooksize

32 Influence of the Size of the Codebook In the second ex-periment the influence of the vocabulary size (codebooksize) upon retrieval performance is studied The test datasetincludes 800 protein structures chosen from SCOP proteindatabase

The number of visual words in the codebook (the code-book size) is a very important parameter in our algorithmBecause the codebook size not only determines the spa-tial requirement but also significantly affects the retrievalperformance Figure 7 demonstrates that the precision-recallcurves of our methods increase steadily with the codebooksize We observe that with the number of codebook sizeenlarging from 1000 to 11000 the precision-recall values arecomparatively stable and the precision rate is between 096and 1 for all codebook sizes Meanwhile the difference ofprecision is 002 between the best case and the worst caseSo the influence of the number of the codebook on retrievalprecision is small According to the experimental result weset the number of visual words in the codebook as 3000 inthis paper

33 Influence of the Size of the Training Data In the thirdexperiment we investigate the influence of the training datasize on the retrieval efficiency The test dataset differentfrom the training data is selected from different structuralclassifications of SCOP protein database Figure 8 shows theprecision-recall curves when the number of training datais 400 800 1200 1600 and 2000 respectively As the testresults show the precision rates are all between 096 and 1regardless of the number of training data size

4 Conclusion

In this paper we proposed a novel feature extracting algo-rithm for protein retrieval and classification The proposedmethod employs a powerful local image feature called SURFand bag-of-visual-feature The key idea is to describe aview as a word histogram which is obtained by the vector

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

4008001200

16002000

Figure 8 The precision-recall curves of influence of the size oftraining data

quantization of the viewrsquos local features and to apply KLD tocalculate the distance between two models

A set of experiments were carried out to investigateseveral critical issues of our method in the CATH and SCOPprotein models database from PDBThe experimental resultsindicate that our method has satisfying performances forprotein retrieval and protein categorization that cannot bereached by other comparison methods

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank those who created proteindatabases and those who made available codes for theirshape features This work was supported by the NationalNatural Science Foundation of China (Grant nos 61462002and 61261043) the College of Scientific Research Projectof Ningxia Province (Grant NGY2012147) and the ScienceResearch Project of North University of Nationalities (Grant2013XYZ024)

References

[1] A I Cuff I Sillitoe T Lewis et al ldquoThe CATH classificationrevisited-architectures reviewed and new ways to characterizestructural divergence in superfamiliesrdquo Nucleic Acids Researchvol 37 no 1 pp D310ndashD314 2009

[2] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[3] D J Lipman and W R Pearson ldquoRapid and sensitive proteinsimilarity searchesrdquo Science vol 227 no 4693 pp 1435ndash14411985

Computational and Mathematical Methods in Medicine 9

[4] S F AltschulW GishWMiller EWMyers and D J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[5] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[6] K Karplus C Barrett and R Hughey ldquoHiddenMarkovmodelsfor detecting remote protein homologiesrdquo Bioinformatics vol14 no 10 pp 846ndash856 1998

[7] T Milledge G Zheng T Mullins and G NarasimhanldquoSBLAST structural basic local alignment searching tools usinggeometric hashingrdquo in Proceeding of the 7th IEEE InternationalConference on Bioinformatics and Bioengineering (BIBE 07) pp1343ndash1347 Boston MassUSA October 2007

[8] E Zotenko R Islamaj Dogan W J Wilbur D P OrsquoLeary andT M Przytycka ldquoStructural footprinting in protein structurecomparison the impact of structural fragmentsrdquo BMC Struc-tural Biology vol 7 article 53 2007

[9] Z Aung and K Tan ldquoRapid 3D protein structure databasesearching using information retrieval techniquesrdquo Bioinformat-ics vol 20 no 7 pp 1045ndash1052 2004

[10] O Camoglu T Kahveci and A K Singh ldquoPSI indexing proteinstructures for fast similarity searchrdquo Bioinformatics vol 19supplement 1 pp i81ndashi83 2003

[11] V Cantoni A Ferone O Ozbudak and A Petrosino ldquoProteinmotifs retrieval by SS terns occurrencesrdquo Pattern RecognitionLetters vol 34 no 5 pp 559ndash563 2013

[12] L Mavridis V Venkatraman D W Ritchie et al ldquoSHRECrsquo10track protein modelsrdquo in Proceedings of the Eurographics Work-shop on 3D Object Retrieval 2010

[13] A R Ortiz C E M Strauss and O Olmea ldquoMAMMOTH(matching molecular models obtained from theory) an auto-mated method for model comparisonrdquo Protein Science vol 11no 11 pp 2606ndash2621 2002

[14] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[15] J Zhu and ZWeng ldquoFAST a novel protein structure alignmentalgorithmrdquo Proteins Structure Function and Genetics vol 58no 3 pp 618ndash627 2005

[16] L Holm and C Sander ldquoProtein structure comparison byalignment of distance matricesrdquo Journal of Molecular Biologyvol 233 no 1 pp 123ndash138 1993

[17] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[18] M Ankerst G Kastenmuller H-P Kriegel and T SeidlldquoNearest neighbor classification in 3D protein databaserdquo inProceedings of the 7th International Conference on IntelligentSystems for Molecular Biology pp 34ndash43 Heidelberg Germany1999

[19] P Chi G Scott and C Shyu ldquoA fast protein structure retrievalsystem using image-based distance matrices and multidimen-sional indexrdquo in Proceedingsof the 4th IEEE Symposium onBioinformatics and Bioengineering (BIBE rsquo04) pp 522ndash529 May2004

[20] J-S Yeh D-Y Chen and M Ouhyoung ldquoA web-based proteinretrieval system bymatching visual similarityrdquo in Proceedings ofthe Emerging Information Technology Conference pp 108ndash110Taipei Taiwan August 2005

[21] Y Liu H Zha and H Qin ldquoShape topics a compact repre-sentation and new algorithms for 3D partial shape retrievalrdquoin Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) pp 2025ndash2032 June 2006

[22] J Yu Z Qin T Wan and X Zhang ldquoFeature integrationanalysis of bag-of-features model for image retrievalrdquo Neuro-computing vol 120 pp 355ndash364 2013

[23] S Yoo and C Kim ldquoBackground subtraction using hybridfeature coding in the bag-of-features frameworkrdquo PatternRecognition Letters vol 34 no 16 pp 2086ndash2093 2013

[24] L Nanni and A Lumini ldquoHeterogeneous bag-of-features forobjectscene recognitionrdquo Applied Soft Computing Journal vol13 no 4 pp 2171ndash2178 2013

[25] L Zhou Z Zhou andDHu ldquoScene classification using amulti-resolution bag-of-features modelrdquo Pattern Recognition vol 46no 1 pp 424ndash433 2013

[26] Z Fu G Lu KM Ting andD Zhang ldquoMusic classification viathe bag-of-features approachrdquo Pattern Recognition Letters vol32 no 14 pp 1768ndash1777 2011

[27] K Zagoris I Pratikakis A Antonacopoulos B Gatos and NPapamarkos ldquoDistinction between handwritten and machine-printed text based on the bag of visual words modelrdquo PatternRecognition vol 47 no 3 pp 1051ndash1062 2014

[28] Z Wu J Cao H Tao and Y Zhuang ldquoA novel noise filterbased on interesting patternmining for bag-of-features imagesrdquoExpert Systems with Applications vol 40 no 18 pp 7555ndash75612013

[29] R Ohbuchi K Osada T Furuya and T Banno Salient LocalVisual Features for Shape-Based 3d Model Retrieval 2008

[30] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60 no2 pp 91ndash110 2004

[31] R Toldo U Castellani and A Fusiello ldquoA bag of wordsapproach for 3d object categorizationrdquo in Proceedings of the4th International Conference on Computer VisionComputerGraphics Collaboration Techniques pp 116ndash127 2009

[32] L Shen J Lin S Wu and S Yu ldquoHEp-2 image classificationusing intensity order pooling based features and bag of wordsrdquoPattern Recognition vol 47 no 7 pp 2419ndash2427 2014

[33] J J Wang H Bensmail and X Gao ldquoJoint learning andweighting of visual vocabulary for bag-of-feature based tissueclassificationrdquo Pattern Recognition vol 46 no 12 pp 3249ndash3255 2013

[34] H Bay A Ess T Tuytelaars and L van Gool ldquoSURF SpeededUp Robust Featuresrdquo Computer Vision and Image Understand-ing vol 110 no 3 pp 346ndash359 2008

[35] Y Gui A Su and J Du ldquoPoint-pattern matching method usingSURF and Shape ContextrdquoOptik vol 124 no 14 pp 1869ndash18732013

[36] Y Pang W Li Y Yuan and J Pan ldquoFully affine invariant SURFfor image matchingrdquo Neurocomputing vol 85 pp 6ndash10 2012

[37] P F Alcantarilla L M Bergasa and A J Davison ldquoGauge-SURF descriptorsrdquo Image and Vision Computing vol 31 no 1pp 103ndash116 2013

[38] U Jayaraman S Prakash and P Gupta ldquoAn efficient color andtexture based iris image retrieval techniquerdquo Expert Systemswith Applications vol 39 no 5 pp 4915ndash4926 2012

[39] H Mehrotra P K Sa and B Majhi ldquoFast segmentation andadaptive SURF descriptor for iris recognitionrdquo Mathematicaland Computer Modelling vol 58 no 1-2 pp 132ndash146 2013

10 Computational and Mathematical Methods in Medicine

[40] J Feulner S K Zhou E Angelopoulou et al ldquoComparing axialCT slices in quantized N-dimensional SURF descriptor spaceto estimate the visible body regionrdquo Computerized MedicalImaging and Graphics vol 35 no 3 pp 227ndash236 2011

[41] ldquoOpenSURFrdquo httpscodegooglecompopensurf1

Page 2: A Method of Protein Model Classification and Retrieval ......ResearchArticle A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features JinlinMa,1,2 ZipingMa,2

2 Computational and Mathematical Methods in Medicine

Spherical Trace Transform and 3DBlast by classifying proteinstructures according to the CATH superfamily classificationThe experimental results showed that contact maps and3DBlast are conceived specifically to compare the structuresof proteins

The methods based on the 3D structure coordinatestry to describe proteins shape by identifying or comparingstructural alignment MAMMOTH [13] modeled portion ofthe target structure and compared protein structure withan arbitrary low-resolution protein model TM-align [14]identified the best structural alignment by protein pairs andDynamic Programming FAST [15] compared the intramolec-ular residue-residue relationships of two structures by usinga directionality-based scoring scheme In order to reduce thecoordinate-independent space of protein structures Holmand Sander [16] proposed the optimal pairwise structuralalignment algorithm using Monte Carlo Shindyalov andBourne [17] studied heuristics combinatorial extension andsimilarity evaluation of structural alignment path algorithm

In recent years there appeared a method based onimage distance matrices Ankerst et al [18] introduced 3Dshape histograms algorithm to compare protein models ormolecules Chi et al [19] compared protein structures byusing signatures extracted from image-based distance matri-ces and multidimensional index Yeh et al [20] comparedthe protein models from multiple 3D projection views Theimage-based retrieval methods exhibited a higher degree ofprecision than the three kinds of traditional methods

In this paper we propose an image-based protein retrievaland classification method using SURF algorithm to extractfeatures and 119896-means to cluster the features thus generatinga codebookWe use histogramdetermined by BOVF (bag-of-visual-features) vectors to represent the characteristics of theidentified models

We construct an image-basedmethod to avoid exhaustivesearch for themolecular sequence structure coordinates andchain structure alignments Our major contribution is topropose an efficient protein models retrieval and classifica-tionmethod by using bag-of-visual-featureThe performanceis exciting Our experimental retrieval precision is 96 onaverage

This paper is organized as follows Section 2 discussesthe related algorithm for SURF and bag-of-feature andthen details the proposed method Experimental results arerepresented and analyzed in Section 3 In the final section weconclude this paper

2 Materials and Methods

21 Bag-of-Features Thebag-of-wordsmethodwas first usedin document retrieval and applied in 3D shape retrieval andcategorization due to its many advantages such as simplicityflexibility and efficiency The bag-of-features was first pro-posed by Liu et al [21] for both global comparison and partialmatching It relies on the extraction of spin image signatureswhich are later grouped in clusters Yu et al [22] built aneffective image retrieval system based on the bag-of-featuresmodel They respectively integrated the SIFT and LBP

descriptors and the HOG and LBP descriptors and proposedthe patch-based integration and image-based integrationmodels The experimental results showed that the image-based SIFT-LBP integration clustering by weighted 119896-meansalgorithm achieves the best performance A simple novel yetpowerful approachwas presented for background subtractionby bag-of-features [23] They supposed that encoding thelocal color and texture information can effectively attenuatethe texture variations in the background scenes and thendomain-range features were encoded in the soft-assignmentcoding procedure which is decided by the appropriate kernelvariances Nanni and Lumini [24] applied bag-of-featuresand heterogeneous set of texture descriptors for objectrecognition The proposed method is based on a simpleexhaustive extraction of subwindows and classification ofrandom subspace by support vector machine (SVM) andcan reduce dimensions by the principal component analysisMoreover Zhou et al [25] proposed a method for sceneclassification using a multiresolution bag-of-features modelThe bag-of-features approach can be also applied in musicclassification [26] distinction text between handwritten andmachine-printed [27] and noise filter [28] and so forth

The bag-of-features has also been applied to local feature3D shape retrieval and classification Ohbuchi et al [29]extracted local features from each range image of differentviews using the Scale Invariant Feature Transform (SIFT)algorithm [30] for retrieving rigid models and articulatedmodels In [31] a novel framework is employed to combinespectral clustering with region growing based on fast march-ing 3D object categorization

BOF was also used in analyzing medical images andcomputer-aided diagnosis (CAD) Shen et al [32] proposeda human epithelial type 2 (HEp-2) classification frameworkusing intensity order pooling based on gradient feature andbag-of-words The pooled gradient feature extracted by theintensity orders of local grid points is rotationally invariantwhich outperformed SIFT feature significantly Wang et al[33] investigated two issues of bag-of-feature strategy fortissue classification and developed a novel algorithm namedJoint-ViVo to learn the vocabulary and visual word weightsjointly The test results showed that the algorithm is betterthan the state-of-art methods on classifying breast tissuedensity in mammograms and lung tissue in high-resolutioncomputed tomography (HRCT) images and identifying braintissue type in magnetic resonance imaging (MRI)

22 Speed-Up Robust Features (SURF) SURF was proposedby Bay et al [34] It is based on sums of 2D Haar waveletresponses and Hessian matrix based interest pointrsquos detectorand it makes an efficient use of integral images which area robust local feature detector and descriptor that can beused in computer vision tasks like object recognition for 3Dreconstruction Though SURF is partly inspired by the SIFTdescriptor it is several times faster and more robust thanSIFT

Gui et al [35] proposed a novel point-pattern matchingmethod based on the SURF and Shape ContextThey appliedthe SURF bidirectional matching to match the feature points

Computational and Mathematical Methods in Medicine 3

1

2

3

14

1

2

3

14

Protein

F

A

E

C

D

B

Regular octahedron view Multiview shape Local feature extraction

Bag-of-featuresClusteringVisual words (codebook)Histogram

Figure 1 An illustration of our method (PDB code ldquo1hddrdquo)

in two images preliminarily and then calculated Shape Con-text descriptors of the feature points Experimental resultsshow that the method can eliminate the incorrect matchingpoint pairs and improve the accuracy of point-patternmatch-ing A fully affine invariant SURF algorithm was proposedby Pang et al [36] which has the affine invariant advantageof ASIFT and the efficient merit of SURF Alcantarilla etal [37] proposed a descriptor named Gauge-SURF whichis evaluated relative to the gradient direction at every pixelBecause of the use of integral images the descriptors are fastand robust

Recently SURF was used in iris retrieval and recognitionA hierarchical approach was proposed to retrieve an irisimage from a large iris database [38] The approach is acombination of both iris color and texture and the iris texturefeatures are obtained by SURF algorithm Mehrotra et al[39] proposed a robust segmentation and an adaptive SURFdescriptor for iris recognition In their method the adaptivestrip transformed from the annular region between the irisand pupil boundaries is enhanced using a gamma correctionapproach Then features are extracted from the adaptivestrip using SURF Feulner et al [40] presented a method forautomatically estimating the body region of a CT volumeimage The method is based on 1D registration of histogramsof visual words which serves as a description of a CT sliceThe SURF descriptor was extended to 119873 dimensions named119873-SURF Because of its simpler and efficient functioning

they used 2D upright SURF descriptors for estimating thebody region

23 The Proposed Method The key idea of our method isto extract the features of proteins and measure the visualsimilarity between proteins Our algorithm is implementedsubsequently in four steps as shown in Figure 1

(1) Multiview Rendering Render multiview images of theprotein from different perspectives The viewing angle isdetermined using the vertices and planes of a given octahe-dron structure surrounding the protein as shown in Figure 2

(2) Local Feature Extraction Extract the local visual featuresof the multiview images by using SURF algorithm Then foreach view we calculate the SURF descriptors

(3) Visual Words Generation and Word Histogram Construc-tion Generate visual words from feature vectors using avisual dictionary (ie the codebook) A visual dictionarycan be got by 119896-means clustering and so each local featureshall be represented as a discrete form The frequencies ofvisual words are counted and stored into a histogram whichbecomes the feature vector of the corresponding proteinmodel

4 Computational and Mathematical Methods in Medicine

O

Y

XZ

ba

cd

F

A

E

C

D

B

a998400

b998400c

998400

d998400

Figure 2 The multiviews are captured from the vertices and the planes on a given regular octahedron

(4) Distance Computation The dissimilarity among a pair offeature vectors (the histogram) is computed by the Kullback-Leibler divergence (KLD)

231 Multiview Rendering The multiview shapes are cap-tured from the six vertices and the eight planes on a givenregular octahedron Figure 2 shows the six vertices and theeight planes on the octahedron Along the 119909+ 119909minus 119910+ 119910minus119911+ and 119911minus axes we capture the proteinrsquos right view left viewtop view bottom view front view and rear view Along thenormal direction of each plane we get the proteinrsquos eightoblique views The size of the captured image is set as 100 times100 pixels

As we mentioned above 14 different views still existfor a protein Figure 3 illustrates all the 14 views of proteinstructure encoded as ldquo1hddrdquo with PDB code after adjustingthe viewing perspective

To reduce the interference in operating and standardizingthe rendering process we write a program for view capturingfrom CATH and SCOP in the PDB The program canautomatically load protein models rotate models captureimages and save images following a certain naming rule Ofcourse the protein models are also selected automatically bythe program in advance

232 Local Feature Extraction After the range images arerendered the SURF algorithm is applied to each of therange images to detect interest points and then to extractSURF descriptors as presented in [30] The SURF algorithmdetects interest points and then computes features at theseinterest points The SURF firstly finds positions of featuresthat are salientThe saliency detection is based on amultiscaleand multiorientation Fast-Hessian detector and distribution-based descriptor for gray-level change so that each SURF

feature can encode this information The SURF descriptor iscalculated using the OpenSURF C++ source code by Evans[41]

Figure 4 shows examples of an interest point generatedand its images that are rotated affined and scaled Thenumbers of interest points of SURF algorithm are 113 112102 and 93 as shown in the second row The interest pointsappear at similar locations in these four images in spiteof the geometrical transformations This robustness againstgeometric transformations contributes to the protein modelretrieval performance

Figure 5 shows the examples of SURF interest pointsmatch of images in Figure 4 The numbers of interest pointsmatching are 59 52 54 35 45 and 40 Every image issuccessfully matched to a certain feature point Because ofthe different interest feature points (in number size andposition) extracted by SURF the numbers of the featurepoints that match are different

233 Visual Words Generation and Word Histogram Con-struction It is time consuming to compare modelrsquos localSURF feature directly especially for the large number ofviewsTherefore it is necessary to quantize the SURF descrip-tors extracted from a multiview image into visual wordsFirstly a visual codebook is generated by using off-line 119896-means clustering of the features of every view Then thecodebook is searched linearly to find a visual word closestfor the feature As a result the feature vectors of visualwords are selected through the centers of the clusters (calledbarycenter) and the number of the clusters determines thecodebook size

After generating the codebook we should construct aword histogram over the codebook which is also an off-lineprocess The word histogram is constructed by counting the

Computational and Mathematical Methods in Medicine 5

(1) DO direction (2) CO direction (3) BO direction (4) AO direction (5) FO direction

(6) EO direction (7) Plane a (8) Plane b (9) Plane c (10) Plane d

(11) Plane a998400 (12) Plane b998400 (13) Plane c998400 (14) Plane d998400

Figure 3 Fourteen views of protein ldquo1hddrdquo (PDB code)

Original image (I1) Rotated image (I2) Affined image (I3) Scaled image (I4)

Interest points of I1 Interest points of I2 Interest points of I3 Interest points of I4

Figure 4 Interest points of the SURF algorithm are robust

frequencies of visual words Each view is represented as aword histogram which is the features extracted by bag-of-visual-features So the protein modelrsquos histogram is producedby combining every view imagersquos histogram of the proteinmodel

234 Distance Computation and Range Models MatchingThe last step of our method is the distance computation (alsocalled models matching) between two models A distance

among a pair of feature vectors (the histogram) is computedby using the Kullback-Leibler divergence (KLD) The KLD isnot a distance metric for it is not symmetric Consider

119863(x y) =119899

sum

119894=1

(119910119894minus 119909119894) ln119910119894

119909119894

(1)

where x = (119909119894) and y = (119910

119894) are the feature vectors and 119899 is

the dimension of the vectors

6 Computational and Mathematical Methods in Medicine

I1 and I2 I1 and I3 I1 and I4

I2 and I3 I2 and I4

I3 and I4

Figure 5 Interest points matching

3 Results and Discussion

In order to evaluate the efficacy and generalization capacityof the proposed method we tested it with several differentretrieval and categorization tasks The first experiment com-pares the performance of our method and the comparedmethod in terms of their capability for classification Thesecond experiment tests the impact of the numbers ofthe clusters on the methodsrsquo retrieval capability The lastexperiment tests the influence of the size of the training dataon the method

We implement the experiments in Matlab R2010a whilethe SURF and the 119896-means code is written in C++ Allalgorithms are run under windows 7 32 bit on a personalcomputer with a Core 2 Quad 266GHz CPU 300GB DDR2memory and a 512MB ATI Radeon HD4600 graphics card

To evaluate the methodrsquos efficiency we measured thefeature extracting time We experimented on the compu-tation time in a 2000 protein models database which hasimages of 28000 views The computation time for SURFfeature extracted algorithm is about 21948 seconds (an off-line process) For clustering the features by 119896-means andgenerating the histograms by bag-of-visual-features it takes4249 seconds and 066 seconds by an off-line process for thesame 2000 protein models database

31 Classification In the first experiment we evaluate theclassification performance of our approach by using proteinmodels database of SHREC 2010 [12] which includes 1000protein structures chosen from 100 CATH 33 superfamiliesIn the dataset each superfamily consists of at least 10 struc-tures and each structure contains at least 50 amino acids Weuse two different ways (nearest neighbor ROC plot) to testthe performance of our method by comparing with 3DBlast3DZernike GENOCRIPT ContactMaps Group Integrationand Spherical Trace Transform methods

Table 1 Nearest neighbour results

Method Correct predictions3DBlast 683DZernike 8Genocrypt 56Contact maps 80Group integration 52Spherical trace transform 0Proposed method 89

Nearest neighbor was counted as a correct predictionwhen the first protein of each ranked list was found to be amember of the same superfamily as the query

ROCplot (receiver operating characteristic) is a graphicalplot which illustrates the true positives rate against the falsepositives rate The area under the curve (AUC) is a singlenumerical performance measure of each ROC plot Theperfect value of AUC is 10 [12]

Table 1 summarizes the retrieval rate for all the methodsThe value of the comparison algorithm is from literature [12]According to Table 1 the correct prediction of the proposedmethod is better than the comparison algorithm

Figure 6 shows in a bar graph the query results witheach algorithm on 50 protein structures As Figure 6 showsthe proposed method can easily and successfully identifyeach query protein including those that the comparisonalgorithms failed to identify Compared with the compar-ison algorithms the proposed method may identify somesuperfamilies more easily It has a very encouraging andsatisfactory result that the comparison algorithms cannotreach It is worth noting that the classification correctness ofthe proposedmethod is almost the same as that that has beendone by human experts

Computational and Mathematical Methods in Medicine 7

00

04

08

00

04

08

3DBLAST2 6 10

2 6 10 2 6 10 2 6 10

2 6 10

2 6 10 2 6 1014 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

3DZernike Genocrypt

Contact maps Group integration Spherical trace transform

Proposed method

00

04

08

00

04

08

00

04

08

00

04

08

00

04

08

(a)

14 17 20 23 26 29 32 35 38 41 44 47 50

1 5 9 14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

000

002

004

006

008

010 3DBLAST

000

002

004

006

008

010 3DZernike

000

002

004

006

008

010 Genocrypt

000

002

004

006

008

010Contact maps

000

002

004

006

008

010 Group integration

000

002

004

006

008

010 Spherical trace transform

000

002

004

006

008

010Proposed method

1 5 9 1 5 9 1 5 9

1 5 91 5 9

1 5 9

(b)

Figure 6 Bar chart for each method The upper charts show the total AUC whereas the lower charts show the AUCs calculated for the top10 of the database The comparison algorithm is come from literature [12]

8 Computational and Mathematical Methods in Medicine

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

1K2K3K4K5K

6K7K8K9K11K

Figure 7 The precision-recall curves of influence of the codebooksize

32 Influence of the Size of the Codebook In the second ex-periment the influence of the vocabulary size (codebooksize) upon retrieval performance is studied The test datasetincludes 800 protein structures chosen from SCOP proteindatabase

The number of visual words in the codebook (the code-book size) is a very important parameter in our algorithmBecause the codebook size not only determines the spa-tial requirement but also significantly affects the retrievalperformance Figure 7 demonstrates that the precision-recallcurves of our methods increase steadily with the codebooksize We observe that with the number of codebook sizeenlarging from 1000 to 11000 the precision-recall values arecomparatively stable and the precision rate is between 096and 1 for all codebook sizes Meanwhile the difference ofprecision is 002 between the best case and the worst caseSo the influence of the number of the codebook on retrievalprecision is small According to the experimental result weset the number of visual words in the codebook as 3000 inthis paper

33 Influence of the Size of the Training Data In the thirdexperiment we investigate the influence of the training datasize on the retrieval efficiency The test dataset differentfrom the training data is selected from different structuralclassifications of SCOP protein database Figure 8 shows theprecision-recall curves when the number of training datais 400 800 1200 1600 and 2000 respectively As the testresults show the precision rates are all between 096 and 1regardless of the number of training data size

4 Conclusion

In this paper we proposed a novel feature extracting algo-rithm for protein retrieval and classification The proposedmethod employs a powerful local image feature called SURFand bag-of-visual-feature The key idea is to describe aview as a word histogram which is obtained by the vector

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

4008001200

16002000

Figure 8 The precision-recall curves of influence of the size oftraining data

quantization of the viewrsquos local features and to apply KLD tocalculate the distance between two models

A set of experiments were carried out to investigateseveral critical issues of our method in the CATH and SCOPprotein models database from PDBThe experimental resultsindicate that our method has satisfying performances forprotein retrieval and protein categorization that cannot bereached by other comparison methods

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank those who created proteindatabases and those who made available codes for theirshape features This work was supported by the NationalNatural Science Foundation of China (Grant nos 61462002and 61261043) the College of Scientific Research Projectof Ningxia Province (Grant NGY2012147) and the ScienceResearch Project of North University of Nationalities (Grant2013XYZ024)

References

[1] A I Cuff I Sillitoe T Lewis et al ldquoThe CATH classificationrevisited-architectures reviewed and new ways to characterizestructural divergence in superfamiliesrdquo Nucleic Acids Researchvol 37 no 1 pp D310ndashD314 2009

[2] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[3] D J Lipman and W R Pearson ldquoRapid and sensitive proteinsimilarity searchesrdquo Science vol 227 no 4693 pp 1435ndash14411985

Computational and Mathematical Methods in Medicine 9

[4] S F AltschulW GishWMiller EWMyers and D J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[5] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[6] K Karplus C Barrett and R Hughey ldquoHiddenMarkovmodelsfor detecting remote protein homologiesrdquo Bioinformatics vol14 no 10 pp 846ndash856 1998

[7] T Milledge G Zheng T Mullins and G NarasimhanldquoSBLAST structural basic local alignment searching tools usinggeometric hashingrdquo in Proceeding of the 7th IEEE InternationalConference on Bioinformatics and Bioengineering (BIBE 07) pp1343ndash1347 Boston MassUSA October 2007

[8] E Zotenko R Islamaj Dogan W J Wilbur D P OrsquoLeary andT M Przytycka ldquoStructural footprinting in protein structurecomparison the impact of structural fragmentsrdquo BMC Struc-tural Biology vol 7 article 53 2007

[9] Z Aung and K Tan ldquoRapid 3D protein structure databasesearching using information retrieval techniquesrdquo Bioinformat-ics vol 20 no 7 pp 1045ndash1052 2004

[10] O Camoglu T Kahveci and A K Singh ldquoPSI indexing proteinstructures for fast similarity searchrdquo Bioinformatics vol 19supplement 1 pp i81ndashi83 2003

[11] V Cantoni A Ferone O Ozbudak and A Petrosino ldquoProteinmotifs retrieval by SS terns occurrencesrdquo Pattern RecognitionLetters vol 34 no 5 pp 559ndash563 2013

[12] L Mavridis V Venkatraman D W Ritchie et al ldquoSHRECrsquo10track protein modelsrdquo in Proceedings of the Eurographics Work-shop on 3D Object Retrieval 2010

[13] A R Ortiz C E M Strauss and O Olmea ldquoMAMMOTH(matching molecular models obtained from theory) an auto-mated method for model comparisonrdquo Protein Science vol 11no 11 pp 2606ndash2621 2002

[14] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[15] J Zhu and ZWeng ldquoFAST a novel protein structure alignmentalgorithmrdquo Proteins Structure Function and Genetics vol 58no 3 pp 618ndash627 2005

[16] L Holm and C Sander ldquoProtein structure comparison byalignment of distance matricesrdquo Journal of Molecular Biologyvol 233 no 1 pp 123ndash138 1993

[17] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[18] M Ankerst G Kastenmuller H-P Kriegel and T SeidlldquoNearest neighbor classification in 3D protein databaserdquo inProceedings of the 7th International Conference on IntelligentSystems for Molecular Biology pp 34ndash43 Heidelberg Germany1999

[19] P Chi G Scott and C Shyu ldquoA fast protein structure retrievalsystem using image-based distance matrices and multidimen-sional indexrdquo in Proceedingsof the 4th IEEE Symposium onBioinformatics and Bioengineering (BIBE rsquo04) pp 522ndash529 May2004

[20] J-S Yeh D-Y Chen and M Ouhyoung ldquoA web-based proteinretrieval system bymatching visual similarityrdquo in Proceedings ofthe Emerging Information Technology Conference pp 108ndash110Taipei Taiwan August 2005

[21] Y Liu H Zha and H Qin ldquoShape topics a compact repre-sentation and new algorithms for 3D partial shape retrievalrdquoin Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) pp 2025ndash2032 June 2006

[22] J Yu Z Qin T Wan and X Zhang ldquoFeature integrationanalysis of bag-of-features model for image retrievalrdquo Neuro-computing vol 120 pp 355ndash364 2013

[23] S Yoo and C Kim ldquoBackground subtraction using hybridfeature coding in the bag-of-features frameworkrdquo PatternRecognition Letters vol 34 no 16 pp 2086ndash2093 2013

[24] L Nanni and A Lumini ldquoHeterogeneous bag-of-features forobjectscene recognitionrdquo Applied Soft Computing Journal vol13 no 4 pp 2171ndash2178 2013

[25] L Zhou Z Zhou andDHu ldquoScene classification using amulti-resolution bag-of-features modelrdquo Pattern Recognition vol 46no 1 pp 424ndash433 2013

[26] Z Fu G Lu KM Ting andD Zhang ldquoMusic classification viathe bag-of-features approachrdquo Pattern Recognition Letters vol32 no 14 pp 1768ndash1777 2011

[27] K Zagoris I Pratikakis A Antonacopoulos B Gatos and NPapamarkos ldquoDistinction between handwritten and machine-printed text based on the bag of visual words modelrdquo PatternRecognition vol 47 no 3 pp 1051ndash1062 2014

[28] Z Wu J Cao H Tao and Y Zhuang ldquoA novel noise filterbased on interesting patternmining for bag-of-features imagesrdquoExpert Systems with Applications vol 40 no 18 pp 7555ndash75612013

[29] R Ohbuchi K Osada T Furuya and T Banno Salient LocalVisual Features for Shape-Based 3d Model Retrieval 2008

[30] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60 no2 pp 91ndash110 2004

[31] R Toldo U Castellani and A Fusiello ldquoA bag of wordsapproach for 3d object categorizationrdquo in Proceedings of the4th International Conference on Computer VisionComputerGraphics Collaboration Techniques pp 116ndash127 2009

[32] L Shen J Lin S Wu and S Yu ldquoHEp-2 image classificationusing intensity order pooling based features and bag of wordsrdquoPattern Recognition vol 47 no 7 pp 2419ndash2427 2014

[33] J J Wang H Bensmail and X Gao ldquoJoint learning andweighting of visual vocabulary for bag-of-feature based tissueclassificationrdquo Pattern Recognition vol 46 no 12 pp 3249ndash3255 2013

[34] H Bay A Ess T Tuytelaars and L van Gool ldquoSURF SpeededUp Robust Featuresrdquo Computer Vision and Image Understand-ing vol 110 no 3 pp 346ndash359 2008

[35] Y Gui A Su and J Du ldquoPoint-pattern matching method usingSURF and Shape ContextrdquoOptik vol 124 no 14 pp 1869ndash18732013

[36] Y Pang W Li Y Yuan and J Pan ldquoFully affine invariant SURFfor image matchingrdquo Neurocomputing vol 85 pp 6ndash10 2012

[37] P F Alcantarilla L M Bergasa and A J Davison ldquoGauge-SURF descriptorsrdquo Image and Vision Computing vol 31 no 1pp 103ndash116 2013

[38] U Jayaraman S Prakash and P Gupta ldquoAn efficient color andtexture based iris image retrieval techniquerdquo Expert Systemswith Applications vol 39 no 5 pp 4915ndash4926 2012

[39] H Mehrotra P K Sa and B Majhi ldquoFast segmentation andadaptive SURF descriptor for iris recognitionrdquo Mathematicaland Computer Modelling vol 58 no 1-2 pp 132ndash146 2013

10 Computational and Mathematical Methods in Medicine

[40] J Feulner S K Zhou E Angelopoulou et al ldquoComparing axialCT slices in quantized N-dimensional SURF descriptor spaceto estimate the visible body regionrdquo Computerized MedicalImaging and Graphics vol 35 no 3 pp 227ndash236 2011

[41] ldquoOpenSURFrdquo httpscodegooglecompopensurf1

Page 3: A Method of Protein Model Classification and Retrieval ......ResearchArticle A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features JinlinMa,1,2 ZipingMa,2

Computational and Mathematical Methods in Medicine 3

1

2

3

14

1

2

3

14

Protein

F

A

E

C

D

B

Regular octahedron view Multiview shape Local feature extraction

Bag-of-featuresClusteringVisual words (codebook)Histogram

Figure 1 An illustration of our method (PDB code ldquo1hddrdquo)

in two images preliminarily and then calculated Shape Con-text descriptors of the feature points Experimental resultsshow that the method can eliminate the incorrect matchingpoint pairs and improve the accuracy of point-patternmatch-ing A fully affine invariant SURF algorithm was proposedby Pang et al [36] which has the affine invariant advantageof ASIFT and the efficient merit of SURF Alcantarilla etal [37] proposed a descriptor named Gauge-SURF whichis evaluated relative to the gradient direction at every pixelBecause of the use of integral images the descriptors are fastand robust

Recently SURF was used in iris retrieval and recognitionA hierarchical approach was proposed to retrieve an irisimage from a large iris database [38] The approach is acombination of both iris color and texture and the iris texturefeatures are obtained by SURF algorithm Mehrotra et al[39] proposed a robust segmentation and an adaptive SURFdescriptor for iris recognition In their method the adaptivestrip transformed from the annular region between the irisand pupil boundaries is enhanced using a gamma correctionapproach Then features are extracted from the adaptivestrip using SURF Feulner et al [40] presented a method forautomatically estimating the body region of a CT volumeimage The method is based on 1D registration of histogramsof visual words which serves as a description of a CT sliceThe SURF descriptor was extended to 119873 dimensions named119873-SURF Because of its simpler and efficient functioning

they used 2D upright SURF descriptors for estimating thebody region

23 The Proposed Method The key idea of our method isto extract the features of proteins and measure the visualsimilarity between proteins Our algorithm is implementedsubsequently in four steps as shown in Figure 1

(1) Multiview Rendering Render multiview images of theprotein from different perspectives The viewing angle isdetermined using the vertices and planes of a given octahe-dron structure surrounding the protein as shown in Figure 2

(2) Local Feature Extraction Extract the local visual featuresof the multiview images by using SURF algorithm Then foreach view we calculate the SURF descriptors

(3) Visual Words Generation and Word Histogram Construc-tion Generate visual words from feature vectors using avisual dictionary (ie the codebook) A visual dictionarycan be got by 119896-means clustering and so each local featureshall be represented as a discrete form The frequencies ofvisual words are counted and stored into a histogram whichbecomes the feature vector of the corresponding proteinmodel

4 Computational and Mathematical Methods in Medicine

O

Y

XZ

ba

cd

F

A

E

C

D

B

a998400

b998400c

998400

d998400

Figure 2 The multiviews are captured from the vertices and the planes on a given regular octahedron

(4) Distance Computation The dissimilarity among a pair offeature vectors (the histogram) is computed by the Kullback-Leibler divergence (KLD)

231 Multiview Rendering The multiview shapes are cap-tured from the six vertices and the eight planes on a givenregular octahedron Figure 2 shows the six vertices and theeight planes on the octahedron Along the 119909+ 119909minus 119910+ 119910minus119911+ and 119911minus axes we capture the proteinrsquos right view left viewtop view bottom view front view and rear view Along thenormal direction of each plane we get the proteinrsquos eightoblique views The size of the captured image is set as 100 times100 pixels

As we mentioned above 14 different views still existfor a protein Figure 3 illustrates all the 14 views of proteinstructure encoded as ldquo1hddrdquo with PDB code after adjustingthe viewing perspective

To reduce the interference in operating and standardizingthe rendering process we write a program for view capturingfrom CATH and SCOP in the PDB The program canautomatically load protein models rotate models captureimages and save images following a certain naming rule Ofcourse the protein models are also selected automatically bythe program in advance

232 Local Feature Extraction After the range images arerendered the SURF algorithm is applied to each of therange images to detect interest points and then to extractSURF descriptors as presented in [30] The SURF algorithmdetects interest points and then computes features at theseinterest points The SURF firstly finds positions of featuresthat are salientThe saliency detection is based on amultiscaleand multiorientation Fast-Hessian detector and distribution-based descriptor for gray-level change so that each SURF

feature can encode this information The SURF descriptor iscalculated using the OpenSURF C++ source code by Evans[41]

Figure 4 shows examples of an interest point generatedand its images that are rotated affined and scaled Thenumbers of interest points of SURF algorithm are 113 112102 and 93 as shown in the second row The interest pointsappear at similar locations in these four images in spiteof the geometrical transformations This robustness againstgeometric transformations contributes to the protein modelretrieval performance

Figure 5 shows the examples of SURF interest pointsmatch of images in Figure 4 The numbers of interest pointsmatching are 59 52 54 35 45 and 40 Every image issuccessfully matched to a certain feature point Because ofthe different interest feature points (in number size andposition) extracted by SURF the numbers of the featurepoints that match are different

233 Visual Words Generation and Word Histogram Con-struction It is time consuming to compare modelrsquos localSURF feature directly especially for the large number ofviewsTherefore it is necessary to quantize the SURF descrip-tors extracted from a multiview image into visual wordsFirstly a visual codebook is generated by using off-line 119896-means clustering of the features of every view Then thecodebook is searched linearly to find a visual word closestfor the feature As a result the feature vectors of visualwords are selected through the centers of the clusters (calledbarycenter) and the number of the clusters determines thecodebook size

After generating the codebook we should construct aword histogram over the codebook which is also an off-lineprocess The word histogram is constructed by counting the

Computational and Mathematical Methods in Medicine 5

(1) DO direction (2) CO direction (3) BO direction (4) AO direction (5) FO direction

(6) EO direction (7) Plane a (8) Plane b (9) Plane c (10) Plane d

(11) Plane a998400 (12) Plane b998400 (13) Plane c998400 (14) Plane d998400

Figure 3 Fourteen views of protein ldquo1hddrdquo (PDB code)

Original image (I1) Rotated image (I2) Affined image (I3) Scaled image (I4)

Interest points of I1 Interest points of I2 Interest points of I3 Interest points of I4

Figure 4 Interest points of the SURF algorithm are robust

frequencies of visual words Each view is represented as aword histogram which is the features extracted by bag-of-visual-features So the protein modelrsquos histogram is producedby combining every view imagersquos histogram of the proteinmodel

234 Distance Computation and Range Models MatchingThe last step of our method is the distance computation (alsocalled models matching) between two models A distance

among a pair of feature vectors (the histogram) is computedby using the Kullback-Leibler divergence (KLD) The KLD isnot a distance metric for it is not symmetric Consider

119863(x y) =119899

sum

119894=1

(119910119894minus 119909119894) ln119910119894

119909119894

(1)

where x = (119909119894) and y = (119910

119894) are the feature vectors and 119899 is

the dimension of the vectors

6 Computational and Mathematical Methods in Medicine

I1 and I2 I1 and I3 I1 and I4

I2 and I3 I2 and I4

I3 and I4

Figure 5 Interest points matching

3 Results and Discussion

In order to evaluate the efficacy and generalization capacityof the proposed method we tested it with several differentretrieval and categorization tasks The first experiment com-pares the performance of our method and the comparedmethod in terms of their capability for classification Thesecond experiment tests the impact of the numbers ofthe clusters on the methodsrsquo retrieval capability The lastexperiment tests the influence of the size of the training dataon the method

We implement the experiments in Matlab R2010a whilethe SURF and the 119896-means code is written in C++ Allalgorithms are run under windows 7 32 bit on a personalcomputer with a Core 2 Quad 266GHz CPU 300GB DDR2memory and a 512MB ATI Radeon HD4600 graphics card

To evaluate the methodrsquos efficiency we measured thefeature extracting time We experimented on the compu-tation time in a 2000 protein models database which hasimages of 28000 views The computation time for SURFfeature extracted algorithm is about 21948 seconds (an off-line process) For clustering the features by 119896-means andgenerating the histograms by bag-of-visual-features it takes4249 seconds and 066 seconds by an off-line process for thesame 2000 protein models database

31 Classification In the first experiment we evaluate theclassification performance of our approach by using proteinmodels database of SHREC 2010 [12] which includes 1000protein structures chosen from 100 CATH 33 superfamiliesIn the dataset each superfamily consists of at least 10 struc-tures and each structure contains at least 50 amino acids Weuse two different ways (nearest neighbor ROC plot) to testthe performance of our method by comparing with 3DBlast3DZernike GENOCRIPT ContactMaps Group Integrationand Spherical Trace Transform methods

Table 1 Nearest neighbour results

Method Correct predictions3DBlast 683DZernike 8Genocrypt 56Contact maps 80Group integration 52Spherical trace transform 0Proposed method 89

Nearest neighbor was counted as a correct predictionwhen the first protein of each ranked list was found to be amember of the same superfamily as the query

ROCplot (receiver operating characteristic) is a graphicalplot which illustrates the true positives rate against the falsepositives rate The area under the curve (AUC) is a singlenumerical performance measure of each ROC plot Theperfect value of AUC is 10 [12]

Table 1 summarizes the retrieval rate for all the methodsThe value of the comparison algorithm is from literature [12]According to Table 1 the correct prediction of the proposedmethod is better than the comparison algorithm

Figure 6 shows in a bar graph the query results witheach algorithm on 50 protein structures As Figure 6 showsthe proposed method can easily and successfully identifyeach query protein including those that the comparisonalgorithms failed to identify Compared with the compar-ison algorithms the proposed method may identify somesuperfamilies more easily It has a very encouraging andsatisfactory result that the comparison algorithms cannotreach It is worth noting that the classification correctness ofthe proposedmethod is almost the same as that that has beendone by human experts

Computational and Mathematical Methods in Medicine 7

00

04

08

00

04

08

3DBLAST2 6 10

2 6 10 2 6 10 2 6 10

2 6 10

2 6 10 2 6 1014 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

3DZernike Genocrypt

Contact maps Group integration Spherical trace transform

Proposed method

00

04

08

00

04

08

00

04

08

00

04

08

00

04

08

(a)

14 17 20 23 26 29 32 35 38 41 44 47 50

1 5 9 14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

000

002

004

006

008

010 3DBLAST

000

002

004

006

008

010 3DZernike

000

002

004

006

008

010 Genocrypt

000

002

004

006

008

010Contact maps

000

002

004

006

008

010 Group integration

000

002

004

006

008

010 Spherical trace transform

000

002

004

006

008

010Proposed method

1 5 9 1 5 9 1 5 9

1 5 91 5 9

1 5 9

(b)

Figure 6 Bar chart for each method The upper charts show the total AUC whereas the lower charts show the AUCs calculated for the top10 of the database The comparison algorithm is come from literature [12]

8 Computational and Mathematical Methods in Medicine

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

1K2K3K4K5K

6K7K8K9K11K

Figure 7 The precision-recall curves of influence of the codebooksize

32 Influence of the Size of the Codebook In the second ex-periment the influence of the vocabulary size (codebooksize) upon retrieval performance is studied The test datasetincludes 800 protein structures chosen from SCOP proteindatabase

The number of visual words in the codebook (the code-book size) is a very important parameter in our algorithmBecause the codebook size not only determines the spa-tial requirement but also significantly affects the retrievalperformance Figure 7 demonstrates that the precision-recallcurves of our methods increase steadily with the codebooksize We observe that with the number of codebook sizeenlarging from 1000 to 11000 the precision-recall values arecomparatively stable and the precision rate is between 096and 1 for all codebook sizes Meanwhile the difference ofprecision is 002 between the best case and the worst caseSo the influence of the number of the codebook on retrievalprecision is small According to the experimental result weset the number of visual words in the codebook as 3000 inthis paper

33 Influence of the Size of the Training Data In the thirdexperiment we investigate the influence of the training datasize on the retrieval efficiency The test dataset differentfrom the training data is selected from different structuralclassifications of SCOP protein database Figure 8 shows theprecision-recall curves when the number of training datais 400 800 1200 1600 and 2000 respectively As the testresults show the precision rates are all between 096 and 1regardless of the number of training data size

4 Conclusion

In this paper we proposed a novel feature extracting algo-rithm for protein retrieval and classification The proposedmethod employs a powerful local image feature called SURFand bag-of-visual-feature The key idea is to describe aview as a word histogram which is obtained by the vector

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

4008001200

16002000

Figure 8 The precision-recall curves of influence of the size oftraining data

quantization of the viewrsquos local features and to apply KLD tocalculate the distance between two models

A set of experiments were carried out to investigateseveral critical issues of our method in the CATH and SCOPprotein models database from PDBThe experimental resultsindicate that our method has satisfying performances forprotein retrieval and protein categorization that cannot bereached by other comparison methods

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank those who created proteindatabases and those who made available codes for theirshape features This work was supported by the NationalNatural Science Foundation of China (Grant nos 61462002and 61261043) the College of Scientific Research Projectof Ningxia Province (Grant NGY2012147) and the ScienceResearch Project of North University of Nationalities (Grant2013XYZ024)

References

[1] A I Cuff I Sillitoe T Lewis et al ldquoThe CATH classificationrevisited-architectures reviewed and new ways to characterizestructural divergence in superfamiliesrdquo Nucleic Acids Researchvol 37 no 1 pp D310ndashD314 2009

[2] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[3] D J Lipman and W R Pearson ldquoRapid and sensitive proteinsimilarity searchesrdquo Science vol 227 no 4693 pp 1435ndash14411985

Computational and Mathematical Methods in Medicine 9

[4] S F AltschulW GishWMiller EWMyers and D J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[5] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[6] K Karplus C Barrett and R Hughey ldquoHiddenMarkovmodelsfor detecting remote protein homologiesrdquo Bioinformatics vol14 no 10 pp 846ndash856 1998

[7] T Milledge G Zheng T Mullins and G NarasimhanldquoSBLAST structural basic local alignment searching tools usinggeometric hashingrdquo in Proceeding of the 7th IEEE InternationalConference on Bioinformatics and Bioengineering (BIBE 07) pp1343ndash1347 Boston MassUSA October 2007

[8] E Zotenko R Islamaj Dogan W J Wilbur D P OrsquoLeary andT M Przytycka ldquoStructural footprinting in protein structurecomparison the impact of structural fragmentsrdquo BMC Struc-tural Biology vol 7 article 53 2007

[9] Z Aung and K Tan ldquoRapid 3D protein structure databasesearching using information retrieval techniquesrdquo Bioinformat-ics vol 20 no 7 pp 1045ndash1052 2004

[10] O Camoglu T Kahveci and A K Singh ldquoPSI indexing proteinstructures for fast similarity searchrdquo Bioinformatics vol 19supplement 1 pp i81ndashi83 2003

[11] V Cantoni A Ferone O Ozbudak and A Petrosino ldquoProteinmotifs retrieval by SS terns occurrencesrdquo Pattern RecognitionLetters vol 34 no 5 pp 559ndash563 2013

[12] L Mavridis V Venkatraman D W Ritchie et al ldquoSHRECrsquo10track protein modelsrdquo in Proceedings of the Eurographics Work-shop on 3D Object Retrieval 2010

[13] A R Ortiz C E M Strauss and O Olmea ldquoMAMMOTH(matching molecular models obtained from theory) an auto-mated method for model comparisonrdquo Protein Science vol 11no 11 pp 2606ndash2621 2002

[14] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[15] J Zhu and ZWeng ldquoFAST a novel protein structure alignmentalgorithmrdquo Proteins Structure Function and Genetics vol 58no 3 pp 618ndash627 2005

[16] L Holm and C Sander ldquoProtein structure comparison byalignment of distance matricesrdquo Journal of Molecular Biologyvol 233 no 1 pp 123ndash138 1993

[17] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[18] M Ankerst G Kastenmuller H-P Kriegel and T SeidlldquoNearest neighbor classification in 3D protein databaserdquo inProceedings of the 7th International Conference on IntelligentSystems for Molecular Biology pp 34ndash43 Heidelberg Germany1999

[19] P Chi G Scott and C Shyu ldquoA fast protein structure retrievalsystem using image-based distance matrices and multidimen-sional indexrdquo in Proceedingsof the 4th IEEE Symposium onBioinformatics and Bioengineering (BIBE rsquo04) pp 522ndash529 May2004

[20] J-S Yeh D-Y Chen and M Ouhyoung ldquoA web-based proteinretrieval system bymatching visual similarityrdquo in Proceedings ofthe Emerging Information Technology Conference pp 108ndash110Taipei Taiwan August 2005

[21] Y Liu H Zha and H Qin ldquoShape topics a compact repre-sentation and new algorithms for 3D partial shape retrievalrdquoin Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) pp 2025ndash2032 June 2006

[22] J Yu Z Qin T Wan and X Zhang ldquoFeature integrationanalysis of bag-of-features model for image retrievalrdquo Neuro-computing vol 120 pp 355ndash364 2013

[23] S Yoo and C Kim ldquoBackground subtraction using hybridfeature coding in the bag-of-features frameworkrdquo PatternRecognition Letters vol 34 no 16 pp 2086ndash2093 2013

[24] L Nanni and A Lumini ldquoHeterogeneous bag-of-features forobjectscene recognitionrdquo Applied Soft Computing Journal vol13 no 4 pp 2171ndash2178 2013

[25] L Zhou Z Zhou andDHu ldquoScene classification using amulti-resolution bag-of-features modelrdquo Pattern Recognition vol 46no 1 pp 424ndash433 2013

[26] Z Fu G Lu KM Ting andD Zhang ldquoMusic classification viathe bag-of-features approachrdquo Pattern Recognition Letters vol32 no 14 pp 1768ndash1777 2011

[27] K Zagoris I Pratikakis A Antonacopoulos B Gatos and NPapamarkos ldquoDistinction between handwritten and machine-printed text based on the bag of visual words modelrdquo PatternRecognition vol 47 no 3 pp 1051ndash1062 2014

[28] Z Wu J Cao H Tao and Y Zhuang ldquoA novel noise filterbased on interesting patternmining for bag-of-features imagesrdquoExpert Systems with Applications vol 40 no 18 pp 7555ndash75612013

[29] R Ohbuchi K Osada T Furuya and T Banno Salient LocalVisual Features for Shape-Based 3d Model Retrieval 2008

[30] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60 no2 pp 91ndash110 2004

[31] R Toldo U Castellani and A Fusiello ldquoA bag of wordsapproach for 3d object categorizationrdquo in Proceedings of the4th International Conference on Computer VisionComputerGraphics Collaboration Techniques pp 116ndash127 2009

[32] L Shen J Lin S Wu and S Yu ldquoHEp-2 image classificationusing intensity order pooling based features and bag of wordsrdquoPattern Recognition vol 47 no 7 pp 2419ndash2427 2014

[33] J J Wang H Bensmail and X Gao ldquoJoint learning andweighting of visual vocabulary for bag-of-feature based tissueclassificationrdquo Pattern Recognition vol 46 no 12 pp 3249ndash3255 2013

[34] H Bay A Ess T Tuytelaars and L van Gool ldquoSURF SpeededUp Robust Featuresrdquo Computer Vision and Image Understand-ing vol 110 no 3 pp 346ndash359 2008

[35] Y Gui A Su and J Du ldquoPoint-pattern matching method usingSURF and Shape ContextrdquoOptik vol 124 no 14 pp 1869ndash18732013

[36] Y Pang W Li Y Yuan and J Pan ldquoFully affine invariant SURFfor image matchingrdquo Neurocomputing vol 85 pp 6ndash10 2012

[37] P F Alcantarilla L M Bergasa and A J Davison ldquoGauge-SURF descriptorsrdquo Image and Vision Computing vol 31 no 1pp 103ndash116 2013

[38] U Jayaraman S Prakash and P Gupta ldquoAn efficient color andtexture based iris image retrieval techniquerdquo Expert Systemswith Applications vol 39 no 5 pp 4915ndash4926 2012

[39] H Mehrotra P K Sa and B Majhi ldquoFast segmentation andadaptive SURF descriptor for iris recognitionrdquo Mathematicaland Computer Modelling vol 58 no 1-2 pp 132ndash146 2013

10 Computational and Mathematical Methods in Medicine

[40] J Feulner S K Zhou E Angelopoulou et al ldquoComparing axialCT slices in quantized N-dimensional SURF descriptor spaceto estimate the visible body regionrdquo Computerized MedicalImaging and Graphics vol 35 no 3 pp 227ndash236 2011

[41] ldquoOpenSURFrdquo httpscodegooglecompopensurf1

Page 4: A Method of Protein Model Classification and Retrieval ......ResearchArticle A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features JinlinMa,1,2 ZipingMa,2

4 Computational and Mathematical Methods in Medicine

O

Y

XZ

ba

cd

F

A

E

C

D

B

a998400

b998400c

998400

d998400

Figure 2 The multiviews are captured from the vertices and the planes on a given regular octahedron

(4) Distance Computation The dissimilarity among a pair offeature vectors (the histogram) is computed by the Kullback-Leibler divergence (KLD)

231 Multiview Rendering The multiview shapes are cap-tured from the six vertices and the eight planes on a givenregular octahedron Figure 2 shows the six vertices and theeight planes on the octahedron Along the 119909+ 119909minus 119910+ 119910minus119911+ and 119911minus axes we capture the proteinrsquos right view left viewtop view bottom view front view and rear view Along thenormal direction of each plane we get the proteinrsquos eightoblique views The size of the captured image is set as 100 times100 pixels

As we mentioned above 14 different views still existfor a protein Figure 3 illustrates all the 14 views of proteinstructure encoded as ldquo1hddrdquo with PDB code after adjustingthe viewing perspective

To reduce the interference in operating and standardizingthe rendering process we write a program for view capturingfrom CATH and SCOP in the PDB The program canautomatically load protein models rotate models captureimages and save images following a certain naming rule Ofcourse the protein models are also selected automatically bythe program in advance

232 Local Feature Extraction After the range images arerendered the SURF algorithm is applied to each of therange images to detect interest points and then to extractSURF descriptors as presented in [30] The SURF algorithmdetects interest points and then computes features at theseinterest points The SURF firstly finds positions of featuresthat are salientThe saliency detection is based on amultiscaleand multiorientation Fast-Hessian detector and distribution-based descriptor for gray-level change so that each SURF

feature can encode this information The SURF descriptor iscalculated using the OpenSURF C++ source code by Evans[41]

Figure 4 shows examples of an interest point generatedand its images that are rotated affined and scaled Thenumbers of interest points of SURF algorithm are 113 112102 and 93 as shown in the second row The interest pointsappear at similar locations in these four images in spiteof the geometrical transformations This robustness againstgeometric transformations contributes to the protein modelretrieval performance

Figure 5 shows the examples of SURF interest pointsmatch of images in Figure 4 The numbers of interest pointsmatching are 59 52 54 35 45 and 40 Every image issuccessfully matched to a certain feature point Because ofthe different interest feature points (in number size andposition) extracted by SURF the numbers of the featurepoints that match are different

233 Visual Words Generation and Word Histogram Con-struction It is time consuming to compare modelrsquos localSURF feature directly especially for the large number ofviewsTherefore it is necessary to quantize the SURF descrip-tors extracted from a multiview image into visual wordsFirstly a visual codebook is generated by using off-line 119896-means clustering of the features of every view Then thecodebook is searched linearly to find a visual word closestfor the feature As a result the feature vectors of visualwords are selected through the centers of the clusters (calledbarycenter) and the number of the clusters determines thecodebook size

After generating the codebook we should construct aword histogram over the codebook which is also an off-lineprocess The word histogram is constructed by counting the

Computational and Mathematical Methods in Medicine 5

(1) DO direction (2) CO direction (3) BO direction (4) AO direction (5) FO direction

(6) EO direction (7) Plane a (8) Plane b (9) Plane c (10) Plane d

(11) Plane a998400 (12) Plane b998400 (13) Plane c998400 (14) Plane d998400

Figure 3 Fourteen views of protein ldquo1hddrdquo (PDB code)

Original image (I1) Rotated image (I2) Affined image (I3) Scaled image (I4)

Interest points of I1 Interest points of I2 Interest points of I3 Interest points of I4

Figure 4 Interest points of the SURF algorithm are robust

frequencies of visual words Each view is represented as aword histogram which is the features extracted by bag-of-visual-features So the protein modelrsquos histogram is producedby combining every view imagersquos histogram of the proteinmodel

234 Distance Computation and Range Models MatchingThe last step of our method is the distance computation (alsocalled models matching) between two models A distance

among a pair of feature vectors (the histogram) is computedby using the Kullback-Leibler divergence (KLD) The KLD isnot a distance metric for it is not symmetric Consider

119863(x y) =119899

sum

119894=1

(119910119894minus 119909119894) ln119910119894

119909119894

(1)

where x = (119909119894) and y = (119910

119894) are the feature vectors and 119899 is

the dimension of the vectors

6 Computational and Mathematical Methods in Medicine

I1 and I2 I1 and I3 I1 and I4

I2 and I3 I2 and I4

I3 and I4

Figure 5 Interest points matching

3 Results and Discussion

In order to evaluate the efficacy and generalization capacityof the proposed method we tested it with several differentretrieval and categorization tasks The first experiment com-pares the performance of our method and the comparedmethod in terms of their capability for classification Thesecond experiment tests the impact of the numbers ofthe clusters on the methodsrsquo retrieval capability The lastexperiment tests the influence of the size of the training dataon the method

We implement the experiments in Matlab R2010a whilethe SURF and the 119896-means code is written in C++ Allalgorithms are run under windows 7 32 bit on a personalcomputer with a Core 2 Quad 266GHz CPU 300GB DDR2memory and a 512MB ATI Radeon HD4600 graphics card

To evaluate the methodrsquos efficiency we measured thefeature extracting time We experimented on the compu-tation time in a 2000 protein models database which hasimages of 28000 views The computation time for SURFfeature extracted algorithm is about 21948 seconds (an off-line process) For clustering the features by 119896-means andgenerating the histograms by bag-of-visual-features it takes4249 seconds and 066 seconds by an off-line process for thesame 2000 protein models database

31 Classification In the first experiment we evaluate theclassification performance of our approach by using proteinmodels database of SHREC 2010 [12] which includes 1000protein structures chosen from 100 CATH 33 superfamiliesIn the dataset each superfamily consists of at least 10 struc-tures and each structure contains at least 50 amino acids Weuse two different ways (nearest neighbor ROC plot) to testthe performance of our method by comparing with 3DBlast3DZernike GENOCRIPT ContactMaps Group Integrationand Spherical Trace Transform methods

Table 1 Nearest neighbour results

Method Correct predictions3DBlast 683DZernike 8Genocrypt 56Contact maps 80Group integration 52Spherical trace transform 0Proposed method 89

Nearest neighbor was counted as a correct predictionwhen the first protein of each ranked list was found to be amember of the same superfamily as the query

ROCplot (receiver operating characteristic) is a graphicalplot which illustrates the true positives rate against the falsepositives rate The area under the curve (AUC) is a singlenumerical performance measure of each ROC plot Theperfect value of AUC is 10 [12]

Table 1 summarizes the retrieval rate for all the methodsThe value of the comparison algorithm is from literature [12]According to Table 1 the correct prediction of the proposedmethod is better than the comparison algorithm

Figure 6 shows in a bar graph the query results witheach algorithm on 50 protein structures As Figure 6 showsthe proposed method can easily and successfully identifyeach query protein including those that the comparisonalgorithms failed to identify Compared with the compar-ison algorithms the proposed method may identify somesuperfamilies more easily It has a very encouraging andsatisfactory result that the comparison algorithms cannotreach It is worth noting that the classification correctness ofthe proposedmethod is almost the same as that that has beendone by human experts

Computational and Mathematical Methods in Medicine 7

00

04

08

00

04

08

3DBLAST2 6 10

2 6 10 2 6 10 2 6 10

2 6 10

2 6 10 2 6 1014 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

3DZernike Genocrypt

Contact maps Group integration Spherical trace transform

Proposed method

00

04

08

00

04

08

00

04

08

00

04

08

00

04

08

(a)

14 17 20 23 26 29 32 35 38 41 44 47 50

1 5 9 14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

000

002

004

006

008

010 3DBLAST

000

002

004

006

008

010 3DZernike

000

002

004

006

008

010 Genocrypt

000

002

004

006

008

010Contact maps

000

002

004

006

008

010 Group integration

000

002

004

006

008

010 Spherical trace transform

000

002

004

006

008

010Proposed method

1 5 9 1 5 9 1 5 9

1 5 91 5 9

1 5 9

(b)

Figure 6 Bar chart for each method The upper charts show the total AUC whereas the lower charts show the AUCs calculated for the top10 of the database The comparison algorithm is come from literature [12]

8 Computational and Mathematical Methods in Medicine

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

1K2K3K4K5K

6K7K8K9K11K

Figure 7 The precision-recall curves of influence of the codebooksize

32 Influence of the Size of the Codebook In the second ex-periment the influence of the vocabulary size (codebooksize) upon retrieval performance is studied The test datasetincludes 800 protein structures chosen from SCOP proteindatabase

The number of visual words in the codebook (the code-book size) is a very important parameter in our algorithmBecause the codebook size not only determines the spa-tial requirement but also significantly affects the retrievalperformance Figure 7 demonstrates that the precision-recallcurves of our methods increase steadily with the codebooksize We observe that with the number of codebook sizeenlarging from 1000 to 11000 the precision-recall values arecomparatively stable and the precision rate is between 096and 1 for all codebook sizes Meanwhile the difference ofprecision is 002 between the best case and the worst caseSo the influence of the number of the codebook on retrievalprecision is small According to the experimental result weset the number of visual words in the codebook as 3000 inthis paper

33 Influence of the Size of the Training Data In the thirdexperiment we investigate the influence of the training datasize on the retrieval efficiency The test dataset differentfrom the training data is selected from different structuralclassifications of SCOP protein database Figure 8 shows theprecision-recall curves when the number of training datais 400 800 1200 1600 and 2000 respectively As the testresults show the precision rates are all between 096 and 1regardless of the number of training data size

4 Conclusion

In this paper we proposed a novel feature extracting algo-rithm for protein retrieval and classification The proposedmethod employs a powerful local image feature called SURFand bag-of-visual-feature The key idea is to describe aview as a word histogram which is obtained by the vector

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

4008001200

16002000

Figure 8 The precision-recall curves of influence of the size oftraining data

quantization of the viewrsquos local features and to apply KLD tocalculate the distance between two models

A set of experiments were carried out to investigateseveral critical issues of our method in the CATH and SCOPprotein models database from PDBThe experimental resultsindicate that our method has satisfying performances forprotein retrieval and protein categorization that cannot bereached by other comparison methods

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank those who created proteindatabases and those who made available codes for theirshape features This work was supported by the NationalNatural Science Foundation of China (Grant nos 61462002and 61261043) the College of Scientific Research Projectof Ningxia Province (Grant NGY2012147) and the ScienceResearch Project of North University of Nationalities (Grant2013XYZ024)

References

[1] A I Cuff I Sillitoe T Lewis et al ldquoThe CATH classificationrevisited-architectures reviewed and new ways to characterizestructural divergence in superfamiliesrdquo Nucleic Acids Researchvol 37 no 1 pp D310ndashD314 2009

[2] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[3] D J Lipman and W R Pearson ldquoRapid and sensitive proteinsimilarity searchesrdquo Science vol 227 no 4693 pp 1435ndash14411985

Computational and Mathematical Methods in Medicine 9

[4] S F AltschulW GishWMiller EWMyers and D J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[5] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[6] K Karplus C Barrett and R Hughey ldquoHiddenMarkovmodelsfor detecting remote protein homologiesrdquo Bioinformatics vol14 no 10 pp 846ndash856 1998

[7] T Milledge G Zheng T Mullins and G NarasimhanldquoSBLAST structural basic local alignment searching tools usinggeometric hashingrdquo in Proceeding of the 7th IEEE InternationalConference on Bioinformatics and Bioengineering (BIBE 07) pp1343ndash1347 Boston MassUSA October 2007

[8] E Zotenko R Islamaj Dogan W J Wilbur D P OrsquoLeary andT M Przytycka ldquoStructural footprinting in protein structurecomparison the impact of structural fragmentsrdquo BMC Struc-tural Biology vol 7 article 53 2007

[9] Z Aung and K Tan ldquoRapid 3D protein structure databasesearching using information retrieval techniquesrdquo Bioinformat-ics vol 20 no 7 pp 1045ndash1052 2004

[10] O Camoglu T Kahveci and A K Singh ldquoPSI indexing proteinstructures for fast similarity searchrdquo Bioinformatics vol 19supplement 1 pp i81ndashi83 2003

[11] V Cantoni A Ferone O Ozbudak and A Petrosino ldquoProteinmotifs retrieval by SS terns occurrencesrdquo Pattern RecognitionLetters vol 34 no 5 pp 559ndash563 2013

[12] L Mavridis V Venkatraman D W Ritchie et al ldquoSHRECrsquo10track protein modelsrdquo in Proceedings of the Eurographics Work-shop on 3D Object Retrieval 2010

[13] A R Ortiz C E M Strauss and O Olmea ldquoMAMMOTH(matching molecular models obtained from theory) an auto-mated method for model comparisonrdquo Protein Science vol 11no 11 pp 2606ndash2621 2002

[14] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[15] J Zhu and ZWeng ldquoFAST a novel protein structure alignmentalgorithmrdquo Proteins Structure Function and Genetics vol 58no 3 pp 618ndash627 2005

[16] L Holm and C Sander ldquoProtein structure comparison byalignment of distance matricesrdquo Journal of Molecular Biologyvol 233 no 1 pp 123ndash138 1993

[17] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[18] M Ankerst G Kastenmuller H-P Kriegel and T SeidlldquoNearest neighbor classification in 3D protein databaserdquo inProceedings of the 7th International Conference on IntelligentSystems for Molecular Biology pp 34ndash43 Heidelberg Germany1999

[19] P Chi G Scott and C Shyu ldquoA fast protein structure retrievalsystem using image-based distance matrices and multidimen-sional indexrdquo in Proceedingsof the 4th IEEE Symposium onBioinformatics and Bioengineering (BIBE rsquo04) pp 522ndash529 May2004

[20] J-S Yeh D-Y Chen and M Ouhyoung ldquoA web-based proteinretrieval system bymatching visual similarityrdquo in Proceedings ofthe Emerging Information Technology Conference pp 108ndash110Taipei Taiwan August 2005

[21] Y Liu H Zha and H Qin ldquoShape topics a compact repre-sentation and new algorithms for 3D partial shape retrievalrdquoin Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) pp 2025ndash2032 June 2006

[22] J Yu Z Qin T Wan and X Zhang ldquoFeature integrationanalysis of bag-of-features model for image retrievalrdquo Neuro-computing vol 120 pp 355ndash364 2013

[23] S Yoo and C Kim ldquoBackground subtraction using hybridfeature coding in the bag-of-features frameworkrdquo PatternRecognition Letters vol 34 no 16 pp 2086ndash2093 2013

[24] L Nanni and A Lumini ldquoHeterogeneous bag-of-features forobjectscene recognitionrdquo Applied Soft Computing Journal vol13 no 4 pp 2171ndash2178 2013

[25] L Zhou Z Zhou andDHu ldquoScene classification using amulti-resolution bag-of-features modelrdquo Pattern Recognition vol 46no 1 pp 424ndash433 2013

[26] Z Fu G Lu KM Ting andD Zhang ldquoMusic classification viathe bag-of-features approachrdquo Pattern Recognition Letters vol32 no 14 pp 1768ndash1777 2011

[27] K Zagoris I Pratikakis A Antonacopoulos B Gatos and NPapamarkos ldquoDistinction between handwritten and machine-printed text based on the bag of visual words modelrdquo PatternRecognition vol 47 no 3 pp 1051ndash1062 2014

[28] Z Wu J Cao H Tao and Y Zhuang ldquoA novel noise filterbased on interesting patternmining for bag-of-features imagesrdquoExpert Systems with Applications vol 40 no 18 pp 7555ndash75612013

[29] R Ohbuchi K Osada T Furuya and T Banno Salient LocalVisual Features for Shape-Based 3d Model Retrieval 2008

[30] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60 no2 pp 91ndash110 2004

[31] R Toldo U Castellani and A Fusiello ldquoA bag of wordsapproach for 3d object categorizationrdquo in Proceedings of the4th International Conference on Computer VisionComputerGraphics Collaboration Techniques pp 116ndash127 2009

[32] L Shen J Lin S Wu and S Yu ldquoHEp-2 image classificationusing intensity order pooling based features and bag of wordsrdquoPattern Recognition vol 47 no 7 pp 2419ndash2427 2014

[33] J J Wang H Bensmail and X Gao ldquoJoint learning andweighting of visual vocabulary for bag-of-feature based tissueclassificationrdquo Pattern Recognition vol 46 no 12 pp 3249ndash3255 2013

[34] H Bay A Ess T Tuytelaars and L van Gool ldquoSURF SpeededUp Robust Featuresrdquo Computer Vision and Image Understand-ing vol 110 no 3 pp 346ndash359 2008

[35] Y Gui A Su and J Du ldquoPoint-pattern matching method usingSURF and Shape ContextrdquoOptik vol 124 no 14 pp 1869ndash18732013

[36] Y Pang W Li Y Yuan and J Pan ldquoFully affine invariant SURFfor image matchingrdquo Neurocomputing vol 85 pp 6ndash10 2012

[37] P F Alcantarilla L M Bergasa and A J Davison ldquoGauge-SURF descriptorsrdquo Image and Vision Computing vol 31 no 1pp 103ndash116 2013

[38] U Jayaraman S Prakash and P Gupta ldquoAn efficient color andtexture based iris image retrieval techniquerdquo Expert Systemswith Applications vol 39 no 5 pp 4915ndash4926 2012

[39] H Mehrotra P K Sa and B Majhi ldquoFast segmentation andadaptive SURF descriptor for iris recognitionrdquo Mathematicaland Computer Modelling vol 58 no 1-2 pp 132ndash146 2013

10 Computational and Mathematical Methods in Medicine

[40] J Feulner S K Zhou E Angelopoulou et al ldquoComparing axialCT slices in quantized N-dimensional SURF descriptor spaceto estimate the visible body regionrdquo Computerized MedicalImaging and Graphics vol 35 no 3 pp 227ndash236 2011

[41] ldquoOpenSURFrdquo httpscodegooglecompopensurf1

Page 5: A Method of Protein Model Classification and Retrieval ......ResearchArticle A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features JinlinMa,1,2 ZipingMa,2

Computational and Mathematical Methods in Medicine 5

(1) DO direction (2) CO direction (3) BO direction (4) AO direction (5) FO direction

(6) EO direction (7) Plane a (8) Plane b (9) Plane c (10) Plane d

(11) Plane a998400 (12) Plane b998400 (13) Plane c998400 (14) Plane d998400

Figure 3 Fourteen views of protein ldquo1hddrdquo (PDB code)

Original image (I1) Rotated image (I2) Affined image (I3) Scaled image (I4)

Interest points of I1 Interest points of I2 Interest points of I3 Interest points of I4

Figure 4 Interest points of the SURF algorithm are robust

frequencies of visual words Each view is represented as aword histogram which is the features extracted by bag-of-visual-features So the protein modelrsquos histogram is producedby combining every view imagersquos histogram of the proteinmodel

234 Distance Computation and Range Models MatchingThe last step of our method is the distance computation (alsocalled models matching) between two models A distance

among a pair of feature vectors (the histogram) is computedby using the Kullback-Leibler divergence (KLD) The KLD isnot a distance metric for it is not symmetric Consider

119863(x y) =119899

sum

119894=1

(119910119894minus 119909119894) ln119910119894

119909119894

(1)

where x = (119909119894) and y = (119910

119894) are the feature vectors and 119899 is

the dimension of the vectors

6 Computational and Mathematical Methods in Medicine

I1 and I2 I1 and I3 I1 and I4

I2 and I3 I2 and I4

I3 and I4

Figure 5 Interest points matching

3 Results and Discussion

In order to evaluate the efficacy and generalization capacityof the proposed method we tested it with several differentretrieval and categorization tasks The first experiment com-pares the performance of our method and the comparedmethod in terms of their capability for classification Thesecond experiment tests the impact of the numbers ofthe clusters on the methodsrsquo retrieval capability The lastexperiment tests the influence of the size of the training dataon the method

We implement the experiments in Matlab R2010a whilethe SURF and the 119896-means code is written in C++ Allalgorithms are run under windows 7 32 bit on a personalcomputer with a Core 2 Quad 266GHz CPU 300GB DDR2memory and a 512MB ATI Radeon HD4600 graphics card

To evaluate the methodrsquos efficiency we measured thefeature extracting time We experimented on the compu-tation time in a 2000 protein models database which hasimages of 28000 views The computation time for SURFfeature extracted algorithm is about 21948 seconds (an off-line process) For clustering the features by 119896-means andgenerating the histograms by bag-of-visual-features it takes4249 seconds and 066 seconds by an off-line process for thesame 2000 protein models database

31 Classification In the first experiment we evaluate theclassification performance of our approach by using proteinmodels database of SHREC 2010 [12] which includes 1000protein structures chosen from 100 CATH 33 superfamiliesIn the dataset each superfamily consists of at least 10 struc-tures and each structure contains at least 50 amino acids Weuse two different ways (nearest neighbor ROC plot) to testthe performance of our method by comparing with 3DBlast3DZernike GENOCRIPT ContactMaps Group Integrationand Spherical Trace Transform methods

Table 1 Nearest neighbour results

Method Correct predictions3DBlast 683DZernike 8Genocrypt 56Contact maps 80Group integration 52Spherical trace transform 0Proposed method 89

Nearest neighbor was counted as a correct predictionwhen the first protein of each ranked list was found to be amember of the same superfamily as the query

ROCplot (receiver operating characteristic) is a graphicalplot which illustrates the true positives rate against the falsepositives rate The area under the curve (AUC) is a singlenumerical performance measure of each ROC plot Theperfect value of AUC is 10 [12]

Table 1 summarizes the retrieval rate for all the methodsThe value of the comparison algorithm is from literature [12]According to Table 1 the correct prediction of the proposedmethod is better than the comparison algorithm

Figure 6 shows in a bar graph the query results witheach algorithm on 50 protein structures As Figure 6 showsthe proposed method can easily and successfully identifyeach query protein including those that the comparisonalgorithms failed to identify Compared with the compar-ison algorithms the proposed method may identify somesuperfamilies more easily It has a very encouraging andsatisfactory result that the comparison algorithms cannotreach It is worth noting that the classification correctness ofthe proposedmethod is almost the same as that that has beendone by human experts

Computational and Mathematical Methods in Medicine 7

00

04

08

00

04

08

3DBLAST2 6 10

2 6 10 2 6 10 2 6 10

2 6 10

2 6 10 2 6 1014 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

3DZernike Genocrypt

Contact maps Group integration Spherical trace transform

Proposed method

00

04

08

00

04

08

00

04

08

00

04

08

00

04

08

(a)

14 17 20 23 26 29 32 35 38 41 44 47 50

1 5 9 14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

000

002

004

006

008

010 3DBLAST

000

002

004

006

008

010 3DZernike

000

002

004

006

008

010 Genocrypt

000

002

004

006

008

010Contact maps

000

002

004

006

008

010 Group integration

000

002

004

006

008

010 Spherical trace transform

000

002

004

006

008

010Proposed method

1 5 9 1 5 9 1 5 9

1 5 91 5 9

1 5 9

(b)

Figure 6 Bar chart for each method The upper charts show the total AUC whereas the lower charts show the AUCs calculated for the top10 of the database The comparison algorithm is come from literature [12]

8 Computational and Mathematical Methods in Medicine

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

1K2K3K4K5K

6K7K8K9K11K

Figure 7 The precision-recall curves of influence of the codebooksize

32 Influence of the Size of the Codebook In the second ex-periment the influence of the vocabulary size (codebooksize) upon retrieval performance is studied The test datasetincludes 800 protein structures chosen from SCOP proteindatabase

The number of visual words in the codebook (the code-book size) is a very important parameter in our algorithmBecause the codebook size not only determines the spa-tial requirement but also significantly affects the retrievalperformance Figure 7 demonstrates that the precision-recallcurves of our methods increase steadily with the codebooksize We observe that with the number of codebook sizeenlarging from 1000 to 11000 the precision-recall values arecomparatively stable and the precision rate is between 096and 1 for all codebook sizes Meanwhile the difference ofprecision is 002 between the best case and the worst caseSo the influence of the number of the codebook on retrievalprecision is small According to the experimental result weset the number of visual words in the codebook as 3000 inthis paper

33 Influence of the Size of the Training Data In the thirdexperiment we investigate the influence of the training datasize on the retrieval efficiency The test dataset differentfrom the training data is selected from different structuralclassifications of SCOP protein database Figure 8 shows theprecision-recall curves when the number of training datais 400 800 1200 1600 and 2000 respectively As the testresults show the precision rates are all between 096 and 1regardless of the number of training data size

4 Conclusion

In this paper we proposed a novel feature extracting algo-rithm for protein retrieval and classification The proposedmethod employs a powerful local image feature called SURFand bag-of-visual-feature The key idea is to describe aview as a word histogram which is obtained by the vector

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

4008001200

16002000

Figure 8 The precision-recall curves of influence of the size oftraining data

quantization of the viewrsquos local features and to apply KLD tocalculate the distance between two models

A set of experiments were carried out to investigateseveral critical issues of our method in the CATH and SCOPprotein models database from PDBThe experimental resultsindicate that our method has satisfying performances forprotein retrieval and protein categorization that cannot bereached by other comparison methods

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank those who created proteindatabases and those who made available codes for theirshape features This work was supported by the NationalNatural Science Foundation of China (Grant nos 61462002and 61261043) the College of Scientific Research Projectof Ningxia Province (Grant NGY2012147) and the ScienceResearch Project of North University of Nationalities (Grant2013XYZ024)

References

[1] A I Cuff I Sillitoe T Lewis et al ldquoThe CATH classificationrevisited-architectures reviewed and new ways to characterizestructural divergence in superfamiliesrdquo Nucleic Acids Researchvol 37 no 1 pp D310ndashD314 2009

[2] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[3] D J Lipman and W R Pearson ldquoRapid and sensitive proteinsimilarity searchesrdquo Science vol 227 no 4693 pp 1435ndash14411985

Computational and Mathematical Methods in Medicine 9

[4] S F AltschulW GishWMiller EWMyers and D J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[5] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[6] K Karplus C Barrett and R Hughey ldquoHiddenMarkovmodelsfor detecting remote protein homologiesrdquo Bioinformatics vol14 no 10 pp 846ndash856 1998

[7] T Milledge G Zheng T Mullins and G NarasimhanldquoSBLAST structural basic local alignment searching tools usinggeometric hashingrdquo in Proceeding of the 7th IEEE InternationalConference on Bioinformatics and Bioengineering (BIBE 07) pp1343ndash1347 Boston MassUSA October 2007

[8] E Zotenko R Islamaj Dogan W J Wilbur D P OrsquoLeary andT M Przytycka ldquoStructural footprinting in protein structurecomparison the impact of structural fragmentsrdquo BMC Struc-tural Biology vol 7 article 53 2007

[9] Z Aung and K Tan ldquoRapid 3D protein structure databasesearching using information retrieval techniquesrdquo Bioinformat-ics vol 20 no 7 pp 1045ndash1052 2004

[10] O Camoglu T Kahveci and A K Singh ldquoPSI indexing proteinstructures for fast similarity searchrdquo Bioinformatics vol 19supplement 1 pp i81ndashi83 2003

[11] V Cantoni A Ferone O Ozbudak and A Petrosino ldquoProteinmotifs retrieval by SS terns occurrencesrdquo Pattern RecognitionLetters vol 34 no 5 pp 559ndash563 2013

[12] L Mavridis V Venkatraman D W Ritchie et al ldquoSHRECrsquo10track protein modelsrdquo in Proceedings of the Eurographics Work-shop on 3D Object Retrieval 2010

[13] A R Ortiz C E M Strauss and O Olmea ldquoMAMMOTH(matching molecular models obtained from theory) an auto-mated method for model comparisonrdquo Protein Science vol 11no 11 pp 2606ndash2621 2002

[14] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[15] J Zhu and ZWeng ldquoFAST a novel protein structure alignmentalgorithmrdquo Proteins Structure Function and Genetics vol 58no 3 pp 618ndash627 2005

[16] L Holm and C Sander ldquoProtein structure comparison byalignment of distance matricesrdquo Journal of Molecular Biologyvol 233 no 1 pp 123ndash138 1993

[17] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[18] M Ankerst G Kastenmuller H-P Kriegel and T SeidlldquoNearest neighbor classification in 3D protein databaserdquo inProceedings of the 7th International Conference on IntelligentSystems for Molecular Biology pp 34ndash43 Heidelberg Germany1999

[19] P Chi G Scott and C Shyu ldquoA fast protein structure retrievalsystem using image-based distance matrices and multidimen-sional indexrdquo in Proceedingsof the 4th IEEE Symposium onBioinformatics and Bioengineering (BIBE rsquo04) pp 522ndash529 May2004

[20] J-S Yeh D-Y Chen and M Ouhyoung ldquoA web-based proteinretrieval system bymatching visual similarityrdquo in Proceedings ofthe Emerging Information Technology Conference pp 108ndash110Taipei Taiwan August 2005

[21] Y Liu H Zha and H Qin ldquoShape topics a compact repre-sentation and new algorithms for 3D partial shape retrievalrdquoin Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) pp 2025ndash2032 June 2006

[22] J Yu Z Qin T Wan and X Zhang ldquoFeature integrationanalysis of bag-of-features model for image retrievalrdquo Neuro-computing vol 120 pp 355ndash364 2013

[23] S Yoo and C Kim ldquoBackground subtraction using hybridfeature coding in the bag-of-features frameworkrdquo PatternRecognition Letters vol 34 no 16 pp 2086ndash2093 2013

[24] L Nanni and A Lumini ldquoHeterogeneous bag-of-features forobjectscene recognitionrdquo Applied Soft Computing Journal vol13 no 4 pp 2171ndash2178 2013

[25] L Zhou Z Zhou andDHu ldquoScene classification using amulti-resolution bag-of-features modelrdquo Pattern Recognition vol 46no 1 pp 424ndash433 2013

[26] Z Fu G Lu KM Ting andD Zhang ldquoMusic classification viathe bag-of-features approachrdquo Pattern Recognition Letters vol32 no 14 pp 1768ndash1777 2011

[27] K Zagoris I Pratikakis A Antonacopoulos B Gatos and NPapamarkos ldquoDistinction between handwritten and machine-printed text based on the bag of visual words modelrdquo PatternRecognition vol 47 no 3 pp 1051ndash1062 2014

[28] Z Wu J Cao H Tao and Y Zhuang ldquoA novel noise filterbased on interesting patternmining for bag-of-features imagesrdquoExpert Systems with Applications vol 40 no 18 pp 7555ndash75612013

[29] R Ohbuchi K Osada T Furuya and T Banno Salient LocalVisual Features for Shape-Based 3d Model Retrieval 2008

[30] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60 no2 pp 91ndash110 2004

[31] R Toldo U Castellani and A Fusiello ldquoA bag of wordsapproach for 3d object categorizationrdquo in Proceedings of the4th International Conference on Computer VisionComputerGraphics Collaboration Techniques pp 116ndash127 2009

[32] L Shen J Lin S Wu and S Yu ldquoHEp-2 image classificationusing intensity order pooling based features and bag of wordsrdquoPattern Recognition vol 47 no 7 pp 2419ndash2427 2014

[33] J J Wang H Bensmail and X Gao ldquoJoint learning andweighting of visual vocabulary for bag-of-feature based tissueclassificationrdquo Pattern Recognition vol 46 no 12 pp 3249ndash3255 2013

[34] H Bay A Ess T Tuytelaars and L van Gool ldquoSURF SpeededUp Robust Featuresrdquo Computer Vision and Image Understand-ing vol 110 no 3 pp 346ndash359 2008

[35] Y Gui A Su and J Du ldquoPoint-pattern matching method usingSURF and Shape ContextrdquoOptik vol 124 no 14 pp 1869ndash18732013

[36] Y Pang W Li Y Yuan and J Pan ldquoFully affine invariant SURFfor image matchingrdquo Neurocomputing vol 85 pp 6ndash10 2012

[37] P F Alcantarilla L M Bergasa and A J Davison ldquoGauge-SURF descriptorsrdquo Image and Vision Computing vol 31 no 1pp 103ndash116 2013

[38] U Jayaraman S Prakash and P Gupta ldquoAn efficient color andtexture based iris image retrieval techniquerdquo Expert Systemswith Applications vol 39 no 5 pp 4915ndash4926 2012

[39] H Mehrotra P K Sa and B Majhi ldquoFast segmentation andadaptive SURF descriptor for iris recognitionrdquo Mathematicaland Computer Modelling vol 58 no 1-2 pp 132ndash146 2013

10 Computational and Mathematical Methods in Medicine

[40] J Feulner S K Zhou E Angelopoulou et al ldquoComparing axialCT slices in quantized N-dimensional SURF descriptor spaceto estimate the visible body regionrdquo Computerized MedicalImaging and Graphics vol 35 no 3 pp 227ndash236 2011

[41] ldquoOpenSURFrdquo httpscodegooglecompopensurf1

Page 6: A Method of Protein Model Classification and Retrieval ......ResearchArticle A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features JinlinMa,1,2 ZipingMa,2

6 Computational and Mathematical Methods in Medicine

I1 and I2 I1 and I3 I1 and I4

I2 and I3 I2 and I4

I3 and I4

Figure 5 Interest points matching

3 Results and Discussion

In order to evaluate the efficacy and generalization capacityof the proposed method we tested it with several differentretrieval and categorization tasks The first experiment com-pares the performance of our method and the comparedmethod in terms of their capability for classification Thesecond experiment tests the impact of the numbers ofthe clusters on the methodsrsquo retrieval capability The lastexperiment tests the influence of the size of the training dataon the method

We implement the experiments in Matlab R2010a whilethe SURF and the 119896-means code is written in C++ Allalgorithms are run under windows 7 32 bit on a personalcomputer with a Core 2 Quad 266GHz CPU 300GB DDR2memory and a 512MB ATI Radeon HD4600 graphics card

To evaluate the methodrsquos efficiency we measured thefeature extracting time We experimented on the compu-tation time in a 2000 protein models database which hasimages of 28000 views The computation time for SURFfeature extracted algorithm is about 21948 seconds (an off-line process) For clustering the features by 119896-means andgenerating the histograms by bag-of-visual-features it takes4249 seconds and 066 seconds by an off-line process for thesame 2000 protein models database

31 Classification In the first experiment we evaluate theclassification performance of our approach by using proteinmodels database of SHREC 2010 [12] which includes 1000protein structures chosen from 100 CATH 33 superfamiliesIn the dataset each superfamily consists of at least 10 struc-tures and each structure contains at least 50 amino acids Weuse two different ways (nearest neighbor ROC plot) to testthe performance of our method by comparing with 3DBlast3DZernike GENOCRIPT ContactMaps Group Integrationand Spherical Trace Transform methods

Table 1 Nearest neighbour results

Method Correct predictions3DBlast 683DZernike 8Genocrypt 56Contact maps 80Group integration 52Spherical trace transform 0Proposed method 89

Nearest neighbor was counted as a correct predictionwhen the first protein of each ranked list was found to be amember of the same superfamily as the query

ROCplot (receiver operating characteristic) is a graphicalplot which illustrates the true positives rate against the falsepositives rate The area under the curve (AUC) is a singlenumerical performance measure of each ROC plot Theperfect value of AUC is 10 [12]

Table 1 summarizes the retrieval rate for all the methodsThe value of the comparison algorithm is from literature [12]According to Table 1 the correct prediction of the proposedmethod is better than the comparison algorithm

Figure 6 shows in a bar graph the query results witheach algorithm on 50 protein structures As Figure 6 showsthe proposed method can easily and successfully identifyeach query protein including those that the comparisonalgorithms failed to identify Compared with the compar-ison algorithms the proposed method may identify somesuperfamilies more easily It has a very encouraging andsatisfactory result that the comparison algorithms cannotreach It is worth noting that the classification correctness ofthe proposedmethod is almost the same as that that has beendone by human experts

Computational and Mathematical Methods in Medicine 7

00

04

08

00

04

08

3DBLAST2 6 10

2 6 10 2 6 10 2 6 10

2 6 10

2 6 10 2 6 1014 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

3DZernike Genocrypt

Contact maps Group integration Spherical trace transform

Proposed method

00

04

08

00

04

08

00

04

08

00

04

08

00

04

08

(a)

14 17 20 23 26 29 32 35 38 41 44 47 50

1 5 9 14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

000

002

004

006

008

010 3DBLAST

000

002

004

006

008

010 3DZernike

000

002

004

006

008

010 Genocrypt

000

002

004

006

008

010Contact maps

000

002

004

006

008

010 Group integration

000

002

004

006

008

010 Spherical trace transform

000

002

004

006

008

010Proposed method

1 5 9 1 5 9 1 5 9

1 5 91 5 9

1 5 9

(b)

Figure 6 Bar chart for each method The upper charts show the total AUC whereas the lower charts show the AUCs calculated for the top10 of the database The comparison algorithm is come from literature [12]

8 Computational and Mathematical Methods in Medicine

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

1K2K3K4K5K

6K7K8K9K11K

Figure 7 The precision-recall curves of influence of the codebooksize

32 Influence of the Size of the Codebook In the second ex-periment the influence of the vocabulary size (codebooksize) upon retrieval performance is studied The test datasetincludes 800 protein structures chosen from SCOP proteindatabase

The number of visual words in the codebook (the code-book size) is a very important parameter in our algorithmBecause the codebook size not only determines the spa-tial requirement but also significantly affects the retrievalperformance Figure 7 demonstrates that the precision-recallcurves of our methods increase steadily with the codebooksize We observe that with the number of codebook sizeenlarging from 1000 to 11000 the precision-recall values arecomparatively stable and the precision rate is between 096and 1 for all codebook sizes Meanwhile the difference ofprecision is 002 between the best case and the worst caseSo the influence of the number of the codebook on retrievalprecision is small According to the experimental result weset the number of visual words in the codebook as 3000 inthis paper

33 Influence of the Size of the Training Data In the thirdexperiment we investigate the influence of the training datasize on the retrieval efficiency The test dataset differentfrom the training data is selected from different structuralclassifications of SCOP protein database Figure 8 shows theprecision-recall curves when the number of training datais 400 800 1200 1600 and 2000 respectively As the testresults show the precision rates are all between 096 and 1regardless of the number of training data size

4 Conclusion

In this paper we proposed a novel feature extracting algo-rithm for protein retrieval and classification The proposedmethod employs a powerful local image feature called SURFand bag-of-visual-feature The key idea is to describe aview as a word histogram which is obtained by the vector

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

4008001200

16002000

Figure 8 The precision-recall curves of influence of the size oftraining data

quantization of the viewrsquos local features and to apply KLD tocalculate the distance between two models

A set of experiments were carried out to investigateseveral critical issues of our method in the CATH and SCOPprotein models database from PDBThe experimental resultsindicate that our method has satisfying performances forprotein retrieval and protein categorization that cannot bereached by other comparison methods

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank those who created proteindatabases and those who made available codes for theirshape features This work was supported by the NationalNatural Science Foundation of China (Grant nos 61462002and 61261043) the College of Scientific Research Projectof Ningxia Province (Grant NGY2012147) and the ScienceResearch Project of North University of Nationalities (Grant2013XYZ024)

References

[1] A I Cuff I Sillitoe T Lewis et al ldquoThe CATH classificationrevisited-architectures reviewed and new ways to characterizestructural divergence in superfamiliesrdquo Nucleic Acids Researchvol 37 no 1 pp D310ndashD314 2009

[2] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[3] D J Lipman and W R Pearson ldquoRapid and sensitive proteinsimilarity searchesrdquo Science vol 227 no 4693 pp 1435ndash14411985

Computational and Mathematical Methods in Medicine 9

[4] S F AltschulW GishWMiller EWMyers and D J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[5] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[6] K Karplus C Barrett and R Hughey ldquoHiddenMarkovmodelsfor detecting remote protein homologiesrdquo Bioinformatics vol14 no 10 pp 846ndash856 1998

[7] T Milledge G Zheng T Mullins and G NarasimhanldquoSBLAST structural basic local alignment searching tools usinggeometric hashingrdquo in Proceeding of the 7th IEEE InternationalConference on Bioinformatics and Bioengineering (BIBE 07) pp1343ndash1347 Boston MassUSA October 2007

[8] E Zotenko R Islamaj Dogan W J Wilbur D P OrsquoLeary andT M Przytycka ldquoStructural footprinting in protein structurecomparison the impact of structural fragmentsrdquo BMC Struc-tural Biology vol 7 article 53 2007

[9] Z Aung and K Tan ldquoRapid 3D protein structure databasesearching using information retrieval techniquesrdquo Bioinformat-ics vol 20 no 7 pp 1045ndash1052 2004

[10] O Camoglu T Kahveci and A K Singh ldquoPSI indexing proteinstructures for fast similarity searchrdquo Bioinformatics vol 19supplement 1 pp i81ndashi83 2003

[11] V Cantoni A Ferone O Ozbudak and A Petrosino ldquoProteinmotifs retrieval by SS terns occurrencesrdquo Pattern RecognitionLetters vol 34 no 5 pp 559ndash563 2013

[12] L Mavridis V Venkatraman D W Ritchie et al ldquoSHRECrsquo10track protein modelsrdquo in Proceedings of the Eurographics Work-shop on 3D Object Retrieval 2010

[13] A R Ortiz C E M Strauss and O Olmea ldquoMAMMOTH(matching molecular models obtained from theory) an auto-mated method for model comparisonrdquo Protein Science vol 11no 11 pp 2606ndash2621 2002

[14] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[15] J Zhu and ZWeng ldquoFAST a novel protein structure alignmentalgorithmrdquo Proteins Structure Function and Genetics vol 58no 3 pp 618ndash627 2005

[16] L Holm and C Sander ldquoProtein structure comparison byalignment of distance matricesrdquo Journal of Molecular Biologyvol 233 no 1 pp 123ndash138 1993

[17] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[18] M Ankerst G Kastenmuller H-P Kriegel and T SeidlldquoNearest neighbor classification in 3D protein databaserdquo inProceedings of the 7th International Conference on IntelligentSystems for Molecular Biology pp 34ndash43 Heidelberg Germany1999

[19] P Chi G Scott and C Shyu ldquoA fast protein structure retrievalsystem using image-based distance matrices and multidimen-sional indexrdquo in Proceedingsof the 4th IEEE Symposium onBioinformatics and Bioengineering (BIBE rsquo04) pp 522ndash529 May2004

[20] J-S Yeh D-Y Chen and M Ouhyoung ldquoA web-based proteinretrieval system bymatching visual similarityrdquo in Proceedings ofthe Emerging Information Technology Conference pp 108ndash110Taipei Taiwan August 2005

[21] Y Liu H Zha and H Qin ldquoShape topics a compact repre-sentation and new algorithms for 3D partial shape retrievalrdquoin Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) pp 2025ndash2032 June 2006

[22] J Yu Z Qin T Wan and X Zhang ldquoFeature integrationanalysis of bag-of-features model for image retrievalrdquo Neuro-computing vol 120 pp 355ndash364 2013

[23] S Yoo and C Kim ldquoBackground subtraction using hybridfeature coding in the bag-of-features frameworkrdquo PatternRecognition Letters vol 34 no 16 pp 2086ndash2093 2013

[24] L Nanni and A Lumini ldquoHeterogeneous bag-of-features forobjectscene recognitionrdquo Applied Soft Computing Journal vol13 no 4 pp 2171ndash2178 2013

[25] L Zhou Z Zhou andDHu ldquoScene classification using amulti-resolution bag-of-features modelrdquo Pattern Recognition vol 46no 1 pp 424ndash433 2013

[26] Z Fu G Lu KM Ting andD Zhang ldquoMusic classification viathe bag-of-features approachrdquo Pattern Recognition Letters vol32 no 14 pp 1768ndash1777 2011

[27] K Zagoris I Pratikakis A Antonacopoulos B Gatos and NPapamarkos ldquoDistinction between handwritten and machine-printed text based on the bag of visual words modelrdquo PatternRecognition vol 47 no 3 pp 1051ndash1062 2014

[28] Z Wu J Cao H Tao and Y Zhuang ldquoA novel noise filterbased on interesting patternmining for bag-of-features imagesrdquoExpert Systems with Applications vol 40 no 18 pp 7555ndash75612013

[29] R Ohbuchi K Osada T Furuya and T Banno Salient LocalVisual Features for Shape-Based 3d Model Retrieval 2008

[30] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60 no2 pp 91ndash110 2004

[31] R Toldo U Castellani and A Fusiello ldquoA bag of wordsapproach for 3d object categorizationrdquo in Proceedings of the4th International Conference on Computer VisionComputerGraphics Collaboration Techniques pp 116ndash127 2009

[32] L Shen J Lin S Wu and S Yu ldquoHEp-2 image classificationusing intensity order pooling based features and bag of wordsrdquoPattern Recognition vol 47 no 7 pp 2419ndash2427 2014

[33] J J Wang H Bensmail and X Gao ldquoJoint learning andweighting of visual vocabulary for bag-of-feature based tissueclassificationrdquo Pattern Recognition vol 46 no 12 pp 3249ndash3255 2013

[34] H Bay A Ess T Tuytelaars and L van Gool ldquoSURF SpeededUp Robust Featuresrdquo Computer Vision and Image Understand-ing vol 110 no 3 pp 346ndash359 2008

[35] Y Gui A Su and J Du ldquoPoint-pattern matching method usingSURF and Shape ContextrdquoOptik vol 124 no 14 pp 1869ndash18732013

[36] Y Pang W Li Y Yuan and J Pan ldquoFully affine invariant SURFfor image matchingrdquo Neurocomputing vol 85 pp 6ndash10 2012

[37] P F Alcantarilla L M Bergasa and A J Davison ldquoGauge-SURF descriptorsrdquo Image and Vision Computing vol 31 no 1pp 103ndash116 2013

[38] U Jayaraman S Prakash and P Gupta ldquoAn efficient color andtexture based iris image retrieval techniquerdquo Expert Systemswith Applications vol 39 no 5 pp 4915ndash4926 2012

[39] H Mehrotra P K Sa and B Majhi ldquoFast segmentation andadaptive SURF descriptor for iris recognitionrdquo Mathematicaland Computer Modelling vol 58 no 1-2 pp 132ndash146 2013

10 Computational and Mathematical Methods in Medicine

[40] J Feulner S K Zhou E Angelopoulou et al ldquoComparing axialCT slices in quantized N-dimensional SURF descriptor spaceto estimate the visible body regionrdquo Computerized MedicalImaging and Graphics vol 35 no 3 pp 227ndash236 2011

[41] ldquoOpenSURFrdquo httpscodegooglecompopensurf1

Page 7: A Method of Protein Model Classification and Retrieval ......ResearchArticle A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features JinlinMa,1,2 ZipingMa,2

Computational and Mathematical Methods in Medicine 7

00

04

08

00

04

08

3DBLAST2 6 10

2 6 10 2 6 10 2 6 10

2 6 10

2 6 10 2 6 1014 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

14 18 22 26 30 34 38 42 46 50 14 18 22 26 30 34 38 42 46 50

3DZernike Genocrypt

Contact maps Group integration Spherical trace transform

Proposed method

00

04

08

00

04

08

00

04

08

00

04

08

00

04

08

(a)

14 17 20 23 26 29 32 35 38 41 44 47 50

1 5 9 14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50

14 17 20 23 26 29 32 35 38 41 44 47 50 14 17 20 23 26 29 32 35 38 41 44 47 50

000

002

004

006

008

010 3DBLAST

000

002

004

006

008

010 3DZernike

000

002

004

006

008

010 Genocrypt

000

002

004

006

008

010Contact maps

000

002

004

006

008

010 Group integration

000

002

004

006

008

010 Spherical trace transform

000

002

004

006

008

010Proposed method

1 5 9 1 5 9 1 5 9

1 5 91 5 9

1 5 9

(b)

Figure 6 Bar chart for each method The upper charts show the total AUC whereas the lower charts show the AUCs calculated for the top10 of the database The comparison algorithm is come from literature [12]

8 Computational and Mathematical Methods in Medicine

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

1K2K3K4K5K

6K7K8K9K11K

Figure 7 The precision-recall curves of influence of the codebooksize

32 Influence of the Size of the Codebook In the second ex-periment the influence of the vocabulary size (codebooksize) upon retrieval performance is studied The test datasetincludes 800 protein structures chosen from SCOP proteindatabase

The number of visual words in the codebook (the code-book size) is a very important parameter in our algorithmBecause the codebook size not only determines the spa-tial requirement but also significantly affects the retrievalperformance Figure 7 demonstrates that the precision-recallcurves of our methods increase steadily with the codebooksize We observe that with the number of codebook sizeenlarging from 1000 to 11000 the precision-recall values arecomparatively stable and the precision rate is between 096and 1 for all codebook sizes Meanwhile the difference ofprecision is 002 between the best case and the worst caseSo the influence of the number of the codebook on retrievalprecision is small According to the experimental result weset the number of visual words in the codebook as 3000 inthis paper

33 Influence of the Size of the Training Data In the thirdexperiment we investigate the influence of the training datasize on the retrieval efficiency The test dataset differentfrom the training data is selected from different structuralclassifications of SCOP protein database Figure 8 shows theprecision-recall curves when the number of training datais 400 800 1200 1600 and 2000 respectively As the testresults show the precision rates are all between 096 and 1regardless of the number of training data size

4 Conclusion

In this paper we proposed a novel feature extracting algo-rithm for protein retrieval and classification The proposedmethod employs a powerful local image feature called SURFand bag-of-visual-feature The key idea is to describe aview as a word histogram which is obtained by the vector

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

4008001200

16002000

Figure 8 The precision-recall curves of influence of the size oftraining data

quantization of the viewrsquos local features and to apply KLD tocalculate the distance between two models

A set of experiments were carried out to investigateseveral critical issues of our method in the CATH and SCOPprotein models database from PDBThe experimental resultsindicate that our method has satisfying performances forprotein retrieval and protein categorization that cannot bereached by other comparison methods

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank those who created proteindatabases and those who made available codes for theirshape features This work was supported by the NationalNatural Science Foundation of China (Grant nos 61462002and 61261043) the College of Scientific Research Projectof Ningxia Province (Grant NGY2012147) and the ScienceResearch Project of North University of Nationalities (Grant2013XYZ024)

References

[1] A I Cuff I Sillitoe T Lewis et al ldquoThe CATH classificationrevisited-architectures reviewed and new ways to characterizestructural divergence in superfamiliesrdquo Nucleic Acids Researchvol 37 no 1 pp D310ndashD314 2009

[2] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[3] D J Lipman and W R Pearson ldquoRapid and sensitive proteinsimilarity searchesrdquo Science vol 227 no 4693 pp 1435ndash14411985

Computational and Mathematical Methods in Medicine 9

[4] S F AltschulW GishWMiller EWMyers and D J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[5] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[6] K Karplus C Barrett and R Hughey ldquoHiddenMarkovmodelsfor detecting remote protein homologiesrdquo Bioinformatics vol14 no 10 pp 846ndash856 1998

[7] T Milledge G Zheng T Mullins and G NarasimhanldquoSBLAST structural basic local alignment searching tools usinggeometric hashingrdquo in Proceeding of the 7th IEEE InternationalConference on Bioinformatics and Bioengineering (BIBE 07) pp1343ndash1347 Boston MassUSA October 2007

[8] E Zotenko R Islamaj Dogan W J Wilbur D P OrsquoLeary andT M Przytycka ldquoStructural footprinting in protein structurecomparison the impact of structural fragmentsrdquo BMC Struc-tural Biology vol 7 article 53 2007

[9] Z Aung and K Tan ldquoRapid 3D protein structure databasesearching using information retrieval techniquesrdquo Bioinformat-ics vol 20 no 7 pp 1045ndash1052 2004

[10] O Camoglu T Kahveci and A K Singh ldquoPSI indexing proteinstructures for fast similarity searchrdquo Bioinformatics vol 19supplement 1 pp i81ndashi83 2003

[11] V Cantoni A Ferone O Ozbudak and A Petrosino ldquoProteinmotifs retrieval by SS terns occurrencesrdquo Pattern RecognitionLetters vol 34 no 5 pp 559ndash563 2013

[12] L Mavridis V Venkatraman D W Ritchie et al ldquoSHRECrsquo10track protein modelsrdquo in Proceedings of the Eurographics Work-shop on 3D Object Retrieval 2010

[13] A R Ortiz C E M Strauss and O Olmea ldquoMAMMOTH(matching molecular models obtained from theory) an auto-mated method for model comparisonrdquo Protein Science vol 11no 11 pp 2606ndash2621 2002

[14] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[15] J Zhu and ZWeng ldquoFAST a novel protein structure alignmentalgorithmrdquo Proteins Structure Function and Genetics vol 58no 3 pp 618ndash627 2005

[16] L Holm and C Sander ldquoProtein structure comparison byalignment of distance matricesrdquo Journal of Molecular Biologyvol 233 no 1 pp 123ndash138 1993

[17] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[18] M Ankerst G Kastenmuller H-P Kriegel and T SeidlldquoNearest neighbor classification in 3D protein databaserdquo inProceedings of the 7th International Conference on IntelligentSystems for Molecular Biology pp 34ndash43 Heidelberg Germany1999

[19] P Chi G Scott and C Shyu ldquoA fast protein structure retrievalsystem using image-based distance matrices and multidimen-sional indexrdquo in Proceedingsof the 4th IEEE Symposium onBioinformatics and Bioengineering (BIBE rsquo04) pp 522ndash529 May2004

[20] J-S Yeh D-Y Chen and M Ouhyoung ldquoA web-based proteinretrieval system bymatching visual similarityrdquo in Proceedings ofthe Emerging Information Technology Conference pp 108ndash110Taipei Taiwan August 2005

[21] Y Liu H Zha and H Qin ldquoShape topics a compact repre-sentation and new algorithms for 3D partial shape retrievalrdquoin Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) pp 2025ndash2032 June 2006

[22] J Yu Z Qin T Wan and X Zhang ldquoFeature integrationanalysis of bag-of-features model for image retrievalrdquo Neuro-computing vol 120 pp 355ndash364 2013

[23] S Yoo and C Kim ldquoBackground subtraction using hybridfeature coding in the bag-of-features frameworkrdquo PatternRecognition Letters vol 34 no 16 pp 2086ndash2093 2013

[24] L Nanni and A Lumini ldquoHeterogeneous bag-of-features forobjectscene recognitionrdquo Applied Soft Computing Journal vol13 no 4 pp 2171ndash2178 2013

[25] L Zhou Z Zhou andDHu ldquoScene classification using amulti-resolution bag-of-features modelrdquo Pattern Recognition vol 46no 1 pp 424ndash433 2013

[26] Z Fu G Lu KM Ting andD Zhang ldquoMusic classification viathe bag-of-features approachrdquo Pattern Recognition Letters vol32 no 14 pp 1768ndash1777 2011

[27] K Zagoris I Pratikakis A Antonacopoulos B Gatos and NPapamarkos ldquoDistinction between handwritten and machine-printed text based on the bag of visual words modelrdquo PatternRecognition vol 47 no 3 pp 1051ndash1062 2014

[28] Z Wu J Cao H Tao and Y Zhuang ldquoA novel noise filterbased on interesting patternmining for bag-of-features imagesrdquoExpert Systems with Applications vol 40 no 18 pp 7555ndash75612013

[29] R Ohbuchi K Osada T Furuya and T Banno Salient LocalVisual Features for Shape-Based 3d Model Retrieval 2008

[30] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60 no2 pp 91ndash110 2004

[31] R Toldo U Castellani and A Fusiello ldquoA bag of wordsapproach for 3d object categorizationrdquo in Proceedings of the4th International Conference on Computer VisionComputerGraphics Collaboration Techniques pp 116ndash127 2009

[32] L Shen J Lin S Wu and S Yu ldquoHEp-2 image classificationusing intensity order pooling based features and bag of wordsrdquoPattern Recognition vol 47 no 7 pp 2419ndash2427 2014

[33] J J Wang H Bensmail and X Gao ldquoJoint learning andweighting of visual vocabulary for bag-of-feature based tissueclassificationrdquo Pattern Recognition vol 46 no 12 pp 3249ndash3255 2013

[34] H Bay A Ess T Tuytelaars and L van Gool ldquoSURF SpeededUp Robust Featuresrdquo Computer Vision and Image Understand-ing vol 110 no 3 pp 346ndash359 2008

[35] Y Gui A Su and J Du ldquoPoint-pattern matching method usingSURF and Shape ContextrdquoOptik vol 124 no 14 pp 1869ndash18732013

[36] Y Pang W Li Y Yuan and J Pan ldquoFully affine invariant SURFfor image matchingrdquo Neurocomputing vol 85 pp 6ndash10 2012

[37] P F Alcantarilla L M Bergasa and A J Davison ldquoGauge-SURF descriptorsrdquo Image and Vision Computing vol 31 no 1pp 103ndash116 2013

[38] U Jayaraman S Prakash and P Gupta ldquoAn efficient color andtexture based iris image retrieval techniquerdquo Expert Systemswith Applications vol 39 no 5 pp 4915ndash4926 2012

[39] H Mehrotra P K Sa and B Majhi ldquoFast segmentation andadaptive SURF descriptor for iris recognitionrdquo Mathematicaland Computer Modelling vol 58 no 1-2 pp 132ndash146 2013

10 Computational and Mathematical Methods in Medicine

[40] J Feulner S K Zhou E Angelopoulou et al ldquoComparing axialCT slices in quantized N-dimensional SURF descriptor spaceto estimate the visible body regionrdquo Computerized MedicalImaging and Graphics vol 35 no 3 pp 227ndash236 2011

[41] ldquoOpenSURFrdquo httpscodegooglecompopensurf1

Page 8: A Method of Protein Model Classification and Retrieval ......ResearchArticle A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features JinlinMa,1,2 ZipingMa,2

8 Computational and Mathematical Methods in Medicine

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

1K2K3K4K5K

6K7K8K9K11K

Figure 7 The precision-recall curves of influence of the codebooksize

32 Influence of the Size of the Codebook In the second ex-periment the influence of the vocabulary size (codebooksize) upon retrieval performance is studied The test datasetincludes 800 protein structures chosen from SCOP proteindatabase

The number of visual words in the codebook (the code-book size) is a very important parameter in our algorithmBecause the codebook size not only determines the spa-tial requirement but also significantly affects the retrievalperformance Figure 7 demonstrates that the precision-recallcurves of our methods increase steadily with the codebooksize We observe that with the number of codebook sizeenlarging from 1000 to 11000 the precision-recall values arecomparatively stable and the precision rate is between 096and 1 for all codebook sizes Meanwhile the difference ofprecision is 002 between the best case and the worst caseSo the influence of the number of the codebook on retrievalprecision is small According to the experimental result weset the number of visual words in the codebook as 3000 inthis paper

33 Influence of the Size of the Training Data In the thirdexperiment we investigate the influence of the training datasize on the retrieval efficiency The test dataset differentfrom the training data is selected from different structuralclassifications of SCOP protein database Figure 8 shows theprecision-recall curves when the number of training datais 400 800 1200 1600 and 2000 respectively As the testresults show the precision rates are all between 096 and 1regardless of the number of training data size

4 Conclusion

In this paper we proposed a novel feature extracting algo-rithm for protein retrieval and classification The proposedmethod employs a powerful local image feature called SURFand bag-of-visual-feature The key idea is to describe aview as a word histogram which is obtained by the vector

0 01 02 03 04 05 06 07 08 09 1096

0965097

0975098

0985099

09951

1005

Average recall

Aver

age p

reci

sion

4008001200

16002000

Figure 8 The precision-recall curves of influence of the size oftraining data

quantization of the viewrsquos local features and to apply KLD tocalculate the distance between two models

A set of experiments were carried out to investigateseveral critical issues of our method in the CATH and SCOPprotein models database from PDBThe experimental resultsindicate that our method has satisfying performances forprotein retrieval and protein categorization that cannot bereached by other comparison methods

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank those who created proteindatabases and those who made available codes for theirshape features This work was supported by the NationalNatural Science Foundation of China (Grant nos 61462002and 61261043) the College of Scientific Research Projectof Ningxia Province (Grant NGY2012147) and the ScienceResearch Project of North University of Nationalities (Grant2013XYZ024)

References

[1] A I Cuff I Sillitoe T Lewis et al ldquoThe CATH classificationrevisited-architectures reviewed and new ways to characterizestructural divergence in superfamiliesrdquo Nucleic Acids Researchvol 37 no 1 pp D310ndashD314 2009

[2] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[3] D J Lipman and W R Pearson ldquoRapid and sensitive proteinsimilarity searchesrdquo Science vol 227 no 4693 pp 1435ndash14411985

Computational and Mathematical Methods in Medicine 9

[4] S F AltschulW GishWMiller EWMyers and D J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[5] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[6] K Karplus C Barrett and R Hughey ldquoHiddenMarkovmodelsfor detecting remote protein homologiesrdquo Bioinformatics vol14 no 10 pp 846ndash856 1998

[7] T Milledge G Zheng T Mullins and G NarasimhanldquoSBLAST structural basic local alignment searching tools usinggeometric hashingrdquo in Proceeding of the 7th IEEE InternationalConference on Bioinformatics and Bioengineering (BIBE 07) pp1343ndash1347 Boston MassUSA October 2007

[8] E Zotenko R Islamaj Dogan W J Wilbur D P OrsquoLeary andT M Przytycka ldquoStructural footprinting in protein structurecomparison the impact of structural fragmentsrdquo BMC Struc-tural Biology vol 7 article 53 2007

[9] Z Aung and K Tan ldquoRapid 3D protein structure databasesearching using information retrieval techniquesrdquo Bioinformat-ics vol 20 no 7 pp 1045ndash1052 2004

[10] O Camoglu T Kahveci and A K Singh ldquoPSI indexing proteinstructures for fast similarity searchrdquo Bioinformatics vol 19supplement 1 pp i81ndashi83 2003

[11] V Cantoni A Ferone O Ozbudak and A Petrosino ldquoProteinmotifs retrieval by SS terns occurrencesrdquo Pattern RecognitionLetters vol 34 no 5 pp 559ndash563 2013

[12] L Mavridis V Venkatraman D W Ritchie et al ldquoSHRECrsquo10track protein modelsrdquo in Proceedings of the Eurographics Work-shop on 3D Object Retrieval 2010

[13] A R Ortiz C E M Strauss and O Olmea ldquoMAMMOTH(matching molecular models obtained from theory) an auto-mated method for model comparisonrdquo Protein Science vol 11no 11 pp 2606ndash2621 2002

[14] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[15] J Zhu and ZWeng ldquoFAST a novel protein structure alignmentalgorithmrdquo Proteins Structure Function and Genetics vol 58no 3 pp 618ndash627 2005

[16] L Holm and C Sander ldquoProtein structure comparison byalignment of distance matricesrdquo Journal of Molecular Biologyvol 233 no 1 pp 123ndash138 1993

[17] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[18] M Ankerst G Kastenmuller H-P Kriegel and T SeidlldquoNearest neighbor classification in 3D protein databaserdquo inProceedings of the 7th International Conference on IntelligentSystems for Molecular Biology pp 34ndash43 Heidelberg Germany1999

[19] P Chi G Scott and C Shyu ldquoA fast protein structure retrievalsystem using image-based distance matrices and multidimen-sional indexrdquo in Proceedingsof the 4th IEEE Symposium onBioinformatics and Bioengineering (BIBE rsquo04) pp 522ndash529 May2004

[20] J-S Yeh D-Y Chen and M Ouhyoung ldquoA web-based proteinretrieval system bymatching visual similarityrdquo in Proceedings ofthe Emerging Information Technology Conference pp 108ndash110Taipei Taiwan August 2005

[21] Y Liu H Zha and H Qin ldquoShape topics a compact repre-sentation and new algorithms for 3D partial shape retrievalrdquoin Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) pp 2025ndash2032 June 2006

[22] J Yu Z Qin T Wan and X Zhang ldquoFeature integrationanalysis of bag-of-features model for image retrievalrdquo Neuro-computing vol 120 pp 355ndash364 2013

[23] S Yoo and C Kim ldquoBackground subtraction using hybridfeature coding in the bag-of-features frameworkrdquo PatternRecognition Letters vol 34 no 16 pp 2086ndash2093 2013

[24] L Nanni and A Lumini ldquoHeterogeneous bag-of-features forobjectscene recognitionrdquo Applied Soft Computing Journal vol13 no 4 pp 2171ndash2178 2013

[25] L Zhou Z Zhou andDHu ldquoScene classification using amulti-resolution bag-of-features modelrdquo Pattern Recognition vol 46no 1 pp 424ndash433 2013

[26] Z Fu G Lu KM Ting andD Zhang ldquoMusic classification viathe bag-of-features approachrdquo Pattern Recognition Letters vol32 no 14 pp 1768ndash1777 2011

[27] K Zagoris I Pratikakis A Antonacopoulos B Gatos and NPapamarkos ldquoDistinction between handwritten and machine-printed text based on the bag of visual words modelrdquo PatternRecognition vol 47 no 3 pp 1051ndash1062 2014

[28] Z Wu J Cao H Tao and Y Zhuang ldquoA novel noise filterbased on interesting patternmining for bag-of-features imagesrdquoExpert Systems with Applications vol 40 no 18 pp 7555ndash75612013

[29] R Ohbuchi K Osada T Furuya and T Banno Salient LocalVisual Features for Shape-Based 3d Model Retrieval 2008

[30] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60 no2 pp 91ndash110 2004

[31] R Toldo U Castellani and A Fusiello ldquoA bag of wordsapproach for 3d object categorizationrdquo in Proceedings of the4th International Conference on Computer VisionComputerGraphics Collaboration Techniques pp 116ndash127 2009

[32] L Shen J Lin S Wu and S Yu ldquoHEp-2 image classificationusing intensity order pooling based features and bag of wordsrdquoPattern Recognition vol 47 no 7 pp 2419ndash2427 2014

[33] J J Wang H Bensmail and X Gao ldquoJoint learning andweighting of visual vocabulary for bag-of-feature based tissueclassificationrdquo Pattern Recognition vol 46 no 12 pp 3249ndash3255 2013

[34] H Bay A Ess T Tuytelaars and L van Gool ldquoSURF SpeededUp Robust Featuresrdquo Computer Vision and Image Understand-ing vol 110 no 3 pp 346ndash359 2008

[35] Y Gui A Su and J Du ldquoPoint-pattern matching method usingSURF and Shape ContextrdquoOptik vol 124 no 14 pp 1869ndash18732013

[36] Y Pang W Li Y Yuan and J Pan ldquoFully affine invariant SURFfor image matchingrdquo Neurocomputing vol 85 pp 6ndash10 2012

[37] P F Alcantarilla L M Bergasa and A J Davison ldquoGauge-SURF descriptorsrdquo Image and Vision Computing vol 31 no 1pp 103ndash116 2013

[38] U Jayaraman S Prakash and P Gupta ldquoAn efficient color andtexture based iris image retrieval techniquerdquo Expert Systemswith Applications vol 39 no 5 pp 4915ndash4926 2012

[39] H Mehrotra P K Sa and B Majhi ldquoFast segmentation andadaptive SURF descriptor for iris recognitionrdquo Mathematicaland Computer Modelling vol 58 no 1-2 pp 132ndash146 2013

10 Computational and Mathematical Methods in Medicine

[40] J Feulner S K Zhou E Angelopoulou et al ldquoComparing axialCT slices in quantized N-dimensional SURF descriptor spaceto estimate the visible body regionrdquo Computerized MedicalImaging and Graphics vol 35 no 3 pp 227ndash236 2011

[41] ldquoOpenSURFrdquo httpscodegooglecompopensurf1

Page 9: A Method of Protein Model Classification and Retrieval ......ResearchArticle A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features JinlinMa,1,2 ZipingMa,2

Computational and Mathematical Methods in Medicine 9

[4] S F AltschulW GishWMiller EWMyers and D J LipmanldquoBasic local alignment search toolrdquo Journal ofMolecular Biologyvol 215 no 3 pp 403ndash410 1990

[5] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[6] K Karplus C Barrett and R Hughey ldquoHiddenMarkovmodelsfor detecting remote protein homologiesrdquo Bioinformatics vol14 no 10 pp 846ndash856 1998

[7] T Milledge G Zheng T Mullins and G NarasimhanldquoSBLAST structural basic local alignment searching tools usinggeometric hashingrdquo in Proceeding of the 7th IEEE InternationalConference on Bioinformatics and Bioengineering (BIBE 07) pp1343ndash1347 Boston MassUSA October 2007

[8] E Zotenko R Islamaj Dogan W J Wilbur D P OrsquoLeary andT M Przytycka ldquoStructural footprinting in protein structurecomparison the impact of structural fragmentsrdquo BMC Struc-tural Biology vol 7 article 53 2007

[9] Z Aung and K Tan ldquoRapid 3D protein structure databasesearching using information retrieval techniquesrdquo Bioinformat-ics vol 20 no 7 pp 1045ndash1052 2004

[10] O Camoglu T Kahveci and A K Singh ldquoPSI indexing proteinstructures for fast similarity searchrdquo Bioinformatics vol 19supplement 1 pp i81ndashi83 2003

[11] V Cantoni A Ferone O Ozbudak and A Petrosino ldquoProteinmotifs retrieval by SS terns occurrencesrdquo Pattern RecognitionLetters vol 34 no 5 pp 559ndash563 2013

[12] L Mavridis V Venkatraman D W Ritchie et al ldquoSHRECrsquo10track protein modelsrdquo in Proceedings of the Eurographics Work-shop on 3D Object Retrieval 2010

[13] A R Ortiz C E M Strauss and O Olmea ldquoMAMMOTH(matching molecular models obtained from theory) an auto-mated method for model comparisonrdquo Protein Science vol 11no 11 pp 2606ndash2621 2002

[14] Y Zhang and J Skolnick ldquoTM-align a protein structurealignment algorithm based on the TM-scorerdquo Nucleic AcidsResearch vol 33 no 7 pp 2302ndash2309 2005

[15] J Zhu and ZWeng ldquoFAST a novel protein structure alignmentalgorithmrdquo Proteins Structure Function and Genetics vol 58no 3 pp 618ndash627 2005

[16] L Holm and C Sander ldquoProtein structure comparison byalignment of distance matricesrdquo Journal of Molecular Biologyvol 233 no 1 pp 123ndash138 1993

[17] I N Shindyalov and P E Bourne ldquoProtein structure alignmentby incremental combinatorial extension (CE) of the optimalpathrdquo Protein Engineering vol 11 no 9 pp 739ndash747 1998

[18] M Ankerst G Kastenmuller H-P Kriegel and T SeidlldquoNearest neighbor classification in 3D protein databaserdquo inProceedings of the 7th International Conference on IntelligentSystems for Molecular Biology pp 34ndash43 Heidelberg Germany1999

[19] P Chi G Scott and C Shyu ldquoA fast protein structure retrievalsystem using image-based distance matrices and multidimen-sional indexrdquo in Proceedingsof the 4th IEEE Symposium onBioinformatics and Bioengineering (BIBE rsquo04) pp 522ndash529 May2004

[20] J-S Yeh D-Y Chen and M Ouhyoung ldquoA web-based proteinretrieval system bymatching visual similarityrdquo in Proceedings ofthe Emerging Information Technology Conference pp 108ndash110Taipei Taiwan August 2005

[21] Y Liu H Zha and H Qin ldquoShape topics a compact repre-sentation and new algorithms for 3D partial shape retrievalrdquoin Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR rsquo06) pp 2025ndash2032 June 2006

[22] J Yu Z Qin T Wan and X Zhang ldquoFeature integrationanalysis of bag-of-features model for image retrievalrdquo Neuro-computing vol 120 pp 355ndash364 2013

[23] S Yoo and C Kim ldquoBackground subtraction using hybridfeature coding in the bag-of-features frameworkrdquo PatternRecognition Letters vol 34 no 16 pp 2086ndash2093 2013

[24] L Nanni and A Lumini ldquoHeterogeneous bag-of-features forobjectscene recognitionrdquo Applied Soft Computing Journal vol13 no 4 pp 2171ndash2178 2013

[25] L Zhou Z Zhou andDHu ldquoScene classification using amulti-resolution bag-of-features modelrdquo Pattern Recognition vol 46no 1 pp 424ndash433 2013

[26] Z Fu G Lu KM Ting andD Zhang ldquoMusic classification viathe bag-of-features approachrdquo Pattern Recognition Letters vol32 no 14 pp 1768ndash1777 2011

[27] K Zagoris I Pratikakis A Antonacopoulos B Gatos and NPapamarkos ldquoDistinction between handwritten and machine-printed text based on the bag of visual words modelrdquo PatternRecognition vol 47 no 3 pp 1051ndash1062 2014

[28] Z Wu J Cao H Tao and Y Zhuang ldquoA novel noise filterbased on interesting patternmining for bag-of-features imagesrdquoExpert Systems with Applications vol 40 no 18 pp 7555ndash75612013

[29] R Ohbuchi K Osada T Furuya and T Banno Salient LocalVisual Features for Shape-Based 3d Model Retrieval 2008

[30] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60 no2 pp 91ndash110 2004

[31] R Toldo U Castellani and A Fusiello ldquoA bag of wordsapproach for 3d object categorizationrdquo in Proceedings of the4th International Conference on Computer VisionComputerGraphics Collaboration Techniques pp 116ndash127 2009

[32] L Shen J Lin S Wu and S Yu ldquoHEp-2 image classificationusing intensity order pooling based features and bag of wordsrdquoPattern Recognition vol 47 no 7 pp 2419ndash2427 2014

[33] J J Wang H Bensmail and X Gao ldquoJoint learning andweighting of visual vocabulary for bag-of-feature based tissueclassificationrdquo Pattern Recognition vol 46 no 12 pp 3249ndash3255 2013

[34] H Bay A Ess T Tuytelaars and L van Gool ldquoSURF SpeededUp Robust Featuresrdquo Computer Vision and Image Understand-ing vol 110 no 3 pp 346ndash359 2008

[35] Y Gui A Su and J Du ldquoPoint-pattern matching method usingSURF and Shape ContextrdquoOptik vol 124 no 14 pp 1869ndash18732013

[36] Y Pang W Li Y Yuan and J Pan ldquoFully affine invariant SURFfor image matchingrdquo Neurocomputing vol 85 pp 6ndash10 2012

[37] P F Alcantarilla L M Bergasa and A J Davison ldquoGauge-SURF descriptorsrdquo Image and Vision Computing vol 31 no 1pp 103ndash116 2013

[38] U Jayaraman S Prakash and P Gupta ldquoAn efficient color andtexture based iris image retrieval techniquerdquo Expert Systemswith Applications vol 39 no 5 pp 4915ndash4926 2012

[39] H Mehrotra P K Sa and B Majhi ldquoFast segmentation andadaptive SURF descriptor for iris recognitionrdquo Mathematicaland Computer Modelling vol 58 no 1-2 pp 132ndash146 2013

10 Computational and Mathematical Methods in Medicine

[40] J Feulner S K Zhou E Angelopoulou et al ldquoComparing axialCT slices in quantized N-dimensional SURF descriptor spaceto estimate the visible body regionrdquo Computerized MedicalImaging and Graphics vol 35 no 3 pp 227ndash236 2011

[41] ldquoOpenSURFrdquo httpscodegooglecompopensurf1

Page 10: A Method of Protein Model Classification and Retrieval ......ResearchArticle A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features JinlinMa,1,2 ZipingMa,2

10 Computational and Mathematical Methods in Medicine

[40] J Feulner S K Zhou E Angelopoulou et al ldquoComparing axialCT slices in quantized N-dimensional SURF descriptor spaceto estimate the visible body regionrdquo Computerized MedicalImaging and Graphics vol 35 no 3 pp 227ndash236 2011

[41] ldquoOpenSURFrdquo httpscodegooglecompopensurf1