combining evolutionary information extracted from frequency profiles with sequence-based kernels for...

Download Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi

If you can't read please download the document

Upload: leo-mosley

Post on 18-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

ABSTRACT  Availability and implementation(source code)  Supplementary information: Supplementary data are available at Bioinformatics online  Contact: or

TRANSCRIPT

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi ID: 14S051034 ABSTRACT To further improve protein remote homology detection, a key step is how to find an optimal means to extract the evolutionary information into the profiles. In this paper, three top performing sequence-based kernels (SVM-Ngram, SVM-pairwise and SVM-LA) were combined with the profile-based protein representation. Various tests were conducted on a SCOP benchmark dataset that contains 54 families and 23 super-families. Results: The results showed that the new approach is promising, and can obviously improve the performance of the three kernels. ABSTRACT Availability and implementation(source code) Supplementary information: Supplementary data are available at Bioinformatics online Contact: or 1. INTRODUCTION Backgrounds: Considerable differences in magnitude(2013-5) protein structures in the Protein Data Bank: protein sequences in the Swiss-Prot database: Goal: Predict the target proteins family Some methods: Based on the generative model : the hidden Markov model (HMM) Based on the discriminative model : support vector machine (SVM) the kernel combination methodology(VBKC) 2.1. SCOP BENCHMARK 2. MATERIALS AND METHOD The 4352 proteins in Scan be classified into 853 super families and 1356 families These proteins were selected from SCOP version 1.53 SCOP BENCHMARK 2. MATERIALS AND METHOD PROTEIN FREQUENCY PROFILE 2. MATERIALS AND METHOD The frequency profile M for protein P with L amino acids can be represented by target frequency Calculated from the multiple sequence alignments generated by running PSI-BLAST PROFILE-BASED PROTEIN REPRESENTATION 2. MATERIALS AND METHOD Sorting P1,p2,p20 SEQUENCE-BASED KERNELS 2. MATERIALS AND METHOD Discriminative methods based on SVM are the most effective and accurate methods for remote protein detection. Validate whether the proposed approach could improve their performance: SVM-Ngram, SVM-pairwise, SVM-LA At the heart of the SVM is a kernel function : Where X and Y are two proteins in the dataset, This normalized step was also used by SVM-pairwise and SVM-LA. MULTIPLE KERNEL LEARNING 2. MATERIALS AND METHOD The MKL technique aimed to combine different kernels to improve the performance. The weight of each kernel can be optimized based on different criterion, which can be categorized by two groups. 1. one-stage kernel learning methods. 2. two-stage kernel learning methods(showed better performance with reduced training cost). KTA (the kernel target alignment): optimize the weight of each kernel MULTIPLE KERNEL LEARNING 2. MATERIALS AND METHOD m training samples : x1,x2,,xm Labels: y1,y2,,ym ideal kernel matrix, y= [y1, y2,..., ym]. n kernels K1, K2,... Kn MULTIPLE KERNEL LEARNING 2. MATERIALS AND METHOD learn the weight of each kernel KTA 2. Normalization 3. Combination Cortes,C. et al. (2010) Two-stage learning kernel algorithms. In: Proceedings of the 27th International Conference on Machine Learning. pp. 239246. EVALUATION METHODOLOGY ROC & ROC50 2. MATERIALS AND METHOD The positive and negative samples are not evenly distributed : the test sets have many more negative than positive samples. The best way to evaluate the trade-off between the specificity and sensitivity is to use a receiver operating characteristic (ROC) score. Another performance measure: ROC50 ( The area under the ROC curve up to the first 50 false positives) PROFILE-BASED PROTEIN REPRESENTATION CAN IMPROVE THE PERFORMANCE OF METHODS BASED ON SEQUENCE COMPOSITION 3. RESULTS AND DISCUSSION The frequency profile of a protein P can be converted into 20 profile-based proteins (p1, p2,..., p20) by using the proposed approach. In this study, only the top n most important profile-based proteins (p1,...,pn) were used in the prediction. To select the value of n: The frequencies of 20 standard amino acids in each column of a frequency profiles add up to 1. Therefore, the average frequency is Therefore, in this study, only the top three profile-based proteins were used in the prediction. p1, p2 and p3 (99.99%, 99.60% and 98.13%, respectively) PROFILE-BASED PROTEIN REPRESENTATION CAN IMPROVE THE PERFORMANCE OF METHODS BASED ON SEQUENCE COMPOSITION 3. RESULTS AND DISCUSSION % % COMBINING DIFFERENT METHODS VIA MKL 3. RESULTS AND DISCUSSION The MKL framework was used to combine these methods. The KTA method was used to automatically optimize the weight of each kernel on the training set. Then these kernels are combined with weights into a single kernel for the SVM-based prediction VBKC is another method based on the MKL Four string kernels: SVM-pairwise SVM-LA SVM-MM and SVM-Mono Combining different methods via MKL 3. RESULTS AND DISCUSSION % % Correlations Between Discriminative Power Of Ngrams And Protein Families 3. RESULTS AND DISCUSSION the weight vector of a set of M sequences =[1, 2, 3,..., M] F is the matrix of sequence representatives the element in w represents the discriminative power of the corresponding feature. Calculate the discriminant weight for each Ngram(Lingner and Meinicke, 2008): These Ngrams would be the important sequence patterns for maintaining the structure and function of this protein family. 4. CONCLUSION It is anticipated that the proposed method for detecting remote homology proteins will certainly enhance the power of homology modeling, and hence have impacts on drug development as well. THANK YOU!