[ieee communication technologies, research, innovation, and vision for the future (rivf) - ho chi...

4
GA SVM: A genetic algorithm for improving gene regulatory activity prediction Dong Do Duc , Tri-Thanh Le , Trung-Nghia Vu , Huy Q. Dinh § , Hoang Xuan Huan , Institute of Information Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Hanoi, Vietnam Department of Information Technology, Vietnam Maritime University, Hai Phong, Vietnam Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium § Center for Integrative Bioinformatics, Max F. Perutz Laboratories, Vienna, Dr Bohrgasse 9, 1030 Vienna, Austria and Gregor Mendel of Molecular Plant Biology, Vienna, Austrian Academy of Sciences, Dr Bohrgasse 3, 1030 Vienna, Austria University of Technology (UET), Vietnam National University, Hanoi, 144 Xuan Thuy, Hanoi Email: {dongdoduc, huanhx}@vnu.edu.vn, [email protected], [email protected], [email protected] Abstract—Gene regulatory activity prediction problem is one of the important steps to understand the significant factors for gene regulation in biology. The advents of recent sequencing technolo- gies allow us to deal with this task efficiently. Amongst these, Support Vector Machine (SVM) has been applied successfully up to more than 80% accuracy in the case of predicting gene regulatory activity in Drosophila embryonic development. In this paper, we introduce a metaheuristic based on genetic algorithm (GA) to select the best parameters for regulatory prediction from transcriptional factor binding profiles. Our approach helps to improve more than 10% accuracy compared to the traditional grid search. The improvements are also significantly supported by biological experimental data. Thus, the proposed method helps boosting not only the prediction performance but also the potentially biological insights. I. I NTRODUCTION &RELATED WORKS Since its double helix structure was discovered in 1953, the DNA (Deoxylribo Nucleic Acid) sequence simply con- sisting of four letters (Adenine, Cytosine, Guanine, Thymine) has been considered as the natural blueprint of organism development. Genome itself contains a variety of information encoded in a long sequence of DNA letters. For example, an interesting information is the gene-regulatory that shapes the different gene expression patterns. Enhancer, or cis-regulatory module (CRM) is the DNA fragment consisting of the in- formation to regulate the associated genes. It contains the binding sites for the specific transcriptional factors (TFs) protein corresponding to a certain regulatory activity. So that, understanding the CRM activity and its requirement is a fundamental problem in biology [1]. Authors in [2] proposed a simple model of the CRM activity which depends on the respective TF bindings, i.e either the elimination of TF or the disruption of its binding leads to the changes of the CRM function. This model has been supported by several small- scale evidences by ChIP (Chromatin Immunoprecipitation) experiments after Polymerase Chain Reaction amplification. Recently, one of the first genome-wide scale experiments [3] was successfully done by using microarray technologies in the model organism, Drosophila melonagaster. This work used ChIP on the tiling microarray to obtain the first high- resolution atlas of mesodermal cis-regulatory modules. The data provided a strong experimental proof for the model mentioned above. In addition, they used transcriptional factor binding profiles measured by ChIP signals [4] to predict the expression patterns of genes which are regulated by those respective enhancers. Interestingly, the prediction performance was quite well; and more importantly, they predicted some novel enhancers with highly accurate expression categories. Thus, learning regulatory code that derives different expression patterns by computational methods is a very attractive branch in computational biology [1]. To predict the expression patterns of genes, the authors [3] applied a traditional grid search for the parameter op- timization of radial kernel Support Vector Machine (SVM, [5]) and gained up to 82% accuracy under the leave-one- out cross validation (LOOCV) framework. Cost C and γ are two parameters of radial kernel SVM. The former determines the trade-off between the minimization of fitting error and the maximization of classification margin whereas the later affects the efficiency of the kernel function especially for high-dimensional data. Parameter optimization plays an im- portant role in the prediction performance of SVM, especially when using radial kernel [6]. Metaheuristic approaches (e.g Genetic Algorithm and Ant Colony Optimization) have been successfully applied to optimize the SVM parameters ([6], [7]) in different context problems.The grid search used by the authors in [3] was a quick method that helps to approximate the efficient parameters for SVM prediction. However, this method only explored a sparse amount of parameter space. As a consequence, three out of five test cases achieved only 70% of accuracy on average and just one case reached more than 80%. Especially, those three cases were the situation that the expression pattern of one uniquely corresponds to one enhancer activity. Thus, it is necessary to have more intensive methods to further seek the best parameters, particularly for the very strict datasets that the available information might be not enough for the standard prediction. 978-1-4673-0309-5/12/$31.00 ©2012 IEEE

Upload: hoang-xuan

Post on 11-Apr-2017

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Ho Chi Minh City, Vietnam (2012.02.27-2012.03.1)] 2012 IEEE RIVF International Conference

GA SVM: A genetic algorithm for improvinggene regulatory activity prediction

Dong Do Duc∗, Tri-Thanh Le†, Trung-Nghia Vu‡, Huy Q. Dinh§, Hoang Xuan Huan¶,∗Institute of Information Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Hanoi, Vietnam†Department of Information Technology, Vietnam Maritime University, Hai Phong, Vietnam

‡Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium§ Center for Integrative Bioinformatics, Max F. Perutz Laboratories, Vienna, Dr Bohrgasse 9, 1030 Vienna, Austria

and Gregor Mendel of Molecular Plant Biology, Vienna, Austrian Academy of Sciences, Dr Bohrgasse 3, 1030 Vienna, Austria¶University of Technology (UET), Vietnam National University, Hanoi, 144 Xuan Thuy, Hanoi

Email: {dongdoduc, huanhx}@vnu.edu.vn, [email protected], [email protected], [email protected]

Abstract—Gene regulatory activity prediction problem is one ofthe important steps to understand the significant factors for generegulation in biology. The advents of recent sequencing technolo-gies allow us to deal with this task efficiently. Amongst these,Support Vector Machine (SVM) has been applied successfullyup to more than 80% accuracy in the case of predicting generegulatory activity in Drosophila embryonic development. In thispaper, we introduce a metaheuristic based on genetic algorithm(GA) to select the best parameters for regulatory prediction fromtranscriptional factor binding profiles. Our approach helps toimprove more than 10% accuracy compared to the traditionalgrid search. The improvements are also significantly supportedby biological experimental data. Thus, the proposed methodhelps boosting not only the prediction performance but also thepotentially biological insights.

I. INTRODUCTION & RELATED WORKS

Since its double helix structure was discovered in 1953,the DNA (Deoxylribo Nucleic Acid) sequence simply con-sisting of four letters (Adenine, Cytosine, Guanine, Thymine)has been considered as the natural blueprint of organismdevelopment. Genome itself contains a variety of informationencoded in a long sequence of DNA letters. For example, aninteresting information is the gene-regulatory that shapes thedifferent gene expression patterns. Enhancer, or cis-regulatorymodule (CRM) is the DNA fragment consisting of the in-formation to regulate the associated genes. It contains thebinding sites for the specific transcriptional factors (TFs)protein corresponding to a certain regulatory activity. So that,understanding the CRM activity and its requirement is afundamental problem in biology [1]. Authors in [2] proposeda simple model of the CRM activity which depends on therespective TF bindings, i.e either the elimination of TF or thedisruption of its binding leads to the changes of the CRMfunction. This model has been supported by several small-scale evidences by ChIP (Chromatin Immunoprecipitation)experiments after Polymerase Chain Reaction amplification.Recently, one of the first genome-wide scale experiments[3] was successfully done by using microarray technologiesin the model organism, Drosophila melonagaster. This work

used ChIP on the tiling microarray to obtain the first high-resolution atlas of mesodermal cis-regulatory modules. Thedata provided a strong experimental proof for the modelmentioned above. In addition, they used transcriptional factorbinding profiles measured by ChIP signals [4] to predict theexpression patterns of genes which are regulated by thoserespective enhancers. Interestingly, the prediction performancewas quite well; and more importantly, they predicted somenovel enhancers with highly accurate expression categories.Thus, learning regulatory code that derives different expressionpatterns by computational methods is a very attractive branchin computational biology [1].

To predict the expression patterns of genes, the authors[3] applied a traditional grid search for the parameter op-timization of radial kernel Support Vector Machine (SVM,[5]) and gained up to 82% accuracy under the leave-one-out cross validation (LOOCV) framework. Cost C and γ aretwo parameters of radial kernel SVM. The former determinesthe trade-off between the minimization of fitting error andthe maximization of classification margin whereas the lateraffects the efficiency of the kernel function especially forhigh-dimensional data. Parameter optimization plays an im-portant role in the prediction performance of SVM, especiallywhen using radial kernel [6]. Metaheuristic approaches (e.gGenetic Algorithm and Ant Colony Optimization) have beensuccessfully applied to optimize the SVM parameters ([6],[7]) in different context problems.The grid search used by theauthors in [3] was a quick method that helps to approximatethe efficient parameters for SVM prediction. However, thismethod only explored a sparse amount of parameter space.As a consequence, three out of five test cases achieved only70% of accuracy on average and just one case reached morethan 80%. Especially, those three cases were the situation thatthe expression pattern of one uniquely corresponds to oneenhancer activity. Thus, it is necessary to have more intensivemethods to further seek the best parameters, particularly forthe very strict datasets that the available information might benot enough for the standard prediction.

978-1-4673-0309-5/12/$31.00 ©2012 IEEE

Page 2: [IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Ho Chi Minh City, Vietnam (2012.02.27-2012.03.1)] 2012 IEEE RIVF International Conference

We introduce a genetic algorithm approach to improve theperformance of enhancer activity prediction. Making use ofGA, the method search more intensively on the parameterspace than the traditional grid search did, to explore betterparameter for the prediction. Consequently, the proposed ap-proach outperforms the previous method [3] and obtains morethan 80% LOOCV accuracy on average for all the cases. Moreimportant, our results are significantly better in the case ofpredicting the regulatory activity for novel enhancers with invivo validated data. Our study proved the need of parameterchoosing and optimization in the SVM prediction with thespecific biological dataset.

II. BIOLOGICAL DATA AND PREDICTION PROBLEM

A. Transcriptional binding landscapes in embryonicDrosophila development

Drosophila is a model organism for embryonic developmentresearch in biology because of the well-established time-course experiments for several important transcriptional factorslike Twist or Tinman [3]. It is also well-known for the veryearly time point of the cell development that only DNAinformation might be existed. It allows us to investigate theimportance of DNA information (e.g DNA motif) with respectto the developmental regulation of the cell. ChIP is a method toselectively enrich for DNA sequences bound by a particularprotein. Recently, this technology was used to identify theactive CRMs systematically by either tiling microarray (ChIP-chip) or deep sequencing(ChIP-Seq) at whole-genome scale.Using ChIP-chip, [3] used a tiling array to obtain the data oftranscriptional factor binding for five important mesodermalfactors: Twist, Tinman, Mef2, Bagpipe, and Biniou at 5 crucialtime points during embryogenesis (Fig 1).

As sequence, each CRM is assigned with one expressioncategory (mesoderm, somatic muscle, or visceral muscle; (Fig1) referred as meso, sm, vm from here on in the paper) by usingthe well-known database (e.g REDFly database [8]) consistingof 310 CRMs. In this dataset, there are a number of CRMsbelonging to ambiguous expression categories, i.e the patternsare determined at both meso and sm (called meso sm), or bothvm and sm (called vm sm). In addition, they also identified invivo the expression category for 35 de novo CRMs which areunknown from the REDFly database. Using transgenic reporterassay experiments, they also could determine the expressionpattern for those novel CRMs. It is very important that one cantest the performance of the prediction approach by predictingthose novel CRMs’ activities using the known REDfly CRMsin training process.

B. Spatio-temporal cis-regulatory activity prediction in ma-chine learning context

Researchers in [3] applied Support Vector Machine toestablish a prediction framework of transcriptional regulatoryactivity, i.e expression category, from the binding profiles ofthe corresponding transcriptional factor. The prediction washelpful to indicate the potential of determining the specific

Fig. 1. Regulatory activity prediction based on the transcriptional bindingmeasured by ChIP-chip heights. The peak height indicates the ChIP bindingof the respective TF at specific time point. In this figure, Twist (Twi), Tin areat early time point (5-7h, 8-9h,10-11h), Bin is at late time point (10-11h, 12-13h,13-15h). Whereas Bap is only at 10-11h and Mef2 is at all time points.The binding profile is then used to predict the group of the enhancer activity.Three groups are mesoderm, somatic muscle, visceral muscle on the rightside. A part of the figure is from [1]

transcriptional factors and their degrees that influence the ex-pression patterns it regulated. In the machine learning context,each CRM was represented by an object data of maximal 15features which were the combination of transcriptional factorsand time points. The SVM method was applied to predict theexpression pattern of each CRM. In details, the binary SVMwas used to predict the group of an enhancer correspondingexpression of 5 transcriptional binding factors at 5 embryonicdevelopment time points. The groups were mesoderm/somaticmuscle/visceral muscle (meso, SM, VM). The combinations,Meso+SM and VM+SM, were also considered because of thenatural observation from the expression data.

III. METHODS

A. SVM prediction of regulatory activity based on transcrip-tional factor binding profiles

A SVM constructs an N-dimensional hyperplane that opti-mally discriminates the data into two categories. Given an in-dividual enhancer and its corresponding binding profiles fromChIP-chip data, the binary SVM prediction is used to predictits transcriptional category. A SVM model is built to learn howto classify the enhancer x into two classes, e.g mesodermalor notmesodermal, from a training set of m enhancers whichhave known activities. The SVM classifier works based on thefollowing decision function: f(x) =

∑m1 λiK(xi, x) where K

is a kernel function and λs are coefficients which are learnedduring the training process. Usually, the linear kernel functionis used for simple data and the radial kernel function is forthe more complex cases.

SVM is a parameter-sensitive machine learning classifi-cation method, particularly with the radial kernel function.Researchers in [3] used fine-grained grid searching to achievethe optimal result in which C and γ were set as integer valuesranging from 10−2 to 105 and from 10−6 to 102 respectively.It resulted on average 78% accuracy SVM performance withLOOCV. In this paper, we investigate the optimization of twoimportant parameters: C and γ by using Genetic Algorithm.GA method will search finer in the parameter spaces, and sobetter results are expected.

Page 3: [IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Ho Chi Minh City, Vietnam (2012.02.27-2012.03.1)] 2012 IEEE RIVF International Conference

B. Genetic Algorithm

The GA algorithm works as follow (see pseudo codeAlgorithm 1): at tth generation called P (t) consisting ofN solutions or N set of parameters (C, γ). Each solutionis evaluated with a fitness function, here, an AUC value. Anext generation (t + 1)th is created by selecting the bestindividuals via lottery cycle procedure and GA operatorsincluding mutation or cross-over. More details about GAcould be refered to [9]. The builds of chromosome and fitnessfunction of GA for our problem are discussed in the nextsection.

Algorithm 1: GA algorithm to improve the predictionData: An enhancer set with known regulation activityOutput: The best solution

begint← 0 (generation index);Initialize the generation P (t);Evaluate P (t);while termination condition is not met do

t← t + 1 (next generation);Select new generation Q(t) from P (t− 1);Create P (t) from Q(t) by GA operators;Evaluate P (t) and Select the best individuals;

Output the best solution;end

The standard implementation with default parameters of GAalgorithm is derived from R package genalg1.

C. Fitness function and representation of parameters in GA

The main issue of GA is how to present the problem by achromosome. In our method, two parameters C and γ wereencoded by a chromosome in binary vector. In details, eachchromosome consists of a 51-bit binary vector that representsreal values of the parameters. The 24 first bits are reserved forthe C and the rest represents the value of γ. Figure 2 gives anexample of a chromosome, mutation and crossover operations.In the mutation, the bit zero in the dark cell of a chromosomeis changed to the bit one in the result chromosome. In thecrossover, two chromosomes are divided at the same postion,then heads and tails of two chromosomes are exchanged.

At each step, the GA algorithm in silico evolves the popu-lation and selects the best individuals for the next generationaccording to the fitness function which is defined as the AreaUnder Curve (AUC) value computed by [10]. At the last stage,the best binary vectors are used to transformed back to thereal-valued parameters normalized by a factor of 102 (withC) and 106(with γ).

IV. EXPERIMENTAL RESULTS

A. Data & Evaluation

We used two published datasets from the model organismDrosophila Melanogaster: the first consisted of 310 CRMs

1http://cran.r-project.org/web/packages/genalg/index.html

Fig. 2. 51-bit binary representation consists of 24 bits for C and 27 bit for γ.After a generation, GA operators like mutation and cross-over are performedto generate a new representation.

with known regulatory activity, the second was a selectedcollection of 35 novel enhancers whose expression categorywas tested in vivo from more than 8000 enhancers [3]. The310 enhancers are from the CRM Activity Database (CAD)with the expression driven by published CRMs, using REDFlydatabase [8]. For the second set, we used the training set as thefirst 310 known enhancers. The novel enhancers were selectedand tested in vivo from [3].

It is worth to note that the majority of datasets wereimbalanced, i.e the number of active and non-active enhancerswere not equally. To evaluate such the type of data, we usedthe so-called Balanced Accuracy (BACC) as the average ofSensitivity and Specificity of the prediction results. In addition,we used the traditional Area Under the Curves (AUC) toestimate the trade off between the two measurements. Allevaluations were computed under the unbiased Leave-One-Outcross validation (LOOCV) context. The proposed method wererun 20 times and results were recorded. Initiation parameterof GA was default by the genalg package. The run time is anhour in PC 3.3Ghz 4GB RAM, while traditional grid searchtooks about 5 minutes in implementation because of its simplestrategy. However, it is not a significant problem for more andmore powerful machine nowadays.

B. Comparative Study

1) Known enhancer dataset: The GA SVM outperformsthe previous study in all cases of datasets including MESO,VM, SM and VM SM (Fig 3). In case of Meso SM, the per-formances of two methods are similar and both up to 82%. It isremarkable to see that the GA SVM significantly improved upto 10% average the performance of SVM prediction for threecases of unique regulatory activity (Meso, VM and SM). Thebig gap proofs the efficiency of the parameter optimization ofSVM for a particular type of data.

In the view of AUC, the mean and deviation of run 20times were recoreded, see the table I. The proposed method

Page 4: [IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Ho Chi Minh City, Vietnam (2012.02.27-2012.03.1)] 2012 IEEE RIVF International Conference

Fig. 3. The comparison of Balanced Accuracy (BACC) between theGA SVM method and the grid search (GS SVM) method [3] for fiveexperimental categories. The GA SVM (for 20 runs) outperforms the othermethod in all cases.

again has significantly higher performance than the grid searchmethod in cases of uniquely regulatory activities. The ROCRpackage [10] is used for the computation.

Regulatory category GS SVM[3] GA SVMMeso 0.66 0.71±0.01VM 0.67 0.78±0.01SM 0.71 0.75±0.01

Meso SM 0.82 0.83±0.01VM SM 0.74 0.82±0.02

TABLE ITHE COMPARISON BETWEEN THE GA SVM METHOD WITH THE GRID

SEARCH METHOD (GS SVM) [3] IN TERMS OF AREA UNDER THE

CURVES (AUC) FOR ALL EXPERIMENTAL CATEGORIES.

2) In vivo enhancer test: In [3], they carried out the invivo experiments for 35 among more than 8000 new enhancersand reported its specific regulatory activities. In this paper, weevaluate the performance of the two methods by predictingthese datasets. It also considered the so-called partially cor-rected predictions if the enhancers were predicted one of theexpression categories observed. Both methods well-performup to approximately 80% of novel CRM regulatory activities(see Fig 4). Interestingly, the GA SVM improves significantlynumber of CRM activity predictions for partially expression. Italso helps to decrease number of false positive CRM activitypredictions significantly compared to the previous results [3].It indicates that the well-suited prediction parameters arenecessary for learning the rules from known CRM datasetsto predict the activity of the novel ones where the traininginformation might not be really fit the predicting information.

V. CONCLUSIONS

We proposed a new way to improve the prediction ofgene regulatory activity based on transcriptional factor bindingprofiles. Our performance was improved roughly more than10% accuracy compared to the previous method. Especially,we gained the significantly better results in case of unique

Fig. 4. The comparison between the GA SVM method with the grid searchmethod [3] for the novel enhancers. True Positive and False Positive indicatesthe CRMs with unique regulatory activities where the prediction results aretrue/false. Partial indicates the number of CRMs that the predicted regulatoryactivity is one of the expression categories detected by in vivo experiments.

expression category where the prediction information needsto be more precise. In addition, we also outperformed theprediction in the novel enhancers when using known enhancersas training set. That indicates the importance of optimization inbiological prediction. The biological data is in emerging timethat leads to the needs of optimal computational optimization.Future work includes challenging a diversity of predictionproblems in biology and then building up an automatic systemsof evolutionary computation algorithms to learn the predictionparameters from the biological data itself.

ACKNOWLEDGMENT

This work is partially supported by Vietnams National Foundationfor Science and Technology Development (NAFOSTED).

REFERENCES

[1] A. Stark, “Learning the transcriptional regulatory code,” Mol. Syst. Biol.,vol. 5, p. 329, 2009.

[2] M. I. Arnone and E. H. Davidson, “The hardwiring of development:organization and function of genomic regulatory systems,” Development,vol. 124, pp. 1851–1864, May 1997.

[3] R. P. Zinzen, C. Girardot, J. Gagneur, M. Braun, and E. E. Furlong,“Combinatorial binding predicts spatio-temporal cis-regulatory activity,”Nature, vol. 462, pp. 65–70, Nov 2009.

[4] P. J. Park, “ChIP-seq: advantages and challenges of a maturing technol-ogy,” Nat. Rev. Genet., vol. 10, pp. 669–680, Oct 2009.

[5] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning,vol. 20, pp. 273–297, 1995, 10.1007/BF00994018. [Online]. Available:http://dx.doi.org/10.1007/BF00994018

[6] X. Zhang, X. Chen, and Z. He, “An aco-based algorithm forparameter optimization of support vector machines,” Expert Syst.Appl., vol. 37, pp. 6618–6628, September 2010. [Online]. Available:http://dx.doi.org/10.1016/j.eswa.2010.03.067

[7] C.-L. Huang and C.-J. Wang, “A ga-based feature selection and param-eters optimizationfor support vector machines,” Expert Systems withApplications, vol. 31, no. 2, pp. 231 – 240, 2006. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0957417405002083

[8] M. S. Halfon, S. M. Gallo, and C. M. Bergman, “REDfly 2.0: anintegrated database of cis-regulatory modules and transcription factorbinding sites in Drosophila,” Nucleic Acids Res., vol. 36, pp. D594–598, Jan 2008.

[9] C. Reeves, Genetic Algorithms and Combinatorial Optimisation: Appli-cations of Modern Heuristic Techniques. UK: In V.J. Rayward- Smith(Eds), Alfred Waller Ltd, Henley-on-Thames, UK, 1995.

[10] T. Sing, O. Sander, N. Beerenwinkel, and T. Lengauer, “ROCR: visualiz-ing classifier performance in R,” Bioinformatics, vol. 21, pp. 3940–3941,Oct 2005.