cleavage knowledge extraction in hiv-l protease using ... · active virus. cleavage kllowledge...

Proceedings of lClSIP - -2005

Cleavage Knowledge Extraction in HIV-l Protease

using Hidden Markov Model

Jayavardhana Rama G L l, #Palaniswami M2 Department of Electrical and Electronics Engineering,

The University of Melbourne, Parkville - 3010, Australia. [email protected]. 2 [email protected]

Abstract [lIucfive H[V is u poly protein precursor. This protein chain has to be cleaved at 9 specific pusitions lu produce individual

functional mature proteins respunsible for making u p a new active virus. Cleavage kllowledge extraction in HIV Pro lease will assist in designing effective inhibitors used ill rhe frealmenl of AIDS. Althollgl/ much progress has been made in

sequellcing tIle viral protease, little progress has been made in understallding the speq·ficity. Sel'eml maclJille twrnillg techniques have been IIsed ill IInderstallding the specificity of Hrv-l protease t .... ith rIle highesl prediction rate being 91%. In this paper the Hiddell MarkoI' Model is //Sed for analyzing the specificity of Hrv-l protease. The objeclil'e is to learn the lock and key mechanism o/the prolease and protein precursor llsing tfte Hidden Markov Model (HMM) from a set of experimental observations.

A good self cOllsisfency rale of 96% and recognition accuracy of 95.24% 011 Ullseell data is acliiel'ed. The Hrv protease 5peci(idty to deal'c betH'een a phellylalanine and fyrosine or proline is also validated by our e,\periments indicating filat fIJe HMM is successfill in learning fhe complex lock and key rule between pro/ease and precursor protein. Used with of lie I' technique s. HMM call be used (IS all eJJecfive tool for designillg /lew drugs:

I. INTRODUCTION HIV belongs to retrovirus class which call)' genelic infomlalion in the fonn of RNA. It infects the T cells that carry the CD4 antigen on their surface. Once the virus enters the cell, it is transcribed into DNA with the help of a viral enzyme called reverse transcriptase. The viral DNA enters the cell nucleus which then integrates itself to the genetic material of the cell with the help of the enzyme Integrase . Activation of the host cell results in the transcription of the viral DNA into messenger RNA which is then translated into viral proteins. This viral protein precursor must be cleaved at 9 specific positions to produce mature proteins. The viral RNA and the mature proteins give rise 10 functional HIV ([8]. [20}.[31). This cleaving or cuning activity is earned out by HlV Protcase. HIV- l protease is an aspartyl protease. [t has 99 residues that are highly substrate specific and cleavage specific. They have a dimcric structure containing identical monomers but a single active site. The active site of HIV-l protease is estima ted to

0-7803-8840-2/051$2000 ©2005 IEEE

be eight residues long although it occasionally occurs as seven or nine residues. The interaction between the protease and the protein precursor can be visualized as a lock a nd key model where the protease active site represents the lock and the sequence in protein precursor defines the key (Refer Figure I). Because of its crucial role in virus maturation. protease

is a potential ta rget for drug design to control the spread of HIV. leading to control of AIDS ([23]). The substance which can control or inhibit the protease activity is called a protease inhibitor. Currently Indinavir. Ritonavir and Saquinavir are the three popularly available inhibitors used in the treatment of AIDS. But the adaptive changes in the protease cleavage sites limit the use of such available drugs. which force the researchers to learn the specificity of such adaptive proteases ([2], [9], [4]. [13]). The process of protease activity and inhibition is sunlllarized in Figure 2. The knowledge about the specificity of the cleavage is extremely impurtant in designing effective protease inhibitors used in the treatment of AIDS.

As the sequence is eight residues long and <IS many as 20 amino acids can take these positions. a total of :.W8 key combinations are possible, which is too many to individually test in the lab. The objective of this inv<;:stigation is to learn the lock and key mcchanism of the protease and protem precursor using the Hidden Markov Model (HMM) from a set experimental observations so that we can reduce the number of improbable combinations to a test verifiable numbers.

Previous Work

Several machine learning techniques have been applied recently to model the HlV-l specificity. [71 applied standard feed-forward multi-layer perceptron (MLP) with eight hidden units reporting an accuracy of88%. [5] furthered the previous work with an expanded dat.l set and reported <I1t accuracy of 92%. [6] applied Support vector Machines (SVM) with ditferent kernel types and found that the Gaussian' kemel would produce (he best result. They reported 87% recognition with SVMs. [17] reported similar results to [5] using MLP but they used different combinations of training and testing sets. They also used Symbolic Leaming Techniques and found that MLP outperforms others. [21] reported that the data sets used in all the above were linearly separable and did not require a non-linear classifiers when there was 110 lluproved results from these classifiers. TIley also tested three other out.of sample data sets with good results. Work 011 learning drug resistance

469

Proceedings of ICISIP - 2005

Fig. I: Lock and Key Model of the Protein Precursor Sequence and the Protease Active Site. Our task is to learn the sequences which can get cleaved by the protease so that the inhibitor we design succeeds in blocking the protease action by attaching itself to the drug we inject S repre3�nts the protease active site and P represents the protein precursor.

(a) Protem PreC'UI'sor

L-____ �� ______ __ ________ � _______ �

(b) jeav.ge Point

�-------------�--------�

(Cl � Protein Precursor ""------------- �

�

(d) Protein Precursor

L-____ -;��======�------�------�

�

�

FIg. 2: Protease Activity (a.b.c) and its Inhibition (d). (a) HIV Protease is ready to cleave the protein precursor. Prolein precursors are long polypeptide chains which should be cleaved by protease at 9 ditTerent positions to produce active HIV (b) The HN protease attaches itself to the cl�avage position between Pt and PI'. (cJ Protease successfully cleaves at one position and mOVes to another to cleave another position. (d) Protease inhibitors attach these protease. and make them ineffective.

can be found in [10].

In this paper we describe the use of HMM in learning the specificity of HIV-l proteas e to cleave the protein precursor and also show that the recognition accuracy on the HIV-I data sets is better than SVM, MLP and "ytItbolic language used in the liteni.ture for the same data set.

2. METHODS A. Representation Sytnbolically 8 amino acid residues are represented as P4P3PZPIPI' P2, P3, P.". Corresponding amino a cids in the protease can be represented as S4S3S2SIS),S2,S3,S4'. This representation was first used by Schechter and Berger ([22]).

Onr task is to lind the specificity of S which will efJectively cleave P between H and PI'. The sequence S can be represented as a Lock and the sequence P can be visualized

as the Key. There are numerous keys for the lock and our task is to learn the Lock and Key Rule from a set of experimental observations. In onr experiments. 20 amino acids were numbered 1 through 20 as shown in Table I.

B. Datasel

362 sequences of 8 residue length made up the data set. It contained 114 cleaved and 248 non-cle:Jved samples. This set was divided into a set which included 60 cleavable samples and 239 non-cleavable. This was used for training the HMM. Another set consisting of 54 cleavable samples and 9 noncleavable sa.mples was used for testing. Several researchers used the same dataset [5]. [6]. [17]. [211 which was a superset of the data used by [71.

470

TABLE I: LIST OF AMINO ACIDs. RESPECTIVE BIOCHEMICAL SYMBOLS

AND THE OI3SERVED OUTPUT fROM HMM. THE THIRD COLUMN IS NOTHING BUT THE SET 0

Amino Acid Biochemical Symbols Observed Symbol Alanine A I Arginine R

Asparagme N 3 Aspartic Acid D 4

CYSIeine C Glutamine Q 6

Glutamic Acid E 7 Glycine G 8 Histidine H 9 Isoleucine I 10 Leucine L 11 Lysine K 12

Methionine M 13 Phenylalanine F 14

Proline P 15 Serine S 16

Threonine T 17 Tryptophan W J8

Tyrosine Y 19 Valine V 20

C. Hiddell MarkoI' Models The Hidden Markov Model is a well known stochastic modelIllg technique mainly used in speech modeling literature. For a

detailed description one can refer to [19J. In bioinfoffilatics it has been used in application.s like protein modeling ((15]) and sequence aligmllent ([14]) and is fast becom ing an increasingly popular tool. ll8J apply HMM for yeast DNA segmentation. The HMM applications. to bioinformatics is comprehensively reviewed in [12].

A firsi order HMM [Ill. [1] is defined by a finite set of states N. discretc alphabct 0 of symbols, a probability transition matrix T, a probability emission matrix E and initial distribution vector of Markov chain IT. It is represcnted as

), = (T, E, 7r) (I)

The system randomly evolvcs Irom state to state while emitting symbols from 0 (Reier Table I). Only thc symbols emitted by the system are observable but the random walk from state to state is ·11idden". Estimation of model parameters as well as eVlllunti.ng the test result can be done with several algoritlulls available in literature [19}. Then.: are many ways of estimati�lg the model 's parameter� from data, as well as evaluating the

test data. We use tbe Baull1-Welch algorithm for estimating the model parameters. Given a sequence P and some initial model ). with a given number of states N, the algorittull adjusts the parameters of the HMM so as to maximize the likelihood of Pr( PI).). It is an iterative version of expectation-maximization tecllllique that converges to a local maximum. One of the [allowing two .;onditions is chost.'n as the convergcnt criterion;

• Maximum iterations equal to the number of training sequences, that is 60 for the clenvable case and 239 for

the non-cleavabl.: casc. • The dilTerence between the pn:vious log likelihood and

the present is less than or equal to 10-3

Proceedings of rCISIP. 2005

_10501.--_--, __ "'"-=--L ... ,:..::""""'=-"""""""'_..=_ .. ,_--,-.. _.""""'_---,-"_ ....... __ .-__

-i100

-1150

-1200

!-1250 � � -1JOO

-14M

-15000!------C,0o----, .. :::---:: .. :-----7;40:----,"'!;:----:6 0".----,!,.· ,,, ........

Fig. 3: Plot showing the log likelihood convergence Jor numerous iterations for four states (6,7,8,9).

Fig. 4: Schematic Representation of 2 HMMs is parallel one each modelling Cleavable and Non-cleavable set

The evolution of the likelillOod of numerous iterations of Baum-Welch algoritlllll is carried out on the data set. The ntnnber of states in both the models are chosen as '8', length

of the sequence that is to be recognized. However we tried

out the training using dillerent number of states (6,7,8 and 9). We fou.nd that the minimal numb.:r of it erations and the sel f consistency tests were better with N =8. Figure 3 shows convergence of log-likelihood for ditT.:rent numbers of

iterations with 6,7,8 and 9 states. The experiment was calTied out without fixing the maxinmm number of iterations and was allowed to converge to the log likelihood diftcrence between

present and previous iteration which was orthe ord.:r 01'10-3. The choice with highest log likelihood was chosen to be the number of stat.:s for our expt.'riments.

D. Classification The problem of c leavability and l1on-deavability can be viewed as a a two class problem with two models - one each

for cleavabi lity and non-cleavability as represented in Figure 4. Given the observed sequence 0 and the mudd ).. we need to find P(OI),). The model which gives higher likelihood for the sequeJlce is chosen to be the classification resulL

3. RESULTS AND DISCUSSION On th.: self consistency test the following results were obtained. Sensitivity also called the "Hit-Rate" defined as TP/(TP + FN) (Refer Tablc 2 for abbreviations) was R9.39

471

Proceedings of lClSIP - 20{l5 TABLE 2: RESllLT OF SELF CONSISTENC\" AND ON THE UNSEEN DATA.

True Positive (TP) True Negative (TN) False' Positive (F"P)

False negative (FN)

Self-Consistency 59160

2321239 ]/60

7/239

Unseen Data 51/54

919 3/54 019

TABLE 3: SENSITIVITY. SPECIFICITY, M ATTHEW'S CORRELATION COEFFICIENT AND OVERALL RECOG:-IlTION ACCURACY OF SELF CONSISTENCY TEST AND ON Tll.E UNSEEN DATA

Sensitivity Specificity

Sei f-Consistency Unseen Data

Matthew's Correlation Coefficienl 'Recognition Accuracy

89.39% 98.33%

0.92 97.32%

100% 75% 0.84

95.24%

% for self consistency test. Specificity defined as T P! (T P + F' P) was obtained to be 98.33%. Matthew's correlation coefficient ([ 16]) calculated using

TP*TN -FP*PN C"'" -Jr;::(T�P�+=;:P""'j\�'):;:;;(T;:;::;P:;=+�F::;::;P:;=;)(�TT.N=+=F;=;=P�)::::;;(T;:;:;N;:;=: +=;:P""'1\:;::=T)

is obtained to be 0.92. The value of C approaching -I indicates total disagreement of the dma and +1 indicates total agreement of the data. The overall accuracy defined by

. TP+TN Recog1l!tionAcc1.lr·ocy = T F (2) TP+ N + P+FN

in the self consistency test is obtained to be 97.32%. . The values obtained for sensitivity, specificity, Matthew's

correlation coefficiem and accuracy arc respectively 100%, 75%,0.84 and 95.24%.The results are sUlmnarized in Table 2 and 3

Observing the probabjlity emission matrix E, we C,U\ derive the following specificity rules:

I) Ch<ulce of cleavage is high if

a) I f position I is serine and position 5 Proline b) If position 4 is Phenylalanine or Tyrosine c) If position 5 is Proline d) If position 6 is Glutamic acid

2) Chance of cleavage is low if

a) If Isoleucine is present after position 5 b) If both position I and position 7 are Alanine c) I f position 8 is Lencine d) If position 2 or posit ion 8 is G[um�ine

3) Very little chance of cleavage if Serine after cleavage position

Rules I-a and 1-b can be considered to be the most important as they characterize the HIV-J protease specificity. HIV protease is unique in that it can cleave between either tyrosine or phenylalanine and proline. This is a very important tact because no human enzyme has the ability to do this. Although the emission matrix for-the model (cleavage absent) does not lead to any specific rule for an <Ullino acid in a particular position, presence of Serine atler cleavage does show the property of non-cleavage activity. The list of misclassified

TABLE 4: MISCLASSIFIED SEQUENCES Sequence

AETFYVDK HLVEALYL

HYGFPTYG

Actual Class Cleavable Cleavable Cleavable

Pr�dicted Class

Non-cleavable Non-ciea"able Non-cleavable

sequences are shown in Table 4, Although the sequence "AETFYVDK" is misclassified. we find that the difference in log likelihood was very small of the order of 0.01 which is very smal l compared to the log-likelihood of atleast 10 for other sequences.

4. CONCLUSIO:-':

We provide a new approach based on HMM for leaming the specificity rule of protease to cleave the protein precursor. The results, as can be seen fro111 Matthew's correlation coefficient which is closer to 1 for both self consi!>tency as well as 011 the unseen data, indicates close agreement of our model for the given data-set. The predicted results are compared and they demonstrate HMM based approach is superior to the ones reported in the literature indicating that it call be etTectively used as an assist<Ult for designing B [V protease inhibitor�. The proposed method does uot need the knowledge of any physical or biochemical propcrties of the pr otease and is wholly dependent on the sequence of the protein. The algorithm can also be used on other protease such as HIV-2 protease and Hepatitis-C Virus (HCV) with minor

modifications.

REFERENCES [Il Baldi p, Brunak S [�OOOl BiDi"JormaU,-.s - n,e Mac/line Learning

Approach. 2nd Edition.The MIT Press. Masachussel1s. [2] Zachary Q. Beck. Ying-Chuan Lin. John H. Elder (2003) "Molecular

Basis for the Relative Substrate Specificity of Human Immunodeficiency Virus Type 1 and Feline Immullodt:::fidency Virus Prat.:-as�s··. JOi/nlO! of flrology, 75-19. 9458-9469.

13] Brik A. and Wong C.-H. (2003)"HIV-1 Ptoteasc: mechanism and drug discovery". OrganiC and Biomolemlar Chemisrry. I, 5-14.

[4] Bernd Bhler, Ying-Chu3n Lin. Garren Morlis, Arthur J Olson. C:hi-Huey Wong, Douglas D. Richman. John H. Elder and Bruce E. Torbett 12001) "Viral Evolution in Response to the Broad-Based Retroviral Protease Inhibitor TL-3 Proteases". JOUr/wi of Hrology. 75-19. 9502-95U8.

[51 Cai Y. -D., Chou K.C. (1998) "Artificial Neural Network Model for Predicting HIV Protease Cleavage Sites in Prolei,,". Adl'allces in Ellgi· lIeerillg Software, 29. 119-128.

[6] Cai Y. -D .. Liu X.J., Xu X. B. and Chou K.C. (1002) "Support Vector Machines for Predicting HIV Protease Cleavage Sites in Protem". JourJJaI of Compurational Chemistry. 23. 267-274.

[7] Choll K.C. (1996) --Prediction of Human Immullodefici�n"y Virus protease cleavage sites in Proteins", Allu(l'lical Biochemistry. 233. 1-14.

[8J Helene C. F. Cote, Znbrina 1. Brumln� and P. Richard Harrigan (2000) "Human Immunodeficiency Virus Type 1 PTDt"ase Cleavage Site Mutations Associated with Protense Inhibitor Cross-Resistance Sdected by Indinavir. Ritonavir, andlor Saquin3vir", JOllrnal of Hroiogv, 75-2. 589-594.

[9] Tulio de Oliveira. Susan Engelbrecht. Estrelila Janse van Rensburg. Michelle Gordon, Karen Bishop, Jan zur Megede. Snsan W. Barnett and Sharon Cassol (2003) "Variability nt Human Immunodeficiency Virus Type 1 Subtype C Protease Cleavage Siles: an Indication of Viral Fitness?". JounlClI oj Virology. 17-17. 9412-9430,

[10] Sarin Drghici and R. Brian Potier (2003) "Predicting HIV drug resistance with neural ne(works". Bioinformalics. 19-1. 98-107.

472

PI)

[12]

[13)

. [14)

[15]

[16)

[17)

[18)

[19)

[20)

Durbin R . . Eddy S. R .. Krogh A. and Mitchinson G. (1998) Biological Sequence Ana�l'sis Probabilisric Models of Proteins and Nucleic Acids, Cambridge Universi ty Press, Cambridge. Sean R. Eddy (19981 --Profile Hidden Markov Models", Bioinformarics. 14-9. David S Goodsell (2002) "A hierarchical model ofHIV-J protease drug resistance". Applied Biomjormatics. 1-1. 3-12. Hugey R. and Krogh A. (1996) "Hidden Marko,' models for sequence analysis: Extension and analysis of the basic method", CABlOS. 12-2. 95·107. A. Krogh, M, Brown. L S. Mian. K. Sjolander and D. Haussler [1994) "Hidden Markov models in computational biology: Applications 10 protein modeling", JOllr/wl of Molecular Biology, 235. 1501-1531. Matthews B. W. (19751 C"omparison of the Predicted atld Observed Secondary Structure of T4 phage Lysozyme", Biochim. Biophys. acla, 405. 442-451. Ajil Narayanan, Xikun Wu and Z. Rong Yang (2002) "Mining Viral Prolease Da ta to Extracl Cleavage Knowledge". BioinformaIics. 18 -SuppI 1, S5 - Sll Peshkin L Gelfand M. S. 11999) "Segment'ttion of Yeast DNA using Hidden Markov Models", Bioi,:jormalics. 15-12, 980·986, Rabiner L. R. (1989) "A Tutorial of Hidden Markov Modes and' Selected Applications in Speech Recognition", Proc. IEEE, 77, 257-286. Todd W. Ridky, Craig E. Cameron. John Cameron, Jonathan Leis, Terry Copeland. Alexander Wlodawer, Irene T. Weber and Robert w. Harrison (2001) "Molecular Basis for AMV/RSV and HIV-J PR Specificity". FEES Leiters. lSl-1. 77-80.

(11) RognvaJdsaon, Thorsteinn and You, Liwen 11004) "Why neural networks should not be used for HIV-I protease cleavage site prediction". Bioil/' formalics. IN Press.

[12) Schechter l" and Berger A.(l967) "On the size of active sites in proteases", Biochem. Biophs.Res.Colll1llU1J. 27.157·162.

[23] Tzser ] .. Copeland T. D .. Wondrak E. M. and Oroszlan S. 11991) "Comparison of the HlY-l and HIY·2 proteinases using oligopeptide substrates representing cleavage sites in Gag and Gag·Pol polyproteins". Bioinjormatics. 15-12. 980-986.

473

Proceedings of ICISIP· 2005

cleavage knowledge extraction in hiv-l protease using ... · active virus. cleavage kllowledge...

Documents