1 computational analysis of protein-dna interactions changhui (charles) yan department of computer...

1

Computational Analysis Computational Analysis of Protein-DNA of Protein-DNA

InteractionsInteractionsChanghui (Charles) YanChanghui (Charles) Yan

Department of Computer Department of Computer ScienceScience

Utah State UniversityUtah State University

2

Problem II

Identifying amino acid residues involved in protein-DNA interactions from sequence

3

Materials And Methods

56 double-stranded DNA binding proteins previously used in the study of Jones et al. (2003)

Encoding

4

Materials And Methods

5

Leave-one-out cross-validation

Naïve Bayes

Naïve Bayes Classifier

6

Naïve Bayes

n

ii

n

ii

n

n

cxPcP

cxPcP

xxxXcP

xxxXcP

1

1

21

21

)0|()0(

)1|()1(

)...|0(

)...|1(

Naïve Bayes Classifier

Leave-one-out cross-validation

7

Leave-One-Out Cross-Validations

Sequence-based Sequence/structure-based

Identities

(ID)

ID + entropy ID + rASA ID + rASA + entropy

Correlation coefficient

0.25 0.29 0.28 0.30

Accuracy(%) 77 75 76 77

Specificity+(%)

37 37 36 39

Sensitivity+(%)

43 53 51 52

8

Pit-1, PDB 1au7

TP:30 FP: 16 TN: 86FN:14 CC: 0.51 (2nd)Accuracy: 79%

Predicted Actual

Predictions in The Context of 3-D Structures

9

Predictions in The Context of 3-D Structures

-Cro, PDB 6cro

TP:10FP: 5 TN: 34FN:10 CC: 0.37 (19th)Accuracy: 73%

Predicted Actual

10

Predictions CPredictions Compared With With PROSITE MotifsPROSITE Motifs

Predicted binding sites substantially overlap with 34 of the 37 “DNA-binding” PROSITE motifs

In 52 of the 56 proteins, the predictor identifies at least 20% of the DNA-binding residues

28 of the 56 proteins contain no PROSITE motifs that are annotated as “DNA-binding”

11

Comparison With Previous StudyComparison With Previous Study

Method Naïve Bayes classifier

Ahmad and Sarai

method*

Correlation CCoefficient

0.260.26 0.230.23

Accuracy (%) 8080 6666

Specificity+(%) 2929 2121

Sensitivity+(%) 4848 6868*Ahmad, S. and Sarai, A. (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 6, 33.

12

Summary

A simple sequence-based Naive Bayes classifier predicts interface residues in DNA-binding proteins with 75% accuracy, 37% specificity+, 53% sensitivity+ and correlation coefficient of 0.29

Predicted binding sites

correctly indicate the locations of actual binding sites

substantially overlap with known PROSITE motifs

13

Problem IIProblem II

Identification of Helix-Turn-Helix (HTH) DNA-Identification of Helix-Turn-Helix (HTH) DNA-binding motifsbinding motifs

14

HTH MotifsHTH Motifs

Sequences sharing low similarities can fold into a Sequences sharing low similarities can fold into a similar HTH structuresimilar HTH structure

Identifying HTH motifs from sequence is Identifying HTH motifs from sequence is extremely challengingextremely challenging

15

Trick 1Trick 1

Including more informationIncluding more information Amino acid sequenceAmino acid sequence Secondary structureSecondary structure

16

Hidden Markov Model (HMM)Hidden Markov Model (HMM)

LQQITHIANQL-GLE----KDVVRVWF

17

Hidden Markov Model Hidden Markov Model (HMM(HMM_AA_SS_AA_SS))

LQQITHIANQL-GLE----KDVVRVWFHHHEEHEEEHMHE----HHEEMMEH

18

Trick 2Trick 2

There are similarities among the 20 naturally There are similarities among the 20 naturally occurred amino acidsoccurred amino acids Reduced alphabetsReduced alphabets

19

Reduced AlphabetsReduced Alphabets

Schemes for reducing amino acid alphabet based on the Schemes for reducing amino acid alphabet based on the BLOSUM50 matrix by Henikoff and Henikoff (1992) BLOSUM50 matrix by Henikoff and Henikoff (1992) derived by grouping and averaging the similarity matrix derived by grouping and averaging the similarity matrix elements as described in the text. elements as described in the text. (Murphy (Murphy et al.et al. 2000) 2000)

20

Cross-Families EvaluationsCross-Families Evaluations

True Positive 1 False Positive 2

HMM_AA 3 0

HMM_AA_SS(20 letters) 3

227 0

HMM_AA_SS(Murphy_15) 3

474 0


470 3


431 5

1.True positive: HTH motifs that are correctly identified as such.2.False positive: Non-HTH motifs that are identified as HTH motifs.3.The alphabet used to encode amino acid sequences.

21

QuestionsQuestions

22

Within-family Three-Fold Cross-Within-family Three-Fold Cross-ValidationsValidations

.Family (number of HTH motifs in the family)

HMM_AA HMM_AA_SS(Murphy_15)

PF00126 (1635) 1594 1622

PF00165 (90) 63 80

PF00196 (30) 26 30

PF04545 (164) 137 164

PF01022 (42) 39 39

PF00046 (189) 176 188

PF03965 (48) 48 48

23

Comparisons of HMM_AA_SS with Comparisons of HMM_AA_SS with FFAS03 in Cross-Family FFAS03 in Cross-Family

EvaluationsEvaluations

Total HTH motifs

Recognized by both FFAS03 and

HMM_AA_SS

Recognized by

FFAS03 only

Recognized by HMM_AA_SS

only

563 135 24 71

1 computational analysis of protein-dna interactions changhui (charles) yan department of computer...

Documents

sequence slide

dnabinding slide

challenging slide

kdvvrvwf slide

encoding slide

predictedactual slide

hheemmeh slide

crossvalidation slide