1 computational analysis of protein-dna interactions changhui (charles) yan department of computer...
Post on 20-Dec-2015
214 views
TRANSCRIPT
1
Computational Analysis Computational Analysis of Protein-DNA of Protein-DNA
InteractionsInteractionsChanghui (Charles) YanChanghui (Charles) Yan
Department of Computer Department of Computer ScienceScience
Utah State UniversityUtah State University
2
Problem II
Identifying amino acid residues involved in protein-DNA interactions from sequence
3
Materials And Methods
56 double-stranded DNA binding proteins previously used in the study of Jones et al. (2003)
Encoding
4
Materials And Methods
5
Leave-one-out cross-validation
Naïve Bayes
Naïve Bayes Classifier
6
Naïve Bayes
n
ii
n
ii
n
n
cxPcP
cxPcP
xxxXcP
xxxXcP
1
1
21
21
)0|()0(
)1|()1(
)...|0(
)...|1(
Naïve Bayes Classifier
Leave-one-out cross-validation
7
Leave-One-Out Cross-Validations
Sequence-based Sequence/structure-based
Identities
(ID)
ID + entropy ID + rASA ID + rASA + entropy
Correlation coefficient
0.25 0.29 0.28 0.30
Accuracy(%) 77 75 76 77
Specificity+(%)
37 37 36 39
Sensitivity+(%)
43 53 51 52
8
Pit-1, PDB 1au7
TP:30 FP: 16 TN: 86FN:14 CC: 0.51 (2nd)Accuracy: 79%
Predicted Actual
Predictions in The Context of 3-D Structures
9
Predictions in The Context of 3-D Structures
-Cro, PDB 6cro
TP:10FP: 5 TN: 34FN:10 CC: 0.37 (19th)Accuracy: 73%
Predicted Actual
10
Predictions CPredictions Compared With With PROSITE MotifsPROSITE Motifs
Predicted binding sites substantially overlap with 34 of the 37 “DNA-binding” PROSITE motifs
In 52 of the 56 proteins, the predictor identifies at least 20% of the DNA-binding residues
28 of the 56 proteins contain no PROSITE motifs that are annotated as “DNA-binding”
11
Comparison With Previous StudyComparison With Previous Study
Method Naïve Bayes classifier
Ahmad and Sarai
method*
Correlation CCoefficient
0.260.26 0.230.23
Accuracy (%) 8080 6666
Specificity+(%) 2929 2121
Sensitivity+(%) 4848 6868*Ahmad, S. and Sarai, A. (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 6, 33.
12
Summary
A simple sequence-based Naive Bayes classifier predicts interface residues in DNA-binding proteins with 75% accuracy, 37% specificity+, 53% sensitivity+ and correlation coefficient of 0.29
Predicted binding sites
correctly indicate the locations of actual binding sites
substantially overlap with known PROSITE motifs
13
Problem IIProblem II
Identification of Helix-Turn-Helix (HTH) DNA-Identification of Helix-Turn-Helix (HTH) DNA-binding motifsbinding motifs
14
HTH MotifsHTH Motifs
Sequences sharing low similarities can fold into a Sequences sharing low similarities can fold into a similar HTH structuresimilar HTH structure
Identifying HTH motifs from sequence is Identifying HTH motifs from sequence is extremely challengingextremely challenging
15
Trick 1Trick 1
Including more informationIncluding more information Amino acid sequenceAmino acid sequence Secondary structureSecondary structure
16
Hidden Markov Model (HMM)Hidden Markov Model (HMM)
LQQITHIANQL-GLE----KDVVRVWF
17
Hidden Markov Model Hidden Markov Model (HMM(HMM_AA_SS_AA_SS))
LQQITHIANQL-GLE----KDVVRVWFHHHEEHEEEHMHE----HHEEMMEH
18
Trick 2Trick 2
There are similarities among the 20 naturally There are similarities among the 20 naturally occurred amino acidsoccurred amino acids Reduced alphabetsReduced alphabets
19
Reduced AlphabetsReduced Alphabets
Schemes for reducing amino acid alphabet based on the Schemes for reducing amino acid alphabet based on the BLOSUM50 matrix by Henikoff and Henikoff (1992) BLOSUM50 matrix by Henikoff and Henikoff (1992) derived by grouping and averaging the similarity matrix derived by grouping and averaging the similarity matrix elements as described in the text. elements as described in the text. (Murphy (Murphy et al.et al. 2000) 2000)
20
Cross-Families EvaluationsCross-Families Evaluations
True Positive 1 False Positive 2
HMM_AA 3 0
HMM_AA_SS(20 letters) 3
227 0
HMM_AA_SS(Murphy_15) 3
474 0
HMM_AA_SS(Murphy_10) 3
470 3
HMM_AA_SS(Murphy_8) 3
431 5
1.True positive: HTH motifs that are correctly identified as such.2.False positive: Non-HTH motifs that are identified as HTH motifs.3.The alphabet used to encode amino acid sequences.
21
QuestionsQuestions
22
Within-family Three-Fold Cross-Within-family Three-Fold Cross-ValidationsValidations
.Family (number of HTH motifs in the family)
HMM_AA HMM_AA_SS(Murphy_15)
PF00126 (1635) 1594 1622
PF00165 (90) 63 80
PF00196 (30) 26 30
PF04545 (164) 137 164
PF01022 (42) 39 39
PF00046 (189) 176 188
PF03965 (48) 48 48
23
Comparisons of HMM_AA_SS with Comparisons of HMM_AA_SS with FFAS03 in Cross-Family FFAS03 in Cross-Family
EvaluationsEvaluations
Total HTH motifs
Recognized by both FFAS03 and
HMM_AA_SS
Recognized by
FFAS03 only
Recognized by HMM_AA_SS
only
563 135 24 71
24
Putative HTH motifs in Putative HTH motifs in Ureaplasma parvumUreaplasma parvum
Protein Location Annotation from Uniprot
sp|Q9PQE5|SCPB_UREPA 176-214 Participates to chromosomal partition during cell division
sp|Q9PQV6|RPOB_UREPA 540-587 DNA-directed RNA polymerase
sp|Q9PR27|SYY_UREPA 340-380 Tyrosyl-tRNA synthetase
sp|Q9PQC2|SYA_UREPA 217-265 Alanyl-tRNA synthetase
sp|Q9PQ74|DPO3A_UREPA 365-400 DNA polymerase III subunit alpha
sp|Q9PQX7|Y166_UREPA 507-553 Hypothetical protein