1 computational analysis of protein-dna interactions changhui (charles) yan department of computer...

Computational Analysis Computational Analysis of Protein-DNA of Protein-DNA

InteractionsInteractionsChanghui (Charles) YanChanghui (Charles) Yan

Department of Computer Department of Computer ScienceScience

Utah State UniversityUtah State University

Problem II

Identifying amino acid residues involved in protein-DNA interactions from sequence

Materials And Methods

56 double-stranded DNA binding proteins previously used in the study of Jones et al. (2003)

Encoding

Materials And Methods

Leave-one-out cross-validation

Naïve Bayes

Naïve Bayes Classifier

Naïve Bayes

xxxXcP

)0|()0(

)1|()1(

)...|0(

)...|1(

Naïve Bayes Classifier

Leave-one-out cross-validation

Leave-One-Out Cross-Validations

Sequence-based Sequence/structure-based

Identities

ID + entropy ID + rASA ID + rASA + entropy

Correlation coefficient

0.25 0.29 0.28 0.30

Accuracy(%) 77 75 76 77

Specificity+(%)

37 37 36 39

Sensitivity+(%)

43 53 51 52

Pit-1, PDB 1au7

TP:30 FP: 16 TN: 86FN:14 CC: 0.51 (2nd)Accuracy: 79%

Predicted Actual

Predictions in The Context of 3-D Structures

-Cro, PDB 6cro

TP:10FP: 5 TN: 34FN:10 CC: 0.37 (19th)Accuracy: 73%

Predicted Actual

Predictions CPredictions Compared With With PROSITE MotifsPROSITE Motifs

Predicted binding sites substantially overlap with 34 of the 37 “DNA-binding” PROSITE motifs

In 52 of the 56 proteins, the predictor identifies at least 20% of the DNA-binding residues

28 of the 56 proteins contain no PROSITE motifs that are annotated as “DNA-binding”

Comparison With Previous StudyComparison With Previous Study

Method Naïve Bayes classifier

Ahmad and Sarai

method*

Correlation CCoefficient

0.260.26 0.230.23

Accuracy (%) 8080 6666

Specificity+(%) 2929 2121

Sensitivity+(%) 4848 6868*Ahmad, S. and Sarai, A. (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 6, 33.

Summary

A simple sequence-based Naive Bayes classifier predicts interface residues in DNA-binding proteins with 75% accuracy, 37% specificity+, 53% sensitivity+ and correlation coefficient of 0.29

Predicted binding sites

correctly indicate the locations of actual binding sites

substantially overlap with known PROSITE motifs

Problem IIProblem II

Identification of Helix-Turn-Helix (HTH) DNA-Identification of Helix-Turn-Helix (HTH) DNA-binding motifsbinding motifs

HTH MotifsHTH Motifs

Sequences sharing low similarities can fold into a Sequences sharing low similarities can fold into a similar HTH structuresimilar HTH structure

Identifying HTH motifs from sequence is Identifying HTH motifs from sequence is extremely challengingextremely challenging

Trick 1Trick 1

Including more informationIncluding more information Amino acid sequenceAmino acid sequence Secondary structureSecondary structure

Hidden Markov Model (HMM)Hidden Markov Model (HMM)

LQQITHIANQL-GLE----KDVVRVWF

Hidden Markov Model Hidden Markov Model (HMM(HMM_AA_SS_AA_SS))

LQQITHIANQL-GLE----KDVVRVWFHHHEEHEEEHMHE----HHEEMMEH

Trick 2Trick 2

There are similarities among the 20 naturally There are similarities among the 20 naturally occurred amino acidsoccurred amino acids Reduced alphabetsReduced alphabets

Reduced AlphabetsReduced Alphabets

Schemes for reducing amino acid alphabet based on the Schemes for reducing amino acid alphabet based on the BLOSUM50 matrix by Henikoff and Henikoff (1992) BLOSUM50 matrix by Henikoff and Henikoff (1992) derived by grouping and averaging the similarity matrix derived by grouping and averaging the similarity matrix elements as described in the text. elements as described in the text. (Murphy (Murphy et al.et al. 2000) 2000)

Cross-Families EvaluationsCross-Families Evaluations

True Positive 1 False Positive 2

HMM_AA 3 0

HMM_AA_SS(20 letters) 3

HMM_AA_SS(Murphy_15) 3

1.True positive: HTH motifs that are correctly identified as such.2.False positive: Non-HTH motifs that are identified as HTH motifs.3.The alphabet used to encode amino acid sequences.

QuestionsQuestions

Within-family Three-Fold Cross-Within-family Three-Fold Cross-ValidationsValidations

.Family (number of HTH motifs in the family)

HMM_AA HMM_AA_SS(Murphy_15)

PF00126 (1635) 1594 1622

PF00165 (90) 63 80

PF00196 (30) 26 30

PF04545 (164) 137 164

PF01022 (42) 39 39

PF00046 (189) 176 188

PF03965 (48) 48 48

Comparisons of HMM_AA_SS with Comparisons of HMM_AA_SS with FFAS03 in Cross-Family FFAS03 in Cross-Family

EvaluationsEvaluations

Total HTH motifs

Recognized by both FFAS03 and

HMM_AA_SS

Recognized by

FFAS03 only

Recognized by HMM_AA_SS

563 135 24 71

Putative HTH motifs in Putative HTH motifs in Ureaplasma parvumUreaplasma parvum

Protein Location Annotation from Uniprot

sp|Q9PQE5|SCPB_UREPA 176-214 Participates to chromosomal partition during cell division

sp|Q9PQV6|RPOB_UREPA 540-587 DNA-directed RNA polymerase

sp|Q9PR27|SYY_UREPA 340-380 Tyrosyl-tRNA synthetase

sp|Q9PQC2|SYA_UREPA 217-265 Alanyl-tRNA synthetase

sp|Q9PQ74|DPO3A_UREPA 365-400 DNA polymerase III subunit alpha

sp|Q9PQX7|Y166_UREPA 507-553 Hypothetical protein

1 computational analysis of protein-dna interactions changhui (charles) yan department of computer...

sequence slide

dnabinding slide

challenging slide

kdvvrvwf slide

encoding slide

predictedactual slide

hheemmeh slide

crossvalidation slide

Documents

changhui hu, et al. v. cadence design systems, inc., et al...

yan week6assignment

baisheng yan - michigan state...

topology - yan

drell-yan: the missing spin programme · 2018. 10. 16. ·...

6 papr849 yan

unit17lee yan shen

jung hur and changhui kang - nus - home

walt whitman by zhang ying & yan yan. walt whitman...

noodles ko yan

wah yan nss curriculum wah yan nss curriculum 1/22

yan ding, ph.d

baisheng yan - michigan state...

yan y. kagan

10 puzzles yan

source camera identification issues: forensic … camera...

progress on media processor design xiaolang yan...

yan cui 2013.1.16

chinese-japanese history textbook dispute yan kaidi 4s1 yan...

gui-yang xu*, chun-guang wang, yan-fang zhu, hong-yan li