protein secondary structure prediction: a new improved knowledge-based method wen-lian hsu institute...

29
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

Upload: melissa-terry

Post on 31-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

Protein Secondary Structure Prediction:A New Improved Knowledge-Based Method

Wen-Lian Hsu

Institute of Information Science

Academia Sinica, Taiwan

Page 2: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

2/29

Outline Introduction

PSSP Motivation

Knowledge-Based Method PROSP

An Improved Hybrid Method PROSP II HYPROSP II+

Conclusion

Page 3: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

3/29

Protein Structures Primary sequence

Secondary structures

Tertiary structures

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

helices strands loops

Three dimensional packing of secondary structures

Page 4: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

4/29

Introduction to PSSP Protein Secondary Structure Prediction

(PSSP) is to predict protein secondary structure based only on its sequence.

Each amino acid is assigned a structure element (SSE): Helix (H), Strand (E) or Coil (C or L).

Page 5: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

5/29

Motivation PSSP plays an important role in tertiary

structure predictions Fischer (1996) improved the tertiary structure

prediction accuracy from 59.0 to 71.0 by using PHD to predict SSE.

In Yang’s 2003, the tertiary structure prediction accuracy was improved from 71.9 to 79.0 by using PSIPRED to predict SSE.

Predicted SSE can also be employed in other prediction algorithms as features to improve performance

Page 6: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

6/29

Outline Introduction

PSSP Motivation

Knowledge-Based Method PROSP

An Improved Hybrid Method PROSP II HYPROSP II+

Conclusion

Page 7: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

7/29

Treat PSSP as a Translation Problem Secondary structure prediction

A language of 20 alphabets

A language of 3 alphabets

Page 8: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

8/29

Treating Genomic/Proteomic sequencesas a Language

For proteomic data:

Amino acid motif protein

Alphabet word sentence

paragraph

Protein structure or function

Sentence meaning

Finding the interrelationships of data Data Mining, Knowledge Discovery

Page 9: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

9/29

Matching by Semantics (prediction based on evolutionary information)

• Existing sentences in database (understood):– His old father gave me a book.– Joan loves Andy

•• UnderstandingUnderstanding a new sentence– Mary’s lovely daughter does not like John

• Techniques– Corpus analysis– Pattern discovery and matching

• Sequence, semantics (classification, transformation)

– Structure prediction

Page 10: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

Speech Recognition ─ ExampleSense Disambiguation in English

Selection of homonyms (or senses) in speech recognition

台 北 市 一 位 小 孩 走 失 了

台 北 市 小 孩台 北 適 宜 走 失 事 宜 一 位 一 味 移 位

Page 11: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

11/29

How do we represent the context in a protein sequence (or sentence)? Using motifs as Words?

Motifs could be too specific, do not provide enough coverage

What about using k-mers? Can build (k-mer, structure) pairs How many k-mers can we get? How do we define similar k-mers? (under the

context) How do we combine the structural information

from the k-mers?

Page 12: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

12/29

PROSP Our knowledge-based method for PSSP

Constructing a peptide Sequence-Structure Knowledge Base (SSKB)

Use PSI-BLAST to find all peptides similar to those of the target protein

Use similar peptides found in the SSKB to vote for the dominant structure of each amino acid in the target protein.

Page 13: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

13/29

Using PSI-BLAST to Amplify the Effect of DSSP Database (create more synonyms)

The number of peptide words is still small (~ 5 million)

Identify similar peptides For each protein p in the NR database, apply PSI-

BLAST to find its HSPs (high score segment pairs).

HSP: an alignment of subsequence of protein p and another protein q with unknown structure

Assign the structure of “selected” peptides of p to those of q These peptides comprise our dictionary (~ 100 million)

Page 14: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

14/29

SSKB construction (synonyms)

An example of High-scoring Segment Pair (HSP) from PSI-Blast Search result

known

unknown

Page 15: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

15/29

x

H(x)E(x)C(x)

Voting score

x is assigned as helix

HH

HC

EC

SSKB

PSI-Blast

Prediction at a position x

Page 16: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

16/29

Outline Introduction

PSSP Motivation

Knowledge-Based Method PROSP

An Improved Hybrid Method PROSP II HYPROSP II+

Conclusion

Page 17: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

17/29

Two problems of searching for homologous peptides in protein sequences databases

Redundant information generated by duplicate peptides The voting bias problem in PROSP

Poor prediction accuracy due to insufficient knowledgebase matching boost coverage

Page 18: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

18/29

The voting bias problem

Query Sbject

The PSIBLAST results

KTYQCQY…

KPYQCQYKPYQCQYKPYQCQYKPYQCQYKPYQCQYKVYQCQYQPYRCKY

SSKB

KTYQCQY…

HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCHHHC

CCHHHC

Dominate result

Page 19: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

19/29

Clustering HSPs

…MYKKILYPTDFSETAEIALK…MYSKILLMYSKILLMYSKILLMYKKIYLMYKKIYLMYKKIYLMYKKIYLMYSSILYMYSSILY

Similar HSPs

Page 20: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

20/29

Measuring the amount of structural information

Low Local match rate

HSPs

There is no information from SSKB7 for this region

Found

Unfound

Page 21: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

21/29

Construct SSKB with different lengths (to boost coverage)

HSPs

TrainingProtein

PSI-BLAST search

SSKBwindow length = 7

SSKB construction

window length = 7

HSPs

TrainingProtein

PSI-BLAST search

SSKBwindow length = 5

SSKB construction

window length = 5

Page 22: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

22/29

HSPs from SSKB7

Boost match rate using different length peptide record

Protein :

MYKKILYPTDFSETAEIALK…

SSKBWindow length = 7

SSKBWindow length = 7

SSKBWindow length = 5

SSKBWindow length = 5

HH 1 2 1 3 6 7 8…1 2 1 3 6 7 8…

EE 1 2 2 0 0 0 1…1 2 2 0 0 0 1…

CC 2 3 8 8 5 4 2… 2 3 8 8 5 4 2…

HH 1 3 2 5 5 5 2…1 3 2 5 5 5 2…

EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…

CC 2 4 7 7 6 6 7… 2 4 7 7 6 6 7…

HSPs from SSKB5

Page 23: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

23/29

NEW PROSP systemProtein :

MYKKILYPTDFSETAEIALK…SSKB

Window length = 7

SSKBWindow length = 7

SSKBWindow length = 5

SSKBWindow length = 5

HH 1 2 1 3 6 7 8…1 2 1 3 6 7 8…

EE 1 2 2 0 0 0 1…1 2 2 0 0 0 1…

CC 2 3 8 8 5 4 2… 2 3 8 8 5 4 2…

HH 1 3 2 5 5 5 2…1 3 2 5 5 5 2…

EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…

CC 2 4 7 7 6 6 7… 2 4 7 7 6 6 7…

HHPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×H×H77((xx))++((1- 1- LMRLMR7mer7mer((xx))))×H×H55((xx))EEPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×E×E77((xx))++((1- LMR1- LMR7mer7mer((xx))))×E×E55((xx))CCPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×C×C77((xx))++((1- LMR1- LMR7mer7mer(x(x))))×C×C55((xx)) HH 1 3 2 5 7 6 7…1 3 2 5 7 6 7…

EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…

CC 2 4 8 8 4 5 6… 2 4 8 8 4 5 6…

Page 24: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

24/29

Hybrid by Neural Network

Query Protein

PSIPRED

PROSP

PSIPBLAST

H scoreH score

E scoreE score

C scoreC score

H scoreH score

E scoreE score

C scoreC score

PSSMPSSM

Neural Network Final Result

3 features

3 features

20 features

Page 25: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

25/29

Data Sets Two broadly used test sets

CB513 EVAc4

Derivation of the training sets Get 4,572 unique protein chains (with less than 25%

mutual sequence identity) from DSSP database Further remove protein chains of sequence identity

over 25% with the respective test datasets to obtain their respective training datasets.

The final training datasets consist of 4395 and 4055 protein chains for EVAc4 and CB513, respectively.

Page 26: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

26/29

55

60

65

70

75

80

85

[0,10) [10,20) [20,30) [30,40) [40,50)

7-mer SSKB 5-mer SSKBPROSP II

The respective performance improvement using SSKB5 and SSKB7

LMR7mer(%)

Q3(%)

Performance of prediction on CB513 by SSKB5, SSKB7 and PROSP II with respect to LMR7mer lower than 50%.

Page 27: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

27/29

Performance of HYPROSP II+

Q3 SOV QH_o QH_p QE_o QE_p QC_o QC_p Info

HYPROSPII+ 80.35 78.66 78.65 83.85 61.10 71.27 81.79 76.35 0.44

Errsig 0.84 1.20 1.87 1.75 2.33 2.15 1.05 1.15 0.02

PROFsec 76.54 75.39 67.30 74.00 43.70 43.20 76.80 73.50 0.38

PSIPRED 77.62 76.05 72.90 71.50 38.60 42.30 73.50 76.40 0.38

SAM-T99sec 77.64 75.05 75.50 69.60 38.80 47.30 72.40 75.70 0.39

YASSPP 79.34 78.65 -- -- -- -- -- -- 0.42

HYPROSPII 79.32 76.51 81.49 77.85 60.91 68.83 76.98 77.78 0.41

Page 28: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

28/29

ConclusionHYPROSP II+

Using a more robust knowledge-based algorithm PROSP II

More structural information, better prediction. Incremental Learning

The general strategy developed in this paper could be used to enhance the performance of similar approaches in other prediction problems.

Page 29: Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

People

Wen-Lian HsuTing-Yi SungHsin-Nan Lin

Jia-Ming ChangEi-Wen Yang