fast search protein structure prediction algorithm for almost perfect matches1 by jayakumar...

Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches 1

Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches

By

Jayakumar Rudhrasenan

S3047315

Primary Supervisor: Prof. Heiko Schroder

Secondary Supervisor: Dr. Margaret Hamilton


Introduction

Bio-Informatics

What is Bio-Informatics?

Bio-Informatics is the science of developing computer databases and algorithms to facilitate biological research especially in the area of genomic.

Genomic is the study of genes and its functions.


Background - Protein Structure

e

How can we find the Structure of a protein ?• X-ray Crystallography

• NMR Spectroscopy

Phi Psi

Amino acid

a

k

r

n

d

c

a

r

aProtein Structure


Where does Computer Science come into it?

Limitations of traditional lab-work

•Expensive

Cost involved in finding the structure through these method is expensive

•Time Consuming

Takes 6 to 12 months to predict the structure of a single protein.

REASON:

Some proteins don’t crystallise

Some don’t give good diffraction patterns

All proteins are fragile and difficult to handle.


Methods Available

There are many ways by which this problem is being tackled.

These methods are basically classified into two groups:

• ab initio

• Homology modelling

What is Homology modelling ?


What is homology modelling?

Homology modeling works on the principle that although each protein adopts a unique structure, there are only ~2,000 common folds between the various super families identified thus far.

If two protein sequences are aligned and their percentage similarity is above the ‘twilight zone’, or 20% we can conclude that the sequences are homologous, or share a common ancestry, below this zone it is not possible to say whether the identical amino acid residues are in fact evolutionarily linked or have arisen by chance.


What is Protein Structure Prediction?

In its most general form

- It is the prediction of the relative position of each amino acid in the protein structure with the knowledge of the structural details of other known proteins.


Why predict protein structure?

• The sequence structure gap

– 750 000 known sequences, 17 000 known structures

• Structural knowledge brings understanding of function and mechanism of action

• Can help in prediction of function

• Predicted structures can be used in structure based drug design

• It can help us understand the effects of mutations on structure or function

• It is a very interesting scientific problem

– still unsolved in its most general form after more than 20 years of effort


Protein Structure Prediction Algorithm

n f s b c a r . . . . .

a r n d c q e g h i l k m n f s s d

e g h i l n f s e a r l k s p q g a

n h e . . . . . . . . . . .

Window size =3. Can be implemented with window size of 5,7,9. With window size of 9, we look for almost perfect matches as we wont get a perfect match with the database we have.

window

Protein Database

Protein sequence for which the structure is unknown


Algorithm – continued..

Number of Occurrences

Number of Occurrences

Phi graph

Psi graph


Limitations of this algorithm

Time Consuming

Time taken to predict the

structure of a protein

Time taken to predict the

structure 20,000 protein

2 hr PC time

2 x 20,000 = 40,000 hrs PC time


Why does it take time?

Each sub sequence of the unknown protein is compared with all the sub sequences of the proteins in the database.

With a window size of 9, the number of sub strings in the database will be around 2 million.

So, there will be 2 million comparisons for each sub sequence in the unknown protein.

“Unknown protein” here means the proteins whose sequence is knows but the structure is not known.


Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches

•Arrange the sub sequences with a hamming distance of one between each sub sequences.

What is hamming distance?

The number of disagreeing bits between twobinary vectors.

Used as measure of dissimilarity.

Eg. 1000011

1000001 These two binary numbers differ by one bit.

Hamming distance of one here means that the each sub sequence differ from the one next to that by just one amino acid.


Continued…

• Maintain a table which stores the hope index value for a mismatch. For example

Row number

Sub Sequence Jump to row number

1023 111110000 1027

1024 111110001

1025 111110002

1026 111110003

1027 111110013 1031

1028 111110012

1029 111110011

1030 111110010

1031 111110020 1035

. . .

fast search protein structure prediction algorithm for almost perfect matches1 by jayakumar...

Documents