barcelona sabatica

Post on 18-Nov-2014

237 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Protein loop classification using Artificial Neural

Networks

Armando Vieira1 and Baldomero Oliva2

1ISEP and Centro de Física Computacional, Coimbra, Portugalwww.defi.isep.ipp.pt/~asv

2Structural Bioinformatics Laboratory (GRIB) IMIM/Universitat Pompeu Fabra, Barcelona, Spain

XXI: the century of BIO

BIOINFORMATICSjoining two worlds apart

OutlineBrief review of protein structure

Statement of problem and why is so hard

Data pre-processing, corrections, updates and beyond multiple alignments…

Neural Networks in protein structure prediction

HLVQ

Results and future work

Proteins

All proteins are chains of 20 amino acids

Not all chains of amino acids are proteins

Fold rapidly and repeatedly

Proteins are the machinery of live

Essential to all (known) organisms

The Gist of it

Amino acid Amino acid sequencesequence

Physical Physical structurestructure

FunctionFunction

Typical globular protein

MMEMEKMEKKMEKKEFHIVAMEKKEFHIVAETGIHARPATLLVQTASLFNSDINLETLGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEADGMAAIVETLQLQGLAQ

Coarse-Grained Model

+180

b b b p o M e e e

b b b p o M M e e

b b b p . l l s e

a a a T . l l g N

N a a a . U l g N

N a a a . U g g N

I a a a . G G G I

e F F F o e e e e

b b b p o e e e e

-180

-180 +180

Ramachandran Alphabet

φφ

ψψ

-180-180°° 180°180°-180°-180°

180°180°

90°90°

-90°-90°

0°0°

0°0°-90°-90° 90°90°

AA

BB

EE

GG

5-letter alphabet

Residue Sequence

MEKKEFHIVAETGIHARPATLLVQTASLFNSDINLETLGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEADGMAAIVETLQLQGLAQ...

3° Structure

ACCDECBAABDECBDABCDBEABDBCBDBAEBDBDBAEBABDCBBDBADDCBDBCBDBEBDBCBBDCAABDEDCDCEAABACAAAADC…

What shall we do?• Ab initio:

Quantum Mechanics + big computers + large # configurations

= huge problems…

• Machine Learning:Use known cases to learn a suitable

map:sequence→ structure

Machine Learning Approach

Artificial Neural Networks• A problem-solving paradigm modeled after the

physiological functioning of the human brain.

• Synapses in the brain are modeled by computational nodes.

• The firing of a synapse is modeled by input, output, and threshold functions.

• The network “learns” based on problems to which answers are known (supervised learning).

• The network can then produce answers to entirely new problems of the same type.

Neural Networks

OutputLayer

InputLayer

HiddenLayers

Overfitting – high risk!

Less complicated hypothesis has lower error rate

Hidden Layer Vector Quantization- HLVQ

xxxx xxxxxx

xx

xxxx

oo

oooo oo

oo

oooo

Traditional NNTraditional NN

xxxx xxxxxx

xx

xxxx

oo

oooo oo

oo

oooo

HLVQHLVQ

Main advantage: detect and Main advantage: detect and correctcorrect prediction for prediction for outliersoutliers

zz

Loops, loops everywhere!!!Loops, loops everywhere!!!

Look for a loop…

Geometry of the Motif

Loop Types

: : strandstrand - - -helix -helix

: : -helix - -helix - -helix -helix : : -helix – -helix – strandstrand

-hairpin-hairpin: : strandstrand - - strandstrand

- link- link: : strandstrand - - strandstrand

Similar conformation Similar conformation aa{aa{bb}aa / aa{p}aa}aa / aa{p}aa

Identical geometry Identical geometry (4,6)(0,45)(45,90)(180,225)(4,6)(0,45)(45,90)(180,225)

1.3.1 aa{p}aa1.3.1 aa{p}aa

1.1.2 aa{b}aa1.1.2 aa{b}aa

Pro 75%

Ser 75%

© Baldomero Oliva© Baldomero Oliva

Class Class

ArchDB database

~ 20 000 loops classified into ~ 3000 classes.EE-3.4.1

Loop type - loop size . consensus . motif

TASK: classify a loop from sequence alone

If not possible, get as much information as possible

Problems

• Coding of aminoacids

• Huge searching space, sparsely populated

• How to assign the loop classes?

• High dimensionality → Large Networks → poor generalization

Aminoacid codingthe classical way

A → (1, 0, …0)

C → (0, 1, …0)

Y → (0, 0, …1)

Useful but not efficient!!!

I am working to improve it…

Theory; but how about applications?!

- link and - harpins from sequence

HLVQ

(MLP)

Predicted

- link

Predicted

- harpin

Real

- link

88.4

(79.4)

11.6

(20.6)

Real

- harpin

12.5

(16.1)

87.5

(83.9)

Prediction of all loop types from sequence alone

- lk α- - hp -α α-α

- lk 45.9 28.5 3.7 19.8 2.1

α- 8.8 67.4 1.2 18.0 4.6

- hp 0.4 0.9 96.1 2.1 0.5

-α 4.4 6.2 2.4 79.5 7.6

α-α 4.0 15.7 1.3 20.3 58.6

What’s it all mean?

Given a loop residue sequence, we can (usually) identify its native structure.

Not ab initio: We cannot tell the structure of a novel sequence.

HLVQ is superior to MLP

Future Work

Better coding of aminoacidsBetter coding of aminoacids

Larger sequences / low complexityLarger sequences / low complexity

Going beyond structureGoing beyond structure

Clever alphabet that explore similaritiesClever alphabet that explore similarities

Multiobjective Genetic AlgorithmsMultiobjective Genetic Algorithms

Beyond Multiple Alignments

• Alligments are good … but expensive and boring ...

• Information contained in a multiple alignment can, in principle, be expressed using an adequate aminoacid coding scheme

• How? SensibilitySensibility

Genetic AlgorithmGenetic Algorithm

Coded Amino Acids

Alanine (A) Arginine (R) Asparagine (N) Aspartic Acid (D) Cysteine (C)

Glutamic Acid (E) Glutamine (Q) Glycine (G) Histidine (H) Isoleucine (I)

Leucine (L) Methionine (M)Lysine (K) Phenylalanine (F) Proline (P)

Serine (S) Threonine (T) Tryptophan (W) Tyrosine (Y) Valine (V)http://www.chemie.fu-berlin.de/chemistry/bio/http://www.chemie.fu-berlin.de/chemistry/bio/

ArchDB database

Protein Data Bank (PDB) http://www.rcsb.org contains ~ 25 000 proteins with known structure of ~ 106 entries in SWISS-PROT

ArchDB ~ 20 000 classified loops

top related