barcelona sabatica

34
Protein loop classification using Artificial Neural Networks Armando Vieira 1 and Baldomero Oliva 2 1 ISEP and Centro de Física Computacional, Coimbra, Portugal www.defi.isep.ipp.pt/~asv 2 Structural Bioinformatics Laboratory (GRIB) IMIM/Universitat Pompeu Fabra,

Upload: armando-vieira

Post on 18-Nov-2014

237 views

Category:

Documents


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Barcelona sabatica

Protein loop classification using Artificial Neural

Networks

Armando Vieira1 and Baldomero Oliva2

1ISEP and Centro de Física Computacional, Coimbra, Portugalwww.defi.isep.ipp.pt/~asv

2Structural Bioinformatics Laboratory (GRIB) IMIM/Universitat Pompeu Fabra, Barcelona, Spain

Page 2: Barcelona sabatica

XXI: the century of BIO

Page 3: Barcelona sabatica

BIOINFORMATICSjoining two worlds apart

Page 4: Barcelona sabatica

OutlineBrief review of protein structure

Statement of problem and why is so hard

Data pre-processing, corrections, updates and beyond multiple alignments…

Neural Networks in protein structure prediction

HLVQ

Results and future work

Page 5: Barcelona sabatica

Proteins

All proteins are chains of 20 amino acids

Not all chains of amino acids are proteins

Fold rapidly and repeatedly

Proteins are the machinery of live

Essential to all (known) organisms

Page 6: Barcelona sabatica

The Gist of it

Amino acid Amino acid sequencesequence

Physical Physical structurestructure

FunctionFunction

Page 7: Barcelona sabatica

Typical globular protein

MMEMEKMEKKMEKKEFHIVAMEKKEFHIVAETGIHARPATLLVQTASLFNSDINLETLGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEADGMAAIVETLQLQGLAQ

Page 8: Barcelona sabatica

Coarse-Grained Model

Page 9: Barcelona sabatica

+180

b b b p o M e e e

b b b p o M M e e

b b b p . l l s e

a a a T . l l g N

N a a a . U l g N

N a a a . U g g N

I a a a . G G G I

e F F F o e e e e

b b b p o e e e e

-180

-180 +180

Page 10: Barcelona sabatica

Ramachandran Alphabet

φφ

ψψ

-180-180°° 180°180°-180°-180°

180°180°

90°90°

-90°-90°

0°0°

0°0°-90°-90° 90°90°

AA

BB

EE

GG

Page 11: Barcelona sabatica

5-letter alphabet

Residue Sequence

MEKKEFHIVAETGIHARPATLLVQTASLFNSDINLETLGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEADGMAAIVETLQLQGLAQ...

3° Structure

ACCDECBAABDECBDABCDBEABDBCBDBAEBDBDBAEBABDCBBDBADDCBDBCBDBEBDBCBBDCAABDEDCDCEAABACAAAADC…

Page 12: Barcelona sabatica

What shall we do?• Ab initio:

Quantum Mechanics + big computers + large # configurations

= huge problems…

• Machine Learning:Use known cases to learn a suitable

map:sequence→ structure

Page 13: Barcelona sabatica

Machine Learning Approach

Page 14: Barcelona sabatica

Artificial Neural Networks• A problem-solving paradigm modeled after the

physiological functioning of the human brain.

• Synapses in the brain are modeled by computational nodes.

• The firing of a synapse is modeled by input, output, and threshold functions.

• The network “learns” based on problems to which answers are known (supervised learning).

• The network can then produce answers to entirely new problems of the same type.

Page 15: Barcelona sabatica

Neural Networks

OutputLayer

InputLayer

HiddenLayers

Page 16: Barcelona sabatica

Overfitting – high risk!

Less complicated hypothesis has lower error rate

Page 17: Barcelona sabatica

Hidden Layer Vector Quantization- HLVQ

xxxx xxxxxx

xx

xxxx

oo

oooo oo

oo

oooo

Traditional NNTraditional NN

xxxx xxxxxx

xx

xxxx

oo

oooo oo

oo

oooo

HLVQHLVQ

Main advantage: detect and Main advantage: detect and correctcorrect prediction for prediction for outliersoutliers

zz

Page 18: Barcelona sabatica

Loops, loops everywhere!!!Loops, loops everywhere!!!

Page 19: Barcelona sabatica

Look for a loop…

Page 20: Barcelona sabatica

Geometry of the Motif

Page 21: Barcelona sabatica

Loop Types

: : strandstrand - - -helix -helix

: : -helix - -helix - -helix -helix : : -helix – -helix – strandstrand

-hairpin-hairpin: : strandstrand - - strandstrand

- link- link: : strandstrand - - strandstrand

Page 22: Barcelona sabatica

Similar conformation Similar conformation aa{aa{bb}aa / aa{p}aa}aa / aa{p}aa

Identical geometry Identical geometry (4,6)(0,45)(45,90)(180,225)(4,6)(0,45)(45,90)(180,225)

1.3.1 aa{p}aa1.3.1 aa{p}aa

1.1.2 aa{b}aa1.1.2 aa{b}aa

Pro 75%

Ser 75%

© Baldomero Oliva© Baldomero Oliva

Page 23: Barcelona sabatica

Class Class

Page 24: Barcelona sabatica

ArchDB database

~ 20 000 loops classified into ~ 3000 classes.EE-3.4.1

Loop type - loop size . consensus . motif

TASK: classify a loop from sequence alone

If not possible, get as much information as possible

Page 25: Barcelona sabatica

Problems

• Coding of aminoacids

• Huge searching space, sparsely populated

• How to assign the loop classes?

• High dimensionality → Large Networks → poor generalization

Page 26: Barcelona sabatica

Aminoacid codingthe classical way

A → (1, 0, …0)

C → (0, 1, …0)

Y → (0, 0, …1)

Useful but not efficient!!!

I am working to improve it…

Page 27: Barcelona sabatica

Theory; but how about applications?!

Page 28: Barcelona sabatica

- link and - harpins from sequence

HLVQ

(MLP)

Predicted

- link

Predicted

- harpin

Real

- link

88.4

(79.4)

11.6

(20.6)

Real

- harpin

12.5

(16.1)

87.5

(83.9)

Page 29: Barcelona sabatica

Prediction of all loop types from sequence alone

- lk α- - hp -α α-α

- lk 45.9 28.5 3.7 19.8 2.1

α- 8.8 67.4 1.2 18.0 4.6

- hp 0.4 0.9 96.1 2.1 0.5

-α 4.4 6.2 2.4 79.5 7.6

α-α 4.0 15.7 1.3 20.3 58.6

Page 30: Barcelona sabatica

What’s it all mean?

Given a loop residue sequence, we can (usually) identify its native structure.

Not ab initio: We cannot tell the structure of a novel sequence.

HLVQ is superior to MLP

Page 31: Barcelona sabatica

Future Work

Better coding of aminoacidsBetter coding of aminoacids

Larger sequences / low complexityLarger sequences / low complexity

Going beyond structureGoing beyond structure

Clever alphabet that explore similaritiesClever alphabet that explore similarities

Multiobjective Genetic AlgorithmsMultiobjective Genetic Algorithms

Page 32: Barcelona sabatica

Beyond Multiple Alignments

• Alligments are good … but expensive and boring ...

• Information contained in a multiple alignment can, in principle, be expressed using an adequate aminoacid coding scheme

• How? SensibilitySensibility

Genetic AlgorithmGenetic Algorithm

Page 33: Barcelona sabatica

Coded Amino Acids

Alanine (A) Arginine (R) Asparagine (N) Aspartic Acid (D) Cysteine (C)

Glutamic Acid (E) Glutamine (Q) Glycine (G) Histidine (H) Isoleucine (I)

Leucine (L) Methionine (M)Lysine (K) Phenylalanine (F) Proline (P)

Serine (S) Threonine (T) Tryptophan (W) Tyrosine (Y) Valine (V)http://www.chemie.fu-berlin.de/chemistry/bio/http://www.chemie.fu-berlin.de/chemistry/bio/

Page 34: Barcelona sabatica

ArchDB database

Protein Data Bank (PDB) http://www.rcsb.org contains ~ 25 000 proteins with known structure of ~ 106 entries in SWISS-PROT

ArchDB ~ 20 000 classified loops