barcelona sabatica
Post on 18-Nov-2014
237 Views
Preview:
DESCRIPTION
TRANSCRIPT
Protein loop classification using Artificial Neural
Networks
Armando Vieira1 and Baldomero Oliva2
1ISEP and Centro de Física Computacional, Coimbra, Portugalwww.defi.isep.ipp.pt/~asv
2Structural Bioinformatics Laboratory (GRIB) IMIM/Universitat Pompeu Fabra, Barcelona, Spain
XXI: the century of BIO
BIOINFORMATICSjoining two worlds apart
OutlineBrief review of protein structure
Statement of problem and why is so hard
Data pre-processing, corrections, updates and beyond multiple alignments…
Neural Networks in protein structure prediction
HLVQ
Results and future work
Proteins
All proteins are chains of 20 amino acids
Not all chains of amino acids are proteins
Fold rapidly and repeatedly
Proteins are the machinery of live
Essential to all (known) organisms
The Gist of it
Amino acid Amino acid sequencesequence
Physical Physical structurestructure
FunctionFunction
Typical globular protein
MMEMEKMEKKMEKKEFHIVAMEKKEFHIVAETGIHARPATLLVQTASLFNSDINLETLGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEADGMAAIVETLQLQGLAQ
Coarse-Grained Model
+180
b b b p o M e e e
b b b p o M M e e
b b b p . l l s e
a a a T . l l g N
N a a a . U l g N
N a a a . U g g N
I a a a . G G G I
e F F F o e e e e
b b b p o e e e e
-180
-180 +180
Ramachandran Alphabet
φφ
ψψ
-180-180°° 180°180°-180°-180°
180°180°
90°90°
-90°-90°
0°0°
0°0°-90°-90° 90°90°
AA
BB
EE
GG
5-letter alphabet
Residue Sequence
MEKKEFHIVAETGIHARPATLLVQTASLFNSDINLETLGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEADGMAAIVETLQLQGLAQ...
3° Structure
ACCDECBAABDECBDABCDBEABDBCBDBAEBDBDBAEBABDCBBDBADDCBDBCBDBEBDBCBBDCAABDEDCDCEAABACAAAADC…
What shall we do?• Ab initio:
Quantum Mechanics + big computers + large # configurations
= huge problems…
• Machine Learning:Use known cases to learn a suitable
map:sequence→ structure
Machine Learning Approach
Artificial Neural Networks• A problem-solving paradigm modeled after the
physiological functioning of the human brain.
• Synapses in the brain are modeled by computational nodes.
• The firing of a synapse is modeled by input, output, and threshold functions.
• The network “learns” based on problems to which answers are known (supervised learning).
• The network can then produce answers to entirely new problems of the same type.
Neural Networks
OutputLayer
InputLayer
HiddenLayers
Overfitting – high risk!
Less complicated hypothesis has lower error rate
Hidden Layer Vector Quantization- HLVQ
xxxx xxxxxx
xx
xxxx
oo
oooo oo
oo
oooo
Traditional NNTraditional NN
xxxx xxxxxx
xx
xxxx
oo
oooo oo
oo
oooo
HLVQHLVQ
Main advantage: detect and Main advantage: detect and correctcorrect prediction for prediction for outliersoutliers
zz
Loops, loops everywhere!!!Loops, loops everywhere!!!
Look for a loop…
Geometry of the Motif
Loop Types
: : strandstrand - - -helix -helix
: : -helix - -helix - -helix -helix : : -helix – -helix – strandstrand
-hairpin-hairpin: : strandstrand - - strandstrand
- link- link: : strandstrand - - strandstrand
Similar conformation Similar conformation aa{aa{bb}aa / aa{p}aa}aa / aa{p}aa
Identical geometry Identical geometry (4,6)(0,45)(45,90)(180,225)(4,6)(0,45)(45,90)(180,225)
1.3.1 aa{p}aa1.3.1 aa{p}aa
1.1.2 aa{b}aa1.1.2 aa{b}aa
Pro 75%
Ser 75%
© Baldomero Oliva© Baldomero Oliva
Class Class
ArchDB database
~ 20 000 loops classified into ~ 3000 classes.EE-3.4.1
Loop type - loop size . consensus . motif
TASK: classify a loop from sequence alone
If not possible, get as much information as possible
Problems
• Coding of aminoacids
• Huge searching space, sparsely populated
• How to assign the loop classes?
• High dimensionality → Large Networks → poor generalization
Aminoacid codingthe classical way
A → (1, 0, …0)
C → (0, 1, …0)
Y → (0, 0, …1)
Useful but not efficient!!!
I am working to improve it…
Theory; but how about applications?!
- link and - harpins from sequence
HLVQ
(MLP)
Predicted
- link
Predicted
- harpin
Real
- link
88.4
(79.4)
11.6
(20.6)
Real
- harpin
12.5
(16.1)
87.5
(83.9)
Prediction of all loop types from sequence alone
- lk α- - hp -α α-α
- lk 45.9 28.5 3.7 19.8 2.1
α- 8.8 67.4 1.2 18.0 4.6
- hp 0.4 0.9 96.1 2.1 0.5
-α 4.4 6.2 2.4 79.5 7.6
α-α 4.0 15.7 1.3 20.3 58.6
What’s it all mean?
Given a loop residue sequence, we can (usually) identify its native structure.
Not ab initio: We cannot tell the structure of a novel sequence.
HLVQ is superior to MLP
Future Work
Better coding of aminoacidsBetter coding of aminoacids
Larger sequences / low complexityLarger sequences / low complexity
Going beyond structureGoing beyond structure
Clever alphabet that explore similaritiesClever alphabet that explore similarities
Multiobjective Genetic AlgorithmsMultiobjective Genetic Algorithms
Beyond Multiple Alignments
• Alligments are good … but expensive and boring ...
• Information contained in a multiple alignment can, in principle, be expressed using an adequate aminoacid coding scheme
• How? SensibilitySensibility
Genetic AlgorithmGenetic Algorithm
Coded Amino Acids
Alanine (A) Arginine (R) Asparagine (N) Aspartic Acid (D) Cysteine (C)
Glutamic Acid (E) Glutamine (Q) Glycine (G) Histidine (H) Isoleucine (I)
Leucine (L) Methionine (M)Lysine (K) Phenylalanine (F) Proline (P)
Serine (S) Threonine (T) Tryptophan (W) Tyrosine (Y) Valine (V)http://www.chemie.fu-berlin.de/chemistry/bio/http://www.chemie.fu-berlin.de/chemistry/bio/
ArchDB database
Protein Data Bank (PDB) http://www.rcsb.org contains ~ 25 000 proteins with known structure of ~ 106 entries in SWISS-PROT
ArchDB ~ 20 000 classified loops
top related