![Page 1: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/1.jpg)
Protein Folding Pathway Prediction
Supervised by
Prof. Ibrahim M.El-HenawyDr. Ahmed H.Kamal
Dr. Hisham Al-Shishiny
by
Haitham Ahmad Gamal
![Page 2: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/2.jpg)
Problem Statement Motivation Approach Previous Work Biological Background What Affects Folding Why is it difficult Data Set Methodology (the 4 stages) Hypothesis (formally stated) Results Conclusion
![Page 3: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/3.jpg)
Proteins are the most vital agents in living bodies.
Their function is what concerns scientists
Function 3D Structure
Hydrophobicity
Much effort in structure prediction but limited success:
Result are:
• premature due to the huge conformations search space.
• or, insufficiently accurate due to simplifications.
![Page 4: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/4.jpg)
![Page 5: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/5.jpg)
Knowledge of how a protein can fold enables us to understand how it is functioning.
With this level of understanding we can affect a protein either by enhancement or by suppression.
Drugs can be built to affect certain proteins directly or through other proteins interacting with the protein under investigation.
![Page 6: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/6.jpg)
The approach used in this study is a statistical,
machine learning approach. We try using this
approach to answer the previous questions.
Clustering Distribution Fitting
![Page 7: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/7.jpg)
In our study we are not developing a prediction algorithm.
We are proving some hypothesis that can improve several
types of prediction algorithms.
Prediction algorithms/techniques can be classified based on
different criteria.Ab intio Homology
On-lattice Off-lattice
Heuristic Statistics
Protein-based Subsequence-based
Our study fits in the coloured classes across all these
criteria.
![Page 8: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/8.jpg)
![Page 9: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/9.jpg)
![Page 10: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/10.jpg)
The tertiary structure is the minimum free energy structure of a protein (for single chain proteins)
![Page 11: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/11.jpg)
It has been proven that the function of a protein depends on its 3D structure not its primary structure.
The most effective factor is folding proteins (specially globular proteins) is the hydrophobicity of its constituents amino acids.
Amino acids are either charged(soluble) or contains aromatic groups(insoluble).
Hydrophobicity of all the 20 known amino acids is called the Hydrophobicity scale.
![Page 12: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/12.jpg)
![Page 13: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/13.jpg)
An exact simulation of a short peptide folding may take months on a super computer.
The number of possible conformations is huge.
bond peptide theoflength theis
20 lthatsuchl
Scientists proved that solving the problem for the HP model (simplified model) is NP-Complete.
Current technologies cannot keep pace with this God created miracle.
![Page 14: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/14.jpg)
A collection of more than 1000 proteins is taken randomly from the SCOP protein databank
Each SCOP entry (file) represents one protein with all its features including its exact atom coordinates.
Angles are extracted using the three dimensional coordinates of each Cα atom
![Page 15: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/15.jpg)
Angle Extraction
Chopping to Subsequences
K-means Clustering
Distribution Fitting
![Page 16: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/16.jpg)
Atom Serial NumberResidue NameResidue Sequence Number
X - coordinate
Y - coordinate
Z - coordinate
the 3rd residue
the 4th residue
the 5th residue
Continue doing the same until the end
![Page 17: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/17.jpg)
The angle that lies between each three consecutive Cα atoms is called angle θ.
As shown in the figure the angles are calculated at each Cα atom starting from Cα1 until CαL-1, such that (L)is the protein length.
θ1
θ2
θ3
.
.
.
.
Cαi-1
Cαi
Cαi+1
( , , )
( , , )
( , , )
Let (a) be a vector such that: a = (Cαi,Cαi-1)
Let (b) be a vector such that: b = (Cαi,Cαi+1)
Cαi-1
Cαi
Cαi+1
θ
θ can then be calculated using the cosine law:
![Page 18: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/18.jpg)
After all the angles of all of the proteins are extracted in each protein sequence is divided into subsequences of length n.
A subsequence must contain an odd number of residues.
A sliding window technique is used to chop the whole protein sequence into pieces.
The value of n is crucial in our study as will be shown in the results section.
![Page 19: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/19.jpg)
Let’s take n = 5 as an example
aa0
aa1
aa2
aa3
aa4
aa5
aa6
aa7
aa8
Θ0 Θ1
Θ2
Θ3
Θ4Θ6
Θ7
The first subsequence starts from aa0 to aa4 and the effect of this subsequence on the central angle Θ1 is
what concerns us in this study.
Similarity the effect of all the next subsequences starting generally from aai to aai+n-1 on the
measurement of the central angle Θi+floor(n/2)-1 is studied.
![Page 20: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/20.jpg)
Since hydrophobicity is the main factor affecting protein folding. The centroids were determined accordingly.
The choice of centroids is meant to cover all the possible hydrophobicity patterns of a subsequence of length n.
Let’s take n = 3 as an example
All Hydrophillic
All Hydrophobic
No. ofinitial centroids is
2n
Hydrophobic
Hydrophillic
![Page 21: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/21.jpg)
Clustered as well as the unclustered data are compared using Kolmogrov-Smirnov test against 66 continuous probability distributions, which are:
Beta, Burr, Burr (4P), Cauchy, Chi-Squared, Chi-Squared (2P), Dagum, Dagum (4P), Erlang, Erlang (3P), Error, Error Function, Exponential, Exponential (2P), Fatigue Life, Fatigue Life (3P), Frechet, Frechet (3P), Gamma, Gamma (3P), Gen. Extreme Value, Gen. Gamma, Gen. Gamma (4P), Gen. Logistic, Gen. Pareto, Gumbel Max, Gumbel Min, Hypersecant, Inv. Gaussian, Inv. Gaussian (3P), Johnson SB, Johnson SU, Kumaraswamy, Laplace, Levy, Levy (2P), Log-Gamma, Log-Logistic, Log-Logistic (3P), Log-Pearson 3, Logistic, Lognormal, Lognormal (3P), Nakagami, Normal, Pareto, Pareto 2, Pearson 5, Pearson 5 (3P), Pearson 6, Pearson 6 (4P), Pert, Phased Bi-Exponential, Phased Bi-Weibull, Power Function, Rayleigh, Rayleigh (2P), Reciprocal, Rice, Student's t, Triangular, Uniform, Wakeby, Weibull and Weibull (3P).
![Page 22: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/22.jpg)
![Page 23: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/23.jpg)
n = 3
DistributionCentroids in this distribution (i = Ci)
Burr1, 4
Burr(4p)7
Gen. Extreme Value6
Gen. Pareto2, 3, 5
Johnson SB0
![Page 24: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/24.jpg)
n = 5
DistributionCentroids in this distribution (i = Ci)
Dagum(4p)0, 5, 7, 19
Gumbel Min.1, 2, 3, 17, 20
Gen. Extreme Value4, 32
Burr(4p)6, 8, 10, 11, 14, 18, 21, 22, 23, 24, 27, 30, 31
Weibull(3p)9, 12, 13, 15, 16, 25, 26, 28, 29
![Page 25: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/25.jpg)
n = 7
DistributionCentroids in this distribution (i = Ci)
Weibull(3p)3, 21, 79Burr(4p)20, 32, 40, 60, 67, 71, 74, 75, 83, 85, 105Dagum4, 80Dagum(4p)41, 90Gen. Gamma(4p)69, 84, 106
Gen. Logistic2, 6, 7, 9, 12, 14, 15, 19, 33, 34, 35, 36, 37, 45, 46, 47,
49, 79, 87, 89, 94, 95, 107, 117, 125
Gumbel Min.66Log-Logistic42, 116, 118
Wakeby
1, 5, 8, 10, 11, 13, 16, 17, 18, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 38, 39, 43, 44, 48, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 68, 70, 72, 73,
76, 77, 78, 81, 82, 86, 88, 91, 92, 93, 96, 98, 99, 100, 101, 102, 103, 104, 108, 109, 110, 111, 112, 113, 114,
115, 119, 120, 121, 122, 123, 124, 126, 127
![Page 26: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/26.jpg)
Tricky KS-statistic value are not enough for completeinterpretation
KS statistic for Unclustered data
KS statistic for Clustered data
n = 30.090410.0937
n = 50.0120.0243
n = 70.0130.0202
![Page 27: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/27.jpg)
No. of rejected values for Un-Clustered data
No. of rejected values for Clustered
data
n = 3All 5 valuesAll 5 values
n = 5All 5 values2.94
n = 7All 5 valuesZero
The number of rejected critical values shows that the fits of Un-clustered data are fake fits
Number of tested critical values is 5
![Page 28: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/28.jpg)
Obviously the KS-statistic shows that the larger the value of n the better the fit.
![Page 29: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/29.jpg)
Looking deeper at the rejected value test, all the 5 test values are rejected for n = 3 while n = 7 gives ZERO rejected values, the thing that emphasizes
the truth of our hypothesis.
![Page 30: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/30.jpg)
it is now clear that there exists a direct relationship between the hydrophobicity of the residues of a subsequence (local neighbours) and the measurements of the backbone angles. Classifying a subsequence into one of the available clusters will give a good insight of the angles measurements and consequently the structure of the subsequence.
Also the length of the subsequence is an effective factor in angle measurement prediction process. Longer subsequences achieve better fits in one of the standard continuous probability distributions.
![Page 31: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/31.jpg)
These results can be used to guide the search process in a complete protein structure prediction algorithm.
Local angle-hydrophobicity relationship can be used combined with heuristic techniques like genetic algorithm to restrict the initial population to statistically familiar conformation.
Approximations of our results can be applied to crystalline lattices protein models like cube octahedron lattice model which allows the use of several possible angles 60", 90", 120" and 180".
it is possible to investigate applying the same approach on subsequences of length more than 7 residues and try to minimize the required processing time.
![Page 32: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/32.jpg)
Title
A CENTRAL-3-RESIDUES-BASED CLUSTERING APPROACH FOR STUDYING THE EFFECT OF HYDROPHOBICITY ON PROTEIN
BACKBONE ANGLES
Authors
Prof. Ibrahim M.El-Henawy Dr. Ahmed H.KamalDr. Hisham Al-Shishiny Haitham Gamal
Has been published in Egyptian Computer Science Journal (ECS Journal), ISSN-1110-2586, Volume 32, Number 1, May, 2009
![Page 33: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal](https://reader035.vdocument.in/reader035/viewer/2022062322/56649cef5503460f949bd83f/html5/thumbnails/33.jpg)