carnegie mellon school of computer science 1 protein quaternary fold recognition using conditional...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Carnegie MellonSchool of Computer Science
1
Protein Quaternary Fold Recognition Using Conditional Graphical Models
Yan Liu, Jaime CarbonellVanathi Gopalakrishnan (U Pitt), Peter Weigele (MIT)
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
IJCAI-2007 – Hyderabad, India
Carnegie MellonSchool of Computer Science
2
Snapshot of Cell BiologyNobelprize.org
+
Protein function
DSCTFTTAAAAKAGKAKAG
Protein sequence
Protein structure
Carnegie MellonSchool of Computer Science
3
Example Protein Structures
Adenovirus Fibre Shaft Virus Capsid
Triple beta-spiral fold in Adenovirus Fiber Shaft
Carnegie MellonSchool of Computer Science
4
Predicting Protein Structures• Protein Structure is a key determinant of protein function
• Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins
• The gap between the known protein sequences and structures: 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) Therefore we need to predict structures in-silico
Carnegie MellonSchool of Computer Science
5
Quaternary Folds and Alignments
• Protein fold Identifiable regular arrangement of secondary structural elements
• Thus far, a limited number of protein folds have been discovered (~1000)
Very few research work on quaternary folds • Complex structures and few labeled data
• Quaternary fold recognition
Seq 1: APA FSVSPA … SGACGP ECAESGSeq 2 : DSCTFT…TAAAAKAGKAKCSTITL
Biology task
Protein fold Membership and non-membership proteins
Will the protein take the fold?
AI task Pattern to be induced
Training data (seq-struc pairs + physics)
Does the pattern appear in the testing sequence?
Carnegie MellonSchool of Computer Science
6
Previous Work• Sequence similarity perspective
Sequence similarity searches, e.g. PSI-BLAST [Altschul et al, 1997]
Profile HMM, .e.g. HMMER [Durbin et al, 1998] and SAM [Karplus et al, 1998]
Window-based methods, e.g. PSI_pred [Jones, 2001]
• Physical forces perspective Homology modeling or threading, e.g. Threader [Jones, 1998]
• Structural biology perspective Painstakingly hand-engineered methods for specific structures, e.g. αα- and ββ- hairpins, β-turn and
β-helix [Efimov, 1991; Wilmot and Thornton, 1990; Bradley at al, 2001]
Generative models based on rough approximation of free-energy, perform very poorly on complex structures
Very Hard to generalize due to built-in constants, fixed features
Fail to capture the structure properties and long-range dependencies
Carnegie MellonSchool of Computer Science
7
Conditional Random Fields• Hidden Markov model (HMM) [Rabiner, 1989]
• Conditional random fields (CRFs) [Lafferty et al, 2001]
Model conditional probability directly (discriminative models, directly optimizable)
Allow arbitrary dependencies in observation Adaptive to different loss functions and
regularizers Promising results in multiple applications But, need to scale up (computationally) and
extend to long-distance dependencies
11
( ) ( | ) ( | )N
i i i ii
P P x y P y y
x, y
11 10
1( ) exp( ( , , , ))
N K
k k i ii k
P f i y yZ
y | x x
1 1( , , , ) ' ( , ) ( , ')k i i k i if i y y f i I y s y s x x
Carnegie MellonSchool of Computer Science
8
• Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si}
• Feature definition Node feature
Local interaction feature
Long-range interaction feature
Our Solution: Conditional Graphical Models
1 1 1( , , ) ( , ', 1)k i i i i i if w w x I s s s s p q
( , ) '( , , ) ( ', 1 ')k i k i i i i if w x f x p q I s s q p d
Long-range dependencyLocal dependency
1( , , ) '( , , , , ) ( , ')k i j k i i j j i if w w x g x p q p q I s s s s
Carnegie MellonSchool of Computer Science
9
Linked Segmentation CRF
• Node: secondary structure elements and/or simple fold• Edges: Local interactions and long-range inter-chain and
intra-chain interactions• L-SCRF: conditional probability of y given x is defined as
, , ,
1 1 , , ,,
1( ,..., | ,..., ) exp( ( , )) exp( ( , , , ))
i j G i j a b G
R R k k i i j l k i a i j a bV k lE
P f g yZ
y y y
y y x x x y x x y
Joint Labels
Carnegie MellonSchool of Computer Science
10
• Classification:
• Training : learn the model parameters λ Minimizing regularized negative log loss
Iterative search algorithms by seeking the direction whose empirical values agree with the expectation
• Complex graphs results in huge computational complexity
Linked Segmentation CRF (II)
( | )( ( , ) [ ( , )]) ( ) 0G
k c p k cc Ck
Lf E f
y xx y x y
21
( , ) log ( )G
K
k k cc C k
L f Z
x y
1
* arg max ( , )G
K
k k cc C k
y f Y
x
Carnegie MellonSchool of Computer Science
11
Approximate Inference of L-SCRF
• Most approximation algorithms cannot handle variable number of nodes in the graph, but we need variable graph topologies, so…
• Reversible jump MCMC sampling [Greens, 1995, Schmidler et al,
2001] with Four types of Metropolis operators State switching Position switching Segment split Segment merge
• Simulated annealing reversible jump MCMC [Andireu et al, 2000]
Replace the sample with RJ MCMC Theoretically converge on the global optimum
Carnegie MellonSchool of Computer Science
12
Experiments: Target Quaternary Fold
• Triple beta-spirals [van Raaij et al. Nature 1999]
Virus fibers in adenovirus, reovirus and PRD1
• Double barrel trimer [Benson et al, 2004]
Coat protein of adenovirus, PRD1, STIV, PBCV
Carnegie MellonSchool of Computer Science
14
Tertiary Fold Recognition: β-Helix fold
• Histogram and ranks for known β-helices against PDB-minus dataset
5
Chain graph model reduces the real running time of SCRFs model by around 50 times
Carnegie MellonSchool of Computer Science
15
Fold Alignment Prediction: β-Helix• Predicted alignment for known β -helices on cross-family
validation
Carnegie MellonSchool of Computer Science
16
Discovery of New Potential β-helices
• Run structural predictor seeking potential β-helices from Uniprot (structurally unresolved) databases Full list (98 new predictions) can be accessed at
www.cs.cmu.edu/~yanliu/SCRF.html
• Verification on 3 proteins with later experimentally resolved structures from different organisms 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase 1PXZ: The Major Allergen From Cedar Pollen GP14 of Shigella bacteriophage as a β-helix protein
No single false positive!
Carnegie MellonSchool of Computer Science
17
Experiment Results: Fold Recognition
Double barrel-trimerTriple beta-spirals
Carnegie MellonSchool of Computer Science
18
Experiment Results: Alignment Prediction
Triple beta-spirals
Four states: B1, B2, T1 and T2
Correct Alignment:
B1: i – o B2: a - h
Predicted Alignment
B1 B2
Carnegie MellonSchool of Computer Science
19
Experiment Results:Discovery of New Membership Proteins
• Predicted membership proteins of triple beta-spirals can be accessed at
http://www.cs.cmu.edu/~yanliu/swissprot_list.xls
• Membership proteins of double barrel-trimer suggested by biologists [Benson, 2005] compared with L-SCRF predictions
Carnegie MellonSchool of Computer Science
20
Conclusion• Conditional graphical models for protein structure
prediction Effective representation for protein structural properties Feasibility to incorporate different kinds of informative
features Efficient inference algorithms for large-scale applications
• A major extension compared with previous work Knowledge representation through graphical models Ability to handle long-range interactions within one chain
and between chains
• Future work Automatic learning of graph topology Applications to other domains
Carnegie MellonSchool of Computer Science
22
Graphical Models
• A graphical model is a graph representation of probability dependencies [Pearl 1993; Jordan 1999]
Node: random variables Edges: dependency relations
• Directed graphical model (Bayesian networks)
• Undirected graphical model (Markov random fields)
11..
( ,.., ) ( | parents( ))n i ii n
P x x P x x
1
1 1( ,.., ) ( ) exp( ( ))n c c c
c C c C
P x x x H xZ Z