informs 2004
DESCRIPTION
INFORMS 2004. Optimization Approaches to HP Lattice Protein Folding. Hyun-suk Yoon Joel Sokol School of Industrial and Systems Engineering Georgia Institute of Technology. Table of contents. Introduction to Protein Folding Integer Programming (IP) Approach - PowerPoint PPT PresentationTRANSCRIPT
INFORMS 2004
Hyun-suk Yoon
Joel Sokol
School of Industrial and Systems Engineering
Georgia Institute of Technology
Optimization Approaches toHP Lattice Protein Folding
Table of contents
Introduction to Protein Folding
Integer Programming (IP) Approach
Introduction to Constraint Programming
(CP)
CP Approach
Discussion
Protein
• Sequence of amino acids
• Size: 30 ~ 10,000 amino acids,
a few hundred amino acids
on average
• Fold into a 3D compact structure
quickly in minimum energy state.
• Exponential number of possible
3D structures.
Problem description
How can we find a 3D structure of a protein
given a sequence of amino acids?
Motivation
1. Design drugs
• Most drugs work by attaching themselves to a protein
• Knowing 3-D shapes of proteins will help to design drugs. 2. Detect misfolding
• Proteins occasionally may not have the correct 3-D shapes.
• Misfolded proteins are known as the causes of a number of
diseases, i.e., Alzheimer’s disease and Parkinson’s disease.
Protein folding
How to figure out protein folding
• Experimental techniques: X-ray crystallography and NMR
spectroscopy
• Computational techniques: i.e., Folding@Home Protein Data Bank (PDB)
• http://www.rcsb.org/pdb
• Worldwide repository for 3-D structure data of large
molecules of proteins and nucleic acids.
HP model
• Hydrophobic or Polar
• 20 types of amino acids:
8 H’s and 12 P’s
Lattice model
• Locate each amino acid on a point of a cubic lattice.
• Parity problem: triangular or diagonal lattice model.
HP model and Lattice model
• HP model + Lattice model: the simplest protein model
- Advantage: use enumeration techniques to locate amino
acids.
- Disadvantage: low resolution, no explicit local interactions,
equal bond length
• Lau and Dill (1989): minimizing total energy in the HP
lattice model = maximizing the number of H-H contacts.
HP lattice model
Example of HP lattice model
Hydrophobic amino acid
Polar amino acid
Peptide bond
H-H contacts
Number of H-H contacts
= Number of adjacencies between hydrophobic amino acids
(except for peptide bonds)
Literature review
Protein topology
• Levitt and Chothia (1976) represent 2D structural topology of protein in a diagrammatic form.
• Richardson (1977) shows the first systematic survey of protein topology.
HP lattice model
• Lau and Dill (1989) study a HP model on the square and cubic lattice.
• Berger and Leighton (1998) and Crescenzi et al. (1998) prove that HP lattice model is NP-complete.
Table of contents
Introduction to Protein Folding
Integer Programming (IP) Approach
Introduction to Constraint Programming
(CP)
CP Approach
Discussion
General model
Max The number of H-H contacts
s.t. 1. (Assignment) Each amino acid must occupy one
lattice point.
2. (Non-overlapping) No two amino acids may share
the same lattice point.
3. (Connectivity) Every two amino acids that are
consecutive in the protein's sequence must also
occupy adjacent lattice points.
Two IP models
(0,0)
• Model IP-1: Uses the coordinate of each amino
acid.
• Model IP-2: Uses the direction (Up, Down, Left,
Right).
(0,1) (1,1)
2Up Righ
t
3
1
2 3
1
• Often use 2-D model instead of 3-D and attempt to extend
2-D into 3-D.
• Easily extend 2-D into 3-D in our models
- Model IP-1: (x,y) (x,y,z)
- Model IP-2: add two more directions – forward, backward.
2-D vs 3-D
Solving IP Models
Defining decision
variables
Formulating the
problem
Preprocessing
Running it with
CPLEX
Max
s.t.
ji d
ijdy
jixk
ijk ,1 (Assignment) kx
i jijk 1
(Non-overlapping)
ijdijk yx ,binary
djixhyxhyk
kdijkijdk
ijkkijd ,,, ,)(
kjixxxxx kjikjikjikjiijk ,,0)1)(1()1)(1()1()1()1()1(
(Connectivity)
(Define y)
xijk = 1 if kth amino acid is located at (i,j),
0 otherwise.
yijd = 1 if two amino acids in (i,j) and in (i,j)+d are
both adjacent,
0 otherwise.
Computational results
Instance: 1PSV
• 28 amino acids: one of the smallest human proteins.
• Obtained data from PDB.
• Truncate to different sizes: 12, 18, 23, 28.
• Optimal solution:
Computational results (cont)
• CPLEX Running times (seconds)
- IP does not work well.
- Take a long time to solve 23 and 28 amino acids
instances.
N = 12
N = 18 N = 23 N = 28
IP-1 9.32 72.61 30000+ 30000+IP-2 13.85 30000+ 30000+ 30000+
IP did not work well
• Why?
- High degeneracy: there are a lot of structures having
the same minimum energy.
- Symmetry: IP formulation contains much symmetry.
• CP is known better than IP where IP formulation contains
much symmetry.
• So move on to CP.
Table of contents
Introduction to Protein Folding
Integer Programming (IP) Approach
Introduction to Constraint Programming
(CP)
CP Approach
Discussion
Concepts of CP
Constraint programming (CP)
• Study of modeling and solving a system of logical
constraints using search techniques.
• Began in the 1980s as part of artificial intelligence
research.
• Two main procedures: domain reduction and constraint
propagation
CP vs IP
• Advantages and disadvantages
• Unified methodologies with CP and IP have been
designed in recent years.
Advantages Disadvantages
CP More expressive,More effective in some cases
Less predictable,A lower bound may not exists.
IP A lower bound always exists.
Less expressive
CP previous research
• Smith (1996) shows environments where CP may work
better than IP.
• Barták (1999), Smith (1995), ILOG Solver 5.0 manual
(2000) show CP’s successful accomplishments in many
applications.
• Easton (2003) and Milano (2004) deal with combining
CP and IP.
• Model CP-1, CP-2: Use the direction (Up, Down, Left,
Right).
• Model CP-3: Uses the combination of coordinates.
Three CP models
2Up Righ
t
3
1
203+1 = 1 (0,1) (1,1) 13+1 =43
103+0 = 0 (0,0)
Models Description
Model CP-1Similar as IP models, but use max function and if-
then
function. Model CP-2Similar to CP-1 and makes the formulation simpler
using
Boolean function and absolute value. Model CP-3Use the alldifferent function.
How to solve the problem faster
CP strategies to solve the problem faster
• Use a known solution.
• Fix the direction from the first amino acid to the
next.
• Any two amino acids which have an even distance
cannot be adjacent.
• Two amino acids have an upper bound on their
distance.
• Variable ordering: Choose first the variables with
the smallest domain.
Computational results
• Same instance as IP (1PSV): 12, 18, 23, 28 amino
acids.
• Use ILOG Solver to run CP.
N = 23 N = 28N = 23 N = 28
Computational result - IP vs CP
• IP vs CP best running times (seconds)
- Models used: IP IP-1, CP CP-1 (with strategies).
- CP is faster than IP with our models.
IP (CPLEX) CP (Solver)
N = 12
9.32 0.18
N = 18
72.61 18.83
N = 23
30,000+ 7347.74 (= 2 hrs)
N = 28
30,000+ 209,127.89 (= 58 hrs)
Proposed research
1. Try other CP approaches such as dual modeling and
dynamic variable ordering.
2. Consider an unified methodology of IP and CP
- Decompose the problem, and apply IP to one part and
CP to the other part.
3. Attempt other approaches such as heuristic algorithm to
find better bounds.
Contribution
2. Biological field
• Success of our research can help in
the prediction of 3-D protein
structures, which may assist in
medical development.
1. Optimization field
• Help to show how CP can be an
alternative to or a complement of IP.