unc chapel hill david a. o’brien chain growing using statistical energy functions david a....

30
UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha Andrew Leaver-Fey Shuquan Zong

Upload: jonas-logan

Post on 20-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Chain Growing Using Statistical Energy Functions

David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha Andrew Leaver-FeyShuquan Zong

Page 2: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Overview

Lattice Chain Growth Algorithm Statistical Energy Functions

2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential

Results Chains Identifying Good Decoys

Current Work New Scoring Functions Incremental Tetrahedralization

Future work

Page 3: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Chain Growing - Introduction

Lattice Chain Growing Goals:

Test measures of proteins Build protein chains that maximize a given measure If these chains appear native like, confirms that this is valid

measure

Predict protein structures from just sequence information, ab initio.

Develop an algorithm to build 3D folded protein decoys from the sequence that are similar to the native structure

Evaluate these decoys and determine which are native-like. In short, be able to pick the most native-like structure from the large set of decoys we will generate.

Page 4: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Lattice Chain Growth Algo.

Cubic lattice (311) w/ 24 possible moves {(3,1,1),(3,1,-1),…,(-3,1,1)}

Generate chain configuration by sequential addition of links until full length of chain is reached.

New links can not be placed in the zone of exclusion of of other links and must satisfy angle constraints.

Page 5: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Lattice Chain Growth Algo.: Adding a new link

Generate a set of possible open lattice nodes. For each, calculate a temperature-dependent transition probability. Choose one of these open lattice nodes with a Monte Carlo step. Variations such as look 2 steps ahead or building from middle

Page 6: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Temperature-Dependent Transition Probability

Probability at step i of picking configuration x’ from x1 … xC :

T = temperature kB = Boltzman Constant E = Energy (Lower is better.)

1

1 1( ') exp[ ( ')] / exp[ ( )]

C

i jjB B

P x E x E xk T k T

Page 7: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Overview

Lattice Chain Growth Algorithm Statistical Energy Functions

2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential

Results Chains Identifying Good Decoys

Current Work New Scoring Functions Incremental Tetrahedralization

Future work

Page 8: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Statistical Energy Functions

Statistical energy functions assume that “contact” energies between amino acid residues in native proteins are related to their observed frequency in a representative structural database.

If a potential configuration (decoy) has a certain set of nearby residues that is common in nature, give this a good score.

Score for entire protein is sum of all contact energies.

We use three statistical energy functions: 2-body Miyazawa-Jernigan 4-body Potential Local Shape Potential

Page 9: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Statistical Energy FunctionsOverview

Global vs. Local Global: Measures well the entire protein (or partial fragment) Local: Measures just a small sequence of consecutive residues

2-body Miyazawa-Jernigan Easy to calculate Can be global or local

4-body Potential Expensive to calculate Works better as a global measure Good for determining native-like folded structures

Local Shape Potential Easy to calculate Defined as a local measure Global measure ?

Page 10: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Overview

Lattice Chain Growth Algorithm Statistical Energy Functions

2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential

Results Chains Identifying Good Decoys

Current Work New Scoring Functions Incremental Tetrahedralization

Future work

Page 11: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

For two-body potentials:

Actual ij values are taken from the Miyazawa-Jernigan matrix as reevaluated in 1996

Two-body Statistical Energy Function

ln[ / ]ij ij B ij ijQ k T F P

observed contact frequencyijF reference stateijP

ij ijQ

Miyazawa S, Jernigan RL. Residue residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 1996;256: 623 644.

Page 12: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Overview

Lattice Chain Growth Algorithm Statistical Energy Functions

2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential

Results Chains Identifying Good Decoys

Current Work New Scoring Functions Incremental Tetrahedralization

Future work

Page 13: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Calculates the energy based on a sets of 4 nearby residues (quad).

Quads calculated from the Delaunay Tessellation. The 4 vertices of each tetrahedra define a quad. Each quad is given a statistical score.

Four-Body Statistical Energy Function

Convex hull formed by the tetrahedral edges Each tetrahedron

corresponds to a cluster of four residues

Page 14: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Four-Body Statistical Energy Function - Overview

Four-body potential is written .

Training set of 1166 proteins were tessellated Frequency of each quad type is counted Each quad is typed in two ways

by the combination of the four residue types {i,j,k,l} by the number of consecutively appearing residues ()

25.5% 35.6% 11.4% 22.1% 5.4%

ijklQ

Page 15: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Four-Body Statistical Energy Function - Classifying quadruplets

Denote each quad by {i,j,k,l} i,j,k and l can be any of the 20 amino acids (L20)

e.g. AALV, TLKM, TTLK, YYYY etc. 8855 possible combinations

Or 20 amino acids can be grouped into just 6 types (L6) Groups defined by chemical properties of amino acids 126 possible combinations

c={cysteine} f={phenylaline, tyrosine, tryptophan}

h={histiine, arginine, lysine}

n={asparagine, aspartic acid, glutamine, glutamic acid}

s={serine, threonine, proline, alanine, glycine}

v={methionine, isoleucine, leucine, valine}

Page 16: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Four-Body Statistical Energy Function - Classifying quadruplets

L20 Case: 5 -types x 8855 combination ==> 44,275 quad types Not all quad types observed in training set Potential of unfound types set to some fraction of the lowest

score for a represented quad type. L6 Case:

5 -types x 126 combination ==> 630 quad types All but a few quad types observed in training set

Page 17: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Four-Body Statistical Energy Function - Formulation

Formulation is an extension of the previous 2-body formula:

where,ln[ / ]ijkl ijkl ijklQ k T f P

observed occurrences of type ( ) neighbors

total number for typeijkl

ijklf

observed occurrences of amino acid type

total number of residues in data seti

ia

number of each type it i

# of type tetrahedra observed in training set

total # of tetrahedra in training setP

1

4!

!ijkl ijkl i j k lN

ii

P P P P a a a a

t

Page 18: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Overview

Lattice Chain Growth Algorithm Statistical Energy Functions

2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential

Results Chains Identifying Good Decoys

Current Work New Scoring Functions Incremental Tetrahedralization

Future work

Page 19: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Motivation: Fragment libraries model protein structures accurately. Use the frequency of common fragments to construct a statistical

function that supplements the 2 and 4-body energy functions to grow better decoys

Good fragment libraries exist, but for the lattice-chain building we need fragments that fit in the 311 lattice

Main Idea: For each possible consecutive sequence of four residues, i, j, k, and l,

calculate in which shape these residues most often occur.

Shape – A Shape – B

If Shape – A is found more often in nature, try to build chain accordingly

Local Shape Statistical Energy Function

Page 20: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Create set of canonical lattice shapes of length 4 (and 5) Calculate ways to embed chain of length 4 (or 5) in 311

lattice. 155 canonical shapes for length 4, (2789 for length 5) For L6, there are 64=1,296 sequences

155 x 1,296 = 200,880 combinations

• Parse representative set of 971 proteins into segments. For each 4 length segment, calculate RMSD against

each canonical shape

Local Shape Statistical Energy Function

Shape 1

Shape 2

Shape 155

Sample protein

Page 21: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Turning RMSD values into frequencies If only the canonical shape with best RMSD are counted, not

all 200,880 shapes found in training set. If two canonical shapes have low RMSD, give each some

credit If each For each RMSD

i,j,k,l , i,j,k,l = residue type, = shape

Normalize the 155 RMSD values

Local Shape Statistical Energy Function

))(exp( ,,,

1,,,

RMSD klji

Freq nlkji

Page 22: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Overview

Lattice Chain Growth Algorithm Statistical Energy Functions

2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential

Results Chains Identifying Good Decoys

Current Work New Scoring Functions Incremental Tetrahedralization

Future work

Page 23: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Decoys produced by the Chain Growing still not good enough.

Relatively good correlation between RMSD and 4-Body Energy.

2mhu Built with MJ Potential Local Shape Pot.

Results - Building Decoys

Native state

Fou

r-b

od

y E

nerg

y p

er

resid

ue

Fou

r-b

od

y E

nerg

y p

er

resid

ue

Page 24: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Overview

Lattice Chain Growth Algorithm Statistical Energy Functions

2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential

Results Chains Identifying Good Decoys

Current Work New Scoring Functions Incremental Tetrahedralization

Future work

Page 25: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

20L or 6L Non-bonded Sum only the contribution of -type 0 tetrahedra.

Identifying good Decoys

Page 26: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Non-Bounded L20 scoring function applied to a set of folded and unfolded decoys.

Discriminating Native & Non-Native

Non-bonded log-likelihoods for the Shahnovich instances and the native structure (20L1T , SC)

0

5

10

15

20

25

30

35

40

s6A

1

s6A

2

s6A

3

s6A

4

s6A

5

s6A

6

GF

01

GF

02

GF

03

GF

04

GF

05

GF

06

GF

07

GF

08

GF

09

GF

10

GF

11

GF

12

GF

13

GF

14

GF

15

GF

16

GF

17

GF

18

GF

19

GF

20

2C

I2

instances (yellow-pre(6), blue-post(20), red-native)

log

-lik

elih

oo

d s

co

re

Page 27: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Overview

Lattice Chain Growth Algorithm Statistical Energy Functions

2-body Miyazawa-Jernigan Potential 4-body Potential Local Shape Potential

Results Chains Identifying Good Decoys

Current Work New Scoring Functions Incremental Tetrahedralization

Future work

Page 28: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

20L or 6L Non-bonded Sum only the contribution of -type 0 tetrahedra.

20L or 6L 5T Sum contribution of all tetrahedra.

20L Ratio All As above, but Define:

Adjustments to Scoring Functions

# of type tetrahedra in test protein

total # of tetrahedra in test proteintestP

_, test

RatioAllijkl ijkl

Pr Q r Q

P

Page 29: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

Incremental Tetrahedralization

Maintain constant tetrahedralization and only add and remove single vertices.

When evaluating a new candidate, update total energy by tagging new quadruplets as well as any that have been removed.

Add the effect of the new, and subtract effect of those removed.Add candidate

and evaluate.Add next candidate and reevaluate.

Remove candidate and reset state.

Page 30: UNC Chapel Hill David A. O’Brien Chain Growing Using Statistical Energy Functions David A. O'Brien Balasubramanian Krishnamoorthy: Jack Snoeyink Alex Tropsha

UNC Chapel Hill David A. O’Brien

References

Generating folded protein structures with a lattice chain-growth algorithm. H.H. Gan, A. Tropsha and T. Schlick, J. Chem. Phys. 113, 5511-5524 (2000).

Lattice protein folding with two and four-body statistical potentials. H.H. Gan, A. Tropsha and T. Schlick, Proteins: Structure, Function, and Genetics 43, 161-174 (2001).

Miyazawa S, Jernigan RL. Residue–residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 1996;256: 623–644.

Tropsha A, Sigh RK, Vaisman LI. Delaunay tessellation of proteins: Four body nearest neighbor propensities of amino acid residues, J. Comput. Biol. 1996:3:2, 213-222 (1996).

R. Kolodny, P. Koehl, L. Guibas and M. Levitt. Small libraries of protein fragments model native protein structures

accurately, J. Mol. Biol., 323, 297-307 (2002).