summary protein design seeks to find amino acid sequences which stably fold into specific 3-d...

SummaryProtein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein backbone in the design algorithm is necessary to capture the behaviour of real proteins and is a prerequisite for the accurate exploration of sequence space.

We present a broad exploration of protein sequence space, with backbone flexibility, through a novel approach: large-scale protein design to structural ensembles. An application is demonstrated, wherein designed sequences are used to increase the utility of comparative modeling, in place of natural sequence homologues.

Results

• We designed hundreds of thousands of diverse sequences for 264 naturally-occurring proteins, in 55 fold classes.

• Protein folds show distinct variation in “designability”.

• Our novel “reverse BLAST” approach uses designed sequence to identify up to 5-fold more high-quality structural templates for comparative modeling than standard PSI-BLAST.

• Reverse BLAST identifies at least one new modeling target in 41 of 49 genomes tested.

Protein designChallenges in computational protein design:

• choosing sufficiently accurate energy functions

• finding intelligent ways to efficiently search the large (O(10n)) space of protein sequences

• modeling peptide backbone flexibility

Some highlights of the design algorithm (SPA):

• initial rotamer filtering step

• Amber/OPLS parameter set; implicit solvation

•amino acid baseline corrections to maintain reasonable sequence compositions

• genetic algorithm to search for low energy sequences to match the target structure

Peptide backbone flexibility through structural ensembles

Ten representative backbone traces from the structural ensemble used in designing sequences for 1abo, the SH3 domain from Abl tyrosine kinase. The structural variants appear in yellow, with the original crystal structure backbone traced in purple. All structures are within 1 Å rmsd of each other.

Increasing sequence diversity with size of structural ensemble

0

1

2

3

4

5

6

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Number of structural variants in ensemble

Me

an

se

qu

en

ce

en

tro

py

Designing to a structural ensemble generates more diverse sequences than fixed-backbone methods.

More non-native-like sequences are designed

Identity of designed sequences to parent native sequence

0

0.1

0.2

0.3

0.4

0.5

0.6

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.

Identity (%)

Fre

qu

en

cy

Structural ensemble, full sequence

Single structure, full sequence

Structural ensemble, hydrophobic core positions

Single structure, hydrophobic core positions

Distribution of identity to the native parent sequence for 253 proteins. Identity to the native sequence was calculated for the set of sequences designed using only the fixed parent backbone as a target template (all residues: black dashed line; buried residues: great dashed line) and for the set of sequences designed using a structural ensemble target (all residues: black solid line; buried residues: grey solid line).

Using structural ensembles of 100 structural variants as target templates narrows and lowers the distribution of identity to the parent native sequence, indicating broader exploration of sequence space.

Overall sequence diversity is determined by the protein fold

Sequence entropy distribution of designed folds

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

4.8 5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8

Sequence entropy [exp(S)]

Fre

qu

ency

antifreeze

toxin

copper-bind

rubredoxin

Kunitz_BPTI

Phage_DNA_bind

Sequence entropy distributions of designed sequences, grouped by structure into folds. The six folds are identified by their PFAM families.

The relatively tight clustering of sequence entropies within a fold and the separation of sequence entropy distributions for different folds suggests a) that the diversity of the designed sequence set for a structure is primarily determined by its overall fold and b) that the designability principle postulated from studies of simple models may hold in real proteins.

Designed sequences identify structural homologues accurately

1E-16

1E-15

1E-14

1E-13

1E-12

1E-11

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

0.0001

0.001

0.01

0.1

1

10

50 100 150 200 250

Proteins in test set (ranked by lowest E-value)

E-v

alu

e o

f m

os

t s

ign

ific

an

t h

it

The E-value of the most significant hit from each of 264 “reverse BLAST” searches is plotted. Dark grey columns represent predictions that are true structural homologues; light grey columns represent false positives.

Our novel “reverse BLAST searching” uses alignments of designed sequences as PSI-BLAST queries against a genome to identify structural templates for structure prediction of gene sequences. 251 of the 264 designed sequence alignments produced hits (against PDB as a test set) with E-values below 10. At a significance level of E<0.01, a commonly used threshold in comparative modeling, all hits were against true structural homologues, with 47% (124/264) coverage.

“Reverse BLAST” identifies more templates for homology modeling

0

5

10

15

20

25

30

35

Pyr

oco

ccu

s h

ori

kosh

ii S

ulfo

lobu

s so

lfata

ricu

s T

he

rmo

pla

sma

aci

dop

hilu

m

Th

erm

op

lasm

a v

olc

an

ium

T

rep

on

em

a p

alli

dum

H

elic

ob

act

er

pyl

ori

26

69

5

Hel

ico

ba

cte

r p

ylo

ri J

99

Cam

pyl

ob

act

er

jeju

ni

Myc

ob

act

eriu

m t

ub

erc

ulo

sis

CD

C1

55

1

Myc

ob

act

eriu

m t

ub

erc

ulo

sis

H3

7R

v R

icke

ttsi

a p

row

aze

kii

Chl

am

ydo

ph

ila p

ne

um

iae

AR

39

C

hla

myd

op

hila

pn

eu

mia

e C

WL

02

9

Chl

am

ydo

ph

ila p

ne

um

iae

J1

38

M

yco

ba

cte

rium

lep

rae

C

hla

myd

ia m

uri

da

rum

C

hla

myd

ia tr

ach

om

atis

A

qu

ifex

ae

olic

us

Myc

op

lasm

a g

en

italiu

m

Myc

op

lasm

a p

ne

um

onia

e

Myc

op

lasm

a p

ulm

on

is

Str

ep

toco

ccu

s p

yog

en

es

Mes

orh

izo

biu

m lo

ti M

eth

an

oco

ccu

s ja

nn

asc

hii

Bo

rre

lia b

urg

do

rfe

ri

Dei

no

cucc

us

rad

iod

ura

ns

Ure

ap

lasm

a u

real

ytic

um

H

alo

ba

cte

riu

m s

p

Cau

lob

act

er

cre

sce

ntu

s L

acto

cocc

us

lact

is

Arc

ha

eo

glo

bu

s fu

lgid

us

Pyr

oco

ccu

s a

bys

si

Met

ha

no

ba

cte

riu

m t

he

rmo

au

totr

oph

icu

m

Nei

sse

ria m

en

ingi

tidis

MC

58

N

eiss

eria

me

nin

gitid

is Z

24

91

H

aem

op

hilu

s in

flue

nza

e

Xyl

ella

fast

idio

sa

Bu

chn

era

sp

S

tap

hyl

oco

ccu

s a

ure

us M

u5

0

Sta

ph

ylo

cocc

us

au

reus

N31

5

Pa

ste

ure

lla m

ulto

cid

a T

he

rmo

tog

a m

arit

ima

V

ibri

o c

ho

lera

e

Ba

cillu

s su

btili

s P

seu

do

mo

na

s ae

rug

ino

sa

Syn

ech

ocy

stis

PC

C6

80

3

Esc

he

rich

ia c

oli

O1

57

H7

ED

L9

33

E

sch

eri

chia

co

li O

15

7H

7

Esc

he

rich

ia c

oli

K1

2

Genome searched

Nu

mb

er

of

str

uc

tura

l te

mp

late

s id

en

tifi

ed

Light grey: the number of genes for which structural templates were identified by PSI-BLAST searching against the set of 264 structures in the test set. Dark grey: the number of novel genes for which structural templates were identified by “reverse BLAST” searching using 264 alignments of computationally designed sequences.

Reverse BLAST searching identified at least one additional structural template for use in homology modeling (not identified by standard PSI-BLAST) for 41 of 49 genomes. In ten cases, the reverse BLAST method more than doubled the number of structural templates identified.

Conclusions

• The task of large-scale protein sequence design has been efficiently massively parallelized.

• Design to a structural ensemble greatly increases the diversity of sequences generated, without loss of sequence quality.

• Similar structures produce sequence sets of similar diversity, and the distributions of sequence entropies for different folds segregate, supporting the designability postulate seen in simple models.

• “Reverse BLAST searching” uses designed sequences to accurately identify structural homologues.

• Reverse BLAST searching allows increased identification of structural templates for homology modeling without the need for natural sequence homologues.

Future Directions

• Use sequence profiles for specific proteins to generate biased combinatorial libraries for protein synthesis. This will experimentally test the ability of the design algorithm to produce viable sequences.

• Introduce functional constraints into the design process to produce new sequences which are both stable and functional.

• Refine methods for generating high sequence diversity for a given structure, allowing more extensive sampling of sequence space.

• Use computational design to redesign peptide ligands for applications in drug discovery and understanding protein-protein interactions.

summary protein design seeks to find amino acid sequences which stably fold into specific 3-d...

Documents

protein backbone

set of sequences

native parent sequence

parent native sequence

protein folds

diverse sequences

structural ensembles

overall sequence diversity