summary protein design seeks to find amino acid sequences which stably fold into specific 3-d...
Post on 21-Dec-2015
214 views
TRANSCRIPT
![Page 1: Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein](https://reader030.vdocument.in/reader030/viewer/2022032801/56649d555503460f94a32463/html5/thumbnails/1.jpg)
SummaryProtein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein backbone in the design algorithm is necessary to capture the behaviour of real proteins and is a prerequisite for the accurate exploration of sequence space.
We present a broad exploration of protein sequence space, with backbone flexibility, through a novel approach: large-scale protein design to structural ensembles. An application is demonstrated, wherein designed sequences are used to increase the utility of comparative modeling, in place of natural sequence homologues.
Results
• We designed hundreds of thousands of diverse sequences for 264 naturally-occurring proteins, in 55 fold classes.
• Protein folds show distinct variation in “designability”.
• Our novel “reverse BLAST” approach uses designed sequence to identify up to 5-fold more high-quality structural templates for comparative modeling than standard PSI-BLAST.
• Reverse BLAST identifies at least one new modeling target in 41 of 49 genomes tested.
![Page 2: Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein](https://reader030.vdocument.in/reader030/viewer/2022032801/56649d555503460f94a32463/html5/thumbnails/2.jpg)
Protein designChallenges in computational protein design:
• choosing sufficiently accurate energy functions
• finding intelligent ways to efficiently search the large (O(10n)) space of protein sequences
• modeling peptide backbone flexibility
Some highlights of the design algorithm (SPA):
• initial rotamer filtering step
• Amber/OPLS parameter set; implicit solvation
•amino acid baseline corrections to maintain reasonable sequence compositions
• genetic algorithm to search for low energy sequences to match the target structure
![Page 3: Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein](https://reader030.vdocument.in/reader030/viewer/2022032801/56649d555503460f94a32463/html5/thumbnails/3.jpg)
Peptide backbone flexibility through structural ensembles
Ten representative backbone traces from the structural ensemble used in designing sequences for 1abo, the SH3 domain from Abl tyrosine kinase. The structural variants appear in yellow, with the original crystal structure backbone traced in purple. All structures are within 1 Å rmsd of each other.
Increasing sequence diversity with size of structural ensemble
0
1
2
3
4
5
6
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Number of structural variants in ensemble
Me
an
se
qu
en
ce
en
tro
py
Designing to a structural ensemble generates more diverse sequences than fixed-backbone methods.
![Page 4: Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein](https://reader030.vdocument.in/reader030/viewer/2022032801/56649d555503460f94a32463/html5/thumbnails/4.jpg)
More non-native-like sequences are designed
Identity of designed sequences to parent native sequence
0
0.1
0.2
0.3
0.4
0.5
0.6
0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.
Identity (%)
Fre
qu
en
cy
Structural ensemble, full sequence
Single structure, full sequence
Structural ensemble, hydrophobic core positions
Single structure, hydrophobic core positions
Distribution of identity to the native parent sequence for 253 proteins. Identity to the native sequence was calculated for the set of sequences designed using only the fixed parent backbone as a target template (all residues: black dashed line; buried residues: great dashed line) and for the set of sequences designed using a structural ensemble target (all residues: black solid line; buried residues: grey solid line).
Using structural ensembles of 100 structural variants as target templates narrows and lowers the distribution of identity to the parent native sequence, indicating broader exploration of sequence space.
![Page 5: Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein](https://reader030.vdocument.in/reader030/viewer/2022032801/56649d555503460f94a32463/html5/thumbnails/5.jpg)
Overall sequence diversity is determined by the protein fold
Sequence entropy distribution of designed folds
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
4.8 5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8
Sequence entropy [exp(S)]
Fre
qu
ency
antifreeze
toxin
copper-bind
rubredoxin
Kunitz_BPTI
Phage_DNA_bind
Sequence entropy distributions of designed sequences, grouped by structure into folds. The six folds are identified by their PFAM families.
The relatively tight clustering of sequence entropies within a fold and the separation of sequence entropy distributions for different folds suggests a) that the diversity of the designed sequence set for a structure is primarily determined by its overall fold and b) that the designability principle postulated from studies of simple models may hold in real proteins.
![Page 6: Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein](https://reader030.vdocument.in/reader030/viewer/2022032801/56649d555503460f94a32463/html5/thumbnails/6.jpg)
Designed sequences identify structural homologues accurately
1E-16
1E-15
1E-14
1E-13
1E-12
1E-11
1E-10
1E-09
1E-08
1E-07
1E-06
1E-05
0.0001
0.001
0.01
0.1
1
10
50 100 150 200 250
Proteins in test set (ranked by lowest E-value)
E-v
alu
e o
f m
os
t s
ign
ific
an
t h
it
The E-value of the most significant hit from each of 264 “reverse BLAST” searches is plotted. Dark grey columns represent predictions that are true structural homologues; light grey columns represent false positives.
Our novel “reverse BLAST searching” uses alignments of designed sequences as PSI-BLAST queries against a genome to identify structural templates for structure prediction of gene sequences. 251 of the 264 designed sequence alignments produced hits (against PDB as a test set) with E-values below 10. At a significance level of E<0.01, a commonly used threshold in comparative modeling, all hits were against true structural homologues, with 47% (124/264) coverage.
![Page 7: Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein](https://reader030.vdocument.in/reader030/viewer/2022032801/56649d555503460f94a32463/html5/thumbnails/7.jpg)
“Reverse BLAST” identifies more templates for homology modeling
0
5
10
15
20
25
30
35
Pyr
oco
ccu
s h
ori
kosh
ii S
ulfo
lobu
s so
lfata
ricu
s T
he
rmo
pla
sma
aci
dop
hilu
m
Th
erm
op
lasm
a v
olc
an
ium
T
rep
on
em
a p
alli
dum
H
elic
ob
act
er
pyl
ori
26
69
5
Hel
ico
ba
cte
r p
ylo
ri J
99
Cam
pyl
ob
act
er
jeju
ni
Myc
ob
act
eriu
m t
ub
erc
ulo
sis
CD
C1
55
1
Myc
ob
act
eriu
m t
ub
erc
ulo
sis
H3
7R
v R
icke
ttsi
a p
row
aze
kii
Chl
am
ydo
ph
ila p
ne
um
iae
AR
39
C
hla
myd
op
hila
pn
eu
mia
e C
WL
02
9
Chl
am
ydo
ph
ila p
ne
um
iae
J1
38
M
yco
ba
cte
rium
lep
rae
C
hla
myd
ia m
uri
da
rum
C
hla
myd
ia tr
ach
om
atis
A
qu
ifex
ae
olic
us
Myc
op
lasm
a g
en
italiu
m
Myc
op
lasm
a p
ne
um
onia
e
Myc
op
lasm
a p
ulm
on
is
Str
ep
toco
ccu
s p
yog
en
es
Mes
orh
izo
biu
m lo
ti M
eth
an
oco
ccu
s ja
nn
asc
hii
Bo
rre
lia b
urg
do
rfe
ri
Dei
no
cucc
us
rad
iod
ura
ns
Ure
ap
lasm
a u
real
ytic
um
H
alo
ba
cte
riu
m s
p
Cau
lob
act
er
cre
sce
ntu
s L
acto
cocc
us
lact
is
Arc
ha
eo
glo
bu
s fu
lgid
us
Pyr
oco
ccu
s a
bys
si
Met
ha
no
ba
cte
riu
m t
he
rmo
au
totr
oph
icu
m
Nei
sse
ria m
en
ingi
tidis
MC
58
N
eiss
eria
me
nin
gitid
is Z
24
91
H
aem
op
hilu
s in
flue
nza
e
Xyl
ella
fast
idio
sa
Bu
chn
era
sp
S
tap
hyl
oco
ccu
s a
ure
us M
u5
0
Sta
ph
ylo
cocc
us
au
reus
N31
5
Pa
ste
ure
lla m
ulto
cid
a T
he
rmo
tog
a m
arit
ima
V
ibri
o c
ho
lera
e
Ba
cillu
s su
btili
s P
seu
do
mo
na
s ae
rug
ino
sa
Syn
ech
ocy
stis
PC
C6
80
3
Esc
he
rich
ia c
oli
O1
57
H7
ED
L9
33
E
sch
eri
chia
co
li O
15
7H
7
Esc
he
rich
ia c
oli
K1
2
Genome searched
Nu
mb
er
of
str
uc
tura
l te
mp
late
s id
en
tifi
ed
Light grey: the number of genes for which structural templates were identified by PSI-BLAST searching against the set of 264 structures in the test set. Dark grey: the number of novel genes for which structural templates were identified by “reverse BLAST” searching using 264 alignments of computationally designed sequences.
Reverse BLAST searching identified at least one additional structural template for use in homology modeling (not identified by standard PSI-BLAST) for 41 of 49 genomes. In ten cases, the reverse BLAST method more than doubled the number of structural templates identified.
![Page 8: Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein](https://reader030.vdocument.in/reader030/viewer/2022032801/56649d555503460f94a32463/html5/thumbnails/8.jpg)
Conclusions
• The task of large-scale protein sequence design has been efficiently massively parallelized.
• Design to a structural ensemble greatly increases the diversity of sequences generated, without loss of sequence quality.
• Similar structures produce sequence sets of similar diversity, and the distributions of sequence entropies for different folds segregate, supporting the designability postulate seen in simple models.
• “Reverse BLAST searching” uses designed sequences to accurately identify structural homologues.
• Reverse BLAST searching allows increased identification of structural templates for homology modeling without the need for natural sequence homologues.
![Page 9: Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein](https://reader030.vdocument.in/reader030/viewer/2022032801/56649d555503460f94a32463/html5/thumbnails/9.jpg)
Future Directions
• Use sequence profiles for specific proteins to generate biased combinatorial libraries for protein synthesis. This will experimentally test the ability of the design algorithm to produce viable sequences.
• Introduce functional constraints into the design process to produce new sequences which are both stable and functional.
• Refine methods for generating high sequence diversity for a given structure, allowing more extensive sampling of sequence space.
• Use computational design to redesign peptide ligands for applications in drug discovery and understanding protein-protein interactions.