do not reproduce without permission 1 gerstein.info/talks (c) 2003 permissions statement this...
TRANSCRIPT
1 G
ers
tein
.in
fo/t
alk
s
(c)
20
03
Do not reproduce without permission
Permissions Statement
This Presentation is copyright Mark Gerstein, Yale University, 2005.
Feel free to use images in it with
PROPER acknowledgement
(via citation to relevant papers or link to gersteinlab.org).
2 G
ers
tein
.in
fo/t
alk
s
(c)
20
03
Do not reproduce without permission
Computational Proteomics: Networks & Structures
Mark B GersteinYale (Comp. Bio. & Bioinformatics)
Ottawa Health Research Institute
2006.10.23, 14:30 – 15:30
3 G
ers
tein
.in
fo/t
alk
s
(c)
20
03
Do not reproduce without permission
Omics Research at GersteinLab.org
• Human Genome Analysis (pseudogenes) Finding genes, characterizing the function of intergenic regions, and
analyzing protein fossils (pseudogenes)
• Eukaryotic Proteome Analysis (networks) Using molecular networks to integrate functional genomics
information and describe protein function on a genomic scale
• Structural Genomics (macromolecular motions) Analyzing select populations of 3D-structures in detail, trying to
understand their flexibility in terms of packing
4 G
ers
tein
.in
fo/t
alk
s
(c)
20
03
Do not reproduce without permission
OutlineComputational Proteomics:
Networks & Structures
• 3-D Structural Analysis of Protein Interaction Networks Gives New Insight Into Protein Function, Network Topology and Evolution Interaction Networks and their
properties A 3-D structural point of view Network properties revisited TopNet Website
• Surveying Structural Motions in a DB Framework Motions DB based on Simple
Models for Protein Flexibility Detailed Classification based on
Interface Packing
• Hinge & Shear
• Packing Tools Comprehensive Statistics on
Flexibility over all Structures
• [ Distributions ]
• Hinge Survey
5 G
ers
tein
.in
fo/t
alk
s
(c)
20
03
Do not reproduce without permission
TopNet – an automated web tool
[Yu et al., 2004; Yip et al. (2005); Similar tools include Cytoscape.org, Idekar, Sander et al]
(vers. 2)
Normal website + Downloaded code (JAVA)+ Web service (SOAP) with Cytoscape plugin
6 G
ers
tein
.in
fo/t
alk
s
(c)
20
03
Do not reproduce without permission
SVGA visualization, Network Mgt. (Multiple Network Support, tagging with DB)
7 G
ers
tein
.in
fo/t
alk
s
(c)
20
03
Do not reproduce without permission
Surveying structural flexibility on a proteomic scale
• Originally identified in early structures Hb, ATCase, hexokinase
• Why study it? Complicated biological phenomena that can be studied in
quantitative detail
• changes in 1000s of coordinates Motions link structure & function
(many functions carried out by motions)
• catalysis, regulation, transport, formation of assemblies, cellular locomotion
• ligand binding Structural genomics will produce many
structures with slight variations on same fold• Next step after fold classification in flexibility
classification
8 G
ers
tein
.in
fo/t
alk
s
(c)
20
03
Do not reproduce without permission
New structures
increasingly don't give new folds
from"Expectations for Structural Genomics"
[Levitt, Protein Science 9: 197]
9 G
ers
tein
.in
fo/t
alk
s
(c)
20
03
Do not reproduce without permission
Surveying structural flexibility on a proteomic scale
• Questions How do we describe a wide-range of
structural variability in standard terms? Can we develop simple models to
explain constraints on protein flexibility? What information about flexible hinge
location is encoded in sequence?
10
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Computational Proteomics:Understanding Protein Flexibility in a
Database Framework
1) Motions DB based on Simple Models for Protein Flexibility
- Rigid Core Models, Pathway Interpolation, NMA.
2) Detailed Classification of Small Subset of Motions based on Interface Packing
- Packing constrains motions. Two mechanisms, Hinge and Shear, depending on whether there is a well-packed interface, account for many motions (CS vs LF). More involved motions exist (e.g. Ig, GroEL).
3) Comprehensive Statistics on Flexibility over all Structures
- Putting individual motions into perspective from distributions. Some initial conclusions from datamining.
11
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Computational Proteomics:Understanding Protein Flexibility in a
Database Framework
1) Motions DB based on Simple Models for Protein Flexibility
- Rigid Core Models, Pathway Interpolation, NMA.
2) Detailed Classification of Small Subset of Motions based on Interface Packing
- Packing constrains motions. Two mechanisms, Hinge and Shear, depending on whether there is a well-packed interface, account for many motions (CS vs LF). More involved motions exist (e.g. Ig, GroEL).
3) Comprehensive Statistics on Flexibility over all Structures
- Putting individual motions into perspective from distributions. Some initial conclusions from datamining.
13
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Example "Morph": MBP
2 Known Crystal Structures (endpoints, not necessarily same seq.)
Std. Geometric Stats. (from structure comparison)
Pathway Interpolation
14
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Motions collecting together and annotating Individual morphs into logical units
~19K morphs(instances of conformational variability)
(384 canonical ones)~200 classified
motions
15
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Morph Analysis System to Standardize Protein Motions and Create Pathways
Alignment
Superposition
Screw-Axis Orientation
Homogenization
Pathway Generation
Visual Rendering
Web Report
16
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Simple 2 Rigid Core Model of Protein Motions -- to what degree does it apply?
Struc-1 Struc-2
Core-1 Fit Core-2 Fit
Overall Fit
Std. Statistics RMS
Core 1 & 2 fits Rot. & Trans. T Max. Disp. Centroid-Screw-
Axis Dist.Web Report
Alignment
Superposition
Screw-Axis Orientation
Homogenization
Pathway Generation
Visual Rendering
Do not reproduce without permission
Visualizing Pathways:
Interpolation via Adiabatic Mapping
1' Interpolation Step1 Energy minimization
(VDW + bonds) [Charmm, Encad]
2 Re-interpolate, re-minimize….
* Slows down over humps
0
2000
4000
6000
8000
10000
12000
1 2 3 4 5 6 7 8 9
Interpolation Step
En
erg
y
Web Report
Alignment
Superposition
Screw-Axis Orientation
Homogenization
Pathway Generation
Visual Rendering
1'1
2
18
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Adiabatic Mapping vs Linear Interpolation Strategies
Compared with Calmodulin
cm: 1ctr-1cl1
19
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Adiabatic Mapping vs Linear Interpolation Strategies
Compared with Calmodulin
Frame 4 (adiabatic)cm: 1ctr-1cl1
20
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Adiabatic Mapping vs Linear Interpolation Strategies
Compared with Calmodulin
Frame 4 (adiabatic)Frame 4 (linear)
Collapsed
21
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Other Dynamic Calculations to Model
Pathway
• Most simple possible calculations here, but....• Progressively add and subtract energy terms from
pathway calculation• Interoperate DB within framework of dynamics groups
• Normal Modes …
22
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Normal Mode Analysis
• Describe flexibility of system by characteristic harmonic modes
• Calculate 20 lowest-freq. modes for 1 conformation of each pair in morph, using simple mass distribution and spring potential
• Find best linear combination of modes (v) fitting initial direction of observed motion
• Measure degree to which fit matches initial direction of the observed motion. Measure how concentrated linear combination is in a few modes (entropy ~ v ln v)
23
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Computational Proteomics:Understanding Protein Flexibility in a
Database Framework
1) Motions DB based on Simple Models for Protein Flexibility
- Rigid Core Models, Pathway Interpolation, NMA.
2) Detailed Classification of Small Subset of Motions based on Interface Packing
- Packing constrains motions. Two mechanisms, Hinge and Shear, depending on whether there is a well-packed interface, account for many motions (CS vs LF). More involved motions exist (e.g. Ig, GroEL).
3) Comprehensive Statistics on Flexibility over all Structures
- Putting individual motions into perspective from distributions. Some initial conclusions from datamining.
24
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Breakdown of the Database
Submitted
Manual
Automatic
>4400 user submitted morphs
~200manually classified motions
>14000 automatically classified motions
25
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Classification of motions by
packing
Submitted
Manual
Automatic
26
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Interdigitating structure of protein interfaces constrains motion
27
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Sliding Shear Motion Between two Close Packed Helices
28
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
2 Ideal Mechanisms
Shear Hinge
Mainchain Packing Constrained by close packing Free to kink
Mainchain Torsions Many small changes A few large changes
Motion Overall Concatenation of small local motions Identical to twisting at hinge
Motion at Interface Parallel to plane of interface (shear) Perpendicular to plane of interface, exposing & burying surfaces.
Sidechain Packing Same packing in both forms New contacts created; Packing at base of hinge crucial.
Sidechain Torsions Mostly small changes Some large changes
depending on whether a well-
packed interface is maintained continuously over motion
29
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Glutamate mutase: Intradomain Shear Motion
[Krautler]
30
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Small Shearing Domain Motions: Molybdenum-binding protein & GAPDH
[Lawson] [Wonacott]
31
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Citrate Synthase: Domain Motion with Shearing Helices
32
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Ras: Hinged Loop
[SH Kim]
33
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Troponin: fragment hinge motion of secondary structures
Absence of packing at joint
[M James]
34
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Transferrin: Interdomain Hinges
[Baker]
35
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Transferrin hinge involves absence of steric constraints (continuously
maintained interfaces), esp. at hinge
36
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Packing Tools - Voronoi software to calculate packing volumes
37
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Volume Distribution
and Std. Volume
Typing for Atoms
Optimized Radii for Proteins and Nucleic Acids
38
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Goal of helix.gersteinlab.org:-to provide a comprehensive suite of tools for analyzing helix packing
39
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
helix.gersteinlab.org
enter PDB ID upload PDB file
STRIDE processing
helix-helix interaction report(distance-based)
visualization ofhelix interactions (Jmol)
Voronoi calculation
visualization of helix-helix interface (VRML)
helix-helix contact area calculation (Jmol)
sequence motif search(Jmol)
report of atom-atom contacts from Voronoi calculation
PDB file verificationand tool selection menu
40
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
PDB ID 1C3W, helices 3 and 7(a) (b)
Arg
Asp
(c)
intersection area = 23.3 Å2
crossing angle = 24.6º
PDB ID 1C3W, GxxxG motif (residues 116-120A, GIMIG)
(d)
44
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Computational Proteomics:Understanding Protein Flexibility in a
Database Framework
1) Motions DB based on Simple Models for Protein Flexibility
- Rigid Core Models, Pathway Interpolation, NMA.
2) Detailed Classification of Small Subset of Motions based on Interface Packing
- Packing constrains motions. Two mechanisms, Hinge and Shear, depending on whether there is a well-packed interface, account for many motions (CS vs LF). More involved motions exist (e.g. Ig, GroEL).
3) Comprehensive Statistics on Flexibility over all Structures
- Putting individual motions into perspective from distributions. Some initial conclusions from datamining.
45
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Global Statistics from Comprehensive Analysis of Flexibility in the PDB
Submitted
Manual
Automatic
• "Unbiased" view of flexibility in PDB
•Automatic structural alignments of all pairs in the PDB (based on fold classification)
•One subset of ~14K is 3814 pairs with large structural differences (& acceptable morphs) but great seq. similarity
46
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Rotation Distribution
0
10
20
30
40
50
60
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
Rotation Angle of 2nd Core (degrees)
Fre
qu
ency
47
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Translation Distribution
0
10
20
30
40
50
60
70
80
90
0 2 4 6 8 10 12 14 16 18 20 22 24
Translation of 2nd Core (Angstroms)
Fre
qu
ency
48
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Max Displacement Distribution
0
10
20
30
40
50
60
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
Max Atomic Displacement (Angstroms)
Fre
qu
ency
49
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
An individual in the population: typical or unusual
Average Displacement of Moving Core
50
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
TGL: Motion of Small Fragment
[Derewenda]
max-Disp: 13 Å (82%)
Trans: 1.7 Å (92%)
Rot: 2.7° (91%)
max-E: 30 (83%)
51
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
cAMP-dependent Protein Kinase: Complex Motion
[Taylor]
max-Disp: 7.8 Å (92%)
Trans: 0.98 Å (95%)
Rot: 4.9 ° (84%)
max-E: 23 (89%)
52
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Diptheria Toxin: Domain
Swapping
B
A B
A
[Eisenberg]
max-Disp: 60 Å (9.6%)Trans: 66 Å (20%)Rot: 62 ° (37%)max-E 482 (11%)
54
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Real Motion between Diverged Members of Periplasmic Binding Protein II Superfamily
(oligo-peptide & dipeptide binding proteins) [~26% identity]
max-Disp: 30 Å (53%), Trans: 8.7 Å (66%), Rot: 34 (48%), max-E 59 (69%)
[Quiocho]
55
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Degree to which initial direction of motion can be fit by a few modes
56
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Flexibility Prediction from a Single Structure
• Hinge Atlas: a resource for statistical studies of protein flexibility
• Hinge information in sequence, using the hinge atlas• Structure based hinge predictors, tested using Hinge
Atlas Gold
57
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Relation between Vectors of Lowest Normal Mode and Obs. Motion for 4 Most Mobile Atoms: intra-domain motions in Calmodulin & bR
N-terminus
THR 5
C-terminus
VAL 101
VAL 177PHE
153
ASP 64
LYS 148
SER 147 THR 146
C-terminus
N-terminus
58
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
1 2
34
N-terminus
C-terminus
Relation between Vectors of Lowest Normal Mode and Obs. Motion for 4 Most Mobile Atoms: inter-domain motion in T7 RNA polymerase
59
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Data Mining & Clustering on Corpus of
Statistics
• Datamining on statistics… The Hinge Atlas: Hinge information in sequence
• Auto characterize submitted motion as being similar to previous observed motion
• Develop canonical set of motions
60
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
The Hinge Atlas
• Hundreds of protein pairs (morphs) observed• Hinge regions manually selected• Useful for testing hinge predictors or for statistical studies of
hinge properties• Hinge information can be transferred to homologs for which
hinges are unknown• Public involvement• 214 nonredundant proteins annotated, and growing
61
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Viewer and public interface
• Highlight hinges from Hinge Atlas annotation
• ‘Public hinge’ submissions taken from users
• We used this annotation to look for hinge information in sequence..
62
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Glycine and serine are significantly more likely to occur in hinges.
Phenylalanine, valine, alanine, and leucine are less likely to occur,
Log-odds and p-value of amino acid occurrence in hinges
0
0.25
0.5
PHECYS
TRPVAL
ALA ILE
LEU
LYS
GLUM
ETARG
ASPGLN
ASNTHR
TYRHIS
PROGLY
SER
p-v
alu
e
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
LO
D
p-value
LOD score
Amino acid frequency of occurrence in hinges
63
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Are hinges segregated by secondary structure?
Hinges were found to occur preferentially in turns and disordered regions, and to avoid alpha helices.
Hinge coincidence with secondary structure
0
0.15
0.3
Other Alpha helix Beta sheet Turn Random coil
p-v
alu
e
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
LO
D
P-value
LOD
64
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Are certain physicochemical properties preferred in hinge residues?
High confidence that small residues are preferred, aliphatic and hydrophophobic residues are avoided.
Hinge coincidence with physicochemical property
0
0.2
0.4
Alipha
tic
Arom
atic
Hydro
phob
ic
Negativ
e
Charge
d
Positiv
ePola
r
Small
Tiny
p-v
alu
e
-0.12
-0.08
-0.04
0
0.04
0.08
0.12
LO
D
P-value
LOD
65
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Do hinges coincide with active sites?
We found that computer annotated hinges had a significant tendency to coincide with active site residues from the Catalytic Site Atlas. No significant coincidence was found for residues near the hinge.
Hinge and active site residues
0
0.3
0.6
0 1 2 3 4 5 6 7 8 9 10
distance from active site (residues)
p-v
alu
e
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
LO
D
p-value
LOD
66
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Are hinge residues conserved in evolution?
Hinge residues were anti-conserved, or rather hypermutable.
Hinge occurrence vs. conservation
0
0.25
0.5
Top 1/5th 2nd 1/5th 3rd 1/5th 4th 1/5th Bottom 1/5th
bins
p-v
alu
e
-0.1
0
0.1
LO
D
p-value
LOD
67
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Hypermutability is probably due to appearance on surface
Hinge occurrence vs. solvent accessible surface area
0
0.5
1
1 2 3 4 5
ASA bin
p-v
alu
e
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
HI
p-value
HI
3∙10-12 3∙10-9 1∙10-6
<10-30
Hinge residues tended to occur on the surface with extremely high significance.
68
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
HingeSeq: A sequence-based hinge predictor
• Uses the definition of the Hinge Index:
• And sums Hinge Indices for residue type, secondary structure, and active site annotation, assuming indepedence:
)()()()()()(
)()()(log)( 10 iHIiHIiHI
apapap
haphaphapiHingeSeq activessaa
lkj
lkj
)(
)(log)( 10
i
ii ap
hapaHI
ROC curve for HingeSeq
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FP/(FP+TN)
TP
/(T
P+F
N)
70
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
3
Do not reproduce without permission
Computational Proteomics:Understanding Protein Flexibility in a
Database Framework
1) Motions DB based on Simple Models for Protein Flexibility
- Rigid Core Models, Pathway Interpolation, NMA.
2) Detailed Classification of Small Subset of Motions based on Interface Packing
- Packing constrains motions. Two mechanisms, Hinge and Shear, depending on whether there is a well-packed interface, account for many motions (CS vs LF). More involved motions exist (e.g. Ig, GroEL).
3) Comprehensive Statistics on Flexibility over all Structures
- Putting individual motions into perspective from distributions. Some initial conclusions from datamining.