folding@home and genome@home: protein folding and design with distributed computing stefan larson...

Folding@Home and Genome@home: Protein folding and design with distributed computing

Stefan Larson

Pande GroupDept. of Chemistry and Biophysics Program

Stanford University

CreditsPande GroupDr. Vijay Pande

– Folding@home• Siraj Khaliq• Young Min Rhee• Michael Shirts• Chris Snow• Eric Sorin• Bojan Zagrovic • Sidney Elmer

– Genome@home• Stefan Larson• Vishal Vaidyanathan• Amit Garg• Guha Jayachandran

Collaborators

• Adam Beberg (Mithral)• Dr. Jed Pitera (IBM)• Dr. Bill Swope (IBM)• Dr. Jay Ponder (Wash U)• Folding@home users

• Dr. John Desjarlais (Xencor)• Jeremy England (Harvard)• Genome@home users

Molecular simulations in computational

biology

Common challenges of Computational Biology

• Problems related to folding– Structure prediction– Binding– Protein-protein interaction

• Issues:– Models

• Force fields (e.g. Charmm, Amber)• Lots of parameters, constrained by experiment: good

enough?

– Sampling• Can simulate 1ns = 10-9 sec in a day• Need to sample 104 to 106 ns!

Why simulate?• Physics chemistry biology

– Start from the laws of physics and chemistry,explain the properties of biomolecules

• Experiments: less detailed– Spectroscopies, FRET, NMR, etc.– Crystals are static

• Simulations: very detailed– Femtosecond time resolution– Angstrom spatial resolution– Much like having thousands of completely detailed single

molecule experiments

Goals

• Can we characterize folding computationally?– Accurate rates– Detailed mechanisms

• Can we design proteins?– Specific stable structure– Retention of function

Challenges of simulation

Models (force fields)

Sampling (tractability)

Analysis (insight)

Simulating protein folding

The Challenges of Protein Folding Simulation

1. How can we overcome the long timescales?• Fastest proteins in 10’s to 100’s s• Simulations orders of magnitude shorter

2. Are force fields good enough?• Would we reach the native state (w/o NS info)? • Would we quantitatively predict folding rates, G, etc

under experimental conditions (30C)?

3. Can we use simulation to learn about folding?• By what mechanism do they fold? • Do we agree with any folding theories?

Relevant timescales

10-15

femto10-12

pico10-9

nano10-6

micro10-3

milli100

seconds

Bond vibration

Isomeris-ation

Waterdynamics

Helixforms

Fastestfolders

typicalfolders

slowfolders

• 16 order of magnitude range– Femtosecond timesteps– Need to simulate micro to milliseconds

long MD run

where weneed to be

MDstep

where we’dlove to be

Traditional parallel MD:Few, long trajectories

• Divide the force calculations between processors– Spatial

decomposition for work division

– Requires fast communication T3E supercomputer IBM Blue Gene

Duan and Kollman, Science (1998)

Problem: we need WAY more time than is available at current supercomputer centers

Our method:Many, short trajectories

• Advantages of exponential kinetics:– Number that fold in time t:

M f(t) = M[1–exp(-kt)] ~ Mkt for small ktM ~ 10,000 procs, k ~ 1/10,000ns, t ~ 20ns/proc expect Mkt ~ 20 simulations to fold

• Computationally economical– Doesn’t waste resources on communication– Natural for large, heterogeneous clusters

• Important for folding– Heterogeneity of paths, statistics– ergodicity

http://folding.stanford.edu

Distributed computing

home… lab/office… anywhere

The client uses the spare CPU cycles on a user’s computer to run the simulation algorithm on the assigned structure. Results are automatically returned and exchanged for a new work unit on a daily basis.

The server sends and receives the work units (essentially just protein structures and sequences). It verifies, collates and stores the returned data, completes initial analyses, and computes user statistics for the website.

Worldwide distributed computing

Protein folding results

What to fold?…fastest folders

1

10

102

103

104

105

Nanose

con

ds,

CPU

-days

10

60

1

CPU

years

PPA alphahelix

betahairpinBBA5 villin

Rates: predicted vs experiment

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000experimental measurement

(nanoseconds)

Pre

dic

ted

fold

ing

tim

e

(n

an

osecon

ds)

PPA

alpha helix

betahairpin

villinExperiments:

villin: Raleigh, et al, SUNY, Stony Brook

BBAW:Gruebele, et al, UIUC

beta hairpin: Eaton, et al, NIH

alpha helix: Eaton, et al, NIH

PPA: Gruebele, et al, UIUC

BBAW

Mechanism: How did these proteins fold?

• Form secondary structure first– Form helices & hairpins– Hierarchical, decrease in entropy

• Collapse first– Hydrophobically driven– Need to remove water to form hydrogen bonds

• Form rough native shape first– Need to find the right “topology” first– Then pack side chains

What have we learned?

• Can tackle sampling today

• Forcefields sufficient? Folding to the native state folding rate prediction

• Role of water– Explicit solvent not crucial to rate determination?– Compare to explicit solvent simulation

• Universal mechanism of folding?– Maybe no universal mechanism: all proteins could be different?

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000experimental measurement

(nanoseconds)

Pre

dic

ted

fold

ing

tim

e

(nan

osecon

ds)

PPA

alpha helix

betahairpin

villin

BBAW

Protein design

Stanford UniversityStefan LarsonAmit GargGuha JayachandranDr. Vijay Pande

Harvard UniversityJeremy England

Xencor, Inc.Dr. John Desjarlais

gah.stanford.edu

Exploring sequence space:

large scale protein design

Utility of large sequence libraries

Directed evolution

• constrain and guide mutagenesis steps

• enrich starting material in “structured” sequences.

Homology modeling

• broader sequence database for finding homologues

• generate sequence profiles for alignments, etc.

Drug design

• In silico screening of peptide and peptide-mimetic ligands to reduce lead libraries for drug design.

Computational exploration of

sequence spaceApproach

• Detailed all-atom protein representations

• Standard molecular mechanics force-fields

• Generate large sequence libraries

• Apply results to relevant biomedical questions

Challenges

• modeling backbone flexibility

• generating sequence diversity

• large scale iteration of design process

Sequence prediction algorithm

Wollacott AM, Desjarlais JR. “Virtual interaction profiles of proteins.” J Mol Biol. 2001, 313(2):317-42.

Raha K, Wollacott AM, Italia MJ, Desjarlais JR. “Prediction of amino acid sequence from structure.” Protein Sci. 2000, 9(6):1106-19.

Johnson EC, Lazar GA, Desjarlais JR, Handel TM. “Solution structure and dynamics of a designed hydrophobic core variant of ubiquitin.” Structure Fold Des. 1999, 7(8):967-76.

Desjarlais JR, Handel TM. “Side-chain and backbone flexibility in protein core design.” J Mol Biol. 1999, 290(1):305-18.

Lazar GA, Desjarlais JR, Handel TM. “De novo design of the hydrophobic core of ubiquitin.” Protein Sci. 1997, 6(6):1167-78.

Desjarlais JR, Handel TM. “De novo design of the hydrophobic cores of proteins.” Protein Sci. 1995, 4(10):2006-18.

Energy function

• Amber/OPLS parameters

• implicit solvation

Sampling

• genetic algorithm

• structure-dependent rotamer space

Structural

ensembles

Increased sequence diversity

Decreased identity to native sequence

0

1

2

3

4

5

6

7

8

0 5 10 15 20 25 30 35 40 45 50Number of structural variants

Seq

uen

ce e

ntr

op

y [<

exp

(S(i

))>]

0

0.1

0.2

0.3

0.4

0.5

0.6

0.0 15.0 30.0 45.0 60.0 75.0 90.0

Identity (%)

Fre

qu

en

cy

Structural ensemble, full sequenceSingle structure, full sequence

Large scale sequence generation

Total structures 253

Total backbone variants

25,300

Total time of data collection

62 days

Processors available

3,000

Total sequences generated

188,725

Diversity study:

Sequence quality

1E-17

1E-16

1E-15

1E-14

1E-13

1E-12

1E-11

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

0.0001

0.001

0.01

0.1

1

10

0 25 50 75 100 125 150 175 200 225

Designed sequence profile (ranked by E-value)

E-v

alu

e o

f b

est

PD

B h

it

0

5

10

15

20

25

30

Ave

rag

e id

enti

ty t

o n

ativ

e se

qu

ence

(%

)

Designability

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

3 3.4 3.8 4.2 4.6 5 5.4 5.8 6.2 6.6 7Sequence entropy [exp(S)]

Fre

qu

ency

antifreeze

toxin

copper-bind

rubredoxin

Kunitz_BPTI

Phage_DNA_bind

All 253 structures

New directionsOngoing work

• Characterization of sequence space

• Natural sequence diversity (SH3)

• Homology modeling database

• SH3 peptide ligand design

• Experimental validation of designed sequences

• Hybrid approaches to protein design

• Design of peptide-mimetic ligands

• Design of functional proteins

• New design algorithms and parameter sets

folding@home and genome@home: protein folding and design with distributed computing stefan larson...

Documents

folding theories

folding rates

proteins fold

protein designpredict

home usersdr

time t

force calculations

fastest proteins