jack snoeyink & matt o’meara dept. computer science unc chapel hill
TRANSCRIPT
Scientific Benchmarks for Structure Prediction
CodesJack Snoeyink & Matt O’Meara
Dept. Computer ScienceUNC Chapel Hill
With thanks to:
Collaborators Brian Kuhlman, UNC Biochem Many other members of the RosettaCommons Richardson lab, Duke Biochem
Funding NIH NSF
Key Points… Scientific Models, esp. for Structural Molecular Biology
Models are the lens through which we view data Models are predominantly geometric Computational models are complex Models evolve, so testing becomes crucial
Focus on statistical/computational models with a sample source, observable local features, chosen functional form,
fit parameters, & visualization/testing methods Capture assumptions and date used to build models to:
Visualize for making design decisions while building Fit parameters to ensure best performance Record as scientific benchmarks
Case Study: Rosetta protein structure prediction software [B]
Science views nature thru models
Scientists view nature thru models
People view the world thru models
Geometric molecular models
Model complexity
Physical and Conceptual models Kept simple to aid understanding
Statistical and Computational models Evolve by combining simple models Even when complex can still be effective at
Validation (Molprobity) or Prediction (Rosetta)
Model complexity
Model complexity
Computational model life cycle
Computational model life cycle
Spiral development, much like software Discover problematic features in some data Create an energy function to adjust them Fit parameters to improve results Check into the software as a new option Make default option if everyone likes it Occasionally refactor and rewrite, removing
outdated or unused modelsBut less support for testing…
Computational model testing
Our goal: Capture data and assumptions from model building for use in model visualization and testing.
Our computational models
Abstraction: A simple component of a complex computational model consists of:
One or more sample sources giving Pdb files from native or decoys
Observable local features having a Hydrogen bond distances and angles
Chosen functional form that Energy from distances and angles
Depends on fitting parameters Weights for combining terms
KMB’03
data set A
data set B
data set Z
. . .
SQL query
ggplot2spec
plots
statistics
gatherfeatures
filter transform
Tool schematic
Visualization
Implemented tools Compare distributions from sample sources Tufte’s small multiples via ggplot Kernel density estimation Normalization
Opportunities for Statistical analysis Dimension reduction …
Normalization
[KMB’03]Histogram of Hbond A-H distances in natives
0
200
400
600
800
1000
1200
1400
1.45
1.55
1.65
1.75
1.85
1.95
2.05
2.15
2.25
2.35
2.45
2.55
2.65
2.75
2.85
Tool uses…
Scientific unit tests native, HEAD, ^HEAD run on continuously testing server
Knowledge-base score term creation native, release, experimental turn exploration into living benchmarks
Test design hypotheses native, protocol, designs how strange is the this geometry?
Rotamer recovery
Key Points… Scientific Models, esp. for Structural Molecular Biology
Models are the lens through which we view data Models are predominantly geometric Computational models are complex Models evolve, so testing becomes crucial
Focus on statistical/computational models with a sample source, observable local features, chosen functional form,
fit parameters, & visualization/testing methods Capture assumptions and date used to build models to:
Visualize for making design decisions while building Fit parameters to ensure best performance Record as scientific benchmarks
Case Study: Rosetta protein structure prediction software [B]