two bioinformatics applications of dynamic bayesian networks william stafford noble department of...
TRANSCRIPT
Two bioinformatics applications of dynamic Bayesian networks
William Stafford NobleDepartment of Genome Sciences
Department of Computer Science and EngineeringUniversity of Washington
Outline
• Segmenting genomic data– Background: DNA, chromatin and DNase I– Simple solution– Wavelets– Hierarchical model
• Matching peptides to mass spectra– Background: tandem mass spectrometry– Modeling peptide fragmentation
GenesGenes
Gene Gene ‘domains’‘domains’
DnaseIDnaseIHypersensitive Hypersensitive SiteSite
Trans-Trans-factor factor
complexcomplex
Chromatin Fiber Chromatin Fiber
NucleusNucleus
GenomicGenomicDNADNA
Packaged into Packaged into ChromatinChromatin
The human genome in vivo
Measuring chromatin
accessibility
A simple hidden Markov model
• Each state contains a single Gaussian.• The model has six parameters (two transitions, two means, two standard
deviations).• The parameters are initialized randomly and trained in an unsupervised
fashion via expectation-maximization.• EM is re-started 100 times, and we select the parameters that yield the
highest likelihood.• The original data set is then segmented using either Viterbi or posterior
decoding.
Openchromatin
Closedchromatin
very
^
1.5 megabases
A problem, and two solutions
• Problem: We are interested in phenomena occurring at multiple scales.
• Solution #1: Perform a wavelet smooth prior to HMM analysis.
• Solution #2: Build a more complex probability model.
Change point model
• Four-state model: – major DNase hypersensitive site (DHS),– minor DHS,– intermediate sensitivity region, and– insensitive region.
• Continuous mixture of Gaussians at each state.
• Gamma distribution of lengths within each region.
Spanning the gaps
Beginning in State 1 (Insensitive)
Spanning the gaps
Beginning in State 4 (Major DHS)
Selecting the number of states
Improved fit to the data
Each panel is a QQ plot of the difference between the observed residuals and the theoretical Gaussian.
Insensitive Intermediate sensitivity
Minor DHS Major DHS
Capturing different scales
Enrichment of biologically relevant features
Future directions
• Many types of genomic data– Phylogenetic conservation scores– Various histone modifications– Replication timing, etc.
• Perform segmentions in multiple dimensions simultaneously.
• Assign statistical significance to observed segments.
Shotgun proteomics
TrainedModel
TestPSMs
TrainingPSMs
ProbabilityModel
Evaluation
PSM = peptide-spectrum match
Peptide sequence influences peak height
Bayesian network
• We model peptide fragmentation using a Bayesian network.
• Nodes represent random variables, and edges represent conditional dependencies.
• Each node stores a conditional probability table (CPT) giving Pr(node|parents).
1.000.00no b-ion observed
0.750.25 b-ion observed
intensity > 50% intensity < 50%
Is b-ionobserved?
b-ionintensity
Ion series modeled in a Markov chain
Is b-ionobserved?
b-ionintensity
Is b-ionobserved?
b-ionintensity
Is b-ionobserved?
b-ionintensity
Is b-ionobserved?
b-ionintensity
Is b-ionobserved?
b-ionintensity
~ PepHMM (Han et al., 2005).
A more realistic model
Is b-ionobserved?
b-ion intensity
N-termAA
C-term AA
Is ion detectable?
Fractionalm/z
Is protonmobile?
Ion series modeled in a Markov chain
model nullpeptide ions,-bPr
modelpeptide ions,-bPrlogbLOR
Vectors of log-odds ratios
Correct peptide-spectrum matches Incorrect peptide-spectrum matches
Binary classifier
Model Evaluation: Accuracy
Model Redundant TP/FP Unique TP/FP
Bayes Net 285/300, 95% 137/144, 95.1%
SEQUEST 288/300, 96% 136/144, 94.4%
InsPecT 274/300, 91.3% 131/144, 90.9%
TrainedModel
TestPSMs
TrainingPSMs
ProbabilityModel
Evaluation
An incorrect identification
SEQUEST: LRPGAELLEGAHVGNFVEMKBayes net: HQDETQDALNALDLLTNEK
Blue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2
This peptide does not appear in E. coli, the organism from which this protein sample was derived.
Co-eluting peptides
SEQUEST: AFPEAVLFIHPLDAKBayes net: DVFVHFSALQGNQFK
Blue = b and y, green = a, red = ammonia loss, magenta = water loss, sienna = +2
Future directions
• Build a single Bayesian network that includes all ion types.
• Produce more descriptive outputs from the Bayesian network for input to the classifier.
• Add more biophysical details to the model: chromatography retention time, a better mass-to-charge estimate, etc.
• Generate a better (larger, more accurate) gold standard data set.
Acknowledgments
• DNase I hypersensitivity– John Stamatoyannopoulos– Pete Sabo– Scott Kuehn– many others in the Stam
lab
• Wavelet analysis: Bob Thurman
• Change point model– Charles Lawrence– Heng Lian– William Thompson
• Mass spectrometry– Aaron Klammer– Jeff Bilmes– Sheila Reynolds– Michael MacCoss