noncoding rna genes pt. 2 scfgs cs374 vincent dorie
Post on 21-Dec-2015
215 views
TRANSCRIPT
Noncoding RNA Genes Pt. 2SCFGs
CS374
Vincent Dorie
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Motivation
Noncoding RNA genes can be anywhere Noncoding RNA genes can do anything
Location
rRNA, snRNA Exons? Introns Viral vectors
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Function
Function, pt. 2
Overview
“RSEARCH: Finding homologs of single structured RNA sequences” by Klein and Eddy (2003)
“Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars” by Holmes and Rubin (2002)
Comparison - Methodology
RSEARCH DART (Stemloc)
Sequence
Comparison, Pt. 2 - Uses
RSEARCH Find parts of a
genome which may be homologous to query sequence
More practical in comparative genomics
DART (Stemloc) Investigate a specific
sequence suspected of being homologous to query sequence
Comparison, Pt. 3 - Complexity
RSEARCH O((M - B)LD + BLD2)
to scan O(M4) to calculate
statistics
DART (Stemloc) Between O(LM) and
O(L3M3)
Background:Context Free Grammars
Four-tuple {N, T, S, P} N is a set of nonterminals T is a set of terminals S is the start symbol, S N P is a set of productions
Context Free Grammars, pt. 2Sample Grammar
N = {S, A, B} T = {a, u, c, g, } P = {
S -> A | B,
A -> aAc | aBc | g,
B -> g
}
Context Free Grammars, pt. 3Parse Trees
Parse: aagccS
A
A
g
ca
ca
S
A
A
g
ca
ca B
Stochastic CFG
Each production associated with a probability
Probabilities for all productions starting from a given nonterminal sum to one
Superset of HMM Assigns a probability to a parse E.g. S -> A, 0.3
| B, 0.7
Pairwise (profile) SCFG
Terminals in each production can exist in each of two strings
E.g. W -> xiykVxjyl
RSEARCH: pSCFG Simplified Each secondary
structure specifies (most of) a grammar, creating a “Model Architecture”
Eschews probabilistic interpretation
Problem becomes fitting target to model architecture
Sequence
Node Types vs. Node States
Nodes types are what we want to do given model (e.g. MATP is match pair)
Node state represents what happens when scanning a target sequence
E.g. Node type is MATP, target sequence doesn’t have a pair in that location -> insert a gap
Node States
Set of node states possible for node type
Gap Classes
Gap class per node type/state pair
Transition Scores
Gap class determines transition scores Gap penalties are affine
Emission Scores
Emission scores determined empirically
Parameterizing the ModelEmission Scores
AA AU AC AG UA …AA sAAAA sAAAU sAAAC sAAAG sAAAU …AU - sAUAU sAUAC sAUAG sAUUA …AC - - sACAC sACAG sACUA …AG - - - sAGAG sAGUA …UA - - - - sUAUA …… … … … … … …
Substitution Matrices
€
sij = log2f ijgig j
A U C GA sAA sAU sAC sAG
U - sUU sUC sUG
C - - sCC sCG
G - - - sGG
€
sijkl = log2f ijkl
gig jgkgl
Scores are observed / random
RIBOSUM Matrices
Start with MSA Whose MSA?
RIBOSUM[X, Y] Sequences X% identical are reweighted to
sum to 1 Only sequences Y% identical are counted in
making matrices
Model Parameters
Gap open penalty (single and pair) Gap extension penalty (single and pair) Internal start penalty Internal end penalty
Solution
Guess and check “We might have been able to derive a more
robust parameter set had we used a more comprehensive set of tests, but the long running time required by RSEARCH makes such an approach infeasible.”
Digression: Biostatistics
Confidence intervals Expectation values
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Gumbel Distribution
Parameterized by and K E = KNe-x, P = 1 - e-E
Gumbel Distriubtion, pt. 2
K and depend on G+C content of target database
For database with heterogeneous G+C content, compute K and for G+C bins
Putting it All Together
Run against database substrings of length two times the query
Greedily take K best, non-overlapping hits Recover alignments Report: score, position in database,
alignment, E-value, P-value Statistics need to be calculated for every
query and target database
Time
For a 113 nt sequence against 2.1 * 107 nt database, 2.9 CPU days. 2% computing statistics
For a 330 nt sequence against 2.1 * 107 nt database, 38 CPU days. 7% computing statistics
Parallelized to 33 minutes and 7.4 hours respectively
Shifting GearsFold Envelopes
Pre-enumerates pSCFGs search space
Presents conditional versions of dynamical programming algorithms
User defined complexity
Fold Envelopes, pt. 2
Conceptualize search over grammars and parse trees
Each node in tree accounts for subsequence
Wu
…Accounts for Xi..j
… Accounts for X0..i and Xj..L
Outside sequence
Inside sequence
Analogy: Message Passing
Inside algorithm: likelihood of sequence over all possible parses
Cocke-Younger-Kasami algorithm: maximum likelihood parse of a sequence
Inside-Outside algorithm: expected number each grammar production is used
Use fold envelopes to limit messages by restricting subsequences considered
The Inside Algorithm
To compute
a(i, j, V) = P(xi…xj, produced by V)
a(i, j, v) = X Y k a(i, k, X) a(k+1, j, Y) P(V XY)
k k+1i j
V
X Y
Batzolgou
Constructing Fold Envelopes
Constrain to possible 2ndary structures Constrain to primary sequence alignment
Summary
RSEARCH to find a set of possible homologs, sorted by score and statistics
Fold Envelopes permit greater search depth in case of unfolded comparisons
RSEARCH employs simplified pSCFGs Fold Envelopes are useful over full
spectrum of comparisons but represent more computationally complex situations