algorithms for analysis of multiple ...fk368by4307/...i certify that i have read this dissertation...
TRANSCRIPT
ALGORITHMS FOR ANALYSIS OF MULTIPLE BIOLOGICAL
SEQUENCES: THEORY AND PRACTICE
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Eugene Davydov
December 2009
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/fk368by4307
© 2010 by Eugene V Davydov. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Serafim Batzoglou, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
David Dill
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Arend Sidow
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
Availability of massive amounts of genomic data from hundreds of species has introduced
many challenging computational problems as well as the need for efficient algorithmic
tools that leverage multiple species information to facilitate biological analysis. This dis-
sertation discusses two such problems: noncoding RNA multiple structural alignment and
constrained element detection.
Noncoding RNA genes (ncRNAs) are regions of the genome that are transcribed but not
translated into protein, and fold directly into secondary and tertiary structures which can
have a variety of important biological functions. Because their function depends closely
on the secondary structure, ncRNAs often do not exhibit enough primary sequence conser-
vation to be properly aligned using standard sequence-based methods. I therefore consider
the problem of RNA multiple structural alignment, i.e., performing sequence alignment
and secondary structure prediction simultaneously. In the first part of this dissertation I in-
troduce a novel graph theoretic framework for analyzing this problem and prove that when
the number of sequences is not fixed it is NP-complete. I also provide a polynomial time
algorithm that approximates the optimal solution to within a factor of O(log2n).
Constrained elements are regions of the human genome exhibiting evidence of puri-
fying selection and therefore biological function. Computational identification of such
iv
elements is one of the major goals of comparative genomics. In the second part of this dis-
sertation I present GERP++, a new tool for efficient constrained element detection that sig-
nificantly improves on one of the current leading methods, GERP. While retaining GERP’s
biological transparency and metric for quantifying position-specific constraint, GERP++
uses a more rigorous method for computing evolutionary rates and a novel algorithm for
element identification that uses statistical significance directly to evaluate and rank can-
didate elements. These algorithmic improvements decrease the running time by several
orders of magnitude in practice, enabling high-throughput analysis of large data sets. Fur-
thermore, I present analysis and biological interpretation of constrained elements identified
by GERP++ in the human genome from recently available multiple species alignments.
v
Acknowledgments
First and foremost, I would like to thank my advisor, Serafim Batzoglou, for his guidance,
patience, encouragement, and advice throughout my graduate career. I would also like to
thank all my co-authors and collaborators, especially Arend Sidow for all his help, advice,
and incredible insights he has shared with me during my work on the GERP++ project;
David Dill, for taking his valuable time to be on my dissertation reading committee and my
oral examination; and Atul Butte and Vijay Pande for their thought-provoking questions
during the aforementioned exam.
To all past and present members of the Batzoglou lab: it was a great pleasure working
and interacting with all of you, and I am grateful to Stanford University for the opportunity
to be around and learn from so many brilliant people, both students and professors. In
particular, I owe a great debt to Marina Sirota, George Asimenos, and Tom Do for the
countless times they’ve helped me and everything they’ve taught me.
Last, but certainly not least, I would like to thank my family: my brother Konstantin
and my parents Vladimir and Irina, for all their patience and support throughout the years.
vi
Contents
Abstract iv
Acknowledgments vi
1 Introduction 1
2 RNA Structural Alignment 5
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 A Graph Theoretic Formulation . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Complexity Analysis of MAX-NLS . . . . . . . . . . . . . . . . . . . . . 9
2.4 Approximating MAX-NLS with MAX-FLS . . . . . . . . . . . . . . . . . 15
2.5 Approximating MAX-FLS with MAX-LLS . . . . . . . . . . . . . . . . . 20
2.6 A Polynomial-Time Algorithm for MAX-LLS . . . . . . . . . . . . . . . . 24
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Constrained Element Detection with GERP++ 29
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . 31
vii
3.2.2 Constraint in the Human Genome . . . . . . . . . . . . . . . . . . 33
3.2.3 Estimating Detectable Constraint . . . . . . . . . . . . . . . . . . 35
3.2.4 Association of CEs with Known Functional Elements . . . . . . . . 38
3.2.5 Comparison with PhastCons . . . . . . . . . . . . . . . . . . . . . 42
3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Estimation of Evolutionary Rates and RS Scores . . . . . . . . . . 45
3.3.2 Computation of p-values and Element Prediction . . . . . . . . . . 48
3.3.3 Overview of the Data . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliography 55
viii
List of Tables
3.1 Fraction of Functional Regions Covered by GERP++ Constrained Elements 40
3.2 Mean 3-Periodicity Bias for Different Types of Regions . . . . . . . . . . . 42
ix
List of Figures
2.1 Linear Graph Representation of RNA Sequences . . . . . . . . . . . . . . 8
2.2 Largest Common Nested Linear Subgraph . . . . . . . . . . . . . . . . . . 9
2.3 Thick Edges in Nested Linear Graphs . . . . . . . . . . . . . . . . . . . . 11
2.4 Overview of Reduction from 3-SAT to D-NLS . . . . . . . . . . . . . . . . 12
2.5 Tree Representation of Nested Linear Graph . . . . . . . . . . . . . . . . . 16
2.6 Upper Bound on Tree Size as Function of Flat Order . . . . . . . . . . . . 17
2.7 Largest Possible Trees with Flat Order at Most 3 . . . . . . . . . . . . . . . 19
2.8 Trees Attaining Asymptotic Upper Bound Between Size and Flat Order . . 20
2.9 Level Graphs as Points on Level Signature . . . . . . . . . . . . . . . . . . 22
2.10 Level Signature and Largest Level Subgraph . . . . . . . . . . . . . . . . . 23
2.11 Algorithm for MAX-LLS . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Overview of GERP++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 GERP++ Constrained Element Length Distribution . . . . . . . . . . . . . 34
3.3 Per-chromosome Constraint Intensity . . . . . . . . . . . . . . . . . . . . 36
3.4 Estimating Detectable Constraint . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Relationship Between Constrained and Known Functional Elements . . . . 39
3.6 3-Periodicity Bias Distributions for Different Element Types . . . . . . . . 41
x
3.7 GERP++ vs PhastCons Predictions . . . . . . . . . . . . . . . . . . . . . . 44
3.8 Constrained Exon on Human Chromosome 3 . . . . . . . . . . . . . . . . 45
3.9 Mammalian Phylogenetic Tree . . . . . . . . . . . . . . . . . . . . . . . . 51
xi
Chapter 1
Introduction
While proteins have been long known to play a major role in many cellular processes, it
is deoxyribonucleic acid (DNA) that acts as the carrier of genetic information in all living
organisms. DNA is a polymer of nucleotides adenine (A), cytosine (C), guanine (G), and
thymine (T), and the specific nucleotide sequence is essentially a blueprint for development
and function at the cellular level. Understanding exactly how this occurs is a major goal of
modern genetics. The central dogma of molecular biology (40) states that DNA, in addi-
tion to being replicated from generation to generation, can be transcribed into ribonucleic
acid (RNA), which can then be translated into the proteins that perform most biological
functions in the cell. Such regions are called genes, which in complex organisms consist
of one or more exons, pieces which are actually translated into amino acids that make up
the protein, interspersed with introns, which are spliced out during RNA processing; the
translated region is flanked on both sides by untranslated regions (UTRs). Recent research,
however, has revealed this to be a gross oversimplification. Since all cells in our body share
the same DNA, how do some become brain neurons while others muscle or vascular tissue?
The key is precise regulation of protein synthesis, both at transcription and translation level.
1
CHAPTER 1. INTRODUCTION 2
Promoters, enhancers, silencers, and binding sites for transcription factors are all important
components of this regulation. In addition, some genes are transcribed into RNA but never
translated into protein, instead performing structural, regulatory, and catalytic functions in
a folded RNA state.
The sequencing of the human genome (41) has created an unprecedented opportunity
for high-throughput computational analysis of the 3 billion letter long sequence in order to
identify and annotate important functional elements. Yet aside from the genetic code, a long
known mapping from DNA triplets called codons to amino acids that make up proteins, the
DNA sequence alone is surprisingly difficult to decipher without additional information.
That information, however, is becoming available at a rapidly growing pace.
DNA occasionally mutates during replication, and these mutations have a chance of
proliferating to the entire population, a process called fixation. These mutations may be
harmful, helpful, or neutral in terms of the reproductive fitness of the organism. Mutations
in nonfunctional regions are likely to be neutral, while those inside genes or regulatory ele-
ments tend to mostly disrupt biological function, and are thus selected against in the course
of evolution. Therefore comparing DNA sequences of related organisms and identifying
conserved regions can help guide the search for biological function. This insight is the
cornerstone of comparative genomics.
Hundreds of genomes have already been sequenced, including dozens of mammals. As
more efficient and cheaper methods for obtaining sequences become available, the amount
of genomic data will continue to grow, as will the need for effective computational tools
to analyze it. For example, a fundamental problem in comparative genomics is sequence
alignment, i.e., arranging two or more sequences in a way that reflects the minimum edit
distance between them and represents a hypothesis about common origin of individual
CHAPTER 1. INTRODUCTION 3
nucleotides and larger regions. While dynamic programming algorithms for minimizing
string edit distance in this particular context have existed for decades (42; 43), practical
applications have necessitated the development of new sequence alignment tools that use
heuristics such as anchored and progressive alignment in order to greatly reduce the com-
putational cost of aligning whole multiple genomes.
In order to motivate the design of such tools, it is important to understand the nature of
the computational problem at the level of algorithm complexity: some problems may not be
tractable no matter how fast or cheap computer hardware gets, and may need to be reformu-
lated or solved approximately. Equally important is the development of practical analytical
tools that can be used to answer a specific biological question. In this dissertation, I de-
scribe my contributions to two biologically important problems: RNA multiple structural
alignment, and constrained element detection. The remainder of this thesis is organized as
follows. In chapter 2, I present a novel graphical model framework for analyzing the com-
putational complexity of simultaneous alignment and folding of RNA sequences, showing
the problem to be computationally intractable (NP-Complete) in the stated formulation, and
present an approximation algorithm with a provable error bound. In chapter 3, I introduce
a new tool, GERP++, for the problem of constrained element detection, i.e., quantifying
intensity of conservation and annotating significantly constrained regions within a multiple
sequence alignment. This method relies on an improved evolutionary rate estimation pro-
cedure as well as a novel dynamic programming algorithm that directly assesses statistical
significance for a richer set of candidate elements than previous approaches. I discuss the
results of GERP++ analysis of recently generated alignments of the entire human genome
to 33 other mammalian species. Although no true gold standard is available due to the
open-ended nature of the problem and lack of exhaustive annotations, GERP++ predicts
CHAPTER 1. INTRODUCTION 4
more constrained positions at lower false positive rates and shows better correspondence
with known functional elements.
Chapter 2
A Computational Model for RNA
Multiple Structural Alignment
This chapter addresses the problem of aligning multiple sequences of non-coding RNA
genes. I approach this problem with the biologically motivated paradigm that scoring of
ncRNA alignments should be based primarily on secondary structure rather than nucleotide
conservation. I introduce a novel graph theoretic model (NLG) for analyzing algorithms
based on this approach, prove that the RNA multiple alignment problem is NP-Complete in
this model, and present a polynomial time algorithm that approximates the optimal struc-
ture of size S within a factor of O(log2 S).
2.1 Background
Noncoding RNA (ncRNA) genes are among the biologically active features in genomic
DNA. They are polymers of four nucleotides: A (adenine), C (cytosine), G (guanine),
and U (uracil). Unlike regular genes, ncRNAs are not translated into protein, but rather
5
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 6
fold directly into secondary and tertiary structures, which can have a variety of structural,
catalytic, and regulatory functions (6).
The structural stability and function of ncRNA genes are largely determined by the
formation of stable secondary structures through complementarity of nucleotides, whereby
G-C, A-U, and G-U form hydrogen bonds that are energetically favored. This secondary
structure can be predicted from the nucleotide sequence as one minimizing (some approx-
imation of) the free energy (18; 19), which is largely determined by the formation of the
hydrogen bonds. In ncRNAs, such bonds almost always occur in a nested fashion, which
allows the optimal structure to be computed in time O(n3) in the length of the input se-
quence using a dynamic programming approach (13; 11). Algorithms that do not assume
a nested structure are even more computationally expensive (15). However, the stability
of ncRNA secondary structures is not sufficiently different from the predicted stability of
random genomic fragments to yield a discernible statistical signal (16), limiting the appli-
cation of current ncRNA detection methods to the simplest and best understood structures,
such as tRNAs (12).
One of the most promising ways of detecting ncRNA genes and predicting reliable sec-
ondary structures for them is comparative sequence analysis. During the course of genome
evolution, mutations that occur in functional regions of the genome tend to be deleterious,
and therefore unlikely to fix, while mutations that occur in non-functional regions tend to
be neutral and accummulate. As a result, functional regions of the genome tend to ex-
hibit significant sequence similarity across related genomes, whereas regions that are not
functional are usually much less conserved. This difference in the rate of sequence conser-
vation is used as a powerful signal for detecting protein-coding genes (1; 2) and regulatory
sites (14; 10), and could be applied to ncRNA genes. However, their function is largely
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 7
determined by their secondary structure, which in turn is determined by nucleotide com-
plementarity: RNA genes across different species are similar in the pattern of nucleotide
complementarity rather than in the genomic sequence. As a result, conventional sequence
alignment methods are not able to properly align ncRNAs (7).
One biologically meaningful approach to ncRNA multiple alignment is finding the
largest secondary structure common to all the sequences, lining up the nucleotides form-
ing this structure, and then aligning corresponding leftover pieces as one would align ge-
nomic sequences which have no evolutionary pressure favoring complementary substitu-
tion. However, this approach has never been applied in practice because the task of find-
ing the largest common secondary structure among several sequences is computationally
challenging: the straightforward extention of the dynamic programming algorithm using
stochastic context-free grammars (SCFGs) has a running time of O(n3k), where k is the
number of sequences being aligned, which is prohibitive even for two sequences of moder-
ate length (9).
The problem of aligning multiple DNA sequences has been proven to be NP-Complete
for certain scoring schemes and metrics (17; 3) . However, when analyzing the computa-
tional complexity of ncRNA multiple alignment, it’s more relevant to focus on the com-
plexity of finding the largest common secondary structure, because for most biologically
meaningful ncRNAs the remaining pieces should be relatively short and easy to align.
In this chapter we introduce a novel theoretical framework for analyzing the problem of
ncRNA multiple structural alignment. We present the Nested Linear Graph (NLG) model
and formulate the problem of computing the largest common secondary structure in this
model in terms of finding the largest common nested subgraph. We then prove this prob-
lem to be NP-Complete, and present a polynomial-time algorithm which approximates the
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 8
Figure 2.1: A linear graph representation of the RNA sequence UACGUG. The nu-
cleotides are represented by points on a line in the same order as in the sequence. Each
edge is represented by an arc to emphasize that it does not pass through the nodes be-
tween its two endspoints. Edges are drawn between nodes representing complementary
nucleotide pairs A-U, C-G, and G-U.
optimal solution within a factor of O(log2 S), where S is the size of the optimal solution.
We conclude with a discussion of the NLG model in general and our algorithm and results
in particular.
2.2 A Graph Theoretic Formulation
A linear graph is a graph whose vertices, V , are points on some lineL. Genomic sequences
naturally yield themselves to linear graph representations, because each of their nucleotides
can correspond to a point, and the sequence can correspond to the line. For modeling
ncRNA folding and secondary structure, we form the linear graph with edges connecting
pairs of vertices that represent complementary nucleotide pairs (A-U, C-G, and G-U). A
typical linear graph induced by an RNA sequence is shown in Fig 2.1.
Two edges ab and cd of a linear graph intersect if exactly one of c and d lies on the
line segment ab (and vice versa). A linear graph is nested if no two edges of the graph
intersect each other. For a linear graph derived from an RNA sequence, a nested subgraph
represents a plausible fold of that sequence. Thus, in the NLG model, the problem of
finding the largest secondary structure of an ncRNA is precisely the problem of finding the
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 9
Figure 2.2: The MAX-NLS of several linear graphs; its edges have been emphasized in
bold to distinguish them from the edges of the original linear graphs. Note that the MAX-
NLS is not necessarily unique, but its size is.
largest nested subgraph in the linear graph derived from the sequence. For multiple ncRNA
alignment, where we seek the largest common secondary structure, the appropriate NLG
formulation is finding the largest common nested linear subgraph (MAX-NLS) among the
linear graphs induced by the sequences (Fig 2.2). We now formulate this problem precisely
and formally analyze its computational complexity.
2.3 Complexity Analysis of MAX-NLS
Let G1, . . . , Gm be the linear graphs derived from ncRNA sequences S1, . . . , Sm respec-
tively. The MAX-NLS of these graphs is the largest nested graph Gc such that Gc is a
subgraph of Gi for all i = 1, . . . ,m. For any problem instance I = G1, . . . , Gm, we
write MAX-NLS(I) to indicate this Gc. To represent the size (number of edges) of this
graph, we use the notation |MAX-NLS(I)|.
Note that the MAX-NLS problem represents a slight generalization of the RNA multi-
ple alignment problem, in that we do not constrain the linear graphs to be derived from RNA
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 10
strings by connecting every pair of nodes corresponding to complementary nucleotides with
an edge. We motivate this relaxation in the discussion section of the chapter.
The MAX-NLS is an optimization problem, because our objective is to maximize the
size of the common nested subgraph of G1, . . . , Gm. We now formulate the correspond-
ing decision problem, where our objective is to answer a boolean query.
Definition 1. The NLS decision problem (D-NLS) is to determine, given an inputG1, . . . , Gm
and a positive integer k (where 1 < k < mini |Gi|), whether there exists a common nested
linear subgraph of G1, . . . , Gm of size ≥ k.
Theorem 1. D-NLS is NP-Complete.
Proof. We proceed by demonstrating a polynomial reduction from 3-SAT, a well-known
NP-Complete problem (4).
Definition 2. Let x1, . . . , xk be boolean variables. Let ψ1, . . . , ψn be logical clauses, with
each clause ψi being a disjunction of 3 literals, where each literal is either a variable xj
or its negation, ¬xj . The 3-SAT problem is to determine, given this as input, whether there
exists an assignment for the k variables which satisfies all n clauses simultaneously.
To establish the reduction we need to demonstrate that the existence of a polynomial-
time algorithm for D-NLS yields a polynomial-time algorithm for 3-SAT. As such, we
show that given any input instance I3−SAT and a polynomial-time algorithm A for D-NLS,
we can construct, in polynomial time and space, an instance ID−NLS such that computing
A(ID−NLS) will allow us to answer whether the instance I3−SAT is satisfiable. However,
to simplify the description of this construction, we must define the notion of a c-thick edge
(Fig 2.3).
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 11
Figure 2.3: A 4-thick edge intersecting a 5-thick edge. Any edge not shown must intersect
either all edges in either stack, or none at all.
Definition 3. In a linear graph, an edge ab contains an edge cd if and only if both c and
d lie strictly between a and b on the line. ab directly contains cd whenever ab contains cd
and there is no other edge e that contains cd but is itself contained by ab.
A c-thick “edge” in a linear graph is a set of c edges e1, . . . , ec with the properties that:
(i) for all i, j such that i > j, ei contains ej
(ii) for any other edge e′, either e′ intersects all ei, or it intersects none of them
We can now describe the construction of ID−NLS from I3−SAT , as depicted in Fig 2.4.
Given the set of variables x1, . . . , xk and the clauses ψ1, . . . , ψn, we construct k + 1 linear
graphs: one corresponding to each boolean variable in I3−SAT , and an extra graph x′ whose
purpose will be clarified shortly. Each of the n graphs consists of two intersecting c1-thick
edges, each of which contains a sequence of k similar groups of edges, where each group
corresponds to a particular clause ψi. Such a group is depicted in detail at the bottom of
Fig 2.4.
The edge group varies slightly depending on which xj and ψi it corresponds to. The
portion common to all such groups consists of three c2-thick edges none of which intersects
or contains the other. Beyond these, each group has up to three of the following set of
mutually intersecting edges: an edge that contains the first and second c2-thick edges, an
edge that contains only the second, and an edge that contains the third, as illustrated in
Fig 2.4. An edge is missing from the group only if the corresponding literal in the clause
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 12
Figure 2.4: Constructing an instance of D-NLS from an instance of 3-SAT. Each variable
xj gives rise to a linear graph xj , which consists of two intersecting c1-thick edges, each of
which contains n edge groups corresponding to clauses ψi. Every clause group consists of
3 c2-thick edges in sequence, as well as up to 3 mutually intersecting selection edges, which
are present if ψi does not depend on xj , or the truth value induced upon xj by the label of
the c1 edge makes ψi TRUE. Finally, ID−NLS contains an extra linear graph x′, consisting
of only one c1 edge, which contains the standard collection of 3nc2 edges, as well as all
possible selection edges. The goal of the x′ graph is to force an alignment where every
other graph xj , x′ aligns to either the TRUE or FALSE portion of xj , thus corresponding to
a truth assignment to all variables of the 3-SAT problem.
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 13
ψi is not in agreement with the truth assignment induced by the c1 edge to xj . To be more
precise, let ψi = ηaxa ∨ ηbxb ∨ ηcxc, where a ≤ b ≤ c, and the corresponding η can either
be the identity or negation ¬. The edge corresponding to ηaxa is absent if and only if a = j
and ηaxa is false under the truth assignment induced by the c1 edge of xj . If a = b and
ηa = ¬ηb, the edge is present; since if a clause contains the disjunction xj ∨ ¬xj , it is
automatically satisfied and the edge should exist.
The k+ 1st graph consists of only one c1 edge, and n clause groups each of which con-
tains all 3 selector edges in addition to the 3c2 edges. The basic premise of this construction
is that if (and only if) there is a satisfying assignment, we will be able to match the x′ graph
to the corresponding c1 edge in each of the k graphs, and align the n clause groups within.
Only because the assignment is satisfying will we be able to align one additional selector
edge from every clause, giving us the largest possible common subgraph.
Lemma 1. Let c2 = n+ 1 and let c1 = 3n2 + 4n+ 1. Under the scheme described above,
the k+ 1 linear graphs have a common nested subgraph of size 6n2 + 8n+ 1 if and only if
ψ1, . . . , ψk are satisfiable.
Proof. Suppose the clauses are satisfiable, that is, there exists some assignment to x1, . . . , xk
which satisfies them all. We align the c1 edge of the x′ graph with the c1 edge of graph j
that corresponds to the value of xj in this truth assignment. We then align the c2 edges to
each other. Now consider a particular selector edge in some clause ψi. Because of the way
we aligned the c1 edges, if this edge is absent in any of the half-graphs we selected, it is
because its corresponding literal is false in that clause given the truth assignment. How-
ever, since we assumed our assignment is satisfying, every clause must have a literal that
evaluates to TRUE. The corresponding selector edge must be present in every graph.
We can choose at most one selector edge per clause, since they all intersect each other.
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 14
Because we can choose one from every clause, we have a total of c1 + 3nc2 + n = 6n2 +
8n+ 1.
Now suppose we indeed have a common nested subgraph of size 6n2 +8n+1. As there
are a total of 6nc2 c2 edges and up to 2n selector edges that may be chosen simultaneously,
we could only have 6n2 + 8n edges without choosing a c1 edge. Thus, we must align a c1
edge, in which case we might as well align the whole stack of them. That leaves 3n2 + 4n
edges to be included. Note that each c2 stack contributes more than the selector edges could
simultaneously, so we must choose all 3n c2 stacks for a total of 3n2+3n edges. This leaves
n edges to be accounted for, all of which must be selector edges, one from each clause.
Note that the c1 alignment we choose induces a truth assignment to our variables. As
we just showed, the size of our alignment implies not only that the c1 and c2 edges are
aligned, but also that under this truth assignment, every clause has a selector edge that is
present in every graph’s chosen c1 half. In particular, that edge is present in the graph
corresponding to its literal, meaning that under this induced truth assignment, the clause is
satisfied because the literal is TRUE. Since this applies to all the clauses, ψ1, . . . , ψk are all
satisfied.
The time required for this construction is O(kn2); thus, we have demonstrated a poly-
nomial reduction to D-NLS from 3-SAT, and D-NLS is NP-Complete.
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 15
2.4 Approximating MAX-NLS with MAX-FLS
In view of Theorem 1 there is little hope for a tractable exact algorithm for MAX-NLS.
Therefore, we present a polynomial time approximation algorithm that guarantees optimal-
ity within a factor ofO(log2 S), where S is the size of the optimal solution. The polynomial
time is achieved by restricting attention to a subclass of nested linear graphs and finding
the optimal member of this restricted subclass. The main tradeoff here is the choice of the
restriction: if the subclass is too narrow, our approximation will be poor; otherwise, finding
the optimal member of the subclass may still be NP-Complete.
The restriction that yields our algorithm is best presented as a composition of two re-
strictions. First, we consider the subclass of NLGs that are flat.
Definition 4. A branching edge in a nested linear graph is an edge e that contains two
edges e1 and e2, neither of which contains the other. A nested linear graph is flat if it
contains no branching edges. The flat order of a nested linear graph is the size of its
largest flat subgraph.
The optimization problem corresponding to this restriction is that of finding the largest
common flat nested linear graph (MAX-FLS). We now show that this restriction yields a
solution that is suboptimal by a factor of at most O(logS).
Theorem 2. Every nested linear graph G with flat order FG satisfies |G| ≤ FG log(FG).
Proof. We begin by introducing the tree representation of nested linear graphs in order to
relate the main notions of our argument to familiar data structures. The basic transformation
is mapping each NLG edge to a node in the tree, as shown in Fig 2.5. We first add an
edge containing the entire NLG, for the sake of uniformity. We then construct the tree by
mapping each edge ei to a tree node ni. A node ni is a parent of another node nj whenever
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 16
Figure 2.5: A tree representation of a nested linear graph. Each node in the tree corre-
sponds to an edge in the graph. Node i is an ancestor of node j in the tree if and only
if the corresponding edge i contains j in the graph. For unity of representation, an edge
containing all other edges in the NLG is added so that the result of the transformation is a
tree rather than a forest. This edge and the corresponding root vertex are represented with
dashed lines in the diagram.
its corresponding edge ei directly contains ej (see Definition 3 for the notion of direct
containment). While this transformation is rather elementary, it affords us insights into the
notion of flat order. Noting that the notion of a branching edge in an NLG corresponds
precisely to a branching node in the tree, we observe the following:
(i) When viewed as a subtree, the path from the root to any leaf contains no branching
nodes and is therefore flat. Thus, the flat order FT satisfies FT ≥ h(T ), where h(T )
is the node height of T (number of nodes in the longest root-leaf path).
(ii) Consider any disjoint subtrees of T satisfying the property that nodes in different
subtrees cannot be descendants or ancestors of each other in T . The union of their
flat subtrees will also be flat, as no branching nodes can be introduced by taking the
union of flat constituents that have no ancestor relationships amongst one another.
Consequently, for any split node in the tree, the sum of the flat orders of its subtrees
is ≤ FT .
We now examine an arbitrary tree Tn with flat order n. We show that |Tn| ≤ n log(n)+
1. We establish the general result by strong induction: assuming the formula holds for
every n′ < n, we show that it holds for n. We enumerate the required base cases in Fig 2.7.
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 17
Figure 2.6: An upper bound on a tree’s size NT as a function of its flat order FT . Every
tree is representable as an ℓ-edge (ℓ ≥ 0) chain from its root to the first node where a
split into subtrees T1, . . . , Tk1, Tb1 occurs. These subtrees (labeled with their flat order) are
arranged from left to right according to increasing flat order. We continue splitting the
largest subtree, bi, recursively, until we have no subtrees with flat order > FT/2. This
allows us to prove that NT = O(FT log(FT )).
Each tree can be represented as an initial trunk of length ℓ ≥ 0, followed by a split into
some number of subtrees. Among these we then consider the subtree with the largest flat
order. If its flat order is > n/2, we recursively divide that subtree into a trunk, a splitting
node, and the subtrees at that node. We continue this process until no subtree has flat order
> n/2, as shown in Fig 2.6. Note that there can only be one subtree with flat order > n/2,
so we will never have to subdivide more than one subtree at each level.
We can now write the formula for the number of nodes in Tn. From the diagram,
|Tn| =kr∑
i=1
|Tai| + |Tbr
| +r∑
j=1
ℓj. (2.1)
By the inductive assumption |Tai| ≤ ai log(ai) + 1 and |Tbr
| ≤ br log(br) + 1, so
|Tn| ≤kr∑
i=1
(ai log(ai) + 1) + (br log(br) + 1) +r∑
j=1
ℓj. (2.2)
By construction, all ai and br are ≤ n/2. Furthermore, since n ≥∑kr
i=1ai + br, at most
3 of a1, . . . , ak+r, br may be > n/4. When ai ≤ n/4, ai log(ai) + 1 ≤ ai log(ai) + ai ≤
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 18
ai log(2ai) ≤ ai log(n/2), similarly for br. Thus,
|Tn| ≤kr∑
i=1
ai log(n/2) + br log(n/2) + 3 +r∑
j=1
ℓj.
To prove that this implies |Tn| ≤ n log(n) + 1, we now consider 3 cases:
(1) h(br) ≥ 2
Then, according to observation (i), n ≥∑r
j=1ℓj + h(br) ≥
∑r
j=1ℓj + 2, therefore,
|Tn| ≤ log(n/2)
(br +
kr∑
i=1
ai
)+ 1 + n ≤ n log(n/2) + n+ 1 = n log(n) + 1.
(2) h(br) = 0
Then Tbrhas no nodes, and since by construction it is the largest subtree in its
level, it must be that the splitting node at the bottom of trunk ℓr has no children.
This means that either the entire tree is a single trunk, in which case |Tn| = n ≤
n log(n) + 1, or that ℓr > n/2, since we had to subdivide Tbr−1. In this case, we have
|Tn| ≤∑kr−1
i=1(ai log(ai) + 1) +
∑r
j=1ℓj . Since ai ≤ n/2 by construction, we have
ai log(ai) + 1 ≤ ai log(2ai) ≤ ai log(n), and therefore |Tn| ≤ log(n)∑kr−1
i=1ai +
∑r
j=1ℓj , which transforms to |Tn| < (n/2) log(n) + n since observation (ii) implies
∑kr−1
i=1ai ≤ n− ℓr < n/2. Finally, since n ≤ (n/2) log(n) for n ≥ 4, we have
|Tn| < n log(n).
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 19
Figure 2.7: The largest possible trees with flat order 1, 2, and 3, respectively.
(3) h(br) = 1
In this case Tbrconsists of a single node, so br = 1. We may now write |Tn| ≤
log(n/2)∑kr
i=1ai + 3 + 1 log(1) +
∑r
j=1ℓj , since at most 3 elements of ai may
be > n/4. Noting that 1 = br log(2) ≤ br log(n/2) as long as n ≥ 4, we have
|Tn| ≤ log(n/2)(br +∑kr
i=1ai) + 1 +
∑r
j=1ℓj . Applying the results of observations
(i) and (ii), we have the familiar inequalities br +∑kr
i=1ai ≤ n and 1 +
∑r
j=1ℓj ≤ n,
yielding
|Tn| ≤ n log(n/2) + n+ 1 = n log(n) + 1.
The assumption n ≥ 4 can be eliminated by noting that the largest trees with flat
order < 4 still obey the equation. These trees are shown in Fig 2.7.
Thus, for an arbitrary tree T with flat order FT , |T | ≤ FT log(FT )+1 = O(FT log(FT )),
which is precisely the statement of the theorem for nested flat graphs.
It is noteworthy to observe that this bound is asymptotically tight. Consider the family
of trees Ti defined recursively as:
• T0 = a single node.
• Ti+1 = a trunk of length 2i nodes, which splits into two subtrees Ti, as shown in
Fig 2.8.
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 20
Figure 2.8: A family of trees Ti with flat order Fi that attain the asymptotic upper bound of
O(log(Fi)) on the ratio |Ti|/Fi. The particular tree depicted in the diagram is Ti+1.
By induction, it is clear that both the height and the flat order of Ti are equal to 2i. The
number of nodes is defined by the recurrence |Ti+1| = 2|Ti| + 2i, the solution to which is
|Ti| = 2i−1(i+ 2). Thus, for any tree T of this family,
|T | = (1/2)FT (2 + log(FT )) = Θ(FT log(FT )).
2.5 Approximating MAX-FLS with MAX-LLS
We now further restrict the subclass of NLGs to examine by introducing the notion of level
flat graphs, and the corresponding optimization problem MAX-LLS. First, however, we
prove a useful property of flat linear graphs.
Theorem 3. Any flat nested linear graph G can be written as a union of k ≥ 0 disjoint
subsets, G =⋃k
i=1Ci, where each Ci is a column of edges, i.e. a |Ci|-thick edge.
Proof. Consider any edge e ∈ G, and let E be the set of edges that either contain or are
contained by e. Because G is flat, E must form a column: if two distinct edges in E both
contain e, they must contain each other or intersect; if they are both contained by e, they
must contain each other, otherwise e is a branching edge. Now note that by exactly the
same reasoning, there can be no edge e′ ∈ E that contains or is contained by an edge
g ∈ G − E, since g and e cannot contain or be contained by one another: if e′ contains
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 21
them both it must be a branching edge, if e′ is contained by both then they must intersect,
and if e′ contains one and is contained by the other, then one must contain the other.
Thus, E is completely disjoint with respect to containment from G− E. Thus, we can
let C1 = E, and continue subdividing G − E in this manner to obtain C2, . . . , Ck. In the
end, each Ci is a column separate from one another, and G =⋃k
i=1Ci.
Definition 5. Consider any flat nested linear graph G =⋃k
i=1Ci, where each Ci is a
column. G is level if |C1| = . . . = |Ci|.
The MAX-LLS optimization problem is therefore to find the largest level flat subgraph
in a set of linear graphs. We now show that this further restriction yields an approximation
within a factor of O(log |GF |) of the optimal solution GF to MAX-FLS.
Theorem 4. For any flat nested linear graph GF , its largest level subgraph GL with size
L = |GL| satisfies |GF | = O(L logL).
Proof. We first define two properties of linear graphs that are particularly important for
level graphs.
Definition 6. The length ℓG of a linear graphG is the size of the largest subgraph ofG that
consists solely of edges that do not intersect or contain one another, i.e. a flat graph where
|Ci| = 1 for all i. The height hG of G is the size of the largest subgraph of G that consists
solely of one column, i.e. a flat graph consisting of one hG-thick edge.
These definitions are applicable to any linear graphs, but for level graphs they induce
a compact representation since each level graph corresponds to an ordered pair (h, ℓ), as
shown in Fig 2.9.
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 22
Figure 2.9: Level graphs [a] (h, ℓ) = (2, 7) and [b] (h, ℓ) = (4, 2). These particular graphs
represent points on the level signature of the flat graph shown in Fig 2.10.
We now consider an arbitrary flat graph GF with height hG and length ℓG. For each
h = 1, . . . , hG, we let F (h) be the largest value such that the level graph (h, F (h)) is a
subgraph, noting that 1 ≤ F (h) ≤ ℓG (Fig 2.10). The discrete function F is thus uniquely
defined for any flat graph GF . We call this function the level signature of a flat graph.
Note that the level signature is unique for any flat graph, although two distinct flat graphs
may produce the same level signature simply because of different order of the columns.
Each point (h, F (h)) corresponds to a level subgraph ofGF , as depicted in Fig 2.9. The
size of this subgraph is hF (h), therefore, the largest level subgraph of GF corresponds to
the point with the largest hF (h), say (h∗, ℓ∗). Thus, L = |GL| = h∗ℓ∗.
Let F be the hyperbola passing through (h∗, ℓ∗) with the equation hℓ = L. By definition
of (h∗, ℓ∗), all points on F must lie below this hyperbola. Note that the area under F
given by∑hG
h=1F (h) gives the size of the original flat graph GF , because F (h) counts the
number of columns containing an edge at height h. We now rewrite the sum as |GF | =
ℓG +∑hG
h=2F (h), noticing that the area represented by the sum is a subset of the area under
F from h = 1 to h = hG. Thus,
|GF | ≤ ℓG +
∫ hG
1
F (h)dh.
Since (1, ℓG) and (hG, 1) are both points of the level signature, ℓG ≤ L and hG ≤ L.
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 23
Figure 2.10: Representing the possible level subgraphs of a flat graph G with a discrete
nonincreasing function F , its level signature. Each point (h, F (h)) corresponds to a level
graph with F (h) columns of height h that is the largest level subgraph of G of height h.
The shaded area represents L, the size of the largest level subgraph of G. The hyperbola Fhas the equation hℓ = L and lies above all other points of F .
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 24
Evaluating the integral, we have∫ hG
1F (h)dh =
∫ hG
1L/hdh = L log hG. Thus, |GF | ≤
L+ L logL = O(L logL).
2.6 A Polynomial-Time Algorithm for MAX-LLS
To briefly summarize the results of theorems 2 and 4, a nested linear graph GN of size
S has a flat subgraph GF of size F satisfying S = O(F logF ). Rewriting, we have
F = Ω(S/ logF ) = Ω(S/ logS) since F ≤ S. The flat graph GF in turn has a level
subgraph GL of size L satisfying F = O(L logL), which can be similarly rewritten as
L = Ω(F/ logS) (since L ≤ S). Combining these equations yields L = Ω(S/ log2 S).
Since the largest level flat subgraph of GN has size L = Ω(S/ log2 S), and the optimal
common level subgraph MAX-LLS has by definition size ≥ L, we have thus shown that
MAX-LLS approximates MAX-NLS within a factor of at most O(log2 S). We now present
an algorithm to compute MAX-LLS in polynomial time.
The main idea of the algorithm is to efficiently search the space of level subgraphs
for the one with the largest size. For an input instance I consisting of k linear graphs
G1, . . . , Gk, let ℓI = minki=1 length(Gi), and hI = mink
i=1 height(Gi); these will be com-
puted in the course of the algorithm. We now demonstrate how to find, for any h ≤ hGi,
the largest level (h, ℓ) which is a subgraph of Gi.
For any edge e = xixj where xi < xj , we compute a subset S(e) of the edges containing
e. Each edge in S(e) is indexed by its left coordinate. Iterating through all edges of Gi,
we only an add edge e′ = xi′xj′ if i′ < i and j′ > j. If S(e) already contains an edge
ec with left coordinate xi′ , we will only keep whichever of e′ and ec has a smaller right
coordinate. This ensures that we only keep the smallest edge containing e for each left
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 25
coordinate. Thus, S(e) will have size O(n) for every edge.
Using S(e) we can compute the height of every edge in the graph (the height of an
edge e is the height of the tallest column where that e is the top edge). We can think of the
edges of our linear graphGi as nodes in a directed acyclic graph (DAG)G∗
i , where the edge
e → e′ is present in G∗
i if and only if e′ ∈ S(e). Furthermore, we add an auxiliary source
node s that has edges to every other node of G∗
i , and assign weight of −1 to all edges of
G∗
i . Clearly, an edge of height h in Gi will have shortest path distance −h − 1 from s in
G∗. Thus, computing edge heights in Gi is equivalent to computing shortest path distances
from s in G∗
i .
Thus, to label the edges of Gi with their height we construct the DAG G∗
i , and use
the Bellman-Ford algorithm for DAGs (5) to compute the shortest path distances from s to
every node of G∗
i . This computation will be linear in the size of G∗
i . We call this procedure
vertical labeling.
Similarly, we compute R(e), a subset of edges that lie to the right of e. We only add an
edge if its left coordinate xi′ > xj , and we only keep one such edge per left coodinate, the
one with the smaller right coordinate, ensuring |R(e)| = O(n) for any edge e. Using the
same approach as with vertical labeling, but with the edges of G∗
i given by R(e) instead of
S(e), we obtain a labeling of edges according to the length of the largest flat sequence of
non-intersecting edges ending at the given edge. The largest label in the graph will have
value ℓGi, the length of the graph.
We generalize this approach to produce the largest level subgraph of height h. Starting
with the labeling of the edges by height obtained during the vertical labeling phase, we
compute Rh(e), which is the same as R(e) in the subset of Gi that has height ≥ h. In
other words, we disregard all edges of height < h, and calculate the length of each edge
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 26
Figure 2.11: The algorithm for finding the MAX-LLS of a linear graph. First, all edges
in the graph are marked with their height (see [a]), using the vertical labeling procedure.
Next, for each h, all edges of height < h are ignored, and the remaining edges are marked
with their length using horizontal labeling. For h = 1 and h = 2 the results are shown in
[b] and [c] respectively.
in the remaining subgraph (Fig 2.11). The largest level subgraph of Gi with height h will
be (h, F (h)), where F (h) is the largest label in the graph obtained in this manner. We call
this horizontal labeling.
Using this procedure, we can now find MAX-LLS for an instance I as follows:
1. Label the edges of each graph G ∈ I according to height using the iteration in the
vertical direction.
2. Let hI = minG∈I hG.
3. For h = 1, . . . , hI and each G ∈ I , compute the length FG(h) of the largest level
subgraph of G with height h, using horizational iteration. For each h, let ℓh =
minG∈I FG(h). The level graph (h, ℓh) is the largest common level subgraph for the
instance I of height h.
4. While iterating from h = 1 to h = hI , keep track of the largest level subgraph (h, ℓh)
produced in the previous step. Return this subgraph.
Suppose the k linear graphs in the input I each have ≤ n nucleotides and ≤ e edges.
Each horizontal or vertical labeling procedure takes O(ne), as the DAG constructed for
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 27
the Bellman-Ford computation will have O(e) nodes and e · O(n) edges. Horizontal la-
beling must be performed for every h and both types of labeling must be done for each of
the k linear graphs. Thus, the overall running time, dominated by horizontal iteration, is
O(khne) = O(kn2e).
2.7 Discussion
We have introduced a novel computational model for RNA multiple structural alignment,
by representing each RNA sequence as a linear graph and the multiple alignment as a com-
mon nested subgraph. We noted earlier that the MAX-NLS problem represents a relaxation
of RNA multiple structural alignment, because a linear graph derived from an RNA se-
quence by connecting all complementary nucleotide pairs has certain constraints dictating
which edges must exist.
There are sound biological and computational reasons to adopt the more general NLG
model. At times the complementarity of nucleotides is not sufficient for the formation of a
stable hydrogen bond. For instance, adjacent complementary nucleotides are rarely paired
in real structures, because geometric constraints prevent them from achieving an orientation
that allows participation in hydrogen bonding. It is therefore common to explicitly prevent
the structure from pairing such nucleotides (or more generally, nucleotides that are less
than some fixed number of bases apart) by modifying the algorithm used to compute it. In
the NLG model, this can be accomplished simply by not adding such edges to the linear
graphs constructed from each sequence. In general, the NLG model is flexible enough
to allow easy incorporation of biological insights that modify the space of permissible
pairings. Insights that reduce this space are particularly valuable because by decreasing
the number of edges in the resulting linear graphs, the running time of our approximation
CHAPTER 2. RNA STRUCTURAL ALIGNMENT 28
algorithm improves accordingly. In addition, heuristic approaches to prune certain edges,
which are deemed unlikely to be included in the final structure, could be combined with
our algorithm in order to reduce running time further. Such enhancements are likely to be
incorporated into any practical algorithm that finds biologically meaningful structures.
The approximation quality, while bounded byO(log2 S) in the worst case, will vary de-
pending on the class of ncRNAs being aligned. When mapped back to the RNA sequence, a
level graph consists of ℓ groups of stems, each consisting of h complementary pairs. Thus,
for ncRNA families whose secondary structure fits this pattern well, such as tRNAs, our
algorithm will perform more accurately.
Compared to the elaborate free energy functions used by several structure-prediction
programs (18; 19), the NLG model uses a fairly rough approximation. The main advan-
tage of the NLG model is the ability to incorporate multiple sequence information without
having a fixed alignment. The approximation algorithm we presented could be used to
obtain a rough alignment and structure, which could then be refined using heuristic meth-
ods with more elaborate scoring models. Such a hybrid would combine theoretical bounds
on approximation quality derived in the NLG framework with the benefits of heuristic ap-
proaches.
Chapter 3
Constrained Element Detection with
GERP++
3.1 Background
Identification and annotation of all functional elements in the human genome is one of the
main goals of contemporary genetics in general, and the ENCODE project in particular
(20; 21; 22). Comparative sequence analysis has become a powerful tool for such analysis,
as sequence conservation due to negative selection is often a strong signal of biological
function. After constructing a multiple sequence alignment, the goal is to quantify evolu-
tionary constraint at the level of individual positions and identify segments of the alignment
that show significantly elevated levels of conservation.
Several computational methods for constrained element (CE) detection have been de-
veloped, with most falling into one of two broad categories: generative model-based ap-
proaches, which attempt to explicitly model the quantity and distribution of constraint
29
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 30
within an alignment, and bottom-up approaches, which first estimate constraint at individ-
ual positions and then look for clusters of highly constrained positions. The main generative
approach, phastCons (23), uses a Hidden Markov Model to find the most likely parse of
the alignment into constrained and neutral hidden states. While HMMs are widely used in
modeling biological sequences, they have known drawbacks: transition probabilities imply
a specific geometric state duration distribution, which in the context of phastCons means
predicted constrained and neutral segment length. This biases the resulting estimates of
element length and total genomic fraction under constraint.
One of the leading bottom-up approaches is GERP (24), which quantifies position-
specific constraint in terms of rejected substitutions (RS), the difference between the neutral
rate of substitution and the observed rate as estimated by maximum likelihood, and heuris-
tically extends contiguous segments of constrained positions (RS > 0) in a BLAST-like
(25) manner. However, GERP is computationally slow because its maximum likelihood
computation uses the Expectation Maximization algorithm (26) to estimate a new set of
branch lengths for each position of the alignment; this step is also undesirable methodolog-
ically because it involves estimating k real-valued parameters from k nucleotides of data.
Furthermore, the extension heuristic used by GERP (and other bottom-up methods (27))
may induce biases in the length of predicted CEs.
In this chapter we present GERP++, which as the name suggests represents a signifi-
cant improvement on the GERP methodology and addresses these weaknesses. GERP++
achieves over 100x speedup over the original GERP algorithm while using a more sta-
tistically robust maximum likelihood estimation procedure. In addition, we introduce a
novel criterion of grouping constrained positions into constrained elements using statisti-
cal significance as a guide and assigning p-values to our predictions. We use a dynamic
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 31
programming approach to globally predict a set of constrained elements ranked by their
p-values and coupled with a rigorous false positive rate estimate. Using GERP++ we an-
alyzed an alignment of the human genome and 33 other mammalian species, identifying
over 1.3 million constrained elements spanning over 7% of the human genome with high
confidence. Compared to previous methods, we predict a significantly larger fraction of the
human genome positions under constraint, grouped in a much smaller number number of
predicted CEs, with very low false positive rate.
3.2 Results
3.2.1 Overview of the Algorithm
Like other bottom-up approaches, the GERP++ algorithm consists of two components: cal-
culation of position-specific constraint scores for each column of a multiple alignment, and
subsequent annotation of segments that score significantly higher than expected by chance
(Fig 3.1; see Section 3.3 for more detailed description). These are largely independent
procedures: the GERP++ score for a specific position depends almost entirely on the nu-
cleotides at that position and not on any global element predictions, while identification
of statistically significant high-scoring segments depends only on the additivity of individ-
ual position scores and can potentially be used in conjunction with other position-specific
scoring metrics.
Constraint intensity at individual alignment positions is quantified in terms of rejected
substitutions (RS), defined as the number of substitutions expected under neutrality minus
the number of substitutions observed at the position (24). Thus, positive scores represent
constraint or substitution deficit, while negative scores represent a substitution surplus.
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 32
Figure 3.1: Overview of GERP++. (1) For each position of the multiple alignment we
compute the conservation score in rejected substitutions by subtracting the estimated evo-
lutionary rate from the neutral rate. The neutral rate is computed by removing species
gapped at that position from the phylogenetic tree and summing the branch lengths of the
resulting projected tree; the evolutionary rate is estimated by computing the maximum like-
lihood rescaling of the projected tree. (2) Given position-specific conservation scores, we
generate a set of candidate elements. (3) For each candidate element, we compute a p-value
to represent the likelihood of observing a segment of equal length and greater than or equal
score. We then select a non-overlapping set of elements in order of increasing p-value.
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 33
Since it is impossible to truly observe which and how many substitution events actually
occurred during the course of evolution, we approximate this quantity by estimating the
optimal scaling factor that maximizes the probability of the observed nucleotides in the
scaled neutral tree.
Statistically significant constrained regions are detected by first generating a set of can-
didate elements and then computing p-values based on their score and length that represent
the probability of such a region occurring in the null model (element score is defined as the
sum of RS scores for each position within the element). These p-values are used to rank
CEs in order of significance and report a set of non-overlapping predictions, starting with
the lowest (best) p-value. Rather than using a fixed cutoff, GERP++ estimates the false
positive rate by randomly permuting the input RS-scores and treating any prediction within
the shuffled sequence as a false positive, similar to the first version of GERP (22; 24).
3.2.2 Constraint in the Human Genome
We used GERP++ to analyze the TBA alignment of the human genome to 33 other mam-
malian species spanning over 3 billion positions and 5.83 substitutions per site in phylo-
genetic scope. We identified 1,354,034 constrained elements covering 214,749,502 nu-
cleotides, or approximately 7% of the human genome, with an estimated false positive rate
of 0.86% at the nucleotide level (see Section 3.3 for details). Compared to a slightly nega-
tive background average of −0.125 RS, GERP++ predictions and certain known functional
elements show elevated amount of constraint, in excess of 1.7 RS. GERP++ elements range
in size from 4 to nearly 2000 bases, with mean length of 158.6 nucleotides. The minimum
and maximum lengths are parameters of the algorithm, and the tail of the length distribution
(Fig 3.2) suggests that with a more permissive upper bound even longer elements could be
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 34
Figure 3.2: Distribution of GERP++ constrained element lengths.
identified.
We observe significant variation at the level of entire chromosomes in both average RS
score and fraction predicted to belong to constrained elements (Fig 3.3). The mean con-
straint level varied from −0.3 to −0.05 RS with the exception of chromosome X, which
was the only chromosome with a positive average RS score, just under 0.1 RS. This result
is consistent with earlier work of (28), which suggested reduced mutation rate of the X
chromosome in rodents. We also observe substantial fluctuation in the chromosome frac-
tion predicted to be inside constrained elements, which varied from 1% of chromosome
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 35
Y to 4-9% for other chromosomes. Admittedly this metric is skewed for chromosome Y
because a large portion of the alignment there does not have enough species for rate esti-
mation, but even when adjusting for “effective” chromosome size (Fig 3.3B) much of the
fluctuation remains. Surprisingly, despite the low fraction within constrained elements, Y
does not have a particularly low average RS score, while X does not exhibit a high CE frac-
tion despite the positive average RS. In fact, there appears to be at best weak correlation
between these two metrics of constraint.
3.2.3 Estimating Detectable Constraint
The only major parameter for GERP++ is a false positive rate cutoff that determines at
what point the algorithm should stop generating predictions in order to avoid too many
false discoveries. Throughout its execution GERP++ keeps track of the constrained ele-
ments predicted so far, as well as estimates of the number and total size of false positive
predictions for that cutoff level. Examining how these quantities grow as the cutoff pa-
rameter increases permits us to estimate the amount of total constraint that can be detected
using this methodology and give an approximate upper bound on the amount of constraint
within the human genome.
Let B(c) be the number of bases within constrained elements predicted at false positive
cutoff c, and let B∗(c) = B(c) − F (c) be the same quantity adjusted for false positive
predictions by subtracting the estimated number of false positive bases (as found in shuf-
fled alignments) at cutoff c. Figure 3.4 shows B and B∗ as a function of c from 0 to 50%:
while B continues to increase, B∗ starts to level off right as B begins to grow linearly.
This suggests that maxcB∗(c) can be used to estimate the total number of bases in con-
strained elements that can be annotated using this method in any given region or the entire
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 36
(a)
(b)
Figure 3.3: Per-chromosome constraint intensity. (A) Mean RS score for all alignment
positions where evolutionary rate was computed. Note the elevated average score for chro-
mosome X. (B) Fraction of chromosome that falls into predicted constrained elements.
Light green bars show fraction of entire chromosome, while dark green bars show fraction
adjusted for regions where no rate computation was performed and no elements could span
(see Section 3.3).
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 37
Figure 3.4: Estimating detectable constraint. The red curve represents the number of bases
within predicted constrained element as a function of the false positive cutoff parameter.
The blue curve represents the number of predicted bases minus the expected number of
false positive bases, also as a function of the false positive cutoff.
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 38
genome. Approximately 225 megabases, or nearly 7.3% of the human genome can be de-
tected as constrained using GERP++ at the mammalian phylogenetic scope. If we adjust
for the portions of the genome where rate estimation was not performed (but with a deeper
alignment might be in the future), we estimate that up to 8% of the human genome consists
of constrained elements detectable using this kind of methodology. Combined with the
observation that about 190 megabases, or 6.2% can be detected at a false positive cutoff
of 0 (Fig 3.4), we obtain a fairly narrow estimate of 6-8% of the human genome under
detectable mammalian constraint.
3.2.4 Association of CEs with Known Functional Elements
We next examine the relationship between evolutionary constraint and several classes of
biologically important regions. Overall, coding exons exhibit by far the strongest lev-
els of constraint, as quantified both by the average RS score within functional elements
(Fig 3.5A), and by fraction of bases that overlap the predicted CEs (Table 3.1). Both 5’
and 3’ UTR regions show weaker but noticeable constraint levels and, somewhat surpris-
ingly, introns on average have slightly lower RS scores than the overall genomic baseline.
However, a nontrivial fraction of introns does exhibit evidence of constraint, as nearly 7%
of intron positions overlap predicted elements (Table 3.1), and these positions make up a
large fraction of constrained element bases (Fig 3.5B).
Over 94% of the coding exons in the human genome overlap at least one predicted CE;
conversely, only about 16% of contrained elements overlap a coding exon. Such CEs tend
to be about 60 nucleotides or 40% longer on average compared to elements that do not
overlap exons, with more than a twofold difference in score (both t-tests significant at p-
value< 2.2×10−16). While overall these results are consistent with earlier findings 3.6, the
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 39
(a)
(b)
Figure 3.5: Relationship between CEs and known functional elements. (A) Mean rejected
substitution scores for entire human genome, constrained elements predicted by GERP++,
and known annotated exons, introns, and UTR regions. (B) Breakdown of constrained
element positions by region type.
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 40
Annotation % Coverage by CEs
Exons 84.6%
Introns 6.9%
UTR 5’ 23.7%
UTR 3’ 33.9%
ncRNA 10.1%
Table 3.1: Fraction of Functional Regions Covered by Constrained Elements on a Nu-
cleotide Level
length difference between exon-associated and non-overlapping CEs is somewhat smaller
than what was previously found. This can be partially explained by considering the differ-
ences in the pattern of constraint between coding exons and other regions. Because GERP
by default only merges blocks of contiguous constrained positions if they are separated by
at most one unconstrained position, it is far more likely to generate longer elements in ex-
onic regions where most unconstrained bases correspond to 3rd positions of a codon and are
usually flanked by constrained positions. In noncoding regions where unconstrained posi-
tions are distributed more irregularly and often occur consecutively, the GERP algorithm
ends up fragmenting longer constrained regions and generating shorter elements. Because
GERP++ does not base merging decisions on any such fixed threshold it is able to better
annotate longer noncoding CEs.
To further test this hypothesis, and to investigate a potentially useful signal for detecting
coding exons, we introduce a metric that rigorously quantifies this pattern of constraint for
any region. For any given segment, we define the 3-periodicity bias as the maximum over
the 3 possible reading frames of the mean RS score at positions 1 and 2 minus the mean
RS score at position 3. This metric quantifies a periodic bias in constraint and effectively
deals with unknown reading frame location and lack of a reading frame altogether, since
the maximum is taken over all 3 possibilities. As figure 3.6 shows, the 3-periodicity bias
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 41
Figure 3.6: Distributions (smoothed histograms) of 3-periodicity bias for known exons
(red), introns (green), CEs that overlap exons (orange), and CEs not overlapping exons
(blue).
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 42
Type Mean 3-Periodicity Bias
Exons 2.96
UTR 5’ 0.57
UTR 3’ 0.32
Introns 0.18
CEs overlapping exons 2.46
CEs not overlapping exons 0.55
Table 3.2: Mean 3-Periodicity Bias for Different Types of Regions
is a strong signal characteristic of coding exons (mean 2.96) compared to other regions
such as UTRs, introns, and ncRNAs (mean 0.13-0.38, difference significant at p-value
< 2.2 × 10−16). We partitioned the constrained elements predicted by GERP++ according
to exon overlap, and found that CEs overlapping coding exons have a much greater mean
3-periodicity bias (Table 3.2). However, the difference between CEs that did not overlap
any annotated exons, and known nonexonic regions such as introns was still significant,
suggesting some of these CEs intersect unannotated exonic regions. To test this hypothesis,
we checked the constrained elements that did not overlap any known coding exons against
exon predictions made by the computational gene prediction tool CONTRAST (29). We
found 16881 CEs (making up 1.5% of all CEs that did not overlap known genes) that
overlapped CONTRAST predictions, and these CEs had a significantly higher 3-periodicity
bias (1.33) than those that did not overlap CONTRAST predictions (0.54). As this is still
higher than the average 3-periodicity of clearly non-exonic elements, it is possible that
these elements may overlap unannotated exons or pseudogenes with recently lost function.
3.2.5 Comparison with PhastCons
We compared the GERP++ constrained element predictions in placental mammals (see Sec-
tion 3.3) to phastCons (23), the leading generative model-based tool. Not surprisingly, we
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 43
found significant overlap between GERP++ and phastCons predictions: 80% of GERP++
predictions overlapped at least one phastCons prediction, and vice versa. However, aside
from both algorithms detecting clearly constrained areas, there are substantial differences:
GERP++ predicts significantly fewer elements, which are much longer on average and
cover a substantially larger portion of the human genome—almost twice as much as the
4% predicted by phastCons (Fig 3.7A). As a result, on a nucleotide level GERP++ overlaps
90% of phastCons predictions while only half of GERP++ CE positions are covered by
phastCons.
Part of the reason for these differences is that often phastCons predicts multiple ele-
ments where GERP++ makes one longer prediction. PhastCons thus skips intermediate po-
sitions which may be under weaker constraint yet still part of one large functional element,
as the example in Fig 3.8. In order to demonstrate that this is not an isolated occurrence
and to quantify fragmentation of known functional elements, we computed the number of
distinct predicted constrained elements overlapping each annotated coding exon. While
the total number of exons that overlap at least one constrained element prediction is ap-
proximately the same between the two methods, GERP++ is significantly more effective at
identifying entire exons as a single predicted CE, rather than fragmented between two or
more CEs like phastCons (Fig 3.7C,D).
Due in part to its ability to annotate larger elements in one piece, GERP++ is more
effective at predicting constraint within several types of known functional regions. At the
nucleotide level GERP++ elements cover a substantially larger fraction of several major
types of functional elements, especially coding exons and UTRs (Fig 3.7B). The improved
ability to detect known functional elements suggests GERP++ may also be more effective
at predicting unannotated regions that are not only constrained but also functional.
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 44
(a)
(c)
(b)
(d)
Figure 3.7: GERP++ vs phastCons predictions. (A) Mean length (left), number (middle)
and total length (right) of constrained elements predicted by GERP++ (blue) and phast-
Cons(yellow). (B) Nucleotide-level fraction of annotated exons, introns, UTRs and non-
coding RNAs genes covered by GERP++ (blue) and phastCons (yellow) predictions. (C,
D) Histogram of number of distinct predicted GERP++ (blue, D) and phastCons(yellow,
C) constrained elements overlapping each annotated coding exon. Note the difference in
scale on the y-axis.
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 45
Figure 3.8: Constrained region slightly over 200 base pairs in length containing known
exon as annotated by GERP++ (labeled ’GERP++’, black) and phastCons (purple track
labeled ’Mammal El’). Note the fragmentation of the single functional region into multiple
CE predictions by phastCons.
3.3 Methods
3.3.1 Estimation of Evolutionary Rates and RS Scores
Given a phylogenetic tree with branch lengths and a multiple sequence alignment of the
species within that tree, GERP++ quantifies constraint intensity at each individual position
in terms of rejected substitutions (24), the difference between the neutral rate and the es-
timated evolutionary rate at the position. For our analysis the alignment was compressed
to remove gaps in the reference sequence (human), although the RS score computation
algorithm does not assume any specific reference sequence. In order to estimate the evo-
lutionary rate we model nucleotide evolution as a continuous-time Markov process, which
specifies for each pair of nucleotides a and b and duration t the probability of a transform-
ing into b over time t, designated by pab(t). Many such evolutionary models have been
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 46
developed (30; 31; 32), each with its own set of simplifying assumptions. GERP++ imple-
ments the HKY85 model (32), but any time-reversible model (where papab(t) = pbpba(t)
for all pairs of nucleotides a and b) that permits efficient computation of pab(t) can be used
instead.
For each individual alignment column GERP++ labels the leaves of the phylogenetic
tree with the corresponding nucleotides c1, . . . , ck; gapped species are projected out. Al-
though this is not necessarily ideal and sometimes leads to information loss, it avoids some
of the common difficulties and potentially serious biases that accompany modeling gaps
in alignments: aligner errors and artifacts that result from simplified gap penalties and
incorrect handling of duplications and rearrangements, assembly mistakes, and missing se-
quence data. Furthermore, this treatment of gaps avoids penalizing constrained elements
that have undergone lineage-specific deletion (24).
Once the gapped species are removed, the site-specific neutral rate is computed as the
sum of the branch lengths in the trimmed tree. When there are fewer than 3 species remain-
ing no rate estimation is performed for that position, as there are not enough species to even
form a valid tree. To estimate the evolutionary rate we introduce a scaling parameter r that
represents the site’s rate of evolution relative to neutrality. When r < 1 the quantity 1 − r
can be naturally interpreted as the fraction of neutral substitutions “rejected” by evolution-
ary selection. GERP++ estimates r by maximum likelihood, where the likelihood is given
by L(r) = Pr(c1, . . . , ck|Tr), where Tr is the neutral tree T scaled by r. For any given
r, and therefore fixed tree Tr, this function can be computed efficiently using a dynamic
programming algorithm due to Felsenstein (33). If n is an internal node with children n1
and n2, and c1, . . . , ckn represents the subset of the leaves corresponding to the subtree
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 47
rooted at n, then
Pr(c1, . . . , ckn|n = a) = Pr(c1, . . . , ckn1|n = a) · Pr(c1, . . . , ckn2
|n = a)
=
(∑
b
Pr(c1, . . . , ckn1|n1 = b) · pab(Tr(n, n1)
)·
(∑
b
Pr(c1, . . . , ckn2|n2 = b) · pab(Tr(n, n2)
)
where Tr(x, y) is the branch lengths in Tr between two neighboring nodes x and y. Since
the leaf nucleotides are observed, this equation can be used to compute the subtree prob-
ability for all internal nodes, starting at the bottom and reaching the root, where we can
compute L(r) = Pr(c1, . . . , ck|Tr) =∑
a Pr(c1, . . . , ckn|root = a) · pa. Assuming a
fixed alphabet and an evolutionary model where the probabilities pab(t) are computable in
constant time, this algorithm runs in time O(k) where k is the number of species in the
phylogenetic tree.
Using this algorithm as a subroutine to calculate L(r), GERP++ computes the maxi-
mum likelihood value of r using Brent’s method (34; 35), a numerical optimization tech-
nique that tends to require relatively few computations of the function being optimized.
The evolutionary rate for a site with neutral rate n is estimated to be rn, and the final RS
score is computed as n− rn = n(1− r). As maximum likelihood may estimate very large
or even infinite values of r, we impose a cap of r = 3 on GERP++ rate estimates, yielding
RS scores that range between −2n and +n. These scores are then used as the basis for
prediction of constrained elements within the region.
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 48
3.3.2 Computation of p-values and Element Prediction
Given position-specific constraint scores, GERP++ generates a list of elements that exhibit
evidence of evolutionary constraint beyond what is likely to occur by chance. For each
element, we compute a p-value that represents the probability of a random neutral segment
of equal length having an equal or higher RS score. In addition to being used to select
final predictions from the set of candidate elements, these p-values in conjunction with
position-specific scores provide useful information for biological analysis.
Every segment of contiguous multiple alignment columns is a candidate element. Be-
cause considering all possible segments within the alignment is computationally infeasible,
GERP++ generates a list of candidate elements using several simple biological heuristics
to prune the possibilities. First, we impose a user-specified minimum and maximum on
candidate element length; while real functional elements vary in length, very few extend
beyond several thousand bases, and even these will not be missed entirely as GERP++
will identify their most constrained parts. Second, since positive RS scores indicate con-
straint, GERP++ allows only candidate elements that start and end at positions with RS =
0 and cannot be extended further in either direction; this rule has the additional benefit of
imposing sensible boundary conditions on predicted elements. Finally, we only consider
candidate elements with score above a certain value, which is a function of the element
length and the median neutral rate of the region. This allows pruning of candidate elements
that have low score relative to their length and will end up with poor p-values anyway;
ignoring them early reduces the memory requirements considerably since most candidate
elements would otherwise fall into this category.
Using neutrality as the null hypothesis, we can now define p-values for candidate and
predicted elements on the basis of score and length. If the probability of a single neutral
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 49
position having RS score x is given by P (x), then for an element of length L and score S
the p-value is the probability of having score at least S in exactly L positions, and is given
by
pval(L, S) =∑
x
pval(L− 1, S − x) · P (x)
These p-values can be computed using dynamic programming, for L = 1, . . . , Lmax,
provided the distribution P (x) can be computed and the space of possible scores x is not
too large. The latter is accomplished by discretizing to within a specified tolerance t; since
individual scores range from −2n to +n, there are 3n/t possible discretized scores. We
now build a histogram of these discrete scores from the alignment, with two exceptions.
First, we exclude long consecutive runs of “shallow” positions, i.e., positions with neu-
tral rate below specified cutoff, as there are many such primate-specific regions and they
tend to skew the score distribution. Additionally, remaining shallow positions are given
a small penalty to discourage GERP++ from predicting CEs consisting mostly of shallow
positions. Second, we exclude positions that belong to clearly constrained regions, which
are identified using a preliminary pass of the algorithm. Also, in order to eliminate artifacts
caused by zero probabilities, we add a small uniform prior to the histogram to ensure every
discretized score appears at least once.
Once all candidate elements have been assigned p-values, GERP++ selects elements in
a greedy manner, from smallest to highest p-value, discarding any elements that overlap
previously reported elements. As the p-value increases so does the expected false pos-
itive rate of our predictions; when this reaches a user-specified threshold the algorithm
terminates. While it would be ideal to compute this directly from the p-values, the multiple
hypothesis correction in this case is non-trivial because GERP++ reports a non-overlapping
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 50
set of predictions. Therefore, we adopt the approach of Cooper et al (24) and estimate the
false positive rate from independent permuted alignments.
3.3.3 Overview of the Data
TBA (36) alignments of the human genome (hg18) to 43 other vertebrate species were ob-
tained from the UCSC genome browser (37; 38) together with a phylogenetic tree with the
generally accepted topology (Fig 3.9) and neutral branch lengths estimated from 4-fold de-
generate sites. Both the tree and alignments were projected to the 34 mammalian species.
The alignment was compressed to remove gaps in the human sequence, and GERP++
scores were computed for every position with at least 3 ungapped species present, or ap-
proximately 88.9% of the 3.08 billion positions on the 22 autosomes and X/Y chromo-
somes. We used the HKY85 (32) model of evolution with the transition/transversion ratio
set to 2.0 and nucleotide frequencies estimated from the multiple alignment.
To limit memory requirements and allow parallelization of the constrained element
computation, each chromosome was broken up into regions of approximately 2 megabases,
with long segments where no RS score was computed chosen as boundaries. These bound-
ary segments contain no information usable by GERP++ and because the algorithm never
annotates constrained elements spanning them, excluding such segments did not sacrifice
any predictive ability. These boundary regions made up approximately 6.8% of the human
genome, including a 30.2 megabase region that made up more than half of chromosome Y.
Constrained element predictions were generated using default parameters and a 5% false
positive cutoff measured in terms of number of predictions; the estimated nucleotide-level
false positive rate was under 1%. Other constraint element prediction methods are generally
tuned around a 5% nucleotide-level false positive rate.
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 51
Figure 3.9: Phylogenetic tree used for GERP++ analysis. Tree is drawn to scale with
respect to estimated neutral branch lengths.
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 52
Gene, noncoding RNA, and PhastCons conserved element annotations were obtained
from the UCSC genome browser’s (38; 39) Known Genes, RNA Genes, and Conservation
(23) tracks respectively. To avoid skewed statistics due to alternative splicing, gene an-
notations were resolved to a consistent nonoverlapping set where any segment belonging
to multiple conflicting annotations was assigned a single annotation in the following order
of priority: coding exon, 5’ UTR, 3’ UTR, intron. For meaningful comparison against
phastCons, separate GERP++ scores and constrained elements were generated according
to the same procedure as above but using only placental mammal data (ignoring platypus
and opossum in the alignment and projecting them out of the phylogenetic tree).
3.4 Discussion
One of the main challenges in constrained element detection is the lack of a clear gold stan-
dard for evaluating the quality of predictions. Human functional elements are sometimes
unconstrained at the mammalian scope or missed at the assembly or alignment stages,
and constrained predictions that do not correspond to any known annotations may have
unknown function, and cannot be definitively considered false positives. Given these limi-
tations, we have shown with a variety of metrics that GERP++ compares favorably with its
predecessor GERP as well as other leading methods such as phastCons. Previous bottom-
up approaches have been limited largely by the simple heuristics used to merge constrained
positions into longer elements; these heuristics may introduce biases in element length
due to patterned constraint such as the 3-periodicity in coding exons. With GERP++ we
evaluate a much richer set of candidate elements, selecting and ranking final predictions
according to statistically meaningful p-values. Despite the added computational cost at
this stage, GERP++ overall is more than 100x faster than GERP due to the speedup in
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 53
rate estimation. Because GERP++ estimates a single parameter that directly translates into
evolutionary rate, rather than an independent parameter for each branch of the tree, the
computation is not only faster but also benefits from deeper alignments, resulting in more
statistically robust estimates.
Our understanding of the evolutionary forces responsible for the observed constraint,
especially in noncoding regions, is still limited. This presents a tremendous challenge
for generative model-based approaches, which model implicitly or explicitly the distribu-
tion of length and intensity of constrained elements and the total genomic fraction under
constraint. In such situations, generative models tend to make more assumptions than nec-
essary, potentially introducing biases and making new insights difficult to incorporate. In
contrast, rate estimation and element prediction in GERP++ are largely independent proce-
dures, and while we believe the rejected substitution metric introduced by Cooper et al. (24)
most accurately quantifies constraint intensity at individual positions, any additive position-
specific scoring scheme could potentially be used instead. For example, more elaborate or
context-dependent models could be easily incorporated without drastically changing the
overall algorithm.
One drawback of GERP++ and other similar approaches is sensitivity to variation in
and erroneous estimates of the neutral rate of substitution. Neutral rate estimates are often
subject to some uncertainty and can vary depending on the methodology, alignment quality,
and regions used. To test the ability of GERP++ to tolerate a reasonable amount of error
in neutral rate estimates, we repeated our analysis with the neutral tree scaled up or down
by 5 or 10%. Not surprisingly, overestimating the neutral rate leads to overprediction of
constraint, and vice versa. For a fixed false positive cutoff, we observed a linear relation-
ship between the input neutral rate and the amount of constrained element bases predicted;
CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 54
a 5/10% change in neutral rate leads to approximately 8/15% change in the number of
predicted constrained bases.
GERP++ recapitulates known biology, both at nucleotide-level resolution and on the
scale of entire functional elements and even chromosomes. GERP++ scores are accurate
enough to obtain a strong signal of synonymous substitution in coding exons, and the ele-
vated average RS score for chromosome X (Fig 3.3A) agrees with earlier findings (20; 21).
Compared to phastCons, GERP++ predictions overlap a larger fraction of known functional
elements (Fig 3.5B) and have better one to one correspondence to constrained coding ex-
ons (Fig 3.7C,D). Our analysis also yielded interesting biological insights, including the
likelihood of unannotated coding exons among our predicted constrained elements. We
predict around 7% of the human genome to be constrained across the mammalian scope,
a slightly larger amount than previous methods, yet with a lower estimated false positive
rate. While this estimate is inexact, our analysis suggests 6% and 8% as reasonable lower
and upper bounds, a somewhat tighter range than earlier estimates (20; 21). Computation-
ally, GERP++ is efficient enough to perform whole-genome analysis of deep mammalian
alignments within a few cpu-days, making it suitable for high-throughput analysis of the
ever increasing amounts of genomic data. We hope GERP++ will prove to be a useful
tool in analyzing, quantifying, and annotating constraint and discovering novel functional
elements in the human and other genomes for which sufficient comparative data exist.
Bibliography
[1] Bafna V, Huson DH. The conserved exon method for gene finding. Proceedings of
the Fifth International Conference on Intelligent Systems for Molecular Biology 2000,
3-12.
[2] Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES. Human and mouse gene
structure: Comparative analysis and application to exon prediction. Genome Research
2000, 10:950-958.
[3] Bonizzoni P, Vedova GD. The complexity of multiple sequence alignment with SP-
score that is a metric. Theoretical Computer Science 2001, 259:63-79.
[4] Cook SA. The complexity of theorem-proving procedures. Proceedings of the Third
ACM Symposium on Theory of Computing 1971, 151-158.
[5] Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms, Second
Edition 2001.
[6] Eddy SR. Noncoding RNA genes and the modern RNA world. Nature Review Genetics
2001, 2:919-929.
[7] Eddy SR. Computational genomics of noncoding RNA genes. Cell 2002, 109:137-140.
55
BIBLIOGRAPHY 56
[8] Hardison RC, Oeltjen J, Miller W. Long Human-Mouse Sequence Alignments Reveal
Novel Regulatory Elements: A Reason to Sequence the Mouse Genome. Genome Re-
search 1997, 7:959-966.
[9] Holmes I, Rubin GM. Pairwise RNA Structure Comparison with Stoachastic Context-
Free Grammars. Pacific Symposium on Biocomputing 2002, 7:175-186.
[10] Jareborg N, Birney E, Durbin R. Comparative analysis of noncoding regions of 77
orthologous mouse and human gene pairs. Genome Research 9:815-824, 1999.
[11] Kasami T. An efficient recognition and syntax algorithm for context-free languages.
Technical Report AF-CRL-65-758, 1965, Air Force Cambridge Research Laboratory,
Bedford, MA.
[12] Lowe TM, Eddy SR. tRNAscan-SE: a Program For Improved Detection of Transfer
RNA genes in Genomic Sequence. Nucleic Acids Research 1997, 25:955-964.
[13] Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ. Algorithms for loop matching.
SIAM Journal of Applied Mathematics 1978, 35:68-82.
[14] Pennacchio L, Rubin E. Genomic strategies to identify mammalian regulatory se-
quences. Nature Reviews 2001, 2:100-109.
[15] Rivas E, Eddy SR. A dynamic programming algorithm for RNA structure prediction
including pseudoknots. Journal of Molecular Biology 1999, 285:2053-2068.
[16] Rivas E, Eddy SR. Secondary structure alone is generally not statistically significant
for the detection of noncoding RNAs. Bioinformatics 2000, 16:573-583.
BIBLIOGRAPHY 57
[17] Wang L, Jiang T. On the complexity of multiple sequence alignment. Journal of Com-
putational Biology 1994, 1:337-348.
[18] Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using ther-
modynamics and auxiliary information. Nucleic Acids Research 1981, 9:133-148.
[19] Zuker M. Computer Prediction of RNA structure. Methods in Enzymology 1989,
180:262-288.
[20] The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 2004,
306(5696):636-640.
[21] Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH,
Weng Z, Snyder M, Dermitzakis ET, Thurman RE et al. Identification and analysis of
functional elements in 1% of the human genome by the ENCODE pilot project. Nature
2007, 447(7146):799-816.
[22] Margulies EH, Cooper GM, Asimenos G, Thomas DJ, Dewey CN, Siepel A, Bir-
ney E, Keefe D, Schwartz AS, Hou M et al. Analyses of deep mammalian sequence
alignments and constraint predictions for 1% of the human genome. Genome Research
2007, 17(6):760-774.
[23] Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H,
Spieth J, Hillier LW, Richards S et al. Evolutionarily conserved elements in vertebrate,
insect, worm, and yeast genomes. Genome Research 2005, 15(8):1034-1050.
[24] Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A. Distribution
and intensity of constraint in mammalian genomic sequence. Genome Research 2005,
15(7):901-913.
BIBLIOGRAPHY 58
[25] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search
tool. Journal of Molecular Biology 1990, 215(3):403-410.
[26] Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data via
the EM Algorithm. Journal of the Royal Statistical Society 1977. Series B (Method-
ological) 39(1):1-38.
[27] Margulies EH, Blanchette M, Haussler D, Green ED. Identification and characteri-
zation of multi-species conserved sequences. Genome Research 2003, 13 (12):2507-
2518.
[28] McVean GT, Hurst LD. Evidence for a selectively favourable reduction in the muta-
tion rate of the X chromosome. Nature 1997, 386(6623):388-392.
[29] Gross SS, Do CB, Sirota M, Batzoglou S. CONTRAST: a discriminative, phylogeny-
free approach to multiple informant de novo gene prediction. Genome Biology 2007,
8(12):R269.
[30] Jukes TH, Cantor CR. Evolution of protein molecules. Munro HN, ed. Mammalian
protein metabolism 1969, 21-123.
[31] Kimura M. A simple method for estimating evolutionary rate of base substitution
through comparative studies of nucleotide sequences. Journal of Molecular Evolution
1980, 16:111-120.
[32] Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a molecular
clock of mitochondrial DNA. Journal of Molecular Evolution 1985, 22(2):160-174.
[33] Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood ap-
proach. Journal of Molecular Evolution 1981, 17:368-376.
BIBLIOGRAPHY 59
[34] Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes in C, Sec-
ond Edition: the Art of Scientific Computing 1992.
[35] Brent RP. Algorithms for Minimization without Derivatives 1973, chapter 4.
[36] Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R,
Rosenbloom K, Clawson H, Green ED et al. Aligning multiple genomic sequences
with the threaded blockset aligner. Genome Research 2004, 14(4):708-715.
[37] Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D.
The human genome browser at UCSC. Genome Research 2002, 12(6):996-1006.
[38] Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M,
Smith KE, Rosenbloom KR, Raney BJ et al. The UCSC genome browser database:
update 2010. Nucleic Acids Research 2009.
[39] Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D. The UCSC Known
Genes. Bioinformatics 2006, 22(9):1036-1046.
[40] Crick F. Central Dogma of Molecular Biology. Nature 1970. 227:561-563.
[41] International Human Genome Sequencing Consortium. Initial sequencing and analy-
sis of the human genome. Nature 2001. 409:860-921.
[42] Needleman SB, Wunsch CD. A general method applicable to the search for similar-
ities in the amino acid sequence of two proteins. Journal of Molecular Biology 1970.
48(3):443-453.
[43] Smith TF, Waterman MS. Identification of Common Molecular Subsequences. Jour-
nal of Molecular Biology 1981. 147:195-197.