algorithms for analysis of multiple ...fk368by4307/...i certify that i have read this dissertation...

ALGORITHMS FOR ANALYSIS OF MULTIPLE BIOLOGICAL

SEQUENCES: THEORY AND PRACTICE

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Eugene Davydov

December 2009

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/fk368by4307

© 2010 by Eugene V Davydov. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/fk368by4307

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Serafim Batzoglou, Primary Adviser


David Dill


Arend Sidow

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

Availability of massive amounts of genomic data from hundreds of species has introduced

many challenging computational problems as well as the need for efficient algorithmic

tools that leverage multiple species information to facilitate biological analysis. This dis-

sertation discusses two such problems: noncoding RNA multiple structural alignment and

constrained element detection.

Noncoding RNA genes (ncRNAs) are regions of the genome that are transcribed but not

translated into protein, and fold directly into secondary and tertiary structures which can

have a variety of important biological functions. Because their function depends closely

on the secondary structure, ncRNAs often do not exhibit enough primary sequence conser-

vation to be properly aligned using standard sequence-based methods. I therefore consider

the problem of RNA multiple structural alignment, i.e., performing sequence alignment

and secondary structure prediction simultaneously. In the first part of this dissertation I in-

troduce a novel graph theoretic framework for analyzing this problem and prove that when

the number of sequences is not fixed it is NP-complete. I also provide a polynomial time

algorithm that approximates the optimal solution to within a factor of O(log2n).

Constrained elements are regions of the human genome exhibiting evidence of puri-

fying selection and therefore biological function. Computational identification of such

iv

elements is one of the major goals of comparative genomics. In the second part of this dis-

sertation I present GERP++, a new tool for efficient constrained element detection that sig-

nificantly improves on one of the current leading methods, GERP. While retaining GERP’s

biological transparency and metric for quantifying position-specific constraint, GERP++

uses a more rigorous method for computing evolutionary rates and a novel algorithm for

element identification that uses statistical significance directly to evaluate and rank can-

didate elements. These algorithmic improvements decrease the running time by several

orders of magnitude in practice, enabling high-throughput analysis of large data sets. Fur-

thermore, I present analysis and biological interpretation of constrained elements identified

by GERP++ in the human genome from recently available multiple species alignments.

v

Acknowledgments

First and foremost, I would like to thank my advisor, Serafim Batzoglou, for his guidance,

patience, encouragement, and advice throughout my graduate career. I would also like to

thank all my co-authors and collaborators, especially Arend Sidow for all his help, advice,

and incredible insights he has shared with me during my work on the GERP++ project;

David Dill, for taking his valuable time to be on my dissertation reading committee and my

oral examination; and Atul Butte and Vijay Pande for their thought-provoking questions

during the aforementioned exam.

To all past and present members of the Batzoglou lab: it was a great pleasure working

and interacting with all of you, and I am grateful to Stanford University for the opportunity

to be around and learn from so many brilliant people, both students and professors. In

particular, I owe a great debt to Marina Sirota, George Asimenos, and Tom Do for the

countless times they’ve helped me and everything they’ve taught me.

Last, but certainly not least, I would like to thank my family: my brother Konstantin

and my parents Vladimir and Irina, for all their patience and support throughout the years.

vi

Contents

Abstract iv

Acknowledgments vi

1 Introduction 1

2 RNA Structural Alignment 5

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 A Graph Theoretic Formulation . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Complexity Analysis of MAX-NLS . . . . . . . . . . . . . . . . . . . . . 9

2.4 Approximating MAX-NLS with MAX-FLS . . . . . . . . . . . . . . . . . 15

2.5 Approximating MAX-FLS with MAX-LLS . . . . . . . . . . . . . . . . . 20

2.6 A Polynomial-Time Algorithm for MAX-LLS . . . . . . . . . . . . . . . . 24

2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Constrained Element Detection with GERP++ 29

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.1 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . 31

vii

3.2.2 Constraint in the Human Genome . . . . . . . . . . . . . . . . . . 33

3.2.3 Estimating Detectable Constraint . . . . . . . . . . . . . . . . . . 35

3.2.4 Association of CEs with Known Functional Elements . . . . . . . . 38

3.2.5 Comparison with PhastCons . . . . . . . . . . . . . . . . . . . . . 42

3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.1 Estimation of Evolutionary Rates and RS Scores . . . . . . . . . . 45

3.3.2 Computation of p-values and Element Prediction . . . . . . . . . . 48

3.3.3 Overview of the Data . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Bibliography 55

viii

List of Tables

3.1 Fraction of Functional Regions Covered by GERP++ Constrained Elements 40

3.2 Mean 3-Periodicity Bias for Different Types of Regions . . . . . . . . . . . 42

ix

List of Figures

2.1 Linear Graph Representation of RNA Sequences . . . . . . . . . . . . . . 8

2.2 Largest Common Nested Linear Subgraph . . . . . . . . . . . . . . . . . . 9

2.3 Thick Edges in Nested Linear Graphs . . . . . . . . . . . . . . . . . . . . 11

2.4 Overview of Reduction from 3-SAT to D-NLS . . . . . . . . . . . . . . . . 12

2.5 Tree Representation of Nested Linear Graph . . . . . . . . . . . . . . . . . 16

2.6 Upper Bound on Tree Size as Function of Flat Order . . . . . . . . . . . . 17

2.7 Largest Possible Trees with Flat Order at Most 3 . . . . . . . . . . . . . . . 19

2.8 Trees Attaining Asymptotic Upper Bound Between Size and Flat Order . . 20

2.9 Level Graphs as Points on Level Signature . . . . . . . . . . . . . . . . . . 22

2.10 Level Signature and Largest Level Subgraph . . . . . . . . . . . . . . . . . 23

2.11 Algorithm for MAX-LLS . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 Overview of GERP++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 GERP++ Constrained Element Length Distribution . . . . . . . . . . . . . 34

3.3 Per-chromosome Constraint Intensity . . . . . . . . . . . . . . . . . . . . 36

3.4 Estimating Detectable Constraint . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Relationship Between Constrained and Known Functional Elements . . . . 39

3.6 3-Periodicity Bias Distributions for Different Element Types . . . . . . . . 41

x

3.7 GERP++ vs PhastCons Predictions . . . . . . . . . . . . . . . . . . . . . . 44

3.8 Constrained Exon on Human Chromosome 3 . . . . . . . . . . . . . . . . 45

3.9 Mammalian Phylogenetic Tree . . . . . . . . . . . . . . . . . . . . . . . . 51

xi

Chapter 1

Introduction

While proteins have been long known to play a major role in many cellular processes, it

is deoxyribonucleic acid (DNA) that acts as the carrier of genetic information in all living

organisms. DNA is a polymer of nucleotides adenine (A), cytosine (C), guanine (G), and

thymine (T), and the specific nucleotide sequence is essentially a blueprint for development

and function at the cellular level. Understanding exactly how this occurs is a major goal of

modern genetics. The central dogma of molecular biology (40) states that DNA, in addi-

tion to being replicated from generation to generation, can be transcribed into ribonucleic

acid (RNA), which can then be translated into the proteins that perform most biological

functions in the cell. Such regions are called genes, which in complex organisms consist

of one or more exons, pieces which are actually translated into amino acids that make up

the protein, interspersed with introns, which are spliced out during RNA processing; the

translated region is flanked on both sides by untranslated regions (UTRs). Recent research,

however, has revealed this to be a gross oversimplification. Since all cells in our body share

the same DNA, how do some become brain neurons while others muscle or vascular tissue?

The key is precise regulation of protein synthesis, both at transcription and translation level.

1

CHAPTER 1. INTRODUCTION 2

Promoters, enhancers, silencers, and binding sites for transcription factors are all important

components of this regulation. In addition, some genes are transcribed into RNA but never

translated into protein, instead performing structural, regulatory, and catalytic functions in

a folded RNA state.

The sequencing of the human genome (41) has created an unprecedented opportunity

for high-throughput computational analysis of the 3 billion letter long sequence in order to

identify and annotate important functional elements. Yet aside from the genetic code, a long

known mapping from DNA triplets called codons to amino acids that make up proteins, the

DNA sequence alone is surprisingly difficult to decipher without additional information.

That information, however, is becoming available at a rapidly growing pace.

DNA occasionally mutates during replication, and these mutations have a chance of

proliferating to the entire population, a process called fixation. These mutations may be

harmful, helpful, or neutral in terms of the reproductive fitness of the organism. Mutations

in nonfunctional regions are likely to be neutral, while those inside genes or regulatory ele-

ments tend to mostly disrupt biological function, and are thus selected against in the course

of evolution. Therefore comparing DNA sequences of related organisms and identifying

conserved regions can help guide the search for biological function. This insight is the

cornerstone of comparative genomics.

Hundreds of genomes have already been sequenced, including dozens of mammals. As

more efficient and cheaper methods for obtaining sequences become available, the amount

of genomic data will continue to grow, as will the need for effective computational tools

to analyze it. For example, a fundamental problem in comparative genomics is sequence

alignment, i.e., arranging two or more sequences in a way that reflects the minimum edit

distance between them and represents a hypothesis about common origin of individual


nucleotides and larger regions. While dynamic programming algorithms for minimizing

string edit distance in this particular context have existed for decades (42; 43), practical

applications have necessitated the development of new sequence alignment tools that use

heuristics such as anchored and progressive alignment in order to greatly reduce the com-

putational cost of aligning whole multiple genomes.

In order to motivate the design of such tools, it is important to understand the nature of

the computational problem at the level of algorithm complexity: some problems may not be

tractable no matter how fast or cheap computer hardware gets, and may need to be reformu-

lated or solved approximately. Equally important is the development of practical analytical

tools that can be used to answer a specific biological question. In this dissertation, I de-

scribe my contributions to two biologically important problems: RNA multiple structural

alignment, and constrained element detection. The remainder of this thesis is organized as

follows. In chapter 2, I present a novel graphical model framework for analyzing the com-

putational complexity of simultaneous alignment and folding of RNA sequences, showing

the problem to be computationally intractable (NP-Complete) in the stated formulation, and

present an approximation algorithm with a provable error bound. In chapter 3, I introduce

a new tool, GERP++, for the problem of constrained element detection, i.e., quantifying

intensity of conservation and annotating significantly constrained regions within a multiple

sequence alignment. This method relies on an improved evolutionary rate estimation pro-

cedure as well as a novel dynamic programming algorithm that directly assesses statistical

significance for a richer set of candidate elements than previous approaches. I discuss the

results of GERP++ analysis of recently generated alignments of the entire human genome

to 33 other mammalian species. Although no true gold standard is available due to the

open-ended nature of the problem and lack of exhaustive annotations, GERP++ predicts


more constrained positions at lower false positive rates and shows better correspondence

with known functional elements.

Chapter 2

A Computational Model for RNA

Multiple Structural Alignment

This chapter addresses the problem of aligning multiple sequences of non-coding RNA

genes. I approach this problem with the biologically motivated paradigm that scoring of

ncRNA alignments should be based primarily on secondary structure rather than nucleotide

conservation. I introduce a novel graph theoretic model (NLG) for analyzing algorithms

based on this approach, prove that the RNA multiple alignment problem is NP-Complete in

this model, and present a polynomial time algorithm that approximates the optimal struc-

ture of size S within a factor of O(log2 S).

2.1 Background

Noncoding RNA (ncRNA) genes are among the biologically active features in genomic

DNA. They are polymers of four nucleotides: A (adenine), C (cytosine), G (guanine),

and U (uracil). Unlike regular genes, ncRNAs are not translated into protein, but rather

5

CHAPTER 2. RNA STRUCTURAL ALIGNMENT 6

fold directly into secondary and tertiary structures, which can have a variety of structural,

catalytic, and regulatory functions (6).

The structural stability and function of ncRNA genes are largely determined by the

formation of stable secondary structures through complementarity of nucleotides, whereby

G-C, A-U, and G-U form hydrogen bonds that are energetically favored. This secondary

structure can be predicted from the nucleotide sequence as one minimizing (some approx-

imation of) the free energy (18; 19), which is largely determined by the formation of the

hydrogen bonds. In ncRNAs, such bonds almost always occur in a nested fashion, which

allows the optimal structure to be computed in time O(n3) in the length of the input se-

quence using a dynamic programming approach (13; 11). Algorithms that do not assume

a nested structure are even more computationally expensive (15). However, the stability

of ncRNA secondary structures is not sufficiently different from the predicted stability of

random genomic fragments to yield a discernible statistical signal (16), limiting the appli-

cation of current ncRNA detection methods to the simplest and best understood structures,

such as tRNAs (12).

One of the most promising ways of detecting ncRNA genes and predicting reliable sec-

ondary structures for them is comparative sequence analysis. During the course of genome

evolution, mutations that occur in functional regions of the genome tend to be deleterious,

and therefore unlikely to fix, while mutations that occur in non-functional regions tend to

be neutral and accummulate. As a result, functional regions of the genome tend to ex-

hibit significant sequence similarity across related genomes, whereas regions that are not

functional are usually much less conserved. This difference in the rate of sequence conser-

vation is used as a powerful signal for detecting protein-coding genes (1; 2) and regulatory

sites (14; 10), and could be applied to ncRNA genes. However, their function is largely


determined by their secondary structure, which in turn is determined by nucleotide com-

plementarity: RNA genes across different species are similar in the pattern of nucleotide

complementarity rather than in the genomic sequence. As a result, conventional sequence

alignment methods are not able to properly align ncRNAs (7).

One biologically meaningful approach to ncRNA multiple alignment is finding the

largest secondary structure common to all the sequences, lining up the nucleotides form-

ing this structure, and then aligning corresponding leftover pieces as one would align ge-

nomic sequences which have no evolutionary pressure favoring complementary substitu-

tion. However, this approach has never been applied in practice because the task of find-

ing the largest common secondary structure among several sequences is computationally

challenging: the straightforward extention of the dynamic programming algorithm using

stochastic context-free grammars (SCFGs) has a running time of O(n3k), where k is the

number of sequences being aligned, which is prohibitive even for two sequences of moder-

ate length (9).

The problem of aligning multiple DNA sequences has been proven to be NP-Complete

for certain scoring schemes and metrics (17; 3) . However, when analyzing the computa-

tional complexity of ncRNA multiple alignment, it’s more relevant to focus on the com-

plexity of finding the largest common secondary structure, because for most biologically

meaningful ncRNAs the remaining pieces should be relatively short and easy to align.

In this chapter we introduce a novel theoretical framework for analyzing the problem of

ncRNA multiple structural alignment. We present the Nested Linear Graph (NLG) model

and formulate the problem of computing the largest common secondary structure in this

model in terms of finding the largest common nested subgraph. We then prove this prob-

lem to be NP-Complete, and present a polynomial-time algorithm which approximates the


Figure 2.1: A linear graph representation of the RNA sequence UACGUG. The nu-

cleotides are represented by points on a line in the same order as in the sequence. Each

edge is represented by an arc to emphasize that it does not pass through the nodes be-

tween its two endspoints. Edges are drawn between nodes representing complementary

nucleotide pairs A-U, C-G, and G-U.

optimal solution within a factor of O(log2 S), where S is the size of the optimal solution.

We conclude with a discussion of the NLG model in general and our algorithm and results

in particular.

2.2 A Graph Theoretic Formulation

A linear graph is a graph whose vertices, V , are points on some lineL. Genomic sequences

naturally yield themselves to linear graph representations, because each of their nucleotides

can correspond to a point, and the sequence can correspond to the line. For modeling

ncRNA folding and secondary structure, we form the linear graph with edges connecting

pairs of vertices that represent complementary nucleotide pairs (A-U, C-G, and G-U). A

typical linear graph induced by an RNA sequence is shown in Fig 2.1.

Two edges ab and cd of a linear graph intersect if exactly one of c and d lies on the

line segment ab (and vice versa). A linear graph is nested if no two edges of the graph

intersect each other. For a linear graph derived from an RNA sequence, a nested subgraph

represents a plausible fold of that sequence. Thus, in the NLG model, the problem of

finding the largest secondary structure of an ncRNA is precisely the problem of finding the


Figure 2.2: The MAX-NLS of several linear graphs; its edges have been emphasized in

bold to distinguish them from the edges of the original linear graphs. Note that the MAX-

NLS is not necessarily unique, but its size is.

largest nested subgraph in the linear graph derived from the sequence. For multiple ncRNA

alignment, where we seek the largest common secondary structure, the appropriate NLG

formulation is finding the largest common nested linear subgraph (MAX-NLS) among the

linear graphs induced by the sequences (Fig 2.2). We now formulate this problem precisely

and formally analyze its computational complexity.

2.3 Complexity Analysis of MAX-NLS

Let G1, . . . , Gm be the linear graphs derived from ncRNA sequences S1, . . . , Sm respec-

tively. The MAX-NLS of these graphs is the largest nested graph Gc such that Gc is a

subgraph of Gi for all i = 1, . . . ,m. For any problem instance I = G1, . . . , Gm, we

write MAX-NLS(I) to indicate this Gc. To represent the size (number of edges) of this

graph, we use the notation |MAX-NLS(I)|.

Note that the MAX-NLS problem represents a slight generalization of the RNA multi-

ple alignment problem, in that we do not constrain the linear graphs to be derived from RNA


strings by connecting every pair of nodes corresponding to complementary nucleotides with

an edge. We motivate this relaxation in the discussion section of the chapter.

The MAX-NLS is an optimization problem, because our objective is to maximize the

size of the common nested subgraph of G1, . . . , Gm. We now formulate the correspond-

ing decision problem, where our objective is to answer a boolean query.

Definition 1. The NLS decision problem (D-NLS) is to determine, given an inputG1, . . . , Gm

and a positive integer k (where 1 < k < mini |Gi|), whether there exists a common nested

linear subgraph of G1, . . . , Gm of size ≥ k.

Theorem 1. D-NLS is NP-Complete.

Proof. We proceed by demonstrating a polynomial reduction from 3-SAT, a well-known

NP-Complete problem (4).

Definition 2. Let x1, . . . , xk be boolean variables. Let ψ1, . . . , ψn be logical clauses, with

each clause ψi being a disjunction of 3 literals, where each literal is either a variable xj

or its negation, ¬xj . The 3-SAT problem is to determine, given this as input, whether there

exists an assignment for the k variables which satisfies all n clauses simultaneously.

To establish the reduction we need to demonstrate that the existence of a polynomial-

time algorithm for D-NLS yields a polynomial-time algorithm for 3-SAT. As such, we

show that given any input instance I3−SAT and a polynomial-time algorithm A for D-NLS,

we can construct, in polynomial time and space, an instance ID−NLS such that computing

A(ID−NLS) will allow us to answer whether the instance I3−SAT is satisfiable. However,

to simplify the description of this construction, we must define the notion of a c-thick edge

(Fig 2.3).


Figure 2.3: A 4-thick edge intersecting a 5-thick edge. Any edge not shown must intersect

either all edges in either stack, or none at all.

Definition 3. In a linear graph, an edge ab contains an edge cd if and only if both c and

d lie strictly between a and b on the line. ab directly contains cd whenever ab contains cd

and there is no other edge e that contains cd but is itself contained by ab.

A c-thick “edge” in a linear graph is a set of c edges e1, . . . , ec with the properties that:

(i) for all i, j such that i > j, ei contains ej

(ii) for any other edge e′, either e′ intersects all ei, or it intersects none of them

We can now describe the construction of ID−NLS from I3−SAT , as depicted in Fig 2.4.

Given the set of variables x1, . . . , xk and the clauses ψ1, . . . , ψn, we construct k + 1 linear

graphs: one corresponding to each boolean variable in I3−SAT , and an extra graph x′ whose

purpose will be clarified shortly. Each of the n graphs consists of two intersecting c1-thick

edges, each of which contains a sequence of k similar groups of edges, where each group

corresponds to a particular clause ψi. Such a group is depicted in detail at the bottom of

Fig 2.4.

The edge group varies slightly depending on which xj and ψi it corresponds to. The

portion common to all such groups consists of three c2-thick edges none of which intersects

or contains the other. Beyond these, each group has up to three of the following set of

mutually intersecting edges: an edge that contains the first and second c2-thick edges, an

edge that contains only the second, and an edge that contains the third, as illustrated in

Fig 2.4. An edge is missing from the group only if the corresponding literal in the clause


Figure 2.4: Constructing an instance of D-NLS from an instance of 3-SAT. Each variable

xj gives rise to a linear graph xj , which consists of two intersecting c1-thick edges, each of

which contains n edge groups corresponding to clauses ψi. Every clause group consists of

3 c2-thick edges in sequence, as well as up to 3 mutually intersecting selection edges, which

are present if ψi does not depend on xj , or the truth value induced upon xj by the label of

the c1 edge makes ψi TRUE. Finally, ID−NLS contains an extra linear graph x′, consisting

of only one c1 edge, which contains the standard collection of 3nc2 edges, as well as all

possible selection edges. The goal of the x′ graph is to force an alignment where every

other graph xj , x′ aligns to either the TRUE or FALSE portion of xj , thus corresponding to

a truth assignment to all variables of the 3-SAT problem.


ψi is not in agreement with the truth assignment induced by the c1 edge to xj . To be more

precise, let ψi = ηaxa ∨ ηbxb ∨ ηcxc, where a ≤ b ≤ c, and the corresponding η can either

be the identity or negation ¬. The edge corresponding to ηaxa is absent if and only if a = j

and ηaxa is false under the truth assignment induced by the c1 edge of xj . If a = b and

ηa = ¬ηb, the edge is present; since if a clause contains the disjunction xj ∨ ¬xj , it is

automatically satisfied and the edge should exist.

The k+ 1st graph consists of only one c1 edge, and n clause groups each of which con-

tains all 3 selector edges in addition to the 3c2 edges. The basic premise of this construction

is that if (and only if) there is a satisfying assignment, we will be able to match the x′ graph

to the corresponding c1 edge in each of the k graphs, and align the n clause groups within.

Only because the assignment is satisfying will we be able to align one additional selector

edge from every clause, giving us the largest possible common subgraph.

Lemma 1. Let c2 = n+ 1 and let c1 = 3n2 + 4n+ 1. Under the scheme described above,

the k+ 1 linear graphs have a common nested subgraph of size 6n2 + 8n+ 1 if and only if

ψ1, . . . , ψk are satisfiable.

Proof. Suppose the clauses are satisfiable, that is, there exists some assignment to x1, . . . , xk

which satisfies them all. We align the c1 edge of the x′ graph with the c1 edge of graph j

that corresponds to the value of xj in this truth assignment. We then align the c2 edges to

each other. Now consider a particular selector edge in some clause ψi. Because of the way

we aligned the c1 edges, if this edge is absent in any of the half-graphs we selected, it is

because its corresponding literal is false in that clause given the truth assignment. How-

ever, since we assumed our assignment is satisfying, every clause must have a literal that

evaluates to TRUE. The corresponding selector edge must be present in every graph.

We can choose at most one selector edge per clause, since they all intersect each other.


Because we can choose one from every clause, we have a total of c1 + 3nc2 + n = 6n2 +

8n+ 1.

Now suppose we indeed have a common nested subgraph of size 6n2 +8n+1. As there

are a total of 6nc2 c2 edges and up to 2n selector edges that may be chosen simultaneously,

we could only have 6n2 + 8n edges without choosing a c1 edge. Thus, we must align a c1

edge, in which case we might as well align the whole stack of them. That leaves 3n2 + 4n

edges to be included. Note that each c2 stack contributes more than the selector edges could

simultaneously, so we must choose all 3n c2 stacks for a total of 3n2+3n edges. This leaves

n edges to be accounted for, all of which must be selector edges, one from each clause.

Note that the c1 alignment we choose induces a truth assignment to our variables. As

we just showed, the size of our alignment implies not only that the c1 and c2 edges are

aligned, but also that under this truth assignment, every clause has a selector edge that is

present in every graph’s chosen c1 half. In particular, that edge is present in the graph

corresponding to its literal, meaning that under this induced truth assignment, the clause is

satisfied because the literal is TRUE. Since this applies to all the clauses, ψ1, . . . , ψk are all

satisfied.

The time required for this construction is O(kn2); thus, we have demonstrated a poly-

nomial reduction to D-NLS from 3-SAT, and D-NLS is NP-Complete.


2.4 Approximating MAX-NLS with MAX-FLS

In view of Theorem 1 there is little hope for a tractable exact algorithm for MAX-NLS.

Therefore, we present a polynomial time approximation algorithm that guarantees optimal-

ity within a factor ofO(log2 S), where S is the size of the optimal solution. The polynomial

time is achieved by restricting attention to a subclass of nested linear graphs and finding

the optimal member of this restricted subclass. The main tradeoff here is the choice of the

restriction: if the subclass is too narrow, our approximation will be poor; otherwise, finding

the optimal member of the subclass may still be NP-Complete.

The restriction that yields our algorithm is best presented as a composition of two re-

strictions. First, we consider the subclass of NLGs that are flat.

Definition 4. A branching edge in a nested linear graph is an edge e that contains two

edges e1 and e2, neither of which contains the other. A nested linear graph is flat if it

contains no branching edges. The flat order of a nested linear graph is the size of its

largest flat subgraph.

The optimization problem corresponding to this restriction is that of finding the largest

common flat nested linear graph (MAX-FLS). We now show that this restriction yields a

solution that is suboptimal by a factor of at most O(logS).

Theorem 2. Every nested linear graph G with flat order FG satisfies |G| ≤ FG log(FG).

Proof. We begin by introducing the tree representation of nested linear graphs in order to

relate the main notions of our argument to familiar data structures. The basic transformation

is mapping each NLG edge to a node in the tree, as shown in Fig 2.5. We first add an

edge containing the entire NLG, for the sake of uniformity. We then construct the tree by

mapping each edge ei to a tree node ni. A node ni is a parent of another node nj whenever


Figure 2.5: A tree representation of a nested linear graph. Each node in the tree corre-

sponds to an edge in the graph. Node i is an ancestor of node j in the tree if and only

if the corresponding edge i contains j in the graph. For unity of representation, an edge

containing all other edges in the NLG is added so that the result of the transformation is a

tree rather than a forest. This edge and the corresponding root vertex are represented with

dashed lines in the diagram.

its corresponding edge ei directly contains ej (see Definition 3 for the notion of direct

containment). While this transformation is rather elementary, it affords us insights into the

notion of flat order. Noting that the notion of a branching edge in an NLG corresponds

precisely to a branching node in the tree, we observe the following:

(i) When viewed as a subtree, the path from the root to any leaf contains no branching

nodes and is therefore flat. Thus, the flat order FT satisfies FT ≥ h(T ), where h(T )

is the node height of T (number of nodes in the longest root-leaf path).

(ii) Consider any disjoint subtrees of T satisfying the property that nodes in different

subtrees cannot be descendants or ancestors of each other in T . The union of their

flat subtrees will also be flat, as no branching nodes can be introduced by taking the

union of flat constituents that have no ancestor relationships amongst one another.

Consequently, for any split node in the tree, the sum of the flat orders of its subtrees

is ≤ FT .

We now examine an arbitrary tree Tn with flat order n. We show that |Tn| ≤ n log(n)+

1. We establish the general result by strong induction: assuming the formula holds for

every n′ < n, we show that it holds for n. We enumerate the required base cases in Fig 2.7.


Figure 2.6: An upper bound on a tree’s size NT as a function of its flat order FT . Every

tree is representable as an ℓ-edge (ℓ ≥ 0) chain from its root to the first node where a

split into subtrees T1, . . . , Tk1, Tb1 occurs. These subtrees (labeled with their flat order) are

arranged from left to right according to increasing flat order. We continue splitting the

largest subtree, bi, recursively, until we have no subtrees with flat order > FT/2. This

allows us to prove that NT = O(FT log(FT )).

Each tree can be represented as an initial trunk of length ℓ ≥ 0, followed by a split into

some number of subtrees. Among these we then consider the subtree with the largest flat

order. If its flat order is > n/2, we recursively divide that subtree into a trunk, a splitting

node, and the subtrees at that node. We continue this process until no subtree has flat order

> n/2, as shown in Fig 2.6. Note that there can only be one subtree with flat order > n/2,

so we will never have to subdivide more than one subtree at each level.

We can now write the formula for the number of nodes in Tn. From the diagram,

|Tn| =kr∑

i=1

|Tai| + |Tbr

| +r∑

j=1

ℓj. (2.1)

By the inductive assumption |Tai| ≤ ai log(ai) + 1 and |Tbr

| ≤ br log(br) + 1, so

|Tn| ≤kr∑

i=1

(ai log(ai) + 1) + (br log(br) + 1) +r∑

j=1

ℓj. (2.2)

By construction, all ai and br are ≤ n/2. Furthermore, since n ≥∑kr

i=1ai + br, at most

3 of a1, . . . , ak+r, br may be > n/4. When ai ≤ n/4, ai log(ai) + 1 ≤ ai log(ai) + ai ≤


ai log(2ai) ≤ ai log(n/2), similarly for br. Thus,

|Tn| ≤kr∑

i=1

ai log(n/2) + br log(n/2) + 3 +r∑

j=1

ℓj.

To prove that this implies |Tn| ≤ n log(n) + 1, we now consider 3 cases:

(1) h(br) ≥ 2

Then, according to observation (i), n ≥∑r

j=1ℓj + h(br) ≥

∑r

j=1ℓj + 2, therefore,

|Tn| ≤ log(n/2)

(br +

kr∑

i=1

ai

)+ 1 + n ≤ n log(n/2) + n+ 1 = n log(n) + 1.

(2) h(br) = 0

Then Tbrhas no nodes, and since by construction it is the largest subtree in its

level, it must be that the splitting node at the bottom of trunk ℓr has no children.

This means that either the entire tree is a single trunk, in which case |Tn| = n ≤

n log(n) + 1, or that ℓr > n/2, since we had to subdivide Tbr−1. In this case, we have

|Tn| ≤∑kr−1

i=1(ai log(ai) + 1) +

∑r

j=1ℓj . Since ai ≤ n/2 by construction, we have

ai log(ai) + 1 ≤ ai log(2ai) ≤ ai log(n), and therefore |Tn| ≤ log(n)∑kr−1

i=1ai +

∑r

j=1ℓj , which transforms to |Tn| < (n/2) log(n) + n since observation (ii) implies

∑kr−1

i=1ai ≤ n− ℓr < n/2. Finally, since n ≤ (n/2) log(n) for n ≥ 4, we have

|Tn| < n log(n).


Figure 2.7: The largest possible trees with flat order 1, 2, and 3, respectively.

(3) h(br) = 1

In this case Tbrconsists of a single node, so br = 1. We may now write |Tn| ≤

log(n/2)∑kr

i=1ai + 3 + 1 log(1) +

∑r

j=1ℓj , since at most 3 elements of ai may

be > n/4. Noting that 1 = br log(2) ≤ br log(n/2) as long as n ≥ 4, we have

|Tn| ≤ log(n/2)(br +∑kr

i=1ai) + 1 +

∑r

j=1ℓj . Applying the results of observations

(i) and (ii), we have the familiar inequalities br +∑kr

i=1ai ≤ n and 1 +

∑r

j=1ℓj ≤ n,

yielding

|Tn| ≤ n log(n/2) + n+ 1 = n log(n) + 1.

The assumption n ≥ 4 can be eliminated by noting that the largest trees with flat

order < 4 still obey the equation. These trees are shown in Fig 2.7.

Thus, for an arbitrary tree T with flat order FT , |T | ≤ FT log(FT )+1 = O(FT log(FT )),

which is precisely the statement of the theorem for nested flat graphs.

It is noteworthy to observe that this bound is asymptotically tight. Consider the family

of trees Ti defined recursively as:

• T0 = a single node.

• Ti+1 = a trunk of length 2i nodes, which splits into two subtrees Ti, as shown in

Fig 2.8.


Figure 2.8: A family of trees Ti with flat order Fi that attain the asymptotic upper bound of

O(log(Fi)) on the ratio |Ti|/Fi. The particular tree depicted in the diagram is Ti+1.

By induction, it is clear that both the height and the flat order of Ti are equal to 2i. The

number of nodes is defined by the recurrence |Ti+1| = 2|Ti| + 2i, the solution to which is

|Ti| = 2i−1(i+ 2). Thus, for any tree T of this family,

|T | = (1/2)FT (2 + log(FT )) = Θ(FT log(FT )).

2.5 Approximating MAX-FLS with MAX-LLS

We now further restrict the subclass of NLGs to examine by introducing the notion of level

flat graphs, and the corresponding optimization problem MAX-LLS. First, however, we

prove a useful property of flat linear graphs.

Theorem 3. Any flat nested linear graph G can be written as a union of k ≥ 0 disjoint

subsets, G =⋃k

i=1Ci, where each Ci is a column of edges, i.e. a |Ci|-thick edge.

Proof. Consider any edge e ∈ G, and let E be the set of edges that either contain or are

contained by e. Because G is flat, E must form a column: if two distinct edges in E both

contain e, they must contain each other or intersect; if they are both contained by e, they

must contain each other, otherwise e is a branching edge. Now note that by exactly the

same reasoning, there can be no edge e′ ∈ E that contains or is contained by an edge

g ∈ G − E, since g and e cannot contain or be contained by one another: if e′ contains


them both it must be a branching edge, if e′ is contained by both then they must intersect,

and if e′ contains one and is contained by the other, then one must contain the other.

Thus, E is completely disjoint with respect to containment from G− E. Thus, we can

let C1 = E, and continue subdividing G − E in this manner to obtain C2, . . . , Ck. In the

end, each Ci is a column separate from one another, and G =⋃k

i=1Ci.

Definition 5. Consider any flat nested linear graph G =⋃k

i=1Ci, where each Ci is a

column. G is level if |C1| = . . . = |Ci|.

The MAX-LLS optimization problem is therefore to find the largest level flat subgraph

in a set of linear graphs. We now show that this further restriction yields an approximation

within a factor of O(log |GF |) of the optimal solution GF to MAX-FLS.

Theorem 4. For any flat nested linear graph GF , its largest level subgraph GL with size

L = |GL| satisfies |GF | = O(L logL).

Proof. We first define two properties of linear graphs that are particularly important for

level graphs.

Definition 6. The length ℓG of a linear graphG is the size of the largest subgraph ofG that

consists solely of edges that do not intersect or contain one another, i.e. a flat graph where

|Ci| = 1 for all i. The height hG of G is the size of the largest subgraph of G that consists

solely of one column, i.e. a flat graph consisting of one hG-thick edge.

These definitions are applicable to any linear graphs, but for level graphs they induce

a compact representation since each level graph corresponds to an ordered pair (h, ℓ), as

shown in Fig 2.9.


Figure 2.9: Level graphs [a] (h, ℓ) = (2, 7) and [b] (h, ℓ) = (4, 2). These particular graphs

represent points on the level signature of the flat graph shown in Fig 2.10.

We now consider an arbitrary flat graph GF with height hG and length ℓG. For each

h = 1, . . . , hG, we let F (h) be the largest value such that the level graph (h, F (h)) is a

subgraph, noting that 1 ≤ F (h) ≤ ℓG (Fig 2.10). The discrete function F is thus uniquely

defined for any flat graph GF . We call this function the level signature of a flat graph.

Note that the level signature is unique for any flat graph, although two distinct flat graphs

may produce the same level signature simply because of different order of the columns.

Each point (h, F (h)) corresponds to a level subgraph ofGF , as depicted in Fig 2.9. The

size of this subgraph is hF (h), therefore, the largest level subgraph of GF corresponds to

the point with the largest hF (h), say (h∗, ℓ∗). Thus, L = |GL| = h∗ℓ∗.

Let F be the hyperbola passing through (h∗, ℓ∗) with the equation hℓ = L. By definition

of (h∗, ℓ∗), all points on F must lie below this hyperbola. Note that the area under F

given by∑hG

h=1F (h) gives the size of the original flat graph GF , because F (h) counts the

number of columns containing an edge at height h. We now rewrite the sum as |GF | =

ℓG +∑hG

h=2F (h), noticing that the area represented by the sum is a subset of the area under

F from h = 1 to h = hG. Thus,

|GF | ≤ ℓG +

∫ hG

1

F (h)dh.

Since (1, ℓG) and (hG, 1) are both points of the level signature, ℓG ≤ L and hG ≤ L.


Figure 2.10: Representing the possible level subgraphs of a flat graph G with a discrete

nonincreasing function F , its level signature. Each point (h, F (h)) corresponds to a level

graph with F (h) columns of height h that is the largest level subgraph of G of height h.

The shaded area represents L, the size of the largest level subgraph of G. The hyperbola Fhas the equation hℓ = L and lies above all other points of F .


Evaluating the integral, we have∫ hG

1F (h)dh =

∫ hG

1L/hdh = L log hG. Thus, |GF | ≤

L+ L logL = O(L logL).

2.6 A Polynomial-Time Algorithm for MAX-LLS

To briefly summarize the results of theorems 2 and 4, a nested linear graph GN of size

S has a flat subgraph GF of size F satisfying S = O(F logF ). Rewriting, we have

F = Ω(S/ logF ) = Ω(S/ logS) since F ≤ S. The flat graph GF in turn has a level

subgraph GL of size L satisfying F = O(L logL), which can be similarly rewritten as

L = Ω(F/ logS) (since L ≤ S). Combining these equations yields L = Ω(S/ log2 S).

Since the largest level flat subgraph of GN has size L = Ω(S/ log2 S), and the optimal

common level subgraph MAX-LLS has by definition size ≥ L, we have thus shown that

MAX-LLS approximates MAX-NLS within a factor of at most O(log2 S). We now present

an algorithm to compute MAX-LLS in polynomial time.

The main idea of the algorithm is to efficiently search the space of level subgraphs

for the one with the largest size. For an input instance I consisting of k linear graphs

G1, . . . , Gk, let ℓI = minki=1 length(Gi), and hI = mink

i=1 height(Gi); these will be com-

puted in the course of the algorithm. We now demonstrate how to find, for any h ≤ hGi,

the largest level (h, ℓ) which is a subgraph of Gi.

For any edge e = xixj where xi < xj , we compute a subset S(e) of the edges containing

e. Each edge in S(e) is indexed by its left coordinate. Iterating through all edges of Gi,

we only an add edge e′ = xi′xj′ if i′ < i and j′ > j. If S(e) already contains an edge

ec with left coordinate xi′ , we will only keep whichever of e′ and ec has a smaller right

coordinate. This ensures that we only keep the smallest edge containing e for each left


coordinate. Thus, S(e) will have size O(n) for every edge.

Using S(e) we can compute the height of every edge in the graph (the height of an

edge e is the height of the tallest column where that e is the top edge). We can think of the

edges of our linear graphGi as nodes in a directed acyclic graph (DAG)G∗

i , where the edge

e → e′ is present in G∗

i if and only if e′ ∈ S(e). Furthermore, we add an auxiliary source

node s that has edges to every other node of G∗

i , and assign weight of −1 to all edges of

G∗

i . Clearly, an edge of height h in Gi will have shortest path distance −h − 1 from s in

G∗. Thus, computing edge heights in Gi is equivalent to computing shortest path distances

from s in G∗

i .

Thus, to label the edges of Gi with their height we construct the DAG G∗

i , and use

the Bellman-Ford algorithm for DAGs (5) to compute the shortest path distances from s to

every node of G∗

i . This computation will be linear in the size of G∗

i . We call this procedure

vertical labeling.

Similarly, we compute R(e), a subset of edges that lie to the right of e. We only add an

edge if its left coordinate xi′ > xj , and we only keep one such edge per left coodinate, the

one with the smaller right coordinate, ensuring |R(e)| = O(n) for any edge e. Using the

same approach as with vertical labeling, but with the edges of G∗

i given by R(e) instead of

S(e), we obtain a labeling of edges according to the length of the largest flat sequence of

non-intersecting edges ending at the given edge. The largest label in the graph will have

value ℓGi, the length of the graph.

We generalize this approach to produce the largest level subgraph of height h. Starting

with the labeling of the edges by height obtained during the vertical labeling phase, we

compute Rh(e), which is the same as R(e) in the subset of Gi that has height ≥ h. In

other words, we disregard all edges of height < h, and calculate the length of each edge


Figure 2.11: The algorithm for finding the MAX-LLS of a linear graph. First, all edges

in the graph are marked with their height (see [a]), using the vertical labeling procedure.

Next, for each h, all edges of height < h are ignored, and the remaining edges are marked

with their length using horizontal labeling. For h = 1 and h = 2 the results are shown in

[b] and [c] respectively.

in the remaining subgraph (Fig 2.11). The largest level subgraph of Gi with height h will

be (h, F (h)), where F (h) is the largest label in the graph obtained in this manner. We call

this horizontal labeling.

Using this procedure, we can now find MAX-LLS for an instance I as follows:

1. Label the edges of each graph G ∈ I according to height using the iteration in the

vertical direction.

2. Let hI = minG∈I hG.

3. For h = 1, . . . , hI and each G ∈ I , compute the length FG(h) of the largest level

subgraph of G with height h, using horizational iteration. For each h, let ℓh =

minG∈I FG(h). The level graph (h, ℓh) is the largest common level subgraph for the

instance I of height h.

4. While iterating from h = 1 to h = hI , keep track of the largest level subgraph (h, ℓh)

produced in the previous step. Return this subgraph.

Suppose the k linear graphs in the input I each have ≤ n nucleotides and ≤ e edges.

Each horizontal or vertical labeling procedure takes O(ne), as the DAG constructed for


the Bellman-Ford computation will have O(e) nodes and e · O(n) edges. Horizontal la-

beling must be performed for every h and both types of labeling must be done for each of

the k linear graphs. Thus, the overall running time, dominated by horizontal iteration, is

O(khne) = O(kn2e).

2.7 Discussion

We have introduced a novel computational model for RNA multiple structural alignment,

by representing each RNA sequence as a linear graph and the multiple alignment as a com-

mon nested subgraph. We noted earlier that the MAX-NLS problem represents a relaxation

of RNA multiple structural alignment, because a linear graph derived from an RNA se-

quence by connecting all complementary nucleotide pairs has certain constraints dictating

which edges must exist.

There are sound biological and computational reasons to adopt the more general NLG

model. At times the complementarity of nucleotides is not sufficient for the formation of a

stable hydrogen bond. For instance, adjacent complementary nucleotides are rarely paired

in real structures, because geometric constraints prevent them from achieving an orientation

that allows participation in hydrogen bonding. It is therefore common to explicitly prevent

the structure from pairing such nucleotides (or more generally, nucleotides that are less

than some fixed number of bases apart) by modifying the algorithm used to compute it. In

the NLG model, this can be accomplished simply by not adding such edges to the linear

graphs constructed from each sequence. In general, the NLG model is flexible enough

to allow easy incorporation of biological insights that modify the space of permissible

pairings. Insights that reduce this space are particularly valuable because by decreasing

the number of edges in the resulting linear graphs, the running time of our approximation


algorithm improves accordingly. In addition, heuristic approaches to prune certain edges,

which are deemed unlikely to be included in the final structure, could be combined with

our algorithm in order to reduce running time further. Such enhancements are likely to be

incorporated into any practical algorithm that finds biologically meaningful structures.

The approximation quality, while bounded byO(log2 S) in the worst case, will vary de-

pending on the class of ncRNAs being aligned. When mapped back to the RNA sequence, a

level graph consists of ℓ groups of stems, each consisting of h complementary pairs. Thus,

for ncRNA families whose secondary structure fits this pattern well, such as tRNAs, our

algorithm will perform more accurately.

Compared to the elaborate free energy functions used by several structure-prediction

programs (18; 19), the NLG model uses a fairly rough approximation. The main advan-

tage of the NLG model is the ability to incorporate multiple sequence information without

having a fixed alignment. The approximation algorithm we presented could be used to

obtain a rough alignment and structure, which could then be refined using heuristic meth-

ods with more elaborate scoring models. Such a hybrid would combine theoretical bounds

on approximation quality derived in the NLG framework with the benefits of heuristic ap-

proaches.

Chapter 3

Constrained Element Detection with

GERP++

3.1 Background

Identification and annotation of all functional elements in the human genome is one of the

main goals of contemporary genetics in general, and the ENCODE project in particular

(20; 21; 22). Comparative sequence analysis has become a powerful tool for such analysis,

as sequence conservation due to negative selection is often a strong signal of biological

function. After constructing a multiple sequence alignment, the goal is to quantify evolu-

tionary constraint at the level of individual positions and identify segments of the alignment

that show significantly elevated levels of conservation.

Several computational methods for constrained element (CE) detection have been de-

veloped, with most falling into one of two broad categories: generative model-based ap-

proaches, which attempt to explicitly model the quantity and distribution of constraint

29

CHAPTER 3. CONSTRAINED ELEMENT DETECTION WITH GERP++ 30

within an alignment, and bottom-up approaches, which first estimate constraint at individ-

ual positions and then look for clusters of highly constrained positions. The main generative

approach, phastCons (23), uses a Hidden Markov Model to find the most likely parse of

the alignment into constrained and neutral hidden states. While HMMs are widely used in

modeling biological sequences, they have known drawbacks: transition probabilities imply

a specific geometric state duration distribution, which in the context of phastCons means

predicted constrained and neutral segment length. This biases the resulting estimates of

element length and total genomic fraction under constraint.

One of the leading bottom-up approaches is GERP (24), which quantifies position-

specific constraint in terms of rejected substitutions (RS), the difference between the neutral

rate of substitution and the observed rate as estimated by maximum likelihood, and heuris-

tically extends contiguous segments of constrained positions (RS > 0) in a BLAST-like

(25) manner. However, GERP is computationally slow because its maximum likelihood

computation uses the Expectation Maximization algorithm (26) to estimate a new set of

branch lengths for each position of the alignment; this step is also undesirable methodolog-

ically because it involves estimating k real-valued parameters from k nucleotides of data.

Furthermore, the extension heuristic used by GERP (and other bottom-up methods (27))

may induce biases in the length of predicted CEs.

In this chapter we present GERP++, which as the name suggests represents a signifi-

cant improvement on the GERP methodology and addresses these weaknesses. GERP++

achieves over 100x speedup over the original GERP algorithm while using a more sta-

tistically robust maximum likelihood estimation procedure. In addition, we introduce a

novel criterion of grouping constrained positions into constrained elements using statisti-

cal significance as a guide and assigning p-values to our predictions. We use a dynamic


programming approach to globally predict a set of constrained elements ranked by their

p-values and coupled with a rigorous false positive rate estimate. Using GERP++ we an-

alyzed an alignment of the human genome and 33 other mammalian species, identifying

over 1.3 million constrained elements spanning over 7% of the human genome with high

confidence. Compared to previous methods, we predict a significantly larger fraction of the

human genome positions under constraint, grouped in a much smaller number number of

predicted CEs, with very low false positive rate.

3.2 Results

3.2.1 Overview of the Algorithm

Like other bottom-up approaches, the GERP++ algorithm consists of two components: cal-

culation of position-specific constraint scores for each column of a multiple alignment, and

subsequent annotation of segments that score significantly higher than expected by chance

(Fig 3.1; see Section 3.3 for more detailed description). These are largely independent

procedures: the GERP++ score for a specific position depends almost entirely on the nu-

cleotides at that position and not on any global element predictions, while identification

of statistically significant high-scoring segments depends only on the additivity of individ-

ual position scores and can potentially be used in conjunction with other position-specific

scoring metrics.

Constraint intensity at individual alignment positions is quantified in terms of rejected

substitutions (RS), defined as the number of substitutions expected under neutrality minus

the number of substitutions observed at the position (24). Thus, positive scores represent

constraint or substitution deficit, while negative scores represent a substitution surplus.


Figure 3.1: Overview of GERP++. (1) For each position of the multiple alignment we

compute the conservation score in rejected substitutions by subtracting the estimated evo-

lutionary rate from the neutral rate. The neutral rate is computed by removing species

gapped at that position from the phylogenetic tree and summing the branch lengths of the

resulting projected tree; the evolutionary rate is estimated by computing the maximum like-

lihood rescaling of the projected tree. (2) Given position-specific conservation scores, we

generate a set of candidate elements. (3) For each candidate element, we compute a p-value

to represent the likelihood of observing a segment of equal length and greater than or equal

score. We then select a non-overlapping set of elements in order of increasing p-value.


Since it is impossible to truly observe which and how many substitution events actually

occurred during the course of evolution, we approximate this quantity by estimating the

optimal scaling factor that maximizes the probability of the observed nucleotides in the

scaled neutral tree.

Statistically significant constrained regions are detected by first generating a set of can-

didate elements and then computing p-values based on their score and length that represent

the probability of such a region occurring in the null model (element score is defined as the

sum of RS scores for each position within the element). These p-values are used to rank

CEs in order of significance and report a set of non-overlapping predictions, starting with

the lowest (best) p-value. Rather than using a fixed cutoff, GERP++ estimates the false

positive rate by randomly permuting the input RS-scores and treating any prediction within

the shuffled sequence as a false positive, similar to the first version of GERP (22; 24).

3.2.2 Constraint in the Human Genome

We used GERP++ to analyze the TBA alignment of the human genome to 33 other mam-

malian species spanning over 3 billion positions and 5.83 substitutions per site in phylo-

genetic scope. We identified 1,354,034 constrained elements covering 214,749,502 nu-

cleotides, or approximately 7% of the human genome, with an estimated false positive rate

of 0.86% at the nucleotide level (see Section 3.3 for details). Compared to a slightly nega-

tive background average of −0.125 RS, GERP++ predictions and certain known functional

elements show elevated amount of constraint, in excess of 1.7 RS. GERP++ elements range

in size from 4 to nearly 2000 bases, with mean length of 158.6 nucleotides. The minimum

and maximum lengths are parameters of the algorithm, and the tail of the length distribution

(Fig 3.2) suggests that with a more permissive upper bound even longer elements could be


Figure 3.2: Distribution of GERP++ constrained element lengths.

identified.

We observe significant variation at the level of entire chromosomes in both average RS

score and fraction predicted to belong to constrained elements (Fig 3.3). The mean con-

straint level varied from −0.3 to −0.05 RS with the exception of chromosome X, which

was the only chromosome with a positive average RS score, just under 0.1 RS. This result

is consistent with earlier work of (28), which suggested reduced mutation rate of the X

chromosome in rodents. We also observe substantial fluctuation in the chromosome frac-

tion predicted to be inside constrained elements, which varied from 1% of chromosome


Y to 4-9% for other chromosomes. Admittedly this metric is skewed for chromosome Y

because a large portion of the alignment there does not have enough species for rate esti-

mation, but even when adjusting for “effective” chromosome size (Fig 3.3B) much of the

fluctuation remains. Surprisingly, despite the low fraction within constrained elements, Y

does not have a particularly low average RS score, while X does not exhibit a high CE frac-

tion despite the positive average RS. In fact, there appears to be at best weak correlation

between these two metrics of constraint.

3.2.3 Estimating Detectable Constraint

The only major parameter for GERP++ is a false positive rate cutoff that determines at

what point the algorithm should stop generating predictions in order to avoid too many

false discoveries. Throughout its execution GERP++ keeps track of the constrained ele-

ments predicted so far, as well as estimates of the number and total size of false positive

predictions for that cutoff level. Examining how these quantities grow as the cutoff pa-

rameter increases permits us to estimate the amount of total constraint that can be detected

using this methodology and give an approximate upper bound on the amount of constraint

within the human genome.

Let B(c) be the number of bases within constrained elements predicted at false positive

cutoff c, and let B∗(c) = B(c) − F (c) be the same quantity adjusted for false positive

predictions by subtracting the estimated number of false positive bases (as found in shuf-

fled alignments) at cutoff c. Figure 3.4 shows B and B∗ as a function of c from 0 to 50%:

while B continues to increase, B∗ starts to level off right as B begins to grow linearly.

This suggests that maxcB∗(c) can be used to estimate the total number of bases in con-

strained elements that can be annotated using this method in any given region or the entire


(a)

(b)

Figure 3.3: Per-chromosome constraint intensity. (A) Mean RS score for all alignment

positions where evolutionary rate was computed. Note the elevated average score for chro-

mosome X. (B) Fraction of chromosome that falls into predicted constrained elements.

Light green bars show fraction of entire chromosome, while dark green bars show fraction

adjusted for regions where no rate computation was performed and no elements could span

(see Section 3.3).


Figure 3.4: Estimating detectable constraint. The red curve represents the number of bases

within predicted constrained element as a function of the false positive cutoff parameter.

The blue curve represents the number of predicted bases minus the expected number of

false positive bases, also as a function of the false positive cutoff.


genome. Approximately 225 megabases, or nearly 7.3% of the human genome can be de-

tected as constrained using GERP++ at the mammalian phylogenetic scope. If we adjust

for the portions of the genome where rate estimation was not performed (but with a deeper

alignment might be in the future), we estimate that up to 8% of the human genome consists

of constrained elements detectable using this kind of methodology. Combined with the

observation that about 190 megabases, or 6.2% can be detected at a false positive cutoff

of 0 (Fig 3.4), we obtain a fairly narrow estimate of 6-8% of the human genome under

detectable mammalian constraint.

3.2.4 Association of CEs with Known Functional Elements

We next examine the relationship between evolutionary constraint and several classes of

biologically important regions. Overall, coding exons exhibit by far the strongest lev-

els of constraint, as quantified both by the average RS score within functional elements

(Fig 3.5A), and by fraction of bases that overlap the predicted CEs (Table 3.1). Both 5’

and 3’ UTR regions show weaker but noticeable constraint levels and, somewhat surpris-

ingly, introns on average have slightly lower RS scores than the overall genomic baseline.

However, a nontrivial fraction of introns does exhibit evidence of constraint, as nearly 7%

of intron positions overlap predicted elements (Table 3.1), and these positions make up a

large fraction of constrained element bases (Fig 3.5B).

Over 94% of the coding exons in the human genome overlap at least one predicted CE;

conversely, only about 16% of contrained elements overlap a coding exon. Such CEs tend

to be about 60 nucleotides or 40% longer on average compared to elements that do not

overlap exons, with more than a twofold difference in score (both t-tests significant at p-

value< 2.2×10−16). While overall these results are consistent with earlier findings 3.6, the


(a)

(b)

Figure 3.5: Relationship between CEs and known functional elements. (A) Mean rejected

substitution scores for entire human genome, constrained elements predicted by GERP++,

and known annotated exons, introns, and UTR regions. (B) Breakdown of constrained

element positions by region type.


Annotation % Coverage by CEs

Exons 84.6%

Introns 6.9%

UTR 5’ 23.7%

UTR 3’ 33.9%

ncRNA 10.1%

Table 3.1: Fraction of Functional Regions Covered by Constrained Elements on a Nu-

cleotide Level

length difference between exon-associated and non-overlapping CEs is somewhat smaller

than what was previously found. This can be partially explained by considering the differ-

ences in the pattern of constraint between coding exons and other regions. Because GERP

by default only merges blocks of contiguous constrained positions if they are separated by

at most one unconstrained position, it is far more likely to generate longer elements in ex-

onic regions where most unconstrained bases correspond to 3rd positions of a codon and are

usually flanked by constrained positions. In noncoding regions where unconstrained posi-

tions are distributed more irregularly and often occur consecutively, the GERP algorithm

ends up fragmenting longer constrained regions and generating shorter elements. Because

GERP++ does not base merging decisions on any such fixed threshold it is able to better

annotate longer noncoding CEs.

To further test this hypothesis, and to investigate a potentially useful signal for detecting

coding exons, we introduce a metric that rigorously quantifies this pattern of constraint for

any region. For any given segment, we define the 3-periodicity bias as the maximum over

the 3 possible reading frames of the mean RS score at positions 1 and 2 minus the mean

RS score at position 3. This metric quantifies a periodic bias in constraint and effectively

deals with unknown reading frame location and lack of a reading frame altogether, since

the maximum is taken over all 3 possibilities. As figure 3.6 shows, the 3-periodicity bias


Figure 3.6: Distributions (smoothed histograms) of 3-periodicity bias for known exons

(red), introns (green), CEs that overlap exons (orange), and CEs not overlapping exons

(blue).


Type Mean 3-Periodicity Bias

Exons 2.96

UTR 5’ 0.57

UTR 3’ 0.32

Introns 0.18

CEs overlapping exons 2.46

CEs not overlapping exons 0.55

Table 3.2: Mean 3-Periodicity Bias for Different Types of Regions

is a strong signal characteristic of coding exons (mean 2.96) compared to other regions

such as UTRs, introns, and ncRNAs (mean 0.13-0.38, difference significant at p-value

< 2.2 × 10−16). We partitioned the constrained elements predicted by GERP++ according

to exon overlap, and found that CEs overlapping coding exons have a much greater mean

3-periodicity bias (Table 3.2). However, the difference between CEs that did not overlap

any annotated exons, and known nonexonic regions such as introns was still significant,

suggesting some of these CEs intersect unannotated exonic regions. To test this hypothesis,

we checked the constrained elements that did not overlap any known coding exons against

exon predictions made by the computational gene prediction tool CONTRAST (29). We

found 16881 CEs (making up 1.5% of all CEs that did not overlap known genes) that

overlapped CONTRAST predictions, and these CEs had a significantly higher 3-periodicity

bias (1.33) than those that did not overlap CONTRAST predictions (0.54). As this is still

higher than the average 3-periodicity of clearly non-exonic elements, it is possible that

these elements may overlap unannotated exons or pseudogenes with recently lost function.

3.2.5 Comparison with PhastCons

We compared the GERP++ constrained element predictions in placental mammals (see Sec-

tion 3.3) to phastCons (23), the leading generative model-based tool. Not surprisingly, we


found significant overlap between GERP++ and phastCons predictions: 80% of GERP++

predictions overlapped at least one phastCons prediction, and vice versa. However, aside

from both algorithms detecting clearly constrained areas, there are substantial differences:

GERP++ predicts significantly fewer elements, which are much longer on average and

cover a substantially larger portion of the human genome—almost twice as much as the

4% predicted by phastCons (Fig 3.7A). As a result, on a nucleotide level GERP++ overlaps

90% of phastCons predictions while only half of GERP++ CE positions are covered by

phastCons.

Part of the reason for these differences is that often phastCons predicts multiple ele-

ments where GERP++ makes one longer prediction. PhastCons thus skips intermediate po-

sitions which may be under weaker constraint yet still part of one large functional element,

as the example in Fig 3.8. In order to demonstrate that this is not an isolated occurrence

and to quantify fragmentation of known functional elements, we computed the number of

distinct predicted constrained elements overlapping each annotated coding exon. While

the total number of exons that overlap at least one constrained element prediction is ap-

proximately the same between the two methods, GERP++ is significantly more effective at

identifying entire exons as a single predicted CE, rather than fragmented between two or

more CEs like phastCons (Fig 3.7C,D).

Due in part to its ability to annotate larger elements in one piece, GERP++ is more

effective at predicting constraint within several types of known functional regions. At the

nucleotide level GERP++ elements cover a substantially larger fraction of several major

types of functional elements, especially coding exons and UTRs (Fig 3.7B). The improved

ability to detect known functional elements suggests GERP++ may also be more effective

at predicting unannotated regions that are not only constrained but also functional.


(a)

(c)

(b)

(d)

Figure 3.7: GERP++ vs phastCons predictions. (A) Mean length (left), number (middle)

and total length (right) of constrained elements predicted by GERP++ (blue) and phast-

Cons(yellow). (B) Nucleotide-level fraction of annotated exons, introns, UTRs and non-

coding RNAs genes covered by GERP++ (blue) and phastCons (yellow) predictions. (C,

D) Histogram of number of distinct predicted GERP++ (blue, D) and phastCons(yellow,

C) constrained elements overlapping each annotated coding exon. Note the difference in

scale on the y-axis.


Figure 3.8: Constrained region slightly over 200 base pairs in length containing known

exon as annotated by GERP++ (labeled ’GERP++’, black) and phastCons (purple track

labeled ’Mammal El’). Note the fragmentation of the single functional region into multiple

CE predictions by phastCons.

3.3 Methods

3.3.1 Estimation of Evolutionary Rates and RS Scores

Given a phylogenetic tree with branch lengths and a multiple sequence alignment of the

species within that tree, GERP++ quantifies constraint intensity at each individual position

in terms of rejected substitutions (24), the difference between the neutral rate and the es-

timated evolutionary rate at the position. For our analysis the alignment was compressed

to remove gaps in the reference sequence (human), although the RS score computation

algorithm does not assume any specific reference sequence. In order to estimate the evo-

lutionary rate we model nucleotide evolution as a continuous-time Markov process, which

specifies for each pair of nucleotides a and b and duration t the probability of a transform-

ing into b over time t, designated by pab(t). Many such evolutionary models have been


developed (30; 31; 32), each with its own set of simplifying assumptions. GERP++ imple-

ments the HKY85 model (32), but any time-reversible model (where papab(t) = pbpba(t)

for all pairs of nucleotides a and b) that permits efficient computation of pab(t) can be used

instead.

For each individual alignment column GERP++ labels the leaves of the phylogenetic

tree with the corresponding nucleotides c1, . . . , ck; gapped species are projected out. Al-

though this is not necessarily ideal and sometimes leads to information loss, it avoids some

of the common difficulties and potentially serious biases that accompany modeling gaps

in alignments: aligner errors and artifacts that result from simplified gap penalties and

incorrect handling of duplications and rearrangements, assembly mistakes, and missing se-

quence data. Furthermore, this treatment of gaps avoids penalizing constrained elements

that have undergone lineage-specific deletion (24).

Once the gapped species are removed, the site-specific neutral rate is computed as the

sum of the branch lengths in the trimmed tree. When there are fewer than 3 species remain-

ing no rate estimation is performed for that position, as there are not enough species to even

form a valid tree. To estimate the evolutionary rate we introduce a scaling parameter r that

represents the site’s rate of evolution relative to neutrality. When r < 1 the quantity 1 − r

can be naturally interpreted as the fraction of neutral substitutions “rejected” by evolution-

ary selection. GERP++ estimates r by maximum likelihood, where the likelihood is given

by L(r) = Pr(c1, . . . , ck|Tr), where Tr is the neutral tree T scaled by r. For any given

r, and therefore fixed tree Tr, this function can be computed efficiently using a dynamic

programming algorithm due to Felsenstein (33). If n is an internal node with children n1

and n2, and c1, . . . , ckn represents the subset of the leaves corresponding to the subtree


rooted at n, then

Pr(c1, . . . , ckn|n = a) = Pr(c1, . . . , ckn1|n = a) · Pr(c1, . . . , ckn2

|n = a)

=

(∑

b

Pr(c1, . . . , ckn1|n1 = b) · pab(Tr(n, n1)

)·

(∑

b

Pr(c1, . . . , ckn2|n2 = b) · pab(Tr(n, n2)

)

where Tr(x, y) is the branch lengths in Tr between two neighboring nodes x and y. Since

the leaf nucleotides are observed, this equation can be used to compute the subtree prob-

ability for all internal nodes, starting at the bottom and reaching the root, where we can

compute L(r) = Pr(c1, . . . , ck|Tr) =∑

a Pr(c1, . . . , ckn|root = a) · pa. Assuming a

fixed alphabet and an evolutionary model where the probabilities pab(t) are computable in

constant time, this algorithm runs in time O(k) where k is the number of species in the

phylogenetic tree.

Using this algorithm as a subroutine to calculate L(r), GERP++ computes the maxi-

mum likelihood value of r using Brent’s method (34; 35), a numerical optimization tech-

nique that tends to require relatively few computations of the function being optimized.

The evolutionary rate for a site with neutral rate n is estimated to be rn, and the final RS

score is computed as n− rn = n(1− r). As maximum likelihood may estimate very large

or even infinite values of r, we impose a cap of r = 3 on GERP++ rate estimates, yielding

RS scores that range between −2n and +n. These scores are then used as the basis for

prediction of constrained elements within the region.


3.3.2 Computation of p-values and Element Prediction

Given position-specific constraint scores, GERP++ generates a list of elements that exhibit

evidence of evolutionary constraint beyond what is likely to occur by chance. For each

element, we compute a p-value that represents the probability of a random neutral segment

of equal length having an equal or higher RS score. In addition to being used to select

final predictions from the set of candidate elements, these p-values in conjunction with

position-specific scores provide useful information for biological analysis.

Every segment of contiguous multiple alignment columns is a candidate element. Be-

cause considering all possible segments within the alignment is computationally infeasible,

GERP++ generates a list of candidate elements using several simple biological heuristics

to prune the possibilities. First, we impose a user-specified minimum and maximum on

candidate element length; while real functional elements vary in length, very few extend

beyond several thousand bases, and even these will not be missed entirely as GERP++

will identify their most constrained parts. Second, since positive RS scores indicate con-

straint, GERP++ allows only candidate elements that start and end at positions with RS =

0 and cannot be extended further in either direction; this rule has the additional benefit of

imposing sensible boundary conditions on predicted elements. Finally, we only consider

candidate elements with score above a certain value, which is a function of the element

length and the median neutral rate of the region. This allows pruning of candidate elements

that have low score relative to their length and will end up with poor p-values anyway;

ignoring them early reduces the memory requirements considerably since most candidate

elements would otherwise fall into this category.

Using neutrality as the null hypothesis, we can now define p-values for candidate and

predicted elements on the basis of score and length. If the probability of a single neutral


position having RS score x is given by P (x), then for an element of length L and score S

the p-value is the probability of having score at least S in exactly L positions, and is given

by

pval(L, S) =∑

x

pval(L− 1, S − x) · P (x)

These p-values can be computed using dynamic programming, for L = 1, . . . , Lmax,

provided the distribution P (x) can be computed and the space of possible scores x is not

too large. The latter is accomplished by discretizing to within a specified tolerance t; since

individual scores range from −2n to +n, there are 3n/t possible discretized scores. We

now build a histogram of these discrete scores from the alignment, with two exceptions.

First, we exclude long consecutive runs of “shallow” positions, i.e., positions with neu-

tral rate below specified cutoff, as there are many such primate-specific regions and they

tend to skew the score distribution. Additionally, remaining shallow positions are given

a small penalty to discourage GERP++ from predicting CEs consisting mostly of shallow

positions. Second, we exclude positions that belong to clearly constrained regions, which

are identified using a preliminary pass of the algorithm. Also, in order to eliminate artifacts

caused by zero probabilities, we add a small uniform prior to the histogram to ensure every

discretized score appears at least once.

Once all candidate elements have been assigned p-values, GERP++ selects elements in

a greedy manner, from smallest to highest p-value, discarding any elements that overlap

previously reported elements. As the p-value increases so does the expected false pos-

itive rate of our predictions; when this reaches a user-specified threshold the algorithm

terminates. While it would be ideal to compute this directly from the p-values, the multiple

hypothesis correction in this case is non-trivial because GERP++ reports a non-overlapping


set of predictions. Therefore, we adopt the approach of Cooper et al (24) and estimate the

false positive rate from independent permuted alignments.

3.3.3 Overview of the Data

TBA (36) alignments of the human genome (hg18) to 43 other vertebrate species were ob-

tained from the UCSC genome browser (37; 38) together with a phylogenetic tree with the

generally accepted topology (Fig 3.9) and neutral branch lengths estimated from 4-fold de-

generate sites. Both the tree and alignments were projected to the 34 mammalian species.

The alignment was compressed to remove gaps in the human sequence, and GERP++

scores were computed for every position with at least 3 ungapped species present, or ap-

proximately 88.9% of the 3.08 billion positions on the 22 autosomes and X/Y chromo-

somes. We used the HKY85 (32) model of evolution with the transition/transversion ratio

set to 2.0 and nucleotide frequencies estimated from the multiple alignment.

To limit memory requirements and allow parallelization of the constrained element

computation, each chromosome was broken up into regions of approximately 2 megabases,

with long segments where no RS score was computed chosen as boundaries. These bound-

ary segments contain no information usable by GERP++ and because the algorithm never

annotates constrained elements spanning them, excluding such segments did not sacrifice

any predictive ability. These boundary regions made up approximately 6.8% of the human

genome, including a 30.2 megabase region that made up more than half of chromosome Y.

Constrained element predictions were generated using default parameters and a 5% false

positive cutoff measured in terms of number of predictions; the estimated nucleotide-level

false positive rate was under 1%. Other constraint element prediction methods are generally

tuned around a 5% nucleotide-level false positive rate.


Figure 3.9: Phylogenetic tree used for GERP++ analysis. Tree is drawn to scale with

respect to estimated neutral branch lengths.


Gene, noncoding RNA, and PhastCons conserved element annotations were obtained

from the UCSC genome browser’s (38; 39) Known Genes, RNA Genes, and Conservation

(23) tracks respectively. To avoid skewed statistics due to alternative splicing, gene an-

notations were resolved to a consistent nonoverlapping set where any segment belonging

to multiple conflicting annotations was assigned a single annotation in the following order

of priority: coding exon, 5’ UTR, 3’ UTR, intron. For meaningful comparison against

phastCons, separate GERP++ scores and constrained elements were generated according

to the same procedure as above but using only placental mammal data (ignoring platypus

and opossum in the alignment and projecting them out of the phylogenetic tree).

3.4 Discussion

One of the main challenges in constrained element detection is the lack of a clear gold stan-

dard for evaluating the quality of predictions. Human functional elements are sometimes

unconstrained at the mammalian scope or missed at the assembly or alignment stages,

and constrained predictions that do not correspond to any known annotations may have

unknown function, and cannot be definitively considered false positives. Given these limi-

tations, we have shown with a variety of metrics that GERP++ compares favorably with its

predecessor GERP as well as other leading methods such as phastCons. Previous bottom-

up approaches have been limited largely by the simple heuristics used to merge constrained

positions into longer elements; these heuristics may introduce biases in element length

due to patterned constraint such as the 3-periodicity in coding exons. With GERP++ we

evaluate a much richer set of candidate elements, selecting and ranking final predictions

according to statistically meaningful p-values. Despite the added computational cost at

this stage, GERP++ overall is more than 100x faster than GERP due to the speedup in


rate estimation. Because GERP++ estimates a single parameter that directly translates into

evolutionary rate, rather than an independent parameter for each branch of the tree, the

computation is not only faster but also benefits from deeper alignments, resulting in more

statistically robust estimates.

Our understanding of the evolutionary forces responsible for the observed constraint,

especially in noncoding regions, is still limited. This presents a tremendous challenge

for generative model-based approaches, which model implicitly or explicitly the distribu-

tion of length and intensity of constrained elements and the total genomic fraction under

constraint. In such situations, generative models tend to make more assumptions than nec-

essary, potentially introducing biases and making new insights difficult to incorporate. In

contrast, rate estimation and element prediction in GERP++ are largely independent proce-

dures, and while we believe the rejected substitution metric introduced by Cooper et al. (24)

most accurately quantifies constraint intensity at individual positions, any additive position-

specific scoring scheme could potentially be used instead. For example, more elaborate or

context-dependent models could be easily incorporated without drastically changing the

overall algorithm.

One drawback of GERP++ and other similar approaches is sensitivity to variation in

and erroneous estimates of the neutral rate of substitution. Neutral rate estimates are often

subject to some uncertainty and can vary depending on the methodology, alignment quality,

and regions used. To test the ability of GERP++ to tolerate a reasonable amount of error

in neutral rate estimates, we repeated our analysis with the neutral tree scaled up or down

by 5 or 10%. Not surprisingly, overestimating the neutral rate leads to overprediction of

constraint, and vice versa. For a fixed false positive cutoff, we observed a linear relation-

ship between the input neutral rate and the amount of constrained element bases predicted;


a 5/10% change in neutral rate leads to approximately 8/15% change in the number of

predicted constrained bases.

GERP++ recapitulates known biology, both at nucleotide-level resolution and on the

scale of entire functional elements and even chromosomes. GERP++ scores are accurate

enough to obtain a strong signal of synonymous substitution in coding exons, and the ele-

vated average RS score for chromosome X (Fig 3.3A) agrees with earlier findings (20; 21).

Compared to phastCons, GERP++ predictions overlap a larger fraction of known functional

elements (Fig 3.5B) and have better one to one correspondence to constrained coding ex-

ons (Fig 3.7C,D). Our analysis also yielded interesting biological insights, including the

likelihood of unannotated coding exons among our predicted constrained elements. We

predict around 7% of the human genome to be constrained across the mammalian scope,

a slightly larger amount than previous methods, yet with a lower estimated false positive

rate. While this estimate is inexact, our analysis suggests 6% and 8% as reasonable lower

and upper bounds, a somewhat tighter range than earlier estimates (20; 21). Computation-

ally, GERP++ is efficient enough to perform whole-genome analysis of deep mammalian

alignments within a few cpu-days, making it suitable for high-throughput analysis of the

ever increasing amounts of genomic data. We hope GERP++ will prove to be a useful

tool in analyzing, quantifying, and annotating constraint and discovering novel functional

elements in the human and other genomes for which sufficient comparative data exist.

Bibliography

[1] Bafna V, Huson DH. The conserved exon method for gene finding. Proceedings of

the Fifth International Conference on Intelligent Systems for Molecular Biology 2000,

3-12.

[2] Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES. Human and mouse gene

structure: Comparative analysis and application to exon prediction. Genome Research

2000, 10:950-958.

[3] Bonizzoni P, Vedova GD. The complexity of multiple sequence alignment with SP-

score that is a metric. Theoretical Computer Science 2001, 259:63-79.

[4] Cook SA. The complexity of theorem-proving procedures. Proceedings of the Third

ACM Symposium on Theory of Computing 1971, 151-158.

[5] Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms, Second

Edition 2001.

[6] Eddy SR. Noncoding RNA genes and the modern RNA world. Nature Review Genetics

2001, 2:919-929.

[7] Eddy SR. Computational genomics of noncoding RNA genes. Cell 2002, 109:137-140.

55

BIBLIOGRAPHY 56

[8] Hardison RC, Oeltjen J, Miller W. Long Human-Mouse Sequence Alignments Reveal

Novel Regulatory Elements: A Reason to Sequence the Mouse Genome. Genome Re-

search 1997, 7:959-966.

[9] Holmes I, Rubin GM. Pairwise RNA Structure Comparison with Stoachastic Context-

Free Grammars. Pacific Symposium on Biocomputing 2002, 7:175-186.

[10] Jareborg N, Birney E, Durbin R. Comparative analysis of noncoding regions of 77

orthologous mouse and human gene pairs. Genome Research 9:815-824, 1999.

[11] Kasami T. An efficient recognition and syntax algorithm for context-free languages.

Technical Report AF-CRL-65-758, 1965, Air Force Cambridge Research Laboratory,

Bedford, MA.

[12] Lowe TM, Eddy SR. tRNAscan-SE: a Program For Improved Detection of Transfer

RNA genes in Genomic Sequence. Nucleic Acids Research 1997, 25:955-964.

[13] Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ. Algorithms for loop matching.

SIAM Journal of Applied Mathematics 1978, 35:68-82.

[14] Pennacchio L, Rubin E. Genomic strategies to identify mammalian regulatory se-

quences. Nature Reviews 2001, 2:100-109.

[15] Rivas E, Eddy SR. A dynamic programming algorithm for RNA structure prediction

including pseudoknots. Journal of Molecular Biology 1999, 285:2053-2068.

[16] Rivas E, Eddy SR. Secondary structure alone is generally not statistically significant

for the detection of noncoding RNAs. Bioinformatics 2000, 16:573-583.

BIBLIOGRAPHY 57

[17] Wang L, Jiang T. On the complexity of multiple sequence alignment. Journal of Com-

putational Biology 1994, 1:337-348.

[18] Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using ther-

modynamics and auxiliary information. Nucleic Acids Research 1981, 9:133-148.

[19] Zuker M. Computer Prediction of RNA structure. Methods in Enzymology 1989,

180:262-288.

[20] The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 2004,

306(5696):636-640.

[21] Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH,

Weng Z, Snyder M, Dermitzakis ET, Thurman RE et al. Identification and analysis of

functional elements in 1% of the human genome by the ENCODE pilot project. Nature

2007, 447(7146):799-816.

[22] Margulies EH, Cooper GM, Asimenos G, Thomas DJ, Dewey CN, Siepel A, Bir-

ney E, Keefe D, Schwartz AS, Hou M et al. Analyses of deep mammalian sequence

alignments and constraint predictions for 1% of the human genome. Genome Research

2007, 17(6):760-774.

[23] Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H,

Spieth J, Hillier LW, Richards S et al. Evolutionarily conserved elements in vertebrate,

insect, worm, and yeast genomes. Genome Research 2005, 15(8):1034-1050.

[24] Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A. Distribution

and intensity of constraint in mammalian genomic sequence. Genome Research 2005,

15(7):901-913.

BIBLIOGRAPHY 58

[25] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search

tool. Journal of Molecular Biology 1990, 215(3):403-410.

[26] Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data via

the EM Algorithm. Journal of the Royal Statistical Society 1977. Series B (Method-

ological) 39(1):1-38.

[27] Margulies EH, Blanchette M, Haussler D, Green ED. Identification and characteri-

zation of multi-species conserved sequences. Genome Research 2003, 13 (12):2507-

2518.

[28] McVean GT, Hurst LD. Evidence for a selectively favourable reduction in the muta-

tion rate of the X chromosome. Nature 1997, 386(6623):388-392.

[29] Gross SS, Do CB, Sirota M, Batzoglou S. CONTRAST: a discriminative, phylogeny-

free approach to multiple informant de novo gene prediction. Genome Biology 2007,

8(12):R269.

[30] Jukes TH, Cantor CR. Evolution of protein molecules. Munro HN, ed. Mammalian

protein metabolism 1969, 21-123.

[31] Kimura M. A simple method for estimating evolutionary rate of base substitution

through comparative studies of nucleotide sequences. Journal of Molecular Evolution

1980, 16:111-120.

[32] Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a molecular

clock of mitochondrial DNA. Journal of Molecular Evolution 1985, 22(2):160-174.

[33] Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood ap-

proach. Journal of Molecular Evolution 1981, 17:368-376.

BIBLIOGRAPHY 59

[34] Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes in C, Sec-

ond Edition: the Art of Scientific Computing 1992.

[35] Brent RP. Algorithms for Minimization without Derivatives 1973, chapter 4.

[36] Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R,

Rosenbloom K, Clawson H, Green ED et al. Aligning multiple genomic sequences

with the threaded blockset aligner. Genome Research 2004, 14(4):708-715.

[37] Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D.

The human genome browser at UCSC. Genome Research 2002, 12(6):996-1006.

[38] Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M,

Smith KE, Rosenbloom KR, Raney BJ et al. The UCSC genome browser database:

update 2010. Nucleic Acids Research 2009.

[39] Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D. The UCSC Known

Genes. Bioinformatics 2006, 22(9):1036-1046.

[40] Crick F. Central Dogma of Molecular Biology. Nature 1970. 227:561-563.

[41] International Human Genome Sequencing Consortium. Initial sequencing and analy-

sis of the human genome. Nature 2001. 409:860-921.

[42] Needleman SB, Wunsch CD. A general method applicable to the search for similar-

ities in the amino acid sequence of two proteins. Journal of Molecular Biology 1970.

48(3):443-453.

[43] Smith TF, Waterman MS. Identification of Common Molecular Subsequences. Jour-

nal of Molecular Biology 1981. 147:195-197.

algorithms for analysis of multiple ...fk368by4307/...i certify that i have read this dissertation...

Documents