rna bioinformatics - dartmouth computer sciencerockmore/csss2006/stadler...rna bioinformatics peter...

RNA Bioinformatics

Peter F. Stadler

Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center forBioinformatics, University of Leipzig

Institute for Theoretical Chemistry, Univ. of Vienna (external faculty)The Santa Fe Institute (external faculty)

CSSS, June 2006

Overview

PART 1: RNA Structures and How to Compute Them

PART 2: RNA Landscapes

PART 3: The Modern RNA World

PART1Why RNA?

until relatively recently:Central Dogma of Molecular BiologyDNA → RNA → ProteinDNA = “genetic memory”, RNA = working copy, proteins dothe work

around 1980: discovery of catalytic RNAs (Nobelprize for TomCech and Sidney Altman)nevertheless long considered “exotic” remnants from theancient RNA world

around 2000: structure of the ribosome showns that theribosome is an “RNA enzyme”

around 2000: microRNAs are discovered as a large class ofregulatory RNAs that inhibit translation of proteins

2006: the ENCODE project shows that human geneexpression is quite different from textbook knowledge

RNA Bioinformatics

RNA Secondary Structures are an appropriate level of description

explain the thermodynamics of RNA Structures

often highly conserved in evolution

can be computed efficiently

Many Functional RNAs are Structured

(a) Group I intron P4-P6 domain(b) Hammerhead ribozyme(c) HDV ribozyme

(d) Yeast tRNAphe

(e) L1 domain of 23S rRNA

Hermann & Patel, JMB 294, 1999

The RNA Model

GCGGGAAU

AGUUGG U A

G A G CA

AAGGUCGGGGU

CG C G A G

U U CGA

GUCUCGU

UUCCCGC

GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

Formal Definition

A secondary structure on a sequence s is a collection of pairs (i , j)with i < j such that

Base pairing rules are respected, i.e., (i , j) ∈ Ω implies (si , sj)form an allowed pair (GC, CG, AU, UA, GU, UG)

Each base is involved in at most one pair, i.e., Ω is amatching, (i , j), (i , k) ∈ Ω implies j = k and (i , k), (j , k) ∈ Ωin implies i = j .

(i , j)Ω implies |j − i | > 3 (sterical constraint)

No-crossing rule: (i , j), (k, l) ∈ Ω and i < k implies eitheri < k < l < j or i < j < k < l .This excludes so-called pseudoknots

Let’s count the structures . . .

Counting secondary structures. Given a sequence of length n.Πkl = 1 if sequence positions k, l can form a pair GC, CG, AU,

UA, GU, UG and Πkl = 0 otherwise.Nkl = number of structures of the subsequence from k to l .Basic recursion:

• xxxxxxx +∑

(xxxx)xxxx

Nkl = Nk+1,l +l∑

ΠkjNk+1,j−1Nj+1,l

RNA Folding in a nutshell

i jj i i+1 j i i+1 k−1 k k+1|=

Nij = Ni+1,j +∑

k(i,k)pair

Ni+1,k−1Nk+1,j

Eij = minEi+1,j + min

k(i,k)pair

(Ei+1,k−1 + Ek+1,j + εik)

Zij = Zi+1,j +∑

k(i,k)pair

Zi+1,k−1Zk+1,j exp(−εik/RT )

Partition function: Z =∑

Ω exp(−E (Ω)/RT )

A word on the Partition Function

The partition function is the link between the combinatorics of thestructures (in general: states in an ensemble) and thethermodynamic properties of the physical ensemble, e.g.:

Free energy G = −RT lnZ

Expected Energy 〈E 〉 = RT 2 ∂ lnZ∂T

Heat Capacity Cp = −T ∂2G∂T 2

0 20 37 100

CUGUAUUGUUGUAUAGCCCGUGUGGUAAUAUGG

cal•(

•K)-1

Realistic Energy Model

closing base pair

interior base pairsclosing base pair

closing base pair

multi-loop

interior base pair

hairpin loop

closing base pair

stacking pair

interior loop

Parameters from large number of melting experiments by Douglas

Turner, David Matthews, John Santa Lucia, and others

Recursions for Linear RNAs

i u u+1|

i j i+1 j i

hairpin Cinterior

i j i i k l j

k k+1 j

=i ij j−1

ui+1 u+1i j−1 j

j−1 ji u u+1|C

Fij free energy of the optimal substructure on the subsequencex [i , j].

Cij free energy of the optimal substructure on the subsequencex [i , j] subject to the constraint that i and j form a base pair.

Mij free energy of the optimal substructure on the subsequencex [i , j] subject to the constraint that that x [i , j] is part of amultiloop and has at least one component, i.e., asub-sequence that is enclosed by a base pair.

M1ij free energy of the optimal substructure on the subsequence

x [i , j] subject to the constraint that that x [i , j] is part of amultiloop and has exactly one component, which has theclosing pair i , h for some h satisfying i ≤ h < j .

Fij = minFi+1,j , min

i<k≤jCik + Fk+1,j

Cij = minH(i , j), min

i<k<l<jCkl + I(i , j ; k, l),

mini<u<j

Mi+1,u + M1u+1,j−1 + a

Mij = min

mini<u<j

(u − i − 1)c + Cu+1,j + b,

mini<u<j

Mi ,u + Cu+1,j + b, Mi ,j−1 + c

M1ij = min

i ,j−1 + c , Cij + b

Backward Recursion: Base Pairing Probabilities

pij =Z1,i−1Zi ,jZj+1,n

pklΞij ,kl .

Ξij ,kl is a ratio of the two partition functions:

Zij ,kl ... both i , j and k, l pair

Zkl ... k, l pair.Simplest case:Zij ,kl = Zk+1,i−1ZijZj+1,l−1ζkl where ζkl = exp(−βkl/RT ) is theBoltzman factor of the pairing energy

Backward recursion: full modelBackward recursion:

Pkl = Pkl +

p<k;q>l

e−I(p,q,k,l)

ZMp+1,uZ

M1u+1,k−1)

e−(a+(q−l−1)c)

ZMl+1,uZ

M1v+1,q−1)

e−(a+(k−p−1)c)

+ ZMp+1,k−1Z

Ml+1,q−1

k lp q

Single-Stranded Circular RNAs

Viroid RNA

Hepatitis Delta Virus Genome

Cryptic by-products of splicing formed intronic sequence

Circularized C/D box snoRNAs were recently reported inPyrococcus furiosus

Synthetic constructs for in vitro selection

Circular, Linear, and Interacting RNAs

In the maximum matching case=⇒ same algorithm for all three cases

CIRCULAR FOLDING LINEAR FOLDING BINARY COFOLDING

Linear versus Circular Folding

Linear folding: energy contributions inside a pair (i , j) only.Co-folding: additional contribution for loop spanning [n, 1].

no energy contribution for external loop

no external loop

extra contribution

external loop

Strategy 1 (e.g. Michael Zucker’s mfold)For each pair (i , j): compute energy both inside and outsidethe pair⇒ doubles memory requirements

Strategy 2 (Vienna RNA Package)First compute linear folding energies. Then compute energiesfor the loop spanning [n, 1].

hairpin loop interior loop or bulge multi−branch loop

Implementing Circular Folding

Relative to linear folding, only the loop containing the cut has tobe re-evaluated.Three cases: cut in Hairpin, Interior-, or Multi-loop

F = minF H ,F

= | |n1

interior

in kk+1

k nu u+1

Exterior Hairpin.

F H = min

p<qCpq + H(q, p)

Exterior Interior Loop.

F I = min

k<l<p<qCpq + Ckl + I(q, p, l , k)

Exterior Multi-Loop.

Modified decomposition: one or more components M1,k +exactly two components M2

M2kn = min

ku + M1u+1,n

F M = min

M1,k + M2

k+1,n + a

Folding energy: F = minF H ,F

Application

larFold

Itdoes

erence

120140

160180

200220

240260

284starting point of linearized sequence

0 10 20 30 40 50

Structure distance

-6 -4 -2 0 2 4Energy difference

GGGGAAUUUCUCUGCGGGACCAA

A AACAGCUUGUGGAGGGAACAUACCUGAAGAG

AUCCCCGGGG

CUUCAG

UCGUCGAGGGGAGGGCG

CCGCGGAUC

CGUCCAGC

CAGGAG

CU C G

UCUCCUU

CGCUGGCU

CAUCCGAUCG

UCGCUUCUUCCUUCGCGA

CCUGAG

ACCCGGUG

ACUCUUGGGUUGUUCCUCCCAGGCUUGUU A

GGCCCGCGUUUG

AGACCCC

RNAfold-circ

eViennaRNAPackage

Local structures

Idea: Restrict Recursion to base pairs (i , j) with j − i < L.Special interest in robust structures:Z

u,Lij . . . partition function of sub-sequence [i , j] when sequence

window [u, u + L] is folded

pu,Lij . . . probability that i and j form a base pair when window

[u, u + L] is folded.

Zu,Lij =

Zij if [i , j] ⊆ [u, u + L]0 otherwise

pu,Lij =

Zu,L1,i−1Z

u,Li ,j Z

u,Lj+1,n

Zu,Lu,u+L

pu,Lkl Ξu,L

ij ,kl

=Zu,i−1Zi ,jZj+1,u+L

Zu,u+L

pu,Lkl Ξij ,kl .

Robust local structuresAverage probability of an (i , j) pair over all folding windowscontaining the sequence interval [i , j]

πLij =

L − (j − i) + 1

u=j−L

pu,Lij .

Direct Recursion:

πLij =

L − (j − i) + 1

u=j−L

Zu,L1,i−1

bZu,Li,j Z

u,Lj+1,n

Zu,L1,n

| z π∗L

L − (j − i) + 1

u=j−L

pu,Lkl

Ξij,kl

= π∗Lij +

i−1X

k=j−L

u=l−L

pu,Lkl

Ξij,kl

L − (j − i) + 1= π∗L

i−1X

k=j−L

L − (k − l) + 1

L − (j − i) + 1πL

klΞij,kl .

C C U U G G C C A U G U A A A A G U G C U U A C A G U G C A G G U A G C U U U U U G A G A U C U A C U G C A A U G U A A G C A C U U C U U A C A U U A C C A U G G U G A U U U A G U C A A U G G C U A C U G A G A A C U G U A G U U U G U G C A U A A U U A A G U A G U U G A U G C U U U U G A G C U G C U U C U U A U A A U G U G U C U C U U G U G U U A A G G U G C A U C U A G U G C A G U U A G U G A A G C A G C U U A G A A U C U A C U G C C C U A A A U G C C C C U U C U G G C A C A G G C U G C C U A A U A U A C A G C A U U U U A A A A G U A U G C C U U G A G U A G U A A U U U G A A U A G G A C A C A U U U C A G U G G U U U G U U U U U U G C C U U U U U A U U G U U U G U U G G G A A C A G A U G G U G G G G A C U G U G C A G U G U A C A G U U G U G U A C A G A G G A U A A G A U U G G G U C C U A G U A G U A C C A A A G U G C U C A U A G U G C A G G U A G U U U U G G C A U G A C U C U A C U G U A G U A U G G G C A C U U C C A G U A C U C U U G G A U A A C A A A U C U C U U G U U G A U G G A G A G A A U A U U C A A A G A C A U U G C U A C U U A C A A U U A G U U U U G C A G G U U U G C A U U U C A G C G U A U A U A U G U A U A U G U G G C U G U G C A A A U C C A U G C A A A A C U G A U U G U G A U A A U G U G U G C U U C C U A C G U C U G U G U G A A C A C A C C U U C A U G C G U A U C U C C A G C A C U C A U G C C C A U U C A U C C C U G G G U G G G G A U U U G U U G C A U U A C U U G U G U U C U A U A U A A A G U A U U G C A C U U G U C C C G G C C U G U G G A A G A

133029828 133029088Human chromosome X (minus strand)

mir-92-2mir-19b-2mir-20bmir-18bmir-106a

Local structures (L=100) in a 740 nt region of human X chromosome

Cofold: How to deal with Concentration?

Algorithmically that same as linear foldingspecial energy contribution for “loop with the cut”

Additional energy contribution for forming duplex

At least 5 molecular species need to be taken into account(Dmitrov & Zuker, 2005): A, B , A2, B2, AB .

Their folding energies and partition functions are easilycomputed

Cofold

A U G A A G A U G A C U G U C U G U C U U G A G A C A

A U G A A G A U G A C U G U C U G U C U U G A G A C AAU

AUGAAG

Dot plot (left) and mfe structure representation (right) of thecofolding structure of the two RNA molecules AUGAAGAUGA (red)and CUGUCUGUCUUGAGACA.

Cofold: Concentration dependencies

Q = V n a!b! × (Z ′A)nA(Z ′AA)nAA(Z ′AB)nAB (Z ′BB)nBB (Z ′B)nB

nA!nB ! 2 nAA! 2 nBB !nAB !

where a = nA + 2nAA + nAB . The system minimizes the freeenergy −kT lnQ.solving this optimization problem yields the equilibria:[AA] = KAA [A]2 , [BB ] = KBB [B ]2 . [AB ] = KAB [A] [B ].with [A] = 6.023 × 1023nA, etc., and

KAA =Z ′AA

(ZA)2=

(ZAA − (ZA)2)e−ΘI /RT/2

(ZA)2=

2e−ΘI /RT

(ZA)2− 1

KBB =1

2e−ΘI /RT

(ZB)2− 1

KAB = e−ΘI /RT

ZAZB− 1

1 10total siRNA concentration b [nmol]

]A.siA.AAsiA’.si’A’.A’A’si’

Example for the concentration dependency for two mRNA-siRNAbinding experiments.

RNAup: Small RNAs Binding to Large Ones

RNA folding excludes pseudoknots, i.e., non outerplanargraphs

cofold thus does not allow small RNA binding into loopregions of large ones

... but this happens in reality

Remedy: Compute energy/partition function

Pu[i , j] =Z [1, i − 1] × 1 × Z [j + 1,N]

Z︸︷︷︸exterior

p,qp<i≤j<q

Ppq ×Zpq[i , j]

Z b[p, q]︸︷︷︸enclosed

that subsequence [i , j] is unpaired and the energy of binding ashort molecule in this location

qp p q

1 n 1 n

qp p q

(a) (b’) (b”) (c) (d) (e)

Zpq[i , j ] = exp(−βH(p, q))︸︷︷︸(a)

p < i ≤ j < k or

l < i ≤ j < q

Z b[k , l ]e−βI (p,q;k,l)

︸︷︷︸(b)

p<i≤j<q

Zm2[p + 1, i − 1]e−βc(q−i)

︸︷︷︸(c)

p<i≤j<q

Zm[p + 1, i − 1]Zm[j + 1, q − 1]e−βc(j−i+1)

︸︷︷︸(d)

p<i≤j<q

Zm2[j + 1, q − 1]e−βc(j−p)

︸︷︷︸(e)

RNAup: Interaction part

Z I [i , j , i∗, j∗] =∑

i<k<ji∗>k∗>j∗

Z I [i , k, i∗, k∗]e−βI (k,k∗ ;j ,j∗)

Z ∗[i , j] = Pu[i , j]∑

i∗>j∗

Z I [i , j , i∗, j∗];

P∗[i , j] = Z ∗[i , j]/∑

Z ∗[k, l ]n (3’)

m (3’)

1 (5’)

1 (5’)j*

RNAup: Application

VR1 straight VR1 HP5_16 VR1 HP5_11

Sequence position

160 180 160 180 160 180 160 180

VR1 HP5_6

1060 1080

Binding of siRNAs to VR mRNA.Pu[i , i ] (dashed line), P∗

i (thick black line), ∆Gi (thick red line).Below: activity of siRNA

Alternative Approach

Consider RNA Folding as a Machine Learning ProblemContext Free Grammar + probabilities for production rules⇒ Stochastic Context Free Grammarssee work by Sean Eddy, Jotun Hein, and collaborators

Folding Kinetics

RNA molecules may have kinetic traps which prevent them from reachingequilibrium within the lifetime of the molecule. Long molecules are oftentrapped in such meta-stable states during transcription.Possible solutions are

Stochastic folding simulations can predict folding pathways and finalstructures. Computationally expensive, few programs available.

Predicting structures for growing fragments of the sequence canshow whether large scale re-folding will occur during transcription.Cheap but inaccurate.

Analysis of the energy landscape based on complete suboptimalfolding can identify possible traps (local minima).

Kinetic Folding Algorithm

Simulate folding kinetics by a Monte-Carlo type algorithm:Generate all neighbors using the move-setAssign rates to each move, e.g.

Pi = min

1, exp

Select a move with probability proportional toits rateAdvance clock 1/

∑i Pi .

Characterization of Landscapes

A landscape consists of a configuration space V , a move set within thatconfiguration space and an energy function f : V → R.Simplest move set for secondary structure: opening and closing of basepairs.Speed of optimization depends on the roughness of the Landscape.Measures of roughness suggested in the literature:

Number of local optima

Correlation lengths (e.g. along a random walk)

Lengths of adaptive walks

Folding temperature vs. glass temperature Tf /Tg

Energy barriers between the local optima. Especially, themaximum barrier height (“depth” in SA literature)

Energy barriers

E [s,w ] = min

[f (z)

∣∣z ∈ p] ∣∣∣∣ p : path from s to w

B(s) = minE [s,w ] − f (s)

∣∣w : f (w) < f (s)

Depth and Difficulty(borrowed from simulated annealing theory)

D = maxB(s)

∣∣s is not a global minimum

ψ = max

f (s) − f (min)

∣∣∣∣s is not a global minimum

Energy Barriers and Barrier Trees

Some topological definitions:A structure is a

local minimum if its energy is lowerthan the energy of all neighbors

local maximum if its energy ishigher than the energy of all

neighbors

saddle point if there are at leasttwo local minima that can bereached by a downhill walk startingat this point

Calculating barrier trees

The flooding algorithm:Read conformations in energy sortedorder.For each confirmation x we havethree cases:

(a) x is a local minimum if it hasno neighbors we’ve alreadyseen

(b) x belongs to basin B(s), if allknown neighbors belong toB(s)

(c) if x has neighbors in severalbasins B(s1) . . .B(sk) then it’sa saddle point that merges

these basins.Basins B(s1), . . . ,B(sk) arethen united and are assigned tothe deepest of local minimum.

Information from the Barrier Trees

Local minima Saddle points Barrier heights Gradient basins Partition functions and free energies of (gradient) basins Depth and Difficulty of the landscape

N.B.: A gradient basin is the set of all initial points from which a

gradient walk (steepest descent) ends in the same local minimum.

Energy Landscape of a Toy Sequence

G C U A U U A

G C U A U UCG

A G C U A U UCG

C C G G A

G C U A U UCG

G C U A G UCG

G C G U G GCG

G U G G G ACG

C G G GG

G GUGG

C GUUUAGGG

8 9 10

UUUAGGG

Steps [arbitrary]

2.45[5]

[9]3.80 1 [11]

1.60 3 [7]

141.30 9

1.40 192.10 11

3.80 2 [1][3] 12

1.20 8 [4]

173.60 4

2.20 73.61 5

1.40 161.90 10

1.50 152.00 205.02 6

3.90 18E

Folding Kinetics

Transition rates from x to y :

ryx = r0e−

E6=yx−E(x)

RT for x 6= y

rxx = −∑

Kinetics as a Markov process:

rxypy (t) .

Transition states:E 6=

yx = maxE (x),E (y)

or more complex models (Tacker et al 1994, Schmitz et al 1996)

Reduced Description of the Folding Dynamics

Macrostates = Classes of a partition of the state space.Partition function for a macro state:

Zα =∑

x∈α

exp(−E (x)/RT )

Free energy of a macro state:

G(α) = −RT ln Zα

rβα =∑

y∈β

x∈α

ryxProb[x |α] for α 6= β

y∈β

x∈α

ryxe−E(x)/RT

rβα “on flight” while executing the barriers program.Transition state free energy:

G6=βα = −RT ln

y∈β

x∈α

e−E6=xy

0.5 1.3

lillyA simple model sequence

mfe23456

mfe255680

Refolding of a tRNA molecule.

Summary I:

RNA structures can be computed efficiently by means ofdynamic programming

Computations are based on a set of carefully measures energyparameters and an additive energy model

Algorithms exist for ground state energy and structure, fullpartition functions, density of states, interacting structures,. . .

The folding kinetics of a given RNA Sequence can also beinvestigated as the level of secondary structures

VIENNA RNA PACKAGE

PART II: How Do RNAs Evolve

Basic Assumption

Selection Acts on Secondary Structures, Mutations acts on theunderlying sequences⇒ We need to understand the sequence-to-structure map of RNAs(hang on, we’ll discuss the empirical evidence for that a bit later)

Sewall Wright’s Fitness Landscapes

Phenotype

How do realistic fitness landscapes look like?

Biological Landscapes

The RNA case is a special case of a very general paradigm:

genotype 7→ phenotype 7→ fitness

What is the relationship between Genotyp and Phenotype?

Central topic in any theory of evolutionbecause:* Selection acts on the Phenotype* Mutation/Recombination acts on the GenotypeBiopolymers as the simplest model:The molecule is both genotype (sequence) and phenotype (structure).

The map from genotype to genotype is determined by physical chemistry:

⇐⇒ folding problem

Computational Analysis of the RNA Map

There are many more sequences than structures.(.)-string: 3-letters (with constraints)

=⇒ less than 3n structures

but 4n sequences.

=⇒ Redundancy

How are sequences folding into the same structure distributed insequence space?Neutral Set S(ψ) = x ∈ Qn

α|f (x) = ψ

Sensitivity and Neutrality

GCGGGAAU

AGUUGG U A

G A G CA

AAGGUCGGGGU

CG C G A G

U U CGA

GUCUCGU

UUCCCGC

GCGGGUAUA

GCUCAGU

UGG U A

G A G CA C G

A C CUU G CC A A

C G G G GU CG C G A G

U U CGA

GUCUCGU

UUCCCGCUCC

Effect of a single

point mutation0 100 200 300

Structure Distance

Distribution of structure distances

The Random Graph Model

Approach:Model S(ψ) as a random induced subgraph Γ with a given value

λ =〈#neutral neighbors〉

(α− 1)n

Threshold value:

λ∗ = 1 −

) 1α−1

Theorem. [Reidys, Stadler, Schuster]If λ > λ∗ then Γ is a.s. dense and connected,if λ < λ∗ then Γ is a.s. neither dense nor connected

A complication: Base Pairing Rules

Unpaired bases:Alphabet A = A,U,G,C

Paired bases: 5’ and 3’ side correlated:Alphabet: B = AU,UA,GC,CG,GU,UG, .

Thus consider only the set of compatible sequences C (ψ):S(ψ) ⊆ C (ψ) ≡ Qnu

4 ×Qnp

6 .=⇒ Two neutrality parameters λu and λp

Connected Components of Neutral Networks

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0λu

gray many small components red 1 connected componentgreen 2 equal sized components yellow 3 components size 2:1blue 4 equal sized components

Explanation: for this deviation from the random graph model in terms of the energy model. Some structures can

be made only with a significant bias in the G/C ratio.

x??P (d)

Length of Neutral Path: d !0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0

0.250.200.150.100.050.00

0 20 40 60 80 100 120

Chain lenght n

from enumeration from inverse folding lower bound

Distance to Target structure Covering radiuslength neutral paths

Closest Approach

Intersection Theorem. For any two secondary structures φ, andψ holds

C (φ) ∩ C (ψ) 6= ∅

What is the distance of neutral networks

δ(φ,ψ) = mind(x , y)|f (x) = φ and f (y) = ψ

Random graph Theory: If λ > λ∗ then δ(φ,ψ) ≈ 2.Computer simulations: upper bounds on δ(φ,ψ):

n GC AU AUGC

50 5.6 2.6 2.170 9.3 4.6 3.4100 13.0 7.8 5.6

AccessibilityFontana & Schuster 1998

Idea: The “interface” between two structures is large is they are“similar”.More precisely: Structure ψ is accessible for φ if x ∈ S(φ) is like tohave neighbor (mutant) x ′ ∈ S(ψ).Structural characterization of “easy” (continuous) transitions:

Shorteningof stacks

Elongationof stacks

Opening ofconstrained stacks

Closing ofconstrained stacks

SUMMARY: Sequence-Structure Map of RNA

1. Redundancy: Many more sequences than structures

2. Sensitivity: Small changes in the sequences may lead to largechanges in the structure

3. Neutrality: A substantial fraction of mutations does not alter thestructure.

4. Isotropy: S(ψ) is “randomly” embedded in C (ψ).

Implications:

1. Neutral Networks: S(ψ) forms a connected “percolating” network insequence space for all “common” structures.

2. Shape Space Covering: Almost all structures can be found in arelatively small neighborhood of almost every sequence.

3. Mutual Accessibility: The neutral networks of any two structuresalmost touch each other somewhere in sequence space.

Simulated Trajectories

0.5 1.0 1.5time (arbitrary units)

Punctuated equilibra = diffusion of neutral networks +constant rate of innovation +exponential selection of rare mutants

Proc.Natl.Acad.Sci. 93: 397-401 (1996)

Diffusion Constant

. . . can be deduced from Moran model:

D = λ6Anp

3 + 4Np(1 + 1/N) ∼

(3/2)A(n/N) p ≫ 0 orN ≫ 1

2Anp p ≪ 1

A . . . replication raten . . . sequence lengthN . . . population sizep . . . mutation rateλ . . . neutrality of network

Dynamics of Interacting Replicators

Ik + Ij −→ Il + Ik + Ij

With mutation:

xk = xk

Akjxj −∑

Aijxixj

(QklAljxjxl − QlkAkjxkxj)

Qkl = (1 − p)n−d(k,l)

α− 1

)d(k,l)

How does this behave in sequence space?

Simplest case: Simplest case: Akl = A0(1 − d(Ik , Il )/n):

0 1000 2000 3000 4000 5000time

0 2×105

4×105

6×105

8×105

time lag τ

g(τ) =1

T2 − T1 + 1

‖p(t + τ) − p(t)‖2

B.M.R. Stadler, Adv. Complex Syst. (2003)

Left: Diffusion coefficient D as a function of the mutation rate for N = 10, 20, 30, 40, 80 and

n = 10, 20, 30, 40, 80 such that N/n = 1 after equilibration for 105 timesteps. Right: Dependence of the ratio

D/p on N/n.

An RNA-Based Model in the Plane

Target hypercycle with 8 mem-bers.

Model:Hypercyclically coupledspecies, each sequence hasa function that depends onits structure.

Spatial Extension: CA Model

Possible Catalysts Actual CatalystsPossible Replicators

Rules of replication. For each of the neighbors (•) of the empty cell (marked by a bold outline) the replication rate

ρz is computed taking into account their neighbors in the direction of the replication () as potential catalysts.

The neighbor with the largest values of ρz invades the empty position. In this example, for the chosen replicator,

only three of its neighbours are catalysts according to the hypercycle topology.

Spirals formed after 3000 generations in an evolution experimentstarted with 300 random sequences in the absence of parasites.see also Borlijst & Hogeweg (1993)

Diffusion in Sequence Space

2000 4000 6000 8000 10000time lag (tau)

0.0005 0.0010 0.0015 0.0020Mutation rate

Summary

Neutrality of the Sequence-Structure Map impliesdiffusion/drift-like motion in sequence independent of detailsof the selection/mutation mechanisms and whether spatialextension is taken into account or not.

=⇒ The basic assumption of molecular phylogenetics, namelya dominating influence of drift in sequence evolution, holdstrue even when phenotypic evolution is dominated byinteractions(co-evolution).

TODO Development of a rigorous mathematical theorydescribing the motion in sequence space of a population withstrong interactions.

Evolutionary histories of some structured RNAs

Ribosomal RNAs (rRNAs) are the most frequently used sequencedata for reconstructing phylogenies from molecular dataHow does that work:In a nutshell:(1) compute evolutionary distances from the sequence data(2) “fit” an additive tree to the distances(In reality, there are other methods such as maximum parsimonyand maximum likelihood approaches, but the basic idea is thesame)Observation: all tRNAs have more or less the same clover-leafstructure.

MicroRNAs

processed from precursorhairpins

short (∼ 22nt) RNAmolecules

highly conserved

Function

bind to 3’UTRs of mRNAtargets

supress expression of thismRNA

mark mRNA molecule fordegradation

in plants involved in DNAmethylation

GUGUAUUUGACAAGCU

U GGACACUC

G U GGU

GUGUCAGUUUGUCAAAUACC

MicroRNAs — processing and function

MicroRNAs ...

transcribed inlonger transcripts(primary-miRNA)

in some cases:polycistronic

“clusters”

Drosha processing→ precursor

export to cytoplasmExportin-5 pathway

Dicer processing →mature miRNA

Evolution of microRNA Families: mir-17 clusters

Many miRNAs are transcribed from polycistronic transcriptsMost spectacular example: Human mir-17 clusters

Chr−13

Chr−X

Chr−7

19b−1 92−1

18X 20X 19b−2 92−2

106b 93 25

18 19a 2017

91 17 18 19a 20 19b 92

106a 18X 20X 19b 92

106b 93 25

II−3

J. Mol. Biol. 339: 327-335 (2004)

Case Study: mir-17 clusters

X-106a

X-19b-2

X-92-2

. . . . . . . . ( ( ( ( ( ( ( ( ( ( ( ( ( ( . . . . ( ( ( ( ( . ( ( ( ( ( ( . . . ) ) ) . . ) ) ) ) ) ) ) ) . . . . ) ) ) ) ) ) ) ) . ) ) ) ) ) ) ( ( ( . . . . ) ) ) . . . ( ( ( . ( ( ( ( ( . . . . . . . . . . . . . . . . . . . . . . . . . ( ( ( ( . ( ( ( . . . ( ( ( . . . . . . . . ( ( ( ( ( . . ( ( ( ( . . . . . . . . ( ( ( . . . ( ( ( ( ( ( ( ( ( ( . ( ( ( ( . ( ( ( . ( ( ( ( ( . . . . . ( ( ( . . . ) ) ) . . . . . . . ) ) ) ) ) . ) ) ) . ) ) ) ) . ) ) ) ) . . ) ) ) ) ) ) ) ) ) ( ( ( . . . . . . . . . . . . ( ( ( ( ( ( ( . . . . . . . . . . . . . . . . . . . . ) ) ) ) ) ) ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ( ( ( ( ( ( ( ( ( . . . . . . . . . . . ) ) ) ) ) ) ) ) ) . . . . . . . . . . . . . . . . . . . . ) ) ) . . . . . ) ) ) ) . . ) ) ) ) ) . . . ( ( ( . . . . ( ( ( ( ( . . ( ( ( ( ( ( ( ( ( ( ( . ( ( ( ( ( . ( ( ( . . . . . . . . . . . . . ) ) ) ) ) ) ) ) . ) ) ) ) ) ) ) ) ) ) ) . . . ) ) ) ) ) . . . . ) ) ) . . . . ) ) ) . . . ) ) ) . ) ) ) ) . . ( ( ( . . . . . . . ) ) ) . . . . . ) ) ) ) ) ) ) ) . . . . . . . ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( . . . ( ( ( ( . . . . . . . . . . . . . . . . . ) ) ) ) ) ) ) ) ) ) ) ) . . ) ) ) ) ) ) ) ) ) ) ) . . . . . . . . . . . ( ( ( ( . . . . . . . ) ) ) ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ( ( ( . . . . ) ) ) . . . ( ( ( . ( ( ( . ( ( ( ( ( . ( ( ( ( ( . ( ( ( ( . . . . ( ( ( ( ( ( ( ( ( . . . ) ) ) ) . ) ) ) ) ) . . . ) ) ) ) . ) ) ) ) ) . ) ) ) ) ) . ) ) ) . ) ) ) . . . . . . . .

X-106a

X-19b-2

X-92-2

(a) (b)

Structure of the pri-pre-mir-17 at the human X chromosome.

Construction of Gene Treesfrom concatenated sequences in the cluster

HsXPtX

Pt1Hs1 Mm1

233617

TnCTrC

Pt−IIHs−II

Mm−IIRn−II

Xt−II

Dr−II−D

Tr−II−D

Distant Homologies with unreliable Alignments

How to quantify sequence similarity when we cannot get a goodalignment?

measure pairwise sequence similarity s(x , y)

compare to the distribution of similarity values of alignmentsof shuffled sequences

define a z-score

z(x , y) =s(m, y) − 〈s(π(x), π′(y))〉π,π′

√varπ,π′(s(π(x), π′(y)))

use z(x , y) as similarity measure in WPGMA clustering

19b1−

Hs1−17

Hs1−18

Hs1−19a

Hs1−19b−1

Hs1−20

Hs1−92−1

Mm1−17

Mm1−18

Mm1−19a

Mm1−19b−1

Mm1−20

Mm1−92−1

Rn1−20

HsX−19b−2

HsX−92−2

HsX−106a

MmX−19b−2

MmX−92−2

MmX−106a

Hs3−25

Hs3−93

Hs3−106b

Mm3−25

Mm3−93

Mm3−106b

Dm−92aDm−92b

Ce−235

Pt1−17

Pt1−18

Pt1−19a

Pt1−19b−1

Pt1−20

Pt1−92−1

Rn1−17

Rn1−18

Rn1−19a

Rn1−19b−1

Rn1−92−1

Xt−17

Xt−18

Xt−19a

Xt−20

Xt−19b

Xt−92

HsX−18

HsX−20

PtX−106a

PtX−18

PtX−19b−2

PtX−20

MmX−18

MmX−20

RnX−106a

RnX−18

RnX−19b−2

RnX−20

RnX−92−2

CcX−19b−2

CcX−92

Pt3−93

Pt3−25

Pt3−106bRn3−106b

Rn3−25

Rn3−93

Cc3−25DrD−25

DrD−19b

DrD−93

TrD−17/106b

TrD−20/93

TrD−25

TrD−19b

XtB−93

XtB−25

DrA−17

DrA−18

DrA−92−1

DrB−17

DrB−18

DrB−19a

DrB−19b−1

DrB−92−1

Dr14−20

Dr14−18

Dr14−19−b1

TrA−17

TrA−18

TrA−19a

TrA−19b−1

TrA−20

TrA−92−1

TrB−17

TrB−18

TrB−19a

TrB−19b−1

TrB−20

TrB−92−1

TrC−18

TrC−19b

TnB−17

TnB−18

TnB−19a

TnB−20

TnB−19b

TnB−92

TnC−18

TnC−19b

TnA−17

TnA−18

TnA−19a

TnA−20

TnA−19b

TnA−92

10.08.0

z−sc

Collapsed tree of microRNA subgroups

z−score

I−17

I−18

I−19a

I−20

I−19b

I−92

II−106b

II−93

II−19b

II−25

Ce−92

mir−17 groupmir−92 group mir−19 group

obtained by collapsingvertebrate, insect,and nematode speciestrees to single vertices

next step:combine gene treesand syntenyinformation to aduplication history

Scenario for the evolution of the mir17 familyancestral mir17 cluster probably contained

mir-17, mir-19 and mir-92

181720

93106b 19b

19−3

deletion

Scenario for the evolution of the mir17 familyfirst detectable duplication event:

branch mir-17 and mir-18

181720

93106b 19b

19−3

deletion

19b is copy of 19

18 is copy of 17

Scenario for the evolution of the mir17 familyseries of duplications:

branch mir-19 and mir19b, mir-17 and mir-93

181720

93106b 19b

19−3

deletion

19b is copy of 19

18 is copy of 17

93 is copy of 17

Scenario for the evolution of the mir17 familygenome wide duplication:

duplication of whole cluster and loss of individual miRNAs

181720

93106b 19b

19−3

deletion

cluster duplication

Deletions

19b is copy of 19

18 is copy of 17

93 is copy of 17

Scenario for the evolution of the mir17 familyindependent miRNA duplications

in type I cluster

181720

93106b 19b

19−3

deletion

20 is copy of 17

cluster duplication

Deletions

19b is copy of 19

18 is copy of 17

93 is copy of 17

Scenario for the evolution of the mir17 familysplit of teleosts and mammalia

teleost specific genome duplication

Mammalia

I−1I−XII

181720

93106b 19b

19−3

deletion

20 is copy of 17

cluster duplication

Deletions

19b is copy of 19

18 is copy of 17

93 is copy of 17

Scenario for the evolution of the mir17 familysplit of teleosts and mammalia

teleost specific genome duplication

TeleosteiMammalia

I−BI−AI−CII−D

I−1I−XII

181720

93106b 19b

19−3

deletion

20 is copy of 17

cluster duplication

Deletions

19b is copy of 19

18 is copy of 17

93 is copy of 17

History of the mir-17 cluster: updated data

TeleosteiTetrapoda

deletion

cluster duplication

IIII−D

2519b−II

17/106a

20 is copy of 17

18 is copy of 17

93 is copy of 17

Further Examples: let-7 family

a1 f1 df2 m

k g i c1 e a2jb

Ancestral Eubilaterian

Tetrapoda only

Ancestral vertebrate

lost in mammals loss of mir100 or mir125 in one paralog in most species

duplicationteleost genome

Mammalian transcription from

intron of coding sequenceintron of non−coding sequenceexon

Further Examples: mir-1 and mir-30

1−1 1−2Teleosts Teleosts

TetrapodaTetrapoda

Nematoda

Urochordata

Arthropoda

Sea Urchin

0.10.0

Gga−1b

teleosts

tetrapoda

teleosts

tetrapoda

teleosts

tetrapoda

teleosts

tetrapoda

0.10.0

mir-1 mir-30

Further Examples: mir-9, mir-23, mir130/301

Sea Urchin

Nematoda

Schistosoma (?)

Tetrapods

Teleosts Diptera

mir−79

mir−9a

mir−9b

mir−9c

9c 79 9b306 9a

lost in Tn/Tr

Hs19 Hs9

mir-23 cluster

Rn, Mm

Cf, Pt, Hs

mir−301

mir−130b

mir−130a

mir−301

mir−130a

mir-130 cluster

Expansion of the Metazoan MicroRNA Repertoire

0 10 20 30 40 50 60 70 80

Rn Bt Cf PtMdGgXtDr Tn TrCsCiOdSpAgBmTcAmCb Ce D.sp. Mm Hs

miRNA innovationsnon−local duplicationslocal duplications

Similar Situation: snoRNA

snoRNAs direct chemical modification of other RNAs (mostlyrRNA, snRNA, and (some?) messenger RNAs

two classes: box-H/ACA and box-C/D

known in eukaryotes and archea, not in eubacteria

Mm_4_E1_1Mm_4_E1_2

Rn_8_E1Mm_9_E1

Hs_1_E1_1Pt_1_E1_1

Hs_1_E1_2

Pt_1_E1_2Oc_E1_1r

Ss_E1_1Cf_2_E1_1

Cf_2_E1_2Bt_E1

Gg_E1Xl_E1_6r

Xl_E1_1rXl_E1_5r

Xl_E1_4rXl_E1_3r

Xl_E1_2rTr_E1_3

Tr_E1_4Tr_E1_5Tr_E1_6

Ol_E1_1Tn_E1_1

Tn_E1_2Tn_E1_4

Tn_E1_3Om_E1_r

St_E1_rDr_E1_3r

Dr_E1_2rDr_E1_5r

Dr_E1_4r

DrE2_1

DrE2_2

Cf_23_E2_1

Hs_3_E2_1

Mm_9_E2_1

Rn_8_E2_1

Gg_E2_2

Xt_E2_2

Xt_E2_1

Mm_9_E2

Rn_8_E2

Hs_3_E2

Cf_23_E2

Gg_E2_1 0.00

Dr_25_E1_1Dr_25_E3_3

Dr_25_E3_5Dr_25_E3_2

Dr_25_E3_4Dr_25_E3_6

Tr_E3_1Tn_E3_2

Tr_E3_2Tn_E3_1

Xl_E3_2Gg_9_E3_1

Xl_E3_1Cf_34_E3_2

Hs_3_E3_2Pt_2_E3_2

Rn_11_E3_1Mm_16_E3_2

Cf_34_E3_1Hs_3_E3_1

Pt_2_E3_1Mm_16_E3_1

Rn_11_E3_2

FrTnTn

GgMmRn

XlYa XlY3

OmY1HsY4

Summary

The genotype-phenotype map of RNA is charcterized by aninterplay of “ruggedness” and neutrality

Selection plus drift results in diffusion on neutral networks

Many non-coding RNAs have highly constrained (i.e.,evolutionarily very well conserved) structures but fairly rapidlyevolving sequences

Drift of sequences is independent of the details of theselection mechanism

Ongoing research: elucidate the evolutionary histories ofstructured ncRNAs

PART III: The Modern RNA World

pre−mRNA

tRNA rRNA

pre−miRNA7SL + proteinsY + proteins

NUCLEOLUS

CAJAL BODY

7SL Y RNA vRNApre−tRNA RNAse_P

MRPmiRNA

pre−miRNA

pri−miRNA

pre−rRNAsnoRNA

rRNASplicosome

snRNA scaRNA

Drosha

Introns mRNA

translational inhibition

RISCRNAi

Ro RNP SRP

Ribosome

vRNA + proteins

Proteins

NUCLEUS

Multiple Origins of ncRNAs

CraniataCephalochordata

UrochordataEchinodermataHemichordata

ChoanoflagellataFungiMicrosporidia

AlveolatesStramenophilesRhodophyta

Chordata

Metazoa

Green Plantsother protists

Eukarya

Protostomia

Bacteria

ArcheaLUCA

snoRNA C/DsnoRNA H/ACA

RNAse MRPtelomerase RNAmost snRNAsvault RNAs ?

U7 Y RNA

tRNARNAseP7SL/SRP small bacterial RNAs

in Kinetoplastids onlygRNA

microRNA in multicellular animals and plants only ?

Surveys for noncoding RNAs

> 5% of the human genome is under stabilizing selection(from man/mouse comparison), less than 1/3 of this codes forprotein

Virtually the entire genome is transcribed as primary nucleartranscripts in at least one direction(ENCODE Genes&Transcripts group, unpublished data)

∼ 80% of the ENCODE regions are transcribed in as parts ofprotein coding transcripts including introns and UTRs

Only a tiny part of the primary transcripts is protein coding

Large fraction of apparently non-protein-coding cDNAs

The functions of most of these transcripts are unclear.

“There is need for reliable experimental and computational methods

for comprehensive identification of non-coding RNAs.”

–International Human Genome Sequencing Consortium, Nature 431, p.943, October 2004

The ENCODE Project

ENCyclopedia Of DNA Elements

Public research consortium launched by NHGRI in 2003

Purpose: “testing and comparing existing methods torigorously analyze a defined portion of the human genomesequence”.

Focus: specified 30 megabases ( 1% of genome) in more than20 species

Informally organized in subgroups: Sequencing Technology,Comparative Genomics, Genes and Transcripts, GeneticVariation, ...

Results from 1st phase currently under review

Phase 2: scale-up to complete genome

Highlights from

ENCODE Genes and Transcripts Analysis Group

(Data presented by Tom Gingeras in Bethesda, Jan 12 2006)

Only a fraction of processed RNA transcripts correspond toGeneCode annotated transcripts:70% correlated with annotated (m)RNAs52% correlate with annotated protein coding sequences

Substantial fraction of transcription is specific of cellularconditionsonly 2.6% of transfrags are common to all 11 cell-lines.

The same genomic sequence may be processed into multipleRNA sequences with different fates

Virtually the entire genome is transcribed as primary nucleartranscript in at least one direction.

Transcriptional output is MUCH more extensive AND much more

complex than previously thought.

Recall: Sequence-Structure Map of RNA

1. Redundancy: Many more sequences than structures

2. Sensitivity: Small changes in the sequences may lead to largechanges in the structure

3. Neutrality: A substantial fraction of mutations does not alter thestructure.

4. Isotropy: S(ψ) is “randomly” embedded in C (ψ).

Implications:1. Neutral Networks: S(ψ) forms a connected “percolating” network in

sequence space for all “common” structures.

2. Shape Space Covering: Almost all structures can be found in arelatively small neighborhood of almost every sequence.

3. Mutual Accessibility: The neutral networks of any two structuresalmost touch each other somewhere in sequence space.

Proc.Roy.Soc.B 255 279-284 (1994), Proc. Natl. Acad. Sci. USA 93, 397-401 (1996),

Bull. Math. Biol. 59, 339-397 (1997), RNA 7: 254-265 (2000).

RNA Sequencs

Multiple Sequence Alignment

CLUSTAL W

Minimum Energy Folding

Mountain Representations

Vienna RNA Package

Secondary Structures

Aligned Structures

Detect conserved sub-structures Confirmed conserved sub-structures

CHECK:

compensatory

mutations

. . .. . .. . .. . .

. . .. . .. . .

.. .. ..

... . .

. . .. . .

. . ... .. ... . .. . ..

.... .

.. . .

......

.. . . ..

......

. . ... .. ... . ... .. ..

RNA Sequencs

Dot Plots

Multiple Sequence Alignment

Combined Pair

Conserved sub-structures

McCaskill’s

Algorithm

CLUSTAL W

UGUGGUCGAUAU 0.99

sequence and pairing probability

compensatory

mutations

Credibility RankingReduce Pair List

Minimum Energy Base Pairing Probabilities

Nucl. Acids Res. 26: 3825-3836 (1998), Comp. & Chem. 23: 401-414 (1999)

Examples: HIV-1 TAR-hairpin

. . . . . (((((((((((. (((((. . . ((((. . . . . . )))))))))))))))))))).

0 10 20 30 40 50

. . . . . ( ( ( ( ( ( ( ( ( ( ( . ( ( ( ( ( . . . ( ( ( ( . . . . . . ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) . ..

Flaviviridae: Nucl. Acids Res. 29: 5079-5089 (2001), Picornaviridae: J. Gen. Virol. 85: 1113-1124 (2004), Broad

survey: Bioinformatics 20: 1495-1499 (2004)

Examples: Picornaviridae: Cis-acting-Replication Element

The function of the CRE probably involves the initiation of the synthesisof the negative-sense strand template RNA during virus replication.

CGACGGUUA

CAA C C A

AGACCGUCG C

AGUUCAAG

A A U GCCG

A UU A

ACGGCCAC

CC C A A U

GCCGU A

UCAUAUACCG

A A C ACUA

GUGAUGAU G

AAGUCAUCGUUG

A A C GAAA

ACGGUGGCC

ACGGCUA

CAA A C A

AAGCUGU

UUUUGCAUUUUG

CAAGAUGUAGAG

U AG U

Aphthovirus Enterovirus Cardiovirus HRV-A HRV-B Teschov. Hepatov.region:2C 2C 1B 2A 1B 2C 2C

Aphto ~~~~CGAC-GGUU------ACA-CCAAGCA------GACCGUCG~~~~~Entero CAUACAGU-UCAAG--------UCCAAAU-GCCGUAUUGAACCUGUAUGCardio ~~~~~ACG-GCCA---CAAACACCCAAUCAACUGU-UGGCCGU~~~~~~HRV-A ~~~AUCAUAUACCGAACAAACA---------CUAUAGGUGAUGAU~~~~HRV-B GAAGUCAU-CGUUGAGAAAACG---AAACA------GACGGUGGCCUC~Tescho ~~~~~~AC-GGCU--ACAAACA-----ACA------AGCUGU~~~~~~~Hepato UUUUGCAU-UUUG---CAAA--------------UUCAAGAUGUAGAG~ ~~~(((((-((((.......................)))))))))~~~~ 1.......10........20........30........40.........

predicted in Nucl. Acids Res. 29 5079-5089 (2001),

experimentally detected by Gerber, Wimmer Paul J.Virol. 75 10979-10990 (2001).

A Method for Large Genomes: RNAz

∗ Two ingredients: Thermodynamic Stability & StructureConservation

Measuring thermodynamic stability of ncRNAs

Naturally occurring structured RNAs have a lower foldingenergy compared to random sequences of the same size andbase composition?

1. Calculate native MFE m.2. Calculate mean µ and standard deviation σ of MFEs of a large

number of shuffled random sequences.3. Express significance in standard deviations from the mean as

z-score

z =m − µ

Negative z-scores indicate that the native RNA is more stablethan the random RNAs.

Efficient calculation of stability z-scores

The mean µ and standard deviation σ ofrandom samples of a given sequence arefunctions of the length and the basecomposition:

µ, σ(length,GC

Calculating z-scores is thus a 5 dimensionalregression problem.

The regression problem is solved using aSupport Vector Machine regression algorithm.

The SVM was trained on 10,000 syntheticsequences spaced evenly in the variable space.

The regression calculation is of the sameaccuracy as the sampling procedure.

s-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

Sampled z-scores

z-scores of known ncRNAs

ncRNA Type No. of Seqs. Mean z-score

tRNA 579 −1.845S rRNA 606 −1.62Hammerhead ribozyme III 251 −3.08Group II catalytic intron 116 −3.88SRP RNA 73 −3.37U5 spliceosomal RNA 199 −2.73

Functional RNAs are clearly more stable than randomsequences.

However: The scores are too small to discriminate reliably in agenome-wide screens since the z-score distributions haveheavy tails.

Consensus folding using RNAalifold

RNAalifold uses the same algorithms and energy parametersas RNAfold

Energy contributions of the single sequences are averaged

Covariance information (e.g. compensatory mutations) isincorporated in the energy model.

It calculates a consensus MFE consisting of an energy termand a covariance term:

J.Mol.Biol. 319:1059-1066 (2002)

The structure conservation index

The SCI is an efficient and convenient measure for secondarystructure conservation.

Separation of native ncRNAs from random controls in two

dimensions

5S rRNA tRNASignal recognitionparticle RNA

RNAseP U2 spliceosomal RNA U5 spliceosomalRNA

z-score

Classification based on both scores

Implementation and availability

The approach is implemented in ANSI C in the program RNAz.

The z-score regression is limited to 400 nucleotides.

The classification model is currently limited to alignments ofsix sequences.

At least an order of magnitude faster than other programs.

RNAz is freely available:Download from www.tbi.univie.ac.at/∼wash/RNAz

Proc. Natl. Acad. Sci. USA 102: 2454-2459 (2005)

Screening the human genome

Large scale comparative screen including: human, mouse, rat, dog chicken fugu, zebrafish

Reduction of the ≈ 3.095 MB human genome: Take ≈ 5% of the best conserved regions Remove all annotated coding exons Only take alignments strictly conserved in all 4 mammals.

→ 438,788 alignments alignments covering 82.64 MB

92.0M 94.0M 96.0M 98.0MMost conserved noncoding regions (present in at least human/mouse/rat/dog)

RNAz structural RNAs (P>0.5)

RefSeq Genes

90801000 90801500RNAz structural RNAs (P>0.9)

miRNAsmir-17 mir-19a mir-19b-1

mir-18 mir-20mir-92-1

(((((..((((((..((((((((.((.(((((...(((........)))...))))).)).))))))))...))))))....)))))

GTCAGAATAATGTCAAAGTGCTTACAGTGCAGGTAGTGATATGT-GCATCTACTGCAGTGAAGGCACTTGTAGCATTA-TG-GTGAC

GTCAGAATAATGTCAAAGTGCTTACAGTGCAGGTAGTGATGTGT-GCATCTACTGCAGTGAGGGCACTTGTAGCATTA-TG-CTGAC

GTCAGGATAATGTCAAAGTGCTTACAGTGCAGGTAGTGGTGTGT-GCATCTACTGCAGTGAAGGCACTTGTGGCATTG-TG-CTGAC

GTCAGAGTAATGTCAAAGTGCTTACAGTGCAGGTAGTGATATATAGAACCTACTGCAGTGAAGGCACTTGTAGCATTA-TG-TTGAC

GTCAATGTATTGTCAAAGTGCTTACAGTGCAGGTAGTATTATGGAATATCTACTGCAGTGGAGGCACTTCTAGCAATA-CACTTGAC

GTCTGTGTATTGCCAAAGTGCTTACAGTGCAGGTAGTTCTATGTGACACCTACTGCAATGGAGGCACTTACAGCAGTACTC-TTGAC

HumanMouseRatChickenZebrafishFugu

G U C A GA A

U A A U G UC A

A A G U G C U UA C A

G U G C A GG

U AG U G

UACUGCA

GUGAAGGCACUU

GUAGCAUUA

_UG_UUGAC

93104k 93106k 93108kRNAz structural RNAs (P>0.5)

H/ACA snoRNAs

C/D-box snoRNAs

ACA25ACA32

ACA1ACA8

ACA18ACA40

mgh28S-2412mgh28S-2410

Chr. 13

Chr. 11

Results of Human Genome Screen

Genome Coverage Alignments RNAz hits p > 0.9Size Fraction Number Size Fraction of Number

(MB) (%) (MB) input (%)Human genome 3,095.02 100.00 –PhastCons most conserved 137.85 4.81 1,601,903without coding regions 110.04 3.84 1,291,385without alignments < 50nt 103.83 3.33 564,455Set 1: 4 Mammals 82.64 2.88 438,788 5.46 6.62 35,985Set 2: + Chicken 24.00 0.85 104,266 1.34 5.50 8,802Set 3: + Fugu or zebrafish 6.86 0.24 30,896 0.14 2.03 996

Nature Biotechn. 23: 1383-1390 (2005)

Pictures instead of Numbers

91,676

35,958

20,391

8,8022,916 9965,6616,898

26,508

2087950

30,000

60,000

90,000

ents Observed

Expected

P > 0.5 P > 0.9

6.6%15.1%

Structural RNA

Estimated false positives

Other conservednoncoding element s

4 Mammals

4 Mammals+ chicken

All vertebrates

P > 0.5 P > 0.5 P > 0.5P > 0.9 P > 0.9 P > 0.9

Distribution related to known protein gene annotation

Known gene< 10 kb from nearest gene> 10 kb from nearest gene

Intron of coding region3’−UTR (exon or intron)

1538016860

283011205

5’−UTR (exon or intron)

Sensitivity on known classes of ncRNAs

Detected ( > 0.9)P

Detected (0.5 < < 0.9)P

Not detected

Not in input setmicroRNA

(207)C/D snoRNA

(256)H/ACA

Not all ncRNAs have conserved secondary structures!

chr7: RNAz_set1_50

EvoFoldsno/miRNA

Conservation

RepeatMasker

26.90m 26.95m 27.00m 27.05m

chr7.279

chr7.283

chr7.287

HOXA10HOXA9

chr7.290

HOXA11

hoxa11-as

HOXA13 chr7.295

HOXA1HOXA1

AC004079.7AC004079.7AC004079.7

AC004079.7HOXA2

HOXA3HOXA3

AC010990.1HOXA3HOXA3

AC010990.1AC010990.1

AC010990.1

AC004080.14HOXA5HOXA5

HOXA6HOXA6

AC004080.14HOXA6

AC004080.14AC004080.14

HOXA7HOXA7

HOXA9HOXA9

HOXA10HOXA10

HOXA10

HOXA10HOXA10

HOXA11HOXA11

HOXA13 EVX1

AC004080.12AC004080.12AC004080.12

AC004080.13

AC004080.15AC004080.15

AC004080.1AC004080.1

AC004080.1AC004080.17

AC004080.18AC004080.19

Affy Transcription

GENCODE

GENCODEputative

alternativesplicing

Other RNAz Screens

Urochordates: Ciona intestinalis & Ciona savignyi

only a few conserved RNA with Oikopleura dioica

Bioinformatics 21(S2): i77-i78

Nematodes: Caernorhabditis elegans & Caenorhabditis

briggsae

JEZ:MDE 2006 epub

Teleost fishes: Danio rerio, Takifugu rubripes, Tetraodon

nigroviridis, Oryzias latipes (partial)(in progress)

Trypanosomatids: Trypananosoma and Leishmania species

Yeasts. (joint work with Kay Nieselt and Stephan Steigele)

Summary

Predicted structured RNAs (RNAz predictions, p > 0.9)

Teleosts Mammals36000

Yeasts Nematods Insects Urochordates4000 20002500

Trypanosomatids

rRNA, tRNA, snoRNA, snRNA ...

>10000

1000 unknownconserved

500<200

Novel Human ncRNA Candidates

GGAGGCCU

UUGUCCGCUG

G C A G CGUU A U

GG G A

A G G C CA C CUUCCAAAGCCUGCACAAGGG

UGG A G G

CCCCUC

0 10 20 30 40 50 60 70 80 90 100

Novel ncRNA Candidates in Caenorhabditis

_ACCUU

AU _ A C C

UCGAUGAA

CCACUAA

GUCU C

AGUGGGUUAU

UAUUUCUCUC

CGGG G

GACCG U

U CGGUC

AGUCAAUGGGUU

UU C A A

CCAUUGACAA

CeN23 (UM1) CeN74 (UM3) CeN77 (UM3)unknown sb-RNA sb-RNA

__GAUCA

CAUGCU

CUCAACCAG

UUA C C C U A C C

UGG CUGUGG

C C C A C A GU

GUACAG

GC A C

CCG A C AC

C C A CA

A C CG

CAUGACACU

A U AAC

C C G G GU

AUACGGU

AGGUGGCG

CAAGAG

C A GAAA

UAAUCGAU

GUUUGAAUUGUUUCAAUUGU

UGCAAGGAAACAAU C C

UUCAAA

AUCAAUCU

AGCAUA

CUCAGAUAUCAAUU__UCUACAACA

UGAUUAGCAU

AA A C C A A A A

ACAAACCAC

513253 515948 513590(UM2) (UM3) (UM1)

Efforts to Annotate the RNAz Results

ongoing effort

Large number of microRNA candidates approximately 30-40 good H/ACA-box snoRNAs only 6% of hits (comparable to estimated false positive rate)

overlaps with predicted coding regions few clusters of signals with high sequence-similarity

work in progress: structure-based clustering (joint work withRolf Backofen’s lab in Freiburg)

BOTTOM LINE: most signals still unclassified.We need MUCH better methods to recognize members of knownRNA classes

RNAmicro: A classificator for microRNA Precursors

Input: Multiple Sequence alignment

Preprocessing: non-restrictive check for almost-hairpinstructureSome known microRNA precursors, notably some let-7

family members have small branches!

SVM Classification with few descriptors:Property # DescriptorsStructure 2 ls , lhSequence composition 1 G+CSequence conservation 4 S5′ , S3′ , S0 , SminThermodynamic stability 4 E , ǫ, η, z

Structure conservation 1 Econs

ISMB 2006, in press

Results: Caenorhabditis

351 158

RNAmicroP > 0.5 P > 0.9

626 31 5

1251452675

miRNA registry 7.1

Grad et al 2003

other RNAs

Results: Mammals

5440 1491

RNAmicroP > 0.5 P > 0.9

208481

miRNA registry 7.1

Berezikov et al. 2005

203014 3826 1260

Clustering

AB DE C

dot.ps

U A C G A C G G A C U U A C G G A C U U A C G

U A C G A C G G A C U U A C G G A C U U A C GUA

dot.ps

A U C A C U C G U A C U G U A C

A U C A C U C G U A C U G U A CAU

dot.ps

U A C G A C G G A C U U A C G G A C U U A C G

U A C G A C G G A C U U A C G G A C U U A C GUA

Arg/Asn2 2 Arg 2 Thr 2 ~ Ile2 Phe Cys 2 2 Lys 2 Gln 3 Gly 2 2 Val Glu 2 Pro~ 2 Met Arg Met Gln Ala ~~~4 2 Ile 4 Ser Leu Tyr

0.050.1

0.150.2

0.250.3

0.350.4

Summary

Some classes of ncRNAs, namely the structures ones, can befound efficiently by means of comparative genomics

There are Tens of Thousands of structured RNAs of unknownfunction in the human genome

Some of them probably act, like microRNA and snoRNAs bybinding to other RNAs. These could be investigated usingRNA cofolding approaches (ongoing research).

So far, we know only of the proverbial tip of the iceberg of the

complexity of cellular regulation

& RNA bioinformatics is a really cool research topic ...

Acknowledgments: It’s not my fault . . .

Leipzig: Kristin Missal, Dominic Rose, Jana Hertel, ManjaLindemeyer, Matthias Kruspe, Sonja J. Prohaska, ClaudiaFried, Roman Stocsits, Axel Mosig Bettina Muller (FH

Weihenstephan), Katrin Sameith (U Jena)

Vienna: Stefan Washietl, Ivo L. Hofacker, Christoph Flamm,Andrea Tanzer, Stefan BernhartSusanne Rauscher, Caroline Thurner, Christina Witwer, Ingrid

Abfalter, and many others

Havard: Walter Fontana

Beijing: Xiaopeng Zhu, Wei Deng, Geir Skogerbø, RunshengChen

Tubingen: Stephan Steigele, Kay Nieselt,

Copenhagen: Jan Gorodkin, Stefan Seemann

Freiburg: Rolf Backofen, Sebastian Will

rna bioinformatics - dartmouth computer sciencerockmore/csss2006/stadler...rna bioinformatics peter...

Documents

rna and protein structure prediction bioinformatics...

biol335: rna bioinformatics

rna-seqdata analysis - cornell...

the rna world, fitness landscapes, and all thatthe rna...

bioinformatics for dna - seq and rna- seq experiments

rna bioinformatics - göteborgs...

cs5263 bioinformatics rna secondary structure prediction

rna bioinformatics genes and secondary structure

introduction to bioinformatics - tutorial no. 9 rna...

rna bioinformatics marcela davila-lopez...

lecture 6. topics in rna bioinformatics the chinese...

introduction to bioinformatics - rna seq - marcel willemsen...

bioinformatics pages 1–7 - arxivarxiv:0908.0597v1...

a bioinformatics characterization of secondary metabolism...

bioinformatics for rna-seq - github pages

rna sequencing and bioinformatics analysis of human lens...

rna informatics unit 12 biol221t: advanced bioinformatics...

fr3d userâ€™s manual - bgsu rna bioinformatics lab

bioinformatics tools for rna data...

introduction to rna-seq and rna-seq data analysis (ueb-uat...