1 10 ראוני 03 - bgumichaluz/seminar/rna_guy_itai.pdf · 2010-01-03 · biological sequence...

38
03 ינואר10 1

Upload: others

Post on 17-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

101ינואר 03

Page 2: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

RNA

Ribonucleic acid (RNA) is a biologically

important type of molecule that consists of a

long chain of nucleotide units.

RNA is very similar to DNA, but differs in a

few important structural details.

There are few types of RNA. The main types

are: mRNA, tRNA and miRNA.

102ינואר 03

Page 3: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

DNA &RNA

Similarity

DNA and RNA are both part of the process of

building a protein in the cell.

RNA and DNA are both nucleic acid.

Contains 3 similar bases: Adenine (A),

Cytosine (C), Guanine (G).

103ינואר 03

Page 4: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

DNA &RNA

Difference

DNA contains the base Thymine (T) and RNA contains the base Uracil (U).

Unlike DNA, which is double-stranded, RNA is a single-stranded molecule in most of its biological roles.

RNA is the next stage after DNA in the protein synthesis.

DNA is more stable then RNA.

104ינואר 03

Page 5: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Protein synthesis

Protein synthesis is the process in which cells

build proteins.

The synthesis has two stages: transcription and

translation.

At the end of the process, the cell produces

protein from DNA.

105ינואר 03

Page 6: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Protein synthesis - Transcription

In transcription, a RNA chain is generated from DNA.

The DNA is "unzipped" by the enzyme Helicase, leaving the single chain open to be copied.

After reading the unzipped DNA, it produces RNA.

In the end, the DNA is “zipped” again.

Occurs inside the nucleus.

106ינואר 03

Page 7: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Protein synthesis - Transcription

107ינואר 03

Page 8: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Protein synthesis - Translation

The synthesis of proteins is known as

translation and it occurs in the Ribosome.

In translation, RNA is decoded. The Ribosome

makes chains of amino acids from the RNA

template.

In the end of this process, we have protein.

Occurs outside the nucleus.

108ינואר 03

Page 9: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Protein synthesis - Translation

109ינואר 03

Page 10: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Protein synthesis

1010ינואר 03

Page 12: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

RNA secondary structureA common problem for researchers working with RNA is to determine the three-dimensional structure of the molecule given just the nucleic acid sequence.

In RNA, much of the final

structure is determined by the

secondary structure or base-

pairing interactions of the

molecule.

What are base pairs?1012ינואר 03

Page 13: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Base-pairs

1013ינואר 03

Page 14: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Base-pairs

Two nucleotides (DNA/RNA letters) on opposite complementary DNA or RNA strands that are connected via hydrogen bonds are

called a base pairs. When the number of base pairs is large, the dimensional structure is more stable.

In DNA, base pairs are: A + T, G + C.

In RNA, base pairs are: A + U, G + C.

The human genome contains an

estimated three billion base pairs.

1014ינואר 03

Page 15: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

RNA sequence evolution by structure

It is relatively common to find examples of

homologous RNAs that have a common

secondary structure without sharing significant

sequence similarity.

It would be advantageous to be able to search

for conserved secondary structure in addition

to conserved sequence when searching

databases for homologous RNAs.

1015ינואר 03

Page 16: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

RNA sequence evolution by structure

It is possible to search for genomes using RNA pattern-matching program which searches for deterministic motifs but with secondary structure constraints as extra terms.

It works fine for small, well-defined patterns but not for less well conserved structures.

Currently database searches might be done by writing a carefully customized program for each RNA structure of interest.

As the number of different interesting RNAs grows, this is an increasingly unsatisfactory state.

1016ינואר 03

Page 17: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

RNA secondary structure prediction

One application of bioinformatics uses

predicted RNA secondary structures in

searching a genome for forms of RNA.

One of the methods to predict RNA secondary

structure is to maximize the number of base-

pairs.

One of the issues when predicting RNA

secondary structure is that the standard

recursions excludes pseudoknots.

What are pseudoknots?1017ינואר 03

Page 18: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

PseudoknotsA pseudoknot is an RNA secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem.

Elena Rivas and Sean Eddy published a dynamic programming algorithm that could handle pseudoknots. However, the time and memory requirements of the method are prohibitive.

Pseudoknots

1018ינואר 03

Page 19: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

RNA structure prediction by base pairs

maximization

Input:

A string over A,C,G,U.

A can pair with U, C can pair with G.

Output:

A subset of possible base-pairs of maximal

size such that no two base-pairs intersect.

1019ינואר 03

Page 20: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Motivation

We want to maximize the number of base pairs for some reasons.

As we said before, a high number of base pairs means a more stable condition of the molecule.

In addition, researches want to find similar known structure of RNA to a given RNA. The number of base pairs can predict the structure and make it very fast to find the similar RNA. In the naive way, finding the similar RNA may take a lot of time.

1020ינואר 03

Page 21: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Ruth Nussinov

Professor in the Department of

Human Genetics, School of Medicine,

Tel Aviv University.

Proposed the dynamic programming algorithm

for RNA secondary structure prediction, first

by maximizing the number of base pairs

(1978) and later introducing the so-called

'energy rules' into the algorithm (1980).

1021ינואר 03

Page 22: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

RNA secondary structure prediction

algorithms

Suppose we wish to predict the secondary

structure of a single RNA.

Many secondary structures can be drawn for a

sequence.

The number increases exponentially with

sequence length.

An RNA with only 200 bases has over 1050

possible base-paired structures.

1022ינואר 03

Page 23: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

RNA secondary structure prediction

algorithms cont’d

We must distinguish the biologically correct

structure from all the incorrect structures.

We need both a function that assign the correct

structure the highest score, and an algorithm

for evaluating the scores of all possible

structures.

1023ינואר 03

Page 24: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

One approach might be to find the structure

with the most base pairs.

Nussinov introduced an efficient dynamic

programming algorithm for this problem at

1978.

Although this criterion is to simplistic, the

mechanics of this algorithm are the same of

more sophisticated energy minimization

folding algorithms and of SCFG-based

algorithms.1024ינואר 03

RNA secondary structure prediction

algorithms cont’d

Page 25: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Nussinov folding algorithm

The algorithm is recursive, it calculates the

best structure for small subsequences and work

its way outwards to larger and larger

subsequences.

The key idea of the recursive calculation is that

there are only four possible ways of getting the

best structure for i,j from the best structures of

the smaller subsequences.

1025ינואר 03

Page 26: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

The Nussinov algorithm looks at four ways in which the best

RNA structure for a subsequence i,j can be made

1. add i,j pair onto best structure found for subsequence i+1 , j-

1;

2. add unpaired position i onto best structure for subsequence

i+1 , j;

3. add unpaired position j onto best structure for subsequence i ,

j-1;

4. combine two optimal substructures i , k and k+1 , j1026ינואר 03

i+1 j-1i j

i,j pair

i+1 j

ii unpaired

i j-1

j

j unpaired

i k k+1 j

bifurcation

Page 27: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

More formally, the Nussinov RNA folding algorithm is as follows. We are given a sequence x of length L with symbols x1,x2,…,xL.

Let P(i,j) = 1, if xi and xj are a complementary base pair, else P(i,j) = 0.

We will recursively calculate scores S(i,j) which are the maximal number of base pairs that can be formed for subsequence xi,…,xj.

1027ינואר 03

Page 28: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Nussinov RNA folding, fill stage

Initialization:

S(i,i-1) = 0 , for i=2 to L

S(i,i) = 0, for i=1 to L

Recursion: starting with all subsequences of length 2, to length L:

S(i+1,j-1)+P(i,j)

S(i+1,j)

S(i,j) = max S(i,j-1)

maxi<k<j[S(i,k)+S(k+1,j)

1028ינואר 03

Page 29: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

for the sequence GGGAAAUCC

Initialization of half

diagonal matrix

1029ינואר 03

Page 30: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

for the sequence GGGAAAUCC

1030ינואר 03

The matrix after

scores for

subsequences of

length 2 have been

calculated.

Page 31: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

S(i+1,j-1)+P(i,j)

S(i+1,j)

S(i,j) = max S(i,j-1)

maxi<k<j[S(i,k)+S(k+1,j)

1031ינואר 03

i+1 j-1i j

i+1 j

i

i j-1

ji k k+1 j

Page 32: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

for the sequence GGGAAAUCC

1032ינואר 03

An example of two

different optimal

substructures for the same

subsequence.

For the subsequence

AAAU either the A at i

and the U at j can be

paired (diagonal path)

or i can be added to a

substructure that already

pairs the A at i+1 to the U

at j (vertical path).GA

A U

G

C

G

C

A

GA

A U

G

C

G

C

A

Page 33: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

for the sequence GGGAAAUCC

1033ינואר 03

The final matrix.

The value in the

upper right, S(1,L)

indicates that the

maximally paired

structure has three

base pairs.

Page 34: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Algorithm complexity

What is the time complexity of the algorithm?

The fill step is the limiting step as it is O(L2) in

memory and O(L3) in time.

1034ינואר 03

Page 35: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Nussinov RNA folding, trace back

stage

There are often a number of alternative

structures with the same number of base pairs.

To find one of these maximally base-paired

structures, we trace back through the values we

calculated in the dynamic programming

matrix, beginning from S(1,L).

How can we do it?

1035ינואר 03

Page 36: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Nussinov RNA folding, trace back stage

cont’d

Initialization: Push (1,L) to the stack

Recursion: Repeat until stack is empty:

- pop(i,j),

- if i>= j continue;

else if S(i+1,j) = S(i,j) push (i+1,j);

else if S(i,j-1) = S(i,j) push (i,j-1);

else if S(i+1,j-1)+P(i,j) = S(i,j)

- record i,j base pair

- push (i+1,j-1);

else for k = i+1 to j-1: if S(i, k)+S(k+1,j)=S(i,j);

- push (k+1,j)

- push(i, k)

- break1036ינואר 03

Page 37: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Trace back complexity

What is the time complexity of the trace back?

The trace back is linear in time and memory.

The trace back we have shown is unbranched,

so the need for the pushdown stack is not

apparent. The pushdown stack becomes

important when bifurcated structures are

traced back.

1037ינואר 03

Page 38: 1 10 ראוני 03 - BGUmichaluz/seminar/RNA_guy_itai.pdf · 2010-01-03 · Biological sequence analysis: probabilistic models of proteins and nucleic acids by Richard Durbin 38 10

Bibliography

Wikipedia

openwetware.org

Biological sequence analysis: probabilistic

models of proteins and nucleic acids by

Richard Durbin

1038ינואר 03