1 10 ראוני 03 - bgumichaluz/seminar/rna_guy_itai.pdf · 2010-01-03 · biological sequence...
TRANSCRIPT
101ינואר 03
RNA
Ribonucleic acid (RNA) is a biologically
important type of molecule that consists of a
long chain of nucleotide units.
RNA is very similar to DNA, but differs in a
few important structural details.
There are few types of RNA. The main types
are: mRNA, tRNA and miRNA.
102ינואר 03
DNA &RNA
Similarity
DNA and RNA are both part of the process of
building a protein in the cell.
RNA and DNA are both nucleic acid.
Contains 3 similar bases: Adenine (A),
Cytosine (C), Guanine (G).
103ינואר 03
DNA &RNA
Difference
DNA contains the base Thymine (T) and RNA contains the base Uracil (U).
Unlike DNA, which is double-stranded, RNA is a single-stranded molecule in most of its biological roles.
RNA is the next stage after DNA in the protein synthesis.
DNA is more stable then RNA.
104ינואר 03
Protein synthesis
Protein synthesis is the process in which cells
build proteins.
The synthesis has two stages: transcription and
translation.
At the end of the process, the cell produces
protein from DNA.
105ינואר 03
Protein synthesis - Transcription
In transcription, a RNA chain is generated from DNA.
The DNA is "unzipped" by the enzyme Helicase, leaving the single chain open to be copied.
After reading the unzipped DNA, it produces RNA.
In the end, the DNA is “zipped” again.
Occurs inside the nucleus.
106ינואר 03
Protein synthesis - Transcription
107ינואר 03
Protein synthesis - Translation
The synthesis of proteins is known as
translation and it occurs in the Ribosome.
In translation, RNA is decoded. The Ribosome
makes chains of amino acids from the RNA
template.
In the end of this process, we have protein.
Occurs outside the nucleus.
108ינואר 03
Protein synthesis - Translation
109ינואר 03
Protein synthesis
1010ינואר 03
Protein synthesis video
&act=forumPrint648&article_id=1428&incat=1747http://www.weizmann.ac.il/zemed/net_activities.php?cat=
RNA secondary structureA common problem for researchers working with RNA is to determine the three-dimensional structure of the molecule given just the nucleic acid sequence.
In RNA, much of the final
structure is determined by the
secondary structure or base-
pairing interactions of the
molecule.
What are base pairs?1012ינואר 03
Base-pairs
1013ינואר 03
Base-pairs
Two nucleotides (DNA/RNA letters) on opposite complementary DNA or RNA strands that are connected via hydrogen bonds are
called a base pairs. When the number of base pairs is large, the dimensional structure is more stable.
In DNA, base pairs are: A + T, G + C.
In RNA, base pairs are: A + U, G + C.
The human genome contains an
estimated three billion base pairs.
1014ינואר 03
RNA sequence evolution by structure
It is relatively common to find examples of
homologous RNAs that have a common
secondary structure without sharing significant
sequence similarity.
It would be advantageous to be able to search
for conserved secondary structure in addition
to conserved sequence when searching
databases for homologous RNAs.
1015ינואר 03
RNA sequence evolution by structure
It is possible to search for genomes using RNA pattern-matching program which searches for deterministic motifs but with secondary structure constraints as extra terms.
It works fine for small, well-defined patterns but not for less well conserved structures.
Currently database searches might be done by writing a carefully customized program for each RNA structure of interest.
As the number of different interesting RNAs grows, this is an increasingly unsatisfactory state.
1016ינואר 03
RNA secondary structure prediction
One application of bioinformatics uses
predicted RNA secondary structures in
searching a genome for forms of RNA.
One of the methods to predict RNA secondary
structure is to maximize the number of base-
pairs.
One of the issues when predicting RNA
secondary structure is that the standard
recursions excludes pseudoknots.
What are pseudoknots?1017ינואר 03
PseudoknotsA pseudoknot is an RNA secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem.
Elena Rivas and Sean Eddy published a dynamic programming algorithm that could handle pseudoknots. However, the time and memory requirements of the method are prohibitive.
Pseudoknots
1018ינואר 03
RNA structure prediction by base pairs
maximization
Input:
A string over A,C,G,U.
A can pair with U, C can pair with G.
Output:
A subset of possible base-pairs of maximal
size such that no two base-pairs intersect.
1019ינואר 03
Motivation
We want to maximize the number of base pairs for some reasons.
As we said before, a high number of base pairs means a more stable condition of the molecule.
In addition, researches want to find similar known structure of RNA to a given RNA. The number of base pairs can predict the structure and make it very fast to find the similar RNA. In the naive way, finding the similar RNA may take a lot of time.
1020ינואר 03
Ruth Nussinov
Professor in the Department of
Human Genetics, School of Medicine,
Tel Aviv University.
Proposed the dynamic programming algorithm
for RNA secondary structure prediction, first
by maximizing the number of base pairs
(1978) and later introducing the so-called
'energy rules' into the algorithm (1980).
1021ינואר 03
RNA secondary structure prediction
algorithms
Suppose we wish to predict the secondary
structure of a single RNA.
Many secondary structures can be drawn for a
sequence.
The number increases exponentially with
sequence length.
An RNA with only 200 bases has over 1050
possible base-paired structures.
1022ינואר 03
RNA secondary structure prediction
algorithms cont’d
We must distinguish the biologically correct
structure from all the incorrect structures.
We need both a function that assign the correct
structure the highest score, and an algorithm
for evaluating the scores of all possible
structures.
1023ינואר 03
One approach might be to find the structure
with the most base pairs.
Nussinov introduced an efficient dynamic
programming algorithm for this problem at
1978.
Although this criterion is to simplistic, the
mechanics of this algorithm are the same of
more sophisticated energy minimization
folding algorithms and of SCFG-based
algorithms.1024ינואר 03
RNA secondary structure prediction
algorithms cont’d
Nussinov folding algorithm
The algorithm is recursive, it calculates the
best structure for small subsequences and work
its way outwards to larger and larger
subsequences.
The key idea of the recursive calculation is that
there are only four possible ways of getting the
best structure for i,j from the best structures of
the smaller subsequences.
1025ינואר 03
The Nussinov algorithm looks at four ways in which the best
RNA structure for a subsequence i,j can be made
1. add i,j pair onto best structure found for subsequence i+1 , j-
1;
2. add unpaired position i onto best structure for subsequence
i+1 , j;
3. add unpaired position j onto best structure for subsequence i ,
j-1;
4. combine two optimal substructures i , k and k+1 , j1026ינואר 03
i+1 j-1i j
i,j pair
i+1 j
ii unpaired
i j-1
j
j unpaired
i k k+1 j
bifurcation
More formally, the Nussinov RNA folding algorithm is as follows. We are given a sequence x of length L with symbols x1,x2,…,xL.
Let P(i,j) = 1, if xi and xj are a complementary base pair, else P(i,j) = 0.
We will recursively calculate scores S(i,j) which are the maximal number of base pairs that can be formed for subsequence xi,…,xj.
1027ינואר 03
Nussinov RNA folding, fill stage
Initialization:
S(i,i-1) = 0 , for i=2 to L
S(i,i) = 0, for i=1 to L
Recursion: starting with all subsequences of length 2, to length L:
S(i+1,j-1)+P(i,j)
S(i+1,j)
S(i,j) = max S(i,j-1)
maxi<k<j[S(i,k)+S(k+1,j)
1028ינואר 03
for the sequence GGGAAAUCC
Initialization of half
diagonal matrix
1029ינואר 03
for the sequence GGGAAAUCC
1030ינואר 03
The matrix after
scores for
subsequences of
length 2 have been
calculated.
S(i+1,j-1)+P(i,j)
S(i+1,j)
S(i,j) = max S(i,j-1)
maxi<k<j[S(i,k)+S(k+1,j)
1031ינואר 03
i+1 j-1i j
i+1 j
i
i j-1
ji k k+1 j
for the sequence GGGAAAUCC
1032ינואר 03
An example of two
different optimal
substructures for the same
subsequence.
For the subsequence
AAAU either the A at i
and the U at j can be
paired (diagonal path)
or i can be added to a
substructure that already
pairs the A at i+1 to the U
at j (vertical path).GA
A U
G
C
G
C
A
GA
A U
G
C
G
C
A
for the sequence GGGAAAUCC
1033ינואר 03
The final matrix.
The value in the
upper right, S(1,L)
indicates that the
maximally paired
structure has three
base pairs.
Algorithm complexity
What is the time complexity of the algorithm?
The fill step is the limiting step as it is O(L2) in
memory and O(L3) in time.
1034ינואר 03
Nussinov RNA folding, trace back
stage
There are often a number of alternative
structures with the same number of base pairs.
To find one of these maximally base-paired
structures, we trace back through the values we
calculated in the dynamic programming
matrix, beginning from S(1,L).
How can we do it?
1035ינואר 03
Nussinov RNA folding, trace back stage
cont’d
Initialization: Push (1,L) to the stack
Recursion: Repeat until stack is empty:
- pop(i,j),
- if i>= j continue;
else if S(i+1,j) = S(i,j) push (i+1,j);
else if S(i,j-1) = S(i,j) push (i,j-1);
else if S(i+1,j-1)+P(i,j) = S(i,j)
- record i,j base pair
- push (i+1,j-1);
else for k = i+1 to j-1: if S(i, k)+S(k+1,j)=S(i,j);
- push (k+1,j)
- push(i, k)
- break1036ינואר 03
Trace back complexity
What is the time complexity of the trace back?
The trace back is linear in time and memory.
The trace back we have shown is unbranched,
so the need for the pushdown stack is not
apparent. The pushdown stack becomes
important when bifurcated structures are
traced back.
1037ינואר 03
Bibliography
Wikipedia
openwetware.org
Biological sequence analysis: probabilistic
models of proteins and nucleic acids by
Richard Durbin
1038ינואר 03