cse182-l16 - university of california, san diego · cse182-l16 non-coding rna. biol. data analysis:...

35
CSE182-L16 Non-coding RNA

Upload: others

Post on 05-Feb-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

CSE182-L16

Non-coding RNA

Page 2: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Biol. Data analysis: Review

Protein SequenceAnalysis

Sequence Analysis

Gene Finding

Assembly

Page 3: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Much other analysis is possible

Protein SequenceAnalysis

Sequence Analysis

Gene Finding

Assembly

ncRNA

GenomicAnalysis/Pop. Genetics

Page 4: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

A Static picture of the cell is insufficient

• Each Cell is continuouslyactive,– Genes are being

transcribed into RNA– RNA is translated into

proteins– Proteins are PT

modified andtransported

– Proteins performvarious cellularfunctions

• Can we probe the Celldynamically

GeneRegulation Proteomic

profilingTranscript profiling

Page 5: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

ncRNA gene finding

• Gene is transcribed but not translated.• What are the clues to non-coding genes?

– Look for signals selecting start of transcription andtranslation. Non coding genes are transcribed by Pol III

– Non-coding genes have structure. Look for genomicsequences that fold into an RNA structure

• Structure: Given a sequence, what is the structureinto which it can fold with minimum energy?

Page 6: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

tRNA structure

Page 7: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

RNA structure: Basics

• Key: RNA is single-stranded. Think of a string over 4letters, AC,G, and U.• The complementary bases form pairs.• Base-pairing defines a secondary structure. The base-pairing is usually non-crossing.

Page 8: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

RNA structure: pseudoknots

Sometimes, unpaired bases in loops form‘crossing pairs’. These are pseudoknots

Page 9: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

RNA structure prediction

• Any set of non-crossing base-pairsdefines a secondary structure.

• Abstract Question:– Given an RNA string find a structure that

maximizes the number of non-crossing base-pairs

– Incorporate the true energetics of folding– Incorporate Pseudo-knots

Page 10: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

A combinatorial problem

• Input:• A string over A,C,G,U• A pairs with U, C pairs with G

• Output:• A subset of possible base-pairs of maximum

size such that• No two base-pairs intersect

• How can we compute this set efficiently?

Page 11: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

RNA structure

1. Nussinov’s algorithm1. Score B for every base-pair. No penalty for loops. No pesudo-knots.2. Let W(i,j) be the score of the best structure of the subsequence

from i to j.

for i = n down to 1 {for j = i+1 to n {

}}

W (i, j) = max

B(ri,rj ) + W (i +1, j -1),W (i, j -1),W(i +1, j)

W (i,k) + W (k +1, j) i £ k < j

Ï

Ì

Ô Ô

Ó

Ô Ô

Page 12: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Obtaining RNA structure

for i = n downto 1 {for j = i+1 to n {

}}

W (i, j) = max

B(ri,rj ) + W (i +1, j -1),W (i, j -1),W(i +1, j)

W(i,k) +W(k +1, j)

(1)(2)(3)(4)

Ï

Ì Ô Ô

Ó Ô Ô

if (1) { S(i, j) = / else if (2) S(i, j) = | else if(3) S(i, j) = - else S(i, j) = k }

Page 13: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Obtaining RNA StructureProcedure print_RNA(i,j) {

if S(i,j) = / { print “(i,j)”;

print_RNA(i+1,j-1); else if (S(i,j) = -) {

print_RNA(i+1,j);} else if (S(i,j) = |) {

print_RNA(i,j-1);} else { k=S(i,j) print_RNA(i,k); print_RNA(k+1,j);}

}

Page 14: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

RNA structure: example

011231122

01111

0

i 1 2 3 4 5 6j

3

4

5

6

A C G A U UA C G A U U 1 2 3 4 5 6 1 2 3 4 5 6

2

Page 15: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

RNA Structure: Details

Page 16: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Base-pairing & Loops• Base-pairs arise from complementary nucleotides• Single-stranded• Stack is when 2 base-pairs are contiguous• Loops arise when there are unpaired bases.• They are characterized by the number of base-pairs that close it.

• Hairpin: closed by 1 base-pair• Bulge/Interior Loops (2 base-pairs)• Multiple Internal loops (k base-pairs)

Page 17: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Scoring Loops, multi-loops

• Zuker-Turner Energy Rules• http://www.bioinfo.rpi.edu/~zukerm/rna/energy/node2.html

• Stacking Energies• Energy for Bulges and Interior Loops• Energy for Multi-loops

Page 18: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Other tricks for obtaining structure

• Alignment and Covariance

Page 19: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

RNA: unsolved problems

• The structure problem is still unsolved.– De novo prediction does not work as well.– Co-variance models require prior alignment.

• Many undiscovered non-coding genes– miRNA, and others have only just been discovered.– Very hard to detect signal for these genes– Random sequence folds into low energy structures.

Page 20: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Other ncRNA: miRNA

• ncRNA ~22 nt in length• Pairs to sites within the 3’ UTR,

specifying translational repression.• Similar to siRNA (involved in RNAi)• Unlike siRNA, miRNA do not need

perfect base complementarity• Until recently, no computational

techniques to predict miRNA• Most predictions based on cloning

small RNAs from size fractionatedsamples

Page 21: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Gene Regulation

Page 22: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Gene expression

• The expression oftranscripts and protein inthe cell is not static. Itchanges in response tosignals.

• The expression can bemeasured using micro-arrays.

• What causes the changein expression?

Page 23: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Transcriptional machinery

• DNA polymerase (II) scans the genome, initiatingtranscription, and terminating it.

• The same machinery is used for every gene, so while Pol IIis required, it is not sufficient to confer specificity

Page 24: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

TF binding

• Other transcriptionfactors interact with thecore machinery andupstream DNA toprovide specificity.

• TFs bind to TF bindingsites which are clusteredin upstream enhancer andpromoter elements.

• The enhancer elementsmay be located many kbupstream of the core-promoter

Upstream elements

Transcription factors

Page 25: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

TF binding sites• TF binding sites are

weak signal (about 10bp with 5bpconserved)

• If two genes are co-regulated, they arelikely to share bindingsites

• Discovery of bindingsite motifs is animportant researchproblem.

TGAGGAGTCAGGAG

TCAGGTGTGAGGTGTCAGGTG

g1

g2

g3

g4

g5

Page 26: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

http://www.gene-regulation.com/pub/databases.html#transfac

Page 27: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Discovering TF binding sites

• Identification of these TF bindingsites/switches is critical.

• Requires identification of co-regulatedgenes (genes containing the same set ofswitches).

• How do we find co-regulated genes?

Page 28: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Idea1: Use orthologous genes from differentspecies

ACGGCAGCTCGCCGCCGCGC||||| || ||||||| ||ACGGC-GGGCGCCGCCCCGC

ACGGCAGCTCGCCGCCGC-C| || | ||||||| | AGTGC-GGGCGCCGCCTCAT

ACGGC-GC-TCGCCGCCGCGC| | | || | | AT-ACGAAGTAGCGG-ATGGT

1. The species are too close (EX:humans and chimps). Binding& non-binding sites are bothconserved.

2. The species are distant. Bindingsites are conserved but notother sequence.

3. The species are very distant.Even binding sites are notconerved. The genes havealternative regulators.

Page 29: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Idea2: Measure expression of genes

• Northern Blot:– Quantitative

expression of afew genes

Page 30: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Microarray

• Expression level of all genes

Page 31: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Protein Expression using MS

Page 32: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Pathways

• Proteins interact totransduce signal,catalyze reactions, etc.

• The interactions can becaptured in a database.

• Queries on thisdatabase are aboutlooking for interestingsub-graphs in a largegraph.

Page 33: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Biological databases in NAR

• http://www3.oup.co.uk/nar/database/c• 548 databases in various categories

RfamGenbank

SwissProt

Stanford microarray db

PDBKegg

dbSNP/OMIM/seattleSNPs

SWISS 2D-page

Page 34: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract

Summary

• Biological databases cannot beunderstood without understandingthe data, and the tools forquerying and accessing these data.

• While database technology (XML,Relational OO databases, textformats) is used to store this data,its use is (often) transparent forBioinformatics people.

• In this course, we looked at variousdata-streams, and pointed todatabases that store these data-streams

• Nucleic Acids Research brings outa database issue every January

2004: 548 databases

Page 35: CSE182-L16 - University of California, San Diego · CSE182-L16 Non-coding RNA. Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis Gene Finding ... •Abstract