alon “alonzo” sade & harel “hipoptam” shein advisor: prof. michal linial (aka “m”)...

62
Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Upload: bridget-richardson

Post on 11-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Alon “Alonzo” Sade & Harel “Hipoptam” SheinAdvisor: Prof. Michal Linial (AKA “M”)

29.7.08

Page 2: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Estimating amount of functional sequences in the genome

The ENCODE pilot project◦ Research on ncRNAs◦ Research on Alt.Splicing

Fun in the sun…

Page 3: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Understanding how our genome encodes information

How that information underpins differences between individuals and species

Page 4: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Currently estimated number of protein-coding genes:◦ Human: ∼ 20-25,000◦ Sea urchin: ∼ 23,000◦ Nematode worm: ∼ 19,000◦ Tetrahymena thermophila: ∼ 27,000 ( כי אין לנו

(שמרים We are complex, where is the information ? Protein coding sequences account for

<1.5% of the human genome What is the function of the remainder ?

Page 5: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Alternative splicing ?

Non-protein-coding sequences contain large amounts of regulatory information ?

Recent discoveries say that the vast majority of the mammalian genome is transcribed◦ We’ll get back to that…

Page 6: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Non-coding RNA An RNA that is not translated into a protein Many members in this family It was assumed that leftover RNA was

“junk”

2001 – Mattick claims: “more than 97% of RNA is ncRNA!”

Page 7: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Ancient Repeats A CNE that was inserted into early

mammalian lineage Primarily transposon derived Has since become dormant Most are thought to be neutrally evolving

Page 8: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Required for replication & structural integrity of the chromosome

Encode functional products Required for regulation or processing

◦ Includes sequences that may act as spacers

Page 9: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 10: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

At least 70% of the mammalian genome is transcribed

A lot of these are ncRNA shows cell-specific or developmental regulation

Functionality?

Noise, by-products for late evolvingBut all may also indicate functionality

!

Page 11: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Recent evidence implicate ncRNAs in control of:◦ chromatin structure◦ epigenetic memory◦ Transcription◦ Translation◦ Splicing (possibly)

Most are evolving quickly but can maintain highly preserved regions in them

Page 12: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

∼5% of small segments in mouse & human are under selection(May range between 3%-8%)

Doesn’t include sequences that have diverged for other reasons than evolution

At the time we thought only ∼1.2% is protein-coding

5

Page 13: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

conservation is relative

Taken to be substitution rate measured under the assumption “ functional evolving ⇔ neutral rate”

Requires estimate of the “neutral rate of evolution”

Classes expected to be evolving free of constraint

Yes, everything is relative

5

שמירות

התפתחות טבעית

Page 14: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

classes chosen have included:

1. Mainly ARs

2. Lineage-specific nonexonic sequences3. Synonymous sites in codons

5

שמירות

התפתחות טבעית

מאפיינים3

Page 15: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Estimates based on ARs may be biased:◦ The annotated and aligned ARs may comprise

mainly slowly evolving subset◦ ARs are under purifying selection

Lineage-specific & Nonexonic sequences Synonymous sites been found to be also biased

Page 16: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

The 5% study Conservation Netural rate classes chosen have included:

◦ Lineage-specific nonexonic sequences◦ Synonymous sites in codons◦ Ars

None of which is unbiased

5

שמירות

התפתחות טבעית

מאפיינים3

Page 17: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Conclusions: Functionally RNAs illustrate:◦ Low conservation ↮ loss of functionally◦ Many functional transcripts have more relaxed

structure-function constraints

Many functional elements are unconstrained biologically active but provide no specific

benefit to the organism

Page 18: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

CFTR - cystic fibrosis transmembrane conductance regulator

Page 19: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Figure 1. Conservation in the ENCODE CFTR locus

Page 20: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Amount and function of the transcriptional output

Conservation Functionality estimates

Fractions of the genome under purifying selection may be have been underestimated

May get to 11.8%

Page 21: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 22: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

The ENCyclopedia Of DNA Elements

Page 23: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

GeneFrom Wikipedia, the free encyclopedia

”A gene is a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions”

Page 24: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

A public research consortium Launched by US National Human Genome

Research Institute in September 2003 Goal: identify all functional elements in

the human genome sequence Top-down research Project Phases:

◦ Pilot Phase◦ Technological Phase◦ Production Phase

Page 25: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

• Goal:

Evaluate a variety of different Evaluate a variety of different methods for use in later stagesmethods for use in later stages

• Using a number of existing techniques to analyse a portion of the genome equal to about 1% (30mb)

• 35 groups provided more than 200 experimental & computational data

Page 26: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

• 50% were selected manually50% were selected randomly

• The two main criteria for manually : • The presence of well-studied

genes or other known sequence elements

• The existence of a substantial amount of comparative sequence data

Page 27: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

The randomly selected sequences• composed of 500kb regions • selected according to a stratified

random-sampling strategy based on • gene density –

#bases in genes/#other bases

• level of non-exonic conservation• 125 bases windows, base alignment with mouse

75%+, score (prediction), took the low score

Page 28: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

• The technology development phase is concurrent with the pilot phase

• Goal:Investigate and develop new, high throughput techniques and protocols

suitable for the production phase

Page 29: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 30: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 31: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 32: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 33: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 34: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 35: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

• One major challenge in the ENCODE project is annotating the large number of ncRNAs

• They are difficult to find in computational/experimental means

• Why ?

Page 36: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 37: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

• We must consider secondary structure as well as nucleotide sequence

• Structure can be detected more reliably from a set of related sequences

• RNA secondary structure is imperative when searching for structured ncRNAs

• So RNA search algorithms are expensive…

Page 38: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

אתה מתחיל הכי חזק

שאתה יכולולאט לאט, אתה מגביר !

• In 1985 Sankoff suggested to perform sequence alignment and minimal free energy folding simultaneously

• For two sequences of length n it’s O(n6)• Exponential in the number of sequences• Given the high cost, for many years it

rested in oblivion...

Page 39: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

• Several approximation attempts have been developed• FOLDALIGN• Dynalign• Stemloc• Consan

• All trying to increase performance w/o sacrificing accuracy

• They still remain relatively expensive

Page 40: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

First align then fold More attractive nowadays RNAz & EvoFold use

existing alignments◦ Thousands of new potential

structured ncRNAs◦ restricted to highly

conserved segments

Page 41: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

X As sequence similarity drops, frequent compensating base changes causes misalignments

X Assumes RNA structure is present in all sequences in the alignment

X Global alignments within fixed-width sliding window

Page 42: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

CMFinder◦ Search set of orthologous, unaligned seq. for

conservation◦ Doesn’t use external alignments (\orthology)◦ Doesn’t use sliding windows

Page 43: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

We scanned 2*56,017 block from UCSC MULTIZ multiple alignment files

We restricted analysis to blocks that don’t overlap exons or conserved elements

8.68 Mb (of 30), 3.87Mb repetitive sequences (RepeatMasker)

Page 44: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

10,106 predicted motifs meeting cutoff◦ Composite score > 5◦ Free energy < -5

Estimated false-positive of 50%

Page 45: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Some predicted motifs overlap Sense/antisense to each other Considering as single candidates we have

6587 candidate regions◦ Average region length – 80 nt◦ Covering 6.1% of input◦ More dense in nonrepetative regions (7.9%

against 3.9%)

Page 46: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

ENCODE regions are poor in known ncRNAs Only one known ncRNA fully overlapped our

input (has-miR-483) It received a high score (8.6, -31.4) Also scored high as miRNA by RNAmicro

Page 47: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

GENCODE annotations aim to identify all human protein-coding genes in the ENCODE regions

40% of our candidates are intergenic 60% overlap some non exonic part of a

coding gene

Page 48: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 49: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Elfar Torarinsson et al. Genome Res. 2008; 18: 242-251

Figure 3. Average pairwise sequence similarity of the predicted motifs versus the fraction that has been realigned compared to the

original alignments

Page 50: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

To explore the biological relevance of our prediction methods, we selected 11 high-confidence candidates◦ score>9, energy<-15, length>60, base change>5

We tested expression of these 11 candidates and found that 8 of 11 candidates could be detected in human RNA by RT-PCR

Page 51: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Expression of predicted ncRNA candidates by RT-PCR analysis

Page 52: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 53: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 54: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Expression of predicted ncRNA candidates by RT-PCR analysis

Page 55: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

ncRNAs are receiving increasing attention First large-scale search for structured

ncRNA using local structure motif finiding algorithm

One can benefit from realignment consider sequence and structure

Identified several thousand new ncRNA cadidates

Need for high-throughput methods to identify potential functions for the results

Page 56: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 57: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

2.53 protein coding variants per locus Key to understanding how human

complexity can be encoded by so few genes

Page 58: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Alt. Spilicing has to be demonstrated at the protein level

Many of the alternative isoforms are not likely to add functionality

So what is the advantage ?

Page 59: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Prof. Michal Linial for the guidance You for listening

Page 60: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Page 61: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

Pheasant and Mattick (2007), Raising the estimate of functional human sequences, Genome Res., 17: 1245-1253.

The ENCODE Project Consortium (2007), Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project, Nature, 447: 799-816.

Torarinsson et al. (2008), Comparative genomics beyond sequence-based alignments: RNA structures in the ENCODE regions, Genome Res., 18:242-251.

Tress et al. (2007), The implications of alternative splicing in the ENCODE protein complement, Proc. Natl Acad. Sci. USA, 104: 5495–5500.

Page 62: Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08