multiply aligning rna sequences

46
Multiply Aligning RNA Sequences -RNA -Phylogeny -SAR -Re-Sequencing Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

Upload: katina

Post on 08-Jan-2016

33 views

Category:

Documents


4 download

DESCRIPTION

Multiply Aligning RNA Sequences. -RNA -Phylogeny -SAR -Re-Sequencing Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program. Open Questions in Multiple Sequence Alignments. Aligning Protein Sequences Aligning RNA Sequences. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiply Aligning  RNA  Sequences

Multiply Aligning RNA Sequences

-RNA-Phylogeny-SAR-Re-Sequencing

Cédric NotredameComparative Bioinformatics GroupBioinformatics and Genomics Program

Page 2: Multiply Aligning  RNA  Sequences

Open Questions in Multiple Sequence Alignments

Aligning Protein Sequences Aligning RNA Sequences

Page 3: Multiply Aligning  RNA  Sequences

Accurately Aligning Protein Sequences

Remains Challenging with sequences less than 20% identity

These sequences can be structurally homologues Correct alignments can help discovering functional

sites Expresso/3D-Coffee is currently the most accurate

way of combining sequence and structural information

Available on www.tcoffee.org

Page 4: Multiply Aligning  RNA  Sequences

Comparing ncRNAs

Page 5: Multiply Aligning  RNA  Sequences

ncRNAs Comparison

And ENCODE said…“nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions”

Who Are They?– tRNA, rRNA, snoRNAs, – microRNAs, siRNAs– piRNAs– long ncRNAs (Xist, Evf, Air, CTN, PINK…)

How Many of them– Open question– 30.000 is a common guess– Harder to detect than proteins

.

Page 6: Multiply Aligning  RNA  Sequences

Detecting ncRNAs in silico: a long way to go…

RNAse P (Not in ENCODE)

Page 7: Multiply Aligning  RNA  Sequences

Lizard ---GG--TGGAGACTAGTCTGAATTGGGTTATGAAG--CCA--Rat GGCGG--GGGAGAGTAGTCTGAATTGGGTTATGAGG--CCC--Hedgehog GACGG--GGGAGAGTAGTCTGAATTAGGTTATGGGG--CCC--Shrew GACGG-CGGGAGAGTAGTCTGAATTGGGTTATGAGG--CCC--Medaka GTGAG--TGGAGAGTAGTCTGAATTGGGT---------TCT--X.tropicalis AGCGG-CGGGAGAGTAGTCTGACTTGGGTTATGAGG--TGC--Cat GACGG--GGGAGAGTAGTCTGAATTGGGTTATGAGGCCCCC--Dog -------------------------------------------Rhesus GGCGG--GGGAGAGTAGTCTGAATTGGGTTATGAGG--TCC--Mouse GGCGG--GGGAGAGTAGTCTGAATTGGGTTATGAGG--CCC--Chimp GGCGG--AGGAGAGTAGTCTGAATTGGGTTATGAGG--TCC--Human GGCGG--AGGAGAGTAGTCTGAATTGGGTTATGAGG--TCC--TreeShrew GCGCG--GGGAGAGTAGTCTGAATTGGGTTATGAGG--CCC--

prediction

UCSC

RNAalifold

RFAM

Search (CMsearch)

Genome

RFAM

Page 8: Multiply Aligning  RNA  Sequences

Results for RNase P

Mammalian alignment

Vertebrate alignment

Structure Results

UCSC Predicted Nothing

RFAM Predicted Nothing

UCSC RFAM Nothing

RFAM RFAM OK

UCSC Predicted Nothing

RFAM Predicted Nothing

UCSC RFAM OK

RFAM RFAM OKMatthias Zytneki

Page 9: Multiply Aligning  RNA  Sequences

Results for RNase PBetter Alignments = Better Predictions

Matthias ZytnekiThomas DerrienRoderic GuigoRamin Shiekhattar

QualitativeImprovement

QuantitativeImprovement

Page 10: Multiply Aligning  RNA  Sequences

ncRNAs can have different sequences and Similar Structures

Page 11: Multiply Aligning  RNA  Sequences

ncRNAs Can Evolve Rapidly

CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**

GAACGGACC

CTTGCCTGG

GG

AAC CA

CGG

AG

AC G

CTTGCCTCC

GAACGGAGG

GG

AAC CA

CGG

AG

AC G

Page 12: Multiply Aligning  RNA  Sequences

ncRNAs are Difficult to Align

Same Structure Low Sequence Identity

Small Alphabet, Short Sequences Alignments often Non-Significant

Page 13: Multiply Aligning  RNA  Sequences

Obtaining the Structure of a ncRNA is difficult

Hard to Align The Sequences Without the Structure

Hard to Predict the Structures Without an Alignment

Page 14: Multiply Aligning  RNA  Sequences

The Holy Grail of RNA Comparison:Sankoff’ Algorithm

Page 15: Multiply Aligning  RNA  Sequences

The Holy Grail of RNA ComparisonSankoff’ Algorithm

Simultaneous Folding and Alignment

– Time Complexity: O(L2n)– Space Complexity: O(L3n)

In Practice, for Two Sequences:

– 50 nucleotides: 1 min. 6 M.– 100 nucleotides 16 min. 256 M.– 200 nucleotides 4 hours 4 G.– 400 nucleotides 3 days 3 T.

Forget about– Multiple sequence alignments– Database searches

Page 16: Multiply Aligning  RNA  Sequences

The next best Thing: Consan

Consan = Sankoff + a few constraints

Use of Stochastic Context Free Grammars

– Tree-shaped HMMs– Made sparse with constraints

The constraints are derived from the most confident positions of the alignment

Equivalent of Banded DP

Page 17: Multiply Aligning  RNA  Sequences

Going Multiple….

Structural Aligners

Page 18: Multiply Aligning  RNA  Sequences

Game Rules

Using Structural Predictions– Produces better alignments– Is Computationally expensive

Use as much structural information as possible while doing as little computation as possible…

Page 19: Multiply Aligning  RNA  Sequences

Adapting T-Coffee To

RNA Alignments

Page 20: Multiply Aligning  RNA  Sequences

T-Coffee and Concistency…

Page 21: Multiply Aligning  RNA  Sequences

T-Coffee and Concistency…

Page 22: Multiply Aligning  RNA  Sequences

T-Coffee and Concistency…

Page 23: Multiply Aligning  RNA  Sequences

T-Coffee and Concistency…

Page 24: Multiply Aligning  RNA  Sequences

Consistency: Conflicts and Information

X

Y

X

Z

Y

W Z

X

Z

Y

ZW

Y

W

X

Z

Y

Z

X

WY

Z

X

W

Partly Consistent

Less Reliable

Fully Consistent

More Reliable

Y-Z is unhappy X-W is unhappy

X

Y

Page 25: Multiply Aligning  RNA  Sequences

R-Coffee: Modifying T-Coffee at the Right Place

Incorporation of Secondary Structure information within the Library

Two Extra Components for the T-Coffee Scoring Scheme

– A new Library– A new Scoring Scheme

Page 26: Multiply Aligning  RNA  Sequences

RNA Sequences

Secondary Structures

Primary Library

R-Coffee ExtendedPrimary Library

Progressive AlignmentUsing The R-Score

RNAplfoldConsan

orMafft / Muscle / ProbCons

R-CoffeeExtension

R-Score

Page 27: Multiply Aligning  RNA  Sequences

CC

R-Coffee Extension

GG

TC Library

G G Score XC C Score Y

CC

GG

Goal: Embedding RNA Structures Within The T-Coffee Libraries The R-extension can be added on the top of any existing method.

Page 28: Multiply Aligning  RNA  Sequences

CC

R-Coffee Scoring Scheme

GG

R-Score (CC)=MAX(TC-Score(CC), TC-Score (GG))

Page 29: Multiply Aligning  RNA  Sequences

Validating R-Coffee

Page 30: Multiply Aligning  RNA  Sequences

RNA Alignments are harder to validate than Protein Alignments

Protein Alignments Use of Structure based Reference Alignments

RNA Alignments No Real structure based reference alignments– The structures are mostly predicted from

sequences– Circularity

Page 31: Multiply Aligning  RNA  Sequences

BraliBase and the BraliScore

Database of Reference Alignments

388 multiple sequence alignments.

Evenly distributed between 35 and 95 percent average sequence identity

Contain 5 sequences selected from the RNA family database Rfam

The reference alignment is based on a SCFG model based on the full Rfam seed dataset (~100 sequences).

Page 32: Multiply Aligning  RNA  Sequences

BraliBase SPS Score

RFam MSA

Number of Identically Aligned PairsSPS=Number of Aligned Pairs

Page 33: Multiply Aligning  RNA  Sequences

BraliBase: SCI Score

RNApfold

(((…)))…((..)) G Seq1(((…)))…((..)) G Seq2(((…)))…((..))G Seq3(((…)))…((..)) G Seq4(((…)))…((..)) G Seq5(((…)))…((..)) G Seq6

RNAlifold

(((…)))…((..)) ALN G

Average G Seq X Cov

G ALN

SCI=

Covariance

Page 34: Multiply Aligning  RNA  Sequences

BRaliScore

Braliscore= SCI*SPS

Page 35: Multiply Aligning  RNA  Sequences

R-Coffee + Regular Aligners

Method Avg Braliscore Net Improv.direct +T +R +T +R

-----------------------------------------------------------Poa 0.62 0.65 0.70 48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64 0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7 83Mafft_fftnts 0.68 0.68 0.72 17 68ProbConsRNA 0.69 0.67 0.71 -49 39Muscle 0.69 0.69 0.73 -17 42Mafft_ginsi 0.70 0.68 0.72 -49 39-----------------------------------------------------------

Improvement= # R-Coffee wins - # R-Coffee looses

Page 36: Multiply Aligning  RNA  Sequences

RM-Coffee + Regular Aligners

Method Avg Braliscore Net Improv.direct +T +R +T +R

-----------------------------------------------------------Poa 0.62 0.65 0.70 48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64 0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7 83Mafft_fftnts 0.68 0.68 0.72 17 68ProbConsRNA 0.69 0.67 0.71 -49 39Muscle 0.69 0.69 0.73 -17 42Mafft_ginsi 0.70 0.68 0.72 -49 39-----------------------------------------------------------RM-Coffee4 0.71 / 0.74 / 84

Page 37: Multiply Aligning  RNA  Sequences

R-Coffee + Structural Aligners

Method Avg Braliscore Net Improv.direct +T +R +T +R

-----------------------------------------------------------Stemloc 0.62 0.75 0.76 104 113Mlocarna 0.66 0.69 0.71 101 133Murlet 0.73 0.70 0.72 -132 -73Pmcomp 0.73 0.73 0.73 142 145T-Lara 0.74 0.74 0.69 -36 -8Foldalign 0.75 0.77 0.77 72 73-----------------------------------------------------------Dyalign --- 0.63 0.62 --- ---Consan --- 0.79 0.79 --- --------------------------------------------------------------RM-Coffee4 0.71 / 0.74 / 84

Page 38: Multiply Aligning  RNA  Sequences

How Best is the Best….

M-Locarna 234 *** 183 **

Stral 169 *** 62

FoldalignM 146 61

Murlet 130 * -12

Rnasampler 129 * -27

T-Lara 125 * -30

Poa 241 *** 217 ***

T-Coffee 241 *** 199 ***

Prrn 232 *** 198 ***

Pcma 218 *** 151 ***

Proalign 216 *** 150 **

Mafft fftns 206 *** 148 *

ClustalW 203 *** 136 ***

Probcons 192 *** 128 *

Mafft ginsi 170 *** 115

Muscle 169 *** 111

Methodvs. R-Coffee-Consan

vs. RM-Coffee4

Page 39: Multiply Aligning  RNA  Sequences

Range of Performances

Effect of Compensated Mutations

Page 40: Multiply Aligning  RNA  Sequences

Split Alignments and RNA

Few of the new long RNAs are reported with a secondary structure

Two explanations– They do not have a secondary structure– It is hard to predict the structure

To predict the structure– One needs an Homologues to build an MSA

To find homologues one needs to find them

Page 41: Multiply Aligning  RNA  Sequences

Split Alignments and RNA

-Protein Split Alignments-Guided by Primary structure

Transcript

genome

Page 42: Multiply Aligning  RNA  Sequences

Split Alignments and RNA

CCAGGCAAGACGGGACGAGAGTTGCCTGG

CCTCCGTTC AGAGGTGCATA GAACGGAGG

Page 43: Multiply Aligning  RNA  Sequences

Split Alignments and RNA

Homology appears through secondary structures

One needs to evaluate all possible secondary structures

Very computationaly intensive

Page 44: Multiply Aligning  RNA  Sequences

Conclusion/Future Directions

T-Coffee/Consan is currently the best MSA protocol for ncRNAs

Testing how important is the accuracy of the secondary structure prediction

Going deeper into Sankoff’s territory: predicting and aligning simultaneously

Solving the split alignment problem

Page 45: Multiply Aligning  RNA  Sequences

www.tcoffee.org

Credits and Web Servers

Andreas Wilm (UCD) Des Higgins (UCD) Sebastien Moretti (SIB) Ioannis Xenarios (SIB) Matthias Zytneki (CRG) Thomas Derrien (CRG) Roderic Guigo (CRG) Ramin Shiekhattar (CRG)

CGR, SIB, UCD

Page 46: Multiply Aligning  RNA  Sequences