getting the best out of multiple sequence alignment methods in the genomic era cédric notredame...
TRANSCRIPT
Getting the best out of multiple sequence alignment methods in
the genomic era
Cédric NotredameComparative Bioinformatics GroupBioinformatics and Genomics Program
Which Tool for Which Sequence ?
In- SilVo Biology
In Silico Biology – Making Sense of digital data
In Vivo Biology– Recording data in a living Cell
In SilVo Biology– Connect In-Vitro and In-Vivo
In-Vivo: High-throughput recording In-Silico: High-Throughput analysis
Is it Possible to Compare all Types of Sequences ?
Non Transcribed World– Genes/Full Genomes
Lagan, TBA
– Promoter Regions Meta-Aligner Motifs Finders
– Nucleosome ???
Multiple Genome Aligners– Not Very Accurate– Very Fast– Deal with rearrangements
Multiple Genome Alignments and
re-sequencing
Before– Re-sequence Human
Genomes– Map the Reads onto the
reference genome
Now– Re-sequence– Assemble– Align– Non trivial with very large
datasets
Is it Possible to Compare all Types of Sequences ?
RNA Comparison– Less Accurate than Proteins– Secondary Structures
ncRNA World– Sankoff
Time O(L2n) Space O(L3n)
– Consan– R-Coffee
Is it Possible to Compare all Types of Sequences ?
Protein Comparisons– Very Accurate– 3D-Structure Improves it
Protein Aligners– ClustalW– T-Coffee– 3D-Coffee
What Changes with 1000 Genomes?
Phylogeny Vs Function
Function– Low level => Biochemistry => Protein Domains– High Level => Metabolic Pathway => Orthology
Orthology– Phylogenetic Analysis– Phylogenetic Analysis =>Accurate Alignments
Using The tree
Correct Tree
Correct Orthologous Assignment
Correct Functional Prediction
The Alignment that Hides The Forest…
Why So Much interest for MSA methods???
Phylogenetic Trees and
Multiple Sequence Alignments
Phylogenetic Trees and
Multiple Sequence Alignments
Positive Selection and MSAs
Positive Selection and MSAs
Positive Selection and MSAs
Multiple Genome Alignments
Multiple Genome Alignments
Genomic Era: The Goal
10.000 Sequences: interspecies 1 Billion: Re-sequencing
Incorporation of ALL experimental Data– Structure, Genomic, ChIp-Chip, ChIp-Seq…
Alignments suitable for all applications of comparative genomics– Homology Modeling (function)– Functional Analysis– Phylogenetic Reconstruction– 3D-Modelling
Accurate Alignments for ALL kind of data
Non Transcribed DNA Transcribed DNA Translated DNA
Genomic Era Challenges
Accuracy– Proteins: 30% is the limit– DNA/RNA 70% is the limit
Scale– With too many sequences
algorithms lose in accuracy
Data Integration– Structure– Homology– Genomic Structure– Function– Proteomics
Methods– Wealth of alternative methods– Poorly Characterized
Consistency and Data Integration
Most methods rely on the progressive algorithm
Consistency based methods have been designed as an extension
Consistency based alignment methods have been designed to:
– Better extract the signal contained in the data– Integrate/Confront existing methods– Integrate/Confront heterogeneous types of Information
The Progressive Alignment Algorithm
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT
SeqD THE FAT CAT
SeqA GARFIELD THE LAST FA-T CATSeqB GARFIELD THE FAST CA-T ---SeqC GARFIELD THE VERY FAST CATSeqD -------- THE ---- FA-T CAT
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88SeqB GARFIELD THE FAST CAT ---
SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT
SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100SeqD -------- THE ---- FAT CAT
SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100SeqC GARFIELD THE VERY FAST CAT
SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100SeqD -------- THE ---- FA-T CAT
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88SeqB GARFIELD THE FAST CAT ---
SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELD THE VERY FAST CAT
SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100SeqD -------- THE ---- FAT CAT
SeqB GARFIELD THE ---- FAST CAT Prim. Weight =100SeqC GARFIELD THE VERY FAST CAT
SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100SeqD -------- THE ---- FA-T CAT
SeqA GARFIELD THE LAST FAT CAT Weight =88SeqB GARFIELD THE FAST CAT ---
SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CATSeqB GARFIELD THE ---- FAST CAT
SeqA GARFIELD THE LAST FA-T CAT Weight =100SeqD -------- THE ---- FA-T CATSeqB GARFIELD THE ---- FAST CAT
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT Weight =88SeqB GARFIELD THE FAST CAT ---
SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELD THE VERY FAST CATSeqB GARFIELD THE ---- FAST CAT
SeqA GARFIELD THE LAST FA-T CAT Weight =100SeqD -------- THE ---- FA-T CATSeqB GARFIELD THE ---- FAST CAT
T-Coffee and Concistency…
T-Coffee and Concistency…
Methods
Data
Scalability
A Brief History of Consistency
A Long Chain of Small Contributions…
Consistency Based Algorithms
Gotoh (1990)– Iterative strategy using consistency
Martin Vingron (1991)– Dot Matrices Multiplications– Accurate but too stringeant
Dialign (1996, Morgenstern)– Concistency– Agglomerative Assembly
T-Coffee (2000, Notredame)– Concistency– Progressive algorithm
ProbCons (2004, Do)– T-Coffee with a Bayesian Treatment– Biphasic Gap Penalty
AMAP (Schwarz, 2007)– ProbCons Consistency– Replace Progressive alignment with
simulated Annealing– Hard to distinguish from ProbCons
FSA ( Patcher, 2009)– AMAP with automated parameter
estimation– Hard to distinguish from ProbCons
Choosing the right modeling methodM-Coffee
Combining Many MSAs into ONE
MUSCLE
MAFFT
ClustalW
???????
T-Coffee
Consistency and Accuracy
Integrating New Types of DataTemplate Based Sequence
Alignments
ExperimentalData
…
TARGET
ExperimentalData
…
TARGETTemplate
Aligner
Template-Sequence Alignment
Primary Library
Template Alignment
Template based Alignmentof the Sequences
Templates Templates
TARGET
Exploring The Template World
Template Generator Alignment Method
RNA Structure Prediction RNA Aligner
Protein Structure BLAST vs PDB 3D Aligner
Profile BLAST vs NR Profile/Profile Alignment
Gene Structure ENSEMBL Genome Aligner
Promoter Transfac Meta-Aligner
Exploring The Template World
Template Generator Alignment Method
Mode
RNA Structure Prediction RNA Aligner R-Coffee
Protein Structure BLAST /PDB 3D Aligner 3D-Coffee
Profile BLAST/NR Profile/Profile PSI-Coffee
Gene Structure ENSEMBL Genome Aligner Exoset
Promoter Transfac Meta-Aligner Meta-Coffee
3D-Coffee/ExpressoIncorporating
Structural Information
Expresso: Finding the Right Structure
Sources
Templates
Library
BLAST BLAST
SAP
Template Alignment
Source Template Alignment
Remove Templates
Templates
PSI-CoffeeHomology Extension
Exploring The Template World
What is Homology Extension ?
L L
L
?
-Simple scoring schemes result in alignment ambiguities
What is Homology Extension ?
L L
L
LLLLLL
LLIVIL
LLLLLL
Profile 1
Profile 2
What is Homology Extension ?
L L
L
LLLLLL
LLIVIL
LLLLLL
Profile 1
Profile 2
PSI-Coffee: Homology Extension
Sources
Templates
Library
BLAST BLAST
Template Alignment
Source Template Alignment
Remove Templates
TemplatesProfile Aligner
Benchmarks
Method Method Template Score Comment
ClustalW-2 Progressive NO 22.74
PRANK Gap NO 26.18 Science2008
MAFFT Iterative NO 26.18
Muscle Iterative NO 31.37
ProbCons Consistency NO 40.80
ProbCons MonoPhasic NO 37.53
T-Coffee Consistency NO 42.30
M-Coffe4 Consistency NO 43.60
PSI-Coffee Consistency Profile 53.71
PROMAL Consistency Profile 55.08
PROMAL-3D Consistency PDB 57.60
3D-Coffee Consistency PDB 61.00 Expresso
Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).
Method Method Template Score Comment
ClustalW-2 Progressive NO 22.74
PRANK Gap NO 26.18 Science2008
MAFFT Iterative NO 26.18
Muscle Iterative NO 31.37
ProbCons Consistency NO 40.80
ProbCons MonoPhasic NO 37.53
T-Coffee Consistency NO 42.30
M-Coffe4 Consistency NO 43.60
PSI-Coffee Consistency Profile 53.71
PROMAL Consistency Profile 55.08
PROMAL-3D Consistency PDB 57.60
3D-Coffee Consistency PDB 61.00 Expresso
Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).
Consistency
Method Method Template Score Comment
ClustalW-2 Progressive NO 22.74
PRANK Gap NO 26.18 Science2008
MAFFT Iterative NO 26.18
Muscle Iterative NO 31.37
ProbCons Consistency NO 40.80
ProbCons MonoPhasic NO 37.53
T-Coffee Consistency NO 42.30
M-Coffe4 Consistency NO 43.60
PSI-Coffee Consistency Profile 53.71
PROMAL Consistency Profile 55.08
PROMAL-3D Consistency PDB 57.60
3D-Coffee Consistency PDB 61.00 Expresso
Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).
Homology Extension
Method Method Template Score Comment
ClustalW-2 Progressive NO 22.74
PRANK Gap NO 26.18 Science2008
MAFFT Iterative NO 26.18
Muscle Iterative NO 31.37
ProbCons Consistency NO 40.80
ProbCons MonoPhasic NO 37.53
T-Coffee Consistency NO 42.30
M-Coffe4 Consistency NO 43.60
PSI-Coffee Consistency Profile 53.71
PROMAL Consistency Profile 55.08
PROMAL-3D Consistency PDB 57.60
3D-Coffee Consistency PDB 61.00 Expresso
Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).
Structural Extension
T-Coffee and The World
BLAST/SOAP
-Some Templates are obtained with a BLAST-Queries can be sent to the EBI or the NCBI-No Need for a Local BLAST installation
Users sequences
Incorporating RNA Information Within the T-Coffee Algorithm
ncRNAs Can Evolve Rapidly
CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**
GAACGGACC
CTTGCCTGG
GG
AAC CA
CGG
AG
AC G
CTTGCCTCC
GAACGGAGG
GG
AAC CA
CGG
AG
AC G
ncRNAs Can Evolve Rapidly
CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**
CC--AGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAAC--GGAGG** * *** * * *** **
Sequence Alignment (Maximizing Identity)
-Incorrect-Not Predictive
The Holy Grail of RNA Comparison:Sankoff’ Algorithm
CC
R-Coffee Extension
GG
TC Library
G G Score XC C Score Y
CC
GG
Goal: Embedding RNA Structures Within The T-Coffee Libraries The R-extension can be added on the top of any existing method.
R-Coffee + Structural Aligners
Method Avg Braliscore Net Improv.direct +T +R +T +R
-----------------------------------------------------------Stemloc 0.62 0.75 0.76 104 113Mlocarna 0.66 0.69 0.71 101 133Murlet 0.73 0.70 0.72 -132 -73Pmcomp 0.73 0.73 0.73 142 145T-Lara 0.74 0.74 0.69 -36 -8Foldalign 0.75 0.77 0.77 72 73-----------------------------------------------------------Dyalign --- 0.63 0.62 --- ---Consan --- 0.79 0.79 --- --------------------------------------------------------------
Improvement= # R-Coffee wins - # R-Coffee looses over 170 test sets
R-Coffee + Regular Aligners
Method Avg Braliscore Net Improv.direct +T +R +T +R
-----------------------------------------------------------Poa 0.62 0.65 0.70 48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64 0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7 83Mafft_fftnts 0.68 0.68 0.72 17 68ProbConsRNA 0.69 0.67 0.71 -49 39Muscle 0.69 0.69 0.73 -17 42Mafft_ginsi 0.70 0.68 0.72 -49 39-----------------------------------------------------------
Improvement= # R-Coffee wins - # R-Coffee looses over 388 test sets
Genomic Era ChallengesConclusion
TemplateBased
Alignments
Meta-MethodsM-Coffee
Homology Extension(Proteins)
R-Coffee
Scaled Consistency
Open Questions
Accurately Aligning non transcribed DNA Coping with One Billion Human Genomes
Comparative Bioinformatics
University College Dublin– Des Higgins– Orla O’Sullivan– Iain Wallace (UCD, IE)
Berlin Free University– Knut Reinert– Tobias Rausch
Swiss Intitute of Bioinformatics– Ioannis Xenarios– Sebastien Morreti
Comparative Bioinformatics– Merixell Oliva– Giovanni Bussoti– Carsten Kemena– Emanuele Rainieri– Ionas Erb– Jia Ming Chang– Matthias Zytneki
Why So Much Interest For Multiple Alignments ?
Extrapolation
Motifs/Patterns
Phylogeny
Profiles
Structure Prediction
SNP Analysis
Reactivity Analysis
Regulatory Elements
Phylogeny Vs Function:Applications
Comparative Genomics => New Medium New Medium => New Clinical Test
Detecting ncRNAs in silico: a long way to go…
RNAse P
Obtaining the Structure of a ncRNA is difficult
Hard to Align The Sequences Without the Structure
Hard to Predict the Structures Without an Alignment
R-Coffee: Modifying T-Coffee at the Right Place
Incorporation of Secondary Structure information within the Library
Two Extra Components for the T-Coffee Scoring Scheme
– A new Library– A new Scoring Scheme
G-INS-i, H-INS-i and F-INS-i use pairwise alignment information when constructing a multiple alignment. The two options ([HF]-INS-i) incorporate local alignment information and do NOT USE FFT.
Molecular BiologyWithin the System Biology Era
Protein A Interacts with
RegulatesInhibits
Protein B
Molecular BiologyIn the 1000 Genomes Era
Protein A Interacts with
RegulatesInhibits
Protein B
Variation Within Species: CNVs of A and SNPs
Conservation Across Species
System Biology vs
Comparative Genomics
Systems Biology
Systems can be Understood
Comparative Genomics
Systems can Evolve through Selection
Phylogeny Vs Function:Applications
– Important Application
– Possible Many New Genomes
– Challenging Too Many New Genomes
3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments
Comparing Methods
MAFFT
Some Benchmark: BB11 BaliBase
BB11: 38 highly divergent (less than 25% id) datasets from BaliBaseBB11: predicts 78% of the results measured on other datasets
Blackshield, Higgins
PhD Fellowships www.crg.es
What ‘s in a Multiple Sequence Alignment
Evolution Inertia
Common Ancestry Shows up
In the sequences
Selection
Important FeaturesAre Preserved
Functional Constraint
Same FunctionSame Sequence
ConvergencePhylogenetic Footprint, Evolutionary Trace …
Which Tool for Which Sequence ?
Is it Possible to Compare all Types of Sequences
Non Transcribed World– Genes/Full Genomes
Lagan, TBA
– Promoter Regions Meta-Aligner Motifs Finders
– Nucleosome ???
Multiple Genome Aligners– Not Very Accurate– Very Fast– Deal with rearrangements
Is it Possible to Compare all Types of Sequences
RNA Comparison– Less Accurate than Proteins– Secondary Structures
ncRNA World– Consan– R-Coffee
Is it Possible to Compare all Types of Sequences
Protein Comparisons– Very Accurate– 3D-Structure Improves it
Protein Aligners– ClustalW– T-Coffee– 3D-Coffee
Why So Much Interest For Multiple Alignments ?
Extrapolation
Motifs/Patterns
Phylogeny
Profiles
Structure Prediction
SNP Analysis
Reactivity Analysis
Regulatory Elements
What’s in a Multiple Alignment ?
The MSA contains what you put inside:
– Structural Similarity– Evolutive Similarity– Sequence Similarity
You can view your MSA as:
– A record of evolution– A summary of a protein family– A collection of experiments made for you by Nature…
Producing The Right Alignment
Multiple Sequence Alignments Influence Phylogenetic Trees
Choice of Method is not Neutral
– Different Methods– Different Alignments– Different Trees
Using The Right Models insures Producing the right Tree
Model Based Alignments vs Naïve Alignments
Naïve Alignment– Lexicographic Alignment– Maximizing the number of identities– At best using a substitution matrix
Model Based Alignments– Using a model– Protein structure information– RNA Structure information– Combining/Confronting Modeling
methods
Template based Alignments– Model based Alignments through the
use of Templates
T-Coffee and Model Based Alignments
T-Coffee Algorithm
Expresso: Aligning Protein Structures
R-Coffee: Aligning RNA structures
M-Coffee: Combining methods
T-Coffee and Concistency…
When Sequences Are not Enough
3D-Coffee and Expresso
3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments
3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments
Where to Trust Your Alignments
Most Methods Agree
Most Methods Disagree
Conclusion
Model Based Alignments Give the best Accuracy
Template based alignment is a very efficient way to turn Naïve aligners into model based aligners
Sequence Alignments are not necessarily reliable over their entire lengths
Manguel M, Samaniego F.J., Abraham Wald’s Work on Aircraft Suvivability, J. American Statistical Association. 79, 259-270, (1984)
Building and Using Models
35.67 Angstrom
Computing the Correct Alignment is a Complicated Problem
Stochastic Optimization
Stochastic Optimization
Exploration of Complex Optimization Problems With Multiple Constraints
– Genomic Alignments– RNA Alignments
Generation of Population of Suboptimal Solutions
– Quality=f( optimality )
Specification of Concistency Objective Function of T-Coffee
Three Types of Algorithms
Progressive: ClustalW
Iterative: Muscle
Concistency Based: T-Coffee and Probcons
T-Coffee and Concistency…
Each Library Line is a Soft Constraint (a wish)
You can’t satisfy them all
You must satisfy as many as possible (The easy ones)
Concistency Based Algorithms: T-Coffee
Gotoh (1990)– Iterative strategy using consistency
Martin Vingron (1991)– Dot Matrices Multiplications– Accurate but too stringeant
Dialign (1996, Morgenstern)– Concistency– Agglomerative Assembly
T-Coffee (2000, Notredame)– Concistency– Progressive algorithm
ProbCons (2004, Do)– T-Coffee with a Bayesian Treatment
How Good Is My Method ?
Structures Vs Sequences
T-Coffee Results
Validation Using BaliBase
Too Many Methods for ONE AlignmentM-Coffee
Estimating the Accuracy of your MSA
What To Do Without Structures
3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments
Expresso: Finding the Right Structure
Why Not Using Structure Based Alignments
Template Based Multiple Sequence Alignments
Template Based Multiple Sequence Alignments
-Structure-Profile-…
Sources
Templates
Library
TemplateAligner
Template Alignment
Source Template Alignment
Remove Templates
Templates-Structure-Profile-…
Method Score Templates Prefab Homstrad --------------------------------------------------------------ClustalW Matrix ---- 61.80 ----Kalign Matrix ---- 63.00 ----MUSCLE Matrix ---- 68.00 45.0--------------------------------------------------------------
T-Coffee Consistency ---- 69.97 44.0ProbCons Consistency ---- 70.54 ----Mafft Consistency ---- 72.20 ----M-Coffee Consistency ---- 72.91 ----MUMMALS Consistency ---- 73.10 ------------------------------------------------------------------Clustal-db Matrix Profiles ---- ----PRALINE Matrix Profiles ---- 50.2PROMALS Consistency Profiles 79.00 ----SPEM Matrix Profiles 77.00 ------------------------------------------------------------------EXPRESSO Consistency Structures ---- 71.9 *T-Lara Consistency Structures ---- ------------------------------------------------------------------
Table 1. Summary of all the methods described in the review. Validation figures were compiled from several sources, and selected for the compatibility. Prefab refers to some validation made on Prefab Version 3. The HOMSTRAD validation was made on datasets having less than 30% identity. The source of each figure is indicated by a reference.*The EXPRESSO figure comes from a slightly more demanding subset of HOMSTRAD (HOM39) made of sequences less than 25% identical.
Improving The Evaluation
How Do We Perform In The Twilight Zone?
Concistency Based Methods Have an Edge Hard to tell Methods Apart Sequence Alignment is NOT solved
More Than Structure based Alignments
Structural Correctness Is Only the Easy Side of the Coin.
In practice MSA are intermediate models used to generate other models:
Data Model Type BenchmarkHomology Profile Yes
Evolution Trees No
Structure 3D-Structure CASP
Function Annotation No
Conclusion
Template based Multiple Sequence Alignments Projecting any relevant information onto the sequences Using this Information
Need for new evaluation procedures Functional Analysis Phylogenetic Analysis Homology Search (Profiles) Homology Modelling
Integrating data Making sure your bits of data can fight with one another
Turning Data into Models
DataColumbus, considered that the landmass occupied 225°, leaving only 135° of water (Marinus of Tyre, 70 AD).
Columbus believed that 1° represented only 56 miles (Alfraganus, XIth century)
He knew there was an island named Japan off the cost of China…
ModelCircumference of the Earth as 25,255 km at most,Canary Island to Japan : 3,700 km (Reality: 12,000 km.)
The More Structures The Merrier
Average Improvement over
T-Coffee
Struc/Seq Ratio
The Right Mixt of Methods
3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments
Applications
Looking-Up The DNA Behind The Sequences: PROTOGENE
SAR Analysis
Correlate Alignment Variations with Reactivity
Application to the Human Kinome Collaboration with Sanofi-Aventis
Main Issue: – Training problem Proper Benchmarking
ncRNA Multiple Alignments with R-Coffee
Laundering the Genome Dark Matter
Cédric NotredameComparative Bioinformatics GroupBioinformatics and Genomics Program
No Plane Today…
ncRNAs Comparison
And ENCODE said…“nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions”
Who Are They?– tRNA, rRNA, snoRNAs, – microRNAs, siRNAs– piRNAs– long ncRNAs (Xist, Evf, Air, CTN, PINK…)
How Many of them– Open question– 30.000 is a common guess– Harder to detect than proteins
.
ncRNAs can have different sequences and Similar Structures
ncRNAs are Difficult to Align
Same Structure Low Sequence Identity
Small Alphabet, Short Sequences Alignments often Non-Significant
Obtaining the Structure of a ncRNA is difficult
Hard to Align The Sequences Without the Structure
Hard to Predict the Structures Without an Alignment
The Holy Grail of RNA Comparison:Sankoff’ Algorithm
The Holy Grail of RNA ComparisonSankoff’ Algorithm
Simultaneous Folding and Alignment
– Time Complexity: O(L2n)– Space Complexity: O(L3n)
In Practice, for Two Sequences:
– 50 nucleotides: 1 min. 6 M.– 100 nucleotides 16 min. 256 M.– 200 nucleotides 4 hours 4 G.– 400 nucleotides 3 days 3 T.
Forget about– Multiple sequence alignments– Database searches
The next best Thing: Consan
Consan = Sankoff + a few constraints
Use of Stochastic Context Free Grammars
– Tree-shaped HMMs– Made sparse with constraints
The constraints are derived from the most confident positions of the alignment
Equivalent of Banded DP
Going Multiple….
Structural Aligners
Game Rules
Using Structural Predictions– Produces better alignments– Is Computationally expensive
Use as much structural information as possible while doing as little computation as possible…
Adapting T-Coffee To
RNA Alignments
T-Coffee and Concistency…
T-Coffee and Concistency…
T-Coffee and Concistency…
T-Coffee and Concistency…
Consistency: Conflicts and Information
X
Y
X
Z
Y
W Z
X
Z
Y
ZW
Y
W
X
Z
Y
Z
X
WY
Z
X
W
Partly Consistent
Less Reliable
Fully Consistent
More Reliable
Y is unhappy X is unhappy
X
Y
RNA Sequences
Secondary Structures
Primary Library
R-Coffee ExtendedPrimary Library
Progressive AlignmentUsing The R-Score
RNAplfoldConsan
orMafft / Muscle / ProbCons
R-CoffeeExtension
R-Score
CC
R-Coffee Scoring Scheme
GG
R-Score (CC)=MAX(TC-Score(CC), TC-Score (GG))
Validating R-Coffee
RNA Alignments are harder to validate than Protein Alignments
Protein Alignments Use of Structure based Reference Alignments
RNA Alignments No Real structure based reference alignments– The structures are mostly predicted from
sequences– Circularity
BraliBase and the BraliScore
Database of Reference Alignments
388 multiple sequence alignments.
Evenly distributed between 35 and 95 percent average sequence identity
Contain 5 sequences selected from the RNA family database Rfam
The reference alignment is based on a SCFG model based on the full Rfam seed dataset (~100 sequences).
BraliBase SPS Score
RFam MSA
Number of Identically Aligned PairsSPS=Number of Aligned Pairs
BraliBase: SCI Score
RNApfold
(((…)))…((..)) G Seq1(((…)))…((..)) G Seq2(((…)))…((..))G Seq3(((…)))…((..)) G Seq4(((…)))…((..)) G Seq5(((…)))…((..)) G Seq6
RNAlifold
(((…)))…((..)) ALN G
Average G Seq X Cov
G ALN
SCI=
Covariance
BRaliScore
Braliscore= SCI*SPS
RM-Coffee + Regular Aligners
Method Avg Braliscore Net Improv.direct +T +R +T +R
-----------------------------------------------------------Poa 0.62 0.65 0.70 48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64 0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7 83Mafft_fftnts 0.68 0.68 0.72 17 68ProbConsRNA 0.69 0.67 0.71 -49 39Muscle 0.69 0.69 0.73 -17 42Mafft_ginsi 0.70 0.68 0.72 -49 39-----------------------------------------------------------RM-Coffee4 0.71 / 0.74 / 84
How Best is the Best….
M-Locarna 234 *** 183 **
Stral 169 *** 62
FoldalignM 146 61
Murlet 130 * -12
Rnasampler 129 * -27
T-Lara 125 * -30
Poa 241 *** 217 ***
T-Coffee 241 *** 199 ***
Prrn 232 *** 198 ***
Pcma 218 *** 151 ***
Proalign 216 *** 150 **
Mafft fftns 206 *** 148 *
ClustalW 203 *** 136 ***
Probcons 192 *** 128 *
Mafft ginsi 170 *** 115
Muscle 169 *** 111
Methodvs. R-Coffee-Consan
vs. RM-Coffee4
Range of Performances
Effect of Compensated Mutations
Conclusion/Future Directions
T-Coffee/Consan is currently the best MSA protocol for ncRNAs
Testing how important is the accuracy of the secondary structure prediction
Going deeper into Sankoff’s territory: predicting and aligning simultaneously
Credits and Web Servers
Andreas Wilm Des Higgins Sebastien Moretti Ioannis Xenarios Cedric Notredame
CGR, SIB, UCD
Prank Vs Prank
Prank Vs Prank
Gop=0Gep=0
Prank Vs Prank
The reconstruction of evolutionary homology -- including the correct placement of insertion and deletion events -- is only feasible for rather closely-related sequences. PRANK is not meant for the alignment of very diverged protein sequences. If sequences are very different, the correct homology cannot be reconstructed with confidence and
http://www.ebi.ac.uk/goldman-srv/prank/
Do Benchmarks All Tell the same story?
Based on
Probcons: Different Primary Library
Score= (MIN(xz,zk))/MAX(xz,zk)Score(xi ~ yj | x, y, z)
∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)