improving the computation of guide trees forjobim2012.inria.fr/sources/p70.pdfneutral species tree...
TRANSCRIPT
Participation financée par le
Labex NUMEV
http://www.lirmm.fr/numev/
Master STIC pour la Santé
[email protected]@um2.frhttp://www.master-stic-sante.univ-montp2.fr/
Spécialité« Bioinformatique, Connaissances, Données »
[email protected]@inserm.frhttp://www.lirmm.fr/BCD/
1
Improving the computation of guide trees for
genome multiple alignments in Ensembl
Compara API
Nicolas Fiorini1,2, Paul Flicek2 and Javier Herrero2
1 [email protected], 2 {fiorini,flicek,herrero}@ebi.ac.uk
1 Universite de Montpellier 2 Sciences et Techniques, Place Eugene Bataillon, 34095 Montpellier Cedex 5, France2 EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United-Kingdom
IntroductionThe Ensembl Compara database stores the results of genome-wide species comparisons
calculated for each data release. The EPO pipeline (Enredo [1], Pecan [1,2], Ortheus [3]) con-
siders genome duplications. From the alignments, conservation scores and constrained
elements are determined using GERP [4].
EPO requires a guide tree to align the sequences and GERP requires a neutral tree to de-
tect deviations from the expected number of substitutions in the alignment. One can use the
neutral species tree estimated from 4D sites. This works well for Mercator-Pecan alignments,
a simpler pipeline that does not consider duplications. However for EPO segmental duplica-
tions, one has to estimate a new tree for these sequences.
Using a correct tree is key to obtain reliable alignments and conservation data.
Guide trees
– The species tree does
not work for duplications
– Series of draft trees/alignments
are computed until the
tree is stable
– Once stable, the final tree
is used to guide the align-
ment
This process is prone to a local maximum
AnalysisWe need to compare both methods as Pecan
is considered to be reliable. This one will
thus be the reference for improving EPO.
– As the species tree is used, we should not
see these discrepancies
– However, trees inferred from Pecan align-
ments frequently (81%) show an unexpec-ted global topology (in the Primates/Ro-
dents/Laurasiatheria branching)
– EPO trees overall topology is correct apart
from the many discrepancies within theLaurasiatheria group (due to the dupli-
cations)
If we can combine the strength of both meth-
ods, we will be able to get better trees.
We tried to add a new step before the tree
computing in Pecan. This step is used in the
EPO pipeline : the gapped sites filter.
– From a Pecan alignment, we remove every
site containing at least one gap
– We compute a new tree from these data
– The resulting tree is the same as the speciestree
The implementation of an EPO step in Pecan
allowed us to find the species tree we ex-
pected at the first computation of these data.
This proves that the different proccesses used
in EPO/Pecan can be complementary, so EPO
can be improved by using some Pecan steps.
Future workThe most consistent process in the Pecan pipeline which can impact the alignment is the use of the species tree as a guide tree. As said in the Guide trees section, EPO cannot benefit from this accuracy.
However, it is possible to have a reconciled tree. This tree would consider duplications in the evolution of the sequences. It would be equivalent (as long as it does not contain mistakes) to the species
tree, specific for each data containing duplications.
Therefore, after the gap filter, we could run a reconciliation tool to get the most parsimonious trees instead of testing random trees with the ML approach. We expect better topologies, especially in the
Laurasiatheria group as the guide tree is the key parameter to get this group right in the Pecan method. The final alignment would thus be better, as well as the GERP analysis.
References[1] Paten B., Herrero J., Beal K., Fitzgerald S. and Birney E., Enredo and Pecan : Genome-wide mammalian consistency based multiple alignment with paralogs. Genome Research, Nov ;18(11) :1814-28, 2008.
[2] Paten B., Herrero J., Beal K. and Birney E., Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Genome Research, Feb 1 ;25(3) :295-301, 2009.
[3] Paten B., Herrero J., Fitzgerald S., Beal K., Flicek P., Holmes I. and Birney E., Genome-wide nucleotide level mammalian ancestor reconstruction. Genome Research, Nov ;18(11) :1829-43, 2008.
[4] Cooper GM et al., Distribution and intensity of constraint in mammalian genomic sequence. Genome Research, 15 :901-913, 2005.
AcknowledgementsWe thank the EMBL-EBI for funding the internship of Nicolas Fiorini as well as the Labex NUMEV (http ://www.lirmm.fr/numev/) for funding the JOBIM participation. We also thank Vincent Berry for his guidance concerning the tools we will use, the
Ensembl Compara team for their day to day help and Alban Mancheron for giving us the template of this poster.