improving the computation of guide trees forjobim2012.inria.fr/sources/p70.pdfneutral species tree...

Participation financée par le

Labex NUMEV

http://www.lirmm.fr/numev/

Master STIC pour la Santé

[email protected]@um2.frhttp://www.master-stic-sante.univ-montp2.fr/

Spécialité« Bioinformatique, Connaissances, Données »

[email protected]@inserm.frhttp://www.lirmm.fr/BCD/

1

Improving the computation of guide trees for

genome multiple alignments in Ensembl

Compara API

Nicolas Fiorini1,2, Paul Flicek2 and Javier Herrero2

1 [email protected], 2 {fiorini,flicek,herrero}@ebi.ac.uk

1 Universite de Montpellier 2 Sciences et Techniques, Place Eugene Bataillon, 34095 Montpellier Cedex 5, France2 EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United-Kingdom

IntroductionThe Ensembl Compara database stores the results of genome-wide species comparisons

calculated for each data release. The EPO pipeline (Enredo [1], Pecan [1,2], Ortheus [3]) con-

siders genome duplications. From the alignments, conservation scores and constrained

elements are determined using GERP [4].

EPO requires a guide tree to align the sequences and GERP requires a neutral tree to de-

tect deviations from the expected number of substitutions in the alignment. One can use the

neutral species tree estimated from 4D sites. This works well for Mercator-Pecan alignments,

a simpler pipeline that does not consider duplications. However for EPO segmental duplica-

tions, one has to estimate a new tree for these sequences.

Using a correct tree is key to obtain reliable alignments and conservation data.

Guide trees

– The species tree does

not work for duplications

– Series of draft trees/alignments

are computed until the

tree is stable

– Once stable, the final tree

is used to guide the align-

ment

This process is prone to a local maximum

AnalysisWe need to compare both methods as Pecan

is considered to be reliable. This one will

thus be the reference for improving EPO.

– As the species tree is used, we should not

see these discrepancies

– However, trees inferred from Pecan align-

ments frequently (81%) show an unexpec-ted global topology (in the Primates/Ro-

dents/Laurasiatheria branching)

– EPO trees overall topology is correct apart

from the many discrepancies within theLaurasiatheria group (due to the dupli-

cations)

If we can combine the strength of both meth-

ods, we will be able to get better trees.

We tried to add a new step before the tree

computing in Pecan. This step is used in the

EPO pipeline : the gapped sites filter.

– From a Pecan alignment, we remove every

site containing at least one gap

– We compute a new tree from these data

– The resulting tree is the same as the speciestree

The implementation of an EPO step in Pecan

allowed us to find the species tree we ex-

pected at the first computation of these data.

This proves that the different proccesses used

in EPO/Pecan can be complementary, so EPO

can be improved by using some Pecan steps.

Future workThe most consistent process in the Pecan pipeline which can impact the alignment is the use of the species tree as a guide tree. As said in the Guide trees section, EPO cannot benefit from this accuracy.

However, it is possible to have a reconciled tree. This tree would consider duplications in the evolution of the sequences. It would be equivalent (as long as it does not contain mistakes) to the species

tree, specific for each data containing duplications.

Therefore, after the gap filter, we could run a reconciliation tool to get the most parsimonious trees instead of testing random trees with the ML approach. We expect better topologies, especially in the

Laurasiatheria group as the guide tree is the key parameter to get this group right in the Pecan method. The final alignment would thus be better, as well as the GERP analysis.

References[1] Paten B., Herrero J., Beal K., Fitzgerald S. and Birney E., Enredo and Pecan : Genome-wide mammalian consistency based multiple alignment with paralogs. Genome Research, Nov ;18(11) :1814-28, 2008.

[2] Paten B., Herrero J., Beal K. and Birney E., Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Genome Research, Feb 1 ;25(3) :295-301, 2009.

[3] Paten B., Herrero J., Fitzgerald S., Beal K., Flicek P., Holmes I. and Birney E., Genome-wide nucleotide level mammalian ancestor reconstruction. Genome Research, Nov ;18(11) :1829-43, 2008.

[4] Cooper GM et al., Distribution and intensity of constraint in mammalian genomic sequence. Genome Research, 15 :901-913, 2005.

AcknowledgementsWe thank the EMBL-EBI for funding the internship of Nicolas Fiorini as well as the Labex NUMEV (http ://www.lirmm.fr/numev/) for funding the JOBIM participation. We also thank Vincent Berry for his guidance concerning the tools we will use, the

Ensembl Compara team for their day to day help and Alban Mancheron for giving us the template of this poster.

improving the computation of guide trees forjobim2012.inria.fr/sources/p70.pdfneutral species tree...

Documents