phylogenetic service set
DESCRIPTION
Phylogenetic Service Set. Webservices and Workflow to infer and use phylogenetic information in biodiversities studies. Saverio Vicario CNR-ITB, Bari (Italy). Meaning of a phylogeny. It is a summary of the evolutionary history of a group of organisms Topology summarize the relationships - PowerPoint PPT PresentationTRANSCRIPT
1
Phylogenetic Service Set
Webservices and Workflow to infer and use phylogenetic information in biodiversities studies
Saverio Vicario CNR-ITB, Bari (Italy)
2
Meaning of a phylogeny
• It is a summary of the evolutionary history of a group of organisms
• Topology summarize the relationships• Branch length summarize expected change along
a given section• Object of this description could be the full
organism, or groups or single heredity units (SNPs or genes)
• What ever attribute of an organism that has an heredity component could “mapped” on tree: its history could be inferred based on the phylogeny of the species/gene
3
Contribution to INVA
• Finding population of origin of invasive species and modes of the invasion– Phylogeograpy
• Detecting selection acting on aliens species or native after alien intervention– Molecular Evolution
• Diagnose an alien species from mixed sample of individuals– Molecular Systematics-> Barcode
• Describing impact of aliens on a community: biodiversity profiling with phylogenetic diversity (based on species list or environmental sequencing)
4
Contribution to ECOS
• Annotation of metagenome– Phylogenomics : comparing divergence of different
gene families within and across samples to find what metabolic pathway is a more active in a given environment
(Phylogenetic inference, Molecular Evolution)– Biodiversity profiling: phylogenetic differentiation
across samples of house keeping gene to find out mode of ecosystem formation
(Molecular Systematics-> Barcode; Phylogenetic diversity)
• Biodiversity profiling from species list and known phylogeny(Phylogenetic diversity)
5
Overall plan of phylogenetic setPhylogenetic inference • Align based on HMM profile or using scoring matrix (HMMer3.0,
Muscle)• User interface to describe model of substitution• Phylogenetic inference format translator• Infer Phylogeny (MrBayes, RAXML,…)• Asses convergence numerical parameters and topology for
MCMC/MC inference (CODA R pkg and GeoKS)• Assess overall goodness of fit of phylogenetic inference with
Posterior predictive test and relative with Akaike (HyPhy)Use of Phylogenetic information • Estimate phylogenetic diversity (Phylocom, …)• Estimate evolutionary parameters (HyPhY)Utilities • Add new sequences to a phylogenetic inference (Pplacer)• Permutate, Resample, Thin tree list (scripting with R)
6
Phylogenetic inference
W AGCTGCGX ACCGGTGZ AGTTGTGY AGTTGCG
Observe
+Evolutionary model
Guess with an Inference
Tree
7
Gathering data
• At the moment it should be user supplied
• In the future it should be based on taxonomic, geographic, and traits (i.e. gene) availability
• I assume possibility of data input from taxonomic service set
8
Best Practice and Robust Workflow
• Biovel need would like to promote interdisciplinary work offering robust workflow such that scientist of other field could use state of the art methods of a given discipline
• Phylogenetic inference it is easy to misuse and not implement best practice
9
Phylogenetic inference pitfall -I
• Dependent from the call of homology of the alignment– Workflow that will give conservative quality score for single sites
• Highly dependent from model ( and prior if bayesian):– Need to test absolute fit of model to data– Need to compare models– Need to help user in describing the model
• Difference between gene and species tree:– use species phylogeny tools
• Check for paralogy– homogeneity of model and check several gene
phylogeny• Numerical estimation of parameters is satisfactory?
– Test of convergence of MCMC/MC
10
Phylogenetic inference pitfall – II• Prior on branches is still very problematic• Not really possible to produce robust workflow with
bayesian inference for branch length estimation (molecular clock)
• Probably demographic explicit model are less problematic, because they try to tackle the problem explicitly
11
• Comparing Model with Akaike not appropriate under Bayesian framework, probably state of the art for maximum likelihood, but only relative evaluation
• Good estimate of Bayesian Factor are difficult to estimate and not yet standard ( see Phycas and Phylobayes implementation), and still is a relative evaluation
• Posterior predictive test and the L statistics seems more robust test applicable to Maximum Likelihood and Bayesian
Phylogenetic inference pitfall – III Comparing models
Using Mixed models?
12
Services and Workflow
13
Robust Alignment
14
Alighment WF for coding sequences• Given a set of nucleotide coding sequence– Perform all possible translation changing
frame and genetic codes– Perform gene homology call HMMsearch on
PFAMdb and find frame and compatible genetic codes
– Align protein alignment on Protein profile (HMMalign), obtaining sites quality scores
– Guide alignment Dna on protein and import quality scores
15
Other Alignment WF planned • Generic= Muscle +Gblocks• RNA =Infernal (HMM for RNA)• …
16
Phylogenetic inference
17
Different problems
• Access the correct software and approach for the question
• Describe the model in the input file• Check for convergence• Evaluate model
19
Select software
• Depending on the divergence time and if the history that we are reconstructing is within or between species different simplifying assumption could be used in the model
Species barrier Div
erge
nce
Signal saturation leading to LBA and
heterogeneity of rates
Mismatch between gene and species tree
Demographical complexity
Beast?
PhyloBayes?
TNT?
RaXML
MrBayes
MrBayes+
Best Garlie?
20
• Our first user interface based on MrBayes nexus description
• Hyphy batch language very rich but no prior
• BEAST XML input file … • …
Oh Evolutionary Model Description Language, where art thou?
21
Details in the model description
Evolutionary model
S1 S2 S3
B1
B2
B3
B4
B5
Group of Sites
Group of branches
Topology1
Topology2
i.e.
Transition matrix
BaseFreq
Site Var
I.e. GTR, mtREV, HKY
I.e. equal, empirical, estimate
I.e. equal, gamma, ..
Evol
B1=a * B3 B1<- demographic/geographic model X
W AGCTGCGX ACCGGTGZ AGTTGTGY AGTTGCG
S1 S2 S3
22
Our inteface for model
Evolutionary model
S1 S2 S3
B1
B3
Group of Sites
Group of branches
Topology1
Topology2 …
i.e.
Transition matrix
BaseFreq
Site Var
I.e. GTR, mtREV, HKY
I.e. equal, empirical, estimate
I.e. equal, gamma, ..
B1=a * B3
Topology1== Topology2 or
Topology1!= Topology2
W AGCTGCGX ACCGGTGZ AGTTGTGY AGTTGCG
S1 S2 S3
23
Convergence
• GeoKS for convergence of tree in MCMC ( web application (http://mblabproject.it/geoks/ess_options.html)
• R pkgs (Coda or Boa) for convergence of continuous parameters
24
GEOKS
• Based on Billera’s tree space
Compare the distribution of Billera tree distance (topology +branch) of two clouds of trees versus a mean tree
Second round of revision Sys. Bio.
25
Evaluating model
Relative comparison • Akaike Information Criterion• L of Ibrahim ( to be implemented
with Hyphy) Absolute assessment• Posterior Predictive test to be
implemented in Hyphy
Not so keen to include MrModeltest-> too much emphasis to select among transition matrix all submodel of same GTR
26
What transition matrix?
• Nucleotide model requires 4X4 matrix • Some RNA model 16X16 matrix • Protein models requires 23X23 matrix but often
they are pre-calculate (i.e. Blosum62)• Codon model 61X61 61X61 matrices are quite time consuming for CPU
and they are generally used only when tree is known , but GPU availability makes this models more accessible.
Codon model are much more realistic for coding sequence, only way to parse the different selective force (ω, dn, ds)
27
Ontology for phylogenetic workflow
28
Use of Ontology in Workflow • Connect input and ouput of two
workflow that are semantically coherent
• Substitute or make redundant services within a workflow
29
What Ontology?
• EDAM (http://edamontology.sourceforge.net/)– Data and methods of general
bioinformatics including basic phylogeny
• CDAO (https://www.nescent.org/wg_evoinfo/CDAO)− Data only, but very much specialized on
comparative studies and phylogeny
30
What to do with it
• Annotate input and output of services/ workflow
31
Using phylogeny
32
Getting already inferred phylogeny
• Where to find them?– TreeBase/nescent web services plan (
https://www.nescent.org/wg/evoinfo/index.php?title=PhyloWS)
– REST service not yet there but Phylr is a first sketch of it
• How likely is to re-use phylogeny?– Taxon list need to match exactly! Taxonomic services
to check match taking in account synonymy– Possible Tree operation to match taxon list:
• Subsetting or Pruning (easy and clean)– Tree object of several scripting languages could do the job
• Patching several trees or making SuperTree (difficult and choice dependent)
33
Phylogenetic Diversity
But also One General formulathat includes Rao and Faith Phylogenetic diversity (PD) and corrected version of Allen’s PD that better generalize Shannon entropy
I implemented the formula in python script in order to estimate phylogenetic beta diversity across communities as mutual information of the communities
34
Phylogenetic diversity
• It was recently considered in a GEOBON meeting Essential Biodiversity Variable (although in a more general sense than here used)
• It allow to describe the amount of variation within a sample but also where in the tree and how much there are differentiation across sample
• It could be a powerful tool to summarize environmental sequencing data
35
Example across 3 localities: NI, CI, SI
p1
p2
p3
p4
p5
p4+p3+p5
p4+p3
p1+p2
€
H(S)p =LiTpi log(pi)
i=1
2S−1
∑
T = Lipii=1
2S−1
∑
Anne Chao, et al. 2010 Phil. Trans. R. Soc. B,365:3599-3609
€
DbetaEnv = exp(H(S) −H(S | E))
OnlyCI and SI
36
Hypothesis of workflow on phylogenetic differentiation across localities
Define Taxonomic group
Get Reference Sequence from
NCBI/EMBL/BOLDBuild Reference Alignment
Clean Environmental
Sequences
Filter Locus and Taxa with HMM
profile
Add Sequence to Alignment
Add Sequence to Phylogeny (pplacer)
Describe region differentiation with
Phylogenetic diversity
Identify Species, alpha and beta diversity
Phylogenetic Inference
37
Other post phylogenetic inference application
• Reconstructing past history of a given traits on a species phylogeny (es, R pkg ape, but BayesTraits could be more interesting or phylocom)
• Biogeography: comparison of phylogeny across groups of species to infer geographical barrier and event of general impact on biodiversity
• ChronoBiogeography: same thing but with dating, distinguish the effects of recurrent climate change
• …
38
Acknowledgments
• CNR – ITB– Bachir Balech– Arianna Consiglio– Giorgio Grillo
• INFN-IGI ( Italian Grid Initiative)– Giacinto Donvito– Pasquale Notarangelo
Model Definition GUI
Testing Workflow
ICT infranstructure