phylogenetic service set

37
1 Phylogenetic Service Set Webservices and Workflow to infer and use phylogenetic information in biodiversities studies Saverio Vicario CNR-ITB, Bari (Italy)

Upload: daria-allison

Post on 31-Dec-2015

26 views

Category:

Documents


3 download

DESCRIPTION

Phylogenetic Service Set. Webservices and Workflow to infer and use phylogenetic information in biodiversities studies. Saverio Vicario CNR-ITB, Bari (Italy). Meaning of a phylogeny. It is a summary of the evolutionary history of a group of organisms Topology summarize the relationships - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Phylogenetic  Service Set

1

Phylogenetic Service Set

Webservices and Workflow to infer and use phylogenetic information in biodiversities studies

Saverio Vicario CNR-ITB, Bari (Italy)

Page 2: Phylogenetic  Service Set

2

Meaning of a phylogeny

• It is a summary of the evolutionary history of a group of organisms

• Topology summarize the relationships• Branch length summarize expected change along

a given section• Object of this description could be the full

organism, or groups or single heredity units (SNPs or genes)

• What ever attribute of an organism that has an heredity component could “mapped” on tree: its history could be inferred based on the phylogeny of the species/gene

Page 3: Phylogenetic  Service Set

3

Contribution to INVA

• Finding population of origin of invasive species and modes of the invasion– Phylogeograpy

• Detecting selection acting on aliens species or native after alien intervention– Molecular Evolution

• Diagnose an alien species from mixed sample of individuals– Molecular Systematics-> Barcode

• Describing impact of aliens on a community: biodiversity profiling with phylogenetic diversity (based on species list or environmental sequencing)

Page 4: Phylogenetic  Service Set

4

Contribution to ECOS

• Annotation of metagenome– Phylogenomics : comparing divergence of different

gene families within and across samples to find what metabolic pathway is a more active in a given environment

(Phylogenetic inference, Molecular Evolution)– Biodiversity profiling: phylogenetic differentiation

across samples of house keeping gene to find out mode of ecosystem formation

(Molecular Systematics-> Barcode; Phylogenetic diversity)

• Biodiversity profiling from species list and known phylogeny(Phylogenetic diversity)

Page 5: Phylogenetic  Service Set

5

Overall plan of phylogenetic setPhylogenetic inference • Align based on HMM profile or using scoring matrix (HMMer3.0,

Muscle)• User interface to describe model of substitution• Phylogenetic inference format translator• Infer Phylogeny (MrBayes, RAXML,…)• Asses convergence numerical parameters and topology for

MCMC/MC inference (CODA R pkg and GeoKS)• Assess overall goodness of fit of phylogenetic inference with

Posterior predictive test and relative with Akaike (HyPhy)Use of Phylogenetic information • Estimate phylogenetic diversity (Phylocom, …)• Estimate evolutionary parameters (HyPhY)Utilities • Add new sequences to a phylogenetic inference (Pplacer)• Permutate, Resample, Thin tree list (scripting with R)

Page 6: Phylogenetic  Service Set

6

Phylogenetic inference

W AGCTGCGX ACCGGTGZ AGTTGTGY AGTTGCG

Observe

+Evolutionary model

Guess with an Inference

Tree

Page 7: Phylogenetic  Service Set

7

Gathering data

• At the moment it should be user supplied

• In the future it should be based on taxonomic, geographic, and traits (i.e. gene) availability

• I assume possibility of data input from taxonomic service set

Page 8: Phylogenetic  Service Set

8

Best Practice and Robust Workflow

• Biovel need would like to promote interdisciplinary work offering robust workflow such that scientist of other field could use state of the art methods of a given discipline

• Phylogenetic inference it is easy to misuse and not implement best practice

Page 9: Phylogenetic  Service Set

9

Phylogenetic inference pitfall -I

• Dependent from the call of homology of the alignment– Workflow that will give conservative quality score for single sites

• Highly dependent from model ( and prior if bayesian):– Need to test absolute fit of model to data– Need to compare models– Need to help user in describing the model

• Difference between gene and species tree:– use species phylogeny tools

• Check for paralogy– homogeneity of model and check several gene

phylogeny• Numerical estimation of parameters is satisfactory?

– Test of convergence of MCMC/MC

Page 10: Phylogenetic  Service Set

10

Phylogenetic inference pitfall – II• Prior on branches is still very problematic• Not really possible to produce robust workflow with

bayesian inference for branch length estimation (molecular clock)

• Probably demographic explicit model are less problematic, because they try to tackle the problem explicitly

Page 11: Phylogenetic  Service Set

11

• Comparing Model with Akaike not appropriate under Bayesian framework, probably state of the art for maximum likelihood, but only relative evaluation

• Good estimate of Bayesian Factor are difficult to estimate and not yet standard ( see Phycas and Phylobayes implementation), and still is a relative evaluation

• Posterior predictive test and the L statistics seems more robust test applicable to Maximum Likelihood and Bayesian

Phylogenetic inference pitfall – III Comparing models

Using Mixed models?

Page 12: Phylogenetic  Service Set

12

Services and Workflow

Page 13: Phylogenetic  Service Set

13

Robust Alignment

Page 14: Phylogenetic  Service Set

14

Alighment WF for coding sequences• Given a set of nucleotide coding sequence– Perform all possible translation changing

frame and genetic codes– Perform gene homology call HMMsearch on

PFAMdb and find frame and compatible genetic codes

– Align protein alignment on Protein profile (HMMalign), obtaining sites quality scores

– Guide alignment Dna on protein and import quality scores

Page 15: Phylogenetic  Service Set

15

Other Alignment WF planned • Generic= Muscle +Gblocks• RNA =Infernal (HMM for RNA)• …

Page 16: Phylogenetic  Service Set

16

Phylogenetic inference

Page 17: Phylogenetic  Service Set

17

Different problems

• Access the correct software and approach for the question

• Describe the model in the input file• Check for convergence• Evaluate model

Page 18: Phylogenetic  Service Set

19

Select software

• Depending on the divergence time and if the history that we are reconstructing is within or between species different simplifying assumption could be used in the model

Species barrier Div

erge

nce

Signal saturation leading to LBA and

heterogeneity of rates

Mismatch between gene and species tree

Demographical complexity

Beast?

PhyloBayes?

TNT?

RaXML

MrBayes

MrBayes+

Best Garlie?

Page 19: Phylogenetic  Service Set

20

• Our first user interface based on MrBayes nexus description

• Hyphy batch language very rich but no prior

• BEAST XML input file … • …

Oh Evolutionary Model Description Language, where art thou?

Page 20: Phylogenetic  Service Set

21

Details in the model description

Evolutionary model

S1 S2 S3

B1

B2

B3

B4

B5

Group of Sites

Group of branches

Topology1

Topology2

i.e.

Transition matrix

BaseFreq

Site Var

I.e. GTR, mtREV, HKY

I.e. equal, empirical, estimate

I.e. equal, gamma, ..

Evol

B1=a * B3 B1<- demographic/geographic model X

W AGCTGCGX ACCGGTGZ AGTTGTGY AGTTGCG

S1 S2 S3

Page 21: Phylogenetic  Service Set

22

Our inteface for model

Evolutionary model

S1 S2 S3

B1

B3

Group of Sites

Group of branches

Topology1

Topology2 …

i.e.

Transition matrix

BaseFreq

Site Var

I.e. GTR, mtREV, HKY

I.e. equal, empirical, estimate

I.e. equal, gamma, ..

B1=a * B3

Topology1== Topology2 or

Topology1!= Topology2

W AGCTGCGX ACCGGTGZ AGTTGTGY AGTTGCG

S1 S2 S3

Page 22: Phylogenetic  Service Set

23

Convergence

• GeoKS for convergence of tree in MCMC ( web application (http://mblabproject.it/geoks/ess_options.html)

• R pkgs (Coda or Boa) for convergence of continuous parameters

Page 23: Phylogenetic  Service Set

24

GEOKS

• Based on Billera’s tree space

Compare the distribution of Billera tree distance (topology +branch) of two clouds of trees versus a mean tree

Second round of revision Sys. Bio.

Page 24: Phylogenetic  Service Set

25

Evaluating model

Relative comparison • Akaike Information Criterion• L of Ibrahim ( to be implemented

with Hyphy) Absolute assessment• Posterior Predictive test to be

implemented in Hyphy

Not so keen to include MrModeltest-> too much emphasis to select among transition matrix all submodel of same GTR

Page 25: Phylogenetic  Service Set

26

What transition matrix?

• Nucleotide model requires 4X4 matrix • Some RNA model 16X16 matrix • Protein models requires 23X23 matrix but often

they are pre-calculate (i.e. Blosum62)• Codon model 61X61 61X61 matrices are quite time consuming for CPU

and they are generally used only when tree is known , but GPU availability makes this models more accessible.

Codon model are much more realistic for coding sequence, only way to parse the different selective force (ω, dn, ds)

Page 26: Phylogenetic  Service Set

27

Ontology for phylogenetic workflow

Page 27: Phylogenetic  Service Set

28

Use of Ontology in Workflow • Connect input and ouput of two

workflow that are semantically coherent

• Substitute or make redundant services within a workflow

Page 28: Phylogenetic  Service Set

29

What Ontology?

• EDAM (http://edamontology.sourceforge.net/)– Data and methods of general

bioinformatics including basic phylogeny

• CDAO (https://www.nescent.org/wg_evoinfo/CDAO)− Data only, but very much specialized on

comparative studies and phylogeny

Page 29: Phylogenetic  Service Set

30

What to do with it

• Annotate input and output of services/ workflow

Page 30: Phylogenetic  Service Set

31

Using phylogeny

Page 31: Phylogenetic  Service Set

32

Getting already inferred phylogeny

• Where to find them?– TreeBase/nescent web services plan (

https://www.nescent.org/wg/evoinfo/index.php?title=PhyloWS)

– REST service not yet there but Phylr is a first sketch of it

• How likely is to re-use phylogeny?– Taxon list need to match exactly! Taxonomic services

to check match taking in account synonymy– Possible Tree operation to match taxon list:

• Subsetting or Pruning (easy and clean)– Tree object of several scripting languages could do the job

• Patching several trees or making SuperTree (difficult and choice dependent)

Page 32: Phylogenetic  Service Set

33

Phylogenetic Diversity

But also One General formulathat includes Rao and Faith Phylogenetic diversity (PD) and corrected version of Allen’s PD that better generalize Shannon entropy

I implemented the formula in python script in order to estimate phylogenetic beta diversity across communities as mutual information of the communities

Page 33: Phylogenetic  Service Set

34

Phylogenetic diversity

• It was recently considered in a GEOBON meeting Essential Biodiversity Variable (although in a more general sense than here used)

• It allow to describe the amount of variation within a sample but also where in the tree and how much there are differentiation across sample

• It could be a powerful tool to summarize environmental sequencing data

Page 34: Phylogenetic  Service Set

35

Example across 3 localities: NI, CI, SI

p1

p2

p3

p4

p5

p4+p3+p5

p4+p3

p1+p2

H(S)p =LiTpi log(pi)

i=1

2S−1

T = Lipii=1

2S−1

Anne Chao, et al. 2010 Phil. Trans. R. Soc. B,365:3599-3609

DbetaEnv = exp(H(S) −H(S | E))

OnlyCI and SI

Page 35: Phylogenetic  Service Set

36

Hypothesis of workflow on phylogenetic differentiation across localities

Define Taxonomic group

Get Reference Sequence from

NCBI/EMBL/BOLDBuild Reference Alignment

Clean Environmental

Sequences

Filter Locus and Taxa with HMM

profile

Add Sequence to Alignment

Add Sequence to Phylogeny (pplacer)

Describe region differentiation with

Phylogenetic diversity

Identify Species, alpha and beta diversity

Phylogenetic Inference

Page 36: Phylogenetic  Service Set

37

Other post phylogenetic inference application

• Reconstructing past history of a given traits on a species phylogeny (es, R pkg ape, but BayesTraits could be more interesting or phylocom)

• Biogeography: comparison of phylogeny across groups of species to infer geographical barrier and event of general impact on biodiversity

• ChronoBiogeography: same thing but with dating, distinguish the effects of recurrent climate change

• …

Page 37: Phylogenetic  Service Set

38

Acknowledgments

• CNR – ITB– Bachir Balech– Arianna Consiglio– Giorgio Grillo

• INFN-IGI ( Italian Grid Initiative)– Giacinto Donvito– Pasquale Notarangelo

Model Definition GUI

Testing Workflow

ICT infranstructure