-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
1/29
Tips to align DNA sequencesbuild phylogenetic treesA QCBS workshop - Annie ArchambaultJuly 2013
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
2/29
Intro
Level and content: practical How to get sequences,
To align
To build a tree
Your expectations
Tomorrow: Your turn
Participants
Jorge Ramirez
Pedram Samani
Genevieve Guay
Annie Archamba
Nomie Blanchet
Ariane Pelletier
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
3/29
Get sequences In GenBank: The database
Use the Nucleotide , the Gene , or the Populatio
https://www.ncbi.nlm.nih.gov/nuccore/advanced
https://www.ncbi.nlm.nih.gov/popset/
https://www.ncbi.nlm.nih.gov/nuccore/advancedhttps://www.ncbi.nlm.nih.gov/popset/https://www.ncbi.nlm.nih.gov/popset/https://www.ncbi.nlm.nih.gov/nuccore/advanced -
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
4/29
Get the sequences Use the Blast similarity search
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Use automation if many similar searches
e.g. http://qcbs.ca/wiki/commandline_remote_blas
http://blast.ncbi.nlm.nih.gov/Blast.cgihttp://qcbs.ca/wiki/commandline_remote_blasthttp://qcbs.ca/wiki/commandline_remote_blasthttp://blast.ncbi.nlm.nih.gov/Blast.cgi -
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
5/29
Get the sequences
Alternative : In reference databases (Genbank is not error-freeorganisms
SILVA http://www.arb-silva.de : aligned ribosomal RNA (16S/1(23S/28S, LSU) gene sequences from the Bacteria,Archaea anEukaryota
RDP http://rdp.cme.msu.edu/ The Ribosomal Database ProjecBacterial and Archaeal 16S rRNA seq
Greengenes http://greengenes.lbl.gov 16S rRNA gene sequen
alignment UNITE http://unite.ut.ee/ reference records of ITS sequences f
Ectomycorrhizal (ECM) fungi
Barcode of Life Data System (BOLD) http://www.boldsystems.animal mitochondrial cytochrome c oxidase I (COI)
Protist Ribosomal Reference Database http://ssu-rrna.org/, uneukaryotes Small SubUnit rRNA (18S)
http://www.arb-silva.de/http://rdp.cme.msu.edu/http://greengenes.lbl.gov/http://unite.ut.ee/http://www.boldsystems.org/http://ssu-rrna.org/http://ssu-rrna.org/http://ssu-rrna.org/http://ssu-rrna.org/http://www.boldsystems.org/http://unite.ut.ee/http://greengenes.lbl.gov/http://rdp.cme.msu.edu/http://www.arb-silva.de/http://www.arb-silva.de/http://www.arb-silva.de/ -
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
6/29
Practical steps for alignment an
Align sequences with automated programs
Test a few algorithms
Adjust alignment by eye - Controversial
Trim (or exclude) ends, and regions with aligment you don
Identify the best-fit model for your data
Build trees (typically ML or Bayesian)
Share data, codes and models
Account for uncertainty
Branch support (Bootstrap or Bayesian trees)
Inferring evolutionary forces (e.g. positive selection)
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
7/29
>Protea_witzenbergiana Some description here
CGCGAGAAGTCCACTGAACCTTATCATTTAGAGGAA
TTCCGTAGGTGAACCTGCGGAAGGATCATTGTCGAT
CCCGCGAACACGTCGAACGGTGACC
>Protea_wentzeliana Different description hereCGCGAGAAGTCCACTGAACCTTATCATTTAGAGGAA
TTCCGTAGGTGAACCTGCGGAAGGATCATTGTCGAT
CCCGCGAACACGTCGAACGGTGACC
>Protea_vogtsiae Other heree
CGCGAGAAGTCCACTGAACCTTATCATTTAGAGGAA
TTCCGTAGGTGAACCTGCGGAAGGATCATTGTCGAT
CCCGCGAACACGTCGAACGGTGACCGGGGGGCGA
>Protea_witzenbergiana Some description here
----C-------GCGA--------GAAGTCCACTGAACCTTATCATTTAGAGGAAGGAGA
TAGGTGAACCTGCGGAAGGATCATTGTCGATGCCTG
GAACACGTC-G-AACGGT-GACC-
>Protea_wentzeliana Different description here
----C-------GCGA--------
GAAGTCCACTGAACCTTATCATTTAGAGGAAGGAGA
TAGGTGAACCTGCGGAAGGATCATTGTCGATGCCTG
GAACACGTC-G-AACGGT-GACC-
>Protea_vogtsiae Other heree
----C-------GCGA--------
GAAGTCCACTGAACCTTATCATTTAGAGGAAGGAGA
TAGGTGAACCTGCGGAAGGATCATTGTCGATGCCTG
GAACACGTC-G-AACGGT-GACC-GGGG-G-G-CGA-G-TG----------
Fasta format
All your sequences into one file, in
the fasta format A text file
Greater than sign
Sequence name + Description
Return
The sequence (with - or not)
Return
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
8/29
Lets align!
A plethora of alignment algorithms
Very different calculation methods
Will try only five (listed on the wiki):
Clustal
Muscle
PRANK
SATe
FAST
JalView, SuiteMSA, BioEdit (viewer)
On 3 datasets
PR10_fabaceae_1
ITS_oxytropis_84se
Fungal_refseq_ITS_
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
9/29
Why care about alignment?
To have a reliable tree!
Because your matrix will be public
http://treebase.org
http://datadryad.org/ 80$
http://treebase.org/http://datadryad.org/http://datadryad.org/http://treebase.org/http://treebase.org/ -
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
10/29
Lets align!
Questions to ponder
Would you trust aalignments?
Does one algorithoutperform all thcircumstances?
Which algorithm likely to use for yo
projects?
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
11/29
Lets align!
Clustal Muscle Prank SAT
AdvantageEasy, Usedeverywhere
Fast distanceestimation,progressivealignment,refinement byrestricted
partitioning.
Corrects forinsertions anddeletions. Goodfor codons
Co-estimatealignments andtrees. Runsrelatively fast.Divide-and-conquer
realignment.
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
12/29
Lets align!Clustal Muscle Prank SAT
PR10_fabaceae_11seq = With longgaps + CDS
8240 bp. Does notlocate the cdsregions
8560 bp. Foundsimilar cds regionof a divergentgene
9470 bp. Lowconfidence in cdsregion of adivergent gene.
8370. Found thesimilar cds regionof the divergentgene.
ITS_oxytropis_84se
q = highlyconserved good goog good good
Fungal_refseq_ITS_301seq = Highlydiverging
1070 pb.Is not careful inaligning lowsimilarity areas.
1850 bp Alignstogether stretcheswith low similarity.Finds conservedregions in themiddle.
21630 bp. Findsconserved regionin the middle.
1420 bp. Alignstogether stretchewith low similarityFinds conservedregions in themiddle.
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
13/29
Lets align!
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
14/29
Phylogenetic trees - Basics
Branch = edge: a lineage through time
Node: branching of a lineage into two. Byspeciation or gene duplication
Internal node
Leaf (terminal node) = Tip (OUT)
Branch length: Typically nb substitution/site ; isoften not constant;
Outgroup: used to find the root
Topology: The branching pattern of the tree
Terminology
It is a drawing = can be re-rooted, branches swapped
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
15/29
Phylogenetic trees - Basics
Useful in :
Gene duplication events Recombination or horizontal gene transfer
Variation of selective pressures andadaptive evolution
Divergence times between species
Origin of epidemics
Host-parasite cospeciation events
Genealogies of somatic cells in cancer
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
16/29
Phylogenetic trees Yang, Z., and B. Rannala. 2012. Molecular phylogenetics: principles a
Nature Reviews Genetics 13: 303314.
Hall, Berry G. Phylogenetic Trees Made Easy: A How-To Manual, Third
Aris-Brosou, S., and X. Xia. 2008. Phylogenetic Analyses: A Toolbox Exptowards Bayesian Methods. International Journal of Plant Genomics
Roquet, C., W. Thuiller, and S. Lavergne. 2013. Building megaphyloge
macroecology: taking up the challenge. Ecography 36: 013026.
http://treethinkers.org/Workshops in applied phylogenetics
http://www.molecularevolution.org/ Software description and glossa
http://www.nature.com/nrg/journal/v13/n5/full/nrg3186.htmlhttp://www.hindawi.com/journals/ijpg/2008/683509/http://www.hindawi.com/journals/ijpg/2008/683509/http://onlinelibrary.wiley.com/doi/10.1111/j.1600-0587.2012.07773.x/abstracthttp://onlinelibrary.wiley.com/doi/10.1111/j.1600-0587.2012.07773.x/abstracthttp://treethinkers.org/http://www.molecularevolution.org/http://www.molecularevolution.org/http://treethinkers.org/http://onlinelibrary.wiley.com/doi/10.1111/j.1600-0587.2012.07773.x/abstracthttp://onlinelibrary.wiley.com/doi/10.1111/j.1600-0587.2012.07773.x/abstracthttp://www.hindawi.com/journals/ijpg/2008/683509/http://www.hindawi.com/journals/ijpg/2008/683509/http://www.nature.com/nrg/journal/v13/n5/full/nrg3186.html -
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
17/29
Building the treeDistance Parsimony Max. Likelihood Bayesi
Based on Distance matrix Informativecharacters
All characters All cha
Approach Clustering Which treeexplains the datawith leastevolutionarychanges?
What tree and valuesgive the highestlikelihood to thisalignment?
What isprobabdistribubased data?
Score for
choosingtrees
Steps: minimum
number ofchanges
Log likelihood. A
relative number,cannot compareacross alignments
Posterio
that thcorrecunders
Yang, Z., and B. Rannala. 2012. Molecular phylogenetics: principles and pracNature Reviews Genetics 13: 303314.
Hall, Berry G. Phylogenetic Trees Made Easy: A How-To Manual, Third Edition
http://www.nature.com/nrg/journal/v13/n5/full/nrg3186.htmlhttp://www.nature.com/nrg/journal/v13/n5/full/nrg3186.html -
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
18/29
Building the treeDistance Parsimony Max. Likelihood Bayesi
When not touse
Withnumerouslong gaps
With highly divergentsequences
With a model not fitto your data
When yPriors aappropdata
Strength Quick Efficient, andgenerally reliable
Consistent, efficient.Can be used for testsof evolution
ConsistCan btests of
Weaknesses Sensitive to
gap and todivergentsequences
Calculation cannot
be improved,because no model ofsequence evolution
Computationally
demanding.Depends on modelof evolution.
Comp
demanPosterioprobabbe hig
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
19/29
Building the treeDistance Parsimony Max. Likelihood Bayesi
Programs -commercial PAUP*
Programs -free
MEGA MEGA5, TNT MEGA5, RAxML,GARLi,PAML, Hyphy
BEAST, PhycasBayesPBUCKyPhy
Kumar 1994 (cited 2459 times); 2001
(cited 6481 tiems); 2004 (11433 tiems);2007 (cited 19499 times); 2011 (cited6226 times
RAxML, Stamatakis
2006 . Cited 3536times
GARLi, Zwickl 2006Cited 1574 times
BEAST,
2007 C
MrBaye2003 Ctimes; Cited 2
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
20/29
Distance
HugeFlo BigFl MediumFl
HugeFl
BigFl 2/12
MediumFl 3/12 3/12
SmallFl 412 4/12 3/12
TinyFl 5/12 5/12 4/12
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
21/29
Parsimony
Only informative ch
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
22/29
Max. likelihood; Bayesian; distance
Rates of substitution bnucleotides
More complex: 6
Rate variation among
e.g. codon positio gamma
Proportion of invarian
Model of sequence evolution Includes
Which model fits your data?
jModelTest
MEGA5
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
23/29
Newick format
tree ML_tree = [&U](SmallFlower:0.0,TinyFlower:0.0937:0.086766,(HugeFlower:0.08978,Big):0.118638):0.228517);
tree ML_tree = [&U](SmallFlower:0.0,TinyFlower:0.0937:0.086766,(HugeFlower:0.08978,Big)[&"bootstrapproportion"=78.0"]:0.118638)[&"bootproportion"=87.0"]:0.228517);
The tree fileNewick forma
Branc
Branc
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
24/29
Build trees, general steps
Test which model of sequence evolution fits
your data
Choose method
Distance : MEGA5 - Quick, notpublications
Parsimony : MEGA5 - Not very popular
Max. likelihood : GARli - Efficient
Max. likelihood : RAxMLVery largedatasets, e.g thousands taxa + hundredsgenes. Use CAT.
Bayesian: BEAST2.0To answer questionswith a range of probable trees
At your com
Download 2ailgnment htcontent/uploads/20TS.txt ; http://qcbs.cacontent/uploads/20trnL.txt
Analyze withRAxML on yo
Use BEAST
http://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_ITS.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_ITS.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_trnL.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_trnL.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_trnL.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_trnL.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_trnL.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_trnL.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_trnL.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_trnL.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_trnL.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_ITS.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_ITS.txthttp://qcbs.ca/wp-content/uploads/2013/06/ProteaFaurea_ITS.txt -
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
25/29
BEAST 2.0
Start from one of the tutorial
Know the model of substitution (e.g.from jModelTest)
Read carefully the BEAST FAQ from theWiki
For any problem, search and browsethrough the BEAST forum first
http://www.beast2.org/wiki/index.php/FAQhttps://groups.google.com/forum/https://groups.google.com/forum/http://www.beast2.org/wiki/index.php/FAQ -
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
26/29
Branch support
From a large sets of tree (many hundreds) Bootstrap replicates
Bayesian analysis
Compute consensus tree
Strict = 100% of the trees
Majority = a % set by the user
Report on your best tree (e.g. dendropy)
http://pythonhosted.org/DendroPy/scripts/sumtrees.htmlhttp://pythonhosted.org/DendroPy/scripts/sumtrees.html -
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
27/29
Why so many programs?
Umbrella programs (commercial)
Geneious 400$
G l d ti
-
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
28/29
General recommendations
Anisimova, et al. 2013. State-of the art methodologies dictatstandards for phylogenetic analysis. BMC Evolutionary Biolog
161.
Write a question that the analysis will answer
Justify the choice of methods, test alternatives
Account for uncertainty (branch support, confidence int
Share the data
http://www.biomedcentral.com/1471-2148/13/161/abstracthttp://www.biomedcentral.com/1471-2148/13/161/abstracthttp://www.biomedcentral.com/1471-2148/13/161/abstracthttp://www.biomedcentral.com/1471-2148/13/161/abstracthttp://www.biomedcentral.com/1471-2148/13/161/abstracthttp://www.biomedcentral.com/1471-2148/13/161/abstract -
7/28/2019 Workshop Tips to align DNA sequences and build phylogenetic trees
29/29
End!