benchmarking and comparing software for phylogenetic ... · figure 3: an example phylogenetic tree...
TRANSCRIPT
1
Benchmarking and comparing
software for phylogenetic analysis
from genome-scale data
COMP4560
Qiuyue Wang
U6342378
May 2019
Supervisors: Minh Bui and Yu Lin
2
Table of contents
Figure List ..................................................................................................................... 4
Table List ....................................................................................................................... 4
Equation List ................................................................................................................. 5
Acknowledgments.......................................................................................................... 6
Abstract .......................................................................................................................... 7
1. Introduction ............................................................................................................. 8
1.1 The theoretical basis of phylogeny ....................................................................... 8
1.2 Molecular data ...................................................................................................... 8
1.2.1 Deoxyribonucleic Acid .................................................................................. 8
1.2.2 Amino Acid ................................................................................................... 9
1.3 Phylogenetic tree ................................................................................................ 10
1.4 Phylogenetic tree reconstruction method ........................................................... 12
1.5 Aim ..................................................................................................................... 13
2. Background ............................................................................................................. 14
2.1 Phylogenetic inference and tree-space search .................................................... 14
2.1.1 Nearest Neighbor Interchange ..................................................................... 14
2.1.2 Subtree-Pruning-and -Regrafting ................................................................ 15
2.1.3 Time complexity .......................................................................................... 16
2.2 Phylogenetic inference by maximum likelihood ................................................ 16
2.2.1 RAxML........................................................................................................ 17
2.2.2 IQ-TREE...................................................................................................... 17
2.2.3 RAxML-NG ................................................................................................ 18
2.2.4 Related work ................................................................................................ 18
3. Method ..................................................................................................................... 19
3.1 Benchmark framework ....................................................................................... 19
3.2 Tree inference software ...................................................................................... 21
3.3 Cluster configuration .......................................................................................... 21
3.4 Data sets ............................................................................................................. 22
3.5 Evaluation method .............................................................................................. 23
4. Results & Discussion .............................................................................................. 24
3
4.1 Maximum log-likelihood score .......................................................................... 24
4.2 Running time ...................................................................................................... 27
4.3 Parallel efficiency ............................................................................................... 29
4.4 Memory usage .................................................................................................... 31
5. Conclusion ............................................................................................................... 33
6. References ............................................................................................................... 33
7. Appendix .................................................................................................................. 38
7.1 Final project description ................................................................................... 38
7.2 Project contract ................................................................................................. 38
7.3 Artefacts ............................................................................................................. 39
7.3.1 List of all program code files....................................................................... 39
7.3.2 Details of testing code ................................................................................. 40
7.3.3 Experimental environment .......................................................................... 40
7.4 README file ..................................................................................................... 40
7.5 Maximum log-likelihood scores for all tree inferences .................................... 43
7.6 Runtime, memory usage and parallel efficiency for all tree inferences ........... 50
4
Figure List
Figure 1: The same fragment of DNA alignment of different mammals, derived from
a DNA dataset. ............................................................................................................... 9
Figure 2: The same fragment of amino acid alignment of different land plants, derived
from a protein dataset................................................................................................... 10
Figure 3: An example phylogenetic tree of using IQ-TREE to build a phylogenetic
tree for a dataset and FigTree (v1.4.3) to visualize the tree file. ................................. 11
Figure 4: Simple visualization of rooted and unrooted trees.. ..................................... 11
Figure 5: Flowchart of constructing phylogenetic tree. ............................................... 12
Figure 6: Pruning and grafting of heuristic search algorithm. ..................................... 14
Figure 7: Simple schematic of NNI.. ........................................................................... 15
Figure 8: Simple schematic of SPR. ............................................................................ 16
Figure 9: The overall pipeline of my benchmark framework. ..................................... 20
Figure 10: Difference of maximum log-likelihood score to the score of best-known
trees. ............................................................................................................................. 26
Figure 11: Wall-clock execution time in hours (16 threads) ....................................... 28
Figure 12: The percentage of efficiency that runs in parallel using multiple threads. 30
Figure 13: Maximum memory usage in GB during the running. ................................ 32
Table List
Table 1: Software and related information. ................................................................. 21
Table 2: Hardware and software configuration of GDU server. .................................. 21
Table 3: Data sets with varying characteristics for evaluating. ................................... 22
Table 4: Command lines used for tree inference. ........................................................ 23
Table 5: The number of maximum log-likelihood tree inferences (out of 5) which
yield the best-known tree per dataset and inference software. .................................... 24
Table 6: The ratio of average RAxML-NG wall-clock running time relative to
RAxML and IQ-TREE. ................................................................................................ 27
Table 7: All likelihood scores extracted from the output files and the difference with
the maximum likelihood score of the best tree. ........................................................... 43
Table 8: All running time, memory usage and parallel efficiency calculated from the
output files. .................................................................................................................. 50
5
Equation List
Equation 1: Formula for calculating the percentage of parallel efficiency .................. 29
Equation 2: Formula for calculating theoretical maximum memory value ............... 31
6
Acknowledgement
First of all, I would like to thank my supervisors, Minh Bui and Yu Lin. Thank you for
your careful guidance and let me have a deeper understanding of phylogenetic analysis
and how to evaluate software. Thank you for giving me the opportunity to practice.
Thank you for your patience in every time I encounter confusion. Thank you for
teaching me how to take academic issues seriously.
Secondly, I would like to thank Cameron Jack from the ANU bioinformatics
consultancy. Thank you for helping me access the GDU server so that I could use huge
computing resources. Thanks to Alexey M. Kozlov, the developer of RAxML-NG, for
answering my question on github in time when I was confused about the operation.
I would also like to thank my colleague Yu Zhang. Thank you for your help and
encouragement on weekdays.
Thanks to all those who have helped me.
7
Abstract
With the rapid development of sequencing technologies in recent years, a large
number of nucleotide sequences and amino acid sequences are being collected at an
increasing pace. Biologists start inferring evolutionary relationships between species
(i.e. phylogenies) using aligned DNA or protein sequences. A number of maximum-
likelihood based phylogenomic software such as IQ-TREE, RAxML, PhyML have
emerged, raising the need for benchmarking of available approaches and thus helping
users to make an appropriate choice. This project was to design an automated pipeline
to benchmark different phylogenomic software over different datasets. The results
were evaluated through a multi-faceted assessment, including maximum likelihood
score, running speed, parallel efficiency, and memory usage. Twenty datasets with
distinct characteristics (type, length, number of taxa, etc.) were collected to compare
IQ-TREE, RAxML-NG, and RAxML.
In terms of maximum log-likelihood score, RAxML-NG performed best in general. It
found the best-known tree for fifteen data sets out of twenty, followed by IQ-TREE
which found the best tree for twelve data sets. RAxML-NG and IQ-TREE showed
good performance on both the DNA dataset and the protein dataset. In contrast,
RAxML found the best tree for only six data sets and five of datasets were protein.
But in some taxon-rich data sets, only RAxML-NG found one or two best trees in five
experiments. Neither IQ-TREE nor RAxML found the best tree. In repeated
experiments, IQ-TREE performed better results stability. It found the best tree for
eleven data sets (RAxML-NG:9; RAxML:5) in all five tree inferences. RAxML-NG
had considerable advantages when comparing speed. It was the fastest software in
seventeen data sets out of twenty. In the remaining three data sets, RAxML-NG was
slower than IQ-TREE or RAxML but found better ML trees. RAxML-NG and
RAxML have the best parallel efficiency and there remains a large room for IQ-TREE
for optimizing parallel implementation, especially when there were multiple
partitions. The memory usage of the three software was basically in line with
expectations, but when analyzing some huge data sets, this three software actually
took up more memory than expected.
This assessment helps users to understand the strengths and weaknesses and
performance of each software and select the software that best suits their needs. It also
provides insights for developers how to upgrade and improve software. e.g.,
improving the parallel efficiency for partition model in IQ-TREE.
8
1. Introduction
1.1 The theoretical basis of phylogeny
In the mid-19th century, Charles Darwin proposed the Darwinism that explains the
theory of biological evolution (C. Darwin, 1859). He carried out a systematic
explanation of the occurrence and development of the biological world, thus
overthrowing the dominance of the idealistic metaphysics of special creationism in
biology, making evolutionary biology a revolutionary change (Mayr, 2003). Last
universal common ancestor theory is the theoretical basis for constructing a
phylogenetic tree (Woese et al., 1990). It shows that all forms of life on Earth have a
common origin. Whether animals, plants, fungi, protists, or prokaryotes, they share a
common evolutionary history and have a near or far-reaching relationship. In molecular
biology, the genetic codes of all organisms are highly consistent.
A significant problem in the field of phylogeny is to reconstruct the evolutionary history
of all species and to use phylogenetic trees to represent evolutionary relationships
between biological groups. It is an important part of evolutionary biology research (Nei
and Kumar, 2000). Establishing a reliable phylogenetic relationship is not only the basis
for taxonomic classification and naming, but also a prerequisite for elucidating the
origin and spread of the genus, exploring the evolution of traits, and revealing the
mechanism of species formation (Futuyma, 1998; Soltis, 2000).
Due to technical limitations, biologists could only rely on the morphological
characteristics of living things to infer the genetic relationship between species in the
past. However, these characteristics have certain limitations. Sometimes the organisms
with different morphology also have certain genetic relationships, such as whales and
bats. Advances in molecular biology have made phylogenetic reconstruction possible.
Molecular data, especially DNA alignment has good richness and comparability, as well
as the normative nature of data analysis (Nei, 1987). It has become an important
means of evolutionary biology research. The theory and methods of constructing
phylogenetic trees based on mathematics and statistics have also developed rapidly,
forming a new research field named molecular phylogeny. It refers to the use of
information from biological molecules to infer the evolutionary history of organisms,
or to reconstruct the phylogenetic relationships of biological groups.
1.2 Molecular data
1.2.1 Deoxyribonucleic Acid
DNA is a long-chain polymer composed of nucleotides (Watson and Crick, 1953). The
nitrogenous bases of the nucleotides are adenine (A), guanine (G), cytosine (C), and
thymine (T). Almost all organisms store genetic information in the DNA. Since DNA
contains genetic information, through the comparison of DNA sequences, one can infer
9
the evolutionary relationship of organisms.
Usually we use a string of letters to display the primary structure of a DNA sequence
(Nei, 1987). Each letter represents a base, and the only possible letters are A, T, C, and
G. Because of the large differences in the evolution rates of different DNA fragments,
we can compare the DNA sequences to study the evolutionary relationship of organisms
at almost all levels. The genetic information contained in the DNA sequence is
significant for elucidating the evolution of the multigene family and understanding the
adaptive evolution of molecular levels (Nei, 1987). Figure 1 is an example of the
difference in same fragment of DNA between different species.
Figure 1: The same fragment of DNA alignment of different mammals, derived from a
DNA dataset (Tarver et al., 2016).
1.2.2 Amino Acid
Amino acid is the basic unit of protein. Prior to the invention of the rapid DNA
sequencing method (Maxam and Gilbert, 1977; Sanger et al., 1977), most molecular
evolution researches were based on amino acid sequences. They are more conservative
than DNA sequences and therefore provide more helpful information for the long-term
evolution of genes and species (Nei and Kumar, 2000). Figure 2 shows the same
fragment of amino acid alignment of some land plants. Different amino acids are
represented by different single letters. With this type of data, it is possible to determine
evolutionary differences between sequences and begin to construct phylogenetic trees.
10
Figure 2: The same fragment of amino acid alignment of different land plants, derived
from a protein dataset (Wickett et al., 2013).
1.3 Phylogenetic tree
The process of biological evolution is not directly visible. People can only understand
what has happened in history through relevant clues, and scientists use these clues to
establish various hypotheses, models, and even the history of life. In the study of
systematic classification, the most commonly used method to visualize evolutionary
relationships as a phylogenetic tree (e.g. Figure 3).
Charles Darwin introduced the concept of evolutionary "trees" in his groundbreaking
work "The Origin of Species" (C. Darwin, 1859). The phylogenetic tree uses a tree-like
branch diagram to represent the relationship between various organisms. The history of
species evolution is inferred by studying biological sequences, mainly DNA sequences
and amino acid sequences.
11
Figure 3: An example phylogenetic tree of using IQ-TREE (1.6.10) to build a
phylogenetic tree and FigTree (v1.4.3) to visualize the tree file for a dataset (Tarver et
al., 2016).
The phylogenetic relationship of organisms is often represented by a rooted or unrooted
tree structure. A rooted tree contains a unique root node that acts as the most recent
common ancestor of all species in the tree. Removing the rooted tree from the root
becomes an unrooted tree. The unrooted tree has no direction, and both directions of
the line segment are possible. The internal nodes of the tree represent the location of
the evolutionary event or the common ancestor in the evolution process. The external
nodes are also called leaf nodes, representing different species.
Figure 4: Simple visualization of (a) rooted and (b) unrooted trees. The internal nodes
represent potential common ancestors and leaf nodes A-F represent different species
such as human, Chimp, Gorilla etc.
Building a phylogenetic tree includes four steps: selecting sequences from different
species, aligning sequences, inferring phylogenetic trees, and evaluating phylogenetic
trees. The specific process is shown in the figure below. This report focuses on the last
two steps. The main task of inferring phylogenetic tree is to find the optimal tree
12
topology and estimate the branch length. The related algorithm and common software
will be described in detail later.
Figure 5: Flowchart of constructing phylogenetic tree.
1.4 Phylogenetic tree reconstruction method
The phylogenetic tree inference methods based on molecular level include distance-
based methods, maximum likelihood method (Felsenstein, 1981), maximum parsimony
method (Fitch, 1971) and Bayesian method (Rannala and Yang, 1996; Mau and Newton
1997). The latter three are based on discrete features. A commonly used distance-based
approach is the Neighbor-Joining method (Saitou and Nei, 1987). The method
minimizes the total distance of the phylogenetic tree by determining the closest (or
adjacent) pairwise classification units. The distance-based method usually cannot find
the exact minimum phylogenetic tree but the approximate minimum phylogenetic tree.
The disadvantage is this method has higher sensitivity to different mutation rates of the
species. The maximum parsimony method is to calculate all possible correct topologies
and pick the topology with the smallest number of substitutions as the optimal
phylogenetic tree. This method thinks that the evolutionary relationship with fewer
mutations is more likely to be the true evolutionary relationship between species (Sober,
1988). The maximum likelihood method analyzes a predetermined set of sequences
according to a specific substitution model and maximizes the likelihood value of each
topology obtained and selects the topology with the largest likelihood score as the best
phylogenetic tree. According to the multi-molecular evolution model, the Bayesian
method uses the Monte Carlo method of the Markov chain to generate the posterior
13
probability estimates of all parameters and finally selects the tree with the highest
reliability. Unlike the maximum likelihood method, the Bayesian algorithm specifies
the structure of the tree and the evolution model first and calculates the probability of
sequence composition to infer the corresponding phylogenetic tree.
There are many methods to infer a phylogenetic tree, and each has its own advantages
and disadvantages. Therefore, in practice, it is often necessary to combine different
construction methods to obtain the best analysis results according to their own research.
In general, the maximum likelihood method is considered to be more efficient than the
distance and parsimony methods when the evolution model is chosen reasonably, and
the results are in good agreement with the facts of evolutionary history (Yang and
Rannala, 2012; Whelan and Morrison, 2017). For closely related species sequences, the
maximum parsimony method is usually used; for distant species sequences, the
Neighbor-Joining method or the maximum likelihood method is generally used.
1.5 Aim
The purpose of this report is to design a framework to evaluate several commonly used
phylogenetic software based on the maximum likelihood method, and then compare
their performance and finally draw conclusions.
The specific objectives of this study are:
I. Develop and design an automated pipeline to invoke different software to infer
the phylogenetic tree for large data sets entered.
II. Collect and benchmark the performance of different software from various
aspects (stability, speed, parallel efficiency, etc.).
III. Analyze the results and provide suggestions to both users and developers.
To achieve these goals, I designed an automated benchmark pipeline. When the user
inputs an alignment file and a partition file, the pipeline automatically calls the different
software to perform the inference of the phylogenetic tree multiple times. After
inferring the phylogenetic trees, the pipeline collects all relevant maximum log-
likelihood scores, run time, parallel efficiency and other information and generate a
report containing a csv table of raw data and corresponding violin plots for multifaceted
assessment.
Different software may exhibit distinct performance for data sets with varying
characteristics. The numerical results and plot results provided by the pipeline help
users select software that is more suitable for their chosen data set. These results may
also help developers find the advantages and disadvantages of the software itself and
improve it.
14
2. Background
2.1 Phylogenetic inference and tree-space search
Building a phylogenetic tree is a typical NP-complete problem when the number of
species is large (Foulds and Graham, 1982). This means that the problem is unlikely to
be solved efficiently, and the relative solution can only be used to determine the most
appropriate answer. Heuristic search algorithms have been improved to search for the
tree spaces (Chor and Tuller 2005). An iterative "hill-climbing" optimization technique
is applied to solve this problem. The initial tree is modified using the rearrangement
algorithm, which replaces the initial tree if a better tree is found, according to the
maximum likelihood criterion. The heuristic search algorithm swaps the subtree
branches, grafts the branches to other locations of the current best tree found in this step,
and produces a tree with a similar topology to the initial tree (Figure 6). This process is
repeated until the algorithm terminates. The heuristic search algorithm can greatly
reduce the number of possible trees to be searched, thus solving the problem of a large
amount of calculations.
Figure 6: Pruning and grafting of heuristic search.
Although different phylogenetic software is based on the same maximum likelihood
method, they differ in the choice and implementation of tree rearrangement algorithms.
The currently used heuristic search algorithms are Nearest-Neighbor-Interchange (NNI)
algorithm and Subtree-Pruning-and -Regrafting (SPR) algorithm.
2.1.1 Nearest Neighbor Interchange
Nearest Neighbor Interchange (NNI) is the exchange of four subtrees in the main tree,
which means to swap the subtree to try to get a tree with higher probability (Robinson
1971). More specifically, the five internal branches on the tree are first removed, which
disconnects the four subtrees. Then rearrange the four subtrees in other ways. There are
three possible ways to connect four subtrees. In addition to the original connection,
interchange process creates two new trees. Repeat this process for subtrees until no
better tree generated. Thoroughly searching for possible nearest neighbors for each
possible subtree set is the slowest but most optimized way to perform the search. Figure
7 is an example of an NNI algorithm where branches B and C are exchanged or B and
D are exchanged.
15
Figure 7: Simple schematic of NNI. The two possible exchanges on an internal edge.
2.1.2 Subtree-Pruning-and -Regrafting
Subtree-Pruning-and -Regrafting (SPR) is a much broader heuristic search method. The
method is to select and separate the subtree from the main tree and reinsert it into
another branch of the main tree. It creates a new tree topology each time, and then
calculates the possibility of a new topology and evaluates for potential improvements.
(Swofford et al., 1996). This process is repeated for subtrees within the specified level
until no significant improvements are made. In Figure 8, the red subtree is cut and
reinsert into other position. This method explores changes in known trees with
approaching the minimum length. This reduces the amount of computation because it
is more efficient than checking a large number of alternative trees of unknown length.
16
Figure 8: Simple schematic of SPR. The red subtree is pruned and grafted
into other branches.
2.1.3 Time complexity
The NNI algorithm checks the O(N) topology each time, where N is the number of
leaves in the original tree. Instead, a single pass of the SPR algorithm checks O(N2)
new trees. The SPR method considers more trees than the NNI method and is therefore
more time consuming. But SPR is more scalable than NNI in terms of searching tree
space. NNI is not easy to find shorter trees sometimes.
2.2 Phylogenetic inference by maximum likelihood
All of the phylogenetic programs evaluated in this report was based on the maximum
likelihood method. Maximum likelihood is a statistical method that explicitly uses a
probability model (Felsenstein, 1981). The goal of this method is to find a phylogenetic
tree that can produce observation data with a high probability. It is a commonly used
method of phylogenetic tree reconstruction based on statistics.
The maximum likelihood method was first applied to phylogenetic analysis in the
17
analysis of gene frequency data. The principle is to take into account the likelihood
values of the residues at each locus and accumulate all possible residue substitution
probabilities at each position to produce a likelihood value for a particular locus
(Felsenstein, 1981). The maximum likelihood method computes a likelihood function
value for all possible phylogenetic trees to infer the probability distribution and assigns
the probability to a specific possible phylogenetic tree. The tree with the largest
likelihood function value is the most likely phylogenetic tree. To use the maximum
likelihood method to infer a phylogenetic tree of an alignment, we need to first
determine the model of sequence evolution. Substitution models for nucleotide
sequences are generally Jukes-Cantor model (Jukes and Cantor, 1969) and the Kimura-
2 parameter model (Kimura, 1980). The model of protein sequence generally chooses
Poisson correction. The maximum likelihood algorithm is based on statistical properties
and is supported by good mathematical theory. The disadvantage of this method is that
it requires a considerable amount of computation, and sometimes it can be time-
consuming.
Three popular fast phylogenetic software based on maximum likelihood are chosen in
our evaluation process, including RAxML, IQ-TREE and RAxML-NG. They all
support partitioned analysis and common and custom models. But they differ in the
choice and implementation of topological moves and the trade-off between speed and
performance.
2.2.1 RAxML
RAxML (Randomized Axelerated Maximum Likelihood) is a program for large
phylogenetic trees inference based on maximum likelihood. It uses a fast tree search
algorithm and returns a tree with good likelihood scores (Stamatakis et al., 2006, 2014).
The latest version 8.2.12 employs SPR-based heuristic search algorithms and “lazy
subtree rearrangements” to reduce the number of unreasonable SPR alternatives
(Stamatakis et al. 2005). On the one hand, the lazy subtree rearrangement algorithm
limits the distance between the re-grafted position and the pruning position. On the
other hand, when a re-grafting results in a worse likelihood score, all branches that are
away from that re-grafting position will no longer be considered (Stamatakis et al.,
2007). RAxML creatively employs dynamic adjustment of rearrangement distances
(Stamatakis et al. 2006). Multiple iteration distances are used on the starting tree to
determine the optimal rearrangement distance, and then the minimum rearrangement
distance that produces the best likelihood improvement is selected for inference.
RAxML has been also parallelized with MPI to perform parallel multiple bootstraps
and inferences on multiple initial trees. RAxML has an excellent performance in terms
of accuracy and speed (Stamatakis et al., 2006).
2.2.2 IQ-TREE
IQ-TREE is an emerging and widely used software for phylogenetic analysis from
18
genome-scale data. IQ-TREE (latest version 1.6.10) employs a new tree search strategy
to overcome local optimization problems. IQ-TREE combines some existing
phylogenetic and combinatorial optimization techniques to form a new efficient tree
search strategy (Nguyen et al. 2015). This new strategy combines extensive sampling
of initial starting trees, an NNI-based hill-climbing search algorithm, and a stochastic
perturbation method for the current best tree to escape local NNI optima caused by pure
"hill-climbing" methods. In detail, IQ-TREE generates multiple initial trees and stores
and updates candidate trees throughout the iteration. In each iteration, IQ-TREE
randomly selects a candidate tree and modifies the tree using stochastic perturbations.
An NNI based hill climbing tree search is then applied to this tree. If a tree with a higher
likelihood score is generated, the worst one of the current candidate trees is replaced;
otherwise, this iteration fails, and the analysis will terminate after the number of failed
iterations exceeds the limit. In addition, IQ-TREE uses some elements of the
evolutionary strategy to extend tree space search (Nguyen et al., 2015). IQ-TREE shows
good performance and is a time and search efficient ML tree rebuild program.
2.2.3 RAxML-NG
In 2018, RAxML's development team developed RAxML-NG (Next generation). This
is an established greedy tree search algorithm that re-implements RAxML from scratch
(Kozlov et al., 2018). Compared to RAxML, RAxML-NG supports more evolution
models and provides optimization of all model parameters. The user can also set a fixed
value for all parameters including branch lengths as needed. It fixes an issue where the
subtree enumeration method used in RAxML occasionally skips the promising topology.
RAxML-NG combines some of the latest released methods to improve performance.
RAxML-NG also employs a technique that optimizes likelihood computation called
site repeats (Kobert et al., 2017) to increase speed. RAxML-NG integrates balanced
load algorithms and parallel I/O optimization techniques used in ExaML (Kozlov et
al.'s software for large cascading datasets in 2015) to improve parallel efficiency.
RAxML-NG offers greater accuracy, speed, scalability and usability.
2.2.4 Related work
There is a relative lack of comprehensive benchmarking of the latest phylogenetic
software. Most evaluations are made by the software developer to compare their
software to the same type of software at the time. These studies are often outdated. For
example, in 2006, RAxML’s developers compared RAxML with other phylogeny
programs in the same period, including GARLI (Zwickl, 2006), MrBayes (Ronquist
and Huelsenbeck, 2003), IQPNNI (Minh et al., 2005) and PHYML (Guindon and
Gascuel, 2003). Their results proved that RAxML performed best. Some of these
programs have no longer been updated or have been eliminated in the competition.
A recent paper (Zhou et al., 2018) evaluated four fast phylogenetic software based on
maximum likelihood from the likelihood score topology and computational speed,
19
including IQ-TREE (1.5.5), RAxML(8.2.11), FastTree (2.1.10) and PhyML (20160531).
The results showed that IQ-TREE performed the best tree inference accuracy. Ranked
second close was RAxML. In contrast, PhyML generates trees with lower maximum
likelihood scores. In terms of running time, IQ-TREE was faster than RAxML for both
protein and DNA data. On average, PhyML was 1.5 times faster than RAxML for
protein data sets, but 3.1 times slower on DNA data sets. FastTree was the fastest
compared with other three programs but got the worst results according to maximum
likelihood.
In another newer paper (Kozlov et al., 2018), developers of RAxML-NG compared IQ-
TREE (1.6.7), RAxML-NG (0.6.0), ExaML (3.0.19) and RAxML (8.2.10). Their results
showed that RAxML-NG got better likelihood scores in both DNA datasets and protein
datasets. Ranked second was IQ-TREE. RAxML and ExaML performed similarly in
terms of maximum likelihood scores. RAxML-NG was also the fastest software on
most datasets. In the remaining few data sets, RAxML-NG was slower than ExaML or
RAxML but got trees with better likelihood scores.
It can be seen that systematic evaluation and comparison of phylogeny software are
relatively lacking, raising the need for benchmarking of available programs and thus
helping users to make an appropriate choice.
2. Method
3. 1 Benchmark framework
20
Figure 9: The overall pipeline of my benchmark framework. When the user inputs an
alignment file and a partition file, the pipeline automatically invokes different
software to infer phylogenetic trees multiple times. Then it collects and calculates all
results including maximum likelihood scores, run time, parallel efficiency and
21
memory usage and generate a report containing tables of raw data and corresponding
violin plots.
As Figure 9 shows, the highlight of the framework is its automation. The user only
needs to enter the directory where the alignment file and the partition file are located
and specify the number of times each software is repeated and the number of threads
they would like to use. Then they can obtain the corresponding output and report. The
pipeline automatically traverses all alignments in the directory and invokes multiple
software to infer phylogenetic trees. It then collects information from all output files
and summarizes the results including maximum likelihood score for each tree, the time
and memory it takes to infer the tree and parallel efficiency. A table containing the raw
data and the corresponding violin plots will be returned to the user finally. Another
advantage of the framework is the evaluation of parallel efficiency and memory usage.
This is an innovation that was not covered in previous research.
3.2 Tree inference software
Consider the results in other papers, IQ-TREE, RAxML and the newly improved
RAxML-NG were selected for evaluation in this report. The version, release time, and
reference of the software used are shown in the following table (Table 1). These three
programs were compared in terms of maximum likelihood score, run time, memory
usage, and parallel efficiency.
Software Version Release date References
IQ-TREE 1.6.10 March 2019 Nguyen et al., 2015
Chernomor et al., 2016
RAxML 8.2.12 May 2018 Stamatakis, 2014
RAxML-NG 0.6.0 September 2018 Kozlov et al., 2018
Table 1: Software and related information.
3.3 Cluster configuration
All evaluation codes were run on a cluster called GDU server provided by the Research
school of Biology in Australian National University. Related software and hardware
information is shown in Table 2.
System GDU Server
Hardware Software
CPU model Intel(R) Xeon(R)
CPU E5-2630
Operating
System
CentOS Linux
release 6.6
CPU architecture Haswell
CPU cores 96 Compiler GCC version 8.2.0
Memory size 378GB MPI Open MPI 2.0.2
22
Table 2: Hardware and software configuration of GDU server.
3.4 Data sets
The three software may exhibit different behaviors for phylogenetic datasets with
different characteristics. Ten protein and ten DNA datasets were selected to measure
three software as comprehensively as possible. Among these datasets, seventeen
datasets were collected by Zhou et al. in 2018 for the evaluation of phylogenetic
software. The other three were collected by the developer of RAxML-NG, Kozlov et
al., to evaluate the performance of RAxML-NG. These datasets have varying numbers
of taxa and genes as well as different alignment lengths and cover a range of species
such as animals, plants, and fungi (Table 3). In the collection process of the data set,
the format of the alignment file was first converted into PHYLIP to ensure that they
could be recognized by most phylogenetic software. Twelve of them contain partition
files and use the partition model for tree inference. The partitioning scheme and
substitution model in the setting were consistent with the original study of these dataset.
Dataset Data
type
Taxa Length Distinct
patterns
Partitions Reference
SongD1 DNA 37 1,338,678 746,408 1 (Song et al.,
2012)
MisoD2b DNA 144 413,459 371,434 50 (Misof et al.,
2014)
WickD3a DNA 103 436,077 422,676 14 (Wickett et al.,
2014)
WickD3b DNA 103 290,718 277,375 8 (Wickett et al.,
2014)
XiD4 DNA 46 239,763 165,781 1 (Xi et al.,
2014)
PrumD6 DNA 200 394,684 236,674 75 (Prum et al.,
2015)
TarvD7 DNA 36 21,410,970 8,520,738 1 (Tarver et al.,
2016)
PeteD8 DNA 174 3,011,099 2,248,590 4,116 (Peters et al.,
2017)
ShiD9 DNA 815 20,364 13,311 29 (Shi and
Rabosky, 2015)
StamD10 DNA 436 1,371 1,011 1 (Stamatakis et
al., 2010)
NagyA1 AA 60 172,073 156,312 594 (Nagy et al.,
2014)
MisoA2 AA 144 413,459 406,963 479 (Misof et al.,
2014)
23
WickA3 AA 103 145,359 144,342 11 (Wickett et al.,
2014)
ChenA4 AA 58 1,806,035 1,547,914 1 (Chen et al.,
2015)
StruA5 AA 100 189,193 178,600 1 (Struck et al.,
2015)
BoroA6 AA 36 384,981 376,803 831 (Borowiec et al.,
2015)
WhelA7 AA 70 59,725 58,419 210 (Whelan et al.,
2015)
YangA8 AA 95 504,850 476,259 1,122 (Yang et al.,
2015)
ShenA9 AA 96 609,899 583,199 1 (Shen et al.,
2016)
GitzA12 AA 1,897 18,328 18,303 1 (Gitzendanner et
al., 2018)
Table 3: Data sets with varying characteristics for evaluating. The letter before the
number represents the data type, A is the amino acid, and D is the DNA.
3.5 Evaluation method
Multiple experiments help to reduce the difference in the results that random starting
trees can bring. Each software performed five phylogenetic tree inference processes for
each dataset with five distinct seeds for random number generator. That is, each dataset
got fifteen phylogenetic trees finally. The tree with the highest maximum likelihood
score in the dataset was selected from the fifteen trees. Such a tree was called the best-
known tree for a dataset.
By comparing the maximum likelihood score of the tree obtained from different
software with the maximum likelihood score of the best-known tree for each dataset,
we can see the performance of tree inference for different software. The specific
maximum likelihood score, runtime, memory usage, and parallel efficiency were also
used as comparison parameters.
Python was used to write the benchmark pipeline that inferred the tree as well as
calculated and summarized the run time and parallel efficiency and recorded the
memory usage information. Then the related results were visualized by ggplot() in R
language. The core command line is shown in the following table 4.
Mode Software Command lines
Inference IQ-TREE iqtree -s <ALIGNMENT> -q <PARTITIONS>
-seed <RSEED> -nt 16 -pre <OUTDIR>
24
Inference RAxML raxmlHPC-PTHREADS-AVX -m <MODEL>
-s <ALIGNMENT> -q <PARTITIONS> -p <RSEED>
-n <RUNNAME> -w <OUTDIR> -T 16
Inference RAxML-
NG
raxml-ng -search -msa<ALIGNMENT> -model
<PARTITIONS> -seed <RSEED> -prefix <OUTDIR>
-threads 16 -site-repeats on
Evaluation IQ-TREE iqtree -s <ALIGNMENT> -q <PARTITIONS>
-te <ML_TREE> -nt 16 -pre <OUTDIR>
Time /usr/bin/time
Table 4: Command lines used for tree inference.
4. Results & Discussion
4.1 Maximum log-likelihood score
The maximum likelihood score is one of the benchmarks for judging the phylogenetic
tree. When comparing the maximum likelihood score for each tree, RAxML-NG got
the best results, followed closely by IQ-TREE. Within the tolerances of the error, the
number of best-known trees each software found for each data set were counted. Details
are in Table 5. The information below is very intuitive to show that RAxML-NG found
the best-known tree for fifteen data sets out of twenty. IQ-TREE found best-known
trees on twelve datasets instead. At the same time, RAxML only found the best-known
tree in six data sets, and most of the data sets were protein. This shows that RAxML
may be easier to get a better maximum likelihood score when inferring phylogenetic
trees for protein alignments. IQ-TREE and RAxML-NG performed well in both protein
and DNA data sets. For some data sets (YangA8, ShenA9 and XiD4) with a small
number of taxa and no partitioning model used, all software found the best-known tree.
Conversely, in some large data sets that were too complex (PeteD8, ShiD9, GitzA12,
etc.), only RAxML-NG found the best-known tree. This may be because the SPR
algorithm used by RAxML-NG is more suitable for handling such cases than the NNI
algorithm used by IQ-TREE for data sets with a huge number of taxa. It's worth noting
that no software can find the best-known tree for all data sets. In other words, no
rearranging algorithm can outperform other algorithms in all cases.
Dataset ML tree searches which found the best-known tree
IQ-TREE RAxML RAxML-NG
SongD1 5 0 5
MisoD2b 0 0 5
WickD3a 0 0 5
WickD3b 5 0 3
XiD4 5 4 5
PrumD6 5 0 0
TarvD7 5 0 0
PeteD8 0 0 1
25
ShiD9 0 0 1
StamD10 0 0 2
NagyA1 5 0 5
MisoA2 5 0 0
WickA3 0 0 5
ChenA4 5 5 5
StrucA5 4 0 5
BoroA6 0 5 0
WhelA7 5 5 0
YangA8 5 5 5
ShenA9 5 5 3
GitzA12 0 0 1
Number of datasets for which the best-known tree was found
12 6 15
Table 5: The number of maximum log-likelihood tree inferences (out of 5) which
yield the best-known tree per dataset and inference software.
Figure 10 shows the difference between the maximum likelihood score of the tree
obtained by each software and the score of best-known trees. From the distribution
point of view, the performance of IQ-TREE was more stable. IQ-TREE tended to get
more similar results when using different random seeds. This may be because IQ-TREE
uses multiple initial trees and stores and updates candidate trees throughout the iteration.
This can avoid local optimization problems to some extent. RAxML-NG and RAxML
got a big difference sometimes. For software that is not stable enough, users may need
to repeat multiple operations to get satisfactory results. Therefore, RAxML-NG and
RAxML need to improve stability and avoid the contingency of random starting trees.
26
Figure 10: Difference of maximum log-likelihood score to the score of best-known trees.
27
4.2 Running Time
RAxML-NG was the fastest performing in seventeen out of twenty data sets (Figure11,
Table 6). Table 6 provides the ratio of the average running time of RAxML-NG to the
time spent by the other two software. It can be seen that in these seventeenth data sets,
the speed ratio of RAxML-NG was from 1.02 (relative to IQ-TREE on WhelA7) to 6.87
(relative to IQ-TREE on MisoD2b). IQ-TREE, which was the fastest only in one data
set (NagyA1) as well as found five best-known trees (Table 5). RAxML performed the
fastest on data sets StamD10 and GitzA12. But in these two data sets, only
RAxML-NG found the best-known tree. The latest version of RAxML-NG implements
an excellent phylogenetic likelihood kernel and efficient parallelization and load
balancing technolog. This may be why it performs well in terms of running time.
Although IQ-TREE took the longest time in most data sets, IQ-TREE was still better
trees than RAxML.
Dataset RAxML-NG speedup (x) compared to
IQ-TREE RAxML
SongD1 5.83 2.00
MisoD2b 6.87 1.90
WickD3a 2.01 1.14
WickD3b 2.17 1.10
XiD4 3.93 1.09
PrumD6 2.38 1.33
TarvD7 4.23 1.71
PeteD8 1.42 1.44
ShiD9 6.38 1.36
StamD10 21.75 0.25
NagyA1 0.72 2.37
MisoA2 1.54 1.69
WickA3 1.64 1.14
ChenA4 1.84 1.15
StrucA5 1.21 2.08
BoroA6 1.68 1.64
WhelA7 1.02 2.16
YangA8 1.13 1.51
ShenA9 1.36 1.23
GitzA12 2.59 0.93
Table 6: The ratio of average RAxML-NG wall-clock running time relative to
RAxML and IQ-TREE.
28
Figure 11: Wall-clock execution time in hours (16 threads).
29
4.3 Parallel efficiency
Compared with the existing papers, one of the innovations of this study is to evaluate
the parallel efficiency of different software. Given that all evaluated software support
multi-threaded operation, we hope to help developers identify problems and improve
software by measuring actual parallel efficiency. Sixteen threads were used in this
experiment. Therefore, the formula for calculating parallel efficiency is as follows,
where K represents the number of threads. C stands for CPU time and W for Wall clock
time.
C ÷ (K × W)
From the perspective of parallel efficiency (Figure 12), both RAxML-NG and RAxML
implemented excellent parallelization techniques. When using multiple threads, parallel
efficiency could approach 99.8%. If we compare it carefully, we can find that the
parallel efficiency of RAxML was slightly better than RAxML-NG. However, the
parallel efficiency of IQ-TREE only reached about 75% in average. The current version
of IQ-TREE has not yet achieved better parallelization and load balancing techniques.
Simply speaking, when the number of partitions in a data set is not multiplier of the
number of threads, IQ-TREE cannot balance the load well, which makes the running
time longer and the parallel efficiency lower. In actual, both IQ-TREE and RAxML
parallelize computations over alignment sites. But RAxML can divide partitions into
smaller chunks, and then fit the chunks into k threads equally. Whereas, IQ-TREE does
not implement this. When there are long partitions and short partitions, the load
balancing is not good, as more time is waiting for computations on long partitions to
finish. Therefore, this benchmark helps to identify this problem to further improve IQ-
TREE.
30
Figure 12: The percentage of efficiency that runs in parallel using multiple threads
(16 threads).
31
4.4 Memory usage
Another unprecedented innovation is that we also evaluated memory usage when
software infers trees. The data in Figure 13 is the maximum memory usage by different
software. It can be seen that even if in the same data set, the memory occupied by
different software was distinct. The red line represented the theoretical minimum
memory value. Since the implementation of the software and some other non-
computation operations also take up memory, it is actually higher than the theoretical
value. The theoretical value in bytes is calculated as follows:
M = N × M × K × R × size of (doubles )
N represents the number of sequences. M is the number of distinct site patterns. K
represents different states in the sequence, it represents four bases for DNA, while for
proteins K represents 20 amino acids. R stands for rate categories. In this experiment
G4 was used so R=4. Size of double is generally considered equal to eight bytes. This
is the memory requirement for storing the likelihoods on the tree according to the
pruning algorithm (Felsenstein, 1981). This should represent the major bulk of RAM,
but particular software may allocate more RAM.
In most cases the actual values of memory usage were only slightly higher than the
theoretical values. This was very reasonable and in line with expectations. However,
for some sites-rich data sets (TarvD7, PeteD8, etc.), all software tended to use more
memory than the theoretical value to infer the phylogenetic tree.
32
Figure 13: Maximum memory usage in GB during the running.
33
5. Conclusion
This study was to design an automated pipeline to benchmark different phylogenetic
software. The pipeline can call multiple phylogenetic software to perform phylogenetic
inference on the input data set and compare the results from four aspects: maximum
likelihood score, run time, parallel efficiency, and memory usage. Compared with
previous benchmarks (Zhou, 2017; Kozlov, 2019), ours additionally address the issue
of parallel efficiency and memory usage. Three popular maximum likelihood based
phylogenetic software were selected to be compared. To test the pipeline, some state-
of-the-art genome-scale data sets with varying characteristics were collected. From the
tree's maximum likelihood score, RAxML-NG was the best one. Ranked second was
IQ-TREE. But the stability of IQ-TREE was better. RAxML-NG was also the fastest
software in most data sets. Both RAxML and RAxML-NG implemented good
parallelization techniques. Our comparison revealed that there remains a large room for
IQ-TREE for improving parallel efficiency. Moreover, most of the situation of software
in terms of memory usage was similar to the theoretical value, but when analyzing some
data sets with more sites, the software generally uses memory that was significantly
higher than the theoretical value to infer the phylogenetic tree.
The pipeline can help users choose the software that works best for them. For software
developers, the benchmarking of ML-based phylogenetic software also helps them
discover the advantages and disadvantages of the software and how to improve it.
It can be seen that the implementation methods of technology and algorithm have a
great impact on performance of the software. Each software has excellent algorithmic
ideas and innovative implementations. It is the developers who are constantly
integrating new and excellent methods to make significant progress in the field of
phylogeny.
In the future, we plan to add more software that can be evaluated in our pipeline, such
as MEGA, PhyML and so on. We also intend to collect more reliable data sets for
evaluation. We hope to provide web services so that users can easily upload data sets
through web pages and get results summary of different software.
6. References
Borowiec, M. L. et al. (2015). Extracting phylogenetic signal and accounting for bias
in wholegenome data sets supports the ctenophora as sister to remaining metazoa. BMC
genomics, 16(1), 987.
Darwin, C. (1859). The origin of species by means of natural selection. (Murray,
London, 1859).
34
Chernomor, O. et al. (2016). Terrace aware data structure for phylogenomic inference
from supermatrices. Systematic Biology, 65(6), 997–1008.
Chen, M.-Y. et al. (2015). Selecting question-specific genes to reduce incongruence in
phylogenomics: A case study of jawed vertebrate backbone phylogeny. Systematic
Biology, 64(6), 1104–1120.
Chor B, Tuller T. (2005). Maximum Likelihood of Evolutionary Trees Is Hard. In:
Miyano S,Mesirov J, Kasif S, Istrail S, Pevzner PA,Waterman M, editors. Research in
Computational Molecular Biology: 9th Annual International Conference, RECOMB
2005, Cambridge, MA, USA; 2005 May 14–18, Proceedings. Berlin, Heidelberg
(Germany): Springer. p. 296–310.
Felsenstein J. (1981). A likelihood approach to character weighting and what it tells us
about parsimony and compatibility. Biological Journal of the Limnean Society, 16(3):
183-196.
Fitch, Walter M., Toward defining the course of evolution: minimum change for a
specific tree topology. Systematic Zoology, 20: 406-416 (1971).
Foulds LR, Graham RL. (1982). The steiner tree problem in phylogeny is NP-complete.
Advances in Applied Mathematics. 3: 4-49.
Futuyma DJ. (1998). Evolutionary biology. Sunderland, MA: Sinauer Associates.
Gitzendanner, M. A. et al. (2018). Plastid phylogenomic analysis of green plants: A
billion years of evolutionary history. American Journal of Botany, 105(3), 291–301.
Guindon S., Gascuel O. (2003) A simple, fast, and accurate algorithm to estimate large
phylogenies by maximum likelihood, Systematic Biology, vol. 52 (pg. 696-704)
Jukes TH. Cantor cr. (1969). Evolution of protein molecules. In: Mammalian Protein
Metabolism. New York: Academic Press.
Kimura M. (1980). A simple method for estimating evolutionary rates of base
substitutions through comparative studies of nucleotide sequences. Journal of
Molecular Evolution, 16(2): 111-120.
Kobert K, et al. (2017). Efficient detection of repeating sites to accelerate phylogenetic
likelihood calculates. Systematic Biology, 66(2), 205-217.
Kozlov AM, Aberer AJ, Stamatakis A. (2015). ExaML version 3: a tool for
phylogenomic analyses on supercomputers. Bioinformatics, 31(15), 2577-2579.
35
Kozlov AM. (2018). RAxML-NG: A fast, scalable, and user-friendly tool for maximum
likelihood phylogenetic inference. bioRxiv 447110
Mau B, Newton M. (1997). Phylogenetic inference for binary data on dendrograms
using Markov chain Monte Carlo. J.Comput.Graph.Stat, 6, 122-131.
Maxam AM, Gilbert W. (1977). A new method for sequencing DNA. Proc Natl Acad
Sci U S A, 74(2):560–564.
Mayr, E. (2003). The growth of biological thought. Cambridge, Mass.: The Belknap
Press of Harvard Univ. Press.
Nagy, L. G. et al. (2014). Latent homology and convergent regulatory evolution
underlies the repeated emergence of yeasts. Nature communications, 5, 4471.
Minh B.Q., et al. (2005). IQPNNI: parallel reconstruction of large maximum likelihood
phylogenies, Bioinformatics, vol. 21, 3794-3796
Misof, B. et al. (2014). Phylogenomics resolves the timing and pattern of insect
evolution. Science, 346(6210), 763–767.
Nei M. (1987). Molecular evolutionary genetics. New York: Columbia University Press.
Nei M, Kumar S. (2000). Molecular evolution and phylogenetics. Oxford: Oxford
University Press.
Nguyen, L.-T. et al. (2015). IQ-TREE: A fast and effective stochastic algorithm for
estimating maximum-likelihood phylogenies. Molecular Biology and Evolution, 32(1),
268–274.
Robinson DF. (1971). Comparison of labeled trees with valency three. J Comb Theory.
B 11(2): 105–119.
Ronquist F., Huelsenbeck J.. Mrbayes. (2003). 3: bayesian phylogenetic inference
under mixed models, Bioinformatics, vol. 19 (pg. 1572-1574)
Peters, R. S. et al. (2017). Evolutionary history of the hymenoptera. Current Biology,
27(7), 1013-1018
Prum, R. O. et al. (2015). A comprehensive phylogeny of birds (Aves) using targeted
nextgeneration DNA sequencing. Nature, 526(7574), 569–573.
Rannala, B. Yang, Z. (1996). Probability distribution of molecular evolutionary trees: a
36
new method of phylogenetic inference. Journal of Molecular Evaluation,43. 304-311.
Saitou N, Nei M. (1986). The number of nucleotides required to determine the
branching order of three species, with special reference to the human-chimpa-nzee-
gorilla divergence. Journal of Molecular Evolution, 24(1-2); 189-204.
Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA, Hutchison CA,
Slocombe PM, Smith M. (1977). Nucleotide sequence of bacteriophage phi X174
DNA. Nature. 265(5596):687–695.
Shen, X.-X. et al. (2016). Reconstructing the backbone of the saccharomycotina yeast
phylogeny using genome-scale data. G3: Genes, Genomes, Genetics, pages g3–116.
Shi, J. J. and Rabosky, D. L. (2015). Speciation dynamics during the global radiation
of extant bats. Evolution, 69(6), 1528–1545.
Sober E. (1988). Reconstructing the Past: Parsimony Evolution and Inference. London:
Cambridge MIT Press.
Soltis ED, Soltis PS. (2000). Contributions of plant molecular systematics to studies of
molecular evolution. Plant Molecular Biology 42: 45-75.
Stamatakis A. (2014). RAxML version 8: a tool for phylogenetic analysis
and post-analysis of large phylogenies. Bioinformatics, 30(9):1312–1313.
Stamatakis A. (2006). RAxML-VI-HPC: maximum likelihood-based phylogenetic
analyses with thousands of taxa and mixed models. Bioinformatics, 22(21): 2688–2690.
Stamatakis A , Blagojevic F, Nikolopoulos DS, Antonopoulos CD (Stamatakis2007 co-
authors). (2007). Exploring new search algorithms and hardware for phylogenetics:
RAxML meets the IBM Cell. J VLSI Signal Process Syst Signal Image Video Technol.
483: 271–286.http://dx.doi.org/10.1007/s11265-007-0067-4
Stamatakis, A. et al. (2010). Maximum likelihood analyses of 3,490 rbcl sequences:
Scalability of comprehensive inference versus group-specific taxon sampling.
Evolutionary Bioinformatics, 6, EBO.S4528.
Struck, T. H. et al. (2015). The evolution of annelids reveals two adaptive routes to the
interstitial realm. Current Biology, 25(15), 1993–1999.
Song, S. et al. (2012). Resolving conflict in eutherian mammal phylogeny using
phylogenomics and the multispecies coalescent model. Proceedings of the National
Academy of Sciences, 109(37), 14942–14947.
37
Swofford DL, Olsen GJ, Waddell PJ, Hillis DM. (1996). Phylogenetic inference.
In: Hillis DM, Moritz C, Mable BK, editors. Molecular systematics. Sunderland (MA):
Sinauer Associates. p. 407–514.
Tarver, J. E. et al. (2016). The interrelationships of placental mammals and the limits
of phylogenetic inference. Genome Biology and Evolution, 8(2), 330–344.
Watson, J. and Crick, F. (1953). Molecular Structure of Nucleic Acids: A Structure for
Deoxyribose Nucleic Acid.
Whelan S, Morrison DA. (2017). Inferring trees. Methods Mol Biol.1525:349–377.
Whelan, N. V. et al. (2015). Error, signal, and the placement of Ctenophora sister to all
other animals. Proceedings of the National Academy of Sciences, 112(18), 5773–5778.
Wickett, N. J. et al. (2014). Phylotranscriptomic analysis of the origin and early
diversification of land plants. Proceedings of the National Academy of Sciences,
111(45), E4859–E4868.
Woese, C., Kandler, O. and Wheelis, M. (1990). Towards a natural system of organisms:
proposal for the domains Archaea, Bacteria, and Eucarya.
Xi, Z. et al. (2014). Coalescent versus concatenation methods and the placement of
amborella as sister to water lilies. Systematic biology, 63(6), 919–932.
Yang Z, Rannala B. (2012). Molecular phylogenetics: principles and practice.
Nat Rev Genet. 13(5): 303–314.
Yang, Y. et al. (2015). Dissecting molecular evolution in the highly diverse plant clade
caryophyllales using transcriptome sequencing. Molecular Biology and Evolution,
32(8), 2001–2014.
Zwickl D. (2006) Genetic algorithm approaches for the phylogenetic analysis of large
biological sequence datasets under the maximum likelihood criterion, TX University of
Texas at Austin PhD thesis
Zhou, X. et al. (2018). Evaluating fast maximum likelihood-based phylogenetic
programs using empirical phylogenomic data sets. Molecular biology and evolution,
35(2), 486-503
38
7. Appendix
7.1 Final project description
IQ-TREE is a widely used software for phylogenetic analysis from genome-scale data.
It has accumulated >1,300 citations for three novel and fast methods for model selection,
tree search algorithm and bootstrap approximation. This project was to test, verify,
benchmark and compare IQ-TREE with other existing software in phylogenetics,
including RAxML and RAxML-NG. The phylogenetic analysis was applied to a large
collection of publicly available datasets of DNA and amino acid alignments with
comprehensive metadata.
By the completion of this project, the student had a good understanding of the literature
review of phylogenetic analysis and showed good skills in developing and
benchmarking large-scale software. The student also wrote a report to communicate the
knowledge gained under the project.
7.2 Project contract
39
40
7.3 Artefacts
7.3.1 List of all program code files
The folder "projectcode" contains scripts "pipeline.py" and "plotforproject.R" that
used for these twenty datasets and two csv tables contain results and a subfolder
called "plot" contains all plots of these twenty datasets.
The folder "testcode" contains scripts "test.py" and "plotfortest.R" that you can run on
your own computer to test the pipeline.
The folder "example" contains two simple datasets that you can use as inputs of the
test script.
Since these twenty datasets are complex, we provide two simple alignments and a test
version of pipeline. These two scripts "pipeline.py" and "test.py" have some different
parameters. If you want to run it on your own computer, we recommend you use
test.py and the example alignments we provided. Before testing, please make sure all
41
software has installed in your computer and the corresponding versions are same.
All codes were implemented by Qiuyue Wang.
7.3.2 Details of testing code
All code has been tested for correctness. The test code is included in test.py.
7.3.3 Experimental environment
See Table 2 in chapter 3.3: Hardware and software configuration of GDU server.
See Table 3 in chapter 3.4: Data sets with varying characteristics for evaluating.
Compilers and versions: Python 2.7, R 3.5.3.
7.4 README file
This project is to design an automated pipeline to evaluate several phylogenetic
software from different aspects.
Author: Qiuyue Wang(u6342378) Please contact [email protected] if you have
any questions.
FOLDER DESCRIPTION
-----------------
The folder "datasets" contains following 20 datasets.
https://cloudstor.aarnet.edu.au/plus/s/hdnxvQaSyr225pC
The folder "outputs" contains all output files of these 20 datasets.
https://cloudstor.aarnet.edu.au/plus/s/7ClfKmP82mKP42K
The above two folders are so large that I provide links here to download them.
The folder "projectcode" contains scripts "pipeline.py" and "plotforproject.R" that
used for these twenty datasets and two csv tables contain results and a subfolder
called "plot" contains all plots of these twenty datasets.
The folder "testcode" contains scripts "test.py" and "plotfortest.R" that you can run on
your own computer to test the pipeline.
The folder "example" contains two simple datasets that you can use as inputs of the
test script.
42
Hint: Since these twenty datasets are complex, we provide two simple alignments and
a test version of pipeline. These two scripts "pipeline.py" and "test.py" have some
different parameters. If you want to run it on your own computer, we recommend you
use test.py and the example alignments we provided. Before testing, please make sure
all software has installed in your computer and the corresponding versions are same.
TEST OPERATION EXAMPLE
-----------------
1. Input "python test.py" in your command line, press Enter.
2. Input the path of the folder of all datasets, press Enter.
e.g. "example"
(If you use linux, you need to also input quotation marks. Here we hope you input the
folder's name instead of the alignment's name since our code can traverse all
subfolders under this folder)
3. Input the number of threads you want to use, press Enter.
e.g. 2
(We recommend you use a small number since the example alignments are simple,
otherwise some software like RAxML-NG may report an error)
4. Input how many trees you would like to obtain from each software.
e.g. 5
(We recommend you use 5 since too small number may cause the violin plots only
contain points, while too large number may spend more time)
5. Then you will get all output files and two csv files. The "lhscore.csv" contains each
tree's log-likelihood score. The "result.csv" contains running time, parallel efficiency
and memory usage for each tree inference.
6. If your python has installed rpy2 package you can call R directly to draw violin
plots, otherwise, you need to open your own Rstudio or other R tool to run
"plotfortest.R" after getting two csv tables. Then you will get four violin plots.
Hint: RAxML only support absolute path for output directory. You need change it
before use.
SOFTWARE VERSIONS
-----------------
RAxML(8.2.12)
IQ-TREE(1.6.10)
RAxML-NG(0.6.0)
COMPARISION
43
-----------------
Maximum log-likelihood score
Running time
Parallel efficiency
Memory usage
INPUT DATA (in project)
----------
For historical reasons, directory/file names are slightly different from the dataset
names used in the paper.
Please see the mapping below:
Paper Directory/file name
------ ---------------------
SongD1 dna_rokasD1
MisoD2b dna_rokasD2b
WickD3a dna_rokasD3a
WickD3b dna_rokasD3b
XiD4 dna_rokasD4
PrumD6 dna_rokasD6
TarvD7 dna_rokasD7
PeteD8 dna_hymeALL
ShiD9 dna_ShiD9
StamD10 dna_StamD10
NagyA1 aa_rokasA1
MisoA2 aa_rokasA2
WickA3 aa_rokasA3
ChenA4 aa_rokasA4
StruA5 aa_rokasA5
BoroA6 aa_rokasA6
WhelA7 aa_rokasA7
YangA8 aa_rokasA8
ShenA9 aa_rokasA9
GitzA12 aa_GitzA12
INPUT DATA (test)
----------
Dataset Source
------ ---------------------
dna_WoroD1 https://github.com/roblanf/BenchmarkAlignments
aa_NguyA1 https://github.com/roblanf/BenchmarkAlignments
44
7.5 Maximum log-likelihood scores for all tree inferences
dataset software seed ML score Best ML score
aa_GitzA12 iqtree p1 -3403341.11 -3403037.59
aa_GitzA12 iqtree p2 -3403135.13 -3403037.59
aa_GitzA12 iqtree p3 -3403246.50 -3403037.59
aa_GitzA12 iqtree p4 -3403179.69 -3403037.59
aa_GitzA12 iqtree p5 -3403186.61 -3403037.59
aa_GitzA12 raxng p1 -3403037.59 -3403037.59
aa_GitzA12 raxng p2 -3403045.17 -3403037.59
aa_GitzA12 raxng p3 -3403072.51 -3403037.59
aa_GitzA12 raxng p4 -3403076.12 -3403037.59
aa_GitzA12 raxng p5 -3403228.34 -3403037.59
aa_GitzA12 raxml p1 -3403216.13 -3403037.59
aa_GitzA12 raxml p2 -3403307.63 -3403037.59
aa_GitzA12 raxml p3 -3403587.25 -3403037.59
aa_GitzA12 raxml p4 -3403203.77 -3403037.59
aa_GitzA12 raxml p5 -3403448.12 -3403037.59
aa_rokasA1 iqtree p1 -5017861.13 -5017860.86
aa_rokasA1 iqtree p2 -5017861.13 -5017860.86
aa_rokasA1 iqtree p3 -5017861.13 -5017860.86
aa_rokasA1 iqtree p4 -5017861.13 -5017860.86
aa_rokasA1 iqtree p5 -5017861.10 -5017860.86
aa_rokasA1 raxng p1 -5017860.86 -5017860.86
aa_rokasA1 raxng p2 -5017860.86 -5017860.86
aa_rokasA1 raxng p3 -5017860.86 -5017860.86
aa_rokasA1 raxng p4 -5017860.86 -5017860.86
aa_rokasA1 raxng p5 -5017860.86 -5017860.86
aa_rokasA1 raxml p1 -5017872.46 -5017860.86
aa_rokasA1 raxml p2 -5017872.46 -5017860.86
aa_rokasA1 raxml p3 -5017872.46 -5017860.86
aa_rokasA1 raxml p4 -5017872.46 -5017860.86
aa_rokasA1 raxml p5 -5017872.46 -5017860.86
aa_rokasA2 iqtree p1 -30793593.82 -30793593.82
aa_rokasA2 iqtree p2 -30793593.83 -30793593.82
aa_rokasA2 iqtree p3 -30793593.83 -30793593.82
aa_rokasA2 iqtree p4 -30793593.83 -30793593.82
aa_rokasA2 iqtree p5 -30793593.83 -30793593.82
aa_rokasA2 raxng p1 -30793748.05 -30793593.82
aa_rokasA2 raxng p2 -30793748.03 -30793593.82
aa_rokasA2 raxng p3 -30793748.04 -30793593.82
aa_rokasA2 raxng p4 -30793748.04 -30793593.82
45
aa_rokasA2 raxng p5 -30793757.03 -30793593.82
aa_rokasA2 raxml p1 -30793646.44 -30793593.82
aa_rokasA2 raxml p2 -30793646.44 -30793593.82
aa_rokasA2 raxml p3 -30793646.44 -30793593.82
aa_rokasA2 raxml p4 -30793646.44 -30793593.82
aa_rokasA2 raxml p5 -30793646.44 -30793593.82
aa_rokasA3 iqtree p1 -8481070.89 -8481068.17
aa_rokasA3 iqtree p2 -8481070.89 -8481068.17
aa_rokasA3 iqtree p3 -8481070.89 -8481068.17
aa_rokasA3 iqtree p4 -8481070.89 -8481068.17
aa_rokasA3 iqtree p5 -8481070.89 -8481068.17
aa_rokasA3 raxng p1 -8481068.18 -8481068.17
aa_rokasA3 raxng p2 -8481068.17 -8481068.17
aa_rokasA3 raxng p3 -8481068.17 -8481068.17
aa_rokasA3 raxng p4 -8481068.17 -8481068.17
aa_rokasA3 raxng p5 -8481068.17 -8481068.17
aa_rokasA3 raxml p1 -8481070.89 -8481068.17
aa_rokasA3 raxml p2 -8481070.89 -8481068.17
aa_rokasA3 raxml p3 -8481070.89 -8481068.17
aa_rokasA3 raxml p4 -8481070.89 -8481068.17
aa_rokasA3 raxml p5 -8481070.89 -8481068.17
aa_rokasA4 iqtree p1 -40909393.07 -40909392.88
aa_rokasA4 iqtree p2 -40909393.07 -40909392.88
aa_rokasA4 iqtree p3 -40909393.07 -40909392.88
aa_rokasA4 iqtree p4 -40909393.07 -40909392.88
aa_rokasA4 iqtree p5 -40909393.07 -40909392.88
aa_rokasA4 raxng p1 -40909392.97 -40909392.88
aa_rokasA4 raxng p2 -40909392.98 -40909392.88
aa_rokasA4 raxng p3 -40909392.88 -40909392.88
aa_rokasA4 raxng p4 -40909392.88 -40909392.88
aa_rokasA4 raxng p5 -40909392.88 -40909392.88
aa_rokasA4 raxml p1 -40909393.03 -40909392.88
aa_rokasA4 raxml p2 -40909393.03 -40909392.88
aa_rokasA4 raxml p3 -40909393.03 -40909392.88
aa_rokasA4 raxml p4 -40909393.03 -40909392.88
aa_rokasA4 raxml p5 -40909393.03 -40909392.88
aa_rokasA5 iqtree p1 -5028496.39 -5028495.38
aa_rokasA5 iqtree p2 -5028495.41 -5028495.38
aa_rokasA5 iqtree p3 -5028496.38 -5028495.38
aa_rokasA5 iqtree p4 -5028496.38 -5028495.38
aa_rokasA5 iqtree p5 -5028495.60 -5028495.38
aa_rokasA5 raxng p1 -5028495.38 -5028495.38
aa_rokasA5 raxng p2 -5028495.39 -5028495.38
46
aa_rokasA5 raxng p3 -5028495.38 -5028495.38
aa_rokasA5 raxng p4 -5028495.40 -5028495.38
aa_rokasA5 raxng p5 -5028495.50 -5028495.38
aa_rokasA5 raxml p1 -5028663.61 -5028495.38
aa_rokasA5 raxml p2 -5028689.75 -5028495.38
aa_rokasA5 raxml p3 -5028655.72 -5028495.38
aa_rokasA5 raxml p4 -5028683.45 -5028495.38
aa_rokasA5 raxml p5 -5028665.45 -5028495.38
aa_rokasA6 iqtree p1 -15164451.68 -15164441.89
aa_rokasA6 iqtree p2 -15164451.68 -15164441.89
aa_rokasA6 iqtree p3 -15164451.68 -15164441.89
aa_rokasA6 iqtree p4 -15164451.68 -15164441.89
aa_rokasA6 iqtree p5 -15164451.68 -15164441.89
aa_rokasA6 raxng p1 -15164453.22 -15164441.89
aa_rokasA6 raxng p2 -15164453.22 -15164441.89
aa_rokasA6 raxng p3 -15164453.06 -15164441.89
aa_rokasA6 raxng p4 -15164453.06 -15164441.89
aa_rokasA6 raxng p5 -15164453.22 -15164441.89
aa_rokasA6 raxml p1 -15164442.04 -15164441.89
aa_rokasA6 raxml p2 -15164442.04 -15164441.89
aa_rokasA6 raxml p3 -15164442.04 -15164441.89
aa_rokasA6 raxml p4 -15164442.04 -15164441.89
aa_rokasA6 raxml p5 -15164441.89 -15164441.89
aa_rokasA7 iqtree p1 -2894956.80 -2894956.74
aa_rokasA7 iqtree p2 -2894956.80 -2894956.74
aa_rokasA7 iqtree p3 -2894956.80 -2894956.74
aa_rokasA7 iqtree p4 -2894956.80 -2894956.74
aa_rokasA7 iqtree p5 -2894956.80 -2894956.74
aa_rokasA7 raxng p1 -2894961.62 -2894956.74
aa_rokasA7 raxng p2 -2894961.62 -2894956.74
aa_rokasA7 raxng p3 -2894961.62 -2894956.74
aa_rokasA7 raxng p4 -2894961.62 -2894956.74
aa_rokasA7 raxng p5 -2894961.62 -2894956.74
aa_rokasA7 raxml p1 -2894956.74 -2894956.74
aa_rokasA7 raxml p2 -2894956.74 -2894956.74
aa_rokasA7 raxml p3 -2894956.74 -2894956.74
aa_rokasA7 raxml p4 -2894956.74 -2894956.74
aa_rokasA7 raxml p5 -2894956.74 -2894956.74
aa_rokasA8 iqtree p1 -20012134.65 -20012134.63
aa_rokasA8 iqtree p2 -20012134.65 -20012134.63
aa_rokasA8 iqtree p3 -20012134.65 -20012134.63
aa_rokasA8 iqtree p4 -20012134.65 -20012134.63
aa_rokasA8 iqtree p5 -20012134.65 -20012134.63
47
aa_rokasA8 raxng p1 -20012134.65 -20012134.63
aa_rokasA8 raxng p2 -20012134.65 -20012134.63
aa_rokasA8 raxng p3 -20012134.65 -20012134.63
aa_rokasA8 raxng p4 -20012134.65 -20012134.63
aa_rokasA8 raxng p5 -20012134.65 -20012134.63
aa_rokasA8 raxml p1 -20012134.63 -20012134.63
aa_rokasA8 raxml p2 -20012134.63 -20012134.63
aa_rokasA8 raxml p3 -20012134.63 -20012134.63
aa_rokasA8 raxml p4 -20012134.63 -20012134.63
aa_rokasA8 raxml p5 -20012134.63 -20012134.63
aa_rokasA9 iqtree p1 -53493548.18 -53493548.14
aa_rokasA9 iqtree p2 -53493548.18 -53493548.14
aa_rokasA9 iqtree p3 -53493548.19 -53493548.14
aa_rokasA9 iqtree p4 -53493548.18 -53493548.14
aa_rokasA9 iqtree p5 -53493548.18 -53493548.14
aa_rokasA9 raxng p1 -53494590.10 -53493548.14
aa_rokasA9 raxng p2 -53493548.28 -53493548.14
aa_rokasA9 raxng p3 -53493548.25 -53493548.14
aa_rokasA9 raxng p4 -53493548.31 -53493548.14
aa_rokasA9 raxng p5 -53494590.03 -53493548.14
aa_rokasA9 raxml p1 -53493548.14 -53493548.14
aa_rokasA9 raxml p2 -53493548.14 -53493548.14
aa_rokasA9 raxml p3 -53493548.14 -53493548.14
aa_rokasA9 raxml p4 -53493548.14 -53493548.14
aa_rokasA9 raxml p5 -53493548.14 -53493548.14
dna_hymeALL iqtree p1 -74022353.87 -74019633.13
dna_hymeALL iqtree p2 -74022287.04 -74019633.13
dna_hymeALL iqtree p3 -74022287.98 -74019633.13
dna_hymeALL iqtree p4 -74022275.78 -74019633.13
dna_hymeALL iqtree p5 -74022286.41 -74019633.13
dna_hymeALL raxng p1 -74019818.05 -74019633.13
dna_hymeALL raxng p2 -74019654.31 -74019633.13
dna_hymeALL raxng p3 -74019827.67 -74019633.13
dna_hymeALL raxng p4 -74019633.13 -74019633.13
dna_hymeALL raxng p5 -74019861.01 -74019633.13
dna_hymeALL raxml p1 -74022094.82 -74019633.13
dna_hymeALL raxml p2 -74022104.77 -74019633.13
dna_hymeALL raxml p3 -74022094.82 -74019633.13
dna_hymeALL raxml p4 -74022076.52 -74019633.13
dna_hymeALL raxml p5 -74021926.97 -74019633.13
dna_rokasD1 iqtree p1 -12715376.04 -12715375.68
dna_rokasD1 iqtree p2 -12715376.04 -12715375.68
dna_rokasD1 iqtree p3 -12715376.04 -12715375.68
48
dna_rokasD1 iqtree p4 -12715376.04 -12715375.68
dna_rokasD1 iqtree p5 -12715376.04 -12715375.68
dna_rokasD1 raxng p1 -12715375.70 -12715375.68
dna_rokasD1 raxng p2 -12715375.74 -12715375.68
dna_rokasD1 raxng p3 -12715375.69 -12715375.68
dna_rokasD1 raxng p4 -12715375.68 -12715375.68
dna_rokasD1 raxng p5 -12715375.80 -12715375.68
dna_rokasD1 raxml p1 -12715378.39 -12715375.68
dna_rokasD1 raxml p2 -12715378.32 -12715375.68
dna_rokasD1 raxml p3 -12715378.41 -12715375.68
dna_rokasD1 raxml p4 -12715378.47 -12715375.68
dna_rokasD1 raxml p5 -12715378.29 -12715375.68
dna_rokasD2b iqtree p1 -13252627.81 -13230654.38
dna_rokasD2b iqtree p2 -13252628.01 -13230654.38
dna_rokasD2b iqtree p3 -13252628.34 -13230654.38
dna_rokasD2b iqtree p4 -13252628.20 -13230654.38
dna_rokasD2b iqtree p5 -13252628.06 -13230654.38
dna_rokasD2b raxng p1 -13230654.51 -13230654.38
dna_rokasD2b raxng p2 -13230654.46 -13230654.38
dna_rokasD2b raxng p3 -13230654.62 -13230654.38
dna_rokasD2b raxng p4 -13230654.45 -13230654.38
dna_rokasD2b raxng p5 -13230654.38 -13230654.38
dna_rokasD2b raxml p1 -13230836.81 -13230654.38
dna_rokasD2b raxml p2 -13230840.71 -13230654.38
dna_rokasD2b raxml p3 -13230836.81 -13230654.38
dna_rokasD2b raxml p4 -13230866.91 -13230654.38
dna_rokasD2b raxml p5 -13230866.91 -13230654.38
dna_rokasD3a iqtree p1 -18545579.18 -18545572.54
dna_rokasD3a iqtree p2 -18545579.17 -18545572.54
dna_rokasD3a iqtree p3 -18545579.22 -18545572.54
dna_rokasD3a iqtree p4 -18545579.25 -18545572.54
dna_rokasD3a iqtree p5 -18545579.17 -18545572.54
dna_rokasD3a raxng p1 -18545572.54 -18545572.54
dna_rokasD3a raxng p2 -18545572.56 -18545572.54
dna_rokasD3a raxng p3 -18545572.63 -18545572.54
dna_rokasD3a raxng p4 -18545572.54 -18545572.54
dna_rokasD3a raxng p5 -18545572.56 -18545572.54
dna_rokasD3a raxml p1 -18545730.62 -18545572.54
dna_rokasD3a raxml p2 -18546391.82 -18545572.54
dna_rokasD3a raxml p3 -18545730.62 -18545572.54
dna_rokasD3a raxml p4 -18545969.03 -18545572.54
dna_rokasD3a raxml p5 -18545937.62 -18545572.54
dna_rokasD3b iqtree p1 -8832955.09 -8832955.07
49
dna_rokasD3b iqtree p2 -8832955.10 -8832955.07
dna_rokasD3b iqtree p3 -8832955.09 -8832955.07
dna_rokasD3b iqtree p4 -8832955.10 -8832955.07
dna_rokasD3b iqtree p5 -8832955.10 -8832955.07
dna_rokasD3b raxng p1 -8832956.74 -8832955.07
dna_rokasD3b raxng p2 -8832955.08 -8832955.07
dna_rokasD3b raxng p3 -8832955.07 -8832955.07
dna_rokasD3b raxng p4 -8832956.74 -8832955.07
dna_rokasD3b raxng p5 -8832955.08 -8832955.07
dna_rokasD3b raxml p1 -8832957.31 -8832955.07
dna_rokasD3b raxml p2 -8832957.31 -8832955.07
dna_rokasD3b raxml p3 -8832957.31 -8832955.07
dna_rokasD3b raxml p4 -8832957.31 -8832955.07
dna_rokasD3b raxml p5 -8832957.31 -8832955.07
dna_rokasD4 iqtree p1 -4442102.57 -4442102.52
dna_rokasD4 iqtree p2 -4442102.60 -4442102.52
dna_rokasD4 iqtree p3 -4442102.57 -4442102.52
dna_rokasD4 iqtree p4 -4442102.57 -4442102.52
dna_rokasD4 iqtree p5 -4442102.57 -4442102.52
dna_rokasD4 raxng p1 -4442102.54 -4442102.52
dna_rokasD4 raxng p2 -4442102.52 -4442102.52
dna_rokasD4 raxng p3 -4442102.57 -4442102.52
dna_rokasD4 raxng p4 -4442102.53 -4442102.52
dna_rokasD4 raxng p5 -4442102.54 -4442102.52
dna_rokasD4 raxml p1 -4442103.37 -4442102.52
dna_rokasD4 raxml p2 -4442102.94 -4442102.52
dna_rokasD4 raxml p3 -4442103.11 -4442102.52
dna_rokasD4 raxml p4 -4442103.46 -4442102.52
dna_rokasD4 raxml p5 -4442104.64 -4442102.52
dna_rokasD6 iqtree p1 -9514265.34 -9514265.28
dna_rokasD6 iqtree p2 -9514265.32 -9514265.28
dna_rokasD6 iqtree p3 -9514265.29 -9514265.28
dna_rokasD6 iqtree p4 -9514265.28 -9514265.28
dna_rokasD6 iqtree p5 -9514265.35 -9514265.28
dna_rokasD6 raxng p1 -9514285.91 -9514265.28
dna_rokasD6 raxng p2 -9514285.92 -9514265.28
dna_rokasD6 raxng p3 -9514285.93 -9514265.28
dna_rokasD6 raxng p4 -9514285.91 -9514265.28
dna_rokasD6 raxng p5 -9514285.92 -9514265.28
dna_rokasD6 raxml p1 -9514496.52 -9514265.28
dna_rokasD6 raxml p2 -9514491.93 -9514265.28
dna_rokasD6 raxml p3 -9514438.41 -9514265.28
dna_rokasD6 raxml p4 -9514496.51 -9514265.28
50
dna_rokasD6 raxml p5 -9514461.89 -9514265.28
dna_rokasD7 iqtree p1 -115119111.10 -115119111.08
dna_rokasD7 iqtree p2 -115119111.08 -115119111.08
dna_rokasD7 iqtree p3 -115119111.08 -115119111.08
dna_rokasD7 iqtree p4 -115119111.08 -115119111.08
dna_rokasD7 iqtree p5 -115119111.30 -115119111.08
dna_rokasD7 raxng p1 -115119132.06 -115119111.08
dna_rokasD7 raxng p2 -115119131.93 -115119111.08
dna_rokasD7 raxng p3 -115119131.89 -115119111.08
dna_rokasD7 raxng p4 -115119131.90 -115119111.08
dna_rokasD7 raxng p5 -115119131.88 -115119111.08
dna_rokasD7 raxml p1 -115119112.52 -115119111.08
dna_rokasD7 raxml p2 -115119112.50 -115119111.08
dna_rokasD7 raxml p3 -115119112.51 -115119111.08
dna_rokasD7 raxml p4 -115119112.51 -115119111.08
dna_rokasD7 raxml p5 -115119112.78 -115119111.08
dna_ShiD9 iqtree p1 -584879.25 -584549.80
dna_ShiD9 iqtree p2 -584920.55 -584549.80
dna_ShiD9 iqtree p3 -584868.20 -584549.80
dna_ShiD9 iqtree p4 -584861.39 -584549.80
dna_ShiD9 iqtree p5 -584885.19 -584549.80
dna_ShiD9 raxng p1 -584549.80 -584549.80
dna_ShiD9 raxng p2 -584580.02 -584549.80
dna_ShiD9 raxng p3 -584565.51 -584549.80
dna_ShiD9 raxng p4 -585070.55 -584549.80
dna_ShiD9 raxng p5 -584568.11 -584549.80
dna_ShiD9 raxml p1 -584920.52 -584549.80
dna_ShiD9 raxml p2 -584925.20 -584549.80
dna_ShiD9 raxml p3 -584934.56 -584549.80
dna_ShiD9 raxml p4 -584952.16 -584549.80
dna_ShiD9 raxml p5 -584928.13 -584549.80
dna_StamD10 iqtree p1 -30629.44 -30623.28
dna_StamD10 iqtree p2 -30626.44 -30623.28
dna_StamD10 iqtree p3 -30625.74 -30623.28
dna_StamD10 iqtree p4 -30627.09 -30623.28
dna_StamD10 iqtree p5 -30629.19 -30623.28
dna_StamD10 raxng p1 -30627.57 -30623.28
dna_StamD10 raxng p2 -30623.28 -30623.28
dna_StamD10 raxng p3 -30627.01 -30623.28
dna_StamD10 raxng p4 -30623.73 -30623.28
dna_StamD10 raxng p5 -30627.61 -30623.28
dna_StamD10 raxml p1 -30624.78 -30623.28
dna_StamD10 raxml p2 -30636.85 -30623.28
51
dna_StamD10 raxml p3 -30633.61 -30623.28
dna_StamD10 raxml p4 -30625.94 -30623.28
dna_StamD10 raxml p5 -30631.30 -30623.28
Table 7: All likelihood scores extracted from the output files and the difference with
the maximum likelihood score of the best tree. For ease of display, all data is rounded
off.
7.6 Runtime, memory usage and parallel efficiency for all tree
inferences
dataset software seed time(h) Memory(GB) efficiency(%)
aa_GitzA12 iqtree p1 88.39 89.22 75.06
aa_GitzA12 iqtree p2 106.50 89.22 76.19
aa_GitzA12 iqtree p3 59.00 89.22 71.50
aa_GitzA12 iqtree p4 79.03 89.20 72.50
aa_GitzA12 iqtree p5 53.33 89.21 73.81
aa_GitzA12 raxng p1 29.12 43.32 99.88
aa_GitzA12 raxng p2 35.14 43.41 99.88
aa_GitzA12 raxng p3 27.49 44.46 99.88
aa_GitzA12 raxng p4 29.48 44.26 99.88
aa_GitzA12 raxng p5 27.73 44.21 99.63
aa_GitzA12 raxml p1 26.50 82.99 99.88
aa_GitzA12 raxml p2 27.66 82.99 99.88
aa_GitzA12 raxml p3 28.76 82.99 99.88
aa_GitzA12 raxml p4 29.24 83.00 99.88
aa_GitzA12 raxml p5 25.93 82.99 99.88
aa_rokasA1 iqtree p1 2.53 16.96 75.25
aa_rokasA1 iqtree p2 2.73 16.96 78.44
aa_rokasA1 iqtree p3 2.65 16.96 76.69
aa_rokasA1 iqtree p4 2.68 16.96 76.44
aa_rokasA1 iqtree p5 2.55 16.92 75.25
aa_rokasA1 raxng p1 2.93 36.68 99.88
aa_rokasA1 raxng p2 3.97 37.52 98.25
aa_rokasA1 raxng p3 3.08 36.47 99.63
aa_rokasA1 raxng p4 4.03 37.58 99.69
aa_rokasA1 raxng p5 4.19 37.15 99.88
aa_rokasA1 raxml p1 9.97 30.42 99.88
aa_rokasA1 raxml p2 8.14 30.42 99.81
aa_rokasA1 raxml p3 9.52 30.42 99.88
aa_rokasA1 raxml p4 8.08 30.42 99.88
aa_rokasA1 raxml p5 7.47 30.42 99.88
aa_rokasA2 iqtree p1 48.31 163.78 56.44
52
aa_rokasA2 iqtree p2 48.48 163.78 59.19
aa_rokasA2 iqtree p3 55.49 163.81 60.69
aa_rokasA2 iqtree p4 38.74 163.81 55.38
aa_rokasA2 iqtree p5 43.12 163.80 50.31
aa_rokasA2 raxng p1 24.12 114.00 99.88
aa_rokasA2 raxng p2 30.84 113.82 99.88
aa_rokasA2 raxng p3 39.99 114.68 99.75
aa_rokasA2 raxng p4 28.16 110.74 99.88
aa_rokasA2 raxng p5 29.27 110.19 99.88
aa_rokasA2 raxml p1 58.60 148.18 99.88
aa_rokasA2 raxml p2 61.15 148.18 99.88
aa_rokasA2 raxml p3 53.91 148.18 99.88
aa_rokasA2 raxml p4 43.85 148.18 99.88
aa_rokasA2 raxml p5 39.30 148.18 99.88
aa_rokasA3 iqtree p1 17.20 38.25 75.31
aa_rokasA3 iqtree p2 12.04 38.25 70.25
aa_rokasA3 iqtree p3 11.17 38.24 74.50
aa_rokasA3 iqtree p4 14.64 38.26 73.75
aa_rokasA3 iqtree p5 9.86 38.24 73.44
aa_rokasA3 raxng p1 7.66 23.75 99.88
aa_rokasA3 raxng p2 8.88 23.91 99.50
aa_rokasA3 raxng p3 8.90 22.34 99.88
aa_rokasA3 raxng p4 7.50 24.59 99.88
aa_rokasA3 raxng p5 6.60 22.59 99.81
aa_rokasA3 raxml p1 10.78 35.55 99.88
aa_rokasA3 raxml p2 8.34 35.55 99.88
aa_rokasA3 raxml p3 8.03 35.40 99.88
aa_rokasA3 raxml p4 8.29 35.54 99.88
aa_rokasA3 raxml p5 9.82 35.55 99.88
aa_rokasA4 iqtree p1 44.61 233.36 59.88
aa_rokasA4 iqtree p2 45.42 233.35 59.06
aa_rokasA4 iqtree p3 41.97 233.32 59.06
aa_rokasA4 iqtree p4 45.02 233.34 62.13
aa_rokasA4 iqtree p5 40.48 233.34 68.13
aa_rokasA4 raxng p1 21.77 121.88 99.19
aa_rokasA4 raxng p2 21.45 129.32 99.19
aa_rokasA4 raxng p3 16.91 125.51 99.13
aa_rokasA4 raxng p4 24.63 135.95 99.38
aa_rokasA4 raxng p5 33.56 125.48 99.25
aa_rokasA4 raxml p1 31.53 212.06 99.88
aa_rokasA4 raxml p2 27.86 212.04 99.88
aa_rokasA4 raxml p3 23.05 212.06 99.88
aa_rokasA4 raxml p4 24.21 212.06 99.88
53
aa_rokasA4 raxml p5 29.47 212.05 99.88
aa_rokasA5 iqtree p1 10.05 46.36 76.31
aa_rokasA5 iqtree p2 12.44 46.35 76.44
aa_rokasA5 iqtree p3 12.39 46.35 72.81
aa_rokasA5 iqtree p4 10.87 46.35 75.94
aa_rokasA5 iqtree p5 14.63 46.36 74.75
aa_rokasA5 raxng p1 9.75 22.35 99.81
aa_rokasA5 raxng p2 10.83 21.57 99.88
aa_rokasA5 raxng p3 8.97 21.36 99.81
aa_rokasA5 raxng p4 11.09 23.70 99.81
aa_rokasA5 raxng p5 9.28 22.18 99.75
aa_rokasA5 raxml p1 14.42 42.49 99.88
aa_rokasA5 raxml p2 26.27 42.49 99.88
aa_rokasA5 raxml p3 24.39 42.46 99.88
aa_rokasA5 raxml p4 18.31 42.49 99.88
aa_rokasA5 raxml p5 20.21 42.46 99.88
aa_rokasA6 iqtree p1 7.08 38.66 70.94
aa_rokasA6 iqtree p2 7.36 38.80 67.75
aa_rokasA6 iqtree p3 6.69 38.80 74.81
aa_rokasA6 iqtree p4 6.80 38.79 71.94
aa_rokasA6 iqtree p5 6.46 38.81 73.56
aa_rokasA6 raxng p1 4.09 56.29 98.63
aa_rokasA6 raxng p2 4.01 56.37 99.81
aa_rokasA6 raxng p3 4.19 55.04 99.81
aa_rokasA6 raxng p4 4.42 57.52 99.81
aa_rokasA6 raxng p5 3.80 55.68 99.63
aa_rokasA6 raxml p1 7.01 42.42 99.81
aa_rokasA6 raxml p2 6.68 42.42 99.81
aa_rokasA6 raxml p3 7.07 42.43 99.88
aa_rokasA6 raxml p4 6.41 42.42 99.81
aa_rokasA6 raxml p5 6.50 42.42 99.81
aa_rokasA7 iqtree p1 1.63 10.67 78.56
aa_rokasA7 iqtree p2 1.63 10.69 78.56
aa_rokasA7 iqtree p3 1.59 10.69 76.00
aa_rokasA7 iqtree p4 1.48 10.67 78.94
aa_rokasA7 iqtree p5 1.62 10.69 79.75
aa_rokasA7 raxng p1 1.85 15.88 97.94
aa_rokasA7 raxng p2 1.32 16.23 99.69
aa_rokasA7 raxng p3 1.80 15.86 97.94
aa_rokasA7 raxng p4 1.41 16.47 99.69
aa_rokasA7 raxng p5 1.41 15.76 99.81
aa_rokasA7 raxml p1 2.85 12.78 99.88
aa_rokasA7 raxml p2 3.16 10.55 99.81
54
aa_rokasA7 raxml p3 4.31 10.55 99.81
aa_rokasA7 raxml p4 3.20 10.55 99.88
aa_rokasA7 raxml p5 3.28 10.55 99.88
aa_rokasA8 iqtree p1 21.53 135.01 66.81
aa_rokasA8 iqtree p2 26.64 135.03 64.00
aa_rokasA8 iqtree p3 19.82 135.01 73.56
aa_rokasA8 iqtree p4 27.62 135.01 60.56
aa_rokasA8 iqtree p5 18.03 135.02 66.94
aa_rokasA8 raxng p1 16.41 124.19 99.88
aa_rokasA8 raxng p2 20.46 123.17 99.38
aa_rokasA8 raxng p3 18.78 121.59 99.88
aa_rokasA8 raxng p4 22.99 123.89 99.88
aa_rokasA8 raxng p5 21.69 121.74 99.88
aa_rokasA8 raxml p1 25.66 124.89 99.81
aa_rokasA8 raxml p2 34.99 124.87 99.88
aa_rokasA8 raxml p3 36.09 124.89 99.88
aa_rokasA8 raxml p4 29.44 124.89 99.88
aa_rokasA8 raxml p5 25.00 124.88 99.88
aa_rokasA9 iqtree p1 37.52 143.35 62.13
aa_rokasA9 iqtree p2 34.96 143.33 59.94
aa_rokasA9 iqtree p3 32.61 143.33 69.75
aa_rokasA9 iqtree p4 25.87 143.33 73.44
aa_rokasA9 iqtree p5 30.37 143.33 64.44
aa_rokasA9 raxng p1 21.13 81.18 99.81
aa_rokasA9 raxng p2 23.01 84.10 98.25
aa_rokasA9 raxng p3 26.42 88.91 99.75
aa_rokasA9 raxng p4 25.69 85.11 98.13
aa_rokasA9 raxng p5 22.29 91.97 99.75
aa_rokasA9 raxml p1 24.66 130.73 99.88
aa_rokasA9 raxml p2 35.60 130.73 99.00
aa_rokasA9 raxml p3 35.26 130.73 99.88
aa_rokasA9 raxml p4 25.84 130.72 99.88
aa_rokasA9 raxml p5 24.62 130.73 99.88
dna_hymeALL iqtree p1 78.03 234.09 50.19
dna_hymeALL iqtree p2 88.47 234.08 49.19
dna_hymeALL iqtree p3 77.89 234.09 51.13
dna_hymeALL iqtree p4 71.86 234.09 53.00
dna_hymeALL iqtree p5 69.16 234.08 50.94
dna_hymeALL raxng p1 42.22 219.61 97.19
dna_hymeALL raxng p2 52.59 219.86 99.88
dna_hymeALL raxng p3 81.85 217.66 94.38
dna_hymeALL raxng p4 55.81 219.10 99.69
dna_hymeALL raxng p5 38.26 223.39 99.63
55
dna_hymeALL raxml p1 89.82 207.43 99.88
dna_hymeALL raxml p2 83.57 207.45 99.88
dna_hymeALL raxml p3 95.41 207.44 99.88
dna_hymeALL raxml p4 67.57 207.46 99.81
dna_hymeALL raxml p5 53.09 207.44 99.81
dna_rokasD1 iqtree p1 2.54 18.45 70.69
dna_rokasD1 iqtree p2 2.85 18.44 75.00
dna_rokasD1 iqtree p3 2.30 18.43 72.69
dna_rokasD1 iqtree p4 2.44 18.43 72.25
dna_rokasD1 iqtree p5 2.80 18.43 74.63
dna_rokasD1 raxng p1 0.45 10.29 99.25
dna_rokasD1 raxng p2 0.56 9.14 98.31
dna_rokasD1 raxng p3 0.39 8.14 98.38
dna_rokasD1 raxng p4 0.40 10.38 98.00
dna_rokasD1 raxng p5 0.41 10.38 98.88
dna_rokasD1 raxml p1 0.82 13.47 99.31
dna_rokasD1 raxml p2 0.84 13.47 99.38
dna_rokasD1 raxml p3 0.99 13.47 99.56
dna_rokasD1 raxml p4 0.88 13.48 99.50
dna_rokasD1 raxml p5 0.91 13.47 99.50
dna_rokasD2b iqtree p1 24.71 33.88 31.31
dna_rokasD2b iqtree p2 25.88 33.90 31.75
dna_rokasD2b iqtree p3 39.71 33.88 28.25
dna_rokasD2b iqtree p4 27.38 33.88 34.06
dna_rokasD2b iqtree p5 29.06 33.88 31.50
dna_rokasD2b raxng p1 4.00 14.56 99.88
dna_rokasD2b raxng p2 4.69 15.88 99.69
dna_rokasD2b raxng p3 4.71 15.44 99.81
dna_rokasD2b raxng p4 3.93 15.68 99.88
dna_rokasD2b raxng p5 4.03 14.81 99.44
dna_rokasD2b raxml p1 8.15 26.33 99.88
dna_rokasD2b raxml p2 8.10 26.33 99.88
dna_rokasD2b raxml p3 9.40 26.33 99.88
dna_rokasD2b raxml p4 7.52 26.33 99.88
dna_rokasD2b raxml p5 7.38 26.33 99.88
dna_rokasD3a iqtree p1 2.60 27.31 80.75
dna_rokasD3a iqtree p2 8.62 27.64 79.19
dna_rokasD3a iqtree p3 8.83 27.65 78.25
dna_rokasD3a iqtree p4 9.10 27.65 79.13
dna_rokasD3a iqtree p5 9.85 27.63 79.69
dna_rokasD3a raxng p1 3.96 14.71 99.88
dna_rokasD3a raxng p2 3.86 13.19 99.75
dna_rokasD3a raxng p3 4.90 13.39 99.38
56
dna_rokasD3a raxng p4 3.86 13.76 99.56
dna_rokasD3a raxng p5 2.84 13.58 99.81
dna_rokasD3a raxml p1 4.00 21.31 99.88
dna_rokasD3a raxml p2 4.30 21.31 99.88
dna_rokasD3a raxml p3 4.59 21.31 99.88
dna_rokasD3a raxml p4 4.86 21.31 99.88
dna_rokasD3a raxml p5 4.42 21.31 99.88
dna_rokasD3b iqtree p1 4.42 18.18 77.81
dna_rokasD3b iqtree p2 4.33 18.18 78.25
dna_rokasD3b iqtree p3 4.75 18.17 79.31
dna_rokasD3b iqtree p4 5.22 18.17 80.50
dna_rokasD3b iqtree p5 5.19 18.19 79.88
dna_rokasD3b raxng p1 1.45 9.43 99.88
dna_rokasD3b raxng p2 1.90 9.10 99.88
dna_rokasD3b raxng p3 2.26 8.79 99.44
dna_rokasD3b raxng p4 2.45 9.19 99.88
dna_rokasD3b raxng p5 2.97 8.60 99.75
dna_rokasD3b raxml p1 2.65 13.99 99.88
dna_rokasD3b raxml p2 2.42 13.99 99.88
dna_rokasD3b raxml p3 2.57 13.99 99.88
dna_rokasD3b raxml p4 2.23 13.99 99.81
dna_rokasD3b raxml p5 2.20 13.99 99.88
dna_rokasD4 iqtree p1 1.03 5.14 76.44
dna_rokasD4 iqtree p2 1.03 5.11 78.25
dna_rokasD4 iqtree p3 0.78 5.13 80.94
dna_rokasD4 iqtree p4 0.94 5.10 80.00
dna_rokasD4 iqtree p5 0.82 5.11 79.56
dna_rokasD4 raxng p1 0.23 2.91 99.56
dna_rokasD4 raxng p2 0.24 3.10 99.88
dna_rokasD4 raxng p3 0.23 3.38 99.88
dna_rokasD4 raxng p4 0.23 3.39 99.88
dna_rokasD4 raxng p5 0.24 3.06 99.50
dna_rokasD4 raxml p1 0.29 3.73 99.75
dna_rokasD4 raxml p2 0.22 3.73 99.75
dna_rokasD4 raxml p3 0.24 3.73 99.75
dna_rokasD4 raxml p4 0.24 3.73 99.75
dna_rokasD4 raxml p5 0.28 3.73 99.75
dna_rokasD6 iqtree p1 6.12 29.74 80.38
dna_rokasD6 iqtree p2 6.41 29.73 79.13
dna_rokasD6 iqtree p3 7.09 29.74 81.38
dna_rokasD6 iqtree p4 8.48 29.73 81.88
dna_rokasD6 iqtree p5 7.05 29.74 75.88
dna_rokasD6 raxng p1 2.70 14.76 99.88
57
dna_rokasD6 raxng p2 2.58 14.14 99.88
dna_rokasD6 raxng p3 3.17 13.00 99.81
dna_rokasD6 raxng p4 2.95 14.32 99.81
dna_rokasD6 raxng p5 3.35 13.79 99.00
dna_rokasD6 raxml p1 2.93 23.40 99.81
dna_rokasD6 raxml p2 5.39 23.53 99.81
dna_rokasD6 raxml p3 4.22 23.54 99.81
dna_rokasD6 raxml p4 2.98 23.54 99.81
dna_rokasD6 raxml p5 4.10 23.53 99.81
dna_rokasD7 iqtree p1 34.01 199.07 71.13
dna_rokasD7 iqtree p2 30.66 199.07 72.25
dna_rokasD7 iqtree p3 29.93 199.08 73.44
dna_rokasD7 iqtree p4 31.26 199.14 72.25
dna_rokasD7 iqtree p5 35.06 199.07 72.94
dna_rokasD7 raxng p1 11.82 88.70 97.69
dna_rokasD7 raxng p2 7.14 91.86 98.56
dna_rokasD7 raxng p3 6.18 81.76 98.38
dna_rokasD7 raxng p4 7.15 114.44 98.06
dna_rokasD7 raxng p5 5.77 115.90 98.50
dna_rokasD7 raxml p1 16.09 149.66 98.81
dna_rokasD7 raxml p2 9.38 149.67 96.56
dna_rokasD7 raxml p3 12.23 149.68 98.63
dna_rokasD7 raxml p4 12.89 149.68 98.69
dna_rokasD7 raxml p5 14.41 149.68 98.81
dna_ShiD9 iqtree p1 10.01 1.96 94.44
dna_ShiD9 iqtree p2 9.98 1.96 94.50
dna_ShiD9 iqtree p3 13.90 1.96 95.19
dna_ShiD9 iqtree p4 8.47 1.96 93.75
dna_ShiD9 iqtree p5 7.08 1.96 92.69
dna_ShiD9 raxng p1 1.92 3.68 99.25
dna_ShiD9 raxng p2 1.49 3.77 99.88
dna_ShiD9 raxng p3 1.55 3.67 98.38
dna_ShiD9 raxng p4 1.78 3.76 99.38
dna_ShiD9 raxng p5 1.02 3.71 99.88
dna_ShiD9 raxml p1 1.46 5.73 99.81
dna_ShiD9 raxml p2 2.39 5.90 99.81
dna_ShiD9 raxml p3 2.32 5.91 99.81
dna_ShiD9 raxml p4 2.26 5.91 99.81
dna_ShiD9 raxml p5 2.09 5.92 99.81
dna_StamD10 iqtree p1 2.19 0.34 61.19
dna_StamD10 iqtree p2 5.25 0.34 60.94
dna_StamD10 iqtree p3 3.79 0.34 61.44
dna_StamD10 iqtree p4 3.63 0.34 61.44
58
dna_StamD10 iqtree p5 3.06 0.35 61.50
dna_StamD10 raxng p1 0.54 0.66 93.63
dna_StamD10 raxng p2 0.06 0.67 99.81
dna_StamD10 raxng p3 0.06 0.66 99.81
dna_StamD10 raxng p4 0.10 0.67 97.81
dna_StamD10 raxng p5 0.07 0.66 99.88
dna_StamD10 raxml p1 0.05 0.32 99.75
dna_StamD10 raxml p2 0.03 0.32 99.75
dna_StamD10 raxml p3 0.04 0.32 99.81
dna_StamD10 raxml p4 0.04 0.32 99.75
dna_StamD10 raxml p5 0.04 0.32 99.81
Table 8: All running time, memory usage and parallel efficiency calculated from the
output files. For ease of display, all data is rounded off.