1
Running head:
AffyTrees
Corresponding author:
Georg Weiller
Research School of Biological Sciences
Australian National University
2602 Canberra Australia
Email:[email protected]
Tel: +61 2 6125 5916
Research area: Bioinformatics
Plant Physiology Preview. Published on December 7, 2007, as DOI:10.1104/pp.107.109603
Copyright 2007 by the American Society of Plant Biologists
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
2
AffyTrees: facilitating comparative analysis of Affymetrix plant microarray chips.
Tancred Frickey 1, Vagner Augusto Benedito 2, Michael Udvardi 2 and Georg Weiller 1* 1 ARC Centre of Excellence for Integrative Legume Research and Bioinformatics Laboratory,
Genomic Interactions Group, Research School of Biological Sciences, Australian National
University, GPO Box 475, Canberra, ACT 2601 Australia 2 The Samuel Roberts Noble Foundation, Ardmore, Oklahoma 73401
*to whom correspondence should be addressed
Email:
Tancred Frickey: [email protected]
Vagner Benedito: [email protected]
Michael Udvardi: [email protected]
Georg Weiller: [email protected]
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
3
Financial source:
This research was funded by an Australian Research Council Centre of Excellence grant. Funding to
pay for the publication charges was provided by the same grant.
Corresponding author:
Georg Weiller
Research School of Biological Sciences
Australian National University
2602 Canberra Australia
Email:[email protected]
Tel: +61 2 6125 5916
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
4
Species: Arabidopsis thaliana and Medicago truncatula
Abstract:
Microarrays measure the expression of large numbers of genes simultaneously and can be used to
delve into interaction networks involving many genes at a time. However, it is often difficult to
decide to what extent knowledge about the expression of genes gleaned in one model organism can
be transferred to other species. This can be examined either by measuring the expression of genes of
interest under comparable experimental conditions in other species, or by gathering the necessary
data from comparable microarray experiments. However, it is essential to know which genes to
compare between the organisms. To facilitate comparison of expression data across different
species, we have implemented a web-based software tool that provides information about sequence
orthologs across a range of Affymetrix microarray chips.
Affytrees provides a quick and easy way of assigning which probe sets on different Affymetrix
chips measure the expression of orthologous genes. Even in cases where gene or genome
duplications have complicated the assignment, groups of comparable probe sets can be identified.
The phylogenetic trees provide a resource that can be used to improve sequence annotation and
detect biases in the sequence complement of Affymetrix chips. Being able to identify sequence
orthologs and recognize biases in the sequence complement of chips is necessary for reliable cross-
species microarray comparison. As the amount of work required to generate a single phylogeny in a
non-automated manner is considerable, AffyTrees can greatly reduce the workload for scientists
interested in large-scale cross-species comparisons.
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
5
Introduction
Microarray experiments have made it possible to rapidly quantify the expression of large numbers
of genes for a given experimental condition. The rapidity and ease of use of this technology has
enabled research into complex aspects of growth and development involving multiple genes at a
time. However, it remains difficult to extend findings from one organism to another, as it is often
not known which of the spots on different microarray chips measure the expression of comparable
(i.e. orthologous) genes.
The basic idea of using “model organisms” is that the knowledge gained from studying such an
organism will, to a large extent, be transferable to other species. Taking the regulatory feedback
loop controlling branching in Arabidopsis thaliana as an example, validating analyses needed to be
performed in a range of other species to determine to what extent this mechanism was conserved
and how far the knowledge gained in Arabidopsis thaliana could be applied to other plants
(Johnson 2006).
Approaches to validate such regulatory networks range from crudely determining whether the
necessary genes might be present in another genome and then assuming the complete network of
gene interaction to be conserved, to quantifying the expression of the corresponding genes under
comparable experimental conditions and verifying that the genes actually do behave in a similar
manner. The former is a crude but quick, cheap and easy approach, while the latter is more refined,
but work intensive, expensive and complicated. Data-mining available microarray data may provide
an intermediate solution to the problem. Microarray data repositories such as the Gene Expression
Omnibus (GEO) (Edgar 2002) provide a wealth of information about how an organism responds to
a wide variety of experimental conditions and may provide information about the expression of a
gene of interest in a species of interest under an experimental condition of interest.
Regardless of the approach used, it is necessary to know which genes can be compared between
organisms. In many cases, available gene annotation or best BLAST (Altschul 1997) hits are used.
However, gene annotation is not always correct or up to date and best BLAST hits do not always
correspond to the closest phylogenetic relative (Koski 2001). The orthology of genes, i.e. gene
copies that arose due to a speciation event, is the quintessential feature to look for when attempting
to compare genes or gene-products. The underlying assumption is that a gene in an emergent
species will continue to perform the same function it had in the ancestral species. Genes that arose
via duplication (i.e. paralogous genes) are a different matter, as two copies of the gene are present in
the genome of the organism, making it less likely that changes in one of the duplicates will lead to a
noticeable reduction in fitness, making it more likely that such changes will be passed on to the next
generation. The paralogous genes we observe today were therefore less restrained in their ability to
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
6
change, be lost, be inactivated or evolve towards a new function. Alternatively, both of the
duplicates may have changed only slightly, each continuing to perform a subset of the original
gene's tasks or both may have remained fully functional, accumulating only minor changes in the
regulation of their expression to counteract potential dosage effects. This freedom of paralogs to
change is the main reason why comparison of paralogous genes is unlikely to be beneficial or
intended and cross-species comparisons should be confined to orthologous or co-orthologous genes.
A number of tools and databases exist that attempt to determine which genes are orthologous and
therefore comparable across organisms (for example COG (Tatusov 1997), Orthomcl (Li 2003),
KOG (Tatsutov 2003), Genome Clusters Database (Horan 2005), Inparanoid (O'Brien 2005),
Multiparanoid (Alexeyenko 2006) and Orthologid (Chiu 2006)). Unfortunately, some of these
provide orthology assignments only for a very restricted set of species while others require
completed genomes to base their predictions on. Both these points make these databases next to
useless for researchers wanting to compare sequences from organisms for which completed
genomes are not yet available and that were not part of the select set of species that were included
in the databases. For such organisms, researchers generally have to rely on sequence similarity
searches to determine potential sequence orthologs in better described species. In addition, the
majority of the methods do not base their orthology predictions on phylognetic trees but on other
clustering methods and only use phylogenies to visualize the results. Finally, none of the methods
provide an easy lookup of which affymetrix sequences are comparable across chips, making an
additional mapping of affymetrix exemplar sequences to predicted sequence orthologs necessary.
Our web-based software tool provides a quick and easy way of assessing the orthology of protein-
coding genes for a variety of plant microarray chips, irrespective of whether the genome of the
organism is completed or not. We focused on Affymetrix chips as the overwhelming majority of
microarray data present in public repositories is based on these (GEO (Edgar 2002)). These chips
generally provide a reasonable coverage of the transcriptome of an organism and the corresponding
sequence data is readily available. As many chips are designed and sold before the corresponding
organism is completely sequenced, there may be cases where sequences spotted on a chip are
thought no longer to be present in the genome or some genes in the genome may be represented
multiple times or missing on the chip. In contrast to other methods, we do not use ORF's predicted
from genomic data, but the sequences from which the probe sets for a given chip were derived,
hereafter refered to as either exemplar or consensus sequences. We thereby avoid problems arising
from inaccurate ORF prediction, genome sequences being revised and changed, as well as errors in
assigning the various probe sets to predicted genomic ORFs. For each of the consensus sequences,
we provide the results of sequence similarity searches against a number of sequence databases, a
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
7
Profile-Hidden-Markov-Model (HMM) representative of the sequence family, as well as a multiple
sequence alignment and phylogenetic tree for that family. An additional utility permits determining
sequence orthologs in a species of choice to the sequences present on an Affymetrix chip. A web-
interface is provided to PHAT, part of the PhyloGenie package (Frickey 2004), that allows the
repository of phylogenetic trees to be mined for trees corresponding to specific topological or
species constraints.
Construction and content:
The NCBI non-redundant protein database “nr” and 6-frame translations of the plant microarray
chip consensus sequences provided by Affymetrix provide the set of sequences we base our
predictions on. The 6-frame translations of the consensus sequences provide information as to what
proteins are represented on the various microarray chips. The “nr” database contains a wide variety
of species suitable as outgroups for the phylogenies as well as providing sequences that may have
failed to be included on the microarray chips of the various organisms. The latter are of special
importance as they provide critical data when attempting to assess whether two sequences are
orthologous or paralogous (Fig. 1).
PhyloGenie is used to automatically search for sequence homologs and infer phylogenetic trees for
all consensus sequences on a chip. This tool was originally developed to generate and analyze
phylomes in regards to gene duplications and lateral gene transfers and can be briefly described as
follows: A) Each microarray consensus sequence is compared against the above mentioned
databases using BLAST. The result of these sequence similarity searches are used to identify
potential sequence homologs. B) BLAST High-Scoring-Segment-Pairs (HSPs) with greater than
70% coverage of the query and E-values better than 1e-5 are extracted and aligned to one another.
These parameters were chosen lax enough to detect non-trivial sequence similarities yet stringent
enough to exclude high-scoring local similarities that would, by themselves, not warrant the
assignment of two sequences as being orthologous. The resulting alignment contains the sequence
regions we regard as homologous to the query. C) Hmmer (http://hmmer.janelia.org/) is used to
derive a HMM from this alignment and search the full-length sequences of all BLAST-HSPs with
E-values better than 1. Deriving a HMM from the above alignment gives a better representation of
the sequence family. Using this HMM to search against full-length sequences of even marginal
BLAST hits allows detection of more of the distant sequence homologs and better defines the start
and end of homologous sequence regions than a single BLAST search could. D) Sequence regions
matching the full-length HMM with E-values better than 1e-5 are combined to a multiple sequence
alignment. E) A phylogenetic tree with 100 bootstrap replicates is infered from this alignment. Due
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
8
to limited computational resources, we use Neighbor-Joining (Saitou 1987) to infer phylogenies. All
intermediary files are made available so that the process can be followed from beginning to end and
alternative approaches, for example a different method of tree inference, could be used. The trees
are rooted at the phylogenetic node closest to the “Last Universal Common Ancestor”, as described
in the PhyloGenie manuscript (Frickey 2004).
The set of trees generated by PhyloGenie provides the basis of our prediction of sequence
orthologs. The actual prediction requires a number of user-specified parameters and is performed
on-the-fly, allowing for a high degree of flexibility. Detection of sequence orthologs is based on the
number of nodes separating the query sequence, i.e. the sequence for which a tree was derived, from
sequences of any given species in the tree. In the following examples we assume that the user
selected the Arabidopsis thaliana ATH1-121501 chip and was attempting to find sequence
orthologs in Medicago truncatula.
Determining sequence orthologs is done in the following manner (Fig. 2). The number of nodes
separating each Medicago truncatula sequence (yellow) from the query (purple) is determined
(minimum number:4, standard deviation 2.87). An additional scaling factor (default:0.5) allows the
user to specify the range in which he is willing to accept Medicago truncatula sequences as
potential sequence orthologs. Increasing this value causes the program to take into account more
distant sequence relatives as potential orthologs while decreasing this value causes the program to
focus on the most closely related sequences only. In the presented analysis, we used a value of 0.5
as this allowed us to determine orthologs for most of the chip sequences while not causing too many
of the query sequences to be assigned multiple orthologs in the other species. The distance within
which sequences are accepted as potential sequence orthologs is refered to as the “permissive
range” in this manuscript. The permissive range is calculated as the minimal number of nodes
separating the query sequence from a Medicago truncatula homolog in the tree plus the standard
deviation multiplied by the scaling factor. The standard deviation reflects the dispersal pattern of
Medicago truncatula sequences throughout the tree. The more clades in a tree contain Medicago
truncatula sequences, the greater the uncertainty about which of these clades contain sequences
orthologous to the query. We therefore use the standard deviation of the number of nodes separating
Medicago truncatula sequences from the query as a measure for how uncertain we are that the
sequences closest to each other, in number of nodes, really are the sequence orthologs. For the tree
shown in Figure 2, the permissive range is highlighted in green and encompasses all sequences less
than 6 nodes removed from the query. Affymetrix Arabidopsis thaliana ATH1-121501 sequences
less than 6 nodes removed from the query are regarded as sequence paralogs to the query
(260439_at). Medicago truncatula sequences within the permissive range are regarded as potential
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
9
sequence orthologs (Mtr.28509.1.S1_at, Mtr.17370.1.S1_at and Mtr.21922.1.S1_at).
For each of the potential orthologs we subsequently perform a reverse lookup. We calculate the
minimum and standard deviation of the number of nodes separating each potential ortholog from
the Affymetrix Arabidopsis thaliana ATH1-121501 sequences present in the tree. As the minimum
and standard deviation are greatly influenced by the position in the tree of the sequence for which
the values are being calculated, the permissive ranges of the potential orthologs may be quite
different from one another. A red and blue line show the permissive ranges for two of our three
potential orthologs. The query sequence does not lie within the permissive range of
Mtr.21922.1.S1_at (blue line). This sequence is therefore removed from the set of potential
orthologs as it appears much more closely related to the Affymetrix Arabidopsis thaliana sequence
“257728_at” than to the query. Mtr.28509.1.S1_at (red line) and Mtr.17370.1.S1_at (not shown)
recover the query sequence in their permissive ranges and both are retained as sequence orthologs
to the query. Analysis of this tree therefore tells us that our query sequence “245641_at” has a
sequence paralog (260439_at) on the Affymetrix Arabidopsis thaliana ATH1-121501 chip and two
sequence orthologs (or co-orthologs) on the Affymetrix Medicago truncatula chip.
The aim of this tool is twofold: it offers a fully automated way of retreiving sequence orthologs for
microarray consensus sequences from a wide variety of species and provides the results of a
BLAST search, multiple sequence alignment and phylogenetic inference for every consensus
sequence on a chip. This allows manual validation of any dubious orthology predictions by
comparing the various intermediate results leading to the phylogeny against the corresponding
phylogenetic trees and alignments. In addition, the large number of alignments generated in the
process of constructing the phylogenies are a useful resource on which to base further analyses, as
they provide sets of aligned sequence homologs for every consensus sequence on a chip.
Utility:
User interface: The user interface has five webpages. The home page allows querying of individual
genes and links to the remaining pages, some help and supplemental data. The other four pages of
the interface deal with batch requests, analysis of chip phylomes, generating phylogenies for
sequences provided by the user and predicting sequence orthologs between the consensus sequences
represented on a chip and other species.
The results of an individual query are shown in Figure 3. Tabs at the top of the page allow
navigation between the results of a BLAST search (BLAST), alignment of high-scoring HSPs
(CLN), the derived HMM (HMM), results of the HMM-search (HMS), alignment of high-scoring
HMM-hits (HLN) and either a textual or applet-based representation of a Neighbor-Joining tree
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
10
(TRE). The tabs allow the user to retrace every step leading from query sequence to phylogeny and
are very useful to gain a better understanding of why two genes were regarded as homologous,
included in the same tree, or predicted to be sequence orthologs. To facilitate interpretation of batch
requests and complete phylome analyses, intermediate pages can be generated that gather the
results, order them and link to the results pages of the various genes. Prediction of sequence
orthologs between microarray chip consensus sequences and a species of choice generates a tab-
delimited list containing information about which sequences on the chip could be assigned sequence
orthologs in another species, which sequences should be regarded as co-orthologous or paralogous,
and which other homologous sequences were present in the phylogenies but could not be assigned a
more precise relationship.
Supplemental data, providing further information about the programs used, the individual steps
performed to generate the data as well as the parameters the user can tweak, is available at
http://bioinfoserver.rsbs.anu.edu.au/utils/affytrees/help.php. Results of phylome analyses, custom
phylogenetic trees and orthology predictions are stored for a week and can be accessed by referring
to the job identifier provided in the results.
This tool differs from other databases and programs in a number of ways. It provides the data on
which tree inference and orthology prediction is based and thereby allows the user to re-trace each
step of the decision process. Our trees include sequences from the “nr” database which greatly
facilitates correct rooting and interpretation. In addition, this allows us to potentially detect
sequence orthologs for any species represented in “nr” instead of being limited to those species for
which complete genomes or proteomes are available. The use of a user-defined “scaling factor”
avoids problems co-orthologous genes cause for approaches relying solely on reciprocal best hits
between genomes. If, for example, a species has a gene of interest, gene A, that was duplicated in
another species, giving rise to genes B and B', reciprocal best hit approaches may identify genes A
and B or A and B' as reciprocal best hits and assign them as sequence orthologs. However, if A
appears most similar to B but B' appears most similar to A, a possible scenario if non-symmetric
scoring schemes are used, such as employed by BLAST, then no reciprocal best hits can be
determined and no sequence orthologs are assigned. All of the above cases produce an incorrect
assignment of gene orthology, as B and B' are co-orthologous to A (i.e. duplicates derived from a
gene that was orthologous to A) and should be treated as such.
Another part of this tool allows the user to search through the trees of a given species or chip for
those corresponding to specific topological selection criteria. For example, to find all trees in which
a clade contains at least one Medicago truncatula and Arabidopsis thaliana sequence, but no
sequences from the Arabidopsis thaliana ATH1-121501 chip, the selection string “((Medicago
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
11
truncatula & Arabidopsis thaliana) & !Arabidopsis ATH1-121501)” could be used. Trees containing
such clades could identify sequences present in Medicago truncatula, the orthologs of which can
not be measured using the Affymetrix Arabidopsis thaliana ATH1-121501 chip, as no sequence
orthologs are present on that chip. As an example of such a case (Figure 4), we show a tree derived
for a hypothetical protein from Medicago truncatula, the ortholog of which was not included on the
ATH1-121501 chip even though orthologous sequences are present in the Arabidopsis thaliana
genome as well as throughout the plant, fungal and animal kingdoms.
Future developments include, as a first step, extending this tool beyond the currently available 7
chips to include all publicly available Affymetrix plant microarray chips. Since this system is not
limited as to what species can be analyzed, provided some sequence information for the species is
available, it is conceivable that the system may be extended to cover all available Affymetrix
microarray chips. Beyond that, the aim will be to develop and implement methods that further
facilitate comparative analysis of microarray expression data across species.
Results and discussion:
To determine whether the AffyTrees orthology predictions were comparable to, less or more
accurate than reciprocal best BLAST hits, the most widely used method to identify sequence
orthologs, we compared the orthology predictions generated by both methods. Phylogenetically
orthologous sequences are generally expected to fulfill the same function in different species and
functionally orthologous sequences are expected to be similarly expressed across different species.
Therefore, phylogenetic orthologs can be expected to show a certain degree of similarity in their
expression across species. We based our comparison on prediction of sequence orthologs between
the Arabidopsis thaliana ATH1-121501 and Medicago truncatula Affymetrix chips. These species
were chosen specifically because sets of comparable microarray experiments were available and
provided us with the opportunity to test whether and how well sequence orthology, as predicted by
reciprocal best BLAST hits and AffyTrees, was reflected in similarity of expression.
The results of comparing the orthology predictions for these two microarray chips are shown in
Figure 5A. BLAST produced many more reciprocal best hits (7025) than AffyTrees predicted
orthologs (5793). Of these, 2926 predictions of sequence orthologs coincided, 4099 orthology
predictions were unique to the reciprocal best BLAST hits and 2867 orthology predictions were
unique to AffyTrees. Even though BLAST produced nearly 30% more orthology predictions, fewer
individual sequences were assigned an ortholog in BLAST than in AffyTrees. This was due to many
of the BLAST hits having multiple ortholog assignments. On average, each Medicago truncatula
chip sequence was assigned 1.78 Arabidopsis thaliana chip sequences as reciprocal best BLAST
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
12
hits and every Arabidopsis thaliana chip sequence was assigned 1.57 Medicago truncatula chip
sequences. This artificially inflated the number of “orthology” predictions provided by BLAST.
Dividing the number of reciprocal best BLAST hits by the amount of multiple predictions for each
species gives us the number of individual genes for each species that could be assigned at least one
ortholog in the other species: the exclusively BLAST based predictions assigned 2303 sequences
from Medicago one or more orthologs in Arabidopsis and 2611 sequences in Arabidopsis could be
assigned one or more orthologs in Medicago. The exclusively AffyTrees based predictions assigned
2515 Medicago sequences orthologs in Arabidopsis and 2537 Arabidopsis sequences orthologs in
Medicago; 138 more sequences than assigned by reciprocal best BLAST hits.
To determine which of the methods provided a more accurate orthology prediction, we compared
the expression of predicted sequence orthologs in two sets of microarray experiments, one for
Arabidopsis thaliana (Schmid 2005) and one for Medicago truncatula (Benedito et al., Medicago
Gene Atlas, manuscript in preparation, ArrayExpress accession: E-MEXP-1097). The expression of
genes was compared across 7 tissue types: stems, petioles, leaves, vegetative buds, flowers, roots
and seeds. Different laboratories generated the data, and differences in harvesting, preparation,
experimental procedure, growth conditions and of course the plants themselves, undoubtedly will
have affected the experiments and provide ample explanation for why some sequence orthologs
might not be correlated in their expression in these two species. Therefore, we did do not expect all
sequence orthologs to show a strong positive correlation in their expression, but a general positive
trend in correlation was certainly expected. However, our aim was not to show that sequence
orthologs share similar expression patterns, but to use the available expression data to assess the
accuracy of the two prediction methods.
Accepting the 2926 orthology assignments both BLAST and AffyTrees agreed upon as “true”
orthologs, we used the Pearson (linear) correlation coefficient of the expression values to measure
the co-expression of all predicted ortholog pairs. The histogram in Figure 5B shows the number of
predicted ortholog pairs for a given correlation coefficient as well as a fitted scaled extreme-value-
distribution (EVD) (Fig. 5B). Most of the predicted ortholog pairs produced positive correlation
coefficients, supporting our expectation that sequence orthologs, in general, should show similar
expression across different organisms. In addition, the graph provides us with a means of testing the
accuracy of reciprocal best BLAST hits and AffyTrees orthology predictions as seen in Figure 5C.
Rather than comparing histograms directly, we approximated the histograms by a distribution with a
small number of parameters to facilitate comparison of multiple datasets. The EVD approximates
the various histograms depicted in Figure 5 quite well. The more accurate the set of orthologs
predicted by each method, the better the corresponding fitted EVD should approximate the EVD
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
13
derived from our set of 2926 “true” orthologs.
We then compared the sets of genes for which sequence orthologs could only be predicted by either
BLAST or AffyTrees. Whenever one gene was assigned multiple sequence orthologs, we averaged
their correlation coefficients to reflect that the method generating the prediction could not decide in
more detail which of the predicted orthologs should be used. 4914 genes were assigned sequence
orthologs only in reciprocal BLAST hits and 5052 genes were assigned sequence orthologs only in
AffyTrees. The graphs of the histograms and fitted EVD's for these sets of genes are shown in
Figure 5C. Both BLAST and AffyTrees were able to predict orthologs for similar numbers of genes,
however, the maximum of the BLAST-EVD lies at 0.47, while the maximum of the AffyTrees-EVD
lies at 0.66. The EVD based on the AffyTrees predictions also better approximates the EVD based
on the set of “true” orthologs. Taking the median of the correlation coefficients as the comparison
metric leads to similar results (Figure 5B-D). Bootstrap sampling of the BLAST and AffyTrees
distributions (10000 samples, 1000 replicates) showed the median values of the distributions to be
very resilient to change. The probability of generating a randomly sampled distribution with the
median value observed in the other method was, in both cases, quite unlikely (BLAST: 2.1-36,
AffyTrees: 6.2-26). Both the median values of the distributions as well as the maximum of the fitted
EVD's show that the histogram of the AffyTrees predictions (blue) is more similar to the histogram
of the “true” orthologs (green), than the histogram of the best BLAST-based predictions (yellow) is
to the “true” orthologs. This points to the affytrees predictions being more reliable than the
predictions based on best BLAST hits.
However, it was recently shown that GCRMA (Wu 2004) normalization can lead to overprediction
of correlated genes (Lim 2007). To see whether this was affecting our results, we repeated the above
analysis using MAS5 (Hubbell 2002) normalized data. The median values of the resulting
distributions were 0.417 for our set of 'true' orthologs, 0.339 for the AffyTrees orthologs, 0.275 for
the BLAST predictions, 0.267 for AffyTrees homologs and 0.018 for random sequence pairs. These
values are similar to those calculated based on the GCRMA normalized data, indicating that,
although GCRMA normalization does seem to increase the median value of the distributions, the
increase is slight and no qualitative difference in how the methods compare to one another is
apparent.
In an attempt to determine why the BLAST-based prediction fare poorly, we examined how various
modes of orthology assignment influence the fitted EVD. We show the histograms and fitted EVD
for two further datasets (Figure 5D). The first set was generated by randomly pairing sequences
from within our set of “true” orthologs (black) and the second by accepting all sequence homologs
present in the AffyTrees phylogenies as sequence orthologs (pink). These phylogenies provide a
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
14
large number of groupings of homologous sequences. We know a large number of the trees to
contain paralogous sequences and mis-assigning sequence paralogs as orthologs is one of the key
difficulties in accurately detecting sequence orthologs. The graphs shows that an EVD fitted to the
random orthology assignments (black) has its maximum close to zero. Indiscriminately assigning all
sequence homologs present in a tree as sequence orthologs generates many more orthology
predictions, as visible by the increased amplitude of the EVD. However, the maximum of the fitted
EVD is close to 0.5, well below the 0.68 maximum we determined for the EVD of the set of “true”
orthologs (green). We therefore expect the maximum of EVD's fitted to various methods of
orthology assignment, for this dataset, to lie within 0 and 0.7. The closer the maximum lies to 0.7 or
above, the better the prediction method is likely to be. Not differentiating between “orthology” and
“homology”, thereby causing too many sequences to be assigned as sequence orthologs, shifts the
maximum of the fitted EVD to around 0.5. BLAST-based predictions more frequently assigned
multiple sequence orthologs to genes than the AffyTrees predictions. This might explain why the
maximum of the BLAST-EVD lies at 0.47. The best BLAST approach, while quite suited to
detecting sequence homologs, therefore does not appear very accurate when used to distinguish
between sequence orthologs and other homologs. The AffyTrees method, in contrast, appears far
better at reliably determining orthologous sequences.
Conclusions:
AffyTrees provides a repository of phylogenetic trees inferred from every consensus sequence
represented on a variety of Affymetrix plant microarray chips. This repository can be used to gain
insights into the relationship of sequence homologs, improve annotation data or automatically
generate a list of sequence orthologs between a species and the consensus sequences represented on
a specific microarray chip. The inclusion of sequences from the “nr” database and our method of
detecting sequence orthologs circumvent the problems reciprocal best hit approaches have when
dealing with co-orthologous genes. For sequences represented on Affymetrix plant microarray
chips, AffyTrees can identify sequence orthologs present on other Affymetrix plant microarray
chips, as well as sequence orthologs present in the “nr” database.
The ability to filter chip phylomes for specific selection criteria allows discrepancies or systematic
biases between the sequence complements of chips and the corresponding genomes to be detected.
Affymetrix chips were designed to measure the transcription of genes and therefore are biased
towards highly expressed and protein coding genes. This is a known and useful bias of these chips.
However, other biases, for example systematic preference for long or short sequences, differences in
the EST-libraries on which the chips were based or differences in the ability to successfully predict
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
15
short genes in different species, will have affected which sequences were included on a chip and
thereby influence the results.
We provide a means of comparing the sequence complement of microarray chips to the publicly
available sequence data of the corresponding organism as well as to the microarrays of other
species. Robust ways of assessing sequence orthologs and knowledge about systematic differences
in the sequence complement of various chips are prerequisites to making cross-species analyses of
microarray expression data feasible. Without knowledge of the sequence orthologs present on other
microarray chips, there is no way of determining which probe sets are comparable across chips.
Similarly, without a way of estimating sequence biases or genes missing on a chip, the conclusions
drawn from presence or absence of groups of genes derived from expression data are likely to be
flawed.
We show, to the extent that the limitations of the available experimental data permitted, that the
majority of genes predicted to be orthologous show a similar expression across the two examined
species. We also show that AffyTrees is able to assign sequence orthologs to more genes than a
comparable approach relying on reciprocal best BLAST hits and, by comparing the expression of
predicted sequence orthologs, that the AffyTrees orthologs appear more reliable than the BLAST-
based predictions.
AffyTrees provides prediction of sequence orthologs for a wide variety of species at greater
accuracy than reciprocal best BLAST hits. Combined with the available phylogenetic trees,
sequence alignments and additional utilities, AffyTrees should provide a useful resource for
comparative analyses of transcriptomes and proteomes.
Methods:
The sequences we based our sequence-similarity searches on originated from either the “nr”
database, downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz), or from 6-
frame translations of exemplar sequences for a variety of affymetrix chips. The nucleotide exemplar
sequences were downloaded, after registration, from the affymetrix website by following the links
to the various species (http://www.affymetrix.com/support/technical/byproduct.affx?cat=exparrays).
BLAST searches were performed against the NCBI non-redundant protein database “nr” and 6-
frame translation of consensus sequences for the Affymetrix microarray chips ATH1-121501,
AtGenome1, Barley1, Citrus, Cotton, Grape, Maize, Medicago, Poplar, Rice, Soybean, Sugar Cane,
Tomato and Wheat. The BLAST results for sequences represented on the Arabidopsis thaliana
ATH1-121501 and Medicago truncatula chips were retrieved via the AffyTrees web-interface.
Putative sequence orthologs between Medicago truncatula and Arabidopsis thaliana sequences
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
16
were predicted as described above (scaling factor = 0.5) based on the phylogenies provided by
AffyTrees. To keep the results as comparable as possible, the same cutoffs used to generate the
phylogenies (i.e. >70% coverage of the query and E-values better than 1e-5) were used as a lower
limit for analysis of the reciprocal best BLAST hits. BLAST hits that did not satisfy these cutoffs
were not taken into account. In cases where multiple BLAST hits had identical best E-values, all of
these best hits were taken into account. This made it possible for some genes to be assigned
multiple reciprocal best BLAST hits. The method of orthology prediction we describe allows genes
in one species to be assigned multiple orthologs in another. In such cases, all of the predicted
sequence orthologs were taken into account. A noticeable discrepancy was apparent in the number
of predicted sequence orthologs compared to the number of reciprocal best BLAST hits. To keep
both approaches of detecting sequence orthologs as comparable as possible, we compared
reciprocal AffyTrees orthologs to the reciprocal best BLAST hits. This allowed both methods to use
“reciprocality” as a further criterion to reduce the number of false positive orthology predictions.
For each plant species, the Affymetrix CEL files of the experiments we wanted to compare were
normalized using using both GCRMA (Wu 2004) and MAS5 (Hubbell 2002) for comparison. All
experimental files for a species were normalized at the same time, as normalizing each set of
experiments individually would have artificially increased the differences observed between the
experimental conditions. Linear correlation coefficients were calculated using the average
expression value of each gene over the three available experimental replicates.
Availability and requirements:
The tool is freely accessible at http://bioinfoserver.rsbs.anu.edu.au/utils/affytrees/. Further
information and help is available at http://bioinfoserver.rsbs.anu.edu.au/utils/affytrees/help.php.
Javascript should be enabled in the browser and a Java1.5 or above browser plugin should be
installed for visualization of phylogenetic trees.
Acknowledgements:
This research was funded by an Australian Research Council Centre of Excellence grant. Funding to
pay for the publication charges was provided by the same grant.
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
17
Literature cited
References:
Alexeyenko A, Tamas I, Liu G, Sonnhammer EL (2006). Automatic clustering of orthologs and
inparalogs shared by multiple proteomes. Bioinformatics. 22:e9-15.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped
BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids
Res. 25:3389-3402.
Chiu JC, Lee EK, Egan MG, Sarkar IN, Coruzzi GM, DeSalle R (2006) OrthologID: automation of
genome-scale ortholog identification within a parsimony framework. Bioinformatics. 22:699-
707.
Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI gene expression and
hybridization array data repository. Nucleic Acids Res. 30:207-210.
Frickey T, Lupas AN (2004). PhyloGenie: automated phylome generation and analysis. Nucleic
Acids Res. 32:5231-5238.
Horan K, Lauricha J, Bailey-Serres J, Raikhel N, Girke T (2005) Genome cluster database. A
sequence family analysis platform for Arabidopsis and rice. Plant Physiol. 138:47-54.
Hubbell E, Liu WM, Mei R (2002) Robust estimators for expression analysis. Bioinformatics
18:1585-92.
Johnson X, Brcich T, Dun EA, Goussot M, Haurogne K, Beveridge CA, Rameau C (2006)
Branching genes are conserved across species. Genes controlling a novel signal in pea are
coregulated by other long-distance signals. Plant Physiol. 142:1014-1026.
Koski LB, Golding GB (2001) The closest BLAST hit is often not the nearest neighbor. J Mol Evol.
52:540-542.
Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic
genomes. Genome Res. 13:2178-2189.
Lim W K, Wang K, Lefebvre C, Califano A (2007) Comparative analysis of microarray
normalization procedures: effects on reverse engineering gene networks. Bioinformatics 23: 282-
288.
O'Brien KP, Remm M, Sonnhammer EL (2005) Inparanoid: a comprehensive database of eukaryotic
orthologs. Nucleic Acids Res. 33:D476-480.
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
18
Saitou N, Nei M (1987). The neighbor-joining method: a new method for reconstructing
phylogenetic trees. Mol Biol Evol. 4:406-425.
Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Scholkopf B, Weigel D and
Lohmann JU (2005) A gene expression map of Arabidopsis thaliana development. Nat. Genet.
5:501-506.
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science
278:631-637.
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM,
Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S,
Wolf YI, Yin JJ, Natale DA (2003) The COG database: an updated version includes eukaryotes.
BMC Bioinformatics 4:41.
Wu Z, Irizarry RA, Gentleman R, Murillo FM, Spencer F (2004) A Model Based Background
Adjustment for Oligonucleotide Expression Arrays. Technical Report. John Hopkins University,
Department of Biostatistics Working Papers, Baltimore, MD;
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
19
Figure legends
Figure 1:
An ancestral gene undergoes a duplication and gives rise to two paralogous genes A and B. Some
time later a speciation event gives rise to two species (light and dark). Each of these has retained
both paralogs in their genome, but only genes A for the dark species and B for the light species are
included on the chip. Simple pairwise comparison of the chip sequences alone would predict A
(dark) and B' (light) to be sequence orthologs as these would appear to be reciprocal closest
relatives. Including additional sequence data, such as sequences of outgroup species or the
sequences A' (light) and B (dark) missing on the chips but present in the genomes of the blue and
red species, can help clarify relationships and allow unambiguous asignment of sequence orthologs.
Figure 2:
Determining sequence orthologs based on the number of nodes separating them from the query.
This example provides a case where multiple clades containing both Medicago truncatula and
closely related Arabidopsis thaliana homologs are present. Sequences from the Arabidopsis
thaliana microarray chip ATH1-121501 are highlighted in blue, the query sequence for which this
tree was computed is highlighted in magenta and sequences from the Medicago truncatula
microarray chip are highlighted in yellow. The “permissive range” for the query is show with a
colored background (green), red and blue lines, above and below the tree, respectively, show the
permissive range for the reverse lookup for two of the three potential sequence orthologs. Circles
show which of the Arabidopsis thaliana ATH1-121501 sequences were recovered in the respective
reverse lookups.
Figure 3:
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
20
Screenshot of results using the Arabidopsis thaliana ATH1-121501 chip consensus sequence
261590_at as a query. Part of the corresponding phylogenetic tree is displayed. Red (dark) dots
highlight Medicago truncatula sequences, yellow (light) dots highlight ATH1-121501 sequences
and a blue dot (bottom) highlights the query sequence. The tabs at the top of the page allow
navigation between BLAST results (BLAST), the alignment of HSPs (CLN), the derived HMM
(HMM), the HMM-search results (HMS), the alignment from which the phylogeny is infered
(HLN) and either a text or graphical representation of the phylogenetic tree (TRE).
Figure 4:
Phylogenetic tree of a protein coding gene present in a wide variety of eukaryotes that is not
represented on either of the Affymetrix Arabidopsis thaliana chips. This is recognizable by the
sequence identifiers. The Arabidopsis thaliana sequences (yellow, light dot) have NCBI gi-numbers
instead of affymetrix identifiers, signifying that these sequences were taken from the “nr” database
and not one of the 6-frame translations of the microarray chip consensus sequences. The bottom-
most sequence is the Medicago truncatula query sequence for which this tree was generated. Other
Medicago truncatula sequences are highlighted with a red (dark) dot.
Figure 5:
A) Overlap of reciprocal best BLAST hits (yellow) with AffyTrees orthology predictions (blue).
B) Histogram and fitted EVD of ortholog pairs predicted by both BLAST and AffyTrees over the
correlation coefficient of their expression values across the microarray experiments. For comparison
purposes, the fitted EVD curve (green) for this data is represented in 5C and 5D as well. A vertical
dotted line is placed at the peak of the EVD and the correlation coefficient at which the peak is
found is stated in black numbers at the bottom. The median value of each dataset is marked in the
top left corner.
C) Histogram and fitted EVD of the genes assigned orthologs in either BLAST (yellow) or
AffyTrees (blue) over the average correlation coefficient of the assigned orthologs.
D) Histogram and fitted EVD over the average correlation coefficient for genes assigned orthologs
randomly (black) or by indiscriminately using any sequences present in the Affytrees phylogenies
as orthologs (magenta).
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.
www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.