annotation of expressed sequence tags for the east …the exceptionally diverse species flocks of...

50
Salzburger et al. 1 Annotation of expressed sequence tags for the East African cichlid fish Astatotilapia burtoni and evolutionary analyses of cichlid ORFs Walter Salzburger 1,2 *, Susan C. P. Renn 3 *, Dirk Steinke 1,4 *, Ingo Braasch 1,5 , Hans A. Hofmann 6 and Axel Meyer 1 Lehrstuhl für Zoologie und Evolutionsbiologie, Department of Biology, University of Konstanz, 78467 Konstanz, Germany 2 present address: Zoological Institute, University of Basel, 4051, Switzerland 3 Department of Biology, Reed College, Portland, Oregon 97202, USA 4 present address: Guelph Centre for DNA Barcoding, Biodiversity Institute of Ontario, University of Guelph, Guelph, Ontario N1G 2W1, Canada 5 present address: Physiological Chemistry I, Biozentrum, University of Würzburg, 97074 Würzburg, Germany 6 Section of Integrative Biology, University of Texas at Austin, Austin, Texas 78712, USA *These authors contributed equally to this work Corresponding author Email addresses: WS: [email protected] SCPR: [email protected] DS: [email protected] IB: [email protected] HAH: [email protected] AM: [email protected]

Upload: others

Post on 06-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 1

Annotation of expressed sequence tags for the East

African cichlid fish Astatotilapia burtoni and

evolutionary analyses of cichlid ORFs

Walter Salzburger1,2

*, Susan C. P. Renn

3*, Dirk Steinke

1,4*, Ingo Braasch

1,5 , Hans A.

Hofmann6 and Axel Meyer

1Lehrstuhl für Zoologie und Evolutionsbiologie, Department of Biology, University of

Konstanz, 78467 Konstanz, Germany 2present address: Zoological Institute, University of Basel, 4051, Switzerland

3Department of Biology, Reed College, Portland, Oregon 97202, USA

4present address: Guelph Centre for DNA Barcoding, Biodiversity Institute of Ontario,

University of Guelph, Guelph, Ontario N1G 2W1, Canada 5present address: Physiological Chemistry I, Biozentrum, University of Würzburg, 97074

Würzburg, Germany 6Section of Integrative Biology, University of Texas at Austin, Austin, Texas 78712,

USA

*These authors contributed equally to this work ¶Corresponding author

Email addresses:

WS: [email protected]

SCPR: [email protected]

DS: [email protected]

IB: [email protected]

HAH: [email protected]

AM: [email protected]

Page 2: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 2

Abstract

Background

The cichlid fishes in general, and the exceptionally diverse East African haplochromine

cichlids in particular, are famous examples of adaptive radiation and explosive

speciation. Here we report the collection and annotation of more than 12,000 expressed

sequence tags (ESTs) generated from three different cDNA libraries obtained from the

East African haplochromine cichlid species Astatotilapia burtoni and Metriaclima zebra.

Results

We first annotated more than 12,000 newly generated cichlid ESTs using the Gene

Ontology classification system. For evolutionary analyses, we combined these ESTs with

all available sequence data for haplochromine cichlids, which resulted in a total of more

than 45,000 ESTs. The ESTs represent a broad range of molecular functions and

biological processes. We compared the haplochromine ESTs to sequence data from those

available for other fish model systems such as pufferfish (Takifugu rubripes and

Tetraodon nigroviridis), trout, and zebrafish. We characterized genes that show a faster

or slower rate of base substitutions in haplochromine cichlids compared to other fish

species, as this is indicative of a relaxed or reinforced selection regime. Four of these

genes showed the signature of positive selection as revealed by calculating Ka/Ks ratios.

Conclusions

About 22 % of the surveyed ESTs were found to have cichlid specific rate differences

suggesting that these genes might play a role in lineage specific characteristics of

cichlids. We also conclude that the four genes with a Ka/Ks ratio greater than one appear

as good candidate genes for further work on the genetic basis of evolutionary success of

haplochromine cichlid fishes.

Page 3: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 3

Background

The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes

Tanganyika, Malawi and Victoria are prime examples for adaptive radiations and

explosive speciation [1-3]. More than 2,000 cichlid species have evolved in the last few

million years in the rivers and lakes of East Africa [1, 4-6]. Together with an additional

~1,000 species that are found in other parts of Africa, in South and Central America, in

Madagascar, and in India, the family Cichlidae represents one of the most species-rich

families of vertebrates. In addition to their unparalleled species-richness, cichlids are

famous for their ecological, morphological and behavioral diversity [1, 2, 7], for their

propensity for rapid speciation [5], for their capacity for sympatric speciation [8, 9], and

for the formation of parallel characters in independently evolved species flocks [10-12].

For these reasons, the cichlid fishes are an excellent model system to study basic

dynamics of evolution, adaptation and speciation. However, while the phylogenetic

relationships between the main cichlid lineages are largely established and some of the

cichlids’ evolutionary innovations have been identified (see e.g., [1, 2, 4, 7, 14]), little is

known about the genomic and transcriptional basis of the evolutionary success of the

cichlids.

The cichlid model system provides many advantages for evolutionary genomic

research. The hundreds of closely related yet morphologically diverse species in East

Africa’s cichlid species flocks are even more powerful than a ‘mutagenic screen’ (to

which these species assemblages have been compared to [1, 12]) in that they represent

combinations of alleles that confer a selective advantage under various ecological

pressures. Because of the possibility to produce viable crosses between different cichlid

Page 4: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 4

species in the lab [15], these alleles can be tied to particular phenotypic traits by means of

classical genetic experiments (see e.g., [16-19]). The close relatedness of the different

species allows the design of primer sets for the amplification of particular genomic DNA

regions such as candidate gene loci, microsatellites, or SNPs, which are applicable to a

wide range of species (see e.g., [18, 20-22]). The same is true for expression profiling

with cDNA microarrays that, once developed for one species, can be used for any East

African cichlid species [23].

A variety of genomic resources have already been established for East African

cichlid species. Genetic maps are available for the Nile tilapia Oreochromis niloticus [24,

25] and the Lake Malawi species Metriaclima zebra [18]. BAC libraries have been

constructed for O. niloticus [26] and M. zebra (available at the Hubbard Center for

Genome Studies), for the Lake Victoria haplochromine Paralabidochromis chilotes [27]

and for Astatotilapia burtoni from Lake Tanganyika and surrounding rivers [28]. cDNA

microarrays are available for A. burtoni [23] and for Lake Victoria haplochromines [29,

30]. Also, EST sequencing projects have been initiated (see e.g., [31]), and a BLAST

server for cichlid resources has been established

(http://cichlids.cgr.harvard.edu/cichlidblast/). Recently, the National Institute of Health

(NIH) has committed to sequencing four cichlid genomes. A detailed description of

genomic resources developed for cichlid fishes is available at

http://www.cichlidgenome.org.

Expressed sequence tags (ESTs) derived from the partial sequencing of cDNA

clones provide an economical approach to identify large numbers of genes that can be

used for comparative genomic and gene expression studies as well as for the detection of

Page 5: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 5

splice variants (see e.g. [32, 33]). Furthermore, EST projects facilitate genome annotation

and are therefore often applied in addition to genome sequencing projects. Due to the

large amount of data available in public databases, ESTs emerge as important resources

for comparative genome-wide surveys both among closely and more distantly related

taxa [34, 35]. A series of software applications have been developed to date to perform

such EST-based analyses (see e.g., [36-38]). Since ESTs reflect the coding portions of a

genome, they can also be used to test for different evolutionary rates in particular genes

when comparing different lineages, and to detect genes that have undergone positive

selection [34]. It is generally assumed that genes with a statistically significant increase in

substitution rates have experienced relaxed functional constraints, while genes, which

have not undergone accelerated substitution rates, have experienced purifying selection

and, thus, could not accumulate substitutions at random. Positive Darwinian selection, on

the other hand, is a phenomenon where selective pressure is favoring change. Natural

selection is commonly thought of as a process of editing genetic change so that only a

small number of mutational events are retained in natural populations. Under positive

selection, the retention of mutations is much closer to the rate at which mutations occur.

Here we report the collection and annotation of more than 12,000 ESTs generated

from two different cDNA libraries obtained from the East African cichlid species

Astatotilapia burtoni, as well as a smaller cDNA library from the Lake Malawi species

Metriaclima zebra. Astatotilapia burtoni has long been used as a model system to study

cichlid spawning behavior [7, 39, 40], social interactions [40-43], neural and behavioral

plasticity [44, 45], endocrinology [46], the visual system [47], as well as cichlid

development and embryogenesis [48]. In addition, the phylogenetic position of A. burtoni

Page 6: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 6

makes this species an ideal model system for comparative genomic research [28].

Astatotilapia burtoni, which belongs to the most species-rich lineage of cichlids, the

haplochromines, was shown to be a sister group to both the Lake Victoria region

superflock (~ 600 species) and the species flock of Lake Malawi (~ 1,000 species) [4, 5,

49, 50]. Three highly specialized haplochromine species from two species assemblages,

Paralabidochromis chilotes and Ptyochromis sp. “redtail sheller” from Lake Victoria and

Metriaclima zebra from Lake Malawi, have already been established as genomic models

[17, 27, 29, 31]. Important insight into cichlid (genome) evolution will be afforded by the

comparison of their genomes to that of A. burtoni, which has a more generalist life style

and is likely to resemble the ancestral lineage that seeded the cichlid adaptive radiations

in these two lakes [4, 7].

For EST sequencing, we utilized a cDNA library from A. burtoni brain tissue

(‘brain’) that was used for the construction of a cDNA microarray [23] and a newly

generated normalized cDNA library constructed from different A. burtoni tissues at

different developmental stages (‘pinky’). We annotated the ESTs on the basis of

similarity searches with BLAST and using the structured vocabulary provided by the

Gene Ontology Consortium [51], based on molecular studies of gene function in various

model organisms [52]. For evolutionary analyses, we combined our newly generated

ESTs with all available sequence data for haplochromine cichlids [31] and a previously

constructed library from skin tissue of the Lake Malawi species Metriaclima zebra (W.

Salzburger, H. A. Hofmann & A. Meyer; unpublished data), which resulted in a total of

more than 45,000 ESTs. We then compared the haplochromine ESTs to sequence data

from two pufferfish species (Takifugu rubripes and Tetraodon nigroviridis), trout, and

Page 7: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 7

zebrafish, and identified those ESTs with cichlid specific differences in evolutionary rates

with EverEST [36].

Page 8: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 8

Results

The 14,592 initial sequences were trimmed of vector and low-quality sequences and

filtered for minimum length (200 bp cut-off), identifying 12,070 high-quality ESTs

(Table 1). More than 11,000 of these ESTs (from 13,056 initial sequences) are derived

from two different Astatotilapia burtoni cDNA libraries – one made from brain tissue

(‘brain’), the other one from different tissues (‘pinky’) including brain, muscle, skin and

fin. The overall quality as measured by sequencing success rate and read-length was

better in the ‘pinky’ library. Also, there was much less redundancy in the ‘pinky’ library

(16% versus 30%), which might be the consequence of the normalization step applied to

this library or the use of different source tissues.

A total of 8,636 A. burtoni sequences assembled into EST contigs have an open

reading frame (ORF) of at least 400 bp. Of these, 1,219 (14%) had matches in the

Takifugu database and 7,417 (86%) had no matches when an expected value threshold (e-

value) of < 1 x 10-50 was used. 2,902 (34%) had matches in the Takifugu database with an

expected value threshold of < 1 x 10-15 and 3,460 (40%) had matches with an expected

value of < 1 x 10-5. Similar proportions were retrieved with other databases (Fig. 1).

Among the 8,363 A. burtoni assembled sequences, 2,977 could be annotated

according to Gene Ontology (GO) terms. Supplementary Tables 1, 2 and 3 use the

generic GO slim subset of terms (http://www.geneontology.org/GO.slims.shtml; Generic

GO slim; Mundodi and Ireland; downloaded 04/06/2007) that have been developed to

provide a useful summary of GO annotation for comparison of genomes, microarrays, or

cDNA collections when a broad overview of the ontology content is required. 2,692

ESTs could be assigned to genes listed in the molecular function ontology, 2,532 to genes

Page 9: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 9

listed in the biological process ontology, and 2,293 to genes listed in the cellular

components ontology, when using an e-value of < 1 x 10-12. Supplementary Figure 1, 2,

and 3 provide more detail of the specific fine-grained terms. Because a single A. burtoni

assembled sequence may be annotated in all three ontologies and according to multiple

ontology terms, a total of 27,451 annotations have been applied (10,926 among biological

process, 9,414 among molecular function, and 7,111 among cellular component).

For the comparative evolutionary analyses, we combined our newly generated

ESTs with previously published data from Paralabidochromis chilotes and P. sp. “redtail

sheller” [31] and about 1,000 sequences obtained from a Metriaclima zebra skin cDNA

library (W. Salzburger, H. A. Hofmann & A. Meyer; unpublished data). When using this

set of haplochromine cichlid ESTs as reference, we identified 759 open reading frames

that are present in all six databases used for comparative analyses (haplochromine

cichlids, Danio rerio, Homo sapiens, Oncorhynchus mykiss, Takifugu rubripes, and

Tetraodon nigroviridis).

In order to identify sequences that evolve significantly more rapidly or more

slowly in the haplochromine cichlid, we applied the triangle method implemented in

EverEST [36] to calculate the p-distance for each of this 759 ORFs in all fish species

relative to the human ortholog. The relative p-distances for three fish species were then

mapped in ternary diagrams. An example of such a ternary diagram is shown in Fig. 2a,

in this case showing the relative p-distances of cichlid, Takifugu rubripes, and Danio

rerio amino acid sequences with respect to the homologous Homo sapiens genes. Figure

2b depicts a diagram with Oncorhynchus mykiss amino acid sequence divergence instead

of haplochromine cichlid. The ternary diagrams show that in all combinations most genes

Page 10: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 10

are clustered around the center of the respective triangle, which indicates that, in general,

the p-distances relative to the human outgroup are similar in all fish species.

When compared to the green-spotted pufferfish (Tetraodon nigroviridis) and fugu

(Takifugu rubripes) (always with human as outgroup), 49 gene fragments appeared to

have a significantly faster rate of evolution in haplochromine cichlids, and 213 had a

slower rate. In the comparison including zebrafish and fugu, 52 genes were found to have

evolved faster and 185 genes slower in cichlids. When trout and zebrafish were used, 69

genes were faster and 139 genes evolved slower. In a comparison including trout and

fugu, 68 genes appeared to have a faster rate in haplochromines, and 132 had a slower

rate. In total 69 genes were found to have evolved faster, and 213 genes appeared to have

evolved with a significantly slower mutation rate in haplochromines compared to other

fish species. Altogether, about 22% of the surveyed ESTs were found to have

haplochromine specific rate differences in at least one of the comparisons suggesting that

these genes might play a role in lineage specific features of haplochromine cichlids. A set

of 170 cichlid genes appeared in all comparisons. Forty-eight cichlid genes were found to

have a higher rate of amino-acid substitution compared to the other fish species included

in this study, while 122 cichlid genes were found to have a slower rate. Cichlid sequences

that match Danio rerio, Takifugu rubripes, Tetraodon nigroviridis, and Oncorhynchus

mykiss genes and have a significantly higher or lower p-distance compared to the other

fish genes relative to the human outgroup are listed in Supplementary Table 4 and 5,

respectively.

A histogram of the abundance of amino acid sequence divergences of all five fish

species with respect to homologous human genes is depicted in Fig. 3. The p-distances

Page 11: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 11

appear normally distributed. With 0.211, cichlids show the lowest average distance

followed by Oncorhynchus mykiss (0.216), Danio rerio (0.239), Takifugu rubripes

(0.242), and Tetraodon nigroviridis (0.258). The average distance of all five fish species

to Homo sapiens is 0.233. We also used the 482 redundant sequences that were found in

all three large haplochromine cichlid EST datasets (P. chilotes and P. sp. “redtail sheller”

[31]; Astatotilapia burtoni, this study) to calculate mean pairwise p-distances. Within

these three cichlid species, we found a mean p-distance of 0.14 between A. burtoni and P.

chilotes, 0.17 between A. burtoni and P. sp. “redtail sheller”, and 0.08 between the two

Lake Victoria species P. chilotes and P. sp. “redtail sheller”.

We then calculated Ka/Ks ratios for all of these genes. Ka/Ks ratios greater than

one, which are indicative of positive selection in that gene, were found in four genes that

evolve more slowly in cichlids compared to the other fish species. The highest Ka/Ks ratio

(3.77) was found in the neuroendocrine convertase subtilisin/kexin type 1 that is

responsible for processing large precursor proteins into mature peptide hormones [53,

54]. In claudin 3, a member of the claudin family involved in the formation of tight

junctions in various tissues [55], the Ka/Ks ratio was 1.55. A Ka/Ks ratio of 1.30 was

observed in the catalyzing enzyme glutathione peroxidase 3, and a ratio of 1.19 was

found in ménage a trois 1 (MNAT1), which is a member of the CDK7-cyclin H complex

that functions in cell cycle progression [56], basal transcription, and DNA repair.

Page 12: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 12

Discussion

Expressed sequence tags are important genomic resources and their numbers in public

databases such as GenBank are rapidly increasing. Full-length cDNA and EST

sequencing projects typically accompany genome sequencing projects, as these data are

essential for the recognition and annotation of genes, the characterization of the

transcriptome, the identification of intron-exon boundaries and the detection of splice

variants in eukaryotes, etc. (see e.g. [32, 33, 57-59]). In addition, the standardized

procedure of cDNA library construction and normalization, and the comparably low costs

of large-scale DNA sequencing facilitate EST projects in organisms for which the whole

genome sequencing has not (yet) been completed. Thus, EST sequencing projects

outnumber genome-sequencing projects – particularly in groups with larger genome sizes

such as plants and vertebrates – leading to a large body of sequence data available for

comparative analyses. Large-scale EST analyses have been used in many other contexts,

such as primary gene expression assays [60, 61], the estimation of the total number of

genes in an organism [62], cDNA microarray annotations [63], or the construction of

genetic linkage maps [64-66]. Expressed sequence tags can further be used for

phylogenomics [35, 67], and for the identification of microRNAs [68].

Despite their many advantages, there are also some problems associated with

ESTs. For example, EST sequences typically cover only parts of a gene, so that two

sequences of the same gene might not necessarily overlap. That only fragments of a gene

are available also leads to problems with homology-based analyses such as BLAST.

Then, EST sequences often contain the untranslated regions (UTRs) that are present in

mRNAs but do not translate into amino acids. Finally, it is often difficult to figure out the

Page 13: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 13

proper reading frame, particularly in shorter ESTs, which impedes certain analyses. A

combination of multiple EST projects (as we have done here) helps to alleviate some of

the shortcomings inherent in EST data.

We have sequenced, annotated and conducted evolutionary analysis of ESTs of

haplochromine cichlids for several reasons. First, this large set of sequence data for

cichlid ORFs provides insight into the genome of a representative of haplochromine

cichlids, which are a main model system for the study of adaptive evolution and

explosive speciation [1-3]. Second, we wanted to extend the existing genomic resources

for Astatotilapia burtoni such as a genomic BAC library [28] by establishing cDNA

libraries from different tissues. Furthermore, these cDNA libraries provide the basis for

annotated cDNA microarrays that are being used for expression analyses in a variety of

cichlid species [23, 29, 69]. Finally, we were interested in identifying genes with a

different evolutionary rate in the rapidly radiating cichlid lineage compared to other fish

species, as well as in identifying genes that show the signature of adaptive evolution in

cichlids.

Of the two A. burtoni cDNA libraries that were used for EST sequencing, the

normalized mixed tissue library (‘pinky’) was of better quality. Not only were there much

fewer redundant sequences as compared to the brain library, which was mainly due to the

normalization step, but also the average insert size was larger and the average read length

was longer. Altogether, about 85% of the sequenced cDNA clones led to high-quality

ESTs of a length of >200 bp (86% in pinky, and 85% in brain). In the BLAST searches

against Takifugu rubripes, Tetraodon nigroviridis, and Danio rerio, between 14% (when

compared to T. rubripes; e-value ! 10-50) and 43 % (when compared to D. rerio; e-value

Page 14: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 14

! 10-5) of the A. burtoni ESTs led to hits (Fig. 1). This lies well in the range of other EST

sequencing projects (see e.g., [61, 63, 70]).

About 8,600 A. burtoni ORFs (or 75% of the high quality ESTs) were longer than

400 bp, and about 3,000 sequences could unambiguously be annotated and classified

following the vocabulary provided by the Gene Ontology Consortium (Supplementary

Tables 1-3; Supplementary Figs. 1-3). According to the Gene Ontology classification, it

appears that a broad range of genes involved in functions, processes and compartments

are represented in our EST set. This cichlid specific GO slim offers several advantages.

First, it offers a rapid visual interpretation of gene subsets. Second, because the cichlid

specific slim is built from those sequences used to build a cDNA microarray, it offers

maximal power when testing for over- or under-representation of gene lists while

reducing the need for correction for multiple hypothesis testing. Finally, it allows for a

less experimenter-biased interpretation of microarray results, or other genomics analyses

in a manner that can be easily compared between experiments.

One of our main goals was to characterize genes in haplochromine cichlids that

show a faster or slower rate of base substitutions in cichlids compared to other fish

species, as this is indicative of a relaxed or reinforced selection regime, respectively [34].

To this end, we combined our newly generated ESTs with previously published

sequences for Lake Victoria haplochromine cichlids [31] and about 1,000 sequences

obtained from a Metriaclima zebra skin cDNA library, which resulted in a total of about

45,000 ORFs. By means of homology searches against human, the two pufferfishes,

trout, and zebrafish using local BLAST, we identified a set of 759 ORFs that are present

in all species and that show a sufficient degree of homology (e-value ! 10-50) for further

Page 15: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 15

analyses with EverEST [36]. The number of genes with a cichlid-specific faster or slower

rate of molecular evolution (always with human as outgroup) varied when different fish

taxa were used in addition to the cichlid ORFs. However, we found a set of 170 genes (48

“faster” and 122 “slower”; Supplementary Tables 4, 5) that appeared in all comparisons

and are, thus, good candidates for playing an important role in the evolution of

(haplochromine) cichlid fishes.

When characterizing these genes further, by means of calculating Ka/Ks ratios, we

found that four genes (or 2.35 % of all deviating genes) showed the signature of adaptive

evolution in the haplochromine lineage. The highest Ka/Ks ratio (3.77) was found in the

neuroendocrine convertase subtilisin/kexin type 1, followed by claudin 3, (1.55),

glutathione peroxidase 3 (1.50), and ménage a trois 1 (1.19). All gene fragments that

show a Ka/Ks>1 are found among the more slowly evolving genes. These genes are now

candidate genes for further investigations. The gene with the highest Ka/Ks ratio appears

particularly interesting. It is known that neuroendocrine, such as gonadotropin releasing

hormone (GnRH), are involved in regulation of reproduction and behavior in A. burtoni

[54, 71].

In order to generate hypotheses regarding possible mechanisms by which the

rapidly or slowly evolving cichlid genes might contribute to the process of adaptive

radiation, we made use of the GO term annotations and cichlid specific slim. Over- and

under-represented terms were identified among the annotations for the rapidly and slowly

evolving cichlid genes (Table 2). Among the 759 ORFs for which p-distances were

calculated, over 6,000 total annotations were applied to 647, 675, and 619 ORFs

according to biological process, molecular function, and cellular component respectively.

Page 16: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 16

Therefore the majority of the 122 slowly evolving and 48 rapidly evolving genes could be

classified bioinformatically.

There was a relatively even distribution of rapidly evolving genes across all GO

categories. Only three terms, “response to stimulus”, “cellular process” and “actin

cytoskeleton” deviated significantly from the distribution expected by chance alone. The

most significant disproportionate under-representation for the rapidly evolving genes was

the category of response to stimulus for which only 1 of the 86 possible annotated ORFs

was included on the list.

The distribution across GO categories was highly non-uniform for the slowly

evolving genes. Many categories from each ontology were represented by significantly

more or fewer ORFs than would be expected by chance. Among those terms over-

represented we found several relating to cellular processes such as protein kinase

cascade, mitosis, and cell signaling as well as growth and cell adhesion, while metabolic

process was under-represented along with the secretory pathway category.

The GO analysis highlights the possible categories of genes that may play an

important role in the evolution of the haplochromine cichlid fishes. This analysis presents

hypotheses to be tested through focused experimental or sequence analysis. An

interesting contrast in GO analysis results was observed between the rapidly evolving

genes that showed little tendency to derive from a particular class and slowly evolving

genes that were more structured in their distribution. The lack of structure to the

distribution of rapidly evolving genes may reflect the possibility that specialization

among cichlids occurs along diverse biological pathways rather than a repeated

divergence of a given biological process or molecular function. The GO categories that

Page 17: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 17

are over-represented among slowly evolving genes could represent genes whose

functions are important for phenotypic plasticity or other traits linked to the successful

adaptive radiation, while those categories that are under-represented by slowly evolving

genes represent categories that are not as tightly constrained.

Our p-distance comparisons between the five fish species and human (as

outgroup) also revealed that cichlids show the lowest average p-distance compared to

Homo sapiens (Fig. 3). This might be an artifact that is due to the use of the

haplochromine cichlid sequence as query for all BLAST searches. Alternatively, as we

also found 122 slowly evolving genes in haplochromine cichlids, there might be a

tendency in haplochromines to retain ancestral forms and functions. The pairwise average

p-distance comparisons between the three cichlid species Paralabidochromis chilotes,

Ptyochromis sp. “redtail sheller”, and Astatotilapia burtoni revealed that the coalescence

time between the two Lake Victoria species (0.08) is about half compared to their

coalescence time with A. burtoni (0.14 and 0.17, respectively), which is in concordance

to the phylogenetic relationships between these three taxa [4].

Page 18: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 18

Conclusions

Here we report the sequencing and annotation of more than 11,000 ESTs from the East

African haplochrormine cichlid Astatotilapia burtoni. Our EST set comprises a broad

range of genes involved in functions, processes and compartments. By combining the A.

burtoni ESTs with publicly available ORFs from two Lake Victoria haplochromines and

subsequent comparisons to other fish model systems, we identify a set of 170 genes with

haplochromine-specific differences in evolutionary rates. These genes appear as good

candidates for playing an important role in the evolution of the exceptional diversity

found in (haplochromine) cichlids. Interestingly, genes that were more slowly evolving in

the cichlid lineage were not evenly distributed across Gene Ontology categories; classes

that are over-represented could represent genes whose functions are important for

successful adaptive radiation. We also identify four genes with a Ka/Ks ratio greater than

one, which are, hence, likely to have undergone positive selection in haplochromines.

The A. burtoni ESTs provide novel insights into the genome of haplochromine cichlids

and will serve as valuable resource for researchers working in the field of (cichlid)

evolutionary genomics, particularly in the light of the forthcoming sequencing of four

cichlid genomes.

Page 19: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 19

Methods

Fishes

Astatotilapia burtoni were kept at Stanford, and at the Tierforschungsanlage of the

University of Konstanz under standard conditions (12 h light, 12 h dark; 26°C). For RNA

isolation, fishes were sacrificed after anesthetization with MS 222 (Sigma).

Pinky cDNA Library Construction

For the preparation of the pinky cDNA library, total RNA was isolated from the

following tissues of adult A. burtoni: brain, caudal fin, anal fin (male), lips, muscle, ovary

(female), and skin. Additionally, we isolated total RNA from a juvenile individual (about

30 days after fertilization). Total RNA was isolated by guanidine thiocyanate/phenol-

chlorophorm-isoamyl alcohol extraction and lithium-chloride precipitation. The different

RNA samples were pooled and cDNA was synthesized using the SMART PCR cDNA

Synthesis Kit (Clontech) following the manufacturers protocol. Amplified cDNA was

purified using the QIAquick PCR Purification Kit (Qiagen) and concentrated by ethanol

precipitation. The pellet was dissolved in 10 !l H2O. For normalization, three microliters

of purified cDNA were mixed with 1 !l hybridization buffer (200 mM HEPES-HCl, pH

8.0; 2 M NaCl) and incubated at 95°C for 5 minutes and at 70°C overnight. Then, 1 !l of

DNAse buffer (500 mM Tris-HCl, pH 8.0; 50 mM MgCl2, 10 mM DTT) and 0.5 !l of

DSN enzyme (duplex-specific nuclease; Evrogen, Russia) were added, and the mix was

incubated at 65°C for 20 minutes. The normalization reaction was terminated by adding 1

!l 50 mM EDTA and incubation at 95°C for 7 minutes. Normalized cDNA was PCR

amplified (20 cycles) and cloned into pAL 16 vectors.

Page 20: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 20

Brain cDNA Library Construction

A full-length, directional (EcoRI – XhoI) cDNA library was constructed in Lambda ZapII

phage vector (Stratagene) with mRNA from A. burtoni brains (both sexes at all stages of

development and social condition were included). Construction of this library has

previously been described in ref. [23]. For cDNA sequencing, we used 2!l of purified

PCR products, which were also used for the construction of a cDNA microarray [23].

DNA-sequencing and Sequence Analysis

For sequencing of the normalized pinky cDNA library we used purified plasmid DNA

from 1 ml colonies that were grown overnight. Plasmid DNA was directly sequenced

using T7 primers and the BigDye Termination Reaction Kit v3.0 (Applied Biosystems)

on ABI 3730 and ABI 3100 automated capillary DNA sequencers (Applied Biosystems).

Sequences of the brain cDNA library were determined on an ABI 3100 DNA sequencer

after cycle sequencing reactions from purified PCR products that were available from the

construction of a cDNA microarray [23] using the primer CSVP3 (5’ –

AAGCGCGCAATTAACCCTCACTA – 3’) and the BigDye Termination Reaction Kit

v3.0 (Applied Biosystems).

Base-calling and quality trimming were performed with PHRED

(http://www.genome.washington.edu/UWGC/analysistools/Phred.cfm) using a quality

score >20. Vectors were trimmed with Sequencher 4.2.2 (Genecodes). Those ESTs

having a total length of >200 bp after quality and vector trimming were considered “high-

quality ESTs”. Screens for possible contaminations were conducted by blastn searches

against the E. coli genome, and the EST_human, EST_mouse and EST_others databases

Page 21: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 21

(downloaded in March 2005). Sequences have been deposited in GenBank under

accession numbers CN468542 – CN472211 (brain library) and DY625779 – DY 632420

(pinky library).

Annotation of A. burtoni ESTs

High quality A. burtoni ESTs were screened by tblastx searches against protein data from

Danio rerio (Zebrafish Sequencing Group at the Sanger Institute), Homo sapiens

(GenBank) and Takifugu rubripes (JGI Fugu v3.0) as well as ESTs from Oncorhynchus

mykiss and Tetraodon nigroviridis (GenBank) using the standard vertebrate code for

translation into amino acids. The expected value thresholds (e-values) were set to < 1 x

10-5

, < 1 x 10-15

, and < 1 x 10-50

. The proper open reading frame for A. burtoni ESTs was

determined with EverEST [36], based on the results from these BLAST searches.

For functional annotation of A. burtoni ESTs, we followed the vocabulary

provided by the Gene Ontology Consortium using the GO database

(http://www.geneontology.org). Gene Ontology terms were applied to the cichlid

assembled sequences by BLAST comparison to the Gene Ontology database (release

200704), which represents protein sequence for all contributed genes for which at least

one GO annotation has been applied based on experimental evidence rather than only

inferred electronic annotation of sequence. All GO annotations at any confidence level

were then transferred from the single best-hit gene using e-value < 10-12

as a threshold.

The collection of GO terms used was "slimmed" in order to produce useful summaries of

the annotations.

Page 22: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 22

This cichlid specific slim (Supplementary Figs. 1-3) is based upon statistical

consideration for analysis of microarray results. The leaf most nodes have been selected

for which 20 or more A. burtoni assembled sequences were annotated with this term.

Parent nodes were retained only when an additional 20 A. burtoni assembled sequences

were included. To assess the enrichment of particular classes of genes among the genes

showing deviating rate of molecular evolution, Gene Ontology annotation terms were

tested for significant over- and under-representation in either the higher or lower p-

distance list using a hypergeometric test implemented in the BINGO plugin [72] for

Cytoscape (http://www.cytoscape.org/). Due to the exploratory nature of this analysis and

controversial application of correction techniques [73], reported p-values are not

corrected for multiple testing. Only the representation for the leaf most node is reported

except in cases when a larger, parent node showed increased significance. The directed

acyclic graphs (DAGs) were created using hierarchical visualization in Cytoscape and

manually adjust to facilitate comprehension.

Evolutionary Analyses

For evolutionary analyses of ESTs from haplochromine cichlids, we combined our newly

generated high-quality ESTs from A. burtoni with previously published ESTs from

Paralabidochromis chilotes and Ptyochromis sp. “redtail sheller” [31] and with about

1,000 ESTs obtained from a cDNA library made from Metriaclima zebra skin tissue (W.

Salzburger, H. A. Hofmann & A. Meyer, unpublished). The combined dataset, including

more than 45,000 ESTs, was BLASTed against protein data from Danio rerio, Homo

sapiens and Takifugu rubripes as well as ESTs from Oncorhynchus mykiss and Tetraodon

Page 23: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 23

nigroviridis (see above for source of data) using the translated BLAST routine and the

standard vertebrate code. BLAST searches were performed with an e-value of < 1 x 10-50

in order to achieve high levels of confidence in the similarity searches. The cichlid query

sequences and the best hits from every single BLAST search against the different

databases were imported into EverEST [36].

In order to identify coding sequences showing a deviating rate of molecular

evolution in haplochromine cichlids compared to other fish lineages we applied the

triangle method implemented in EverEST. In this approach, the query sequences are

aligned to their best BLAST hits in two ingroup and one outgroup taxa using the T-

Coffee algorithm [74] as implemented in EverEST [36]. This reveals multiple sequence

alignments consisting of four taxa. Then, uncorrected pairwise p-distances are calculated

for all taxon pairs in each alignment, which are used to construct neighbor-joining trees

and, after rooting with the outgroup sequences, for a global ternary representation. A

relative rate test was applied to each of the orthologous groups. We applied the

nonparametric rate test developed by Tajima (ref. [75]), and compared the genes with

their human and their fish orthologs in order to identify higher or lower substitution rates.

For these analyses, we used the human sequences as outgroup since tetrapods are

valid outgroup taxa for teleost fish and the human genome is the most complete and best

annotated genome among those. In addition to our haplochromine cichlid query

sequences, we used different sets of ingroup taxa in order to minimize biasing effects due

to sparse taxon sampling. We used the following combinations of taxa for our

evolutionary rate analyses using 759 ORFs that have been found in all datasets: (human,

(haplochromine cichlid, Danio rerio, Takifugu rubripes)) (Fig. 2a), (human,

Page 24: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 24

(haplochromine cichlid, Danio rerio, Tetraodon nigroviridis)) (not shown), (human,

(haplochromine cichlid, Danio rerio, Oncorhynchus mykiss)) (not shown). As a control,

we also analyzed a data set without the cichlid-query sequences for the same set of ORFs

(human, (Danio rerio, Oncorhynchus mykiss, Takifugu rubripes)) (Fig. 2b). We note that

this approach might lead to an underestimation of the number of faster evolving genes, as

genes that accumulated too many mutations are likely to not get chosen in the stringent

initial BLAST searches.

For orthologous groups, where the p-distance in the haplochromine cichlids were

significantly (p < 0.05) higher or lower compared to other fish, the ratio of the number of

nonsynonymous substitutions per nonsynonymous site (Ka) to the number of synonymous

substitutions per synonymous site (Ks) was calculated based on a likelihood approach

[76] to evaluate the selective forces acting on those proteins. The Ka/Ks ratio is an

indicator of the form of sequence evolution, with Ka/Ks >>1 providing strong evidence

that positive selection has acted to change the protein sequence.

We also constructed a histogram of amino acid sequence divergence of all five

fish datasets with respect to homologous human sequences. We finally used the

redundant sequences in the three datasets P. chilotes, P. sp. “redtail sheller”, and A.

burtoni to calculate pairwise average p-distances.

List of abbreviations used:

DAG, directed acyclic graph; EST, expressed sequence tag; GO, gene ontology; ORF,

open reading frame

Page 25: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 25

Author’s contributions

WS, HAH and AM designed the study. WS and HAH were involved in library

construction; WS and IB carried out the molecular work; WS, DS, SCPR, and IB

performed the analyses. All authors contributed to the preparation of the manuscript.

They read and approved the final version.

Acknowledgements

We thank E. Hespeler for technical assistance in the laboratory, P. Jantzen for assistance

with GO figures and tables, and R. D. Fernald, in whose laboratory the brain cDNA

library was constructed; W.S. was supported by a Marie Curie Fellowship of the EU, and

grants from the Landesstiftung-Baden Württemberg gGmbH and the Center for Junior

Research Fellows, University of Konstanz; S.C.P. R was supported by an NIH-NRSA

grant; H.A.H. was supported by a NIH-NIGMS grant GM068763, the Bauer Center for

Genomics Research at Harvard University and the Institute for Cellular and Molecular

Biology at the University of Texas, Austin; A.M. was supported by the Deutsche

Forschungsgemeinschaft (DFG) and the University of Konstanz.

Page 26: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 26

References 1. Kocher TD: Adaptive evolution and explosive speciation: the cichlid fish

model. Nature Reviews Genetics 2004, 5:288-298.

2. Salzburger W, Meyer A: The species flocks of East African cichlid fishes:

recent advances in molecular phylogenetics and population genetics.

Naturwissenschaften 2004, 91:277-290.

3. Kornfield I, Smith PF: African Cichlid Fishes: Model systems for evolutionary

biology. Annu Rev Ecol Syst 2000, 31:163-196.

4. Salzburger W, Mack T, Verheyen E, Meyer A: Out of Tanganyika: Genesis,

explosive speciation, key-innovations and phylogeography of the

haplochromine cichlid fishes. BMC Evolutionary Biology 2005, 5:17.

5. Verheyen E, Salzburger W, Snoeks J, Meyer A: Origin of the superflock of

cichlid fishes from Lake Victoria, East Africa. Science 2003, 300:325-329.

6. Genner MJ, Seehausen O, Lunt DH, Joyce DA, Shaw PW, Carvalho GR, Turner

GF: Age of cichlids: new dates for ancient lake fish radiations. Mol Biol Evol

2007, 24:1269-1282.

7. Fryer G, Iles TD: The cichlid fishes of the Great Lakes of Africa: Their biology

and Evolution. Edinburgh: Oliver & Boyd; 1972.

8. Barluenga M, Stolting KN, Salzburger W, Muschick M, Meyer A: Sympatric

speciation in Nicaraguan crater lake cichlid fish. Nature 2006, 439:719-723.

9. Schliewen UK, Tautz D, Paabo S: Sympatric speciation suggested by

monophyly of crater lake cichlids. Nature 1994, 368:629-632.

10. Kocher TD, Conroy JA, McKaye KR, Stauffer JR: Similar morphologies of

cichlid fish in lakes Tanganyika and Malawi are due to convergence. Mol

Phylogenet Evol 1993, 2:158-165.

11. Stiassny MLJ, Meyer A: Cichlids of the Rift Lakes. Scientific American 1999,

280:64-69.

12. Meyer A: Phylogenetic relationships and evolutionary processes in East

African cichlids. Trends in Ecology and Evolution 1993, 8:279-284.

13. Barlow GW: The Cichlid Fishes. Nature's Grand Experiment in Evolution.

Cambridge, MA: Perseus Publishing; 2000.

14. Liem KF: Evolutionary strategies and morphological innovations: cichlid

pharyngeal jaws. Systematic Zoology 1973, 22:425-441.

15. Crapon de Caprona MD, Fritzsch B: Interspecific fertile hybrids of

haplochromine Cichlidae (Teleostei) and their possible importance for

speciation. Netherlands Journal of Zoology 1984, 34:503-538.

16. Albertson RC, Kocher TD: Genetic architecture sets limits on transgressive

segregation in hybrid cichlid fishes. Evolution Int J Org Evolution 2005,

59:686-690.

17. Streelman JT, Albertson RC, Kocher TD: Genome mapping of the orange

blotch colour pattern in cichlid fishes. Mol Ecol 2003, 12:2465-2471.

18. Albertson RC, Streelman JT, Kocher TD: Directional selection has shaped the

oral jaws of Lake Malawi cichlid fishes. Proc Natl Acad Sci U S A 2003,

100:5252-5257.

Page 27: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 27

19. Albertson RC, Streelman JT, Kocher TD, Yelick PC: Integration and evolution

of the cichlid mandible: the molecular basis of alternate feeding strategies.

Proc Natl Acad Sci U S A 2005, 102:16287-16292.

20. Terai Y, Morikawa N, Okada N: The evolution of the pro-domain of bone

morphogenetic protein 4 (Bmp4) in an explosively speciated lineage of East

African cichlid fishes. Mol Biol Evol 2002, 19:1628-1632.

21. Sugie A, Terai Y, Ota R, Okada N: The evolution of genes for pigmentation in

African cichlid fishes. Gene 2004, 343:337-346.

22. Carleton KL, Kocher TD: Cone opsin genes of african cichlid fishes: tuning

spectral sensitivity by differential gene expression. Mol Biol Evol 2001,

18:1540-1550.

23. Renn SC, Aubin-Horth N, Hofmann HA: Biologically meaningful expression

profiling across species using heterologous hybridization to a cDNA

microarray. BMC Genomics 2004, 5:42.

24. Lee BY, Lee WJ, Streelman JT, Carleton KL, Howe AE, Hulata G, Slettan A,

Stern JE, Terai Y, Kocher TD: A Second Generation Genetic Linkage Map of

Tilapia (Oreochromis spp.). Genetics 2005, 170:237-244.

25. Kocher TD, Lee WJ, Sobolewska H, Penman D, McAndrew B: A genetic linkage

map of a cichlid fish, the tilapia (Oreochromis niloticus). Genetics 1998,

148:1225-1232.

26. Katagiri T, Asakawa S, Minagawa S, Shimizu N, Hirono I, Aoki T: Construction

and characterization of BAC libraries for three fish species; rainbow trout,

carp and tilapia. Anim Genet 2001, 32:200-204.

27. Watanabe M, Kobayashi N, Fujiyama A, Okada N: Construction of a BAC

library for Haplochromis chilotes, a cichlid fish from Lake Victoria. Genes

Genet Syst 2003, 78:103-105.

28. Lang M, Miyake T, Braasch I, Tinnemore D, Siegel N, Salzburger W, Amemiya

CT, Meyer A: A BAC library of the East African haplochromine cichlid fish

Astatotilapia burtoni. J Exp Zoolog B Mol Dev Evol 2006, 306B:35-44.

29. Kijimoto T, Watanabe M, Fujimura K, Nakazawa M, Murakami Y, Kuratani S,

Kohara Y, Gojobori T, Okada N: cimp1, a novel actin family metalloproteinase

gene from East African cichlids, is differentially expressed between species

during growth. Mol Biol Evol 2005, 22:1649-1660.

30. Kobayashi N, Watanabe M, Kijimoto T, Fujimura K, Nakazawa M, Ikeo K,

Kohara Y, Gojobori T, Okada N: magp4 gene may contribute to the

diversification of cichlid morphs and their speciation. Gene 2006, 373:126-

133.

31. Watanabe M, Kobayashi N, Shin-i T, Horiike T, Tateno Y, Kohara Y, Okada N:

Extensive analysis of ORF sequences from two different cichlid species in

Lake Victoria provides molecular evidence for a recent radiation event of the

Victoria species flock: identity of EST sequences between Haplochromis

chilotes and Haplochromis sp. "Redtailsheller". Gene 2004, 343:263-269.

32. Gibson G, Muse SV: A Primer of Genome Science. Sunderland, MA: Sinauer

Associates, Inc.; 2002.

33. Gerhold D, Caskey CT: It's the genes! EST access to human genome content.

Bioessays 1996, 18:973-981.

Page 28: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 28

34. Steinke D, Salzburger W, Braasch I, Meyer A: Many genes in fish have species-

specific asymmetric rates of molecular evolution. BMC Genomics 2006, 7:20.

35. Steinke D, Salzburger W, Meyer A: Higher teleostean relationships revealed

from genome-wide phylogenetic analyses. J Mol Evol 2006, in press.

36. Steinke D, Salzburger W, Meyer A: EverEST - a phylogenomic EST database

approach. PhyloInformatics 2004, 6:1-4.

37. Wasmuth JD, Blaxter ML: prot4EST: translating expressed sequence tags

from neglected genomes. BMC Bioinformatics 2004, 5:187.

38. Nilsson RH, Rajashekar B, Larsson KH, Ursing BM: galaxieEST: addressing

EST identity through automated phylogenetic analysis. BMC Bioinformatics

2004, 5:87.

39. Wickler W: 'Egg-dummies' as natural releasers in mouth-breeding cichlids.

Nature 1962, 194:1092-1093.

40. Wickler W: Zur Stammesgeschichte funktionell korrelierter Organ- und

Verhaltensmerkmale: Ei-Attrappen und Maulbrüten bei afrikanischen

Cichliden. Zeitschrift für Tierpsychologie 1962, 19:129-164.

41. Wickler W: Haplochromis burtoni (Cichlidae) Ablaichen. In Encyclopedia

Cinematographica. Göttingen: Institut für den wissenschaftlichen Film; 1969.

42. Crapon de Caprona MD: Olfactory communication in a cichlid fish,

Haplochromis burtoni. Zeitschrift für Tierpsychologie 1980, 52:113-134.

43. Grosenick L, Clement TS, Fernald RD: Fish can infer social rank by

observation alone. Nature 2007, 445:429-432.

44. Hofmann HA, Fernald RD: What cichlids tell us about the social regulation of

brain and behavior. Journal of Aquariculture and Aquatic Sciences 2001, 9:1-

15.

45. Hofmann HA: Functional genomics of neural and behavioral plasticity. J

Neurobiol 2003, 54:272-282.

46. Robison RR, White RB, Illing N, Troskie BE, Morley M, Millar RP, Fernald RD:

Gonadotropin-releasing hormone receptor in the teleost Haplochromis

burtoni: structure, location, and function. Endocrinology 2001, 142:1737-

1743.

47. Kroger RH, Campbell MC, Fernald RD: The development of the crystalline

lens is sensitive to visual input in the African cichlid fish, Haplochromis

burtoni. Vision Res 2001, 41:549-559.

48. Hagedorn M, Mack AF, Evans B, Fernald RD: The embryogenesis of rod

photoreceptors in the teleost fish retina, Haplochromis burtoni. Brain Res

Dev Brain Res 1998, 108:217-227.

49. Meyer A, Kocher TD, Basasibwaki P, Wilson AC: Monophyletic origin of Lake

Victoria cichlid fishes suggested by mitochondrial DNA sequences. Nature

1990, 347:550-553.

50. Salzburger W, Meyer A, Baric S, Verheyen E, Sturmbauer C: Phylogeny of the

Lake Tanganyika cichlid species flock and its relationship to the Central and

East African haplochromine cichlid fish faunas. Syst Biol 2002, 51:113-135.

51. Consortium TGO: Creating the gene ontology resource: design and

implementation. Genome Res 2001, 11:1425-1433.

Page 29: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 29

52. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP,

Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification

of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25-29.

53. Jansen E, Ayoubi TA, Meulemans SM, Van de Ven WJ: Neuroendocrine-

specific expression of the human prohormone convertase 1 gene. Hormonal

regulation of transcription through distinct cAMP response elements. J Biol

Chem 1995, 270:15391-15397.

54. Hofmann HA: Gonadotropin-releasing hormone signaling in behavioral

pasticity. Current Opinion in Neurobiology 2006, 16:343-350.

55. Morita K, Furuse M, Fujimoto K, Tsukita S: Claudin multigene family

encoding four-transmembrane domain protein components of tight junction

strands. Proc Natl Acad Sci U S A 1999, 96:511-516.

56. Talukder AH, Mishra SK, Mandal M, Balasenthil S, Mehta S, Sahin AA, Barnes

CJ, Kumar R: MTA1 interacts with MAT1, a cyclin-dependent kinase-

activating kinase complex ring finger factor, and regulates estrogen receptor

transactivation functions. J Biol Chem 2003, 278:11676-11685.

57. Gupta S, Zink D, Korn B, Vingron M, Haas SA: Strengths and weaknesses of

EST-based prediction of tissue-specific alternative splicing. BMC Genomics

2004, 5:72.

58. Banfi S, Borsani G, Rossi E, Bernard L, Guffanti A, Rubboli F, Marchitiello A,

Giglio S, Coluccia E, Zollo M, et al: Identification and mapping of human

cDNAs homologous to Drosophila mutant genes through EST database

searching. Nat Genet 1996, 13:167-174.

59. Bailey LC, Jr., Searls DB, Overton GC: Analysis of EST-driven gene

annotation in human genomic sequence. Genome Res 1998, 8:362-376.

60. Schmitt AO, Specht T, Beckmann G, Dahl E, Pilarsky CP, Hinzmann B,

Rosenthal A: Exhaustive mining of EST libraries for genes differentially

expressed in normal and tumour tissues. Nucleic Acids Res 1999, 27:4251-

4260.

61. Habermann B, Bebin AG, Herklotz S, Volkmer M, Eckelt K, Pehlke K, Epperlein

HH, Schackert HK, Wiebe G, Tanaka EM: An Ambystoma mexicanum EST

sequencing project: analysis of 17,352 expressed sequence tags from

embryonic and regenerating blastema cDNA libraries. Genome Biol 2004,

5:R67.

62. Ewing B, Green P: Analysis of expressed sequence tags indicates 35,000

human genes. Nat Genet 2000, 25:232-234.

63. Whitfield CW, Band MR, Bonaldo MF, Kumar CG, Liu L, Pardinas JR,

Robertson HM, Soares MB, Robinson GE: Annotated expressed sequence tags

and cDNA microarrays for studies of brain and behavior in the honey bee.

Genome Res 2002, 12:555-566.

64. Smith JJ, Kump DK, Walker JA, Parichy DM, Voss SR: A Comprehensive EST

Linkage Map for Tiger Salamander and Mexican Axolotl: Enabling Gene

Mapping and Comparative Genomics in Ambystoma. Genetics 2005.

65. Scheetz TE, Raymond MR, Nishimura DY, McClain A, Roberts C, Birkett C,

Gardiner J, Zhang J, Butters N, Sun C, et al: Generation of a high-density rat

EST map. Genome Res 2001, 11:497-502.

Page 30: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 30

66. Lorenzen MD, Doyungan Z, Savard J, Snow K, Crumly LR, Shippy TD, Stuart JJ,

Brown SJ, Beeman RW: Genetic linkage maps of the red flour beetle,

Tribolium castaneum, based on bacterial artificial chromosomes and

expressed sequence tags. Genetics 2005, 170:741-747.

67. Philippe H, Lartillot N, Brinkmann H: Multigene analyses of bilaterian animals

corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia.

Mol Biol Evol 2005, 22:1246-1253.

68. Zhang BH, Pan XP, Wang QL, Cobb GP, Anderson TA: Identification and

characterization of new plant microRNAs using EST analysis. Cell Res 2005,

15:336-360.

69. Aubin-Horth N, Desjardins JK, Martei YM, Balshine S, Hofmann HA:

Masculinized dominant females in a cooperatively breeding species. Mol Ecol

2007, 16:1349-1358.

70. Rise ML, von Schalburg KR, Brown GD, Mawer MA, Devlin RH, Kuipers N,

Busby M, Beetz-Sargent M, Alberto R, Gibbs AR, et al: Development and

application of a salmonid EST database and cDNA microarray: data mining

and interspecific hybridization characteristics. Genome Res 2004, 14:478-490.

71. Trainor BC, Hofmann HA: Somatostatin regulates aggressive behavior in an

African cichlid fish. Endocrinology 2006, 147:5119-5125.

72. Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess

overrepresentation of gene ontology categories in biological networks.

Bioinformatics 2005, 21:3448-3449.

73. Ge YC, Dudoit S, Speet TP: Resampling-based multiple testing for microarray

data analysis. Test 2003, 12:1-77.

74. Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and

accurate multiple sequence alignment. J Mol Biol 2000, 302:205-217.

75. Tajima F: Simple methods for testing the molecular evolutionary clock

hypothesis. Genetics 1993, 135:599-607.

76. Yang Z: Likelihood ratio tests for detecting positive selection and application

to primate lysozyme evolution. Mol Biol Evol 1998, 15:568-573.

Page 31: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 31

Tables

Table 1 - Expressed sequence tag (EST) summary

Total sequences 13,056

High quality sequences 12,070 (between 200 and 1,564 bp)

Brain library (A. burtoni) 4,570

Mixed tissue library (A. burtoni) 6,541

Skin library (P. zebra) 959

Page 32: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 32

Table 2 - Gene Ontology terms which are over- or under-represented among the

rapidly or slowly evolving cichlid ORFs. Hypergeometic p-values are reported

uncorrected for multiple testing. The number of ORFs of deviating evolutionary rate (#)

relative to the number of core set ORFs (total) is given.

Representation GO-ID p-value # total Description

biological process 42 with higher p-distance (647 annotated)

over none

under GO:0050896 0.0161 1 86 response to stimulus

GO:0009987 0.0439 12 273 cellular process

molecular function 44 with higher p-distance (675 annotated)

none

cellular component 40 with higher p-distance (619 annotated)

over GO:0015629 0.0327 6 39 actin cytoskeleton

under none

biological process 103 with lower p-distance (647 annotated)

over GO:0009987 0.0024 57 273 cellular process

over GO:0007243 0.0052 8 19 protein kinase cascade

over GO:0007155 0.0205 7 19 cell adhesion

over GO:0040007 0.0208 6 15 growth

over GO:0007154 0.0230 25 109 cell communication

over GO:0007267 0.0071 7 16 cell-cell signaling

over GO:0016477 0.0290 5 12 cell migration

over GO:0040008 0.0290 5 12 regulation of growth

over GO:0007409 0.0308 3 5 axonogenesis

over GO:0007610 0.0308 3 5 behavior

over GO:0015674 0.0308 3 5 di-, tri-valent inorganic cation transport

over GO:0019752 0.0376 10 35 carboxylic acid metabolic process

over GO:0007067 0.0402 4 9 mitosis

over GO:0007417 0.0402 4 9 central nervous system development

under GO:0008152 0.0016 63 477 metabolic process

under GO:0046907 0.0180 2 44 intracellular transport

under GO:0045045 0.0295 0 20 secretory pathway

under GO:0009117 0.0421 0 18 nucleotide metabolic process

molecular function 110 with lower p-distance (675 annotated)

over GO:0004930 0.0157 4 7 G-protein

over GO:0003774 0.0233 6 15 motor activity

over GO:0005262 0.0264 2 2 calcium channel

over GO:0008047 0.0324 6 16 enzyme activator activity

over GO:0005509 0.0333 12 43 calcium ion binding

over GO:0019899 0.0435 4 9 enzyme binding

under GO:0005525 0.0116 1 36 GTP binding

under GO:0005198 0.0407 8 85 structural molecule activity

under GO:0051082 0.0467 0 17 unfolded protein binding

under GO:0003743 0.0467 0 17 translation initiation factor activity

under GO:0003924 0.0481 1 27 GTPase activity

under GO:0003676 0.0483 17 147 nucleic acid binding

cellular component 97 with lower p-distance (619 annotated)

over GO:0016021 0.0096 22 88 integral to membrane

over GO:0015630 0.0388 5 13 microtubule cytoskeleton

over GO:0005625 0.0479 6 18 soluble fraction

over GO:0005615 0.0479 6 18 extracellular space

under GO:0032991 0.0001 19 222 macromolecular complex

under GO:0043234 0.0015 18 195 protein complex

under GO:0043226 0.0089 56 425 organelle

under GO:0030529 0.0139 4 65 ribonucleoprotein complex

under GO:0005829 0.0267 3 51 cytosol

under GO:0005739 0.0311 6 75 mitochondrion

Page 33: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 33

Figures Figure 1 - The proportion of assembled haplochromine cichlid sequences with and

without BLAST matches compared to three databases (Takifugu rubripes, Danio

rerio, and Oncorhynchus mykiss). The pie charts indicate the relative number of

BLAST hits (blue) versus the percentage fraction, for which no BLAST hit was retrieved

(red) for three different e-values (< 10-50

, < 10-15

, and <10-5

, respectively).

Page 34: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 34

Figure 2 - Ternary representation of relative distances of ORFs of three fish

species compared to their human orthologs. (a) Haplochromine cichlid, Danio rerio,

and Takifugu rubripes, (b) Danio rerio, Oncorhynchus mykiss, and Takifugu rubripes.

Each dot represents a single ORF, the position of the dot within the ternary diagram

indicates the relative distance of this ORF in each of the three fish species compared to

the orthologous ORF in human. We were interested in identifying those ORFs that show

a faster or slower rate of molecular evolution in the haplochromine cichlids.

Page 35: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 35

Figure 3 - Histogram of the abundance of amino acid sequence divergences of all

five fish species (haplochromine cichlid, Danio rerio, Takifugu rubripes, Tetraodon

nigroviridis, and Oncorhynchus mykiss) with respect to human genes. P-distances

have been calculated for a set of 759 ORFs found in all five fish species and plotted in

categories of 0.1.

Page 36: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 36

Supplementary Tables

Supplementary Table 1 - Gene ontology (generic GO slim subset for molecular

function). The classification is hierarchical. Indented terms are children of parent terms

listed above. For each term, the number of A. burtoni assembled sequences that match

genes to which Gene Ontology annotations have been assigned at, or below, this general

level is given. Note that genes may be assigned to more than one term and child terms

may have more than one parent term. For parent terms, the total number of A. burtoni

assembled sequences is given in parentheses. Match means that the annotation derives

from a gene that was the “best hit” for the A. burtoni sequence at and e-value <10-12

.

Page 37: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 37

Molecular Function 2692

antioxidant activity 18

binding 684 (2097)

calcium ion 154

carbohydrate binding 43

chromatin binding 21

lead ion binding 0

lipid binding 82

nucleic acid binding 130 (586)

DNA binding 193 (227)

transcription factor activity 89

RNA binding 293

translation factor activity, nucleic acid binding 55

nucleotide binding 459

oxygen binding 24

protein binding 1096 (1183)

cytoskeletal protein binding 52 (110)

actin binding 65

receptor binding 110

triplet codon-amino acid adaptor activity 0

catalytic activity 574 (1239)

electron carrier activity 98

hydrolase activity 396 (473)

nuclease 9

peptidase activity 117

phosphoprotein phosphatase 25

transferase activity 229 (277)

kinase activity 103 (125)

protein kinase activity 55

chaperone regulator activity 4

enzyme regulator activity 127

motor activity 35

nutrient reservoir activity 0

protein tag 1

signal transducer activity 63 (170)

receptor activity 116

structural molecule activity 382

transcription regulator activity 116 (167)

translation regulator activity 5 (59)

transporter activity 338 (415)

ion channel activity 84

neurotransmitter transporter activity 4

Page 38: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 38

Supplementary Table 2 - Gene ontology (generic GO slim subset for biological

process). The classification is hierarchical. Indented terms are children of parent terms

listed above. Genes may be assigned to more than one term. For each term, the number of

A. burtoni assembled sequences that match genes to which Gene Ontology annotations

have been assigned at, or below, this general level is given. Note that genes may be

assigned to more than one term and child terms may have more than one parent term. For

parent terms, the total number of A. burtoni assembled sequences is given in parentheses.

Match means that the annotation derives from a gene that was the “best hit” for the A.

burtoni sequence at and e-value <10-12

.

Page 39: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 39

Biological Process 2532

anatomical structure morphogenesis 131 (152)

cell growth 26

behavior 24

cell communication 9 (437)

cell-cell signaling 83

signal transduction 381

cell cycle 163

cell differentiation 166 (283)

cell death 138

cell homeostasis 132

cell proliferation 88

cell recognition 8

cellular component organization and biogenesis 359 (659)

cytoplasm organization and biogenesis 1

organelle organization and biogenesis 189 (353)

cytoskeleton organization and biogenesis 159

mitochondrion organization and biogenesis 13

death 0 (138)

growth 18 (42)

metabolic process 213 (1731)

biosynthetic process (583)

translation 313

catabolic process 194

generation of precursor metabolites and energy 82 (299)

electron transport 226

primary metabolic process 0 (1425)

amino acid and derivative metabolic process 65

carbohydrate metabolic process 123

lipid metabolic process 81

nucleobase, nucleoside, nucleotide and nucleic acid metabolic process 230 (563)

DNA metabolic process 146

transcription 288

protein metabolic process 335 (805)

protein modification process 240

secondary metabolic process 83

multicellular organismal development 330 (365)

embryonic development 67

regulation of biological process 660

regulation of gene expression, epigenetic 7

reproduction 66

response to abiotic stimulus 46

response to biotic stimulus 34

response to endogenous stimulus 68

response to external stimulus 41

response to stress 180

symbiosis, encompassing mutualism through parasitism 1

transport 522 (746)

ion transport 305

protein transport 155

viral reproduction 5

Page 40: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 40

Supplementary Table 3 - Gene ontology (generic GO slim subset for cellular

component). The classification is hierarchical. Indented terms are children of parent

terms listed above. Genes may be assigned to more than one term. For each term, the

number of A. burtoni assembled sequences that match genes to which Gene Ontology

annotations have been assigned at, or below, this general level is given. Note that genes

may be assigned to more than one term and child terms may have more than one parent

term. For parent terms, the total number of A. burtoni assembled sequences is given in

parentheses. Match means that the annotation derives from a gene that was the “best hit”

for the A. burtoni sequence at and e-value <10-12

.

Page 41: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 41

Cellular Component 2293

cell 804 (2216)

intracellular 540 (1883)

chromosome 49 (62)

nuclear chromosome 20

cilium 7

cytoplasm 488 (1367)

cytoplasmic membrane-bound vesicle 61

cytosol 343

endoplasmic reticulum 104

endosome 10

Golgi 65

microtubule organizing center 15

mitochondrion 414

peroxisome 8

plastid 4

ribosome 244

vacuole 13 (39)

lysosome 27

cytoskeleton 204 (216)

nucleus 576 (630)

nuclear envelope 31

nucleolus 60

nucleoplasm 87

plasma membrane 252

extracellular region 62 (184)

extracellular space 134

proteinaceous extracellular matrix 35

organelle 17 (1538)

protein complex 808

unlocalized protein complex 1

Page 42: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 42

Supplementary Table 4 - ESTs where the p-distance between Homo sapiens and

haplochromine cichlid amino acid sequences is significantly higher as compared

to other fish species (Danio rerio, Takifugu rubripes, Tetraodon nigroviridis and

Oncorhynchus mykiss). Annotation means that the Homo sapiens gene was “best hit”

for the cichlid sequence (and e-value <10-50

).

Homo sapien best hit

GenBank

acc. cichlid

Ka/Ks

ratio p-dist.

|NP_742034.1| retinol dehydrogenase 10 BJ671564 0.0035 0.27

|NP_057267.2| ribosomal protein P0-like protein; ribosomal protein, large, CN470249 0.0038 0.26

|NP_000017.1| adenylosuccinate lyase; adenylosuccinase BJ673035 0.0044 0.251

|XP_030559.1| PREDICTED: PAR-6 beta BJ697326 0.0044 0.272

|NP_003272.2| troponin I, skeletal, slow; Troponin-I, skeletal, slow DY629797 0.0044 0.385

|NP_001210.1| calumenin precursor DY627455 0.0072 0.424

|NP_003583.2| cullin 1 CN468603 0.0219 0.317

|NP_059133.1| cleavage and polyadenylation specific factor 2; cleavage and BJ683745 0.0267 0.25

|NP_009097.1| phosphatidylinositol-binding clathrin assembly protein; clathrin BJ691629 0.0295 0.469

|NP_001959.1| eukaryotic translation initiation factor 4E; eukaryotic translation BJ686392 0.0302 0.306

|NP_002063.2| guanine nucleotide binding protein (G protein), q polypeptide BJ698454 0.0432 0.301

|NP_056170.1| joined to JAZF1 CN469599 0.0438 0.257

|NP_036265.1| coatomer protein complex, subunit gamma 2; coat protein, BJ672808 0.0484 0.272

|NP_006764.3| DEAD (Asp-Glu-Ala-Asp) box polypeptide 18; Myc-regulated DEAD box BJ700024 0.0547 0.286

|NP_055205.1| staphylococcal nuclease domain containing 1; EBNA-2 co-activator BJ673580 0.0561 0.255

|NP_060104.2| potassium channel tetramerisation domain containing 9 BJ687766 0.0564 0.331

|NP_001737.1| calnexin BJ685700 0.0574 0.4

|NP_006612.2| S-adenosylhomocysteine hydrolase-like 1; S-adenosyl homocysteine DY632021 0.061 0.374

|NP_036205.1| chaperonin containing TCP1, subunit 5 (epsilon) [Homo sapiens] BJ698817 0.0619 0.275

|NP_003679.1| calcium/calmodulin-dependent serine protein kinase (MAGUK family) BJ682470 0.0639 0.286

|NP_612639.1| taube nuss; TAF8 RNA polymerase II, TATA box binding protein BJ703081 0.0664 0.298

|NP_114366.1| poly(rC)-binding protein 2 isoform b; alpha-CP2; poly(rC)-binding CN468947 0.0699 0.327

|NP_036475.2| nicotinamide nucleotide transhydrogenase DY629769 0.0733 0.325

|NP_003926.1| topoisomerase (DNA) III beta; topoisomerase III beta BJ701292 0.0736 0.328

|NP_060802.1| DDX19-like protein; RNA helicase DY625929 0.0757 0.256

|NP_689953.1| 3-hydroxyisobutyrate dehydrogenase BJ690821 0.0759 0.282

|NP_005854.2| neuroepithelial cell transforming gene 1; guanine nucleotide BJ680834 0.0772 0.28

|NP_003793.1| myosin, heavy polypeptide 13, skeletal muscle; extraocular muscle BJ679618 0.0778 0.318

|NP_003306.1| DnaJ (Hsp40) homolog, subfamily C, member 7; tetratricopeptide BJ676351 0.0798 0.31

|NP_078939.3| aminopeptidase-like 1 DY629835 0.0836 0.254

|NP_002586.2| PCTAIRE protein kinase 2; serine/threonine-protein kinase BJ703098 0.0868 0.286

|NP_001410.2| ELAV-like 1; embryonic lethal, abnormal vision, drosophila, BJ685085 0.0871 0.25

|NP_003743.1| eukaryotic translation initiation factor 3, subunit 8, 110kDa; DY626506 0.0895 0.434

|NP_002289.1| L-plastin; plastin 2; Lymphocyte cytosolic protein-1 (plasmin) BJ696087 0.0906 0.283

|NP_002070.1| aspartate aminotransferase 1 BJ700931 0.0944 0.263

|NP_057388.1| ribosomal protein L24-like; homolog of yeast ribosomal like protein DY625961 0.1002 0.323

|NP_000174.1| hydroxyacyl dehydrogenase, subunit B; 3-ketoacyl-Coenzyme A DY627495 0.1026 0.269

|NP_002568.2| p21-activated kinase 2; S6/H4 kinase BJ692414 0.1037 0.405

|NP_443133.1| solute carrier family 25 (mitochondrial carrier; phosphate DY629151 0.1174 0.297

Page 43: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 43

|NP_006383.2| nucleolar protein 5A; nucleolar protein 5A (56kD with KKE/D repeat) DY626311 0.1251 0.25

|NP_001961.1| eukaryotic translation initiation factor 5A; eIF5AI BJ692285 0.1317 0.273

|NP_004362.1| coatomer protein complex, subunit alpha; xenin; alpha coat protein BJ690224 0.1395 0.286

|NP_115915.1| 5'-nucleotidase, cytosolic IA; cytosolic 5' nucleotidase, type 1A; BJ701030 0.1501 0.379

|XP_377129.2| PREDICTED: similar to golgi autoantigen, golgin subfamily a, 7 BJ686655 0.1522 0.284

|NP_006400.2| actin related protein 2/3 complex subunit 1A; actin binding protein DY630466 0.1682 0.359

|NP_006730.2| minichromosome maintenance deficient protein 5; DNA replication BJ690838 0.1858 0.284

|NP_039234.1| chloride intracellular channel 4; chloride intracellular channel 4 BJ701465 0.1977 0.402

|NP_060103.1| chromosome 6 open reading frame 37; retinal expressed gene C6orf37 BJ674373 0.2952 0.259

Page 44: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 44

Supplementary Table 5 - ESTs where the p-distance between Homo sapiens and

haplochromine cichlid amino acid sequences is significantly smaller as compared

to other fish species (Danio rerio, Takifugu rubripes, Tetraodon nigroviridis, and

Oncorhynchus mykiss). Annotation means that the Homo sapiens gene was “best hit”

for the Cichlid sequence (and e-value <10-50

).

Homo sapien best hit

GenBank

acc. cichlid

Ka/Ks

ratio

p-

distance

|NP_000276.1| Xaa-Pro dipeptidase; proline dipeptidase DY628286 0 0.006

|NP_000060.1| calcium channel, voltage-dependent, L type, alpha 1S subunit; calcium DY627704 0 0.025

|NP_002492.1| nuclear factor I/X (CCAAT-binding transcription factor) BJ684881 0.0188 0.037

|NP_006574.1| peropsin DY628165 0.6 0.04

|NP_694592.1| protein-tyrosine kinase fyn isoform b; proto-oncogene CN469350 0.0121 0.044

|XP_371813.2| PREDICTED: kinesin family member C1 DY627797 0.5 0.051

|NP_742055.1| eyes absent 1 isoform b; Eyes absent, Drosophila, homolog of, 1; BJ693416 0.0349 0.058

|NP_000977.1| ribosomal protein L24; 60S ribosomal protein L24; ribosomal protein BJ684650 0.0419 0.059

|NP_001264.1| chromodomain helicase DNA binding protein 4; Mi-2b BJ687855 0.0305 0.068

|NP_057399.1| GULP, engulfment adaptor PTB domain containing 1; engulfment CN469384 0.0219 0.073

|NP_061133.1| peroxisomal membrane protein 2, 22kDa; peroxisomal membrane protein DY628735 0.4231 0.074

|NP_004437.1| glutamyl-prolyl tRNA synthetase; glutamate tRNA ligase BJ676918 0.012 0.08

|NP_001924.2| dihydrolipoamide S-succinyltransferase (E2 component of BJ679359 0.0255 0.094

|NP_001369.1| dynein, cytoplasmic, intermediate polypeptide 2 BJ703324 0.0073 0.099

|NP_077022.1| mitogen-activated protein kinase associated protein 1; ras DY626351 0.0353 0.101

|NP_073557.3| DiGeorge syndrome critical region gene 8 CN469327 0.0364 0.104

|NP_002075.2| plasma glutathione peroxidase 3 precursor DY627877 1.3043 0.106

|NP_004495.2| HIV-1 Rev binding protein; nucleoporin-like protein RIP; Rab, CN468756 0.0241 0.107

|NP_056067.1| NPF/calponin-like protein; EH domain-binding protein 1 CN469453 0.0015 0.108

|NP_003984.2| CDC-like kinase 2 isoform 1; dual specificity protein kinase CLK2; BJ699169 0.0544 0.109

|NP_056371.1| signal-induced proliferation-associated 1 like 1; signal-induced BJ668716 0.0063 0.11

|NP_000993.1| ribosomal protein P0; 60S acidic ribosomal protein P0; acidic BJ701638 0.053 0.111

|NP_001297.1| claudin 3; Clostridium perfringens enterotoxin receptor 2; rat DY631785 1.5556 0.115

|NP_004312.2| axonal transport of synaptic vesicles; kinesin, heavy chain, member DY632057 0.0393 0.123

|NP_005494.2| amyloid beta A4 precursor protein-binding, family A, member 2; CN469652 0.02 0.127

|NP_006304.1| ubiquitin specific protease 15; deubiquitinating enzyme BJ693203 0.048 0.128

|NP_004969.2| Shaw-related voltage-gated potassium channel protein 4 isoform a; DY631758 0.7808 0.13

|NP_057365.2| STE20-like kinase; STE2-like kinase BJ685775 0.0266 0.137

|NP_783860.1| synaptotagmin IX CN470765 0.0194 0.14

|NP_003281.1| tropomyosin 4 BJ674753 0.0384 0.145

|NP_000981.1| ribosomal protein L27a; 60S ribosomal protein L27a BJ682982 0.0861 0.145

|NP_001261.1| chromodomain helicase DNA binding protein 1 BJ671483 0.0072 0.147

|NP_005075.2| phospholipase A2, group VII; platelet-activating factor DY628278 0.7794 0.15

|NP_000430.3| proprotein convertase subtilisin/kexin type 1 preproprotein; CN469887 3.7763 0.15

|NP_004244.1| phospholipase A2-activating protein; phospholipase A2 activating BJ703212 0.0631 0.15

|NP_056271.2| cofactor of BRCA1; negative elongation factor protein B BJ689040 0.0374 0.153

|NP_001545.2| cysteine-rich, angiogenic inducer, 61; cysteine-rich, anigogenic BJ692405 0.0759 0.154

|NP_004073.2| dynactin 1 isoform 1; p150, Glued (Drosophila) homolog; dynactin 1 CN471126 0.0874 0.155

|NP_114072.1| frizzled 8; frizzled (Drosophila) homolog 8 DY626075 0.0422 0.157

Page 45: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 45

|NP_542197.1| alpha 1 type XI collagen isoform C preproprotein; collagen XI, BJ668630 0.035 0.159

|NP_078974.1| mitochondrial glutamate carrier 1 BJ688909 0.0905 0.162

|NP_005678.2| phenylalanine-tRNA synthetase-like, beta subunit; phenylalanyl-tRNA BJ692171 0.0757 0.162

|NP_006197.1| platelet-derived growth factor receptor alpha precursor BJ690723 0.0017 0.163

|NP_003240.1| thimet oligopeptidase 1 BJ691014 0.0778 0.164

|NP_009186.1| xylosylprotein beta 1,4-galactosyltransferase 7; BJ695358 0.0745 0.165

|NP_872579.2| fetal Alzheimer antigen isoform 1; fetal Alz-50 reactive clone 1; BJ701592 0.0349 0.165

|NP_057165.2| palladin; CGI-151 protein BJ690747 0.0521 0.166

|NP_004521.1| matrix metalloproteinase 2 preproprotein; gelatinase neutrophil; BJ694547 0.0645 0.166

|NP_056111.1| dedicator of cytokinesis 9 CN469939 0.0667 0.167

|NP_006651.2| ClpX caseinolytic protease X homolog; energy-dependent regulator of BJ673221 0.0162 0.167

|NP_060004.2| myosin, heavy polypeptide 2, skeletal muscle, adult BJ674015 0.0441 0.167

|NP_954580.1| ubiquitin-conjugating enzyme E2 variant 1 BJ685697 0.1036 0.168

|NP_149351.1| surfeit 4; surfeit locus protein 4; surface 4 integral membrane BJ677148 0.0335 0.169

|NP_000503.1| N-acetylgalactosamine-6-sulfatase precursor; chondroitinase; BJ693255 0.0526 0.169

|NP_001164.1| Rho GTPase activating protein 5 BJ686112 0.0363 0.172

|NP_003613.2| PTPRF interacting protein binding protein 1 isoform 1; liprin-beta 1; BJ693077 0.0186 0.172

|NP_001449.1| gamma filamin; filamin C, gamma (actin-binding protein-280); filamin BJ684635 0.063 0.173

|NP_001348.1| DEAH (Asp-Glu-Ala-His) box polypeptide 9 isoform 1; ATP-dependent

RNA DY630855 0.0189 0.174

|NP_000850.1| 3-hydroxy-3-methylglutaryl-Coenzyme A reductase BJ700149 0.0228 0.176

|NP_631898.1| dipeptidylpeptidase 9; dipeptidyl peptidase 9; dipeptidyl peptidase CN470065 0.0915 0.178

|NP_000628.1| glutathione reductase BJ683618 0.0457 0.178

|NP_005955.1| myosin, heavy polypeptide 10, non-muscle; myosin heavy chain, BJ696841 0.0186 0.179

|NP_932332.1| glucosamine-phosphate N-acetyltransferase 1 BJ691214 0.0418 0.182

|NP_002100.2| histidyl-tRNA synthetase; histidine-tRNA ligase; HisRS; histidine BJ694216 0.0521 0.183

|NP_005782.1| M-phase phosphoprotein 10 DY628233 0.7037 0.184

|NP_065116.2| X-prolyl aminopeptidase (aminopeptidase P) 1, soluble; X-prolyl BJ701510 0.0129 0.184

|NP_203750.3| mitogen-activated protein kinase 8 interacting protein 3 isoform 2; CN468708 0.095 0.185

|NP_037457.3| APG4 autophagy 4 homolog B isoform a; autophagin-1 BJ693415 0.0586 0.185

|NP_002878.2| arginyl-tRNA synthetase BJ697670 0.0023 0.186

|NP_006074.1| RED protein; RD element; prer protein; IK factor; DY627427 0.0392 0.187

|NP_031401.1| TAR DNA binding protein; TAR DNA-binding protein-43 CN468754 0.0486 0.188

|NP_476500.1| endothelial differentiation, lysophosphatidic acid DY630256 0.0536 0.188

|NP_003280.2| tropomyosin 2 (beta) isoform 1; arthrogryposis multiplex BJ683045 0.0466 0.189

|NP_001339.1| death-associated protein kinase 3 BJ674422 0.0642 0.19

|NP_002145.3| heat shock 70kDa protein 4 isoform a; heat shock 70kD protein 4 BJ691037 0.0392 0.19

|NP_000180.2| hexokinase 2; hexokinase-2, muscle BJ689488 0.0042 0.191

|NP_061957.2| cyclin J BJ699157 0.0351 0.193

|XP_371474.2| PREDICTED: plexin B2 BJ684903 0.0552 0.195

|NP_004748.2| small inducible cytokine subfamily E, member 1; endothelial BJ685307 0.0541 0.195

|NP_000149.1| glucan (1,4-alpha-), branching enzyme 1 (glycogen branching BJ688578 0.0542 0.195

|NP_542164.2| oxysterol-binding protein-like 1A isoform B; oxysterol-binding CN468818 0.0156 0.197

|NP_001027.2| ryanodine receptor 3 BJ694127 0.0848 0.198

|NP_000084.2| alpha 1 type V collagen preproprotein BJ670714 0.1171 0.2

|NP_055316.1| ubiquitin carrier protein; ubiquitin-conjugating enzyme E2-24 kD; BJ674752 0.0923 0.202

|NP_036322.2| formyltetrahydrofolate dehydrogenase isoform a CN470753 0.2445 0.203

|NP_002787.2| proteasome beta 4 subunit; proteasome subunit, beta type, 4; DY626588 0.1461 0.203

|NP_002513.1| neuronal pentraxin I precursor CN468907 0.0914 0.204

|NP_057531.2| membrane protein, palmitoylated 6; MAGUK protein p55T; protein CN469732 0.0431 0.204

Page 46: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 46

|NP_075053.2| zinc transporter ZTL1; zinc transporter 5 BJ700354 0.0023 0.205

|NP_037428.2| G-protein signalling modulator 2 (AGS3-like, C. elegans); LGN BJ695914 0.0205 0.207

|NP_000687.2| aldehyde dehydrogenase 9A1; gamma-aminobutyraldehyde

dehydrogenase; BJ701067 0.0726 0.207

|NP_056323.2| TCDD-inducible poly(ADP-ribose) polymerase DY625879 0.0033 0.208

|NP_115984.1| protein phosphatase 1, regulatory subunit 9B; neurabin II; BJ690319 0.0715 0.209

|NP_004664.1| angiopoietin-like 1 precursor; angiopoietin 3; angiopoietin Y1 BJ690922 0.0204 0.213

|NP_057375.1| inositol hexaphosphate kinase 2; mammalian inositol DY631850 0.0558 0.213

|NP_001223.1| cardiac calsequestrin 2; calsequestrin 2, cardiac muscle; BJ670795 0.0642 0.217

|NP_114402.1| N-myc downstream-regulated gene 3 isoform a; N-myc CN470617 0.0521 0.219

|NP_002422.1| menage a trois 1 (CAK assembly factor); cyclin G1 interacting DY628021 1.1881 0.221

|NP_055179.1| glyceraldehyde-3-phosphate dehydrogenase, spermatogenic; BJ690842 0.1228 0.222

|NP_060729.2| tetratricopeptide repeat domain 17 DY627279 0.0152 0.222

|NP_775105.1| SNF2 histone linker PHD RING helicase; 2610103K11Rik BJ697939 0.0346 0.222

|NP_006138.1| interferon regulatory factor 6; Popliteala pterygium syndrome BJ685456 0.0388 0.223

|NP_002538.1| oligophrenin 1; oligophrenin-1, Rho-GTPase activating protein BJ672227 0.064 0.224

|NP_057456.1| seven transmembrane domain orphan receptor; transmembrane domain DY629566 0.0786 0.226

|NP_006789.1| UDP glycosyltransferase 2 family, polypeptide A1; UDP DY629774 0.0155 0.228

|NP_116262.2| nm23-phosphorylated unknown substrate; SH3 domain-containing 70 kDa BJ693084 0.0885 0.231

|NP_004206.1| estrogen receptor binding site associated antigen 9; cancer BJ701417 0.0706 0.232

|NP_000184.1| sonic hedgehog preproprotein BJ703040 0.0777 0.233

|NP_849189.1| sideroflexin 2 BJ687393 0.1028 0.233

|NP_110448.2| phospholipase A2, group XIIA; group XII secreted phospholipase A2; BJ703348 0.0386 0.235

|NP_001783.2| cadherin 2, type 1 preproprotein; cadherin 2, N-cadherin BJ680852 0.0642 0.237

|NP_071939.1| zinc finger, DHHC domain containing 6 DY628782 0.3895 0.239

|NP_055048.1| upstream binding transcription factor, RNA polymerase I CN470191 0.0785 0.239

|NP_000144.1| galactosylceramidase precursor; galactocerebrosidase; BJ676310 0.0262 0.24

|NP_000689.1| arachidonate 5-lipoxygenase BJ698470 0.0846 0.242

|NP_006613.1| polo-like kinase 2; serum-inducible kinase BJ697217 0.0071 0.243

|NP_003238.2| thrombospondin 2 precursor BJ689989 0.05 0.244

|NP_060752.1| paraspeckle protein 1 CN468873 0.0041 0.244

|NP_001740.1| calpain, small subunit 1; calcium-activated neutral proteinase; BJ670938 0.0626 0.245

|NP_056210.1| DKFZP434B0335 protein BJ695451 0.4116 0.246

|NP_078932.2| hypothetical protein FLJ22329 BJ697568 0.2964 0.246

|NP_056349.1| zinc finger, ZZ domain containing 3 BJ688763 0.1254 0.247

Page 47: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 47

Supplementary Figure 1 – Directed acyclic graph (DAG) of the cichlid specific

Gene ontology (GO) slim for molecular function. Molecular function terms were

selected for inclusion in the ontologies such that leaf nodes include approximately 20

annotated genes. Circle size represents relative number of genes annotated to each parent

node.

Page 48: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 48

Supplementary Figure 2 – Directed acyclic graph (DAG) of the cichlid specific

Gene ontology (GO) slim for biological process. Biological process terms were

selected for inclusion in the ontologies such that leaf nodes include approximately 20

annotated genes. Circle size represents relative number of genes annotated to each parent

node.

Page 49: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 49

Page 50: Annotation of expressed sequence tags for the East …The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are

Salzburger et al. 50

Supplementary Figure 3 – Directed acyclic graph (DAG) of the cichlid specific

Gene ontology (GO) slim for cellular component. Cellular component terms were

selected for inclusion in the ontologies such that leaf nodes include approximately 20

annotated genes. Circle size represents relative number of genes annotated to each parent

node.