supplemental text - g3: genes, genomes, genetics€¦ · web view2015/03/04  · creating rna-seq...

61
SUPPLEMENTAL TEXT The Muller elements nomenclature Orthologous chromosomes among the different Drosophila species often have different chromosome numbers. For example, chromosome 4 in D. melanogaster is orthologous to chromosome 6 in D. grimshawi, D. mojavensis, and D. virilis. Hermann Muller developed a nomenclature (A–F) to refer to orthologous chromosomes among the different Drosophila species (Muller 1940). Using this nomenclature, chromosome 4 in D. melanogaster is known as the Muller F element while chromosome 3L is known as the Muller D element. Introduction to Position Effect Variegation (PEV) PEV describes the phenomenon whereby a euchromatic gene is either partially or completely silenced when it is moved (by translocation or transposition) to a region next to a heterochromatic domain (Muller 1930). For example, insertion of a transgenic reporter (hsp70-driven white) into a euchromatic environment (via P-element transposition) results in a red eye phenotype because the white gene is expressed in all the ommatidia of the compound eye. In contrast, insertion of the transgenic W. Leung et al. 1

Upload: others

Post on 07-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

SUPPLEMENTAL TEXT

The Muller elements nomenclature

Orthologous chromosomes among the different Drosophila species often have different

chromosome numbers. For example, chromosome 4 in D. melanogaster is orthologous to

chromosome 6 in D. grimshawi, D. mojavensis, and D. virilis. Hermann Muller developed a

nomenclature (A–F) to refer to orthologous chromosomes among the different Drosophila

species (Muller 1940). Using this nomenclature, chromosome 4 in D. melanogaster is

known as the Muller F element while chromosome 3L is known as the Muller D element.

Introduction to Position Effect Variegation (PEV)

PEV describes the phenomenon whereby a euchromatic gene is either partially or

completely silenced when it is moved (by translocation or transposition) to a region next to

a heterochromatic domain (Muller 1930). For example, insertion of a transgenic reporter

(hsp70-driven white) into a euchromatic environment (via P-element transposition) results

in a red eye phenotype because the white gene is expressed in all the ommatidia of the

compound eye. In contrast, insertion of the transgenic reporter into a heterochromatic

environment results in a variegating phenotype because the white gene is silenced in a

subset of the ommatidia. The white gene is required for deposition of pigment in the eye.

Identifying subfamilies of the DINE-1 element

Transposons in the species-specific transposon libraries are classified as DINE-1 fragments

based on sequence similarity to the conserved core element within block A of the DINE-1

W. Leung et al. 1

consensus (Yang and Barbash 2008). Comparison of the DINE-1 fragments identified by

RepeatMasker using the species-specific library versus the RepBase Drosophila library

(Jurka et al. 2005) allows us to further categorize the subfamilies of DINE-1 elements in

each species. (See File S5 for a complete list of RepBase repeats that overlap with DINE-1

fragments in all of the analysis regions.) This comparison shows that there are additional

DINE-1 elements in the D. grimshawi, D. mojavensis, and D. erecta species-specific

transposon libraries that are not in the Drosophila RepBase library (see Table S5).

Comparison of the repeats identified by the two repeat libraries also shows that the DINE-1

fragments on the D. mojavensis F element can be partitioned into at least two subfamilies:

67.4% of the DINE-1 fragments overlap with Homo6, while 21.8% overlap with

Helitron1_Dmoj. Homo6 is classified as a member of the HOBO family (a hAT DNA

transposon) (de Freitas Ortiz and Loreto 2009), but part of the Homo6 consensus sequence

was masked in a subsequent RepBase release (17.07) because of a helitron insertion.

Previous analyses have shown that the IsBu1 element in D. buzatii (another species in the

repleta group) is homologous to the DINE-1 element in D. mojavensis (Cáceres et al. 2001;

Casals et al. 2005). A CENSOR search of IsBu1 (GenBank record AY756162.1) against the

Drosophila RepBase library shows that IsBu1 has ~94% sequence similarity to the Homo6

consensus sequence (Figure SM1). Given the similarity of the D. mojavensis fragments to

the core element of DINE-1 and the ambiguity associated with the annotation of Homo6, we

have retained the assignment of these repeats as putative DINE-1 fragments.

W. Leung et al. 2

Nc versus CAI comparisons indicate response to selective pressure

While most of the F element genes within each Nc versus CAI scatterplot follow a similar

trend, there are also a few outliers (as denoted by the inverted V shape in the LOESS

regression lines in Figure 6C). Heat map analyses of Nc and CAI (Eisen et al. 1998) show

that two of these outliers, ATPsyn-beta and RpS3A, consistently show strong selection on

codon bias (high CAI and low Nc) in all four species (Figure S7). (Heat maps for all analysis

regions are available in Figure SM2.) The heat map also shows substantial differences in

the Nc and CAI among D. mojavensis F element genes relative to their putative orthologs in

D. melanogaster and D. erecta (e.g., Thd1, Ephrin, sv, Actbeta, and Rfabg show a higher CAI,

indicating more optimal codon usage).

Changes in codon preference for each amino acid

In addition to the changes in the distribution of Nc (Effective Number of Codons) and CAI

(Codon Adaptation Index), analysis of the proportions of codon usage for each amino acid

showed that there were also substantial differences in codon usage among the four species

(File S7). For example, F element genes showed weak preference for the various codons for

leucine (L), while genes on the D element showed a strong preference for CTG and a weak

preference for TTA. This preference pattern remains the same across all the D euchromatic

reference genes even though the genes at the base of the D. mojavensis and D. grimshawi D

elements differed from those found at the base of the D. melanogaster and D. erecta D

elements (Figure SM3).

W. Leung et al. 3

There were also different codon usage patterns among the F elements. For example, the D.

melanogaster and D. erecta F element genes preferred the codon CAA instead of CAG to

encode the amino acid glutamine (Q), while D. mojavensis and D. grimshawi F element

genes showed almost codon equal usage. In contrast, genes in the D euchromatic reference

regions in all four species showed a strong preference for CAG over CAA. Another example

is seen in the codon preferences for valine (V); all the euchromatic reference regions and

the D. grimshawi F element showed a strong preference of GTG over GTA, while the other F

elements show a strong preference of GTT over GTC.

SUPPLEMENTAL METHODS

Rationale for sequence improvement

The Drosophila 12 Genomes Consortium has previously analyzed D. melanogaster and the

Comparative Analysis Freeze 1 (CAF1) assemblies of 11 Drosophila species, including D.

erecta, D. mojavensis, and D. grimshawi analyzed in this study. The CAF1 assemblies of D.

erecta, D. mojavensis, and D. grimshawi are each based on the reconciliation (Zimin et al.

2008) of two independent assemblies constructed by the Arachne (Batzoglou et al. 2002)

and the Celera (Myers et al. 2000) assemblers. The reconciliation procedure improved the

overall quality of the assemblies by reducing the number of misassemblies and increasing

the length of the assembled scaffolds (Drosophila 12 Genomes Consortium et al. 2007).

The estimated genome coverage for the D. erecta genome assembly was ~10.6x, and we

found that this assembly was of sufficient quality for gene annotation and genomic analysis

without further sequence enhancement (see GenBank record AAPQ00000000.1). Of the

1338 isoforms we have annotated in all of the D. erecta analysis regions, we have identified

W. Leung et al. 4

29 isoforms (nine genes) that contained potential errors in the consensus. Additional

details on these putative consensus errors are described in File S2. The locations of these

putative consensus errors are shown in the “Consensus Errors” track on the GEP UCSC

Genome Browser (http://gander.wustl.edu).

In contrast, both the D. mojavensis and D. grimshawi assemblies have approximately six to

eight fold genome coverage. Examination of the D. grimshawi CAF1 assembly near the base

of the D element suggested that this region was of sufficient quality for genomic analysis

without sequence improvement. Of the 138 isoforms we have annotated in the D.

grimshawi euchromatic reference regions, four isoforms (two genes) contained putative

consensus errors (see File S2 for details).

Examination of the D. mojavensis and D. grimshawi F element scaffolds and the base of the

D. mojavensis D element suggested that there was still substantial room for sequence

improvement. Consequently, prior to performing the genomic analysis, we manually

improved these regions to a quality standard similar to the one used for the mouse genome

project (Mouse Genome Sequencing Consortium et al. 2002).

Defining the analysis regions

Defining the F element analysis regions: Schaeffer and colleagues used physical and genetic

markers to anchor many genomic scaffolds in the CAF1 assemblies to the polytene

chromosomes (Schaeffer et al. 2008). That analysis assigned scaffold 4512 to the D. erecta F

element and scaffolds 14822 and 14592 to the D. grimshawi F element. However, they were

W. Leung et al. 5

unable to visualize the D. mojavensis F element in polytene chromosomes. Based on

previous analysis by the Drosophila 12 Genomes Consortium, which shows that most

Drosophila genes tend to remain on the same Muller element (Drosophila 12 Genomes

Consortium et al. 2007), the 3.4 Mb scaffold 6498 was assigned to the D. mojavensis F

element.

Examination of this scaffold using FlyBase GBrowse showed that the F element genes were

only found within the first 2 Mb of this scaffold. Most of the genes at the end of scaffold

6498 were found on the D. melanogaster A element. The last F element gene (CG31999) on

scaffold 6498 has a gene span of 157 kb, with a total transposon density of 90%. These

properties suggested that CG31999 is located within a heterochromatic environment in D.

mojavensis. The 143 kb region between the last F element gene (CG31999) and the next

non-F element gene (CG42450) contained multiple large gaps, with a total estimated gap

size of ~27 kb.

Examination of the reads placement file (reads.placed) for the D. mojavensis CAF1 assembly

(available through the AAA: 12 Drosophila Genomes website at

http://rana.lbl.gov/drosophila/) showed that only mate pairs from fosmid end reads

supported the large gaps found in this region. Examination of the A element genes found at

the end of the D. mojavensis scaffold 6498 in D. virilis showed that most of these genes are

found in the middle of other non-F element scaffolds in the D. virilis CAF1 assembly.

W. Leung et al. 6

Collectively, there was insufficient evidence to support the hypothesis that the A element

genes found at the end of scaffold 6498 were part of the D. mojavensis F element.

Consequently, we have restricted our analysis of the D. mojavensis F element to the first 2

Mb of scaffold 6498 where all of the F element genes are found.

The ends of the genomic scaffolds in a whole genome assembly often contain

misassemblies and are highly repetitive. To reduce the bias introduced by these potential

assembly errors, we have restricted our analysis of the F element scaffolds to the region

that extended from the start of the coding span of the first gene to the end of the coding

span of the last gene.

Defining the D element analysis regions: In order to compare and contrast the evolution of

the F element with the evolution of a euchromatic domain, we have also analyzed the

repeat and gene characteristics of a euchromatic region in D. melanogaster, D. erecta, D.

mojavensis, and D. grimshawi. Because of the low rate of recombination near the base of the

chromosome arms, the region near the base of the chromosomes might exhibit different

genomic properties compared to other regions (Talbert and Henikoff 2010). To account for

the potential differences that could be introduced by the proximity to the centromere, we

selected ~1 Mb euchromatic reference regions near the base of the D element and

compared their properties against those of the F element.

W. Leung et al. 7

The D. melanogaster release 5 assembly has well-defined heterochromatin boundaries

based on both cytogenomic (Hoskins et al. 2007) and epigenomic criteria (Riddle et al.

2011). However, these types of evidence were unavailable for the other Drosophila species.

Previous studies in D. melanogaster have shown that there is a 4.7-fold increase in

transposon density near the centromeres of the long chromosome arms (Kaminker et al.

2002) and that changes in repeat density could be used to demarcate the approximate

boundaries of heterochromatic and euchromatic regions (Yasuhara and Wakimoto 2008).

Consequently, we analyzed the changes in repeat density for each species using a sliding

window analysis (1k window, 500 bp step size) across the entire genome assembly (Figure

S1). The ends of the D element scaffold in the D. melanogaster, D. erecta, and D. mojavensis

assemblies contained highly repetitive regions that likely correspond to the locations of the

heterochromatic domains. In contrast, the ends of the large D. grimshawi scaffolds only

contained short regions that exhibit high repeat density (Figure S1). This difference in the

extent to which the heterochromatic regions have been assembled might lead to an

underestimate of the true repeat content of the D. grimshawi genome.

Among all the autosomes, the D element showed the clearest demarcation between regions

with high and low repeat density, which may correspond to the heterochromatin-

euchromatin border in all four species (Figure S1). Hence we decided to focus our analysis

on an ~1 Mb region upstream of the heterochromatin-euchromatin border and we referred

to this region as the “base” of the chromosome in the main text. We have also selected a 1.4

Mb region upstream of the base of the D. erecta D element as well as a 1.3 Mb region near

W. Leung et al. 8

the telomere of the D element as additional comparison regions. The exact coordinates of

all the analysis regions are listed in Table S1.

Additional issues with the CAF1 assemblies

We encountered two issues with the CAF1 assemblies during the course of our analysis:

duplicated regions in the D. yakuba assembly and misassembled scaffolds in the D.

grimshawi assembly.

Duplicate scaffolds in the D. yakuba CAF1 assembly: Because the modENCODE project did

not generate RNA-Seq data for D. erecta, we utilized the D. yakuba RNA-Seq data to assist in

the annotation of the D. erecta analysis regions. Initial mapping of the modENCODE RNA-

Seq reads to the D. yakuba genome assembly (see protocol below) resulted in large regions

of the D. yakuba genome that have no RNA-Seq read coverage. Comparison of these

genomic regions with known genes in D. melanogaster indicated that these D. yakuba

genomic regions likely contained multiple genes. Comparison of these genomic regions

against the entire D. yakuba CAF1 assembly revealed that some of the “random” scaffolds in

the CAF1 assembly have substantial overlap with the larger scaffolds that have previously

been assigned to specific Muller elements.

In order to identify these redundant scaffolds in the CAF1 assembly, we compared the

sequences of all of the “random” scaffolds against the rest of the scaffolds in the D. yakuba

assembly using NCBI BLASTN (Altschul et al. 1990) with default parameters and an Expect

threshold of 1e-20. “Random” scaffolds that have substantial overlap with the main

W. Leung et al. 9

scaffolds along their entire lengths were removed from the assembly in order to construct

a filtered genome assembly. The D. yakuba RNA-Seq reads were then mapped against this

filtered assembly.

Highly repetitive scaffolds in the D. grimshawi CAF1 assembly: The first step in most de novo

repeats discovery algorithms involved comparing the genome assembly against itself in

order to discover repetitive regions. During the construction of the species-specific

transposon libraries (protocol described below), we found that the comparison of the D.

grimshawi assembly against itself took a substantially longer time to complete than those of

the other Drosophila species. The D. grimshawi assembly also produced substantially more

alignments (after filtering trivial self alignments) than the other assemblies.

Examination of the D. grimshawi alignment results revealed that the large number of D.

grimshawi alignments could be attributed to a few scaffolds. For example, when we

compared the D. grimshawi assembly against itself using the Pairwise Aligner for Long

Sequences (PALS) program (Edgar and Myers 2005), we found that two of the scaffolds

(scaffold_17366 and scaffold_14590) produced more than 1 million alignments (Table

SM1).

Aligning these two scaffolds against each other with blast 2 sequences (Tatusova and

Madden 1999) showed a tandem array of high-scoring segment pairs (HSPs) across the

entire length of the two sequences (Figure SM4A). Searching scaffold_17366 against the

RepBase Drosophila repeat library (Jurka et al. 2005) with CENSOR (Kohany et al. 2006)

W. Leung et al. 10

showed that the scaffold consists of a tandem array of a 155 bp Gypsy-5 LTR fragment

(Figure SM4B).

To ascertain whether these scaffolds have been correctly assembled, we examined all the

reads used to construct these scaffolds and determined the percentage of reads that were

missing their mate pair (the reads.placed file for the D. grimshawi CAF1 assembly is

available at http://rana.lbl.gov/drosophila/). A large number of missing or inconsistent

mate pairs in these regions would support the hypothesis that these scaffolds have been

misassembled (Phillippy et al. 2008).

Examination of the reads.placed file showed that most of the reads placed in scaffolds with

a large number of PALS alignments were unpaired. In addition, most of the missing mate

pairs were not placed in the CAF1 assembly (Figure SM5). These observations suggested

that the scaffolds that contained a tandem array of the LTR fragments were likely to be an

artifact produced by the assembly process and did not accurately reflect the organization of

the D. grimshawi genome. These misassembled scaffolds were removed from the D.

grimshawi assembly prior to the construction of the species-specific transposon library.

Using modENCODE RNA-Seq datasets to improve gene annotations

The modENCODE project has produced RNA-Seq data for six Drosophila species (in

addition to D. melanogaster) in order to study the changes in transcriptional profiles across

multiple Drosophila species (Graveley et al. 2011). Among the three Drosophila species

W. Leung et al. 11

that were part of this study, the modENCODE project only produced RNA-Seq data (from

head tissues, adult males, and adult females) for D. mojavensis.

While the other species of interest (i.e. D. erecta and D. grimshawi) were not included, we

could use the RNA-Seq data from nearby species to improve the annotations of these two

species. Specifically, we used the RNA-Seq data from D. yakuba to improve the D. erecta

annotations and the D. virilis and D. mojavensis RNA-Seq data to improve the D. grimshawi

annotations. The limited amount of comparative RNA-Seq data nonetheless provides us

with additional confidence in the annotation of D. erecta, D. mojavensis, and D. grimshawi,

especially when there are substantial changes in the gene model compared to the putative

D. melanogaster ortholog.

The RNA-Seq data also enabled us to identify potential errors in the D. melanogaster gene

annotations. For example, the original D. erecta GLEAN-R annotations produced by the

Drosophila 12 Genomes Consortium predicted two genes, GG16094 and GG16095, in

scaffold_4784 (within the euchromatic reference region of D. erecta analyzed in this study).

These two features are predicted to be the orthologs of the D. melanogaster genes CG13814

and rdgC, respectively (Figure SM6, top). However, GG16094 only shows weak sequence

similarity to its putative D. melanogaster ortholog CG13814. Examination of the alignment

between the predicted proteins assembled from D. yakuba RNA-Seq reads and this region

of the D. erecta genome assembly, as well as the TopHat splice junction predictions,

suggested that GG16094 was actually an unannotated coding exon of rdgC (Figure SM6,

W. Leung et al. 12

bottom). FlyBase has subsequently revised the annotation for this region in release 5.51,

and CG13814 was merged with rdgC in D. melanogaster.

Building the RNA-Seq transcriptome libraries

For our analysis, we retrieved the RNA-Seq datasets for D. yakuba, D. mojavensis, and D.

virilis from the NCBI Gene Expression Omnibus (GEO) database (available under accession

number GSE44612) and used them to construct transcriptome libraries for each species.

Because the CAF1 assemblies are relatively high quality, we used the align-then-assemble

strategy to construct the transcriptome library for each species (reviewed in (Martin and

Wang 2011)).

Mapping RNA-Seq reads with TopHat: The RNA-Seq transcriptome libraries for each species

were built by mapping RNA-Seq reads against the assembly of the corresponding species

with TopHat and Bowtie2 to discover splice junctions (Trapnell et al. 2009; Langmead and

Salzberg 2012). To improve the accuracy of the mapping and splice junction predictions,

only reads that mapped reliably to a single location in the assembly were kept.

Using the strategy first described by Cabili and colleagues (Cabili et al. 2011), we ran

TopHat twice in order to identify splice junctions. The first TopHat run was used to

discover splice junctions and the second run used this initial set of splice junctions as raw

junctions in “no-novel-juncs” mode. The two-stage TopHat mapping strategy improved the

TopHat splice junction predictions and reduced the number of unmapped reads.

W. Leung et al. 13

The following parameters were used for both TopHat runs: -g 1 --no-mixed --no-discordant

--b2-very-sensitive --min-intron-length 30 --max-intron-length 150000. We used both

SAMtools (Li et al. 2009) and BamTools (Barnett et al. 2011) to manipulate and analyze the

BAM files produced by TopHat.

Creating RNA-Seq transcripts and predicted protein libraries: For all three species, the

mapped RNA-Seq reads were assembled into transcripts using Cufflinks (v2.1.1) with the

following parameters: --min-intron-length 30 --max-intron-length 150000 (Trapnell et al.

2010). For D. mojavensis, we also assembled the transcripts using CEM (--max-pe-span

150000) (Li and Jiang 2012). After combining the transcripts from multiple replicates and

filtering redundant transcripts, we used TransDecoder (Grabherr et al. 2011) to produce a

predicted protein library for each species (-m 50 --search_pfam Pfam-A.hmm). The

complete workflow for building the transcriptome library is summarized in Figure SM7A.

Assembling unmapped RNA-Seq reads: Depending on the quality of the assembly, some of

the RNA-Seq reads might not have been mapped to the reference assembly even though

they are part of a gene (e.g., because of gaps in the assembly). The use of the stringent

TopHat mapping criteria described above also resulted in a larger fraction of unmapped

reads compared to the default parameters. In order to utilize these additional unmapped

RNA-Seq data, we assembled the unmapped RNA-Seq reads and then aligned the

assembled RNA-Seq scaffolds against the target assembly using the protocol described

below. (For example, we aligned the scaffolds assembled from unmapped D. yakuba RNA-

Seq reads against the D. erecta assembly to identify additional splice junctions.)

W. Leung et al. 14

For each sample, the BAM file that contained the unmapped RNA-Seq reads produced by

the second TopHat run was converted into fastq format using the bam2FastQ program in

BamUtil (available at http://genome.sph.umich.edu/wiki/BamUtil). The assembly process

only used the subset of reads where both paired end reads were found in the collection of

unmapped RNA-Seq reads. We partitioned the fastq file into smaller subgroups (~1Gb

each) in order to reduce the amount of memory required to assemble the unmapped RNA-

Seq reads. RNA-Seq reads in each subgroup were assembled together using ABySS

(Robertson et al. 2010) with the following parameters: lib=pe200 k=25 n=10.

Contigs and singleton reads constructed by ABySS from all subgroups were then assembled

together using CAP3 (Huang and Madan 1999) with default parameters to construct an

unmapped transcript library. TransDecoder (Grabherr et al. 2011) was used to identify

coding regions in the unmapped transcript library using the following parameters: -m50 --

search_pfam Pfam-A.hmm. The complete workflow for assembling the unmapped RNA-Seq

reads is summarized in Figure SM7B.

Identifying transposons that are being actively transcribed: In addition to using RNA-Seq

data to improve gene annotations, we can also use the assembled unmapped RNA-Seq data

to examine other transcribed sequences in the genome. BLAT searches of the scaffolds

assembled from unmapped D. mojavensis RNA-Seq reads against the D. mojavensis genome

assembly showed that most of these scaffolds overlap with the transcripts assembled by

Cufflinks and CEM. However, we also found a subset of the assembled scaffolds that overlap

W. Leung et al. 15

with transposon fragments identified by RepeatMasker (see protocol for creating the

species-specific transposon libraries below). CD-Search (Marchler-Bauer et al. 2011) of

these assembled contigs against the NCBI Conserved Domain Database shows that many of

these contigs contain conserved domains commonly found in transposable elements (e.g.,

Gypsy, HTH_Tnp_Tc3_2, Rnase_H, Helitron_like_N), which suggests that some of these

transposons are being actively transcribed in the D. mojavensis genome.

Cross-species transcript and protein mapping

Annotated proteins from D. melanogaster and predicted proteins assembled from RNA-Seq

reads in the other Drosophila species were mapped against the target genome using a

serial alignment strategy (Korf 2003). Proteins from the source genome were first mapped

against the genome assembly of the target genome using WU TBLASTN (Gish 1996) with

the following parameters: matrix=BLOSUM62, hspsepSmax=40000, hitdist=40,

topComboN=1, e=1e-20, W=4, T=20, B=10000000, V=10000000, filter=seg+xnu,

hspmax=0. We then collected the high-scoring segment pairs (HSPs) for each aligned

protein to define a search window (with 10 kb padding at both ends). The same protein

was re-aligned against this search window using SPALN (Iwata and Gotoh 2012) with

cross-species parameters (-Tdromel -yS -yX).

We used a similar search strategy to map the D. melanogaster transcripts and the predicted

transcripts assembled from RNA-Seq reads in the other Drosophila species to the target

genome. Transcripts were first mapped against the target assembly using WU BLASTN

(Gish 1996) with the following parameters: M=5, N=4, Q=20, R=10, hspsepQmax=1000,

W. Leung et al. 16

hspsepSmax=20000, hspmax=5000, B=10000000, V=10000000, topComboN=1. We then

collected the HSPs for each aligned transcript to define the search window (with 10 kb

padding at both ends). The transcript was then re-aligned against this search window using

sim4cc (Zhou et al. 2009) with default parameters.

Creating the species-specific transposon libraries

Overview of the pipeline used to construct the species-specific repeat libraries: Previous

studies have shown that many transposons are species-specific (Jurka et al. 2011) and that

using transposon libraries even from closely related Drosophila species could lead to a

substantial underestimate of the total repeat content (Leung et al. 2010). Consequently,

prior to analyzing the types and distribution of transposable elements on the F element, we

needed to first construct species-specific transposon libraries.

A plethora of computational tools has been developed for constructing de novo repeat

libraries from a genome assembly (reviewed in (Bergman and Quesneville 2007; Lerat

2010)). Most alignment-based repeat finders construct de novo transposon libraries using

three major steps: compare the genome against itself, cluster the alignments, and then

generate a consensus sequence for each cluster of alignments (Flutre et al. 2011).

Alternatively, tools such as ReAS (Li et al. 2005) and RepeatScout (Price et al. 2005) use

over-represented k-mers in genomic reads or in a whole genome assembly to identify

repeats. Flutre and colleagues have previously shown that different de novo repeat

discovery approaches can recover sequences that were missed by other algorithms and

W. Leung et al. 17

that one should utilize multiple computational approaches when constructing species-

specific repeat libraries (Flutre et al. 2011).

Using the REPET pipeline developed by Flutre and colleagues as a template (Flutre et al.

2011), we utilized six different approaches to construct the species-specific library: ReAS,

RepeatModeler, BLASTN+RECON+MAP, dcblast+GROUPER+MAP, PALS+PILER+MUSCLE,

and Tallymer+CD-HIT. All the consensus sequences identified by these different

approaches are combined into a single library. In order to reduce redundancy in the

combined repeat library, we used the UCLUST (Edgar 2010) recentering strategy to

construct a centroid for each cluster of sequences. Each repeat library was classified by a

combination of TEClass (Abrusán et al. 2009), RepClass (Feschotte et al. 2009), and

sequence similarity to the conserved core within block A of the DINE-1 consensus, as

previously defined by Yang and Barbash (Yang and Barbash 2008).

ReAS repeat library: The Drosophila 12 Genomes Consortium has previously created

species-specific transposon libraries for 12 Drosophila species using ReAS (Li et al. 2005;

Drosophila 12 Genomes Consortium et al. 2007). The species-specific ReAS transposon

libraries (v2) are available for download through FlyBase at

ftp://ftp.flybase.net/12_species_analysis/genomes/aaa/transposable_elements/ReAS/v2/.

Transposons in the ReAS library were classified using the protocol described below.

RepeatModeler library: RepeatModeler (Smit and Hubley 2008) was run on the whole

genome assemblies with the WU BLAST search engine using default parameters. While

W. Leung et al. 18

RepeatModeler included a module for classifying repeats, we found that RepeatModeler

could not classify a substantial fraction of the consensus sequences in the RepeatModeler

libraries. Hence we applied our repeat classification protocol (described below) to re-

classify the RepeatModeler consensus sequences.

BLASTN+RECON+MAP library: Each genome assembly was aligned against itself using the

BLASTN program in the NCBI BLAST+ suite (Camacho et al. 2009) with the following

parameters: -max_target_seqs 10000000 -evalue 1e-300 -perc_identity 90 -reward 1 -

penalty -1 -gapopen 2 -gapextend 2. Trivial self-alignments and alignments shorter than

100 bp in length were filtered from the BLASTN output.

We then used RECON to analyze the collection of filtered alignments using default

parameters in order to cluster the BLASTN alignments. Sequences within each cluster were

aligned with MAP to construct the consensus sequences using the following parameters:

gap size=50, mismatch=-8, gap open=16 gap extend=4.

Note that we used RECON version 1.07 (maintained by the developers of RepeatModeler,

available at http://www.repeatmasker.org/RepeatModeler.html) instead of the official

release (1.05, (Bao and Eddy 2002)) because it contained a bug fix for running RECON on

64-bit machines. We used the implementation of the MAP program (rpt_map) provided by

the developers of the REPET package (Flutre et al. 2011) instead of the original version

(Huang 1994) because it has been optimized for handling multiple sequences.

W. Leung et al. 19

dcblast+GROUPER+MAP library: Each genome was soft-masked using WindowMasker

(Morgulis et al. 2006) and then aligned against itself using the discontiguous megablast

program in the NCBI BLAST+ suite (Morgulis et al. 2008) with the following parameters: -

evalue 1e-20 -max_target_seqs 100 -task dc-megablast -db_soft_mask 30.

The dc-megablast alignments were clustered together using GROUPER (Quesneville et al.

2003) with the following parameters: -j -Z 3 -X 2 -G -1. Sequences within each GROUPER

cluster were aligned with MAP to construct the consensus sequences using the following

parameters: gap size=50, mismatch=-8, gap open=16 gap extend=4.

PALS+PILER+MUSCLE library: Each genome assembly was aligned against itself using PALS

(with the -self parameter) and PILER-DF was used with the -trfs parameter and default

parameters to produce alignment clusters (Edgar and Myers 2005). Sequences within each

cluster were aligned against each other using MUSCLE (Edgar 2004) to produce the final

consensus sequences using the following parameters: -maxiters 1 -diags 1.

Tallymer+CD-HIT library: Using the occratio program in Tallymer (Kurtz et al. 2008), we

analyzed the entire genome assembly to determine the size of k such that approximately

95% of the k-mers were unique (k=18 for D. melanogaster, k=17 for D. erecta, k=18 for D.

mojavensis, k=21 for D. grimshawi). Using this species-specific k-mer size (-mersize k), we

identified k-mers that appear at least 40 times (-minocc 40) using the Tallymer mkindex

program. Regions that matched these high frequency k-mers were identified using the

W. Leung et al. 20

“search” program in Tallymer with default parameters. Only Tallymer matches with a

minimum length of 80 bp were kept.

The Tallymer matches were clustered together using the cd-hit-est program in the CD-HIT

package (Li and Godzik 2006) to construct the final consensus sequences with the

following parameters: -c 0.9 -n 8 -r 1.

Combined repeat library: Low complexity sequences in each unclassified de novo repeat

library were identified by TRF (Benson 1999) and nseg (Wootton and Federhen 1993). TRF

was run using the following parameters: 2 7 7 80 10 50 2000, while nseg (part of the WU-

BLAST package) was run with -x and default parameters (window=21, locut=1.4,

hicut=1.6). Consensus sequences were removed from the library if more than 70% of the

sequence was masked by a combination of TRF and nseg.

Consensus sequences from the six de novo repeat libraries were combined together into a

single file, and the UCLUST algorithm implemented by USEARCH (Edgar 2010) was used to

remove redundant sequences in the combined repeat library. Sequences were clustered

together using USEARCH with the following parameters: -cluster_fast -id 0.95 -target_cov

0.98 -sizeout. The clusters were then sorted by size (-sortbysize) and the centroids for the

combined library were produced using the following parameters: -cluster_smallmem -id

0.97 -centroids.

W. Leung et al. 21

Repeat classification: We initially used TEClass (Abrusán et al. 2009) with default

parameters to classify the sequences in each de novo repeat library. Repeats that were in

the “Unclear” class were extracted from the repeat library and then classified by RepClass

(Feschotte et al. 2009) using RepBase release 17.09.

Because both TEClass and RepClass were not specifically designed to identify DINE-1

elements, we reclassified repeats in each de novo repeat library based on sequence

similarity to the DINE-1 element in each species. The sequence for the conserved core

element within block A of the DINE-1 consensus for each species was obtained from the

multiple sequence alignment of DINE elements produced by Yang and Barbash (Yang and

Barbash 2008). Sequences in each de novo repeat library were compared against the

species-specific conserved core sequence of DINE-1 using CENSOR (Kohany et al. 2006)

with the following parameters: -redundant -nofilter -s.

Comparing results from different de novo repeat analysis pipelines: To evaluate the efficacy

of the various approaches for constructing de novo repeat libraries, we analyzed the repeat

density of the D. melanogaster F element and the base of the D using the six de novo repeat

libraries and the combined repeat library. In addition, we also analyzed the two regions

using the RepBase Drosophila library (Jurka et al. 2005) as reference.

Most of the repeat libraries produced repeat density estimates that were similar to the

results obtained using the Drosophila RepBase library. However, the PALS+PILER+MUSCLE

pipeline consistently underestimated the true repeat content because it failed to identify

W. Leung et al. 22

most of the DINE-1 fragments. Despite differences in the estimated repeat content, the F

element showed higher repeat density than the euchromatic reference region from the D

element, irrespective of the repeat libraries used (Figure SM8). An Excel workbook with all

the repeat analysis results is available in File S4.

Omitting genes from the analysis of gene movement

Of the 79 D. melanogaster F element genes that have been annotated by FlyBase in release

5.50, two of these genes (CG11231 and JYalpha) have been omitted from the analysis of

gene movement.

Comparison between the genes found in the F elements of D. melanogaster and D. erecta

showed that the D. erecta F element scaffold (scaffold_4512) is in the reverse

complemented orientation relative to the D. melanogaster F element. The gene order is

completely syntenic between the D. erecta and D. melanogaster F elements, with the

exception of CG11231, which cannot be placed in the D. erecta CAF1 assembly. A TBLASTN

search of the D. melanogaster CG11231 protein against the genome assemblies of D. erecta

and D. mojavensis showed a large number of weak matches with similar E-values. The best

match to CG11231 was found in scaffold_4797, but it contained multiple in-frame stop

codons. A TBLASTN search of the D. melanogaster CG11231 protein against the D.

grimshawi assembly failed to detect any regions with significant (E-value < 1e-5) similarity.

A BLASTN (Altschul et al. 1990) search of the D. melanogaster CG11231 transcript against

the D. melanogaster release 5 assembly showed multiple significant matches to chrU, which

W. Leung et al. 23

contained all the unplaced scaffolds in the D. melanogaster assembly. A CENSOR search of

the CG11231 transcript against the Drosophila RepBase library revealed multiple

significant matches to transposon fragments. A BLASTN search of CG11231 against the

assemblies of species more closely related to D. melanogaster (e.g., D. simulans) also

showed multiple weak matches. Collectively, the comparative analysis indicates that

CG11231 is either a gene specific to D. melanogaster or a misannotation.

The D. melanogaster F element also contained a partial gene (JYalpha); the complete gene

model (CG40625) is placed on chrU. The putative ortholog of CG40625 is found on the D.

erecta F element scaffold (4512), but this ortholog is found on an unplaced scaffold 3030 in

D. mojavensis and cannot be found by TBLASTN in the D. grimshawi genome assembly.

Hence we have also omitted this gene from the gene movement analysis.

Assignment of wanderer genes to Muller elements

The assignment of genomic scaffolds to Muller elements is primarily based on the work of

Schaeffer and colleagues (Schaeffer et al. 2008). The assignment of the putative orthologs

of the PRY gene to the Y chromosome is based on previous work by Koerich and colleagues

(Koerich et al. 2008). The D. virilis Or13a gene is found in scaffold_13050, which has not

been assigned to a Muller element by Schaeffer and colleagues. However, seven of the eight

genes on this 3.4 Mb scaffold (Or13a, Stim, CG8578, CG33172, Myb, Ranbp16, and Rrp47) are

found on the A element in D. melanogaster, and the remaining gene (CG42617) is found in

the heterochromatic region of 3R (chr3RHet). Because most genes remained on the same

W. Leung et al. 24

Muller element across the different Drosophila species (Bhutkar et al. 2008), we assigned

the D. virilis ortholog of Or13a to the A element.

Software versions

The bioinformatics tools used in this analysis and their versions are listed in Table SM2.

LITERATURE CITED

Abrusán, G., N. Grundmann, L. DeMester, and W. Makalowski, 2009 TEclass — a tool for

automated classification of unknown eukaryotic transposable elements. Bioinforma.

Oxf. Engl. 25: 1329–1330.

Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, 1990 Basic local alignment

search tool. J. Mol. Biol. 215: 403–410.

Bao, Z., and S. R. Eddy, 2002 Automated de novo identification of repeat sequence families

in sequenced genomes. Genome Res. 12: 1269–1276.

Barnett, D. W., E. K. Garrison, A. R. Quinlan, M. P. Strömberg, and G. T. Marth, 2011

BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinforma.

Oxf. Engl. 27: 1691–1692.

Batzoglou, S., D. B. Jaffe, K. Stanley, J. Butler, S. Gnerre et al., 2002 ARACHNE: a whole-

genome shotgun assembler. Genome Res. 12: 177–189.

Benson, G., 1999 Tandem repeats finder: a program to analyze DNA sequences. Nucleic

Acids Res. 27: 573–580.

W. Leung et al. 25

Bergman, C. M., and H. Quesneville, 2007 Discovering and detecting transposable elements

in genome sequences. Brief. Bioinform. 8: 382–392.

Bhutkar, A., S. W. Schaeffer, S. M. Russo, M. Xu, T. F. Smith et al., 2008 Chromosomal

rearrangement inferred from comparisons of 12 Drosophila genomes. Genetics 179:

1657–1680.

Cabili, M. N., C. Trapnell, L. Goff, M. Koziol, B. Tazon-Vega et al., 2011 Integrative annotation

of human large intergenic noncoding RNAs reveals global properties and specific

subclasses. Genes Dev. 25: 1915–1927.

Cáceres, M., M. Puig, and A. Ruiz, 2001 Molecular characterization of two natural hotspots

in the Drosophila buzzatii genome induced by transposon insertions. Genome Res.

11: 1353–1364.

Camacho, C., G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos et al., 2009 BLAST+:

architecture and applications. BMC Bioinformatics 10: 421.

Casals, F., M. Cáceres, M. H. Manfrin, J. González, and A. Ruiz, 2005 Molecular

characterization and chromosomal distribution of Galileo, Kepler and Newton, three

foldback transposable elements of the Drosophila buzzatii species complex. Genetics

169: 2047–2059.

De Freitas Ortiz, M., and E. L. S. Loreto, 2009 Characterization of new hAT transposable

elements in 12 Drosophila genomes. Genetica 135: 67–75.

W. Leung et al. 26

Drosophila 12 Genomes Consortium, A. G. Clark, M. B. Eisen, D. R. Smith, C. M. Bergman et

al., 2007 Evolution of genes and genomes on the Drosophila phylogeny. Nature 450:

203–218.

Edgar, R. C., 2004 MUSCLE: multiple sequence alignment with high accuracy and high

throughput. Nucleic Acids Res. 32: 1792–1797.

Edgar, R. C., 2010 Search and clustering orders of magnitude faster than BLAST.

Bioinforma. Oxf. Engl. 26: 2460–2461.

Edgar, R. C., and E. W. Myers, 2005 PILER: identification and classification of genomic

repeats. Bioinforma. Oxf. Engl. 21 Suppl 1: i152–158.

Eisen, M. B., P. T. Spellman, P. O. Brown, and D. Botstein, 1998 Cluster analysis and display

of genome-wide expression patterns. Proc. Natl. Acad. Sci. U. S. A. 95: 14863–14868.

Feschotte, C., U. Keswani, N. Ranganathan, M. L. Guibotsy, and D. Levine, 2009 Exploring

repetitive DNA landscapes using REPCLASS, a tool that automates the classification

of transposable elements in eukaryotic genomes. Genome Biol. Evol. 1: 205–220.

Flutre, T., E. Duprat, C. Feuillet, and H. Quesneville, 2011 Considering transposable element

diversification in de novo annotation approaches. PloS One 6: e16526.

Gish, W., 1996 http://blast.wustl.edu.

Grabherr, M. G., B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson et al., 2011 Full-length

transcriptome assembly from RNA-Seq data without a reference genome. Nat.

Biotechnol. 29: 644–652.

W. Leung et al. 27

Graveley, B. R., A. N. Brooks, J. W. Carlson, M. O. Duff, J. M. Landolin et al., 2011 The

developmental transcriptome of Drosophila melanogaster. Nature 471: 473–479.

Hoskins, R. A., J. W. Carlson, C. Kennedy, D. Acevedo, M. Evans-Holm et al., 2007 Sequence

finishing and mapping of Drosophila melanogaster heterochromatin. Science 316:

1625–1628.

Huang, X., 1994 On global sequence alignment. Comput. Appl. Biosci. CABIOS 10: 227–235.

Huang, X., and A. Madan, 1999 CAP3: A DNA sequence assembly program. Genome Res. 9:

868–877.

Iwata, H., and O. Gotoh, 2012 Benchmarking spliced alignment programs including Spaln2,

an extended version of Spaln that incorporates additional species-specific features.

Nucleic Acids Res. 40: e161.

Jurka, J., W. Bao, and K. K. Kojima, 2011 Families of transposable elements, population

structure and the origin of species. Biol. Direct 6: 44.

Jurka, J., V. V. Kapitonov, A. Pavlicek, P. Klonowski, O. Kohany et al., 2005 Repbase Update, a

database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110: 462–467.

Kaminker, J. S., C. M. Bergman, B. Kronmiller, J. Carlson, R. Svirskas et al., 2002 The

transposable elements of the Drosophila melanogaster euchromatin: a genomics

perspective. Genome Biol. 3: RESEARCH0084.

Koerich, L. B., X. Wang, A. G. Clark, and A. B. Carvalho, 2008 Low conservation of gene

content in the Drosophila Y chromosome. Nature 456: 949–951.

W. Leung et al. 28

Kohany, O., A. J. Gentles, L. Hankus, and J. Jurka, 2006 Annotation, submission and screening

of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC

Bioinformatics 7: 474.

Korf, I., 2003 Serial BLAST searching. Bioinforma. Oxf. Engl. 19: 1492–1496.

Kurtz, S., A. Narechania, J. C. Stein, and D. Ware, 2008 A new method to compute K-mer

frequencies and its application to annotate large repetitive plant genomes. BMC

Genomics 9: 517.

Langmead, B., and S. L. Salzberg, 2012 Fast gapped-read alignment with Bowtie 2. Nat.

Methods 9: 357–359.

Lerat, E., 2010 Identifying repeats and transposable elements in sequenced genomes: how

to find your way through the dense forest of programs. Heredity 104: 520–533.

Leung, W., C. D. Shaffer, T. Cordonnier, J. Wong, M. S. Itano et al., 2010 Evolution of a distinct

genomic domain in Drosophila: comparative analysis of the dot chromosome in

Drosophila melanogaster and Drosophila virilis. Genetics 185: 1519–1534.

Li, W., and A. Godzik, 2006 Cd-hit: a fast program for clustering and comparing large sets of

protein or nucleotide sequences. Bioinforma. Oxf. Engl. 22: 1658–1659.

Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan et al., 2009 The Sequence

Alignment/Map format and SAMtools. Bioinforma. Oxf. Engl. 25: 2078–2079.

Li, W., and T. Jiang, 2012 Transcriptome assembly and isoform expression level estimation

from biased RNA-Seq reads. Bioinforma. Oxf. Engl. 28: 2914–2921.

W. Leung et al. 29

Li, R., J. Ye, S. Li, J. Wang, Y. Han et al., 2005 ReAS: Recovery of ancestral sequences for

transposable elements from the unassembled reads of a whole genome shotgun.

PLoS Comput. Biol. 1: e43.

Marchler-Bauer, A., S. Lu, J. B. Anderson, F. Chitsaz, M. K. Derbyshire et al., 2011 CDD: a

Conserved Domain Database for the functional annotation of proteins. Nucleic Acids

Res. 39: D225–229.

Martin, J. A., and Z. Wang, 2011 Next-generation transcriptome assembly. Nat. Rev. Genet.

12: 671–682.

Morgulis, A., G. Coulouris, Y. Raytselis, T. L. Madden, R. Agarwala et al., 2008 Database

indexing for production MegaBLAST searches. Bioinforma. Oxf. Engl. 24: 1757–

1764.

Morgulis, A., E. M. Gertz, A. A. Schäffer, and R. Agarwala, 2006 WindowMasker: window-

based masker for sequenced genomes. Bioinforma. Oxf. Engl. 22: 134–141.

Mouse Genome Sequencing Consortium, R. H. Waterston, K. Lindblad-Toh, E. Birney, J.

Rogers et al., 2002 Initial sequencing and comparative analysis of the mouse

genome. Nature 420: 520–562.

Muller, H., 1930 Types of visible variations induced by X-rays in Drosophila. J. Genet. 22:

299–334.

Muller, H. J., 1940 Bearings of the “Drosophila” work on systematics, pp. 185–268 in The

New Systematics, edited by J. Huxley. Oxford: Clarendon Press.

W. Leung et al. 30

Myers, E. W., G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo et al., 2000 A whole-genome

assembly of Drosophila. Science 287: 2196–2204.

Phillippy, A. M., M. C. Schatz, and M. Pop, 2008 Genome assembly forensics: finding the

elusive mis-assembly. Genome Biol. 9: R55.

Price, A. L., N. C. Jones, and P. A. Pevzner, 2005 De novo identification of repeat families in

large genomes. Bioinforma. Oxf. Engl. 21 Suppl 1: i351–358.

Quesneville, H., D. Nouaud, and D. Anxolabéhère, 2003 Detection of new transposable

element families in Drosophila melanogaster and Anopheles gambiae genomes. J.

Mol. Evol. 57 Suppl 1: S50–59.

Riddle, N. C., A. Minoda, P. V. Kharchenko, A. A. Alekseyenko, Y. B. Schwartz et al., 2011

Plasticity in patterns of histone modifications and chromosomal proteins in

Drosophila heterochromatin. Genome Res. 21: 147–163.

Robertson, G., J. Schein, R. Chiu, R. Corbett, M. Field et al., 2010 De novo assembly and

analysis of RNA-seq data. Nat. Methods 7: 909–912.

Schaeffer, S. W., A. Bhutkar, B. F. McAllister, M. Matsuda, L. M. Matzkin et al., 2008 Polytene

chromosomal maps of 11 Drosophila species: the order of genomic scaffolds

inferred from genetic and physical maps. Genetics 179: 1601–1655.

Smit, A. F. A., and R. Hubley, 2008 RepeatModeler Open-1.0.

Talbert, P. B., and S. Henikoff, 2010 Centromeres convert but don’t cross. PLoS Biol. 8:

e1000326.

W. Leung et al. 31

Tatusova, T. A., and T. L. Madden, 1999 BLAST 2 Sequences, a new tool for comparing

protein and nucleotide sequences. FEMS Microbiol. Lett. 174: 247–250.

Trapnell, C., L. Pachter, and S. L. Salzberg, 2009 TopHat: discovering splice junctions with

RNA-Seq. Bioinforma. Oxf. Engl. 25: 1105–1111.

Trapnell, C., B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan et al., 2010 Transcript assembly

and quantification by RNA-Seq reveals unannotated transcripts and isoform

switching during cell differentiation. Nat. Biotechnol. 28: 511–515.

Wootton, J. C., and S. Federhen, 1993 Statistics of local complexity in amino acid sequences

and sequence databases. Comput. Chem. 17: 149–163.

Yang, H.-P., and D. A. Barbash, 2008 Abundant and species-specific DINE-1 transposable

elements in 12 Drosophila genomes. Genome Biol. 9: R39.

Yasuhara, J. C., and B. T. Wakimoto, 2008 Molecular landscape of modified histones in

Drosophila heterochromatic genes and euchromatin-heterochromatin transition

zones. PLoS Genet. 4: e16.

Zhou, L., M. Pertea, A. L. Delcher, and L. Florea, 2009 Sim4cc: a cross-species spliced

alignment program. Nucleic Acids Res. 37: e80.

Zimin, A. V., D. R. Smith, G. Sutton, and J. A. Yorke, 2008 Assembly reconciliation.

Bioinforma. Oxf. Engl. 24: 42–45.

W. Leung et al. 32

SUPPLEMENTAL METHODS FIGURES

Figure SM1 CENSOR search of the D. buzatii ISBu1 transposon insertion sequence (AY756165.1) against the Drosophila RepBase library shows that it has 94% sequence similarity to the Homo6 consensus sequence in RepBase. The RepBase Homo6 record was last updated in release 17.07, where part of the Homo6 consensus sequence has been masked as a helitron.

W. Leung et al. 33

W. Leung et al. 34

W. Leung et al. 35

Figure SM2 Heat maps of CAI versus Nc show that D. grimshawi F element genes are under stronger selective pressure than is the case for other species. Genes with high z-scores are shown in blue in the heat map, while genes with low z-scores are shown in red. Most of the genes on the D. melanogaster, D. erecta, and D. mojavensis F elements show a red-red or blue-blue pattern in the heat map for CAI and Nc, indicating that most of the observed codon bias in these genes can be attributed to mutational biases rather than selection. In contrast, the heat maps for the D. grimshawi F element and the base of the D elements show a red-blue pattern for CAI and Nc, which indicates that most of the codon biases observed in these regions are the result of selection. Similar to the base of the D. erecta D element, genes on the extended (D. ere: D (ext.)) and the telomeric (D. ere: D (tel.)) regions of the D. erecta D element also showed the red-blue pattern for CAI and Nc. The order of the genes in each heat map is determined by Ward hierarchical clustering.

W. Leung et al. 36

Figure SM3 Different codon usage patterns in the F element genes compared to the euchromatic reference regions. (Left) The heat map shows the proportion of codon usage for each amino acid. Preferred codons are in blue and less frequent codons are in red. The heat map shows that F element genes have a different codon usage pattern compared to genes in the D euchromatic reference regions. The codon usage patterns in all the euchromatic reference regions are similar, despite the fact that the genes on the D. mojavensis and D. grimshawi D euchromatic reference regions differ from those found on the D. melanogaster and D. erecta euchromatic reference regions. (Right) Genes in the euchromatic reference regions show a strong preference for CTG over TTA for leucine (L), while genes in the F element do not show the same bias. CAA is preferred over CAG for glutamine (Q) in the D. melanogaster and D. erecta F elements, while there is no preference in the D. mojavensis and D. grimshawi F elements. In contrast, the D euchromatic reference regions show a strong preference for CAG over CAA. The euchromatic reference regions show a strong preference of GTG over GTA for valine (V), and the D. grimshawi F element shows the same preference. In contrast, the other F elements show a preference for GTT over GTC.

W. Leung et al. 37

A

B

Figure SM4 D. grimshawi scaffolds with tandem arrays of LTRs results generate a large number of whole genome alignments. (A) BLAST 2 alignment of the D. grimshawi scaffolds 17366 and 14590 shows a tandem array of matches. The unaligned regions correspond to gaps in either scaffold 17366 or 14590. (B) Searching scaffold 17366 against the Drosophila RepBase library shows a tandem array of matches (~155bp) to the D. grimshawi Gypsy-5 LTR across the entire length of this scaffold.

W. Leung et al. 38

Figure SM5 Analysis of missing mate pairs indicates that most of the highly repetitive scaffolds in the D. grimshawi assembly are caused by misassemblies. Analysis of the reads.placed file from the CAF1 assembly shows the majority of the reads placed in scaffolds with a large number of PALS hits are unpaired. Most of the missing mate pairs are unplaced in the CAF1 assembly.

W. Leung et al. 39

Figure SM6 Improving the D. melanogaster and D. erecta gene annotations using the D. yakuba RNA-Seq data. In release 5.50, FlyBase annotated the gene CG13814 next to rdgC in D. melanogaster. The GLEAN-R gene prediction GG16094 is assigned as a putative ortholog of CG13814 in D. erecta (black arrows). TopHat splice junction predictions and mapping of assembled proteins from D. yakuba RNA-Seq reads suggest that GG16094 is likely part of an unannotated isoform of rdgC in D. erecta (red arrows). FlyBase has subsequently revised the D. melanogaster annotation for this region in release 5.51 and merged CG13814 with rdgC.

W. Leung et al. 40

Figure SM7 Pipelines used to construct the species-specific modENCODE RNA-Seq transcriptome library. (A) RNA-Seq reads are mapped against the corresponding genome assembly using TopHat to discover novel splice junctions. We re-ran TopHat in ‘–no-novel-juncs’ mode using the splice junctions from the initial TopHat run as raw junctions to increase the number of mapped reads. The mapped reads are assembled independently using Cufflinks and CEM to construct the transcript library. Coding regions in the assembled transcripts are identified using TransDecoder. (B) Unmapped RNA-Seq reads are assembled together to produce an unmapped transcript library. To reduce memory requirements, the unmapped RNA-Seq reads are divided into smaller partitions and reads in each partition are assembled together using ABySS. Contigs produced by ABySS are combined into a single collection and assembled together using CAP3. Coding regions within the unmapped assembled transcripts are identified using TransDecoder.

W. Leung et al. 41

Figure SM8 Repeat density estimates for the D. melanogaster F element and the base of the D elements using eight different repeat libraries with RepeatMasker. Irrespective of the repeat library used in the analysis, the D. melanogaster F element shows higher repeat density than the D element. Except for the PALS+PILER+MUSCLE pipeline, the other de novo repeat discovery pipelines produce estimates of total repeat density that are similar to the results obtained using the RepBase Drosophila library. The PALS+PILER+MUSCLE pipeline has a lower estimate of the total repeat density than the other libraries in both the F element and the euchromatic reference region, particularly in the density of the DINE-1 elements.

W. Leung et al. 42

SUPPLEMENTAL METHODS TABLES

Table SM1 The list of D. grimshawi scaffolds that show the largest number of PALS alignments when the scaffold is aligned against the entire D. grimshawi assembly

D. grimshawi scaffold Number of PALS hitsscaffold_17366 1,043,884scaffold_14590 1,015,371scaffold_14979 975,948scaffold_14591 965,159scaffold_6903 859,208scaffold_13968 806,230scaffold_17208 799,044scaffold_6381 774,881scaffold_15098 772,658scaffold_15099 747,890

W. Leung et al. 43

Table SM2 Version information for the main bioinformatics tools used in this analysis

Program VersionABySS 1.3.3BamTools 2.3.0BamUtil 1.0.7BEDTools 2.16.1BLAT 34x13Bowtie 2 2.1.0CAP3 12/21/07CD-HIT 4.6CEM 0.9.1CENSOR 4.2.28Cufflinks 2.1.1EMBOSS 6.5.7.0GNU Parallel 20130422GRIMM 2.0.1GROUPER 2.25LAST 291MAP REPET 2.0MUSCLE 3.8.31NCBI BLAST+ 2.2.27+PALS 1.0_p1Phred, Phrap, and Consed 0.071220.b, 1.090518, 15.0PILER 1.0R 3.0.2RECON 1.0.7RepClass 1.0.1RepeatMasker open-3.4.0RepeatModeler open-1.0.5RepeatScout 1.0.5REPET 2.0SAMtools 0.1.18scnRCA 04/11/2013sim4cc 2010-11-22SPALN 2.1.2Tallymer gt 1.4.2tantan 13TEClass 2.1TopHat 2.0.8bTransDecoder 2012-08-15TRF 4.0.4UCSC Genome Browser 270USEARCH 6.0.203WindowMasker 1.0.0WU BLAST 2.0MP-WashU

W. Leung et al. 44