evolutionary origins of pseudogenes and their association with ... · pseudogenes (cs),...

17
LARGE-SCALE BIOLOGY ARTICLE Evolutionary Origins of Pseudogenes and Their Association with Regulatory Sequences in Plants [OPEN] Jianbo Xie, a,b,c Ying Li, a,b,c Xiaomin Liu, b,c Yiyang Zhao, a,b,c Bailian Li, a,b,c,d Pär K. Ingvarsson, e and Deqiang Zhang a,b,c,1 a Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, Peoples Republic of China b National Engineering Laboratory for Tree Breeding, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, Peoples Republic of China c Key Laboratory of Genetics and Breeding in Forest Trees and Ornamental Plants, Ministry of Education, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, Peoples Republic of China d Department of Forestry, North Carolina State University, Raleigh, North Carolina 27695-8203 e Linnean Center for Plant Biology, Department of Plant Biology, Swedish University of Agricultural Sciences, Box 7080, SE-750 07 Uppsala, Sweden ORCID IDs: 0000-0002-8650-7675 (J.X.); 0000-0001-6005-1174 (Y.L.); 0000-0002-8418-2870 (X.L.); 0000-0002-3077-9401 (Y.Z.); 0000-0002-5310-4466 (B.L.); 0000-0001-9225-7521 (P.K.I.); 0000-0002-8849-2366 (D.Z.) Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and loss of gene function by disabling mutations. Evolutionary analysis provides clues to C origins and effects on gene regulation. However, few systematic studies of plant Cs have been conducted, hampering comparative analyses. Here, we examined the origin, evolution, and expression patterns of Cs and their relationships with noncoding sequences in seven angiosperm plants. We identied ;250,000 Cs, most of which are more lineage specic than protein-coding genes. The distribution of Cs on the chromosome indicates that genome recombination may contribute to C elimination. Most Cs evolve rapidly in terms of sequence and expression levels, showing tissue- or stage-specic expression patterns. We found that a surprisingly large fraction of nontransposable element regulatory noncoding RNAs (microRNAs and long noncoding RNAs) originate from transcription of C proximal upstream regions. We also found that transcription factor binding sites preferentially occur in putative C proximal upstream regions compared with random intergenic regions, suggesting that Cs have conditioned genome evolution by providing transcription factor binding sites that serve as promoters and enhancers. We therefore propose that rapid rewiring of C transcriptional regulatory regions is a major mechanism driving the origin of novel regulatory modules. INTRODUCTION Pseudogenes (Cs) are disabled copies of protein-coding genes and are often referred to as genomic fossils (Balasubramanian et al., 2009; Sisu et al., 2014). Protein-coding genes become Cs if degenerated features are present, such as frameshifts, in-frame stop codons, and truncations of full-length genes (Zhang et al., 2003). Depending on the mechanism of the duplication event, Cs can be classied into two categories: nonprocessed and pro- cessed. Nonprocessed Cs originated from genomic DNA dupli- cation or unequal crossing-over; processed Cs originated from reverse transcription and integration events (Zhang et al., 2003; Zou et al., 2009). Cs have been dened as nonfunctional sequences and thus are expected to evolve neutrally (Torrents et al., 2003); consistent with this, the majority of Cs evolve neu- trally in the human (Homo sapiens), worm (Caenorhabditis ele- gans), and fruity(Drosophila melanogaster) genomes (Sisu et al., 2014). Although Cs are disabled copies of protein-coding genes, a small fraction of Cs have been shown to function as versatile regulators in fundamental processes, acting by producing regu- latory RNAs (Guo et al., 2009; Wen et al., 2011). For example, several studies suggest that Cs could serve as sources of en- dogenous small interfering RNAs (Tam et al., 2008; Watanabe et al., 2008; Wen et al., 2011). Cs have also been shown to regulate gene expression by sequestering microRNAs (miRNAs; Poliseno et al., 2010). These observations suggest that Cs play regulatory roles in gene expression and have motivated scientists to in- vestigate the functions of Cs in different organisms. Evolutionary analyses of Cs, including their expression patterns and associations with noncoding RNAs (Guo et al., 2009), have provided important clues into lineage-specic genomic evolu- tionary histories and the genetic basis of C functions. However, 1 Address correspondence to [email protected]. The author responsible for distribution of materials integral to the nd- ings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantcell.org) is: Deqiang Zhang ([email protected]). [OPEN] Articles can be viewed without a subscription. www.plantcell.org/cgi/doi/10.1105/tpc.18.00601 The Plant Cell, Vol. 31: 563–578, March 2019, www.plantcell.org ã 2019 ASPB.

Upload: others

Post on 31-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

LARGE-SCALE BIOLOGY ARTICLE

Evolutionary Origins of Pseudogenes and Their Associationwith Regulatory Sequences in Plants[OPEN]

Jianbo Xie,a,b,c Ying Li,a,b,c Xiaomin Liu,b,c Yiyang Zhao,a,b,c Bailian Li,a,b,c,d Pär K. Ingvarsson,e andDeqiang Zhanga,b,c,1

a Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, No. 35, Qinghua East Road,Beijing 100083, People’s Republic of ChinabNational Engineering Laboratory for Tree Breeding, College of Biological Sciences and Technology, Beijing Forestry University, No.35, Qinghua East Road, Beijing 100083, People’s Republic of Chinac Key Laboratory of Genetics and Breeding in Forest Trees and Ornamental Plants, Ministry of Education, College of BiologicalSciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People’s Republic of ChinadDepartment of Forestry, North Carolina State University, Raleigh, North Carolina 27695-8203e Linnean Center for Plant Biology, Department of Plant Biology, Swedish University of Agricultural Sciences, Box 7080, SE-750 07Uppsala, Sweden

ORCID IDs: 0000-0002-8650-7675 (J.X.); 0000-0001-6005-1174 (Y.L.); 0000-0002-8418-2870 (X.L.); 0000-0002-3077-9401 (Y.Z.);0000-0002-5310-4466 (B.L.); 0000-0001-9225-7521 (P.K.I.); 0000-0002-8849-2366 (D.Z.)

Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and loss of genefunction by disabling mutations. Evolutionary analysis provides clues to C origins and effects on gene regulation. However,few systematic studies of plant Cs have been conducted, hampering comparative analyses. Here, we examined the origin,evolution, and expression patterns of Cs and their relationships with noncoding sequences in seven angiosperm plants. Weidentified ;250,000 Cs, most of which are more lineage specific than protein-coding genes. The distribution of Cs on thechromosome indicates that genome recombination may contribute to C elimination. Most Cs evolve rapidly in terms ofsequence and expression levels, showing tissue- or stage-specific expression patterns. We found that a surprisingly largefraction of nontransposable element regulatory noncoding RNAs (microRNAs and long noncoding RNAs) originate fromtranscription of C proximal upstream regions. We also found that transcription factor binding sites preferentially occur inputative C proximal upstream regions compared with random intergenic regions, suggesting that Cs have conditionedgenome evolution by providing transcription factor binding sites that serve as promoters and enhancers. We thereforepropose that rapid rewiring ofC transcriptional regulatory regions is a major mechanism driving the origin of novel regulatorymodules.

INTRODUCTION

Pseudogenes (Cs) are disabled copies of protein-coding genesand are often referred to as genomic fossils (Balasubramanianet al., 2009; Sisu et al., 2014). Protein-coding genes becomeCs ifdegenerated features are present, such as frameshifts, in-framestop codons, and truncations of full-length genes (Zhang et al.,2003). Depending on the mechanism of the duplication event,Cscan be classified into two categories: nonprocessed and pro-cessed. Nonprocessed Cs originated from genomic DNA dupli-cation or unequal crossing-over; processed Cs originated fromreverse transcription and integration events (Zhang et al., 2003;Zou et al., 2009). Cs have been defined as nonfunctional

sequences and thus are expected to evolve neutrally (Torrentset al., 2003); consistent with this, the majority of Cs evolve neu-trally in the human (Homo sapiens), worm (Caenorhabditis ele-gans), and fruitfly (Drosophila melanogaster) genomes (Sisu et al.,2014).Although Cs are disabled copies of protein-coding genes,

a small fraction of Cs have been shown to function as versatileregulators in fundamental processes, acting by producing regu-latory RNAs (Guo et al., 2009; Wen et al., 2011). For example,several studies suggest that Cs could serve as sources of en-dogenous small interfering RNAs (Tam et al., 2008; Watanabeet al., 2008;Wenetal., 2011).Cshavealsobeenshown to regulategene expression by sequestering microRNAs (miRNAs; Polisenoet al., 2010). These observations suggest thatCs play regulatoryroles in gene expression and have motivated scientists to in-vestigate the functions of Cs in different organisms.EvolutionaryanalysesofCs, including their expressionpatterns

and associations with noncoding RNAs (Guo et al., 2009), haveprovided important clues into lineage-specific genomic evolu-tionary histories and the genetic basis of C functions. However,

1 Address correspondence to [email protected] author responsible for distribution of materials integral to the find-ings presented in this article in accordance with the policy described inthe Instructions for Authors (www.plantcell.org) is: Deqiang Zhang([email protected]).[OPEN]Articles can be viewed without a subscription.www.plantcell.org/cgi/doi/10.1105/tpc.18.00601

The Plant Cell, Vol. 31: 563–578, March 2019, www.plantcell.org ã 2019 ASPB.

Page 2: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

despite growing interest in Cs, such analyses remain scarce.Further evolutionary studies of Cs can be informative for identi-fying the origin and regulation of RNA genes.

The evolutionary forces that affect the chromosomal distri-bution of Cs are poorly understood. Genome duplication(paleopolyploidy) is common in flowering plants (Wendel, 2000). Thelong-term evolution of paleopolyploids often involves extensivegenome reorganization and elimination of a large fraction of du-plicate genes (Wolfe, 2001). Thismay produce thousands ofCs inplant genomes.Recombinationhasbeen recognizedasoneof thekey factors that shapesgenomic features, suchas the elimination/retention of duplicated genes after whole-genome duplication(WGD), distribution of transposable elements (TEs) and genes,andnucleotidevariation ineukaryotes (Gautet al., 2007;Tianetal.,2009; Du et al., 2012). Cs in the human and fruitfly genomesare enriched in regions of low recombination, but Cs in wormsshow the opposite trend (Sisu et al., 2014). These observationsmay conflict because of differing distributions of recombina-tion events inspecificgenomes,basedondifferingdistributionsofheterochromatin.

Few studies have performed genome-wide, multispeciesanalyses ofCs regarding their rates of evolution and surroundingchromatin environment. As a result, little is known about theevolutionary forces thatshape thepatternsofCs inpaleopolyploidorganisms. The only cross-species comparison of Cs in plantsconcerned the identification and evolution of Cs in Arabidopsis(Arabidopsis thaliana) and rice (Oryza sativa) genomes (Zou et al.,2009). Therefore, the evolution of plant Cs requires furtherexamination. The availability of complete annotations of severalplant genomes, including rice, Arabidopsis, and Populus tricho-carpa, allowed us to embark on a comprehensive, cross-speciescomparison to discover common features of the evolution ofCsacrossdifferentorganisms. Inaddition, the recombinationdatafrom soybean (Glycine max; Du et al., 2012) provided us withthe opportunity to study whether recombination shaped the

distribution of Cs in detail. A surprisingly large fraction ofnon–TE long noncoding RNA (lncRNA) transcripts originate fromtranscription at putativeC proximal upstream regions, indicatinga commonmechanism for the origin of novel regulatory modules.

RESULTS

Identification of Cs in Seven Angiosperm Species

To systematically identify candidate pseudogenic regions inseven species, Arabidopsis,Brachypodiumdistachyon, soybean,Medicago truncatula, rice, Populus trichocarpa, and Sorghumbicolor, we used a combination of homology searches andstringent filters to minimize noise and increase positive signals(Figure 1). First, repeat sequences in the intergenic regions weremasked using RepeatMasker (RM) to avoid alignment errors. Weidentified 90,000 to 800,000 intergenic homologous contigs withsignificant similarity (identity $ 20%; match length $ 5% of thequery sequence) to known proteins from the non-redundantdatabase in the seven taxa. We then examined Cs near the cut-off (length 30 to 107 amino acids; match length coverage ratio,0.050 to 0.052) and found that eight were WGD-derived Cs andthree were syntenic Cs located on the syntenic blocks betweenP. trichocarpa and Arabidopsis (Supplemental Figure 1). By thismethod, it is possible that we missed some intergenic regionsthat resemble protein-coding sequences in repeat regions. Afterstringent filtering, most of the initial homologous contigs did notremain in the finalC data set; thesemay represent artifacts ormaybe too diverged to display characteristics of Cs (such as matchlength$ 30 amino acids). The application of these stringent filtersretained 5128 to 73,811 putative Cs per species (Figure 2A;Supplemental Data Sets 1 to 7), and 146 to 2524 of the Cs arederived fromWGD events (Figure 2A). We also found that 11.6 to25% of the total C pool have introns; these could be Cs thatretained their original intron structure.

564 The Plant Cell

Page 3: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

We observed a moderate, but not significant, trend of a highernumber ofCs in larger genomes (Pearson’s correlation = 0.71, P =0.07), with the most Cs present in soybean and the fewest inArabidopsis. S. bicolor is an exception: it has the second largestgenome size but a lowC number. As expected, we found a strongcorrelation between theC andprotein-coding genedensities in allthe seven taxa (Figure 2B). A closer inspection revealed that thedistribution ofCs among the chromosomes is also proportional tothe chromosome length (Pearson’s correlation > 0.72, P < 0.02)andgenedensity (Pearson’s correlation> 0.90,P < 0.02; Figure 2B;Supplemental Figure 2).

Among the species examined, the soybean Cs appear to bemore fragmented than those in other species (Figure 2A). Thesoybean lineagehasundergone two roundsofWGDwithin the last60 million years (Myr), with a recent event (;13Myr ago) resultinginahighlyduplicatedgenomewithnearly75%of thegeneshavingmultiple copies (Schmutz et al., 2010). Thus, the shorter extent ofsoybean Cs may result from the rapid gene loss that occurred intheearly stagesofgenome reshapingshortly after the recentWGD(Inoue et al., 2015). Consistent with this, we found that gene pairsof WGD blocks containing Cs have peaks with a synonymoussubstitution rate (Ks) of ;0.13, corresponding to a soybeanlineage-specific paleotetraploidization (;13 million years ago;Supplemental Figure 3). The highest alignment coverage ofCs totheir closest functional paralogs (FPs)was found inP. trichocarpa,which is known to have a slower evolution rate (Tuskan et al.,2006). We determined the evolution rate of the Cs by estimating

the Ks, nonsynonymous substitution rate (Ka), and Ka:Ks ratiobetween Cs and their FPs (Figure 2C). In general, the majority oftheC–FP pairs had Ka:Ks ratios that were much greater than thatof functional WGD (FG–FG) pairs. The median Ka:Ks ratio forFG–FGpairs was <0.40, representing selection on FGs. However,large differences in Ka and Ks for both C–FP and FG–FG weredetected across the seven species. We detected lower Ks valuesfor soybean FG–FG pairs compared with other species (Wilcoxone-tailed test, P < 0.05), which also suggested a recent WGDevent. The sole exception to this was P. trichocarpa, which isknown tohaveaslowmolecularclockdue to longgeneration times(Tuskan et al., 2006). Also, the large variation of FG–FG Ks valuesin P. trichocarpa may indicate divergent selection after WGD(Tuskan et al., 2006). The lowestmedian Ka value was detected inM. truncatula and the highest were detected in P. trichocarpa.

Asymmetric Elimination Rate of Ancient Full-Length Cs

Next, we inferred the age of the Cs by examining their sequencesimilarity to their FPs. We observed that most species showa stepwise increase in the number ofCs at similar time points andastepwisedecreaseafter that timepoint (Figure2D;SupplementalFigure 4). By contrast, in B. distachyon, we found a stepwisedecrease at most time points. Since Cs are expected to evolveneutrally (Zou et al., 2009), we examined three type of disable-ments: insertions, deletions, and stop codons in the Cs. Theaverage number of deletions was lower than the number of in-sertions in all species except M. truncatula (Supplemental Fig-ure 5). Of the three kinds of disablement, we observed a higherdensity of stop codons in P. trichocarpa, M. truncatula, andsoybean and a higher density of insertions in Arabidopsis,B. distachyon rice, and S. bicolor.Considering thatCsevolveneutrally, thestrengthof selectionat

all sites of the ancient full-length C is expected to be identical.Whenwe examinedwhere theC fragments overlapwith their FPs,we did, however, observe an asymmetric elimination rate of theancient full-length C, with peaks of higher retention at both ends(Supplemental Figure 6). When analyzing data simulating randomloss, we observed a much more uniform distribution of sequenceelimination (Supplemental Figure 7). Our results thus suggest thatthe 59 and 39 ends were under stronger selection than the middleof the ancientCs, resulting in smallC fragment peaks located ateach end.

Dynamic Repertoire of Cs in Different Species

We compared the distribution pattern of Cs with their FPs(Figure 2E; Supplemental Figure 8) and found that on average only23.3% of genes hadC counterparts, resulting in a highly unevendistribution of Cs per gene. By investigating the distribution ofparalogs per FP across all seven genomes, we found little overlapbetween the genes with many paralogs and those with many C

counterparts (Figure 2E; Supplemental Figure 8). Interestingly,among all seven species, we found numerous types of FPs thatwere enriched in Cs and depleted in paralogs and vice versa.We further assessed the overrepresented or underrepresentedfunctionpfamdomainsofCsbyexamining theannotationsof their

Figure 1. The C Identification Pipeline.

The overall procedure for identifying Cs from seven plant species: P. tri-chocarpa, Arabidopsis, B. distachyon, soybean, M. truncatula, rice(O. sativa subsp japonica), and S. bicolor. E-value, BLAST expect value,match length, alignment length.

Evolutionary Origins of Pseudogenes in Plants 565

Page 4: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

FPs. In general, defense domain families (leucine-rich repeat,NB-ARC [for nucleotide binding adaptor shared by APAF-1, Rproteins, and CED-4]) had significantly overrepresented numbersof Cs, whereas transcription factor–associated domains wereunderrepresented (Figure 3). Several domains related to sec-ondarymetabolismwere overrepresented inCs of P. trichocarpa,B. distachyon,M. truncatula, andG.max. The topCdomain familyin P. trichocarpa was wound-induced protein WI12 (SupplementalData Set 8), possibly reflecting the family’s rapid evolution (Ma et al.,2013). Interestingly,M. truncatula, soybean,S. bicolor, Arabidopsis,and rice shared reverse transcriptase as their dominant domain, anindication of the activity of retrotransposons.

To directly compareCs from different species and identify coreand specific families shared by all species, we grouped the Csinto 38,278 families according to the similarity of their FPs(Supplemental Data Set 9). This method detected only 43 coreC

families (>0.1%) across the seven species, compared with >22%of protein-coding genes and >6% of small RNA primary tran-scripts (Supplemental Figure 9). Since the two closest speciesexamined were rice and B. distachyon, and any other two spe-cies were separated by at least 48 Myr of parallel evolution, we

identified only 543 commonC families (Supplemental Figure 10).Some of the coreC families such as defense genes (leucine-richrepeat) were tandem duplicates, suggesting that these domainfamilies are experiencing repeated gene gain and loss andtherefore have higher chances of being present across species.

Recombination Contributes to C Elimination

The majority of Cs are under no selective constraint and are freeto accumulate non-gene-like features such as frameshifts andstop codons. Therefore, we wondered whether recombinationrates would affect the pattern of C distribution. Deleterious mu-tations accumulate more easily in regions with suppressedrecombination, such as pericentromeric regions, due to Hill–Robertsoneffects.We therefore expect to observe anenrichment ofCs in regions of low recombination and near the pericentromericregions. As expected, we found thatCs were relatively enriched inthe pericentromeric regions in genomes for which centromerepositions were available. In soybean, rice, and Arabidopsis, peri-centromeric regions were often associated with low recombination

Figure 2. Identification of Cs in Seven Plant Species.

(A) Number of Cs identified in each genome.(B) Distribution of genes and the different types of Cs across chromosomes (Chr) in P. trichocarpa, Arabidopsis, and B. distachyon.(C) Comparison of evolution between pairs of FG–FG (functional WGDs) and C–FP (closest FPs).(D) Distribution of sequence similarity to their closest FPs (as a function of C age) for P. trichocarpa. CDS, coding sequence.(E) Comparative distribution of Cs and paralogs per gene in P. trichocarpa.

566 The Plant Cell

Page 5: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

rates (Supplemental Table 1; Fisher’s exact test, P < 0.05). Onestriking feature of the soybeangenome is that 57%of thegenomicsequence occurs in pericentromeric regions (Du et al., 2012).Examination of recombination rate andC density in the soybeangenome revealed a significant negative correlation (P < 0.03;Pearson’s correlation less than –0.32; Figure 4A), with a negativeSpearman’s r in the 0.32 to 0.73 range.

Several studies have reported asymmetric evolution of protein-coding genes between high and low recombination regions(Hamblin and Aquadro, 1996; Du et al., 2012). Our study extendedthese analyses to compare the evolution between Cs and theirFPs within various genomic features. We began by aligning theannotatedCs in thesoybeangenometo their respectiveFPsusingan empirical codon model and removing low confidence C–FPpairs. Next, the Ka and Ks of each C–FP pair was calculated(Supplemental DataSet 10).Weobserveda significantly higher Kafor the Cs in chromosomal arms and FPs in pericentromeric re-gions (P < 0.001; Figure 4B). By contrast, no significant differencein Ka was observed betweenC–FP pairs when both were locatedwithin chromosomal arms or pericentromeric regions (Figure 4B),suggesting thatCs inboth regionshaveexperiencedsimilar levelsof selective constraints. This suggests a higher mutation rateforCs inpericentromeric regions, although themedianKa:Ks ratioforCs inpericentromeric regionswassignificantly higher than thatof Cs in chromosome arms (P < 0.05; Figure 4B). Alternatively, ifthere are differences in age in different genomic compartments,

the paceof evolution could be the sameacross the genome, but insome regions, such as pericentromeric regions, Cs are simplyretained longer, thus accumulating more mutations. Similarly, forFPs in pericentromeric regions, the corresponding Cs in peri-centromeric regions displayed significantly higher Ks values.Overall, based on the data, our findings suggest that genomerecombination rates may be an essential contributor to the Cselimination.

Cis-Regulatory Elements Are Enriched in the ProximalUpstream Regions of Cs

AlthoughmanyCs appear to have a high turnover rate and do notencode proteins, some may still produce RNA. To examine theexpression ofCs, we reanalyzed the RNA sequencing (RNA-seq)data from six species and acquired strand-specific RNA-seqdata for Populus under four abiotic stress treatments. In eachspecies, we used RNA-seq reads from at least three samples(Supplemental Table 2), and all libraries were prepared with poly(A)-selected RNA. Expression was detected for 75.5% (on aver-age)of theprotein-codinggenesbut foronly32.5%(onaverage)ofthe Cs (significantly fewer by Fisher’s exact test, P < 2e216). InArabidopsis, 0.29 to 0.44 of Cs were expressed in each sample,and in P. trichocarpa 0.02 to 0.11 of Cs were expressed. Acrossgene expression profiles, the median expression level of Cs wassignificantly lower than that of protein-coding genes in seven

Figure 3. Representative Overrepresented and Underrepresented Pfam Domains of Cs in Seven Plant Species.

Graph showsoverrepresented (top) andunderrepresented (bottom) Pfamdomains, respectively. Colors represent different gene families: orange, defense-related genes; purple, enzymes involved in secondary metabolism; green, kinase; and yellow, transcription factor–associated domains. The degree ofshading represents the significance in the Fisher’s exact test (P-value; red, overrepresentation; blue, underrepresentation).

Evolutionary Origins of Pseudogenes in Plants 567

Page 6: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

species (Wilcox test, P < 2e216; Figure 5A). We also found thatsome Cs showed highly tissue-specific expression (Figure 5C),with the highest median tissue specificity found in M. truncatula.The lowest specificity was detected in B. distachyon, which maybe due to its small sample size.

A detailed analysis revealed that the Cs with detectable ex-pression (expressed Cs) tended to have a significantly highersequence identity to their FPs anda significantly lowerKa,Ks, andKa:Ks ratio compared with Cs without detectable expression(nonexpressed Cs) in the seven species (Wilcox test, P # 0.015;Figure 5B; Supplemental Figure 11). This suggests that the ex-pressed Cs may be derived from relatively recent duplicationevents. One possible explanation is that the ancient parental cis-regulatory elements of the expressed Cs have not completelydegenerated. In this case, the expression of the Cs should behighly associatedwith the expression of their FPs. As expected, in

all seven species, Spearman’s correlation coefficient for C–FPpairs was 0.19, on average, a value that is higher than that ofrandomly selected gene pairs or C–C pairs but lower than thatobserved for pairs of WGDs (0.29; Figure 5D).The low expression levels and high tissue specificity ofCs raise

the question of whether cis-regulatory elements are enriched inthe proximal upstream regions (i.e., their promoters) ofCs. To testthis hypothesis, we analyzed the frequency of transcription factorbinding sites (TFBSs). Using a genome-wide set of TFBSs pre-dicted in silico, we found that proximal upstream regions of Cswere more frequently associated with TFBSs compared withrandom intergenic regions (Figure 5E).Many transcriptional units are associated with chromatin-

modifying complexes, and their expression is affected byhistone modifications. We collected published chromatin immu-noprecipitation sequencing (ChIP-seq) data profiling three kinds

Figure 4. Recombination Rate Shapes the Pattern of the Elimination of Cs.

(A) Pearson correlation between recombination rate and the frequency of Cs in the G. max genome. Each chromosome is divided into 1-Mb bins. Therecombination rates in each bin are indicated above the x axis, and the frequency of Cs is indicated below the x axis. GR, genome recombination.(B)Comparison of evolution rates ofCs in chromosomal arms and pericentromeric regions. The statistical analysiswas conducted between each set ofCsby Wilcoxon one-tailed test. An “a” above each column indicates P < 0.001 and “b” indicates P < 0.05.

568 The Plant Cell

Page 7: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

of histonemarks in Arabidopsis (Jin et al., 2017) and rice (He et al.,2010), including two positive marks, acetylated histone 3 lysine 9and trimethylated histone 3 lysine, and one negative mark, tri-methylated histone 3 lysine 27. We reanalyzed the data andcompared the histone modification marks within Cs and pub-lished lncRNA loci by genomic position. In total, ;64.9% ofArabidopsis and 33.1%of riceC proximal upstream regions wereassociated with either positive or negative histone modificationpeaks in selected samples, which is higher than the associationwith histone peaks of lncRNAs (Supplemental Table 3). Further-more, the Populus ChIP-seq data set for transcription factors(including members of class I KNOX, class III HD ZIP, BEL1-likefamilies; Liu et al., 2015) showed that 6.1% of the peaks wereassociated with C proximal upstream regions. Consistent with

this, the frequency of DNase I hypersensitive (DH) peaks, also anessential indicator of cis-regulatory elements (Zhang et al., 2012),was significantly higher than randomly intergenic regions(Supplemental Figure 12). Taken together, these results suggestthatcis-regulatoryelementsareenriched in theproximal upstreamregions of Cs.We next assessed the evolutionary conservation of C ex-

pression patterns. To this end, we first estimated the presence ofshared transcriptional activities across species and found thatCtranscription evolves rapidly. Only ;52.1% of expressed rice C

familieswerealsoexpressed inB.distachyon, andonly;52.4%ofexpressed soybean C families were expressed in M. truncatula(Figure 5F).However,more than74%of theprotein-codinggenesfrom all seven plant species showed conserved expression

Figure 5. Enrichment of Cis-Regulatory Elements in the Proximal Upstream Regions of Cs.

(A) Comparison the maximum expression for protein-coding genes and Cs.(B) Pseudo-protein identity, Ka, and Ks and the ratio between Ka and Ks of C–FP pairs.(C) Tissue specificity of protein-coding genes and Cs.(D) Expression Pearson correlation among different gene sets. Genome duplicates were obtained from the Plant Genome Duplication Database(http://chibba.agtec.uga.edu/duplication/). Error bars indicate 95% confidence intervals generated by 1000 bootstrap replicates.(E)Frequency of in silico–predicted binding sites for different proximal regions (2 kb upstream) of gene set, including genes, oldCs (pseudo-protein identity< 0.8),Cs (totalC set), youngCs (pseudo-protein identity$ 0.8), and random intergenic regions. Error bars indicate 95%confidence intervals generated by1000 bootstrap replicates.(F) Percentage of shared transcription ofC families between species pairs with varying divergence times: 47 Myr, B. distachyon versus O. sativa; 48 Myr,S. bicolor versusO. sativa; 52Myr,G.max versusM. truncatula; 108Myr,P. trichocarpa versusA. thaliana; and149Myr,B.distachyon versusP. trichocarpa.

Evolutionary Origins of Pseudogenes in Plants 569

Page 8: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

(Figure 5F). Comparisons of the transcript ratios highlighted thisdiscrepancy inconservationofexpressionamongCsandprotein-coding genes (Supplemental Figure 13).

Most homologous Cs that are conserved in syntenic blocksbetweenArabidopsisandP. trichocarpawere foundtobedivergentlytranscribed (Supplemental Data Set 11). For example, Chr1|20148959-20149332 from Arabidopsis and Chr01|13707000-13707886 from P. trichocarpaare syntenic sequences that areconserved between the two species, yet their transcription isnot conserved. Chr1|20148959-20149332 is expressed in fourtissues of Arabidopsis but is not expressed in P. trichocarpa(Supplemental Figure 14). Overall, these results indicate that rapidtranscriptional evolution is a genuine feature of Cs.

Noncoding RNA Genes Are Associated with mRNA Genesand Cs

To explore the contribution ofCs to themakeup and regulation ofnoncoding RNAs, we compiled a catalog of lncRNA species bycombining published and unpublished lncRNA data for specieswhere extensive lncRNA data sets are available (SupplementalTable 4). For the analysis, each lncRNA in the initial pool wasrequired tohavea59end thatoriginated fromagenomic site and tobe at least 200 nucleotides long. Furthermore, to excludepossibleassociation with TEs, we excluded lncRNAs or their proximalregions (2 kb upstream and downstream) that overlap with TEs by10 bp or more (Supplemental Data Sets 12 to 16). We refer to theremaining lncRNAs as non–TE lncRNAs.

Previous studies showed that the observed lncRNA speciesvary due to differences in genome sequence and RNA-seq dataquantity and quality, as well as differences in the diversity ofsamples used for sequencing in different species (Necsulea et al.,2014). Inspection of the position of origin for non–TE lncRNAsrevealed that themajority were located closer to genes than toCs(70.0% on average) and the minority were located closer to Csthan to genes (Figure 6A).

Most of the non–TE lncRNAs that were closer to mRNA geneswere found to originate within a 2-kb region surrounding thetranscriptional start sites of protein-coding genes (ranging from52.1% in P. trichocarpa to 70.0% in Arabidopsis), while a smallerfraction (37.9% on average) were more distant (>2 kb) fromprotein-coding genes (Figure 6A). Further examination of thenon–TE lncRNAs that were closer to transcriptional start sites ofgenes (<2kb) revealed that themajority (62.0%onaverage; 31.1%in P. trichocarpa and 95.3% in Arabidopsis) were associated withthe promoters of protein-coding genes, suggesting that a sur-prising fraction of lncRNAs originate from the promoters of mRNAgenes. A large number of non–TE lncRNA species were foundcloser to the59endofCs than toprotein-codinggenes (21.1.4%ofthe total non–TE lncRNA pool on average). A visual inspection ofindividual genes suggested that many of the lncRNA species aretranscribed from the proximal upstream regions of Cs. Forexample, a 256-bp lncRNA transcript of Arabidopsis originated;1082 bp upstream of the C locus Chr5:5286482-5286293 andwas divergently transcribed in relation to the C (Figure 6B). Ananalysis of the entire non–TE lncRNA population revealed that12.5 (Arabidopsis) to 22.1% (M. truncatula) of the total non–TElncRNA pool was associated with proximal upstream regions of

Cs (Figure 6A). For the non–TE lncRNAs thatwere closer to theCs(<2 kb), the majority were associated with the proximal regions ofCs in all five species, ranging from the smallest proportion inArabidopsis (71.8%) to the largest in rice (83.5%).We randomly selected 30 proximal upstream regions that were

associated with non–TE lncRNA loci (found in antisense of Cs)for a transient expression experiment (Supplemental Figure 15;SupplementalDataSets17and18). Theselectedsequencesweresynthesized and cloned to the binary vector 3302Y3 by replacingthe cauliflower mosaic virus 35S promoter. Twenty-one yellowfluorescent protein (YFP) signals out of 30 localized to thecell membrane, trichome, and nucleus (Figures 6C to 6F;Supplemental Figure 16). Analysis of the miRNA species sug-gested that several non–TE miRNAs could also be transcribedfrom the proximal upstream regions of Cs (Figure 7). Altogether,these results suggest that a substantial fraction of lncRNAs/miRNAs are transcribed from proximal upstream regions of Cs.

DISCUSSION

Cs Have a High Turnover Rate in Plants

In this study, we performed a systematic investigation ofCs usingseven representative plant species with highly accurate expres-sion profiles, thereby providing an important resource for futurestudies.Our resultssuggest thatCsarehighly lineagespecificandhaveahigh turnover rate, indicating thata largenumberofplantCsappear to be evolving under relaxed selective constraints andtherefore tend to be rapidly eliminated during pseudogenization.Indeed, we detected only 43 core C families (>0.1%) across theseven species, which is substantially lower than the numbers forprotein-coding genesandmiRNAs. Thedistributions ofCs acrossthe genome have been documented in human, worm, and flygenomes (Sisu et al., 2014). However, these analyses do notprovide consistent results, and the accumulation, elimination, anddistribution of Cs in local genomic regions have not been com-prehensively investigated. Here, we have investigated the distri-bution of Cs in three plant species and determined that they areuniformly enriched in the centromeric regions.We further analyzed the genomic recombination data in soybean,

and the results show that the distribution of Cs is significantlynegatively correlated with the local genomic recombination rate,indicating that Cs are organized along recombinational gradientson chromosomes. Recombination is typically initiated by double-stranded breaks that trigger strand exchange (Schuermann et al.,2005). An increasing body of evidence indicates that recombinationplays an essential role in genome evolution by generatingmutations(Rattray and Strathern, 2003), increasing microsatellite instability,and contributing to gross chromosomal rearrangements (Pearsonet al., 2005). In regions of high recombination, the neutral mutationand rearrangement rates increase, which may help explain thegreater rate of loss ofCs in these regions.Cs are thus expected toaccumulate in regions of low recombination rates. This study pro-vides an in-depth analysis of the distribution and rates of eliminationofCs in relation to genomic recombination rates in plant genomes.All these findings suggest that genomic recombination is an es-sential contributor toC elimination.

570 The Plant Cell

Page 9: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

Using the 90th percentile of the distribution of intron probeintensities as a threshold, multipleCs are likely expressed (32.5%onaverage) but at a lower level comparedwith functional genes.Cexpression also tends to be spatially and temporally more re-stricted than that of functional genes. Additionally, expressedCs

appear to have lower Ka:Ks ratios and are more complete, in-dicating that they are derived from relatively recent duplicationevents. Indeed, the protein sequence identity between expressedCs and their FPs is significantly higher than that of other C–FPpairs. The Spearman’s correlation coefficient for pairs of Cs and

Figure 6. Many lncRNAs Are Transcribed from the Proximal Upstream Regions of Protein-Coding Genes and Cs.

(A) Summary of various types and numbers of lncRNA loci in five species, including P. trichocarpa, Arabidopsis,M. truncatula, O. sativa (O. sativa subspjaponica), and S. bicolor.(B)Example of lncRNA locuswhose 59 end occurswithin 2 kb of the 59 end of aC (proximal upstream region–associated lncRNA). The x axis represents thelinear sequence of genomic DNA, and the y axis represents the total number of RNA-seqmapped reads in root of Arabidopsis. RNA-seq reads that map tolncRNA (blue) and C (red) of genomic DNA are shown separately. The scale is indicated in the top right.(C) to (H) Subcellular localization of four constructs in tobacco leaf epidermal cells by Agrobacterium-mediated transient expression. The fluorescencesignal of the YFP under the putative promoter regions of Chr5|5286482-5287564 (C), Chr2|852936-853952 (D), Chr1|26153967-26154985 (E), and Chr3|16675465-16676590 (F). Fluorescence signal of the negative control, YFP under no promoter (G), and the positive control, fABD2 with YFP fusion undera Cauliflower mosaic virus 35S promoter (H).(I) Schematic diagram of the constructs. Bar = 10 mm.

Evolutionary Origins of Pseudogenes in Plants 571

Page 10: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

their respective FPs is lower than that between WGDs, and at thepromoter level, this dynamic includes a loss and gain of TFBSs.Together, the fast-evolvingexpressionpatterns, thehighlydynamicCfamilies indistinct lineages,and themultiplemechanismsaffecting theturnoverofCssuggest thatbothsequencesand regulatory regionsofCscanevolveextremelyrapidlybetweencloselyrelatedplantspecies.

Intergenic Noncoding RNAs Are Derived from Divergent Cs

ExpressedCs are unlikely to only represent transcriptional noise,as many Cs exhibit specific expression patterns and are asso-ciated with abiotic stress (Zou et al., 2009). In our study, a fractionof Cs are actively transcribed (32.5% on average), and proximalupstream regions of theseCs are enriched in TFBSs, suggestingthe proximal upstream regions of many Cs are still active andsome Cs may still be active as RNA genes. We further observedextensive divergenceof expressionpatternsbetweenC–FPpairs,suggesting that a vast majority of C–FP pairs have diverged inexpression through random degeneration in their cis-regulatoryregions. Nonetheless, our collections of expression profiles are byno means complete, and more precise expression data will pro-vide more evidence regarding C expression.

Using the available oligo(dT)-based RNA purification for RNA-seqdata ledus to focusonpolyadenylatedCs,whicharemorestableandabundant than nonpolyadenylated transcripts. However, this data setis missing some types of C transcripts. Therefore, the numbers ofexpressed Cs are underestimated to some extent. Recent high-throughput efforts to characterize the transcriptomes of eukaryoteshaveuncovered thousandsof lncRNAs (Liuet al., 2012;Qi et al., 2013;Hezroni etal., 2015). lncRNAcatalogsare far fromexhaustiveandalsocontain false positives (Kapusta et al., 2013), indicating that thecomplexity of lncRNAs may exceed our current estimates.

The transcriptional control and origin of lncRNAs have been thesubject of intense study; yet, most of these investigations havefocused on protein-coding genes or transposons (Kapusta et al.,2013;Sigovaetal., 2013).A large fractionof lncRNAsarepredictedto originate from divergent transcription from promoters of active

protein-coding genes based on high-throughput RNA-seq anal-ysis (Sigova et al., 2013). Using strand-specific RNA-seq, wefound thatmany intergenic non–TE lncRNAs and non–TEmiRNAsin plant species are divergently transcribed at the proximal up-stream regions ofCs in all seven species. Only a few studies havedescribed several non–TE lncRNAs that originated from Cs andhave essential roles in development and disease (Milligan andLipovich, 2015). Our study found that on average, 20.2% ofnon–TE lncRNAs are located within the 2-kb proximal upstreamregionsofCs, andonlyaminorityof theseoverlapwith theCbody.The complexity of plant transcriptomes is further demonstrated

by the frequent overlap between different transcript categories orbetween lncRNAs and other genomic elements. For example,lncRNAs can act as miRNA precursors, or function as miRNAsponges (Tian et al., 2016). The evidence described here revealsthat many intergenic non–TE lncRNAs are derived from tran-scription at the proximal upstream regions of Cs and providesinsight into the evolution of novel regulatory modules.One implication of this finding is that the transcription of

lncRNAs undergoes evolutionary dynamics. Large-scale inves-tigations of these data sets have only recently begun and shouldprovide a rich source of information for additional studies into thefunctions of these noncoding RNA species and the control of theirexpression. Thus, the strong association of noncoding RNAswithCproximal upstreamregions isprobablyacommoncharacteristicof plant lncRNA repertoires that distinguish them from those thatare derived from genes and transposons. This provides anotherimportant mechanism for the origin of noncoding RNAs. Futureinvestigationsof lncRNA–Cpairs and the lncRNAsdescribedherecould provide insights into the contributions of Cs to tran-scriptomic complexity.

Do Novel Regulatory Sequences Originate De Novo or fromPreexisting Regulatory Sequences?

Understanding the genomic origins of transcriptional noveltiescan provide insight into the construction of the regulatory system

Figure 7. Many Non–TEmiRNAs Are Transcribed from the Proximal Upstream Regions of Protein-Coding Genes andCs. Summary of various types andnumbers of non–TE miRNA loci in Seven Plant Species.

572 The Plant Cell

Page 11: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

and thus into evolutionary biology.ManynoncodingRNAspecies,such as lncRNAs and miRNAs, are transcribed from intergenicregionswithin the genome (Xie et al., 2017). The poorly conservedprofilesandspatio-temporal expressionpatternsof lncRNAs raisethe following question: What are the mechanisms of sequenceevolution leading to the rapid formation and loss of regulatorysites?

New patterns of gene expression could be generated by twomain mechanisms: de novo evolution and rewriting of the pre-existing regulatory information. The second mechanism may fallinto three general categories: transposition, promoter switching,and co-option (Rebeiz et al., 2011). Gene regulation is controlledby coordinated binding of transcription factors at the TFBSs in thepromoters of genes. In many species, TFBSs tend to occur ashomotypic or heterotypic clusters, possessing complicatedregulatory motifs (Gupta and Liu, 2005). The stretches of theseintergenic regions in the genome often harbor sequences thatcontain various TFBSs, and such regions could acquire a series ofrandom point mutations, small indels, or TE transfers that sub-sequently generate functional regulatory sequences. The highfrequency of TFBSs at promoters and the expression patterns oflncRNAs suggest that their transcription is actively regulatedoverall (Necsulea et al., 2014). However, the extent to whichregulatory elements occur de novo is unknown, and we are un-aware of any empirical examples of their occurrence.

Compared with the de novo mechanism, generating newexpression patterns that are founded on preexisting regulatorysequences seems to bemore plausible, based on our findings. Thismodeof regulatory systemevolution is supportedby several lines ofevidence. TEs are currently thought to provide a common route bywhich regulatoryDNAsequencesevolve (Hezroni et al., 2015). In thecase of pesticide resistance in fruitfly, gene expression is driven byapreexistingTFBSinTEsequences (Dabornetal., 2002). Inaddition,transcriptional data from embryonic stem cells show that mRNAgenes could share regulatory activity with their adjacent lncRNAs(Sigova et al., 2013). Statistical analysis of 346 cis-regulatorymodules in fruitfly show that local sequence duplication is an es-sential mechanism that transports and produces cis-regulatory in-formation (Nourmohammad and Lässig, 2011). In this study, wefound that from 12.5 to 22.1% of the total non–TE lncRNA pool isderived from proximal upstream regions of Cs. Further analysesshow that the proximal upstream regions ofCs aremore enriched inTFBSs and DH peaks than are random intergenic regions. Con-sistentwith this, a numberof plantCsare likely expressedandshowa lowexpressioncorrelationwith their FPs.Studiesalso indicate thatsomeproximalupstreamregionsofCsarehighlyactiveandhavethepotential to contribute to novel transcriptional systems (Scarolaet al., 2015; Ma et al., 2016). Thus, it appears that for lncRNAs andmiRNAs, evolution rarely producesnovelties fromscratchbutworkson the promiscuous activities that existed previously, and this mayreflect a general mechanism whereby new transcripts evolve.

METHODS

Data Set for Populus trichocarpa

Populus trichocarpaplantsweregrown in agreenhouseunder a 16-h-light/8-h-dark photoperiod, with light provided by cool white fluorescent lights

(at 250 mmol m22 s21 photosynthetic photon flux density [PPFD]). Forstress treatments, 12plants obtained froma single genotypewere used forchilling stress (three plants; at 4°C for 6 h, 250 mmol m22 s21 PPFD), heatstress (three plants; at 42°C for 6 h, 250 mmol m22 s21 PPFD), exposure to150 mM NaCl, 30% polyethylene glycol 6000 (three plants; for 6 h), anddrought stress (three plants; at 25°C, 250 mmol m22 s21 PPFD, soilmoisture content 15% to 20%). Leaves were collected from P. trichocarpafor RNA extraction with different treatments (three biological replicates pertreatment). For expression analyses of genes andCs in different species,filtered transcriptome reads were mapped to the corresponding referencegenome using hisat2 (Kim et al., 2015), with parameters -q -x -S -p. GeneandC quantification was determined using StringTie (Pertea et al., 2016),with parameter -e -G. To measure the expression specificity of Cs, thespecificity score (Liao and Zhang, 2006) was computed.

TE Annotation

TE annotations used in this study were obtained from the outputs ofRM 4.0.6 software (Chen, 2009) with the combined database (Dfam_Consensus-20170127, RepBase-20170127; species parameter: Arabi-dopsis thaliana [Arabidopsis]; P. trichocarpa: Populus; G. max: Glycine;M. truncatula: Medicago; O. sativa: Oryza; B. distachyon: Brachypodium;S.bicolor: Panicoideae). TheseRMoutputswere filtered to removenon–TEelements (satellites, simple repeats, low complexity, rRNA).

Identification of Cs in the Seven Taxa

The selected taxa including rosids (Arabidopsis, P. trichocarpa, soybean[Glycine max], and M. truncatula) and monocots (rice [Oryza sativa],B. distachyon, and S. bicolor) were used forC identification. The genomeinformation is provided in Supplemental Table 5. The overall pipeline foridentification is outlined in Figure 1 and is generally based on the previousPseudoPipe workflow (Zhang et al., 2006; Zou et al., 2009), with mod-ifications. Generally, the pipeline consisted of five major steps: (1) identifyintergenic regions (masked genic and transposon regions) with se-quence similarity to known proteins using exonerate; (2) quality control,identity$ 20%,match length$ 30 amino acids, match length$ 5%of thequery sequence, and only the best match is retained; (3) link homologoussegments into contigs (set ICs); (4) realign using tfasty to identify featuresthat disrupt contiguous protein sequences; and (5) distinguish WGD-derived Cs and set II Cs.

In the first step, RM-masked genomes were used to mask the genicregions (annotated transcription unit in the genome annotation) andgenerate a file of intergenic regions. Thus, our following steps of C

identification focused on intergenic non–TE regions.The second step in the annotation pipeline was to identify all regions in

the genome that share sequence similarity with any known protein, usingexonerate (Slater and Birney, 2005) with parameters --model pro-tein2genome --showquerygff no --showtargetgff yes --maxintron 5000--showvulgar yes --ryo \"%ti\\t%qi\\t%tS\\t%qS\\t%tl\\t%ql\\t%tab\\t%tae\\t%tal\\t%qab\\t%qae\\t%qal\\t%pi\\n\". In addition to the filters al-ready included in PseudoPipe (overlap > 30 bp between a hit and afunctional gene), we did not accept alignments with E-value >1e25,identity < 20%, match length < 30 amino acids, and match length (pro-portion aligned) < 5%. Then, the bestmatch of alignment hits was selectedin places where a given chromosomal segment had multiple hits.

The third step was to linkC contigs based on the distance between thehits on the chromosome (Gc) and the distance on the query protein (Gq). Inour workflow, these gaps Gc could arise from low complexity or verydecayed regions of the C that were discarded by exonerate. We set thisdistance to 50 bp.

In the fourth step, the set I Cs were realigned using a more accuratealignment program, tfasty34, with parameters “-A -m 3 ‘q”. Accurate

Evolutionary Origins of Pseudogenes in Plants 573

Page 12: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

sequence similarity and annotate positions of disablements (frame shiftsand stop codons) as well as insertions and deletions were generated inthis step.

In thefinal step,WGD-derivedCsweredetectedusingMCScanX (Wanget al., 2012) based on the DAGchainer algorithm (Haas et al., 2004) withparameters -k 50 -g -1 -s 5 -m 25, and blocks with minimum of five genepairs were selected. We used protein pairs from each organism withaBLASTPE-value of <1e25 andC–FPpairs as the input datawhen runningMCScanX. Pairs of C–FPs in the syntenic block were considered WGDderived.

C Family Identification

Parent protein-coding gene information was downloaded from the En-sembl database (http://plants.ensembl.org/index.html) and MichiganState University Rice Genome Annotation Project and Phytozome (http://phytozome.jgi.doe.gov/pz/portal.html; Supplemental Table 5). First, pair-wisesequencesimilaritiesbetweenall inputproteinsequences fromselectedspecies were calculated using BLASTP with an E-value cutoff of 1e25.Markov clustering of the resulting similarity matrix was used to define thegenefamily, usingan inflationvalueof1.5.TheOrthomclclustering results listthe gene family members from all plant species. Second, the correspondingCs can be grouped into Orthomcl families according to their closest FPs(Supplemental Data Set 9).

We first constructed aphylogenetic treewith the sevenspecies studied.The maximum likelihood phylogenetic tree was generated by RAxML(Stamatakis, 2014) using the PROTGAMMALGFmodel with 100 bootstrapreplicates based on 124 single-copy proteins (Supplemental Data Set 19)that were identified by OrthoMCL (Li et al., 2003). Branch lengths reflectevolutionarydivergence times inmillionsof yearsas inferred fromTimeTree(http://www.timetree.org/). TimeTree assembles the public data fromthousands of published studies into a searchable tree of life scaled to time.The median molecule time estimates were selected from this study.

Expression Conservation Analyses

For the qualitative assessment of transcription conservation of Cs, weanalyzed the expression ratio of the total shared Cs families between thetwo species across different divergence times: 47 Myr, O. sativa versusB. distachyon; 48 Myr, S. bicolor versus rice; 52 Myr, soybean versusM. truncatula; 108 Myr, P. trichocarpa versus Arabidopsis; and 149 Myr,B. distachyon versus P. trichocarpa . In the analysis, one C family wasdefined tobeexpressed if at least onememberof the familywasexpressed.

To study the coexpression patterns of C–FP, WGDs, random genepairs, and random C pairs, Spearman correlations of expression levels(fragments per kilobase of exon per million reads mapped values) acrossdifferent samples were calculated.

Measurement of Expression Specificity

Tomeasure the expression specificity ofCs, the specificity score (Liao andZhang, 2006) was computed.We let aij be the average expression of gene iin tissue/treatment j. Then, the expression specificity of gene iwasgivenby

1n2 1

∑n

j¼1ð12 aij

maxjðaijÞÞ

where n is the number of tissues or treatments. Thus, if a gene wasexpressed in only one tissue the scorewas 1, and if the average expressionof a gene was the same in all tissues the score was 0.

Frequency of In Silico–Predicted TFBSs

We used a genome-wide set of transcription factor binding sites of thesevenspecies thatweremanually curated, nonredundant, andhigh-quality

transcription factor binding motifs derived from experiments (Plant Tran-scription Factor Database; http://planttfdb.cbi.pku.edu.cn/download.php). Predictions were performed using MEME package (fimo --oc.--verbosity 1 --thresh 1.0E25). Average frequency of in silico–predictedbinding sites was calculated for different categories of proximal regions(2 kb upstream) of genes, including genes, old Cs (pseudo-protein identity< 0.8), Cs (total C set), young Cs (pseudo-protein identity $ 0.8), and inrandomintergenicregions.FrequencyofTFBSsreferstotheaveragenumberof binding sites per fraction of promoters or regions. Error bars indicate 95%confidence intervals generated by 1000 bootstrap replicates.

We examined proximal upstream regions within 2 kb of the annotatedstart sites or 59 end for all Cs, genes, and lncRNAs. For analyzing theupstream sequence activity, we used the acetylated histone 3 lysine 9 andtrimethylated histone 3 lysine, and one negative mark, trimethylated his-tone 3 lysine 27, in Arabidopsis and rice. We also analyzed Populus ChIP-seq data for transcription factors (Liu et al., 2015). Regionswere labeled asactive if ChIP-seq peaks overlapped with them. The frequency of the DHpeaks in proximal upstream regions was reanalyzed. The DNase I hy-persensitivedata of Arabidopsis,B. distachyon, and ricewere downloadedfrom PlantDHS (http://plantdhs.org/). As a control, we also analyzed thefrequency DNase I hypersensitive peaks for 2000 randomly generatedintergenic regions.

The Evolution Analyses of Cs

WGDpairs for each organismwere detected usingMCScanX (Wang et al.,2012), based on the DAGchainer algorithm (Haas et al., 2004). We usedprotein pairs from each organism with a BLASTP E-value of <1e25 andblocks with a minimum of five gene pairs were selected.C–FP pairs in thesyntenic block were considered WGD derived. To evaluate the level ofselective constraint on Cs, we calculated the Ks and Ka between each C

and its parent gene in all selected plant species. First, the protein align-ments of C–FP were extracted from the pipeline output and regionsrepresentinggaps inanyof thealignedsequenceswere removed.Then, thecorresponding codon alignment was obtained on the basis of proteinalignment using Python scripts. Second, the evolutionary rates were de-termined using the yn00 program in the PAML program package (Yang,1997). Pairs with errors or pairs that were too divergent (Ks > 3) wereexcluded.

Differences in the dynamics of genome evolution make it difficult todirectly estimate the age of Cs. The C ages were estimated using thesequence similarity to FPs as an indicator. Thus, older Cs have a lowersequence similarity to their FPs. Three different types of C disablingmutations (insertion, deletions, and stop codons) were extracted from thepipeline and the average defect density per kilobase was calculated foreach plant species.

Pfam Domain Analyses of Cs

We annotated all theCs according to their FPs in the Pfam database (Finnet al., 2014). Fisher’s exact test was used to test whether the annotatedpfam domains were significantly overrepresented or underrepresented.

Genome Recombination Rate and C Density

The soybean genome recombination rate was obtained from a previousstudy (Du et al., 2012). Each chromosome was subdivided into 1-Mb bins,and C density and the recombination rate in each chromosome wereused for analysis of potential correlation. The genome recombination-suppressed pericentromeric regions were defined based on the com-parison of soybean physical and genetics maps as previously described.

574 The Plant Cell

Page 13: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

Positions of C Overlap with Their FPs

The study was based on the PseudoPipe output; this output provided thealignment position of theC fragments and their FPs. The relative positionsof the full-length parental genes were calculated by start/L and end/L,where L is the full-length of FPs, start is the alignment start position, andend is the alignment end position. Thus, the density of the C fragmentsrelative to the position of FPs was calculated. For the randomization test,1000 randomly generated gene fractions from 100 genes of seven specieswere aligned to their full-length genes, and their positions were plotted. Adifferent volume of simulated data was also generated in this test.

Association of Non–TE Noncoding RNA and Proximal UpstreamRegions of Cs

The positions of non–TE noncoding transcripts (lncRNAs and miRNAs)relative to theproximal upstream regions ofCs (2-kb sequences precedingthe59endofeachannotatedCsequence)weredetermined.The locationofnoncoding transcripts was divided into four categories: (1) proximal up-stream region–associated lncRNA loci, (2) gene body–associated lncRNAloci, (3) tail-to-tail lncRNA loci, and (4) distant lncRNA loci. Their relativepositions were determined using an in-house Python script. The data forthe four categories are provided in Supplemental Data Sets 20 to 24.

For expression analyses of genes and Cs in different species, filteredtranscriptome readsweremapped to thecorresponding referencegenomeusing hisat2 (Kim et al., 2015), with parameters -q -x -S -p. Gene and C

quantification was conducted using StringTie (Pertea et al., 2016), withparameters -e -G.

Identification of lncRNA Catalogs

For prediction of lncRNAs in Populus, the clean reads were first aligned tothe reference genome using hisat2 v2.0.5 (Pertea et al., 2016) with pa-rameters -q -x -U -p --rna-strandness -S . The mapped reads were used tomerge and assemble transcripts using samtools v1.3.1 sort function (Liet al., 2009) and cuffcompare package in Cufflinks v2.1.1 (Trapnell et al.,2012) with default parameters. The different sets of lncRNAs, includingintergenic, TE-containing, sense, and natural antisense lncRNA tran-scripts,were identifiedusingEvolinc-I (Nelsonet al., 2017) bysearching thecorresponding repeat database.

Transient Expression in Nicotiana benthamiana

Arabidopsisplantsused in this studywere thewild-typeColumbiaecotype,and genomic DNA of Arabidopsis was extracted for PCR. N. benthamianaplants were grown in a greenhouse at 25°C under long-day conditions(16-h-light/8-h-dark cycle). Four-week-old N. benthamiana plants wereused for transient expressionexperiments. To test the activity of proximalupstream regions of C associated with non–TE lncRNAs, 30 randomlyselected proximal upstream regions of expressed Arabidopsis Cs weresynthesized to the binary vector 3302Y3 by replacing the cauliflowermosaic virus 35Spromoter (GeneRay;Supplemental DataSet 18). Vector3302Y3 without any promoter was used as a negative control, and a vectorwith a YFP-fABD2 fusion with the 35S promoter was used as positivecontrol. Transient expression in N. benthamiana was performed as pre-viouslydescribed (Sunetal., 2018).Microscopyanalysiswasperformed for2 d after infiltration. Fluorescencewasobservedwith anSP5confocal laserscanning microscope (Leica) and captured with a charge-coupled devicecamera.

Quantitative RT-PCR Validation of Expression Profiles

Quantitative RT-PCRwas performed on a DNA Engine Opticon 2machine(MJ Research) using the LightCycler FastStart DNA master SYBR Green I

kit (Roche). ThecDNA template for reactionswas reverse transcribedusingtotal RNA extracted from leaves with or without stress treatment. PoplarActinwas used as the internal control for gene expressionmeasurements.The PCR program was as described previously (Zhang et al., 2011). Theprimers usedwereChr1|20148959-20149332F (59-GTTGTTGGTAACACGACCGC-39) and Chr1|20148959-20149332R (59-GTCCGCTCCCATGTTCAAGA-39) for Arabidopsis andChr01|13707000-13707886F (59-TGAGTTTGCCACCACTGGG-39) and Chr01|13707000-13707886R (59-ACCTTTCCGGCAGATGGATT-39) for Populus.

Accession Numbers

Raw data are available for download at the Beijing Institute of GenomicsData center Genome Sequence Archive under accession numberCRA000471. Bioinformatics analysis pipelines, singularity images, recipefiles, and clear instructions are available online at GitHub (https://github.com/bjfupoplar/PlantPseudo.git). Cs identified using the PlantPseudopipeline are reported in Supplemental Data Sets 1 to 7. Non–TE lncRNAdata sets are reported in Supplemental Data Sets 12–16. Associationsof non–TE lncRNA and C/gene proximal regions are reported inSupplemental Data Set 20. Poplar Actin sequence is available under theaccession number EF145577.

Supplemental Data

Supplemental Figure 1. Some Cs near the cutoff are in syntenicpositions between closely related species (P. trichocarpa and Arabi-dopsis) or in WGD blocks.

Supplemental Figure 2. Distribution of genes and the different typesof Cs across chromosomes.

Supplemental Figure 3. Distribution of Ks values between gene pairson WGD blocks that contain Cs.

Supplemental Figure 4. Distribution of Cs in P. trichocarpa, Arabi-dopsis, B. distachyon, G. max,M. truncatula, O. sativa (O. sativa subspjaponica), and S. bicolor as a function of age (sequence similarity toparents).

Supplemental Figure 5. Distribution of disablements in Cs asfunctions of type and C age.

Supplemental Figure 6. Position of C fragment overlaps with theirFPs.

Supplemental Figure 7. Randomly generated gene fractions uni-formly overlap with their FPs.

Supplemental Figure 8. Orthologs, paralogs, and families.

Supplemental Figure 9. Dynamic repertoire of Cs in different species.

Supplemental Figure 10. Common C families of the seven plantspecies. Seven plant species were used in the study.

Supplemental Figure 11. Recently created Cs tend to have higherexpression values.

Supplemental Figure 12. Frequency of DNase I hypersensitive peaks.

Supplemental Figure 13. Comparison of the transcript ratio betweenCs and protein-coding genes in seven species.

Supplemental Figure 14. Rapid transcriptional evolution of Cs.

Supplemental Figure 15. Position of randomly selected proximalupstream regions associated with non–TE lncRNA loci.

Supplemental Figure 16. Positive transient expression assays of theother 16 proximal upstream sequences.

Evolutionary Origins of Pseudogenes in Plants 575

Page 14: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

Supplemental Table 1. Comparison of the Cs located on chromo-some arms and centromeres.

Supplemental Table 2. RNA-seq data used in this study.

Supplemental Table 3. Association of histone marks with lncRNAsand Cs.

Supplemental Table 4. lncRNA data used in this study.

Supplemental Table 5. Genomes used in this study.

Supplemental Data Set 1. Cs identified in P. trichocarpa.

Supplemental Data Set 2. Cs identified in Arabidopsis.

Supplemental Data Set 3. Cs identified in B. distachyon.

Supplemental Data Set 4. Cs identified in G. max.

Supplemental Data Set 5. Cs identified in M. truncatula.

Supplemental Data Set 6. Cs identified in O. sativa.

Supplemental Data Set 7. Cs identified in S. bicolor.

Supplemental Data Set 8. Top 30 pfam domains of Cs in sevenspecies.

Supplemental Data Set 9. C families shared by all species.

Supplemental Data Set 10. Values for Ka and Ks between Cs andtheir functional WGDs.

Supplemental Data Set 11. Syntenic blocks containing C betweenArabidopsis and Populus.

Supplemental Data Set 12. Non–TE lncRNA data sets of P.trichocarpa.

Supplemental Data Set 13. Non–TE lncRNA data sets of A. thaliana.

Supplemental Data Set 14. Non–TE lncRNA data sets of B.distachyon.

Supplemental Data Set 15. Non–TE lncRNA data sets of M.truncatula.

Supplemental Data Set 16. Non–TE lncRNA data sets of O. sativa.

Supplemental Data Set 17. Thirty randomly selected proximalupstream regions associated with non–TE lncRNA loci.

Supplemental Data Set 18. Proximal upstream sequences used in thetransient expression experiment.

Supplemental Data Set 19. Single-copy genes used in the phyloge-netic analysis.

Supplemental Data Set 20. Association of non–TE lncRNA andC/gene proximal regions in P. trichocarpa.

Supplemental Data Set 21. Association of non–TE lncRNA andC/gene proximal regions in Arabidopsis.

Supplemental Data Set 22. Association of non–TE lncRNA andC/gene proximal regions in B. distachyon.

Supplemental Data Set 23. Association of non–TE lncRNA andC/gene proximal regions in M. truncatula.

Supplemental Data Set 24. Association of non–TE lncRNA andC/gene proximal regions in O. sativa.

ACKNOWLEDGMENTS

We thank Ronald R. Sederoff (North Carolina State University) for specificsuggestions and detailed comments to improve themanuscript. This work

was supported by the State “13.5” Key Research Program of China(2016YFD0600102), the Project of the National Natural ScienceFoundation of China (31600537 and 31670333), Young Elite ScientistsSponsorship Program by CAST (2018QNRC001), and the Program ofIntroducing Talents of Discipline to Universities (111 project, B13007).

AUTHOR CONTRIBUTIONS

D.Z. designed the research. J.X. performed the research, analyzed thedata, contributed new computational pipeline, and wrote the paper. X.L.performed the transient expression experiment. Y.L., X.L., Y.Z., and D.Z.revised the manuscript. B.L. and P.K.I. provided valuable suggestions tothemanuscript. D.Z. obtained funding and is responsible for this article. Allauthors read and approved the manuscript.

ReceivedOctober11, 2018; revisedDecember3, 2018; acceptedFebruary12, 2019; published February 13, 2019.

REFERENCES

Balasubramanian, S., Zheng, D., Liu, Y.J., Fang, G., Frankish, A.,Carriero, N., Robilotto, R., Cayting, P., and Gerstein, M. (2009).Comparative analysis of processed ribosomal protein pseudogenesin four mammalian genomes. Genome Biol. 10: R2.

Chen, N. (2009). Using RepeatMasker to identify repetitive elements ingenomic sequences. Curr. Protoc. Bioinformatics 25: 4.10.1-4.10.14

Daborn, P.J., et al. (2002). A single p450 allele associated with in-secticide resistance in Drosophila. Science 297: 2253–2256.

Du, J., Tian, Z., Sui, Y., Zhao, M., Song, Q., Cannon, S.B., Cregan,P., and Ma, J. (2012). Pericentromeric effects shape the patterns ofdivergence, retention, and expression of duplicated genes in thepaleopolyploid soybean. Plant Cell 24: 21–32.

Finn, R.D., et al. (2014). Pfam: The protein families database. NucleicAcids Res. 42: D222–D230.

Gaut, B.S., Wright, S.I., Rizzon, C., Dvorak, J., and Anderson, L.K.(2007). Recombination: An underappreciated factor in the evolutionof plant genomes. Nat. Rev. Genet. 8: 77–84.

Guo, X., Zhang, Z., Gerstein, M.B., and Zheng, D. (2009). SmallRNAs originated from pseudogenes: Cis- or trans-acting? PLOSComput. Biol. 5: e1000449.

Gupta, M., and Liu, J.S. (2005). De novo cis-regulatory module elic-itation for eukaryotic genomes. Proc. Natl. Acad. Sci. USA 102:7079–7084.

Haas, B.J., Delcher, A.L., Wortman, J.R., and Salzberg, S.L. (2004).DAGchainer: A tool for mining segmental genome duplications andsynteny. Bioinformatics 20: 3643–3646.

Hamblin, M.T., and Aquadro, C.F. (1996). High nucleotide sequencevariation in a region of low recombination in Drosophila simulans isconsistent with the background selection model. Mol. Biol. Evol. 13:1133–1140.

He, G., et al. (2010). Global epigenetic and transcriptional trendsamong two rice subspecies and their reciprocal hybrids. Plant Cell22: 17–33.

Hezroni, H., Koppstein, D., Schwartz, M.G., Avrutin, A., Bartel, D.P.,and Ulitsky, I. (2015). Principles of long noncoding RNA evolution de-rived from direct comparison of transcriptomes in 17 species. Cell Re-ports 11: 1110–1122.

Inoue, J., Sato, Y., Sinclair, R., Tsukamoto, K., and Nishida, M.(2015). Rapid genome reshaping by multiple-gene loss after whole-genome duplication in teleost fish suggested by mathematicalmodeling. Proc. Natl. Acad. Sci. USA 112: 14918–14923.

576 The Plant Cell

Page 15: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

Jin, J., Tian, F., Yang, D.C., Meng, Y.Q., Kong, L., Luo, J., and Gao,G. (2017). PlantTFDB 4.0: Toward a central hub for transcriptionfactors and regulatory interactions in plants. Nucleic Acids Res. 45:D1040–D1045.

Kapusta, A., Kronenberg, Z., Lynch, V.J., Zhuo, X., Ramsay, L.,Bourque, G., Yandell, M., and Feschotte, C. (2013). Transposableelements are major contributors to the origin, diversification, andregulation of vertebrate long noncoding RNAs. PLoS Genet. 9:e1003470.

Kim, D., Langmead, B., and Salzberg, S.L. (2015). HISAT: A fast splicedaligner with low memory requirements. Nat. Methods 12: 357–360.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,Marth, G., Abecasis, G., and Durbin, R. (2009). The sequencealignment/map (SAM) format and SAMtools. Bioinformatics 25:2078–2079.

Li, L., Stoeckert, C.J., Jr., and Roos, D.S. (2003). OrthoMCL: Iden-tification of ortholog groups for eukaryotic genomes. Genome Res.13: 2178–2189.

Liao, B.Y., and Zhang, J. (2006). Evolutionary conservation of ex-pression profiles between human and mouse orthologous genes.Mol. Biol. Evol. 23: 530–540.

Liu, J., Jung, C., Xu, J., Wang, H., Deng, S., Bernad, L., Arenas-Huertero, C., and Chua, N.H. (2012). Genome-wide analysisuncovers regulation of long intergenic noncoding RNAs in Arabi-dopsis. Plant Cell 24: 4333–4345.

Liu, L., Ramsay, T., Zinkgraf, M., Sundell, D., Street, N.R., Filkov,V., and Groover, A. (2015). A resource for characterizing genome-wide binding and putative target genes of transcription factors ex-pressed during secondary growth and wood formation in Populus.Plant J. 82: 887–898.

Ma, T., et al. (2013). Genomic insights into salt adaptation in a desertpoplar. Nat. Commun. 4: 2797.

Ma, H.W., Xie, M., Sun, M., Chen, T.Y., Jin, R.R., Ma, T.S., Chen,Q.N., Zhang, E.B., He, X.Z., De, W., and Zhang, Z.H. (2016). Thepseudogene derived long noncoding RNA DUXAP8 promotes gas-tric cancer cell proliferation and migration via epigenetically si-lencing PLEKHO1 expression. Oncotarget 8: 52211–52224.

Milligan, M.J., and Lipovich, L. (2015). Pseudogene-derived lncRNAs:Emerging regulators of gene expression. Front. Genet. 5: 476.

Necsulea, A., Soumillon, M., Warnefors, M., Liechti, A., Daish, T.,Zeller, U., Baker, J.C., Grützner, F., and Kaessmann, H. (2014).The evolution of lncRNA repertoires and expression patterns intetrapods. Nature 505: 635–640.

Nelson, A.D.L., Devisetty, U.K., Palos, K., Haug-Baltzell, A.K.,Lyons, E., and Beilstein, M.A. (2017). Evolinc: a tool for the iden-tification and evolutionary comparison of long intergenic non-coding RNAs. Front. Genet. 8: 52.

Nourmohammad, A., and Lässig, M. (2011). Formation of regulatorymodules by local sequence duplication. PLOS Comput. Biol. 7:e1002167.

Pearson, C.E., Nichol Edamura, K., and Cleary, J.D. (2005). Repeatinstability: mechanisms of dynamic mutations. Nat. Rev. Genet. 6:729–742.

Pertea, M., Kim, D., Pertea, G.M., Leek, J.T., and Salzberg, S.L.(2016). Transcript-level expression analysis of RNA-seq experi-ments with HISAT, StringTie and Ballgown. Nat. Protoc. 11:1650–1667.

Poliseno, L., Salmena, L., Zhang, J., Carver, B., Haveman, W.J.,and Pandolfi, P.P. (2010). A coding-independent function of geneand pseudogene mRNAs regulates tumour biology. Nature 465:1033–1038.

Qi, X., Xie, S., Liu, Y., Yi, F., and Yu, J. (2013). Genome-wide an-notation of genes and noncoding RNAs of foxtail millet in response

to simulated drought stress by deep sequencing. Plant Mol. Biol.83: 459–473.

Rattray, A.J., and Strathern, J.N. (2003). Error-prone DNA poly-merases: When making a mistake is the only way to get ahead.Annu. Rev. Genet. 37: 31–66.

Rebeiz, M., Jikomes, N., Kassner, V.A., and Carroll, S.B. (2011).Evolutionary origin of a novel gene expression pattern through co-option of the latent activities of existing regulatory sequences. Proc.Natl. Acad. Sci. USA 108: 10036–10043.

Scarola, M., Comisso, E., Pascolo, R., Chiaradia, R., Marion, R.M.,Schneider, C., Blasco, M.A., Schoeftner, S., and Benetti, R.(2015). Epigenetic silencing of Oct4 by a complex containingSUV39H1 and Oct4 pseudogene lncRNA. Nat. Commun. 6: 7631.

Schmutz, J., et al. (2010). Genome sequence of the palaeopolyploidsoybean. Nature 463: 178–183.

Schuermann, D., Molinier, J., Fritsch, O., and Hohn, B. (2005). Thedual nature of homologous recombination in plants. Trends Genet.21: 172–181.

Sigova, A.A., Mullen, A.C., Molinie, B., Gupta, S., Orlando, D.A.,Guenther, M.G., Almada, A.E., Lin, C., Sharp, P.A., Giallourakis,C.C., and Young, R.A. (2013). Divergent transcription of longnoncoding RNA/mRNA gene pairs in embryonic stem cells. Proc.Natl. Acad. Sci. USA 110: 2876–2881.

Sisu, C., et al. (2014). Comparative analysis of pseudogenes acrossthree phyla. Proc. Natl. Acad. Sci. USA 111: 13361–13366.

Slater, G.S., and Birney, E. (2005). Automated generation of heu-ristics for biological sequence comparison. BMC Bioinformatics 6:31.

Stamatakis, A. (2014). RAxML version 8: A tool for phylogeneticanalysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313.

Sun, Q., Li, J., Cheng, W., Guo, H., Liu, X., and Gao, H. (2018). At-PAP2, a Unique Member of the PAP Family, Functions in the PlasmaMembrane. Genes (Basel) 9: E257. 29772783

Tam, O.H., Aravin, A.A., Stein, P., Girard, A., Murchison, E.P.,Cheloufi, S., Hodges, E., Anger, M., Sachidanandam, R., Schultz,R.M., and Hannon, G.J. (2008). Pseudogene-derived small in-terfering RNAs regulate gene expression in mouse oocytes. Nature453: 534–538.

Tian, J., Song, Y., Du, Q., Yang, X., Ci, D., Chen, J., Xie, J., Li, B.,and Zhang, D. (2016). Population genomic analysis of gibberellin-responsive long non-coding RNAs in Populus. J. Exp. Bot. 67:2467–2482.

Tian, Z., Rizzon, C., Du, J., Zhu, L., Bennetzen, J.L., Jackson, S.A.,Gaut, B.S., and Ma, J. (2009). Do genetic recombination and genedensity shape the pattern of DNA elimination in rice long terminalrepeat retrotransposons? Genome Res. 19: 2221–2230.

Torrents, D., Suyama, M., Zdobnov, E., and Bork, P. (2003). Agenome-wide survey of human pseudogenes. Genome Res. 13:2559–2567.

Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R.,Pimentel, H., Salzberg, S.L., Rinn, J.L., and Pachter, L. (2012).Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks. Nat. Protoc. 7: 562–578.

Tuskan, G.A., et al. (2006). The genome of black cottonwood, Pop-ulus trichocarpa (Torr. & Gray). Science 313: 1596–1604.

Wang, Y., Tang, H., Debarry, J.D., Tan, X., Li, J., Wang, X., Lee,T.H., Jin, H., Marler, B., Guo, H., Kissinger, J.C., and Paterson,A.H. (2012). MCScanX: A toolkit for detection and evolutionary analysisof gene synteny and collinearity. Nucleic Acids Res. 40: e49.

Watanabe, T., et al. (2008). Endogenous siRNAs from naturallyformed dsRNAs regulate transcripts in mouse oocytes. Nature 453:539–543.

Evolutionary Origins of Pseudogenes in Plants 577

Page 16: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

Wen, Y.Z., Zheng, L.L., Liao, J.Y., Wang, M.H., Wei, Y., Guo, X.M.,Qu, L.H., Ayala, F.J., and Lun, Z.R. (2011). Pseudogene-derivedsmall interference RNAs regulate gene expression in African Try-panosoma brucei. Proc. Natl. Acad. Sci. USA 108: 8345–8350.

Wendel, J.F. (2000). Genome evolution in polyploids. Plant Mol. Biol.42: 225–249.

Wolfe, K.H. (2001). Yesterday’s polyploids and the mystery of dip-loidization. Nat. Rev. Genet. 2: 333–341.

Xie, J., Yang, X., Song, Y., Du, Q., Li, Y., Chen, J., and Zhang, D. (2017).Adaptive evolution and functional innovation of Populus-specific recentlyevolved microRNAs. New Phytol. 213: 206–219.

Yang, Z. (1997). PAML: A program package for phylogenetic analysisby maximum likelihood. Comput. Appl. Biosci. 13: 555–556.

Zhang, W., Zhang, T., Wu, Y., and Jiang, J. (2012). Genome-wide iden-tification of regulatory DNA elements and protein-binding footprints usingsignatures of open chromatin in Arabidopsis. Plant Cell 24: 2719–2731.

Zhang, Z., Harrison, P.M., Liu, Y., and Gerstein, M. (2003). Millionsof years of evolution preserved: A comprehensive catalog of theprocessed pseudogenes in the human genome. Genome Res. 13:2541–2558.

Zhang, Z., Carriero, N., Zheng, D., Karro, J., Harrison, P.M., andGerstein, M. (2006). PseudoPipe: An automated pseudogeneidentification pipeline. Bioinformatics 22: 1437–1439.

Zhang, Z.L., Ogawa, M., Fleet, C.M., Zentella, R., Hu, J., Heo, J.O.,Lim, J., Kamiya, Y., Yamaguchi, S., and Sun, T.P. (2011).Scarecrow-like 3 promotes gibberellin signaling by antagonizingmaster growth repressor DELLA in Arabidopsis. Proc. Natl. Acad.Sci. USA 108: 2160–2165.

Zou, C., Lehti-Shiu, M.D., Thibaud-Nissen, F., Prakash, T., Buell,C.R., and Shiu, S.H. (2009). Evolutionary and expression sig-natures of pseudogenes in Arabidopsis and rice. Plant Physiol.151: 3–15.

578 The Plant Cell

Page 17: Evolutionary Origins of Pseudogenes and Their Association with ... · Pseudogenes (Cs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and

DOI 10.1105/tpc.18.00601; originally published online February 13, 2019; 2019;31;563-578Plant Cell

Jianbo Xie, Ying Li, Xiaomin Liu, Yiyang Zhao, Bailian Li, Pär K. Ingvarsson and Deqiang ZhangEvolutionary Origins of Pseudogenes and Their Association with Regulatory Sequences in Plants

 This information is current as of June 4, 2020

 

Supplemental Data /content/suppl/2019/07/12/tpc.18.00601.DC2.html /content/suppl/2019/02/13/tpc.18.00601.DC1.html

References /content/31/3/563.full.html#ref-list-1

This article cites 59 articles, 18 of which can be accessed free at:

Permissions https://www.copyright.com/ccc/openurl.do?sid=pd_hw1532298X&issn=1532298X&WT.mc_id=pd_hw1532298X

eTOCs http://www.plantcell.org/cgi/alerts/ctmain

Sign up for eTOCs at:

CiteTrack Alerts http://www.plantcell.org/cgi/alerts/ctmain

Sign up for CiteTrack Alerts at:

Subscription Information http://www.aspb.org/publications/subscriptions.cfm

is available at:Plant Physiology and The Plant CellSubscription Information for

ADVANCING THE SCIENCE OF PLANT BIOLOGY © American Society of Plant Biologists