the long intergenic noncoding rna (lincrna) landscape of ... · sible roles in translation...

15
The Long Intergenic Noncoding RNA (LincRNA) Landscape of the Soybean Genome 1[OPEN] Agnieszka A. Golicz, Mohan B. Singh, and Prem L. Bhalla 2 Plant Molecular Biology and Biotechnology Laboratory, Faculty of Veterinary and Agricultural Sciences, University of Melbourne, Parkville, Melbourne, Victoria 3010, Australia ORCID IDs: 0000-0002-9711-4826 (A.A.G.); 0000-0001-9427-8975 (M.B.S.); 0000-0002-2910-0393 (P.L.B.). Long intergenic noncoding RNAs (lincRNAs) are emerging as important regulators of diverse biological processes. However, our understanding of lincRNA abundance and function remains very limited especially for agriculturally important plants. Soybean (Glycine max) is a major legume crop plant providing over a half of global oilseed production. Moreover, soybean can form symbiotic relationships with Rhizobium bacteria to x atmospheric nitrogen. Soybean has a complex paleopolyploid genome and exhibits many vegetative and oral development complexities. Soybean cultivars have photoperiod requirements restricting its use and productivity. Molecular regulators of these legume-specic developmental processes remain enigmatic. Long noncoding RNAs may play important regulatory roles in soybean growth and development. In this study, over one billion RNA-seq read pairs from 37 samples representing nine tissues were used to discover 6,018 lincRNA loci. The lincRNAs were shorter than protein-coding transcripts and had lower expression levels and more sample specic expression. Few of the loci were found to be conserved in two other legume species (chickpea [Cicer arietinum] and Medicago truncatula), but almost 200 homeologous lincRNAs in the soybean genome were detected. Protein-coding gene-lincRNA coexpression analysis suggested an involvement of lincRNAs in stress response, signal transduction, and developmental processes. Positional analysis of lincRNA loci implicated involvement in transcriptional regulation. lincRNA expression from centromeric regions was observed especially in actively dividing tissues, suggesting possible roles in cell division. Integration of publicly available genome-wide association data with the lincRNA map of the soybean genome uncovered 23 lincRNAs potentially associated with agronomic traits. Recently, it has been elucidated that eukaryotic ge- nomes, including plant genomes, encode a multitude of noncoding RNAs (ncRNAs; Chekanova et al., 2007; Kapranov et al., 2007). One class of ncRNAs are long noncoding RNAs (lncRNAs), which are dened as transcripts .200 bp in length and harboring no dis- cernible coding potential (Jin et al., 2013; Wang et al., 2014a; Chekanova, 2015). The relative location of lncRNA loci to protein-coding genes identies a further subgroup known as long intergenic noncoding RNAs (lincRNAs), which do not overlap protein-coding genes. lncRNAs were long considered little beyond transcriptional noise; however, current evidence points to important roles in diverse biological processes across eukaryotes (van Werven et al., 2012; Ulitsky and Bartel, 2013; Flynn and Chang, 2014). In Arabidopsis ( Ara- bidopsis thaliana) and rice (Oryza sativa), lncRNAs have been shown to be involved in owering time regulation, reproduction, and root organogenesis (Swiezewski et al., 2009; Cifuentes-Rojas et al., 2011; Heo and Sung, 2011; Ariel et al., 2014; Bardou et al., 2014; Matzke et al., 2015; Wang et al., 2014b; Zhang et al., 2014; Berry and Dean, 2015; Khemka et al., 2016). lncRNAs are found both in the nucleus and cytoplasm, which suggests a diversity of modes of action, includ- ing chromatin modication (Heo et al., 2013); acting as decoys preventing access of regulatory proteins, in- cluding splicing machinery, and microRNAs to their true RNA and DNA targets (Franco-Zorrilla et al., 2007; Wu et al., 2013; Bardou et al., 2014); and acting as scaffolds for assembly of larger protein-RNA com- plexes (Lai et al., 2013; Pefanis et al., 2015). Recently, a large number of lncRNAs have been found to be asso- ciated with ribosomes and coexpressed with ribosomal proteins, although not translated, which suggests pos- sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013; Szcze sniak et al., 2016), plant-specic lncRNA data- bases are scarce, and lncRNA genome-wide discovery and especially functional annotation in agriculture im- portant plant species remain unavailable. Legumes are a large family of plant species charac- terized by buttery-like owers and pod-shaped 1 This work was supported by Australian Research Council Dis- covery Grant ARC DP0988972 and by Melbourne Bioinformatics at the University of Melbourne (project UOM0033). 2 Address correspondence to [email protected]. The author responsible for distribution of materials integral to the ndings presented in this article in accordance with the policy de- scribed in the Instructions for Authors (www.plantphysiol.org) is: Prem L. Bhalla ([email protected]). A.A.G. designed the experiments, performed the analysis, and wrote the manuscript; M.B.S. and P.L.B. conceived the research and wrote the manuscript. [OPEN] Articles can be viewed without a subscription. www.plantphysiol.org/cgi/doi/10.1104/pp.17.01657 Plant Physiology Ò , March 2018, Vol. 176, pp. 21332147, www.plantphysiol.org Ó 2018 American Society of Plant Biologists. All Rights Reserved. 2133 www.plantphysiol.org on October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Upload: others

Post on 29-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

The Long Intergenic Noncoding RNA (LincRNA)Landscape of the Soybean Genome1[OPEN]

Agnieszka A. Golicz, Mohan B. Singh, and Prem L. Bhalla2

Plant Molecular Biology and Biotechnology Laboratory, Faculty of Veterinary and Agricultural Sciences,University of Melbourne, Parkville, Melbourne, Victoria 3010, Australia

ORCID IDs: 0000-0002-9711-4826 (A.A.G.); 0000-0001-9427-8975 (M.B.S.); 0000-0002-2910-0393 (P.L.B.).

Long intergenic noncoding RNAs (lincRNAs) are emerging as important regulators of diverse biological processes. However,our understanding of lincRNA abundance and function remains very limited especially for agriculturally important plants.Soybean (Glycine max) is a major legume crop plant providing over a half of global oilseed production. Moreover, soybean canform symbiotic relationships with Rhizobium bacteria to fix atmospheric nitrogen. Soybean has a complex paleopolyploidgenome and exhibits many vegetative and floral development complexities. Soybean cultivars have photoperiodrequirements restricting its use and productivity. Molecular regulators of these legume-specific developmental processesremain enigmatic. Long noncoding RNAs may play important regulatory roles in soybean growth and development. In thisstudy, over one billion RNA-seq read pairs from 37 samples representing nine tissues were used to discover 6,018 lincRNA loci.The lincRNAs were shorter than protein-coding transcripts and had lower expression levels and more sample specificexpression. Few of the loci were found to be conserved in two other legume species (chickpea [Cicer arietinum] and Medicagotruncatula), but almost 200 homeologous lincRNAs in the soybean genome were detected. Protein-coding gene-lincRNAcoexpression analysis suggested an involvement of lincRNAs in stress response, signal transduction, and developmentalprocesses. Positional analysis of lincRNA loci implicated involvement in transcriptional regulation. lincRNA expression fromcentromeric regions was observed especially in actively dividing tissues, suggesting possible roles in cell division. Integration ofpublicly available genome-wide association data with the lincRNA map of the soybean genome uncovered 23 lincRNAspotentially associated with agronomic traits.

Recently, it has been elucidated that eukaryotic ge-nomes, including plant genomes, encode amultitude ofnoncoding RNAs (ncRNAs; Chekanova et al., 2007;Kapranov et al., 2007). One class of ncRNAs are longnoncoding RNAs (lncRNAs), which are defined astranscripts .200 bp in length and harboring no dis-cernible coding potential (Jin et al., 2013; Wang et al.,2014a; Chekanova, 2015). The relative location oflncRNA loci to protein-coding genes identifies a furthersubgroup known as long intergenic noncoding RNAs(lincRNAs), which do not overlap protein-codinggenes. lncRNAs were long considered little beyondtranscriptional noise; however, current evidence pointsto important roles in diverse biological processes acrosseukaryotes (vanWerven et al., 2012; Ulitsky and Bartel,

2013; Flynn and Chang, 2014). In Arabidopsis (Ara-bidopsis thaliana) and rice (Oryza sativa), lncRNAshave been shown to be involved in flowering timeregulation, reproduction, and root organogenesis(Swiezewski et al., 2009; Cifuentes-Rojas et al., 2011;Heo and Sung, 2011; Ariel et al., 2014; Bardou et al.,2014; Matzke et al., 2015; Wang et al., 2014b; Zhanget al., 2014; Berry and Dean, 2015; Khemka et al., 2016).lncRNAs are found both in the nucleus and cytoplasm,which suggests a diversity of modes of action, includ-ing chromatin modification (Heo et al., 2013); acting asdecoys preventing access of regulatory proteins, in-cluding splicing machinery, and microRNAs to theirtrue RNA and DNA targets (Franco-Zorrilla et al., 2007;Wu et al., 2013; Bardou et al., 2014); and acting asscaffolds for assembly of larger protein-RNA com-plexes (Lai et al., 2013; Pefanis et al., 2015). Recently, alarge number of lncRNAs have been found to be asso-ciated with ribosomes and coexpressed with ribosomalproteins, although not translated, which suggests pos-sible roles in translation regulation (Carlevaro-Fitaet al., 2016). Despite increasing efforts (Jin et al., 2013;Szcze�sniak et al., 2016), plant-specific lncRNA data-bases are scarce, and lncRNA genome-wide discoveryand especially functional annotation in agriculture im-portant plant species remain unavailable.

Legumes are a large family of plant species charac-terized by butterfly-like flowers and pod-shaped

1 This work was supported by Australian Research Council Dis-covery Grant ARC DP0988972 and by Melbourne Bioinformatics atthe University of Melbourne (project UOM0033).

2 Address correspondence to [email protected] author responsible for distribution of materials integral to the

findings presented in this article in accordance with the policy de-scribed in the Instructions for Authors (www.plantphysiol.org) is:Prem L. Bhalla ([email protected]).

A.A.G. designed the experiments, performed the analysis, andwrote the manuscript; M.B.S. and P.L.B. conceived the research andwrote the manuscript.

[OPEN] Articles can be viewed without a subscription.www.plantphysiol.org/cgi/doi/10.1104/pp.17.01657

Plant Physiology�, March 2018, Vol. 176, pp. 2133–2147, www.plantphysiol.org � 2018 American Society of Plant Biologists. All Rights Reserved. 2133 www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from

Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 2: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

shaped fruits. They provide an invaluable contributionto ecosystems due to their ability to form symbiotic re-lationships with Rhizobium bacteria. This symbiosis re-sults in dinitrogen capture from the air and itssubsequent fixation, making legumes one of the majorsources of bioavailable nitrogen. Legume seeds are thesecond, after cereals, source of human and animal foodand include soybeans (Glycine max), peanuts (Arachishypogaea), garden peas (Pisum sativum), and broad beans(Vicia faba). Additionally, soybean is responsible for overa half of global oilseed production. Due to its economicimportance as a source of food and oils, soybean hasincreasingly become a target of genomic and tran-scriptomic research efforts. Sequencing of the soybeangenome revealed its complex paleopolyploid structure(Schmutz et al., 2010). Although comparison betweensoybean and the model plant species Arabidopsis can bedrawn, the two species are suggested to have divergedfrom a common ancestor 92million years ago (Zhu et al.,2003) and soybean has undergone at least two genomeduplication events resulting in homeologous relation-ships between chromosomes and gene loci (Shoemakeret al., 2006). One of the most interesting questions fromthe genomics point of view is, “Which genomic features ofsoybean define its characteristics and are responsible for itsvegetative and floral complexities?” Considering thatmany of the key developmental control genes in soybeanexist inmultiple copies, a complex interplay and additionalcontrol for fine-tuning are expected. Our recent awarenessof prevalence and importance of lncRNAs highlight thatthese may play important regulatory roles in soybeangrowth and development. lncRNAs could provide theadditional level of control and signal integration, which ismissing when only protein-coding genes are considered.

This study presents genome-wide discovery, char-acterization, and functional annotation of lincRNAs inthe soybean genome. Genome-wide lincRNAdiscoverywas performed using a combination of de novo andreference-guided assembly approaches generating amost comprehensive lincRNA database. Comparativeanalysis between soybean lincRNAs and other legumespecies was performed to identify lincRNAs, whichcould play universal roles in all legumes and thelincRNAs, which are soybean-specific. Functionalanalysis was conducted to uncover biological processesthat could be influenced by lincRNA action. Finally,publicly available genome-wide association data wereused to further characterize the lincRNAs discoveredand find potential links to agronomic traits.

RESULTS AND DISCUSSION

Genome-Wide Discovery of 6,018 Long NoncodingIntergenic Loci

LincRNAs are a class of RNA molecules that are.200 bp long and have no discernible coding potential.High-throughput technologies offer an opportunity for

both coding and noncoding transcript detection andquantification. In total 1,025,323,161 read pairs from37 soybean samples were used in the analysis. Thesoybean sampled tissues included 28 samples repre-senting stem (germination and trefoil stage), flower(flower bud, unopened flower, florescence and 5 d afterflowering), leaf bud (germination, trefoil, and differen-tiation stage), leaf (trefoil, flower bud differentiationstage, and senescent leaves), pod (3, 4, and 5 weeks),seed (3, 5, 6, 8, and 10 weeks), seed and pod (2, 3, and4 weeks), shoot meristem (flower bud differentiationstage), cotyledon (germination and trefoil stage), androot (Shen et al., 2014). Additionally, nine samples (fourfrom leaf tissue and five from shoot apical meristemtissue [SAM]) representing time points during the floraltransition period following short-day treatment (Wonget al., 2013) were used. Both de novo and referenceguided transcriptome assembly strategies were applied.StringTie reference guided assembly resulted in 68,190loci and 160,337 transcripts. Trinity assemblies rendered448,338 transcripts using de novo and 337,955 tran-scripts using reference guided approach. The PASAcomprehensive transcript database built using StringTieand Trinity assemblies comprised 147,825 loci and293,537 transcripts. Both StringTie and PASA annota-tions were subjected to lincRNA discovery pipelineand PASA-derived lincRNAs, which did not appearin StringTie annotation, were used to supplementStringTie-derived lincRNAs (Supplemental Fig. S1).Loci were considered to encode lincRNAs if they didnot produce any protein-coding transcripts (openreading frame [ORF] size #100 amino acids and nosimilarity to protein-coding genes) and did not overlapany protein-coding loci. The lincRNAs were filtered toremove loci producing transcripts with similarity totRNAs, rRNAs, and snoRNAs (58 loci) found in theRfam database, transcripts that were nested (entiretycontained) within other lincRNAs (63 loci), and tran-scripts that overlapped protein-coding genes inGmax_275_v2.0 genome annotation (126 loci). LincR-NAs are known to be expressed at low levels (Li et al.,2014; Zhang et al., 2014; Hao et al., 2015). Choosing anexpression cutoff requires balancing a trade-off be-tween retaining the largest possible set of lincRNAs anddiscarding the spurious transcription and mappingartifacts. Two lincRNA sets were generated. The largerset (9,766 loci) with a permissive cutoff . 0.1 frag-ments per kilobase per million mapped fragments(FPKM) in a least one of the samples (SupplementalTable S2) and a filtered set generated using morestringent FPKM cutoff ($1.0 FPKM in at least one ofsamples or$0.5 FPKM in at least two samples or genesize of at least 1,000 bp). The filtered lincRNA setconsisted of 6,018 lincRNA loci (6,134 transcripts),including 3,435 StringTie-derived and 2,583 PASA-derived loci (Supplemental Table S3). The full set isprovided for the benefit of the readers but only thefiltered lincRNA set was used in the analysis.

2134 Plant Physiol. Vol. 176, 2018

Golicz et al.

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 3: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

LincRNAs Have Distinct Properties When Compared toProtein-Coding Genes

The lincRNA and protein-coding loci were examinedfor main gene characteristics. The lincRNA transcriptswere on average shorter than protein-coding tran-scripts (Fig. 1A). The median length of lincRNA tran-scripts was 320 bp (mean: 467.3 bp), whereas themedian length of protein-coding transcripts was3,657 bp (mean: 4,450 bp). The lincRNA transcriptscontained a lower number of exons than protein-codingtranscripts (Fig. 1C). The majority of lincRNA tran-scripts (90.3%) contained a single exon. The maximumnumber of exons found in a lincRNA transcript was 4.The lincRNA genes had a lower number of isoformscompared to protein-coding genes (Fig. 1B). A vastmajority of lincRNA genes (98.5%) had a single isoform.Finally, lincRNAs showed lower overall expressionlevels compared to coding genes (Fig. 1D). The obser-vations are consistent with lincRNA studies in otherplant species. LincRNAs in rice, cucumber (Cucumissativus), and chickpea (Cicer arietinum) were reported tobe shorter than protein-coding genes (Zhang et al.,2014; Hao et al., 2015; Khemka et al., 2016). LincRNAsin cucumber, maize (Zea mays), and chickpea werereported to have predominantly one exon only (Liet al., 2014; Hao et al., 2015; Khemka et al., 2016). Also,low expression levels of lincRNAs were observedin Arabidopsis, rice, and maize (Liu et al., 2012; Liet al., 2014; Zhang et al., 2014; Hao et al., 2015). Al-though usually lacking sequence homology (Hao et al.,2015; Mohammadin et al., 2015; Wang et al., 2015a),lincRNAs appear to share similar characteristics acrossdifferent species that include short length and a lownumber of exons and splice variants.

Centromeric Regions of Soybean Chromosomes ShowLincRNA Expression

The distribution of lincRNAs across chromosomescan provide clues regarding possible functions andmechanisms of action. For example, lincRNAs locatedamong protein-coding genes could modulate expres-sion of their neighbors, while lincRNAs found close tocentromeres or in gene deserts may act distally or haveadditional roles. Centromeric regions of soybean chro-mosomes are enriched in transposable elements (TEs)and depleted in protein-coding loci (Schmutz et al.,2010). In contrast, lincRNA loci display an evendistribution across chromosomes (Fig. 1E), with activetranscription from centromeric regions. LincRNAstranscribed from centromeric regions have been impli-cated to play roles in centromere maintenance andcellular division (Roši�c and Erhardt, 2016). In total,32 centromeric (as defined by transcription from re-gions delimited by GmCent-1 and GmCent-2 repeats)lincRNAs on chromosomes 1, 3, 5, 7, 13, 16, 17, and19 were identified. The number of lincRNAs identifiedwas weakly positively correlated with the identifiedcentromere size (r = 0.25). No centromeric lincRNAs

were expressed in all the samples, and the mediannumber of samples showing centromeric lincRNA ex-pression was seven. The median expression value insamples that expressed centromeric lincRNA (FPKM.0.1) was 0.31 FPKM. Centromeric lincRNAs showedhigher transcriptional activity in actively dividing tis-sues (flower bud, leaf bud, and SAM, Mann-WhitneyUtest, P value , 0.01; Fig. 2B). The most common trans-posable element type found within centromeric lincR-NAs was LTR Gypsy retrotransposon (Fig. 2C), whichis consistent with high prevalence of Gypsy transpos-able elements in the vicinity of centromeres (Schmutzet al., 2010).

Although centromeric lincRNA expression was ob-served, similar to rice and maize (Wang et al., 2015a),the majority of lincRNAs were found relatively close toneighboring protein-coding genes. The median dis-tance from lincRNA to protein-coding gene was1,064 bp (mean distance: 3,497 bp). LincRNAs found ashort distance from protein-coding genes could mod-ulate their expression by actively recruiting activators,repressors, and epigenetic modifiers or simply bytranscription from the lincRNA locus (Wang andChang, 2011; Kornienko et al., 2013).

Nearly a Fifth of lincRNA Transcripts Has SequenceSimilarity to Transposable Elements

The relatively high abundance of lincRNAs proximalto centromeres sparked an investigation of the contri-bution of transposable elements to lincRNA transcriptcomposition. In total, 18.3% of lincRNA transcriptswere predicted to harbor TEs, and a higher proportionof lincRNAs than coding transcripts (10.8%) containedTEs. For transcripts, which harbored TEs, TEs contrib-uted a larger amount of sequence to lincRNAs (medianlincRNA coverage by TEs was 100%, mean: 82.8%) thanto protein-coding transcripts (median coding transcriptcoverage by TEs: 19%, mean: 36.5%). A similar patternwas observed in the human genome, where two-thirdsof mature noncoding transcripts showed similarity toTEs and TEs were found to contribute signals essentialfor biogenesis of many lncRNAs (Kapusta et al., 2013).The lincRNAs were found harbor more retro-transposons than DNA transposons (Fig. 2A), whichreflects the overall TE landscape of the soybean genome(Du et al., 2010; Schmutz et al., 2010).

Soybean LincRNAs Have Low Levels of Sequence andPositional Conservation in Chickpea andMedicago truncatula

Information about conservation of lincRNAs acrossspecies can provide further inputs regarding theirpossible functions and the processes in which they areinvolved. If a lincRNA is well conserved in a number ofspecies, it can be assumed to play a generally importantrole. Conversely, if a lincRNA is species specific, it

Plant Physiol. Vol. 176, 2018 2135

Long Noncoding RNA Landscape of Soybean Genome

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 4: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

Figure 1. Comparison of properties of protein-coding and lincRNA genes. LincRNA genes differ from protein-coding genes withrespect to transcript length, number of exons per transcript, number of transcripts per gene, and transcriptional profile. A,Comparison of transcript lengths of coding and noncoding genes. Noncoding genes have shorter transcripts. B, Comparison of thenumber of transcripts found in coding and noncoding genes. Noncoding genes have less isoforms. C, Comparison of the numberof exons found in transcripts of coding and noncoding genes. Transcripts of noncoding genes have a lower number of exons. D,

2136 Plant Physiol. Vol. 176, 2018

Golicz et al.

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 5: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

may play a role unique to given organism or provide amodulatory function that alters the otherwise con-served system. It has been noted that the sequenceconservation of lincRNAs is much lower than protein-coding genes (Hao et al., 2015; Mohammadin et al.,2015), but higher levels of positional based conser-vation have been postulated (Mohammadin et al.,2015; Wang et al., 2015a). In total, 6,018 soybean,2,248 chickpea, 5,794 M. truncatula, and 6,480 Arabi-dopsis lincRNAs were available for analysis. Recip-rocal best BLAST comparison uncovered 143 soybeanlincRNAs that have sequence similarity to lincRNA inother species, with four lincRNAs showing similarityto lincRNAs in both chickpea and M. truncatula. Be-cause different tissue samples and discovery pipe-lines were used, it is possible that some conservedlincRNA pairs were missed. To address this, soybeanlincRNAs were compared against full genome as-semblies, which resulted in the discovery of 787 ad-ditional loci with sequence similarity to genomes ofother species (Fig. 3A). Those could correspond tounannotated noncoding transcripts. However, in theabsence of evidence of transcription, their functionremains unknown, and those loci were not consideredin further analysis.Positional conservation between lncRNA loci has

been suggested to extend across longer evolutionarydistances than sequence conservation (Mohammadinet al., 2015; Wang et al., 2015a). A long noncodingRNA is often considered positionally conserved iffound in the same orientation (upstream or down-stream) relative to orthologous protein-coding gene inat least two species (Mohammadin et al., 2015; Wanget al., 2015a). If the direction of transcription oflincRNA is known, transcription from the same strandis also required. However, it is conceivable that if alarge number of lincRNAs are considered, a number ofthose will show positional similarity across species(found in the same orientation relative to protein-coding genes) by chance only, rather than as a resultof evolutionary conservation. To test this, the numberof soybean lincRNAs that had positional similaritywith chickpea, M. truncatula, and Arabidopsis lincR-NAs was compared with control data sets constructedby random redistribution of lincRNAs across genomesof all four species. Two properties of lincRNA lociwere considered while constructing the control data-sets: (1) A proportion of lincRNA loci is found inclusters of two or more loci (mirroring this propertythe in control datasets will result in a more realisticdistribution of lincRNA), and (2) lincRNA loci are

enriched proximal to transcription factors (unevendistribution of lincRNAs relative to transcription fac-tors could affect the results if transcription factors arepreferentially retained or lost from syntenic regions).To accommodate those, four types of control data sets(5 simulations each, 20 data sets in total) were con-structed: (1) random redistribution of lincRNA rela-tive to protein-coding loci; (2) random redistributionof lincRNA relative to protein-coding loci, but main-taining the proportion of lincRNA found adjacent totranscription factors; (3) random redistribution ofexisting lincRNA clusters relative to protein-codingloci; and (4) random redistribution of existinglincRNA clusters relative to protein-coding loci, butmaintaining the proportion of lincRNA found adjacentto transcription factors. The true biological lincRNAdata set and the simulated control data sets were an-alyzed using the same positional conservation dis-covery pipeline. The number of positionally similarlincRNAs in the biological data set (1,201) and thesimulation data sets 1 and 2 were not significantlydifferent (Fisher’s test, P value . 0.01 for majoritycomparisons; Fig. 3B). However, more positionallysimilar lincRNAs were found in the biological data setwhen compared to simulation data sets 3 and4 (Fisher’s test, P value , 0.01 for all comparisons).Although comparison with simulation data sets 3 and4 suggests that positional similarity observed issomewhat higher than expected by chance alone, thedifference is not large (Fig. 3B). Results of analyses ofpositional conservation ought to be interpreted withcaution, especially across larger evolutionary dis-tances, and considered in conjunction with sequencesimilarity and analysis of expression patterns.

Strong support for positional conservation oflincRNAs rather than chance positional similaritywould be any sequence similarity between tran-scripts. Comparison of positionally similar transcriptpairs uncovered 48 soybean lincRNAs that show po-sitional similarity and sequence similarity withlncRNAs in other species. Sequence comparison ofthe positionally similar lincRNA pairs in simulateddata sets (100 simulated data sets using random re-distribution of lincRNA relative to protein-codingloci) showed them to have no sequence similarity(median number of pairs with sequence similarity perdata set: 0), suggesting that the sequence similarityobserved was not due to chance alone (permutationtest, P value , 0.01). Subsequently, the 48 loci wereanalyzed in more detail. Protein-coding genes andshort RNA primary transcripts are known to have

Figure 1. (Continued.)Comparison of log2 (FPKM) values of coding and noncoding genes. FPKM values calculated based on counts produced byfeatureCounts. Noncoding genes show lower expression levels compared to protein-coding genes. E, Plot presenting distributionof protein-coding and lincRNA loci across 20 soybean chromosomes. LincRNA loci are evenly distributed across chromosomes,whereas protein-coding genes show lower density in centromeric regions. Starting from the outer ring: (1) protein-coding genes,(2) all lincRNA genes, (3) non-TE lincRNA loci, and (4) lincRNA loci with transcripts harboring TEs.

Plant Physiol. Vol. 176, 2018 2137

Long Noncoding RNA Landscape of Soybean Genome

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 6: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

higher conservation levels than long noncodingRNAs (Hezroni et al., 2015). Some lncRNAs areknown to be sRNA precursors. The 48 putative con-served lincRNAs were inspected to check whetherthey (1) show similarity to TEs, (2) could encode smallconserved peptides that would be missed by thelincRNA discovery pipeline (peptides ,100 aminoacids and with no similarity to proteins as evaluatedby BLASTX), or (3) could be sRNA precursors. Theywere also compared against NCBI RefSeq-RNA da-tabase to check for similarity with any otherknown ncRNAs. Only one of the 48 lincRNAshad similarity to TEs (Supplemental Table S5). Threeof the 48 lincRNAs had significant similarity to tensof sequences annotated as ncRNAs in theRefSeq database. However, a detailed analysis of thehomologous region revealed it to contain a short25 amino acid ORF encoding a peptide RPL41 (ri-bosomal protein L41), which was embedded withina much longer transcript. Because of the short lengthof the peptide, transcripts carrying RPL41 were an-notated as lncRNAs by the pipeline used in thisstudy as well as NCBI annotation pipeline. Follow-ing this discovery, the entire lincRNA data set wasreanalyzed to check for presence of other RPL41ORFs. However, only five lincRNA loci (includingthe three conserved ones) were carrying RPL41 ORF.This finding does suggest that some of the tran-scripts classified as lncRNAs based on the discoveryalgorithm parameters used could, in fact, encodesmall peptides (Niazi and Valadkhan, 2012; Ruiz-Orera et al., 2014; Nelson et al., 2016). Short of ex-tremely well-conserved examples like RPL41, in theabsence of proteomic data, these are impossible todiscern. Six of the lincRNAs showed 100% percent-age identity to microRNA, suggesting that theycould be precursors of short RNAs. Reanalysis ofthe whole lincRNA data set suggested that 56 oflincRNAs could be microRNA precursors and themicroRNA precursors were overrepresented in the

positionally and sequence conserved lincRNAs(Fisher’s test, P value , 0.01). Finally, 19 lincRNAsshowed similarity to other lncRNA transcripts inRefSeq and those represented other species.

Almost 200 Homeologous LincRNA Loci Can Be Traced toa Soybean Lineage-Specific Whole-Genome DuplicationThat Occurred ;13 Million Years Ago

The soybean genome has a paleopolyploid structureresulting in extensive homeology across chromosomes(Shoemaker et al., 2006). It has undergone two roundsof whole-genome duplications, a more ancient eventthat occurred ;59 million years ago (MYA) and soy-bean lineage-specific paleotetraploidization, whichtook place ;13 MYA. As a result, the soybean genomeis composed of large blocks of homeologous regions(Schmutz et al., 2010). It is possible that akin to protein-coding loci, homeologous lincRNA loci in soybean ge-nome exist. Following a similar procedure for lincRNApositional similarity analysis performed between spe-cies, analysis of positional similarity of lincRNA lociwithin soybean genome was performed. Again, controldata sets 1, 2, 3, and 4were used to compare the numberof positionally similar lincRNA loci found to the num-ber that would be expected by chance alone. Thenumber of positionally similar lincRNA loci in the truebiological data set was significantly larger than thenumber found in any of the control datasets (Fisher’stest, P value , 0.01 for all comparisons; Fig. 3C). Thedifference between biological and control data setssuggested that at least 200 to 300 lincRNA loci withhomeologs in the soybean genomewere to be expected.Sequences of the lincRNA pairs with positional simi-larity within soybean genome were compared, whichallowed identification of 103 pairs of homeologousloci (Supplemental Table S6). Sequence comparison ofthe positionally similar lincRNA pairs in simulateddata sets (100 simulated data sets using random

Figure 2. Transposable element composition of lincRNAs. A, Types of TEs found within lincRNA transcripts and in 50,000randomly selected regions of soybean genome. TE composition of lincRNAs follows that of the soybean genome. RLG, LTRGypsy; RLC, LTR Copia; RIu, LINE; RIL, LINE L1; DTT, Tc1-Mariner; DTO, PONG; DTM, Mutator; DTH, PIF-Harbinger; DTC,CACTA; DHH,Helitron; *P value, 0.01, Fisher’s test. B, Expression patterns of lincRNAs located in centromeric regions (n = 32).C, TE composition of centromeric lincRNAs.

2138 Plant Physiol. Vol. 176, 2018

Golicz et al.

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 7: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

Figure 3. Conservation of lincRNA loci in chickpea,M. truncatula, and Arabidopsis. A, Number of soybean lincRNA loci showingsequence similarity with lincRNAs or genomes of other species. B, Number of soybean lincRNA loci in biological and controldatasets showing positional similarity with lincRNAs in other species. C, Number of soybean lincRNA loci in biological and controldatasets showing positional similarity with other lincRNAs in soybean genome. D, Ks values calculated for protein-coding gene pairsflanking homeologous lincRNA loci and a random selection of homeologous protein-coding gene pairs (n = 444). The distribution ofKs values representing random selection has two peaks corresponding two duplication events. The protein pairs flanking homeol-ogous loci mostly represent single, more recent duplication. E, Correlation of expression between homeologous lincRNA and arandom selection of lincRNA pairs (n = 3,000). Homeologous loci have higher levels of coexpression.

Plant Physiol. Vol. 176, 2018 2139

Long Noncoding RNA Landscape of Soybean Genome

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 8: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

redistribution of lincRNA relative to protein-codingloci) showed them to have no sequence similarity(median number of pairs with sequence similarity perdata set: 0), again suggesting that that the sequencesimilarity observed was not due to chance alone (per-mutation test, P value, 0.01). The number also roughlycorresponds to the predictions based on comparison ofpositional similarity in biological and control data sets.

The age of homeologous blocks can be establishedusing pairwise synonymous distance (Ks values) ofparalogs (Schlueter et al., 2004; Pfeil et al., 2005; Schmutzet al., 2010). In case of soybean theKs values of 0.06 to 0.39correspond to 13-million year genome duplication andtheKs values of 0.40 to 0.80 to the 59-million year genomeduplication (Schmutz et al., 2010). The vast majority of Ksvalues of protein-coding gene pairs flanking homeolo-gous lincRNA loci fall within the 0.06 to 0.39 range(Fig. 3D), suggesting a more recent origin resulting fromthe soybean-lineage-specific paleotetraploidization. It isalso possible that some homeologous loci representingthe ;59 MYA duplication do exist, but sequence diver-gence prevents their identification. Taken together, re-sults of inter- and intraspecies comparisons suggest thatwhile a lifespan of soybean lincRNA can exceed 15 mil-lion years it is unlikely to extend over 60 million years.

Functional enrichment of proteins flanking home-ologous loci revealed overrepresentation of genesinvolved in response to abiotic stimuli including cellu-lar response to phosphate starvation and response toabsence of light (Supplemental Table S7). Finally, thecoexpression of homeologous lincRNA loci wassignificantly higher (Fig. 3E, Mann-Whitney U test,P value, 0.01) when compared to a randomly selectedlincRNA loci pairs, suggesting at least partial conser-vation of expression patterns.

The LincRNAs Show Highly Tissue-Specific Expression

Expression of lincRNAs across all tissues was in-vestigated using a combination of straightforwardcounting method and Tau specificity index, whichwere recently shown to be most successful methods ofexpression characterization (Kryuchkova-Mostacciand Robinson-Rechavi, 2017). LincRNAs displayedmore tissue-specific expression than protein-codinggenes (Fig. 4A). Any given lincRNA was on averageexpressed in eight samples (median: 6.0), whereas anygiven protein-coding gene was on average expressedin 23 samples (median: 30). Only 27 lincRNAs wereexpressed in all the samples. The tissue with thehighest number of lincRNAs expressed (FPKM . 0.1)was floral tissue, followed by shoot apical meristemand leaf, suggesting an active role of lincRNAs inflowering and developmental processes. The samplewith the highest number of lincRNAs expressed intotal and uniquely was flower bud (flower1; 1,891lincRNAs expressed in total, 51 expressed uniquely;Fig. 4, B and C; Supplemental Table S3). A largenumber of lincRNAs expressed in the SAM are

consistent with previous observations in chickpea andother plants (Khemka et al., 2016). Overall, samplesfrom the same tissue show similar expression patterns(Fig. 4D). Samples representing SAM, leaf, flower, andseed are grouped together. Nine of the samples fromtwo tissues (leaf and SAM) represent floral transi-tion period following short-day treatment. In total,366 lincRNAs were uniquely expressed in the floraltransition samples, and of these, 363 (99% of alllincRNAs) were expressed following short-day treat-ment, with 89, 128, and 149 lincRNAs expressed in leafonly, SAMonly, and leaf and SAM, respectively. TheselincRNAs represent an interesting target for the studyof the mechanism of soybean floral transition.

The specificity of lincRNA expression can be bettercontextualized when compared with differentgroups of protein-coding genes. The lincRNA tissueexpression patterns were compared with expressionpatterns of protein-coding genes representing dif-ferent specificity groups (transcription factors, highspecificity; protein phosphorylation, medium speci-ficity; translation, low specificity). LincRNAs havehigher tissue specificity than any of the protein-coding gene groups, but the expression pattern isclosest to the transcription factors (Fig. 4E). Tran-scription factors are knownmaster regulators of geneexpression and the parallels observed can suggestsimilar roles of lincRNAs. The high tissue-specificlincRNA expression supports the idea of theirhighly specialized, possible regulatory functions. Italso allows for the possibility of using lincRNAs astissue type and state markers.

The LincRNA-Protein-Coding Gene CoexpressionNetwork and Position of lincRNAs Relative to Protein-Coding Neighbors Allows Functional Annotation ofNoncoding RNAs

Functional annotation of long noncoding RNAsposes a considerable challenge. In the case of protein-coding genes, often extensive information about thefunction of a gene in a model organism is available, andsequence homology can be used to transfer existingannotation to newly discovered loci. In the case oflincRNAs, very few functional assignments exist, andlack of sequence homology hampers interspecies com-parisons (Rinn and Chang, 2012; Smith and Mattick,2017). The primary form of annotation involves a con-struction of coexpression network and using a methodof so-called “guilt-by-association.” Correlation of ex-pression between lincRNAs and protein-coding genescan imply involvement in common biological pro-cesses. Spearman correlation between expression oflincRNA and protein-coding loci was calculated. Onlysignificant correlations were used in the analysis(P value , 0.05, P value adjusted for multiple com-parisons using method “holm”). The resulting distri-bution of correlation coefficients is presented inSupplemental Figure S2A. The minimum absolute

2140 Plant Physiol. Vol. 176, 2018

Golicz et al.

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 9: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

value of correlation coefficient used in the analysis was0.84. A higher number of positive than negative corre-lations was observed and a large number of perfectcorrelations (r = 1) were observed. A similar observa-tion was made in a human lncRNA annotation project,noting a higher number of positive correlations(Derrien et al., 2012). The high number of perfect

correlations was due to high tissue specificity oflincRNA expression. LincRNAswere annotated using ahub-based approach (Liao et al., 2011). Gene Ontology(GO) enrichment analysis of protein-coding first-degreeneighbors resulted in functional annotation of 1,574lincRNAs (Supplemental Table S8). The summary ofthe GO annotation mapped to GOslim terms is

Figure 4. Sample-specific lincRNA expression. A, Comparison of the number of samples showing expression of coding and non-coding genes. Expression of noncoding genes is more sample specific. B, Number of total and unique lincRNA genes expressed ineach tissue (samples fromdifferent time points for the same tissuewere combined). Tissueswith the highest total number of lincRNAsexpressed and the highest number of uniquely expressed lincRNAs are floral tissue and shoot apical meristem. C, Number oflincRNAs expressed in each sample. The sample with most lincRNAs expressed is flower 1. D, Heat map showing relationshipsbetween samples. Samples from the same tissue have similar lincRNAexpression profiles and cluster together. The color correspondsto distance calculated as 1-cor(log1p(FPKM)). Clustering was performed using hclust, method = complete. E, Tau expressionspecificity index calculated for lincRNA loci and three groups of protein coding genes representing different biological processes.Higher values of Tau correspond to more sample specific expression. Cotyledon 1, germination stage; cotyledon 2, trefoil stage;flower 1, flower bud differentiation stage; flower 2, flowering stage, bud before flowering; flower 3, flowering stage, florescence;flower 4, flowering stage, 5 d after flowering; flower 5, flowering stage, florescence, different stage; leaf 1, trefoil stage; leaf 2, flowerbud differentiation stage; leaf 3, senescent leaves; leaf bud 1, germination stage; leaf bud 2, trefoil stage; leaf bud 3, flower buddifferentiation stage; pod seed 1, 2weeks; pod seed 2, 3weeks; pod seed 3, 4weeks; pod 1, 3weeks; pod 2, 4weeks; pod3, 5weeks;seed 1, 3 weeks; seed 2, 5 weeks; seed 3, 6 weeks; seed 4, 8 weeks; seed 5, 10 weeks; shoot meristem, flower bud differentiationstage; stem 1, germination; stem 2, trefoil stage; root, germination stage; sam sd0, SAM before short-day treatment; sam sd1-4, SAMshort days 1 to 4; leaf sd0, leaf before short-day treatment; leaf sd1-3, leaf short days 1 to 3.

Plant Physiol. Vol. 176, 2018 2141

Long Noncoding RNA Landscape of Soybean Genome

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 10: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

presented in Supplemental Figure S2B. Overall, lincR-NAs are annotated with a range of functions includingstress response, signal transduction, and DNA meth-ylation. Genes that are specifically or highly expressedin a given tissue are considered likely to contribute torelevant biological processes (Boyle et al., 2017).Clustering of lincRNAs based on their expressionacross tissues showed that genes that have peak ex-pression in a given tissue are likely to have overallsimilar expression profiles (Supplemental Fig. S3),implying involvement in common biological process.

The lincRNAs have been divided based on the tissuewith peak expression (each set contained lincRNAswith peak expression in a given tissue) and GO en-richment for each of the lincRNA sets (peak ex-pression in cotyledon, SAM, flower, leaf, leaf bud,pod, pod seed, seed, stem, and root; Fig. 5) wascalculated. The enrichment of highly or specificallyexpressed lincRNA functions correlated well withthe tissue-associated biological processes. For ex-ample, functionally annotated lincRNAs expressedin SAM, floral tissue, and root were highly enriched

Figure 5. Top significantly enriched bi-ological processes among lincRNAs thatshow peak expression in a given tissue.Enrichment calculated using topGO,adjustment for multiple comparisonsusing method ‘weight’. P value , 0.01.

2142 Plant Physiol. Vol. 176, 2018

Golicz et al.

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 11: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

with processes associated with regulation of photo-periodism, sexual reproduction, and phloem trans-port, respectively. The results suggest possibleinvolvement of lincRNAs in tissue-specific biologi-cal processes.Finally, lincRNAs often exert their function

on neighboring protein-coding genes; therefore,analysis of overrepresentation of classes of protein-coding genes flanking lincRNA loci provides addi-tional source of functional annotation. The genesflanking lincRNAs were enriched in functions asso-ciated with transcription and development, suggest-ing possible lincRNA involvement in these processes(Supplemental Table S9).

Several LincRNAs Are Potentially Related toAgronomic Traits

Genome-wide association studies (GWAS) have beensuccessful in uncovering the genetic basis of trait vari-ation and linking casual loci to phenotypic traits.However, only a portion of variants identified by GWAstudies can be assigned to protein-coding genes (Sonahet al., 2015; Zhang et al., 2015; Zhou et al., 2015a; Zhouet al., 2015b). Some of the remaining intergenic trait-associated variants can potentially be assigned tolincRNAs and serve as an additional source of func-tional annotation. In total, 316 single-nucleotide poly-morphisms (SNPs) identified as associated with

Figure 6. Analysis of potential trait-related lincRNAs. A, For each of the putative trait-related lincRNAs, the plot presents 11 genesfound in close vicinity of trait-associated SNP (1 putative trait related lincRNA+ 5 downstreamprotein-coding genes + 5 upstreamprotein-coding genes). Each dot point represents a gene (red, lincRNA; blue, protein-coding gene) and is labeledwith samplewithpeak expression (y axis represents actual expression value). B and C, Expression of lincRNA in soybean samples and its positionalortholog in chickpea. The x axis corresponds to samples. The y axis corresponds to the FPKM values.

Plant Physiol. Vol. 176, 2018 2143

Long Noncoding RNA Landscape of Soybean Genome

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 12: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

agronomic traits were used in the analysis. A lincRNAwas identified as potentially related to a trait if theSNP was found either within the lincRNA locus or thelocus was closer to the SNP than any other protein-coding gene. In total, 23 lincRNA candidates havebeen identified (Supplemental Fig. S4). Six of thelincRNAs overlapped trait-associated SNPs, and theremainder were found in close proximity (median dis-tance: 981 bp). The putative trait-related lincRNAs areenriched inmultiexon loci (Fisher’s test, P value, 0.01).The SNPs proximal to candidate lincRNA loci wererelated to traits such as number of days to flowering,number of days from flowering to maturity, andnumber of seeds per pod.

Several loci are typically found in the vicinity of atrait associated SNP and it is usually not immediatelyobvious which one may contribute to the trait. Ac-cordingly, although the 23 lincRNAs were found closerto the SNP than any other protein-coding gene, it pos-sible that a more distal coding gene contributes to thetrait instead of the lincRNA (an interaction between thelincRNA and neighboring protein-coding gene is alsopossible). To add more confidence to the functionalpredictions, the genomic-position-only-based analysiswas supplemented with investigation of expressionpatterns of neighboring genes. For each of the 23 puta-tive, trait-related lincRNAs, the samples with peak ex-pression for the lincRNA as well as five upstream anddownstream protein-coding genes were investigated.The lincRNAs were considered more likely to influencethe trait if they showed peak expression in a relevanttissue (for example, lincRNA associated with days toflowering being highly expressed in shoot apical meri-stem upon short day treatment; Supplemental TableS10; Supplemental Fig. S4). As a result, the top sixlincRNAs that were found in the vicinity of trait-associated SNPs and showed consistent expressionpatterns were analyzed in more detail (Fig. 6A). Inter-estingly, four out of six had positionally similarlincRNAs in other species. Two of them showed ex-pression in similar tissue types across species(NC_GMAXST00018683 and NC_GMAXPA00061260);for the remaining two, expression data in relevant tis-sue in chickpea and M. truncatula were unavailable.One of the lincRNAs (NC_GMAXST00018683) thatoverlapped SNP associated with the number of days toflowering and had peak expression in shoot apicalmeristem upon short-day treatment had positionalsimilarity with lincRNA in chickpea (Fig. 6B). Com-parison of expression patterns across samples insoybean and chickpea showed the lincRNAs to beexpressed in flower buds and SAM in both species(Fig. 6B). The other lincRNA (NC_GMAXPA00061260)was found 223 bp from a SNP associated with numberof seeds per pod and again had a positionally similarlincRNA in chickpea. Both lincRNAs showed peak ex-pression in mature flowers (Fig. 6C). The proximity totrait-associated SNPs, expression in relevant tissues,and conservation of expression patterns across speciesmakes them likely candidate for trait related lincRNAs.

Combination of proximity to trait-associated SNPs andexpression profile, as well as interspecies conservation,has been successfully used for functional annotation oflncRNAs in other species, including human, zebra fish,rice, and maize (Ulitsky et al., 2011; Gong et al., 2015;Wang et al., 2015a; Hon et al., 2017). In human, the studyincorporating expression and genetic data found thatlncRNAs that harbored trait-associated SNPs were alsospecifically expressed in tissues relevant to the trait, lead-ing the authors to conclude the lncRNAs are likely func-tional andplay important roles in disease (Hon et al., 2017).Furthermore, the putative functional lncRNAs alsoexhibited higher levels of conservation (Hon et al., 2017).Similarly, in maize, SNPs associated with leaf morpho-logical traits were significantly enriched in genomic lociencoding maize lincRNAs, leading the authors to suggestroles of lincRNAs in control of agronomic traits (Wanget al., 2015a). Even without the support of GWAS data,lncRNA conservation itself was also found to be indicativeof functionality. In zebra fish, lincRNAs selected based ontheir tissue-specific expression and synteny with mam-malian lincRNAs were shown to be important for devel-opmental processes (Ulitsky et al., 2011). Taken together,the availability of evidence fromseveral sources and earlierstudies suggesting that the GWAS, expression profile, andconservation evidence are highly indicative of lncRNAfunctionality further supports the functional predictions.

CONCLUSION

The soybean genome encodes several thousand oflincRNAs, and several lincRNAs may be related toagronomic traits. Further investigations on detailedfunction and regulation, including identification ofinteracting partners and regulators of the lincRNAs,will elucidate their mechanism of action. This studyalso provides evidence that the network controlling andimplementing biological processes in soybean involvescomplex interactions between proteins and long andshort noncoding RNAs. Furthermore, this study pre-sents a comprehensive atlas of lincRNAs in the soybeangenome and paves the way for future research.

MATERIALS AND METHODS

Data

RNA-seq sequence data corresponding to Sequence Read Archive projectsSRP020868 and PRJNA238493 were downloaded (full list of accessions can befound in Supplemental Table S1). The soybean (Glycine max) genome assembly(Gmax_275_v2.0) and corresponding annotation (Gmax_275_Wm82.a2.v1)were downloaded from Phytozome v11.

LincRNA Annotation

Reads were mapped to the reference genome using HISAT2 v2.0.5 (Kim et al.,2015; –min-intronlen 20–max-intronlen 2000). For each accession, transcriptswereassembled and subsequently merged using StringTie v1.3.0 (Pertea et al., 2015; –merge -F 0.5 -T 0.5 -G Gmax_275_Wm82.a2.v1.gene_exons.gff3). Reads werealso assembled usingTrinity v2.3.2 (Grabherr et al., 2011). Bothde novo (–seqTypefq –max_memory 50G –verbose –normalize_reads –trimmomatic –CPU 16) and

2144 Plant Physiol. Vol. 176, 2018

Golicz et al.

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 13: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

reference-guided (–genome_guided_bam –genome_guided_max_intron 10000–max_memory 50G –verbose –CPU 16, reads trimmed and normalized duringde novo Trinity run were used) assemblies were performed. The resultingStringTie and Trinity assemblies were supplied to PASA (Haas et al., 2003) inorder to build comprehensive transcriptome database using procedure as de-scribed in PASA user guide (http://pasapipeline.github.io/). The aligner usedwas BLAT and MAX_INTRON_LENGTH was set to 2000. StringTie only andPASA transcripts were processed in parallel to identify potential lncRNAs(Supplemental Fig. S1). Transcripts .200 bp in length were subjected to ORFdiscovery using OrfPredictor v3.0 (Min et al., 2005). Transcripts with ORFs.300 bp (100 amino acids) were considered coding. Remaining transcripts wereextracted and subjected to DIAMOND v0.8.25 (Buchfink et al., 2015) BLASTXsearch (–more-sensitive –evalue 0.01) against the NCBI nr database (obtained onthe 23.10.2016). Transcripts that had a significant BLASTXmatchwere consideredcoding. The remaining transcripts fulfilling the three criteria (1) length .200 bp,(2) ORF size#300 bp, and (3) no significant BLASTX hit were considered putativelncRNAs. A gene was considered coding if at least one transcript was coding. Agene was considered noncoding if none of the transcripts were coding. The po-sitions of noncoding genes from StringTie and PASA annotations were comparedagainst positions of codinggenes in both annotations. If the putative lncRNAgenedid not overlap any coding loci, it was considered a lincRNA gene. LincRNA locifrom both annotations were merged. If lincRNA loci from both annotations hadpositional overlap, StringTie annotationwas kept. Finally, readsmapping to genewere counted using Subread v1.5.1 featureCounts (Liao et al., 2014; -p -B -P -d 0 -D1000) and FPKM values were calculated for each gene (109*fragments mapped toexons/assigned fragments*total length of exons). LincRNAs that did not haveFPKM values larger than 0.1 in one of the samples were discarded.

LincRNA Functional Annotation

LincRNA functional annotation was performed by building lincRNA-protein-coding gene coexpression network. Coexpression was measured be-tween identified lincRNA loci and protein-coding loci from Gmax_275_Wm82.a2.v1 annotation updated by StringTie. FPKM values were used for Spearmancorrelation calculation. Correlation coefficients and corresponding P valueswere calculated using corr.test function of R package Psych. Adjustment formultiple comparisons was performed using method ‘holm’. Only lincRNA-protein-coding gene pairs with P values , 0.05 were retained. All the proteinpartners were functionally annotated using Blast2GO (Conesa et al., 2005; nrsubset corresponding to “Arabidopsis” [porgn] OR “Oryza” [porgn] OR“Sorghum” [porgn] OR “Glycine” [porgn] OR “Medicago” [porgn] OR “Bra-chypodium” [porgn]). For each of the lincRNAs, all the proteins that weresignificantly correlated were gathered and GO enrichment of biological pro-cesses category was calculated using topGO v2.22.0 (Alexa et al., 2006). Allproteins in correlation with lincRNAs were used as background. Adjustmentfor multiple comparisonswas performed usingmethod ‘weight’. GO terms thatwere significantly enriched were assigned to the corresponding lincRNA asfunctional annotation (P value cutoff 0.05). The GO terms were mapped to theplant GOslim terms using Map2Slim option of owltools.

Sequence-Based Similarity of LincRNAs

Sequence-based similarity of lincRNAs was measured using reciprocal bestBLAST (BLAST+ v2.5.0; -task blastn –evalue 1e-3). Best hits were identified bylowest e-value. Coordinates of chickpea (Cicer arietinum) lincRNAs wereobtained from Khemka et al. (2016), Medicago truncatula lincRNAs from Wanget al. (2015b), and Arabidopsis (Arabidopsis thaliana) lincRNA from http://chualab.rockefeller.edu/gbrowse2/homepage.html. The lincRNA sequenceswere extracted from genome assemblies (chickpea Cicer_arietinum_GA_v1.0,Medicago Mt4.0v1, and Arabidopsis TAIR9). Comparisons against genome se-quencewere performed using BLAST+ v2.5.0 (-task dc-megablast –evalue 1e-3).In order to remove spurious hits due to presence of transposable elements orrepetitive sequences, lincRNAs that hadmore than threematches in the genomewere excluded. Additionally, the most significant high-scoring pair betweenlincRNAs and the genome was required to cover at least 10% of the lincRNA.

TE Composition of LincRNAs

The soybean TE database was obtained from SoyBase (SoyBase_TE_Fasta.txt). The lincRNA transcripts were compared against the TE database usingBLAST+ v2.2.30 (blastn -task megablast –evalue 1e-5). The 50,000 random

nonoverlapping intervals that did not overlap lincRNAs were identified in thesoybean genome using regioneR (Gel et al., 2016). The corresponding sequenceswere extracted and compared against the TE database with the same BLASTparameters as for lincRNAs.

Centromeric LincRNA Identification

Centromeres were identified by presence of two soybean centromere-specific repeats: CentGm-1 and CentGm-2. CentGm-1 and CentGm-2 werecompared against soybean genome (Gmax_275_v2.0) using BLAST+ v2.2.30(blastn -task megablast). The coordinates of a centromere for a given chro-mosome corresponded to first and third quartile of CentGm-1 and CentGm-2match coordinates. LincRNAs that fell within centromeres were identified ascentromeric lincRNAs.

Position-Based Similarity of LincRNAs

Syntenic blocks between genomes of soybean (Gmax_275_v2.0), chickpea(Cicer_arietinum_GA_v1.0), Medicago (Mt4.0v1), and Arabidopsis (TAIR10)were identified usingMCScanX (Wang et al., 2012). The syntenic blocks wereused to identify positional similarity between soybean lincRNAs andlincRNAs from other species. For each lincRNA, five protein-codingneighbors upstream and downstream were extracted. The neighbors werethen compared with collinear blocks identified by MCScanX. The lincRNAwas said to belong to a collinear block if at least 3 out of 10 protein-codingneighbors were found in the block. LincRNAs from two species were said tobe positionally similar if they belonged to the same collinear block, at leastone of the two pairs of flanking protein-coding genes was identified asorthologous, and the lincRNAs shared the same relative position (upstreamor downstream) with respect to the orthologous gene/genes. The lincRNAloci that shared positional similarity were compared using BLAST+ v2.5.0(-task blastn –evalue 1e-3). Comparison against the RefSeq RNA database(downloaded on: 27.06.2017) was also performed with BLAST+ v2.5.0 (-taskblastn –evalue 1e-3).

Generation of Control Data Sets

The control data sets were generated by assigning existing lincRNA to newprotein-coding neighbors, taken from the pool of all protein-coding genes foundin the genome. For data sets 1 and2, a coordinate sorted full list of protein-codinggenes was shuffled using Linux shuf function, which generates random per-mutations, andfirst n genes corresponding to the number of lincRNAs in a givendata set were assigned to existing lincRNAs. The assigned protein-coding genebecame new downstream protein-coding neighbor and the new lincRNA po-sition was immediately upstream of the protein-coding gene assigned. For datasets 3 and 4, the procedure was similar, but the existing lincRNA clusters werekept together.

Calculation of Synonymous Substitution Rate

The synonymous substitution rates were computed between pairs of genesidentified as homeologous by MCScanX. Proteins were aligned by ClustalOmega v1.2.0 (Sievers et al., 2011). The protein alignments were converted tonucleotide alignments using PAL2NAL v14 (Suyama et al., 2006). The Ks valueswere calculated using PAML v4.7 (yn00) (Yang, 2007).

Selections of Protein Groups for Comparison of Tissue-Specific Expression with LincRNAs

The protein-coding genes were divided into three categories. Genes expressedinnomore than15samples (highspecificityexpressionpattern),genesexpressed in16 to 35 samples (medium specificity expression pattern), and genes expressed inmore than 35 samples (low specificity expression pattern). For each group GObiological process term enrichment was performed using topGO (Alexa et al.,2006), using all protein-coding genes as background. Adjustment for multiplecomparisons was performed using method ‘weight’. For each category, a repre-sentative process was chosen (process with the highest number of significantgenes among top 10 enriched GO terms). All the genes from a given categoryannotated with representative process were gathered and Tau specificity indiceswere calculated (Yanai et al., 2005).

Plant Physiol. Vol. 176, 2018 2145

Long Noncoding RNA Landscape of Soybean Genome

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 14: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

Identification of LincRNAs Potentially Related toAgronomic Traits

The positions of SNPs associated with agronomic traits identified by Zhou et al.(2015a), Zhang et al. (2015), Zhou et al. (2015b), Sonah et al. (2015), and Fang et al. (2017)were obtained. Some of the SNPswere originally discovered against an older version ofsoybean genome (NCBI accession GCA_000004515.1); therefore, their coordinates weretransferred to the Gmax_275_v2.0 genome assembly using NCBI remap tool (https://www.ncbi.nlm.nih.gov/genome/tools/remap). The lincRNAwas consider potentiallyrelated to agronomic trait if it either harboredaSNP identified in association studies or itwas closer to a SNP than any protein-coding gene and no further than 10 kb.

Code and Data Availability

The codeused for generation of all thefigures canbe foundat https://github.com/agolicz/lncRNAs-Plots. The data set described in the manuscript can bedownloaded from https://osf.io/d7qz2/.

Accession Numbers

Sequence data from this article can be found in the GenBank/EMBL datalibraries under accession numbers SRP020868 and PRJNA238493.

Supplemental DataSupplemental Figure S1. Workflow combining StringTie and PASA anno-

tation to create the final nonredundant set of putative lincRNAs.

Supplemental Figure S2. Functional annotation of lincRNAs.

Supplemental Figure S3. Clustering of lincRNAs based on expressionacross all tissues.

Supplemental Figure S4. Candidate lincRNAs associated with agronomictraits.

Supplemental Table S1. Sequencing and mapping statistics for the librar-ies used.

Supplemental Table S2. Full list of putative lincRNA loci.

Supplemental Table S3. List of confident lincRNA loci used in the analysisand their expression.

Supplemental Table S4. Number of conserved lincRNAs identified usingdifferent e-value cutoffs.

Supplemental Table S5. List of conserved lincRNA loci.

Supplemental Table S6. List of homeologous lincRNA loci.

Supplemental Table S7. GO enrichment of coding genes flanking homeol-ogous lincRNAs.

Supplemental Table S8. GO annotation of lincRNAs.

Supplemental Table S9. GO enrichment of all coding genes flanking lincR-NAs.

Supplemental Table S10. Tissues considered relevant to a given trait.

Received November 20, 2017; accepted December 21, 2017; published December28, 2017.

LITERATURE CITED

Alexa A, Rahnenführer J, Lengauer T (2006) Improved scoring of func-tional groups from gene expression data by decorrelating GO graphstructure. Bioinformatics 22: 1600–1607

Ariel F, Jegu T, Latrasse D, Romero-Barrios N, Christ A, Benhamed M,Crespi M (2014) Noncoding transcription by alternative RNA poly-merases dynamically regulates an auxin-driven chromatin loop. MolCell 55: 383–396

Bardou F, Ariel F, Simpson CG, Romero-Barrios N, Laporte P, BalzergueS, Brown JWS, Crespi M (2014) Long noncoding RNA modulates al-ternative splicing regulators in Arabidopsis. Dev Cell 30: 166–176

Berry S, Dean C (2015) Environmental perception and epigenetic memory:mechanistic insight through FLC. Plant J 83: 133–148

Boyle EA, Li YI, Pritchard JK (2017) An expanded view of complex traits:from polygenic to omnigenic. Cell 169: 1177–1186

Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignmentusing DIAMOND. Nat Methods 12: 59–60

Carlevaro-Fita J, Rahim A, Guigó R, Vardy LA, Johnson R (2016) Cyto-plasmic long noncoding RNAs are frequently bound to and degraded atribosomes in human cells. RNA 22: 867–882

Chekanova JA (2015) Long non-coding RNAs and their functions in plants.Curr Opin Plant Biol 27: 207–216

Chekanova JA, Gregory BD, Reverdatto SV, Chen H, Kumar R, Hooker T,Yazaki J, Li P, Skiba N, Peng Q, et al (2007) Genome-wide high-resolution mapping of exosome substrates reveals hidden features inthe Arabidopsis transcriptome. Cell 131: 1340–1353

Cifuentes-Rojas C, Kannan K, Tseng L, Shippen DE (2011) Two RNAsubunits and POT1a are components of Arabidopsis telomerase. ProcNatl Acad Sci USA 108: 73–78

Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M (2005)Blast2GO: a universal tool for annotation, visualization and analysis infunctional genomics research. Bioinformatics 21: 3674–3676

Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H,Guernec G, Martin D, Merkel A, Knowles DG, et al (2012) TheGENCODE v7 catalog of human long noncoding RNAs: analysis of theirgene structure, evolution, and expression. Genome Res 22: 1775–1789

Du J, Grant D, Tian Z, Nelson RT, Zhu L, Shoemaker RC, Ma J (2010)SoyTEdb: a comprehensive database of transposable elements in thesoybean genome. BMC Genomics 11: 113

Fang C, Ma Y, Wu S, Liu Z, Wang Z, Yang R, Hu G, Zhou Z, Yu H, ZhangM, et al (2017) Genome-wide association studies dissect the geneticnetworks underlying agronomical traits in soybean. Genome Biol 18: 161

Flynn RA, Chang HY (2014) Long noncoding RNAs in cell-fate program-ming and reprogramming. Cell Stem Cell 14: 752–761

Franco-Zorrilla JM, Valli A, Todesco M, Mateos I, Puga MI, Rubio-Somoza I,Leyva A, Weigel D, García JA, Paz-Ares J (2007) Target mimicry providesa newmechanism for regulation of microRNA activity. Nat Genet 39: 1033–1037

Gel B, Díez-Villanueva A, Serra E, Buschbeck M, Peinado MA, Malinverni R(2016) regioneR: an R/Bioconductor package for the association analysis ofgenomic regions based on permutation tests. Bioinformatics 32: 289–291

Gong J, Liu W, Zhang J, Miao X, Guo A-Y (2015) lncRNASNP: a databaseof SNPs in lncRNAs and their potential functions in human and mouse.Nucleic Acids Res 43: D181–D186

Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I,Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al (2011) Trinity: re-constructing a full-length transcriptome without a genome from RNA-seq data. Nat Biotechnol 29: 644–652

Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI,Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O(2003) Improving the Arabidopsis genome annotation using maximaltranscript alignment assemblies. Nucleic Acids Res 31: 5654–5666

Hao Z, Fan C, Cheng T, Su Y, Wei Q, Li G (2015) Genome-wide identifi-cation, characterization and evolutionary analysis of long intergenicnoncoding RNAs in cucumber. PLoS One 10: e0121800

Heo JB, Lee YS, Sung S (2013) Epigenetic regulation by long noncodingRNAs in plants. Chromosome Res 21: 685–693

Heo JB, Sung S (2011) Vernalization-mediated epigenetic silencing by along intronic noncoding RNA. Science 331: 76–79

Hezroni H, Koppstein D, Schwartz MG, Avrutin A, Bartel DP, Ulitsky I(2015) Principles of long noncoding RNA evolution derived from directcomparison of transcriptomes in 17 species. Cell Reports 11: 1110–1122

Hon C-C, Ramilowski JA, Harshbarger J, Bertin N, Rackham OJL, Gough J,Denisenko E, Schmeier S, Poulsen TM, Severin J, et al (2017) An atlas ofhuman long non-coding RNAs with accurate 59 ends. Nature 543: 199–204

Jin J, Liu J, Wang H, Wong L, Chua NH (2013) PLncDB: plant long non-coding RNA database. Bioinformatics 29: 1068–1071

Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT,Stadler PF, Hertel J, Hackermüller J, Hofacker IL, et al (2007) RNAmaps reveal new RNA classes and a possible function for pervasivetranscription. Science 316: 1484–1488

Kapusta A, Kronenberg Z, Lynch VJ, Zhuo X, Ramsay L, Bourque G,Yandell M, Feschotte C (2013) Transposable elements are major con-tributors to the origin, diversification, and regulation of vertebrate longnoncoding RNAs. PLoS Genet 9: e1003470

2146 Plant Physiol. Vol. 176, 2018

Golicz et al.

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.

Page 15: The Long Intergenic Noncoding RNA (LincRNA) Landscape of ... · sible roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et al., 2013;

Khemka N, Singh VK, Garg R, Jain M (2016) Genome-wide analysis oflong intergenic non-coding RNAs in chickpea and their potential role inflower development. Sci Rep 6: 33297

Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner withlow memory requirements. Nat Methods 12: 357–360

Kornienko AE, Guenzl PM, Barlow DP, Pauler FM (2013) Gene regulationby the act of long non-coding RNA transcription. BMC Biol 11: 59

Kryuchkova-Mostacci N, Robinson-Rechavi M (2017) A benchmark ofgene expression tissue-specificity metrics. Brief Bioinform 18: 205–214

Lai F, Orom UA, Cesaroni M, Beringer M, Taatjes DJ, Blobel GA,Shiekhattar R (2013) Activating RNAs associate with Mediator to en-hance chromatin architecture and transcription. Nature 494: 497–501

Li L, Eichten SR, Shimizu R, Petsch K, Yeh CT, Wu W, Chettoor AM,Givan SA, Cole RA, Fowler JE, et al (2014) Genome-wide discovery andcharacterization of maize long non-coding RNAs. Genome Biol 15: R40

Liao Q, Liu C, Yuan X, Kang S, Miao R, Xiao H, Zhao G, Luo H, Bu D,Zhao H, et al (2011) Large-scale prediction of long non-coding RNAfunctions in a coding-non-coding gene co-expression network. NucleicAcids Res 39: 3864–3878

Liao Y, Smyth GK, Shi W (2014) featureCounts: an efficient general pur-pose program for assigning sequence reads to genomic features. Bio-informatics 30: 923–930

Liu J, Jung C, Xu J, Wang H, Deng S, Bernad L, Arenas-Huertero C, ChuaN-H (2012) Genome-wide analysis uncovers regulation of long inter-genic noncoding RNAs in Arabidopsis. Plant Cell 24: 4333–4345

Matzke MA, Kanno T, Matzke AJ (2015) RNA-directed DNA methylation:the evolution of a complex epigenetic pathway in flowering plants.Annu Rev Plant Biol 66: 243–267

Min XJ, Butler G, Storms R, Tsang A (2005) OrfPredictor: predicting protein-coding regions in EST-derived sequences. Nucleic Acids Res 33:W677–W680

Mohammadin S, Edger PP, Pires JC, Schranz ME (2015) Positionally-conserved but sequence-diverged: identification of long non-codingRNAs in the Brassicaceae and Cleomaceae. BMC Plant Biol 15: 217

Nelson BR, Makarewich CA, Anderson DM, Winders BR, Troupes CD,Wu F, Reese AL, McAnally JR, Chen X, Kavalali ET, et al (2016) Apeptide encoded by a transcript annotated as long noncoding RNAenhances SERCA activity in muscle. Science 351: 271–275

Niazi F, Valadkhan S (2012) Computational analysis of functional longnoncoding RNAs reveals lack of peptide-coding capacity and parallelswith 39 UTRs. RNA 18: 825–843

Pefanis E, Wang J, Rothschild G, Lim J, Kazadi D, Sun J, Federation A, Chao J,Elliott O, Liu ZP, et al (2015) RNA exosome-regulated long non-coding RNAtranscription controls super-enhancer activity. Cell 161: 774–789

Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, SalzbergSL (2015) StringTie enables improved reconstruction of a transcriptomefrom RNA-seq reads. Nat Biotechnol 33: 290–295

Pfeil BE, Schlueter JA, Shoemaker RC, Doyle JJ (2005) Placing paleo-polyploidy in relation to taxon divergence: a phylogenetic analysis inlegumes using 39 gene families. Syst Biol 54: 441–454

Rinn JL, Chang HY (2012) Genome regulation by long noncoding RNAs.Annu Rev Biochem 81: 145–166

Roši�c S, Erhardt S (2016) No longer a nuisance: long non-coding RNAs joinCENP-A in epigenetic centromere regulation. Cell Mol Life Sci 73: 1387–1398

Ruiz-Orera J, Messeguer X, Subirana JA, Alba MM (2014) Long non-coding RNAs as a source of new peptides. eLife 3: e03523

Schlueter JA, Dixon P, Granger C, Grant D, Clark L, Doyle JJ, ShoemakerRC (2004) Mining EST databases to resolve evolutionary events in majorcrop species. Genome 47: 868–876

Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL,Song Q, Thelen JJ, Cheng J, et al (2010) Genome sequence of the pa-laeopolyploid soybean. Nature 463: 178–183

Shen Y, Zhou Z, Wang Z, Li W, Fang C, Wu M, Ma Y, Liu T, Kong L-A,Peng D-L, Tian Z (2014) Global dissection of alternative splicing inpaleopolyploid soybean. Plant Cell 26: 996–1008

Shoemaker RC, Schlueter J, Doyle JJ (2006) Paleopolyploidy and gene dupli-cation in soybean and other legumes. Curr Opin Plant Biol 9: 104–109

Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R,McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG(2011) Fast, scalable generation of high-quality protein multiple se-quence alignments using Clustal Omega. Mol Syst Biol 7: 539

Smith MA, Mattick JS (2017) Structural and functional annotation of longnoncoding RNAs. In JM Keith, ed, Bioinformatics: Volume II: Structure,Function, and Applications. Springer, New York, pp 65–85

Sonah H, O’Donoughue L, Cober E, Rajcan I, Belzile F (2015) Identifica-tion of loci governing eight agronomic traits using a GBS-GWAS ap-proach and validation by QTL mapping in soya bean. Plant Biotechnol J13: 211–221

Suyama M, Torrents D, Bork P (2006) PAL2NAL: robust conversion ofprotein sequence alignments into the corresponding codon alignments.Nucleic Acids Res 34: W609–W612

Swiezewski S, Liu F, Magusin A, Dean C (2009) Cold-induced silencing bylong antisense transcripts of an Arabidopsis Polycomb target. Nature462: 799–802

Szcze�sniak MW, Rosikiewicz W, Makałowska I (2016) CANTATAdb: Acollection of plant long non-coding RNAs. Plant Cell Physiol 57: e8

Ulitsky I, Bartel DP (2013) lincRNAs: genomics, evolution, and mecha-nisms. Cell 154: 26–46

Ulitsky I, Shkumatava A, Jan CH, Sive H, Bartel DP (2011) Conservedfunction of lincRNAs in vertebrate embryonic development despiterapid sequence evolution. Cell 147: 1537–1550

van Werven FJ, Neuert G, Hendrick N, Lardenois A, Buratowski S, vanOudenaarden A, Primig M, Amon A (2012) Transcription of two longnoncoding RNAs mediates mating-type control of gametogenesis inbudding yeast. Cell 150: 1170–1181

Wang H, Chung PJ, Liu J, Jang IC, Kean MJ, Xu J, Chua NH (2014a) Genome-wide identification of long noncoding natural antisense transcripts and theirresponses to light in Arabidopsis. Genome Res 24: 444–453

Wang H, Niu QW, Wu HW, Liu J, Ye J, Yu N, Chua NH (2015a) Analysis ofnon-coding transcriptome in rice and maize uncovers roles of conservedlncRNAs associated with agriculture traits. Plant J 84: 404–416

Wang KC, Chang HY (2011) Molecular mechanisms of long noncodingRNAs. Mol Cell 43: 904–914

Wang T-Z, Liu M, Zhao M-G, Chen R, Zhang W-H (2015b) Identificationand characterization of long non-coding RNAs involved in osmotic andsalt stress in Medicago truncatula using genome-wide high-throughputsequencing. BMC Plant Biol 15: 131

Wang Y, Fan X, Lin F, He G, Terzaghi W, Zhu D, Deng XW (2014b)Arabidopsis noncoding RNA mediates control of photomorphogenesisby red light. Proc Natl Acad Sci USA 111: 10359–10364

Wang Y, Tang H, Debarry JD, Tan X, Li J, Wang X, Lee TH, Jin H, MarlerB, Guo H, Kissinger JC, Paterson AH (2012) MCScanX: a toolkit fordetection and evolutionary analysis of gene synteny and collinearity.Nucleic Acids Res 40: e49

Wong CE, Singh MB, Bhalla PL (2013) The dynamics of soybean leaf andshoot apical meristem transcriptome undergoing floral initiation pro-cess. PLoS One 8: e65319

Wu HJ, Wang ZM, Wang M, Wang XJ (2013) Widespread long noncodingRNAs as endogenous target mimics for microRNAs in plants. PlantPhysiol 161: 1875–1884

Yanai I, Benjamin H, Shmoish M, Chalifa-Caspi V, Shklar M, Ophir R,Bar-Even A, Horn-Saban S, Safran M, Domany E, Lancet D, Shmueli O(2005) Genome-wide midrange transcription profiles reveal expression levelrelationships in human tissue specification. Bioinformatics 21: 650–659

Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. MolBiol Evol 24: 1586–1591

Zhang J, Song Q, Cregan PB, Nelson RL, Wang X, Wu J, Jiang G-L (2015)Genome-wide association study for flowering time, maturity dates andplant height in early maturing soybean (Glycine max) germplasm. BMCGenomics 16: 217

Zhang YC, Liao JY, Li ZY, Yu Y, Zhang JP, Li QF, Qu LH, ShuWS, ChenYQ (2014)Genome-wide screening and functional analysis identify a large number of longnoncoding RNAs involved in the sexual reproduction of rice. Genome Biol 15: 512

Zhou L, Wang S-B, Jian J, Geng Q-C, Wen J, Song Q, Wu Z, Li G-J, Liu Y-Q,Dunwell JM, et al (2015a) Identification of domestication-related loci asso-ciated with flowering time and seed size in soybean with the RAD-seqgenotyping method. Sci Rep 5: 9350

Zhou Z, Jiang Y, Wang Z, Gou Z, Lyu J, Li W, Yu Y, Shu L, Zhao Y, Ma Y, et al(2015b) Resequencing 302 wild and cultivated accessions identifies genes relatedto domestication and improvement in soybean. Nat Biotechnol 33: 408–414

Zhu YL, Song QJ, Hyten DL, Van Tassell CP, Matukumalli LK, GrimmDR, Hyatt SM, Fickus EW, Young ND, Cregan PB (2003) Single-nucleotide polymorphisms in soybean. Genetics 163: 1123–1134

Plant Physiol. Vol. 176, 2018 2147

Long Noncoding RNA Landscape of Soybean Genome

www.plantphysiol.orgon October 7, 2020 - Published by Downloaded from Copyright © 2018 American Society of Plant Biologists. All rights reserved.