long read annotation (lorean): automated eukaryotic …nov 06, 2018  · 1 1 short title: genome...

52
1 Short title: Genome annotation with LoReAn 1 2 Long Read Annotation (LoReAn): automated eukaryotic genome 3 annotation based on long-read cDNA sequencing 4 5 David E. Cook 1† , Jose Espejo Valle-Inclan 1† , Alice Pajoro 2 , Hanna Rovenich 1 , Bart PHJ 6 Thomma 1 *, and Luigi Faino 1 * 7 1 Laboratory of Phytopathology, Wageningen University and Research, Droevendaalsesteeg 1, 8 6708 PB Wageningen, The Netherlands 9 2 Laboratory of Molecular Biology, Wageningen University and Research, Droevendaalsesteeg 1, 10 6708 PB Wageningen, The Netherlands 11 12 These authors contributed equally to this work. 13 * These authors contributed equally to this work. 14 15 One sentence summary 16 The Long Read Annotation (LoReAn) pipeline provides automated genome annotation with 17 superior performance by incorporating single-molecule cDNA sequencing data. 18 19 Footnotes: 20 List of author contributions 21 LF and BT conceived the project. DEC performed data collection for the Illumina sequencing, and 22 JVI performed cDNA normalization and sequencing on the Minion with help from DEC. AP 23 performed the Arabidopsis short- and long-read experiments. HR performed experiments to 24 confirm the annotation results for the Ave1 locus. LF and JVI wrote the LoReAn Python script. LF 25 ran the annotations, and LF and DEC performed the analysis. DEC wrote the paper with LF and 26 BT. Funding, guidance and oversight of the project were provided by BT. 27 Plant Physiology Preview. Published on November 6, 2018, as DOI:10.1104/pp.18.00848 Copyright 2018 by the American Society of Plant Biologists https://plantphysiol.org Downloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Upload: others

Post on 01-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

1

Short title: Genome annotation with LoReAn 1

2

Long Read Annotation (LoReAn): automated eukaryotic genome 3

annotation based on long-read cDNA sequencing 4

5

David E. Cook1†

, Jose Espejo Valle-Inclan1†

, Alice Pajoro2, Hanna Rovenich

1, Bart PHJ 6

Thomma1*, and Luigi Faino

1*

7

1Laboratory of Phytopathology, Wageningen University and Research, Droevendaalsesteeg 1, 8

6708 PB Wageningen, The Netherlands 9

2Laboratory of Molecular Biology, Wageningen University and Research, Droevendaalsesteeg 1, 10

6708 PB Wageningen, The Netherlands 11

12

†These authors contributed equally to this work. 13

*These authors contributed equally to this work. 14

15

One sentence summary 16

The Long Read Annotation (LoReAn) pipeline provides automated genome annotation with 17

superior performance by incorporating single-molecule cDNA sequencing data. 18

19

Footnotes: 20

List of author contributions 21

LF and BT conceived the project. DEC performed data collection for the Illumina sequencing, and 22

JVI performed cDNA normalization and sequencing on the Minion with help from DEC. AP 23

performed the Arabidopsis short- and long-read experiments. HR performed experiments to 24

confirm the annotation results for the Ave1 locus. LF and JVI wrote the LoReAn Python script. LF 25

ran the annotations, and LF and DEC performed the analysis. DEC wrote the paper with LF and 26

BT. Funding, guidance and oversight of the project were provided by BT. 27

Plant Physiology Preview. Published on November 6, 2018, as DOI:10.1104/pp.18.00848

Copyright 2018 by the American Society of Plant Biologists

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 2: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

2

28

29

Funding Information 30

Funding was provided to DEC by the Human Frontiers in Science Program (HFSP) Long-term 31

fellowship (LT000627/2014-L). Work in the laboratory of BPJHT is supported by a Vici grant of 32

the Research Council for Earth and Life Sciences (ALW) of the Netherlands Organization for 33

Scientific Research (NWO). The Arabidopsis long read sequencing was performed within the 34

ZonMw-project number 435003020 entitled “Arabidopsis transcript isoform identification using 35

PacBio sequencing technology". 36

37

Present Address: David E. Cook, Department of Plant Pathology, Kansas State University, 38

Manhattan KS 66056, USA; Jose Espejo Valle-Inclan, Department of Genetics, Center for 39

Molecular Medicine, University Medical Center Utrecht, Utrecht University. Utrecht, The 40

Netherlands; Alice Pajoro, Department of Plant Developmental Biology, Max Planck Institute for 41

Plant Breeding Research, 50829 Köln, Germany; Hanna Rovenich, Botanical Institute, Cluster of 42

Excellence on Plant Sciences (CEPLAS), University of Cologne, 50674 Cologne, Germany; Luigi 43

Faino, Department of Environmental Biology, University La Sapienza, P.le Aldo Moro 5, 00185, 44

Rome, Italy 45

46

* To whom correspondence should be addressed. Email: [email protected] 47

and [email protected] 48

49

50

51

52

53

54

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 3: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

3

ABSTRACT 55

Single-molecule full-length cDNA sequencing can aid genome annotation by revealing transcript 56

structure and alternative splice forms, yet current annotation pipelines do not incorporate such 57

information. Here we present Long Read Annotation (LoReAn) software, an automated 58

annotation pipeline utilizing short- and long-read cDNA sequencing, protein evidence, and ab 59

initio prediction to generate accurate genome annotations. Based on annotations of two fungal 60

genomes (Verticillium dahliae and Plicaturopsis crispa) and two plant genomes (Arabidopsis 61

thaliana and Oryza sativa), we show that LoReAn outperforms popular annotation pipelines by 62

integrating single-molecule cDNA sequencing data generated from either the Pacific Biosciences 63

(PacBio) or MinION sequencing platforms, correctly predicting gene structure, and capturing 64

genes missed by other annotation pipelines. 65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 4: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

4

83

INTRODUCTION 84

Genome sequencing has advanced nearly every discipline within the biological sciences, as the 85

ongoing decreasing sequencing costs and increasing computational capacity allows many 86

laboratories to pursue genomics-based answers to biological questions. New sequencing 87

technologies designed to sequence longer contiguous DNA molecules, such as Pacific 88

Biosciences’ (PacBio) Single Molecule Real Time sequencing (SMRT) and Oxford Nanopore 89

Technologies’ (ONT) MinION, have ushered the most recent genomics revolution (Koren and 90

Phillippy, 2015). These advances are further enhancing the ability to generate high-quality 91

genome assemblies of large, complex eukaryotic genomes (Faino et al., 2015; Chin et al., 2016; 92

Davey et al., 2016; Jiao and Schneeberger, 2017). 93

A high-quality genome assembly, represented by (near-)chromosome completion, helps 94

address many biological questions but often requires functional features to be further defined 95

(Thomma et al., 2016). The process of genome annotation, i.e. the identification of protein-coding 96

genes and their structural features, such as intron-exons boundaries, is important to capture 97

biological values of a genome assembly (Yandell and Ence, 2012). Genomes can be annotated 98

using computer algorithms in so-called ab initio gene predictions and using wet-lab generated 99

data, such as cDNA or protein datasets for evidence-based predictions, and current annotation 100

pipelines typically incorporate both types of data (Cantarel et al., 2008; Yandell and Ence, 2012). 101

Ab initio gene prediction tools are based on statistical models, most often Hidden Markov Models 102

(HMMs), which are trained using known proteins, and typically perform well at predicting 103

conserved or core genes (Goodswen et al., 2012; Yandell and Ence, 2012). However, the ab 104

initio prediction accuracy decreases for organism-specific genes, for genes encoding small 105

proteins and across organisms with differing intron-exon features (Ter-Hovhannisyan et al., 2008; 106

Hoff et al., 2016). Furthermore, ab initio annotation of non-model genomes remains challenging 107

as appropriate training data is not always available and genome characteristics across organisms 108

can vary (Reid et al., 2014). To improve genome annotations, cDNA sequencing (RNA-seq) data 109

can be incorporated to train ab initio software (Hoff et al., 2016) and to provide additional 110

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 5: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

5

evidence for defining accurate gene models (Wang et al., 2009). Most genome annotations to 111

date rely on a combination of short-read mapping data and ab initio gene prediction. However, 112

errors occur during this process because short-read RNA-seq data cannot always be 113

unequivocally mapped, because a single read does not span a gene’s full length, and because of 114

differences in evidence weighting leading to gene prediction. LoReAn was developed to also use 115

information from long-read sequencing data to help address issues of mapping and gene 116

structure with greater emphasis given to empirical mapping data during the gene prediction 117

process. 118

Current annotation pipelines use a combination of ab initio and evidence-based 119

predictions to generate accurate consensus annotations. MAKER2 is a user-friendly, fully 120

automated annotation pipeline that incorporates multiple sources of gene prediction information 121

and has been extensively used to annotate eukaryotic genomes (Cantarel et al., 2008), (Holt and 122

Yandell, 2011) (Smith et al., 2011; Amemiya et al., 2013; Smith et al., 2013; Ming et al., 2015; 123

Lamichhaney et al., 2016). The Broad Institute Eukaryotic Genome Annotation Pipeline (here 124

referred to as BAP) has mainly been used to annotate fungal genomes (Linde et al., 2015; Muñoz 125

et al., 2015; Ma et al., 2016) and integrates multiple programs and evidences for genome 126

annotation (Haas et al., 2008; Haas et al., 2011). BRAKER1 and CodingQuarry are two gene 127

prediction software packages that utilize RNA-seq data and genome sequence to predict gene 128

models. BRAKER1 is a pipeline for unsupervised RNA-seq-based genome annotation that 129

combines the advantages of GeneMark-ET and Augustus. BRAKER1 is a two-step software that 130

combines GeneMark-ET and intron-hints, derived from mapped RNA-seq, to generate a species-131

specific database that Augustus can use for gene prediction (Hoff et al., 2016). CodingQuarry is a 132

pipeline for RNA-Seq assembly-supported training and gene prediction, which is only 133

recommended for application to fungi. In this tool, Cufflink’s assembled RNA-seq is used to build 134

a HMM model, which is used for gene prediction (Testa et al., 2015). A limitation of these 135

annotation pipelines is that experimental evidence from short read RNA-seq mapping can be lost 136

due to evidence weighting, and the pipelines cannot natively exploit gene structure information 137

from single-molecule cDNA sequencing. 138

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 6: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

6

In addition to improving genome assembly (Phillippy, 2017), long-read sequencing data 139

can be used to improve genome annotation. The use of single-molecule cDNA sequencing can 140

increase the accuracy of automated genome annotation by improving genome mapping of 141

sequencing data, correctly identifying intron-exon boundaries, directly identifying alternatively 142

spliced transcripts, identifying transcription start and end sites, and providing precise strand 143

orientation to single-exon genes (Minoche et al., 2015; Abdel-Ghany et al., 2016; Wang et al., 144

2016). However, several hurdles limit the implementation of long-read sequencing data into 145

automated genome annotation, such as the higher per-base costs when compared to short-read 146

data, the relatively high error rates for long-read sequencing technologies, and the lack of 147

bioinformatics tools to integrate long-read data into current annotation pipelines (Faino and 148

Thomma, 2014; Laver et al., 2015). The first two limitations are addressed by the continual 149

reduction in sequencing cost and improved base calling by long-read sequencing providers and 150

the development of bioinformatics methods to correct for sequencing errors (Loman et al., 2015; 151

Laehnemann et al., 2016). To address the disconnect between genome annotation pipelines and 152

the latest sequencing technologies, we developed the Long Read Annotation (LoReAn) pipeline. 153

LoReAn is an automated annotation pipeline that takes full advantage of MinION or PacBio 154

SMRT long-read sequencing data in combination with protein evidence and ab initio gene 155

predictions for full genome annotation. Short-read RNA-seq can be used in LoReAn to train ab 156

initio software. Based on the re-annotation of two fungal and two plant species, we demonstrate 157

that LoReAn can provide annotations with increased accuracy by incorporating single-molecule 158

cDNA sequencing data from different sequencing platforms. 159

160

RESULTS 161

LoReAn design and implementation 162

The LoReAn pipeline can be conceptualized in two phases. The first phase involves genome 163

annotation based on ab initio and evidence-based predictions (Figure 1A: blue arrows) and 164

largely follows the workflow previously described in the BAP (Haas et al., 2008; Haas et al., 165

2011). This first phase produces a full-genome annotation and requires the minimum input of a 166

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 7: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

7

reference genome, protein sequence of known and, possibly, related species, and a species 167

name from the Augustus prediction software database (Stanke et al., 2008). We implemented two 168

changes into the first phase of LoReAn, which we refer to as BAP+. First, LoReAn used RNA-seq 169

reads as input in combination with the BRAKER1 software (Hoff et al., 2016) to produce a 170

species-specific database for the Augustus prediction software. Additionally, RNA-seq data was 171

assembled into full-length cDNA using Trinity software (Grabherr et al., 2011), and the assembled 172

transcripts were aligned to the genome using both the Program to Assemble Spliced Alignments 173

(PASA; Haas et al., 2008) and the Genomic Mapping and Alignment Program (GMAP; Wu and 174

Watanabe, 2005). The output of PASA software was passed to the Evidence Modeler (EVM) 175

(Haas et al., 2008) as cDNA evidence while the output of GMAP was given to EVM as ab initio 176

evidence. 177

The second phase of LoReAn incorporates single-molecule cDNA sequencing with the 178

annotation results of the first phase by utilizing an alternative approach to reconstruct full-length 179

transcripts (Figure 1A: red arrows). Single-molecule long-read sequencing reads are mapped to 180

the genome using GMAP, which allows the determination of transcript structure (i.e. start, stop 181

and exon boundaries) from a single cDNA molecule (Križanovic et al., 2017). The underlying 182

reference sequence is extracted to overcome sequence errors associated with long-read 183

sequencing, and these sequences are combined with the gene models from the first phase in a 184

process we refer to as ‘clustered transcript reconstruction’ (Figure 1A and B). Through this 185

process, consensus gene models are built by combining the first and second phase gene models 186

that cluster at the same locus. Optionally, model clustering can be conducted in a strand-specific 187

manner (LoReAn stranded, see Supplemental text for details) where only gene models mapping 188

on the same DNA coding strand are used to build a consensus model. These high-confidence 189

models are mapped back to the reference using GMAP to correct open reading frames, and 190

subsequently, PASA is used to update the gene models by identifying untranslated regions 191

(UTRs) and alternatively spliced transcripts to generate a final annotation. Sequence-based 192

support for the final gene models (Figure 1B orange models) can come from the first phase 193

annotation alone (Figure 1B i), the second phase, given a sufficient level of support (Figure 1B ii, 194

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 8: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

8

iii), or through a combination of the two phases (Figure 1B iv, v). If a single consensus annotation 195

cannot be reached between the two phases, both annotations are kept in the final output (Figure 196

1B v). 197

198

LoReAn produces high quality gene predictions based on gene, exon and intron metrics 199

compared to the reference annotation and empirical data 200

To test the performance of LoReAn, we re-annotated the genome sequence of the haploid fungus 201

Verticillium dahliae, an important pathogen of hundreds of plant species including many crops 202

(Fradin and Thomma, 2006; Klosterman et al., 2009). The genome of V. dahliae strain JR2 was 203

used for testing LoReAn because it is assembled into complete chromosomes and has a 204

manually curated annotation, providing a high-confidence resource for reference (Faino et al., 205

2015). A total of 55 annotations were generated and compared, of which 24 were produced using 206

LoReAn, 12 using BAP and 12 using BAP+ with different genome masking and ab initio options 207

(description in Supporting text), along with output from the annotation software MAKER2, 208

BRAKER1, Augustus and two each from CodingQuarry and GeneMark-ES. We determined the 209

quality of annotation outputs by comparing each to the reference annotation for exact matches to 210

either genes or exon locations (Figure 2A; Table S1 and S2; reference annotation contains 211

11,385 gene models and 28,142 exons). These comparisons were used to calculate sensitivity 212

(how much of the reference is correctly predicted), specificity (how much of the prediction is in the 213

reference), and F1 score (the harmonic mean of sensitivity and specificity). These metrics were 214

calculated based on commonly described methods used within the gene prediction community 215

(see methods and references (Keibler and Brent, 2003; Yandell and Ence, 2012; Chan et al., 216

2017)). Hard masking, where short DNA repeat (>10 bp) sequences are removed from the 217

genome prior to annotation, significantly affected the quality of predicted gene and exon models, 218

with partially masked or non-masked genome inputs producing significantly improved annotations 219

(Figure S1A and S2A; Table S3 and S4). On average, the ‘fungus’ option (-f) of the ab initio 220

software GeneMark-ES produced the best gene and exon predictions (Figure S1B and S2B; 221

Table S3 and S4). Gene predictions from LoReAn using coding strand information (LoReAn-s) 222

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 9: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

9

produced statistically similar results to LoReAn for exact match genes and similar results to 223

LoReAn and BAP for exact match exons (Table S2 – S3). However, the F1 score for exact match 224

gene and exon predictions were significantly higher for LoReAn stranded compared to the other 225

three outputs (Figure S1C and S2C), indicating that using strand information improves the overall 226

quality of the annotation. 227

A single output from LoReAn and BAP were selected for further comparison to the 228

outputs from the other annotation software (i.e. MAKER2, CodingQuarry, GeneMark, BRAKER1 229

and Augustus). The LoReAn-stranded run using the ‘fungus’ option of GeneMark-ES (referred to 230

as LoReAn-sF throughout) and the BAP run using the fungus option of GeneMark-ES (referred to 231

as BAP-F throughout) using a non-masked genome were selected because they had the highest 232

F1 scores and used similar settings to the other pipelines, thereby enabling comparisons (Figure 233

2A, highlighted by horizontal yellow lines). The 7 annotation pipelines were compared by 234

computing the number of predicted genes that matched the reference and the number of 235

reference genes that were matched (i.e. numbers for specificity and sensitivity). A chi-square test 236

of independence indicated a significant, non-equal association between the 7 annotation 237

pipelines and the tested annotation metrics Pearson’s chi-square test of independence, 2= 238

913.61, p-value < 2.2e-16 Plotting the residuals of the chi-square test indicated that LoReAn-sF 239

had the largest positive association for correctly predicting gene models (column 1 and 3, Figure 240

2B) and the largest negative association for predicting wrong gene models or missing gene 241

models (Column 2 and 4, Figure 2B). These results show that there are statistically significant 242

differences between the tested pipelines for annotation quality, and the residual analysis indicates 243

that LoReAn-sF has the best association with desirable annotation metrics. 244

To further characterize gene prediction differences between annotation pipelines, four 245

outputs were selected for comparison. MAKER2 and CodingQuarry were selected as they 246

represent popular annotation choices and performed well, along with the output of LoReAn-sF 247

and BAP-F, as they are the focus of the study. The gene predictions were compared head-to-248

head in the absence of a reference annotation by determining the overlap between exact match 249

genes. There were 4,584 genes with the same predicted structure (i.e. start, stop, intron position) 250

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 10: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

10

from the four pipelines, equivalent to approximately 40% of the genes in the reference annotation 251

(Figure 3A). BAP-F predicted the fewest unique genes (1,352), while MAKER2 predicted the most 252

(3,157) (Figure 3A). However, the use of exact match gene structure to identify unique coding 253

sequences is potentially misleading, as two gene predictions can code for the same or a similar 254

protein without the exact same structure. To generate a more biologically relevant comparison of 255

unique protein coding differences, we grouped translated protein sequences of each annotation 256

into homologous groups using orthoMCL (Li et al., 2003; Chen et al., 2006). Using these groups, 257

we identified protein coding sequences that were unique to a single annotation pipeline, referred 258

to as singletons. We identified 1,429 singletons across the four annotations, with CodingQuarry 259

predicting the most (461) and BAP-F predicting the fewest (180) (Figure 3D). To validate the 260

predicted singletons, the percent coverage of the predicted transcripts was calculated using 261

short-read RNA-seq data. Singletons from LoReAn-sF were covered on average across 80% of 262

their length by mapped RNA-seq reads, while the next highest was CodingQuarry with 63% 263

average coverage. Non-parametric statistical test for the rank of the mean coverage shows that 264

singletons predicted by LoReAn-sF had significantly higher RNA-seq coverage compared to the 265

other pipelines (Figure 3B, Kruskal-Wallis test). Analyzing the length of the predicted singletons 266

indicated that CodingQuarry predicted the shortest singletons, while the other pipelines were not 267

statistically different than one another (Figure 3C, Kruskal-Wallis test). The singletons were 268

further grouped by the presence/absence of introns and RNA-seq coverage, as gene models with 269

introns and RNA-seq support are more likely to be true genes. Singletons that contained at least 270

one intron and had RNA-seq reads covering at least 75% of their length were considered the 271

highest-confidence models. Pairwise, two-sample test for equal proportions indicated the outputs 272

of LoReAn-sF and MAKER2 were statistically similar for the proportion of singletons in this high 273

confidence group, but both were significantly higher than the other two pipelines (Table S5). The 274

collective, comprehensive comparison of LoReAn versus other commonly used annotation 275

software shows that LoReAn outperforms these for many gene prediction quality metrics. 276

277

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 11: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

11

LoReAn predicts the most accurate intron locations compared to empirical RNA-seq 278

predictions 279

Annotation comparisons are commonly made against a reference annotation. This carries 280

inherent problems, however, as many organisms do not have a high-confidence reference 281

annotation. Additionally, the overlap between the software used to generate a given reference 282

annotation and that being tested introduces bias in the analysis. To evaluate the annotation 283

outputs produced here in the absence of a reference annotation, we devised an approach to 284

quantify annotation quality based on empirical RNA-seq output rather than a reference 285

annotation. The locations of predicted introns from the annotation outputs were compared to the 286

locations of the inferred introns from long- and short-read mapped data, using the same 287

annotation outputs compared earlier for gene and exon locations (Figure 4A). This analysis also 288

allowed the reference annotation to be compared to the intron locations inferred from the RNA-289

seq data. To formally test for statistical differences, the same annotation pipelines compared 290

earlier were compared using a chi-square test of independence, which indicated a significant, 291

non-equal association between the annotation pipelines and intron predictions Pearson’s chi-292

square test of independence, 2= 3220.7, p-value < 2.2e-16 The residual plot of the chi-square 293

test shows the magnitude and direction of the association between the pipelines and the 294

predictions (Figure 4B). Multiple two-sample tests of proportions between the annotation 295

predictions and the reference annotation, testing if the annotation prediction showed improved 296

metrics when compared with the reference, showed that only CodingQuarry outperformed the 297

reference annotation for the specificity metric (Table S6). This indicated the reference annotation 298

was the most similar to the RNA-seq data, which is not surprising given the high-quality reference 299

annotation. The analysis was re-run to test if the LoReAn-sF outperformed the other pipelines, 300

and LoReAn-sF outperformed the other software in nearly every instance (Table 1). This further 301

indicates that the LoReAn software offers improved annotation performance when evaluated 302

against RNA-seq data in the absence of a reference annotation. 303

304

Only the LoReAn pipeline correctly annotates the Ave1 effector locus 305

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 12: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

12

Plant-pathogenic fungi encode in planta-secreted proteins, termed effectors, which serve to 306

facilitate infection (Cook et al., 2015; Presti et al., 2015). Effectors are generally characterized as 307

lineage-specific small, secreted, cysteine-rich proteins with generally no characterized protein 308

domains or homology, characteristics which can make effectors difficult to predict with automated 309

annotation (Sperschneider et al., 2015). To test how LoReAn and the other annotation pipelines 310

performed at a specific effector locus, we detailed the annotation results for the V. dahliae Ave1 311

locus, which encodes a small-secreted protein that functions to increase virulence during plant 312

infection (de Jonge et al., 2012). As previously reported, a considerable number of short RNA-seq 313

reads uniquely map to the Ave1 locus (de Jonge et al., 2012), along with single-molecule cDNA 314

reads identified here (Figure 5A). Interestingly, MAKER2, BAP, Augustus, GeneMark-ES and 315

CodingQuarry (default) each failed to predict the previously characterized Ave1 gene despite the 316

abundance of uniquely-mapped reads (Figure 5B; Figure S3). The MAKER2, BAP and Augustus 317

pipelines predicted a separate gene on the opposite strand located to the 3’ end of the Ave1 gene 318

that is absent in the reference annotation. The CodingQuarry software run with the ‘effector’ 319

option predicted the correct coding sequence model for Ave1. The LoReAn-sF, LoReAn-s, PASA, 320

GMAP and BAP+ software predicted two genes at the locus, one corresponding to the known 321

Ave1 gene, and an additional gene to the 3’ end of Ave1 (called Ave1c), similar to the gene 322

model identified by MAKER2 (Figure 5B; Figure S3). 323

LoReAn-sF additionally predicted two mRNAs corresponding to the previously 324

characterized Ave1 gene, termed isoform-1 and -2 (Figure 5B). To confirm the presence of two 325

Ave1 isoforms, cDNAs were amplified and cloned into vectors, and 18 clones were randomly 326

selected for sequencing. A majority of the sequenced transcripts, 15 of 18, had a sequence 327

corresponding to isoform-1, the known Ave1 transcript, while the other 3 were the isoform-2 328

sequence (Figure 5C). The isoform-2 transcript is the result of an alternative splice junction 5 bp 329

upstream of the previously identified splice site in the Ave1 5’ UTR intron and is not predicted to 330

alter the protein coding sequence. The accuracy of the new gene prediction at the Ave1 locus 331

(two Ave1 isoforms and one additional gene model) was additionally tested by showing the 332

expression of the Ave1c gene. Two sets of primers (Ave1 and Ave1c fw and rev) amplified bands 333

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 13: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

13

of the expected sizes, confirming the expression of both genes across various V. dahliae growth 334

conditions (Figure 5D, top amplification panel). We also attempted to amplify a specific product 335

from both gDNA and cDNA to confirm the orientation and rule out a transcriptional fusion. 336

Consistent with the annotation, the amplification using a gDNA template was successful, while 337

the cDNA template failed to amplify a product (Figure 5D, bottom amplification panel, primers 338

Ave1 fw + Ave1c fw). Collectively, these results confirm that LoReAn predicts the most accurate 339

gene models at the Ave1 locus, including a splice-variant of Ave1. 340

341

LoReAn produces the most accurate annotation of a second fungal genome using PacBio 342

Iso-seq reads 343

The basidiomycete Plicaturopsis crispa, mostly known for its wood-degrading abilities, has a 344

relatively complex transcriptome with high levels of exons per gene; 5.6 exons per gene 345

compared to V. dahliae’s 2.5 exons per gene (Gordon et al., 2015). A total of nine annotations of 346

the P. crispa genome from five pipelines were generated using publicly available short-read 347

Illumina RNA-seq and single-molecule PacBio Iso-seq data (Kohler et al., 2015). The LoReAn 348

annotations predicted the greatest number of genes, transcripts and exons, while BAP and BAP+ 349

had the greatest number of genes, transcripts and exons exactly matching the reference (Table 2; 350

Table S7). The F1 scores for exact match genes was highest for the BAP outputs, but overall 351

similar between the annotations (Figure 6A). Testing the exact match gene proportions used for 352

sensitivity and specificity indicated significant association between annotation pipelines and the 353

metrics (Figure 6B, Pearson’s chi-square test of independence, 2= 3220.7, p-value < 2.2e-16). 354

LoReAn-sF scored significantly higher than MAKER2 for sensitivity and specificity of exact match 355

gene proportions and higher than GeneMark-ES for exact match sensitivity (Table S8). However, 356

these comparisons depended on the starting reference annotation. 357

To better understand the differences between the outputs, we applied the empirical intron 358

analysis and calculated the annotation quality metrics (Figure 6C). The sensitivity and specificity 359

proportions for exact match introns indicated significant differences across the pipelines (Figure 360

6D, Pearson’s chi-square test of independence, 2= 8583, p-value < 2.2e-16). Using chi-square 361

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 14: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

14

tests of proportions between the individual annotation outputs to that of the reference annotation 362

showed that LoReAn in stranded mode using a masked or non-masked genome produces 363

significantly improved intron location estimates than the current reference (Table 2). LoReAn 364

stranded outperformed all other pipelines for exact intron specificity and sensitivity (Table S9). 365

These results indicate the LoReAn pipeline produces an improved annotation compared to the 366

current reference and produces results as good and, under some metrics, better than other 367

annotation options. 368

369

LoReAn produces high quality annotations for larger plant genomes using PacBio Iso-seq 370

data 371

To further test LoReAn, we re-annotated the 135 megabase (Mb) Arabidopsis (Arabidopsis 372

thaliana) and 375 Mb rice (Oryza sativa) genomes using Pacbio Iso-seq data. These genomes 373

are larger and contain a higher percentage of repetitive elements than the two fungal genomes 374

tested. The Arabidopsis annotations generated here were compared to the reference annotation, 375

TAIR10, which is highly curated and represents one of the most complete plant genome 376

annotations (Lamesch et al., 2012; Berardini et al., 2015). The LoReAn outputs using a non-377

masked genome had the highest number of genes and transcripts exactly matching the 378

reference, while BAP+ had the highest number of exact match exons (Table S10). We conducted 379

similar analyses as used for the fungal genomes for exact match genes compared to the 380

reference annotation (Figure 7A and B). There was a significant difference across the pipelines 381

for sensitivity and specificity of exact match genes (Figure 7B, Pearson’s chi-square test of 382

independence, 2= 22393, p-value < 2.2e-16). Two sample proportion testing showed LoReAn-sF 383

outperformed MAKER2, GenMark-ES and Augustus for predicting genes based on exact match 384

analysis (Table S11). Similar results were seen for the inferred intron analysis (Figure 7C and D). 385

The pipelines were not equal for exact intron sensitivity and specificity (Figure 7D, Pearson’s chi-386

square test of independence, 2= 38926, p-value < 2.2e-16). The results of the pipelines were 387

compared to those from the reference annotation, and LoReAn using a masked genome in 388

standard or stranded mode and MAKER2 outperformed the current reference annotation for 389

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 15: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

15

matching intron location specificity (Table 3). None of the pipelines outperformed the reference 390

annotation for sensitivity, reflecting the high quality of the TAIR10 annotation. 391

The same analysis procedures for the rice genome showed LoReAn performed as well or 392

better than the other tested pipelines for gene and intron predictions (Figure 8, Table S12). There 393

was a significant difference between the pipelines for exact match gene sensitivity and specificity 394

(Pearson’s chi-square test of independence, 2= 50019, p-value < 2.2e-16) with the LoReAn 395

outputs having strong associations to positive annotation metrics (Figure 8B). Indeed, pairwise 396

two sample proportion testing of the other pipelines against LoReAn-s showed that LoReAn-s 397

was significantly better for both gene prediction sensitivity and specificity (Table S13). LoReAn 398

also performed well for predicting exact intron locations (Figure 8C and D), and again, there was 399

a significant difference across the pipelines for gene sensitivity and specificity (Pearson’s chi-400

square test of independence, 2= 984030, p-value < 2.2e-16). All LoReAn pipelines outperformed 401

the reference annotation for intron match specificity, and LoReAn using the full-genome and 402

strand information outperformed the reference annotation for intron match sensitivity (Table 4). 403

These data show that LoReAn provides robust results across genomes of varying features and 404

sizes and, in many instances, outperformed other currently used annotation software. 405

406

DISCUSSION 407

High throughput sequencing continues to have profound impacts on biological systems and the 408

questions researchers are addressing. The technical improvements and associated reduction in 409

cost have resulted in a deluge of high quality model and non-model genomes from across the 410

kingdoms of life. To capture the value of these assembled genomes, equal advances are needed 411

in defining the functional elements of the genome. One such technical advance is the ability to 412

sequence full-length single-molecule cDNAs that directly contain information on transcript 413

structure and alternative forms. This information has previously helped identify alternatively 414

spliced transcripts (Au et al., 2013; Abdel-Ghany et al., 2016), but single-molecule long-reads 415

have not been systematically incorporated into annotation pipelines. The newly developed 416

LoReAn pipeline integrates both short-read RNA-seq and long-read single-molecule cDNA 417

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 16: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

16

sequencing with ab initio gene prediction to generate high accuracy gene predictions. In total, 418

three separate analyses using a reference annotation, head-to-head comparison, and 419

comparison to empirical data indicate that LoReAn produces high quality annotations of the four 420

genomes tested. These results show that LoReAn has improved performance for predicting gene 421

structures and intron locations. 422

Whereas several genome annotation tools use experimental data (i.e. RNA-seq) for gene 423

prediction, none of them fully utilize this information. This is apparent for genes such as Ave1, 424

where there is ample RNA-seq evidence supporting the gene model, but prediction software, 425

including MAKER2, GeneMark-ES and BAP, do not predict the gene. The annotation pipeline 426

CodingQuarry also does not predict the Ave1 transcript when run in the default mode but does 427

predict the transcript when run using the ‘effector’ option. LoReAn correctly predicts the Ave1 428

transcript, plus an additional new transcript at the locus. The ability to correctly annotate genes 429

with unique features or restricted taxonomic distribution, such as effectors, is relevant to many 430

biological questions and will aid comparative genomic studies. LoReAn was designed to 431

incorporate information from both short- and long-read RNA-seq data as we believe with 432

increasing sequencing depth, length and accuracy this significant source of empirical evidence 433

will greatly improve gene prediction. 434

The technical and biological characteristics of a genome will impact the annotation 435

options needed to achieve high-quality gene predictions. Genome masking significantly affected 436

the gene prediction output of the V. dahliae annotation. From a technical aspect, genome 437

masking prior to annotation can have a large impact when annotating highly contiguously 438

assembled genomes. Fragmented genome assemblies often lack repetitive regions and are de 439

facto masked. Masking the telomere-to-telomere complete V. dahliae strain JR2 genome resulted 440

in gene predictions which were fragmented because of coding regions overlapping masked 441

regions. Our results indicate that genome masking of short repetitive DNA decreases the quality 442

of the genome annotation and that using a partially- (only masking repeats > 400 bp) or non-443

masked genome may improve annotation results. From a biological perspective, our results show 444

that strand information had a significant impact on annotation quality for the V. dahliae genome. 445

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 17: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

17

Compact fungal genomes have genes with overlapping UTRs, which make gene prediction 446

difficult. Using strand information, LoReAn can assign transcripts to the correct coding strand and 447

avoid the prediction of fused genes. Additionally, strand information is used to assign single exon 448

genes to the correct strand. These results need to be confirmed on a greater number of genomes 449

with diverse characteristics before being fully generalizable. Collectively, our results suggest that 450

both technical and biological information, such as assembly completeness and coding sequence 451

overlap, can impact genome annotation quality and should be considered early during project 452

design. 453

Our results show that LoReAn can successfully use single-molecule cDNA sequencing 454

data from different platforms to produce high-quality genome annotations, similar to or better than 455

the current community references for four diverse genomes. This suggests robust performance of 456

LoReAn across sequencing platforms and for annotating small fungal genomes of 35 Mb to the 457

rice genome of ~375 Mb. We speculate that the use of annotation software such as LoReAn, 458

which incorporates single-molecule cDNA sequencing into the annotation process, will 459

significantly improve genome annotation and aid in answering biological questions across all 460

domains of life. 461

462

MATERIAL AND METHODS 463

Growth conditions and RNA extraction 464

Verticillium dahliae strain JR2 (Faino et al., 2015) was maintained on potato dextrose agar (PDA) 465

plates grown at approximately 22⁰C and stored in the dark. Conidiospores were collected from 466

two-week-old PDA plates using half-strength potato dextrose broth (PDB), and subsequently 467

1x106 spores were inoculated in glass flasks containing 50 mL of either PDB, half-strength 468

Murashige and Skoog (MS) medium supplemented with 3% sucrose, or xylem sap collected from 469

greenhouse grown tomato (Solanum lycopersicum) plants of the cultivar Moneymaker. For 470

analysis of Ave1 transcription, V. dahliae strain JR2 was additionally grown in 50 mL of Czapek-471

dox media following the manufacturer’s guidelines (Oxoid Microbiology Products, Thermo 472

Scientific). The cultures were grown for four days in the dark at 22⁰C and 160 RPM. The cultures 473

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 18: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

18

were strained through miracloth (22 μm) (EMD Millipore, Darmstadt, Germany), pressed to 474

remove liquid, and flash frozen in liquid nitrogen. Next, the cultures were to ground to powder with 475

a mortar and pestle using liquid nitrogen to ensure samples remained frozen. 476

RNA extraction was carried out using TRIzol (Thermo Fisher Science, Waltham, MA, 477

USA) following manufacturer guidelines. Following RNA re-suspension, contaminating DNA was 478

removed using the TURBO DNA-free kit (Ambion, Thermo Fisher Science, Waltham, MA, USA), 479

and the RNA was checked for integrity by separating 2 μL of each sample on a 2% agarose gel. 480

RNA samples were quantified using a Nanodrop (Thermo Fisher Science, Waltham, MA, USA) 481

and stored at -80⁰C. 482

483

Library preparation and sequencing – Illumina 484

Each RNA sample from V. dahliae strain JR2 grown in PDB, half-strength MS, and xylem sap 485

was used to construct an Illumina sequencing library for RNA-sequencing by the Beijing 486

Genomics Institute (BGI) following manufacturer guidelines (Illumina Inc., San Diego, CA, USA). 487

Briefly, messenger RNA (mRNA) was enriched using oligo(dT) magnetic beads. The RNA was 488

then fragmented and double stranded cDNA synthesized following manufacturer guidelines 489

(Illumina Inc., San Diego, CA, USA). The fragments were then end-repaired and poly-adenylated 490

to allow for the addition of sequencing adapters, followed by fragment enrichment using 491

polymerase chain reaction (PCR) amplification. Library quality was assessed using the Agilent 492

2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). Qualified libraries were 493

sequenced on an Illumina HiSeq-2000 (Illumina Inc., San Diego, CA, USA) at the Beijing 494

Genomics Institute. 495

496

cDNA synthesis and normalization, library preparation and sequencing – Oxford Nanopore 497

Technologies 498

For the synthesis of single-stranded cDNA, 1 μg of each RNA sample was reverse-transcribed 499

using the Mint-2 cDNA synthesis kit as described by the manufacturer (Evrogen, Moscow, 500

Russia), using the primers PlugOligo-1 (5’ end) and CDS-1 (3’ end). For each sample, 1 μl of 501

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 19: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

19

cDNA was amplified with PCR for 15 cycles (95ºC for 15 seconds, 66ºC for 20 seconds and 72ºC 502

for 3 minutes) to generate double-stranded cDNA, and purified with 1.8x volume Agencourt 503

AMPure XP magnetic beads (Beckman Coulter Inc., Indianapolis, IN, USA). 504

Three cDNA samples were normalized with the Trimmer-2 cDNA normalization kit 505

following the manufacturer’s guidelines (Evrogen, Moscow, Russia). The cDNA was precipitated, 506

denatured and hybridized for 5 hours. Next, the double stranded cDNA fraction was cleaved and 507

the remaining single stranded cDNA amplified with PCR for 18 cycles (95ºC for 15 seconds, 66ºC 508

for 20 seconds and 72ºC for 3 minutes). 509

Library preparation for the three samples was performed using the Nanopore Sequencing 510

Kit (v. SQK-MAP006) following the manufacturer’s guidelines (Oxford Nanopore Technologies 511

[ONT], Oxford, UK). The cDNA was end-repaired and dA-tailed using the NEBNext End Repair 512

and NEBNext dA-Tailing Modules following the manufacturer’s instructions (New England 513

BioLabs [NEB], Ipswich, MA, USA). The reactions were cleaned using an equal volume of 514

Agencourt AMPure XP magnetic beads (Beckman Coulter Inc., Indianapolis, IN, USA), followed 515

by ONT adapter ligation using Blunt/TA ligation Master Mix (NEB, Ipswich, MA, USA). The 516

adapter-ligated fragments were purified using Dynabeads MyOne Streptavidin C1 (Thermo Fisher 517

Science, Waltham, MA, USA). 518

Sequencing was performed on three different MinION flow cells (v. FLO-MAP103, ONT, 519

Oxford, UK). After priming the flow cells with sequencing buffer, 6 μl of the library preparation was 520

added. Additional library preparation (6 μl) was added to the flow cells at 3, 17 and 24 hours after 521

the run was started. Base-calling was performed using the Metrichor app (v. 2.39.1, ONT, Oxford, 522

UK), and Poretools (v. 0.5.1) was used to generate FASTQ files from the Metrichor produced 523

FAST5 files (Loman and Quinlan, 2014). 524

525

Software in LoReAn pipeline 526

LoReAn is implemented in Python3. Usage and parameters to run LoReAn, including default 527

settings, are detailed at https://github.com/lfaino/LoReAn/blob/master/OPTIONS.md. Mandatory 528

parameters are protein sequences of related organisms, a reference genome sequence and an 529

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 20: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

20

identification name for the species form the Augustus database. Other inputs are: short-reads (i.e. 530

Illumina RNA-seq), which may be single or paired-end; and long-reads from either MinION or 531

SMRT sequencing platforms. LoReAn outputs a GFF3 file with genome annotations. 532

The most convenient way to install and run LoReAn is by using the Docker 533

(https://www.docker.com/) image. Information about the software and how to use it can be found 534

at the https://github.com/lfaino/LoReAn repository. Additional information regarding the settings 535

used for programs in this work can be found in the Supplemental Text. The following programs 536

and versions were used for LoReAn: for read mapping, STAR (version 2.5.3a) (Dobin et al., 537

2013) and GMAP (v. 2017-06-20) (Wu and Watanabe, 2005); to assemble and reconstruct 538

transcripts from short reads, Trinity (v. 2.2.0) (Grabherr et al., 2011) run on “genome-guided 539

mode”, followed by PASA (v. 2.1.0) (Haas et al., 2008); to map protein sequences, AAT (v. 03-05-540

2011) (Huang et al., 1997); for gene prediction, GeneMark-ES (v4.34) (Lomsadze et al., 2014) 541

and Augustus (v3.3) (Stanke et al., 2008) as ab initio software but BRAKER1 (v. 2) (Hoff et al., 542

2016) in substitution of Augustus to generate ab initio gene prediction for organisms not present 543

in the Augustus catalogue when RNA-seq is supplied; GMAP (v. 2017-06-20) (Wu and 544

Watanabe, 2005) for long reads mapping and for assembled ESTs after Trinity assembly; 545

Evidence Modeler (EVM, v. 1.1.1) (Haas et al., 2008) to combine the output from the previous 546

tools to generate a combined annotation model. For EVM, evidence weights were set to 1 and 547

default options were used. Bedtools suite (v. 2.21.0) (Quinlan and Hall, 2010) was used to extract 548

the genomic sequence, merge and cluster the long-reads. iAssembler (v. 1.32) (Zheng et al., 549

2011) was used to call a consensus on the clusters (i.e. the process of transcript reconstruction). 550

GenomeTools (v. 1.5.9) software was used at several stages in the LoReAn pipeline (Gremme et 551

al., 2013). Additional information about the tools used can be found at 552

https://github.com/lfaino/LoReAn/blob/master/README.md. 553

554

Genome Masking 555

To study the effect of genome masking on automated genome annotation with LoReAn, the 556

pipeline was run on stranded mode using three reference genomes with different levels of 557

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 21: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

21

repetition masking: a fully masked genome with all repetitive sequences masked, a partially 558

masked genome where only repetitions larger than 400 bps were masked and a full genome with 559

no repetition masking. Repeats were masked using RepeatMasker software as previously 560

described (Faino et al., 2016). 561

562

LoReAn Stranded Mode 563

To use the software in strand mode efficiently, sequences from the same transcript need to have 564

the same strand. However, sequencing is random and, depending from which fragment 565

sequencing starts, fragments from the same transcript could be sequenced in forward or reverse 566

orientation compared to the transcription direction. Unlike in DNA sequencing, the direction of the 567

cDNA long-read sequencing can be inferred by localizing only one or both directions between the 568

3’ adapter or the 5’ adapter used during the cDNA production. Using the Smith-Waterman 569

alignment, the location of the adapter/s in the sequenced fragments can be identified and the 570

sequencing orientation adjusted based on the adapter alignment onto the fragments. For the 571

MinION data generated, the 5’ PlugOligo-1 AAGCAGTGGTATCAACGCAGAGTACGCGGG and 572

3’-CDS AAGCAGTGGTATCAACGCAGAGTACTGGAG primer sequences associated with the 573

cDNA synthesis and normalization process were used to identify the coding strand for each long 574

read. For the PacBio Arabidopsis (Arabidopsis thaliana) experiment, primers 575

AAGCAGTGGTATCAACGCAGAGTACGCGGG and 576

AAGCAGTGGTATCAACGCAGAGTACTTTTT were used for the correction of the transcript 577

orientation. Rice (Oryza sativa) and Plicaturopsis crispa PacBio transcripts were oriented by 578

using the sequence 579

AAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT. 580

581

Annotation quality definitions 582

The common metrics sensitivity, specificity and accuracy were used to compare the annotation 583

features. These metrics have been previously discussed in the context of annotations (Yandell 584

and Ence, 2012). Briefly, sensitivity is a measure of how well an annotation identifies the known 585

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 22: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

22

features of a reference, also called a true positive rate. Here, sensitivity was calculated as 586

[(Annotation matching reference / total Reference) * 100] for a specific feature of interest and 587

represented the percentage of known reference features captured. Specificity is a measure of 588

how many of the annotated features were in the reference, also called positive predictive value. 589

Here, specificity was calculated as [(Annotation matching reference / total Annotation) * 100] for a 590

specific feature of interest and represented the percentage of all the annotation features that 591

matched the reference. These comparisons can be for any annotation feature, such as genes, 592

transcripts, or individual exons for exact matches or for a specified overlap to a reference. The F1 593

score accounted for both sensitivity and specificity to measure annotation quality in a single 594

number, calculated as the harmonic mean of sensitivity and specificity [(Sensitivity * Specificity) / 595

(Sensitivity + Specificity)] * 2. 596

597

Head to head comparisons between annotations 598

To determine the unique protein coding genes annotated between LoReAn-sF, BAP-F, MAKER2 599

and CodingQuarry, the annotations were compared using orthoMCL (Li et al., 2003). OrthoMCL 600

was downloaded from https://github.com/apetkau/orthomcl-pipeline and run using default 601

settings. 602

603

Intron analysis 604

Introns were extracted from mapped reads using the same methodology from BRAKER1 (Hoff et 605

al., 2016). Introns supported from at least two reads were extracted and used in the intron set. 606

Genome tool software (Gremme et al., 2013) was used to annotate introns in the gff3 file. Custom 607

scripts were used to identify exact match intron coordinates from the annotation files that 608

overlapped with the intron coordinates from the RNA-seq data. Sensitivity, specificity and F1 609

score were calculated as described before. 610

611

Statistical comparisons between annotation outputs 612

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 23: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

23

Statistical comparisons were made using the R software package (Team, 2016). Chi-square tests 613

of independence (chisq.test in R) were computed to test for association between the annotation 614

pipelines and the sensitivity and specificity metrics. The residuals of the test were used to assess 615

the direction and magnitude of the associations across the data, but no formal post-hoc testing 616

was performed on this data. Two-proportions z-tests (prop.test in R) were used to compare 617

individual annotation results against the reference annotation or against a LoReAn output. These 618

tests were conducted for both gene and intron features using the number of matched features 619

and non-matched features or the number of matched and missing features (i.e. specificity and 620

sensitivity). The two-sample tests were conducted as one-tailed to determine the difference 621

compared to the reference, and Bonferroni multiple testing correction was applied to adjust the p-622

value needed to reject the null hypothesis. For singleton analysis, the read coverage and length 623

data were compared using the non-parametric Kruskal-Wallis test (kruskal in R from the agricolae 624

(de Mendiburu, 2016) package) to avoid the assumption of equal distribution and variance of the 625

data. The proportion of highest-quality singletons from each pipeline were compared against the 626

results of LoReAn-sF using the two-proportions z-test. The effect of genome masking, ab initio 627

options and pipeline options were tested using ANOVA, and Tukey’s honestly significant post-hoc 628

test (alpha = 0.05) was used to determine statistical grouping. 629

630

Ave1 isoform analysis 631

Ave1 isoforms were confirmed using cDNA-PCR of infected plant material with V. dahliae strain 632

JR2. Specific primers for the Ave1 gene (F- TTTAACACTTCACTCTGCTCTCG; R-633

CCTTGTGTGCTGCTTTGGTA) and for Ave1c gene (F-CGCCGGCAATACTATCTCAA; R-634

ATCCTGTGGGCAACAATAGC) were used to identify the two Ave1 isoforms. 635

636

AVAILABILITY 637

The LoReAn source code is available at: https://github.com/lfaino/LoReAn/ and provided under 638

an MIT license, available at: https://github.com/lfaino/LoReAn/blob/master/LICENSE. 639

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 24: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

24

Documentation and software are available at https://github.com/lfaino/LoReAn. The software can 640

run on all platforms when deployed via Docker (https://www.docker.com/). 641

All genome annotations, scripts and additional files generated and/or analyzed in the paper can 642

be found at https://github.com/lfaino/files-paper-LoReAn.git. 643

A dataset to test the correct installation of the tool can be found at 644

https://github.com/lfaino/LoReAn-Example.git. This dataset contains all the data to annotate a 645

single chromosome of V. dahliae strain JR2. 646

647

ACCESSION NUMBERS 648

The V. dahliae strain JR2 reference annotation version 5 was used in the analysis. The version 5 649

was generated by comparing the concordance of all gene models of version 4 with the long reads 650

information. Subsequently, the improved version 5 was deposited at ENSEMBL fungi database 651

and can be downloaded at http://fungi.ensembl.org/Verticillium_dahliaejr2/Info/Index. 652

The P. crispa reference genome and annotation were downloaded from JGI 653

(http://genome.jgi.doe.gov/pages/dynamicOrganismDownload.jsf?organism=Plicr1). The 654

Arabidopsis genome sequence and reference annotation were downloaded from the TAIR 655

database (ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/; 656

https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR1657

0_GFF3_genes.gff). The rice genome sequence and annotation were retrieved from the 658

ENSEMBL plant database (http://plants.ensembl.org/Oryza_sativa/Info/Index). The sequencing 659

data are accessible at the NCBI SRA database. The short-read A. thaliana data set is deposited 660

under SRA accession number SRR5446746 and the PacBio dataset under SRA accession 661

number SRR5445910. The V. dahliae Illumina transcriptome is deposited under accession 662

number SRR5440696 while the Nanopore transcriptome data is deposited as SRR5445874. The 663

P. crispa PacBio reads were downloaded from the publicly accessible NCBI SRA site, runs 664

SRR5077068 to SRR5077144 and Illumina data from run SRR1577770. The O. sativa data were 665

downloaded from the European Nucleotide Archive (ENA) under runs ERR91110 and 666

ERR911111 and the Illumina data from run ERR748773. 667

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 25: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

25

668

ACKNOWLEGMENTS 669

We thank Jordi Coolen for his assistance in writing the LoReAn software and Michael Seidl for 670

reading the manuscript. 671

672

SUPPLEMENTAL DATA 673

Supplemental Figure S1. Annotation quality metrics and summary statistics for predicted 674

genes. 675

Supplemental Figure S2. Annotation quality metrics and summary statistics for predicted exons. 676

Supplemental Figure S3. Ave1 gene model predictions from different software used during 677

annotation. 678

Supplemental Table S1. Number of predicted genes and exact match predictions to the JR2 679

reference. 680

Supplemental Table S2. Number of predicted exons and exact match predictions to the JR2 681

reference. 682

Supplemental Table S3. Gene prediction summaries for annotation options. 683

Supplemental Table S4. Exon prediction summaries for annotation options. 684

Supplemental Table S5. Two-sample test of proportions for high-confidence singleton 685

predictions. 686

Supplemental Table S6. Two-samples test of proportions for exact match intron based on 687

empirical RNA-seq intron identification in V. dahliae. 688

Supplemental Table S7. Predicted features for P. crispa annotation analysis. 689

Supplemental Table S8. P. crispa chi-square test of exact match gene proportions against 690

LoReAn_sF. 691

Supplemental Table S9. P. crispa chi-square test of exact match intron proportions against 692

LoReAn_sF. 693

Supplemental Table S10. Predicted features for A. thaliana annotation analysis. 694

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 26: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

26

Supplemental Table S11. A. thaliana chi-square test of exact match gene proportions against 695

LoReAn_s. 696

Supplemental Table S12. Predicted features for O. sativa annotation analysis. 697

698

Supplemental Table S13. O. sativa chi-square test of exact match gene proportions against 699

LoReAn_s. 700

701

702

TABLES 703

Table 1. Chi-square test of proportions for predicted exact match introns inferred from RNA-seq 704 mapping data in V. dahliae. 705

Pipeline Predictions Matching RNA-seq

Predictions Not Matching

Predictions missing

Specificity less than LoReAn-sF

a

Sensitivity less than LoReAn-sF

b

BAP-F 13,128 5,284 5,446 < 0.0001 < 0.0001

MAKER2 12,509 5,903 6,065 < 0.0001 < 0.0001

CodingQuarry 12,021 3,001 6,553 N.S.c < 0.0001

GeneMark-ES-F 12,323 5,274 6,251 < 0.0001 < 0.0001

BRAKER1 11,491 4,739 7,083 < 0.0001 < 0.0001

Augustus 11,857 6,079 6,717 < 0.0001 < 0.0001

Reference Annotation

13,935

4,145

4,639

N.S.c

N.S.

c

LoReAn-sF 13,780 3,899 4,794 Not applicable Not applicable a Column reports p-values for the chi-square test of proportions for the specificity metric. 706

b Column reports p-values for the chi-square test of proportions for the sensitivity metric.

707 c N.S. Not significant with a p-value greater than 0.05. 708

709 710 711 712 713 714 715 716 717 718 719 720 721 722 723

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 27: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

27

Table 2. Chi-square test of proportions for predicted exact match introns inferred from RNA-seq 724 mapping data in P. crispa. 725

Pipeline Predictions Matching RNA-seq

Predictions Not Matching

Predictions missing

Specificity less than Reference Annotation

a

Sensitivity less than Reference Annotation

b

LoReAn 58,831 8,235 29,119 N.S.c < 0.001

LoReAn-M 58,798 8,219 29,152 N.S. c < 0.001

LoReAn-s 58,556 7,318 29,394 < 0.001 < 0.001

LoReAn-sM 58,622 7,326 29,328 < 0.001 < 0.001

BAP 54,565 8,204 33,385 N.S.c N.S.

c

BAPplus 53,943 7,159 34,007 < 0.001 N.S.c

Augustus 52,525 8,773 35,425 N.S.c N.S.

c

GeneMark-ES-F 51,450 11,085 36,500 N.S.

c N.S.

c

MAKER2 50,228 10,881 37,722 N.S.c N.S.

c

Reference Annotation 57,837 80,54 30,113 Not applicable Not applicable

a Column reports p-values for the chi-square test of proportions for the specificity metric. 726

b Column reports p-values for the chi-square test of proportions for the sensitivity metric.

727 c N.S. Not significant with a p-value greater than 0.05. 728

729 730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 28: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

28

Table 3. Chi-square test of proportions for predicted exact match introns inferred from RNA-seq 745 mapping data in A. thaliana. 746

Pipeline Predictions Matching RNA-seq

Predictions Not Matching

Predictions missing

Specificity less than Reference Annotation

a

Sensitivity less than Reference Annotation

b

LoReAn 67064 57969 4376 N.S.c N.S.

c

LoReAn-M 66199 45073 5241 < 0.001 N.S.c

LoReAn-s 67003 57897 4437 N.S.c N.S.

c

LoReAn-sM 66133 44987 5307 < 0.001 N.S.c

BAP 64444 59868 6996 N.S.c N.S.

c

BAPplus 65037 59153 6403 N.S.c N.S.

c

Augustus 63137 70107 8303 N.S.c N.S.

c

GeneMark-ES 61815 80017 9625 N.S.c N.S.

c

MAKER2 60289 26238 11151 < 0.001 N.S.c

Reference Annotation

69657 54910 1783 Not applicable Not applicable

a Column reports p-values for the chi-square test of proportions for the specificity metric. 747

b Column reports p-values for the chi-square test of proportions for the sensitivity metric.

748 c N.S. Not significant with a p-value greater than 0.05. 749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 29: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

29

Table 4 Chi-square test of proportions for predicted exact match introns inferred from RNA-seq 766 mapping data in O. sativa. 767

Pipeline Predictions Matching RNA-seq

Predictions Not Matching

Predictions missing

Specificity less than Reference Annotation

a

Sensitivity less than Reference Annotation

b

LoReAn 84679 33790 37383 < 0.001 N.S.c

LoReAn-M 81707 15929 40355 < 0.001 N.S.c

LoReAn-s 84880 33749 37182 < 0.001 0.007

LoReAn-sM 81870 15892 40192 < 0.001 N.S.c

BAP 77609 85999 44453 N.S.c N.S.

c

BAPplus 75048 39740 47014 N.S.c N.S.

c

Augustus 64279 94836 57783 N.S.c N.S.

c

GeneMark-ES 1958 447395 120104 N.S.c N.S.

c

MAKER2 78819 57887 43243 N.S.c N.S.

c

Reference Annotation

84325 38746 37737 Not applicable Not applicable

a Column reports p-values for the chi-square test of proportions for the specificity metric. 768

b Column reports p-values for the chi-square test of proportions for the sensitivity metric.

769 c N.S. Not significant with a p-value greater than 0.05. 770

771

772

773

774

775

FIGURE LEGENDS 776

Figure 1. Schematic overview of the LoReAn pipeline and clustered transcript 777

reconstruction. 778

(A) Illustration of the computational workflow for the LoReAn pipeline. Grey boxes represent input 779

data, and each white box represents a step in the annotation process with mention of the specific 780

software. The boxes connected by blue arrows integrate the steps from the previously described 781

BAP, described in the text as phase one of LoReAn (Haas et al., 2008). The LoReAn pipeline 782

(boxes connected by red arrows) integrates the BAP workflow, but additionally incorporates long-783

read sequencing data, described in the text as phase two of LoReAn. The blue box, ‘BAP 784

annotation’ represents the annotation results from the BAP pipeline used for comparison in this 785

study, while the orange box ‘LoReAn annotation’ represents the annotation results from the 786

LoReAn pipeline using long-read sequencing data. Dashed arrows represent optional steps for 787

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 30: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

30

the pipeline. (B) Illustration of the clustered transcript reconstruction. Gene models are depicted 788

as exons (boxes) and connecting introns (lines). Blue models represent BAP annotations, while 789

red models represent hypothetical long-reads mapped to the genome. Orange models represent 790

consensus annotations reported in the final LoReAn output. Various scenarios can occur: i: High 791

confidence predictions from the BAP are kept regardless of whether they are supported by long-792

reads. ii & iii: Clusters of mapped long-reads are used to generate a consensus prediction model, 793

unless the model is supported by less than a user-defined minimum depth. iv: Overlapping BAP 794

and mapped long-reads are combined to a consensus model. v: Two annotations are reported if 795

no consensus can be reached for the BAP and clustered long-read data. 796

797

Figure 2. Annotation quality summary for exact match genes to the reference. 798

(A) Gene annotation quality summary, where each horizontal bar represents an annotation 799

output, and each colored dot represents the sensitivity (green), specificity (purple) and F1 score 800

(red). The annotations are labelled using the left grid, where the group of horizontal black dots 801

defines the parameters used in the annotation. Possible parameters include using LoReAn, BAP 802

or BAP+ pipeline, stranded mode for LoReAn (Stranded), the fungus option for GeneMark-ES 803

(Fungus), or the BRAKER1 program for Augustus (BRAKER1). Annotation options are grouped 804

by the level of reference masking- Partially Masked (Part.), Non-Masked (None) or Fully Masked 805

(Mask). The results from additionally tested annotation pipelines are shown at the bottom. Three 806

vertical grey lines represent the 1st quantile, median and 3

rd quantile for the F1 score. The two 807

annotations highlighted with a yellow horizontal bar were used for subsequent analysis. (B) The 808

proportion of exact match to non-matching gene predictions (specificity) and exact match to 809

missing gene predictions (sensitivity) were compared using a chi-square test of independence. 810

The residuals from the analysis are shown with the size and color representing the magnitude 811

and direction of the association between rows and columns. GeneMark-ES-F: GeneMark gene 812

prediction software using the ‘fungus’ option. LoReAn-sF: LoReAn using strand information and 813

the ‘fungus’ option of GeneMark-ES. BAP-F: The Broad Institute Eukaryotic Genome Annotation 814

Pipeline described in the text and using the ‘fungus’ option of GeneMark-ES. 815

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 31: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

31

816

Figure 3. Analysis of predicted singletons across four pipelines. 817

(A) Venn diagram showing the overlap and uniqueness of predicted genes based on genomic 818

location. The Venn diagram shows that 4,584 genes were annotated with the exact same 819

features across all four pipelines. The numbers captured by only a single annotation pipeline are 820

considered singletons- genes whose structure is uniquely annotated by a given pipeline. Note, 821

these singletons do not necessarily represent unique loci. (B) Short-read RNA-seq data were 822

mapped to the genome, and the percent length coverage of each gene annotation was 823

calculated. The data were not completely normally distributed, so a non-parametric Kruskal-Wallis 824

test was used to rank the mean of the coverage. Data are shown as violin plots, with the tails 825

representing the data range and the mean and standard deviation are shown as a black point and 826

black vertical lines, respectively. Letters shown above each violin plot represent post-hoc 827

statistical groupings where plots with the same letter are statistically indistinguishable. Multiple 828

comparisons were made using the non-parametric Kruskal-Wallis rank of means and post hoc 829

differences were determined using Fisher’s least significamt difference, p < 0.05 with Bonferroni 830

correction. (C) Same as in (B) except the mean rank of the singleton length is analyzed. (D) The 831

orthoMCL singletons from each pipeline were grouped into one of four categories shown in the 832

key representing if the singleton contained an intron or not and if the singleton’s length was 833

covered by over 75% with RNA-seq data. The number of singletons within each of the four 834

categories is shown. 835

836

Figure 4. Annotation quality summary for predicted introns exactly matching RNA-seq 837

inferred introns. (A) Predicted intron quality summary, where each horizontal bar represents an 838

annotation output, and each colored dot represents the sensitivity (green), specificity (purple) and 839

F1 score (red) as described in Figure 2A. (B) The proportion of exact match to non-matching 840

intron predictions (specificity) and exact match to missing intron predictions (sensitivity) were 841

compared using a chi-square test of independence as described in Figure 2B. GeneMark-ES-F: 842

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 32: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

32

GeneMark gene prediction software using the ‘fungus’ option. LoReAn-sF: LoReAn using strand 843

information and the ‘fungus’ option of GeneMark-ES. 844

845

Figure 5. The LoReAn pipeline most accurately annotates a specific fungal locus encoding 846

a strain-specific gene. (A) Short-read RNA-seq data mapped to the locus are shown as a 847

coverage plot (grey peaks) and as representative individual reads (yellow boxes). Long-reads 848

from single-molecule cDNA data mapped to the locus are shown as a coverage plot (grey peaks) 849

and representative reads (purple boxes). Thick black lines linking mapped reads represent gaps 850

in the mapped reads and are indicative of introns. The long-read data were split by mapping 851

strand and coverage plots for forward (red) and reverse (blue) coverage plots. (B) Gene model 852

predictions from four annotation pipelines are illustrated. Light blue boxes represent untranslated 853

regions (5’ and 3’ UTR), dark blue boxes represent coding sequence boundaries, and thin black 854

lines depict introns. Arrows in the introns indicate the direction of transcription. The MAKER2 and 855

BAP pipelines predict a single transcript coded on the reverse strand at the 3’ end of the known 856

Ave1 transcript. LoReAn-sF predicts two transcripts corresponding to the Ave1 gene along with 857

the similar transcript predicted by MAKER2. The reference Ave1 transcript is shown in grey. (C) 858

To confirm the presence of an alternative splice site in the 5’UTR of the Ave1 transcript, 18 cDNA 859

clones were randomly chosen and sequenced. Isoform 1 sequence is identical to the reference 860

Ave1 sequence and was identified in 15 of the 18 clones. Isoform 2 has a 5 bp insertion in the 861

5’UTR resulting from an alternative exon splice site and was identified in 3 of the 18 sequenced 862

clones. The Ave1 reference sequence is shown from bases 71 through 86. (D) The presence of 863

Ave1 and the additional gene transcribed to the 3’ end of Ave1, termed Ave1close(Ave1c), was 864

confirmed using PCR on gDNA and cDNA. PCR using gene specific primers, termed Ave1 fw + 865

rev (pink arrows) or Ave1c for + rev (yellow arrows), shows that both genes are expressed in 866

either potato dextrose broth (PDB) Czapek-dox (CPD) or half-strength Murashige-Skoog (1/2MS) 867

media. The inverse orientation of the two genes was confirmed using forward primers only, which 868

amplified the entire locus resulting in a band of approximately 1,118 bp, but does not amplify 869

product using cDNA as the template. 870

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 33: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

33

871

Figure 6. Assessment of gene and intron predictions from P. crispa. 872

(A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. 873

LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in 874

non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-875

masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad 876

Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in 877

text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact 878

match to missing gene predictions (sensitivity) compared using a chi-square test of independence 879

as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) 880

and (B) respectively. 881

882

Figure 7. Assessment of gene and intron predictions from A. thaliana. 883

(A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. 884

LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in 885

non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-886

masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad 887

Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in 888

text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact 889

match to missing gene predictions (sensitivity) compared using a chi-square test of independence 890

as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) 891

and (B) respectively. 892

893

Figure 8. Assessment of gene and intron predictions from O. sativa. 894

(A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. 895

LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in 896

non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-897

masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad 898

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 34: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

34

Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in 899

text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact 900

match to missing gene predictions (sensitivity) compared using a chi-square test of independence 901

as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) 902

and (B) respectively. 903

904

905

906

907

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 35: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

35

REFERENCES 908

Abdel-Ghany SE, Hamilton M, Jacobi JL, Ngam P, Devitt N, Schilkey F, Ben-Hur A, Reddy 909 ASN (2016) A survey of the sorghum transcriptome using single-molecule long reads. Nature 910 Communications 7: 11706 911

Amemiya CT, Alföldi J, Lee AP, Fan S, Philippe H, Maccallum I, Braasch I, Manousaki T, 912 Schneider I, Rohner N, et al (2013) The African coelacanth genome provides insights into 913 tetrapod evolution. Nature 496: 311–316 914

Au KF, Sebastiano V, Afshar PT, Durruthy JD, Lee L, Williams BA, van Bakel H, Schadt EE, 915 Reijo-Pera RA, Underwood JG, et al (2013) Characterization of the human ESC 916 transcriptome by hybrid sequencing. Proc Natl Acad Sci USA 110: E4821–30 917

Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, Huala E (2015) The 918 Arabidopsis information resource: Making and mining the “gold standard” annotated 919 reference plant genome. Genesis 53: 474–485 920

Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, 921 Yandell M (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model 922 organism genomes. Genome Res 18: 188–196 923

Chan K-L, Rosli R, Tatarinova TV, Hogan M, Firdaus-Raih M, Low E-TL (2017) Seqping: gene 924 prediction pipeline for plant genomes using self-training gene models and transcriptomic 925 data. BMC Bioinformatics 18: 1426–7 926

Chen F, Mackey AJ, Stoeckert CJ, Roos DS (2006) OrthoMCL-DB: querying a comprehensive 927 multi-species collection of ortholog groups. Nucleic Acids Res 34: D363–8 928

Chin C-S, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn C, O'Malley 929 R, Figueroa-Balderas R, Morales-Cruz A, et al (2016) Phased diploid genome assembly 930 with single-molecule real-time sequencing. Nat Methods 13: 1050–1054 931

Cook DE, Mesarich CH, Thomma BPHJ (2015) Understanding plant immunity as a surveillance 932 system to detect invasion. Annu Rev Phytopathol 53: 541–563 933

Davey JW, Chouteau M, Barker SL, Maroja L, Baxter SW, Simpson F, Joron M, Mallet J, 934 Dasmahapatra KK, Jiggins CD (2016) Major Improvements to the Heliconius melpomene 935 Genome Assembly Used to Confirm 10 Chromosome Fusion Events in 6 Million Years of 936 Butterfly Evolution. G3 (Bethesda) 6: 695–708 937

de Jonge R, van Esse HP, Maruthachalam K, Bolton MD, Santhanam P, Saber MK, Zhang Z, 938 Usami T, Lievens B, Subbarao KV, et al (2012) Tomato immune receptor Ve1 recognizes 939 effector of multiple fungal pathogens uncovered by genome and RNA sequencing. Proc Natl 940 Acad Sci USA 109: 5110–5115 941

de Mendiburu F (2016) agricolae: Statistical Procedures for Agricultural Research. 942 httpsCRAN.R-project.orgpackageagricolae, https://CRAN.R-project.org/package=agricolae 943

Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, 944 Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29: 15–21 945

Faino L, Seidl MF, Datema E, van den Berg GCM, Janssen A, Wittenberg AHJ, Thomma 946 BPHJ (2015) Single-Molecule Real-Time Sequencing Combined with Optical Mapping Yields 947 Completely Finished Fungal Genome. mBio 6: e00936–15 948

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 36: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

36

Faino L, Seidl MF, Shi-Kunne X, Pauper M, van den Berg GCM, Wittenberg AHJ, Thomma 949 BPHJ (2016) Transposons passively and actively contribute to evolution of the two-speed 950 genome of a fungal pathogen. Genome Res 26: 1091–1100 951

Faino L, Thomma BPHJ (2014) Get your high-quality low-cost genome sequence. Trends in 952 Plant Science 19: 288–291 953

Fradin EF, Thomma BPHJ (2006) Physiology and molecular aspects of Verticillium wilt diseases 954 caused by V. dahliae and V. albo-atrum. Mol Plant Pathol 7: 71–86 955

Goodswen SJ, Kennedy PJ, Ellis JT (2012) Evaluating high-throughput ab initio gene finders to 956 discover proteins encoded in eukaryotic pathogen genomes missed by laboratory 957 techniques. PLoS ONE 7: e50609 958

Gordon SP, Tseng E, Salamov A, Zhang J, Meng X, Zhao Z, Kang D, Underwood J, 959 Grigoriev IV, Figueroa M, et al (2015) Widespread Polycistronic Transcripts in Fungi 960 Revealed by Single-Molecule mRNA Sequencing. PLoS ONE 10: e0132628 961

Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, 962 Raychowdhury R, Zeng Q, et al (2011) Full-length transcriptome assembly from RNA-Seq 963 data without a reference genome. Nature Biotechnology 29: 644–652 964

Gremme G, Steinbiss S, Kurtz S (2013) GenomeTools: a comprehensive software library for 965 efficient processing of structured genome annotations. IEEE/ACM Trans Comput Biol 966 Bioinform 10: 645–656 967

Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR 968 (2008) Automated eukaryotic gene structure annotation using EVidenceModeler and the 969 Program to Assemble Spliced Alignments. Genome Biol 9: R7 970

Haas BJ, Zeng Q, Pearson MD, Cuomo CA, Wortman JR (2011) Approaches to Fungal 971 Genome Annotation. Mycology 2: 118–141 972

Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2016) BRAKER1: Unsupervised 973 RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 974 32: 767–769 975

Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management 976 tool for second-generation genome projects. BMC Bioinformatics 12: 491 977

Huang X, Adams MD, Zhou H, Kerlavage AR (1997) A tool for analyzing and annotating 978 genomic sequences. Genomics 46: 37–45 979

Jiao W-B, Schneeberger K (2017) The impact of third generation genomic technologies on plant 980 genome assembly. Current Opinion in Plant Biology 36: 64–70 981

Keibler E, Brent MR (2003) Eval: a software package for analysis of genome annotations. BMC 982 Bioinformatics 4: 50 983

Klosterman SJ, Atallah ZK, Vallad GE, Subbarao KV (2009) Diversity, pathogenicity, and 984 management of verticillium species. Annu Rev Phytopathol 47: 39–62 985

Kohler A, Kuo A, Nagy LG, Morin E, Barry KW, Buscot F, Canbäck B, Choi C, Cichocki N, 986 Clum A, et al (2015) Convergent losses of decay mechanisms and rapid turnover of 987 symbiosis genes in mycorrhizal mutualists. Nature Genetics 47: 410–415 988

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 37: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

37

Koren S, Phillippy AM (2015) One chromosome, one contig: complete microbial genomes from 989 long-read sequencing and assembly. Curr Opin Microbiol 23: 110–120 990

Križanovic K, Echchiki A, Roux J, Šikic M (2017) Evaluation of tools for long read RNA-seq 991 splice-aware alignment. Bioinformatics. doi: 10.1093/bioinformatics/btx668 992

Laehnemann D, Borkhardt A, McHardy AC (2016) Denoising DNA deep sequencing data-high-993 throughput sequencing errors and their correction. Brief Bioinformatics 17: 154–179 994

Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, 995 Alexander DL, Garcia-Hernandez M, et al (2012) The Arabidopsis Information Resource 996 (TAIR): improved gene annotation and new tools. Nucleic Acids Res 40: D1202–D1210 997

Lamichhaney S, Fan G, Widemo F, Gunnarsson U, Thalmann DS, Hoeppner MP, Kerje S, 998 Gustafson U, Shi C, Zhang H, et al (2016) Structural genomic changes underlie alternative 999 reproductive strategies in the ruff (Philomachus pugnax). Nature Genetics 48: 84–88 1000

Laver T, Harrison J, O'Neill PA, Moore K, Farbos A, Paszkiewicz K, Studholme DJ (2015) 1001 Assessing the performance of the Oxford Nanopore Technologies MinION. Biomol Detect 1002 Quantif 3: 1–8 1003

Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic 1004 genomes. Genome Res 13: 2178–2189 1005

Linde J, Duggan S, Weber M, Horn F, Sieber P, Hellwig D, Riege K, Marz M, Martin R, 1006 Guthke R, et al (2015) Defining the transcriptomic landscape of Candida glabrata by RNA-1007 Seq. Nucleic Acids Res 43: 1392–1406 1008

Loman NJ, Quick J, Simpson JT (2015) A complete bacterial genome assembled de novo 1009 using only nanopore sequencing data. Nat Methods 12: 733–735 1010

Loman NJ, Quinlan AR (2014) Poretools: a toolkit for analyzing nanopore sequence data. 1011 Bioinformatics 30: 3399–3401 1012

Lomsadze A, Burns PD, Borodovsky M (2014) Integration of mapped RNA-Seq reads into 1013 automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42: e119–e119 1014

Ma L, Chen Z, Huang DW, Kutty G, Ishihara M, Wang H, Abouelleil A, Bishop L, Davey E, 1015 Deng R, et al (2016) Genome analysis of three Pneumocystis species reveals adaptation 1016 mechanisms to life exclusively in mammalian hosts. Nature Communications 7: 10740 1017

Ming R, VanBuren R, Wai CM, Tang H, Schatz MC, Bowers JE, Lyons E, Wang M-L, Chen J, 1018 Biggers E, et al (2015) The pineapple genome and the evolution of CAM photosynthesis. 1019 Nature Genetics 47: 1435–1442 1020

Minoche AE, Dohm JC, Schneider J, Holtgräwe D, Viehöver P, Montfort M, Sörensen TR, 1021 Weisshaar B, Himmelbauer H (2015) Exploiting single-molecule transcript sequencing for 1022 eukaryotic gene prediction. Genome Biol 16: 184 1023

Muñoz JF, Gauthier GM, Desjardins CA, Gallo JE, Holder J, Sullivan TD, Marty AJ, Carmen 1024 JC, Chen Z, Ding L, et al (2015) The Dynamic Genome and Transcriptome of the Human 1025 Fungal Pathogen Blastomyces and Close Relative Emmonsia. PLoS Genet 11: e1005493 1026

Phillippy AM (2017) New advances in sequence assembly. Genome Res 27: xi–xiii 1027

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 38: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

38

Presti Lo L, Lanver D, Schweizer G, Tanaka S, Liang L, Tollot M, Zuccaro A, Reissmann S, 1028 Kahmann R (2015) Fungal effectors and plant susceptibility. Annu Rev Plant Biol 66: 513–1029 545 1030

Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. 1031 Bioinformatics 26: 841–842 1032

Reid I, O'Toole N, Zabaneh O, Nourzadeh R, Dahdouli M, Abdellateef M, Gordon PMK, Soh 1033 J, Butler G, Sensen CW, et al (2014) SnowyOwl: Accurate prediction of fungal genes by 1034 using RNA-Seq and homology information to select among ab initio models. BMC 1035 Bioinformatics 15: 229 1036

Smith CD, Zimin A, Holt C, Abouheif E, Benton R, Cash E, Croset V, Currie CR, Elhaik E, 1037 Elsik CG, et al (2011) Draft genome of the globally widespread and invasive Argentine ant 1038 (Linepithema humile). Proc Natl Acad Sci USA 108: 5673–5678 1039

Smith JJ, Kuraku S, Holt C, Sauka-Spengler T, Jiang N, Campbell MS, Yandell MD, 1040 Manousaki T, Meyer A, Bloom OE, et al (2013) Sequencing of the sea lamprey 1041 (Petromyzon marinus) genome provides insights into vertebrate evolution. Nature Genetics 1042 45: 415–21– 421e1–2 1043

Sperschneider J, Dodds PN, Gardiner DM, Manners JM, Singh KB, Taylor JM (2015) 1044 Advances and challenges in computational prediction of effectors from plant pathogenic 1045 fungi. PLoS Pathog 11: e1004806 1046

Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped 1047 cDNA alignments to improve de novo gene finding. Bioinformatics 24: 637–644 1048

Team RC (2016) R: A Language and Environment for Statistical Computing. https://www.R-1049 project.org/ 1050

Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M (2008) Gene prediction in 1051 novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 1052 18: 1979–1990 1053

Testa AC, Hane JK, Ellwood SR, Oliver RP (2015) CodingQuarry: highly accurate hidden 1054 Markov model gene prediction in fungal genomes using RNA-seq transcripts. BMC 1055 Genomics 16: 170 1056

Thomma BPHJ, Seidl MF, Shi-Kunne X, Cook DE, Bolton MD, van Kan JAL, Faino L (2016) 1057 Mind the gap; seven reasons to close fragmented genome assemblies. Fungal Genet Biol 1058 90: 24–30 1059

Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, Lu Z, Olson A, Stein JC, Ware D 1060 (2016) Unveiling the complexity of the maize transcriptome by single-molecule long-read 1061 sequencing. Nature Communications 7: 11708 1062

Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature 1063 Reviews Genetics 10: 57–63 1064

Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and 1065 EST sequences. Bioinformatics 21: 1859–1875 1066

Yandell M, Ence D (2012) A beginner's guide to eukaryotic genome annotation. Nature Reviews 1067 Genetics 13: 329–342 1068

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 39: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

39

Zheng Y, Zhao L, Gao J, Fei Z (2011) iAssembler: a package for de novo assembly of Roche-1069 454/Sanger transcriptome sequences. BMC Bioinformatics 12: 453 1070

1071

1072

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 40: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

B

A

Clustered Transcript Reconstruction

Protein alignment

AAT

Transcript reconstruction

Trinity

Short-readRNA-seq data

Transcript mapping

PASA

Combine evidenceEVidence Modeller

ab initioBraker1

AugustusGeneMark-ES

Update transcripts

PASA

Long-readRNA-seq data

Clusteredtranscript

reconstruction Bedtools

iAssembler

Transcript mapping

GMAP

BAP annotation

Extract exon sequences

GFFread

LoReAn annotation

Referencegenome

sequence

Proteinsequenceevidence

Long-Readmapping

GMAP

Extract exon sequences

GFFread

Transcript mapping

GMAP

Transcript update

PASA

Phase-one models

Phase-two models

LoReAn output Excluded

i ii iv viii

Figure 1

Figure 1. Schematic overview of the LoReAn pipeline and clustered transcript reconstruction.(A) Illustration of the computational workflow for the LoReAn pipeline. Grey boxes represent input data and each white box represents a step in the annotation process with mention of the specific software. The boxes connected by blue arrows integrate the steps from the previously described BAP (Haas et al., 2008). The LoReAn pipeline (boxes connected by red arrows) integrates the BAP workflow, but additionally incorporates long-read sequencing data. The orange box, ‘Final BAP annotation’ represents the annotation results from the BAP pipeline used for comparison in this study. Dashed arrows represent optional steps for the pipeline. (B) Illustration of the clustered transcript reconstruction. Gene models are depicted as exons (boxes) and connecting introns (lines). Blue models represent BAP annotations, while red models represent hypothetical long-reads mapped to the genome. Orange models represent consensus annotations reported in the final LoReAn output. Various scenarios can occur: i: High confidence predictions from the BAP are kept regardless of whether they are supported by long-reads. ii & iii: Clusters of mapped long-reads are used to generate a consensus prediction model, unless the model is supported by less than a user-defined minimum depth. iv: Overlapping BAP and mapped long-reads are combined to a consensus model. v: Two annotations are reported if no consensus can be reached for the BAP and clustered long-read data.

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 41: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

25 35 45 55 65 75

Part

.M

ask

Non

ePa

rt.

Mas

kN

one

Part

.M

ask

Non

e

LoReAn

Stranded

FungusBraker

BAPBAP+

MAKER2CodingQuarry_effector

CodingQuarryGeneMark-ES-F

GeneMark-ESBRAKER1Augustus

-15 -9 -6 -3 0 3 6 9 12-12 15

MAKER2

CodingQuarry

GeneMark-ES-F

BRAKER1

Prediction M

atch

Prediction NOT M

atch

Reference Match

Reference NOT Match

LoReAn-sF

Augustus

Pearson residuals for chi-square

BAP-F

A BSensitivity F1 scoreSpecificity

Percent of exact match genes

Figure 2

Figure 2. Annotation quality summary for exact match genes to the reference. (A) Gene annotation quality summary, where each horizontal bar represents an annotation output, and each colored dot represents the sensitivity (green), specificity (purple) and F1 score (red). The annotations are labelled using the left grid, where the group of horizontal black dots defines the parameters used in the annotation. Possible parameters include using LoReAn, BAP or BAP+ pipeline, stranded mode for LoReAn (Stranded), the fungus option for GeneMark-ES (Fungus), or the BRAKER1 program for Augustus (BRAKER1). Annotation options are grouped by the level of reference masking- Partially Masked (Part.), Non-Masked (None) or Fully Masked (Mask). The results from additionally tested annotation pipelines are shown at the bottom. Three vertical grey lines represent the 1st quantile, median and 3rd quantile for the F1 score. The two annotations highlighted with a yellow horizontal bar were used for subsequent analysis. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact match to missing gene predictions (sensitivity) were compared using a chi-square test of independence. The residuals from the analysis are shown with the size and color representing the magnitude and direction of the association between rows and columns. GeneMark-ES-F: GeneMark gene prediction software using the ‘fungus’ option. LoReAn-sF: LoReAn using strand information and the ‘fungus’ option of GeneMark-ES. https://plantphysiol.orgDownloaded on February 7, 2021. - Published by

Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 42: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

2665

1352 1923

3157397

567

137

728

945

3091362

339147

808

4584

140

64153

10420

44

41

75

24169

17

110

CodingQuarryBAP-F LoReAn-sF

MAKER2A

B

D

C

CodingQuarry BAP-F

LoReAn-sF MAKER2

a cbd a aba

With intron,< 75% coverage

No intron,< 75% coverage

With intron,>75% coverage

No intron,> 75% coverage

CodingQuarryBAP-F LoReAn-sF MAKER2

176159

106

Mea

n Ra

nk C

over

age

Mea

n Ra

nk L

engt

h0

500

1000

1500

0

500

1000

1500

Figure 3

Figure 3. Analysis of predicted singletons across four pipelines. (A) Venn diagram showing the overlap and uniqueness of predicted genes based on genomic location. The Venn diagram shows that 4,584 genes were annotated with the exact same features across all four pipelines. The numbers captured by only a single annotation pipeline are considered singletons- genes whose structure is uniquely annotated by a given pipeline. Note, these singletons do not necessarily represent unique loci. (B) Short-read RNA-seq data were mapped to the genome and the percent length coverage of each gene annotation was calculated. The data were not completely normally distributed, so a non-parametric Kruskal-Wallis test was used to rank the mean of the coverage. Data are shown as violin plots, with the tails representing the data range and the mean and standard deviation is shown as a black point and black vertical lines. Letters shown above each violin plot represent post-hoc statistical groupings where plots with the same letter are statistically indistinguishable. (C) Same as in (B) except the mean rank of the singleton length is analyzed. (D) The orthoMCL singletons from each pipeline were grouped into one of four categories shown in the key representing if the singleton contained an intron or not and if the singleton’s length was covered by over 75% with RNA-seq data. The number of singletons within each of the four categories is shown.

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 43: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

LoReAn

Stranded

Fungus

BrakerBAP

BAP+

MAKER2CodingQuarry_effector

CodingQuarryGeneMark-ES-F

GeneMark-ESBRAKER1Augustus

ASensitivity F1 scoreSpecificity

55 65 75 85

Part

.M

ask

Non

ePa

rt.

Mas

kN

one

Part

.M

ask

Non

e

Reference Annotation

MAKER2

CodingQuarry

GeneMark-ES-F

BRAKER1

Prediction M

atch

Prediction NOT M

atch

Reference Match

Reference NOT Match

LoReAn-sF

Augustus

BAP-F

ReferenceAnnotation

B

-25 -15 -10 -5 0 5 10 15 20-20 25Pearson residuals for chi-square

Percent of exact match introns

Figure 4

Figure 4. Annotation quality summary for predicted introns exactly matching RNA-seq inferred introns. (A) Predicted intron quality summary, where each horizontal bar represents an annotation output, and each colored dot represents the sensitivity (green), specificity (purple) and F1 score (red) as described in Figure 2A. (B) The proportion of exact match to non-matching intron predictions (specificity) and exact match to missing intron predictions (sensitivity) were compared using a chi-square test of independence as described in Figure 2B. GeneMark-ES-F: GeneMark gene prediction software using the ‘fungus’ option. LoReAn-sF: LoReAn using strand information and the ‘fungus’ option of GeneMark-ES.

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 44: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

MappedShortReads

ForwardReverse

0

10000

−10

50

0

50

MAKER2

MappedLongReads

Iso-2Iso-1

Ave1

LongReadStrand

CodingQuarry with‘effector’ option

BAP

Reference

Reference Ave1

LoReAn-sF

Ave1 fw

700,100Chr 5Ave1 rev

Ave1c rev Ave1c fw70 AAGCAATT - - - - - CGCCAATT 87

70 AAGCAATTAGTAGCGCCAATT 92

70 AAGCAATT - - - - - CGCCAATT 87LoReAn Ave1- isoform 2

Ave1 cDNA sequencing

Isoform 2

Isoform 1

LoReAn Ave1- isoform 1

153

700,750 701,250

Figure 5. The LoReAn pipeline most accurately annotates a specific fungal locus encoding a strain specific gene. (A) Short-read RNA-seq data mapped to the locus are shown as a coverage plot (grey peaks) and as representative individual reads (yellow boxes). Long-reads from single-molecule cDNA data mapped to the locus are shown as a coverage plot (grey peaks) and representative reads (purple boxes). Think black lines linked mapped reads represent gaps in the mapped reads and are indicative of introns. The long-read data was split by mapping strand and coverage plots for forward (red) and reverse (blue) coverage plots. (B) Gene model predictions from four annotation pipelines are illustrated. Light blue boxes represent untranslated regions (5’ and 3’ UTR), dark blue boxes represent coding sequence boundaries, and thin black lines depict introns. Arrows in the introns indicate the direction of transcription. The MAKER2 and BAP pipelines predict a single transcript coded on the reverse strand at the 3’ end of the known Ave1 transcript. LoReAn-sF predicts two transcripts corresponding to the Ave1 gene along with the similar transcript predicted by MAKER2. The reference Ave1 transcript is shown in grey. (C) To confirm the presence of an alternative splice site in the 5’UTR of the Ave1 transcript, 18 cDNA clones were randomly chosen and sequenced. Isoform 1 sequence is identical to the reference Ave1 sequence and was identified in 15 of the 18 clones. Isoform 2 has a 5 bp insertion in the 5’UTR resulting from an alternative exon splice site and was identified in 3 of the 18 sequenced clones. The Ave1 reference sequence is shown from bases 71 through 86. (D) The presence of Ave1 and the additional gene transcribed to the 3’ end of Ave1, termed Ave1close(Ave1c), was confirmed using PCR on gDNA and cDNA. PCR using gene specific primers, termed Ave1 fw + rev (pink arrows) or Ave1c for + rev (yellow arrows), shows that both genes are expressed in either potato dextrose broth (PDB) Czapek-dox (CPD) or half-strength Murashige-Skoog (1/2MS) media. The inverse orientation of the two genes was confirmed using forward primers only, which amplified the entire locus resulting in a band of approximately 1,118 bp, but does not amplify product using cDNA as the template.

Figure 5

A

B

C D

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 45: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

MAKER2

GeneMark-ES

Prediction Match

Prediction NOT Match

Reference Match

Reference NOT Match

Prediction Match

Prediction NOT Match

Reference Match

Reference NOT Match

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

BRAKER1

BAP

BAP

BAP+

MAKER2

GeneMark-ES

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

BRAKER1

BAP+

-45 -27 -9 0 9 27 45Pearson residuals for chi-square

-35 -21 -7 0 7 21 35Pearson residuals for chi-square

Sensitivity F1 scoreSpecificity

Sensitivity F1 scoreSpecificity

MAKER2

GeneMark-ES

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

BRAKER1

BAP

BAP+

15 25 35 45 55

A

C

B

D

55 65 75 85 95

MAKER2

GeneMark-ES

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

BRAKER1

BAP

BAP+

ReferenceAnnotation

ReferenceAnnotation

Percent of exact match introns

Percent of exact match genes

Figure 6

Figure 6. Assessment of gene and intron predictions from P. crispa. (A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact match to missing gene predictions (sensitivity) compared using a chi-square test of independence as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) and (B) respectively.

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 46: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

MAKER2

GeneMark-ES

Prediction Match

Prediction NOT Match

Reference Match

Reference NOT Match

Prediction Match

Prediction NOT Match

Reference Match

Reference NOT Match

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

Augustus

BAP

BAP

BAP+

MAKER2

GeneMark-ES

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

Augustus

BAP+

Pearson residuals for chi-square

Pearson residuals for chi-square

Sensitivity F1 scoreSpecificity

Sensitivity F1 scoreSpecificity

MAKER2

GeneMark-ES

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

Augustus

BAP

BAP+

25 35 45 55

40 60 80 100

65 75

A

C

B

D

MAKER2

GeneMark-ES

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

Augustus

BAP

BAP+ReferenceAnnotation

-105 -63 -21 0 21 63 105

-95 -57 -19 0 19 57 95

ReferenceAnnotation

Percent of exact match introns

Percent of exact match genes

Figure 7

Figure 7. Assessment of gene and intron predictions from A. thaliana. (A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact match to missing gene predictions (sensitivity) compared using a chi-square test of independence as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) and (B) respectively.

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 47: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

MAKER2

GeneMark-ES

Prediction Match

Prediction NOT Match

Reference Match

Reference NOT Match

Prediction Match

Prediction NOT Match

Reference Match

Reference NOT Match

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

BRAKER1

BAP

BAP

BAP+

MAKER2

GeneMark-ES

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

BRAKER1

BAP+

Pearson residuals for chi-square

Pearson residuals for chi-square

Sensitivity F1 scoreSpecificity

Sensitivity F1 scoreSpecificity

MAKER2

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

BAP

BAP+

0 10 20 30

0 30 60 90

40

A

C

B

D

MAKER2

LoReAn

LoReAn-M

LoReAn-s

LoReAn-sM

BAP

BAP+

-105 -63 -21 0 21 63 105

-650-390

-130 0 130 390 650

ReferenceAnnotationReference

Annotation

Percent of exact match introns

Percent of exact match genes

Figure 8

GeneMark-ES

BRAKER1

GeneMark-ES

BRAKER1

Figure 8. Assessment of gene and intron predictions from O. sativa. (A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact match to missing gene predictions (sensitivity) compared using a chi-square test of independence as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) and (B) respectively.

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 48: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

Parsed CitationsAbdel-Ghany SE, Hamilton M, Jacobi JL, Ngam P, Devitt N, Schilkey F, Ben-Hur A, Reddy ASN (2016) A survey of the sorghumtranscriptome using single-molecule long reads. Nature Communications 7: 11706

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Amemiya CT, Alföldi J, Lee AP, Fan S, Philippe H, Maccallum I, Braasch I, Manousaki T, Schneider I, Rohner N, et al (2013) The Africancoelacanth genome provides insights into tetrapod evolution. Nature 496: 311–316

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Au KF, Sebastiano V, Afshar PT, Durruthy JD, Lee L, Williams BA, van Bakel H, Schadt EE, Reijo-Pera RA, Underwood JG, et al (2013)Characterization of the human ESC transcriptome by hybrid sequencing. Proc Natl Acad Sci USA 110: E4821–30

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, Huala E (2015) The Arabidopsis information resource: Making and miningthe "gold standard" annotated reference plant genome. Genesis 53: 474–485

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M (2008) MAKER: an easy-to-useannotation pipeline designed for emerging model organism genomes. Genome Res 18: 188–196

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Chan K-L, Rosli R, Tatarinova TV, Hogan M, Firdaus-Raih M, Low E-TL (2017) Seqping: gene prediction pipeline for plant genomesusing self-training gene models and transcriptomic data. BMC Bioinformatics 18: 1426–7

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Chen F, Mackey AJ, Stoeckert CJ, Roos DS (2006) OrthoMCL-DB: querying a comprehensive multi-species collection of orthologgroups. Nucleic Acids Res 34: D363–8

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Chin C-S, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn C, O'Malley R, Figueroa-Balderas R, Morales-Cruz A, etal (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 13: 1050–1054

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Cook DE, Mesarich CH, Thomma BPHJ (2015) Understanding plant immunity as a surveillance system to detect invasion. Annu RevPhytopathol 53: 541–563

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Davey JW, Chouteau M, Barker SL, Maroja L, Baxter SW, Simpson F, Joron M, Mallet J, Dasmahapatra KK, Jiggins CD (2016) MajorImprovements to the Heliconius melpomene Genome Assembly Used to Confirm 10 Chromosome Fusion Events in 6 Million Years ofButterfly Evolution. G3 (Bethesda) 6: 695–708

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

de Jonge R, van Esse HP, Maruthachalam K, Bolton MD, Santhanam P, Saber MK, Zhang Z, Usami T, Lievens B, Subbarao KV, et al(2012) Tomato immune receptor Ve1 recognizes effector of multiple fungal pathogens uncovered by genome and RNA sequencing.Proc Natl Acad Sci USA 109: 5110–5115

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

de Mendiburu F (2016) agricolae: Statistical Procedures for Agricultural Research. httpsCRAN.R-project.orgpackageagricolae,https://CRAN.R-project.org/package=agricolae

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universalRNA-seq aligner. Bioinformatics 29: 15–21

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Faino L, Seidl MF, Datema E, van den Berg GCM, Janssen A, Wittenberg AHJ, Thomma BPHJ (2015) Single-Molecule Real-TimeSequencing Combined with Optical Mapping Yields Completely Finished Fungal Genome. mBio 6: e00936–15

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 49: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

Faino L, Seidl MF, Shi-Kunne X, Pauper M, van den Berg GCM, Wittenberg AHJ, Thomma BPHJ (2016) Transposons passively andactively contribute to evolution of the two-speed genome of a fungal pathogen. Genome Res 26: 1091–1100

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Faino L, Thomma BPHJ (2014) Get your high-quality low-cost genome sequence. Trends in Plant Science 19: 288–291Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Fradin EF, Thomma BPHJ (2006) Physiology and molecular aspects of Verticillium wilt diseases caused by V. dahliae and V. albo-atrum.Mol Plant Pathol 7: 71–86

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Goodswen SJ, Kennedy PJ, Ellis JT (2012) Evaluating high-throughput ab initio gene finders to discover proteins encoded ineukaryotic pathogen genomes missed by laboratory techniques. PLoS ONE 7: e50609

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Gordon SP, Tseng E, Salamov A, Zhang J, Meng X, Zhao Z, Kang D, Underwood J, Grigoriev IV, Figueroa M, et al (2015) WidespreadPolycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing. PLoS ONE 10: e0132628

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al (2011) Full-lengthtranscriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29: 644–652

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Gremme G, Steinbiss S, Kurtz S (2013) GenomeTools: a comprehensive software library for efficient processing of structured genomeannotations. IEEE/ACM Trans Comput Biol Bioinform 10: 645–656

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR (2008) Automated eukaryotic gene structureannotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol 9: R7

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Haas BJ, Zeng Q, Pearson MD, Cuomo CA, Wortman JR (2011) Approaches to Fungal Genome Annotation. Mycology 2: 118–141Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2016) BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation withGeneMark-ET and AUGUSTUS. Bioinformatics 32: 767–769

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genomeprojects. BMC Bioinformatics 12: 491

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Huang X, Adams MD, Zhou H, Kerlavage AR (1997) A tool for analyzing and annotating genomic sequences. Genomics 46: 37–45Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Jiao W-B, Schneeberger K (2017) The impact of third generation genomic technologies on plant genome assembly. Current Opinion inPlant Biology 36: 64–70

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Keibler E, Brent MR (2003) Eval: a software package for analysis of genome annotations. BMC Bioinformatics 4: 50Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Klosterman SJ, Atallah ZK, Vallad GE, Subbarao KV (2009) Diversity, pathogenicity, and management of verticillium species. Annu RevPhytopathol 47: 39–62

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Kohler A, Kuo A, Nagy LG, Morin E, Barry KW, Buscot F, Canbäck B, Choi C, Cichocki N, Clum A, et al (2015) Convergent losses ofdecay mechanisms and rapid turnover of symbiosis genes in mycorrhizal mutualists. Nature Genetics 47: 410–415

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title https://plantphysiol.orgDownloaded on February 7, 2021. - Published by

Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 50: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

Koren S, Phillippy AM (2015) One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly.Curr Opin Microbiol 23: 110–120

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Križanovic K, Echchiki A, Roux J, Šikic M (2017) Evaluation of tools for long read RNA-seq splice-aware alignment. Bioinformatics. doi:10.1093/bioinformatics/btx668

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Laehnemann D, Borkhardt A, McHardy AC (2016) Denoising DNA deep sequencing data-high-throughput sequencing errors and theircorrection. Brief Bioinformatics 17: 154–179

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M, et al(2012) The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res 40: D1202–D1210

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Lamichhaney S, Fan G, Widemo F, Gunnarsson U, Thalmann DS, Hoeppner MP, Kerje S, Gustafson U, Shi C, Zhang H, et al (2016)Structural genomic changes underlie alternative reproductive strategies in the ruff (Philomachus pugnax). Nature Genetics 48: 84–88

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Laver T, Harrison J, O'Neill PA, Moore K, Farbos A, Paszkiewicz K, Studholme DJ (2015) Assessing the performance of the OxfordNanopore Technologies MinION. Biomol Detect Quantif 3: 1–8

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178–2189Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Linde J, Duggan S, Weber M, Horn F, Sieber P, Hellwig D, Riege K, Marz M, Martin R, Guthke R, et al (2015) Defining the transcriptomiclandscape of Candida glabrata by RNA-Seq. Nucleic Acids Res 43: 1392–1406

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Loman NJ, Quick J, Simpson JT (2015) A complete bacterial genome assembled de novo using only nanopore sequencing data. NatMethods 12: 733–735

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Loman NJ, Quinlan AR (2014) Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics 30: 3399–3401Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Lomsadze A, Burns PD, Borodovsky M (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene findingalgorithm. Nucleic Acids Res 42: e119–e119

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Ma L, Chen Z, Huang DW, Kutty G, Ishihara M, Wang H, Abouelleil A, Bishop L, Davey E, Deng R, et al (2016) Genome analysis of threePneumocystis species reveals adaptation mechanisms to life exclusively in mammalian hosts. Nature Communications 7: 10740

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Ming R, VanBuren R, Wai CM, Tang H, Schatz MC, Bowers JE, Lyons E, Wang M-L, Chen J, Biggers E, et al (2015) The pineapplegenome and the evolution of CAM photosynthesis. Nature Genetics 47: 1435–1442

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Minoche AE, Dohm JC, Schneider J, Holtgräwe D, Viehöver P, Montfort M, Sörensen TR, Weisshaar B, Himmelbauer H (2015)Exploiting single-molecule transcript sequencing for eukaryotic gene prediction. Genome Biol 16: 184

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Muñoz JF, Gauthier GM, Desjardins CA, Gallo JE, Holder J, Sullivan TD, Marty AJ, Carmen JC, Chen Z, Ding L, et al (2015) The DynamicGenome and Transcriptome of the Human Fungal Pathogen Blastomyces and Close Relative Emmonsia. PLoS Genet 11: e1005493

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Phillippy AM (2017) New advances in sequence assembly. Genome Res 27: xi–xiiihttps://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 51: Long Read Annotation (LoReAn): automated eukaryotic …Nov 06, 2018  · 1 1 Short title: Genome annotation with LoReAn 2 3 Long Read Annotation (LoReAn): automated eukaryotic genome

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Presti Lo L, Lanver D, Schweizer G, Tanaka S, Liang L, Tollot M, Zuccaro A, Reissmann S, Kahmann R (2015) Fungal effectors and plantsusceptibility. Annu Rev Plant Biol 66: 513–545

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841–842Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Reid I, O'Toole N, Zabaneh O, Nourzadeh R, Dahdouli M, Abdellateef M, Gordon PMK, Soh J, Butler G, Sensen CW, et al (2014)SnowyOwl: Accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models. BMCBioinformatics 15: 229

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Smith CD, Zimin A, Holt C, Abouheif E, Benton R, Cash E, Croset V, Currie CR, Elhaik E, Elsik CG, et al (2011) Draft genome of theglobally widespread and invasive Argentine ant (Linepithema humile). Proc Natl Acad Sci USA 108: 5673–5678

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Smith JJ, Kuraku S, Holt C, Sauka-Spengler T, Jiang N, Campbell MS, Yandell MD, Manousaki T, Meyer A, Bloom OE, et al (2013)Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution. Nature Genetics 45: 415–21– 421e1–2

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Sperschneider J, Dodds PN, Gardiner DM, Manners JM, Singh KB, Taylor JM (2015) Advances and challenges in computationalprediction of effectors from plant pathogenic fungi. PLoS Pathog 11: e1004806

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped cDNA alignments to improve de novogene finding. Bioinformatics 24: 637–644

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Team RC (2016) R: A Language and Environment for Statistical Computing. https://www.R-project.org/Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M (2008) Gene prediction in novel fungal genomes using an ab initioalgorithm with unsupervised training. Genome Res 18: 1979–1990

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Testa AC, Hane JK, Ellwood SR, Oliver RP (2015) CodingQuarry: highly accurate hidden Markov model gene prediction in fungalgenomes using RNA-seq transcripts. BMC Genomics 16: 170

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Thomma BPHJ, Seidl MF, Shi-Kunne X, Cook DE, Bolton MD, van Kan JAL, Faino L (2016) Mind the gap; seven reasons to closefragmented genome assemblies. Fungal Genet Biol 90: 24–30

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, Lu Z, Olson A, Stein JC, Ware D (2016) Unveiling the complexity of the maizetranscriptome by single-molecule long-read sequencing. Nature Communications 7: 11708

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10: 57–63Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859–1875

Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Yandell M, Ence D (2012) A beginner's guide to eukaryotic genome annotation. Nature Reviews Genetics 13: 329–342Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title https://plantphysiol.orgDownloaded on February 7, 2021. - Published by

Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.