long read annotation (lorean): automated eukaryotic …nov 06, 2018 · 1 1 short title: genome...
TRANSCRIPT
1
Short title: Genome annotation with LoReAn 1
2
Long Read Annotation (LoReAn): automated eukaryotic genome 3
annotation based on long-read cDNA sequencing 4
5
David E. Cook1†
, Jose Espejo Valle-Inclan1†
, Alice Pajoro2, Hanna Rovenich
1, Bart PHJ 6
Thomma1*, and Luigi Faino
1*
7
1Laboratory of Phytopathology, Wageningen University and Research, Droevendaalsesteeg 1, 8
6708 PB Wageningen, The Netherlands 9
2Laboratory of Molecular Biology, Wageningen University and Research, Droevendaalsesteeg 1, 10
6708 PB Wageningen, The Netherlands 11
12
†These authors contributed equally to this work. 13
*These authors contributed equally to this work. 14
15
One sentence summary 16
The Long Read Annotation (LoReAn) pipeline provides automated genome annotation with 17
superior performance by incorporating single-molecule cDNA sequencing data. 18
19
Footnotes: 20
List of author contributions 21
LF and BT conceived the project. DEC performed data collection for the Illumina sequencing, and 22
JVI performed cDNA normalization and sequencing on the Minion with help from DEC. AP 23
performed the Arabidopsis short- and long-read experiments. HR performed experiments to 24
confirm the annotation results for the Ave1 locus. LF and JVI wrote the LoReAn Python script. LF 25
ran the annotations, and LF and DEC performed the analysis. DEC wrote the paper with LF and 26
BT. Funding, guidance and oversight of the project were provided by BT. 27
Plant Physiology Preview. Published on November 6, 2018, as DOI:10.1104/pp.18.00848
Copyright 2018 by the American Society of Plant Biologists
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
2
28
29
Funding Information 30
Funding was provided to DEC by the Human Frontiers in Science Program (HFSP) Long-term 31
fellowship (LT000627/2014-L). Work in the laboratory of BPJHT is supported by a Vici grant of 32
the Research Council for Earth and Life Sciences (ALW) of the Netherlands Organization for 33
Scientific Research (NWO). The Arabidopsis long read sequencing was performed within the 34
ZonMw-project number 435003020 entitled “Arabidopsis transcript isoform identification using 35
PacBio sequencing technology". 36
37
Present Address: David E. Cook, Department of Plant Pathology, Kansas State University, 38
Manhattan KS 66056, USA; Jose Espejo Valle-Inclan, Department of Genetics, Center for 39
Molecular Medicine, University Medical Center Utrecht, Utrecht University. Utrecht, The 40
Netherlands; Alice Pajoro, Department of Plant Developmental Biology, Max Planck Institute for 41
Plant Breeding Research, 50829 Köln, Germany; Hanna Rovenich, Botanical Institute, Cluster of 42
Excellence on Plant Sciences (CEPLAS), University of Cologne, 50674 Cologne, Germany; Luigi 43
Faino, Department of Environmental Biology, University La Sapienza, P.le Aldo Moro 5, 00185, 44
Rome, Italy 45
46
* To whom correspondence should be addressed. Email: [email protected] 47
and [email protected] 48
49
50
51
52
53
54
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
3
ABSTRACT 55
Single-molecule full-length cDNA sequencing can aid genome annotation by revealing transcript 56
structure and alternative splice forms, yet current annotation pipelines do not incorporate such 57
information. Here we present Long Read Annotation (LoReAn) software, an automated 58
annotation pipeline utilizing short- and long-read cDNA sequencing, protein evidence, and ab 59
initio prediction to generate accurate genome annotations. Based on annotations of two fungal 60
genomes (Verticillium dahliae and Plicaturopsis crispa) and two plant genomes (Arabidopsis 61
thaliana and Oryza sativa), we show that LoReAn outperforms popular annotation pipelines by 62
integrating single-molecule cDNA sequencing data generated from either the Pacific Biosciences 63
(PacBio) or MinION sequencing platforms, correctly predicting gene structure, and capturing 64
genes missed by other annotation pipelines. 65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
4
83
INTRODUCTION 84
Genome sequencing has advanced nearly every discipline within the biological sciences, as the 85
ongoing decreasing sequencing costs and increasing computational capacity allows many 86
laboratories to pursue genomics-based answers to biological questions. New sequencing 87
technologies designed to sequence longer contiguous DNA molecules, such as Pacific 88
Biosciences’ (PacBio) Single Molecule Real Time sequencing (SMRT) and Oxford Nanopore 89
Technologies’ (ONT) MinION, have ushered the most recent genomics revolution (Koren and 90
Phillippy, 2015). These advances are further enhancing the ability to generate high-quality 91
genome assemblies of large, complex eukaryotic genomes (Faino et al., 2015; Chin et al., 2016; 92
Davey et al., 2016; Jiao and Schneeberger, 2017). 93
A high-quality genome assembly, represented by (near-)chromosome completion, helps 94
address many biological questions but often requires functional features to be further defined 95
(Thomma et al., 2016). The process of genome annotation, i.e. the identification of protein-coding 96
genes and their structural features, such as intron-exons boundaries, is important to capture 97
biological values of a genome assembly (Yandell and Ence, 2012). Genomes can be annotated 98
using computer algorithms in so-called ab initio gene predictions and using wet-lab generated 99
data, such as cDNA or protein datasets for evidence-based predictions, and current annotation 100
pipelines typically incorporate both types of data (Cantarel et al., 2008; Yandell and Ence, 2012). 101
Ab initio gene prediction tools are based on statistical models, most often Hidden Markov Models 102
(HMMs), which are trained using known proteins, and typically perform well at predicting 103
conserved or core genes (Goodswen et al., 2012; Yandell and Ence, 2012). However, the ab 104
initio prediction accuracy decreases for organism-specific genes, for genes encoding small 105
proteins and across organisms with differing intron-exon features (Ter-Hovhannisyan et al., 2008; 106
Hoff et al., 2016). Furthermore, ab initio annotation of non-model genomes remains challenging 107
as appropriate training data is not always available and genome characteristics across organisms 108
can vary (Reid et al., 2014). To improve genome annotations, cDNA sequencing (RNA-seq) data 109
can be incorporated to train ab initio software (Hoff et al., 2016) and to provide additional 110
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
5
evidence for defining accurate gene models (Wang et al., 2009). Most genome annotations to 111
date rely on a combination of short-read mapping data and ab initio gene prediction. However, 112
errors occur during this process because short-read RNA-seq data cannot always be 113
unequivocally mapped, because a single read does not span a gene’s full length, and because of 114
differences in evidence weighting leading to gene prediction. LoReAn was developed to also use 115
information from long-read sequencing data to help address issues of mapping and gene 116
structure with greater emphasis given to empirical mapping data during the gene prediction 117
process. 118
Current annotation pipelines use a combination of ab initio and evidence-based 119
predictions to generate accurate consensus annotations. MAKER2 is a user-friendly, fully 120
automated annotation pipeline that incorporates multiple sources of gene prediction information 121
and has been extensively used to annotate eukaryotic genomes (Cantarel et al., 2008), (Holt and 122
Yandell, 2011) (Smith et al., 2011; Amemiya et al., 2013; Smith et al., 2013; Ming et al., 2015; 123
Lamichhaney et al., 2016). The Broad Institute Eukaryotic Genome Annotation Pipeline (here 124
referred to as BAP) has mainly been used to annotate fungal genomes (Linde et al., 2015; Muñoz 125
et al., 2015; Ma et al., 2016) and integrates multiple programs and evidences for genome 126
annotation (Haas et al., 2008; Haas et al., 2011). BRAKER1 and CodingQuarry are two gene 127
prediction software packages that utilize RNA-seq data and genome sequence to predict gene 128
models. BRAKER1 is a pipeline for unsupervised RNA-seq-based genome annotation that 129
combines the advantages of GeneMark-ET and Augustus. BRAKER1 is a two-step software that 130
combines GeneMark-ET and intron-hints, derived from mapped RNA-seq, to generate a species-131
specific database that Augustus can use for gene prediction (Hoff et al., 2016). CodingQuarry is a 132
pipeline for RNA-Seq assembly-supported training and gene prediction, which is only 133
recommended for application to fungi. In this tool, Cufflink’s assembled RNA-seq is used to build 134
a HMM model, which is used for gene prediction (Testa et al., 2015). A limitation of these 135
annotation pipelines is that experimental evidence from short read RNA-seq mapping can be lost 136
due to evidence weighting, and the pipelines cannot natively exploit gene structure information 137
from single-molecule cDNA sequencing. 138
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
6
In addition to improving genome assembly (Phillippy, 2017), long-read sequencing data 139
can be used to improve genome annotation. The use of single-molecule cDNA sequencing can 140
increase the accuracy of automated genome annotation by improving genome mapping of 141
sequencing data, correctly identifying intron-exon boundaries, directly identifying alternatively 142
spliced transcripts, identifying transcription start and end sites, and providing precise strand 143
orientation to single-exon genes (Minoche et al., 2015; Abdel-Ghany et al., 2016; Wang et al., 144
2016). However, several hurdles limit the implementation of long-read sequencing data into 145
automated genome annotation, such as the higher per-base costs when compared to short-read 146
data, the relatively high error rates for long-read sequencing technologies, and the lack of 147
bioinformatics tools to integrate long-read data into current annotation pipelines (Faino and 148
Thomma, 2014; Laver et al., 2015). The first two limitations are addressed by the continual 149
reduction in sequencing cost and improved base calling by long-read sequencing providers and 150
the development of bioinformatics methods to correct for sequencing errors (Loman et al., 2015; 151
Laehnemann et al., 2016). To address the disconnect between genome annotation pipelines and 152
the latest sequencing technologies, we developed the Long Read Annotation (LoReAn) pipeline. 153
LoReAn is an automated annotation pipeline that takes full advantage of MinION or PacBio 154
SMRT long-read sequencing data in combination with protein evidence and ab initio gene 155
predictions for full genome annotation. Short-read RNA-seq can be used in LoReAn to train ab 156
initio software. Based on the re-annotation of two fungal and two plant species, we demonstrate 157
that LoReAn can provide annotations with increased accuracy by incorporating single-molecule 158
cDNA sequencing data from different sequencing platforms. 159
160
RESULTS 161
LoReAn design and implementation 162
The LoReAn pipeline can be conceptualized in two phases. The first phase involves genome 163
annotation based on ab initio and evidence-based predictions (Figure 1A: blue arrows) and 164
largely follows the workflow previously described in the BAP (Haas et al., 2008; Haas et al., 165
2011). This first phase produces a full-genome annotation and requires the minimum input of a 166
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
7
reference genome, protein sequence of known and, possibly, related species, and a species 167
name from the Augustus prediction software database (Stanke et al., 2008). We implemented two 168
changes into the first phase of LoReAn, which we refer to as BAP+. First, LoReAn used RNA-seq 169
reads as input in combination with the BRAKER1 software (Hoff et al., 2016) to produce a 170
species-specific database for the Augustus prediction software. Additionally, RNA-seq data was 171
assembled into full-length cDNA using Trinity software (Grabherr et al., 2011), and the assembled 172
transcripts were aligned to the genome using both the Program to Assemble Spliced Alignments 173
(PASA; Haas et al., 2008) and the Genomic Mapping and Alignment Program (GMAP; Wu and 174
Watanabe, 2005). The output of PASA software was passed to the Evidence Modeler (EVM) 175
(Haas et al., 2008) as cDNA evidence while the output of GMAP was given to EVM as ab initio 176
evidence. 177
The second phase of LoReAn incorporates single-molecule cDNA sequencing with the 178
annotation results of the first phase by utilizing an alternative approach to reconstruct full-length 179
transcripts (Figure 1A: red arrows). Single-molecule long-read sequencing reads are mapped to 180
the genome using GMAP, which allows the determination of transcript structure (i.e. start, stop 181
and exon boundaries) from a single cDNA molecule (Križanovic et al., 2017). The underlying 182
reference sequence is extracted to overcome sequence errors associated with long-read 183
sequencing, and these sequences are combined with the gene models from the first phase in a 184
process we refer to as ‘clustered transcript reconstruction’ (Figure 1A and B). Through this 185
process, consensus gene models are built by combining the first and second phase gene models 186
that cluster at the same locus. Optionally, model clustering can be conducted in a strand-specific 187
manner (LoReAn stranded, see Supplemental text for details) where only gene models mapping 188
on the same DNA coding strand are used to build a consensus model. These high-confidence 189
models are mapped back to the reference using GMAP to correct open reading frames, and 190
subsequently, PASA is used to update the gene models by identifying untranslated regions 191
(UTRs) and alternatively spliced transcripts to generate a final annotation. Sequence-based 192
support for the final gene models (Figure 1B orange models) can come from the first phase 193
annotation alone (Figure 1B i), the second phase, given a sufficient level of support (Figure 1B ii, 194
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
8
iii), or through a combination of the two phases (Figure 1B iv, v). If a single consensus annotation 195
cannot be reached between the two phases, both annotations are kept in the final output (Figure 196
1B v). 197
198
LoReAn produces high quality gene predictions based on gene, exon and intron metrics 199
compared to the reference annotation and empirical data 200
To test the performance of LoReAn, we re-annotated the genome sequence of the haploid fungus 201
Verticillium dahliae, an important pathogen of hundreds of plant species including many crops 202
(Fradin and Thomma, 2006; Klosterman et al., 2009). The genome of V. dahliae strain JR2 was 203
used for testing LoReAn because it is assembled into complete chromosomes and has a 204
manually curated annotation, providing a high-confidence resource for reference (Faino et al., 205
2015). A total of 55 annotations were generated and compared, of which 24 were produced using 206
LoReAn, 12 using BAP and 12 using BAP+ with different genome masking and ab initio options 207
(description in Supporting text), along with output from the annotation software MAKER2, 208
BRAKER1, Augustus and two each from CodingQuarry and GeneMark-ES. We determined the 209
quality of annotation outputs by comparing each to the reference annotation for exact matches to 210
either genes or exon locations (Figure 2A; Table S1 and S2; reference annotation contains 211
11,385 gene models and 28,142 exons). These comparisons were used to calculate sensitivity 212
(how much of the reference is correctly predicted), specificity (how much of the prediction is in the 213
reference), and F1 score (the harmonic mean of sensitivity and specificity). These metrics were 214
calculated based on commonly described methods used within the gene prediction community 215
(see methods and references (Keibler and Brent, 2003; Yandell and Ence, 2012; Chan et al., 216
2017)). Hard masking, where short DNA repeat (>10 bp) sequences are removed from the 217
genome prior to annotation, significantly affected the quality of predicted gene and exon models, 218
with partially masked or non-masked genome inputs producing significantly improved annotations 219
(Figure S1A and S2A; Table S3 and S4). On average, the ‘fungus’ option (-f) of the ab initio 220
software GeneMark-ES produced the best gene and exon predictions (Figure S1B and S2B; 221
Table S3 and S4). Gene predictions from LoReAn using coding strand information (LoReAn-s) 222
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
9
produced statistically similar results to LoReAn for exact match genes and similar results to 223
LoReAn and BAP for exact match exons (Table S2 – S3). However, the F1 score for exact match 224
gene and exon predictions were significantly higher for LoReAn stranded compared to the other 225
three outputs (Figure S1C and S2C), indicating that using strand information improves the overall 226
quality of the annotation. 227
A single output from LoReAn and BAP were selected for further comparison to the 228
outputs from the other annotation software (i.e. MAKER2, CodingQuarry, GeneMark, BRAKER1 229
and Augustus). The LoReAn-stranded run using the ‘fungus’ option of GeneMark-ES (referred to 230
as LoReAn-sF throughout) and the BAP run using the fungus option of GeneMark-ES (referred to 231
as BAP-F throughout) using a non-masked genome were selected because they had the highest 232
F1 scores and used similar settings to the other pipelines, thereby enabling comparisons (Figure 233
2A, highlighted by horizontal yellow lines). The 7 annotation pipelines were compared by 234
computing the number of predicted genes that matched the reference and the number of 235
reference genes that were matched (i.e. numbers for specificity and sensitivity). A chi-square test 236
of independence indicated a significant, non-equal association between the 7 annotation 237
pipelines and the tested annotation metrics Pearson’s chi-square test of independence, 2= 238
913.61, p-value < 2.2e-16 Plotting the residuals of the chi-square test indicated that LoReAn-sF 239
had the largest positive association for correctly predicting gene models (column 1 and 3, Figure 240
2B) and the largest negative association for predicting wrong gene models or missing gene 241
models (Column 2 and 4, Figure 2B). These results show that there are statistically significant 242
differences between the tested pipelines for annotation quality, and the residual analysis indicates 243
that LoReAn-sF has the best association with desirable annotation metrics. 244
To further characterize gene prediction differences between annotation pipelines, four 245
outputs were selected for comparison. MAKER2 and CodingQuarry were selected as they 246
represent popular annotation choices and performed well, along with the output of LoReAn-sF 247
and BAP-F, as they are the focus of the study. The gene predictions were compared head-to-248
head in the absence of a reference annotation by determining the overlap between exact match 249
genes. There were 4,584 genes with the same predicted structure (i.e. start, stop, intron position) 250
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
10
from the four pipelines, equivalent to approximately 40% of the genes in the reference annotation 251
(Figure 3A). BAP-F predicted the fewest unique genes (1,352), while MAKER2 predicted the most 252
(3,157) (Figure 3A). However, the use of exact match gene structure to identify unique coding 253
sequences is potentially misleading, as two gene predictions can code for the same or a similar 254
protein without the exact same structure. To generate a more biologically relevant comparison of 255
unique protein coding differences, we grouped translated protein sequences of each annotation 256
into homologous groups using orthoMCL (Li et al., 2003; Chen et al., 2006). Using these groups, 257
we identified protein coding sequences that were unique to a single annotation pipeline, referred 258
to as singletons. We identified 1,429 singletons across the four annotations, with CodingQuarry 259
predicting the most (461) and BAP-F predicting the fewest (180) (Figure 3D). To validate the 260
predicted singletons, the percent coverage of the predicted transcripts was calculated using 261
short-read RNA-seq data. Singletons from LoReAn-sF were covered on average across 80% of 262
their length by mapped RNA-seq reads, while the next highest was CodingQuarry with 63% 263
average coverage. Non-parametric statistical test for the rank of the mean coverage shows that 264
singletons predicted by LoReAn-sF had significantly higher RNA-seq coverage compared to the 265
other pipelines (Figure 3B, Kruskal-Wallis test). Analyzing the length of the predicted singletons 266
indicated that CodingQuarry predicted the shortest singletons, while the other pipelines were not 267
statistically different than one another (Figure 3C, Kruskal-Wallis test). The singletons were 268
further grouped by the presence/absence of introns and RNA-seq coverage, as gene models with 269
introns and RNA-seq support are more likely to be true genes. Singletons that contained at least 270
one intron and had RNA-seq reads covering at least 75% of their length were considered the 271
highest-confidence models. Pairwise, two-sample test for equal proportions indicated the outputs 272
of LoReAn-sF and MAKER2 were statistically similar for the proportion of singletons in this high 273
confidence group, but both were significantly higher than the other two pipelines (Table S5). The 274
collective, comprehensive comparison of LoReAn versus other commonly used annotation 275
software shows that LoReAn outperforms these for many gene prediction quality metrics. 276
277
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
11
LoReAn predicts the most accurate intron locations compared to empirical RNA-seq 278
predictions 279
Annotation comparisons are commonly made against a reference annotation. This carries 280
inherent problems, however, as many organisms do not have a high-confidence reference 281
annotation. Additionally, the overlap between the software used to generate a given reference 282
annotation and that being tested introduces bias in the analysis. To evaluate the annotation 283
outputs produced here in the absence of a reference annotation, we devised an approach to 284
quantify annotation quality based on empirical RNA-seq output rather than a reference 285
annotation. The locations of predicted introns from the annotation outputs were compared to the 286
locations of the inferred introns from long- and short-read mapped data, using the same 287
annotation outputs compared earlier for gene and exon locations (Figure 4A). This analysis also 288
allowed the reference annotation to be compared to the intron locations inferred from the RNA-289
seq data. To formally test for statistical differences, the same annotation pipelines compared 290
earlier were compared using a chi-square test of independence, which indicated a significant, 291
non-equal association between the annotation pipelines and intron predictions Pearson’s chi-292
square test of independence, 2= 3220.7, p-value < 2.2e-16 The residual plot of the chi-square 293
test shows the magnitude and direction of the association between the pipelines and the 294
predictions (Figure 4B). Multiple two-sample tests of proportions between the annotation 295
predictions and the reference annotation, testing if the annotation prediction showed improved 296
metrics when compared with the reference, showed that only CodingQuarry outperformed the 297
reference annotation for the specificity metric (Table S6). This indicated the reference annotation 298
was the most similar to the RNA-seq data, which is not surprising given the high-quality reference 299
annotation. The analysis was re-run to test if the LoReAn-sF outperformed the other pipelines, 300
and LoReAn-sF outperformed the other software in nearly every instance (Table 1). This further 301
indicates that the LoReAn software offers improved annotation performance when evaluated 302
against RNA-seq data in the absence of a reference annotation. 303
304
Only the LoReAn pipeline correctly annotates the Ave1 effector locus 305
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
12
Plant-pathogenic fungi encode in planta-secreted proteins, termed effectors, which serve to 306
facilitate infection (Cook et al., 2015; Presti et al., 2015). Effectors are generally characterized as 307
lineage-specific small, secreted, cysteine-rich proteins with generally no characterized protein 308
domains or homology, characteristics which can make effectors difficult to predict with automated 309
annotation (Sperschneider et al., 2015). To test how LoReAn and the other annotation pipelines 310
performed at a specific effector locus, we detailed the annotation results for the V. dahliae Ave1 311
locus, which encodes a small-secreted protein that functions to increase virulence during plant 312
infection (de Jonge et al., 2012). As previously reported, a considerable number of short RNA-seq 313
reads uniquely map to the Ave1 locus (de Jonge et al., 2012), along with single-molecule cDNA 314
reads identified here (Figure 5A). Interestingly, MAKER2, BAP, Augustus, GeneMark-ES and 315
CodingQuarry (default) each failed to predict the previously characterized Ave1 gene despite the 316
abundance of uniquely-mapped reads (Figure 5B; Figure S3). The MAKER2, BAP and Augustus 317
pipelines predicted a separate gene on the opposite strand located to the 3’ end of the Ave1 gene 318
that is absent in the reference annotation. The CodingQuarry software run with the ‘effector’ 319
option predicted the correct coding sequence model for Ave1. The LoReAn-sF, LoReAn-s, PASA, 320
GMAP and BAP+ software predicted two genes at the locus, one corresponding to the known 321
Ave1 gene, and an additional gene to the 3’ end of Ave1 (called Ave1c), similar to the gene 322
model identified by MAKER2 (Figure 5B; Figure S3). 323
LoReAn-sF additionally predicted two mRNAs corresponding to the previously 324
characterized Ave1 gene, termed isoform-1 and -2 (Figure 5B). To confirm the presence of two 325
Ave1 isoforms, cDNAs were amplified and cloned into vectors, and 18 clones were randomly 326
selected for sequencing. A majority of the sequenced transcripts, 15 of 18, had a sequence 327
corresponding to isoform-1, the known Ave1 transcript, while the other 3 were the isoform-2 328
sequence (Figure 5C). The isoform-2 transcript is the result of an alternative splice junction 5 bp 329
upstream of the previously identified splice site in the Ave1 5’ UTR intron and is not predicted to 330
alter the protein coding sequence. The accuracy of the new gene prediction at the Ave1 locus 331
(two Ave1 isoforms and one additional gene model) was additionally tested by showing the 332
expression of the Ave1c gene. Two sets of primers (Ave1 and Ave1c fw and rev) amplified bands 333
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
13
of the expected sizes, confirming the expression of both genes across various V. dahliae growth 334
conditions (Figure 5D, top amplification panel). We also attempted to amplify a specific product 335
from both gDNA and cDNA to confirm the orientation and rule out a transcriptional fusion. 336
Consistent with the annotation, the amplification using a gDNA template was successful, while 337
the cDNA template failed to amplify a product (Figure 5D, bottom amplification panel, primers 338
Ave1 fw + Ave1c fw). Collectively, these results confirm that LoReAn predicts the most accurate 339
gene models at the Ave1 locus, including a splice-variant of Ave1. 340
341
LoReAn produces the most accurate annotation of a second fungal genome using PacBio 342
Iso-seq reads 343
The basidiomycete Plicaturopsis crispa, mostly known for its wood-degrading abilities, has a 344
relatively complex transcriptome with high levels of exons per gene; 5.6 exons per gene 345
compared to V. dahliae’s 2.5 exons per gene (Gordon et al., 2015). A total of nine annotations of 346
the P. crispa genome from five pipelines were generated using publicly available short-read 347
Illumina RNA-seq and single-molecule PacBio Iso-seq data (Kohler et al., 2015). The LoReAn 348
annotations predicted the greatest number of genes, transcripts and exons, while BAP and BAP+ 349
had the greatest number of genes, transcripts and exons exactly matching the reference (Table 2; 350
Table S7). The F1 scores for exact match genes was highest for the BAP outputs, but overall 351
similar between the annotations (Figure 6A). Testing the exact match gene proportions used for 352
sensitivity and specificity indicated significant association between annotation pipelines and the 353
metrics (Figure 6B, Pearson’s chi-square test of independence, 2= 3220.7, p-value < 2.2e-16). 354
LoReAn-sF scored significantly higher than MAKER2 for sensitivity and specificity of exact match 355
gene proportions and higher than GeneMark-ES for exact match sensitivity (Table S8). However, 356
these comparisons depended on the starting reference annotation. 357
To better understand the differences between the outputs, we applied the empirical intron 358
analysis and calculated the annotation quality metrics (Figure 6C). The sensitivity and specificity 359
proportions for exact match introns indicated significant differences across the pipelines (Figure 360
6D, Pearson’s chi-square test of independence, 2= 8583, p-value < 2.2e-16). Using chi-square 361
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
14
tests of proportions between the individual annotation outputs to that of the reference annotation 362
showed that LoReAn in stranded mode using a masked or non-masked genome produces 363
significantly improved intron location estimates than the current reference (Table 2). LoReAn 364
stranded outperformed all other pipelines for exact intron specificity and sensitivity (Table S9). 365
These results indicate the LoReAn pipeline produces an improved annotation compared to the 366
current reference and produces results as good and, under some metrics, better than other 367
annotation options. 368
369
LoReAn produces high quality annotations for larger plant genomes using PacBio Iso-seq 370
data 371
To further test LoReAn, we re-annotated the 135 megabase (Mb) Arabidopsis (Arabidopsis 372
thaliana) and 375 Mb rice (Oryza sativa) genomes using Pacbio Iso-seq data. These genomes 373
are larger and contain a higher percentage of repetitive elements than the two fungal genomes 374
tested. The Arabidopsis annotations generated here were compared to the reference annotation, 375
TAIR10, which is highly curated and represents one of the most complete plant genome 376
annotations (Lamesch et al., 2012; Berardini et al., 2015). The LoReAn outputs using a non-377
masked genome had the highest number of genes and transcripts exactly matching the 378
reference, while BAP+ had the highest number of exact match exons (Table S10). We conducted 379
similar analyses as used for the fungal genomes for exact match genes compared to the 380
reference annotation (Figure 7A and B). There was a significant difference across the pipelines 381
for sensitivity and specificity of exact match genes (Figure 7B, Pearson’s chi-square test of 382
independence, 2= 22393, p-value < 2.2e-16). Two sample proportion testing showed LoReAn-sF 383
outperformed MAKER2, GenMark-ES and Augustus for predicting genes based on exact match 384
analysis (Table S11). Similar results were seen for the inferred intron analysis (Figure 7C and D). 385
The pipelines were not equal for exact intron sensitivity and specificity (Figure 7D, Pearson’s chi-386
square test of independence, 2= 38926, p-value < 2.2e-16). The results of the pipelines were 387
compared to those from the reference annotation, and LoReAn using a masked genome in 388
standard or stranded mode and MAKER2 outperformed the current reference annotation for 389
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
15
matching intron location specificity (Table 3). None of the pipelines outperformed the reference 390
annotation for sensitivity, reflecting the high quality of the TAIR10 annotation. 391
The same analysis procedures for the rice genome showed LoReAn performed as well or 392
better than the other tested pipelines for gene and intron predictions (Figure 8, Table S12). There 393
was a significant difference between the pipelines for exact match gene sensitivity and specificity 394
(Pearson’s chi-square test of independence, 2= 50019, p-value < 2.2e-16) with the LoReAn 395
outputs having strong associations to positive annotation metrics (Figure 8B). Indeed, pairwise 396
two sample proportion testing of the other pipelines against LoReAn-s showed that LoReAn-s 397
was significantly better for both gene prediction sensitivity and specificity (Table S13). LoReAn 398
also performed well for predicting exact intron locations (Figure 8C and D), and again, there was 399
a significant difference across the pipelines for gene sensitivity and specificity (Pearson’s chi-400
square test of independence, 2= 984030, p-value < 2.2e-16). All LoReAn pipelines outperformed 401
the reference annotation for intron match specificity, and LoReAn using the full-genome and 402
strand information outperformed the reference annotation for intron match sensitivity (Table 4). 403
These data show that LoReAn provides robust results across genomes of varying features and 404
sizes and, in many instances, outperformed other currently used annotation software. 405
406
DISCUSSION 407
High throughput sequencing continues to have profound impacts on biological systems and the 408
questions researchers are addressing. The technical improvements and associated reduction in 409
cost have resulted in a deluge of high quality model and non-model genomes from across the 410
kingdoms of life. To capture the value of these assembled genomes, equal advances are needed 411
in defining the functional elements of the genome. One such technical advance is the ability to 412
sequence full-length single-molecule cDNAs that directly contain information on transcript 413
structure and alternative forms. This information has previously helped identify alternatively 414
spliced transcripts (Au et al., 2013; Abdel-Ghany et al., 2016), but single-molecule long-reads 415
have not been systematically incorporated into annotation pipelines. The newly developed 416
LoReAn pipeline integrates both short-read RNA-seq and long-read single-molecule cDNA 417
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
16
sequencing with ab initio gene prediction to generate high accuracy gene predictions. In total, 418
three separate analyses using a reference annotation, head-to-head comparison, and 419
comparison to empirical data indicate that LoReAn produces high quality annotations of the four 420
genomes tested. These results show that LoReAn has improved performance for predicting gene 421
structures and intron locations. 422
Whereas several genome annotation tools use experimental data (i.e. RNA-seq) for gene 423
prediction, none of them fully utilize this information. This is apparent for genes such as Ave1, 424
where there is ample RNA-seq evidence supporting the gene model, but prediction software, 425
including MAKER2, GeneMark-ES and BAP, do not predict the gene. The annotation pipeline 426
CodingQuarry also does not predict the Ave1 transcript when run in the default mode but does 427
predict the transcript when run using the ‘effector’ option. LoReAn correctly predicts the Ave1 428
transcript, plus an additional new transcript at the locus. The ability to correctly annotate genes 429
with unique features or restricted taxonomic distribution, such as effectors, is relevant to many 430
biological questions and will aid comparative genomic studies. LoReAn was designed to 431
incorporate information from both short- and long-read RNA-seq data as we believe with 432
increasing sequencing depth, length and accuracy this significant source of empirical evidence 433
will greatly improve gene prediction. 434
The technical and biological characteristics of a genome will impact the annotation 435
options needed to achieve high-quality gene predictions. Genome masking significantly affected 436
the gene prediction output of the V. dahliae annotation. From a technical aspect, genome 437
masking prior to annotation can have a large impact when annotating highly contiguously 438
assembled genomes. Fragmented genome assemblies often lack repetitive regions and are de 439
facto masked. Masking the telomere-to-telomere complete V. dahliae strain JR2 genome resulted 440
in gene predictions which were fragmented because of coding regions overlapping masked 441
regions. Our results indicate that genome masking of short repetitive DNA decreases the quality 442
of the genome annotation and that using a partially- (only masking repeats > 400 bp) or non-443
masked genome may improve annotation results. From a biological perspective, our results show 444
that strand information had a significant impact on annotation quality for the V. dahliae genome. 445
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
17
Compact fungal genomes have genes with overlapping UTRs, which make gene prediction 446
difficult. Using strand information, LoReAn can assign transcripts to the correct coding strand and 447
avoid the prediction of fused genes. Additionally, strand information is used to assign single exon 448
genes to the correct strand. These results need to be confirmed on a greater number of genomes 449
with diverse characteristics before being fully generalizable. Collectively, our results suggest that 450
both technical and biological information, such as assembly completeness and coding sequence 451
overlap, can impact genome annotation quality and should be considered early during project 452
design. 453
Our results show that LoReAn can successfully use single-molecule cDNA sequencing 454
data from different platforms to produce high-quality genome annotations, similar to or better than 455
the current community references for four diverse genomes. This suggests robust performance of 456
LoReAn across sequencing platforms and for annotating small fungal genomes of 35 Mb to the 457
rice genome of ~375 Mb. We speculate that the use of annotation software such as LoReAn, 458
which incorporates single-molecule cDNA sequencing into the annotation process, will 459
significantly improve genome annotation and aid in answering biological questions across all 460
domains of life. 461
462
MATERIAL AND METHODS 463
Growth conditions and RNA extraction 464
Verticillium dahliae strain JR2 (Faino et al., 2015) was maintained on potato dextrose agar (PDA) 465
plates grown at approximately 22⁰C and stored in the dark. Conidiospores were collected from 466
two-week-old PDA plates using half-strength potato dextrose broth (PDB), and subsequently 467
1x106 spores were inoculated in glass flasks containing 50 mL of either PDB, half-strength 468
Murashige and Skoog (MS) medium supplemented with 3% sucrose, or xylem sap collected from 469
greenhouse grown tomato (Solanum lycopersicum) plants of the cultivar Moneymaker. For 470
analysis of Ave1 transcription, V. dahliae strain JR2 was additionally grown in 50 mL of Czapek-471
dox media following the manufacturer’s guidelines (Oxoid Microbiology Products, Thermo 472
Scientific). The cultures were grown for four days in the dark at 22⁰C and 160 RPM. The cultures 473
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
18
were strained through miracloth (22 μm) (EMD Millipore, Darmstadt, Germany), pressed to 474
remove liquid, and flash frozen in liquid nitrogen. Next, the cultures were to ground to powder with 475
a mortar and pestle using liquid nitrogen to ensure samples remained frozen. 476
RNA extraction was carried out using TRIzol (Thermo Fisher Science, Waltham, MA, 477
USA) following manufacturer guidelines. Following RNA re-suspension, contaminating DNA was 478
removed using the TURBO DNA-free kit (Ambion, Thermo Fisher Science, Waltham, MA, USA), 479
and the RNA was checked for integrity by separating 2 μL of each sample on a 2% agarose gel. 480
RNA samples were quantified using a Nanodrop (Thermo Fisher Science, Waltham, MA, USA) 481
and stored at -80⁰C. 482
483
Library preparation and sequencing – Illumina 484
Each RNA sample from V. dahliae strain JR2 grown in PDB, half-strength MS, and xylem sap 485
was used to construct an Illumina sequencing library for RNA-sequencing by the Beijing 486
Genomics Institute (BGI) following manufacturer guidelines (Illumina Inc., San Diego, CA, USA). 487
Briefly, messenger RNA (mRNA) was enriched using oligo(dT) magnetic beads. The RNA was 488
then fragmented and double stranded cDNA synthesized following manufacturer guidelines 489
(Illumina Inc., San Diego, CA, USA). The fragments were then end-repaired and poly-adenylated 490
to allow for the addition of sequencing adapters, followed by fragment enrichment using 491
polymerase chain reaction (PCR) amplification. Library quality was assessed using the Agilent 492
2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). Qualified libraries were 493
sequenced on an Illumina HiSeq-2000 (Illumina Inc., San Diego, CA, USA) at the Beijing 494
Genomics Institute. 495
496
cDNA synthesis and normalization, library preparation and sequencing – Oxford Nanopore 497
Technologies 498
For the synthesis of single-stranded cDNA, 1 μg of each RNA sample was reverse-transcribed 499
using the Mint-2 cDNA synthesis kit as described by the manufacturer (Evrogen, Moscow, 500
Russia), using the primers PlugOligo-1 (5’ end) and CDS-1 (3’ end). For each sample, 1 μl of 501
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
19
cDNA was amplified with PCR for 15 cycles (95ºC for 15 seconds, 66ºC for 20 seconds and 72ºC 502
for 3 minutes) to generate double-stranded cDNA, and purified with 1.8x volume Agencourt 503
AMPure XP magnetic beads (Beckman Coulter Inc., Indianapolis, IN, USA). 504
Three cDNA samples were normalized with the Trimmer-2 cDNA normalization kit 505
following the manufacturer’s guidelines (Evrogen, Moscow, Russia). The cDNA was precipitated, 506
denatured and hybridized for 5 hours. Next, the double stranded cDNA fraction was cleaved and 507
the remaining single stranded cDNA amplified with PCR for 18 cycles (95ºC for 15 seconds, 66ºC 508
for 20 seconds and 72ºC for 3 minutes). 509
Library preparation for the three samples was performed using the Nanopore Sequencing 510
Kit (v. SQK-MAP006) following the manufacturer’s guidelines (Oxford Nanopore Technologies 511
[ONT], Oxford, UK). The cDNA was end-repaired and dA-tailed using the NEBNext End Repair 512
and NEBNext dA-Tailing Modules following the manufacturer’s instructions (New England 513
BioLabs [NEB], Ipswich, MA, USA). The reactions were cleaned using an equal volume of 514
Agencourt AMPure XP magnetic beads (Beckman Coulter Inc., Indianapolis, IN, USA), followed 515
by ONT adapter ligation using Blunt/TA ligation Master Mix (NEB, Ipswich, MA, USA). The 516
adapter-ligated fragments were purified using Dynabeads MyOne Streptavidin C1 (Thermo Fisher 517
Science, Waltham, MA, USA). 518
Sequencing was performed on three different MinION flow cells (v. FLO-MAP103, ONT, 519
Oxford, UK). After priming the flow cells with sequencing buffer, 6 μl of the library preparation was 520
added. Additional library preparation (6 μl) was added to the flow cells at 3, 17 and 24 hours after 521
the run was started. Base-calling was performed using the Metrichor app (v. 2.39.1, ONT, Oxford, 522
UK), and Poretools (v. 0.5.1) was used to generate FASTQ files from the Metrichor produced 523
FAST5 files (Loman and Quinlan, 2014). 524
525
Software in LoReAn pipeline 526
LoReAn is implemented in Python3. Usage and parameters to run LoReAn, including default 527
settings, are detailed at https://github.com/lfaino/LoReAn/blob/master/OPTIONS.md. Mandatory 528
parameters are protein sequences of related organisms, a reference genome sequence and an 529
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
20
identification name for the species form the Augustus database. Other inputs are: short-reads (i.e. 530
Illumina RNA-seq), which may be single or paired-end; and long-reads from either MinION or 531
SMRT sequencing platforms. LoReAn outputs a GFF3 file with genome annotations. 532
The most convenient way to install and run LoReAn is by using the Docker 533
(https://www.docker.com/) image. Information about the software and how to use it can be found 534
at the https://github.com/lfaino/LoReAn repository. Additional information regarding the settings 535
used for programs in this work can be found in the Supplemental Text. The following programs 536
and versions were used for LoReAn: for read mapping, STAR (version 2.5.3a) (Dobin et al., 537
2013) and GMAP (v. 2017-06-20) (Wu and Watanabe, 2005); to assemble and reconstruct 538
transcripts from short reads, Trinity (v. 2.2.0) (Grabherr et al., 2011) run on “genome-guided 539
mode”, followed by PASA (v. 2.1.0) (Haas et al., 2008); to map protein sequences, AAT (v. 03-05-540
2011) (Huang et al., 1997); for gene prediction, GeneMark-ES (v4.34) (Lomsadze et al., 2014) 541
and Augustus (v3.3) (Stanke et al., 2008) as ab initio software but BRAKER1 (v. 2) (Hoff et al., 542
2016) in substitution of Augustus to generate ab initio gene prediction for organisms not present 543
in the Augustus catalogue when RNA-seq is supplied; GMAP (v. 2017-06-20) (Wu and 544
Watanabe, 2005) for long reads mapping and for assembled ESTs after Trinity assembly; 545
Evidence Modeler (EVM, v. 1.1.1) (Haas et al., 2008) to combine the output from the previous 546
tools to generate a combined annotation model. For EVM, evidence weights were set to 1 and 547
default options were used. Bedtools suite (v. 2.21.0) (Quinlan and Hall, 2010) was used to extract 548
the genomic sequence, merge and cluster the long-reads. iAssembler (v. 1.32) (Zheng et al., 549
2011) was used to call a consensus on the clusters (i.e. the process of transcript reconstruction). 550
GenomeTools (v. 1.5.9) software was used at several stages in the LoReAn pipeline (Gremme et 551
al., 2013). Additional information about the tools used can be found at 552
https://github.com/lfaino/LoReAn/blob/master/README.md. 553
554
Genome Masking 555
To study the effect of genome masking on automated genome annotation with LoReAn, the 556
pipeline was run on stranded mode using three reference genomes with different levels of 557
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
21
repetition masking: a fully masked genome with all repetitive sequences masked, a partially 558
masked genome where only repetitions larger than 400 bps were masked and a full genome with 559
no repetition masking. Repeats were masked using RepeatMasker software as previously 560
described (Faino et al., 2016). 561
562
LoReAn Stranded Mode 563
To use the software in strand mode efficiently, sequences from the same transcript need to have 564
the same strand. However, sequencing is random and, depending from which fragment 565
sequencing starts, fragments from the same transcript could be sequenced in forward or reverse 566
orientation compared to the transcription direction. Unlike in DNA sequencing, the direction of the 567
cDNA long-read sequencing can be inferred by localizing only one or both directions between the 568
3’ adapter or the 5’ adapter used during the cDNA production. Using the Smith-Waterman 569
alignment, the location of the adapter/s in the sequenced fragments can be identified and the 570
sequencing orientation adjusted based on the adapter alignment onto the fragments. For the 571
MinION data generated, the 5’ PlugOligo-1 AAGCAGTGGTATCAACGCAGAGTACGCGGG and 572
3’-CDS AAGCAGTGGTATCAACGCAGAGTACTGGAG primer sequences associated with the 573
cDNA synthesis and normalization process were used to identify the coding strand for each long 574
read. For the PacBio Arabidopsis (Arabidopsis thaliana) experiment, primers 575
AAGCAGTGGTATCAACGCAGAGTACGCGGG and 576
AAGCAGTGGTATCAACGCAGAGTACTTTTT were used for the correction of the transcript 577
orientation. Rice (Oryza sativa) and Plicaturopsis crispa PacBio transcripts were oriented by 578
using the sequence 579
AAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTT. 580
581
Annotation quality definitions 582
The common metrics sensitivity, specificity and accuracy were used to compare the annotation 583
features. These metrics have been previously discussed in the context of annotations (Yandell 584
and Ence, 2012). Briefly, sensitivity is a measure of how well an annotation identifies the known 585
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
22
features of a reference, also called a true positive rate. Here, sensitivity was calculated as 586
[(Annotation matching reference / total Reference) * 100] for a specific feature of interest and 587
represented the percentage of known reference features captured. Specificity is a measure of 588
how many of the annotated features were in the reference, also called positive predictive value. 589
Here, specificity was calculated as [(Annotation matching reference / total Annotation) * 100] for a 590
specific feature of interest and represented the percentage of all the annotation features that 591
matched the reference. These comparisons can be for any annotation feature, such as genes, 592
transcripts, or individual exons for exact matches or for a specified overlap to a reference. The F1 593
score accounted for both sensitivity and specificity to measure annotation quality in a single 594
number, calculated as the harmonic mean of sensitivity and specificity [(Sensitivity * Specificity) / 595
(Sensitivity + Specificity)] * 2. 596
597
Head to head comparisons between annotations 598
To determine the unique protein coding genes annotated between LoReAn-sF, BAP-F, MAKER2 599
and CodingQuarry, the annotations were compared using orthoMCL (Li et al., 2003). OrthoMCL 600
was downloaded from https://github.com/apetkau/orthomcl-pipeline and run using default 601
settings. 602
603
Intron analysis 604
Introns were extracted from mapped reads using the same methodology from BRAKER1 (Hoff et 605
al., 2016). Introns supported from at least two reads were extracted and used in the intron set. 606
Genome tool software (Gremme et al., 2013) was used to annotate introns in the gff3 file. Custom 607
scripts were used to identify exact match intron coordinates from the annotation files that 608
overlapped with the intron coordinates from the RNA-seq data. Sensitivity, specificity and F1 609
score were calculated as described before. 610
611
Statistical comparisons between annotation outputs 612
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
23
Statistical comparisons were made using the R software package (Team, 2016). Chi-square tests 613
of independence (chisq.test in R) were computed to test for association between the annotation 614
pipelines and the sensitivity and specificity metrics. The residuals of the test were used to assess 615
the direction and magnitude of the associations across the data, but no formal post-hoc testing 616
was performed on this data. Two-proportions z-tests (prop.test in R) were used to compare 617
individual annotation results against the reference annotation or against a LoReAn output. These 618
tests were conducted for both gene and intron features using the number of matched features 619
and non-matched features or the number of matched and missing features (i.e. specificity and 620
sensitivity). The two-sample tests were conducted as one-tailed to determine the difference 621
compared to the reference, and Bonferroni multiple testing correction was applied to adjust the p-622
value needed to reject the null hypothesis. For singleton analysis, the read coverage and length 623
data were compared using the non-parametric Kruskal-Wallis test (kruskal in R from the agricolae 624
(de Mendiburu, 2016) package) to avoid the assumption of equal distribution and variance of the 625
data. The proportion of highest-quality singletons from each pipeline were compared against the 626
results of LoReAn-sF using the two-proportions z-test. The effect of genome masking, ab initio 627
options and pipeline options were tested using ANOVA, and Tukey’s honestly significant post-hoc 628
test (alpha = 0.05) was used to determine statistical grouping. 629
630
Ave1 isoform analysis 631
Ave1 isoforms were confirmed using cDNA-PCR of infected plant material with V. dahliae strain 632
JR2. Specific primers for the Ave1 gene (F- TTTAACACTTCACTCTGCTCTCG; R-633
CCTTGTGTGCTGCTTTGGTA) and for Ave1c gene (F-CGCCGGCAATACTATCTCAA; R-634
ATCCTGTGGGCAACAATAGC) were used to identify the two Ave1 isoforms. 635
636
AVAILABILITY 637
The LoReAn source code is available at: https://github.com/lfaino/LoReAn/ and provided under 638
an MIT license, available at: https://github.com/lfaino/LoReAn/blob/master/LICENSE. 639
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
24
Documentation and software are available at https://github.com/lfaino/LoReAn. The software can 640
run on all platforms when deployed via Docker (https://www.docker.com/). 641
All genome annotations, scripts and additional files generated and/or analyzed in the paper can 642
be found at https://github.com/lfaino/files-paper-LoReAn.git. 643
A dataset to test the correct installation of the tool can be found at 644
https://github.com/lfaino/LoReAn-Example.git. This dataset contains all the data to annotate a 645
single chromosome of V. dahliae strain JR2. 646
647
ACCESSION NUMBERS 648
The V. dahliae strain JR2 reference annotation version 5 was used in the analysis. The version 5 649
was generated by comparing the concordance of all gene models of version 4 with the long reads 650
information. Subsequently, the improved version 5 was deposited at ENSEMBL fungi database 651
and can be downloaded at http://fungi.ensembl.org/Verticillium_dahliaejr2/Info/Index. 652
The P. crispa reference genome and annotation were downloaded from JGI 653
(http://genome.jgi.doe.gov/pages/dynamicOrganismDownload.jsf?organism=Plicr1). The 654
Arabidopsis genome sequence and reference annotation were downloaded from the TAIR 655
database (ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/; 656
https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR1657
0_GFF3_genes.gff). The rice genome sequence and annotation were retrieved from the 658
ENSEMBL plant database (http://plants.ensembl.org/Oryza_sativa/Info/Index). The sequencing 659
data are accessible at the NCBI SRA database. The short-read A. thaliana data set is deposited 660
under SRA accession number SRR5446746 and the PacBio dataset under SRA accession 661
number SRR5445910. The V. dahliae Illumina transcriptome is deposited under accession 662
number SRR5440696 while the Nanopore transcriptome data is deposited as SRR5445874. The 663
P. crispa PacBio reads were downloaded from the publicly accessible NCBI SRA site, runs 664
SRR5077068 to SRR5077144 and Illumina data from run SRR1577770. The O. sativa data were 665
downloaded from the European Nucleotide Archive (ENA) under runs ERR91110 and 666
ERR911111 and the Illumina data from run ERR748773. 667
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
25
668
ACKNOWLEGMENTS 669
We thank Jordi Coolen for his assistance in writing the LoReAn software and Michael Seidl for 670
reading the manuscript. 671
672
SUPPLEMENTAL DATA 673
Supplemental Figure S1. Annotation quality metrics and summary statistics for predicted 674
genes. 675
Supplemental Figure S2. Annotation quality metrics and summary statistics for predicted exons. 676
Supplemental Figure S3. Ave1 gene model predictions from different software used during 677
annotation. 678
Supplemental Table S1. Number of predicted genes and exact match predictions to the JR2 679
reference. 680
Supplemental Table S2. Number of predicted exons and exact match predictions to the JR2 681
reference. 682
Supplemental Table S3. Gene prediction summaries for annotation options. 683
Supplemental Table S4. Exon prediction summaries for annotation options. 684
Supplemental Table S5. Two-sample test of proportions for high-confidence singleton 685
predictions. 686
Supplemental Table S6. Two-samples test of proportions for exact match intron based on 687
empirical RNA-seq intron identification in V. dahliae. 688
Supplemental Table S7. Predicted features for P. crispa annotation analysis. 689
Supplemental Table S8. P. crispa chi-square test of exact match gene proportions against 690
LoReAn_sF. 691
Supplemental Table S9. P. crispa chi-square test of exact match intron proportions against 692
LoReAn_sF. 693
Supplemental Table S10. Predicted features for A. thaliana annotation analysis. 694
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
26
Supplemental Table S11. A. thaliana chi-square test of exact match gene proportions against 695
LoReAn_s. 696
Supplemental Table S12. Predicted features for O. sativa annotation analysis. 697
698
Supplemental Table S13. O. sativa chi-square test of exact match gene proportions against 699
LoReAn_s. 700
701
702
TABLES 703
Table 1. Chi-square test of proportions for predicted exact match introns inferred from RNA-seq 704 mapping data in V. dahliae. 705
Pipeline Predictions Matching RNA-seq
Predictions Not Matching
Predictions missing
Specificity less than LoReAn-sF
a
Sensitivity less than LoReAn-sF
b
BAP-F 13,128 5,284 5,446 < 0.0001 < 0.0001
MAKER2 12,509 5,903 6,065 < 0.0001 < 0.0001
CodingQuarry 12,021 3,001 6,553 N.S.c < 0.0001
GeneMark-ES-F 12,323 5,274 6,251 < 0.0001 < 0.0001
BRAKER1 11,491 4,739 7,083 < 0.0001 < 0.0001
Augustus 11,857 6,079 6,717 < 0.0001 < 0.0001
Reference Annotation
13,935
4,145
4,639
N.S.c
N.S.
c
LoReAn-sF 13,780 3,899 4,794 Not applicable Not applicable a Column reports p-values for the chi-square test of proportions for the specificity metric. 706
b Column reports p-values for the chi-square test of proportions for the sensitivity metric.
707 c N.S. Not significant with a p-value greater than 0.05. 708
709 710 711 712 713 714 715 716 717 718 719 720 721 722 723
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
27
Table 2. Chi-square test of proportions for predicted exact match introns inferred from RNA-seq 724 mapping data in P. crispa. 725
Pipeline Predictions Matching RNA-seq
Predictions Not Matching
Predictions missing
Specificity less than Reference Annotation
a
Sensitivity less than Reference Annotation
b
LoReAn 58,831 8,235 29,119 N.S.c < 0.001
LoReAn-M 58,798 8,219 29,152 N.S. c < 0.001
LoReAn-s 58,556 7,318 29,394 < 0.001 < 0.001
LoReAn-sM 58,622 7,326 29,328 < 0.001 < 0.001
BAP 54,565 8,204 33,385 N.S.c N.S.
c
BAPplus 53,943 7,159 34,007 < 0.001 N.S.c
Augustus 52,525 8,773 35,425 N.S.c N.S.
c
GeneMark-ES-F 51,450 11,085 36,500 N.S.
c N.S.
c
MAKER2 50,228 10,881 37,722 N.S.c N.S.
c
Reference Annotation 57,837 80,54 30,113 Not applicable Not applicable
a Column reports p-values for the chi-square test of proportions for the specificity metric. 726
b Column reports p-values for the chi-square test of proportions for the sensitivity metric.
727 c N.S. Not significant with a p-value greater than 0.05. 728
729 730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
28
Table 3. Chi-square test of proportions for predicted exact match introns inferred from RNA-seq 745 mapping data in A. thaliana. 746
Pipeline Predictions Matching RNA-seq
Predictions Not Matching
Predictions missing
Specificity less than Reference Annotation
a
Sensitivity less than Reference Annotation
b
LoReAn 67064 57969 4376 N.S.c N.S.
c
LoReAn-M 66199 45073 5241 < 0.001 N.S.c
LoReAn-s 67003 57897 4437 N.S.c N.S.
c
LoReAn-sM 66133 44987 5307 < 0.001 N.S.c
BAP 64444 59868 6996 N.S.c N.S.
c
BAPplus 65037 59153 6403 N.S.c N.S.
c
Augustus 63137 70107 8303 N.S.c N.S.
c
GeneMark-ES 61815 80017 9625 N.S.c N.S.
c
MAKER2 60289 26238 11151 < 0.001 N.S.c
Reference Annotation
69657 54910 1783 Not applicable Not applicable
a Column reports p-values for the chi-square test of proportions for the specificity metric. 747
b Column reports p-values for the chi-square test of proportions for the sensitivity metric.
748 c N.S. Not significant with a p-value greater than 0.05. 749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
29
Table 4 Chi-square test of proportions for predicted exact match introns inferred from RNA-seq 766 mapping data in O. sativa. 767
Pipeline Predictions Matching RNA-seq
Predictions Not Matching
Predictions missing
Specificity less than Reference Annotation
a
Sensitivity less than Reference Annotation
b
LoReAn 84679 33790 37383 < 0.001 N.S.c
LoReAn-M 81707 15929 40355 < 0.001 N.S.c
LoReAn-s 84880 33749 37182 < 0.001 0.007
LoReAn-sM 81870 15892 40192 < 0.001 N.S.c
BAP 77609 85999 44453 N.S.c N.S.
c
BAPplus 75048 39740 47014 N.S.c N.S.
c
Augustus 64279 94836 57783 N.S.c N.S.
c
GeneMark-ES 1958 447395 120104 N.S.c N.S.
c
MAKER2 78819 57887 43243 N.S.c N.S.
c
Reference Annotation
84325 38746 37737 Not applicable Not applicable
a Column reports p-values for the chi-square test of proportions for the specificity metric. 768
b Column reports p-values for the chi-square test of proportions for the sensitivity metric.
769 c N.S. Not significant with a p-value greater than 0.05. 770
771
772
773
774
775
FIGURE LEGENDS 776
Figure 1. Schematic overview of the LoReAn pipeline and clustered transcript 777
reconstruction. 778
(A) Illustration of the computational workflow for the LoReAn pipeline. Grey boxes represent input 779
data, and each white box represents a step in the annotation process with mention of the specific 780
software. The boxes connected by blue arrows integrate the steps from the previously described 781
BAP, described in the text as phase one of LoReAn (Haas et al., 2008). The LoReAn pipeline 782
(boxes connected by red arrows) integrates the BAP workflow, but additionally incorporates long-783
read sequencing data, described in the text as phase two of LoReAn. The blue box, ‘BAP 784
annotation’ represents the annotation results from the BAP pipeline used for comparison in this 785
study, while the orange box ‘LoReAn annotation’ represents the annotation results from the 786
LoReAn pipeline using long-read sequencing data. Dashed arrows represent optional steps for 787
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
30
the pipeline. (B) Illustration of the clustered transcript reconstruction. Gene models are depicted 788
as exons (boxes) and connecting introns (lines). Blue models represent BAP annotations, while 789
red models represent hypothetical long-reads mapped to the genome. Orange models represent 790
consensus annotations reported in the final LoReAn output. Various scenarios can occur: i: High 791
confidence predictions from the BAP are kept regardless of whether they are supported by long-792
reads. ii & iii: Clusters of mapped long-reads are used to generate a consensus prediction model, 793
unless the model is supported by less than a user-defined minimum depth. iv: Overlapping BAP 794
and mapped long-reads are combined to a consensus model. v: Two annotations are reported if 795
no consensus can be reached for the BAP and clustered long-read data. 796
797
Figure 2. Annotation quality summary for exact match genes to the reference. 798
(A) Gene annotation quality summary, where each horizontal bar represents an annotation 799
output, and each colored dot represents the sensitivity (green), specificity (purple) and F1 score 800
(red). The annotations are labelled using the left grid, where the group of horizontal black dots 801
defines the parameters used in the annotation. Possible parameters include using LoReAn, BAP 802
or BAP+ pipeline, stranded mode for LoReAn (Stranded), the fungus option for GeneMark-ES 803
(Fungus), or the BRAKER1 program for Augustus (BRAKER1). Annotation options are grouped 804
by the level of reference masking- Partially Masked (Part.), Non-Masked (None) or Fully Masked 805
(Mask). The results from additionally tested annotation pipelines are shown at the bottom. Three 806
vertical grey lines represent the 1st quantile, median and 3
rd quantile for the F1 score. The two 807
annotations highlighted with a yellow horizontal bar were used for subsequent analysis. (B) The 808
proportion of exact match to non-matching gene predictions (specificity) and exact match to 809
missing gene predictions (sensitivity) were compared using a chi-square test of independence. 810
The residuals from the analysis are shown with the size and color representing the magnitude 811
and direction of the association between rows and columns. GeneMark-ES-F: GeneMark gene 812
prediction software using the ‘fungus’ option. LoReAn-sF: LoReAn using strand information and 813
the ‘fungus’ option of GeneMark-ES. BAP-F: The Broad Institute Eukaryotic Genome Annotation 814
Pipeline described in the text and using the ‘fungus’ option of GeneMark-ES. 815
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
31
816
Figure 3. Analysis of predicted singletons across four pipelines. 817
(A) Venn diagram showing the overlap and uniqueness of predicted genes based on genomic 818
location. The Venn diagram shows that 4,584 genes were annotated with the exact same 819
features across all four pipelines. The numbers captured by only a single annotation pipeline are 820
considered singletons- genes whose structure is uniquely annotated by a given pipeline. Note, 821
these singletons do not necessarily represent unique loci. (B) Short-read RNA-seq data were 822
mapped to the genome, and the percent length coverage of each gene annotation was 823
calculated. The data were not completely normally distributed, so a non-parametric Kruskal-Wallis 824
test was used to rank the mean of the coverage. Data are shown as violin plots, with the tails 825
representing the data range and the mean and standard deviation are shown as a black point and 826
black vertical lines, respectively. Letters shown above each violin plot represent post-hoc 827
statistical groupings where plots with the same letter are statistically indistinguishable. Multiple 828
comparisons were made using the non-parametric Kruskal-Wallis rank of means and post hoc 829
differences were determined using Fisher’s least significamt difference, p < 0.05 with Bonferroni 830
correction. (C) Same as in (B) except the mean rank of the singleton length is analyzed. (D) The 831
orthoMCL singletons from each pipeline were grouped into one of four categories shown in the 832
key representing if the singleton contained an intron or not and if the singleton’s length was 833
covered by over 75% with RNA-seq data. The number of singletons within each of the four 834
categories is shown. 835
836
Figure 4. Annotation quality summary for predicted introns exactly matching RNA-seq 837
inferred introns. (A) Predicted intron quality summary, where each horizontal bar represents an 838
annotation output, and each colored dot represents the sensitivity (green), specificity (purple) and 839
F1 score (red) as described in Figure 2A. (B) The proportion of exact match to non-matching 840
intron predictions (specificity) and exact match to missing intron predictions (sensitivity) were 841
compared using a chi-square test of independence as described in Figure 2B. GeneMark-ES-F: 842
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
32
GeneMark gene prediction software using the ‘fungus’ option. LoReAn-sF: LoReAn using strand 843
information and the ‘fungus’ option of GeneMark-ES. 844
845
Figure 5. The LoReAn pipeline most accurately annotates a specific fungal locus encoding 846
a strain-specific gene. (A) Short-read RNA-seq data mapped to the locus are shown as a 847
coverage plot (grey peaks) and as representative individual reads (yellow boxes). Long-reads 848
from single-molecule cDNA data mapped to the locus are shown as a coverage plot (grey peaks) 849
and representative reads (purple boxes). Thick black lines linking mapped reads represent gaps 850
in the mapped reads and are indicative of introns. The long-read data were split by mapping 851
strand and coverage plots for forward (red) and reverse (blue) coverage plots. (B) Gene model 852
predictions from four annotation pipelines are illustrated. Light blue boxes represent untranslated 853
regions (5’ and 3’ UTR), dark blue boxes represent coding sequence boundaries, and thin black 854
lines depict introns. Arrows in the introns indicate the direction of transcription. The MAKER2 and 855
BAP pipelines predict a single transcript coded on the reverse strand at the 3’ end of the known 856
Ave1 transcript. LoReAn-sF predicts two transcripts corresponding to the Ave1 gene along with 857
the similar transcript predicted by MAKER2. The reference Ave1 transcript is shown in grey. (C) 858
To confirm the presence of an alternative splice site in the 5’UTR of the Ave1 transcript, 18 cDNA 859
clones were randomly chosen and sequenced. Isoform 1 sequence is identical to the reference 860
Ave1 sequence and was identified in 15 of the 18 clones. Isoform 2 has a 5 bp insertion in the 861
5’UTR resulting from an alternative exon splice site and was identified in 3 of the 18 sequenced 862
clones. The Ave1 reference sequence is shown from bases 71 through 86. (D) The presence of 863
Ave1 and the additional gene transcribed to the 3’ end of Ave1, termed Ave1close(Ave1c), was 864
confirmed using PCR on gDNA and cDNA. PCR using gene specific primers, termed Ave1 fw + 865
rev (pink arrows) or Ave1c for + rev (yellow arrows), shows that both genes are expressed in 866
either potato dextrose broth (PDB) Czapek-dox (CPD) or half-strength Murashige-Skoog (1/2MS) 867
media. The inverse orientation of the two genes was confirmed using forward primers only, which 868
amplified the entire locus resulting in a band of approximately 1,118 bp, but does not amplify 869
product using cDNA as the template. 870
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
33
871
Figure 6. Assessment of gene and intron predictions from P. crispa. 872
(A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. 873
LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in 874
non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-875
masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad 876
Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in 877
text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact 878
match to missing gene predictions (sensitivity) compared using a chi-square test of independence 879
as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) 880
and (B) respectively. 881
882
Figure 7. Assessment of gene and intron predictions from A. thaliana. 883
(A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. 884
LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in 885
non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-886
masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad 887
Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in 888
text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact 889
match to missing gene predictions (sensitivity) compared using a chi-square test of independence 890
as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) 891
and (B) respectively. 892
893
Figure 8. Assessment of gene and intron predictions from O. sativa. 894
(A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. 895
LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in 896
non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-897
masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad 898
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
34
Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in 899
text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact 900
match to missing gene predictions (sensitivity) compared using a chi-square test of independence 901
as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) 902
and (B) respectively. 903
904
905
906
907
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
35
REFERENCES 908
Abdel-Ghany SE, Hamilton M, Jacobi JL, Ngam P, Devitt N, Schilkey F, Ben-Hur A, Reddy 909 ASN (2016) A survey of the sorghum transcriptome using single-molecule long reads. Nature 910 Communications 7: 11706 911
Amemiya CT, Alföldi J, Lee AP, Fan S, Philippe H, Maccallum I, Braasch I, Manousaki T, 912 Schneider I, Rohner N, et al (2013) The African coelacanth genome provides insights into 913 tetrapod evolution. Nature 496: 311–316 914
Au KF, Sebastiano V, Afshar PT, Durruthy JD, Lee L, Williams BA, van Bakel H, Schadt EE, 915 Reijo-Pera RA, Underwood JG, et al (2013) Characterization of the human ESC 916 transcriptome by hybrid sequencing. Proc Natl Acad Sci USA 110: E4821–30 917
Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, Huala E (2015) The 918 Arabidopsis information resource: Making and mining the “gold standard” annotated 919 reference plant genome. Genesis 53: 474–485 920
Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, 921 Yandell M (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model 922 organism genomes. Genome Res 18: 188–196 923
Chan K-L, Rosli R, Tatarinova TV, Hogan M, Firdaus-Raih M, Low E-TL (2017) Seqping: gene 924 prediction pipeline for plant genomes using self-training gene models and transcriptomic 925 data. BMC Bioinformatics 18: 1426–7 926
Chen F, Mackey AJ, Stoeckert CJ, Roos DS (2006) OrthoMCL-DB: querying a comprehensive 927 multi-species collection of ortholog groups. Nucleic Acids Res 34: D363–8 928
Chin C-S, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn C, O'Malley 929 R, Figueroa-Balderas R, Morales-Cruz A, et al (2016) Phased diploid genome assembly 930 with single-molecule real-time sequencing. Nat Methods 13: 1050–1054 931
Cook DE, Mesarich CH, Thomma BPHJ (2015) Understanding plant immunity as a surveillance 932 system to detect invasion. Annu Rev Phytopathol 53: 541–563 933
Davey JW, Chouteau M, Barker SL, Maroja L, Baxter SW, Simpson F, Joron M, Mallet J, 934 Dasmahapatra KK, Jiggins CD (2016) Major Improvements to the Heliconius melpomene 935 Genome Assembly Used to Confirm 10 Chromosome Fusion Events in 6 Million Years of 936 Butterfly Evolution. G3 (Bethesda) 6: 695–708 937
de Jonge R, van Esse HP, Maruthachalam K, Bolton MD, Santhanam P, Saber MK, Zhang Z, 938 Usami T, Lievens B, Subbarao KV, et al (2012) Tomato immune receptor Ve1 recognizes 939 effector of multiple fungal pathogens uncovered by genome and RNA sequencing. Proc Natl 940 Acad Sci USA 109: 5110–5115 941
de Mendiburu F (2016) agricolae: Statistical Procedures for Agricultural Research. 942 httpsCRAN.R-project.orgpackageagricolae, https://CRAN.R-project.org/package=agricolae 943
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, 944 Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29: 15–21 945
Faino L, Seidl MF, Datema E, van den Berg GCM, Janssen A, Wittenberg AHJ, Thomma 946 BPHJ (2015) Single-Molecule Real-Time Sequencing Combined with Optical Mapping Yields 947 Completely Finished Fungal Genome. mBio 6: e00936–15 948
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
36
Faino L, Seidl MF, Shi-Kunne X, Pauper M, van den Berg GCM, Wittenberg AHJ, Thomma 949 BPHJ (2016) Transposons passively and actively contribute to evolution of the two-speed 950 genome of a fungal pathogen. Genome Res 26: 1091–1100 951
Faino L, Thomma BPHJ (2014) Get your high-quality low-cost genome sequence. Trends in 952 Plant Science 19: 288–291 953
Fradin EF, Thomma BPHJ (2006) Physiology and molecular aspects of Verticillium wilt diseases 954 caused by V. dahliae and V. albo-atrum. Mol Plant Pathol 7: 71–86 955
Goodswen SJ, Kennedy PJ, Ellis JT (2012) Evaluating high-throughput ab initio gene finders to 956 discover proteins encoded in eukaryotic pathogen genomes missed by laboratory 957 techniques. PLoS ONE 7: e50609 958
Gordon SP, Tseng E, Salamov A, Zhang J, Meng X, Zhao Z, Kang D, Underwood J, 959 Grigoriev IV, Figueroa M, et al (2015) Widespread Polycistronic Transcripts in Fungi 960 Revealed by Single-Molecule mRNA Sequencing. PLoS ONE 10: e0132628 961
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, 962 Raychowdhury R, Zeng Q, et al (2011) Full-length transcriptome assembly from RNA-Seq 963 data without a reference genome. Nature Biotechnology 29: 644–652 964
Gremme G, Steinbiss S, Kurtz S (2013) GenomeTools: a comprehensive software library for 965 efficient processing of structured genome annotations. IEEE/ACM Trans Comput Biol 966 Bioinform 10: 645–656 967
Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR 968 (2008) Automated eukaryotic gene structure annotation using EVidenceModeler and the 969 Program to Assemble Spliced Alignments. Genome Biol 9: R7 970
Haas BJ, Zeng Q, Pearson MD, Cuomo CA, Wortman JR (2011) Approaches to Fungal 971 Genome Annotation. Mycology 2: 118–141 972
Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2016) BRAKER1: Unsupervised 973 RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 974 32: 767–769 975
Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management 976 tool for second-generation genome projects. BMC Bioinformatics 12: 491 977
Huang X, Adams MD, Zhou H, Kerlavage AR (1997) A tool for analyzing and annotating 978 genomic sequences. Genomics 46: 37–45 979
Jiao W-B, Schneeberger K (2017) The impact of third generation genomic technologies on plant 980 genome assembly. Current Opinion in Plant Biology 36: 64–70 981
Keibler E, Brent MR (2003) Eval: a software package for analysis of genome annotations. BMC 982 Bioinformatics 4: 50 983
Klosterman SJ, Atallah ZK, Vallad GE, Subbarao KV (2009) Diversity, pathogenicity, and 984 management of verticillium species. Annu Rev Phytopathol 47: 39–62 985
Kohler A, Kuo A, Nagy LG, Morin E, Barry KW, Buscot F, Canbäck B, Choi C, Cichocki N, 986 Clum A, et al (2015) Convergent losses of decay mechanisms and rapid turnover of 987 symbiosis genes in mycorrhizal mutualists. Nature Genetics 47: 410–415 988
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
37
Koren S, Phillippy AM (2015) One chromosome, one contig: complete microbial genomes from 989 long-read sequencing and assembly. Curr Opin Microbiol 23: 110–120 990
Križanovic K, Echchiki A, Roux J, Šikic M (2017) Evaluation of tools for long read RNA-seq 991 splice-aware alignment. Bioinformatics. doi: 10.1093/bioinformatics/btx668 992
Laehnemann D, Borkhardt A, McHardy AC (2016) Denoising DNA deep sequencing data-high-993 throughput sequencing errors and their correction. Brief Bioinformatics 17: 154–179 994
Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, 995 Alexander DL, Garcia-Hernandez M, et al (2012) The Arabidopsis Information Resource 996 (TAIR): improved gene annotation and new tools. Nucleic Acids Res 40: D1202–D1210 997
Lamichhaney S, Fan G, Widemo F, Gunnarsson U, Thalmann DS, Hoeppner MP, Kerje S, 998 Gustafson U, Shi C, Zhang H, et al (2016) Structural genomic changes underlie alternative 999 reproductive strategies in the ruff (Philomachus pugnax). Nature Genetics 48: 84–88 1000
Laver T, Harrison J, O'Neill PA, Moore K, Farbos A, Paszkiewicz K, Studholme DJ (2015) 1001 Assessing the performance of the Oxford Nanopore Technologies MinION. Biomol Detect 1002 Quantif 3: 1–8 1003
Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic 1004 genomes. Genome Res 13: 2178–2189 1005
Linde J, Duggan S, Weber M, Horn F, Sieber P, Hellwig D, Riege K, Marz M, Martin R, 1006 Guthke R, et al (2015) Defining the transcriptomic landscape of Candida glabrata by RNA-1007 Seq. Nucleic Acids Res 43: 1392–1406 1008
Loman NJ, Quick J, Simpson JT (2015) A complete bacterial genome assembled de novo 1009 using only nanopore sequencing data. Nat Methods 12: 733–735 1010
Loman NJ, Quinlan AR (2014) Poretools: a toolkit for analyzing nanopore sequence data. 1011 Bioinformatics 30: 3399–3401 1012
Lomsadze A, Burns PD, Borodovsky M (2014) Integration of mapped RNA-Seq reads into 1013 automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42: e119–e119 1014
Ma L, Chen Z, Huang DW, Kutty G, Ishihara M, Wang H, Abouelleil A, Bishop L, Davey E, 1015 Deng R, et al (2016) Genome analysis of three Pneumocystis species reveals adaptation 1016 mechanisms to life exclusively in mammalian hosts. Nature Communications 7: 10740 1017
Ming R, VanBuren R, Wai CM, Tang H, Schatz MC, Bowers JE, Lyons E, Wang M-L, Chen J, 1018 Biggers E, et al (2015) The pineapple genome and the evolution of CAM photosynthesis. 1019 Nature Genetics 47: 1435–1442 1020
Minoche AE, Dohm JC, Schneider J, Holtgräwe D, Viehöver P, Montfort M, Sörensen TR, 1021 Weisshaar B, Himmelbauer H (2015) Exploiting single-molecule transcript sequencing for 1022 eukaryotic gene prediction. Genome Biol 16: 184 1023
Muñoz JF, Gauthier GM, Desjardins CA, Gallo JE, Holder J, Sullivan TD, Marty AJ, Carmen 1024 JC, Chen Z, Ding L, et al (2015) The Dynamic Genome and Transcriptome of the Human 1025 Fungal Pathogen Blastomyces and Close Relative Emmonsia. PLoS Genet 11: e1005493 1026
Phillippy AM (2017) New advances in sequence assembly. Genome Res 27: xi–xiii 1027
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
38
Presti Lo L, Lanver D, Schweizer G, Tanaka S, Liang L, Tollot M, Zuccaro A, Reissmann S, 1028 Kahmann R (2015) Fungal effectors and plant susceptibility. Annu Rev Plant Biol 66: 513–1029 545 1030
Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. 1031 Bioinformatics 26: 841–842 1032
Reid I, O'Toole N, Zabaneh O, Nourzadeh R, Dahdouli M, Abdellateef M, Gordon PMK, Soh 1033 J, Butler G, Sensen CW, et al (2014) SnowyOwl: Accurate prediction of fungal genes by 1034 using RNA-Seq and homology information to select among ab initio models. BMC 1035 Bioinformatics 15: 229 1036
Smith CD, Zimin A, Holt C, Abouheif E, Benton R, Cash E, Croset V, Currie CR, Elhaik E, 1037 Elsik CG, et al (2011) Draft genome of the globally widespread and invasive Argentine ant 1038 (Linepithema humile). Proc Natl Acad Sci USA 108: 5673–5678 1039
Smith JJ, Kuraku S, Holt C, Sauka-Spengler T, Jiang N, Campbell MS, Yandell MD, 1040 Manousaki T, Meyer A, Bloom OE, et al (2013) Sequencing of the sea lamprey 1041 (Petromyzon marinus) genome provides insights into vertebrate evolution. Nature Genetics 1042 45: 415–21– 421e1–2 1043
Sperschneider J, Dodds PN, Gardiner DM, Manners JM, Singh KB, Taylor JM (2015) 1044 Advances and challenges in computational prediction of effectors from plant pathogenic 1045 fungi. PLoS Pathog 11: e1004806 1046
Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped 1047 cDNA alignments to improve de novo gene finding. Bioinformatics 24: 637–644 1048
Team RC (2016) R: A Language and Environment for Statistical Computing. https://www.R-1049 project.org/ 1050
Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M (2008) Gene prediction in 1051 novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 1052 18: 1979–1990 1053
Testa AC, Hane JK, Ellwood SR, Oliver RP (2015) CodingQuarry: highly accurate hidden 1054 Markov model gene prediction in fungal genomes using RNA-seq transcripts. BMC 1055 Genomics 16: 170 1056
Thomma BPHJ, Seidl MF, Shi-Kunne X, Cook DE, Bolton MD, van Kan JAL, Faino L (2016) 1057 Mind the gap; seven reasons to close fragmented genome assemblies. Fungal Genet Biol 1058 90: 24–30 1059
Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, Lu Z, Olson A, Stein JC, Ware D 1060 (2016) Unveiling the complexity of the maize transcriptome by single-molecule long-read 1061 sequencing. Nature Communications 7: 11708 1062
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature 1063 Reviews Genetics 10: 57–63 1064
Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and 1065 EST sequences. Bioinformatics 21: 1859–1875 1066
Yandell M, Ence D (2012) A beginner's guide to eukaryotic genome annotation. Nature Reviews 1067 Genetics 13: 329–342 1068
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
39
Zheng Y, Zhao L, Gao J, Fei Z (2011) iAssembler: a package for de novo assembly of Roche-1069 454/Sanger transcriptome sequences. BMC Bioinformatics 12: 453 1070
1071
1072
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
B
A
Clustered Transcript Reconstruction
Protein alignment
AAT
Transcript reconstruction
Trinity
Short-readRNA-seq data
Transcript mapping
PASA
Combine evidenceEVidence Modeller
ab initioBraker1
AugustusGeneMark-ES
Update transcripts
PASA
Long-readRNA-seq data
Clusteredtranscript
reconstruction Bedtools
iAssembler
Transcript mapping
GMAP
BAP annotation
Extract exon sequences
GFFread
LoReAn annotation
Referencegenome
sequence
Proteinsequenceevidence
Long-Readmapping
GMAP
Extract exon sequences
GFFread
Transcript mapping
GMAP
Transcript update
PASA
Phase-one models
Phase-two models
LoReAn output Excluded
i ii iv viii
Figure 1
Figure 1. Schematic overview of the LoReAn pipeline and clustered transcript reconstruction.(A) Illustration of the computational workflow for the LoReAn pipeline. Grey boxes represent input data and each white box represents a step in the annotation process with mention of the specific software. The boxes connected by blue arrows integrate the steps from the previously described BAP (Haas et al., 2008). The LoReAn pipeline (boxes connected by red arrows) integrates the BAP workflow, but additionally incorporates long-read sequencing data. The orange box, ‘Final BAP annotation’ represents the annotation results from the BAP pipeline used for comparison in this study. Dashed arrows represent optional steps for the pipeline. (B) Illustration of the clustered transcript reconstruction. Gene models are depicted as exons (boxes) and connecting introns (lines). Blue models represent BAP annotations, while red models represent hypothetical long-reads mapped to the genome. Orange models represent consensus annotations reported in the final LoReAn output. Various scenarios can occur: i: High confidence predictions from the BAP are kept regardless of whether they are supported by long-reads. ii & iii: Clusters of mapped long-reads are used to generate a consensus prediction model, unless the model is supported by less than a user-defined minimum depth. iv: Overlapping BAP and mapped long-reads are combined to a consensus model. v: Two annotations are reported if no consensus can be reached for the BAP and clustered long-read data.
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
25 35 45 55 65 75
Part
.M
ask
Non
ePa
rt.
Mas
kN
one
Part
.M
ask
Non
e
LoReAn
Stranded
FungusBraker
BAPBAP+
MAKER2CodingQuarry_effector
CodingQuarryGeneMark-ES-F
GeneMark-ESBRAKER1Augustus
-15 -9 -6 -3 0 3 6 9 12-12 15
MAKER2
CodingQuarry
GeneMark-ES-F
BRAKER1
Prediction M
atch
Prediction NOT M
atch
Reference Match
Reference NOT Match
LoReAn-sF
Augustus
Pearson residuals for chi-square
BAP-F
A BSensitivity F1 scoreSpecificity
Percent of exact match genes
Figure 2
Figure 2. Annotation quality summary for exact match genes to the reference. (A) Gene annotation quality summary, where each horizontal bar represents an annotation output, and each colored dot represents the sensitivity (green), specificity (purple) and F1 score (red). The annotations are labelled using the left grid, where the group of horizontal black dots defines the parameters used in the annotation. Possible parameters include using LoReAn, BAP or BAP+ pipeline, stranded mode for LoReAn (Stranded), the fungus option for GeneMark-ES (Fungus), or the BRAKER1 program for Augustus (BRAKER1). Annotation options are grouped by the level of reference masking- Partially Masked (Part.), Non-Masked (None) or Fully Masked (Mask). The results from additionally tested annotation pipelines are shown at the bottom. Three vertical grey lines represent the 1st quantile, median and 3rd quantile for the F1 score. The two annotations highlighted with a yellow horizontal bar were used for subsequent analysis. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact match to missing gene predictions (sensitivity) were compared using a chi-square test of independence. The residuals from the analysis are shown with the size and color representing the magnitude and direction of the association between rows and columns. GeneMark-ES-F: GeneMark gene prediction software using the ‘fungus’ option. LoReAn-sF: LoReAn using strand information and the ‘fungus’ option of GeneMark-ES. https://plantphysiol.orgDownloaded on February 7, 2021. - Published by
Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
2665
1352 1923
3157397
567
137
728
945
3091362
339147
808
4584
140
64153
10420
44
41
75
24169
17
110
CodingQuarryBAP-F LoReAn-sF
MAKER2A
B
D
C
CodingQuarry BAP-F
LoReAn-sF MAKER2
a cbd a aba
With intron,< 75% coverage
No intron,< 75% coverage
With intron,>75% coverage
No intron,> 75% coverage
CodingQuarryBAP-F LoReAn-sF MAKER2
176159
106
Mea
n Ra
nk C
over
age
Mea
n Ra
nk L
engt
h0
500
1000
1500
0
500
1000
1500
Figure 3
Figure 3. Analysis of predicted singletons across four pipelines. (A) Venn diagram showing the overlap and uniqueness of predicted genes based on genomic location. The Venn diagram shows that 4,584 genes were annotated with the exact same features across all four pipelines. The numbers captured by only a single annotation pipeline are considered singletons- genes whose structure is uniquely annotated by a given pipeline. Note, these singletons do not necessarily represent unique loci. (B) Short-read RNA-seq data were mapped to the genome and the percent length coverage of each gene annotation was calculated. The data were not completely normally distributed, so a non-parametric Kruskal-Wallis test was used to rank the mean of the coverage. Data are shown as violin plots, with the tails representing the data range and the mean and standard deviation is shown as a black point and black vertical lines. Letters shown above each violin plot represent post-hoc statistical groupings where plots with the same letter are statistically indistinguishable. (C) Same as in (B) except the mean rank of the singleton length is analyzed. (D) The orthoMCL singletons from each pipeline were grouped into one of four categories shown in the key representing if the singleton contained an intron or not and if the singleton’s length was covered by over 75% with RNA-seq data. The number of singletons within each of the four categories is shown.
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
LoReAn
Stranded
Fungus
BrakerBAP
BAP+
MAKER2CodingQuarry_effector
CodingQuarryGeneMark-ES-F
GeneMark-ESBRAKER1Augustus
ASensitivity F1 scoreSpecificity
55 65 75 85
Part
.M
ask
Non
ePa
rt.
Mas
kN
one
Part
.M
ask
Non
e
Reference Annotation
MAKER2
CodingQuarry
GeneMark-ES-F
BRAKER1
Prediction M
atch
Prediction NOT M
atch
Reference Match
Reference NOT Match
LoReAn-sF
Augustus
BAP-F
ReferenceAnnotation
B
-25 -15 -10 -5 0 5 10 15 20-20 25Pearson residuals for chi-square
Percent of exact match introns
Figure 4
Figure 4. Annotation quality summary for predicted introns exactly matching RNA-seq inferred introns. (A) Predicted intron quality summary, where each horizontal bar represents an annotation output, and each colored dot represents the sensitivity (green), specificity (purple) and F1 score (red) as described in Figure 2A. (B) The proportion of exact match to non-matching intron predictions (specificity) and exact match to missing intron predictions (sensitivity) were compared using a chi-square test of independence as described in Figure 2B. GeneMark-ES-F: GeneMark gene prediction software using the ‘fungus’ option. LoReAn-sF: LoReAn using strand information and the ‘fungus’ option of GeneMark-ES.
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
MappedShortReads
ForwardReverse
0
10000
−10
50
0
50
MAKER2
MappedLongReads
Iso-2Iso-1
Ave1
LongReadStrand
CodingQuarry with‘effector’ option
BAP
Reference
Reference Ave1
LoReAn-sF
Ave1 fw
700,100Chr 5Ave1 rev
Ave1c rev Ave1c fw70 AAGCAATT - - - - - CGCCAATT 87
70 AAGCAATTAGTAGCGCCAATT 92
70 AAGCAATT - - - - - CGCCAATT 87LoReAn Ave1- isoform 2
Ave1 cDNA sequencing
Isoform 2
Isoform 1
LoReAn Ave1- isoform 1
153
700,750 701,250
Figure 5. The LoReAn pipeline most accurately annotates a specific fungal locus encoding a strain specific gene. (A) Short-read RNA-seq data mapped to the locus are shown as a coverage plot (grey peaks) and as representative individual reads (yellow boxes). Long-reads from single-molecule cDNA data mapped to the locus are shown as a coverage plot (grey peaks) and representative reads (purple boxes). Think black lines linked mapped reads represent gaps in the mapped reads and are indicative of introns. The long-read data was split by mapping strand and coverage plots for forward (red) and reverse (blue) coverage plots. (B) Gene model predictions from four annotation pipelines are illustrated. Light blue boxes represent untranslated regions (5’ and 3’ UTR), dark blue boxes represent coding sequence boundaries, and thin black lines depict introns. Arrows in the introns indicate the direction of transcription. The MAKER2 and BAP pipelines predict a single transcript coded on the reverse strand at the 3’ end of the known Ave1 transcript. LoReAn-sF predicts two transcripts corresponding to the Ave1 gene along with the similar transcript predicted by MAKER2. The reference Ave1 transcript is shown in grey. (C) To confirm the presence of an alternative splice site in the 5’UTR of the Ave1 transcript, 18 cDNA clones were randomly chosen and sequenced. Isoform 1 sequence is identical to the reference Ave1 sequence and was identified in 15 of the 18 clones. Isoform 2 has a 5 bp insertion in the 5’UTR resulting from an alternative exon splice site and was identified in 3 of the 18 sequenced clones. The Ave1 reference sequence is shown from bases 71 through 86. (D) The presence of Ave1 and the additional gene transcribed to the 3’ end of Ave1, termed Ave1close(Ave1c), was confirmed using PCR on gDNA and cDNA. PCR using gene specific primers, termed Ave1 fw + rev (pink arrows) or Ave1c for + rev (yellow arrows), shows that both genes are expressed in either potato dextrose broth (PDB) Czapek-dox (CPD) or half-strength Murashige-Skoog (1/2MS) media. The inverse orientation of the two genes was confirmed using forward primers only, which amplified the entire locus resulting in a band of approximately 1,118 bp, but does not amplify product using cDNA as the template.
Figure 5
A
B
C D
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
MAKER2
GeneMark-ES
Prediction Match
Prediction NOT Match
Reference Match
Reference NOT Match
Prediction Match
Prediction NOT Match
Reference Match
Reference NOT Match
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
BRAKER1
BAP
BAP
BAP+
MAKER2
GeneMark-ES
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
BRAKER1
BAP+
-45 -27 -9 0 9 27 45Pearson residuals for chi-square
-35 -21 -7 0 7 21 35Pearson residuals for chi-square
Sensitivity F1 scoreSpecificity
Sensitivity F1 scoreSpecificity
MAKER2
GeneMark-ES
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
BRAKER1
BAP
BAP+
15 25 35 45 55
A
C
B
D
55 65 75 85 95
MAKER2
GeneMark-ES
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
BRAKER1
BAP
BAP+
ReferenceAnnotation
ReferenceAnnotation
Percent of exact match introns
Percent of exact match genes
Figure 6
Figure 6. Assessment of gene and intron predictions from P. crispa. (A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact match to missing gene predictions (sensitivity) compared using a chi-square test of independence as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) and (B) respectively.
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
MAKER2
GeneMark-ES
Prediction Match
Prediction NOT Match
Reference Match
Reference NOT Match
Prediction Match
Prediction NOT Match
Reference Match
Reference NOT Match
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
Augustus
BAP
BAP
BAP+
MAKER2
GeneMark-ES
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
Augustus
BAP+
Pearson residuals for chi-square
Pearson residuals for chi-square
Sensitivity F1 scoreSpecificity
Sensitivity F1 scoreSpecificity
MAKER2
GeneMark-ES
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
Augustus
BAP
BAP+
25 35 45 55
40 60 80 100
65 75
A
C
B
D
MAKER2
GeneMark-ES
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
Augustus
BAP
BAP+ReferenceAnnotation
-105 -63 -21 0 21 63 105
-95 -57 -19 0 19 57 95
ReferenceAnnotation
Percent of exact match introns
Percent of exact match genes
Figure 7
Figure 7. Assessment of gene and intron predictions from A. thaliana. (A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact match to missing gene predictions (sensitivity) compared using a chi-square test of independence as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) and (B) respectively.
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
MAKER2
GeneMark-ES
Prediction Match
Prediction NOT Match
Reference Match
Reference NOT Match
Prediction Match
Prediction NOT Match
Reference Match
Reference NOT Match
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
BRAKER1
BAP
BAP
BAP+
MAKER2
GeneMark-ES
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
BRAKER1
BAP+
Pearson residuals for chi-square
Pearson residuals for chi-square
Sensitivity F1 scoreSpecificity
Sensitivity F1 scoreSpecificity
MAKER2
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
BAP
BAP+
0 10 20 30
0 30 60 90
40
A
C
B
D
MAKER2
LoReAn
LoReAn-M
LoReAn-s
LoReAn-sM
BAP
BAP+
-105 -63 -21 0 21 63 105
-650-390
-130 0 130 390 650
ReferenceAnnotationReference
Annotation
Percent of exact match introns
Percent of exact match genes
Figure 8
GeneMark-ES
BRAKER1
GeneMark-ES
BRAKER1
Figure 8. Assessment of gene and intron predictions from O. sativa. (A) Annotation quality metrics are shown for exact match genes as detailed for Figure 2A. LoReAn, LoReAn in non-stranded mode using a non-masked genome; LoReAn-M, LoReAn in non-stranded mode using a masked genome; LoReAn-s, LoReAn in stranded mode using a non-masked genome; LoReAn-sM, LoReAn in stranded mode using a masked genome; BAP, Broad Annotation Pipeline; BAP+, Broad Annotation Pipeline plus additional modifications described in text. (B) The proportion of exact match to non-matching gene predictions (specificity) and exact match to missing gene predictions (sensitivity) compared using a chi-square test of independence as described in Figure 2B. (C) and (D) are for exact match intron analysis, represented as in (A) and (B) respectively.
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
Parsed CitationsAbdel-Ghany SE, Hamilton M, Jacobi JL, Ngam P, Devitt N, Schilkey F, Ben-Hur A, Reddy ASN (2016) A survey of the sorghumtranscriptome using single-molecule long reads. Nature Communications 7: 11706
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Amemiya CT, Alföldi J, Lee AP, Fan S, Philippe H, Maccallum I, Braasch I, Manousaki T, Schneider I, Rohner N, et al (2013) The Africancoelacanth genome provides insights into tetrapod evolution. Nature 496: 311–316
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Au KF, Sebastiano V, Afshar PT, Durruthy JD, Lee L, Williams BA, van Bakel H, Schadt EE, Reijo-Pera RA, Underwood JG, et al (2013)Characterization of the human ESC transcriptome by hybrid sequencing. Proc Natl Acad Sci USA 110: E4821–30
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, Huala E (2015) The Arabidopsis information resource: Making and miningthe "gold standard" annotated reference plant genome. Genesis 53: 474–485
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M (2008) MAKER: an easy-to-useannotation pipeline designed for emerging model organism genomes. Genome Res 18: 188–196
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Chan K-L, Rosli R, Tatarinova TV, Hogan M, Firdaus-Raih M, Low E-TL (2017) Seqping: gene prediction pipeline for plant genomesusing self-training gene models and transcriptomic data. BMC Bioinformatics 18: 1426–7
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Chen F, Mackey AJ, Stoeckert CJ, Roos DS (2006) OrthoMCL-DB: querying a comprehensive multi-species collection of orthologgroups. Nucleic Acids Res 34: D363–8
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Chin C-S, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn C, O'Malley R, Figueroa-Balderas R, Morales-Cruz A, etal (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 13: 1050–1054
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Cook DE, Mesarich CH, Thomma BPHJ (2015) Understanding plant immunity as a surveillance system to detect invasion. Annu RevPhytopathol 53: 541–563
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Davey JW, Chouteau M, Barker SL, Maroja L, Baxter SW, Simpson F, Joron M, Mallet J, Dasmahapatra KK, Jiggins CD (2016) MajorImprovements to the Heliconius melpomene Genome Assembly Used to Confirm 10 Chromosome Fusion Events in 6 Million Years ofButterfly Evolution. G3 (Bethesda) 6: 695–708
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
de Jonge R, van Esse HP, Maruthachalam K, Bolton MD, Santhanam P, Saber MK, Zhang Z, Usami T, Lievens B, Subbarao KV, et al(2012) Tomato immune receptor Ve1 recognizes effector of multiple fungal pathogens uncovered by genome and RNA sequencing.Proc Natl Acad Sci USA 109: 5110–5115
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
de Mendiburu F (2016) agricolae: Statistical Procedures for Agricultural Research. httpsCRAN.R-project.orgpackageagricolae,https://CRAN.R-project.org/package=agricolae
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universalRNA-seq aligner. Bioinformatics 29: 15–21
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Faino L, Seidl MF, Datema E, van den Berg GCM, Janssen A, Wittenberg AHJ, Thomma BPHJ (2015) Single-Molecule Real-TimeSequencing Combined with Optical Mapping Yields Completely Finished Fungal Genome. mBio 6: e00936–15
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
Faino L, Seidl MF, Shi-Kunne X, Pauper M, van den Berg GCM, Wittenberg AHJ, Thomma BPHJ (2016) Transposons passively andactively contribute to evolution of the two-speed genome of a fungal pathogen. Genome Res 26: 1091–1100
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Faino L, Thomma BPHJ (2014) Get your high-quality low-cost genome sequence. Trends in Plant Science 19: 288–291Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Fradin EF, Thomma BPHJ (2006) Physiology and molecular aspects of Verticillium wilt diseases caused by V. dahliae and V. albo-atrum.Mol Plant Pathol 7: 71–86
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Goodswen SJ, Kennedy PJ, Ellis JT (2012) Evaluating high-throughput ab initio gene finders to discover proteins encoded ineukaryotic pathogen genomes missed by laboratory techniques. PLoS ONE 7: e50609
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Gordon SP, Tseng E, Salamov A, Zhang J, Meng X, Zhao Z, Kang D, Underwood J, Grigoriev IV, Figueroa M, et al (2015) WidespreadPolycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing. PLoS ONE 10: e0132628
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al (2011) Full-lengthtranscriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29: 644–652
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Gremme G, Steinbiss S, Kurtz S (2013) GenomeTools: a comprehensive software library for efficient processing of structured genomeannotations. IEEE/ACM Trans Comput Biol Bioinform 10: 645–656
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR (2008) Automated eukaryotic gene structureannotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol 9: R7
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Haas BJ, Zeng Q, Pearson MD, Cuomo CA, Wortman JR (2011) Approaches to Fungal Genome Annotation. Mycology 2: 118–141Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2016) BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation withGeneMark-ET and AUGUSTUS. Bioinformatics 32: 767–769
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genomeprojects. BMC Bioinformatics 12: 491
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Huang X, Adams MD, Zhou H, Kerlavage AR (1997) A tool for analyzing and annotating genomic sequences. Genomics 46: 37–45Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Jiao W-B, Schneeberger K (2017) The impact of third generation genomic technologies on plant genome assembly. Current Opinion inPlant Biology 36: 64–70
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Keibler E, Brent MR (2003) Eval: a software package for analysis of genome annotations. BMC Bioinformatics 4: 50Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Klosterman SJ, Atallah ZK, Vallad GE, Subbarao KV (2009) Diversity, pathogenicity, and management of verticillium species. Annu RevPhytopathol 47: 39–62
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Kohler A, Kuo A, Nagy LG, Morin E, Barry KW, Buscot F, Canbäck B, Choi C, Cichocki N, Clum A, et al (2015) Convergent losses ofdecay mechanisms and rapid turnover of symbiosis genes in mycorrhizal mutualists. Nature Genetics 47: 410–415
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title https://plantphysiol.orgDownloaded on February 7, 2021. - Published by
Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
Koren S, Phillippy AM (2015) One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly.Curr Opin Microbiol 23: 110–120
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Križanovic K, Echchiki A, Roux J, Šikic M (2017) Evaluation of tools for long read RNA-seq splice-aware alignment. Bioinformatics. doi:10.1093/bioinformatics/btx668
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Laehnemann D, Borkhardt A, McHardy AC (2016) Denoising DNA deep sequencing data-high-throughput sequencing errors and theircorrection. Brief Bioinformatics 17: 154–179
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M, et al(2012) The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res 40: D1202–D1210
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Lamichhaney S, Fan G, Widemo F, Gunnarsson U, Thalmann DS, Hoeppner MP, Kerje S, Gustafson U, Shi C, Zhang H, et al (2016)Structural genomic changes underlie alternative reproductive strategies in the ruff (Philomachus pugnax). Nature Genetics 48: 84–88
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Laver T, Harrison J, O'Neill PA, Moore K, Farbos A, Paszkiewicz K, Studholme DJ (2015) Assessing the performance of the OxfordNanopore Technologies MinION. Biomol Detect Quantif 3: 1–8
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178–2189Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Linde J, Duggan S, Weber M, Horn F, Sieber P, Hellwig D, Riege K, Marz M, Martin R, Guthke R, et al (2015) Defining the transcriptomiclandscape of Candida glabrata by RNA-Seq. Nucleic Acids Res 43: 1392–1406
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Loman NJ, Quick J, Simpson JT (2015) A complete bacterial genome assembled de novo using only nanopore sequencing data. NatMethods 12: 733–735
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Loman NJ, Quinlan AR (2014) Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics 30: 3399–3401Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Lomsadze A, Burns PD, Borodovsky M (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene findingalgorithm. Nucleic Acids Res 42: e119–e119
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Ma L, Chen Z, Huang DW, Kutty G, Ishihara M, Wang H, Abouelleil A, Bishop L, Davey E, Deng R, et al (2016) Genome analysis of threePneumocystis species reveals adaptation mechanisms to life exclusively in mammalian hosts. Nature Communications 7: 10740
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Ming R, VanBuren R, Wai CM, Tang H, Schatz MC, Bowers JE, Lyons E, Wang M-L, Chen J, Biggers E, et al (2015) The pineapplegenome and the evolution of CAM photosynthesis. Nature Genetics 47: 1435–1442
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Minoche AE, Dohm JC, Schneider J, Holtgräwe D, Viehöver P, Montfort M, Sörensen TR, Weisshaar B, Himmelbauer H (2015)Exploiting single-molecule transcript sequencing for eukaryotic gene prediction. Genome Biol 16: 184
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Muñoz JF, Gauthier GM, Desjardins CA, Gallo JE, Holder J, Sullivan TD, Marty AJ, Carmen JC, Chen Z, Ding L, et al (2015) The DynamicGenome and Transcriptome of the Human Fungal Pathogen Blastomyces and Close Relative Emmonsia. PLoS Genet 11: e1005493
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Phillippy AM (2017) New advances in sequence assembly. Genome Res 27: xi–xiiihttps://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Presti Lo L, Lanver D, Schweizer G, Tanaka S, Liang L, Tollot M, Zuccaro A, Reissmann S, Kahmann R (2015) Fungal effectors and plantsusceptibility. Annu Rev Plant Biol 66: 513–545
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841–842Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Reid I, O'Toole N, Zabaneh O, Nourzadeh R, Dahdouli M, Abdellateef M, Gordon PMK, Soh J, Butler G, Sensen CW, et al (2014)SnowyOwl: Accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models. BMCBioinformatics 15: 229
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Smith CD, Zimin A, Holt C, Abouheif E, Benton R, Cash E, Croset V, Currie CR, Elhaik E, Elsik CG, et al (2011) Draft genome of theglobally widespread and invasive Argentine ant (Linepithema humile). Proc Natl Acad Sci USA 108: 5673–5678
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Smith JJ, Kuraku S, Holt C, Sauka-Spengler T, Jiang N, Campbell MS, Yandell MD, Manousaki T, Meyer A, Bloom OE, et al (2013)Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution. Nature Genetics 45: 415–21– 421e1–2
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Sperschneider J, Dodds PN, Gardiner DM, Manners JM, Singh KB, Taylor JM (2015) Advances and challenges in computationalprediction of effectors from plant pathogenic fungi. PLoS Pathog 11: e1004806
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped cDNA alignments to improve de novogene finding. Bioinformatics 24: 637–644
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Team RC (2016) R: A Language and Environment for Statistical Computing. https://www.R-project.org/Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M (2008) Gene prediction in novel fungal genomes using an ab initioalgorithm with unsupervised training. Genome Res 18: 1979–1990
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Testa AC, Hane JK, Ellwood SR, Oliver RP (2015) CodingQuarry: highly accurate hidden Markov model gene prediction in fungalgenomes using RNA-seq transcripts. BMC Genomics 16: 170
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Thomma BPHJ, Seidl MF, Shi-Kunne X, Cook DE, Bolton MD, van Kan JAL, Faino L (2016) Mind the gap; seven reasons to closefragmented genome assemblies. Fungal Genet Biol 90: 24–30
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, Lu Z, Olson A, Stein JC, Ware D (2016) Unveiling the complexity of the maizetranscriptome by single-molecule long-read sequencing. Nature Communications 7: 11708
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10: 57–63Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859–1875
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
Yandell M, Ence D (2012) A beginner's guide to eukaryotic genome annotation. Nature Reviews Genetics 13: 329–342Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title https://plantphysiol.orgDownloaded on February 7, 2021. - Published by
Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.
Zheng Y, Zhao L, Gao J, Fei Z (2011) iAssembler: a package for de novo assembly of Roche-454/Sanger transcriptome sequences. BMCBioinformatics 12: 453
Pubmed: Author and TitleGoogle Scholar: Author Only Title Only Author and Title
https://plantphysiol.orgDownloaded on February 7, 2021. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.