generating uent sentences from early-stage drafts …...generating uent sentences from early-stage...
TRANSCRIPT
B8IM2004
Generating fluent sentences from early-stage
drafts for academic writing assistance
Takumi Ito
January 27, 2020
Graduate School of Information Sciences
Tohoku University
A Master’s Thesis
submitted to System Information Sciences,
Graduate School of Information Science,
Tohoku University
in partial fulfillment of the requirements for the degree of
MASTER of INFORMATION SCIENCE
Takumi Ito
Thesis Committee:
Professor Kentaro Inui (Supervisor)
Professor Omachi Shinichiro
Professor Kazuyuki Tanaka
Associate Professor Jun Suzuki (Co-supervisor)
Generating fluent sentences from early-stage
drafts for academic writing assistance∗
Takumi Ito
Abstract
The writing process consists of several stages such as drafting, revising, edit-
ing, and proofreading. Studies on writing assistance, such as grammatical er-
ror correction (GEC), have mainly focused on sentence editing and proofreading,
where surface-level issues such as typographical, spelling, or grammatical errors
should be corrected. We broaden this focus to include the earlier revising stage,
where sentences require adjustment to the information included or major rewrit-
ing and propose Sentence-level Revision (SentRev) as a new writing assistance
task. Well-performing systems in this task can help inexperienced authors by
producing fluent, complete sentences given their rough, incomplete drafts. We
build a new freely available crowdsourced evaluation dataset consisting of incom-
plete sentences authored by non-native writers paired with their final versions
extracted from published academic papers for developing and evaluating SentRev
models. We also establish baseline performance on SentRev using our newly built
evaluation dataset.
Keywords:
Natural Language Generation, Dataset, Writing Assistant
∗, System Information Sciences, Graduate School of Information Sciences, Tohoku University,
B8IM2004, January 27, 2020.
i
論文執筆支援に向けた草稿文自動修正∗
伊藤 拓海
内容梗概
ライティングプロセスは草稿作成,編集,訂正,校正などのいくつかの段階で
構成されてる.文法誤り訂正 (Grammatical Error Correction)などの執筆支援に
関する研究では,主にスペル誤りや文法誤りなどの表面的な問題を中心に取り組
まれている.一方で,執筆の初期段階でよく行われる,大胆な書き換えを支援す
ることも,執筆者支援において重要であると考えられる.本論文では,文法誤り
訂正の対象領域を広げ,Sentence-level Revision(SentRev)という草稿文からの
自動修正タスクを提案する。SentRevモデルの開発および評価のため、学術論文
の文と非英語母語話者によって作成された草稿文のペアで構成されるデータセッ
トを構築する.さらに,ベースラインモデルを構築し,作成した評価用データセッ
トで評価を行う.
キーワード
自然言語生成, データセット, 執筆支援
∗東北大学 大学院情報科学研究科 システム情報科学専攻 , B8IM2004, 2020年 1月 27日.
ii
Contents
1 Introduction 1
2 The Sentence-level Revision task 3
3 The Smith dataset 5
3.1 Dataset creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Analysis of the Smith dataset 9
4.1 Error type comparison . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Human error type analysis . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Human fluency analysis . . . . . . . . . . . . . . . . . . . . . . . . 11
4.4 Sentence-level linguistic characteristics . . . . . . . . . . . . . . . 11
5 Experiments 13
5.1 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.1 Heuristic noising and denoising model . . . . . . . . . . . . 13
5.1.2 Enc-Dec noising and denoising model . . . . . . . . . . . . 14
5.1.3 GEC model . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Results 18
7 Related work 19
7.1 Writing assistance in the academic domain . . . . . . . . . . . . . 19
7.2 Grammatical error correction . . . . . . . . . . . . . . . . . . . . 19
7.3 Style transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.4 Text completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8 Conclusion 21
Acknowledgements 22
Appendix 27
iii
A Lexical tendencies 27
B Heuristic noising algorithm 28
C Examples from the Smith dataset and generated sentences by
Baseline models 28
iv
List of Figures
1 Overview of the estimated process of writing a sentence Our model
shows excellent performance in this task.. Writing activity con-
sists of four stages: (i) drafting, (ii) revising, (iii) editing, and (iv)
proofreading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Overview of the crowdsourcing protocol for creating an evaluation
dataset for the SentRev task. . . . . . . . . . . . . . . . . . . . . 5
3 Comparison of the top 10 frequent errors observed in the 3 datasets. 9
4 Examples of “OTHER” operations predicted by the ERRANT
toolkit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Result of the English experts’ analyses of error types in draft sen-
tences on our Smith dataset. The scores show the ratio of sen-
tences where the targeted type of errors occurred. . . . . . . . . . 10
6 Comparison of the 10 most frequent error types in Smith and
synthetic drafts created by the Enc-Dec noising methods. . . . . . 16
7 Performance of the ED-ND baseline model on top 10 most error
types in SMITH. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8 Characteristic words and phrases in draft sentences and reference
sentences in the development set of Smith. . . . . . . . . . . . . . 27
v
List of Tables
1 Examples of sentence-level revisions in our Smith dataset. Our
task is to transform the draft sentences into their corresponding
reference sentences. . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Criteria for evaluating workers. L.D denotes the Levenshtein dis-
tance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Comparison with existing datasets. w/<*> and w/change denote
the percentage of source sentences with <*> and the percentage
where the source and target sentences differ, respectively. L.D. in-
dicates the averaged character-level Levenshtein distance between
the pairs of sentences. . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Comparison of the draft and reference sentences in Smith. FRE
and PPL scores were calculated once in each sentence and then
averaged over all the sentences in the development set of Smith. . 11
6 Examples of generated training dataset. . . . . . . . . . . . . . . 13
7 Results of quantitative evaluation. Gramm. denotes the grammat-
icality score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
8 Examples of the output from the baseline models. Bold text indi-
cates tokens introduced by the model. . . . . . . . . . . . . . . . . 30
9 Further examples of draft, reference, and the baseline models’ out-
put. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vi
Ourmodelshowsexcellentperformance inthistask.
final version
drafting Modelhavegoodresults.
revising Our modelshowgoodresultinthistask.
Ourmodelshowsaexcellent perfomanceinthistask.
Ourmodelshowsgoodresultsinthistask.
Ourmodelshowsaexcellent perfomanceinthistask.
editing
proofreading Ourmodelshowsexcellentperformanceinthistask.
Non-native, inexperienced writer
Existing studies
Our focus
Figure 1: Overview of the estimated process of writing a sentence Our model
shows excellent performance in this task.. Writing activity consists of four stages:
(i) drafting, (ii) revising, (iii) editing, and (iv) proofreading.
1 Introduction
Models developed for academic writing assistance using existing datasets can
serve as a support system during the final stages by editing a nearly finished
version of the draft. For example, Daudaravicius [1] collects scientific papers
before and after professional editing from publishing companies, and Dale and
Kilgarriff [2] extract already published papers that still contain errors and correct
the errors to obtain target fragments of text.
Process-writing pedagogy, however, asserts that writing comprises several pro-
cesses [3, 4, 5] as shown in Figure 1. This study takes on the challenge of au-
tomatic assistance in both the final checking process (proofreading and editing)
1
and the earlier stages of writing (revising). In the revising stage, authors may
drastically modify the wording and supplement some words, a highly demanding
task for non-native or less experienced writers. Assistance in this stage has been
less explored in NLP.
In this study, we design a new type of academic writing assistance task, Sentence-
level Revision (SentRev), where a system receives an early draft of a sentence,
and generates a revised, error-free, proofread version.
A critical issue in tackling this type of assistance task is that evaluation re-
sources are scarce since early-stage draft sentences are not usually publicly avail-
able. To overcome this limitation, we release an evaluation dataset of pairs of
draft sentences and their final versions, the Set of Modified Incomplete TecHni-
cal paper sentences (Smith), that we created using crowdsourcing techniques.
Additionally, we evaluate the quality of our dataset and extensively analyze the
characteristics of the obtained drafts. Finally, we train unsupervised models and
report the baseline performance for our task on the Smith evaluation dataset.
Our contribution is fourfold:
• We propose a new task—SentRev.
• We create an evaluation dataset, Smith, for SentRev using a new crowd-
sourcing approach and release it.1
• We compare the characteristics of our dataset with major corpora and an-
alyze the obtained draft sentences.
• We establish baseline scores for SentRev.
1https://github.com/taku-ito/INLG2019_SentRev
2
2 The Sentence-level Revision task
Table 1: Examples of sentence-level revisions in our Smith dataset. Our task is
to transform the draft sentences into their corresponding reference sentences.
Draft However, the F1 score of KBP 2017 coupus <*> decreased by the sub
event base rule.
Reference However, subevent based constraints slightly reduced the F1 scores on
KBP 2017 corpus.
Draft But, there are some important difference to <*> our work unique.
Reference However, there exist several key differences that make our work
unique.
The proposed task, SentRev, is revising and editing incomplete draft sentences
to create final versions. Examples of sentence-level revision are shown in Table 1.
A draft sentence, x, may have several types of problems. Surface-level problems
such as typographical errors, spelling errors, or grammatical errors are a common
occurrence. Wording problems, such as collocation errors or expressions being
stylistically odd or inappropriate for the academic domain, are also typical of
rough sentences written by non-native, inexperienced writers. The third type of
error is information gaps. Information gaps are cases where the author likely
could not find the appropriate wording for the idea he or she wanted to convey,
such as a specific expression common in the academic domain or a technical term.
In addition, a draft sentence may be missing sections without the author being
aware of this. Solving the aforementioned problems in a draft sentence would
elevate the draft sentence x to its final or nearly final version y with greatly
improved correctness and fluency. Ideally, a single error-free and correctly filled-
in final version should be generated while considering the context of the sentence.
However, as a first step, an assistance system may output a set of likely candidates
for the user to choose from or be inspired by, which would be realistic for a real-
world application.
Our proposed task is, therefore, to generate likely final versions y from early-
3
draft sentences x. For this purpose, we provide an evaluation dataset, Smith,
comprising pairs of drafts and their final versions (X, Y ).
4
SMITH
(iii) the workerswrite English 𝑌"#→%&
'(&)
(ii) show 𝑦"+'(&) to the L’native crowd workers
(i) extract 𝑌'(&) frompublished papers
(iv) apply quality control
このタスクでは、モデルが優れた性能を示す。
(𝑦'(&) in L’ (Japanese))
Modelhavegoodresults.
Trans. en → L’
Filter
(𝑋can𝑑, 𝑌cand)
final versionOurmodel showsexcellentperformance inthistask.
draft
(𝑋, 𝑌)
𝑦'(&):
𝑦8+'(&):
𝑦"#→%&'(&) :
(= 𝑥'(&))
Figure 2: Overview of the crowdsourcing protocol for creating an evaluation
dataset for the SentRev task.
3 The Smith dataset
3.1 Dataset creation
Process overview Although we cannot collect “drafts” X from published pa-
pers, we can easily collect the “final versions” Y . We also have access to non-
native, inexperienced writers through crowdsourcing services. Our test set cre-
ation process combines these two factors (Figure 2). The protocol consists of the
following four phases:
(i) Collecting a large number of sentences written by experts Y cand from pub-
lished papers.
(ii) Translating them into another language L′, resulting in sentences Y candL′ .
(iii) Asking native speakers of L′ to translate Y candL′ back into English Y cand
L′→en
through crowdsourcing. Henceforth, we denote Y candL′→en as Xcand.
(iv) Filtering the pairs of (Xcand, Y cand) to ensure the quality of the dataset (X,
Y ).
This setting is analogous to the situation non-native writers face, as Cohen
and Brooks-Carson [6] report that non-native speakers tend to formulate in their
native language and mentally translate to the target second language. We assume
5
that most crowdworkers have never written an academic paper, and that the
target users of SentRev-based systems also include this type of inexperienced
writers.
To control the quality of the drafts, we first create many candidate pairs of
drafts and reference sentences (Xcand, Y cand) and then filter them to create the
quality-controlled set (X, Y ). The following subsections detail this process.
Collecting final version sentences We collected sentences Y cand from the
ACL Anthology Sentence Corpus (AASC).2 We extracted the sentences that sat-
isfied the following conditions from the AASC as Y cand:
• accepted to ACL 2018,
• 70 to 120 characters long,
• does not include mathematical symbols, special tokens for citations, URLs,
Greek letters, or other special symbols defined in AASC, and
• free of clear conversion mistakes when automatically extracted from PDFs.
Creating draft sentences We used Japanese as L′. First, we translated Y cand
into Japanese using Google Translate.3 We denote the Japanese versions of Y cand
by Y candja . To guarantee the quality of Y cand
ja , two authors of this paper, who were
native speakers of Japanese, inspected all the sentences from Y candja and removed
those that at least one speaker judged to be incorrect translations.
Next, we asked each Japanese crowdworker to translate three sentences from
Y candja into English Y cand
ja→en within 15 minutes. The appropriate time limit and rules
were determined based on several trial tasks.
The workers were allowed to insert the special symbol <*> in places where they
could not think of a good expression for that position in their answer Y candja→en.
This instruction revealed the information gaps that the authors of the drafts
consciously left empty. An author may also be unaware that a draft sentence is
missing sections. 306 workers participated in our crowdsourcing task.
2https://github.com/KMCS-NII/AASC3https://translate.google.com/
6
Quality control We designed thorough filtering criteria and applied them to
the workers because Yahoo! crowdsourcing, 4 a Japanese crowdsourcing service,
does not provide filtering based on the worker’s writing skills or abilities. We fil-
tered workers depending on their writing activities. We scored each worker using
the three answers they produced by using the criteria detailed in Table 3. We
accepted work from workers with score 0 or higher as valid. The hyperparame-
ters were determined with trial experiments. We used spaCy-CLD5 for language
detection.
Table 3: Criteria for evaluating workers. L.D denotes the Levenshtein distance.
Criteria Judgment
Working time is too short (< 2 minutes) Reject
All answers are too short (< 4 words) Reject
No answer ends with “.” or “?” Reject
Contain identical answers Reject
Some answers have Japanese words Reject
No answer is recognized as English Reject
Some answers are too short (< 4 words) -2 points
Some answers use fewer than 4 kinds of words -2 points
Too close to automatic translation (20 <= L.D. <= 30) -0.5 points/ans
Too close to automatic translation (10 <= L.D. <= 20) -1.5 points/ans
Too close to automatic translation (L.D. <= 10) Reject
All answers end with “.” or “?” +1 points
Some answers have <*> +1 points
All answers are recognized as English +1 points
In addition, to remove instances with a too large gap, we automatically filtered
out the obtained (xcand, ycand) ∈ (Xcand, Y cand) whose unigram overlap coefficient
was considerably low:
|U(xcandchecked) ∩ U(ycand)|
min{|U(xcandchecked)|, |U(ycand)|}
< α ,
where U(·) is the set of tokens excluding stop-words and special tokens (<*>).
4https://crowdsourcing.yahoo.co.jp/5https://github.com/nickdavidhaynes/spacy-cld
7
xcandchecked is the spell-checked version6 of xcand. α is set to 0.4, which was determined
in trial experiments.
We collected 10,804 pairs of draft and their final versions, which cost us ap-
proximately US$4,200, including the trial rounds of crowdsourcing.
Unfortunately, works produced by unmotivated workers could have evaded the
aforementioned filters and lowered the quality of our dataset. For example, work-
ers could have bypassed the filter by simply repeating popular phrases in academic
writing (“We apply we apply”). To estimate the frequency of such examples, we
sampled 100 (x, y) pairs from (X,Y ) and asked an NLP researcher (not an author
of this paper) fluent in Japanese and English to check for examples where x was
totally irrelevant to xja, which was shown to the crowdworkers when creating x.
The expert observed no completely inappropriate examples, but noted a small
number of clearly subpar translations. Therefore, 95% of sentence pairs were
determined to be appropriate. This result shows that, overall, our method was
suitable to create the dataset and confirms the quality of Smith.
3.2 Statistics
Table 4 shows the statistics of our Smith dataset and a comparison with major
datasets for building a writing assistance system [7, 8, 1]. The size of our dataset
(10k sentence pairs) is six times greater than that of JFLEG, which contains both
grammatical errors and nonfluent wording. In addition, our dataset simulates
significant editing—99% of the pairs have some changes between the draft and
its corresponding reference, and 33% of the draft sentences contain gaps indicated
by the special token <*>. We also measured the amount of change from the drafts
X to the references Y by using the Levenshtein distance between them. A higher
Levenshtein distance between the X and Y sentences in our dataset indicated
more significant differences between them compared with major GEC corpora.
This finding implies that our dataset emulates more drastic rephrasing.
6We corrected spelling errors using https://github.com/barrust/pyspellchecker
8
Dataset size w/<*> w/change L.D.
Lang-8 2.1M - 42% 3.5
AESW 1.2M - 39% 4.8
JFLEG 1.5k - 86% 12.4
Smith 10k 33% 99% 47.0
Table 4: Comparison with existing datasets. w/<*> and w/change denote the
percentage of source sentences with <*> and the percentage where the source and
target sentences differ, respectively. L.D. indicates the averaged character-level
Levenshtein distance between the pairs of sentences.
(%) ~~
SMITHJFLEGAESW
Figure 3: Comparison of the top 10 frequent errors observed in the 3 datasets.
4 Analysis of the Smith dataset
In this section, we run extensive analyses on the sentences written by non-native
workers (draft sentences X), and the original sentences extracted from the set of
accepted papers (reference sentences Y ). We randomly selected a set of 500 pairs
from Smith as the development set for analysis.
4.1 Error type comparison
To obtain the approximate distributions of error types between the source and
target sentences, we used ERRANT [9, 10]. Next, we compared them with three
datasets: Smith, AESW (the same domain as Smith), and JFLEG (has a rel-
atively close Levenshtein distance to Smith). To calculate the error type distri-
9
Draft: the best models are very effective on the condition that they are far greater than human.
Reference: The best models are very effective in the local context condition where they significantly outperform humans.
Draft: Results show MARM tend to generate <*> and very short responces.
Reference: The results indicate that MARM tends to generate specific but very short responses.
OTHER
OTHER
Figure 4: Examples of “OTHER” operations predicted by the ERRANT toolkit.
0 20 40 60 80 100
others
lack of information
orthographic errors
problems in wording
grammatical errors
Figure 5: Result of the English experts’ analyses of error types in draft sentences
on our Smith dataset. The scores show the ratio of sentences where the targeted
type of errors occurred.
butions on AESW and JFLEG, we randomly sampled 500 pairs of source and
target sentences from each corpus. Figure 3 shows the results of the compari-
son. Although all datasets contained a mix of error types and operations, the
Smith dataset included more “OTHER” operations than the other two datasets.
Manual inspection of some samples of “OTHER” operations revealed that they
tend to inject information missing in the draft sentence (Figure 4). This finding
confirms that our dataset emphasizes a new, challenging “completion-type” task
setting for writing assistance.
4.2 Human error type analysis
To understand the characteristics of our dataset in detail, an annotator proficient
in English (not an author of this paper) analyzed the types of errors in the draft
sentences (Figure 5). The most frequent errors were fluency problems (e.g., “In
10
Table 5: Comparison of the draft and reference sentences in Smith. FRE and
PPL scores were calculated once in each sentence and then averaged over all the
sentences in the development set of Smith.
Data FRE passive voice (%) word repetition (%) PPL
Draft X 45.5 34.0 33.0 1373
Reference Y 40.0 29.6 28.6 147
these ways” instead of “In these methods,”)—characterized by errors in academic
style and wording, which are out of the scope of traditional GEC. Another notable
type of frequent error was lack of information, which further distinguishes this
dataset from other datasets.
4.3 Human fluency analysis
We outsourced the scoring of the fluency of the given draft and reference sentence
pairs to three annotators proficient in English. Nearly every draft x (94.8%) was
marked as being less fluent than its corresponding reference y, confirming that
obtaining high performance with our dataset requires the ability to transform
rough input sentences into more fluent sentences.
4.4 Sentence-level linguistic characteristics
We computed some sentence-level linguistic measures over the dataset sentences:
Flesch Reading Ease (FRE) [11], passive voice7, word repetition, and perplexity
(PPL) (Table 5).
FRE measures the readability of a text, namely, how easy it is to understand
(higher is easier). The draft sentences consistently demonstrated higher FRE
scores than their reference counterparts, which may be attributed to the latter
containing more sophisticated language and technical terms.
In addition, workers tended to use the passive voice and to repeat words within
a narrow span, and both those phenomenon must be avoided in academic writ-
ing. We conducted further analyses on lexical tendencies between the drafts and
7https://github.com/armsp/active_or_passive
11
references (Appendix A).
Finally, we analyzed the draft and the reference sentences using PPL calculated
by a 5-gram language model trained on ACL Anthology papers.8 The higher PPL
scores in the draft sentences (Table 5) suggest that they have properties unsuitable
for academic writing (e.g., less fluent wording).
8PPL is calculated with the implementation available in the KenLM (https://github.com/
kpu/kenlm), tuned on AASC (excluding the texts used for building the Smith).
12
Table 6: Examples of generated training dataset.
method original generated
Heuristic Besides , the recognizer success-
fully rejected only 15 out of 42
negative sentences .
recognizer Besides successfully ,
the informativeness rejected of
out <*>
Grammatical error generation We plan to analyze these direct
communications and interaction
of sentiments expressed in these
sequences of posts .
We plan to analysis the di-
rect communication interaction
of sentiments express in these
sequence of posts .
Style removal This experiment suggested that
there were ambiguities in these
pointing gestures and led to a
redesign of the system .
This experiment indicated the
ambiguity found in the pointing
gestures and caused a renewal
of the system .
Entailed sentence generation Figure 2 illustrates the ef-
fectiveness of different features
class.
There is different feature in figure
2 .
5 Experiments
5.1 Baseline models
We evaluated three baseline models on the SentRev task.
5.1.1 Heuristic noising and denoising model
We can access a great deal of final version academic papers. Noising and denoising
approaches have gained attention in the GEC and machine translation fields [12,
13, 14]. We combined these two factors to train baseline models on noised final
version sentences.
First, we collected 4,898,146 sentences Y aasc from the AASC that satisfied the
following conditions: (i) not included in the Smith dataset, (ii) not too long
or too short (between 5 and 35 tokens), (iii) over 50% of the characters were
alphabetic. Next, we created a training dataset (Xaaschrst , Y
aasc) by adding noise to
Y aasc.
As the simplest approach for noising, we used a set of heuristic rules by ran-
13
domly deleting, replacing, and swapping words in the reference sentences. Specif-
ically, these rules included deleting words with a probability of 0.1, replacing
words with a token that appeared over 10,000 times in Yaasc with a probability
of 0.1, and randomly shuffling the sentence while maintaining the originally ad-
jacent words within three words apart. Next, we randomly replaced up to 50%
of the words with a <*> token (see Appendix B for a more detailed algorithm).
This method generated 4.8M heuristically noised sentences.
Subsequently, we trained a denoising model (a mapping function from Xaaschrst to
Y aasc) by using Transformer [15] implemented in fairseq [16]. We used an Adam
optimizer [17] with α = 0.0005, β1 = 0.9, β2 = 0.98, and ϵ = 10e−8. We limited
the maximum tokens per each minibatch to 3000, limited the maximum number
of updates to 500,000, and used a dropout rate of 0.3. The input and output
texts were tokenized and then segmented into character bigrams. We used a
beam width of 5 in the decoding. This model is our first baseline model for the
SentRev task (henceforth, H-ND).
5.1.2 Enc-Dec noising and denoising model
As an extension of the heuristic noising and denoising model, we changed the
noising methods to better simulate the characteristics of X in Smith than the
heuristic rules in Section 5.1.1. As described in Section 4, the drafts tended to (i)
contain grammatical errors, (ii) use stylistically improper wording, and (iii) lack
certain words. We used the following three neural Encoder-Decoder (Enc-Dec)
models to generate the synthetic draft sentences.
Grammatical error generation Here, we trained a model that introduces
synthetic grammatical errors to “clean” sentences by using a “flipped” dataset
from GEC (clean → erroneous). We used nonidentical (source, target) sentence
pairs from the Lang-8, AESW, and JFLEG datasets.
Style removal To generate stylistically unnatural sentences in the academic
domain, we used paraphrasing, which preserves a sentence’s content while disre-
garding its style. We used the ParaNMT-50M dataset [18], a paraphrase dataset
automatically created using Enc-Dec translation. We extracted parallel sentences
14
Table 7: Results of quantitative evaluation. Gramm. denotes the grammaticality
score.
Model BLEU ROUGE-L BERT-P BERT-R BERT-F P R F0.5 Gramm. PPL
Draft X 9.8 46.8 75.9 78.2 77.0 - - - 92.9 1454
H-ND 8.2 45.0 77.0 76.1 76.5 5.4 2.9 4.6 94.1 406
ED-ND 15.4 51.1 80.9 80.0 80.4 21.8 12.8 19.2 96.3 236
GEC 11.9 49.0 80.8 79.1 79.9 22.2 6.2 14.6 96.7 414
Reference Y - - - - - - - - 96.5 147
with annotated paraphrase scores between 0.7 and 0.95 from the ParaNMT-50M
dataset and used swapped pairs of source and target sentences in the dataset.
Entailed sentence generation To simulate the missing words in the draft sen-
tences, we trained a model that generated a sentence entailed with the given text.
We extracted entailed sentence pairs from the SNLI [19] and the MultiNLI [20]
datasets.
Random noising beam search As Xie et al. [13] pointed out, a standard
beam search often yields hypotheses that are too conservative. This tendency
leads the noising models to generate synthetic draft sentences similar to their
references. To address this problem, we applied the random noising beam search
[13] on all three noising models. Specifically, during the beam search, we added
rβ to the scores of the hypotheses, where r is a value sampled from a uniform
distribution over the interval [0, 1], and β is a penalty hyperparameter set to 5.
We obtained 14.6M sentence pairs of (Xaascencdec, Y
aasc) by applying these Enc-Dec
noising models to Y aasc. To train the denoising model, we used both data (Xaaschrst ,
Y aasc) and (Xaascencdec, Y
aasc). The model architecture was the same as the heuristic
model. This denoising model is our second baseline model (ED-ND). To facilitate
research in the SentRev task, we released all the 19.6M synthetic data.9
9https://github.com/taku-ito/INLG2019_SentRev
15
drafts in SMITHdrafts in synthetic data
~~(%)
Figure 6: Comparison of the 10 most frequent error types in Smith and synthetic
drafts created by the Enc-Dec noising methods.
0
0.1
0.2
0.3
0.4
0.5
OTHER
DET
NOUN
NOUN:NUM PR
EPVERB
PUNCT
SPELL
ORTH AD
J
(F0.5)
Figure 7: Performance of the ED-ND baseline model on top 10 most error types
in SMITH.
Analysis of the synthetic drafts Finally, we analyzed the error type dis-
tribution of the synthetic data used for training Enc-Dec noising and denoising
model with ERRANT (Figure 6). The error type distribution from the synthetic
dataset had similar tendencies to the one from the development set in Smith
(real-draft). Kullback–Leibler divergence between these error type distributions
was 0.139. This result supports the validity of our assumption that the SentRev
task is a combination of GEC, style transfer, and a completion-type task.
Table 6 shows examples of the training data generated by the noising models
described in Section 5. Heuristic noising, the rule-based noising method, cre-
ated ungrammatical sentences. The grammatical error generation model added
grammatical errors (e.g., plan to analyze → plan to analysis). The style removal
model generated stylistically unnatural sentences for the academic domain (e.g.,
redesign → renewal). The entailed sentence generation model caused a lack of
information.
16
5.1.3 GEC model
The GEC task is closely related to SentRev. We examined the performance of
the current state-of-the-art GEC model [21] in our task. We applied spelling
correction before evaluation following [21].
5.2 Evaluation metrics
The SentRev task has a very diverse space of valid revisions to a given context,
which is challenging to evaluate. As one solution, we evaluated the performance
from multiple aspects by using various reference and reference-less evaluation
metrics. We used BLEU, ROUGE-L, and F0.5 score, which are widely used
metrics in related tasks (machine translation, style-transfer, GEC). We used
nlg-eval [22] to compute the BLEU and ROUGE-L scores and calculated F0.5
scores with ERRANT. In addition, to handle the lexical and compositional diver-
sity of valid revisions, we used BERT-score [23], a contextualized embedding-
based evaluation metric. Furthermore, we used two reference-less evaluation
metrics: grammaticality score [24] and PPL. Grammaticality was scored as 1 −(Nerrors in sentence/Ntokens in sentence), where the number of grammatical errors in a
sentence is obtained using LanguageTools.10 By using a language model tuned
to the academic domain, we expect PPL to evaluate the stylistic validity and
fluency of a complemented sentence. We favored n-gram language models over
neural language models for reproducibility and calculated the score in the same
manner as described in Section 4.3.
10https://github.com/languagetool-org/languagetool/releases/tag/v3.2
17
6 Results
Table 7 shows the performance of the baseline models. We observed that the
ED-ND model outperforms the other models in nearly all evaluation metrics.
This finding suggests that the Enc-Dec noising methods induced noise closer to
real-world drafts compared with the heuristic methods.
The current state-of-the-art GEC model showed higher precision but low recall
scores in F0.5. This suggests that the SentRev task requires the model to make a
more drastic change in the drafts than in the GEC task. Furthermore, the GEC
model, trained in the general domain, showed the worst performance in PPL.
This indicates that the general GEC model did not reflect academic writing style
upon revision and that SentRev requires academic domain-aware rewriting.
Table 8 shows examples of the models’ output. In the first example, the ED-ND
model made a drastic change to the draft. The middle example demonstrates that
our models replaced the <*> token with plausible words. The last example is the
case where our model underperformed by making erroneous edits such as changing
“Chart4” to “Figure2”, and suggesting odd content (“relation between model and
gold standard and piason”). This may be due to having inadvertently introduced
noise while generating the training datasets. Appendix C shows more examples of
generated sentences. Using ERRANT, we analyzed the performance of the ED-
ND baseline model by error types. The results are shown in Figure 7. Overall,
typical grammatical errors such as noun number errors or orthographic errors are
well corrected, but the model struggles with drastic revisions (“OTHER” type
errors).
18
7 Related work
7.1 Writing assistance in the academic domain
Several shared tasks for assisting academic writing have been organized. The
Helping Our Own (HOO) 2011 Pilot Shared Task [2] aimed to promote the de-
velopment of tools and techniques to assist authors in writing, with a specific
focus on writing within the NLP community. The Automated Evaluation of Sci-
entific Writing (AESW) Shared Task [1] was organized to promote tools to help
write scientific papers. The HOO dataset was created by finding errors in pub-
lished papers and editing the errors, and the AESW dataset contains a collection
of text extracts from published journal articles before and after proofreading.
Rather than adding finishing touches to almost completed sentences, our task is
to convert unfinished, rough drafts into complete sentences. In addition, these
studies tackled the task of the identification of errors while SentRev goes further
by rewriting the drafts.
Other corpora for revisions are available in the academic domain [25, 26, 27].
Thus, we provide a notable contribution by exploring the methods to create a
dataset of revisions with a scalable crowdsourcing approach. By contrast, Zhang
et al. [27] recruited 60 students over 2 weeks and Lee and Webster [25] collected
data from a language learning project where over 300 tutors reviewed academic
essays written by 4500 students.
7.2 Grammatical error correction
GEC is the task of correcting errors in text such as spelling, punctuation, gram-
mar, and word choice [28, 29]. GEC falls within the editing and proofreading
phases of the writing process, while SentRev subsumes GEC and a broader range
of text generation (e.g., increasing the fluency of the sentence and complement-
ing missing information). Napoles et al. [7] and d Sakaguchi et al. [30] explored
fluency edits to correct grammatical errors and to make a text more “native
sounding.” Although this direction is similar to SentRev, our task used sentences
that required many more corrections.
19
7.3 Style transfer
Style transfer is the task of rephrasing the text to conform to specific stylistic
properties while preserving the text’s original semantic content [31, 32]. From
the perspective of automatic academic writing assistance, the assistance systems
are required to convert nonacademic-style drafts into academic-style drafts. This
type of transfer is regarded as a subproblem in the revising stage of the writing
process.
7.4 Text completion
The drafts in the revising stage may contain gaps denoted with <*>. This setting
is similar to text infilling [33], masking-based language modeling [34, 35], or the
sentence completion task [36], where the models are required to replace mask
tokens with plausible words. Notably, SentRev differs from such tasks because
systems for these tasks are expected to keep all the original tokens unchanged
and only fill the <*> token, with one or more other tokens.
20
8 Conclusion
We proposed the SentRev task, where an incomplete, rough draft sentence is
transformed into a more fluent, complete sentence in the academic writing do-
main. We created the Smith dataset with crowdsourcing for development and
evaluation of this task and established baseline performance with a synthetic
training dataset. We believe that this task can increase the effectiveness of the
process of academic writing.
21
Acknowledgements
I am deeply grateful to Dr. Kentaro Inui and Dr. Jun Suzuki for able guidance
and generous support. I would like to show my greatest appreciation to Dr. Hay-
ato Kobayashi, Dr. Masato Hagiwara, Ana Brassard, and Tatsuki Kuribayashi for
insightful comments and constructive suggestions. Discussions with my academic
colleagues in our laboratory have been quite illuminating.
22
References
[1] Vidas Daudaravicius. Automated Evaluation of Scientific Writing: AESW
Shared Task Proposal. In Proceedings of the Tenth Workshop on Innovative
Use of NLP for Building Educational Applications (BEA), pages 56–63, 2015.
[2] Robert Dale and Adam Kilgarriff. Helping Our Own: The HOO 2011 Pilot
Shared Task. In Proceedings of ENLG, pages 242–249, 2011.
[3] Bernard Susser. Process approaches in ESL/EFL writing instruction. Jour-
nal of Second Language Writing, 3(1):31–47, 1994.
[4] Anthony Seow. The writing process and process writing. Methodology in
language teaching: An anthology of current practice, pages 315–320, 2002.
[5] Michael Buchman, Roberta Moore, Linda Stern, and Betsy Feist. Power
Writing: Writing with Purpose, volume 4. 2000.
[6] Andrew D Cohen and Amanda Brooks-Carson. Research on direct versus
translated writing: Students’ strategies and their results. The Modern Lan-
guage Journal, 85(2):169–188, 2001.
[7] Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. JFLEG: A Flu-
ency Corpus and Benchmark for Grammatical Error Correction. In Proceed-
ings of EACL, pages 229–234, 2017.
[8] Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata, and Yuji Mat-
sumoto. Mining Revision Log of Language Learning SNS for Automated
Japanese Error Correction of Second Language Learners. In Proceedings of
IJCNLP, pages 147–155, 2011.
[9] Christopher Bryant, Mariano Felice, and Ted Briscoe. Automatic Annota-
tion and Evaluation of Error Types for Grammatical Error Correction. In
Proceedings of ACL, pages 793–805, 2017.
[10] Mariano Felice, Christopher Bryant, and Ted Briscoe. Automatic Extraction
of Learner Errors in ESL Sentences Using Linguistically Enhanced Align-
ments. In Proceedings of COLING, pages 825–835, 2016.
23
[11] Rudolph Flesch. A new readability yardstick. Journal of Applied Psychology,
32(3):221 – 233, June 1948.
[12] Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding
Back-Translation at Scale. In Proceedings of EMNLP, pages 489–500, 2018.
[13] Ziang Xie, Guillaume Genthial, Stanley Xie, Andrew Ng, and Dan Juraf-
sky. Noising and Denoising Natural Language: Diverse Backtranslation for
Grammar Correction. In Proceedings of NAACL-HLT, pages 619–628, 2018.
[14] Jared Lichtarge, Chris Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar,
and Simon Tong. Corpora generation for grammatical error correction. In
Proceedings of NAACL-HLT, pages 3291–3301, June 2019.
[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin. Attention is All you
Need. In Proceedings of NIPS, pages 5998–6008. 2017.
[16] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan
Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit
for sequence modeling. In Proceedings of NAACL-HLT, pages 48–53, June
2019.
[17] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Opti-
mization. In Proceedings of ICLR, 2015.
[18] John Wieting and Kevin Gimpel. ParaNMT-50M: Pushing the Limits of
Paraphrastic Sentence Embeddings with Millions of Machine Translations.
In Proceedings of ACL, pages 451–462, 2018.
[19] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D.
Manning. A Large Annotated Corpus for Learning Natural Language Infer-
ence. In Proceedings of EMNLP, pages 632–642, 2015.
[20] Adina Williams, Nikita Nangia, and Samuel Bowman. A Broad-Coverage
Challenge Corpus for Sentence Understanding through Inference. In Pro-
ceedings of NAACL-HLT, pages 1112–1122, 2018.
24
[21] Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, and Jingming Liu. Im-
proving grammatical error correction via pre-training a copy-augmented ar-
chitecture with unlabeled data. In Proceedings of the NAACL-HLT, pages
156–165, June 2019.
[22] Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. Rel-
evance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating
Natural Language Generation. arXiv preprint arXiv:1706.09799, 2017.
[23] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav.
Artzi. BERTScore: Evaluating Text Generation with BERT. arXiv preprint
arXiv:1904.09675, 2019.
[24] Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. There’s No Com-
parison: Reference-less Evaluation Metrics in Grammatical Error Correction.
In Proceedings of EMNLP, pages 2109–2115, 2016.
[25] John Lee and Jonathan Webster. A corpus of textual revisions in second
language writing. In Proceedings of ACL, pages 248–252, July 2012.
[26] Chenhao Tan and Lillian Lee. A corpus of sentence-level revisions in aca-
demic writing: A step towards understanding statement strength in commu-
nication. In Proceedings of ACL, pages 403–408, June 2014.
[27] Fan Zhang, Homa B. Hashemi, Rebecca Hwa, and Diane Litman. A Corpus
of Annotated Revisions for Studying Argumentative Writing. In Proceedings
of ACL, pages 1568–1578, 2017.
[28] Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Ray-
mond Hendy Susanto, and Christopher Bryant. The CoNLL-2014 Shared
Task on Grammatical Error Correction. In Proceedings of CoNLL:Shared
Task, pages 1–14, 2014.
[29] Zheng Yuan and Ted Briscoe. Grammatical Error Correction using Neural
Machine Translation. In Proceedings of NAACL-HLT, pages 380–386, 2016.
25
[30] Keisuke Sakaguchi, Courtney Napoles, Matt Post, and Joel Tetreault. Re-
assessing the Goals of Grammatical Error Correction: Fluency instead of
Grammaticality. TACL, 4:169–182, 2016.
[31] Lajanugen Logeswaran, Honglak Lee, and Samy Bengio. Content Preserving
Text Generation with Attribute Controls. In Proceedings of NeurIPS,, pages
5103–5113. 2018.
[32] Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W.
Black. Style Transfer Through Back-Translation. In Proceedings of ACL,
pages 866–876, 2018.
[33] Wanrong Zhu, Zhiting Hu, and Eric Xing. Text Infilling. arXiv preprint
arXiv:1901.00158, 2019.
[34] William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better Text
Generation via Filling in the . In Proceedings of ICLR, 2018.
[35] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
BERT: Pre-training of deep bidirectional transformers for language under-
standing. In Proceedings of NAACL-HLT, pages 4171–4186, June 2019.
[36] Geoffrey Zweig, John C. Platt, Christopher Meek, Christopher J. C. Burges,
Ainur Yessenalina, and Qiang Liu. Computational Approaches to Sentence
Completion. In Proceedings of ACL, pages 601–610, 2012.
[37] Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how
Corpora Differ. In Proceedings of ACL System Demonstrations, pages 85–
90, 2017.
26
we
Figure 8: Characteristic words and phrases in draft sentences and reference sen-
tences in the development set of Smith.
Appendix
A Lexical tendencies
Certain words and phrases were more frequently observed in the reference sen-
tences than in the draft sentences, and vice-versa. Figure 8 visualizes these biases,
where words more often observed in the draft sentences are plotted in the upper-
left corner, and words more often observed in the references are plotted in the
lower-right corner. Words observed more commonly in the drafts were: will, is
not, if, and I, versus can be, no, when, and they. The contrast also includes
27
Algorithm 1 Heuristic noising
INPUT: x = {w0, w1, · · · , wn}
1: x = delete(x, 0.1)
# 10% of the tokens in x are deleted.
2: x = replace(x, 0.1)
# 10% of the tokens in x are replaced with common terms in ACL.
3: x = permutate(x)
# permutate the tokens in x.
4: r ← Uniform(0, 0.5)
5: m = int(x.length ∗ r)
6: c = 0
7: while c < m do
8: n← sample({j ∈ N | 1 ≤ j ≤ m− c})9: (s, e)← sample({n-grams of x})10: x = “x:s−1 + <*> + xe+1:
′′
11: c = c + n
12: end while
# r × 100% of the tokens in x are masked.
a widely-used spelling (data set vs dataset) and common plurality (method vs
methods). The plot was generated using the scattertext toolkit [37].
B Heuristic noising algorithm
Algorithm 1 shows the noising algorithm in the heuristic noising method.
C Examples from the Smith dataset and gener-
ated sentences by Baseline models
Table 9 shows examples from the Smith dataset and the output of the baseline
models. “Reference” is a sentence extracted from papers, “Draft” is written by
28
a crowdworker and is the input for the baseline models.
29
Table 8: Examples of the output from the baseline models. Bold text indicates
tokens introduced by the model.
Draft The global modeling using the reinforcement learning in all documents is our
work in the future .
H-ND The global modeling of the reinforcement learning using all documents in our
work is the future .
ED-ND In our future work , we plan to explore the use of global modeling for
reinforcement learning in all documents .
GEC Global modelling using reinforcement learning in all documents is our work in the
future .
Reference The global modeling using reinforcement learning for a whole document is our
future work .
Draft Also , the above <*> efficiently calculated by dynamic programming .
H-ND Also , the above results are calculated efficiently by dynamic programming .
ED-ND Also , the above probabilities are calculated efficiently by dynamic program-
ming .
GEC Also , the above is efficiently calculated by dynamic programming .
Reference Again , the above equation can be efficiently computed by dynamic programming
.
Draft Chart4 : relation model and gold % between KL and piason .
H-ND Table 1 : Charx- relation between gold and piason and KL .
ED-ND Figure 2 : CharxDiff relation between model and gold standard and
piason .
GEC Chart4 : relation model and gold % between KL and person .
Reference Table 4 : KL and Pearson correlation between model and gold probability .
30
Table 9: Further examples of draft, reference, and the baseline models’ output.
Draft By this setting , the persona is acquired from a test set popl about both turker
anad model .
H-ND By this setting , the persona is acquired from a test set both about popl anad
anad model .
ED-ND In this setting , persona is obtained from the test set popl about both Turker
and model .
GEC By this setting , the persona is acquired from a test set pool about both turkey
and models .
Reference In this setting , for both the Turker and the model , the personas come from the
test set pool .
Draft In addition to results of study until now , we add two baseline to vindicate
effectiveness on our flame work .
H-ND In addition to the results of this study , we now add two baseline methods to
vindicate effectiveness on our work .
ED-ND In addition to the results of the study until now , we add two baselines to
visualize the effectiveness of our framework .
GEC In addition to the results of study until now , we added two baseline to vindicate
effectiveness on our flame work .
Reference In addition to results of previous work , we add two baselines to demonstrate
the effectiveness of our framework .
Draft Yhe input and output <*> are one - hot encoding of the center word and the
context word , <*> .
H-ND The input and output are one - hot encoding of the center word and the context
word , respectively .
ED-ND The input and output layers are one - hot encoding of the center word and the
context word , respectively .
GEC Yhe input and output are one - hot encoding of the center word and the context
word , .
Reference The input and output layers are centre word and context word one - hot encod-
ings , respectively .
31
List of Publications
Awards
Journal Papers
International Conferences Papers
1. Takumi Ito, Tatsuki Kuribayashi, Hayato Kobayashi, Ana Brassard, Masato
Hagiwara, Jun Suzuki and Kentaro Inui. Diamonds in the Rough: Gener-
ating Fluent Sentences from Early-stage Drafts for Academic Writing As-
sistance. To appear in Proceedings of the 12th International Conference on
Natural Language Generation (INLG 2019), pp. –, October 2019.
2. Masato Hagiwara, Takumi Ito, Tatsuki Kuribayashi, Jun Suzuki and Ken-
taro Inui. TEASPN: Framework and Protocol for Integrated Writing As-
sistance Environments. To appear in Proceedings of 2019 Conference on
Empirical Methods in Natural Language Processing and 9th International
Joint Conference on Natural Language Processing: System Demonstrations
(EMNLP-IJCNLP 2019), pp. –, November 2019.
Other Publications
• 伊藤拓海, 栗林樹生, 萩原正人, 鈴木潤, 乾健太郎. 英語論文執筆のための
統合ライティング支援環境. 第 14回NLP若手の会 シンポジウム (YANS),
August 2019.
• 伊藤拓海, 栗林樹生, 小林隼人, 鈴木潤, 乾健太郎. ライティング支援を想
定した情報補完型生成. 言語処理学会第 25回年次大会, pp.970-973, March
2019.
• 栗林樹生, 伊藤拓海, 内山香, 鈴木潤, 乾健太郎. 言語モデルを用いた日本語
の語順評価と基本語順の分析. 言語処理学会第 25回年次大会, pp.1053-1056,
March 2019.
• 伊藤拓海, 山口健史, 田然, 松田耕史, 岡崎直観, 乾健太郎. 自治体FAQの比
較マイニング. 言語処理学会第 24回年次大会, pp.536-539, March 2018.
32
• 伊藤拓海, 鈴木正敏, 田然, 山口健史, 岡崎直観, 乾健太郎. 自治体QAサー
ビスのためのFAQの自治体間の横断的解析. 第 12回NLP若手の会 シンポ
ジウム (YANS), September 2017.
33