generating uent sentences from early-stage drafts …...generating uent sentences from early-stage...

B8IM2004

Generating fluent sentences from early-stage

drafts for academic writing assistance

Takumi Ito

January 27, 2020

Graduate School of Information Sciences

Tohoku University

A Master’s Thesis

submitted to System Information Sciences,

Graduate School of Information Science,

Tohoku University

in partial fulfillment of the requirements for the degree of

MASTER of INFORMATION SCIENCE

Takumi Ito

Thesis Committee:

Professor Kentaro Inui (Supervisor)

Professor Omachi Shinichiro

Professor Kazuyuki Tanaka

Associate Professor Jun Suzuki (Co-supervisor)

Generating fluent sentences from early-stage

drafts for academic writing assistance∗

Takumi Ito

Abstract

The writing process consists of several stages such as drafting, revising, edit-

ing, and proofreading. Studies on writing assistance, such as grammatical er-

ror correction (GEC), have mainly focused on sentence editing and proofreading,

where surface-level issues such as typographical, spelling, or grammatical errors

should be corrected. We broaden this focus to include the earlier revising stage,

where sentences require adjustment to the information included or major rewrit-

ing and propose Sentence-level Revision (SentRev) as a new writing assistance

task. Well-performing systems in this task can help inexperienced authors by

producing fluent, complete sentences given their rough, incomplete drafts. We

build a new freely available crowdsourced evaluation dataset consisting of incom-

plete sentences authored by non-native writers paired with their final versions

extracted from published academic papers for developing and evaluating SentRev

models. We also establish baseline performance on SentRev using our newly built

evaluation dataset.

Keywords:

Natural Language Generation, Dataset, Writing Assistant

∗, System Information Sciences, Graduate School of Information Sciences, Tohoku University,

B8IM2004, January 27, 2020.

i

論文執筆支援に向けた草稿文自動修正∗

伊藤拓海

内容梗概

ライティングプロセスは草稿作成，編集，訂正，校正などのいくつかの段階で

構成されてる．文法誤り訂正 (Grammatical Error Correction）などの執筆支援に

関する研究では，主にスペル誤りや文法誤りなどの表面的な問題を中心に取り組

まれている．一方で，執筆の初期段階でよく行われる，大胆な書き換えを支援す

ることも，執筆者支援において重要であると考えられる．本論文では，文法誤り

訂正の対象領域を広げ，Sentence-level Revision（SentRev）という草稿文からの

自動修正タスクを提案する。SentRevモデルの開発および評価のため、学術論文

の文と非英語母語話者によって作成された草稿文のペアで構成されるデータセッ

トを構築する．さらに，ベースラインモデルを構築し，作成した評価用データセッ

トで評価を行う．

キーワード

自然言語生成, データセット, 執筆支援

∗東北大学大学院情報科学研究科システム情報科学専攻 , B8IM2004, 2020年 1月 27日.

ii

Contents

1 Introduction 1

2 The Sentence-level Revision task 3

3 The Smith dataset 5

3.1 Dataset creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Analysis of the Smith dataset 9

4.1 Error type comparison . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Human error type analysis . . . . . . . . . . . . . . . . . . . . . . 10

4.3 Human fluency analysis . . . . . . . . . . . . . . . . . . . . . . . . 11

4.4 Sentence-level linguistic characteristics . . . . . . . . . . . . . . . 11

5 Experiments 13

5.1 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.1 Heuristic noising and denoising model . . . . . . . . . . . . 13

5.1.2 Enc-Dec noising and denoising model . . . . . . . . . . . . 14

5.1.3 GEC model . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Results 18

7 Related work 19

7.1 Writing assistance in the academic domain . . . . . . . . . . . . . 19

7.2 Grammatical error correction . . . . . . . . . . . . . . . . . . . . 19

7.3 Style transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.4 Text completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

8 Conclusion 21

Acknowledgements 22

Appendix 27

iii

A Lexical tendencies 27

B Heuristic noising algorithm 28

C Examples from the Smith dataset and generated sentences by

Baseline models 28

iv

List of Figures

1 Overview of the estimated process of writing a sentence Our model

shows excellent performance in this task.. Writing activity con-

sists of four stages: (i) drafting, (ii) revising, (iii) editing, and (iv)

proofreading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Overview of the crowdsourcing protocol for creating an evaluation

dataset for the SentRev task. . . . . . . . . . . . . . . . . . . . . 5

3 Comparison of the top 10 frequent errors observed in the 3 datasets. 9

4 Examples of “OTHER” operations predicted by the ERRANT

toolkit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Result of the English experts’ analyses of error types in draft sen-

tences on our Smith dataset. The scores show the ratio of sen-

tences where the targeted type of errors occurred. . . . . . . . . . 10

6 Comparison of the 10 most frequent error types in Smith and

synthetic drafts created by the Enc-Dec noising methods. . . . . . 16

7 Performance of the ED-ND baseline model on top 10 most error

types in SMITH. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

8 Characteristic words and phrases in draft sentences and reference

sentences in the development set of Smith. . . . . . . . . . . . . . 27

v

List of Tables

1 Examples of sentence-level revisions in our Smith dataset. Our

task is to transform the draft sentences into their corresponding

reference sentences. . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Criteria for evaluating workers. L.D denotes the Levenshtein dis-

tance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Comparison with existing datasets. w/<*> and w/change denote

the percentage of source sentences with <*> and the percentage

where the source and target sentences differ, respectively. L.D. in-

dicates the averaged character-level Levenshtein distance between

the pairs of sentences. . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Comparison of the draft and reference sentences in Smith. FRE

and PPL scores were calculated once in each sentence and then

averaged over all the sentences in the development set of Smith. . 11

6 Examples of generated training dataset. . . . . . . . . . . . . . . 13

7 Results of quantitative evaluation. Gramm. denotes the grammat-

icality score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

8 Examples of the output from the baseline models. Bold text indi-

cates tokens introduced by the model. . . . . . . . . . . . . . . . . 30

9 Further examples of draft, reference, and the baseline models’ out-

put. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

vi

Ourmodelshowsexcellentperformance inthistask.

final version

drafting Modelhavegoodresults.

revising Our modelshowgoodresultinthistask.

Ourmodelshowsaexcellent perfomanceinthistask.

Ourmodelshowsgoodresultsinthistask.

Ourmodelshowsaexcellent perfomanceinthistask.

editing

proofreading Ourmodelshowsexcellentperformanceinthistask.

Non-native, inexperienced writer

Existing studies

Our focus

Figure 1: Overview of the estimated process of writing a sentence Our model

shows excellent performance in this task.. Writing activity consists of four stages:

(i) drafting, (ii) revising, (iii) editing, and (iv) proofreading.

1 Introduction

Models developed for academic writing assistance using existing datasets can

serve as a support system during the final stages by editing a nearly finished

version of the draft. For example, Daudaravicius [1] collects scientific papers

before and after professional editing from publishing companies, and Dale and

Kilgarriff [2] extract already published papers that still contain errors and correct

the errors to obtain target fragments of text.

Process-writing pedagogy, however, asserts that writing comprises several pro-

cesses [3, 4, 5] as shown in Figure 1. This study takes on the challenge of au-

tomatic assistance in both the final checking process (proofreading and editing)

1

and the earlier stages of writing (revising). In the revising stage, authors may

drastically modify the wording and supplement some words, a highly demanding

task for non-native or less experienced writers. Assistance in this stage has been

less explored in NLP.

In this study, we design a new type of academic writing assistance task, Sentence-

level Revision (SentRev), where a system receives an early draft of a sentence,

and generates a revised, error-free, proofread version.

A critical issue in tackling this type of assistance task is that evaluation re-

sources are scarce since early-stage draft sentences are not usually publicly avail-

able. To overcome this limitation, we release an evaluation dataset of pairs of

draft sentences and their final versions, the Set of Modified Incomplete TecHni-

cal paper sentences (Smith), that we created using crowdsourcing techniques.

Additionally, we evaluate the quality of our dataset and extensively analyze the

characteristics of the obtained drafts. Finally, we train unsupervised models and

report the baseline performance for our task on the Smith evaluation dataset.

Our contribution is fourfold:

• We propose a new task—SentRev.

• We create an evaluation dataset, Smith, for SentRev using a new crowd-

sourcing approach and release it.1

• We compare the characteristics of our dataset with major corpora and an-

alyze the obtained draft sentences.

• We establish baseline scores for SentRev.

1https://github.com/taku-ito/INLG2019_SentRev

2

2 The Sentence-level Revision task

Table 1: Examples of sentence-level revisions in our Smith dataset. Our task is

to transform the draft sentences into their corresponding reference sentences.

Draft However, the F1 score of KBP 2017 coupus <*> decreased by the sub

event base rule.

Reference However, subevent based constraints slightly reduced the F1 scores on

KBP 2017 corpus.

Draft But, there are some important difference to <*> our work unique.

Reference However, there exist several key differences that make our work

unique.

The proposed task, SentRev, is revising and editing incomplete draft sentences

to create final versions. Examples of sentence-level revision are shown in Table 1.

A draft sentence, x, may have several types of problems. Surface-level problems

such as typographical errors, spelling errors, or grammatical errors are a common

occurrence. Wording problems, such as collocation errors or expressions being

stylistically odd or inappropriate for the academic domain, are also typical of

rough sentences written by non-native, inexperienced writers. The third type of

error is information gaps. Information gaps are cases where the author likely

could not find the appropriate wording for the idea he or she wanted to convey,

such as a specific expression common in the academic domain or a technical term.

In addition, a draft sentence may be missing sections without the author being

aware of this. Solving the aforementioned problems in a draft sentence would

elevate the draft sentence x to its final or nearly final version y with greatly

improved correctness and fluency. Ideally, a single error-free and correctly filled-

in final version should be generated while considering the context of the sentence.

However, as a first step, an assistance system may output a set of likely candidates

for the user to choose from or be inspired by, which would be realistic for a real-

world application.

Our proposed task is, therefore, to generate likely final versions y from early-

3

draft sentences x. For this purpose, we provide an evaluation dataset, Smith,

comprising pairs of drafts and their final versions (X, Y ).

4

SMITH

(iii) the workerswrite English 𝑌"#→%&

'(&)

(ii) show 𝑦"+'(&) to the L’native crowd workers

(i) extract 𝑌'(&) frompublished papers

(iv) apply quality control

このタスクでは、モデルが優れた性能を示す。

(𝑦'(&) in L’ (Japanese))

Modelhavegoodresults.

Trans. en → L’

Filter

(𝑋can𝑑, 𝑌cand)

final versionOurmodel showsexcellentperformance inthistask.

draft

(𝑋, 𝑌)

𝑦'(&):

𝑦8+'(&):

𝑦"#→%&'(&) :

(= 𝑥'(&))

Figure 2: Overview of the crowdsourcing protocol for creating an evaluation

dataset for the SentRev task.

3 The Smith dataset

3.1 Dataset creation

Process overview Although we cannot collect “drafts” X from published pa-

pers, we can easily collect the “final versions” Y . We also have access to non-

native, inexperienced writers through crowdsourcing services. Our test set cre-

ation process combines these two factors (Figure 2). The protocol consists of the

following four phases:

(i) Collecting a large number of sentences written by experts Y cand from pub-

lished papers.

(ii) Translating them into another language L′, resulting in sentences Y candL′ .

(iii) Asking native speakers of L′ to translate Y candL′ back into English Y cand

L′→en

through crowdsourcing. Henceforth, we denote Y candL′→en as Xcand.

(iv) Filtering the pairs of (Xcand, Y cand) to ensure the quality of the dataset (X,

Y ).

This setting is analogous to the situation non-native writers face, as Cohen

and Brooks-Carson [6] report that non-native speakers tend to formulate in their

native language and mentally translate to the target second language. We assume

5

that most crowdworkers have never written an academic paper, and that the

target users of SentRev-based systems also include this type of inexperienced

writers.

To control the quality of the drafts, we first create many candidate pairs of

drafts and reference sentences (Xcand, Y cand) and then filter them to create the

quality-controlled set (X, Y ). The following subsections detail this process.

Collecting final version sentences We collected sentences Y cand from the

ACL Anthology Sentence Corpus (AASC).2 We extracted the sentences that sat-

isfied the following conditions from the AASC as Y cand:

• accepted to ACL 2018,

• 70 to 120 characters long,

• does not include mathematical symbols, special tokens for citations, URLs,

Greek letters, or other special symbols defined in AASC, and

• free of clear conversion mistakes when automatically extracted from PDFs.

Creating draft sentences We used Japanese as L′. First, we translated Y cand

into Japanese using Google Translate.3 We denote the Japanese versions of Y cand

by Y candja . To guarantee the quality of Y cand

ja , two authors of this paper, who were

native speakers of Japanese, inspected all the sentences from Y candja and removed

those that at least one speaker judged to be incorrect translations.

Next, we asked each Japanese crowdworker to translate three sentences from

Y candja into English Y cand

ja→en within 15 minutes. The appropriate time limit and rules

were determined based on several trial tasks.

The workers were allowed to insert the special symbol <*> in places where they

could not think of a good expression for that position in their answer Y candja→en.

This instruction revealed the information gaps that the authors of the drafts

consciously left empty. An author may also be unaware that a draft sentence is

missing sections. 306 workers participated in our crowdsourcing task.

2https://github.com/KMCS-NII/AASC3https://translate.google.com/

6

Quality control We designed thorough filtering criteria and applied them to

the workers because Yahoo! crowdsourcing, 4 a Japanese crowdsourcing service,

does not provide filtering based on the worker’s writing skills or abilities. We fil-

tered workers depending on their writing activities. We scored each worker using

the three answers they produced by using the criteria detailed in Table 3. We

accepted work from workers with score 0 or higher as valid. The hyperparame-

ters were determined with trial experiments. We used spaCy-CLD5 for language

detection.

Table 3: Criteria for evaluating workers. L.D denotes the Levenshtein distance.

Criteria Judgment

Working time is too short (< 2 minutes) Reject

All answers are too short (< 4 words) Reject

No answer ends with “.” or “?” Reject

Contain identical answers Reject

Some answers have Japanese words Reject

No answer is recognized as English Reject

Some answers are too short (< 4 words) -2 points

Some answers use fewer than 4 kinds of words -2 points

Too close to automatic translation (20 <= L.D. <= 30) -0.5 points/ans

Too close to automatic translation (10 <= L.D. <= 20) -1.5 points/ans

Too close to automatic translation (L.D. <= 10) Reject

All answers end with “.” or “?” +1 points

Some answers have <*> +1 points

All answers are recognized as English +1 points

In addition, to remove instances with a too large gap, we automatically filtered

out the obtained (xcand, ycand) ∈ (Xcand, Y cand) whose unigram overlap coefficient

was considerably low:

|U(xcandchecked) ∩ U(ycand)|

min{|U(xcandchecked)|, |U(ycand)|}

< α ,

where U(·) is the set of tokens excluding stop-words and special tokens (<*>).

4https://crowdsourcing.yahoo.co.jp/5https://github.com/nickdavidhaynes/spacy-cld

7

xcandchecked is the spell-checked version6 of xcand. α is set to 0.4, which was determined

in trial experiments.

We collected 10,804 pairs of draft and their final versions, which cost us ap-

proximately US$4,200, including the trial rounds of crowdsourcing.

Unfortunately, works produced by unmotivated workers could have evaded the

aforementioned filters and lowered the quality of our dataset. For example, work-

ers could have bypassed the filter by simply repeating popular phrases in academic

writing (“We apply we apply”). To estimate the frequency of such examples, we

sampled 100 (x, y) pairs from (X,Y ) and asked an NLP researcher (not an author

of this paper) fluent in Japanese and English to check for examples where x was

totally irrelevant to xja, which was shown to the crowdworkers when creating x.

The expert observed no completely inappropriate examples, but noted a small

number of clearly subpar translations. Therefore, 95% of sentence pairs were

determined to be appropriate. This result shows that, overall, our method was

suitable to create the dataset and confirms the quality of Smith.

3.2 Statistics

Table 4 shows the statistics of our Smith dataset and a comparison with major

datasets for building a writing assistance system [7, 8, 1]. The size of our dataset

(10k sentence pairs) is six times greater than that of JFLEG, which contains both

grammatical errors and nonfluent wording. In addition, our dataset simulates

significant editing—99% of the pairs have some changes between the draft and

its corresponding reference, and 33% of the draft sentences contain gaps indicated

by the special token <*>. We also measured the amount of change from the drafts

X to the references Y by using the Levenshtein distance between them. A higher

Levenshtein distance between the X and Y sentences in our dataset indicated

more significant differences between them compared with major GEC corpora.

This finding implies that our dataset emulates more drastic rephrasing.

6We corrected spelling errors using https://github.com/barrust/pyspellchecker

8

Dataset size w/<*> w/change L.D.

Lang-8 2.1M - 42% 3.5

AESW 1.2M - 39% 4.8

JFLEG 1.5k - 86% 12.4

Smith 10k 33% 99% 47.0

Table 4: Comparison with existing datasets. w/<*> and w/change denote the

percentage of source sentences with <*> and the percentage where the source and

target sentences differ, respectively. L.D. indicates the averaged character-level

Levenshtein distance between the pairs of sentences.

(%) ~~

SMITHJFLEGAESW

Figure 3: Comparison of the top 10 frequent errors observed in the 3 datasets.

4 Analysis of the Smith dataset

In this section, we run extensive analyses on the sentences written by non-native

workers (draft sentences X), and the original sentences extracted from the set of

accepted papers (reference sentences Y ). We randomly selected a set of 500 pairs

from Smith as the development set for analysis.

4.1 Error type comparison

To obtain the approximate distributions of error types between the source and

target sentences, we used ERRANT [9, 10]. Next, we compared them with three

datasets: Smith, AESW (the same domain as Smith), and JFLEG (has a rel-

atively close Levenshtein distance to Smith). To calculate the error type distri-

9

Draft: the best models are very effective on the condition that they are far greater than human.

Reference: The best models are very effective in the local context condition where they significantly outperform humans.

Draft: Results show MARM tend to generate <*> and very short responces.

Reference: The results indicate that MARM tends to generate specific but very short responses.

OTHER

OTHER

Figure 4: Examples of “OTHER” operations predicted by the ERRANT toolkit.

0 20 40 60 80 100

others

lack of information

orthographic errors

problems in wording

grammatical errors

Figure 5: Result of the English experts’ analyses of error types in draft sentences

on our Smith dataset. The scores show the ratio of sentences where the targeted

type of errors occurred.

butions on AESW and JFLEG, we randomly sampled 500 pairs of source and

target sentences from each corpus. Figure 3 shows the results of the compari-

son. Although all datasets contained a mix of error types and operations, the

Smith dataset included more “OTHER” operations than the other two datasets.

Manual inspection of some samples of “OTHER” operations revealed that they

tend to inject information missing in the draft sentence (Figure 4). This finding

confirms that our dataset emphasizes a new, challenging “completion-type” task

setting for writing assistance.

4.2 Human error type analysis

To understand the characteristics of our dataset in detail, an annotator proficient

in English (not an author of this paper) analyzed the types of errors in the draft

sentences (Figure 5). The most frequent errors were fluency problems (e.g., “In

10

Table 5: Comparison of the draft and reference sentences in Smith. FRE and

PPL scores were calculated once in each sentence and then averaged over all the

sentences in the development set of Smith.

Data FRE passive voice (%) word repetition (%) PPL

Draft X 45.5 34.0 33.0 1373

Reference Y 40.0 29.6 28.6 147

these ways” instead of “In these methods,”)—characterized by errors in academic

style and wording, which are out of the scope of traditional GEC. Another notable

type of frequent error was lack of information, which further distinguishes this

dataset from other datasets.

4.3 Human fluency analysis

We outsourced the scoring of the fluency of the given draft and reference sentence

pairs to three annotators proficient in English. Nearly every draft x (94.8%) was

marked as being less fluent than its corresponding reference y, confirming that

obtaining high performance with our dataset requires the ability to transform

rough input sentences into more fluent sentences.

4.4 Sentence-level linguistic characteristics

We computed some sentence-level linguistic measures over the dataset sentences:

Flesch Reading Ease (FRE) [11], passive voice7, word repetition, and perplexity

(PPL) (Table 5).

FRE measures the readability of a text, namely, how easy it is to understand

(higher is easier). The draft sentences consistently demonstrated higher FRE

scores than their reference counterparts, which may be attributed to the latter

containing more sophisticated language and technical terms.

In addition, workers tended to use the passive voice and to repeat words within

a narrow span, and both those phenomenon must be avoided in academic writ-

ing. We conducted further analyses on lexical tendencies between the drafts and

7https://github.com/armsp/active_or_passive

11

references (Appendix A).

Finally, we analyzed the draft and the reference sentences using PPL calculated

by a 5-gram language model trained on ACL Anthology papers.8 The higher PPL

scores in the draft sentences (Table 5) suggest that they have properties unsuitable

for academic writing (e.g., less fluent wording).

8PPL is calculated with the implementation available in the KenLM (https://github.com/

kpu/kenlm), tuned on AASC (excluding the texts used for building the Smith).

12

Table 6: Examples of generated training dataset.

method original generated

Heuristic Besides , the recognizer success-

fully rejected only 15 out of 42

negative sentences .

recognizer Besides successfully ,

the informativeness rejected of

out <*>

Grammatical error generation We plan to analyze these direct

communications and interaction

of sentiments expressed in these

sequences of posts .

We plan to analysis the di-

rect communication interaction

of sentiments express in these

sequence of posts .

Style removal This experiment suggested that

there were ambiguities in these

pointing gestures and led to a

redesign of the system .

This experiment indicated the

ambiguity found in the pointing

gestures and caused a renewal

of the system .

Entailed sentence generation Figure 2 illustrates the ef-

fectiveness of different features

class.

There is different feature in figure

2 .

5 Experiments

5.1 Baseline models

We evaluated three baseline models on the SentRev task.

5.1.1 Heuristic noising and denoising model

We can access a great deal of final version academic papers. Noising and denoising

approaches have gained attention in the GEC and machine translation fields [12,

13, 14]. We combined these two factors to train baseline models on noised final

version sentences.

First, we collected 4,898,146 sentences Y aasc from the AASC that satisfied the

following conditions: (i) not included in the Smith dataset, (ii) not too long

or too short (between 5 and 35 tokens), (iii) over 50% of the characters were

alphabetic. Next, we created a training dataset (Xaaschrst , Y

aasc) by adding noise to

Y aasc.

As the simplest approach for noising, we used a set of heuristic rules by ran-

13

domly deleting, replacing, and swapping words in the reference sentences. Specif-

ically, these rules included deleting words with a probability of 0.1, replacing

words with a token that appeared over 10,000 times in Yaasc with a probability

of 0.1, and randomly shuffling the sentence while maintaining the originally ad-

jacent words within three words apart. Next, we randomly replaced up to 50%

of the words with a <*> token (see Appendix B for a more detailed algorithm).

This method generated 4.8M heuristically noised sentences.

Subsequently, we trained a denoising model (a mapping function from Xaaschrst to

Y aasc) by using Transformer [15] implemented in fairseq [16]. We used an Adam

optimizer [17] with α = 0.0005, β1 = 0.9, β2 = 0.98, and ϵ = 10e−8. We limited

the maximum tokens per each minibatch to 3000, limited the maximum number

of updates to 500,000, and used a dropout rate of 0.3. The input and output

texts were tokenized and then segmented into character bigrams. We used a

beam width of 5 in the decoding. This model is our first baseline model for the

SentRev task (henceforth, H-ND).

5.1.2 Enc-Dec noising and denoising model

As an extension of the heuristic noising and denoising model, we changed the

noising methods to better simulate the characteristics of X in Smith than the

heuristic rules in Section 5.1.1. As described in Section 4, the drafts tended to (i)

contain grammatical errors, (ii) use stylistically improper wording, and (iii) lack

certain words. We used the following three neural Encoder-Decoder (Enc-Dec)

models to generate the synthetic draft sentences.

Grammatical error generation Here, we trained a model that introduces

synthetic grammatical errors to “clean” sentences by using a “flipped” dataset

from GEC (clean → erroneous). We used nonidentical (source, target) sentence

pairs from the Lang-8, AESW, and JFLEG datasets.

Style removal To generate stylistically unnatural sentences in the academic

domain, we used paraphrasing, which preserves a sentence’s content while disre-

garding its style. We used the ParaNMT-50M dataset [18], a paraphrase dataset

automatically created using Enc-Dec translation. We extracted parallel sentences

14

Table 7: Results of quantitative evaluation. Gramm. denotes the grammaticality

score.

Model BLEU ROUGE-L BERT-P BERT-R BERT-F P R F0.5 Gramm. PPL

Draft X 9.8 46.8 75.9 78.2 77.0 - - - 92.9 1454

H-ND 8.2 45.0 77.0 76.1 76.5 5.4 2.9 4.6 94.1 406

ED-ND 15.4 51.1 80.9 80.0 80.4 21.8 12.8 19.2 96.3 236

GEC 11.9 49.0 80.8 79.1 79.9 22.2 6.2 14.6 96.7 414

Reference Y - - - - - - - - 96.5 147

with annotated paraphrase scores between 0.7 and 0.95 from the ParaNMT-50M

dataset and used swapped pairs of source and target sentences in the dataset.

Entailed sentence generation To simulate the missing words in the draft sen-

tences, we trained a model that generated a sentence entailed with the given text.

We extracted entailed sentence pairs from the SNLI [19] and the MultiNLI [20]

datasets.

Random noising beam search As Xie et al. [13] pointed out, a standard

beam search often yields hypotheses that are too conservative. This tendency

leads the noising models to generate synthetic draft sentences similar to their

references. To address this problem, we applied the random noising beam search

[13] on all three noising models. Specifically, during the beam search, we added

rβ to the scores of the hypotheses, where r is a value sampled from a uniform

distribution over the interval [0, 1], and β is a penalty hyperparameter set to 5.

We obtained 14.6M sentence pairs of (Xaascencdec, Y

aasc) by applying these Enc-Dec

noising models to Y aasc. To train the denoising model, we used both data (Xaaschrst ,

Y aasc) and (Xaascencdec, Y

aasc). The model architecture was the same as the heuristic

model. This denoising model is our second baseline model (ED-ND). To facilitate

research in the SentRev task, we released all the 19.6M synthetic data.9

9https://github.com/taku-ito/INLG2019_SentRev

15

drafts in SMITHdrafts in synthetic data

~~(%)

Figure 6: Comparison of the 10 most frequent error types in Smith and synthetic

drafts created by the Enc-Dec noising methods.

0

0.1

0.2

0.3

0.4

0.5

OTHER

DET

NOUN

NOUN:NUM PR

EPVERB

PUNCT

SPELL

ORTH AD

J

(F0.5)

Figure 7: Performance of the ED-ND baseline model on top 10 most error types

in SMITH.

Analysis of the synthetic drafts Finally, we analyzed the error type dis-

tribution of the synthetic data used for training Enc-Dec noising and denoising

model with ERRANT (Figure 6). The error type distribution from the synthetic

dataset had similar tendencies to the one from the development set in Smith

(real-draft). Kullback–Leibler divergence between these error type distributions

was 0.139. This result supports the validity of our assumption that the SentRev

task is a combination of GEC, style transfer, and a completion-type task.

Table 6 shows examples of the training data generated by the noising models

described in Section 5. Heuristic noising, the rule-based noising method, cre-

ated ungrammatical sentences. The grammatical error generation model added

grammatical errors (e.g., plan to analyze → plan to analysis). The style removal

model generated stylistically unnatural sentences for the academic domain (e.g.,

redesign → renewal). The entailed sentence generation model caused a lack of

information.

16

5.1.3 GEC model

The GEC task is closely related to SentRev. We examined the performance of

the current state-of-the-art GEC model [21] in our task. We applied spelling

correction before evaluation following [21].

5.2 Evaluation metrics

The SentRev task has a very diverse space of valid revisions to a given context,

which is challenging to evaluate. As one solution, we evaluated the performance

from multiple aspects by using various reference and reference-less evaluation

metrics. We used BLEU, ROUGE-L, and F0.5 score, which are widely used

metrics in related tasks (machine translation, style-transfer, GEC). We used

nlg-eval [22] to compute the BLEU and ROUGE-L scores and calculated F0.5

scores with ERRANT. In addition, to handle the lexical and compositional diver-

sity of valid revisions, we used BERT-score [23], a contextualized embedding-

based evaluation metric. Furthermore, we used two reference-less evaluation

metrics: grammaticality score [24] and PPL. Grammaticality was scored as 1 −(Nerrors in sentence/Ntokens in sentence), where the number of grammatical errors in a

sentence is obtained using LanguageTools.10 By using a language model tuned

to the academic domain, we expect PPL to evaluate the stylistic validity and

fluency of a complemented sentence. We favored n-gram language models over

neural language models for reproducibility and calculated the score in the same

manner as described in Section 4.3.

10https://github.com/languagetool-org/languagetool/releases/tag/v3.2

17

6 Results

Table 7 shows the performance of the baseline models. We observed that the

ED-ND model outperforms the other models in nearly all evaluation metrics.

This finding suggests that the Enc-Dec noising methods induced noise closer to

real-world drafts compared with the heuristic methods.

The current state-of-the-art GEC model showed higher precision but low recall

scores in F0.5. This suggests that the SentRev task requires the model to make a

more drastic change in the drafts than in the GEC task. Furthermore, the GEC

model, trained in the general domain, showed the worst performance in PPL.

This indicates that the general GEC model did not reflect academic writing style

upon revision and that SentRev requires academic domain-aware rewriting.

Table 8 shows examples of the models’ output. In the first example, the ED-ND

model made a drastic change to the draft. The middle example demonstrates that

our models replaced the <*> token with plausible words. The last example is the

case where our model underperformed by making erroneous edits such as changing

“Chart4” to “Figure2”, and suggesting odd content (“relation between model and

gold standard and piason”). This may be due to having inadvertently introduced

noise while generating the training datasets. Appendix C shows more examples of

generated sentences. Using ERRANT, we analyzed the performance of the ED-

ND baseline model by error types. The results are shown in Figure 7. Overall,

typical grammatical errors such as noun number errors or orthographic errors are

well corrected, but the model struggles with drastic revisions (“OTHER” type

errors).

18

7 Related work

7.1 Writing assistance in the academic domain

Several shared tasks for assisting academic writing have been organized. The

Helping Our Own (HOO) 2011 Pilot Shared Task [2] aimed to promote the de-

velopment of tools and techniques to assist authors in writing, with a specific

focus on writing within the NLP community. The Automated Evaluation of Sci-

entific Writing (AESW) Shared Task [1] was organized to promote tools to help

write scientific papers. The HOO dataset was created by finding errors in pub-

lished papers and editing the errors, and the AESW dataset contains a collection

of text extracts from published journal articles before and after proofreading.

Rather than adding finishing touches to almost completed sentences, our task is

to convert unfinished, rough drafts into complete sentences. In addition, these

studies tackled the task of the identification of errors while SentRev goes further

by rewriting the drafts.

Other corpora for revisions are available in the academic domain [25, 26, 27].

Thus, we provide a notable contribution by exploring the methods to create a

dataset of revisions with a scalable crowdsourcing approach. By contrast, Zhang

et al. [27] recruited 60 students over 2 weeks and Lee and Webster [25] collected

data from a language learning project where over 300 tutors reviewed academic

essays written by 4500 students.

7.2 Grammatical error correction

GEC is the task of correcting errors in text such as spelling, punctuation, gram-

mar, and word choice [28, 29]. GEC falls within the editing and proofreading

phases of the writing process, while SentRev subsumes GEC and a broader range

of text generation (e.g., increasing the fluency of the sentence and complement-

ing missing information). Napoles et al. [7] and d Sakaguchi et al. [30] explored

fluency edits to correct grammatical errors and to make a text more “native

sounding.” Although this direction is similar to SentRev, our task used sentences

that required many more corrections.

19

7.3 Style transfer

Style transfer is the task of rephrasing the text to conform to specific stylistic

properties while preserving the text’s original semantic content [31, 32]. From

the perspective of automatic academic writing assistance, the assistance systems

are required to convert nonacademic-style drafts into academic-style drafts. This

type of transfer is regarded as a subproblem in the revising stage of the writing

process.

7.4 Text completion

The drafts in the revising stage may contain gaps denoted with <*>. This setting

is similar to text infilling [33], masking-based language modeling [34, 35], or the

sentence completion task [36], where the models are required to replace mask

tokens with plausible words. Notably, SentRev differs from such tasks because

systems for these tasks are expected to keep all the original tokens unchanged

and only fill the <*> token, with one or more other tokens.

20

8 Conclusion

We proposed the SentRev task, where an incomplete, rough draft sentence is

transformed into a more fluent, complete sentence in the academic writing do-

main. We created the Smith dataset with crowdsourcing for development and

evaluation of this task and established baseline performance with a synthetic

training dataset. We believe that this task can increase the effectiveness of the

process of academic writing.

21

Acknowledgements

I am deeply grateful to Dr. Kentaro Inui and Dr. Jun Suzuki for able guidance

and generous support. I would like to show my greatest appreciation to Dr. Hay-

ato Kobayashi, Dr. Masato Hagiwara, Ana Brassard, and Tatsuki Kuribayashi for

insightful comments and constructive suggestions. Discussions with my academic

colleagues in our laboratory have been quite illuminating.

22

References

[1] Vidas Daudaravicius. Automated Evaluation of Scientific Writing: AESW

Shared Task Proposal. In Proceedings of the Tenth Workshop on Innovative

Use of NLP for Building Educational Applications (BEA), pages 56–63, 2015.

[2] Robert Dale and Adam Kilgarriff. Helping Our Own: The HOO 2011 Pilot

Shared Task. In Proceedings of ENLG, pages 242–249, 2011.

[3] Bernard Susser. Process approaches in ESL/EFL writing instruction. Jour-

nal of Second Language Writing, 3(1):31–47, 1994.

[4] Anthony Seow. The writing process and process writing. Methodology in

language teaching: An anthology of current practice, pages 315–320, 2002.

[5] Michael Buchman, Roberta Moore, Linda Stern, and Betsy Feist. Power

Writing: Writing with Purpose, volume 4. 2000.

[6] Andrew D Cohen and Amanda Brooks-Carson. Research on direct versus

translated writing: Students’ strategies and their results. The Modern Lan-

guage Journal, 85(2):169–188, 2001.

[7] Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. JFLEG: A Flu-

ency Corpus and Benchmark for Grammatical Error Correction. In Proceed-

ings of EACL, pages 229–234, 2017.

[8] Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata, and Yuji Mat-

sumoto. Mining Revision Log of Language Learning SNS for Automated

Japanese Error Correction of Second Language Learners. In Proceedings of

IJCNLP, pages 147–155, 2011.

[9] Christopher Bryant, Mariano Felice, and Ted Briscoe. Automatic Annota-

tion and Evaluation of Error Types for Grammatical Error Correction. In

Proceedings of ACL, pages 793–805, 2017.

[10] Mariano Felice, Christopher Bryant, and Ted Briscoe. Automatic Extraction

of Learner Errors in ESL Sentences Using Linguistically Enhanced Align-

ments. In Proceedings of COLING, pages 825–835, 2016.

23

[11] Rudolph Flesch. A new readability yardstick. Journal of Applied Psychology,

32(3):221 – 233, June 1948.

[12] Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding

Back-Translation at Scale. In Proceedings of EMNLP, pages 489–500, 2018.

[13] Ziang Xie, Guillaume Genthial, Stanley Xie, Andrew Ng, and Dan Juraf-

sky. Noising and Denoising Natural Language: Diverse Backtranslation for

Grammar Correction. In Proceedings of NAACL-HLT, pages 619–628, 2018.

[14] Jared Lichtarge, Chris Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar,

and Simon Tong. Corpora generation for grammatical error correction. In

Proceedings of NAACL-HLT, pages 3291–3301, June 2019.

[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,

Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin. Attention is All you

Need. In Proceedings of NIPS, pages 5998–6008. 2017.

[16] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan

Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit

for sequence modeling. In Proceedings of NAACL-HLT, pages 48–53, June

2019.

[17] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Opti-

mization. In Proceedings of ICLR, 2015.

[18] John Wieting and Kevin Gimpel. ParaNMT-50M: Pushing the Limits of

Paraphrastic Sentence Embeddings with Millions of Machine Translations.

In Proceedings of ACL, pages 451–462, 2018.

[19] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D.

Manning. A Large Annotated Corpus for Learning Natural Language Infer-

ence. In Proceedings of EMNLP, pages 632–642, 2015.

[20] Adina Williams, Nikita Nangia, and Samuel Bowman. A Broad-Coverage

Challenge Corpus for Sentence Understanding through Inference. In Pro-

ceedings of NAACL-HLT, pages 1112–1122, 2018.

24

[21] Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, and Jingming Liu. Im-

proving grammatical error correction via pre-training a copy-augmented ar-

chitecture with unlabeled data. In Proceedings of the NAACL-HLT, pages

156–165, June 2019.

[22] Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. Rel-

evance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating

Natural Language Generation. arXiv preprint arXiv:1706.09799, 2017.

[23] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav.

Artzi. BERTScore: Evaluating Text Generation with BERT. arXiv preprint

arXiv:1904.09675, 2019.

[24] Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. There’s No Com-

parison: Reference-less Evaluation Metrics in Grammatical Error Correction.

In Proceedings of EMNLP, pages 2109–2115, 2016.

[25] John Lee and Jonathan Webster. A corpus of textual revisions in second

language writing. In Proceedings of ACL, pages 248–252, July 2012.

[26] Chenhao Tan and Lillian Lee. A corpus of sentence-level revisions in aca-

demic writing: A step towards understanding statement strength in commu-

nication. In Proceedings of ACL, pages 403–408, June 2014.

[27] Fan Zhang, Homa B. Hashemi, Rebecca Hwa, and Diane Litman. A Corpus

of Annotated Revisions for Studying Argumentative Writing. In Proceedings

of ACL, pages 1568–1578, 2017.

[28] Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Ray-

mond Hendy Susanto, and Christopher Bryant. The CoNLL-2014 Shared

Task on Grammatical Error Correction. In Proceedings of CoNLL:Shared

Task, pages 1–14, 2014.

[29] Zheng Yuan and Ted Briscoe. Grammatical Error Correction using Neural

Machine Translation. In Proceedings of NAACL-HLT, pages 380–386, 2016.

25

[30] Keisuke Sakaguchi, Courtney Napoles, Matt Post, and Joel Tetreault. Re-

assessing the Goals of Grammatical Error Correction: Fluency instead of

Grammaticality. TACL, 4:169–182, 2016.

[31] Lajanugen Logeswaran, Honglak Lee, and Samy Bengio. Content Preserving

Text Generation with Attribute Controls. In Proceedings of NeurIPS,, pages

5103–5113. 2018.

[32] Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W.

Black. Style Transfer Through Back-Translation. In Proceedings of ACL,

pages 866–876, 2018.

[33] Wanrong Zhu, Zhiting Hu, and Eric Xing. Text Infilling. arXiv preprint

arXiv:1901.00158, 2019.

[34] William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better Text

Generation via Filling in the . In Proceedings of ICLR, 2018.

[35] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.

BERT: Pre-training of deep bidirectional transformers for language under-

standing. In Proceedings of NAACL-HLT, pages 4171–4186, June 2019.

[36] Geoffrey Zweig, John C. Platt, Christopher Meek, Christopher J. C. Burges,

Ainur Yessenalina, and Qiang Liu. Computational Approaches to Sentence

Completion. In Proceedings of ACL, pages 601–610, 2012.

[37] Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how

Corpora Differ. In Proceedings of ACL System Demonstrations, pages 85–

90, 2017.

26

we

Figure 8: Characteristic words and phrases in draft sentences and reference sen-

tences in the development set of Smith.

Appendix

A Lexical tendencies

Certain words and phrases were more frequently observed in the reference sen-

tences than in the draft sentences, and vice-versa. Figure 8 visualizes these biases,

where words more often observed in the draft sentences are plotted in the upper-

left corner, and words more often observed in the references are plotted in the

lower-right corner. Words observed more commonly in the drafts were: will, is

not, if, and I, versus can be, no, when, and they. The contrast also includes

27

Algorithm 1 Heuristic noising

INPUT: x = {w0, w1, · · · , wn}

1: x = delete(x, 0.1)

# 10% of the tokens in x are deleted.

2: x = replace(x, 0.1)

# 10% of the tokens in x are replaced with common terms in ACL.

3: x = permutate(x)

# permutate the tokens in x.

4: r ← Uniform(0, 0.5)

5: m = int(x.length ∗ r)

6: c = 0

7: while c < m do

8: n← sample({j ∈ N | 1 ≤ j ≤ m− c})9: (s, e)← sample({n-grams of x})10: x = “x:s−1 + <*> + xe+1:

′′

11: c = c + n

12: end while

# r × 100% of the tokens in x are masked.

a widely-used spelling (data set vs dataset) and common plurality (method vs

methods). The plot was generated using the scattertext toolkit [37].

B Heuristic noising algorithm

Algorithm 1 shows the noising algorithm in the heuristic noising method.

C Examples from the Smith dataset and gener-

ated sentences by Baseline models

Table 9 shows examples from the Smith dataset and the output of the baseline

models. “Reference” is a sentence extracted from papers, “Draft” is written by

28

a crowdworker and is the input for the baseline models.

29

Table 8: Examples of the output from the baseline models. Bold text indicates

tokens introduced by the model.

Draft The global modeling using the reinforcement learning in all documents is our

work in the future .

H-ND The global modeling of the reinforcement learning using all documents in our

work is the future .

ED-ND In our future work , we plan to explore the use of global modeling for

reinforcement learning in all documents .

GEC Global modelling using reinforcement learning in all documents is our work in the

future .

Reference The global modeling using reinforcement learning for a whole document is our

future work .

Draft Also , the above <*> efficiently calculated by dynamic programming .

H-ND Also , the above results are calculated efficiently by dynamic programming .

ED-ND Also , the above probabilities are calculated efficiently by dynamic program-

ming .

GEC Also , the above is efficiently calculated by dynamic programming .

Reference Again , the above equation can be efficiently computed by dynamic programming

.

Draft Chart4 : relation model and gold % between KL and piason .

H-ND Table 1 : Charx- relation between gold and piason and KL .

ED-ND Figure 2 : CharxDiff relation between model and gold standard and

piason .

GEC Chart4 : relation model and gold % between KL and person .

Reference Table 4 : KL and Pearson correlation between model and gold probability .

30

Table 9: Further examples of draft, reference, and the baseline models’ output.

Draft By this setting , the persona is acquired from a test set popl about both turker

anad model .

H-ND By this setting , the persona is acquired from a test set both about popl anad

anad model .

ED-ND In this setting , persona is obtained from the test set popl about both Turker

and model .

GEC By this setting , the persona is acquired from a test set pool about both turkey

and models .

Reference In this setting , for both the Turker and the model , the personas come from the

test set pool .

Draft In addition to results of study until now , we add two baseline to vindicate

effectiveness on our flame work .

H-ND In addition to the results of this study , we now add two baseline methods to

vindicate effectiveness on our work .

ED-ND In addition to the results of the study until now , we add two baselines to

visualize the effectiveness of our framework .

GEC In addition to the results of study until now , we added two baseline to vindicate

effectiveness on our flame work .

Reference In addition to results of previous work , we add two baselines to demonstrate

the effectiveness of our framework .

Draft Yhe input and output <*> are one - hot encoding of the center word and the

context word , <*> .

H-ND The input and output are one - hot encoding of the center word and the context

word , respectively .

ED-ND The input and output layers are one - hot encoding of the center word and the

context word , respectively .

GEC Yhe input and output are one - hot encoding of the center word and the context

word , .

Reference The input and output layers are centre word and context word one - hot encod-

ings , respectively .

31

List of Publications

Awards

Journal Papers

International Conferences Papers

1. Takumi Ito, Tatsuki Kuribayashi, Hayato Kobayashi, Ana Brassard, Masato

Hagiwara, Jun Suzuki and Kentaro Inui. Diamonds in the Rough: Gener-

ating Fluent Sentences from Early-stage Drafts for Academic Writing As-

sistance. To appear in Proceedings of the 12th International Conference on

Natural Language Generation (INLG 2019), pp. –, October 2019.

2. Masato Hagiwara, Takumi Ito, Tatsuki Kuribayashi, Jun Suzuki and Ken-

taro Inui. TEASPN: Framework and Protocol for Integrated Writing As-

sistance Environments. To appear in Proceedings of 2019 Conference on

Empirical Methods in Natural Language Processing and 9th International

Joint Conference on Natural Language Processing: System Demonstrations

(EMNLP-IJCNLP 2019), pp. –, November 2019.

Other Publications

• 伊藤拓海, 栗林樹生, 萩原正人, 鈴木潤, 乾健太郎. 英語論文執筆のための

統合ライティング支援環境. 第 14回NLP若手の会シンポジウム (YANS),

August 2019.

• 伊藤拓海, 栗林樹生, 小林隼人, 鈴木潤, 乾健太郎. ライティング支援を想

定した情報補完型生成. 言語処理学会第 25回年次大会, pp.970-973, March

2019.

• 栗林樹生, 伊藤拓海, 内山香, 鈴木潤, 乾健太郎. 言語モデルを用いた日本語

の語順評価と基本語順の分析. 言語処理学会第 25回年次大会, pp.1053-1056,

March 2019.

• 伊藤拓海, 山口健史, 田然, 松田耕史, 岡崎直観, 乾健太郎. 自治体FAQの比

較マイニング. 言語処理学会第 24回年次大会, pp.536-539, March 2018.

32

• 伊藤拓海, 鈴木正敏, 田然, 山口健史, 岡崎直観, 乾健太郎. 自治体QAサー

ビスのためのFAQの自治体間の横断的解析. 第 12回NLP若手の会シンポ

ジウム (YANS), September 2017.

33

generating uent sentences from early-stage drafts …...generating uent sentences from early-stage...

Documents