the university of chicago broad context ......state-of-the-art language models. the dataset is...

THE UNIVERSITY OF CHICAGO

BROAD CONTEXT LANGUAGE MODELING WITH READING COMPREHENSION

A DISSERTATION SUBMITTED TO

THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCE DIVISION

IN CANDIDACY FOR THE DEGREE OF

MASTER OF SCIENCE

DEPARTMENT OF COMPUTER SCIENCE

BY

ZEWEI CHU

CHICAGO, ILLINOIS

JUNE, 2017

I dedicate my dissertation to my grandparents Chengrong Li and Liuzhen Xu.

Epigraph Text

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Machine Reading Comprehension . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Neural Readers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 The LAMBADA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 LAMBADA as Reading Comprehension . . . . . . . . . . . . . . . . . . . . . 103.2 Training Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1 Manual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

v

LIST OF FIGURES

2.1 Attention Sum Reader [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Gated-attention Reader [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Stanford Reader [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

5.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7.1 instance 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237.2 instance 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.3 instance 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.4 instance 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.5 instance 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.6 instance 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.7 instance 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.8 instance 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.9 instance 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.10 instance 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

vi

LIST OF TABLES

2.1 Example instances from LAMBADA. . . . . . . . . . . . . . . . . . . . . . . . . 9

5.1 Accuracies on test and control datasets, computed over all instances (“all”)and separately on those in which the answer is in the context (“context”). Thefirst section is from [10]. ∗Estimated from 100 randomly-sampled dev instances.†Estimated from 100 randomly-sampled control instances. . . . . . . . . . . 15

5.2 Labels derived from manual analysis of 100 LAMBADA dev instances. An in-stance can be tagged with multiple labels, hence the sum of instances across labelsexceeds 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

vii

ACKNOWLEDGMENTS

I want to thank Hai Wang, my thesis advisor Kevin Gimpel and Prof. David McAllester

for their collaboration on the project of applying reading comprehension models on the

LAMBADA dataset. Hai Wang contributed his code of various neural reader models and

GPUs for our experiments. My advisor Kevin Gimpel provided insightful guidance on the

project and did 100 manual analysis for the LAMBADA dataset. Prof. David McAllester

provided feedbacks and suggestions for the project. Without their generous contribution,

this project and the thesis cannot be accomplished.

We thank Denis Paperno for answering our questions about the LAMBADA dataset and

we thank NVIDIA Corporation for donating GPUs used in this research.

viii

ABSTRACT

Progress in text understanding has been driven by large datasets that test particular capa-

bilities, like recent datasets for reading comprehension [4]. We focus here on the LAMBADA

dataset [10], a word prediction task requiring broader context than the immediate sentence.

We view LAMBADA as a reading comprehension problem and apply comprehension models

based on neural networks. Though these models are constrained to choose a word from the

context, they improve the state of the art on LAMBADA from 7.3% to 49%. We analyze 100

instances, finding that neural network readers perform well in cases that involve selecting a

name from the context based on dialogue or discourse cues but struggle when coreference

resolution or external knowledge is needed.

ix

CHAPTER 1

INTRODUCTION

The recently created LAMABDA [10] dataset is a challenging word prediction task for all

state-of-the-art language models. The dataset is created to encourage the invention of new

language models that capture broader context.

Paperno et al. [10] provide baseline results with popular language models and neural

network architectures; all achieve zero percent accuracy. The best accuracy is 7.3% obtained

by randomly choosing a capitalized word from the passage.

Our approach is based on the observation that in 83% of instances the answer appears

in the context. We exploit this in two ways. First, we automatically construct a large

training set of 1.8 million instances by simply selecting passages where the answer occurs in

the context. Second, we treat the problem as a reading comprehension task similar to the

CNN/Daily Mail datasets introduced by [4], the Children’s Book Test (CBT) of [5], and the

Who-did-What dataset of [9]. We show that standard models for reading comprehension,

trained on our automatically generated training set, improve the state of the art on the

LAMBADA test set from 7.3% to 49.0%. This is in spite of the fact that these models fail

on the 17% of instances in which the answer is not in the context.

We also perform a manual analysis of the LAMBADA task, provide an estimate of human

performance, and categorize the instances in terms of the phenomena they test. We find that

the comprehension models perform best on instances that require selecting a name from the

context based on dialogue or discourse cues, but struggle when required to do coreference

resolution or when external knowledge could help in choosing the answer.

Our contributions are:

• We review the recent development of reading comprehension models, and recent created

reading comprehension dataset, and the LAMABDA dataset.

• We created a new training set from the original LAMABDA dataset.1

• We provide stronger baseline results of the LAMABDA dataset.

• We show that language modeling tasks can be assisted by reading comprehension models.

• Our manual analysis provides insights on the design of future models to further improve

the results on LAMBADA.

2

CHAPTER 2

BACKGROUND

This chapter briefly reviews machine reading comprehension tasks, neural network based

reading comprehension models and the LAMBADA dataset.

2.1 Machine Reading Comprehension

Researchers created various reading comprehension datasets to test machine’s capability of

understanding text. We introduce some good datasets that are created recently.

MCTest [12]: each instance of MCTest comes with a story, and four multiple choice

questions regarding the story. The dataset is of high quality, but the limitation is that it

only provides 640 instances, and this amount is not sufficient for training a good model for

solving reading comprehension tasks.

DeepMind’s CNN/Daily Mail dataset [4] is built from news of CNN and Daily Mail.

They created a much bigger dataset, consisting of approximately 187k documents and 1259k

questions. This model has boosted the research of designing attention based neural reading

comprehension models, as introduced in section 2.2.

Some other high quality reading comprehension dataset such as SQuAD [11], Wik-

iQA [15], RACE [8], etc. focus on various language aspects. We do not discuss them in

details for brevity.

2.2 Neural Readers

Various kinds of recurrent neural network models have been invented to solve cloze style

question answering tasks with state-of-the-art accuracies. DeepMind [4] provides a large

scale supervised reading comprehension dataset (CNNDM: CNN/Daily Mail dataset) and

3

Figure 2.1: Attention Sum Reader [7]

Figure 2.2: Gated-attention Reader [3]

boosted the research of designing attention based neural reading comprehension models, such

as attention sum reader [7], the Stanford reader [1], and gated-attention readers [3].

We refer to these models as “neural readers”. The architectures of the aforementioned

three neural readers are shown in figure 2.1, figure 2.2 and figure 2.3.

These neural readers use attention based on the question and passage to choose an answer

from among the words in the passage. We use d for the context word sequence, q for the

question (with a blank to be filled), A for the candidate answer list, and V for the vocabulary.

We describe neural readers in terms of three components:

4

Figure 2.3: Stanford Reader [1]

1. Embedding and Encoding: Each word in d and q is mapped into a v-dimensional

vector via the embedding function e(w) ∈ Rv, for all w ∈ d ∪ q.1 The same embedding

function is used for both d and q. The embeddings are learned from random initializa-

tion; no pretrained word embeddings are used. The embedded context is processed by a

bidirectional recurrent neural network (RNN) which computes hidden vectors hi for each

position i in each sequence:

−→h = fRNN (

−→θd, e(d))

←−h = bRNN (

←−θd, e(d))

h = 〈−→h ,←−h 〉

where θ→d and θ←d are RNN parameters, and each of fRNN and bRNN return a sequence

of hidden vectors, one for each position in the input e(d). The question is encoded into a

1. We overload the e function to operate on sequences and denote the embedding of d and q as matricese(d) and e(q).

5

single vector g which is the concatenation of the final vectors of two RNNs:

−→g = fRNN (−→θq, e(q))

←−g = bRNN (←−θq, e(q))

g = 〈−→g|q|,←−g0〉

The RNNs use either gated recurrent units [2] or long short-term memory [6].

The gated-attention reader uses different bidirectional RNN encoder at each layer.

2. Attention: The readers then compute attention weights on positions of h using g. In

general, we define αi = softmax(att(hi, g)), where i ranges over positions in h. The att

function is an inner product in the Attention Sum Reader [7] and a bilinear product in

the Stanford Reader [1].

The computed attentions are then passed through a softmax function to form a probability

distribution.

The Gated-Attention Reader uses a richer attention architecture [3]. Each attention layer

has its own question embedding function, and the next layer document embedding is

computed by the following GA function. The super indices indicate different layers.

d(k) = GA(h(k),q(k))

The GA function at each layer is defined as following, and the super script is omitted:

6

αi = softmax(e(d)>hi) (2.1)

q̃i = e(q)αi (2.2)

di = hi � q̃i (2.3)

3. Output and Prediction: To output a prediction a∗, the Stanford Reader [1] computes

the attention-weighted sum of the context vectors and then an inner product with each

candidate answer:

c =

|d|∑i=1

αihi a∗ = argmax

a∈Ao(a)>c

where o(a) is the “output” embedding function. As the Stanford Reader was developed

for the anonymized CNN/Daily Mail tasks, only a few entries in the output embedding

function needed to be well-trained in their experiments. However, for LAMBADA, the

set of possible answers can range over the entirety of V , making the output embedding

function difficult to train. Therefore we also experiment with a modified version of the

Stanford Reader that uses the same embedding function e for both input and output

words:

a∗ = argmaxa∈A

e(a)>Wc (2.4)

where W is an additional parameter matrix used to match dimensions and model any

additional needed transformation.

7

For the Attention Sum and Gated-Attention Readers the answer is computed by:

∀a ∈ A, P (a|d,q) =∑

i∈I(a,d)αi

a∗ = argmaxa∈A

P (a|d,q)

where I(a,d) is the set of positions where a appears in context d.

2.3 The LAMBADA dataset

The LAMBADA dataset [10] was designed by identifying word prediction tasks that require

broad context. Each instance is drawn from the BookCorpus [16] and consists of a passage

of several sentences where the task is to predict the last word of the last sentence.

The creation procedure of the LAMBADA dataset was designed to guarentee that two

human subjects can correctly predict the target word with the context, while at least ten

human subjects were not able to correctly guess the target word without the context but

only the target sentence. This filtering procedure finds cases that are guessable by humans

when given the larger context but not when only given the last sentence.

The expense of this manual filtering has limited the dataset to only about 10,000 instances

which are viewed as development and test data. The training data is taken to be books in

the corpus other than those from which the evaluation passages were extracted.

Table 2.1 are two LAMABADA instances. Each instance consists of a context document,

a target sentence without the target word, and a target word that needs to be predicted.

None of several state-of-the-art language models reaches accuracy above 1% on the LAM-

BADA dataset as reported by Paperno et al.[10]. They thus claim the LAMBADA dataset

as a challenging test set, meant to encourage models requiring genuine understanding of

broad context in natural language text.

8

Context : “Why?” “I would have thoughtyou’d find him rather dry,” she said. “Idon’t know about that,” said Gabriel.“He was a great craftsman,” said Heather.“That he was,” said Flannery.Target Sentence: “And Polish, to boot,”saidTarget Word : Gabriel

Context : Both its sun-speckled shade andthe cool grass beneath were a welcomerespite after the stifling kitchen, and I wasglad to relax against the tree’s rough, brit-tle bark and begin my breakfast of buttery,toasted bread and fresh fruit. Even the wa-ter was tasty, it was so clean and cold.Target Sentence: It almost made up for thelack ofTarget Word : coffee

Table 2.1: Example instances from LAMBADA.

9

CHAPTER 3

METHODS

In this chapter we describe how we model the LAMBADA dataset as a reading comprehension

task, and the procedures we take to create a new training set from the original LAMBADA

training set.

3.1 LAMBADA as Reading Comprehension

As reported in [10], n-gram and LSTM achieves accuracy of less than 1% on the LAMBADA

dataset. [10] claims that solving LAMBADA requires broader context information compared

to what traditional language models can provide. It inspired us to model LAMABDA as a

reading comprehension task and apply neural reading comprehension models on them.

For each instance, we feed the context sentences as the context document of the neural

readers, and we feed the target sentence as the target sentence of neural readers. Some

neural readers require a candidate target word list to choose from. The tricky part is what

list of candidate words we feed into the neural readers. In our experiment, we list all words

in the context as candidate answers, except for punctuation.1. This choice, though sacrifices

the instances that do not contain the target word in the context, is a reasonable decision, as

over 80% of LAMABDA development instances have the target word in the context.

3.2 Training Data Construction

Each LAMBADA instance is divided into a context (4.6 sentences on average) and a target

sentence, and the last word of the target sentence is the target word to be predicted. The

1. This list of punctuation symbols is at https://raw.githubusercontent.com/ZeweiChu/lambada-dataset/master/stopwords/shortlist-stopwords.txt

10

LAMBADA dataset consists of development (dev) and test (test) sets; [10] also provide a

control dataset (control), an unfiltered sample of instances from the BookCorpus.

We construct a new training dataset from the BookCorpus. We restrict it to instances

that contain the target word in the context. This decision is natural given our use of neural

readers that assume the answer is contained in the passage. We also ensure that the context

has at least 50 words and contains 4 or 5 sentences and we require the target sentences to

have more than 10 words.

Our new dataset contains 1,827,123 instances in total. We divide it into two parts, a

training set (train) of 1,618,782 instances and a validation set (val) of 208,341 instances.

These datasets can be found at the authors’ websites.

11

CHAPTER 4

EXPERIMENTS

We use the Stanford Reader [1], our modified Stanford Reader (Eq. 2.4), the Attention Sum

(AS) Reader [7], and the Gated-Attention (GA) Reader [3]. We also add the simple features

from [14] to the AS and GA Readers. The features are concatenated to the word embeddings

in the context. They include: whether the word appears in the target sentence, the frequency

of the word in the context, the position of the word’s first occurrence in the context as a

percentage of the context length, and whether the text surrounding the word matches the

text surrounding the blank in the target sentence. For the last feature, we only consider

matching the left word since the blank is always the last word in the target sentence.

All models are trained end to end without any warm start and without using pretrained

embeddings. We train each reader on train for a max of 10 epochs, stopping when accuracy

on dev decreases two epochs in a row. We take the model from the epoch with max dev

accuracy and evaluate it on test and control. val is not used.

We evaluate several other baseline systems inspired by those of [10], but we focus on

versions that restrict the choice of answers to non-stopwords in the context.1 We found this

strategy to consistently improve performance even though it limits the maximum achievable

accuracy.

We consider two n-gram language model baselines. We use the SRILM toolkit [13] to

estimate a 4-gram model with modified Kneser-Ney smoothing on the combination of train

and val. One uses a cache size of 100 and the other does not use a cache. We use each

model to score each non-stopword from the context. We also evaluate an LSTM language

model. We train it on train, where the loss is cross entropy summed over all positions in

each instance. The output vocabulary is the vocabulary of train, approximately 130k word

types. At test time, we again limit the search to non-stopwords in the context.

1. We use the stopword list from [12].

12

We also test simple baselines that choose particular non-stopwords from the context,

including a random one, the first in the context, the last in the context, and the most

frequent in the context.

13

CHAPTER 5

RESULTS

Table 5.1 shows our results. We report accuracies on the entirety of test and control

(“all”), as well as separately on the part of test and control where the target word is in

the context (“context”). The first part of the table shows results from [10]. We then show

our baselines that choose a word from the context. Choosing the most frequent yields a

surprisingly high accuracy of 11.7%, which is better than all results from Paperno et al[10].

Our language models perform comparably, with the n-gram + cache model doing best.

By forcing language models to select a word from the context, the accuracy on test is

much higher than the analogous models from Paperno et al.[10], though accuracy suffers on

control.

We then show results with the neural readers, showing that they give much higher accura-

cies on test than all other methods. The GA Reader with the simple additional features [14]

yields the highest accuracy, reaching 49.0%. We also measured the “top k” accuracy of this

model, where we give the model credit if the correct answer is among the top k ranked

answers. On test, we reach 65.4% top-2 accuracy and 72.8% top-3.

Figure 5.1 shows the accuracies of various models on the entirety of LAMABDA test set.

The AS and GA Readers work much better than the Stanford Reader. One cause appears

to be that the Stanford Reader learns distinct embeddings for input and answer words,

as discussed above. Our modified Stanford Reader, which uses only a single set of word

embeddings, improves by 10.4% absolute. Since the AS and GA Readers merely score words

in the context, they do not learn separate answer word embeddings and therefore do not

suffer from this effect.

We suspect the remaining accuracy difference between the Stanford and the other readers

is due to the difference in the output function. The Stanford Reader was developed for the

CNN and Daily Mail datasets, in which correct answers are anonymized entity identifiers

14

Methodtest control

all context all context

Baselines [10]

Random in context 1.6 N/A 0 N/ARandom cap. in context 7.3 N/A 0 N/An-gram 0.1 N/A 19.1 N/An-gram + cache 0.1 N/A 19.1 N/ALSTM 0 N/A 21.9 N/AMemory network 0 N/A 8.5 N/A

Our context-restricted non-stopword baselines

Random 5.6 6.7 0.3 2.2First 3.8 4.6 0.1 1.1Last 6.2 7.5 0.9 6.5Most frequent 11.7 14.4 0.4 8.1

Our context-restricted language model baselines

n-gram 10.7 13.1 2.2 15.6n-gram + cache 11.8 14.5 2.2 15.6LSTM 9.2 11.1 2.4 16.9

Our neural reader results

Stanford Reader 21.7 26.2 7.0 49.3Modified Stanford Reader 32.1 38.8 7.4 52.3AS Reader 41.4 50.1 8.5 60.2AS Reader + features 47.4 57.4 8.6 60.6GA Reader 44.5 53.9 8.8 62.5GA Reader + features 49.0 59.4 9.3 65.6

Human 86.0∗ 36.0† - -

Table 5.1: Accuracies on test and control datasets, computed over all instances (“all”)and separately on those in which the answer is in the context (“context”). The first sectionis from [10]. ∗Estimated from 100 randomly-sampled dev instances. †Estimated from 100randomly-sampled control instances.

which are reused across instances. Since the identifier embeddings are observed so frequently

in the training data, they are frequently updated. In our setting, however, answers are

words from a large vocabulary, so many of the word embeddings of correct answers may be

undertrained. This could potentially be addressed by augmenting the word embeddings with

identifiers to obtain some of the modeling benefits of anonymization [14].

All context restricted models yield poor accuracies on the entirety of control. This is

due to the fact that only 14.1% of control instances have the target word in the context,

so this sets the upper bound that these models can achieve.

15

Figure 5.1: Accuracy

5.1 Manual Analysis

One annotator, a native English speaker, sampled 100 instances randomly from dev, hid

the final word, and attempted to guess it from the context and target sentence. The an-

notator was correct in 86 cases. For the subset that contained the answer in the context,

the annotator was correct in 79 of 87 cases. Even though two annotators were able to cor-

rectly answer all LAMBADA instances during dataset construction [10], our results give an

estimate of how often a third would agree. The annotator did the same on 100 instances

randomly sampled from control, guessing correctly in 36 cases. These results are reported

in Table 5.1. The annotator was correct on 6 of the 12 control instances in which the

answer was contained in the context.

We analyzed the 100 LAMBADA dev instances, tagging each with labels indicating the

minimal kinds of understanding needed to answer it correctly.1 Each instance can have

1. The annotations are available from the authors’ websites.

16

label # GA+ humansingle name cue 9 89% 100%simple speaker tracking 19 84% 100%basic reference 18 56% 72%discourse inference rule 16 50% 88%semantic trigger 20 40% 80%coreference 21 38% 90%external knowledge 24 21% 88%all 100 55% 86%

Table 5.2: Labels derived from manual analysis of 100 LAMBADA dev instances. Aninstance can be tagged with multiple labels, hence the sum of instances across labels exceeds100.

multiple labels. We briefly describe each label below:

• single name cue: the answer is clearly a name according to contextual cues and only a

single name is mentioned in the context.

• simple speaker tracking: instance can be answered merely by tracking who is speaking

without understanding what they are saying.

• basic reference: answer is a reference to something mentioned in the context; simple

understanding/context matching suffices.

• discourse inference rule: answer can be found by applying a single discourse inference rule,

such as the rule: “X left Y and went in search of Z” → Y 6= Z.

• semantic trigger: amorphous semantic information is needed to choose the answer, typ-

ically related to event sequences or dialogue turns, e.g., a customer says “Where is the

X?” and a supplier responds “We got plenty of X”.

• coreference: instance requires non-trivial coreference resolution to solve correctly, typically

the resolution of anaphoric pronouns.

• external knowledge: some particular external knowledge is needed to choose the answer.

17

Table 5.2 shows the breakdown of these labels across instances, as well as the accuracy on

each label of the GA Reader with features.

The GA Reader performs well on instances involving shallower, more surface-level cues.

In 9 cases, the answer is clearly a name based on contextual cues in the target sentence and

there is only one name in the context; the reader answers all but one correctly. When only

simple speaker tracking is needed (19 cases), the reader gets 84% correct.

The hardest instances are those that involve deeper understanding, like semantic links,

coreference resolution, and external knowledge. While external knowledge is difficult to

define, we chose this label when we were able to explicitly write down the knowledge that

one would use when answering the instances, e.g., one instance requires knowing that “when

something explodes, noise emanates from it”. These instances make up nearly a quarter of

those we analyzed, making LAMBADA a good task for work in leveraging external knowledge

for language understanding.

5.2 Discussion

On control, while our readers outperform our other baselines, they are outperformed by

the language modeling baselines from Paperno et al. This suggests that though we have

improved the state of the art on LAMBADA by more than 40% absolute, we have not solved

the general language modeling problem; there is no single model that performs well on both

test and control. Our 36% estimate of human performance on control shows the

difficulty of the general problem, and reveals a gap of 14% between the best language model

and human accuracy.

A natural question to ask is whether applying neural readers is a good direction for

this task, since they fail on the 17% of instances which do not have the target word in the

context. Furthermore, this subset of LAMBADA in which the answer is not in the context

may in fact display the most interesting and challenging phenomena. Some neural readers,

18

like the Stanford Reader, can be easily used to predict target words that do not appear

in the context, and the other readers can be modified to do so. Doing this will require a

different selection of training data than that used above. However, we do wish to note that,

in addition to the relative rarity of these instances in LAMBADA, we found them to be

challenging for our annotator (who was correct on only 7 of the 13 in this subset).

We note that train has similar characteristics to the part of control that contains the

answer in the context (the final column of Table 5.1). We find that the ranking of systems

according to this column is similar to that in the test column. This suggests that our

simple method of dataset creation could be used to create additional training or evaluation

sets for challenging language modeling problems like LAMBADA, perhaps by combining it

with baseline suppression [9]. The experiments of solving the LAMBADA word prediction

task with neural readers also suggests that language modeling tasks can be solved by reading

comprehension models.

19

CHAPTER 6

CONCLUSION

We constructed a new training set for LAMBADA and used it to train neural readers to

improve the state of the art from 7.3% to 49%. We also provided results with several other

strong baselines and included a manual evaluation in an attempt to better understand the

phenomena tested by the task. Our hope is that other researchers will seek models and

training regimes that simultaneously perform well on both LAMBADA and control, with

the goal of solving the general problem of language modeling.

20

REFERENCES

[1] Danqi Chen, Jason Bolton, and Christopher D. Manning. A thorough examination ofthe CNN/Daily Mail reading comprehension task. In Proc. of ACL, 2016.

[2] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, FethiBougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations usingRNN encoder-decoder for statistical machine translation. In Proc. of EMNLP, 2014.

[3] Bhuwan Dhingra, Hanxiao Liu, William W. Cohen, and Ruslan Salakhutdinov. Gated-attention readers for text comprehension. arXiv preprint, 2016.

[4] Karl Moritz Hermann, Tom Koisk, Edward Grefenstette, Lasse Espeholt, Will Kay,Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. InProc. of NIPS, 2015.

[5] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The Goldilocks principle:Reading children’s books with explicit memory representations. In Proc. of ICLR, 2016.

[6] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computa-tion, 9(8), 1997.

[7] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. Text understand-ing with the attention sum reader network. In Proc. of ACL, 2016.

[8] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint, 2017.

[9] Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. Whodid What: A large-scale person-centered cloze dataset. In Proc. of EMNLP, 2016.

[10] Denis Paperno, Germn Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, RaffaellaBernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernndez. TheLAMBADA dataset: Word prediction requiring a broad discourse context. In Proc. ofACL, 2016.

[11] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+questions for machine comprehension of text. EMNLP, 2016.

[12] Matthew Richardson, Christopher JC Burges, and Erin Renshaw. MCTest: A challengedataset for the open-domain machine comprehension of text. In Proc. of EMNLP, 2013.

[13] Andreas Stolcke. SRILM-an extensible language modeling toolkit. In Proc. of Inter-speech, 2002.

[14] Hai Wang, Takeshi Onishi, Kevin Gimpel, and David McAllester. Emergent logicalstructure in vector representations of neural readers. arXiv preprint, 2016.

21

[15] Yi Yang, Scott Wen tau Yih, and Chris Meek. Wikiqa: A challenge dataset for open-domain question answering. EMNLP, 2015.

[16] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, An-tonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visualexplanations by watching movies and reading books. In Proc. of ICCV, 2015.

22

CHAPTER 7

APPENDIX

We show heat maps of the LAMBADA instances solved by the gated-attention sum reader.

Words with darker red color are givin higher probability by the gated-attention sum reader

model. The first candidate word of each instance is the correct target word. We can see that

in most cases, the gated-attention sum reader attends to a few candidate words.

Figure 7.1: instance 1

23







24




25

the university of chicago broad context ......state-of-the-art language models. the dataset is...

Documents

language meaning context and functional communication

pronunciation models, decision trees, context-dependent...

bootstrapping language...

macedonian language tendencies in balkan context

context free languages (cfl) language...

context-speciﬁc independence mixture models for cluster...

lakoff-language and context

computer-assisted language learning materials in the context...

how context a ects language models’ factual...

context free language

the context of language teaching

report from dagstuhl seminar 13462 computational models of...

exploiting context in topic models

hrm models in context

language in glocal cultural context

integrating source-language context into log-linear models...

factored language models

genre language context and literacy - hyland

context-oriented language engineering · context-oriented...

cs517: language models