the university of chicago broad context ......state-of-the-art language models. the dataset is...
Post on 01-Feb-2021
0 Views
Preview:
TRANSCRIPT
-
THE UNIVERSITY OF CHICAGO
BROAD CONTEXT LANGUAGE MODELING WITH READING COMPREHENSION
A DISSERTATION SUBMITTED TO
THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCE DIVISION
IN CANDIDACY FOR THE DEGREE OF
MASTER OF SCIENCE
DEPARTMENT OF COMPUTER SCIENCE
BY
ZEWEI CHU
CHICAGO, ILLINOIS
JUNE, 2017
-
Copyright c© 2017 by Zewei Chu
All Rights Reserved
-
I dedicate my dissertation to my grandparents Chengrong Li and Liuzhen Xu.
-
Epigraph Text
-
TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Machine Reading Comprehension . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Neural Readers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 The LAMBADA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 LAMBADA as Reading Comprehension . . . . . . . . . . . . . . . . . . . . . 103.2 Training Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1 Manual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
v
-
LIST OF FIGURES
2.1 Attention Sum Reader [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Gated-attention Reader [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Stanford Reader [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7.1 instance 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237.2 instance 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.3 instance 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.4 instance 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.5 instance 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.6 instance 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.7 instance 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247.8 instance 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.9 instance 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257.10 instance 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
vi
-
LIST OF TABLES
2.1 Example instances from LAMBADA. . . . . . . . . . . . . . . . . . . . . . . . . 9
5.1 Accuracies on test and control datasets, computed over all instances (“all”)and separately on those in which the answer is in the context (“context”). Thefirst section is from [10]. ∗Estimated from 100 randomly-sampled dev instances.†Estimated from 100 randomly-sampled control instances. . . . . . . . . . . 15
5.2 Labels derived from manual analysis of 100 LAMBADA dev instances. An in-stance can be tagged with multiple labels, hence the sum of instances across labelsexceeds 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vii
-
ACKNOWLEDGMENTS
I want to thank Hai Wang, my thesis advisor Kevin Gimpel and Prof. David McAllester
for their collaboration on the project of applying reading comprehension models on the
LAMBADA dataset. Hai Wang contributed his code of various neural reader models and
GPUs for our experiments. My advisor Kevin Gimpel provided insightful guidance on the
project and did 100 manual analysis for the LAMBADA dataset. Prof. David McAllester
provided feedbacks and suggestions for the project. Without their generous contribution,
this project and the thesis cannot be accomplished.
We thank Denis Paperno for answering our questions about the LAMBADA dataset and
we thank NVIDIA Corporation for donating GPUs used in this research.
viii
-
ABSTRACT
Progress in text understanding has been driven by large datasets that test particular capa-
bilities, like recent datasets for reading comprehension [4]. We focus here on the LAMBADA
dataset [10], a word prediction task requiring broader context than the immediate sentence.
We view LAMBADA as a reading comprehension problem and apply comprehension models
based on neural networks. Though these models are constrained to choose a word from the
context, they improve the state of the art on LAMBADA from 7.3% to 49%. We analyze 100
instances, finding that neural network readers perform well in cases that involve selecting a
name from the context based on dialogue or discourse cues but struggle when coreference
resolution or external knowledge is needed.
ix
-
CHAPTER 1
INTRODUCTION
The recently created LAMABDA [10] dataset is a challenging word prediction task for all
state-of-the-art language models. The dataset is created to encourage the invention of new
language models that capture broader context.
Paperno et al. [10] provide baseline results with popular language models and neural
network architectures; all achieve zero percent accuracy. The best accuracy is 7.3% obtained
by randomly choosing a capitalized word from the passage.
Our approach is based on the observation that in 83% of instances the answer appears
in the context. We exploit this in two ways. First, we automatically construct a large
training set of 1.8 million instances by simply selecting passages where the answer occurs in
the context. Second, we treat the problem as a reading comprehension task similar to the
CNN/Daily Mail datasets introduced by [4], the Children’s Book Test (CBT) of [5], and the
Who-did-What dataset of [9]. We show that standard models for reading comprehension,
trained on our automatically generated training set, improve the state of the art on the
LAMBADA test set from 7.3% to 49.0%. This is in spite of the fact that these models fail
on the 17% of instances in which the answer is not in the context.
We also perform a manual analysis of the LAMBADA task, provide an estimate of human
performance, and categorize the instances in terms of the phenomena they test. We find that
the comprehension models perform best on instances that require selecting a name from the
context based on dialogue or discourse cues, but struggle when required to do coreference
resolution or when external knowledge could help in choosing the answer.
Our contributions are:
• We review the recent development of reading comprehension models, and recent created
reading comprehension dataset, and the LAMABDA dataset.
• We created a new training set from the original LAMABDA dataset.1
-
• We provide stronger baseline results of the LAMABDA dataset.
• We show that language modeling tasks can be assisted by reading comprehension models.
• Our manual analysis provides insights on the design of future models to further improve
the results on LAMBADA.
2
-
CHAPTER 2
BACKGROUND
This chapter briefly reviews machine reading comprehension tasks, neural network based
reading comprehension models and the LAMBADA dataset.
2.1 Machine Reading Comprehension
Researchers created various reading comprehension datasets to test machine’s capability of
understanding text. We introduce some good datasets that are created recently.
MCTest [12]: each instance of MCTest comes with a story, and four multiple choice
questions regarding the story. The dataset is of high quality, but the limitation is that it
only provides 640 instances, and this amount is not sufficient for training a good model for
solving reading comprehension tasks.
DeepMind’s CNN/Daily Mail dataset [4] is built from news of CNN and Daily Mail.
They created a much bigger dataset, consisting of approximately 187k documents and 1259k
questions. This model has boosted the research of designing attention based neural reading
comprehension models, as introduced in section 2.2.
Some other high quality reading comprehension dataset such as SQuAD [11], Wik-
iQA [15], RACE [8], etc. focus on various language aspects. We do not discuss them in
details for brevity.
2.2 Neural Readers
Various kinds of recurrent neural network models have been invented to solve cloze style
question answering tasks with state-of-the-art accuracies. DeepMind [4] provides a large
scale supervised reading comprehension dataset (CNNDM: CNN/Daily Mail dataset) and
3
-
Figure 2.1: Attention Sum Reader [7]
Figure 2.2: Gated-attention Reader [3]
boosted the research of designing attention based neural reading comprehension models, such
as attention sum reader [7], the Stanford reader [1], and gated-attention readers [3].
We refer to these models as “neural readers”. The architectures of the aforementioned
three neural readers are shown in figure 2.1, figure 2.2 and figure 2.3.
These neural readers use attention based on the question and passage to choose an answer
from among the words in the passage. We use d for the context word sequence, q for the
question (with a blank to be filled), A for the candidate answer list, and V for the vocabulary.
We describe neural readers in terms of three components:
4
-
Figure 2.3: Stanford Reader [1]
1. Embedding and Encoding: Each word in d and q is mapped into a v-dimensional
vector via the embedding function e(w) ∈ Rv, for all w ∈ d ∪ q.1 The same embedding
function is used for both d and q. The embeddings are learned from random initializa-
tion; no pretrained word embeddings are used. The embedded context is processed by a
bidirectional recurrent neural network (RNN) which computes hidden vectors hi for each
position i in each sequence:
−→h = fRNN (
−→θd, e(d))
←−h = bRNN (
←−θd, e(d))
h = 〈−→h ,←−h 〉
where θ→d and θ←d are RNN parameters, and each of fRNN and bRNN return a sequence
of hidden vectors, one for each position in the input e(d). The question is encoded into a
1. We overload the e function to operate on sequences and denote the embedding of d and q as matricese(d) and e(q).
5
-
single vector g which is the concatenation of the final vectors of two RNNs:
−→g = fRNN (−→θq, e(q))
←−g = bRNN (←−θq, e(q))
g = 〈−→g|q|,←−g0〉
The RNNs use either gated recurrent units [2] or long short-term memory [6].
The gated-attention reader uses different bidirectional RNN encoder at each layer.
2. Attention: The readers then compute attention weights on positions of h using g. In
general, we define αi = softmax(att(hi, g)), where i ranges over positions in h. The att
function is an inner product in the Attention Sum Reader [7] and a bilinear product in
the Stanford Reader [1].
The computed attentions are then passed through a softmax function to form a probability
distribution.
The Gated-Attention Reader uses a richer attention architecture [3]. Each attention layer
has its own question embedding function, and the next layer document embedding is
computed by the following GA function. The super indices indicate different layers.
d(k) = GA(h(k),q(k))
The GA function at each layer is defined as following, and the super script is omitted:
6
-
αi = softmax(e(d)>hi) (2.1)
q̃i = e(q)αi (2.2)
di = hi � q̃i (2.3)
3. Output and Prediction: To output a prediction a∗, the Stanford Reader [1] computes
the attention-weighted sum of the context vectors and then an inner product with each
candidate answer:
c =
|d|∑i=1
αihi a∗ = argmax
a∈Ao(a)>c
where o(a) is the “output” embedding function. As the Stanford Reader was developed
for the anonymized CNN/Daily Mail tasks, only a few entries in the output embedding
function needed to be well-trained in their experiments. However, for LAMBADA, the
set of possible answers can range over the entirety of V , making the output embedding
function difficult to train. Therefore we also experiment with a modified version of the
Stanford Reader that uses the same embedding function e for both input and output
words:
a∗ = argmaxa∈A
e(a)>Wc (2.4)
where W is an additional parameter matrix used to match dimensions and model any
additional needed transformation.
7
-
For the Attention Sum and Gated-Attention Readers the answer is computed by:
∀a ∈ A, P (a|d,q) =∑
i∈I(a,d)αi
a∗ = argmaxa∈A
P (a|d,q)
where I(a,d) is the set of positions where a appears in context d.
2.3 The LAMBADA dataset
The LAMBADA dataset [10] was designed by identifying word prediction tasks that require
broad context. Each instance is drawn from the BookCorpus [16] and consists of a passage
of several sentences where the task is to predict the last word of the last sentence.
The creation procedure of the LAMBADA dataset was designed to guarentee that two
human subjects can correctly predict the target word with the context, while at least ten
human subjects were not able to correctly guess the target word without the context but
only the target sentence. This filtering procedure finds cases that are guessable by humans
when given the larger context but not when only given the last sentence.
The expense of this manual filtering has limited the dataset to only about 10,000 instances
which are viewed as development and test data. The training data is taken to be books in
the corpus other than those from which the evaluation passages were extracted.
Table 2.1 are two LAMABADA instances. Each instance consists of a context document,
a target sentence without the target word, and a target word that needs to be predicted.
None of several state-of-the-art language models reaches accuracy above 1% on the LAM-
BADA dataset as reported by Paperno et al.[10]. They thus claim the LAMBADA dataset
as a challenging test set, meant to encourage models requiring genuine understanding of
broad context in natural language text.
8
-
Context : “Why?” “I would have thoughtyou’d find him rather dry,” she said. “Idon’t know about that,” said Gabriel.“He was a great craftsman,” said Heather.“That he was,” said Flannery.Target Sentence: “And Polish, to boot,”saidTarget Word : Gabriel
Context : Both its sun-speckled shade andthe cool grass beneath were a welcomerespite after the stifling kitchen, and I wasglad to relax against the tree’s rough, brit-tle bark and begin my breakfast of buttery,toasted bread and fresh fruit. Even the wa-ter was tasty, it was so clean and cold.Target Sentence: It almost made up for thelack ofTarget Word : coffee
Table 2.1: Example instances from LAMBADA.
9
-
CHAPTER 3
METHODS
In this chapter we describe how we model the LAMBADA dataset as a reading comprehension
task, and the procedures we take to create a new training set from the original LAMBADA
training set.
3.1 LAMBADA as Reading Comprehension
As reported in [10], n-gram and LSTM achieves accuracy of less than 1% on the LAMBADA
dataset. [10] claims that solving LAMBADA requires broader context information compared
to what traditional language models can provide. It inspired us to model LAMABDA as a
reading comprehension task and apply neural reading comprehension models on them.
For each instance, we feed the context sentences as the context document of the neural
readers, and we feed the target sentence as the target sentence of neural readers. Some
neural readers require a candidate target word list to choose from. The tricky part is what
list of candidate words we feed into the neural readers. In our experiment, we list all words
in the context as candidate answers, except for punctuation.1. This choice, though sacrifices
the instances that do not contain the target word in the context, is a reasonable decision, as
over 80% of LAMABDA development instances have the target word in the context.
3.2 Training Data Construction
Each LAMBADA instance is divided into a context (4.6 sentences on average) and a target
sentence, and the last word of the target sentence is the target word to be predicted. The
1. This list of punctuation symbols is at https://raw.githubusercontent.com/ZeweiChu/lambada-dataset/master/stopwords/shortlist-stopwords.txt
10
-
LAMBADA dataset consists of development (dev) and test (test) sets; [10] also provide a
control dataset (control), an unfiltered sample of instances from the BookCorpus.
We construct a new training dataset from the BookCorpus. We restrict it to instances
that contain the target word in the context. This decision is natural given our use of neural
readers that assume the answer is contained in the passage. We also ensure that the context
has at least 50 words and contains 4 or 5 sentences and we require the target sentences to
have more than 10 words.
Our new dataset contains 1,827,123 instances in total. We divide it into two parts, a
training set (train) of 1,618,782 instances and a validation set (val) of 208,341 instances.
These datasets can be found at the authors’ websites.
11
-
CHAPTER 4
EXPERIMENTS
We use the Stanford Reader [1], our modified Stanford Reader (Eq. 2.4), the Attention Sum
(AS) Reader [7], and the Gated-Attention (GA) Reader [3]. We also add the simple features
from [14] to the AS and GA Readers. The features are concatenated to the word embeddings
in the context. They include: whether the word appears in the target sentence, the frequency
of the word in the context, the position of the word’s first occurrence in the context as a
percentage of the context length, and whether the text surrounding the word matches the
text surrounding the blank in the target sentence. For the last feature, we only consider
matching the left word since the blank is always the last word in the target sentence.
All models are trained end to end without any warm start and without using pretrained
embeddings. We train each reader on train for a max of 10 epochs, stopping when accuracy
on dev decreases two epochs in a row. We take the model from the epoch with max dev
accuracy and evaluate it on test and control. val is not used.
We evaluate several other baseline systems inspired by those of [10], but we focus on
versions that restrict the choice of answers to non-stopwords in the context.1 We found this
strategy to consistently improve performance even though it limits the maximum achievable
accuracy.
We consider two n-gram language model baselines. We use the SRILM toolkit [13] to
estimate a 4-gram model with modified Kneser-Ney smoothing on the combination of train
and val. One uses a cache size of 100 and the other does not use a cache. We use each
model to score each non-stopword from the context. We also evaluate an LSTM language
model. We train it on train, where the loss is cross entropy summed over all positions in
each instance. The output vocabulary is the vocabulary of train, approximately 130k word
types. At test time, we again limit the search to non-stopwords in the context.
1. We use the stopword list from [12].
12
-
We also test simple baselines that choose particular non-stopwords from the context,
including a random one, the first in the context, the last in the context, and the most
frequent in the context.
13
-
CHAPTER 5
RESULTS
Table 5.1 shows our results. We report accuracies on the entirety of test and control
(“all”), as well as separately on the part of test and control where the target word is in
the context (“context”). The first part of the table shows results from [10]. We then show
our baselines that choose a word from the context. Choosing the most frequent yields a
surprisingly high accuracy of 11.7%, which is better than all results from Paperno et al[10].
Our language models perform comparably, with the n-gram + cache model doing best.
By forcing language models to select a word from the context, the accuracy on test is
much higher than the analogous models from Paperno et al.[10], though accuracy suffers on
control.
We then show results with the neural readers, showing that they give much higher accura-
cies on test than all other methods. The GA Reader with the simple additional features [14]
yields the highest accuracy, reaching 49.0%. We also measured the “top k” accuracy of this
model, where we give the model credit if the correct answer is among the top k ranked
answers. On test, we reach 65.4% top-2 accuracy and 72.8% top-3.
Figure 5.1 shows the accuracies of various models on the entirety of LAMABDA test set.
The AS and GA Readers work much better than the Stanford Reader. One cause appears
to be that the Stanford Reader learns distinct embeddings for input and answer words,
as discussed above. Our modified Stanford Reader, which uses only a single set of word
embeddings, improves by 10.4% absolute. Since the AS and GA Readers merely score words
in the context, they do not learn separate answer word embeddings and therefore do not
suffer from this effect.
We suspect the remaining accuracy difference between the Stanford and the other readers
is due to the difference in the output function. The Stanford Reader was developed for the
CNN and Daily Mail datasets, in which correct answers are anonymized entity identifiers
14
-
Methodtest control
all context all context
Baselines [10]
Random in context 1.6 N/A 0 N/ARandom cap. in context 7.3 N/A 0 N/An-gram 0.1 N/A 19.1 N/An-gram + cache 0.1 N/A 19.1 N/ALSTM 0 N/A 21.9 N/AMemory network 0 N/A 8.5 N/A
Our context-restricted non-stopword baselines
Random 5.6 6.7 0.3 2.2First 3.8 4.6 0.1 1.1Last 6.2 7.5 0.9 6.5Most frequent 11.7 14.4 0.4 8.1
Our context-restricted language model baselines
n-gram 10.7 13.1 2.2 15.6n-gram + cache 11.8 14.5 2.2 15.6LSTM 9.2 11.1 2.4 16.9
Our neural reader results
Stanford Reader 21.7 26.2 7.0 49.3Modified Stanford Reader 32.1 38.8 7.4 52.3AS Reader 41.4 50.1 8.5 60.2AS Reader + features 47.4 57.4 8.6 60.6GA Reader 44.5 53.9 8.8 62.5GA Reader + features 49.0 59.4 9.3 65.6
Human 86.0∗ 36.0† - -
Table 5.1: Accuracies on test and control datasets, computed over all instances (“all”)and separately on those in which the answer is in the context (“context”). The first sectionis from [10]. ∗Estimated from 100 randomly-sampled dev instances. †Estimated from 100randomly-sampled control instances.
which are reused across instances. Since the identifier embeddings are observed so frequently
in the training data, they are frequently updated. In our setting, however, answers are
words from a large vocabulary, so many of the word embeddings of correct answers may be
undertrained. This could potentially be addressed by augmenting the word embeddings with
identifiers to obtain some of the modeling benefits of anonymization [14].
All context restricted models yield poor accuracies on the entirety of control. This is
due to the fact that only 14.1% of control instances have the target word in the context,
so this sets the upper bound that these models can achieve.
15
-
Figure 5.1: Accuracy
5.1 Manual Analysis
One annotator, a native English speaker, sampled 100 instances randomly from dev, hid
the final word, and attempted to guess it from the context and target sentence. The an-
notator was correct in 86 cases. For the subset that contained the answer in the context,
the annotator was correct in 79 of 87 cases. Even though two annotators were able to cor-
rectly answer all LAMBADA instances during dataset construction [10], our results give an
estimate of how often a third would agree. The annotator did the same on 100 instances
randomly sampled from control, guessing correctly in 36 cases. These results are reported
in Table 5.1. The annotator was correct on 6 of the 12 control instances in which the
answer was contained in the context.
We analyzed the 100 LAMBADA dev instances, tagging each with labels indicating the
minimal kinds of understanding needed to answer it correctly.1 Each instance can have
1. The annotations are available from the authors’ websites.
16
-
label # GA+ humansingle name cue 9 89% 100%simple speaker tracking 19 84% 100%basic reference 18 56% 72%discourse inference rule 16 50% 88%semantic trigger 20 40% 80%coreference 21 38% 90%external knowledge 24 21% 88%all 100 55% 86%
Table 5.2: Labels derived from manual analysis of 100 LAMBADA dev instances. Aninstance can be tagged with multiple labels, hence the sum of instances across labels exceeds100.
multiple labels. We briefly describe each label below:
• single name cue: the answer is clearly a name according to contextual cues and only a
single name is mentioned in the context.
• simple speaker tracking: instance can be answered merely by tracking who is speaking
without understanding what they are saying.
• basic reference: answer is a reference to something mentioned in the context; simple
understanding/context matching suffices.
• discourse inference rule: answer can be found by applying a single discourse inference rule,
such as the rule: “X left Y and went in search of Z” → Y 6= Z.
• semantic trigger: amorphous semantic information is needed to choose the answer, typ-
ically related to event sequences or dialogue turns, e.g., a customer says “Where is the
X?” and a supplier responds “We got plenty of X”.
• coreference: instance requires non-trivial coreference resolution to solve correctly, typically
the resolution of anaphoric pronouns.
• external knowledge: some particular external knowledge is needed to choose the answer.
17
-
Table 5.2 shows the breakdown of these labels across instances, as well as the accuracy on
each label of the GA Reader with features.
The GA Reader performs well on instances involving shallower, more surface-level cues.
In 9 cases, the answer is clearly a name based on contextual cues in the target sentence and
there is only one name in the context; the reader answers all but one correctly. When only
simple speaker tracking is needed (19 cases), the reader gets 84% correct.
The hardest instances are those that involve deeper understanding, like semantic links,
coreference resolution, and external knowledge. While external knowledge is difficult to
define, we chose this label when we were able to explicitly write down the knowledge that
one would use when answering the instances, e.g., one instance requires knowing that “when
something explodes, noise emanates from it”. These instances make up nearly a quarter of
those we analyzed, making LAMBADA a good task for work in leveraging external knowledge
for language understanding.
5.2 Discussion
On control, while our readers outperform our other baselines, they are outperformed by
the language modeling baselines from Paperno et al. This suggests that though we have
improved the state of the art on LAMBADA by more than 40% absolute, we have not solved
the general language modeling problem; there is no single model that performs well on both
test and control. Our 36% estimate of human performance on control shows the
difficulty of the general problem, and reveals a gap of 14% between the best language model
and human accuracy.
A natural question to ask is whether applying neural readers is a good direction for
this task, since they fail on the 17% of instances which do not have the target word in the
context. Furthermore, this subset of LAMBADA in which the answer is not in the context
may in fact display the most interesting and challenging phenomena. Some neural readers,
18
-
like the Stanford Reader, can be easily used to predict target words that do not appear
in the context, and the other readers can be modified to do so. Doing this will require a
different selection of training data than that used above. However, we do wish to note that,
in addition to the relative rarity of these instances in LAMBADA, we found them to be
challenging for our annotator (who was correct on only 7 of the 13 in this subset).
We note that train has similar characteristics to the part of control that contains the
answer in the context (the final column of Table 5.1). We find that the ranking of systems
according to this column is similar to that in the test column. This suggests that our
simple method of dataset creation could be used to create additional training or evaluation
sets for challenging language modeling problems like LAMBADA, perhaps by combining it
with baseline suppression [9]. The experiments of solving the LAMBADA word prediction
task with neural readers also suggests that language modeling tasks can be solved by reading
comprehension models.
19
-
CHAPTER 6
CONCLUSION
We constructed a new training set for LAMBADA and used it to train neural readers to
improve the state of the art from 7.3% to 49%. We also provided results with several other
strong baselines and included a manual evaluation in an attempt to better understand the
phenomena tested by the task. Our hope is that other researchers will seek models and
training regimes that simultaneously perform well on both LAMBADA and control, with
the goal of solving the general problem of language modeling.
20
-
REFERENCES
[1] Danqi Chen, Jason Bolton, and Christopher D. Manning. A thorough examination ofthe CNN/Daily Mail reading comprehension task. In Proc. of ACL, 2016.
[2] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, FethiBougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations usingRNN encoder-decoder for statistical machine translation. In Proc. of EMNLP, 2014.
[3] Bhuwan Dhingra, Hanxiao Liu, William W. Cohen, and Ruslan Salakhutdinov. Gated-attention readers for text comprehension. arXiv preprint, 2016.
[4] Karl Moritz Hermann, Tom Koisk, Edward Grefenstette, Lasse Espeholt, Will Kay,Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. InProc. of NIPS, 2015.
[5] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The Goldilocks principle:Reading children’s books with explicit memory representations. In Proc. of ICLR, 2016.
[6] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computa-tion, 9(8), 1997.
[7] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. Text understand-ing with the attention sum reader network. In Proc. of ACL, 2016.
[8] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint, 2017.
[9] Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. Whodid What: A large-scale person-centered cloze dataset. In Proc. of EMNLP, 2016.
[10] Denis Paperno, Germn Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, RaffaellaBernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernndez. TheLAMBADA dataset: Word prediction requiring a broad discourse context. In Proc. ofACL, 2016.
[11] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+questions for machine comprehension of text. EMNLP, 2016.
[12] Matthew Richardson, Christopher JC Burges, and Erin Renshaw. MCTest: A challengedataset for the open-domain machine comprehension of text. In Proc. of EMNLP, 2013.
[13] Andreas Stolcke. SRILM-an extensible language modeling toolkit. In Proc. of Inter-speech, 2002.
[14] Hai Wang, Takeshi Onishi, Kevin Gimpel, and David McAllester. Emergent logicalstructure in vector representations of neural readers. arXiv preprint, 2016.
21
-
[15] Yi Yang, Scott Wen tau Yih, and Chris Meek. Wikiqa: A challenge dataset for open-domain question answering. EMNLP, 2015.
[16] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, An-tonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visualexplanations by watching movies and reading books. In Proc. of ICCV, 2015.
22
-
CHAPTER 7
APPENDIX
We show heat maps of the LAMBADA instances solved by the gated-attention sum reader.
Words with darker red color are givin higher probability by the gated-attention sum reader
model. The first candidate word of each instance is the correct target word. We can see that
in most cases, the gated-attention sum reader attends to a few candidate words.
Figure 7.1: instance 1
23
-
Figure 7.2: instance 2
Figure 7.3: instance 3
Figure 7.4: instance 4
Figure 7.5: instance 5
Figure 7.6: instance 6
Figure 7.7: instance 7
24
-
Figure 7.8: instance 8
Figure 7.9: instance 9
Figure 7.10: instance 10
25
top related