conformal prediction for text in lling and part-of-speech

14
Conformal prediction for text infilling and part-of-speech prediction Neil Dey 1 , Jing Ding 1 , Jack Ferrell 2 , Carolina Kapper 3 , Maxwell Lovig 4 , Emiliano Planchon 1 , and Jonathan P Williams 1 1 North Carolina State University 2 University of Florida 3 High Point University 4 University of Louisiana, Lafayette Abstract Modern machine learning algorithms are capable of providing remarkably accurate point-predictions; however, questions remain about their statistical reliability. Unlike conventional machine learning methods, conformal prediction algorithms return confidence sets (i.e., set-valued predictions) that correspond to a given significance level. Moreover, these confidence sets are valid in the sense that they guarantee finite sample control over type 1 error probabilities, allowing the practitioner to choose an acceptable error rate. In our paper, we propose inductive conformal predic- tion (ICP) algorithms for the tasks of text infilling and part-of-speech (POS) prediction for natural language data. We construct new conformal prediction-enhanced bidirectional encoder representations from transformers (BERT) and bidirectional long short-term memory (BiLSTM) algorithms for POS tagging and a new conformal prediction- enhanced BERT algorithm for text infilling. We analyze the performance of the algorithms in simulations using the Brown Corpus, which contains over 57,000 sentences. Our results demonstrate that the ICP algorithms are able to produce valid set-valued predictions that are small enough to be applicable in real-world applications. We also pro- vide a real data example for how our proposed set-valued predictions can improve machine generated audio transcriptions. Keywords: BERT; BiLSTM; natural language processing; set-valued prediction; uncertainty quantification 1. INTRODUCTION In recent years, machine learning algorithms have domi- nated the realm of natural language processing (NLP). Over time, these algorithms have achieved higher and higher ac- curacy in various NLP tasks. However, such algorithms are specialized for point prediction, and as such, a significant limitation of many machine learning algorithms is that they do not offer any uncertainty quantification about how often these point predictions are actually correct. To address this limitation, the ideas of conformal prediction have gained traction in recent years in the machine learning literature generally, but less so in application to NLP tasks. Conformal prediction is an approach introduced in [54] that allows, for example, a point prediction method to be extended to form confidence sets, guaranteeing that the set contains the true unknown predictor value with some nomi- nal coverage probability. It has been shown that deep learn- ing architectures such as multilayer perceptrons (MLP), convolutional neural networks (CNN), and gated recurrent units (GRU) often improve in their robustness when en- hanced by a conformal prediction algorithm [33]. Confor- mal prediction has been applied to text classification NLP tasks. For example, [31] and [30] demonstrate similar results for conformal prediction-enhanced BERT and artificial neu- ral network (ANN)-based sentiment classification and multi- label text classification, respectively. Other experiments in the literature, such as [37] working with deep neural net- work (DNN)-based multi-label text classifiers and [4] work- ing with tree-based classifiers, replicate these findings for other multi-label classification models. Conformal predic- tion has also been successful in relation classification, iden- tifying relationships between two entities in a sentence, as demonstrated by [11] and for open-domain question answer- ing and information retrieval for fact verification in [12]. To our knowledge, however, conformal prediction has not been applied to two key tasks in NLP: the text infilling task and POS tagging. The text infilling task (also known as the Cloze task) is a standard NLP task, asking a model to “fill in the blank” given an otherwise complete sentence. Since its conception, the task has greatly expanded in scope due to the great success of various text infilling algorithms developed. For example, [10] uses generative adversarial networks to great effect in the MaskGAN algorithm to generalize the problem to full text generation. Another generalization of the text infilling task was introduced by [35] in the form of the Story Cloze Test, determining the “right ending” to a story. The 1 arXiv:2111.02592v1 [stat.ML] 4 Nov 2021

Upload: others

Post on 07-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Conformal prediction for text in lling and part-of-speech

Conformal prediction for text infilling and part-of-speech prediction

Neil Dey1, Jing Ding1, Jack Ferrell2, Carolina Kapper3, Maxwell Lovig4, Emiliano Planchon1, and Jonathan P Williams1

1 North Carolina State University2 University of Florida3 High Point University4 University of Louisiana, Lafayette

AbstractModern machine learning algorithms are capable of providing remarkably accurate point-predictions; however,

questions remain about their statistical reliability. Unlike conventional machine learning methods, conformal predictionalgorithms return confidence sets (i.e., set-valued predictions) that correspond to a given significance level. Moreover,these confidence sets are valid in the sense that they guarantee finite sample control over type 1 error probabilities,allowing the practitioner to choose an acceptable error rate. In our paper, we propose inductive conformal predic-tion (ICP) algorithms for the tasks of text infilling and part-of-speech (POS) prediction for natural language data.We construct new conformal prediction-enhanced bidirectional encoder representations from transformers (BERT)and bidirectional long short-term memory (BiLSTM) algorithms for POS tagging and a new conformal prediction-enhanced BERT algorithm for text infilling. We analyze the performance of the algorithms in simulations using theBrown Corpus, which contains over 57,000 sentences. Our results demonstrate that the ICP algorithms are able toproduce valid set-valued predictions that are small enough to be applicable in real-world applications. We also pro-vide a real data example for how our proposed set-valued predictions can improve machine generated audio transcriptions.

Keywords: BERT; BiLSTM; natural language processing; set-valued prediction; uncertainty quantification

1. INTRODUCTION

In recent years, machine learning algorithms have domi-nated the realm of natural language processing (NLP). Overtime, these algorithms have achieved higher and higher ac-curacy in various NLP tasks. However, such algorithms arespecialized for point prediction, and as such, a significantlimitation of many machine learning algorithms is that theydo not offer any uncertainty quantification about how oftenthese point predictions are actually correct. To address thislimitation, the ideas of conformal prediction have gainedtraction in recent years in the machine learning literaturegenerally, but less so in application to NLP tasks.

Conformal prediction is an approach introduced in [54]that allows, for example, a point prediction method to beextended to form confidence sets, guaranteeing that the setcontains the true unknown predictor value with some nomi-nal coverage probability. It has been shown that deep learn-ing architectures such as multilayer perceptrons (MLP),convolutional neural networks (CNN), and gated recurrentunits (GRU) often improve in their robustness when en-hanced by a conformal prediction algorithm [33]. Confor-mal prediction has been applied to text classification NLPtasks. For example, [31] and [30] demonstrate similar results

for conformal prediction-enhanced BERT and artificial neu-ral network (ANN)-based sentiment classification and multi-label text classification, respectively. Other experiments inthe literature, such as [37] working with deep neural net-work (DNN)-based multi-label text classifiers and [4] work-ing with tree-based classifiers, replicate these findings forother multi-label classification models. Conformal predic-tion has also been successful in relation classification, iden-tifying relationships between two entities in a sentence, asdemonstrated by [11] and for open-domain question answer-ing and information retrieval for fact verification in [12]. Toour knowledge, however, conformal prediction has not beenapplied to two key tasks in NLP: the text infilling task andPOS tagging.

The text infilling task (also known as the Cloze task) isa standard NLP task, asking a model to “fill in the blank”given an otherwise complete sentence. Since its conception,the task has greatly expanded in scope due to the greatsuccess of various text infilling algorithms developed. Forexample, [10] uses generative adversarial networks to greateffect in the MaskGAN algorithm to generalize the problemto full text generation. Another generalization of the textinfilling task was introduced by [35] in the form of the StoryCloze Test, determining the “right ending” to a story. The

1

arX

iv:2

111.

0259

2v1

[st

at.M

L]

4 N

ov 2

021

Page 2: Conformal prediction for text in lling and part-of-speech

2

Story Cloze Test has been further explored in the form ofneural network solutions [50] and generative pre-training oflanguage models [45], among other methods. Yet another ex-tension to the text infilling task comes in the form of filling inblanks of arbitrary length, as explored in [60] (utilizing self-attention mechanisms) and in [49] (using the blank languagemodel). Although many techniques have been proposed tosolve the text infilling task, such as gradient-search-basedinference [27] and infilling by language modeling [8], textinfilling in practice has been dominated by the BERT algo-rithm [7], which uses a masked language modeling (MLM)pre-training objective to attain word embeddings. Thoughtrained on the text infilling task, the resulting word embed-dings remain competitive in many standard NLP tasks.

The POS tagging task is another standard NLP task inwhich a model assigns the correct grammatical POS to eachword in a sentence. This task is unusual in the NLP realmin that the most naive algorithm of simply assigning eachword its most common POS already achieves a very highbaseline accuracy of roughly 92% [22, Chapter 8, end ofSection 2]. The introduction of some classical models suchas hidden Markov models (HMM) [24] and conditional ran-dom fields (CRF) [25] improved the accuracy to about 96%;more modern techniques currently used such as the BiLSTMproposed by [56] and transformer models such as BERT [7]offer further marginal improvements, reaching about 97-98%accuracy. Similar to the text infilling task, it does not ap-pear that the application of conformal prediction to POStagging is present in the literature. However, a method ofset-valued prediction introduced by [34] has been appliedto POS tagging of a middle-lower German corpus by [17],demonstrating more robust predictions than standard POStagging algorithms, but these set-valued predictions do notoffer the guaranteed control over type 1 error probabilitiesthat are inherent in conformal prediction sets. As discussedin [17], POS tagging of historical corpora remains one areawhere linguistics experts do not necessarily know or agree onthe POS for particular words because the languages are nolonger in use. In these applications, set-valued predictionsare most sensible.

Furthermore, in machine learning applications, since theaccuracy of POS tagging is typically high, it can be expectedthat many set-valued POS predictions will be of size 1, andgreater than 1 for occasional ambiguous cases. Accordingly,the set-valued POS tagging algorithms that we contributecombine the speed of automated tagging with the accuracyof manual tagging.

In our paper, we apply conformal prediction to the textinfilling (more specifically the MLM) and POS taggingtasks. We construct new conformal prediction-enhancedBERT and BiLSTM algorithms for POS tagging and a newconformal prediction-enhanced BERT algorithm for MLM.Using the Brown Corpus [14], we empirically demonstratethat BERT provides smaller prediction sets for POS tag-ging than a BiLSTM model, and we show that BERT gen-erates usefully small prediction sets for MLM. Moreover, we

show that all conformal prediction sets achieve their nom-inal coverage for any level of significance. A brief overviewof BiLSTM models, transformers and BERT, and confor-mal predictions is given in Section 2. Section 3 presentsour proposed algorithms, followed by a discussion of ourempirical studies in Section 4. The utility of the enhancedBERT model for MLM in a realistic setting is illustratedin Section 5 by running the model on missing words froma transcript of a TED Talk generated by automatic speechrecognition software, and the paper closes with concludingremarks provided in Section 6. The code and workflow forreproducing our results, along with documented software forimplementing our algorithms on new data sets, are availableat https://github.com/jackferrellncsu/drums-nlp-codesnapshot.

2. EXISTING MACHINE LEARNINGAPPROACHES

Currently, the state-of-art methods for MLM tasks areBERT-based [6]. Other models include TagLM [42] andELMo [43]. TagLM and ELMo both use recurrent neuralnetworks (RNN), and ELMo specifically constructs a two-layer BiLSTM, commonly used as a pre-trained model forthe embedding layer for other models. Alternatively, BERTmodels use transformers instead of an LSTM in the deepembedding layer.

POS tagging takes a sequence of words and assigns eachword a particular POS. It is a sequence labeling task becauseeach word can represent different a POS depending on itscontext. POS tagging is useful in syntactic parsing, reorder-ing in translation, sentiment tasks, text-to-speech tasks, etc.Classic POS labeling algorithms include HMM and linearchain CRF. HMM is a probabilistic sequence model thatcomputes a probability distribution over possible sequencesof labels and chooses the label sequence with highest likeli-hood. However, as a generative model, HMM does not in-corporate arbitrary features for unknown words in a cleanway. [3] implemented HMM, handling unknown words usingsuffix features, and attained an accuracy of 96.46%. CRFis a log-linear model that assigns a probability to an en-tire output (label) sequence with respect to all the possiblesequences, given the entire sequence of input words. [51] pro-posed using CRF with a method for structure regularizationand achieved 97.36% accuracy.

Modern POS labeling algorithms include RNNs andtransformer networks. Both approaches manage to deal di-rectly with the sequential nature of language without be-ing restricted to a fixed window size surrounding the targetword. RNN architectures contain a cycle within the net-work connections, where the value of a unit is directly orindirectly dependent on the earlier output as an input. TheBiLSTM architecture has achieved wide attention due to itseffectiveness for sequence classification. It solves the “van-ishing gradient” problem by forgetting information that isno longer needed, carrying information that is required for

Page 3: Conformal prediction for text in lling and part-of-speech

Conformal prediction for text infilling and part-of-speech prediction 3

decisions to come, and combining the forward and back-ward network results. Researchers have applied BiLSTMsand obtained accuracies ranging from 97.22% to 97.76%[26, 44, 58, 2, 57, 29]. As an alternative solution, transform-ers are made up of blocks including self attention layers,feedforward networks, and custom connections. Transformerbased models, such as BERT, are pre-trained on large con-text corpora and are well-suited for POS tagging.

Although it appears promising that the accuracy of POStagging has reached 97% for English language texts, itshould be noted that the baseline accuracy is 92% [22, Chap-ter 8, end of Section 2] because many words have only a sin-gle POS, and those that have multiple POS overwhelminglyoccur with their most common class. However, a single badtagging in a sentence can lead to a huge error in downstreamtasks such as dependency parsing. It is thus more meaning-ful to view the accuracy of the whole-sentence POS tagging,which is around 55-57% [32]. Researchers have been tryingto improve the accuracy of POS tagging via improvements infeatures, parameters, and learning methods without break-through success. Meanwhile, there are concerns regardingthe correctness of the treebank and whether POS labels arewell-defined to allow us to assign each word a single sym-bolic label [32]. That is to say, it is possible that the errorin POS labeling is due to linguistically justified definitionsand cannot be further improved without improvement in thefield of linguistics.

One way to deal with the current error in POS taggingfor further improvement is to add associated confidence val-ues for each prediction. All the aforementioned approachesonly output a simple point prediction without evaluatinghow likely it is for each prediction to be correct. The likeli-hood of each prediction enables us to evaluate how much wecan rely on the prediction and generates alternative POStags. This serves as a filtering mechanism with regard tothe corresponding confidence level and can help avoid theproblem that a single mistake in a sentence limits the use-fulness of a tagger for downstream tasks. Conformal pre-diction [47] is well-suited to provide such confidence infor-mation on top of the traditional algorithms. Moreover, [38]introduced the more computationally feasible application ofICP in neural networks. [31] applied ICP on a binary textclassification problem using a BERT model for contextual-ized word embeddings. The results show that the predic-tion accuracy for the BERT classifier was maintained, whilethe prediction sets calculated using the conformal predic-tion algorithm provided more useful information. [12] ex-panded the conformal prediction correctness criterion byadding admissible labels to reduce the size of predicted sets,and filtered out implausible labels early on by using confor-mal prediction cascades to decrease the computational cost.[30] continued the study of conformal prediction applied to“multi-label” text classification using DNNs based on con-textualized and non-contextualized word embeddings. They

reduced the computational complexity by eliminating label-sets that would surely have p-values below the specified sig-nificance level. Their results show that the context-basedclassifier with conformal predictions has good performanceand small prediction sets that are practically useful.

2.1 Long short-term memory neural net

The use of RNNs in NLP tasks is very common due to thesequential nature of language. Unlike feed-forward networks,RNNs are able to take into account all of the precedingwords in a variable length sequence with fixed-size input andembedding vectors when making predictions [9]. In languagetasks like next word prediction, this is desirable because themore structured the context that a model is learning from,the more accurate the prediction is likely to be.

In machine learning, the goal of a gradient descent algo-rithm is to minimize the cost function by finding and updat-ing the parameters of the model. With RNNs, using gradientdescent with an error criterion for tasks involving long-termdependencies is inadequate and may result in exploding orvanishing gradients [1]. This problem arises when the net-work updates the weights while back-propagating throughtime during training [19]. An extremely large gradient willmake the model that is being trained unstable, and an ex-tremely small (≈ 0) gradient will make it impossible for themodel to learn correlations between events with a high tem-poral span of dependencies [40]. Moreover, gradient descentbecomes less efficient the further apart the inputs are, sug-gesting that RNNs are not desirable for tasks that requirelong-term “memory.” There have been many theorized solu-tions to these issues; however, none are as prevalent as gatedneural networks [21].

×

×

+

×

Figure 1: LSTM Memory Cell

A popular type of gated neural network is the LSTM[20]. LSTMs help prevent vanishing and exploding gradientsthrough the use of a memory cell, which is regulated by theforget (ft), input (it), and output (ot) gates (see Figure 1).Each of these gates contain a sigmoid activation alongside acomponent-wise multiplication operation. The sigmoid layeroutputs values that are between 0 and 1 which serve as in-dicators for the proportion of each component that will be

Page 4: Conformal prediction for text in lling and part-of-speech

4

“let through” the gate. The standard reference for describ-ing the architecture and intuition for memory cells is givenin [36]. For convenience, we summarize the main ideas inthe remainder of this section.

The forget gate (ft) considers yt−1 and xt, where yt−1 isthe network output layer at time t − 1 and xt is the inputvector at time t ∈ N. These quantities are passed throughthe vectorized sigmoid function

ft = σ([xTt , yTt−1] ·Wf + bf ),

where Wf and bf are a weight matrix and bias vector, re-spectively. After passing [xTt , y

Tt−1] through the forget gate,

the past cell state Ct−1 is multiplied component-wise withft. Next, as shown in Figure 1, a tanh activation functionalso evaluated at [xTt , y

Tt−1], but with a different weight ma-

trix WC and bias vector bC , is used to create a vector ofvalues in [−1, 1]:

C∗t = tanh([xTt , y

Tt−1] ·WC + bC

).

The input gate is similarly constructed as

it = σ([xTt , yTt−1] ·Wi + bi)

for weight and bias terms Wi and bi and the cell is updatedas

Ct = ft � Ct−1 + it � C∗t ,

where � denotes component-wise multiplication. The ft �Ct−1 term controls how much of the past cell memory tocarry forward, and the it � C∗t term controls how much ofthe updated cell memory to add [16]. Lastly, the cell updatesthe state yt as

ot = σ([xTt , yTt−1] ·Wo + bo)

yt = ot � tanh(Ct),

using a final set of weight and bias terms, Wo and bo.The implementation of a memory cell like the one above

is quite common; however, there is much variety when itcomes to the exact details [36]. Examples include GRUs [5],peephole connections [15], and clockwork RNNs [23], amongothers. The sophisticated nature of these memory cells haveproven to work efficiently on NLP problems [48], which iswhy we consider it a favorable method to combine with aconformal predictor.

2.2 Transformers and BERT embeddings

Recurrent models, while useful for encapsulating informa-tion about the structure of sentences, are extremely compu-tationally expensive in practice. Namely, the sequential na-ture of such models makes training them impossible to par-allelize. Transformers were introduced to fix this issue withan encoder/decoder structure [52]. To understand the en-coder/decoder intuitively, consider the problem of machine

translation. If we have a sentence in written in Spanish, theencoder will attempt to construct a mathematical represen-tation for the meaning of the sentence. The decoder willtake this mathematical representation, as well as informa-tion about the English language (for example), and com-bine the two to create an English sentence. The meaningof the sentence and information about the English languageare captured using a technique referred to as “attention”[52]. The following description of attention closely followsthe source paper [52], and is provided for convenience.

In attention, an output is computed using a weightedsum of values, but with weights learned from a functionthat finds the compatibility between a query and the keycorresponding to a value, where the query, the key-valuepairs, and the output are all represented by vectors [52].Attention is mathematically described as

Attention(K,V,Q) = softmax

(QK>√dk

)V,

where K is the matrix containing the key vectors with dknumber of rows, V is the matrix containing the value vectors,and Q is the matrix containing the query vectors [52]. Thescalar dk is introduced as a normalization factor, lest dot-products become so large as to be unusable [52]. Differentitems are used as keys, values, and queries depending on thecontext. In the most basic case, the query is the word cur-rently being examined, the key vector is all words being usedas context for the query word, and the value vector is alsoall the words being used as the context for the query word.The output of the softmax function in the above equation isused as a weighting matrix for the value vectors comprisingV .

It is often desirable for different weights to be learnedbased on some number, h, of different features of text, sothe notion of “multi-head” attention is defined as

MultiHead(Q,K, V ) = [head1, . . . ,headh] ·WO,

where for each j ∈ {1, . . . , h},

headj := Attention(Q ·WQj ,K ·W

Kj , V ·WV

j ),

with weight matrices WO,WQj ,W

Kj , and WV

j to be learned.Each head is empirically constructed to focus on differentaspects of the the training text [52]. These multi-head layersare stacked and then fed into a feed-forward neural networkto form the encoder and decoder.

Transformers have been used in many state-of-the-artNLP models, such as GPT [45], BERT [7], and ERNIE [59].In developing our conformal predictors, we choose to in-corporate pre-trained word embeddings from BERT in par-ticular. We focus on the use of “BERT-base” rather than“BERT-large” due to the high computation cost associated

Page 5: Conformal prediction for text in lling and part-of-speech

Conformal prediction for text infilling and part-of-speech prediction 5

with the latter. Nonetheless, both deliver state-of-the-art re-sults, so any minor trade-off in accuracy is justified. BERT-base has 12 layers, 768 hidden states, and 12 self-attentionheads for a total parameter count of 110 million [7].

The main difference between BERT and the originaltransformer is its ability to examine context in both direc-tions simultaneously, whereas the original transformer [52]and GPT [45] both gated the decoder layer, only allowingit to look in the direction from which it was supposed tobe predicting. This proved effective, giving both versions ofthe original BERT state-of-the-art results across all general-ized language understanding evaluation (GLUE) [55] taskswhen the paper was published in 2019 [7]. BERT was pre-trained using two tasks, next sentence prediction (NSP) andMLM. In NSP, BERT is presented with two sentences andattempts to determine whether or not they are truly sequen-tial. In MLM, BERT is presented with a masked word andasked to predict it given a context. During pre-training, 15%of words were masked so as to not let the model look at thecorrect answer while predicting. BERT was trained over theentirety of Wikipedia (approximately 2.5 billion words) andthe BooksCorpus [61] in efforts to mimic language as closelyas possible. A new sub-field, “BERTology”, has surfaced inan attempt to explain why the embeddings are so efficientand generalizable [46]. We hope our application of confor-mal predictors to the BERT MLM task will contribute tothis area of study.

2.3 Conformal predictions

Throughout the remainder of the paper we will use thefollowing notation. Let D denote a corpus of text, where theindex i ∈ {1, . . . , n} denotes the position of the i-th wordand n denotes the total number of words in D. For the POStasks, let yposi represent the true POS for the i-th word inD. Similarly, for the MLM tasks, let ymlm

j represent the truemasked word for the j-th sentence in D, for j ∈ {1, . . . , k}where k is the total number of sentences in D. For training,testing, and calibration, the entire corpus D is randomlysplit into three pieces Dtrain, Dtest, and Dcal, respectively.

Point predictions have been the standard for NLP tasks,including those from neural networks and transformers.However, uncertainty quantification in the form of confi-dence intervals/sets provide added utility for point pre-dictions for NLP tasks. Recent work has shown that thiscan be achieved in a variety of NLP tasks such as senti-ment classification [31], multi-label text classification [30],open-domain question answering [12], and information re-trieval for fact verification [12]. [31] used conformal predic-tion for document-level binary sentiment analysis to deter-mine whether IMDB movie reviews had positive or negativeconnotation. Their data set contained 25,000 positive and25,000 negative reviews, and they used ICP with a BERT-based text classification model to create confidence sets. Atroughly 91% confidence (i.e., ε = 0.09), the average set size

had been narrowed to one classification, and both the sig-moid and softmax activation functions were found to per-form equally well [31]. We construct similar algorithms formulti-label classification for POS tagging and MLM. [13]expanded the use of conformal predictions for informationretrieval with a cascading approach, filtering out incorrectoptions at every step with the hopes of keeping at least one“admissible” option after all the layers. This approach wasfound to improve both computational and predictive effi-ciency by giving the model fewer items to sort through ateach step [13].

Conformal prediction uses knowledge gained from train-ing a model to create confidence sets with guaranteed finitesample control over the probability of a type 1 error [47] andcan be built on almost any machine learning tool, includingneural networks [53]. Precisely, assuming exchangeable dataexamples, for any level of significance 1− ε with ε ∈ (0, 1), aconformal predictor yields a set-valued prediction with theproperty that it will fail to include the true label with proba-bility at most ε [47]. This property, referred to as “validity”,is mathematically guaranteed to hold for any finite samplesize, but it is possible that the conformal prediction set isvery large. The values included in the prediction sets arebased on the “strangeness” of the test data when comparedto training data, and the efficiency (i.e., size of the predic-tion sets) is dependent on how the strangeness measure – aso-called “nonconformity function” – is defined [53].

The only necessary assumption for the validity of confor-mal prediction sets is that the data must be exchangeable:a more relaxed assumption than the common assumption ofindependent and identically distributed, essentially meaningthat for observed data examples z1, . . . , zn, each of the n!possible orderings of the values were equally probable forbeing observed [47]. In that case, the collection of observedexamples are best described by a “bag”

B := *z1, . . . , zn+,

denoting a set of values such that the order of the elementsis irrelevant [53]. For example *1, 2+ = *2, 1+.

A nonconformity measure A is a real-valued function thatmeasures how strange or different a value z is from the otherexamples in the bag B. For the example values zi ∈ B fori ∈ {1, . . . , n}, denote the nonconformity scores by

αi := A(B\{zi}, zi). (2.1)

The particular form of A is context/application-specific, butcommon choices include various norms, such as the `∞ normin [31] or the `2 norm [47], of distances from a ‘center’ ofthe set B\{zi} to the point zi.

Next, to decide whether to include a test value z in theconformal prediction set Γε(z1, . . . , zn) with level of signifi-cance 1− ε, first denote zn+1 := z and update:

B := *z1, . . . , zn, zn+1 + .

Page 6: Conformal prediction for text in lling and part-of-speech

6

Then, noting that αn+1 corresponds to the test value, in-clude z = zn+1 ∈ Γε(z1, . . . , zn) if

p :=|{i = 1, . . . , n+ 1 : αi ≥ αn+1}|

n+ 1> ε.

This procedure is formally described in [53, 47] as a trans-ductive conformal algorithm, and we summarize it here asAlgorithm 1.

Algorithm 1: Transductive conformal algorithm

Input: Nonconformity measure A, significance level ε,observed examples z1, . . . , zn, and a newobservation or value z

Decide whether to include z in the set Γε(z1, . . . , zn)Set zn+1 := zSet B := *z1, . . . , zn, zn+1+for i ∈ {1, . . . , n+ 1} do

Set αi := A(B\{zi}, zi)end

Set p :=|i=1,...,n+1:αi≥αn+1|

n+1

Include z in Γε(z1, . . . , zn) if p > ε

For many machine learning applications, however, trans-ductive conformal prediction would be too computationallyexpensive since it requires recomputing all of the noncon-formity scores for every new test observation/value. Moti-vated by this issue, ICP [38] is a modification of confor-mal prediction that greatly reduces computation costs. InICP, the data is first split into proper training, calibration,and testing sets Dtrain, Dcal, and Dtest, as in our nota-tion. Next, nonconformity scores are computed for the cal-ibration set examples analogous to equation (2.1) for everyj ∈ {i ∈ {1, . . . , n} : di ∈ Dcal} as

αj := A(Dtrain, dj).

Without loss of generality, re-index these scores byj ∈ {1, . . . , |Dcal|}. Similarly, the nonconformity score for atest observation d∗ ∈ Dtest is defined as

α∗ := A(Dtrain, d∗),

and d∗ ∈ Γε if

p :=|{j = 1, . . . , |Dcal| : αj ≥ α∗}|+ 1

|Dcal|+ 1> ε.

Thus, the ICP algorithm must only be applied once to thecalibration set, and each subsequent test value only requirescalculating a single new nonconformity score to compare tothe static collection of nonconformity scores in the calibra-tion set. While ICP is slightly less reliable empirically thanthe transductive approach, the small sacrifice in empiricalreliability does not outweigh the added benefit in computa-tional efficiency [38]. From this point forward, any referenceto conformal prediction should be interpreted as ICP unlessotherwise stated.

3. METHODOLOGY

In this section we present our methodological contribu-tions, namely an ICP with a BERT-based neural networknonconformity measure for POS tagging in Algorithm 2, anICP with a BiLSTM-based neural network nonconformitymeasure for POS tagging in Algorithm 2, and an ICP witha BERT-based neural network nonconformity measure forMLM in Algorithm 3.

3.1 POS prediction

POS prediction involves finding the context of a word andthen outputting the corresponding POS. Here we presentour ICP Algorithm 2 for POS prediction. Let S representthe set of all q unique POS in D, and for the i-th word in D,let yposi ∈ Rq represent the softmax vector produced by oneof our two POS models, namely the subsequently describedBERT POS (BPS) model or BiLSTM model. In addition, letyposi,s denote the specific softmax value for any POS s ∈ S.

Algorithm 2: ICP POS Prediction

Result: Returns the conformal prediction set Γε

containing POS labels for a test word d∗ ∈ Dtest

and significance level ε.train the model using Dtrain to produce

{yposi : i ∈ {1, . . . , n} and di ∈ Dcal};for j in {i ∈ {1, . . . , n} : di ∈ Dcal} do

s = yposj ; # Recall yposj is the true masked POS

αj = 1− yposj,s ;

endRe-index the nonconformity scores by j ∈ {1, . . . , |Dcal|};for s in S do

α∗s = 1− ypos∗,s ;

ps =|{j=1,...,|Dcal|:αj≥α∗

s}|+1

|Dcal|+1;

if ps > ε thens ∈ Γε;

end

endreturn Γε

3.1.1 BERT POS prediction

BERT creates custom embeddings for words based on thewords themselves and the context around them. These em-beddings can be fine-tuned to specific NLP tasks, such asPOS prediction. We extend these predictions to form confor-mal prediction sets to quantify prediction uncertainty. Theparameters of BERT that we implement for POS predictionhave been pre-trained and are available from [7]. However,we must adjust the BERT parameters in addition to the pa-rameters of a dense feed-forward network that we constructfor mapping the BERT-base length 768 output embeddingfor a word to our q component softmax vector [7].

There is some nuance to how we format the data to beusable with BERT. First, we address the BERT tokenizer.

Page 7: Conformal prediction for text in lling and part-of-speech

Conformal prediction for text infilling and part-of-speech prediction 7

BERT splits a word root from its tense. For this reason, wedefine a word as its last token, since this dictates the tenseof a word (e.g., the word “wanted” is tokenized as “##ed”).Next, it is necessary for a BERT input to have a fixed length(e.g., 100 words per sentence). If a sentence is larger thanthis maximum length, we split the sentence into multiplesentences of length 100 and add [PAD] tokens for sentencesless than 100 words until the sentence is of length 100.

On top of BERT, we place a single softmax layer whichreduces the 768 length vector into a q length probabilityvector. Our model is trained by inputting a sentence andeach word has its fine-tuned embedding vector run throughthe dense layer. We train the parameters for 3 epochs usingthe binary cross entropy loss with the RADAM optimizer[28]. A schematic illustration of our BERT architecture itis given in Figure 2. The softmax output vector from thisneural network is then used in Algorithm 2 to yield the re-sulting conformal prediction sets. This combined BERT ar-chitecture with the conformal prediction algorithm for POStagging is what we refer to as our BPS model.

Figure 2: Illustration of the BERT POS model. The left most layeris the input sentence which is then transformed into the last token ofeach word. This 2nd layer is then input into BERT and an optimizedembedding for the POS is made for each word. Each embedding ispassed through a single layer dense neural net with sigmoid and soft-max activation to produce the probability of each POS tag for eachword in the sentence.

3.1.2 BiLSTM POS prediction

In addition to our BPS model, we also construct a BiL-STM architecture for the task of POS tagging with con-formal prediction sets, also using Algorithm 2. For wordembeddings, we use Stanford’s GloVe embeddings [41]. TheGloVe embeddings are desirable because of their ability tobalance local and global relationships between words. Tomake the model more generalizable, we chose to use pre-trained embeddings. Specifically, we use the GloVe embed-dings which are of length 300 and trained on 6 billion tokensfrom Wikipedia and Gigaword [39]. Note that any word inour corpus that does not have a defined, pre-trained GloVeembedding is instead represented by a 300 length zero vec-tor.

To train the BiLSTM model, we first create sentence em-beddings to represent all of the sentences in our corpus. Wecreate these sentence embeddings by concatenating the or-dered, pre-trained GloVe word embeddings for the words in

a given sentence. Accordingly, the sentence embedding forthe j-th sentence is a matrix of dimension 300 × nj , wherenj is the number of words in the j-th sentence. These sen-tence embedding matrices are then passed through a layerin the BiLSTM model. The BiLSTM layer consists of twosub-layers, a forward LSTM layer and a backward LSTMlayer. For any individual sentence indexed by j, the forwardLSTM layer takes in the matrix of embeddings and returnsa matrix of dimension 150 × nj . Similarly, the backwardLSTM layer takes in the reversed matrix of embeddings andreturns a matrix of dimensions 150 × nj . Each column inthese returned matrices contains a 150 length embeddingsuited for predicting the respective POS for each word. Theidea is that the forward layer is capturing the context ofa sentence that is processed from beginning to end, whilethe backward layer is capturing the context of a sentencethat is processed from end to beginning. This extra contextallows for the model to get a better understanding of thesequential patterns of POS in sentences. To combine theinformation gathered by the forward LSTM layer and thebackward LSTM layer, we reverse the order of the columnsof the matrix that were returned by the backward LSTMand concatenate it with the matrix that was output by theforward LSTM. This results in a 300×nj matrix, with eachcolumn representing an optimal embedding for predictingPOS.

......

...

250

...

190

reLu

softmax

the quick brown fox

1.08 0.57 -0.92 -0.28

1.00 0.98 -0.46 -0.23 . . . . . . . . . . . .

0.24 1.01 -0.95 -1.93

300

the quick brown fox

0.38 0.27 -0.12 -0.20

0.06 0.18 -0.25 -1.23 . . . . . . . . . . . .

0.54 1.43 -2.01 -1.63

150

fox brown quick the

1.28 -0.17 -1.00 0.41

0.56 -0.12 0.95 -1.03 . . . . . . . . . . . .

0.24 -0.93 1.02 -1.99

150

Forward LSTM Backward LSTMSentence Embeddings

the quick brown fox

300

BiLSTM Layer Output

0.38 0.27 -0.12 -0.20

0.06 0.18 -0.25 -1.23 . . . . . . . . . . . .

0.54 1.43 -2.01 -1.63

0.41 -1.00 -0.17 1.28

-1.03 0.95 -0.12 0.56 . . . . . . . . . . . .

-1.99 1.02 -0.93 0.24 300

Dense Layers

BiLSTM Layer

the quick brown fox

0.08 0.02 0.01 0.03

0.01 0.01 0.35 0.01 . . . . . . . . . . . .

0.24 0.02 0.02 0.11

190

POS softmax

Figure 3: Illustration of the BiLSTM POS model processing a samplesentence embedding matrix. The top row illustrates the functionalityof the BiLSTM layer within the model, with the leftmost matrix in thesecond row symbolizing the output of the BiLSTM layer. As seen, thisoutput is simply the concatenation of the output matrix for forwardLSTM layer and the reversed output matrix for backward LSTM layer.The rest of the second row provides a visualization of the dense layerprocesses which eventually result in the POS softmax matrix shown inthe bottom right.

After training the BiLSTM matrix of optimal embed-dings, we pass the columns of this matrix through a feed-

Page 8: Conformal prediction for text in lling and part-of-speech

8

forward neural net. This net reduces the 300 length embed-ding to a 250 length vector with a ReLU activation, which isfurther reduced to a q length softmax vector correspondingto the q POS labels. Each softmax output vector representsan estimated probability distribution over the POS labels fora given word. This procedure is repeated for the nj columnsin the input matrix (each column corresponding to a wordin the input sentence). The schematic for this BiLSTM ar-chitecture is displayed in Figure 3.

For training the parameters, we implement exponentialdecay in the popular RADAM optimizer [28]. We train for700 epochs to avoid overfitting and we use cross entropy asour loss function. Finally, similarly to our BPS model, thesoftmax output vector from this neural network is then usedin Algorithm 2 to yield the resulting conformal predictionsets.

3.2 Masked language modeling

The MLM task is similar to POS tagging with two ex-ceptions. First, the word to be predicted is masked or un-known (for training/testing, when a sentence is passed intothe model, the target word is assigned the [MASK] token).Second, instead of classifying a word using q POS labels,unknown words are inferred using a massive vocabulary ofwords. Though, these changes actually do not affect the ba-sic conformal algorithm too much, as presented in Algo-rithm 3. From here on, “token” and “word” will be usedinterchangeably.

Algorithm 3: ICP MLM

Result: Returns the conformal prediction set Γε

containing candidate words for a masked tokend∗ ∈ Dtest and significance level ε.

train the model using Dtrain to produce{ymlmi : i ∈ {1, . . . , n}, di ∈ Dcal, and di is masked};

for j in {i ∈ {1, . . . , n} : di ∈ Dcal and di is masked} dou = ymlm

j ; # Recall ymlmj is the true masked token

αj = 1− ymlmj,u ;

end

Re-index the nonconformity scores by j ∈ {1, . . . , k},where k is the number of sentences in Dcal ;

for u in U do

α∗u = 1− ymlm∗,u ;

pu =|{j=1,...,k:αj≥α∗

u}|+1

k+1;

if pu > ε thenu ∈ Γε;

end

endreturn Γε

For MLM, we construct a BERT-based conformal predic-tion algorithm similar to the BPS model for POS taggingdescribed in the previous section. BERT was designed forthe task of predicting a masked word. Our BERT model

takes the context and position of a [MASK] token and re-turns a softmax distribution over the 30,522 candidate to-kens, and then Algorithm 3 is implemented to constructthe conformal prediction set of candidate tokens for a givenmasked word of interest. Within Algorithm 3, U denotesthe set of all 30,522 unique tokens comprising the set ofpre-defined BERT tokens. For the j-th masked token in D,ymlmj ∈ R30,522 represents the softmax vector for the MLM

model. In addition, ymlmj,u denotes the specific softmax value

for any token u ∈ U . A schematic of our BERT MLM isgiven in Figure 4.

Figure 4: Illustration of the BERT MLM model. The top layer isthe input sentence, which is then tokenized. A single token is thenreplaced with [MASK]. This tokenized sentence is then passed intoBERT which outputs a softmax probability distribution correspondingto the masked token.

4. EMPIRICAL RESULTS

Using the Brown Corpus, we evaluate the conformal pre-diction sets produced by our three algorithms. The BrownCorpus contains 500 documents, with each word in thesedocuments having a corresponding POS label. In total, thereare just over 57,000 sentences and around 49,800 uniquewords. We consider each sentence in the corpus as a datainstance and randomly allocate 80% of these sentences fortraining, 10% for calibration, and 10% for testing. To ac-count for sampling variability, the random allocation of thedata into training, calibration, and testing sets is repeated5 times, and all metrics are evaluated on and averaged overthe 5 test sets.

For all POS tags (and combination of POS tags) we re-move the hyphenated portion (if any). This includes head-line (-HL), title (-TL), and emphasis (-NC) hyphenations,as well as foreign word prefix (FW-). If a word has a POSlisted as a combination of multiple POS, the specific mul-tiple POS combination is added as a new unique POS toour label set. After these preprocessing steps there remainq = 190 unique POS tags in the label set.

Page 9: Conformal prediction for text in lling and part-of-speech

Conformal prediction for text infilling and part-of-speech prediction 9

4.1 Performance metrics

We consider a variety of metrics that evaluate boththe “forced” point-predictions and the conformal predictionsets. The metrics we consider are adopted from the criteriaconsidered in [31]. Let ntest := |Dtest|, and assume a fixedε ∈ (0, 1). For ease of notation, let yi denote a prediction forsome label yi, for some example indexed by i. The metricsare defined as follows.

Classification accuracy (CA) is taken simply to be theproportion of correct predictions:

CA =1

ntest

ntest∑i=1

I[yi = yi].

Average credibility (Cred) is the average minimum sig-nificance level required such that our prediction sets arenonempty:

Cred =1

ntest

ntest∑i=1

inf{1− ε : |Γεi | ≥ 1}.

A high value for Cred is an indication that the model haslittle confidence that any of the considered labels are appro-priate for the test examples. The OP criterion (for observedperceptiveness) is the average of all test p-values for correctclassifications:

OP =1

ntest

ntest∑i=1

pyi .

Conversely, the OF criterion (for observed fuzziness) is theaverage of all test p-values for incorrect classifications:

OF =1

ntest

ntest∑i=1

∑y 6=yi

py.

Average empirical coverage (Coverage) is the proportion ofprediction sets that contain the true value:

Coverage =1

ntest

ntest∑i=1

I[yi ∈ Γεi ].

Proportion of indecisive sets (PIS) is the proportion of sets(for a fixed ε) that contain more than one label:

PIS =1

ntest

ntest∑i=1

I[|Γεi | > 1].

The average confidence of decisive sets (ACDS) is the pro-portion of confidence sets of size 1 that contain the truelabel:

ACDS =

∑ntest

i=1 I[|Γεi | = 1, yi ∈ Γεi ]∑ntest

i=1 I[|Γεi | = 1].

Lastly, the Nε criterion is the mean size of prediction setsat level of significance 1− ε:

Nε =1

ntest

ntest∑i=1

|Γεi |.

4.2 POS prediction results

Figure 5 and 6 present the results for both POS models.Note that the metrics in Figure 6 require a forced point-prediction, which we take to be the label that maximizes thesoftmax vector that is returned by either the BPS model orthe BiLSTM model.

Proposed Conf. 99.9% 99% 95%

BiLSTM

Coverage 0.9989 0.9903 0.9502ACDS 0.9996 0.9939 0.9631PIS 0.5424 0.1566 0.0024Nε 3.4336 1.2732 0.9889

BPS

Coverage 0.9990 0.9897 0.9499ACDS 0.9992 0.9909 NAPIS 0.3577 0.0334 0.0000Nε 2.6260 1.0378 0.9570

Figure 5: Set-value prediction criterion results for POS prediction

It is observed in Figure 5 that for the 99% nominal con-fidence level, both models produce sets that average around1-2 POS per set. This illustrates that the conformal pre-diction algorithm produces efficient sets at high confidencelevels, and also suggests that the softmax probability vec-tors from the underlying neural nets are highly concentratedon 1-2 POS labels. Moreover, the Coverage and ACDS val-ues in Figure 5 demonstrate that these conformal predictionsets achieve their nominal coverage. Excessively small valuesfor PIS with Nε ≈ 1 at the 95% confidence level indicate ahigh proportion of conformal prediction sets containing zeroor one POS label.

Model CA Cred. OP OF

BiLSTM 0.9536 0.5055 0.5012 0.0493

BPS 0.9793 0.5020 0.5008 0.0126

Figure 6: Forced-value prediction criterion results for POS prediction

To offer further insight, Figure 7 displays histograms ofthe set sizes for both models at the 99% confidence level.The majority of the sets are of size one, which accounts forthe height of the leftmost bins. However, the sizes of the setsvary greatly for different levels of nominal confidence, andso the uncertainty quantification afforded by the conformalprediction sets has utility. In particular, the models we con-structed are able to provide 99.9% confidence for 3-4 POSlabels, on average, for a given word. Such a quantified guar-antee about the uncertainty in a prediction is not possibleto provide from neural network architectures alone.

Page 10: Conformal prediction for text in lling and part-of-speech

10

Set size

Figure 7: Histograms of conformal prediction set sizes for POS predic-tion at the 99% confidence level for BiLSTM (top) and BPS (bottom).

Em

pir

ical

Con

fid

ence

Lev

elE

mp

iric

al

Con

fid

ence

Lev

el

Nominal confidence level

Figure 8: Coverage of conformal prediction sets for POS predictionfor BiLSTM (top) and BPS (bottom). For reference, the dashed lineis a 45 degree line.

Nonconformity score

Figure 9: Histogram of nonconformity scores for the calibration setsfor the BPS model. The histogram on the bottom plot only includesscores less than 0.0002 to better illustrate how these scores are dis-tributed near zero.

Nonconformity score

Figure 10: Histogram of nonconformity scores for the calibration setsfor the BiLSTM POS model. The histogram on the bottom plot onlyincludes scores less than 0.0002 to better illustrate how these scoresare distributed near zero.

To demonstrate the validity for values of Coverage at

Page 11: Conformal prediction for text in lling and part-of-speech

Conformal prediction for text infilling and part-of-speech prediction 11

more levels than the 99.9%, 99%, and 95% levels displayedin Figure 5, Figure 8 plots the average empirical coverageof the conformal prediction sets against their nominal levelsfor levels of significance ranging from 0 to 1.

Next, Figure 6 provides an assessment of the forced point-predictions of the underlying BPS and BiLSTM models.Being the state-of-the-art, it is found that the BPS modelis marginally more accurate with respect to CA. However,both models perform relatively similar with regard to theother metrics in Figure 6. The difference in values betweenOP and OF indicate that the models are able to discrim-inate the correct POS label from the incorrect labels, onaverage.

Lastly, for further assessment of the conformal predic-tion algorithm, we present histograms of the nonconformityscores for the calibration sets in Figures 10 and 9.

4.3 MLM results

For the MLM task, we mask a randomly chosen singleword in each sentence in the Brown Corpus. Sentences aretokenized according to the “WordPiece” embeddings usedby BERT, then truncated to a length of 128 to feed into themodel. Further, we include fewer examples in the calibrationset for the MLM task than in the previous section for thePOS task due to the larger computational cost entailed bythe much larger label set for MLM (i.e., all words in a vocab-ulary of around 30,000 words). Specifically, the calibrationset contains around 1,300 sentences, and the testing set isalso reduced to 1,000 sentences. To account for samplingvariability in the random allocation of the data into train-ing, calibration, and testing sets, we still repeat the process5 times and report our results as averages of these 5 MonteCarlo iterations.

Confidence Level Coverage Nε95% .948 176.77

90% .898 43.62

80% .794 6.96

75% .739 3.65

Figure 11: Set-value prediction criterion results for MLM

Model CA Cred OP OF

BERT MLM 0.542 0.609 .491 .122

Figure 12: Forced-value prediction criterion results for MLM

Unlike for POS prediction in the previous section, forMLM it is found that higher levels of confidence lead toprediction sets that are too large to be useful (see Figure11). In particular, to guarantee that the true masked wordis not omitted from the prediction set for more than 5% oftest sentences (i.e., at the 95% level), the average confor-mal prediction set size is reported to be approximately 177

candidate tokens. Nonetheless, sacrificing some confidencequickly leads to smaller sets, down to 3-4 words on averageat the 75% level. The histogram of the conformal predictionset sizes for all test examples is shown in Figure 13.

Set size

Figure 13: Histograms of conformal prediction set sizes for MLM atthe 95% confidence level.

Em

pir

ical

con

fid

ence

level

Nominal confidence level

Figure 14: Coverage of conformal prediction sets for MLM. For ref-erence, the dashed line is a 45 degree line.

Nonconformity score

Figure 15: Histogram of nonconformity scores for the calibration setsfor MLM.

Page 12: Conformal prediction for text in lling and part-of-speech

12

Additionally, the conformal prediction sets do achievetheir nominal coverage at all levels displayed in Figure 11.To infer the validity for all values of Coverage from 75%to 95%, Figure 14 plots the average empirical coverage ofthe conformal prediction sets against their nominal levels ofsignificance in this range.

Lastly, we provide the forced point-prediction metrics inFigure 12, and we present a histogram of the nonconformityscores for the calibration sets in Figure 15. The bimodal na-ture of the histogram is due to the underlying BERT modelmaking overly discriminative predictions (i.e., the softmaxvectors ymlm

j being close to a one-hot vector), even whenthese predictions are sometimes very wrong, leading to ei-ther very high or very low nonconformity scores and notmuch in-between.

5. ILLUSTRATIVE REAL EXAMPLE

An application of our conformal prediction sets for MLMcould come in the form of a post-hoc analysis tool for speechrecognition software. The following example comes from avoice transcription of a 2009 TED Talk given by MichelleObama, part of the greater TED-LIUM3 audio transcrip-tion corpus [18]. However, not all words were able to be de-tected by the automated speech recognition (ASR) system,and are instead labeled with the token <UNK> to takethe place of the unknown word. Ideally, our model wouldbe able to fill in these unknown words with set-valued pre-dictions for any desired confidence level. To compare withother voice-to-text systems, we also analyzed the YouTubeclosed-captioning for this TED Talk video, which appearedto be more accurate than the ASR. Below are 3 examplesentences from the talk, with the italicized text representingthe YouTube closed-captioning transcriptions, and the non-italicized text representing the ASR system transcriptions.The correct words, along with conformal prediction sets atthe 75% confidence level (i.e., ε = 0.25), are presented next.

Example 1.

. . . to go with him to a community meeting. But when wemet, Barack was a community organizer.

. . . to go with him to a community <UNK>. But whenwe met, Barack was a community organizer.

Γ0.25 = [‘college’, ‘center’, ‘event’, ‘conference’, ‘meeting’,‘dinner’, ‘gathering’]Correct word: ‘meeting’ �

Example 2.

And he urged the people in that meeting, in that com-munity, to devote themselves to closing the gap betweenthose two ideas, to work together to try to make the worldas it is and the world as it should be, one and the same.

And he urged the people in that meeting in that com-munity to devote themselves to closing the gap betweenthose two ideas, to work together to try to make the world

as it is and the world as it should <UNK> one and thesame.

Γ0.25 = [‘,’, ‘be’, ‘seem’]Correct word: ‘be’ �

Example 3.

And they opened many new doors for millions of femaledoctors and nurses and artists and authors, all of whomhave followed them. And by getting a good education youtoo can control your own destiny.

And they opened many new doors for millions of femaledoctors and nurses and artists and authors all of whomhave <UNK> <UNK>. And by getting a good educationyou too can control your own destiny.

Γ0.251 = [‘been’, ‘become’, ‘loved’]

Γ0.252 = [‘children’, ‘died’, ‘success’, ‘experience’, ‘careers’]

Correct word: ‘followed’, ‘them’ �

At the 75% confidence level, the conformal predictionsets included the correct word in the first two examples.However, our MLM was not trained on any sentence withtwo consecutive masked words, thus it fails to include thecorrect words in the third example. That being so, if wepass this sentence through the model twice, each time withonly one masked word, we see the more accurate results:

Correct word: ‘followed’Γ0.252 = [‘joined’, ‘followed’, ‘loved’, ‘taught’, ‘inspired’,

‘influenced’]

Correct word: ‘them’Γ0.252 = [‘you’, ‘me’, ‘them’, ‘through’, ‘suit’]

This suggests that the BERT model heavily depends ondirectly adjacent words to predict the token for a maskedword in a sentence.

6. CONCLUDING REMARKS

We found that BERT-based conformal prediction setswere extremely effective in predicting both POS and maskedwords, which is unsurprising seeing as BERT is the dominantmodel for many NLP tasks at the moment. The complexityof models like BERT or BiLSTM was necessary, as our pre-vious attempts using simpler nonconformity functions werenot able to produce as efficient confidence sets. In the fu-ture, we may explore different nonconformity scores to getthe BERT MLM prediction intervals even smaller. Initialtests show promising results, but these are more computa-tionally intensive than the methods described in our resultssection.

Page 13: Conformal prediction for text in lling and part-of-speech

Conformal prediction for text infilling and part-of-speech prediction 13

REFERENCES

[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term de-pendencies with gradient descent is difficult. IEEE Transactionson Neural Networks, 5(2):157–166, 1994.

[2] Bernd Bohnet, Ryan McDonald, Goncalo Simoes, Daniel Andor,Emily Pitler, and Joshua Maynez. Morphosyntactic tagging witha meta-BiLSTM model over context sensitive token encodings.In Proceedings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), pages2642–2652, 2018.

[3] Thorsten Brants. TnT: a statistical part-of-speech tagger. InProceedings of the sixth conference on Applied natural languageprocessing, pages 224–231, 2000.

[4] Maxime Cauchois, Suyash Gupta, and John C. Duchi. Knowingwhat you know: valid and validated confidence sets in multiclassand multilabel prediction. Journal of Machine Learning Research,22(81):1–42, 2021.

[5] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre,Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and YoshuaBengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprintarXiv:1406.1078, 2014.

[6] Jacob Devlin and Ming-Wei Chang. Open sourcing BERT: state-of-the-art pre-training for natural language processing. Google AIBlog, 2, 2018.

[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. BERT: Pre-training of deep bidirectional transform-ers for language understanding. arXiv preprint arXiv:1810.04805,2019.

[8] Chris Donahue, Mina Lee, and Percy Liang. Enabling languagemodels to fill in the blanks. In Proceedings of the 58th AnnualMeeting of the Association for Computational Linguistics, pages2492–2501, 2020.

[9] Jeffrey L. Elman. Finding structure in time. Cognitive Science,14(2):179–211, 1990.

[10] William Fedus, Ian Goodfellow, and Andrew M. Dai. Maskgan:Better text generation via filling in the . In International Con-ference on Learning Representations, 2018.

[11] Adam Fisch, Tal Schuster, Tommi Jaakkola, and Regina Barzi-lay. Few-shot conformal prediction with auxiliary tasks. arXivpreprint arXiv:2102.08898, 2021.

[12] Adam Fisch, Tal Schuster, Tommi S Jaakkola, and Regina Barzi-lay. Efficient conformal prediction via cascaded inference withexpanded admission. In International Conference on LearningRepresentations, 2020.

[13] Adam Fisch, Tal Schuster, Tommi S. Jaakkola, and Regina Barzi-lay. Relaxed conformal prediction cascades for efficient inferenceover many labels. International Conference on Learning Repre-sentations, 2021.

[14] W Nelson Francis and Henry Kucera. Brown Corpus manual.Letters to the Editor, 5(2):7, 1979.

[15] Felix A Gers, Nicol N Schraudolph, and Jurgen Schmidhuber.Learning precise timing with LSTM recurrent networks. Journalof machine learning research, 3(Aug):115–143, 2002.

[16] Yoav Goldberg. A primer on neural network models for naturallanguage processing. Journal of Artificial Intelligence Research,57:345–420, 2016.

[17] Stefan Helmut Heid, Marcel Dominik Wever, and EykeHullermeier. Reliable part-of-speech tagging of historical corporathrough set-valued prediction. Journal of Data Mining and Dig-ital Humanities, 2020.

[18] Francois Hernandez, Vincent Nguyen, Sahar Ghannay, NataliaTomashenko, and Yannick Esteve. TED-LIUM 3: twice as muchdata and corpus repartition for experiments on speaker adapta-tion. In International conference on speech and computer, pages198–208. Springer, 2018.

[19] Sepp Hochreiter. The vanishing gradient problem during learn-ing recurrent neural nets and problem solutions. International

Journal of Uncertainty, Fuzziness and Knowledge-Based Sys-tems, 6(02):107–116, 1998.

[20] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term mem-ory. Neural computation, 9(8):1735–1780, 1997.

[21] Yuhuang Hu, Adrian Huber, Jithendar Anumula, and Shih-ChiiLiu. Overcoming the vanishing gradient problem in plain recur-rent networks. arXiv preprint arXiv:1801.06105, 2018.

[22] Daniel Jurafsky and James H Martin. Speech and Language Pro-cessing: An Introduction to Natural Language Processing, Com-putational Linguistics, and Speech Recognition. 3 edition, 2021.

[23] Jan Koutnık, Klaus Greff, Faustino Gomez, and Jurgen Schmid-huber. A clockwork RNN. In International Conference on Ma-chine Learning, pages 1863–1871. PMLR, 2014.

[24] Julian Kupiec. Robust part-of-speech tagging using a hiddenMarkov model. Computer Speech & Language, 6(3):225–242,1992.

[25] John Lafferty, Andrew McCallum, and Fernando C.N. Pereira.Conditional random fields: Probabilistic models for segmentingand labeling sequence data. In Proceedings of the 18th Interna-tional Conference on Machine Learning 2001, 2001.

[26] Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, RamonFermandez, Silvio Amir, Luis Marujo, and Tiago Luıs. Findingfunction in form: Compositional character models for open vo-cabulary word representation. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural Language Processing,pages 1520–1530, 2015.

[27] Dayiheng Liu, Jie Fu, Pengfei Liu, and Jiancheng Lv. Tigs: Aninference algorithm for text infilling with gradient search. Pro-ceedings of the 57th Annual Meeting of the Association for Com-putational Linguistics, 2019.

[28] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xi-aodong Liu, Jianfeng Gao, and Jiawei Han. On the variance ofthe adaptive learning rate and beyond. International Conferenceon Learning Representations, 2020.

[29] Liyuan Liu, Jingbo Shang, Xiang Ren, Frank Xu, Huan Gui, JianPeng, and Jiawei Han. Empower sequence labeling with task-aware neural language model. In Proceedings of the AAAI Con-ference on Artificial Intelligence, volume 32, 2018.

[30] Lysimachos Maltoudoglou, Andreas Paisios, Ladislav Lenc, JirıMartınek, Pavel Kral, and Harris Papadopoulos. Well-calibratedconfidence measures for multi-label text classification with a largenumber of labels. Pattern Recognition, 122:108271, 2022.

[31] Lysimachos Maltoudoglou, Andreas Paisios, and Harris Pa-padopoulos. BERT-based conformal predictor for sentiment anal-ysis. In Conformal and Probabilistic Prediction and Applications,pages 269–284. PMLR, 2020.

[32] Christopher D Manning. Part-of-speech tagging from 97% to100%: is it time for some linguistics? In International confer-ence on intelligent text processing and computational linguistics,pages 171–189. Springer, 2011.

[33] Soundouss Messoudi, Sylvain Rousseau, and Sebastien Destercke.Deep conformal prediction for robust models. In InternationalConference on Information Processing and Management of Un-certainty in Knowledge-Based Systems, pages 528–540. Springer,2020.

[34] Thomas Mortier, Marek Wydmuch, Eyke Hullermeier, KrzysztofDembczynski, and Willem Waegeman. Efficient algorithms forset-valued prediction in multi-class classification. arXiv preprintarXiv:1906.08129, 2019.

[35] Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, DeviParikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli,and James Allen. A corpus and evaluation framework fordeeper understanding of commonsense stories. arXiv preprintarXiv:1604.01696, 2016.

[36] Christopher Olah. Understanding LSTM networks, 2015.[37] Andreas Paisios, Ladislav Lenc, Jirı Martınek, Pavel Kral, and

Harris Papadopoulos. A deep neural network conformal predictorfor multi-label text classification. In Conformal and ProbabilisticPrediction and Applications, pages 228–245. PMLR, 2019.

Page 14: Conformal prediction for text in lling and part-of-speech

14

[38] Harris Papadopoulos. Inductive conformal prediction: Theory andapplication to neural networks. In Tools in Artificial Intelligence.IntechOpen, 2008.

[39] R Parker, D Graff, J Kong, K Chen, and K Maeda. English Giga-word fifth edition. Technical report, Technical Report. LinguisticData Consortium, Philadelphia, 2011.

[40] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Un-derstanding the exploding gradient problem. arXiv preprintarXiv:1211.5063, 2012.

[41] Jeffrey Pennington, Richard Socher, and Christopher D Manning.GloVe: Global vectors for word representation. In Proceedings ofthe 2014 conference on empirical methods in natural languageprocessing (EMNLP), pages 1532–1543, 2014.

[42] Matthew Peters, Waleed Ammar, Chandra Bhagavatula, andRussell Power. Semi-supervised sequence tagging with bidirec-tional language models. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics (Volume 1:Long Papers), pages 1756–1765, 2017.

[43] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner,Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep con-textualized word representations. In Proceedings of NAACL-HLT,pages 2227–2237, 2018.

[44] Barbara Plank, Anders Søgaard, and Yoav Goldberg. Multilingualpart-of-speech tagging with bidirectional long short-term memorymodels and auxiliary loss. In Proceedings of ACL 2016. Associa-tion for Computational Linguistics (ACL), 2016.

[45] Alec Radford, Karthik Narasimhan, Tim Salimans, and IlyaSutskever. Improving language understanding by generative pre-training. 2018.

[46] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primerin BERTology: What we know about how BERT works. Trans-actions of the Association for Computational Linguistics, 8:842–866, 2020.

[47] Glenn Shafer and Vladimir Vovk. A tutorial on conformal pre-diction. Journal of Machine Learning Research, 9(3), 2008.

[48] Abdullah Aziz Sharfuddin, Md Nafis Tihami, and Md Saiful Is-lam. A deep recurrent neural network with BiLSMT model forsentiment classification. In 2018 International Conference onBangla Speech and Language Processing (ICBSLP), pages 1–4.IEEE, 2018.

[49] Tianxiao Shen, Victor Quach, Regina Barzilay, and TommiJaakkola. Blank language models. In Proceedings of the 2020Conference on Empirical Methods in Natural Language Process-ing (EMNLP), pages 5186–5198, 2020.

[50] Siddarth Srinivasan, Richa Arora, and Mark Riedl. A simple andeffective approach to the story cloze test. In Proceedings of the

2018 Conference of the North American Chapter of the Associa-tion for Computational Linguistics: Human Language Technolo-gies, Volume 2 (Short Papers), pages 92–96, 2018.

[51] Xu Sun. Structure regularization for structured prediction. Ad-vances in Neural Information Processing Systems, 27:2402–2410,2014.

[52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is all you need. In Advances in neural informationprocessing systems, pages 5998–6008, 2017.

[53] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmiclearning in a random world. Springer Science & Business Media,2005.

[54] Volodya Vovk, Alexander Gammerman, and Craig Saunders.Machine-learning applications of algorithmic randomness. In Pro-ceedings of the Sixteenth International Conference on MachineLearning, ICML ’99, page 444–453, San Francisco, CA, USA,1999. Morgan Kaufmann Publishers Inc.

[55] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel R. Bowman. GLUE: A multi-task benchmarkand analysis platform for natural language understanding, 2019.

[56] Peilu Wang, Yao Qian, Frank K Soong, Lei He, and Hai Zhao.Part-of-speech tagging with bidirectional long short-term memoryrecurrent neural network. arXiv preprint arXiv:1510.06168, 2015.

[57] Yingwei Xin, Ethan Hart, Vibhuti Mahajan, and Jean David Ru-vini. Learning better internal structure of words for sequencelabeling. In Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages 2584–2593, 2018.

[58] Michihiro Yasunaga, Jungo Kasai, and Dragomir Radev. Ro-bust multilingual part-of-speech tagging via adversarial training.In Proceedings of the 2018 Conference of the North AmericanChapter of the Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Papers), pages 976–986, 2018.

[59] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun,and Qun Liu. ERNIE: Enhanced language representation withinformative entities. In Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics, pages 1441–1451,2019.

[60] Wanrong Zhu, Zhiting Hu, and Eric Xing. Text infilling. arXivpreprint arXiv:1901.00158, 2019.

[61] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov,Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Align-ing books and movies: Towards story-like visual explanations bywatching movies and reading books. In Proceedings of the IEEEinternational conference on computer vision, pages 19–27, 2015.