subsequence based deep active learning for named entity

Proceedings of the 59th Annual Meeting of the Association for Computational Linguisticsand the 11th International Joint Conference on Natural Language Processing, pages 4310–4321

August 1–6, 2021. ©2021 Association for Computational Linguistics

4310

Subsequence Based Deep Active Learningfor Named Entity Recognition

Puria Radmard1,2,3

[email protected] Fathullah2

[email protected]

1University College London2University of Cambridge

3Vector AI

Aldo Lipani1,[email protected]

Abstract

Active Learning (AL) has been successfullyapplied to Deep Learning in order to drasti-cally reduce the amount of data required toachieve high performance. Previous workshave shown that lightweight architectures forNamed Entity Recognition (NER) can achieveoptimal performance with only 25% of theoriginal training data. However, these meth-ods do not exploit the sequential nature oflanguage and the heterogeneity of uncertaintywithin each instance, requiring the labelling ofwhole sentences. Additionally, this standardmethod requires that the annotator has accessto the full sentence when labelling. In thiswork, we overcome these limitations by allow-ing the AL algorithm to query subsequenceswithin sentences, and propagate their labels toother sentences. We achieve highly efficientresults on OntoNotes 5.0, only requiring 13%of the original training data, and CoNLL 2003,requiring only 27%. This is an improvement of39% and 37% compared to querying full sen-tences.

1 Introduction

The availability of large datasets has been key to thesuccess of deep learning in Natural Language Pro-cessing (NLP). This has galvanized the creation oflarger datasets in order to train larger deep learningmodels. However, creating high quality datasets isexpensive due to the sparsity of natural language,our inability to label it efficiently compared to otherforms of data, and the amount of prior knowledgerequired to solve certain annotation tasks. Sucha problem has motivated the development of newActive Learning (AL) strategies which aim to effi-ciently train models, by automatically identifyingthe best training examples from large amounts of

Code is made available on: https://github.com/puria-radmard/RFL-SBDALNER

unlabeled data (Wei et al., 2015; Wang et al., 2017;Tong and Koller, 2002). This tremendously reduceshuman annotation effort as much fewer instancesneed to be labeled manually.

To minimise the amount of data needed to traina model, AL algorithms iterate between training amodel, and querying information rich instances tohuman annotators from a pool of unlabelled data(Huang et al., 2014). This has been shown to workwell when the queries are ‘atomic’—a single an-notation requires a unit labour, and describes en-tirely the instance to be annotated. Conversely,each instance of structured data, such as sequences,require multiple annotations. Hence, such query se-lection methods can result in a waste of annotationbudget (Settles, 2011).

For example, in Named Entity Recognition(NER), each sentence is usually considered an in-stance. However, because each token has a sepa-rate label, annotation budgeting is typically doneon a token basis (Shen et al., 2017). Budget wast-ing may therefore arise from the heterogeneity ofuncertainty across each sentence; a sentence cancontain multiple subsequences (of tokens) of whichthe model is certain on some and uncertain on oth-ers. By making the selection at a sentence level,although some budget is spent on annotating uncer-tain subsequences, the remaining budget may bewasted on annotating subsequences for which anannotation is not needed.

It can therefore be desirable for annotators tolabel subsequences rather than the full sentences.This gives a greater flexibility to AL strategies tolocate information rich parts of the input with im-proved efficiency – and reduces the cognitive de-mands required of annotators. Annotators may infact perform better if they are asked to annotateshorter sequences, because longer sentences cancause boredom, fatigue, and inaccuracies (Rzeszo-tarski et al., 2013).

https://github.com/puria-radmard/RFL-SBDALNER

https://github.com/puria-radmard/RFL-SBDALNER

4311

In this work, we aim to improve upon the ef-ficiency of AL for NER by querying for subse-quences within each sentence, and propagatinglabels to unseen, identical subsequences in thedataset. This strategy simulates a setup in whichannotators are presented with these subsequences,and do not have access to the full context, ensuringthat their focus is centred on the tokens of interest.

We show that AL algorithms for NER tasks thatuse subsequences, allowing training on partiallylabelled sentences, are more efficient in terms ofbudget than those that only query full sentences.This improvement is furthered by generalising ex-isting acquisition functions (§ 4.1) for use withsequential data. We test our approaches on twoNER datasets, OntoNotes 5.0 and CoNLL 2003.On OntoNotes 5.0, Shen et al. (2017) achieve state-of-the-art performance with 25% of the originaldataset querying full sentences, while we requireonly 13% of the dataset querying subsequences.On CoNLL 2003, we show that the AL strategyof Shen et al. (2017) requires 50% of the datasetto achieve the same results as training on the fulldataset, while ours requires only 27%.

Contributions of this paper are:1. Improving the efficiency of AL for NER by

allowing querying of subsequences over fullsentences;

2. An entity based analysis demonstrating thatsubsequence querying AL strategies tend toquery more relevant tokens (i.e., tokens be-longing to entities);

3. An uncertainty analysis of the queries madeby both full sentence and subsequence query-ing methods, demonstrating that querying fullsentences leads to selecting more tokens towhich the model is already certain.

2 Related Work

AL algorithms aim to query information rich datapoints to annotators in order to improve the perfor-mance of the model in a data efficient way. Tradi-tionally these algorithms choose data points whichlie close to decision boundaries (Pinsler et al.,2019), where uncertainty is high, in order for themodel to learn more useful information. This mea-sure of uncertainty, measured through acquisitionfunctions, are therefore vital to AL. Key functionsinclude predictive entropy (MaxEnt) (Gal et al.,2017), mutual information between model poste-rior and predictions (BALD) (Houlsby et al., 2011;

Gal et al., 2017), or the certainty of the modelwhen making label predictions (here called LC)(Mingkun Li and Sethi, 2006). These techniquesensure all instances used for training, painstak-ingly labelled by experts, have maximum impacton model performance. There has been explorationof uncertainty and deep learning based AL for NER(Chen et al., 2015; Shen et al., 2017; Settles andCraven, 2008; Fang et al., 2017). These approacheshowever, treat each sentence as a single query in-stead of a collection of individually labelled to-kens. In these methods, the acquisition functionsthat score sentences aggregate token-wise scores(through summation or averaging).

Other works forgo this aggregation, queryingsingle tokens at a time (Tomanek and Hahn, 2009;Wanvarie et al., 2011; Marcheggiani and Artieres,2014). These works show that AL for NER can beimproved by taking the single token as a unit query,and use semi-supervision (Reddy et al., 2018; Is-cen et al., 2019) for training on partially labelledsentences (Muslea et al., 2002). However, queryingsingle-tokens is inapplicable in practise because,either a) annotators have access to the full sentencewhen queried but can only label one token, whichwould lead to frustration as they are asked to readthe full sentence but only annotate a single token, orb) annotators only have access to the token of inter-est, which means that they would not have enoughinformation to label tokens differently based ontheir context, leading to annotators labeling anyunique token with the same label. Moreover, if thelatter approach was somehow possible, we wouldbe able to reduce the annotation effort to the annota-tion of only the unique tokens forming the dataset,its dictionary. Furthermore, all of these past worksuse Conditional Random Fields (CRFs) (Laffertyet al., 2001), which have since been surpassed asthe state-of-the-art for NER (and most NLP tasks)by deep learning models (Devlin et al., 2019).

In this work we follow the approach where anno-tators only have access to subsequences of multipletokens. However, instead of making use of singletokens, we will query more than one token, provid-ing enough context to the annotators. This allowsthe propagation of these annotations to identicalsubsequences in the dataset, further reducing thetotal annotation effort.

4312

3 Background

3.1 Active Learning Algorithms

Most AL strategies are based on a repeating score,query and fine-tune cycle. After initially training anNER model with a small pool of labelled examples,the following is repeated: (1) score all unlabelledinstances, (2) query the highest scoring instancesand add them to training set, and, (3) fine-tune themodel using the updated training set (Huang et al.,2014).

To describe this further, notation and proposedtraining process is introduced, with details in fol-lowing sections. First, the sequence tagging dataset,denoted by D = {(x(n),y(n))}Nn=1, consists of acollection of sentence and ground truth labels. Thei-th token of the n-th sentence (y(n)i ) has a labely(n)i = c with c belonging to C = {c1, ..., cK}.

We also differentiate between the labelled and un-labelled datasets, DL and DU , which initially areempty and equal toD. Finally, we fix A as the totalnumber of tokens queried in each iteration.

3.2 Acquisition Functions

Instances in the unlabelled pool are queried us-ing an acquisition function. This function aims toquantify the uncertainty of the model when generat-ing predictive probabilities over possible labels foreach instance. Instances with the highest predictiveuncertainty are deemed as the most informative formodel training. Previously used acquisition func-tions such as Least Confidence (LC) and MaximumNormalized Log-Probability (MNLP) (Shen et al.,2017; Chen et al., 2015) are generalised for vari-able length sequences. Letting y

(n)<i be the history

of predictions prior to the i-th input, the next outputprobability will be p(n)i,c = P (y

(n)i = c|y(n)

<i ,x(n)).

Then, we define the token-wise LC score as:

LC(n)i = −max

c∈Clog p

(n)i,c . (1)

The LC acquisition function for sequences is thendefined as:

LC(x(n)1 , ..., x

(n)`

)=∑j=1

LC(n)j , (2)

and, for MNLP as:

MNLP(x(n)1 , ..., x

(n)`

)=

1

`

∑j=1

LC(n)j . (3)

Note that this is similar to LC except for the nor-malization factor 1/`. The formulation above canbe applied to other types of commonly used acquisi-tion functions such as Maximum Entropy (MaxEnt)(Gal et al., 2017) by simply defining:

ME(n)i = −

∑c∈C

p(n)i,c log p

(n)i,c , (4)

as the token score. Given the task of quantifyinguncertainty amongst the unlabelled pool of data,both of these metrics - LC and MaxEnt - provideintuitive interpretations. eq. (1) scores highly to-kens for which the predicted label has lowest confi-dence, while eq. (4) scores highly tokens for whichthe whole probability mass function has higher en-tropy. Both of these therefore score more highlyuniform predictive distributions, which indicatesunderlying uncertainty.

Finally, given the similarity of performance be-tween MNLP and Bayesian Active Learning byDisagreement (BALD) (Houlsby et al., 2011) inNER tasks (Shen et al., 2017), and the computa-tional complexity required to calculate BALD withrespect to the other activation functions, we willnot compare against BALD.

4 Subsequence Acquisition

In this section we describe how we build on pastworks, and the core contribution of this paper. Ourwork forms a more flexible AL algorithm that oper-ates on subsequences, as opposed to full sentences(Shen et al., 2017). This is achieved by generalisingacquisition functions for subsequences (§ 4.1) scor-ing and querying subsequences within sentences(§ 4.2), and performing label propagation on un-seen sentences to avoid the multiple annotations ofrepeated subsequences (§ 4.3).

4.1 Subsequence Acquisition FunctionsSince this work focuses on the querying of sub-sequences, from the previously defined LC andMNLP we generalize them to define a family ofacquisition functions applicable for both full sen-tences and subsequences:

LCα

(x(n)i+1, ..., x

(n)i+`

)=

1

`α

i+∑j=i+1

LC(n)j . (5)

Special cases are when α = 0 and α = 1 whichreturn the original definitions of LC in eq. (2) andMNLP in eq. (3). As noted by Shen et al. (2017),

4313

LC for sequences biases acquisition towards longersentences. The tuneable normalisation factor ineq. (5) over the sequence of scores mediates the bal-ance of shorter and longer subsequences selected.This generalisation can be applied to other typesof commonly used acquisition functions such asMaxEnt and BALD by modifying the token-wisescore.

4.2 Subsequence Selection

Each sentence x(n) can be broken into a set of sub-sequences S(n) = {(x(n)i , ..., x

(n)j ) |∀i < j} where

all elements s ∈ S(n) can be efficiently scored byfirst computing the token scores, then aggregat-ing as required. Once this has been done for allsentences in DU , a query set SQ ⊂ ∪nS(n) ofnon-overlapping (mutually disjoint) subsequencesis found. The requirement of non-overlapping sub-sequences avoids the problem of relabelling tokens,but disallows simply choosing the highest scoringsubsequences (since these can overlap). Insteadat each round of querying, we perform a greedyselection, repeatedly choosing the highest scoringsubsequence that does not overlap with previouslyselected subsequences. Adjustments can be madeto reflect practical needs, such as restricting thelength ` of the viable subsequences to [`min, `max].This is because longer subsequences are easier tolabel, while shorter subsequences are more efficientin querying uncertain tokens, and so the selectionis only allowed to operate within these bounds.

Additionally, it is easy to imagine a scenario inwhich a greedy selection method does not select themaximum total score that can be generated froma sentence. This scenario is illustrated in Table 1where lengths are restricted to `min = `max = 3for simplicity. Note that tokens can become unse-lectable in future rounds because they are not insidea span of unlabelled tokens of at least size `min.When the algorithm has queried all subsequencesof this size range, it starts to query shorter subse-quences by relaxing the length constraint. Howeverin practise, model performance on the validation setconverges before all subsequences of valid rangehave been exhausted. Nonetheless, when choosingsubsequences of size [`min, `max] = [4, 7] thesewill be exhausted when roughly 90% and 80% oftokens have been labelled for the OntoNotes 5.0and CoNLL 2003 datasets.

4.3 Subsequence Label Propagation

Since a subsequence querying algorithm can resultin partially labelled sentences, it raises the questionof how unlabelled tokens should be handled. In pre-vious work based on the use of CRFs (Tomanek andHahn, 2009; Wanvarie et al., 2011; Marcheggianiand Artieres, 2014) this was solved by using semi-supervision on tokens for which the model showedlow uncertainty. However, for neural networks, theuse of model generated labels could lead to themodel becoming over-confident, harming perfor-mance and biasing (Arazo et al., 2020) uncertaintyscores. Hence, we ensure that backpropagationonly occurs from labelled tokens.

Our final contribution to the AL algorithm is theuse of another semi-supervision strategy where wepropagate uniquely labelled subsequences in or-der to minimise the number of annotations needed.When queried for a subsequence, the annotator (inthis case an oracle) is not given the contextual to-kens in the remainder of the sentence. For this rea-son, given an identical subsequence, a consistentannotator will provide the same labels. Therefore,the proposed algorithm maintains a dictionary thatmaps previously queried subsequences to their pro-vided labels. Once a queried subsequence and itslabel are added to the dictionary, all other matchingsubsequences in the unlabelled pool are given thesame, but temporary, labels.

The tokens retain these temporary labels untilthey are queried themselves. After scoring andranking members of S, the algorithm will disre-gard sequences that match exactly members of thisdictionary, which is updated during the queryinground. However, if tokens belonging to these pre-viously seen subsequences are encountered in adifferent context, meaning as part of a differentsubsequence, they may also be queried. For ex-ample, in Table 1, if the subsequence “shop tobuy” had been previously queried elsewhere in thedataset, the red subsequence will not be consideredfor querying, as it retains its temporary labels. In-stead, the green subsequence could be queried, inwhich case the temporary labels of tokens 6 and 7will be overwritten by new, permanent labels.

Therefore, the value of `min becomes a trade-offbetween the improved resolution of the acquisitionfunction, and the erroneous propagation of shorter,more frequent label subsequences to identical onesin different contexts.

4314

j 1 2 3 4 5 6 7 8 9 10x(n)j Yassir is going to the shop to buy shoes .

y(n)j X O O O X X X X X X

lc(n)j 3.22 - - - 0.41 0.78 0.83 0.60 0.27 0.50

LC1 = 0.67 LC1 = 0.46LC1 = 0.74

LC1 = 0.57

Table 1: This shows the subsequences from a sentence using `min = `max = 3, α = 1. Besides the token index j,the top three rows show the tokens, labels, and the token-wise scores. If y(n)j = X, then the corresponding token isunlabelled, hence the score is considered when selecting the next query. After this, the subsequences constitutingS(n) are displayed with their LC1 scores. In this case “shop to buy” will be chosen since it maximises LC1,but ‘traps’ its surrounding tokens until `min is lowered to 2 and “shoes .” may be considered.

4.4 Subsequence Active Learning AlgorithmFinally, we summarise the AL algorithm proposed.Given a set of unlabelled data DU , we initiallyrandomly select a proportion of sentences fromDU , label them, and add these to DL. A dictionaryB is also initialised. Using these labelled sentenceswe train a model. Then, the following proposedtraining cycle is repeated until DU is empty (or anearly stopping condition is reached):

1. Find all consecutive unlabelled subsequencesin DU , and score them using a pre-definedacquisition function.

2. Select the top scoring non-overlapping sub-sequences SQ that do not appear in B, suchthat the number of tokens in SQ is A, andquery them to the annotators. Update DL andDU . As each sequence is selected, add it to B,mapping it to its true labels.

3. Provide all occurrences of the keys of B inDU with their corresponding temporary labels.These will not be included in DL as these aretemporary.

4. Finetune the model on sentences with any la-bel, temporary and permanent.

Repeat this process until convergence.

5 Experimental Setup

5.1 DatasetsAs in previous works (Shen et al., 2017), we usethe two following NER datasets:

OntoNotes 5.0. This is a dataset used to compareresults with the full sentence querying baseline(Weischedel, Ralph et al., 2013), and comprising of

text coming from: news, conversational telephonespeech, weblogs, usenet newsgroups, broadcast,and talk shows. This is a BIO formatted datasetwith a total of K = 37 classes and 99,333 trainingsentences, with an average sentence length of 17.8tokens in its training set.

CoNLL 2003. This is a dataset, also in BIO for-mat, with only 4 entity types (LOC, MISC, PER,ORG) resulting in K = 9 labels (Tjong Kim Sangand De Meulder, 2003). This dataset is made froma collection of news wire articles from the ReutersCorpus (Lewis et al., 2004). The average sentencelength is 12.6 tokens in its training set.

A full list of class types and entity lengths andfrequencies for both datasets can be found in theAppendix.

5.2 NER Model

Following the work of Shen et al. (2017), a CNN-CNN-LSTM model for combined letter- and token-level embeddings was used; see Appendix for anoverview of the model and hyperparameters settingand validation. Furthermore, the AL algorithmused in (Shen et al., 2017) will serve as one ofthe baselines following the same procedure. Thisrepresents an equivalent algorithm to that proposed,but which can only query full sentences, and doesnot use label propagation.

5.3 Model Training and Evaluation

As the evaluation measure we use the F1 score. Af-ter the first round of random subsequence selection,the model is trained. After subsequent selectionsthe model is finetuned - training is resumed fromthe previous round’s parameters. In all cases, themodel training was stopped either after 30 epochswere completed, or if the F1 score for the valida-

4315

0 5 10 15 20 25 30Percentage of tokens manually labelled

0.450.500.550.600.650.700.750.800.850.90

F1

FS, α=1FS, α=0FS, α=0.7SUB, α=0.1FS, randomSUB, randomNo AL

(a) LCα for OntoNotes 5.0 NER dataset

0 5 10 15 20 25 30 35 40Percentage of tokens manually labelled

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

F1

FS, α=1FS, α=0FS, α=0.7SUB, α=1FS, randomSUB, randomNo AL

(b) LCα for CoNLL 2003 NER datasetFigure 1: F1 score on test set achieved each round using round-optimal model parameters. All subsequenceexperiments here use `min = 4, `max = 7. Each curve is averaged over 10 runs.

tion set had monotonically decreased for 2 epochs.This validation set is made up of a randomly se-lected 1% of sentences of the original training set.After finetuning, the model reloads its parametersfrom the round-optimal epoch, and its performanceis evaluated on the test set. Furthermore, the ALalgorithms were also stopped after all hyperparam-eter variations using that dataset and acquisitionfunction family had converged to the same best F1,which we denote with F ∗1 . For the OntoNotes 5.0dataset, F ∗1 value was achieved after 30% of thetraining set was labelled, and for the CoNLL 2003dataset after 40%.

5.4 Active Learning Setup & Evaluation

We choose `min = 4 to give a realistic contextto the annotator, and to avoid a significant prop-agation of common subsequences. The upperbound of `max = 7 was chosen to ensure subse-quences were properly utilised, since the averagesentence length of both datasets is roughly twicethis size. For the OntoNotes 5.0 dataset, everyround A = 10, 000 tokens are queried, whereas forthe CoNLL 2003 datasetA = 2, 000 tokens. Theserepresent roughly 0.5% and 1% of the availabletraining set.

We evaluate the efficacy and efficiency of thetested AL strategies in three ways. First, modelperformance over the course of the algorithm wasevaluated using end of round F1 score on the testset. We compare the proportion of the dataset’s to-kens labelled when the model achieves 99% of theF ∗1 score (F1

∗= 0.99×F ∗1 ). We also quantify the

rate of improvement of model performance duringtraining using the normalised Area Under the Curve(AUC) score of each F1 test curve. The normalisa-tion ensures that the resulting AUC score is in the

range [0, 1], and it is achieved by dividing the AUCscore by the size of the dataset. This implies thatmethods that converge faster to their best perfor-mance will have a higher normalized AUC. Second,we consider how quickly the algorithms can locateand query relevant tokens (named entities). Third,we finally evaluate their ability to extract the mostuncertain tokens from the unlabelled pool.

6 Results & Discussion

6.1 Active Learning PerformanceFigure 1 shows the LCα performance curves forα = 0, α = 1 and the best performing value foreach acquisition class (based on the normalisedAUC score, Table 3) for full sentence querying(FS), and only the best performing α values for sub-sequence querying (SUB). The figure also showsthe performance of training on the complete train-ing set (No AL), and when the both sentencesand subsequences are random selected by the ac-quisition function. The equivalent figures forMaxEntα are available in Appendix, and followsimilar trends. Then, the performance of eachcurve, quantified in terms of the normalised AUCis summarised in Table 3.

Table 2 shows further analysis of the best resultsin Figure 1, with best referring to acquisition func-tion and optimal α. These results first show thatsubsequence querying methods are more efficientthan querying full sentences, achieving their finalF1 with substantially less annotated data, and withhigher normalised AUC scores. For OntoNotes 5.0,querying subsequences reduces final proportion re-quired by 38.8%. For CoNLL 2003, this reductionis 36.6%. Altogether, subsequence querying holdsimproved efficiency over the full sentence queryingbaseline.

4316

F1 Final Frac.Score of the Dataset

100% AL FS SUBON 5.0 0.829 0.843 22% 13%CoNLL 0.930 0.938 42% 27%

Table 2: Summary of the results of the AL strategiesfrom Figure 1, when the models are trained using 100%of the training set and active learning (AL), with thebest hyperparameter setting of the acquisition functionwith for full sentence and subsequence, based on nor-malised AUC score.

As a point of interest, full sentence querying canbe easily improved by optimising α alone. Forthe OntoNotes 5.0 dataset, using LC1, 24.2% oftokens are required to achieve F ∗1 . This however,can be improved by 9.33% to only requiring 22.0%by choosing α = 0.7. For CoNLL 2003, usingLC1 for full sentences, 50.0% of the dataset wasrequired, but when using LC0.7, it was 40.7% ofthe tokens.

6.2 Entity Recall

This section and the next aim to understand someof the underlying mechanisms that allow the sub-sequence querying methods to achieve resultssubstantially better than a full sentence baseline.Namely, the ability of the different methods to ex-tract the tokens for which the model is the mostuncertain about. Given that the majority of tokensin both datasets have the same label - “O”, signi-fying no entity - it is likely that tokens belongingto entities, particularly rarer classes, trigger highermodel uncertainty. Querying full sentences at atime, the AL algorithm will spend much of its tokenbudget for that round labelling non-entity tokenswhile attempting to locate the more informativeentities. Subsequence querying methods, not facedwith this wasteful behaviour, allow the AL algo-rithm to query entity tokens quicker, locating andlabelling the majority of entity tokens faster overthe course of training.

The proportion of tokens belonging to entitiesthat the AL algorithm has queried against the roundnumber is plotted in Figure 2 for OntoNotes 5.0.For both datasets, the random querying methodscontain a distribution of token classes that reflectthe dataset at large, producing roughly linear curvesfor this figure. Curves for all methods that employ

Figure 2: Proportion of tokens that belong to entitieslabelled, against the round number.

an uncertainty based acquisition function are con-cave, and the AUC reflects the ranking of modelperformance for each querying method. This rela-tion suggests that shortly after initialisation, betterperforming algorithm variations query entity to-kens faster. In later stages of finetuning this rateis reduced, likely because after labelling a largeproportion of them, the remaining entity tokenscause little uncertainty for the model. In a practicalsetting where querying may have to be stopped be-fore model performance has converged (i.e. due toaccumulated cost of annotations), it is greatly ben-eficial to ensure that the model is exposed to a highnumber of relevant tokens, because this increasesthe likelihood of locating entity tokens belongingto underrepresented classes at an early stage.

6.3 Uncertainty Score Analysis

Finally, this section compares the scores of tokensin the queried set SQ for each querying method.Comparing the distribution and development ofthese scores provides a direct insight to the core as-sumptions of why full sentence querying is outper-formed. Figure 3 shows the difference in score dis-tributions for sentence versus subsequence query-ing, against querying round number, for roundspreceding model performance convergence. First,it is seen that decreasing the individual query size(full sentence to subsequence) increases the medianuncertainty extracted at the earlier rounds. Second,Figure 3 provides evidence for the mechanism sug-gested earlier: aggregating the token scores acrossfull sentences means querying both the highly un-certain tokens, and the tokens that provide little un-certainty. Querying high scoring sentences like thiscan cause a distribution with two peaks as seen in

4317

DatasetAcquisition

FunctionFull Sentence Subsequence

α = 0 α = 1 Optimal (α) α = 0 α = 1 Optimal (α)

OntoNotes5.0

LCα 0.794 0.802 0.804 (0.7) 0.817 0.812 0.818 † (0.1)MaxEntα 0.791 0.803 0.803 (1.0) 0.815 0.813 0.816 † (0.5)Random 0.734 0.769

CoNLL2003

LCα 0.857 0.875 0.879 (0.7) 0.885 0.883 0.892 † (1.0)MaxEntα 0.841 0.882 0.882 (1.0) 0.881 0.883 0.891 † (0.9)Random 0.824 0.859

Table 3: Normalised AUC scores for model performance (F1 score on test set) for α = 0, 1, and its optimal value ineach case. Each pair of differences between the optimized acquisition function for full sentences and subsequences(indicated by a †) are significantly different (two-sided unpaired t-test, with p-value < 0.05).

1 5 10 15Round number

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Token-wise sc

ore

FSSUB

Figure 3: Distributions of the queried LC scores for theOntoNotes 5.0 dataset, made on the 1st, 5th, 10th, and15th scoring rounds. This corresponds to scores aftertraining on 1%, 3.2%, 6.1%, and 9.0% of the utilisedtraining set.

the figure. As the model becomes increasingly cer-tain about its predictions, high scores are localisedwithin smaller subsequences, and the coarse sen-sitivity of full sentence querying means it forfeitsall the higher scoring tokens. These differenceswere also observed when comparing subsequencequerying methods with sub-optimal α.

This figure only analyses behaviour of up to 9%of the training set’s tokens have been queried. In-stead, Figure 4 show how the mean of token-wisescores evolve for different querying methods forthe OntoNotes 5.0 dataset until convergence. Thisclearly shows that subsequence querying methodsconverge faster over the full course of the algo-rithm compared to full sentence querying. Thisis consistent with Figure 1 in terms of initial rateand final time of model performance convergence,namely that model performance plateaus alongsidethe uncertainty score.

Keeping track of query scores like this is also areasonable idea in industrial applications. When

Figure 4: Average value of LC for all tokens in SQ withconfidence intervals, against round number. Score val-ues are averaged over all tested values of α

training on a very semantically specific corpus,there may not be enough fully labelled sentencesto build a test set. In that case, observing the rateprogress of score convergence can be used as anearly stopping method for the AL algorithm (Zhuet al., 2010).

7 Conclusion & Future Work

In this study we have employed subsequencequerying methods for improving the efficiency ofAL for NER tasks. We have seen that these meth-ods outperform full sentence querying in termsof annotations required for optimal model perfor-mance, requiring 38.8% and 36.6% fewer tokensfor the OntoNotes 5.0 and CoNLL 2003 datasets.Optimal results for subsequence querying (and fullsentence querying) were achieved by generalisingpreviously used AL acquisition functions, defininga larger family of acquisition functions for sequen-tial data.

The analysis of § 6.3 suggests that a full sentencequerying causes noisy acquisition functions due tothe tokens in the queried sentences that were not

4318

highly scored. This added noise reduces the bud-get efficiency, and a subsequence querying methodeliminates a large part of this effect. This efficiencyalso translated into a faster recall of named entitiesin the dataset to be queried (§ 6.2).

Limitations and future work: Limitations of thisstudy are largely centred on the use of an oracle toprovide tokens with their labels. With human anno-tators, the cropped context of subsequence queriesmay make them produce more inaccuracies thanwhen annotating full sentences. such studies willhelp reveal how context affects label accuracy, howthis, in turn, affects optimal hyperparameters inthe subsequence selection process (such as optimalquery length), further accommodations that mustbe made to effectively optimise worker efficiency,and how to deal with unreliable labels. We leave tofuture work the evaluation of these querying meth-ods with human annotators.There are also ways to incorporate model generatedlabelling methods for more robust semi-supervisioninto our framework that we leave to future work.Finally, there are examples of other tasks for struc-tured data, such as audio, video, and image segmen-tation, where the part of an instance may be queried.A generalisation of the strategy demonstrated forthe NER case may allow for more efficient activelearning querying methods for these other types ofdata.

ReferencesEric Arazo, Diego Ortego, Paul Albert, Noel E.

O’Connor, and Kevin McGuinness. 2020. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International JointConference on Neural Networks (IJCNN), pages 1–8.

Yukun Chen, Thomas A. Lasko, Qiaozhu Mei,Joshua C. Denny, and Hua Xu. 2015. A study of ac-tive learning methods for named entity recognitionin clinical text. Journal of Biomedical Informatics,58:11 – 18.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Meng Fang, Yuan Li, and Trevor Cohn. 2017. Learninghow to active learn: A deep reinforcement learning

approach. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,pages 595–605, Copenhagen, Denmark. Associationfor Computational Linguistics.

Yarin Gal, Riashat Islam, and Zoubin Ghahramani.2017. Deep Bayesian active learning with imagedata. In Proceedings of the 34th InternationalConference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages1183–1192, International Convention Centre, Syd-ney, Australia. PMLR.

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, andMate Lengyel. 2011. Bayesian active learning forclassification and preference learning.

S. Huang, R. Jin, and Z. Zhou. 2014. Active learn-ing by querying informative and representative ex-amples. IEEE Transactions on Pattern Analysis andMachine Intelligence, 36(10):1936–1949.

Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, andOndrej Chum. 2019. Label propagation for deepsemi-supervised learning. In Proceedings of theIEEE/CVF Conference on Computer Vision and Pat-tern Recognition (CVPR).

John Lafferty, Andrew McCallum, and Fernando CNPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data.

Yann LeCun and Yoshua Bengio. 1998. ConvolutionalNetworks for Images, Speech, and Time Series, page255–258. MIT Press, Cambridge, MA, USA.

David D Lewis, Yiming Yang, Tony G Rose, and FanLi. 2004. Rcv1: A new benchmark collection fortext categorization research. Journal of machinelearning research, 5(Apr):361–397.

Wang Ling, Chris Dyer, Alan W. Black, and IsabelTrancoso. 2015. Two/too simple adaptations ofWord2Vec for syntax problems. In Proceedings ofthe 2015 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 1299–1304, Denver, Colorado. Association for Computa-tional Linguistics.

Diego Marcheggiani and Thierry Artieres. 2014. Anexperimental comparison of active learning strate-gies for partially labeled sequences. In Proceedingsof the 2014 Conference on Empirical Methods inNatural Language Processing (EMNLP), pages 898–906, Doha, Qatar. Association for ComputationalLinguistics.

Mingkun Li and I. K. Sethi. 2006. Confidence-basedactive learning. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 28(8):1251–1261.

Ion Muslea, Steven Minton, and Craig A. Knoblock.2002. Active + semi-supervised learning = robust

https://doi.org/10.1109/IJCNN48605.2020.9207304



https://doi.org/https://doi.org/10.1016/j.jbi.2015.09.010



https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/D17-1063

https://doi.org/10.18653/v1/D17-1063

https://doi.org/10.18653/v1/D17-1063

http://proceedings.mlr.press/v70/gal17a.html

http://proceedings.mlr.press/v70/gal17a.html

http://arxiv.org/abs/1112.5745

http://arxiv.org/abs/1112.5745

https://doi.org/10.3115/v1/N15-1142

https://doi.org/10.3115/v1/N15-1142

https://doi.org/10.3115/v1/D14-1097

https://doi.org/10.3115/v1/D14-1097

https://doi.org/10.3115/v1/D14-1097

https://doi.org/10.1109/TPAMI.2006.156

https://doi.org/10.1109/TPAMI.2006.156

4319

multi-view learning. In Proceedings of the Nine-teenth International Conference on Machine Learn-ing, ICML ’02, page 435–442, San Francisco, CA,USA. Morgan Kaufmann Publishers Inc.

Robert Pinsler, Jonathan Gordon, Eric Nalisnick, andJose Miguel Hernandez-Lobato. 2019. Bayesianbatch active learning as sparse subset approximation.In Advances in Neural Information Processing Sys-tems, volume 32, pages 6359–6370. Curran Asso-ciates, Inc.

Y Reddy, Viswanath Pulabaigari, and Eswara B. 2018.Semi-supervised learning: a brief review. Interna-tional Journal of Engineering Technology, 7:81.

Jeffrey Rzeszotarski, Ed Chi, Praveen Paritosh, andPeng Dai. 2013. Inserting micro-breaks into crowd-sourcing workflows. Proceedings of the AAAI Con-ference on Human Computation and Crowdsourcing,1(1).

Burr Settles. 2011. From theories to queries: Ac-tive learning in practice. In Active Learning andExperimental Design workshop In conjunction withAISTATS 2010, volume 16 of Proceedings of Ma-chine Learning Research, pages 1–18, Sardinia,Italy. JMLR Workshop and Conference Proceedings.

Burr Settles and Mark Craven. 2008. An analysisof active learning strategies for sequence labelingtasks. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing,EMNLP ’08, page 1070–1079, USA. Associationfor Computational Linguistics.

Yanyao Shen, Hyokun Yun, Zachary Lipton, YakovKronrod, and Animashree Anandkumar. 2017.Deep active learning for named entity recognition.In Proceedings of the 2nd Workshop on Representa-tion Learning for NLP, pages 252–256, Vancouver,Canada. Association for Computational Linguistics.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InProceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003, pages142–147.

Katrin Tomanek and Udo Hahn. 2009. Semi-supervised active learning for sequence labeling. InProceedings of the Joint Conference of the 47th An-nual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processingof the AFNLP, pages 1039–1047, Suntec, Singapore.Association for Computational Linguistics.

Simon Tong and Daphne Koller. 2002. Support vec-tor machine active learning with applications to textclassification. J. Mach. Learn. Res., 2:45–66.

K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. 2017.Cost-effective active learning for deep image classi-fication. IEEE Transactions on Circuits and Systemsfor Video Technology, 27(12):2591–2600.

Dittaya Wanvarie, Hiroya Takamura, and Manabu Oku-mura. 2011. Active learning with subsequence sam-pling strategy for sequence labeling tasks. Journalof Natural Language Processing, 18(2):153–173.

Kai Wei, Rishabh Iyer, and Jeff Bilmes. 2015. Submod-ularity in data subset selection and active learning.In Proceedings of the 32nd International Conferenceon Machine Learning, volume 37 of Proceedingsof Machine Learning Research, pages 1954–1963,Lille, France. PMLR.

Weischedel, Ralph, Palmer, Martha, Marcus, Mitchell,Hovy, Eduard, Pradhan, Sameer, Ramshaw, Lance,Xue, Nianwen, Taylor, Ann, Kaufman, Jeff, Fran-chini, Michelle, El-Bachouti, Mohammed, Belvin,Robert, and Houston, Ann. 2013. Ontonotes release5.0.

Jingbo Zhu, Huizhen Wang, Eduard Hovy, andMatthew Ma. 2010. Confidence-based stopping cri-teria for active learning for data annotation. ACMTrans. Speech Lang. Process., 6(3).

https://proceedings.neurips.cc/paper/2019/file/84c2d4860a0fc27bcf854c444fb8b400-Paper.pdf

https://proceedings.neurips.cc/paper/2019/file/84c2d4860a0fc27bcf854c444fb8b400-Paper.pdf

https://doi.org/10.14419/ijet.v7i1.8.9977

https://ojs.aaai.org/index.php/HCOMP/article/view/13127

https://ojs.aaai.org/index.php/HCOMP/article/view/13127

https://doi.org/10.18653/v1/W17-2630

https://www.aclweb.org/anthology/W03-0419

https://www.aclweb.org/anthology/W03-0419

https://www.aclweb.org/anthology/P09-1117

https://www.aclweb.org/anthology/P09-1117

https://doi.org/10.1162/153244302760185243

https://doi.org/10.1162/153244302760185243

https://doi.org/10.1162/153244302760185243

https://doi.org/10.1109/TCSVT.2016.2589879

https://doi.org/10.1109/TCSVT.2016.2589879

https://doi.org/10.5715/jnlp.18.153

https://doi.org/10.5715/jnlp.18.153

https://doi.org/10.35111/XMHB-2B84

https://doi.org/10.35111/XMHB-2B84

https://doi.org/10.1145/1753783.1753784

https://doi.org/10.1145/1753783.1753784

4320

A Model Architecture

The model architecture is built of three sections.The character-level convolutional neural network(CNN) (LeCun and Bengio, 1998) character-levelencoder extracts character level features, wword

j for

each token x(i)j in a sentence.Then, a latent token embedding wemb

j corre-sponding to that token is generated. The fullrepresentation of the token is the concatentationof the two vectors: wfull

j := (wcharj ,wemb

j ). Thetoken-label embeddings, wemb, are initialised us-ing word2vec (Ling et al., 2015), and updatedduring training and finetuning, as per the base-line paper. A second, token-level CNN encoderis used to generate {htoken

j }ìj=1, given the token-level representations {wfull

j }ìj=1. The final token-

level encoding is defined by another concatentation:hEncj := (htoken

j ,wfullj ).

Finally, a tag decoder is used to generate thetoken-level pmfs over the C possible token classes:{hEnc

j }ìj=1

LSTM−−−→ {y(i)j }

ìj=1.

B Model & Training Parameters

Table 4 lists the hyperparameter values used to trainthe NER model. Note that while dropout is usedduring training, it is turned off when generating theprobabilities that contribute to the scoring of theacquisition function. Model was developed usingPyTorch, and trained on a Titan RTX.

Hyperparameter ValueBatch size 32

Dropout rate for convolutional layers 0.5Dropout rate for embedding layers 0.25

Gradient clipping magnitude 0.35Character- and token-level CNN kernel size 3Layers in character- and token-level CNNs 3

Character embedding vector size 50Number of filters per character-level CNN layer 50

Number of filters per token-level CNN layer 300Optimiser type SGD

Optimiser learning rate 1.0

Table 4: Values of model and training hyperparametersused throughout the investigation.

C Dataset Analysis

Here, we cluster similar labels in the BIO format,reducing the total K classes to the K(r) = (K +

1)/2 class groups c(r)1 , ..., c(r)

K(r) . Therefore, c(r)1

corresponds exactly to c1, the empty label, whilec(r)k , k > 1 groups the raw labels c2k−2 and c2k−1.

Figures 5 and 7 show the distribution of theseclass groups for the OntoNotes 5.0 and CoNLL2003 datasets respectively. For the former, countsrange from 199 tokens for the ’LANGUAGE’ to46698 tokens for the ’ORG’ class. The full avail-able training set totals 1766955 tokens in 99333sentences; this is partitioned into a train and vali-dation set during experimentation. A further testset comprises of 146253 tokens in 8057 sentences.The latter’s training set contains 172210 tokens in13689 sentences, and its test set has 42141 tokensin 3091 sentences sentences.

Figure 5: Composition of token classes in the Onto-Notes 5.0 English NER training set.

Figure 6: Lengths of entities in the Onto-Notes 5.0training set in number of tokens, again omitting theempty class ’O’

4321

Figure 7: Composition of token classes in the CoNLL2003 NER training set.

Figure 8: Lengths of entities in the CoNLL 2003 train-ing set in number of tokens, again omitting the emptyclass ’O’

D Active Learning Results for BothDatasets

In Figure 9 we show the model performance plot-ted against the percentage of the tokens used as atraining set for all the combinations of acquisitionfunctions.

0 5 10 15 20 25 30Percentage of tokens manually labelled

0.450.500.550.600.650.700.750.800.850.90

F1

FS, α=1FS, α=0FS, α=0.7SUB, α=0.1FS, randomSUB, randomNo AL

(a) LCα for OntoNotes5.0 NER dataset


0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

F1FS, α=1FS, α=0SUB, α=0.9FS, randomSUB, randomNo AL

(b) MaxEntα for OntoNotes5.0 NER dataset


0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

F1

FS, α=1FS, α=0FS, α=0.7SUB, α=1FS, randomSUB, randomNo AL

(c) LCα for CoNLL 2003 NER dataset


0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

F1

FS, α=1FS, α=0SUB, α=0.9FS, randomSUB, randomNo AL

(d) MaxEntα for CoNLL 2003 NER datasetFigure 9: F1 score on test set achieved each round (top)and against time (bottom in each case) using round-optimal model parameters. All subsequence experi-ments here use `min = 3, `max = 6.

subsequence based deep active learning for named entity

Documents