On active annotation for named entity recognition

Asif Ekbal • Sriparna Saha • Utpal Kumar Sikdar

Received: 15 October 2013 / Accepted: 5 June 2014

Abstract A major constraint of machine learning tech-

niques for solving several information extraction problems

is the availability of sufficient amount of training exam-

ples, which involve huge costs and efforts to prepare.

Active learning techniques select informative instances

from the unlabeled data and add it to the training set in

such a way that the overall classification performance

improves. In random sampling approach, unlabeled data is

selected for annotation at random and thus can’t yield the

desired results. In contrast, active learning selects the

useful data from a huge pool of unlabeled documents. The

strategies used often classify the instances to belong to the

incorrect classes. The classifier is confused between two

classes if the test instance is located near the margin. We

propose two methods for active learning, and show that

these techniques favorably result in the increased perfor-

mance. The first approach is based on support vector

machine (SVM), whereas the second one is based on an

ensemble learning which utilizes the classification capa-

bilities of two well-known classifiers, namely SVM and

conditional random field. The motivation of using these

classifiers is that these are orthogonal in nature, and

thereby a combination of them can produce the better

results. In order to show the efficacy of the proposed

approach we choose a crucial problem, namely named

entity recognition (NER) in three languages, namely

Bengali, Hindi and English. This is also evaluated for NER

in biomedical domain. Evaluation results reveal that the

proposed techniques indeed show considerable perfor-

mance improvements.

Keywords Named entity recognition (NER) � Active

learning � Conditional random field (CRF) � Support vector

machine � Classifier ensemble � Biomedical domain

1 Introduction

One of the greatest difficulties in machine learning tech-

niques is the availability of the large amount of training

data. It is both very cost-sensitive and time consuming to

create these labeled examples [1–3]. Active learning (AL)

[4] is, nowadays, a popular research area due to its many-

fold potential benefits. By using active learning techniques

we can reduce the amount of manual annotations which are

necessary for creating a large training corpus. The strength

of active learning lies in the fact that it selects only a subset

of tokens which are useful for a given classifier.

Active learning [4–7] optimizes the control of model

growth and it greatly reduces the time and costs involved in

preparing the data as well as the model. Without AL, the

knowledge base models grow with the increase of size of

the already built data set. Active learning selects the most

informative training examples instead of the entire body of

data, thus restricting the amount of learning by the learning

algorithm. In most of the cases, the predictive accuracies

obtained from the resulting models are comparable to that

of a standard (exact) learning model. In [5] authors discuss

the use of active annotation for annotating corpora of

articles about archaeology in the Portale della Ricerca

Umanistica Trentina in the domains of humanities (and in

A. Ekbal • S. Saha • U. K. Sikdar

Department of Computer Science and Engineering, Indian

Institute of Technology Patna, Patna 800 013, India

e-mail: asif@iitp.ac.in; asif.ekbal@gmail.com

S. Saha

e-mail: sriparna@iitp.ac.in; sriparna.saha@gmail.com

U. K. Sikdar

e-mail: utpal.sikdar@iitp.ac.in; utpal.sikdar@gmail.com


other scholarly domains) where there is a great need for

making the best possible use of the annotators, as the

entities mentioned in collections of these scholarly articles

belong to different types from those familiar from news

corpora, hence new resources need to be annotated to

create supervised taggers for tasks such as named entity

extraction. The thesis of Settles [8] focused on active

learning with structured instances and potentially varied

annotation costs. Active learning with support vector

machines (SVMs) and Bayesian networks can be found in

Tong [9]. Some theoretical aspects of active learning for

classification are described in Monteleoni [10]. For named

entity recognition (NER), active learning techniques have

also been used in the past [11]. A good survey of active

learning with its applications to natural language process-

ing (NLP) can be found in [12]. In an interesting study,

Schein and Ungar [13] illustrated that active learning can

sometimes requires more labeled data than passive learning

while utilizing the similar model class (here it is logistic

regression). Baldridge and Palmer [14] found that the

performance of active learning depends on the proficiency

of the annotator (specially a domain expert). Tomanek and

Olsson [15] reported a survey where 91 % researchers who

have used active annotations for solving their problems had

their expectations fully or partially met. Dasgupta [16]

determined a variety of theoretical upper and lower bounds

for active learning when huge collection of unlabeled data

is available. Balcan et al. [17] has proved that asymptoti-

cally, some active learning strategies should perform better

than the supervised learning in the limit. Settles and Craven

[18] have developed a large number of active learning

algorithms for sequence labeling tasks using probabilistic

sequence models like conditional random field (CRF).

Reichart et al. [19] developed a two-task active learning

technique for natural language parsing and NER. Two

methods are proposed for actively learning both the tasks.

The first approach is termed as alternating selection where

the parser is used to query sentences in one iteration, and

then the NER system is used to query instances in the next

iteration. In the second strategy named rank combination

both the learners can rank the query candidates in the pool

independently and then the candidates having highest ranks

are selected for active expert annotation.

Some unsupervised learning paradigms are developed in

[20–22]. In [20] a mutual bootstrapping technique is used

to learn from a set of unannotated training texts and a

handful of ‘‘seed’’ words for the semantic category of

interest. At first some extraction patterns are learnt from

the seed words and then these learned patterns are utilized

to detect more words which belong to the same semantic

category. In the second phase authors have devised a sec-

ond level of bootstrapping where the most reliable lexicon

entries generated after application of the first stage are kept

and the process is restarted with enhanced semantic lexi-

con. Results shown by the paper support the fact that this

two-tiered bootstrapping process is less sensitive to noise

than a single level of bootstrapping and generates high-

quality dictionaries. In [21] an unsupervised NER system is

developed using syntactic and semantic contextual evi-

dences. Here a corpus-driven statistical technique was

developed which uses a learning corpus to acquire con-

textual classification clues and then utilizes these results for

classifying unrecognized proper nouns (PN) in an unla-

beled corpus. In order to generate the training examples of

proper nouns they used both rule-based as well as machine

learning based recognizers. But the contextual model of PN

categories can be learnt without using any supervised

information. In [22] authors have developed a system

named KNOWITALL which is used to automate the

complex task of extracting large collections of facts (e.g.,

names of scientists or politicians) from the Web in an

unsupervised, domain-independent, and scalable manner.

After the first execution, KNOWITALL was able to extract

50,000 class instances. But here the challenge is to increase

the recall without sacrificing the precision. In order to

address this challenge three distinct methods are devised.

In order to learn the domain-specific extraction rules pat-

tern learning technique is used. This also enables learning

additional extractions. In order to increase the recall values

subclass extraction is used to automatically determine sub-

classes (e.g., chemist and biologist are identified as sub-

classes of scientist). The third method named as ‘‘list

extraction’’ generates lists of class instances. It then learns

a wrapper for each list, and finally extracts elements of

each list. Authors have evaluated all the methods in con-

nection with KNOWITALL. Application of these three

techniques helps to increase the recall of KNOWITALL.

The active learning for NER that was reported in [23]

focuses on reducing class imbalance. Here main goal is to

generate more balanced data sets using annotation proce-

dures of AL. Results prove that the resultant approaches can

indeed minimize class imbalance and increase the perfor-

mance of classifiers on minority classes while maintaining a

good overall performance in terms of macro F-score. In [24]

authors have developed some active learning techniques to

bootstrap a NER system for a new domain of radio astro-

nomical abstracts. Several committee-based metrics are

evaluated for quantifying the disagreement between the

classifiers built using multiple views. Results show that

appropriate value of the metric can be determined using

simulation experiments with the existing annotated data

collected from the different domains. Final evaluation

reveals that active learning performs much better than a

randomly sampled baseline. In [25] a CRF based active

learning technique is developed which utilizes the concepts

of information density for selecting uncertain samples. This

Int. J. Mach. Learn. & Cyber.


technique is then applied for solving NER in Chinese. Some

works on stopping criterion of active learning based tech-

niques are done in [26], where the authors have proposed

three different stopping criteria for active learning. Results

reveal that among these three stopping criteria, gradient-

based stopping, is the best one to stop active learning and

achieves near optimal NER performance. In [27] authors

have developed some multi-criteria based active learning

approach and finally applied this technique for NER from

two standard corpus. In order to maximize the contribution

of the selected examples by sample selection technique,

multiple criteria: informativeness, representativeness and

diversity are considered. Thereafter some measures are

proposed to quantify these values. Two sample selection

strategies were developed, which result in less labeling cost

than single-criterion-based method.

NER is an important task in the field of NLP. It is an

important module in many applications including infor-

mation extraction, information retrieval, machine transla-

tion, question answering and automatic summarization etc.

The main task of NER can be viewed as a combination of

two steps; in the first phase every word/term from the text

has to be identified and in the second phase these are cat-

egorized into groups like person name, location name,

organization name, miscellaneous name (date, time, per-

centage and monetary expressions etc) and ‘‘none-of-the-

above’’. The existing works related to Indian language

NER are still limited. Some of the facts behind these are:

lack of capitalization information, free word order nature,

more diversity, resource-constrained nature etc. As part of

the Indian languages, there are some existing works that

cover a few languages like Bengali [28, 29], Hindi [30] and

Telugu [31]. As mentioned before, the performance of any

supervised system greatly depends on the amount of

available annotated datasets, and this not very easier to

achieve. In order to tackle this issue active learning can be

an effective solution. This will provide us a way to auto-

matically increase the amount of training data. The work

reported in this paper differs from the works reported in

[20–22] in the sense that all these were built based on

unsupervised machine learning. But the current work deals

with an active annotation technique where in each iteration

some tokens are selected using some novel techniques for

which active expert opinion is sought. These tokens along

with the corresponding sentences are added to the training

data, and the system is retained and evaluated on devel-

opment/unlabeled data in an iterative fashion.

The explosion of information in the biomedical domain

leads to growing demand for automated biomedical

information extraction techniques [32]. Named entity

(NE) extraction1 is a fundamental task of biomedical text

mining. Recognizing NEs like mentions of proteins,

DNA, RNA etc. is one of the most important factors in

biomedical knowledge discovery. But the inherently

complex structures of biomedical NEs pose a big chal-

lenge for their identification and classification in bio-

medical information extraction. The biomedical NE

extraction is vast, but there is still a wide gap in perfor-

mance between the systems developed for the traditional

news-wire domains and the systems developed targeting

biomedical domains. The major challenges and/or diffi-

culties associated with the identification and classification

of biomedical NEs are as follows: (1) building a complete

dictionary for all types of biomedical NEs is infeasible

due to the generative nature of NEs; (2) NEs are made of

very long compounded words (i.e., contain nested entities)

or abbreviations and hence difficult to classify them

properly; (3) names do not follow any nomenclature; (4)

names include different symbols, common words and

punctuation symbols, conjunctions, prepositions etc. that

make NE boundary identification more difficult and

challenging; and (5) same word or phrase can refer to

different NEs based on their contexts.

In this paper we propose two methods for active

learning. The first one is based on SVM [33]. We eval-

uate our proposed technique for NER in three languages,

namely Bengali, Hindi and English. Bengali and Hindi are

the two widely spoken languages, rank fifth and second,

respectively, in terms of the native speakers all over in

the world. Evaluation results show that our proposed

approach in general performs well for three different

datasets. Thereafter this approach is evaluated for NER in

biomedical domain. We identify and implement variety of

features that are based on orthography, local contextual

information and global contexts. Thereafter we propose

the active learning technique based on the concept of

classifier ensemble, where SVM [33] and CRF [34] are

used as the underlying classification techniques. Based on

the distance from the hyperplane of SVM and the con-

ditional probabilities assigned to each token by CRF, we

select most uncertain samples from the unlabeled data to

be added to the initial training data. The proposed

approach is again evaluated for NER in Bengali, Hindi,

English and biomedical texts. Results show that ensemble

performs reasonably superior compared to the individual


Some unsupervised models for NER are developed in

[20, 35]. Collins and Singer [35] developed two tech-

niques which can build a NER system by utilizing a small

amount of labeled data and a huge collection of unlabeled

documents for solving the NER. The first approach

describes how to generate rules for NER from the sig-

nificantly large amount of unlabelled documents. It first

starts with some seed set of rules that are increased while1 Here by extraction we mean both recognition and classification.

Int. J. Mach. Learn. & Cyber.


maintaining a high level of agreement between spelling

and contextual decision lists. The second approach,

named as CoBoost, is a generalization of boosting tech-

nique [35] that is applied to solve the problem of NER. It

utilizes both the labeled and unlabeled data and builds

two classifiers in parallel. While AdaBoost determines a

weighted combination of simple (weak) classifiers where

weights are calculated after minimizing a function which

bounds the classification error on a set of training

examples. The second algorithm devised in this paper

performs the similar kind of search but instead of mini-

mizing only the classification error on the training data it

also minimizes the disagreement between the classifiers in

predicting class labels of unlabeled examples. The pro-

posed algorithm also develops two classifiers iteratively

where in each iteration they tried to minimize a contin-

uously differential function which bounds the number of

examples on which the two classifiers disagree. Thus

CoBoost algorithm relies on certain samples (samples on

which two classifiers agree).

In the current paper we have selected tokens for which

both the classifiers have confusions in determining the

class label. For each classifier and for each token we

determine the difference between the confidence values

of the two most probable classes. If this difference is less

than a predefined threshold (if the confidence values are

similar to each other) then that particular instance is a

probable candidate where the classifier is most uncertain.

For each of the classifiers, we determine two different

lists of potential candidates, and then combine them

together in an unique way. Finally the sentences con-

taining the most confusing ten instances are selected for

active expert opinion. Thereafter these are added to the

training set. Thus our algorithm is different from the

work proposed in [35] that focused on to minimize an

objective function implicitly which bounds the number of

examples on which two classifiers disagree. The tech-

nique is rather a way of building classifiers using few

labled and a huge number of unlabeled documents. This

is in contrast to the concept of active learning technique

where in each iteration we select informative tokens that

are assigned the correct class labels by some domain


The rest of the paper is organized as follows. Section

2 describes very briefly the base classifiers that we have

used for building our active learning systems. In Sect. 3

we present our algorithms for active annotation. Section

4 describes the set of features that we have used for

training and/or testing our machine learning algorithms.

Section 5 elaborately reports on the datasets used,

experimental results, detailed analysis and necessary

comparisons with the existing works. Finally, we con-

clude in Sect. 6.

2 Base classifiers

In our work we use two different classifiers, namely CRF

[34] and SVM [33].

2.1 Conditional random field

CRFs [34] are undirected graphical models, widely used

for sequence learning tasks. A special case of this classi-

fication technique corresponds to the conditionally trained

probabilistic finite state automata. As CRFs are condi-

tionally trained, they can easily incorporate a large number

of arbitrary and non-independent features. At the same time

they can still have the efficient procedures for non-greedy

finite-state inference and training.

Given an observation sequence we have to determine the

best state sequence. A feature function fkðst�1; st; o; tÞ is

having a value of 0 for most cases and is only set to be 1,

when st�1; st are certain states and the observation has

certain properties. We use the Cþþ based CRFþþ package2,

a simple, customizable, and open source implementation of

CRF for segmenting or labeling sequential data.

2.2 Support vector machine

In the field of NLP, SVMs [33] have been widely applied

for text categorization, and are reported to have achieved

high accuracy without falling into over-fitting even though

with a large number of words taken as the features [36].

We develop our system using SVM [33, 36] which per-

forms classification by constructing an N-dimensional

hyperplane that optimally separates data into two catego-

ries. We have used YamCha3 toolkit, an SVM based tool

for detecting classes in documents and formulating the

NER task as a sequential labeling problem. Here, the

pairwise multi-class decision method and the polynomial

kernel function are used. We use TinySVM-0.074 classifier.

3 Proposed active learning techniques

In this section we describe our proposed active learning

techniques. Our first approach is based on the classification

technique, namely SVM. The second method is based on

an ensemble approach, where two supervised classifiers,

namely SVM and CRF are used. Effective uncertain sam-

ples are selected based on the measurements of distance

from the hyperplane of SVM and conditional probabilities

of CRF.

2 http://crfpp.sourceforge.net.3 http://chasen.org/*taku/software/yamcha/.4 http://cl.aist-nara.ac.jp/taku-ku/software/TinySVM.

Int. J. Mach. Learn. & Cyber.


3.1 Active annotation

Active annotation–the term introduced by [5, 37] to refer to

the application of active learning [1–4] to corpus creation–

is becoming a popular annotation technique because it can

lead to drastic reductions in the amount of annotation

needed for constructing training set to develop some highly

accurate classifiers. In the traditional, random sampling

approach, unlabeled data is selected for annotation at


In contrast, in active learning, the most useful data for

the classifier are carefully selected. Generally, a given

classifier is trained using a small sample of the data (usu-

ally selected randomly) which are also termed as the seed

examples. The classifier is subsequently applied to a pool

of unlabeled data with the purpose of selecting additional

examples that the classifier views as informative. The

selected data are manually annotated and the steps are

repeated so that the classifier can determine the optimal

decision boundary between the classes. The key question in

this approach is how to determine the samples that will be

most useful to the classifier.

3.2 Active annotation with SVM

A feature vector consisting of the features described in the

following section is extracted for each word in the NE

tagged corpus. Now, we have a training data in the form

ðWi; TiÞ, where, Wi is the ith word and its feature vector and

Ti is its output tag.

The SVM is trained with the available feature set and

evaluated on the gold standard test set. We develop our

system using SVM [33, 36] which performs classification

by constructing an N-dimensional hyperplane that opti-

mally separates data into two categories. Based on some

selection criterion, sentences are chosen from the devel-

opment set and added to the initial training set in such a

way that the performance on the test set improves.

Our selection criterion is based on the confidence values

of a SVM model. For each token of the development set, a

SVM classifier produces the distance from different sepa-

rating hyper planes. Here at first we normalize these dis-

tance values in the range [0, 1]. The normalized value is

treated as the confidence value for a particular class. Our

selection criterion is based on the differences between the

confidence values of the two most probable classes for a

token, the hypothesis being that items for which this dif-

ference is smaller are those of which the classifier is less

certain. A threshold on the confidence interval is defined,

and at each iteration of the algorithm we select the effec-

tive sentences from the development set and add to train-

ing. In each iteration, we add ten most informative

sentences to the training set. We stop iteration of the

algorithm when the performance in two consecutive itera-

tions be equal.

The main steps of the active annotation approach we

followed in this work are shown in Fig. 1.

3.3 Ensemble approach for active annotation

The method is based on an ensemble learning. As the base

classifiers we use CRF and SVM. For each of the base

classifiers, a feature vector consisting of the features

described in the following section is extracted for each

token. We consider the feature vector consisting of all the

features of the current token and varied the contexts within

wiþ2i�2 ¼ wi�2. . .wiþ2 (i.e. preceding two and succeeding two

tokens). For CRF we use bigram feature template that

computes the combinations of all the available features for

the current and previous tokens. For SVM, we include the

dynamic output labels of the previous two tokens. Based on

some selection criterion, sentences are chosen from the

development set and added to the initial training set in such

a way that the performance on the test set improves. Our

technique is based on the combined decisions of both SVM

and CRF.

Step 1: Evaluate the system on the gold standard test data.Step 2: Test on the development data and calculate the confidence values of

each class of the output classes.Step 3: Compute the confidence interval (CI) between the two most probable classes

for each token.Step 4: If CI is below the threshold value (set to 0.1 and 0.2) then

Step 4.1: Add the NE token along with its sentence identifier and CI in a list ofeffective sentences, selected for active annotation (named as EA).

Step 5: Sort EA in ascending order of CI.Step 6: Select the top most 10 sentences.Step 7: Remove the 10 sentences from the development set.Step 8: Add the sentences to the training set.Step 9: Retrain the SVM classifier and evaluate with the test set.Step 10: Repeat steps 2-9 until the performance in two consecutive iterations be same.

Fig. 1 Main steps of the

proposed active learning


Int. J. Mach. Learn. & Cyber.


As both algorithms produce two different kinds of proba-

bilistic scores, we first normalize all the confidence values

within the range [0, 1]. We consider these as the actual con-

fidence scores of outputs. The selection criterion is again

based on the differences between the confidence values of the

two most probable classes for a token. A threshold on the

confidence interval is defined, and for each base classifier we

generate a set of uncertain samples. These sets contain the

selected sentence identifiers along with the confidence inter-

vals for which they are included into the respective sets.

Thereafter we combine the decisions of SVM and CRF, and

generate a new set of uncertain samples by taking the unions

of these two sets. The union is taken in such a way that the

common sentence is assumed to have the confidence interval,

equal to the minimum of two values assigned to that particular

sentence in two sets. Finally, we select ten most uncertain

sentences from the development data. Thus, in each iteration

of the algorithm, we actually add ten most informative sen-

tences to the training set. We run the algorithm for the max-

imum ten iterations. In some cases, the performance starts to

decrease even at the earlier step of the algorithm. In order to

account this fact we stop the algorithm’s iteration, and retrieve

last iteration’s training data as the final one.

The main steps of the proposed active annotation are

shown in Fig. 2.

4 Named entity features

Performance of any classification technique greatly

depends on the features used in the model. In our work we

implement the following set of features for our task. These

features are easy to derive and don’t require deep domain

knowledge and/or external resources for their generation.

Thus, these features are general in nature and can be easily

extracted for other domains.

1. Context words: These are the preceding and following

words surrounding the current token. This is based on

the observation that surrounding words carry effective

information for NE identification.

2. Word suffix and prefix: Fixed length (say, n) word

suffixes and prefixes are very effective to identify

NEs and work well particularly for the highly

inflective Indian languages. Actually, these are the

fixed length character sequences stripped either from

the rightmost (for suffix) or from the leftmost (for

prefix) positions of the words.

3. First word: This is a binary valued feature that checks

whether the current token is the first word of the

sentence or not.

4. Length of the word: This binary valued feature checks

whether the number of characters in a token is less

than a predetermined threshold value (here, set to 5).

This feature is defined with the observation that very

short words are most probably not the NEs.

5. Infrequent word: This is a binary valued feature that

checks whether the current word appears in the

training set very frequently or not. We include this

feature as the frequently occurring words are most

likely not the NEs.

6. Last word of sentence: This feature checks whether

the current word is the last word of a sentence or not.

In Indian languages, verbs generally appear in the last

position of the sentence. Indian languages follow

subject–object–verb structure. This feature distin-

guishes NEs from the verbs.

Step 1: Train the base classifiers with the initial training data and evaluate with the gold standard test data.Step 2: Train the base classifiers with the initial training data and evaluate with the development data.Step 3: Calculate the confidence value of each token for each output class.Step 4: Normalize the confidence scores within the range of [0,1].Step 5: Compute the confidence interval (CI) between the two most probable classes for each tokenof the development data.This is computed on the outputs of both SVM and CRF.Step 6: From each of dev output, perform the following operations:

Step 6.1: if CI is below the threshold value (set to 0.2) then add the NE tokenalong with its sentence identifier and CI in a set of effective sentences,selected for active annotation.Step 6.2: Create two different sets, (Set SVM and Set CRF ) for two classifiers.

Step 7: Combine two sets into one, named as EA in such a way that if the sentence identifiers are same, thenfor that sentence CInew = min(CISV M , CICRF ). All the dissimilar sentences are added as they are.

Step 8: Sort EA in ascending order of CInew .Step 9: Select the top most 10 sentences, and remove these from the development data.Step 10: Add the sentences to the training set. This generates new training set.Retrain the SVM and CRF classifiersand evaluate with the test set.Step 11: Repeat steps 3-10 for some iteration (10 in our case).

Fig. 2 Steps of the proposed ensemble technique for active annotation

Int. J. Mach. Learn. & Cyber.


7. Digit features: Several orthographic features are

defined depending upon the presence and/or the

number of digits and/or symbols in a token. These

features are digitComma (token contains digit and

comma), digitPercentage (token contains digit and

percentage), digitPeriod (token contains digit and

period), digitSlash (token contains digit and slash),

digitHyphen (token contains digit and hyphen) and

digitFour (token consists of four digits only).

8. Dynamic feature: Dynamic feature denotes the output

tags ti�3ti�2ti�1, ti�2ti�1, ti�1 of the word

wi�3wi�2wi�1, wi�2wi�1, wi�1 preceding wi in the

sequence wn1. For CRF, we consider the bigram

template that considers the combination of the current

and previous output labels.

9. Content words in surrounding contexts: We consider

all unigrams in contexts wiþ3i�3 ¼ wi�3. . .wiþ3 of wi

(crossing sentence boundaries) for the entire training

data. We convert tokens to lower case, remove

stopwords, numbers and punctuation symbols. We

define a feature vector of length 10 using the 10 most

frequent content words. Given a classification

instance, the feature corresponding to token t is set

to 1 if and only if the context wiþ3i�3 of wi contains t.

10. Part of speech (PoS) information: PoS information is

a critical feature for NE identification. We use PoS

information of the current and/or the surrounding

token(s) as the features. For Bengali and Hindi we

use our in-house PoS tagger to extract the PoS

information. The PoS information for English was

provided with the datasets. For biomedical texts, we

use the GENIA tagger V2.0.25 to extract this


In addition to the above we make use of the following

additional features for NER, particularly for biomedical


1. Chunk information: We use GENIA tagger V2.0.2 to

get the chunk information. Chunk information provide

useful evidences about the boundaries of biomedical

NEs. In the current work, we use chunk information of

the current and/or the surrounding token(s). This

information was provided for the English datasets.

2. Unknown token feature: This is a binary valued feature

that checks whether the current token was seen or not

in the training corpus. In the training phase, this feature

is set randomly.

3. Word normalization: We define the feature for word

normalization. The first type of feature attempts to

reduce a word to its stem or root form. This helps to

handle the words containing plural forms, verb

inflections, hyphen, and alphanumeric letters. The

second type of feature indicates how a target word is

orthographically constructed. Word shapes refer to the

mapping of each word to their equivalence classes.

Here each capitalized character of the word is replaced

by ‘A’, small characters are replaced by ‘a’ and all

consecutive digits are replaced by ‘0’. For example,

‘IL’ is normalized to ‘AA’, ‘IL-2’ is normalized to

‘AA-0’ and ‘IL-88’ is also normalized to ‘AA-0’.

4. Head nouns: Head noun is the major noun or noun

phrase of a NE that describes its function or the

property. For example, transcription factor is the head

noun for the NE NF-kappa B transcription factor. In

comparison to other words in NE, head nouns are more

important as these play key role for correct classifica-

tion of the NE class.

5. Verb trigger: These are the special type of verb (e.g.,

binds, participates etc.) that occur preceding to NEs

and provide useful information about the NE class.

These trigger words are extracted automatically from

the training corpus based on their frequencies of

occurrences. A feature is then defined that fires iff the

current word appears in the list of trigger words.

6. Informative words: In general, biomedical NEs are

too long and they contain many common words that

are actually not NEs. For example, the function

words such as of, and etc.; nominals such as active,

normal etc. appear in the training data often more

frequently but these don’t help to recognize NEs. In

order to select the most important effective words,

we first list all the words which occur inside the

multiword NEs. Thereafter digits, numbers and

various symbols are removed from this list. For each

word (wi) of this list, a weight is assigned that

measures how better the word is to identify and/or

classify the NEs. This feature is defined in line with

the one defined in [38].

7. Orthographic features: We define a number of ortho-

graphic features depending upon the contents of the

wordforms. Several binary features are defined which

use capitalization and digit information. These features

are: initial capital, all capital, capital in inner, initial

capital then mix, only digit, digit with special charac-

ter, initial digit then alphabetic, digit in inner.

The presence of some special characters like

(‘,’,‘-’,‘.’,‘)’,‘(’ etc.) is very much helpful to detect

NEs, especially in biomedical domain. For example,

many biomedical NEs have ‘-’ (hyphen) in their

construction. Some of these special characters are also

important to detect boundaries of NEs. We also use the

features that check the presence of ATGC sequence

and stop words. The complete list of orthographic

features is shown in Table 1.5 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger.

Int. J. Mach. Learn. & Cyber.


5 Datasets, experiments and discussions

In this section we describe the datasets used for the

experiments, report the evaluation results and present the

necessary discussions.

5.1 Datasets and experimental setup

Indian languages are resource-constrained in nature. For

NER, we use a Bengali news corpus [39], developed from the

archive of a leading Bengali newspaper available in the web.

A portion of this dataset containing 250 K wordforms was

manually annotated with the NE tagset of four tags namely,

Person name (PER), Location name (LOC), Organization

name (ORG) and Miscellaneous name (MISC). The Miscel-

laneous name includes date, time, number, percentages,

monetary expressions and measurement expressions. The

data is collected mostly from the National, States, Sports

domains and the various sub-domains of District of the

newspaper. This annotation was carried out by one of the

authors and verified by an expert [39]. We also use the

IJCNLP-08 NER on South and South East Asian Languages

(NERSSEAL)6 Shared Task data of around 100 K wordforms

that were originally annotated with a fine-grained tagset of

twelve tags. This data is mostly from the agriculture and

scientific domains. For Hindi, we use approximately 502,913

tokens obtained from the NERSSEAL shared task.

In the present work, we consider only the tags that

denote person names (NEP), location names (NEL), orga-

nization names (NEO), number expressions (NEN), time

expressions (NETI) and measurement expressions (NEM).

The NEN, NETI and NEM tags are mapped to the Mis-

cellaneous name tag that denotes miscellaneous entities.

Other tags of the shared task have been mapped to the

‘other-than-NE’ category denoted by ‘O’.

For the development sets, we partition the available

training sets in such a way that 10 % instances of each class

belong to the respective development set. Some statistics of

training, development and test data are presented in Table 2.

For English NER, we use the CoNLL-2003 shared task [40]

data. Here training data set is having total 203,621 tokens

out of which 23,499 tokens are NE. In the test data there are

51,578 tokens, out of which 5,648 tokens are of NE types.

For biomedical texts, we use the JNLPBA 2004 shared

task datasets7. The data sets were extracted from the

GENIA Version 3.02 corpus of the GENIA project. This

was constructed by a controlled search on Medline using

MeSH terms such as human, blood cells and transcription

factors. From this search, 2,000 abstracts of about 500 K

wordforms were selected and manually annotated according

to a small taxonomy of 48 classes based on a chemical

classification. Out of these classes, 36 classes were used to

annotate the GENIA corpus. In the shared task, the data sets

were further simplified to be annotated with only five NE

classes, namely Protein, DNA, RNA, Cell_line and Cell_-

type [41]. The test set was relatively new collection of

Medline abstracts from the GENIA project. The test set

contains 404 abstracts of around 100K words. One half of

the test data was from the same domain as that of the

training data and the rest half was from the super domain of

blood cells and transcription factors.

For the active annotation experiment we take out 10 %

examples of each class from the training data and create

development set. Initially the system is trained on the

training set and evaluated on the development and test

datasets. Our algorithm selects most uncertain samples

from the development set and adds to the training set.

We use the standard metrics of recall, precision and

F-measure to evaluate the performance of our system.

These metrics are defined below:

Recall is the ratio of the number of correctly tagged

entities and the total number of entities.

Recall ¼ number of correctly tagged entities

total number of entities

Precision is the ratio of the number of correctly tagged

entities and the total number of tagged entities.

Table 2 Statistics of the datasets used for Indian language

Language # Tokens

in training

# Tokens

in dev

# Tokens

in test (in %)

Bengali 277,611 35,336 37,053

Hindi 455,961 47,218 32,796

Table 1 Orthographic features

Feature Example Feature Example

InitCap Src AllCaps EBNA, LMP

InCap mAb CapMixAlpha NFkappaB,


DigitOnly 1, 123 DigitSpecial 12-3

DigitAlpha 2�NFkappaB,


AlphaDigitAlpha IL23R, EIA

Hyphen – CapLowAlpha Src, Ras, Epo

CapsAndDigits 32Dc13 RomanNumeral I, II

StopWord At, in ATGCSeq CCGCCC,


AlphaDigit p50, p65 DigitCommaDigit 1, 28

GreekLetter Alpha, beta LowMixAlpha mRNA, mAb

6 http://ltrc.iiit.ac.in/ner-ssea-08. 7 http://research.nii.ac.jp/collier/workshops/JNLPBA04st.htm.

Int. J. Mach. Learn. & Cyber.


Precision ¼ number of correctly tagged entities

total number of tagged entities

F-measure is the harmonic mean of recall and precision.

F-measure ¼ ð2� recall � precisionÞðrecall þ precisionÞ

For biomedical domain, we executed JNLPBA 2004 shared

task evaluation script.8 The script outputs three sets of

F-measures according to exact, right and left boundary


5.2 Results of SVM based active annotation technique

for indian languages

We trained a SVM model with the feature set mentioned in

Sect. 4. We consider various combinations from the set of

feature combinations as given by, F1 ¼{wi�m; . . .;wi�1;

wi;wiþ1; . . .;wiþn; feature vector consisting of root word,

prefix and suffix, first word, infrequent word, digit, content

words, and dynamic NE information.gWe observed the best performance with the context of

wi�1;wi;wiþ1, and thus only report its results.

Results for Bengali: We conducted active learning

experiments with the thresholds of both 0.1 and 0.2. But

we report the results only with 0.2 threshold value in

Table 3 as it yielded better performance. Here, in each

iteration of the algorithm ten most effective sentences are

added to the training set after removing from the devel-

opment set.

The highest performance obtained with this method has

the recall, precision and F-measure values of 86.80, 87.84

and 87.317 %, respectively. This highest performance is

obtained at the seventh iteration, and it does not improve

further in the subsequent iterations. This is actually a

marginal improvement over the first iteration. However it

proves the effectiveness of our proposed approach.

We also develop a baseline model, where in each iter-

ation ten sentences are randomly chosen from the devel-

opment set and added to the training set. Results of this

baseline show the recall, precision and F-measure values of

86.77, 87.80 and 87.28 %, respectively.

Results on Hindi: The proposed technique is evaluated

on the Hindi language, and its results are shown in Table 4.

The Hindi dataset is highly unbalanced, and we sample it

by removing the sentences that don’t contain any NEs. The

system attains the highest performance of recall, precision

and F-measure values of 87.12, 88.54, and, 87.82 %,

respectively, in the seventh and eighth iteration. This is

actually an improvement of 4.21 % F-measure points over

the first iteration. The baseline model, where in each

iteration ten sentences were selected randomly showed the

recall, precision and F-measure values of 86.23, 87.77 and

86.99 %, respectively. These results again show the effi-

cacy of the proposed technique.

Results on English: For English, we use the CoNLL-

2003 benchmark datasets [40]. We trained with CoNLL-

2003 training data and with the same set of features as we

used for the Indian languages except the Last word of

sentence feature. But for English, we use two additional

features, first one checks capitalization information, and

the second one denotes the chunk information. We use

context window within the previous four and next four

words, i.e. wiþ4i�4 ¼ wi�4. . .wiþ4 of wi, word suffixes and

prefixes of length upto four (4þ 4 different features)

characters. Experimental results for this dataset are pre-

sented in Table 5. It shows the overall recall, precision and

F-measure values of 87.16, 88.50 and 87.82 %, respec-

tively. This is actually an improvement of 2.06 F-measure

points over the first iteration. The baseline model, where in

each iteration ten sentences were selected randomly

Table 3 Evaluation results of SVM based AL on Bengali with

threshold 0.2 (in terms of percentage)

Iteration Recall Precision F-measure

0 (initial) 86.75 87.77 87.258

1 86.754 87.81 87.279

2 86.75 87.82 87.280

3 86.75 87.82 87.281

4 86.74 87.83 87.283

5 86.76 87.83 87.290

6 86.76 87.84 87.297

7 86.76 87.84 87.298

8 86.76 87.84 87.309

9 86.78 87.84 87.317

10 86.80 87.84 87.317

Table 4 Evaluation results of SVM based AL on Hindi with

threshold 0.2 (in terms of percentage)

Iteration Recall Precision F-measure

0 82.98 84.25 83.61

1 83.10 84.34 83.715

2 83.51 84.71 84.11

3 83.96 85.01 84.48

4 84.01 85.27 84.64

5 85.05 86.28 85.66

6 86.08 87.49 86.78

7 87.12 88.54 87.82

8 87.12 88.54 87.82

8 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html.

Int. J. Mach. Learn. & Cyber.


showed the recall, precision and F-measure values of

86.13, 87.56 and 86.84 %, respectively. These results again

show the efficacy of the proposed technique.

Results on biomedical texts: We train a SVM model with

the feature set mentioned in Sect. 4. We observe the best

performance with the context of previous three and next

three tokens, and thus only report its results. Results are

reported with the following feature combinations:

(1) Contexts within the previous and next three

tokens, i.e. wiþ3i�3 ¼ wi�3. . .wiþ3, (2) word suffixes and

prefixes of length upto three (4þ 4 different features)

characters of the word, (3) PoS information of the cur-

rent token, (4) chunk information of the current token,

(5) dynamic NE information, (6) word normalization, (7)

word length, (8) infrequent word, (9) unknown tokens,

(10) head nouns (unigram and bigram), (11) verb trigger,

(12) word class, (13) informative NE information, (14)

orthographic features and (15) all features within the

context of wi�1 to wiþ1.

We experiment with the selection criteria that only adds

the current sentence to the training set. We report the

results with threshold value of 0.2 in Table 6 as it yields

better results compared to the threshold of 0.1. It shows the

highest recall, precision and F-measure values of 76.2, 70.1

and 73.02 %, respectively. This shows improvements of

1.43, 1.39 and 1.41 % in recall, precision and F-measure,

respectively over the first iteration.

The results of the baseline model as defined previously

show the recall, precision and F-measure values of 75.37,

69.18 and 72.14 %, respectively. This is lower in com-

parison to our proposed approach by 0.83, 0.92 and 0.88 %

of recall, precision and F-measure values, respectively.

Hence the proposed method works effectively even for the

biomedical domain.

5.3 Results on ensemble based active annotation

In this section we present the results of the proposed

ensemble based active learning.

Results on Indian language NER: We train both CRF

and SVM with the feature set mentioned in Sect. 4. At first

we evaluate the active learning technique proposed in [5]

for CRF based classifier on the Bengali data. Its results are

shown in Table 7.

The system attains the highest performance with the

recall, precision and F-measure values of 87.76, 89.34 and

88.55 %, respectively. This is actually an improvement of

around 0.913 % F-measure points over the first iteration.

The baseline model showed the recall, precision and

F-measure values of 87.22, 88.96 and 88.08 %, respec-

tively. We already have the results for SVM as shown in

Table 3.

The learning curve comparing between SVM based

supervised model (i.e. baseline model, where at each

Table 5 Evaluation results of SVM based AL on English with

threshold 0.2

Iteration Recall Precision F-measure

0 85.39 86.12 85.76

1 85.90 86.65 86.27

2 85.98 86.90 86.44

3 86.02 87.01 86.51

4 86.15 87.14 86.64

5 86.32 87.55 86.93

6 86.54 87.81 87.17

7 86.90 88.15 87.52

8 86.99 88.32 87.64

9 87.16 88.50 87.82

10 87.16 88.50 87.82

Table 6 Evaluation results of SVM based AL for biomedical texts

with threshold = 0.2

Iteration number Recall Precision F-measure

1 74.77 68.71 71.61

2 74.90 69.20 71.94

3 75.01 69.50 72.15

4 75.10 69.81 72.36

5 75.5 69.95 72.62

6 75.82 70.02 72.80

7 76.2 70.10 73.02

8 76.2 70.10 73.02

9 76.2 70.10 73.02

10 76.2 70.10 73.02

Table 7 Evaluation results of CRF based AL for Bengali with

threshold value of 0.2

Iteration Recall Precision F-


0 87.02 88.26 87.63

1 87.51 89.07 88.28

2 87.56 89.05 88.30

3 87.63 89.10 88.36

4 87.63 89.20 88.41

5 87.67 89.21 88.43

6 87.72 89.26 88.48

7 87.76 89.26 88.51

8 87.76 89.32 88.54

9 87.76 89.34 88.55

10 87.76 89.34 88.55

Int. J. Mach. Learn. & Cyber.


iteration ten sentences are selected randomly) and SVM

based active learning approach with the same amount of

training data is shown in Fig. 3. Similarly learning curve

between CRF based baseline model and CRF based active

learning model are shown in Fig. 3. The characteristics of

the learning curve shows that with the same annotation

effort (in each case we are selecting ten sentences to be

added in the training set) we gain more performance

improvements with active learning.

Thereafter we evaluate the proposed ensemble based

technique which combines the outputs of both CRF and

SVM based classifiers for the Bengali data. Results are

shown in Table 8. The system attains the highest

performance of recall, precision and F-measure values of

87.99, 89.99 and 88.97 %, respectively. This shows an

improvement of around 0.657 % F-measure points. The

baseline model shows the recall, precision and F-measure

values of 87.72, 89.26 and 88.48 %, respectively.

Table 9 compares the results between the (1) ensemble

and SVM based classifier, and (2) ensemble and CRF based

classifier for Bengali. In each of the cases, the system

attains the highest performance in the ninth iteration with

the F-measure values of 87.32, 88.55 and 88.98 %,

respectively. These are the improvements of 1.66 and

0.43 %, over the active learning techniques based on CRF

and SVM classifiers, respectively.

At first we evaluated the active learning based technique

using CRF based classifier [5] for the Hindi data. Here we

use the same features as Bengali. Results on Hindi for CRF

are shown in Table 10.9 Please note that we make use of

the same set of features as like Bengali. The system attains

the highest recall, precision and F-measure values of 87.26,

88.55 and 87.89 %, respectively. These were obtained in

0 1 2 3 4 5 6 7 8 987.25















0 1 2 3 4 5 6 7 8 987.6
















(a) (b)

Fig. 3 Learning curves for Bengali data comparing a SVM based baseline and active learning approaches, b CRF based baseline and active

learning approaches

Table 8 Evaluation results of ensemble based AL on Bengali with

threshold 0.2

Iteration Recall Precision F-measure

0 87.58 89.11 88.32

1 87.58 89.16 88.36

2 87.58 89.18 88.37

3 87.63 89.20 88.41

4 87.65 89.19 88.41

5 87.72 89.26 88.48

6 87.76 89.26 88.51

7 87.83 89.35 88.58

8 87.83 89.39 88.60

9 87.99 89.99 88.97

Table 9 Comparison results for Bengali using ensemble approach v/s

single classifiers with threshold 0.2

Iteration SVM CRF Ensemble

0 87.26 87.63 88.32

1 87.28 88.28 88.36

2 87.28 88.30 88.37

3 87.28 88.36 88.41

4 87.28 88.41 88.41

5 87.29 88.43 88.48

6 87.29 88.48 88.51

7 87.29 88.54 88.58

8 87.31 88.55 88.60

9 87.32 88.55 88.97

9 We iterate the algorithm for more than 10 iterations as we observed

performance improvement even in the 10th iteration.

Int. J. Mach. Learn. & Cyber.


the tenth and eleventh iterations. This is actually an

improvement of around 1.91 % F-measure points over the

first iteration. The baseline model showed the recall, pre-

cision and F-measure values of 85.72, 87.06 and 86.39 %,

respectively. Thus the CRF based active learning technique

attains an improvement of 1.51 % over the baseline. These

results prove that CRF based active learning technique is

clearly superior to the individual baseline models.

The learning curves, comparing between SVM based

baseline model and active learning model with same

amount of training data is shown in Fig. 4. Similarly the

learning curve for CRF based baseline and CRF based

active learning is shown in Fig. 4. These curves show that

with the same annotation effort we can get some more

points gain in the F-measure value.

Thereafter we evaluate the proposed ensemble based

active learning technique for the Hindi data. Results are

shown in Table 11. The system attains the highest recall,

precision and F-measure values of 88.01, 88.99 and

88.50 %, respectively. This is obtained in the tenth itera-

tion and remains unaltered in the next iteration. This an

improvement of around 2.58 % F-measure points over the

first iteration. The baseline model, where in each iteration

ten sentences were selected randomly show the recall,

precision and F-measure values of 85.46, 86.79 and

86.17 %, respectively. This proves that the proposed

ensemble based active learning technique is superior than

the baseline approach.

Table 12 compares the results of ensemble classifier

with the results of SVM based active learner and CRF

based active learner for Hindi. In each of the cases, the

system attains the highest performance of 87.82, 87.89 and

88.50 % in the 7th, 10th and 10th iteration, respectively.

For Hindi, we observe the improvement with the ensemble

approach. It shows the increments of 0.61 and 0.68 %,

respectively over CRF and SVM based active learning


Results on English: Here at first we evaluate the active

learning technique proposed in [5] for CRF based classifier

on the English data. Results are shown in Table 13. The

system attains the highest recall, precision and F-measure

values of 87.58, 88.98 and 88.27 %, respectively. In base-

line model we observed the recall, precision and F-measure

values of 86.90, 88.01 and 87.45 %, respectively.

Thereafter we evaluate the proposed ensemble based

technique which combines the outputs of both CRF and

SVM based classifiers. Results are shown in Table 14. The

system attains the highest recall, precision and F-measure

values of 88.52, 88.86 and 88.69 %, respectively.

0 2 4 6 8 1083.5















0 2 4 6 8 1085.5











(a) (b)

Fig. 4 Learning curves for Hindi data set comparing a SVM based baseline and active learning approaches, b CRF based baseline and active

learning approaches

Table 10 Evaluation results of CRF alone on Hindi with threshold


Iteration Recall Precision F-measure

0 85.32 86.65 85.98

1 85.72 87.06 86.39

2 85.99 87.33 86.65

3 86.16 87.53 86.84

4 86.26 87.66 86.95

5 86.42 87.79 87.10

6 86.66 88.03 87.34

7 86.79 88.12 87.44

8 86.96 88.28 87.61

9 86.99 88.28 87.63

10 87.26 88.55 87.89

11 87.26 88.55 87.89

Int. J. Mach. Learn. & Cyber.


In Table 15 we present the comparisons of the ensemble

approach with individual CRF and SVM based approaches

for English data. In each of the cases, the system attains the

highest performing F-measure values of 87.82, 88.27 and

88.69 %, respectively. Table shows that the system

achieves this performance in the 9th iteration. Like other

languages, the ensemble obtains better accuracy over the

individual models. Comparisons with the other systems for

English are shown in Table 16.

For the benchmark English dataset, our proposed system

achieves the performance, comparable to the best per-

forming system [44] of CoNLL-2003 shared task. The best

system [44] at CoNLL-2003 shared task demonstrated the

recall, precision and F-measure values of 88.54, 88.99 and

88.76 %, respectively. They used an ensemble learner with

many domain dependent resources and/or tools. In contrast,

our proposed algorithm (1) makes use of the features that

Table 11 Evaluation results of AL using ensemble approach on

Hindi data with threshold 0.2 (RPF values; we report percentage


Iteration Recall Precision F-measure

0 85.32 86.65 85.98

1 86.06 87.28 86.66

2 86.09 87.39 86.74

3 86.26 87.59 86.92

4 86.42 87.77 87.09

5 86.59 87.94 87.26

6 86.69 87.98 87.33

7 86.96 88.21 87.59

8 87.24 88.56 87.87

9 87.57 88.91 88.23

10 88.01 88.99 88.50

11 88.01 88.99 88.50

Table 12 Comparisons for Hindi using ensemble approach v/s single

classifier with threshold 0.2

Iteration SVM CRF Ensemble

0 83.61 85.98 85.98

1 83.72 86.39 86.66

2 84.11 86.65 86.74

3 84.48 86.84 86.92

4 84.64 86.95 87.09

5 85.66 87.10 87.26

6 86.78 87.34 87.33

7 87.82 87.44 87.56

8 87.82 87.61 87.87

9 87.82 87.63 88.23

10 87.82 87.89 88.50

Table 13 Evaluation results of CRF based AL on English with

threshold 0.2

Iteration Recall Precision F-


0 85.74 86.03 85.89

1 85.91 86.65 86.28

2 85.99 86.91 86.45

3 86.11 87.03 86.57

4 86.41 87.23 86.82

5 86.51 87.41 86.96

6 86.67 87.78 87.22

7 86.91 87.92 87.41

8 87.21 88.14 87.67

9 87.58 88.98 88.27

10 87.58 88.98 88.27

Table 14 Evaluation results of ensemble based AL on English data

with threshold 0.2

Iteration Recall Precision F-


0 86.01 86.63 86.32

1 86.41 86.91 86.66

2 86.72 87.05 86.88

3 86.91 87.34 87.12

4 87.04 87.61 87.32

5 87.51 87.80 87.65

6 87.67 87.98 87.82

7 87.88 88.05 87.96

8 88.01 88.25 88.13

9 88.52 88.86 88.69

10 88.52 88.86 88.69

Table 15 Comparison results of AL on English data using ensemble

approach v/s single classifiers with threshold 0.2

Iteration SVM CRF Ensemble

0 85.76 85.89 86.32

1 86.27 86.28 86.66

2 86.44 86.45 86.88

3 86.51 86.57 87.12

4 86.64 86.82 87.32

5 86.93 86.96 87.65

6 87.17 87.22 87.82

7 87.52 87.41 87.96

8 87.64 87.67 88.13

9 87.82 88.27 88.69

10 87.82 88.27 88.69

Int. J. Mach. Learn. & Cyber.


can be derived for any language with a little effort, (2) does

not make use of any domain dependent resources like the

gazetteers etc., and (3) does not make use of any additional

NE taggers, but still achieves state-of-the-art performance,

which is below 0.07 F-measure point compared to the best

system of CoNLL-2003.

Until now, the best reported results for CoNLL-2003

shared task data are in Lin and Wu [42] that proposed a

semi-supervised approach for NER. They obtained the

F-measure value of 90.90 %, which is 2.21 points higher

than our proposed system. In addition to the above men-

tioned two systems [42, 44], we also present the compar-

isons with some other well-known existing techniques in

Table 16. Suzuki and Isozaki [43] run a baseline discrim-

inative classifier on unlabeled data to generate pseudo

examples, which are then used to train a different type of

classifier for the same problem. Later on, they used the

automatically labeled corpus to train hidden Markov model

(HMMs). Chieu and Ng [45] showed how the use of global

information, in addition to the local ones, can improve the

model performance. It is to be noted that our system

achieves 6.00 points higher F-measure value in comparison

to the stacked, voted model, proposed by Wu et al. [47] in

the CoNLL-2003 shared task.

Results on biomedical texts: Here at first we execute the

CRF based active learning technique (described in [5]) on

the biomedical data. Results are shown in Table 17. We

trained a CRF model with the feature set mentioned in

Sect. 4.

The system attains the highest recall, precision and

F-measure values of 76.50, 73.00 and 74.80 %, respec-

tively. This was obtained in the ninth iteration, and accu-

racy does not change thereafter. This is actually an

improvement of around 1.57 % F-measure points over the

first iteration. The baseline model, where in each iteration

ten sentences were selected randomly showed the recall,

precision and F-measure values of 74.57, 72.30 and

73.42 %, respectively.

Figure 5 shows the learning curves demonstrating the

comparisons between SVM based baseline vs. SVM based

AL and CRF based baseline vs. CRF based AL. This

illustrates the effectiveness of active learning based tech-

niques where with the same annotation efforts we achieve

better accuracies.

Thereafter we evaluate the proposed ensemble based

technique for the biomedical data. Results are shown in

Table 18. The system attains the highest recall, precision

and F-measure values of 76.80, 74.95 and 75.86 %,

respectively. This is better compared to the accuracies

obtained in the first iteration and the baseline model. Table

19 compares the results of ensemble approach and CRF

based active learner and SVM based active learner. The

proposed ensemble based active learning technique attains

1.06 and 2.84 % F-measure improvements over the SVM

and CRF based active learning techniques, respectively.

The results clearly indicate that ensemble indeed achieves

better performance.

Comparison with existing biomedical NER systems: We

compare with the systems reported in the JNLPBA 2004

shared task as well as with those that were developed at the

later stages but made use of the same datasets. We present

the comparative evaluation results in Table 20 not only

with the domain-independent systems but also with the

systems that incorporate domain knowledge and/or exter-

nal resources.

GuoDong and Jian [48] developed the best system in the

JNLPBA 2004 shared task. This system provides the

highest F-measure value of 72.55 with several deep domain

knowledge. Song et al. [49] used CRF and SVM both, and

obtained the F-measure of 66.28 % with virtual samples.

The HMM-based system reported by Ponomareva et al.

[50] achieved a F-measure value of 65.7 % with PoS and

phrase-level domain dependent knowledge. A maximum

entropy (ME)-based system was reported in [51] where

recognition of terms and their classification were per-

formed in two steps. They achieved a F-measure value of

66.91 % with several lexical knowledge sources such as

Table 16 Comparisons with

some existing systems for

English NER

System F-measure

(in %)

Lin and Wu [42] 90.90

Suzuki and Isozaki



Florian et al. [44] 88.76

Our proposed



Chieu and Ng [45] 88.31

Klein et al. [46] 86.31

Wu et al. [47] 82.69

Table 17 Evaluation results of CRF based active learning technique

with threshold = 0.2 for biomedical data

Iteration number Recall Precision F-measure

1 74.5 72.0 73.23

2 74.6 72.1 73.33

3 74.9 72.1 73.47

4 75.1 72.1 73.57

5 75.5 72.1 73.76

6 76.0 72.5 74.21

7 76.2 72.8 74.46

8 76.5 73.0 74.7

9 76.5 73.0 74.7

10 76.5 73.0 74.8

Int. J. Mach. Learn. & Cyber.


salient words obtained through corpus comparison between

domain-specific and WSJ corpora, morphological patterns

and collocations extracted from the Medline corpus. As far

our knowledge is concerned, one of the very recent works

proposed in [38] obtained the F-measure value of 67.41 %

with PoS and phrase information as the only domain

knowledge. This is the highest performance achieved by

any system that did not use any deep domain knowledge.

A CRF-based NER system has been reported in [52] that

obtained the F-measure value of 70 % with orthographic

features, semantic knowledge in the form of 17 lexicons

generated from the public databases and Google sets.

Finkel et al. [53] reported a CRF-based system that showed

the F-measure value of 70.06 % with the use of a number

of external resources, including gazetteers, web-querying,

surrounding abstracts, abbreviation handling method, and

frequency counts from the BNC corpus. A two-phase

model based on ME and CRF was proposed by Kim et al.

[54] that achieved a F-measure value of 71.19 % by post-

processing the outputs of machine learning models with a

rule-based component.

Our proposed ensemble based active learning technique

attains the average recall, precision and F-measure values

of 76.80, 74.95 and 75.86 %, respectively. This is at par

with existing state-of-the-art systems.

We also compare the performance of our proposed

ensemble based active learning approach with the sate-of-

the-art biomedical NER system, BANNER [55] that was

implemented using CRFs. BANNER exploits a range of

orthographic, morphological and shallow syntax features,

such as part-of-speech tags, capitalisation, letter/digit

combinations, prefixes, suffixes and Greek letters. Com-

parisons between the several existing NER systems are

provided in [56]. For BANNER, Kabiljo et al. [56] reported

the F-measure values of 77.50 and 61.00 % under the

sloppy matching and strict matching criterion, respectively

with the JNLPBA shared task datasets.

1 2 3 4 5 6 7 8 9 1071.6












sure SVM


1 2 3 4 5 6 7 8 9 1073.2












sure CRF


(a) (b)

Fig. 5 Learning curves for biomedical data comparing a SVM based baseline and active learning approaches, b CRF based baseline and active

learning approaches

Table 18 Evaluation results of AL using ensemble approach on

biomedical data with threshold 0.2

Iteration Recall Precision F-measure

1 75.3 73.2 74.23

2 75.6 73.51 74.54

3 75.72 73.8 74.75

4 75.81 73.93 74.86

5 75.95 73.99 74.96

6 76.02 74.05 75.02

7 76.5 74.8 75.64

8 76.62 74.91 75.76

9 76.80 74.95 75.86

10 76.80 74.95 75.86

Table 19 Comparison results of AL on biomedical data using

ensemble approach v/s single classifier with threshold 0.2

Iteration SVM CRF Ensemble

1 71.61 73.23 74.23

2 71.94 73.33 74.54

3 72.15 73.47 74.75

4 72.36 73.57 74.86

5 72.62 73.76 74.96

6 72.80 74.21 75.02

7 73.02 74.46 75.64

8 73.02 74.7 75.76

9 73.02 74.7 75.86

10 73.02 74.8 75.86

Int. J. Mach. Learn. & Cyber.


Note that the proposed active learning technique

achieves better results just after the first iteration compared

to the other existing systems. This is because of the use of

diverse set of features. However it is to be noted that in our

system we identify and implement features without using

any domain knowledge and/or resources. Note that initially

we trained the systems using the training data of having

450 K word forms.

Comparison with existing active annotation techniques:

We compare the results obtained using the proposed active

learning techniques with some of the existing active

learning based techniques. In [27] authors have proposed a

multi-criteria based active learning technique for NER.

They evaluated their techniques on biomedical and English

data sets. They achieved average F-measure of 83.3 % for

English data and F-measure of 63.3 % for biomedical data

sets. But our proposed approach attains F-measure value of

88.69 % for English data and 75.86 % F-measure value for

biomedical data sets. Thus the proposed approach attains

performance improvements of 5.39 and 12.56 % F-mea-

sure values over the approach proposed in [27] for English

data and biomedical data, respectively. One of the possible

explanations behind this considerable performance

improvement is due to the use of rich features (described in

Sect. 4).

We also execute the active annotation algorithm pro-

posed in [25] on the Bengali and Hindi data sets. In [25] a

CRF based active annotation is proposed. It attains

F-measure values of 87.95 and 86.51 % for Bengali and

Hindi, respectively. But our proposed ensemble technique

attains the final F-measure values of 88.97 and 88.50 % for

Bengali and Hindi data sets, respectively.

6 Conclusion and future work

In this paper we have progressively proposed two methods

for active annotation that could be helpful for many

applications where there is a scarcity in the amount of

available labeled data, and its creation involves consider-

ably long time and huge expenses. We have proposed

algorithms, one based on SVM and the other based on

ensemble learning. Based on SVM, we devised a method to

select the uncertain examples to be added to the initial

training set. The uncertain examples were selected based

on the distance between the two classes from the separating

hyperplane. The ensemble approach combines SVM and

CRF both. For CRF, the uncertain samples are selected

based on the marginal probabilities. Ensemble utilizes both

the concepts, viz. distance from the separating hyperplane

and marginal probability. The proposed system is evaluated

for solving the problem of NER, an important pipelined

module in many NLP application areas. Experiments were

conducted on two resource-poor Indian languages, namely

Bengali and Hindi. In addition the systems have also been

evaluated for English and biomedical texts. We obtain

good accuracies for all the domains. The ensemble method

clearly dominates over the previous method.

This is a highly accurate and scalable technique which

can be easily used in other information extraction prob-

lems. The high accuracy is due to the use of maximum-

margin nature of SVMs and also due to the use of capa-

bility of CRFs to model correlations between neighboring

output tags. The system is scalable for training SVM and

CRF classifiers. The system is easy to use because of the

utilization of existing softwares in a straightforward way.

Table 20 Comparison with the existing approaches

System Used approach Domain knowledge/resources FM

Our proposed system Ensemble based active

learning (CRF and SVM)

POS, phrase 75.86

Guo Dong and Jian [48] final HMM, SVM Name alias, cascaded NEs dictionary, POS,

phrase POS, phrase


Guo Dong and Jian [48] HMM, SVM POS, phrase 64.1

Kim et al. [54] Two-phase model

with ME and CRF

POS, phrase, rule-based component 71.19

Finkel et al. [53] CRF Gazetteers, web-querying, surrounding abstracts,

POS abbreviation handling, BNC corpus,


Settles [52] ME POS, semantic knowledge sources of 17 lexicons 70.00

Saha et al. [38] ME POS, phrase 67.41

Park et al. [51] ME POS, phrase, domain-salient words using WSJ,

morphological patterns, collocations from Medline


Song et al. [49] final SVM, CRF POS, phrase, virtual sample 66.28

Song et al. [49] base SVM POS, phrase 63.85

Ponomareva et al. [50] HMM POS 65.7

Int. J. Mach. Learn. & Cyber.


More work can be done in this area using more than two

classifiers. Apart from this, genetic algorithms and multi-

objective optimization based feature selection technique

will be employed for determining appropriate set of fea-

tures. The proposed approach will be applied for other

information extraction problems in natural language and



