urdu named entity recognition and classification system...

13
2 Urdu Named Entity Recognition and Classification System Using Artificial Neural Network MUHAMMAD KAMRAN MALIK, Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore Pakistan Named Entity Recognition and Classification (NERC) is a process of identifying words and classifying them into person names, location names, organization names, and so on. In this article, we discuss the development of an Urdu Named Entity (NE) corpus, called the Kamran-PU-NE (KPU-NE) corpus, for three entity types, that is, Person, Organization, and Location, and marking the remaining tokens as Others (O). We use two supervised learning algorithms, Hidden Markov Model (HMM) and Artificial Neural Network (ANN), for the development of the Urdu NERC system. We annotate the 652852-token corpus taken from 15 different genres with a total of 44480 NEs. The inter-annotator agreement between the two annotators in terms of Kappa k statistic is 73.41%. With HMM, the highest recorded precision, recall, and f-measure values are 55.98%, 83.11%, and 66.90%, respectively, and with ANN, they are 81.05%, 87.54%, and 84.17%, respectively. CCS Concepts: • Computing methodologies Information extraction; Neural networks; Additional Key Words and Phrases: Resource Poor Languages, Deep Learning, NER using Deep Learning, Urdu POS tagged Data, NER Data, Urdu word2vec ACM Reference format: Muhammad Kamran Malik. 2017. Urdu Named Entity Recognition and Classification System Using Artificial Neural Network. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 17, 1, Article 2 (September 2017), 13 pages. https://doi.org/10.1145/3129290 1 INTRODUCTION The Urdu language is written in the Arabic script and from right to left [1]. It is the national language of Pakistan and one of the state languages of India. It is spoken in more than 22 countries, and it has more than 60.6 million first-language speakers with over 490 million total speakers [2]. The computational aspect of Urdu started in early 1980 [3]. Named Entity Recognition and Classification (NERC) is a process of identifying real-world ob- jects and classifying them into person names, location names, and organization names, collectively called Named Entities (NEs). The NERC systems are used to improve the results of Information Extraction (IE), Machine Translation (MT), and many other Natural Language Processing (NLP) applications. For the development of an Urdu NERC system, we tag the Urdu NE data and com- pare the results of the Hidden Markov Model (HMM) and Artificial Neural Network (ANN). The following are the main contributions in our work: Authors’ addresses: M. K. Malik, Punjab University College of Information Technology (PUCIT), Quaid-e-Azam Campus, University of the Punjab, Lahore Pakistan. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2017 ACM 2375-4699/2017/09-ART2 $15.00 https://doi.org/10.1145/3129290 ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Upload: others

Post on 29-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

2

Urdu Named Entity Recognition and Classification System

Using Artificial Neural Network

MUHAMMAD KAMRAN MALIK, Punjab University College of Information Technology (PUCIT),

University of the Punjab, Lahore Pakistan

Named Entity Recognition and Classification (NERC) is a process of identifying words and classifying them

into person names, location names, organization names, and so on. In this article, we discuss the development

of an Urdu Named Entity (NE) corpus, called the Kamran-PU-NE (KPU-NE) corpus, for three entity types,

that is, Person, Organization, and Location, and marking the remaining tokens as Others (O). We use two

supervised learning algorithms, Hidden Markov Model (HMM) and Artificial Neural Network (ANN), for the

development of the Urdu NERC system. We annotate the 652852-token corpus taken from 15 different genres

with a total of 44480 NEs. The inter-annotator agreement between the two annotators in terms of Kappa k

statistic is 73.41%.With HMM, the highest recorded precision, recall, and f-measure values are 55.98%, 83.11%,

and 66.90%, respectively, and with ANN, they are 81.05%, 87.54%, and 84.17%, respectively.

CCS Concepts: • Computing methodologies→ Information extraction; Neural networks;

Additional Key Words and Phrases: Resource Poor Languages, Deep Learning, NER using Deep Learning,

Urdu POS tagged Data, NER Data, Urdu word2vec

ACM Reference format:

Muhammad Kamran Malik. 2017. Urdu Named Entity Recognition and Classification System Using Artificial

Neural Network. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 17, 1, Article 2 (September 2017), 13 pages.

https://doi.org/10.1145/3129290

1 INTRODUCTION

The Urdu language is written in the Arabic script and from right to left [1]. It is the nationallanguage of Pakistan and one of the state languages of India. It is spoken in more than 22 countries,and it has more than 60.6 million first-language speakers with over 490 million total speakers [2].The computational aspect of Urdu started in early 1980 [3].Named Entity Recognition and Classification (NERC) is a process of identifying real-world ob-

jects and classifying them into person names, location names, and organization names, collectivelycalled Named Entities (NEs). The NERC systems are used to improve the results of InformationExtraction (IE), Machine Translation (MT), and many other Natural Language Processing (NLP)applications. For the development of an Urdu NERC system, we tag the Urdu NE data and com-pare the results of the Hidden Markov Model (HMM) and Artificial Neural Network (ANN). Thefollowing are the main contributions in our work:

Authors’ addresses: M. K. Malik, Punjab University College of Information Technology (PUCIT), Quaid-e-Azam Campus,

University of the Punjab, Lahore Pakistan.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specific permission and/or a fee. Request permissions from [email protected].

© 2017 ACM 2375-4699/2017/09-ART2 $15.00

https://doi.org/10.1145/3129290

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 2: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

2:2 M. K. Malik

• The development of the KPU-NE corpus of 0.65 million tokens from 15 different genres• The generation of word vectors of Urdu from more than 214 million tokens using the Con-

tinuous Bag of Words (CBOW) model [34]• To the best of our knowledge, we are the first to perform NERC experiments using ANN on

Urdu data• The comparison of NERC results between HMM and ANN

2 LITERATURE REVIEW

Annotated corpora play a vital role in the development of systems like NERC, Part of speech (POS)tagger, and phrase chunker. Most of the South Asian languages, especially Urdu, lack such type ofresources. Some work has been done to develop an Urdu Named Entity Recognition (NER) systemusing an Urdu NE tagged corpus available for experimenting. Presently, the following is the list ofUrdu NE corpora available for experimentation and their limitations.

• Enabling Minority Language Engineering (EMILLE) (only 200000 tokens) [5]• Becker-Riaz corpus (only 50000 tokens) [6]• International Joint Conference onNatural Language Processing (IJCNLP)1 workshop corpus

(only 58252 tokens)• Computing Research Laboratory2 (CRL) annotated corpus (only 55,000 tokens)

The issue with the EMILLE corpus is that it contains a large number of tokens but, as comparedto the number of tokens, it does not have a lot of NEs [7]. In the Becker-Riaz corpus, the ratioof NE count is high as compared to number of tokens, but it is not publically available [7]. TheNER workshop of the IJCNLP in 2008 provided a free corpus that contains training and testingdata for Urdu NE against 12 NEs. CRL annotated a corpus of 55,000 words for Urdu MT and thecounts for Location (LOC), Organization (ORG), and Person (PER) names were 1,262 1,258, and1,772, respectively [8].Many techniques have been proposed to solve the NERC problem using rule-based approaches

[9], supervised learning approaches like the HMM-based IdentityFinder [10], the MaxEnt-basedrecognizer MENE [11], Conditional Random Field (CRF) [12], and so on. The following systemsused the IJCNLP dataset for the development of Urdu NER. In [8], the author designed an IE systemfor Urdu. NER was one of the subparts of this IE system. The values of the f-measure for the UrduNER using Maximum Entropy (ME) and CRF were 55.3% and 68.9%, respectively. In [13], the NERCsystem for South and South East Asian languages like Hindi, Urdu, Bengali, Oriya, and Teluguwere developed using CRF. The NERC system for 12 NE classes was developed using differentfeatures. The system achieved an f-measure of 35.52% for Urdu. Reference[14] presents an MEapproach–based NER system for Hindi, Bengali, Telugu, Oriya, and Urdu. It has an f-measure of35.47% for Urdu. [15] describes a CRF-based NERC system for Hindi, Bengali, Telugu, Oriya, andUrdu. They used a hybrid approach for NERC system. The system has an f-measure of 43.46%f for Urdu. [16] describes an NERC system that uses CRF, HMM, and rules for five languagesincluding Urdu. By using HMM and hand-crafted rules, the system has an f-measure of 44.73%for Urdu.[7] describes a hand-crafted rule-based system for the NERC problem because of the scarcity

of online available resources and availability of only a limited annotated corpus for training. Thesystem used 2,262 documents taken from the Becker-Riaz corpus. Of the 2,262 documents, theyused 200 documents to construct a set of rules for NEs like person name, designation name, location

1http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi.2https://crl.ucsd.edu/corpora/.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 3: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

Urdu Named Entity Recognition and Classification System Using ANN 2:3

Table 1. Summary of Existing Work on the Urdu NER

Algorithm Dataset f-measure[8] CRF CRL 68.9%[13] CRF IJCNLP 35.53%[14] ME IJCNLP 35.47%[15] Hybrid approach IJCNP 43.46%[16] HMM + Rules IJCNLP 44.73%[7] Rules Beacker-Riaz 91.1%[17] Rules IJCNP 88.1%[19] Bigram CRL 75.83%

name, date, number, and organization. In 2,262 documents, there were 206 unique NEs. The rule-based approach extracted 187 NEs, of which 171 were true NEs. The results showed the recall of90.7% and precision of 91.5%, which gave an f-measure value of 91.1%. Without tuning any rule,the results of the IJCNP 2008 NER workshop were 72.5% f-measure and after tuning the resultswere 81.6% f-measure.TheNER system described in [17] also used a rule-based approach to identify different NEs in the

Urdu text. Rules were used for extracting numbers, non-numeral numbers, date, and time. Suffixmatching was used to identify location, person names, and terms. The system also used differentgazetteers for titles, names, and locations. It used news data for testing purposes, because usuallynews data have many NEs. The system used two test sets. The first one consisted of 12,032 tokensthat represented news and articles related to politics. The second one consisted of 150,243 tokensthat represented news data related to science topics and business news. The accuracy was 60.09%for the first test set and 88.1% for second test set.[18] describes a two-stage and a four-stage bootstrapped model for the development of the

Urdu NERC system. It used CRF with 10-fold cross-validation test to find the results of theirapproach. The f-measure values for the two-stage and four-stage models were 55.3% and 68.9%,respectively.The system described in [19] used the n-gram models (unigram and bigram) for the Urdu NE. A

gazetteer list was used for unigram and bigram approaches complemented with different smooth-ing techniques with the bigrammodel. The NE tagged corpus was download from CRL. The NERCsystem results are 65.21% precision, 88.63% recall, and 75.14% f-measure using the unigram ap-proach. By using the bigram approach and Backoff smoothing, the results were 66.2% precision,88.18% recall, and 75.83% f-measure. Table 1 shows the summary of existing work on the UrduNER system.

3 DATA ACQUISITION

The KPU-NE corpus is taken from two sources. The first corpus of 113,686 tokens consisting of 15different domains/genres like book reviews, culture, education, entertainment, health, interviews,letters, novels, press, religion, science, short stories, sports, technology, and translation of foreignliterature was acquired from the Centre for Language Engineering (CLE)3. The CLE took the datafrom Urdu Digest [20] and assigned the POS tags to it [21].

The second corpus of more than 214 million tokens was acquired from “NewsLink”4 throughscraping between 2014 and 2015. These data are categorized into religion, politics/press,

3http://cle.org.pk/.4http://newslink.pk/.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 4: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

2:4 M. K. Malik

Table 2. Genrewise Details of Data

Genre NE TokensBook Reviews 194 4,827Culture 45 8,530Education 6,572 67,335Entertainment 6,819 74,922Health 7,642 79,675Interviews 53 11,804Letters 104 7,011Novels 397 4,191Press 5,654 98,749Religion 8,675 119,848Science 104 8,350Short Stories 184 6,841Sports 7,654 150,291Technology 187 3,486Translation Foreign Literature 196 6,992Total 44,480 652,852Average Length of Sentence 18.67

education, entertainment, health, and sports. Using the NewsLink corpus, we generated theUrdu word vectors. The KPU-NE corpus of 652,852 tokens, consisting of 113,686 tokensof the CLE corpus and 539,166 tokens from the NewsLink corpus, was used for NE5 tagging. Wetokenize our corpus on the basis of space and punctuation marks. For example, in the sentence

(Ali, Umar, Shan, aur Hassan dost hain) (Ali, Umar, Shan andHassan are friends), our tokenizer returns nine tokens including punctuation marks, ignoringmultiple spaces between tokens. Genrewise details are given in Table 2.

3.1 Urdu Word2vec

We used the CBOWmodel [34] with negative sampling for the development of Urdu word vectorsfor the NewsLink corpus. We use different parameters to develop 12 different variations of Urduword vectors, and assign an ID to each word2vec. For example, word2vec_ID 1 indicates the wordvectors of vocabulary V with dimension d = 50 and context size 5. Similarly, word2vec_ID 2 in-dicates the word vectors of vocabulary V with dimension d = 50 and context size 10, and so on.Details of the Urdu word2vec with different parameters are given in Table 3.For handling unknown words, all words are tokenized on the basis of space and punctuation

marks. Then we compute the frequency of these words. All such words with frequency less than 6are replaced with the keyword “UNK.” When we analyze the results of Urdu word vectors, we findthat some words are not properly tokenized. For example, (kar + dee) (did) and(kar + rahay) (doing) become single token due to the tokenization problem of Urdu text.

3.2 Annotation Guidelines

For the development of the KPU-NE tagged corpus, three annotators were used in the manualassignment of NE tags to words. Of the three, two were asked to assign NE tags to each word,and if there was a conflict in tagging, then the third one was asked to resolve the issue and his

5Urdu NE Dataset can be accessed by emailing to corresponding author.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 5: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

Urdu Named Entity Recognition and Classification System Using ANN 2:5

Table 3. Details of Urdu Word Vectors

word2vec_ID Dimensions Context1 50 52 50 103 50 154 100 55 100 106 100 157 150 58 150 109 150 1510 200 511 200 1012 200 15

decision was considered final. We focused on only three NEs, that is, PER, LOC, and ORG, andthe remaining tokens were marked O. Annotation guidelines included three things, that is, “whatto annotate,” “what not to annotate,” and the special cases handled by “how much to annotate.”We mark the maximal sequence of words as a PER, LOC, or ORG entity. The following were theguidelines given to the annotators for manual annotation of Urdu NEs.PER: The PER tag will be assigned to names, nicknames, or aliases of living and deceased hu-

mans and fictional characters. For example: (Ali), (Barack Obama), (Ahmed),(Allama Iqbal), (Mani), (Pomi), and such. Family names will also be marked

as person names, such as (Malik), (Khokhar), and (Rajpoot). Titles, relationnames, pronouns, reflexive pronouns, name prefixes, and God’s names like Allah should not bemarked as PER. Some examples are (Mr.), (Sadar) (President), (Professor),

(Manager), (General), (Ammi) (Mother), (Bhai) (Brother), (Abbu) (Fa-ther), (Main) (I), (Tum) (you), (Junior/ Jr.), (Senior/Sr.). Here is the example

of how much to annotate in the sentence (Doctor [SyedMansoor Sarwar NE=PER] [Kamran NE = PER] kay ustad hain) (Doctor Syed Mansoor Sarwar isKamran’s teacher). As you can see, Syed Mansoor Sarwar is marked as a single NE.

• ORG: All types of local companies, multinational companies, stock exchanges, unions,agencies, corporations, media groups, sports teams, military groups, political parties,and all other organizational structures created by a group of people will be marked as

ORG. For example, (K.S.E), (Karachi Stock Exchange),

(Congress), and (All India Muslim League). Product names, brandnames, and so on, will not be marked as ORG. For example, 5 (iPhone 5) and

(Samsung Galaxy Note). The example of how much to annotate is

([Karachi Stock Exchange NE=ORG][Pakistan NE=LOC]main buland tareen satah per hai) (Karachi Stock Exchange is at highestlevel in Pakistan).

• LOC: All human-made structures and politically defined places like the names of coun-tries, provinces, states, cities, streets, highways, mountains, sea, rivers, airports, railwaystations, and such will be marked as LOC. For example, (Lahore), (Punjab),

(United Arab Emirates), (Ravi river), (Dead Sea),

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 6: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

2:6 M. K. Malik

and (Allama Iqbal International Airport). Generic locations

like (sub say purani amarat) (the oldest building) and(naya airport) (the new airport) should not be marked as LOC. Similarly, nationality like

(American) will not be considered LOC. When location is the part of organization

then we will also not mark that word as LOC. For example, (PakistanCricket Board) will not be marked as LOC. The example of how much to annotate is

([Lahore NE=LOC] [Islamia Jamhoria PakistanNE=LOC] ka aik shahir hai) (Lahore is the City of Islamic Republic of Pakistan).

• Other: All remaining words that belong to close class words like (main) (I), (tum)(you), (woh) (he), (hum)(we), prepositions, postpositions, adjectives, adverbs, verbs,punctuations marks, and names like animals, name of the books, events, movie names, laws,diseases, software, animals and so on, will be marked O.

We use a separate guideline for the handling of multi-name expressions. In a sentence like(Us nay [Shumali aur Janobi Punjab NE= LOC] ka doura

kiya) (he visited North and South Punjab), the (North and South Punjab) will be marked assingle Named Entity, and in the sentence (Us nay[Shumali Punjab NE = LOC] aur [Janobi Punjab NE= LOC] ka doura kiya) (he visited NorthPunjab and South Punjab) the names (North Punjab) and (South Punjab) will be marked as twoseparate NEs. Similarly, if there is a possessive construction with two NEs, then we will mark each

NE separately. For example, ([PunjabUniversity NE = ORG] ka [PUCIT NE = ORG] department buhat acha hai) (Punjab University’sPUCIT department is very good). Here, we mark “Punjab University” and “PUCIT department” astwo separate NEs. Annotators perform the following steps while carrying out the tagging process.

• First, verify whether a word represents PER, LOC, and ORG or not; this decisionwill be based on the context of the relevant word(s). For example, in the sentence

(main ganay ka riaz kar raha houn) (I am practicing the song)(Riaz) will be marked as O and in the sentence (main riaz giya) (I went

Riaz), and (Riaz mera acha dost hai) (Riaz is my good friend)(Riaz) will be marked as LOC and PER, respectively.

• Annotators are asked to disambiguate the cases and annotate the expression with the cat-egory best defined by the context in which it is used. If identified word is an NE, then thenext step is to determine the category of the NE, that is, is it the PER, ORG, or LOC? Thisdecision will be based on the context of the word(s). For example, in the above example,

(Riaz) could be LOC or PER, depending on its context in the given sentence.• Complete data are annotated using the SSF6 notation.

Some interesting confusions arise in the following few examples:• (Junaid Jamshed) can be PER or ORG. An appropriate NE tag is selected on the

basis of the context in which the two words occur in a sentence.• (Quaid-e-Azam kay

yomay pedaish kay moqa per Wazer-e-Azam nay mizar Quaid-e-Azam per hazri di) (OnQuaid-e-Azam’s birthday Prime minister visited Quaid-e-Azam Tomb). The annotator

marks (Quaid-e-Azam) as person name, whereas the first occurrence is the nameof a PER and second is the name of LOC.

6http://www.iiit.ac.in/techreports/2009_85.pdf.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 7: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

Urdu Named Entity Recognition and Classification System Using ANN 2:7

Fig. 1. NE tagging tool for the Urdu language.

• (main nay Pepsi pi) (I drank Pepsi), (main pepsi compnaygiya) (I went to Pepsi Company). In both cases, one annotator marks Pepsi as ORG whereasthe first occurrence is just the name of a drink and the second occurrence is the name of anORG. The third annotator decided to tag the first occurrence of Pepsi as O and the secondoccurrence as ORG.

• There was some confusion whether we should assign an NE to (Mother of the

Nation) and (He (peace be upon him)), or not. However, we decided that NE PER shouldbe assigned to these words/terms.

3.3 Inter-Annotator Agreement

Using an application as shown in Figure 1, two annotators assign NE tags to words. Most of thewords are assigned one of the three tags, PER, LOC, or ORG; the remainingwords are automaticallymarked as O by our system.For the inter-annotator agreement and experimentation we count the lexical level enti-

ties, that is, (Junaid Jamshed) will be counted as two entities for PER tag and(Islamia Jamhoria Pakistan) (Islamic Republic of Pakistan) will be counted

as three LOC entities. The inter annotator agreement is calculated using the Kappa coefficient, k,by using the following equation [23]:

k = ((Ao −Ae ) / (100 −Ae )) ∗ 100.Now, we need to compute the observed agreement (Ao) and chance agreement (Ae) by using in-formation mentioned in Table 4 [22].Here,Ao = Total number of tokens on which both annotators agree (including O) / Total number oftokens and

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 8: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

2:8 M. K. Malik

Table 4. Contingency Table Between Two Annotators

PER LOC ORG O TotalPER 17,902 220 82 1,991 20,195LOC 341 18,640 618 2,155 21,754ORG 22 341 6,517 2,572 9,452O 6,256 7,167 7,468 580,560 601,451Total 24,521 26,368 14,685 587,278 652,852

Table 5. Final Count of

Named Entities

Named Entities CountPER 18,150LOC 18,728ORG 7,602Total 44,480

Ae =∑c

P(cA|n) · P (cB |n) ,

where, c is PER, LOC, and ORG; n is total number of tokens; cA is the count of tokens where NEc is assigned by the annotator A; and cB is the count of tokens where NE c is assigned by theannotator B. There are two reasons for not includingO is the calculation of Ae. First, we are interested in PER, LOC, and ORG. Second, the count

of O is very high, and its inclusion will increase the k value that will not truly reflect the interannotator agreement. Thus,

Ao = 95.52%,

Ae = 83.15%

and

k =

((95.52 − 83.15)(100 − 83.15)

)∗100 = 73.41%.

The value of k can be interpreted using various scales. The closer the value of k to 1 the better it is,because it would reflect better agreement between the two annotators. Thus, the value of k (0.7341)indicates that the experiment conducted to find the agreement between the two annotators is good.Table 5 shows the final counts for the NEs with respect to each entity after resolving conflictsbetween annotators.

3.4 Findings During Tagging Process

Here are some observations that may help during the development of an Urdu NER system.

• If there is a sequence of words that have Proper Noun (NNP) POS sequence and the wordof last NNP is (Khan), (Malik), (Singh), and so on, then all the words havingNNP POS tag before the last NNP may be person NE.

• Words like (Razi Allah Tala Anhu) (Allah is pleased with him),(Rahmatullah) (May God have mercy on him), (Alayhis Salam) (peace be uponhim), and so on, come after the person name, so we can assign them PER as NE.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 9: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

Urdu Named Entity Recognition and Classification System Using ANN 2:9

• If (Mr.), (Mrs), (Ameer), (Son of), (Sir), and (Sir) come beforea word with NNP POS, then there are many chances that the next word is PER name.

• If (Bazar), (Abad), (Pur), or (Nagar) comes after a word with NNP POS,then there are chances that the preceding word is a LOC name.

• If (University) or (Foundation) comes at end of a word and the precedingword’s POS is NNP, then the preceding word is the ORG name.

The following are examples where one word may or may not be an NE, and one word may havedifferent NEs.• (June/ John) can be June (Name of the Month) or John or Joan (Person Name),

(Joan/life) can be Jaan (life) or John (Person name)• (Mungal) could be the name of day (Tuesday) or it could be PER Name like Mangal

Panday• (Riaz) and (Shuja/brave) could be PER, LOC, or O.• (Praim/love), (Sadaat/opportunity), (Hassan/beautiful), (Kamal/

super) and (Gulzar/garden) could be PER name or O. Some more examples are

(Ghulam/slave), (Azam/big), (Baig/bag), (Azeem/great), (Faisal),(Ahmed), (Sahil/beach), and (Gul/flower).

4 NEURAL NETWORK FOR NAMED ENTITY

[29, 30, 31, 32, 33] have used ANN for solving different NLP problems including NER. We havetaken inspiration from the work described in these articles to solve the Urdu NER problem. We useANN for the development of our Urdu NER system. We consider NER as a classification problemwhere we have to assign a single class to each word. In our case, we have four classes, that is, PER,ORG, LOC, and O. In the ANN model, we have three layers: Input Layer (IL), Hidden Layer (HL),and Output Layer (OL). Suppose we have the following sequence of words in our training data,

that is, (Ali apnay ghar say bazar jaa raha tha) (Ali was going tobazar from his home). We associate a label y with each word. The possible values of y are 0, 1, 2,or 3, respectively, for O, ORG, LOC, and PER. For our experiments, we used the word vectors andcontext window approach using window sizes 3 and 5. If context is 1, then the window size will be3, and if context is 2, then the window size will be 5. Suppose that we have (Ali) at first indexand we represent it with vector x1, and, similarly, we represent the whole sentence as the set ofvectors (x1, x2, . . . , x8).When we want to classify the first and last words, and our context is 1, then to include the

context we append special tokens <s> and </s> at the front and back of each sentence respec-tively. If we use a context size of C, then we pad each input of the training corpus with C num-ber of these special tokens at the front and back. Let us assume that C is 1. In this case, thewindow size will be three, and our data would be look like (<s>, x1, x2, . . . , x8, </s>). Weconvert the data into {(<s>, x1, x2), (x1, x2, x3), (x2, x3, x4), . . . , (x6, x7, x8), (x7; x8; </s>)},and each window is associated with label y against the center word. For window size 1, in-

put is represented as x (t ) = [Lxt−1, Lxt , Lxt+1], and for window size 2, input is represented as

x (t ) = [Lxt−2,Lxt−1, Lxt , Lxt+1,Lxt+2].Here, the input xt−2,xt−1, xt , xt+1,xt+2 are one-hot vectors and L is the word representation

matrix generated using the CBOW model, L ∈ Rd x |V | , with each column Li as the vector for aparticular word i = xt . V represents the size of the vocabulary and d represents the number ofdimensions of the word vectors. In our case, the value of d could be 50, 100, 150, and 200. Let d =100, and window size is 5. Then the IL size will be 500 and HL has a dimension of 100. The OL sizeis the size of y, that is, 4.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 10: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

2:10 M. K. Malik

Table 6. Detail Results of Urdu NERC Using HMM

ParametersN-Grams To handle sparseness of data Precision Recall F-measureUnigram replace zero frequency with 0.5 50.1 84.2 62.82Bigram replace zero frequency with 0.5 52.19 83.11 64.12Trigram replace zero frequency with 0.5 55.79 82.91 66.70Trigram used linear interpolation for smoothing 55.98 83.11 66.90

We compute predicted_y as

predicted_ y = softmax(U

(tanh

(W x (t ) + bias1

))+ bias2

),

where W is the weight between IL and HL and U is the weight between HL and OL.We use a cross-entropy loss function:

J (θ ) = −4∑

k=1

(yk ) log (predicted_yk ) ,

where θ are W, U, bias1, and bias2. To compute the overall cost for training data, we av-erage this J (θ ) as computed on each training example. We also use L2 regularization to im-prove our results. The dimensions of the fours parameters are W ϵR100 x 500, U ϵR4 x 100,bias1 ϵR100, and bias2 ϵR4. Finally, we use the backward propagation algorithm using a learningrate of 0.05 for learning the weights and use forward propagation for prediction.

5 RESULTS AND DISCUSSION

In this section, we explain the details of two experiments. For both, we use different parameters toimprove the accuracy of our Urdu NERC system, and we divide the KPU-NE corpus into a disjointset of training (70%), validation (15%), and test set (15%). In the first experiment we use a statisticaltagger called Trigrams n Tags (TnT) [28] to develop our bigram and trigram HMM models (a su-pervised learning algorithm) from our training corpus. TnT, by default, uses the trigram languagemodel with linear interpolation to handle sparse data problem. Weights of linear interpolation arecalculated using deleted interpolation [26] and unknown words are handled by using the suffixinformation [27]. The results using unigram, bigram, trigram, and trigram information with linearinterpolation are 62.82%, 64.12%, 66.70%, and 66.90% f-measure, respectively. The detailed resultsof our experiments are given in Table 6.For the second experiment we use the ANN approach for development of Urdu NER. Table 7

shows the results of ANN using window sizes 3 and 5 with learning rate 0.05. We perform a to-tal of 24 experiments using ANN. Twelve experiments using window size 3, and 12 experimentsusing window size 5. In each experiment, we use the learned word vectors. As shown in the re-sults, with the increase in the dimensions of a word vector and context size, f-measure increases.The highest f-measure is achieved when word vectors are created using 200 dimensions with 15context size, and the 5 window size of ANN. The results are 81.05% precision, 87.54% recall, and84.17% f-measure. From the results of the 2 experiments, we conclude that the use of word vectorswith higher dimensions and more context produce better results in all cases, except between theword2vec versions 3 and 4 and the word2vec versions 4 and 5. Also, it can be observed that theuse of window size 5 outperforms window size 3 in all cases, except when the word2vec version is4. By comparing the results of ANN and HMM, we conclude that the ANN outperforms the HMMin all settings.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 11: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

Urdu Named Entity Recognition and Classification System Using ANN 2:11

Table 7. Detail Results of Urdu NER Using ANN

Word2Vec_ID Window Size Precision Recall F-measure1 3 64.24 71.35 67.611 5 66.12 71.57 68.742 3 66.48 72.17 69.212 5 67.49 73.74 70.453 3 68.43 72.92 70.603 5 69.56 72.31 70.914 3 70.63 71.36 70.994 5 71.6 68.48 70.015 3 69.57 71.34 70.445 5 70.84 71.5 71.176 3 72.24 72.36 72.306 5 71.62 73.73 72.667 3 74.51 72.08 73.277 5 72.75 74.42 73.588 3 77.22 74.81 768 5 76.07 77.26 76.669 3 79.67 78.45 79.069 5 74.59 86.44 80.0810 3 78.41 88.34 83.0810 5 79.55 87.46 83.3211 3 80.43 87.36 83.7511 5 79.25 89.46 84.0512 3 80.05 87.96 83.8212 5 81.05 87.54 84.17

6 CONCLUSION AND FUTURE WORK

We annotated the KPU-NE corpus for the development of an Urdu NERC system and developedthe Urdu word vectors using CBOW models. To the best of our knowledge, this is the onlycomprehensive Urdu NE corpus that covers approximately 15 genres. In addition, no one hasexperimented with Urdu data using word2vec. The Urdu digest part of our corpus of 113686words also has the POS information associated with it. In the development of a high-accuracyNERC system, POS information plays a vital role. In future, we plan to annotate the completedata with comprehensive NEs like number, time, measure, designation, and title.We performed experiment using HMM and ANNwithout exploiting POS information. As stated

earlier, due to wrong tokenization, somewords are not properly tokenized. Thus, there are chancesthat word vectors of some words may have some problem in learning the weights in ANN. In thefuture, wewill work on improving our tokenization process to develop better word representationsof data. We can also develop the Urdu NERC system using machine-learning algorithms like CRF,ME, Recursive Neural Network, and Recurrent Neural Network by including the POS information.

REFERENCES

[1] S. Hussain. 2003. In Proceedings of the 12th AMIC Annual Conference on E-Worlds: Governments, Business and Civil

Society, Asian Media Information Center, Singapore.

[2] A. BBC-Languages. Guide to Urdu—10 Facts, Key Phrases and the Alphabet Retrieved May 2, 2012 from from

http://www.bbc.co.uk/languages/other/urdu/guide.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 12: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

2:12 M. K. Malik

[3] S. Hussain. 2008. Resources for urdu language processing. In Proceedings of the 6th Workshop on Asian Language

Resources. 99–100.

[4] R. Grishman and B. Sundheim. 1996. Message understanding conference–6: A brief history. In Proceedings of the

International Conference on Computational Linguistics. 466–471.

[5] P. Baker, A. Hardie, T. McEnery, and B. D. Jayaram. 2003. Corpus data for south asian language processing. In Pro-

ceedings of the 10th Annual Workshop for South Asian Language Processing. EACL.

[6] D. Becker and K. Riaz. 2002. A study in urdu corpus construction. In Proceedings of the 3rdWorkshop onAsian Language

Resources and International Standardization-Volume 12 (1–5). Association for Computational Linguistics.

[7] K. Riaz. 2010. Rule-based named entity recognition in urdu. In Proceedings of the 2010 Named Entities Workshop.

Association for Computational Linguistics, 126–135.

[8] S. Mukund, R. Srihari, and E. Peterson. 2010. An information-extraction system for urdu—A resource-poor language.

ACM Trans. Asian Lang. Inf. Process. 9, 4, 15.

[9] D. Farmakiotou, V. Karkaletsis, J. Koutsias, G. Sigletos, C. D. Spyropoulos, and P. Stamatopoulos. 2000. Rule-based

named entity recognition for greek financial texts. In Proceedings of the Workshop on Computational lexicography and

Multimedia Dictionaries (COMLEX’00). 75–78.

[10] D. M. Bikel, R. Schwartz, and R. M. Weischedel. 1999. An algorithm that learns what’s in a name. Mach. Learn. 34,

1–3, 211–231.

[11] A. Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation, New York

University.

[12] A. McCallum and W. Li. 2003. Early results for named entity recognition with conditional random fields, feature

induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-

NAACL 2003. Association for Computational Linguistics, Volume 4 188–191.

[13] A. Ekbal, R. Haque, A. Das, V. Poka, and S. Bandyopadhyay. 2008. Language independent named entity recognition in

indian languages. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08).

33–40.

[14] S. Saha, S. Sarkar, and P. Mitra. 2008. A hybrid feature set based maximum entropy hindi named entity recognition.

In Proceedings of the 3rd International Joint Conference on NLP (IJCNLP’08). 343–349.

[15] K. Gali, H. Surana, A. Vaidya, P. Shishtla, and D. M. Sharma. 2008. Aggregating machine learning and rule based

heuristics for named entity recognition. In Proceedings of the 3rd International Joint Conference on Natural Language

Processing (IJCNLP’08). 25–32.

[16] P. P. Kumar and V. R. Kiran. 2008. A hybrid named entity recognition system for south asian languages. In Proceedings

of the Proceedings of the 3rd International Joint Conference on Natural Language Processing Workshop on NER for South

and South East Asian Languages (IJCNLP’08). 83–88.

[17] U. Singh, V. Goyal, and G. S. Lehal. 2012. Named entity recognition system for urdu. In Proceedings of COLING:

Technical Papers. 2507–2518.

[18] S. Mukund and R. K. Srihari. 2009. NE tagging for urdu based on bootstrap POS learning. In Proceedings of the 3rd

International Workshop on Cross Lingual Information: Addressing the Information Need of Multilingual Societies. Asso-

ciation of Computational Linguistics, 61–69.

[19] F. Jahangir, W. Anwar, U. I. Bajwa, and X. Wang. 2012. N-gram and gazetteer list based named entity recognition for

urdu: A scarce resourced language. In Proceedings of the 24th International Conference on Computational Linguistics.

[20] Retrieved from http://www.cle.org.pk/clestore/urdudigestcorpus100ktagged.htm.

[21] T. Ahmed, S. Urooj, S. Hussain, A. Mustafa, R. Parveen, F. Adeeba, A. Hautli, and M. Butt. 2014. The CLE urdu POS

tagset. In Proceedings of the Language Resources and Evaluation Conference (LERC’14).

[22] R. Fernández. 2011. Assessing the Reliability of an Annotation Scheme for Indefinites, Measuring Inter-annotator Agree-

ment. MoL Project, Institute for Logic, Language & Computation University of Amsterdam.

[23] Cohen Jacob1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46.

[24] Arstein Ron and Poesio Massimo. 2008. Survey article: Inter-coder agreement for computational linguistics. Comput.

Ling. 34, 4, 555–596.

[25] L. E. Baum and T. Petrie. 1966. Statistical inference for probabilistic functions of finite state markov chains. Ann.

Math. Stat. 37, 6, 1554–1563.

[26] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. 1992. Class-based n-gram models of natural

language. Comput. Ling. 18, 4, 467–479.

[27] C. Samuelsson. 1993. Morphological tagging based entirely on Bayesian inference. In Proceedings of the 9th Nordic

Conference on Computational Linguistics.

[28] T. Brants. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural

Language Processing. Association for Computational Linguistics, 224–231.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.

Page 13: Urdu Named Entity Recognition and Classification System ...static.tongtianta.site/paper_pdf/23b1b068-8b78-11e... · 2 Urdu Named Entity Recognition and Classification System Using

Urdu Named Entity Recognition and Classification System Using ANN 2:13

[29] I. Gallo, E. Binaghi, M. Carullo, and N. Lamberti. 2008. Named entity recognition by neural sliding window. In Pro-

ceedings of the 8th IAPR International Workshop on Document Analysis Systems (DAS’08). IEEE, 567–573.

[30] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. 2012. Improving word representations via global context and

multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:

Long Papers-Volume 1. Association for Computational Linguistics, 873–882.

[31] G. Petasis, S. Petridis, G. Paliouras, V. Karkaletsis, S. J. Perantonis, and C. D. Spyropoulos. 2000. Symbolic and neural

learning for named-entity recognition. In Proceedings of the Symposium on Computational Intelligence and Learning.

58–66.

[32] J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of

the Empiricial Methods in Natural Language Processing (EMNLP’14), 12. 1532–1543.

[33] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res.

3, 1137–1155.

[34] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in

vector space. In Proceedings of the International Conference on Learning Representations Workshop (ICLR’13).

Received February 2016; revised June 2017; accepted July 2017

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 17, No. 1, Article 2. Publication date: September 2017.