Unc
orre
cted
Aut
hor P
roof
Journal of Intelligent & Fuzzy Systems xx (20xx) x–xxDOI:10.3233/IFS-151923IOS Press
1
Automatic keyphrase extraction for Arabicnews documents based on KEA system
1
2
Rehab Duwairia,∗ and Mona Hedayab3
aDepartment of Computer Information Systems, Jordan University of Science and Technology, Irbid, Jordan4
bDepartment of Computer Science and Engineering, College of Engineering, Qatar University, Doha, Qatar5
Abstract. A keyphrase is a sequence of words that play an important role in the identification of the topics that are embeddedin a given document. Keyphrase extraction is a process which extracts such phrases. This has many important applicationssuch as document indexing, document retrieval, search engines, and document summarization. This paper presents a frameworkfor extracting keyphrases from Arabic news documents which is based on the KEA system. It relies on supervised learning,Naı̈ve Bayes in particular, to extract keyphrases. Two probabilities are computed: the probability of being a keyphrase and theprobability of not being a keyphrase. The final set of keyphrases is chosen from the set of phrases that have high probabilitiesof being keyphrases. The novel contributions of the current work are that it provides insights on keyphrase extraction for newsdocuments written in Arabic. It also presents an annotated dataset that was used in the experimentation. Finally, it uses Naı̈veBayes as a medium for extracting keyphrases.
6
7
8
9
10
11
12
13
14
Keywords: Keyphrase extraction, term indexing, document summarization, document classification, Arabic web content15
1. Introduction16
Keyphrase extraction is the process of assigning17
phrases that describe the main topic or important18
phrases of a document [8, 18, 31]. Keyphrase extraction19
is very important and has many applications in infor-20
mation retrieval, automatic indexing, text classification,21
text summarization and tagging to name a few [7–10,22
20]. Traditionally, this was done by human annotator23
who would assign a set of keyphrases to a document.24
Manual annotation is tedious and time consuming and25
may not be practical these days with the huge vol-26
umes of online documents. Automatic or semiautomatic27
annotation of documents, on the other hand, employs a28
computer program to extract keyphrases that describe29
a document. In the latter case, a human may provide
∗Corresponding author. Rehab Duwairi, Department of Com-puter Information Systems, Jordan University of Science andTechnology, Irbid 22110, Jordan. Tel.: +962 2 7201000; Fax: +962 27201077; E-mail: [email protected].
certain guidelines or hints to the system. Keyphrases 30
could be drawn from a fixed vocabulary (controlled 31
indexing or term assignment); in this case keyphrases 32
of a document may contain phrases that do not appear 33
in the document. Free indexing, on the other hand, 34
means that the annotators or systems are free to choose 35
keyphrases that describe a document. 36
Keyphrase extraction has been seen as a classifi- 37
cation problem for the English language [16, 27, 28, 38
31–33], and the Arabic language [9]. Other efforts view 39
this problem as a ranking problem for English [14, 40
18, 34, 35] and for Arabic [7, 8]. Consequently such 41
efforts utilize ranking algorithms to extract features. 42
The classification viewpoint for keyphrase extraction 43
is a supervised machine learning method where classi- 44
fiers such as naı̈ve Bayes classifiers [16, 33] or neural 45
networks [32] are used. The classifiers should be trained 46
first using annotated documents (i.e. documents which 47
keyphrases are known beforehand). The classifiers per- 48
form well when new documents have a similar domain 49
as the training documents. 50
1064-1246/16/$35.00 © 2016 – IOS Press and the authors. All rights reserved
Unc
orre
cted
Aut
hor P
roof
2 R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system
This paper adopts a classification based approach for51
extracting keyphrases from news documents. In par-52
ticular, it extends the KEA system [16, 33], which is53
based on the Naı̈ve Bayes classifier, to handle keyphrase54
extraction from Arabic news documents. The adapted55
system is called Arabic-KEA. KEA is open software56
and this has encouraged many researchers to adapt it to57
other languages such as the work reported in [24] which58
has adapted KEA to the Turkish language.59
Arabic-KEA was trained and tested using three60
datasets. The KP-Miner dataset [8] which consists of61
100 short articles generated from Arabic Wikipedia62
and two in-house collected datasets. Several tests were63
carried out using Arabic-KEA to measure its accuracy64
when several parameters are changed.65
The major contributions of this paper include exten-66
sions of KEA’s stemming algorithms and stopword67
files, the collection and annotation of an Arabic dataset,68
and thorough experimentations which helped us under-69
stand keyphrase extraction in Arabic documents. The70
results reveal that the higher the size of the training71
dataset the better the performance of the classifier.72
The reason is that more data means the classifier73
is capable of building more sophisticated models. A74
second finding is that the choice of stemming algo-75
rithm is somewhat related to the quality of results:76
good stemmers help in obtaining better results. Also,77
the accuracy increases as the number of generated78
keyphrases increases.79
This paper is organized as follows: Section 1 has80
described and introduced keyphrase extraction in gen-81
eral and the suggested algorithm in particular. Section82
2 provides background information to the reader and it83
also places this work in its proper place in the litera-84
ture. Section 3 explains aspects of the Arabic language85
that one should consider when dealing with keyphrase86
extraction. Section 4, on the other hand, describes the87
suggested framework and the modifications to KEA.88
Section 5, by comparison, illustrates the experimenta-89
tion setup and provides insights on the obtained results.90
Section 6 concludes this paper and highlights future91
work.92
2. Background and related work93
Keyphrase extraction from documents is very impor-94
tant and has many applications and therefore there95
are numerous publications that suggest algorithms96
for approaching that problem (For example [3, 7–10,97
13–19, 22, 23, 25, 27, 28, 31–35]). The majority98
of these published algorithms deal with the English 99
text. 100
2.1. Keyphrase extraction for non-Arabic texts 101
Turney [31] was the first to suggest using machine 102
learning for keyphrase extraction where he used the 103
C4.5 classifier to extract phrases. Wang et al. [32] com- 104
bine several neural networks to extract keyphrases from 105
Chinese and English text. They argue that combining 106
several base classifiers yields better results. Sarkar [28] 107
used naı̈ve Bayes classifier to extract keyphrases from 108
medical documents; so his work is domain-specific and 109
utilizes a glossary database. 110
G. Ercan and I. Cicekli [10] utilized lexical chains 111
to extract keyphrases. They focused on single word 112
phrases. In addition to standard phrase features such 113
as phrase frequency and phrase position, they introduce 114
four additional features that are derived from the lexical 115
chain of the phrase. These lexical chains are built using 116
WordNet [12]. They approached keyphrase extraction 117
as a classification problem and used the C4.5 classifier 118
in particular. 119
Matsuo and Ishizuka [18] employed word co- 120
occurrence to judge whether a candidate phrase is 121
a keyphrase or not. In particular, the bias of the 122
co-occurrence probability distribution of a candidate 123
phrase with frequent phrases in a document is mea- 124
sured using the χ2. Afterwards, phrases are ranked in 125
decreasing order of their χ2 scores. Xie, et al. [35] also 126
have used word co-occurrence to extract keyphrases but 127
they introduced a relatedness semantic measure to rank 128
the phrases instead of using χ2. 129
Jiang et al. [14] viewed keyphrase extraction as a 130
ranking problem and used Linear Ranking SVM algo- 131
rithm to extract keyphrases. Wu et al. [34] extracted 132
keyphrases from documents by focusing on noun 133
phrases only and by utilizing a glossary database. Their 134
work is domain-specific, uses POS tagger and noun 135
phrase extractor. Weights are assigned to the candi- 136
date phrases. These are calculated by utilizing the 137
glossary database and afterwards phrases with highest 138
weights are returned as keyphrases. It addresses English 139
text. 140
CFinder [15] is an unsupervised framework for 141
detecting keyphrases that subsequently are used to gen- 142
erate ontologies. The framework is based on extracting 143
noun phrases as candidate keyphrases. Abbreviations 144
are expanded using a manually built-in synonym table. 145
Afterwards, the weights of these candidates are calcu- 146
lated. The weight function is a combination of statistical 147
Unc
orre
cted
Aut
hor P
roof
R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system 3
knowledge (frequency), domain-specific knowledge (A148
manually built glossary related to a specific domain) and149
structural patterns (occurrence). The authors have used150
the D04MG [6] ontology and 27 documents related to151
emergency management systems. Even though CFinder152
is an unsupervised framework yet it relies on a manu-153
ally built synonym table and a glossary. CFinder scored154
a 0.53 F-measure compared to 0.28 by Test2Onto [4],155
0.14 by KP-Miner [7, 9] and 0.43 by Moki [30] when156
run on the emergency dataset.157
The researchers in [3] extracted topical keyphrases158
from tweets. Their work utilizes a graph-based algo-159
rithm for ranking keyphrases that is based on TextRank160
[21]. Their addition to TextRank is to include node161
properties when calculating weight or merit. They also162
leverage hashtags when extracting topical keyphrases.163
Their results show that accuracy is increased when com-164
pared to the standard TextRank.165
The work reported in [5] extracts keyphrases from166
short texts such as titles of scientific papers. Their167
framework is based on clustering words in the collec-168
tion of documents using topic modeling. Afterwards,169
the candidate keyphrases are generated using frequent170
pattern mining algorithm. The third stage of their work171
consists of ranking the generated candidate keyphrases172
based on their respective coverage, purity, phraseness173
and completeness. Their main addition is that the qual-174
ity of phrases can be judged on phrases of different175
lengths simultaneously.176
The focal point of the work reported in [23] is that177
adopting a uniform view to all candidate keyphrases is178
unfair towards rare terms that a human would consider179
as keyphrases. They assume that every keyphrase has a180
word which is more important compared to other words181
in the same phrase. This is called the core word. The182
heart of their algorithm is to find such core words based183
on their frequency and POS tags. In the second stage,184
the core words are expanded with correlated words so185
that keyphrases can be generated.186
The work reported in [19] suggests an approach187
for keyphrase indexing. Keyphrase indexing starts188
by extracting keyphrases from documents and then189
mapping them to a fixed vocabulary. This approach190
overcomes the problem of ill-formed keyphrases191
obtained by keyphrase extraction and the large training192
corpus required by term assignment. Their suggested193
approach starts by extracting n-grams and then assign-194
ing weights to these n-grams. The n-grams are195
transformed to pseudo-phrases by removing stopwords,196
stemming, and alphabetically sorting the remaining197
words. These pseudo-phrases are then mapped to a198
controlled vocabulary to generate the final list of 199
keyphrases. 200
2.2. Keyphrase extraction for Arabic texts 201
Few works have addressed keyphrase extraction for 202
Arabic text. Sakhr [26] a leading company in the field 203
of Arabic text processing, provides a keyphrase extrac- 204
tor. Users can upload their documents to Sakhr keyword 205
extractor one by one and the system would return the 206
set of keywords; the algorithm behind the extractor is 207
not published. El-Beltagy and Rafea [7, 8] provide KP- 208
Miner for extracting keyphrases from Arabic as well as 209
English documents. KP-Miner utilizes a set of heuris- 210
tics to extract keyphrases. KP-Miner approaches the 211
problem as a ranking problem and therefore does not 212
require a training corpus. Candidate keyphrases are 213
ranked based on their term frequency (tf) and inverse 214
term frequency (idf). 215
In the work of El-Shishtawy and Al-Sammak [9], 216
which addressed Arabic text, each candidate keyphrase 217
is represented as a vector of 8 features. Some of 218
these features are statistical and others are linguistic. 219
Examples of statistical features include: normalized 220
phrase words, phrase relative frequency, word relative 221
frequency, normalized sentence location, normalized 222
phrase location, and normalized phrase length. Linguis- 223
tic features, on the other hand, contain two Boolean 224
features. The first one specifies whether or not a sen- 225
tence contains a verb; and the second feature determines 226
whether the sentence is a question. Also, the abstract 227
or verbal noun (Masdar) form is used to represent 228
keyphrases. After building the feature vectors of can- 229
didate keyphrases, analysis of variance test is used to 230
determine the importance of the previous 8 features. 231
After that a linear discriminant classifier is used to 232
classify candidate keyphrases as true keyphrases (posi- 233
tive examples) or false keyphrases (negative examples). 234
Still, this work view keyphrase extraction as a classifi- 235
cation problem. The dataset size is rather small and it 236
is manually annotated. Their results show better preci- 237
sion and recall values when compared to KP-Miner [7, 238
8] and Sakhr [26]. 239
2.3. Summary of keyphrase extraction 240
All keyphrase extraction algorithms whether they are 241
ranking or classification algorithms, domain-specific or 242
domain independent, need training corpus or do not 243
need training corpus, applied on full-length scientific 244
Unc
orre
cted
Aut
hor P
roof
4 R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system
papers, news articles, or microblogs, utilize glos-245
sary database (or thesaurus or ontology) or do not246
utilize glossary database (or thesaurus or ontology)247
have to deal with three issues, namely, candidate248
keyphrase generation, assigning features to the can-249
didate keyphrases (i.e. building feature vectors) and250
ranking the set of candidates using some ranking func-251
tions. When generating candidate phrases there is a252
potential that the number of phrases is huge and there-253
fore pruning strategies that are based on heuristics are254
used. For example, both English and Arabic keyphrases255
are located in a single sentence. As another example256
of pruning, the frequency of a keyphrase is calcu-257
lated and only phrases that appear in a document a258
number of times greater than a user-specified thresh-259
old will be considered in subsequent stages. As a260
last example of pruning, most researchers assume that261
keyphrases do not contain stopwords and therefore dur-262
ing document preprocessing, stopwords are removed.263
Experiments have shown that some author or reader264
assigned keyphrases, for English text, may contain stop-265
words such as prepositions.266
A feature vector for a candidate keyphrase con-267
sists of a number of variables or features that describe268
that keyphrase and are subsequently used to judge the269
importance of that keyphrase. Examples of keyphrase270
features that are commonly used by researchers include271
phrase frequency, the position (in the document) where272
the phrase first or last appears, or whether the phrase is273
a noun or verb.274
3. Arabic language peculiarities for keyphrase275
extraction276
Keyphrase extraction requires identifying sentence277
boundaries. In Arabic this is not an easy task as Ara-278
bic does not support letter capitalization and does not279
follow strict punctuation rules especially when deal-280
ing with informal text – where punctuation marks281
are usually absent. In English, however, the sentence282
begins with a capital letter and ends with a period283
[11].284
Usually the first phase of keyphrase extraction algo-285
rithms deals with generating candidate keyphrases. This286
may mean generating n-grams, or linking to ontology287
or a thesaurus. For a morphologically rich language like288
Arabic, the number of possible candidates may be huge289
and therefore pruning strategies must be employed.290
A common pruning strategy is to stem the candidate291
phrases. In Arabic, there is stemming and light stem-292
ming. In stemming the words are reduced to their roots 293
[2] while in light stemming [1] only common prefixes 294
and suffixes are removed. 295
Arabic also has three varieties: classical Arabic, 296
modern standard Arabic (MSA), and dialectical Ara- 297
bic. Arabic dialects vary from one Arab country to 298
another. As we are dealing with news documents and 299
not scientific papers, dialectical Arabic is present. This 300
adversely affects the performance of the stemmers 301
and consequently reduces the accuracy of keyphrase 302
extraction. 303
Moreover, unlike English, -which has many 304
resources on the internet that contain formal articles 305
on specific fields associated with their keywords and 306
keyphrases- Arabic content on the Internet is poor 307
and keyphrase-associated Arabic text is almost non- 308
existent. In fact, one major contribution of this work is 309
to provide an annotated dataset suitable for keyphrase 310
extraction. 311
4. A supervised learning framework for 312
keyphrase extraction 313
4.1. KEA architecture 314
KEA is a supervised learning algorithm which 315
consists of two stages, namely, training phase and 316
extraction phase. In the training stage, KEA creates 317
a model using the training data; these consist of doc- 318
uments with author assigned keyphrases. During the 319
extraction stage, by comparison, KEA uses the model 320
created in the training phase and applies it to the test- 321
ing data. The accuracy of KEA is calculated based on 322
comparing the author-assigned keyphrases with KEA 323
assigned keyphrases for the testing data. 324
During the selection of candidate keyphrases phase, 325
KEA first cleans the input documents. Secondly, KEA 326
identifies candidate phrases and lastly KEA case-folds 327
and stems the candidate phrases. The following rules, 328
which are used by KEA, were adapted to become suit- 329
able for Arabic: 330
– Punctuation marks, brackets and numbers are 331
replaced with phrase boundaries. 332
– Apostrophes are removed from the documents. 333
– Hyphenated words are split into two words; i.e. 334
hyphens are removed. 335
– Non-letter tokens are removed from the docu- 336
ments. 337
– Acronyms are handled as a single token.
Unc
orre
cted
Aut
hor P
roof
R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system 5
After applying the above rules to documents, every338
document now consists of a sequence of words; each339
word consists at least of one letter.340
The following rules are used by KEA during candi-341
date phrase identification and were modified to become342
suitable for Arabic:343
– Candidate phrases cannot begin or end with a stop-344
word.345
– Candidate phrases can be proper names.346
– Candidate phrases are limited to maximum 3347
phrases.348
In the Case-folding and Stemming task, candidate349
phrases are folded to small case letters so that all phrases350
will be case insensitive. Case-folding is applicable for351
English but not for Arabic. After that, all phrases are352
stemmed.353
After candidate phrases are generated and prepro-cessed, KEA assigns weights to these candidates bycalculating two values: TFxIDF and First Occurrence ofthe phrase. TFxIDF calculates the frequency of a givenphrase at the current document (TF) and frequency ofthe phrase in the general use or the global corpus ofdocuments (IDF). TFxIDF for phrase P in document Dis calculated using the formula shown in Equation (1):
TF × IDF = freq(P, D)
size(D)× log2
df (P)
N(1) [33]
Where:354
– freq(P,D) is the number of times P occurs in D.355
– size(D) is the number of words in D.356
– df(P) is the number of documents containing P in357
the global corpus.358
– N is the size of the global corpus.359
The First Occurrence weight is calculated as the num-360
ber of words that precedes the phrase’s first appearance361
divided by the total number of words in that document.362
KEA uses the Naı̈ve Bayes Classifier to build its clas-363
sification model. This classifier is a probabilistic one364
that depends on the Bayes Theorem with the assump-365
tion that features or attributes are independent. After366
calculating the weights of candidate phrases, KEA367
determines whether a given candidate phrase is qual-368
ified to be a keyphrase (P[YES]) or not qualified to be369
a keyphrase (P[NO]). These two probabilities are cal-370
culated as shown in Equations (2) and (3) respectively:371
P[yes] = Y
Y + N× PTF×IDF [t|yes]372
×Pdistance[d|yes] (2) [33]373
P[No] = N
Y + N× PTF×IDF[t|no] 374
×Pdistance[d|no] (3) [33] 375
Where: 376
– t is TF × IDF. 377
– d is distance or First Occurrence value. 378
– Y is the number of positive phrases in the training 379
documents. 380
– N is the number of negative phrases in the training 381
documents. 382
The rank or importance of a candidate phrase is cal-culated using Equation (4):
Rank = P[yes]
P[yes] + P[no](4) [33]
Candidate phrases are ranked according to the val- 383
ues calculated using Equation (4). If the ranks of two 384
candidate phrases are equal, then their TF×IDF val- 385
ues are compared to break this tie; the candidate phrase 386
with higher TF×IDF is put first in the list. Finally, KEA 387
prunes the candidate phrases that are subsets of other 388
candidate phrases which rank is higher. 389
4.2. Extending KEA to suit the extraction of Arabic 390
keyphrases 391
The following subsections explain the extensions that 392
we have applied to KEA to become suitable for Arabic. 393
4.2.1. Replacing the KEA stemming algorithm 394
The original stemming algorithm of KEA was 395
removed and replaced with a stemming algorithm suit- 396
able for extracting roots of Arabic words. The stemming 397
algorithm reported in [2] was coded using Java and 398
added to the code of KEA. This algorithm is a statistical 399
based stemmer which extracts roots of words by assign- 400
ing weights and orders to the letters of words. Table 1 401
shows the original weights assigned to the letters of the 402
Arabic alphabet and Table 2, by comparison, shows the 403
Table 1Arabic letters and their weights (Adopted from [2])
Weight Arabic letters
53.53
210 Remaining letter in the Arabic Alphabet
Unc
orre
cted
Aut
hor P
roof
6 R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system
Table 2Arabic letters with their order (Adopted from [2])
Position of Letters Order of letters Order of Letters(from right) (Word Length is Odd) (Word Length is Even)
1 N N2 N-1 N-1. . . . . . . . .N/2 N/2 N/2+1.0N/2+1 N/2+1-1.5 N/2+1-0.5N/2+2 N/2+2-1.5 N/2+2-0.5. . . . . . . . .N N-1.5 N-0.5
Table 3Example of root extracting of a word using stemmer 1
Weight 1 5 2 0 3.5 5Order 5.5 4.5 3.5 4 5 6Product 5.5 22.5 7 0 25 30Root
orders assigned to the letters of the Arabic alphabet.404
N is the number of letters in a word, 1 . . . N are the405
positions of letters in a word. 1 is the first letter and N406
is the last letter. The idea behind the weights, shown in407
Table 1, is that, letters that appear as prefixes or suf-408
fixes are assigned weights higher than letters which do409
not appear as prefixes or suffixes. According to the work410
reported in [2], the following letters may appear as parts411
of prefixes or suffixes: .412
These letters may also appear in the stem of the word.413
The original algorithm did not take this into considera-414
tion and therefore it has errors in the generated roots. We415
have modified the weights and differentiated between416
the weights of a given letter when it appears as a prefix or417
suffix; and when it appears as part of the stem. For exam-418
ple, the weight of letter is set to 3.5 if it serves as419
part of the definitive article and to 1 if it appears as part420
of the stem. After determining the orders and weights of421
letters, the algorithm then multiplies the orders by the422
Table 5Root extraction using stemmer 2 (adopted from [29])
LetterGroup A U O U O O U P URoot
weights to produce products that subsequently are used 423
in extracting the root. The letters that correspond to the 424
smallest three products constitute the root (read from 425
right to left). Table 3, shows the algorithm in action by 426
demonstrating how the root of “ ” is extracted. 427
The stemmer reported in [29] extracts the roots of 428
Arabic words by dividing the Arabic alphabet into six 429
groups. This division is based on whether a letter can be 430
part of the prefixes, suffixes, original stem or any combi- 431
nation of these alternatives. These groups are described 432
in Table 4. 433
As few of the Arabic letters can appear as pre- 434
fixes, suffixes and as part of the stems, the algorithm 435
added position information to the groups to differenti- 436
ate between the case when a given letter is part of the 437
original stem or part of the suffixes or prefixes. The algo- 438
rithm extracts the root of a given word by first encoding 439
that word using the groups and position information. In 440
several cases, the root is extracted directly after encod- 441
ing. On few cases conflict resolution via transformation 442
rules is required to extract the correct root. Table 5, 443
demonstrates how stemmer 2 is applied to extract the 444
root of “ ”. 445
In Arabic KEA, we have alternated between the use 446
of Stemmer 1 and Stemmer 2. Information about the 447
performance of these two stemmers is provided in the 448
experimentation and result analysis section. 449
4.2.2. Replacing the stopwords file 450
Stopwords are words that cannot be part of 451
keyphrases. The stopwords list that is originally used 452
with KEA is removed and a new list of Arabic stop- 453
words is added. This list was compiled from free 454
resources available on the internet. 455
Table 4The division of the Arabic alphabet into groups for stemmer 2 [29]
Group Description
O: Original letters. These letters are surely part of the root. They are: .P: Prefix letters. These letters can be added only in the prefix part. They are:S: Suffix letters. These letters can be added only in the suffix part. They are: only HaaPS: Prefix-Suffix letters. These letters can be only added in both sides of the word i.e. in the suffix part or in the prefix part. They are:
U: Uncertain letters. These letters can be added anywhere in the word. They are:A: Added Letters. These letters are always considered additional letters. They are: only Taa Marbuta.
Unc
orre
cted
Aut
hor P
roof
R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system 7
4.2.3. Adjusting KEA features and their456
corresponding probabilities457
As there are fundamental differences between the458
Arabic and the English languages, we have also altered459
a few aspects of KEA that are related to keyphrase460
features and their corresponding weights.461
The following list describes these alterations:462
– For a phrase to be a candidate phrase it must occur463
in the document two or more times. This feature464
is called the number of occurrences. Also, the465
weight of this feature is proportional to that fea-466
tures’ value. The higher the values of number of467
occurrences the higher the weight of this feature.468
– The maximum number of words that may appear469
in a keyphrase for the Arabic language and after470
extensive analysis and consultations with language471
experts, this feature was set to 3 words. This is472
related to the structure of the sentence in Arabic.473
– Arabic allows proper nouns to be part of474
keyphrases especially when dealing with news arti-475
cles.476
– Case folding is not applicable to Arabic.477
5. Experimentation and result analysis478
5.1. Datasets479
As KEA is a supervised learning algorithm, we had480
to prepare a dataset which consists of documents and481
their corresponding author/reader assigned keywords.482
This is not an easy task, as Arabic content on the internet483
is modest. Furthermore, many documents which exist484
on the internet do not have author assigned keyphrases.485
We have contacted the authors of KP-Miner [7, 8] and486
requested their dataset. They agreed to provide us with487
a copy of their dataset. This dataset is called KP-Miner488
Dataset.489
In addition to the KP-Miner dataset, we have decided490
to generate our own dataset by collecting articles491
published on the internet and manually assigning492
keyphrases to them. We had focused on two topics,493
namely: leadership and management; and agriculture,494
environment and food. In total, we had gathered 62 doc-495
uments. 27 documents fall in in the first category and 35496
documents fall in the second category. Two raters were497
used to assign keyphrases to these documents. At the498
end of this phase, every document has two sets of key499
phrases. The final set of keyphrases for a given docu-500
ment consists of the intersection of the keyphrases’ lists501
generated by the two raters.
Table 6Training and testing datasets distributions
Dataset Total Training Testing Avg. # ofDocs Docs Docs keyphrases
Dataset 1: Leadership andmanagement
27 18 9 7.8
Dataset 2: Agriculture,environment and food
35 23 12 11.1
Dataset 3: KP-Miner 100 70 30 7.9
5.2. Experimentation setup 502
In the current work, we have divided the documents 503
as training or testing as shown in Table 6. The last col- 504
umn of Table 6 shows the average number of keyphrases 505
assigned to documents. 506
5.2.1. Arabic-KEA overall performance when 507
varying the number of extracted keyphrases 508
Arabic-KEA performance is measured using the 509
average number of matched keyphrases between KEA 510
generated phrases and author-assigned phrases. For 511
example, assume we have 10 documents and each 512
is assigned 5 author-assigned keyphrases and Arabic- 513
KEA extracts five keyphrases for each document. The 514
average number of matches is calculated by averaging 515
the number of matched keyphrases for every docu- 516
ment and then dividing by the number of documents. 517
Table 7 shows the average number of matches for the 518
three datasets when varying the number of extracted 519
keyphrases to be: 5, 7, 10, 15 and 20. As Table 7 520
demonstrates the accuracy of Arabic-KEA increases 521
as the number of extracted keyphrases is increased. 522
This is understandable as the likelihood of obtain- 523
ing matches between author-assigned keyphrases and 524
Arabic-KEA generated keyphrases increases as the 525
number of keyphrases increases. This behavior is com- 526
mon for the three datasets. The second half of the 527
number represents the standard deviation. The sta- 528
tistical stemmer, i.e. Stemmer 1, was used in this 529
experiment. 530
5.2.2. Arabic-KEA overall performance when 531
varying the size of the training data 532
In this experiment, we aim to assess Arabic-KEA 533
performance when focusing on the size of the training 534
data. The theory here is that we can build more accu- 535
rate classifiers when the number of training documents 536
is large. Table 8 shows the accuracies of Arabic-KEA 537
for Dataset 1 when varying the number of training doc- 538
uments between 1, 5, 10, and 18. As it is clear from 539
Unc
orre
cted
Aut
hor P
roof
8 R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system
Table 7Performance of Arabic-KEA as the number of extracted keyphrases increases
Dataset 5 7 10 15 20
Dataset 1: (18 training, 9 testing) 1.33 ± 1.41 1.56 ± 1.59 1.67 ± 1.73 1.78 ± 1.92 1.89 ± 1.96Dataset 2: (23 training, 12 testing) 2 ± 1.13 2.58 ± 1.24 2.92 ± 1.24 3.33 ± 1.61 3.5 ± 1.68Dataset 3: (70 training, 30 testing) 1.1 ± 0.76 1.4 ± 0.86 1.73 ± 1.08 2 ± 1.29 2.3 ± 1.62
Table 8Accuracy of Arabic-KEA as the number of training documents increases for Dataset 1
Size of training dataset 1 5 10 18
Average number of matching keyphrases 0.89 ± 0.93 1.22 ± 1.2 1.31 ± 1.17 1.56 ± 1.59
Table 9Accuracy of Arabic-KEA as the number of training documents increases for Dataset 2
Size of training dataset 1 5 10 15 23
Average number of matching keyphrases 1.83 ± 1.19 2.08 ± 1 2.17 ± 1.19 2.33 ± 1.15 2.58 ± 1.24
Table 10Accuracy of Arabic-KEA as the number of training documents increases for Dataset 3
Size of training dataset 1 5 10 15 30 45 70
Average number of matching keyphrases 1.22 ± 0.43 1.23 ± 0.82 1.32 ± 0.96 1.33 ± 0.92 1.38 ± 0.78 1.4 ± 0.8 1.4 ± 0.86
Table 8, the average number of matches increases as540
the number of training documents increases.541
Table 9 shows the behavior of Arabic-KEA when542
varying the number of training documents for Dataset543
2. Again, as expected, the average number of matches544
is directly proportional to the number of training docu-545
ments.546
Table 10 describes the changes of the average num-547
ber of matches when varying the number of training548
documents for Dataset 3. Table 10 emphasizes that the549
average number of matched keyphrases increases as the550
number of training documents increases. We note that551
accuracy of Arabic-KEA converges when the number552
of training documents reaches 45 documents.553
5.2.3. Arabic-KEA overall performance when554
alternating between stemming algorithms555
Table 11, shows the average number of matched556
keyphrases when alternating between two Arabic stem-557
mers. Stemmer 1 is the statistical based stemmer558
described in [2]. Stemmer 2, on the other hand, is a559
rule based stemmer described in [29]. As Table 11560
clearly indicates, Stemmer 1 outperforms Stemmer 2.561
The results shown in Table 11 that the choice of stem-562
ming algorithm does affect the quality of extracted563
keyphrases.564
In order to assess where Arabic-KEA stands when565
compared with other keyphrase extractor systems,566
we have compared Arabic-KEA with KP-Miner. KP-567
Table 11Arabic-KEA performance with several stemmers
Dataset Stemmer 1: Statistical Stemmer 2: Rulebased stemmer based stemmer
Dataset 1 1.56 ± 1.59 0.67 ± 1Dataset 2 2.58 ± 1.24 1.17 ± 0.94Dataset 3 1.4 ± 0.86 0.96 ± 0.87
Miner [7, 8] is an unsupervised keyphrase extractor 568
and thus does not require any training data. KP- 569
Miner is freely available online at http://www.claes.sci. 570
eg/coe wm/kpminer/. We have used the online interface 571
provided by its authors. This means that we have used 572
the original implementation provided by KP-Miner 573
authors. To make the comparison fair and meaningful, 574
only documents which were used in testing Arabic- 575
KEA from the three datasets were also used to test 576
KP-Miner. For Dataset 1, documents numbered from 577
1 to 9 were used for testing both Arabic-KEA and 578
KP-Miner. For Dataset 2, documents numbered from 579
1 to 12 were used for testing both Arabic-KEA and 580
KP-Miner. Finally, for Dataset 3, documents num- 581
bered from 1 to 30 were used for testing purposes. 582
The results of the comparisons are summarized in 583
Table 12. The numbers in Table 12 represent the average 584
number of matched keyphrases obtained by Arabic- 585
KEA and by KP-Miner when the number of extracted 586
keyphrases varies between 5, 7, 10, 15, and 20. As 587
it can be seen from Table 12, Arabic-KEA scored 588
Unc
orre
cted
Aut
hor P
roof
R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system 9
Table 12Arabic-KEA versus KP-miner
No of KPs Dataset 1 Dataset 2 Dataset 3Arabic-KEA KP-Miner Arabic-KEA KP-Miner Arabic-KEA KP-Miner
5 1.33 ± 1.41 1.00 ± 0.82 2.00 ± 1.13 1.67 ± 1.03 1.1 ± 0.76 1.23 ± 0.677 1.56 ± 1.59 1.11 ± 0.87 2.58 ± 1.24 2.08 ± 1.19 1.4 ± 0.86 1.50 ± 0.6710 1.67 ± 1.73 1.44 ± 1.07 2.92 ± 1.24 2.75 ± 1.23 1.73 ± 1.08 1.73 ± 0.6815 1.78 ± 1.92 1.44 ± 1.07 3.33 ± 1.61 3.00 ± 1.41 2.00 ± 1.29 1.90 ± 0.9120 1.89 ± 1.96 1.89 ± 1.37 3.5 ± 1.96 3.00 ± 1.41 2.3 ± 1.62 2.10 ± 1.27
better averages for Dataset 1 and Dataset 2 regardless589
of the extracted number of keyphrases. For Dataset 3,590
we notice that KP-Miner provides better accuracy than591
Arabic-KEA when the number of extracted keyphrases592
is 5 or 7. Both Arabic-KEA and KP-Miner give the593
same number of matched keyphrases when the number594
of extracted keyphrases is 10 for Dataset 3. Arabic-KEA595
outperforms KP-Miner when the number of keyphrases596
is equal to 15 or 20 for Dataset 3. In summary, Arabic-597
KEA outperforms KP-Miner for Dataset 1 and Dataset598
2. For Dataset 3, KP-Miner outperforms Arabic-KEA599
when the number of extracted keyphrases is 5 or 7600
but Arabic-KEA managed to improve its accuracy and601
supersedes KP-Miner when the number of extracted602
keyphrases reached either 15 or 20.603
6. Conclusions and future work604
This paper has reported a framework for keyphrase605
extraction from Arabic documents which is based on606
adapting an existing system that was initially used for607
the English language. In particular, the KEA system608
[33] was extended by the current work. The extensions609
were nontrivial and major parts of the code have to be610
modified as there are fundamental differences between611
Arabic and English. The contribution of this paper are:612
the modification of KEA to become suitable for Arabic613
language, the collection and annotation of two datasets614
that are suitable for research that deals with keyphrase615
extraction, and finally an extensive experimentation and616
results analysis which gained us a better understanding617
of keyphrase extraction in Arabic documents.618
The results reveal that the accuracy of Arabic-619
KEA increases as the number of extracted keyphrases620
increases. It also shows that accuracy increases as the621
size of the training data increases. Finally, the results622
reveal that stemming is an effective method to reduce623
the number of candidate keyphrases and therefore good624
stemming algorithms play part in improving the accu-625
racy of keyphrase extraction. In this work we have626
experimented with a statistical-based stemmer and a 627
rule-based stemmer. The statistical stemmer gave better 628
results when compared with the rule-based stemmer. 629
References 630
[1] M. Aljlayl and O. Frieder, On Arabic search: Improving the 631
retrieval effectiveness via a light stemming approach. Pro- 632
ceedings of the ACM 11th Conference on Information and 633
Knowledge Management, New York: ACM Press, 2002, pp. 634
340–347. 635
[2] R. Al-Shalabi, G. Kanaan and H. Al-Sarhan, New approach for 636
extracting Arabic roots. In proceeding of International Arab 637
Conference on Information Technology (ACIT’2003), Alexan- 638
dra, Egypt, 2003, pp. 42–59. 639
[3] A. Bellaachia and M. Al-Dhelaan, NE-Rank: A novel graph- 640
based keyphrase extraction on Twitter. Proceedings of the 641
IEEE/WIC/ACM International Conference on Web Intelli- 642
gence and Intelligent Agent Technology, Macau, 2012, pp. 643
372–379. 644
[4] P. Cimiano and J. Volker, Text2Onto – A framework for ontol- 645
ogy learning and data-driven change discovery. Proceedings 646
of the 10th International Conference on Applications of Natu- 647
ral Language to Information Systems (NLDB), Springer, Vol. 648
3513, 2005, pp. 227–238. 649
[5] M. Danilevsky, C. Wang, X. Desai, J. Guo and J. Han, 650
Automatic construction and ranking of topical keyphrases on 651
collections of short documents, Proceedings of the 2014 SIAM 652
International Conference on Data Mining, 2014. 653
[6] P. Delir-Haghighi, F. Burstein, A. Zaslavsky and P. Arbon, 654
Development and evaluation of ontology for intelligent deci- 655
sion support in medical emergency management for mass 656
gatherings, Decision Support Systems 54 (2013), 1192–1204. 657
[7] S. El-Beltagy and R. Rafea, KP-Miner: Participation in 658
SemEval-2. Proceedings of the 5th International workshop on 659
Semantic Evaluation (ACL2010), Uppsala, Sweden, 2010, pp. 660
190–193. 661
[8] S. El-Beltagy and R. Rafea, KP-Miner: A keyphrase extraction 662
for English and Arabic Documents, Information Systems 34 663
(2009), 132–144. 664
[9] T.A. El-Shishtawy and A.K. Al-Sammak, Arabic keyphrase 665
extraction using linguistic knowledge and machine learning 666
techniques. Proceedings of the Second International Confer- 667
ence on Arabic Language Resources and Tools, Cairo, Egypt, 668
2009. 669
[10] G. Ercan and I. Cicekli, Using lexical chains for key- 670
word extraction, Information Processing and Management 43 671
(2007), 1705–1714. 672
Unc
orre
cted
Aut
hor P
roof
10 R. Duwairi and M. Hedaya / Automatic keyphrase extraction for Arabic news documents based on KEA system
[11] A. Farghaly and K. Shaalan, Arabic natural language process-673
ing: Challenges and solutions. ACM Transactions on Asian674
Languages Information Processing 8(4) (2009). Article 14.675
[12] C. Fellbaum, Ed., WordNet: An electronic lexical database,676
MIT Press, 1998.677
[13] K. Frantzi, S. Ananiadou and H. Mima, Automatic recognition678
of multi-word terms: The C-value/NC-value Method, Interna-679
tional Journal on Digital Library 3(2) (2000), 115–130.680
[14] X. Jiang, Y. Hu and H. Li, A ranking approach to keyphrase681
extraction. SIGIR’09, Boston, MA, USA, 2009, pp. 756–757.682
[15] Y. Kang, P. Delir-Haghighi and F. Burstein, CFinder: An intel-683
ligent key concept finder from text to ontology, Expert Systems684
with Applications 41 (2014), 4494–4505.685
[16] KEA: Keyphrase Extraction Algorithm, http://www.nzdl.686
org/Kea/. Last accessed 7, June 2015.687
[17] S.N. Kim, O. Medelyan, M.Y. Kan and T. Baldwin, Semeval-688
2010 task 5: Automatic keyphrase extraction from scientific689
articles. In Proceedings of the 5th International Workshop on690
Semantic Evaluation. Association for Computational Linguis-691
tics, 2010, pp. 21–26.692
[18] Y. Matsuo and M. Ishizuka, Keyword extraction from a single693
document suing word co-occurrence statistical information,694
International Journal on Artificial Intelligence Tools 13(1)695
(2004), 157–169.696
[19] O. Medelyan and I. Witten, Domain-independent automatic697
keyphrase indexing with small training sets, Journal of the698
American Society for Information Science and Technology699
59(7) (2008), 1026–1040.700
[20] O. Medelyan, E. Frank and I.H. Witten, Human-competitive701
tagging using automatic keyphrase extraction. In Proceedings702
of the 2009 Conference on Empirical Methods in Natural703
Language Processing: Volume 3-Volume 3, Association for704
Computational Linguistics, 2009, pp. 1318–1327.705
[21] R. Mihalcea and P. Tarau, Textrank: Bringing order into text.706
Proceeding of the 2004 Conference on Empirical Methods in707
Natural Language Processing, Barcelona, Spain, 2004, pp.708
404–411.709
[22] T. Nguyen and M. Kan, Keyphrase extraction in scientific pub-710
lications, in D.H. Goh et al., (Eds.): ICADL, LNCS Vol. 4822,711
2007, pp. 317–326.712
[23] Y. Ouyang, W. Li and R. Zhang, Keyphrase extraction based713
on core word identification and word expansion, Proceedings714
of the 5th International Conference on Semantic Evaluation,715
ACL 2010, Upssla Sweden, 2010, pp. 142–145.716
[24] N. Pala and I. Cicekli, Turkish keyphrase extraction using 717
KEA. Proceedings of the 22nd International Conference on 718
Computer and Information Sciences, Ankara, Turkey, 2007, 719
pp. 1–5. 720
[25] N. Pudota, A. Dattolo, A. Baruzzo and C. Tasso, A new domain 721
independent keyphrase extraction system. M. Agosti, F. Espos- 722
ito and C. Thanos (Eds.), IRCDL, CCIS 91, 2010, pp. 67–78. 723
[26] Sakhr Keyword Extractor, http://www.sakhr.com/Keyword. 724
aspx, Last accessed 7, June 2015. 725
[27] K. Sarkar, M. Nasipuri and S. Ghose, Machine learning based 726
keyphrase extraction: Comparing decision trees, naı̈ve Bayes 727
and artificial neural networks, The Journal of Information Pro- 728
cessing Systems 8(4) (2012), 693–712. 729
[28] K. Sarkar, Automatic keyphrase extraction from medical doc- 730
uments. In Chaudhury et al (Eds): PReMI, LNCS Vol. 5909, 731
2009, pp. 273–278. 732
[29] R. Sonbol, N. Ghneim and M. Desouki, Arabic morpholog- 733
ical analysis: A new approach. In Proceedings of the Third 734
International Conference on Information and Communication 735
Technologies: From Theory to Applications (ICTTA 2008), 736
Damascus, Syria, 2008, pp. 1–6. 737
[30] S. Tonelli, M. Rospocher, E. Pianta and L. Serafini, Boosting 738
collaborative ontology building with key-concept extraction. 739
In 2011 Fifth IEEE international conference on semantic com- 740
puting (ICSC), 2011, pp. 316–319. 741
[31] P.D. Turney, Learning algorithms for keyphrase extraction, 742
Information Retrieval 2(4) (2000), 303–336. 743
[32] J. Wang, H. Peng, J. Hu and J. Zhang, Ensemble learning for 744
keyphrase extraction from scientific documents. In J. Wang 745
(Eds.): LNCS Vol. 3971, 2006, pp. 1267–1272. 746
[33] I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin and C.G. 747
Nevill-Manning, Kea: Practical automatic keyphrase extrac- 748
tion. In Y.L. Theng and S. Foo, Eds., Design and Usability of 749
Digital Libraries: Case Studies in the Asia Pacific, Information 750
Science Publishing, London, 2005, pp. 129-152. 751
[34] Y.B. Wu, Q. Li, R.S. Bot and X. Chine, Finding nuggets 752
in documents: A machine learning approach, Journal of the 753
American Society for Information Science and Technology 754
(JASIST) 57(6) (2006), 740–752. 755
[35] F. Xie, X. Wu and X. Hu, Keyphrase extraction based on 756
semantic relatedness, Proceedings on the 9th IEEE Inter- 757
national Conference on Cognitive Informatics (ICCI’10), 758
Beijing, China, 2010. 759