arabic text search

Upload: hachemi-kadri

Post on 06-Apr-2018

242 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Arabic Text Search

    1/21

    Arabic texts analysis for topic modeling evaluation

    Abderrezak Brahmi Ahmed Ech-Cherif Abdelkader Benyettou

    Received: 12 September 2010 / Accepted: 23 May 2011 Springer Science+Business Media, LLC 2011

    Abstract Significant progress has been made in information retrieval covering text semantic

    indexing and multilingual analysis. However, developments in Arabic information retrieval

    did not follow the extraordinary growth of Arabic usage in the Web during the ten last years. In

    the tasks relating to semantic analysis, it is preferable to directly deal with texts in their original

    language. Studies on topic models, which provide a good way to automatically deal with

    semantic embedded in texts, are not complete enough to assess the effectiveness of the

    approach on Arabic texts. This paper investigates several text stemming methods for Arabictopic modeling. A new lemma-based stemmer is described and applied to newspaper articles.

    The Latent Dirichlet Allocation model is used to extract latent topics from three Arabic real-

    world corpora. For supervised classification in the topics space, experiments show an

    improvement when comparing to classification in the full words space or with root-based

    stemming approach. In addition, topic modeling with lemma-based stemming allows us to

    discover interesting subjects in the press articles published during the 20072009 period.

    Keywords Arabic stemming Topic model Linguistic analysis Classification Test collections

    1 Introduction

    Arabic is one among the six official languages of the United Nations organization where a

    considerable work has been done to develop the multilingual United Nations Bibliographic

    A. Brahmi (&)

    Department of Computer Science, University of Abdelhamid Ibn Badis,

    BP: 188, Mostaganem, Algeria

    e-mail: [email protected]

    A. Ech-Cherif (&) A. BenyettouDepartment of Computer Science, USTO-MB, BP: 1505, Oran, Algeria

    e-mail: [email protected]

    A. Benyettou

    e-mail: [email protected]

    123

    Inf Retrieval

    DOI 10.1007/s10791-011-9171-y

  • 8/3/2019 Arabic Text Search

    2/21

    Information System Thesaurus. This UNBIS Thesaurus is used in subject analysis of

    documents and other materials relevant to the United Nations programs and activities. In

    addition, Arabic is one of the top ten languages in the Internet. For a global population of

    350 millions in Arabic world, Internet World Stats1 have reported, for Internet Arabic

    users, the highest growth rate with 2,501.2% for the period 20002010.Unfortunately, developments in Arabic information retrieval (IR) did not follow this

    extraordinary growth. For most of the studies in the different IR tasks (such as categori-

    zation, clustering and search), English was used as the main language. However, when

    switching to Arabic, two approaches have been adopted for evaluating IR methods: either

    by retaining English as a pivot language when using parallel corpora in a cross-language

    context; either by processing the original Arabic text and analyzing the IR methods in a

    mono-language context. Although the first approach allows an equitable evaluation, it

    depends on the availability and the quality of parallel corpora. In the second approach, the

    evaluation of IR methods requires standard corpora and appropriate linguistic prepro-

    cessing. This approach becomes more interesting for the IR tasks with semantic analysis. It

    avoids the loss of meaning caused by translation from languages with high inflectional

    morphology such as Arabic (Oard and Gey 2002; Larkey et al. 2004). Unfortunately,

    Arabic IR, including stemming methods, did not receive sufficient standard evaluation.

    In this context, three major challenges face the developments of the Arabic IR and

    generalization of existing methods on Arabic texts: (1) How to efficiently extract a good

    stem from a morpheme that implies several segmentations and senses? (2) How to apply a

    topic model to capture the semantics embedded in Arabic texts? (3) How to make more

    accessible Arabic resources for IR tasks to benefit from the various developments in non-

    commercial context.This work aims to answer the first two questions raised above. On the one hand, a

    lemma-based stemming approach is proposed and compared with other Arabic stemmers.

    On the other hand, the Latent Dirichlet Allocation (LDA) model is used to extract Arabic

    topics from newspaper articles. As regards the third question, our experiments are con-

    ducted on three real-world corpus automatically crawled from the Web. All results and

    resources will be made freely available for the research community.

    This paper presents first related works on Arabic stemming and topic modeling. Then,

    the lemma-based stemmer is described and evaluated with other approaches for Arabic text

    analysis. Afterwards, the generative process for LDA topic modeling is illustrated. Before

    presenting the results of our experiments, three datasets of newspaper articles are descri-bed. Finally, we discuss the main results and conclude our study.

    2 Related works

    2.1 Arabic stemming

    Among the successful approaches for Arabic stemming, a root-based stemmer has been

    developed by (Khoja and Garside 1999). Based on predefined root lists and morphologicalanalysis, Khoja algorithm attempts to extract the true root. However, more than one root

    can be found in an isolated word without diacritics. Although the Khoja2 stemmer has not

    1 Internet World Stats, Usage and population statistics (Miniwatts Marketing Group). Updated for June 30,

    2010, last visited in August 2010. http://www.internetworldstats.com/.2 Freely available at http://www.zeus.cs.pacificu.edu/shereen/ArabicStemmerCode.zip.

    Inf Retrieval

    123

    http://www.internetworldstats.com/http://www.zeus.cs.pacificu.edu/shereen/ArabicStemmerCode.ziphttp://www.zeus.cs.pacificu.edu/shereen/ArabicStemmerCode.ziphttp://www.internetworldstats.com/
  • 8/3/2019 Arabic Text Search

    3/21

    been maintained since its first publication, it has been widely used and analyzed in later

    works. As an instance, Al-Shammari lemma-based stemmer included Khoja algorithm for

    verbs stemming (Al-Shammari and Lin 2008). The authors combined successfully light

    stemming, root stemming and dictionary lookup. In addition to its effectiveness in clus-

    tering task, the Al-Shammari algorithm outperformed Khoja and light-stemmers in terms ofover-stemming evaluation (Al-Shammari 2010).

    For light stemming, several variants have been developed (Larkey et al. 2002). When

    applying in the AFP_ARB corpus, the authors have found that light stemmer was more

    effective for cross-language retrieval than a morphological stemmer. They deduce that it is

    not essential for a stemmer to yield the correct root. Surprisingly, in a technical report

    (Larkey and Connell 2001), the authors claim that these results, either in mono-lingual or

    cross-language retrieval, were obtained with no prior experience with Arabic. Another

    study has confirmed the same result which prefers light stemming for Arabic retrieval tasks

    (Moukdad 2006).

    On the contrary, Brants et al. reported that, whether by stemming or using full forms,

    they have obtained the same performances for document topic analysis (Brants et al. 2002).

    A recent study about Arabic text categorization has highlighted this contradiction in lit-

    erature and attempted to analyze various stemming tools (Said et al. 2009). In (Darwish

    et al. 2005), the authors showed that using context to improve the root extraction process

    may enhance the IR process. However, the context root extraction is computationally

    expensive compared with the light and root stemming. Similar to Khoja but without a root

    dictionary, a good light stemmer was developed by (Taghva et al. 2005). The authors found

    that stem lists are not required in an Arabic stemmer. They deduced that finding the true

    grammatical root of a term should not be the goal of a stemmer for document retrieval.Compared to English and other languages, the research relating to Arabic texts stem-

    ming is fairly limited (Taghva et al. 2005). The main efforts to build efficient Arabic IR

    systems have been achieved in a commercial framework. The approaches used for these

    systems as well as the performance accuracy are not known. As a significant example, Siraj

    system3 from Sakhr allows to classify Arabic text and to extract named entities with

    human-satisfying response. However, it has no technical documentation to explain the used

    method neither the system evaluation.

    2.2 Topic modeling

    The LDA model has been introduced within a general Bayesian framework where the

    authors developed a variational method and EM algorithm for learning the model from

    collection of discrete data (Blei et al. 2003). The authors applied their model in document

    modeling, text classification and collaborative filtering. For document modeling, they

    trained a number of latent variable models, including LDA, on two texts corpora to

    compare the generalization performance compare the generalization performance, as

    measured by likelihood of held-out test data. Based on different datasets with various

    document numbers and vocabulary sizes, experiments show that the LDA model outper-

    forms other models such as unigram model and pLSI.In (Blei et al. 2006), LDA model was tested on CGC Bibliography items. Experiments

    show that LDA had better predictive performance than two standard models (unigram and

    mixture of unigrams). In the text classification problem, SVM has been trained in the low-

    dimensional representations produced by LDA from unlabeled documents. The authors

    3 http://www.siraj.sakhr.com/.

    Inf Retrieval

    123

    http://www.siraj.sakhr.com/http://www.siraj.sakhr.com/
  • 8/3/2019 Arabic Text Search

    4/21

    conducted two binary classification experiments using the Reuters-21578 dataset. They

    realize a similar performance compared with SVM classification based on full words space.

    It is worth pointing out that most datasets used for LDA evaluation are freely available

    and included few thousands of English documents (sometimes up to 20,000) with some

    30,000 unique words. This was considered sufficient for analyzing and assessing the modelperformances but this is not the same case for topic modeling in other languages such as

    Arabic.

    Since the original introduction of the LDA model, several contributions have been

    proposed. However, few studies, on finding latent topics in Arabic context, have been

    identified. In addition to the works related to Arabic topic detecting and tracking (Oard and

    Gey 2002; Larkey et al. 2004), a segmentation method that uses the Probabilistic Latent

    Semantic Analysis (Hofmann 1999) have been applied to AFP_ARB corpus for mono-

    lingual Arabic document topic analysis (Brants et al. 2002). In (Larkey et al. 2004), the

    researchers compared different topic tracking methods. They claimed that it should be

    preferable to use separate language for building specific topic models. Good topic models

    have been obtained when native Arabic stories are available. However, Arabic topic

    tracking has not been improved in texts translated from English stories.

    In fact, studies on Arabic IR are insufficient and the few works carried out for topic

    modeling as well as for text stemming lack of rigorous evaluation. Considering the high

    inflectional morphology in Arabic, it seems more appropriate to learn LDA model in

    mono-language context taking more care for linguistic aspects. However, a large investi-

    gation on stemming methods is required for assessing Arabic topic modeling in real-world

    corpora.

    3 Arabic text analysis

    Unlike the indo-European languages, Arabic belongs to the Semitic languages family.

    Written from right to left, it includes 28 letters. Despite the fact that different Arabic

    dialects are spoken in the Arab world, there is only one form of the written language found

    in printed works which it is known as the Modern Standard Arabic, herein referred to as

    Arabic (Kadri and Nie 2006). In addition to its derivational morphology, the main char-

    acteristics of the Arabic language, which complicate any automatic text analysis, are the

    agglutination and the non-vocalization.

    3.1 Arabic language features

    Arabic is a highly inflected language due to its complex morphology. An Arabic word can

    be one of three morpho-syntactic categories: nouns, verbs or particles. Several works have

    used other categories (such as prepositions and adverbs) with no good reason except that

    they are taken from English (Larkey et al. 2002; Tuerlinckx 2004; Moukdad 2006).

    The lemma is the dictionary entry which is fully vocalized and relates to any form of

    text. Particularly, the verbs are reduced to the third masculine singular in past tense. Allnouns and verbs are derived from a non-vocalized root according to one of the Arabic

    patterns. The root is a linguistic unit carrying a semantic area. It is a non-vocalized word

    (more general than lemma) and often consists of only 3 consonants (rarely 4 or 5 con-

    sonants) (Kadri and Nie 2006; Tuerlinckx 2004).

    Inf Retrieval

    123

  • 8/3/2019 Arabic Text Search

    5/21

    3.1.1 Morphological complexity

    In Semitic languages, the root is an essential element from which various words may be

    derived according to specific patterns or schemes. The morphological complexity in Arabic

    is characterized by inflection and derivation.Inflection modifies the word to express different grammatical categories for the same

    meaning such as gender, number, place or tense. Some irregular inflection schemes did not

    use simple prefixed or suffixed roots but also they apply infixation and complex affixation

    process. The following examples illustrate Arabic inflection with the irregular plural

    (called broken plural):

    from the root [Elm] : plural of [Eilm, science] is [Eulum, sciences] ,from the root [ktb] : plural of [kitAb, book] is [kutub, books]

    However, derivation is a root affixation process generating a new word with different

    meaning but generally in the same semantic area. An example of verb derivation from theroot [Elm] is [[aEolam, notify/inform] and [{isotaEolam, inquire] .

    3.1.2 Agglutination

    In Arabic text, a lexical unit is not easily identifiable from a graphic unit (word delimited

    by space characters or punctuations marks). Morphological affixation process becomes

    more complicated when extra affixes are agglutinated to a lemma. Indeed, a word can be

    extended by attaching four kinds of affixes (antefix, prefix, suffix and postfix). Table 1

    shows an example of an agglutinated and inflected word, [wayaEolamuwnahu] wherevarious kinds of affixes are attached to the core form [Elm] .This situation can make a high ambiguity to extract the right core (stem) from an

    agglutinated form. In non-vocalized texts, morphology analysis become more difficult as

    illustrated above in Table 1. There are other agglutinative languages such as Japanese,

    Turkish and Finnish but the problem of non-vocalization is not as complicated as in Arabic.

    3.1.3 Vocalization

    Arabic word is vocalized with diacritics (short vowels) but unfortunately, full or partial

    vocalization can be found only in didactic documents or in Koranic text. This factaccentuates the ambiguity of words and requires from each automatic analyzer to pay more

    attention to the morphology and the word context. In Table 2, the non-vocalized word

    [bsm] gives more than one segmentation with different meanings. This is mainly due to

    diacritics missing in an agglutinated form.

    Table 1 Segmentation of an Arabic agglutinated form meaning to and they know it

    Agglutination

    Inflection

    Antefix Prefix Lemma Suffix Postfix

    [wa] [ya] [Eolamu] [wna] [hu]

    and they 1/2 know they 2/2 It

    Inf Retrieval

    123

  • 8/3/2019 Arabic Text Search

    6/21

    3.2 Stemming methods

    Stemming is a process for conflating inflected or derived words to a unique base-stem. It is

    an important way to reduce collection vocabulary. In addition, stemming avoids dealing

    with the same word as different index entries. Two classes of Arabic stemming methods

    can be identified: (1) Light stemmers by removing the most common affixes and, (2)

    Morphological analyzers by extracting each core (root or lemma) according to a scheme.

    3.2.1 Light stemmers

    They refer to the technique which truncates from a word a reduced list of affixes without

    trying to find roots. Effectiveness of this approach depends on the content of prefixes and

    suffixes lists. When, in English, one tries to find a stem by, mainly, removing conjugation

    suffixes, we have to deal in Arabic texts with ambiguous agglutinated forms that imply

    several morphological derivations. An analysis of such an approach can be found in (Larkeyet al. 2002). ISRI stemmer is another example of light stemming (Taghva et al. 2005).Without a root dictionary, theISRI4 algorithm use some affix lists and most common patternsto extract roots. Nevertheless, it keeps normalized form for unfound stem.

    This kind of stemmers can effectively deal with most practical cases, but in some ones,

    the right word is lost. As an example, in the word [wafiy] , one can read two agglutinated

    prepositions that mean to and in but another will consider a noun, which meansfaithful/complete.

    3.2.2 Morphological analyzers

    In morphological analysis, we try to extract more complete forms according to vocalization

    variation and derivation patterns knowledge. We can distinguish two categories of ana-

    lyzers according to the nature of desired output unit: (1) Root-based stemmers and, (2)

    lemma-based stemmers. The choice between the two approaches depends on how further

    stemming results, in IR tasks or in language modeling, will be used.

    In first category, the Khoja stemmer, which attempts to find root for Arabic words, hasbeen proposed in (Khoja and Garside 1999). A list of roots and patterns is used to

    determine the right stem. This approach produces abstract roots, which reduce significantly

    the dimension of document features space, but it leads to a confusion of divergent meaningembedded in a unique non-vocalized stem. For example, stemming of the word cited

    above in Table 1 must deduce the root-stem with possible meaning to verbs to know orto teach. However, the same root can mean to the noun flag.

    Table 2 Four possible solutions for the word [bsm]

    Solution Morphology Vocalization English meaning

    1 Noun [basom] Smiling

    2 Verb [basam] Smile

    3 Prep ? noun [bi] ? [somi] In/by (the) name of

    4 Prep ? noun [bi] ? [sam *] By/with poison

    4 http://www.nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.isri-pysrc.html.

    Inf Retrieval

    123

    http://www.nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.isri-pysrc.htmlhttp://www.nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.isri-pysrc.html
  • 8/3/2019 Arabic Text Search

    7/21

    In second category, a lemma-based stemmer has been developed and compared to

    Khoja stemmer (Al-Shammari and Lin 2008). The authors combined light stemming withKhoja algorithm for antefixes removal and verbs stemming before processing theremaining words as nouns. They use a stop list exceeding 2,200 words with verbs and

    nouns dictionaries as linguistic resources. In addition to clustering performance compar-ison, they used a collection of concept groups for under- and over-stemming evaluation.

    Unfortunately, neither these resources nor the test collections are available. As shown in

    Table 2, an Arabic word can induce more than one stem and so more than one lemma. The

    above approaches did not support this aspect.

    For this purpose, a set of Arabic lexicons5

    has been developed with rules for legal

    combinations of lemma-stems and affixes forms (Buckwalter 2002). P. Brihaye has

    developed AraMorph,6 a java package for Arabic lemmatization based on BuckwalterArabic morphological analyzer. Several stemming solutions can be proposed for each

    word. From this analyzer, one can develop, under some considerations, a lemma-based

    stemmer. This approach will be described hereafter.

    3.3 Lemma-based stemmer

    We propose to develop an algorithm for lemma-based stemmer that is called the Brahmi- Buckwalter Stemmerand referred henceforth as BBw. Based on the resources of theBuckwaltermorphological analyzer, two main contributions can be reported for the BBwstemmer: (1) Normalization preprocessing and, (2) stem selection with morphological

    analysis.

    3.3.1 Normalization

    This step is performed for normalizing the input text. Then the obtained list of tokens will

    be processed by the Buckwalter morphological analyzer.

    Convert to UTF-8 encoding

    Tokenize text respecting the standard punctuation

    Remove diacritics and tatweel ( ) Remove non-Arabic letters and stop-words.

    Replace initial alef with hamza ( or ) by bar-alef (

    ) Replace final waw or yeh with hamza ( or ) by hamza ( ) Replace maddah ( ) or alef-waslah ( ) by bar-alef () Replace two bar-alef ( ) by alef-maddah ( ) Replace final teh marbuta ( ) by heh ( ) Remove final yeh ( ) when the remaining stem is valid.

    3.3.2 Stem selection

    When an input token (in-token) is processed by the Buckwalter morphological analyzer,three cases can be reported: (1) A unique solution is given according to a specific pattern.

    5 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49.6 http://www.nongnu.org/aramorph/english/index.html.

    Inf Retrieval

    123

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49http://www.nongnu.org/aramorph/english/index.htmlhttp://www.nongnu.org/aramorph/english/index.htmlhttp://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49
  • 8/3/2019 Arabic Text Search

    8/21

    (2) Multiple solutions are found corresponding to several patterns and lexicon entries. (3)

    No solution can be attributed to the in-token. The actions, that the BBw stemmer mustundertake, will be detailed bellow.

    Unique solution. The BBw stemmer retains only the non-vocalized lemma-stem of the

    solution (without affixes). A solution without noun or verb lemma (i.e., contains onlyparticles) is ignored and therefore the in-token is considered as a stop-word.

    Multiple solutions. The BBw stemmer treats all the proposed solutions as a set ofseparated unique solutions and thus retains all non-vocalized lemma-stem. Note that

    eliminating diacritics from lemmas may unify some stems and so reduce the solutions

    multiplicity. For example, Table 2 gives four vocalized solutions for the token [bsm]

    but after removing diacritics from output lemmas, the BBw stemmer will identify only twoconfused stems {[bsm] , [sm] }. It is worth pointing out that most of the Arabic proper

    names can be derived regularly from roots. In this case multiple solutions, including the in-

    token, must be considered.

    No solution. The in-token cannot have a solution in the following cases, differentreasons can be raised: (1) The in-token is wrong and it did not imply any Arabic lemma. (2)

    The in-token corresponds to a proper name (person, city, etc.) that has no entry in the

    dictionary. (3) The in-token is a correct Arabic word but it is not yet included in the current

    release of Buckwaltermorphological analyzer.In this study, we have opted to improve the normalization preprocessing in BBw

    algorithm based on the original Buckwalterlexicon. Three stemmer variants (BBw0, BBw1,BBw2) are developed and evaluated on different Arabic datasets. Table 3 summarizes stemselection approach for each BBwX stemmer.

    3.3.3 Confusion degree measure

    When the morphological analysis of an in-token implies multiple solutions, each BBwXstemmer produces multiple lemma-stems. From a collection (S), we denote by L= 0 thetotal number of in-tokens after stop-words removal. LR = 0 refers to the total number ofstems when stemming (S) with an algorithm (R). Then, the confusion degree C(S|R) isdefined as:

    CSjR LR

    L

    1

    For example, C(S|Khoja) = 1, since the Khoja stemmer gives at most one stem for eachtoken in any dataset S. This is an ideal situation for a stemming process but when applyingthe BBwXstemmers, the confusion degree Cwill be increased. We proposed this measureC(S|R) for assessing the lexical ambiguity in Arabic texts. For a human Arabic reader, thisproblem will be solved easily with semantic considerations guided by the context.

    For BBwXstemming, note that all possible stems are equitably related to their in-token.At this stage, we have not precise knowledge to select the good stem. Nevertheless, that the

    Table 3 BBwX outputs versus different cases of the in-token morphological analysis

    Stemmer Unique solution Multiple solutions No solution

    BBw0 1 lemma-stem all lemmas \

    BBw1 1 lemma-stem all lemmas ? in-token in-token

    BBw2 1 lemma-stem all lemmas in-token

    Inf Retrieval

    123

  • 8/3/2019 Arabic Text Search

    9/21

    relevant solution can be weighted later by a co-occurrence computation in local context.

    We think that this will be possible with LDA topic modeling.

    3.4 Stemming evaluation

    Although the literature describes various Arabic stemmers, only few of them have received

    a standard evaluation (Al-Shammari 2010; Said et al. 2009). A way to assess the effec-

    tiveness of stemming algorithms is to evaluate their performances in information retrieval

    tasks. This requires standard and representative test collections. Nevertheless, it is unsure

    that good performances in IR tasks result only from stemming quality (Paice 1996; Frakes

    2003). Herein, we describe three stemming metrics used in this present work. Such metrics

    allow assessing some stemming aspects independently of IR tasks performance.

    3.4.1 Index compression

    The Index Compression Factor (ICF) represents the extent to which a collection of uniquewords is reduced (compressed) by stemming, the idea being that the heavier the Stemmer,

    the greater the Index Compression Factor (Frakes 2003). This can be calculated by:

    N, Number of unique words before stemming; S, Number of unique stems afterstemming

    ICF

    N S

    N 2

    The ICF factor has been introduced as a strength measure to evaluate stemmers andcompression performance. However, vocabulary compression did not mean to ideal

    stemming. In fact, a good stemmer is that stems all the words to their correct roots. The

    following measures may satisfy this condition.

    3.4.2 Under- and over-stemming

    Under-stemming is the failure to conflate morphologically related words. This occurs whentwo words that should be stemmed to the same root are not. An example of under-

    stemming would be if the words adhere and adhesion are not stemmed to the same

    root.

    Over-stemming refers to words that should not be grouped together by stemming, but

    are. For example, merging the words probe and probable after stemming would

    constitute an over-stemming error.

    Using a sample file of W grouped words, under-stemming errors are then counted asdescribed in (Paice 1996). A concept group contains forms which are both semantically

    and morphologically related one to another. For each group g containing ng words, thenumber of pairs of different words defines the desired merged total (DMTg):

    DMTg 0:5ngng 1

    Since a perfect stemmer should not merge any member of a group with other group

    words, for every group there is a desired non-merge total (DNTg):

    Inf Retrieval

    123

  • 8/3/2019 Arabic Text Search

    10/21

    DNTg 0:5ngW ng

    When summing these two totals over all groups, one can obtain the global desiredmerged total (GDMT) and the global desired non-merge total (GDNT) respectively. Thus,stemming errors are calculated as follows:

    Conflation Index (CI): proportion of equivalent word pairs which were successfullygrouped to the same stem; Distinctness Index (DI): proportion of non-equivalent wordpairs which remained distinct after stemming

    The under-stemming index (UI) and the over-stemming index (OI) are given by:

    UI 1 CI 3

    OI 1 DI 4

    In (Paice 1996), the author proposed to compute the ratio of these two quantities as a

    measure of the stemming weight (SW):

    SW OI=UI 5

    The purpose of the Paices error-counting approach is that, although it is advantageous

    to have the index of terms compressed, this is only useful up to a point. This is because, as

    conflation becomes heavier, the merging of distinct concepts becomes increasingly fre-

    quent. At this point, small increases in Recall are gained at the expense of a major loss of

    Precision (Frakes 2003).

    One question mark over this approach concerns the validity of the grouped file

    against which the errors are assessed. These grouped files were constructed by human

    judgment, during scrutiny of sample word lists (Paice 1996; Frakes 2003). For Arabicstemming evaluation, Al-Shammari has selected a sample of 419 words and has divided

    them into 81 conceptual groups (i.e. close to 5 words per group). Comparing to Khojaand Light stemmers, Al-Shammaris lemmatizer has reduced over-stemming errors.However, no effective improvement is achieved for under-stemming counting (Al-

    Shammari 2010).

    4 LDA topic model

    Latent Dirichlet allocation (LDA) is a generative topic model for text documents (Blei

    et al. 2003). Based on the classical bag of words assumption, a topic model considers

    each document as a mixture of topics where a topic is defined by a probability distribution

    over words.

    The distribution over words within a document (d) is given by:

    Pwijd XT

    j1

    Pwijzi jPzi j dj :

    where P(w|z) defines the probability distribution over words w given topic z and P(z|d)refers to the distribution over topics z in a collection of words (document). More detailsand interpretations about topic models can be found in (Blei et al. 2003; Steyvers and

    Griffiths 2007)

    Inf Retrieval

    123

  • 8/3/2019 Arabic Text Search

    11/21

    For a given number of topics T, LDA model will be trained from a collection ofdocuments defined as follows:

    N: number of words in vocabulary.M: number of document in corpus.T: number of topics, given as input value.P(z/d): distribution over topics z in a particular document.P(w/z): probability distribution over words w given topic z.

    Then, we can define a generative process as follows:

    For each document d= 1 to M (in dataset) do:

    1. Sample mixing probability hd* Dir(a)2. For each word wdi = 1 to N (in vocabulary) do:

    2.a. Choose a topic zdi [ {1,, T} * Multinomial (hd)

    2.b. Choose a word wd [ {1,, N} * Multinomial (bzdi)

    where a is a Dirichletsymmetric parameter and {bi} are multinomial topic parameter. Eachbi assigns a high probability to a specific set of words that are semantically related. Thi s

    distribution over vocabulary is referred to as topic. In the present work, we use LingPipe7

    LDA implementation which is based on Gibbs sampling for parameters estimation.

    The choice of the number of topics can affect the interpretability of the results. A model

    with too few topics will generally result very broad topics. However, a model with too

    many topics will result in an uninterpretable model (Steyvers and Griffiths 2007). Since the

    number of topics (T) is given as an input parameter for training the LDA model, severalmethods were proposed to select a suitable T. An evident approach is to choose Tthat leadsto best performance for tasks (classification, clustering etc.).

    5 Building Arabic datasets

    As raised above, developments in Arabic IR are often faced with the problem of

    unavailability of standard free resources. So, we have opted to build our own experi-

    mentations datasets. For this aim, we developed a Web-Crawler8 to collect newspaper

    articles from several Arabic websites. In this study, we present three real-world corporabased on Echorouk,9Reuters10 and Xinhua11 Web-articles. Each article is saved with UTF-8 encoding in a separated text file where the first line is reserved to its title. A brief

    description is given in Table 4.

    5.1 Datasets description

    Echoroukcollection contains 11,313 documents from Echorouk newspaper articles relatingto 20082009 period. It is labeled according to eight categories. From the full corpus

    Ech-11k, we build a subset, Ech-4000, of 4,000 documents for preliminary evaluations.

    7 Available at http://www.alias-i.com/lingpipe/index.html.8 Developed with Amine Roukh and Abdelkadir Sadouki in the Mostaganem University.9 Algerian newspaper with online edition at http://www.echoroukonline.com/.10 International news agency with Arabic online edition at http://www.ara.reuters.com/.11 Chinese news agency with Arabic online edition at http://www.arabic.news.cn/.

    Inf Retrieval

    123

    http://www.alias-i.com/lingpipe/index.htmlhttp://www.echoroukonline.com/http://www.ara.reuters.com/http://www.arabic.news.cn/http://www.arabic.news.cn/http://www.ara.reuters.com/http://www.echoroukonline.com/http://www.alias-i.com/lingpipe/index.html
  • 8/3/2019 Arabic Text Search

    12/21

    Reuters collection, Rtr-41k, contains 41,251 Arabic documents relating to 200720082009 period. It is labeled according to six categories. A subset, Rtr-5251, of 5251 docu-ments is used for preliminary evaluations.

    Xinhua collection contains 36,696 Arabic documents relating to 20082009 period. It islabeled according to eight categories. A subset, Xnh-4500, of 4500 documents is used forpreliminary evaluations. Table 5 describes the collected datasets with their distributions

    over published categories.

    5.2 Arabic stemming

    The three lemma-based stemmers (BBw0, BBw1 and BBw2) were applied on the threedatasets described above. For comparison, we use ISRIalgorithm for light stemming and 2

    variants of Khoja algorithm for root-based morphological analysis. The first one, Khoja0,is the original Khoja algorithm which gives only the found roots. The second variant,Khoja1, allows adding unfound words to the vocabulary. Furthermore, we use the raw textas baseline characterization in preliminary experiments. For aligning comparison, the stop-

    word list used in BBwX algorithms is removed from other stemmers output.

    Table 4 Description of three datasets relating to Echorouk, Reuters and Xinhua Web-articles

    Feature\dataset Ech-11k Rtr-41k Xnh-36k

    # Articles 11,313 41,251 36,696

    # Characters 48,247,774 111,478,849 85,696,969# Tokens 4,388,426 10,093,707 7,532,955

    # Arabic tokens 3,341,465 7,892,348 6,097,652

    Average # tokens/article 387.9 244.7 205.3

    # Categories 8 6 8

    Table 5 Distribution of the three datasets over categories

    Dataset source Echorouk Reuters Xinhua

    Category\dataset Ech-11k Ech-4000 Rtr-41k Rtr-5251 Xnh-36k Xnh-45001 World 2,274 572 10,000 1,000 9,465 561

    2 Economy 816 572 10,000 1,000 6,862 563

    3 Sport 3,554 572 10,000 1,000 1,132 563

    4 Middle-East 10,000 1,000 9,822 563

    5 Science-health 889 889 1,993 563

    6 Culture-education 566 566 1,508 562

    7 Algeria 2,722 573

    8 Society 808 572

    9 Art 315 315 10 Religion 258 258

    11 Entertainment 362 362

    12 China 4,654 563

    13 Tourism-ecology 1,260 562

    Total 11,313 4,000 41,251 5,251 36,696 4,500

    Inf Retrieval

    123

  • 8/3/2019 Arabic Text Search

    13/21

    For each stemmer, different sizes of corpus vocabularies are reported in Table 6. As a

    main observation for the three datasets, stemming results show that ISRIalgorithm producehigh vocabulary dimensions. Close to raw text, ISRIdo not provide significant reduction inthe feature space. On the contrary, morphology analysis, with Khoja and BBw, decrease thevocabulary size by unifying tokens those have a common root or lemma. Retaining only

    the correct form, Khoja0 and BBw0 stemmers produce the lower vocabularies.

    In morphological analysis, it is clear that our lemma-based stemmer (BBw0) enhancesthe feature space when comparing to a root-based stemmer (Khoja0). However, thevocabularies produced when adding unrecognized tokens (Khoja1 and BBw2), show a lackin the used lexicons. We have expected such cases by analyzing the BBwX variants asdescribed in a previous sections where we discussed the no solution causes. To analyze

    the loss caused by this lack, we compute the ratio of unfound tokens by comparison with

    all the words (found and unfound). Figure 1 shows that the BBw lexicon more completethan Khoja.

    To complete our preliminary analysis and assess performances ofBBwXstemmers when

    dealing with ambiguous forms, we compute the confusion degree as defined in (1).According to Table 7, the maximum confusion degree, in real-world corpora, does not

    exceed 1.16. This indicates that when our stemming approach includes all multiple solu-

    tions in vocabulary, it preserves all word senses without significant lexical ambiguity.

    5.3 Stemming evaluation

    ICF and Paice metrics have been tested on the three datasets for analyzing stemmersquality. For evaluation fairness, the index compression factor has been calculated after

    stop-words removal as applied in BBwX stemmers. In lemma-based stemming, a larger

    stop-list with 575 words is used. As defined in (2), Table 8 summarizes the ICFcomputing.Table 8 show that Khoja0 root-stemmer and BBw0 lemma-stemmer have performed the

    best compression over the three datasets. However, light stemming with ISRIdid not give asignificant index reduction. Because they keep unknown tokens as proper stems, the other

    variants of Khoja and BBw have reduced slightly the index.For Paices evaluation, the main difficulty consists of how to build a representative

    concept groups. Al-Shammari has used a moderate set of 81 groups for comparing her

    lemmatizer to Khoja and light Arabic stemmers. Unfortunately, no test collections orconcept groups used in experiments were available (Al-Shammari 2010).

    In this study, we have built a large real-world concept-groups by processing the article

    titles in the three datasets (Ech-11k, Rtr-41k and Xnh-36k). After stop-words removal andpreliminary grouping, we have selected the groups that contained 10 words at least. Then,

    we have revised the resulting groups with an Arabic expert before retaining a collection of

    13,142 words distributed on 689 groups. Because all the selected words are recognized by

    Khoja and BBw stemmers, it was not necessary to calculate stemming errors for the othervariants (KhojaX, BBwX). Table 9 gives results of the Paices evaluation.

    Table 6 The vocabulary sizes of three datasets according to different stemmers

    Vocabulary Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2

    Ech-11k 179,225 144,442 3,172 22,411 17,558 60,148 48,789

    Rtr-41k 183,510 150,977 3,153 38,644 17,020 73,382 62,843Xnh-36k 175,410 140,192 3,089 10,712 11,352 25,835 22,456

    Inf Retrieval

    123

  • 8/3/2019 Arabic Text Search

    14/21

    Results show that our lemma-based stemmer provides the lowest over-stemming and

    under-stemming indexes (OI and UI). It is a significant performance compared to Khojaand ISRI stemmers. Remember that BBw stemmer can generate multiple stems for an in-

    token. Theoretically, this fact increases the under-stemming errors but empirically, resultsshow that BBw improves the sense distinctness.

    6 Experiments and results

    The experiments performed for text categorization and topic modeling are described and

    analyzed in this section. Support vector machine (SVM) is a kernel-based method which

    has been introduced by (Vapnik 1995) for binary and multi-class classification. In this

    work, the LIBSVM12 package is applied for multi-class text categorization. As SVM

    training parameter, the simple linear kernel is used with a cost parameter set to 10. For

    performance evaluation, fivefold cross validation is performed for each feature space of

    datasets.

    0% 2% 4% 6% 8% 10% 12% 14%

    Ech-11k

    Rtr-41k

    Xnh-36k

    BBw

    Khoja

    Fig. 1 The unfound tokens ratefor the two analyzers, Khoja andBBw

    Table 7 Confusion degree (Conf-deg) relating to the BBwX stemmers

    Dataset Ech-11k Rtr-41k Xnh-36k

    Stemmer Conf-deg Max-mul Conf-deg Max-mul Conf-deg Max-mul

    BBw0 1.118 5 1.129 6 1.121 5

    BBw1 1.146 6 1.155 6 1.141 6

    BBw2 1.106 5 1.116 5 1.105 5

    Max-mul refers to maximum multiple solutions

    Table 8 Comparison of index compression factors for different stemmers

    Dataset\stemmer ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2

    Ech-11k 0.194 0.982 0.875 0.902 0.664 0.728

    Rtr-41k 0.177 0.983 0.789 0.907 0.600 0.658

    Xnh-36k 0.201 0.982 0.939 0.935 0.853 0.872

    12 LIBSVM-2.89 is available at http://www.csie.ntu.edu.tw/*cjlin/libsvm?zip.

    Inf Retrieval

    123

    http://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+zip
  • 8/3/2019 Arabic Text Search

    15/21

    It is worth pointing out that: (1) for each dataset, six stemmers (ISRI, Khoja0, Khoja1,BBw0, BBw1 and BBw2) are applied for categorization and topic modeling, (2) raw text isused as baseline, (3) for topic modeling, all datasets are used as unlabeled collections.

    6.1 Classification in full word space

    For the basic words space definition, the TF and the TF-IDF measures are applied onstemmed datasets. TFt refers to the term frequency of each term (t) in its document. Theinverse document frequency (IDFt) measures the general importance of the term over a setof documents (D). In this work, we use the simple formulation of a TF-IDF measure asfollows:

    TF IDFterm tjD TFt IDFt TFt logDj j

    d : t2 df gj j: 6

    The three subsets (Ech-4000, Rtr-5251 and Xnh-4500) are used for text categorization in

    the full word space. According to two term models (TFand TF-IDF) and various stemmingapproaches, Fig. 2 gives a preliminary evaluation for text classification.As a main observation, the Khoja stemmers give the weakest performances. With

    abstract root stemming and incomplete lexicon, Khoja0 algorithm decreases text charac-terization. When adding unfound tokens, Khoja1 variant improves slightly classificationperformances. Although the ISRIlight stemmer produces huge vocabularies, it seems givegood performances for classification in the full word space. The same point can be reported

    when dealing with raw text. However, the BBwXstemmers improve classification accuracyfor reasonable dimension of words space.

    6.2 Classification in topics space

    The three subsets (Ech-4000, Rtr-5251 and Xnh-4500) with various stemming variants aretrained with LDA algorithm. For several numbers of topics, we report in Tables 10, 11 and

    12 the classification accuracy obtained by SVM cross-validation.

    According to Tables 10, 11 and 12, we can deduce the suitable number of topics for

    each stemmed dataset. The best models may be obtained when choosing, for LDA training,

    a number of topics between 100 and 400.

    Considering stemming methods, preliminary experiments show that Khoja stemmersgive the lowest classification performances. Unexpected results for linguists show that with

    raw texts, topic modeling can produce performances as good as with morphological

    analysis. Remind us that light stemming (ISRI) also has generated large vocabularies

    where different entries should be conflated in a single Arabic stem.

    Focusing on morphological analyzers, experiments show that BBw lemma-basedstemmers improve classification in topics space compared to Khoja root-based stemmers.For Reuters collection as example, we highlight in Fig. 3 the difference between the

    Table 9 Paices evaluation for three Arabic stemmers

    Stemmer OI UI SW

    ISRI 0.019 9 10-3 0.968 0.019 9 10-3

    Khoja 1.452 9 10-3

    0.098 14.87 9 10-3

    BBw 0.006 9 10-3 0.060 0.096 9 10-3

    Inf Retrieval

    123

  • 8/3/2019 Arabic Text Search

    16/21

    stemmers which retain only the correct Arabic forms. It is clear that lemma-based stem-

    ming enhances LDA modeling even with low topics number.

    6.3 Finding topics in newspaper articles

    In this section, we illustrate some results of LDA modeling in Arabic texts. The BBw2algorithm was used for stemming the three complete datasets (Echorouk, Reuters and

    Xinhua). Note that a latent topic can be titled with human assessment according to itsrelevant terms. For example, we give in Table 13 topics distribution which are mostrelevant to the word [mAl, money] .

    In addition, we propose to compute the categories distribution over learned topics. A

    confusion matrix (category\topic) can be obtained by adding document distributions of the

    same category. In Table 14, we give an illustration of the categories distribution over the

    Table 10 Classification accuracy in topics space of the Ech-4000 corpus

    # Topics\stemmer Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2

    32 76.3 77.4 71.7 76.5 79.0 78.9 79.1

    64 79.2 79.4 75.4 78.7 79.4 79.7 80.2

    100 80.2 80.3 77.3 79.2 80.9 80.0 80.8

    200 80.9 81.2 77.2 79.6 80.6 80.6 81.1

    300 81.8 80.4 76.9 78.2 80.3 80.8 81.3

    400 80.9 80.5 76.7 79.2 80.5 80.5 80.7

    500 81.3 80.7 77.3 77.5 80.0 80.9 81.0

    600 80.9 80.6 76.0 77.7 80.3 80.9 80.3

    700 80.6 79.7 75.6 76.4 79.3 80.1 79.6

    6 stemmers are applied with raw text as baseline for several numbers of topics

    Bold values indicate the best classification accuracies

    68

    70

    72

    74

    76

    78

    80

    82

    84

    Ech-TF Rtr-TF Xnh-TF Ech-TFIDF Rtr-TFIDF Xnh-TFIDF

    Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2

    Fig. 2 Classification accuracy in the full-word space of the three datasets (Ech-4000, Rtr-5251 andXnh-4500)

    Inf Retrieval

    123

  • 8/3/2019 Arabic Text Search

    17/21

    eight main topics13 from Reuters dataset. By setting a likelihood threshold at 10%, one canidentify relevant topics in each category. For example, one will easily discover that the

    main subjects in sport category during 20072009 were football and Tennis.

    Table 11 Classification accuracy in topics space of the Rtr-5251 corpus

    # Topics\stemmer Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2

    32 80.0 81.1 77.5 82.3 83.8 83.3 84.2

    64 84.2 84.4 80.4 83.9 85.6 84.8 84.6100 84.3 85.5 82.0 84.6 86.0 85.6 85.3

    200 85.7 85.2 82.8 84.9 85.6 85.8 85.8

    300 85.6 84.9 83.5 84.7 86.0 85.7 85.7

    400 84.9 84.9 83.2 85.0 85.5 85.7 85.6

    500 84.2 84.5 82.4 84.4 85.5 85.2 85.5

    600 84.4 84.7 81.8 84.9 85.2 85.1 85.4

    700 84.3 84.5 81.6 83.7 85.2 85.4 85.4

    Bold values indicate the best classification accuracies

    Table 12 Classification accuracy in topics space of the Xnh-4500 corpus

    # Topics\stemmer Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2

    32 67.5 70.1 68.9 74.6 75.3 75.0 74.2

    64 75.8 77.7 74.3 78.4 80.3 81.0 79.9

    100 78.8 77.5 75.6 79.8 81.9 81.6 80.6

    200 81.3 82.1 78.7 80.8 82.8 83.1 82.2

    300 82.0 81.5 79.5 81.1 82.9 82.5 83.2

    400 81.0 81.2 79.8 81.3 82.5 82.3 82.4500 80.5 81.0 78.9 81.3 82.5 81.8 82.3

    600 81.3 80.9 78.2 80.9 81.8 81.7 81.9

    700 81.2 81.4 78.5 79.3 81.8 82.4 82.4

    Bold values indicate the best classification accuracies

    72

    74

    76

    78

    80

    82

    84

    86

    88

    32 64 100 200 300 400 500 600 700

    Topics number

    Khoja0

    BBw0

    Fig. 3 Classification accuracy in topics space of the Rtr-5251 corpus. Comparison between root-based(Khoja0) and lemma-based (BBw0) stemming

    13 More details are available at https://sites.google.com/site/abderrezakbrahmi/.

    Inf Retrieval

    123

    https://sites.google.com/site/abderrezakbrahmi/https://sites.google.com/site/abderrezakbrahmi/
  • 8/3/2019 Arabic Text Search

    18/21

    Furthermore, words can be analyzed by finding their different contexts when training

    LDA for each collection. Among 100 latent topics, we report in Table 15 the relevant

    topics to some words. Two senses can be assigned to the first one, [slAm] , either peaceor greeting/salutation.

    It is clear, in Table 15, that the different contexts related to this word imply the firstsense excepting the topic Coal mines in Australia. In fact, the forth topic in Xnh-36 kdataset is related to another word. The input token [slAmh] has two BBw stemmingsolutions ([slAm] and [slAmh] ). The second solution means safety which is highlycorrelated to security and safety requirements in coal mines. However, vocabulariesobtained by Khoja stemming do not contain this entry because the word [slAm] wasindexed by its root [slm] . This non-vocalized word induces various senses such as

    [sul*im, be conceded] , [silom, peace] and [sul*am, stairs] . Furthermore, Khojaalgorithm indexes under the same root other lemmas such as [isolam , Islam] and[saliym , correct] .

    The second word, [mAl] , means either money/capital/funds or lean/bend/incline/sympathize. However, Table 15 shows that several specific topics are related to the firstsense including some ways to financing Al-Qaida organization.

    7 Conclusion and future work

    For information retrieval tasks, several stemming approaches have been proposed and

    applied on Arabic texts. The light stemmers try to remove from a word the most common

    affixes. However, the morphological analyzers attempt to extract the correct root or lemma.The preference of a particular method depends on the nature of the following IR task.

    Unfortunately, the literature does not give clear answers for choosing the appropriate

    stemming method. Using raw text for categorization or topic analysis can also lead to

    acceptable performances (Brants et al. 2002; Said et al. 2009).

    Table 13 The 4 topics related to the word ([mAl, money] ) in Echorouk (Ech-11k) dataset

    The word [mAl, money] in bold

    Inf Retrieval

    123

  • 8/3/2019 Arabic Text Search

    19/21

    Two main contributions were presented in this study: Firstly, we have proposed the BBwlemma-based stemming with specific text normalization and multiple lemmas indexing. By

    applying a confusion measure on three real-world corpora, we have shown that our

    stemming approach may preserve the semantic embedded in Arabic texts withoutcompromising lexical characterization. The Paices evaluation has been used to measure

    under- and over-stemming errors. The results have shown a high effectiveness for our

    approach. The BBw lemma-based stemmer reduces significantly vocabulary dimension,under- and over-stemming errors. In addition, classification performance is improved

    slightly compared to classification of raw and light stemmed texts.

    For morphological analysis, three BBw variants were compared to root based stemmers(Khoja0 and Khoja1). The two variants tested for Khoja algorithm have shown a lack in itslexicon (roots and patterns). This limitation was surmounted by adding the unfound words

    as new vocabulary entries. However, it would be judicious to maintain permanently lin-

    guistic resources.

    Secondly, three real-world corpora were tested for Arabic stemming and topic model-

    ing. Tens of thousands of Web-articles were automatically crawled from Echorouk,Reuters and Xinhua. The variety of writing styles allows us to validate the proposedlemma-based stemmer. For topic modeling evaluation, SVM classification was

    Table 14 Distribution of Reuters categories over eight latent topics

    Category\topic Middle-East

    (%)

    Business

    (%)

    World

    (%)

    Sport

    (%)

    Entertainment

    (%)

    Science

    (%)

    Military actions (USA. Iraq.Afghanistan) 25.4 1.8 21.1 0.8 2.8 2.0

    Football 0.6 0.6 0.9 52.3 4.9 1.2

    Economy 3.0 75.2 2.8 1.4 5.4 5.9

    Health and science 12.3 5.6 17.4 4.0 50.3 59.6

    Iranian nuclear 12.4 11.4 28.9 3.3 6.9 26.0

    Iranian politics 21.0 3.8 26.9 3.2 22.2 4.1

    Tennis 0.3 0.5 0.7 34.4 1.6 0.5

    Middle-East 25.0 1.1 1.4 0.6 6.0 0.8

    Bold values indicate relevant topics

    Table 15 The topics related to two words ([slAm] and [mAl] ) in each dataset

    Word Ech-11 k Rtr-41 k Xnh-36 k

    [slAm] Arab league Somali actors UN missions

    Palestinian dialogue Nobel price Jordans peace effort

    Middle-East negotiations Middle-East negotiations

    Sudan events Coal mines in Australia

    Sudan events[mAl] Monetary values Arabic gulf investments Financial crisis

    Al-Qaida operations Financial operations Financial operations

    Financial institutions Financial crisis Energy projects

    Government projects Government investments

    Inf Retrieval

    123

  • 8/3/2019 Arabic Text Search

    20/21

    successfully tested with various stemming methods. In particular, the proposed BBwstemming approach proved its efficiency in both text classification and Arabic topic

    modeling.

    Furthermore, topic LDA modeling was applied in Arabic texts. A large investigation

    was carried out by varying the topics number. Classification in topics space was achievedto assess the performances LDA model with different stemming methods. When training

    LDA in BBw vocabularies, interpreting topics was easier than those obtained from Khojaroots.

    It is worth pointing that it is difficult to understand how one can assess semantic aspects

    in Arabic texts without sufficient linguistic knowledge (Larkey et al. 2002, 2004). This

    study shows that effective developments in Arabic IR and topic modeling can not be

    performed without close collaboration between computer scientists and Arabic language

    experts.

    As future work, the BBw stemmer will be improved by handling additional irregularforms and enhancing lexicons to proper names (person, location and organization). Further

    effort must be oriented to topic model integration with end-user retrieval system.

    Acknowledgments We gratefully acknowledge Professor Patrick Gallinari, director of LIP6 laboratory

    (Paris-France), for his valuable advices to achieve this work. We would like to thank the reviewers of INRT

    journal for their insightful and constructive comments.

    References

    Al-Shammari, E. (2010). Lemmatizing, stemming, and query expansion method and system. US Patent20100082333, April 2010.

    Al-Shammari, E., & Lin, J. (2008). A novel Arabic lemmatization algorithm. In Proceedings of theworkshop on analytics for noisy unstructured data, Singapore, pp. 113118.

    Blei, D. M., Franks, K., Jordan, M. I., & Mian, I. S. (2006). Statistical modeling of biomedical corpora:

    Mining the caenorhabditis genetic center bibliography for genes related to life span. BMC Bioinfor-matics, 7, 250. doi:10.1186/1471-2105-7-250.

    Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine LearningResearch, 3, 9931022.

    Brants, T., Chen, F., & Farahat, A. (2002). Arabic document topic analysis. LREC-2002 workshop on

    Arabic language resources and evaluation, Las Palmas, Spain.

    Buckwalter, T. (2002). Buckwalter Arabic morphological analyzer version 1.0. Linguistic data consortium,

    University of Pennsylvania. LDC catalog no. LDC2002L49.

    Darwish, K., Hassan, H., & Emam, O. (2005). Examining the effect of improved context sensitive mor-

    phology on Arabic information retrieval. In Proceedings of the ACL workshop on computationalapproaches to semitic languages, Ann Arbor, Michigan, pp. 2530.

    Frakes, W. B. (2003). Strength and similarity of affix removal stemming algorithms. In SIGIR forum (Vol.37, issue 1), pp. 2630.

    Hofmann T. (1999). Probabilistic latent semantic analysis. In Proceedings of the fifteenth conference onuncertainty in artificial intelligence, pp. 289296.

    Kadri, Y., & Nie, J. (2006). Effective stemming for Arabic information retrieval. The challenge of Arabic forNLP/MT, international conference at the British Computer Society (BCS), pp. 6874. London, UK.

    Khoja, S., & Garside, R. (1999). Stemming Arabic text. Technical report. Computing Department, Lancaster

    University, Lancaster.

    Larkey, L. S., Ballesteros, L., & Connell, M. E. (2002). Improving stemming for Arabic information

    retrieval: Light stemming and co-occurrence analysis. In Proceedings of SIGIR 2002, pp. 275282.Tampere, Finland.

    Larkey, L. S., & Connell, M. E. (2001). Arabic information retrieval at UMass in TREC-10. In TREC 2001,pp. 562570. Gaithersburg, Maryland, USA.

    Larkey, L. S., Feng, F., Connell, M. E., & Lavrenko, V. (2004). Language-specific models in multilingual

    topic tracking. In Proceedings of SIGIR 2004, 402409. Sheffield, UK.

    Inf Retrieval

    123

    http://dx.doi.org/10.1186/1471-2105-7-250http://dx.doi.org/10.1186/1471-2105-7-250
  • 8/3/2019 Arabic Text Search

    21/21

    Moukdad, H. (2006). Stemming and root-based approaches to the retrieval of Arabic documents on the Web.

    Webology, 3(1), article 22.Oard, D. W., & Gey, F. (2002). The TREC-2002 Arabic/English CLIR track. In TREC2002 notebook,

    pp. 8193.

    Paice, C. D. (1996). Method for evaluation of stemming algorithms based on error counting. Journal of the

    American Society for Information Science, 47(8), 632649.Said, D., Wanas, N., Darwish, N., & Hegazy, N. (2009). A study of text preprocessing tools for Arabic textclassification. In Proceedings of the 2nd international conference on Arabic language resources andtools, pp. 230236. Cairo, Egypt.

    Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of Latent Semantic Analysis.Mahwah, NJ: Lawrence Erlbaum.

    Taghva, K., Elkoury, R., & Coombs, J. (2005). Arabic stemming without a root dictionary. In Proceedingsof the international conference on information technology: Coding and computing, Vol. 01,pp. 152157.

    Tuerlinckx, L. (2004). La lemmatisation de larabe non classique. In JADT 2004, 7e Journe es internatio-nales dAnalyse statistique des Donnees Textuelles, pp. 10691078.

    Vapnik, V. N. (1995). The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New

    York, Inc.

    Inf Retrieval