segmentation and linguistic analysis of ad arabic...

12
The SALAH Project: Segmentation and Linguistic Analysis of adīṯ Arabic Texts Marco Boella 1 , Francesca Romana Romani 2 , Anjela Al-Raies 2 , Cristina Solimando 2 , Giuliano Lancioni 2 1 University “La Sapienza”, Rome, Italy [email protected] 2 Roma Tre University, Rome, Italy [email protected], [email protected] {csolimando,lancioni}@uniroma3.it, Abstract. A model for the unsupervised segmentation and linguistic analysis of Arabic texts of Prophetic tradition (adīṯs), SALAH, is proposed. The model automatically segments each text unit in a transmitter chain (isnād) and a text content (matn) and further analyses each segment according to two distinct pipelines: a set of regular expressions chunks transmitter chains in a graph la- beled with the relation between transmitters, while a tailored, augmented ver- sion of the AraMorph morphological analyzer (RAM) analyzes and annotates lexically and morphologically the text content. A graph with relations among transmitters and a lemmatized text corpus, both in XML format, are the final output of the system, which can further feed the automatic generation of con- cordances of the texts with variable-sized windows. The model results can be useful for a variety of purposes, including retrieving information from adīṯ texts, verify the relations between transmitters, finding variant readings, supply- ing lexical information to specialized dictionaries. Keywords. segmentation, Arabic text treatment, information retrieval, morpho- logical analyzer, hadith, regular expressions, graph 1 Introduction Information retrieval in Arabic has often pivoted on contemporary texts, for obvious reasons: electronic availability, usefulness of information, analogy with work done in other linguistic domains. However, Classical texts are much more important in con- temporary Arabic culture than in most Western countries, as witnessed by the large diffusion of websites which make Middle Ages books available not only to scholars, but also - and most important - to laymen interested in such texts. All authors have contributed equally to this work , but since it refers to a modular project, Boella should be mainly credited for sec. 3, Romani for sec. 2, Al-Rajes for sec. 5, Solimando for sec. 4.1, Lancioni for secs. 1, 4.2 and 6. M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Upload: dinhkhuong

Post on 25-Feb-2019

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

The SALAH Project:

Segmentation and Linguistic Analysis of ḥadīṯ Arabic Texts

Marco Boella1, Francesca Romana Romani2, Anjela Al-Raies2, Cristina Solimando2, Giuliano Lancioni2

1 University “La Sapienza”, Rome, Italy [email protected]

2 Roma Tre University, Rome, Italy [email protected], [email protected]

{csolimando,lancioni}@uniroma3.it,

Abstract. A model for the unsupervised segmentation and linguistic analysis of Arabic texts of Prophetic tradition (ḥadīṯs), SALAH, is proposed. The model automatically segments each text unit in a transmitter chain (isnād) and a text content (matn) and further analyses each segment according to two distinct pipelines: a set of regular expressions chunks transmitter chains in a graph la-beled with the relation between transmitters, while a tailored, augmented ver-sion of the AraMorph morphological analyzer (RAM) analyzes and annotates lexically and morphologically the text content. A graph with relations among transmitters and a lemmatized text corpus, both in XML format, are the final output of the system, which can further feed the automatic generation of con-cordances of the texts with variable-sized windows. The model results can be useful for a variety of purposes, including retrieving information from ḥadīṯ texts, verify the relations between transmitters, finding variant readings, supply-ing lexical information to specialized dictionaries.

Keywords. segmentation, Arabic text treatment, information retrieval, morpho-logical analyzer, hadith, regular expressions, graph

1 Introduction

Information retrieval in Arabic has often pivoted on contemporary texts, for obvious reasons: electronic availability, usefulness of information, analogy with work done in other linguistic domains. However, Classical texts are much more important in con-temporary Arabic culture than in most Western countries, as witnessed by the large diffusion of websites which make Middle Ages books available not only to scholars, but also - and most important - to laymen interested in such texts.

All authors have contributed equally to this work , but since it refers to a modular project, Boella

should be mainly credited for sec. 3, Romani for sec. 2, Al-Rajes for sec. 5, Solimando for sec. 4.1, Lancioni for secs. 1, 4.2 and 6.

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Administrator
Typewritten Text
Administrator
Sticky Note
Accepted set by Administrator
Page 2: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

A special role in this context is played by ḥadīṯ texts, the set of narratives on the life and deeds of the Prophet that altogether constitute the sunna, or Islamic Tradition (see Section 2). These texts do not have only a historical importance, they are the cornerstone of Muslim law and a favored reading of most Muslims around the world, and their presence in contemporary written Arabic is widespread.

Notwithstanding their importance, Classical texts have not been - at least to the best of our knowledge - the subject of any scholar research project as far as informa-tion retrieval is regarded: to search the texts, most scholars still refer to older, paper resources such as Wensinck's concordances [ 1]. 1

On the contrary, ḥadīṯ texts are a privileged fields for information retrieval texts. Their structure, which couples a text with a preceding chain of transmitters that as-sures the validity of the tradition, or isnād, is already (if informally) organized in such a way that readers are able to detect information with a relatively small amount of ambiguity. Yet, notwithstanding the importance of relations among transmitters in order to ascertain the legal validity of a tradition, such data are still managed in a rather haphazard way, by recurring to traditional resources such as prosopographical repertories and by evaluating transmission relations in a mostly impressionistic way.

The same is true of the lexical and grammatical content of traditions: in most cases, interpreters analyze each ḥadīṯ on its own merit, making few, if any, recur to cross-textual regularities and collocations.

Our research project aims to devise methods and algorithm to extract as much in-formation as possible from such texts in an automatic way. The subject matters on which we started working are the automatic segmentation of isnād and narrative text (or matn: see Section 3), the reconstruction of chains of transmitters through graphs, the creation of (semi-)automatic lexical concordances and the prospective develop-ment of a grammar suitable to (semi-)automatically interpret texts and to build seman-tic representation which can further be employed in inference (by modeling a classical method used by Islamic law scholars). Preliminary results of a morphological ana-lyzer and lemmatizer (see Section 4) are discussed (see Section 5).

2 Contents and Structure of the Corpus: the Ḥadīṯs

Ḥadīṯ, lit. ‘narrative, talk’, is the term used to indicate each member of the set of shorter or longer narratives on the life and deeds of the Prophet Muḥammad (571-632) that report what he said or did, or of his tacit approval of something said or done and by itself define what is considered good, by providing details to regulate all as-pects of life in this world and to prepare people for the beyond, clarifying the Koranic shades; ḥadīṯ texts constitute the sunna, lit. ‘way of life’, or Islamic Tradition, that in Muslim culture is considered second in authority only to the Koran: other sources of the Islamic Law (uṣūl al-fiqh), ijmā‘ ‘consensus’ and qiyās ‘analogical reasoning’ have generally a lower rank.1

1 Since ḥadīṯs become sources of rules of conduct as authoritative example of the Prophet’s

behavior, then very badly wanted: in fact the Companions of the Prophet, or simply those who had known him, should have much to tell about him and the new converts wished to le-

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Page 3: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

Ḥadīṯs structure is a sequence of binary elements: a text of the narrative, matn, with a preceding chain of transmitters (isnād, literally ‘support’), that have transmit-ted the narrative, and that assures the validity of the tradition, following one another until the first one who saw or heard Muḥammad.2

For the selection of input data computational and linguistics criteria were privi-leged, rather than philological ones. Among the canonical collection of ḥadīṯs, it has been chosen an on-line edition of the collection known as Ṣaḥīḥ Al-Buḫārī, compiled by Muḥammad ibn Isma‘īl al-Buḫārī [ 3]. Its features, namely full digitalization and vocalization, allow a wide range of investigations without any needing of manual intervention or preparation. The text has been processed as is, and a systematic con-trol of orthographical and philological coherence has been postponed as not relevant at this stage of project’s architecture’s implementation.

3 Extracting Surface's Information: from Segmentation to Representation

The automatic segmentation is a process that assigns segment boundaries to get dis-crete objects from a non-discrete continuum [ 4]. This approach aims to avoid or at least to limit drastically the supervised intervention, which is rather resource-consuming in time and human involvement, especially considering large amount of data [ 4- 6].

3.1 Ḥadīṯ's Segmentation: pairing Explicit and Implicit Information

The task of segmenting ḥadīṯ texts is in many respects analogous to other cases of semi-structured texts in (pseudo-)natural languages, such as semi-formal texts in de-scriptions of mathematics. Wolska and Kruijff-Korbayová [ 7] approached the analysis and formalization of symbolisms and formulas used in mathematics manuals, and drafted a model that: (i) finds regularities in a text; (ii) employs them as patterns to extract textual and meta-textual information; (iii) conceives a set of rules based on these patterns in order to automatically translate extensive verbal expression in math’s formulas and vice-versa. This study and others [ 8], point out that segmentation could be used in textual analysis not only to identify discrete strings, but to try to assign, through an analysis of regularities and recurrences, a global structure to the text itself as well. This structure could be seen as governed by a sort of contextual “grammar of

arn what he said or did to imitate him, to conform to his traditional standards of behavior, as a rule, in name of the taqlīd (the so called imitatio Muḥammadica or ‘imitation of Muḥammad’).

2 Isnād is a guarantee of truth for ḥadīṯ, through the reputation and the good faith of the transmitters who handed down the narratives orally [ 2]: Islamic civilization was built upon the supremacy of the spoken word and hearing, the written fixation has only a support role, as prescriptive and restrictive measure, against the aptitude to establish false chains of tradi-tions, considered valid.

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Page 4: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

rules”, which also controls connections between content's information and its textual organization.

As written above, the ḥadīṯ collections fit undoubtedly well the definition of “structured texts”. The organization of a ḥadīṯ is, in fact, almost rigorous and depend-ant on a set of recurrent “functional expressions” (based on verbs and prepositions in particular) that bound, define and sometimes nest different kind of content [ 9- 10]. The text’s continuum could therefore read as formed by two levels, the first one contain-ing information and the other which assumes, beside its textual value, a meta-textual function which organize and define the first level. This seem to show, although in a linear way, somehow a similar structure of that employed in databases, in which re-cords contain information defined by fields. The parallelism with mathematics and information science is far from perfect, however. In fact, a look at literature about the structure of ḥadīṯ collections shows that there is not a general agreement about the original value, meaning and translation of these “functional expressions” [ 11- 12- 13] and a complete set of them has not been jet fully defined. However, they could be undoubtedly considered as provided with some extra meanings beside the merely linguistic ones: (i) they separate one transmitter from another; (ii) they specify the authority and typology of transmission; (iii) sometimes they show the “direction” of the transmission. An automatic recognition of these elements in text and their role seems able to draft a rather solid structure of relationships and meanings.

3.2 Extraction and Organization: the HadExtractor Program

In order to test the above-mentioned models of segmenting and structuring texts, a specific program named HadExtractor (HE) has been designed to deal with ḥadīṯ corpora, aiming to: (i) read the full collection and identify single ḥadīṯs; (ii) segment for each ḥadīṯ isnād from matn; (iii) extract from each isnād all transmitters’ names together with relative supplementary information (position, typology and direction of transmission). HE was written and implemented in Python [ 14]. At the present HE has designed as rather close system, as specifically requires as input ḥadīṯ texts only and not jet other Classical Arabic textual structures.

Direct processing of Arabic script in programming languages is possible in theory but hard to manage, especially for switching direction (right to left – left to right) among strings in different characters’ systems. The original Arabic script has been therefore converted by using a set of characters based on Buckwalter transliteration system [ 15] and modified by us in order to fit to Python and regular expressions’ con-straints on special reserved characters. This transliteration uses ASCII characters only and substitutes usual diacritics employed academically in Latin characters with capital letters and, where needed, supplying with non-alphabetic characters. We implemented the conversion by employing a specific program that allows back transliteration to Arabic at every stage and processes together either vocalized and not vocalized strings [ 16].

The core of HE is based on Regular Expressions Syntax, conceived in the 1950s by Kleene as tool of automata theory to describe formal languages and developed after-wards by Thompson to be used in programming languages [ 17]. A regular expression

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Page 5: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

(regex) consists of a formally structured text string for describing complex search patterns, which could be applied to other strings in order to find for matches. The complexity of constants and operators employed allows regexs to be powerful and deeply expressive instruments to retrieve textual segments [ 18]. HE has been built mainly on three regexs: the first one identifies single ḥadīṯ, the second one separates isnād from matn, the last one catches the transmitters’ names.

After the identification of single ḥadīṯs, HE looks for the textual separation point between isnād and matn, through a regex that models a pattern containing as variable the above-mentioned “functional expressions” and some look-back and look-forwards operators that verify the context to detect effectively the “functional” value of the expression as, obviously, the same word could recur in other contexts, for example inside matn without any particular meta-textual role. Once that all isnāds are ob-tained, another regex, working in similar way but referring to a larger list of “func-tional expressions”, extract all transmitters’ names pairing them with the correspon-dent “functional expression”. Concerning the extracted matns, they were paired with a digitalized English translation [ 19- 20], which was processed by a tailored version of HE.

Once HE has implemented all regex routines and related tasks, it produces as out-put an XML file, in which all extracted information is organized, automatically tagged and nested. The following lines show an excerpt from output XML file, related to a single ḥadīṯ (Arabic script is in Buckwalter modified transliteration):

<hadith id_ar="7296" id_cor="">

<source_info> <vol>9</vol>

<num>7554</num></source_info>

<isn> <trasm type="Had+aCaniy">muHam+adu b_nu Eabiy GaAlibI</trasm>

<trasm type="Had+aCanaA">muHam+adu b_nu eis_maAciyla</trasm>

<trasm type="Had+aCanaA">muc_tamirU</trasm>

<trasm type="samic_tu">Eabiy</trasm> […] <trasm type="Ean+a">Eabiy raAficI</trasm>

<trasm type="Had+aCahu Ean+ahu samica">Eabuw huray_raoa</trasm>

</isn>

<mat>yaquwlu […] camalNA</mat></hadith>

3.3 Representation: Transmitters’ Chains and Graphs

Once HE has been applied to the ḥadīṯ corpus, a large amount of automatically ex-tracted information was available for further investigations, most of them dealing with extracted matn (see Section 4). Focusing instead on the isnād, a smart example of representation of information about transmitters is given here below, with the aim to focus on objects' relationships rather then objects themselves.

We structured accordingly the features of extracted isnād's information in the fol-lowing categories: (i) name of transmitter; (ii) its position in the single ḥadīṯ transmis-sion chain, which starts usually from the collector and arrives to the Prophet Muḥammad); (iii) the typology of transmission (see Section 3.1). These categories

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Page 6: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

were read as pertaining to a simple model in which “objects” have different kinds of “relationships” among them. This model is undoubtedly similar to the ones coming from the graph theory, in which a graph is defined as a structure of nodes and edges to model relations among objects from a given collection [ 21]. These nodes and edges are drawn on a bi-dimensional grid through specific algorithms, in order to graphi-cally visualize the above-mentioned relationships [ 22]. It was therefore clear that the graph theory could be usefully applied to our data in order to try a graphical represen-tations of transmitters’ chains. This kind of representation aims to offer a sophisticate and quantitative-based instrument in a field of research traditionally characterized by analogical and human-based approaches [ 23] [ 11].

On the basis of fundamental literature on graph drawing [24-25] we have con-ceived and implemented in Python another specific program, named ChainViewer. This application, by using existing Python libraries for graph drawing: (i) gets all information about transmitters stored in the XML file containing previously extracted data through HE); (ii) for each ḥadīṯ assigns the transmitters' names to nodes , the relationships among each transmitter and the previous/next one to edges(i.e. the chains), the typology of transmission to edges' types; (iii) through an algorithm is able to generate graphs for single chains or joins together multiple chains in the same graph (in this case if a transmitter's name appears twice or more is shown once but with multiple edges).

At the current stage of development, ChainViewer works well with limited number of chains only (see fig. 1), but could virtually be applied to all chains at the same time, in order to automatically gather in a unique graph all the transmitters of an ḥadīṯ collection together with the complete set of their transmission's relationships. This task obviously presents new problems to deal with, namely: (i) the needing of a semi-automatic instrument able to disambiguate homonyms, unify various inflected forms of the same name, identify nicknames and aliases; (ii) a specifically designed draw-ing's algorithm that could deal with thousands of nodes and edges, and dynamically represent them with expansion/compression tools.

Fig. 1. A portion of a not-directional graph of transmission chains obtained by processing 11 ḥadīṯs together (Arabic script is in Buckwalter modified transliteration)

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Page 7: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

4 Analyzing the Corpus: the Revised AraMorph analyzer

4.1 The Original AraMorph Implementation

As a starting point for the implementation of the text analysis module of the SALAH project, the morphological analyzer and lemmatizer AraMorph (AM) by Tim Buck-walter [ 26] was chosen. The main reasons for this choice are the simplicity of design of the model, its high performance level even in unsupervised environment and the easiness of its maintaining and extending.3

Opposite to a long-standing linguistic and computer science tradition which em-phasized the need for complex, multi-layered, “deep” morphological components in order to analyze properly Arabic texts and to account for the apparent lack of linearity of many Arabic morphemes —see examples and discussion in [ 26][ 28- 30], — AM chooses to treat Arabic words (in the rather naive, but computationally efficient sense of “any sequence of characters separated by spaces”) as elements linearly decom-posable in three sub-elements: a prefix, a stem, and a suffix, the stem being the only necessary sub-element (in fact, zero prefixes and suffixes are postulated, sometimes adding some grammatical information to the stem according to the time-honoured morphological concept of “zero morpheme”).

This simple account is straightforwardly implemented in the (possibly) simplest way, by feeding the system with three lookup lists of, respectively, (a) prefixes, (b) stems and (c) suffixes, together with three compatibility tables between, respectively, (d) prefixed and stems, (e) prefixes and suffixes and (f) stems and suffixes. Entries in the lookup lists are made up of four fields: (i) unvocalized and (ii) vocalized forms of the morpheme, (iii) grammatical category and (iv) English gloss; compatibility tables just list couples of compatible morphemes, all other combinations being incompati-bles. Supplementary pieces of information, not employed in the analysis proper but potentially useful for glossing the texts (root and lemma for a group of morphemes), are provided in the stem lookup list in the form of pseudo-comments.

The analysis, both in the original Buckwalter model (implemented in Perl) and in the Java implementation by the AM project, is performed through a brute-force search of every possible decomposition of words into prefixes, stems and suffixes, by look-ing up for prefixes from 0 to 4 letters long, stems from 1 letter upwards, and suffixes from 0 to 6 letters long. Only the unvocalized form of words is taken into account (short vowels and other diacritics are stripped before looking up): candidate prefix-stem-suffix decompositions are first matched against the first fields of the respective lookup lists and discarded if any of the elements is missing, then the grammatical categories of the surviving combinations are matched against the compatibility tables and discarded if any of the combination is not present. As a result, each word of the input text can be labeled as (i) unrecognized, if no possible analysis passes the text,

3 Buckwalter’s system, has been used in many different projects, mostly in its Java implemen-

tation; it is, for instance, included as a morphological analysis tool in the Arabic WordNet Project [ 27].

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Page 8: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

(ii) unambiguous or (iii) ambiguous if, respectively, one ore more analyses are li-censed.

This model, whose beauty lies in its very simplicity, is a good starting point for a successful morphological analysis, but does not fits our needs for a plurality of rea-sons. First, while the emphasis on the unvocalized form of the word is relatively justi-fied for the ideal text genre targeted by Buckwalter —newspaper texts and other Modern Standard Arabic (MSA) non-literary texts that largely comprise the LDC Arabic corpus— it is far from ideal for other types of texts, first of all fully vocalized ones like ḥadīṯ corpora, but also sparsely vocalized texts: each diacritic added to the consonantal skeleton of a text reduces ambiguity, and thus a system that, like the original AM model, deliberately chooses to ignore this information accepts to live with a higher degree of (morphological) ambiguity and automatically passes a number of wrong analyses that would instead be ruled out by taking into account diacritics present in the text.

The second weak point in the original model lies in the fact that the lookup lists and the relative compatibility tables were built from a sample of the text corpora Buckwalter worked on: again, only morphemes attested in a subset of MSA texts and their combinations are included in the lookup lists and the combination tables, which unavoidably brings to reject or analyze wrongly many words attested in other textual types.

A third weakness in the original AM implementation, which is linked to the previ-ous one, lies in the lack of any stylistic or chronological information in the lookup lists. This way, many morphemes that are virtually exclusive of MSA texts —for instance, a not negligible number of transliterated foreign named entities which can-not be found in Classical texts and which are relatively rare in modern literary texts as well— are included in the lists (and more ore less properly vocalized —foreign proper names are never vocalized in real-world Arabic texts— in order to respect the field structure of the lists themselves) and are likely to give rise to a number of false posi-tives in the analysis of some textual genres.

4.2 Modifications to the Algorithm

In order to overcome the weaknesses listed in the previous section, a number of modi-fications to the original AM algorithm were devised that tackle the single problems detected above; the new algorithm has been dubbed “Revised AraMorph” (RAM). The first modification is about the token identification mechanism: instead than dis-carding vocalization, our revised lemmatizer uses it to reduce the number of false positives by taking into account all the vowels present in the text. The comparison phase is less trivial than it might seem, since it must proceed on a three-stage level: (i) the token is segmented in consonants and diacritics (where everything between two characters marked as a consonant is a diacritic); (ii) consonants must match exactly —in fact some qualifications are orders which take into account current practice, e.g. an alif with hamza above or below matches a simple alif (which the original AM ac-counts for pragmatically, but rather inefficiently, by multiplying entries), and some more are required to reflect idiosyncrasies in the ḥadīṯ orthography;— (iii) diacritics

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Page 9: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

present in the token must not contradict the full vocalized form in the lexicon (that is, e.g., missing vowels are ok, but a vowel cannot match a different one).4

To tackle the second weak point, namely the partial and unbalanced coverage of Arabic lexicon in the original AM implementation, a file with additional stems auto-matically extracted from Anthony Salmoné Arabic-English dictionary [ 31] (a work from the end of the 19th century encoded in TEI-compliant XML format within the Perseus project) was added to the system. Moreover, an analysis of most frequent types of unrecognized tokens allowed to add a limited, but important, number of addi-tional lists of prefixes and suffixes together with the relative combination tables. The single more important addition was the set of prefixes, suffixes and combinatory rules for verb imperatives, a category entirely missing from the original AM implementa-tion —perhaps on purpose, since Arabic imperatives are morphologically complex and quite rare in newspaper texts— and relatively frequent in ḥadīṯ texts, given the abundance of prescriptions and performative contexts in the latter.

To reduce the third problem detected, namely the genre and style indistinctness in the AM lexicon, we experimented with automatically remove items that are likely to correspond to contemporary foreign named entities, especially proper names and place-names.5 In order to do so, we first extracted a list of potential named entities by exploiting a suggestion by Tim Buckwalter himself that in most cases a gloss starting with an uppercase letter is a named entities in 99% of cases; we after perform a full-text search for each word in Salmoné’s dictionary and retain only words found there. This way, we are likely to exclude most foreign contemporary named entities by re-taining Arabic proper names and place-names which can be found in Classical texts and which are often (but unfortunately not always) included by Salmoné.

5 Results and Evaluation

Results of both HE and RAM have been submitted to standard practice of evaluation [ 32] through division of the corpus in a training (95%) and testing (5%) section; the testing section has been held relatively small in consideration of the homogeneity of the corpus and the necessity to manually annotate the test sentences. At the present

4 Some parameters were introduced in the experiments to test the impact of full vocalization

in ḥadīṯ texts: since the texts were full vocalized, it is meaningful not to allow additional vowels nor a tašdīd (reduplication) symbol if not present in the text. Both the original algo-rithm and our implementation were rewritten in Python 2.6 in order to profit from other ex-isting tools and to have the possibility to treat directly Arabic texts in Unicode format (even if this is not expedient in some cases, see also Section 3.2).

5 The original AraMorph implementation, true to its newspaper-based spirit (in this case, the source was the AFP corpus), included pretty contemporary items such as the Arabized ver-sion of the names of the Belgian tennis player Sabine Appelmans or the Czech soccer team Sigma Olomouc. In some cases, confusion with Arabic words is in fact possible, especially if we take into account the fact that nothing like capitalization is available in the Arabic script: for instance, the transcription of the English first name ‘John’ (juwn) is indistinguish-able from jūn ‘inlet, bay’.

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Page 10: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

stage of development, the results obtained through both modules are brilliant but quite faceted. The total number of processed ḥadīṯ was 7305, and the segmentation pro-duced outputs for 7135 of them, showing an effectiveness’ rate of 97.7%. A manual screening of the testing ḥadīṯ sample showed a rate of 7.7% incorrect, of which 6.8% are false negatives and 0.9% false positives.

Table 1. Summary of HE results

segmentation testing of segmented data effective 97.7% error rate 7.7% wrong 2.3% precision 99.2% recall 93.1% F measure 96.1%

The accuracy is going to be improved mainly by refining the above-mentioned op-erators, and secondly by raising human control on outputs whereas automatic recogni-tion is still impeded.

As to RAM, the system was applied only on the effectively segmented matn text output by HE. We obtained a corpus that gathers only matn’s section and consists of 382,700 words. Then we applied the original AM analyzer to get a preliminary system output; the results were then compared to the output of the RAM analyzer with differ-ent parameter settings.

Table 2. Summary of RAM results

recognition

original AM RAM

with vocalization RAM

with added entries RAM

with contemporary NEs removed

unanalyzed 10.36% 12.55% 7.23% 8.12% univocal 29.45% 58.98% 62.52% 67.79% ambiguous 60.19% 28.47% 30.25% 24.09% testing of segmented data error rate 60.54% 32.77% 27.65% 24.58% precision 64.90% 74.57% 81.47% 83.37% recall 74.56% 92.66% 90.88% 92.05% F measure 69.40% 82.64% 85.92% 87.50%

The RAM system with vocalization fares far better than the original AM in univocal token recognition, even if the rate of unanalyzed token is slight higher. In fact, the result is equivocal, since AM gets a better result at the price of a higher number of false positives (which RAM discriminates through vocalization).

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Page 11: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

6 Further Research

Both HE and RAM can be seen as starting points for future research. HE can be ex-tended and generalized to other domains within Classical Arabic culture where texts are arranged according to semi-formal criteria: genealogical repertories, specialized dictionaries, definition lexica (as opposed to lexical encyclopedia). RAM can be fur-ther extend to cope with a larger domain of textual genres, especially if coupled with some reasonably well performing system of Arabic Named Entities recognition. As showed by the flowchart in fig. 2 and alluded in the Introduction, this might well feed other, higher-level systems of text analysis and information retrieval.

Fig. 2. The SALAH process flowchart

References

1. Wensinck, A.J., et al.: Concordance et indices de la tradition musulmane : Le six Livres, le Musnad d'al-Dārimī, le Muwaṭṭa de Malik, le Musnad de Aḥmad ibn Ḥanbal. Brill, Leiden, (1933-)

2. Brown, J.: The Canonization of al-Bukhārī and Muslim: the Formation and Function of the Sunnī Ḥadīṯ Canon., Brill, Leiden (2007)

3. al-Bukhārī, M. b. I.: Ṣaḥīḥ al-Bukhārī. Dār Ṭūq al-Najāh, Riyaḍ (1990) 4. Gibbon D., Moore R., Winski R. (eds.): Handbook of Standards and Resources for Spoken

Language Systems. Mouton de Gruyter, Berlin (1997) 5. Jackson, P., Moulinier, I.: Natural Language Processing for Online Applications: Text Re-

trieval, Extraction & Categorization. John Benjamins, Amsterdam (2002) 6. Abu El-Khair, I.: Arabic Information Retrieval. Annual Review of Information Science

and Technology, 41(1), 505-533 (2007) 7. Wolska M. and Kruijff-Korbayová, I.: Analysis of mixed natural and symbolic language in-

put in mathematical dialogs. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Stroudsburg, Association for Computational Linguistics (2004)

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com

Page 12: Segmentation and Linguistic Analysis of ad Arabic Textshost.uniroma3.it/docenti/lancioni/db/data/_uploaded/file/... · The SALAH Project: Segmentation and Linguistic Analysis of ḥadīṯ

8. Bird, S., Klein, E.: Regular Expressions for Natural Language Processing. University of Pennsylvania (2006)

9. Robson, J.: Standard applied by Muslim traditionists. Bulletin of the John Rylands Library, 63 (1961)

10. Juynboll, G. H. A.: Encyclopedia of Canonical Hadith. Brill, Leiden (2007) 11. Sezgin, F.: Geschichte des Arabischen Schrifttums. 1, Brill, Leiden (1967) 12. Robson J. Ḥadīth. In: Enciclopaedia of Islam. Vol. 3, pp. 23–28. Brill, Leiden (1978) 13. Günther, S.: Assessing the Sources of Classical Arabic Compilations: The Issue of Catego-

ries and Methodologies. British Journal of Middle Eastern Studies, 32:1, 75–98 (2005) 14. van Rossum, G.: An Introduction to Python for UNIX/C Programmers. Proceedings of the

NLUUG najaarsconferentie (1993) 15. Buckwalter, T.: Buckwalter Arabic transliteration, (undated)

[available at http:qamus.org/transliteration.htm] 16. Lancioni, G.: An Adaptation of Buckwalter Transcription Model to XML and Regular Ex-

pression Syntax. Technical report, Roma Tre University, r3a (2011) 17. Aho, A. V.: Algorithms for finding patterns in strings. In van Leeuwen, Jan. Handbook of

Theoretical Computer Science, volume A: Algorithms and Complexity. The MIT Press. 255–300 (1990)

18. Goyvaens, J., Levitan, S.: Regular Expressions Cookbook. O'Reilly, Sebastopol (2009) 19. Khan, M.M.: The English Translation of Sahih Al Bukhari. Alexandria, Al-Saadawi Publi-

cations (1984) 20. Al-‘Asqalānī, A.: Fatḥ al-bārī bi-sharḥ Ṣaḥīḥ al-Bukhārī. Bayrūt, Dār al-Ma‘rifah (1959) 21. Berge, C.: Théorie des graphes et ses applications. Collection Universitaire de Mathé-

matiques, vol. II. Dunod , Paris (1958) 22. Bondy, J.A., Murty, U.S.R.: Graph Theory, Springer, Heidelberg (2008) 23. Fück, J.: Beiträge zur Überlieferungsgeschichte von Buḫāris Traditionssammlung.

Zeitschrift der Deutschen Morgenländischen Gesellschaft, 60–87 (1938) 24. Di Battista, G., Eades, P., Tamassia, R., Tollis, I.G.: Graph Drawing; Algorithms for the

visualization of graphs. Prentice Hall, Upper Saddle River (1999) 25. Kaufmann, M., Wagner, D.: Drawing Graphs: Methods and Models. Springer, Heidelberg (2001) 26. Buckwalter, T.: Buckwalter Arabic Morphological Analyzer Version 1.0. Philadelphia,

Linguistic Data Consortium (2002) 27. Black, W., Elkateb, S., Rodriguez, H, Alkhalifa, M., Vossen, P., Pease, A. and Fellbaum,

C.: Introducing the Arabic WordNet Project. In: Proceedings of the Third International WordNet Conference. Sojka, Choi, Fellbaum and Vossen eds (2006)

28. Al-Sughaiyer, I.A., Al-Kharashi I.A.: Arabic morphological analysis techniques: A com-prehensive survey. Journal of the American Society for Information Science and Technol-ogy, 55(3), 189–213 (2004)

29. Bebah, M., Belahbib, R., Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A.: A Mark-ovian Approach for Arabic Root Extraction. The International Arab Journal of Information Technology, Vol. 8, No. 1 (2011)

30. Habash, N., Rambow, O.: Arabic Tokenization, Part-of-Speech Tagging and Morphologi-cal Disambiguation in One Fell woop. Proceedings of the 43rd Annual Meeting of the ACL, pp.573–580, Ann Arbor (2005)

31. Salmoné, H.A.: An Advanced Learner's Arabic-English Dictionary. Librairie du Liban, Beirut (1889)

32. van Rijsbergen, C. J.: Information Retrieval. Butterworths, London (1979)

M. Boella, F.R. Romani, A. Al-Raies, C. Solimando, G. Lancioni 'The SALAH Project: Segmentation and Linguistic Analysis of Hadīth Arabic Texts' in M. V. Salem, K. Shaalan, F. Oroumchian, A. Shakery, H. Khelalfa, Proceedings of the Seventh Asia Information Retrieval Societies Conference, Springer, Heidelberg, 2011 [in press]. The original pubblication will be soon available at www.springerlink.com