design and develop sentence parser for afan …

DSpace Institution

DSpace Repository http://dspace.org

Information Technology thesis

2020-03-20

DESIGN AND DEVELOP SENTENCE

PARSER FOR AFAN OROMO

LANGUAGE USING TOP-DOWN

CHART PARSING ALGORITHM

Beshada, Hailu

http://hdl.handle.net/123456789/10765

Downloaded from DSpace Repository, DSpace Institution's institutional repository

BAHIR DAR UNIVERSITY

BAHIR DAR INSTITUTE OF TECHNOLOGY

SCHOOL OF RESEARCH AND POSTGRADUATE STUDIES

FACULTY OF COMPUTING

DESIGN AND DEVELOP SENTENCE PARSER FOR AFAN

OROMO LANGUAGE USING TOP-DOWN CHART PARSING

ALGORITHM

Hailu Beshada Balcha

Bahir Dar, Ethiopia

May 2017

DESIGN AND DEVELOP SENTENCE PARSER FOR AFAN OROMO

LANGUAGE USING TOP-DOWN CHART PARSING ALGORITHM

A Thesis submitted to the School of Research and Graduate Studies of Bahir Dar

Institute of Technology, Bahir Dar University in partial fulfillment of the

requirements for the degree of Master of Science in the Information Technology in

the Faculty of Computing

Supervised by: Tesfa Tegegne(PhD)

Bahir Dar, Ethiopia

May 2017

DECLARATION

I, the undersigned, declare that the thesis comprises my own work. In compliance

with internationally accepted practices, I have acknowledged and refereed all

materials used in this work. I understand that non-adherence to the principles of

academic honesty and integrity, misrepresentation/ fabrication of any

idea/data/fact/source will constitute sufficient ground for disciplinary action by the

University and can also evoke penal action from the sources which have not been

properly cited or acknowledged.

Name of the student: Hailu Beshada Balcha Signature _____________

Date of submission: May 24, 2017

Place: Bahir Dar, Ethiopia

This thesis has been submitted for examination with my approval as a university

advisor.

Advisor Name: Tesfa Tegegne (PhD)

Advisor’s Signature: __________________

Bahir Dar University

Bahir Dar Institute of Technology-

School of Research and Graduate Studies

Faculty of Computing

THESIS APPROVAL SHEET

Student:

_____________________________________________________________________

Name Signature Date

The following graduate faculty members certify that this student has successfully

presented the necessary written final thesis and oral presentation for partial fulfillment

of the thesis requirements for the Degree of Master of Science in Information

Technology

Approved by:

Advisor: _____________________________________________________________________

Name Signature Date

External Examiner:

___________________________________________________________________

Name Signature Date

Internal Examiner:

_______________________________________________

Name Signature Date

Chair Holder:

_____________________________________________________________________

Name Signature Date

Faculty Dean:

_____________________________________________________________________

Name Signature Date

DEDICATION

This thesis is dedicated to my mother Afrasa Robi and My two sisters Tadelech

Beshada and Nigatuwa Beshada who made me the person of today without attending

schools by themselves and also to my lovely wife Birtukan Sahile who I married

during the end of master’s class and the starting time of my thesis work.

ACKNOWLEDGEMENTS

Above all I would like to thank the almighty and omnipresent God, for giving me the

strength from the beginning to the end of this research work. “Yaa Uumaa koo galanin

siif haa ta’u! Amiin!”. Next, it is a pleasure to thank many people who made this thesis

is accomplished. I would like to gratefully acknowledge the supervision of my advisor

Tesfa Tegegne (PhD), for his abundant help, suggestive and constructive comments.

My great thanks also go to Gebeyehu Beyene (PhD) for his constructive comments on

the draft report of this work. Again, I like to thanks a lot Mr. Jabesa Daba and Mr.

Kasahun Abdisa from Wollega University, who had provided me important reading

materials and constructive suggestion which helped me much in my thesis work. It is

an honor for me to express my special appreciation for my colleagues for their

collaboration with giving me ideas, directions, comments and also for their

encouragements. It is a pleasure to thank those who helped me by different mechanism

when I was working on this study.

The last, but absolutely not the least, I want to thank my lovely wife and family, whose

love and guidance is with me in whatever I pursue.

ABSTRACT

Previously many sentence parsers are developed for foreign languages such as English,

Arabic, etc. as well as for Amharic language from local languages of Ethiopia. Parsing

Afan Oromo sentence is also needed and a necessary mechanism for other natural

language processing applications like machine translation, question answering,

knowledge extraction and information retrieval. Thus, we have been developed rule-

based parser using a top-down chart parsing algorithm for Afan Oromo sentences,

which include both simple and complex sentences. Context Free Grammar (CFGs) was

used to represent the grammar of the language. 500 sentences for sample corpus were

prepared and CFG was extracted manually from sample tagged corpus. We also

developed simple algorithm of a lexicon generator to automatically generate the lexical

rules. Python programming language and NLTK are used as an implementation tools

for this study. Then, the experimentation took place on a parser. The parser was trained

on 400 sentences of training dataset with the accuracy of 98.25% and tested on 100

sentences of testing dataset with the accuracy of 91%. Thus, the experimental results

on a parser is an encouraging result since it is the first work for simple and complex

sentences of Afan Oromo language. Finally, we have been reported that the conclusion

and possible recommendation for future work in the last chapter.

Keywords: NLP, Parser, context free grammar, top-down chart parser, lexicon

generator, lexicon.

TABLE OF CONTENTS DECLARATION........................................................................................................... i

ACKNOWLEDGEMENTS ........................................................................................ v

ABSTRACT ................................................................................................................. vi

LIST OF FIGURES ..................................................................................................... x

LIST OF ALGORITHMS .......................................................................................... xi

LIST OF TABLES ..................................................................................................... xii

ABBREVIATIONS AND ACRONYMS ................................................................. xiii

CHAPTER ONE .......................................................................................................... 1

1 INTRODUCTION ................................................................................................ 1

1.1 Background .................................................................................................... 1

1.2 Statement of the Problem ............................................................................. 3

1.3 Objectives ....................................................................................................... 5

1.3.1 General objective ..................................................................................... 5

1.3.2 Specific objectives ................................................................................... 6

1.4 Methodology .................................................................................................. 6

1.4.1 Literature Review..................................................................................... 6

1.4.2 Data Collection ........................................................................................ 7

1.4.3 Tools and Techniques .............................................................................. 7

1.4.4 Evaluation ................................................................................................ 7

1.5 Scope and Limitation .................................................................................... 8

1.6 Significance of the study ............................................................................... 8

1.7 Organization of the thesis ............................................................................. 9

CHAPTER TWO ....................................................................................................... 10

2 LITERATURE REVIEW .................................................................................. 10

2.1 Introduction ................................................................................................. 10

2.2 Works so far on sentence parser ................................................................ 11

2.2.1 Local works on sentence parser ............................................................. 11

2.2.2 Global works on sentence parser ........................................................... 14

2.3 Grammar Formalism .................................................................................. 17

2.3.1 Context Free Grammar .......................................................................... 18

2.3.2 Context Sensitive Grammar ................................................................... 19

2.3.3 Transition Network Grammar ................................................................ 20

2.3.4 Unification Based Grammar .................................................................. 20

2.3.5 Probabilistic Context Free Grammar ..................................................... 20

2.4 Sentence Parsing Approaches .................................................................... 21

2.4.1 Stochastic Approaches ........................................................................... 22

2.4.2 Rule-based Approaches ......................................................................... 23

2.5 Afan Oromo Grammar ............................................................................... 26

2.5.1 Word order ............................................................................................. 27

2.5.2 Word Categories .................................................................................... 27

2.5.3 Phrases Categories ................................................................................. 32

2.6 Afan Oromo Sentences................................................................................ 34

2.6.1 Simple Sentences ................................................................................... 34

2.6.2 Complex Sentences ................................................................................ 36

2.7 Summary ...................................................................................................... 36

CHAPTER THREE ................................................................................................... 39

3 DESIGN OF AFAN OROMO SENTENCE PARSER .................................... 39

3.1 Components of Afan Oromo Sentence Parser (AOSP)............................ 39

3.2 Context Free Grammar (CFG) .................................................................. 40

3.3 Sentence Tokenizer ..................................................................................... 41

3.4 Lexicon Generator....................................................................................... 41

3.5 AOSP Chart Parser ..................................................................................... 42

3.6 Summary ...................................................................................................... 44

CHAPTER FOUR ...................................................................................................... 45

4 IMPLEMENTATION RESULTS AND DISCUSION .................................... 45

4.1 Development Environment ......................................................................... 45

4.2 Corpus Preparation..................................................................................... 45

4.3 Grammar Rules Extraction ........................................................................ 45

4.4 Generating Lexical Rules............................................................................ 47

4.5 Implementation of Chart Parser ................................................................ 48

4.6 Evaluations ................................................................................................... 51

4.6.1 Evaluation of Lexical Generator ............................................................ 52

4.6.2 Evaluation of AOSP Chart Parser .......................................................... 53

4.7 Discussion ..................................................................................................... 58

CHAPTER FIVE ....................................................................................................... 60

5 CONCLUSIONS AND RECOMMENDATIONS ........................................... 60

5.1 Conclusion .................................................................................................... 60

5.2 Recommendations ....................................................................................... 61

REFERENCES ........................................................................................................... 63

APPENDICES ............................................................................................................ 66

Appendix 1: Part of Speech Tags by Abraham (used in this study) .................. 66

Appendix 2: Sample Context Free Grammar Extracted from corpus .............. 69

Appendix 3: Sample Lexical Rules Generated by the Lexicon Generator ....... 70

Appendix 4: Sample parsed sentences by the parser .......................................... 71

LIST OF FIGURES

Figure 3. 1: Architecture of Sentence Parser for Afan Oromo Language ................... 40

Figure 4. 1: Screenshot of Lexical Rules generated by the Lexicon Generator .......... 52

Figure 4. 2: Screenshot of parsed imperative sentence ................................................ 54

Figure 4. 3: Screenshot of parsed exclamatory sentence ............................................. 55

Figure 4. 4: Screenshot of parsed declarative sentence ............................................... 56

Figure 4. 5: Screenshot of parsed interrogative sentence ............................................ 56

Figure 4. 6: Screenshot of parsed complex sentence ................................................... 58

LIST OF ALGORITHMS

Algorithm 4. 1: Lexical Generator Algorithm ............................................................. 48

Algorithm 4. 2 : Top Down Chart Parsing Algorithm for Afan Oromo Sentences ..... 51

LIST OF TABLES

Table 4. 1: Tag Name of Afan Oromo Phrases ............................................................ 47

Table 4. 2: Parsing a result on training set before making number of error orrection . 53

Table 4. 3: Parsing a result on training set after making most of error correction ...... 54

Table 4. 4: Number of correctly parsed imperative sentences ..................................... 55

Table 4. 5: Number of correctly parsed Exclamatory Sentences ................................. 55

Table 4. 6: Number of correctly parsed Declarative Sentences ................................... 56

Table 4. 7: Number of correctly parsed Interrogative Sentences ................................. 57

Table 4. 8: Number of correctly parsed Complex Sentences ....................................... 57

ABBREVIATIONS AND ACRONYMS

ADP Adverbial Phrase

AOSP Afan Oromo Sentence Parser

APCP Adpositional Phrase

ATB Arabic Tree Bank

ATN Augmented Transition Network

CFG Context Free Grammar

CKY Cocke Kasami Younger

CNF Chomsky Normal Form

CSG Context Sensitive Grammar

CTB Chinese Tree Bank

FSM Finite State Machine

HMM Hidden Markov Model

IE Information Extraction

IR Information Retrieval

JJP Adjectival Phrase

LHS Left Hand Side

NLP Natural Language Processing

NP Noun Phrase

PCFG Probabilistic Context Free Grammar

POS Part of Speech

QA Question Answering

RHS Right Hand side

SOV Subject-Object-Verb

TNG Transition Network Grammar

UBG Unification Based Grammar

VP Verb Phrase

CHAPTER ONE

1 INTRODUCTION

1.1 Background

Language is one of the fundamental aspects of human behavior and it constitutes a

crucial component of our lives. Natural language is a language that is spoken by the

people. According to Abdi[1], Natural language processing (NLP) is a theoretically

motivated range of computational techniques for analyzing and representing naturally

occurring texts at one or more levels of linguistic analysis for the purpose of achieving

human like language processing for a range of tasks or applications. NLP can be defined

as the automatic or semi-automatic processing of human language [2]. It runs different

applications, namely tokenization, lexical analysis, syntactic analysis, semantic

analysis, and pragmatic analysis. Among these applications our focus is on Syntactic

analysis (Parsing), which provides an order and structure of each sentence in the text.

Natural language processing systems take strings of words (sentences) as their input

and produce structured representations capturing the meaning of those strings as their

output. The nature of this output depends heavily on the task at hand. “In the context of

natural language processing, the process of assigning structural descriptions to

sequences of words is called parsing” [3].

Parsing is a process of analyzing a sentence by taking each word and determining its

linguistic structure from its constituent parts. Parsing process makes use of two

components: a parser and a grammar. Parser is a procedural component and is a

computer program, whereas, grammar is a declarative component. Both the grammar

and parser depend on the grammar formalism. The term parser is used in cases where

the sentences are made up of information units of any kind and therefore it also deals

with a number of sub problems such as identifying constituents that can fit together,

testing the compatibility of a number [4]. Sentence parsing is one of the steps to design

a functional NLP application and which can work in cooperation, and as input to other

many NLP applications like grammar checker, machine translation, and etc. It is also

called syntax analysis [5], which is the process of identifying how words can be put

together to form correct sentence and determining what structural role of each word

plays in the sentence and what phrases are subparts of what other phrases or what other

words modify, which words of the central point of the whole sentence constructed.

Thus, parsing has an important role in semantic processing operation on that of sentence

constituents. If there is no syntactic parsing step, then the semantics system must decide

on its own constituents[4]. If parsing is done, however, it contains the number of

constituents that semantic can consider. The focus of this study is to develop a sentence

parser for Afan Oromo language using top-down chart parsing algorithm and Context

Free Grammar (CFG) formalism to represent Afan Oromo grammar rules.

Chart parser combines the advantages of top-down and bottom-up approaches. Hence

the main objective of chart parsing is to improve the efficiency of the parser by taking

the advantages of top-down and bottom-up approaches. According to Jason [6], in chart

parsing the process of parsing is an n-word sentence consists of forming a chart with n

+ 1 vertices and adding edges to the chart one at a time. There is no backtracking,

everything that is put in the chart stays there, and chart contains all information needed

to create a parse tree. Chart parser is driven by an agenda of completed constituents and

the arc extension, which combines active arcs with constituents when they are added to

the chart [7]. The technique of extending arcs with constituents can be applied by using

both bottom-up and top-down approach. However, the difference is in how new arcs

are generated from the grammar. In bottom-up approach, new active arcs are generated

whenever a completed constituent is added that could be the first constituent of the

right-hand side of the rule, whereas in the top-down approach, new active arcs are

generated whenever a new active arc is added to the chart [4][7]. For this reason,

Abdurheman [4] state that, the number of constituents generated using a top-down

chart parser is less than the number of constituents which are generated using bottom-

up chart parser. Therefore, the top down chart parser is considerably more efficient for

any reasonable grammar.

In the current world, the amount of accessible electronic information has exploded. Due

to the rapid expansion of Internet and its use for communication and dissemination of

information throughout the world, electronic information sources are now available in

an ever-increasing number of languages. As Jabesa [8] mentioned in his work, users of

such globally distributed networks (including digital libraries and World Wide Web)

need to be able to access and retrieve any relevant information in whatever language

and form it may have been recorded and stored. However, according to Abebe [9], the

most developing countries have no systematic programs for the collection, analysis and

dissemination of available information to the potential users. One of the barriers to this

is the absence of full-fledged online machine translation system that can translate texts

from a foreign language to a local for example, from English to Afan Oromo. Thus, the

existences of machine translation systems that require a parser as a component of

importance for the delivery of electronic resources are paramount. Therefore, the need

for NLP systems such as sentence parser is unquestionable for Afan Oromo. Afan

Oromo language is the official language of Oromia National Regional State. It is used

in offices, schools, colleges, universities and in media. Thus, the availability of huge

electronic and non-electronic data was motivated us to develop an NLP application.

“For computational linguists, parsing corresponds to produce some sort of a structure

that fits and confirms a particular theory of syntax or language in general” [10]. We

have seen the purpose of parsers in terms of standard tools for NLP that do not represent

a final goal as such, but should contribute to improve other applications and serve for

many tasks. Thus, we are motivated to develop Afan Oromo Sentence Parser by using

top-down chart parsing approach.

1.2 Statement of the Problem

Afan Oromo is one of the major languages that are widely spoken in Ethiopia.

Currently, it is the official language of the regional state of Oromia (the largest regional

state in Ethiopia) being used as a working language in offices, medium of instruction

for primary and junior-secondary schools, and it is also given as a subject for secondary

schools (9 -12 grades). As Mandafro report in his work [11] , at the country level, in

Ethiopia, out of public universities, 8 universities are offering degree programs

majoring in Afan Oromo and Addis Ababa University is offering Afan Oromo language

at Master’s degree level. Like Amharic, another major language and working language

of Ethiopia, which belongs to Semitic family languages, Afan Oromo is part of the

lowland east Cushitic group within the Cushitic family of the Afro Asiatic phylum.

According to Abebe [9], Afan Oromo language is not only spoken in Ethiopia, it has

also spoken in Somalia, Kenya, Uganda, Tanzania and Djibouti. Although Afan Oromo

is today spoken by such a large number of people, few advances have been made in

computational linguistics or natural language processing in the language.

“Computational approaches to linguistic analysis of Afan Oromo so far have been

hindered due to non-availability of well-studied linguistic resources” [12]. Since Afan

Oromo language is the official language of Oromia National Regional State as

mentioned above and used in offices, schools, colleges, universities and in media,

various written materials are being published electronically and non-electronically now

a day. Thus, this creates an interest of NLP researches in this language. For instances;

morphological synthesizer [9], spell checker [13], grammar checker [14], part of speech

tagging [15][16][12], named entity recognition[1], news text summarization [17]

machine translation [8], word sense disambiguation [18], question answering [19], text

retrieval [20] and search engines [21] are some NLP applications among the

applications that require a sentence parser for successful and full-fledged

implementation. Besides, sentence parser is useful NLP application in teaching and

learning process for phrase identification and to know word relations in sentences of

the Afan Oromo language. It is also an important tool in NLP and it serves as an

intermediate component for different higher level applications like machine translation

On the other hand, as we have mentioned in above section, an Internet is one of the

main sources of information. The enormous amount of information on the Internet

could be used to enhance development by making it accessible to the public. To fully

localize and utilize these resources which are available on the Internet, translation of

documents from one language to another may be necessary. For example, many

documents on the Internet are written in English, because of this, English to Afan

Oromo translation and vice versa may be required in syntax-based machine translation

[22]. Besides, according to[23], parsers have become efficient and accurate enough to

be useful in many natural language processing systems, most notably in machine

translation. Therefore, machine translation, which uses Afan Oromo language

sentences as an input, and sentence parsers as a component, plays a great role in solving

the translation problem. Thus, we were proposed to develop a sentence parser for Afan

Oromo language.

To this end, the researcher has gone through different literatures to find if there is any

sentence parser, which can parse both simple and complex sentences in Afan Oromo.

Thus, to the best of the researcher’s knowledge concerned, there is no Afan Oromo

sentence parser for both simple and complex sentences. However, There is one attempt

by [5] on automatic sentence parser for Afan Oromo language using supervised learning

technique for simple declarative Afan Oromo sentence. In his study, the chart algorithm

has been used. In addition, the unsupervised learning algorithm was designed to guide

the parser in predicting unknown and ambiguous words in a sentence. It also adopts an

intelligent (Rule-Based learning module) approach to develop a prototype. The result

obtained was 95% on the training dataset and 88.5% on the test dataset. The parser was

developed purely based on an Intelligent (hybrid of Rule-based and supervised

learning) System approach and tagger were not included, which could have been used

as a preprocessor to the parser. It was developed only for simple declarative sentences

of Afan Oromo language. Due to this fact, the researcher is motivated to develop a

parser for both simple and complex Afan Oromo sentences. Hence the focus of this

study is, therefore, in designing and developing sentence parser for Afan Oromo text,

which includes both simple and complex sentences. Obviously, the parser will have the

major significance for the language users. Moreover, as the nature and structure of

sentences parsing (syntactic parsing) in Afan Oromo is different from English, Amharic

or other languages, sentence parser developed for such languages could not be

functional for Afan Oromo language. This is due to the fact that the language has

different syntactic and morphological nature, and they have also their own grammatical

and word formation technique that is different from other languages. As a result,

sentence parser developed for other languages could not be used for Afan Oromo

language, which results in the need for the independent sentence parser. So that we

decided to develop sentence parser for Afan Oromo simple and complex sentences

using top down chart parsing algorithm.

Based on the above justification this study attempts to answer the following questions:

- What are the properties and word orders in Afan Oromo Language?

- Is it possible to use other languages sentence parsers for Afan Oromo language?

- Does the adoption of other language parsing algorithms work for Afan Oromo

Language?

1.3 Objectives

1.3.1 General objective

The general objective of this research is to design a sentence parser for Afan Oromo

Language using top-down chart parsing algorithm.

1.3.2 Specific objectives

In order to achieve the general objective of this research, the following specific

objectives are formulated.

To identify the properties of Afan Oromo sentences based on the knowledge

base of the language which are the basic word order, word categories,

morphological properties, phrase structure, and sentences in the language that

are useful for sentence parsing.

To select sample sentences that would potentially serve for the experiment

To extract an appropriate grammar rule to represent the structure of Afan Oromo

sentences.

To design a general architecture of Afan Oromo parser

To develop a simple algorithm for lexical generator in order to automatically

generate lexical rules from sample corpus.

To select and customize an appropriate parsing algorithm for Afan Oromo

sentence parser.

To evaluate performance of the parser

1.4 Methodology

In order to develop a Sentence Parser for Afan Oromo language, exploring of the

characteristics of the language and different approaches which can be used for the

development should be needed. The followings are the methods that have been followed

to achieve the general and specific objectives of this thesis work.

1.4.1 Literature Review

Sentence parser which is previously done in Afan Oromo and other languages have

been reviewed to understand the techniques that show how a sentence parser works.

Related literature materials such as research papers, books, some of the previous related

research work as well as electronic materials on the web have been reviewed to have

better knowledge and to understand the phrase structure of Afan Oromo language and

to be aware of the strategies, techniques and how to parse the sentence and how to

transfer the sentence to a parser. This study employs rule-based parsers to develop Afan

Oromo sentence parser for both simple and complex sentences. The selection of this

rule-based approach was based on some argue that parser which is developed using

rule-based are require less storage and ten times faster than those developed using

stochastic approaches[5]. The detail of the approaches is presented in chapter 2.

1.4.2 Data Collection

500 Afan Oromo Sentences of both simple and complex types was collected from Afan

Oromo text sources like Afan Oromo grammar (Seer-luga Afan Oromo) book, previous

thesis papers and other written materials by the language. Among the sample corpus,

around 40 sentences from [5], 300 sentences from seer-luga Afan Oromo and the rest

of 160 sentences are from other written materials. It was then given to the linguistic in

order to get feedback on the correctness of the manual parse and manual extraction of

the context free grammar rules.

1.4.3 Tools and Techniques

We have been designed the general architecture of Afan Oromo sentence parser.

Parsing algorithm was selected and customized to develop a sentence parser for Afan

Oromo sentences based on the grammar rule of the language and lexical rules which

automatically generated from collected sample corpus. Python programming language

and NLTK were used for implementation tools for this study.

1.4.4 Evaluation

The experiment was conducted in two phases in order to evaluate the parser: the first is

on the training dataset and next is on the test dataset and the results have been evaluated.

The outputs have been crosschecked with manually parsed sentences and how much

they are similar. Our sample data in this study is still small, though when we compare

with the previous work sample data, the data for this study is better in size than the

previous work. Thus, 400 sentences (80% of the sample corpus) were used for training

dataset while the rest 100 sentences (20%) from the corpus were used as a test dataset.

Finally, figures obtained from the observed results have been statistically summarized

and analyzed in a way that is suitable to report the attained accuracy level by using

table.

1.5 Scope and Limitation

The scope of the thesis is limited to demonstrating the potential rule-based approach to

design and develop a sentence parser for Afan Oromo language using top-down chart

parsing. In this research, simple sentences and complex sentences are considered. In

this study the complex sentence is composed of one independent clause and one or more

dependent clause. However, which has one or more independent clause and two or more

dependent clauses (compound complex sentences) in the sentence is out of the scope of

this research, due to the absence of clearly stated rules of grammar in literatures and

lack of well-prepared corpus for the research purpose publicly. Parsing of grammatical

categories, which indicates features like gender, cases, tense, etc. are also not

considered. An automatic morphological analyzer and part of speech tagger are also not

included in this work.

On the other hand, preparing details of grammar rules and tagging the sentences with

their correct word categories were very difficult, because, all sample corpus was

manually annotated by the researcher and its correctness was verified by linguists. This

is because of automatic morphological analyzer and automatic part of speech tagger are

not integrated with our parser. So that, it had taken much time and efforts.

1.6 Significance of the study

As we discussed so far in the above sections, the parser has a vital role in different areas

of NLP applications. Thus, the beneficiaries of this study include researchers who

are/who want to be, involved in increasing the capability of computer processing in

Afan Oromo language. This means, the sentence parser can be used in the development

of high level NLP applications as a component. Thus, the researchers in the area of

phrase recognition, conceptual parsing, machine translation, question answering,

grammar checker, text summarization, etc. are among the main beneficiaries. In

addition to this, linguistic teachers and students in the field of Afan Oromo language

will also the beneficiary of this study to parse sentences in the language. Finally, this

study may also contribute to the advancement of Afan Oromo language toward aware

of using technology.

1.7 Organization of the thesis

The remaining part of the thesis is organized as follows. Chapter 2 covers literature

review in which different concepts and approaches important for our work are

presented. In addition, related works to our study, which was done in Afan Oromo and

other languages are also presented. Moreover, the grammar of Afan Oromo language,

such as word orders, word classes, phrase structures, and different types of sentences

are also discussed in detail. Chapter 3 deals with the design of the proposed system. It

presents the general architecture of the system with its basic components and the

discussion of the components and their interaction in the system. Chapter 4 is focused

on the detail implementation and discussions of the system. It discussed on the

algorithms we were used for achieving the goal of the components in the proposed

system. On the other hand, the evaluation of the system and the results obtained are also

present in this chapter. Chapter 5 presents conclusions of our work and

recommendations for the improvement of our system to interested researchers.

CHAPTER TWO

2 LITERATURE REVIEW

2.1 Introduction

According to Abdi[1], natural language processing is an interdisciplinary area based on

many fields of study, which is used for designing and building software that can

analyze, understand, and generate natural language. Some of the tasks of NLP that

provides a potential means of gaining access to the information inherent in the large

amount of text are IR and IE. Information retrieval systems typically allow a user to

retrieve documents from a large database. NLP is a computational method that

automates the translation process between computer and human languages. It is a

method of getting a computer to understandably read a line of text without the computer

being fed some sort of clue or calculation. The goal is to enable natural languages, such

as English, Amharic, Afan Oromo and others to serve either as the medium through

which users interact with computer systems.

NLP researchers aim is to gather knowledge on how human beings understand and use

language so that appropriate tools and techniques can be developed to make computer

systems understand and manipulate natural languages to perform the desired tasks. This

is based on both a set of theories and a set of technologies [2]. In NLP to examine how

the syntactic structure of a sentence can be compute is the main things which should be

consider are the grammar and parsing technique. Grammar is a formal specification of

structures allowable in the language, whereas, the parsing technique is the method of

analyzing a sentence to determine its structure according to the grammar. Several types

of grammatical formalism and parsing approaches which are used to parse sentences

are briefly discussed in the next section of this chapter.

In addition, there have been much work done in NLP, in recent years on different

languages. Among those works, a sentence parser is one of the most important NLP

tools. Even though there is only one attempt work for Afan Oromo language regarding

sentence parsing as far as the researcher’s knowledge, much work has been done in

different languages on different aspects of parsing based on various approaches. Thus,

we reviewed previous Afan Oromo language and other languages works that are more

related to our study as follows.

2.2 Works so far on sentence parser

2.2.1 Local works on sentence parser

A few sentence parser works have been done in local languages, which came as a result

of increasing demand of precise and exact information needs. It has been realized that

the previous information retrieval mechanism alone would not be enough to satisfy the

users need. Below we try to present the sentence parser by the respective researchers of

works in local languages.

Parser for Afan Oromo Language

The first work we were going to reviewed on related work was, the sentence parser for

Afan Oromo language, which is the work of Diriba [5]. He developed the first parser

for an automatic Afan Oromo sentence parser which was aimed to parse declarative

simple sentences. The study was conducted using the chart algorithm with the grammar

formalism Head-driven Phrase Structure Grammar compiled into left to right the table.

It is a representation that allows to minimize the number of syntactic rules and to

provide rich and well-structured lexical representation. The system was also used

supervised learning algorithm to enable the parser to predict unknown and ambiguous

words. In his work, the total size of sample corpus was consisted of 352 sentences from

the handout ‘Seer-luga Afan Oromo’. The sample data was divided into two sets, such

as the training dataset which contains 300 sentences and the testing dataset with the

remaining 52 sentences. However, in addition to the small number and similar sentence

type of the text, the part of speech tagger that preprocesses the text to improve the

performance of the parser was not included in this work, although the result obtained

was 95% on training set and 85.5% on testing set using manually parsed sentences. In

our study, maximum number of sample sentences than the previous was collected from

Afan Oromo grammar books and from the previous research for dataset. In addition,

we consider both simple sentence and complex sentence type with manually tagged.

The parsing algorithm and the grammar formalism adopted in this thesis are similar

with the top-down chart parser for Amharic sentences [4] and a Top-Down Chart Parser

for Analyzing Arabic Sentences[7] which are chart parsing with top-down strategy(top-

down chart parsing algorithm) and CFG rules respectively.

Parser for Amharic Language

Some works have been also done on Amharic sentence parsing. However, it is very few

work when compared to the number of works dealing with other foreign natural

languages such as Arabic, English, etc. To our knowledge concerned, the majority of

works in natural language processing on local languages in Ethiopia are on Amharic

language. Thus, For Amharic language sentence parser, some efforts are taken by

different researchers. The first attempted was by Atelach [24], to develop a simple

automatic parser for Amharic sentences to address the problem of developing systems

that can automatically process Amharic text. The Probabilistic Context Free Grammar

(PCFG) and Inside-Outside algorithm with a bottom-up chart parsing has been used as

a grammatical formalism to represent the phrase structural rules and as the parsing

strategy of the Amharic language respectively. The study was tried to combine

probabilistic formalism and rule based reasoning for developing automatic sentence

parser. The total size of sample corpus was 100 Amharic sentences only from simple

declarative sentence. The sentences were automatically tagged sentences by previous

researchers. Manual hand parsing was also the other pre-processing phase done by the

researcher after the corpus has passed through the POS tagger. The results achieved

based on the first set of sample sentences was very high, 100% on the training set and

approximately 96% on the test set. As a researcher state in her work, this high accuracy

is obtained partly due to the small number of words considered for the experiment.

Another reason is that all the sentences have identical constructions, and the highest

probability parses were almost always the correct ones.

The second attempted was by Daniel [25]. The work was the integration of the ideas

and outputs of previously attempted by Atelach, to develop an automatic sentence

parser, particularly for complex Amharic sentences. The parsing algorithm and the

grammar formalism adopted from an Automatic sentence parser for Amharic sentences,

which are Input Output Algorithm with bottom-up strategy and PCFG rules

respectively. The total size of collected sample corpus was 350 Amharic complex

sentences. Experiments have been conducted in this study using the training set and test

set. The first experiment was conducted on the part-of-speech tagger to see the state of

its performance when a morphological analysis is embedded in it. The result of this

experiment showed that the tagger attained 98.7% and 94% of accuracies on the training

set and the test set, respectively. The experiments on complex sentence parsing showed

89.6% accuracy result on the training set and 81.6% accuracy result on the test set.

The third work in an Amharic sentence parser was done by Abeba [26], which is a

hybrid approach to Amharic base phrase chunking and parsing. Its main objective was

to extract different types of Amharic phrases by grouping syntactically correlated

words, which are found at a different level of the parser using Hidden Markov Model

(HMM) model and to transform the chunker to a parser. Bottom-up approach with a

transformation algorithm is used to transform the chunker to the parser. The data sets

were analyzed and tagged manually and used as a corpus for chunking. However, the

entire data sets were chunk tagged manually for the training data set. The training and

testing datasets are prepared using the 10-fold cross validation. The experiments on

Amharic sentence chunking showed an average accuracy of 85.31% testing set before

applying the rule for correction and an average accuracy of 93.75% on the test set after

applying rules. And also, the experiment on Amharic sentence parsing showed an

average accuracy of 93.75%.

Another important work in Amharic sentence parser and similar approach and parsing

strategy that we have been proposed in our study was done by Abdurheman [4]. The

researcher was developed top-down chart parser for Amharic sentences. The parser was

designed to parse all types of Amharic sentences using a top-down chart parsing

algorithm using Context Free Grammar to represent the Amharic grammars. Lexicon

generator, which is used to automatically generate the lexicon was also developed. In

addition, integrating a morphological analyzer in the construction of the lexicon was

also done. In this research, the total size of the corpus was 480 different types of sample

sentences. In order to test the effectiveness of the parser, 100 sentences that are selected

randomly from all types of sentences, on average 20 sentences ranging from four to

nine-word length from each sentence type was used. The correctness of the parser is

evaluated or examined by inspecting its result manually. The output could be checked

with respect to the right categorization of words in their proper word class, the right

identification of sub phrases and main phrases, the right order of sub phrases in building

main phrases, and whether all words and phrases are involved during construction of

the sentence.

2.2.2 Global works on sentence parser

There are also many work of sentence parser systems that have been done globally with

different approaches. Some of the works among many work of different scholars

reviewed in our thesis are as following.

Parser for English Language

For an English language sentence parser, the researchers in [27] developed a parser,

which have the equivalent expressive power to that of CFG was developed formal

grammatical system called a link grammar. A link grammar consists of a set of words

each of which has a liking requirement that is contained in the dictionary. The

researchers have written a link grammar of seven hundred definitions that capture many

phenomena of English grammar. Moreover, the researchers developed an algorithm

based on dynamic programming, which tries to build a linkage in a top-down strategy.

The system was tested by applying it to articles taken from newspapers, and the result

indicated that the performance of the system is good. However, there are a number of

English phenomena that are not handled by the system. For example, the system accepts

sentences and clauses that end with preposition. There are also problems on the

placement of the adverbs and prepositional phrases modifying verbs.

Another Statistical based parser for English language was also developed by Charniak

[28]. The parsing system was based on a language model which in turn is based upon

assigning probabilities to be possible parses of a sentence. The model is used in the

parsing system by finding the parse for the sentence with the highest probability.

Therefore, the parser operates by assigning probabilities to the sentence under all its

possible parses and then choosing the parse for which the probability is highest. In line

with this, rules of the context free grammar specify how each phrase constituent will be

expanded. The researcher evaluated the performance of the system by training the

parser on about one million words of the Peen Wall Street Journal Tree bank and testing

on 50,000 words and claimed that its performance was superior to previous parsers in

the area. However, creating the corpus or tree-bank was a difficult task that requires

great strength or effort.

Parser for Arabic Language

In order to parse a simple sentence of Arabic language top-down chart parser was

developed by researchers in [7]. In this work, the parser includes nominal and verbal

sentences within a specific domain Arabic grammar. To represent the grammar of

simple Arabic sentences the researchers used CFG grammar formalism. The grammar

rules were developed by researchers which gives the precise description of grammatical

sentences. Then, the parser which assigns grammatical structure to the input sentences

was implemented. The parser was tested on sentences extracted from real documents.

Another parser for an Arabic sentence parser was developed based on the supervised

machine learning by [29]. The support vector machine algorithm for the learning phase

and Penn Arabic Treebank as a learning corpus were used in this work. Cross validation

method was also used for evaluation purpose. The parser has two phases in this study,

such as learning phase and the analysis phase. The learning phase involves the use of a

training corpus in order to extract a set of features and rules, which are used to train the

support vector machine. The extracted features are used to specify the morphological

category (POS) of the word being processed and the POS of the words in the left vicinity

of the word being analyzed with a maximum depth equal to four. On the other hand,

the extracted rules are used to train the system in grouping of the sequence of labels

that may belong to the same syntactic grouping and thus define their border. The

evaluation of the system was made by the cross-validation method using the Weka tool

by dividing the corpus into two parts, one, which contains 80% of ATB for learning

and the other, which contains 20% for testing. When the system is evaluated on 100

sentences, the result had 89.01 precision, 80.24 recall and 84.37 F-score.

The parser in [30] also has been developed with the aim of analyzing and extracting the

attributes of Arabic words. The parser has been written using top-down algorithm

parsing technique with recursive transition network, and the development was a two-

step process. In the first step, the set of rules used in the study for Arabic parser have

been generated from an existing Arabic text taught in k-12 grade levels. The second

step was the implementation of the parser which analyses an Arabic sentence and

determines if the sentence follows a valid grammatical structure. The sentences are

made to have gender and number agreement to ensure the correction of syntax structure

of the Arabic sentences. After the evaluation of the parser, it is found that some

sentences are unparsed totally, and some other sentences are parsed incorrectly.

Sentences are not parsed because of the following reasons; first when the parse does

not found the word in the lexicon, second because of the incorrect input sentence and

third when the parser is unable to produce a rule for the input sentences because the

syntactic form of the sentences is not included in the grammar. The efficiency of the

developed parser has been evaluated, A sample of 90 sentences was used in the test.

The result shows that 85.6% of sentences were parsed successfully, 2.2% of sentences

were parsed unsuccessfully and 14.4% of sentences not parsed for various reasons,

4.4% Lexical problem, 2.2% Incorrect sentences, 5.6% not recognizable by linguists

according to Arabic grammar rules.

Parser for Indian Language

Using CKY algorithm sentence parser for Hindi language, which is one of the official

languages of India was developed by [31]. This parser can recognize languages defined

by a context free grammar in Chomsky normal form and it parses whole sentence and

generates a matrix. In this study, the researchers developed a set of grammars that has

14 non-terminals and 13 terminals to represent sentences of the language. As the

researchers described in their work, the system incorporates three components, such as

interface, database for Hindi words and the parser. The interface allows the user to enter

sentences and tokenize the input sentence and assign tag to each token. The database

on the other hand, stores the tag of Hindi language words. The parser will then take a

string of tag as input and states whether or not input string is correct. Concerning about

the performance of the parser, the paper didn’t mention anything. Moreover, the number

of sentences used, the types of sentences, and the amount of words the database contains

for evaluation are not indicated. Hence it is difficult to say anything about how the

parser performs compared with other parsers. However, the researchers state that a large

database would slow the speed of parsing and also introduce word sense ambiguity in

assigning tag to words of input sentence.

Parser for Myanmar Language

Top-down parser was developed for both simple and complex Myanmar language

sentences by [32] using CFG grammar formalism. The researchers collected the

sentences consist of 5 to 50 words, which are nearly 3000 training sentences and 530

testing sentences. The corpus was pre-processed before it is passed to the actual parsing

process. In the sentence level, the researchers annotated the corpus for part of speech,

chunk, and function tags relationship between the words in the sentences. The sentences

were tested and the output parse trees were manually checked. The accuracy of parse

tree was 90.6%. However, top-down is not efficient compared to a top-down chart

parser.

Parser for Chinese Language

By applying Maximum-Entropy-Inspired parser on peen Chinese Tree Bank(CTB)

there was work done by [33] for Chinese language sentences. The model assigns a

probability to a parse by a top-down strategy. A parse tree is generated by starting from

the tree root and use the context-free grammar for branching. Each expansion is

assigned a probability, and the probability of a tree would be the product of the

probabilities of all expansions that generate the given sentence. Then the parse that has

maximum probability P (Phi, s) for a given sentence s will be selected. The evaluation

of the system was conducted by first transforming the tree bank. Since words in Chinese

are not delimited by white-spaces, the original tree bank was converted into a tree in

which the terminal consists of a single character instead of words. Moreover, a MaxEnt

re-ranker which assigns a new probability to each one of the parse of a sentence was

used to improve the performance of the parser. The system was tested on the two

versions of the Chinese tree bank CTB1.0 and CTB4.0 with 3485 and 12334 sentences

respectively. The paper was concluded that the performance of the parser is better than

previously obtained results.

2.3 Grammar Formalism

Grammar is a set of constraints on the possible sequences of symbols expressed as rules

or principles. Syntax is the basic ingredient of grammar. Grammar tells us the difference

between sets of sentences. It can be also a formal specification to describe the rules and

the syntax in which the parser attempts to analyze and determine the structure of a

sentence in language [4]. There are five fundamental units of grammatical structure:

morpheme, word, phrase, clause, and sentence. Morpheme is the lowest unit.

Morphemes joined to form a word. Phrase and clause are a group of words. While a

phrase does not have subject and predicate, clause does have its own subject and

predicate. For instance, in a sentence, Tolosaan ni faarfata, which means Tolosa sings,

‘Tolosaa’ is subject and ‘ni faarfata’ is a predicate. Sentence is also a group of words

that convey some meaning. The above example is called traditional grammar. Subject

and predicate are called grammatical functions. Parts-of-speech such as verb, noun,

adjective, adverb, conjunction and preposition are called grammatical categories.

On the other hand, grammar specifies two things, such as the set of grammatically

correct sentences and the structure to be assigned to each grammatical sentence in the

language. In order to specify these two things, the grammar has the grammar

formalisms. Grammar formalisms are, first and foremost, languages whose intended

usage is to describe languages themselves to describe the set of sentences the language

encompasses (the string set), the structural properties of such sentences (their syntax),

and the meanings of such sentences (their semantics) [34]. There are different types of

grammatical formalisms, such as Context Free Grammar (CFG), Context Sensitive

Grammar (CSG), Transition Network Grammar (TNG), Unification Based Grammar

(UBG) and Probabilistic Context Free Grammar (PCFG) are the most common and

most widely used formalisms.

2.3.1 Context Free Grammar

A context-free grammar is a set of production rules that describe all possible strings in

a given formal language. A CFG can be defined as a finite set of grammar rules, which

consist of always one non-terminal symbol on the left-hand side but anything on the

right-hand side. Context-free grammars (CFGs) are a class of formal grammars that

have found numerous applications in modeling computer languages [35]. In order to

define the grammar rule, there are two kinds of symbols: the terminals, which are the

symbols of the alphabet underlying the language under consideration, and non-

terminals, which behave like variables ranging over strings of terminals [36]. A rule is

of the form A → α, where A is a single nonterminal, and the right-hand side α is a string

of terminal and/or nonterminal symbols. A context-free grammar (CFG) is a four-tuple

(Σ, V, S, P)

where:

Σ is a finite, non-empty set of terminals, the alphabet;

V is a finite, non-empty set of grammar variables (categories, or non-

terminal symbols), such that Σ ∩ V = ∅;

S ∈ V is the start symbol;

P is a finite set of production rules, each of the form A → α, where A ∈

V and α ∈ (V ∪ Σ) ∗.

For a rule A → α, A is the rule’s head and α is its body. CFGs are a very important class

of grammars for two reasons [4], first, the formalism is powerful enough to describe

most of the structure in a natural language and the second it is also restricted enough so

that efficient parsers can be built to analyze sentences.

2.3.2 Context Sensitive Grammar

These rules are used in a natural language to describe subject-verb agreement with

respect to number, i.e., singular or plural as reflected in sentences; the student come,

and the student comes [4]. A Context-Sensitive Grammar is a four-tuple, like that of

context free grammar, G= (N, Σ, P, S) where;

N is a set of non-terminal symbols,

Σ is a set of terminal symbols,

S is the start symbol of the production and

P is a finite set of production rules of the forms α1Aα2α1βα2 (where a single

non -terminal A ∈ N and α1, β, α2 ∈ (N U Σ) +).

The production rules of the context sensitive grammar satisfy the following constraints

for the production rule of the form:

- A B, where A and B are strings of the alphabet symbol, the length of (A)

should be less than or equal to the length of (B).

- A y / x_z, where A is a non-terminal symbol, y is a sequence of one or more

terminal and non-terminal symbols, and x and y are sequence of zero or more

terminal and non-terminal symbols.

The meaning of the second production rule is that A can be rewritten as y if it appears

within the context ‘x_z’, i.e., immediately proceeded by the symbols x and immediately

followed by the symbols z.

2.3.3 Transition Network Grammar

Transitional Network Grammar (TNG) formalism describes the rules by using nodes

and labeled-arcs in a transition network [4]. One of the nodes is specified as the initial

state, or start state. Starting at the initial state, an arc can be traversed if the current word

in the sentence is in the category on the arc. If the arc is followed, the current word is

updated to the next word. Simple transition networks are often called Finite State

Machines (FSMs) and have equal expressive power to regular grammars. However,

they are not powerful enough to describe all languages that can be described by CFGs.

In order for the transition network grammar to get the descriptive power of CFGs, it

should allow arc labels to refer to other networks as well as word categories. Thus the

grammatical formalism based on such a notion is called recursive transition network

The other commonly used type of TNG formalism for writing natural language

grammars is Augmented Transition Network (ATN), introduced by Woods [37]. This

type of formalism represents the grammar in the assumption that if there is a path from

the start state to some final state such that the labels of the arcs on the path match the

words within the sentence, a sentence is in the language defined by the network.

2.3.4 Unification Based Grammar

Unification-based formalisms use as their informational domain a system based on

features and their values [34]. The feature structures consist of features and associated

values, which can be an atomic or complex, i.e., feature structure themselves. In other

words, the values can be from a structured set.

2.3.5 Probabilistic Context Free Grammar

The most commonly used type of grammar in natural language modeling is a

probabilistic version of the CFG, called probabilistic (or stochastic) context-free

grammar (PCFG) [35]. The key idea in probabilistic context-free grammars is to extend

our definition to give a probability distribution over possible derivations [38]. The

probability is calculated by counting the number of times each rule is used in a corpus

of parsed sentences. A PCFG is a five tuple [39]: PCFG = (N, Σ, P, S, D)

N a set of non-terminals

Σ a set of terminals symbols

P a set of production rules in CNF

D Function to assign probabilities to each rule in P

PCFGs were introduced as an extension to CFGs to aid in sentence disambiguation, but

they have a number of problems. Due to this, in practice, most current probabilistic

parsers use some augmented form of PCFGs. The main drawback of PCFGs is that they

do not model dependencies. Although it was not stated explicitly, it is clear that the

formulation of PCFGs assumes that the derivation from each non-terminal node to a set

of input words, is not only independent of the nodes outside the sub-tree but also

independent of the words on both sides of the subsequence of input string that the sub-

tree considers. The first one refers to structural independence while the other implies

lexical independence. Natural languages are not that simple and have both kinds of

dependencies [39][40].

All extensions of PCFG try to include the dependencies between words and parse trees,

some way or the other. One drawback of extended PCFGs is that they need an extremely

large corpus for estimating that probabilities. To avoid this, the various extensions

consider some simplifying assumptions of independence. A commonly used solution to

incorporate dependencies into PCFGs is the probabilistic lexicalized CFGs. This is

based on the concept of the head driven grammars. Every phrase is associated with a

“head” word, which constrains the overall structure to the sentence. Instead of

computing the probability of the parse just by multiplying each of the PCFG rule

probabilities, each rule probability is now conditioned on its head [35][39][40].

2.4 Sentence Parsing Approaches

Parsing is the process of assigning syntactic structures to input strings, according to a

grammar [26]. It is the step in which a flat input sentence is converted into a hierarchical

structure that corresponds to the units of meaning in the sentence. In order to efficiently

parse the sentence, there are in two ways techniques, such as Stochastic and Rule-based.

These are briefly discussed as follows.

2.4.1 Stochastic Approaches

Stochastic approach is called corpus-based approach, which is based on the use of text

corpora. The approach uses the idea of Bayes (Network) theorem, that is an independent

event and the Markov assumptions are used to determine the most likely lexical

sequence of each word in a given sentence [29]. Many parsers use formal grammars to

analyze language input. Stochastic parsing has the difference that the rules in the

grammar are assigned with probabilities [41]. Based on the type of text corpora used,

the corpus based (stochastic) approach can be further categorized into supervised and

unsupervised approaches.

Supervised Approach

In supervised approach, we have given a data set and already know what our correct

output should look like, having the idea that there is a relationship between the input

and the output. It is called supervised learning because the process of algorithm learning

from the training dataset can be thought of as a teacher supervising the learning process

[42]. It uses annotated text corpora and system, which are developed using this

approach is called supervised parsers. They use probability or statistics in analyzing the

syntactic structure. The main source of information for a supervised parser is the lexicon

(which lists each word with the entire possible lexical category for each word) and the

list of contextual probabilities for each lexical category. The lists of contextual

probabilities indicate the particular lexical category that is appropriate for a particular

context. However, this approach has two main drawbacks: lack of manually or

automatically parsed text (corpora) and the manual parsing is required each time

whenever the parser is needed to be applied on a new text [5].

Unsupervised Approach

Unsupervised approach, on the other hand, allows us to approach problems with little

or no idea what our results should look like. We can derive structure from data where

we don't necessarily know the effect on the variables. These are called unsupervised

learning because unlike supervised learning above, there are no correct answers and

there is no teacher [42]. Algorithms are left to their own devises to discover and present

the interesting structure on the data.

Unlike supervised approach, unsupervised approach uses a natural corpus as those

found in newspapers and books. For this reason, they do not require any pre-tagged text

in the training process. Some probabilistic information generated from the corpus is

used to develop the syntactic analysis system. These parsers also work based on the

assumption of Markov model in that a set (lexical categories in this case) with directed

edges labeled with transition probabilities that indicate the probability of moving to the

state at the end of the directed edge is utilized.

2.4.2 Rule-based Approaches

The rule-based approach has successfully been used in developing many natural

language processing systems. Systems that use rule-based transformations are based on

a core of solid linguistic knowledge. The linguistic knowledge acquired for one natural

language processing system may be reused to build knowledge required for a similar

task in another system [39]. The advantage of the rule-based approach over the corpus-

based approach is for less-resources languages, for which large corpora, possibly

parallel or bilingual, with representative structures and entities are neither available nor

easily affordable and for morphologically rich languages, which even with the

availability of corpora suffer from data sparseness [43]. The rules may contain a large

number of morphological, lexical or syntactical information [9]. These have motivated

many researchers to follow the rule-based approach in developing natural processing

tools and systems.

According to [24] states in her work, in parsing sentences, rule based approaches

attempt to find a way in which that sentence could have been generated from the start

symbol in the grammar. It attempts to parse a sentence based on the information from

the knowledge base (grammar rules) of the language. Systems which are based on such

rules learn a set of rules automatically based on a given list of strings and then parse the

sentences by following the rules. There are three ways in which this approach can be

applied, such as top-down, bottom up, and chart based approach. These approaches are

briefly discussed below.

Top-down Parsing approach

Top-down approach starts with the largest point. It breaks down from the largest into

the smaller segments. According to [44] top-down parsing has the advantage that the

only rules are applied, which can be useful in proving that the sentence is grammatical,

and its disadvantage is that the rules are tried "blindly," without any regard to the lexical

material present in the sentence. Top down parsing is the strategy that builds the parse

tree from the start symbol S. It never wastes time exploring trees that cannot result in

an S, since it begins by generating just those trees [39]. This means it also never

explores sub-trees that cannot find a place in some S rooted tree. Thus, it is goal

oriented. The goal is towards parsing the sentence according to the grammar

production. The following steps of the approach should repeat itself until the parse tree

matches the input string in order to build a parse [7].

At the start node S, select a production with S on its left-hand side and for each symbol

on its right-hand side, construct the appropriate child. When a terminal is added to the

tree being constructed that doesn’t match the input string, then backtrack. Find the next

node to be expanded. If the parse tree does not match the input string, then it means

that input string is wrong. Top down methods have the advantage of being highly

predictive [4] and it predicts the end string from the given grammar. It has also to

backtrack to where it made the wrong decision at each time when it chooses the wrong

Bottom up parsing approach

Bottom-up parsing is data directed. The initial goal list of a bottom-up parser is the

string to be parsed. If a sequence in the goal list matches the RHS of a rule, then this

sequence may be replaced by the LHS of the rule. Parsing is finished when the goal list

contains just the start category. If the RHS of several rules match the goal list, then

there is a choice of which rule to apply. The standard presentation is as shift-reduce

parsing.

The task of the parser is that of attempting to group words into their respective

categories together in a manner permitted by the grammar. Unlike top down parsing,

the bottom up parser only checks the input sentence once, and only builds each

constituent exactly once [40]. This is because a bottom-up parser works from left to

right, i.e., it does everything it can with the first item before exploring what it can do

with the next items. However, bottom-up parser can also get stuck in a loop if the

grammar has empty productions.

It has an advantage that the choice of the grammar rules that are applied depends on the

words present in the sentence and on analyses for sub-strings of the sentence. However,

the disadvantage is that analyses for sub-strings are built up, which do not contribute to

the overall analysis of the sentence [44]. Even if both bottom-up and top-down parsers

have advantages, they are inefficient and have a worst case exponential run-time as the

parser would tend to try the same matches repeatedly, thus duplicating much of its work

unnecessarily. Therefore, another an efficient approach which is called chart parsing

approach is discussed as follows.

Chart Parsing approach

The approaches that we discussed above have significant limitations. The bottom-up

approach(shift-reduce) parser can only find one parse, and it often fails to find a parse

even if one exists. As just pointed out, the top-down approach(recursive-descent) parser

can be very inefficient, and if the grammar contains left recursive rules, it can enter into

an infinite loop. In order to address these problems of completeness and efficiency, we

explore chart parsing approach, which stores intermediate results, and re-uses them

when appropriate. Chart parser combines some of the advantages of top-down and

bottom-up approaches. The combination of the selective behavior of the top-down

algorithm in building partial parser is based on left context with the bottom-up

algorithm behavior building each partial parse only once, form a chart parser [24]. The

main objective of chart parsing is to improve parsing efficiency. Therefore, it considers

three points for the improvement of the parsing efficiency; first, it doesn’t do twice

what can be done once, second, it doesn’t do once what can be avoided altogether, and

thirdly, it doesn’t represent distinctions if that is not the concern of the study [4].

To parse a sentence, a chart parser first creates an empty chart spanning the sentence.

It then finds edges that are licensed by its knowledge of the sentence, and adds them to

the chart one at a time until one or more parse edges are found. It has three main

constituents, such as chart, key list and a set of edges. A chart is a set of chart entries

each of which consists of the name of terminal or non-terminal symbols, the starting

point of an entry and the entry length. The key list push down stack of chart entries that

are waiting to be added to the chart. The edges are rules that can be applied to chart

entries to build them up into large entries [7][4]. The chart maintains the record of all

the constituents derived from the sentence so far in the parse. It also maintains the

record of rules that have matched partially but are not complete. Recording of

intermediate results is a form of dynamic programming that avoids duplicate work [45].

Chart parser is driven by an agenda of completed constituents and the arc extension,

which combines active arcs with constituents when they are added to the chart. The

technique of extending arcs with constituents can be applied by using both bottom-up

and top-down approach. However, the difference is in how new arcs are generated from

the grammar. In bottom-up approach, new active arcs are generated whenever a

completed constituent is added that could be the first constituent of the right-hand side

of the rule. In the top-down approach, new active arcs are generated whenever a new

active arc is added to the chart [7]. For this reason, the number of constituents generated

using a top-down chart parser is less than the number of constituents which are

generated using bottom-up chart parser. Therefore, the top down chart parser is

considerably more efficient for any reasonable grammar.

2.5 Afan Oromo Grammar

Ethiopia is one of the multilingual countries. It constitutes more than 80 ethnic groups

[3] with diversified linguistic backgrounds. The country comprises the Afro-Asiatic

super family (Cushitic, Semitic, Omotic and Nilotic) [46]. Afan Oromo belongs to an

East Cushitic language family of the Afro-Asiatic language super family. It is the most

widely spoken language in Ethiopia. As Abdi states, Afan Oromo has around 40 million

speakers, 50% of the total population of the country, native speakers and the most

populous language of Ethiopia. The writing system of Afan Oromo is nearly phonetic

since it is written the way it is spoken, i.e. one letter corresponds to one sound. The

language uses Latin alphabet “Qubee” which was formally adopted in 1991 G.C [47],

and it has its own consonants and vowels sounds. Afan Oromo has thirty- three

consonants, of these seven of them are combined consonant letters: ch, dh, ny, ph, sh,

ts and zh. The combined consonant letters are known as ‘qubee dachaa’. Afan Oromo

has five short and five long vowels. The Afan Oromo alphabet is characterized by

capital and small letters like English alphabet. In Afan Oromo, as in English language,

vowels are sound makers and do stand by themselves.

“Parsing refers to the activity of analyzing a sentence into its component categories and

functions” [39]. It is also a skill of something that you can learn to do rather than

something you simply know about. Hence, sentence parsing is all about discovering a

structure of sentences in an input sentence based on external information known for the

elements of the input sentences and their order. According to [13] described in his work

Afan Oromo is morphologically rich language; each root word can combine with

multiple morphemes to generate a huge number of word forms. The grammatical

system of Afan Oromo is quite complex and exhibits many features common to

other Cushitic languages, this means it is an inflected language that uses post-positions

more than prepositions [47]. Hence, For the purpose of supporting such inflectionally

rich languages, the structure of each word has to be identified. Thus, we present about

the grammar of Afan Oromo language starting from its word orders, word classes,

phrase types and sentence types in the following sections due to their importance for

our study.

2.5.1 Word order

Words combine in different orders to form sentences and phrases. They also have the

internal structure [48]. One of the primary ways in which languages differ from one

another is in the order of constituents or word order. For instance, Afan Oromo and

English have differences in their syntactic structure. In Afan Oromo, the sentence

structure is subject-object-verb (SOV). SOV is a sentence structure where the subject

comes first, then the object and the verb next to the object. For example, if we take Afan

Oromo sentence “Dagaagaan nyaata nyaate”, “Dagaagaan” is the subject, “nyaata”

is the object and “nyaate” is the verb of the sentence. In case of English, the sentence

structure is subject-verb-object. For example, if the above Afan Oromo sentence is

translated into English it will be “Degaga ate food” where “Degaga” is the subject

“ate” is the verb and “food” is the object, however, Afan Oromo follows the Subject-

Object-Verb (SOV) format. But nouns change depending on their role within the

sentence, word order can be flexible, though verbs always come after their subjects and

objects. Typically, indirect objects follow direct objects.

2.5.2 Word Categories

In Afan Oromo language based on their contextual and formation in the sentences, word

classes are categorized into five major groups. These are noun, adjective, verb, adverb

and adposition (pre- and post-position). However, this paper adopts the trend that

conjunctions and adposition appear in the same category, which is adposition category.

Nouns are names that are used to name or identify things, people, animals, places or

abstract ideas. In Afan Oromo noun, we can have nouns, adjectives and pronouns. For

example, words like ‘Farda ‘horse are considered as the nouns positions in the

following sentence. ‘Fardi garbuu nyaata (the horse eat a barley). Zero morphemes

are marked as singular noun, whereas various forms are marked as plural noun.

Sometimes nouns are used as adjectives in Afan Oromo language.

Example:

Tolosaan mana barumsaa deeme. Tolosa went to school

According to [5], Afan Oromo nouns are pluralized by suffixing various forms of

suffixes, such as, {-een, -wan, -(o)ota, -yyii and -lee}. It is possible to use more than

one type of different plural markers in some nouns. For instance, mana 'house' can be

pluralized both as manneen or manoota. Most nouns, however, prefer one plural

marker to the other. The word sagalee 'sound', for example, can be pluralized by

suffixing {-(o)ota} or {-lee}, but it prefers the former form to the latter.

Adjective

An adjective modifies a noun or a pronoun by describing, identifying, or quantifying

words. In Afan Oromo, an adjective usually follows the noun or the pronoun which it

modifies. Adjectives are also categorized under different categories, like nouns. Afan

Oromo adjectives can be either primitive or derived. It comes after the nouns they

describe. For example, “saree gurraacha”. In the example, the adjective guraacha

“black” in saree guraacha comes after the noun it modified. It is also true in the adjective

case that not all words after nouns could be adjectives. For example, in mana citaa

“thatched house,” the word citaa “thatched” is not an adjective but noun as adjective.

Moreover, nouns but not adjectives occur in subject and/or object positions. On the

other hand, Adjectives are similar to nouns in various forms. The two sub categories

also share similar characteristics for the inflection of gender. However, it should be

noted that adjectives cannot substitute nouns in a sentence construction. Only pronouns

seem to substitute for each other since they can occur in the same position in a sentence

of Afan Oromo. For example, consider the following sentences in which one is correct

and the other is not.

Abbabaan barsiisaadha. [Abebe is a teacher]

Inni barsiisaadha [He is a teacher].

Guraacha barsiisaadha. [Black is the teacher], which is ungrammatical.

Verb is the most important part of a sentence that says something about the subject to

a sentence, expresses an action, events or states of being. In Afan Oromo, verb occurs

within the final positions of a sentence. It is not the case that verbs constitute a distinct,

open word class in all languages. In Afan Oromo, verbs are forms, which occur in clause

final positions and belong to a distinct category [1][11]. For example, in the following

sentence:

Inni qalama bite. [He bought a pen]

Caantuun deemte. [Chaltu has went]

Isheen gabaabdu dha. [She is a short]

The italicized part are all verbs [11] divides verbs into a number of sub categories based

on the type of constituents they are associated with. These are intransitive, transitive,

ditransitive, modals and auxiliary’s verbs. The intransitive verbs are those verbs which

do not take any phrase as their complement. For example, in the sentence ‘Abbabaan

furdate’ (Abebe got fat), ‘furdate’ [got fat] is an intransitive verb which has no

complement. There are also strictly transitive verbs [5]. These types of verbs are those

which take one complement in Afan Oromo. For example,

inni teechuma cabse (he broke the chair)

Caalaan mana bite (Chala bought a house)

In these two examples teechuma and mana are complement to the verbs ‘cabse’ broke

and ‘bite’ bought respectively. Finally, the third category of verbs in Afan Oromo is

what is called the ditransitive verbs. These verbs take two complements. The

complement for such verbs is usually noun phrase and adpositional phrase in Afan

Oromo. For example,

Tulluun abbaaf konkoolataa bite.

In the above sentence abbaaf and konkoolataa are the two complements which, are

noun phrase and adpositional phrase respectively for the verb bite ‘bought’.

Adverb

Adverbs are words, which are used to modify a verb, an adjective, another adverb, or a

clause. Adverbs usually precede the verbs they modify or describe in Afan Oromo

sentences. An adverb indicates time, manner, place, cause, or degree and answers

questions such as, how? when? where? and how much? In the following example, each

of the bold words is an adverb. Example:

Oboleessi koo boru deema. (My brother will leave tomorrow.) Boru (tomorrow) is an

adverb.

However, it should be noted that any word that comes before a verb is not necessarily

an adverb. For instance, in kitaaba bite “bought book”, the word kitaaba “book”

precedes the verb bite ‘bought”. In this case, the word kitaaba is a noun and in turn is

modified by the verb bite. Hence, the verb functionally shares the feature of an adjective

(modifier). There are different types of adverbs: adverbs of time, place, manner,

frequency, degree, etc. In general, adverbs are treated as the subclass of verbs[11].

Adpositions

Adpositions are traditionally defined as words that link to other words, phrases, and

clauses and express spatial or temporal relations. Adpositions are almost universal part

of speech. It is a cover term for prepositions and postpositions. Afan Oromo has both

prepositions and postpositions, though postpositions are more common. Examples:

boqonnaarra [boqonnaa irra] – “on vacation”

mana nyaataa kanatti [kana itti] – “at this restaurant”

Keeniyaan Itoophiyaarraa (gara) kibbatti argamti” – “Kenya is located (to the)

south of Ethiopia”

From the examples above, we notice that the postpositions (itti, irra, and irraa) most

often occur as suffixes, -tti, -rra, and -rraa, on the nouns they relate to. With place

names, no preposition or postposition is used to be mean “in”. Therefore, one can say,

“Finfinnee jiratta” for “you live in the Finfinnee [Addis Ababa]”, or “hospitaalan

ture” for “I was in the hospital,” using no preposition.

Conjunction

A conjunction is the word that is used to connect words, phrases, clauses or sentences.

Conjunctions in Afan Oromo are coordinating or subordinating. In this study,

conjunction and adpositions are used as the same category. According to[5], one

problem that arises by categorizing Adpositions and conjunctions into different

categories is the problem pertaining to distinguish conjunctions from Adpositions. The

problem in distinguishing the two mainly arises from the fact that the same words are

mostly used as both Adpositions and conjunctions. However, in cases where it is

possible to separate Adpositions from conjunctions, they are parsed separately. That is

when the parser can to distinguish between the two sub- categories a distinct category

is given to both of them. Some of Afan Oromo conjunctions are; [fi] “and,” [immoo,

garuu] “but,” [yookin(for declaratives),moo(for questions)] “or,” [haa ta’u malee]

“however,” [ta’us] “though,” [kanaaf] “so,” [kanaafuu] “therefore,” [sababiin isaa]

“because,” [akka] ‘in order to, so that’ etc. Examples:

Nyaatan barbaada sababiinsa nan beela'e. – “I want food because I am hungry.”

Ani kochee nyaadhe kanaafuu garaa kaasan qaba. – “I ate kitfo so I got

diarrhea.”

Daadhii moo biiraa dhuguu barbaadda? – “Do you want tej or beer?”

Numeral

Numerals are words representing numbers, and they can be cardinal or ordinal

numbers[1]. Afan Oromo cardinal numbers refer to the counting numbers, because they

show quantity. Ordinal numbers, on the other hand, tell the order of things and their

rank. In Afan Oromo, the ordinal numbers are formed from the cardinal numbers by

suffixing the suffix {–affaa}[5][1].

The examples below use numbers in different ways and places to demonstrate how they

behave in a sentence.

“Isheen afaan torba dubbatti”. she speaks seven languages

“kun barnota ko lammaffaa dha”. this is my second lesson

Like English, compound Afan Oromo numerals are also put separately. Example:

dhibba lama, “two hundred” and Dhibba lam-affaa,” two hundredth”.

There are numerals that indicate distribution. These numerals are called distributive

numerals. Example:

tokko tokko “one one”, sadi sadi “three three”

There are also special numerals in Afan Oromo that corresponds to the English like

“half”, “quarter”. Example:

walakkaa “half”, sisoo “one third”.

2.5.3 Phrases Categories

A phrase can be defined as a syntactic combination of a word with one or more other

words. A phrase is constrained or restricted by two things [5]. These are in terms of the

constituents’ and the lexical categories like nouns, verbs, etc. Thus, we can determine

the number of phrases by the number of words. A question of how to check whether a

structure is a phrase can be answered using the following four guiding principles [4].

These are:

1. If the constituents of the phrase can be moved together to another place

without separation.

2. If the phrase can replace by a pronoun (for noun phrase).

3. If one of the constituents of that phrase is missed, the meaning of that

phrase will be corrupted.

4. If an insertion of other words in between that phrase affects the meaning.

Based on the type of lexical categories in Afan Oromo, there are five phrase types in

the language[49]. They are briefly presented as following.

Noun Phrase

A noun phrase is made of one noun and one or more other lexical categories, including

the noun itself. For example, in the phrase ‘mana citaa’ [thatched house], there are two

nouns, which make the noun phrase: mana [house] and citaa [thatched] [5].

Thus, noun phrase and phrases in general must meet the above criteria to be called a

phrase. In the sentence ‘Alamuun mana Magarsaa deeme’ (Alemu has gone to

Megersa’s house), ‘mana Magarsaa’ [Megersa’s house] is a noun phrase. However,

to check whether it is really a phrase or not, we can see the above criteria. According

to [5], the following arrangement is impossible for the above reasons.

- Deeme Alamuun mana Magarsaa. (legal movement)

- * Magarsaa Alamuun mana deeme. (illegal movement because of the

above rule 1 and 4.)

- * Mana Alamuun Magarsaa deeme. (illegal because of rule 1 and 4)

The above sentences with asterisks (*) have illegal phrase construction because of the

above four rules that are specified by Baye [49]. Thus, we checked that “mana

Magarsaa” is a phrasal structure.

As indicated above, nouns can appear in a number of positions, such as in the positions

of the three nouns in “Abbabaan kitaaba Caalaaf bite” [Abebe bought Chala a book].

These same positions allow sequences of a noun followed by an article, as in

“Abbabaan kitaabicha Caalaaf kenne” [Abebe gave Chala the book.]. Since the

position of the article can also be filled by demonstratives (kun, sun, etc.), possessives

(koo, kee, keessan, etc), or quantifiers (e.g. xiqqoo), the more general term

“Determiner”.

Verbal Phrase

A verb phrase (VP) is composed of a verb as head and other constituents such as

complements, modifiers and specifiers. Afan Oromo verb phrases can be captured by

dividing verbs into three categories. These are intransitive verbs, strictly transitive verbs

and ditransitive verbs [5]. Examples:

Inni dhufe “He came”

Tolosaan buna dhuge “Tolosa drunk coffee”

Caalaan qalama naaf bite “Chala bought me a pen.”

Adverbial Phrase

Adverbial phrase is made up of one adverb as head word and one or more other lexical

categories, including adverbs itself as modifiers and specifiers in Afan Oromo. For

example, in Afan Oromo it is possible to have two adverbs in an adverb phrase like in

a phrase ‘kaleessa galgala’ [yesterday night]. As indicated above adverbs and their

phrases are used to modify verbs. Hence, they precede verbs in a sentence.

Adjectival Phrase

An adjective phrase is a group of words that describe a noun or pronoun in a sentence.

In adjectives nouns, can act as adjectives like ‘Mana Magarsaa’ ‘Megersa’s house’ or

verbs as adjectives like ‘farda bite’ ‘bought horse’

Adpositional Phrase

Adpositional phrases are the combination of nouns and adpositions. They usually

specify a verb phrase. This phrasal category sometimes is called adpositional objects

[5][49].

Inni kara mana deeme [He went to the house]

Lammaan qalama Caaltuu-f bite. [Lemma bought a pen to Chaltu].

Adpositions in a adpositional phrase can be either stand independently like in the first

phrase or affixed to the adpositional object like in the second phrase above.

2.6 Afan Oromo Sentences

Afan Oromo Sentence is made from word or phrase or one clause and more than one

clauses. This means, Sentence in Afan Oromo are made by the combination of zero or

more noun phrases and one or more verb phrases. A sentence is a group of words or

phrases that are complete in itself, conveying a statement, question, exclamation, or

command and typically containing a subject and predicate. However, a sentence is

considered as a special kind of phrases which consists of noun phrase (NP) and the verb

phrase (VP). This is a general representation of sentences of all types in Afan Oromo

language. The structure of a sentence can be either simple sentences or complex

sentences based on the number of verbs it contains.

2.6.1 Simple Sentences

Simple sentences in Afan Oromo are sentences, which contain only one verb and that

can have a full meaning. Simple sentence can be constructed from NP followed by VP,

which only contain single a verb. Example:

Caalaan dhufe – “Chala came”.

Here the sentence contains only one verb ‘dhufe’. Simple sentences can be declarative

sentences, interrogative sentence and imperative sentences. All these types of sentences

are discussed below.

Declarative sentences (Hima Addeessaa yookin Himaamsa)

In contrast to command, question or exclamation, if the sentence is a statement it is

declarative sentence. It is always ended with the period(.) mark, which is the same in

English and equivalent to (::) in Amharic. They are used to convey information.

Declarative sentences can be positive or negative sentences. Negative sentences simply

negate a declarative statement made about something. Example:

“Caaltuun mana jirti”, ‘Chaltu is at home’.

Here the sentence is declarative because it describes where aster is, and also

“Tolosaan mana barumsaa hindeemne”, ‘Tolosa did not go to school’

In this example, the sentence is negative declarative sentence. The verb hindeemne

‘did not go” is negated by the prefix hin- ‘not’.

Interrogative sentences (Hima Gaaffii)

In Afan Oromo Interrogative sentences are sentences that can form a question. The

question can be the one that asks the known thing to be sure or the one that asks the

unknown one. These types of sentences always end the question mark which is

symbolled as ‘?’. Example: Guyyaan har’aa maali? ‘what is the day of today?’.

Interrogative sentences consist of interrogative pronouns, which are eenyu? ‘who’,

yoom? ‘when’, maal? ‘what’, meeqa? ‘how many’, eessa? “where”, etc.

Imperative sentences (Hima Ajajaa)

When someone wants to pass instruction or commands, imperative sentences can be

used. Most of the time, the subjects of imperative sentences are second person

pronouns. However, when the command is passed for the third person, the subject of

the sentence can be third person pronouns or nouns. Example:

Hojii manaa hojjadhu. ‘do homework.’

Here the subject is ‘you’, second person for both feminine and masculine singular and

plural.

Exclamatory sentences (Hima Raajeffannoo)

In Afan Oromo, an exclamatory sentence is a type of simple sentence that expresses

strong feelings (excitement or emotions) by making an exclamation. (Compare with

sentences that make a statement, express a command, or ask a question). Exclamatory

sentences are rarely appearing in academic writing, unless they're part of quoted

material. Example:

ajaa’iba kuni! ‘This is a surprise’.

2.6.2 Complex Sentences

Complex sentence in Afan Oromo grammar is formed from either complex noun phrase

or complex verb phrase or both. In other words, a complex sentence can have a complex

NP and a simple VP, a simple NP and a complex VP or both complex NP and complex

VP. Complex NPs contain at least one embedded sentence, which can be complemented

or other type phrase. On the other hand, complex VPs contain at least one sentence or

more than one verb. Based on this we can have complex sentences which is constructed

from one independent clause and one or more dependent clause. And also, which has

one or more independent clause and two or more dependent clauses, this called

compound complex sentences. Though the focus of this work is a complex sentence

type which is constructed from one independent and one or more dependent clauses.

Example:

‘yoo dhufuuf taate, ganamaan koottu.’

‘Qilleensarras kaattu, lafarras arreeddu walgeettin teenya Finfinnee dha.’

2.7 Summary

In order to develop a sentence parser for any natural language, grammar formalism and

parsing methods are needed. Hence several grammar formalisms and parsing

algorithms are proposed by different scholars. Some of them are Context Free

Grammar, Transition Network Grammar, Probabilistic Context Free Grammar, Context

Sensitive Grammar, and Unification Based Grammar were discussed in this research.

Among the strategies, Stochastic Approaches and Rule-based approaches are presented.

Another point which is discussed in this chapter is about grammar of Afan Oromo

language in details. We start our discussion from the word order of the sentence in Afan

Oromo grammar. We also discussed the categories of word classes and types of phrases

based on what the language scholars classified.

We have also presented some works, which are closely related to our thesis work.

Parsing natural language text(sentence) is challenging because of the problems like

ambiguity and inefficiency. It is considered to be an important intermediate stage for

semantic analysis in natural language processing application such as information

retrieval (IR), information extraction (IE) and question answering (QA) [32]. As we

come across some literatures on sentence parsing, the parser can have developed in

different approaches for a number of languages around the world. We have reviewed

some sentence parser for different languages, among these, we have come across the

first attempt which is an automatic sentence parser for Afan Oromo simple declarative

sentences.

We had also reviewed on other languages of the sentence parser. Some of them are used

tree-banks to develop a sentence parser, whereas the others used rule-based approaches

using different parsing strategies. The tree bank allows the parser to produce its

grammar and probabilities to be possible parses of a sentence. Parsers made from a tree

bank are often the best parsers because of the reason that they can simply exercise in

machine learning. However, creating the requisite training corpus, or tree bank, is a

difficult task. Because of the absence of such a tree bank for Afan Oromo corpus, it is

difficult to apply this approach to develop Afan Oromo sentence parser. On the other

hand, when we see the rule-based approach there are also limitations. In top-down

parsing, there are problems like reduplicating, backtracking to where it made the wrong

decision at each time it chooses the wrong path, and getting stuck when there are

grammar rules, which are left recursive. In contrast, the bottom-up parser only checks

the input sentence once, and only builds each constituent exactly once. Even if both

bottom up and top down parsers have advantages, they are inefficient and have a worst

case exponential run-time as the parser would tend to try the same matches repeatedly,

thus duplicating much of its work unnecessarily. To avoid this problem, a data structure

called a chart is introduced that allows the parser to store the partial results of the

matching it has done so far so that the works need not to be reduplicated [39].

Finally, chart parsing can avoid the problems of top-down and bottom-up approaches,

and it is an efficient parsing algorithm. Hence, top-down chart parsing algorithm is

selected to develop simple and complex sentences of Afan Oromo language in our

research work.

CHAPTER THREE

3 DESIGN OF AFAN OROMO SENTENCE PARSER

We discussed the main components of the Afan Oromo sentence parser and the

interaction between each component, and also, we designed the architecture of the

parser in this chapter.

3.1 Components of Afan Oromo Sentence Parser (AOSP)

Rule-based sentence parser approach has the three basic components such as grammar

rule, lexicon and the parsing algorithm. The grammar component is responsible for

storing grammar rules written in one of the grammatical formalisms. To learn a set of

rules automatically based on the given strings, the parser should enable by grammar

rules and then the parser parse sentences based on those rules. The lexicon component

is used as a dictionary for the parser by storing lexical rules which are separated from

grammar rules. Lexical rules specify the possible categories of each word so that the

efficiency of the parser will be improved.

According to Abdurehman [4] states the rule based parser comprises of, the structure

of the grammar rules, the lexicon and the parsing algorithm differs from system to

system and from one language to the other. Hence, our system has additional

components to the basic components such as Sentence Tokenizer and Lexicon

Generator. Sentence Tokenizer is used to break down or split the input sentence into

individual words. whereas, Lexicon Generator is used to avoid the manual preparation

of the lexical rules and generate the lexical rules automatically from sample corpus in

the form of tag ‘word’. We used the corpus which is manually annotated or POS

tagged. However, the POS tagger is not included in our system.

Grammar rule component of Afan Oromo sentence parser is required to store Afan

Oromo Context Free Grammar rules. The grammar rules are identified in a way that

they can represent the structure of Afan Oromo sentences in terms of what phrases and

word categories, in case we used Context Free Grammar formalism. The lexicon

component stores a list of lexical rules, which specify the possible categories of each

word. The parser engine accepts input sentence, by considering the grammar rules of

Afan Oromo sentence, and retrieves the POS tag of each word in the sentence from the

lexicon and finally returns parsed sentence as an output.

CFG for

OromoLexicon

Lexical

Generator

Corpus Data

Sentence

Tokenizer

sentences

Parser

Output

Figure 3. 1: Architecture of Sentence Parser for Afan Oromo Language

3.2 Context Free Grammar (CFG)

A grammar in human language represents understandable specification of language

syntax. According to Abdurheman, grammar is not concerned with semantics. In other

words, the grammar is a collection of words that describes well-informed sentences in

a language. An efficient parser can be constructed automatically from a properly

designed grammar. The idea of a context-free grammar should be familiar from formal

language theory. Furthermore, Context Free Grammar in natural languages represents

a formal system which describes a language by specifying how any legal text can be

derived from a distinguished symbol called the syntactic symbol [50]. CFG rules for

this thesis are extracted from sentences collected from Afan Oromo grammar books and

previous research papers, which are already tagged manually by researchers. Context

Free Grammar rules are extracted to represent the grammatical structure of valid

sentences as much as possible. As we have discussed earlier, there are two types of

sentences in Afan Oromo language, such as simple and complex sentences. Hence,

context free grammar rule incorporates both sentences to represent their grammatical

structures in this thesis.

3.3 Sentence Tokenizer

Tokenization is an early step of processing to divide the input text into units called

tokens where each is either a word or something [40]. It is also stated in [48] as one of

the more basic operations that can be applied to a text to breaking up a stream of

characters into words, punctuation marks, numbers and other discrete items. For this

reason, we need to have a tokenizer that is responsible to break words of the sentence.

The tokenizer spans through the sentence from the beginning to the end and whenever

it gets a space gap it considers the text before the space as one word. This process was

done by writing python code which is used to split the input sentence into its words.

3.4 Lexicon Generator

A lexicon is the knowledge that a native speaker has about a language. This includes

information about the form and meanings of words and phrases, lexical categorization,

the appropriate usage of words and phrases, relationships between words and phrases.

Lexicon is an essential catalogue of a language’s words, whereas, grammar is a system

of rules which allow for the combination of those words into meaningful sentences. A

lexicon is the vocabulary of a person, language, or branch of knowledge. It is also

thought to include bound morphemes, which cannot stand alone as words, such as most

affixes [51]. Because words tend to follow regular morphological patterns, many

forms of words are not explicitly included in the lexicon. For example, for the verb

‘deem-‘ (dependent root form), there are different forms of the verb ‘deemte, deemuu,

deeme, etc’ [5].

From the sample corpus we have collected, the lexicon was prepared which is a list of

words with their POS tag name. Preparing the lexicon manually by typing the word and

its word category, especially when there is a large size corpus it is an error prone and

time taking. Therefore, it is better to have an automatic lexicon generator that outputs

the result correctly and within a short time. We have prepared simple algorithm, then

wrote small python code to develop a lexicon generator that output lexical rules from

tagged sentences automatically for later used in parsing Afan Oromo sentences. The

lexicon generator reads the POS tagged sentence, then generate the result as (tag name

-> ‘word’).

3.5 AOSP Chart Parser

This component is the main part of the proposed system. We deal with how a parsing

algorithm is used to apply the chart parsing algorithm in combination with other

components. According to Zhu [52] state general search(top-down or bottom-up)

methods are not best for syntactic parsing because the same syntactic constituent may

be re-derived many times as a part of larger constituents due to the local ambiguities of

grammar. Hence, we have considered the idea of chart parsing. A chart is a form of

well-formed substring table. Chart parsing is a common context free parsing algorithm

which uses dynamic programming techniques to avoid duplication of effort by ignoring

differences in derivation where they have no effect [53]. There is no backtracking and

everything that is put in the chart stays there. In addition, chart parsing doesn’t throw

away any information. This means it keeps a record (a chart) of all the structure you

have found so far [7]. Chart parsing has two forms: passive and active. In case of

passive chart parsing, the chart is simply a record of all constituents that have been

recognized. Whereas, active chart parsing is to keeping track of a record of complete

constituents that we have found, so we record what we are actually looking for and how

much of it we have found so far. Such information is recorded in active edges or active

arcs. [i.e. S NP. VP]. In this production rules, the arc label as S NP. VP means:

“we are trying to build an S consisting of an NP followed by a VP. So far, we have

found an NP arc, and we are still looking for the VP”. So, the insignificant looking “.”

Or dot symbol in the middle of the rule is very important, it marks the boundary between

what we have found so far, and what we are still looking for, i.e. a boundary between

active and passive arcs. Therefore, constituents before the dot are passive edges,

whereas active edges can be combined with these passive edges to create new edges.

According to [4], the fundamental rule for combining of the passive edges and an active

edge can be performed as S A. cB. Suppose that there is the passive edge going from

where the active edge ends and has category c on the left side as c Z. The dot in the

active edge is now moved one category forward (i.e., S Ac. B).

We are working with the active chart parser to make use of an agenda. An agenda is a

data structure that keeps track of the things that we still have to do, and it is used to

prioritize constituents to be processed. When new edges are created, we have to

remember that we have to look at them to see whether they can be combined with other

edges in any way [4]. In order to not forget the new created edges, we store them in the

agenda. Then take one edge at a time from the agenda, add it to the chart and use it to

build new edges.

Thus, chart parsing has two main approaches to apply the parsing algorithm, such as

top-down and bottom-up approach. Bottom-up chart parsing checks the input sentence

and builds each constituent exactly once. It can also avoid duplication of effort.

However, bottom-up chart parsing may build constituents that cannot be used legally,

whereas only grammar rules that can be legally applied will be put on the chart in top-

down chart parsing [52]. In addition, the algorithm reads (bottom-up) the rules right-to-

left, and starts with the information in passive edges. However, Top-down parsing reads

the rules left to right and starts with the information in active edges in the case of top-

down chart parsing. Besides, top-down searching will be used to use rules to make

active edges. The agenda will have at least one active edge to start the parsing process.

The active edge starts at the position zero from the sentence S. Hence, the active edge

will be taken from the input sentence. The chart will remain empty until an active edge

is added to it. Thus, in our work, we proposed, Context Free Grammar and top-down

chart parsing for the grammar rules and for parsing algorithm respectively. The reason

why CFG grammar formalism is selected in our work, it is easier to maintain, easy to

understand and to add new language features. It also imparts structure to language and

builds an efficient parser automatically [50]. On the other hand, the reason we have

chosen top-down chart parsing is because it does well if there is useful grammar-driven

control [4], and it has the advantage of both top-down and bottom-up parsing.

Generally, as we discussed in the chapter 2, chart parser is driven by an agenda of

completed constituents and the arc extension, which combines active arcs with

constituents when they are added to the chart. The technique of extending arcs with

constituents can be applied by using both bottom-up and top-down approach. However,

the difference is in how new arcs are generated from the grammar. In bottom-up

approach, new active arcs are generated whenever a completed constituent is added that

could be the first constituent of the right-hand side of the rule. In the top-down

approach, new active arcs are generated whenever a new active arc is added to the chart.

For this reason, the number of constituents generated using a top-down chart parser is

less than the number of constituents which are generated using bottom-up chart

parser[4][7]. Therefore, the top down chart parser is considerably more efficient than

bottom-up chart parser and tradition parser approaches (top-down and bottom-up).

Thus, based the aforementioned reasons we employ the top-down chart parsing for this

thesis is discussed in chapter 4.

3.6 Summary

To sum up, we have discussed in this chapter, about the important components of the

sentence parser that proposed in this work in detail. In addition, we presented the

architecture of the Afan Oromo sentence parser. The proposed rule based parser has

three basic components such as grammar rule, lexicon and the parsing algorithm. The

grammar component is responsible for storing grammar rules written in context free

grammar (CFG) rule, which is the selected grammatical formalisms for this work to

represent the grammar rule of Afan Oromo language. The lexicon component is used

as a dictionary for the parser by storing lexical rules which are separated from grammar

rules. Parsing algorithm is a method of understanding the exact structure of the sentence

or words. On the other hand, the parser has also another additional component, sentence

tokenizer which is an early step of a process to divide or split the input sentences into

units called tokens, and lexical generator which is used to generate lexical rules

automatically from the sample tagged corpus.

CHAPTER FOUR

4 IMPLEMENTATION RESULTS AND DISCUSION

We present the detail implementation of the parser in this chapter. The development

environment, corpus collection and preparation, extraction of context free grammar rule

and preparation of a lexicon are discussed. Then, the main objective of this study is

developing sentence parser for Afan Oromo language using top-down chart parsing

algorithm, therefore, top down chart parser is discussed in detail. Evaluation of lexicon

generator and the chart parser is also presented. In the final section, discussion is

presented.

4.1 Development Environment

We have developed a sentence parser which takes an input sentence from the user and

parses the sentence according to the CFG and lexicon based on the parsing method and

finally deliver the output for the user. The parser has also automatic lexicon generator

as its components. The lexicon generator takes manually tagged sentences and produces

lexical rules automatically. We used python programming language and NLTK for

implementation purpose.

4.2 Corpus Preparation

For our study, we have collected 500 simple and complex sentences from different

sources. Most of the sentences are simple sentences which around 70% and the rest

30% is complex sentences. This is because of there are different simple sentence types,

such as declarative, exclamatory, interrogative and imperative simple sentence types in

Afan Oromo sentences. Some of the sentences are taken from previous research work

in the area by Diriba [5], some others are taken from Seer-luga Afan Oromo (Afan

Oromo grammar) book, and the rests are taken from different Afan Oromo written

documents. Sentences are tagged manually using tag set which was developed by

Abraham [12] based on Afan Oromo language rule and verified by linguistics of Afan

Oromo language. See POS tag sets used in our work on Appendix 1.

4.3 Grammar Rules Extraction

In Afan Oromo sentence parsing, lexicon and Context Free Grammar rules are

important and necessary. CFG rules are used to train the parser with a set of grammar

rules and enable it to parse sentences based on rules. In order to extract the grammar

rule from manually tagged sentences we have reviewed and identified the

morphological property of the language and word order. Then, Context Free Grammar

rules are extracted manually from the collected corpus after studying the grammar of

Afan Oromo language and verified by the help of linguists.

CFG rules describe which structure of a sentence can be built from which sequence of

words. In other words, according to [36] the grammar rules specify how we are able to

determine whether a given sentence is valid or not. Therefore, we used five types of

Afan Oromo phrases that identified by Baye [49], which are noun phrase, verb phrase,

adpositional phrase, adjectival phrase, and adverbial phrase and the possible

combination of words from which the aforementioned phrases can be formed. Our

sample corpus contains manually tagged sentences (POS) in order to identify the word

category of each word in the sentence. When we construct the CFG, we use the

sentences (looked at the sentences) and identify part of speech tag of a phrase (Noun

Phrase and Verb Phrase). A number of sentences have similar phrase structure, and

some others share common sub-phrases so that a single CFG rule can represent many

sentences. Once we have the structure of the sentence we transformed it into the proper

format of the CFG, which is a non-terminal followed by an arrow and then terminal or

other non-terminals, which can replace the non-terminal before the arrow, i.e., like S -

> NP VP, NP -> NN JJ and VP -> NN VB. For more information, see a sample CFG

rules on Appendix 2.

The CFG begins from the non-terminal S, which represents sentences, and then phrases,

which can form S, most of the time noun phrase(NP) and verb phrase(VP). Each phrase

which is on the right side of S will be expanded or expressed by other non-terminals

(like NN, JJ, VB, AV, etc.) in next rules. As we have discussed in our previous section

in chapter two in detail about Afan Oromo language phrases categories, we have five

phrases. The tag name of each Afan Oromo phrase types in this work is shown in Table

Table 4. 1: Tag Name of Afan Oromo Phrases

Name Phrase Tag Name

Noun Phrase NP

Verb Phrase VP

Adverb Phrase ADP

Adjective Phrase JJP

Adposition Phrase APCP

4.4 Generating Lexical Rules

We used the same corpus that is used in CFG extraction for the generation of the lexical

rules. The construction of the lexicon is a one-time process that is done at the very

beginning of the parser implementation. However, the result will be needed whenever

there is parsing. During the lexicon construction, the lexicon generator reads the corpus

from local disk and goes through each sentence. While scanning each sentence, the

generator identifies the POS tag name and the word which will be associated with it in

lexical rules. The output of the lexicon generator is expected to be formal lexical rules

that will be stored in the lexicon in the form of tag -> “word”. The simple algorithm

developed for automatic lexicon generator is shown as follow on Algorithm 4.1. See

sample lexical rules generated by the lexicon generator from sample corpus on

Appendix 3.

Algorithm 4. 1: Lexical Generator Algorithm

4.5 Implementation of Chart Parser

Chart parsers use a set of rules to heuristically decide when an edge should be added to

a chart. This set of rules together with a specification when they should be applied form

a parsing strategy [39]. This rule is called fundamental rule, which is used by every

chart parser. In the new edge, the dot has moved one place to the right, and the span of

the new edge is the combined span of the original edges. When we add this new edge,

we do not remove the other two because they might be used again. In the case of a

selected chart parser algorithm for our work, which is top-down chart parsing, it works

in a similar way to that of recursive descent parser. Thus, it starts off with the top-level

goal of finding an S and broken down into the sub-goals of trying to find constituents

such as NP and VP predicted by the grammar rule of Afan Oromo language. Hence, in

Input: Afan Oromo tagged sentences

Read tagged sample corpus from local disk

Scan each sentence of sample corpus

Identify the part of speech (POS) tag name and word that will be associated with

For each word in each sentence of sample corpus

Call str2tuple () built-in function from a python library

If the words and its tags are split in the form of (word, tag)

Reverse the form into (tag, word)

Return the reversed value

generate the result with proper format

return the result

End If

End For

Output: lexical rules

order to apply a top-down chart parsing algorithm, we use the fundamental rule and

other three rules, such as:

The Top-Down Initialization Rule

The Top-Down Expand Rule

The Top-Down Match Rule

The Top-Down Initialization Rule: It captures the fact that the root of any parse must

the start symbol S. For each production, S → α, add the self-loop edge [S →. α, (0, 0)].

It e predicts to find an NP and a VP starting at 0. In order to find an NP, we need to

invoke a production that has NP on its left-hand side. This step is done by the next rule

which is Top-Down Expand Rule.

The Top-Down Expand Rule: This rule tells us that if our chart contains an incomplete

edge whose dot is followed by a non-terminal, then the parser should add any self-loop

edges licensed by the grammar whose left-hand side is non-terminal.

The Top-Down Match Rule: At this point, the rule allows the predictions of the

grammar to be matched against the input string, if the chart contains an incomplete edge

whose dot is followed by a terminal, then the parser should add an edge if the terminal

corresponds to the current input symbol.

The main focus of this is to parse the user sentence of Afan Oromo language using top

down chart parser. The parser interacts with other components in order to parse the user

sentences. In our study, the parser uses two basic references at the time of parsing;

Context Free Grammar rules and lexical rules. Initially, the parser accepts the input

sentence resulted from the tokenizer. Before the sentence undergoes through the parsing

process, the parser checks the grammar rule and lexical rules of sentence. Then the

parser scans through the sentence and asks for the POS tag of each word from the

lexicon.

Then after, the chart will be initialized by an active edge, which is a grammar rule that

has S symbol at the left-hand side. Active chart parsing is to keeping track of a record

of complete constituents that we have found, so we record what we are actually looking

for and how much of it we have found so far. Such information is recorded in active

edges or active arcs. [i.e. S NP. VP]. S NP. VP means: “we are trying to build an

S consisting of an NP followed by a VP. So far, we have found an NP arc, and we are

still looking for the VP”. So, the insignificant looking “.” Or dot symbol in the middle

of the rule is very important, it marks the boundary between what we have found so far,

and what we are still looking for, i.e. a boundary between active and passive arcs. The

fundamental rule for combining of the passive edges and an active edge can be

performed as S NP. VP. Suppose that there is the passive edge going from where

the active edge ends and has category VB on the left side as VP VB AV. The dot in

the active edge is now moved one category forward (i.e., VP VB. AV, which mean

S NP VB. AV). Our intension is working with active chart parser to make use of an

agenda. Then the agenda also initialized by a grammar rule that has a non-terminal at

its left-hand side. The left-hand side terminal is similar with that of the non-terminal

which is immediately after the arrow in a grammar rule in the chart. The grammar rule

in the agenda will move to the chart to replace the first non-terminal, in the right-hand

side of S, if it can replace unless another grammar rule is added to the agenda till the

grammar rule that can replace the non-terminal is found. If there is a terminal that can

replace the non-terminal, the parser will replace it and continues to the next non-

terminal. However, if the non-terminal in the chart can’t be replaced by the terminal, it

looks for other non-terminals which can replace it and which can be replaced by

terminals later on. This process will continue until all non-terminals in S are replaced

by the terminals and the grammar structure of the sentence S is recognized. This means

when new edges are created, the parser has to look at them to see whether they can be

combined with other edges to create another new edge in any way. It stores them in the

agenda to later remember. The agenda contains edges (grammar rules that can replace

non-terminals in the chart and creates new active edge or new grammar rule). The parser

will then take one edge at a time from the agenda, add it to the chart and then use it to

build new edges. The top down chart parsing algorithm we have adopted from [4][7]

with a few modification is shown in Algorithm 4.2

Algorithm 4. 2 : Top Down Chart Parsing Algorithm for Afan Oromo Sentences

4.6 Evaluations

In this section, we discussed the evaluation result of lexicon generator and chart parser,

which is developed to parse Afan Oromo sentences by checking whether they produce

correct and expected result or not. The results of the parser have been compared against

the manual parsed sentences by researchers of this study. The comparison has been

made manually by the researchers between the output of the system and the result of

Input: Afan Oromo Sentence from the user

Scan and tokenize the input sentence

Check the words of the sentence whether it is in the lexicon or not

If the word of the user sentence in the lexicon

Take the sentence

Make initial the chart and the agenda

Repeat the following until the agenda becomes empty:

a. Take the first arc (grammar rule) from the agenda

b. Add the arc to chart (if the edge is not already on the chart)

c. Combine this arc with arcs from the chart and add the obtained edges to the

agenda

d. Make a hypothesis about new constituents based on the arc and the rule of the

grammar. Add these new arcs to the agenda.

End repeat

See if the chart contains passive edges from the first node to the last node that has

labeled S.

if the chart contains the passive edges that represent all nodes of the sentence

the parsing process succeeds,

if not, the input sentence has a syntax error with respect to the grammatical

production rules in the CFG.

Return the parsed sentence (parse tree).

Output: Parse tree or The sentence is not parsed

manual parsed sentences. Hence, most of the evaluations are performed manually. The

evaluation technique used in estimating the accuracy of the parser in this study is simply

count the number of correctly parsed sentences and divide it to the total number of the

parsed sentences.

4.6.1 Evaluation of Lexical Generator

We have developed simple lexicon generator algorithms that constructs lexical rules

automatically from a sample tagged sentences of Afan Oromo language. This simple

algorithm was implemented by a python programming language, and the lexical rules

are later used by the parser. The correctness of the lexical rules was inspected manually

by checking whether the words are categorized in their proper word classes or not by

comparing with manually tagged sentences. Hence, our lexicon generator generates

correct lexical rules from manually tagged sentences as expected. Thus, lexicon

generator performed correctly as expected without any error. Figure 4.1 shows that the

result of the lexicon generator for five Afan Oromo Sentences from the sample corpus.

The selected sample sentences are “Tulluun nama gurraacha dha” ‘Tullu is a black

man’, “Inni gara manaa deeme” ‘He went to home’, “Inni kaleessa galgala dhufe” ‘He

comes yesterday at evening’, “Tolosaan mana guddaa qaba” ‘Tolosa has a big house’

and “Abdiisaan mana citaa ijaare” ‘Abdisa build the thatch house.’

Figure 4. 1: Screenshot of Lexical Rules generated by the Lexicon Generator

4.6.2 Evaluation of AOSP Chart Parser

Developed chart parser uses only the extracted grammar rules and the lexicon that is

produced by the lexicon generator from the sample manually tagged sentences to parse

the input sentences. The input sentence in the parsing process after it is a tokenized

word by word, and the word of input sentence is checked whether it exists in the

generated a lexicon in the corpus. The sentence tokenizer did not encounter any error

in the parser, so it was perfect for tokenize the input sentence into words. Based on this

the system was trained on the training dataset repeatedly and after correcting the man-

made error on manually tagged sentences by using CFG rules which manually extracted

from sample corpus was obtained the accuracy of 98.25%.

On the other hand, in order to test the effectiveness of the parser, we have used 100

other sentences selected from the corpus as a test set. On average 20 sentences are from

each type of the sentences in the corpus. The correctness of the parser is examined by

inspecting its result manually. The output can be checked with respect to the right

categorization of words in their proper word class, the right identification of sub phrases

and main phrases, the right order of sub phrases in building main phrases, and whether

all words and phrases are involved during construction of the sentence S. Therefore,

before testing, we parsed the sentences manually on paper based on linguist’s

suggestion and comments, and then we compared the results of the chart parser for the

same sentences with what we have on the paper. Any one of the results, which doesn’t

satisfy one of the criteria we have set is considered as wrong output or if the result or

the parse tree doesn’t display at all. The result obtained when the Parser was trained

and run on the same data is shown in Table 4.2.

Table 4. 2: Parsing a result on training set before making number of error correction

Dataset No of sentences No of correctly parsed sentences Accuracy in %

Training set 400 350 87.5%

As one would expect the accuracy achieved should be high when a parser is trained and

tested on the same data. But, due to man-made errors during the manual tagging of the

sentences and manually extracted context free grammar rules from sample sentences

the accuracy was not as high as it was expected. Hence, to ensure the accuracy of the

parser we have sent the corpus to the linguistic in order to check the correctness of

tagged sentences as well as the extracted grammar rule by researcher. Then, the final

accuracy obtained on training set after the error were identified and corrected is

displayed in Table 4.3.

Table 4. 3: Parsing a result on training set after making most of error correction

Dataset No of sentences No of correctly parsed sentences Accuracy in %

Training set 400 393 98.25%

On the other hand, the test on the unseen dataset (testing set) of the corpus provided

have the following result from each sentence type in the following table.

The result of the parser for the testing dataset on imperative sentences type is shown in

the Table 4.4.

Figure 4. 2: Screenshot of parsed imperative sentence

Table 4. 4: Number of correctly parsed imperative sentences

Dataset No. of imperative sentences correctly parsed sentences Accuracy in %

Testing set 20 20 100%

The result obtained when testing the parser on imperative simple sentence type is

approximately 100% accuracy. The result of the parser for the testing dataset on

Exclamatory Sentences type is shown in the Table 4.5.

Figure 4. 3: Screenshot of parsed exclamatory sentence

Table 4. 5: Number of correctly parsed Exclamatory Sentences

Data set No of exclamatory sentences correctly parsed Accuracy in %

The result obtained when testing the parser on exclamatory simple sentence type is

Declarative Sentences type is shown in the Table 4.6.

Figure 4. 4: Screenshot of parsed declarative sentence

Table 4. 6: Number of correctly parsed Declarative Sentences

Data set No. of declarative sentences correctly parsed Accuracy in %

The result obtained when testing the parser on declarative simple sentence type is

approximately 95% accuracy. Table 4.7 presents the result of the parser for the testing

dataset on interrogative sentences.

Figure 4. 5: Screenshot of parsed interrogative sentence

Table 4. 7: Number of correctly parsed Interrogative Sentences

Data set No. of interrogative sentences correctly parsed Accuracy in %

The result obtained when testing the parser on interrogative simple sentence type is

complex Sentences type is shown in the Table 4.8.

Table 4. 8: Number of correctly parsed Complex Sentences

Data set No. of complex sentences correctly parsed Accuracy in %

The result obtained when testing the parser on complex sentence type is approximately

70% accuracy. This was due to man-made errors during the manual parsing process on

complex sentences were identified to be one cause for wrong parse assignments, and

also incorrectly extraction context free grammar rules of the language.

Figure 4. 6: Screenshot of parsed complex sentence

Finally, the test on the imperative and interrogative simple sentences were gave a 100%

accuracy, this is due to all sentences were parsed correctly. This could be accounted to

the fact that the sentences were uniform (have the same kind of constructs) that could

generate the correct parse structure [24]. Thus, the parser result shows that the chart

parser which is developed to parse Afan Oromo sentence could obtain the accuracy of

98.25% on training set after the correction of errors faced during the first training and

91% on test set, which is a promising result.

4.7 Discussion

The sample corpus, which was discussed in the above section was used for

experimentation. Each sentence in the corpus had been tagged and hand parsed by the

researchers, with comments and suggestions from linguists. However, it was difficult

for us to get experts on Afan Oromo language during extracting of the context free

grammar rules from the corpus. After many times of searching, we are an able to get

linguist of Afan Oromo language those we approach via email. Then we have sent the

corpus that we prepared for this study to them in order to check its correctness. After

that each sentence in our corpus had been tagged and hand parsed correctly by the

researchers, based on comments and suggestions from linguists. The sentences in the

selected corpus are classified as training dataset and testing dataset.

Although, man-made errors were occurred during the training of the system, which are

the manual tagging and parsing process were to be one cause for wrongly parsed

sentences. Some of the context free grammar rules of the language were incorrectly

extracted, which affect the performance of the parser. This was challenging task to

extract CFG rules for Afan Oromo sentences, because of the extraction was manual and

the lack of standardized and well-prepared Afan Oromo corpus which required

conducting conclusive experimentation for the proposed parser. So, it was a challenge

task that we have been faced. However, we have been evaluated the parser based on the

extracted context free grammar rule (CFG) of Afan Oromo language depending on the

expected output of the parser and manually parsed sentences by the researchers based

on the linguists’ suggestion. Besides, there was also challenges during the performance

evaluation of the parser, which is because of the absence of a standard criteria to parse

Afan Oromo sentences, which says if some conditions are satisfied, then the parsing is

correct, and if not, the parsing is wrong.

We have seen some of the ways used during the experiment to deal with incorrectly

parsed sentences before we obtained the last 98.25% accuracy of the parser on training

dataset, the first thing we have done was a review of the manually parsed sentences and

made corrections to the errors. The second was a review of the context free grammar

rule which was extracted manually from sample sentences. And the next was making

some corrections to the errors and if more than one sentence had similar grammar rules,

the number of grammar rules, which is in the form of CFG have been reduced. This is

because of as the number of grammar rules increase the efficiency and accuracy of the

parser decreases in terms of time and speed as stated in the study [7] which is the top

down chart parser for analyzing Arabic sentences. Comparison of top down chart parser

which is developed in this study with the top-down and bottom-up approaches is not

shown due to time constraint.

To sum up, developed parser has shown encouraging results in terms of covering both

simple and complex sentences and automatic construction of the lexical rules from the

given sample corpus. An efficient parsing approach which is top-down chart parsing

approaches is used in this study rather than using only traditional parsing methods like

top-down and bottom-up approaches.

CHAPTER FIVE

5 CONCLUSIONS AND RECOMMENDATIONS

This chapter focuses on summaries that indicate the whole picture of the study in the

conclusion based on the findings of the experiment and recommendations that the

researchers have suggested as the future work.

5.1 Conclusion

The common objective of understanding and extracting meaning from natural language

input is processed in natural language processing [24]. This process involves

transforming the natural language into a form where the meaning is explicit and is easily

usable by the application program. Thus, sentence parsing is a process in which a flat

input sentence is converted into a hierarchical structure or tree structure that

corresponds to the units of meaning in the sentence. The important concepts in related

to sentence parsing are also briefly discussed in this study. Rule based and statistical

approaches, which are the major approaches to sentence parsing were discussed, and

rule-based approach was employed for this study. The different grammatical

formalisms used to represent phrase structure rules in a language were briefly reviewed.

Literature in the area of Afan Oromo grammatical category was also reviewed. This is

because of the knowledge of the grammar of the language is the core component in

designing a rule-based sentence parser. Parsing of grammatical categories indicated

features like gender and tense are not considered. Algorithm, which is top down chart

parsing and components required by the parser to access the knowledge base and parse

input sentences with appropriate lexical categories were presented. For this purpose, as

the developed parser is rule-based, the parser needs to have components, which are used

to enable the system to learn how to parse from grammar rules. This part is composed

of lexicon generator, context free grammar rules and lexicon. Lexicon generator is the

component used for the automatic construction of the lexical rules. It uses the same

POS tagged corpus, which is used for the extraction of context free grammar rules.

The corpus preparation used in this study, and the challenges that the researchers have

faced during the preparation of the corpus are presented. The sample corpus is used to

extract grammar rules and generate lexical rules of Afan Oromo language. Due to lack

of time, the sample corpus was tagged manually by researchers based on the linguist’s

suggestions and comments. The corpus is small in size for the reasons that lack of

annotated large corpora with POS tags. The experiments are also conducted. With this

regard, the parser trained on 400 sentences and tested on 100 sentences from sample

corpus. As we have been discussed above in evaluation section, 98.25% and 91% (the

average result of selected sentence types) accuracy is obtained on training dataset and

testing dataset respectively.

In general, the study has been designed the general architecture of top-down chart parser

for Afan Oromo language. We have been developed how to construct lexical rules

automatically from tagged corpus. Another contribution of this study was the developed

rule-based top-down chart parser, which does not require a tree bank from which the

parser learns how to parse through iterative trainings for Afan Oromo simple and

complex sentences. Since this study is the first work in parsing both simple and complex

sentences for Afan Oromo Language, it encourages Ethiopian students and researchers

to take part in parsing sentences, which led to develop a full-fledged parser for Afan

Oromo language.

5.2 Recommendations

The sample corpus and the grammar rules taken for this study cannot be taken to be a

representative of the language, therefore, conducting a larger set of corpora is needed.

Although sentence parsing is not an easy task, which requires more time and needs

more features to make it full-fledged. However, our study has shown that sentence

parsing can be done automatically using a top-down chart parsing algorithm for Afan

Oromo simple and complex sentences. There are many shortcomings in this work and

in the area, particularly in Afan Oromo. This should be addressed by interested

individuals in the area for further improvements. The efforts of those researchers might

enable efforts of coming up with an efficient sentence parser for Afan Oromo language

to make it full-fledged. Hence, further improvements and modifications are required.

Thus, we have listed additional features that can be added to increase the performance

of the system and future research directions as following.

Preparing processed (annotated) Afan Oromo corpus for the purposes of

experimentation is recommended, particularly, as appropriate for sentence

parser.

Developing Afan Oromo sentence parser by adding automatic part of speech

tagger, automatic morphological analyzer, all types of sentences with all

attributes like case, number, gender, person, tense, definiteness to increase the

coverage of current parser and use large dataset to make full-fledged parser.

Experimenting how the sentence parser could perform using stochastic

approaches by increasing sample dataset size is also recommended. Though we

didn’t make experiments using stochastic approaches due to time constraint,

better results might be obtained.

Replicating the work in other Ethiopian local languages like Tigrigna, Silte, etc.

REFERENCES

[1] A. S. Genemo, “Afaan Oromo Named Entity Recognition Using Hybrid

Approach,” MSc.Thesis, Department of Computer Science, School of Graduate

Studies, Addis Ababa University, Addis Ababa, 2015.

[2] A. Copestack, “Natural Language Processing,” in Natural Language Processing,

2004, pp. 2003–2004.

[3] N. Chomsky, Syntactic structures, Second Edi. New York, 2002.

[4] A. D. Mohammed, “A Top-Down Chart Parser for Amharic Sentences,”

MSc.Thesis, Department of Computer Science, School of Graduate Studies,

Addis Ababa University, Addis Ababa, 2015.

[5] D. Megersa, “An Automatic Sentence Parser for Oromo Language Using

Supervised Learning Technique,” MSc.Thesis, Department of Information

Science, School of Graduate Studies, Addis Ababa University, Addis Ababa,

[6] Jason, “Parsing.” [Online]. Available:

https://www.cs.cornell.edu/courses/cs4740/2012sp/lectures/parsing-intro-

4pp.pdf. [Accessed: 03-Feb-2017].

[7] A. Al-Taani, M. Msallam, and S. Wedian, “A Top-Down Chart Parser for

Analyzing Arabic Sentences,” Int. Arab J. Inf. Technol., vol. 9, no. 3, 2012.

[8] J. Daba and Y. Assabie, “A Hybrid Approach to the Development of

Bidirectional English-Oromiffa,” vol. 8686, pp. 228–235, 2014.

[9] A. Abeshu, “Analysis of Rule Based Approach for Afan Oromo Automatic

Morphological Synthesizer,” An Off. Int. J. Wollega Univ. Sci. Technol. Arts Res.

J., vol. 2, no. 4, pp. 2226–7522.

[10] M. Jakubíˇ, “Rule-Based Parsing of Morphologically Rich Languages,”

PhD.Dissertation, Faculty of Informatics, Masaryk University, 2012.

[11] M. L. Kejela, “Named Entity Recognition for Afan Oromo,” MSc.Thesis,

Department of Computer Science, School of Graduate Studies, Addis Ababa

University, Addis Ababa, 2010.

[12] A. T. Nedjo, D. Huang, and X. Liu, “Automatic Part-of-speech Tagging for

Oromo Language Using Maximum Entropy Markov Model ( MEMM ) ⋆,” vol.

10, pp. 3319–3334, 2014.

[13] G. O. Ganfure and D. Midekso, “Design And Implementation Of Morphology

Based Spell Checker,” vol. 3, no. 12, pp. 118–125, 2014.

[14] D. Tesfaye, “A rule-based Afan Oromo Grammar Checker,” IJACSA - Int. J.

Adv. Comput. Sci. Appl., vol. 2, no. 8, pp. 126–130, 2011.

[15] G. Mamo, “‘Part of Speech Tagging for Afaan Oromo Language,’” MSc.Thesis,

Department of Information Science, School of Graduate Studies, Addis Ababa

[16] A. Mohammed-hussen, “Part of Speech Tagger for Afaan Oromo Language

Using Transformational Error Driven Learning (TEL) Approach,” MSc.Thesis,

Department of Computer Science, School of Graduate Studies, Addis Ababa

[17] G. D. Dinegde and M. Y. Tachbelie, “Afan Oromo News Text Summarizer,” Int.

J. Comput. Appl., vol. 103, no. 4, pp. 975–8887, 2014.

[18] T. K. Hundesa, “Word Sense Disambiguation for Afaan Oromo Language,”

MSc.Thesis, Department of Computer Science, School of Graduate Studies,

Addis Ababa University, Addis Ababa, 2013.

[19] K. Abdisa, “Factoid Question Answering For Afaan Oromo,” MSc.Thesis,

Department of Information Science, School of Graduate Studies, Addis Ababa

[20] G. G. Eggi, “Afaan Oromo Text Retrieval System,” MSc.Thesis, Department of

Information Science, School of Graduate Studies, Addis Ababa University,

Addis Ababa, 2012.

[21] T. G. Debela, “Afaan Oromo Search Engine,” MSc.Thesis, Department of

Computer Science, School of Graduate Studies, Addis Ababa University, Addis

Ababa, 2010.

[22] M. Post and D. Gildea, “Parsers as language models for statistical machine

translation,” … Assoc. Mach. Transl. …, 2008.

[23] J. Katz-Brown et al., “Training a Parser for Machine Translation Reordering,”

Proc. 2011 Conf. Empir. Methods Nat. Lang. Process. (EMNLP 2011), pp. 183-

-192, 2011.

[24] A. Alemu, “Automatic Sentence Parsing for Amharic Text an Experiment Using

Probabilistic Context Free Grammars,” MSc.Thesis, Department of Information

Science, School of Graduate Studies, Addis Ababa University, Addis Ababa,

[25] D. G. Agonafer, “An Integrated Approach to Automatic Complex Sentence

Parsing for Amharic Text,” MSc.Thesis, Department of Information Science,

School of Graduate Studies, Addis Ababa University, Addis Ababa, 2003.

[26] A. Ibrahim, “A Hybrid Approach to Amharic Base Phrase Chunking and

Parsing,” MSc.Thesis, Department of Computer Science, School of Graduate

Studies, Addis Ababa University, Addis Ababa, 2013.

[27] D. D. K. Sleator, “Parsing English with a Link Grammar,” National Science

Foundation under grant CCR-8658139, Olin Corporation, and R. R. Donnelley

and Sons, New York, 1991.

[28] E. Charniak, “Statistical Parsing with a Context-free Grammar and Word

Statistics,” pp. 1–6, 1997.

[29] N. Khoufi, C. Aloulou, L. Hadrich, and B. Anlp, “ARSYPAR : A tool for parsing

the Arabic language,” Int. Arab Conf. Inf. Technol., 2013.

[30] B. M. Bataineh and E. A. Bataineh, “An Efficient Recursive Transition Network

Parser for Arabic Language,” World Congr. Eng. 2009, Vols I Ii, vol. II, pp.

1307–1311, 2009.

[31] N. Hambir, “Hindi Parser-based on CKY algorithm,” vol. 3, no. 2, pp. 851–853.

[32] W. W. Thant, T. M. Htwe, and N. L. Thein, “Context Free Grammar Based Top-

Down Parsing of Myanmar Sentences,” Int. Conf. Comput. Sci. Inf. Technol.

Pattaya Dec. 2011, pp. 71–75, 2011.

[33] H. Lian, “Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Maximum-Entropy-Inspired Parser,” M.S. Thesis, pp. 1–6, 2005.

[34] S. Shieber and Microtome, “An Introduction to Unification-Based Approaches

to Grammar,” Microtome Publishing, 2003. [Online]. Available:

https://nrs.harvard.edu/urn-3:HUL-InstRepos:11576719. [Accessed: 14-Feb-

2017].

[35] S. Sundararajan, “Probabilistic Context-Free Grammars in Natural Language

Processing,” pp. 1–6.

[36] B. S. R. L, S. Ishwar, and S. K. Ravindranath, “Context Free Grammar for

Natural Language Constructs- An implementation for Venpa class of Tamil

Poetry,” Indian Inst. Inf. Technol., pp. 128–136.

[37] W. A. Woods, “Transition network grammars for natural language analysis,”

Commun. ACM, vol. 13, pp. 591–606, 1970.

[38] M. Collins, “Probabilistic Context-Free Grammars (PCFGs),” Lect. Notes, pp.

1–18, 2011.

[39] D. Jurafsky and J. Martin, “Speech and Language Processing”, 2nd Edition.

[40] G. Weikum, “Foundations of statistical natural language processing,” ACM

SIGMOD Rec., vol. 31, no. 3, p. 37, 2002.

[41] M. Ailomaa, “Two Approaches to Robust Stochastic Parsing,” MSc.Thesis,

Computational Linguistics, Goteborg University, Lausanne, 2004.

[42] J. Brownlee, “Supervised and Unsupervised Machine Learning Algorithms,”

Mach. Learn. Mastery, pp. 1–9, 2016.

[43] R. Ouersighni, “Robust Rule-based Approach in Arabic Processing,” Int. J.

Comput. Appl., vol. 93, no. 12, pp. 31–37, 2014.

[44] O. Herzog and R. Rollinger, “A Flexible Parser for a Linguistic Development

Environment Gregor Erbach Choices in Parser Design Types of Grammars,”

Text Underst. LILOG, Springer, Berlin, no. Kay 1980, 1991.

[45] E. Othman, K. Shaalan, and A. Rafea, “A chart parser for analyzing modern

standard Arabic sentence,” MT Summit IX Work. Mach. Transl. Semit. Lang.

Issues Approaches, p. 37{\textendash}44, 2003.

[46] B. I. Thompson, “Afro Asiatic Language Family,” 2017. [Online]. Available:

http://aboutworldlanguages.com/afro-asiatic-language-family. [Accessed: 04-

Feb-2017].

[47] T. Gamta, “The Oromo language and the latin alphabet,” J. Oromo Stud.

1992.http//www.africa.upenn.edu/Hornet/Afaan_Oromo_19777.html , last Visit.

Febr. 06, 2017, pp. 10–13, 2017.

[48] R. Kibble, “Introduction to natural language processing Undergraduate study in

Computing and related programmes,” 2013.

[49] B. Yimam, “‘THE PHRASE STRUCTURES OF ETHIOPIAN OROMO ,’”

PhD.dissertification, Addis Ababa University, 1986.

[50] S. Alqrainy, S. Jordan, and M. S. Alkoffash, “Context-Free Grammar Analysis

for Arabic Sentences,” Int. J. Comput. Appl., vol. 53, no. 3, pp. 7–11, 2012.

[51] C. February, “Contents 1,” 2011. [Online]. Available:

https://en.wikipedia.org/wiki/Lexicon. [Accessed: 12-Nov-2016].

[52] S. C. Zhu, “Ch 4 Classic Parsing Algorithms Chart Parsing in NLP,” pp. 1–51.

[53] H. J. Fox, “Lexicalized, Edge-Based, Best-First Chart Parsing,” MSc.Thesis,

Dep’t of Computer Science, Massachusetts Institute of Technology, Brown

University, 1999.

APPENDICES

Appendix 1: Part of Speech Tags by Abraham (used in this study)

Tags Descriptions Examples

AD A tag for all types of adverb in

the language

Kaleessa, edana, yoomiyyuu, as, achi,

yoom, yammuu, yeroo . . .

APC Adpositions and conjunctions

C A tag for all

preposition/postpositions as

well as

redeterminers/postdetermine

rs and conjunctions that are

separated from other

design and develop sentence parser for afan …

Documents

future of secondary education in the upper afan … ·...

1 javacup javacup (construct useful parser) is a parser...

basic issues on parsinggatius/mai-ihlp/parsingfi2018.pdf ·...

mms parser

pen afan primary school ysgol gynradd pen afan newsletter...

afan nedd tawe schools athletics association afan...

afan open meeting with board of trustees

triple literacy cymer afan comprehensive

topdown parser

parser generator yacc

context-free grammar analysis for arabic sentences paper...

critical analysis of the diacritic /'/ in afan oromo...

xanadu vanguard parser - xanadu · pdf filexanadu vanguard...

pen afan primary school ysgol gynradd pen afan newsletter...

sabrina gerth and peter beim graben- unifying syntactic...

afan valley walk -...

afan tavernafan tavernafan tavernafan...

data parser

parser combinators

1 javacup javacup (construct useful parser) is a parser...