design and develop sentence parser for afan …
Post on 27-Nov-2021
6 Views
Preview:
TRANSCRIPT
DSpace Institution
DSpace Repository http://dspace.org
Information Technology thesis
2020-03-20
DESIGN AND DEVELOP SENTENCE
PARSER FOR AFAN OROMO
LANGUAGE USING TOP-DOWN
CHART PARSING ALGORITHM
Beshada, Hailu
http://hdl.handle.net/123456789/10765
Downloaded from DSpace Repository, DSpace Institution's institutional repository
BAHIR DAR UNIVERSITY
BAHIR DAR INSTITUTE OF TECHNOLOGY
SCHOOL OF RESEARCH AND POSTGRADUATE STUDIES
FACULTY OF COMPUTING
DESIGN AND DEVELOP SENTENCE PARSER FOR AFAN
OROMO LANGUAGE USING TOP-DOWN CHART PARSING
ALGORITHM
Hailu Beshada Balcha
Bahir Dar, Ethiopia
May 2017
DESIGN AND DEVELOP SENTENCE PARSER FOR AFAN OROMO
LANGUAGE USING TOP-DOWN CHART PARSING ALGORITHM
Hailu Beshada Balcha
A Thesis submitted to the School of Research and Graduate Studies of Bahir Dar
Institute of Technology, Bahir Dar University in partial fulfillment of the
requirements for the degree of Master of Science in the Information Technology in
the Faculty of Computing
Supervised by: Tesfa Tegegne(PhD)
Bahir Dar, Ethiopia
May 2017
i
DECLARATION
I, the undersigned, declare that the thesis comprises my own work. In compliance
with internationally accepted practices, I have acknowledged and refereed all
materials used in this work. I understand that non-adherence to the principles of
academic honesty and integrity, misrepresentation/ fabrication of any
idea/data/fact/source will constitute sufficient ground for disciplinary action by the
University and can also evoke penal action from the sources which have not been
properly cited or acknowledged.
Name of the student: Hailu Beshada Balcha Signature _____________
Date of submission: May 24, 2017
Place: Bahir Dar, Ethiopia
This thesis has been submitted for examination with my approval as a university
advisor.
Advisor Name: Tesfa Tegegne (PhD)
Advisor’s Signature: __________________
iii
Bahir Dar University
Bahir Dar Institute of Technology-
School of Research and Graduate Studies
Faculty of Computing
THESIS APPROVAL SHEET
Student:
_____________________________________________________________________
Name Signature Date
The following graduate faculty members certify that this student has successfully
presented the necessary written final thesis and oral presentation for partial fulfillment
of the thesis requirements for the Degree of Master of Science in Information
Technology
Approved by:
Advisor: _____________________________________________________________________
Name Signature Date
External Examiner:
___________________________________________________________________
Name Signature Date
Internal Examiner:
_______________________________________________
Name Signature Date
Chair Holder:
_____________________________________________________________________
Name Signature Date
Faculty Dean:
_____________________________________________________________________
Name Signature Date
iv
DEDICATION
This thesis is dedicated to my mother Afrasa Robi and My two sisters Tadelech
Beshada and Nigatuwa Beshada who made me the person of today without attending
schools by themselves and also to my lovely wife Birtukan Sahile who I married
during the end of master’s class and the starting time of my thesis work.
v
ACKNOWLEDGEMENTS
Above all I would like to thank the almighty and omnipresent God, for giving me the
strength from the beginning to the end of this research work. “Yaa Uumaa koo galanin
siif haa ta’u! Amiin!”. Next, it is a pleasure to thank many people who made this thesis
is accomplished. I would like to gratefully acknowledge the supervision of my advisor
Tesfa Tegegne (PhD), for his abundant help, suggestive and constructive comments.
My great thanks also go to Gebeyehu Beyene (PhD) for his constructive comments on
the draft report of this work. Again, I like to thanks a lot Mr. Jabesa Daba and Mr.
Kasahun Abdisa from Wollega University, who had provided me important reading
materials and constructive suggestion which helped me much in my thesis work. It is
an honor for me to express my special appreciation for my colleagues for their
collaboration with giving me ideas, directions, comments and also for their
encouragements. It is a pleasure to thank those who helped me by different mechanism
when I was working on this study.
The last, but absolutely not the least, I want to thank my lovely wife and family, whose
love and guidance is with me in whatever I pursue.
vi
ABSTRACT
Previously many sentence parsers are developed for foreign languages such as English,
Arabic, etc. as well as for Amharic language from local languages of Ethiopia. Parsing
Afan Oromo sentence is also needed and a necessary mechanism for other natural
language processing applications like machine translation, question answering,
knowledge extraction and information retrieval. Thus, we have been developed rule-
based parser using a top-down chart parsing algorithm for Afan Oromo sentences,
which include both simple and complex sentences. Context Free Grammar (CFGs) was
used to represent the grammar of the language. 500 sentences for sample corpus were
prepared and CFG was extracted manually from sample tagged corpus. We also
developed simple algorithm of a lexicon generator to automatically generate the lexical
rules. Python programming language and NLTK are used as an implementation tools
for this study. Then, the experimentation took place on a parser. The parser was trained
on 400 sentences of training dataset with the accuracy of 98.25% and tested on 100
sentences of testing dataset with the accuracy of 91%. Thus, the experimental results
on a parser is an encouraging result since it is the first work for simple and complex
sentences of Afan Oromo language. Finally, we have been reported that the conclusion
and possible recommendation for future work in the last chapter.
Keywords: NLP, Parser, context free grammar, top-down chart parser, lexicon
generator, lexicon.
vii
TABLE OF CONTENTS DECLARATION........................................................................................................... i
ACKNOWLEDGEMENTS ........................................................................................ v
ABSTRACT ................................................................................................................. vi
LIST OF FIGURES ..................................................................................................... x
LIST OF ALGORITHMS .......................................................................................... xi
LIST OF TABLES ..................................................................................................... xii
ABBREVIATIONS AND ACRONYMS ................................................................. xiii
CHAPTER ONE .......................................................................................................... 1
1 INTRODUCTION ................................................................................................ 1
1.1 Background .................................................................................................... 1
1.2 Statement of the Problem ............................................................................. 3
1.3 Objectives ....................................................................................................... 5
1.3.1 General objective ..................................................................................... 5
1.3.2 Specific objectives ................................................................................... 6
1.4 Methodology .................................................................................................. 6
1.4.1 Literature Review..................................................................................... 6
1.4.2 Data Collection ........................................................................................ 7
1.4.3 Tools and Techniques .............................................................................. 7
1.4.4 Evaluation ................................................................................................ 7
1.5 Scope and Limitation .................................................................................... 8
1.6 Significance of the study ............................................................................... 8
1.7 Organization of the thesis ............................................................................. 9
CHAPTER TWO ....................................................................................................... 10
2 LITERATURE REVIEW .................................................................................. 10
2.1 Introduction ................................................................................................. 10
2.2 Works so far on sentence parser ................................................................ 11
2.2.1 Local works on sentence parser ............................................................. 11
2.2.2 Global works on sentence parser ........................................................... 14
viii
2.3 Grammar Formalism .................................................................................. 17
2.3.1 Context Free Grammar .......................................................................... 18
2.3.2 Context Sensitive Grammar ................................................................... 19
2.3.3 Transition Network Grammar ................................................................ 20
2.3.4 Unification Based Grammar .................................................................. 20
2.3.5 Probabilistic Context Free Grammar ..................................................... 20
2.4 Sentence Parsing Approaches .................................................................... 21
2.4.1 Stochastic Approaches ........................................................................... 22
2.4.2 Rule-based Approaches ......................................................................... 23
2.5 Afan Oromo Grammar ............................................................................... 26
2.5.1 Word order ............................................................................................. 27
2.5.2 Word Categories .................................................................................... 27
2.5.3 Phrases Categories ................................................................................. 32
2.6 Afan Oromo Sentences................................................................................ 34
2.6.1 Simple Sentences ................................................................................... 34
2.6.2 Complex Sentences ................................................................................ 36
2.7 Summary ...................................................................................................... 36
CHAPTER THREE ................................................................................................... 39
3 DESIGN OF AFAN OROMO SENTENCE PARSER .................................... 39
3.1 Components of Afan Oromo Sentence Parser (AOSP)............................ 39
3.2 Context Free Grammar (CFG) .................................................................. 40
3.3 Sentence Tokenizer ..................................................................................... 41
3.4 Lexicon Generator....................................................................................... 41
3.5 AOSP Chart Parser ..................................................................................... 42
3.6 Summary ...................................................................................................... 44
CHAPTER FOUR ...................................................................................................... 45
4 IMPLEMENTATION RESULTS AND DISCUSION .................................... 45
4.1 Development Environment ......................................................................... 45
4.2 Corpus Preparation..................................................................................... 45
4.3 Grammar Rules Extraction ........................................................................ 45
4.4 Generating Lexical Rules............................................................................ 47
4.5 Implementation of Chart Parser ................................................................ 48
ix
4.6 Evaluations ................................................................................................... 51
4.6.1 Evaluation of Lexical Generator ............................................................ 52
4.6.2 Evaluation of AOSP Chart Parser .......................................................... 53
4.7 Discussion ..................................................................................................... 58
CHAPTER FIVE ....................................................................................................... 60
5 CONCLUSIONS AND RECOMMENDATIONS ........................................... 60
5.1 Conclusion .................................................................................................... 60
5.2 Recommendations ....................................................................................... 61
REFERENCES ........................................................................................................... 63
APPENDICES ............................................................................................................ 66
Appendix 1: Part of Speech Tags by Abraham (used in this study) .................. 66
Appendix 2: Sample Context Free Grammar Extracted from corpus .............. 69
Appendix 3: Sample Lexical Rules Generated by the Lexicon Generator ....... 70
Appendix 4: Sample parsed sentences by the parser .......................................... 71
x
LIST OF FIGURES
Figure 3. 1: Architecture of Sentence Parser for Afan Oromo Language ................... 40
Figure 4. 1: Screenshot of Lexical Rules generated by the Lexicon Generator .......... 52
Figure 4. 2: Screenshot of parsed imperative sentence ................................................ 54
Figure 4. 3: Screenshot of parsed exclamatory sentence ............................................. 55
Figure 4. 4: Screenshot of parsed declarative sentence ............................................... 56
Figure 4. 5: Screenshot of parsed interrogative sentence ............................................ 56
Figure 4. 6: Screenshot of parsed complex sentence ................................................... 58
xi
LIST OF ALGORITHMS
Algorithm 4. 1: Lexical Generator Algorithm ............................................................. 48
Algorithm 4. 2 : Top Down Chart Parsing Algorithm for Afan Oromo Sentences ..... 51
xii
LIST OF TABLES
Table 4. 1: Tag Name of Afan Oromo Phrases ............................................................ 47
Table 4. 2: Parsing a result on training set before making number of error orrection . 53
Table 4. 3: Parsing a result on training set after making most of error correction ...... 54
Table 4. 4: Number of correctly parsed imperative sentences ..................................... 55
Table 4. 5: Number of correctly parsed Exclamatory Sentences ................................. 55
Table 4. 6: Number of correctly parsed Declarative Sentences ................................... 56
Table 4. 7: Number of correctly parsed Interrogative Sentences ................................. 57
Table 4. 8: Number of correctly parsed Complex Sentences ....................................... 57
xiii
ABBREVIATIONS AND ACRONYMS
ADP Adverbial Phrase
AOSP Afan Oromo Sentence Parser
APCP Adpositional Phrase
ATB Arabic Tree Bank
ATN Augmented Transition Network
CFG Context Free Grammar
CKY Cocke Kasami Younger
CNF Chomsky Normal Form
CSG Context Sensitive Grammar
CTB Chinese Tree Bank
FSM Finite State Machine
HMM Hidden Markov Model
IE Information Extraction
IR Information Retrieval
JJP Adjectival Phrase
LHS Left Hand Side
NLP Natural Language Processing
NP Noun Phrase
PCFG Probabilistic Context Free Grammar
POS Part of Speech
QA Question Answering
RHS Right Hand side
SOV Subject-Object-Verb
TNG Transition Network Grammar
UBG Unification Based Grammar
VP Verb Phrase
1
CHAPTER ONE
1 INTRODUCTION
1.1 Background
Language is one of the fundamental aspects of human behavior and it constitutes a
crucial component of our lives. Natural language is a language that is spoken by the
people. According to Abdi[1], Natural language processing (NLP) is a theoretically
motivated range of computational techniques for analyzing and representing naturally
occurring texts at one or more levels of linguistic analysis for the purpose of achieving
human like language processing for a range of tasks or applications. NLP can be defined
as the automatic or semi-automatic processing of human language [2]. It runs different
applications, namely tokenization, lexical analysis, syntactic analysis, semantic
analysis, and pragmatic analysis. Among these applications our focus is on Syntactic
analysis (Parsing), which provides an order and structure of each sentence in the text.
Natural language processing systems take strings of words (sentences) as their input
and produce structured representations capturing the meaning of those strings as their
output. The nature of this output depends heavily on the task at hand. “In the context of
natural language processing, the process of assigning structural descriptions to
sequences of words is called parsing” [3].
Parsing is a process of analyzing a sentence by taking each word and determining its
linguistic structure from its constituent parts. Parsing process makes use of two
components: a parser and a grammar. Parser is a procedural component and is a
computer program, whereas, grammar is a declarative component. Both the grammar
and parser depend on the grammar formalism. The term parser is used in cases where
the sentences are made up of information units of any kind and therefore it also deals
with a number of sub problems such as identifying constituents that can fit together,
testing the compatibility of a number [4]. Sentence parsing is one of the steps to design
a functional NLP application and which can work in cooperation, and as input to other
many NLP applications like grammar checker, machine translation, and etc. It is also
called syntax analysis [5], which is the process of identifying how words can be put
together to form correct sentence and determining what structural role of each word
plays in the sentence and what phrases are subparts of what other phrases or what other
2
words modify, which words of the central point of the whole sentence constructed.
Thus, parsing has an important role in semantic processing operation on that of sentence
constituents. If there is no syntactic parsing step, then the semantics system must decide
on its own constituents[4]. If parsing is done, however, it contains the number of
constituents that semantic can consider. The focus of this study is to develop a sentence
parser for Afan Oromo language using top-down chart parsing algorithm and Context
Free Grammar (CFG) formalism to represent Afan Oromo grammar rules.
Chart parser combines the advantages of top-down and bottom-up approaches. Hence
the main objective of chart parsing is to improve the efficiency of the parser by taking
the advantages of top-down and bottom-up approaches. According to Jason [6], in chart
parsing the process of parsing is an n-word sentence consists of forming a chart with n
+ 1 vertices and adding edges to the chart one at a time. There is no backtracking,
everything that is put in the chart stays there, and chart contains all information needed
to create a parse tree. Chart parser is driven by an agenda of completed constituents and
the arc extension, which combines active arcs with constituents when they are added to
the chart [7]. The technique of extending arcs with constituents can be applied by using
both bottom-up and top-down approach. However, the difference is in how new arcs
are generated from the grammar. In bottom-up approach, new active arcs are generated
whenever a completed constituent is added that could be the first constituent of the
right-hand side of the rule, whereas in the top-down approach, new active arcs are
generated whenever a new active arc is added to the chart [4][7]. For this reason,
Abdurheman [4] state that, the number of constituents generated using a top-down
chart parser is less than the number of constituents which are generated using bottom-
up chart parser. Therefore, the top down chart parser is considerably more efficient for
any reasonable grammar.
In the current world, the amount of accessible electronic information has exploded. Due
to the rapid expansion of Internet and its use for communication and dissemination of
information throughout the world, electronic information sources are now available in
an ever-increasing number of languages. As Jabesa [8] mentioned in his work, users of
such globally distributed networks (including digital libraries and World Wide Web)
need to be able to access and retrieve any relevant information in whatever language
and form it may have been recorded and stored. However, according to Abebe [9], the
3
most developing countries have no systematic programs for the collection, analysis and
dissemination of available information to the potential users. One of the barriers to this
is the absence of full-fledged online machine translation system that can translate texts
from a foreign language to a local for example, from English to Afan Oromo. Thus, the
existences of machine translation systems that require a parser as a component of
importance for the delivery of electronic resources are paramount. Therefore, the need
for NLP systems such as sentence parser is unquestionable for Afan Oromo. Afan
Oromo language is the official language of Oromia National Regional State. It is used
in offices, schools, colleges, universities and in media. Thus, the availability of huge
electronic and non-electronic data was motivated us to develop an NLP application.
“For computational linguists, parsing corresponds to produce some sort of a structure
that fits and confirms a particular theory of syntax or language in general” [10]. We
have seen the purpose of parsers in terms of standard tools for NLP that do not represent
a final goal as such, but should contribute to improve other applications and serve for
many tasks. Thus, we are motivated to develop Afan Oromo Sentence Parser by using
top-down chart parsing approach.
1.2 Statement of the Problem
Afan Oromo is one of the major languages that are widely spoken in Ethiopia.
Currently, it is the official language of the regional state of Oromia (the largest regional
state in Ethiopia) being used as a working language in offices, medium of instruction
for primary and junior-secondary schools, and it is also given as a subject for secondary
schools (9 -12 grades). As Mandafro report in his work [11] , at the country level, in
Ethiopia, out of public universities, 8 universities are offering degree programs
majoring in Afan Oromo and Addis Ababa University is offering Afan Oromo language
at Master’s degree level. Like Amharic, another major language and working language
of Ethiopia, which belongs to Semitic family languages, Afan Oromo is part of the
lowland east Cushitic group within the Cushitic family of the Afro Asiatic phylum.
According to Abebe [9], Afan Oromo language is not only spoken in Ethiopia, it has
also spoken in Somalia, Kenya, Uganda, Tanzania and Djibouti. Although Afan Oromo
is today spoken by such a large number of people, few advances have been made in
computational linguistics or natural language processing in the language.
“Computational approaches to linguistic analysis of Afan Oromo so far have been
hindered due to non-availability of well-studied linguistic resources” [12]. Since Afan
4
Oromo language is the official language of Oromia National Regional State as
mentioned above and used in offices, schools, colleges, universities and in media,
various written materials are being published electronically and non-electronically now
a day. Thus, this creates an interest of NLP researches in this language. For instances;
morphological synthesizer [9], spell checker [13], grammar checker [14], part of speech
tagging [15][16][12], named entity recognition[1], news text summarization [17]
machine translation [8], word sense disambiguation [18], question answering [19], text
retrieval [20] and search engines [21] are some NLP applications among the
applications that require a sentence parser for successful and full-fledged
implementation. Besides, sentence parser is useful NLP application in teaching and
learning process for phrase identification and to know word relations in sentences of
the Afan Oromo language. It is also an important tool in NLP and it serves as an
intermediate component for different higher level applications like machine translation
[4].
On the other hand, as we have mentioned in above section, an Internet is one of the
main sources of information. The enormous amount of information on the Internet
could be used to enhance development by making it accessible to the public. To fully
localize and utilize these resources which are available on the Internet, translation of
documents from one language to another may be necessary. For example, many
documents on the Internet are written in English, because of this, English to Afan
Oromo translation and vice versa may be required in syntax-based machine translation
[22]. Besides, according to[23], parsers have become efficient and accurate enough to
be useful in many natural language processing systems, most notably in machine
translation. Therefore, machine translation, which uses Afan Oromo language
sentences as an input, and sentence parsers as a component, plays a great role in solving
the translation problem. Thus, we were proposed to develop a sentence parser for Afan
Oromo language.
To this end, the researcher has gone through different literatures to find if there is any
sentence parser, which can parse both simple and complex sentences in Afan Oromo.
Thus, to the best of the researcher’s knowledge concerned, there is no Afan Oromo
sentence parser for both simple and complex sentences. However, There is one attempt
by [5] on automatic sentence parser for Afan Oromo language using supervised learning
5
technique for simple declarative Afan Oromo sentence. In his study, the chart algorithm
has been used. In addition, the unsupervised learning algorithm was designed to guide
the parser in predicting unknown and ambiguous words in a sentence. It also adopts an
intelligent (Rule-Based learning module) approach to develop a prototype. The result
obtained was 95% on the training dataset and 88.5% on the test dataset. The parser was
developed purely based on an Intelligent (hybrid of Rule-based and supervised
learning) System approach and tagger were not included, which could have been used
as a preprocessor to the parser. It was developed only for simple declarative sentences
of Afan Oromo language. Due to this fact, the researcher is motivated to develop a
parser for both simple and complex Afan Oromo sentences. Hence the focus of this
study is, therefore, in designing and developing sentence parser for Afan Oromo text,
which includes both simple and complex sentences. Obviously, the parser will have the
major significance for the language users. Moreover, as the nature and structure of
sentences parsing (syntactic parsing) in Afan Oromo is different from English, Amharic
or other languages, sentence parser developed for such languages could not be
functional for Afan Oromo language. This is due to the fact that the language has
different syntactic and morphological nature, and they have also their own grammatical
and word formation technique that is different from other languages. As a result,
sentence parser developed for other languages could not be used for Afan Oromo
language, which results in the need for the independent sentence parser. So that we
decided to develop sentence parser for Afan Oromo simple and complex sentences
using top down chart parsing algorithm.
Based on the above justification this study attempts to answer the following questions:
- What are the properties and word orders in Afan Oromo Language?
- Is it possible to use other languages sentence parsers for Afan Oromo language?
- Does the adoption of other language parsing algorithms work for Afan Oromo
Language?
1.3 Objectives
1.3.1 General objective
The general objective of this research is to design a sentence parser for Afan Oromo
Language using top-down chart parsing algorithm.
6
1.3.2 Specific objectives
In order to achieve the general objective of this research, the following specific
objectives are formulated.
To identify the properties of Afan Oromo sentences based on the knowledge
base of the language which are the basic word order, word categories,
morphological properties, phrase structure, and sentences in the language that
are useful for sentence parsing.
To select sample sentences that would potentially serve for the experiment
To extract an appropriate grammar rule to represent the structure of Afan Oromo
sentences.
To design a general architecture of Afan Oromo parser
To develop a simple algorithm for lexical generator in order to automatically
generate lexical rules from sample corpus.
To select and customize an appropriate parsing algorithm for Afan Oromo
sentence parser.
To evaluate performance of the parser
1.4 Methodology
In order to develop a Sentence Parser for Afan Oromo language, exploring of the
characteristics of the language and different approaches which can be used for the
development should be needed. The followings are the methods that have been followed
to achieve the general and specific objectives of this thesis work.
1.4.1 Literature Review
Sentence parser which is previously done in Afan Oromo and other languages have
been reviewed to understand the techniques that show how a sentence parser works.
Related literature materials such as research papers, books, some of the previous related
research work as well as electronic materials on the web have been reviewed to have
better knowledge and to understand the phrase structure of Afan Oromo language and
to be aware of the strategies, techniques and how to parse the sentence and how to
7
transfer the sentence to a parser. This study employs rule-based parsers to develop Afan
Oromo sentence parser for both simple and complex sentences. The selection of this
rule-based approach was based on some argue that parser which is developed using
rule-based are require less storage and ten times faster than those developed using
stochastic approaches[5]. The detail of the approaches is presented in chapter 2.
1.4.2 Data Collection
500 Afan Oromo Sentences of both simple and complex types was collected from Afan
Oromo text sources like Afan Oromo grammar (Seer-luga Afan Oromo) book, previous
thesis papers and other written materials by the language. Among the sample corpus,
around 40 sentences from [5], 300 sentences from seer-luga Afan Oromo and the rest
of 160 sentences are from other written materials. It was then given to the linguistic in
order to get feedback on the correctness of the manual parse and manual extraction of
the context free grammar rules.
1.4.3 Tools and Techniques
We have been designed the general architecture of Afan Oromo sentence parser.
Parsing algorithm was selected and customized to develop a sentence parser for Afan
Oromo sentences based on the grammar rule of the language and lexical rules which
automatically generated from collected sample corpus. Python programming language
and NLTK were used for implementation tools for this study.
1.4.4 Evaluation
The experiment was conducted in two phases in order to evaluate the parser: the first is
on the training dataset and next is on the test dataset and the results have been evaluated.
The outputs have been crosschecked with manually parsed sentences and how much
they are similar. Our sample data in this study is still small, though when we compare
with the previous work sample data, the data for this study is better in size than the
previous work. Thus, 400 sentences (80% of the sample corpus) were used for training
dataset while the rest 100 sentences (20%) from the corpus were used as a test dataset.
Finally, figures obtained from the observed results have been statistically summarized
and analyzed in a way that is suitable to report the attained accuracy level by using
table.
8
1.5 Scope and Limitation
The scope of the thesis is limited to demonstrating the potential rule-based approach to
design and develop a sentence parser for Afan Oromo language using top-down chart
parsing. In this research, simple sentences and complex sentences are considered. In
this study the complex sentence is composed of one independent clause and one or more
dependent clause. However, which has one or more independent clause and two or more
dependent clauses (compound complex sentences) in the sentence is out of the scope of
this research, due to the absence of clearly stated rules of grammar in literatures and
lack of well-prepared corpus for the research purpose publicly. Parsing of grammatical
categories, which indicates features like gender, cases, tense, etc. are also not
considered. An automatic morphological analyzer and part of speech tagger are also not
included in this work.
On the other hand, preparing details of grammar rules and tagging the sentences with
their correct word categories were very difficult, because, all sample corpus was
manually annotated by the researcher and its correctness was verified by linguists. This
is because of automatic morphological analyzer and automatic part of speech tagger are
not integrated with our parser. So that, it had taken much time and efforts.
1.6 Significance of the study
As we discussed so far in the above sections, the parser has a vital role in different areas
of NLP applications. Thus, the beneficiaries of this study include researchers who
are/who want to be, involved in increasing the capability of computer processing in
Afan Oromo language. This means, the sentence parser can be used in the development
of high level NLP applications as a component. Thus, the researchers in the area of
phrase recognition, conceptual parsing, machine translation, question answering,
grammar checker, text summarization, etc. are among the main beneficiaries. In
addition to this, linguistic teachers and students in the field of Afan Oromo language
will also the beneficiary of this study to parse sentences in the language. Finally, this
study may also contribute to the advancement of Afan Oromo language toward aware
of using technology.
9
1.7 Organization of the thesis
The remaining part of the thesis is organized as follows. Chapter 2 covers literature
review in which different concepts and approaches important for our work are
presented. In addition, related works to our study, which was done in Afan Oromo and
other languages are also presented. Moreover, the grammar of Afan Oromo language,
such as word orders, word classes, phrase structures, and different types of sentences
are also discussed in detail. Chapter 3 deals with the design of the proposed system. It
presents the general architecture of the system with its basic components and the
discussion of the components and their interaction in the system. Chapter 4 is focused
on the detail implementation and discussions of the system. It discussed on the
algorithms we were used for achieving the goal of the components in the proposed
system. On the other hand, the evaluation of the system and the results obtained are also
present in this chapter. Chapter 5 presents conclusions of our work and
recommendations for the improvement of our system to interested researchers.
10
CHAPTER TWO
2 LITERATURE REVIEW
2.1 Introduction
According to Abdi[1], natural language processing is an interdisciplinary area based on
many fields of study, which is used for designing and building software that can
analyze, understand, and generate natural language. Some of the tasks of NLP that
provides a potential means of gaining access to the information inherent in the large
amount of text are IR and IE. Information retrieval systems typically allow a user to
retrieve documents from a large database. NLP is a computational method that
automates the translation process between computer and human languages. It is a
method of getting a computer to understandably read a line of text without the computer
being fed some sort of clue or calculation. The goal is to enable natural languages, such
as English, Amharic, Afan Oromo and others to serve either as the medium through
which users interact with computer systems.
NLP researchers aim is to gather knowledge on how human beings understand and use
language so that appropriate tools and techniques can be developed to make computer
systems understand and manipulate natural languages to perform the desired tasks. This
is based on both a set of theories and a set of technologies [2]. In NLP to examine how
the syntactic structure of a sentence can be compute is the main things which should be
consider are the grammar and parsing technique. Grammar is a formal specification of
structures allowable in the language, whereas, the parsing technique is the method of
analyzing a sentence to determine its structure according to the grammar. Several types
of grammatical formalism and parsing approaches which are used to parse sentences
are briefly discussed in the next section of this chapter.
In addition, there have been much work done in NLP, in recent years on different
languages. Among those works, a sentence parser is one of the most important NLP
tools. Even though there is only one attempt work for Afan Oromo language regarding
sentence parsing as far as the researcher’s knowledge, much work has been done in
different languages on different aspects of parsing based on various approaches. Thus,
11
we reviewed previous Afan Oromo language and other languages works that are more
related to our study as follows.
2.2 Works so far on sentence parser
2.2.1 Local works on sentence parser
A few sentence parser works have been done in local languages, which came as a result
of increasing demand of precise and exact information needs. It has been realized that
the previous information retrieval mechanism alone would not be enough to satisfy the
users need. Below we try to present the sentence parser by the respective researchers of
works in local languages.
Parser for Afan Oromo Language
The first work we were going to reviewed on related work was, the sentence parser for
Afan Oromo language, which is the work of Diriba [5]. He developed the first parser
for an automatic Afan Oromo sentence parser which was aimed to parse declarative
simple sentences. The study was conducted using the chart algorithm with the grammar
formalism Head-driven Phrase Structure Grammar compiled into left to right the table.
It is a representation that allows to minimize the number of syntactic rules and to
provide rich and well-structured lexical representation. The system was also used
supervised learning algorithm to enable the parser to predict unknown and ambiguous
words. In his work, the total size of sample corpus was consisted of 352 sentences from
the handout ‘Seer-luga Afan Oromo’. The sample data was divided into two sets, such
as the training dataset which contains 300 sentences and the testing dataset with the
remaining 52 sentences. However, in addition to the small number and similar sentence
type of the text, the part of speech tagger that preprocesses the text to improve the
performance of the parser was not included in this work, although the result obtained
was 95% on training set and 85.5% on testing set using manually parsed sentences. In
our study, maximum number of sample sentences than the previous was collected from
Afan Oromo grammar books and from the previous research for dataset. In addition,
we consider both simple sentence and complex sentence type with manually tagged.
The parsing algorithm and the grammar formalism adopted in this thesis are similar
with the top-down chart parser for Amharic sentences [4] and a Top-Down Chart Parser
12
for Analyzing Arabic Sentences[7] which are chart parsing with top-down strategy(top-
down chart parsing algorithm) and CFG rules respectively.
Parser for Amharic Language
Some works have been also done on Amharic sentence parsing. However, it is very few
work when compared to the number of works dealing with other foreign natural
languages such as Arabic, English, etc. To our knowledge concerned, the majority of
works in natural language processing on local languages in Ethiopia are on Amharic
language. Thus, For Amharic language sentence parser, some efforts are taken by
different researchers. The first attempted was by Atelach [24], to develop a simple
automatic parser for Amharic sentences to address the problem of developing systems
that can automatically process Amharic text. The Probabilistic Context Free Grammar
(PCFG) and Inside-Outside algorithm with a bottom-up chart parsing has been used as
a grammatical formalism to represent the phrase structural rules and as the parsing
strategy of the Amharic language respectively. The study was tried to combine
probabilistic formalism and rule based reasoning for developing automatic sentence
parser. The total size of sample corpus was 100 Amharic sentences only from simple
declarative sentence. The sentences were automatically tagged sentences by previous
researchers. Manual hand parsing was also the other pre-processing phase done by the
researcher after the corpus has passed through the POS tagger. The results achieved
based on the first set of sample sentences was very high, 100% on the training set and
approximately 96% on the test set. As a researcher state in her work, this high accuracy
is obtained partly due to the small number of words considered for the experiment.
Another reason is that all the sentences have identical constructions, and the highest
probability parses were almost always the correct ones.
The second attempted was by Daniel [25]. The work was the integration of the ideas
and outputs of previously attempted by Atelach, to develop an automatic sentence
parser, particularly for complex Amharic sentences. The parsing algorithm and the
grammar formalism adopted from an Automatic sentence parser for Amharic sentences,
which are Input Output Algorithm with bottom-up strategy and PCFG rules
respectively. The total size of collected sample corpus was 350 Amharic complex
sentences. Experiments have been conducted in this study using the training set and test
set. The first experiment was conducted on the part-of-speech tagger to see the state of
13
its performance when a morphological analysis is embedded in it. The result of this
experiment showed that the tagger attained 98.7% and 94% of accuracies on the training
set and the test set, respectively. The experiments on complex sentence parsing showed
89.6% accuracy result on the training set and 81.6% accuracy result on the test set.
The third work in an Amharic sentence parser was done by Abeba [26], which is a
hybrid approach to Amharic base phrase chunking and parsing. Its main objective was
to extract different types of Amharic phrases by grouping syntactically correlated
words, which are found at a different level of the parser using Hidden Markov Model
(HMM) model and to transform the chunker to a parser. Bottom-up approach with a
transformation algorithm is used to transform the chunker to the parser. The data sets
were analyzed and tagged manually and used as a corpus for chunking. However, the
entire data sets were chunk tagged manually for the training data set. The training and
testing datasets are prepared using the 10-fold cross validation. The experiments on
Amharic sentence chunking showed an average accuracy of 85.31% testing set before
applying the rule for correction and an average accuracy of 93.75% on the test set after
applying rules. And also, the experiment on Amharic sentence parsing showed an
average accuracy of 93.75%.
Another important work in Amharic sentence parser and similar approach and parsing
strategy that we have been proposed in our study was done by Abdurheman [4]. The
researcher was developed top-down chart parser for Amharic sentences. The parser was
designed to parse all types of Amharic sentences using a top-down chart parsing
algorithm using Context Free Grammar to represent the Amharic grammars. Lexicon
generator, which is used to automatically generate the lexicon was also developed. In
addition, integrating a morphological analyzer in the construction of the lexicon was
also done. In this research, the total size of the corpus was 480 different types of sample
sentences. In order to test the effectiveness of the parser, 100 sentences that are selected
randomly from all types of sentences, on average 20 sentences ranging from four to
nine-word length from each sentence type was used. The correctness of the parser is
evaluated or examined by inspecting its result manually. The output could be checked
with respect to the right categorization of words in their proper word class, the right
identification of sub phrases and main phrases, the right order of sub phrases in building
14
main phrases, and whether all words and phrases are involved during construction of
the sentence.
2.2.2 Global works on sentence parser
There are also many work of sentence parser systems that have been done globally with
different approaches. Some of the works among many work of different scholars
reviewed in our thesis are as following.
Parser for English Language
For an English language sentence parser, the researchers in [27] developed a parser,
which have the equivalent expressive power to that of CFG was developed formal
grammatical system called a link grammar. A link grammar consists of a set of words
each of which has a liking requirement that is contained in the dictionary. The
researchers have written a link grammar of seven hundred definitions that capture many
phenomena of English grammar. Moreover, the researchers developed an algorithm
based on dynamic programming, which tries to build a linkage in a top-down strategy.
The system was tested by applying it to articles taken from newspapers, and the result
indicated that the performance of the system is good. However, there are a number of
English phenomena that are not handled by the system. For example, the system accepts
sentences and clauses that end with preposition. There are also problems on the
placement of the adverbs and prepositional phrases modifying verbs.
Another Statistical based parser for English language was also developed by Charniak
[28]. The parsing system was based on a language model which in turn is based upon
assigning probabilities to be possible parses of a sentence. The model is used in the
parsing system by finding the parse for the sentence with the highest probability.
Therefore, the parser operates by assigning probabilities to the sentence under all its
possible parses and then choosing the parse for which the probability is highest. In line
with this, rules of the context free grammar specify how each phrase constituent will be
expanded. The researcher evaluated the performance of the system by training the
parser on about one million words of the Peen Wall Street Journal Tree bank and testing
on 50,000 words and claimed that its performance was superior to previous parsers in
the area. However, creating the corpus or tree-bank was a difficult task that requires
great strength or effort.
15
Parser for Arabic Language
In order to parse a simple sentence of Arabic language top-down chart parser was
developed by researchers in [7]. In this work, the parser includes nominal and verbal
sentences within a specific domain Arabic grammar. To represent the grammar of
simple Arabic sentences the researchers used CFG grammar formalism. The grammar
rules were developed by researchers which gives the precise description of grammatical
sentences. Then, the parser which assigns grammatical structure to the input sentences
was implemented. The parser was tested on sentences extracted from real documents.
Another parser for an Arabic sentence parser was developed based on the supervised
machine learning by [29]. The support vector machine algorithm for the learning phase
and Penn Arabic Treebank as a learning corpus were used in this work. Cross validation
method was also used for evaluation purpose. The parser has two phases in this study,
such as learning phase and the analysis phase. The learning phase involves the use of a
training corpus in order to extract a set of features and rules, which are used to train the
support vector machine. The extracted features are used to specify the morphological
category (POS) of the word being processed and the POS of the words in the left vicinity
of the word being analyzed with a maximum depth equal to four. On the other hand,
the extracted rules are used to train the system in grouping of the sequence of labels
that may belong to the same syntactic grouping and thus define their border. The
evaluation of the system was made by the cross-validation method using the Weka tool
by dividing the corpus into two parts, one, which contains 80% of ATB for learning
and the other, which contains 20% for testing. When the system is evaluated on 100
sentences, the result had 89.01 precision, 80.24 recall and 84.37 F-score.
The parser in [30] also has been developed with the aim of analyzing and extracting the
attributes of Arabic words. The parser has been written using top-down algorithm
parsing technique with recursive transition network, and the development was a two-
step process. In the first step, the set of rules used in the study for Arabic parser have
been generated from an existing Arabic text taught in k-12 grade levels. The second
step was the implementation of the parser which analyses an Arabic sentence and
determines if the sentence follows a valid grammatical structure. The sentences are
made to have gender and number agreement to ensure the correction of syntax structure
16
of the Arabic sentences. After the evaluation of the parser, it is found that some
sentences are unparsed totally, and some other sentences are parsed incorrectly.
Sentences are not parsed because of the following reasons; first when the parse does
not found the word in the lexicon, second because of the incorrect input sentence and
third when the parser is unable to produce a rule for the input sentences because the
syntactic form of the sentences is not included in the grammar. The efficiency of the
developed parser has been evaluated, A sample of 90 sentences was used in the test.
The result shows that 85.6% of sentences were parsed successfully, 2.2% of sentences
were parsed unsuccessfully and 14.4% of sentences not parsed for various reasons,
4.4% Lexical problem, 2.2% Incorrect sentences, 5.6% not recognizable by linguists
according to Arabic grammar rules.
Parser for Indian Language
Using CKY algorithm sentence parser for Hindi language, which is one of the official
languages of India was developed by [31]. This parser can recognize languages defined
by a context free grammar in Chomsky normal form and it parses whole sentence and
generates a matrix. In this study, the researchers developed a set of grammars that has
14 non-terminals and 13 terminals to represent sentences of the language. As the
researchers described in their work, the system incorporates three components, such as
interface, database for Hindi words and the parser. The interface allows the user to enter
sentences and tokenize the input sentence and assign tag to each token. The database
on the other hand, stores the tag of Hindi language words. The parser will then take a
string of tag as input and states whether or not input string is correct. Concerning about
the performance of the parser, the paper didn’t mention anything. Moreover, the number
of sentences used, the types of sentences, and the amount of words the database contains
for evaluation are not indicated. Hence it is difficult to say anything about how the
parser performs compared with other parsers. However, the researchers state that a large
database would slow the speed of parsing and also introduce word sense ambiguity in
assigning tag to words of input sentence.
Parser for Myanmar Language
Top-down parser was developed for both simple and complex Myanmar language
sentences by [32] using CFG grammar formalism. The researchers collected the
sentences consist of 5 to 50 words, which are nearly 3000 training sentences and 530
17
testing sentences. The corpus was pre-processed before it is passed to the actual parsing
process. In the sentence level, the researchers annotated the corpus for part of speech,
chunk, and function tags relationship between the words in the sentences. The sentences
were tested and the output parse trees were manually checked. The accuracy of parse
tree was 90.6%. However, top-down is not efficient compared to a top-down chart
parser.
Parser for Chinese Language
By applying Maximum-Entropy-Inspired parser on peen Chinese Tree Bank(CTB)
there was work done by [33] for Chinese language sentences. The model assigns a
probability to a parse by a top-down strategy. A parse tree is generated by starting from
the tree root and use the context-free grammar for branching. Each expansion is
assigned a probability, and the probability of a tree would be the product of the
probabilities of all expansions that generate the given sentence. Then the parse that has
maximum probability P (Phi, s) for a given sentence s will be selected. The evaluation
of the system was conducted by first transforming the tree bank. Since words in Chinese
are not delimited by white-spaces, the original tree bank was converted into a tree in
which the terminal consists of a single character instead of words. Moreover, a MaxEnt
re-ranker which assigns a new probability to each one of the parse of a sentence was
used to improve the performance of the parser. The system was tested on the two
versions of the Chinese tree bank CTB1.0 and CTB4.0 with 3485 and 12334 sentences
respectively. The paper was concluded that the performance of the parser is better than
previously obtained results.
2.3 Grammar Formalism
Grammar is a set of constraints on the possible sequences of symbols expressed as rules
or principles. Syntax is the basic ingredient of grammar. Grammar tells us the difference
between sets of sentences. It can be also a formal specification to describe the rules and
the syntax in which the parser attempts to analyze and determine the structure of a
sentence in language [4]. There are five fundamental units of grammatical structure:
morpheme, word, phrase, clause, and sentence. Morpheme is the lowest unit.
Morphemes joined to form a word. Phrase and clause are a group of words. While a
phrase does not have subject and predicate, clause does have its own subject and
predicate. For instance, in a sentence, Tolosaan ni faarfata, which means Tolosa sings,
18
‘Tolosaa’ is subject and ‘ni faarfata’ is a predicate. Sentence is also a group of words
that convey some meaning. The above example is called traditional grammar. Subject
and predicate are called grammatical functions. Parts-of-speech such as verb, noun,
adjective, adverb, conjunction and preposition are called grammatical categories.
On the other hand, grammar specifies two things, such as the set of grammatically
correct sentences and the structure to be assigned to each grammatical sentence in the
language. In order to specify these two things, the grammar has the grammar
formalisms. Grammar formalisms are, first and foremost, languages whose intended
usage is to describe languages themselves to describe the set of sentences the language
encompasses (the string set), the structural properties of such sentences (their syntax),
and the meanings of such sentences (their semantics) [34]. There are different types of
grammatical formalisms, such as Context Free Grammar (CFG), Context Sensitive
Grammar (CSG), Transition Network Grammar (TNG), Unification Based Grammar
(UBG) and Probabilistic Context Free Grammar (PCFG) are the most common and
most widely used formalisms.
2.3.1 Context Free Grammar
A context-free grammar is a set of production rules that describe all possible strings in
a given formal language. A CFG can be defined as a finite set of grammar rules, which
consist of always one non-terminal symbol on the left-hand side but anything on the
right-hand side. Context-free grammars (CFGs) are a class of formal grammars that
have found numerous applications in modeling computer languages [35]. In order to
define the grammar rule, there are two kinds of symbols: the terminals, which are the
symbols of the alphabet underlying the language under consideration, and non-
terminals, which behave like variables ranging over strings of terminals [36]. A rule is
of the form A → α, where A is a single nonterminal, and the right-hand side α is a string
of terminal and/or nonterminal symbols. A context-free grammar (CFG) is a four-tuple
(Σ, V, S, P)
where:
Σ is a finite, non-empty set of terminals, the alphabet;
V is a finite, non-empty set of grammar variables (categories, or non-
terminal symbols), such that Σ ∩ V = ∅;
19
S ∈ V is the start symbol;
P is a finite set of production rules, each of the form A → α, where A ∈
V and α ∈ (V ∪ Σ) ∗.
For a rule A → α, A is the rule’s head and α is its body. CFGs are a very important class
of grammars for two reasons [4], first, the formalism is powerful enough to describe
most of the structure in a natural language and the second it is also restricted enough so
that efficient parsers can be built to analyze sentences.
2.3.2 Context Sensitive Grammar
These rules are used in a natural language to describe subject-verb agreement with
respect to number, i.e., singular or plural as reflected in sentences; the student come,
and the student comes [4]. A Context-Sensitive Grammar is a four-tuple, like that of
context free grammar, G= (N, Σ, P, S) where;
N is a set of non-terminal symbols,
Σ is a set of terminal symbols,
S is the start symbol of the production and
P is a finite set of production rules of the forms α1Aα2α1βα2 (where a single
non -terminal A ∈ N and α1, β, α2 ∈ (N U Σ) +).
The production rules of the context sensitive grammar satisfy the following constraints
for the production rule of the form:
- A B, where A and B are strings of the alphabet symbol, the length of (A)
should be less than or equal to the length of (B).
- A y / x_z, where A is a non-terminal symbol, y is a sequence of one or more
terminal and non-terminal symbols, and x and y are sequence of zero or more
terminal and non-terminal symbols.
The meaning of the second production rule is that A can be rewritten as y if it appears
within the context ‘x_z’, i.e., immediately proceeded by the symbols x and immediately
followed by the symbols z.
20
2.3.3 Transition Network Grammar
Transitional Network Grammar (TNG) formalism describes the rules by using nodes
and labeled-arcs in a transition network [4]. One of the nodes is specified as the initial
state, or start state. Starting at the initial state, an arc can be traversed if the current word
in the sentence is in the category on the arc. If the arc is followed, the current word is
updated to the next word. Simple transition networks are often called Finite State
Machines (FSMs) and have equal expressive power to regular grammars. However,
they are not powerful enough to describe all languages that can be described by CFGs.
In order for the transition network grammar to get the descriptive power of CFGs, it
should allow arc labels to refer to other networks as well as word categories. Thus the
grammatical formalism based on such a notion is called recursive transition network
[30].
The other commonly used type of TNG formalism for writing natural language
grammars is Augmented Transition Network (ATN), introduced by Woods [37]. This
type of formalism represents the grammar in the assumption that if there is a path from
the start state to some final state such that the labels of the arcs on the path match the
words within the sentence, a sentence is in the language defined by the network.
2.3.4 Unification Based Grammar
Unification-based formalisms use as their informational domain a system based on
features and their values [34]. The feature structures consist of features and associated
values, which can be an atomic or complex, i.e., feature structure themselves. In other
words, the values can be from a structured set.
2.3.5 Probabilistic Context Free Grammar
The most commonly used type of grammar in natural language modeling is a
probabilistic version of the CFG, called probabilistic (or stochastic) context-free
grammar (PCFG) [35]. The key idea in probabilistic context-free grammars is to extend
our definition to give a probability distribution over possible derivations [38]. The
probability is calculated by counting the number of times each rule is used in a corpus
of parsed sentences. A PCFG is a five tuple [39]: PCFG = (N, Σ, P, S, D)
21
N a set of non-terminals
Σ a set of terminals symbols
P a set of production rules in CNF
D Function to assign probabilities to each rule in P
PCFGs were introduced as an extension to CFGs to aid in sentence disambiguation, but
they have a number of problems. Due to this, in practice, most current probabilistic
parsers use some augmented form of PCFGs. The main drawback of PCFGs is that they
do not model dependencies. Although it was not stated explicitly, it is clear that the
formulation of PCFGs assumes that the derivation from each non-terminal node to a set
of input words, is not only independent of the nodes outside the sub-tree but also
independent of the words on both sides of the subsequence of input string that the sub-
tree considers. The first one refers to structural independence while the other implies
lexical independence. Natural languages are not that simple and have both kinds of
dependencies [39][40].
All extensions of PCFG try to include the dependencies between words and parse trees,
some way or the other. One drawback of extended PCFGs is that they need an extremely
large corpus for estimating that probabilities. To avoid this, the various extensions
consider some simplifying assumptions of independence. A commonly used solution to
incorporate dependencies into PCFGs is the probabilistic lexicalized CFGs. This is
based on the concept of the head driven grammars. Every phrase is associated with a
“head” word, which constrains the overall structure to the sentence. Instead of
computing the probability of the parse just by multiplying each of the PCFG rule
probabilities, each rule probability is now conditioned on its head [35][39][40].
2.4 Sentence Parsing Approaches
Parsing is the process of assigning syntactic structures to input strings, according to a
grammar [26]. It is the step in which a flat input sentence is converted into a hierarchical
structure that corresponds to the units of meaning in the sentence. In order to efficiently
parse the sentence, there are in two ways techniques, such as Stochastic and Rule-based.
These are briefly discussed as follows.
22
2.4.1 Stochastic Approaches
Stochastic approach is called corpus-based approach, which is based on the use of text
corpora. The approach uses the idea of Bayes (Network) theorem, that is an independent
event and the Markov assumptions are used to determine the most likely lexical
sequence of each word in a given sentence [29]. Many parsers use formal grammars to
analyze language input. Stochastic parsing has the difference that the rules in the
grammar are assigned with probabilities [41]. Based on the type of text corpora used,
the corpus based (stochastic) approach can be further categorized into supervised and
unsupervised approaches.
Supervised Approach
In supervised approach, we have given a data set and already know what our correct
output should look like, having the idea that there is a relationship between the input
and the output. It is called supervised learning because the process of algorithm learning
from the training dataset can be thought of as a teacher supervising the learning process
[42]. It uses annotated text corpora and system, which are developed using this
approach is called supervised parsers. They use probability or statistics in analyzing the
syntactic structure. The main source of information for a supervised parser is the lexicon
(which lists each word with the entire possible lexical category for each word) and the
list of contextual probabilities for each lexical category. The lists of contextual
probabilities indicate the particular lexical category that is appropriate for a particular
context. However, this approach has two main drawbacks: lack of manually or
automatically parsed text (corpora) and the manual parsing is required each time
whenever the parser is needed to be applied on a new text [5].
Unsupervised Approach
Unsupervised approach, on the other hand, allows us to approach problems with little
or no idea what our results should look like. We can derive structure from data where
we don't necessarily know the effect on the variables. These are called unsupervised
learning because unlike supervised learning above, there are no correct answers and
there is no teacher [42]. Algorithms are left to their own devises to discover and present
the interesting structure on the data.
23
Unlike supervised approach, unsupervised approach uses a natural corpus as those
found in newspapers and books. For this reason, they do not require any pre-tagged text
in the training process. Some probabilistic information generated from the corpus is
used to develop the syntactic analysis system. These parsers also work based on the
assumption of Markov model in that a set (lexical categories in this case) with directed
edges labeled with transition probabilities that indicate the probability of moving to the
state at the end of the directed edge is utilized.
2.4.2 Rule-based Approaches
The rule-based approach has successfully been used in developing many natural
language processing systems. Systems that use rule-based transformations are based on
a core of solid linguistic knowledge. The linguistic knowledge acquired for one natural
language processing system may be reused to build knowledge required for a similar
task in another system [39]. The advantage of the rule-based approach over the corpus-
based approach is for less-resources languages, for which large corpora, possibly
parallel or bilingual, with representative structures and entities are neither available nor
easily affordable and for morphologically rich languages, which even with the
availability of corpora suffer from data sparseness [43]. The rules may contain a large
number of morphological, lexical or syntactical information [9]. These have motivated
many researchers to follow the rule-based approach in developing natural processing
tools and systems.
According to [24] states in her work, in parsing sentences, rule based approaches
attempt to find a way in which that sentence could have been generated from the start
symbol in the grammar. It attempts to parse a sentence based on the information from
the knowledge base (grammar rules) of the language. Systems which are based on such
rules learn a set of rules automatically based on a given list of strings and then parse the
sentences by following the rules. There are three ways in which this approach can be
applied, such as top-down, bottom up, and chart based approach. These approaches are
briefly discussed below.
Top-down Parsing approach
Top-down approach starts with the largest point. It breaks down from the largest into
the smaller segments. According to [44] top-down parsing has the advantage that the
24
only rules are applied, which can be useful in proving that the sentence is grammatical,
and its disadvantage is that the rules are tried "blindly," without any regard to the lexical
material present in the sentence. Top down parsing is the strategy that builds the parse
tree from the start symbol S. It never wastes time exploring trees that cannot result in
an S, since it begins by generating just those trees [39]. This means it also never
explores sub-trees that cannot find a place in some S rooted tree. Thus, it is goal
oriented. The goal is towards parsing the sentence according to the grammar
production. The following steps of the approach should repeat itself until the parse tree
matches the input string in order to build a parse [7].
At the start node S, select a production with S on its left-hand side and for each symbol
on its right-hand side, construct the appropriate child. When a terminal is added to the
tree being constructed that doesn’t match the input string, then backtrack. Find the next
node to be expanded. If the parse tree does not match the input string, then it means
that input string is wrong. Top down methods have the advantage of being highly
predictive [4] and it predicts the end string from the given grammar. It has also to
backtrack to where it made the wrong decision at each time when it chooses the wrong
path.
Bottom up parsing approach
Bottom-up parsing is data directed. The initial goal list of a bottom-up parser is the
string to be parsed. If a sequence in the goal list matches the RHS of a rule, then this
sequence may be replaced by the LHS of the rule. Parsing is finished when the goal list
contains just the start category. If the RHS of several rules match the goal list, then
there is a choice of which rule to apply. The standard presentation is as shift-reduce
parsing.
The task of the parser is that of attempting to group words into their respective
categories together in a manner permitted by the grammar. Unlike top down parsing,
the bottom up parser only checks the input sentence once, and only builds each
constituent exactly once [40]. This is because a bottom-up parser works from left to
right, i.e., it does everything it can with the first item before exploring what it can do
with the next items. However, bottom-up parser can also get stuck in a loop if the
grammar has empty productions.
25
It has an advantage that the choice of the grammar rules that are applied depends on the
words present in the sentence and on analyses for sub-strings of the sentence. However,
the disadvantage is that analyses for sub-strings are built up, which do not contribute to
the overall analysis of the sentence [44]. Even if both bottom-up and top-down parsers
have advantages, they are inefficient and have a worst case exponential run-time as the
parser would tend to try the same matches repeatedly, thus duplicating much of its work
unnecessarily. Therefore, another an efficient approach which is called chart parsing
approach is discussed as follows.
Chart Parsing approach
The approaches that we discussed above have significant limitations. The bottom-up
approach(shift-reduce) parser can only find one parse, and it often fails to find a parse
even if one exists. As just pointed out, the top-down approach(recursive-descent) parser
can be very inefficient, and if the grammar contains left recursive rules, it can enter into
an infinite loop. In order to address these problems of completeness and efficiency, we
explore chart parsing approach, which stores intermediate results, and re-uses them
when appropriate. Chart parser combines some of the advantages of top-down and
bottom-up approaches. The combination of the selective behavior of the top-down
algorithm in building partial parser is based on left context with the bottom-up
algorithm behavior building each partial parse only once, form a chart parser [24]. The
main objective of chart parsing is to improve parsing efficiency. Therefore, it considers
three points for the improvement of the parsing efficiency; first, it doesn’t do twice
what can be done once, second, it doesn’t do once what can be avoided altogether, and
thirdly, it doesn’t represent distinctions if that is not the concern of the study [4].
To parse a sentence, a chart parser first creates an empty chart spanning the sentence.
It then finds edges that are licensed by its knowledge of the sentence, and adds them to
the chart one at a time until one or more parse edges are found. It has three main
constituents, such as chart, key list and a set of edges. A chart is a set of chart entries
each of which consists of the name of terminal or non-terminal symbols, the starting
point of an entry and the entry length. The key list push down stack of chart entries that
are waiting to be added to the chart. The edges are rules that can be applied to chart
entries to build them up into large entries [7][4]. The chart maintains the record of all
the constituents derived from the sentence so far in the parse. It also maintains the
26
record of rules that have matched partially but are not complete. Recording of
intermediate results is a form of dynamic programming that avoids duplicate work [45].
Chart parser is driven by an agenda of completed constituents and the arc extension,
which combines active arcs with constituents when they are added to the chart. The
technique of extending arcs with constituents can be applied by using both bottom-up
and top-down approach. However, the difference is in how new arcs are generated from
the grammar. In bottom-up approach, new active arcs are generated whenever a
completed constituent is added that could be the first constituent of the right-hand side
of the rule. In the top-down approach, new active arcs are generated whenever a new
active arc is added to the chart [7]. For this reason, the number of constituents generated
using a top-down chart parser is less than the number of constituents which are
generated using bottom-up chart parser. Therefore, the top down chart parser is
considerably more efficient for any reasonable grammar.
2.5 Afan Oromo Grammar
Ethiopia is one of the multilingual countries. It constitutes more than 80 ethnic groups
[3] with diversified linguistic backgrounds. The country comprises the Afro-Asiatic
super family (Cushitic, Semitic, Omotic and Nilotic) [46]. Afan Oromo belongs to an
East Cushitic language family of the Afro-Asiatic language super family. It is the most
widely spoken language in Ethiopia. As Abdi states, Afan Oromo has around 40 million
speakers, 50% of the total population of the country, native speakers and the most
populous language of Ethiopia. The writing system of Afan Oromo is nearly phonetic
since it is written the way it is spoken, i.e. one letter corresponds to one sound. The
language uses Latin alphabet “Qubee” which was formally adopted in 1991 G.C [47],
and it has its own consonants and vowels sounds. Afan Oromo has thirty- three
consonants, of these seven of them are combined consonant letters: ch, dh, ny, ph, sh,
ts and zh. The combined consonant letters are known as ‘qubee dachaa’. Afan Oromo
has five short and five long vowels. The Afan Oromo alphabet is characterized by
capital and small letters like English alphabet. In Afan Oromo, as in English language,
vowels are sound makers and do stand by themselves.
“Parsing refers to the activity of analyzing a sentence into its component categories and
functions” [39]. It is also a skill of something that you can learn to do rather than
something you simply know about. Hence, sentence parsing is all about discovering a
27
structure of sentences in an input sentence based on external information known for the
elements of the input sentences and their order. According to [13] described in his work
Afan Oromo is morphologically rich language; each root word can combine with
multiple morphemes to generate a huge number of word forms. The grammatical
system of Afan Oromo is quite complex and exhibits many features common to
other Cushitic languages, this means it is an inflected language that uses post-positions
more than prepositions [47]. Hence, For the purpose of supporting such inflectionally
rich languages, the structure of each word has to be identified. Thus, we present about
the grammar of Afan Oromo language starting from its word orders, word classes,
phrase types and sentence types in the following sections due to their importance for
our study.
2.5.1 Word order
Words combine in different orders to form sentences and phrases. They also have the
internal structure [48]. One of the primary ways in which languages differ from one
another is in the order of constituents or word order. For instance, Afan Oromo and
English have differences in their syntactic structure. In Afan Oromo, the sentence
structure is subject-object-verb (SOV). SOV is a sentence structure where the subject
comes first, then the object and the verb next to the object. For example, if we take Afan
Oromo sentence “Dagaagaan nyaata nyaate”, “Dagaagaan” is the subject, “nyaata”
is the object and “nyaate” is the verb of the sentence. In case of English, the sentence
structure is subject-verb-object. For example, if the above Afan Oromo sentence is
translated into English it will be “Degaga ate food” where “Degaga” is the subject
“ate” is the verb and “food” is the object, however, Afan Oromo follows the Subject-
Object-Verb (SOV) format. But nouns change depending on their role within the
sentence, word order can be flexible, though verbs always come after their subjects and
objects. Typically, indirect objects follow direct objects.
2.5.2 Word Categories
In Afan Oromo language based on their contextual and formation in the sentences, word
classes are categorized into five major groups. These are noun, adjective, verb, adverb
and adposition (pre- and post-position). However, this paper adopts the trend that
conjunctions and adposition appear in the same category, which is adposition category.
28
Noun
Nouns are names that are used to name or identify things, people, animals, places or
abstract ideas. In Afan Oromo noun, we can have nouns, adjectives and pronouns. For
example, words like ‘Farda ‘horse are considered as the nouns positions in the
following sentence. ‘Fardi garbuu nyaata (the horse eat a barley). Zero morphemes
are marked as singular noun, whereas various forms are marked as plural noun.
Sometimes nouns are used as adjectives in Afan Oromo language.
Example:
Tolosaan mana barumsaa deeme. Tolosa went to school
According to [5], Afan Oromo nouns are pluralized by suffixing various forms of
suffixes, such as, {-een, -wan, -(o)ota, -yyii and -lee}. It is possible to use more than
one type of different plural markers in some nouns. For instance, mana 'house' can be
pluralized both as manneen or manoota. Most nouns, however, prefer one plural
marker to the other. The word sagalee 'sound', for example, can be pluralized by
suffixing {-(o)ota} or {-lee}, but it prefers the former form to the latter.
Adjective
An adjective modifies a noun or a pronoun by describing, identifying, or quantifying
words. In Afan Oromo, an adjective usually follows the noun or the pronoun which it
modifies. Adjectives are also categorized under different categories, like nouns. Afan
Oromo adjectives can be either primitive or derived. It comes after the nouns they
describe. For example, “saree gurraacha”. In the example, the adjective guraacha
“black” in saree guraacha comes after the noun it modified. It is also true in the adjective
case that not all words after nouns could be adjectives. For example, in mana citaa
“thatched house,” the word citaa “thatched” is not an adjective but noun as adjective.
Moreover, nouns but not adjectives occur in subject and/or object positions. On the
other hand, Adjectives are similar to nouns in various forms. The two sub categories
also share similar characteristics for the inflection of gender. However, it should be
noted that adjectives cannot substitute nouns in a sentence construction. Only pronouns
seem to substitute for each other since they can occur in the same position in a sentence
of Afan Oromo. For example, consider the following sentences in which one is correct
and the other is not.
29
Abbabaan barsiisaadha. [Abebe is a teacher]
Inni barsiisaadha [He is a teacher].
Guraacha barsiisaadha. [Black is the teacher], which is ungrammatical.
Verb
Verb is the most important part of a sentence that says something about the subject to
a sentence, expresses an action, events or states of being. In Afan Oromo, verb occurs
within the final positions of a sentence. It is not the case that verbs constitute a distinct,
open word class in all languages. In Afan Oromo, verbs are forms, which occur in clause
final positions and belong to a distinct category [1][11]. For example, in the following
sentence:
Inni qalama bite. [He bought a pen]
Caantuun deemte. [Chaltu has went]
Isheen gabaabdu dha. [She is a short]
The italicized part are all verbs [11] divides verbs into a number of sub categories based
on the type of constituents they are associated with. These are intransitive, transitive,
ditransitive, modals and auxiliary’s verbs. The intransitive verbs are those verbs which
do not take any phrase as their complement. For example, in the sentence ‘Abbabaan
furdate’ (Abebe got fat), ‘furdate’ [got fat] is an intransitive verb which has no
complement. There are also strictly transitive verbs [5]. These types of verbs are those
which take one complement in Afan Oromo. For example,
inni teechuma cabse (he broke the chair)
Caalaan mana bite (Chala bought a house)
In these two examples teechuma and mana are complement to the verbs ‘cabse’ broke
and ‘bite’ bought respectively. Finally, the third category of verbs in Afan Oromo is
what is called the ditransitive verbs. These verbs take two complements. The
complement for such verbs is usually noun phrase and adpositional phrase in Afan
Oromo. For example,
Tulluun abbaaf konkoolataa bite.
30
In the above sentence abbaaf and konkoolataa are the two complements which, are
noun phrase and adpositional phrase respectively for the verb bite ‘bought’.
Adverb
Adverbs are words, which are used to modify a verb, an adjective, another adverb, or a
clause. Adverbs usually precede the verbs they modify or describe in Afan Oromo
sentences. An adverb indicates time, manner, place, cause, or degree and answers
questions such as, how? when? where? and how much? In the following example, each
of the bold words is an adverb. Example:
Oboleessi koo boru deema. (My brother will leave tomorrow.) Boru (tomorrow) is an
adverb.
However, it should be noted that any word that comes before a verb is not necessarily
an adverb. For instance, in kitaaba bite “bought book”, the word kitaaba “book”
precedes the verb bite ‘bought”. In this case, the word kitaaba is a noun and in turn is
modified by the verb bite. Hence, the verb functionally shares the feature of an adjective
(modifier). There are different types of adverbs: adverbs of time, place, manner,
frequency, degree, etc. In general, adverbs are treated as the subclass of verbs[11].
Adpositions
Adpositions are traditionally defined as words that link to other words, phrases, and
clauses and express spatial or temporal relations. Adpositions are almost universal part
of speech. It is a cover term for prepositions and postpositions. Afan Oromo has both
prepositions and postpositions, though postpositions are more common. Examples:
boqonnaarra [boqonnaa irra] – “on vacation”
mana nyaataa kanatti [kana itti] – “at this restaurant”
Keeniyaan Itoophiyaarraa (gara) kibbatti argamti” – “Kenya is located (to the)
south of Ethiopia”
From the examples above, we notice that the postpositions (itti, irra, and irraa) most
often occur as suffixes, -tti, -rra, and -rraa, on the nouns they relate to. With place
names, no preposition or postposition is used to be mean “in”. Therefore, one can say,
“Finfinnee jiratta” for “you live in the Finfinnee [Addis Ababa]”, or “hospitaalan
ture” for “I was in the hospital,” using no preposition.
31
Conjunction
A conjunction is the word that is used to connect words, phrases, clauses or sentences.
Conjunctions in Afan Oromo are coordinating or subordinating. In this study,
conjunction and adpositions are used as the same category. According to[5], one
problem that arises by categorizing Adpositions and conjunctions into different
categories is the problem pertaining to distinguish conjunctions from Adpositions. The
problem in distinguishing the two mainly arises from the fact that the same words are
mostly used as both Adpositions and conjunctions. However, in cases where it is
possible to separate Adpositions from conjunctions, they are parsed separately. That is
when the parser can to distinguish between the two sub- categories a distinct category
is given to both of them. Some of Afan Oromo conjunctions are; [fi] “and,” [immoo,
garuu] “but,” [yookin(for declaratives),moo(for questions)] “or,” [haa ta’u malee]
“however,” [ta’us] “though,” [kanaaf] “so,” [kanaafuu] “therefore,” [sababiin isaa]
“because,” [akka] ‘in order to, so that’ etc. Examples:
Nyaatan barbaada sababiinsa nan beela'e. – “I want food because I am hungry.”
Ani kochee nyaadhe kanaafuu garaa kaasan qaba. – “I ate kitfo so I got
diarrhea.”
Daadhii moo biiraa dhuguu barbaadda? – “Do you want tej or beer?”
Numeral
Numerals are words representing numbers, and they can be cardinal or ordinal
numbers[1]. Afan Oromo cardinal numbers refer to the counting numbers, because they
show quantity. Ordinal numbers, on the other hand, tell the order of things and their
rank. In Afan Oromo, the ordinal numbers are formed from the cardinal numbers by
suffixing the suffix {–affaa}[5][1].
The examples below use numbers in different ways and places to demonstrate how they
behave in a sentence.
“Isheen afaan torba dubbatti”. she speaks seven languages
“kun barnota ko lammaffaa dha”. this is my second lesson
Like English, compound Afan Oromo numerals are also put separately. Example:
32
dhibba lama, “two hundred” and Dhibba lam-affaa,” two hundredth”.
There are numerals that indicate distribution. These numerals are called distributive
numerals. Example:
tokko tokko “one one”, sadi sadi “three three”
There are also special numerals in Afan Oromo that corresponds to the English like
“half”, “quarter”. Example:
walakkaa “half”, sisoo “one third”.
2.5.3 Phrases Categories
A phrase can be defined as a syntactic combination of a word with one or more other
words. A phrase is constrained or restricted by two things [5]. These are in terms of the
constituents’ and the lexical categories like nouns, verbs, etc. Thus, we can determine
the number of phrases by the number of words. A question of how to check whether a
structure is a phrase can be answered using the following four guiding principles [4].
These are:
1. If the constituents of the phrase can be moved together to another place
without separation.
2. If the phrase can replace by a pronoun (for noun phrase).
3. If one of the constituents of that phrase is missed, the meaning of that
phrase will be corrupted.
4. If an insertion of other words in between that phrase affects the meaning.
Based on the type of lexical categories in Afan Oromo, there are five phrase types in
the language[49]. They are briefly presented as following.
Noun Phrase
A noun phrase is made of one noun and one or more other lexical categories, including
the noun itself. For example, in the phrase ‘mana citaa’ [thatched house], there are two
nouns, which make the noun phrase: mana [house] and citaa [thatched] [5].
Thus, noun phrase and phrases in general must meet the above criteria to be called a
phrase. In the sentence ‘Alamuun mana Magarsaa deeme’ (Alemu has gone to
Megersa’s house), ‘mana Magarsaa’ [Megersa’s house] is a noun phrase. However,
33
to check whether it is really a phrase or not, we can see the above criteria. According
to [5], the following arrangement is impossible for the above reasons.
- Deeme Alamuun mana Magarsaa. (legal movement)
- * Magarsaa Alamuun mana deeme. (illegal movement because of the
above rule 1 and 4.)
- * Mana Alamuun Magarsaa deeme. (illegal because of rule 1 and 4)
The above sentences with asterisks (*) have illegal phrase construction because of the
above four rules that are specified by Baye [49]. Thus, we checked that “mana
Magarsaa” is a phrasal structure.
As indicated above, nouns can appear in a number of positions, such as in the positions
of the three nouns in “Abbabaan kitaaba Caalaaf bite” [Abebe bought Chala a book].
These same positions allow sequences of a noun followed by an article, as in
“Abbabaan kitaabicha Caalaaf kenne” [Abebe gave Chala the book.]. Since the
position of the article can also be filled by demonstratives (kun, sun, etc.), possessives
(koo, kee, keessan, etc), or quantifiers (e.g. xiqqoo), the more general term
“Determiner”.
Verbal Phrase
A verb phrase (VP) is composed of a verb as head and other constituents such as
complements, modifiers and specifiers. Afan Oromo verb phrases can be captured by
dividing verbs into three categories. These are intransitive verbs, strictly transitive verbs
and ditransitive verbs [5]. Examples:
Inni dhufe “He came”
Tolosaan buna dhuge “Tolosa drunk coffee”
Caalaan qalama naaf bite “Chala bought me a pen.”
Adverbial Phrase
Adverbial phrase is made up of one adverb as head word and one or more other lexical
categories, including adverbs itself as modifiers and specifiers in Afan Oromo. For
example, in Afan Oromo it is possible to have two adverbs in an adverb phrase like in
a phrase ‘kaleessa galgala’ [yesterday night]. As indicated above adverbs and their
phrases are used to modify verbs. Hence, they precede verbs in a sentence.
34
Adjectival Phrase
An adjective phrase is a group of words that describe a noun or pronoun in a sentence.
In adjectives nouns, can act as adjectives like ‘Mana Magarsaa’ ‘Megersa’s house’ or
verbs as adjectives like ‘farda bite’ ‘bought horse’
Adpositional Phrase
Adpositional phrases are the combination of nouns and adpositions. They usually
specify a verb phrase. This phrasal category sometimes is called adpositional objects
[5][49].
Inni kara mana deeme [He went to the house]
Lammaan qalama Caaltuu-f bite. [Lemma bought a pen to Chaltu].
Adpositions in a adpositional phrase can be either stand independently like in the first
phrase or affixed to the adpositional object like in the second phrase above.
2.6 Afan Oromo Sentences
Afan Oromo Sentence is made from word or phrase or one clause and more than one
clauses. This means, Sentence in Afan Oromo are made by the combination of zero or
more noun phrases and one or more verb phrases. A sentence is a group of words or
phrases that are complete in itself, conveying a statement, question, exclamation, or
command and typically containing a subject and predicate. However, a sentence is
considered as a special kind of phrases which consists of noun phrase (NP) and the verb
phrase (VP). This is a general representation of sentences of all types in Afan Oromo
language. The structure of a sentence can be either simple sentences or complex
sentences based on the number of verbs it contains.
2.6.1 Simple Sentences
Simple sentences in Afan Oromo are sentences, which contain only one verb and that
can have a full meaning. Simple sentence can be constructed from NP followed by VP,
which only contain single a verb. Example:
Caalaan dhufe – “Chala came”.
35
Here the sentence contains only one verb ‘dhufe’. Simple sentences can be declarative
sentences, interrogative sentence and imperative sentences. All these types of sentences
are discussed below.
Declarative sentences (Hima Addeessaa yookin Himaamsa)
In contrast to command, question or exclamation, if the sentence is a statement it is
declarative sentence. It is always ended with the period(.) mark, which is the same in
English and equivalent to (::) in Amharic. They are used to convey information.
Declarative sentences can be positive or negative sentences. Negative sentences simply
negate a declarative statement made about something. Example:
“Caaltuun mana jirti”, ‘Chaltu is at home’.
Here the sentence is declarative because it describes where aster is, and also
“Tolosaan mana barumsaa hindeemne”, ‘Tolosa did not go to school’
In this example, the sentence is negative declarative sentence. The verb hindeemne
‘did not go” is negated by the prefix hin- ‘not’.
Interrogative sentences (Hima Gaaffii)
In Afan Oromo Interrogative sentences are sentences that can form a question. The
question can be the one that asks the known thing to be sure or the one that asks the
unknown one. These types of sentences always end the question mark which is
symbolled as ‘?’. Example: Guyyaan har’aa maali? ‘what is the day of today?’.
Interrogative sentences consist of interrogative pronouns, which are eenyu? ‘who’,
yoom? ‘when’, maal? ‘what’, meeqa? ‘how many’, eessa? “where”, etc.
Imperative sentences (Hima Ajajaa)
When someone wants to pass instruction or commands, imperative sentences can be
used. Most of the time, the subjects of imperative sentences are second person
pronouns. However, when the command is passed for the third person, the subject of
the sentence can be third person pronouns or nouns. Example:
Hojii manaa hojjadhu. ‘do homework.’
Here the subject is ‘you’, second person for both feminine and masculine singular and
plural.
36
Exclamatory sentences (Hima Raajeffannoo)
In Afan Oromo, an exclamatory sentence is a type of simple sentence that expresses
strong feelings (excitement or emotions) by making an exclamation. (Compare with
sentences that make a statement, express a command, or ask a question). Exclamatory
sentences are rarely appearing in academic writing, unless they're part of quoted
material. Example:
ajaa’iba kuni! ‘This is a surprise’.
2.6.2 Complex Sentences
Complex sentence in Afan Oromo grammar is formed from either complex noun phrase
or complex verb phrase or both. In other words, a complex sentence can have a complex
NP and a simple VP, a simple NP and a complex VP or both complex NP and complex
VP. Complex NPs contain at least one embedded sentence, which can be complemented
or other type phrase. On the other hand, complex VPs contain at least one sentence or
more than one verb. Based on this we can have complex sentences which is constructed
from one independent clause and one or more dependent clause. And also, which has
one or more independent clause and two or more dependent clauses, this called
compound complex sentences. Though the focus of this work is a complex sentence
type which is constructed from one independent and one or more dependent clauses.
Example:
‘yoo dhufuuf taate, ganamaan koottu.’
‘Qilleensarras kaattu, lafarras arreeddu walgeettin teenya Finfinnee dha.’
2.7 Summary
In order to develop a sentence parser for any natural language, grammar formalism and
parsing methods are needed. Hence several grammar formalisms and parsing
algorithms are proposed by different scholars. Some of them are Context Free
Grammar, Transition Network Grammar, Probabilistic Context Free Grammar, Context
Sensitive Grammar, and Unification Based Grammar were discussed in this research.
Among the strategies, Stochastic Approaches and Rule-based approaches are presented.
Another point which is discussed in this chapter is about grammar of Afan Oromo
language in details. We start our discussion from the word order of the sentence in Afan
37
Oromo grammar. We also discussed the categories of word classes and types of phrases
based on what the language scholars classified.
We have also presented some works, which are closely related to our thesis work.
Parsing natural language text(sentence) is challenging because of the problems like
ambiguity and inefficiency. It is considered to be an important intermediate stage for
semantic analysis in natural language processing application such as information
retrieval (IR), information extraction (IE) and question answering (QA) [32]. As we
come across some literatures on sentence parsing, the parser can have developed in
different approaches for a number of languages around the world. We have reviewed
some sentence parser for different languages, among these, we have come across the
first attempt which is an automatic sentence parser for Afan Oromo simple declarative
sentences.
We had also reviewed on other languages of the sentence parser. Some of them are used
tree-banks to develop a sentence parser, whereas the others used rule-based approaches
using different parsing strategies. The tree bank allows the parser to produce its
grammar and probabilities to be possible parses of a sentence. Parsers made from a tree
bank are often the best parsers because of the reason that they can simply exercise in
machine learning. However, creating the requisite training corpus, or tree bank, is a
difficult task. Because of the absence of such a tree bank for Afan Oromo corpus, it is
difficult to apply this approach to develop Afan Oromo sentence parser. On the other
hand, when we see the rule-based approach there are also limitations. In top-down
parsing, there are problems like reduplicating, backtracking to where it made the wrong
decision at each time it chooses the wrong path, and getting stuck when there are
grammar rules, which are left recursive. In contrast, the bottom-up parser only checks
the input sentence once, and only builds each constituent exactly once. Even if both
bottom up and top down parsers have advantages, they are inefficient and have a worst
case exponential run-time as the parser would tend to try the same matches repeatedly,
thus duplicating much of its work unnecessarily. To avoid this problem, a data structure
called a chart is introduced that allows the parser to store the partial results of the
matching it has done so far so that the works need not to be reduplicated [39].
38
Finally, chart parsing can avoid the problems of top-down and bottom-up approaches,
and it is an efficient parsing algorithm. Hence, top-down chart parsing algorithm is
selected to develop simple and complex sentences of Afan Oromo language in our
research work.
39
CHAPTER THREE
3 DESIGN OF AFAN OROMO SENTENCE PARSER
We discussed the main components of the Afan Oromo sentence parser and the
interaction between each component, and also, we designed the architecture of the
parser in this chapter.
3.1 Components of Afan Oromo Sentence Parser (AOSP)
Rule-based sentence parser approach has the three basic components such as grammar
rule, lexicon and the parsing algorithm. The grammar component is responsible for
storing grammar rules written in one of the grammatical formalisms. To learn a set of
rules automatically based on the given strings, the parser should enable by grammar
rules and then the parser parse sentences based on those rules. The lexicon component
is used as a dictionary for the parser by storing lexical rules which are separated from
grammar rules. Lexical rules specify the possible categories of each word so that the
efficiency of the parser will be improved.
According to Abdurehman [4] states the rule based parser comprises of, the structure
of the grammar rules, the lexicon and the parsing algorithm differs from system to
system and from one language to the other. Hence, our system has additional
components to the basic components such as Sentence Tokenizer and Lexicon
Generator. Sentence Tokenizer is used to break down or split the input sentence into
individual words. whereas, Lexicon Generator is used to avoid the manual preparation
of the lexical rules and generate the lexical rules automatically from sample corpus in
the form of tag ‘word’. We used the corpus which is manually annotated or POS
tagged. However, the POS tagger is not included in our system.
Grammar rule component of Afan Oromo sentence parser is required to store Afan
Oromo Context Free Grammar rules. The grammar rules are identified in a way that
they can represent the structure of Afan Oromo sentences in terms of what phrases and
word categories, in case we used Context Free Grammar formalism. The lexicon
component stores a list of lexical rules, which specify the possible categories of each
word. The parser engine accepts input sentence, by considering the grammar rules of
40
Afan Oromo sentence, and retrieves the POS tag of each word in the sentence from the
lexicon and finally returns parsed sentence as an output.
CFG for
Afan
OromoLexicon
Lexical
Generator
Corpus Data
Sentence
Tokenizer
Input
sentences
Chart
Parser
Output
Figure 3. 1: Architecture of Sentence Parser for Afan Oromo Language
3.2 Context Free Grammar (CFG)
A grammar in human language represents understandable specification of language
syntax. According to Abdurheman, grammar is not concerned with semantics. In other
words, the grammar is a collection of words that describes well-informed sentences in
a language. An efficient parser can be constructed automatically from a properly
designed grammar. The idea of a context-free grammar should be familiar from formal
language theory. Furthermore, Context Free Grammar in natural languages represents
a formal system which describes a language by specifying how any legal text can be
derived from a distinguished symbol called the syntactic symbol [50]. CFG rules for
this thesis are extracted from sentences collected from Afan Oromo grammar books and
41
previous research papers, which are already tagged manually by researchers. Context
Free Grammar rules are extracted to represent the grammatical structure of valid
sentences as much as possible. As we have discussed earlier, there are two types of
sentences in Afan Oromo language, such as simple and complex sentences. Hence,
context free grammar rule incorporates both sentences to represent their grammatical
structures in this thesis.
3.3 Sentence Tokenizer
Tokenization is an early step of processing to divide the input text into units called
tokens where each is either a word or something [40]. It is also stated in [48] as one of
the more basic operations that can be applied to a text to breaking up a stream of
characters into words, punctuation marks, numbers and other discrete items. For this
reason, we need to have a tokenizer that is responsible to break words of the sentence.
The tokenizer spans through the sentence from the beginning to the end and whenever
it gets a space gap it considers the text before the space as one word. This process was
done by writing python code which is used to split the input sentence into its words.
3.4 Lexicon Generator
A lexicon is the knowledge that a native speaker has about a language. This includes
information about the form and meanings of words and phrases, lexical categorization,
the appropriate usage of words and phrases, relationships between words and phrases.
Lexicon is an essential catalogue of a language’s words, whereas, grammar is a system
of rules which allow for the combination of those words into meaningful sentences. A
lexicon is the vocabulary of a person, language, or branch of knowledge. It is also
thought to include bound morphemes, which cannot stand alone as words, such as most
affixes [51]. Because words tend to follow regular morphological patterns, many
forms of words are not explicitly included in the lexicon. For example, for the verb
‘deem-‘ (dependent root form), there are different forms of the verb ‘deemte, deemuu,
deeme, etc’ [5].
From the sample corpus we have collected, the lexicon was prepared which is a list of
words with their POS tag name. Preparing the lexicon manually by typing the word and
its word category, especially when there is a large size corpus it is an error prone and
time taking. Therefore, it is better to have an automatic lexicon generator that outputs
the result correctly and within a short time. We have prepared simple algorithm, then
42
wrote small python code to develop a lexicon generator that output lexical rules from
tagged sentences automatically for later used in parsing Afan Oromo sentences. The
lexicon generator reads the POS tagged sentence, then generate the result as (tag name
-> ‘word’).
3.5 AOSP Chart Parser
This component is the main part of the proposed system. We deal with how a parsing
algorithm is used to apply the chart parsing algorithm in combination with other
components. According to Zhu [52] state general search(top-down or bottom-up)
methods are not best for syntactic parsing because the same syntactic constituent may
be re-derived many times as a part of larger constituents due to the local ambiguities of
grammar. Hence, we have considered the idea of chart parsing. A chart is a form of
well-formed substring table. Chart parsing is a common context free parsing algorithm
which uses dynamic programming techniques to avoid duplication of effort by ignoring
differences in derivation where they have no effect [53]. There is no backtracking and
everything that is put in the chart stays there. In addition, chart parsing doesn’t throw
away any information. This means it keeps a record (a chart) of all the structure you
have found so far [7]. Chart parsing has two forms: passive and active. In case of
passive chart parsing, the chart is simply a record of all constituents that have been
recognized. Whereas, active chart parsing is to keeping track of a record of complete
constituents that we have found, so we record what we are actually looking for and how
much of it we have found so far. Such information is recorded in active edges or active
arcs. [i.e. S NP. VP]. In this production rules, the arc label as S NP. VP means:
“we are trying to build an S consisting of an NP followed by a VP. So far, we have
found an NP arc, and we are still looking for the VP”. So, the insignificant looking “.”
Or dot symbol in the middle of the rule is very important, it marks the boundary between
what we have found so far, and what we are still looking for, i.e. a boundary between
active and passive arcs. Therefore, constituents before the dot are passive edges,
whereas active edges can be combined with these passive edges to create new edges.
According to [4], the fundamental rule for combining of the passive edges and an active
edge can be performed as S A. cB. Suppose that there is the passive edge going from
where the active edge ends and has category c on the left side as c Z. The dot in the
active edge is now moved one category forward (i.e., S Ac. B).
43
We are working with the active chart parser to make use of an agenda. An agenda is a
data structure that keeps track of the things that we still have to do, and it is used to
prioritize constituents to be processed. When new edges are created, we have to
remember that we have to look at them to see whether they can be combined with other
edges in any way [4]. In order to not forget the new created edges, we store them in the
agenda. Then take one edge at a time from the agenda, add it to the chart and use it to
build new edges.
Thus, chart parsing has two main approaches to apply the parsing algorithm, such as
top-down and bottom-up approach. Bottom-up chart parsing checks the input sentence
and builds each constituent exactly once. It can also avoid duplication of effort.
However, bottom-up chart parsing may build constituents that cannot be used legally,
whereas only grammar rules that can be legally applied will be put on the chart in top-
down chart parsing [52]. In addition, the algorithm reads (bottom-up) the rules right-to-
left, and starts with the information in passive edges. However, Top-down parsing reads
the rules left to right and starts with the information in active edges in the case of top-
down chart parsing. Besides, top-down searching will be used to use rules to make
active edges. The agenda will have at least one active edge to start the parsing process.
The active edge starts at the position zero from the sentence S. Hence, the active edge
will be taken from the input sentence. The chart will remain empty until an active edge
is added to it. Thus, in our work, we proposed, Context Free Grammar and top-down
chart parsing for the grammar rules and for parsing algorithm respectively. The reason
why CFG grammar formalism is selected in our work, it is easier to maintain, easy to
understand and to add new language features. It also imparts structure to language and
builds an efficient parser automatically [50]. On the other hand, the reason we have
chosen top-down chart parsing is because it does well if there is useful grammar-driven
control [4], and it has the advantage of both top-down and bottom-up parsing.
Generally, as we discussed in the chapter 2, chart parser is driven by an agenda of
completed constituents and the arc extension, which combines active arcs with
constituents when they are added to the chart. The technique of extending arcs with
constituents can be applied by using both bottom-up and top-down approach. However,
the difference is in how new arcs are generated from the grammar. In bottom-up
approach, new active arcs are generated whenever a completed constituent is added that
44
could be the first constituent of the right-hand side of the rule. In the top-down
approach, new active arcs are generated whenever a new active arc is added to the chart.
For this reason, the number of constituents generated using a top-down chart parser is
less than the number of constituents which are generated using bottom-up chart
parser[4][7]. Therefore, the top down chart parser is considerably more efficient than
bottom-up chart parser and tradition parser approaches (top-down and bottom-up).
Thus, based the aforementioned reasons we employ the top-down chart parsing for this
thesis is discussed in chapter 4.
3.6 Summary
To sum up, we have discussed in this chapter, about the important components of the
sentence parser that proposed in this work in detail. In addition, we presented the
architecture of the Afan Oromo sentence parser. The proposed rule based parser has
three basic components such as grammar rule, lexicon and the parsing algorithm. The
grammar component is responsible for storing grammar rules written in context free
grammar (CFG) rule, which is the selected grammatical formalisms for this work to
represent the grammar rule of Afan Oromo language. The lexicon component is used
as a dictionary for the parser by storing lexical rules which are separated from grammar
rules. Parsing algorithm is a method of understanding the exact structure of the sentence
or words. On the other hand, the parser has also another additional component, sentence
tokenizer which is an early step of a process to divide or split the input sentences into
units called tokens, and lexical generator which is used to generate lexical rules
automatically from the sample tagged corpus.
45
CHAPTER FOUR
4 IMPLEMENTATION RESULTS AND DISCUSION
We present the detail implementation of the parser in this chapter. The development
environment, corpus collection and preparation, extraction of context free grammar rule
and preparation of a lexicon are discussed. Then, the main objective of this study is
developing sentence parser for Afan Oromo language using top-down chart parsing
algorithm, therefore, top down chart parser is discussed in detail. Evaluation of lexicon
generator and the chart parser is also presented. In the final section, discussion is
presented.
4.1 Development Environment
We have developed a sentence parser which takes an input sentence from the user and
parses the sentence according to the CFG and lexicon based on the parsing method and
finally deliver the output for the user. The parser has also automatic lexicon generator
as its components. The lexicon generator takes manually tagged sentences and produces
lexical rules automatically. We used python programming language and NLTK for
implementation purpose.
4.2 Corpus Preparation
For our study, we have collected 500 simple and complex sentences from different
sources. Most of the sentences are simple sentences which around 70% and the rest
30% is complex sentences. This is because of there are different simple sentence types,
such as declarative, exclamatory, interrogative and imperative simple sentence types in
Afan Oromo sentences. Some of the sentences are taken from previous research work
in the area by Diriba [5], some others are taken from Seer-luga Afan Oromo (Afan
Oromo grammar) book, and the rests are taken from different Afan Oromo written
documents. Sentences are tagged manually using tag set which was developed by
Abraham [12] based on Afan Oromo language rule and verified by linguistics of Afan
Oromo language. See POS tag sets used in our work on Appendix 1.
4.3 Grammar Rules Extraction
In Afan Oromo sentence parsing, lexicon and Context Free Grammar rules are
important and necessary. CFG rules are used to train the parser with a set of grammar
46
rules and enable it to parse sentences based on rules. In order to extract the grammar
rule from manually tagged sentences we have reviewed and identified the
morphological property of the language and word order. Then, Context Free Grammar
rules are extracted manually from the collected corpus after studying the grammar of
Afan Oromo language and verified by the help of linguists.
CFG rules describe which structure of a sentence can be built from which sequence of
words. In other words, according to [36] the grammar rules specify how we are able to
determine whether a given sentence is valid or not. Therefore, we used five types of
Afan Oromo phrases that identified by Baye [49], which are noun phrase, verb phrase,
adpositional phrase, adjectival phrase, and adverbial phrase and the possible
combination of words from which the aforementioned phrases can be formed. Our
sample corpus contains manually tagged sentences (POS) in order to identify the word
category of each word in the sentence. When we construct the CFG, we use the
sentences (looked at the sentences) and identify part of speech tag of a phrase (Noun
Phrase and Verb Phrase). A number of sentences have similar phrase structure, and
some others share common sub-phrases so that a single CFG rule can represent many
sentences. Once we have the structure of the sentence we transformed it into the proper
format of the CFG, which is a non-terminal followed by an arrow and then terminal or
other non-terminals, which can replace the non-terminal before the arrow, i.e., like S -
> NP VP, NP -> NN JJ and VP -> NN VB. For more information, see a sample CFG
rules on Appendix 2.
The CFG begins from the non-terminal S, which represents sentences, and then phrases,
which can form S, most of the time noun phrase(NP) and verb phrase(VP). Each phrase
which is on the right side of S will be expanded or expressed by other non-terminals
(like NN, JJ, VB, AV, etc.) in next rules. As we have discussed in our previous section
in chapter two in detail about Afan Oromo language phrases categories, we have five
phrases. The tag name of each Afan Oromo phrase types in this work is shown in Table
4.1
47
Table 4. 1: Tag Name of Afan Oromo Phrases
Name Phrase Tag Name
Noun Phrase NP
Verb Phrase VP
Adverb Phrase ADP
Adjective Phrase JJP
Adposition Phrase APCP
4.4 Generating Lexical Rules
We used the same corpus that is used in CFG extraction for the generation of the lexical
rules. The construction of the lexicon is a one-time process that is done at the very
beginning of the parser implementation. However, the result will be needed whenever
there is parsing. During the lexicon construction, the lexicon generator reads the corpus
from local disk and goes through each sentence. While scanning each sentence, the
generator identifies the POS tag name and the word which will be associated with it in
lexical rules. The output of the lexicon generator is expected to be formal lexical rules
that will be stored in the lexicon in the form of tag -> “word”. The simple algorithm
developed for automatic lexicon generator is shown as follow on Algorithm 4.1. See
sample lexical rules generated by the lexicon generator from sample corpus on
Appendix 3.
48
Algorithm 4. 1: Lexical Generator Algorithm
4.5 Implementation of Chart Parser
Chart parsers use a set of rules to heuristically decide when an edge should be added to
a chart. This set of rules together with a specification when they should be applied form
a parsing strategy [39]. This rule is called fundamental rule, which is used by every
chart parser. In the new edge, the dot has moved one place to the right, and the span of
the new edge is the combined span of the original edges. When we add this new edge,
we do not remove the other two because they might be used again. In the case of a
selected chart parser algorithm for our work, which is top-down chart parsing, it works
in a similar way to that of recursive descent parser. Thus, it starts off with the top-level
goal of finding an S and broken down into the sub-goals of trying to find constituents
such as NP and VP predicted by the grammar rule of Afan Oromo language. Hence, in
Input: Afan Oromo tagged sentences
Read tagged sample corpus from local disk
Scan each sentence of sample corpus
Identify the part of speech (POS) tag name and word that will be associated with
it
For each word in each sentence of sample corpus
Call str2tuple () built-in function from a python library
If the words and its tags are split in the form of (word, tag)
Reverse the form into (tag, word)
Return the reversed value
generate the result with proper format
return the result
End If
End For
Output: lexical rules
49
order to apply a top-down chart parsing algorithm, we use the fundamental rule and
other three rules, such as:
The Top-Down Initialization Rule
The Top-Down Expand Rule
The Top-Down Match Rule
The Top-Down Initialization Rule: It captures the fact that the root of any parse must
be
the start symbol S. For each production, S → α, add the self-loop edge [S →. α, (0, 0)].
It e predicts to find an NP and a VP starting at 0. In order to find an NP, we need to
invoke a production that has NP on its left-hand side. This step is done by the next rule
which is Top-Down Expand Rule.
The Top-Down Expand Rule: This rule tells us that if our chart contains an incomplete
edge whose dot is followed by a non-terminal, then the parser should add any self-loop
edges licensed by the grammar whose left-hand side is non-terminal.
The Top-Down Match Rule: At this point, the rule allows the predictions of the
grammar to be matched against the input string, if the chart contains an incomplete edge
whose dot is followed by a terminal, then the parser should add an edge if the terminal
corresponds to the current input symbol.
The main focus of this is to parse the user sentence of Afan Oromo language using top
down chart parser. The parser interacts with other components in order to parse the user
sentences. In our study, the parser uses two basic references at the time of parsing;
Context Free Grammar rules and lexical rules. Initially, the parser accepts the input
sentence resulted from the tokenizer. Before the sentence undergoes through the parsing
process, the parser checks the grammar rule and lexical rules of sentence. Then the
parser scans through the sentence and asks for the POS tag of each word from the
lexicon.
Then after, the chart will be initialized by an active edge, which is a grammar rule that
has S symbol at the left-hand side. Active chart parsing is to keeping track of a record
of complete constituents that we have found, so we record what we are actually looking
for and how much of it we have found so far. Such information is recorded in active
edges or active arcs. [i.e. S NP. VP]. S NP. VP means: “we are trying to build an
S consisting of an NP followed by a VP. So far, we have found an NP arc, and we are
50
still looking for the VP”. So, the insignificant looking “.” Or dot symbol in the middle
of the rule is very important, it marks the boundary between what we have found so far,
and what we are still looking for, i.e. a boundary between active and passive arcs. The
fundamental rule for combining of the passive edges and an active edge can be
performed as S NP. VP. Suppose that there is the passive edge going from where
the active edge ends and has category VB on the left side as VP VB AV. The dot in
the active edge is now moved one category forward (i.e., VP VB. AV, which mean
S NP VB. AV). Our intension is working with active chart parser to make use of an
agenda. Then the agenda also initialized by a grammar rule that has a non-terminal at
its left-hand side. The left-hand side terminal is similar with that of the non-terminal
which is immediately after the arrow in a grammar rule in the chart. The grammar rule
in the agenda will move to the chart to replace the first non-terminal, in the right-hand
side of S, if it can replace unless another grammar rule is added to the agenda till the
grammar rule that can replace the non-terminal is found. If there is a terminal that can
replace the non-terminal, the parser will replace it and continues to the next non-
terminal. However, if the non-terminal in the chart can’t be replaced by the terminal, it
looks for other non-terminals which can replace it and which can be replaced by
terminals later on. This process will continue until all non-terminals in S are replaced
by the terminals and the grammar structure of the sentence S is recognized. This means
when new edges are created, the parser has to look at them to see whether they can be
combined with other edges to create another new edge in any way. It stores them in the
agenda to later remember. The agenda contains edges (grammar rules that can replace
non-terminals in the chart and creates new active edge or new grammar rule). The parser
will then take one edge at a time from the agenda, add it to the chart and then use it to
build new edges. The top down chart parsing algorithm we have adopted from [4][7]
with a few modification is shown in Algorithm 4.2
51
Algorithm 4. 2 : Top Down Chart Parsing Algorithm for Afan Oromo Sentences
4.6 Evaluations
In this section, we discussed the evaluation result of lexicon generator and chart parser,
which is developed to parse Afan Oromo sentences by checking whether they produce
correct and expected result or not. The results of the parser have been compared against
the manual parsed sentences by researchers of this study. The comparison has been
made manually by the researchers between the output of the system and the result of
Input: Afan Oromo Sentence from the user
Scan and tokenize the input sentence
Check the words of the sentence whether it is in the lexicon or not
If the word of the user sentence in the lexicon
Take the sentence
Make initial the chart and the agenda
Repeat the following until the agenda becomes empty:
a. Take the first arc (grammar rule) from the agenda
b. Add the arc to chart (if the edge is not already on the chart)
c. Combine this arc with arcs from the chart and add the obtained edges to the
agenda
d. Make a hypothesis about new constituents based on the arc and the rule of the
grammar. Add these new arcs to the agenda.
End repeat
See if the chart contains passive edges from the first node to the last node that has
labeled S.
if the chart contains the passive edges that represent all nodes of the sentence
then
the parsing process succeeds,
if not, the input sentence has a syntax error with respect to the grammatical
production rules in the CFG.
Return the parsed sentence (parse tree).
Output: Parse tree or The sentence is not parsed
52
manual parsed sentences. Hence, most of the evaluations are performed manually. The
evaluation technique used in estimating the accuracy of the parser in this study is simply
count the number of correctly parsed sentences and divide it to the total number of the
parsed sentences.
4.6.1 Evaluation of Lexical Generator
We have developed simple lexicon generator algorithms that constructs lexical rules
automatically from a sample tagged sentences of Afan Oromo language. This simple
algorithm was implemented by a python programming language, and the lexical rules
are later used by the parser. The correctness of the lexical rules was inspected manually
by checking whether the words are categorized in their proper word classes or not by
comparing with manually tagged sentences. Hence, our lexicon generator generates
correct lexical rules from manually tagged sentences as expected. Thus, lexicon
generator performed correctly as expected without any error. Figure 4.1 shows that the
result of the lexicon generator for five Afan Oromo Sentences from the sample corpus.
The selected sample sentences are “Tulluun nama gurraacha dha” ‘Tullu is a black
man’, “Inni gara manaa deeme” ‘He went to home’, “Inni kaleessa galgala dhufe” ‘He
comes yesterday at evening’, “Tolosaan mana guddaa qaba” ‘Tolosa has a big house’
and “Abdiisaan mana citaa ijaare” ‘Abdisa build the thatch house.’
Figure 4. 1: Screenshot of Lexical Rules generated by the Lexicon Generator
53
4.6.2 Evaluation of AOSP Chart Parser
Developed chart parser uses only the extracted grammar rules and the lexicon that is
produced by the lexicon generator from the sample manually tagged sentences to parse
the input sentences. The input sentence in the parsing process after it is a tokenized
word by word, and the word of input sentence is checked whether it exists in the
generated a lexicon in the corpus. The sentence tokenizer did not encounter any error
in the parser, so it was perfect for tokenize the input sentence into words. Based on this
the system was trained on the training dataset repeatedly and after correcting the man-
made error on manually tagged sentences by using CFG rules which manually extracted
from sample corpus was obtained the accuracy of 98.25%.
On the other hand, in order to test the effectiveness of the parser, we have used 100
other sentences selected from the corpus as a test set. On average 20 sentences are from
each type of the sentences in the corpus. The correctness of the parser is examined by
inspecting its result manually. The output can be checked with respect to the right
categorization of words in their proper word class, the right identification of sub phrases
and main phrases, the right order of sub phrases in building main phrases, and whether
all words and phrases are involved during construction of the sentence S. Therefore,
before testing, we parsed the sentences manually on paper based on linguist’s
suggestion and comments, and then we compared the results of the chart parser for the
same sentences with what we have on the paper. Any one of the results, which doesn’t
satisfy one of the criteria we have set is considered as wrong output or if the result or
the parse tree doesn’t display at all. The result obtained when the Parser was trained
and run on the same data is shown in Table 4.2.
Table 4. 2: Parsing a result on training set before making number of error correction
Dataset No of sentences No of correctly parsed sentences Accuracy in %
Training set 400 350 87.5%
As one would expect the accuracy achieved should be high when a parser is trained and
tested on the same data. But, due to man-made errors during the manual tagging of the
sentences and manually extracted context free grammar rules from sample sentences
54
the accuracy was not as high as it was expected. Hence, to ensure the accuracy of the
parser we have sent the corpus to the linguistic in order to check the correctness of
tagged sentences as well as the extracted grammar rule by researcher. Then, the final
accuracy obtained on training set after the error were identified and corrected is
displayed in Table 4.3.
Table 4. 3: Parsing a result on training set after making most of error correction
Dataset No of sentences No of correctly parsed sentences Accuracy in %
Training set 400 393 98.25%
On the other hand, the test on the unseen dataset (testing set) of the corpus provided
have the following result from each sentence type in the following table.
The result of the parser for the testing dataset on imperative sentences type is shown in
the Table 4.4.
Figure 4. 2: Screenshot of parsed imperative sentence
55
Table 4. 4: Number of correctly parsed imperative sentences
Dataset No. of imperative sentences correctly parsed sentences Accuracy in %
Testing set 20 20 100%
The result obtained when testing the parser on imperative simple sentence type is
approximately 100% accuracy. The result of the parser for the testing dataset on
Exclamatory Sentences type is shown in the Table 4.5.
Figure 4. 3: Screenshot of parsed exclamatory sentence
Table 4. 5: Number of correctly parsed Exclamatory Sentences
Data set No of exclamatory sentences correctly parsed Accuracy in %
Testing set 20 18 90%
The result obtained when testing the parser on exclamatory simple sentence type is
approximately 90% accuracy. The result of the parser for the testing dataset on
Declarative Sentences type is shown in the Table 4.6.
56
Figure 4. 4: Screenshot of parsed declarative sentence
Table 4. 6: Number of correctly parsed Declarative Sentences
Data set No. of declarative sentences correctly parsed Accuracy in %
Testing set 20 19 95%
The result obtained when testing the parser on declarative simple sentence type is
approximately 95% accuracy. Table 4.7 presents the result of the parser for the testing
dataset on interrogative sentences.
Figure 4. 5: Screenshot of parsed interrogative sentence
57
Table 4. 7: Number of correctly parsed Interrogative Sentences
Data set No. of interrogative sentences correctly parsed Accuracy in %
Testing set 20 20 100%
The result obtained when testing the parser on interrogative simple sentence type is
approximately 100% accuracy. The result of the parser for the testing dataset on
complex Sentences type is shown in the Table 4.8.
Table 4. 8: Number of correctly parsed Complex Sentences
Data set No. of complex sentences correctly parsed Accuracy in %
Testing set 20 14 70%
The result obtained when testing the parser on complex sentence type is approximately
70% accuracy. This was due to man-made errors during the manual parsing process on
complex sentences were identified to be one cause for wrong parse assignments, and
also incorrectly extraction context free grammar rules of the language.
58
Figure 4. 6: Screenshot of parsed complex sentence
Finally, the test on the imperative and interrogative simple sentences were gave a 100%
accuracy, this is due to all sentences were parsed correctly. This could be accounted to
the fact that the sentences were uniform (have the same kind of constructs) that could
generate the correct parse structure [24]. Thus, the parser result shows that the chart
parser which is developed to parse Afan Oromo sentence could obtain the accuracy of
98.25% on training set after the correction of errors faced during the first training and
91% on test set, which is a promising result.
4.7 Discussion
The sample corpus, which was discussed in the above section was used for
experimentation. Each sentence in the corpus had been tagged and hand parsed by the
researchers, with comments and suggestions from linguists. However, it was difficult
for us to get experts on Afan Oromo language during extracting of the context free
grammar rules from the corpus. After many times of searching, we are an able to get
linguist of Afan Oromo language those we approach via email. Then we have sent the
corpus that we prepared for this study to them in order to check its correctness. After
that each sentence in our corpus had been tagged and hand parsed correctly by the
researchers, based on comments and suggestions from linguists. The sentences in the
selected corpus are classified as training dataset and testing dataset.
Although, man-made errors were occurred during the training of the system, which are
the manual tagging and parsing process were to be one cause for wrongly parsed
sentences. Some of the context free grammar rules of the language were incorrectly
59
extracted, which affect the performance of the parser. This was challenging task to
extract CFG rules for Afan Oromo sentences, because of the extraction was manual and
the lack of standardized and well-prepared Afan Oromo corpus which required
conducting conclusive experimentation for the proposed parser. So, it was a challenge
task that we have been faced. However, we have been evaluated the parser based on the
extracted context free grammar rule (CFG) of Afan Oromo language depending on the
expected output of the parser and manually parsed sentences by the researchers based
on the linguists’ suggestion. Besides, there was also challenges during the performance
evaluation of the parser, which is because of the absence of a standard criteria to parse
Afan Oromo sentences, which says if some conditions are satisfied, then the parsing is
correct, and if not, the parsing is wrong.
We have seen some of the ways used during the experiment to deal with incorrectly
parsed sentences before we obtained the last 98.25% accuracy of the parser on training
dataset, the first thing we have done was a review of the manually parsed sentences and
made corrections to the errors. The second was a review of the context free grammar
rule which was extracted manually from sample sentences. And the next was making
some corrections to the errors and if more than one sentence had similar grammar rules,
the number of grammar rules, which is in the form of CFG have been reduced. This is
because of as the number of grammar rules increase the efficiency and accuracy of the
parser decreases in terms of time and speed as stated in the study [7] which is the top
down chart parser for analyzing Arabic sentences. Comparison of top down chart parser
which is developed in this study with the top-down and bottom-up approaches is not
shown due to time constraint.
To sum up, developed parser has shown encouraging results in terms of covering both
simple and complex sentences and automatic construction of the lexical rules from the
given sample corpus. An efficient parsing approach which is top-down chart parsing
approaches is used in this study rather than using only traditional parsing methods like
top-down and bottom-up approaches.
60
CHAPTER FIVE
5 CONCLUSIONS AND RECOMMENDATIONS
This chapter focuses on summaries that indicate the whole picture of the study in the
conclusion based on the findings of the experiment and recommendations that the
researchers have suggested as the future work.
5.1 Conclusion
The common objective of understanding and extracting meaning from natural language
input is processed in natural language processing [24]. This process involves
transforming the natural language into a form where the meaning is explicit and is easily
usable by the application program. Thus, sentence parsing is a process in which a flat
input sentence is converted into a hierarchical structure or tree structure that
corresponds to the units of meaning in the sentence. The important concepts in related
to sentence parsing are also briefly discussed in this study. Rule based and statistical
approaches, which are the major approaches to sentence parsing were discussed, and
rule-based approach was employed for this study. The different grammatical
formalisms used to represent phrase structure rules in a language were briefly reviewed.
Literature in the area of Afan Oromo grammatical category was also reviewed. This is
because of the knowledge of the grammar of the language is the core component in
designing a rule-based sentence parser. Parsing of grammatical categories indicated
features like gender and tense are not considered. Algorithm, which is top down chart
parsing and components required by the parser to access the knowledge base and parse
input sentences with appropriate lexical categories were presented. For this purpose, as
the developed parser is rule-based, the parser needs to have components, which are used
to enable the system to learn how to parse from grammar rules. This part is composed
of lexicon generator, context free grammar rules and lexicon. Lexicon generator is the
component used for the automatic construction of the lexical rules. It uses the same
POS tagged corpus, which is used for the extraction of context free grammar rules.
The corpus preparation used in this study, and the challenges that the researchers have
faced during the preparation of the corpus are presented. The sample corpus is used to
extract grammar rules and generate lexical rules of Afan Oromo language. Due to lack
61
of time, the sample corpus was tagged manually by researchers based on the linguist’s
suggestions and comments. The corpus is small in size for the reasons that lack of
annotated large corpora with POS tags. The experiments are also conducted. With this
regard, the parser trained on 400 sentences and tested on 100 sentences from sample
corpus. As we have been discussed above in evaluation section, 98.25% and 91% (the
average result of selected sentence types) accuracy is obtained on training dataset and
testing dataset respectively.
In general, the study has been designed the general architecture of top-down chart parser
for Afan Oromo language. We have been developed how to construct lexical rules
automatically from tagged corpus. Another contribution of this study was the developed
rule-based top-down chart parser, which does not require a tree bank from which the
parser learns how to parse through iterative trainings for Afan Oromo simple and
complex sentences. Since this study is the first work in parsing both simple and complex
sentences for Afan Oromo Language, it encourages Ethiopian students and researchers
to take part in parsing sentences, which led to develop a full-fledged parser for Afan
Oromo language.
5.2 Recommendations
The sample corpus and the grammar rules taken for this study cannot be taken to be a
representative of the language, therefore, conducting a larger set of corpora is needed.
Although sentence parsing is not an easy task, which requires more time and needs
more features to make it full-fledged. However, our study has shown that sentence
parsing can be done automatically using a top-down chart parsing algorithm for Afan
Oromo simple and complex sentences. There are many shortcomings in this work and
in the area, particularly in Afan Oromo. This should be addressed by interested
individuals in the area for further improvements. The efforts of those researchers might
enable efforts of coming up with an efficient sentence parser for Afan Oromo language
to make it full-fledged. Hence, further improvements and modifications are required.
Thus, we have listed additional features that can be added to increase the performance
of the system and future research directions as following.
62
Preparing processed (annotated) Afan Oromo corpus for the purposes of
experimentation is recommended, particularly, as appropriate for sentence
parser.
Developing Afan Oromo sentence parser by adding automatic part of speech
tagger, automatic morphological analyzer, all types of sentences with all
attributes like case, number, gender, person, tense, definiteness to increase the
coverage of current parser and use large dataset to make full-fledged parser.
Experimenting how the sentence parser could perform using stochastic
approaches by increasing sample dataset size is also recommended. Though we
didn’t make experiments using stochastic approaches due to time constraint,
better results might be obtained.
Replicating the work in other Ethiopian local languages like Tigrigna, Silte, etc.
63
REFERENCES
[1] A. S. Genemo, “Afaan Oromo Named Entity Recognition Using Hybrid
Approach,” MSc.Thesis, Department of Computer Science, School of Graduate
Studies, Addis Ababa University, Addis Ababa, 2015.
[2] A. Copestack, “Natural Language Processing,” in Natural Language Processing,
2004, pp. 2003–2004.
[3] N. Chomsky, Syntactic structures, Second Edi. New York, 2002.
[4] A. D. Mohammed, “A Top-Down Chart Parser for Amharic Sentences,”
MSc.Thesis, Department of Computer Science, School of Graduate Studies,
Addis Ababa University, Addis Ababa, 2015.
[5] D. Megersa, “An Automatic Sentence Parser for Oromo Language Using
Supervised Learning Technique,” MSc.Thesis, Department of Information
Science, School of Graduate Studies, Addis Ababa University, Addis Ababa,
2002.
[6] Jason, “Parsing.” [Online]. Available:
https://www.cs.cornell.edu/courses/cs4740/2012sp/lectures/parsing-intro-
4pp.pdf. [Accessed: 03-Feb-2017].
[7] A. Al-Taani, M. Msallam, and S. Wedian, “A Top-Down Chart Parser for
Analyzing Arabic Sentences,” Int. Arab J. Inf. Technol., vol. 9, no. 3, 2012.
[8] J. Daba and Y. Assabie, “A Hybrid Approach to the Development of
Bidirectional English-Oromiffa,” vol. 8686, pp. 228–235, 2014.
[9] A. Abeshu, “Analysis of Rule Based Approach for Afan Oromo Automatic
Morphological Synthesizer,” An Off. Int. J. Wollega Univ. Sci. Technol. Arts Res.
J., vol. 2, no. 4, pp. 2226–7522.
[10] M. Jakubíˇ, “Rule-Based Parsing of Morphologically Rich Languages,”
PhD.Dissertation, Faculty of Informatics, Masaryk University, 2012.
[11] M. L. Kejela, “Named Entity Recognition for Afan Oromo,” MSc.Thesis,
Department of Computer Science, School of Graduate Studies, Addis Ababa
University, Addis Ababa, 2010.
[12] A. T. Nedjo, D. Huang, and X. Liu, “Automatic Part-of-speech Tagging for
Oromo Language Using Maximum Entropy Markov Model ( MEMM ) ⋆,” vol.
10, pp. 3319–3334, 2014.
[13] G. O. Ganfure and D. Midekso, “Design And Implementation Of Morphology
Based Spell Checker,” vol. 3, no. 12, pp. 118–125, 2014.
[14] D. Tesfaye, “A rule-based Afan Oromo Grammar Checker,” IJACSA - Int. J.
Adv. Comput. Sci. Appl., vol. 2, no. 8, pp. 126–130, 2011.
[15] G. Mamo, “‘Part of Speech Tagging for Afaan Oromo Language,’” MSc.Thesis,
Department of Information Science, School of Graduate Studies, Addis Ababa
University, Addis Ababa, 2009.
[16] A. Mohammed-hussen, “Part of Speech Tagger for Afaan Oromo Language
Using Transformational Error Driven Learning (TEL) Approach,” MSc.Thesis,
Department of Computer Science, School of Graduate Studies, Addis Ababa
University, Addis Ababa, 2010.
[17] G. D. Dinegde and M. Y. Tachbelie, “Afan Oromo News Text Summarizer,” Int.
J. Comput. Appl., vol. 103, no. 4, pp. 975–8887, 2014.
[18] T. K. Hundesa, “Word Sense Disambiguation for Afaan Oromo Language,”
MSc.Thesis, Department of Computer Science, School of Graduate Studies,
Addis Ababa University, Addis Ababa, 2013.
64
[19] K. Abdisa, “Factoid Question Answering For Afaan Oromo,” MSc.Thesis,
Department of Information Science, School of Graduate Studies, Addis Ababa
University, Addis Ababa, 2014.
[20] G. G. Eggi, “Afaan Oromo Text Retrieval System,” MSc.Thesis, Department of
Information Science, School of Graduate Studies, Addis Ababa University,
Addis Ababa, 2012.
[21] T. G. Debela, “Afaan Oromo Search Engine,” MSc.Thesis, Department of
Computer Science, School of Graduate Studies, Addis Ababa University, Addis
Ababa, 2010.
[22] M. Post and D. Gildea, “Parsers as language models for statistical machine
translation,” … Assoc. Mach. Transl. …, 2008.
[23] J. Katz-Brown et al., “Training a Parser for Machine Translation Reordering,”
Proc. 2011 Conf. Empir. Methods Nat. Lang. Process. (EMNLP 2011), pp. 183-
-192, 2011.
[24] A. Alemu, “Automatic Sentence Parsing for Amharic Text an Experiment Using
Probabilistic Context Free Grammars,” MSc.Thesis, Department of Information
Science, School of Graduate Studies, Addis Ababa University, Addis Ababa,
2002.
[25] D. G. Agonafer, “An Integrated Approach to Automatic Complex Sentence
Parsing for Amharic Text,” MSc.Thesis, Department of Information Science,
School of Graduate Studies, Addis Ababa University, Addis Ababa, 2003.
[26] A. Ibrahim, “A Hybrid Approach to Amharic Base Phrase Chunking and
Parsing,” MSc.Thesis, Department of Computer Science, School of Graduate
Studies, Addis Ababa University, Addis Ababa, 2013.
[27] D. D. K. Sleator, “Parsing English with a Link Grammar,” National Science
Foundation under grant CCR-8658139, Olin Corporation, and R. R. Donnelley
and Sons, New York, 1991.
[28] E. Charniak, “Statistical Parsing with a Context-free Grammar and Word
Statistics,” pp. 1–6, 1997.
[29] N. Khoufi, C. Aloulou, L. Hadrich, and B. Anlp, “ARSYPAR : A tool for parsing
the Arabic language,” Int. Arab Conf. Inf. Technol., 2013.
[30] B. M. Bataineh and E. A. Bataineh, “An Efficient Recursive Transition Network
Parser for Arabic Language,” World Congr. Eng. 2009, Vols I Ii, vol. II, pp.
1307–1311, 2009.
[31] N. Hambir, “Hindi Parser-based on CKY algorithm,” vol. 3, no. 2, pp. 851–853.
[32] W. W. Thant, T. M. Htwe, and N. L. Thein, “Context Free Grammar Based Top-
Down Parsing of Myanmar Sentences,” Int. Conf. Comput. Sci. Inf. Technol.
Pattaya Dec. 2011, pp. 71–75, 2011.
[33] H. Lian, “Chinese Language Parsing with Maximum-Entropy-Inspired Parser
Maximum-Entropy-Inspired Parser,” M.S. Thesis, pp. 1–6, 2005.
[34] S. Shieber and Microtome, “An Introduction to Unification-Based Approaches
to Grammar,” Microtome Publishing, 2003. [Online]. Available:
https://nrs.harvard.edu/urn-3:HUL-InstRepos:11576719. [Accessed: 14-Feb-
2017].
[35] S. Sundararajan, “Probabilistic Context-Free Grammars in Natural Language
Processing,” pp. 1–6.
[36] B. S. R. L, S. Ishwar, and S. K. Ravindranath, “Context Free Grammar for
Natural Language Constructs- An implementation for Venpa class of Tamil
Poetry,” Indian Inst. Inf. Technol., pp. 128–136.
65
[37] W. A. Woods, “Transition network grammars for natural language analysis,”
Commun. ACM, vol. 13, pp. 591–606, 1970.
[38] M. Collins, “Probabilistic Context-Free Grammars (PCFGs),” Lect. Notes, pp.
1–18, 2011.
[39] D. Jurafsky and J. Martin, “Speech and Language Processing”, 2nd Edition.
2008.
[40] G. Weikum, “Foundations of statistical natural language processing,” ACM
SIGMOD Rec., vol. 31, no. 3, p. 37, 2002.
[41] M. Ailomaa, “Two Approaches to Robust Stochastic Parsing,” MSc.Thesis,
Computational Linguistics, Goteborg University, Lausanne, 2004.
[42] J. Brownlee, “Supervised and Unsupervised Machine Learning Algorithms,”
Mach. Learn. Mastery, pp. 1–9, 2016.
[43] R. Ouersighni, “Robust Rule-based Approach in Arabic Processing,” Int. J.
Comput. Appl., vol. 93, no. 12, pp. 31–37, 2014.
[44] O. Herzog and R. Rollinger, “A Flexible Parser for a Linguistic Development
Environment Gregor Erbach Choices in Parser Design Types of Grammars,”
Text Underst. LILOG, Springer, Berlin, no. Kay 1980, 1991.
[45] E. Othman, K. Shaalan, and A. Rafea, “A chart parser for analyzing modern
standard Arabic sentence,” MT Summit IX Work. Mach. Transl. Semit. Lang.
Issues Approaches, p. 37{\textendash}44, 2003.
[46] B. I. Thompson, “Afro Asiatic Language Family,” 2017. [Online]. Available:
http://aboutworldlanguages.com/afro-asiatic-language-family. [Accessed: 04-
Feb-2017].
[47] T. Gamta, “The Oromo language and the latin alphabet,” J. Oromo Stud.
1992.http//www.africa.upenn.edu/Hornet/Afaan_Oromo_19777.html , last Visit.
Febr. 06, 2017, pp. 10–13, 2017.
[48] R. Kibble, “Introduction to natural language processing Undergraduate study in
Computing and related programmes,” 2013.
[49] B. Yimam, “‘THE PHRASE STRUCTURES OF ETHIOPIAN OROMO ,’”
PhD.dissertification, Addis Ababa University, 1986.
[50] S. Alqrainy, S. Jordan, and M. S. Alkoffash, “Context-Free Grammar Analysis
for Arabic Sentences,” Int. J. Comput. Appl., vol. 53, no. 3, pp. 7–11, 2012.
[51] C. February, “Contents 1,” 2011. [Online]. Available:
https://en.wikipedia.org/wiki/Lexicon. [Accessed: 12-Nov-2016].
[52] S. C. Zhu, “Ch 4 Classic Parsing Algorithms Chart Parsing in NLP,” pp. 1–51.
[53] H. J. Fox, “Lexicalized, Edge-Based, Best-First Chart Parsing,” MSc.Thesis,
Dep’t of Computer Science, Massachusetts Institute of Technology, Brown
University, 1999.
66
APPENDICES
Appendix 1: Part of Speech Tags by Abraham (used in this study)
Tags Descriptions Examples
AD A tag for all types of adverb in
the language
Kaleessa, edana, yoomiyyuu, as, achi,
yoom, yammuu, yeroo . . .
APC Adpositions and conjunctions
C A tag for all
preposition/postpositions as
well as
redeterminers/postdetermine
rs and conjunctions that are
separated from other
categories
ni/in, irra, irraa, itti, bira, ol, jala, gadi,
keessa, ala, akka, gara, gar, kan, wal,
waliin, yaa, haa, Hunda, qofa, faa, malee,
mee, fi, yookaan (ykn), otoo, yoo, garuu,
akkasumas, waa’ee, akkaataa, duukaa . . .
AV A tag for all auxiliary verbs Jira, jirti, jiru, ture, qaba, qabda, qabna,
qabdu, tahe, ta’e, ta’uu, danda’a,
barbaada
CN A tag for cardinal numbers Tokko, Lama, kudhan, 2012,1977, 2, 19
CP A tag for all copula in the
language
Dha, ti, miti, ree . . .
DT A tag for all types of
determiners in the language
Kun, kana, kanneen, sana . . .
IN A tag for all interjections in the
language
Ishoo, i’hii, a’haa, wayyoo, anaan,
anibadee, tole, nagaatti
JJ A tag for all adjectives that are
separated from other
categories; including nominal
adjectives like “warra”,
“11jara”, “isa”, “inni”, . . .
Bareedduu, diimaa, magaala, cimaa,
garraamii,. . .
NN A tag for all types of (common)
singular or mass nouns that are
not joined with other categories
Nama, mana, Maatii, Muka, bishaa,
biyya, biliisummaa, walabummaa,
Bareedina, bareedinaan, ciminaan,
67
in sentences. It also includes
nouns derived from other wort
classes like verbs and adjectives
jabaatu, jabinaan, ofittummaa,
garraamummaa, gaarummaa,
qulqullaa‘uu,qulqulleessuu,qulqulleessuu
n,
mirkanaa‘uu, mirkaneessuu, barbaaduun,
kadhachuun,..
NNP tag for all types of plural nouns
that are not
joined with other categories in
sentences
Namoota, mukeetii, manneen, Maatilee,
biyyoota
NNP
S
A tag for all types of plural
nouns that are not separated
from postpositions.
Namootaaf, mukeetiitti, . . .
NNS A tag for all types of (common)
singular or mass nouns that are
not separated from
postpositions
Namaaf, mukatti, bishaaniin, . . .
ON A tag for ordinal numerals 1ffaa, 2ffaa, sadaffaa, afurffaa, . . .
PN A tag for all proper nouns that
are not joined with other
categories in sentences. In
Oromo, people sometimes use
plural form of proper nouns
like Oromoota, Ingilizoota,
Habashoota, but these plural
forms are not treated
separately.
Dammee, Caaltuu, Oromiyaa, OIB, IBM,
Qeerroo, Oromoo, Oromoota,
Ingilizoota, Habashoota . . .
PNS A tag for all proper nouns that
are not separated from
conjunctions/postpositions
Dammeef, Oromiyaatti, Qeerrootu,
Oromootaaf, Habashootatu, . . .
POP A tag for all Possessive
Pronoun that are not joined
with other categories.
Koo, kee, kiyya, isaa, isaanii, keessan, . .
.
68
POPS A tag for all Possessive
Pronoun that are not separated
from
conjunctions/postpositions
Keessaniin, kootiin, keetti, . . .
PPN A tag for all Personal pronoun
that are not joined with other
categories.
Ana/ani, nuyi/nuti, sii/ati, isin, isa, ishee,
isaan, Of,
PPNS A tag for all Personal Pronoun
that are not separated from
conjunctions/postpositions
Isiniif, aniyyuu, isaantu, . . .
QU A tag for all types of
questioning forms in the
language
Akkam?, maal?, eenyu?, maaliif?,
eessa?, meeqa?, kam?, kamiin?, yoom?
VB A tag for all main verbs in
sentences
Kottu, Deemi, dubbisi, jaallatte, jaallate,
69
Appendix 2: Sample Context Free Grammar Extracted from corpus
S -> NP VP
S -> VP
S -> IN
NP -> NN JJ
NP -> NN NN
NP -> NN POP
NP -> PPN NN POP
NP -> APC PNS
NP -> NN
NP -> PNS
NP -> PPN NN
NP -> PN NN
NP -> NN JJ DT
NP -> PPNS
NP -> PPN DT
NP -> NN DT
NP -> NNS POP
NP -> DT PN
NP -> NNS AD
NP -> PPN
NP -> PNS NP
NP -> NN APC JJ
NP -> APC VB NN
NP -> APC PPN VB
NP -> NP PPN
NP -> NN PPN VB
VP -> NP CP
VP -> NP VB
VP -> NN VP
VP -> NN VB
VP -> AD VB
VP -> ADP VB
VP -> JJP VB
VP -> IN JJP
VP -> NNS VB
VP -> JJP AV
VP -> ADP AD AV
VP -> APCP AV
VP -> AD APCP VB
VP -> APCP APC
VP -> VB
VP -> JJP CP
VP -> QU
VP -> QU AV
VP -> QU VB
VP -> DT QU
VP -> QU PPN VB
VP -> POP VB
VP -> ADP APCP VP
VP -> VB AV
VP -> NN VB CP
VP -> APC VB
ADP -> AD
ADP -> PNS AD VB
ADP -> AD AD
ADP -> AD VB
ADP -> APC AD VB
ADP -> PN AD
JJP -> JJ JJ
JJP -> NN JJ
APCP -> APC
APCP -> APC NN
APCP -> APC NNS
APCP -> APC VB
APCP -> APC PN
APCP -> APC PPNS
70
Appendix 3: Sample Lexical Rules Generated by the Lexicon Generator
JJ -> 'gurraacha'
CP -> 'dha'
PPN -> 'Inni'
APC -> 'gara'
NN -> 'manaa'
VB -> 'deeme'
PPN -> 'Inni'
AD -> 'kaleessa'
AD -> 'galgala'
VB -> 'dhufe'
PNS -> 'Tolosaan'
NN -> 'mana'
JJ -> 'guddaa'
AV -> 'qaba'
PNS -> 'Abdiisaan'
NN -> 'mana'
NN -> 'citaa'
VB -> 'ijaare'
PNS -> 'Tolaan'
NN -> 'aannan'
VB -> 'dhuge'
PNS -> 'Alamuun'
AD -> 'saffisaan'
APC -> 'gara'
PN -> 'finfinnee'
VB -> 'deeme'
PNS -> 'Alamuun'
NN -> 'konkolaataa'
VB -> 'bite'
PNS -> 'namichi'
NN -> 'saree'
VB -> 'ajjeese'
PNS -> 'Tulluun'
PN -> 'Asteer'
APC -> 'waliin'
VB -> 'deeme'
PNS -> 'Baacaan'
AD -> 'boru'
VB -> 'dhufa'
PPNS -> 'isheen'
AD -> 'boru'
VB -> 'dhufti'
PPN -> 'Inni'
AD -> 'kaleessa'
VB -> 'dhufe'
PNS -> 'Angaatuun'
APC -> 'sirritti'
VB -> 'dubbifti'
NNS -> 'ibsaan'
VB -> 'ife'
PNS -> 'Caalaan'
NN -> 'barataa'
JJ -> 'cimaa'
CP -> 'dha'
PNS -> 'Biiftuun'
NN -> 'mana'
NN -> 'barumsaa'
VB -> 'deemte'
PPN -> 'Ati'
JJ -> 'gahee'
POP -> 'kee'
VB -> 'xumurtee'
AV -> 'jirta'
NNS -> 'bokkaan'
VB -> 'roobaa'
AV -> 'jira'
71
Appendix 4: Sample parsed sentences by the parser
(S
(NP (PNS Tolosaan))
(VP (ADP (AD kaleessa) (AD galgala)) (VB dhufe)))
(S
(NP (PNS Tolaan))
(VP
(ADP (AD saffisaan))
(APCP (APC gara) (PN finfinnee))
(VP (VB deeme))))
(S
(NP (PNS Tolaan))
(VP (AD saffisaan) (APCP (APC gara) (PN finfinnee)) (VB deeme)))
(S
(NP (NN barumsi) (APC waan) (JJ cimuuf))
(VP
(ADP (AD sirritti))
(APCP (APC itti))
(VP (VB qophaa'uu) (AV qabda))))
(S (NP (NNS maqaan) (POP kee)) (VP (QU eenyu)))
(S (NP (NNS umriin) (POP kee)) (VP (QU meeqa)))
(S (NP (PNS Alamuun) (NP (NN kitaaba))) (VP (POP naaf) (VB bite)))
(S (NP (PNS Alamuun) (NP (NN kitaaba) (POP naaf))) (VP (VB bite)))
(S (NP (PNS Alamuun)) (VP (NN kitaaba) (VP (POP naaf) (VB bite))))
(S (NP (PNS Alamuun)) (VP (NP (NN kitaaba) (POP naaf)) (VB bite)))
(S (VP (NP (PNS Alamuun) (NP (NN kitaaba) (POP naaf))) (VB bite)))
top related