submitted to assam university, silchar in partial fulfillment of the … · 2017-08-28 ·...
TRANSCRIPT
-
1
A
PROJECT REPORT
ON
LEMMATIZATION USING TRIE DATA STRUCTURE
A Project Work (MS-405)
Submitted to Assam University, Silchar in partial fulfillment of the requirements for
the award of degree of Master of Science in Computer Science
UNDER THE GUIDANCE OF
Mrs SUNITA SARKAR,
Department Of Computer Science
Assam University, Silchar
SUBMITTED BY
KARISHMA TAPARIA
Semester: 4th, M.Sc. (2 Yrs)
Sub Code: Msc 405(C)
Exam Roll: 101714 No.:22220415
-
2
ABSTRACT:
Lemmatization is use to normalize inflectional form of word to its root
word. So it can be used as preprocessing step in any natural language
processing application. Lemmatization is very important approach for
information retrieval process. Lemmatization is used to reduce different
inflectional form as well as derivational form of word to its root or head
word which called as its 'lemma'. A 'lemma' is the simply the "Dictionary
form" of a word. In lemmatization, different grammatical form of word
can be analyzed as a single word
However, the problem is that for many languages (Mainly Indian), lemmatizers do not exist, and this problem is not easy to solve, since rule based lemmatizers take time and require highly skilled linguistics. Statistical stemmers on the other hand do not return legitimate lemma Our goal is to implement a language independent Lemmatizer which will be using Trie Data Structure to store the root words and tries to find out a potential lemma of a surface word by efficiently searching in the Trie. A trie allows retrieving possible lemma of a given inflected or derivational form.
-
3
DECLARATION
I, Karishma Taparia, student of 4th semester (M.sc 2 years), Department of
Computer Science do hereby solemnly declare that I have duly worked on my
project entitled “Lemmatization using Trie-Datastructure” under the
supervision of Mrs. Sunita Sarkar , Assistant Professor, Department of
Computer Science, Assam University, Silchar. This project work is submitted
in the partial fulfillment of the requirements for the award of degree of Master
of Science in Computer Science. The results embodied in this thesis have not
been submitted to any other university or institution for the award of any
degree or diploma.
Date: (Karishma Taparia)
Place: Assam University, Silchar Semester- 04th sem(Msc 2 years)
Roll: No:22220415
Assam University, Silchar
-
4
CERTIFICATE
This is to certify that Miss. Karishma Taparia, student of 4th semester(Msc 2
years), Department of Computer Science, Assam University, Silchar, bearing
the Roll: No: 22220415 has carried out her project work
entitled “Lemmatization using Trie-Datastructure” under my guidance in the
partial fulfillment of the requirement for the award of degree of Master of
Science in Computer Science during the period of January 2016 to May 2016.
The project is the result of her investigation and neither the report as a whole,
nor any part of it has been submitted to any other university or institution for
any degree or diploma.
Date: Mrs. SUNITA SARKAR
Place: Assistant Professor,
Department of Computer Science,
Assam University,Silchar.
-
5
CERTIFICATE
This is to certify that Miss. Karishma Taparia, student of 4th semester(Msc 2
years), Department of Computer Science, Assam University, Silchar, bearing
the Roll: No: 22220415 has carried out her project work
entitled “Lemmatization using Trie-Datastructure” under the guidance of
Mrs.Sunita Sarkar , Assistant Professor, Department of Computer Science,
Assam University, Silchar, in the partial fulfillment of the requirement for the
award of degree of Master of Science in Computer Science during the period of
January 2016 to May 2016.
The project is the result of her investigation and neither the report as a whole,
nor any part of it has been submitted to any other university or institution for
any degree or diploma.
Date: Dr. BIPUL SHYAM PURKAYASTHA
Place: Head of Department,
Department of Computer Science,
Assam University, Silchar.
-
6
ACKNOWLEDGEMENT:
At the very outset, I take the privilege to convey my gratitude to those
persons whose co-operation, suggestions and heartfelt support helped us to
accomplish the project successfully.
I take immense pleasure to express my sincere thanks and profound
gratitude to my respected guide Mrs. SUNTA SARKAR for her continuous
support, patience, motivation, enthusiasm, immense knowledge and
providing timely support and suitable suggestions.
I want to thank my parents for their affection and their help for managing
my life in busy times. Without them, it would have been very difficult to
focus on my project.
My special thanks are to those who directly or indirectly extended their
helping hands in making the project a grand success.
Date: 27th April 2016 KARISHMA TAPARIA
Department of Computer Science,
Assam University Silchar.
-
7
TABLE OF CONTENTS
1. Introduction
1.1. Stemming Vs Lemmatization
1.2. Motivation
1.3. Objective
1.4. Problem Statement
1.5. Aim of Lemmatization
2. Related Work and Background
2.1. Introduction
2.2. Review on Lemmatization
2.3. Review on Stemming
3. Approaches For Lemmatization
3.1 Introduction
3.2 Lavenshtein distance based approach
3.3 Morphological Analyzer
3.4 AFFIX based approach
3.5 Fixed Length Truncation
4 TRIE based approach
4.1 Introduction
4.2 Pre-processing Steps
4.2.1 Tokenization
4.2.2 Stop-words Removal
4.3 Applications of Trie
4.4 Basic Operations on Trie
4.4.1 Searching A Trie
4.4.2 Insertion In Trie
4.5 Algorithm For Lemmatization
4.6 Flowchart
4.7 Drawback of algorithm
-
8
5 Implementation and Result Analysis
5.1 Implementation
5.2 Result Analysis
6 Conclusion
7 Future Work
8 References
-
9
CHAPTER 1
1. INTRODUCTION:
As the language is an important tool for communication, so natural
language processing is concerned with the interaction between
human languages and computers. NLP involves enabling computers
to derive meaning from human or natural language input. Natural
language processing is very hot research topic now a day, as it is used in
most of the linguistic activities.
In the lexical knowledge-bases like dictionary, Word Net, etc, the entries
are usually root words with their morphological and semantic
descriptions and therefore, when a surface word is encountered in a raw
text, its meaning cannot be obtained unless and until its appropriate root
word is determined through lemmatization. Thus, lemmatization is a
basic need for any kind of semantic processing for languages
"Lemmatization" refers to normalized different inflectional forms as
well as derivational forms to its head word.
This task can be used as a pre-processing step for many natural
processing applications (e.g. morphological analyzers, electronic
dictionaries, spell-checkers, stemmers, etc.). It may also be useful as a
generic keywords generator for search engines and other data
mining, clustering and classification tools
Morphological Variants
of Words
(Preparing, prepared,
preparation, prepare)
Common
Root Word
(Prepare)
LEMMATIZATION
PROCESS
-
10
1.1 STEMMING VS LEMMATIZATION:
Normalization is very important task in any natural language processing
application. Stemming or Lemmatization used as a normalized
technique to reduce different grammatical words to its head word by
applying set of rule
Stemming is process of reducing different inflectional form to its stem by
applying different set of rule. Aim of Stemming is just to reduce word to
its stem without bothering about POS. It is used in most of the text
mining application where aim is just to reduce the form of word
without worrying of its occurrence in the given context. So it is used
to convert the different inflectional form of word to its stem. The result
of stemming is called as a stem, it is not always a dictionary word.
In linguistics, a lemma (from the Greek noun “lemma”, “headword”)
is the “dictionary” or “canonical” form of a set of words. More
specifically, a lemma is the canonical form of a lexeme, where
lexeme refers to the set of all the forms that have the same meaning, and
lemma refers to the particular form that is chosen as base form to
represent the lexeme. Lemmatization used as a most frequently used
normalization technique in any information retrieval application like
indexing and searching.
For Example:
Collection of words ‘produce’, ‘produced’, ‘producing’ and ‘production’
are stemmed to ‘produc’ whereas the lemmatizer will return the word
‘produce’ .
-
11
1.2 MOTIVATION
Natural language processing is very hot research topic now a day, as
it is used in most of the linguistic activities.
Stemming or Lemmatization used as a normalized technique to
reduce different grammatical words to its head word .lemmatizing
can be used as a pre processing steps in Information Retrieval
application.
However, the problem is that for many languages, lemmatizers do not
exist, and this problem is not easy to solve, since rule based
lemmatizers take time and require highly skilled linguists. Statistical
stemmers on the other hand do not return legitimate lemma
-
12
1.3 OBJECTIVE
To develop a Lemmatizer using Trie Data structure
1.4 PROBLEM STATEMENT
The key idea is that a trie is created using a file containing list of
lemma’s (root words) of the language concerned.
The lemmatizing process consists in navigating the trie , trying to
find a match between the input word and an entry in the trie
An algorithm is applied to find the appropriate lemma for the input
word.
1.5 AIM OF LEMMATIZATION:
Lemmatization aims to remove inflectional endings only and to return
dictionary form of a word and may use of a vocabulary and/or
morphological analysis of words. Therefore lemmatizers require much
knowledge about language than stemmers and they don’t use
language specific rules unlike stemmers.
Lemmatization is closely related to stemming, however, stemming
operates only on a single word at a time. Instead, lemmatization may
operate on the full-text and therefore can discriminate between words
that have different meanings depending on part-of-speech. On the
other hand, stemmers are typically easier to implement and run
faster. Hence, lemmatizers play a significant role in IR and ability to
lemmatize words efficiently and effectively is thus important.
-
13
CHAPTER 2
RELATED WORK AND BACKGROUND
INTRODUCTION: Lovins described the first stemmer (Lovins , J.B.,1968), which was
developed specifically for IR/NLP applications. His approach consisted
of the use of a manually developed list of 294 suffixes, each linked to 29
conditions, plus 35 transformation rules. For an input word, the suffix
with an appropriate condition is checked and removed .Porter developed
the Porter stemming algorithm (Porter, 1980) which became the most
widely used stemming algorithm for English language. These stemmers
were described in a very high level language known as Snowball
A number of statistical approaches have been developed for stemming.
Notable works include: Goldsmith’s unsupervised algorithm for learning
morphology of a language based on the Minimum Description Length
(MDL) framework (Goldsmith, 2001, 2006). Creutz uses probabilistic
maximum a posteriori (MAP) formulation for unsupervised morpheme
segmentation (Creutz, 2005, 2007)
A few approaches are based on the application of Hidden Markov models
(Massimo et al., 2003).In this technique, each word is considered to be
composed of two parts “prefix” and “suffix”. Here, HMM states are
divided into two disjoint sets: Prefix state which generates the first part of
the word and Suffix state which generates the last part of the word, if the
word has a suffix. After a complete and trained HMM is available for a
language, stemming can be performed directly.
Plisson proposed the most accepted rule based approach for
lemmatization (Plisson et al., 2008).It is based on the word endings,
where suffixes are removed or added to get the normalized word form. In
another work, a method to automatically develop lemmatization rules to
generate the lemma from the full form of a word was discussed (Jongejan
et al., 2009). The lemmatizer was trained on Danish, Dutch, English,
-
14
German, Greek, Icelandic, Norwegian, Polish, Slovene and Swedish full
form-lemma pairs respectively.
Kimmo (Karttunen et al., 1983) is a two level morphological analyzer
containing a large set of morphophonemic rules. The work started in 1980
and the first implementation n LIST was available 3 years later.
Tarek El-Shishtawy proposed the first non statistical Arabic lemmatizer
algorithm (Tarek etal., 2012). He makes use of different Arabic language
knowledge resources to generate accurate lemma form and its relevant
features that support IR purposes and a maximum accuracy of 94.8% is
reported. OMA is a Turkish Morphological Analyzer which gives all
possible analyses for a given word with the help of finite state
technology. Two-level morphology is used to build the lexicon for a
language (Okan et al., 2012).
Grzegorz Chrupala (Chrupala et al., 2006) presented a simple data-driven
context-sensitive approach to lemmatizing word forms. Shortest Edit
Script (SES) between reversed input and output strings is computed to
achieve this task. An SES describes the transformations that have to be
applied to the input string (word form) in order to convert it to the output
string (lemma).
As for lemmatizers for Indian languages, the earliest work by
Ramanathan and Rao (2003) used manually sorted suffix list and
performed longest match stripping for building a Hindi stemmer.
Majumdar et. al (2007) developed YASS: Yet Another Suffix Stripper.
Here conflation was viewed as a clustering problem with a-priory
unknown number of clusters. They suggested several distance measures
rewarding long matching prefixes and penalizing early mismatches.
In a recent work related to Affix Stacking languages like Marathi,(Dabre
et al., 2012) Finite State Machine (FSM)is used to develop a Marathi
morphological Analyzer. In another approach, A Hindi Lemmatizer is
proposed, where suffixes are stripped according to various rules and
necessary addition of character(s) is done to get a proper root form (Paul
etal., 2013). GRALE is a graph based lemmatizer for Bengali comprising
two steps (Loponen et al.,2013). In the first, step it extracts the set of
frequent suffixes and in the second step, a human manually identifies the
-
15
case suffixes. Words are often considered as node and edge from node u
to v exist if only v can be generated from u by addition of a suffix.
Unlike the above mentioned rule based and statistical approaches, our
lemmatizer uses the properties of a “trie” data structure which allows
retrieving possible lemma of a given inflected word.
2.1 REVIEW ON LEMMATIZATION:
Author Title Language Technique Used
Accuracy
Joël Plisson[1] A Rule based Approach to Word Lemmatization
Slovene Ripple Down Rules (RDR) approach
77%
António Branco and João Silva[2]
Very high accuracy rule-based Nominal lemmatization with a minimal lexicon
Portuguese shallow processing, rule-based algorithm
94%
Vaishali Gupta, Nisheeth Joshi and Iti Mathur[3]
Rule based Lemmatization
Urdu Rules 86.5%
Snigdha Paul, Mini Tandon, Nisheeth Joshi and Iti Mathur[4]
Design Of a Rulebased hindi Lemmatizer
Hindi automated lemmatizer using the rules
89.08%
Aduriz I., Alegria I., Arriola J.M., Artola X[5]
Different issues in the design of a Lemmatizer/tagger
Basque morphological disambiguation with structured four level tagset
Under development
-
16
Grzegorz Chrupała[6]
Simple Data-Driven Context-Sensitive Lemmatization
languages like Spanish ,Dutch,Frenh etc
Classification based on Shortest Edit Script (SES)
60-88%
Eugenio Picchi[7]
Statistical Tools for Corpus Analysis
Italian Statistical tools (PE-system)
95%
Wolfgang Lezius[8]
A Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer
German
Morphology module and tagger
91%
Ezeiza N., Alegria I., Arriola J.M., Urizar R[9]
Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages
Basque stochastic and rule-based disambiguation methods
96%
Tarek El-Shishtawy and Fatma El-Ghannam[10]
An Accurate Arabic Root- Based Lemmatizer for Information Retrieval Purposes
Arabic Rule based lexicon and supervised learning
89.15%
-
17
DETAILS OF THE APPROACHES MENTIONED IN THE
LEMMATIZATION TABLE:
[1]. A Rule based Approach to Word Lemmatization
The approach presented by them focuses on word endings: what word
suffix should be removed and/or added to get the normalized form. They
compared the results of two word lemmatization algorithms, one based on
if-then rules and the other based on ripple down rules induction
algorithms. It presents the problem of lemmatization of words from
Slovene free text and ex plains why the Ripple Down Rules (RDR)
approach is very well suited for the task. When learning from a corpus of
lemmatized Slovene words the RDR approach results in easy to
understand rules of improved classification accuracy compared to the
results of rule learning achieved in previous work.
Five datasets are used for evaluation purpose; they were obtained as
random samples of different size taken from a large hand constructed
lexicon MULText -EAST. The whole lexicon contains about 20 000
different normalized words with listed different forms for each of them
resulting in about 500 000 different entries (potential learning examples).
[2]. Very high accuracy rule-based nominal lemmatization with a
minimal lexicon:
They described a shallow processing, rule-based algorithm for Nominal
Lemmatization in Portuguese with minimal word lists.
They build upon morphological regularities found in word inflection and
use a set of transformation rules that “undo” the form changes due to
inflection. Thus, the basic rationale for the rule-based lemmatizer
described is to gather a set of transformation rules that, depending on the
termination of a word, replace that termination by another, and
complement this set of rules with a list of exceptions. In order to
implement the algorithm, a list of 126 transformation rules was necessary.
The list of exceptions to these rules amounts to 9,614 entries. Prefix
removal is done resorting to a list of 130 prefixes. The lemmatizer was
evaluated over the 50,637 Adjectives and Common Nouns present in a
260,000 token corpus
-
18
[3]. Rule Based lemmatizer in Urdu
They proposed a rule based system. They generated an affix list of the
words and based on this list lemma are produced .If the input word does
not match with affix list then it displays the same word.
They checked their system for 2000 words. Among these 2000 words,
1730 gives correct lemma and 270 gives incorrect lemma.
[4]. DESIGN OF A RULE BASED HINDI LEMMATIZER:
The lemmatizer that they discussed mainly focuses on the time
complexity problem. The lemmatizer was built using a rule based
approach and paradigm approach. In rule based approach along with the
rules, knowledgebase is created for storing the grammatical features.
Knowledge base is also created for storing the exceptional root words.
Although the knowledgebase creation requires a large amount of
memory, but in respect of time it gives us the best, accurate and fast
result. The reason behind this fast retrieval is that, a very short time is
taken to search the input word from the knowledgebase. The system is
evaluated for 2500 words for analysis. Among these 2500 words 2227
words were evaluated correctly and 273 words were incorrect
[5]. Different issues in the design of a Lemmatizer/tagger
They focus on the development of a general purpose lemmatizer/tagger
for Basque. The basic tools used are:
• The Lexical Database for Basque (LDBB). At present it contains 60,000
entries, each with its associated linguistic features (category, subcategory,
case, number, etc.).
• A general morphological analyzer/generator.
The mechanism has two main components in order to be capable of
treating unknown words: 1) generic lemmas corresponding to each
possible open category or subcategory, and 2) two additional rules in
order to express the relationship between the generic lemmas at lexical
level and any acceptable lemma of Basque.
-
19
2.2 REVIEW ON STEMMING :
Author Proposed Method
Language Technique Used
Accuracy
Dinesh Kumar, Prince Rana[11]
Brute Force Technique
Punjabi Suffix Stripping
80.73%
Suprabhat Das, Pabitra Mitra[12]
Method Proposed by Porter
Bengali Suffix Stripping
96.27%
Juhi Ameta, Nisheeth Joshi, Iti Mathur[13]
Longest Matched
Gujarati Rule Based 91.5%
Shahid Husain[14]
n-gram stripping mode
Urdu 1)Length based 2)Frequency Base
1)84.27% 2)79.63%
Braja Gopal Patra, Dipankar Das[15]
Rule Based Stemmer
Kokborok 1)Minimum Suffix Stripping 2)Maximum Suffix Stripping
1)80.02% 2)85.13%
Upendra Mishra, Chandra Prakash[16]
Brute force Technique
Hindi Suffix stripping.
91.6%
Shahid Husain[17]
n-gram stripping model
Marathi 1)Length based 2)Frequency Based
1)63.5%, 2)82.68%
Thangarsu, Manavalan[18]
Light Stemmer
Tamil Light stemming
98%
Vishal Gupta[19]
Suffix Stripping
Hindi Rule based suffix stripping
83.65%
-
20
Elaheh Rahimtoroghi, Hesham Faili, Azadeh Shakery[20]
Rule Based Stemmer
Persian
Structural approach and Morphological rules
The precision has been increased by 4.83%.
Ayu Purwarianti[21]
Nondeterministic Finite Automata
Indonesian Suffix Stripping
81%
Mohamad Ababneh, Riyad Al-Shalabi[22]
Rule Based light stemmer
Arabic Root extraction stemmer and light stemmer
71%
Osama A. Ghanem,
Wesam M. Ashour[23]
K-means Algorithm
Arabic
Clustering
Not mentioned
Sidikka Parlak, Murat Saraclar[24]
Length based Turkish Rule Based 80%
-
21
CHAPTER 3
In this chapter we will talk about the 5 approaches used for lemmatization. These approaches are either rule based or statistical in nature. The first approach is string matching dictionary based approach. Second is based on finite state automata. Third approach is affix removal approach and Fourth is fixed length truncation approach. This approach mostly used for those languages where size of word is more than 7. So by removing fixed size of suffix it can produce good result
The Fifth approach is Trie based approach which we have chosen for building our lemmatizer, it is also known as tree approach. This approach will be discussed in details in the next Chapter.
Trie approach retrieve all possible lemma of a given word inflectional words.
APPROACHES FOR LEMMATIZATION: Edit Distance on dictionary algorithm which is combination of
string matching and most frequent inflectional suffixes model.
Morphological Analyzer which is based on "finite state automata".
Affix lemmatizer which is combination of rule based and
supervised training approach and Fixed length truncation approach.
Trie data structure which allow retrieving possible lemma of a given inflected or derivational form.
-
22
3.1 LEVENSHTEIN DISTANCE DICTIONARY BASED
APPROACH:
The Levenshtein distance is a string metric for measuring the
difference between two sequences. Informally, the Levenshtein distance
between two words is the minimum number of single-character edits (i.e.
Insertions, deletions or substitutions) required to change one word into
the other.
Searching similar sequences of data is of great importance to many
applications such as the gene similarity determination, speech
recognition applications, database and/or Internet search engines,
handwriting recognition, spell-checkers and other biology, genomics
and text processing applications.
Therefore, algorithms that can efficiently manipulate sequences of
data (in terms of time and/or space) are highly desirable, even with
modest approximation guarantees.
The Levenshtein Distance of two strings A and B is the minimum
number of character transformation required to convert string A to
string B.
The following Equation 1 is used two find the Levenshtein distance
between two strings a, b is given by:
Liv(a,b)(|a|,|b|) where:
Where (ai != bj) is indicator function equal to 0 when and (ai=bj)
equal to 1 otherwise. Note that the first element in the minimum
corresponds to deletion (from a to b), the second to insertion and
the third to match or mismatch, depending on whether the respective
symbols are the same.
-
23
The edit distance algorithm is performed by using three most
"primitive edit operation". By term primitive edit operation we refer to
the substitution of one character to another, the deletion of a
character and insertion of a character. So this algorithm can be
performed by three basic operations like insertion, deletion and
substitution. Some approached focused on suffix phenomena only. But
this approach deals with both suffixes as well as prefixes. So it is known
as affixation phenomena. Sometime it happens that suffixes added into
the words based on grammatical rules. For example word "going", this
approach return headword "go". But for word "went", it contains discrete
entry of lemma in dictionary. The idea is to find out all possible
lemma for user's input word.
For each one of the target words, the similarity distance between
the source and the target word is calculated and stored. When this
process is completed, the algorithm returns a set of target words having
the minimum edit distance from the source word .So algorithm
compare user input to the all available stored lemmas. Retrieve the
minimum distance word from the target word.
The algorithm provides the option to select the value of the
approximation that the system considers as desired similarity distance
(e.g. if the user enters zero as the desired approximation, then only the
target words with the minimum edit distance will be returned,
whereas if he/she enters e.g. 2 as the desired approximation, then the
returned set will contain all the target words having a distance
-
24
3.2 MORPHOLOGICAL ANALYZER BASED
APPROACH
Morphological Analyzer gives all possible analyses for a given word
which is based on finite state technology, and it produces the
morphological analysis of the word form as its output. This approach uses
finite state automata and two level morphology to build a lexicon for
a language with infinite vocabulary. Two-Level rules are declarative
constraints that describe morphological alternations, such as the y->ie
alternation in the plural of some English nouns (spy->spies). Aim of this
approach is to converts two-level rules into deterministic, minimized
finite-state transducers. It describes the format of two-level grammars, the
rule formalism, and the user interface to the compiler.
It also explains how the compiler can assist the user in the development
of a two-level grammar. A finite state transducer (FST) is a finite state
machine with two tapes: an input tape and an output tape. This contrasts
with an ordinary finite state automaton (or finite state acceptor),
which has a single tape. Transducer means to translate a word from one
state to another. Transducer is having two states, one is input tape and
another is output tape.
Finite state transducer is 6-tuple (Q, Σ, Γ, I, F, δ) such that:
• Q is a finite set, the set of states;
• Σ is a finite set, called the input alphabet;
• Γ is a finite set, called the output alphabet;
• I is a subset of Q, the set of initial states;
• F is a subset of Q, the set of final states; and
(Where ε is the empty string) is the transition relation.
FSM give input actions and output depends on only state. State change
from input tap to output tap based on this action performed.
-
25
3.3 AFFIX LEMMATIZER
The most common approach for word normalization is to remove affix
from a given word. Suffix or prefix removed as per rules defined based
on grammatical knowledge of the language. To just remove suffix or
prefix from word cannot give accurate head word or root word. To just
used rule based approach cannot give accurate result so by combining
rule based approach to some statistical approach like supervised
training can give more accurate result.
Supervised training algorithm generates a data structure consisting of
rules that a lemmatizer must traverse to arrive at a rule that is elected to
fire. After training, the data structure of rules is made permanent and can
be consulted by a lemmatizer. The lemmatizer must elect and fire rules in
the same way as the training algorithm, so that all words from the training
set are lemmatized correctly. It may however fail to produce the correct
lemmas for words that were not in the training set – the OOV words. For
training word this approach used prime and derived rules. Prime rule for
training is the least specific rule needs to lemmatize. Where derived rules
are more specific rule-can be created by adding or removing
characters.
For example rule can be "watcha" which is derived from what are
you, "yer" which is derived from you are rather than "your". This
approach is more generalized than only suffix removal approach. The
bulk of ‘normal’ training words must be bigger for the new affix based
lemmatizer than for the suffix lemmatizer. This is because the new
algorithm generates immense numbers of candidate rules with only
marginal differences in accuracy, requiring many examples to find the
best candidate.
-
26
3.4 FIXED LENGTH TRUNCATION
In this approach, we simply truncate the words and use the first 5
and 7 characters of each word as its lemma. In this approach words
with less than n characters are used as a lemma with no truncation. This
approach is most appropriate for the languages like Turkish which has
average length of word is 7.07 letters.
So this approach is used when time is most priority issue. It is the
simplest approach not dependent on any language or grammar. So it can
be applicable to any language.
-
27
CHAPTER 4
TRIE BASED APPROACH
4.1 INTRODUCTION
The trie data structure is one of the most important data storage mechanisms in programming. It's a natural way to represent essential utilities on a computer like the directory structure in a file system. Many other objects can be stored in a tree data structure resulting in space and/or time efficiency. For example, when we have a huge number of dictionary (and/or non-dictionary) words or string that we want to store in memory we can use a tree structure to efficiently store the words instead of using a plain Array or Vector type that simply stores each word individually in memory. The space needed to store the words in an Array or Vector is simply the number of words times the average length of the words we need to store. A Trie, also called a Prefix Tree, is a tree structure that stores words with a common prefix under the same sequence of edges in the tree eliminating the need for storing the same prefix each time for each word. A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.
http://en.wikipedia.org/wiki/Tree_%28data_structure%29http://en.wikipedia.org/wiki/Triehttp://en.wikipedia.org/wiki/Ordered_tree_data_structurehttp://en.wikipedia.org/wiki/Data_structurehttp://en.wikipedia.org/wiki/Associative_arrayhttp://en.wikipedia.org/wiki/String_%28computer_science%29http://en.wikipedia.org/wiki/Binary_search_treehttp://en.wikipedia.org/wiki/Binary_search_treehttp://en.wikipedia.org/wiki/Prefixhttp://en.wikipedia.org/wiki/String_%28computer_science%29
-
28
EXAMPLE 1: Trie for a language consisting of words a , abase , abate and bat:
In computer science, a radix tree (also Patricia trie or radix trie or compact prefix tree) is a space-optimized trie data structure where each node with only one child is merged with its parent. This makes them much more efficient for small sets (especially if the strings are long) and for sets of strings that share long prefixes. Trie is a data structure which allows retrieving all possible lemmas. Here each node is having single character. Two nodes connected with the edges. Word is retrieve byte by byte. These approaches also involve backtracking for getting appropriate result.
-
29
EXAMPLE 2:
Trie for a Hindi language consisting of words:
कमल,कमरा,कमर,कमर�,लड़,लड़का,लड़क�:
Word is stored in root, in character by character Unicode byte order. User input word searched from first node and it traverse the tree up to last character of the word. It is possible that traverse need to backtrack for some level
-
30
4.2 APPLICATIONS OF TRIE: Prefix trees are a bit of an overlooked data structure with lots of interesting possibilities. TRIE is an interesting data structure used mainly for manipulating with Words in a language. TRIE has a wide variety of applications in • Spell checking, Word completion • Data compression • Computational biology • Routing table for IP addresses • Storing/Querying XML documents etc. As a dictionary Looking up if a word is in a trie takes O(n) operations, where n is the length of the word. Thus - for array implementations - the lookup speed doesn't change with increasing trie size. It has been used to store large dictionaries of English (say) words in spelling-checking programs and in natural-language "understanding" programs. Simple spell checkers operate on individual words by comparing each of them against the contents of a dictionary, possibly performing stemming on the word. If the word is not found it is considered to be an error, and an attempt may be made to suggest a word that was likely to have been intended. Word completion is straightforward to implement using a trie: simply find the node corresponding to the first few letters, and then collapse the subtree into a list of possible endings. This can be used in auto completing user input in text editors. Tries and Web Search Engines The index of a search engine (collection of all searchable words) is stored into a compressed trie. Each leaf of the trie is associated with a word and has a list of pages (URLs) containing that word, called occurrence list. The trie is kept in internal memory. The occurrence lists are kept in external memory and are ranked by relevance. Boolean queries for sets of words (e.g. Java and coffee) correspond to set operations (e.g. intersection) on the occurrence lists.
-
31
Additional information retrieval techniques are used, such as: - stop word elimination (e.g. ignore “the”, “a”, “is”). - Stemming (e.g. identify “add”, “adding”, “added”). - Link analysis (recognize authoritative pages). Tries an Internet Routers Computers on the internet (hosts) are identified by a unique 32-bit IP (internet protocol) address, usually written in “dotted-quad-decimal” notation. E.g.: www.google.com is 62.233.189.104. An organization uses a subset of IP addresses with the same prefix, e.g. IIDT uses 10.*.*.* Data is sent to a host by fragmenting it into packets. Each packet carries the IP address of its destination. A router forwards packet to its neighbors using IP prefix matching rules. Routers use tries to do prefix matching.
http://www.google.com/
-
4.3 PRE-PROCESSING STEPS
Before applying the lemmatization algorithm we need to normalize
the contents of the Input file.
1) Tokenization:
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called away certain characters, such as punctuation. Here Input: Friends, Romans, Countrymen, lend me your ears;
Output: These tokens are often loosely referred to as terms or words
2) Stop-words Removal:
In computing, stop words
after processing of natural language
usually refer to the most common words in a language, there is no
single universal list of stop words used by all
processing tools.
Common stop words for English language:
a,about,above,after,again,against,is,I,am,all,an,but,be,being,and,are,c
an’t
PROCESSING STEPS
Before applying the lemmatization algorithm we need to normalize
the contents of the Input file. This includes the following 2 steps:
:
Given a character sequence and a defined document unit, tokenization is the task chopping it up into pieces, called tokens, perhaps at the same time throwing
away certain characters, such as punctuation. Here is an example of tokenization
Input: Friends, Romans, Countrymen, lend me your ears;
often loosely referred to as terms or words
words Removal:
stop words are words which are filtered out before or
processing of natural language data (text). Though
usually refer to the most common words in a language, there is no
single universal list of stop words used by all natural language
Common stop words for English language:
a,about,above,after,again,against,is,I,am,all,an,but,be,being,and,are,c
32
Before applying the lemmatization algorithm we need to normalize
This includes the following 2 steps:
Given a character sequence and a defined document unit, tokenization is the task perhaps at the same time throwing
is an example of tokenization
are words which are filtered out before or
data (text). Though stop words
usually refer to the most common words in a language, there is no
natural language
a,about,above,after,again,against,is,I,am,all,an,but,be,being,and,are,c
-
33
4.4 BASIC OPERATIONS IN TRIE:
4.4.1 SEARCHING IN A TRIE To search for a key k in a trie T, we begin at the root which is a branch node. Let us suppose the key k is made up of characters k1 K2 k3….kn. The first character of the key K viz., k1 is extracted and the pChildren field corresponding to the letter k1 in the root branch node is spotted. If T->pChildren[k1- 'a'] is equal to NULL, then the search is unsuccessful, since no such key is found. If T->pChildren[k1–‘a’] is not equal to NULL. Then the pChildren field may either point to an information node or a branch node. If the information node holds K then the search is done. The key K has been successfully retrieved. Otherwise, it implies the presence of key(s) with a similar prefix. We extract the next character ,k2 of key K and move down the link field corresponding to k2 in the branch node encountered at level 2 and so on until the key is found in an information node or the search is unsuccessful. The deeper the search, the more there are keys with similar but longer prefixes. Algorithm for searching for a word in the Trie:
1. Set current node to root node. Set the current letter to the first letter in
the word.
2. If the current node is null then the word does not exist in the Trie.
3. If the current node reference to a valid node containing the current
letter then set current node to that referenced node and set current letter to
the letter in the new current node.
4. Repeat steps 2 and 3 until all letters in the word have been processed.
5. Now there are two possibilities that may indicate the letter is not there
in the tree:
a) The current letter is the last letter and there is no valid node containing
this letter, and
b) There is a valid node containing the last letter but the node does not
indicate it contains a full word (i.e. the Boolean field isEnd=false).
-
34
6. If step the conditions in step 5 are not met, then we have a match for
the word in the Trie (i,e when isEnd=True).
4.4.2 INSERTION IN A TRIE :
To insert a key K into a trie we begin as we would to search following the appropriate pChildren fields of the branch nodes, corresponding to the characters of thekey. At the point where the pChildren of the branch node leads to NULL, the key K Algorithms for inserting a word into a Trie:
1. Set current node to root node i.e. value=null
2. Set the current letter to the first letter in the word.
3. If the current node already has an existing reference to the current letter
then set current node to that referenced node; else create a new node, set
the letter to current letter, and set current node to this new node, set the
value of isEnd to false.
4. Repeat step 3 until all letters in the current word has been processed.
Set the value of isEnd=true when the process end
-
35
4.5 ALGORITHM FOR LEMMATIZATION: The algorithm requires a file containing the list of the root words (lemma) of the language concerned. At first, we have created a trie structure using the dictionary root words. Each node in the trie corresponds to a Unicode character of the language concerned and the nodes that end with the final character of any root word are marked as final nodes. The rest nodes are marked as non-final nodes. To find the lemma of a surface word, the trie is navigated starting from the initial node in the trie and navigation ends when either the word is completely found in the trie or after some portion of the word there is no path present in the trie to navigate. The key idea is that a trie is created out of the vocabulary (root words) of
the language.
The lemmatizing process consists in navigating the trie, trying to find a
match between the input word and an entry in the trie
An algorithm is applied to find the appropriate lemma for the input word.
EXPLANATION:
The algorithm requires a list of root words of the language concerned. We
are storing the root words in a file.
Step1:At first, we have created a trie structure using the list of root
words.
A trie node consists of fields:
1) value, corresponds to an Unicode character of the language concerned .
2) isEnd, a Boolean field which is set to true if it’s the final character of
any root word otherwise the value is set to false.
3) children,a hash map function which maps the current trie node to its
child nodes
-
36
The Insert algorithm explained above is used to insert the root words in a
trie.
Step2: Navigating through the trie to find the matching prefix
The Search algorithm explained above is applied here with some
modifications to find the lemma of the surface word.
To find the lemma of a surface word, the trie is navigated starting from
the initial node in the trie and navigation ends when either the word is
completely found in the trie or after some portion of the word there is no
path present in the trie to navigate.
While navigating, some situations may occur, depending on which we
take decision to determine the lemma. Those situations are described
below.
CASE 1:
The surface word is a root word. In that case, the surface word itself is
the lemma.
Example: Stored word: abbreviate Input: abbreviate Matched String: Abbreviate Output: Abbreviate CASE 2: The surface word is not a root word. In that case, the trie is navigated up
to that node where the surface word completely ends or there is no path to
navigate in the trie. We call this node as the end node.
Again two different cases may occur.
CASE 2.1:
In the path from the initial node to the end node, if one or more than one
root words are found i.e. if one or more final nodes are present in the
path. Then pick that final node which is closest to the end node.
-
37
Example: Stored words: a, an, and Input: ands Matched prefix: a, an, and; Output: and The word represented by the path from initial node to the picked final
node is considered as the lemma
CASE 2.2
If no root word is found in the path from the initial node to the end node.
Then find the final node in the trie which is closest to the end node.
Example:
Stored word: abbreviate Input: abbreviating Matched String: abbreviat Output: abbreviate If more than one final node is found at the closest distance then
pick all of them. Now, generate the root word(s) which is/are represented
by the path from initial node to those picked final node(s).
Output: The list of matched lemma will be returned.
-
38
Hindi Language:(Tokenization and stopwords removal)
-
4.6 FLOWCHART OF ALGORITHM
The above discussed algorithm can be depicted with the help of the
following flowchart which
common prefixes:
FLOWCHART OF ALGORITHM:
The above discussed algorithm can be depicted with the help of the
which explains how trie can be used to find
39
The above discussed algorithm can be depicted with the help of the
explains how trie can be used to find
-
40
4.7 DRAWBACK OF ALGORITHM:
The following are some of the drawbacks of the lemmatizing algorithm used:
Compound words and out-of-vocabulary words are not considered in our algorithm.
Root words are taken from dictionary but if the coverage of the dictionary is not good then accuracy will degrade
-
41
CHAPTER 5
IMPLEMENTATION & RESULT ANALYSIS
Appendix A: Snapshot
ENGLISH LANGUAGE
a)Input:abases;Output:abase
-
42
b) Input: EnglishInput.txt (file); Output: myFile.txt
-
43
c) Input:EnglishTry.txt Tokenized file:Tokenized.txt Output:MyFile.txt
-
44
HINDI LANGUAGE
a)Input:लड़क�या ंOutput:लड़क�
b) Input:HindiInput.txt;Output:myFile.txt
-
45
RESULT ANALYSIS:
For evaluation of results we have performed the following tests:
English Language:
Test1:
A file containing 14,730 lemmas to build the Trie-datastructure and another file (input file) containing 25,803 inflected words to perform the testing.
It is found that out of these 25,803 words our lemmatizer is able to give correct results for 24,513 words. We have used the following formula to calculate the accuracy of the lemmatizer.
Accuracy = No. of words correctly lemmatized x 100 Total no. of words Thus for this test, Accuracy=(24,513/25,803)*100=95%
Test2:
A file (input file) containing 7 sentences of approximately 4 words each. For this we have taken a file which contains the related root words of the file to built up the Trie data structure. It is found that our lemmatizer is able to correctly tokenize and remove the stop words from the input file. Then it is also able to correctly lemmatize all the words. (Accuracy=100%). (Refer Snapshot for details)
Hindi Language:
Test1:
A file containing 1000 Hindi lemmas to build the Trie data structure and another file (input file) containing 220 inflected words to perform the testing.
It is found that out of these 220 words our lemmatizer is able to give correct results for 207 words.
Thus for this test, Accuracy= (207/220)*100=94%
-
46
Test2:
A file (input file) containing 10 sentences of approximately 5 words each. For this we have taken a file which contains the related root words of the file to built up the Trie data structure. It is found that our lemmatizer is able to correctly tokenize and remove the stop words from the input file. Then it is also able to approximately lemmatize all the words. (Accuracy=97%)
Table:
Srno Language No. of words taken
Correctly Lemmatized words
Accuracy
1 ENGLISH 25,803 24,513 95%
2 HINDI 220 207 94%
-
47
Appendix B: Development Platform
Software Requirements for implementing the system:
Operating System Windows 7
Platform Used Java Net Beans IDE 7.3.1
Hardware requirements for developing and implementing the system:
A Pentium based LAPTOP with minimum of
I. 1GB RAM
II. 320GB Hard Disk Space
III. Intel Pentium inside Processor
-
48
CHAPTER 7
CONCLUSION
In this Project work we investigated many existing techniques and have selected Trie-based approach for building our lemmatizer.
We tested our lemmatizer for English and Hindi language and it is found that it gives good result but in many cases it fails to correctly lemmatize because of out-of-vocabulary words, compound-words and also due to different kind of inflectional words which are specific to languages.
Finally we can conclude that our lemmatizer is language independent and thus can use our lemmmatizer for any language but we will need correct list of all the root words of that language to build the Trie.
-
49
CHAPTER 8
FUTURE WORK
With the present approach one can further work on the following future aspects:
1) One can also use other data structure like compressed trie to improve the results.
2) High Accuracy can be achieved by providing more User Interaction
3) Solution for Compound words and out-of-vocabulary words can be considered in our algorithm.
4) If the Root word is not in the dictionary then there should be some way to provide the result
5) Backtracking can be implemented in the Algorithm for better search results
-
50
REFERENCES
[1]. https://www.google.com/search?sclient=psy-
ab&btnG=Search&q=lemmatization+articles#q=A+Rule+based+Approac
h+to++Word+Lemmatization+by+joel+pilson
[2]. http://www.apl.org.pt/docs/22-textos-seleccionados/12-
Branco_Silva.pdf
[3]. http://www.arxiv.org/pdf/1310.0581
[4]. http://www.airccj.org/CSCP/vol3/csit3408.pdf
[5]. http://arxiv.org/abs/cmp-lg/9503020
[6].http://www.citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149
[7].http://www.euralex.org/elx_proceedings/Euralex1994/56_Euralex_Eugenio%20Picchi%20%20Statistical%20Tools%20for%20Corpus%20Analysis%20%20A%20Tagger%20and%20Lemmatizer%20for%20I.pdf
[8]. http://dl.acm.org/citation.cfm?id=980692
[9]. http://dl.acm.org/citation.cfm?id=980910
[10]. http://arxiv.org/abs/1203.3584
[11] Dalwadi Bijal, Suthar Sanket, “Overview of Stemming
Algorithm for Indian and NonIndian Languages”, International
Journal of Computer Sciences and Information Technologies (IJCSIT)
Vol. 5 (2), PP. 1144-1146, 2014.
[12] Vishal Gupta, “Hindi Rule Based Stemmer for Nouns”,
International Journal of Advanced Research in Computer Science and
Software Engineering, Volume 4, Issue 1, January 2014.
[13] M.Thangarasu. R.Manavalan, “Design and Development of
Stemmer for Tamil Language: Cluster Analysis”, International Journal of
http://www.apl.org.pt/docs/22-textos-seleccionados/12-Branco_Silva.pdfhttp://www.apl.org.pt/docs/22-textos-seleccionados/12-Branco_Silva.pdfhttp://www.arxiv.org/pdf/1310.0581http://www.airccj.org/CSCP/vol3/csit3408.pdfhttp://arxiv.org/abs/cmp-lg/9503020http://arxiv.org/abs/1203.3584
-
51
Advanced Research in Computer Science and Software Engineering,
Volume 3, Issue 7, July 2013.
[14] Siddika Parlak, Murat Saraclar, “Performance Analysis and
Improvement of Turkish Broadcast News Retrieval”, IEEE
Transactions and audio, Speech and Language Processing, Vol. 20,
No. 3, PP 731-740 March 2012.
[15] Upendra Mishra, Chandra Prakash, “MAULIK: An Effective
Stemmer for Hindi Language”, International Journal on Computer
Science and Engineering (IJCSE) Vol. 4 No. 5, PP.711-717, May
2012.
[16] Ms. Anjali Ganesh Jivani, “A Comparative Study of Stemming
Algorithms”, International Journal of Computer Technology and
Applications, Vol.2 (6), PP 1930-1938, NOV-DEC 2011.
[17] Mohamad Ababneh, Riyad Al-Shalabi, Ghassan Kanaan, Alaa
Al-Nobani, “Building an Effective Rule-Based Light Stemmer for
Arabic Language to Improve Search Effectiveness”, The International
Arab Journal of Information Technology, Vol. 9, No. 4, PP.368-372, July
2012.
[18] Suprabhat Das, Pabitra Mitra, “A Rule-based Approach of
Stemming for Inflectional and Derivational Words in Bengali”,
Proceeding of the IEEE Students' Technology Symposium, PP.14-16,
January, 2011.
[19] Mohd. Shahid Husain, “AN UNSUPERVISED APPROACH TO
DEVELOP STEMMER”, International Journal on Natural Language
Computing (IJNLC) Vol. 1, No.2, August 2012.
[20] M.Thangarasu., R.Manavalan, “A Literature Review: Stemming
Algorithms for Indian Languages”, International Journal of Computer
Trends and Technology (IJCTT), volume 4 Issue 8, August 2013.
[21] Vimala Balakrishnan, Ethel Lloyd-Yemoh, “Stemming and
Lemmatization: A Comparison of Retrieval Performances”, Lecture
Notes on Software Engineering, Vol. 2, No. 3, August 2014.
-
52
[22] M. Nithya, “Clustering Technique with Potter stemmer and
Hyper graph Algorithms for Multi-featured Query Processing”,
International Journal of Modern Engineering Research (IJMER), Vol.2,
Issue.3, pp-960-965, May-June 2012.
[23] Dhamodharan Rajalingam, “A Rule Based Iterative Affix Stripping
Stemming Algorithm for Tamil”, vol 132, PP-583-590, 2012
[24] www.ijrat.org/downloads/icatest2015/ICATEST
[25]https://www.cse.iitb.ac.in/~pb/papers/gwc14-multilingual-stemmer
[26].https://hbfs.wordpress.com/2012/07/10/stemming
[27].https://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=624693
[28].String Processing and Information Retrieval: Volume 9
[29].Ljiljana Dolamic and Jacques Savoy. 2010. Comparative Study of Indexing and Search Strategies for the Hindi, Marathi and Bengali Languages.
http://www.ijrat.org/downloads/icatest2015/ICATEST
-
53