submitted to assam university, silchar in partial fulfillment of the … · 2017-08-28 ·...

1

A

PROJECT REPORT

ON

LEMMATIZATION USING TRIE DATA STRUCTURE

A Project Work (MS-405)

Submitted to Assam University, Silchar in partial fulfillment of the requirements for

the award of degree of Master of Science in Computer Science

UNDER THE GUIDANCE OF

Mrs SUNITA SARKAR,

Department Of Computer Science

Assam University, Silchar

SUBMITTED BY

KARISHMA TAPARIA

Semester: 4th, M.Sc. (2 Yrs)

Sub Code: Msc 405(C)

Exam Roll: 101714 No.:22220415

2

ABSTRACT:

Lemmatization is use to normalize inflectional form of word to its root

word. So it can be used as preprocessing step in any natural language

processing application. Lemmatization is very important approach for

information retrieval process. Lemmatization is used to reduce different

inflectional form as well as derivational form of word to its root or head

word which called as its 'lemma'. A 'lemma' is the simply the "Dictionary

form" of a word. In lemmatization, different grammatical form of word

can be analyzed as a single word

However, the problem is that for many languages (Mainly Indian), lemmatizers do not exist, and this problem is not easy to solve, since rule based lemmatizers take time and require highly skilled linguistics. Statistical stemmers on the other hand do not return legitimate lemma Our goal is to implement a language independent Lemmatizer which will be using Trie Data Structure to store the root words and tries to find out a potential lemma of a surface word by efficiently searching in the Trie. A trie allows retrieving possible lemma of a given inflected or derivational form.

3

DECLARATION

I, Karishma Taparia, student of 4th semester (M.sc 2 years), Department of

Computer Science do hereby solemnly declare that I have duly worked on my

project entitled “Lemmatization using Trie-Datastructure” under the

supervision of Mrs. Sunita Sarkar , Assistant Professor, Department of

Computer Science, Assam University, Silchar. This project work is submitted

in the partial fulfillment of the requirements for the award of degree of Master

of Science in Computer Science. The results embodied in this thesis have not

been submitted to any other university or institution for the award of any

degree or diploma.

Date: (Karishma Taparia)

Place: Assam University, Silchar Semester- 04th sem(Msc 2 years)

Roll: No:22220415

Assam University, Silchar

4

CERTIFICATE

This is to certify that Miss. Karishma Taparia, student of 4th semester(Msc 2

years), Department of Computer Science, Assam University, Silchar, bearing

the Roll: No: 22220415 has carried out her project work

entitled “Lemmatization using Trie-Datastructure” under my guidance in the

partial fulfillment of the requirement for the award of degree of Master of

Science in Computer Science during the period of January 2016 to May 2016.

The project is the result of her investigation and neither the report as a whole,

nor any part of it has been submitted to any other university or institution for

any degree or diploma.

Date: Mrs. SUNITA SARKAR

Place: Assistant Professor,

Department of Computer Science,

Assam University,Silchar.

5

CERTIFICATE

This is to certify that Miss. Karishma Taparia, student of 4th semester(Msc 2

years), Department of Computer Science, Assam University, Silchar, bearing

the Roll: No: 22220415 has carried out her project work

entitled “Lemmatization using Trie-Datastructure” under the guidance of

Mrs.Sunita Sarkar , Assistant Professor, Department of Computer Science,

Assam University, Silchar, in the partial fulfillment of the requirement for the

award of degree of Master of Science in Computer Science during the period of

January 2016 to May 2016.

The project is the result of her investigation and neither the report as a whole,

nor any part of it has been submitted to any other university or institution for

any degree or diploma.

Date: Dr. BIPUL SHYAM PURKAYASTHA

Place: Head of Department,


Assam University, Silchar.

6

ACKNOWLEDGEMENT:

At the very outset, I take the privilege to convey my gratitude to those

persons whose co-operation, suggestions and heartfelt support helped us to

accomplish the project successfully.

I take immense pleasure to express my sincere thanks and profound

gratitude to my respected guide Mrs. SUNTA SARKAR for her continuous

support, patience, motivation, enthusiasm, immense knowledge and

providing timely support and suitable suggestions.

I want to thank my parents for their affection and their help for managing

my life in busy times. Without them, it would have been very difficult to

focus on my project.

My special thanks are to those who directly or indirectly extended their

helping hands in making the project a grand success.

Date: 27th April 2016 KARISHMA TAPARIA


Assam University Silchar.

7

TABLE OF CONTENTS

1. Introduction

1.1. Stemming Vs Lemmatization

1.2. Motivation

1.3. Objective

1.4. Problem Statement

1.5. Aim of Lemmatization

2. Related Work and Background

2.1. Introduction

2.2. Review on Lemmatization

2.3. Review on Stemming

3. Approaches For Lemmatization

3.1 Introduction

3.2 Lavenshtein distance based approach

3.3 Morphological Analyzer

3.4 AFFIX based approach

3.5 Fixed Length Truncation

4 TRIE based approach

4.1 Introduction

4.2 Pre-processing Steps

4.2.1 Tokenization

4.2.2 Stop-words Removal

4.3 Applications of Trie

4.4 Basic Operations on Trie

4.4.1 Searching A Trie

4.4.2 Insertion In Trie

4.5 Algorithm For Lemmatization

4.6 Flowchart

4.7 Drawback of algorithm

8

5 Implementation and Result Analysis

5.1 Implementation

5.2 Result Analysis

6 Conclusion

7 Future Work

8 References

9

CHAPTER 1

1. INTRODUCTION:

As the language is an important tool for communication, so natural

language processing is concerned with the interaction between

human languages and computers. NLP involves enabling computers

to derive meaning from human or natural language input. Natural

language processing is very hot research topic now a day, as it is used in

most of the linguistic activities.

In the lexical knowledge-bases like dictionary, Word Net, etc, the entries

are usually root words with their morphological and semantic

descriptions and therefore, when a surface word is encountered in a raw

text, its meaning cannot be obtained unless and until its appropriate root

word is determined through lemmatization. Thus, lemmatization is a

basic need for any kind of semantic processing for languages

"Lemmatization" refers to normalized different inflectional forms as

well as derivational forms to its head word.

This task can be used as a pre-processing step for many natural

processing applications (e.g. morphological analyzers, electronic

dictionaries, spell-checkers, stemmers, etc.). It may also be useful as a

generic keywords generator for search engines and other data

mining, clustering and classification tools

Morphological Variants

of Words

(Preparing, prepared,

preparation, prepare)

Common

Root Word

(Prepare)

LEMMATIZATION

PROCESS

10

1.1 STEMMING VS LEMMATIZATION:

Normalization is very important task in any natural language processing

application. Stemming or Lemmatization used as a normalized

technique to reduce different grammatical words to its head word by

applying set of rule

Stemming is process of reducing different inflectional form to its stem by

applying different set of rule. Aim of Stemming is just to reduce word to

its stem without bothering about POS. It is used in most of the text

mining application where aim is just to reduce the form of word

without worrying of its occurrence in the given context. So it is used

to convert the different inflectional form of word to its stem. The result

of stemming is called as a stem, it is not always a dictionary word.

In linguistics, a lemma (from the Greek noun “lemma”, “headword”)

is the “dictionary” or “canonical” form of a set of words. More

specifically, a lemma is the canonical form of a lexeme, where

lexeme refers to the set of all the forms that have the same meaning, and

lemma refers to the particular form that is chosen as base form to

represent the lexeme. Lemmatization used as a most frequently used

normalization technique in any information retrieval application like

indexing and searching.

For Example:

Collection of words ‘produce’, ‘produced’, ‘producing’ and ‘production’

are stemmed to ‘produc’ whereas the lemmatizer will return the word

‘produce’ .

11

1.2 MOTIVATION

Natural language processing is very hot research topic now a day, as

it is used in most of the linguistic activities.

Stemming or Lemmatization used as a normalized technique to

reduce different grammatical words to its head word .lemmatizing

can be used as a pre processing steps in Information Retrieval

application.

However, the problem is that for many languages, lemmatizers do not

exist, and this problem is not easy to solve, since rule based

lemmatizers take time and require highly skilled linguists. Statistical

stemmers on the other hand do not return legitimate lemma

12

1.3 OBJECTIVE

To develop a Lemmatizer using Trie Data structure

1.4 PROBLEM STATEMENT

The key idea is that a trie is created using a file containing list of

lemma’s (root words) of the language concerned.

The lemmatizing process consists in navigating the trie , trying to

find a match between the input word and an entry in the trie

An algorithm is applied to find the appropriate lemma for the input

word.

1.5 AIM OF LEMMATIZATION:

Lemmatization aims to remove inflectional endings only and to return

dictionary form of a word and may use of a vocabulary and/or

morphological analysis of words. Therefore lemmatizers require much

knowledge about language than stemmers and they don’t use

language specific rules unlike stemmers.

Lemmatization is closely related to stemming, however, stemming

operates only on a single word at a time. Instead, lemmatization may

operate on the full-text and therefore can discriminate between words

that have different meanings depending on part-of-speech. On the

other hand, stemmers are typically easier to implement and run

faster. Hence, lemmatizers play a significant role in IR and ability to

lemmatize words efficiently and effectively is thus important.

13

CHAPTER 2

RELATED WORK AND BACKGROUND

INTRODUCTION: Lovins described the first stemmer (Lovins , J.B.,1968), which was

developed specifically for IR/NLP applications. His approach consisted

of the use of a manually developed list of 294 suffixes, each linked to 29

conditions, plus 35 transformation rules. For an input word, the suffix

with an appropriate condition is checked and removed .Porter developed

the Porter stemming algorithm (Porter, 1980) which became the most

widely used stemming algorithm for English language. These stemmers

were described in a very high level language known as Snowball

A number of statistical approaches have been developed for stemming.

Notable works include: Goldsmith’s unsupervised algorithm for learning

morphology of a language based on the Minimum Description Length

(MDL) framework (Goldsmith, 2001, 2006). Creutz uses probabilistic

maximum a posteriori (MAP) formulation for unsupervised morpheme

segmentation (Creutz, 2005, 2007)

A few approaches are based on the application of Hidden Markov models

(Massimo et al., 2003).In this technique, each word is considered to be

composed of two parts “prefix” and “suffix”. Here, HMM states are

divided into two disjoint sets: Prefix state which generates the first part of

the word and Suffix state which generates the last part of the word, if the

word has a suffix. After a complete and trained HMM is available for a

language, stemming can be performed directly.

Plisson proposed the most accepted rule based approach for

lemmatization (Plisson et al., 2008).It is based on the word endings,

where suffixes are removed or added to get the normalized word form. In

another work, a method to automatically develop lemmatization rules to

generate the lemma from the full form of a word was discussed (Jongejan

et al., 2009). The lemmatizer was trained on Danish, Dutch, English,

14

German, Greek, Icelandic, Norwegian, Polish, Slovene and Swedish full

form-lemma pairs respectively.

Kimmo (Karttunen et al., 1983) is a two level morphological analyzer

containing a large set of morphophonemic rules. The work started in 1980

and the first implementation n LIST was available 3 years later.

Tarek El-Shishtawy proposed the first non statistical Arabic lemmatizer

algorithm (Tarek etal., 2012). He makes use of different Arabic language

knowledge resources to generate accurate lemma form and its relevant

features that support IR purposes and a maximum accuracy of 94.8% is

reported. OMA is a Turkish Morphological Analyzer which gives all

possible analyses for a given word with the help of finite state

technology. Two-level morphology is used to build the lexicon for a

language (Okan et al., 2012).

Grzegorz Chrupala (Chrupala et al., 2006) presented a simple data-driven

context-sensitive approach to lemmatizing word forms. Shortest Edit

Script (SES) between reversed input and output strings is computed to

achieve this task. An SES describes the transformations that have to be

applied to the input string (word form) in order to convert it to the output

string (lemma).

As for lemmatizers for Indian languages, the earliest work by

Ramanathan and Rao (2003) used manually sorted suffix list and

performed longest match stripping for building a Hindi stemmer.

Majumdar et. al (2007) developed YASS: Yet Another Suffix Stripper.

Here conflation was viewed as a clustering problem with a-priory

unknown number of clusters. They suggested several distance measures

rewarding long matching prefixes and penalizing early mismatches.

In a recent work related to Affix Stacking languages like Marathi,(Dabre

et al., 2012) Finite State Machine (FSM)is used to develop a Marathi

morphological Analyzer. In another approach, A Hindi Lemmatizer is

proposed, where suffixes are stripped according to various rules and

necessary addition of character(s) is done to get a proper root form (Paul

etal., 2013). GRALE is a graph based lemmatizer for Bengali comprising

two steps (Loponen et al.,2013). In the first, step it extracts the set of

frequent suffixes and in the second step, a human manually identifies the

15

case suffixes. Words are often considered as node and edge from node u

to v exist if only v can be generated from u by addition of a suffix.

Unlike the above mentioned rule based and statistical approaches, our

lemmatizer uses the properties of a “trie” data structure which allows

retrieving possible lemma of a given inflected word.

2.1 REVIEW ON LEMMATIZATION:

Author Title Language Technique Used

Accuracy

Joël Plisson[1] A Rule based Approach to Word Lemmatization

Slovene Ripple Down Rules (RDR) approach

77%

António Branco and João Silva[2]

Very high accuracy rule-based Nominal lemmatization with a minimal lexicon

Portuguese shallow processing, rule-based algorithm

94%

Vaishali Gupta, Nisheeth Joshi and Iti Mathur[3]

Rule based Lemmatization

Urdu Rules 86.5%

Snigdha Paul, Mini Tandon, Nisheeth Joshi and Iti Mathur[4]

Design Of a Rulebased hindi Lemmatizer

Hindi automated lemmatizer using the rules

89.08%

Aduriz I., Alegria I., Arriola J.M., Artola X[5]

Different issues in the design of a Lemmatizer/tagger

Basque morphological disambiguation with structured four level tagset

Under development

16

Grzegorz Chrupała[6]

Simple Data-Driven Context-Sensitive Lemmatization

languages like Spanish ,Dutch,Frenh etc

Classification based on Shortest Edit Script (SES)

60-88%

Eugenio Picchi[7]

Statistical Tools for Corpus Analysis

Italian Statistical tools (PE-system)

95%

Wolfgang Lezius[8]

A Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer

German

Morphology module and tagger

91%

Ezeiza N., Alegria I., Arriola J.M., Urizar R[9]

Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages

Basque stochastic and rule-based disambiguation methods

96%

Tarek El-Shishtawy and Fatma El-Ghannam[10]

An Accurate Arabic Root- Based Lemmatizer for Information Retrieval Purposes

Arabic Rule based lexicon and supervised learning

89.15%

17

DETAILS OF THE APPROACHES MENTIONED IN THE

LEMMATIZATION TABLE:

[1]. A Rule based Approach to Word Lemmatization

The approach presented by them focuses on word endings: what word

suffix should be removed and/or added to get the normalized form. They

compared the results of two word lemmatization algorithms, one based on

if-then rules and the other based on ripple down rules induction

algorithms. It presents the problem of lemmatization of words from

Slovene free text and ex plains why the Ripple Down Rules (RDR)

approach is very well suited for the task. When learning from a corpus of

lemmatized Slovene words the RDR approach results in easy to

understand rules of improved classification accuracy compared to the

results of rule learning achieved in previous work.

Five datasets are used for evaluation purpose; they were obtained as

random samples of different size taken from a large hand constructed

lexicon MULText -EAST. The whole lexicon contains about 20 000

different normalized words with listed different forms for each of them

resulting in about 500 000 different entries (potential learning examples).

[2]. Very high accuracy rule-based nominal lemmatization with a

minimal lexicon:

They described a shallow processing, rule-based algorithm for Nominal

Lemmatization in Portuguese with minimal word lists.

They build upon morphological regularities found in word inflection and

use a set of transformation rules that “undo” the form changes due to

inflection. Thus, the basic rationale for the rule-based lemmatizer

described is to gather a set of transformation rules that, depending on the

termination of a word, replace that termination by another, and

complement this set of rules with a list of exceptions. In order to

implement the algorithm, a list of 126 transformation rules was necessary.

The list of exceptions to these rules amounts to 9,614 entries. Prefix

removal is done resorting to a list of 130 prefixes. The lemmatizer was

evaluated over the 50,637 Adjectives and Common Nouns present in a

260,000 token corpus

18

[3]. Rule Based lemmatizer in Urdu

They proposed a rule based system. They generated an affix list of the

words and based on this list lemma are produced .If the input word does

not match with affix list then it displays the same word.

They checked their system for 2000 words. Among these 2000 words,

1730 gives correct lemma and 270 gives incorrect lemma.

[4]. DESIGN OF A RULE BASED HINDI LEMMATIZER:

The lemmatizer that they discussed mainly focuses on the time

complexity problem. The lemmatizer was built using a rule based

approach and paradigm approach. In rule based approach along with the

rules, knowledgebase is created for storing the grammatical features.

Knowledge base is also created for storing the exceptional root words.

Although the knowledgebase creation requires a large amount of

memory, but in respect of time it gives us the best, accurate and fast

result. The reason behind this fast retrieval is that, a very short time is

taken to search the input word from the knowledgebase. The system is

evaluated for 2500 words for analysis. Among these 2500 words 2227

words were evaluated correctly and 273 words were incorrect

[5]. Different issues in the design of a Lemmatizer/tagger

They focus on the development of a general purpose lemmatizer/tagger

for Basque. The basic tools used are:

• The Lexical Database for Basque (LDBB). At present it contains 60,000

entries, each with its associated linguistic features (category, subcategory,

case, number, etc.).

• A general morphological analyzer/generator.

The mechanism has two main components in order to be capable of

treating unknown words: 1) generic lemmas corresponding to each

possible open category or subcategory, and 2) two additional rules in

order to express the relationship between the generic lemmas at lexical

level and any acceptable lemma of Basque.

19

2.2 REVIEW ON STEMMING :

Author Proposed Method

Language Technique Used

Accuracy

Dinesh Kumar, Prince Rana[11]

Brute Force Technique

Punjabi Suffix Stripping

80.73%

Suprabhat Das, Pabitra Mitra[12]

Method Proposed by Porter

Bengali Suffix Stripping

96.27%

Juhi Ameta, Nisheeth Joshi, Iti Mathur[13]

Longest Matched

Gujarati Rule Based 91.5%

Shahid Husain[14]

n-gram stripping mode

Urdu 1)Length based 2)Frequency Base

1)84.27% 2)79.63%

Braja Gopal Patra, Dipankar Das[15]

Rule Based Stemmer

Kokborok 1)Minimum Suffix Stripping 2)Maximum Suffix Stripping

1)80.02% 2)85.13%

Upendra Mishra, Chandra Prakash[16]

Brute force Technique

Hindi Suffix stripping.

91.6%

Shahid Husain[17]

n-gram stripping model

Marathi 1)Length based 2)Frequency Based

1)63.5%, 2)82.68%

Thangarsu, Manavalan[18]

Light Stemmer

Tamil Light stemming

98%

Vishal Gupta[19]

Suffix Stripping

Hindi Rule based suffix stripping

83.65%

20

Elaheh Rahimtoroghi, Hesham Faili, Azadeh Shakery[20]

Rule Based Stemmer

Persian

Structural approach and Morphological rules

The precision has been increased by 4.83%.

Ayu Purwarianti[21]

Nondeterministic Finite Automata

Indonesian Suffix Stripping

81%

Mohamad Ababneh, Riyad Al-Shalabi[22]

Rule Based light stemmer

Arabic Root extraction stemmer and light stemmer

71%

Osama A. Ghanem,

Wesam M. Ashour[23]

K-means Algorithm

Arabic

Clustering

Not mentioned

Sidikka Parlak, Murat Saraclar[24]

Length based Turkish Rule Based 80%

21

CHAPTER 3

In this chapter we will talk about the 5 approaches used for lemmatization. These approaches are either rule based or statistical in nature. The first approach is string matching dictionary based approach. Second is based on finite state automata. Third approach is affix removal approach and Fourth is fixed length truncation approach. This approach mostly used for those languages where size of word is more than 7. So by removing fixed size of suffix it can produce good result

The Fifth approach is Trie based approach which we have chosen for building our lemmatizer, it is also known as tree approach. This approach will be discussed in details in the next Chapter.

Trie approach retrieve all possible lemma of a given word inflectional words.

APPROACHES FOR LEMMATIZATION: Edit Distance on dictionary algorithm which is combination of

string matching and most frequent inflectional suffixes model.

Morphological Analyzer which is based on "finite state automata".

Affix lemmatizer which is combination of rule based and

supervised training approach and Fixed length truncation approach.

Trie data structure which allow retrieving possible lemma of a given inflected or derivational form.

22

3.1 LEVENSHTEIN DISTANCE DICTIONARY BASED

APPROACH:

The Levenshtein distance is a string metric for measuring the

difference between two sequences. Informally, the Levenshtein distance

between two words is the minimum number of single-character edits (i.e.

Insertions, deletions or substitutions) required to change one word into

the other.

Searching similar sequences of data is of great importance to many

applications such as the gene similarity determination, speech

recognition applications, database and/or Internet search engines,

handwriting recognition, spell-checkers and other biology, genomics

and text processing applications.

Therefore, algorithms that can efficiently manipulate sequences of

data (in terms of time and/or space) are highly desirable, even with

modest approximation guarantees.

The Levenshtein Distance of two strings A and B is the minimum

number of character transformation required to convert string A to

string B.

The following Equation 1 is used two find the Levenshtein distance

between two strings a, b is given by:

Liv(a,b)(|a|,|b|) where:

Where (ai != bj) is indicator function equal to 0 when and (ai=bj)

equal to 1 otherwise. Note that the first element in the minimum

corresponds to deletion (from a to b), the second to insertion and

the third to match or mismatch, depending on whether the respective

symbols are the same.

23

The edit distance algorithm is performed by using three most

"primitive edit operation". By term primitive edit operation we refer to

the substitution of one character to another, the deletion of a

character and insertion of a character. So this algorithm can be

performed by three basic operations like insertion, deletion and

substitution. Some approached focused on suffix phenomena only. But

this approach deals with both suffixes as well as prefixes. So it is known

as affixation phenomena. Sometime it happens that suffixes added into

the words based on grammatical rules. For example word "going", this

approach return headword "go". But for word "went", it contains discrete

entry of lemma in dictionary. The idea is to find out all possible

lemma for user's input word.

For each one of the target words, the similarity distance between

the source and the target word is calculated and stored. When this

process is completed, the algorithm returns a set of target words having

the minimum edit distance from the source word .So algorithm

compare user input to the all available stored lemmas. Retrieve the

minimum distance word from the target word.

The algorithm provides the option to select the value of the

approximation that the system considers as desired similarity distance

(e.g. if the user enters zero as the desired approximation, then only the

target words with the minimum edit distance will be returned,

whereas if he/she enters e.g. 2 as the desired approximation, then the

returned set will contain all the target words having a distance

24

3.2 MORPHOLOGICAL ANALYZER BASED

APPROACH

Morphological Analyzer gives all possible analyses for a given word

which is based on finite state technology, and it produces the

morphological analysis of the word form as its output. This approach uses

finite state automata and two level morphology to build a lexicon for

a language with infinite vocabulary. Two-Level rules are declarative

constraints that describe morphological alternations, such as the y->ie

alternation in the plural of some English nouns (spy->spies). Aim of this

approach is to converts two-level rules into deterministic, minimized

finite-state transducers. It describes the format of two-level grammars, the

rule formalism, and the user interface to the compiler.

It also explains how the compiler can assist the user in the development

of a two-level grammar. A finite state transducer (FST) is a finite state

machine with two tapes: an input tape and an output tape. This contrasts

with an ordinary finite state automaton (or finite state acceptor),

which has a single tape. Transducer means to translate a word from one

state to another. Transducer is having two states, one is input tape and

another is output tape.

Finite state transducer is 6-tuple (Q, Σ, Γ, I, F, δ) such that:

• Q is a finite set, the set of states;

• Σ is a finite set, called the input alphabet;

• Γ is a finite set, called the output alphabet;

• I is a subset of Q, the set of initial states;

• F is a subset of Q, the set of final states; and

(Where ε is the empty string) is the transition relation.

FSM give input actions and output depends on only state. State change

from input tap to output tap based on this action performed.

25

3.3 AFFIX LEMMATIZER

The most common approach for word normalization is to remove affix

from a given word. Suffix or prefix removed as per rules defined based

on grammatical knowledge of the language. To just remove suffix or

prefix from word cannot give accurate head word or root word. To just

used rule based approach cannot give accurate result so by combining

rule based approach to some statistical approach like supervised

training can give more accurate result.

Supervised training algorithm generates a data structure consisting of

rules that a lemmatizer must traverse to arrive at a rule that is elected to

fire. After training, the data structure of rules is made permanent and can

be consulted by a lemmatizer. The lemmatizer must elect and fire rules in

the same way as the training algorithm, so that all words from the training

set are lemmatized correctly. It may however fail to produce the correct

lemmas for words that were not in the training set – the OOV words. For

training word this approach used prime and derived rules. Prime rule for

training is the least specific rule needs to lemmatize. Where derived rules

are more specific rule-can be created by adding or removing

characters.

For example rule can be "watcha" which is derived from what are

you, "yer" which is derived from you are rather than "your". This

approach is more generalized than only suffix removal approach. The

bulk of ‘normal’ training words must be bigger for the new affix based

lemmatizer than for the suffix lemmatizer. This is because the new

algorithm generates immense numbers of candidate rules with only

marginal differences in accuracy, requiring many examples to find the

best candidate.

26

3.4 FIXED LENGTH TRUNCATION

In this approach, we simply truncate the words and use the first 5

and 7 characters of each word as its lemma. In this approach words

with less than n characters are used as a lemma with no truncation. This

approach is most appropriate for the languages like Turkish which has

average length of word is 7.07 letters.

So this approach is used when time is most priority issue. It is the

simplest approach not dependent on any language or grammar. So it can

be applicable to any language.

27

CHAPTER 4

TRIE BASED APPROACH

4.1 INTRODUCTION

The trie data structure is one of the most important data storage mechanisms in programming. It's a natural way to represent essential utilities on a computer like the directory structure in a file system. Many other objects can be stored in a tree data structure resulting in space and/or time efficiency. For example, when we have a huge number of dictionary (and/or non-dictionary) words or string that we want to store in memory we can use a tree structure to efficiently store the words instead of using a plain Array or Vector type that simply stores each word individually in memory. The space needed to store the words in an Array or Vector is simply the number of words times the average length of the words we need to store. A Trie, also called a Prefix Tree, is a tree structure that stores words with a common prefix under the same sequence of edges in the tree eliminating the need for storing the same prefix each time for each word. A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.

http://en.wikipedia.org/wiki/Tree_%28data_structure%29http://en.wikipedia.org/wiki/Triehttp://en.wikipedia.org/wiki/Ordered_tree_data_structurehttp://en.wikipedia.org/wiki/Data_structurehttp://en.wikipedia.org/wiki/Associative_arrayhttp://en.wikipedia.org/wiki/String_%28computer_science%29http://en.wikipedia.org/wiki/Binary_search_treehttp://en.wikipedia.org/wiki/Binary_search_treehttp://en.wikipedia.org/wiki/Prefixhttp://en.wikipedia.org/wiki/String_%28computer_science%29

28

EXAMPLE 1: Trie for a language consisting of words a , abase , abate and bat:

In computer science, a radix tree (also Patricia trie or radix trie or compact prefix tree) is a space-optimized trie data structure where each node with only one child is merged with its parent. This makes them much more efficient for small sets (especially if the strings are long) and for sets of strings that share long prefixes. Trie is a data structure which allows retrieving all possible lemmas. Here each node is having single character. Two nodes connected with the edges. Word is retrieve byte by byte. These approaches also involve backtracking for getting appropriate result.

29

EXAMPLE 2:

Trie for a Hindi language consisting of words:

कमल,कमरा,कमर,कमर�,लड़,लड़का,लड़क�:

Word is stored in root, in character by character Unicode byte order. User input word searched from first node and it traverse the tree up to last character of the word. It is possible that traverse need to backtrack for some level

30

4.2 APPLICATIONS OF TRIE: Prefix trees are a bit of an overlooked data structure with lots of interesting possibilities. TRIE is an interesting data structure used mainly for manipulating with Words in a language. TRIE has a wide variety of applications in • Spell checking, Word completion • Data compression • Computational biology • Routing table for IP addresses • Storing/Querying XML documents etc. As a dictionary Looking up if a word is in a trie takes O(n) operations, where n is the length of the word. Thus - for array implementations - the lookup speed doesn't change with increasing trie size. It has been used to store large dictionaries of English (say) words in spelling-checking programs and in natural-language "understanding" programs. Simple spell checkers operate on individual words by comparing each of them against the contents of a dictionary, possibly performing stemming on the word. If the word is not found it is considered to be an error, and an attempt may be made to suggest a word that was likely to have been intended. Word completion is straightforward to implement using a trie: simply find the node corresponding to the first few letters, and then collapse the subtree into a list of possible endings. This can be used in auto completing user input in text editors. Tries and Web Search Engines The index of a search engine (collection of all searchable words) is stored into a compressed trie. Each leaf of the trie is associated with a word and has a list of pages (URLs) containing that word, called occurrence list. The trie is kept in internal memory. The occurrence lists are kept in external memory and are ranked by relevance. Boolean queries for sets of words (e.g. Java and coffee) correspond to set operations (e.g. intersection) on the occurrence lists.

31

Additional information retrieval techniques are used, such as: - stop word elimination (e.g. ignore “the”, “a”, “is”). - Stemming (e.g. identify “add”, “adding”, “added”). - Link analysis (recognize authoritative pages). Tries an Internet Routers Computers on the internet (hosts) are identified by a unique 32-bit IP (internet protocol) address, usually written in “dotted-quad-decimal” notation. E.g.: www.google.com is 62.233.189.104. An organization uses a subset of IP addresses with the same prefix, e.g. IIDT uses 10.*.*.* Data is sent to a host by fragmenting it into packets. Each packet carries the IP address of its destination. A router forwards packet to its neighbors using IP prefix matching rules. Routers use tries to do prefix matching.

http://www.google.com/

4.3 PRE-PROCESSING STEPS

Before applying the lemmatization algorithm we need to normalize

the contents of the Input file.

1) Tokenization:

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called away certain characters, such as punctuation. Here Input: Friends, Romans, Countrymen, lend me your ears;

Output: These tokens are often loosely referred to as terms or words

2) Stop-words Removal:

In computing, stop words

after processing of natural language

usually refer to the most common words in a language, there is no

single universal list of stop words used by all

processing tools.

Common stop words for English language:

a,about,above,after,again,against,is,I,am,all,an,but,be,being,and,are,c

an’t

PROCESSING STEPS


the contents of the Input file. This includes the following 2 steps:

:

Given a character sequence and a defined document unit, tokenization is the task chopping it up into pieces, called tokens, perhaps at the same time throwing

away certain characters, such as punctuation. Here is an example of tokenization

Input: Friends, Romans, Countrymen, lend me your ears;

often loosely referred to as terms or words

words Removal:

stop words are words which are filtered out before or

processing of natural language data (text). Though


single universal list of stop words used by all natural language

Common stop words for English language:


32


This includes the following 2 steps:

Given a character sequence and a defined document unit, tokenization is the task perhaps at the same time throwing

is an example of tokenization

are words which are filtered out before or

data (text). Though stop words


natural language


33

4.4 BASIC OPERATIONS IN TRIE:

4.4.1 SEARCHING IN A TRIE To search for a key k in a trie T, we begin at the root which is a branch node. Let us suppose the key k is made up of characters k1 K2 k3….kn. The first character of the key K viz., k1 is extracted and the pChildren field corresponding to the letter k1 in the root branch node is spotted. If T->pChildren[k1- 'a'] is equal to NULL, then the search is unsuccessful, since no such key is found. If T->pChildren[k1–‘a’] is not equal to NULL. Then the pChildren field may either point to an information node or a branch node. If the information node holds K then the search is done. The key K has been successfully retrieved. Otherwise, it implies the presence of key(s) with a similar prefix. We extract the next character ,k2 of key K and move down the link field corresponding to k2 in the branch node encountered at level 2 and so on until the key is found in an information node or the search is unsuccessful. The deeper the search, the more there are keys with similar but longer prefixes. Algorithm for searching for a word in the Trie:

1. Set current node to root node. Set the current letter to the first letter in

the word.

2. If the current node is null then the word does not exist in the Trie.

3. If the current node reference to a valid node containing the current

letter then set current node to that referenced node and set current letter to

the letter in the new current node.

4. Repeat steps 2 and 3 until all letters in the word have been processed.

5. Now there are two possibilities that may indicate the letter is not there

in the tree:

a) The current letter is the last letter and there is no valid node containing

this letter, and

b) There is a valid node containing the last letter but the node does not

indicate it contains a full word (i.e. the Boolean field isEnd=false).

34

6. If step the conditions in step 5 are not met, then we have a match for

the word in the Trie (i,e when isEnd=True).

4.4.2 INSERTION IN A TRIE :

To insert a key K into a trie we begin as we would to search following the appropriate pChildren fields of the branch nodes, corresponding to the characters of thekey. At the point where the pChildren of the branch node leads to NULL, the key K Algorithms for inserting a word into a Trie:

1. Set current node to root node i.e. value=null

2. Set the current letter to the first letter in the word.

3. If the current node already has an existing reference to the current letter

then set current node to that referenced node; else create a new node, set

the letter to current letter, and set current node to this new node, set the

value of isEnd to false.

4. Repeat step 3 until all letters in the current word has been processed.

Set the value of isEnd=true when the process end

35

4.5 ALGORITHM FOR LEMMATIZATION: The algorithm requires a file containing the list of the root words (lemma) of the language concerned. At first, we have created a trie structure using the dictionary root words. Each node in the trie corresponds to a Unicode character of the language concerned and the nodes that end with the final character of any root word are marked as final nodes. The rest nodes are marked as non-final nodes. To find the lemma of a surface word, the trie is navigated starting from the initial node in the trie and navigation ends when either the word is completely found in the trie or after some portion of the word there is no path present in the trie to navigate. The key idea is that a trie is created out of the vocabulary (root words) of

the language.

The lemmatizing process consists in navigating the trie, trying to find a

match between the input word and an entry in the trie

An algorithm is applied to find the appropriate lemma for the input word.

EXPLANATION:

The algorithm requires a list of root words of the language concerned. We

are storing the root words in a file.

Step1:At first, we have created a trie structure using the list of root

words.

A trie node consists of fields:

1) value, corresponds to an Unicode character of the language concerned .

2) isEnd, a Boolean field which is set to true if it’s the final character of

any root word otherwise the value is set to false.

3) children,a hash map function which maps the current trie node to its

child nodes

36

The Insert algorithm explained above is used to insert the root words in a

trie.

Step2: Navigating through the trie to find the matching prefix

The Search algorithm explained above is applied here with some

modifications to find the lemma of the surface word.

To find the lemma of a surface word, the trie is navigated starting from

the initial node in the trie and navigation ends when either the word is

completely found in the trie or after some portion of the word there is no

path present in the trie to navigate.

While navigating, some situations may occur, depending on which we

take decision to determine the lemma. Those situations are described

below.

CASE 1:

The surface word is a root word. In that case, the surface word itself is

the lemma.

Example: Stored word: abbreviate Input: abbreviate Matched String: Abbreviate Output: Abbreviate CASE 2: The surface word is not a root word. In that case, the trie is navigated up

to that node where the surface word completely ends or there is no path to

navigate in the trie. We call this node as the end node.

Again two different cases may occur.

CASE 2.1:

In the path from the initial node to the end node, if one or more than one

root words are found i.e. if one or more final nodes are present in the

path. Then pick that final node which is closest to the end node.

37

Example: Stored words: a, an, and Input: ands Matched prefix: a, an, and; Output: and The word represented by the path from initial node to the picked final

node is considered as the lemma

CASE 2.2

If no root word is found in the path from the initial node to the end node.

Then find the final node in the trie which is closest to the end node.

Example:

Stored word: abbreviate Input: abbreviating Matched String: abbreviat Output: abbreviate If more than one final node is found at the closest distance then

pick all of them. Now, generate the root word(s) which is/are represented

by the path from initial node to those picked final node(s).

Output: The list of matched lemma will be returned.

38

Hindi Language:(Tokenization and stopwords removal)

4.6 FLOWCHART OF ALGORITHM

The above discussed algorithm can be depicted with the help of the

following flowchart which

common prefixes:

FLOWCHART OF ALGORITHM:


which explains how trie can be used to find

39


explains how trie can be used to find

40

4.7 DRAWBACK OF ALGORITHM:

The following are some of the drawbacks of the lemmatizing algorithm used:

Compound words and out-of-vocabulary words are not considered in our algorithm.

Root words are taken from dictionary but if the coverage of the dictionary is not good then accuracy will degrade

41

CHAPTER 5

IMPLEMENTATION & RESULT ANALYSIS

Appendix A: Snapshot

ENGLISH LANGUAGE

a)Input:abases;Output:abase

42

b) Input: EnglishInput.txt (file); Output: myFile.txt

43

c) Input:EnglishTry.txt Tokenized file:Tokenized.txt Output:MyFile.txt

44

HINDI LANGUAGE

a)Input:लड़क�या ंOutput:लड़क�

b) Input:HindiInput.txt;Output:myFile.txt

45

RESULT ANALYSIS:

For evaluation of results we have performed the following tests:

English Language:

Test1:

A file containing 14,730 lemmas to build the Trie-datastructure and another file (input file) containing 25,803 inflected words to perform the testing.

It is found that out of these 25,803 words our lemmatizer is able to give correct results for 24,513 words. We have used the following formula to calculate the accuracy of the lemmatizer.

Accuracy = No. of words correctly lemmatized x 100 Total no. of words Thus for this test, Accuracy=(24,513/25,803)*100=95%

Test2:

A file (input file) containing 7 sentences of approximately 4 words each. For this we have taken a file which contains the related root words of the file to built up the Trie data structure. It is found that our lemmatizer is able to correctly tokenize and remove the stop words from the input file. Then it is also able to correctly lemmatize all the words. (Accuracy=100%). (Refer Snapshot for details)

Hindi Language:

Test1:

A file containing 1000 Hindi lemmas to build the Trie data structure and another file (input file) containing 220 inflected words to perform the testing.

It is found that out of these 220 words our lemmatizer is able to give correct results for 207 words.

Thus for this test, Accuracy= (207/220)*100=94%

46

Test2:

A file (input file) containing 10 sentences of approximately 5 words each. For this we have taken a file which contains the related root words of the file to built up the Trie data structure. It is found that our lemmatizer is able to correctly tokenize and remove the stop words from the input file. Then it is also able to approximately lemmatize all the words. (Accuracy=97%)

Table:

Srno Language No. of words taken

Correctly Lemmatized words

Accuracy

1 ENGLISH 25,803 24,513 95%

2 HINDI 220 207 94%

47

Appendix B: Development Platform

Software Requirements for implementing the system:

Operating System Windows 7

Platform Used Java Net Beans IDE 7.3.1

Hardware requirements for developing and implementing the system:

A Pentium based LAPTOP with minimum of

I. 1GB RAM

II. 320GB Hard Disk Space

III. Intel Pentium inside Processor

48

CHAPTER 7

CONCLUSION

In this Project work we investigated many existing techniques and have selected Trie-based approach for building our lemmatizer.

We tested our lemmatizer for English and Hindi language and it is found that it gives good result but in many cases it fails to correctly lemmatize because of out-of-vocabulary words, compound-words and also due to different kind of inflectional words which are specific to languages.

Finally we can conclude that our lemmatizer is language independent and thus can use our lemmmatizer for any language but we will need correct list of all the root words of that language to build the Trie.

49

CHAPTER 8

FUTURE WORK

With the present approach one can further work on the following future aspects:

1) One can also use other data structure like compressed trie to improve the results.

2) High Accuracy can be achieved by providing more User Interaction

3) Solution for Compound words and out-of-vocabulary words can be considered in our algorithm.

4) If the Root word is not in the dictionary then there should be some way to provide the result

5) Backtracking can be implemented in the Algorithm for better search results

50

REFERENCES

[1]. https://www.google.com/search?sclient=psy-

ab&btnG=Search&q=lemmatization+articles#q=A+Rule+based+Approac

h+to++Word+Lemmatization+by+joel+pilson

[2]. http://www.apl.org.pt/docs/22-textos-seleccionados/12-

Branco_Silva.pdf

[3]. http://www.arxiv.org/pdf/1310.0581

[4]. http://www.airccj.org/CSCP/vol3/csit3408.pdf

[5]. http://arxiv.org/abs/cmp-lg/9503020

[6].http://www.citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149

[7].http://www.euralex.org/elx_proceedings/Euralex1994/56_Euralex_Eugenio%20Picchi%20%20Statistical%20Tools%20for%20Corpus%20Analysis%20%20A%20Tagger%20and%20Lemmatizer%20for%20I.pdf

[8]. http://dl.acm.org/citation.cfm?id=980692

[9]. http://dl.acm.org/citation.cfm?id=980910

[10]. http://arxiv.org/abs/1203.3584

[11] Dalwadi Bijal, Suthar Sanket, “Overview of Stemming

Algorithm for Indian and NonIndian Languages”, International

Journal of Computer Sciences and Information Technologies (IJCSIT)

Vol. 5 (2), PP. 1144-1146, 2014.

[12] Vishal Gupta, “Hindi Rule Based Stemmer for Nouns”,

International Journal of Advanced Research in Computer Science and

Software Engineering, Volume 4, Issue 1, January 2014.

[13] M.Thangarasu. R.Manavalan, “Design and Development of

Stemmer for Tamil Language: Cluster Analysis”, International Journal of

http://www.apl.org.pt/docs/22-textos-seleccionados/12-Branco_Silva.pdfhttp://www.apl.org.pt/docs/22-textos-seleccionados/12-Branco_Silva.pdfhttp://www.arxiv.org/pdf/1310.0581http://www.airccj.org/CSCP/vol3/csit3408.pdfhttp://arxiv.org/abs/cmp-lg/9503020http://arxiv.org/abs/1203.3584

51

Advanced Research in Computer Science and Software Engineering,

Volume 3, Issue 7, July 2013.

[14] Siddika Parlak, Murat Saraclar, “Performance Analysis and

Improvement of Turkish Broadcast News Retrieval”, IEEE

Transactions and audio, Speech and Language Processing, Vol. 20,

No. 3, PP 731-740 March 2012.

[15] Upendra Mishra, Chandra Prakash, “MAULIK: An Effective

Stemmer for Hindi Language”, International Journal on Computer

Science and Engineering (IJCSE) Vol. 4 No. 5, PP.711-717, May

2012.

[16] Ms. Anjali Ganesh Jivani, “A Comparative Study of Stemming

Algorithms”, International Journal of Computer Technology and

Applications, Vol.2 (6), PP 1930-1938, NOV-DEC 2011.

[17] Mohamad Ababneh, Riyad Al-Shalabi, Ghassan Kanaan, Alaa

Al-Nobani, “Building an Effective Rule-Based Light Stemmer for

Arabic Language to Improve Search Effectiveness”, The International

Arab Journal of Information Technology, Vol. 9, No. 4, PP.368-372, July

2012.

[18] Suprabhat Das, Pabitra Mitra, “A Rule-based Approach of

Stemming for Inflectional and Derivational Words in Bengali”,

Proceeding of the IEEE Students' Technology Symposium, PP.14-16,

January, 2011.

[19] Mohd. Shahid Husain, “AN UNSUPERVISED APPROACH TO

DEVELOP STEMMER”, International Journal on Natural Language

Computing (IJNLC) Vol. 1, No.2, August 2012.

[20] M.Thangarasu., R.Manavalan, “A Literature Review: Stemming

Algorithms for Indian Languages”, International Journal of Computer

Trends and Technology (IJCTT), volume 4 Issue 8, August 2013.

[21] Vimala Balakrishnan, Ethel Lloyd-Yemoh, “Stemming and

Lemmatization: A Comparison of Retrieval Performances”, Lecture

Notes on Software Engineering, Vol. 2, No. 3, August 2014.

52

[22] M. Nithya, “Clustering Technique with Potter stemmer and

Hyper graph Algorithms for Multi-featured Query Processing”,

International Journal of Modern Engineering Research (IJMER), Vol.2,

Issue.3, pp-960-965, May-June 2012.

[23] Dhamodharan Rajalingam, “A Rule Based Iterative Affix Stripping

Stemming Algorithm for Tamil”, vol 132, PP-583-590, 2012

[24] www.ijrat.org/downloads/icatest2015/ICATEST

[25]https://www.cse.iitb.ac.in/~pb/papers/gwc14-multilingual-stemmer

[26].https://hbfs.wordpress.com/2012/07/10/stemming

[27].https://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=624693

[28].String Processing and Information Retrieval: Volume 9

[29].Ljiljana Dolamic and Jacques Savoy. 2010. Comparative Study of Indexing and Search Strategies for the Hindi, Marathi and Bengali Languages.

http://www.ijrat.org/downloads/icatest2015/ICATEST

submitted to assam university, silchar in partial fulfillment of the … · 2017-08-28 ·...

Documents