submitted to assam university, silchar in partial fulfillment of the … · 2017-08-28 ·...

53
1 A PROJECT REPORT ON LEMMATIZATION USING TRIE DATA STRUCTURE A Project Work (MS-405) Submitted to Assam University, Silchar in partial fulfillment of the requirements for the award of degree of Master of Science in Computer Science UNDER THE GUIDANCE OF Mrs SUNITA SARKAR, Department Of Computer Science Assam University, Silchar SUBMITTED BY KARISHMA TAPARIA Semester: 4 th , M.Sc. (2 Yrs) Sub Code: Msc 405(C) Exam Roll: 101714 No.:22220415

Upload: others

Post on 25-Dec-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    A

    PROJECT REPORT

    ON

    LEMMATIZATION USING TRIE DATA STRUCTURE

    A Project Work (MS-405)

    Submitted to Assam University, Silchar in partial fulfillment of the requirements for

    the award of degree of Master of Science in Computer Science

    UNDER THE GUIDANCE OF

    Mrs SUNITA SARKAR,

    Department Of Computer Science

    Assam University, Silchar

    SUBMITTED BY

    KARISHMA TAPARIA

    Semester: 4th, M.Sc. (2 Yrs)

    Sub Code: Msc 405(C)

    Exam Roll: 101714 No.:22220415

  • 2

    ABSTRACT:

    Lemmatization is use to normalize inflectional form of word to its root

    word. So it can be used as preprocessing step in any natural language

    processing application. Lemmatization is very important approach for

    information retrieval process. Lemmatization is used to reduce different

    inflectional form as well as derivational form of word to its root or head

    word which called as its 'lemma'. A 'lemma' is the simply the "Dictionary

    form" of a word. In lemmatization, different grammatical form of word

    can be analyzed as a single word

    However, the problem is that for many languages (Mainly Indian), lemmatizers do not exist, and this problem is not easy to solve, since rule based lemmatizers take time and require highly skilled linguistics. Statistical stemmers on the other hand do not return legitimate lemma Our goal is to implement a language independent Lemmatizer which will be using Trie Data Structure to store the root words and tries to find out a potential lemma of a surface word by efficiently searching in the Trie. A trie allows retrieving possible lemma of a given inflected or derivational form.

  • 3

    DECLARATION

    I, Karishma Taparia, student of 4th semester (M.sc 2 years), Department of

    Computer Science do hereby solemnly declare that I have duly worked on my

    project entitled “Lemmatization using Trie-Datastructure” under the

    supervision of Mrs. Sunita Sarkar , Assistant Professor, Department of

    Computer Science, Assam University, Silchar. This project work is submitted

    in the partial fulfillment of the requirements for the award of degree of Master

    of Science in Computer Science. The results embodied in this thesis have not

    been submitted to any other university or institution for the award of any

    degree or diploma.

    Date: (Karishma Taparia)

    Place: Assam University, Silchar Semester- 04th sem(Msc 2 years)

    Roll: No:22220415

    Assam University, Silchar

  • 4

    CERTIFICATE

    This is to certify that Miss. Karishma Taparia, student of 4th semester(Msc 2

    years), Department of Computer Science, Assam University, Silchar, bearing

    the Roll: No: 22220415 has carried out her project work

    entitled “Lemmatization using Trie-Datastructure” under my guidance in the

    partial fulfillment of the requirement for the award of degree of Master of

    Science in Computer Science during the period of January 2016 to May 2016.

    The project is the result of her investigation and neither the report as a whole,

    nor any part of it has been submitted to any other university or institution for

    any degree or diploma.

    Date: Mrs. SUNITA SARKAR

    Place: Assistant Professor,

    Department of Computer Science,

    Assam University,Silchar.

  • 5

    CERTIFICATE

    This is to certify that Miss. Karishma Taparia, student of 4th semester(Msc 2

    years), Department of Computer Science, Assam University, Silchar, bearing

    the Roll: No: 22220415 has carried out her project work

    entitled “Lemmatization using Trie-Datastructure” under the guidance of

    Mrs.Sunita Sarkar , Assistant Professor, Department of Computer Science,

    Assam University, Silchar, in the partial fulfillment of the requirement for the

    award of degree of Master of Science in Computer Science during the period of

    January 2016 to May 2016.

    The project is the result of her investigation and neither the report as a whole,

    nor any part of it has been submitted to any other university or institution for

    any degree or diploma.

    Date: Dr. BIPUL SHYAM PURKAYASTHA

    Place: Head of Department,

    Department of Computer Science,

    Assam University, Silchar.

  • 6

    ACKNOWLEDGEMENT:

    At the very outset, I take the privilege to convey my gratitude to those

    persons whose co-operation, suggestions and heartfelt support helped us to

    accomplish the project successfully.

    I take immense pleasure to express my sincere thanks and profound

    gratitude to my respected guide Mrs. SUNTA SARKAR for her continuous

    support, patience, motivation, enthusiasm, immense knowledge and

    providing timely support and suitable suggestions.

    I want to thank my parents for their affection and their help for managing

    my life in busy times. Without them, it would have been very difficult to

    focus on my project.

    My special thanks are to those who directly or indirectly extended their

    helping hands in making the project a grand success.

    Date: 27th April 2016 KARISHMA TAPARIA

    Department of Computer Science,

    Assam University Silchar.

  • 7

    TABLE OF CONTENTS

    1. Introduction

    1.1. Stemming Vs Lemmatization

    1.2. Motivation

    1.3. Objective

    1.4. Problem Statement

    1.5. Aim of Lemmatization

    2. Related Work and Background

    2.1. Introduction

    2.2. Review on Lemmatization

    2.3. Review on Stemming

    3. Approaches For Lemmatization

    3.1 Introduction

    3.2 Lavenshtein distance based approach

    3.3 Morphological Analyzer

    3.4 AFFIX based approach

    3.5 Fixed Length Truncation

    4 TRIE based approach

    4.1 Introduction

    4.2 Pre-processing Steps

    4.2.1 Tokenization

    4.2.2 Stop-words Removal

    4.3 Applications of Trie

    4.4 Basic Operations on Trie

    4.4.1 Searching A Trie

    4.4.2 Insertion In Trie

    4.5 Algorithm For Lemmatization

    4.6 Flowchart

    4.7 Drawback of algorithm

  • 8

    5 Implementation and Result Analysis

    5.1 Implementation

    5.2 Result Analysis

    6 Conclusion

    7 Future Work

    8 References

  • 9

    CHAPTER 1

    1. INTRODUCTION:

    As the language is an important tool for communication, so natural

    language processing is concerned with the interaction between

    human languages and computers. NLP involves enabling computers

    to derive meaning from human or natural language input. Natural

    language processing is very hot research topic now a day, as it is used in

    most of the linguistic activities.

    In the lexical knowledge-bases like dictionary, Word Net, etc, the entries

    are usually root words with their morphological and semantic

    descriptions and therefore, when a surface word is encountered in a raw

    text, its meaning cannot be obtained unless and until its appropriate root

    word is determined through lemmatization. Thus, lemmatization is a

    basic need for any kind of semantic processing for languages

    "Lemmatization" refers to normalized different inflectional forms as

    well as derivational forms to its head word.

    This task can be used as a pre-processing step for many natural

    processing applications (e.g. morphological analyzers, electronic

    dictionaries, spell-checkers, stemmers, etc.). It may also be useful as a

    generic keywords generator for search engines and other data

    mining, clustering and classification tools

    Morphological Variants

    of Words

    (Preparing, prepared,

    preparation, prepare)

    Common

    Root Word

    (Prepare)

    LEMMATIZATION

    PROCESS

  • 10

    1.1 STEMMING VS LEMMATIZATION:

    Normalization is very important task in any natural language processing

    application. Stemming or Lemmatization used as a normalized

    technique to reduce different grammatical words to its head word by

    applying set of rule

    Stemming is process of reducing different inflectional form to its stem by

    applying different set of rule. Aim of Stemming is just to reduce word to

    its stem without bothering about POS. It is used in most of the text

    mining application where aim is just to reduce the form of word

    without worrying of its occurrence in the given context. So it is used

    to convert the different inflectional form of word to its stem. The result

    of stemming is called as a stem, it is not always a dictionary word.

    In linguistics, a lemma (from the Greek noun “lemma”, “headword”)

    is the “dictionary” or “canonical” form of a set of words. More

    specifically, a lemma is the canonical form of a lexeme, where

    lexeme refers to the set of all the forms that have the same meaning, and

    lemma refers to the particular form that is chosen as base form to

    represent the lexeme. Lemmatization used as a most frequently used

    normalization technique in any information retrieval application like

    indexing and searching.

    For Example:

    Collection of words ‘produce’, ‘produced’, ‘producing’ and ‘production’

    are stemmed to ‘produc’ whereas the lemmatizer will return the word

    ‘produce’ .

  • 11

    1.2 MOTIVATION

    Natural language processing is very hot research topic now a day, as

    it is used in most of the linguistic activities.

    Stemming or Lemmatization used as a normalized technique to

    reduce different grammatical words to its head word .lemmatizing

    can be used as a pre processing steps in Information Retrieval

    application.

    However, the problem is that for many languages, lemmatizers do not

    exist, and this problem is not easy to solve, since rule based

    lemmatizers take time and require highly skilled linguists. Statistical

    stemmers on the other hand do not return legitimate lemma

  • 12

    1.3 OBJECTIVE

    To develop a Lemmatizer using Trie Data structure

    1.4 PROBLEM STATEMENT

    The key idea is that a trie is created using a file containing list of

    lemma’s (root words) of the language concerned.

    The lemmatizing process consists in navigating the trie , trying to

    find a match between the input word and an entry in the trie

    An algorithm is applied to find the appropriate lemma for the input

    word.

    1.5 AIM OF LEMMATIZATION:

    Lemmatization aims to remove inflectional endings only and to return

    dictionary form of a word and may use of a vocabulary and/or

    morphological analysis of words. Therefore lemmatizers require much

    knowledge about language than stemmers and they don’t use

    language specific rules unlike stemmers.

    Lemmatization is closely related to stemming, however, stemming

    operates only on a single word at a time. Instead, lemmatization may

    operate on the full-text and therefore can discriminate between words

    that have different meanings depending on part-of-speech. On the

    other hand, stemmers are typically easier to implement and run

    faster. Hence, lemmatizers play a significant role in IR and ability to

    lemmatize words efficiently and effectively is thus important.

  • 13

    CHAPTER 2

    RELATED WORK AND BACKGROUND

    INTRODUCTION: Lovins described the first stemmer (Lovins , J.B.,1968), which was

    developed specifically for IR/NLP applications. His approach consisted

    of the use of a manually developed list of 294 suffixes, each linked to 29

    conditions, plus 35 transformation rules. For an input word, the suffix

    with an appropriate condition is checked and removed .Porter developed

    the Porter stemming algorithm (Porter, 1980) which became the most

    widely used stemming algorithm for English language. These stemmers

    were described in a very high level language known as Snowball

    A number of statistical approaches have been developed for stemming.

    Notable works include: Goldsmith’s unsupervised algorithm for learning

    morphology of a language based on the Minimum Description Length

    (MDL) framework (Goldsmith, 2001, 2006). Creutz uses probabilistic

    maximum a posteriori (MAP) formulation for unsupervised morpheme

    segmentation (Creutz, 2005, 2007)

    A few approaches are based on the application of Hidden Markov models

    (Massimo et al., 2003).In this technique, each word is considered to be

    composed of two parts “prefix” and “suffix”. Here, HMM states are

    divided into two disjoint sets: Prefix state which generates the first part of

    the word and Suffix state which generates the last part of the word, if the

    word has a suffix. After a complete and trained HMM is available for a

    language, stemming can be performed directly.

    Plisson proposed the most accepted rule based approach for

    lemmatization (Plisson et al., 2008).It is based on the word endings,

    where suffixes are removed or added to get the normalized word form. In

    another work, a method to automatically develop lemmatization rules to

    generate the lemma from the full form of a word was discussed (Jongejan

    et al., 2009). The lemmatizer was trained on Danish, Dutch, English,

  • 14

    German, Greek, Icelandic, Norwegian, Polish, Slovene and Swedish full

    form-lemma pairs respectively.

    Kimmo (Karttunen et al., 1983) is a two level morphological analyzer

    containing a large set of morphophonemic rules. The work started in 1980

    and the first implementation n LIST was available 3 years later.

    Tarek El-Shishtawy proposed the first non statistical Arabic lemmatizer

    algorithm (Tarek etal., 2012). He makes use of different Arabic language

    knowledge resources to generate accurate lemma form and its relevant

    features that support IR purposes and a maximum accuracy of 94.8% is

    reported. OMA is a Turkish Morphological Analyzer which gives all

    possible analyses for a given word with the help of finite state

    technology. Two-level morphology is used to build the lexicon for a

    language (Okan et al., 2012).

    Grzegorz Chrupala (Chrupala et al., 2006) presented a simple data-driven

    context-sensitive approach to lemmatizing word forms. Shortest Edit

    Script (SES) between reversed input and output strings is computed to

    achieve this task. An SES describes the transformations that have to be

    applied to the input string (word form) in order to convert it to the output

    string (lemma).

    As for lemmatizers for Indian languages, the earliest work by

    Ramanathan and Rao (2003) used manually sorted suffix list and

    performed longest match stripping for building a Hindi stemmer.

    Majumdar et. al (2007) developed YASS: Yet Another Suffix Stripper.

    Here conflation was viewed as a clustering problem with a-priory

    unknown number of clusters. They suggested several distance measures

    rewarding long matching prefixes and penalizing early mismatches.

    In a recent work related to Affix Stacking languages like Marathi,(Dabre

    et al., 2012) Finite State Machine (FSM)is used to develop a Marathi

    morphological Analyzer. In another approach, A Hindi Lemmatizer is

    proposed, where suffixes are stripped according to various rules and

    necessary addition of character(s) is done to get a proper root form (Paul

    etal., 2013). GRALE is a graph based lemmatizer for Bengali comprising

    two steps (Loponen et al.,2013). In the first, step it extracts the set of

    frequent suffixes and in the second step, a human manually identifies the

  • 15

    case suffixes. Words are often considered as node and edge from node u

    to v exist if only v can be generated from u by addition of a suffix.

    Unlike the above mentioned rule based and statistical approaches, our

    lemmatizer uses the properties of a “trie” data structure which allows

    retrieving possible lemma of a given inflected word.

    2.1 REVIEW ON LEMMATIZATION:

    Author Title Language Technique Used

    Accuracy

    Joël Plisson[1] A Rule based Approach to Word Lemmatization

    Slovene Ripple Down Rules (RDR) approach

    77%

    António Branco and João Silva[2]

    Very high accuracy rule-based Nominal lemmatization with a minimal lexicon

    Portuguese shallow processing, rule-based algorithm

    94%

    Vaishali Gupta, Nisheeth Joshi and Iti Mathur[3]

    Rule based Lemmatization

    Urdu Rules 86.5%

    Snigdha Paul, Mini Tandon, Nisheeth Joshi and Iti Mathur[4]

    Design Of a Rulebased hindi Lemmatizer

    Hindi automated lemmatizer using the rules

    89.08%

    Aduriz I., Alegria I., Arriola J.M., Artola X[5]

    Different issues in the design of a Lemmatizer/tagger

    Basque morphological disambiguation with structured four level tagset

    Under development

  • 16

    Grzegorz Chrupała[6]

    Simple Data-Driven Context-Sensitive Lemmatization

    languages like Spanish ,Dutch,Frenh etc

    Classification based on Shortest Edit Script (SES)

    60-88%

    Eugenio Picchi[7]

    Statistical Tools for Corpus Analysis

    Italian Statistical tools (PE-system)

    95%

    Wolfgang Lezius[8]

    A Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer

    German

    Morphology module and tagger

    91%

    Ezeiza N., Alegria I., Arriola J.M., Urizar R[9]

    Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages

    Basque stochastic and rule-based disambiguation methods

    96%

    Tarek El-Shishtawy and Fatma El-Ghannam[10]

    An Accurate Arabic Root- Based Lemmatizer for Information Retrieval Purposes

    Arabic Rule based lexicon and supervised learning

    89.15%

  • 17

    DETAILS OF THE APPROACHES MENTIONED IN THE

    LEMMATIZATION TABLE:

    [1]. A Rule based Approach to Word Lemmatization

    The approach presented by them focuses on word endings: what word

    suffix should be removed and/or added to get the normalized form. They

    compared the results of two word lemmatization algorithms, one based on

    if-then rules and the other based on ripple down rules induction

    algorithms. It presents the problem of lemmatization of words from

    Slovene free text and ex plains why the Ripple Down Rules (RDR)

    approach is very well suited for the task. When learning from a corpus of

    lemmatized Slovene words the RDR approach results in easy to

    understand rules of improved classification accuracy compared to the

    results of rule learning achieved in previous work.

    Five datasets are used for evaluation purpose; they were obtained as

    random samples of different size taken from a large hand constructed

    lexicon MULText -EAST. The whole lexicon contains about 20 000

    different normalized words with listed different forms for each of them

    resulting in about 500 000 different entries (potential learning examples).

    [2]. Very high accuracy rule-based nominal lemmatization with a

    minimal lexicon:

    They described a shallow processing, rule-based algorithm for Nominal

    Lemmatization in Portuguese with minimal word lists.

    They build upon morphological regularities found in word inflection and

    use a set of transformation rules that “undo” the form changes due to

    inflection. Thus, the basic rationale for the rule-based lemmatizer

    described is to gather a set of transformation rules that, depending on the

    termination of a word, replace that termination by another, and

    complement this set of rules with a list of exceptions. In order to

    implement the algorithm, a list of 126 transformation rules was necessary.

    The list of exceptions to these rules amounts to 9,614 entries. Prefix

    removal is done resorting to a list of 130 prefixes. The lemmatizer was

    evaluated over the 50,637 Adjectives and Common Nouns present in a

    260,000 token corpus

  • 18

    [3]. Rule Based lemmatizer in Urdu

    They proposed a rule based system. They generated an affix list of the

    words and based on this list lemma are produced .If the input word does

    not match with affix list then it displays the same word.

    They checked their system for 2000 words. Among these 2000 words,

    1730 gives correct lemma and 270 gives incorrect lemma.

    [4]. DESIGN OF A RULE BASED HINDI LEMMATIZER:

    The lemmatizer that they discussed mainly focuses on the time

    complexity problem. The lemmatizer was built using a rule based

    approach and paradigm approach. In rule based approach along with the

    rules, knowledgebase is created for storing the grammatical features.

    Knowledge base is also created for storing the exceptional root words.

    Although the knowledgebase creation requires a large amount of

    memory, but in respect of time it gives us the best, accurate and fast

    result. The reason behind this fast retrieval is that, a very short time is

    taken to search the input word from the knowledgebase. The system is

    evaluated for 2500 words for analysis. Among these 2500 words 2227

    words were evaluated correctly and 273 words were incorrect

    [5]. Different issues in the design of a Lemmatizer/tagger

    They focus on the development of a general purpose lemmatizer/tagger

    for Basque. The basic tools used are:

    • The Lexical Database for Basque (LDBB). At present it contains 60,000

    entries, each with its associated linguistic features (category, subcategory,

    case, number, etc.).

    • A general morphological analyzer/generator.

    The mechanism has two main components in order to be capable of

    treating unknown words: 1) generic lemmas corresponding to each

    possible open category or subcategory, and 2) two additional rules in

    order to express the relationship between the generic lemmas at lexical

    level and any acceptable lemma of Basque.

  • 19

    2.2 REVIEW ON STEMMING :

    Author Proposed Method

    Language Technique Used

    Accuracy

    Dinesh Kumar, Prince Rana[11]

    Brute Force Technique

    Punjabi Suffix Stripping

    80.73%

    Suprabhat Das, Pabitra Mitra[12]

    Method Proposed by Porter

    Bengali Suffix Stripping

    96.27%

    Juhi Ameta, Nisheeth Joshi, Iti Mathur[13]

    Longest Matched

    Gujarati Rule Based 91.5%

    Shahid Husain[14]

    n-gram stripping mode

    Urdu 1)Length based 2)Frequency Base

    1)84.27% 2)79.63%

    Braja Gopal Patra, Dipankar Das[15]

    Rule Based Stemmer

    Kokborok 1)Minimum Suffix Stripping 2)Maximum Suffix Stripping

    1)80.02% 2)85.13%

    Upendra Mishra, Chandra Prakash[16]

    Brute force Technique

    Hindi Suffix stripping.

    91.6%

    Shahid Husain[17]

    n-gram stripping model

    Marathi 1)Length based 2)Frequency Based

    1)63.5%, 2)82.68%

    Thangarsu, Manavalan[18]

    Light Stemmer

    Tamil Light stemming

    98%

    Vishal Gupta[19]

    Suffix Stripping

    Hindi Rule based suffix stripping

    83.65%

  • 20

    Elaheh Rahimtoroghi, Hesham Faili, Azadeh Shakery[20]

    Rule Based Stemmer

    Persian

    Structural approach and Morphological rules

    The precision has been increased by 4.83%.

    Ayu Purwarianti[21]

    Nondeterministic Finite Automata

    Indonesian Suffix Stripping

    81%

    Mohamad Ababneh, Riyad Al-Shalabi[22]

    Rule Based light stemmer

    Arabic Root extraction stemmer and light stemmer

    71%

    Osama A. Ghanem,

    Wesam M. Ashour[23]

    K-means Algorithm

    Arabic

    Clustering

    Not mentioned

    Sidikka Parlak, Murat Saraclar[24]

    Length based Turkish Rule Based 80%

  • 21

    CHAPTER 3

    In this chapter we will talk about the 5 approaches used for lemmatization. These approaches are either rule based or statistical in nature. The first approach is string matching dictionary based approach. Second is based on finite state automata. Third approach is affix removal approach and Fourth is fixed length truncation approach. This approach mostly used for those languages where size of word is more than 7. So by removing fixed size of suffix it can produce good result

    The Fifth approach is Trie based approach which we have chosen for building our lemmatizer, it is also known as tree approach. This approach will be discussed in details in the next Chapter.

    Trie approach retrieve all possible lemma of a given word inflectional words.

    APPROACHES FOR LEMMATIZATION: Edit Distance on dictionary algorithm which is combination of

    string matching and most frequent inflectional suffixes model.

    Morphological Analyzer which is based on "finite state automata".

    Affix lemmatizer which is combination of rule based and

    supervised training approach and Fixed length truncation approach.

    Trie data structure which allow retrieving possible lemma of a given inflected or derivational form.

  • 22

    3.1 LEVENSHTEIN DISTANCE DICTIONARY BASED

    APPROACH:

    The Levenshtein distance is a string metric for measuring the

    difference between two sequences. Informally, the Levenshtein distance

    between two words is the minimum number of single-character edits (i.e.

    Insertions, deletions or substitutions) required to change one word into

    the other.

    Searching similar sequences of data is of great importance to many

    applications such as the gene similarity determination, speech

    recognition applications, database and/or Internet search engines,

    handwriting recognition, spell-checkers and other biology, genomics

    and text processing applications.

    Therefore, algorithms that can efficiently manipulate sequences of

    data (in terms of time and/or space) are highly desirable, even with

    modest approximation guarantees.

    The Levenshtein Distance of two strings A and B is the minimum

    number of character transformation required to convert string A to

    string B.

    The following Equation 1 is used two find the Levenshtein distance

    between two strings a, b is given by:

    Liv(a,b)(|a|,|b|) where:

    Where (ai != bj) is indicator function equal to 0 when and (ai=bj)

    equal to 1 otherwise. Note that the first element in the minimum

    corresponds to deletion (from a to b), the second to insertion and

    the third to match or mismatch, depending on whether the respective

    symbols are the same.

  • 23

    The edit distance algorithm is performed by using three most

    "primitive edit operation". By term primitive edit operation we refer to

    the substitution of one character to another, the deletion of a

    character and insertion of a character. So this algorithm can be

    performed by three basic operations like insertion, deletion and

    substitution. Some approached focused on suffix phenomena only. But

    this approach deals with both suffixes as well as prefixes. So it is known

    as affixation phenomena. Sometime it happens that suffixes added into

    the words based on grammatical rules. For example word "going", this

    approach return headword "go". But for word "went", it contains discrete

    entry of lemma in dictionary. The idea is to find out all possible

    lemma for user's input word.

    For each one of the target words, the similarity distance between

    the source and the target word is calculated and stored. When this

    process is completed, the algorithm returns a set of target words having

    the minimum edit distance from the source word .So algorithm

    compare user input to the all available stored lemmas. Retrieve the

    minimum distance word from the target word.

    The algorithm provides the option to select the value of the

    approximation that the system considers as desired similarity distance

    (e.g. if the user enters zero as the desired approximation, then only the

    target words with the minimum edit distance will be returned,

    whereas if he/she enters e.g. 2 as the desired approximation, then the

    returned set will contain all the target words having a distance

  • 24

    3.2 MORPHOLOGICAL ANALYZER BASED

    APPROACH

    Morphological Analyzer gives all possible analyses for a given word

    which is based on finite state technology, and it produces the

    morphological analysis of the word form as its output. This approach uses

    finite state automata and two level morphology to build a lexicon for

    a language with infinite vocabulary. Two-Level rules are declarative

    constraints that describe morphological alternations, such as the y->ie

    alternation in the plural of some English nouns (spy->spies). Aim of this

    approach is to converts two-level rules into deterministic, minimized

    finite-state transducers. It describes the format of two-level grammars, the

    rule formalism, and the user interface to the compiler.

    It also explains how the compiler can assist the user in the development

    of a two-level grammar. A finite state transducer (FST) is a finite state

    machine with two tapes: an input tape and an output tape. This contrasts

    with an ordinary finite state automaton (or finite state acceptor),

    which has a single tape. Transducer means to translate a word from one

    state to another. Transducer is having two states, one is input tape and

    another is output tape.

    Finite state transducer is 6-tuple (Q, Σ, Γ, I, F, δ) such that:

    • Q is a finite set, the set of states;

    • Σ is a finite set, called the input alphabet;

    • Γ is a finite set, called the output alphabet;

    • I is a subset of Q, the set of initial states;

    • F is a subset of Q, the set of final states; and

    (Where ε is the empty string) is the transition relation.

    FSM give input actions and output depends on only state. State change

    from input tap to output tap based on this action performed.

  • 25

    3.3 AFFIX LEMMATIZER

    The most common approach for word normalization is to remove affix

    from a given word. Suffix or prefix removed as per rules defined based

    on grammatical knowledge of the language. To just remove suffix or

    prefix from word cannot give accurate head word or root word. To just

    used rule based approach cannot give accurate result so by combining

    rule based approach to some statistical approach like supervised

    training can give more accurate result.

    Supervised training algorithm generates a data structure consisting of

    rules that a lemmatizer must traverse to arrive at a rule that is elected to

    fire. After training, the data structure of rules is made permanent and can

    be consulted by a lemmatizer. The lemmatizer must elect and fire rules in

    the same way as the training algorithm, so that all words from the training

    set are lemmatized correctly. It may however fail to produce the correct

    lemmas for words that were not in the training set – the OOV words. For

    training word this approach used prime and derived rules. Prime rule for

    training is the least specific rule needs to lemmatize. Where derived rules

    are more specific rule-can be created by adding or removing

    characters.

    For example rule can be "watcha" which is derived from what are

    you, "yer" which is derived from you are rather than "your". This

    approach is more generalized than only suffix removal approach. The

    bulk of ‘normal’ training words must be bigger for the new affix based

    lemmatizer than for the suffix lemmatizer. This is because the new

    algorithm generates immense numbers of candidate rules with only

    marginal differences in accuracy, requiring many examples to find the

    best candidate.

  • 26

    3.4 FIXED LENGTH TRUNCATION

    In this approach, we simply truncate the words and use the first 5

    and 7 characters of each word as its lemma. In this approach words

    with less than n characters are used as a lemma with no truncation. This

    approach is most appropriate for the languages like Turkish which has

    average length of word is 7.07 letters.

    So this approach is used when time is most priority issue. It is the

    simplest approach not dependent on any language or grammar. So it can

    be applicable to any language.

  • 27

    CHAPTER 4

    TRIE BASED APPROACH

    4.1 INTRODUCTION

    The trie data structure is one of the most important data storage mechanisms in programming. It's a natural way to represent essential utilities on a computer like the directory structure in a file system. Many other objects can be stored in a tree data structure resulting in space and/or time efficiency. For example, when we have a huge number of dictionary (and/or non-dictionary) words or string that we want to store in memory we can use a tree structure to efficiently store the words instead of using a plain Array or Vector type that simply stores each word individually in memory. The space needed to store the words in an Array or Vector is simply the number of words times the average length of the words we need to store. A Trie, also called a Prefix Tree, is a tree structure that stores words with a common prefix under the same sequence of edges in the tree eliminating the need for storing the same prefix each time for each word. A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.

    http://en.wikipedia.org/wiki/Tree_%28data_structure%29http://en.wikipedia.org/wiki/Triehttp://en.wikipedia.org/wiki/Ordered_tree_data_structurehttp://en.wikipedia.org/wiki/Data_structurehttp://en.wikipedia.org/wiki/Associative_arrayhttp://en.wikipedia.org/wiki/String_%28computer_science%29http://en.wikipedia.org/wiki/Binary_search_treehttp://en.wikipedia.org/wiki/Binary_search_treehttp://en.wikipedia.org/wiki/Prefixhttp://en.wikipedia.org/wiki/String_%28computer_science%29

  • 28

    EXAMPLE 1: Trie for a language consisting of words a , abase , abate and bat:

    In computer science, a radix tree (also Patricia trie or radix trie or compact prefix tree) is a space-optimized trie data structure where each node with only one child is merged with its parent. This makes them much more efficient for small sets (especially if the strings are long) and for sets of strings that share long prefixes. Trie is a data structure which allows retrieving all possible lemmas. Here each node is having single character. Two nodes connected with the edges. Word is retrieve byte by byte. These approaches also involve backtracking for getting appropriate result.

  • 29

    EXAMPLE 2:

    Trie for a Hindi language consisting of words:

    कमल,कमरा,कमर,कमर�,लड़,लड़का,लड़क�:

    Word is stored in root, in character by character Unicode byte order. User input word searched from first node and it traverse the tree up to last character of the word. It is possible that traverse need to backtrack for some level

  • 30

    4.2 APPLICATIONS OF TRIE: Prefix trees are a bit of an overlooked data structure with lots of interesting possibilities. TRIE is an interesting data structure used mainly for manipulating with Words in a language. TRIE has a wide variety of applications in • Spell checking, Word completion • Data compression • Computational biology • Routing table for IP addresses • Storing/Querying XML documents etc. As a dictionary Looking up if a word is in a trie takes O(n) operations, where n is the length of the word. Thus - for array implementations - the lookup speed doesn't change with increasing trie size. It has been used to store large dictionaries of English (say) words in spelling-checking programs and in natural-language "understanding" programs. Simple spell checkers operate on individual words by comparing each of them against the contents of a dictionary, possibly performing stemming on the word. If the word is not found it is considered to be an error, and an attempt may be made to suggest a word that was likely to have been intended. Word completion is straightforward to implement using a trie: simply find the node corresponding to the first few letters, and then collapse the subtree into a list of possible endings. This can be used in auto completing user input in text editors. Tries and Web Search Engines The index of a search engine (collection of all searchable words) is stored into a compressed trie. Each leaf of the trie is associated with a word and has a list of pages (URLs) containing that word, called occurrence list. The trie is kept in internal memory. The occurrence lists are kept in external memory and are ranked by relevance. Boolean queries for sets of words (e.g. Java and coffee) correspond to set operations (e.g. intersection) on the occurrence lists.

  • 31

    Additional information retrieval techniques are used, such as: - stop word elimination (e.g. ignore “the”, “a”, “is”). - Stemming (e.g. identify “add”, “adding”, “added”). - Link analysis (recognize authoritative pages). Tries an Internet Routers Computers on the internet (hosts) are identified by a unique 32-bit IP (internet protocol) address, usually written in “dotted-quad-decimal” notation. E.g.: www.google.com is 62.233.189.104. An organization uses a subset of IP addresses with the same prefix, e.g. IIDT uses 10.*.*.* Data is sent to a host by fragmenting it into packets. Each packet carries the IP address of its destination. A router forwards packet to its neighbors using IP prefix matching rules. Routers use tries to do prefix matching.

    http://www.google.com/

  • 4.3 PRE-PROCESSING STEPS

    Before applying the lemmatization algorithm we need to normalize

    the contents of the Input file.

    1) Tokenization:

    Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called away certain characters, such as punctuation. Here Input: Friends, Romans, Countrymen, lend me your ears;

    Output: These tokens are often loosely referred to as terms or words

    2) Stop-words Removal:

    In computing, stop words

    after processing of natural language

    usually refer to the most common words in a language, there is no

    single universal list of stop words used by all

    processing tools.

    Common stop words for English language:

    a,about,above,after,again,against,is,I,am,all,an,but,be,being,and,are,c

    an’t

    PROCESSING STEPS

    Before applying the lemmatization algorithm we need to normalize

    the contents of the Input file. This includes the following 2 steps:

    :

    Given a character sequence and a defined document unit, tokenization is the task chopping it up into pieces, called tokens, perhaps at the same time throwing

    away certain characters, such as punctuation. Here is an example of tokenization

    Input: Friends, Romans, Countrymen, lend me your ears;

    often loosely referred to as terms or words

    words Removal:

    stop words are words which are filtered out before or

    processing of natural language data (text). Though

    usually refer to the most common words in a language, there is no

    single universal list of stop words used by all natural language

    Common stop words for English language:

    a,about,above,after,again,against,is,I,am,all,an,but,be,being,and,are,c

    32

    Before applying the lemmatization algorithm we need to normalize

    This includes the following 2 steps:

    Given a character sequence and a defined document unit, tokenization is the task perhaps at the same time throwing

    is an example of tokenization

    are words which are filtered out before or

    data (text). Though stop words

    usually refer to the most common words in a language, there is no

    natural language

    a,about,above,after,again,against,is,I,am,all,an,but,be,being,and,are,c

  • 33

    4.4 BASIC OPERATIONS IN TRIE:

    4.4.1 SEARCHING IN A TRIE To search for a key k in a trie T, we begin at the root which is a branch node. Let us suppose the key k is made up of characters k1 K2 k3….kn. The first character of the key K viz., k1 is extracted and the pChildren field corresponding to the letter k1 in the root branch node is spotted. If T->pChildren[k1- 'a'] is equal to NULL, then the search is unsuccessful, since no such key is found. If T->pChildren[k1–‘a’] is not equal to NULL. Then the pChildren field may either point to an information node or a branch node. If the information node holds K then the search is done. The key K has been successfully retrieved. Otherwise, it implies the presence of key(s) with a similar prefix. We extract the next character ,k2 of key K and move down the link field corresponding to k2 in the branch node encountered at level 2 and so on until the key is found in an information node or the search is unsuccessful. The deeper the search, the more there are keys with similar but longer prefixes. Algorithm for searching for a word in the Trie:

    1. Set current node to root node. Set the current letter to the first letter in

    the word.

    2. If the current node is null then the word does not exist in the Trie.

    3. If the current node reference to a valid node containing the current

    letter then set current node to that referenced node and set current letter to

    the letter in the new current node.

    4. Repeat steps 2 and 3 until all letters in the word have been processed.

    5. Now there are two possibilities that may indicate the letter is not there

    in the tree:

    a) The current letter is the last letter and there is no valid node containing

    this letter, and

    b) There is a valid node containing the last letter but the node does not

    indicate it contains a full word (i.e. the Boolean field isEnd=false).

  • 34

    6. If step the conditions in step 5 are not met, then we have a match for

    the word in the Trie (i,e when isEnd=True).

    4.4.2 INSERTION IN A TRIE :

    To insert a key K into a trie we begin as we would to search following the appropriate pChildren fields of the branch nodes, corresponding to the characters of thekey. At the point where the pChildren of the branch node leads to NULL, the key K Algorithms for inserting a word into a Trie:

    1. Set current node to root node i.e. value=null

    2. Set the current letter to the first letter in the word.

    3. If the current node already has an existing reference to the current letter

    then set current node to that referenced node; else create a new node, set

    the letter to current letter, and set current node to this new node, set the

    value of isEnd to false.

    4. Repeat step 3 until all letters in the current word has been processed.

    Set the value of isEnd=true when the process end

  • 35

    4.5 ALGORITHM FOR LEMMATIZATION: The algorithm requires a file containing the list of the root words (lemma) of the language concerned. At first, we have created a trie structure using the dictionary root words. Each node in the trie corresponds to a Unicode character of the language concerned and the nodes that end with the final character of any root word are marked as final nodes. The rest nodes are marked as non-final nodes. To find the lemma of a surface word, the trie is navigated starting from the initial node in the trie and navigation ends when either the word is completely found in the trie or after some portion of the word there is no path present in the trie to navigate. The key idea is that a trie is created out of the vocabulary (root words) of

    the language.

    The lemmatizing process consists in navigating the trie, trying to find a

    match between the input word and an entry in the trie

    An algorithm is applied to find the appropriate lemma for the input word.

    EXPLANATION:

    The algorithm requires a list of root words of the language concerned. We

    are storing the root words in a file.

    Step1:At first, we have created a trie structure using the list of root

    words.

    A trie node consists of fields:

    1) value, corresponds to an Unicode character of the language concerned .

    2) isEnd, a Boolean field which is set to true if it’s the final character of

    any root word otherwise the value is set to false.

    3) children,a hash map function which maps the current trie node to its

    child nodes

  • 36

    The Insert algorithm explained above is used to insert the root words in a

    trie.

    Step2: Navigating through the trie to find the matching prefix

    The Search algorithm explained above is applied here with some

    modifications to find the lemma of the surface word.

    To find the lemma of a surface word, the trie is navigated starting from

    the initial node in the trie and navigation ends when either the word is

    completely found in the trie or after some portion of the word there is no

    path present in the trie to navigate.

    While navigating, some situations may occur, depending on which we

    take decision to determine the lemma. Those situations are described

    below.

    CASE 1:

    The surface word is a root word. In that case, the surface word itself is

    the lemma.

    Example: Stored word: abbreviate Input: abbreviate Matched String: Abbreviate Output: Abbreviate CASE 2: The surface word is not a root word. In that case, the trie is navigated up

    to that node where the surface word completely ends or there is no path to

    navigate in the trie. We call this node as the end node.

    Again two different cases may occur.

    CASE 2.1:

    In the path from the initial node to the end node, if one or more than one

    root words are found i.e. if one or more final nodes are present in the

    path. Then pick that final node which is closest to the end node.

  • 37

    Example: Stored words: a, an, and Input: ands Matched prefix: a, an, and; Output: and The word represented by the path from initial node to the picked final

    node is considered as the lemma

    CASE 2.2

    If no root word is found in the path from the initial node to the end node.

    Then find the final node in the trie which is closest to the end node.

    Example:

    Stored word: abbreviate Input: abbreviating Matched String: abbreviat Output: abbreviate If more than one final node is found at the closest distance then

    pick all of them. Now, generate the root word(s) which is/are represented

    by the path from initial node to those picked final node(s).

    Output: The list of matched lemma will be returned.

  • 38

    Hindi Language:(Tokenization and stopwords removal)

  • 4.6 FLOWCHART OF ALGORITHM

    The above discussed algorithm can be depicted with the help of the

    following flowchart which

    common prefixes:

    FLOWCHART OF ALGORITHM:

    The above discussed algorithm can be depicted with the help of the

    which explains how trie can be used to find

    39

    The above discussed algorithm can be depicted with the help of the

    explains how trie can be used to find

  • 40

    4.7 DRAWBACK OF ALGORITHM:

    The following are some of the drawbacks of the lemmatizing algorithm used:

    Compound words and out-of-vocabulary words are not considered in our algorithm.

    Root words are taken from dictionary but if the coverage of the dictionary is not good then accuracy will degrade

  • 41

    CHAPTER 5

    IMPLEMENTATION & RESULT ANALYSIS

    Appendix A: Snapshot

    ENGLISH LANGUAGE

    a)Input:abases;Output:abase

  • 42

    b) Input: EnglishInput.txt (file); Output: myFile.txt

  • 43

    c) Input:EnglishTry.txt Tokenized file:Tokenized.txt Output:MyFile.txt

  • 44

    HINDI LANGUAGE

    a)Input:लड़क�या ंOutput:लड़क�

    b) Input:HindiInput.txt;Output:myFile.txt

  • 45

    RESULT ANALYSIS:

    For evaluation of results we have performed the following tests:

    English Language:

    Test1:

    A file containing 14,730 lemmas to build the Trie-datastructure and another file (input file) containing 25,803 inflected words to perform the testing.

    It is found that out of these 25,803 words our lemmatizer is able to give correct results for 24,513 words. We have used the following formula to calculate the accuracy of the lemmatizer.

    Accuracy = No. of words correctly lemmatized x 100 Total no. of words Thus for this test, Accuracy=(24,513/25,803)*100=95%

    Test2:

    A file (input file) containing 7 sentences of approximately 4 words each. For this we have taken a file which contains the related root words of the file to built up the Trie data structure. It is found that our lemmatizer is able to correctly tokenize and remove the stop words from the input file. Then it is also able to correctly lemmatize all the words. (Accuracy=100%). (Refer Snapshot for details)

    Hindi Language:

    Test1:

    A file containing 1000 Hindi lemmas to build the Trie data structure and another file (input file) containing 220 inflected words to perform the testing.

    It is found that out of these 220 words our lemmatizer is able to give correct results for 207 words.

    Thus for this test, Accuracy= (207/220)*100=94%

  • 46

    Test2:

    A file (input file) containing 10 sentences of approximately 5 words each. For this we have taken a file which contains the related root words of the file to built up the Trie data structure. It is found that our lemmatizer is able to correctly tokenize and remove the stop words from the input file. Then it is also able to approximately lemmatize all the words. (Accuracy=97%)

    Table:

    Srno Language No. of words taken

    Correctly Lemmatized words

    Accuracy

    1 ENGLISH 25,803 24,513 95%

    2 HINDI 220 207 94%

  • 47

    Appendix B: Development Platform

    Software Requirements for implementing the system:

    Operating System Windows 7

    Platform Used Java Net Beans IDE 7.3.1

    Hardware requirements for developing and implementing the system:

    A Pentium based LAPTOP with minimum of

    I. 1GB RAM

    II. 320GB Hard Disk Space

    III. Intel Pentium inside Processor

  • 48

    CHAPTER 7

    CONCLUSION

    In this Project work we investigated many existing techniques and have selected Trie-based approach for building our lemmatizer.

    We tested our lemmatizer for English and Hindi language and it is found that it gives good result but in many cases it fails to correctly lemmatize because of out-of-vocabulary words, compound-words and also due to different kind of inflectional words which are specific to languages.

    Finally we can conclude that our lemmatizer is language independent and thus can use our lemmmatizer for any language but we will need correct list of all the root words of that language to build the Trie.

  • 49

    CHAPTER 8

    FUTURE WORK

    With the present approach one can further work on the following future aspects:

    1) One can also use other data structure like compressed trie to improve the results.

    2) High Accuracy can be achieved by providing more User Interaction

    3) Solution for Compound words and out-of-vocabulary words can be considered in our algorithm.

    4) If the Root word is not in the dictionary then there should be some way to provide the result

    5) Backtracking can be implemented in the Algorithm for better search results

  • 50

    REFERENCES

    [1]. https://www.google.com/search?sclient=psy-

    ab&btnG=Search&q=lemmatization+articles#q=A+Rule+based+Approac

    h+to++Word+Lemmatization+by+joel+pilson

    [2]. http://www.apl.org.pt/docs/22-textos-seleccionados/12-

    Branco_Silva.pdf

    [3]. http://www.arxiv.org/pdf/1310.0581

    [4]. http://www.airccj.org/CSCP/vol3/csit3408.pdf

    [5]. http://arxiv.org/abs/cmp-lg/9503020

    [6].http://www.citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149

    [7].http://www.euralex.org/elx_proceedings/Euralex1994/56_Euralex_Eugenio%20Picchi%20%20Statistical%20Tools%20for%20Corpus%20Analysis%20%20A%20Tagger%20and%20Lemmatizer%20for%20I.pdf

    [8]. http://dl.acm.org/citation.cfm?id=980692

    [9]. http://dl.acm.org/citation.cfm?id=980910

    [10]. http://arxiv.org/abs/1203.3584

    [11] Dalwadi Bijal, Suthar Sanket, “Overview of Stemming

    Algorithm for Indian and NonIndian Languages”, International

    Journal of Computer Sciences and Information Technologies (IJCSIT)

    Vol. 5 (2), PP. 1144-1146, 2014.

    [12] Vishal Gupta, “Hindi Rule Based Stemmer for Nouns”,

    International Journal of Advanced Research in Computer Science and

    Software Engineering, Volume 4, Issue 1, January 2014.

    [13] M.Thangarasu. R.Manavalan, “Design and Development of

    Stemmer for Tamil Language: Cluster Analysis”, International Journal of

    http://www.apl.org.pt/docs/22-textos-seleccionados/12-Branco_Silva.pdfhttp://www.apl.org.pt/docs/22-textos-seleccionados/12-Branco_Silva.pdfhttp://www.arxiv.org/pdf/1310.0581http://www.airccj.org/CSCP/vol3/csit3408.pdfhttp://arxiv.org/abs/cmp-lg/9503020http://arxiv.org/abs/1203.3584

  • 51

    Advanced Research in Computer Science and Software Engineering,

    Volume 3, Issue 7, July 2013.

    [14] Siddika Parlak, Murat Saraclar, “Performance Analysis and

    Improvement of Turkish Broadcast News Retrieval”, IEEE

    Transactions and audio, Speech and Language Processing, Vol. 20,

    No. 3, PP 731-740 March 2012.

    [15] Upendra Mishra, Chandra Prakash, “MAULIK: An Effective

    Stemmer for Hindi Language”, International Journal on Computer

    Science and Engineering (IJCSE) Vol. 4 No. 5, PP.711-717, May

    2012.

    [16] Ms. Anjali Ganesh Jivani, “A Comparative Study of Stemming

    Algorithms”, International Journal of Computer Technology and

    Applications, Vol.2 (6), PP 1930-1938, NOV-DEC 2011.

    [17] Mohamad Ababneh, Riyad Al-Shalabi, Ghassan Kanaan, Alaa

    Al-Nobani, “Building an Effective Rule-Based Light Stemmer for

    Arabic Language to Improve Search Effectiveness”, The International

    Arab Journal of Information Technology, Vol. 9, No. 4, PP.368-372, July

    2012.

    [18] Suprabhat Das, Pabitra Mitra, “A Rule-based Approach of

    Stemming for Inflectional and Derivational Words in Bengali”,

    Proceeding of the IEEE Students' Technology Symposium, PP.14-16,

    January, 2011.

    [19] Mohd. Shahid Husain, “AN UNSUPERVISED APPROACH TO

    DEVELOP STEMMER”, International Journal on Natural Language

    Computing (IJNLC) Vol. 1, No.2, August 2012.

    [20] M.Thangarasu., R.Manavalan, “A Literature Review: Stemming

    Algorithms for Indian Languages”, International Journal of Computer

    Trends and Technology (IJCTT), volume 4 Issue 8, August 2013.

    [21] Vimala Balakrishnan, Ethel Lloyd-Yemoh, “Stemming and

    Lemmatization: A Comparison of Retrieval Performances”, Lecture

    Notes on Software Engineering, Vol. 2, No. 3, August 2014.

  • 52

    [22] M. Nithya, “Clustering Technique with Potter stemmer and

    Hyper graph Algorithms for Multi-featured Query Processing”,

    International Journal of Modern Engineering Research (IJMER), Vol.2,

    Issue.3, pp-960-965, May-June 2012.

    [23] Dhamodharan Rajalingam, “A Rule Based Iterative Affix Stripping

    Stemming Algorithm for Tamil”, vol 132, PP-583-590, 2012

    [24] www.ijrat.org/downloads/icatest2015/ICATEST

    [25]https://www.cse.iitb.ac.in/~pb/papers/gwc14-multilingual-stemmer

    [26].https://hbfs.wordpress.com/2012/07/10/stemming

    [27].https://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=624693

    [28].String Processing and Information Retrieval: Volume 9

    [29].Ljiljana Dolamic and Jacques Savoy. 2010. Comparative Study of Indexing and Search Strategies for the Hindi, Marathi and Bengali Languages.

    http://www.ijrat.org/downloads/icatest2015/ICATEST

  • 53