final 20 dec reprt..(2)

53
 JAVA API FOR NLP TOOLS Bachelor of Technology Computer Science and Engineering Submitted By:- RAJSHREE GUPTA (0709110075) RICHA PANDEY (0709110079) SURABHI VERMA (0709110104) SWETA SINGH (0709110105) Department of Computer Science and Engineering JSS Academy Of Technical Education, Noida

Upload: manishay339

Post on 08-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 1/53

 JAVA API FOR NLP TOOLS

Bachelor of Technology

Computer Science and Engineering

Submitted By:-

RAJSHREE GUPTA (0709110075)

RICHA PANDEY (0709110079)

SURABHI VERMA (0709110104)

SWETA SINGH (0709110105)

Department of Computer Science andEngineering

JSS Academy Of Technical Education,

Noida

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 2/53

 DECLARATION 

We hereby declare that this submission is our own work and that, to the best of our 

knowledge and belief, it contains no material previously published or written by

another person nor material which to a substantial extent has been accepted for 

the award of any other degree or diploma of the university or other institute of 

higher learning, except where due acknowledgment has been made in the text.

Signature: Signature:

 Name :Rajshree Gupta Name :Richa Pandey

 Roll No.:0709110075 Roll No.:0709110079

 Date : Date :

Signature: Signature:

 Name :Surabhi Verma Name :Sweta Singh

  Roll No.: 0709110104 Roll No.: 0709110105

 Date : Date :

 

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 3/53

ii

CERTIFICATE 

 

This is to certify that Project Report entitled “Java API for NLP tools” which is submitted

 by Rajshree Gupta, Richa Pandey, Surabhi Verma and Sweta Singh in partial fulfillment

of the requirement for the award of degree B. Tech. in Department of Computer Science

& Engineering of U. P. Technical University, is a record of the candidates’ own work 

carried out by them under my supervision. The matter embodied in this thesis is original

and has not been submitted for the award of any other degree.

 

Supervisor

Mrs. Seema Shukla

Asst. Professor

Department of Computer Science & Engineering.

Date

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 4/53

iii

 ACKNOWLEDGEMENT 

 It gives us a great sense of pleasure to present the report of the B. Tech Project undertaken

during B. Tech. Final Year. We owe special debt of gratitude to our  Professor(Mrs.) Seema Shukla, Departmentof Computer Science & Engineering,

 JSS Academy of Technical Education, Noida for her constant support and guidance

throughout the course of our work. Her sincerity, thoroughness and perseverance

have been a constant source of inspiration for us. It is only her cognizant efforts

that our endeavors have seen light of the day.

We also do not like to miss the opportunity to acknowledge the contribution of all faculty

members of the department for their kind assistance and cooperation during the

development of our project. Last but not the least, we acknowledge our friends for their 

contribution in the completion of the project.

Signature: Signature:

 Name :Rajshree Gupta Name :Richa Pandey

  Roll No.:0709110075 Roll No.:0709110079

 Date : Date :

Signature: Signature:

 Name :Surabhi Verma Name :Sweta Singh

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 5/53

  Roll No.: 0709110104 Roll No.: 0709110105

 Date : Date :

iv

 ABSTRACT 

Many application areas require text preprocessing. Java API for NLP tools is an API that 

analyzes, understands, and help in processing languages that humans use naturally. The

 proposed API provides a platform to programmers for preprocessing natural text.

The proposed API supports a variety of NLP tools like the Tokenizer to split the stream of 

text in tokens. Stop Word Remover eliminates the stopwords from the text. Word 

 Frequency Counter counts the frequency of words from the input file. Stemmer reduces

inflectional forms and derivationally related forms of a word to a common base form.

 NGram Identifier identifies a subsequence of n words from a given sequence. Multiword 

 Extractor identifies multiword sets from a corpora. Word Sense Disambiguator identifiesand corrects the meaning of an ambiguous word.

 Java API for NLP tools tries to eliminate the major deficiencies in the available tools

although with some constraints as mentioned in the subsequent chapters.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 6/53

v

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 7/53

TABLE OF CONTENTS

  Page

DECLARATION ii

CERTIFICATE iii

ACKNOWLEDGEMENTS iv

ABSTRACT v

LIST OF FIGURES ix

LIST OF TABLES x

CHAPTER 1 INTRODUCTION

1.1 Problem Introduction 11

1.1.1 Motivation 11

1.1.2 Applications of JAVA API for NLP Tools 12

1.1.3 Objective 12

1.1.4 Scope of the Project 12

1.2 Related Previous Work 16

1.2.1 NLP Tools 16

1.3 Organization of the report 17

CHAPTER 2 LITERATURE SURVEY 18

2.1   Natural Language Processing 18

2.2 Tokenizer 19

2.2.1 Need of Tokenizer 19

2.2.2 Existing approaches with recurring problems 20

2.3 Stop Words 21

2.3.1 Significance of Stopword list 22

2.3.2 Problems encountered in Search Engines 22

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 8/53

2.4 Word Frequency Counter 23

CHAPTER 3 TOKENIZER TOOL 24

3.1 Problem Identification and Elimination 24

3.2 Algorithm and Flowchart 25

3.3 Class Descripion 26

3.3.1 Attributes 27

3.3.2 Constructors 27

3.3.3 Methods 29

3.4 Assumptions and Dependencies 32

3.5 Constraints 32

3.6 Result 32

CHAPTER 4 STOP WORD REMOVER 37

4.1 Algorithm and Flowchart 37

4.2 Class Descripion 37

4.2.1 Attributes 38

4.2.2 Constructors 39

4.2.3 Methods 40

4.3 Assumptions and Dependencies 40

4.4 Constraints 41

4.5 Result 41

CHAPTER 5 WORD FREQUENCY COUNTER 42

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 9/53

5.1 Algorithm and Flowchart 42

5.2 Class Descripion 44

5.2.1 Attributes 44

5.2.2 Constructors 44

5.2.3 Methods 45

CHAPTER 6 CONCLUSION 46

6.1 Agenda for next semester 46

APPENDIX A:LIST OF STOPWORDS 47

REFERENCES 49

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 10/53

LIST OF FIGURES

Fig. 3.1 Flowchart for Tokenizer Page: 26

Fig. 3.2 Class Diagram for class Tokenizer Page: 31

Fig. 3.3 Snap shot of User Interface Page: 33

Fig. 3.4 Snapshot of Tokenizer tool Interface Page: 33

Fig. 3.5 Test case 1 Snapshot Page: 34

Fig. 3.6 Test Case 2 snapshot- browsing an input file Page: 35

Fig. 3.7 Test Case 2 snapshot- selecting an output file Page: 36

Fig. 4.1 Flowchart for Stop Word Removal Page: 38

Fig. 4.2 Class Diagram for class StopWordRemover Page: 40

Fig. 5.1 Flowchart for Word Frequency Counter Page: 43

Fig. 5.2 Class Diagram for class WordFrequencyCounter Page: 44

ix

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 11/53

LIST OF TABLES

Table 3.1. Table representing key application areas Page: 16

 

 x

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 12/53

CHAPTER 1

INTRODUCTION

 NLP Tools are useful in many areas of computational linguistics and information-retrieval

work, in automated morphological analysis, for stylistic or mathematical analysis of a  body

of language, auto text summarization, auto text categorization, etc., which require texts to

 be pre-processed. But certain linguistic problems exist in almost every tool no matter what

its ultimate use.

Besides this, existing API’s are too complex to use and do not contain all the proposed

tools.

1.1  Problem Introduction

For the purpose of preprocessing the language, the task of identifying classes, their 

attributes, methods and other object oriented features will be an umbrella activity carried

out for all the NLP tools supported by the proposed Java API.

1.1.1 Motivation

 Natural languages  are very complex. Natural-language processing is a very attractive

method of human-computer interaction. The goal is to design and build APIs that will

analyze, understand, and help in processing languages that humans use naturally, so that

eventually one will be able to address his computer as though one is addressing another 

 person[1,3].

There are many applications such as auto text summarization, auto text categorization, etc.,

which require texts to be pre-processed. They also require tasks such as stop word removal

and stemming. Although many APIs do exist for NLP tools, most of they are complex to

use and they do not contain all of the proposed tools. The proposed work is focused on

developing efficient Java API for NLP tools that processes texts and to make their 

information accessible to computer applications.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 13/53

1.1.2 Applications of JAVA API for NLP tools

Different NLP tools are used for different purposes. Some of the applications are:

• Automatic summarization

• Machine Translation

• Morphological Segmentation

•  Natural Language Generation

• Information Retrieval

• Information Extraction

• Question Answering System

1.1.3 Objective

The primary objective of the project is to provide programmers of NLP based applications

with an easy to use API for NLP tools such as the following:

• Tokenizer 

• Stop word remover 

• Frequency counter of words

•  N-gram identification

• Stemmer 

• Multiword Extractor 

• Disambiguator 

1.1.4 Scope of the Project

The following tasks will be carried out to achieve the objectives stated above:

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 14/53

Designing of classes

Identifying classes, their attributes, methods and other object oriented features will be anumbrella activity carried out for all the NLP tools supported by the proposed Java API.

Design and development of the following tools

• Tokenizer

Tokenization is the process of demarcating and possibly classifying sections of a string of 

input characters[1]. Tokenizer splits text into simple tokens based on delimeters that may

 be specified either at the time of creation or on a per token basis. An effort will be made to

remove maximum problems such as tokenizing abbreviation like A.K.Sharma as a single

token.

• Stop Word Remover

Stop words is the name given to words which are filtered out prior to, or after, processingof natural language data (text)[2]. Some examples of stop words are to, for, of, etc. To

remove stop word,a database will be maintained which can be manipulated by user.

• Word frequency counter

The aim is to develop an efficient algorithm to count the occurrence of unique words in a

corpus and store in a database or a text file.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 15/53

• N-Gram identification[1]

An n gram is a subsequence of n words from a given sequence. An n-gram of size 1 is

referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a

"trigram"; and size 4 or more is simply called an "n-gram".  The effort will be to design an

appropriate algorithm to identify n grams with user defined value of n. Even if that is not

achieved at least trigram will be identified.

• Stemmer

The goal of stemmer and is to reduce inflectional forms and derivationally related forms of 

a word to a common base form. Stemming usually refers to a crude heuristic process that

chops off the ends of words and often includes the removal of derivational affixes[2].

The subtasks to be carried out for this tool are

Study of existing stemming algorithm and their drawbacks.

Implementation of stemming algorithm which is a crude heuristic process.

Development of a statistical stemmer.

The following tools will also be added to the API if time permits

• Multiword Extractor

Multiword extractor will identify multiword sets on the basis of an existing corpus, and

store them in a database or text file that can be used for future reference to extract multi

word from another text/corpus.

The subtasks to be carried out for this tool are

Corpora collection.

Developing algorithm for statistical extraction of multiwords from corpora.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 16/53

• Word Sense Disambiguator[4]

This tool is used to identify the correct meaning of an ambiguous word.

The subtasks to be carried out for this tool are

Study of WORDNET- a lexical dictionary of English words.

Explore methods for word sense disambiguator without explicit creation of lexicon

sets of database.

Study and explore the tools to be used for POS tagging

• Design and Development of a GUI

A GUI will be developed which consists of a work area, from where a user can browse the

entire system and select any file or input any random text that he wishes to

classify/process. The selected file/text may then be processed after the user clicks on any

desired button provided for all the NLP tools supported by the respectively developed Java

API.

• Testing & Evaluation of results achieved

The project will be tested against a variety of test data of varying length. After the

evaluation of the output thus produced, the accuracy of the project will be stated.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 17/53

1.2 Related Previous Work 

The following Table1.1 shows some of the commonly used tools in main key areas[3] :

Table 1.1 Table representing key application areas

Table: Key application areas

KEY APPLICATION AREAS NLP TOOLS COMMONLY USED

Machine Learning and Data Mining Weka(implements algos like Naive Bayes,

Support Vector Machines),

Apache Lucene Mahout

Information Extraction Mallet, Minor Third, GATE

Text Classification NTLK, LingPipe

  Named entity analysis &Co-reference

analysis

OpenNLP(uses maxent Machine Learning

Package)

Approximate String Matching Second String, Simmetrics, LingPipe

Wordnet interfaces Java Wordnet Library(JWNL),

MIT Java Wordnet Interface (JWI)

Question Answering OpenEphyra

Speech recognition & OCR error correction OpenFST

Parallel Language Parsing Dan Bike’s Multilingual Parser (English,

Arabic, Chinese & soon Korean)

1.2.1 NLP tools

LingPipe - It is a suite of java tools for linguistic processing of text including entity

extraction, speech tagging (pos) , clustering, classification, etc. It is known for it's speed,

stability, and scalability. One of its best features is the extensive collection of well-written

tutorials to help you get started. LingPipe[1][6] is released under a royalty-free commercial

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 18/53

license that includes the source code, but it's not technically 'open-source'.

Stanford Parser and Part-of-Speech (POS) Tagger - Java packages for sentence parsingand part of speech tagging from the Stanford NLP group. It has implementations of 

  probabilistic natural language parsers, both highly optimized PCFG and lexicalized

dependency parsers, and a lexicalized PCFG parser. It's has a full GNU GPL license.

OpenFST - A package for manipulating weighted finite state automata.These are often

used to represent probablistic model. They are used to model text for speech recognition,

OCR error correction, machine translation and a variety of other tasks.

NTLK - The natural language toolkit is a tool for teaching and researching classification,

clustering, speech tagging and parsing, and more. It contains a set of tutorials and data sets

for experimentation. It is written by Steven Bird, from the University of Melbourne.

Dan Bikel's Multilingual Statistical Parser - A parallel statistical parsing engine for 

English, Arabic, Chinese, and soon Korean.

1.3 Organization of the Report

This report consists of six chapters dealing with the process of designing Java API for NLP

tools. The first two chapters deal with the introduction of the project and the background

research carried out in order to provide a better understanding of the topic. Then in the

subsequent chapters, further details about the methods and a description of each tool along

with the test cases are defined. The approaches used to design each class of the API are

also specified. The various algorithms used are stated along with the flow diagrams and

class descriptions. The last chapter concludes the report by summarizing the work that has

 been done till now.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 19/53

CHAPTER 2

LITERATURE SURVEY

‘Naturally occurring texts’ can be of any language, mode, genre, etc. The texts can be oral

or written. The only requirement is that they be in a language used by humans to

communicate to one another. Also, the text being analyzed should not be specifically

constructed for the purpose of the analysis, but rather that the text be gathered from actual

usage.

‘Human-like language processing’[1] reveals that NLP is considered a discipline within

Artificial Intelligence (AI). And while the full lineage of NLP does depend on a number of 

other disciplines, since NLP strives for human-like performance, it is appropriate to

consider it an AI discipline.

2.1 Natural Language Processing

 Natural Language Processing (NLP) is the computerized approach to analyzing text that is

 based on both a set of theories and a set of technologies. The definition we offer is: Natural

Language Processing is a theoretically motivated range of computational techniques for 

analyzing and representing naturally occurring texts at one or more levels of linguistic

analysis for the purpose of achieving human-like language processing for a range of tasks

or applications.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 20/53

2.2 Tokenizer

Tokenization, that is the identification of each “atomic” unit, represents the very first

operation to be performed in document processing: nevertheless it is often overlooked

 because of its supposed basic nature. Despite the apparent simplicity of the issue at stake,

no readily available solution or standard exists for character stream tokenization.

In NLP, tokenization can be defined as the task of splitting a stream of characters into

words. However, very often it is associated with lower or upper level processes[1].Even if 

there exists a tendency to gather both tasks under the vague label: “pre-processing”,

tokenization differs nevertheless from preliminary “cleaning procedures”.Conversely additional preprocessing is often associated to the segmenting into word units:

acronym and abbreviation recognition, hyphenation checking, number standardization, etc.

Some tokenizers include even the delimitation of textual units such as sentences,

 paragraphs, notes, and so on.

2.2.1 Need of tokenization

The major question of the tokenization phase is what are the correct tokens to use? In this

example, it looks fairly trivial: you chop on whitespace and throw away punctuation

characters. This is a starting point, but even for English there are a number of tricky cases.

For example, here various uses of possession and contraction of apostrophes that can be

visualized: Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.

For O'Neill , which of the following is the desired tokenization?

neill

oneill

o’neill

o’ neill

neill ?

And for aren't , is it:

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 21/53

aren’t

arent

are n’t

aren t ?

A simple strategy is to just split on all non-alphanumeric characters, but while o and neill

looks okay, aren and t looks intuitively bad. These issues of tokenization are language-

specific. It thus requires the language of the document to be known.  Language

identification[2] based on classifiers that use short character subsequences as features is

highly effective; most languages have distinctive signature patterns .

For most languages and particular domains within them there are unusual specific tokens

that we wish to recognize as terms, such as the programming languages C++ and C#,

aircraft names like B-52, or a T.V. show name such as M*A*S*H. Computer technology

has introduced new types of character sequences that a tokenizer should probably tokenize

as a single token, including email addresses ([email protected]), web URLs

(http://stuff.big.com/new/specials.html), numeric IP addresses (142.32.48.231), package

tracking numbers (1Z9999W99845399981), and more. In English, hyphenation is used for 

various purposes ranging from splitting up vowels in words (co-education) to joining

nouns as names ( Hewlett-Packard ) to a copyediting device to show word grouping (the

hold-him-back-and-drag-him-away maneuver ). It is easy to feel that the first example

should be regarded as one token, the last should be separated into words, and that the

middle case is unclear.

2.2.2 Existing approaches with recurring problems[7]

Handling hyphens automatically can thus be complex: it can either be done as a

classification problem, or more commonly by some heuristic rules, such as allowing short

hyphenated prefixes on words, but not longer hyphenated forms.

Conceptually, splitting on white space can also split what should be regarded as a single

token. This occurs most commonly with names (San Francisco, Los Angeles) but also with

 borrowed foreign phrases (au fait ) and compounds that are sometimes written as a singleword and sometimes space separated (such as white space vs. whitespace).

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 22/53

Other cases with internal spaces that we might wish to regard as a single token include

 phone numbers ((800) 234-2333) and dates (Mar 11, 1983). Splitting tokens on spaces can

cause bad retrieval results, for example, if a search for York University mainly returns

documents containing New York University.

The problems of  hyphens and non-separating whitespace can even interact.

Advertisements for air fares frequently contain items like San Francisco-Los Angeles,

where simply doing whitespace splitting would give unfortunate results. One effective

strategy in practice, which is used by some Boolean retrieval systems such as Westlaw and

Lexis-Nexis (westlaw), is to encourage users to enter hyphens wherever they may be

 possible, and whenever there is a hyphenated form, the system will generalize the query to

cover all three of the one word, hyphenated, and two word forms, so that a query for over-

eager will search for over-eager OR ``over eager'' OR overeager.

However, this strategy depends on user training, since if you query using any of the

available techniques, you get no generalization. Since there are multiple possible

segmentations of character sequences , all such methods make mistakes sometimes, and so

you are never guaranteed a consistent unique tokenization. The other approach is to

abandon word-based indexing and to do all indexing via just short subsequences of 

characters (character –n-grams) regardless of whether particular sequences cross word

 boundaries or not.

2.3 Stopwords

Stop words is the name given to words which are filtered out prior to, or after, processing

of natural language data (text)[5]. Hans Peter Luhn, one of the pioneers in information

retrieval, is credited with coining the phrase and using the concept in his design. It is

controlled by human input and not automated. Sometimes, some extremely common words

which would appear to be of little value in helping select documents matching a user need

are excluded from the vocabulary entirely. These words are called stop words .

2.3.1 Significance of Stop Word list

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 23/53

The general strategy for determining a stop list is to sort the terms by collection frequency

(the total number of times each term appears in the document collection), and then to take

the most frequent terms, often hand-filtered for their semantic content relative to the

domain of the documents being indexed, as a  stop list , the members of which are then

discarded during indexing. Using a stop list significantly reduces the number of postings

that a system has to store.

The general trend in IR systems [2] over time has been from standard use of quite large stop

lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever. Web

search engines generally do not use stop lists. Some of the design of modern IR systems

has focused precisely on how we can exploit the statistics of language so as to be able to

cope with common words in better ways. For most modern IR systems, the additional cost

of including stop words is not that big - neither in terms of index size nor in terms of query

 processing time.

2.3.2 Problems encountered in Search Engines

Stop words can cause problems when using a search engine to search for phrases that

include them, particularly in names such as 'The Who', 'The The', or 'Take That'. There is

no definite list of stop words which all Natural language processing (NLP) tools

incorporate. Not all NLP tools use a stoplist. Some tools specifically avoid using them to

support phrase search. The phrase query ``President of the United States'', which contains

two stop words, is more precise than President AND ``United States''. The meaning of 

“flights to London” is likely to be lost if the word ‘to’ is stopped out. A search for 

Vannevar Bush's article, “As we may think” will be difficult if the first three words are

stopped out, and the system searches simply for documents containing the word think.

Some special query types are disproportionately affected. Some song titles and well known

 pieces of verse consist entirely of words that are commonly on stop lists (To be or not to

 be, Let It Be, I don't want to be, ...).

In more precise terms, some constraints on stop word removal may be as follows:

• All of the words in a query are stop words. If all the query terms are removed

during stop word processing, then the result set is empty. To ensure that search

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 24/53

results are returned, stop word removal is disabled when all of the query terms are

stop words. For example, if the word car is a stop word and you search for car, then

the search results contain documents that match the word car. If you search for car 

xylo, the search results contain only documents that match the word xylo.

• The word in a query is preceded by the plus sign (+).

• The word is part of an exact match.

• The word is inside a phrase, for example, "I love my car".

2.4 Word Frequency Counter

WordFrequencyCounter counts the frequency of words from a single file, multiple files or 

the clipboard. The many options make it a very useful word counting tool for language

analysis and learning[1]. Search engines and directories now use artificial intelligence to

analyze the web as they produce sorted search results. Knowing the frequency of words

used in the web design will give the user more of an idea how these tools work for 

 processing in natural languages.

Word Frequency Counter enables the user to:

Define words. A word is made up of characters from an alphabet, but there are some

characters that you might or might not want to include in a word definition such as & or -.

Define word separators. Word separators are used to divide language into individual

words (text segmentation). The space character and punctuation (in the English language)

are the most important word separators, but you also need to decide whether you want to

use characters such as & or - as separators.

Count words from the clipboard, directories and sub-directories. It can count word

frequencies from either a single file, the clipboard or all files in a directory and its

subdirectories.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 25/53

  CHAPTER 3

TOKENIZER TOOL

This chapter is devoted to present the various phases namely design, methodology,

implementation and testing of the tokenizer tool. It aims at splitting text into simple tokens

 based upon a set of delimiters provided by the user or else a default list is used. Although

the issues concerned are language specific, yet this tool tries to minimize the distinct

signature patterns that mostly languages have by providing the user freedom on textual

delimitation units[6].

3.1 Problem Identification and Elimination

This tool provides solution to the following problems which were not identified by the

existing tokenizer classes:

• Identification of abbreviation as a single token

Certain applications require the abbreviation to be identified as a single atomic unit.

This tool returns abbreviation as a single token even while considering period as a

delimiter. For eg., it will return I.U.P.A.C as a single token, IUPAC.

• Distinguishing token and non token delimiters 

There are certain characters and punctuation marks which though acting as a

delimiter for tokenizing process need to be returned as a token. For eg. @ and . in

[email protected]. This tool distinguishes between token and non-token

delimiters by enabling user to explicitly define the two different kind of delimiters.

• Ability to return empty tokens 

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 26/53

In case when two delimiters are encountered consecutively, an empty token needs

to be returned. For example in a database entry statement,the presence of a null

token specifies that the corresponding null entry has to be made in the database.

This tool facilitates returning of empty tokens by returning a space character 

whenever required.

3.2 Algorithm and Flowchart

The following steps are used for tokenization:

Step 1. Input text, non-token delimiters, token delimiters and empty returned parameters.

Step 2. Set them to their respective class variables.

If there is no specification for delimiters then take a default list for non-token

delimiters and token delimiters are set to null.

Step 3. Set initial position=0.

Step 4. Repeat steps 4-7 for entire text.

Step 5. Set working position= position.

Step 6. Use advance position function to find position of next delimiter.

Step 7. If position is changed i.e. position!= working position, then return substring from

working position to position as a token.

If token is null and empty returned is true, then return null token.

Advance Position function

Step 1. For i = position+1 to entire text length, repeat step 2 &3.

Step 2. If character at index i belongs to token delimiters or non-token delimiters the

return i as new position.

Step 3. If character at index i is a character then skip the position.

The steps are explained with the help of following flowchart in Fig. 3.2

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 27/53

Fig.3.1 Flowchart for Tokenizer

3.3 Class Description

The specification of the attributes, constructors and methods of class tokenizer 

represtented by the Fig. 3.2 is as follows:

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 28/53

3.3.1 Attributes

Text(datatype String) 

For storing the string to be tokenized.

filein(datatype BufferedReader)

For reading input from file.

fileout(datatype BufferedWriter)

For storing the result into the file

strlength(datatype Integer)

For storing the length of the string.

nonTokenDelims(datatype String)

For storing the set of non token delimiters.

tokenDelims(datatype String) 

For storing the set of token delimiters.

position(datatype Integer) 

For representing the position at we should start looking for the next token.Value= the

 position of the character immediately following the end of the last token or Value =-1 if 

the entire string is examined.

emptyReturned(datatype Boolean) 

Value =true if an empty token should be returned or if the last token returned was an

empty token.

returnEmptyTokens( datatype Boolean) 

For determining whether empty tokens should be returned.

delimsChangedPosition(datatype Integer) 

For indicating at which position the delimeter was last changed.

tokenCount(datatype Integer)

value=-1 if tokens has not been counted else >=0.

3.3.2 Constructors

Parameters used in constructor are as follows:

• text - a string to be parsed.

• nonTokenDelims - the non-token delimiters, i.e. the delimiters that only separate

tokens and are not returned as separate tokens.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 29/53

• tokenDelims - the token delimiters, i.e. delimiters that both separate tokens, and

are themselves returned as tokens.

• returnEmptyTokens - true if empty tokens may be returned; false otherwise.

• filein – is the file from which text to be tokenized is read.

• fileout – is the file into which tokenized result is written.

Following are the constructors defined in the class:

public StringTokenizer(String text, String nontokenDelims, String 

tokenDelims, boolean returnEmptyTokens)

It is the primary constructor which constructs a string tokenizer for the specified string.

Both token and non-token delimiters are specified and whether or not empty tokens are

returned is specified. Empty tokens are the tokens that are between consequtive

delimiters.The current position is set at the beginning of the string.

public StringTokenizer(String text, String nontokenDelims, String tokenDelims)It is equivalent to public StringTokenizer(String text, String nontokenDelims, String

tokendelims, False).

public StringTokenizer(String text, String nontokenDelims)

It is equivalent to public StringTokenizer(String text, String nontokenDelims, null,

False).

public StringTokenizer(String text)

It is equivalent to StringTokenizer(text, " \t\n\r\f", null).

public StringTokenizer(BufferedReader filein, BufferedWriter fileout, String

nontokenDelims, String tokenDelims, boolean returnEmptyTokens)

It is the primary constructor which constructs a string tokenizer for the specified input

and output file.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 30/53

public StringTokenizer(BufferedReader filein, BufferedWriter fileout, String

nontokenDelims, String tokenDelims)

It is equivalent to public StringTokenizer(BufferedReader filein, Writer fileout, String

nontokenDelims, String tokendelims, False).

public StringTokenizer(BufferedReader filein, BufferedWriter fileout, String

nontokenDelims)

It is equivalent to public StringTokenizer(BufferedReader filein, BufferedWriter 

fileout, String nontokenDelims, null, False).

public StringTokenizer(BufferedReader filein, BufferedWriter fileout)

It is equivalent to StringTokenizer(BufferedReader filein, BufferedWriter fileout, "

\t\n\r\f", null).

3.3.4 Methods

public void setText(String text)

It sets the text to be tokenized in this StringTokenizer.

• Parameter 

Text- string to be tokenized.

private void setDelims(String nontokenDelims, String tokenDelims)

It sets the delimiters for the StringTokenizer.

• Parameters

nontokenDelims: list of delimeters that are not returned as tokens.

tokenDelims: list of delimiters that should be returned as tokens.

public boolean hasMoreTokens()

It returns true if there is atleast one token in the string after the current position,false

otherwise. If this method returns true, then a subsequent call to nextToken with noargument will successfully return a token.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 31/53

public String nextToken()

It returns the next token from the string tokenizer. The current position is set after the

token returned. It throws “  NoSuchElementException” if there are no more tokens in

this tokenizer's string.

public boolean skipDelimiters()

It returns true if there are more tokens, false otherwise.

It advances the current position so it is before the next token.This method skips non

tokendelimiters but does not skip token delimiters.

public int countTokens()

It returns the number of tokens remaining in the string using the current delimiter set.

private boolean advancePosition()

It returns true if a token has to be returned, false otherwise.

It advances the state of the tokenizer to the next token or delimiter. This method only

modifies the class variables Position, and emptyReturned. If there are no more tokens,

the state of these variables does not change at all.

private int indexOfNextDelimiter(int start)

It returns the number of tokens remaining in the string using the current delimiter set.

• Parameter 

Start: index in text at which to begin the search.

public boolean hasMoreElements()

It  returns the same value as the hasMoreTokens() method(true if more tokens are

available). It exists so that this class can implement the Enumeration interface.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 32/53

public String nextElement()

It returns the same value as the nextToken() method(the next token in the string),

except that its declared return value is Object rather than String. It exists so that this

class can implement the Enumeration interface.

public boolean hasNext()

It returns the same value as the hasMoreTokens() method(true if there are more tokens;

false otherwise.). It exists so that this class can implement the Iterator interface.

public String next()

It returns the same value as the nextToken() method(the next token in the string),

except that its declared return value is Object rather than String. It exists so that this

class can implement the Iterator interface.

public void setReturnEmptyTokens(boolean returnEmptyTokens)

It sets whether empty tokens should be returned in the tokenizing process this point

onwards.

• Parameter 

returnEmptyTokens - true if empty tokens should be returned.

public String[] toArray()

It returns a string array of remaining tokens.

public void tokenize()

It writes an output file containing tokens of input file.

 

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 33/53

Fig.3.2 Class diagram for the class tokenizer

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 34/53

3.4 Assumptions and Dependencies

The assumptions and dependencies are listed as follows:

1. It has been assumed that the input text is not null otherwise exception is thrown.

2. A default set of non-token delimiters is taken if both token and non token delimiters are

specified by the user.

3.5 Constraints

This tokenizer is only applicable for English language text. Also its not designed to follow

any grammatical rules and does not consider semantics of a language.

3.6 Result

This involves giving an idea about the user interfaces and the various test cases used to

demonstrate the working of the project.

Fig.3.3 shows the main page of user interface that will be used for each module. It contains

five buttons each representing a NLP tool designed. On the click of button, a new user 

interface will appear corresponding to the tool.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 35/53

Fig.3.3 Snapshot of user interface

When user clicks on tokenizer tool button, user interface for tokenizer opens up.

It has four text areas, a submit button and a refresh button.

Three text areas are for input- String to be tokenized, token delimiters and non-token

delimiters, one text area is for output tokens. Fig. 3.4 represents the tokenizer tool

interface.

 

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 36/53

Fig 3.4 Snapshot of Tokenizer Tool Interface

Test Case 1

On click of submit button, the text entered by the user is tokenized on the basis of token

and non-token delimiters specified by the user as represented in Fig 3.5.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 37/53

Fig 3.5 Test Case 1 Snapshot

Test case 2

On click of browse button a window pop ups to browse the location of the input file

tokenized as represented in Fig 3.6

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 38/53

Fig 3.6 Test case 2 snapshot -browsing an input file

On click of submit button the input file is tokenized into the selected output file as

represented in the Fig 3.7

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 39/53

Fig 3.7 Test case 2 snapshot –selecting an output file

This tool tokenizes character streams of input data into a set of single atomic units called

tokens. Existing problems in the tokenization process have been identified and eliminated.

Provisions for providing separate lists of token and non token delimiters have been made to

 provide distinguishibility between two types of delimiters. Features like returning empty

tokens and identification of abbreviation as a single token are included. It provides user 

with the freedom of choosing the delimitation units without textual dependence.

 

CHAPTER 4

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 40/53

STOP WORD REMOVER 

Stop words are the words which are used as fillers in order to construct a proper 

grammatical sentence from the text. Stop word remover aims at removing all the stop

words by maintaining a list of stop words in a database which can be manipulated by the

user.

4.1 Algorithm and Flowchart

Following are the steps for stop word removal:

Step 1. Input text/file and a list of stop words.

Step 2. Apply tokenization on the input.

Step 3. Check whether user has loaded a list of stop words. If not, load a default list.

Step 4. Each word in the given text is matched against the list of stop words.

Step 5. If the word matches, then it is eliminated from the text.

Step 6. Output is now a text without any stop words.

Fig.4.1 shows the flowchart of stopword removal as follows.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 41/53

Fig.4.1 Flowchart for Stop word removal

4.2 Class Description

The specification of the attributes, constructors and methods of class StopWordRemover 

represented by the Fig. 4.2 is as follows:

4.2.1 Attributes

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 42/53

filein(datatype BufferedReader)

For storing reference of user input text file.

filestop(datatype BufferedReader)

For storing reference of user stop word list text file.

fileout(datatype BufferedWriter)

For storing reference of user output text file.

defaultStopWordsList(datatype String)

For storing the list of default stop words.

 

4.2.2 Constructors

Parameters used in constructor are as follows:

• filein – is the file from which text to be tokenized is read.

• filestop-is the file from which stop words to be rmoved are specified.

• fileout – is the file into which tokenized result is written.

Following are the constructors used in StopWordRemover class-

public StopWordRemover(BufferedReader filein, BufferedReader filestop,

BufferedWriter fileout)

It defines a StopWordRemover which stores the reference of an input text file in the

class attribute filein, the list of stop words in the class attribute filestop and reference of 

output file in the class attribute fileout.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 43/53

public StopWordRemover(String BufferedReader filein, BufferedWriter

fileout)

It defines a StopWordRemover which stores the reference of an input text file in the

class attribute filein, reference of output file in the class attribute fileout and the list of 

stop words used is default stop word list.

4.2.3 Methods

public void removeStopWord()

It removes stop words from the given input text file and writes the final result in the

output file specified by the user.

Fig.4.2 Class diagram for class StopWordRemover

4.4 Assumptions and Dependencies

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 44/53

The assumptions and dependencies are listed as follows:

1. It has been assumed that the input text is not null otherwise exception is thrown.

2. Only space is used as non-token delimiter to tokenize the input text file.

3. The input stop words list text file should only have one stopword per line. If user 

doesnot specifies a stop words list, a default stop words list is taken.

4.5 Constraints

This StopWordRemover is only applicable for English language text. Also, its not designed

to follow any grammatical rules and does not consider semantics of a language. This tool is

dependent on the case sensitivity of stop word list and input text.

4.6 Result

The class for stop word remover has been designed with the specification of all basic

constructors, methods and related attributes. Testing of the module will be done in the next

semester along with any subsequent improvisation in the class specification if any. The

default list of the stop words being used is specified in the Appendix A.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 45/53

CHAPTER 5

WORD FREQUENCY COUNTER 

Its aim is to count the frequency of words in the given text. The input text is first tokenized

 by a tokenizer and then the corresponding frequencies of the tokens are generated using the

frequency counter.

5.1 Algorithm and Flowchart 

The following steps are used for counting the frequency of words:

Step 1. Input the text.

Step 2. Tokenize the input text and save the tokens in a arraylist- InArrayList.

Step 3. Create an empty arraylist –OutArrayList to store the output.

Step 4. Repeat steps 5-8 for all the elements i of InArrayList.

Step 5. Repeat steps 6-7 for all the elements n of OutArrayList.

Step 6. Initialize a variable fc=0.

Step 7. If e= n, then increment frequency of element n by1and set fc=1.

Step 8. If fc =1, then add e to OutArrayList with frequency=1.Step 9. Final output is OutArrayList having tokens with their corresponding frequency.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 46/53

The steps are explained with the help of the following flowchart in Fig. 3.5

Fig.5.1 Flowchart for Word Frequency Counter

5.2 Class Description

The specification of the attributes, constructors and methods of class FrequencyCounter 

represtented by the Fig. 6.2 is as follows:

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 47/53

Fig.5.2 Class diagram for WordFrequencyCounter

5.2.1 Parameters

filein(datatype BufferedReader)

For storing reference of user input text file.

fileout(datatype BufferedWriter)

For storing reference of user output text file.

5.2.2 Constructors

Parameters used in constructor are as follows:

• filein – is the file from which text to be tokenized is read.

• fileout – is the file into which tokenized result is written.

The constructors used are as described below:

public FrequencyCounter(BufferedReader filein, BufferedWriter fileout)

It defines a FrequencyCounter which stores the reference of an input text file in the

class attribute filein and reference of output file in the class attribute fileout.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 48/53

5.2.3 Methods

public void countWordFrequency()

It counts the frequency of words in the given input text file and writes the final result in

the output file specified by the user.

The class for frequency counter of words has been designed with the specification of all

 basic constructors, methods and related attributes. Testing of the module will be done in

the next semester along with any subsequent improvisation in the class specification if any.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 49/53

CHAPTER 6

CONCLUSION

So far, Tokenizer tool has been implemented and tested. Existing problems have been

identified and eliminated. The basic approaches for stopword removal and word frequency

counter have been designed and are being worked upon. Simultaneously, existing

algorithms for next modules are being studied for problem identification and elimination

 phase.

6.1 Agenda for the next semester

In the next semester,first target would be testing of stopword remover and frequency

counter of words .After completion of these modules focus will be to design and

implementation of stemmer and n-gram identifier and if time permits then wsd and

multigram will be developed.

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 50/53

APPENDIX A

LIST OF STOP WORDSa

about

all

am

an

and

any

are

as

at

be

because

between

both

but

by

can

could

do

for 

from

had

has

havehe

her 

here

him

his

how

if 

in

is

it

me

my

no

not

of 

on

or 

she

so

than

that

the

this

to

untilup

was

we

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 51/53

what

when

where

which

who

whom

why

with

you

your 

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 52/53

References

1. Christopher D. Manning, Hinrich Schütze ,” Foundations of Statistical Natural

Language Processing”, p. xxxi , MIT Press (1999), ISBN 978-0-262-13360-9

2. Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze , “Introduction to

Information Retrieval” Stemming and Lemmatization, Cambridge University Press

(2008)

http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

3. Dalton .J, “Java Open Source NLP and Text Mining tools” 16- march-2008

http://www.searchenginecaffe.com/2007/03/java-open-source-text-mining-and.html

4. Jaiswal , Singh, Gupta ,Srivastava “Word Sense Disambiguation” B.Tech Report , U.P

Technical University , 2009.

5. “Natural Language Processing” MediaWiki version 1.16wmf4 (r71783) 27-August-

2010 http://en.wikipedia.org/wiki/Natural_language_processing

6. Padhi .B “ Improve tokenization information rich text” 15- june- 2001

http://www.javaworld.com/javaworld/javatips/jw-javatip112.html

7. “Tokenizer MediaWiki version 1.16wmf4 (r71783) 18-August-2010

http://en.wikipedia.org/wiki/Lexical_analysis

 

8/7/2019 Final 20 Dec Reprt..(2)

http://slidepdf.com/reader/full/final-20-dec-reprt2 53/53

Page no’s dalne hain n chap 5 ,6 ki editin karni hai and abstract ki

headin b and chap 3 ki snapshot daal die hai bt spacing samajh nai aarai

Sorry rajjo hum useless chaps se kuch nai ho paaya kaise karti ho tum ye

sab…ajj