an enhanced text document - b.s. abdur rahman crescent ... an enhanced text document classification...
TRANSCRIPT
i
AN ENHANCED TEXT DOCUMENT
CLASSIFICATION BASED ON TERMS AND
SYNONYMS RELATION
A THESIS REPORT
Submitted by
PRANEETHA K.
Under the guidance of
Dr. ANGELINA GEETHA
in partial fulfillment for the award of the degree of
MASTER OF PHILOSOPHY in
COMPUTER SCIENCE
B.S.ABDUR RAHMAN UNIVERSITY (B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)
(Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in
December 2012
ii
B.S.ABDUR RAHMAN UNIVERSITY (B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)
(Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in
BONAFIDE CERTIFICATE
Certified that this thesis report AN ENHANCED TEXT DOCUMENT
CLASSIFICATION BASED ON TERMS AND SYNONYMS RELATION is the
bonafide work of PRANEETHA K. (RRN: 1145213) who carried out the thesis work
under my supervision. Certified further, that to the best of my knowledge the work
reported herein does not form part of any other thesis report or dissertation on the
basis of which a degree or award was conferred on an earlier occasion on this or any
other candidate.
SIGNATURE SIGNATURE
Dr. ANGELINA GEETHA Dr. P. SHEIK ABDUL KHADER
SUPERVISOR HEAD OF THE DEPARTMENT
Professor & Head Professor & Head
Dept. of Computer Science & Engg. Dept. of Computer Applications
B.S. Abdur Rahman University B.S. Abdur Rahman University
Vandalur, Chennai – 600 048 Vandalur, Chennai – 600 048
iii
ACKNOWLEDGEMENT
I would like to start by thanking Dr. V. M. Periasamy, Registrar, B. S.
Abdur Rahman University for providing an excellent infrastructure and
facilities to carry out my course successfully.
I owe my deepest gratitude to Dr. Angelina Geetha, Professor & Head,
Department of Computer Science and Engineering, B. S. Abdur Rahman
University for her valuable advice and guidance in carrying out this research
work. I will always remain indebted to her for the moral support,
encouragement and enthusiastic motivation for work that she imbibed upon
me.
With great pleasure and acknowledgement, I extend my profound
thanks to Dr. P. Sheik Abdul Khader, Head, Department of Computer
Applications, B. S. Abdur Rahman University who has been a constant
source of inspiration throughout this work.
I extend my sincere thanks to my class advisor Dr. A. Jaya,
Professor, Department of Computer Applications, B. S. Abdur Rahman
University for her constant support and encouragement.
I am also thankful to all the staff members of the department for their
full cooperation and help.
iv
ABSTRACT
Data mining is the discovery of knowledge and useful information from
large amounts of data stored in databases. Since a large portion of the
available data is stored in text databases, the field of text mining is gaining
importance. The text databases are rapidly growing due to the increasing
amount of data available in electronic form such as digital libraries, World
Wide Web, electronic repositories etc. Due to this vast amount of digitized
texts, classification systems are used more often in text mining, to analyse
texts and to extract knowledge they contain. Text classification (also called
text categorization) is a process that assigns a text document to one of a set
of predefined classes.
Most of the existing classification systems use the Bag-of-Words model
which classifies the text document based on number of occurrences of its
component words and omit the fact that various words might have been used
to express a similar concept. Hence this model suffers from the problem of
synonymy which arises due to different words with similar meanings. The
proposed approach classifies the text documents by enriching the Bag-of-
words data representation with synonyms. This approach uses WordNet – a
lexical database of English to extract the synonyms for all the key terms in
the text document, and then combines them with the key terms to form a new
representative vector. As a result, the system counts the occurrence of both
the key term and corresponding synonyms in the document for the
classification, resulting in the reduction of synonymy problem.
The performance of the proposed system in comparison with the two
classification approaches i.e. synonym frequency approach and term
frequency approach is evaluated using the 20Newsgroups data corpus. The
experimental results showed that the proposed approach of using the sum of
term and its synonym frequencies for classification results in the increase in
performance of the classification system when compared to the classification
using the other two approaches.
v
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE NO.
ABSTRACT iv
LIST OF TABLES viii
LIST OF FIGURES ix
LIST OF ABBREVIATIONS x
1. INTRODUCTION 1
1.1 TERMS AND TERMINOLOGIES 2
1.1.1 Data 2
1.1.2 Information 2
1.1.3 Knowledge 3
1.1.4 Data Mining 3
1.1.5 Text Mining 4
1.1.6 Information Extraction 5
1.1.7 Information Retrieval 6
1.1.8 Text Retrieval 7
1.1.9 Text Classification 8
1.1.10 Vector Space Model 9
1.1.11 Natural language Processing 10
1.1.11.1 Text Tokenization 10
1.1.11.2 Removal of Stop Words 11
1.1.11.3 Stemming 11
1.2 THESIS ORGANIZATION 11
2. LITERATURE REVIEW 12
vi
3. DEVELOPMENT PROCESS 20
3.1 PROBLEM DEFINITION 20
3.2 SYSTEM DESIGN 21
3.2.1 Learning Phase 21
3.2.2 Classification Phase 21
3.2.3 Data Pre-processing 23
3.2.3.1 Tokenization 23
3.2.3.2 Noise Removal 24
3.2.3.3 Stop Word Removal 24
3.2.3.4 Stemming 24
3.2.4 Bag-of-Words 24
3.2.5 WordNet 25
3.2.5.1 WordNet & Synonyms 25
3.3 DETAILED DESIGN 26
3.3.1 Data Flow Diagram for Level 0 26
3.3.2 Data Flow Diagram for Level 1 26
3.3.3 Data Flow Diagram for Level 2 27
3.4 METHODOLOGY 29
3.4.1 Calculation of Sum of Frequencies 29
of Term and its Synonyms
3.4.2 Vector Generation by Weighting the 31
Key Terms
3.4.3 Calculation of Similarity between 33
Document and Categories Profiles
vii
3.5 IMPLEMENTATION 34
3.5.1 Java Programming Language 34
3.5.2 NetBeans IDE 36
3.5.3 Implementation Details 37
4. EXPERIMENTS AND EVALUATION 39
4.1 EVALUATION METRICS 39
4.1.1 Precision and Recall 39
4.1.2 F-measure 41
4.1.2.1 Macro Averaged F-measure 42
4.1.2.2 Micro Averaged F-measure 42
4.2 EXPERIMENTAL DATASET 44
4.3 PERFORMANCE ANALYSIS 45
4.4 INFERENCE FROM THE RESULT 58
5. CONCLUSION AND FUTURE ENHANCEMENT 59
REFERENCES 60
APPENDIX 1: SAMPLE SCREEN SHOTS 67
TECHNICAL BIOGRAPHY 72
viii
LIST OF TABLES
Table No. Table Name Page No.
4.1 Possible Predictions of a Classifier 40
4.2 Details of the 20Newsgroups Categories 45
used for Evaluation
4.3 Macro Averaged F-measure Results for 47
20Newsgroups Categories
4.4 Micro Averaged F-measure Results for 50
20Newsgroups Categories
4.5 Macro Averaged F-measure Results based 53
on Size of Categories Profile
4.6 Micro Averaged F-measure Results based 56
on Size of Categories Profile
ix
LIST OF FIGURES
Figure No. Figure Name Page No.
1.1 Relation between Data, Information and 3
Knowledge
1.2 Text Mining Process 4
1.3 Architecture of an Information Extraction 5
System
1.4 Architecture of an Information Retrieval 6
System
1.5 Text Classification Process 9
3.1 System Architecture Diagram 22
3.2 Data Flow Diagram - Level 0 26
3.3 Data Flow Diagram - Level 1 27
3.4 Data Flow Diagram - Level 2 28
4.1 Macro Averaged F1-score for 48
20Newsgroups Data Corpus
4.2 Micro Averaged F1-score for 51
20Newsgroups Data Corpus
4.3 Macro Averaged F- measure Results 54
for Varying Size of Categories Profile
4.4 Micro Averaged F- measure Results 57
for Varying Size of Categories Profile
x
LIST OF ABBREVIATIONS
S. No. ACRONYM EXPANSION
1. KDD Knowledge Discovery from Data
2. tf Term Frequency
3. sf Synonym Frequency
4. idf Inverse Document Frequency
5. tfidf Term Frequency Inverse Document
Frequency
6. IE Information Extraction
7. IR Information Retrieval
8. IRS Information Retrieval System
9. NLP Natural Language Processing
10. rMFoM Regularized Maximum Figure-of-Merit
11. MFoM Maximum Figure-of-Merit
12. NL Negative Label
13. NLP Negative Label Propagation
14. SVM Support Vector Machine
15. TP True Positive
16. TN True Negative
17. FP False Positive
18. FN False Negative
19. CI Class Information
20. HMM Hidden Markov Model
21. SOM Self Organizing Map
22. FFCA Fuzzy Formal Concept Analysis
xi
23. HONB Higher Order Naive Bayes
24. IID Independent and Identically Distributed
25. FRAM Frequency Ratio Accumulation Method
26. KNN K-Nearest Neighbors
27. DIFS Distributional Information based Feature
Selection
28. BoW Bag of Words
29. DFD Data Flow Diagram
30. API Application Programming Interface
31. WWW World Wide Web
32. JVM Java Virtual Machine
33. CPU Central Processing Unit
34. OS Operating System
35. IDE Integrated Development Environment
36. Java SE Java Platform, Standard Edition
37. Java ME Java Platform, Micro Edition
38. EJB Enterprise JavaBeans
39. GUI Graphical User Interface
40. XML Extensible Markup Language
41. HTML Hyper Text Markup Language
42. CSS Cascading Style Sheets
43. JSP Java Server Pages
44. SQL Structured Query Language
1
1. INTRODUCTION
Computerization and automated data gathering has resulted in
extremely large data repositories. Hence data mining has gained a great deal
of attention in recent years, due to the availability of huge amounts of data
and the need for converting such data into useful information and knowledge.
Data mining refers to the process of mining or extracting knowledge from
large amounts of data. It is an indispensable step in the Knowledge
Discovery from Data (KDD) process. Knowledge discovery is an iterative
process of data cleaning, data integration, data selection, data
transformation, data mining, pattern evaluation and knowledge
representation.
Text mining is the extraction of useful information from textual
resources using data mining techniques. The data sources used for text
mining may be semi structured or unstructured documents. Basic text mining
tasks include text classification (text categorization), information retrieval, text
clustering, information extraction etc. Nowadays, a large portion of the data is
stored in text databases, which consists of large collections of documents
from various sources such as World Wide Web, digital libraries, news
articles, electronic repositories etc. Since text databases are rapidly growing
due to the increasing amount of information available in electronic form, text
classification systems are used more often in text mining.
Text classification (text categorization) is the task of assigning a text
document to a relevant category. In order to be classified, each document
should be turned into a machine understandable format. The limitation of the
traditional Bag of Words document representation is the problem of
synonymy. This model counts the key word occurrences and omits the fact
that different words might have been used to express a similar concept within
same document. This representation has to be enhanced so that the key
words and their corresponding synonyms in the document should be
considered for the classification process.
2
1.1 TERMS AND TERMINOLOGIES
There are several basic terms under data mining area. Some of them are
as follow:
1.1.1 Data
Data is raw material for data processing. It refers to the unprocessed
information. Data on its own carries no meaning. Data is the facts, statistics
used for reference or analysis. It may be numbers, characters, symbols,
images etc., which can be processed by a computer. Data must be
interpreted by a human or machine, to derive meaning. In computational
terms, data refers to the computerized representations of models and
attributes of entities.
For example, researchers who conduct market survey might ask public
to complete questionnaires about a product or a service. These completed
questionnaires are called data.
1.1.2 Information
Information is the result of interpretation of data. When data is
processed, manipulated, organized and presented in such a way as to be
meaningful to the person who receives it, it is called as information. Data that
represents the results of a computational process to which meaning has
been assigned by human beings or computers is referred to as information in
computational terms.
For example, the data collected by the market survey in the form of
questionnaires is processed and analysed in order to prepare a report on the
survey. This resulting report is information.
3
1.1.3 Knowledge
Knowledge is application of information. Knowledge is obtained when
information is given meaning by interpretation related to some context. In
computer terms, data that represents the results of a computer related
process such as learning, association and is called as knowledge.
For example, the application of information gained about the product
based on the report on the market survey to increase its sale is considered to
be knowledge. The relation between data, information and knowledge is
depicted in the Figure 1.1.
Figure 1.1: Relation between Data, Information and Knowledge.
1.1.4 Data Mining
Data mining is the process for the extraction of previously unknown and
potentially useful information from data stored in repositories. It is an
essential step in the process of knowledge discovery [51].
Knowledge discovery consists of an iterative sequence of the following steps:
Data cleaning – to eliminate noisy and inconsistent data.
Data integration – the process of combining multiple data sources.
Data selection – data relevant to the analysis task are retrieved.
Knowledge
Information
Data Obtain raw data
Give meaning to
obtained facts
Analyse derived
information
4
Data transformation – transforming the data into the form appropriate
for mining.
Data mining – process of extracting interesting data patterns.
Pattern evaluation – identifying the patterns representing knowledge
based on some interestingness measures.
Knowledge presentation – visualization and knowledge representation
techniques are used to present the mined knowledge to the user.
Data mining can be performed on different kind of data repositories such as
relational databases, data warehouses, text databases, multimedia
databases, data streams etc.
1.1.5 Text Mining
Text mining is the process of extracting useful information from large
amount of textual resources. Data stored in these textual resources may be a
combination of structured and unstructured data such as natural language
texts. The text mining process is as depicted in figure 1.2.
Figure 1.2: Text Mining Process
The first step in text mining process is to pre-process the text to remove
noise and inconsistent data. Thereafter, the text is transformed into the form
suitable for mining. Then attributes are selected and appropriate patterns are
discovered. Finally the discovered patterns are evaluated based on
Text Text Pre-
processing
Text
Transformation
Evaluation Pattern
Discovery
Attribute
Selection
5
interestingness measures. Basic text mining tasks include information
extraction, information retrieval, summarization, visualization etc.
1.1.6 Information Extraction
Information Extraction (IE) is the process of analysing the text to
extract facts about pre specified entities. These facts are then stored in the
database which may be used for further analysis. The figure 1.3 shows the
architecture for a simple information extraction system [51].
Figure 1.3: Architecture of an Information Extraction System.
The system takes raw text of documents as its input and generates a
list of tuples as its output. It begins by processing a document in a series of
procedures. In the first step, the raw text of the document is split into multiple
sentences using a sentence segmenter and subsequently, each sentence is
Input (Raw Text)
Sentence
Segmentation
Tokenization
Part of Speech
Tagging
Relation
Detection
Entity
Detection
Output (List of Tuples)
Sentences
Tokenized Sentences
Grouped Sentences
String Relations
POS Tagged
Sentences
6
further divided into words using a tokenizer. Thereafter, part-of-speech of
each sentence is obtained. In the next step (named entity detection), the
interesting entities in each sentence is searched for. Finally, the relation
detection step is used to search for relations between different entities in the
text. As a result of these steps, the list of related tuples is extracted.
IE mainly deals with identifying key words or key terms within a textual
document. It produces structured data ready for post processing. While
Information Retrieval (IR) retrieves relevant documents from collections, IE
retrieves relevant information from documents.
1.1.7 Information Retrieval
Information Retrieval (IR) is a process that deals with representing,
storing, organizing and accessing the information. It is the technique used to
retrieve information from large collections of documents. There are multiple
ways to represent an IR system (IRS). In their paper, Isabel Volpe et al. [26]
have described a model of the architecture of the IR system depicted in
Figure 1.4.
Figure 1.4: Architecture of an Information Retrieval System.
User
Information Retrieval System
Query
representation
Matching
Produce
ranked list of
matches
Tokenization, Stop word
removal, Stemming
Index
Document
Collection
Results
Query
7
The goal of Information Retrieval is to select documents that are most
relevant to a query. IR allows the user to retrieve relevant documents that
best matches a query, but does not specify exactly where the required
information lies.
IR views the text in a document merely as a bag of unordered words.
It deals with different types of information such as text, images, audio, video
etc. The IRS pre-processes the data in the documents using tokenization,
stop word removal and stemming processes before indexing them. When the
user enters a query, same pre-processing is applied to the query text and
then the query is matched to the documents. Matching is done by a similarity
measure which assigns a similarity score to a document in response to the
query. These scores are used to generate a ranked list of documents which
is returned to the user as the result of IR.
1.1.8 Text Retrieval
Text Retrieval (also called document retrieval) is a branch of
information retrieval where the information stored in the form of text is
retrieved. Text retrieval process retrieves the relevant documents by
matching the user query against a set of text documents. A user query can
be a single or multiple sentence description of information.
The purpose of text retrieval system is to search the text database for
relevant documents. If the database is very large, its response time will be
slow. To overcome this latency, the text database is pre-processed and
stored in a system which helps in fast searching. This pre-processed form is
called text representation and is the form of text provided as input to the IR
system.
The major applications of the text document retrieval systems are the
internet search engines. To retrieve the documents according to the interest
of the user, text classification systems are used. For example, computer
science retrieval might be classified based on subject areas, like Operating
System, Data Structure, and Artificial Intelligence etc. The people interested
in the specific topic of computer science would find this classification useful.
8
1.1.9 Text Classification
Text Classification is also called as Text Categorization. It is the task
of classifying a document under a predefined category or class. Text
classification tasks can be divided into different types. They are:
Supervised document classification – where some external
mechanism provides information on the correct classification for the
documents.
Unsupervised document classification – where classification must be
done entirely without reference to external information.
Semi-supervised document classification – where parts of the
documents are labeled by the external mechanism.
The proposed approach deals with the supervised document classification
technique as it makes use of prior information on the membership of training
documents in predefined categories.
Supervised classification is a two-phase process. First phase is called
the training phase or learning phase. Second phase in the classification
process is the testing or classification phase. For developing any
categorization model, a collection of input data set is used. This data set is
sub divided into training data set and test data set.
Training data set refers to the collection of records whose class labels
are already known and is used to build the categorization model. Using these
documents, a set of predetermined classes are described. The classifier built
in the training phase is then applied to the test data set.
Test data set refers to the collection of records whose class labels are
known but when given as an input to the built categorization model, should
return the accurate class labels of the records. It determines the accuracy of
the model based on the count of correct and incorrect predictions of the test
records.
9
The general approach to text classification is given in figure 1.5.
Figure 1.5: Text Classification Process.
The figure 1.5 shows the steps followed in text classification process.
The first step is to transform documents into a representation suitable for the
classification task. Then word stemming is performed followed by deletion of
stop words. Thereafter, the document is represented as the vector of its
content words obtained in the previous step. Next feature selection is done to
avoid unnecessarily large feature vectors. Finally learning algorithm is
applied to predict the class labels of previously unknown documents.
To represent the document as vector of its key terms Vector Space
Model is used.
1.1.10 Vector Space Model
Vector Space Model or Term Vector Model is a model for
representing text documents as vectors of key terms. Generally this model
represents both, a document and query, as vectors in a high dimensional
space corresponding to all the keywords and uses an appropriate similarity
measure to compute the similarity between the query vector and the
document vector.
Vector Space Model is divided into three stages. The first stage is
document indexing where the keywords representing the content are
extracted from the document text. The second stage is the weighing of the
Read document
Feature selection and feature
transformation
Vector representation of text
Tokenize text Stemming
Delete stop words
Learning algorithm
10
extracted key terms using an appropriate weighing scheme. The final stage is
to calculate the similarity between the document vector and the query vector
using the appropriate similarity measure.
When used in text classification process, the Vector Space Model
calculates the similarity between categories vector and the vector of the
document to be classified. To represent the document as the vector of its key
terms, first the text in the document has to be pre-processed using Natural
Language Processing Techniques to turn it into machine readable form.
1.1.11 Natural Language Processing (NLP)
Natural Language Processing (NLP) is a field dealing with the
interaction between computer and human languages (natural languages). It
is a process in which the computer extracts meaningful information from
natural language input, processes it and gives output in natural language.
Tokenization, stop word removal and stemming are the commonly
used NLP tasks for pre-processing the document in classification process.
1.1.11.1 Text Tokenization
Tokenization is the process of breaking up the input text into
individual units called tokens. The tokens may be the string of alphabetic
characters or numbers. These tokens are separated by a space or line break
or any of the punctuation characters. These separators may or may not be
included in the list of resulting tokens.
For example,
Input: Tokenization splits given text into tokens, based on the specified
delimiters.
If the punctuation characters full stop and comma are used as the delimiters
for tokenization then,
Output: Tokenization splits given text into tokens
based on the specified delimiters
11
1.1.11.2 Removal of Stop Words
Stop words are the one which does not convey any meaning, such
as auxiliary verbs, conjunctions and articles (e.g. “the”, “a”, “and”… etc.).
Stop word removal is the process that makes the document easier to process
by removing all the stop words in the document.
1.1.11.3 Stemming
The document may contain different grammatical forms of a word
such as realise, realises, realising or it may contain a group of derivationally
related words with similar meaning such as democracy, democratic,
democratization etc. In such situations, the technique of stemming would be
useful. The goal of stemming is to reduce inflected or derived words to their
stem or root form. A stemming algorithm processes words with the same
stem and retains only the stem as the key word.
For example, the words “stemming”, “stemmed”, “stemmer” are reduced to
their root word which is “stem”.
1.2 THESIS ORGANIZATION
The outline of the thesis can be summarized as follows:
Chapter 2: Literature review section gives a brief survey of research going on
in the area of text classification and the methods used for it.
Chapter 3: The problem statement and the development process of the
proposed approach to solve this problem have been defined.
Chapter 4: In this chapter, all experimental results obtained using the
proposed classification approach are tabulated, compared and evaluated.
Chapter 5: Focuses on the conclusion of the work presented and the ideas
for the further research have been proposed.
12
2. LITERATURE REVIEW
Due to the increased availability of documents in digital form, the
automated categorization or classification of texts into predefined categories
has become an active research topic over the last few years. Traditionally, a
domain expert does text categorization manually. Documents are read by the
expert and then assigned to the appropriate category. To eliminate the large
amount of manual effort required, automatic text categorization is used. The
dominant approach to the text classification problem is based on Machine
Learning techniques such as supervised text categorization in which a
classifier is built by learning, from a set of preclassified documents. In the late
80’s the most popular approach to text categorization was knowledge
engineering where a set of rules where defined manually to encode the
expert knowledge on how to classify the documents under the given
categories. But in 90’s this approach is slowly being replaced by the machine
learning paradigm. Fabrizio Sebastiani [9], in his survey discusses the main
approaches to text categorization that fall within the machine learning
paradigm.
The commonly used text document representation is the bag-of-
words, which simply uses a set of words and number of occurrences of the
words in a document to represent the document [3, 12]. Many efforts have
been taken to improve this simple and limited document representation. For
example, phrases or word sequences are used to replace single words.
Besides the single word, syntactic phrases have been explored by many
researchers [20], [11], [5], [19]. A syntactic phrase is extracted according to
language grammars. In general, experiments showed that syntactic phrases
were not able to improve the performance of standard “bag-of-word”
indexing. Statistical phrases have also attracted much attention from
researchers [15], [10], [4]. A statistical phrase is composed of a sequence of
words that occur contiguously in text in a statistically interesting way [15],
which is usually called n-gram. Here, n is the number of words in the
sequence. Researchers also indicated that the short statistical phrase was
13
more helpful than the long one [4]. In addition to phrases, other linguistic
features such as POS-tag were tried by researchers [1], [19].
Word cluster i.e., a word’s distribution on different categories was used
to characterize a word [13], [16], [5]. The clustering methods used by the
researchers included the agglomerative approach [13] and the Information
Bottleneck [15]. Experiments showed that the word cluster based
representation outperformed the single word based representation
sometimes. Several variants of term frequency (tf) such as the logarithmic
frequency and the inverse frequency were used by researchers [14], [7]. The
logarithmic frequency reflected that the intuition that the importance of a word
should increase logarithmically instead of linearly with the increase of its
frequency. The inverse frequency was derived in order to distribute term
frequencies evenly on the interval from 0 to 1 [7]. Ko et al. [22] used the
importance of each sentence to weight the term frequency. On the other
hand categorization at the document classification stage is a traditional
problem of machine learning and more sophisticated machine learning
methods and classification algorithms such as Neural Networks [8], Decision
Trees [2], [17], the Naive Bayes Method [6], K-Nearest Neighbor [23] and
Boosting Algorithms [18], as well as Support Vector Machines (SVM) [21] are
applied to induce representations for documents based on the categories.
The following section of literature review gives a brief survey of research
going on in the area of text classification and the methods used for it.
Atika Mustafa et al. [24] have discussed the implementation of
Information Extraction (IE) and categorization in text mining applications.
Modified version of Porter’s Algorithm for inflectional stemming is used to
extract the key terms from a document. A domain dictionary which defines a
collective set of all key terms of a certain field (in this case Computer
Science) is used for calculating term frequencies for categorization.
14
Aurangzeb Khan et al. [25] presented a study on E-Documents
classification. The growing numbers of textual data needs text mining,
machine learning and natural language processing techniques and
methodologies to organize and extract pattern and knowledge from the
documents. This study focused on and explored the main techniques and
methods for automatic documents classification.
Chengyuan Ma et al. [26] proposed a new approach called
Regularized Maximum Figure-of-Merit (rMFoM) to supervised and semi
supervised learning. A regularized extension to supervised maximum figure-
of-merit (MFoM) learning to improve its generalization capability and
successfully extend it to semi supervised learning is proposed. The MFoM
learning criteria is reformulated in the Tikhonov regularization framework, to
improve the generalization capability of any classifiers based on discriminant
functions.
Chenping Hou et al. [27] described a novel type of semi supervised
learning using negative labels. A new type of supervision information called
Negative Label (NL) and an approach called Negative Label Propagation
(NLP) is proposed to guide the process of semi supervised learning. The
NLP algorithm first constructs an initialization matrix and a parameter matrix
which specifies the type of an NL point. Then the data labels are fed under
the guidance of NL information and the structure is given by both labelled
and unlabelled points.
Christian Wartena et al. [28] presented a study on keyword extraction
using word co-occurrence. The relevance measure of a word for the text is
computed by defining co-occurrence distributions for words and comparing
these distributions with the document and the corpus distribution. The
semantic similarity of two terms is calculated by computing similarity measure
for their co-occurrence distributions. The co-occurrence distribution of a word
can also be compared with the word distribution of a text. This gives a
measure to determine how typical a word is for a text.
15
Christoph Goller et al. [29] presented an evaluation of various
automatic document classification methods. Different feature construction
and selection methods and various classifiers are evaluated. By this
evaluation, it is shown that feature selection or dimensionality reduction is
necessary not only to reduce learning and classification time, but also to
avoid over fitting. Also it is shown that Support Vector Machine (SVM) is
significantly better than all other classification methods.
Deng Cai et al. [30] proposed a novel learning algorithm called
Manifold Adaptive Experimental Design for text categorization. Unlike most
previous learning approaches which explore either Euclidean or data-
independent nonlinear structure of the data space, the proposed approach
explicitly takes into account the intrinsic manifold structure. The local
geometry of the data is captured by a nearest neighbour graph. The graph
Laplacian is incorporated into the manifold adaptive kernel space in which
active learning is then performed.
Der Chiang Li et al. [31] described a new attribute construction
approach to solve the small dataset classification problem. The proposed
method computes the classification oriented membership degrees to
construct new attributes, called as class-possibility attributes, and also
develops an attribute construction procedure to construct another set of new
attributes, called as synthetic attributes, to increase the amount of
information for small data set analysis.
Fang Lu et al. [32] proposed a refined weighted K-Nearest Neighbors
algorithm for text categorization. Traditional KNN algorithm assumes that the
weight of each feature item in various categories is identical. Obviously, this
is not reasonable. For each feature item may have different importance and
distribution in different categories. This disadvantage of traditional KNN
algorithm can be significantly overcome using the refined KNN algorithm.
16
Jung-Yi Jiang et al. [33] presented a paper about Fuzzy Self-
Constructing Feature Clustering Algorithm for text classification. This paper
proposes a fuzzy similarity-based self-constructing algorithm for feature
clustering. The words in the feature vector of a document are grouped into
clusters, based on similarity test. By this algorithm, the membership functions
describe properly the distribution of the training data.
Lifei Chen et al. [34] proposed a classifier for text categorization using
class-dependent projection based approach. By projecting onto a set of
individual subspaces, the samples belonging to different document classes
are separated such that they are easily to be classified. This is achieved by
developing a supervised feature weighting algorithm to learn the optimized
subspaces for each document class. This method learns class-specific
weighting values for each term from training data in training phase, and
classifies new documents based on a weighted distance measurement in
testing phase.
Ma Zhanguo et al. [35] presented an improved term weighting method
for text classification. The traditional text classification methods use does not
involve class information of the terms. The proposed tf-idf-CI model uses
class information (CI) for weighting. The class information contains intra class
information and inner class information. Using this method it is shown that
the role of important and representative terms is raised and the effect of the
unimportant feature term to classification is decreased.
Makoto Suzuki et al. [36] presented a new mathematical model of
automatic text categorization and a classification method based on VSM.
They proposed a new classification technique called the Frequency Ratio
Accumulation Method (FRAM). This is a simple technique that adds up the
ratios of term frequencies among categories, and it is able to use index terms
without limit. Then, the Character N-gram is used to form index terms,
thereby improving FRAM.
17
Man Lan et al. [37] described several supervised and unsupervised
term weighting methods for automatic text categorization. Also a new
supervised term weighting method to improve the term’s discriminating power
for text categorization task is proposed. In the comparative study of
supervised and unsupervised term weighting methods, it is found that not all
supervised term weighting methods are superior to unsupervised methods
and the performance of the term weighting methods has close relationships
with the learning algorithms and data corpus.
Murat Can Ganiz et al. [38] introduced a novel Bayesian framework for
classification called as Higher Order Naïve Bayes (HONB). In traditional
machine learning algorithms is that instances are Independent and Identically
Distributed (IID). These critical independence assumptions made in
traditional machine learning algorithms prevent them from going beyond
instance boundaries to exploit latent relations between features. Unlike
approaches that assume data instances are independent, HONB leverages
higher order relations between features across different instances.
Nianyun Shi et al. [39] presented a feature selection method named
Distributional Information based Feature Selection (DIFS). In DIFS a new
estimation mechanism is proposed to measure the relevance between
feature's distribution characteristics and contribution to categorization. The
authors discussed three kinds of distribution informative characteristics of
features and their direct or indirect relevance to feature's contribution to text
categorization. In addition, two kinds of algorithms are designed to implement
DIFS.
Nikos Tsimboukakis et al. [40] present a new approach called Word-
Map Systems for the classification of documents in terms of their content.
This approach consists of two stages. The first stage uses a word map to
create a feature representation of the documents, while the second stage
comprises a supervised classifier that classifies the documents into
predefined categories. Two approaches to create word maps are presented
18
and compared based on Hidden Markov Models (HMM) and the self-
organizing map (SOM).
Ning Zhong et al. [41] in their paper, proposed an innovative and
effective pattern discovery technique. To overcome the problem of low
frequency and misinterpretation of derived patterns the proposed technique
uses two processes, pattern deploying and pattern evolving, to refine the
discovered patterns in text documents. The proposed approach improves the
accuracy of evaluating term weights because discovered patterns are more
specific than whole documents.
Sheng-Tun Li et al. [42] proposed a novel classification approach
based on Fuzzy Formal Concept Analysis (FFCA) to control the impact from
noise. Most of existing document classification algorithms are easily affected
by noise data. This research uses Fuzzy Formal Concept Analysis to
generalize documents to concepts in order to decrease the impact from noise
terms. Every formal concept is used to recommend the category for new
documents.
Tomoharu Iwata et al. [43] proposed a framework for improving
classifier performance by effectively using auxiliary samples. The auxiliary
samples are labelled not in terms of the target taxonomy but according to
classification schemes or taxonomies that are different from the target
taxonomy. This method finds a classifier by minimizing a weighted error over
the target and auxiliary samples. The weights are defined so that the
weighted error approximates the expected error when samples are classified
into the target taxonomy.
Weibin Deng [44] developed a hybrid algorithm for text classification
based on rough set for the problem of high dimensions of text feature words.
In the first stage, the most documents are classified into certain classes with
high accuracy by rough set. In addition, based on the attributes’ importance
degree theory in the informational view of rough set, the documents of the
19
doubt set are classified further. And in the second stage, weighted Naive
Bayes relieves the conditional dependence of Naive Bayes.
Xiao-Bing Xue et al. [45] explored the effect of a novel value assigned
to a word called distributional features, which express the distribution of a
word in the document. The widely used Bag-of-Word (BOW) may not fully
express the information contained in the document. The proposed
distributional features include the compactness of the appearances of the
word and the position of the first appearance of the word. The analysis shows
that the distributional features are useful for text categorization, especially
when they are combined with term frequency or combined together.
Xiaojun Quan et al. [46] investigated the suitability of the existing term-
weighing methods for question categorization. The popular unsupervised and
supervised term-weighting methods for question categorization are compared
and three new supervised term-weighing methods are proposed. The
evaluation of the newly proposed supervised term weighting schemes exhibit
stable and consistent improvement over most of the previous term-weighing
methods.
Yaxin Bi et al. [47] introduced an approach to combining the decisions
of text classifiers. Each classifier output is modelled as a list of prioritized
decisions and then divided into the subsets of 2 and 3 decisions which are
subsequently represented by the evidential structures in terms of triplet and
quartet. Also a general formula is developed based on the Dempster-Shafer
theory of evidence for combining such decisions.
Ying Liu et al. [48] reported an approach of concepts handling in
document representation and its effect on the performance of text
categorization. A Frequent word Sequence algorithm that generates concept-
centred phrases to render domain knowledge concepts is introduced. It is
also observed that a universally suitable support threshold does not exist and
the removal of concept irrelevant sequences can possibly further improve the
performance at a lower support level.
20
3. DEVELOPMENT PROCESS
3.1 PROBLEM DEFINITION
Text classification is one of the basic text mining tasks which classifies
the documents with respect to one or more pre-existing categories. In order
to be classified, each document should be represented in a machine
understandable format. Most of the existing classification systems use the
traditional Bag-of-Words model. It is the common way of representing a text
as a bag of its component words called as Bag-of-Words document
representation. The limitation of the Bag-of-Words document representation
is the problem of synonymy which arises due to the different words with
similar meanings. This model counts the number of occurrences of the key
words and omits the fact that different words might have been used to
express a similar concept within the same document. This representation has
to be enhanced so that the key words and their corresponding synonyms in
the document should be considered for the classification process.
The proposed system uses an approach to automatically classify the
documents by enriching the Bag-of-Words text representation with
synonyms. This approach uses WordNet, a lexical database of English to
help the process of document representation and classification. After pre-
processing the data in the text document, the document is represented as a
bag of key words. Then the system uses the WordNet to extract the
synonyms for all the key words in the text document and combines them with
the key words to form a new representative vector of the document.
The proposed approach helps in improving the performance of the
classification system by providing a solution to the problem of synonymy. If
the document contains different words with same meaning, then the system
counts the number of occurrences of all the synonymic words and adds them
together. As a result, both the key words and their synonyms in the text
document are considered and their occurrences counted together for the
classification process.
21
3.2 SYSTEM DESIGN
The figure 3.1 displays the system architecture design. It consists of two
phases, Learning and Classification phase.
3.2.1 Learning Phase
The learning phase is also called as training phase and deals with the
preparation of training data set. In this phase, a set of training documents are
given as input to the classifier. The class labels of these training documents
will be previously known. The data in the document is pre-processed to
represent the text in the form of vector of key terms. Based on these training
documents a set of pre – defined classes are described and a training data
set is prepared which is called as the Categorical Profile.
3.2.2 Classification Phase
The document to be classified is given as input to the classifier in the
classification phase. It is also called as testing phase since the classifier is
tested using the new document. A data set called Document Profile is
prepared using the input document to the testing phase. In the next step, the
similarity between the profile of the document to be classified and the
categories profiles is calculated using a similarity measure and the input
document is assigned the category whose profile is most similar to the profile
of the document. The category to which the document is assigned is provided
as the output of the system, i.e., the class of the input document.
22
Figure. 3.1: System Architecture Diagram.
Training
documents
Tokenization
Stop word removal
Stemming
Input - Document
to be classified
Wordnet
Generation of document
profile
Calculation of similarity between
categorical profile and
document profile
Calculation of sum of
term and its synonyms frequencies
Data pre-processing
Generation of categorical profile
Classification Phase Learning Phase
Input
Output
Output – Classified
Document
Bag of Words
Document profile Categorical profile
Noise removal
23
Both the learning phase and classification phase use the following
modules to perform the classification process.
3.2.3 Data Pre-processing
Data Pre-processing deals with the representation of data in a
document in terms of the vector of its key terms. The data in the text
document is pre-processed to reduce the complexity of the document and to
make them easier to handle. It is an important step in documents
classification and gives the compact form of content of the document. The
purpose of text representation is to reduce possible language dependent
factors.
The functions carried in pre-processing process are as follows:
3.2.3.1 Tokenization
It is the process of breaking a stream of text into individual words or
phrases called tokens. The tokens may be the string of alphabetic characters
or numbers. The delimiters used to separate the tokens in tokenization may
be a space or line break or any of the punctuation characters. The delimiters
used for tokenization may or may not be included in the list of resulting
tokens.
For instance,
Input: Tokenization splits sentences into individual tokens. This is an
example of tokenization.
If the punctuation character full stop is used as the delimiter for the
tokenization process then,
Output: Tokenization splits sentences into individual tokens
This is an example of tokenization
24
3.2.3.2 Noise Removal
All the irrelevant data such as non-alphabetic characters like full
stops, commas, brackets, numerals and special characters are removed.
3.2.3.3 Stop Word Removal
Stop words such as auxiliary verbs, conjunctions and articles (e.g.
“the”, “a”, “and”… etc.) which does not convey any meaning are removed by
this process.
3.2.3.4 Stemming
The stemming algorithm that converts different words into their root
forms is applied. Stemming refers to the process of reducing inflection and
derivational variants of words to their stem or the root of a certain word.
There are basically two types of stemming techniques, inflectional
and derivational. Derivational stemming derives a new word from an existing
word, by changing its grammatical category (for example, changing a noun to
a verb). When the singular is changed to plural or past to present, it is
referred to inflectional stemming.
A stemmer (an algorithm which performs stemming), removes words
with the same stem and keeps only the stem as key word. For example, the
words “train”, “training”, “trainer” and “trains” can be replaced with “train”. To
minimize the effects of inflection and morphological variations of words,
Porter stemming algorithm is used.
The screen shots for the data pre-processing process carried out in
this research work such as tokenization, noise removal, stop word removal
and stemming are depicted in Appendix 1.
3.2.4 Bag-of-Words
The Bag-of-Words is the collection of component words of an input text
document. These component words are also called as feature words or key
25
words and are obtained after the data in the document is pre-processed. It is
the simplest representation of texts in vector space model. It transforms texts
into vectors where each component represents a word. The purpose of this
representation is to make the text understandable to the machine.
3.2.5 WordNet
To move from counting word occurrences to counting synonyms, a
thesaurus is required. The proposed approach uses WordNet to help the
process of document representation and classification.
WordNet is a large lexical database of English developed at the
University of Princeton [50]. It is a combination of a dictionary and thesaurus.
The basic building block of WordNet is synsets consisting of all the words
that represent a given concept. Nouns, verbs, adjectives and adverbs are
groups into sets of synonyms each expressing a distinct concept. Each
synset represents the underlying lexical concept expressed by all the
synonymic words. Each synset is identified by a unique synset number. In
addition, each synset also contains pointers to other semantically related
synsets. A word may belong to more than one synset. Each synset is
associated to a sense, i.e. a word meaning; for example the words “car” and
“automobile” grouped in the synset {car, automobile}. A word form in
WordNet can be a single word or two or more words connected by
underscores. WordNet is capable of referring a word form to a synset.
3.2.5.1 WordNet & Synonyms
The relation of synonymy is the base of structure of the WordNet.
Synonymy is the relation binding two equivalent or close concepts. It is a
symmetrical relation. A synonym is a word which can be substituted to
another without major change of meaning. The lexemes are gathered in sets
of synonyms called as synsets. Thus a synset consists of all the terms used
to indicate the concept.
26
The definition of synonymy used in WordNet is as follows: "Two
expressions are synonymous in a linguistic context C if the substitution of for
the other out of C does not modify the value of truth of the sentence in which
substitution is made". Example of a synset is [feature, characteristic,
lineament, have, sport, boast].
3.3 DETAILED DESIGN
3.3.1 Data Flow Diagram for Level 0
The overview of overall process is provided in Level 0 DFD in
figure 3.2.
Figure. 3.2: Data Flow Diagram - Level 0.
In the overall view of the process given in the above diagram, the
input text document is processed by the classifier and assigned to a class
whose profile is most similar to the profile of the document.
3.3.2 Data Flow Diagram for Level 1
More detailed insight of the overall process is provided in Level 1 DFD
in figure 3.3.
The document given as input to the system is pre-processed to
reduce the complexity of the document and to get the compact form of the
input document. The bag of component words of input document is obtained
as the result of data pre-processing process.
Training/test
document Input – Text
document
Output –
Class of the
input
document
Classifier
Category most
similar to the input
document
27
By using this bag of words of text document, the categorical and
document profile are prepared. The similarity between the document profile
and all categories profile is calculated using the similarity measure. The input
document is assigned to the category or class whose profile is most similar to
the document profile.
Figure. 3.3: Data Flow Diagram - Level 1.
3.3.3 Data Flow Diagram for Level 2
The detailed flow of data within each module in the classification
system can be obtained by the Level 2 DFD in figure 3.4.
The input document is passed to the pre-processing module where it
gets tokenized, noise and stop words are removed and finally the key words
in the document are stemmed.
Keywords
Categorical
or document
data set
Input – Text
document
Process
data
Output –
Class of the
input
document
Prepare
profiles
Training/test
document
Classifying
the document
Keywords
Categorical
or document
profiles Category most
similar to the input
document
Bag-of-Words Data Set
28
Figure. 3.4: Data Flow Diagram - Level 2.
Tokenization splits the stream of text in the document into individual
tokens based on the delimiters provided. Then the words that do not convey
Resulted tokens
Input – Text
document
Tokenization Noise
Removal
Tokens
Training/test document
Resulted words
Tokenized Data Tokens after Noise
Removal
Stop word
Removal
Tokens after Stop
word Removal Stemming
Synonyms
Calculate term
frequency (tf) +
synonym
frequency (sf)
Words to be
stemmed
Keywords
Keywords
WordNet Bag–of- Words
Sum of frequencies of term and its synonyms
Vector
generation
frequencies tf + sf
Categorical/ document data
Term frequency (tf) +
Synonym frequency (sf)
Data Set
Calculate
similarity
between profiles
Output –
Class of the
input
document
Category most similar to the input
document Categorical/ document
profiles
Resulted Tokens
Resulted Tokens
29
any meaning such as special characters and articles are removed in noise
removal and stop word removal operation. At the end of pre-processing
operation, the document undergoes stemming where all the key words got as
the result of previous operation are brought to their root form.
The bag of key words of the input text document is obtained as the
result of tokenization, noise removal, stop word removal and stemming
operations. Each one of these key words is passed through WordNet which
is a lexical database of English and their synonyms are extracted. These
synonyms are checked against the key words and if any of the synonyms of
the key word appears in the document then the frequency of that synonym
are added to the key term (key word) frequency.
After calculating the sum of frequencies of key term and its synonyms,
categorical and document profile are prepared by assigning the weights to
the key terms. Then, the similarity between the document profile and all
categories profile is calculated using the similarity measure. Finally, the input
document is assigned to the category or class whose profile is most similar to
its profile.
3.4 METHODOLOGY
The two phases of the classifier, learning phase and classification phase
are used to perform the classification process. The learning or training phase
deals with classification model construction and the usage of the classifier is
done in the testing or classification phase. The way in which the classification
system works is stated as follows.
3.4.1 Calculation of Sum of Frequencies of Term and its Synonyms
Term Frequency (tf) is the number of occurrences of a key word in the
bag of component words of the text document generated as a result of the
data pre-processing process. Synonym Frequency (sf) is the number of
occurrences of all the synonyms of a key word in the bag of component
words of the text document.
30
Most of the existing classification systems use the term frequency to
classify the document, whereas some classification systems classify the
documents based on the synonym frequency.
To enhance the performance of these classification systems, this
research proposes an approach to classify the text documents based on both
term and its synonym frequency. It classifies the documents by finding the
sum of term and its corresponding synonym frequencies in the document.
Calculation of sum of the term and its synonym frequency is the
process where the sum of number of occurrences of each key word and
number of occurrences of all of its synonyms in the document are obtained.
This process interacts with WordNet, the lexical database of English to obtain
the synonyms of the key words. Each of the key word in the bag of words is
passed through WordNet and its synonyms are extracted. Then these
synonyms are checked against the key words in the text document obtained
by the pre-processing process. If for a key term, any of its synonyms
obtained using WordNet appears within the document then the frequency of
that synonyms are added to the term (key word) frequency. This strategy
extends each term vector by entries for WordNet synonyms S appearing in
the text document.
As a result, the frequency of each of the key terms in the document is given
by,
where,
is the number of occurrences of a key word.
is the total number of occurrences of all the synonyms appearing
in the document for a key word.
31
Thus each of the term vector will be replaced by the concatenation of
term vector and synonym vector where,
where,
where denotes the frequency that a synonym
sЄS of the key terms appearing in a document d.
3.4.2 Vector Generation by Weighting the Key Terms
Given the key word frequencies, the input test document and all the
categories can be represented as a weighted vector of its key terms.
Weighting the key terms is the process where the weights are assigned to
each of the key words as an indication of the importance of a word. There are
various methods to calculate the weight of a key word. The proposed
approach uses the standard (product of Term Frequency and Inverse
Document Frequency) measure defined as,
where,
is the number of occurrences of the term in document d.
represents importance of term which is a measure of
whether the term is common or rare across all documents.
Given the key term weights in all categories, the weighted vector for
each category is given by,
( )
where,
are the weights of the key terms in the
category
32
The key words in the category are weighted by the proposed approach
as follow:
(
| | )
where,
is the sum of frequency of term and its synonyms
in category
is the total number categories.
| | is the number of categories that contain term .
The equation 3.5 measures the degree of association between a key
term and category. Its application is based on the assumption that a term
whose frequency strongly depends on the category in which it occurs will be
useful for distinguishing among the categories.
Also, the weighted vector for test document is given by,
( )
where,
are the weights of the key terms in the
document
In this approach, the key terms in the test document are weighted as,
(
| |)
where,
is the sum of frequency of term and its synonyms
in document
is the total number of training documents.
| | is the number of documents containing term .
33
The equation 3.7 computes the document dependent weights for the
key terms so as to generate a vectorial representation for each document in
which each term is weighted by its contribution to the discriminative
semantics of the document.
3.4.3 Calculation of Similarity between Document and Categories
Profiles
The similarity measure is used to determine the degree of
resemblance between two vectors. The similarity between documents is
estimated by calculating the distance between the vectors of these
documents. The similarity measure value will be larger to documents that
belong to the same class and smaller otherwise. Many measures have been
proposed for measuring document similarity based on term occurrences or
document vectors.
The measure used to calculate the similarity in the proposed approach
is cosine similarity measure. The cosine similarity measure evaluates the
cosine of the angle between two document vectors and is given by [29],
The equation 3.8 can be derived as,
( ∑
) (√∑
√∑
)⁄
where:
t is a key word, and are the two vectors (profiles) to be
compared.
is the weight of the term t in .
34
is the weight of the term t in .
In the proposed approach, this similarity measure is used to calculate
the distance between all categories vector and the vector of the document to
be categorized. When this similarity measure is used, if there are more
common key terms and these key terms have strong weightings, the
similarity will be closer to 1, and vice versa. As a result, the document will be
assigned to the category whose vector is closest with the document vector.
3.5 IMPLEMENTATION
To develop the classifier, this research uses advantages of the built in
functions of Java programming language and NetBeans which is an
Integrated Development Environment (IDE) intended for Java development.
3.5.1. Java Programming Language
Java is an object-oriented programming language developed by Sun
Microsystems [52], [53]. It has a built-in Application Programming Interface
(API) that can handle graphics and user interfaces and can be used to create
applications and applets. Applications are programs that perform the same
functions as written in other programming languages. Applets are the
programs that can be embedded in a webpage and accessed over the
internet. When a program written in Java is compiled, a byte code is
produced that can be read and executed by any platform that can run Java.
Because of its rich set of API’s and its platform independence, Java can also
be thought of as a platform in itself.
The Java has become a language of choice for providing worldwide
Internet solutions because of its security and portability features. The key that
allows Java to facilitate these features is the Bytecode. It is a highly
optimized set of instructions designed to be executed on the Java run-time
system called Java Virtual machine (JVM), which is an interpreter for
bytecode. Translating a Java program into bytecode makes it much easier to
35
run in a wide variety of environments since only the JVM needs to be
implemented for each platform. Although the details of the JVM will differ
from platform to platform, all interpret the same Java bytecode.
Java provides a number of advantages to developers which are as
follow:
Simple
Java is designed to be easy for the programmer to learn and use
effectively. Java is built on and improved the ideas of C++. Since it
inherits most of the syntax and object-oriented features of C++, the
people already having some programming experience require very
little effort in learning and using Java.
Object-Oriented
Java is object oriented because programming in Java is mainly
focused on creating objects, manipulating objects and making objects
work together. This allows the developer to create modular programs
and reusable code.
Portable
Java program can be compiled on one platform and run on
another, even if the Central Processing Units (CPU) and Operating
Systems (OS) of the two platforms are completely different. The ability
to execute the same program on different systems is important on
World Wide Web and Java succeeds in this by being platform
independent.
Secured
Java considers security as part of its design. Using a Java-
compatible web browser, the user can safely download Java applets
without fear of viral infection. Java achieves this protection by limiting
a Java program to the Java execution environment and not allowing it
access to other parts of the computer.
36
Robust
Since Java is a strictly typed language, it checks the code at
compile time and also at run time. Java gives importance on early
checking for possible errors, as Java compilers are able to detect
many problems that would show up only during execution time in other
programming languages.
Multithreaded
Java supports multithreaded programming, which allows the
programmer to write programs that do many things simultaneously.
But in other programming languages, operating system specific
functions have to be called in order to enable multithreading.
3.5.2 NetBeans IDE
NetBeans IDE is an open source integrated development environment
for developing Java applications. It is written in Java and runs on all major
platforms. NetBeans IDE supports development of all Java application types
including Java SE, Java ME, EJB and mobile applications. The functions of
the NetBeans IDE needed for Java development is provided by the modules.
Each module provides a function such as support for Java language, editing
etc. These modules also allow the NetBeans to be extended. New features
can be added by installing additional modules. The modules NetBeans
profiler, GUI design tool, NetBeans JavaScript editor etc. are part of the
NetBeans IDE [53].
The NetBeans IDE has many features and tools for each of the Java
platforms. Some of these features are listed below:
Syntax highlighting for Java, JavaScript, XML, HTML, CSS, JSP etc.
Customizable fonts, colours, and keyboard shortcuts.
Live parsing and error marking.
Pop-up Javadoc for quick access to documentation.
Advanced code completion.
Automatic indentation, which is customizable.
37
Word matching with the same initial prefixes.
Navigation of current class and commonly used features.
Macros and abbreviations.
Matching brace highlighting.
JumpList allows you to return the cursor to previous modification.
Zoom view ability.
Database schema browsing to see the tables, views, and stored
procedures defined in a database.
Database schema editing using wizards.
Data view to see data stored in tables.
Works with databases, such as MySQL, Oracle, IBM DB2, Microsoft
SQL Server, Sybase, Informix, Cloudscape, Derby etc.
3.5.3 Implementation Details
The text document is loaded from 20Newsgroup data corpus which is
stored inside a directory. In the preprocessing stage, the document is first
tokenized using Java’s StringTokenizer class. The string functions are used
to compare the resulted tokens first with the array of noisy characters and
then with array of stop words for their removal. As the last stage of
preprocessing, Porter’s stemming algorithm is applied to the key words
obtained from the previous process to reduce them to their root form.
Thereafter, the document is searched for the synonyms of each key
word that are generated as the result of preprocessing process. If any of the
synonyms appears in the document, then their frequency is added to the key
term frequency to obtain the sum of key term and its synonym frequency.
This module makes use of WordNet to extract the synonyms of the key
words. The MIT Java WordNet Interface (JWI) is used to interface with the
WordNet dictionary. First, a URL is constructed that points to the WordNet
dictionary directory. Then using the instance of the default Dictionary object
of JWI, the dictionary is searched for the senses or synonyms of the input
tokens.
38
Categorical and document profiles or vectors are generated in the
next module. In this process, the sum of key term and its synonym
frequencies obtained in the previous module are loaded to the function that
calculates the weights for these key terms. Then using these weighted key
terms the vectors are generated for the categories and the test document.
The last module of the system calculates the similarity between
categorical and test document profile. The module makes use of cosine
similarity function. The similarity between test document profile and each of
the category profile is calculated by providing the category vector and
document vector as input to the similarity function.
The features of GUI design tools of NetBeans IDE are used to present
the details and results at different stages.
39
4. EXPERIMENTS AND EVALUATION
In this research, the 20Newsgroups collection is used to evaluate the
proposed approach. All documents for training and testing involve a pre-
processing step, which includes the task of tokenization, noise and stop word
removal and stemming. Then common measures such as Precision, Recall
and F-measure are applied for performance evaluation.
4.1 EVALUATION METRICS
The performance of a text categorization system is evaluated using
performance measures from information retrieval. Common metrics for text
categorization evaluation include precision, recall and F1.
4.1.1 Precision and Recall
Precision is a measure of the ability of a system to present only
relevant items to the user.
Recall is a measure of the ability of a system to present all relevant
items to the user.
For classification tasks, the terms true positives, true negatives, false
positives, and false negatives compare the results of the classifier under test
phase with previously known results. The terms positive and negative refer to
the classifier's prediction and the terms true and false refer to whether that
prediction corresponds to the previously known result.
40
The terms true positive (TP), true negative (TN), false positive (FP)
and false negative (FN) are defines as follows.
– number of documents assigned correctly to class i.
– number of documents assigned correctly to the classes other than i.
– number of documents that do not belong to class i but are assigned to
class i incorrectly by the classifier.
– number of documents that are not assigned to class i by the classifier
but which actually belong to class i.
The definitions given above can be illustrated by the table 4.1 below.
Table 4.1: Possible Predictions of a Classifier.
Category i Actual category
TRUE FALSE
Classifier Prediction TRUE
FALSE
In a classification task, the precision for a class is the number of true
positives (i.e. the number of items correctly labeled as belonging to the
positive class) divided by the total number of documents labeled as
belonging to the positive class (i.e. the sum of true positives and false
positives - which are documents incorrectly labeled as belonging to the
class).
where,
is number of documents assigned correctly to class i.
is the number of documents that do not belong to class i but are
assigned to class i incorrectly by the classifier.
41
Recall is defined as the number of true positives divided by the total
number of documents that actually belong to the positive class (i.e. the sum
of true positives and false negatives - which are items which were not labeled
as belonging to the positive class but actually belonging to positive class).
where,
is number of documents assigned correctly to class i.
is number of documents that are not assigned to class i by the
classifier but which actually belong to class i.
4.1.2 F-measure
The F-measure or F1 score is used to calculate the performance of the
classification system based on precision and recall values. It is the harmonic
mean of precision and recall. In some cases there will be a need to trade off
precision for recall or vice versa. Hence F-score is used since it takes into
account both precision and recall. The F-measure value lies in the interval (0,
1) and larger the F-measure value, higher will be the classification quality.
The F-measure is defined as,
The overall F-measure score of the entire classification system can be
computed by two different types of average, micro-average and macro-
average. Both macro averaged and micro averaged F-measure is used to
evaluate the performance of the proposed classification system.
42
4.1.2.1 Macro Averaged F-Measure
The macro averaged F-measure is the average of F1 scores of all
the categories. In macro averaging, F-measure is computed locally over each
category first. Then the average over F1 scores of all categories is taken.
Macro averaged F-measure gives equal weight to each category, without
taking into account its frequency. It is generally dominated by the classifier’s
performance on rare categories.
Given a training dataset with n categories, and F1 value for the ith
category as , the macro averaged F1 is defined as,
∑
⁄
The equation 4.6 can be derived as,
∑(
)
⁄
where,
is the number of categories.
is the F-measure value of category.
is the value of category precision.
is the recall value for category.
4.1.2.2 Micro Averaged F-measure
In micro averaging, first the micro averaged precision and recall are
computed globally by adding the individual true positive, false positive and
false negative decisions of the system. Then the micro averaged F-measure
is calculated by taking the harmonic mean of the micro averaged precision
and micro averaged recall. Micro averaged F-measure assigns equal weight
43
to every document. It is generally dominated by the classifier’s performance
over common categories.
Given a training dataset with n categories, the micro averaged F1 is
defined by,
In equation 4.8, is the micro averaged precision for the system which
is given by,
(∑
) (∑
) ⁄
and is the micro averaged recall value for the system given by,
(∑
) (∑
) ⁄
where,
is the number of categories.
is number of documents assigned correctly to class i.
is the number of documents that do not belong to class i but are
assigned to class i incorrectly by the classifier.
is number of documents that are not assigned to class i by the
classifier but which actually belong to class i.
44
4.2 EXPERIMENTAL DATASET
The 20 Newsgroups data set is a collection of approximately 20,000
newsgroup documents taken from the Usenet newsgroups collection and
partitioned (nearly) evenly across 20 different newsgroups [54]. The 20
newsgroups collection has become a popular data set for experiments in text
applications of machine learning techniques, such as text classification and
text clustering. The data is organized into 20 different newsgroups, each
corresponding to a different topic. Some of the newsgroups are very closely
related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.
hardware), while others are highly unrelated (e.g. misc.forsale / soc. religion.
christian). Each category contains 1,000 articles and 4% of the articles are
cross-posted. The categories in 20Newsgroups are as follows:
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
45
4.3 PERFORMANCE ANALYSIS
To evaluate the performance of the proposed classification approach,
experiments are conducted over 20Newsgroups corpus. The 20Newsgroups
corpus is a collection of approximately 20,000 newsgroup documents nearly
uniformly distributed among 20 groups. In this corpus some newsgroups are
very closely related to each other and some are highly unrelated. The
20Newsgroups corpus has become a popular dataset for experiments in text
classification systems. Compared with asymmetrical category distribution in
other data corpus, the 20 categories in the 20Newsgroups corpus are nearly
uniformly distributed. The macro averaged F-measure performance is almost
similar to that of micro averaged F-measure in this nearly uniform category
distribution corpus [27], [30], [33], [37].
The table 4.2 shows the details of the 20Newsgroups categories used for
evaluation.
Table 4.2: Details of the 20Newsgroups Categories used for Evaluation.
Category
Number of
Training
Documents
Number of Test
Documents
Total number of
Documents
comp.graphics 189 37 226
misc.forsale 204 42 246
rec.autos 198 40 238
sci.space 193 39 232
talk.religion.misc 216 42 258
Total 1000 200 1200
46
Table 4.3 summarizes the macro averaged F-measure results of the
proposed approach compared with term frequency alone strategy and
synonym frequency alone strategy for 20Newsgroups categories. The results
obtained in the experiment suggest that the integration of the term frequency
with its synonym frequency improved the text classification performance
significantly compared with classification using either synonym frequency
strategy or term frequency strategy.
Labels in table 4.3 are defined as follows:
Category stands for the name of the 20Newsgroups categories used
for evaluation.
Proposed Approach (tf+sf) is the classification strategy of calculating
sum of term and its synonym frequencies.
Synonym Frequency (sf) is the classification approach using only the
frequency of synonyms.
Term Frequency (tf) is the strategy used for the classification using
only term frequency.
F1 score is the F-measure value for individual categories under each
strategy.
Macro averaged F-measure is the category pivoted performance
measure of the system.
47
Table 4.3: Macro Averaged F-measure Results for 20Newsgroups
Categories.
Category
F1 score
Proposed
Approach
tf + sf
Synonym
Frequency
sf
Term
Frequency
tf
comp.graphics 0.718 0.667 0.641
misc.forsale 0.720 0.665 0.648
rec.autos 0.719 0.668 0.65
sci.space 0.719 0.666 0.646
talk.religion.misc 0.721 0.668 0.644
Macro Averaged
F – measure 0.719 0.667 0.646
To calculate the macro averaged F1-score, the F1-score is computed
for each of the categories first and then the average of all F1 scores is
calculated. The values from table 4.3 shows that the best overall macro
averaged F-measure value is achieved by the proposed classification
approach i.e. sum of term and its synonym frequencies given by 0.719 which
is higher than that of classification using synonym frequency strategy (0.667)
or only term frequency strategy (0.646).
48
The figure 4.1 represents the performance of the proposed system in
comparison with the synonym frequency and traditional term frequency
approaches for 20Newsgroups dataset.
Figure 4.1: Macro Averaged F1-score for 20Newsgroups Data Corpus.
In figure 4.1 the x-axis represents the macro averaged F1-score (in
percentage) for the classification approaches and the y-axis represents the
different classification strategies including the proposed approach. The
results from the figure shows that the macro averaged F1 values for the
proposed system (tf + sf) reached 71.9% achieving an improvement of 5.2%
compared to the synonym frequency (sf) strategy and 7.3% compared to the
term frequency (tf) strategy.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
tf sf tf+sf
F1-S
core
Classification Strategies
Macro averaged F1-Score
49
Table 4.4 summarizes the micro averaged F-measure results of the
proposed approach compared with term frequency strategy and synonym
frequency strategy for 20Newsgroups categories. By the results obtained in
the experiment it can be concluded that the addition of term frequency with
its synonym frequency improves the text classification performance
compared with classification using either synonym frequency approach or
term frequency approach.
Labels in table 4.4 are defined as follows:
Category refers to the name of the 20Newsgroups categories used for
evaluation.
Proposed Approach (tf+sf) is the classification strategy of calculating
sum of term and its synonym frequencies.
Synonym Frequency (sf) is the strategy used for the classification
using only the synonym frequency.
Term Frequency (tf) is the classification approach using only the
frequency of terms in the document.
Precision and Recall are the evaluation measures used to compare
the system’s result with the previously known results.
Micro averaged F-measure is the document pivoted performance
measure of the classification system.
50
Table 4.4: Micro Averaged F-measure Results for 20Newsgroups
Categories.
Category
Proposed
Approach
tf + sf
Synonym
Frequency
sf
Term Frequency
tf
Precision Recall Precision Recall Precision Recall
comp.graphics
0.720 0.726 0.67 0.674 0.651 0.656
misc.forsale
rec.autos
sci.space
talk.religion.misc
Micro averaged
F - measure 0.723 0.672 0.653
The micro averaged F1-score is calculated by computing the F1-
score globally regardless of categories. The individual decisions of the
system are added together and then applied to get the performance
measure. From the values in the table 4.4 it is shown that the proposed
approach of classification using sum of term and its synonym frequencies
achieves the best micro averaged performance value of 0.723 compared to
0.672 and 0.653 of classification using either the synonym frequency
approach or term frequency approach respectively.
51
The figure 4.2 represents the performance of the proposed system for
20Newsgroups data set plotted in comparison with the synonym frequency
and term frequency classification approaches.
Figure 4.2: Micro Averaged F1-score for 20Newsgroups Data Corpus.
In figure 4.2 the x-axis represents the micro averaged F1-score (in
percentage) for the classification approaches and the y-axis represents the
different classification strategies including the proposed approach. The
results from the figure shows that the proposed system’s (tf + sf) micro
averaged performance score reaches 72.3% achieving an increase of 5.1%
than the synonym frequency (sf) approach and 7% than the traditional term
frequency (tf) approach.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
tf sf tf+sf
F1-S
core
Classification Strategies
Micro averaged F1-Score
52
The table 4.5 displays the macro averaged F-measure results for the
proposed classification approach of counting sum of term and its synonym
frequency compared to that of classification using only synonym frequency or
only term frequency for number of keywords or size of categories profile
ranging from 50 to 300.
Labels in table 4.5 are defined as follows:
Size of Categories Profile stands for the size of test data set or
number of keywords in the categorical profile.
Proposed Approach (tf+sf) is the classification strategy of calculating
sum of term and its synonym frequency.
Synonym Frequency (sf) is the strategy used for the classification
using only the synonym frequency.
Term Frequency (tf) is the classification approach using only the
frequency of terms in the document.
Macro averaged F1-score is the category pivoted performance
measure of the system.
53
Table 4.5: Macro Averaged F-measure Results based on Size of Categories
Profile.
Size of Categories Profile
Macro averaged F1 score
Proposed
Approach
tf + sf
Synonym
Frequency
sf
Term
Frequency
tf
50 0.714 0.664 0.637
100 0.717 0.663 0.646
150 0.716 0.663 0.649
200 0.717 0.666 0.643
250 0.719 0.667 0.646
300 0.719 0.667 0.646
Figure 4.3 shows the macro averaged performance of the three
different classification approaches i.e. proposed approach using sum of term
and its synonym frequency, using synonym frequency only and using term
frequency only, plotted against the size of the categories profile or number of
key terms ranging from 50 to 300. The macro averaged F1-score points show
a tendency to increase as the number of keywords grows. It remains
constant when the number of key terms exceeds 250.
54
Figure 4.3: Macro Averaged F- measure Results for Varying Size of
Categories Profile.
The macro averaged F-measure performance of the three
approaches for classification i.e. using only term frequency (tf), using only
synonym frequency (sf) and using sum of term and its synonym frequencies
(tf+sf) for 20Newsgroups data corpus plotted against the categorical profile
size ranging from 50 to 300 is given in figure 4.3. The x-axis represents the
number of key words; y-axis represents the macro averaged F-measure
value for the three classification approaches. From the figure it is concluded
that among the three approaches the best F1-score points are achieved by
the proposed approach of classification using the sum of term and its
synonym frequencies.
0.63
0.64
0.65
0.66
0.67
0.68
0.69
0.70
0.71
0.72
0.73
50 100 150 200 250 300
F1
-sco
re
Size of categories profile
Macro averaged F1-score
tf+sf
sf
tf
55
The table 4.6 displays the micro averaged F-measure results for the
proposed approach of calculating the sum of term and its synonym
frequencies for classification compared to that of classification using
frequency of synonyms or frequency of terms for number of keywords or size
of categories profile ranging from 50 to 300.
Labels in table 4.6 are defined as follows:
Size of Categories Profile refers to the size of test data set or number
of keywords in categorical profile.
Proposed Approach (tf+sf) is the classification strategy of calculating
sum of term and its synonym frequencies.
Synonym Frequency (sf) is the classification approach using only the
frequency of synonyms.
Term Frequency (tf) is the strategy used for the classification using
only term frequency.
Micro averaged F-measure is the document pivoted performance
measure of the classification system.
56
Table 4.6: Micro Averaged F-measure Results based on Size of Categories
Profile.
Size of Categories Profile
Micro averaged F1 score
Proposed
Approach
tf + sf
Synonym
Frequency
sf
Term
Frequency
tf
50 0.717 0.667 0.648
100 0.720 0.669 0.649
150 0.720 0.668 0.652
200 0.721 0.671 0.651
250 0.723 0.672 0.653
300 0.723 0.672 0.653
Figure 4.4 displays the micro averaged performance of the three
different classification approaches i.e. proposed approach of using sum of
term and its synonym frequency, using synonym frequency alone and using
term frequency alone plotted against the size of the categories profile or
number of key terms ranging from 50 to 300. The micro averaged F1-score
points show the tendency to increase as the number of keywords grows. The
micro averaged F-measure results remain constant when the number of key
terms exceeds 250.
57
Figure 4.4: Micro Averaged F- measure Results for Varying Size of
Categories Profile.
The figure 4.4 shows the micro averaged F-measure performance
results of classification on the 20Newsgroups corpus using term frequency,
using synonym frequency and using the proposed approach that is sum of
term and its synonym frequencies for the different sizes of the key terms
ranging from 50 to 300. In the figure, size of categorical profile is given by the
x-axis and the micro averaged F1-scores for the classification approaches
are given by the y-axis. The results obtained in the experiment shows that
the proposed classification strategy of calculating the sum of term and its
synonym frequencies achieves the best F1 value compared to the other two
classification approaches.
0.64
0.65
0.66
0.67
0.68
0.69
0.70
0.71
0.72
0.73
0.74
50 100 150 200 250 300
F1
-sco
re
Size of categories profile
Micro averaged F1-score
tf+sf
sf
tf
58
4.4 INFERENCE FROM THE RESULT
The experimental results show that proposed approach of text
document classification using the sum of term and its synonym frequencies is
effective in increasing the performance of the classification system. From the
results obtained in the experiments on 20Newsgroups data corpus it is
shown that the proposed approach achieves higher F-measure value
compared to that of other two approaches. For the proposed approach, the
macro averaged F1 value reaches 71.9% achieving a rise of 5.2% compared
to the classification strategy of using only synonym frequency (sf) and a rise
of 7.3% compared to the term frequency (tf) alone approach. Also, the
proposed system achieves micro averaged F-measure of 72.3% with an
increase of 5.1% than the synonym frequency approach and an increase of
7% than the approach of classification using only term frequency.
Also the proposed approach outperforms the other two approaches
when the experiment is carried out with varying size of categories profiles.
When the experiments are carried out with different sizes of categorical
profile, all the three approaches reach their peak performance when the
number of key terms exceeds 200. Among them, the proposed approach
achieves the best macro averaged and micro averaged F1 scores by
showing the consistent performance compared to the other two approaches.
Generally macro averaging gives equal weight to each class and
micro averaging gives equal weight to each per-document classification
decision. Because the F1 measure ignores true negatives and its magnitude
is mostly determined by the number of true positives, large categories
dominate small categories in micro averaging. Whereas, macro averaged F
measure is dominated by the system’s performance on small categories.
Since the categories in the 20Newsgroups data corpus are nearly uniformly
distributed compared to the asymmetrical category distribution in other data
corpus, the macro averaged F-measure performance (0.719) is almost
similar to that of micro averaged F-measure (0.723).
59
5. CONCLUSION AND FUTURE ENHANCEMENT
This research proposes an approach to classify the text documents
based on the integration of key term and its synonym frequencies in the
document to solve the synonymy problem in text classification process. To
classify a document, the system extracts the synonyms for all the key terms
in the text document using WordNet which arranges words into groups of
synonyms and combines them with the key terms to form a new document
representative vector. Hence the system counts the occurrence of both the
key terms and their corresponding synonyms for the classification, resulting
in the reduction of synonymy problem. The experimental results with
20Newsgroups dataset show that incorporating the frequency of synonyms
with frequency of key terms in the document results in the increase in
performance of classification system when compared to the classification
approaches of using only synonym frequency or using only term frequency.
The proposed system uses only the first sense synonyms of the key
term for the classification process since the WordNet returns an ordered list
of synsets for a term in such a way that more commonly used terms are
listed before less commonly used terms. Also a word usually has multiple
synonyms with somewhat different meanings and it is difficult to choose the
correct synonym to use. The future work includes providing an option to the
user for specifying the number of senses or synsets to be used for the
classification process and using a disambiguation strategy for the
identification of proper synonyms for the key term.
60
REFERENCES [1] A. Moschitti and R. Basili, “Complex Linguistic Features for Text
Classification: A Comprehensive Study,” 26th European Conference
on IR Research (ECIR), pp. 181-196, 2004.
[2] C. Apte, F. Damerau and S.M. Weiss, "Automated Learning of
Decision Rules for Text Categorization," ACM Transactions of
Information Systems, Vol.12, No.3, pp.223-251, 1994.
[3] D. Koller and M. Sahami, “Hierarchically Classifying Documents using
Very Few Words,” 14th International Conference on Machine Learning
(ECML), 1998.
[4] D. Mladenic and M. Globelnik, “Word Sequences as Features in Text
Learning,” 17th Electrotechnical and Computer Science Conference
(ERK), pp. 145-148, 1998.
[5] D.D. Lewis, “An Evaluation of Phrasal and Clustered Representations
on a Text Categorization Task,” ACM SIGIR ’92, pp. 37-50, 1992.
[6] D.O. Lewis and M. Ringuette, "A Comparison of Two Learning
Algorithms for Text Categorization," 3rd Annual Symposia on
Document Analysis and Information Retrieval (SDAIR), pp.81-93,
1994.
[7] E. Leopold and J. Kingermann, “Text Categorization with Support
Vector Machines: How to Represent Text in Input Space?” Machine
Learning, Vol. 46, No. 1-3, pp. 423-444, 2002.
61
[8] E.D. Wiener, J.O. Pedersen and A.S. Weigend: "A Neural Network
Approach to Topic Spotting," 4th Symposia on Document Analysis
and Information Retrieval (SDAIR), pp.317-332, 1995.
[9] Fabrizio Sebatiani, “Machine Learning in Automated Text
Categorization”, ACM Computing Surveys, 1999.
[10] J. Fu rnkranz, “A Study using n-Gram Features for Text
Categorization,” Technical Report OEFAI-TR-98-30, Austrian Institute
for Artificial Intelligence, Vienna, Austria, 1998.
[11] J. Fu rnkranz, T. Mitchell and E. Riloff, “A Case Study in Using
Linguistic Phrases for Text Categorization on the WWW,” First
AAAI Workshop Learning for Text Categorization, pp. 5-12, 1998.
[12] K. Lang, “Newsweeder: Learning to Filter News,” 12th International
Conference on Machine Learning, pp.331-339, 1995.
[13] L.D. Baker and A. K. McCallum, “Distributional Clustering of Words for
Text Classification,” ACM SIGIR, pp. 96-103, 1998.
[14] M. Lan, S.Y. Sung, H.B. Low and C.L. Tan, “A Comparative Study on
Term Weighting Schemes for Text Categorization,” International
Conference on Neural Networks (IJCNN ’05), pp. 546-551, 2005.
[15] M.F. Caropreso, S. Matwin and F. Sebastiani, “A Learner-
Independent Evaluation of the Usefulness of Statistical Phrases for
Automated Text Categorization,” Text Databases and Document
Management: Theory and Practice, A.G. Chin, pp. 78-102, Idea
Group Publishing, 2001.
62
[16] R. Bekkerman, R. El-Yaniv, N. Tishby and Y. Winter, “Distributional
Word Clusters versus Words for Text Categorization,” J. Machine
Learning Research, Vol. 3, pp. 1182-1208, 2003.
[17] R. Rastogi and K. Shim, "A Decision Tree Classifier that Integrates
Building and Pruning," 24th International Conference on Very
Large Data Bases, pp.404-415, 1998.
[18] R.E. Schapire and Y. Singer, "BoosTexter - A Boosting-based System
for Text Categorization," Machine Learning, Vol. 39, No.2-3, pp.135-
168, 2000.
[19] S. Scott and S. Matwin, “Feature Engineering for Text Classification,”
16th International Conference on Machine Learning (ICML), pp. 379-
388, 1999.
[20] S.T. Dumais, J.C. Platt, D. Heckerman and M. Sahami, “Inductive
Learning Algorithms and Representations for Text Categorization,”
Seventh International Conference on Information and Knowledge
Management (CIKM), pp. 148-155, 1998.
[21] T. Joachims, "Text Categorization with Support Vector Machines:
Learning with many Relevant Features," 10th European
Conference on Machine Learning, No.1398, pp.137-142, 1998.
[22] Y. Ko, J. Park and J. Seo, “Improving Text Categorization Using the
Importance of Sentences,” Information Processing and Management,
Vol. 40, No. 1, pp. 65-79, 2004.
[23] Y. Yang, "An Evaluation of Statistical Approaches to Text
Categorization," Journal of lriformation Retrieval, Vol.1, No.1, pp. 67-
88, 1999.
63
[24] Atika Mustafa, Ali Akbar and Ahmer Sulthan, “Knowledge Discovery
Using Text Mining: A Programmable Implementation on Information
Extraction and Categorization”, International Journal of Multimedia and
Ubiquitous Engineering, Vol. 4, No. 2, pp. 183-188, April 2009.
[25] Aurangzeb Khan, Baharum B. Bahurdin and Khairullah Khan, “An
Overview of E-documents Classification”, International Journal of
Machine Learning and Computing IPCSIT Vol.3, pp. 544-552, IACSIT
Press, Singapore, 2011.
[26] Chengyuan Ma and Chin-Hui Lee, “A Regularized Maximum Figure-of-
Merit (rMFoM) Approach to Supervised and Semi-Supervised
Learning”, IEEE Transactions on Audio, Speech and Language
Processing, Vol. 19, No. 5, pp. 1316-1327, July 2011.
[27] Chenping Hou, Feiping Nie, Fei Wang, Changshui Zhang and Yi Wu,
“Semisupervised Learning Using Negative Labels”, IEEE Transactions
on Neural Networks, Vol. 22, No. 3, pp. 420-432, March 2011.
[28] Christian Wartena, Rogier Brussee and Wout Slakhorst, “Keyword
Extraction using Word Co-occurrence”, IEEE International Conference
on Database and Expert Systems Applications (DEXA), pp. 54-58,
August 2010.
[29] Christoph Goller, Joachim Loning, Thilo Will and Werner Wolff,
“Automatic Document Classification: A Thorough Evaluation of
Various Methods” International Journal of Computer Applications, Vol.
35, No.6, pp. 0975 –8887, December 2011.
[30] Deng Cai and Xiaofei He, “Manifold Adaptive Experimental Design for
Text Categorization”, IEEE Transactions on Knowledge and Data
Engineering, Vol. 24, No. 4, pp. 707-719, April 2012.
64
[31] Der Chiang Li and Chiao Wen Liu, “Extending Attribute Information for
Small Data Set Classification”, IEEE Transactions on Knowledge and
Data Engineering, Vol. 24, No. 3, pp. 452-464, March 2012.
[32] Fang Lu and Qingyuan Bai, “A Refined Weighted K-Nearest
Neighbors Algorithm for Text Categorization”, IEEE International
Conference on Intelligent Systems and Knowledge Engineering
(ISKE), pp. 326-330, November 2010.
[33] Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee, “A Fuzzy Self-
Constructing Feature Clustering Algorithm for Text Classification”,
IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No.
3, pp. 335-349, March 2011.
[34] Lifei Chen and Gongde Guo, “Using Class-dependent Projection
for Text Categorization”, IEEE International Conference on Machine
Learning and Cybernetics (ICMLC), pp. 1305-1310, July 2010.
[35] Ma Zhanguo, Feng Jing, Chen Liang, Hu Xiangyi, Shi Yanqin and Ma
Zhanguo, “An Improved Approach to Terms Weighting in Text
Classification”, IEEE International Conference on Computer and
Management, pp. 1-4, 2011.
[36] Makoto Suzuki, Naohide Yamagishi, Takashi Ishidat, Masayuki Gotot
and Shigeichi Hirasawa, “On a New Model for Automatic Text
Categorization Based on Vector Space Model”, IEEE International
Conference on Systems man and Cybernetics (SMC), pp. 3152-3159,
2010.
[37] Man Lan, Chew Lim Tan, Jian Su and Yue Lu, “Supervised and
Traditional Term Weighting Methods for Automatic Text
Categorization”, IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 31, No. 4, pp. 721-735, April 2009.
65
[38] Murat Can Ganiz, Cibin George and William M. Pottenger, “Higher
Order Naive Bayes: A Novel Non-IID Approach to Text Classification”,
IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No.
7, pp. 1022-1034, July 2011.
[39] Nianyun Shi and Lingling Liu, “A New Feature Selection Method
Based on Distributional Information for Text Classification”, IEEE
International Conference on Progress in Informatics and Computing
(PIC), pp. 190-194, December 2010.
[40] Nikos Tsimboukakis and George Tambouratzis, “Word- Map Systems
for Content-Based Document Classification”, IEEE Transactions on
Systems, Man and Cybernetics-Part C: Applications and Reviews, Vol.
41, No. 5, pp. 662-673, September 2011.
[41] Ning Zhong, Yuefeng Li and Sheng-Tang Wu, “Effective Pattern
Discovery for Text Mining”, IEEE Transactions on Knowledge and
Data Engineering, Vol. 24, No. 1, pp. 30-44, January 2012.
[42] Sheng-Tun Li and Fu-Ching Tsai, “Noise Control in Document
Classification Based On Fuzzy Formal Concept Analysis”, IEEE
International Conference on Fuzzy Systems, pp. 2583-2588, June
2011.
[43] Tomoharu Iwata, Toshiyuki Tanaka, Takeshi Yamada and Naonori
Ueda, “Improving Classifier Performance Using Data with Different
Taxonomies”, IEEE Transactions on Knowledge and Data
Engineering, Vol. 23, No. 11, pp. 1668-1677, November 2011.
[44] Weibin Deng, “A Hybrid Algorithm for Text Classification based on
Rough Set”, IEEE International Conference on Computer Research
and Development (ICCRD), pp. 406-410, March 2011.
66
[45] Xiao-Bing Xue and Zhi-Hua Zhou, “Distributional Features for Text
Categorization”, IEEE Transactions on Knowledge and Data
Engineering, Vol. 21, No. 3, pp. 428-442, March 2009.
[46] Xiaojun Quan, Wenyin Liu and Bite Qiu, “Term Weighting Schemes for
Question Categorization”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. 33, No. 5, pp. 1009-1021, May 2011.
[47] Yaxin Bi, Shengli Wu, Hui Wang and Gongde Guo, “Combination of
Evidence-based Classifiers for Text Categorization”, IEEE
International Conference on Tools with Artificial Intelligence (ICTAI),
pp. 422-429, November 2011.
[48] Ying Liu and Han Tong Loh, “Domain Concept Handling in Automated
Text Categorization”, IEEE International Conference on Industrial
Electronics and Applications (ICIEA), pp. 1543-1549, June 2010.
[49] Isabel Volpe, Viviane Moreira and Christian Huyck, “Cell Assemblies
for Query Expansion in Information Retrieval”, IEEE International
Conference on Neural Networks, pp. 551-558, August 2011.
[50] http://wordnet.princeton.edu/
[51] J. Han and M. Kamber, Data Mining: Concepts and Techniques,
Second Edition, Morgan Kaufmann Publishers.
[52] Herbert Schildt, Java 2: The Complete Reference, Fifth Edition, Tata
McGraw-Hill Publishers.
[53] http://java.sun.com/
[54] http://www.ai.mit.edu/~jrennie/20Newsgroups/
67
APPENDIX1: SAMPLE SCREEN SHOTS
Loading the text document from 20Newsgroups data corpus
Displaying the content of the selected file
68
Tokenizing the text
Process of noise removal
69
Removing the stop words
Performing word stemming
70
Displaying sum of key term and its synonym frequencies
Displaying the weights of the key terms
71
Sample output of the system
72
TECHNICAL BIOGRAPHY
Mrs. Praneetha K. (RRN. 1145213) was born on 31st May 1982, in Karkala,
Karnataka. She did her schooling in Christ King School. She received B.Sc.
degree in Computer Science from Sri Bhuvanenedra College in the year
2003 from Mangalore University and M.C.A. degree from N. M. A. M. Institute
of Technology in the year 2006 from Visveswariah Technological University.
She is currently pursuing her M.Phil. Degree in Computer Science in the
Department of Computer Applications of B.S. Abdur Rahman University,
Chennai. She had participated in International workshop on “Advances in
Data Mining and Web Mining”. Her areas of interests include Information
Retrieval, Web mining and Natural Language Processing. The e-mail ID is:
[email protected] and the contact number is: +91 8939955709.
Publications:
Praneetha K., “Classification of Text Documents using WordNet”,
International Conference on Recent Trends in Computer Science and
Engineering (ICRTCSE), May 2012.
Praneetha K., “Text Representation using WordNet for the Reduction
of Synonymy”, International Conference on Computational Intelligence
and Communication (ICCIC), pp. 144-148, July 2012.
Praneetha K. and Dr. Angelina Geetha, “An Enhanced Text Document
Classification based on Terms and Synonyms Relations”, IFRSA
International Journal of Data Warehousing and Mining (IIJDWM), Vol.
2, No. 3, pp. 175-181, August 2012.