mainclustering of text documents

105
1 INTRODUCTION 1.1 Introduction: Clustering is main process in engineering and in various fields of scientific research, which tries to group a set of points into clusters such that points in the same cluster are more homogeneous to each other when compared to the points in different clusters. Document clustering is nothing but group the documents by based on similarity among the documents in an unsupervised manner. Document clustering used in Quick topic extraction or filtering and information retrieval. We are facing an ever increasing volume of text documents. The texts flowing over then termed, vast collections of documents in repositories, digital libraries and digitized personal information such as articles and emails .These have brought challenges for the effective and efficient organization of text documents There is no known single optimization method available for solving all optimization problems. A lot of optimization methods have been developed for solving different types of optimization problems in recent years. The modern optimization methods (sometimes called nontraditional optimization methods) are very powerful and popular methods for solving complex engineering problems. These methods are particle swarm optimization algorithm, neural networks, genetic algorithms, ant 1

Upload: atchyut-nagabhairava

Post on 12-Dec-2015

218 views

Category:

Documents


1 download

DESCRIPTION

This documentation is about clustering of text docs.

TRANSCRIPT

Page 1: MainClustering of text documents

1 INTRODUCTION

1.1 Introduction:

Clustering is main process in engineering and in various fields of scientific

research, which tries to group a set of points into clusters such that points in the

same cluster are more homogeneous to each other when compared to the points in

different clusters. Document clustering is nothing but group the documents by based

on similarity among the documents in an unsupervised manner. Document clustering

used in Quick topic extraction or filtering and information retrieval. We are facing an

ever increasing volume of text documents. The texts flowing over then termed, vast

collections of documents in repositories, digital libraries and digitized personal

information such as articles and emails .These have brought challenges for the

effective and efficient organization of text documents

There is no known single optimization method available for solving all

optimization problems. A lot of optimization methods have been developed for

solving different types of optimization problems in recent years. The modern

optimization methods (sometimes called nontraditional optimization methods) are

very powerful and popular methods for solving complex engineering problems.

These methods are particle swarm optimization algorithm, neural networks, genetic

algorithms, ant colony optimization, artificial immune systems, and fuzzy

optimization.

The Particle Swarm Optimization algorithm (abbreviated as PSO) is a novel

population-based stochastic search algorithm and an alternative solution to the

complex non-linear optimization problem. The PSO algorithm was first introduced by

Dr. Kennedy and Dr. Eberhart in 1995 and its basic idea was originally inspired by

simulation of the social behavior of animals such as bird flocking, fish schooling and

so on. It is based on the natural process of group communication to share individual

knowledge when a group of birds or insects search food or migrate and so forth in a

searching space, although all birds or insects do not know where the best position is.

But from the nature of the social behavior, if any member can find out a desirable

path to go, the rest of the members will follow quickly.

1

Page 2: MainClustering of text documents

The PSO algorithm basically learned from animal’s activity or behavior to solve

optimization problems. In PSO, each member of the population is called a particle

and the population is called a swarm. Starting with a randomly initialized population

and moving in randomly chosen directions, on each particle.

In this thesis, a meta-heuristic called Tabu Search and discusses the features of

the tabu search algorithm. This is one of the most efficient heuristic in finding

‘quality solutions’ in relatively short running time. The principal characteristic of tabu

search is based on using a mechanism which is inspired by the human memory i.e.,

to use the information that is stored in the memory to guide and restrict the future

search in a way to obtain quality solutions and to overcome the local optimality. This

thesis provides insight about the algorithm or procedure of the working of tabu

search algorithm on Document clustering problems and merging the other

optimization technique.

Particle swarm optimization (PSO) method for solving the economic dispatch (ED)

problem in power systems. Many nonlinear characteristics of the generator, such as

ramp rate limits, prohibited operating zone, and nonsmooth cost functions are

considered using the proposed method in practical generator operation. The

feasibility of the proposed method is demonstrated for three different systems, and

it is compared with the GA method in terms of the solution quality and computation

efficiency. The experimental results show that the proposed PSO method was indeed

capable of obtaining higher quality solutions efficiently in ED problems.

Tabu Search (TS), a heuristic method originally proposed by Glover in 1986, to

various combinatorial problems have appeared in the operations research

literature. In several cases, the methods described provide solutions very close to

optimality and are among the most effective, if not the best, to tackle the difficult

problems at hand. These successes have made TS extremely popular among those

interested in finding good solutions to the large combinatorial problems

encountered in many practical settings. Several papers, book chapters, special issues

and books have surveyed the rich TS literature (a list of some of the most important

references is provided in a later section). In spite of this abundant literature, there

2

Page 3: MainClustering of text documents

still seem to be many researchers who, while they are eager to apply TS to new

problem settings, find it difficult to properly grasp the fundamental concepts of the

method, its strengths and its limitations, and to come up with effective

implementations. The purpose of this paper is to address this situation by providing

an introduction in the form of a tutorial focusing on the fundamental concepts of

TS. Throughout the paper, two relatively straightforward, yet challenging and

relevant, problems will be used to illustrate these concepts: the Classical Vehicle

Routing Problem (CVRP) and the Capacitated Plant Location Problem (CPLP). These

will be introduced in the following section. The remainder of the paper is organized

as follows. The basic concepts of TS (search space, neighborhoods, and short-term

tabu lists) are described and illustrated in Section 2. Intermediate, yet critical,

concepts, such as intensification and diversification, are described in Section 3. This

is followed in Section 4 by a brief discussion of advanced topics and recent trends in

TS, and in Section 5 by a short list of key references on TS and its

applications. Section 6 provides practical tips for newcomers struggling with

unforeseen problems as they first try to apply TS to their favorite problem. Section 7

concludes the paper with some general advice on the application of TS to

combinatorial problems.

Tabu search (TS) has its antecedents in methods designed to cross boundaries of feasibility or local optimality treated as barriers in classical procedures, and to systematically impose and release constraints to permit exploration of otherwise forbidden regions (Glover, 1977). The tabu search name and terminology comes from Glover (1986). A distinguishing feature of the approach is its use of adaptive memory and special associated problem-solving strategies.(TS provides the origin of the memory-based and strategy-intensive focus in the metaheuristic literature, as opposed to methods that are memory-less or use only a weak inheritance-based memory. It is also responsible for emphasizing the use of structured designs to exploit historical patterns of search, as opposed to processes that rely almost exclusively on randomization.)The fundamental principles of tabu search were elaborated in a series of papers in

the late 1980s and early 1990s, and have been assembled and in the book Tabu

Search (Glover and Laguna, 1997). The remarkable successes of tabu search for

solving hard optimization problems (especially those arising in real world

applications) has caused an explosion of new TS applications in the last several years.

3

Page 4: MainClustering of text documents

The tabu search philosophy is to derive and exploit a collection of intelligent problem solving strategies, based on implicit and explicit learning procedures. The adaptive memory framework of TS not only involves the exploitation of the history of the problem-solving process, but also entails the creation of structures to make such exploitation possible. Problem-solving history extends to experience gained from solving multiple instances of a problem class by joining TS with an associated learning approach called Target Analysis (see, e.g., chapter 9 of Glover and Laguna, 1997). TS is an iterative procedure designed for the solution of optimization problems. TS starts with a random solution and evaluate the fitness function for the given solution. Then all possible neighbors of the given solution are generated and evaluated. A neighbor is a solution which can be reached from the current solution by a simple, basic transformation. If the best of these neighbors is not in tabu list then pick it to be the new current solution. The tabu list keeps track of previously explored solutions and prohibits TS from revisiting them again. Thus, if the best neighbor solution is worse than the current design, TS will go uphill. In this way, local minima can be overcome. Any reversal of these solutions or moves is then forbad move and is classified as tabu. Some aspiration criteria which allow overriding of tabu status can be introduced if that moves is still found to lead to a better fitness with respect to the fitness of the current optimum. If no more neighbors are present (all are tabu), or when during a predetermined number of iterations no improvements are found, the algorithm stops. Otherwise, the algorithm continues the TS procedures. Engineering and technology have been continuously providing examples of difficult

optimization problems. In this talk we shall present the tabu search technique which

with its various ingredients may be viewed as an engineer designed approach: no

clean proof of convergence is known but the technique has shown a remarkable

efficiency on many problems. The roots of tabu search go back to the 1970's; it was

first presented in its present form by Glover [Glover, 1986]; the basic ideas have also

been sketched by Hansen [Hansen 1986]. Additional efforts of formalization are

reported in [Glover, 1989], [de Werra & Hertz, 1989], [Glover, 1990]. Many

computational experiments have shown that tabu search has now become an

established optimization technique which can compete with almost all known

techniques and which - by its flexibility - can beat many classical procedures. Up to

now, there is no formal explanation of this good behavior. Recently, theoretical

aspects of tabu search have been investigated [Faigle & Kern, 1992], [Glover, 1992],

[Fox, 1993]. A didactic presentation of tabu search and a series of applications have

been collected in a recent book [Glover, Taillard, Laguna & de Werra, 1992]. Its

interest lies in the fact that success with tabu search implies often that a serious

effort of modeling be done from the beginning. The applications in [Glover, Taillard,

4

Page 5: MainClustering of text documents

Laguna & de Werra, 1992] provide many such examples together with a collection of

references. A huge collection of optimization techniques have been suggested by a

crowd of researchers of different fields; an infinity of refinements have made these

techniques work on specific types of applications. All these procedures are based on

some common ideas and are furthermore characterized by a few additional specific

features. Among the optimization procedures the iterative techniques play an

important role: for most optimization problems no procedure is known in general to

get directly an "optimal" solution. The general step of an iterative procedure consists

in constructing from a current solution i a next solution j and in checking whether

one should stop there or perform another step. Neighbourhood search methods are

iterative procedures in which a neighbourhood N(i) is defined for each feasible

solution i, and the next solution j is searched among the solutions in N(i).

Non-linear optimization problems are defined by non-linearity constraints and/or

non-linearity objective. These problems are considered in several domains, including

chemical engineering, energy analysis, environmental planning, biotechnology and

thermal processes, among others. Different techniques and methods are employed

to model and solve these problems. A literature survey shows that the most used

techniques are evolutionary algorithms [1, 2], swarm optimization [6] and non-linear

mathematical programming [15]. Leyffer and Mahajan (2010) present a survey of

non-linearly constrained software and methods, focusing on the contrasting

strategies of local optimization and global optimization [15]. Some of those

approaches such as Genetic algorithms are reported to require a lot of parameters

and to entail considerable effort to implement. In the thermal engineering field,

many complex optimization problems arise in practice. Recently, non-linear

optimization problems have increasingly been subjected to analysis by non-

traditional optimization techniques. Patel and Rao [17, 18] recommend the use of

particle swarm optimization (PSO) based on case studies showing that PSO is simple

in concept, requires few parameters, is easy to implement and performs well

compared to traditional techniques like genetic algorithms [17, 18]. The PSO method

has produced good outcomes for a variety of optimization problems, but many

authors have pointed out a limitation in its ability to diversify the population (see [8,

24]). To deal with this problem, research efforts are underway on several fronts to

5

Page 6: MainClustering of text documents

hybridize the PSO method with other id-2 ICSI 2011: International conference on

swarm intelligence Cergy, France, June 14-15, 2011 meta-heuristics. The most

commly used methods to create PSO hybrids are genetic algorithms and differential

evolution algorithms [24]. For global optimization, a PSO-TS hybrid algorithm which

joins PSO with tabu search (TS) has been proposed in [11]. More recently, Shelokar

et al. hybridize PSO with ant colony algorithm for continuous optimization [21]. In

this work, we focus on a thermal optimization problem known as the T-junction

problem, which consists in designing the main channel in electrical machines

responsible for evacuating generated heat. The objective is to determine the ideal

channel features that optimize the temperature in the system. This problem,

identified through a collaborative industrial project, can be formulated as a

constrained non-linear optimization problem (CNOP). The fitness function used to

evaluate solutions of this problem takes extensive computation time. The use of

meta-heuristics like genetic algorithms in this case has proved to be very time

consuming. We apply the PSO meta-heuristic to solve the problem due to its simple

implementation and the limited number of parameters to adjust, as well as for the

ability to control its fitness function effectively. To avoid premature convergence of

our method, a tabu search procedure is embedded within the PSO.

High-density DNA microarrays are one of the most powerful tools for functional

genomic studies and the development of microarray technology allows for

measuring expression levels of thousands of genes simultaneously Schena et al.

(1995). Recent studies have shown that one of the most important applications of

microarrays is tumor classification (Cho et al., 2003; Li et al., 2004). Gene selection is

an important component for gene expression-based tumor classification systems.

Microarray experiments generate large datasets with expression values for

thousands even tens of thousands of genes but not more than a few tissue samples.

Most of the genes monitored in microarray may be irrelevant to analysis and the use

of all the genes may potentially inhibit the prediction performance of classification

rule by masking the contribution of the relevant genes (Li, 2006; Li and Yang, 2002;

Stephanopoulos et al., 2002; Nguyen and Rocke, 2002; Biceiato et al., 2003; Tan et

al., 2004). An efficient way to solve this problem is gene selection and the ∗ Corresponding author. Tel.: +86 371 67767957; fax: +86 371 67763220. E-mail

6

Page 7: MainClustering of text documents

address: [email protected] (W.-M. Shi). selection of discriminatory genes is

critical to improving the accuracy and decrease computational complexity and cost.

By selecting relevant genes, conventional classification techniques can be applied to

the microarray data. Gene selection may highlight those relevant genes and it could

enable biologists to gain significant insight into the genetic nature of the disease and

the mechanisms responsible for it (Guyon et al., 2002; Wang et al., 2005). Several

gene selections techniques have been employed in classification problems, such as t-

test filtering approach, as well as some artificial intelligence techniques such as

genetic algorithms (GAs), evolution algorithms (EAs) (Golub et al., 1999; Furey et al.,

2000; Xiong et al., 2001; Peng et al., 2003; Li et al., 2005; Tibshirani et al., 2002; Sima

and Dougherty, 2006), simulated annealing, tabu search and particle swarm

optimization. Particle swarm optimization (PSO) algorithm (Kennedy and Eberhart,

1995; Shi and Eberhart, 1998; Clerc and Kennedy, 2002) is a recently proposed

algorithm by James Kennedy and R.C. Eberhart in 1995, motivated by social behavior

of organisms such as bird flocking and fish schooling. Particle swarm optimization

comprises a very simple concept, and can be implemented in a few lines of computer

code. It requires only few parameters to adjust, and is computationally inexpensive

in

terms of both memory requirements and speed. A modified discrete PSO algorithm

has been proposed in our previous study (Shen et al., 2004a,b, in press) to reduce

dimension and shown satisfied performance. Although PSO has proved to be a

potent search technique for solving optimization, there are still many complex

situations where the PSO tends to converge to local optima and does not perform

particularly well. Tabu search (TS) is a powerful optimization procedure that has

been successfully applied to a number of combinatorial optimization problems

Glover (1986). It has the ability to avoid convergence to local minima by employing a

flexible memory system. But the convergence speed of TS depends on the initial

solution and the parallelism of PSO population would help the TS find the promising

regions of the search space very quickly. In this paper, we develop a hybrid PSO and

TS (HPSOTS) approach for gene selection for tumor classification. The incorporation

of TS as a local improvement procedure enables the algorithm HPSOTS to overleap

local optima and show satisfactory performance. The formulation and corresponding

7

Page 8: MainClustering of text documents

programming flow chart are presented in details in the paper. To evaluate the

performance of HPSOTS, the proposed approach is applied to three publicly available

microarray datasets. Moreover, we compare the performance of HPSOTS on these

datasets to that of stepwise selection, the pure TS and PSO algorithm. It has been

demonstrated that the HPSOTS is a useful tool for gene selection and mining high

dimension data.

1.2 Motivation:

PSO performs excellently in global search while not so well in local search,

meanwhile, TS performs excellently in local search while not so well in global search.

Therefore in this thesis i to combine the two algorithms so the new hybrid algorithm

conducts both global search and local search in every iteration , so the probability of

finding the optimal points significantly increases. However to the best of the authors

knowledge, TSPSO has not been used to cluster text documents. In This study a

document clustering algorithm based on TSPSO is proposed.

1.3 Thesis Overview:

In this thesis involves clustering documents into categories using

Optimization algorithms. Initially we start with data matrix obtained from the text

documents after preprocessing steps. This data matrix is represented with each row

as a document vector and each column as weight of a significant term. This data

matrix is provided as an input to AMOC algorithm for finding the k value and the

produced k value is given to the PSO and TS and TSPSO algorithms to form

clustering documents. The results obtained from above process compare with

obtained VRC values and also their time complexities.

1.4 Clustering

A general definition of clustering stated by Brian Everitt et al. [6]Given a

number of objects or individuals, each of which is described by a set of numerical

measures, devise a classification scheme for grouping the objects into a number of

classes such that objects within classes are similar in some respect and unlike those

from other classes. The number of classes and the characteristics of each class are to

8

Page 9: MainClustering of text documents

be determined. The clustering problem can be formalized as an optimization

problem, i.e. the minimization or maximization of a function subject to a set of

constraints. The goal of clustering can be defined as follows:

Given

I. a dataset X = {x1, x2, …. , xn}

II. the desired number of clusters k

III. a function f that evaluates the quality of clustering

we want to compute a mapping

γ :{1,2,.....,n}⎯⎯→{1,2,.....,k}

that minimizes the function f subject to some constraints. The function f that

evaluates the clustering quality are often defined in terms of similarity between

objects and it is also called distortion function or divergence. The similarity measure

is the key input to a clustering algorithm.

1.5 Document clustering

Clustering of documents is used to group documents into relevant topics. The

major difficulty in document clustering is its high dimension. It requires efficient

algorithms which can solve this high dimensional clustering. A document clustering is

a major topic in information retrieval area .Example includes search engines. The

basic steps used in document clustering process are shown in figure 2.

The goal of a document clustering scheme is to minimize intra-cluster distances

between

documents, while maximizing inter-cluster distances (using an appropriate distance

measure between documents). A distance measure (or, dually, similarity measure)

thus

lies at the heart of document clustering. The large variety of documents makes it

almost

impossible to create a general algorithm which can work best in case of all kinds of

datasets.

9

Page 10: MainClustering of text documents

Figure 2.Flow diagram for representing basic Steps in text clustering

Peprocessing

The text document preprocessing basically consists of a process to strip all

formatting from the article, including capitalization, punctuation, and extraneous

markup (like the dateline,tags). Then the stop words are removed. Stop words term

(i.e., pronouns, prepositions, conjunctions etc) are the words that don't carry

semantic meaning. Stop words can be eliminated using a list of stop words. Stop

words elimination using a list of stop word list will greatly reduce the amount of

noise in text collection, as well as make the computation easier. The benefit of

removing stop words leaves us with condensed version of the documents containing

content words only. The next process is to stem a word. Stemming is the process for

reducing some derived words into their root form. For English documents, a

popularly known algorithm called the Porter stemmer [7] is used. The performance

of text clustering can be improved by using Porter stemmer.

Document Representation

Preprocessing is done to represent the data in the form that can be used for

clustering. There are many ways of representing documents, like the vector space

model, graphical model etc.[11] .

Vector Space Model

Vector Space Model (VSM) can be the simplest level of document

representation in clusters from [18]. Given a document collection, any word present

in the collection is counted as a dimension. If there are totally d separate words,

each document is treated as a d-dimensional vector, whose coordinate values are

10

Page 11: MainClustering of text documents

the frequencies of appearance of the words in that document. Consequently, this

vector is very high dimensional but extremely sparse, because a collection normally

contains so many documents that only a tiny portion of the words actually belongs

to an individual document.

This representation model treats words as independent entities, completely

ignoring the structural information inside documents, such as syntax and meaningful

relationship between words or between sentences. Recently, many efforts have

been made to find a better way of representing text document. As mentioned,

scarcity is a problem of VSM. A document vector has so many unrelated dimensions

that may hide its actual meaning. Researchers have tried to make use of semantic

relatedness of words, or to find some sort of concepts, instead of words, to

represent documents. Its simplicity facilitates fast computation, at the same time

provides sufficient numerical and statistical information. Hence, it is the common

model used in most of the clustering algorithms nowadays.

The weights assigned to each term can be either the term frequency (tf)

orterm frequency-inverse document frequency (tf−idf). In first case, thefrequency of

occurrence for a term in a document is included in the vectordtf=(tf1,tf2,tfm), where tfi

is the frequency of the ithterm in the document.Usually, very common words are

removed and the terms are stemmed. A refinementto this weighting scheme is the

so-called tf −idf weighting scheme. In this approach,a term that appears in many

documents should not be regarded as more importantthan the one that appears in

few documents, and for this reason it needs to be deemphasized.

11

Page 12: MainClustering of text documents

Figure 2.3: Vector space model

The figure 2.3 explains the vector space model. After preprosessing the

document dataset we have the list of words which are common in all the

documents. Then these word list is used as dimensions to represent the documents

into vectors. Documents have many common words in the dataset so dimensions are

high. The figure 2.3 explains that there are three terms(TERM1,TERM2,TERM3)

common in three documents(DOC1,DOC2,DOC3) so the three terms are considered

as dimensions in the plane. Then the documents the drawn in the plane as vectors.

Let N be the total number of documents in the

collection;dfi(documentfrequency) be the number of documents in which the kiterm

appears, and freqi,jbethe raw frequency of the term ki in the document dj.

The inverse documentfrequency(idfi) for ki is defined as:

Idfi=logN/dfi (2.1)

Thetf−idf weight of term i is computed by:

wij=freqij×logN/dfi(2.2)

To account for documents of different length, each vector is normalized so

that it is ofunit length.

The main advantages of Vector Space Model (VSM) are :

♦ The documents are sorted by decreasing similarity with the query q .

♦The terms are weighted by importance.

♦ It allows for partial matching: the documents need not have exactly the sameterms

with the query.

12

Page 13: MainClustering of text documents

One disadvantage of VSM is that the terms are assumed to be independent.

Moreover, weighting is intuitive and not very formal.

Dimension reduction techniques

Dimension reduction can be divided into feature selection and feature

extraction. Feature selection is the process of selecting smaller subsets (features)

from larger set of inputs and Feature extraction transforms the high dimensional

data space to a space of low dimension. The goals of dimension reduction methods

are to allow fewer dimensions for broader comparisons of the concepts contained in

a text collection.

Similarity Measurement

Accurate clustering requires a precise definition of the closeness between a

pair of objects, in terms of either the pair wise similarity or distance. Before

clustering, a similarity/distance measure must be determined. The measure reflects

the degree of closeness or separation of the target objects and should correspond to

the characteristics that are believed to distinguish the clusters embedded in the

data. In many cases, these characteristics are dependent on the data or the problem

context at hand, and there is no measure that is universally best for all kinds of

clustering problems.

Moreover, choosing an appropriate similarity measure is also crucial for

cluster analysis, especially for a particular type of clustering algorithms. For example,

the density-based clustering algorithms, such as DBSCAN, rely heavily on the

similarity computation. Density-based clustering finds clusters as dense areas in the

data set, and the density of a given point is in turn estimated as the closeness of the

corresponding data object to its neighboring objects. Recalling that closeness is

quantified as the distance/similarity value, we can see that a large number of

distance/similarity computations are required for finding dense areas and estimate

cluster assignment of new data objects. Therefore, understanding the effectiveness

of different measures is of great importance in helping to choose the best one.

13

Page 14: MainClustering of text documents

In general, similarity/distance measures map the distance or similarity

between the symbolic descriptions of two objects into a single numeric value, which

depends on two factors—the properties of the two objects and the measure itself.

There are four measures [23] are discussed below.

Euclidean Distance

Euclidean distance is a popular similarity measure used in the data clustering.

The similarity between the two documents di and djis calculated as

(2.3)

It is used in the traditional k-meansalgorithm[2]. The objective of k-means is

to minimize theEuclidean distance between objects of a cluster and thatcluster’s

centroid:

(2.4)

Cosine Similarity

When documents are represented as term vectors, the similarity of two

documents corresponds to the correlation between the vectors. This is quantified as

the cosine of the angle between vectors, that is, the so-called cosine similarity.

Cosine similarity is one of the most popular similarity measure applied to text

documents, such as in numerous information retrieval applications in [11] and

clustering tool kit from [13].An important property of the cosine similarity is its

independence of document length.The similarity of two document vectors di and

dj,Sim (di, dj), is defined as the cosine of the angle between them. For unit vectors,

this equals to their inner product:

(2.5)

Cosine measure is used in a variant of K-means called spherical K-Means in

[4]. While K-Means aims to minimize Euclidean distance, spherical K-Means intends

to maximize the cosine similarity between the documents in a cluster and that

cluster’s centroid.

(2.6)

14

Page 15: MainClustering of text documents

Jaccard Coefficient

The Jaccard coefficient, which is sometimes referred to as the Tanimoto

coefficient, measures similarity as the intersection divided by the union of the

objects. For text document, the Jaccard coefficient compares the sum weight of

shared terms to the sum weight of terms that are present in either of the two

documents but are not the shared terms. Given non unit document vectors u i,uj

, their Jaccard coefficient is:

(2.7)

Pearson Correlation Coefficient

Correlation Clustering, introduced by Bansal, Blum and Chawla [9], provides a

method for clustering a set of objects into the best possible number of clusters,

without specifying that number in proceed. Correlation clustering that does not

require a bound on the number of clusters that the data is partitioned into. Rather,

Correlation Clustering in the paper [10] divides the data into the optimal number of

clusters based on the similarity between the data points. In their paper, [9] Bansal et

al. discuss two objectives of correlation clustering: minimizing disagreements and

maximizing agreements between clusters.

The normalized Pearson correlation is defined as:

(2.8)

Where denotes the average feature value of x overall dimensions.

In [20] Strehl et al. compared four measures: Euclidean, Cosine, Pearson

correlation and Extended Jaccard, and concluded that cosine and extended Jaccard

are the best ones on the web documents.

1.6 Clustering Applications

15

Page 16: MainClustering of text documents

Clustering is the most common form of unsupervised learning and is a major

tool in a number of applications in many fields of business and science. Hereby, we

summarize the basic directions in which clustering is used.

• Finding Similar Documents This feature is often used when the user has spotted

one “good” document in a search result and wants more-like-this. The interesting

property here is that clustering is able to discover documents that are conceptually

alike in contrast to search-based approaches that are only able to discover whether

the documents share many of the same words.

• Organizing Large Document Collections Document retrieval focuses on finding

documents relevant to a particular query, but it fails to solve the problem of making

sense of a large number of uncategorized documents. The challenge here is to

organize these documents in a taxonomy identical to the one humans would create

given enough time and use it as a browsing interface to the original collection of

documents.

• Duplicate Content Detection In many applications there is a need to find

duplicates or near-duplicates in a large number of documents. Clustering is

employed for plagiarism detection, grouping of related news stories and to

reordersearch results rankings (to assure higher diversity among the topmost

documents).Note that in such applications the description of clusters is rarely

needed.

• Recommendation System In this application a user is recommended articles based

on the articles the user has already read. Clustering of the articles makes itpossible in

real time and improves the quality a lot.

• Search Optimization Clustering helps a lot in improving the quality and efficiency

of search engines as the user query can be first compared to the clusters instead of

comparing it directly to the documents and the search results can also be arranged

easily.

1.7 Challenges in Document Clustering

Document clustering is being studied from many decades but still it is far

from a trivial and solved problem. The challenges are:

16

Page 17: MainClustering of text documents

1. Selecting appropriate features of the documents that should be used for

clustering.

2. Selecting an appropriate similarity measure between documents.

3. Selecting an appropriate clustering method utilizing the above similarity measure.

4. Implementing the clustering algorithm in an efficient way that makes it feasible in

terms of required memory and CPU resources.

5. Finding ways of assessing the quality of the performed clustering.

Furthermore, with medium to large document collections (10,000+ documents), the

number of term-document relations is fairly high (millions+), and the computational

complexity of the algorithm applied is thus a central factor in whether it is feasible

for

real-life applications. If a dense matrix is constructed to represent term-document

relations, this matrix could easily become too large to keep in memory - e.g. 100, 000

documents × 100, 000 terms = 1010 entries ~ 40 GB using 32-bit floating point

values. If

the vector model is applied, the dimensionality of the resulting vector space will

likewise

be quite high (10,000+). This means that simple operations, like finding the Euclidean

distance between two documents in the vector space, become time consuming

tasks.

PARTITIONAL CLUSTERING

Partitional clustering algorithms describes that there is maximum similarity

within the clusters and minimum dissimilarity between the clusters.The very

popular partition based clustering algorithm is K means algorithm because of its easy

implementation and simple. But the main drawback is that difficult to predict K value

.To overcome the drawback we are using Automatic Merging of Optimal Clusters

(AMOC).The aim of AMOC is to automatically generate optimal clusters for the given

datasets. The AMOC is an addition to k-means with a two phase iterative procedure

merging validation techniques in order to find optimal clusters with automatically

combining of clusters.

17

Page 18: MainClustering of text documents

Let X = {X1, X2,… , X m} be a set of m objects,every individual object in X i is

represented as[xi1,xi2,…xin] where n is number of attributes. This algorithm takes

kmax as the upper bound of the number of clusters. It iteratively integrate with the

Clusters having lower probability with its nearest cluster and validates the merging

result using Rand Index .

Steps:

1. Initialize kmax to the square root of total number of objects.

2. Assign objects of kmax, randomly to the centroids of cluster

3. By using k-means then find the clusters

4. Calculate the Intra cluster distance.

5. Find a cluster that has minimal probability and combine with its closest cluster.

Recalculate centroids and decrease the number of clusters by one.

6. Whenever the step 5 has been executed for every cluster, then go to step7, or else

go to step5.

7.Whenever if there is no difference in the number of clusters, then stop, or else go

to step2.

Criterion Function

The frequently used partitional clustering similarity strategy is the Variance

Ratio Criterion (VRC) . Its definition is as formulated

Bd n-k

VRC= (1)

Wd k-1

Here B and W denote the between-cluster variations and within-cluster,

respectively. They are defined as:

W= oij -- oj)T(oi

j -- o j ) (2)

B= j( o j ― o )T (oj ― o ) (3)

Where nj denotes the cardinal of the cluster cj, oi j denotes the ith object assigned to

18

Page 19: MainClustering of text documents

the cluster cj, and o denotes the n-dimensional vector of overall sample means (data

centroid), and o j denotes the n-dimensional vector of sample means within jth

cluster (cluster centroid). Between-cluster variations (k-1) is the degree of freedom

and within-cluster variations (n-k) is degree of freedom.

As a consequence, compact and separated clusters are assumed to have minimum

values of W and maximum values of B. Hence, better the data partition, the more

value of VRC. The normalization term (n-k)/(k-1) prevents the ratio to increase

monotonically the number of clusters, thus making VRC as an optimization

(maximization) criterion.

PSO

Particle swarm optimization (PSO) is a computational method that optimizes

a problem by iteratively trying to enhance a candidate solution considering the

measure of quality. PSO optimizes a problem by having a candidate solution and

moving these particles around in the search -space according to simple mathematical

formulae over the position and velocity of particle's.

Vid=w*vid+c1*rand1*(pid-xid)+c2*rand2*(pgd-xid) (4)

Xid=xid+vid (5)

where w is the inertia weight factor ; location of the element value is p id that

realizes the local best value; location of the elements pgd that realizes a overall best

value; c1 and c2 called as acceleration coefficients and constants; the dimension of

the search domain is d; rand 1, rand2 refers the arbitrary values distributed in the

interval [0 ,1].

Each and every particle's shift is effected by its local best position and is also

accompanied toward the best familiar positions in the space, which are upgraded as

better positions are found by specific particles. This will make the swarm closer to

the best position. The PSO Clustering algorithm step by step overview is given

below:

19

Page 20: MainClustering of text documents

Step1:Initialize the population randomly.

Step2. Perform the following for each particle:

(a) Using the velocity and particle position to Update equation (4) and (5) and

to generate the next solution.

(b) Compute the fitness value using fitness function(1).

Step3. Perform step (2) again and again till any of the below conditions is fulfilled.

(a) The number of iterations performed has reaches maximum or minimum.

(b) The average change in fitness values is negligible.

Tabu search

Fred Glover proposed an approach in 1986, which is called as Tabu Search,

used to allow Local Search (LS) methods to overcome local optima. The main concept

of TS is to pursue LS whenever it reaches a local optimum by allow non-improving

moves. The difference between meta heuristic approaches and tabu search is, tabu

search based on the notion of tabu list. That is combination of before visited

solutions including disallow moves. we are using short term memory so it reserves

few of the attributes of solutions instead of whole solution. So it doesn’t grants any

permission to revisited solutions

Steps:

Step 1 Create initial solution x.

Step 2 Initialize the Tabu List.

Step 3 While set of X‟ candidate solutions is not complete.

Step 3.1 compute x‟ candidate solution from present solution x

Step 3.2. Add x‟ to X‟ iif x‟ at least one Aspiration Criterion is satisfied.

Step 4 Select the best x* candidate solution in X‟.

20

Page 21: MainClustering of text documents

Step 5 .If fitness(x) < fitness(x*) then x = x*.

Step 6 Then Tabu List is updated.

Step 7 If termination criteria is reached then finish.

TSPSO:

In this section We introduced the TSPSO algorithm. The algorithm

combines PSO technique with TS. Particle swarm optimization (PSO) is a

computational method used to optimize the results by iteratively trying to

enhance a candidate solution with view to a given measure of quality. It is a

meta heuristic method, it makes some or no hypothesis about the difficulties

being optimized and can search a lot spaces of applicant solutions. It is not

uses the gradient of the trouble being optimized, which means PSO is not

required for the optimization problem but it is to be distinguished as is required

by classic optimization methods such as quasi-newton methods. Gradient

descent PSO is also used on optimization problems which are relatively noisy,

asymmetry, adjusting, etc. .However, PSO suffer from following two aspects : I) It

is easy to be confined into local minima; II) it costs too much time to converge

especially in a complex high dimensional space. When the optimal solution

found all the particles are situated at the same local minimum. After finding

the optimal solution it is t impossible for particles to move and do further

searching because of the velocity update equation . To overcome

aforementioned problem, we proposed hybrid approach which combines the

PSO and Tabu Search (TS) considering that TS belongs to the class of local search

techniques. To overcome this drawback, we merge PSO with a local search

algorithm called TS. we combine TS and PSO algorithm to use the exploration

ability of both algorithms and to avoid flaws of each other.

The flow chart of TSPSO is shown in fig.2

21

Page 22: MainClustering of text documents

The TSPSO steps are listed bellowSteps:Step1 . the population Initialized randomly; Step 2. compute the fitness function (1) for each particle’sStep 3. randomly divide the population into two halves:a) one half of population was updated by PSO. i.e Update the position andvelocity of each particle.b) The another half of population was updated by TS. it searches the local best solution for Each particle.Step 4. Merge the two halves population, and update the “pbest” and “gbest” particles and the tabu list (TL).Step 5 . Iterate Step 2-Step 4 whenever termination condition was reached.Step 6 . Output the result

22

Page 23: MainClustering of text documents

2. Literature Reviews

Tabu-KM: A Hybrid Clustering Algorithm Based on Tabu Search Approach

Abstract

The clustering problem under the criterion of minimum sum of squares is a non-convex and non-linear program, which possesses many locally optimal values, resulting that its solution often falls into these trap and therefore cannot converge to global optima solution. In this paper, an efficient hybrid optimization algorithm is developed for solving this problem, called Tabu-KM. It gathers the optimization property of tabu search and the local search capability of k-means algorithm together. The contribution of proposed algorithm is to produce tabu space for escaping from the trap of local optima and finding better solutions effectively. The Tabu-KM algorithm is tested on several simulated and standard datasets and its performance is compared with k-means, simulated annealing, tabu search, genetic algorithm, and ant colony optimization algorithms. The experimental results on simulated and standard test problems denote the robustness and efficiency of the algorithm and confirm that the proposed method is a suitable choice for solving data clustering problems.

Introduction

Clustering is an important process in engineering and other fields of scientific research. It is the process of grouping patterns into a number of clusters, each of which contains the patterns that are similar to each other according to a specified similarity measure. Clustering is a sequential process, which takes data as a raw material and produces clusters as a result without any predetermined goal [16]. To analyze the clusters, the objects are represented by points in N-dimensional space, where the objects of the vector are values for the attributes of the object and the objective is to classify these points into K clusters such that a certain similarity measure is optimized. ∗ Corresponding author. M. Yaghini Email: [email protected] Paper first received April. 07. 2010 ,and in revised form June. 24. 2010. We consider clustering problem stated as follows: given N objects in , allocate each object to one of K clusters such that the sum of squared Euclidean distances between each object and the center of belonging cluster is minimized.

23

Page 24: MainClustering of text documents

The clustering problem can be mathematically described as follows: 2 1 1 ( , ) N K i j i j Min F W C wij x c = = = − ∑∑ (1) Where 1 1 K ij j w= ∑ = , i = 1,…, N. If object xi allocated to cluster Cj , then is equal to 1; otherwise is equal to 0. In equation 1, N denotes the number of objects, K denotes the number of clusters, X={x1,x2,…,xN} denotes the set of N objects, C ={c1,…,cK} denotes the set of K Clustering problem, Hybrid algorithm, Tabu search algorithm, k-Means algorithm. September 2010, Volume 21, Number 2 International Journal of Industrial Engineering & Production Research http://IJIEPR.iust.ac.ir/ International Journal of Industrial Engineering & Production Research (2010) pp. 71-79 ISSN: 2008-4889 72 M. Yaghini & N.Ghazanfari Tabu-KM: A Hybrid Clustering Algorithm Based … clusters, and W denotes the 0-1 matrix. Cluster center cj is calculated as follows: 1 , 1,..., i j j j j x c c x j k n ∈ = = ∑ (2) Where nj denotes the number of objects belonging to cluster cj . It is known that this clustering problem is a non-convex and non-linear which possess many locally optimal values, resulting that its solution often falls into these traps [24]. k-Means algorithm is one of the popular center based algorithms [18] which proved to fail to convergence to a local minimum under certain condition. The criterion it uses minimizes the total mean squared distance from each point in N to that point’s closest center in K. However there are two main problems for k-means method [21] and [19]. First is that the algorithm depends on the initial states and the value of K. Second problem is that it is easily converges to some local optima which is much worse than the desired global optima solution. In this paper, a new efficient algorithm is designed and implemented based on tabu search approach for escaping from local optima. The key idea of proposed algorithm is to produce tabu space and select new center of cluster from the objects not in tabu space. Then the k-means algorithm is run to local search. This paper is organized as follows: the tabu search approach for clustering and also related works are reviewed in section 2. In section 3, we propose the Tabu-KM algorithm and give detailed descriptions. Section 4 presents experimental results with simulated and standard datasets that show our method outperforms some other methods. Finally, conclusions of the current work are reported in section 5.

Conclusion

An effective hybrid clustering algorithm based on tabu search approach called TabuKM is developed by integrating the tabu space and move generator for restricting objects to select as center of cluster. Tabu-KM

24

Page 25: MainClustering of text documents

algorithm is used to escape from the trap of local optima and finding better solutions, in the clustering problem under the criterion of minimum sum of squares. To produce the tabu space, two strategies are investigated: the spherical space around the center of cluster with fixed or dynamic radius. In addition, three different strategies are discussed to select objects as center of new cluster and generate a feasible solution: (1) move to the closest object to the center of initial k-means cluster, (2) move to the closest object to the center of current cluster, (3) move to the closest object to the center of best-so-far cluster. All above-mentioned strategies were investigated. According to the result, the dynamic

A Survey on K-mean Clustering and Particle Swarm Optimization

Abstract

In Data Mining, Clustering is an important research topic and wide range of unsupervised classification application. Clustering is technique which divides a data into meaningful groups. K-mean is one of the popular clustering algorithms. K-mean clustering is widely used to minimize squared distance between features values of two points reside in the same cluster. Particle swarm optimization is an evolutionary computation technique which finds optimum solution in many applications. Using the PSO optimized clustering results in the components, in order to get a more precise clustering efficiency. In this paper, we present the comparison of K-mean clustering and the Particle swarm optimization. Introduction

Clustering is a technique which divides data objects into groups based on the information found in data that describes the objects and relationships among them, their feature values which can be used in many applications, such as knowledge discovery, vector quantization, pattern recognition, data mining, data dredging and etc. [1] There are mainly two techniques for clustering: hierarchical clustering and partitioned clustering. Data are not partitioned into a particular cluster in a single step, but a series of partitions takes place in hierarchical clustering, which may run from a single cluster containing all objects to n clusters each containing a single object. And each cluster can have sub clusters, so it can be viewed as a tree, a node in the tree is a cluster, the

25

Page 26: MainClustering of text documents

root of the tree is the cluster containing all the objects, and each node, except the leaf nodes, is the union of its children. But in partitioned clustering, the algorithms typically determine all clusters at once, it divides the set of data objects into non-overlapping clusters, and each data object is in exactly one cluster. Particle swarm optimization (PSO) has gained much attention, and it has been applied in many fields [2]. PSO is a useful stochastic optimization algorithm based on population. The birds in a flock are represented as particles, and particles are considered as simple agents flying through a problem area. And in the multi-dimensional problem space, the particle’s location can represent the solution for the problem. But the PSO may lack global search ability at the end of a run due to the utilization of a linearly decreasing inertia weight and PSO may fail to find the required optima when the problem to be solved is too complicated and complex. K-means is the most widely used and studied clustering algorithm. Given a set of n data points in real d-dimensional space (Rd), and an integer k, the clustering problem is to determine a set of k points in Rd, the set of points is called cluster centres, the set of n data points are divided into k groups based on the distance between them and cluster centres. K means algorithm is flexible and simple. But it has some limitation, the cluster result mainly depends on the selection of initial cluster centroids and it may converge to the local optima [3]. However, the same initial cluster centre in a data space can always generate the same cluster results, if a good cluster centre can always be obtained, the K-means will work well.

Conclusion

Study of the k-mean clustering and Particle swam optimization we say that the k-mean which is depend on initial condition, which cause the algorithm may converge to suboptimal solution. On the other side Particle swarm optimization is less sensitive for initial condition due to its population based nature. So Particle swarm optimization is more likely to find near optimal solution.

Cluster Analysis by Variance Ratio Criterion and Firefly Algorithm

Abstract

In order to solve the cluster analysis problem more efficiently, we presented a new approach based on firefly algorithm (FA). First, we created the optimization model using the variance ratio criterion (VRC)

26

Page 27: MainClustering of text documents

as fitness function. Second, FA was introduced to find the maximal point of the VRC. The experimental dataset contains 400 data of 4 groups with three different levels of overlapping degrees: non-overlapping, partial overlapping, and severely overlapping. We compared the FA with genetic algorithm (GA) and combinatorial particle swarm optimization (CPSO). Each algorithm was run 20 times. The results show that FA can found the largest VRC values among all three algorithms, while costs the least time. Therefore, FA is effective and rapid for the cluster analysis problem.

Introduction

Cluster analysis is the assignment of a set of observations into subsets without any priori knowledge so that observations in the same cluster are similar to each other than to those in other clusters [1]. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields [2], including machine learning [3], data mining [4], pattern recognition [5], image analysis [6], medical image classification [7], and bioinformatics [8]. Cluster analysis can be achieved by various algorithms that differ significantly. Those methods can be basically classified into four categories: I. Hierarchical Methods. They find successive clusters using previously established clusters. They can be further divided into the agglomerative methods and the divisive methods [9]. Agglomerative algorithms start with one-point clusters and recursively merges two or more most appropriate clusters [10]. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters [11]. II. Partition Methods. They generate a single partition of data with a specified or estimated number of non overlapping clusters, in an attempt to recover natural groups present in the data [12]. III. Density-based Methods. They are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN [13] is the typical algorithm of this kind. IV. Subspace Methods. They look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes [14]. In this study, we focus our attention on Partition Clustering methods. The K-means clustering [15] and the fuzzy c-means clustering (FCM) [16] are two typical algorithms of this type. They are iterative algorithms and the solution obtained depends on the selection of the initial partition and may converge to a local minimum of criterion function value if the initial partition is not properly chosen [17].

27

Page 28: MainClustering of text documents

Branch and bound algorithm was proposed to find the global optimum clustering. However, it takes too much computation time [18]. In the last decade, evolutionary algorithms were proposed to clustering problem since they are not sensitive to initial values and able to jump out of local minimal point. For example, Lin et al. [19] Cluster Analysis by Variance Ratio Criterion and Firefly Algorithm Yudong Zhang, Dayong Li International Journal of Digital Content Technology and its Applications(JDCTA) Volume7,Number3,February 2013 doi:10.4156/jdcta.vol7.issue3.84 689 pointed out that k-Anonymity has been widely adopted as a model for protecting public released microdata from individual identification. Their work proposed a novel genetic algorithm-based clustering approach for k-anonymization. Their proposed approach adopted various heuristics to select genes for crossover operations. Experimental results showed that their approach can further reduce the information loss caused by traditional clustering-based k-anonymization techniques. Chang et al. [20] proposed a new clustering algorithm based on genetic algorithm (GA) with gene rearrangement (GAGR), which in application may effectively remove the degeneracy for the purpose of a more efficient search. They used a new crossover operator that exploited a measure of similarity between chromosomes in a population. They also employed adaptive probabilities of crossover and mutation to prevent the convergence of the GAGR to a local optimum. Using the real-world data sets, they compared the performance of GAGR clustering algorithm with K-means algorithm and other GA methods. Their experiment results demonstrated that the GAGR clustering algorithm had high performance, effectiveness and flexibility. Agard et al. [21] pointed out defining an efficient bill of materials for a family of complex products was a real challenge for companies, largely because of the diversity they offered to consumers. They solution is to define a set of components (called modules), each of which contained a set of primary functions. An individual product was then built by combining selected modules. The industrial problem leads, in turn, to the complex optimization problem. They solved the problem via a simulated annealing method based on a clustering approach. Jarboui et al. [12] presented a new clustering approach based on the combinatorial particle swarm optimization (CPSO) algorithm. Each particle was represented as a string of length n (where n is the number of data points), and the ith element of the string denoted the group number assigned to object i. An integer vector corresponded to a candidate solution to the clustering problem. A swarm of particles were initiated and fly through the solution space for

28

Page 29: MainClustering of text documents

targeting the optimal solution. To verify the efficiency of the proposed CPSO algorithm, comparisons with a genetic algorithm were performed. Computational results showed that their proposed CPSO algorithm was very competitive and outperforms the genetic algorithm. Niknam et al. [22] considered the k-means algorithm highly depended on the initial state and converged to local optimum solution. Therefore, they presented a new hybrid evolutionary algorithm to solve nonlinear partitional clustering problem. Their proposed hybrid evolutionary algorithm was the combination of FAPSO (fuzzy adaptive particle swarm optimization), ACO (ant colony optimization) and k-means algorithms, called FAPSO-ACO-K, which can find better cluster partition. The performance of their proposed algorithm was evaluated through several benchmark data sets. Their simulation results showed that the performance of the proposed algorithm was better than other algorithms such as PSO, ACO, simulated annealing (SA), combination of PSO and SA (PSO-SA), combination of ACO and SA (ACO-SA), combination of PSO and ACO (PSO-ACO), genetic algorithm (GA), Tabu search (TS), honey bee mating optimization (HBMO) and k-means for partitional clustering problem. However, those aforementioned algorithms yet performed ideally. They sometimes converged too slow, or even converged to local minima points, which lead to a wrong solution. Recently, the firefly algorithm (FA) is a hot nature-inspired technique and has been used for solving nonlinear multimodal optimization problems in dynamic environment [23]. The algorithm is based on the behavior of the fireflies. In social insect colonies, each firefly seems to have its own plans, and yet the group acts as a whole appears to be highly organized. Scholars published immense literatures reporting its performance, effectiveness, and robustness are superior to GA, PSO, and other global algorithms in a wide range of fields [23, 24]. The structure of the rest of this paper was organized as follows. Next section 2 defined the partitional problem, and gave the encoding strategy and clustering criterion. Section 3 introduced the firefly algorithm. Experiments in section 4 contained three types of artificial data with different overlapping degree. Final section 5 was devoted to conclusions and future work.

Conclusion

we first investigate the optimization model including both the encoding strategy and the criterion function of VRC. Afterwards, an FA algorithm was introduced for solving the model. Experiments on three types of

29

Page 30: MainClustering of text documents

artificial data with different overlapping degrees all demonstrate the FA is more robust and costs less time than either GA or CPSO. Future works contains following points: 1) Develop a method that can determine the number of clusters automatically; 2) Use more benchmark data to test the FA; 3) Apply our FA to practical clustering problems, including mathematics [30], face estimation [31], image segmentation [32], image registration [33], image classification [34], UCAV path planning [35], and prediction [36].

Document Clustering: The Next Frontier

Introduction

The proliferation of documents, on both the Web and in private systems, makes knowledge discovery in document collections arduous. Clustering has been long recognized as a useful tool for the task. It groups like-items together, maximizing intra-cluster similarity and inter-cluster distance. Clustering can provide insight into the make-up of a document collection and is often used as the initial step in data analysis. While most document clustering research to date has focused on moderate length single topic documents, real-life collections are often made up of very short or long documents. Short documents do not contain enough text to accurately compute similarities. Long documents often span multiple topics that general document similarity measures do not take into account. In this paper we will first give an overview of general purpose document clustering, and then focus on recent advancements in the next frontier in document clustering: long and short documents. Conclusion This chapter primarily focused on reviewing some recently developed text clustering methods that are specifically suited for long and for short document collections. These types of document collections introduce new sets of challenges. Long document are by their nature multi-topic and as such the underlying document clustering methods must explicitly focus on modeling and/or accounting for these topics. On the other hand, short documents often contain domain-specific vocabulary, are very noisy, and their proper modeling/understanding often requires the incorporation of external information. We strongly believe research in clustering long and short documents is in its early stages and many new

30

Page 31: MainClustering of text documents

methods will be developed in the years to come. Moreover, many real datasets are not only composed of standard, long, or short documents, but rather documents of mixed length. Current scholarship lacks studies on these types of data. Since different methods are often used for clustering standard, long, or short documents, new methods or frameworks should be investigated that address mixed collections. Traditional document clustering is also faced with new challenges. Today’s very large, high-dimensional document collections often lead to multiple valid clustering solutions. Subspace/projective clustering approaches [67], [82] have been used to cope with high dimensionality when performing the clustering task. Ensemble clustering [40] and multiview/alternative clustering approaches [58], [91], which aim to summarize or detect different clustering solutions, have been used to manage the availability of multiple, possibly alternative clusterings for a given dataset. Relatively little work has been done so far in document clustering research to take advantage of lessons learned from these methods. Integrating subspace/ensemble/multi-view clustering with topic models or segmentation may lead to developing the next-generation clustering methods specialized for the document domain. Some topics that we have only briefly touched on in this article are further detailed in other chapters of this book. Other topics related to clustering documents, such as semisupervised clustering, stream document clustering, parallel clustering algorithms, and kernel methods for dimensionality reduction or clustering, were left for further study. Interested readers may consult document clustering surveys by Aggarwal and Zhai [3], Andrews and Fox [9], and Steinbach et al. Discrete PSO with GA Operators for Document Clustering

Abstract

The paper presents Discrete PSO algorithm for document clustering problems. This algorithm is hybrid of PSO with GA operators. The proposed system is based on population-based heuristic search technique, which can be used to solve combinatorial optimization problems, modeled on the concepts of cultural and social rules derived from the analysis of the swarm intelligence (PSO) with GA operators such as crossover and mutation. In standard PSO the non-oscillatory route can quickly cause a particle to stagnate and also it may prematurely converge on suboptimal solutions that are not even guaranteed to local optimal solution. In this paper a modification

31

Page 32: MainClustering of text documents

strategy is proposed for the particle swarm optimization (PSO) algorithm and applied in the document corpus. The strategy adds reproduction by using crossover and mutation operators when the stagnation in movement of the particle is identified. Reproduction has the capability to achieve faster convergence and better solution. Experiments results are examined with document corpus. It demonstrates that the proposed DPSO algorithm statistically outperforms the Simple PSO.

Introduction

Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Unlike document classification [22], no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Document clustering is widely applicable in areas such as search engines, web mining, information retrieval and topological analysis. Document clustering has become an increasingly important task in analyzing huge numbers of documents distributed among various sites. The challenging aspect is to analyze this enormous number of extremely high dimensional distributed documents and to organize them in such a way that results in better search and knowledge extraction without introducing much extra cost and complexity. Clustering, in data mining, is useful to discover distribution patterns in the underlying data. The K-means and its variants [14][15] represent the category of partitioning clustering algorithms that create a flat, non hierarchical clustering that consist of k clusters. The K-means algorithm iteratively refines a randomly chosen set of k initial centroids, minimizing the average distance (i.e., maximizing the similarity) of documents to their closest (most similar) centroid. A common document clustering method [1][19] is the one that first calculates the similarities between all pairs of the documents and then cluster documents together if their similarity values are above mentioned threshold. The common clustering techniques are partitioning and hierarchical [11]. Most of the document clustering algorithms can be classified into these two groups. In this study, a document clustering algorithm based on DPSO is proposed. The remainder of this paper is organized as follows: Section II provides the related works in document clustering using PSO. Section III gives the overview of the PSO. The DPSO with GA operators clustering algorithm is described in Section IV. Section V presents the detailed experimental

32

Page 33: MainClustering of text documents

setup and results for comparing the performance of the proposed algorithm with the standard PSO (SPSO) and K-means approaches. Conclusion The proposed system uses the vector space model for document representation. The total number of documents exist in CISI is 1460, Cranfield is 1400 and ADI is 82. Each particle in the swarm is represented by 2942 dimensions. The advantages of the PSO are very few parameters to deal with and the large number of processing elements, so called dimensions, which enable to fly around the solution space effectively. On the other hand, it converges to a solution very quickly which should be carefully dealt with when using it for combinatorial optimization problems. In this study, the proposed DPSO with GA operators algorithm developed for much more complex, NP-hard document clustering is verified on the document corpus. It is shown that it increases the performance of the clustering and the best results are derived from the proposed technique. Consequently, the proposed technique markedly increased the success of the document clustering problem. The main objective of the paper is to improve the fitness value of the problem. The fitness value achieved from the standard PSO is low since it has the stagnation it causes the premature convergence. However, it can be handled by the DPSO with the crossover and mutation operators of Genetic Algorithm that tries to avoid the stagnation behavior of the particles. The proposed system does not always avoid the stagnation behavior of the particles. But for seldom it avoids the stagnation, which is the source for the improvement in the particles position.

33

Page 34: MainClustering of text documents

3. System Design

3.1 Hardware and software specifications

H/W System Configuration:-

Processor - Pentium i5 Speed - 2.3 Ghz RAM - 4 GB Hard Disk - 500 GB Key Board - Standard Laptop Keyboard Mouse - USB mouse

S/W System Configuration:-

Operating System :Windows 10

Development tool : Net beans 7.0.1

Language : JAVA

Language Version : jdk 1.7

Technologies : AWT, Swings

34

Page 35: MainClustering of text documents

3.2 UML Diagrams Use case diagram

There is only one actor, he can access the following functionality Reading Vector and feature File from user Applying AMOC for generating clusters Applying Tabu Search for testing the TSPSO values Applying PSO for testing the TSPSO values Using TSPO for solve the document cluster analysis difficulties more

efficiently and quickly

User

Read Vector and feature File

Apply AMOC

Apply Tabu Search

Apply PSO

Apply TSPSO

View Results

35

Page 36: MainClustering of text documents

Class Diagram:

Here Mining executer is the main class where it utilizes the methods of OptionSelection when a user invokes the action function the ResultForm is invoked where the inputs of the ResultForm is given to the DocClusteringModel. Finally the output class is executed with the inputs of DocClustering, here StartUp class is generation of DocClustering.

36

Page 37: MainClustering of text documents

Sequence diagram

Here User is a main class whenever he wants to view the datasets he can view them by requesting them using the vectors and features file. Similarly he can optimize the number of clusters that are present from the datasets using AMOC. Whenever he wants to apply the PSO for generating one of the test case for TSPO he can use them, he can apply TSPO for generating efficient clusters finally he can view all the result when required.

: User

Read_dataset Apply AMCO Apply PSO Apply TSPSO View Results

1 : Browse Vector and Feature File()

2 : Vector and Feature file Succefuly Read()

3 : Apply AMCO()

4 : Optimized number of Clusters Arrived()

5 : Apply PSO()

6 : PSO Applied()

7 : Apply TSPSO()

8 : TSPSO Applied()

9 : View Reults()

10 : Results Viewed()

37

Page 38: MainClustering of text documents

Activity diagram

Behavior of the system in terms of activities are describes below. Here as user starts initiates the process he browses through the file for selection of features and vectors files. Then he applies AOC for generating clusters. Then he can Apply PSO for generating test sample1 for TSPO, Then he can apply TS for generating test sampple2 Finally these can be used for TSPO test and TSPO is applied for generating Efficient clusters.

Userapplication

Browse Files

Apply AMCO

Apply PSO

Apply TS

Apply TSPSO

View Results

38

Page 39: MainClustering of text documents

State chart diagramThis state chart describes the way or the sequence users and their

interactions here as user starts initiates the process he browses through the file for selection of features and vectors files. Then he applies AOC for generating clusters. Then he can Apply PSO for generating test sample1 for TSPO, Then he can apply TS for generating test sampple2 Finally these can be used for TSPO test and TSPO is applied for generating Efficient clusters.

Browse Files

Apply AMCO

Apply PSO

Apply TS

Apply TSPSO

View Results

39

Page 40: MainClustering of text documents

Component Diagram:

The figure shows the various interactions of user with different components. User interacts with Read datasets to browse the vectors an features file on success he can read successfully, he can interact with AMOC component to apply it and get optimized clusters, he then interacts with PSO component to apply it an generates samples, Similarly he can interact with TSPO component and on success to generate optimized clusters

40

Page 41: MainClustering of text documents

ALGORITHMS

1. PSO

Let S be the number of particles in the swarm, each having a position

xi ∈ Rn in the search-space and a velocity vi ∈ Rn. Let pi be the best

known position of particle i and let g be the best known position of the

entire swarm. A basic PSO algorithm is then:

For each particle i = 1, ..., S do:

Initialize the particle's position with a uniformly distributed random

vector: xi ~ U(blo, bup), where blo and bup are the lower and upper

boundaries of the search-space.

Initialize the particle's best known position to its initial position:

pi ← xi

If (f(pi) < f(g)) update the swarm's best known position: g ← pi

Initialize the particle's velocity: vi ~ U(-|bup-blo|, |bup-blo|)

Until a termination criterion is met (e.g. number of iterations

performed, or a solution with adequate objective function value is

found), repeat:

For each particle i = 1, ..., S do:

Pick random numbers: rp, rg ~ U(0,1)

For each dimension d = 1, ..., n do:

41

Page 42: MainClustering of text documents

Update the particle's velocity: vi,d ← ω vi,d + φp rp (pi,d-xi,d) + φg rg

(gd-xi,d)

Update the particle's position: xi ← xi + vi

If (f(xi) < f(pi)) do:

Update the particle's best known position: pi ← xi

If (f(pi) < f(g)) update the swarm's best known position: g ← pi

Now g holds the best found solution.

2. Tabu search:

Steps involved:

Step 1 Create initial solution x.

Step 2 Initialize the Tabu List.

Step 3 While set of X‟ candidate solutions is not complete.

Step 3.1 compute x‟ candidate solution from present solution x

Step 3.2. Add x‟ to X‟ iif x‟ at least one Aspiration Criterion is

satisfied.

Step 4 Select the best x* candidate solution in X‟.

Step 5 .If fitness(x) < fitness(x*) then x = x*.

Step 6 Then Tabu List is updated.

Step 7 If termination criteria is reached then finish.

42

Page 43: MainClustering of text documents

Criterion Function

The Variance Ratio Criterion (VRC) is the mostly widely used

partitioned

clustering strategy.

Where nj denotes the cardinal of the cluster cj, oij denotes the ith

object assigned to the cluster cj, and o denotes the n-dimensional

vector of overall sample means (data centroid), and o j denotes the

n-dimensional vector of sample means within jth cluster (cluster

centroid). Between-cluster variations (k-1) is the degree of freedom.

3. TSPSO

Steps involved:

Step1 . the population Initialized randomly; Step 2. Compute the fitness function (1) for each particle’sStep 3. Randomly divide the population into two halves:a) one half of population was updated by PSO. i.e Update the position and velocity of each particle.b) The another half of population was updated by TS. it searches the local best solution for Each particle.Step 4. Merge the two halves population, and update the “pbest” and “gbest” particles and the tabu list (TL).

43

Page 44: MainClustering of text documents

Step 5 . Iterate Step 2-Step 4 whenever termination condition was reached.Step 6 . Output the result

4. Implementation

4.1 Introduction to technologies

The feasibility of the project is analyzed in this phase and business

proposal is put forth with a very general plan for the project and some

cost estimates. During system analysis the feasibility study of the

proposed system is to be carried out. This is to ensure that the proposed

system is not a burden to the company. For feasibility analysis, some

understanding of the major requirements for the system is essential.

Three key considerations involved in the feasibility analysis are

ECONOMICAL FEASIBILITY

TECHNICAL FEASIBILITY

SOCIAL FEASIBILITY

ECONOMICAL FEASIBILITY

This study is carried out to check the economic impact that the

system will have on the organization. The amount of fund that the

company can pour into the research and development of the system is

limited. The expenditures must be justified. Thus the developed system

as well within the budget and this was achieved because most of the

44

Page 45: MainClustering of text documents

technologies used are freely available. Only the customized products

had to be purchased.

TECHNICAL FEASIBILITY

This study is carried out to check the technical feasibility, that is,

the technical requirements of the system. Any system developed must

not have a high demand on the available technical resources. This will

lead to high demands on the available technical resources. This will lead

to high demands being placed on the client. The developed system must

have a modest requirement, as only minimal or null changes are

required for implementing this system.

SOCIAL FEASIBILITY

The aspect of study is to check the level of acceptance of the

system by the user. This includes the process of training the user to use

the system efficiently. The user must not feel threatened by the system,

instead must accept it as a necessity. The level of acceptance by the

users solely depends on the methods that are employed to educate the

user about the system and to make him familiar with it. His level of

confidence must be raised so that he is also able to make some

45

Page 46: MainClustering of text documents

constructive criticism, which is welcomed, as he is the final user of the

system.

4.2 Sample Code

package miner.psoAlgo;

/** * */

//package miner;

import java.io.*;import java.util.*;import javax.swing.JOptionPane;import miner.*;

public class pso{

float tfIdf[][];float particles[][][];float fitness[];float partiVelocity[][][];float pBest[][][];public float gBest[][];float newFitness[];float gBestFitness;

//boolean clusterPoints[][];int clusterSize[];float distance[];float intraclustDistance[];

boolean clusterPoints[][];

small little=new small();

46

Page 47: MainClustering of text documents

public void extractData() throws IOException{

Scanner s=null;try{

s=new Scanner(new BufferedReader(new FileReader("c:\\dc\\tfIdfMatrix.txt")));

String a,b;int col=-1;while(s.hasNext()){

a=s.next();if(a.indexOf("column")!=-1){

col++;for(int j=0;j<tfIdf.length;j++){

a=s.next();tfIdf[j][col]=Float.parseFloat(a);

}/*End of for*/}/*End of If*/

}/*End of while*/}/*End of try*/

catch(IOException e) { JOptionPane.showMessageDialog(null,e.toString(),"pso-extractData()",JOptionPane.ERROR_MESSAGE); //return count; }

finally{

if(s!=null)s.close();/*End of if*/

/*if(out!=null)out.close();*/

}/*End of finally*/}/*End of Extract data*/

public pso()

47

Page 48: MainClustering of text documents

{}

public pso(int Rows,int Columns,int noOfClusters,int noOfParticles)

{System.out.println("parameterised Constructor Executed");tfIdf=new float[Rows][Columns];System.out.println("The size of the matrix

is:"+tfIdf.length+"\t"+tfIdf[0].length);particles=new float[noOfParticles][noOfClusters][Columns];fitness=new float[particles.length];partiVelocity=new float[noOfParticles][noOfClusters]

[Columns];pBest=new float[noOfParticles][noOfClusters][Columns];gBest=new float[noOfClusters][Columns];newFitness=new float[particles.length];

//clusterPoints=new boolean[tfIdf.length][particles[0].length];

clusterSize=new int[particles[0].length];distance=new float[particles[0].length];intraclustDistance=new float[particles[0].length];

Arrays.fill(fitness,0);Arrays.fill(fitness,0);for(int i=0;i<gBest.length;i++)

Arrays.fill(gBest[i],0);for(int i=0;i<pBest.length;i++)

for(int j=0;j<pBest[i].length;j++){

Arrays.fill(partiVelocity[i][j],0);Arrays.fill(pBest[i][j],0);

}

}

public void assignParticles() throws IOException{

48

Page 49: MainClustering of text documents

int Particles[];try{

numberGenerator n; n=new numberGenerator();

Particles=n.extractNumbers((particles.length)*(particles[0].length));System.out.println(Particles.length);int l=0;for(int i=0;i<particles.length;i++){

for(int j=0;j<(particles[0].length);j++){

for(int k=0;k<(particles[0][0].length);k++) { particles[i][j][k]=tfIdf[Particles[l]-1][k]; } l++; System.out.println(l);

}}

} catch(IOException e) { JOptionPane.showMessageDialog(null,e.toString(),"pso-assignParticles()",JOptionPane.ERROR_MESSAGE); //return count; }

finally{

}

}

public float eucliDistance(float a[],float b[])

49

Page 50: MainClustering of text documents

{float distance=0,temp;for(int i=0;i<a.length;i++){

temp=a[i]-b[i];distance+=temp*temp;

}//distance=(float)(distance/tfIdf[0].length);distance=(float)(Math.sqrt(distance));return distance;

}

public small Small(float distance[]){

small a=new small();a.distance=distance[0];a.pos=0;for(int i=1;i<distance.length;i++){

if(a.distance==0 && i==1)

{int j=i;while(true){

if(distance[i]!=0){

a.distance=distance[j];break;

}j++;

}}

if(a.distance>distance[i] && distance[i]!=0){

a.distance=distance[i];a.pos=i;

}

50

Page 51: MainClustering of text documents

}return a;

}

public void calFitness(){

//int clusterSize[];//float distance[];//float intraclustDistance[];

//clusterSize=new int[particles[0].length];//distance=new float[particles[0].length];//intraclustDistance=new float[particles[0].length];

for(int i=0;i<particles.length;i++){

System.gc();newFitness[i]=0;for(int l=0;l<particles[0].length;l++){

clusterSize[l]=0;distance[l]=(float)(0);intraclustDistance[l]=(float)(0);//newFitness[l]=(float)(0);

}

for(int j=0;j<tfIdf.length;j++){

for(int k=0;k<particles[i].length;k++){

distance[k]=eucliDistance(tfIdf[j],particles[i][k]);

51

Page 52: MainClustering of text documents

}little=Small(distance);//clusterPoints[j][little.pos]=true;intraclustDistance[little.pos]+=little.distance;clusterSize[little.pos]++;

}for(int k=0;k<particles[0].length;k++){

intraclustDistance[k]=intraclustDistance[k]/clusterSize[k];if(Float.isNaN(intraclustDistance[k])==true)

intraclustDistance[k]=(float)(3.3406782);System.out.println("The intracluster distance in

cluster:"+k+" is"+intraclustDistance[k]);}

System.out.println();for(int k=0;k<particles[0].length;k++)

newFitness[i]+=intraclustDistance[k];

newFitness[i]=newFitness[i]/particles[0].length;

if(Float.isNaN(newFitness[i])==true)newFitness[i]=fitness[i];

String l=Float.toString(newFitness[i]);/*if(l.length()>5){

int pos;pos=l.indexOf('.');//System.out.println(pos);String s=l.substring(0,pos+4);//System.out.println(s);newFitness[i]=Float.parseFloat(s);

}*/

System.out.println("The Fitness of particle "+i+" is: "+newFitness[i]);

System.out.println();

52

Page 53: MainClustering of text documents

}

}

public void changePartiVelocityLocation(){

float rand1,rand2;int i,j,k;rand1=(float)(Math.random());rand2=(float)(Math.random());System.gc();while(rand1!=rand2)

rand2=(float)(Math.random());for(i=0;i<particles.length;i++){

for(j=0;j<particles[i].length;j++){

for(k=0;k<particles[i][j].length;k++){

partiVelocity[i][j][k]=(float)((0.72)*partiVelocity[i][j][k]+(1.42)*rand1*(pBest[i][j][k]-particles[i][j][k])+(1.42)*rand2*(gBest[j][k]-particles[i][j][k]));

//partiVelocity[i][j][k]=Math.abs(partiVelocity[i][j][k]);

if(Float.isNaN(partiVelocity[i][j][k])==true)

partiVelocity[i][j][k]=0;particles[i][j][k]+=partiVelocity[i][j][k];

}}

}}

public void findpBest(){

int i,j;for(i=0;i<fitness.length;i++){

if(fitness[i]>newFitness[i])

53

Page 54: MainClustering of text documents

{fitness[i]=newFitness[i];for(j=0;j<particles[0].length;j++){

System.arraycopy(particles[i][j],0,pBest[i][j],0,(particles[i][j]).length);

}}

}}

public void findgBest(int i){

int j;small a=new small();int flag=0;

a=Small(fitness);//System.out.println("gBest:"+a.distance);if(i==0){

gBestFitness=a.distance;flag=1;

}else{

if(i==1 && gBestFitness==0)gBestFitness=a.distance;

else if(a.distance!=0)if(gBestFitness>a.distance){

gBestFitness=a.distance;flag=1;

}}

System.out.println("The gBest Fitness is:"+gBestFitness);

if(flag==1)

54

Page 55: MainClustering of text documents

for(j=0;j<particles[0].length;j++){

//System.out.println("The gBest Fitness is assigned");

System.arraycopy(particles[a.pos][j],0,gBest[j],0,(particles[0][0]).length);

}

}

public boolean checkFitness(){

float a;byte count=0;//byte pos;//String l;//l=Float.toString(fitness[0]);//pos=l.indexOf(.);a=newFitness[0];for(int i=1;i<newFitness.length;i++)

if(Math.abs(a-newFitness[i])==0)count++;

if(count==newFitness.length-1){

System.out.println("After checking Fitness:");for(int l=0;l<newFitness.length;l++)

System.out.println(newFitness[l]);return true;

}return false;

}

public void psoalg(int n){

int i,j,k;for(i=0;i<n;i++){

55

Page 56: MainClustering of text documents

System.gc();System.out.println();System.out.println("iteration: "+i);System.out.println();calFitness();if(i==0){

System.arraycopy(newFitness,0,fitness,0,newFitness.length);

//System.out.println("newFitness:"+newFitness[0]);System.out.println("Fitness:"+fitness[0]);System.out.println();for(j=0;j<fitness.length;j++){

for(k=0;k<particles[0].length;k++){

System.arraycopy(particles[j][k],0,pBest[j][k],0,(particles[j][k]).length);

}}System.gc();findgBest(i);changePartiVelocityLocation();

}

else{

System.gc();findpBest();findgBest(i);changePartiVelocityLocation();

}

if(checkFitness()){

System.out.println("Yes");

break;

56

Page 57: MainClustering of text documents

}

}

}

public void show() throws IOException{

PrintWriter out=null;try{

out=new PrintWriter(new FileWriter("c:\\dc\\psoparticles.txt")); for(int i=0;i<particles.length;i++) { out.println("particle:"+(i+1));

for(int j=0;j<(particles[0].length);j++){

out.println("Cluster:"+(j+1)); for(int k=0;k<(particles[0][0].length);k++) {

out.print(particles[i][j][k]+"\t"); } out.println();

} }

} catch(IOException e) { JOptionPane.showMessageDialog(null,e.toString(),"pso-show()",JOptionPane.ERROR_MESSAGE); //return count; }

finally

57

Page 58: MainClustering of text documents

{if(out!=null)

out.close();}

}

public float centToCentDistance(/*PrintWriter out1*/){

float result=0;for(int i=0;i<gBest.length;i++){

for(int j=i+1;j<gBest.length;j++){

float temp;temp=eucliDistance(gBest[i],gBest[j]);System.out.println("The distance from

centroid"+(i+1)+" to centroids"+(j+1)+" is : "+temp);//out1.println("The distance from

centroid"+(i+1)+" to centroids"+(j+1)+" is : "+temp);result+=temp;

}

}int n=gBest.length;n=(n*(n-1))/2;result=result/n;System.out.println("The average distance is:"+result);//out1.println("The average distance is:"+result);

return result;

}

public float intDist(){

float distancei[];

58

Page 59: MainClustering of text documents

float intraclustDistancei[];float fitnessi=0;int clusterSizei[];

small littlei;

littlei=new small();

distancei=new float[gBest.length];intraclustDistancei=new float[gBest.length];clusterSizei=new int[gBest.length];

clusterPoints=new boolean[tfIdf.length][gBest.length];

for(int i=0;i<gBest.length;i++){

clusterSizei[i]=0;distancei[i]=(float)(0);intraclustDistancei[i]=(float)(0);

Arrays.fill(clusterPoints[i],false);//newFitness[i]=(float)(0);

}

for(int j=0;j<tfIdf.length;j++){

for(int k=0;k<gBest.length;k++){

distancei[k]=eucliDistance(tfIdf[j],gBest[k]);

}littlei=Small(distancei);clusterPoints[j][littlei.pos]=true;intraclustDistancei[littlei.pos]+=littlei.distance;clusterSizei[littlei.pos]++;

}for(int k=0;k<gBest.length;k++)

intraclustDistancei[k]=intraclustDistancei[k]/clusterSizei[k];

59

Page 60: MainClustering of text documents

for(int k=0;k<gBest.length;k++){

System.out.println("Cluster"+k+":"+intraclustDistancei[k]);fitnessi+=intraclustDistancei[k];

}

fitnessi=fitnessi/gBest.length;

//String l=Float.toString(newFitness[0]);/*if(l.length()>5){

int pos;pos=l.indexOf('.');//System.out.println(pos);String s=l.substring(0,pos+4);//System.out.println(s);newFitness[i]=Float.parseFloat(s);

}*/

System.out.println("The gBest fitness is:"+fitnessi);

return fitnessi;//finddocclust(clusterPoints);

}

public String finddocclust(){

String clust;clust="";int flag=0;

for(int i=0;i<clusterPoints[0].length;i++){

clust+="The documents under cluster: "+i+" are:"+"\n";

60

Page 61: MainClustering of text documents

flag=0;

for(int j=0;j<clusterPoints.length;j++){

if(clusterPoints[j][i]==true){

flag++;String s=Integer.toString(j);clust+=s;if(flag%5==0)

clust+="\n";else

clust+="\t";

if(flag==5)flag=0;

}}clust+="\

n"+"**************************************"+"\n";

}

System.out.println("The cluster result is:");System.out.println(clust);

return clust;

}

}

61

Page 62: MainClustering of text documents

5. Testing

5.1 Unit Testing

Tests for Input

Test case : What happens when we press “ok” with leaving allfields empty

Expected Output: When the user clicks on ok without any input for fields then it should prompt an error message in an dialog bax saying

“Select appropriate fields properly”

Observed Output: When “ok” is pressed error is prompt in the dialog box The error show same as that of expected

No errors will be displayed when all fields are entered correctly

62

Page 63: MainClustering of text documents

Tests for empty features field

Test case : What happens when we press “ok” with leaving features feild

empty Expected Output:

When the user clicks on ok without any input for fields then it should prompt an error message in an dialog bax saying

“Select features fields properly”

Observed Output: When “ok” is pressed error is prompt in the dialog box The error show same as that of expected

No errors will be displayed when all features fields is entered correctly

63

Page 64: MainClustering of text documents

Tests for empty vectors field

Test case : What happens when we press “ok” with leaving vectors feild

empty Expected Output:

When the user clicks on ok without any input for fields then it should prompt an error message in an dialog bax saying

“Select vectors fields properly”

Observed Output: When “ok” is pressed error is prompt in the dialog box The error show same as that of expected

No errors will be displayed when all vectors fields is entered correctly

64

Page 65: MainClustering of text documents

Tests for empty algorithms field

Test case : What happens when we press “ok” with leaving algorithms feild

unselected Expected Output:

When the user clicks on ok without any input for fields then it should prompt an error message in an dialog bax saying

“Select Algorithms fields properly”

Observed Output: When “ok” is pressed error is prompt in the dialog box The error show same as that of expected

No errors will be displayed when any Algorithms fields is selected correctly

65

Page 66: MainClustering of text documents

5.2 Performance Evaluation

The table 6.1 contains the VRC values of the different algorithms on different

datasets listed below. We compare the VRC values of PSO,TS and TSPSO clustering

algorithms . The graph is plotted for the VRC values of these algorithms.

Table 5.1: VRC values of Three algorithms

Data PSO TS TSPSO

Dataset1 0.489 0.458 0.38

Dataset2 0.502 0.491 0.305

Dataset3 0.561 0.482 0.4

By using the equation 6.1 we calculate the VRC values of each algorithm in

every iteration which are applied on different document datasets. The above table

gives the VRC values of the PSO and TS and TSPSO clustering algorithms of the

corresponding dataset.

1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Datasets

F-Sc

ore v

alue

s

Bisecting Incremental K-Means

Incremental K-Means

K-Means

Sperical K-Means

tr12tr11fbisre0 re1

Figure 5.2 : Performance evaluation of the algorithms

66

Page 67: MainClustering of text documents

The figure 5.2 explains the VRC values of the three algorithms. On X-axis we

take the datasets and on Y-axis we take the VRC values. For each dataset the Three

cluster algorithm VRC values are plotted in the figure 7.1 bisecting Incremental K-

means values. The cyan colour plot represents the incremental K-way clustering

using MVS and the color yellow represents the K-Means F-Score values and the color

brown represents the spherical K-Means values.From the figure, we concluded that

the Bisecting F-Score value of each dataset is high when compared to the other

algorithms.

67

Page 68: MainClustering of text documents

6. Results

This is the home page where the user must enter next in order to carry out his tasks that are needed to be performed.

68

Page 69: MainClustering of text documents

The above figure demonstrates the user selectable fields such as1: Enter Vector file:

Here the user enters the location of Vector Files in order to provide the id as input2: Enter features file:

Here the use enter the location of Features File In order to provide data that are related to the id

3 AMOC: Used for generating the clusters from the given input files.

4 TSPO: Used for finding the cluster that is needed

5 Practical Swamp Optimization, Tabu search Clustering: Used for testing the TSPO generated

69

Page 70: MainClustering of text documents

The above diagram demonstrates the selection of vector files as input

The above figure demonstrates the input of features dile

70

Page 71: MainClustering of text documents

The above figures demonstrates the id that is selected.

The above figure demonstrates the selection of AMOC for cluster generation

71

Page 72: MainClustering of text documents

The above figure demonstrates the suceessfull note of mining

The above figure shows the result generated for the given input files and cluster.

72

Page 73: MainClustering of text documents

The above figure shows the fitness function results so generated using the input values

73

Page 74: MainClustering of text documents

The above figure shows the final result that is generated after successful execution

74

Page 75: MainClustering of text documents

7. CONCLUSION

In this thesis the new approach Hybrid algorithm that uses Tabu search

and basic PSO is proposed to solve the problem of Document Clustering.

PSO has been proved as an effective optimization technique to solve

combinatorial optimization problems. Tabu search, an efficient local

search procedure helps to explore the solutions in different regions of

solutions. This thesis proposes a Hybrid Algorithm is a blended

technique that combines features of basic PSO and TS. The quality of

solutions obtained by Hybrid Algorithm strongly substantiates the

effectiveness of the algorithm for the document clustering in IR system.

We also compared the TSPSO with particle swarm optimization (PSO)

and Tabu search (TS). The results shows that TSPSO having the largest

VRC values among all the algorithms. It concludes that TSPSO is effective

for the document cluster analysis problem. Future work contains use

more standard data sets to test the performance of the TSPSO.

The clustering algorithm

And these algorithms are applied to different datasets. Compared the

results of proposed TSPSO algorithm with the other existing algorithms.

Finally the VRC values of each algorithm compared and concluded that

TSPSO algorithm gives the accurate clusters compared to reaming

algorithms

75

Page 76: MainClustering of text documents

References

1 P Jaganathan, S Jaiganesh,: “An improved K-means algorithm

combined with Particle Swarm Optimization approach for efficient web

document clustering” .International Conference on Green Computing,

Communication and Conservation of Energy (ICGCE)IEEE(2013).

2 M. Yaghini, N.Ghazanfari : “Tabu-KM: A Hybrid Clustering Algorithm

Based on Tabu Search Approach”. International Journal of Industrial

Engineering & Production Research Septtember (2010),, Vollume 21.

3 Pritesh Vora, Bhavesh Oza: “A Survey on K-mean Clustering and

Particle Swarm Optimization”, International Journal of Science and

Modern Engineering (IJISME) ISSN: 2319-6386, Volume-1, Issue-3,

February( 2013).

4 Yudong Zhang, Dayong Li: “Cluster Analysis by Variance Ratio Criterion

and Firefly Algorithm”, International Journal of Digital Content

Technology and its Applications(JDCTA)

Volume7,Number3,February2013 .

76

Page 77: MainClustering of text documents

5 karypis,G:CLUTO a clustering Toolkit. technical report, Dept.of

computer science, Univ .of

Minnesota(2013).http:/glaros.dtc.umn.edu/~gkhome/views/cluto

6 K. Premalatha, Dr. A.M. Natarajan: Discrete PSO with GA Operators for

Document Clustering. International Journal of Recent Trends in

Engineering, Vol 1, No. 1, May 2009

Sites Referred:

http://java.sun.com

http://www.sourcefordgde.com

http://www.networkcomputing.com/

http://www.roseindia.com/

http://www.java2s.com/

http://www. javadb.com/

77