final proj 2 (1)
TRANSCRIPT
1.INTRODUCTION
1
1. INTRODUCTION
Data mining is the discovery of the unknown patterns from both heterogeneous and
homogeneous database. Secure Data Mining helps to discover association rules which are being
shared by homogeneous databases (same schema but the data is present on different entities).The
algorithm not only finds the union and intersection of association rules with support and
confidence which hold in the total database, while ensuring the data held by players to be
authenticated. It is estimated that the volume of data in the digital world increased from 161 hexa
bytes in 2007 to 998 hexa bytes in 2011 about 18 times the amount of information present in all
the books ever written and it continues to grow exponentially. This large amount of data has a
direct impact in Computer Data Inspection, which can be broadly defined as the discipline that
combines several elements of data and computer science to collect and analyze data from
computer systems in a way that is admissible as the data should have similarities between several
collected data fields. examining hundreds of thousands of files per computer. This activity
exceeds the expert’s ability of analysis and interpretation of data. Therefore, methods for
automated data analysis, like those widely used for machine learning and data mining, are of
paramount importance. In particular, algorithms for pattern recognition from the information
present in text documents are promising.
Clustering algorithms are typically used for exploratory data analysis, where there is little or
no prior knowledge about the data. This is precisely the case in several applications of Computer
Data Inspection, including the one addressed in our work. From a more technical View point, our
datasets consist of unlabeled objects the classes or categories of documents that can be found are
a priori unknown. Moreover, even assuming that labeled datasets could be available from
previous analyses, there is almost no hope that the same classes (possibly learned earlier by a
classifier in a supervised learning setting) would be still valid for the upcoming data, obtained
from other computers and associated to different investigation processes. More precisely, it is
likely that the new data sample would come from a different population. In this context, the use
of clustering algorithms, which are capable of finding latent patterns from text documents found
in seized computers, can enhance the analysis performed by the expert examiner. Clustering
algorithms have been studied for decades, and the literature on the subject is huge. Therefore, we
2
decided to choose a set of several representative algorithms in order to show the potential of the
proposed approach, namely: the partitional K-means and K-medoids, the hierarchical
Single/Complete/Average Link, and the cluster ensemble algorithm known as CSPA and also
Cosine similarity function. These algorithms were run with different combinations of their
parameters, resulting in various different algorithmic instantiations. Thus, as a contribution of
our work, we compare their relative performances on the studied application domain—using
different sample text data sets containing information like sports, food habits, culture and
animals.
1.1 Background and motivation
The main scope of this project is in computer Data analysis, hundreds of thousands of
files are usually examined. Much of the data in those files consists of unstructured text, whose
analysis by computer examiners is difficult to be performed. In this context, automated methods
of analysis are of great interest. In particular, algorithms for clustering documents can facilitate
the discovery of new and useful knowledge from the documents under analysis.
It is well-known that the number of clusters is a critical parameter of many algorithms and it is
usually a priori unknown. As far as we know, however, the automatic estimation of the number
of clusters has not been investigated in the Computer Data Analysis literature. Actually, we
could not even locate one work that is reasonably close in its application domain and that reports
the use of algorithms capable of estimating the number of clusters. Perhaps even more surprising
is the lack of studies on hierarchical clustering algorithms, which date back to the sixties.
1.2 Problem Statement
The problem statement is that in order to identify the documents that are stored in
remote locations inside a computer during computer inspection. As we know that there will be
computer inspection regularly in all the organizations in order to identify some sort of data, at
that time it is very difficult to identify the data through existing algorithms ,so we have proposed
a new system to identify the documents easily and cluster them with the matched attributes that
are present in the system.
3
2. LITERATURE SURVEY
4
2.LITERATURE SURVEY
Literature survey is the most important step in software development process. Before developing
the tool it is necessary to determine the time factor, economy n company strength. Once these
things are satisfied, ten next steps to determine which operating system and language can be used
for developing the tool. Once the programmers start building the tool the programmers need lot
of external support. This support can be obtained from senior programmers, from book or from
websites. Before building the system the above consideration r taken into account for developing
the proposed system
2.1 Cluster ensembles: A knowledge reuse framework for combining multiple
partitions
This project introduces the problem of combining multiple partitioning of a set of objects
into a single consolidated clustering without accessing the features or algorithms that determined
these partitioning. We first identify several application scenarios for the resultant 'knowledge
reuse' framework that we call cluster ensembles. The cluster ensemble problem is then
formalized as a combinatorial optimization problem in terms of shared mutual information. In
addition to a direct maximization approach, we propose three effective and efficient techniques
for obtaining high-quality combiners (consensus functions). The first combiner induces a
similarity measure from the partitioning and then reclusters the objects. The second combiner is
based on hyper graph partitioning. The third one collapses groups of clusters into meta-clusters
which then compete for each object to determine the combined clustering. Due to the low
computational costs of our techniques, it is quite feasible to use a supra-consensus function that
evaluates all three approaches against the objective function and picks the best solution for a
given situation. We evaluate the effectiveness of cluster ensembles in three qualitatively different
application scenarios: (i) where the original clusters were formed based on non-identical sets of
features, (ii) where the original clustering algorithms worked on non-identical sets of objects,
and (iii) where a common data-set is used and the main purpose of combining multiple
clusterings is to improve the quality and robustness of the solution. Promising results are
obtained in all three situations for synthetic as well as real data-sets.
5
2.2 Evolving clusters in gene-expression data
Clustering is a useful exploratory tool for gene-expression data. Although successful
applications of clustering techniques have been reported in the literature, there is no method of
choice in the gene-expression analysis community. Moreover, there are only a few works that
deal with the problem of automatically estimating the number of clusters in bioinformatics
datasets. Most clustering methods require the number k of clusters to be either specified in
advance or selected a posteriori from a set of clustering solutions over a range of k. In both cases,
the user has to select the number of clusters. This project proposes improvements to a clustering
genetic algorithm that is capable of automatically discovering an optimal number of clusters and
its corresponding optimal partition based upon numeric criteria. The proposed improvements are
mainly designed to enhance the efficiency of the original clustering genetic algorithm, resulting
in two new clustering genetic algorithms and an evolutionary algorithm for clustering (EAC).
The original clustering genetic algorithm and its modified versions are evaluated in several runs
using six gene-expression datasets in which the right clusters are known a priori. The results
illustrate that all the proposed algorithms perform well in gene-expression data, although
statistical comparisons in terms of the computational efficiency of each algorithm point out that
EAC outperforms the others. Statistical evidence also shows that EAC is able to outperform a
traditional method based on multiple runs of k-means over a range of k.
2.3 Exploring data with self-organizing mapsThis project discusses the application of a self-organizing map (SOM), an unsupervised
learning neural network model, to support decision making by computer investigators and assist
them in conducting data analysis in a more efficient manner. A SOM is used to search for
patterns in data sets and produce visual displays of the similarities in the data. The project
explores how a SOM can be used as a basis for further analysis. Also, it demonstrates how SOM
visualization can provide investigators with greater abilities to interpret and explore data
generated by computer tools.
6
2.4 Digital text string searching: Improving information retrieval effectiveness
by thematically clustering search resultsCurrent digital text string search tools use match and/or indexing algorithms to search
digital evidence at the physical level to locate specific text strings. They are designed to achieve
100% query recall (i.e. find all instances of the text strings). Given the nature of the data set, this
leads to an extremely high incidence of hits that are not relevant to investigative objectives.
Although Internet search engines suffer similarly, they employ ranking algorithms to present the
search results in a more effective and efficient manner from the user's perspective. Current
digital forensic text string search tools fail to group and/or order search hits in a manner that
appreciably improves the investigator's ability to get to the relevant hits first (or at least more
quickly). This project proposes and empirically tests the feasibility and utility of post-retrieval
clustering of digital text string search results
This project is presented as a work-in-progress. A working tool has been developed and
experimentation has begun. Findings regarding the feasibility and utility of the proposed
approach will be presented , as well as suggestions for follow-on research.
2.5 Towards an integrated e-mail analysis frameworkDue to its simple and inherently vulnerable nature, e-mail communication is abused for
numerous illegitimate purposes. E-mail spamming, phishing, drug trafficking, cyber bullying,
racial vilification, child pornography, and sexual harassment are some common e-mail mediated
cyber crimes. Presently, there is no adequate proactive mechanism for securing e-mail systems.
In this context, this analysis plays a major role by examining suspected e-mail accounts to gather
evidence to prosecute criminals in a court of law. To accomplish this task, a forensic investigator
needs efficient automated tools and techniques to perform a multi-staged analysis of e-mail
ensembles with a high degree of accuracy, and in a timely fashion. In this article, we present our
e-mail forensic analysis software tool, developed by integrating existing state-of-the-art
statistical and machine-learning techniques complemented with social networking techniques. In
this framework we incorporate our two proposed authorship attribution approaches.
7
3. SYSTEM REQUIREMENS
8
3. SYSTEM REQUIREMENTS
3.1Requirement Analysis Document
Requirement Analysis is the first phase in the software development process. The main objective of the phase is to identify the problem and the problem and the system to be developed .The later phases are strictly dependent on this phase and hence requirements for the system analyst to be clearer, precise about this phase. Any inconsistency in this phase will lead to lot of problem in the other phases to be followed. Hence there will be several reviews before the final copy of the analysis is made on the system to be developed. After all the analysis is completed the system analyst will submit the details of the system to be developed in the form of a document called requirement specification.
The Requirement analysis task is a process of discovery, refinement, modeling and specifications. The software scope, initially established by a system engineer and refined during software project planning, is refined in detail. Models of required data, information and control flow and operational behavior are created. Alternative solution are analyzed and allocated to various software elements.
Both the developer and the customer take an active role in requirement analysis and specification. The customer attempts to reformulate a sometimes-nebulous concept of software function and performance into concrete detail. The developer acts as interrogator, consultant and problem solver. The communication content is very high. Changes for misinterpretation of misinformation abound.Ambiguity are probable.
Requirement analysis is a software engineering task that bridges the gap between the system level software allocation and software design. Requirement analysis enables the system engineer to specify the software function and performance indicate software interface with other system elements and establish constraints that software must meet. It allows the software engineer, often called analyst in this role, to refine the software allocation and build model of the data, functional and behavior domain and that will be treated by software.
Requirement analysis provides the software designer with models that can be translated into data, architectural, interface and procedural design. Finally, the requirement specification provides the developer and customer with the means to access quality once software.
9
3.1.1 Functional Requirements
The functional requirement of the system defines a function of software system or its
components. A function is described as set of inputs, behavior of a system and output.
The Functional Requirements are:
The functional requirements comprises of 3 parts.
1) Input
2) Output
3) Data Storage
1) Input
The following are the inputs that should be performed on your current application. They
are as follows: User selects a text file as input data set.
a. User selects a stop words button in order to remove the unwanted words(I.e. Words other
than Noun, Verb and Adverb)
b. User selects on Stemming Button for removing duplicate attributes
c. User click on calculation button in order to get the result
d. User click on K-means to generate clusters with id
e. User click on Distance calculation button in order to calculate distance between attributes.
f. User clicks on Incremental button to generate clusters.
g. User click on purity button to get the purity values of K-means and Inc clustering.
2) Output
The following are the steps that user will click for generating the output as a result.
a. User gets a message called as “Data Selected “
b. User gets a message called as “Stopwords removal completed”
c. User gets a message after he chooses a valid input file as “File Selected Successfully”.
d. User gets failed message if he chooses invalid input file type as “not a valid type”.
e. User gets the filtered words with no duplication when he click on stemming button.
f. User gets the cluster ids and cluster values after he click on K-means
g. User gets the distance matrix values after he click on generate distance matrix.
10
h. User gets the processed values of Cos similarity after he chooses the Inc clustering.
i. User gets the Graph for comparison of purity values of K-means and Inc clustering.
3) Data Storage
Here we use My Sql data base as data base storage function in order to store all the
registration details. In this project we use My Sql as back end because it has following
advantages like
It is GUI in Nature.
It is cross Platform (I.e. It can run and reside on any Operating System).
It has a feature called as Auto Commit.
It takes very less space for installing on any system(i.e. hardly less than 30 Mb).
3.1.2 Non-Functional Requirements
In non-functional requirements the following are the things that come under .They are as
follows:
1) Reusability: As we developed the application in java, the application can be re-used for
any one without having any restrictions in its usage. Hence it is re-Usable.
2) Portability: As the application is designed with java as programming language, we know
java can be run on any operating system. Hence the application is portable to run on any
operating system.
3) Extensibility: The application can be extended at any level if the user wish to extend that
in future this is done because java is a open source medium which doesn’t have any time
limits for expiry or renewal.
11
Requirements
Hardware Requirements
System Pentium IV 2.4 GHz
Hard Disk 40 MB
Floppy Drive 1.44 Mb
Monitor 15 VGA Colour
Mouse Logitech
Ram 512 Mb
3.1: Hardware Requirements
Software Requirements Operating system Windows XP
Coding Language Java Swings
Data Base MYSQL
3.2: Software Requirements
12
4. DESIGNING
13
4. DESIGNING
4.1 Design Considerations
Design Considerations is a process of problem solving and planning for a software
solution. After the purpose and specifications of software are determined, software developers
will design or employ designers to develop a plan for a solution. It includes low-level component
and algorithm implementation issues as well as the architectural view.
4.1.1 Assumptions and Dependencies Describe any assumptions or dependencies regarding the software and its use. These may
concern such issues as:
It is assumed that the system will be deployed on Windows 2007 or later operating
system. A working visual studio 2010 or above is necessary.
4.1.2 General ConstraintsThis project is a desktop based application, developed in java technology. A major
constraint is to provide security for the information. In our project we use symmetric
cryptography algorithm and cipher text and key follows different paths.
4.1.3 Development MethodsA system development methodology refers to the framework that is used to structure,
plan, and control the process of developing an information system. The following diagram
explains the stages.
14
Figure 4.1: Water-Fall Model
Requirement Analysis and Definition
All possible requirements of the system to be developed are captured in this phase.
Requirements are a set of functions and constraints that the end user (who will be using the
system) expects from the system. The requirements are gathered from the end user at the start of
the software development phase. These requirements are analyzed for their validity and the
possibility of incorporating the requirements in the system to be developed is also studied.
Finally, a requirement specification document is created which serves the purpose of guideline
for the next phase of the model.
System and Software Design
Before starting the actual coding phase, it is highly important to understand the requirements of the
end user and also have an idea of how should be the end product looks like. The requirement
specifications from the first phase are studied in this phase and a system design is prepared.
System design helps in specifying hardware and system requirements and also helps in defining
the overall system architecture. The system design specifications serve as an input for the next
phase of the model.
Implementation and Unit Testing
On receiving system design documents, the work is divided in modules/units and actual
coding is started. The system is first developed in small programs called units, which are
integrated in the next phase. Each unit is developed and tested for its functionality; this is
15
referred to as unit testing. Unit testing mainly verifies if the modules/units meet their
specifications.
Integration and System Testing
As specified above, the system is first divided into units which are developed and tested
for their functions. These units are integrated into a complete system during integration phase
and tested to check if all modules/units coordinate with each other and the system as a whole
behaves as per the specifications. After successfully testing the software, it is delivered to the
customer.
4.2 System Design
The DFD is also called as bubble chart. It is a simple graphical formalism that can be used
to represent a system in terms of input data to the system, various processing carried out on this
data, and the output data is generated by this system.
Figure 4.2: Data Flow Diagram
16
Preprocessing
Documents
Term Frequency
Similarity Calculation
Cluster Formation
Query Results
1. The data flow diagram (DFD) is one of the most important modeling tools. It is used to
model the system components. These components are the system process, the data used
by the process, an external entity that interacts with the system and the information flows
in the system.
2. DFD shows how the information moves through the system and how it is modified by a
series of transformations. It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to output.
3. DFD is also known as bubble chart. A DFD may be used to represent a system at any
level of abstraction. DFD may be partitioned into levels that represent increasing
information flow and functional detail.
4.2.1 Proposed Architecture
Figure 4.3:Proposed Architecture
The architecture contains four modules. These are listed below
1. Pre-Processing Module
2. Calculating the number of clusters
3. Clustering techniques
4. Removing Outliers
17
4.2.1.1 Preprocessing Module:Before running clustering algorithms on text datasets, we performed some preprocessing
steps. In particular, stop words (prepositions, pronouns, articles, and irrelevant document
metadata) have been removed. Also, the Snow balls stemming algorithm for Portuguese words
has been used. Then, we adopted a traditional statistical approach for text mining, in which
documents are represented in a vector space model. In this model, each document is represented
by a vector containing the frequencies of occurrences of words, which are defined as delimited
alphabetic strings, whose number of characters is between 4 and 25. We also used a
dimensionality reduction technique known as Term Variance (TV) that can increase both the
effectiveness and efficiency of clustering algorithms. TV selects a number of attributes (in our
case 100 words) that have the greatest variances over the documents. In order to compute
distances between documents, two measures have been used, namely: cosine-based distance and
Levenshtein-based distance. The later has been used to calculate distances between file
(document) names only.
4.2.1.2 Calculating the number of Clusters:In order to estimate the number of clusters, a widely used approach consists of getting a
set of data partitions with different numbers of clusters and then selecting that particular partition
that provides the best result according to a specific quality criterion (e.g., a relative validity
index). Such a set of partitions may result directly from a hierarchical clustering dendrogram or,
alternatively, from multiple runs of a partitional algorithm (e.g., K-means) starting from different
numbers and initial positions of the cluster prototypes.
4.2.1.3 Clustering Techniques:The clustering algorithms adopted in our study—the partitional K-means and K-medoids, the
hierarchical Single/Complete/Average Link, and the cluster ensemble based algorithm known as
CSPA—are popular in the machine learning and data mining fields, and therefore they have been
used in our study. Nevertheless, some of our choices regarding their use deserve further
comments. For instance, K-medoids is similar to K-means. However, instead of computing
centroids, it uses medoids, which are the representative objects of the clusters. This property
makes it particularly interesting for applications in which (i) centroids cannot be computed; and
18
(ii) distances between pairs of objects are available, as for computing dissimilarities between
names of documents with the Levenshtein distance.
4.2.1.4 Removing Outliers:We assess a simple approach to remove outliers. This approach makes recursive use of the
silhouette. Fundamentally, if the best partition chosen by the silhouette has singletons (i.e.,
clusters formed by a single object only), these are removed. Then, the clustering process is
repeated over and over again—until a partition without singletons is found. At the end of the
process, all singletons are incorporated into the resulting data partition (for evaluation purposes)
as single clusters.
Input Design
The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and those steps are necessary to put
transaction data in to a usable form for processing can be achieved by inspecting the computer to
read data from a written or printed document or it can occur by having people keying the data
directly into the system. The design of input focuses on controlling the amount of input required,
controlling the errors, avoiding delay, avoiding extra steps and keeping the process simple. The
input is designed in such a way so that it provides security and ease of use with retaining the
privacy. Input Design considered the following things:
• What data should be given as input?
• How the data should be arranged or coded?
• The dialog to guide the operating personnel in providing input.
• Methods for preparing input validations and steps to follow when error occur.
Objectives:
1. Input Design is the process of converting a user-oriented description of the input into a
computer-based system. This design is important to avoid errors in the data input process and
show the correct direction to the management for getting correct information from the
computerized system.
19
2. It is achieved by creating user-friendly screens for the data entry to handle large volume of
data. The goal of designing input is to make data entry easier and to be free from errors. The data
entry screen is designed in such a way that all the data manipulates can be performed. It also
provides record viewing facilities.
3. When the data is entered it will check for its validity. Data can be entered with the help of
screens. Appropriate messages are provided as when needed so that the user
will not be in maize of instant. Thus the objective of input design is to create an input layout that
is easy to follow
Output Design
A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to
other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and direct
source information to the user. Efficient and intelligent output design improves the system’s
relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out manner; the right
output must be developed while ensuring that each output element is designed so that people will
find the system can use easily and effectively. When analysis design computer output, they
should Identify the specific output that is needed to meet the requirements.
2. Select methods for presenting information.
3. Create document, report, or other formats that contain information produced by the system.
The output form of an information system should accomplish one or more of the following
objectives.
• Convey information about past activities, current status or projections of the
• Future.
• Signal important events, opportunities, problems, or warnings.
• Trigger an action.
20
• Confirm an action.
4.3 Unified Modeling Language
UML stands for Unified Modeling Language. UML is a standardized general-purpose
modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group.
The goal is for UML to become a common language for creating models of object
oriented computer software. In its current form UML is comprised of two major
components: a Meta-model and a notation. In the future, some form of method or process
may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying, Visualization,
Constructing and documenting the artifacts of software system, as well as for business
modeling and other non-software systems.
The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software and the
software develop
The UML uses mostly graphical notations to express the design of software projects.
4.3.1 ScenariosA scenario is “a narrative description of what people do and experience as
they try to make use of computer systems and applications”. A scenario is a concrete,
focused, informal description of single feature of the system from the viewpoint of a
single actor. Scenarios cannot replace use cases, as they focus on specific instances
and concrete events. However, scenarios enhance requirements elicitation providing a
tool that is understandable to users and clients.
Scenario 1:
21
Table 4.1:Scenario1 table
22
Use case Name User Selects a Text Documents
Participating Actors User
Flow of Events 1) User has to browse a text file as input data set
2) Click on Browse button to select the Dataset.
Entry Condition User has To browse for an input Dataset.
Exit Condition Selected Dataset are saved into a output Panel
and click on EXIT button to close the application
BrowseTextFile
checkOnBrowseButton for selection
selected data sets are saved & press EXIT button
user
Figure 4.4: User selects a text document
Scenario 2
23
Table 4.2:Scenario2 table
Figure 4.5:Preprocessing
Scenario 3
Table 4.3:Scenario3 table
24
Use case Name Preprocessing
Participating Actors User
Flow of Events 1) User has to browse a text file as input data set
2) User click on Stop words to remove the unwanted words and
phrases.
3) User then clicks on Stemming button inorder to remove the
duplicates.
Entry Condition User has to browse for an input Dataset.
Exit Condition Finally preprocessed data is saved onto the output panel and click
on EXIT button to close the application
Use case Name Term Frequency Calculation
Participating Actors User
Flow of Events 1) User after do preprocessing on input text file, he will go for
calculation button for clusters.
2) User creates term frequency calculation between all attributes and documents.
3) User gets the frequency values for all the documents parallel with attributes.
Entry Condition User has to browse for an input Dataset.
Exit Condition Finally term Frequency data is saved onto the output panel
and click on EXIT button to close the application
Browse for input data set
Preprocessing
click on calculation button for clusters
term frequency calculation
gets the frequency values
term frequency data is saved & click on EXIT button
User
Figure 4.6:Term Frequency Calculation
Scenario 4Use case Name Similarity Calculation
Participating
Actors
User
Flow of Events 1) User after term frequency calculation, he will go for next button.
2) He will click on similarity button to calculate the cos similarity values
3) The sum of all documents similarity values gives the purity values.
Entry Condition User has to browse for an input Dataset.
Exit Condition Finally Similarity calculation between all documents is saved onto the
output panel and click on EXIT button to close the application
Table 4.4:Scenario4 table
25
Browse for input dataset
Term frequency is calculated
Click on next button
Click on similarity button
Cos similarity values are calculated
Purity values calculated
User
Similarity calculation is saved &Click on EXIT button
Figure 4.7: Similarity Calculation
Scenario 5
26
Table 4.5:Scenario4 table
Browse for input dataset
Preprocessing
Similarity Calculation
Click on next button
process the cluster values
get query results from unmatched values
user
Cluster values are saved &Click on EXIT button
Figure 4.8:Cluster Formation and Query Results
4.3.2 Use case Diagram:A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
27
Use case Name Cluster Formation and Query Result
Participating Actors User
Flow of Events 1) User after processing the similarity values, he will click on next button.
2) User process the cluster values of matches documents with cluster id
3) User gets the Query result at the end in order to show the values that are not matched in that list.
Entry Condition User has to browse for an input Dataset.
Exit Condition Finally Cluster values between all documents is saved onto
the output panel and click on EXIT button to close the
application
overview of the functionality provided by a system in terms of actors, their goals (represented as
use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the actors in
the system can be depicted.
System: A system/system boundary groups use cases together to accomplish a purpose. Each
use case diagram can only have one system.
Actor: An actor represents a coherent set of roles that users of the system plays when interacting
with the use cases of the system. An actor participates in use cases to accomplish an overall
purpose. An actor can represent the role of a human, a device or any other systems.
Use case: A use case describes a sequence of actions that provide something of measurable value
to an actor and is drawn as a horizontal ellipse.
Use case relationships:Four relationships among use cases are used often in practice.
Include: In one form of interaction, a given use case may include another. "Include is a Directed
Relationship”.
In between two use cases, implying that the behavior of the included use case is
inserted into the behavior of the including use case”. The first use case often depends on the
outcome of the included use case. This is useful for extracting truly common behaviors from
multiple use cases into a single description. The notation is a dashed arrow from the including to
the included use case, with the label "«include»".
Extend: This relationship indicates that the behavior of the extension use case may be inserted in
the extended use case under some conditions. The notation is a dashed arrow from the extension to
the extended use case, with the label "«extend»". The notes or constraints may be associated with
this relationship to illustrate the conditions under which this behavior will be executed. Modelers
use the «extend» relationship to indicate use cases that are "optional" to the base use case.
Generalization: A given use case may have common behaviors, requirements, constraints, and
assumptions with a more general use case. In this case, describe them once, and deal with it in the
28
same way, describing any differences in the specialized cases. The notation is a solid line ending
in a hollow triangle drawn from the specialized to the more general use case.
Association: Associations between actors and use cases are indicated in use case lid lines. An
association exists whenever an actor is involved with an interaction described by a use case.
Associations are modeled as lines connecting use cases and actors to one another, with an optional
arrowhead on one end of the line. The arrowhead is often used to indicate the direction of the
initial invocation of the relationship or to indicate the primary actor within the use case.
ChooseAnInputDataset
Preprocessing
TermFrequency
SimilarityCalculation
ClusterFormation
User/Computer Examiner
EvaluatingQueryResults
Figure 4.9: Use Case Diagram
4.3.3 Class Diagram
29
A class diagram describes the static structure of the system. It is a graphic presentation of the
static view that shows a collection of declarative (static) model elements, such as classes, types,
and their contents and relationships. Classes are abstractions that specify the common structure
and behavior of a set of objects. Objects are the instances of the classes that are created, modified
and destroyed during the execution of the system. A Class diagram describes the system in terms
of objects, classes, attributes, operations and their associations.
Class: A rectangle is the icon that represents the class. It is divided into 3 areas. The uppermost
contains name, the middle area holds the attributes and the bottom area holds the operations.
Package: A package is a mechanism for organizing elements into groups. It is used in the Use
Case, Class, and Component diagrams. Packages may be nested within other packages. A
package may contain both subordinate packages and ordinary model elements. The entire system
description can be thought of as a single high-level subsystem package with everything else in it.
Subsystem: A subsystem groups diagram elements together.
Generalization: Generalization is a relationship between a general element and a more specific
kind of that element. It means that the more specific element can be used whenever the general
element appears.
Usage: Usage is a dependency situation in which one element (the Client) requires the presence
of another element (the supplier) for its correct functioning or implementation.
Realization: Realization is the relationship between a specialization and its
implementation. It is an indication of the inheritance of behavior without the inheritance of
structure. One classifier specifies a contract such that another classifier guarantees to carry out
Realization is used in two places: one is between interfaces and the classes that realize them, and
the other is between use cases and the collaboration that realize them.
30
Association: Association is represented by drawing a line between classes and can be
named to facilitate model understanding. If two classes are associated, you can navigate from an
object of one class to an object of the class.
Aggregation: Aggregation is a special kind of association in which class represents as the larger
class that consists of a smaller class. It has the meaning of “has-a” relationship.
Composition: Composition is a strong form of aggregation association. It has strong ownership
and coincident lifetime of parts by the whole. A part may belong to only one composite. Parts
with non-fixed multiplicity may be created after the composite itself. But once
created, they live and die with it (that is, they share lifetimes). Such arts can also be explicitly
removed before the death of the composite.
N-ary Association: N-ary associations are associations that connect more than two classes.
Dependency: The dependency link is a semantic relationship between two elements. It indicates
that whenever a change occurs in one element, there may be a change necessary to other element.
31
1
Preprocessdocclust : JLablecal : JButtonfolder : filestr : stringword : string
preprocess()calcActionPerformed()strmmingAction()
*
1
processJButton1 : JButtonJButton2 : JButtonJButton3 : JButtonJTable1 : JTable
process()JButtonAction()JButton()
StemmerStep1 : intStep2 : intstep3 : intj : intk : int
stemmer()const()cvc()
Graphhsk : doubleKmeans : doublr
draw()main()
11
1
Figure 4.10: Class Diagram
32
4.3.4 Sequence Diagram:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that
shows how processes operate with one another and in what order. It is a construct of a Message
Sequence Chart. Sequence diagrams are sometimes called event diagrams, event scenarios, and
timing diagrams.
Object: Object can be viewed as an entity at a particular point in time with a specific value and
as a holder of identity that has different values over time. Associations among objects are not
shown. When you place an object tag in the design area, a lifeline is automatically drawn and
attached to that object tag.
Actor: An actor represents a coherent set of roles that users of a system play when interacting
with the use cases of the system. An actor participates in use cases to accomplish an overall
purpose. An actor can represent the role of a human, a device, or any other systems.
Message: A message is a sending of a signal from one sender object to other receiver object(s).
It can also be the call of an operation on receiver object by caller object. The arrow can be
labeled with the name of the message (operation or signal) and its argument values. A sequence
number that shows the sequence of the message in the overall interaction as well as a guard
condition can also be labeled at the arrow.
Lifetime: It is the duration that indicates the completion of an action or a message and it will
cause transition from one state to another state. The life time of an object is represented with a
dotted line.
Self Message: A message that indicates an action will perform at a particular state and stay
there.
Create Message: A message that indicates an action that will perform between two states.
33
Figure 4.11: Sequence Diagram
34
4.3.5 Collaboration diagram:
Communication diagram was called collaboration diagram in UML 1. It is similar to sequence
diagrams but the focus is on messages passed between objects. The same information can be
represented using a sequence diagram and different objects. Click here to understand the
differences using an example
Class roles:
Class roles describe how objects behave. Use the UML object symbol to illustrate class roles, but
don't list object attributes.
Association roles:
Association roles describe how an association will behave given a particular situation. You can
draw association roles using simple lines labeled with stereotypes.
Messages: Unlike sequence diagrams, collaboration diagrams do not have an explicit way to
denote time and instead number messages in order of execution. Sequence numbering can
become nested using the Dewey decimal system. The condition for a message is usually placed
in square brackets immediately following the sequence number. Use a * after the sequence
number to indicate a loop.
35
Figure 4.12: Collaboration Diagram
36
4.3.6 Activity Diagram:
Activity diagrams are graphical representations of workflows of stepwise activities and actions
with support for choice, iteration and concurrency. In the Unified Modeling Language, activity
diagrams can be used to describe the business and operational step-by-step workflows
of components. An Activity diagram consists of the following behavioral elements:
Action State: It describes the execution of an atomic action.
Sub-Activity: It is an activity that will perform within another activity.
Initial State: A pseudo state to establish the start of the event .
Final State: It signifies when a transition ends.
Horizontal Synchronization: A horizontal synchronization splits a single transition into parallel
transition or merges concurrent transitions to a single target.
Vertical Synchronization: A vertical synchronization splits a single transition into parallel
transitions or merges concurrent transitions to a single target.
Decision Point: A decision point is used to model the conditional flow of
control. It labels each output transition of a decision with a different guard condition
Swim Lane: A Swim lane is a partition on interaction diagram for organizing responsibilities for
activities. Each lane presents the responsibilities of a particular class. To use a Swim lane
activity diagrams are arranged into vertical zones.
37
document
preprocessing
term frequency
unconsidered
similarity computation
cluster formation
query results
validyes no
Figure 4.13:Activity Diagram
38
IMPLEMENTATION
39
5. IMPLEMENTATION
5.1 Preparing the data setsThe input to the document clustering algorithm can be any set of documents which have to be
divided into clusters based on their similarity. The individual terms from each of the documents
have to be extracted in order to identify similar items. The data set thus undergoes three pre-
processing steps:
Tokenization
Stop word Removal
Stemming
5.1.1 TokenizationTokenization is the process of breaking a stream of text up into words, phrases, symbols, or other
meaningful elements called tokens. The list of tokens becomes input for further processing such
as parsing or text mining. Tokenization is useful both in linguistics(where it is form of text
segmentation), and in computer science, where it forms part of lexical analysis.
5.1.2 Stop word Removal In computing, stop words are the words which are filtered out prior to, or after, processing of
natural language data(text).There is not one definite list of stop words which all tools use and
such a filter is not always used. Some tools specifically avid removing them to support phrase
search. Some of the common stop words are: a, be, been, and, as, out, ever, own, he, she, shall
etc.
5.1.3 StemmingStemming is the process for reducing derived words to their stem, base or root form-generally a
written word form. The stem need not be identical to the morphological root of the word; it is
usually sufficient that related words map to the same stem, even if this stem is not in itself a valid
root. Algorithms for stemming have been studied in computer science since the 1960s. Many
search engines treat words with the same stem as synonyms as a kind of query expansion, a
process called conflation. Stemming programs are commonly referred to as stemming algorithms
or stemmers.
40
5.2 Cluster Analysis
Clustering is the process of grouping a set of objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other cluster. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machines learning, data mining, pattern recognition, image analysis and bioinformatics. The computation task of classifying the data set into k cluster is often referred to as k-clustering. Cluster is also called as data segmentation in some applications because clustering partitions large datasets into groups according to their similarity. Clustering can also be used for outlier detection. Cluster analysis aims to organize a collection of patterns into cluster based on similarity. Cluster has its roots in many fields, such as mathematics, computer science, statistics, biology, and economics. In different applications domains, a variety of clustering techniques have been developed, depending on the methods used to represent data, the measure of similarity between data objects, and the technique for grouping data objects into cluster. In data mining, hierarchical clustering is a method of cluster analysis which creates a hierarchical decomposition of the given set of data objects. Depending on the decomposition approach hierarchical algorithms are classified as agglomerative (merging) or divine (splitting). In this project we focus on decimal clustering using hierarchical clustering
Types of clustering
There are different clustering methodologies. Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using previously established clusters. Hierarchical algorithms can be agglomerative (“bottom-up”) or (“up-down”) partitioning algorithms typically determine all clusters at once. There are different clustering methods like Density based, Grid based, Model based, Constraints Based clustering.
Figure: clustering
41
The clustering algorithms used are
K-means Algorithm
K-medoids Algorithm
Hierarchical Algorithm
These algorithms results minimal latency time delays for all the clients.
Partitioning Clustering:
Given a database of n objects a partitioning methods constructs partitionsof the data, where each partition represents a cluster andk<=n that it classifies the data into k groups. Given k, the number of partitions to construct the methods creates a initial partitioning. It then uses an iterative relocations technique that attempts to improve the partitioning by moving objects from one group to another.
K-means Algorithm: Demonstration of the standard algorithm
1) k initial “means” (in this case k=3) are randomly generated within the data domain.
2) k clusters are created by associating every observation with the nearest mean.
The partition here represent the Voronoi diagram generated by the means.
3) The centroid of each of the k clusters becomes the new mean.
4) Steps 2 and 3 are repeated until convergence has been reached.
Figure 5.1:Clustering through K-means
42
K-medoids Algorithm
1. Initialize: randomly select (without replacement) k of the n data points as the medoids
2. Associate each data point to the closest medoid. ("closest" here is defined using any valid
distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski
distance)
3. For each medoid m
4. For each non-medoid data point o.
5. Swap m and o and compute the total cost of the configuration
6. Select the configuration with the lowest cost.
7. Repeat steps 2 to 4 until there is no change in the medoid.
Figure: Cluster through k- medoids
43
Hierarchal clustering: A hierarchal clustering works by grouping data objects into a tree of clusters.
There are two types of hierarchal clustering :
Agglomerative hierarchal clustering: It is a bottom-up strategy, where it starts by placing each object into clusters & then merges into larger clusters un-till all the objects are in single cluster.
Divisive hierarchal clustering: This is top-down strategy, where the clusters are subdivided into smaller pieces un-till each object forms a cluster on its own or un-till it satisfies certain termination conditions.
Agglomerative vs. Divisive approach.
Agglomerative approach Divisive approach
We start out with all sample units in n clusters of size 1.
We start out with all sample units in a single cluster of size n.
Then, at each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster.
Then, at each step of the algorithm, clusters are partitioned into a pair of daughters clusters, selected to maximize the distance between each daughter.
The algorithm stop when all sample units are combined into a single cluster of size n.
The algorithm stops when sample units are partitioned into n clusters of size 1.
Table: comparison of agglomerative and divisive approach
44
Hierarchical Algorithm:
• The maximum distance between elements of each cluster (also called complete-linkage
clustering): maxf d(x; y) : x 2 A; y 2 B g:
• The minimum distance between elements of each cluster (also called single-linkage
clustering): minf d(x; y) : x 2 A; y 2 B g:
• The mean distance between elements of each cluster (also called average linkage
clustering, used e.g. in UPGMA):
• The sum of all intra-cluster variance.
• The increase in variance for the cluster being merged (Ward’s method<ref
name="[6])The probability that candidate clusters spawn from the same distribution function
(V-linkage). Each agglomeration occurs at a greater distance between clusters than the
previous agglomeration, and one can decide to stop clustering either when the clusters are too
far apart to be merged (distance criterion) or when there is a sufficiently small number of
clusters (number criterion).
Hierarchical vs. partitioning algorithms:
Hierarchical techniques produce a nested sequence of partitions, with a single, all inclusive cluster at the top and singleton clusters of individual points at the bottom. Each intermediate level can be viewed as combining (splitting) two cluster from the next lower (next higher) level. Partitional techniques create a one level (unnested) partitioning the data points. If k is the desired number of clusters, then partitions approaches typically find all k clusters at once. Contrast this with traditional hierarchical schemes, which bisect a cluster to get two clusters or merge two clusters to get one.
Distance Measure
An important step in any clustering is to select a distance measure. We will determine how the similarity of two elements is calculated. This influence the shape of the clusters, as some elements may be close to another according to one distance and further away according to another. The various distance measures are:
45
Euclidean Distance
This is probably the most commonly chosen type of distance. It simply gives the geometric distance in the multidimensional space. It is computed as:
The Euclidean (and squared Euclidean) distances are usually computed on raw data and not from standardized data.
City Block Distance(Manhattan Distance):
This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. The city-block distance is computed as:
Cosine Similarity:
Cosine Similarity is one of the most popular similarity measure practical to text documents, such as in various information retrieval applicationsand clustering too. An important property of the cosine similarity is its independence of document length. For two documents A and B, the similarity between them can be calculated as
46
5.2 Software Environment and Technologies
Java Technology Java technology is both a programming language and a platform.
The Java Programming Language
The Java programming language is a high-level language that can be characterized by all
of the following buzzwords
Simple
Architecture neutral
Object oriented
Portable
Distributed
High performance
Interpreted
Multithreaded
Robust
Dynamic
Secure
With most programming languages, you either compile or interpret a program so that you can
run it on your computer. The Java programming language is unusual in that a program is both
compiled and interpreted. With the compiler, first you translate a program into an intermediate
language called Java byte codes —the platform-independent codes interpreted by the interpreter
on the Java platform. The interpreter parses and runs each Java byte code instruction on the
computer. Compilation happens just once; interpretation occurs each time the program is
executed. The following figure illustrates how this works.
You can think of Java byte codes as the machine code instructions for the Java Virtual Machine
(Java VM). Every Java interpreter, whether it’s a development tool or a Web browser that can
run applets, is an implementation of the Java VM. Java byte codes help make “write once, run
anywhere” possible. You can compile your program into byte codes on any platform that has a
Java compiler. The byte codes can then be run on any implementation of the Java VM. That
47
means that as long as a computer has a Java VM, the same program written in the Java
programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
The Java Platform
A platform is the hardware or software environment in which a program runs. We’ve
already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and
Macros. Most platforms can be described as a combination of the operating system and
hardware. The Java platform differs from most other platforms in that it’s a software-only
platform that runs on top of other hardware-based platforms.
The Java platform has two components:
The Java Virtual Machine (JVM)
The Java Application Programming Interface (Java API)
The Java API is a large collection of ready-made software components that provide many
useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into
libraries of related classes and interfaces; these libraries are known as packages. Native code is
code that after you compile it, the compiled code runs on a specific hardware platform. As a
platform-independent environment, the Java platform can be a bit slower than native code.
However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can
bring performance close to that of native code without threatening portability.
What Can Java Technology Do?
48
The most common types of programs written in the Java programming language are applets and
applications. If you’ve surfed the Web, you’re probably already familiar with applets. An applet
is a program that adheres to certain conventions that allow it to run within a Java-enabled
browser. However, the Java programming language is not just for writing cute, entertaining
applets for the Web. The general-purpose, high-level Java programming language is also a
powerful software platform. Using the generous API, you can write many types of programs.
An application is a standalone program that runs directly on the Java platform. A special kind of
application known as a server serves and supports clients on a network. Examples of servers are
Web servers, proxy servers, mail servers, and print servers. Another specialized program is a
servlet. A servlet can almost be thought of as an applet that runs on the server side. Java Servlets
are a popular choice for building interactive web applications, replacing the use of CGI scripts.
Servlets are similar to applets in that they are runtime extensions of applications. Instead of
working in browsers, though, servlets run within Java Web servers, configuring or tailoring the
server.
Java makes our programs better and requires less effort than other languages. Java technology
will help you do the following:
Get started quickly:
Although the Java programming language is a powerful object-oriented
language, it’s easy to learn, especially for programmers already familiar with C or
C++.
Write less code:
Comparisons of program metrics (class counts, method counts, and so on) suggest
that a program written in the Java programming language can be four times smaller
than the same program in C++.
Write better code:
The Java programming language encourages good coding practices, and its
garbage collection helps you avoid memory leaks. Its object orientation, its
JavaBeans component architecture, and its wide-ranging, easily extendible API let
you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly:
49
Our development time may be as much as twice as fast versus writing the
same program in C++. Why? We write fewer lines of code and it is a simpler
programming language than C++.
Avoid platform dependencies with 100% Pure Java:
We can keep our program portable by avoiding the use of libraries written in
other languages. The 100% Pure JavaTM Product Certification Program has a
repository of historical process manuals, white papers, brochures, and similar
materials online.
Write once, run anywhere:
Because 100% Pure Java programs are compiled into machine-independent
byte codes, they run consistently on any Java platform.
Distribute software more easily:
We can upgrade applets easily from a central server. Applets take advantage
of the feature of allowing new classes to be loaded “on the fly,” without recompiling
the entire program.
6.2.1 ODBC
Microsoft Open Database Connectivity (ODBC) is a standard programming interface for
application developers and database systems providers. Before ODBC became a de facto
standard for Windows programs to interface with database systems, programmers had to use
proprietary languages for each database they wanted to connect to. Now, ODBC has made the
choice of the database system almost irrelevant from a coding perspective, which is as it should
be. Application developers have much more important things to worry about than the syntax that
is needed to port their program from one database to another when business needs suddenly
change. Through the ODBC Administrator in Control Panel, we can specify the particular
database that is associated with a data source that an ODBC application program is written to
use. Think of an ODBC data source as a door with a name on it. Each door will lead us to a
particular database. For example, the data source named Sales Figures might be a SQL Server
database, whereas the Accounts Payable data source could refer to an Access database. The
physical database referred to by a data source can reside anywhere on the LAN. The ODBC
system files are not installed on your system by Windows 95. Rather, they are installed when you
setup a separate database application, such as SQL Server Client or Visual Basic 4.0.
50
The advantages of this scheme are so numerous that you are probably thinking there must
be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking directly to
the native database interface. ODBC has had many detractors make the charge that it is too slow.
Microsoft has always claimed that the critical factor in performance is the quality of the driver
software that is used. And anyway, the criticism about performance is somewhat analogous to
those who said that compilers would never match the speed of pure assembly language. Maybe
not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which
means you finish sooner. Meanwhile, computers get faster every year.
6.2.2 JDBC
In an effort to set an independent database standard API for Java; Sun Microsystems
developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access
mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface
is achieved through the use of “plug-in” database connectivity modules, or drivers. If a database
vendor wishes to have JDBC support, he or she must provide the driver for each platform that the
database and Java run on. To gain a wider acceptance of JDBC, Sun based JDBC’s framework
on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety
of platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much
faster than developing a completely new connectivity solution. JDBC was announced in March
of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user
input, the final JDBC v1.0 specification was released soon after.
6.2.2.1 JDBC Goals
Few software packages are designed without goals in mind. JDBC is one that, because of
its many goals, drove the development of the API. The goals that were set for JDBC are
important. They will give you some insight as to why certain classes and functionalities behave
the way they do.
The seven design goals for JDBC are as follows:
1. SQL Level API
The designers felt that their main goal was to define a SQL interface for Java. Although
not the lowest database interface level possible, it is at a low enough level for higher-level
51
tools and APIs to be created. Conversely, it is at a high enough level for application
programmers to use it confidently. Attaining this goal allows for future tool vendors to
“generate” JDBC code and to hide many of JDBC’s complexities from the end user.
2. SQL Conformance
SQL syntax varies as you move from database vendor to database vendor. In an effort to
support a wide variety of vendors, JDBC will allow any query statement to be passed through
it to the underlying database driver. This allows the connectivity module to handle non-
standard functionality in a manner that is suitable for its users.
3. JDBC must be implemental on top of common database interfaces
The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal
allows JDBC to use existing ODBC level drivers by the use of a software interface. This
interface would translate JDBC calls to ODBC and vice versa.
4. Provide a Java interface that is consistent with the rest of the Java system
Because of Java’s acceptance in the user community thus far, the designers feel that they
should not stray from the current design of the core Java system.
5. Keep it simple
This goal probably appears in all software design goal listings. JDBC is no exception.
Sun felt that the design of JDBC should be very simple, allowing for only one method of
completing a task per mechanism. Allowing duplicate functionality only serves to confuse
the users of the API.
6. Use strong, static typing wherever possible
Strong typing allows for more error checking to be done at compile time; also, less error
appear at runtime.
7. Keep the common cases simple
Because more often than not, the usual SQL calls used by the programmer are simple
SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to
perform with JDBC. However, more complex SQL statements should also be possible. And
for dynamically updating the cache table we go for MS Access database. Java has two things:
a programming language and a platform.
52
Java Program
Compilers
Interpreter
My Program
8. Keep the common cases simple
Because more often than not, the usual SQL calls used by the programmer are simple
SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to
perform with JDBC. However, more complex SQL statements should also be possible. And
for dynamically updating the cache table we go for MS Access database.Java has two things:
a programming language and a platform.
Java is also unusual in that each Java program is both compiled and interpreted. With a compile you translate a Java program into an intermediate language called Java byte codes the platform-independent code instruction is passed and run on the computer. Compilation happens just once; interpretation occurs each time the program is executed. The figure illustrates how this works.
6.2.2.2 JFree Chart
JFree Chart is a free 100% Java chart library that makes it easy for developers to display
professional quality charts in their applications. JFree Chart's extensive feature set includes:
A consistent and well-documented API, supporting a wide range of chart types; A flexible design
that is easy to extend, and targets both server-side and client-side applications;
Support for many output types, including Swing components, image files (including PNG and
JPEG), and vector graphics file formats (including PDF, EPS and SVG); JFreeChart is "open
53
source" or, more specifically, free software. It is distributed under the terms of the GNU Lesser
General Public Licence (LGPL), which permits use in proprietary applications.
1. Map Visualizations
Charts showing values that relate to geographical areas. Some examples include:
(a) population density in each state of the United States, (b) income per capita for each
country in Europe, (c) life expectancy in each country of the world. The tasks in this
project include: Sourcing freely redistributable vector outlines for the countries of the
world, states/provinces in particular countries (USA in particular, but also other areas);
Creating an appropriate dataset interface (plus default implementation), a rendered, and
integrating this with the existing XYPlot class in JFreeChart.
2. Time Series Chart Interactivity
Implement a new (to JFreeChart) feature for interactive time series charts --- to
display a separate control that shows a small version of ALL the time series data, with a
sliding "view" rectangle that allows you to select the subset of the time series data to
display in the main chart.
3. Dashboards
There is currently a lot of interest in dashboard displays. Create a flexible
dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies,
thermometers, bars, and lines/time series) that can be delivered easily via both Java Web
Start and an applet.
4. Property Editors
The property editor mechanism in JFreeChart only handles a small subset of the
properties that can be set for charts. Extend (or reimplement). this mechanism to provide
greater end-user control over the appearance of the charts.
54
6.TESTING
55
6. TESTINGThe purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement.
TYPES OF TESTS Unit testing Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components were
individually satisfaction, as shown by successfully unit testing, the combination of components is
correct and consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.
Functional test
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.
56
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.
Systems/Procedures: interfacing systems or procedures must be invoked. Organization and
preparation of functional tests is focused on requirements, key functions, or special test cases. In
addition, systematic coverage pertaining to identify Business process flows; data fields,
predefined processes, and successive processes must be considered for testing. Before functional
testing is complete, additional tests are identified and the effective value of current tests is
determined.
System Test
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the
configuration oriented system integration test. System testing is based on process descriptions
and flows, emphasizing pre-driven process links and integration points.
White Box Testing White Box Testing is a testing in which in which the software tester has knowledge of the
inner workings, structure and language of the software, or at least its purpose. It is purpose. It is
used to test areas that cannot be reached from a black box level.
Black Box Testing
Black Box Testing is testing the software without any knowledge of the inner workings,
structure or language of the module being tested. Black box tests, as most other kinds of tests,
must be written from a definitive source document, such as specification or requirements
document, such as specification or requirements document. It is a testing in which the software
under test is treated, as a black box .you cannot “see” into it. The test provides inputs and
responds to outputs without considering how the software works.
6.1 Unit Testing:
57
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as
two distinct phases.
6.1.1Test strategy and approachField testing will be performed manually and functional tests will be written in detail.
Test objectives
All field entries must work properly. Pages must be activated from the identified link.The entry screen, messages and responses must not be delayed.
Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.
6.1.2 Test Cases
S No. of test case 1
Name of test User browse for a file (Success)
Sample Input User selects a file to be clustered.
Expected output Displays Message as “File updated successfully”
Actual output Same as expected
RemarksThis component clearly tells that file is uploaded successfully.
Table 6.1: Unit Test Case1
S No. of test case 2
58
Name of test User browse for a file (Fails)
Sample InputUser selects a file to be clustered of different file format.
Expected output Displays Message as “File not updated”
Actual output Same as expected
RemarksThis component clearly tells that file is not updated successfully.
Table 6.2: Unit Test Case2
S No. of test case 3
Name of test User Click on Remove Button
Sample Input User after uploads a file click on remove button
Expected outputDisplays message as “Login SucStop words removed
Successfully”
Actual output Same as expected
RemarksThis component tells that user removed stop words successfully so that it can be forwarded for stemming
Table 6.3: Unit Test Case3
59
Table 6.4: Unit Test Case4
S No. of test case 5
Name of test User Click on Stemming Button
Sample InputUser after click on remove button ,he will go for
stemming action
Expected outputDisplays message as “Stemming is successful” and
displays the words in distinct type
Actual output Same as expected
RemarksThis component tells that user removed stop words successfully ,so that it is filtered from stop words.
Table 6.5: Unit Test Case5
S No. of test case 6
60
S No. of test case 4
Name of testUser clicks on Stemming button without performing remove action
Sample InputUser forgot to click on remove and he went directly to
stemming button..
Expected outputDisplays message as “Please enter the remove and
then click on stemming”
Actual output Same as expected
RemarksThis component clearly that we will get an invalid error message if we doesn’t enter remove button before we go for stemming button.
Name of testUser clicks on Calculation button without performing stemming action
Sample InputUser forgot to click on stemming and he went directly
to calculation button..
Expected outputDisplays message as “Please enter the stemming and
then click on calculation”
Actual output Same as expected
RemarksThis component clearly that we will get an invalid error message if we doesn’t enter stemming button before we go for calculation button.
Table 6.6: Unit Test Case6
S No. of test case 7
Name of test User Click on Stemming Button
Sample InputUser after click on stemming button ,he will go for
calculation action in order to find the clusters.
Expected outputDisplays message as “Clustered the input data set
successfully”
Actual output Same as expected
RemarksThis component tells that user performed stemming and it is ready to apply clustering algorithms.
Table 6.7: Unit Test Case7
61
6.2 Integration Testing
Software integration testing is the incremental integration testing of two or more
integrated software components on a single platform to produce failures caused by interface
defects. The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
6.3 Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation by the end user. It also ensures that the system meets the functional requirements.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
62
7.CONCLUSION
7. CONCLUSION
63
We presented an approach that applies document clustering methods to document analysis of
computer document inspection. Also, we reported and discussed several practical results that can
be very useful for researchers and practitioners of document computing. More specifically, in our
experiments the hierarchical algorithms known as Average Link and Complete Link presented
the best results. Despite their usually high computational costs, we have shown that they are
particularly suitable for the studied application domain because the dendrograms that they
provide offer summarized views of the documents being inspected, thus being helpful tools for
document examiners that analyze textual documents from seized computers. As already observed
in other application domains, dendrograms provide very informative descriptions and
visualization capabilities of data clustering structures .
The partitional K-means and K-medoids algorithms also achieved good results when
properly initialized. Considering the approaches for estimating the number of clusters, the
relative validity criterion known as silhouette has shown to simplified version. In addition, some
of our results suggest that using the file names along with the document content information may
be useful for cluster ensemble algorithms. Most importantly, we observed that clustering
algorithms indeed tend to induce clusters formed by either relevant or irrelevant documents, thus
contributing to enhance the expert examiner’s job. Furthermore, our evaluation of the proposed
approach in five real-world applications show that it has the potential to speed up the computer
inspection process. Aimed at further leveraging the use of data clustering algorithms in similar
applications, a promising venue for future work involves investigating automatic approaches for
cluster labeling. The assignment of labels to clusters may enable the expert examiner to identify
the semantic content of each cluster more quickly—eventually even before examining their
contents. Finally, the study of algorithms that induce overlapping partitions (e.g., Fuzzy C-
Means and Expectation-Maximization for Gaussian Mixture Models) is worth of investigation.
64
REFERENCES
65
REFERENCES
1.Document Clustering for Forensic Analysis:An Approach for Improving
Computer Inspection Luís Filipe da Cruz Nassif and Eduardo Raul Hruschka.
2.Data Mining Concepts and Techniques-Jiawei Han and Micheline Kamber
3.Object-Oriented Software Engineering-Bruegge,Dutoit.
4.JAVA,The Complete Reference by Herbert Schildt.
66
APPENDIX
67
APPENDIX:
A-Input/Output Screens:
a.Selecting Dataset:
Figure 9.1: Selecting Dataset
Description:
The above page mainly indicates the dataset . in this process selecting the dataset.The selecting the data from computer randomly.
68
b. Removing Stop Words:
Figure 9.2: Removing Stop Words
Description:
The above page mainly indicates the removing stop words. In this process removing unnecessary data from the dataset.
69
d. Stemming:
Figure 9.3: Stemming Page
Description:
The above page mainly indicates the stemming data. We select the required data in stemming data process we remove the unnecessary words.
70
e. Clustering Process:
Figure 9.4: Clustering Process
Description:
The above page mainly indicates the clustering process we form the number of clusters
71
f. K-means:
g. Computing Term Frequency:
Figure 9.5: K-means Page
Description:This page describes the repetition of words. It calculates the how many times each word repeated.
72
h. Cluster Preprocessing:
Figure 9.6: Cluster Preprocessing
Description:
The above page mainly indicates the cluster preprocessing we used snow ball technique.
i. Distance Calculation:
73
Figure 9.7: Distance Calculation
Description:
The above page mainly indicates the Distance calculation. We used eucludian distance method.
j. Incremental or Hierarchal Clustering:
74
Figure 9.8: Incremental or Hierarchal Clustering
Description:
The above page mainly indicates the Incremental or Hierarchal Clustering. Here we initially find the similarity of the data points.
k. Similarity Measurement
75
Figure 9.9: Similarity Calculation
Description:
The above page mainly indicates the similarity calculation, in this level we calculate the similarity using cosine similarity and also provide maximum dissimilar value.
l. Purity Checking
76
Figure 9.10: Purity Checking
Description:
The above page mainly indicates the Purity Checking , in this level we give get the purity levels of K-means and Incremental Clustering.
m. Clustering Accuracy
77
Figure 9.11: Clustering Accuracy
Description:
The above page mainly indicates the Clustering Accuracy. In the above page the accuracy of the clustering techniques is represented in the form of a graph.
B-Source Code
78
Preprocess.java :
package ncluster;
import com.mysql.jdbc.Connection;
import java.io.*;
import java.sql.*;
import java.util.*;
import javax.swing.JFileChooser;
import ptstemmer.implementations.PorterStemmer;
public class preprocess extends javax.swing.JFrame {
String cont="", line="", path="", filename="", word="", str="", count="", nooffile="";
public static int numofdoc,count1,coun,i, noofterm;
File folder, files[];
PorterStemmer stemmer = new PorterStemmer();
float[] tf=new float[1500];
double[] idf=new double[1500];
double[] result=new double[1500];
int i1=0,j1=0,k1=0;
public preprocess() {
initComponents();
}
@SuppressWarnings("unchecked")
// <editor-fold defaultstate="collapsed" desc="Generated Code">
private void initComponents() {
selfiles = new javax.swing.JLabel();
79
select = new javax.swing.JButton();
jScrollPane1 = new javax.swing.JScrollPane();
text = new javax.swing.JTextArea();
textbox1 = new javax.swing.JTextField();
removestopword = new javax.swing.JButton();
stemming = new javax.swing.JButton();
title = new javax.swing.JLabel();
pathoffile = new javax.swing.JLabel();
calc = new javax.swing.JButton();
jPanel1 = new javax.swing.JPanel();
DocClust = new javax.swing.JLabel();
jLabel1 = new javax.swing.JLabel();
jLabel2 = new javax.swing.JLabel();
setDefaultCloseOperation(javax.swing.WindowConstants.EXIT_ON_CLOSE);
setTitle("Selecting_Documents");
setMinimumSize(new java.awt.Dimension(599, 601));
getContentPane().setLayout(null);
selfiles.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N
selfiles.setForeground(new java.awt.Color(51, 51, 51));
selfiles.setText("Select Files ");
getContentPane().add(selfiles);
selfiles.setBounds(10, 110, 100, 30);
80
select.setBackground(java.awt.SystemColor.inactiveCaption);
select.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N
select.setForeground(new java.awt.Color(0, 0, 102));
select.setText("SELECT");
select.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
selectActionPerformed(evt);
}
});
getContentPane().add(select);
select.setBounds(120, 110, 100, 30);
text.setBackground(java.awt.SystemColor.inactiveCaption);
text.setColumns(20);
text.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N
text.setForeground(new java.awt.Color(51, 51, 51));
text.setRows(5);
jScrollPane1.setViewportView(text);
getContentPane().add(jScrollPane1);
jScrollPane1.setBounds(70, 240, 440, 320);
textbox1.setFont(new java.awt.Font("Tahoma", 0, 12)); // NOI18N
textbox1.setForeground(new java.awt.Color(0, 0, 102));
getContentPane().add(textbox1);
81
textbox1.setBounds(170, 170, 360, 30);
removestopword.setBackground(java.awt.SystemColor.inactiveCaption);
removestopword.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N
removestopword.setForeground(new java.awt.Color(0, 0, 102));
removestopword.setText("REMOVE");
removestopword.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
removestopwordActionPerformed(evt);
}
});
getContentPane().add(removestopword);
removestopword.setBounds(240, 110, 100, 30);
stemming.setBackground(java.awt.SystemColor.inactiveCaption);
stemming.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N
stemming.setForeground(new java.awt.Color(0, 0, 102));
stemming.setText("STEMMING");
stemming.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
stemmingActionPerformed(evt);
}
});
getContentPane().add(stemming);
stemming.setBounds(350, 110, 100, 30);
82
title.setBackground(new java.awt.Color(255, 0, 0));
title.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N
getContentPane().add(title);
title.setBounds(80, 210, 368, 21);
pathoffile.setFont(new java.awt.Font("Times New Roman", 1, 15)); // NOI18N
pathoffile.setText("Path of the File");
getContentPane().add(pathoffile);
pathoffile.setBounds(40, 170, 110, 18);
calc.setBackground(java.awt.SystemColor.inactiveCaption);
calc.setFont(new java.awt.Font("Times New Roman", 1, 11)); // NOI18N
calc.setForeground(new java.awt.Color(0, 0, 102));
calc.setText("CALCULATION");
calc.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
calcActionPerformed(evt);
}
});
getContentPane().add(calc);
calc.setBounds(460, 110, 130, 30);
jPanel1.setBackground(new java.awt.Color(204, 204, 204));
jPanel1.setLayout(null);
83
DocClust.setFont(new java.awt.Font("Times New Roman", 1, 20)); // NOI18N
DocClust.setIcon(new javax.swing.ImageIcon(getClass().getResource("/image/cooltext1297916834.png"))); // NOI18N
jPanel1.add(DocClust);
DocClust.setBounds(30, 60, 563, 40);
jLabel1.setIcon(new javax.swing.ImageIcon(getClass().getResource("/image/cooltext1297931724.png"))); // NOI18N
jPanel1.add(jLabel1);
jLabel1.setBounds(30, 10, 570, 40);
jLabel2.setIcon(new javax.swing.ImageIcon(getClass().getResource("/image/deep-blue-sky-background.jpg"))); // NOI18N
jPanel1.add(jLabel2);
jLabel2.setBounds(-50, -20, 680, 660);
getContentPane().add(jPanel1);
jPanel1.setBounds(-10, 0, 610, 630);
java.awt.Dimension screenSize = java.awt.Toolkit.getDefaultToolkit().getScreenSize();
setBounds((screenSize.width-607)/2, (screenSize.height-659)/2, 607, 659);
}// </editor-fold>
private void selectActionPerformed(java.awt.event.ActionEvent evt) {
try{
JFileChooser chooser=new JFileChooser();
84
int returnVal = chooser.showOpenDialog(this);
if(returnVal == JFileChooser.APPROVE_OPTION) {
folder = chooser.getCurrentDirectory();
path = folder.getPath();
textbox1.setText(path);
files = folder.listFiles();
}
title.setText("Content of the File");
if(files.length>1){
for(i = 0;i<files.length; i++){
if (files[i].isFile())
{
int index = files[i].getName().lastIndexOf('.');
if (index>0&& index <= files[i].getName().length() - 2 ) {
filename = files[i].getName().substring(0, index);
String fname = filename.toUpperCase();
text.append("\n"+fname+"\n\n");
}
}
FileReader fr = new FileReader(files[i]);
BufferedReader br = new BufferedReader(fr);
while((line = br.readLine())!=null){
text.append(line+" ");
}
text.append("\n");
85
}
}
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
private void removestopwordActionPerformed(java.awt.event.ActionEvent evt) {
while(true){
ch = Character.toLowerCase((char) ch);
w[j] = (char) ch;
if (j < 500) j++;
ch = in.read();
if (!Character.isLetter((char) ch)){
for (int c = 0; c < j; c++) s.add(w[c]);
s.stem();
{
String u;
u = s.toString();
f.createNewFile();
FileWriter writer = new FileWriter(newfname,true);
writer.write(u+" ");
writer.close();
text.append(u+"\n");
}
86
break;
}
}
}
if (ch < 0) break;
}
text.append("\n");
}
catch (Exception ex){
System.out.println(ex.getMessage());
}
}
catch (Exception ex){
System.out.println(ex.getMessage());
}
}
}
}
catch(Exception ex){
System.out.println(ex.getMessage());
}
}
private void calcActionPerformed(java.awt.event.ActionEvent evt) {
frame1 form =new frame1();
87
form.setVisible(true);
}
public static void main(String args[]) {
java.awt.EventQueue.invokeLater(new Runnable() {
public void run() {
new preprocess().setVisible(true);
}});}
// Variables declaration - do not modify
private javax.swing.JLabel DocClust;
private javax.swing.JButton calc;
private javax.swing.JLabel jLabel1;
private javax.swing.JLabel jLabel2;
private javax.swing.JPanel jPanel1;
private javax.swing.JScrollPane jScrollPane1;
private javax.swing.JLabel pathoffile;
private javax.swing.JButton removestopword;
private javax.swing.JButton select;
private javax.swing.JLabel selfiles;
private javax.swing.JButton stemming;
private javax.swing.JTextArea text;
private javax.swing.JTextField textbox1;
private javax.swing.JLabel title;
// End of variables declaration }
Graph.java :
package ncluster;
88
import java.awt.*;
import org.jfree.chart.*;
import org.jfree.chart.axis.*;
import org.jfree.chart.plot.*;
import org.jfree.chart.renderer.category.BarRenderer;
import org.jfree.data.category.DefaultCategoryDataset;
public class Graph {
public static double kmeans = Purity.res1;
public static double hsk = Purity.res;
public static void main(String arg[]) {
DefaultCategoryDataset dataset = new DefaultCategoryDataset();
dataset.setValue(kmeans, "Accuracy", "K-MEANS");
dataset.setValue(hsk, "Accuracy", "Incremental Mining");
JFreeChart chart = ChartFactory.createBarChart("", "Text Mining", "Accuracy", dataset, PlotOrientation.VERTICAL, false, true, false);
chart.setBackgroundPaint(Color.white);
final CategoryPlot plot = chart.getCategoryPlot();
plot.setBackgroundPaint(Color.lightGray);
plot.setDomainGridlinePaint(Color.white);
plot.setRangeGridlinePaint(Color.white);
final NumberAxis rangeAxis = (NumberAxis) plot.getRangeAxis();
rangeAxis.setStandardTickUnits(NumberAxis.createIntegerTickUnits());
final BarRenderer renderer = (BarRenderer) plot.getRenderer();
renderer.setDrawBarOutline(false);
final GradientPaint gp0 = new GradientPaint(
89
0.0f, 0.0f, Color.blue,
0.0f, 0.0f, Color.lightGray);
final GradientPaint gp1 = new GradientPaint(
0.0f, 0.0f, Color.green,
0.0f, 0.0f, Color.lightGray);
final GradientPaint gp2 = new GradientPaint(
0.0f, 0.0f, Color.red,
0.0f, 0.0f, Color.lightGray);
renderer.setSeriesPaint(0, gp0);
renderer.setSeriesPaint(1, gp1);
renderer.setSeriesPaint(2, gp2);
final CategoryAxis domainAxis = plot.getDomainAxis();
domainAxis.setCategoryLabelPositions(
CategoryLabelPositions.createUpRotationLabelPositions(Math.PI / 6.0));
ChartFrame frame1 = new ChartFrame("Clustering Accuracy", chart);
frame1.setVisible(true);
frame1.setSize(500, 500);
}
90