a two stage feature selection method for text categorization

19
A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm Presented by

Upload: parag-tamhane

Post on 08-Jun-2015

464 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: A two stage feature selection method for text categorization

A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm

Presented by

Page 2: A two stage feature selection method for text categorization

Project scope

The application will serve the two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Due to the increasing number of documents in digital form, automated text categorization has become more promising in the

decades. A two-stage feature selection and feature

extraction is used to reduce the high dimensionality of a feature space composing of a large number of terms, remove redundant and irrelevant features from the feature space and thereby improve the performance of text categorization

Page 3: A two stage feature selection method for text categorization

User classes & characteristics

There are two user module viz. decision tree and KNN classifier.

Decision tree: The first phase is tree growing where a tree is built by greedily splitting each tree node. Because the tree can over fit the training data, in the second phase, the over fitted branches of the tree are removed.

  KNN classifier: The KNN classifier ranks the document’s

neighbors among the training documents and uses the class labels of the k most similar neighbors. Similarity type between two documents may be measured by the Euclidean distance, cosine measure, etc.

Page 4: A two stage feature selection method for text categorization

Operating environment

This application is developed in java platform and will be hosted by a system using Java JDK and tomcat server. The system will primarily be developed and tested on Windows Operating Systems. But our goal is to make it a platform independent solution. The target platforms are:

Linux Microsoft Windows & Solaris.

Page 5: A two stage feature selection method for text categorization

Design and Implementation Constraints

All designing and coding will be done on Java Platform. However application can be implemented in C#.NET.

Page 6: A two stage feature selection method for text categorization

Assumptions and Dependencies

Since the application is based on Java platform. Hence we assume that user system must installed JVM to run this application.

Page 7: A two stage feature selection method for text categorization

SYSTEM FEATURE

Functional requirements

Hard disk 80 GB

RAM 1GB

Processor Intel Pentium IV

Technology Java

Tools Net beans

Operating System Windows

Page 8: A two stage feature selection method for text categorization

EXTERNAL INTERFACE REQUIREMENTS

User Interfaces: The application is accessible through web browser. It will interact with its users with web components interface. There are two types of user for this system retail manager or analyst and customer each can interact with the system with the following UIs.

Main screen: On this interface there are some options shown as per the user type

For the analysts there are some options related to what type of analysis they want to do.

Method wise analysis Decision tree analysis KNN classifier analysis For each of the above analysis there is separate new screen showing advanced

options for that analysis that is something like stated below:

There are buttons for ‘In which format output should be displayed Graphical formats like pie charts , Bar graphs, Tabular format.

Output screen: On this screen output will be produced in graphical format with proper

description and some options like save result for further use or compare it with old results or you may discard it if it is of no use.

Page 9: A two stage feature selection method for text categorization

Software Interfaces

Name: Java Version Number: Version 6.0 Name: Mysql Version Number: Version 7.0.1The system must use My SQL server as its database

Name: NetBeans Version Number: Version 6 onward

Page 10: A two stage feature selection method for text categorization

Communications Interfaces

The system will use Apache/tomcat server as the main communication protocol trough internet/network.

Page 11: A two stage feature selection method for text categorization

NON-FUNCTIONAL REQUIREMENTS

Performance Requirements• System can produce results faster on 4GB RAM.• It may take more time for peak loads at main

node• The system will be available 100% of the time.

Once there is a fatal error, the system will provide understandable feedback to the user.

Page 12: A two stage feature selection method for text categorization

Safety and Security Requirements

• All data will be backed-up everyday automatically and also the system administrator can back- up the data as a function for him.

• The system is designed in modules where errors can be detected and fixed easily. This makes it easier to install and updates new functionality if required.

Page 13: A two stage feature selection method for text categorization

Software Quality Attributes Usability : The application seem to user friendly since the GUI is

interactive.

Maintainability : This application is maintained for long period of time since it will be implemented under java platform .

Reusability : The application can be reusable by expanding it to the new modules. Performance: The application seems to be performing faster under 4 GB of RAM. However, the basic requirement to run the application is 1GB.

Security: Since the application is developed on JAVA .It is much more secure than the other environment.

Page 14: A two stage feature selection method for text categorization

Data flow diagram

Page 15: A two stage feature selection method for text categorization

UML Activity diagram

Page 16: A two stage feature selection method for text categorization

UML State transition diagram

Page 17: A two stage feature selection method for text categorization

UML Sequence diagram

Page 18: A two stage feature selection method for text categorization

TECHNICAL SPECIFICATION

ADVANTAGES The application is platform independent since it is

developed in JAVA. The behavior of the application is user friendly

since the GUI is compatible with all operating environment.

Disadvantage Since the application performs several task at

same time, It seems to generate output at long interval of time.

Page 19: A two stage feature selection method for text categorization

Applications

Spam filtering, a process which tries to discern E-mail spam messages from legitimate emails

Email routing, sending an email sent to a general address to a specific address or mailbox depending on topic.

Language identification, automatically determining the language of a text

Genre classification, automatically determining the genre of a text

Readability assessment, automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger text simplification system.