text minning

7/30/2019 text minning

1/32

Presented By:

Iqra Javed

BSE 2005-2009


2/32

Presentation Rundown

Introduction.

Proposed Approach.

Techniques and methodology.

Application Architecture.

System Design.

Conclusion and Results.


3/32

AUTOMATIC AUTHORSHIP

ATTRIBUTION

Automatic authorship attribution is the task of a system that hasto decide which author from a given list of authors wrote a

given unspecified and unattributed document.

A number of attributed authorship documents served as atraining set.

Attributed:

Unattributed:

Stylometry


4/32

Proposed Approach

The proposed research of this thesis will use the machine

learning model (based on nearest neighbor classifier) with

entropy weighted to detect and identify the class of testing

document and implement it using different architectures to get

the better system understanding as well as more reliable results.


5/32

Data Mining Data mining refers to the mining and extraction of desired

knowledge from a large amount of data.

Text Mining:

The process of deriving high quality information from text.

High quality information is typically derived through the

deriving of patterns and trends through means such as

statistical pattern learning


6/32

Sample

Documents

Transformed

Representation

modelsLearning Domain specific

templates/models

Text document

Visualizations


7/32

Text Mining Methods:

Information Retrieval (IR)

Information Retrieval (IR)

Information Extraction (IE)

Natural Language Processing (NLP)


8/32

Nearest Neighbor Classification (NN).

The K-NN method was first introduced in earlier 1950s as the

method require computational complexity so it remains

unpopular till 1960 when the computational power has been

introduced.

Nearest neighbour classifier are based on learning by analogy

that is by comparing the specified test tuples with the already

trained tuples that are similar to it.


9/32

Used Terminologies. Entropy.

Entropy indicates how large the information content uncertainty

of a clustering result with respect to the given classification is

Cosine Similarity.

Cosine similarity is a measure of similarity between two vectorsof n dimensions by finding the cosine of the angle between

them.


10/32

Object Oriented Architecture. Object-oriented programming Architecture focuses on the

relationships between classes that are combined into one large

binary executable .

In the traditional object-oriented once classes are compiled, the

result is monolithic binary code. All the classes share the same

physical deployment unit (typically an EXE), process, address

space, security privileges, and so on.

If multiple developers work on the same code base, then itrequires the sharing of source files.

Redeployment of all the other classes that results in a burden

for managing the application.


11/32

Component Based Architecture. Component-oriented application

comprises a collection of interacting

binary application modules that is,

its components and the calls thatbind them.

The motivation for breaking down a

monolithic application into multiple

binary components is analogous tothat for placing the code for

different classes into different files.

A component-oriented application is

easier to extend


12/32

TECHNIQUES AND METHODOLOGY

The Automatic Authorship Attribution System

accepts plain text documents. The system working is

based on two phases of training and testing. The

working of both the phases is similar to each other.


13/32

Application Architecture of System


14/32

Text Pre-Processing.

Text pre-processing is the process of transforming theunstructured document into structured format.

Tokenization.

Stop Word Filtering.

Lemmatization. Stemming.

Bag of Words.

Domain Dictionary Creation.


15/32


Tokenization:

The process of splitting text into its constituent tokensis called tokenization.

Stop Word Filtering:

Filtering is used to remove words from the dictionaryas well as from the documents.

Such as the , are , but , for , In ,. Etc


16/32


Lemmatization.

methods try to map verb forms to the infinite

tense and nouns to the singular form.Stemming.

Word stemming is an important feature

supported by present day indexing and searchsystems.

Stemming broadens results to include both

word roots and word derivations.


17/32

Porters Stemming Algorithm.

The Porter Stemmer is a conflation Stemmer developed by

Martin Porter at the University of Cambridge in 1980.

Porter stemming is a process for removing the commoner

morphological and in flexional endings from words in English.

It is based on 5-steps with stem length 2 ( in proposed system).


18/32

Porters Algorithm Steps.

Step #1: deals with plurals and past participles.

Ex. plastered->plaster, motoring-> motor

Step #2: deals with pattern matching on some common suffixes.

Ex. happy -> happi, relational -> relate.

Step #3: deals with special word endings.

Ex. triplicate-> triplic, hopeful-> hope

Step #4:checks the stripped word against more suffixes in case

the word is compounded. Ex. revival -> reviv, allowance-> allow, etc.,

Step #5: checks if the stripped word ends in a vowel and fixes it

appropriately

Ex. probate -> probat, cease -> ceas, controll -> control.


19/32

BOW and Dictionary Creation

Bag -OfWords (BOW):The BOW created contains all the words in the

document irrespective of the order of words as it

does not create any difference.Example:

There are two red apples on a red table.

BOW ={there, two, red, apples, red, table }

Domain Dictionary Creation:

It based on unique words collection from all

documents.


20/32

Vector Space Model (VSM). It represents documents as vectors in n-dimensions in order to

perform multiple vector operations.

One of the main tasks of vector space representation is to find

appropriate encoding of feature vector where each element of

the vector represents a word of document collection.

Multi term encoding is used in the system to get the word

presence as well as word frequency related to a document.


21/32

Centroid-Based Classifier.

1. Input new document d = (w1, w2,,wn) for the training of thesystem.

2. Predefined categories: C={c1,c2,.,cl} based on training.

3. Compute centroid vector on the basis of the vector magnitudes of

the documents.4. Similarity model - cosine function

5. Compute similarity on the basis of the nearest matching classcosine value in order to declare the class of the testing document.

6. Output : Assign to document d the categorycmax.

jlil

jlil

ji

ji

jiji

ww

ww

dd

ddddddSimil

22

22

,cos),(

),cos(),( dcdcSimilii


22/32

Used Terms:

Euclidean Distance:

The Euclidean distance or mahalanobis distance of each pointfrom term space origin is measure in order to compute the

vector magnitude while ignoring the zero terms.|Di| =(xi2 + yi2)

Dot-Product:

Calculate the dot-product for all the documents while ignoring

the zero values for lessen down the computational burden. Dot-product is calculated by

Q . Di = | Di | * | idf |


23/32

SYSTEM DESIGN

System Requirement:

Hardware Requirement:

Pentium 4 computer with the RAM of 1GB are required.

Software Requirement:

Operating system of Microsoft Windows XP /2000 and abovewith Microsoft Office Access 2007 and .Net Framework 2008.


24/32

Workflow of Automatic Authorship System:


25/32

Results and Discussions


26/32

User Interface:


27/32

Application User Interface


28/32

Graphic Analysis of Application


29/32

Help And User Manual


30/32

Application Using Object Oriented

Architecture.

Comprises of creation of class with in the application and

calling the functions behind the user interface events in order to

perform he functionality of the of authorship attributionsystem.

The maintenance of changing requirement is not an easy task

to handled with using this architectural approach.


31/32

Application Using Component Based

Architecture.

Comprises of creation of two base classes one for data

preprocessing and other for classification purpose.

Creating the class object of base class in child class and calling

them with respective function the system can perform its

functionality.

Maintenance of changing requirement can be easily handled in

this architectural approach.


32/32

Conclusion

The time used by component based architecture issomehow less than that of object oriented architecture

which prove component based architecture to be preferablein case of execution time.

The architectural approach somehow effect the application

and its results, as based on the experimental results of theapplication, in component based approach themaintenance , execution and test results are comparativelybetter than those provided by object oriented .