text minning

Upload: iqra-javed

Post on 04-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 text minning

    1/32

    Presented By:

    Iqra Javed

    BSE 2005-2009

  • 7/30/2019 text minning

    2/32

    Presentation Rundown

    Introduction.

    Proposed Approach.

    Techniques and methodology.

    Application Architecture.

    System Design.

    Conclusion and Results.

  • 7/30/2019 text minning

    3/32

    AUTOMATIC AUTHORSHIP

    ATTRIBUTION

    Automatic authorship attribution is the task of a system that hasto decide which author from a given list of authors wrote a

    given unspecified and unattributed document.

    A number of attributed authorship documents served as atraining set.

    Attributed:

    Unattributed:

    Stylometry

  • 7/30/2019 text minning

    4/32

    Proposed Approach

    The proposed research of this thesis will use the machine

    learning model (based on nearest neighbor classifier) with

    entropy weighted to detect and identify the class of testing

    document and implement it using different architectures to get

    the better system understanding as well as more reliable results.

  • 7/30/2019 text minning

    5/32

    Data Mining Data mining refers to the mining and extraction of desired

    knowledge from a large amount of data.

    Text Mining:

    The process of deriving high quality information from text.

    High quality information is typically derived through the

    deriving of patterns and trends through means such as

    statistical pattern learning

  • 7/30/2019 text minning

    6/32

    Sample

    Documents

    Transformed

    Representation

    modelsLearning Domain specific

    templates/models

    Text document

    Visualizations

  • 7/30/2019 text minning

    7/32

    Text Mining Methods:

    Information Retrieval (IR)

    Information Retrieval (IR)

    Information Extraction (IE)

    Natural Language Processing (NLP)

  • 7/30/2019 text minning

    8/32

    Nearest Neighbor Classification (NN).

    The K-NN method was first introduced in earlier 1950s as the

    method require computational complexity so it remains

    unpopular till 1960 when the computational power has been

    introduced.

    Nearest neighbour classifier are based on learning by analogy

    that is by comparing the specified test tuples with the already

    trained tuples that are similar to it.

  • 7/30/2019 text minning

    9/32

    Used Terminologies. Entropy.

    Entropy indicates how large the information content uncertainty

    of a clustering result with respect to the given classification is

    Cosine Similarity.

    Cosine similarity is a measure of similarity between two vectorsof n dimensions by finding the cosine of the angle between

    them.

  • 7/30/2019 text minning

    10/32

    Object Oriented Architecture. Object-oriented programming Architecture focuses on the

    relationships between classes that are combined into one large

    binary executable .

    In the traditional object-oriented once classes are compiled, the

    result is monolithic binary code. All the classes share the same

    physical deployment unit (typically an EXE), process, address

    space, security privileges, and so on.

    If multiple developers work on the same code base, then itrequires the sharing of source files.

    Redeployment of all the other classes that results in a burden

    for managing the application.

  • 7/30/2019 text minning

    11/32

    Component Based Architecture. Component-oriented application

    comprises a collection of interacting

    binary application modules that is,

    its components and the calls thatbind them.

    The motivation for breaking down a

    monolithic application into multiple

    binary components is analogous tothat for placing the code for

    different classes into different files.

    A component-oriented application is

    easier to extend

  • 7/30/2019 text minning

    12/32

    TECHNIQUES AND METHODOLOGY

    The Automatic Authorship Attribution System

    accepts plain text documents. The system working is

    based on two phases of training and testing. The

    working of both the phases is similar to each other.

  • 7/30/2019 text minning

    13/32

    Application Architecture of System

  • 7/30/2019 text minning

    14/32

    Text Pre-Processing.

    Text pre-processing is the process of transforming theunstructured document into structured format.

    Tokenization.

    Stop Word Filtering.

    Lemmatization. Stemming.

    Bag of Words.

    Domain Dictionary Creation.

  • 7/30/2019 text minning

    15/32

    Text Pre-Processing.

    Tokenization:

    The process of splitting text into its constituent tokensis called tokenization.

    Stop Word Filtering:

    Filtering is used to remove words from the dictionaryas well as from the documents.

    Such as the , are , but , for , In ,. Etc

  • 7/30/2019 text minning

    16/32

    Text Pre-Processing.

    Lemmatization.

    methods try to map verb forms to the infinite

    tense and nouns to the singular form.Stemming.

    Word stemming is an important feature

    supported by present day indexing and searchsystems.

    Stemming broadens results to include both

    word roots and word derivations.

  • 7/30/2019 text minning

    17/32

    Porters Stemming Algorithm.

    The Porter Stemmer is a conflation Stemmer developed by

    Martin Porter at the University of Cambridge in 1980.

    Porter stemming is a process for removing the commoner

    morphological and in flexional endings from words in English.

    It is based on 5-steps with stem length 2 ( in proposed system).

  • 7/30/2019 text minning

    18/32

    Porters Algorithm Steps.

    Step #1: deals with plurals and past participles.

    Ex. plastered->plaster, motoring-> motor

    Step #2: deals with pattern matching on some common suffixes.

    Ex. happy -> happi, relational -> relate.

    Step #3: deals with special word endings.

    Ex. triplicate-> triplic, hopeful-> hope

    Step #4:checks the stripped word against more suffixes in case

    the word is compounded. Ex. revival -> reviv, allowance-> allow, etc.,

    Step #5: checks if the stripped word ends in a vowel and fixes it

    appropriately

    Ex. probate -> probat, cease -> ceas, controll -> control.

  • 7/30/2019 text minning

    19/32

    BOW and Dictionary Creation

    Bag -OfWords (BOW):The BOW created contains all the words in the

    document irrespective of the order of words as it

    does not create any difference.Example:

    There are two red apples on a red table.

    BOW ={there, two, red, apples, red, table }

    Domain Dictionary Creation:

    It based on unique words collection from all

    documents.

  • 7/30/2019 text minning

    20/32

    Vector Space Model (VSM). It represents documents as vectors in n-dimensions in order to

    perform multiple vector operations.

    One of the main tasks of vector space representation is to find

    appropriate encoding of feature vector where each element of

    the vector represents a word of document collection.

    Multi term encoding is used in the system to get the word

    presence as well as word frequency related to a document.

  • 7/30/2019 text minning

    21/32

    Centroid-Based Classifier.

    1. Input new document d = (w1, w2,,wn) for the training of thesystem.

    2. Predefined categories: C={c1,c2,.,cl} based on training.

    3. Compute centroid vector on the basis of the vector magnitudes of

    the documents.4. Similarity model - cosine function

    5. Compute similarity on the basis of the nearest matching classcosine value in order to declare the class of the testing document.

    6. Output : Assign to document d the categorycmax.

    jlil

    jlil

    ji

    ji

    jiji

    ww

    ww

    dd

    ddddddSimil

    22

    22

    ,cos),(

    ),cos(),( dcdcSimilii

  • 7/30/2019 text minning

    22/32

    Used Terms:

    Euclidean Distance:

    The Euclidean distance or mahalanobis distance of each pointfrom term space origin is measure in order to compute the

    vector magnitude while ignoring the zero terms.|Di| =(xi2 + yi2)

    Dot-Product:

    Calculate the dot-product for all the documents while ignoring

    the zero values for lessen down the computational burden. Dot-product is calculated by

    Q . Di = | Di | * | idf |

  • 7/30/2019 text minning

    23/32

    SYSTEM DESIGN

    System Requirement:

    Hardware Requirement:

    Pentium 4 computer with the RAM of 1GB are required.

    Software Requirement:

    Operating system of Microsoft Windows XP /2000 and abovewith Microsoft Office Access 2007 and .Net Framework 2008.

  • 7/30/2019 text minning

    24/32

    Workflow of Automatic Authorship System:

  • 7/30/2019 text minning

    25/32

    Results and Discussions

  • 7/30/2019 text minning

    26/32

    User Interface:

  • 7/30/2019 text minning

    27/32

    Application User Interface

  • 7/30/2019 text minning

    28/32

    Graphic Analysis of Application

  • 7/30/2019 text minning

    29/32

    Help And User Manual

  • 7/30/2019 text minning

    30/32

    Application Using Object Oriented

    Architecture.

    Comprises of creation of class with in the application and

    calling the functions behind the user interface events in order to

    perform he functionality of the of authorship attributionsystem.

    The maintenance of changing requirement is not an easy task

    to handled with using this architectural approach.

  • 7/30/2019 text minning

    31/32

    Application Using Component Based

    Architecture.

    Comprises of creation of two base classes one for data

    preprocessing and other for classification purpose.

    Creating the class object of base class in child class and calling

    them with respective function the system can perform its

    functionality.

    Maintenance of changing requirement can be easily handled in

    this architectural approach.

  • 7/30/2019 text minning

    32/32

    Conclusion

    The time used by component based architecture issomehow less than that of object oriented architecture

    which prove component based architecture to be preferablein case of execution time.

    The architectural approach somehow effect the application

    and its results, as based on the experimental results of theapplication, in component based approach themaintenance , execution and test results are comparativelybetter than those provided by object oriented .