presenter : chang, shih-jie authors : adnan yahya and ali salhi 2014. acm talip
DESCRIPTION
Arabic Text Categorization Based on Arabic Wikipedia. Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP . Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Intelligent Database Systems Lab
Presenter : CHANG, SHIH-JIE
Authors : ADNAN YAHYA and ALI SALHI
2014. ACM TALIP.
Arabic Text Categorization Based on Arabic Wikipedia
Intelligent Database Systems Lab
OutlinesMotivationObjectivesMethodologyExperimentsConclusionsComments
Intelligent Database Systems Lab
Motivation
A challenge due to the correlation between certain subcategories and overlap between main categories.
EX:
Intelligent Database Systems Lab
Objectives• To solve this, we use algorithm and further adopt the two
approaches .
Intelligent Database Systems Lab
CATEGORIZATION CORPORA - Training Data
Related Tags Approach
Intelligent Database Systems Lab
Intelligent Database Systems Lab
Testing Data
10 categories with 40 documents in each category
Intelligent Database Systems Lab
Methodology - PREPROCESSING TECHNIQUES
Root Extraction (RE) Light Stemming (LS) Special Expressions Extraction
Intelligent Database Systems Lab
Methodology- CATEGORIZATION PROCESSCategorize the input text in two phases
Phase one: we categorize the text into one of the main categories.
Phase two:We further categorize the input text based on subcategories:
Intelligent Database Systems Lab
Intelligent Database Systems Lab
Methodology - Basic Categorization Algorithm (BCA)
Intelligent Database Systems Lab
Methodology - Percentage and Difference Categorization (PDC) Algorithm
has frequency 7 in the 300-word
Intelligent Database Systems Lab
Methodology - Percentage and Difference Categorization (PDC) Algorithm
The category with the highest sum of flag values is considered to be the best match for the input text.
Intelligent Database Systems Lab
Methodology – PDC Algorithm vs. BCA Algorithm
Intelligent Database Systems Lab
Methodology – Enhancing Main/Subcategories Grouping
(1) Overlapping Main Categories for Phase Two
Problem : The possible high correlation between subcategories of different main categories
Intelligent Database Systems Lab
Methodology – Enhancing Main/Subcategories Grouping
(2) Replacing Main Categories by Groups of Related Categories
Intelligent Database Systems Lab
Methodology – Enhancing Main/Subcategories Grouping
Intelligent Database Systems Lab
Methodology - Word Filtration Techniques within Categories
Intelligent Database Systems Lab
Methodology - The result of applying the three techniques
Intelligent Database Systems Lab
Modified PDC with N Scales Define a scaling of
1 0.5 0
1 0.5 00.250.75
Intelligent Database Systems Lab
Further Testing on the PDC AlgorithmTool Root ExtractionTool Light Stemming & Light10Tool Double WordsTool Expressions Extraction
Intelligent Database Systems Lab
Using Testing Data from the Reference Categories
Intelligent Database Systems Lab
Training Data Characteristics
Intelligent Database Systems Lab
COMPARISON WITH RELATED WORK
Intelligent Database Systems Lab
Using Testing Data from the Reference Categories
Intelligent Database Systems Lab
Conclusions– To use training and testing data from same source by
splitting the corpus into test and training components. This consistently gives better results.
– However, we believe that the second method (different source ) makes more sense, as the tests will
be more credible and indicative of performance in real-life environments.
Intelligent Database Systems Lab
Comments• Advantages
– To.• Applications
– Arabic Text Categorization .