automated text categorization: the two-dimensional probability mode
DESCRIPTION
Automated Text Categorization: The Two-Dimensional probability Mode. Abdulaziz alsharikh. Agenda. Introduction Background on ATC The Two-Dimensional Model Probabilities Peculiarity Document Coordinates Experiments Results Analyses Critiques. Introduction. What is ATC? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/1.jpg)
AUTOMATED TEXT CATEGORIZATION: AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL THE TWO-DIMENSIONAL
PROBABILITY MODEPROBABILITY MODE
Abdulaziz alsharikhAbdulaziz alsharikh
![Page 2: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/2.jpg)
AGENDAAGENDA• Introduction• Background on ATC• The Two-Dimensional Model
– Probabilities– Peculiarity
• Document Coordinates• Experiments• Results Analyses• CritiquesCritiques
![Page 3: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/3.jpg)
INTRODUCTIONINTRODUCTION• What is ATC?
– Build a classifier by observing the properties of the set of pre-classified documents. (Naïve Bayes Model)
• Simple to implement• Gives remarkable accuracy • Gives directions for SVM (parameter tuning)
– 2DPM starts from different hypotheses from NB:• Terms are seen as disjoint event• Documents are seen as union of these events• Visualization tool for understanding the relationships between
categories • Helps users to visually audit the classifier identify suspicious training
data.
![Page 4: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/4.jpg)
BACKGROUND ON ATCBACKGROUND ON ATC• Set of categories• Set of documents
– D X C T, F• Multi Vs Single label categorization
– CSVi : D [0,1]• Degree of membership of document
– Binary categorization• C categories and D documents to assigned to (c) or it is
complement
– Document Dtr for classifier and Dte to measure performance
![Page 5: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/5.jpg)
THE TWO-DIMENSIONAL MODELTHE TWO-DIMENSIONAL MODEL
• Presenting the document on a 2-d Cartesian plan• Based on two parameters
– Presence and expressiveness , measuring the frequency of the term in the document of all the categories.
• Advantages– No explicit need for feature to reduce dimensionality– Limited space required to store objects compare to NB– Less computation cost for classifier training.
![Page 6: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/6.jpg)
BASIC PROBABILITIESBASIC PROBABILITIES
• Given C = c1, c2, …..• Given V = t1, t2, …..• Find Ω=(tk, ci)
![Page 7: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/7.jpg)
BASIC PROBABILITIESBASIC PROBABILITIES• Probability of a term given category or a set of categories
• For set of categories
![Page 8: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/8.jpg)
BASIC PROBABILITIESBASIC PROBABILITIES• Probability of set of terms given category or a set of categories
• And ..
![Page 9: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/9.jpg)
BASIC PROBABILITIESBASIC PROBABILITIES• “Peculiarity” of a tem, chosen a set of categories
– It is the probability of finding a term in a document and the probability of not finding the same term in the document complement.
– Presence (how frequently the term is in the categories)– expressiveness (how distinctive the term is for that set of categories)– It is useful when finding the probability of complements of sets
![Page 10: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/10.jpg)
BASIC PROBABILITIESBASIC PROBABILITIES• Probability of a term given complementary sets of categories
![Page 11: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/11.jpg)
BASIC PROBABILITIESBASIC PROBABILITIES• Probability of a term having chosen a sets of categories
– Indicates finding a term from set of documents and not finding the same term in the complement of the set of documents
![Page 12: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/12.jpg)
2-D REPRESENTATION AND 2-D REPRESENTATION AND CATEGORIZATIONCATEGORIZATION
• To find the probability of set of terms in a set of categories
• Breaking down the expression to plot it
• Natural way to assign d to c
• With threshold to improve the separation (q)
• With angular coefficient (m)
![Page 13: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/13.jpg)
![Page 14: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/14.jpg)
EXPERIMENTSEXPERIMENTS• Dataset
– Reuters-21578 (135 potential categories)• First experiment with 10 most frequent categories• Second excrement with 90 most frequent categories
– Reuters collection volume 1• 810,000 newswire (Aug. 1996 to Aug. 1997)• 30,000 training documents• 36,000 test documents
![Page 15: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/15.jpg)
EXPERIMENTSEXPERIMENTS• Pre-processing
– Removing all the punctuation and convert to lower case– Removing the most frequent terms of the English language– K-fold cross validation (k=5) for FAR alg. For best separation – The recall and precision are computer for each category and combining
them into a single measure (F1)– Marco-averaging and micro-averaging is computed for each of the
previous measurements
![Page 16: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/16.jpg)
ANALYSES AND RESULTSANALYSES AND RESULTS• Comparing 2DPM with NB multinomial• Case 1 : NB performs better than 2DPM• Case 2 : almost the same but the macro-avar. Is halved• Case 3 : same case 2 but macro-avar. increased • Case 4 : NB performs better than 2DPM in micro but worst in macro.
![Page 17: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/17.jpg)
ANALYSES AND RESULTSANALYSES AND RESULTS
![Page 18: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/18.jpg)
CONCLUSIONCONCLUSION• Plotting makes the understanding of the classifier decision easy• Rotation of the decision line would result in better separation of the
two classes. (Focused Angular Region)• 2DPM performance is at least equal to NB model• 2DPM is better with macro-avar. F1
![Page 19: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/19.jpg)
CritiquesCritiques• The paper was not clear in directing the reading to the final results• There are many cases of getting the probability but it does not show
how to use them• The paper focused on explaining the theoretical side, while the results
and analysis part is almost 10%• The main algorithm that the paper depends on was just mentioned
with no enough explanation (focused angular region)
![Page 20: Automated Text Categorization: The Two-Dimensional probability Mode](https://reader035.vdocument.in/reader035/viewer/2022062309/56815720550346895dc4bd14/html5/thumbnails/20.jpg)
Thank youThank you