pmuthoju_presentation.ppt

1

Automatic Document Categorization using Support

Vector Machines

Prashanth Kumar [email protected]

Advisor: Dr. Zubair

2

Overview Introduction Problem Proposed Solution Improvements Results Future Work Conclusion References

3

Introduction What is Categorization

Sorting a set of documents into categories from a predefined set. [link]

Assigning a document to a category based on it’s contents.

4

Introduction .. Cont.d Types of Categorization :

Manual Automatic (Machine Learning)

Probabilistic (e.g., Naïve Bayesian) Decision Structures (e.g., Decision Trees) Support Machines (e.g., SVM)

5

Introduction .. Cont.d

Why ‘Automation’ ? Manual categorization

needs large number of human resources is expensive is time consuming

6

Introduction .. Cont.d

Applications of Automatic Categorization: Indexing of scientific articles Spam filtering of e-mails Authorship attribution

7

Problem

The DTIC document base has to be categorized into 25 fields (broad) and 251 groups (narrow) Fields/Groups listed here

http://www.dtic.mil/trail/fieldgrp.html

8

Towards the solution ..

Strategy: Exploit an existing collection with categorized

documents A portion is used as training set Other potion is used as testing set Allow tuning of classifier to yield maximum

effectiveness

9


What is Support Vector Machine ?

Binary Classifier Finds the ith largest margin

to separate two classes Subsequently classifies items

Based on which side of the lineThey fall.

10


Why is SVM chosen for Automatic Categorization? Prior studies have suggested good results with SVM Relatively immune to ‘over fitting’ (fitting to

coincidental relations encountered during training).

11


SVM Library (LibSVM 2.85)

Java

12

Solution

Before we can train the SVM using LibSVM for a Field/Group, we have to prepare dataset for that Field/Group.

Each file is represented by<label> <feature1>:<value1> < feature 2>:<value2> ...

(Sparse vector representation)

<label> is 1 if positive file; 0 if negative file< feature>:<value> are represented by <word>:<tfidf>

(Common words are eliminated before preparing data set).

13

Solution

For each of the Field/Group,the following procedure isRepeated (Training phase):

Collection Model

by Dr. Zeil

Download Documents

(PDF)

Convert PDF to Text

Model Documents

Using TF and IDF

Positive Training Set for

Negative Training Set for Field/Group K

SVMFor

Field/Group K

Field/Group K

14

Solution

(Testing Phase)

Trained SVMFor

Trained SVMFor

Trained SVMFor

Input Test Document(PDF)

Convert PDF to Text

Model Documents

Using TF and IDF

Estimate in the range 0 to 1 indicating how likely the Field/Group K maps to the test document.

Field/Group 1

Field/Group K

Field/Group N

15

Improving the results

Scaling the vectors in datasets To make the <value>s in <feature>:<value> pairs

between 0 and 1

16

Experiment

Randomly selected 5 Field/Groups. 140200, 120200, 201300, 220200, 250400.

For each field/group, 70 pdf files were downloaded.

50 files were used as positive files for training 20 files were used for testing

An additional 50 files were taken randomly from all other field/groups as negative files for training.

17

Experiment

Metric: Recall = #Correct Answers /

#Total Possible Answers Precision = #Correct Answers /

#Answers Produced

18

Results

140200 120200 201300 220200 250400

140200 13 2 1 2 2

120200 1 16 0 3 0

201300 0 5 13 2 0

220200 1 0 2 17 0

250400 0 0 1 0 19

19

Results ..Cont.d

Category Precession Recall

140200 0.87 0.65

120200 0.70 0.80

201300 0.76 0.65

220200 0.71 0.85

250400 0.90 0.95

20

Future Work

Hierarchical Model

150000

150300 150600

150301 150302 150601 150602

In flat model, we consider each field/group independent.

In Hierarchical model, we consider all files under the branch as positive files for training

21

Future Work

Multi-Label classification Practically each document may belong to multiple

field/groups.

22

Conclusion

The classification results of DTIC documents based on Field/Groups were impressive.

Ways to improve the results have been identified. A couple of suggestions were given for future work in this

particular area.

References

Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. 34(1). pp. 1-47.

Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. (http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf)

J.T. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998, pp. 347- 351.

23

pmuthoju_presentation.ppt

Documents

pdf files

fieldgroup n trained

likelythe fieldgroup

zeil fieldgroup

fieldgroup independent

manual categorization

positive files

set of documents