pmuthoju_presentation.ppt

23
1 Automatic Document Categorization using Support Vector Machines Prashanth Kumar Muthoju [email protected] Advisor: Dr. Zubair

Upload: butest

Post on 11-Jun-2015

297 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: pmuthoju_presentation.ppt

1

Automatic Document Categorization using Support

Vector Machines

Prashanth Kumar [email protected]

Advisor: Dr. Zubair

Page 2: pmuthoju_presentation.ppt

2

Overview Introduction Problem Proposed Solution Improvements Results Future Work Conclusion References

Page 3: pmuthoju_presentation.ppt

3

Introduction What is Categorization

Sorting a set of documents into categories from a predefined set. [link]

Assigning a document to a category based on it’s contents.

Page 4: pmuthoju_presentation.ppt

4

Introduction .. Cont.d Types of Categorization :

Manual Automatic (Machine Learning)

Probabilistic (e.g., Naïve Bayesian) Decision Structures (e.g., Decision Trees) Support Machines (e.g., SVM)

Page 5: pmuthoju_presentation.ppt

5

Introduction .. Cont.d

Why ‘Automation’ ? Manual categorization

needs large number of human resources is expensive is time consuming

Page 6: pmuthoju_presentation.ppt

6

Introduction .. Cont.d

Applications of Automatic Categorization: Indexing of scientific articles Spam filtering of e-mails Authorship attribution

Page 7: pmuthoju_presentation.ppt

7

Problem

The DTIC document base has to be categorized into 25 fields (broad) and 251 groups (narrow) Fields/Groups listed here

http://www.dtic.mil/trail/fieldgrp.html

Page 8: pmuthoju_presentation.ppt

8

Towards the solution ..

Strategy: Exploit an existing collection with categorized

documents A portion is used as training set Other potion is used as testing set Allow tuning of classifier to yield maximum

effectiveness

Page 9: pmuthoju_presentation.ppt

9

Towards the solution ..

What is Support Vector Machine ?

Binary Classifier Finds the ith largest margin

to separate two classes Subsequently classifies items

Based on which side of the lineThey fall.

Page 10: pmuthoju_presentation.ppt

10

Towards the solution ..

Why is SVM chosen for Automatic Categorization? Prior studies have suggested good results with SVM Relatively immune to ‘over fitting’ (fitting to

coincidental relations encountered during training).

Page 11: pmuthoju_presentation.ppt

11

Towards the solution ..

SVM Library (LibSVM 2.85)

Java

Page 12: pmuthoju_presentation.ppt

12

Solution

Before we can train the SVM using LibSVM for a Field/Group, we have to prepare dataset for that Field/Group.

Each file is represented by<label> <feature1>:<value1> < feature 2>:<value2> ...

(Sparse vector representation)

<label> is 1 if positive file; 0 if negative file< feature>:<value> are represented by <word>:<tfidf>

(Common words are eliminated before preparing data set).

Page 13: pmuthoju_presentation.ppt

13

Solution

For each of the Field/Group,the following procedure isRepeated (Training phase):

Collection Model

by Dr. Zeil

Download Documents

(PDF)

Convert PDF to Text

Model Documents

Using TF and IDF

Positive Training Set for

Negative Training Set for Field/Group K

SVMFor

Field/Group K

Field/Group K

Page 14: pmuthoju_presentation.ppt

14

Solution

(Testing Phase)

Trained SVMFor

Trained SVMFor

Trained SVMFor

Input Test Document(PDF)

Convert PDF to Text

Model Documents

Using TF and IDF

Estimate in the range 0 to 1 indicating how likely the Field/Group K maps to the test document.

Field/Group 1

Field/Group K

Field/Group N

Page 15: pmuthoju_presentation.ppt

15

Improving the results

Scaling the vectors in datasets To make the <value>s in <feature>:<value> pairs

between 0 and 1

Page 16: pmuthoju_presentation.ppt

16

Experiment

Randomly selected 5 Field/Groups. 140200, 120200, 201300, 220200, 250400.

For each field/group, 70 pdf files were downloaded.

50 files were used as positive files for training 20 files were used for testing

An additional 50 files were taken randomly from all other field/groups as negative files for training.

Page 17: pmuthoju_presentation.ppt

17

Experiment

Metric: Recall = #Correct Answers /

#Total Possible Answers Precision = #Correct Answers /

#Answers Produced

Page 18: pmuthoju_presentation.ppt

18

Results

140200 120200 201300 220200 250400

140200 13 2 1 2 2

120200 1 16 0 3 0

201300 0 5 13 2 0

220200 1 0 2 17 0

250400 0 0 1 0 19

Page 19: pmuthoju_presentation.ppt

19

Results ..Cont.d

Category Precession Recall

140200 0.87 0.65

120200 0.70 0.80

201300 0.76 0.65

220200 0.71 0.85

250400 0.90 0.95

Page 20: pmuthoju_presentation.ppt

20

Future Work

Hierarchical Model

150000

150300 150600

150301 150302 150601 150602

In flat model, we consider each field/group independent.

In Hierarchical model, we consider all files under the branch as positive files for training

Page 21: pmuthoju_presentation.ppt

21

Future Work

Multi-Label classification Practically each document may belong to multiple

field/groups.

Page 22: pmuthoju_presentation.ppt

22

Conclusion

The classification results of DTIC documents based on Field/Groups were impressive.

Ways to improve the results have been identified. A couple of suggestions were given for future work in this

particular area.

Page 23: pmuthoju_presentation.ppt

References

Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. 34(1). pp. 1-47.

Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. (http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf)

J.T. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998, pp. 347- 351.

23