neural text categorizer for exclusive text categorization journal of information processing systems,...

27
Neural Text Categorizer for Exclusive Text Cate Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報報報 : 報報報

Upload: alaina-perkins

Post on 17-Jan-2016

246 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Neural Text Categorizer for Exclusive Text Categorization

Journal of Information Processing Systems, Vol.4, No.2, June 2008

Taeho Jo*

報告者 : 林昱志

Page 2: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Outline

Introduction

Related Work

Method

Experiment

Conclusion

Page 3: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Introduction

Two types of approaches to text categorization

①Rule based - Define manually in form of if-then-else

Advantage

1) High precision

Disadvantages

1) Poor recall

2) Poor flexibility

Page 4: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Introduction

② Machine learning - Using sample labeled documents

Advantage

1) Much High recall

Disadvantages

② Slightly lower precision than rule based

③ Poor flexibility

Page 5: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Introduction

Focuses on machine learning based , discarding rule based

All the raw data should be encoded into numerical vectors

Encoding documents leads to two main problems

1) Huge dimensionality

2) Sparse distribution

Page 6: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Introduction

Propose two way

1) String vector –

Provide more transparency in classification

2) NTC (Neural Text Categorizer) –

Classify documents with its sufficient robustness

Solves the huge dimensionality

Page 7: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

Machine learning algorithms applied to text categorization

1) KNN (K Nearest Neighbor)

2) NB (Naïve Bayes)

3) SVM (Support Vector Machine)

4) BP (Back Propagation)

Page 8: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

KNN is evaluated as a simple and competitive algorithm with

Support Vector Machine by Sebastiani in 2002

Disadvantage

1) Costs very much time for classifying objects

Page 9: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

Evaluated feature selection methods within the application of NB by Mladenic and Grobellink in 1999

NB for implementing a spam mail filtering system as a real system

based on text categorization by Androutsopoulos in 2000

Requires encoding documents into numerical vectors

Page 10: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

SVM becomes more popular than the KNN and NB machine learning algorithms

Defining a hyper-plane as a boundary of classes

Applicable to only linearly separable distribution of training

examples

Optimizes the weights of the inner products of training examples and input vector, called Lagrange multipliers

Page 11: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

Define two hyper-planes as a boundary of two classes with a maximal margin, figure 1.

Figure 1.

Page 12: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

Advantage

1) Tolerant to huge dimensionality of numerical vectors

Disadvantage

1) Applicable to only binary classification

1) Fragile in representing documents into numerical vectors

Page 13: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Related Work

A hierarchical combination of BPs, called HME (Hierarchical Mixture of Experts), instead of a single BP by Ruiz and Srinivasan in 2002

Observed that HME is the better combination of BPs

Disadvantage

1) Cost much time and slowly

2) Not practical

Page 14: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Study Aim

Two problems

1) Huge dimensionality

2) Sparse distribution

Two successful methods

1) String vectors

2) A new neural network

Page 15: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Numerical Vectors

Figure 2.

Page 16: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

: Frequency of the word , wk

: Total number of documents in the corpus

: The number of documents including the word in the corpus

Figure 3.

Page 17: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Encoding a document into its string vector

Figure 4.

Page 18: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Text Categorization Systems

Proposed neural network (NTC)

Consists of the three layers

1) Input layer

2) Output layer

3) Learning layer

Page 19: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Input Layer - Corresponds to each word in the string vector

Learning Layer - Corresponding to predefined categories

Output Layer - Generates categorical scores , and correspond to predefined categories.

Figure 5.

Page 20: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

String vector is denoted by x = [t 1,t2,...,td ] , ti , 1 ≤ i ≤ d

Predefined categories is denoted by C = [c1,c2,…..c |c|] , 1≤ j ≤ |C|

Wji denote the weight by

Figure 6.

Page 21: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Oj : Output node corresponding to the category , Cj

Membership of the given input vector, x in the category, Cj

Figure 7.

Page 22: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Each string vector in the training set has its own target label, Cj

If its classified category, Ck, , is identical to target category, C

Figure 8.

Page 23: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Method

Inhibit weights for its misclassified category

Minimize the classification error

Figure 9.

Page 24: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Experiment

Evaluate the five approaches on test bed, called ‘20NewsGroups

Each category contain identical number of test documents

Test bed consists of 20 categories and 20,000 documents

Using micro-averaged and macro-averaged average methods

Page 25: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Experiment

Back propagation is the best approach

NB is the worst approach with the decomposition of the task

Figure 10 . Evaluate the five text classifiers in 20Newsgroup with decomposition

Page 26: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Experiment

Classifier answers to each test document by providing one of 20 categories

Exits two groups

1) Better group - BP and NTC

2) Worse group – NB and KNN

Figure 11. Evaluate the five text classifiers in 20Newsgroup without decomposition

Page 27: Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志

Conclusion

Used a full inverted index as the basis for the operation on string vectors, instead of a restricted sized similarity matrix

Note trade-off between the two bases for the operation on string vectors

NB and BP are considered to be modified into their adaptable versions to string vetors , but may be insufficient for modifying other

Future research for modifying other machine learning algorithms