neural text categorizer for exclusive text categorization journal of information processing systems,...

Neural Text Categorizer for Exclusive Text Categorization

Journal of Information Processing Systems, Vol.4, No.2, June 2008

Taeho Jo*

報告者 : 林昱志

Outline

Introduction

Related Work

Method

Experiment

Conclusion

Introduction

Two types of approaches to text categorization

①Rule based - Define manually in form of if-then-else

Advantage

1) High precision

Disadvantages

1) Poor recall

2) Poor flexibility

Introduction

② Machine learning - Using sample labeled documents

Advantage

1) Much High recall

Disadvantages

② Slightly lower precision than rule based

③ Poor flexibility

Introduction

Focuses on machine learning based , discarding rule based

All the raw data should be encoded into numerical vectors

Encoding documents leads to two main problems

1) Huge dimensionality

2) Sparse distribution

Introduction

Propose two way

1) String vector –

Provide more transparency in classification

2) NTC (Neural Text Categorizer) –

Classify documents with its sufficient robustness

Solves the huge dimensionality

Related Work

Machine learning algorithms applied to text categorization

1) KNN (K Nearest Neighbor)

2) NB (Naïve Bayes)

3) SVM (Support Vector Machine)

4) BP (Back Propagation)

Related Work

KNN is evaluated as a simple and competitive algorithm with

Support Vector Machine by Sebastiani in 2002

Disadvantage

1) Costs very much time for classifying objects

Related Work

Evaluated feature selection methods within the application of NB by Mladenic and Grobellink in 1999

NB for implementing a spam mail filtering system as a real system

based on text categorization by Androutsopoulos in 2000

Requires encoding documents into numerical vectors

Related Work

SVM becomes more popular than the KNN and NB machine learning algorithms

Defining a hyper-plane as a boundary of classes

Applicable to only linearly separable distribution of training

examples

Optimizes the weights of the inner products of training examples and input vector, called Lagrange multipliers

Related Work

Define two hyper-planes as a boundary of two classes with a maximal margin, figure 1.

Figure 1.

Related Work

Advantage

1) Tolerant to huge dimensionality of numerical vectors

Disadvantage

1) Applicable to only binary classification

1) Fragile in representing documents into numerical vectors

Related Work

A hierarchical combination of BPs, called HME (Hierarchical Mixture of Experts), instead of a single BP by Ruiz and Srinivasan in 2002

Observed that HME is the better combination of BPs

Disadvantage

1) Cost much time and slowly

2) Not practical

Study Aim

Two problems

1) Huge dimensionality

2) Sparse distribution

Two successful methods

1) String vectors

2) A new neural network

Method

Numerical Vectors

Figure 2.

Method

： Frequency of the word , wk

： Total number of documents in the corpus

： The number of documents including the word in the corpus

Figure 3.

Method

Encoding a document into its string vector

Figure 4.

Method

Text Categorization Systems

Proposed neural network (NTC)

Consists of the three layers

1) Input layer

2) Output layer

3) Learning layer

Method

Input Layer - Corresponds to each word in the string vector

Learning Layer - Corresponding to predefined categories

Output Layer - Generates categorical scores , and correspond to predefined categories.

Figure 5.

Method

String vector is denoted by x = [t 1,t2,...,td ] , ti , 1 ≤ i ≤ d

Predefined categories is denoted by C = [c1,c2,…..c |c|] , 1≤ j ≤ |C|

Wji denote the weight by

Figure 6.

Method

Oj ： Output node corresponding to the category , Cj

Membership of the given input vector, x in the category, Cj

Figure 7.

Method

Each string vector in the training set has its own target label, Cj

If its classified category, Ck, , is identical to target category, C

Figure 8.

Method

Inhibit weights for its misclassified category

Minimize the classification error

Figure 9.

Experiment

Evaluate the five approaches on test bed, called ‘20NewsGroups

Each category contain identical number of test documents

Test bed consists of 20 categories and 20,000 documents

Using micro-averaged and macro-averaged average methods

Experiment

Back propagation is the best approach

NB is the worst approach with the decomposition of the task

Figure 10 . Evaluate the five text classifiers in 20Newsgroup with decomposition

Experiment

Classifier answers to each test document by providing one of 20 categories

Exits two groups

1) Better group - BP and NTC

2) Worse group – NB and KNN

Figure 11. Evaluate the five text classifiers in 20Newsgroup without decomposition

Conclusion

Used a full inverted index as the basis for the operation on string vectors, instead of a restricted sized similarity matrix

Note trade-off between the two bases for the operation on string vectors

NB and BP are considered to be modified into their adaptable versions to string vetors , but may be insufficient for modifying other

Future research for modifying other machine learning algorithms

neural text categorizer for exclusive text categorization journal of information processing systems,...

Documents

text categorizationknn

text categorizationrule

input vector

way string vector

methodstring vector

wktotal number of documents

corpusthe number of

string vectorlearning