text classification -----svm-based approach jianping fan dept of computer science unc-charlotte

58
TEXT CLASSIFICATION TEXT CLASSIFICATION -----SVM-based -----SVM-based Approach Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Upload: shanon-henderson

Post on 16-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

TEXT CLASSIFICATIONTEXT CLASSIFICATION-----SVM-based -----SVM-based ApproachApproach

Jianping Fan Dept of Computer Science UNC-Charlotte

Page 2: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Text CATEGORIZATION / Text CATEGORIZATION / CLASSIFICATIONCLASSIFICATIONGiven:

◦ A description of an instance, xX, where X is the instance language or instance space. E.g: how to represent text documents.

◦ A fixed set of categories C = {c1, c2,…, cn}

Determine:

◦ The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C.

Page 3: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

3

Text ClassificationText ClassificationPre-given categories and labeled

document examples (Categories may form hierarchy)

Classify new documents A standard classification (supervised

learning ) problem

CategorizationSystem

Sports

Business

Education

Science

…SportsBusiness

Education

Page 4: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

A GRAPHICAL VIEW OF TEXT A GRAPHICAL VIEW OF TEXT CLASSIFICATIONCLASSIFICATION

NLP

Graphics

AI

Theory

Arch.

Page 5: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

5

Text Classification Text Classification ApplicationsApplicationsApplications:

◦ Web pages Recommending Yahoo-like classification

◦ Newsgroup Messages Recommending spam filtering

◦ News articles Personalized newspaper

◦ Email messages Routing Prioritizing Folderizing spam filtering

Page 6: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Text Classification Text Classification ApplicationsApplications

Web pages organized into category hierarchies

Journal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.)

Responses to Census Bureau occupationsPatents archived using International

Patent ClassificationPatient records coded using international

insurance categoriesE-mail message filteringNews events tracked and filtered by topics

Page 7: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Cost of Manual Text Cost of Manual Text CategorizationCategorization◦ Yahoo!

200 (?) people for manual labeling of Web pages using a hierarchy of 500,000 categories

◦ MEDLINE (National Library of Medicine) $2 million/year for manual indexing of journal articles using MEdical Subject Headings (18,000 categories)

◦ Mayo Clinic $1.4 million annually for coding patient-record events using the International Classification of Diseases (ICD)

for billing insurance companies◦ US Census Bureau decennial census (1990:

22 million responses) 232 industry categories and 504 occupation categories $15 million if fully done by hand

Page 8: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

What is so special about What is so special about text?text?

No obvious relation between features

High dimensionality, (often larger vocabulary, V, than the number of features!)

Importance of speed

Page 9: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Where we need?Where we need?

Term extraction tools Document representation The need for dimensionality reductionClassifier learning methodsTopics model & semantic

representation ……….

Page 10: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Latent Semantic AnalysisLatent Semantic Analysis

Document/Term count matrix

1

16

0

SCIENCE

6190RESEARCH

2012SOUL

3034LOVE

Doc3 … Doc2Doc1

SVD

High dimensional space, not as high as |V|

SOUL

RESEARCH

LOVE

SCIENCE

EACH WORD IS A SINGLE POINT IN A SEMANTIC SPACE

Page 11: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

EXAMPLES OF TEXT ClassificationEXAMPLES OF TEXT Classification

LABELS=BINARY◦ “spam” / “not spam”

LABELS=TOPICS◦ “finance” / “sports” / “asia”

LABELS=OPINION◦ “like” / “hate” / “neutral”

LABELS=AUTHOR◦ “Shakespeare” / “Marlowe” / “Ben Jonson”◦ The Federalist papers

Page 12: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Text Classification: Problem Text Classification: Problem DefinitionDefinition

Need to assign a boolean value {0,1} to each entry of the decision matrix

C = {c1,....., cm} is a set of pre-defined categories D = {d1,..... dn} is a set of documents to be

categorized 1 for aij : dj belongs to ci 0 for aij : dj does not belong to ci

A Tutorial on Automated Text Categorisation, Fabrizio Sebastiani, Pisa (Italy)

Page 13: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Methods (1)Methods (1)

Manual classification◦ Used by Yahoo!, Looksmart, about.com, ODP, Medline◦ very accurate when job is done by experts◦ consistent when the problem size and team is small◦ difficult and expensive to scale

Automatic document classification◦Hand-coded rule-based systems

Reuters, CIA, Verity, … Commercial systems have complex query

languages (everything in IR query languages + accumulators)

Page 14: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Methods (2)Methods (2)

Supervised learning of document-label assignment function: Autonomy, Kana, MSN, Verity, …

Naive Bayes (simple, common method) k-Nearest Neighbors (simple, powerful) Support-vector machines (new, more

powerful) … plus many other methods No free lunch: requires hand-classified

training data But can be built (and refined) by amateurs

Page 15: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Support Vector MachineSupport Vector MachineSVM: A Large-Margin Classifier

◦ Linear SVM◦ Kernel Trick◦ Fast implementation: SMO

SVM for Text Classification◦ Multi-class Classification◦ Multi-label Classification◦ Hierarchical Classification Tool

Page 16: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

What is a Good Decision What is a Good Decision Boundary?Boundary?

Consider a two-class, linearly separable classification problem

Many decision boundaries!◦ The Perceptron algorithm can

be used to find such a boundary

Are all decision boundaries equally good? Class 1

Class 2

Page 17: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Examples of Bad Decision Examples of Bad Decision BoundariesBoundaries

Class 1

Class 2

Class 1

Class 2

Page 18: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Large-margin Decision Large-margin Decision BoundaryBoundaryThe decision boundary should be as far away

from the data of both classes as possible◦ We should maximize the margin, m

Class 1

Class 2

m

Page 19: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Finding the Decision Finding the Decision BoundaryBoundaryLet {x1, ..., xn} be our data set and let yi

{1,-1} be the class label of xi

The decision boundary should classify all points correctly

The decision boundary can be found by solving the following constrained optimization problem

The Lagrangian of this optimization problem is

Page 20: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

The Dual ProblemThe Dual ProblemBy setting the derivative of the Lagrangian to

be zero, the optimization problem can be written in terms of i (the dual problem)

This is a quadratic programming (QP) problem◦ A global maximum of i can always be found

w can be recovered by

If the number of training examples is large, SVM training will be very slow because the number of parameters Alpha is very large in the dual problem.

Page 21: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

KTT ConditionKTT ConditionThe QP problem is solved when

for all i,

Page 22: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Characteristics of the Characteristics of the SolutionSolution

KTT condition indicates many of the i are zero◦ w is a linear combination of a small number of data

points

xi with non-zero i are called support vectors (SV)◦ The decision boundary is determined only by the SV

◦ Let tj (j=1, ..., s) be the indices of the s support vectors. We can write

For testing with a new data z◦ Compute

and classify z as class 1 if the sum is positive, and class 2 otherwise.

Page 23: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

6=1.4

A Geometrical A Geometrical InterpretationInterpretation

Class 1

Class 2

1=0.8

2=0

3=0

4=0

5=07=0

8=0.6

9=0

10=0

Page 24: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Non-linearly Separable Non-linearly Separable ProblemsProblems

We allow “error” i in classification

Class 1

Class 2

Page 25: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Soft Margin HyperplaneSoft Margin HyperplaneBy minimizing ii, i can be obtained by

i are “slack variables” in optimization; i=0 if there is no error for xi, and i is an upper bound of the number of errors

We want to minimize

C : tradeoff parameter between error and marginThe optimization problem becomes

Page 26: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

The Optimization ProblemThe Optimization ProblemThe dual of the problem is

w is recovered asThis is very similar to the optimization

problem in the linear separable case, except that there is an upper bound C on i now

Once again, a QP solver can be used to find i

Page 27: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Extension to Non-linear Extension to Non-linear Decision BoundaryDecision Boundary

So far, we only consider large-margin classifier with a linear decision boundary, how to generalize it to become nonlinear?

Key idea: transform xi to a higher dimensional space to “make life easier”◦ Input space: the space the point xi are located

◦ Feature space: the space of (xi) after transformation

Why transform?◦ Linear operation in the feature space is equivalent to non-linear

operation in input space◦ Classification can become easier with a proper transformation.

In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable

Page 28: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Transforming the DataTransforming the Data

Computation in the feature space can be costly because it is high dimensional◦ The feature space is typically infinite-dimensional!

The kernel trick comes to rescue

( )

( )

( )( )( )

( )

( )( )

(.)( )

( )

( )

( )( )

( )

( )

( )( )

( )

Feature spaceInput space

Page 29: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

The Kernel TrickThe Kernel Trick

Recall the SVM optimization problem

The data points only appear as inner productAs long as we can calculate the inner product in

the feature space, we do not need the mapping explicitly

Many common geometric operations (angles, distances) can be expressed by inner products

Define the kernel function K by

Page 30: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

An Example for An Example for (.) and (.) and K(.,.)K(.,.) Suppose (.) is given as follows

An inner product in the feature space is

So, if we define the kernel function as follows, there is no need to carry out (.) explicitly

This use of kernel function to avoid carrying out (.) explicitly is known as the kernel trick

Page 31: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Kernel FunctionsKernel Functions In practical use of SVM, only the kernel

function (and not (.)) is specifiedKernel function can be thought of as a

similarity measure between the input objectsNot all similarity measure can be used as

kernel function, however Mercer's condition states that any positive semi-definite kernel K(x, y), i.e.

can be expressed as a dot product in a high

dimensional space.

Page 32: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Examples of Kernel Examples of Kernel FunctionsFunctions

Polynomial kernel with degree d

Radial basis function kernel with width

◦ Closely related to radial basis function neural networksSigmoid with parameter and

◦ It does not satisfy the Mercer condition on all and

Page 33: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Modification Due to Kernel Modification Due to Kernel FunctionFunctionChange all inner products to kernel functionsFor training,

Original

With kernel function

Page 34: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Modification Due to Kernel Modification Due to Kernel FunctionFunctionFor testing, the new data z is classified as

class 1 if f 0, and as class 2 if f <0

Original

With kernel function

Page 35: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Why SVM Works?Why SVM Works? The feature space is often very high dimensional.

Why don’t we have the curse of dimensionality? ◦ A classifier in a high-dimensional space has many

parameters and is hard to estimate Vapnik argues that the fundamental problem is not

the number of parameters to be estimated. Rather, the problem is about the flexibility of a classifier

Typically, a classifier with many parameters is very flexible, but there are also exceptions◦ Let xi=10i where i ranges from 1 to n. The classifier

can classify all xi correctly for all possible combination of class labels on xi

◦ This 1-parameter classifier is very flexible

Page 36: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Why SVM Works?Why SVM Works? Vapnik argues that the flexibility of a classifier should

not be characterized by the number of parameters, but by the capacity of a classifier◦ This is formalized by the “VC-dimension” of a classifier

The addition of ½||w||2 has the effect of restricting the VC-dimension of the classifier in the feature space

The SVM objective can also be justified by structural risk minimization: the empirical risk (training error), plus a term related to the generalization ability of the classifier, is minimized

Another view: the SVM loss function is analogous to ridge regression. The term ½||w||2 “shrinks” the parameters towards zero to avoid overfitting

Page 37: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Choosing the Kernel Choosing the Kernel FunctionFunction Probably the most tricky part of using SVM. The kernel function is important because it creates

the kernel matrix, which summarize all the data Many principles have been proposed (diffusion kernel,

Fisher kernel, string kernel, …) There are even research to estimate the kernel

matrix from available information

In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for most applications.

It was said that for text classification, linear kernel is the best choice, because of the already-high-enough feature dimension

Page 38: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Strengths and Weaknesses Strengths and Weaknesses of SVMof SVMStrengths

◦ Training is relatively easy No local optimal, unlike in neural networks

◦ It scales relatively well to high dimensional data

◦ Tradeoff between classifier complexity and error can be controlled explicitly

◦ Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors

◦ By performing logistic regression (Sigmoid) on the SVM output of a set of data can map SVM output to probabilities.

Weaknesses◦ Need to choose a “good” kernel function.

Page 39: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Summary: Steps for Summary: Steps for ClassificationClassificationPrepare the pattern matrixSelect the kernel function to useSelect the parameter of the kernel function

and the value of C◦ You can use the values suggested by the SVM

software, or you can set apart a validation set to determine the values of the parameter

Execute the training algorithm and obtain the i

Unseen data can be classified using the i

and the support vectors

Page 40: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Fast SVM ImplementationsFast SVM ImplementationsSMO: Sequential Minimal

OptimizationSVM-LightLibSVMBSVM……

Page 41: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

SMO: Sequential Minimal SMO: Sequential Minimal OptimizationOptimizationKey idea

◦ Divide the large QP problem of SVM into a series of smallest possible QP problems, which can be solved analytically and thus avoids using a time-consuming numerical QP in the loop (a kind of SQP method).

◦ Space complexity: O(n).◦ Since QP is greatly simplified, most time-

consuming part of SMO is the evaluation of decision function, therefore it is very fast for linear SVM and sparse data.

Page 42: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

SMOSMOAt each step, SMO chooses 2 Lagrange

multipliers to jointly optimize, find the optimal values for these multipliers and updates the SVM to reflect the new optimal values.

Three components◦ An analytic method to solve for the two

Lagrange multipliers◦ A heuristic for choosing which multipliers to

optimize◦ A method for computing b at each step, so that

the KTT conditions are fulfilled for both the two examples

Page 43: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Choosing Which Multipliers to Choosing Which Multipliers to OptimizeOptimizeFirst multiplier

◦ Iterate over the entire training set, and find an example that violates the KTT condition.

Second multiplier◦ Maximize the size of step taken during

joint optimization.

◦ |E1-E2|, where Ei is the error on the i-th example.

Page 44: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Text CategorizationText CategorizationTypical features

◦ Term frequency ◦ Inverse document frequency

TC is a typical multi-class multi-label classification problem.◦ SVM, with some additional heuristic, has been regarded

as one of the best classification scheme for text data, based on many benchmark evaluations.

TC is a high-dimensional sparse problem◦ SMO is a very good choice in this case.

Page 45: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Multi-Class SVM Multi-Class SVM ClassificationClassification1-vs-rest1-vs-1

◦ MaxWin◦ DB2 ◦ Error Correcting Output Coding

K-class SVM

Page 46: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

1-vs-rest 1-vs-rest For any class C, train a binary

classifier to distinguish C from C. For an unseen sample, find the binary

classifier with highest confidence score for the final decision.

Page 47: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

1-vs-11-vs-1

Train CN2 classifiers,

which distinguish one class from another one. ◦Pairwise:

MaxWin (CN2 tests)

Error-correcting output code

◦DAG: Pachinko-machine (N tests)

Page 48: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Error Correcting Output Error Correcting Output CodingCoding

Code Matrix (MNxK)– N classes, K classifiers

Hamming DistanceClass Ci with

Minimum Error wins

M 1:2 1:3 1:4 2:3 2:4 3:4

1 1 1 1 0 0 0

2 -1 0 0 1 1 0

3 0 -1 0 -1 0 1

4 0 0 -1 0 -1 -1

M 1,2 1,3 1,4 2,3 2,4 3,4

1 1 1 1 -1 -1 -1

2 1 -1 -1 1 1 -1

3 -1 1 -1 1 -1 1

4 -1 -1 1 -1 1 1

Page 49: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Intransitivity of DAGIntransitivity of DAGFor C1 、 C2 、 C3,

if , then , we say is

transitive.

1223 , CCCC dd

13 CC d d

1~3

2~3

1~2

1 2 3

1~2

3~2

1~3

1 3 2

C1

C3

C2

2~3

1~3

2~1

2 1 3

Page 50: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Divided-by-2 (DB2)Divided-by-2 (DB2)

Hierarchically divide the data into two subsets until every subset consists of only one class.

Page 51: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Divided-by-2 (DB2)Divided-by-2 (DB2)Data partitioning criterion:

◦ group the classes such that the resulting subsets have the largest margin.

Trade-off: use clustering methods◦ k-mean: use the mean of each class◦ Balanced subsets: minimal difference in

sample number.

Page 52: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

K-class SVMK-class SVMChange the loss function and

constraints

Page 53: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Multi-label SVM Multi-label SVM ClassificationClassificationHow does multi-label come?

◦Whole-vs-part Share concepts

Page 54: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Whole-vs-partWhole-vs-partCommon for parent-child relationship

◦ Add an “Other” category, and do binary classification to distinguish the child from the other category.

◦ Since the classification boundary is non-linear, kernel methods may be more effective.

Page 55: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Share concepts: TrainingShare concepts: TrainingMode-S

◦ Label multi-label data with the class to which the data most likely belonged, by some perhaps subjective criterion.

Mode-N◦ consider the multi-label data as a new class

Mode-X◦ Use the multi-label data more than once, using

each example as a positive example of each of the classes to which it belongs.

Page 56: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Share concepts: TestShare concepts: TestP-cut

◦ Label input testing data by all of the classes corresponding to positive SVM scores. If no scores are positive, label that data to the class with top score.

S-cut◦ Train a threshold for each class by cross validation,

and Label input testing data by all of the classes corresponding to higher scores than the threshold.

R-cut◦ For any given test instance, always assign it r labels

according to the decedent confidence scores.◦ r can be learned from training data.

Page 57: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Evaluation CriteriaEvaluation Criteria

Micro-F1:

◦ Measure the overall classification accuracy (more consistent with the practical application scenario)

Macro-F1:

◦ Measure the classification accuracy on the category level. Can reflect the classifier’s capability of dealing with rare categories.

Page 58: TEXT CLASSIFICATION -----SVM-based Approach Jianping Fan Dept of Computer Science UNC-Charlotte

Hierarchical Document Hierarchical Document ClassificationClassification