cs 6501 natural language processing - text classification (i ...din tai fung, every time i go eat at...

CS 6501 Natural LanguageProcessingText Classification (I): Logistic Regression

Yangfeng Ji

Department of Computer Science

University of Virginia

Overview

1. Problem Definition

2. Bag-of-Words Representation

3. Case Study: Sentiment Analysis

4. Logistic Regression

5. !2 Regularization

6. Demo Code

Problem Definition

Case I: Sentiment Analysis

[Pang et al., 2002]

Case II: Topic Classification

Example topics

I Business

I Arts

I Technology

I Sports

I · · ·

Classification

I Input: a text xI Example: a product review on Amazon

I Output: H ∈ Y, where Y is the predefined category set (sample

space)

I Example: Y= {Positive,Negative}

The pipeline of text classification:1

Text Numeric Vector x Classifier Category H

1In this course, we use x for both text and its representation with no distinction

Classification

I Input: a text xI Example: a product review on Amazon

I Output: H ∈ Y, where Y is the predefined category set (sample

space)

I Example: Y= {Positive,Negative}

The pipeline of text classification:1

1In this course, we use x for both text and its representation with no distinction

Probabilistic Formulation

With the conditional probability %(. | ^ ), the prediction on . for a

given text ^ = x is

H = argmax

H∈Y%(. = H | ^ = x) (1)

Or, for simplicity

H = argmax

H∈Y%(H | x) (2)

Probabilistic Formulation

With the conditional probability %(. | ^ ), the prediction on . for a

given text ^ = x is

H = argmax

H∈Y%(. = H | ^ = x) (1)

Or, for simplicity

H = argmax

H∈Y%(H | x) (2)

Key Questions

Recall

I The formulation defined in the previous slide

H = argmax

H∈Y%(. = H | ^ = x) (3)

I The pipeline of text classification

Building a text classifier is about answering the following two

questions

1. How to represent a text as x?

I Bag-of-words representation

2. How to estimate %(H | x)?

I Logistic regression models

I Neural network classifiers (Lecture 3)

Key Questions

Recall

H = argmax

H∈Y%(. = H | ^ = x) (3)

questions

1. How to represent a text as x?

I Bag-of-words representation

Key Questions

Recall

H = argmax

H∈Y%(. = H | ^ = x) (3)

questions

1. How to represent a text as x?I Bag-of-words representation

Key Questions

Recall

H = argmax

H∈Y%(. = H | ^ = x) (3)

questions

2. How to estimate %(H | x)?I Logistic regression models

Key Questions

Recall

H = argmax

H∈Y%(. = H | ^ = x) (3)

questions

2. How to estimate %(H | x)?I Logistic regression models

Bag-of-Words Representation

Example TextsText 1: I love coffee.

Text 2: I don’t like tea.

Step I: convert a text into a collection of tokens (e.g., tokenization)

Tokenized TextsTokenized text 1: I love coffee

Tokenized text 2: I don t like tea

Step II: build a dictionary/vocabulary

Vocabulary{I love coffee don t like tea}

Bag-of-Words Representations

Step III: based on the vocab, convert each text into a numeric

representation as

I love coffee don t like tea

x(1) = [1 1 1 0 0 0 0]T

x(2) = [1 0 0 1 1 1 1]T

The pipeline of text classification:

Bag-of-words

Representation

Step III: based on the vocab, convert each text into a numeric

representation as

x(1) = [1 1 1 0 0 0 0]T

x(2) = [1 0 0 1 1 1 1]T

The pipeline of text classification:

Bag-of-words

Representation

Preprocessing for Building Vocab

1. convert all characters to lowercase

UVa, UVA→ uva

Shall we always convert all words to lowercase?

Apple vs. apple

2. map low frequency words to a special token 〈unk〉

Zipf’s law: freq(FC) ∝ 1/ACwhere freq(FC) is the frequency of word FC and AC is the rank of

this word

Preprocessing for Building Vocab

1. convert all characters to lowercase

UVa, UVA→ uva

Shall we always convert all words to lowercase?

Apple vs. apple

2. map low frequency words to a special token 〈unk〉

Zipf’s law: freq(FC) ∝ 1/ACwhere freq(FC) is the frequency of word FC and AC is the rank of

this word

Information Embedded in BoW Representations

It is critical to keep in mind about what information is preserved in

bag-of-words representations:

I Keep:

I words in texts

I Lose:

I word order

I sentence boundary

I sentence order

I · · ·

I Keep:

I words in texts

I Lose:

I word order

I sentence boundary

I sentence order

I · · ·

I Keep:

I words in texts

I Lose:

I word order

I sentence boundary

I sentence order

I · · ·

Case Study: Sentiment Analysis

A Dummy Predictor

Consider the following toy example (adding one more example to

make it more interesting)

Tokenized Texts

Text ^ Label .

Tokenized text 1 I love coffee Positive

Tokenized text 2 I don t like tea Negative

Tokenized text 3 I like coffee Positive

What is the simplest classifier that we can constructed based on this

“dataset”?

I Predict every text as Positive

I 66.7% prediction accuracy on this dataset2

I It has a name: majority baseline3

2The evaluation of classifiers will be discussed in one of the future lectures.

3It can be a competitive baseline in practice, particularly for unbalanced datasets

A Dummy Predictor

Tokenized Texts

Text ^ Label .

“dataset”?

A Dummy Predictor

Tokenized Texts

Text ^ Label .

“dataset”?

A Dummy Predictor

Tokenized Texts

Text ^ Label .

“dataset”?

A Dummy Predictor

Tokenized Texts

Text ^ Label .

“dataset”?

A Simple Predictor

Consider the following toy example, again

Tokenized text 3: I like coffee

x(1) [1 1 1 0 0 0 0 ]T

[0 1 0 0 0 1 0 ]T

[0 0 0 1 0 0 0 ]T

The prediction of sentiment polarity can be formulated as the

following

x = 1 > wTNeg

x = 0 (4)

Essentially, it is equivalent to counting the positive and negative

words.

A Simple Predictor

x(1) [1 1 1 0 0 0 0 ]T

[0 1 0 0 0 1 0 ]T

[0 0 0 1 0 0 0 ]T

following

x = 1 > wTNeg

x = 0 (4)

words.

A Simple Predictor

x(1) [1 1 1 0 0 0 0 ]T

[0 1 0 0 0 1 0 ]T

[0 0 0 1 0 0 0 ]T

following

x = 1 > wTNeg

x = 0 (4)

words.

A Simple Predictor

x(1) [1 1 1 0 0 0 0 ]T

[0 1 0 0 0 1 0 ]T

[0 0 0 1 0 0 0 ]T

following

x = 1 > wTNeg

x = 0 (4)

words. 15

Another Example

The limitation of word counting

x(2) [1 0 0 1 1 1 1 ]T

wPos [0 1 0 0 0 1 0 ]T

wNeg [0 0 0 1 0 0 0 ]T

I Different words should contribute differently. e.g., not vs.

dislike

I Sentiment word lists are definitely incomplete

Example II: PositiveDin Tai Fung, every time I go eat at anyone of the locations around theKing County area, I keep being reminded on why I have to keep comingback to this restaurant. · · ·

Another Example

x(2) [1 0 0 1 1 1 1 ]T

wPos [0 1 0 0 0 1 0 ]T

wNeg [0 0 0 1 0 0 0 ]T

dislike

Another Example

x(2) [1 0 0 1 1 1 1 ]T

wPos [0 1 0 0 0 1 0 ]T

wNeg [0 0 0 1 0 0 0 ]T

dislike

Logistic Regression

Linear Models

Directly modeling a linear classifier as

ℎH(x) = wTHx + 1H (5)

I x ∈ ℕ+: vector, bag-of-words representation

I wH ∈ ℝ+: vector, classification weights associated with label H

I 1H ∈ ℝ: scalar, label bias in the training set H

About Label BiasConsider a case with highly-imbalanced examples, where we have

90 positive examples and 10 negative examples in the training set.

1Pos > 1Neg ,

a classifier can get 90% predictions correct without even resorting

the texts.

Linear Models

Directly modeling a linear classifier as

ℎH(x) = wTHx + 1H (5)

I x ∈ ℕ+: vector, bag-of-words representation

I wH ∈ ℝ+: vector, classification weights associated with label H

I 1H ∈ ℝ: scalar, label bias in the training set H

About Label BiasConsider a case with highly-imbalanced examples, where we have

90 positive examples and 10 negative examples in the training set.

1Pos > 1Neg ,

a classifier can get 90% predictions correct without even resorting

the texts.

Logistic Regression

Rewrite the linear decision function in the log probabilitic form

log%(H | x) ∝ wTHx + 1H︸︷︷︸ℎH (x)

or, the probabilistic form is

%(H | x) ∝ exp(wTHx + 1H) (7)

To make sure %(H | x) is a valid definition of probability, we need to

make sure

∑H %(H | x) = 1,

%(H | x) =exp(wT

Hx + 1H)∑H′∈Yexp(wT

H′x + 1H′)(8)

Problem: Show the classifier based on Eq. 8 is still a linear classifier

Logistic Regression

log%(H | x) ∝ wTHx + 1H︸︷︷︸ℎH (x)

%(H | x) ∝ exp(wTHx + 1H) (7)

make sure

∑H %(H | x) = 1,

%(H | x) =exp(wT

H′x + 1H′)(8)

Logistic Regression

log%(H | x) ∝ wTHx + 1H︸︷︷︸ℎH (x)

%(H | x) ∝ exp(wTHx + 1H) (7)

make sure

∑H %(H | x) = 1,

%(H | x) =exp(wT

H′x + 1H′)(8)

Logistic Regression

log%(H | x) ∝ wTHx + 1H︸︷︷︸ℎH (x)

%(H | x) ∝ exp(wTHx + 1H) (7)

make sure

∑H %(H | x) = 1,

%(H | x) =exp(wT

H′x + 1H′)(8)

Alternative Form

Rewriting x and w as

I xT = [G1 , G2 , · · · , G+ , 1]I wT

H = [F1 , F2 , · · · , F+ , 1H]

allows us to have a more concise form

%(H | x) =exp(wT

Hx)∑H′∈Yexp(wT

H′x)(9)

Comments:

I exp(0)∑0′ exp(0′) is the softmax function

I This form works with any size of Y— it does not have to be a

binary classification problem.

Alternative Form

Rewriting x and w as

I xT = [G1 , G2 , · · · , G+ , 1]I wT

H = [F1 , F2 , · · · , F+ , 1H]

allows us to have a more concise form

%(H | x) =exp(wT

H′x)(9)

Comments:

I exp(0)∑0′ exp(0′) is the softmax function

I This form works with any size of Y— it does not have to be a

binary classification problem.

Binary Classifier

Assume Y= {neg, pos}, then the corresponding logistic regression

classifier with . = Pos is

%(. = Pos | x) = 1

1 + exp(−wTx) (10)

where w is the only parameter.

I %(. = Neg | x) = 1 − %(. = Pos | x)I 1

1+exp(−I) is the Sigmoid function

Problem: Show Eq. 10 is a special form of Eq. 9.

Binary Classifier

%(. = Pos | x) = 1

1 + exp(−wTx) (10)

I %(. = Neg | x) = 1 − %(. = Pos | x)

Binary Classifier

%(. = Pos | x) = 1

1 + exp(−wTx) (10)

I %(. = Neg | x) = 1 − %(. = Pos | x)I 1

Binary Classifier

%(. = Pos | x) = 1

1 + exp(−wTx) (10)

I %(. = Neg | x) = 1 − %(. = Pos | x)I 1

Two Questions on Building LR Models

... of building a logistic regression classifier

%(H | x) =exp(wT

H′x)(11)

I How to learn the parameters] = {wH}H∈Y?

I Can x be better than the bag-of-words representations?

I Will be discussed in lecture 04

Two Questions on Building LR Models

... of building a logistic regression classifier

%(H | x) =exp(wT

H′x)(11)

I How to learn the parameters] = {wH}H∈Y?I Can x be better than the bag-of-words representations?

I Will be discussed in lecture 04

Review: (Log)-likelihood Function

With a collection of training examples {(x(8) , H(8))}<8=1

, the likelihood

function of {wH}H∈Y is

!(] ) =<∏8=1

%(H(8) | x(8)) (12)

and the log-likelihood function is

ℓ ({wH}) =<∑8=1

log%(H(8) | x(8)) (13)

Log-likelihood Function of a LR Model

With the definition of a LR model

%(H | x) =exp(wT

H′x)(14)

the log-likelihood function is

ℓ (] ) =

<∑8=1

log%(H(8) | x(8)) (15)

<∑8=1

{wTH(8)

x(8) − log

∑H′∈Y

exp(wTH′x(8))

Given the training examples {(x(8) , H(8))}<8=1

, ℓ (] ) is a function of

] = {wH}.

Optimization with Gradient

MLE is equivalent to minimize the Negative Log-Likelihood (NLL) as

NLL(] ) = −ℓ (] )

<∑8=1

{−wT

H(8)x(8) + log

∑H′∈Y

exp(wTH′x)

}then, the parameter wH associated with label H can be updated as

wH ← wH − � ·%NLL({wH})

%wH, ∀H ∈ Y (17)

where � is called learning rate.

Optimization with Gradient (II)

Two questions answered by the update equation

(1) which direction?

(2) how far it should go?

wH ← wH − �︸︷︷︸(2)

·%NLL({wH})

%wH︸︷︷︸(1)

[Jurafsky and Martin, 2019]26

wH ← wH − �︸︷︷︸(2)

·%NLL({wH})

%wH︸︷︷︸(1)

wH ← wH − �︸︷︷︸(2)

·%NLL({wH})

%wH︸︷︷︸(1)

Training Procedure

Steps for parameter estimation, given the current parameter {wH}

1. Compute the derivative

%NLL({wH})%wH

, ∀H ∈ Y

2. Update parameters with

wH ← wH − � ·%NLL({wH})

%wH, ∀H ∈ Y

3. If not done, retrun to step 1

Procedure of Building a Classifier

Review: the pipeline of text classification:

Bag-of-words Logistic

Representation Regression

!2 Regularization

The commonly used regularization trick is the !2 regularization. For

that, we need to redefine the objective function of LR by adding an

additional item

Loss(] ) =<∑8=1

{−wT

H(8)x(8) + log

∑H′∈Y

exp(wTH′x(8))

}︸︷︷︸

+ �2

·∑H∈Y‖wH ‖2

2︸︷︷︸!2 reg

I � is the regularization parameter

!2 Regularization

The commonly used regularization trick is the !2 regularization. For

that, we need to redefine the objective function of LR by adding an

additional item

Loss(] ) =<∑8=1

{−wT

H(8)x(8) + log

∑H′∈Y

exp(wTH′x(8))

}︸︷︷︸

+ �2

·∑H∈Y‖wH ‖2

2︸︷︷︸!2 reg

I � is the regularization parameter

!2 Regularization in Gradient Descent

I The gradient of the loss function

%Loss(] )%wH

=%NLL(] )

%wH+ �wH (20)

I To minimize the loss, we need update the parameter as

wH ← wH − �(%NLL(] )

%wH+ �wH

= (1 − ��) ·wH − �%NLL(] )

I Depending on the strength (value) of �, the regularization term

tries to keep the parameter values close to 0, which to some

extent can help avoid overfitting

%Loss(] )%wH

=%NLL(] )

%wH+ �wH (20)

wH ← wH − �(%NLL(] )

%wH+ �wH

= (1 − ��) ·wH − �%NLL(] )

%Loss(] )%wH

=%NLL(] )

%wH+ �wH (20)

wH ← wH − �(%NLL(] )

%wH+ �wH

= (1 − ��) ·wH − �%NLL(] )

Learning without Regularization

In the demo code, we chose � = 1

� = 0.001 to approximate the case

without regularization.

I Training accuracy: 99.89%

I Val accuracy: 52.21%

Classification Weights without Regularization

Here are some word features and their classification weights from the

previous model without regularization. Positive weights indicate the

word feature contribute to positive sentiment classification and

negative weights indicate the opposite contribution

interesting pleasure boring zoe write workings

Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16

I negative: woody allen can write and deliver a one liner as

well as anybody .

I positive: soderbergh , like kubrick before him , may not

touch the planet ’s skin , but understands the workings of

its spirit .

Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16

well as anybody .

its spirit .

Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16

well as anybody .

its spirit .

Learning with Regularization

We chose � = 1

� = 102

I Training accuracy: 62.54%

I Val accuracy: 63.17%

Classification Weights with Regularization

With regularization, the classification weights make more sense to us

Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16

With Reg 0.16 0.36 -0.21 -0.057 -0.066 0.040

Regularization for Avoiding OverfittingReduce the correlation between class label and some noisy features.

Classification Weights with Regularization

With regularization, the classification weights make more sense to us

Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16

With Reg 0.16 0.36 -0.21 -0.057 -0.066 0.040

Regularization for Avoiding OverfittingReduce the correlation between class label and some noisy features.

Demo Code

What we are going to review from this demo code

I Bag-of-words representations

I Text classifiers

I Machine Learning

I Overfitting

I !2regularization

Reference

Jurafsky, D. and Martin, J. (2019).

Speech and language processing.

Pang, B., Lee, L., and Vaithyanathan, S. (2002).

Thumbs up?: sentiment classification using machine learning techniques.

In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 79–86. Association for

Computational Linguistics.

cs 6501 natural language processing - text classification (i ...din tai fung, every time i go eat at...

Documents