cs 6501 natural language processing - text classification (i ...din tai fung, every time i go eat at...

77
CS 6501 Natural Language Processing Text Classification (I): Logistic Regression Yangfeng Ji Department of Computer Science University of Virginia

Upload: others

Post on 08-Aug-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

CS 6501 Natural LanguageProcessingText Classification (I): Logistic Regression

Yangfeng Ji

Department of Computer Science

University of Virginia

Page 2: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Overview

1. Problem Definition

2. Bag-of-Words Representation

3. Case Study: Sentiment Analysis

4. Logistic Regression

5. !2 Regularization

6. Demo Code

1

Page 3: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Problem Definition

Page 4: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Case I: Sentiment Analysis

[Pang et al., 2002]

3

Page 5: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Case II: Topic Classification

Example topics

I Business

I Arts

I Technology

I Sports

I · · ·

4

Page 6: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Classification

I Input: a text xI Example: a product review on Amazon

I Output: H ∈ Y, where Y is the predefined category set (sample

space)

I Example: Y= {Positive,Negative}

The pipeline of text classification:1

Text Numeric Vector x Classifier Category H

1In this course, we use x for both text and its representation with no distinction

5

Page 7: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Classification

I Input: a text xI Example: a product review on Amazon

I Output: H ∈ Y, where Y is the predefined category set (sample

space)

I Example: Y= {Positive,Negative}

The pipeline of text classification:1

Text Numeric Vector x Classifier Category H

1In this course, we use x for both text and its representation with no distinction

5

Page 8: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Probabilistic Formulation

With the conditional probability %(. | ^ ), the prediction on . for a

given text ^ = x is

H = argmax

H∈Y%(. = H | ^ = x) (1)

Or, for simplicity

H = argmax

H∈Y%(H | x) (2)

6

Page 9: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Probabilistic Formulation

With the conditional probability %(. | ^ ), the prediction on . for a

given text ^ = x is

H = argmax

H∈Y%(. = H | ^ = x) (1)

Or, for simplicity

H = argmax

H∈Y%(H | x) (2)

6

Page 10: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Key Questions

Recall

I The formulation defined in the previous slide

H = argmax

H∈Y%(. = H | ^ = x) (3)

I The pipeline of text classification

Text Numeric Vector x Classifier Category H

Building a text classifier is about answering the following two

questions

1. How to represent a text as x?

I Bag-of-words representation

2. How to estimate %(H | x)?

I Logistic regression models

I Neural network classifiers (Lecture 3)

7

Page 11: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Key Questions

Recall

I The formulation defined in the previous slide

H = argmax

H∈Y%(. = H | ^ = x) (3)

I The pipeline of text classification

Text Numeric Vector x Classifier Category H

Building a text classifier is about answering the following two

questions

1. How to represent a text as x?

I Bag-of-words representation

2. How to estimate %(H | x)?

I Logistic regression models

I Neural network classifiers (Lecture 3)

7

Page 12: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Key Questions

Recall

I The formulation defined in the previous slide

H = argmax

H∈Y%(. = H | ^ = x) (3)

I The pipeline of text classification

Text Numeric Vector x Classifier Category H

Building a text classifier is about answering the following two

questions

1. How to represent a text as x?I Bag-of-words representation

2. How to estimate %(H | x)?

I Logistic regression models

I Neural network classifiers (Lecture 3)

7

Page 13: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Key Questions

Recall

I The formulation defined in the previous slide

H = argmax

H∈Y%(. = H | ^ = x) (3)

I The pipeline of text classification

Text Numeric Vector x Classifier Category H

Building a text classifier is about answering the following two

questions

1. How to represent a text as x?I Bag-of-words representation

2. How to estimate %(H | x)?I Logistic regression models

I Neural network classifiers (Lecture 3)

7

Page 14: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Key Questions

Recall

I The formulation defined in the previous slide

H = argmax

H∈Y%(. = H | ^ = x) (3)

I The pipeline of text classification

Text Numeric Vector x Classifier Category H

Building a text classifier is about answering the following two

questions

1. How to represent a text as x?I Bag-of-words representation

2. How to estimate %(H | x)?I Logistic regression models

I Neural network classifiers (Lecture 3)

7

Page 15: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Bag-of-Words Representation

Page 16: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Bag-of-Words Representation

Example TextsText 1: I love coffee.

Text 2: I don’t like tea.

Step I: convert a text into a collection of tokens (e.g., tokenization)

Tokenized TextsTokenized text 1: I love coffee

Tokenized text 2: I don t like tea

Step II: build a dictionary/vocabulary

Vocabulary{I love coffee don t like tea}

9

Page 17: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Bag-of-Words Representation

Example TextsText 1: I love coffee.

Text 2: I don’t like tea.

Step I: convert a text into a collection of tokens (e.g., tokenization)

Tokenized TextsTokenized text 1: I love coffee

Tokenized text 2: I don t like tea

Step II: build a dictionary/vocabulary

Vocabulary{I love coffee don t like tea}

9

Page 18: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Bag-of-Words Representation

Example TextsText 1: I love coffee.

Text 2: I don’t like tea.

Step I: convert a text into a collection of tokens (e.g., tokenization)

Tokenized TextsTokenized text 1: I love coffee

Tokenized text 2: I don t like tea

Step II: build a dictionary/vocabulary

Vocabulary{I love coffee don t like tea}

9

Page 19: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Bag-of-Words Representations

Step III: based on the vocab, convert each text into a numeric

representation as

Bag-of-Words Representations

I love coffee don t like tea

x(1) = [1 1 1 0 0 0 0]T

x(2) = [1 0 0 1 1 1 1]T

The pipeline of text classification:

Text Numeric Vector x Classifier Category H

Bag-of-words

Representation

10

Page 20: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Bag-of-Words Representations

Step III: based on the vocab, convert each text into a numeric

representation as

Bag-of-Words Representations

I love coffee don t like tea

x(1) = [1 1 1 0 0 0 0]T

x(2) = [1 0 0 1 1 1 1]T

The pipeline of text classification:

Text Numeric Vector x Classifier Category H

Bag-of-words

Representation

10

Page 21: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Preprocessing for Building Vocab

1. convert all characters to lowercase

UVa, UVA→ uva

Shall we always convert all words to lowercase?

Apple vs. apple

2. map low frequency words to a special token 〈unk〉

Zipf’s law: freq(FC) ∝ 1/ACwhere freq(FC) is the frequency of word FC and AC is the rank of

this word

11

Page 22: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Preprocessing for Building Vocab

1. convert all characters to lowercase

UVa, UVA→ uva

Shall we always convert all words to lowercase?

Apple vs. apple

2. map low frequency words to a special token 〈unk〉

Zipf’s law: freq(FC) ∝ 1/ACwhere freq(FC) is the frequency of word FC and AC is the rank of

this word

11

Page 23: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Information Embedded in BoW Representations

It is critical to keep in mind about what information is preserved in

bag-of-words representations:

I Keep:

I words in texts

I Lose:

I word order

I love coffee don t like tea

I sentence boundary

I sentence order

I · · ·

12

Page 24: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Information Embedded in BoW Representations

It is critical to keep in mind about what information is preserved in

bag-of-words representations:

I Keep:

I words in texts

I Lose:

I word order

I love coffee don t like tea

I sentence boundary

I sentence order

I · · ·

12

Page 25: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Information Embedded in BoW Representations

It is critical to keep in mind about what information is preserved in

bag-of-words representations:

I Keep:

I words in texts

I Lose:

I word order

I love coffee don t like tea

I sentence boundary

I sentence order

I · · ·

12

Page 26: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Case Study: Sentiment Analysis

Page 27: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

A Dummy Predictor

Consider the following toy example (adding one more example to

make it more interesting)

Tokenized Texts

Text ^ Label .

Tokenized text 1 I love coffee Positive

Tokenized text 2 I don t like tea Negative

Tokenized text 3 I like coffee Positive

What is the simplest classifier that we can constructed based on this

“dataset”?

I Predict every text as Positive

I 66.7% prediction accuracy on this dataset2

I It has a name: majority baseline3

2The evaluation of classifiers will be discussed in one of the future lectures.

3It can be a competitive baseline in practice, particularly for unbalanced datasets

14

Page 28: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

A Dummy Predictor

Consider the following toy example (adding one more example to

make it more interesting)

Tokenized Texts

Text ^ Label .

Tokenized text 1 I love coffee Positive

Tokenized text 2 I don t like tea Negative

Tokenized text 3 I like coffee Positive

What is the simplest classifier that we can constructed based on this

“dataset”?

I Predict every text as Positive

I 66.7% prediction accuracy on this dataset2

I It has a name: majority baseline3

2The evaluation of classifiers will be discussed in one of the future lectures.

3It can be a competitive baseline in practice, particularly for unbalanced datasets

14

Page 29: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

A Dummy Predictor

Consider the following toy example (adding one more example to

make it more interesting)

Tokenized Texts

Text ^ Label .

Tokenized text 1 I love coffee Positive

Tokenized text 2 I don t like tea Negative

Tokenized text 3 I like coffee Positive

What is the simplest classifier that we can constructed based on this

“dataset”?

I Predict every text as Positive

I 66.7% prediction accuracy on this dataset2

I It has a name: majority baseline3

2The evaluation of classifiers will be discussed in one of the future lectures.

3It can be a competitive baseline in practice, particularly for unbalanced datasets

14

Page 30: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

A Dummy Predictor

Consider the following toy example (adding one more example to

make it more interesting)

Tokenized Texts

Text ^ Label .

Tokenized text 1 I love coffee Positive

Tokenized text 2 I don t like tea Negative

Tokenized text 3 I like coffee Positive

What is the simplest classifier that we can constructed based on this

“dataset”?

I Predict every text as Positive

I 66.7% prediction accuracy on this dataset2

I It has a name: majority baseline3

2The evaluation of classifiers will be discussed in one of the future lectures.

3It can be a competitive baseline in practice, particularly for unbalanced datasets

14

Page 31: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

A Dummy Predictor

Consider the following toy example (adding one more example to

make it more interesting)

Tokenized Texts

Text ^ Label .

Tokenized text 1 I love coffee Positive

Tokenized text 2 I don t like tea Negative

Tokenized text 3 I like coffee Positive

What is the simplest classifier that we can constructed based on this

“dataset”?

I Predict every text as Positive

I 66.7% prediction accuracy on this dataset2

I It has a name: majority baseline3

2The evaluation of classifiers will be discussed in one of the future lectures.

3It can be a competitive baseline in practice, particularly for unbalanced datasets

14

Page 32: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

A Simple Predictor

Consider the following toy example, again

Tokenized TextsTokenized text 1: I love coffee

Tokenized text 2: I don t like tea

Tokenized text 3: I like coffee

I love coffee don t like tea

x(1) [1 1 1 0 0 0 0 ]T

wPos

[0 1 0 0 0 1 0 ]T

wNeg

[0 0 0 1 0 0 0 ]T

The prediction of sentiment polarity can be formulated as the

following

wTPos

x = 1 > wTNeg

x = 0 (4)

Essentially, it is equivalent to counting the positive and negative

words.

15

Page 33: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

A Simple Predictor

Consider the following toy example, again

Tokenized TextsTokenized text 1: I love coffee

Tokenized text 2: I don t like tea

Tokenized text 3: I like coffee

I love coffee don t like tea

x(1) [1 1 1 0 0 0 0 ]T

wPos

[0 1 0 0 0 1 0 ]T

wNeg

[0 0 0 1 0 0 0 ]T

The prediction of sentiment polarity can be formulated as the

following

wTPos

x = 1 > wTNeg

x = 0 (4)

Essentially, it is equivalent to counting the positive and negative

words.

15

Page 34: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

A Simple Predictor

Consider the following toy example, again

Tokenized TextsTokenized text 1: I love coffee

Tokenized text 2: I don t like tea

Tokenized text 3: I like coffee

I love coffee don t like tea

x(1) [1 1 1 0 0 0 0 ]T

wPos

[0 1 0 0 0 1 0 ]T

wNeg

[0 0 0 1 0 0 0 ]T

The prediction of sentiment polarity can be formulated as the

following

wTPos

x = 1 > wTNeg

x = 0 (4)

Essentially, it is equivalent to counting the positive and negative

words.

15

Page 35: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

A Simple Predictor

Consider the following toy example, again

Tokenized TextsTokenized text 1: I love coffee

Tokenized text 2: I don t like tea

Tokenized text 3: I like coffee

I love coffee don t like tea

x(1) [1 1 1 0 0 0 0 ]T

wPos

[0 1 0 0 0 1 0 ]T

wNeg

[0 0 0 1 0 0 0 ]T

The prediction of sentiment polarity can be formulated as the

following

wTPos

x = 1 > wTNeg

x = 0 (4)

Essentially, it is equivalent to counting the positive and negative

words. 15

Page 36: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Another Example

The limitation of word counting

I love coffee don t like tea

x(2) [1 0 0 1 1 1 1 ]T

wPos [0 1 0 0 0 1 0 ]T

wNeg [0 0 0 1 0 0 0 ]T

I Different words should contribute differently. e.g., not vs.

dislike

I Sentiment word lists are definitely incomplete

Example II: PositiveDin Tai Fung, every time I go eat at anyone of the locations around theKing County area, I keep being reminded on why I have to keep comingback to this restaurant. · · ·

16

Page 37: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Another Example

The limitation of word counting

I love coffee don t like tea

x(2) [1 0 0 1 1 1 1 ]T

wPos [0 1 0 0 0 1 0 ]T

wNeg [0 0 0 1 0 0 0 ]T

I Different words should contribute differently. e.g., not vs.

dislike

I Sentiment word lists are definitely incomplete

Example II: PositiveDin Tai Fung, every time I go eat at anyone of the locations around theKing County area, I keep being reminded on why I have to keep comingback to this restaurant. · · ·

16

Page 38: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Another Example

The limitation of word counting

I love coffee don t like tea

x(2) [1 0 0 1 1 1 1 ]T

wPos [0 1 0 0 0 1 0 ]T

wNeg [0 0 0 1 0 0 0 ]T

I Different words should contribute differently. e.g., not vs.

dislike

I Sentiment word lists are definitely incomplete

Example II: PositiveDin Tai Fung, every time I go eat at anyone of the locations around theKing County area, I keep being reminded on why I have to keep comingback to this restaurant. · · ·

16

Page 39: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Logistic Regression

Page 40: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Linear Models

Directly modeling a linear classifier as

ℎH(x) = wTHx + 1H (5)

with

I x ∈ ℕ+: vector, bag-of-words representation

I wH ∈ ℝ+: vector, classification weights associated with label H

I 1H ∈ ℝ: scalar, label bias in the training set H

About Label BiasConsider a case with highly-imbalanced examples, where we have

90 positive examples and 10 negative examples in the training set.

With

1Pos > 1Neg ,

a classifier can get 90% predictions correct without even resorting

the texts.

18

Page 41: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Linear Models

Directly modeling a linear classifier as

ℎH(x) = wTHx + 1H (5)

with

I x ∈ ℕ+: vector, bag-of-words representation

I wH ∈ ℝ+: vector, classification weights associated with label H

I 1H ∈ ℝ: scalar, label bias in the training set H

About Label BiasConsider a case with highly-imbalanced examples, where we have

90 positive examples and 10 negative examples in the training set.

With

1Pos > 1Neg ,

a classifier can get 90% predictions correct without even resorting

the texts.

18

Page 42: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Logistic Regression

Rewrite the linear decision function in the log probabilitic form

log%(H | x) ∝ wTHx + 1H︸ ︷︷ ︸ℎH (x)

(6)

or, the probabilistic form is

%(H | x) ∝ exp(wTHx + 1H) (7)

To make sure %(H | x) is a valid definition of probability, we need to

make sure

∑H %(H | x) = 1,

%(H | x) =exp(wT

Hx + 1H)∑H′∈Yexp(wT

H′x + 1H′)(8)

Problem: Show the classifier based on Eq. 8 is still a linear classifier

19

Page 43: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Logistic Regression

Rewrite the linear decision function in the log probabilitic form

log%(H | x) ∝ wTHx + 1H︸ ︷︷ ︸ℎH (x)

(6)

or, the probabilistic form is

%(H | x) ∝ exp(wTHx + 1H) (7)

To make sure %(H | x) is a valid definition of probability, we need to

make sure

∑H %(H | x) = 1,

%(H | x) =exp(wT

Hx + 1H)∑H′∈Yexp(wT

H′x + 1H′)(8)

Problem: Show the classifier based on Eq. 8 is still a linear classifier

19

Page 44: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Logistic Regression

Rewrite the linear decision function in the log probabilitic form

log%(H | x) ∝ wTHx + 1H︸ ︷︷ ︸ℎH (x)

(6)

or, the probabilistic form is

%(H | x) ∝ exp(wTHx + 1H) (7)

To make sure %(H | x) is a valid definition of probability, we need to

make sure

∑H %(H | x) = 1,

%(H | x) =exp(wT

Hx + 1H)∑H′∈Yexp(wT

H′x + 1H′)(8)

Problem: Show the classifier based on Eq. 8 is still a linear classifier

19

Page 45: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Logistic Regression

Rewrite the linear decision function in the log probabilitic form

log%(H | x) ∝ wTHx + 1H︸ ︷︷ ︸ℎH (x)

(6)

or, the probabilistic form is

%(H | x) ∝ exp(wTHx + 1H) (7)

To make sure %(H | x) is a valid definition of probability, we need to

make sure

∑H %(H | x) = 1,

%(H | x) =exp(wT

Hx + 1H)∑H′∈Yexp(wT

H′x + 1H′)(8)

Problem: Show the classifier based on Eq. 8 is still a linear classifier

19

Page 46: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Alternative Form

Rewriting x and w as

I xT = [G1 , G2 , · · · , G+ , 1]I wT

H = [F1 , F2 , · · · , F+ , 1H]

allows us to have a more concise form

%(H | x) =exp(wT

Hx)∑H′∈Yexp(wT

H′x)(9)

Comments:

I exp(0)∑0′ exp(0′) is the softmax function

I This form works with any size of Y— it does not have to be a

binary classification problem.

20

Page 47: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Alternative Form

Rewriting x and w as

I xT = [G1 , G2 , · · · , G+ , 1]I wT

H = [F1 , F2 , · · · , F+ , 1H]

allows us to have a more concise form

%(H | x) =exp(wT

Hx)∑H′∈Yexp(wT

H′x)(9)

Comments:

I exp(0)∑0′ exp(0′) is the softmax function

I This form works with any size of Y— it does not have to be a

binary classification problem.

20

Page 48: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Binary Classifier

Assume Y= {neg, pos}, then the corresponding logistic regression

classifier with . = Pos is

%(. = Pos | x) = 1

1 + exp(−wTx) (10)

where w is the only parameter.

I %(. = Neg | x) = 1 − %(. = Pos | x)I 1

1+exp(−I) is the Sigmoid function

Problem: Show Eq. 10 is a special form of Eq. 9.

21

Page 49: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Binary Classifier

Assume Y= {neg, pos}, then the corresponding logistic regression

classifier with . = Pos is

%(. = Pos | x) = 1

1 + exp(−wTx) (10)

where w is the only parameter.

I %(. = Neg | x) = 1 − %(. = Pos | x)

I 1

1+exp(−I) is the Sigmoid function

Problem: Show Eq. 10 is a special form of Eq. 9.

21

Page 50: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Binary Classifier

Assume Y= {neg, pos}, then the corresponding logistic regression

classifier with . = Pos is

%(. = Pos | x) = 1

1 + exp(−wTx) (10)

where w is the only parameter.

I %(. = Neg | x) = 1 − %(. = Pos | x)I 1

1+exp(−I) is the Sigmoid function

Problem: Show Eq. 10 is a special form of Eq. 9.

21

Page 51: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Binary Classifier

Assume Y= {neg, pos}, then the corresponding logistic regression

classifier with . = Pos is

%(. = Pos | x) = 1

1 + exp(−wTx) (10)

where w is the only parameter.

I %(. = Neg | x) = 1 − %(. = Pos | x)I 1

1+exp(−I) is the Sigmoid function

Problem: Show Eq. 10 is a special form of Eq. 9.

21

Page 52: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Two Questions on Building LR Models

... of building a logistic regression classifier

%(H | x) =exp(wT

Hx)∑H′∈Yexp(wT

H′x)(11)

I How to learn the parameters] = {wH}H∈Y?

I Can x be better than the bag-of-words representations?

I Will be discussed in lecture 04

22

Page 53: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Two Questions on Building LR Models

... of building a logistic regression classifier

%(H | x) =exp(wT

Hx)∑H′∈Yexp(wT

H′x)(11)

I How to learn the parameters] = {wH}H∈Y?I Can x be better than the bag-of-words representations?

I Will be discussed in lecture 04

22

Page 54: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Review: (Log)-likelihood Function

With a collection of training examples {(x(8) , H(8))}<8=1

, the likelihood

function of {wH}H∈Y is

!(] ) =<∏8=1

%(H(8) | x(8)) (12)

and the log-likelihood function is

ℓ ({wH}) =<∑8=1

log%(H(8) | x(8)) (13)

23

Page 55: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Log-likelihood Function of a LR Model

With the definition of a LR model

%(H | x) =exp(wT

Hx)∑H′∈Yexp(wT

H′x)(14)

the log-likelihood function is

ℓ (] ) =

<∑8=1

log%(H(8) | x(8)) (15)

=

<∑8=1

{wTH(8)

x(8) − log

∑H′∈Y

exp(wTH′x(8))

}(16)

Given the training examples {(x(8) , H(8))}<8=1

, ℓ (] ) is a function of

] = {wH}.

24

Page 56: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Optimization with Gradient

MLE is equivalent to minimize the Negative Log-Likelihood (NLL) as

NLL(] ) = −ℓ (] )

=

<∑8=1

{−wT

H(8)x(8) + log

∑H′∈Y

exp(wTH′x)

}then, the parameter wH associated with label H can be updated as

wH ← wH − � ·%NLL({wH})

%wH, ∀H ∈ Y (17)

where � is called learning rate.

25

Page 57: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Optimization with Gradient (II)

Two questions answered by the update equation

(1) which direction?

(2) how far it should go?

wH ← wH − �︸︷︷︸(2)

·%NLL({wH})

%wH︸ ︷︷ ︸(1)

(18)

[Jurafsky and Martin, 2019]26

Page 58: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Optimization with Gradient (II)

Two questions answered by the update equation

(1) which direction?

(2) how far it should go?

wH ← wH − �︸︷︷︸(2)

·%NLL({wH})

%wH︸ ︷︷ ︸(1)

(18)

[Jurafsky and Martin, 2019]26

Page 59: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Optimization with Gradient (II)

Two questions answered by the update equation

(1) which direction?

(2) how far it should go?

wH ← wH − �︸︷︷︸(2)

·%NLL({wH})

%wH︸ ︷︷ ︸(1)

(18)

[Jurafsky and Martin, 2019]26

Page 60: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Training Procedure

Steps for parameter estimation, given the current parameter {wH}

1. Compute the derivative

%NLL({wH})%wH

, ∀H ∈ Y

2. Update parameters with

wH ← wH − � ·%NLL({wH})

%wH, ∀H ∈ Y

3. If not done, retrun to step 1

27

Page 61: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Procedure of Building a Classifier

Review: the pipeline of text classification:

Text Numeric Vector x Classifier Category H

Bag-of-words Logistic

Representation Regression

28

Page 62: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

!2 Regularization

Page 63: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

!2 Regularization

The commonly used regularization trick is the !2 regularization. For

that, we need to redefine the objective function of LR by adding an

additional item

Loss(] ) =<∑8=1

{−wT

H(8)x(8) + log

∑H′∈Y

exp(wTH′x(8))

}︸ ︷︷ ︸

NLL

+ �2

·∑H∈Y‖wH ‖2

2︸ ︷︷ ︸!2 reg

(19)

I � is the regularization parameter

30

Page 64: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

!2 Regularization

The commonly used regularization trick is the !2 regularization. For

that, we need to redefine the objective function of LR by adding an

additional item

Loss(] ) =<∑8=1

{−wT

H(8)x(8) + log

∑H′∈Y

exp(wTH′x(8))

}︸ ︷︷ ︸

NLL

+ �2

·∑H∈Y‖wH ‖2

2︸ ︷︷ ︸!2 reg

(19)

I � is the regularization parameter

30

Page 65: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

!2 Regularization in Gradient Descent

I The gradient of the loss function

%Loss(] )%wH

=%NLL(] )

%wH+ �wH (20)

I To minimize the loss, we need update the parameter as

wH ← wH − �(%NLL(] )

%wH+ �wH

)(21)

= (1 − ��) ·wH − �%NLL(] )

%wH

I Depending on the strength (value) of �, the regularization term

tries to keep the parameter values close to 0, which to some

extent can help avoid overfitting

31

Page 66: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

!2 Regularization in Gradient Descent

I The gradient of the loss function

%Loss(] )%wH

=%NLL(] )

%wH+ �wH (20)

I To minimize the loss, we need update the parameter as

wH ← wH − �(%NLL(] )

%wH+ �wH

)(21)

= (1 − ��) ·wH − �%NLL(] )

%wH

I Depending on the strength (value) of �, the regularization term

tries to keep the parameter values close to 0, which to some

extent can help avoid overfitting

31

Page 67: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

!2 Regularization in Gradient Descent

I The gradient of the loss function

%Loss(] )%wH

=%NLL(] )

%wH+ �wH (20)

I To minimize the loss, we need update the parameter as

wH ← wH − �(%NLL(] )

%wH+ �wH

)(21)

= (1 − ��) ·wH − �%NLL(] )

%wH

I Depending on the strength (value) of �, the regularization term

tries to keep the parameter values close to 0, which to some

extent can help avoid overfitting

31

Page 68: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Learning without Regularization

In the demo code, we chose � = 1

� = 0.001 to approximate the case

without regularization.

I Training accuracy: 99.89%

I Val accuracy: 52.21%

32

Page 69: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Classification Weights without Regularization

Here are some word features and their classification weights from the

previous model without regularization. Positive weights indicate the

word feature contribute to positive sentiment classification and

negative weights indicate the opposite contribution

interesting pleasure boring zoe write workings

Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16

I negative: woody allen can write and deliver a one liner as

well as anybody .

I positive: soderbergh , like kubrick before him , may not

touch the planet ’s skin , but understands the workings of

its spirit .

33

Page 70: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Classification Weights without Regularization

Here are some word features and their classification weights from the

previous model without regularization. Positive weights indicate the

word feature contribute to positive sentiment classification and

negative weights indicate the opposite contribution

interesting pleasure boring zoe write workings

Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16

I negative: woody allen can write and deliver a one liner as

well as anybody .

I positive: soderbergh , like kubrick before him , may not

touch the planet ’s skin , but understands the workings of

its spirit .

33

Page 71: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Classification Weights without Regularization

Here are some word features and their classification weights from the

previous model without regularization. Positive weights indicate the

word feature contribute to positive sentiment classification and

negative weights indicate the opposite contribution

interesting pleasure boring zoe write workings

Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16

I negative: woody allen can write and deliver a one liner as

well as anybody .

I positive: soderbergh , like kubrick before him , may not

touch the planet ’s skin , but understands the workings of

its spirit .

33

Page 72: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Learning with Regularization

We chose � = 1

� = 102

I Training accuracy: 62.54%

I Val accuracy: 63.17%

34

Page 73: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Classification Weights with Regularization

With regularization, the classification weights make more sense to us

interesting pleasure boring zoe write workings

Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16

With Reg 0.16 0.36 -0.21 -0.057 -0.066 0.040

Regularization for Avoiding OverfittingReduce the correlation between class label and some noisy features.

35

Page 74: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Classification Weights with Regularization

With regularization, the classification weights make more sense to us

interesting pleasure boring zoe write workings

Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16

With Reg 0.16 0.36 -0.21 -0.057 -0.066 0.040

Regularization for Avoiding OverfittingReduce the correlation between class label and some noisy features.

35

Page 75: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Demo Code

Page 76: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Demo

What we are going to review from this demo code

I NLP

I Bag-of-words representations

I Text classifiers

I Machine Learning

I Overfitting

I !2regularization

37

Page 77: CS 6501 Natural Language Processing - Text Classification (I ...Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why

Reference

Jurafsky, D. and Martin, J. (2019).

Speech and language processing.

Pang, B., Lee, L., and Vaithyanathan, S. (2002).

Thumbs up?: sentiment classification using machine learning techniques.

In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 79–86. Association for

Computational Linguistics.

38