lecture 22: language models - github pages
TRANSCRIPT
![Page 1: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/1.jpg)
Harvard IACSCS109BPavlos Protopapas, Mark Glickman, and Chris Tanner
NLP Lectures: Part 1 of 4
Lecture 22: Language Models
![Page 2: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/2.jpg)
The goals of the next four NLP lectures are to:
• convey the ubiquity and importance of text data/NLP
• build a foundation of the most important concepts
• illustrate how some state-of-the-art models (SOTA) work
• provide experience with these SOTA models (e.g., BERT, GPT-2)
• instill when to use which models, based on your data
• provide an overview and platform from which to dive deeper
I’ll teach an NLP course next year!FOREWORD
![Page 3: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/3.jpg)
3
Outline
Unigrams
Bigrams
Perplexity
Recap where we are
NLP Introduction
Language Models
![Page 4: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/4.jpg)
4
Outline
Unigrams
Bigrams
Perplexity
Recap where we are
NLP Introduction
Language Models
![Page 5: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/5.jpg)
Our digital world is inundated with text.
How can we leverage it for useful tasks?
62B pages 500M tweets/day 360M user pages
5
13M articles
![Page 6: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/6.jpg)
6
Morphology
Word Segmentation
Part-of-Speech Tagging
Parsing
Constituency
Dependency
SyntaxSentiment Analysis
Topic Modelling
Named Entity Recognition (NER)
Relation Extraction
Word Sense Disambiguation
Natural Language Understanding (NLU)
Natural Language Generation (NLG)
Machine Translation
Entailment
Question Answering
Language Modelling
Semantics
DiscourseSummarization
Coreference Resolution
Common NLP Tasks (aka problems)
![Page 7: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/7.jpg)
7
Morphology
Word Segmentation
Part-of-Speech Tagging
Parsing
Constituency
Dependency
SyntaxSentiment Analysis
Topic Modelling
Named Entity Recognition (NER)
Relation Extraction
Word Sense Disambiguation
Natural Language Understanding (NLU)
Natural Language Generation (NLG)
Machine Translation
Entailment
Question Answering
Language Modelling
Semantics
DiscourseSummarization
Coreference Resolution
Common NLP Tasks (aka problems)
“Overall, Pfizer’s COVID-19
vaccine is very safe and one
of the most effective
vaccines ever produced”
![Page 8: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/8.jpg)
8
Morphology
Word Segmentation
Part-of-Speech Tagging
Parsing
Constituency
Dependency
SyntaxSentiment Analysis
Topic Modelling
Named Entity Recognition (NER)
Relation Extraction
Word Sense Disambiguation
Natural Language Understanding (NLU)
Natural Language Generation (NLG)
Machine Translation
Entailment
Question Answering
Language Modelling
Semantics
DiscourseSummarization
Coreference Resolution
Common NLP Tasks (aka problems)
sports news politics fashion
![Page 9: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/9.jpg)
9
Morphology
Word Segmentation
Part-of-Speech Tagging
Parsing
Constituency
Dependency
SyntaxSentiment Analysis
Topic Modelling
Named Entity Recognition (NER)
Relation Extraction
Word Sense Disambiguation
Natural Language Understanding (NLU)
Natural Language Generation (NLG)
Machine Translation
Entailment
Question Answering
Language Modelling
Semantics
DiscourseSummarization
Coreference Resolution
Common NLP Tasks (aka problems)
INTENT ARTISTSONG
“Alexa, play Drivers License by Olivia Rodrigo”
“Alexa, play Drivers License by Olivia Rodrigo”
![Page 10: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/10.jpg)
10
Morphology
Word Segmentation
Part-of-Speech Tagging
Parsing
Constituency
Dependency
SyntaxSentiment Analysis
Topic Modelling
Named Entity Recognition (NER)
Relation Extraction
Word Sense Disambiguation
Natural Language Understanding (NLU)
Natural Language Generation (NLG)
Machine Translation
Entailment
Question Answering
Language Modelling
Semantics
DiscourseSummarization
Coreference Resolution
Common NLP Tasks (aka problems)
SPANISH ENGLISH
The brown dogEl perro marrón
![Page 11: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/11.jpg)
11
Morphology
Word Segmentation
Part-of-Speech Tagging
Parsing
Constituency
Dependency
SyntaxSentiment Analysis
Topic Modelling
Named Entity Recognition (NER)
Relation Extraction
Word Sense Disambiguation
Natural Language Understanding (NLU)
Natural Language Generation (NLG)
Machine Translation
Entailment
Question Answering
Language Modelling
Semantics
DiscourseSummarization
Coreference Resolution
Common NLP Tasks (aka problems)
Can help with every other task!
![Page 12: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/12.jpg)
12
Outline
Unigrams
Bigrams
Perplexity
Recap where we are
NLP Introduction
Language Models
![Page 13: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/13.jpg)
13
Outline
Recap where we are
NLP Introduction
Language Models
Unigrams
Bigrams
Perplexity
![Page 14: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/14.jpg)
Language Modelling
A Language Model represents the language used by a given entity (e.g., a particular person, genre, or other well-defined class of text)
14
![Page 15: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/15.jpg)
Language Modelling
A Language Model represents the language used by a given entity (e.g., a particular person, genre, or other well-defined class of text)
15
![Page 16: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/16.jpg)
Language Modelling
A Language Model represents the language used by a given entity (e.g., a particular person, genre, or other well-defined class of text)
16
![Page 17: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/17.jpg)
Language Modelling
A Language Model estimates the probability of any sequence of words
17
FORMAL DEFINITION
Let 𝑿 = “Anqi was late for class”
P(𝑿) = 𝑃(“Anqi was late for class”)
𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
![Page 18: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/18.jpg)
Language Modelling
Generate Text
18
![Page 19: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/19.jpg)
Language Modelling
Generate Text
19
![Page 20: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/20.jpg)
Language Modelling
Generate Text
20
![Page 21: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/21.jpg)
Language Modelling
“Drug kingpin El Chapo testified that he gave MILLIONS to Pelosi, Schiff & Killary. The Feds then closed the courtroom doors.”
21
![Page 22: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/22.jpg)
Language Modelling
22
A Language Model is useful for:
Generating Text Classifying Text
• Auto-complete
• Speech-to-text
• Question-answering / chatbots
• Machine translation
• Authorship attribution
• Detecting spam vs not spam
And much more!
![Page 23: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/23.jpg)
Language Modelling
23
Scenario: assume we have a finite vocabulary 𝑉
𝑉 ∗ represents the infinite set of strings/sentences that we could construct
e.g., 𝑉 ∗= {a, a dog, a frog, dog a, dog dog, frog dog, frog a dog, …}
Data: we have a training set of sentences x ∈ 𝑉 ∗
Problem: estimate a probability distribution:
*&∈(
∗𝑝 𝑥 = 1
𝑝 𝑡ℎ𝑒 = 10 − 2
𝑝 𝑤𝑎𝑡𝑒𝑟𝑓𝑎𝑙𝑙, 𝑡ℎ𝑒, 𝑖𝑐𝑒𝑐𝑟𝑒𝑎𝑚 = 2𝑥10 − 18
𝑝 𝑡ℎ𝑒, 𝑠𝑢𝑛, 𝑜𝑘𝑎𝑦 = 2𝑥10 − 13
Slide adapted from Luke Zettlemoyer @ UW 2018
![Page 24: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/24.jpg)
Motivation
24
“Wreck a nice beach” vs “Recognize speech”
“I ate a cherry” vs “Eye eight uh Jerry!”
“What is the weather today?”
“What is the whether two day?”
“What is the whether too day?”
“What is the Wrether today?”
![Page 25: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/25.jpg)
Language Modelling
25
How can we build a language model?
![Page 26: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/26.jpg)
26
Outline
Recap where we are
NLP Introduction
Language Models
Unigrams
Bigrams
Perplexity
![Page 27: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/27.jpg)
27
Outline
Unigrams
Bigrams
Perplexity
Recap where we are
NLP Introduction
Language Models
![Page 28: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/28.jpg)
28
Important Terminology
a word token is a specific occurrence of a word in a text
a word type refers to the general form of the word, defined by its lexical representation
If our corpus were just “I ran and ran and ran”, you’d say we have:
- 6 word tokens [I, ran , and , ran , and , ran]
- 3 word types: {I, ran, and}
![Page 29: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/29.jpg)
Language Modelling
29
Naive Approach: unigram model
𝑃 𝑤!, … , 𝑤" =&#$!
"
𝑝(𝑤𝑡)
Assumes each word is independent of all others.
![Page 30: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/30.jpg)
Language Modelling
30
Naive Approach: unigram model
Assumes each word is independent of all others.
𝑃 𝑤!, … , 𝑤" =&#$!
"
𝑝(𝑤𝑡)
P(𝒘𝟏, 𝒘𝟐, 𝒘𝟑, 𝒘𝟒, 𝒘𝟓) = P(𝒘𝟏), 𝑷(𝒘𝟐), 𝑷 𝒘𝟑 𝑷 𝒘𝟒 𝑷(𝒘𝟓)
![Page 31: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/31.jpg)
Unigram Model
31
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
![Page 32: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/32.jpg)
Unigram Model
32
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
Let’s say our corpus 𝒅 has 100,000 words
word # occurrencesAnqi 15
was 1,000
late 400
for 3,000
class 350
𝑊 = 100,000
![Page 33: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/33.jpg)
Unigram Model
33
P(w.) =/!"(𝒅)/!∗(𝒅)
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
𝑛3"(𝒅) = # of times word 𝒘𝒊 appears in 𝒅
Let’s say our corpus 𝒅 has 100,000 words
word # occurrencesAnqi 15
was 1,000
late 400
for 3,000
class 350
𝑊 = 𝑛3∗(𝒅) = 100,000
𝑛3∗(𝒅) = # of times any word 𝒘 appears in 𝒅
![Page 34: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/34.jpg)
Unigram Model
34
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
Let’s say our corpus 𝒅 has 100,000 words
word # occurrencesAnqi 15
was 1,000
late 400
for 3,000
class 350
P(Anqi) = !%!55,555 = 0.00015
P(w.) =/!"(𝒅)/!∗(𝒅)
𝑛3"(𝒅) = # of times word 𝒘𝒊 appears in 𝒅
𝑊 = 𝑛3∗(𝒅) = 100,000
𝑛3∗(𝒅) = # of times any word 𝒘 appears in 𝒅
![Page 35: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/35.jpg)
Unigram Model
35
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
Let’s say our corpus 𝒅 has 100,000 words
word # occurrencesAnqi 15
was 1,000
late 400
for 3,000
class 350
P(Anqi) = !%!55,555 = 0.00015
P(was) = !,555!55,555 = 0.01
P(w.) =/!"(𝒅)/!∗(𝒅)
𝑛3"(𝒅) = # of times word 𝒘𝒊 appears in 𝒅
𝑊 = 𝑛3∗(𝒅) = 100,000
𝑛3∗(𝒅) = # of times any word 𝒘 appears in 𝒅
![Page 36: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/36.jpg)
Unigram Model
36
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
Let’s say our corpus 𝒅 has 100,000 words
word # occurrencesAnqi 15
was 1,000
late 400
for 3,000
class 350
P(Anqi) = !%!55,555 = 0.00015
P(was) = !,555!55,555 = 0.01
P(w.) =/!"(𝒅)/!∗(𝒅)
𝑛3"(𝒅) = # of times word 𝒘𝒊 appears in 𝒅
𝑊 = 𝑛3∗(𝒅) = 100,000
𝑛3∗(𝒅) = # of times any word 𝒘 appears in 𝒅
![Page 37: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/37.jpg)
Unigram Model
37
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
P Anqi, was, late, for, class = P Anqi P was P late P for P class
![Page 38: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/38.jpg)
Unigram Model
38
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
P Anqi, was, late, for, class = P Anqi P was P late P for P class
= 0.00015 ∗ 0.01 ∗ 0.004 ∗ 0.03 ∗ 0.0035
= 6.3 ∗ 1013
![Page 39: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/39.jpg)
Unigram Model
39
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
P Anqi, was, late, for, class = P Anqi P was P late P for P class
= 0.00015 ∗ 0.01 ∗ 0.004 ∗ 0.03 ∗ 0.0035
= 6.3 ∗ 1013
This iterative approach is much more efficient than dividing by all possible sequences of length 5
![Page 40: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/40.jpg)
Unigram Model
40
P Anqi, was, late, for, class > P Anqi, was, late, for, asd\jkl;
P Anqi, was, late, for, the = ?
![Page 41: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/41.jpg)
41
UNIGRAM ISSUES?
?
![Page 42: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/42.jpg)
42
1. Probabilities become too small
2. Out-of-vocabulary words <UNK>
UNIGRAM ISSUES?
3. Context doesn’t play a role at all
𝑃(“Anqi was late for class”) = 𝑃(“class for was late Anqi”)
Anqi was late for class the
Anqi was late for class the the
Anqi was late for class _____
4. Sequence generation: What’s the most likely next word?
![Page 43: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/43.jpg)
43
UNIGRAM ISSUES?
Problem 1: Probabilities become too small
𝑃 𝑤!, … , 𝑤" =&#$!
"
𝑝(𝑤𝑡)
![Page 44: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/44.jpg)
44
Solution:
UNIGRAM ISSUES?
𝑃 𝑤!, … , 𝑤" =&#$!
"
𝑝(𝑤𝑡)
log&#$!
"
𝑝 𝑤𝑡 = -#$!
"
log(𝑝 𝑤% )
log 10 − 100 = −230.26even is manageable
Problem 1: Probabilities become too small
![Page 45: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/45.jpg)
45
UNIGRAM ISSUES?
𝑝(𝐶𝑂𝑉𝐼𝐷19) = 0
Problem 2: Out-of-vocabulary words <UNK>
![Page 46: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/46.jpg)
46
UNIGRAM ISSUES?
Problem 2: Out-of-vocabulary words <UNK>
Solution: Smoothing
(give every word’s count some inflation)
P(w) = &! 𝒅&!∗
𝑝(𝐶𝑂𝑉𝐼𝐷19) = 0
![Page 47: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/47.jpg)
47
UNIGRAM ISSUES?
Problem 2: Out-of-vocabulary words <UNK>
Solution: Smoothing
(give every word’s count some inflation)
P(w) = &! 𝒅 ()&!∗()|+|
P(Anqi) = !,()!--,--- ( )|+|
P(COVID19) = -()!--,--- ( )|+|
|𝑉| = the # of unique words types in vocabulary(including an extra 1 for <UNK>)
𝑝(𝐶𝑂𝑉𝐼𝐷19) = 0
![Page 48: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/48.jpg)
48
UNIGRAM ISSUES?
Problem 2: Out-of-vocabulary words <UNK>
Solution: Smoothing
(give every word’s count some inflation)
P(w) = &! 𝒅 ()&!∗()|+|
P(Anqi) = !,()!--,--- ( )|+|
P(COVID19) = -()!--,--- ( )|+|
|𝑉| = the # of unique words types in vocabulary(including an extra 1 for <UNK>)
𝑝(𝐶𝑂𝑉𝐼𝐷19) = 0
Two important notes:
1. Generally, 𝛼 values are small (e.g., 0.5 – 2)
2. When a word w isn’t found within the
training corpus 𝒅 you should replace it with
<UNK> (or *U*)
![Page 49: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/49.jpg)
49
UNIGRAM ISSUES?
Problems 3 and 4: Context doesn’t play a role at all
𝑃(“Anqi was late for class”) = 𝑃(“class for was late Anqi”)
Question: How can we factor in context?
![Page 50: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/50.jpg)
50
UNIGRAM ISSUES?
Easiest Approach:
Instead of words being completely independent, condition each word on its immediate predecessor
![Page 51: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/51.jpg)
51
Outline
Unigrams
Bigrams
Perplexity
Recap where we are
NLP Introduction
Language Models
![Page 52: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/52.jpg)
52
Outline
Recap where we are
NLP Introduction
Language Models
Unigrams
Bigrams
Perplexity
![Page 53: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/53.jpg)
Bigram LM
53
Look at pairs of consecutive words
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
![Page 54: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/54.jpg)
54
probability
Let 𝑿 = “Anqi was late for class”
P(𝑿) = 𝑃(was|Anqi)
Look at pairs of consecutive words
Bigram LM
𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
![Page 55: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/55.jpg)
55
probability
Let 𝑿 = “Anqi was late for class”
P(𝑿) = 𝑃(was|Anqi)𝑃(late|was)
Look at pairs of consecutive words
Bigram LM
𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
![Page 56: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/56.jpg)
56
Let 𝑿 = “Anqi was late for class”probability
P(𝑿) = 𝑃(was|Anqi)𝑃(late|was)𝑃(for|late)
Look at pairs of consecutive words
Bigram LM
𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
![Page 57: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/57.jpg)
57
Let 𝑿 = “Anqi was late for class”probability
P(𝑿) = 𝑃(was|Anqi)𝑃(late|was)𝑃(for|late)𝑃(class|for)
Look at pairs of consecutive words
Bigram LM
𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
![Page 58: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/58.jpg)
58
Alternative Approach: bigram model
Look at pairs of consecutive words
Let 𝑿 = “Anqi was late for class”probability
P(𝑿) = 𝑃(was|Anqi)𝑃(late|was)𝑃(for|late)𝑃(class|for)
P(class | fo r) = co u n t(for class)co u n t(for)
Bigram LM
You calculate each of these probabilities by simply counting the occurrences
𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
![Page 59: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/59.jpg)
Bigram Model
59
P w7|𝑤 = P w,w7 = /!,!% (𝒅)/!,!∗ (𝒅)
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
𝑛3,37 (𝒅) = # of times words 𝒘 and 𝒘′ appear together as a bigram in 𝒅
𝑛3,3∗(𝒅) = # of times word 𝒘 is the first token of a bigram in 𝒅
![Page 60: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/60.jpg)
Bigram Model
60
P w7|𝑤 = P w,w7 = /!,!% (𝒅)/!,!∗ (𝒅)
Let 𝑿 = “Anqi was late for class”𝑤! 𝑤" 𝑤# 𝑤$ 𝑤%
𝑛3,37 (𝒅) = # of times words 𝒘 and 𝒘′ appear together as a bigram in 𝒅
Let’s say our corpus 𝒅 has 100,000 words
word # occurrencesAnqi 15
was 1,000
late 400
for 3,000
class 350P(class|for) = P(for, class) = !"
#,555
𝑛3,3∗(𝒅) = # of times word 𝒘 is the first token of a bigram in 𝒅
𝑊 = 𝑛3∗(𝒅) = 100,000
![Page 61: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/61.jpg)
61
BIGRAM ISSUES?
?
![Page 62: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/62.jpg)
62
1. Out-of-vocabulary bigrams are 0 à kills the overall probability
2. Could always benefit from more context but sparsity is an issue (e.g., rarely seen 5-grams)
BIGRAM ISSUES?
3. Storage becomes a problem as we increase the window size
4. No semantic information conveyed by counts (e.g., vehicle vs car)
![Page 63: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/63.jpg)
63
BIGRAM ISSUES?
Problem 1: Out-of-vocabulary bigrams
𝑃 w,w/ = &!,!$ (𝒅)&!,!∗ (𝒅)
|𝑉| = the # of unique words types in vocabulary(including an extra 1 for <UNK>)
Our current bigram probabilities: How we smoothed unigrams:
Q: What should we do?
𝑃(w) = &! 𝒅 ()&!∗()|+|
![Page 64: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/64.jpg)
64
BIGRAM ISSUES?
Problem 1: Out-of-vocabulary bigrams
Imagine our current string 𝑥 includes “COVID19 harms ribofliptonik …”
In our training corpus 𝑑, we’ve never seen:
“COVID19 harms” or “harms ribofliptonik”
But we’ve seen the unigram “harms”, which provides useful information:
![Page 65: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/65.jpg)
65
BIGRAM ISSUES?
Problem 1: Out-of-vocabulary bigrams
Solution: unigram-backoff for smoothing
𝑃 w,w/ =&!,!$ 𝒅 ( 3∗5(6$)
&!,!∗ 𝒅 ( 3
𝑃(w′) = &!$ 𝒅 ()&!∗()|+|
|𝑉| = the # of unique words types in vocabulary(including an extra 1 for <UNK>)
![Page 66: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/66.jpg)
66
BIGRAM ISSUES?
Problem 1: Out-of-vocabulary bigrams
Solution: unigram-backoff for smoothing
𝑃 w,w/ =&!,!$ 𝒅 ( 3∗5(6$)
&!,!∗ 𝒅 ( 3
𝑃(w′) = &!$ 𝒅 ()&!∗()|+|
|𝑉| = the # of unique words types in vocabulary(including an extra 1 for <UNK>)
Our model is properly parameterized with 𝜶 and 𝜷.
So, instead of calculating the probability of text, we are
actually interested in fixing the parameters at particular
values and determining the likelihood of the data.
![Page 67: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/67.jpg)
67
BIGRAM ISSUES?
For a fixed 𝜶 and 𝜷:
𝜃 w,w/ =&!,!$ 𝒅 ( 3∗7(6$)
&!,!∗ 𝒅 ( 3
𝜃(w′) = &!$ 𝒅 ()&!∗()|+|
|𝑉| = the # of unique words types in vocabulary(including an extra 1 for <UNK>)
![Page 68: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/68.jpg)
68
IMPORTANT:
It is common to pad sentences with <S> tokens on each side, which serve as boundary markers. This helps LMs learn the transitions between sentences.
Let 𝑿 = “I ate. Did you?”𝑤! 𝑤" 𝑤# 𝑤$
𝑿 = “<S> I ate <S> Did you? <S>”𝑤! 𝑤"𝑤# 𝑤$ 𝑤%
à𝑤9 𝑤:
![Page 69: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/69.jpg)
Generation
69
• We can also use these LMs to generate text
• Generate the very first token manually by making it be <S>
• Then, generate the next token by sampling from the probability
distribution of possible next tokens (the set of possible next
tokens sums to 1)
• When you generate be <S> again, that represents the end of
the current sentence
![Page 70: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/70.jpg)
Example of Bigram generation
70
• Force a <S> as the first token
• Of the bigrams that start with <S>, probabilistically pick one
based on their likelihoods
• Let’s say the chosen bigram was <S>_The
• Repeat the process, but now condition on “The”. So, perhaps
the next select Bigram is “The_dog”
• The sentence is complete when you generate a bigram whose
second half is <S>
![Page 71: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/71.jpg)
Imagine more context
![Page 72: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/72.jpg)
Language Modelling
72
Better Approach: n-gram model
Let’s factor in context (in practice, a window of size n)
𝑃 𝑥!, … , 𝑥" =&#$!
"
𝑝 𝑥# 𝑥#8!, … , 𝑥!)
![Page 73: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/73.jpg)
Language Modelling
73
Better Approach: n-gram model
The likelihood of any event occurring hinges upon all prior events occurring
𝑃 𝑥!, … , 𝑥" =&#$!
"
𝑝 𝑥# 𝑥#8!, … , 𝑥!)
![Page 74: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/74.jpg)
Language Modelling
74
Better Approach: n-gram model
The likelihood of any event occurring hinges upon all prior events occurring
𝑃 𝑥!, … , 𝑥" =&#$!
"
𝑝 𝑥# 𝑥#8!, … , 𝑥!)
This compounds for all subsequent events, too
![Page 75: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/75.jpg)
75
Outline
Recap where we are
NLP Introduction
Language Models
Unigrams
Bigrams
Perplexity
![Page 76: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/76.jpg)
76
Outline
Recap where we are
NLP Introduction
Language Models
Unigrams
Bigrams
Perplexity
![Page 77: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/77.jpg)
Perplexity
77
N-gram models seem useful, but how can we measure how good they are?
Can we just use the likelihood values?
![Page 78: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/78.jpg)
Perplexity
78
Almost!
The likelihood values aren’t adjusted for the length of sequences, so we would need to normalize by the sequence lengths.
![Page 79: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/79.jpg)
Perplexity
79
The best language model is one that best predicts an unseen test set
Perplexity, denoted as 𝑃𝑃, is the inverse probability of the test set, normalized by the number of words.
𝑃𝑃 𝑤!, … , 𝑤" = 𝑝 𝑤!, 𝑤9, … , 𝑤: 8!/:
=% 1𝑝 𝑤!, 𝑤9, … , 𝑤:
![Page 80: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/80.jpg)
Perplexity
80
Perplexity is also equivalent to the exponentiated negative log-likelihood normalized:
𝑃𝑃 𝑤!, … , 𝑤" = 𝑝 𝑤!, 𝑤9, … , 𝑤: 8!/:
=% 1𝑝 𝑤!, 𝑤9, … , 𝑤:
= 28<, where l = !:∑%$!& log(𝑝 𝑤% )
![Page 81: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/81.jpg)
Perplexity
81
Very related to entropy, perplexity measures the uncertainty of the model for a particular dataset. So, very high perplexity scores correspond to having tons of uncertainty (which is bad).
Perplexity also represents the average number of bits needed to represent each word. You can view this as the branching factor at each step. That is, the more branches (aka bits) at each step, the more uncertainty there is.
![Page 82: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/82.jpg)
Perplexity
82
Good models tend to have perplexity scores around 40-100 on
large, popular corpora.
If our model assumed a uniform distribution of words, then our
perplexity score would be:
𝑉 = the # of unique word types
![Page 83: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/83.jpg)
Perplexity
83
Example: let our corpus 𝑋 have only 3 unique words {the, dog,
ran} but have a length of 𝑁.
𝑃𝑃 𝑋 = % 113
: =% 3: = 3
![Page 84: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/84.jpg)
Perplexity
84
More generally, if we have 𝑀 unique words for a sequence of
length 𝑁.
𝑃𝑃 𝑋 = % 11𝑀
: =% 𝑀: = 𝑀
![Page 85: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/85.jpg)
Perplexity
85
Example perplexity scores: when trained on a corpus of 38 million
words and tested on 1.5 million words:
model perplexity
unigram 962
bigram 170
trigram 109
![Page 86: Lecture 22: Language Models - GitHub Pages](https://reader031.vdocument.in/reader031/viewer/2022020707/61ff5afbdefab22b6e69a0bb/html5/thumbnails/86.jpg)
SUMMARY• Language models estimate the probability of sequences and can
predict the most likely next word
• We can probabilistically generate sequences of words
• We can measure performance of any language model
• Unigrams provide no context and are not good
• Bi-grams and Tri-grams are better but still have serious weaknesses