word classes and part of speech tagging chapter 5
TRANSCRIPT
![Page 1: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/1.jpg)
Word classes and part of speech tagging
Chapter 5
![Page 2: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/2.jpg)
Slide 2
Outline
Tag sets and problem definitionAutomatic approaches 1: rule-based taggingAutomatic approaches 2: stochastic tagging
On Part 2: finish stochastic tagging, and continue on to:evaluation
![Page 3: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/3.jpg)
Slide 3
An Example
thegirlkisstheboyonthecheek
LEMMA TAG
+DET+NOUN+VPAST+DET+NOUN+PREP+DET+NOUN
thegirlkissedtheboyonthecheek
WORD
![Page 4: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/4.jpg)
Slide 4
Word Classes: Tag Sets
•Vary in number of tags: a dozen to over 200•Size of tag sets depends on language, objectives
and purpose
![Page 5: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/5.jpg)
Slide 5
Word Classes: Tag set example
PRPPRP$
![Page 6: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/6.jpg)
Slide 6
Example of Penn Treebank Tagging of Brown Corpus Sentence
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
VB DT NN .Book that flight .
VBZ DT NN VB NN ?Does that flight serve dinner ?
See http://www.infogistics.com/posdemo.htm
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo
![Page 7: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/7.jpg)
Slide 7
The Problem
Words often have more than one word class: thisThis is a nice day = PRPThis day is nice = DTYou can go this far = RB
![Page 8: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/8.jpg)
Slide 8
Word Class Ambiguity(in the Brown Corpus)
Unambiguous (1 tag): 35,340Ambiguous (2-7 tags): 4,100
2 tags 3,760
3 tags 264
4 tags 61
5 tags 12
6 tags 2
7 tags 1
(Derose, 1988)
![Page 9: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/9.jpg)
Slide 9
Part-of-Speech Tagging
•Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis)
•Stochastic Tagger: HMM-based•Transformation-Based Tagger (Brill) (we won’t
cover this)
![Page 10: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/10.jpg)
Slide 10
Rule-Based Tagging
• Basic Idea:– Assign all possible tags to words– Remove tags according to set of rules of type: if
word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv.
– Typically more than 1000 hand-written rules
![Page 11: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/11.jpg)
Slide 11
Sample ENGTWOL Lexicon
Demo: http://www2.lingsoft.fi/cgi-bin/engtwol
![Page 12: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/12.jpg)
Slide 12
Stage 1 of ENGTWOL Tagging
First Stage: Run words through a morphological analyzer to get all parts of speech.
Example: Pavlov had shown that salivation …Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
![Page 13: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/13.jpg)
Slide 13
Stage 2 of ENGTWOL Tagging
Second Stage: Apply constraints.Constraints used in negative way.Example: Adverbial “that” rule
Given input: “that”If
(+1 A/ADV/QUANT)(+2 SENT-LIM)(NOT -1 SVOC/A)
Then eliminate non-ADV tagsElse eliminate ADV
![Page 14: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/14.jpg)
Slide 14
Stochastic Tagging
•Based on probability of certain tag occurring given various possibilities
•Requires a training corpus•No probabilities for words not in corpus.
![Page 15: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/15.jpg)
Slide 15
Stochastic Tagging (cont.)
•Simple Method: Choose most frequent tag in training text for each word!
– Result: 90% accuracy– Baseline– Others will do better– HMM is an example
![Page 16: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/16.jpg)
Slide 16
HMM Tagger
• Intuition: Pick the most likely tag for this word.•Let T = t1,t2,…,tn
Let W = w1,w2,…,wn
•Find POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W.
![Page 17: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/17.jpg)
Slide 17
Toward a Bigram-HMM Tagger
argmaxT P(T|W)
argmaxTP(T)P(W|T)
argmaxtP(t1…tn)P(w1…wn|t1…tn)
argmaxt[P(t1)P(t2|t1)…P(tn|tn-1)][P(w1|t1)P(w2|t2)…P(wn|tn)]
To tag a single word: ti = argmaxj P(tj|ti-1)P(wi|tj)
How do we compute P(ti|ti-1)?c(ti-1ti)/c(ti-1)
How do we compute P(wi|ti)?c(wi,ti)/c(ti)
How do we compute the most probable tag sequence?Viterbi
![Page 18: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/18.jpg)
Slide 1804/19/23 Speech and Language Processing - Jurafsky and Martin 18
Disambiguating “race”
![Page 19: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/19.jpg)
Slide 1904/19/23 Speech and Language Processing - Jurafsky and Martin 19
Example
P(NN|TO) = .00047P(VB|TO) = .83P(race|NN) = .00057P(race|VB) = .00012P(NR|VB) = .0027P(NR|NN) = .0012P(VB|TO)P(NR|VB)P(race|VB) = .00000027P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
So we (correctly) choose the verb reading,
![Page 20: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/20.jpg)
Slide 2004/19/23 Speech and Language Processing - Jurafsky and Martin 20
Hidden Markov Models
What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)
![Page 21: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/21.jpg)
Slide 2104/19/23 Speech and Language Processing - Jurafsky and Martin 21
Definitions
A weighted finite-state automaton adds probabilities to the arcsThe sum of the probabilities on arcs leaving a
node must sum to oneA Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through
Markov chains can’t represent ambiguous problemsUseful for assigning probabilities to unambiguous
sequences
![Page 22: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/22.jpg)
Slide 2204/19/23 Speech and Language Processing - Jurafsky and Martin 22
Markov Chain for Weather
![Page 23: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/23.jpg)
Slide 2304/19/23 Speech and Language Processing - Jurafsky and Martin 23
Markov Chain for Words
![Page 24: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/24.jpg)
Slide 2404/19/23 Speech and Language Processing - Jurafsky and Martin 24
Markov Chain: “First-order observable Markov Model”
A set of states Q = q1, q2…qN; the state at time t is qt
Transition probabilities: a set of probabilities A = a01a02…an1…ann.
Each aij represents the probability of transitioning
from state i to state jThe set of these is the transition probability
matrix A
Current state only depends on previous state
![Page 25: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/25.jpg)
Slide 2504/19/23 Speech and Language Processing - Jurafsky and Martin 25
HMM for Ice Cream
You are a climatologist in the year 2799Studying global warmingYou can’t find any records of the weather in Baltimore, MA for
summer of 2007But you find Jason Eisner’s diaryWhich lists how many ice-creams Jason ate every date that
summerOur job: figure out how hot it was
![Page 26: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/26.jpg)
Slide 2604/19/23 Speech and Language Processing - Jurafsky and Martin 26
Hidden Markov Model
For Markov chains, the symbols are the same as the states.See hot weather: we’re in state hot
But in part-of-speech taggingThe output symbols are wordsBut the hidden states are part-of-speech tags
A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states.
This means we don’t know which state we are in.
![Page 27: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/27.jpg)
Slide 2704/19/23 Speech and Language Processing - Jurafsky and Martin 27
States Q = q1, q2…qN; Observations O= o1, o2…oN;
Each observation is a symbol from a vocabulary V = {v1,v2,…vV}Transition probabilities
Transition probability matrix A = {aij}
Observation likelihoodsOutput probability matrix B={bi(k)}
Special initial probability vector
Hidden Markov Models
![Page 28: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/28.jpg)
Slide 2804/19/23 Speech and Language Processing - Jurafsky and Martin 28
Eisner Task
GivenIce Cream Observation Sequence: 1,2,3,2,2,2,3…
Produce:Weather Sequence: H,C,H,H,H,C…
![Page 29: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/29.jpg)
Slide 2904/19/23 Speech and Language Processing - Jurafsky and Martin 29
HMM for Ice Cream
![Page 30: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/30.jpg)
Slide 3004/19/23 Speech and Language Processing - Jurafsky and Martin 30
Transition Probabilities
![Page 31: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/31.jpg)
Slide 3104/19/23 Speech and Language Processing - Jurafsky and Martin 31
Observation Likelihoods
![Page 32: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/32.jpg)
Slide 3204/19/23 Speech and Language Processing - Jurafsky and Martin 32
Decoding
Ok, now we have a complete model that can give us what we need.
We could just enumerate all paths given the input and use the model to assign probabilities to each.Not a good idea.Luckily dynamic programming (also seen in Ch. 3
with minimum edit distance, but we didn’t cover it) helps us here
![Page 33: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/33.jpg)
Slide 33
Viterbi Algorithm
• The Viterbi algorithm is used to compute the most likely tag sequence in O(W x T2) time, where T is the number of possible part-of-speech tags and W is the number of words in the sentence.
• The algorithm sweeps through all the tag possibilities for each word, computing the best sequence leading to each possibility. The key that makes this algorithm efficient is that we only need to know the best sequences leading to the previous word because of the Markov assumption.
![Page 34: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/34.jpg)
Slide 34
Computing the Probability of a Sentence and Tags
We want to find the sequence of tags that maximizes the formula
P (T1..Tn| wi..wn), which can be estimated as:
P (Ti| Ti−1) is computed by multiplying the arc values in the HMM.
P (wi| Ti) is computed by multiplying the lexical generation probabilities associated with each word
n
iiiii TwPTTP
11 )|(*)|(
![Page 35: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/35.jpg)
Slide 35
The Viterbi Algorithm
Let T = # of part-of-speech tags W = # of words in the sentence/* Initialization Step */for t = 1 to T Score(t, 1) = Pr(Word1| Tagt) * Pr(Tagt| φ) BackPtr(t, 1) = 0;/* Iteration Step */for w = 2 to W for t = 1 to T Score(t, w) = Pr(Wordw| Tagt) *MAXj=1,T(Score(j, w-1) * Pr(Tagt| Tagj)) BackPtr(t, w) = index of j that gave the max above/* Sequence Identification */Seq(W ) = t that maximizes Score(t,W ) for w = W -1 to 1 Seq(w) = BackPtr(Seq(w+1),w+1)
![Page 36: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/36.jpg)
Slide 3604/19/23 Speech and Language Processing - Jurafsky and Martin 36
Viterbi
Example in lecture
![Page 37: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/37.jpg)
Slide 3704/19/23 Speech and Language Processing - Jurafsky and Martin 37
Evaluation
Once you have you POS tagger running how do you evaluate it?Overall error rate with respect to a gold-standard test set.Error rates on particular tagsError rates on particular wordsTag confusions...
![Page 38: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/38.jpg)
Slide 3804/19/23 Speech and Language Processing - Jurafsky and Martin 38
Error Analysis
Look at a confusion matrix
See what errors are causing problemsNoun (NN) vs ProperNoun (NNP) vs Adj (JJ)Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
![Page 39: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/39.jpg)
Slide 3904/19/23 Speech and Language Processing - Jurafsky and Martin 39
Evaluation
The result is compared with a manually coded “Gold Standard”Typically accuracy reaches 96-97%This may be compared with result for a baseline tagger (one that
uses no context).Important: 100% is impossible even for human annotators.
![Page 40: Word classes and part of speech tagging Chapter 5](https://reader035.vdocument.in/reader035/viewer/2022062216/56649d825503460f94a68386/html5/thumbnails/40.jpg)
Slide 4004/19/23 Speech and Language Processing - Jurafsky and Martin 40
Summary
Parts of speechTagsetsPart of speech taggingHMM Tagging
Markov ChainsHidden Markov Models