1 computational lexicology, morphology and syntax course 5 diana trandab ă ț academic year:...
TRANSCRIPT
![Page 1: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/1.jpg)
1
Computational Lexicology, Morphology and Syntax
Course 5
Diana TrandabățAcademic year: 2014-2015
![Page 2: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/2.jpg)
Word Classes
• Words that somehow ‘behave’ alike:–Appear in similar contexts–Perform similar functions in sentences–Undergo similar transformations
• ~9 traditional word classes of parts of speech (POS)–Noun, verb, adjective, preposition, adverb, article,
interjection, pronoun, conjunction
2
![Page 3: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/3.jpg)
Some Examples
• N noun chair, bandwidth, pacing
• V verb study, debate, munch• ADJ adjective purple, tall, ridiculous• ADV adverb unfortunately, slowly• P preposition of, by, to• PRO pronoun I, me, mine• DET determiner the, a, that, those
3
![Page 4: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/4.jpg)
Defining POS Tagging
• The process of assigning a part-of-speech or lexical class marker to each word in a corpus:
4
thekoalaputthekeysonthetable
WORDSTAGS
NVPDET
![Page 5: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/5.jpg)
Applications for POS Tagging• Speech synthesis pronunciation
– Lead Lead– INsult inSULT– OBject obJECT– OVERflow overFLOW– DIScount disCOUNT– CONtent conTENT
• Word Sense Disambiguation: e.g. Time flies like an arrow– Is flies an N or V?
• Word prediction in speech recognition – Possessive pronouns (my, your, her) are likely to be followed by
nouns– Personal pronouns (I, you, he) are likely to be followed by verbs
• Machine Translation5
![Page 6: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/6.jpg)
Closed vs. Open class of words
• Closed class categories are composed of a small, fixed set of grammatical words for a given language.– Prepositions: of, in, by, …– Auxiliary verbs: may, can, will, had, been, …– Pronouns: I, you, she, mine, his, them, …– Usually function words (short common words which play a
role in grammar)• Open class categories have large number of words and
new ones are easily invented.– English has 4 categories: Nouns, Verbs, Adjectives, Adverbs– Nouns: Googler– Verbs: Google– Adjectives: geeky
6
![Page 7: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/7.jpg)
Ambiguity in POS Tagging
• “Like” can be a verb or a preposition– I like/VBP candy.– Time flies like/IN an arrow.
• “Around” can be a preposition, particle, or adverb– I bought it at the shop around/IN the corner.– I never got around/RP to getting a car.– A new Prius costs around/RB $25K.
7
![Page 8: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/8.jpg)
Open Class Words
• Nouns–Proper nouns
• Columbia University, New York City, Arthi Ramachandran, Metropolitan Transit Center
• English (and Romanian, and many other languages) capitalizes these
• Many have abbreviations
–Common nouns• All the rest
8
![Page 9: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/9.jpg)
– Count nouns vs. mass nouns• Count: Have plurals, countable: goat/goats, one goat, two goats• Mass: Not countable (fish, salt, communism) (?two fishes)
• Adjectives: identify properties or qualities of nouns– Color, size, age, …– Adjective ordering restrictions in English:
• Old blue book, not Blue old book– In Korean, adjectives are realized as verbs
• Adverbs: also modify things (verbs, adjectives, adverbs)– The very happy man walked home extremely slowly
yesterday.
9
![Page 10: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/10.jpg)
– Directional/locative adverbs (here, home, downhill)
– Degree adverbs (extremely, very, somewhat)– Manner adverbs (slowly, silkily, delicately)– Temporal adverbs (Monday, tomorrow)
• Verbs:– In most languages take morphological affixes
(eat/eats/eaten)– Represent actions (walk, ate), processes (provide,
see), and states (be, seem)– Many subclasses, e.g.
• eats/V eat/VB, eat/VBP, eats/VBZ, ate/VBD, eaten/VBN, eating/VBG, ...
• Reflect morphological form & syntactic function
![Page 11: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/11.jpg)
How Do We Assign Words to Open or Closed?
• Nouns denote people, places and things and can be preceded by articles. But…
My typing is very bad.*The Mary loves John.
• Verbs are used to refer to actions, processes, states–But some are closed class and some are openI will have emailed everyone by noon.
• Adverbs modify actions– Is Monday a temporal adverbial or a noun?
11
![Page 12: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/12.jpg)
Closed Class Words
• Closed class words (Prep, Det, Pron, Conj, Aux, Part, Num) are generally easy to process, since we can enumerate them…. but– Is it a Particles or a Preposition?
• George eats up his dinner/George eats his dinner up.• George eats up the street/*George eats the street up.
–Articles come in 2 flavors: definite (the) and indefinite (a, an)• What is this in ‘this guy…’?
12
![Page 13: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/13.jpg)
Choosing a POS Tagset
• To do POS tagging, first need to choose a set of tags
• Could pick very coarse (small) tagsets–N, V, Adj, Adv.
• More commonly used: Brown Corpus (Francis & Kucera ‘82), 1M words, 87 tags – more informative but more difficult to tag
• Most commonly used: Penn Treebank: hand-annotated corpus of Wall Street Journal, 1M words, 45 tags
13
![Page 14: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/14.jpg)
English Parts of Speech• Noun (person, place or thing)
– Singular (NN): dog, fork– Plural (NNS): dogs, forks– Proper (NNP, NNPS): John, Springfields
• Pronouns– Personal pronoun (PRP): I, you, he, she, it– Wh-pronoun (WP): who, what
• Verb (actions and processes)– Base, infinitive (VB): eat– Past tense (VBD): ate– Gerund (VBG): eating– Past participle (VBN): eaten– Non 3rd person singular present tense (VBP): eat– 3rd person singular present tense: (VBZ): eats– Modal (MD): should, can– To (TO): to (to eat) 14
![Page 15: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/15.jpg)
English Parts of Speech (cont.)• Adjective (modify nouns)
– Basic (JJ): red, tall– Comparative (JJR): redder, taller– Superlative (JJS): reddest, tallest
• Adverb (modify verbs)– Basic (RB): quickly– Comparative (RBR): quicker– Superlative (RBS): quickest
• Preposition (IN): on, in, by, to, with• Determiner:
– Basic (DT) a, an, the– WH-determiner (WDT): which, that
• Coordinating Conjunction (CC): and, but, or,• Particle (RP): off (took off), up (put up)
15
![Page 16: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/16.jpg)
Penn Treebank Tagset
16
![Page 17: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/17.jpg)
Part of Speech (POS) Tagging• Lowest level of syntactic analysis.
17
John saw the saw and decided to take it to the table.NNP VBD DT NN CC VBD TO VB PRP IN DT NN
![Page 18: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/18.jpg)
POS Tagging Process• Usually assume a separate initial tokenization process that
separates and/or disambiguates punctuation, including detecting sentence boundaries.
• Degree of ambiguity in English (based on Brown corpus)– 11.5% of word types are ambiguous.– 40% of word tokens are ambiguous.
• Average POS tagging disagreement amongst expert human judges for the Penn treebank was 3.5%– Based on correcting the output of an initial automated tagger, which
was deemed to be more accurate than tagging from scratch.• Baseline: Picking the most frequent tag for each specific word
type gives about 90% accuracy– 93.7% if use model for unknown words for Penn Treebank tagset.
18
![Page 19: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/19.jpg)
POS Tagging Approaches• Rule-Based: Human crafted rules based on lexical
and other linguistic knowledge.• Learning-Based: Trained on human annotated
corpora like the Penn Treebank.– Statistical models: Hidden Markov Model (HMM),
Maximum Entropy Markov Model (MEMM), Conditional Random Field (CRF)
– Rule learning: Transformation Based Learning (TBL)• Generally, learning-based approaches have been
found to be more effective overall, taking into account the total amount of human expertise and effort involved.
19
![Page 20: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/20.jpg)
20/24
Simple taggers• Default tagger has one tag per word, and assigns it on the
basis of dictionary lookup– Tags may indicate ambiguity but not resolve it, e.g. NVB for noun-or-
verb• Words may be assigned different tags with associated
probabilities – Tagger will assign most probable tag unless – there is some way to identify when a less probable tag is in fact
correct• Tag sequences may be defined by regular expressions, and
assigned probabilities (including 0 for illegal sequences – negative rules)
![Page 21: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/21.jpg)
21/24
Rule-based taggers
• Earliest type of tagging: two stages• Stage 1: look up word in lexicon to give list of
potential POSs• Stage 2: Apply rules which certify or disallow
tag sequences• Rules originally handwritten; more recently
Machine Learning methods can be used
![Page 22: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/22.jpg)
Start with a POS Dictionary
• she: PRP• promised: VBN,VBD• to: TO• back: VB, JJ, RB, NN• the: DT• bill: NN, VB• Etc… for the ~100,000 words of English
22
![Page 23: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/23.jpg)
Assign All Possible POS to Each Word
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
23
![Page 24: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/24.jpg)
Apply Rules Eliminating Some POS
E.g., Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
24
![Page 25: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/25.jpg)
Apply Rules Eliminating Some POS
E.g., Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”
NNRB
JJ VBPRP VBD TO VB DT NNShe promised to back the bill
25
![Page 26: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/26.jpg)
EngCG ENGTWOL Tagger
• Richer dictionary includes morphological and syntactic features (e.g. subcategorization frames) as well as possible POS
• Uses two-level morphological analysis on input and returns all possible POS
• Apply negative constraints (> 3744) to rule out incorrect POS
![Page 27: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/27.jpg)
Sample ENGTWOL Dictionary
27
![Page 28: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/28.jpg)
ENGTWOL Tagging: Stage 1• First Stage: Run words through FST morphological analyzer
to get POS info from morph• E.g.: Pavlov had shown that salivation …
Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO
HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV
PRON DEM SGDET CENTRAL DEM SGCS
salivation N NOM SG
28
![Page 29: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/29.jpg)
ENGTWOL Tagging: Stage 2
• Second Stage: Apply NEGATIVE constraints• E.g., Adverbial that rule
–Eliminate all readings of that except the one in It isn’t that odd.
Given input: thatIf
(+1 A/ADV/QUANT) ; if next word is adj/adv/quantifier(+2 SENT-LIM) ; followed by E-O-S(NOT -1 SVOC/A) ; and the previous word is not a verb like
consider which allows adjective complements (e.g. I consider
that odd)Then eliminate non-ADV tagsElse eliminate ADV
29
![Page 30: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/30.jpg)
30/24
How do they work?
• Tagger must be “trained”
• Many different techniques, but typically …
• Small “training corpus” hand-tagged
• Tagging rules learned automatically
• Rules define most likely sequence of tags
• Rules based on – Internal evidence (morphology)– External evidence (context)– Probabilities
![Page 31: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/31.jpg)
31/24
What probabilities do we have to learn?
(a) Individual word probabilities:Probability that a given tag t is appropriate for
a given word w– Easy (in principle): learn from training corpus:
– Problem of “sparse data”:• Add a small amount to each calculation, so we get
no zeros
)(
),()|(
wf
wtfwtP run occurs 4800 times in the
training corpus: 3600 times as averb, 1200 times as a noun:P(verb|run) = 0.75
![Page 32: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/32.jpg)
32/24
(b) Tag sequence probability:
Probability that a given tag sequence t1,t2,…,tn is appropriate for a given word sequence w1,w2,…,wn
– P(t1,t2,…,tn | w1,w2,…,wn ) = ??? – Too hard to calculate for entire sequence:P(t1,t2 ,t3 ,t4 ,...) = P(t2|t1 ) P(t3|t1,t2 ) P(t4|t1,t2 ,t3 ) …
– Subsequence is more tractable– Sequence of 2 or 3 should be enough:
Bigram model: P(t1,t2) = P(t2|t1 )
Trigram model: P(t1,t2 ,t3) = P(t2|t1 ) P(t3|t2 ) N-gram model:
ni
iin ttPttP,1
11 )|(),...,(
![Page 33: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/33.jpg)
33/24
More complex taggers
• Bigram taggers assign tags on the basis of sequences of two words (usually assigning tag to wordn on the basis of wordn-1)
• An nth-order tagger assigns tags on the basis of sequences of n words
• As the value of n increases, so does the complexity of the statistical calculation involved in comparing probability combinations.
![Page 34: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/34.jpg)
34/24
Stochastic taggers
• Nowadays, pretty much all taggers are statistics-based and have been since 1980s (or even earlier ... Some primitive algorithms were already published in 60s and 70s)
• Most common is based on Hidden Markov Models (also found in speech processing, etc.)
![Page 35: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/35.jpg)
35/24
(Hidden) Markov Models• Probability calculations imply Markov models: we assume that
P(t|w) is dependent only on the (or, a sequence of) previous word(s)
• (Informally) Markov models are the class of probabilistic models that assume we can predict the future without taking into account too much of the past
• Markov chains can be modelled by finite state automata: the next state in a Markov chain is always dependent on some finite history of previous states
• Model is “hidden” if it is actually a succession of Markov models, whose intermediate states are of no interest
![Page 36: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/36.jpg)
36/24
Supervised vs unsupervised training
• Learning tagging rules from a marked-up corpus (supervised learning) gives very good results (98% accuracy)– Though assigning most probable tag, and “proper noun” to
unknowns will give 90%• But it depends on having a corpus already marked up
to a high quality• If this is not available, we have to try something else:
– “forward-backward” algorithm– A kind of “bootstrapping” approach
![Page 37: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/37.jpg)
37/24
Forward-backward (Baum-Welch) algorithm
• Start with initial probabilities– If nothing known, assume all Ps equal
• Adjust the individual probabilities so as to increase the overall probability.
• Re-estimate the probabilities on the basis of the last iteration
• Continue until convergence– i.e. there is no improvement, or improvement is below a
threshold• All this can be done automatically
![Page 38: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/38.jpg)
38/24
Transformation-based tagging
• Eric Brill (1993)• Start from an initial tagging, and apply a series
of transformations• Transformations are learned as well, from the
training data• Captures the tagging data in much fewer
parameters than stochastic models• The transformations learned (often) have
linguistic “reality”
![Page 39: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/39.jpg)
39/24
Transformation-based tagging
• Three stages:– Lexical look-up– Lexical rule application for unknown words– Contextual rule application to correct mis-tags
![Page 40: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/40.jpg)
40/24
Transformation-based learning
• Change tag a to b when:– Internal evidence (morphology)– Contextual evidence
• One or more of the preceding/following words has a specific tag• One or more of the preceding/following words is a specific word• One or more of the preceding/following words has a certain form
• Order of rules is important– Rules can change a correct tag into an incorrect tag, so
another rule might correct that “mistake”
![Page 41: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/41.jpg)
41/24
Transformation-based tagging: examples
• if a word is currently tagged NN, and has a suffix of length 1 which consists of the letter 's', change its tag to NNS
• if a word has a suffix of length 2 consisting of the letter sequence 'ly', change its tag to RB (regardless of the initial tag)
• change VBN to VBD if previous word is tagged as NN• Change VBD to VBN if previous word is ‘by’
![Page 42: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/42.jpg)
42/24
Transformation-based tagging: example
Booth/NP killed/VBN Abraham/NP Lincoln/NPAbraham/NP Lincoln/NP was/BEDZ shot/VBD by/BY Booth/NPHe/PPS witnessed/VBD Lincoln/NP killed/VBN by/BY Booth/NP
Example after lexical lookup
Booth/NP killed/VBD Abraham/NP Lincoln/NPAbraham/NP Lincoln/NP was/BEDZ shot/VBN by/BY Booth/NPHe/PPS witnessed/VBD Lincoln/NP killed/VBN by/BY Booth/NP
Example after application of contextual rule ’vbd vbn NEXTWORD by’
Booth/NP killed/VBD Abraham/NP Lincoln/NPAbraham/NP Lincoln/NP was/BEDZ shot/VBD by/BY Booth/NPHe/PPS witnessed/VBD Lincoln/NP killed/VBD by/BY Booth/NP
Example after application of contextual rule ’vbn vbd PREVTAG np’
![Page 43: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/43.jpg)
04/19/23 43
Templates for TBL
![Page 44: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/44.jpg)
04/19/23 44
Sample TBL Rule Application
• Labels every word with its most-likely tag– E.g. race occurences in the Brown corpus:
• P(NN|race) = .98• P(VB|race)= .02• is/VBZ expected/VBN to/TO race/NN tomorrow/NN
• Then TBL applies the following rule– “Change NN to VB when previous tag is TO”
… is/VBZ expected/VBN to/TO race/NN tomorrow/NNbecomes… is/VBZ expected/VBN to/TO race/VB tomorrow/NN
![Page 45: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/45.jpg)
04/19/23 45
TBL Tagging Algorithm• Step 1: Label every word with most likely tag (from
dictionary)• Step 2: Check every possible transformation & select one
which most improves tag accuracy (cf Gold)• Step 3: Re-tag corpus applying this rule, and add rule to end
of rule set• Repeat 2-3 until some stopping criterion is reached, e.g., X
% correct with respect to training corpus• RESULT: Ordered set of transformation rules to use on new
data tagged only with most likely POS tags
![Page 46: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/46.jpg)
04/19/23 46
TBL Issues
• Problem: Could keep applying (new) transformations ad infinitum
• Problem: Rules are learned in ordered sequence
• Problem: Rules may interact• But: Rules are compact and can be
inspected by humans
![Page 47: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/47.jpg)
Evaluating Tagging Approaches• For any NLP problem, we need to know
how to evaluate our solutions• Possible Gold Standards -- ceiling:
– Annotated naturally occurring corpus– Human task performance (96-7%)
• How well do humans agree?• Kappa statistic: avg pairwise agreement
corrected for chance agreement– Can be hard to obtain for some tasks:
sometimes humans don’t agree
![Page 48: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/48.jpg)
• Baseline: how well does simple method do? – For tagging, most common tag for each word
(90%)– How much improvement do we get over
baseline?
![Page 49: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/49.jpg)
Methodology: Error Analysis• Confusion matrix:
– E.g. which tags did we most often confuse with which other tags?
– How much of the overall error does each confusion account for?
VB TO NN
VB
TO
NN
![Page 50: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/50.jpg)
More Complex Issues
• Tag indeterminacy: when ‘truth’ isn’t clearCaribbean cooking, child seat
• Tagging multipart wordswouldn’t --> would/MD n’t/RB
• How to handle unknown words– Assume all tags equally likely– Assume same tag distribution as all other
singletons in corpus– Use morphology, word length,….
![Page 51: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/51.jpg)
Classification Learning
• Typical machine learning addresses the problem of classifying a feature-vector description into a fixed number of classes.
• There are many standard learning methods for this task:– Decision Trees and Rule Learning– Naïve Bayes and Bayesian Networks– Logistic Regression / Maximum Entropy (MaxEnt)– Perceptron and Neural Networks– Support Vector Machines (SVMs)– Nearest-Neighbor / Instance-Based
51
![Page 52: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/52.jpg)
Beyond Classification Learning
• Standard classification problem assumes individual cases are disconnected and independent (i.i.d.: independently and identically distributed).
• Many NLP problems do not satisfy this assumption and involve making many connected decisions, each resolving a different ambiguity, but which are mutually dependent.
• More sophisticated learning and inference techniques are needed to handle such situations in general.
52
![Page 53: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/53.jpg)
Sequence Labeling Problem• Many NLP problems can viewed as sequence
labeling.• Each token in a sequence is assigned a label.• Labels of tokens are dependent on the labels of
other tokens in the sequence, particularly their neighbors (not i.i.d).
53
foo bar blam zonk zonk bar blam
![Page 54: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/54.jpg)
Information Extraction• Identify phrases in language that refer to specific types of
entities and relations in text.• Named entity recognition is task of identifying names of
people, places, organizations, etc. in text. people organizations places
– Michael Dell is the CEO of Dell Computer Corporation and lives in Austin Texas.
• Extract pieces of information relevant to a specific application, e.g. used car ads:
make model year mileage price– For sale, 2002 Toyota Prius, 20,000 mi, $15K or best offer.
Available starting July 30, 2006.
54
![Page 55: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/55.jpg)
Semantic Role Labeling
• For each clause, determine the semantic role played by each noun phrase that is an argument to the verb.agent patient source destination instrument– John drove Mary from Austin to Dallas in his
Toyota Prius.– The hammer broke the window.
• Also referred to a “case role analysis,” “thematic analysis,” and “shallow semantic parsing”
55
![Page 56: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/56.jpg)
Bioinformatics
• Sequence labeling also valuable in labeling genetic sequences in genome analysis.extron intron– AGCTAACGTTCGATACGGATTACAGCCT
56
![Page 57: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/57.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
57
John saw the saw and decided to take it to the table.
classifier
NNP
![Page 58: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/58.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
58
John saw the saw and decided to take it to the table.
classifier
VBD
![Page 59: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/59.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
59
John saw the saw and decided to take it to the table.
classifier
DT
![Page 60: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/60.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
60
John saw the saw and decided to take it to the table.
classifier
NN
![Page 61: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/61.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
61
John saw the saw and decided to take it to the table.
classifier
CC
![Page 62: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/62.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
62
John saw the saw and decided to take it to the table.
classifier
VBD
![Page 63: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/63.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
63
John saw the saw and decided to take it to the table.
classifier
TO
![Page 64: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/64.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
64
John saw the saw and decided to take it to the table.
classifier
VB
![Page 65: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/65.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
65
John saw the saw and decided to take it to the table.
classifier
PRP
![Page 66: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/66.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
66
John saw the saw and decided to take it to the table.
classifier
IN
![Page 67: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/67.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
67
John saw the saw and decided to take it to the table.
classifier
DT
![Page 68: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/68.jpg)
Sequence Labeling as Classification
• Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
68
John saw the saw and decided to take it to the table.
classifier
NN
![Page 69: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/69.jpg)
Sequence Labeling as ClassificationUsing Outputs as Inputs
• Better input features are usually the categories of the surrounding tokens, but these are not available yet.
• Can use category of either the preceding or succeeding tokens by going forward or back and using previous output.
69
![Page 70: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/70.jpg)
Forward Classification
70
John saw the saw and decided to take it to the table.
classifier
NNP
![Page 71: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/71.jpg)
Forward Classification
71
NNPJohn saw the saw and decided to take it to the table.
classifier
VBD
![Page 72: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/72.jpg)
Forward Classification
72
NNP VBDJohn saw the saw and decided to take it to the table.
classifier
DT
![Page 73: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/73.jpg)
Forward Classification
73
NNP VBD DTJohn saw the saw and decided to take it to the table.
classifier
NN
![Page 74: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/74.jpg)
Forward Classification
74
NNP VBD DT NNJohn saw the saw and decided to take it to the table.
classifier
CC
![Page 75: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/75.jpg)
Forward Classification
75
NNP VBD DT NN CCJohn saw the saw and decided to take it to the table.
classifier
VBD
![Page 76: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/76.jpg)
Forward Classification
76
NNP VBD DT NN CC VBDJohn saw the saw and decided to take it to the table.
classifier
TO
![Page 77: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/77.jpg)
Forward Classification
77
NNP VBD DT NN CC VBD TOJohn saw the saw and decided to take it to the table.
classifier
VB
![Page 78: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/78.jpg)
Forward Classification
78
NNP VBD DT NN CC VBD TO VBJohn saw the saw and decided to take it to the table.
classifier
PRP
![Page 79: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/79.jpg)
Forward Classification
79
NNP VBD DT NN CC VBD TO VB PRPJohn saw the saw and decided to take it to the table.
classifier
IN
![Page 80: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/80.jpg)
Forward Classification
80
NNP VBD DT NN CC VBD TO VB PRP INJohn saw the saw and decided to take it to the table.
classifier
DT
![Page 81: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/81.jpg)
Forward Classification
81
NNP VBD DT NN CC VBD TO VB PRP IN DTJohn saw the saw and decided to take it to the table.
classifier
NN
![Page 82: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/82.jpg)
Backward Classification
• Disambiguating “to” in this case would be even easier backward.
82
John saw the saw and decided to take it to the table.
classifier
NN
![Page 83: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/83.jpg)
Backward Classification
• Disambiguating “to” in this case would be even easier backward.
83
NNJohn saw the saw and decided to take it to the table.
classifier
DT
![Page 84: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/84.jpg)
Backward Classification
• Disambiguating “to” in this case would be even easier backward.
84
DT NNJohn saw the saw and decided to take it to the table.
classifier
IN
![Page 85: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/85.jpg)
Backward Classification
• Disambiguating “to” in this case would be even easier backward.
85
IN DT NNJohn saw the saw and decided to take it to the table.
classifier
PRP
![Page 86: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/86.jpg)
Backward Classification
• Disambiguating “to” in this case would be even easier backward.
86
PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
VB
![Page 87: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/87.jpg)
Backward Classification
• Disambiguating “to” in this case would be even easier backward.
87
VB PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
TO
![Page 88: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/88.jpg)
Backward Classification
• Disambiguating “to” in this case would be even easier backward.
88
TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
VBD
![Page 89: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/89.jpg)
Backward Classification
• Disambiguating “to” in this case would be even easier backward.
89
VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
CC
![Page 90: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/90.jpg)
Backward Classification• Disambiguating “to” in this case would be
even easier backward.
90
CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
NN
![Page 91: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/91.jpg)
Backward Classification
• Disambiguating “to” in this case would be even easier backward.
91
VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
DT
![Page 92: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/92.jpg)
Backward Classification
• Disambiguating “to” in this case would be even easier backward.
92
DT VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
VBD
![Page 93: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/93.jpg)
Backward Classification
• Disambiguating “to” in this case would be even easier backward.
93
VBD DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
NNP
![Page 94: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/94.jpg)
Problems with Sequence Labeling as Classification
• Not easy to integrate information from category of tokens on both sides.
• Difficult to propagate uncertainty between decisions and “collectively” determine the most likely joint assignment of categories to all of the tokens in a sequence.
94
![Page 95: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/95.jpg)
Probabilistic Sequence Models
• Probabilistic sequence models allow integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment.
• Two standard models– Hidden Markov Model (HMM)– Conditional Random Field (CRF)
95
![Page 96: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/96.jpg)
Markov Model / Markov Chain
• A finite state machine with probabilistic state transitions.
• Makes Markov assumption that next state only depends on the current state and independent of previous history.
96
![Page 97: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/97.jpg)
Sample Markov Model for POS
97
0.95
0.05
0.9
0.05stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
start0.1
0.5
0.4
Det Noun
PropNoun
Verb
![Page 98: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/98.jpg)
Sample Markov Model for POS
98
0.95
0.05
0.9
0.05stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
start0.1
0.5
0.4
Det Noun
PropNoun
Verb
P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*0.1=0.0076
![Page 99: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/99.jpg)
Hidden Markov Model• Probabilistic generative model for sequences.• Assume an underlying set of hidden (unobserved)
states in which the model can be (e.g. parts of speech).
• Assume probabilistic transitions between states over time (e.g. transition from POS to another POS as sequence is generated).
• Assume a probabilistic generation of tokens from states (e.g. words generated for each POS).
99
![Page 100: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/100.jpg)
Sample HMM for POS
100
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
a thea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
start0.1
0.5
0.4
![Page 101: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/101.jpg)
Sample HMM Generation
101
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
a thea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
start0.1
0.5
0.4
![Page 102: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/102.jpg)
Sample HMM Generation
102
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
a thea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.10.25
start0.1
0.5
0.4
![Page 103: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/103.jpg)
Sample HMM Generation
103
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
a thea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John start0.1
0.5
0.4
![Page 104: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/104.jpg)
Sample HMM Generation
104
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
a thea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John start0.1
0.5
0.4
![Page 105: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/105.jpg)
Sample HMM Generation
105
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
a thea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit start0.1
0.5
0.4
![Page 106: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/106.jpg)
Sample HMM Generation
106
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
a thea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit start0.1
0.5
0.4
![Page 107: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/107.jpg)
Sample HMM Generation
107
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
a thea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit thestart0.1
0.5
0.4
![Page 108: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/108.jpg)
Sample HMM Generation
108
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
a thea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit thestart0.1
0.5
0.4
![Page 109: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/109.jpg)
Sample HMM Generation
109
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
a thea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit the applestart0.1
0.5
0.4
![Page 110: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/110.jpg)
Sample HMM Generation
110
PropNoun
JohnMaryAlice
Jerry
Tom
Noun
catdog
carpen
bedapple
Det
a thethe
the
that
a thea
Verb
bit
ate sawplayed
hit
0.95
0.05
0.9
gave0.05
stop
0.5
0.1
0.8
0.1
0.1
0.25
0.25
John bit the applestart0.1
0.5
0.4
![Page 111: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/111.jpg)
Formal Definition of an HMM• A set of N +2 states S={s0,s1,s2, … sN, sF}
– Distinguished start state: s0
– Distinguished final state: sF
• A set of M possible observations V={v1,v2…vM}• A state transition probability distribution A={aij}
• Observation probability distribution for each state j B={bj(k)}
• Total parameter set λ={A,B}
FjiNjisqsqPa itjtij ,0 and ,1 )|( 1
111
Mk1 1 )|at ()( NjsqtvPkb jtkj
Niaa iF
N
jij
011
![Page 112: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/112.jpg)
HMM Generation Procedure
• To generate a sequence of T observations: O = o1 o2 … oT
112
Set initial state q1=s0
For t = 1 to T Transit to another state qt+1=sj based on transition distribution aij for state qt
Pick an observation ot=vk based on being in state qt using distribution bqt(k)
![Page 113: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/113.jpg)
Three Useful HMM Tasks• Observation Likelihood: To classify and order
sequences.• Most likely state sequence (Decoding): To
tag each token in a sequence with a label.• Maximum likelihood training (Learning): To
train models to fit empirical training data.
113
![Page 114: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/114.jpg)
HMM: Observation Likelihood• Given a sequence of observations, O, and a model
with a set of parameters, λ, what is the probability that this observation was generated by this model: P(O| λ) ?
• Allows HMM to be used as a language model: A formal probabilistic model of a language that assigns a probability to each string saying how likely that string was to have been generated by the language.
• Useful for two tasks:– Sequence Classification– Most Likely Sequence
114
![Page 115: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/115.jpg)
Sequence Classification• Assume an HMM is available for each category
(i.e. language).• What is the most likely category for a given
observation sequence, i.e. which category’s HMM is most likely to have generated it?
• Used in speech recognition to find most likely word model to have generate a given sound or phoneme sequence.
115Austin Boston
? ?
P(O | Austin) > P(O | Boston) ?
ah s t e n
O
![Page 116: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/116.jpg)
Most Likely Sequence• Of two or more possible sequences, which
one was most likely generated by a given model?
• Used to score alternative word sequence interpretations in speech recognition.
116
Ordinary English
dice precedent core
vice president Gore
O1
O2
?
?
P(O2 | OrdEnglish) > P(O1 | OrdEnglish) ?
![Page 117: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/117.jpg)
HMM: Observation LikelihoodNaïve Solution
• Consider all possible state sequences, Q, of length T that the model could have traversed in generating the given observation sequence.
• Compute the probability of a given state sequence from A, and multiply it by the probabilities of generating each of given observations in each of the corresponding states in this sequence to get P(O,Q| λ) = P(O| Q, λ) P(Q| λ) .
• Sum this over all possible state sequences to get P(O| λ).
• Computationally complex: O(TNT).
117
![Page 118: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/118.jpg)
HMM: Observation LikelihoodEfficient Solution
• Due to the Markov assumption, the probability of being in any state at any given time t only relies on the probability of being in each of the possible states at time t−1.
• Forward Algorithm: Uses dynamic programming to exploit this fact to efficiently compute observation likelihood in O(TN2) time.– Compute a forward trellis that compactly and implicitly
encodes information about all possible state paths.
118
![Page 119: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/119.jpg)
Forward Probabilities
• Let t(j) be the probability of being in state j after seeing the first t observations (by summing over all initial paths leading to j).
119
)| ,,...,()( 21 jttt sqoooPj
![Page 120: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/120.jpg)
Forward Step• Consider all possible ways of
getting to sj at time t by coming from all possible states si and determine probability of each.
• Sum these to get the total probability of being in state sj at time t while accounting for the first t −1 observations.
• Then multiply by the probability of actually observing ot in sj.
120
s1
s2
sN
sj
t-1(i) t(i)
a1j
a2j
aNj
a2j
![Page 121: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/121.jpg)
Forward Trellis
• Continue forward in time until reaching final time point and sum probability of ending in final state.
121
s1
s2
sN
s0sF
t1 t2 t3 tT-1 tT
![Page 122: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/122.jpg)
Computing the Forward Probabilities
• Initialization
• Recursion
• Termination
122
Njobaj jj 1)()( 101
TtNjobaij tj
N
iijtt
1 ,1)()()(
11
N
iiFTFT aisOP
11 )()()|(
![Page 123: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/123.jpg)
Forward Computational Complexity
• Requires only O(TN2) time to compute the probability of an observed sequence given a model.
• Exploits the fact that all state sequences must merge into one of the N possible states at any point in time and the Markov assumption that only the last state effects the next one.
123
![Page 124: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/124.jpg)
Most Likely State Sequence (Decoding)• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?
• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.
124
John gave the dog an apple.
![Page 125: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/125.jpg)
Most Likely State Sequence• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?
• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.
125
John gave the dog an apple.
Det Noun PropNoun Verb
![Page 126: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/126.jpg)
Most Likely State Sequence• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?
• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.
126
John gave the dog an apple.
Det Noun PropNoun Verb
![Page 127: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/127.jpg)
Most Likely State Sequence• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?
• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.
127
John gave the dog an apple.
Det Noun PropNoun Verb
![Page 128: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/128.jpg)
Most Likely State Sequence• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?
• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.
128
John gave the dog an apple.
Det Noun PropNoun Verb
![Page 129: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/129.jpg)
Most Likely State Sequence• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?
• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.
129
John gave the dog an apple.
Det Noun PropNoun Verb
![Page 130: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/130.jpg)
Most Likely State Sequence• Given an observation sequence, O, and a model, λ,
what is the most likely state sequence,Q=q1,q2,…qT, that generated this sequence from this model?
• Used for sequence labeling, assuming each state corresponds to a tag, it determines the globally best assignment of tags to all tokens in a sequence using a principled approach grounded in probability theory.
130
John gave the dog an apple.
Det Noun PropNoun Verb
![Page 131: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/131.jpg)
HMM: Most Likely State SequenceEfficient Solution
• Obviously, could use naïve algorithm based on examining every possible state sequence of length T.
• Dynamic Programming can also be used to exploit the Markov assumption and efficiently determine the most likely state sequence for a given observation and model.
• Standard procedure is called the Viterbi algorithm (Viterbi, 1967) and also has O(N2T) time complexity.
131
![Page 132: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/132.jpg)
Viterbi Scores• Recursively compute the probability of the most
likely subsequence of states that accounts for the first t observations and ends in state sj.
132
)| ,,..., ,,...,,(max)( 11110,...,, 110
jtttqqq
t sqooqqqPjvt
• Also record “backpointers” that subsequently allow backtracing the most probable state sequence. btt(j) stores the state at time t-1 that maximizes the
probability that system was in state sj at time t (given the observed sequence).
![Page 133: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/133.jpg)
Computing the Viterbi Scores
• Initialization
• Recursion
• Termination
133
Njobajv jj 1)()( 101
TtNjobaivjv tjijt
N
it
1 ,1)()(max)( 11
iFT
N
iFT aivsvP )( max)(*
11
Analogous to Forward algorithm except take max instead of sum
![Page 134: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/134.jpg)
Computing the Viterbi Backpointers
• Initialization
• Recursion
• Termination
134
Njsjbt 1)( 01
TtNjobaivjbt tjijt
N
it
1 ,1)()(argmax)( 1
1
iFT
N
iFTT aivsbtq )( argmax)(*
11
Final state in the most probable state sequence. Follow backpointers to initial state to construct full sequence.
![Page 135: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/135.jpg)
Viterbi Backpointers
135
s1
s2
sN
s0sF
t1 t2 t3 tT-1 tT
![Page 136: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/136.jpg)
Viterbi Backtrace
136
s1
s2
sN
s0sF
t1 t2 t3 tT-1 tT
Most likely Sequence: s0 sN s1 s2 …s2 sF
![Page 137: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/137.jpg)
HMM Learning
• Supervised Learning: All training sequences are completely labeled (tagged).
• Unsupervised Learning: All training sequences are unlabelled (but generally know the number of tags, i.e. states).
• Semisupervised Learning: Some training sequences are labeled, most are unlabeled.
137
![Page 138: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/138.jpg)
Supervised HMM Training• If training sequences are labeled (tagged) with the
underlying state sequences that generated them, then the parameters, λ={A,B} can all be estimated directly.
138
SupervisedHMM
Training
John ate the appleA dog bit MaryMary hit the dogJohn gave Mary the cat.
.
.
.
Training Sequences
Det Noun PropNoun Verb
![Page 139: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/139.jpg)
Supervised Parameter Estimation• Estimate state transition probabilities based on tag
bigram and unigram statistics in the labeled data.
• Estimate the observation probabilities based on tag/word co-occurrence statistics in the labeled data.
• Use appropriate smoothing if training data is sparse.
139
)(
)q ,( 1t
it
jitij sqC
ssqCa
)(
),()(
ji
kijij sqC
vosqCkb
![Page 140: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/140.jpg)
Learning and Using HMM Taggers
• Use a corpus of labeled sequence data to easily construct an HMM using supervised training.
• Given a novel unlabeled test sequence to tag, use the Viterbi algorithm to predict the most likely (globally optimal) tag sequence.
140
![Page 141: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/141.jpg)
Evaluating Taggers
• Train on training set of labeled sequences.• Possibly tune parameters based on performance on a
development set.• Measure accuracy on a disjoint test set.• Generally measure tagging accuracy, i.e. the
percentage of tokens tagged correctly.• Accuracy of most modern POS taggers, including
HMMs is 96−97% (for Penn tagset trained on about 800K words) .– Generally matching human agreement level.
141
![Page 142: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/142.jpg)
Unsupervised Maximum Likelihood Training
142
ah s t e na s t i noh s t u neh z t en
.
.
.
Training Sequences
HMMTraining
Austin
![Page 143: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/143.jpg)
Maximum Likelihood Training
• Given an observation sequence, O, what set of parameters, λ, for a given model maximizes the probability that this data was generated from this model (P(O| λ))?
• Used to train an HMM model and properly induce its parameters from a set of training data.
• Only need to have an unannotated observation sequence (or set of sequences) generated from the model. Does not need to know the correct state sequence(s) for the observation sequence(s). In this sense, it is unsupervised.
143
![Page 144: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/144.jpg)
Bayes Theorem
Simple proof from definition of conditional probability:
)(
)()|()|(
EP
HPHEPEHP
)(
)()|(
EP
EHPEHP
)(
)()|(
HP
EHPHEP
)()|()( HPHEPEHP
QED:
(Def. cond. prob.)
(Def. cond. prob.)
)(
)()|()|(
EP
HPHEPEHP
![Page 145: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/145.jpg)
Maximum Likelihood vs. Maximum A Posteriori (MAP)
• The MAP parameter estimate is the most likely given the observed data, O.
)(
)()|(argmax)|(argmax
OP
POPOPMAP
)()|(argmax
POP
• If all parameterizations are assumed to be equally likely a priori, then MLE and MAP are the same.
• If parameters are given priors (e.g. Gaussian or Lapacian with zero mean), then MAP is a principled way to perform smoothing or regularization.
![Page 146: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/146.jpg)
HMM: Maximum Likelihood TrainingEfficient Solution
• There is no known efficient algorithm for finding the parameters, λ, that truly maximizes P(O| λ).
• However, using iterative re-estimation, the Baum-Welch algorithm (a.k.a. forward-backward) , a version of a standard statistical procedure called Expectation Maximization (EM), is able to locally maximize P(O| λ).
• In practice, EM is able to find a good set of parameters that provide a good fit to the training data in many cases.
146
![Page 147: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/147.jpg)
Sketch of Baum-Welch (EM) Algorithm for Training HMMs
147
Assume an HMM with N states.Randomly set its parameters λ=(A,B) (making sure they represent legal distributions)Until converge (i.e. λ no longer changes) do: E Step: Use the forward/backward procedure to determine the probability of various possible state sequences for generating the training data M Step: Use these probability estimates to re-estimate values for all of the parameters λ
![Page 148: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/148.jpg)
Backward Probabilities
• Let t(i) be the probability of observing the final set of observations from time t+1 to T given that one is in state i at time t.
148
) |,...,()( ,21 itTttt sqoooPi
![Page 149: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/149.jpg)
Computing the Backward Probabilities• Initialization
• Recursion
• Termination
149
Niai iFT 1)(
TtNijobai ttj
N
jijt
1 ,1)()()( 11
1
)()()()()|( 111
001 jobassOP j
N
jjFT
![Page 150: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/150.jpg)
Estimating Probability of State Transitions
• Let t(i,j) be the probability of being in state i at time t and state j at time t + 1
),|,(),( 1 OsqsqPji jtitt
)|(
)()()(
)|(
)|,,(),( 111
OP
jobai
OP
OsqsqPji ttjijtjtit
t
s1
s2
sN
si
a1i
a2i
aNi
a3i
s1
s2
sN
sj
aj1
aj2
ajN
aj3
t-1 t t+1 t+2
)(it )(1 jt
)( 1tjij oba
![Page 151: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/151.jpg)
Re-estimating A
i
jiaij state from ns transitioofnumber expected
to state from ns transitioofnumber expectedˆ
1
1 1
1
1
),(
),(ˆ
T
tt
N
j
T
tt
ij
ji
jia
![Page 152: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/152.jpg)
Estimating Observation Probabilities • Let t(i) be the probability of being in state i at time
t given the observations and the model.
)|(
)()(
)|(
)|,(),|()(
OP
jj
OP
OsqPOsqPj ttjt
jtt
![Page 153: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/153.jpg)
Re-estimating B
j
vjvb k
kj statein timesofnumber expected
observing statein timesofnumber expected)(ˆ
T
tt
T
vt
kj
j
j
vb k
1
o s.t. 1,t
)(
)(
)(ˆ t
![Page 154: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/154.jpg)
154
Pseudocode for Baum-Welch (EM) Algorithm for Training HMMs
Assume an HMM with N states.Randomly set its parameters λ=(A,B) (making sure they represent legal distributions)Until converge (i.e. λ no longer changes) do: E Step: Compute values for t(j) and t(i,j) using current values for parameters A and B. M Step: Re-estimate parameters:
ijij aa ˆ)(ˆ)( kjkj vbvb
![Page 155: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/155.jpg)
EM Properties• Each iteration changes the parameters in a
way that is guaranteed to increase the likelihood of the data: P(O|).
• Anytime algorithm: Can stop at any time prior to convergence to get approximate solution.
• Converges to a local maximum.
![Page 156: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/156.jpg)
Semi-Supervised Learning• EM algorithms can be trained with a mix of
labeled and unlabeled data.• EM basically predicts a probabilistic (soft) labeling
of the instances and then iteratively retrains using supervised learning on these predicted labels (“self training”).
• EM can also exploit supervised data: – 1) Use supervised learning on labeled data to initialize
the parameters (instead of initializing them randomly).– 2) Use known labels for supervised data instead of
predicting soft labels for these examples during retraining iterations.
![Page 157: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/157.jpg)
Semi-Supervised Results• Use of additional unlabeled data improves on
supervised learning when amount of labeled data is very small and amount of unlabeled data is large.
• Can degrade performance when there is sufficient labeled data to learn a decent model and when unsupervised learning tends to create labels that are incompatible with the desired ones.– There are negative results for semi-supervised POS
tagging since unsupervised learning tends to learn semantic labels (e.g. eating verbs, animate nouns) that are better at predicting the data than purely syntactic labels (e.g. verb, noun).
![Page 158: 1 Computational Lexicology, Morphology and Syntax Course 5 Diana Trandab ă ț Academic year: 2014-2015](https://reader033.vdocument.in/reader033/viewer/2022051619/56649e015503460f94aebf0d/html5/thumbnails/158.jpg)
Conclusions• POS Tagging is the lowest level of syntactic
analysis.• It is an instance of sequence labeling, a collective
classification task that also has applications in information extraction, phrase chunking, semantic role labeling, and bioinformatics.
• HMMs are a standard generative probabilistic model for sequence labeling that allows for efficiently computing the globally most probable sequence of labels and supports supervised, unsupervised and semi-supervised learning.