cs460/626 : natural language processing/speech, nlp and ...cs626-460-2012/lecture... · cs460/626 :...
TRANSCRIPT
-
CS460/626 : Natural Language Processing/Speech, NLP and the Web
(Lecture 8– POS tagset)
Pushpak BhattacharyyaCSE Dept., IIT Bombay
17th Jan, 2012
-
HMM: Three ProblemsProblem
Language
Hindi
Marathi
English
FrenchMorph
Analysis
Part of Speech
Tagging
Parsing
Semantics
CRF
HMM
MEMM
NLP
Trinity
� Problem 1: Likelihood of a sequence
� Forward Procedure
� Backward ProcedureAlgorithm
� Problem 2: Best state sequence
� Viterbi Algorithm
� Problem 3: Re-estimation
� Baum-Welch ( Forward-Backward Algorithm )
POS tagging
-
Tagged Corpora
� ^_^“_“ The_DT guys_NNS that_WDT
make_VBP traditional_JJ hardware_NN
are_VBP really_RB being_VBG
obsoleted_VBN by_IN microprocessor-
based_JJ machines_NNS ,_, ”_”
said_VBD Mr._NNP Benton_NNP ._.$_$
-
For Hindi
� Rama achhaa gaata hai. (hai is VAUX :
Auxiliary verb) ; Ram sings well
Rama achha ladakaa hai. (hai is VCOP : � Rama achha ladakaa hai. (hai is VCOP :
Copula verb) ; Ram is a good boy
-
Example of difficulty in POS tagging
-
Tags
Content Word Function Word
Noun Adjective Verb Tags PronounPreposition
Noun Verb TagsConjunctio
nInjection
on
Proper Noun
Common Noun
NNP(for NER)
NNSNN
VBP VBD VBG VBN
-
Difficulty in POS Tagging� Consider the following sentences:
राम अ�छा गाता है_VAUX (auxiliary verb)
Ram good sing is : Ram sings well
GNPTAM for ‘गाता ‘ only : Male, Singular, ??,??,??,-GNPTAM for ‘गाता ‘ only : Male, Singular, ??,??,??,-GNPTAM for ‘गाता है’ : Male, Singular, 2nd or 3rd, Present, Default, Declarative
राम अ�छा लड़का है_VCOP (copular verb)
Ram good boy is : Ram is a good boy
In general, VAUX, VM (main verb) and VCOP cannot be separated easily
-
To POS Tag based on Rules, one simple rule could be:
है
Difficulty in POS Tagging
Preceded by nominal
Preceded by verb
This is a ‘High Precision, Low Recall’ rule, i.e. when it says Yes is indeed Yes but a No may not actually be No
VAUX VCOPFacilitates co-referenceसामानािधकरण
-
Exceptions to the previous rule
� False Negative for VAUX
� Particle Injection (Particles: भी-Bhi, तो-To, ह�-Hi, नह�ं -Nahi)
राम गाता तो अछा है, पर ... राम गाता तो अछा है, पर ...
� Consider the following sentences:
राम अ�छा है_VCOP
राम तो गाता अ�छा है_VAUX
POS TAGs of है vary here despite the preceding word being an adjective
-
Evaluation of POS Tag Accuracy
� Precision, Recall and F-Score
Given G(what our system returns)
Ideal I(Actual Tags)
AgreementAgreement
False Positive
False Negative
• Precision P= |G ∩ I| / |I| Recall R= |G ∩ I| / |I|
• F-Score = 2PR/(P+R)
-
POS tag computation (1/2)Best tag sequence= T*= argmax P(T|W)= argmax P(T)P(W|T) (by Baye’s Theorem)
P(T) = P(t0=^ t1t2 … tn+1=.)P(T) = P(t0=^ t1t2 … tn+1=.)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) …
P(tn|tn-1tn-2…t0)P(tn+1|tntn-1…t0)= P(t0)P(t1|t0)P(t2|t1) … P(tn|tn-1)P(tn+1|tn)
= P(ti|ti-1) Bigram Assumption∏N+1
i = 0
-
POS tag computation (2/2)
P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) …P(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)
Assumption: A word is determined completely by its tag. This is inspired by speech recognitioninspired by speech recognition
= P(wo|to)P(w1|t1) … P(wn+1|tn+1)
= P(wi|ti)
= P(wi|ti) (Lexical Probability Assumption)
∏n+1
i = 0
∏n+1
i = 1
-
Example
”People jump high”.
People : Noun/Verb
jump : Noun/Verbjump : Noun/Verb
high : Noun/Adjective
We can start with probabilities.
-
^
VM
N
VM
N
JJ
N
$
People
Jump High^ $
Trellis diagram
8 POS TAG sequences are possible, given these valid tags for each word taken from dictionary
-
Bigram AssumptionBest tag sequence= T*= argmax P(T|W)= argmax P(T)P(W|T) (by Baye’s Theorem)
P(T) = P(t0=^ t1t2 … tn+1=.)P(T) = P(t0=^ t1t2 … tn+1=.)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) …
P(tn|tn-1tn-2…t0)P(tn+1|tntn-1…t0)= P(t0)P(t1|t0)P(t2|t1) … P(tn|tn-1)P(tn+1|tn)
= P(ti|ti-1) Bigram Assumption∏N+1
i = 0
-
Lexical Probability Assumption
P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) …P(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)
Assumption: A word is determined completely by its tag. This is inspired by speech recognitioninspired by speech recognition
= P(wo|to)P(w1|t1) … P(wn+1|tn+1)
= P(wi|ti)
= P(wi|ti) (Lexical Probability Assumption)
∏n+1
i = 0
∏n+1
i = 1
-
Calculation from actual data� Corpus
� ^ Ram got many NLP books. He found them all very interesting.
� Pos Tagged� Pos Tagged
� ^ N V A N N . N V N A R A .
-
Recording numbers^ N V A R .
^ 0 2 0 0 0 0
N 0 1 2 1 0 1
V 0 1 0 1 0 0
A 0 1 0 0 1 1
R 0 0 0 1 0 0
. 1 0 0 0 0 0
-
Probabilities^ N V A R .
^ 0 1 0 0 0 0
N 0 1/5 2/5 1/5 0 1/5
V 0 1/2 0 1/2 0 0
A 0 1/3 0 0 1/3 1/3
R 0 0 0 1 0 0
. 1 0 0 0 0 0
-
Penn tagset (1/2)
-
Penn tagset (2/2)
-
Indian Language Tagset: Noun
-
Indian Language Tagset: Pronoun
-
Indian Language Tagset: Quantifier
-
Indian Language Tagset: Demonstrative
3 Demonstrative DM DM Vaha, jo, yaha,
3.1 Deictic DMD DM__DMD Vaha, yaha
3.2 Relative DMR DM__DMR jo, jis
3.3 Wh-word DMQ DM__DMQ kis, kaun
Indefinite DMI DM__DMI KoI, kis
-
Indian Language Tagset: Verb, Adjective, Adverb
-
Indian Language Tagset: Postposition, conjunction
-
Indian Language Tagset: Particle
-
Indian Language Tagset: Residuals