statistical natural language processing and applications
TRANSCRIPT
Textbooks you can refer
Jacob Eisenstein. Natural Language Processing(2018, draft)
Jurafsky, D. and J. H. Martin: Speech and Language Processing. Prentice-Hall. 2009. 2nd
edition (3rd edition, 2019 draft: http://web.stanford.edu/~jurafsky/slp3/)
Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing (pdf)
Manning, C. D., Schütze, H.: Foundations of Statistical Natural Language Processing. The
MIT Press. 1999. ISBN 0-262-13360-1.
2
Goals of the HLT
Computers would be a lot more useful if they could handle our email, do our library research, talk to us …
But they are fazed by natural human language.
How can we make computers have abilities to handle human language? (Or help them learn it as kids do?)
3
A few applications of HLT
Spelling correction, grammar checking …(language learning and evaluation e.g. TOEFL essay score)
Better search engines Information extraction, gisting Psychotherapy; Harlequin romances; etc. New interfaces:
Speech recognition (and text-to-speech) Dialogue systems (USS Enterprise onboard computer) Machine translation; speech translation (the Babel
tower??)
Trans-lingual summarization, detection, extraction …
4
Question Answering: IBM’s Watson
Won Jeopardy on February 16, 2011!
5
WILLIAM WILKINSON’S “AN ACCOUNT OF THE PRINCIPALITIES OF
WALLACHIA AND MOLDOVIA”INSPIRED THIS AUTHOR’S
MOST FAMOUS NOVEL
Bram Stoker
Information ExtractionSubject: curriculum meeting
Date: January 15, 2012
To: Dan Jurafsky
Hi Dan, we’ve now scheduled the curriculum meeting.
It will be in Gates 159 tomorrow from 10:00-11:30.
-Chris
6
Create new Calendar entry
Event: Curriculum mtgDate: Jan-16-2012Start: 10:00amEnd: 11:30amWhere: Gates 159
Information Extraction & Sentiment Analysis
nice and compact to carry!
since the camera is small and light, I won't need to carry around those heavy, bulky professional cameras either!
the camera feels flimsy, is plastic and very light in weight you have to be very delicate in the handling of this camera
7
Size and weight
Attributes:zoomaffordabilitysize and weightflash ease of use
Machine Translation
Fully automatic
8
• Helping human translators
Enter Source Text:
Translation from Stanford’s Phrasal:
这 不过 是 一 个 时间 的 问题 .
This is only a matter of time.
Language Technology
Coreference resolution
Question answering (QA)
Part-of-speech (POS) tagging
Word sense disambiguation (WSD)
Paraphrase
Named entity recognition (NER)
ParsingSummarization
Information extraction (IE)
Machine translation (MT)Dialog
Sentiment analysis
mostly solved
making good progress
still really hard
Spam detection
Let’s go to Agra!
Buy V1AGRA …
Colorless green ideas sleep furiously.
ADJ ADJ NOUN VERB ADV
Einstein met with UN officials in PrincetonPERSON ORG LOC
You’re invited to our dinner party, Friday May 27 at 8:30
PartyMay 27add
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes.
Carter told Mubarak he shouldn’t run again.
I need new batteries for my mouse.
The 13th Shanghai International Film Festival…
第13届上海国际电影节开幕…
The Dow Jones is up
Housing prices rose
Economy is good
Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness?
I can see Alcatraz from the window!
XYZ acquired ABC yesterday
ABC has been taken over by XYZ
Where is Citizen Kane playing in SF?
Castro Theatre at 7:30. Do you want a ticket?
The S&P500 jumped
Ambiguity makes NLP hard:“Crash blossoms”
Violinist Linked to JAL Crash BlossomsTeacher Strikes Idle KidsRed Tape Holds Up New BridgesHospitals Are Sued by 7 Foot DoctorsJuvenile Court to Try Shooting DefendantLocal High School Dropouts Cut in Half
non-standard English
Great job @justinbieber! Were SOO PROUD of what youve accomplished! U taught us 2 #neversaynever & you yourself should never give up either♥
segmentation issues idioms
dark horseget cold feet
lose facethrow in the towel
neologisms
unfriendRetweet
bromance
tricky entity names
Where is A Bug’s Life playing …Let It Be was recorded …… a mutation on the for gene …
world knowledge
Mary and Sue are sisters.Mary and Sue are mothers.
But that’s what makes it fun!
the New York-New Haven Railroadthe New York-New Haven Railroad
Why else is natural language understanding difficult?
Levels of Language
Phonetics/phonology/morphology: what words (or subwords) are we dealing with?
Syntax: What phrases are we dealing with? Which words modify one another?
Semantics: What’s the literal meaning?
Pragmatics: What should you conclude from the fact that I said something? How should you react?
12
What’s hard – ambiguities, ambiguities, all different levels of ambiguities
John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensivethere. [from J. Eisner]
- donut: To get a donut (doughnut; spare tire) for his car?
- Donut store: store where donuts shop? or is run by donuts? or looks like a big donut? or made of donut?
- From work: Well, actually, he stopped there from hunger and exhaustion, not just from work.
- Every few hours: That’s how often he thought it? Or that’s for coffee?
- it: the particular coffee that was good every few hours? the donut store? the situation
- Too expensive: too expensive for what? what are we supposed to conclude about what John did?
13
Statistical NLP
Imagine:
Each sentence W = w1, w2, ..., wn gets a probability P(W|X) in a context X (think of it in the intuitive sense for now)
For every possible context X, sort all the imaginable sentences W according to P(W|X):
Ideal situation:
best sentence (most probable in context X)
NB: same for interpretation
P(W) “ungrammatical” sentences
14
Real World Situation
Unable to specify set of grammatical sentences today using fixed “categorical” rules (maybe never)
Use statistical “model” based on REAL WORLD DATAand care about the best sentence only (disregarding the “grammaticality” issue)
best sentence
P(W)
Wbest
15
The Noisy Channel
Prototypical case:
Input Output (noisy)
The channel
0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,...
Model: probability of error (noise):
Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6
The Task:
known: the noisy output; want to know: the input (decoding)
17
Noisy Channel Applications
OCR
straightforward: text → print (adds noise), scan → image Handwriting recognition
text → neurons, muscles (“noise”), scan/digitize → image Speech recognition (dictation, commands, etc.)
text → conversion to acoustic signal (“noise”) → acoustic waves
Machine Translation
text in target language → translation (“noise”) → source language
Also: Part of Speech Tagging
sequence of tags → selection of word forms → text
18
Noisy Channel: The Golden Rule of ...
OCR, ASR, HR, MT, ...
Recall:
p(A|B) = p(B|A) p(A) / p(B) (Bayes formula)
Abest = argmaxA p(B|A) p(A) (The Golden Rule) p(B|A): the acoustic/image/translation/lexical model
application-specific name
will explore later
p(A): the language model
19
Probabilistic Language Models
• Today’s goal: assign a probability to a sentence• Machine Translation:
• P(high winds tonite) > P(large winds tonite)
• Spell Correction
• The office is about fifteen minuets from my house
P(about fifteen minutes from) > P(about fifteen minuets from)
• Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
• + Summarization, question-answering, etc., etc.!!
Why?
Probabilistic Language Modeling
Goal: compute the probability of a sentence or sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
Better: the grammar But language model or LM is standard
n-gram Language Models
(n-1)th order Markov approximation → n-gram LM:
p(W) =df Πi=1..dp(wi|wi-n+1,wi-n+2,...,wi-1) ! In particular (assume vocabulary |V| = 60k):
0-gram LM: uniform model, p(w) = 1/|V|, 1 parameter
1-gram LM: unigram model, p(w), 6ⅹ104 parameters
2-gram LM: bigram model, p(wi|wi-1) 3.6ⅹ109 parameters
3-gram LM: trigram model, p(wi|wi-2,wi-1) 2.16ⅹ1014 parameters
22
Maximum Likelihood Estimate
MLE: Relative Frequency...
...best predicts the data at hand (the “training data”)
Trigrams from Training Data T:
count sequences of three words in T: c3(wi-2,wi-1,wi) [NB: notation: just saying that the three words follow each other]
count sequences of two words in T: c2(wi-1,wi):
either use c2(y,z) = Σw c3(y,z,w)
or count differently at the beginning (& end) of data!
p(wi|wi-2,wi-1) =est. c3(wi-2,wi-1,wi) / c2(wi-2,wi-1) !23
LM: an Example
Training data:
<s> <s> He can buy the can of soda.
Unigram: p1(He) = p1(buy) = p1(the) = p1(of) = p1(soda) = p1(.) = .125
p1(can) = .25
Bigram: p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5,
p2(of|can) = .5, p2(the|buy) = 1,...
Trigram: p3(He|<s>,<s>) = 1, p3(can|<s>,He) = 1,
p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(.|of,soda) = 1.
(normalized for all n-grams) Entropy: H(p1) = 2.75, H(p2) = .25, H(p3) = 0 ← Great?!
24
LM: an Example (The Problem)
Cross-entropy:
S = <s> <s> It was the greatest buy of all. (test data)
Even HS(p1) fails (= HS(p2) = HS(p3) = ∞), because:
all unigrams but p1(the), p1(buy), p1(of) and p1(.) are 0.
all bigram probabilities are 0.
all trigram probabilities are 0.
We want: to make all probabilities non-zero. data sparseness handling
25
Why do we need Nonzero Probs?
To avoid infinite Cross Entropy:
happens when an event is found in test data which has not been seen in training data
H(p) = ∞: prevents comparing data with ≥ 0 “errors”
To make the system more robust
low count estimates:
they typically happen for “detailed” but relatively rare appearances
high count estimates: reliable but less “detailed”
26
Eliminating the Zero Probabilities:Smoothing Get new p’(w) (same Ω): almost p(w) but no zeros
Discount w for (some) p(w) > 0: new p’(w) < p(w)
Σw∈discounted (p(w) - p’(w)) = D
Distribute D to all w; p(w) = 0: new p’(w) > p(w)
possibly also to other w with low p(w)
For some w (possibly): p’(w) = p(w)
Make sure Σw∈Ω p’(w) = 1
There are many ways of smoothing
27
Smoothing by Adding 1(Laplace)
Simplest but not really usable:
Predicting words w from a vocabulary V, training data T:
p’(w|h) = (c(h,w) + 1) / (c(h) + |V|)
for non-conditional distributions: p’(w) = (c(w) + 1) / (|T| + |V|)
Problem if |V| > c(h) (as is often the case; even >> c(h)!)
Example: Training data: <s> what is it what is small ? |T| = 8 V = what, is, it, small, ?, <s>, flying, birds, are, a, bird, . , |V| = 12
p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 ≅ .001
p(it is flying.) = .125ⅹ.25ⅹ02 = 0
p’(it) =.1, p’(what) =.15, p’(.)=.05 p’(what is it?) = .152ⅹ.12 ≅ .0002
p’(it is flying.) = .1ⅹ.15ⅹ.052 ≅ .00004
(assume word independence!)
28
Adding less than 1
Equally simple:
Predicting words w from a vocabulary V, training data T:
p’(w|h) = (c(h,w) + λ) / (c(h) + λ|V|), λ < 1
for non-conditional distributions: p’(w) = (c(w) + λ) / (|T| + λ|V|)
Example: Training data: <s> what is it what is small ? |T| = 8 V = what, is, it, small, ?, <s>, flying, birds, are, a, bird, . , |V| = 12
p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 ≅ .001
p(it is flying.) = .125ⅹ.25´02 = 0
Use λ = .1:
p’(it)≅ .12, p’(what)≅ .23, p’(.)≅ .01 p’(what is it?) = .232ⅹ.122 ≅ .0007
p’(it is flying.) = .12ⅹ.23ⅹ.012 ≅ .000003
29
Perplexity
Perplexity is the probability of the test set, normalized by the number of words:
Chain rule:
For bigrams:
Minimizing perplexity is the same as maximizing probability
The best language model is one that best predicts an unseen test set• Gives the highest P(sentence)
Text Classification
• Assigning subject categories, topics, or genres• Spam detection• Authorship identification• Age/gender identification• Language Identification• Sentiment analysis• …
Dan Jurafsky
Text Classification: definition
• Input:• a document d• a fixed set of classes C = c1, c2,…, cJ
• Output: a predicted class c ∈ C
Dan Jurafsky
Planning GUIGarbageCollection
Machine Learning NLP
parsertagtrainingtranslationlanguage...
learningtrainingalgorithmshrinkagenetwork...
garbagecollectionmemoryoptimizationregion...
Test document
parserlanguagelabeltranslation…
Bag of words for document classification
...planningtemporalreasoningplanlanguage...
?
Dan Jurafsky
Naïve Bayes Classifier (I)
MAP is “maximum a posteriori” = most likely class
Bayes Rule
Dropping the denominator
Generative Model for Multinomial Naïve Bayes
38
c=China
X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds