ling/c’sc’581:’’sandiway/ling581-15/... · 2015-01-22 ·...

LING/C SC 581: Advanced Computa9onal Linguis9cs

Lecture Notes Jan 22nd

Today's Topics

•  Minimum Edit Distance Homework

•  Corpora: frequency informa9on

•  tregex

Minimum Edit Distance Homework

•  Background: – … about 20% of the -me “Britney Spears” is misspelled when people search for it on Google

•  SoOware for genera9ng misspellings –  If a person running a Britney Spears web site wants to get the maximum exposure, it would be in their best interests to include at least a few misspellings.

–  hRp://www.geneffects.com/typoposi9ve/


•  hRp://www.google.com/jobs/archive/britney.html

Top six misspellings

•  Design a minimum edit algorithm that ranks these misspellings (as accurately as possible): – e.g. ED(briRany) < ED(britany)


•  Submit your homework in PDF –  how many you got right –  explain your criteria, e.g. weights, chosen

•  you should submit your modified Excel spreadsheet or code (e.g. Python, Perl, Java) as well

•  due by email to me before next Thursday class… –  put your name and 581 at the top of your submission

Part 2

•  Corpora: frequency informa9on

•  Unlabeled corpus: just words •  Labeled corpus: various kinds … – POS informa9on –  Informa9on about phrases – Word sense or Seman9c role labeling

easy to find

progressively harder to create or obtain

Language Models and N-‐grams •  given a word sequence

–  w1 w2 w3 ... wn •  chain rule

–  how to compute the probability of a sequence of words –  p(w1 w2) = p(w1) p(w2|w1) –  p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) –  ... –  p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-‐2 wn-‐1)

•  note –  It’s not easy to collect (meaningful) sta9s9cs on p(wn|wn-‐1wn-‐2...w1) for all

possible word sequences

Language Models and N-‐grams •  Given a word sequence

–  w1 w2 w3 ... wn •  Bigram approxima8on

–  just look at the previous word only (not all the proceedings words) –  Markov Assump8on: finite length history –  1st order Markov Model –  p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-‐3wn-‐2wn-‐1)

–  p(w1 w2 w3...wn) ≈ p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-‐1)

•  note –  p(wn|wn-‐1) is a lot easier to collect data for (and thus es9mate well) than p(wn|w1...wn-‐2

wn-‐1)

Language Models and N-‐grams

•  es8ma8ng from corpora –  how to compute bigram probabili-es –  p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1w) w is any word

–  Since f(wn-‐1w) = f(wn-‐1) f(wn-‐1) = unigram frequency for wn-‐1

–  p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1) rela8ve frequency

•  Note: –  The technique of es9ma9ng (true) probabili9es using a rela9ve

frequency measure over a training corpus is known as maximum likelihood es8ma8on (MLE)

Mo9va9on for smoothing •  Smoothing: avoid zero probability es-mates •  Consider •  what happens when any individual probability component is

zero? –  Arithme8c mul8plica8on law: 0×X = 0 –  very bri>le!

•  even in a very large corpus, many possible n-‐grams over vocabulary space will have zero frequency –  par-cularly so for larger n-‐grams

p(w1 w2 w3...wn) ≈ p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-‐1)


•  Example:

unigram frequencies

wn-‐1wn bigram frequencies

bigram probabili9es

sparse matrix

zeros render probabili9es unusable

(we’ll need to add fudge factors -‐ i.e. do smoothing)

wn-‐1

wn

Smoothing and N-‐grams •  sparse dataset means zeros are a problem

–  Zero probabili9es are a problem •  p(w1 w2 w3...wn) ≈ p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-‐1) bigram model

•  one zero and the whole product is zero –  Zero frequencies are a problem

•  p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1) rela8ve frequency

•  bigram f(wn-‐1wn) doesn’t exist in dataset

•  smoothing –  refers to ways of assigning zero probability n-‐grams a non-‐zero value

Smoothing and N-‐grams •  Add-‐One Smoothing (4.5.1 Laplace Smoothing)

–  add 1 to all frequency counts –  simple and no more zeros (but there are bePer methods)

•  unigram –  p(w) = f(w)/N (before Add-‐One)

•  N = size of corpus –  p(w) = (f(w)+1)/(N+V) (with Add-‐One) –  f*(w) = (f(w)+1)*N/(N+V) (with Add-‐One)

•  V = number of dis9nct words in corpus •  N/(N+V) normaliza9on factor adjus9ng for the effec9ve increase in the corpus size caused by

Add-‐One •  bigram

–  p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1) (before Add-‐One) –  p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V) (aRer Add-‐One) –  f*(wn-‐1 wn) = (f(wn-‐1 wn)+1)* f(wn-‐1) /(f(wn-‐1)+V) (aRer Add-‐One)

must rescale so that total probability mass stays at 1

Smoothing and N-‐grams

•  Add-‐One Smoothing –  add 1 to all frequency counts

•  bigram –  p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V) –  (f(wn-‐1 wn)+1)* f(wn-‐1) /(f(wn-‐1)+V)

•  frequencies

Remarks: perturba-on problem add-‐one causes large changes in some frequencies due to rela9ve size of V (1616) want to: 786 ⇒ 338

I want to eat Chinese food lunchI 8 1087 0 13 0 0 0want 3 0 786 0 6 8 6to 3 0 10 860 3 0 12eat 0 0 2 0 19 2 52Chinese 2 0 0 0 0 120 1food 19 0 17 0 0 0 0lunch 4 0 0 0 0 1 0

I want to eat Chinese food lunchI 6.12 740.05 0.68 9.52 0.68 0.68 0.68want 1.72 0.43 337.76 0.43 3.00 3.86 3.00to 2.67 0.67 7.35 575.41 2.67 0.67 8.69eat 0.37 0.37 1.10 0.37 7.35 1.10 19.47Chinese 0.35 0.12 0.12 0.12 0.12 14.09 0.23food 9.65 0.48 8.68 0.48 0.48 0.48 0.48lunch 1.11 0.22 0.22 0.22 0.22 0.44 0.22

= figure 6.8

= figure 6.4


•  Add-‐One Smoothing –  add 1 to all frequency counts

•  bigram –  p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V) –  (f(wn-‐1 wn)+1)* f(wn-‐1) /(f(wn-‐1)+V)

•  Probabili8es

Remarks: perturba-on problem similar changes in probabili9es



= figure 6.5

= figure 6.7


•  let’s illustrate the problem –  take the bigram case: –  wn-‐1wn

–  p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1)

–  suppose there are cases –  wn-‐1wzero

1 that don’t occur in the corpus

probability mass

f(wn-‐1)

f(wn-‐1wn)

f(wn-‐1wzero1)=0

f(wn-‐1wzerom)=0

...


•  add-‐one –  “give everyone 1”

probability mass

f(wn-‐1)

f(wn-‐1wn)+1

f(wn-‐1w01)=1

f(wn-‐1w0m)=1

...


•  add-‐one –  “give everyone 1”

probability mass

f(wn-‐1)

f(wn-‐1wn)+1

f(wn-‐1w01)=1

f(wn-‐1w0m)=1

... V = |{wi}|

•  redistribu-on of probability mass –  p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V)


•  Good-‐Turing Discoun8ng (4.5.2) –  Nc = number of things (= n-‐grams) that occur c 9mes in the corpus –  N = total number of things seen –  Formula: smoothed c for Nc given by c* = (c+1)Nc+1/Nc

–  Idea: use frequency of things seen once to es9mate frequency of things we haven’t seen yet –  es9mate N0 in terms of N1… –  and so on but if Nc =0, smooth that first using something like log(Nc)=a+b log(c) –  Formula: P*(things with zero freq) = N1/N –  smaller impact than Add-‐One

•  Textbook Example: –  Fishing in lake with 8 species

•  bass, carp, cawish, eel, perch, salmon, trout, whitefish –  Sample data (6 out of 8 species):

•  10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel

–  P(unseen new fish, i.e. bass or carp) = N1/N = 3/18 = 0.17 –  P(next fish=trout) = 1/18

•  (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…) –  revised count for trout: c*(trout) = 2*N2/N1=2(1/3)=0.67 (discounted from 1) –  revised P(next fish=trout) = 0.67/18 = 0.037


•  N-‐gram models + smoothing – one consequence of smoothing is that – every possible concatenta9on or sequence of words has a non-‐zero probability

– N-‐gram models can also incorporate word classes, e.g. POS labels when available


•  N-‐gram models – data is easy to obtain •  any unlabeled corpus will do

–  they’re technically easy to compute •  count frequencies and apply the smoothing formula

– but just how good are these n-‐gram language models?

– and what can they show us about language?


approxima8ng Shakespeare –  generate random sentences using n-‐grams –  Corpus: Complete Works of Shakespeare

•  Unigram (pick random, unconnected words)

•  Bigram


•  Approxima9ng Shakespeare –  generate random sentences using n-‐grams –  Corpus: Complete Works of Shakespeare

•  Trigram

•  Quadrigram

Remarks: dataset size problem training set is small 884,647 words 29,066 different words

29,0662 = 844,832,356 possible bigrams for the random sentence generator, this means very limited choices for possible con9nua9ons, which means program can’t be very innova9ve for higher n


•  A limita9on: – produces ungramma9cal sequences

•  Treebank: – poten9al to be a beRer language model – Structural informa9on: •  contains frequency informa9on about syntac9c rules

– we should be able to generate sequences that are closer to English …


•  Aside: hRp://hemispheresmagazine.com/contests/2004/intro.htm

Part 3

tregex •  I assume everyone has:

1.  Installed Penn Treebank v3 2.  Downloaded and installed tregex

Trees in the Penn Treebank

Nota8on: LISP S-‐expression

Directory: TREEBANK_3/parsed/mrg/

tregex •  Search Example: << dominates, < immediately dominates

tregex Help

tregex •  Help: tregex expression syntax is non-‐standard wrt bracke-ng

S < VP S < NP

tregex

•  Help: tregex boolean syntax is also non-‐standard

tregex

•  Help

tregex •  Help

tregex •  PaRern:

–  (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <-‐ =comma)

Key: <, first child $+ immediate leO sister <-‐ last child

same node

tregex

•  Help

tregex

tregex •  Different results from:

–  @SBAR < /^WH.*-‐([0-‐9]+)$/#1%index << (@NP < (/^-‐NONE-‐/ < /^\*T\*-‐([0-‐9]+)$/#1%index))

tregex

Example: WHADVP also possible (not just WHNP)

Treebank Guides 1.  Tagging Guide 2.  Arpa94 paper 3.  Parse Guide

Treebank Guides

•  Parts-‐of-‐speech (POS) Tagging Guide, tagguid1.pdf (34 pages):

tagguid2.pdf: addendum, see POS tag ‘TO’

Treebank Guides

•  Parsing guide 1, prsguid1.pdf (318 pages):

prsguid2.pdf: addendum for the Switchboard corpus

ling/c’sc’581:’’sandiway/ling581-15/... · 2015-01-22 ·...

Documents