ling/c’sc’581:’’sandiway/ling581-15/... · 2015-01-22 ·...

43
LING/C SC 581: Advanced Computa9onal Linguis9cs Lecture Notes Jan 22 nd

Upload: others

Post on 16-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

LING/C  SC  581:    Advanced  Computa9onal  Linguis9cs  

Lecture  Notes  Jan  22nd    

Page 2: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Today's  Topics  

•  Minimum  Edit  Distance  Homework  

•  Corpora:  frequency  informa9on  

•  tregex  

Page 3: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Minimum  Edit  Distance  Homework  

•  Background:    – …  about  20%  of  the  -me  “Britney  Spears”  is  misspelled  when  people  search  for  it  on  Google  

•  SoOware  for  genera9ng  misspellings  –  If  a  person  running  a  Britney  Spears  web  site  wants  to  get  the  maximum  exposure,  it  would  be  in  their  best  interests  to  include  at  least  a  few  misspellings.  

–  hRp://www.geneffects.com/typoposi9ve/  

Page 4: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Minimum  Edit  Distance  Homework  

•  hRp://www.google.com/jobs/archive/britney.html  

Top  six  misspellings  

•  Design  a  minimum  edit  algorithm  that  ranks  these  misspellings  (as  accurately  as  possible):  – e.g.  ED(briRany)  <  ED(britany)  

Page 5: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Minimum  Edit  Distance  Homework  

•  Submit  your  homework  in  PDF  –  how  many  you  got  right  –  explain  your  criteria,  e.g.  weights,  chosen  

•  you  should  submit  your  modified  Excel  spreadsheet  or  code  (e.g.  Python,  Perl,  Java)  as  well  

•  due  by  email  to  me  before  next  Thursday  class…  –  put  your  name  and  581  at  the  top  of  your  submission  

Page 6: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Part  2  

•  Corpora:  frequency  informa9on  

•  Unlabeled  corpus:  just  words  •  Labeled  corpus:  various  kinds  …  – POS  informa9on  –  Informa9on  about  phrases  – Word  sense  or  Seman9c  role  labeling  

easy  to  find  

progressively  harder  to  create  or  obtain  

Page 7: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Language  Models  and  N-­‐grams  •  given  a  word  sequence  

–  w1  w2  w3  ...  wn  •  chain  rule  

–  how  to  compute  the  probability  of  a  sequence  of  words  –  p(w1  w2)  =  p(w1)  p(w2|w1)    –  p(w1  w2  w3)  =  p(w1)  p(w2|w1)  p(w3|w1w2)    –  ...  –  p(w1  w2  w3...wn)  =  p(w1)  p(w2|w1)  p(w3|w1w2)...  p(wn|w1...wn-­‐2  wn-­‐1)    

•  note  –  It’s  not  easy  to  collect  (meaningful)  sta9s9cs  on  p(wn|wn-­‐1wn-­‐2...w1)  for  all  

possible  word  sequences  

Page 8: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Language  Models  and  N-­‐grams  •  Given  a  word  sequence  

–  w1  w2  w3  ...  wn  •  Bigram  approxima8on  

–  just  look  at  the  previous  word  only  (not  all  the  proceedings  words)    –  Markov  Assump8on:  finite  length  history  –  1st  order  Markov  Model  –  p(w1  w2  w3...wn)  =  p(w1)  p(w2|w1)  p(w3|w1w2)  ...p(wn|w1...wn-­‐3wn-­‐2wn-­‐1)  

–  p(w1  w2  w3...wn)  ≈  p(w1)  p(w2|w1)  p(w3|w2)...p(wn|wn-­‐1)    

•  note  –  p(wn|wn-­‐1)  is  a  lot  easier  to  collect  data  for  (and  thus  es9mate  well)  than  p(wn|w1...wn-­‐2  

wn-­‐1)    

Page 9: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Language  Models  and  N-­‐grams  •  Trigram  approxima8on    

–  2nd  order  Markov  Model  –  just  look  at  the  preceding  two  words  only  –  p(w1  w2  w3  w4...wn)  =  p(w1)  p(w2|w1)  p(w3|w1w2)  p(w4|w1w2w3)...p(wn|

w1...wn-­‐3wn-­‐2wn-­‐1)  

–  p(w1  w2  w3...wn)  ≈  p(w1)  p(w2|w1)  p(w3|w1w2)p(w4|w2w3)...p(wn  |wn-­‐2  wn-­‐1)    

•  note  –  p(wn|wn-­‐2wn-­‐1)  is  a  lot  easier  to  es9mate  well  than  p(wn|w1...wn-­‐2  wn-­‐1)  but  

harder  than  p(wn|wn-­‐1  )    

Page 10: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Language  Models  and  N-­‐grams  

•  es8ma8ng  from  corpora  –  how  to  compute  bigram  probabili-es  –  p(wn|wn-­‐1)  =  f(wn-­‐1wn)/f(wn-­‐1w)  w  is  any  word  

–  Since  f(wn-­‐1w)  =  f(wn-­‐1)        f(wn-­‐1)  =  unigram  frequency  for  wn-­‐1  

–  p(wn|wn-­‐1)  =  f(wn-­‐1wn)/f(wn-­‐1)  rela8ve  frequency    

•  Note:  –  The  technique  of  es9ma9ng  (true)  probabili9es  using  a  rela9ve  

frequency  measure  over  a  training  corpus  is  known  as  maximum  likelihood  es8ma8on  (MLE)  

Page 11: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Mo9va9on  for  smoothing  •  Smoothing:  avoid  zero  probability  es-mates  •  Consider    •  what  happens  when  any  individual  probability  component  is  

zero?  –  Arithme8c  mul8plica8on  law:  0×X  =  0  –  very  bri>le!  

•  even  in  a  very  large  corpus,  many  possible  n-­‐grams    over  vocabulary  space  will  have  zero  frequency  –  par-cularly  so  for  larger  n-­‐grams  

p(w1  w2  w3...wn)  ≈  p(w1)  p(w2|w1)  p(w3|w2)...p(wn|wn-­‐1)    

Page 12: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Language  Models  and  N-­‐grams  

•  Example:  

unigram  frequencies  

wn-­‐1wn  bigram  frequencies  

bigram  probabili9es  

sparse  matrix  

zeros  render  probabili9es  unusable  

(we’ll  need  to  add  fudge  factors  -­‐  i.e.  do  smoothing)  

wn-­‐1  

wn  

Page 13: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Smoothing  and  N-­‐grams  •  sparse  dataset  means  zeros  are  a  problem  

–  Zero  probabili9es  are  a  problem  •  p(w1  w2  w3...wn)  ≈  p(w1)  p(w2|w1)  p(w3|w2)...p(wn|wn-­‐1)    bigram  model  

•  one  zero  and  the  whole  product  is  zero  –  Zero  frequencies  are  a  problem  

•  p(wn|wn-­‐1)  =  f(wn-­‐1wn)/f(wn-­‐1)      rela8ve  frequency  

•  bigram  f(wn-­‐1wn)  doesn’t  exist  in  dataset  

•  smoothing  –  refers  to  ways  of  assigning  zero  probability  n-­‐grams  a  non-­‐zero  value  

Page 14: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Smoothing  and  N-­‐grams  •  Add-­‐One  Smoothing  (4.5.1  Laplace  Smoothing)  

–  add  1  to  all  frequency  counts  –  simple  and  no  more  zeros  (but  there  are  bePer  methods)  

•  unigram  –  p(w)  =  f(w)/N    (before  Add-­‐One)  

•  N  =  size  of  corpus  –  p(w)  =  (f(w)+1)/(N+V)  (with  Add-­‐One)  –  f*(w)  =  (f(w)+1)*N/(N+V)  (with  Add-­‐One)  

•  V  =  number  of  dis9nct  words  in  corpus  •  N/(N+V)  normaliza9on  factor  adjus9ng  for  the  effec9ve  increase  in  the  corpus  size  caused  by  

Add-­‐One  •  bigram  

–  p(wn|wn-­‐1)  =  f(wn-­‐1wn)/f(wn-­‐1)        (before  Add-­‐One)  –  p(wn|wn-­‐1)  =  (f(wn-­‐1wn)+1)/(f(wn-­‐1)+V)      (aRer  Add-­‐One)  –  f*(wn-­‐1  wn)  =  (f(wn-­‐1  wn)+1)*  f(wn-­‐1)  /(f(wn-­‐1)+V)    (aRer  Add-­‐One)  

must  rescale  so  that  total  probability  mass  stays  at  1  

Page 15: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Smoothing  and  N-­‐grams  

•  Add-­‐One  Smoothing  –  add  1  to  all  frequency  counts  

•  bigram  –  p(wn|wn-­‐1)  =  (f(wn-­‐1wn)+1)/(f(wn-­‐1)+V)        –  (f(wn-­‐1  wn)+1)*  f(wn-­‐1)  /(f(wn-­‐1)+V)  

•  frequencies      

Remarks:  perturba-on  problem    add-­‐one  causes  large  changes  in  some  frequencies  due  to    rela9ve  size  of  V  (1616)      want  to:  786  ⇒  338  

I want to eat Chinese food lunchI 8 1087 0 13 0 0 0want 3 0 786 0 6 8 6to 3 0 10 860 3 0 12eat 0 0 2 0 19 2 52Chinese 2 0 0 0 0 120 1food 19 0 17 0 0 0 0lunch 4 0 0 0 0 1 0

I want to eat Chinese food lunchI 6.12 740.05 0.68 9.52 0.68 0.68 0.68want 1.72 0.43 337.76 0.43 3.00 3.86 3.00to 2.67 0.67 7.35 575.41 2.67 0.67 8.69eat 0.37 0.37 1.10 0.37 7.35 1.10 19.47Chinese 0.35 0.12 0.12 0.12 0.12 14.09 0.23food 9.65 0.48 8.68 0.48 0.48 0.48 0.48lunch 1.11 0.22 0.22 0.22 0.22 0.44 0.22

=  figure  6.8  

=  figure  6.4  

Page 16: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Smoothing  and  N-­‐grams  

•  Add-­‐One  Smoothing  –  add  1  to  all  frequency  counts  

•  bigram  –  p(wn|wn-­‐1)  =  (f(wn-­‐1wn)+1)/(f(wn-­‐1)+V)        –  (f(wn-­‐1  wn)+1)*  f(wn-­‐1)  /(f(wn-­‐1)+V)  

•  Probabili8es      

Remarks:  perturba-on  problem    similar  changes  in  probabili9es  

I want to eat Chinese food lunchI 0.00178 0.21532 0.00020 0.00277 0.00020 0.00020 0.00020want 0.00141 0.00035 0.27799 0.00035 0.00247 0.00318 0.00247to 0.00082 0.00021 0.00226 0.17672 0.00082 0.00021 0.00267eat 0.00039 0.00039 0.00117 0.00039 0.00783 0.00117 0.02075Chinese 0.00164 0.00055 0.00055 0.00055 0.00055 0.06616 0.00109food 0.00641 0.00032 0.00577 0.00032 0.00032 0.00032 0.00032lunch 0.00241 0.00048 0.00048 0.00048 0.00048 0.00096 0.00048

I want to eat Chinese food lunchI 0.00233 0.31626 0.00000 0.00378 0.00000 0.00000 0.00000want 0.00247 0.00000 0.64691 0.00000 0.00494 0.00658 0.00494to 0.00092 0.00000 0.00307 0.26413 0.00092 0.00000 0.00369eat 0.00000 0.00000 0.00213 0.00000 0.02026 0.00213 0.05544Chinese 0.00939 0.00000 0.00000 0.00000 0.00000 0.56338 0.00469food 0.01262 0.00000 0.01129 0.00000 0.00000 0.00000 0.00000lunch 0.00871 0.00000 0.00000 0.00000 0.00000 0.00218 0.00000

=  figure  6.5  

=  figure  6.7  

Page 17: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Smoothing  and  N-­‐grams  

•  let’s  illustrate  the  problem  –  take  the  bigram  case:  –  wn-­‐1wn  

–  p(wn|wn-­‐1)  =  f(wn-­‐1wn)/f(wn-­‐1)  

–  suppose  there  are  cases  –  wn-­‐1wzero

1  that  don’t  occur  in  the  corpus  

probability  mass  

f(wn-­‐1)  

f(wn-­‐1wn)  

f(wn-­‐1wzero1)=0    

f(wn-­‐1wzerom)=0    

...  

Page 18: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Smoothing  and  N-­‐grams  

•  add-­‐one  –  “give  everyone  1”  

probability  mass  

f(wn-­‐1)  

f(wn-­‐1wn)+1  

f(wn-­‐1w01)=1    

f(wn-­‐1w0m)=1    

...  

Page 19: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Smoothing  and  N-­‐grams  

•  add-­‐one  –  “give  everyone  1”  

probability  mass  

f(wn-­‐1)  

f(wn-­‐1wn)+1  

f(wn-­‐1w01)=1    

f(wn-­‐1w0m)=1    

...   V  =  |{wi}|  

•  redistribu-on  of  probability  mass  –  p(wn|wn-­‐1)  =  (f(wn-­‐1wn)+1)/(f(wn-­‐1)+V)

   

Page 20: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Smoothing  and  N-­‐grams  

•  Good-­‐Turing  Discoun8ng  (4.5.2)  –  Nc  =  number  of  things  (=  n-­‐grams)  that  occur  c  9mes  in  the  corpus  –  N  =  total  number  of  things  seen  –  Formula:  smoothed  c  for  Nc  given  by  c*  =  (c+1)Nc+1/Nc  

–  Idea:  use  frequency  of  things  seen  once  to  es9mate  frequency  of  things  we  haven’t  seen  yet  –  es9mate  N0  in  terms  of  N1…  –  and  so  on  but  if  Nc  =0,  smooth  that  first  using  something  like  log(Nc)=a+b  log(c)  –  Formula:  P*(things  with  zero  freq)  =  N1/N  –  smaller  impact  than  Add-­‐One  

•  Textbook  Example:  –  Fishing  in  lake  with  8  species    

•  bass,  carp,  cawish,  eel,  perch,  salmon,  trout,  whitefish  –  Sample  data  (6  out  of  8  species):  

•  10  carp,  3  perch,  2  whitefish,  1  trout,  1  salmon,  1  eel  

–  P(unseen  new  fish,  i.e.  bass  or  carp)  =  N1/N  =  3/18  =  0.17  –  P(next  fish=trout)  =  1/18    

•  (but,  we  have  reassigned  probability  mass,  so  need  to  recalculate  this  from  the  smoothing  formula…)  –  revised  count  for  trout:  c*(trout)  =  2*N2/N1=2(1/3)=0.67  (discounted  from  1)  –  revised  P(next  fish=trout)  =  0.67/18  =  0.037  

Page 21: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Language  Models  and  N-­‐grams  

•  N-­‐gram  models  +  smoothing  – one  consequence  of  smoothing  is  that  – every  possible  concatenta9on  or  sequence  of  words  has  a  non-­‐zero  probability  

– N-­‐gram  models  can  also  incorporate  word  classes,  e.g.  POS  labels  when  available  

Page 22: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Language  Models  and  N-­‐grams  

•  N-­‐gram  models  – data  is  easy  to  obtain  •  any  unlabeled  corpus  will  do  

–  they’re  technically  easy  to  compute    •  count  frequencies  and  apply  the  smoothing  formula  

– but  just  how  good  are  these  n-­‐gram  language  models?  

– and  what  can  they  show  us  about  language?  

Page 23: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Language  Models  and  N-­‐grams  

approxima8ng  Shakespeare  –  generate  random  sentences  using  n-­‐grams  –  Corpus:  Complete  Works  of  Shakespeare  

•  Unigram  (pick  random,  unconnected  words)  

•  Bigram  

Page 24: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Language  Models  and  N-­‐grams  

•  Approxima9ng  Shakespeare  –  generate  random  sentences  using  n-­‐grams  –  Corpus:  Complete  Works  of  Shakespeare  

•  Trigram  

•  Quadrigram  

Remarks:  dataset  size  problem    training  set  is  small  884,647  words  29,066  different  words    

29,0662  =  844,832,356    possible  bigrams    for  the  random  sentence    generator,  this  means    very  limited  choices  for  possible  con9nua9ons,  which  means  program    can’t  be  very  innova9ve    for  higher  n  

Page 25: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Language  Models  and  N-­‐grams  

•  A  limita9on:    – produces  ungramma9cal  sequences  

•  Treebank:  – poten9al  to  be  a  beRer  language  model  – Structural  informa9on:  •  contains  frequency  informa9on  about  syntac9c  rules  

– we  should  be  able  to  generate  sequences  that  are  closer  to  English  …    

Page 26: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Language  Models  and  N-­‐grams  

•  Aside:  hRp://hemispheresmagazine.com/contests/2004/intro.htm  

Page 27: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Part  3  

tregex  •  I  assume  everyone  has:  

1.  Installed  Penn  Treebank  v3  2.  Downloaded  and  installed  tregex  

Page 28: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Trees  in  the  Penn  Treebank  

Nota8on:    LISP  S-­‐expression  

Directory:  TREEBANK_3/parsed/mrg/  

Page 29: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  •  Search  Example:  <<  dominates,  <  immediately  dominates  

Page 30: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  Help  

Page 31: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  Help  

Page 32: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  •  Help:  tregex  expression  syntax  is  non-­‐standard  wrt  bracke-ng  

S  <  VP  S  <  NP  

Page 33: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  

•  Help:  tregex  boolean  syntax  is  also  non-­‐standard  

Page 34: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  

•  Help  

Page 35: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  •  Help  

Page 36: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  •  PaRern:    

–  (@NP  <,  (@NP  $+  (/,/  $+  (@NP  $+  /,/=comma)))  <-­‐  =comma)  

Key:  <,  first  child  $+  immediate              leO  sister  <-­‐    last  child  

same  node  

Page 37: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  

•  Help  

Page 38: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  

Page 39: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  •  Different  results  from:  

–  @SBAR  <  /^WH.*-­‐([0-­‐9]+)$/#1%index  <<  (@NP  <  (/^-­‐NONE-­‐/  <  /^\*T\*-­‐([0-­‐9]+)$/#1%index))  

Page 40: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

tregex  

Example:    WHADVP  also  possible  (not  just  WHNP)  

Page 41: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Treebank  Guides  1.  Tagging  Guide  2.  Arpa94  paper  3.  Parse  Guide  

Page 42: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Treebank  Guides  

•  Parts-­‐of-­‐speech  (POS)  Tagging  Guide,  tagguid1.pdf  (34  pages):  

tagguid2.pdf:  addendum,  see  POS  tag  ‘TO’  

Page 43: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –

Treebank  Guides  

•  Parsing  guide  1,  prsguid1.pdf  (318  pages):  

prsguid2.pdf:  addendum  for  the  Switchboard  corpus