acl2014 reading: [zhang+] "kneser-ney smoothing on expected count" and [pickhardt+]...

15
[Zhang+ ACL2014] Kneser-Ney Smoothing on Expected Count [Pickhardt+ ACL2014] A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing 2014/7/12 ACL Reading @ PFI Nakatani Shuyo, Cybozu Labs Inc.

Upload: shuyo-nakatani

Post on 12-Nov-2014

2.514 views

Category:

Technology


1 download

DESCRIPTION

ย 

TRANSCRIPT

Page 1: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

[Zhang+ ACL2014] Kneser-Ney Smoothing on Expected Count [Pickhardt+ ACL2014] A Generalized Language Model as the

Comination of Skipped n-grams and Modified Kneser-Ney Smoothing

2014/7/12 ACL Reading @ PFI

Nakatani Shuyo, Cybozu Labs Inc.

Page 2: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

Kneser-Ney Smoothing [Kneser+ 1995]

โ€ข Discounting & Interpolation

๐‘ƒ ๐‘ค๐‘– ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1

=max ๐‘ ๐‘ค๐‘–โˆ’๐‘›+1

๐‘– โˆ’ ๐ท, 0

๐‘ ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1

+๐ท

๐‘ ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1

๐‘1+ ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1 โˆ™ ๐‘ƒ ๐‘ค๐‘– ๐‘ค๐‘–โˆ’๐‘›+2

๐‘–โˆ’1

โ€ข where

๐‘ค๐‘š๐‘› = ๐‘ค๐‘š โ‹ฏ๐‘ค๐‘›, ๐‘1+ ๐‘ค๐‘š

๐‘› โ‹… = ๐‘ค๐‘–|๐‘ ๐‘ค๐‘š๐‘›๐‘ค๐‘– > 0

Number of Discounting

Page 3: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

Modified KN-Smoothing [Chen+ 1999]

๐‘ƒ ๐‘ค๐‘– ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1

=๐‘ ๐‘ค๐‘–โˆ’๐‘›+1

๐‘– โˆ’ ๐ท ๐‘ค๐‘–โˆ’๐‘›+1๐‘–

๐‘ ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1

+ ๐›พ ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1 ๐‘ƒ ๐‘ค๐‘– ๐‘ค๐‘–โˆ’๐‘›+2

๐‘–โˆ’1

โ€ข where ๐ท ๐‘ = 0 if ๐‘ = 0, ๐ท1 if ๐‘ = 1, ๐ท2 if ๐‘ = 2, _ ๐ท3+ if ๐‘ โ‰ฅ 3

๐›พ ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1 =

[amount of discounting]

๐‘ ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1

Weighted Discounting (D_n are estimated by leave-1-out CV)

Page 4: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

[Zhang+ ACL2014] Kneser-Ney Smoothing on Expected Count

โ€ข When each sentence has fractional

weight

โ€“ Domain adaptation

โ€“ EM-algorithm on word alignment

โ€ข Propose KN-smoothing using expected

fractional counts

Iโ€™m interested in it!

Page 5: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

Model

โ€ข ๐’– means ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1 , and ๐’–โ€ฒ means ๐‘ค๐‘–โˆ’๐‘›+2

๐‘–โˆ’1

โ€ข A sequence ๐’–๐‘ค occurs ๐‘˜ times and each

occurring has probability ๐‘๐‘– (๐‘– = 1,โ‹ฏ , ๐‘˜) as weight,

โ€ข then count ๐‘(๐’–๐‘ค) is distributed according to Poisson Binomial Distribution.

โ€ข ๐‘ ๐‘ ๐‘ข๐‘ค = ๐‘Ÿ = ๐‘  ๐‘˜, ๐‘Ÿ , where

๐‘  ๐‘˜, ๐‘Ÿ =

๐‘  ๐‘˜ โˆ’ 1, ๐‘Ÿ 1 โˆ’ ๐‘๐‘˜

+ ๐‘  ๐‘˜ โˆ’ 1, ๐‘Ÿ โˆ’ 1 ๐‘๐‘˜

if 0 โ‰ค ๐‘Ÿ โ‰ค ๐‘˜1 if ๐‘˜ = ๐‘Ÿ = 00 otherwise

Page 6: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

MLE on this model

โ€ข Expectations

โ€“ ๐”ผ ๐‘ ๐’–๐‘ค = ๐‘Ÿ โ‹… ๐‘ ๐‘ ๐’–๐‘ค = ๐‘Ÿ๐‘Ÿ

โ€“ ๐”ผ ๐‘๐‘Ÿ ๐’– โ‹… = ๐‘ ๐‘ ๐’–๐‘ค = ๐‘Ÿ๐‘ค

โ€“ ๐”ผ ๐‘๐‘Ÿ+ ๐’– โ‹… = ๐‘ ๐‘ ๐’–๐‘ค โ‰ฅ ๐‘Ÿ๐‘ค

โ€ข Maximize (expected) likelihood

โ€“ ๐”ผ ๐ฟ = ๐”ผ ๐‘ ๐’–๐‘ค log ๐‘ ๐‘ค ๐’–๐’–๐‘ค

= ๐”ผ ๐‘ ๐’–๐‘ค log ๐‘ ๐‘ค ๐’–๐’–๐‘ค

โ€“ obtain ๐‘MLE ๐‘ค ๐’– =๐”ผ ๐‘ ๐’–๐‘ค

๐”ผ ๐‘ ๐’–โ‹…

Page 7: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

Expected Kneser-Ney

โ€ข ๐‘ ๐’–๐‘ค =

max 0, ๐‘ ๐’–๐‘ค โˆ’ ๐ท + ๐‘1+ ๐’– โ‹… ๐ท๐‘โ€ฒ(๐‘ค|๐’–โ€ฒ)

โ€ข So, ๐”ผ ๐‘ ๐’–๐‘ค = ๐”ผ ๐‘ ๐’–๐‘ค โˆ’ ๐‘ ๐‘ ๐’–๐‘ค > 0 ๐ท +

๐”ผ ๐‘1+ ๐’– โ‹… ๐ท๐‘โ€ฒ(๐‘ค|๐’–โ€ฒ)

โ€“ where ๐‘โ€ฒ ๐‘ค ๐’–โ€ฒ = ๐”ผ ๐‘1+ โ‹…๐’–โ€ฒ๐‘ค

๐”ผ ๐‘1+ โ‹…๐’–โ€ฒโ‹…

โ€ข then ๐‘ ๐‘ค ๐’– =๐”ผ ๐‘ ๐’–๐‘ค

๐”ผ ๐‘ ๐’–โ‹…

Page 8: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

Language model adaptation

โ€ข Our corpus consists on

โ€“ large general-domain data and

โ€“ small specific domain data

โ€ข Sentence ๐’˜ โ€˜s weight:

โ€“ ๐‘ ๐’˜ is in โˆ’ domain =1

1+exp โˆ’๐ป ๐’˜

โ€“ where ๐ป ๐’˜ =log ๐‘in ๐’˜ โˆ’log ๐‘out ๐’˜

๐’˜,

โ€“ ๐‘in:lang. model of in-domain, ๐‘out: outโ€™s one

Page 9: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

โ€ข Figure 1: On the language model adaptation task, expected KN outperforms all other methods across all sizes of selected subsets. Integral KN is applied to unweighted instances, while fractional WB, fractional KN and expected KN are applied to weighted instances. (via [Zhang+ ACL2014])

from general-domain data

in-domain data - training: 54k - testing: 3k

192

162

156

148

Why isn't there Modified KN as a

baseline?

Page 10: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

[Pickhardt+ ACL2014] A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing

โ€ข Higher-order n-grams are very sparse

โ€“ Especially remarkable on small data(e.g.

domain specific data!)

โ€ข Improve performance for small data

by skipped n-grams and Modified KN-

smoothing

โ€“ Perplexity reduces 25.7% for very small

training data of only 736KB text

Page 11: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

โ€œGeneralized Language Modelsโ€

โ€ข ๐œ•3๐‘ค1๐‘ค2๐‘ค3๐‘ค4 = ๐‘ค1๐‘ค2_๐‘ค4

โ€“ โ€œ_โ€ means a word placeholder

๐‘ƒGLM ๐‘ค๐‘– ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1 =

๐‘ ๐‘ค๐‘–โˆ’๐‘›+1๐‘– โˆ’ ๐ท ๐‘ ๐‘ค๐‘–โˆ’๐‘›+1

๐‘–

๐‘ ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1

+๐›พhigh ๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1

1

๐‘› โˆ’ 1๐‘ƒ GLM

๐‘›โˆ’1

๐‘—=1

๐‘ค๐‘– ๐œ•๐‘—๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1

๐‘ƒ GLM ๐‘ค๐‘– ๐œ•๐‘—๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1 =

๐‘1+ ๐œ•๐‘—๐‘ค๐‘–โˆ’๐‘›๐‘– โˆ’ ๐ท ๐‘ ๐œ•๐‘—๐‘ค๐‘–โˆ’๐‘›+1

๐‘–

๐‘1+ ๐œ•๐‘—๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1 โˆ—

+๐›พmid ๐œ•๐‘—๐‘ค๐‘–โˆ’๐‘›+1๐‘–โˆ’1

1

๐‘› โˆ’ 2๐‘ƒ GLM ๐‘ค๐‘– ๐œ•๐‘—๐œ•๐‘˜๐‘ค๐‘–โˆ’๐‘›+1

๐‘–โˆ’1

๐‘›โˆ’1

๐‘˜=1,๐‘˜โ‰ ๐‘—

Page 12: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

โ€ข The bold arrows correspond to interpolation of models in traditional modified Kneser-Ney smoothing. The lighter arrows illustrate the additional interpolations introduced by our generalized language models. (via [Pickhardt+ ACL2014])

Page 13: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

โ€ข shrunk training data sets for the English Wikipedia

small domain specific data

Page 14: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

Space Complexity

model size = 9.5GB # of entries = 427M

model size = 15GB # of entries = 742M

Page 15: ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickhardt+] "A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing"

References

โ€ข [Zhang+ ACL2014] Kneser-Ney Smoothing on Expected Count

โ€ข [Pickhardt+ ACL2014] A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing

โ€ข [Kneser+ 1995] Improved backing-off for m-gram language modeling

โ€ข [Chen+ 1999] An Empirical Study of Smoothing Techniques for Language Modeling