a bit of progress in language modeling extended version presented by louis-tsai speech lab, csie,...

92
A Bit of Progress in Language Modeling Extended Version Presented by Louis-Tsai Speech Lab, CSIE, NTNU [email protected]

Upload: garry-hall

Post on 29-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

A Bit of Progress in Language Modeling Extended Version

Presented by Louis-Tsai

Speech Lab, CSIE, NTNU

[email protected]

IntroductionOverview

• LM is the art of determining the probability of a sequence of words– Speech recognition, optical character recognition,

handwriting recognition, machine translation, spelling correction

• Improvements– Higher-order n-grams

– Skipping models

– Clustering

– Caching

– Sentence-mixture models

IntroductionTechnique introductions

• The goal of a LM is to determine the probability of a word sequence w1…wn, P(w1…wn)

• Trigram assumption

)...|(...)|()()...( 111211 iin wwwPwwPwPwwP

)|()...|( 1211 iiiii wwwPwwwP

IntroductionTechnique introductions

• C(wi-2wi-1wi) represent the number of occurrences of wi-2wi-1wi in the training corpus, and similarly for C(wi-2wi-1)

• There are many three word sequences that never occur, consider the sequence “party on Tuesday”, what is P(Tuesday | party on)?

)(

)()|(

12

1212

ii

iiiiii wwC

wwwCwwwP

IntroductionSmoothing

• The training corpus might not contain any instances of the phrase, so C(party on Tuesday) would be 0, while there might still be 20 instances of the phrase “party on” P(Tuesday | party on) = 0

• Smoothing techniques take some probability away from some occurrences

• Imagine we have “party on Stan Chen’s birthday” in the training data and occurs only one time

) (

) (

20

1) |(

onpartyC

StanonpartyConpartyStanP

IntroductionSmoothing

• By taking some probability away from some words, such as “Stan” and redistributing it to other words, such as “Tuesday”, zero probabilities can be avoided

• Katz smoothingJelinek-Mercer smoothing (deleted interpolation)Kneser-Ney smoothing

IntroductionHigher-order n-grams

• The most obvious extension to trigram models is to simply move to higher-order n-grams, such as four-grams and five-grams

• There is a significant interaction between smoothing and n-gram order :higher-order n-grams work better with Kneser-Ney smoothing than with some other methods, especially Katz smoothing

IntroductionSkipping

• We condition on a different context than the previous two words

• Instead computing P(wi|wi-2wi-1) of computing P(wi|wi-3wi-2)

IntroductionClustering

• Clustering (classing) models attempt to make use of the similarities between words

• If we have seen occurrences of phrases like “party on Monday” and “party on Wednesday” then we might imagine that the word “Tuesday” is also likely to follow the phrase “party on”

IntroductionCaching

• Caching models make use of the observation that if you use a word, you are likely to use it again

IntroductionSentence Mixture

• Sentence Mixture models make use of the observation that there are many different sentence types, and that making models for each type of sentence may be better than using one global model

IntroductionEvaluation

• A LM that assigned equal probability to 100 words would have perplexity 100

N

iiiii wwwPwwwP

111211 )...|(log)...|(Entropy

Entropyperplexity 2

100log100log100

1

)(

1log)()(log)(

2

100

12

100

12

100

12

i

i ii

iii wp

wpwpwpEntropy

1002, 100log2 perplexitySo

IntroductionEvaluation

• In general, the perplexity of a LM is equal to the geometric average of the inverse probability of the words measured on test data:

N

N

i ii wwwP 1 11 )...|(

1

N

N

i ii

N

i Nii

N

i

Nii

N

i

wwwPN

wwwPNEntropy

wwwP

wwwP

wwwP

perplexity

ii

N

iii

1 11

11

11

1

1

11

1

)...|(log1

)...|(log1

)...|(

1

)...|(

1

)...|(

1

2

1

22

112

1112

baba xx )(

axx

11

cbacba xxxx

IntroductionEvaluation

• “true” model for any data source will have the lowest possible perplexity

• The lower the perplexity of our model, the closer it is, in some sense, to the true model

• Entropy, which is simply log2 of perplexity

• Entropy is the average number of bits per word that would be necessary to encode the test data using an optimal coder

IntroductionEvaluation

• entropy : 54perplexity : 3216 50%

• entropy : 54.5perplexity : 32 29.3%216

reduction

entropy .01 .1 .16 .2 .3 .4 .5 .75 1

perplexity

0.69% 6.7% 10% 13% 19% 24% 29% 41% 50%

IntroductionEvaluation

• Experiments corpus: 1996 NAB• Experiments performed at 4 different training data

sizes :100K words, 1M words, 10M words, 284M words

• Heldout and test data taken from the 1994 WSJ– Heldout data: 20K words

– Test data: 20K words

• Vocabulary: 58,546 words

Smoothingsimply interpolation

where 0≦, 1≦

• In practice, the uniform distribution are also interpolated

this ensures that no word is assigned probability 0

)]()1()|()[1()|(

)|(

unigram1bigram12trigram

12einterpolat

wPwwPwwwλP

wwwP

iii

ii

y vocabularof size

1)(uniform wP

SmoothingKatz smoothing

• Katz smoothing is based on the Good-Turing formula

• Let nr represent the number of n-grams that occur r times

• discount : *1)1()( rn

nrrdisc

r

r

otherwise )...|()...(

0)...( if )...(

))...((

)...|(

12Katz11

111

1

11Katz

iniiini

iniini

ini

inii

wwwPww

wwCwwC

wwCdisc

wwwP

0)(: 12Katz

0)(: 11Katz

11

1

1

)...|(1

)...|(1)...(

inii

inii

wcw inii

wcw inii

ini wwwP

wwwPww

SmoothingKatz smoothing

• Let N represent the total size of the training set, this left-over probability will be equal to n1/N

2111

2

1

2*

3222

3

2

3*

1111

*

111*

21 )21( 21

32 )32( 32

... ... ...

)1( ))1(( )1(

)1( ))1(( )1(

nnnn

n

n

n

nnnn

n

n

n

nrrrnn

nrr

n

nrr

nrnrnn

nrr

n

nrr

rnrr

r

r

r

rrrr

r

r

r

(r+1)nr+1=0

Sum=n1

SmoothingKatz smoothing

• Consider a bigram model of a phrase such as Pkatz(Francisco | on).

Since the phrase San Francisco is fairly common, the unigram probability will also be fairly high.

• This means that using Katz smoothing, the probability

will also be fairly high. But, the word Francisco occurs in exceedingly few contexts, and its probability of occurring in a new one is very low

wwC

FranciscoC

)(

)(

)()(

otherwise )()(

0) ( if )(

)) (() (

Katz

Katz

Katz

FranciscoPon

FranciscoPon

FranciscoonConC

FranciscoonCdiscFranciscoonP

SmoothingKneser-Ney smoothing

• KN smoothing uses a modified backoff distribution based on the number of contexts each word occurs in, rather than the number of occurrences of the word. Thus, the probability PKN(Francisco | on) would be fairly low, while for a word like Tuesday that occurs in many contexts, PKN(Tuesday | on) would be relatively high, even if the phrase on Tuesday did not occur in the training data

SmoothingKneser-Ney smoothing

• Backoff Kneser-Ney smoothing

where |{v|C(vwi)>0}| is the number of words v that wi can occur in the context, D is the discount, is a normalization constant such that the probabilities sum to 1

otherwise

|}0)(|{|

|}0)(|{|)(

0)( if )(

)(

)|(

1

11

1

1BKN

w

ii

iii

ii

ii

vwCv

vwCvw

wwCwC

DwwC

wwP

0)(:

0)(:1

1

1

1

1

|}0)(|{||}0)(|{|

1

)()(

1

)(

iii

iii

wwCww

i

wwCwi

ii

i

vwCvvwCvwC

DwwC

w

SmoothingKneser-Ney smoothing

bc

abc

abcd

a

a

b

c

d

a

b

c

d

bcd

abc

abc

c

w

iiKN vwCv

vwCvwP

|}0)(|{|

|}0)(|{|)(V={a,b,c,d}

10

1)d(,

10

4)c(

,10

3)b(,

10

2)a(

KNKN

KNKN

PP

PP

10

2

104

1

)d()dc(

1

)a()d()d|a(

CDC

PP KNBKN

SmoothingKneser-Ney smoothing

• Interpolated models always combine both the higher-order and the lower-order distribution

• Interpolated Kneser-Ney smoothing

where (wi-1) is a normalization constant such that the probabilities sum to 1

w

ii

i

iiii vwCv

vwCvw

wC

DwwCwwP

|}0)(|{|

|}0)(|{|)(

)(

)()|( 1

1

11IKN

0)(:1

11

1 )(

)(1)(

iii wwCwi

iii wC

DwwCw

SmoothingKneser-Ney smoothing

• Multiple discounts, one for one counts, another for tow counts, and another for three or more counts. But it have too many parameters

• Modified Kneser-Ney smoothing

||

1

|}0)(|{|

|}0)(|{|)(

)()(|}0)(|{|

|}0)(|{|)|(

)|()()(

)()|(

1unigram-mod-ikn

unigram-mod-ikn11

211bigram-mod-ikn

1bigram-mod-ikn1212

31212IKN

VvwCv

DvwCvwP

wPwwvwCv

DwvwCvwwP

wwPwwwwC

DwwwCwwwP

w

ii

ii

w i

iiii

iiiiii

iiiiii

SmoothingJelinek-mercer smoothing

• Combines different N-gram orders by linearly interpolating all three models whenever computing trigram

)(

)|(

)|()|(ˆ

3

12

12112

n

nn

nnnnnn

wP

wwP

wwwPwwwP

1i

i

Smoothingabsolute discounting

• Absolute discounting subtracting a fixed discount D<=1 from each nonzero count

)...|()1()...(

)...(

)|(

12...1

11

11

11

iniiabsww

w ini

ini

iniiabs

wwwPwwC

DwwC

wwwP

ini

i

Witten-Bell Discounting

• Key Concept—Things Seen Once: Use the count of things you’ve seen once to help estimate the count of things you’ve never seen

• So we estimate the total probability mass of all the zero N-grams with the number of types divided by the number of tokens plus observed types:

0:

*

icii TN

Tp N : the number of tokens

T : observed types

Witten-Bell Discounting

• T/(N+T) gives the total “probability of unseen N-grams”, we need to divide this up among all the zero N-grams

• We could just choose to divide it equally

0:

1ici

Z

)(*

TNZ

Tpi

Z is the total number of N-grams with count zero

Witten-Bell Discounting

)0(if *

ii

i cTN

cp

0if,

0if,*

ii

i

i

cTN

Nc

cTN

N

Z

T

c  

  

Alternatively, we can represent the smoothed counts directly as:

Witten-Bell Discounting

NTN

TNNTN

N

TN

NT

TN

NcZ

TN

N

Z

T

icii

)(

2

0:

1

)( 0:

TN

N

TN

T

TN

cZ

TNZ

T

ici

i

Witten-Bell Discounting

• For bigram

T: the number of bigram types, N: the number of bigram token

)()(

)()|(

0)(:

*

xx

x

wwcixi wTwN

wTwwp

ix

0)(:

1)(ixwwci

xwZ

0)if( ))()()((

)()|(

1

111

11

*

ii ww

iii

iii c

wTwNwZ

wTwwp

)()(

)()|(

0)(:

*

xx

ix

wwcixi wTwc

wwcwwp

ix

20 words per sentence

Higher-order n-grams

• Trigram P(wi|wi-2wi-1) five-gram P(wi|wi-4wi-3wi-2wi-1)

• In many cases, no sequence of the form wi-4wi-3wi-2wi-1 will have been seen in the training databackoff to or interpolation with four-grams, trigrams, bigrams, or even unigrams

• But in those cases where such a long sequence has been seen, it may be a good predictor of wi

0.06 0.02 0.01

284,000,000

Higher-order n-grams

• As we can see, the behavior for Katz smoothing is very different than the behavior for KN smoothing the main cause of this difference was backoff smoothing techniques, such as Katz smoothing, or even the backoff version of KN smoothing

• Backoff smoothing techniques work poorly on low counts, especially one counts, and that as the n-grams order increases, the number of one counts increases

Higher-order n-grams

• Katz smoothing has its best performance around the trigram level, and actually gets worse as this level is exceeded

• KN smoothing is essentially monotonic even through 20-grams

• The plateau point for KN smoothing depends on the amount of training data availablesmall (100,000 words) at trigram levelfull (284 million words) at 5 to 7 gram– (6-gram has .02 bits better than 5-gram, 7-gram has .01 bit

s better than 6-gram)

Skipping

• When considering a 5-gram context, there are many subsets of the 5-gram we could consider, such as P(wi|wi-4wi-3wi-1) or P(wi|wi-4wi-2wi-1)

• If have never seen “Show John a good time” but we have seen “Show Stan a good time”. A normal 5-gram predicting P(time | show John a good) would back off to P(time | John a good) and from there to P(time | a good), which would have a relatively low probability

• A skipping model of the from P(wi|wi-4wi-2wi-1) would assign high probability to P(time | show ____ a good)

Skipping

• These skipping 5-grams are then interpolated with a normal 5-gram, forming models such as

where 0 ≦ 1 and 0 ≦ ≦ 1 and 0 ≦ ≦ (1--) 1≦• Another (and more traditional) use for skipping is as a sort of p

oor man’s higher order n-gram. One can, for instance, create a model of the form

no component probability depends on more than two previous words but the overall probability is 4-gram-like, since it depends on wi-3, wi-2, and wi-1

P(wi|wi-4wi-3wi-2wi-1) + P(wi|wi-4wi-3wi-1) +(1--) P(wi|wi-4wi-2wi-1)

P(wi|wi-2wi-1) + P(wi|wi-3wi-1) +(1--) P(wi|wi-3wi-2)

Skipping

• For a 5-gram skipping experiments, all contexts depended on at most the previous four words, wi-4, w

i-3, wi-2,and wi-1, but used the four words in a variety of ways

• For readability and conciseness, we define v = wi-4, w = wi-3, x = wi-2, y = wi-1

Skipping• First model interpolated dependencies on vw_y and v_xy

does not work well on the smallest training data size, but is competitive for larger ones

• In second model, we add vwx_ into first modelroughly .02 to .04 bits over the first model

• Next, adding back in the dependencies on the missing words, xvwy, wvxy, and yvwx; that is, all models depended on the same variables, but with the interpolation order modified– e.g., by xvwy, we refer to a model of the form P(z|vwxy) interp

olated with P(z|vw_y) interpolated with P(z|w_y) interpolated with P(z|y) interpolated with P(z)

Skipping

• Interpolating together vwyx, vxyw, wxyv (base on vwxy) This model puts each of the four preceding words in the last position for one componentthis model does not work as well as the previous two, leading us to conclude that the y word is by far the most important

Skipping

• Interpolating together vwyx, vywx, yvwx, which put the y word in each possible position in the backoff modelthis was overall the worst model, reconfirming the intuition that the y word is critical

• Finally we interpolating together vwyx, vxyw, wxyv, vywx, yvwx, xvwy, wvxy the result is a marginal gain – less than 0.01 bits – over the best previous model

Skipping

• 1-back word (y) xy, wy, vy, uy and ty• 4-gram level : xy, wy and wx• The improvement over 4-gram pairs was still marg

inal

Clustering

• Consider a probability such as P(Tuesday | party on)• Perhaps the training data contains no instances of the

phrase “party on Tuesday”, although other phrase such as “party on Wednesday” and “party on Friday” do appear

• We can put words into classes, such as the word “Tuesday” into the class WEEKDAY

• P(Tuesday | party on WEEKDAY)

Clustering

• When each word belongs to only one class, which is called hard clustering, this decomposition is a strict equality a fact that can be trivially provenLet Wi represent the cluster of word wi

) |() |(

) |(

WEEKDAYonpartyTuesdayPonpartyWEEKDAYP

onpartyTuesdayP

)(

)(

)(

)(

)(

)(

)|()|(

12

12

12

12

12

12

1212

ii

iiii

iii

iiii

ii

iii

iiiiiii

wwP

wWwwP

WwwP

wWwwP

wwP

WwwP

WwwwPwwWP

(1)

Clustering

• Since each word belongs to a single cluster, P(Wi|wi) = 1

)(

)|()(

)|()()(

12

12

121212

iii

iiiii

iiiiiiiiiii

wwwP

wWPwwwP

wwwWPwwwPwWwwP

(2)

)|(

)(

)()|()|(

12

12

121212

iii

ii

iiiiiiiiii

wwwP

wwP

wwwPWwwwPwwWP

(2) 代入 (1) 中 :

(3) predictive clustering

Clustering• Another type of clustering we can do is to cluster

the words in the contexts. For instance, if “party” is in the class EVENT and “on” is in the class PREPOSITION, then we could write

or more generally

Combining (4) with (3) we get

) |() |( NPREPOSITIOEVENTTuesdayPonpartyTuesdayP

)|()|( 1212 iiii WWwPwwwP

)|()|()|( 121212 WWWwPWWWPwwwP iiiiii

(4)

(5)fullibm clustering

Clustering

• Use the approximation P(w|Wi-2Wi-1W) = P(w|W) to get

fullibm clustering uses more information than ibm clustering, we assumed that it would lead to improvements (goodibm)

)|()|()|( 1212 WwPWWWPwwwP iiii (6)

ibm clustering

Clustering

• Backoff / interpolation go fromP(Tuesday| party EVENT on PREPOSITION) toP(Tuesday| EVENT on PREPOSITION) toP(Tuesday| on PREPOSITION) toP(Tuesday| PREPOSITION) toP(Tuesday)since each word belongs to a single cluster redundant

) |(

) |(

NPREPOSITIOonEVENTpartyTuesdayP

onpartyTuesdayPindex

index clustering

(7)

Clustering

• C(party EVENT on PREPOSITION) = C(party on)C(EVENT on PREPOSITION) = C(EVENT on)

• We generally write an index clustered model as

)|( 1122 iiiii WwWwwP

))|()1()|((

))|()1()|((

)|(

1212

1212

12

WWWwPWwwwP

WWWPwwWP

wwwP

iiii

iiii

iidictfullibmpre

fullibmpredict clustering

Clustering

• indexpredict, combining index and predictive

• combinepredict, interpolating a normal trigram with a predictive clustered trigram

)|()|(

)|(

11221122

12

iiiiiiiiiii

iiictindexpredi

WWwWwwPWwWwWP

wwwP

)|()|()1()|(

)|(

121212

12

iiiiiiiiii

iiidictcombinepre

WwwwPwwWPwwwP

wwwP

Clustering• allcombinenotop, which is an interpolation of a no

rmal trigram, a fullibm-like model, an index model, a predictive model, a true fullibm model, and an indexpredict model

)|()|()1(

)|()|(

)|()|(

)|(

)|(

)|(

)|(

11221122

1212

1212

1122

12

12

12

iiiiiiiiiii

iiiiiii

iiiiiii

iiiii

iii

iii

iiinotopallcombine

WWwWwwPWwWwWP

WWWwPWWWP

WwwwPwwWP

WwWwwP

WWwP

wwwP

wwwP

normal trigram

fullibm-like

index midel

predictive

indexpredicttrue fullibm

Clustering

• allcombine, interpolates the predict-type models first at the cluster level, before interpolating with the word level model

)]|()1()|()|([

)]|()1()|()|([

)1(

)|(

)|(

)|(

)|(

11221212

11221212

1122

12

12

12

iiiiiiiiiiiiii

iiiiiiiiiii

iiiii

iii

iii

iiiallcombine

WWwWwwPWWWwPWwwwP

WwWwWPWWWPwwWP

WwWwwP

WWwP

wwwP

wwwP

normal trigram

fullibm-like

index midel

predictive true fullibm indexpredict

baseline

Clustering

• The value of clustering decreases with training data increases, since clustering is a technique for dealing with data sparseness

• ibm clustering consistently works very well

Clustering

• In Fig.6 we show a comparison of several techniques using Katz smoothing and the same techniques with KN smoothing. The results are similar, with same interesting exceptions :

• Indexpredict works well for the KN smoothing model, but very poorly for the Katz smoothed model.

• This shows that smoothing can have a significant effect on other techniques, such as clustering

Other ways to perform Clustering

• Cluster groups of words instead of individual words

could compute

• For instance, in a trigram model, one could cluster contexts like “New York” and “Los Angeles” as “CITY”, and “on Wednesday” and “late tomorrow” as “TIME”

))()(|( 12 ii wclusterwordwclusterwordwP

))(|( 12 ii wwclustercontextwP

Finding Clusters

• There is no need for the clusters used for different positions to be the same

• ibm clustering P(wi|Wi)*P(Wi|Wi-2Wi-1)Wi cluster = predictive cluster,Wi-1 and Wi-2 = conditional cluster

• The predictive and conditional clusters can be different, consider words a and an, in general, a and an can follow the same words, and so, for predictive clustering, belong in the same cluster. But, there are very few words that can follow both a and an – so for conditional clustering, they belong in different clusters

Finding Clusters

• The clusters are found automatically using a tool that attempts to minimize perplexity

• For the conditional clusters, we try to minimize the perplexity of training data for a bigram of the form P(wi|Wi-1), which is equivalent to maximizing

N

ii WwP1

1)|(

Finding Clusters

• For the predictive clusters, we try to minimize the perplexity of training data of P(Wi|wi-1)*P(wi|Wi)

)|()(

)(

)(

)(

)(

)(

)(

)(

)(

)()|()|(

11 1

1

1 1

1 1

1

11

ii

N

i i

i

i

iiN

i i

ii

i

iiN

i i

iiN

iiiii

WwPwP

wP

WP

WwP

wP

wWP

WP

wWP

wP

WwPWwPwWP

P(Wiwi)=P(Wi|wi)P(wi)P(Wi|wi) = 1

P(wi-1Wi)=P(wi-1|Wi)P(Wi)

Caching

• If a speaker uses a word, it is likely that he will use the same word again in the near future

• We could form a smoothed bigram or trigram from the previous words, and interpolate this with the standard trigram

where Ptricache(w|w1…wi-1) is a simple interpolated trigram model, using counts from the preceding words in the same document

)...|()1()|(

)...|(

1112

121

itricacheiiSmooth

iicachetrigram

wwwPwwwP

wwwwP

Caching

• When interpolating three probabilities P1(w), P2(w), and P3(w), rather than use

we actually use

This allows us to simplify the constraints of the search

)()1()()( 321 wPwPwP

)()()( 321 wPwPwP

Caching

• Conditional caching : weight the trigram cache differently depending on whether or not we have previously seen the context

otherwise

)...|()|(

cachein if

)...|()...|()|(

)...|(

1112

1

111112

121

iunicacheiiSmooth

i-

itricacheiunicacheiiSmooth

iiltrigramconditiona

wwwPwwwP

w

wwwPwwwPwwwP

wwwwP

Caching

• Assume that the more data we have, the more useful each cache is. Thus we make , and be linear functions of the amount of data in the cache

• Always set maxwordsweight to at or near 1,000,000 while assigning multiplier to a small value (100 or less)

ightmaxwordswe

ightmaxwordswemultipliertstartweigh

hewordsincachewoudsincac

),min(

)(

Caching• Finally, we can try conditionally combining unigram, bigra

m, and trigram caches

otherwise )...|()|(

cachein if )...|(

)...|()|(

cachein if )...|()...|(

)...|()|(

)...|(

1112

111

1112

121111

1112

121

iunicacheiiSmooth

i-ibicache

iunicacheiiSmooth

i-iitricacheibicache

iunicacheiiSmooth

iiltrigramconditiona

wwwPwwwP

wwwwP

wwwPwwwP

wwwwwPwwwP

wwwPwwwP

wwwwP

Caching

• As can be seen, caching is potentially one of the most powerful techniques we can apply, leading to performance improvements of up to 0.6 bits on small data. Even on large data, the improvement is still substantial, up to 0.23 bits

• On all data size, the n-gram caches perform substantially better than the unigram cache, but which version of the n-gram is used appears to make only a small difference

Caching

• It should be noted that all of these results assume that the previous words are known exactly

• In a speech recognition system, it is possible for a cache to “look-in” error

if “recognition speech” “wreck a nice beach”, later, “speech recognition” “beach wreck ignition”

since the probability of “beach” will be significantly raised

Sentence Mixture Models

• There may be several different sentence types within a corpus; these types could be grouped by topic, or style, or some other criterion

• In WSJ data, we might assume that there are three types: financial market sentences (with a great deal of numbers and stock name), business sentences (promotions, demotions, mergers) and general news stories

• Of course, in general, we do not know the sentence type until we have heard the sentence. Therefore, instead, we treat the sentence type as a hidden variable

Sentence Mixture Models

• Let sj denote the condition that the sentence under consideration is a sentence of type j. Then the probability of the sentence, given that it is of type j can be written as

• Let s0 be a special context that is always true

• Let there be S different sentence types (4≦S 8); ≦let 0…S be sentence interpolation parameters, that

N

ijiii swwwP

112 )|(

)|()|( 12012 iiiiii wwwPswwwP

10

S

j j

Sentence Mixture Models

• The overall probability of a sentence w1…wn is

• Eq (8) can be read as saying that there is a hidden variable, the sentence type; the prior probability for each sentence type is j

• The probability P(wi|wi-2wi-1sj) may suffer from data sparsity, so they are linearly interpolated with the global model P(wi|wi-2wi-1)

N

ijiii

S

jj swwwP

112

0

)|( (8)

)|()1()|( 121

120

iii

N

ijjiiij

S

jj wwwPswwwP

Sentence Mixture Models

• Sentence types for the training data were found by using the same clustering program used for clustering words; in this case, we tried to minimize the sentence-cluster unigram perplexities

• Let s(i) represent the sentence type assigned to the sentence that word i is part of. (All words in a given sentence are assigned to the same type)

• We tried to put sentences into clusters in such a way that was maximized

N

i i iswP1

))(|(

Relationship between training data size, n-gram order, and number of types

0.08

0.12

Sentence Mixture Models

• Note that we don’t trust results for 128 mixtures. With 128 sentence types, there are 773 parameters, and the system may not have had enough heldout data to accurately estimate the parameters

• Ideally, we would run this experiment with a larger heldout set, but it already required 5.5 days with 20,000 words, so this is impractical

Sentence Mixture Models

• We suspected that sentence mixture models would be more useful on larger training data size;with 100,000 words, only .1 bits,with 284,000,000 words, it’s nearly .3 bits

• This bodes well for the future of sentence mixture models : as computers get faster and larger, training data sizes should also increase

Sentence Mixture Models• Both 5-gram and sentence mixture models attempt to

model long distance dependencies, the improvement from their combination would be less than the sum of the individual improvements

• In Fig.8, for 100,000 and 1,000,000 words, that different between trigram and 5-gram is very small, so the question is not very important

• For 10,000,000 words and all training data, there is some negative interaction 4 32

trigram 0.12 0.27

5-gram 0.08 0.18So, approximately one third of the improvement seems to be correlated

Combining techniques

• Combining techniques

interpolate this clustered trigram with a normal 5-gram :

))|()1()|((

))|()1()|((

)|(

1212

1212

12

WWWwPWwwwP

WWWPwwWP

wwwP

iiii

iiii

iidictfullibmpre

))|()1()|((

))|()1()|((

12341234

12341234

WWWWWwPWwwwwwP

WWWWWPwwwwWP

iiiiiiii

iiiiiiii

Combining techniques• Interpolate the sentence-specific 5-gram model with the

global 5-gram model, the three skipping models, and the two cache model

)|()(

)|()|(

)|()|(

)|()|(

)|()|(

)|()|(

)...,(

12,12,11

234,10234,9

124,8124,7

134,6134,5

1234,41234,3

1234,21234,1

14

iitricachejunicachej

iiijiiij

iiijiiij

iiijiiij

iiiijiiiij

jiiiijjiiiij

iij

wwWPWP

WWWWPwwwWP

WWWWPwwwWP

WWWWPwwwWP

WWWWWPwwwwWP

sWWWWWPswwwwWP

wwWsencluster

Combining techniques• Next, we define the analogous function for predicting

words given clusters:

)|()|(

)|()|(

)|()|(

)|()|(

)|()|(

)|()|(

),...,(

12,12,11

234,10234,9

124,8124,7

134,6134,5

1234,41234,3

1234,21234,1

14

WwwwPWwP

WWWWwPWwwwwP

WWWWwPWwwwwP

WWWWwPWwwwwP

WWWWWwPWwwwwwP

WsWWWWwPWswwwwwP

Wwwwsenword

iitricachejunicachej

iiijiiij

iiijiiij

iiijiiij

iiiijiiiij

jiiiijjiiiij

iij

Combining techniques

• Now, we can write out our probability model :

N

iiiiijiiij

S

jj

Neverything

WwwwsenwordwwWsencluster

wwP

11414

0

1

),...,()...,(

)(

(9)

Experiment

• In fact, without KN-smooth, 5-gram actually hurt at small and medium data sizes. This is a wonderful example of synergy

• Caching is the largest gain at small and medium data size

• Combined with KN-smoothing, 5-grams are the largest gain at large data sizes