decisio tree based language model

7/28/2019 Decisio Tree Based Language Model

1/8

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AN D SIGNAL PROCESSING, VOL. 37, NO. 7 , JULY 1989 1001

A Tree-Based Statistical Language Model for NaturalLanguage Speech RecognitionLALIT R. BAHL, M E M B E R , I E E E , PETER F . BROWN, PETER V . DE SOUZA,

AN D ROBERT L. MERCER, M E M B E R , I E EE

Abstract-This paper is concerned with the problem of predictingthe next word a speaker will say, given the words already spoken; spe-cifically, the problem is to estimate the probability that a given wordwill be the next word uttered. Algorithms are presented for automat-ically constructing a binary decision tree designed to estimate theseprobabilities. At each node of the tree there is a yes ho question relat-ing to the words already spoken, and at each leaf there is a probabilitydistribution over the allowable vocabulary. Ideally, these nodal ques-tions can take the form of arbitrarily complex Boolean expressions, butcomputationally cheaper alternatives are also discussed. The paper in-cludes some results obtained on a 5000-word vocabulary with a treedesigned to predict the next word spoken from the preceding 20 words.The tree is compared to an equivalent trigram model and show n to besuperior.

I. INTRODUCTIONIVEN some acoustic evidence A derived from a spo-G en word sequence W, he problem in automaticspeech recognition iso determine W. n one approach

[ l ] , [SI, n estimate W of W s obtained asW = argmax Pr {W I A } . ( 1 )W

By Bayes rule, this can be rewritten as( 2 )P r { A \ W} * P r { W }{ A )W = argmaxW

The denominator in (2) is independent of W and can beignored in the search for W . The first term in the numer-ator is the probability of observing the acoustic evidenceA when the speaker says W. It is the job of an acousticmodel to estimate this probability. The second term in thenumerator is the probability that the speaker will say W.This is called the prior probability of W . The purpose ofa language model is to estimate the prior probabilities.The prior probability of a word sequence W = wl, w2,. . . w, can be written asn

Pr {W } = It Pr {wilw l, w2, * * , w , - ~ } , ( 3 )so the task of language modeling can be reduced to theproblem of estimating terms like Pr {wi 1 w I , w2, * ,

i = I

Manuscript received November 24, 1987; revised September 23, 1988.The authors are with the Speech Recognit ion Group, IB M Thomas J .IEEE Lo g Number 8928123.Watson Research Center, P . O . Bo x 218, Yorktown Heights , NY 10598.

w,- I }: the probability that w, will be said after w I , w 2,. . . , w,- The number of parameters of a languagemodel is the number of variables that must be known inadvance in order to compute Pr {w, 1 wI , w2, ,

In any practical natural-language system with even amoderate vocabulary size, it is clear that the languagemodel probabilities Pr {w, I wl, w2, , w, - } cannot, w,.Even if the sequences were limited to one or two sen-tences in length, the number of distinct sequences wouldbe so large that a complete set of probabilities could notbe computed, never mind stored or retrieved. To be prac-ticable, then, a language model must have many fewerparameters than the total number of possible sequencesWI, w2, * * , w, . An obvious way to limit the number ofparameters is to partition the various possible word his-tones w I , w2, - , w, - into a manageable number ofequivalence classes.A simple-minded, but surprisingly effective, definitionof equivalence classes can be found in the N-gram lan-guage model [SI, lo]. In this model, word sequences aretreated as equivalent if and only if they end with the sameN - I words. Typically N = 3, in which case the modelis referred to as a 3-gram or trigram model. The trigrammodel is based upon the approximation

* * .w, - } for any word sequence w I, w 2 , * . w,.

be stored for each possible sequence w,, w 2 , * . .

P r{ wl I wI ,w 2, , w , - ~ }= ~ ~ { ~ , ~ w , - ~ , ~ ,(4 )which is clearly inexact, but apparently quite useful.Maximum-likelihood estimates of N-gram probabilitiescan be obtained from their relative frequencies in a largebody of training text. But since many legitimate N-gramsare likely to be missing from the training text, it is nec-essary to smooth the maximum-likelihood estimates soas to avoid probabilities of zero. The trigram model canbe smoothed in a natural way using the bigram and uni-gram relative frequencies as described in [61.The trigram language model has the following advan-tages: the equivalence classes are easy to determine, therelative frequencies can all be precomputed with very lit-tle computation, and the probabilities can be smoothedon the fly quickly and efficiently. The main disadvan-tage of the trigram model lies in its naive definition of

0096-3S18/89/0700-1001$01OO O 1989 IEEE


2/8

1002 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. VOL. 31 . NO . 7. JULY 19x9

equivalence classes. All words prior to the most recenttwo are ignored, and useful information is lost. Addition-ally, word sequences ending in different pairs of wordsshould not necessarily be considered distinct; they maybe functionally equivalent from a language model pointof view. Separating equivalent histories into differentclasses, as the trigram model does , fragments the trainingdata unnecessarily and reduces the accuracy of the result-ing probability estimates.In Section I1 we describe a tree-based language modelwhich avoids both of the above weaknesses in the trigrammodel. Although the tree-based model is as convenient toapply as the trigram model, it requires a massive increasein computation to construct. Some results are presentedin Section 111.11. CONSTRUCTINGTREE-BASEDANGUAGEODELLet us now consider the construction of binary decisiontrees and their application to the language modeling prob-lem. An example of a binary decision tree is shown inFig. 1.At each nonterminal node of the tree, there is a question

requiring a yeslno answer, and corresponding to eachpossible answer there is a branch leading to the next ques-tion. Associated with each terminal node, i.e., leaf, issome advice or information which takes into account allthe questions and answers which lead to that leaf. Theapplication of binary decision trees is much like playingthe venerable TV game Whats My Line? where con-testants try to deduce the occupation of a guest by askinga series of yeslno questions. In the context of languagemodeling, the questions relate to the words already spo-ken; for example: Is the preceding word a verb?. Andthe information at each leaf takes the form of a probabilitydistribution indicating which words are likely to be spo-ken next. The leaves of the tree represent language modelequivalence classes. The classes are defined implicitly bythe questions and answers leading to the leaves.Notice that an N-gram language model is just a specialcase of a binary-tree model. With enough questions of theform Is the last word w,? or Is the second to last wordwj? the N-gram language model can be represented inthe form of a tree. Thus, an optimally constructed treelanguage model is guaranteed to be at least as good as anoptimally constructed N-gram model, and is likely to bemuch better, given the weaknesses in N-gram models al-ready discussed.The object of a decision tree is to reduce the uncertaintyof the event being decided upon, be it an occupation in agame show, a diagnosis, or the next word a speaker willutter. In language modeling, this uncertainty is measuredby the entropy of the probability distributions at theleaves. We seek a decision tree which minimizes the aver-age entropy of the leaf distributions. Let {I, , &, * * ,l L } denote the leaves of a decision tree, and let Y denote

Fig . 1 . Example of a binary decision tree

the event being decided upon-the next word spoken. Theaverage entropy of the leaf distributions isL

H ( Y ) = C H , ( Y ) Pr { I , ) , ( 5 ), = Iwhere H, Y ) s the entropy of the distribution associatedwith leaf l,, and Pr {1,}is the prior probability of visitingleaf 1,. The entropy at leaf 1, , measured in bits, is

VH i ( y ) = - C Pr {w J I I I } log2 Pr { w J l l I } 7 ( 6 )J = l

where V is the size of the vocabulary, and Pr {wJ 1 1,} isthe probability that word wJ will be the next word spokengiven that the immediate word history leads to leaf 1,.Notice that the average entropy is defined in terms ofthe true leaf distributions, and not the relative frequenciesas obtained from a sample of data. The sample entropy,i.e., the entropy on training data, can always be madearbitrarily small by increasing the number of leaves so asto make the leaf sample sizes arbitrarily small. The en-tropy on test data, however, cannot usually be made ar-bitrarily small. Typically, trees which perform extremelywell on training data achieve their success by modelingidiosyncrasies of the training data themselves rather thanby capturing generalizations about the process which gaverise to the data. In the algorithm described below, in orderto avoid making such deceptively good trees, we dividethe data into two independent halves. Improvements sug-gested by one half are verified against the other half. Ifthey do not perform well there, they are rejected.Although we would like to construct the tree with min-imum entropy for test data, there is no way of doing so .The best hope is for a good tree, having low entropyon test data. Searching for a tree which is good is essen-tially a problem in heuristics, and many procedures arepossible. These procedures include manual tree construc-tion by linguistic experts, as well as automatic tree-grow-ing algorithms. Automatic methods have some importantadvantages over manual construction: they are not gen-erally language-dependent and may therefore be appliedto any language, any domain, and any vocabulary of in -terest. In many cases, expert linguistic knowledge maynot even exist.In this paper we shall describe only one automaticmethod: a greedy algorithm, with restrictions on the form


3/8

BA H L et a / . : TR EE-BA SED STA TISTIC A L LA N G U A G E MO D EL 1003

of the questions. The algorithm is greedy in the sense thatat any node in th e tree the question selected is the onegiving the greatest reduction in entropy at that node, with-out regard to subsequent nodes. Thus, the algorithm aimsto construct a tree which is locally optimal, but very likelynot globally optimal; the hope being that a locally optimaltree will be globally good.This tree-construction par-adigm has been advocated before 141, and has been usedsuccessfully in other applications 131, [8]. A dynamic pro-gramming algorithm for the determination of a truly op-timal tree is described in [9], but is only suitable in re-stricted applications with relatively few variables; it isinappropriate for the present application. A treatise on theart and science of tree growing can be found in 121.Because the search space of possible questions is enor-mous, we will place restrictions on the form of the ques-tions. Initially, we shall only allow elementary ques-tions of the form X E S? where X denotes a discreterandom variable with a finite number of possible values,and S is a subset of the values taken by X . For example,if X denotes the preceding word, and S is the set of allverbs in the permitted vocabulary, then X E S? rep-resents the question Is the preceding word a verb?.Similarly, if X denotes the head of the most recent nounphrase, and S is the set of all plural nouns in the vocab-ulary, then X E S? represents the question Does thehead of the most recent noun phrase have the attributep l ura l ? . Later we shall discuss the possibility of relax-ing these restrictions by allowing composite questionscomprising several elementary questions.Let X I , X 2 , * * . ,X , be the discrete random variableswhose values may be questioned. We shall call these thepredictor variables. Clearly the power of the decision treedepends on the number and choice of predictor variables.The trigram model makes use of only two predictors: thepreceding word, X , , and the word before that, X,. But,obviously, the list of predictor variables need not be lim-ited to those two; it can easily be extended to include thelast N words spoken, N > 2, without encountering thedifficulty of combinatorial explosion that besets the gen-eral N-gram model. Furthermore, if a parser that can han-dle partial sentences is available, then the list of predic-tors can also include information from the parser, like thehead of the most recent noun phrase. A s far as tree con-struction is concerned, it makes no difference how thepredictors are defined; the construction algorithm onlyuses their values and the value of the next word, Y . A sfar as the number of equivalence classes is concerned, itmakes no difference how many predictor variables thereare; increasing the number of predictors does not, in it-self, increase the number of equivalence classes. How-ever, the amount of computation involved in tree con-struction will increase as the number of predictorsincreases, and therefore, the choice of predictors shouldbe limited to those variables which provide significant in-formation about the word being predicted.

The tree-growing algorithm can be summarized as fol-1) Let c be the current node of the tree. Initially c islows.the root.

find the set Sf which minimizes the average conditionalentropy at node c2) For each predictor variable X , ( i = 1, 2, . * * m ) ,

H,( Y J X , E S f ? )V

= -Pr { x , ~ f l c } Pr { w J I c , , E s:}J = I* logz Pr {wJ I e , X , E S:- Pr {x1 s f l c } C Pr {wJ J c,X , +s : }V

J = I

log2 Pr {wJ 1 9 1 } ( 7 )3) Determine which of the m questions derived in Step2 leads to the lowest entropy. Let this be question k , i.e.,

k = argmin a,(I X I E S f ? ) . ( 8 )I

4) The reduction in entropy at node c due to questionk is

R , ( k ) = H , ( Y ) - B,(YIXk E S i ? ) , (9 )where

VH , ( Y ) = - C Pr { w , l c } . logz Pr { w l c } .

J = I

If this reduction is significant, store question k , createtwo descendant nodes, cI and c 2 , corresponding to theconditions xk E S, nd xk $ S i , and repeat Steps 2-4 foreach of the new nodes separately.The reduction in entropy in Step 4 is the mutual infor-mation between Y and question k at node e. Thus, seekingquestions which minimize entropy is just another way ofsaying that questions are sought which are maximally in-formative about the event being predicted-an eminentlyreasonable criterion.The true probabilities in Step 2 are generally unknown.In practice, they can be replaced by estimates obtainedfrom relative frequencies in a sample of training text. Thismeans, of course, that the algorithm must be conductedusing estimates a,H, nd R , instead of the true valuesH , H , and R , respectively.

If the distribution of R were known, its statistical sig-nificance could be determined in Step 4 when assessingthe utility of the selectcd question. In our case, hpwever,we were unable to determine the distribution of R and re-sorted to an empirical test of significance instead. Usingindependent (held-out) training data , we computed the re-duction in entropy due to the selected question, and re-tained the question if the reduction exceeded a threshold.


4/8

1004 IEEE TRANSACTIONS ON ACOUSTICS, SPE E C H, A N D S I G N A L PROCESSING, VOL. 37. NO. 7, JULY 1989

The threshold allows the arborist to state what is consid-ered to be practically significant, as opposed to statisti-cally significant. The definition of significant in Step4 determines whether or not a node is subdivided into twosubnodes, and is responsible, therefore, for the ultimatesize of the tree, and the amount of computation requiredto construct it.The only remaining issue to be addressed in the abovetree-growing algorithm is the determination of the set Sfin Step 2 . This amounts to partitioning the values takenby X , into two groups: those in Sf and those not in Sf .Again, there is no known practical way of achieving acertifiably optimal partition, especially in applications likelanguage modeling where X , can take a large number ofdifferent values. As before, the best realistic hope is tofind a good set Sf via some kind of heuristic search.Possible search strategies range from relatively simplegreedy algorithms to the computationally expensive tech-niques of simulated annealing [7].Let X denote the set of values taken by the variable X .In our case, X is the entire vocabulary. The followingalgorithm determines a set S in a greedy fashion.1) Let S be empty.2) Insert into S the x E X which leads to the greatestreduction in the average conditional entropy (7). If no xE X eads to a reduction, make no insertion.3) Delete from S any member x , if so doing leads to areduction in the average conditional entropy.4) If any insertions or deletions were made to S , returnto Step 2.Using this set-construction algorithm and the earliertree-growing algorithm, a language-model decision treecould certainly be constructed, but because of the restric-tive form of the questions it would be somewhat ineffi-cient. For example, suppose that the best question to placeat some node of the tree is actually a composite questionof the form: Is ( X , E S f ) OR ( X , E Sf)?.This is stilla binary question, but with the existing restr ictions on theform of the questions, it can only be implemented as twoseparate questions in such a way that the data are splitinto three subgroups rather than two. The data for whichthe answer is yes are unavoidably fragmented acrosstwo nodes. This is inefficient in the sense that these twonodes may be equivalent from a language model point ofview. Splitting data unnecessarily across two nodes leadsto duplicated branches, unnecessary computation, re-duced sample sizes, and less accurate probability esti-mates.To avoid this inefficiency, the elementary questions ateach node of the tree must be replaced by composite bi-nary questions. We would like the structure of the com-posite questions to permit Boolean expressions of arbi-trary complexity, containing any number of elementaryconditions of the form X E S and any number of AND,O R , and NOT operators, in any combination. This can beachieved if the questions are themselves represented bybinary trees, but with the leaves tied in such a way that

Fig. 2 . Example of the topology of an unrestricted composite binaryquest ion.

there are only two outcomes. Fig. 2shows an example ofsuch a structure.The tree in Fig. 2 has eight leaves which are tied asshown by the dotted lines. The four leaves attached to no-branches are tied; they lead to the same final N-state. Andthe four leaves attached to yes-branches are tied similarly.Thus, the structure is still that of a binary question; thereare only two possible final states. There are many differ-ent routes to each of the final states, each route repre-senting a different series of conditions. Thus, the com-posite condition leading to either one of the final statescan only be described with multiple OR operators: one foreach distinct route. Similarly, the description of any par-ticular route requires the use of multiple AND operators:one for each node on the route. Since there need be nolimits on the number of leaves or the depth of a binarytree, it can be seen that any composite binary question,however complex, can be represented by a binary tree withtied leaves similar to that shown in Fig. 2 . Note that NOToperators are superfluous here. The condition NOT ( XE T ) can be rewritten in elementary form X E S bydefining S to be the complement of T .

Since composite questions are essentially just binarytrees with elementary questions at each node, they can beconstructed using the greedy tree-growing and set-con-struction algorithms already described. The entropies,however, must be computed from relative frequencies ob-tained after pooling all tied data.Constructing a tree-based language model with com-posite questions involves a lot of computation. There aretrees within trees: at each node of the global binary treethere is a local binary tree. If the computation appears tobe excessive, it may be necessary to compromise on someform of question that is more general than an elementaryquestion, but less general than a fully fledged binary tree.One compromise is provided by the pylon shown in Fig.3 . The pylon has two states, N and Y . Starting at the top(level 11, all the data are placed in the N-state. An ele-mentary question is then sought which splits the data inthe N-state into two subgroups; the data answering yesbeing assigned to the Y-state, and the rest remaining in


5/8

BA H L el al . : TR EE-BA SED STA TISTICA L LA N G U A G E MO D EL I005

3 7

4 NY5u 6

Fig. 3. Example of a pylon: a restricted form of composite binaryquestion.

the N-state. At level 2 , a refinement is applied to the datain the Y-state; an elementary question is sought whichsplits the data into two subgroups: the data answeringno being returned to the N-state to rejoin the data al-ready there, and the rest remaining in the Y-state. Theprocedure continues in this fashion, level by level, swap-ping data from one state to the other, until no furtherswaps can be found which lower the entropy.The pylon is equivalent to a restricted binary tree whichis constructed by determining a single question to applysimultaneously to all terminal nodes attached to no-branches, followed by a single question to apply simul-taneously to all terminal nodes attached to yes-branches,and so on. Thus, the questions attached to several differ-ent no-branches are constrained to be identical, as are thequestions attached to several different yes-branches.The pylon has no memory: there are many differentroutes to most pylonic nodes, but no distinction is madebetween those different routes when the data are pro-cessed. The unconstrained binary tree has memory: dif-ferent routes lead to different nodes, and since differentnodes are subjected to different questions, a distinction ismade between the different routes when the data are pro-cessed. Hence, the questions in the tree can be tailored tothe questions already asked; in the pylon they cannot.

Although most composite questions cannot be repre-sented in pylonic form, many useful composite questionscan. For example, Is the preceding word an adjectiveA N D the one before that an article?. Or, Is there anycomputer jargon in the last fifteen words?. The latterquestion requires a pylon 30 levels deep. A s can be seenfrom these examples, the pylon can express certain typesof semantic questions as well as grammatical questions.Both are important in determining what word will be spo-ken next.Having constructed a tree, with whatever question to-pology can be afforded, there remains one important is-sue: the estimation of the probability distributions at theleaves of the decision tree. With a relatively small tree of10 000 leaves, and a modest vocabulary of 10 000 words,there are one-hundred-million probabilities to estimate.Even with a reasonably generous one-billion words oftraining data, there would still be insufficient data to es-timate the probabilities accurately. Thus, the probabilitydistributions cannot simply be estimated from the relative

frequencies in the training data, which are biased in anycase. Just as the trigram probabilities are smoothed withlower order bigram and unigram distributions to amelio-rate problems due to sparse data, so too can each leaf dis-tribution be smoothed with the lower order nodal distri-butions between the root and the leaf.. ,n r }denote the set of nodes betweenthe root, n l , and a given leaf, nr. Let q , ( w ) denote therelative frequency of word w at node n j s obtained fromthe tree-growing training text, and let qi denote the dis-tribution of relative frequencies at node n j. Further, let qodenote the uniform distribution over the vocabulary. Asmoothed leaf probability distribution may be obtained as

Let {n l , n2 , *

r

where the As are chosen to maximize the probability ofsome additional independent (held-out) training data, sub-ject to the constraints thatr

C h ; = 1r = O

and hi I , (0 I I ) . The h values may be deter-mined from independent training data using the forward-backward parameter estimation algorithm, as described in111.It should be clear from the above prescription that tree-growing involves a great deal of computation; the vastmajority being devoted to set construction. Certainly itinvolves a lot more work than the creation of a trigrammodel. It is also clear, however, that at any given nodeof the decision tree, the best question to ask is completelyindependent of the data and questions at any other node.This means that the nodes can be processed independentlyin any convenient order. Given enough parallel proces-sors, tree growing need not take an excessive amount oftime.

111. RESULTSWe tested the foregoing ideas on tree-based languagemodels in a pilot experiment involving a 5000-word vo-cabulary. The vocabulary consisted of the 5000 most fre-quent words in a database of IBM office correspondence.The training and test data were drawn from 550 booksand magazines which ranged from intellectually stimulat-ing romantic novels to silly issues of Datamation. Thebooks and magazines were divided randomly into fourparts as follows:1) training data for tree construction: approximately 10-

million words;2) data for testing the significance of a reduction in en-tropy: approximately 10-million words;3) data for computing the smoothed leaf probabilitydistributions: approximately 9-million words; and4) est data; approximately 1-million words.


6/8

1006 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL . 37 . NO. 7. JULY 1989

N o book or magazine was split between more than oneof the above categories. Words not in the vocabulary weretreated as a single generic unknown word.The purpose of the experiment was to predict the 21stword of a 21-gram, given the first 20 words. This isequivalent to predicting the next word a speaker will saygiven the last 20 words spoken. Thus, Y was a discreterandom variable taking 5000 values. There were 20 pre-dictor variables X I , X 2 , , X z o , each of which repre-sented one of the preceding 20 words, and was, therefore,a discrete random variable taking 5000 values. Note thatsome of the preceding 20 words may be from earlier sen-tences or even earlier paragraphs or chapters.We constructed a tree with pylonic questions using thetree-growing and set-construction algorithms of Section11, with two minor changes. First, instead of adding oneword at a time to the set S in Step 2 of set construction,we added one group of words at a time. The groups werepredefined to represent grammatical classes such as daysof the week, months, nouns, verbs, etc. The classes wereof differing coarseness and overlapped; many words be-longed to several classes. For example, July belongedto the class of months, the class of nouns, as well as itsown class, the class of July, which contained only thatone word. Adding groups of words simultaneously al lowsset construction to proceed more rapidly. Since individualwords may be discarded from the set, little harm is doneby inserting words several at a time.Second, since this was only a pilot experiment, we lim-ited the tree to 10 000 leaves. To ensure even growth, thenodes were processed in order of size.The adequacy of a language model may be assessed byits ability to predict unseen test data. This ability is mea-sured by the perplexity of the data given the model 1 1 1 .Put simply, a script with a perplexity of p with respect tosome model has the same entropy as a language having pequally likely choices in all contexts. Clearly, the lowerthe perplexity, the better the model.Ignoring those 2 1-grams in the test data which ended inthe generic unknown word, the perplexity of the testdata with respect to the tree was 90.7. This compares toa perplexity of 94.9 with respect to an equivalent trigramlanguage model. Although the tree was by no means fullygrown, it still outperformed the trigram model. The dif-ference in perplexity, however, is not great.A bigger difference is evident in the numbers of verybad predictions. Speech recognition errors are more likelyto occur in words given a very low probability by the lan-guage model. Using the trigram model, 3.87 percent ofthe words in the test data were given a probability of lessthan 2-15. (It is convenient to work in powers of 2 whenperforming perplexity and entropy calculations.) In thecase of the tree, only 2.81 percent of the test words hadsuch a low probability. Thus, the number of words witha probability of less than 2-15 was reduced by 27 percent,which could have a significant impact on the error rate ofa speech recognizer.

The improvement of the tree over the trigram modelwas obtained despite the fact that the tree was incompleteand comprised substantially fewer distinct probability dis-tributions than the trigram model. The tree has one dis-tribution per leaf-10 015 in all. The trigram model, onthe other hand, has at least one distribution per uniquebigram in the training text, and there were 796 000 ofthem. Apparently, the vastly greater resolution of the tri-gram model is insufficient to compensate for its otherweaknesses vis-&-vis he tree-based model.Although the tree has 10 015 distributions as comparedto 796 000 for the trigram model, the storage necessaryfor them is about the same. Many of the trigram distri-butions have very few nonzero entries, which is not truefor the tree distributions.We also created a language model which combined thetree and the trigram. The combined probabilities were ob-tained as

with X, determined from independent training data as dis-cussed in [11. Here r denotes the leaf corresponding to thesequence W I , w2, . ,WO ; g r ( ~ 2 1 ) nd Pr {w21 wl9,w 2 0 } denote the tree and trigram probability estimates,respectively. The weight x r was a function of the leafonly, and lay between 0 and 1 as always.The perplexity of the test data with respect to the com-bined model was 82.5-13 percent lower than the trigramperplexity and 9 percent lower than the tree perplexity.Additionally, 2.73 percent of the correct words had aprobability of less than 2-15. Thus, the number of wordswith a probability of less than 2-15 was only slightly lessthan with the tree alone, but was almost 30 percent lessthan obtained with the trigram model on its own.

For interest, we tabulated the depths of the 10 014 py-lons in the tree. The results are shown in Fig. 4. It can beseen from Fig. 4 that roughly one-half of the pylons rep-resented a single elementary question, while the other halfrepresented composite questions of varying complexity.This supports the argument for composite questions at thenodes.We also tabulated how often each predictor was thesubject of a question. The 10 014 pylons contained 22 160questions in total; their subjects are tabulated in Fig. 5 .As would be expected, most of the questions are directedtoward the more recent words; these being more usefulfor determining the likely part of speech of the next wordto be uttered. Earlier words have not been ignored, al-though they have been interrogated somewhat less; theyare useful for determining the likely semantic category ofthe forthcoming word. Certainly, it is not the case that allquestions are directed toward the two most recent words~ 1 9nd wzo; this underlines the limitations of the trigrammodel, and the potentially useful information it discards.


7/8

BAHL ef ol TREE-BASED STATISTICAL LANGUAGE MODEL

34567X910

Depth Number Of Pylons

119756833 7I9 7I2 69 06029

I I 2212

I > I 4 I 3 3 I

21

I J

14

Fig . 4. Histogram of the depths of 10 014 pylons.

4

I Subiect 1 Number Of Quertions

WI 1

I U, I 6451 I

284

K I 5147 I

WI 7 234WI I

I w I 1x 1 I1xx

I I

U,- n;, 706

In summary, the tree-based language model provided alower perplexity and fewer very bad probabilities than anequivalent trigram model. However, the combined tri-gram-tree model had appreciably lower perplexity thaneither model on its own. Thus, the most effective use ofa tree-based model may be as an adjunct to a trigrammodel, rather than a replacement for it.IV. EPILOGUE

A good language model ought to have the property thatthe probability assigned to real text is high, and the prob-ability of nonsensical text is low. With this in mind, weprobed the properties of the trigram and tree-based modelsby generating some text using the models. Words follow-ing quotation marks or a period were selected at randomin accordance with their probabilities. In all other con-texts, the most probable word was selected. The generic

1007

unknown word was excluded from the allowable vo-cabulary. The initial word history was selected at randomfrom real text.The excerpts below are included mainly for the readersamusement, and are not claimed to be typical. The gen-erated text reflects the language models heritage in ro-mantic novels. The punctuation has been left exactly as itwas generated. Each punctuation symbol is considered tobe a word and has a probability of being the next wordspoken like any other word.The following paragraph was obtained with the trigrammodel.If yo u dont have to be a good deal of the world. I said.She was a good deal of the world. But the fac t thatthe only one of the world. W hen thejirst time in theworld.

Less monotonous are the following paragraphs obtainedwith the tree-based model.What do you mean? I don t know. You know , aid the man.I s it ? he asked.You know, aid the man.They are not not to be a good idea. Th ej rs t timeI was a good idea. She was a good idea.Certainly, said.Whats the m atter?May I be able to get the money. Well , aid the man. Scott was a good idea.Mrs. King, Nick said. I dont know what Imean. Take a look at the door. He was a good idea. Idont know what I mean. Didnt you know, he said.

It is the collective unbiased opinion of the authors thatthese paragraphs are at least as stimulating as the averageromantic novel.The trigram sentences consist of very reasonable3-grams, but the limited memory of the trigram modelresults in stilted 4-grams, 5-grams, etc., and hence leadsto meaningless sentences. Quotation marks are un-matched because there is no mechanism to remember thenumber of unmatched quotes.The tree-generated sentences exemplify the effects ofthe longer 20-word memory. Punctuation is vastly im-proved, and repetition is much reduced. Some of the com-ponent 3-grams, however, are not as reasonable as before.Are not not and not not to are pretty unusual con-structions, but trigrams like those are not not to be foundin scientific journals .

ACKNOWLEDGMENTWe would like to thank the other members of the SpeechRecognition Group at the IBM Research Center for theirhelp and encouragement. We are also indebted to the


8/8

1008 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. 37. NO. 7. JUL Y 1989

American Printing House for the Blind who provided thebooks and magazines used in the experiments.REFERENCES ,

[ I ] L. R. Bah l , F . Jel inek, and R . L . Mercer , A m ax imum l ike l ihoodapproach to cont inuous speech recognit ion, IEEE Trans. ParrernAnal. Machine Intel l ., vo l . PAMI-5 , pp . 179-190 , Mar . 1983 .[2 ] L. Breiman , J . H . Fr i edman , R. A . O l shen , and C. J . S tone , Clas-s$cation and Regression Trees. Monterey , CA: Wadswor th , 1984 .[3 ] R . G . Casey and G . Nagy, Decision t ree design using a probabi l is t icmodel , IEEE Trans. Inform. Theory, vo l . IT -30 , pp. 93 -99 , J an .1984.141 C . R . P. Har tmann , P . K . Varshney , K . G . Mehro t ra , and C. L .Gerberich, Applicat ion of information theory to the construct ion ofefficient decis ion t rees , IEEE Trans. Inform. Theory, vol . IT-28,pp . 565-577, July 1982.F . Jel inek, Continuous speech recognit ion by s tat is t ical meth ods,Proc. IEEE, vol . 64, pp. 532-556, Apr. 1976.- The developmen t of an experimental discrete dictat ion recog-n izer , Proc. IEEE, vol. 73, pp. 1616-1624, N ov. 1985.S . Kirkpatr i ck , C. D . Gela t t , J r . , and M . P . Vecch i , Op t imiza t ionby simulated annealing, Science, vol . 220, pp. 671-680, May 1983 .J . M. Lucassen and R. L. Mercer, An information theoret ic ap-proach to the automatic determinat ion of phonem ic baseforms, inProc. I984 IEEE Inr. Con$ Acoust ., Speech, Signal Processing, Sa nDiego , CA, Mar . 1984 , pp . 42 .5 .1 -42 .5 .4 .H . J . Payne and W. S . Meisel , An algori thm for construct ing opt i -mal binary decis ion t rees , IEEE Trans. Compur. , v o l . c - 2 6 , p p .905-916, S ept . 1977.C . E . Shannon, Predict ion and entropy of printed English, BellSyst . Tech. J . , vol . 30, pp. 50-64, Jan . 1951 .

Peter F. Brown is an ex-speechsearcher who works for IBM onguage t ranslat ion.recognitionautomatic I

re -an -

Peter V . de Souza is an ex-biostat is t icianworks for IBM on acoust ic and language rn,for speech recognit ion.wh oodels

Lalit R . Bahl (S66-M68) is an ex-coding theor-1st who works for IBM on acoust ic and languagemodels for speech recognit ionRobert L. Mercer (M83) is an ex-physicis t whoworks for IBM on acoust ic and language modelsfor speech recognit ion.

decisio tree based language model

Documents