using ctw as a language modeler in dasher

Using CTW as a language modeler in Dasher

Phil Cowans, Martijn van Veen

25-04-2007

Inference GroupDepartment of Physics

University of Cambridge

2/24

Language ModellingLanguage Modelling

•Goal is to produce a generative model over strings

•Typically sequential predictions:

•Finite context models:

3/24

Dasher: Language ModelDasher: Language Model

• Conditional probability for each alphabet symbol, given the previous symbols

• Similar to compression methods

• Requirements: – Sequential– Fast– Adaptive

• Model is trained

• Better compression -> faster text input

4/24

Basic Language ModelBasic Language Model

• Independent distributions for each context

•Use Dirichlet prior

•Makes poor use of data– intuitively expect similarities between

similar contexts

5/24

Basic Language ModelBasic Language Model

6/24

Prediction By Partial MatchPrediction By Partial Match

•Associate a generative distribution with each leaf in the context tree

•Share information between nodes using a hierarchical Dirichlet (or Pitman-Yor) prior

• In practice use a fast, but generally good, approximation

7/24

Hierarchical Dirichlet ModelHierarchical Dirichlet Model

8/24

Context Tree WeightingContext Tree Weighting

•Combine nodes in the context tree

•Tree structure treated as a random variable

•Contexts associated with each leaf have the same generative distribution

•Contexts associated with different leaves are independent

•Dirichlet prior on generative distributions

9/24

CTW: Tree modelCTW: Tree model• Source structure in the model, memoryless

parameters

10/24

Tree PartitionsTree Partitions

11/24

Recursive DefinitionRecursive Definition

Children share one distribution

Children distributed independently

12/24

Experimental Results [256]Experimental Results [256]

13/24


14/24


15/24

Observations So FarObservations So Far

•No clear overall winner without modification.

•PPM Does better with small alphabets?

•PPM Initially learns faster?

•CTW is more forgiving with redundant symbols?

16/24

CTW for textCTW for text Properties of text generating sources:

• Large alphabet, but in any given context only a small subset is used– Waste of code space, many probabilities that should be

zero– Solution:

•Adjust zero-order estimator to decrease probability of unlikely events

•Binary decomposition

• Only locally stationary – Limit the counts to increase adaptivity

• Bell, Cleary, Witten 1989

17/24

Binary DecompositionBinary Decomposition• Decomposition tree

18/24

Binary DecompositionBinary Decomposition

• Results found by Aberg and Shtarkov:– All tests with full ASCII alphabet

Input file Paper 1 Paper 2

Book 1 Book 2 News

PPM-D(byte

predictions)2.351 2.322 2.291 1.969 2.379

CTW-D(byte

predictions)2.904 2.719 2.490 2.265 2.877

CTW-KT(bit predictions)

2.322 2.249 2.184 1.910 2.379

CTW/PPM-D(byte

predictions)2.287 2.235 2.192 1.896 2.322

19/24

Count halvingCount halving

• If one count reaches a maximum, divide both counts by 2– Forget older input data, increase adaptivity

• In Dasher: Predict user input with a model based on training text– Adaptivity even more important

20/24

Counthalving: ResultsCounthalving: Results

21/24

Counthalving: ResultsCounthalving: Results

22/24

Results: Enron Results: Enron

23/24

Combining PPM and CTWCombining PPM and CTW

•Select locally best model, or weight models together

•More alpha parameters for PPM, learned from data

•PPM like sharing, with prior over context trees, as with CTW

24/24

ConclusionsConclusions

•PPM and CTW have different strengths, makes sense to try combining them

•Decomposition and count scaling may give clues for improving PPM

•Look at performance on out of domain text in more detail

25/24

Experimental ParametersExperimental Parameters

•Context depth: 5

•Smoothing: 5%

•PPM – alpha: 0.49, beta: 0.77

•CTW – w: 0.05, alpha: 1/128

26/24

Comparing language modelsComparing language models

• PPM– Quickly learns repeating strings

• CTW– Works on a set of all possible tree models– Not sensitive to parameter D, max. model

depth– Easy to increase adaptivity– The weight factor (escape probability) is

strictly defined

using ctw as a language modeler in dasher

Documents