using ctw as a language modeler in dasher

26
Using CTW as a language modeler in Dasher Phil Cowans, Martijn van Veen 25-04-2007 Inference Group Department of Physics University of Cambridge

Upload: zeheb

Post on 04-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Using CTW as a language modeler in Dasher. Phil Cowans, Martijn van Veen 25-04-2007 Inference Group Department of Physics University of Cambridge. Language Modelling. Goal is to produce a generative model over strings Typically sequential predictions: Finite context models:. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using CTW as a language  modeler in Dasher

Using CTW as a language modeler in Dasher

Phil Cowans, Martijn van Veen

25-04-2007

Inference GroupDepartment of Physics

University of Cambridge

Page 2: Using CTW as a language  modeler in Dasher

2/24

Language ModellingLanguage Modelling

•Goal is to produce a generative model over strings

•Typically sequential predictions:

•Finite context models:

Page 3: Using CTW as a language  modeler in Dasher

3/24

Dasher: Language ModelDasher: Language Model

• Conditional probability for each alphabet symbol, given the previous symbols

• Similar to compression methods

• Requirements: – Sequential– Fast– Adaptive

• Model is trained

• Better compression -> faster text input

Page 4: Using CTW as a language  modeler in Dasher

4/24

Basic Language ModelBasic Language Model

• Independent distributions for each context

•Use Dirichlet prior

•Makes poor use of data– intuitively expect similarities between

similar contexts

Page 5: Using CTW as a language  modeler in Dasher

5/24

Basic Language ModelBasic Language Model

Page 6: Using CTW as a language  modeler in Dasher

6/24

Prediction By Partial MatchPrediction By Partial Match

•Associate a generative distribution with each leaf in the context tree

•Share information between nodes using a hierarchical Dirichlet (or Pitman-Yor) prior

• In practice use a fast, but generally good, approximation

Page 7: Using CTW as a language  modeler in Dasher

7/24

Hierarchical Dirichlet ModelHierarchical Dirichlet Model

Page 8: Using CTW as a language  modeler in Dasher

8/24

Context Tree WeightingContext Tree Weighting

•Combine nodes in the context tree

•Tree structure treated as a random variable

•Contexts associated with each leaf have the same generative distribution

•Contexts associated with different leaves are independent

•Dirichlet prior on generative distributions

Page 9: Using CTW as a language  modeler in Dasher

9/24

CTW: Tree modelCTW: Tree model• Source structure in the model, memoryless

parameters

Page 10: Using CTW as a language  modeler in Dasher

10/24

Tree PartitionsTree Partitions

Page 11: Using CTW as a language  modeler in Dasher

11/24

Recursive DefinitionRecursive Definition

Children share one distribution

Children distributed independently

Page 12: Using CTW as a language  modeler in Dasher

12/24

Experimental Results [256]Experimental Results [256]

Page 13: Using CTW as a language  modeler in Dasher

13/24

Experimental Results [128]Experimental Results [128]

Page 14: Using CTW as a language  modeler in Dasher

14/24

Experimental Results [27]Experimental Results [27]

Page 15: Using CTW as a language  modeler in Dasher

15/24

Observations So FarObservations So Far

•No clear overall winner without modification.

•PPM Does better with small alphabets?

•PPM Initially learns faster?

•CTW is more forgiving with redundant symbols?

Page 16: Using CTW as a language  modeler in Dasher

16/24

CTW for textCTW for text Properties of text generating sources:

• Large alphabet, but in any given context only a small subset is used– Waste of code space, many probabilities that should be

zero– Solution:

•Adjust zero-order estimator to decrease probability of unlikely events

•Binary decomposition

• Only locally stationary – Limit the counts to increase adaptivity

• Bell, Cleary, Witten 1989

Page 17: Using CTW as a language  modeler in Dasher

17/24

Binary DecompositionBinary Decomposition• Decomposition tree

Page 18: Using CTW as a language  modeler in Dasher

18/24

Binary DecompositionBinary Decomposition

• Results found by Aberg and Shtarkov:– All tests with full ASCII alphabet

Input file Paper 1 Paper 2

Book 1 Book 2 News

PPM-D(byte

predictions)2.351 2.322 2.291 1.969 2.379

CTW-D(byte

predictions)2.904 2.719 2.490 2.265 2.877

CTW-KT(bit predictions)

2.322 2.249 2.184 1.910 2.379

CTW/PPM-D(byte

predictions)2.287 2.235 2.192 1.896 2.322

Page 19: Using CTW as a language  modeler in Dasher

19/24

Count halvingCount halving

• If one count reaches a maximum, divide both counts by 2– Forget older input data, increase adaptivity

• In Dasher: Predict user input with a model based on training text– Adaptivity even more important

Page 20: Using CTW as a language  modeler in Dasher

20/24

Counthalving: ResultsCounthalving: Results

Page 21: Using CTW as a language  modeler in Dasher

21/24

Counthalving: ResultsCounthalving: Results

Page 22: Using CTW as a language  modeler in Dasher

22/24

Results: Enron Results: Enron

Page 23: Using CTW as a language  modeler in Dasher

23/24

Combining PPM and CTWCombining PPM and CTW

•Select locally best model, or weight models together

•More alpha parameters for PPM, learned from data

•PPM like sharing, with prior over context trees, as with CTW

Page 24: Using CTW as a language  modeler in Dasher

24/24

ConclusionsConclusions

•PPM and CTW have different strengths, makes sense to try combining them

•Decomposition and count scaling may give clues for improving PPM

•Look at performance on out of domain text in more detail

Page 25: Using CTW as a language  modeler in Dasher

25/24

Experimental ParametersExperimental Parameters

•Context depth: 5

•Smoothing: 5%

•PPM – alpha: 0.49, beta: 0.77

•CTW – w: 0.05, alpha: 1/128

Page 26: Using CTW as a language  modeler in Dasher

26/24

Comparing language modelsComparing language models

• PPM– Quickly learns repeating strings

• CTW– Works on a set of all possible tree models– Not sensitive to parameter D, max. model

depth– Easy to increase adaptivity– The weight factor (escape probability) is

strictly defined