12/13/2007chia-ho ling1 srilm language model student: chia-ho ling instructor: dr. veton z. k ë...

12/13/2007 Chia-Ho Ling 1

SRILM Language Model

Student: Chia-Ho LingInstructor: Dr.

Veton Z. Këpuska


Objective Use SRI Language Model toolkit to

build four different language models. Four different language Models:

Good-Turing Smoothing Absolute Discounting Witten-Bell Smoothing Modified Kneser-Ney Smoothing

Decide which one is the best in these four different 3-gram language models.


Linux or Linux-like Environment

Choose Linux-like environment “cygwin”

Download free cygwin form following link: http://www.cygwin.com/


Cygwin Installation

Download the cygwin installation file

Execute setup.exe Choose “Install from Internet” Select root install directory “ C:\

cygwin” Choose a download site “mirrors”


Cygwin Installation The following packages should be

selected for installing SRILM: gcc versoin 3.4.3 or higher GNU make John Ousterhout’s TCL toolkit, version

7.3 or higher Tcsh gzip: to read/write compressed file GNU awk(gawk): to interpret many of the

utility script


SRILM Installation

Download SRILM toolkit, srilm.tgz, from following link:

http://www.speech.sri.com/projects/srilm Run cygwin.bat Unzip srilm.tgz by following

commands: $ cd /cygdrive/c/cygwin/srilm $ tar zxvf srilm.tgz


SRILM Installation After SRILM installation, edit the makefile

which is in the cygwin folder. Add following lines to setup the direction: SRILM=/cygdrive/c/cygwin/srilm MACHINE_TYPE=cygwin Run cygwin and type following command

to install SRILM: $ Make World


Main Function of SRILM

Generate N-gram counts from corpus

Train language model based on the N-gram count file

Use trained language model to calculate test data perplexity


Flow Chart

ngram-count

Training corpus

Lexicon

Count file

Lexicon

Language Model

Test data

ngram-count

ngram

Count file

ppl

Language Model

12/13/2007 Chia-Ho Ling 10

Training Corpus

Download Manual auditing Call Home English conversation, “CallHome_English_trans970711” for our training corpus.

12/13/2007 Chia-Ho Ling 11

Lexicon

Use “wordtokenization.pl” to generate our lexicon.

Because of those conversation training corpus, we have to remove time, speaker information, any kind of brackets, and interjections out of our lexicon. Therefore, we need to add some code in the wordtokenization.pl.

12/13/2007 Chia-Ho Ling 12

Lexicon

Adding the following language code can remove time and speaker information:

# remove time and the speaker information

($time_and_speaker, $sentence) = split (/:/);

$_ = $sentence;

12/13/2007 Chia-Ho Ling 13

LexiconAdding the following language code can remove any kind of brackets: # expand clitics $word =~ s/\>$//; $word =~ s/^\<//; $word =~ s/\>.$//; if (($word =~ /[0-9]+/) && ($word !~ /[a-zA-Z]+/)) { next; } if (($word =~ /^{/) || ($word =~ /^\[/) || ($word =~ /^\*/) || ($word =~ /^\#/) || ($word =~ /^\&/) || ($word =~ /^\

&/) || ($word =~ /^\-/) || ($word =~ /^\%/) ||($word =~ /^\!/) || ($word =~ /^\</) || ($word =~ /^\>/) || ($word =~ /^\+/) || ($word =~ /^\./) || ($word =~ /^\,/) || ($word =~ /^\//) || ($word =~ /^\?/) || ($word =~ /^\'/) || ($word =~ /^\)/) || ($word =~ /^\(/)) {

$not_word_flag = 1; #print "Begining: ", $word, "\n"; } if (not $not_word_flag) { print $word,"\n"; } if ($not_word_flag) { if (($word =~ /}$/) || ($word =~ /\]$/) || ($word =~ /\*$/ ) || ($word =~ /\ $/ ) || ($word =~ /\+$/)) {

$not_word_flag = 0; } } } print "\n";}

12/13/2007 Chia-Ho Ling 14

Lexicon

Generate our lexicon by using following commands:

$ cat train/en_*.txt > corpus.txt $ perl wordtokenization2.pl <

corpus.txt | sort | uniq > lexicon.txt

12/13/2007 Chia-Ho Ling 15

Lexicon

12/13/2007 Chia-Ho Ling 16

Count File Generate 3-gram count file by using

following commands:

$./ngram-count -vocab lexicon.txt -text corpus.txt -order 3 -write count.txt -unk

12/13/2007 Chia-Ho Ling 17

Count File

12/13/2007 Chia-Ho Ling 18

Count File ngram-count count N-grams and estimate language models -vocab file Read a vocabulary from file. Subsequently, out-of-vocabulary words in both

counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary.

-text textfile Generate N-gram counts from text file. textfile should contain one

sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored.

-order n Set the maximal order (length) of N-grams to count. This also determines

the order of the estimated LM, if any. The default order is 3. -write file Write total counts to file. -unk Build an “open vocabulary” LM, i.e., one that contains the unknown-word

token as a regular word. The default is to remove the unknown word.

12/13/2007 Chia-Ho Ling 19

Good-Turing Language Model

$ ./ngram-count -read project/count.txt -order 3 -lm project/gtlm.txt -gt1min 1 -gt1max 3 -gt2min 1 -gt2max 3 -gt3min 1 -gt3max 3

12/13/2007 Chia-Ho Ling 20

Good-Turing Language Model

-read countsfile Read N-gram counts from a file. ASCII count files contain one N-gram of

words per line, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added.

-lm lmfile Estimate a backoff N-gram model from the total counts, and write it to

lmfile. -gtnmin count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the minimal count of N-grams of

order n that will be included in the LM. All N-grams with frequency lower than that will effectively be discounted to 0. If n is omitted the parameter for N-grams of order > 9 is set. NOTE: This option affects not only the default Good-Turing discounting but the alternative discounting methods described below as well.

-gtnmax count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the maximal count of N-grams

of order n that are discounted under Good-Turing. All N-grams more frequent than that will receive maximum likelihood estimates. Discounting can be effectively disabled by setting this to 0. If n is omitted the parameter for N-grams of order > 9 is set.

12/13/2007 Chia-Ho Ling 21

Absolute Discounting Language Model

$ ./ngram-count -read project/count.txt -order 3 -lm adlm.txt -cdiscount1 0.5 -cdiscount2 0.5 -cdiscount3 0.5

12/13/2007 Chia-Ho Ling 22

Absolute Discounting Language Model

-cdiscountn discount where n is 1, 2, 3, 4, 5, 6, 7, 8, or

9. Use Ney's absolute discounting for N-grams of order n, using discount as the constant to subtract.

12/13/2007 Chia-Ho Ling 23

Witten-Bell Discounting Language Model

$ ./ngram-count -read project/count.txt -order 3 -lm wblm.txt -wbdiscount1 -wbdiscount2 -wbdiscount3

12/13/2007 Chia-Ho Ling 24

Witten-Bell Discounting Language Model

-wbdiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or

9. Use Witten-Bell discounting for N-grams of order n. (This is the estimator where the first occurrence of each word is taken to be a sample for the “unseen” event.)

12/13/2007 Chia-Ho Ling 25

Modified Kneser-Ney Discounting Language

Model

$ ./ngram-count -read project/count.txt -order 3 -lm knlm.txt -kndiscount1 -kndiscount2 -kndiscount3

12/13/2007 Chia-Ho Ling 26

Modified Kneser-Ney Discounting Language

Model

-kndiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or

9. Use Chen and Goodman's modified Kneser-Ney discounting for N-grams of order n.

12/13/2007 Chia-Ho Ling 27

Four Language Models

12/13/2007 Chia-Ho Ling 28

Test Data Perplexity Randomly choose three articles of news from

Internet for test data. Commands for four different 3-gram language

Models: $ ./ngram -ppl project/test1.txt -order 3 -lm

project/gtlm.txt $ ./ngram -ppl project/test1.txt -order 3 -lm

project/adlm.txt $ ./ngram -ppl project/test1.txt -order 3 -lm

project/wblm.txt $ ./ngram -ppl project/test1.txt -order 3 -lm

project/knlm.txt

12/13/2007 Chia-Ho Ling 29

Result of test1.txt

12/13/2007 Chia-Ho Ling 30

Result of test2.txt

12/13/2007 Chia-Ho Ling 31

Result of text3.txt

12/13/2007 Chia-Ho Ling 32

Test Data Perplexity -ppl textfile Compute sentence scores (log

probabilities) and perplexities from the sentences in textfile, which should contain one sentence per line.

-lm file Read the (main) N-gram model from

file. This option is always required, unless -null was chosen.

12/13/2007 Chia-Ho Ling 33

Conclusion

Good-Turing Absolute Discounting Witten-Bell Kneser-Ney

test1 602.936 635.381 573.032 504.988

test2 470.316 478.307 425.725 353.042

test3 268.165 271.759 251.203 252.803

12/13/2007 Chia-Ho Ling 34

Reference SRI International, “The SRI Language Modeling Toolkit”,

http://www.speech.sri.com/projects/srilm/ Dec. 2007 Cygwin Information and Installation, “Installing and

Updating Cygwin”, http://www.cygwin.com/ Dec 2007 Daniel Jurafsky and James H. Martin, “Speech and

Language Processing – An Introduction to Natural Language Processing, Computational Linguistic, and Speech Recogition”. Fourth Indian Reprint, 2005

Manual auditing Call Home English conversation, “CallHome_English_trans970711” http://my.fit.edu/~vkepuska/ece5527/CallHome/ Dec. 2007

Dr. Veton Z. Këpuska, “wordtokenization.pl”, http://my.fit.edu/~vkepuska/ece5527/Example%20Code/wordtokenization.pl

12/13/2007chia-ho ling1 srilm language model student: chia-ho ling instructor: dr. veton z. k ë...

Documents