12/13/2007chia-ho ling1 srilm language model student: chia-ho ling instructor: dr. veton z. k ë...
TRANSCRIPT
12/13/2007 Chia-Ho Ling 1
SRILM Language Model
Student: Chia-Ho LingInstructor: Dr.
Veton Z. Këpuska
12/13/2007 Chia-Ho Ling 2
Objective Use SRI Language Model toolkit to
build four different language models. Four different language Models:
Good-Turing Smoothing Absolute Discounting Witten-Bell Smoothing Modified Kneser-Ney Smoothing
Decide which one is the best in these four different 3-gram language models.
12/13/2007 Chia-Ho Ling 3
Linux or Linux-like Environment
Choose Linux-like environment “cygwin”
Download free cygwin form following link: http://www.cygwin.com/
12/13/2007 Chia-Ho Ling 4
Cygwin Installation
Download the cygwin installation file
Execute setup.exe Choose “Install from Internet” Select root install directory “ C:\
cygwin” Choose a download site “mirrors”
12/13/2007 Chia-Ho Ling 5
Cygwin Installation The following packages should be
selected for installing SRILM: gcc versoin 3.4.3 or higher GNU make John Ousterhout’s TCL toolkit, version
7.3 or higher Tcsh gzip: to read/write compressed file GNU awk(gawk): to interpret many of the
utility script
12/13/2007 Chia-Ho Ling 6
SRILM Installation
Download SRILM toolkit, srilm.tgz, from following link:
http://www.speech.sri.com/projects/srilm Run cygwin.bat Unzip srilm.tgz by following
commands: $ cd /cygdrive/c/cygwin/srilm $ tar zxvf srilm.tgz
12/13/2007 Chia-Ho Ling 7
SRILM Installation After SRILM installation, edit the makefile
which is in the cygwin folder. Add following lines to setup the direction: SRILM=/cygdrive/c/cygwin/srilm MACHINE_TYPE=cygwin Run cygwin and type following command
to install SRILM: $ Make World
12/13/2007 Chia-Ho Ling 8
Main Function of SRILM
Generate N-gram counts from corpus
Train language model based on the N-gram count file
Use trained language model to calculate test data perplexity
12/13/2007 Chia-Ho Ling 9
Flow Chart
ngram-count
Training corpus
Lexicon
Count file
Lexicon
Language Model
Test data
ngram-count
ngram
Count file
ppl
Language Model
12/13/2007 Chia-Ho Ling 10
Training Corpus
Download Manual auditing Call Home English conversation, “CallHome_English_trans970711” for our training corpus.
12/13/2007 Chia-Ho Ling 11
Lexicon
Use “wordtokenization.pl” to generate our lexicon.
Because of those conversation training corpus, we have to remove time, speaker information, any kind of brackets, and interjections out of our lexicon. Therefore, we need to add some code in the wordtokenization.pl.
12/13/2007 Chia-Ho Ling 12
Lexicon
Adding the following language code can remove time and speaker information:
# remove time and the speaker information
($time_and_speaker, $sentence) = split (/:/);
$_ = $sentence;
12/13/2007 Chia-Ho Ling 13
LexiconAdding the following language code can remove any kind of brackets: # expand clitics $word =~ s/\>$//; $word =~ s/^\<//; $word =~ s/\>.$//; if (($word =~ /[0-9]+/) && ($word !~ /[a-zA-Z]+/)) { next; } if (($word =~ /^{/) || ($word =~ /^\[/) || ($word =~ /^\*/) || ($word =~ /^\#/) || ($word =~ /^\&/) || ($word =~ /^\
&/) || ($word =~ /^\-/) || ($word =~ /^\%/) ||($word =~ /^\!/) || ($word =~ /^\</) || ($word =~ /^\>/) || ($word =~ /^\+/) || ($word =~ /^\./) || ($word =~ /^\,/) || ($word =~ /^\//) || ($word =~ /^\?/) || ($word =~ /^\'/) || ($word =~ /^\)/) || ($word =~ /^\(/)) {
$not_word_flag = 1; #print "Begining: ", $word, "\n"; } if (not $not_word_flag) { print $word,"\n"; } if ($not_word_flag) { if (($word =~ /}$/) || ($word =~ /\]$/) || ($word =~ /\*$/ ) || ($word =~ /\ $/ ) || ($word =~ /\+$/)) {
$not_word_flag = 0; } } } print "\n";}
12/13/2007 Chia-Ho Ling 14
Lexicon
Generate our lexicon by using following commands:
$ cat train/en_*.txt > corpus.txt $ perl wordtokenization2.pl <
corpus.txt | sort | uniq > lexicon.txt
12/13/2007 Chia-Ho Ling 15
Lexicon
12/13/2007 Chia-Ho Ling 16
Count File Generate 3-gram count file by using
following commands:
$./ngram-count -vocab lexicon.txt -text corpus.txt -order 3 -write count.txt -unk
12/13/2007 Chia-Ho Ling 17
Count File
12/13/2007 Chia-Ho Ling 18
Count File ngram-count count N-grams and estimate language models -vocab file Read a vocabulary from file. Subsequently, out-of-vocabulary words in both
counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary.
-text textfile Generate N-gram counts from text file. textfile should contain one
sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored.
-order n Set the maximal order (length) of N-grams to count. This also determines
the order of the estimated LM, if any. The default order is 3. -write file Write total counts to file. -unk Build an “open vocabulary” LM, i.e., one that contains the unknown-word
token as a regular word. The default is to remove the unknown word.
12/13/2007 Chia-Ho Ling 19
Good-Turing Language Model
$ ./ngram-count -read project/count.txt -order 3 -lm project/gtlm.txt -gt1min 1 -gt1max 3 -gt2min 1 -gt2max 3 -gt3min 1 -gt3max 3
12/13/2007 Chia-Ho Ling 20
Good-Turing Language Model
-read countsfile Read N-gram counts from a file. ASCII count files contain one N-gram of
words per line, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added.
-lm lmfile Estimate a backoff N-gram model from the total counts, and write it to
lmfile. -gtnmin count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the minimal count of N-grams of
order n that will be included in the LM. All N-grams with frequency lower than that will effectively be discounted to 0. If n is omitted the parameter for N-grams of order > 9 is set. NOTE: This option affects not only the default Good-Turing discounting but the alternative discounting methods described below as well.
-gtnmax count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the maximal count of N-grams
of order n that are discounted under Good-Turing. All N-grams more frequent than that will receive maximum likelihood estimates. Discounting can be effectively disabled by setting this to 0. If n is omitted the parameter for N-grams of order > 9 is set.
12/13/2007 Chia-Ho Ling 21
Absolute Discounting Language Model
$ ./ngram-count -read project/count.txt -order 3 -lm adlm.txt -cdiscount1 0.5 -cdiscount2 0.5 -cdiscount3 0.5
12/13/2007 Chia-Ho Ling 22
Absolute Discounting Language Model
-cdiscountn discount where n is 1, 2, 3, 4, 5, 6, 7, 8, or
9. Use Ney's absolute discounting for N-grams of order n, using discount as the constant to subtract.
12/13/2007 Chia-Ho Ling 23
Witten-Bell Discounting Language Model
$ ./ngram-count -read project/count.txt -order 3 -lm wblm.txt -wbdiscount1 -wbdiscount2 -wbdiscount3
12/13/2007 Chia-Ho Ling 24
Witten-Bell Discounting Language Model
-wbdiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or
9. Use Witten-Bell discounting for N-grams of order n. (This is the estimator where the first occurrence of each word is taken to be a sample for the “unseen” event.)
12/13/2007 Chia-Ho Ling 25
Modified Kneser-Ney Discounting Language
Model
$ ./ngram-count -read project/count.txt -order 3 -lm knlm.txt -kndiscount1 -kndiscount2 -kndiscount3
12/13/2007 Chia-Ho Ling 26
Modified Kneser-Ney Discounting Language
Model
-kndiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or
9. Use Chen and Goodman's modified Kneser-Ney discounting for N-grams of order n.
12/13/2007 Chia-Ho Ling 27
Four Language Models
12/13/2007 Chia-Ho Ling 28
Test Data Perplexity Randomly choose three articles of news from
Internet for test data. Commands for four different 3-gram language
Models: $ ./ngram -ppl project/test1.txt -order 3 -lm
project/gtlm.txt $ ./ngram -ppl project/test1.txt -order 3 -lm
project/adlm.txt $ ./ngram -ppl project/test1.txt -order 3 -lm
project/wblm.txt $ ./ngram -ppl project/test1.txt -order 3 -lm
project/knlm.txt
12/13/2007 Chia-Ho Ling 29
Result of test1.txt
12/13/2007 Chia-Ho Ling 30
Result of test2.txt
12/13/2007 Chia-Ho Ling 31
Result of text3.txt
12/13/2007 Chia-Ho Ling 32
Test Data Perplexity -ppl textfile Compute sentence scores (log
probabilities) and perplexities from the sentences in textfile, which should contain one sentence per line.
-lm file Read the (main) N-gram model from
file. This option is always required, unless -null was chosen.
12/13/2007 Chia-Ho Ling 33
Conclusion
Good-Turing Absolute Discounting Witten-Bell Kneser-Ney
test1 602.936 635.381 573.032 504.988
test2 470.316 478.307 425.725 353.042
test3 268.165 271.759 251.203 252.803
12/13/2007 Chia-Ho Ling 34
Reference SRI International, “The SRI Language Modeling Toolkit”,
http://www.speech.sri.com/projects/srilm/ Dec. 2007 Cygwin Information and Installation, “Installing and
Updating Cygwin”, http://www.cygwin.com/ Dec 2007 Daniel Jurafsky and James H. Martin, “Speech and
Language Processing – An Introduction to Natural Language Processing, Computational Linguistic, and Speech Recogition”. Fourth Indian Reprint, 2005
Manual auditing Call Home English conversation, “CallHome_English_trans970711” http://my.fit.edu/~vkepuska/ece5527/CallHome/ Dec. 2007
Dr. Veton Z. Këpuska, “wordtokenization.pl”, http://my.fit.edu/~vkepuska/ece5527/Example%20Code/wordtokenization.pl