name:venkata subramanyan sundaresan instructor:dr.veton kepuska
TRANSCRIPT
SRILM Based Language Model
Name:Venkata subramanyan sundaresanInstructor:Dr.Veton Kepuska
N-GRAM ConceptThe idea of word prediction in formalized
with probabilistic model called N-gram.Statistical models of word sequence are also
called language models or LM’SThe idea of N-gram model is to approximate
the history by just the last few words .
CORPUS Counting things in natural language is based
on a corpus.What is a corpus ?
It is an online collection of text or speech There are two popular corpora.
Brown (1 million word collection )Switch board (Collection 2430 telephone
conversation )
Perplexity Perplexity is interpreted as the weighted
average branching factor of a language.Branching factor of a language is the number
of possible next word that can follow any word .
Perplexity is the most common evaluation metric for N-gram language models .
Improvement in perplexity does not guarantee an improvement in speech recognition performance.
It is commonly used as a quick check of an algorithm.
SMOOTHING It is the process of flattening a probability
distribution implied by a language model ,so the all reasonable word sequence can occur with some probability.
AspirationTo use SRI-LM (LM-Language modeling)
toolkit to build different language models.The following are the language models :
Good –turning SmoothingAbsolute Discounting
Linux Environment in Windows
To implement Linux environment in windows operating system we have to install “cygwin”
This is a open source software and can be downloaded
from : www.cygwin.com.Another main reason for installing cygwin
is ,SRI-LM can be implemented over the cygwin platform .
Installation Procedure “cygwin”Go to the provided webpage.Download the setup file .Select “install from Internet”Give the required destination place for the
cygwin to get installed .There will be a lot of options to download
from website.Select one site and install all the packages .
SRILMDownload the SRILM toolkit ,srilm.tgz from
the following source: http://www.speech.sri.com/projects/srilm/
Run the terminal window of Cygwin.The srilm will be downloaded as a zip file .Unzip the srilm file inside the cygwin
environment Unzip canbe done with the following with the
following command: tar zxvf srilm.tgz
SRILM Installation Once the installation is completed ,we have
to edit the makefile in the cygwin folder .Once the editing is done , we have run the
cygwin ,to install SRILM in cygwin : $ Make World
Function of SRILM Generate N-gram count from the corpusTrain language model based on the N-gram
count file .Use trained language model to calculate test
data perplexity.
LexiconLexicon is a container of words belonging to
the same language .Reference: Wikipedia
Lexicon GenerationUse “wordtokenization.pl” file to generate
the Lexicon for our requirement .Generate lexicon of our requirement using
the following command: cat train/en_*.txt > corpus.txt Perl wordtokenization.pl <corpus.txt|sort|uniq
>lexicon.txt
Count FileGenerate 3-gram count file by using following
command:$./ngram-count –vocab lecicon.txt, -text
corpus.txt ,-order 2 –write count.txt, -unk
Good-Turing Language Model$ ./ngram-count -read project/count.txt -order 3 -lm project/gtlm.txt -gt1min 1 -gt1max 3 -gt2min 1 -gt2max 3-gt3min 1 -gt3max 3
This code has to be typed in the command window of the terminal .
-lm lmfile Estimate a back off N-gram model from the
total counts, and write it to lmfile
Absolute Discounting Language Model$ ./ngram-count -read project/count.txt-order 3-lm adlm.txt-cdiscount1 0.5-cdiscount2 0.5 -cdiscount3 0.5
Here the order N can be any thing b/w 1 to 9.