name:venkata subramanyan sundaresan instructor:dr.veton kepuska

SRILM Based Language Model

Name:Venkata subramanyan sundaresanInstructor:Dr.Veton Kepuska

N-GRAM ConceptThe idea of word prediction in formalized

with probabilistic model called N-gram.Statistical models of word sequence are also

called language models or LM’SThe idea of N-gram model is to approximate

the history by just the last few words .

CORPUS Counting things in natural language is based

on a corpus.What is a corpus ?

It is an online collection of text or speech There are two popular corpora.

Brown (1 million word collection )Switch board (Collection 2430 telephone

conversation )

Perplexity Perplexity is interpreted as the weighted

average branching factor of a language.Branching factor of a language is the number

of possible next word that can follow any word .

Perplexity is the most common evaluation metric for N-gram language models .

Improvement in perplexity does not guarantee an improvement in speech recognition performance.

It is commonly used as a quick check of an algorithm.

SMOOTHING It is the process of flattening a probability

distribution implied by a language model ,so the all reasonable word sequence can occur with some probability.

AspirationTo use SRI-LM (LM-Language modeling)

toolkit to build different language models.The following are the language models :

Good –turning SmoothingAbsolute Discounting

Linux Environment in Windows

To implement Linux environment in windows operating system we have to install “cygwin”

This is a open source software and can be downloaded

from : www.cygwin.com.Another main reason for installing cygwin

is ,SRI-LM can be implemented over the cygwin platform .

http://www.cygwin.com/

Installation Procedure “cygwin”Go to the provided webpage.Download the setup file .Select “install from Internet”Give the required destination place for the

cygwin to get installed .There will be a lot of options to download

from website.Select one site and install all the packages .

SRILMDownload the SRILM toolkit ,srilm.tgz from

the following source: http://www.speech.sri.com/projects/srilm/

Run the terminal window of Cygwin.The srilm will be downloaded as a zip file .Unzip the srilm file inside the cygwin

environment Unzip canbe done with the following with the

following command: tar zxvf srilm.tgz

http://www.speech.sri.com/projects/srilm/

http://www.speech.sri.com/projects/srilm/

SRILM Installation Once the installation is completed ,we have

to edit the makefile in the cygwin folder .Once the editing is done , we have run the

cygwin ,to install SRILM in cygwin : $ Make World

Function of SRILM Generate N-gram count from the corpusTrain language model based on the N-gram

count file .Use trained language model to calculate test

data perplexity.

LexiconLexicon is a container of words belonging to

the same language .Reference: Wikipedia

Lexicon GenerationUse “wordtokenization.pl” file to generate

the Lexicon for our requirement .Generate lexicon of our requirement using

the following command: cat train/en_*.txt > corpus.txt Perl wordtokenization.pl <corpus.txt|sort|uniq

>lexicon.txt

Count FileGenerate 3-gram count file by using following

command:$./ngram-count –vocab lecicon.txt, -text

corpus.txt ,-order 2 –write count.txt, -unk

Good-Turing Language Model$ ./ngram-count -read project/count.txt -order 3 -lm project/gtlm.txt -gt1min 1 -gt1max 3 -gt2min 1 -gt2max 3-gt3min 1 -gt3max 3

This code has to be typed in the command window of the terminal .

-lm lmfile Estimate a back off N-gram model from the

total counts, and write it to lmfile

Absolute Discounting Language Model$ ./ngram-count -read project/count.txt-order 3-lm adlm.txt-cdiscount1 0.5-cdiscount2 0.5 -cdiscount3 0.5

Here the order N can be any thing b/w 1 to 9.

name:venkata subramanyan sundaresan instructor:dr.veton kepuska

Documents