srilm - the sri language modeling toolkit
DESCRIPTION
SRILM - The SRI Language Modeling Toolkit. 2008. 10. 22. Presented by Yeon JongHeum Intelligent Database Systems Laboratory, SNU. Contents. Environment Download Compile Making Corpus Execution Result. Environment. Hardware IBM ThinkPad T41 Intel(R) Pentium(R) M processor 1600MHz - PowerPoint PPT PresentationTRANSCRIPT
SRILM - The SRI Language Modeling ToolkitSRILM - The SRI Language Modeling Toolkit
2008. 10. 22.
Presented by Yeon JongHeum
Intelligent Database Systems Laboratory, SNU
Copyright 2008 by CEBT
ContentsContents
Environment
Download
Compile
Making Corpus
Execution
Result
2
Copyright 2008 by CEBT
EnvironmentEnvironment
Hardware
IBM ThinkPad T41
Intel(R) Pentium(R) M processor 1600MHz
1GiB DDR RAM
OS
Ubuntu Linux 8.04
3
Copyright 2008 by CEBT
DownloadDownload
http://www.speech.sri.com/projects/srilm/download.html
http://clab.snu.ac.kr/class/cl-nlp0802/lecture/euc_txt.zip
4
Copyright 2008 by CEBT
CompileCompile
ubuntu 환경을 기준으로 하므로 , 명령어들은 리눅스 배포본마다 다소 차이가 있을 수 있다 .
csh, tcl, gcc, g++, gawk 등의 필요한 패키지를 설치한다 .
sudo aptitude install csh tcl tcl-dev build-essential gawk
5
Copyright 2008 by CEBT
Compile (cont’d)Compile (cont’d)
다운로드 받은 SRILM 의 압축을 푼다
tar xvfz srilm.tgz
6
Copyright 2008 by CEBT
Compile (cont’d)Compile (cont’d)
쓰기 권한을 추가한다 .
7
Copyright 2008 by CEBT
Compile (cont’d)Compile (cont’d)
Makefile 의 SRILM 환경변수를 수정한다 .
8
Copyright 2008 by CEBT
Compile (cont’d)Compile (cont’d)
commom/Makefile.machine.ARCH 파일의 CC, CXX, TCL_INCLUDE 등을 수정한다 .
ARCH 는 SRILM 이 실행되는 환경으로 sbin/machine-type 을 실행하여 알아본다 .
9
Copyright 2008 by CEBT
CompileCompile
make World 명령어로 컴파일한다 .
10
Copyright 2008 by CEBT
CorpusCorpus
형태소 분석된 파일의 인코딩을 euc-kr 에서 utf-8 로 수정
수정된 파일들에서 각 형태소를 찾아 하나의 큰 파일 생성
파일을 Training Set 과 Test Set 으로 나눈다 .
한줄에 하나의 문장이 있으며 각 형태소는 공백으로 구분된다 .
스크립트는 http://ids.snu.ac.kr/wiki/SRILM 참조
11
Copyright 2008 by CEBT
Corpus - ExampleCorpus - Example
12
Copyright 2008 by CEBT
ExecutionExecution
13
SRILMSRILM
Training SetTraining Set
ngram-countngram-count
Language ModelLanguage Model
ngramngram
PerplexityPerplexity
Test SetTest Set
Copyright 2008 by CEBT
ngram-countngram-count
Command
ngram-count -text train_morCorpus.txt
-lm lm_default.txt
Default
Trigram, Good-Turing discounting, Katz backoff
-text : corpus to read
-lm : output file of language model
14
Copyright 2008 by CEBT
Good-Turing Discounting ParametersGood-Turing Discounting Parameters
Command
ngram-count -text train_morCorpus.txt
-lm lm_gt_3_7.txt
-order 3
-gt1min 3 -gt1max 7
-gt2min 3 -gt2max 7
-gt3min 3 -gt3max 7
Parameter
-gtNmin count
-gtNmax count
15
Max CountMax Count
Min CountMin Count
Copyright 2008 by CEBT
Format of Language ModelFormat of Language Model
e.g., lm_default.txt
16
\data\ngram 1=200989ngram 2=2331224ngram 3=1547582
\1-grams:-6.522542 무조 -0.3102094-4.676433 무조건 -0.2724784-7.300586 무조소 -0.187773 \2-grams:-0.3667601 군종 교구 -0.08042386-1.530162 군종 교구장-1.530162 군종 사목
Log probability(Base 10)
Log probability(Base 10)
Log of Backoff Weight
(Base 10)
Log of Backoff Weight
(Base 10)
Copyright 2008 by CEBT
Ney’s absolute discountingNey’s absolute discounting
Command
ngram-count -text train_morCorpus.txt
-lm lm_absoulte0.5_3gram.txt
-order 3
-cdiscount1 0.5
-cdiscount2 0.5
-cdiscount3 0.5
Parameter
-order n : generate to n-grams. 없으면 trigram 까지 생성한다 .
-cdiscountN value : values is a constant to subtract for N-grams
17
Copyright 2008 by CEBT
Witten-Bell discountingWitten-Bell discounting
Command
ngram-count -text train_morCorpus.txt
-lm lm_witten_3gram.txt
-order 3
-wbdiscount1
-wbdiscount2
-wbdiscount3
18
Copyright 2008 by CEBT
Ristad's natural discountingRistad's natural discounting
Command
ngram-count -text train_morCorpus.txt
-lm lm_nd_3gram.txt
-order 3
-ndiscount1
-ndiscount2
-ndiscount3
19
Copyright 2008 by CEBT
Chen and Goodman's modified Kneser-Ney Chen and Goodman's modified Kneser-Ney discountingdiscounting
Command
ngram-count -text train_morCorpus.txt
-lm lm_knd_5gram.txt
-order 3
-kndiscount1
-kndiscount2
-kndiscount3
20
Copyright 2008 by CEBT
Original Kneser-Ney discountingOriginal Kneser-Ney discounting
Command
ngram-count -text train_morCorpus.txt
-lm lm_uknd_5gram.txt
-order 3
-ukndiscount1
-ukndiscount2
-ukndiscount3
21
Copyright 2008 by CEBT
Discounting with InterpolateDiscounting with Interpolate
Original Kneser-Ney discounting + Interpolate
ngram-count -text train_morCorpus.txt
-lm lm_uknd_inter_5gram.txt
-order 3
-ukndiscount1 -ukndiscount2 -ukndiscount3
-interpolate1 -interpolate2 -interpolate3
Parameter
-interpolateN
– Only Witten-Bell, absolute discounting, and (original or modified) Kneser-Ney smoothing currently support interpolation
22
Copyright 2008 by CEBT
Compute PerplexityCompute Perplexity
Command
ngram -lm lm_default.txt
-ppl testCorpus.txt
Parameter
-lm : Language Model
-ppl : Compute sentence scores (log probabilities) and perplexities from the sentences in textfile
Result
file testCorpus.txt: 171154 sentences, 4829620 words, 26626 OOVs
0 zeroprobs, logprob= -9.34268e+06 ppl= 75.5524 ppl1= 88.1413
23
Copyright 2008 by CEBT
ResultResult
24
pplAbsolute
DiscountingWitten-Bell
Ristad's Natural
modified Kneser-Ney
original Kneser-Ney
original Kneser-Ney
+ Interpolate
Good-TuringNo
SmoothingSmoothing
+1
1 757.845 760.974 757.846 1799.97 1796.29 1796.29 757.846 757.845 760.9762 109.749 111.031 113.783 161.585 157.802 153.811 109.742 144.187 810.013 75.6761 75.714 78.4666 88.7392 85.7081 82.3877 75.5524 99.1683 3061.124 72.1618 71.5292 74.6189 75.8507 70.4343 72.0892 95.4482 6010.215 71.8617 71.0122 74.2249 70.247 67.4874 66.8003 72.076 95.5346 8257
ppl1 1 959.828 963.933 959.83 2351.07 2346.1 2346.1 959.83 959.828 963.9362 129.751 131.32 134.693 193.686 188.991 184.044 129.743 173.133 1028.333 88.2908 88.3366 91.6647 104.121 100.44 96.4127 88.1413 117.447 4074.744 84.0481 83.2852 87.0136 88.5017 81.9653 83.9605 112.883 8195.025 83.6861 82.6618 86.538 81.7396 78.4164 77.5897 83.9447 112.989 11386.7