srilm - the sri language modeling toolkit

24
SRILM - The SRI Language Modeling Toolkit SRILM - The SRI Language Modeling Toolkit 2008. 10. 22. Presented by Yeon JongHeum Intelligent Database Systems Laboratory, SNU

Upload: kelli

Post on 14-Jan-2016

190 views

Category:

Documents


1 download

DESCRIPTION

SRILM - The SRI Language Modeling Toolkit. 2008. 10. 22. Presented by Yeon JongHeum Intelligent Database Systems Laboratory, SNU. Contents. Environment Download Compile Making Corpus Execution Result. Environment. Hardware IBM ThinkPad T41 Intel(R) Pentium(R) M processor 1600MHz - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SRILM - The SRI Language Modeling Toolkit

SRILM - The SRI Language Modeling ToolkitSRILM - The SRI Language Modeling Toolkit

2008. 10. 22.

Presented by Yeon JongHeum

Intelligent Database Systems Laboratory, SNU

Page 2: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

ContentsContents

Environment

Download

Compile

Making Corpus

Execution

Result

2

Page 3: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

EnvironmentEnvironment

Hardware

IBM ThinkPad T41

Intel(R) Pentium(R) M processor 1600MHz

1GiB DDR RAM

OS

Ubuntu Linux 8.04

3

Page 4: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

DownloadDownload

http://www.speech.sri.com/projects/srilm/download.html

http://clab.snu.ac.kr/class/cl-nlp0802/lecture/euc_txt.zip

4

Page 5: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

CompileCompile

ubuntu 환경을 기준으로 하므로 , 명령어들은 리눅스 배포본마다 다소 차이가 있을 수 있다 .

csh, tcl, gcc, g++, gawk 등의 필요한 패키지를 설치한다 .

sudo aptitude install csh tcl tcl-dev build-essential gawk

5

Page 6: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Compile (cont’d)Compile (cont’d)

다운로드 받은 SRILM 의 압축을 푼다

tar xvfz srilm.tgz

6

Page 7: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Compile (cont’d)Compile (cont’d)

쓰기 권한을 추가한다 .

7

Page 8: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Compile (cont’d)Compile (cont’d)

Makefile 의 SRILM 환경변수를 수정한다 .

8

Page 9: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Compile (cont’d)Compile (cont’d)

commom/Makefile.machine.ARCH 파일의 CC, CXX, TCL_INCLUDE 등을 수정한다 .

ARCH 는 SRILM 이 실행되는 환경으로 sbin/machine-type 을 실행하여 알아본다 .

9

Page 10: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

CompileCompile

make World 명령어로 컴파일한다 .

10

Page 11: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

CorpusCorpus

형태소 분석된 파일의 인코딩을 euc-kr 에서 utf-8 로 수정

수정된 파일들에서 각 형태소를 찾아 하나의 큰 파일 생성

파일을 Training Set 과 Test Set 으로 나눈다 .

한줄에 하나의 문장이 있으며 각 형태소는 공백으로 구분된다 .

스크립트는 http://ids.snu.ac.kr/wiki/SRILM 참조

11

Page 12: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Corpus - ExampleCorpus - Example

12

Page 13: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

ExecutionExecution

13

SRILMSRILM

Training SetTraining Set

ngram-countngram-count

Language ModelLanguage Model

ngramngram

PerplexityPerplexity

Test SetTest Set

Page 14: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

ngram-countngram-count

Command

ngram-count -text train_morCorpus.txt

-lm lm_default.txt

Default

Trigram, Good-Turing discounting, Katz backoff

-text : corpus to read

-lm : output file of language model

14

Page 15: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Good-Turing Discounting ParametersGood-Turing Discounting Parameters

Command

ngram-count -text train_morCorpus.txt

-lm lm_gt_3_7.txt

-order 3

-gt1min 3 -gt1max 7

-gt2min 3 -gt2max 7

-gt3min 3 -gt3max 7

Parameter

-gtNmin count

-gtNmax count

15

Max CountMax Count

Min CountMin Count

Page 16: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Format of Language ModelFormat of Language Model

e.g., lm_default.txt

16

\data\ngram 1=200989ngram 2=2331224ngram 3=1547582

\1-grams:-6.522542 무조 -0.3102094-4.676433 무조건 -0.2724784-7.300586 무조소 -0.187773 \2-grams:-0.3667601 군종 교구 -0.08042386-1.530162 군종 교구장-1.530162 군종 사목

Log probability(Base 10)

Log probability(Base 10)

Log of Backoff Weight

(Base 10)

Log of Backoff Weight

(Base 10)

Page 17: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Ney’s absolute discountingNey’s absolute discounting

Command

ngram-count -text train_morCorpus.txt

-lm lm_absoulte0.5_3gram.txt

-order 3

-cdiscount1 0.5

-cdiscount2 0.5

-cdiscount3 0.5

Parameter

-order n : generate to n-grams. 없으면 trigram 까지 생성한다 .

-cdiscountN value : values is a constant to subtract for N-grams

17

Page 18: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Witten-Bell discountingWitten-Bell discounting

Command

ngram-count -text train_morCorpus.txt

-lm lm_witten_3gram.txt

-order 3

-wbdiscount1

-wbdiscount2

-wbdiscount3

18

Page 19: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Ristad's natural discountingRistad's natural discounting

Command

ngram-count -text train_morCorpus.txt

-lm lm_nd_3gram.txt

-order 3

-ndiscount1

-ndiscount2

-ndiscount3

19

Page 20: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Chen and Goodman's modified Kneser-Ney Chen and Goodman's modified Kneser-Ney discountingdiscounting

Command

ngram-count -text train_morCorpus.txt

-lm lm_knd_5gram.txt

-order 3

-kndiscount1

-kndiscount2

-kndiscount3

20

Page 21: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Original Kneser-Ney discountingOriginal Kneser-Ney discounting

Command

ngram-count -text train_morCorpus.txt

-lm lm_uknd_5gram.txt

-order 3

-ukndiscount1

-ukndiscount2

-ukndiscount3

21

Page 22: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Discounting with InterpolateDiscounting with Interpolate

Original Kneser-Ney discounting + Interpolate

ngram-count -text train_morCorpus.txt

-lm lm_uknd_inter_5gram.txt

-order 3

-ukndiscount1 -ukndiscount2 -ukndiscount3

-interpolate1 -interpolate2 -interpolate3

Parameter

-interpolateN

– Only Witten-Bell, absolute discounting, and (original or modified) Kneser-Ney smoothing currently support interpolation

22

Page 23: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

Compute PerplexityCompute Perplexity

Command

ngram -lm lm_default.txt

-ppl testCorpus.txt

Parameter

-lm : Language Model

-ppl : Compute sentence scores (log probabilities) and perplexities from the sentences in textfile

Result

file testCorpus.txt: 171154 sentences, 4829620 words, 26626 OOVs

0 zeroprobs, logprob= -9.34268e+06 ppl= 75.5524 ppl1= 88.1413

23

Page 24: SRILM - The SRI Language Modeling Toolkit

Copyright 2008 by CEBT

ResultResult

24

pplAbsolute

DiscountingWitten-Bell

Ristad's Natural

modified Kneser-Ney

original Kneser-Ney

original Kneser-Ney

+ Interpolate

Good-TuringNo

SmoothingSmoothing

+1

1 757.845 760.974 757.846 1799.97 1796.29 1796.29 757.846 757.845 760.9762 109.749 111.031 113.783 161.585 157.802 153.811 109.742 144.187 810.013 75.6761 75.714 78.4666 88.7392 85.7081 82.3877 75.5524 99.1683 3061.124 72.1618 71.5292 74.6189 75.8507  70.4343 72.0892 95.4482 6010.215 71.8617 71.0122 74.2249 70.247 67.4874 66.8003 72.076 95.5346 8257

ppl1                  1 959.828 963.933 959.83 2351.07 2346.1 2346.1 959.83 959.828 963.9362 129.751 131.32 134.693 193.686 188.991 184.044 129.743 173.133 1028.333 88.2908 88.3366 91.6647 104.121 100.44 96.4127 88.1413 117.447 4074.744 84.0481 83.2852 87.0136 88.5017  81.9653 83.9605 112.883 8195.025 83.6861 82.6618 86.538 81.7396 78.4164 77.5897 83.9447 112.989 11386.7