s tatistical language models for croatian weather - domain corpus
Post on 12-Jan-2016
46 Views
Preview:
DESCRIPTION
TRANSCRIPT
04/21/23 1
STATISTICAL LANGUAGE MODELS FOR CROATIAN
WEATHER-DOMAIN CORPUS
Lucia Načinović, Sanda Martinčić-Ipšić and Ivo IpšićDepartment of Informatics, University of Rijeka
lnacinovic, smarti, ivoi @inf.uniri.hr
04/21/23 2
Introduction• Statistical language modelling estimates the
regularities in natural languages – the probabilities of word sequences which are usually
derived from large collections of text material
• Employed in:– Speech recognition– Optical character recognition– Handwriting recognition– Machine translation– Spelling correction– ...
04/21/23 3
N-gram language models
• The most widely-used LMs – Based on the probability of a word wn given the
preceding sequence of words wn-1
– Bigram models (2-grams) • determine the probability of a word given the
previous word
– Trigram models (3-gram)• determine the probability of a word given the
previous two words
04/21/23 4
Language model perplexity
• The most common metric for evaluating a language model - probability that the model assigns to test data, or the derivative measures of :– cross-entropy– perplexity
04/21/23 5
Cross-entropy
• The cross-entropy of a model p(T) on data T:
• WT -the length of the text T measured in words
)(log1
)( 2 TpW
THT
p
04/21/23 6
Perplexity
• The reciprocal value of the average probability assigned by the model to each word in the test set T
• The perplexity PPp(T) of a model - related to cross-entropy by the equation
• lower cross-entropies and perplexities are better
)(2)( THp
pTPP
04/21/23 7
Smoothing • Data sparsity problem
– N-gram models - trained from finite corpus– some perfectly acceptable N-grams are missing:
probability=0
• Solution – smoothing techiques– adjust the maximum likelihood estimate of probabilities
to produce more accurate probabilities– adjust low probabilities such as zero probabilities
upward, and high probabilities downward
04/21/23 8
Smoothing techniques used in our research
• Additive smoothing
• Absolute discounting
• Witten-Bell technique
• Kneser-Nay technique
04/21/23 9
Additive smoothing• one of the simplest types of smoothing • we add a factor δ to every count: δ (0< δ ≤1) • Formula for additive smoothing:
• V - the vocabulary (set of all words considered)• c - the number of occurrences • values of δ parameter used in our research:
0.1,0.5 and 1
)(||
)()|(
1
111 i
niw
inii
niiadd wcV
wcwwp
i
04/21/23 10
Absolute discounting
• When there is little data for directly estimating an n-gram probability, useful information can be provided by the corresponding (n-1)-gram
• Absolute discounting - the higher-order distribution is created by subtracting a fixed discount D from each non-zero count:
• Values of D used in our research: 0.3, 0.5, 1
)|()1(
)(
0,)(max)|( 1
21
11 1
1
iniiabswi
niw
inini
niiabs wwpwc
Dwcwwp i
ni
i
04/21/23 11
Witten-Bell technique
• Number of different words in the corpus is used as a help at determing the probability of words that never occur in the corpus
• Example for bigram:
)(: )()(
)()(
ixwwci xx
xxi wTwN
wTwwp
04/21/23 12
Kneser-Nay technique
• An extension of absolute discounting
• the lower-order distribution that one combines with a higher-order distribution is built in a novel manner:– it is taken into consideration only when few or
no counts are present in the higher-order distribution
04/21/23 13
Smoothing implementation
• 2-gram, 3-gram and 4-gram language models were built
• Corpus: 290 480 words– 2 398 1-grams, – 18 694 2-grams, – 23 021 3-grams and – 29 736 4-grams
• On each of these models four different smoothing techniques were applied
04/21/23 14
Corpus
• Major part developed from 2002 until 2005 and some parts added later
• Includes the vocabulary related to weather, bio and maritime forecast, river water levels and weather reports
• Devided into 10 parts– 9/10 used for building language models– 1/10 used for evaluating those models in
terms of their estimated perplexities
04/21/23 15
Results given by the perplexities of LM-s
Without
smoothing Additive smoothing Absolute discounting
Witten-Bell
Kneser-Ney
δ parameter D parameter 0,1 0,5 1 0,3 0,5 1
2-gram
19,87 28,8 51,6 73,5 19,61 19,64 21,6 19,75 18,96
3-gram
8,45 30,04 86,9 144,2 8,17 8,22 9,30 8,25 7,63
4-gram
6,04 42,9 142,6 239,87 5,64 5,71 6,76 5,76 5,24
04/21/23 16
Conclusion
• In this paper we described the process of language model building from the Croatian weather-domain corpus
• We built models of different order: – 2-grams– 3-grams – 4-grams
04/21/23 17
Conclusion
• We applied four different smoothing techniques:– additive smoothing– absolute discounting– Witten-Bell technique– Kneser-Ney technique
• We estimated and compared perplexities of those models
• Kneser-Ney smoothing technique gives the best results
04/21/23 18
Further work
• Prepare more balanced corpus of Croatian text and thus build more complete language model
• Other LM– Class based
• Other smoothing techniques
04/21/23 19
STATISTICAL LANGUAGE MODELS FOR CROATIAN
WEATHER-DOMAIN CORPUS
Lucia Načinović, Sanda Martinčić-Ipšić and Ivo IpšićDepartment of Informatics, University of Rijeka
lnacinovic, smarti, ivoi @inf.uniri.hr
04/21/23 20
References• Chen, Stanley F.; Goodman, Joshua. An empirical study of smoothing techniques for
language modelling. Cambridge, MA: Computer Science Group, Harvard University, 1998
• Chou, Wu; Juang, Biing-Hwang. Pattern recognition in speech and language processing. CRC Press, 2003
• Jelinek, Frederick. Statistical Methods for Speech Recognition. Cambridge, MA: The MIT Press, 1998
• Jurafsky, Daniel; Martin, James H. Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, New Jersey: Prentice Hall, 2000
• Manning, Christopher D.; Schütze, Hinrich. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press, 1999
• Martinčić-Ipšić, Sanda. Raspoznavanje i sinteza hrvatskoga govora konteksno ovisnim skrivenim Markovljevim modelima, doktorska disertacija. Zagreb, FER, 2007
• Milharčič, Grega; Žibert, Janez; Mihelič, France. Statistical Language Modeling of SiBN Broadcast News Text Corpus.//Proceedings of 5th Slovenian and 1st international Language Technologies Conference 2006/Erjavec, T.; Žganec Gros, J. (ed.). Ljubljana, Jožef Stefan Institute, 2006
• Stolcke, Andreas. SRILM – An Extensible Language Modeling Toolkit.//Proceedings Intl. Conf. on Spoken Language Processing. Denver, 2002, vol.2, pp. 901-904
04/21/23 21
SRILM toolkit
• Modeli su građeni i evaluirani pomoću SRILM alata
• http://www.speech.sri.com/projects/srilm/
• ngram-count –text TRAINDATA –lm LM
• ngram –lm LM –ppl TESTDATA
04/21/23 22
Language model
• Speech recognition – converting an acoustic signal into a sequence of words
• Through language modelling, the speech signal is being statistically modelled
• Language model of a speech estimates probability Pr(W) for all possible word strings W=(w1, w2,…wi).
04/21/23 23
System diagram of a generic speech recognizer based on statistical models
04/21/23 24
• Bigram language models (2-grams)– Central goal: to determine the probability of a word
given the previous word
• Trigram language models (3-grams)– Central goal: to determine the probability of a word
given the previous two words
The simplest way to approximate this probability is to compute:
-This value is called the maximum likelihood (ML) estimate
)(
)()|(
12
1212
ii
iiiiii wwc
wwwcwwwpML
04/21/23 25
• Linear interpolation - simple method for combining the information from lower-order n-gram models in estimating higher-order probabilities
04/21/23 26
• A general class of interpolated models is described by Jelinek and Mercer:
• The nth-order smoothed model is defined recursively as a linear interpolation between the nth-order maximum likelihood model and the (n-1)-th-order smoothed model
• Given fixed pML, it is possible to search efficiently for the factor that maximizes the probability of some data using the Baum–Welch algorithm
)|()1()|()|( 12int
11
11int 1
11
11
i
niierpw
iniMLw
iniierp wwpwwpwwp i
niini
11
iniw
04/21/23 27
• In absolute discounting smoothing instead of multiplying the higher-order maximum-likelihood distribution by a factor , the higher-order distribution is created by subtracting a fixed discount D from each non-zero count:
• Values of D used in research: 0.3, 0.5, 1
11
iniw
)|()1(
)(
0,)(max)|( 1
21
11 1
1
iniiabswi
niw
inini
niiabs wwpwc
Dwcwwp i
ni
i
top related