Download - java_lect_35.pdf
-
25 August, 2005 "NLP-AI" 1
Language Modeling for
Information Retrieval
Ananthakrishnan R(anand@cse)
-
2Outline of the Seminar What is Language Modeling?
Language Modeling in NLP Speech Recognition, Statistical MT, Classification, IR
The Language Modeling approach to IR the conceptual model smoothing semantic smoothing
IR as Statistical Machine Translation
-
3Language Modeling A language model is a probabilistic
mechanism for generating text
Language models estimate the probability distribution of various natural language phenomena sentences, utterances, queries
-
4Language Modeling Techniques N-grams Class-based N-grams Probabilistic CFGs Decision Trees
Claude Shannon was probably the first language modelerHow well can we predict or compress natural language text through simple n-gram models?
-
5Language Modeling for Other NLP Tasks
-
6Applications
Speech Recognition Statistical Machine Translation Classification Spell Checking
Play the role of either the prior or the likelihood
-
7Language Models inSpeech Recognition
Noisy ChannelSource WordNoisyWord Decoder
)()|(maxarg)|(maxarg
wpwOp
Owpw
Vocabularyw
Vocabularyw
w
)(wp (source model)
Ow
-
8Language Models in Statistical Machine Translation
Noisy ChannelOriginal sentence e Noisy sentence f
Decoder
)e()e|f(maxarg)f|e(maxarge
e
e
pp
p
E
E
)e(p )e|f(p (translation model)(source model)
Source e
-
9Language Models inText Classification
)c()c|d(maxarg)d|c(maxargc
c
c
pp
p
C
C
a language model for each class
Bayesian Classifier
-
10
The Language Modeling Approach to Information Retrieval
-
11
The Conceptual Model of IR The user has an information need ,
From this need he generates an ideal document fragment d,
He selects a set of key terms from d,, and generates a query q from this set
-
12
Analysis of Australias defeatto England in the Ashes. Reason
for defeat: overconfidence?
Eng bt. Aus: AnalysisAustralias defeat to England in the Ashes islargely due to their overconfidence. They dontseem to have figured out that they are facing a
far superior team
The Ashes ReviewBrash batting by Australia has lost them
a test match. Their opposition is not Zimbabwe, and not many of the Aussies
seem to remember that!
Information need ,
Ideal document d,
A candidate document
australia ashes defeat analysisquery
d
q
-
13
Document generation model
Document-queryTranslation model
Retrieval Engineq
q
{d}
d,,
retrieved documentsranked by:
informationneed
Ideal documentfragment
query
query
),q|d( 8p
8
-
14
)|q()|d(),d|q(),q|d( 8
888p
ppp
Using Bayes law,
For the purpose of ranking,
)|d(),d|q()d(q 88 ppp
)d()d|q()d(q ppp (user independent)
The Language Modeling Approach
-
15
)d()d|q()d(q ppp
query-likelihood: comes from the language model
document prior(query-independent)
E.g., PageRank
Language Modeling for IR:Using document language models to assign likelihood scores to queries(Ponte and Croft, 1998)
-
16
Analysis of Australias defeatto England in the Ashes. Reason
for defeat: overconfidence?
Eng bt. Aus: AnalysisAustralias defeat to England in the Ashes islargely due to their overconfidence. They dontseem to have figured out that they are facing a
far superior team
The Ashes ReviewBrash batting by Australia has lost them
a test match. Their opposition is not Zimbabwe, and not many of the Aussies
seem to remember that!
Information need ,
Ideal document d,
A candidate document
australia ashes defeat analysisquery
?),d|q( 8pq
d
-
17
The Language Model For each document d
in the collection C, for each term w, estimate:
To find the query likelihood, assume query terms to be independent:
)d|w(p
mi iqpp 1 )d|()d|q(
-
18
The Need for Smoothing To eliminate zero probabilities
Correct the error in the MLE due to the problem of data sparsity
australia ashes defeat analysis
The Ashes ReviewBrash batting by Australia has lost them
a test match. Their opposition is not Zimbabwe, and not many of the Aussies
seem to remember that!
p(defeat|d) = 0p(q|d) = 0
d
q
-
19
Smoothing Jelinek-Mercer Method: Linear interpolation of
the maximum likelihood model with the collection model
(Zhai & Lafferty, 2003) compares Jelinek-Mercer, Dirichlet and the absolute discounting methods for Language Models applied to IR.
)|w()d|w()1()d|w( &ppp OOO
-
20
Analysis of Australias defeatto England in the Ashes. Reason
for defeat: overconfidence?
Eng bt. Aus: AnalysisAustralias defeat to England in the Ashes islargely due to their overconfidence. They dontseem to have figured out that they are facing a
far superior team
The Ashes ReviewBrash batting by Australia has lost them
a test match. Their opposition is not Zimbabwe, and not many of the Aussies
seem to remember that!
Information need , Ideal document d,
A candidate document
australia ashes defeat overconfidencequery
q
d
The Need for Semantic Smoothing
-
21
Eng bt. Aus: AnalysisAustralias defeat to England in the Ashes islargely due to their overconfidence. They dontseem to have figured out that they are facing a
far superior team
australia ashes defeat overconfidence
Document-queryTranslation model
Ideal document d,
query
Document-Query Translation Model
?)d|q( pMust be sophisticated
enough to handle polysemy and synonymy
-
22
Semantic Smoothing
m
1i wi )d|w()w|()d|q( pqtp
)d|(analysisp
Estimate translation probabilities t(wq|wd) formapping a document term wd to a query term wq
)d|(review)|( reviewpanalysisp
(Berger and Lafferty, 1999)(Lafferty and Zhai, 2001)
-
23
Compare withThe Statistical MT model
Noisy ChannelOriginal sentence e Noisy sentence f
Decoder
)e()e|f(maxarg)f|e(maxarge
e
e
pp
p
E
E
)e(p )e|f(p (translation model)(source model)
Source e
Language model Semantic Smoothing
-
24
Language Modeling vis--visthe traditional probabilistic model
?)d,q|( rp
),q|d(),q|d(
rprp
log
The Probability Ranking Principle(Robertson, 1977)
The Robertson-Sparck Jones Model(Sparck Jones et al., 2000)
Language Modeling approach: Wheres the relevance? Is the ranking optimal?
(Lafferty & Zhai, 2002)(Laverenko and Croft, 2001)
-
25
Discussion: Advantages of the Language Modeling Approach
Conditioning on d provides a larger foothold for estimation
p(d): an explicit notion of the importance of a document
Document normalization is not an issueWhen designing a statistical model the most natural route is to apply a generative model which builds up the output step-by-step. The source channel perspective suggests a different approach: turn the search problem around to predict the input. In speech recognition, NLP, and MT, researchers have time and again found that predicting what is known from competing hypothesis can be easier than directly predicting all of the hypothesis
-
26
Discussion: Disadvantage of the Language Modeling Approach
The Robertson-Sparck Jones model can directly use relevance judgements to improve the estimation of p(Ai |q,r)
This is not possible in the LanguageModeling approach
-
27
References (Robertson, 1977) The probability ranking
principle in IR (Sparck Jones et al., 2000) A probabilistic model
of information retrieval: development and comparative experiments.
(Ponte and Croft, 1998) A language modeling approach to information retrieval
(Zhai and Lafferty, 2001) A study of smoothing methods for language models applied to ad hoc information retrieval
-
28
References (Berger and Lafferty, 1999) Information retrieval
as statistical translation (Lafferty and Zhai, 2001) Risk minimization and
language modeling in information retrieval (Lavrenko and Croft, 2001) Relevance-based
language models (Lafferty and Zhai, 2001) Probabilistic IR models
based on document and query generation