april 7, 2006 natural language processing/language technology for the web cross-language information...
TRANSCRIPT
April 7, 2006 Natural Language Processing/Language Technology for the Web
Cross-Language Information Retrieval (CLIR)
Ananthakrishnan R Computer Science & Engg., IIT Bombay
(anand@cse)
Cross Language Information Retrieval(CLIR) “A subfield of information retrieval dealing with retrieving
information written in a language different from the language of the user's query.”
E.g., Using Hindi queries to retrieve English documents
Also called multi-lingual, cross-lingual, or trans-lingual IR.
Why CLIR?
E.g., On the web, we have:
Documents in different languages Multilingual documents Images with captions in different languages
A single query should retrieve all such resources.
Approaches to CLIR
Knowledge-based
Corpus-based
Query Translation Dictionary/Thesaurus-based
Pseudo-Relevance Feedback (PRF)
Document
Translation
MT
(rule-based)
MT
(EBMT/StatMT)
Intermediate Representation
UNL
(AgroExplorer)
Latent Semantic Indexing
Most effective approaches are hybrid – a combination of knowledge and corpus-based methods.
mostefficient;commonlyused
infeasibleforlargecollections
Dictionary-based Query Translation
आयरलैं�ड शांति� वा��
Irelandpeace talks
Hindi-Englishdictionaries
Collection
search
• phrase identification• words to be transliterated
The problem with dictionary-based CLIR -- ambiguity
अं�रिरक्षी�य घटना cosmic outer-space
incident event occurrence lessen subside decrease lower diminish ebb decline reduce
जालैं� धना lattice mesh net wire_netting meshed_fabric counterfeit forged false fabricated small_net network gauze grating sieve
money riches wealth appositive property
आयरलैं�ड शांति� वा�� Ireland
peace calm tranquility silence quietude
conversation talk negotiation tale
Disambiguation using co-occurrence statistics
Hypothesis: correct translations of query terms will co-occur and incorrect translations will tend not to co-occur
Problem with counting co-occurrences: data sparsity
freq(Marathi Shallow Parsing CRFs)freq(Marathi Shallow Structuring CRFs)freq(Marathi Shallow Analyzing CRFs)
… are all zero.
How do we choose between parsing, structuring, and analyzing?
Pair-wise co-occurrence
अं�रिरक्षी�य घटनाcosmic outer-spaceincident event occurrence lessen subside decrease lower diminish ebb
decline reduce
freq(cosmic incident) 70800freq(cosmic event 269000freq(cosmic lessen) 7130freq(cosmic subside) 3120freq(outer-space incident) 26100freq(outer-space event) 104000freq(outer-space lessen) 2600freq(outer-space subside) 980
Shallow Parsing, Structuring or Analyzing? shallow parsing 166000shallow structuring 180000shallow analyzing 1230000
CRFs parsing 540CRFs structuring 125CRFs analyzing 765
Marathi parsing 17100Marathi structuring 511Marathi analyzing 12200
“shallow parsing” 40700“shallow structuring” 11“shallow analyzing” 2
collocation?
But,
analyzing 74100000parsing 40400000structuring 17400000shallow 33300000
Ranking senses using co-occurrence statistics Use co-occurrence scores to calculate
similarity between two words: sim(x, y) Point-wise mutual information (PMI) Dice coefficient PMI-IR
)()(
) (log),(-
yhitsxhits
yxhitsyxIRPMI
AND
Disambiguation algorithm
},... ,{
:query suser'
21sm
ss qqqq
}{
ons, translatiofset the,each For
,tjii
si
wS
q
','
'' ),(),( .1,,,
i
t
liSw
t
li
tjii
tji wwsimSwsim
ii
itji
tji Swsimwscore
'
),()( .2 ',,
},... ,,{
query translated
21tm
ttt qqqq
)( maxarg .3 ,,
tji
w
ti wscoreq
tji
Example
अं�रिरक्षी�य घटनाcosmic outer-spaceincident event lessen subside decrease lower
diminish ebb decline reduce
score(cosmic)= PMI-IR(cosmic, incident) + PMI-IR(cosmic, event) + PMI-IR(cosmic, lessen) + PMI-IR(cosmic, subside) …
Disambiguation algorithm: sample outputs
आयरलैं�ड शांति� वा��Ireland peace talks
अं�रिरक्षी�य घटना cosmic events
जालैं� धना net money (?)
Results on TREC8 (disks 4 and 5) English topics (401-450) manually translated to Hindi Assumption: relevance judgments for English topics
hold for the translated queries Results (all TF-IDF):
Technique MAP
Monolingual 23
All-translations 16
PMI based disambiguation 20.5
Manual filtering 21.5
(User) Relevance Feedback (mono-lingual)1. Retrieve documents using the user’s query
2. The user marks relevant documents
3. Choose the top N terms from these documents
Top terms IDF is one option for scoring
4. Add these N terms to the user’s query to form a new query
5. Use this new query to retrieve a new set of documents
Pseudo-Relevance Feedback (PRF) (mono-lingual)1. Retrieve documents using the user’s query2. Assume that the top M documents retrieved
are relevant3. Choose the top N terms from these M
documents4. Add these N terms to the user’s query to
form a new query5. Use this new query to retrieve a new set of
documents
PRF for CLIRCorpus-based Query Translation Uses a parallel corpus of documents:
H1 E1
H2 E2
. .
. .
. .Hm Em
Hindi collection H English collection E
PRF for CLIR
1. Retrieve documents in H using the user’s query
2. Assume that the top M documents retrieved are relevant
3. Select the M documents in E that are aligned to the top M retrieved documents
4. Choose the top N terms from these documents
5. These N terms are the translated query
6. Use this query to retrieve from the target collection(which is in the same language as E)
Ranking with Relevance Models
Relevance model or Query model (distribution encodes the information need):
Probability of word occurrence in a relevant document
Probability of word occurrence in the candidate document
Ranking function (relative entropy or KL divergence)
R
)|( RwP
)|( DwP
w RwP
DwPDwP
RDKL
)|(
)|(log).|(
)||(
Estimating Mono-Lingual Relevance Models
)...(
)...,(
)...|()|()|(
21
21
21
m
m
mR
hhhP
hhhwP
hhhwPQwPwP
M
m
iim MhPMwPMPhhhwP
121 )|()|()()...,(
Estimating Cross-Lingual Relevance Models
},{ 121 )|()|(}),({)...,(
EH MM
m
iHiEEHm MhPMwPMMPhhhwP
)()1()|(,
, wPfreq
freqMwP
v Xv
XwX
CLIR Evaluation – TREC(Text REtrieval Conference) TREC CLIR track (2001 and 2002)
Retrieval of Arabic language newswire documents from topics in English
383,872 Arabic documents (896 MB) with SGML markup 50 topics Use of provided resources (stemmers, bilingual
dictionaries, MT systems, parallel corpora) is encouraged to minimize variability
http://trec.nist.gov/
CLIR Evaluation – CLEF(Cross Language Evaluation Forum) Major CLIR evaluation forum Tracks include
Multilingual retrieval on news collections topics will be provided in many languages including Hindi
Multiple language Question Answering ImageCLEF Cross Language Speech Retrieval WebCLEF
http://www.clef-campaign.org/
Summary
CLIR techniques Query Translation-based Document Translation-based Intermediate Representation-based
Query translation using dictionaries, followed by disambiguation, is a simple and effective technique for CLIR
PRF uses a parallel corpus for query translation Parallel corpora can also be used to estimate cross-
lingual relevance models CLEF and TREC: important CLIR evaluation
conferences
References (1)
1. Phrasal Translation and Query Expansion Techniques for Cross-language Information Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1995.
2. Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1998.
3. A Maximum Coherence Model for Dictionary-Based Cross-Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y. Chai, ACM SIGIR, 2005.
4. A Comparative Study of Knowledge-Based Approaches for Cross-Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr, Paul G. Hackett, and Maria Katsova, Technical Report CS-TR-3897, University of Maryland, 1998.
References (2)
5. Translingual Information Retrieval: A Comparative Evaluation, Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D. Brown, Yibing Geng, and Danny Lee, International Joint Conference on Artificial Intelligence, 1997.
6. A Multistage Search Strategy for Cross Lingual Information Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak Bhattacharyya, Symposium on Indian Morphology, Phonology and Language Engineering, IIT Kharagpur, February, 2005.
7. Relevance-Based Language Models, Victor Lavrenko, and W. Bruce Croft, Research and Development in Information Retrieval, 2001.
8. Cross- Lingual Relevance Models, V. Lavrenko, M. Choquette, and W. Croft, ACM-SIGIR, 2002.