iiit hyderabad’s clir experiments for fire-2008 sethuramalingam s & vasudeva varma iiit...
TRANSCRIPT
![Page 1: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/1.jpg)
IIIT Hyderabad’s CLIR experiments for FIRE-2008
Sethuramalingam S & Vasudeva VarmaIIIT Hyderabad, India
1
![Page 2: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/2.jpg)
Outline
• Introduction• Related Work in Indian Language IR• Our CLIR experiments• Evaluation & Analysis• Future Work
IIIT-H @ FIRE-2008 2
![Page 3: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/3.jpg)
Introduction
• Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (courtesy: Wikipedia)
• Information – text, audio, video, speech, geographical information etc
IIIT-H @ FIRE-2008 3
![Page 4: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/4.jpg)
CLIR – Indian languages(IL) scenario
IIIT-H @ FIRE-2008 4
தமி�ழ்
Modified from Source: D. Oard’s Cross-Language IR presentation
हि�न्दी�
తెలు�గు�
বাং��লা�
मरा�ठी
To retrieve documents written in any IL when user queries in one language
![Page 5: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/5.jpg)
Why CLIR for IL?
IIIT-H @ FIRE-2008 5
![Page 6: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/6.jpg)
IIIT-H @ FIRE-2008 6
Why CLIR for IL?
![Page 7: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/7.jpg)
Why CLIR for IL?
IIIT-H @ FIRE-2008 7
• Internet user growth in India between 2000 to 2008 - 1,100.0 % Source : www.internetworldstats.com
• Growth in Indian language contents on the web between 2000 to 2007 – 700%
So, CLIR for IL becomes mandatory!
![Page 8: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/8.jpg)
RELATED WORK IN INDIAN LANGUAGE IR
IIIT-H @ FIRE-2008 8
![Page 9: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/9.jpg)
Related Work in ILIR
• ACM TALIP, 2003 - The surprise language exercises - Task was to build CLIR system for
English to Hindi and Cebuano
“The surprise language exercises”, Douglas W. Oard. ACM Transactions on Asian Language Information Processing (TALIP), 2(2):79–84, 2003
IIIT-H @ FIRE-2008 9
![Page 10: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/10.jpg)
Related Work in ILIR
• CLEF 2006 - Ad-hoc bi-lingual track including two
Indian languages Hindi and Telugu - Our team from IIIT-H participated in
Hindi and Telugu to English CLIR task
“Hindi and Telugu to English Cross Language Information Retrieval”, Prasad Pingali and Vasudeva Varma. CLEF 2006.
IIIT-H @ FIRE-2008 10
![Page 11: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/11.jpg)
Related Work in ILIR
• CLEF 2007 - Indian language subtask consisting of
Hindi, Bengali, Marathi, Telugu and Tamil - Five teams including ours participated
- Hindi and Telugu to English CLIR
“IIIT Hyderabad at CLEF 2007 - Adhoc Indian Language CLIR task”, Prasad Pingali and Vasudeva Varma. CLEF 2007.
IIIT-H @ FIRE-2008 11
![Page 12: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/12.jpg)
Related Work in ILIRGoogle’s CLIR system for 34 languages including
Hindi
IIIT-H @ FIRE-2008 12
![Page 13: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/13.jpg)
OUR CLIR EXPERIMENTS
IIIT-H @ FIRE-2008 13
![Page 14: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/14.jpg)
Our CLIR experiments
• Ad-hoc cross-lingual Hindi to English, and English to Hindi.
• Ad-hoc monolingual runs in Hindi and English• 12 runs in total were submitted for the above
4 tasks
IIIT-H @ FIRE-2008 14
![Page 15: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/15.jpg)
Problem statement
• CLIR system should take a set of 50 topics in the source language and return top 1000 documents for each topic in the target language
IIIT-H @ FIRE-2008 15
<top lang="hi"><num>28</num><title> ईरा�न का� पराम�णु� का�र्य�क्रम</title><desc> ईरा�न का� का�र्य�क्रम औरा उसका� पराम�णु� न�हि� का� बा�रा� म� हि�श्व का�रा�र्य।</desc><narr> ईरा�न का� पराम�णु� न�हि� औरा ऐस� का�र्य�क्रम का� हि�रुद्ध ईरा�न परा र्य#एसए का�
हिनरा%�रा दीबा�� औरा धमका� का� बा�रा� म� स#चन� स%बा%धिध� प्रले�ख म� रा�न� च�हि�ए। पराम�णु� न�हि� का� समझौ-�� का� लिलेए ईरा�न औरा र्य#रा/प�र्य स%घ का� बा�च ����� औरा हि�श्व दृधि2 भी�
रुलिचकारा �/गी�</narr></top>
![Page 16: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/16.jpg)
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration– Query Scoring
• Indexing module– Stop-word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 16
![Page 17: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/17.jpg)
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration– Query Scoring
• Indexing module– Stop-word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 17
![Page 18: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/18.jpg)
Named entities Identification
• Used for identifying the named entities present in the queries for transliteration
• We used– Our CRF-based NER system( as a binary classifier)
for Hindi queries,– Stanford English NER system for English queries
• Identifies Person, Organization and Location names
IIIT-H @ FIRE-2008 18
"Experiments in Telugu NER: A Conditional Random Field Approach“, Praneeth M Shishtla, Prasad Pingali, Vasudeva Varma. NERSSEAL-08, IJCNLP-08, Hyderabad, 2008.
![Page 19: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/19.jpg)
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring
• Indexing module– Stop-word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 19
![Page 20: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/20.jpg)
Query translation
• Using bi-lingual lexicons– “Shabdanjali”, an English-Hindi dictionary
containing 26,633 entries– IIT Bombay Hindi Wordnet– Manually collected Hindi-English dictionary with
6,685 entries
IIIT-H @ FIRE-2008 20
Shabdanjali - http://www.shabdkosh.com/shabdanjali
Hindi Wordnet - http://www.cfilt.iitb.ac.in/wordnet/webhwn/
![Page 21: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/21.jpg)
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring
• Indexing module– Stop-word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 21
![Page 22: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/22.jpg)
Transliteration
• Mapping-based approach• For a given named entity in source language
– Derive the Compressed Word Format (CWF) E.g. academia – cdm
E.g. abullah - bll
– Generate the list of Named entities & their CWFs at the target language side
– Search and map the CWF of source language NE with the CWF of the right target language equivalent within the min. modified edit distance
IIIT-H @ FIRE-2008 22
![Page 23: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/23.jpg)
Transliteration
• Implementation– Named entities present in the Hindi and English
corpora are identified and listed.– Their CWFs are generated using a set of heuristic,
rewrite and remove rules– CWFs are added to the list of NEs
IIIT-H @ FIRE-2008 23
“Named Entity Transliteration for Cross-Language Information Retrievalusing Compressed Word Format Mapping algorithm”, Srinivasan C Janarthanam, Sethuramalingam S, Udhyakumar Nallasamy. iNEWS-08, CIKM-2008.
![Page 24: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/24.jpg)
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring
• Indexing module– Stop-word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 24
![Page 25: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/25.jpg)
Query Scoring
• We generate a Boolean OR query with scored query words
• Query scoring is based on– Position of occurrence of the word in the topic– Number of occurrences of the word– Numbers, Years are given greater weights
IIIT-H @ FIRE-2008 25
![Page 26: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/26.jpg)
CLIR System architecture
• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring
• Indexing & Ranking module– Stop word remover,– A typical Indexer using Lucene
IIIT-H @ FIRE-2008 26
![Page 27: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/27.jpg)
Indexing module
• For the English corpus, stop words are removed and stemmed using Lucene
• For the Hindi corpus, a list of 246 words is generated from the given corpus based on frequency
• Documents are indexed using the Lucene Indexer and ranked using the BM-25 algorithm in Lucene
IIIT-H @ FIRE-2008 27
![Page 28: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/28.jpg)
EVALUATION & ANALYSIS
IIIT-H @ FIRE-2008 28
![Page 29: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/29.jpg)
Evaluation
• English-Hindi cross-lingual run
IIIT-H @ FIRE-2008 29
Run MAP GMAP R-Prec Bpref
Title + Desc 0.1538 0.0093 0.1687 0.1905
Title + Narr 0.1516 0.0229 0.1871 0.1918
Title + Desc + Narr 0.1432 0.0215 0.1793 0.1886
![Page 30: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/30.jpg)
Evaluation
• Hindi-English cross-lingual run
IIIT-H @ FIRE-2008 30
Run MAP GMAP R-Prec Bpref
Title + Desc 0.0907 0.0197 0.1291 0.1408
Title + Narr 0.1204 0.0366 0.1718 0.1734
Title + Desc + Narr 0.1112 0.0287 0.1541 0.1723
![Page 31: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/31.jpg)
Evaluation
• Hindi-Hindi monolingual run
IIIT-H @ FIRE-2008 31
Run MAP GMAP R-Prec Bpref
Title + Desc 0.2579 0.0427 0.2797 0.2964
Title + Narr 0.2652 0.0534 0.2845 0.3023
Title + Desc + Narr 0.2472 0.0525 0.2558 0.2773
![Page 32: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/32.jpg)
Evaluation
• English-English monolingual run
IIIT-H @ FIRE-2008 32
Run MAP GMAP R-Prec Bpref
Title + Desc 0.4416 0.3437 0.4579 0.4889
Title + Narr 0.4863 0.3989 0.4894 0.5218
Title + Desc + Narr 0.4690 0.3841 0.4707 0.5167
![Page 33: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/33.jpg)
English-Hindi Vs Hindi-Hindi
IIIT-H @ FIRE-2008 33
![Page 34: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/34.jpg)
Hindi-English Vs English-English
IIIT-H @ FIRE-2008 34
![Page 35: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/35.jpg)
Evaluation
• Summary– Our English-Hindi CLIR performance was 58% of
the monolingual run– Our Hindi-English CLIR performance was 25% of
the monolingual run– Our Hindi-Hindi monolingual run retrieved 52% of
total relevant documents– Our English-English monolingual run retrieved
91% of total relevant documents
IIIT-H @ FIRE-2008 35
![Page 36: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/36.jpg)
Analysis
• Our English-Hindi CLIR performance can be attributed to factors like– Exact matching of English named entities– Good coverage of English words in our lexicons
• A relatively lower performance on Hindi-English CLIR is due to– Low dictionary coverage– Query formulation was not complex enough
IIIT-H @ FIRE-2008 36
![Page 37: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/37.jpg)
FUTURE WORK
IIIT-H @ FIRE-2008 37
![Page 38: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/38.jpg)
Future Work
• Error analysis on per topic basis• Work on more complex query formulations• Work on other possible query translation
techniques like– Building dictionaries from parallel corpora– Using web– Using Wikipedia
IIIT-H @ FIRE-2008 38
![Page 39: IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efa5503460f94c0c1bd/html5/thumbnails/39.jpg)
THANK YOU!!!
IIIT-H @ FIRE-2008 39