iiit hyderabad’s clir experiments for fire-2008 sethuramalingam s & vasudeva varma iiit...

IIIT Hyderabad’s CLIR experiments for FIRE-2008

Sethuramalingam S & Vasudeva VarmaIIIT Hyderabad, India

1

Outline

• Introduction• Related Work in Indian Language IR• Our CLIR experiments• Evaluation & Analysis• Future Work

IIIT-H @ FIRE-2008 2

Introduction

• Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (courtesy: Wikipedia)

• Information – text, audio, video, speech, geographical information etc


CLIR – Indian languages(IL) scenario


தமி�ழ்

Modified from Source: D. Oard’s Cross-Language IR presentation

हि�न्दी�

తెలు�గు�

বাং��লা�

मरा�ठी

To retrieve documents written in any IL when user queries in one language

Why CLIR for IL?


Why CLIR for IL?


• Internet user growth in India between 2000 to 2008 - 1,100.0 % Source : www.internetworldstats.com

• Growth in Indian language contents on the web between 2000 to 2007 – 700%

So, CLIR for IL becomes mandatory!

RELATED WORK IN INDIAN LANGUAGE IR


Related Work in ILIR

• ACM TALIP, 2003 - The surprise language exercises - Task was to build CLIR system for

English to Hindi and Cebuano

“The surprise language exercises”, Douglas W. Oard. ACM Transactions on Asian Language Information Processing (TALIP), 2(2):79–84, 2003



• CLEF 2006 - Ad-hoc bi-lingual track including two

Indian languages Hindi and Telugu - Our team from IIIT-H participated in

Hindi and Telugu to English CLIR task

“Hindi and Telugu to English Cross Language Information Retrieval”, Prasad Pingali and Vasudeva Varma. CLEF 2006.



• CLEF 2007 - Indian language subtask consisting of

Hindi, Bengali, Marathi, Telugu and Tamil - Five teams including ours participated

- Hindi and Telugu to English CLIR

“IIIT Hyderabad at CLEF 2007 - Adhoc Indian Language CLIR task”, Prasad Pingali and Vasudeva Varma. CLEF 2007.


Related Work in ILIRGoogle’s CLIR system for 34 languages including

Hindi


OUR CLIR EXPERIMENTS


Our CLIR experiments

• Ad-hoc cross-lingual Hindi to English, and English to Hindi.

• Ad-hoc monolingual runs in Hindi and English• 12 runs in total were submitted for the above

4 tasks


Problem statement

• CLIR system should take a set of 50 topics in the source language and return top 1000 documents for each topic in the target language


<top lang="hi"><num>28</num><title> ईरा�न का� पराम�णु� का�र्य�क्रम</title><desc> ईरा�न का� का�र्य�क्रम औरा उसका� पराम�णु� न�हि� का� बा�रा� म� हि�श्व का�रा�र्य।</desc><narr> ईरा�न का� पराम�णु� न�हि� औरा ऐस� का�र्य�क्रम का� हि�रुद्ध ईरा�न परा र्य#एसए का�

हिनरा%�रा दीबा�� औरा धमका� का� बा�रा� म� स#चन� स%बा%धिध� प्रले�ख म� रा�न� च�हि�ए। पराम�णु� न�हि� का� समझौ-�� का� लिलेए ईरा�न औरा र्य#रा/प�र्य स%घ का� बा�च �� औरा हि�श्व दृधि2 भी�

रुलिचकारा �/गी�</narr></top>

CLIR System architecture

• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration– Query Scoring

• Indexing module– Stop-word remover,– A typical Indexer using Lucene



• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration– Query Scoring



Named entities Identification

• Used for identifying the named entities present in the queries for transliteration

• We used– Our CRF-based NER system( as a binary classifier)

for Hindi queries,– Stanford English NER system for English queries

• Identifies Person, Organization and Location names


"Experiments in Telugu NER: A Conditional Random Field Approach“, Praneeth M Shishtla, Prasad Pingali, Vasudeva Varma. NERSSEAL-08, IJCNLP-08, Hyderabad, 2008.


• Query Processing module– Named Entities identification– Query translation using lexicons– Transliteration(mapping-based)– Query Scoring



Query translation

• Using bi-lingual lexicons– “Shabdanjali”, an English-Hindi dictionary

containing 26,633 entries– IIT Bombay Hindi Wordnet– Manually collected Hindi-English dictionary with

6,685 entries


Shabdanjali - http://www.shabdkosh.com/shabdanjali

Hindi Wordnet - http://www.cfilt.iitb.ac.in/wordnet/webhwn/

Transliteration

• Mapping-based approach• For a given named entity in source language

– Derive the Compressed Word Format (CWF) E.g. academia – cdm

E.g. abullah - bll

– Generate the list of Named entities & their CWFs at the target language side

– Search and map the CWF of source language NE with the CWF of the right target language equivalent within the min. modified edit distance


Transliteration

• Implementation– Named entities present in the Hindi and English

corpora are identified and listed.– Their CWFs are generated using a set of heuristic,

rewrite and remove rules– CWFs are added to the list of NEs


“Named Entity Transliteration for Cross-Language Information Retrievalusing Compressed Word Format Mapping algorithm”, Srinivasan C Janarthanam, Sethuramalingam S, Udhyakumar Nallasamy. iNEWS-08, CIKM-2008.

Query Scoring

• We generate a Boolean OR query with scored query words

• Query scoring is based on– Position of occurrence of the word in the topic– Number of occurrences of the word– Numbers, Years are given greater weights




• Indexing & Ranking module– Stop word remover,– A typical Indexer using Lucene


Indexing module

• For the English corpus, stop words are removed and stemmed using Lucene

• For the Hindi corpus, a list of 246 words is generated from the given corpus based on frequency

• Documents are indexed using the Lucene Indexer and ranked using the BM-25 algorithm in Lucene


EVALUATION & ANALYSIS


Evaluation

• English-Hindi cross-lingual run


Run MAP GMAP R-Prec Bpref

Title + Desc 0.1538 0.0093 0.1687 0.1905

Title + Narr 0.1516 0.0229 0.1871 0.1918

Title + Desc + Narr 0.1432 0.0215 0.1793 0.1886

Evaluation

• Hindi-English cross-lingual run



Title + Desc 0.0907 0.0197 0.1291 0.1408

Title + Narr 0.1204 0.0366 0.1718 0.1734

Title + Desc + Narr 0.1112 0.0287 0.1541 0.1723

Evaluation

• Hindi-Hindi monolingual run



Title + Desc 0.2579 0.0427 0.2797 0.2964

Title + Narr 0.2652 0.0534 0.2845 0.3023

Title + Desc + Narr 0.2472 0.0525 0.2558 0.2773

Evaluation

• English-English monolingual run



Title + Desc 0.4416 0.3437 0.4579 0.4889

Title + Narr 0.4863 0.3989 0.4894 0.5218

Title + Desc + Narr 0.4690 0.3841 0.4707 0.5167

English-Hindi Vs Hindi-Hindi


Hindi-English Vs English-English


Evaluation

• Summary– Our English-Hindi CLIR performance was 58% of

the monolingual run– Our Hindi-English CLIR performance was 25% of

the monolingual run– Our Hindi-Hindi monolingual run retrieved 52% of

total relevant documents– Our English-English monolingual run retrieved

91% of total relevant documents


Analysis

• Our English-Hindi CLIR performance can be attributed to factors like– Exact matching of English named entities– Good coverage of English words in our lexicons

• A relatively lower performance on Hindi-English CLIR is due to– Low dictionary coverage– Query formulation was not complex enough


FUTURE WORK


Future Work

• Error analysis on per topic basis• Work on more complex query formulations• Work on other possible query translation

techniques like– Building dictionaries from parallel corpora– Using web– Using Wikipedia


THANK YOU!!!


iiit hyderabad’s clir experiments for fire-2008 sethuramalingam s & vasudeva varma iiit...

Documents

clir experimentsiiith

indian language iriiith

target languageiiith

source language

clir experimentsad

english clir taskhindi

language different

indian language subtask