web search queries: an emerging new language

26
Forum for Information Retrieval Evaluation 2013 (FIRE '13) New Delhi, India Rishiraj Saha Roy Monojit Choudhury IIT Kharagpur Microsoft Research India Prasenjit Majumder Komal Agarwal DAIICT Gandhinagar

Upload: others

Post on 12-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Forum for Information Retrieval Evaluation 2013 (FIRE '13)

New Delhi, India

Rishiraj Saha Roy Monojit Choudhury

IIT Kharagpur Microsoft Research India

Prasenjit Majumder Komal Agarwal

DAIICT Gandhinagar

Song Lyrics

Facebook and Twitter

And lot more

04 December 2013 FIRE 2013 Track on Transliterated Search 5

(Pilot) Track in first year

Focused on basics required for search in transliterated space

Subtask 1

Query word labeling

Subtask 2

Multi-script ad hoc retrieval

04 December 2013 FIRE 2013 Track on Transliterated Search 6

Label words of a query as English or L

Subtask presented for three language pairs

English-Hindi

English-Bangla

English-Gujarati

If labeled as L, generate transliteration in native script

Process of back transliteration

Evaluation excludes OOV named entities

04 December 2013 FIRE 2013 Track on Transliterated Search 7

Input

door ke dhol song lyrics

electric tar best company ki

shu tame mane prem karo

Output

door\H=เคฆเคฐเฅ‚ ke\H=เค•เฅ‡ dhol\H=เคขเฅ‹เคฑ song\E lyrics\E

electric\E tar\B=เฆคเฆพเฆฐ best\E company\E ki\B=เฆ•เฆฟ

shu\G=เชถ เซเช‚ tame\G=เชคเชฎเซ‡ mane\G=เชฎเชจเซ‡ prem\G=เชชเซเชฐเซ‡เชฎ karo\G=เช•เชฐเซ‹

04 December 2013 FIRE 2013 Track on Transliterated Search 8

Retrieve top ten relevant documents for a query

Query in Roman script

Bollywood song text

Large corpus of mixed script Documents

Roman/Devanagari/Both

Documents contain song lyrics

04 December 2013 FIRE 2013 Track on Transliterated Search 9

Query: geeto ki rut aur rangon ki barkha

Document

เค•เฅ‹เคˆ เคœเฅ‹ เคฎเคฟเคฑเคพ เคคเฅ‹ เคฟเฅเค เฅ‡เคเคธเคพ เคฑเค—เคคเคพ เคฅเคพ เคœเคธเฅ‡เฅˆ เคฟเฅ‡เคฐเฅ€ เคธเคพเคฐเฅ€ เคฆเคจเฅเคฟเคฏเคพ เคฟเฅ‡เค‚ เค—เฅ€เคคเฅ‹เค‚ เค•เฅ€ เคฐเฅเคค เค”เคฐ เคฐเค‚เค—เฅ‹เค‚ เค•เฅ€ เคฌเคฐเค–เคพ เคนเฅˆ Khushboo ki andhee hai

Mehki huee si ab saree fizayein hain

04 December 2013 FIRE 2013 Track on Transliterated Search 10

General purpose

Specific to Subtask 1

Specific to Subtask 2

Info on datasets at

http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/index.html

04 December 2013 FIRE 2013 Track on Transliterated Search 11

Word frequency lists: English, Hindi, Gujarati

Word transliteration pairs

Hindi: Alignment of song lyrics [Gupta et al., 2012]

Bangla: Annotations collected from chat, dictation setups

[Sowmya et al. 2010]

Gujarati: Toy set, processed from FIRE 2013 data

Large language corpora (Leipzig)

ITRANS to UTF-8 converter

04 December 2013 FIRE 2013 Track on Transliterated Search 12

Hindi

1000 queries โ€“ 500 development set, 500 test set

Bangla

200 queries โ€“ 100 development set, 100 test set

Gujarati

300 queries โ€“ 150 development set, 150 test set

~1000, ~300, ~500 translit pairs in dev sets

Not all entries technically search โ€œqueriesโ€

04 December 2013 FIRE 2013 Track on Transliterated Search 13

Carefully crafted with instances of language words with valid

English dictionary entries

door, tan, man (Hindi), tar, pore, ache (Bangla); tame, mane, mate

(Guajrati)

Created and annotated by respective native speakers

Future plans

Enrich and expand with more quality control

Looking for partners for more languages!!

04 December 2013 FIRE 2013 Track on Transliterated Search 14

50 hand crafted queries in Roman script โ€“ 25 dev, 25 test

About 63,000 documents in pure/mixed scripts

Documents collected by crawling ~15 popular Bollywood lyrics

domains like dhingana, musicmaza and hindilyrix

XML documents parsed and cleaned to contain only lyrics text

Around 28 relevance judgments per query (6-point scale) after

pooling using several baselines

04 December 2013 FIRE 2013 Track on Transliterated Search 15

Initial show of interest from 17 teams

5 teams participated, 25 runs submitted

India: ISM Dhanbad, Gujarat University (GU), Microsoft

Research India (MSRI)

Abroad: TU Valencia (TU-V), NTNU Norway

MSRI participating but non-competing

04 December 2013 FIRE 2013 Track on Transliterated Search 16

Subtask 1: ISM, GU, MSRI, TU-V, NTNU (17 runs)

Hindi: 10 runs (all 5 teams)

Bangla: 4 runs (NTNU, MSRI)

Gujarati: 3 runs (MSRI)

Subtask 2: NTNU, TU-V, GU (8 runs)

๐ธ๐‘ฅ๐‘Ž๐‘๐‘ก ๐‘„๐‘ข๐‘’๐‘Ÿ๐‘ฆ ๐‘€๐‘Ž๐‘ก๐‘โ„Ž ๐น๐‘Ÿ๐‘Ž๐‘๐‘ก๐‘–๐‘œ๐‘› =

#(๐‘„๐‘ข๐‘’๐‘Ÿ๐‘–๐‘’๐‘  ๐‘“๐‘œ๐‘Ÿ ๐‘ค๐‘•๐‘–๐‘๐‘• ๐‘™๐‘Ž๐‘›๐‘” ๐‘™๐‘Ž๐‘๐‘’๐‘™๐‘  ๐‘Ž๐‘›๐‘‘ ๐‘ก๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘™๐‘–๐‘ก ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘  ๐‘š๐‘Ž๐‘ก๐‘๐‘• ๐‘’๐‘ฅ๐‘Ž๐‘๐‘ก๐‘™๐‘ฆ)

#(๐ด๐‘™๐‘™ ๐‘ž๐‘ข๐‘’๐‘Ÿ๐‘–๐‘’๐‘ )

๐ธ๐‘ฅ๐‘Ž๐‘๐‘ก ๐‘‡๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘™๐‘–๐‘ก๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ƒ๐‘Ž๐‘–๐‘Ÿ๐‘  ๐‘€๐‘Ž๐‘ก๐‘โ„Ž =

#(๐‘ƒ๐‘Ž๐‘–๐‘Ÿ๐‘  ๐‘“๐‘œ๐‘Ÿ ๐‘ค๐‘•๐‘–๐‘๐‘• ๐‘ก๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘™๐‘–๐‘ก๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘–๐‘œ๐‘›๐‘  ๐‘š๐‘Ž๐‘ก๐‘๐‘• ๐‘’๐‘ฅ๐‘Ž๐‘๐‘ก๐‘™๐‘ฆ)

#(๐‘ƒ๐‘Ž๐‘–๐‘Ÿ๐‘  ๐‘“๐‘œ๐‘Ÿ ๐‘ค๐‘•๐‘–๐‘๐‘• ๐‘๐‘œ๐‘ก๐‘• ๐‘œ/๐‘ ๐‘Ž๐‘›๐‘‘ ๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’ ๐‘™๐‘Ž๐‘๐‘’๐‘™๐‘  ๐‘Ž๐‘Ÿ๐‘’ ๐ฟ)

Motivation: Exactly one correct answer for back transliteration

Some cases of normalization have been handled

Thanks to Spandana from MSRI!!

04 December 2013 FIRE 2013 Track on Transliterated Search 17

04 December 2013 FIRE 2013 Track on Transliterated Search 18

๐‘‡๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘™๐‘–๐‘ก๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› ๐‘‡๐‘ƒ = #(๐ถ๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก ๐‘ก๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘™๐‘–๐‘ก๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘–๐‘œ๐‘›๐‘ )

#(๐บ๐‘’๐‘›๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘’๐‘‘ ๐‘ก๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘™๐‘–๐‘ก๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘–๐‘œ๐‘›๐‘ )

๐‘‡๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘™๐‘–๐‘ก๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™ ๐‘‡๐‘… = #(๐ถ๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก ๐‘ก๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘™๐‘–๐‘ก๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘–๐‘œ๐‘›๐‘ )

#(๐‘…๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’ ๐‘ก๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘™๐‘–๐‘ก๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘–๐‘œ๐‘›๐‘ )

๐‘‡๐‘Ÿ๐‘Ž๐‘›๐‘ ๐‘™๐‘–๐‘ก๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐นโ€“ ๐‘†๐‘๐‘œ๐‘Ÿ๐‘’ = 2 โˆ—๐‘‡๐‘ƒ โˆ—๐‘‡๐‘…

๐‘‡๐‘ƒ+๐‘‡๐‘…

04 December 2013 FIRE 2013 Track on Transliterated Search 19

๐ฟ๐‘Ž๐‘๐‘’๐‘™๐‘–๐‘›๐‘” ๐‘Ž๐‘๐‘๐‘ข๐‘Ÿ๐‘Ž๐‘๐‘ฆ =

#(๐ถ๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก ๐‘™๐‘Ž๐‘๐‘’๐‘™ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘ )

# ๐ถ๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก ๐‘™๐‘Ž๐‘๐‘’๐‘™ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘  + #(๐ผ๐‘›๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก ๐‘™๐‘Ž๐‘๐‘’๐‘™ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘ )

๐ธ๐‘›๐‘”๐‘™๐‘–๐‘ โ„Ž ๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› ๐ธ๐‘ƒ = #(๐ธโˆ’๐ธ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘ )

# ๐ธโˆ’๐ธ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘  +#(๐ธโˆ’๐ฟ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘ )

๐ธ๐‘›๐‘”๐‘™๐‘–๐‘ โ„Ž ๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™ ๐ธ๐‘… = #(๐ธโˆ’๐ธ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘ )

# ๐ธโˆ’๐ธ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘  +#(๐ฟโˆ’๐ธ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘ )

๐ธ๐‘›๐‘”๐‘™๐‘–๐‘ โ„Ž ๐นโ€“ ๐‘†๐‘๐‘œ๐‘Ÿ๐‘’ = 2 โˆ—๐ธ๐‘ƒ โˆ—๐ธ๐‘…

๐ธ๐‘ƒ+๐ธ๐‘…

Similarly LP, LR, and LF are computed

04 December 2013 FIRE 2013 Track on Transliterated Search 20

nDCG@5, nDCG@10

๐ท๐ถ๐บ@๐‘ = ๐‘Ÿ๐‘’๐‘™1 + ๐‘Ÿ๐‘’๐‘™๐‘–

log2 ๐‘–

๐‘๐‘–=2 ; ๐‘›๐ท๐ถ๐บ@๐‘ =

๐ท๐ถ๐บ@๐‘

๐ผ๐ท๐ถ๐บ@๐‘

MAP

๐ด๐‘ฃ๐‘’ ๐‘ƒ = (๐‘ƒ ๐‘˜ ร— ๐‘Ÿ๐‘’๐‘™(๐‘˜))๐‘›๐‘˜=1

#๐‘…๐‘’๐‘™.๐‘‘๐‘œ๐‘๐‘ ; ๐‘€๐ด๐‘ƒ =

๐ด๐‘ฃ๐‘’(๐‘ƒ)๐‘„๐‘ž=1

๐‘„

๐‘€๐‘…๐‘… = 1

|๐‘„|

1

๐‘Ÿ๐‘Ž๐‘›๐‘˜๐‘–

|๐‘„|๐‘–=1

04 December 2013 FIRE 2013 Track on Transliterated Search 21

Detailed metric values and approaches coming up soon in

participant talks

Subtask 1:

Transliteration F-score (Hindi): 0.8130

Transliteration F-score (Bangla): 0.5137

Transliteration F-score (Gujarati): 0.4803

Subtask 2:

nDCG@10: 0.8002

04 December 2013 FIRE 2013 Track on Transliterated Search 22

Winners (several very close results!!)

Subtask 1 (Hindi): TU-Valencia [Best on 5/12 metrics]

Subtask 1 (Bangla): NTNU-Norway [Best on 12/12 metrics]

Subtask 1 (Gujarati): None

Subtask 2: TU-Valencia [Best on 4/4 metrics]

MSRI topped Subtask 1 but was non-competing

Congratulations to all!!

04 December 2013 FIRE 2013 Track on Transliterated Search 23

Encouraging response to task in first year โ€“ why the dropouts?

Metric values reflect room for improvement (grain of salt)

Extend to at least one non-Indian language (Arabic?)

Extend to at least Dravidian language (Kannada?)

Want to enrich datasets in a shared environment โ€“ in process

Plans to create awareness on importance of transliteration for

IR like organizing workshops โ€“ please visit http://bit.ly/1k7pG55

04 December 2013 FIRE 2013 Track on Transliterated Search 24

CMU

Rohan Ramanath

IIT Kharagpur

M. Dastagiri Reddy

Ranita Biswas

Swadhin Pradhan

Yogarshi Vyas

Entire FIRE team for making this track possible!

04 December 2013 FIRE 2013 Track on Transliterated Search 26

Looking forward to increased participation at FIRE 2014!!

Primary contact: [email protected]