shared task proposal, fire 2012 monojit choudhury microsoft research lab india
TRANSCRIPT
![Page 1: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/1.jpg)
Search in Transliterated Space
Shared Task Proposal, FIRE 2012
Monojit ChoudhuryMicrosoft Research Lab India
![Page 2: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/2.jpg)
A Transliterated World Wide Web
Song Lyrics
![Page 3: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/3.jpg)
A Transliterated World Wide Web
Reviews and Forums
![Page 4: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/4.jpg)
A Transliterated World Wide Web
Facebook and Twitter
![Page 5: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/5.jpg)
A Transliterated World Wide Web
And lot more
![Page 6: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/6.jpg)
Beyond Indic languages
Many languages that use non-Roman script Arabic (Saudi Arabia, UAE, Egypt,
Morocco,…) Persian Indian sub-continental languages (IL &
Dzongkha, Nepalese, Sinhala) Thai, Vietnamese Cyrillic (Russian, Ukrainian) Chinese, Japanese, Korean (rare)
![Page 7: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/7.jpg)
Aspects of Transliterated Text
Code Mixing
Transliteration
Errors, Contracti
on
![Page 8: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/8.jpg)
IR Scenario - I
Mono-script Monolingual IR in transliterated space Query: thandee hava yeh chandni
suhanee Results: Only Roman transliterated
documents
Challenge: Spelling variations tandee hawa ye chandny soohaany
![Page 9: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/9.jpg)
IR Scenario - II
Cross-script and Multi-script Monolingual IR in transliterated space
Query: thandee hava yeh chandni OR ठं� डी� हवा� ये चाँ��दनी� Results: Both Roman transliterated
or in native script
Challenge: Transliteration
![Page 10: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/10.jpg)
Scenario - III
Cross-script and Cross-lingual IR Query: death of mareech and subahoo Document: Hindi (Transliterated and
Devanagari) and English documents
![Page 11: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/11.jpg)
Shared Task on Retrieval
Mono-scriptMonolingual
IR
Transliterated query in
Roman
Transliterated documents in Roman
Cross-scriptMonolingual
IR
Transliterated query in
Roman
Transliterated documents in native script
Multi-scriptMonolingual
IR
Query in Roman or
native script
Documents in Roman and native scripts
![Page 12: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/12.jpg)
Shared Sub-Tasks
Language identification of transliterated queries, documents, code-mixed text
kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML
Transliteration Forward: കഴി�ക്കാ�ന് kazhikkan Backward: kazhikkan കഴി�ക്കാ�ന്
![Page 13: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/13.jpg)
Available Data
20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags)
35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics
More data under preparation from FaceBook on mixture of various languages.
Looking for partners to extend!
![Page 14: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/14.jpg)
Available Data
Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics
Looking for partners to extend it to other (Indian) Languages
Other domains?
![Page 15: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/15.jpg)
Thank you! [email protected]
![Page 16: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/16.jpg)
Other resources
Lexicons Pronunciation lexicons G2P for some languages Stemmers and morphological
analyzers
Anything else?
![Page 17: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India](https://reader036.vdocument.in/reader036/viewer/2022062321/56649dc65503460f94ab9a43/html5/thumbnails/17.jpg)
Concluding Remarks
We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing
These are just some initial ideas that came up from our experiences
If you are interested please let me know