persian@clef current and future research directions
DESCRIPTION
University of Tehran Database Research Group. Persian@CLEF Current and Future Research Directions. Abolfazl AleAhmad , Ehsan Darrudi , Hadi Amiri , Azadeh Shakery , Farhad Oroumchian. 1 October 2009. Persian@CLEF Current and Future Research Directions. Outline. Why Persian IR - PowerPoint PPT PresentationTRANSCRIPT
Persian@CLEFCurrent and Future Research Directions
University of TehranDatabase Research Group
1 October 2009
Abolfazl AleAhmad, Ehsan Darrudi, Hadi Amiri, Azadeh Shakery, Farhad Oroumchian
1 Oct 2009
Persian@CLEF Current and Future Research Directions
Why Persian IRLanguage Resources for PersianHamshahri at CLEF 2009Persian@CLEF2009 participantsPersian@CLEF2009 resultsPersian@CLEF2009 pool analysisFuture works
Outline
2
1 Oct 2009
Persian in the Middle East
3Source: Internet World Stats, http://internetworldstats.com/
User Population Growth on the Web (2000-2009)
Persian@CLEF Current and Future Research Directions
1 Oct 2009
Persian@CLEF Current and Future Research Directions
Why Persian IR
Updated in June 2009 from Internet World Stats
4
1 Oct 2009
A branch of Indo-European LanguagesOfficial Language of Iran, Afghanistan and TajikistanIts morphological analysis is Comparably difficult
The word “خبر” has two plural forms:• Persian rules: “خبرها”• Arabic rules: “اخبار”
Writing Style Issues:e.g. ” شود are the same ”میشود“ and “میe.g. ”کتابها“ and ” ها are the same “کتاب
5
Persian@CLEF Current and Future Research Directions
The Persian Language
1 Oct 2009
Persian Test Collections
Text IR DomainGhavanin (domain specific)Hamshahri (news): http://ece.ut.ac.ir/dbrg/hamshahriHamshahri 2 (recently developed 50 topics)
Web IR DomainFWT1m (.ir Web) nearly 1Million docs
NLP DomainBijankhan (2.7 Million Words): http://ece.ut.ac.ir/dbrg/bijankhan
6
Persian@CLEF Current and Future Research Directions
1 Oct 2009
Hamshahri at CLEF 2008 & 2009
7
News articles of Hamshahri newspaper from year 1996 to 2002100 bilingual topics166,000+ documents
Persian@CLEF Current and Future Research Directions
Hamshahri 2News articles of Hamshahri newspaper from year 1996 to 200850 bilingual topics320,000 documents (2times larger ~ 1.5GB)Richer document tags
1 Oct 2009 8
Persian@CLEF2009 - Participants
Persian@CLEF Current and Future Research Directions
1. JHU-APL• N-gram tokenization (skip n-grams for n=5)
2. Unine• Developed “light” and “plural” stemmers and blind query
expansion
3. Open Text• Savoy’s Stemmer and 4-grams• Pool analysis (with top 10,000 retrieved docs)
4. Quazvin IAU• Perstem for monolingual runs (Prec +91%, Rec +43%)• “Query Wikification” Algorithm for bilingual runs
1 Oct 2009 9
Persian@CLEF2009 - Final Results
Persian@CLEF Current and Future Research Directions
1 Oct 2009 10
Persian@CLEF2008 - Final Results
Persian@CLEF Current and Future Research Directions
1 Oct 2009
39255
73606256
154832
91122
211444
13215
939093
131162
3354
719
227156
135150
202187
245194219
67125
615179
152175
561742
95206
108137
112103
310260
564435
224509
563351
411727
414333
467272
501638
273440
296524
521316
529663
565421
394551
431490
386338
222418
442507
587539
281202
350520
258476
457407
487474533
376
0 100 200 300 400 500 600 700 800
551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600
Que
ry N
umbe
r
Number of DocumentsRelevant Not Relevant 11
Pool of CLEF 2008
Persian@CLEF Current and Future Research Directions
1 Oct 2009
8993
6373
134135
38177
27829
20130
9333
140137135
19970
5940
1165859
51127
166233
8195
2422
7486
116100
11966
2844
13870
9793
5432
82266
45
571322
321279
51172
502249
514445397
402226
325526
232282
384331
274472
444367
440331
323390
282356
605640
267469
422497
329449
514251
418442247
318305
500443
404356
436550
0 100 200 300 400 500 600 700
601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650
Que
ry N
umbe
r
Number of DocumentsRelevant Not Relevant
12
Pool of CLEF 2009
Persian@CLEF Current and Future Research Directions
1 Oct 2009 13
Persian@CLEF- Pool Comparison
Persian@CLEF Current and Future Research Directions
Quoted from: Stephen Tomlinson. German, French, English and Persian Retrieval Experiments at CLEF 2008 & 2009. Working Notes for the CLEF 2008 & 2009 Workshops.
1 Oct 2009 14
Persian@CLEF- Pool Comparison
Persian@CLEF Current and Future Research Directions
Quoted from: Stephen Tomlinson. German, French, English and Persian Retrieval Experiments at CLEF 2008 & 2009. Working Notes for the CLEF 2008 & 2009 Workshops.
2009
2008
1 Oct 2009
Persian@CLEF Current and Future Research Directions
Using Hamshahri 2 for CLEF 2010 (50 training topics)A campaign on the Persian WebIR collectionCreation of an English-Persian parallel corporaCreation of a comparable corporaA stemmer for the Persian language
Future Works
15
http://ece.ut.ac.ir/dbrg
1 Oct 2009 17
Persian@CLEF Current and Future Research Directions
1 Oct 2009 18
Persian@CLEF Current and Future Research Directions
1 Oct 2009 19
Persian@CLEF Current and Future Research Directions
1 Oct 2009 20
Persian@CLEF Current and Future Research Directions
1 Oct 2009 21
Persian@CLEF Current and Future Research Directions
1 Oct 2009 22
Persian@CLEF Current and Future Research Directions
1 Oct 2009 23
Persian@CLEF Current and Future Research Directions
1 Oct 2009 24
Persian@CLEF Current and Future Research Directions
1 Oct 2009 25
Persian@CLEF Current and Future Research Directions
1 Oct 2009 26
Persian@CLEF Current and Future Research Directions
1 Oct 2009 27
Persian@CLEF Current and Future Research Directions