flexible and efficient toolbox for information retrieval miracle group
DESCRIPTION
Flexible and Efficient Toolbox for Information Retrieval MIRACLE group. José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio Villena-Román (UC3M-Daedalus). Our approach. New Year’s Resolution: work with all languages in CLEF adhoc, image, web, geo, iclef, qa… - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/1.jpg)
1
Flexible and Efficient Toolbox for Flexible and Efficient Toolbox for Information RetrievalInformation Retrieval
MIRACLE groupMIRACLE group
José Miguel Goñi-Menoyo (UPM)José Carlos González-Cristóbal (UPM-Daedalus)
Julio Villena-Román (UC3M-Daedalus)
![Page 2: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/2.jpg)
2
Our approachOur approach
New Year’s Resolution: work with all languages in CLEFNew Year’s Resolution: work with all languages in CLEFadhoc, image, web, geo, iclef, qa…adhoc, image, web, geo, iclef, qa…
Wish list: Wish list: Language-dependent stuffLanguage-dependent stuffLanguage-independent stuffLanguage-independent stuffVersatile combinationVersatile combinationFast Fast Simple for non computer scientistsSimple for non computer scientists
Not to reinvent the wheel again every year!Not to reinvent the wheel again every year! Approach: Toolbox for information retrievalApproach: Toolbox for information retrieval
![Page 3: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/3.jpg)
3
AgendaAgenda
ToolboxToolbox
2005 Experiments2005 Experiments
2005 Results2005 Results
2006 Homework2006 Homework
![Page 4: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/4.jpg)
4
Toolbox BasicsToolbox Basics
Toolbox made of small one-function tools Toolbox made of small one-function tools
Processing as a pipeline (borrowed from Unix):Processing as a pipeline (borrowed from Unix):Each tool combination leads to a different run approachEach tool combination leads to a different run approach
Shallow I/O interfaces: Shallow I/O interfaces: tools in several programming languages (C/C++, Java, Perl, tools in several programming languages (C/C++, Java, Perl,
PHP, Prolog…),PHP, Prolog…), with different design approaches, andwith different design approaches, and from different sources (own development, downloading, …)from different sources (own development, downloading, …)
![Page 5: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/5.jpg)
5
MIRACLE Tools MIRACLE Tools Tokenizer:Tokenizer:
pattern matchingpattern matching isolate punctuationisolate punctuationsplit sentences, paragraphs, passagessplit sentences, paragraphs, passages
identifies some entitiesidentifies some entitiescompounds, numbers, initials, abbreviations, datescompounds, numbers, initials, abbreviations, dates
extracts indexing termsextracts indexing termsown-development (written in Perl) or “outsourced”own-development (written in Perl) or “outsourced”
Proper noun extractionProper noun extractionNaive algorithm: Uppercase words Naive algorithm: Uppercase words unlessunless stop-word, stop- stop-word, stop-
clef or verb/adverbclef or verb/adverb Stemming: generally “outsourced”Stemming: generally “outsourced” Transforming tools: lowercase, accents and diacritical Transforming tools: lowercase, accents and diacritical
characters are normalized, transliterationcharacters are normalized, transliteration
![Page 6: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/6.jpg)
6
More MIRACLE Tools More MIRACLE Tools Filtering tools:Filtering tools:
stop-words and stop-clefsstop-words and stop-clefsphrase pattern filter (for topics)phrase pattern filter (for topics)
Automatic translation issues: “outsourced” to available on-Automatic translation issues: “outsourced” to available on-line resources or desktop applicationsline resources or desktop applications
Bultra (EnBultra (EnBu)Bu) Webtrance (EnWebtrance (EnBu)Bu) AutTrans (EsAutTrans (EsFr, EsFr, EsPt)Pt)
MoBiCAT (EnMoBiCAT (EnHu)Hu) SystranSystran BabelFish AltavistaBabelFish Altavista
BabylonBabylon FreeTranslationFreeTranslation Google Language ToolsGoogle Language Tools
InterTransInterTrans WordLingoWordLingo ReversoReverso
Semantic expansionSemantic expansionEuroWordNetEuroWordNetown resources for Spanishown resources for Spanish
The philosopher's stone: indexing and retrieval systemThe philosopher's stone: indexing and retrieval system
![Page 7: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/7.jpg)
7
Indexing and Retrieval SystemIndexing and Retrieval System
Implements boolean, vectorial and probabilistic BM25 retrieval Implements boolean, vectorial and probabilistic BM25 retrieval modelsmodels
Only BM25 in used in CLEF 2005Only BM25 in used in CLEF 2005 Only OR operator was used for termsOnly OR operator was used for terms
Native support for UTF-8 (and others) encodingsNative support for UTF-8 (and others) encodings No transliteration scheme is neededNo transliteration scheme is needed Good results for BulgarianGood results for Bulgarian
More efficiency achieved than with previous enginesMore efficiency achieved than with previous engines Several orders of magnitude in indexing timeSeveral orders of magnitude in indexing time
![Page 8: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/8.jpg)
8
Trie-based indexTrie-based index
calm, cast, coating, coat, money, monk, month
![Page 9: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/9.jpg)
9
1st course implementation: linked arrays1st course implementation: linked arrays
calm, cast, coating, coat, money, monk, month
![Page 10: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/10.jpg)
10
Efficient tries: avoiding empty cellsEfficient tries: avoiding empty cells
abacus, abet, ace, baby be, beach, bee
![Page 11: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/11.jpg)
11
Basic ExperimentsBasic Experiments
SS: Standard sequence (tokenization, filtering, stemming, : Standard sequence (tokenization, filtering, stemming, transformation)transformation)
NN: Non stemming: Non stemming
RR: Use of narrative field in topics: Use of narrative field in topics TT: Ignore narrative field: Ignore narrative field r1r1: Pseudo-relevance feedback (with 1st retrieved : Pseudo-relevance feedback (with 1st retrieved
document)document) PP: Proper noun extraction (in topics): Proper noun extraction (in topics)
SR, ST, r1SR, NR, NT, NPSR, ST, r1SR, NR, NT, NP
![Page 12: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/12.jpg)
12
Paragraph indexingParagraph indexing
HH: Paragraph indexing: Paragraph indexingdocparsdocpars (document paragraphs) are indexed instead of docs (document paragraphs) are indexed instead of docs
termterm doc1#1, doc69#5 … doc1#1, doc69#5 …combination of combination of docpars docpars relevance:relevance:
relrelNN = rel = relmNmN + + αα / n * ∑ / n * ∑ j≠mj≠m rel reljNjN
n=paragraphs retrieved for doc Nn=paragraphs retrieved for doc N
relreljNjN=relevance of paragraph i of doc N=relevance of paragraph i of doc N
m=paragraph with maximum relevancem=paragraph with maximum relevanceαα=0.75 (experimental)=0.75 (experimental)
HR, HTHR, HT
![Page 13: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/13.jpg)
13
Combined experimentsCombined experiments ““Democratic system”: documents with good score in many Democratic system”: documents with good score in many
experiments are likely to be relevantexperiments are likely to be relevant
aa: Average:: Average:Merging of several experiments, adding relevanceMerging of several experiments, adding relevance
xx: WDX - asymmetric combination of two experiments:: WDX - asymmetric combination of two experiments:First (more relevant) non-weighted D documents from run AFirst (more relevant) non-weighted D documents from run ARest of documents from run A, with W weightRest of documents from run A, with W weightAll documents from run B, with X weightAll documents from run B, with X weightRelevance re-sortingRelevance re-sorting
Mostly used for combining base runs with proper nouns Mostly used for combining base runs with proper nouns runsruns
aHRSR, aHTST, xNP01HR1, xNP01r1SR1aHRSR, aHTST, xNP01HR1, xNP01r1SR1
![Page 14: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/14.jpg)
14
Multilingual mergingMultilingual merging
Standard approaches for merging:Standard approaches for merging:No normalization and relevance re-sortingNo normalization and relevance re-sortingStandard normalization and relevance re-sortingStandard normalization and relevance re-sortingMin-max normalization and relevance re-sortingMin-max normalization and relevance re-sorting
Miracle approach for merging:Miracle approach for merging:The number of docs selected from a collection (language) is The number of docs selected from a collection (language) is
proportional to the average relevance of its first N docs (N=1, proportional to the average relevance of its first N docs (N=1, 10, 50, 125, 250, 1000). Then one of the standard 10, 50, 125, 250, 1000). Then one of the standard approaches is usedapproaches is used
![Page 15: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/15.jpg)
15
Results Results
We performed…We performed…
… … countless experiments!countless experiments!
(just for the adhoc task)(just for the adhoc task)
![Page 16: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/16.jpg)
16
Monolingual BulgarianMonolingual Bulgarian
Stemmer (UTF-8): NeuchâtelStemmer (UTF-8): Neuchâtel
Rank: 4thRank: 4th
![Page 17: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/17.jpg)
17
Bilingual EnglishBilingual EnglishBulgarianBulgarian
(83% monolingual)(83% monolingual)
EnEnBu: Bultra, WebtranceBu: Bultra, Webtrance
Rank: 1stRank: 1st
![Page 18: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/18.jpg)
18
Monolingual HungarianMonolingual Hungarian
Stemmer: NeuchâtelStemmer: Neuchâtel
Rank: 3rdRank: 3rd
![Page 19: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/19.jpg)
19
Bilingual EnglishBilingual EnglishHungarianHungarian
(87% monolingual)(87% monolingual)
EnEnHu: MoBiCATHu: MoBiCAT
Rank: 1stRank: 1st
![Page 20: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/20.jpg)
20
Monolingual FrenchMonolingual French
Stemmer: SnowballStemmer: Snowball
Rank: >5thRank: >5th
![Page 21: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/21.jpg)
21
Bilingual EnglishBilingual EnglishFrenchFrench
(79% monolingual)(79% monolingual)
EnEnFr: SystranFr: Systran
Rank: 5thRank: 5th
![Page 22: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/22.jpg)
22
Bilingual SpanishBilingual SpanishFrenchFrench
(81% monolingual)(81% monolingual)
EsEsFr: ATrans, SystranFr: ATrans, Systran
(Rank: 5th)(Rank: 5th)
![Page 23: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/23.jpg)
23
Monolingual PortugueseMonolingual Portuguese
Stemmer: SnowballStemmer: Snowball
Rank: >5th (4th)Rank: >5th (4th)
![Page 24: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/24.jpg)
24
Bilingual EnglishBilingual EnglishPortuguesePortuguese
(55% monolingual)(55% monolingual)
EnEnPt: SystranPt: Systran
Rank: 3rdRank: 3rd
![Page 25: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/25.jpg)
25
Bilingual SpanishBilingual SpanishPortuguesePortuguese
(88% monolingual)(88% monolingual)
EsEsPt: ATransPt: ATrans
(Rank: 2nd)(Rank: 2nd)
![Page 26: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/26.jpg)
26
Multilingual-8 (En, Es, Fr)Multilingual-8 (En, Es, Fr)
Rank: 2nd [Fr, En] Rank: 2nd [Fr, En] 3rd [Es]3rd [Es]
![Page 27: Flexible and Efficient Toolbox for Information Retrieval MIRACLE group](https://reader034.vdocument.in/reader034/viewer/2022051516/56813412550346895d9b027d/html5/thumbnails/27.jpg)
27
Conclusions and homeworkConclusions and homework
Toolbox = “imagination is the limit”Toolbox = “imagination is the limit” Focus on interesting linguistic things instead of boring text manipulationFocus on interesting linguistic things instead of boring text manipulation Reusability (half of the work is done for next year!)Reusability (half of the work is done for next year!)
Keys for good results:Keys for good results:Fast IR engine is essentialFast IR engine is essentialNative character encoding supportNative character encoding supportTopic narrativeTopic narrativeGood translation engines make the differenceGood translation engines make the difference
Homework: Homework: further development on system modules, fine tuningfurther development on system modules, fine tuningSpanish, French, Portuguese… Spanish, French, Portuguese…