overview of a information retrieval system: terrier ashish
TRANSCRIPT
![Page 1: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/1.jpg)
Overview of a Information Retrieval System: Terrier
Ashish
![Page 2: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/2.jpg)
overview
• Structural view– Indexing– Retrieval
• Extend
• Setup
• Run
![Page 3: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/3.jpg)
IR Systems
• Terrier– Academic/ research– Open source
• Lucene-Nutch– Commercial/ research– Open source
![Page 4: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/4.jpg)
Terrier
• Being developed at University of Glasgow.
• Open Source
• OS independent : Java
• Easy to learn
• Easy to extend– modular
![Page 5: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/5.jpg)
Subfolders -1
• etc/ – Configuration files
• bin/– Srcipts to compile and run the terrier
• lib/– Java library, jar files containing the terrier
system.
![Page 6: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/6.jpg)
Subfolders -2
• src/– The java source files, user written plugins
• doc/– Javadocs for terrier and for extended components
• var/– Index/
• Index files– Results/
• Results and evaluation
• share/– Shared resources such as stopwords, lexicon etc.
![Page 7: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/7.jpg)
Indexing
![Page 8: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/8.jpg)
Tokenization
• Identifying words – Based on space– Handling spacial characters such as -,$,
digits etc.– Sometimes space is not word separator.
• German, Chinese
– agglutinative languages• Marathi
![Page 9: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/9.jpg)
Term Pipelining
• Stemming/ finding root– ate -> eat
• Stopword removal– is, was, I, in etc.
• Abbreviations– Dr -> Doctor
• Normalisation– Color Vs colour
![Page 10: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/10.jpg)
Index – data structures
• Direct Index – stores the identifiers of terms that appear in each document and
the corresponding frequencies.
• Document Index – stores information about each document for example the
document length and identifier,
• Inverted Index – stores the posting lists, i.e. the identifiers of the documents and
their corresponding term frequencies.
• Lexicon – stores the collection vocabulary and the corresponding
document and term frequencies.
![Page 11: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/11.jpg)
Extending the indexing process
• Tokenisation:– uk.ac.gla.terrier.indexing.*Document
• Term Pipelines:– uk.ac.gla.terrier.terms.*
![Page 12: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/12.jpg)
Retrievalquery
Index
![Page 13: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/13.jpg)
Scoring and Ranking
• Score: S(di,qj)
• Documents are ranked (sorted) according to the score
• Presented to the user in decreasing order of S(di,qj)
– Scoring model• e.g. TF-IDF
![Page 14: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/14.jpg)
Matching Process
• Input– Query and weighting model
• Output– Ranked resultset
• Weighting model– Himestra-LM
• Uses– Term Score Modifiers
• uk.ac.gla.terrier.matching.tsms– Document Score Modifiers
• uk.ac.gla.terrier.matching.dsms
• extend– uk.ac.gla.terrier.matching.models
![Page 15: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/15.jpg)
Input
• Corpus– Very large set of documents
• Topics– Queries representing user need
• Relevance Results– Set of judgments per query per document
![Page 16: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/16.jpg)
Topic format<doc><docno>Mumbai85B7FB3BB9.htm.txt</docno>
<text> रा�ज्यपा�लां��नी घे�तलां रा�ष्ट्रपात, उपारा�ष्ट्रपात�ची भे�ट
मुं��बई, त�. २१ - रा�ज्यपा�लां एस. एमुं. कृ� ष्णा� य��नी आज रा�ष्ट्रपात प्रतितभे� पा�ट$लां आणिणा उपारा�ष्ट्रपात डॉ'. हमुंद अन्स�रा य��ची दिदल्लां य�थे� भे�ट घे�तलां. रा�ष्ट्रपात, उपारा�ष्ट्रपातितपाद$ तिनीवडॉ झा�ल्य�च्य� पा�र्श्34 वभे5मुंवरा रा�ज्यपा�लां��नी भे�ट घे�ऊनी त्य��ची� स्व�गत कृ� लां�. आज दुपा�रा रा�ष्ट्रपात भेवनी य�थे� श्रीमुंत प्रतितभे� पा�ट$लां य��ची भे�ट घे�तल्य�नी�तरा त्य��नी हरिराय�नी� भेवनी य�थे� ज�ऊनी उपारा�ष्ट्रपात�ची भे�ट घे�तलां.
</text>
</doc>
![Page 17: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/17.jpg)
Document
<top><num>5<title>भे�रातय रा�ष्ट्रपात तिनीवडॉणा5कृ २००७<desc>भे�रात�च्य� रा�ष्ट्रपात तिनीवडॉणा5कृ?र्श् स�ब�धिAत मुं�द्दे� व घेटनी�.<narr>रा�ष्ट्रपात�ची तिनीवडॉणा5कृ, उमुं�दव�रा��तिवरूध्द कृ� लां�लां / गलिलांच्छ
रा�जकृ?य लिचीखलांफे� कृ आणिणा आपाल्य� तिनीकृटच्य� उमुं�दव�रा�ची� पारा�भेव कृरूनी प्रतितभे� पा�ट$लां ह्यां��ची� भे�रात�च्य� सव4प्रथेमुं मुंतिहलां� रा�ष्ट्रपात (अध्यक्ष) म्हणा5नी तिनीवडॉ5नी य�णा� ह्यां�-तिवषयची मुं�तिहत स�ब�धिAत कृ�गदपात्रा�त अस�वय�स हव.
</top>
![Page 18: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/18.jpg)
.
.
.13 Q0 1100019.cms.txt 013 Q0 1102914.cms.txt 013 Q0 1104294.cms.txt 013 Q0 1104312.cms.txt 113 Q0 1110418.cms.txt 013 Q0 1123377.cms.txt 013 Q0 1124813.cms.txt 113 Q0 1126006.cms.txt 1....
Relevance Judement
Document idQuery-id
Relevence judgement: 0 or 1
![Page 19: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/19.jpg)
Configuration files
• etc/terrier.properties– Utf-8 settings, stemmer, index name, etc
etc/trec.topic.list– set topics/queries
• etc/trec.models– Set matching/retrieval model
• etc\trec.qrels– Set Relevane Judgement file path
![Page 20: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/20.jpg)
Running terrier
• Already compiled • To recompile
– bin/compile.sh• Setup corpus
– bin/trec_setup.sh “<corpus folder path>“• Index
– bin/trec_terrier.sh -i• Retrieval
– bin/trec_terrier.sh -r• Evaluate
– bin/trec_terrier.sh -e “<result file>”
![Page 21: Overview of a Information Retrieval System: Terrier Ashish](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e005503460f94ae8e46/html5/thumbnails/21.jpg)
Reference
• http://ir.dcs.gla.ac.uk/terrier/doc/
• http://ir.dcs.gla.ac.uk/wiki/Terrier