ir homework #1 by j. h. wang mar. 16, 2015. programming exercise #1: vector space retrieval -...
TRANSCRIPT
![Page 1: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/1.jpg)
IR Homework #1
By J. H. WangMar. 16, 2015
![Page 2: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/2.jpg)
Programming Exercise #1: Vector Space Retrieval -
Indexing• Goal: to build an inverted index for a
text collection• Input: a set of text documents• Output: inverted index• Tools: either utilizing open source
tools (libraries, APIs) or writing your own code in any programming language
![Page 3: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/3.jpg)
The Major Task
• Indexing– Given a set of text documents, build an
inverted index
![Page 4: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/4.jpg)
IR, Spring 2014 NTUT CSIE 4
Steps in Vector Space Retrieval
12
![Page 5: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/5.jpg)
Some Open Source Tools
• Apache Lucene/Solr (in Java)• The Lemur Project, Indri, Galago – by
CMU/Umass, (in C++)• Terrier – by U. Glasgow (in Java)• …
![Page 6: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/6.jpg)
Input 1: the Test Collection
• ClueWeb09 dataset– http://lemurproject.org/clueweb09.php/– 1,040,809,705 Web pages in 10 languages,
in Jan.-Feb. 2009– 5TB, compressed (25TB, uncompressed)– File format: WARC (Web ARChive file format)
• http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
• Sample Files: http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-index.php?page=Sample+Files
• Each file contains about 40,000 Web pages, in 1GB• Each team will be randomly allocated different
files!
![Page 7: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/7.jpg)
Web Test Collections
• The ClueWeb12 dataset– a successor to the ClueWeb09 dataset– http://lemurproject.org/clueweb12.php/– 733,019,372 English Web pages, in Feb.-May
2012– 5.5TB, compressed (27.3TB, uncompressed)
• TREC datasets: WT2g, WT10g, .GOV, .GOV2, Blogs06, Blogs08– http://ir.dcs.gla.ac.uk/test_collections/
![Page 8: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/8.jpg)
Previous Test Collections• Test collections held at University of Glasgow:
http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/– LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI– Ex: The Time Collection: 423 documents (1.5MB)
• Reuters-21578: http://www.daviddlewis.com/resources/testcollections/reuters21578/ – 21,578 news articles in 1987 (28.0MB uncompressed)
• Reuters-RCV1: (in the textbook) http://trec.nist.gov/data/reuters/reuters.html – About 810,000 English news stories from 1996/08/20 to
1997/08/19 (2.5GB uncompressed)– Needs to sign agreements
![Page 9: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/9.jpg)
Output: Inverted Index
• E.g.: Using the standard positional index as the format (Chap. 1 & 2):– Dictionary file: a sorted list of vocabularies (in
separate lines)– Postings list: for each term, a list of
occurrences in the original text • termi, dfi: <doc1, tfi1: <pos1, pos2, … >; doc2, tfi2:
<pos1, pos2, …>; …> (as in Fig. 2.11, Sec. 2.4, p.38)– dfi: document frequency of termi
– tfij: term frequency of termi in docj
• to, 993427: <1, 6: <7, 18, 33, 72, 86, 231>; 2, 5: <1, 17, 74, 222, 255>; … >
• …
![Page 10: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/10.jpg)
Design Issues• pos means the token positions in the body of
documents– This facilitate easier implementation in following
steps, e.g., proximity search• You can design different index formats, as long
as– The necessary information can be accessed for
ranking• Dictionary: terms ti and the corresponding document
frequency dfi
• Postings: (DocID, term frequency tfij, Loc) for each term
• Preprocessing should be handled with care– Different formats for different collections– Digits, hyphens, punctuation marks, …
![Page 11: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/11.jpg)
Optional Functionality
• Efficiency issues– A separate data structure (e.g. trie) can be
used to store the vocabularies and postings in your indexer
– Skip pointers (to be used in query processing)
• Tokenization– Case folding– Stopword removal – Stemming– Able to be turned on/off by a parameter trigger
![Page 12: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/12.jpg)
Submission
• Your submission *should* include– The source code (and your configurations of extra
libraries)• For utilizing open source tools, please also submit your
source code on calling the APIs or libraries– A one-page documentation including
• Major features: ex: high efficiency, low storage, multiple input formats, huge corpus, …
• Major difficulties encountered• Instructions for compilation/execution environments (ex:
Java Runtime Environment, special compilers, …)• Team members list: The names and the responsible
parts of each individual member should be clearly identified
• Due: three weeks (extended to Apr. 6, 2015)
![Page 13: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/13.jpg)
Submission Instructions• Programs and related electronic files in your
homework must be submitted directly on the submission site: – Submission site: https://140.124.183.13/
• Username: your student ID• Password: (please change it at your first login)
– Preparing your submission file: as one single compressed file
• Name your file according to your ID such as <ID>_HW1.zip • Remember to specify the names and student IDs of your
team members in the files and documentation
• If you cannot successfully submit your work, please contact with the TA (TBD, @ R1424, Technology Building)
![Page 14: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/14.jpg)
Evaluation
• Minimum requirement: correctness for sample documents– Using the (partial) ClueWeb09 Test Collection
as the input, and the inverted index generated by your program will be checked
– Optional features will be considered as bonus
• You might be required to demo if the program submitted was unable to compile/run by the TA
![Page 15: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection](https://reader036.vdocument.in/reader036/viewer/2022082816/56649f515503460f94c7529b/html5/thumbnails/15.jpg)
Any Questions or Comments?