ir homework #1 by j. h. wang mar. 16, 2015. programming exercise #1: vector space retrieval -...

15
IR Homework #1 By J. H. Wang Mar. 16, 2015

Upload: diane-hodge

Post on 13-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

IR Homework #1

By J. H. WangMar. 16, 2015

Page 2: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Programming Exercise #1: Vector Space Retrieval -

Indexing• Goal: to build an inverted index for a

text collection• Input: a set of text documents• Output: inverted index• Tools: either utilizing open source

tools (libraries, APIs) or writing your own code in any programming language

Page 3: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

The Major Task

• Indexing– Given a set of text documents, build an

inverted index

Page 4: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

IR, Spring 2014 NTUT CSIE 4

Steps in Vector Space Retrieval

12

Page 5: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Some Open Source Tools

• Apache Lucene/Solr (in Java)• The Lemur Project, Indri, Galago – by

CMU/Umass, (in C++)• Terrier – by U. Glasgow (in Java)• …

Page 6: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Input 1: the Test Collection

• ClueWeb09 dataset– http://lemurproject.org/clueweb09.php/– 1,040,809,705 Web pages in 10 languages,

in Jan.-Feb. 2009– 5TB, compressed (25TB, uncompressed)– File format: WARC (Web ARChive file format)

• http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml

• Sample Files: http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-index.php?page=Sample+Files

• Each file contains about 40,000 Web pages, in 1GB• Each team will be randomly allocated different

files!

Page 7: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Web Test Collections

• The ClueWeb12 dataset– a successor to the ClueWeb09 dataset– http://lemurproject.org/clueweb12.php/– 733,019,372 English Web pages, in Feb.-May

2012– 5.5TB, compressed (27.3TB, uncompressed)

• TREC datasets: WT2g, WT10g, .GOV, .GOV2, Blogs06, Blogs08– http://ir.dcs.gla.ac.uk/test_collections/

Page 8: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Previous Test Collections• Test collections held at University of Glasgow:

http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/– LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI– Ex: The Time Collection: 423 documents (1.5MB)

• Reuters-21578: http://www.daviddlewis.com/resources/testcollections/reuters21578/ – 21,578 news articles in 1987 (28.0MB uncompressed)

• Reuters-RCV1: (in the textbook) http://trec.nist.gov/data/reuters/reuters.html – About 810,000 English news stories from 1996/08/20 to

1997/08/19 (2.5GB uncompressed)– Needs to sign agreements

Page 9: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Output: Inverted Index

• E.g.: Using the standard positional index as the format (Chap. 1 & 2):– Dictionary file: a sorted list of vocabularies (in

separate lines)– Postings list: for each term, a list of

occurrences in the original text • termi, dfi: <doc1, tfi1: <pos1, pos2, … >; doc2, tfi2:

<pos1, pos2, …>; …> (as in Fig. 2.11, Sec. 2.4, p.38)– dfi: document frequency of termi

– tfij: term frequency of termi in docj

• to, 993427: <1, 6: <7, 18, 33, 72, 86, 231>; 2, 5: <1, 17, 74, 222, 255>; … >

• …

Page 10: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Design Issues• pos means the token positions in the body of

documents– This facilitate easier implementation in following

steps, e.g., proximity search• You can design different index formats, as long

as– The necessary information can be accessed for

ranking• Dictionary: terms ti and the corresponding document

frequency dfi

• Postings: (DocID, term frequency tfij, Loc) for each term

• Preprocessing should be handled with care– Different formats for different collections– Digits, hyphens, punctuation marks, …

Page 11: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Optional Functionality

• Efficiency issues– A separate data structure (e.g. trie) can be

used to store the vocabularies and postings in your indexer

– Skip pointers (to be used in query processing)

• Tokenization– Case folding– Stopword removal – Stemming– Able to be turned on/off by a parameter trigger

Page 12: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Submission

• Your submission *should* include– The source code (and your configurations of extra

libraries)• For utilizing open source tools, please also submit your

source code on calling the APIs or libraries– A one-page documentation including

• Major features: ex: high efficiency, low storage, multiple input formats, huge corpus, …

• Major difficulties encountered• Instructions for compilation/execution environments (ex:

Java Runtime Environment, special compilers, …)• Team members list: The names and the responsible

parts of each individual member should be clearly identified

• Due: three weeks (extended to Apr. 6, 2015)

Page 13: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Submission Instructions• Programs and related electronic files in your

homework must be submitted directly on the submission site: – Submission site: https://140.124.183.13/

• Username: your student ID• Password: (please change it at your first login)

– Preparing your submission file: as one single compressed file

• Name your file according to your ID such as <ID>_HW1.zip • Remember to specify the names and student IDs of your

team members in the files and documentation

• If you cannot successfully submit your work, please contact with the TA (TBD, @ R1424, Technology Building)

Page 14: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Evaluation

• Minimum requirement: correctness for sample documents– Using the (partial) ClueWeb09 Test Collection

as the input, and the inverted index generated by your program will be checked

– Optional features will be considered as bonus

• You might be required to demo if the program submitted was unable to compile/run by the TA

Page 15: IR Homework #1 By J. H. Wang Mar. 16, 2015. Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection

Any Questions or Comments?