irtools software overview gregory b. newby unc chapel hill [email protected]
TRANSCRIPT
![Page 2: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/2.jpg)
Download & Participate
IRTools is a work in progress. Check back in the spring for more software and test cases. Currently, only some parts workWant to help? We use CVS for distributed developmentOur project page: http://sourceforge.net/projects/irtools
![Page 3: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/3.jpg)
Design Principles
For IR Researchers
A programming toolkit, not an IR system
Implements major approaches to IR (Boolean, VSM, Probabilistic & LSI)
Scalable to billions of documents
High performance algorithms and structures
Expandable
Documented: http://ils.unc.edu/tera
![Page 4: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/4.jpg)
Major Components
Spider Indexer Retrieval Engine
Gathers documents on the live Web
Builds internal representations of documents
Processes queries and generates results
![Page 5: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/5.jpg)
Implementation
Mostly in C++, using the GNU compiler
Uses the Standard Template Library
Tested on Solaris & Linux (Alpha & 386)
Designed for modularity, so IR researchers can add their own components
![Page 6: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/6.jpg)
Why Might You use IRTools?
If you have your own IR software, there’s probably no needIf you are looking for experimental IR software, this might be a good alternative (goal: to be suitable for general use in mid-2002)IRTools should be useful for classroom use and demonstrationFor production use, consider ht://dig
![Page 7: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/7.jpg)
Design Snippet: Word List
The Berkeley DB is used to store the term termID lookup tableA single file, accessed by hash in a B+ treestruct term_termID { char * term irt_int termID}
![Page 8: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/8.jpg)
Design Snippet: 1st Inverted Index File
Binary file with fixed-length recordsAccessed by termid*sizeof(struct)offsetGives basic info needed for weightingPoints to more files for inverted entries (the actual documents for this term)Some duplication (e.g., meantf) to prevent additional I/O
![Page 9: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/9.jpg)
Design Snippet: 1st Inverted Index File
struct inv_file1 { irt_int termIDirt_int term_doccount // Frequencyirt_int meantf // For weightingirt_int nt // # terms in this docirt_int file2_location // File for // entriesirt_int starting_offset // File 2 locirt_int entry_count // # occurrences // of this term // in file 2
}
![Page 10: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/10.jpg)
Design Snippet: 2nd Inverted Index File
Info about documents with this termUsing Page Rank, best docs can be listed earliest (avoiding subsequent disk I/O)Multiple 2nd files for larger collectionsstruct inv_file2 {irt_int termID // Sanity checkirt_int file_location // Next fileirt_int starting_offset,
num_entries // As for file1}
![Page 11: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/11.jpg)
Design Snippet: 2nd Inverted Index File
For each document with this term: struct inv_docentry {
irt_int term_in_doc_count// For weighting:irt_int doc_unique_terms irt_Int doc_total_terms // 3rd file offsetirt_int file3_location
}
![Page 12: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/12.jpg)
Design Snippet: 3nd Inverted Index File
This lists a term’s locations in documentsirt_int termID // Sanity checkFollowed by terms_in_doc_count irt_ints indicating the positions of this term in this document
Usable for a NEAR operator
![Page 13: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/13.jpg)
Planned & Current Components
Current Various stemmers
and stoplists Various weighting
schemes Sparse matrix
formats for LSI etc. Boolean AND & OR TREC output Visual interfaces
Designed & Planned Page Rank Integrated spider Boolean NEAR Update & delete
entries Concurrent retrieval
engine clients Concurrent indexers
![Page 14: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/14.jpg)
Global Collection Variables
maxn:highest # of terms in any doc
maxUn: highest # unique terms
Nterms: total known terms
Ndocs: total known documents
![Page 15: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/15.jpg)
Design Snippet: Boolean Candidate MergingWorks for OR or ANDMin. disk I/O (needed for inverted index only)Doesn’t require inverted index to be sorted in docID orderThe STL map can be problematic for more than about 20K candidates; using documents that are Page Rank’ed can help shrink the candidate set (and speed up everything)Start with terms with the lowest frequency; we only continue until we have enough hits
![Page 16: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/16.jpg)
Design Snippet: Boolean Candidate Mergingirt_int NFULL=0 // stop with enough hitsvector <irt_int> full // docIDs w. all q termsmap <docID, candidate_info> // Candidatesstruct candidate_info { // For each doc irt_int docID // this doc’s ID nt // # terms in this doc for weighting meantf // mean tf in this doc for weighting float [NQUERYTERMS] tf // For weighting irt_short qtcount // # query terms in doc }
The map eliminates sorting!We must allocate memory for every candidate
![Page 17: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/17.jpg)
Design Snippet: LSI & Information Space
We use a modified Harwell-Boeing sparse matrix format on disk (modified = binary files)Berry’s svdpackc has been integratedWe’re doing scaling experiments now. Scaling is a major challenge for LSIOne solution: do smaller eigensystem problems on candidate subset on the fly, rather than pre-computing the entire collection’s semantic space. But this eliminates possibly interesting documents!
![Page 18: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/18.jpg)
Hyperlink Map
The hyperlink map is a sparse asymmetric matrix, size is D x DWe use a modified Harwell-Boeing format to store the matrixA similar index file structure to the inverted index gives us rapid access to any document’s link listWe must store both sides of the matrix
![Page 19: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/19.jpg)
Web Document Metadata
Items stored during spidering. These are kept in a Berkeley DB B+ hash file, with the document URL (or name) as keyDocname // keydocIDHTTP last update as reportedOur last visit/updateHTTP-reported sizeChecksum (simple)# links out
![Page 20: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/20.jpg)
Design Snippet: tokenizer
The tokenizer reads files (via spider or local disk)Goal: Few passes through the fileGoal: Any character setProcess: Keep a static array of word boundaries Keep a static array of tag delimiters (<) Fold everything to lower case termID lookup can happen now or later Simple transformations (like ditching extra white
space) can happen now
![Page 21: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu](https://reader030.vdocument.in/reader030/viewer/2022032606/56649eb75503460f94bc1112/html5/thumbnails/21.jpg)