terrier: terabyte retriever an introduction by: kavita ganesan (last updated april 21 st 2009)
TRANSCRIPT
Terrier: TERabyte RetRIevER
An Introduction By: Kavita Ganesan (Last Updated April 21st 2009)
About Terrier
Information Retrieval Toolkit
Developed by Information Retrieval Group at the University of Glasgow - since 2001
The team: 3 Researchers 5 PhD students 5 Programmers
About Terrier Provides platform for development of large-scale
IR applications Uses Hadoop to distribute indexing Splits indexing tasks across different nodes on a cluster
JAVA Based
Weighting model in Terrier is based on Divergence From Randomness (DFR) framework [ Read More ]
Also includes other IR models
State-of-the-art functionalities
hyperlink structure analysis to rank pages
automatic query expansion/re-formulation techniques
pre-retrieval query performance predictors
compression techniques
Other notable features
selects optimal weighting model based on the statistical features of the
query
Toolkit ComparisonLemur Lucene Terrier
Indexing Claims can index up to terabytes of data Incremental indexing
Can index over 20MB/minute on a home machinesmall RAM requirements -- only 1MB heap index size about 20% -30% the size of text indexed (400GB 80GB)Nutch supports distributed indexingIncremental indexing
Some numbers: size of files to index: 400 GB resulting size of index files: 17 GB 4% of actual text time to build : 3 days (2 processors) time to retrieve: 4 sec/query (8 processors)Supports distributed indexingDoes not support incremental indexing
Retrieval Models
KL-divergence Vector space Okapi BM25 Language
Model TF-IDF
VSM Boolean retrieval
model
126 Divergence From Randomness (DFR) models
Okapi BM25 Language modeling TF-IDF
Prog. Lang
C++ Java Java
Out of the box capabilities
Index and evaluate on TREC test collections
Index standard files formats HTML, PDF, Word, Excel, PowerPoint
files
GUI based desktop search application
Other out of the box capabilities Indexing support using Hadoop
Highly compressed index data structures
Options for various stemming techniques
Many document weighting model options 126 Divergence From Randomness (DFR) models Okapi BM25 Language modeling TF-IDF
Modifiable Code open source code base (Mozilla Public
Licence).
Nice to have…but not there
Ability to easily build a search engine
Incremental indexing Re-create index every time Write your own code for incremental indexing
Flexible Indexer Implement your own indexer for non standard
data format
Benefits of using Terrier
Terrier – active ongoing project Benefit from new models Performance enhancements New features
Can index large amounts of data Scalable in the long run
Good support from the team Wiki Discussion forums
…Benefits of using Terrier
Easy to set up and use Very modular
Source files are fully modifiable and well documented [ Show ]
How To Get Started?
1. Download the Binary [ download ] You get the full source code with this download
2. Unzip the file to a directory
3. Modify configuration files Models to use Stemmer Etc….
4. You are now ready to index and evaluate Use pre-existing scripts to index and evaluate
[ Full Setup Instructions ]
Terrrier’s Directory Structure
The directories of Terrier are
– bin/ : contains useful scripts for running Terrier
– etc/ : contains the configuration files – doc/ : contains the documentation of Terrier– lib/ : contains the compiled Terrier classes and theexternal libraries used by Terrier– licenses/ : contains the license information of thecomponents included with Terrier– share/ : contains a stop word list, an example ofdocuments to test with Terrier, and other infrequentlychanging files
– src/ : contains the source code of Terrier– var/index : contains the data structures– var/results : contains the retrieval results
-Which models?-Stopword list-Stemmer-etc
-Which models?-Stopword list-Stemmer-etc
Source files needed to start modifications
Source files needed to start modifications