terrier: terabyte retriever an introduction by: kavita ganesan (last updated april 21 st 2009)

Terrier: TERabyte RetRIevER

An Introduction By: Kavita Ganesan (Last Updated April 21st 2009)

About Terrier

Information Retrieval Toolkit

Developed by Information Retrieval Group at the University of Glasgow - since 2001

The team: 3 Researchers 5 PhD students 5 Programmers

About Terrier Provides platform for development of large-scale

IR applications Uses Hadoop to distribute indexing Splits indexing tasks across different nodes on a cluster

JAVA Based

Weighting model in Terrier is based on Divergence From Randomness (DFR) framework [ Read More ]

Also includes other IR models

State-of-the-art functionalities

hyperlink structure analysis to rank pages

automatic query expansion/re-formulation techniques

pre-retrieval query performance predictors

compression techniques

Other notable features

selects optimal weighting model based on the statistical features of the

query

Toolkit ComparisonLemur Lucene Terrier

Indexing Claims can index up to terabytes of data Incremental indexing

Can index over 20MB/minute on a home machinesmall RAM requirements -- only 1MB heap index size about 20% -30% the size of text indexed (400GB 80GB)Nutch supports distributed indexingIncremental indexing

Some numbers: size of files to index: 400 GB resulting size of index files: 17 GB 4% of actual text time to build : 3 days (2 processors) time to retrieve: 4 sec/query (8 processors)Supports distributed indexingDoes not support incremental indexing

Retrieval Models

KL-divergence Vector space Okapi BM25 Language

Model TF-IDF

VSM Boolean retrieval

model

126 Divergence From Randomness (DFR) models

Okapi BM25 Language modeling TF-IDF

Prog. Lang

C++ Java Java

Out of the box capabilities

Index and evaluate on TREC test collections

Index standard files formats HTML, PDF, Word, Excel, PowerPoint

files

GUI based desktop search application

Other out of the box capabilities Indexing support using Hadoop

Highly compressed index data structures

Options for various stemming techniques

Many document weighting model options 126 Divergence From Randomness (DFR) models Okapi BM25 Language modeling TF-IDF

Modifiable Code open source code base (Mozilla Public

Licence).

Nice to have…but not there

Ability to easily build a search engine

Incremental indexing Re-create index every time Write your own code for incremental indexing

Flexible Indexer Implement your own indexer for non standard

data format

Benefits of using Terrier

Terrier – active ongoing project Benefit from new models Performance enhancements New features

Can index large amounts of data Scalable in the long run

Good support from the team Wiki Discussion forums

…Benefits of using Terrier

Easy to set up and use Very modular

Source files are fully modifiable and well documented [ Show ]

How To Get Started?

1. Download the Binary [ download ] You get the full source code with this download

2. Unzip the file to a directory

3. Modify configuration files Models to use Stemmer Etc….

4. You are now ready to index and evaluate Use pre-existing scripts to index and evaluate

[ Full Setup Instructions ]

Terrrier’s Directory Structure

The directories of Terrier are

– bin/ : contains useful scripts for running Terrier

– etc/ : contains the configuration files – doc/ : contains the documentation of Terrier– lib/ : contains the compiled Terrier classes and theexternal libraries used by Terrier– licenses/ : contains the license information of thecomponents included with Terrier– share/ : contains a stop word list, an example ofdocuments to test with Terrier, and other infrequentlychanging files

– src/ : contains the source code of Terrier– var/index : contains the data structures– var/results : contains the retrieval results

-Which models?-Stopword list-Stemmer-etc

-Which models?-Stopword list-Stemmer-etc

Source files needed to start modifications

Source files needed to start modifications

terrier: terabyte retriever an introduction by: kavita ganesan (last updated april 21 st 2009)

Documents