ir indexing thanks to b. arms sims baldi, frasconi, smyth manning, raghavan, schutze

IR Indexing

Thanks to

B. Arms

SIMS

Baldi, Frasconi, Smyth

Manning, Raghavan, Schutze

What we have covered• What is IR• Evaluation• Tokenization and properties of text • Web crawling• Vector methods• Measures of similarity• This presentation

– Indexing– Inverted files

Summary: What’s the point of using vector spaces?

• A well-formed algebraic space for retrieval• Key: A user’s query can be viewed as a (very) short

document.• Query becomes a vector in the same space as the docs.• Can measure each doc’s proximity to it.• Natural measure of scores/ranking – no longer

Boolean.– Queries are expressed as bags of words

• Other similarity measures: see http://www.lans.ece.utexas.edu/~strehl/diss/node52.html for a survey

http://www.lans.ece.utexas.edu/~strehl/diss/node52.html







Interface

Query Engine

Indexer

Index

Crawler

Users

Web

A Typical Web Search Engine

Solr/Lucene

YouSeerNutch

Interface

Query Engine

Indexer

Index

Crawler

Users

Web

A Typical Web Search Engine

Indexing

Why indexing?• For efficient searching for documents with

unstructured text (not databases)– Online sequential text search (grep)

• Small collection

• Text volatile

– Data structures - indexes• Large, semi-stable document collection

• Efficient search

Unstructured vs structured data

• What’s available?– Web– Organizations– You

• Unstructured usually means “text”• Structured usually means “databases”• Semistructured somewhere in between

Unstructured (text) vs. structured (database) data companies in 1996

0

20

40

60

80

100

120

140

160

Data volume Market Cap

UnstructuredStructured

Unstructured (text) vs. structured (database) data companies in 2006

0

20

40

60

80

100

120

140

160

Data volume Market Cap

UnstructuredStructured

http://www.yahoo.com/

http://www.yahoo.com/

http://search.live.com/results.aspx?q=housing&mkt=en-us&FORM=LVSP&go.x=0&go.y=0&go=Search

IR vs. databases:Structured vs unstructured data

• Structured data tends to refer to information in “tables”

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000Ivy Smith

Typically allows numerical range and exact match(for text) queries, e.g.,Salary < 60000 AND Manager = Smith.

Unstructured data

• Typically refers to free text• Allows

– Keyword queries including operators– More sophisticated “concept” queries e.g.,

• find all web pages dealing with drug abuse

• Classic model for searching text documents

Semi-structured data

• In fact almost no data is “unstructured”• E.g., this slide has distinctly identified

zones such as the Title and Bullets• Facilitates “semi-structured” search such as

– Title contains data AND Bullets contain search

… to say nothing of linguistic structure

More sophisticated semi-structured search

• Title is about Object Oriented Programming AND Author something like stro*rup

• where * is the wild-card operator

• Issues:– how do you process “about”?– how do you rank results?

• The focus of XML search.

Clustering and classification

• Given a set of docs, group them into clusters based on their contents.

• Given a set of topics, plus a new doc D, decide which topic(s) D belongs to.

• Not discussed in this course

The web and its challenges

• Unusual and diverse documents• Unusual and diverse users, queries,

information needs• Beyond terms, exploit ideas from social

networks– link analysis, clickstreams ...

• How do search engines work? And how can we make them better?

More sophisticated information retrieval not covered here

• Cross-language information retrieval

• Question answering

• Summarization

• Text mining

• …

Search Subsystem

Index database

query parse query

stemming*stemmed terms

stop list* non-stoplist tokens

query tokens

Boolean operations*

ranking*

relevant document set

ranked document set

retrieved document set

*Indicates optional operation.

Major Indexing Methods• Inverted index

– effective for very large collections of documents– associates lexical items to their occurrences in the collection

• Positional index• Non-positional indexs

– Block-sort

• Suffix trees and arrays– Faster for phrase searches; harder to build and maintain

• Signature files– Word oriented index structures based on hashing (usually

not used for large texts)

Sparse Vectors• Tokens as vectors• Vocabulary and therefore dimensionality of

vectors can be very large, ~104 .• However, most documents and queries do

not contain most words, so vectors are sparse (i.e. most entries are 0).

• Need efficient methods for storing and computing with sparse vectors.

Sparse Vectors as Lists

• Store vectors as linked lists of non-zero-weight tokens paired with a weight.– Space proportional to number of unique tokens

(n) in document.– Requires linear search of the list to find (or

change) the weight of a specific token.– Requires quadratic time in worst case to

compute vector for a document:

)(2

)1( 2

1

nOnn

in

i

=+

=∑=

Sparse Vectors as Hash Tables• Hashing:

– well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array

• Store tokens in hash table, with token string as key and weight as value.– Storage overhead for hash table ~1.5n– Table must fit in main memory.– Constant time to find or update weight of a specific token

(ignoring collisions).– O(n) time to construct vector (ignoring collisions).

Implementation Based on Inverted Files

• In practice, document vectors are not stored directly; an inverted organization provides much better efficiency.

• The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree-based data structure (trie, B-tree).

• Critical issue is logarithmic or constant-time access to token information.

Efficiency CriteriaStorage

Inverted files are big, typically 10% to 100% the size of the collection of documents.

Update performance

It must be possible, with a reasonable amount of computation, to:

(a) Add a large batch of documents

(b) Add a single document

Retrieval performance

Retrieval must be fast enough to satisfy users and not use excessive resources.

Document File

The documents file stores the documents that are being indexed. The documents may be:

• primary documents, e.g., electronic journal articles

• surrogates, e.g., catalog records or abstracts

Postings File

Merging inverted lists is the most computationally intensive task in many information retrieval systems.

Since inverted lists may be long, it is important to match postings efficiently.

Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially.

For efficient matching, the inverted lists should all be sorted in the same sequence.

Inverted lists are commonly cached to minimize disk accesses.

Document FileThe storage of the document file may be:

Central (monolithic) - all documents stored together on a single server (e.g., library catalog)

Distributed database - all documents managed together but stored on several servers (e.g., Medline, Westlaw, Dialog)

Highly distributed - documents are stored on independently managed servers (e.g., Web)

Each requires: a document ID, which is a unique identifier that can be used by the inverted file system to refer to the document, and a location counter, which can be used to specify location within a document.

Documents File for Web Search System

For web search systems:

• A document is a web page.

• The documents file is the web.

• The document ID is the URL of the document.

Indexes are built using a web crawler, which retrieves each page on the web (or a subset). After indexing, each page is discarded, unless stored in a cache.

(In addition to the usual index file and postings file the indexing system stores contextual information, which will be discussed in a later lecture.)

Term Frequency (Postings File)

The postings file stores the elements of a sparse matrix, the term assignment matrix.

It is stored as a separate inverted list for each term, i.e., a list corresponding to each term in the index file.

Each element in an inverted list is called a posting, i.e., the occurrence on a term in documents

Each list consists of one or many individual postings.

Length of Postings File

For a common term there may be very large numbers of postings for a given term.

Example:

1,000,000,000 documents1,000,000 distinct wordsaverage length 1,000 words per document

1012 postings

By Zipf's law, the 10th ranking word occurs, approximately:

(1012/10)/10 times= 1010 times

Inverted IndexPrimary data structure for text indexes• Basically two elements:

– (Vocabulary, Occurrences)

• Main Idea:– Invert documents into a big index

• Basic steps:– Make a “dictionary” of all the terms/tokens in the collection– For each term/token, list all the docs it occurs in.

• Possibly location in document (Lucene stores the positions)

– Compress to reduce redundancy in the data structure• Also reduces I/O and storage required

Inverted Indexes

We have seen “Vector files”. An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

How Are Inverted Files Created• Documents are parsed one document at a

time to extract tokens/terms. These are saved with the Document ID (DID).

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

<term, DID>

How Inverted Files are Created

• After all documents have been parsed, the inverted file is sorted alphabetically and in document order.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

How InvertedFiles are Created

• Multiple term/token entries for a single document are merged.

• Within-document term frequency information is compiled.

• Result <term,DID,tf>

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2


<the,1,2>

How Inverted Files are Created

• Then the inverted file can be split into – A Dictionary file

• File of unique terms • Fit in memory if possible

and – A Postings file

• File of what document the term/token is in and how often.• Sometimes where the term is in the document.• Store on disk

• Worst case O(n); n size of tokens.

Dictionary and Posting FilesDictionary PostingTerm Doc # Freq

a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Inverted indexes• Permit fast search for individual terms• For each term/token, you get a list consisting of:

– document ID (DID)– frequency of term in doc (optional, implied in Lucene) – position of term in doc (optional, Lucene)– <term,DID,tf,position>– <term,(DIDi,tf,positionij),…>– Lucene:

• <positionij,…> (term and DID are implied from other files)

• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2

How Inverted Files are Used

Dictionary PostingsDoc # Freq

2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Query on

“time” AND “dark”

2 docs with “time” in dictionary ->

IDs 1 and 2 from posting file

1 doc with “dark” in dictionary ->

ID 2 from posting file

Therefore, only doc 2 satisfied the query.

Inverted index

• Associates a posting list with each term– POSTING LIST example

• a (d1, 1)• …• the (d1,2) (d2,2)

• Replace term frequency(tf) with tfidf– Lucene only uses tf, idf added at query time

• Compress index and put hash links• Match query to index and rank

Position in inverted file posting

Now is the timefor all good mento come to the aidof their country

Doc 1

It was a dark andstormy night in the country manor. The time was past midnight

Doc 2

– POSTING LIST example• now (1)

• …

• time (4, 13)

Change weight

• Multiple term entries for a single document are merged.

• Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2


Example: WestLaw http://www.westlaw.com/

• Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)

• About 7 terabytes of data; 700,000 users• Majority of users still use boolean queries• Example query:

– What is the statute of limitations in cases involving the federal tort claims act?

– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

• Long, precise queries; proximity operators; incrementally developed; not like web search

http://www.westlaw.com/



Time Complexity of Indexing

• Complexity of creating vector and indexing a document of n tokens is O(n).

• So building an index of m such documents is O(m n).• Computing vector lengths is also O(m n), which is

also the complexity of reading in the corpus.

Retrieval with an Inverted Index• Tokens that are not in both the query and the

document do not effect cosine similarity.– Product of token weights is zero and does not contribute

to the dot product.

• Usually the query is fairly short, and therefore its vector is extremely sparse.

• Use inverted index to find the limited set of documents that contain at least one of the query words.

• Retrieval time O(log M) due to hashing where M is the size of the document collection.

Inverted Query Retrieval Efficiency

• Assume that, on average, a query word appears in B documents:

• Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all M documents, O(|V| M), because |Q| << |N| and B << M.

Q = q1 q2 … qN

D11…D1B D21…D2B Dn1…DnB

Processing the Query• Incrementally compute cosine similarity of

each indexed document as query words are processed one by one.

• To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where DocumentReference is the key and the partial accumulated score is the value.

Index Files

On disk

If an index is held on disk, search time is dominated by the number of disk accesses.

In memory

Suppose that an index has 1,000,000 distinct terms.

Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters.

Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

Index File Structures: Linear Index

Advantages

Can be searched quickly, e.g., by binary search, O(log M)

Good for sequential processing, e.g., comp*

Convenient for batch updating

Economical use of storage

Disadvantages

Index must be rebuilt if an extra term is added

Documents File for Web Search System

For Web search systems:

• A Document is a Web page.

• The Documents File is the Web.

• The Document ID is the URL of the document.

Indexes are built using a Web crawler, which retrieves each page on the Web (or a subset). After indexing each page is discarded, unless stored in a cache.

(In addition to the usual index file and postings file the indexing system stores special information)

Index on disk vs. memory

• Most retrieval systems keep the dictionary in memory and the postings on disk

• Web search engines frequently keep both in memory– massive memory requirement

– feasible for large web service installations

– less so for commercial usage where query loads are lighter

Indexing in the real world

• Typically, don’t have all documents sitting on a local filesystem– Documents need to be spidered– Could be dispersed over a WAN with varying

connectivity– Must schedule distributed spiders/indexers – Could be (secure content) in

• Databases• Content management applications• Email applications

Content residing in applications

• Mail systems/groupware, content management contain the most “valuable” documents

• http often not the most efficient way of fetching these documents - native API fetching– Specialized, repository-specific connectors

– These connectors also facilitate document viewing when a search result is selected for viewing

Secure documents

• Each document is accessible to a subset of users– Usually implemented through some form of Access

Control Lists (ACLs)

• Search users are authenticated• Query should retrieve a document only if user can

access it– So if there are docs matching your search but you’re

not privy to them, “Sorry no results found”– E.g., as a lowly employee in the company, I get “No

results” for the query “salary roster”

Users in groups, docs from groups

• Index the ACLs and filter results by them

• Often, user membership in an ACL group verified at query time – slowdown

Users

Documents

0/1 0 if user can’t read doc, 1 otherwise.

Compound documents

• What if a doc consisted of components– Each component has its own ACL.

• Your search should get a doc only if your query meets one of its components that you have access to.

• More generally: doc assembled from computations on components– e.g., in Lotus databases or in content management

systems

• How do you index such docs?

No good answers …

“Rich” documents

• (How) Do we index images?

• Researchers have devised Query based on Image Content (QBIC) systems– “show me a picture similar to this orange circle”

– Then use vector space retrieval

• In practice, image search usually based on meta-data such as file name e.g., monalisa.jpg

• New approaches exploit social tagging– E.g., flickr.com

Passage/sentence retrieval

• Suppose we want to retrieve not an entire document matching a query, but only a passage/sentence - say, in a very long document

• Can index passages/sentences as mini-documents – what should the index units be?

• This is the subject of XML search

Indexing Subsystem

Documents

break into tokens

stop list*

stemming*

term weighting*

Index database

text

non-stoplist tokens

tokens

stemmed terms

terms with weights


assign document IDsdocuments

document numbers

and *field numbers

Search Subsystem

Index database

query parse query

stemming*stemmed terms

stop list* non-stoplist tokens

query tokens

Boolean operations*

ranking*

relevant document set

ranked document set

retrieved document set


Example: WestLaw http://www.westlaw.com/

• Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)

• About 7 terabytes of data; 700,000 users• Majority of users still use boolean queries• Example query:

– What is the statute of limitations in cases involving the federal tort claims act?

– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

• Long, precise queries; proximity operators; incrementally developed; not like web search




What we covered

• Indexing• Inverted files

– Storage and access advantages

• How Lucene does all this.

ir indexing thanks to b. arms sims baldi, frasconi, smyth manning, raghavan, schutze

Documents

betweenunstructured

exact matchfor text

free text indexinga

web pages

coursethe web

short document

focus of xml search

set of docs