today’s topics boolean ir signature files inverted files pat trees suffix arrays

28
Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Upload: emory-dickerson

Post on 17-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Today’s Topics

• Boolean IR• Signature files• Inverted files• PAT trees• Suffix arrays

Page 2: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Boolean IR• Documents composed of TERMS(words, stems)• Express result in set-theoretic terms

Doc’s containingterm A

term B

term C

Doc’s containingterm A

term B

term C

A AND B (A AND B) OR C

- Pre 1970’s- Dominant industrial model through 1994 (Lexis-Nexis, DIALOG)

Page 3: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Boolean Operators

A AND BA OR B

(A AND B) OR CA AND ( NOT B )

Doc’s containing term A

Adjacent AND “ A B ” e.g. “Johns Hopkins”“The Who”

Proximity window A w/10 B A and B within +/- 10 words

A w/sent B A + B in same sentence

ProximityOperators(Extended

ANDs) (in +/- K words)

Page 4: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Boolean IR(implementation)

• Bit vectors

• Inverted files(a.k.a. Index)

• PAT tree(more powerful index)

0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0

Impractical very sparse(wastefully big) costly to compare

V1

V2

Termi

Page 5: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Problems with Boolean IR• Does not effectively support relevance ranking of

returned documents• Base model : expression satisfaction is Boolean

A document matches expression or it doesn’t

• Extension to permit ordering : (A AND B) OR C– Supermatches(5 terms/doc > 3 terms/doc)

– Partial matches (expression incompletely satisfied – give partial credit)– Importance weighting(10A OR 5B)

Weight/importance

Page 6: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Boolean IR• Advantages : Can directly control search

Good for precise queries in structured data

(e.g. database search or legal index)

• Disadvantages : Must directly control search– Users should be familiar with domain and term

space(know what to ask for and exclude)

– Poor at relevance ranking

– Poor at weighted query expansion, user modelling etc.

Page 7: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Signature Files

0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0

0 0 1 0 0 1 0 0 0 0 0

Problem : several different document bit vectors(i.e. different words) get mapped to same signature.(use stoplist to help avoid common words from overwhelming signatures)

DocumentBit

vector

Signature

Mappingfunction f( )

SuperimposedCoding

Using some mapping/Hash function

fewer bits

Page 8: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

False Drop Problem

• On retrieval, all documents/bit vectors mapped to

the same signature are retrieved(returned)

• Only a portion are relevant

• Need to do secondary validation step to make sure

target words actually match

Prob(False Drop) = Prob(Signature qualifies & Text does not)

Page 9: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Efficiency Problem

Testing for signature match may require linear scan through all document signatures

Page 10: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Vertical Partitioning

• Improves sig1, sig2 comparison speed,

but still requires O(N) linear search of all signatures

• Options :

sig

- Bit sliced onto different devices for parallel comparision- And together matches on each segment

sig1sig2

comp AND AND result

Page 11: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Horizontal Partitioning

• Goal : avoid sequential scanning of the signature file

SignatureDatabase

Input signature

Hash functionor index

yielding specific candidates to try

Page 12: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Inverted Files• Like an index to a book

14151617

37383940

1439

1563945

1562904186

156217

TermsBaum

Bayes

Viterbi

index Documents

Page 13: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Inverted Files

• Very efficient for single word queriesJust enumerate documents pointed to by index O( |A| ) = O(SA)

• Efficient for OR’sJust enumerate both lists and remove duplicates

O(SA + SB)

Page 14: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

AND’s using Inverted Files

1439

156227319

39455896

156208

Method 1:

• Begin with two pointers(i, j) on list # is in index(A,B)• if A[ i ] = B[ i ], write A[ i ] to output• if A[ i ] < B[ i ], i++ else j++

Ai Bj

Index forBayes

Index forViterbi

i j

O(SA + SB )

same as OR,

but smaller output

(meet search)

Page 15: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

AND’s using Inverted Files

39227

15

252839455896

156

Method 2: Useful if one index is smaller than the other(SA << SB )

Ai

Bj

(Johns)

(Hopkins)

For all members of A

bsearch (A[ i ], B)

(do binary search

into larger index)

for all members of

smaller indexA AND B AND C

Order by smaller list pairwiseCost : SA * log2 (SB )can achieve SA * log log (SB )

Page 16: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Proximity SearchAH

JH

H

A

Anthony

Johns

Hopkins

Document level indexes not adequate

Option 1 :

Size of corpus = size of index

Doc 1

Doc 2

Doc 3

Doc i

Index to corpus Position offset

Before :Match if ptrA = ptrB

Now :“A B” = match if ptrA = ptrB -1

A w/10 B = match if | ptrA - ptrB | 10

Page 17: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Variations 1Don’t index function words

X TheJohns

Hopkins

index

wordlist

*JohnsThe

Do linear match search in corpus savings on 50% index size potential speed improvement given data access costs

Page 18: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Variations 2 : Multilevel Indexes

Anthony

Johns

Hopkins

JohnsHopkins

JohnsHopkinsAnthony

HopkinsAnthony

Doc level

Position level

Supports parallel search May have paging cost advantage Cost – large index N + dV

Avg. Doc/vocab size

Page 19: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Interpolation Search

174195

* 211 *226230231246

483496521526

995

17181920212223

48495051

100

Bi cellvalue

Useful when data are numeric and uniformly distributed

# of cells in index : 100Values range from 0 … 1000

Goal : looking for the value 211

Binary search : begin looking at cell 50Interpolation search : better guess for 1st cell to examine?

bins#

sizemax K

Page 20: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Binary Search

Bsearch(low, high, key)mid = (high + low) / 2

If (key = A[mid])

return mid

Else if (key < A[mid])

Bsearch (low, mid-1, key)

Else

Bsearch(mid+1, high, key)

Interpolation Search

Isearch(low, high, key)mid = best estimate of pos

mid = low + (high – low) *

(expected % of way

through range)

low] A[ - ]high A[

] low A[ -key

Page 21: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Binary Search

50

25

12

18

22

21

19.

Interpolation Search

21

19. go directly to expected

region

Typical sequence of cell’s tested :

log log (N)

Comparison

Page 22: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Cost of Computing Inverted Index

1. Simple

word position pairs and sort

2. If N >> memory size1) Tokenize(words integers)

2) Create histogram

3) Allocate space in index

4) Do multipass(K-pass) through corpus only adding tokens in K bins

Corpus size N log N

Page 23: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

K-pass Indexingindex

W1

W2

W3

W4

Block1(passK = 1)

K = 2

Time = KN + 1But big win overN log N on paging

Page 24: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Vector Models for IR

• Gerald Salton, Cornell(Salton + Lesk, 68)

(Salton, 71)

(Salton + McGill, 83)

• SMART SystemChris Buckely, Cornell

Current keeper of the flame

Salton’s magical automatic retrieval tool(?)

Page 25: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Vector Models for IR

0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0

Doc V1

Doc V2

Boolean Model

SMART Vector Model

1.0 3.5 4.6 0.1 0.0 0.0Doc V1

Doc V2 0.0 0.0 0.0 0.1 4.0 0.0

Termi WordStemSpecial compounds

SMART vectors are composed of real valued Term weightsNOT simply Boolean Term Present or NOT

Page 26: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Example

3 5 4 1 0 1 0 0Doc V1

Doc V2

Comput* C++ Sparc genome Bilog* proteinCompiler DNA

1 0 0 0 5 3 1 4

Doc V3 2 8 0 1 0 1 0 0

Issues• How are weights determined? (simple option : raw freq. weighted by region, titles, keywords)• Which terms to include? Stoplists• Stem or not?

Page 27: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

QUERIES and Documents share same vector representaion

D3

D2

D1

Q

Given Qeury DQ map to vector VQ

and find document Di : sim (Vi ,VQ) is greatest

Page 28: Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Similarity Functions

• Many other options availabe(Dice, Jaccard)• Cosine similarity is self normalizing

D3

D2

Q

V1 100 200 300 50

V2 1 2 3 0.5

V3 10 20 30 5

Can use arbitrary integer values(don’t need to be probabilities)