i ncremental m aintenance of l ength n ormalized i ndexes for a pproximate s tring m atching -...

Post on 18-Jan-2018

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

I NTRODUCTION Inverted Document Frequency Partial Score Contribution 3

TRANSCRIPT

INCREMENTAL MAINTENANCE OF LENGTH NORMALIZED INDEXES FORAPPROXIMATE STRING MATCHING

- Ashwin Joshi1

PROBLEM Consider a real system - Tens of millions of strings - Updated on hourly basis - Practical scenario 1. Updates buffered 2. Indexed rebuilt weekly - Re-computation time = few hours - Limitations of online systems

2

INTRODUCTION Inverted Document Frequency

Partial Score Contribution

3

LENGTH NORMALIZATION

Types : L0 ,L1 & L2 ………Why L2 is preferred? Similarity,

e.g. Query, q = {t1, t2, t3}, String S1 = {t1}, String S2 = {t1, t2, t3}

and idf(t1) = 10 , idf(t2) = 8 , idf(t3)= 2 .

For L0 , S0(q,s1) = 100/3 > S0(q,s2) = 168/9

For L1 , S1(q,s1) = 100/200 > S1(q,s2) = 168/400

For L2 , S2(q,s1) = 100/41 < S2(q,s2) = 168/168 = 1 4

APPROXIMATE STRING MATCHING Theorem:

Length Boundedness Determine string that are either too

short or too long to match the query

5

MAINTENANCE OPERATIONS Propagating Updates 1. Insert 2. Delete 3. Modify Effectively a ‘Delete’ followed by an

‘Insert’

6

Insert S7

- Generate new tokens - Add new strings - N changes -> idf changes -> L changes

INSERT

7

RELAXED PROPAGATION Relaxation of N - What is Nb ? - Divergence between N & Nb

Relaxation of df - Definition of dfp(ti) - Range of dfp(ti)

Relaxed similarity S2~

8

LOSS IN PRECISION Assume total possible divergence in idf

Relaxed Similarity,

For ρ=1.1 & query threshold,

Equation1 : ,

Equation2 : , 9

UPDATE PROPAGATION ALGORITHM

10 …continued

11

EXPERIMENT (DBLP) - Period = 30 days - 2460433 author/id pairs - 5712041 total words - 269281 distinct words - 33461 total updates - 32121 insertions,1340 deletions

12

EXPERIMENT (BUSINESS LISTING)

13

14

15

QUERY ACCURACY

16

THANK YOU.

17

top related