tunable compression of word-level index for versioned corpora klaus berberich, srikanta bedathur,...

Tunable Compression of Word-level Index for Versioned Corpora

Klaus Berberich, Srikanta Bedathur, Gerhard WeikumMax-Planck Institute for Informatics

Saarbruecken, Germany

EIIR 2008, Glasgow 2/19

Introduction• Most document collections are not static

– Intranet documents, Mail folders, Blogs, Source-code, and contents of the World Wide Web

– Contents are being archived – possibly time-stamped and/or versioned

• Wikis • Document repositories (SVN, CVS, …) • Desktop• Web Archives!

• Search over evolving collections– Ability to query the collection “as of” given time

• Time-travel Search [BBNW’07]


Outline

• Time-travel Search• Our Time-machine: FluxCapacitor/TTIX

– Phrase Queries in TTIX• FUSION and Controlled FUSION• Experimental Evaluation


Historical Information Needs1. News articles discussing Cola-drinks Cancer

controversy during 2005-20062. Contemporary articles about “Harry Potter and the

Philosopher’s Stone” 3. Angela Merkel’s interview during 2002


Time-Travel Search

Angela Merkel Interview @ 2002

Keyword QueryTime-context for Evaluation & Ranking

Keyword search extended with a time-context for evaluation

Q = q @ ts

Evaluate q using the collection that existed at time ts

Key Challenges

• Dealing with the MASSIVE size

• Adapting the scoring models (typically defined for static collections)

• Efficient query processing

Opportunities

• Redundancy in content

• Sufficiency of good approximations

• Append-only data growth


Outline


– Phrase Queries in TTIX• FUSION and controlled FUSION• Experimental Evaluation


FluxCapacitor/TTIX

Adapt Inverted Index structure to include validity time-interval of each document-version

Documents D1, D2, D3 are observed to have changed at different times

Timenow

Version-history of Documents

t1 t2 t3 t5t4 t6 t7 t8 t9 t11

D32.2

[t0,t3)

D12.0

[t0,t2)

D31.9

[t3,t7)

D21.87

[t0,t1)

D11.6

[t2,t4)…

Time-stamped Inverted Index

t12t10t13

Vocabulary

t0

D1

D2

D3D3 “deletion”

D3

xx

[t0,t3)

D1

xx

[t0,t2)

D3

xx

[t3,t7)

D2

xx

[t0,t1)

D3

xx

[t0,t3)

D1

xx

[t0,t2)

D3

xx

[t3,t7)

D2

xx

[t0,t1)

D3

xx

[t0,t3)

D1

xx

[t0,t2)

D3

xx

[t3,t7)

D2

xx

[t0,t1)

D3

xx

[t0,t3)

D1

xx

[t0,t2)

D3

xx

[t3,t7)

D2

xx

[t0,t1)

……

…

Do

c. I

ds

[Berberich, Bedathur, Neumann, Weikum : SIGIR 2007, VLDB 2007]

• Index Compaction via Approximate Temporal Coalescing• A sublist materialization framework for trading off space-

performance


Phrase Queries • Significantly improve effectiveness• Essential for quickly locating

– entities – e.g., “Coca Cola”, “Where Eagles Dare”,…– concepts – e.g., “Water filtering”– …

• Indexing for Phrase queries– For each word, need to store positional

information for every occurrence– Index-size blowup– Size reduction via gap encoding + space-efficient

coding on positions [Scholer et al. 2002]


Phrase Queries in FluxCapacitor• Baseline:

For each document version dtb, posting of the following structure

positionsdidtt eb ||),[

• Word-positions compressed using standard techniques– (Gap + Elias-/Golomb-)encodings

Validity Time-interval(=64 bits)

Document Identifier(=64 bits)

List of Word-Positions

Can this be Improved?


Outline




Word-Positions across Versions

• High Level of Redundancy between versions– Append-only changes leave most parts unchanged– word b between dt1 and dt2

• Numerical closeness of positions– Small shifts in positions– word c between dt2 and dt3

4,2||),[ 21 dtt 4,2||),[ 32 dtt 6,2||),[ 3 dtt nowb:

3||),[ 21 dtt 7,6,3||),[ 32 dtt 9,8,5,3||),[ 3 dtt nowc:


FUSION• Idea:

– Merge (or Fuse) multiple consecutive document versions, and exploit redundancy and positional proximity => Better compressibility

signaturestimestampspositionsdidtt now ||||),[ 0

• Positions: all word-positions in any of the versions• Timestamps: all intermediate version timestamps• Signatures: for each version, a bit-signature of positions

101,110,110|,|6,4,2||),[ 321 ttdtt now

110011,101100,100000|,|9,8,7,6,5,3||),[ 321 ttdtt now

b:

c:


Query Processing – win some, lose some• Save on overall space

– Naïve organization + processing => reads the whole list, computes ranking

– FUSION maintains smaller list, so faster (naïve) query processing

• Who is Naïve !?– Skip pointers to jump ahead during query proc.– In the worst case,

FUSION ends up reading and processing all the versions, instead of just one version!

Baseline - Good performance, Bad storageFUSION - Bad (worst-case) performance, Good storage


Controlled FUSION • Compute a set of fusions over contiguous versions s.t.

– It takes minimal storage for word positions – For any version, the maximum worst case query processing

overhead is within η• Can be set up as an optimization problem• Optimal solution computable in O(n3) time and O(n)

space – Assumption: storage cost is monotonous

– In practice, we found it close to O(n2) ))','cost([))cost([)','[),[ ebebebeb tt,tttttt


Outline




Experimental Evaluation• English Wikipedia

– Revision history (2004 – 2005)– 10% sample (~35,000 docs, ~900,000 ver.)

• Baseline: – Elias- code: 97.51 GBytes– Elias- code: 97.77 GBytes

• FUSION:– η between 1.1 – 10– Elias- & Elias- for compressing word-positions in

each fused posting


Experimental Results

= 1.5 35% of the baseline = 1.5 44% of the baseline


Conclusions• Time-travel Search

– Key to archive search & analysis– An interesting and important problem!

• Our Time-machine: FluxCapacitor/TTIX– Builds on inverted index framework– Tunable index-size reduction

• FUSION– Adds phrase-querying to FluxCapacitor/TTIX– More than 50% space reduction over baseline

• With 50% worst-case overhead in query proc.

Thank You!

Questions ?

tunable compression of word-level index for versioned corpora klaus berberich, srikanta bedathur,...

Documents

inverted index t

time t s key challenges

document version d t

outline timetravel

different times time

keyword query timecontext

documentversion documents

document collections