panagiotis antonopoulos microsoft corp panant@microsoft ioannis konstantinou

Panagiotis AntonopoulosMicrosoft Corp

[email protected]

Ioannis KonstantinouNational Technical University of Athens

[email protected]

Dimitrios TsoumakosIonian University

[email protected]

Nectarios KozirisNational Technical University of Athens

[email protected]

Efficient Index Updates over the Cloud

Requirements in the Web

• Huge volume of datasets> 1.8 zettabytes, growing by 80% each year

• Huge number of users> 2 billion users searching and updating web content

• Explosion of User Generated ContentFacebook: 90 updates/user/month, 30 billions/dayWikipedia: 30 updates/article/month, 8K new/day

• Users demand fresh results

2/18

Our contribution

A distributed system which allows fast and frequent updates on web-scale Inverted Indexes.

• Incremental processing of updates

• Distributed processing - MapReduce

• Distributed index storage and serving – NoSQL

3/ 26

Goals

• Update time independent of existing index size– Fast and frequent updates on large indexes

• Index consistency after an update– System stability and performance unaffected by

updates

• Scalability– Exploit large commodity clusters

4/ 26

Term List of documentsdistributed Doc2, Doc3, Doc7, Doc10

update Doc2, Doc5, Doc12

Hadoop Doc1, Doc2, Doc8

Inverted Index

• Maps each term included in a collection of documents to the documents that contain the term: (term, list(doc_ref))

• Popular for fast content search, search engines

• Index Record: (term, doc_ref)

• Example:

5/26

Related Work

• Google, distributed index creation– Google Caffeine, fast and continuous index updates

• Apache Solr, distributed search through index replication

• Katta, distributed index creation and serving

• CSLAB, distributed index creation and serving

• LucidWorks, distributed index creation and updates on top of Solr (not open-source)

6/ 26

Basic Update Procedure

• Input: Collection of new/modified documents

• For each new document:• Simply add each term to the corresponding list

• For each modified document:– Delete all index records that refer to the old version– Add each term of the new version to the corresponding list

7 /26

Basic Update Procedure

8/26

For modified documents we need to:

• Obtain the indexed terms of the old version

• Locate and delete the corresponding index records– Complexity depends on the schema of the index

Update time critically depends on these operations!

How can we do it efficiently?

Proposed Schema

• HBase:– Stores and indexes millions of columns per row– Stores varying number of columns for each row

• Proposed Schema:– One row for every indexed term– One column for each document contained in the list of

the corresponding term– Use the document ID as the column name

9 /26

Proposed Schema

Each cell (row, column) corresponds to an index record (term, docID)

• Advantages– Fast record discovery and deletion

Almost independent of the list size

• Disadvantages– Required storage space (overhead per column)

10 /26

Forward Index

• Forward Index: List of terms of each document

• Example:

• Advantages: – Immediate access to the terms of old version– Retrieving the Forward Index is faster (smaller size)

• Disadvantages: – Required storage space– Small overhead to the indexing process 11/ 26

Document ID WordsDoc1 data, management, in , the, cloudDoc2 Inverted, index, updates

Minimizing Index Changes

12/26

General Idea:• Modifications in the documents’ content are limited

• Update the index based only on the content modifications

Procedure:• Compare the two different versions of each document

• Delete the terms contained in the old version but not in the new

• Add the terms contained in the new version but not in the old

Minimizing Index Changes

13/26

No changes required for the common terms

Advantages:• Minimize the changes required to the index

‒ Minimize costly insertions and deletions in Hbase‒ Minimize volume of intermediate K/V pairs (distributed)

Disadvantages:• Increased complexity of indexing process

Distributed Index Updates

14/26

• Better but still centralized!

• Perfectly suited to the MapReduce logic:– Each document can be processed independently– The updates have to be merged before they are applied

to the index

• Utilizing MR model:– Easily distribute the processing– Exploit the resources of large commodity clusters

Distributed Index Updates

15/ 26

Mappers:• Scan modified document• Retrieve old FI• Compare two versions

Emit K/V pairs for additions(term, docID)

Emit K/V pairs for deletions(term, docID)

Emit K/V pairs for FI and Content

Combiners:• Merge the K/V pairs into

a list of values per key(only for additions and deletions)

Emit a Key/Value pair for additions:(term, list(docID))

Emit a Key/Value pair for deletions:(term, list(docID))

Reducers:• For additions:

Create an index record for each pair (term, docID)Write the records to HFiles

• For deletions:Delete the corresponding cells using theHBase Client API

Bulk Load the output HFiles to HBase

Content Table: The raw documents

Forward Index Table:The Forward Index

Inverted Index Table: The Inverted Index using the schema described in the previous slides

Even Load Distribution

16 /26

Two different types of keys:

• Document ID:– One K/V pair for the Content and one for the FI of each

document– Divide the keys into equally sized partitions using a hash

function

• Term:– Skewed-Zipfian distribution in natural languages– The number of values per key-term varies significantly

Even Load Distribution

17 /26

Solution: Sampling the input

Mappers:• Process a sample using the same algorithm• Emit a K/V per (term, 1) for each addition or deletion

Reducers: (1 for additions, 1 for deletions)• Count the occurrences to determine the splitting points

Indexer:• Loads the splitting points and chooses the reducer for

each key

Experimental Setup

18 /26

Cluster:• 2-12 worker nodes (default: 8)• 8 cores @2GHz, 8GB RAM• Hadoop v.0.20.2-CDH3 (Cloudera)• HBase v.0.90.3-CDH3 (Cloudera)• 6 mappers and 6 reducers per node

Datasets:• Wikipedia snapshots on April 5, 2011 and May 26, 2011• Default initial dataset: 64.2 GB, 23.7 million documents• Default update dataset: 15.4 GB, 2.2 million documents

Experimental Results

19/26

Evaluating our design choices

• Comparison: Depends on the number of indexed terms• Forward Index: Important in both cases• Bulk Loading: Depends on the number of indexed terms• Sampling: Not important, small number of intermediate K/V pairs

Full-Text Title-only0

10

20

30

40

50

60

70

80

90

No ComparisonNo Forw. IndexNo BulkNo SamplingBest

Inde

x U

pdat

e Co

mpl

etion

(m

in)


20 /26

Update time vs. Update dataset size

Update time linear to update dataset size

For fixed size of initial dataset: 64.2 GB (≈24 mil. documents)

0 2 4 6 8 10 12 14 16 180

5

10

15

20

25

30

35

40

Full-TextTitle-only

Update Size (GB)

Inde

x U

pdat

e Co

mpl

etion

(min

)


21 /26

4X larger initial dataset size increases update time by less

than 6%

Update time roughly independent of the initial

index size

For fixed new/modified documents dataset: 5,1 GB (≈400 thousand docs)

Update time vs. Initial Dataset Size

10 20 30 40 50 60 700

2

4

6

8

10

12

Full-TextTitle-only

Initial Indexed Document Size (GB)

Inde

x U

pdat

e Co

mpl

etion

(min

)


22/ 26

• 5X faster indexing from 2 to 12 nodes• Bulk loading to HBase does NOT scale as expected• 3.3X better performance in total

Update time vs. Available resources (# of Mappers/Reducers)

For fixed size of initial/update datasets: 64.2 GB/15.4GB

0 12 24 36 48 60 720

102030405060708090

100110120

Full-Text

Total TimeIndexing

Total # of Mappers/Reducers used

Inde

x U

pdat

e Co

mpl

etion

(min

)

0 12 24 36 48 60 720

5

10

15

20

25

30

35

40

45

50

55

Title-only

Total Time

Indexing

Total # of Mappers/Reducers used

Inde

x U

pdat

e Co

mpl

etion

(min

)

Conclusion

Incremental Processing:• Process updates, minimize required changes• Update time :

– Almost independent of initial index size– Linear to the update dataset size

Distributed Processing• Reduced update time• Scalability

23/26

Conclusion

24/26

Fast and frequent updates on web-scale Indexes• Wikipedia: >6X faster than index rebuild

Disadvantages:• Slower index creation (done only once)• Increase in required storage space (low cost)

The End

Thank you!

25/26

Questions…

panagiotis antonopoulos microsoft corp panant@microsoft ioannis konstantinou

Documents

index record term

index replicationkatta

efficient index updates

forward indexforward

distributed search

list of terms

new document

frequent updates