panagiotis antonopoulos microsoft corp panant@microsoft ioannis konstantinou
DESCRIPTION
Efficient Index Updates over the Cloud. Panagiotis Antonopoulos Microsoft Corp [email protected] Ioannis Konstantinou National Technical University of Athens [email protected] Dimitrios Tsoumakos Ionian University [email protected] Nectarios Koziris - PowerPoint PPT PresentationTRANSCRIPT
Panagiotis AntonopoulosMicrosoft Corp
Ioannis KonstantinouNational Technical University of Athens
Dimitrios TsoumakosIonian University
Nectarios KozirisNational Technical University of Athens
Efficient Index Updates over the Cloud
Requirements in the Web
• Huge volume of datasets> 1.8 zettabytes, growing by 80% each year
• Huge number of users> 2 billion users searching and updating web content
• Explosion of User Generated ContentFacebook: 90 updates/user/month, 30 billions/dayWikipedia: 30 updates/article/month, 8K new/day
• Users demand fresh results
2/18
Our contribution
A distributed system which allows fast and frequent updates on web-scale Inverted Indexes.
• Incremental processing of updates
• Distributed processing - MapReduce
• Distributed index storage and serving – NoSQL
3/ 26
Goals
• Update time independent of existing index size– Fast and frequent updates on large indexes
• Index consistency after an update– System stability and performance unaffected by
updates
• Scalability– Exploit large commodity clusters
4/ 26
Term List of documentsdistributed Doc2, Doc3, Doc7, Doc10
update Doc2, Doc5, Doc12
Hadoop Doc1, Doc2, Doc8
Inverted Index
• Maps each term included in a collection of documents to the documents that contain the term: (term, list(doc_ref))
• Popular for fast content search, search engines
• Index Record: (term, doc_ref)
• Example:
5/26
Related Work
• Google, distributed index creation– Google Caffeine, fast and continuous index updates
• Apache Solr, distributed search through index replication
• Katta, distributed index creation and serving
• CSLAB, distributed index creation and serving
• LucidWorks, distributed index creation and updates on top of Solr (not open-source)
6/ 26
Basic Update Procedure
• Input: Collection of new/modified documents
• For each new document:• Simply add each term to the corresponding list
• For each modified document:– Delete all index records that refer to the old version– Add each term of the new version to the corresponding list
7 /26
Basic Update Procedure
8/26
For modified documents we need to:
• Obtain the indexed terms of the old version
• Locate and delete the corresponding index records– Complexity depends on the schema of the index
Update time critically depends on these operations!
How can we do it efficiently?
Proposed Schema
• HBase:– Stores and indexes millions of columns per row– Stores varying number of columns for each row
• Proposed Schema:– One row for every indexed term– One column for each document contained in the list of
the corresponding term– Use the document ID as the column name
9 /26
Proposed Schema
Each cell (row, column) corresponds to an index record (term, docID)
• Advantages– Fast record discovery and deletion
Almost independent of the list size
• Disadvantages– Required storage space (overhead per column)
10 /26
Forward Index
• Forward Index: List of terms of each document
• Example:
• Advantages: – Immediate access to the terms of old version– Retrieving the Forward Index is faster (smaller size)
• Disadvantages: – Required storage space– Small overhead to the indexing process 11/ 26
Document ID WordsDoc1 data, management, in , the, cloudDoc2 Inverted, index, updates
Minimizing Index Changes
12/26
General Idea:• Modifications in the documents’ content are limited
• Update the index based only on the content modifications
Procedure:• Compare the two different versions of each document
• Delete the terms contained in the old version but not in the new
• Add the terms contained in the new version but not in the old
Minimizing Index Changes
13/26
No changes required for the common terms
Advantages:• Minimize the changes required to the index
‒ Minimize costly insertions and deletions in Hbase‒ Minimize volume of intermediate K/V pairs (distributed)
Disadvantages:• Increased complexity of indexing process
Distributed Index Updates
14/26
• Better but still centralized!
• Perfectly suited to the MapReduce logic:– Each document can be processed independently– The updates have to be merged before they are applied
to the index
• Utilizing MR model:– Easily distribute the processing– Exploit the resources of large commodity clusters
Distributed Index Updates
15/ 26
Mappers:• Scan modified document• Retrieve old FI• Compare two versions
Emit K/V pairs for additions(term, docID)
Emit K/V pairs for deletions(term, docID)
Emit K/V pairs for FI and Content
Combiners:• Merge the K/V pairs into
a list of values per key(only for additions and deletions)
Emit a Key/Value pair for additions:(term, list(docID))
Emit a Key/Value pair for deletions:(term, list(docID))
Reducers:• For additions:
Create an index record for each pair (term, docID)Write the records to HFiles
• For deletions:Delete the corresponding cells using theHBase Client API
Bulk Load the output HFiles to HBase
Content Table: The raw documents
Forward Index Table:The Forward Index
Inverted Index Table: The Inverted Index using the schema described in the previous slides
Even Load Distribution
16 /26
Two different types of keys:
• Document ID:– One K/V pair for the Content and one for the FI of each
document– Divide the keys into equally sized partitions using a hash
function
• Term:– Skewed-Zipfian distribution in natural languages– The number of values per key-term varies significantly
Even Load Distribution
17 /26
Solution: Sampling the input
Mappers:• Process a sample using the same algorithm• Emit a K/V per (term, 1) for each addition or deletion
Reducers: (1 for additions, 1 for deletions)• Count the occurrences to determine the splitting points
Indexer:• Loads the splitting points and chooses the reducer for
each key
Experimental Setup
18 /26
Cluster:• 2-12 worker nodes (default: 8)• 8 cores @2GHz, 8GB RAM• Hadoop v.0.20.2-CDH3 (Cloudera)• HBase v.0.90.3-CDH3 (Cloudera)• 6 mappers and 6 reducers per node
Datasets:• Wikipedia snapshots on April 5, 2011 and May 26, 2011• Default initial dataset: 64.2 GB, 23.7 million documents• Default update dataset: 15.4 GB, 2.2 million documents
Experimental Results
19/26
Evaluating our design choices
• Comparison: Depends on the number of indexed terms• Forward Index: Important in both cases• Bulk Loading: Depends on the number of indexed terms• Sampling: Not important, small number of intermediate K/V pairs
Full-Text Title-only0
10
20
30
40
50
60
70
80
90
No ComparisonNo Forw. IndexNo BulkNo SamplingBest
Inde
x U
pdat
e Co
mpl
etion
(m
in)
Experimental Results
20 /26
Update time vs. Update dataset size
Update time linear to update dataset size
For fixed size of initial dataset: 64.2 GB (≈24 mil. documents)
0 2 4 6 8 10 12 14 16 180
5
10
15
20
25
30
35
40
Full-TextTitle-only
Update Size (GB)
Inde
x U
pdat
e Co
mpl
etion
(min
)
Experimental Results
21 /26
4X larger initial dataset size increases update time by less
than 6%
Update time roughly independent of the initial
index size
For fixed new/modified documents dataset: 5,1 GB (≈400 thousand docs)
Update time vs. Initial Dataset Size
10 20 30 40 50 60 700
2
4
6
8
10
12
Full-TextTitle-only
Initial Indexed Document Size (GB)
Inde
x U
pdat
e Co
mpl
etion
(min
)
Experimental Results
22/ 26
• 5X faster indexing from 2 to 12 nodes• Bulk loading to HBase does NOT scale as expected• 3.3X better performance in total
Update time vs. Available resources (# of Mappers/Reducers)
For fixed size of initial/update datasets: 64.2 GB/15.4GB
0 12 24 36 48 60 720
102030405060708090
100110120
Full-Text
Total TimeIndexing
Total # of Mappers/Reducers used
Inde
x U
pdat
e Co
mpl
etion
(min
)
0 12 24 36 48 60 720
5
10
15
20
25
30
35
40
45
50
55
Title-only
Total Time
Indexing
Total # of Mappers/Reducers used
Inde
x U
pdat
e Co
mpl
etion
(min
)
Conclusion
Incremental Processing:• Process updates, minimize required changes• Update time :
– Almost independent of initial index size– Linear to the update dataset size
Distributed Processing• Reduced update time• Scalability
23/26
Conclusion
24/26
Fast and frequent updates on web-scale Indexes• Wikipedia: >6X faster than index rebuild
Disadvantages:• Slower index creation (done only once)• Increase in required storage space (low cost)
The End
Thank you!
25/26
Questions…