efficient and scalable archive search avishek anand

1
Efficient and Scalable Archive Search Avishek Anand IS : Idealized Sharding CA : Cost Aware Sharding time Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 1 Doc 2 Doc 7 Doc 3 Doc 4 Doc 5 Doc 6 Web archives span over a long time Challenge Support search with temporal constraints Searchi ng Archive s Web archives continuously grow over time Challenge Scale search to growing archives Scaling Archive Search Index Sharding ] Index Sharding for Space-Time Efficiency in Archive Search : Avishek Anand, Srikanta Bedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2011. ] Index Maintenance for Time-Travel Text Search : Avishek Anand, Srikanta Bedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2012. 3] A Time Machine for Text Search : Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum. SIGIR 2007, July 2007. 2007 2008 2009 2010 2011 2012 2013 Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Index-list Shard 1 Shard 2 Need to design index structures which efficiently process time-travel queries and can be easily maintained. obama @ [6/2009 – 6/2011] Idealized Sharding: Eliminates access to postings with no intersection with query-time interval. Cost Aware Shard Merging: Merge idealized shards by reconciling random and sequential access costs. Index Sharding: • Partitions each index-list disjointly. • No index blow-up. Index Maintenance References Experiments Active Index Archive Index In-memory Archive Index External- memory Archive Index Crawls Doc 4: version 2 Doc 3: version 2 Doc 2: version 9 Doc 1: version 1 Doc 4: version 3 Sent to Archive Indexing System In the live index now Inserted incoming version Appended popped posting Shard buffers Archive Index Shards System Architecture : Separate indexes for active and retired versions. Incremental Sharding: • Online algorithm with approximation guarantee. • Append-only operation on shards. • Retains query performance. End-time arrival order: Versions finalized in their end-time- order. query time- interval SB : Vertical Partitioning with trade-off between performance and index size [3] Approach Avoid accessing postings which do not overlap with query time- interval. Approach Avoid re- computation of the index by creating shards incrementally . Wallclock-times comparison with SB Index-size comparison Index maintenance efficiency Performance of incremental sharding INC : Incremental Sharding

Upload: camdyn

Post on 23-Mar-2016

35 views

Category:

Documents


3 download

DESCRIPTION

Web archives span over a long time. Searching Archives. Web archives continuously grow over time. Challenge Support search with temporal constraints. Scaling Archive S earch. Challenge Scale search to growing archives. obama @ [6/2009 – 6/2011]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient and Scalable Archive Search Avishek Anand

Efficient and Scalable Archive SearchAvishek Anand

IS : Idealized ShardingCA : Cost Aware Sharding

time

Doc 1Doc 2

Doc 3Doc 4

Doc 5Doc 6

Doc 7

Doc 1Doc 2Doc 7

Doc 3Doc 4Doc 5Doc 6

Web archives span over a long

time

Challenge

Support search with temporal

constraints

Searching Archives

Web archives continuously

grow over time

Challenge

Scale search to growing archives

ScalingArchive Search

Index Sharding

[1] Index Sharding for Space-Time Efficiency in Archive Search : Avishek Anand, Srikanta Bedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2011. [2] Index Maintenance for Time-Travel Text Search : Avishek Anand, Srikanta Bedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2012.[3] A Time Machine for Text Search : Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum. SIGIR 2007, July 2007.

2007 2008 2009 2010 2011 2012 2013

Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6Doc 7

Index-list

Shard 1

Shard 2

Need to design index structures which efficiently process time-travel queries and can be easily maintained.

obama @ [6/2009 – 6/2011]

Idealized Sharding: Eliminates access to postings with no intersection with query-time interval.

Cost Aware Shard Merging: Merge idealized shards by reconciling random and sequential access costs.

Index Sharding: • Partitions each index-list disjointly. • No index blow-up.

Index Maintenance

References

Experiments

Active Index

Archive Index

In-memory Archive Index

External-memory Archive Index

Crawls

Doc 4: version 2

Doc 3: version 2

Doc 2: version 9

Doc 1: version 1

Doc 4: version 3

Sent to Archive Indexing System In the live index

now

Insertedincoming version

Appended popped posting

Shard buffers Archive Index Shards

System Architecture : Separate indexes for active and retired versions.

Incremental Sharding: • Online algorithm with approximation guarantee.

• Append-only operation on shards.• Retains query performance.

End-time arrival order: Versions finalized in their end-time-order.

query time-interval

SB : Vertical Partitioning with trade-off between performance and index size [3]

Approach

Avoid accessing postings

which do not overlap with query time-

interval.

Approach

Avoid re-computation of

the index by creating shards incrementally.

Wallclock-times comparison with SB Index-size comparison Index maintenance efficiencyPerformance of incremental sharding

INC : Incremental Sharding