andrzej bialecki lr-2013-dublin

29

Upload: lucenerevolution

Post on 11-May-2015

916 views

Category:

Technology


1 download

DESCRIPTION

Presented by Andrzej Bialecki, LucidWorks This session presents a set of Solr components for easy management of "sidecar indexes" - indexes that extend the main index with additional stored and / or indexed fields. Conceptually this can be viewed as an extension of the ExternalFileField or as a static join between documents from two collections. This functionality is useful in applications that require very different update regimes for the two parts of the index (e.g. main catalogue items combined with clickthroughs).

TRANSCRIPT

Page 1: Andrzej bialecki lr-2013-dublin
Page 2: Andrzej bialecki lr-2013-dublin

SOLR SIDE-CAR INDEXAndrzej Bialecki. LucidWorks

[email protected]

Page 3: Andrzej bialecki lr-2013-dublin

• Started using Lucene in 2003 (1.2-dev…)• Created Luke – the Lucene Index Toolbox• Apache Nutch, Hadoop, Solr committer, Lucene PMC member• LucidWorks engineer

About the speaker

Page 4: Andrzej bialecki lr-2013-dublin

• Challenge: incremental document updates• Existing solutions and workarounds• Sidecar index strategy and components• Scalability and performance• QA

Agenda

Page 5: Andrzej bialecki lr-2013-dublin

• Incremental update (field-level update): modification of a part of document• Sounds like a fundamentally useful functionality!• But Lucene / Solr doesn’t offer true field-level updates (yet!)

– “Update” is really a sequence of “retrieve old document, update fields, add updated document, delete old document”

– “Atomic update” functionality in Solr is a (useful) syntactic sugar

Challenge: incremental document updates

Page 6: Andrzej bialecki lr-2013-dublin

• Documents composed logically of two parts with different update schedules– E.g. mostly static documents with some quickly changing fields

• Two different classes of data in changing fields– Numeric / boolean fields: e.g. popularity, in-stock status, promo campaigns– Text fields: e.g. reviews, tags, click-through feedback, user profiles

• Challenge: how to integrate these modifications with the main index content?– Re-indexing whole documents isn’t always an option

Common use cases for field updates

Page 7: Andrzej bialecki lr-2013-dublin

• Very complex issue, broad impact on many Lucene internals– Inverted index structure is not optimized for partial document updates– At least another 6-12 months away?

• LUCENE-4258 – work in progress

True full-text (inverted fields) incremental updates

Page 8: Andrzej bialecki lr-2013-dublin

• If the corpus is small, or incremental updates infrequent… just re-index everything!• Pros:

– Relatively easy to implement – update source documents and re-index– Allows adding all types of data, including e.g. labels as searchable text

• Cons:– Infeasible for larger corpora or frequent updates, time-wise and cost-wise– Requires keeping around the source documents

• Sometimes inconvenient, when documents are assembled in a complex pipeline

Handling updates via full re-index

Page 9: Andrzej bialecki lr-2013-dublin

• Pros:– Simple to implement– Updates are easy – just file edits, no need to re-index

• Cons:– Only docId => field : number– Not suitable for full-text searchable field updates

• E.g. can’t support user-generated labels attached to a doc

– Still useful if a simple “popularity”-type metric is sufficient• Internally implemented as an in-memory ValueSource usable by

function queries

Handling updates via Solr’s ExternalFileField

doc0=1.5doc1=2.5doc2=0.5…

Page 10: Andrzej bialecki lr-2013-dublin

• Since Lucene/Solr 4.6 … to be released Really Soon • Details can be found in LUCENE-5189• As simple as:

indexWriter.updateNumericDocValue(term, field, value)

• Neatly solves the problem of numeric updates: popularity, in-stock, etc.• Some limitations:

– Massive updates still somewhat costly until the next merge (like deletes)– Can only update existing fields

• Obviously doesn’t address the full-text inverted field updates

Numeric DocValues updates

Page 11: Andrzej bialecki lr-2013-dublin

• Pretends that two or more IndexReader-s are slices of the same index

– Slices contain data for different fields– Both stored and inverted parts are supported– Data for matching docs is joined on the fly

• Structure of all indexes MUST match 1:1 !!!– The same number of segments– The same count of docs per segment– Internal document ID-s must match 1:1– List of deletes is taken from the first index

• Sounds cool … but in practice it’s rarely used:– It’s very difficult to meet these requirements– This is even more difficult in the presence of

index updates and merges

Lucene ParallelReader overview

0123

f1, f2, ...f1, f2, ...f1, f2, ...f1, f2, …

01

f1, f2, ...f1, f2, …

0 f1, f2, …

0123

45

6

0123

f3, f4, ...f3, f4, ...f3, f4, ...f3, f4, …

01

f3, f4, ...f3, f4, …

0 f3, f4, …

ParallelReader

main IR parallel IR

0 f1, f2, f3, f4…

Page 12: Andrzej bialecki lr-2013-dublin

• Pros:– All types of data (e.g. searchable full-text

labels) can be added• Cons:

– Must ensure that the other index always matches the structure of the main index

– Complicated and fragile (rebuild on every update?)

– No tools to manage this parallel index in Solr

Handling updates via ParallelReader

0123

f1, f2, ...f1, f2, ...f1, f2, ...f1, f2, …

01

f1, f2, ...f1, f2, …

0 f1, f2, …

0123

45

6

01

f3, f4, ...f3, f4, …

0 f3, f4, …

ParallelReader

main IR parallel IR

01

f3, f4, ...f3, f4, …

Page 13: Andrzej bialecki lr-2013-dublin

• Uses the ParallelReader strategy for field updates– “Main” and “sidecar” data comes from two different Solr collections– “Sidecar” collection is updated independently from the main collection– “Sidecar” collection is used as a source of document fields for building and

updating a parallel index• Integrates the management of ParallelReader (“sidecar index”) into Solr

– Initial creation of ParallelReader, including synchronization of internal ID-s– Tracking of updates and IndexReader.reopen(…) events

• Partly based on a version of Click Framework in LucidWorks Search• Available under Apache License here: http://github.com/LucidWorks/sidecar_index

Sidecar Index Components for Solr

Page 14: Andrzej bialecki lr-2013-dublin
Page 15: Andrzej bialecki lr-2013-dublin
Page 16: Andrzej bialecki lr-2013-dublin

• “Main” collection contains only the parts of documents with “main” fields• “Sidecar” collection is a source of documents with “sidecar” fields• SidecarIndexReaderFactory creates and maintains the parallel index (sidecar

index)• “Main” collection uses SidecarIndexReader that acts as ParallelReader• Main index is updated as usual, via the “main” collection’s IndexWriter

“Main”, “sidecar” collections and parallel index

Main_collection

SidecarIndexReader

main index

Sidecar_collection

sidecar index

Solr

Page 17: Andrzej bialecki lr-2013-dublin

• SidecarIndexReaderFactory extends Solr’s IndexReaderFactory– newReader(Directory, SolrCore) – initial open– newReader(IndexWriter, SolrCore) – NRT open

• SidecarIndexReader acts like a ParallelReader– Solr wants DirectoryReader, but ParallelReader is not a DirectoryReader– Basically had to re-implement the logic from ParallelReader

• ParallelReader challenges:– How to synchronize internal ID-s?– How to create segments that are of the same size as those of the main index?– How to handle deleted documents?– How to handle updates to the main index?– How to handle updates to the sidecar data?

Implementation details

Page 18: Andrzej bialecki lr-2013-dublin

• How to synchronize internal ID-s?– “Main” collection is traversed sequentially by

internal docId– Primary key is retrieved for each document– Matching document is found in the “sidecar”

collection– Matching document is added to the “sidecar” index

• Very costly phase!– Random seek and retrieval from “sidecar” collection– Primary key lookup is fast– … but stored field retrieval and indexing isn’t

ParallelReader challenges and solutionsGBCEAFD

f3, f4, ...f3, f4, ...f3, f4, ...f3, f4, …f3, f4, ...f3, f4, ...f3, f4, …

sidecar IR

0123

D, f2, ...B, f2, ...A, f2, ...F, f2, …

01

C, f2, ...G, f2, …

0 E, f2, …

0123

45

6

012

f3, f4, ...f3, f4, ...f3, f4, ...

main IR

Sidecar collection

Main collection

q=id:D

Page 19: Andrzej bialecki lr-2013-dublin

• Optimization 1: don’t rebuild data for unmodified segments

• Optimization 2 (cheating): ignore NRT segments• How to handle deleted docs?

– Insert dummy (empty) documents so that the number and the order of documents still match

ParallelReader challenges and solutions

0123

f1, f2, ...f1, f2, ...f1, f2, ...f1, f2, …

01

f1, f2, ...f1, f2, …

0123

45

7

0123

f3, f4, ...f3, f4, ...f3, f4, ...f3, f4, …

01

f3, f4, ...f3, f4, …

ParallelReader

main IR sidecar IR

0 f1, f2, …NRT

01

f1, f2, ...f1, f2, …

X 01

dummyf3, f4, …

Page 20: Andrzej bialecki lr-2013-dublin

• How to create segments that are of the same size as the “main” index?

• Carefully manage the “sidecar” index creation:– IndexWriter uses SerialMergeScheduler to

prevent out-of-order merges– Force flush when reaching the next target count

of documents– Merges are enforced using SidecarMergePolicy

that tracks the sizes of the “main” index segments

Implementation: SidecarMergePolicy

0123

f1, f2, ...f1, f2, ...f1, f2, ...f1, f2, …

01

f1, f2, ...f1, f2, …

0 f1, f2, …

0123

45

6

0123

f3, f4, ...f3, f4, ...f3, f4, ...f3, f4, …

01

f3, f4, ...f3, f4, …

0 f3, f4, …

ParallelReader

main IR sidecar IR

SidecarMergePolicytarget sizes:

Seg0 – 4 docsSeg1 – 2 docsSeg2 – 1 doc

Page 21: Andrzej bialecki lr-2013-dublin

• Re-implements the logic of ParallelReader– ParallelReader != DirectoryReader

• Exposes Directory of the “main” index for replication– Replicas need the “sidecar” collection replica to rebuild the sidecar index locally– If document routing and shard placement is the same then we don’t have to use

distributed search – all data will be local• Reopen(…) avoids rebuilding unmodified segments• Reopen(…) uses SidecarIndexReaderFactory to rebuild the sidecar index when

necessary– When there’s a major merge in the “main” index– When “sidecar” data is updated

• Ref-counting of IndexReaders at different levels is very tricky!

Implementation: SidecarIndexReader

Page 22: Andrzej bialecki lr-2013-dublin

<indexReaderFactory name="IndexReaderFactory" class="com.lucid.solr.sidecar.SidecarIndexReaderFactory"> <str name="docIdField">id</str> <str name="sourceCollection">source</str> <bool name="enabled">true</bool></indexReaderFactory>

Example configuration in solrconfig.xml

Page 23: Andrzej bialecki lr-2013-dublin

• Raw click-through data:– Query, query_time, docId, click_time [, user]

• Aggregated click-through data:– User-generated popularity score: F(number and timing of clicks per docId)

• Numeric updates– User-generated labels: F(top-N queries that led to clicks on docId)

• Full-text searchable updates– User profiles: F(top-N queries per user, top-N docId-s clicked, etc)– …

• Queries can now be expanded to score based on TF/IDF in user-generated labels

Example use case: integration of click-through data

Page 24: Andrzej bialecki lr-2013-dublin

Scalability and performance

Page 25: Andrzej bialecki lr-2013-dublin
Page 26: Andrzej bialecki lr-2013-dublin

Scalability and performance

• Initial full rebuild is very costly– ~0.6 ms / document– 1 mln docs = 600 sec = 10 min– Not even close to “real time” …

• Cost related to new segments in “main” index depends on the size of segments• Major merge events will trigger full rebuild• BUT: search-time cost is negligible

Page 27: Andrzej bialecki lr-2013-dublin

• Combination of ref-counting in Lucene, Solr and ParallelReader is difficult to track– The sidecar code is still unstable and occasionally explodes

• Performance of full rebuild quickly becomes the bottleneck on frequent updates– So the main use case is massive but infrequent updates of “sidecar” data

• Code: http://github.com/LucidWorks/sidecar_index

• Fixes and contributions are welcome – the code is Apache licensed

Caveats

Page 28: Andrzej bialecki lr-2013-dublin

• Challenge: incremental document updates• Existing solutions and workarounds• Sidecar index strategy and components• Scalability and performance• QA

Agenda

Page 29: Andrzej bialecki lr-2013-dublin

Andrzej Bialecki

[email protected]

QA