andrzej bialecki lr-2013-dublin

SOLR SIDE-CAR INDEXAndrzej Bialecki. LucidWorks

[email protected]

• Started using Lucene in 2003 (1.2-dev…)• Created Luke – the Lucene Index Toolbox• Apache Nutch, Hadoop, Solr committer, Lucene PMC member• LucidWorks engineer

About the speaker

• Challenge: incremental document updates• Existing solutions and workarounds• Sidecar index strategy and components• Scalability and performance• QA

Agenda

• Incremental update (field-level update): modification of a part of document• Sounds like a fundamentally useful functionality!• But Lucene / Solr doesn’t offer true field-level updates (yet!)

– “Update” is really a sequence of “retrieve old document, update fields, add updated document, delete old document”

– “Atomic update” functionality in Solr is a (useful) syntactic sugar

Challenge: incremental document updates

• Documents composed logically of two parts with different update schedules– E.g. mostly static documents with some quickly changing fields

• Two different classes of data in changing fields– Numeric / boolean fields: e.g. popularity, in-stock status, promo campaigns– Text fields: e.g. reviews, tags, click-through feedback, user profiles

• Challenge: how to integrate these modifications with the main index content?– Re-indexing whole documents isn’t always an option

Common use cases for field updates

• Very complex issue, broad impact on many Lucene internals– Inverted index structure is not optimized for partial document updates– At least another 6-12 months away?

• LUCENE-4258 – work in progress

True full-text (inverted fields) incremental updates

• If the corpus is small, or incremental updates infrequent… just re-index everything!• Pros:

– Relatively easy to implement – update source documents and re-index– Allows adding all types of data, including e.g. labels as searchable text

• Cons:– Infeasible for larger corpora or frequent updates, time-wise and cost-wise– Requires keeping around the source documents

• Sometimes inconvenient, when documents are assembled in a complex pipeline

Handling updates via full re-index

• Pros:– Simple to implement– Updates are easy – just file edits, no need to re-index

• Cons:– Only docId => field : number– Not suitable for full-text searchable field updates

• E.g. can’t support user-generated labels attached to a doc

– Still useful if a simple “popularity”-type metric is sufficient• Internally implemented as an in-memory ValueSource usable by

function queries

Handling updates via Solr’s ExternalFileField

doc0=1.5doc1=2.5doc2=0.5…

• Since Lucene/Solr 4.6 … to be released Really Soon • Details can be found in LUCENE-5189• As simple as:

indexWriter.updateNumericDocValue(term, field, value)

• Neatly solves the problem of numeric updates: popularity, in-stock, etc.• Some limitations:

– Massive updates still somewhat costly until the next merge (like deletes)– Can only update existing fields

• Obviously doesn’t address the full-text inverted field updates

Numeric DocValues updates

• Pretends that two or more IndexReader-s are slices of the same index

– Slices contain data for different fields– Both stored and inverted parts are supported– Data for matching docs is joined on the fly

• Structure of all indexes MUST match 1:1 !!!– The same number of segments– The same count of docs per segment– Internal document ID-s must match 1:1– List of deletes is taken from the first index

• Sounds cool … but in practice it’s rarely used:– It’s very difficult to meet these requirements– This is even more difficult in the presence of

index updates and merges

Lucene ParallelReader overview

0123

f1, f2, ...f1, f2, ...f1, f2, ...f1, f2, …

01

f1, f2, ...f1, f2, …

0 f1, f2, …

0123

45

6

0123

f3, f4, ...f3, f4, ...f3, f4, ...f3, f4, …

01

f3, f4, ...f3, f4, …

0 f3, f4, …

ParallelReader

main IR parallel IR

0 f1, f2, f3, f4…

• Pros:– All types of data (e.g. searchable full-text

labels) can be added• Cons:

– Must ensure that the other index always matches the structure of the main index

– Complicated and fragile (rebuild on every update?)

– No tools to manage this parallel index in Solr

Handling updates via ParallelReader

0123

f1, f2, ...f1, f2, ...f1, f2, ...f1, f2, …

01

f1, f2, ...f1, f2, …

0 f1, f2, …

0123

45

6

01

f3, f4, ...f3, f4, …

0 f3, f4, …

ParallelReader

main IR parallel IR

01

f3, f4, ...f3, f4, …

• Uses the ParallelReader strategy for field updates– “Main” and “sidecar” data comes from two different Solr collections– “Sidecar” collection is updated independently from the main collection– “Sidecar” collection is used as a source of document fields for building and

updating a parallel index• Integrates the management of ParallelReader (“sidecar index”) into Solr

– Initial creation of ParallelReader, including synchronization of internal ID-s– Tracking of updates and IndexReader.reopen(…) events

• Partly based on a version of Click Framework in LucidWorks Search• Available under Apache License here: http://github.com/LucidWorks/sidecar_index

Sidecar Index Components for Solr

http://github.com/LucidWorks/sidecar_index

• “Main” collection contains only the parts of documents with “main” fields• “Sidecar” collection is a source of documents with “sidecar” fields• SidecarIndexReaderFactory creates and maintains the parallel index (sidecar

index)• “Main” collection uses SidecarIndexReader that acts as ParallelReader• Main index is updated as usual, via the “main” collection’s IndexWriter

“Main”, “sidecar” collections and parallel index

Main_collection

SidecarIndexReader

main index

Sidecar_collection

sidecar index

Solr

• SidecarIndexReaderFactory extends Solr’s IndexReaderFactory– newReader(Directory, SolrCore) – initial open– newReader(IndexWriter, SolrCore) – NRT open

• SidecarIndexReader acts like a ParallelReader– Solr wants DirectoryReader, but ParallelReader is not a DirectoryReader– Basically had to re-implement the logic from ParallelReader

• ParallelReader challenges:– How to synchronize internal ID-s?– How to create segments that are of the same size as those of the main index?– How to handle deleted documents?– How to handle updates to the main index?– How to handle updates to the sidecar data?

Implementation details

• How to synchronize internal ID-s?– “Main” collection is traversed sequentially by

internal docId– Primary key is retrieved for each document– Matching document is found in the “sidecar”

collection– Matching document is added to the “sidecar” index

• Very costly phase!– Random seek and retrieval from “sidecar” collection– Primary key lookup is fast– … but stored field retrieval and indexing isn’t

ParallelReader challenges and solutionsGBCEAFD

f3, f4, ...f3, f4, ...f3, f4, ...f3, f4, …f3, f4, ...f3, f4, ...f3, f4, …

sidecar IR

0123

D, f2, ...B, f2, ...A, f2, ...F, f2, …

01

C, f2, ...G, f2, …

0 E, f2, …

0123

45

6

012

f3, f4, ...f3, f4, ...f3, f4, ...

main IR

Sidecar collection

Main collection

q=id:D

• Optimization 1: don’t rebuild data for unmodified segments

• Optimization 2 (cheating): ignore NRT segments• How to handle deleted docs?

– Insert dummy (empty) documents so that the number and the order of documents still match

ParallelReader challenges and solutions

0123

f1, f2, ...f1, f2, ...f1, f2, ...f1, f2, …

01

f1, f2, ...f1, f2, …

0123

45

7

0123

f3, f4, ...f3, f4, ...f3, f4, ...f3, f4, …

01

f3, f4, ...f3, f4, …

ParallelReader

main IR sidecar IR

0 f1, f2, …NRT

01

f1, f2, ...f1, f2, …

X 01

dummyf3, f4, …

• How to create segments that are of the same size as the “main” index?

• Carefully manage the “sidecar” index creation:– IndexWriter uses SerialMergeScheduler to

prevent out-of-order merges– Force flush when reaching the next target count

of documents– Merges are enforced using SidecarMergePolicy

that tracks the sizes of the “main” index segments

Implementation: SidecarMergePolicy

0123

f1, f2, ...f1, f2, ...f1, f2, ...f1, f2, …

01

f1, f2, ...f1, f2, …

0 f1, f2, …

0123

45

6

0123

f3, f4, ...f3, f4, ...f3, f4, ...f3, f4, …

01

f3, f4, ...f3, f4, …

0 f3, f4, …

ParallelReader

main IR sidecar IR

SidecarMergePolicytarget sizes:

Seg0 – 4 docsSeg1 – 2 docsSeg2 – 1 doc

• Re-implements the logic of ParallelReader– ParallelReader != DirectoryReader

• Exposes Directory of the “main” index for replication– Replicas need the “sidecar” collection replica to rebuild the sidecar index locally– If document routing and shard placement is the same then we don’t have to use

distributed search – all data will be local• Reopen(…) avoids rebuilding unmodified segments• Reopen(…) uses SidecarIndexReaderFactory to rebuild the sidecar index when

necessary– When there’s a major merge in the “main” index– When “sidecar” data is updated

• Ref-counting of IndexReaders at different levels is very tricky!

Implementation: SidecarIndexReader

<indexReaderFactory name="IndexReaderFactory" class="com.lucid.solr.sidecar.SidecarIndexReaderFactory"> <str name="docIdField">id</str> <str name="sourceCollection">source</str> <bool name="enabled">true</bool></indexReaderFactory>

Example configuration in solrconfig.xml

• Raw click-through data:– Query, query_time, docId, click_time [, user]

• Aggregated click-through data:– User-generated popularity score: F(number and timing of clicks per docId)

• Numeric updates– User-generated labels: F(top-N queries that led to clicks on docId)

• Full-text searchable updates– User profiles: F(top-N queries per user, top-N docId-s clicked, etc)– …

• Queries can now be expanded to score based on TF/IDF in user-generated labels

Example use case: integration of click-through data

Scalability and performance

Scalability and performance

• Initial full rebuild is very costly– ~0.6 ms / document– 1 mln docs = 600 sec = 10 min– Not even close to “real time” …

• Cost related to new segments in “main” index depends on the size of segments• Major merge events will trigger full rebuild• BUT: search-time cost is negligible

• Combination of ref-counting in Lucene, Solr and ParallelReader is difficult to track– The sidecar code is still unstable and occasionally explodes

• Performance of full rebuild quickly becomes the bottleneck on frequent updates– So the main use case is massive but infrequent updates of “sidecar” data

• Code: http://github.com/LucidWorks/sidecar_index

• Fixes and contributions are welcome – the code is Apache licensed

Caveats

http://github.com/LucidWorks/sidecar_index

• Challenge: incremental document updates• Existing solutions and workarounds• Sidecar index strategy and components• Scalability and performance• QA

Agenda

Andrzej Bialecki

[email protected]

QA

mailto:[email protected]

andrzej bialecki lr-2013-dublin

Technology

presence of index updates

field updates main

index structure

field updates documents

index slices

handling updates

sidecar index components

main index content