multi-language search using solr netflix: autocompletesolr - open source search platform lucene -...

29
Netflix: Autocomplete Multi-Language Search Using Solr Ivan Provalov Sr Software Engineer

Upload: others

Post on 27-May-2020

30 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Netflix: Autocomplete Multi-Language Search Using Solr

Ivan Provalov Sr Software Engineer

Page 2: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

● Use Case● Configuration, scoring● Language challenges● Character mapper● Query testing framework

Overview

Page 3: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

● Solr - open source search platform● Lucene - open source search engine● Indexing - parsing and storing the data● Token - the smallest piece of parsed data, a word● Document frequency - scoring component, number of

documents containing the term● Term frequency - scoring component, number of times

the term occurs in the specific document● Document - record in the database, consisting of fields

Terminology

Page 4: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Terminology● Precision: the fraction of retrieved documents that are

relevant● Recall: the fraction of relevant documents that are

retrieved● Example: 100 documents in collection are relevant to

query X. Retrieved 200 documents total, only 50 are relevant to query X.○ Precision: 50/200 = 0.25○ Recall: 50/100 = 0.50

Page 5: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

● EDismax: extended disjunction max query○ Query: ‘ab’○ Document {“field1”:“ab”, “field2”:“abcd”}○ Scores: field1=0.5, field2=0.25○ EDismax allows to specify whether to pick the

max score, or a sum of all field scores with some weight given to the max field score

● N-gram: contiguous sequence of n characters

Terminology

Page 6: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

● Netflix launched globally in January 2016● 190 countries● Currently support 23 languages

○ Devices○ Messaging○ Content○ Search

Going Global at Netflix

Page 7: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest
Page 8: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Use Case● Video titles, person's names, genre names● Shorter documents should be ranked

higher● Autocomplete (instant search)● Ranking function - lexical, popularity and

click signals● Lexical score - recall over precision

Page 9: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Prefix Character N-gram“Breaking bad”

b - 0

br - 0

bre - 0

brea - 0

break - 0

breaki - 0

breakin - 0

breaking - 0

b - 1

ba - 1

bad - 1

Page 10: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Configuration● Solr● Edismax: simple syntax, max field score● Phrase: prevents from cross field search● N-gram: character n-gram search

Page 11: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Lexical Scoring● Skewed data distribution

○ e.g. one field sparsely populated● Doc length normalization● Unigram language model

Page 12: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

● Indexed document: ○ ‘breaking bad’○ 11 n-gram ‘terms’

● Query ○ ‘b’

● Score: ○ Term Frequency / N-grams in Doc○ 2 / 11

Scoring With Unigram LM

Page 13: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Language Challenges● Multiple Scripts

○ Japanese: Kanji, Hiragana, Katakana, Romaji

● No token delimiters: Japanese, Chinese● Korean character composition● Stopwords and autocomplete● Stemming

Page 14: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Japanese: Multiple Scripts● ‘南極物語’ (‘Antarctic Story’)● Tokenizer: 南極 物語● Reading form: ナンキョク モノガタリ● Query in Katakana: ナンキョク● Query in Hiragana: なんきょく● Transliteration required

Page 15: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Korean: Character Composition● input jamo ㄱ ㅗㅏ ㅇ● decomposed jamo ᄀ ᅟᅪ ᅟᅠᆼ● fully composed hangul 광

Page 16: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Tokenization Pipelines● Char Filter: pre-processes input characters● Tokenizer: breaks data into tokens● Filters: transform, remove, create new

tokens

Page 17: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Simple Pipeline Example: index● CharFilters: PatternReplaceCharFilterFactory

○ pattern: ([a-z]+)ing● Tokenizer: StandardTokenizerFactory● Filters: LowerCaseFilterFactory,

EdgeNGramFilterFactory

Page 18: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Simple Pipeline Example: query● CharFilters: PatternReplaceCharFilterFactory

○ pattern: ([a-z]+)ing● Tokenizer: StandardTokenizerFactory● Filters: LowerCaseFilterFactory

Page 19: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Simple Pipeline Example

Page 20: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Character Mapping Filter Cases● Prefix Removal

○ Arabic ال (alef lam)● Suffix folding

○ Japanese ァ (katakana small a) => ア (a)● Character decomposition

○ Korean ᅟᅰ (jungseong we) => ㅜ (u) and ㅔ (e)

Page 21: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Character Mapping Filter Cases● Stemmer implementation, or extension

○ Character mapper reference implementation of the Russian stemmer

● Patch to Lucene, can be used as stand-alone○ LUCENE-7321

Page 22: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Query Testing Framework● Open source project● Solr or Elasticsearch● Google Spreadsheets based UI● Unit tests for languages queries● Regression testing after changes, upgrades● 20K queries● 7K titles

Page 23: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Google Spreadsheets as Input

Page 24: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Google Spreadsheets as Detail Report

Page 25: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Google Spreadsheets as Detail Report

Diff

Page 26: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Google Spreadsheets as Summary Report

Page 27: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Google Spreadsheets as Summary Report

Diff

Page 28: Multi-Language Search Using Solr Netflix: AutocompleteSolr - open source search platform Lucene - open source search engine Indexing - parsing and storing the data Token - the smallest

Summary● Use case: short fields, autocomplete, P/R● Configuration, scoring● Language challenges● Character Mapper patch (LUCENE-7321)● Query testing framework

https://github.com/Netflix/q