multi-language search using solr netflix: autocompletesolr - open source search platform lucene -...
TRANSCRIPT
Netflix: Autocomplete Multi-Language Search Using Solr
Ivan Provalov Sr Software Engineer
● Use Case● Configuration, scoring● Language challenges● Character mapper● Query testing framework
Overview
● Solr - open source search platform● Lucene - open source search engine● Indexing - parsing and storing the data● Token - the smallest piece of parsed data, a word● Document frequency - scoring component, number of
documents containing the term● Term frequency - scoring component, number of times
the term occurs in the specific document● Document - record in the database, consisting of fields
Terminology
Terminology● Precision: the fraction of retrieved documents that are
relevant● Recall: the fraction of relevant documents that are
retrieved● Example: 100 documents in collection are relevant to
query X. Retrieved 200 documents total, only 50 are relevant to query X.○ Precision: 50/200 = 0.25○ Recall: 50/100 = 0.50
● EDismax: extended disjunction max query○ Query: ‘ab’○ Document {“field1”:“ab”, “field2”:“abcd”}○ Scores: field1=0.5, field2=0.25○ EDismax allows to specify whether to pick the
max score, or a sum of all field scores with some weight given to the max field score
● N-gram: contiguous sequence of n characters
Terminology
● Netflix launched globally in January 2016● 190 countries● Currently support 23 languages
○ Devices○ Messaging○ Content○ Search
Going Global at Netflix
Use Case● Video titles, person's names, genre names● Shorter documents should be ranked
higher● Autocomplete (instant search)● Ranking function - lexical, popularity and
click signals● Lexical score - recall over precision
Prefix Character N-gram“Breaking bad”
b - 0
br - 0
bre - 0
brea - 0
break - 0
breaki - 0
breakin - 0
breaking - 0
b - 1
ba - 1
bad - 1
Configuration● Solr● Edismax: simple syntax, max field score● Phrase: prevents from cross field search● N-gram: character n-gram search
Lexical Scoring● Skewed data distribution
○ e.g. one field sparsely populated● Doc length normalization● Unigram language model
● Indexed document: ○ ‘breaking bad’○ 11 n-gram ‘terms’
● Query ○ ‘b’
● Score: ○ Term Frequency / N-grams in Doc○ 2 / 11
Scoring With Unigram LM
Language Challenges● Multiple Scripts
○ Japanese: Kanji, Hiragana, Katakana, Romaji
● No token delimiters: Japanese, Chinese● Korean character composition● Stopwords and autocomplete● Stemming
Japanese: Multiple Scripts● ‘南極物語’ (‘Antarctic Story’)● Tokenizer: 南極 物語● Reading form: ナンキョク モノガタリ● Query in Katakana: ナンキョク● Query in Hiragana: なんきょく● Transliteration required
Korean: Character Composition● input jamo ㄱ ㅗㅏ ㅇ● decomposed jamo ᄀ ᅟᅪ ᅟᅠᆼ● fully composed hangul 광
Tokenization Pipelines● Char Filter: pre-processes input characters● Tokenizer: breaks data into tokens● Filters: transform, remove, create new
tokens
Simple Pipeline Example: index● CharFilters: PatternReplaceCharFilterFactory
○ pattern: ([a-z]+)ing● Tokenizer: StandardTokenizerFactory● Filters: LowerCaseFilterFactory,
EdgeNGramFilterFactory
Simple Pipeline Example: query● CharFilters: PatternReplaceCharFilterFactory
○ pattern: ([a-z]+)ing● Tokenizer: StandardTokenizerFactory● Filters: LowerCaseFilterFactory
Simple Pipeline Example
Character Mapping Filter Cases● Prefix Removal
○ Arabic ال (alef lam)● Suffix folding
○ Japanese ァ (katakana small a) => ア (a)● Character decomposition
○ Korean ᅟᅰ (jungseong we) => ㅜ (u) and ㅔ (e)
Character Mapping Filter Cases● Stemmer implementation, or extension
○ Character mapper reference implementation of the Russian stemmer
● Patch to Lucene, can be used as stand-alone○ LUCENE-7321
Query Testing Framework● Open source project● Solr or Elasticsearch● Google Spreadsheets based UI● Unit tests for languages queries● Regression testing after changes, upgrades● 20K queries● 7K titles
Google Spreadsheets as Input
Google Spreadsheets as Detail Report
Google Spreadsheets as Detail Report
Diff
Google Spreadsheets as Summary Report
Google Spreadsheets as Summary Report
Diff
Summary● Use case: short fields, autocomplete, P/R● Configuration, scoring● Language challenges● Character Mapper patch (LUCENE-7321)● Query testing framework
https://github.com/Netflix/q
References● Query testing framework● Chris Manning IR Book, LM Chapter● Trey Grainger’s presentation on Semantic & Multilingual
Strategies in Lucene/Solr ● Character Mapping Patch and Documentation● Java Internationalization, March 25, 2001, by David
Czarnecki, Andy Deitsch