multi-language search using solr netflix: autocompletesolr - open source search platform lucene -...

Netflix: Autocomplete Multi-Language Search Using Solr

Ivan Provalov Sr Software Engineer

● Use Case● Configuration, scoring● Language challenges● Character mapper● Query testing framework

Overview

● Solr - open source search platform● Lucene - open source search engine● Indexing - parsing and storing the data● Token - the smallest piece of parsed data, a word● Document frequency - scoring component, number of

documents containing the term● Term frequency - scoring component, number of times

the term occurs in the specific document● Document - record in the database, consisting of fields

Terminology

Terminology● Precision: the fraction of retrieved documents that are

relevant● Recall: the fraction of relevant documents that are

retrieved● Example: 100 documents in collection are relevant to

query X. Retrieved 200 documents total, only 50 are relevant to query X.○ Precision: 50/200 = 0.25○ Recall: 50/100 = 0.50

● EDismax: extended disjunction max query○ Query: ‘ab’○ Document {“field1”:“ab”, “field2”:“abcd”}○ Scores: field1=0.5, field2=0.25○ EDismax allows to specify whether to pick the

max score, or a sum of all field scores with some weight given to the max field score

● N-gram: contiguous sequence of n characters

Terminology

● Netflix launched globally in January 2016● 190 countries● Currently support 23 languages

○ Devices○ Messaging○ Content○ Search

Going Global at Netflix

Use Case● Video titles, person's names, genre names● Shorter documents should be ranked

higher● Autocomplete (instant search)● Ranking function - lexical, popularity and

click signals● Lexical score - recall over precision

Prefix Character N-gram“Breaking bad”

b - 0

br - 0

bre - 0

brea - 0

break - 0

breaki - 0

breakin - 0

breaking - 0

b - 1

ba - 1

bad - 1

Configuration● Solr● Edismax: simple syntax, max field score● Phrase: prevents from cross field search● N-gram: character n-gram search

Lexical Scoring● Skewed data distribution

○ e.g. one field sparsely populated● Doc length normalization● Unigram language model

● Indexed document: ○ ‘breaking bad’○ 11 n-gram ‘terms’

● Query ○ ‘b’

● Score: ○ Term Frequency / N-grams in Doc○ 2 / 11

Scoring With Unigram LM

Language Challenges● Multiple Scripts

○ Japanese: Kanji, Hiragana, Katakana, Romaji

● No token delimiters: Japanese, Chinese● Korean character composition● Stopwords and autocomplete● Stemming

Japanese: Multiple Scripts● ‘南極物語’ (‘Antarctic Story’)● Tokenizer: 南極物語● Reading form: ナンキョクモノガタリ● Query in Katakana: ナンキョク● Query in Hiragana: なんきょく● Transliteration required

Korean: Character Composition● input jamo ㄱ ㅗㅏ ㅇ● decomposed jamo ᄀ ᅟᅪ ᅟᅠᆼ● fully composed hangul 광

Tokenization Pipelines● Char Filter: pre-processes input characters● Tokenizer: breaks data into tokens● Filters: transform, remove, create new

tokens

Simple Pipeline Example: index● CharFilters: PatternReplaceCharFilterFactory

○ pattern: ([a-z]+)ing● Tokenizer: StandardTokenizerFactory● Filters: LowerCaseFilterFactory,

EdgeNGramFilterFactory

Simple Pipeline Example: query● CharFilters: PatternReplaceCharFilterFactory

○ pattern: ([a-z]+)ing● Tokenizer: StandardTokenizerFactory● Filters: LowerCaseFilterFactory

Simple Pipeline Example

Character Mapping Filter Cases● Prefix Removal

○ Arabic ال (alef lam)● Suffix folding

○ Japanese ァ (katakana small a) => ア (a)● Character decomposition

○ Korean ᅟᅰ (jungseong we) => ㅜ (u) and ㅔ (e)

Character Mapping Filter Cases● Stemmer implementation, or extension

○ Character mapper reference implementation of the Russian stemmer

● Patch to Lucene, can be used as stand-alone○ LUCENE-7321

Query Testing Framework● Open source project● Solr or Elasticsearch● Google Spreadsheets based UI● Unit tests for languages queries● Regression testing after changes, upgrades● 20K queries● 7K titles

Google Spreadsheets as Input

Google Spreadsheets as Detail Report

Google Spreadsheets as Detail Report

Diff

Google Spreadsheets as Summary Report

Google Spreadsheets as Summary Report

Diff

Summary● Use case: short fields, autocomplete, P/R● Configuration, scoring● Language challenges● Character Mapper patch (LUCENE-7321)● Query testing framework

https://github.com/Netflix/q

References● Query testing framework● Chris Manning IR Book, LM Chapter● Trey Grainger’s presentation on Semantic & Multilingual

Strategies in Lucene/Solr ● Character Mapping Patch and Documentation● Java Internationalization, March 25, 2001, by David

Czarnecki, Andy Deitsch

http://techblog.netflix.com/2016/07/global-languages-support-at-netflix.html

http://techblog.netflix.com/2016/07/global-languages-support-at-netflix.html

http://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf

http://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf

http://www.slideshare.net/treygrainger/semantic-multilingual-strategies-in-lucenesolr




https://issues.apache.org/jira/browse/LUCENE-7321

https://issues.apache.org/jira/browse/LUCENE-7321

http://www.amazon.com/Java-Internationalization-Series-David-Czarnecki/dp/0596000197




multi-language search using solr netflix: autocompletesolr - open source search platform lucene -...

Documents