optimizing multilingual search: presented by david troiano, basis technology

34

Upload: lucidworks

Post on 07-Jul-2015

385 views

Category:

Software


0 download

DESCRIPTION

Presented at Lucene/Solr Revolution 2014

TRANSCRIPT

Page 1: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
Page 2: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Optimizing Multilingual Search David Troiano Principal Software Engineer, Basis Technology [email protected]

Page 3: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Talk Overview •  The problem we’re trying to solve •  Natural language processing (NLP) •  Approaches to multilingual search in Solr

Page 4: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

A Multilingual Search Example

Page 5: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

The Goal Build a search engine where: •  Document corpus spans multiple languages

•  Poten&ally  mixed  language  documents    

•  Queries within a language, or potentially spanning multiple

Page 6: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

NLP Meets Search (Querying)

 Terms  

Inverted  Index  

term   document  IDs  

...   ...  

clinton   …,  123,  ...  

...   ...  

speak   …,  123,  ...  

query:  “clinton  speaking”  

NLP  pipeline  

clinton,  speak  

Page 7: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

NLP Meets Search (Indexing)

 Document  123  

 Terms  

Inverted  Index  

NLP  pipeline  

Bill  Clinton  spoke  about  ...  

term   document  IDs  

...   ...  

clinton   …,  123,  ...  

...   ...  

speak   …,  123,  ...  

             

           

 

bill,  clinton,  speak,  about  

Page 8: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

NLP Meets Search

 Terms  

Inverted  Index  

term   document  IDs  

...   ...  

clinton   …,  123,  ...  

...   ...  

speak   …,  123,  ...  

             

           

   Document  123  

NLP  pipeline  

Bill  Clinton  spoke  about  ...  

bill,  clinton,  speak,  about  

query:  “clinton  speaking”  

NLP  pipeline  

clinton,  speak  

Page 9: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

The NLP Pipeline

•  Language Detection •  Tokenization •  Decompounding •  Word Form Normalization

Page 10: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Language Detection •  Often required when indexing

•  Typically not used at query time •  Lower  accuracy  on  short  strings  •  Some&mes  unsolvable  even  to  humans,  e.g.,  named  en&&es  •  End  user  applica&ons  oKen  know  query  language  upstream  of  search  engine  •  No  readily  available  plugin  paNern  in  Solr  

 

Page 11: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Tokenization •  Breaking text into words •  Particularly difficult with CJK languages

•  Find  the  words:  帰国後ハーバード大学に入学を認められていたもの  

Page 12: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Decompounding •  Breaking compound words into subcomponents •  Common in German, Dutch, Korean

•  Samstagmorgen                    Samstag,  morgen  

Page 13: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Word Form Normalization •  Reduce word form variations to a canonical representation •  Critical for recall •  Two approaches

•  Stemming  •  Lemma&za&on  

Page 14: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Normalization: Stemming •  Simple rules-based approach •  “Chop off the end”

•  arsenal,  arsenic                        arsen  

Page 15: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Normalization: Lemmatization •  Map words to their dictionary form via morphological analysis •  spoke, speaks, speaking speak •  Higher precision and recall compared to stemming

Page 16: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

NLP Meets Search

 Terms  

Inverted  Index  

term   document  IDs  

...   ...  

clinton   …,  123,  ...  

...   ...  

speak   …,  123,  ...  

             

           

   Document  123  

NLP  pipeline  

Bill  Clinton  spoke  about  ...  

bill,  clinton,  speak,  about  

query:  “clinton  speaking”  

NLP  pipeline  

clinton,  speak  

Solr  

Page 17: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

NLP Within Solr •  Maximal precision / recall requires NLP pipeline per language •  NLP pipeline (mostly) specified within Solr field type •  Index / query strategies in Solr

•  Field  per  language  •  Core  per  language  •  A  new  approach:  Single  mul&lingual  field  

Page 18: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Field Per Language schema.xml <field name="content_cjk" type="text_cjk" indexed="true" stored="true" /> <field name="content_eng" type="text_eng" indexed="true" stored="true" /> <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> </analyzer> </fieldType> query http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_eng

Page 19: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Field Per Language http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_eng q=serie%20a

Page 20: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Field Per Language http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_eng defType=edismax

Page 21: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Field Per Language http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_eng qf=content_cjk%20content_eng

Page 22: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Core Per Language CJK core’s schema.xml <field name="content" type="text_cjk" indexed="true" stored="true" multiValued="true"/> <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> </analyzer> </fieldType> query http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_eng

Page 23: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Core Per Language

http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_eng q=content:serie%20a

Page 24: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Core Per Language

http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_eng shards=<url>/articles_cjk,<url>/articles_eng

Page 25: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Approach Comparison

Field  Per  Language   Core  Per  Language  

Simplicity    

Speed    

✔  

✔  

Page 26: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Approach Comparison: Query Latency Experimental Setup •  Corpus: Wikipedia across 9 languages (9 million articles) •  Queries: 1000 most frequently used terms for each language, randomized •  JMeter running 1 hour for each of 6 test runs

0  20  40  60  80  

100  120  140  160  

1   4   9  

Field  per  lang  

Core  per  lang  

Avg  latency  (m

s)  

#  languages  queried  

Page 27: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

An Alternative Approach All languages in a single field •  Requires custom meta field type that is applies per-language

concrete field type(s) •  Patch submitted to Solr

cf. Solr In Action / Trey Grainger https://github.com/treygrainger/solr-in-action

Page 28: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

An Alternative Approach

 Terms  

Inverted  Index  

term   document  IDs  

...   ...  

clinton   …,  123,  ...  

...   ...  

speak   …,  123,  ...  

query:  “[en,  es]clinton  speaking”  

Inspect  [en,  es],  apply  English  and  Spanish  field  types  to  “clinton  speaking”,  merge  results  

clinton,  speak  

Page 29: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

An Alternative Approach •  Results scoring potentially worse than other approaches •  IDF thrown off with single field

•  e.g.,  soy  common  in  Spanish,  rela&vely  rare  in  English  •  Consider  a  query  for  “soy  dessert  recipe”  against  a  corpus  of  English  and  

Spanish  recipes  

Page 30: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Enhancing NLP Pipeline Limitations of NLP in Solr out of the box •  Poor precision / performance of CJK tokenization •  Poor precision / recall of stemmers (no lemmatizers) •  Poor recall due to lack of decompounding

RoseNe  to  the  rescue!  

Page 31: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

CJK Tokenization ケネディはマサチューセッツ •  Rosette: ケネディ, は, マサチューセッツ •  Bigrams: ケネ, ネデ, ディ, ィは, はマ, マサ, サチ, チュ, ュー, ーセ, セッ, ッツ

•  How does this impact precision, recall, index size, speed?

Page 32: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Rosette In Solr <fieldType name="text_zho" class="solr.TextField"> <analyzer type="index"> <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" rootDirectory="<rootDir>" language="zho" /> <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" rootDirectory="<rootDir>" language="zho" /> </analyzer> </fieldType> cf. http://www.basistech.com/search-essentials/

Page 33: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Wrapping Up •  Multilingual search is everywhere •  Solr as your multilingual search platform •  Search quality hinges on quality of NLP tools

Page 34: Optimizing Multilingual Search: Presented by David Troiano, Basis Technology

Optimizing Multilingual Search David Troiano Principal Software Engineer, Basis Technology [email protected]