multilingual search and text analytics with solr - open source search conference

26
Basis Technology – Open Source Search Conference 2012 1 Multilingual Search and Text Analytics with Solr Steve Kearns Director of Product Management Basis Technology

Upload: basis-technology

Post on 21-Jun-2015

2.037 views

Category:

Technology


1 download

DESCRIPTION

This talk will explore the challenges of Multilingual search, including language-specific issues — like N-gram segmentation vs. morphological analysis, stemming vs. lemmatization, and language identification — and the various approaches to configuring your Solr schema. We will also discuss the integration strategies for common text analytics capabilities and the impact of multilingual content on application design. Solr is a powerful search engine which rapidly gained acceptance as an alternative to commercial search solutions for many applications. There are many features required by organizations to serve their diverse communities, among these is the ability to deliver search excellence in foreign languages. Delivering quality multilingual search involves careful design of schemas and selection of the best linguistic approach for each supported language.

TRANSCRIPT

Page 1: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 1

Multilingual Search and Text Analytics with Solr Steve Kearns

Director of Product Management

Basis Technology

Page 2: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 2

Agenda  

•  Why  is  Language  Important?  •  Approaches  for  language-­‐aware  search  •  Solr  Configura>on  Op>ons  

Page 3: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 3

Language  is  Important  

Page 4: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 4

Why  is  language  important?  

•  Content  is  produced  and  consumed  in  the  na>ve  language  

•  Document  collec>ons  oBen  contain  more  than  one  language  

•  Each  language  is  unique,  and  presents  different  challenges  to  the  search  engine  

Page 5: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 5

Language  is  Complex  

•  Tokeniza>on  •  Some  languages  do  not  use  spaces  •  Compound  words  combine  two  or  more  words  •  Conjunc>ons    

•  Inflec>on  •  In  grammar,  inflec>on  is  the  modifica>on  of  a  word  to  express  different  gramma>cal  categories  such  as  tense,  gramma>cal  mood,  gramma>cal  voice,  aspect,  person,  number,  gender  and  case.  

hOp://en.wikipedia.org/wiki/Inflec>on  

Page 6: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 6

Language  is  Complex  

hOp://en.wikipedia.org/wiki/File:Flexi%C3%B3nGato.png  

Page 7: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 7

Language  is  Complex!  

•  The  Spanish  word  “pasaportar”  has  more  than  50  inflected  forms:  

pasaportando  pasaportes  pasaportada  pasaportaba  pasaportarían  pasaportarais  pasaportasen  pasaportaren  pasaportado  pasaportaremos  pasaportábamos  pasaportases  pasaportaríais  pasaportaran  pasaportarías  pasaportaras  pasaportarás  

pasaportareis  pasaportaron  pasaportase  pasaportemos  pasaportaría  pasaportara  pasaportasteis  pasaportáramos  pasaportaban  pasaportásemos  pasaportamos  pasaporten  pasaportaréis  pasaportabas  pasaportaríamos  pasaportáremos  pasaporto  

pasaportarán  pasaporte  pasaportan  pasaporta  pasaportaste  pasaportad  pasaportéis  pasaportadas  pasaporté  pasaportados  pasaportaré  pasaportare  pasaportará  pasaportó  pasaportabais  pasaportaseis  …  

http://education.yahoo.com/reference/dict_en_es/spanish/pasaportar

Page 8: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 8

Language  Examples  

•  English:  

•  French:  

•  German:    

•  Japanese:  • 首脳会談後、オバマ大統領は記者団の質問に答える予定  

– Where  are  the  words??  

spoke  (Noun  –  wheel  part)   →  spoke  spoke  (Verb,  past  tense)   →  speak  

été  (summer)   →    été  (summer)  été  (was)         →  être  (to  be)  

Robbe  (seal)   →  Robbe  (seal)  robbe  (I  crawl)   →  robben  (to  crawl)  

Samstagmorgen  (Saturday  Morning)   →  Samstag,  Morgen  (compound)  

Page 9: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 9

Language-­‐Aware  Search  Technology  

•  RoseOe  Linguis>c  Plaiorm    •  Language  Iden>fica>on  •  Tokeniza>on  

» Morphological  

•  Token  processing  »  Lemma>za>on  

•  Higher  level  analy>cs  »  En>ty  Extrac>on  »  Rela>onship  Extrac>on  

•  En>ty  Transla>on  and  En>ty  Search  

Page 10: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 10

Language  Iden>fica>on  

•  Find  a  single  dominant  language  in  a  document  •  Find  mul>ple  languages  in  a  single  document  

Page 11: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 11

Tokeniza>on  

•  Morphological  Analysis  vs.  N-­‐gram  •  Search  Term:    東京 ルパン上映時間

•  N-­‐gram:  

•  Morphological  Analysis:      

Page 12: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 12

Token  Processing  

•  Stemming  vs.  Lemma>za>on  •  English:  “I  have  spoken  at  several  conferences”  •  Stemming:  

•  Lemma>za>on:  

Page 13: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 13

Stemming  vs.  Lemma>za>on  

•  Two  words  with  the  same  spelling,  but  different  meanings  create  the  same  stem.  

Stemming  prensa    (media)  

→  prens  

prensa      (he/she  presses)        

→  prens  

    INCORRECT  

LemmaCzaCon  Prensa  

 (media)  →  prensa  (media)  

prensa      (he/she  presses)        

→  prensar    (to  press)  

    CORRECT  

Page 14: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 14

Stemming  vs.  Lemma>za>on  

•  Two  different  words  create  the  same  stem.  

Stemming  publicaciones    (publicaCons)  

→  public  

publico    (public)  

→  public  

    INCORRECT  

LemmaCzaCon  publicaciones  (publicaCons)    

→  publicación    

publico    (public)  

→  public    (public)  

    CORRECT  

Page 15: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 15

Token  Processing  

German:  “Am  Samstagmorgen  fliege  ich  zurueck  nach  Boston.”  

•  Stemming:  

•  Lemma>za>on  (and  decompounding!):  

Page 16: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 16

How  to  Configure  Solr  

•  Challenges  •  Mul>ple  languages  in  the  data  set  

•  Goals:  1.  Language  Iden>fica>on  2.  Language-­‐aware  Search:  

•  Tokeniza>on  •  Token  Processing  

Page 17: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 17

How  to  Configure  Solr  

•  What  tools  does  Solr  have  to  work  with?  •  UpdateRequestProcessor  •  Analyzer/CharFilter/Tokenizer/TokenFilter  •  Solr  Cores  

•  Pre-­‐process  data  before  Solr?  

Page 18: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 18

Solr  UpdateRequestProcessor  

•  Runs  Before  Analyzers  •  Full  Access  to  Document  

•  Two  op>ons:    •  Run  the  analysis  directly  in  Solr  

•  Good  for  Lightweight  Analysis  •  Call  out  to  external  analysis  services  

• Web  Services/UIMA.  Increases  Complexity  

•  Limita>ons:    •  Think  through  your  indexing  strategy    

Page 19: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 19

Solr  Analyzer/Tokenizer  

•  Good  for:  •  Segmenta>on  of  Asian  Language  •  Linguis>cs  -­‐  Lemma>za>on  

•  Limita>ons:  •  No  access  to  document  object  

 •  Schema.xml  

•  FieldType  •  Analyzer  

–  CharFilter  –  Tokenize  –  TokenFilter  

Page 20: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 20

Goal  1:  Language  ID    

•  UpdateRequestProcessor  •  Runs  before  field-­‐level  analysis  takes  place  •  Basic  Language  Iden>fier  URP  to  be  included  in  Solr  

•  Outside  Solr  

 What  do  you  do  with  the  language  informa>on??  

Page 21: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 21

Goal  2:  Mul>-­‐Lingual  Support  in  Solr  

•  Three  main  approaches:  

1.  One  Solr  field  for  each  language  

2.  One  Solr  Core  per  language  

3.  All  Languages  in  a  Single  Field  

Informed  by  Trey  Grainger    @  Careerbuilder:  hOp://www.lucidimagina>on.com/sites/default/files/Grainger%20Trey%20-­‐%20Extending%20Solr,%20Building%20a%20Cloud-­‐Like%20Knowledge%20Discovery%20Plaiorm%20-­‐%20rev.pdf  

Page 22: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 22

Mul>ple  Languages:  Method  1  

•  One  field  for  each  language  •  Pro:  

•  Simple  approach  and  implementa>on  •  Guarantees  that  queries  are  processed  the  same  way  as  index  

•  Con:  •  Increased  query-­‐>me  complexity  (mi>gate  with  Dismax)  •  Decreased  query  speed  as  addi>onal  fields  are  queried  •  May  require  storing  mul>ple  copies  of  text  

Page 23: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 23

Mul>ple  Languages:  Method  2  

•  One  Solr  core  per  language    Each  Core  has  the  same  field,  with  a  language-­‐specific      Analyzer/Tokenizer  •  Pros:  

•  No  query-­‐>me  performance  overhead  •  Guarantees  that  queries  are  processed  the  same  way  as  index  

•  Cons:  •  Significant  complexity  in  managing  mul>ple  cores  •  Must  implement  custom  sharding  •  Does  not  support  mul>lingual  documents  

Page 24: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 24

Mul>ple  Languages:  Method  3  

•  All  Languages  in  one  field  •  Pros:  

•  Single  field  makes  queries  and  indexing  easy  •  Same  schema/core  as  more  languages  added  

•  Cons:  •  Requires  complex  custom  Tokenizer/Analyzer  •  Must  pass  in  language  informa>on  for  queries  and  indexing  •  Does  not  guarantee  queries  are  processed  the  same  as  the  index  

•  Poten>al  TF/IDF  confusion      

Page 25: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 25

Language  is  Important  

•  Use  language  informa>on  at  index  and  query  >me  •  Increase  recall,  maintain  precision  

•  BeOer  search  results  for  your  users  

Page 26: Multilingual Search and Text Analytics with Solr - Open Source Search Conference

Basis Technology – Open Source Search Conference 2012 26

My  Contact  Info  

•  Steve  Kearns  •  [email protected]  •  hOp://www.basistech.com