spellchecking in trovit: implementing a contextual multi-language spellchecker for classified ads

41

Upload: lucenerevolution

Post on 11-May-2015

3.407 views

Category:

Technology


1 download

DESCRIPTION

Presented by Xavier Sanchez Loro, Ph.D, Trovit Search SL This session aims to explain the implementation and use case for spellchecking in Trovit search engine. Trovit is a classified ads search engine supporting several different sites, one for each on country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using solr and lucene in order to help our users to better find the desired ads and avoid the dreaded 0 results as much as possible. As such our goal is not pure orthographic correction, but also suggestion of correct searches for a certain site.

TRANSCRIPT

Page 1: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads
Page 2: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

SPELLCHECKING IN TROVIT: IMPLEMENTING A CONTEXTUAL MULTI-LANGUAGE SPELLCHECKER FOR CLASSIFIED ADS Xavier Sanchez Loro R&D Engineer

[email protected]

Page 3: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Introduction •  Our approach: Contextual Spellchecking •  Nature and characteristics of our document corpus •  Spellcheckers in Solr •  White-listing and purging: controlling dictionary data •  Spellchecker configuration •  Customizing Solr’s SpellcheckComponent •  Conclusions and Future Work

Outline

Page 4: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

Trovit Engineering Blog post on spellchecking

http://tech.trovit.com/index.php/spellchecking-in-trovit/

Supporting text for this speech

Page 5: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

INTRODUCTION

Page 6: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

Introduction

Trovit: a search engine for classified ads

Page 7: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

Introduction

Page 8: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Multi-language spellchecking system using SOLR and Lucene

•  Objectives: –  help our users to better find the desired ads –  avoid the dreaded 0 results as much as possible –  Our goal is not pure orthographic correction but also to

suggest correct searches for a certain site.

Introduction: spellchecking in Trovit

Page 9: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

OUR APPROACH: CONTEXTUAL SPELLCHECKING

Page 10: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  The Key element in the spellchecking process is choosing the right dictionary –  one with a relevant vocabulary

•  according to the type of information included in each site.

•  Approach –  Specializing the dictionaries based on user’s search context.

•  Search contexts are composed of: –  country (with a default language) –  vertical (determining the type of ads and vocabulary).

Contextual Spellchecking: approach

Page 11: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Each site’s document corpus has a limited vocabulary –  reduced to the type of information, language and terms included in each site’s

ads.

•  Using a more generalized approach is not suitable for our needs –  One vocabulary for each language less precise than specialized vocabularies

for each site. –  Drastic differences

•  type of terms •  semantics of each vertical.

–  Terms that are relevant in one context are meaningless in another one

•  Different vocabularies for each site, even when supporting the same language. –  Vocabulary is tailored according to context of searches

Contextual Spellchecking: vocabularies

Page 12: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

NATURE AND CHARACTERISTICS OF OUR DOCUMENT CORPUS

Page 13: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Document corpus is fed by different third-party sources –  providing the ads for the different sites.

•  We can detect incorrect documents and reconcile certain inconsistences –  But we cannot control or modify the content of the ads themselves.

•  Inconsistencies –  hinder any language detection process –  pose challenges to the development of the spellchecking system

Challenges: Inconsistencies in our corpus

Page 14: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Spanish homes vertical –  not fully written in Spanish –  Ads in several languages.

•  native languages: Spanish, Catalan, Basque and Galician. •  foreign languages: English, German, French, Italian, Russian… even

oriental languages like Chinese! •  Multi-language ads

–  badly written and misspelled words •  Spanish words badly translated from regional languages •  overtly misspelled words

–  e.g. “picina” yields a 1197 docs Vs 1048434 of “piscina”, 0.01% –  “noisy” content

•  numbers, postal codes, references, etc.

Inconsistencies example

Page 15: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Summarizing –  Segmented corpus in different indexes, one per country plus vertical (site) –  3rd party generated –  Ads in national language + other languages (regional and foreign) –  Multi-language content in ads –  Noisy content (numbers, references, postal codes, etc.) –  Small texts (around 3000 characters long) –  Misspellings and incorrect words

Corpus unreliable for use as the knowledge base to build any spellchecking dictionary.

Characteristics of our ads

Page 16: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

geolocation data is not mixed with vertical data

geolocation data interleaved with vertical data

Only vertical data (no geodata) •  Narrower

dictionary, less collisons, more controlable

Cover all geodata •  Wider dictionary,

more collisons, less controlable

What/Where search segmentation

Page 17: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

SPELLCHECKERS IN SOLR

Page 18: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  It creates a parallel index for the spelling dictionary that is based on an existing Lucene index. –  Depends on index data correctness (misspells) –  Creates additional index from current index (small, MB) –  Supports term frequency parameters –  Must (re)build

•  Even though this component behaves as expected –  it was of no use for Trovit’s use case.

IndexBasedSpellchecker

Page 19: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  It depends on index data –  not an accurate and reliable for the spellchecking dictionary.

•  Continuous builds –  synchronicity between index data and spelling index data. –  If not

•  frequency information and hit counting are neither reliable nor accurate.

•  false positives/negatives •  suggestions of words with different number of hits, even 0.

•  We cannot risk suffering this situation

IndexBasedSpellchecker

Page 20: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  It uses a flat file to generate a spelling dictionary in the form of a Lucene spellchecking index. –  Requires a dictionary file –  Creates additional index from dictionary file (small, MB) –  Does not depend on index data (controlled data) –  Build once

•  rebuild only if dictionary is updated –  No frequency information used when calculating spelling suggestions

FileBasedSpellChecker

Page 21: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Requires rebuilds also –  albeit less frequently

•  No frequency related data –  Pure orthographic correction is not our main goal –  We cannot risk suggesting corrections without results.

•  But –  insight on how to approach the final solution we are implementing. –  allows the highest degree of control in dictionary contents

•  essential feature for spelling dictionaries.

FileBasedSpellChecker

Page 22: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Experimental spellchecker that just uses the main Solr index directly –  Build/rebuild is not required. –  Depends on index data correctness (misspells) –  Uses existing index

•  field: source of the spelling dictionary. –  Supports term frequency parameters. –  No (re)build.

•  Several promising features –  No build + continuously in sync with index data. –  Provides accurate frequency information data.

DirectSpellChecker

Page 23: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  The real drawback –  lack of control over index data sourcing the spelling dictionary.

•  If we can overcome it, this type would make an ideal candidate for our use case.

DirectSpellChecker

Page 24: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Generates suggestions by combining adjacent words and/or breaking words into multiples. –  This spellchecker can be configured with a traditional checker

(ie:DirectSolrSpellChecker). –  The results are combined and collations can contain a mix of

corrections from both spellcheckers. –  Uses existing index. No build.

WordBreakSpellChecker

Page 25: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Good complement to the other spellcheckers •  It works really well with well-written concatenated words

–  it is able to break them up with great accuracy. •  Combining split words is not as accurate •  Drawback: it’s based on index data.

WordBreakSpellChecker

Page 26: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

WHITE-LISTING AND PURGING: CONTROLLING DICTIONARY DATA

Page 27: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Any spelling system can only be as good as its knowledge base or dictionary is accurate.

•  We need to control the data indexed as dictionary. •  White-listing approach

–  we only index spelling data contained in a controlled dictionary list. –  processes to build a base dictionary specialized for a given site.

White-listing

Page 28: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

White-list building process

Page 29: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

SPELLCHECKER CONFIGURATION

Page 30: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  DirectSpellChecker using purged spell field –  Spell field filled with purged content

•  Purging according to whitelist •  Whitelist generated from matching dictionary with index words, after

purge process •  Benefits:

–  Build is no longer required. –  Spell field is automatically updated via pipeline. –  We can work with term freq. –  No additional index, just an additional field. –  Better relevance and suggestions.

Initial spellchecker configuration

Page 31: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Cons: –  Whitelist maintenance and creation for new sites.

•  Features: –  Accurate detection of misspelled words. –  Good detection of concatenated words.

•  piscinagarajejardin to piscina garaje jardin •  picina garajejardin to piscina (garaje jardin)

–  Able to detect several misspelled words. –  Evolution based on whitelisting fine-tuning.

Initial spellchecker configuration

Page 32: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Issues: –  False negatives: suggestion of corrections when words are correctly spelled. –  Suggestions for all the words in the query, not just those misspelled words. –  Misguiding “correctlySpelled” parameter.

•  Parameter dependant on frequency information, making it unreliable for our purposes.

•  It returns true/false according to thresholds, –  not really depending on word distance but –  results found, “alternativeTermCount” and “maxResultsForSuggest”

thresholds. –  Minor discrepancies if we only index boosted terms (i.e. qf)

•  # hits spell< #docs index

Initial spellchecker configuration

Page 33: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

CUSTOMIZING SOLR SPELLCHECKCOMPONENT

Page 34: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Lack of reliability of the “correctlySpelled” parameter –  Difficult to know when give a suggestion or not. –  First policy based on document hits

•  sliding window –  based on the number of queried terms

•  the longer the tail, the smaller the threshold •  inaccurate and prone to collisions.

–  Difficult to set up thresholds to a good level of accuracy.

We needed a more reliable way.

Hacking SpellcheckComponent

Page 35: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Binary approach to deciding if a word is correctly spelled or not.

•  Simpler approach –  any term that appears in our spelling field is a correctly spelled word

•  regardless the value of its frequency info or the configured thresholds. –  this way the parameter can be used to control when to start querying the

spellchecking index.

Hacking SpellcheckComponent: correctlySpelled parameter behaviour

Page 36: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Other changes to the SpellcheckComponent: –  No suggestions when words are correctly spelled. –  Only makes suggestions for misspelled words, not for all words

•  i.e. piscina garage -> piscina garaje

•  Spanish-friendly ASCIIFoldingFilter –  modified in order to not fold “ñ” (for Spanish) and “ç” (for Catalan names)

characters. •  Avoids collisions with similar words with “n” and “c”

–  e.g. “pena” and “peña” –  Still folding accented vowels

•  usually omitted by users.

Hacking SpellcheckComponent

Page 37: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

CONCLUSIONS AND FUTURE WORK

Page 38: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Base code –  expand the spellchecking process to other sites –  design final policy to decide when giving suggestions or not.

•  Geodata in homes verticals –  find ways to avoid collisions in large dictionary sets.

•  Scoring system for spelling dictionary –  Control suggestions based on user input

•  Feedback on relevance or quality of our spellchecking suggestions. •  System more accurate and reliable •  Expand whitelists to cover large amounts of geodata

–  with acceptable levels of precision.

Conclusion & Future Work

Page 39: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

•  Plural suggester –  suggest alternative searches and corrections using plural or singular variants

of the terms in the query. –  Use frequency and scoring information to choose most suitable suggestions.

Conclusion & Future Work

Page 40: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

THANKS FOR YOUR ATTENTION! ANY QUESTIONS?

Page 41: Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecker for Classified Ads

[1] Lucene/Solr Revolution EU 2013. Dublin, 6-7 November 2013. http://www.lucenerevolution.org/ [2] Trovit – A search engine for classified ads of real estate, jobs, cars and vacation rentals. http://www.trovit.com [3] Apache Software Foundation. “Apache Solr” https://lucene.apache.org/solr/ [4] Apache Software Foundation. “Apache Lucene” https://lucene.apache.org [5] Apache Software Foundation. “Spell Checking – Apache Solr Reference Guide – Apache Software Foundation” https://cwiki.apache.org/confluence/display/solr/Spell+Checking

References