postgresql search demystified

51
* in * PgConf EU 2014 presents Javier Ramirez PostgreSQL Full-text search demystified @supercoco9 https://teowaki.com

Upload: javier-ramirez

Post on 28-Nov-2014

284 views

Category:

Software


1 download

DESCRIPTION

How does a full-text search engine works? How is the index built and searched? Can I use PostgreSQL as a full-text search engine or should I go for a more specialised solution? How does one configure and use PostgreSQL search? This presentation covers all those aspects, based on the work we did to index teowaki.com. It was presented at PgConf EU 2014 in Madrid

TRANSCRIPT

Page 1: Postgresql search demystified

* in *

PgConf EU 2014 presents

Javier RamirezPostgreSQL

Full-text search

demystified@supercoco9

https://teowaki.com

Page 2: Postgresql search demystified

The problem

Page 3: Postgresql search demystified

our architecture

Page 4: Postgresql search demystified
Page 5: Postgresql search demystified

One does not simply

SELECT * from stuff where

content ilike '%postgresql%'

Page 6: Postgresql search demystified
Page 7: Postgresql search demystified
Page 8: Postgresql search demystified

Basic search features

* stemmers (run, runner, running)* unaccented (josé, jose)* results highlighting* rank results by relevance

Page 9: Postgresql search demystified

Nice to have features* partial searches

* search operators (OR, AND...)

* synonyms (postgres, postgresql, pgsql)

* thesaurus (OS=Operating System)

* fast, and space-efficient

* debugging

Page 10: Postgresql search demystified

Good News:

PostgreSQL supports all

the requested features

Page 11: Postgresql search demystified

Bad News:

unless you already know about search

engines, the official docs are not obvious

Page 12: Postgresql search demystified

How a search engine works

* An indexing phase

* A search phase

Page 13: Postgresql search demystified

The indexing phase

Convert the input text to tokens

Page 14: Postgresql search demystified

The search phase

Match the search terms to

the indexed tokens

Page 15: Postgresql search demystified

indexing in depth

* choose an index format

* tokenize the words

* apply token analysis/filters

* discard unwanted tokens

Page 16: Postgresql search demystified

the index format

* r-tree (GIST in PostgreSQL)

* inverse indexes (GIN in PostgreSQL)

* dynamic/distributed indexes

Page 17: Postgresql search demystified

dynamic indexes: segmentation

* sometimes the token index is

segmented to allow faster updates

* consolidate segments to speed-up

search and account for deletions

Page 18: Postgresql search demystified

tokenizing

* parse/strip/convert format

* normalize terms (unaccent, ascii,

charsets, case folding, number precision..)

Page 19: Postgresql search demystified

token analysis/filters

* find synonyms

* expand thesaurus

* stem (maybe in different languages)

Page 20: Postgresql search demystified

more token analysis/filters

* eliminate stopwords

* store word distance/frequency

* store the full contents of some fields

* store some fields as attributes/facets

Page 21: Postgresql search demystified

“the index file” is really

* a token file, probably segmented/distributed

* some dictionary files: synonyms, thesaurus,

stopwords, stems/lexems (in different languages)

* word distance/frequency info

* attributes/original field files

* optional geospatial index

* auxiliary files: word/sentence boundaries, meta-info,

parser definitions, datasource definitions...

Page 22: Postgresql search demystified

the hardest

part is now

over

Page 23: Postgresql search demystified

searching in depth* tokenize/analyse

* prepare operators

* retrieve information

* rank the results

* highlight the matched parts

Page 24: Postgresql search demystified

searching in depth: tokenize

normalize, tokenize, and analyse

the original search term

the result would be a tokenized, stemmed,

“synonymised” term, without stopwords

Page 25: Postgresql search demystified

searching in depth: operators

* partial search

* logical/geospatial/range operators

* in-sentence/in-paragraph/word distance

* faceting/grouping

Page 26: Postgresql search demystified

searching in depth: retrieval

Go through the token index files, use the

attributes and geospatial files if necessary

for operators and/or grouping

You might need to do this in a distributed way

Page 27: Postgresql search demystified

searching in depth: ranking

algorithm to sort the most relevant results:

* field weights

* word frequency/density

* geospatial or timestamp ranking

* ad-hoc ranking strategies

Page 28: Postgresql search demystified

searching in depth: highlighting

Mark the matching parts of the results

It can be tricky/slow if you are not storing the full contents

in your indexes

Page 29: Postgresql search demystified

PostgreSQL as a

full-text

search engine

Page 30: Postgresql search demystified

search features

* index format configuration

* partial search

* word boundaries parser (not configurable)

* stemmers/synonyms/thesaurus/stopwords

* full-text logical operators

* attributes/geo/timestamp/range (using SQL)

* ranking strategies

* highlighting

* debugging/testing commands

Page 31: Postgresql search demystified

indexing in postgresql

you don't actually need an index to use full-text search in PostgreSQL

but unless your db is very small, you want to have one

Choose GIST or GIN (faster search, slower indexing,

larger index size)

CREATE INDEX pgweb_idx ON pgweb USING

gin(to_tsvector(config_name, body));

Page 32: Postgresql search demystified

Two new things

CREATE INDEX ... USING gin(to_tsvector (config_name, body));

* to_tsvector: postgresql way of saying “tokenize”

* config_name: tokenizing/analysis rule set

Page 33: Postgresql search demystified

Configuration

CREATE TEXT SEARCH CONFIGURATION

public.teowaki ( COPY = pg_catalog.english );

Page 34: Postgresql search demystified

Configuration

CREATE TEXT SEARCH DICTIONARY english_ispell (

TEMPLATE = ispell,

DictFile = en_us,

AffFile = en_us,

StopWords = spanglish

);

CREATE TEXT SEARCH DICTIONARY spanish_ispell (

TEMPLATE = ispell,

DictFile = es_any,

AffFile = es_any,

StopWords = spanish

);

Page 35: Postgresql search demystified

Configuration

CREATE TEXT SEARCH DICTIONARY english_stem (

TEMPLATE = snowball,

Language = english,

StopWords = english

);

CREATE TEXT SEARCH DICTIONARY spanish_stem (

TEMPLATE= snowball,

Language = spanish,

Stopwords = spanish

);

Page 36: Postgresql search demystified

Configuration

Parser.

Word boundaries

Page 37: Postgresql search demystified

Configuration

Assign dictionaries (in specific to generic order)

ALTER TEXT SEARCH CONFIGURATION teowaki

ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword,

hword_part

WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem;

ALTER TEXT SEARCH CONFIGURATION teowaki

DROP MAPPING FOR email, url, url_path, sfloat, float;

Page 38: Postgresql search demystified

debugging

select * from ts_debug('teowaki', 'I am searching unas

b squedas con postgresql database');ú

also ts_lexize and ts_parser

Page 39: Postgresql search demystified

tokenizing

tokens + position (stopwords are removed, tokens are folded)

Page 40: Postgresql search demystified

searching

SELECT guid, description from wakis where

to_tsvector('teowaki',description)

@@ to_tsquery('teowaki','postgres');

Page 41: Postgresql search demystified

searching

SELECT guid, description from wakis where

to_tsvector('teowaki',description)

@@ to_tsquery('teowaki','postgres:*');

Page 42: Postgresql search demystified

operators

SELECT guid, description from wakis where

to_tsvector('teowaki',description)

@@ to_tsquery('teowaki','postgres | mysql');

Page 43: Postgresql search demystified

ranking weights

SELECT setweight(to_tsvector(coalesce(name,'')),'A') ||

setweight(to_tsvector(coalesce(description,'')),'B')

from wakis limit 1;

Page 44: Postgresql search demystified

search by weight

Page 45: Postgresql search demystified

ranking

SELECT name, ts_rank(to_tsvector(name), query) rank

from wakis, to_tsquery('postgres | indexes') query

where to_tsvector(name) @@ query order by rank DESC;

also ts_rank_cd

Page 46: Postgresql search demystified

highlighting

SELECT ts_headline(name, query) from wakis,

to_tsquery('teowaki', 'game|play') query

where to_tsvector('teowaki', name) @@ query;

Page 47: Postgresql search demystified

USE POSTGRESQL

FOR EVERYTHING

Page 48: Postgresql search demystified

When PostgreSQL is not good

* You need to index files (PDF, Odx...)

* Your index is very big (slow reindex)

* You need a distributed index

* You need complex tokenizers

* You need advanced rankers

Page 49: Postgresql search demystified

When PostgreSQL is not good

* You want a REST API

* You want sentence/ proximity/ range/

more complex operators

* You want search auto completion

* You want advanced features (alerts...)

Page 50: Postgresql search demystified

But it has been

perfect for us so far.

Our users don't care

which search engine

we use, as long as

it works.

Page 51: Postgresql search demystified

* in *

PgConf EU 2014 presents

Javier RamirezPostgreSQL

Full-text search

demystified@supercoco9

https://teowaki.com