postgresql search demystified

* in *

PgConf EU 2014 presents

Javier RamirezPostgreSQL

Full-text search

demystified@supercoco9

https://teowaki.com

The problem

our architecture

One does not simply

SELECT * from stuff where

content ilike '%postgresql%'

Basic search features

* stemmers (run, runner, running)* unaccented (josé, jose)* results highlighting* rank results by relevance

Nice to have features* partial searches

* search operators (OR, AND...)

* synonyms (postgres, postgresql, pgsql)

* thesaurus (OS=Operating System)

* fast, and space-efficient

* debugging

Good News:

PostgreSQL supports all

the requested features

Bad News:

unless you already know about search

engines, the official docs are not obvious

How a search engine works

* An indexing phase

* A search phase

The indexing phase

Convert the input text to tokens

The search phase

Match the search terms to

the indexed tokens

indexing in depth

* choose an index format

* tokenize the words

* apply token analysis/filters

* discard unwanted tokens

the index format

* r-tree (GIST in PostgreSQL)

* inverse indexes (GIN in PostgreSQL)

* dynamic/distributed indexes

dynamic indexes: segmentation

* sometimes the token index is

segmented to allow faster updates

* consolidate segments to speed-up

search and account for deletions

tokenizing

* parse/strip/convert format

* normalize terms (unaccent, ascii,

charsets, case folding, number precision..)

token analysis/filters

* find synonyms

* expand thesaurus

* stem (maybe in different languages)

more token analysis/filters

* eliminate stopwords

* store word distance/frequency

* store the full contents of some fields

* store some fields as attributes/facets

“the index file” is really

* a token file, probably segmented/distributed

* some dictionary files: synonyms, thesaurus,

stopwords, stems/lexems (in different languages)

* word distance/frequency info

* attributes/original field files

* optional geospatial index

* auxiliary files: word/sentence boundaries, meta-info,

parser definitions, datasource definitions...

the hardest

part is now

over

searching in depth* tokenize/analyse

* prepare operators

* retrieve information

* rank the results

* highlight the matched parts

searching in depth: tokenize

normalize, tokenize, and analyse

the original search term

the result would be a tokenized, stemmed,

“synonymised” term, without stopwords

searching in depth: operators

* partial search

* logical/geospatial/range operators

* in-sentence/in-paragraph/word distance

* faceting/grouping

searching in depth: retrieval

Go through the token index files, use the

attributes and geospatial files if necessary

for operators and/or grouping

You might need to do this in a distributed way

searching in depth: ranking

algorithm to sort the most relevant results:

* field weights

* word frequency/density

* geospatial or timestamp ranking

* ad-hoc ranking strategies

searching in depth: highlighting

Mark the matching parts of the results

It can be tricky/slow if you are not storing the full contents

in your indexes

PostgreSQL as a

full-text

search engine

search features

* index format configuration

* partial search

* word boundaries parser (not configurable)

* stemmers/synonyms/thesaurus/stopwords

* full-text logical operators

* attributes/geo/timestamp/range (using SQL)

* ranking strategies

* highlighting

* debugging/testing commands

indexing in postgresql

you don't actually need an index to use full-text search in PostgreSQL

but unless your db is very small, you want to have one

Choose GIST or GIN (faster search, slower indexing,

larger index size)

CREATE INDEX pgweb_idx ON pgweb USING

gin(to_tsvector(config_name, body));

Two new things

CREATE INDEX ... USING gin(to_tsvector (config_name, body));

* to_tsvector: postgresql way of saying “tokenize”

* config_name: tokenizing/analysis rule set

Configuration

CREATE TEXT SEARCH CONFIGURATION

public.teowaki ( COPY = pg_catalog.english );

Configuration

CREATE TEXT SEARCH DICTIONARY english_ispell (

TEMPLATE = ispell,

DictFile = en_us,

AffFile = en_us,

StopWords = spanglish

);

CREATE TEXT SEARCH DICTIONARY spanish_ispell (

TEMPLATE = ispell,

DictFile = es_any,

AffFile = es_any,

StopWords = spanish

);

Configuration

CREATE TEXT SEARCH DICTIONARY english_stem (

TEMPLATE = snowball,

Language = english,

StopWords = english

);

CREATE TEXT SEARCH DICTIONARY spanish_stem (

TEMPLATE= snowball,

Language = spanish,

Stopwords = spanish

);

Configuration

Parser.

Word boundaries

Configuration

Assign dictionaries (in specific to generic order)

ALTER TEXT SEARCH CONFIGURATION teowaki

ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword,

hword_part

WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem;

ALTER TEXT SEARCH CONFIGURATION teowaki

DROP MAPPING FOR email, url, url_path, sfloat, float;

debugging

select * from ts_debug('teowaki', 'I am searching unas

b squedas con postgresql database');ú

also ts_lexize and ts_parser

tokenizing

tokens + position (stopwords are removed, tokens are folded)

searching

SELECT guid, description from wakis where

to_tsvector('teowaki',description)

@@ to_tsquery('teowaki','postgres');

searching



@@ to_tsquery('teowaki','postgres:*');

operators



@@ to_tsquery('teowaki','postgres | mysql');

ranking weights

SELECT setweight(to_tsvector(coalesce(name,'')),'A') ||

setweight(to_tsvector(coalesce(description,'')),'B')

from wakis limit 1;

search by weight

ranking

SELECT name, ts_rank(to_tsvector(name), query) rank

from wakis, to_tsquery('postgres | indexes') query

where to_tsvector(name) @@ query order by rank DESC;

also ts_rank_cd

highlighting

SELECT ts_headline(name, query) from wakis,

to_tsquery('teowaki', 'game|play') query

where to_tsvector('teowaki', name) @@ query;

USE POSTGRESQL

FOR EVERYTHING

When PostgreSQL is not good

* You need to index files (PDF, Odx...)

* Your index is very big (slow reindex)

* You need a distributed index

* You need complex tokenizers

* You need advanced rankers

When PostgreSQL is not good

* You want a REST API

* You want sentence/ proximity/ range/

more complex operators

* You want search auto completion

* You want advanced features (alerts...)

But it has been

perfect for us so far.

Our users don't care

which search engine

we use, as long as

it works.

* in *

PgConf EU 2014 presents

Javier RamirezPostgreSQL

Full-text search

demystified@supercoco9

https://teowaki.com

postgresql search demystified

Software