postgresql search demystified
DESCRIPTION
How does a full-text search engine works? How is the index built and searched? Can I use PostgreSQL as a full-text search engine or should I go for a more specialised solution? How does one configure and use PostgreSQL search? This presentation covers all those aspects, based on the work we did to index teowaki.com. It was presented at PgConf EU 2014 in MadridTRANSCRIPT
* in *
PgConf EU 2014 presents
Javier RamirezPostgreSQL
Full-text search
demystified@supercoco9
https://teowaki.com
The problem
our architecture
One does not simply
SELECT * from stuff where
content ilike '%postgresql%'
Basic search features
* stemmers (run, runner, running)* unaccented (josé, jose)* results highlighting* rank results by relevance
Nice to have features* partial searches
* search operators (OR, AND...)
* synonyms (postgres, postgresql, pgsql)
* thesaurus (OS=Operating System)
* fast, and space-efficient
* debugging
Good News:
PostgreSQL supports all
the requested features
Bad News:
unless you already know about search
engines, the official docs are not obvious
How a search engine works
* An indexing phase
* A search phase
The indexing phase
Convert the input text to tokens
The search phase
Match the search terms to
the indexed tokens
indexing in depth
* choose an index format
* tokenize the words
* apply token analysis/filters
* discard unwanted tokens
the index format
* r-tree (GIST in PostgreSQL)
* inverse indexes (GIN in PostgreSQL)
* dynamic/distributed indexes
dynamic indexes: segmentation
* sometimes the token index is
segmented to allow faster updates
* consolidate segments to speed-up
search and account for deletions
tokenizing
* parse/strip/convert format
* normalize terms (unaccent, ascii,
charsets, case folding, number precision..)
token analysis/filters
* find synonyms
* expand thesaurus
* stem (maybe in different languages)
more token analysis/filters
* eliminate stopwords
* store word distance/frequency
* store the full contents of some fields
* store some fields as attributes/facets
“the index file” is really
* a token file, probably segmented/distributed
* some dictionary files: synonyms, thesaurus,
stopwords, stems/lexems (in different languages)
* word distance/frequency info
* attributes/original field files
* optional geospatial index
* auxiliary files: word/sentence boundaries, meta-info,
parser definitions, datasource definitions...
the hardest
part is now
over
searching in depth* tokenize/analyse
* prepare operators
* retrieve information
* rank the results
* highlight the matched parts
searching in depth: tokenize
normalize, tokenize, and analyse
the original search term
the result would be a tokenized, stemmed,
“synonymised” term, without stopwords
searching in depth: operators
* partial search
* logical/geospatial/range operators
* in-sentence/in-paragraph/word distance
* faceting/grouping
searching in depth: retrieval
Go through the token index files, use the
attributes and geospatial files if necessary
for operators and/or grouping
You might need to do this in a distributed way
searching in depth: ranking
algorithm to sort the most relevant results:
* field weights
* word frequency/density
* geospatial or timestamp ranking
* ad-hoc ranking strategies
searching in depth: highlighting
Mark the matching parts of the results
It can be tricky/slow if you are not storing the full contents
in your indexes
PostgreSQL as a
full-text
search engine
search features
* index format configuration
* partial search
* word boundaries parser (not configurable)
* stemmers/synonyms/thesaurus/stopwords
* full-text logical operators
* attributes/geo/timestamp/range (using SQL)
* ranking strategies
* highlighting
* debugging/testing commands
indexing in postgresql
you don't actually need an index to use full-text search in PostgreSQL
but unless your db is very small, you want to have one
Choose GIST or GIN (faster search, slower indexing,
larger index size)
CREATE INDEX pgweb_idx ON pgweb USING
gin(to_tsvector(config_name, body));
Two new things
CREATE INDEX ... USING gin(to_tsvector (config_name, body));
* to_tsvector: postgresql way of saying “tokenize”
* config_name: tokenizing/analysis rule set
Configuration
CREATE TEXT SEARCH CONFIGURATION
public.teowaki ( COPY = pg_catalog.english );
Configuration
CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = en_us,
AffFile = en_us,
StopWords = spanglish
);
CREATE TEXT SEARCH DICTIONARY spanish_ispell (
TEMPLATE = ispell,
DictFile = es_any,
AffFile = es_any,
StopWords = spanish
);
Configuration
CREATE TEXT SEARCH DICTIONARY english_stem (
TEMPLATE = snowball,
Language = english,
StopWords = english
);
CREATE TEXT SEARCH DICTIONARY spanish_stem (
TEMPLATE= snowball,
Language = spanish,
Stopwords = spanish
);
Configuration
Parser.
Word boundaries
Configuration
Assign dictionaries (in specific to generic order)
ALTER TEXT SEARCH CONFIGURATION teowaki
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword,
hword_part
WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem;
ALTER TEXT SEARCH CONFIGURATION teowaki
DROP MAPPING FOR email, url, url_path, sfloat, float;
debugging
select * from ts_debug('teowaki', 'I am searching unas
b squedas con postgresql database');ú
also ts_lexize and ts_parser
tokenizing
tokens + position (stopwords are removed, tokens are folded)
searching
SELECT guid, description from wakis where
to_tsvector('teowaki',description)
@@ to_tsquery('teowaki','postgres');
searching
SELECT guid, description from wakis where
to_tsvector('teowaki',description)
@@ to_tsquery('teowaki','postgres:*');
operators
SELECT guid, description from wakis where
to_tsvector('teowaki',description)
@@ to_tsquery('teowaki','postgres | mysql');
ranking weights
SELECT setweight(to_tsvector(coalesce(name,'')),'A') ||
setweight(to_tsvector(coalesce(description,'')),'B')
from wakis limit 1;
search by weight
ranking
SELECT name, ts_rank(to_tsvector(name), query) rank
from wakis, to_tsquery('postgres | indexes') query
where to_tsvector(name) @@ query order by rank DESC;
also ts_rank_cd
highlighting
SELECT ts_headline(name, query) from wakis,
to_tsquery('teowaki', 'game|play') query
where to_tsvector('teowaki', name) @@ query;
USE POSTGRESQL
FOR EVERYTHING
When PostgreSQL is not good
* You need to index files (PDF, Odx...)
* Your index is very big (slow reindex)
* You need a distributed index
* You need complex tokenizers
* You need advanced rankers
When PostgreSQL is not good
* You want a REST API
* You want sentence/ proximity/ range/
more complex operators
* You want search auto completion
* You want advanced features (alerts...)
But it has been
perfect for us so far.
Our users don't care
which search engine
we use, as long as
it works.
* in *
PgConf EU 2014 presents
Javier RamirezPostgreSQL
Full-text search
demystified@supercoco9
https://teowaki.com