steam learn: full text search with postgresql
TRANSCRIPT
7th of August 2014
Full Text SearchWith PostgreSQL
by Vincent Desmares
7th of August 2014
Summary
1) What is a Full Text Search
2) A basic PostgreSQL example
3) FTS advanced features
4) FTS advanced configuration
7th of August 2014
What is a Full Text Search?
● Searching for documents
● Use the whole document
● Be able to set the precision
7th of August 2014
Why using a Full Text Search?● Basic methods are = and ILIKE
○ No linguistic support for textual search operators
■ “countries” should be the same as “country”
○ Can’t compare matches relevance
○ Basic search too slow for complex queries
● Why PostgreSQL
○ Native and sufficiently performant
7th of August 2014
Library / Database
FTS basic usage
Search for:
● A document
● A business object
● A rowOtherdocuments
Relevantdocuments
7th of August 2014
What is a search?
● A query to run● On a parsed text
SELECT *
FROM document
WHERE to_tsvector(title || content) @@ to_tsquery(‘Car’)
7th of August 2014
How to_tsvector works?
● Text is separated into tokens
# select * from to_tsvector('Hello my name is vincent. I am very happy to be vincent.')
"'happi':9 'hello':1 'name':3 'vincent':5,12"
7th of August 2014
How to_tsquery works?
● Parse a formated query
#select to_tsquery('Vincent is Happy')
ERROR: syntax error in tsquery: "Vincent is Happy"
#select to_tsquery('Vincent & is & Happy')
"'vincent' & 'happi'"
7th of August 2014
Résultat
@@ operator is the same as = for the FTS
#select content from document where to_tsquery('Vincent & is & Happy') @@ to_tsvector(content) limit 1
'Hello my name is vincent. I am very happy to be vincent.'
7th of August 2014
And it’s faaaaaaaaaaast
# select count(*) FROM document;count | 11909475
# select count(*) FROM document where content_vector @@ to_tsquery(‘countries’);count | 424813Time: 454.709 ms
# select count(*) FROM document whereILIKE '%countries%';count | 116734Time: 11672.649 ms
7th of August 2014
Why it’s faaaaaaaaaaaast?
● Indexed
● GIN (Generalized Inverted Index)
○ Longer to build, faster
● GiST (Generalized Search Tree)
○ Quicker to update, slower,
CREATE INDEX document_tsvector_idx ON document USING gin to_tsvector(title || content);
7th of August 2014
Advanced Features
7th of August 2014
Ranked results
#select content, ts_rank_cd( to_tsvector(content), to_tsquery('Happy'), 1|8) as rankfrom document where to_tsquery('Happy') @@ to_tsvector(content)ORDER BY
rank DESC
7th of August 2014
Google style results
# SELECT id, ts_headline( body, q,
‘StartSel=<b>, StopSel=</b>,MaxWords=5, MinWords=4, ShortWord=3, HighlightAll=FALSE,MaxFragments=0, FragmentDelimiter=" ... "’
) FROM document WHERE to_tsquery('Happy') @@ to_tsvector(content)
“<b>Vincent</b> is <b>happy</b> because … very <b>happy</b> to be with ...”
7th of August 2014
Advanced Configuration
7th of August 2014
Simplest workflow
Original Content ts_vector
Vincent is very very Happy
‘happy’:5 ‘is’:2 ‘very’:3,4 ‘vincent’:1
7th of August 2014
Useless words? Stop Words!
● Just a file with a list of words● Must be in the postgres tsearch directory
CREATE TEXT SEARCH DICTIONARY documentIspell ( TEMPLATE = ispell, stopwords = 'my_file');
/usr/share/postgresql/9.3/tsearch_data/my_file
7th of August 2014
With stop words
Original Content ts_vector
Vincent is very very Happy
‘happy’:5 ‘very’:3,4 ‘vincent’:1
Remove useless Words
With only “is” in the .stop
7th of August 2014
Custom dictfile
● Just a file with a list of words● Contain suffix/Affix metadata (can be custom)
CREATE TEXT SEARCH DICTIONARY documentIspell ( [...]dictfile = ‘my.dict’
);
# cat my.dict | grep fryfry/NGDS
# cat en_us.affix | grep SSFX S Y 4SFX S y ies [^aeiou]ySFX S 0 s [aeiou]ySFX S 0 es [sxzh]SFX S 0 s [^sxzhy]
7th of August 2014
With linguistic dictionaries
Original Content ts_vector
Vincent is very very Happy
‘happi’:5 ‘very’:3,4 ‘vincent’:1
Remove useless Words
Reduce Words to their roots
With as custom .affix and .dict
7th of August 2014
The thesaurus
● Link business terms
# cat /var/postgresql/9.3/tserach_data/inovia_learn.thsMcDo : *McDonaldsMc do : *McDonaldsvery happy : *blessed
7th of August 2014
The final chain
Original Content ts_vector
Vincent is very very Happy
‘very’:3, ‘blessed’:4, ‘vincent’:1
Remove useless Words
Reduce Words to their roots
Reduce Words to their syn.
With as custom .ths
7th of August 2014
How to debug?
# Select * FROM ts_debug(‘Vincent is very very happy’)
7th of August 2014
The drawbacks (yes, last slide)
● Transformed words (lexem) Indexed○ Only full or suffix match available
Solution: autocomplete● Business have custom meaningEx: fry (third-person singular simple present fries)Solution: Custom dictionary● Indexes are long to build
7th of August 2014
Merci !
Sources:http://www.postgresql.org/docs/9.3/static/textsearch.htmlhttp://en.wikipedia.org/wiki/Full_text_searchhttp://en.wikipedia.org/wiki/Precision_and_recall
For online questions, please leave a comment on the article.
Questions ?
7th of August 2014
For online questions, please leave a comment on the article.
Questions ?
7th of August 2014
Join the community !(in Paris)
Social networks :● Follow us on Twitter : https://twitter.com/steamlearn● Like us on Facebook : https://www.facebook.com/steamlearn
SteamLearn is an Inovia initiative : inovia.fr
You wish to be in the audience ? Contact us at [email protected]