steam learn: full text search with postgresql

27
7th of August 2014 Full Text Search With PostgreSQL by Vincent Desmares

Upload: inovia

Post on 17-Jul-2015

99 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Full Text SearchWith PostgreSQL

by Vincent Desmares

Page 2: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Summary

1) What is a Full Text Search

2) A basic PostgreSQL example

3) FTS advanced features

4) FTS advanced configuration

Page 3: Steam Learn: Full text search with PostgreSQL

7th of August 2014

What is a Full Text Search?

● Searching for documents

● Use the whole document

● Be able to set the precision

Page 4: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Why using a Full Text Search?● Basic methods are = and ILIKE

○ No linguistic support for textual search operators

■ “countries” should be the same as “country”

○ Can’t compare matches relevance

○ Basic search too slow for complex queries

● Why PostgreSQL

○ Native and sufficiently performant

Page 5: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Library / Database

FTS basic usage

Search for:

● A document

● A business object

● A rowOtherdocuments

Relevantdocuments

Page 6: Steam Learn: Full text search with PostgreSQL

7th of August 2014

What is a search?

● A query to run● On a parsed text

SELECT *

FROM document

WHERE to_tsvector(title || content) @@ to_tsquery(‘Car’)

Page 7: Steam Learn: Full text search with PostgreSQL

7th of August 2014

How to_tsvector works?

● Text is separated into tokens

# select * from to_tsvector('Hello my name is vincent. I am very happy to be vincent.')

"'happi':9 'hello':1 'name':3 'vincent':5,12"

Page 8: Steam Learn: Full text search with PostgreSQL

7th of August 2014

How to_tsquery works?

● Parse a formated query

#select to_tsquery('Vincent is Happy')

ERROR: syntax error in tsquery: "Vincent is Happy"

#select to_tsquery('Vincent & is & Happy')

"'vincent' & 'happi'"

Page 9: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Résultat

@@ operator is the same as = for the FTS

#select content from document where to_tsquery('Vincent & is & Happy') @@ to_tsvector(content) limit 1

'Hello my name is vincent. I am very happy to be vincent.'

Page 10: Steam Learn: Full text search with PostgreSQL

7th of August 2014

And it’s faaaaaaaaaaast

# select count(*) FROM document;count | 11909475

# select count(*) FROM document where content_vector @@ to_tsquery(‘countries’);count | 424813Time: 454.709 ms

# select count(*) FROM document whereILIKE '%countries%';count | 116734Time: 11672.649 ms

Page 11: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Why it’s faaaaaaaaaaaast?

● Indexed

● GIN (Generalized Inverted Index)

○ Longer to build, faster

● GiST (Generalized Search Tree)

○ Quicker to update, slower,

CREATE INDEX document_tsvector_idx ON document USING gin to_tsvector(title || content);

Page 12: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Advanced Features

Page 13: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Ranked results

#select content, ts_rank_cd( to_tsvector(content), to_tsquery('Happy'), 1|8) as rankfrom document where to_tsquery('Happy') @@ to_tsvector(content)ORDER BY

rank DESC

Page 14: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Google style results

# SELECT id, ts_headline( body, q,

‘StartSel=<b>, StopSel=</b>,MaxWords=5, MinWords=4, ShortWord=3, HighlightAll=FALSE,MaxFragments=0, FragmentDelimiter=" ... "’

) FROM document WHERE to_tsquery('Happy') @@ to_tsvector(content)

“<b>Vincent</b> is <b>happy</b> because … very <b>happy</b> to be with ...”

Page 15: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Advanced Configuration

Page 16: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Simplest workflow

Original Content ts_vector

Vincent is very very Happy

‘happy’:5 ‘is’:2 ‘very’:3,4 ‘vincent’:1

Page 17: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Useless words? Stop Words!

● Just a file with a list of words● Must be in the postgres tsearch directory

CREATE TEXT SEARCH DICTIONARY documentIspell ( TEMPLATE = ispell, stopwords = 'my_file');

/usr/share/postgresql/9.3/tsearch_data/my_file

Page 18: Steam Learn: Full text search with PostgreSQL

7th of August 2014

With stop words

Original Content ts_vector

Vincent is very very Happy

‘happy’:5 ‘very’:3,4 ‘vincent’:1

Remove useless Words

With only “is” in the .stop

Page 19: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Custom dictfile

● Just a file with a list of words● Contain suffix/Affix metadata (can be custom)

CREATE TEXT SEARCH DICTIONARY documentIspell ( [...]dictfile = ‘my.dict’

);

# cat my.dict | grep fryfry/NGDS

# cat en_us.affix | grep SSFX S Y 4SFX S y ies [^aeiou]ySFX S 0 s [aeiou]ySFX S 0 es [sxzh]SFX S 0 s [^sxzhy]

Page 20: Steam Learn: Full text search with PostgreSQL

7th of August 2014

With linguistic dictionaries

Original Content ts_vector

Vincent is very very Happy

‘happi’:5 ‘very’:3,4 ‘vincent’:1

Remove useless Words

Reduce Words to their roots

With as custom .affix and .dict

Page 21: Steam Learn: Full text search with PostgreSQL

7th of August 2014

The thesaurus

● Link business terms

# cat /var/postgresql/9.3/tserach_data/inovia_learn.thsMcDo : *McDonaldsMc do : *McDonaldsvery happy : *blessed

Page 22: Steam Learn: Full text search with PostgreSQL

7th of August 2014

The final chain

Original Content ts_vector

Vincent is very very Happy

‘very’:3, ‘blessed’:4, ‘vincent’:1

Remove useless Words

Reduce Words to their roots

Reduce Words to their syn.

With as custom .ths

Page 23: Steam Learn: Full text search with PostgreSQL

7th of August 2014

How to debug?

# Select * FROM ts_debug(‘Vincent is very very happy’)

Page 24: Steam Learn: Full text search with PostgreSQL

7th of August 2014

The drawbacks (yes, last slide)

● Transformed words (lexem) Indexed○ Only full or suffix match available

Solution: autocomplete● Business have custom meaningEx: fry (third-person singular simple present fries)Solution: Custom dictionary● Indexes are long to build

Page 25: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Merci !

Sources:http://www.postgresql.org/docs/9.3/static/textsearch.htmlhttp://en.wikipedia.org/wiki/Full_text_searchhttp://en.wikipedia.org/wiki/Precision_and_recall

For online questions, please leave a comment on the article.

Questions ?

Page 26: Steam Learn: Full text search with PostgreSQL

7th of August 2014

For online questions, please leave a comment on the article.

Questions ?

Page 27: Steam Learn: Full text search with PostgreSQL

7th of August 2014

Join the community !(in Paris)

Social networks :● Follow us on Twitter : https://twitter.com/steamlearn● Like us on Facebook : https://www.facebook.com/steamlearn

SteamLearn is an Inovia initiative : inovia.fr

You wish to be in the audience ? Contact us at [email protected]