rank your results with postgresql full text search (from pgconf2015)

60
Rank Your Results: Using Full Text Search with Natural Language Queries in PostgreSQL to get Ranked Results Jamey Hanson [email protected] [email protected] Freedom Consulting Group http://www.freedomconsultinggroup.com PGConf US, NYC March 26, 2015

Upload: jamesphanson

Post on 20-Jul-2015

103 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

Rank Your Results: Using Full Text Search with Natural Language Queries in PostgreSQL

to get Ranked Results Jamey Hanson [email protected] [email protected] Freedom Consulting Group http://www.freedomconsultinggroup.com

PGConf US, NYC March 26, 2015

Page 2: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

Full Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query.

}  PostgreSQL 9.4 documentation, section 12.1

What is PostgreSQL Full Text Search?

PGConf US, NYC 26-Mar-2015 2

Page 3: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Focus on semantics, rather than syntax }  Keep documents within PostgreSQL

}  Apache Solr, Lucene, Sphinx etc. require their own copy of the data

}  Simple to keep indexes up to date }  ~20 * faster than SQL search (LIKE,

ILIKE, etc.) with default configuration }  Fast enough for nearly all applications

What makes Full Text Search so useful?

PGConf US, NYC 26-Mar-2015 3

Page 4: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Traditional search reveals document existence, ranked results indicate relevance

What makes Full Text Search so useful?

PGConf US, NYC 26-Mar-2015 4

}  Customer expectations are that all searches should include rank

}  FTS includes full suite of PG query tools such as SQL, regex, LIKE/ILIKE, wildcards and function-based indexes

}  FTS parser, stop word list, synonym, thesaurus and language are all customizable at the statement level

}  FTS is extensible with 3rd party dictionaries and more

Page 5: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Infrastructure and data wrangling }  Creating FTS tables with maintenance triggers }  Compare FTS with traditional SQL searches and run FTSs

on documents from early American History }  Rank search results on documents from Data Science }  Generate HTML-tagged fragments with matching terms }  Customize the stop-word dictionary }  Suggest spelling options for query terms }  Re-write queries at run-time

I will move between slides, demonstrations and SQL scripts. We will not review every slide in the file.

Agenda

PGConf US, NYC 26-Mar-2015 5

Page 6: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

Jamey Hanson [email protected] [email protected]

Manage a team for Freedom Consulting Group migrating applications from Oracle to Postgres Plus Advanced Server and PostgreSQL in the government space. We are subcontracting to EnterpriseDB Overly certified: PMP, CISSP, CSEP, OCP in 5 versions of Oracle, Cloudera developer & admin. Used to be NetApp admin and MCSE. I teach PMP and CISSP at the Univ. MD training center Alumnus of multiple schools and was C-130 aircrew

About the author

PGConf US, NYC 26-Mar-2015 6

Page 7: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

•  PostgreSQL 9.4.1 EnterpriseDB free package •  CentOS 7.0 VM with 2GB RAM and 2 CPU cores •  2 sets of documents to search …

•  Primary documents in American History: The American Revolution and the New Nation http://www.loc.gov/rr/program/bib/ourdocs/NewNation.html (Library of Congress)

•  Text from Data Science Girl’s August 15, 2014 blog post “38 Seminal Articles Every Data Scientist Should Read” http://www.datasciencecentral.com/profiles/blogs/30-seminal-articles-every-data-scientist-should-read

Presentation infrastructure …

PGConf US, NYC 26-Mar-2015 7

Page 8: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Used pgAdmin3 for SQL and administration

}  A few Linux shell commands to manage the files

}  American history documents were cut & pasted from Web into MS Notepad

}  Data Science .pdf files were downloaded, converted to text with pdftotext and manually divided into abstract and body files

… Presentation infrastructure

PGConf US, NYC 26-Mar-2015 8

Page 9: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  FTS is built on lexemes, which are (essentially) word roots without tense, possessive, plurality or other ending.

“It is a basic unit of meaning, and the headwords of a dictionary are all lexemes” The Cambridge Encyclopedia of The English Language }  For example ...

}  The lexeme of jump, jumps, jumped and jumping are all ‘jump’ }  Excited, excites, exciting and excited are all ‘excit’

}  Lexemes are stored in lower case (i.e. case insensitive)

How does FTS work?

PGConf US, NYC 26-Mar-2015 9

Page 10: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  lexemes are organized into TSVECTORs, which are sorted arrays of lexemes with associated position and (optionally) weight. Documents are stored as TSVECTORs

}  Query against TSVECTORs using TSQUERYs, which are arrays of lexemes with BOOLEAN operators but without position or weight

}  Match a TSQUERY to a TSVECTOR with the @@ operator

How does FTS work?

PGConf US, NYC 26-Mar-2015 10

Page 11: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

1.  Parses text document into tokens using white space, non printing characters and punctuation

2.  Assigns a class (i.e. type) to each token. 23 classes include word, email, number, URL, etc.

3.  ‘Word’ tokens are normalized into lexemes using a parser 4.  Lexemes are processed to …

a.  Remove stop words (common words such as ‘and’, ‘or’, ‘the’) b.  Add synonyms c.  Add phrases matching

5.  Lexemes are assembled into TSVECTORs by noting the position, recording weight and removing duplicates

This process is controlled by TEXT SEARCH DICTIONARYs

How does TO_TSVECTOR work?

PGConf US, NYC 26-Mar-2015 11

Page 12: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  TSVECTORs are compared to TSQUERYs with the @@ operator

}  TSQUERYs are built with the TO_TSQUERY or PLAINTO_TSQUERY functions …

Never mind … let’s jump to some examples, which are much easier to understand.

How does FTS match documents?

PGConf US, NYC 26-Mar-2015 12

Page 13: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Explore TSVECTORs and TSQUERYs 00_FTS_explore_tsvector_tsquery_v10.sql

GOTO …

PGConf US, NYC 26-Mar-2015 13

Page 14: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  -- What do lexemes look like? SELECT TO_TSVECTOR('enumerate') AS enumerate, TO_TSVECTOR('enumerated') AS enumerated, TO_TSVECTOR('enumerates') AS enumerates, TO_TSVECTOR('enumerating') AS enumerating, TO_TSVECTOR('enumeration') AS enumeration; -- all forms of the work have the same lexeme, 'enumer' -- Example tsvector SELECT TO_TSVECTOR('We hold these truths to be self evident'); -- 'evid':8 'hold':2 'self':7 'truth':4 -- tsvectors are sorted arrays of lexemes with position and (optionally) weight -- notice that common words, a.k.a. stop words, like 'to' and 'be' are not included

TSVECTOR and TSQUERY

PGConf US, NYC 26-Mar-2015 14

Page 15: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

-- tsquery_s are compared with tsvector_s to find matching documents -- they are composed of lexemes and logical operators SELECT TO_TSQUERY('with & liberty & and & justice & for & all'); -- 'liberti' & 'justic' -- Notice that stop words are not included in tsquery_s either -- can also use PLAINTO_TSQUERY with plain(ish) text SELECT PLAINTO_TSQUERY('With liberty and justice for all'); -- 'liberti' & 'justic'

TSVECTOR and TSQUERY

PGConf US, NYC 26-Mar-2015 15

Page 16: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Explore TSVECTORs and TSQUERYs 00_FTS_explore_tsvector_tsquery_v10.sql

RETURN from …

PGConf US, NYC 26-Mar-2015 16

Page 17: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Created at run-time with TO_TSVECTOR + simple, does not require storage - slower queries than pre-computing

}  Created ahead of time with TO_TSVECTOR + fast queries, flexible, does not slow ingestion, less CPU work - can leave TEXT and TSVECTOR out of sync, may not get done

}  Create ahead of time with a trigger + fast queries, TSVECTOR always up to date -  slows ingestion, UPDATE trigger first on small changes

}  Two trigger functions are included tsvector_update_trigger & …_column

How do we create TSVECTORS?

PGConf US, NYC 26-Mar-2015 17

Page 18: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  GIN (Generalized Inverted iNdex)

}  GiST (Generalized Search Tree)

How to make FTS wickedly fast?

PGConf US, NYC 26-Mar-2015 18

GIN GiST

Speed 3 * faster Slower

Size 3 * bigger smaller

Weighted TSV Unsupported Supported

Build speed Slower 3 * faster

Best practice Static data Updated data

See Tomas Vondra's 17-Mar-15 Planet PostgreSQL post on FTS performance for details

Page 19: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

Let’s build our FTS tables }  Build our FTS tables using 20_FTS_DDL_v10.sql

GOTO …

PGConf US, NYC 26-Mar-2015 19

Page 20: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Build our FTS tables using 20_FTS_DDL_v10.sql

RETURN from …

PGConf US, NYC 26-Mar-2015 20

Page 21: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Load text documents from the database host using pg_read_file }  pg_read_binary_file for .pdf_s

}  Files must be in $PGDATA, but symbolic links work }  Syntax is: (SELECT * FROM pg_read_file('path/from/$PGDATA/file.txt'))

}  Weighted searches require that the document is divided into sections. Details forthcoming

}  Can dynamically generate SQL load scripts using pg_ls_dir or run a script from psql

Loading documents for FTS

PGConf US, NYC 26-Mar-2015 21

Page 22: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Load our FTS tables using 30_FTS_Load_v10.sql }  Update title, author and URL fields with 32_FTS_Update_Titles_v10.sql

GOTO …

PGConf US, NYC 26-Mar-2015 22

Page 23: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Load files .txt and .pdf files from within $PGDATA }  We divided the Data Science documents into abstract and

body so that they can we weighted for weighted rank queries

}  TSVECTORs are created by the BIU trigger }  Manually updated fts_data_sci.tsv_document

just to show how it is done }  The update script populates title, author and URL fields.

Load text and .pdf documents

PGConf US, NYC 26-Mar-2015 23

Page 24: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

-- Create dynamic SQL to load fts_amer_hist

WITH list_of_files AS ( SELECT pg_ls_dir('Dropbox/FTS/AmHistory/') AS file_name ) SELECT 'INSERT INTO fts.fts_amer_hist (document, filename) VALUES ( (SELECT * FROM pg_read_file(''Dropbox/FTS/AmHistory/' || file_name || ''')), ''' || file_name || '''); ' FROM list_of_files ORDER BY file_name;

-- generates -- INSERT INTO fts.fts_data_sci (

abstract, body, document, pdf_file, pdf_filename) VALUES ( (SELECT * FROM

pg_read_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo_AB.txt')), (SELECT * FROM

pg_read_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo_BD.txt')), (SELECT * FROM

pg_read_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo.txt')), (SELECT * FROM

pg_read_binary_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo.pdf')), 'WhatMapReduceCanDo.pdf');

Dynamic SQL to load files

PGConf US, NYC 26-Mar-2015 24

Page 25: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Load our FTS tables using 20_FTS_Load_v10.sql }  Update details with 22_FTS_Update_Titles_v10.sql

RETURN from …

PGConf US, NYC 26-Mar-2015 25

Page 26: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

GOTO … }  See FTS in action with 40_FTS_explore_fts_amer_hist_v10.sql

Enough with the setup … show me FTS!

PGConf US, NYC 26-Mar-2015 26

Page 27: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Compare SQL ILIKE searches with FTS* }  See how ILIKE misses documents with different word

forms, such as 'enumerate' vs. 'enumeration' }  See how FTS is ~20 * faster than ILIKE }  Demonstrate that FTS excludes stop words such as 'the',

'and', & 'or' }  Demonstrate that FTS includes BOOLEAN logic with

simple syntax

}  *ILIKE is “case insensitive LIKE”

Explore fts_amer_hist

PGConf US, NYC 26-Mar-2015 27

Page 28: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  See FTS in action with 40_FTS_explore_amer_hist_v10.sql

RETURN from …

PGConf US, NYC 26-Mar-2015 28

Page 29: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Result rank is 0 (not found) to 1 (perfect match), calculated at run time based on a search of all documents. }  That means TOP 5 is slower than LIMIT 5

}  Two rank functions are available, TS_RANK and TS_RANK_CD }  Both consider how often search terms appear }  Both have an optional normalization parameter that weights the

rank by the log of the size of the document

}  TS_RANK_CD also considers the proximity of search terms to each other

Ranking results

PGConf US, NYC 26-Mar-2015 29

Page 30: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Lexemes in tsvectors can be assigned a weight of A(high) – D(low), with defaults of {1.0, 0.4, 0.2, 0.1}

}  Weighting does not affect which records are returned, only their rank

}  Weighted tsvectors are typically built by document section }  title=A, abstract=B, body=D in our example trigger

new.tsv_weight_document := SETWEIGHT(TO_TSVECTOR('pg_catalog.english', COALESCE(new.title, '')), 'A') || SETWEIGHT(TO_TSVECTOR('pg_catalog.english', COALESCE(new.abstract, '')), 'B') || TO_TSVECTOR('pg_catalog.english', new.body);

Building weighted tsvectors

PGConf US, NYC 26-Mar-2015 30

Page 31: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  The example tsvectors were weighted at build-time by the trigger

}  Can also build weighted tsvectors at query-time }  More flexible because different queries can use different weights }  Requires more code because weighting is done for every query }  Slightly slower because the source tsvectors must be

concatenated

}  SETWEIGHT(TO_TSVECTOR(title),'A') || SETWEIGHT(TO_TSVECTOR(abstract,'B') || TO_TSVECTOR(body); -- default weight 'D'

Building weighted tsvectors at query-time

PGConf US, NYC 26-Mar-2015 31

Page 32: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

What does all this get us?

PGConf US, NYC 26-Mar-2015 32

}  Search for document relevance, not just existence

}  Customers now expect demand ranked results

}  The data and the business logic are inside PostgreSQL, available to any application

Page 33: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Generate weighted, ranked document searches with 50_FTS_weighted_ranked_results_v10.sql

GOTO …

PGConf US, NYC 26-Mar-2015 33

Page 34: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Top-5 results syntax SELECT title, ts_rank(tsv_document, q) AS rank -- value between 0 and 1 FROM fts_data_sci, PLAINTO_TSQUERY('corelation big data') AS q ORDER BY rank DESC LIMIT 5;

}  Syntax for ts_rank_cd (ts rank with cover density) is the same

Weighted, ranked document searches

PGConf US, NYC 26-Mar-2015 34

Page 35: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Top-5 weighted results syntax SELECT title, ts_rank(tsv_weight_document, q) AS rank -- weighted column FROM fts_data_sci, PLAINTO_TSQUERY('corelation big data') AS q ORDER BY rank DESC LIMIT 5;

}  The only difference is using the weighted tsvector }  Could also have built a weighted tsvector at query time.

Weighted, ranked document searches

PGConf US, NYC 26-Mar-2015 35

Page 36: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Generate weighted, ranked document searches with 50_FTS_weighted_ranked_results_v10.sql

RETURN from …

PGConf US, NYC 26-Mar-2015 36

Page 37: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  We have used English language with default parser (tokenizer), stop-word list and dictionary.

}  The combination is a TEXT SEARCH DICTIONARY }  The default is pg_catalog.english }  SELECT default_text_search_config; to see }  We created tsvectors (weighted and unweighted) using

default and customer triggers plus manually

Pause … with all default configuration

PGConf US, NYC 26-Mar-2015 37

Page 38: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

We can highlight matches with ts_headline

PGConf US, NYC 26-Mar-2015 38

Page 39: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  TS_HEADLINE returns text fragment(s) surrounding matching terms with HTML tags

}  Default is a single snippet with <b>matching_term</b> }  Search for PLAINTO_TSQUERY('liberty justice happy')

Display fragments with matching terms

PGConf US, NYC 26-Mar-2015 39

Page 40: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  How many fragments? MaxFragments }  What comes between fragments? FragmentDelimiter }  How many surrounding words? MinWords / MaxWords }  Which HTML tags highlight terms? StartSel / StopSel SELECT TS_HEADLINE(document, q, 'StartSel="<font color=red><b>", StopSel="</font></b>", MaxFragments=10, MinWords=5, MaxWords=10, FragmentDelimiter=" ...<br>..."')

Configure TS_HEADLINE to improve display

PGConf US, NYC 26-Mar-2015 40

Page 41: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

Q: Which American history documents contain 'liberty', 'justice' and 'happy'? SELECT 'Document title: <i>' || title || '</i><br><br>' || TS_HEADLINE(document, q, 'StartSel="<font color=red><b>", StopSel="</font></b>", MaxFragments=10, MinWords=5, MaxWords=10, FragmentDelimiter=" ...<br>..."') FROM fts_amer_hist, PLAINTO_TSQUERY('liberty justice happy') AS q WHERE tsv_document @@ q ORDER BY TS_RANK(tsv_document, q) DESC;

Well formatted ts_headline results

41

Page 42: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

Very nice …

PGConf US, NYC 26-Mar-2015 42

GOTO check out the 4 matching documents

Page 43: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

PGConf US, NYC 26-Mar-2015

Q: What has FTS gotten us right out of the box?

A: Directly loaded documents that are automatically indexed, weighted and maintained in a form that supports fast natural language(ish) queries with ranked results plus well-formatted document fragments with highlighted matches. Which is to say, a lot!

Page 44: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Create a custom TEXT SEARCH DICTIONARY

}  Customize the stop word list based on frequency counts

}  Modify queries at run-time to remove terms and/or use synonyms with TS_REWRITE

}  Create a tool to suggest spelling corrections for query terms

Customizing FTS

PGConf US, NYC 26-Mar-2015 44

Page 45: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Defines the language, stopwords, dictfile and other options for TO_TSVECTOR and TSQUERY related functions }  Custom dictionaries based on a template }  pg_catalog.english is the default }  SHOW default_text_search_config;

}  Uses files in $PGSHAREDIR/tsearch_data $PGSHAREDIR=$PG_HOME/share/postgresql

}  Option STOPWORDS=english references $PGSHAREDIR/tsearch_data/english.stop

}  NOTE: Must 'touch' a TS DICT after each file change with ALTER TEXT SEARCH DICTIONARY

Custom TEXT SEARCH DICTIONARY

PGConf US, NYC 26-Mar-2015 45

Page 46: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  TS_STAT(tsvector)returns }  ndoc the number of documents a lexeme appears in }  nentry the number of times a lexeme appears

}  This is useful to identify candidate stop words that appear too frequently to be effective discriminators

}  TS_LEXSIZE('dictionary', 'word') }  Useful to text if the custom dictionary is working as planned

FTS helpful utility functions

PGConf US, NYC 26-Mar-2015 46

Page 47: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  TEXT SEARCH DICTIONARYs change the tsvector, TS_REWRITE changes the tsquery at SQL run-time

}  Look up tsquery substitution values in a table of: term TSQUERY

alias TSQUERY

}  Used for alias' or stop words, by substituting '' }  Ex. 'include' as an alias for 'contain' 'data' as a stop word INSERT INTO fts_alias VALUES ('contain'::TSQUERY, 'contain | include'::TSQUERY), ('data'::TSQUERY, ''::TSQUERY);

Change the TSQUERY w/TS_REWRITE

PGConf US, NYC 26-Mar-2015 47

Page 48: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Create custom dictionary and stop words plus query rewrite 60_FTS_stop_words_custom_dictionary_and_query_v10.sql

GOTO …

PGConf US, NYC 26-Mar-2015 48

Page 49: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Find words that appear frequently and that appear in multiple document SELECT word, nentry AS appears_N_times, ndoc AS appears_in_N_docs FROM TS_STAT( 'SELECT tsv_weight_document FROM fts_data_sci') -- weighted and un weighted tsvector are equiv ORDER BY nentry DESC, word;

Identify candidate stop words

PGConf US, NYC 26-Mar-2015 49

Page 50: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Create custom TEXT SEARCH DICTIONARY -- DROP TEXT SEARCH DICTIONARY IF EXISTS public.stopper_dict; CREATE TEXT SEARCH DICTIONARY public.stopper_dict ( TEMPLATE = pg_catalog.simple, STOPWORDS = english );

}  Add stop words to $SHAREDIR/tsearch_data/english.stop

}  'Touch' the dictionary after each change to take effect ALTER TEXT SEARCH DICTIONARY public.stopper_dict ( STOPWORDS = english );

Create TEXT SEARCH DICTIONARY

PGConf US, NYC 26-Mar-2015 50

Page 51: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

CREATE TABLE fts_alias (

term TSQUERY PRIMARY KEY,

alias TSQUERY

);

}  Add term alias' and stop words INSERT INTO fts_alias VALUES

('contain'::TSQUERY, 'contain | include'::TSQUERY), ('data'::TSQUERY, ''::TSQUERY);

); }  Use the alias table with TS_REWRITE SELECT TS_REWRITE(('data & information & contain')::TSQUERY, 'SELECT * FROM fts_alias');

-- 'information' & ( 'include' | 'contain' )

Create table for TS_REWRITE

PGConf US, NYC 26-Mar-2015 51

Page 52: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Create custom dictionary and stop words plus query rewrite 60_FTS_stop_words_custom_dictionary_and_query_v10.sql

RETURN from…

PGConf US, NYC 26-Mar-2015 52

Page 53: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Use TriGrams , the pg_trgm extension, with a list of words in the documents to identify words that are close to input queries as suggestions for misspelled terms

}  Step 1: create a table of words CREATE TABLE fts_word AS (SELECT word FROM TS_STAT( 'SELECT tsv_document FROM fts_amer_hist')) UNION (SELECT word FROM TS_STAT( 'SELECT tsv_document FROM fts_data_sci') );

Query-term spelling suggestions with TriGrams

Page 54: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Step 2: create an index CREATE INDEX idx_fts_word ON fts_word USING GIN(word gin_trgm_ops);

}  Step 3: query for 'close' terms that exist in the corpus

SELECT word, sml FROM fts_word, SIMILARITY(word, 'asymetric') AS sml WHERE sml > 0.333 -- arbitrary value to filter results

Check it out in action …

Query-term spelling suggestions with TriGrams

Page 55: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  SELECT word, sml FROM fts_word, SIMILARITY(word, 'asymetric') AS sml -- 'asymmetric' is the correct spelling WHERE sml > 0.333 -- arbitrary value to filter results ORDER BY sml DESC, word; word sml ========== ======== asymetr 0.636364 asymmetri 0.538462 asymmetr 0.461538 Metric 0.416667

Suggested spelling with TriGrams

PGConf US, NYC 26-Mar-2015 55

Page 56: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

}  Infrastructure and data wrangling }  Creating FTS tables with maintenance triggers

and load our data }  Compare FTS with traditional SQL searches and run FTSs

on documents from early American History }  Rank search results on documents from Data Science }  Generate HTML-tagged fragments with matching terms }  Customize the stop-word dictionary }  Suggest spelling options for query terms }  Re-write queries at run-time

All slides, data and scripts are on the on PGConf web site

Summary and review

PGConf US, NYC 26-Mar-2015 56

Page 57: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

LIFE. LIBERTY. TECHNOLOGY.

Freedom Consulting Group is a talented, hard-working, and committed partner, providing hardware, software and database development and integration services

to a diverse set of clients.

Page 58: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

POSTGRES innovation

ENTERPRISE reliability

24/7 support

Services & training

Enterprise-class features, tools &

compatibility

Indemnification

Product road-map

Control

Thousands of developers

Fast development

cycles

Low cost

No vendor lock-in

Advanced features

Enabling commercial adoption of Postgres

Page 59: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

Are there any Questions or follow up?

PGConf US, NYC 26-Mar-2015 59

freedomconsultinggroup.com/jhanson

Page 60: Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

Freedom Consulting Group www.freedomconsultinggroup.com

Jamey Hanson [email protected]

[email protected]