full text search in postgresql

51
Practical full-text search in PostgreSQL Bill Karwin PostgreSQL Conference West 09 • 2009/10/17

Upload: karwin-software-solutions-llc

Post on 15-Jan-2015

57.963 views

Category:

Technology


4 download

DESCRIPTION

A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.

TRANSCRIPT

Page 1: Full Text Search In PostgreSQL

Practical full-text search in PostgreSQLBill KarwinPostgreSQL Conference West 09 • 2009/10/17

Page 2: Full Text Search In PostgreSQL

Me

• 20+ years experience

• Application/SDK developer• Support, Training, Proj Mgmt• C, Java, Perl, PHP

• SQL maven

• MySQL, PostgreSQL, InterBase• Zend Framework• Oracle, SQL Server, IBM DB2, SQLite

• Community contributor

Page 3: Full Text Search In PostgreSQL

Full Text Search

Page 4: Full Text Search In PostgreSQL

Text search

• Web applications demand speed

• Let’s compare 5 solutions for text search

Page 5: Full Text Search In PostgreSQL

Sample data

• StackOverflow.com Posts

• Data dump exported September 2009

• 1.2 million tuples

• ~850 MB

Page 6: Full Text Search In PostgreSQL

StackOverflow ER diagram

Page 7: Full Text Search In PostgreSQL

Naive SearchingSome people, when confronted with a problem,

think “I know, I’ll use regular expressions.” Now they have two problems.

— Jamie Zawinsky

Page 8: Full Text Search In PostgreSQL

Performance issue

• LIKE with wildcards:

SELECT * FROM PostsWHERE body LIKE ‘%postgresql%’

• POSIX regular expressions:

SELECT * FROM PostsWHERE body ~ ‘postgresql’

time: 91 sec

time: 105 sec

Page 9: Full Text Search In PostgreSQL

Why so slow?

CREATE TABLE telephone_book ( full_name VARCHAR(50));

CREATE INDEX name_idx ON telephone_book (full_name);

INSERT INTO telephone_book VALUES (‘Riddle, Thomas’), (‘Thomas, Dean’);

Page 10: Full Text Search In PostgreSQL

Why so slow?

• Search for all with last name “Thomas”

SELECT * FROM telephone_bookWHERE full_name LIKE ‘Thomas%’

• Search for all with first name “Thomas”

SELECT * FROM telephone_bookWHERE full_name LIKE ‘%Thomas’

uses index

doesn’t use index

Page 11: Full Text Search In PostgreSQL

Indexes don’t help searching for substrings

Page 12: Full Text Search In PostgreSQL

Accuracy issue

• Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.:

body LIKE ‘%one%’

• Regular expressions in PostgreSQLsupport escapes for word boundaries:

body ~ ‘\yone\y’

Page 13: Full Text Search In PostgreSQL

Solutions

• Full-Text Indexing in the RDBMS

• Sphinx Search

• Apache Lucene

• Inverted Index

• Search Engine Service

Page 14: Full Text Search In PostgreSQL

PostgreSQL Text-Search

Page 15: Full Text Search In PostgreSQL

PostgreSQL Text-Search

• Since PostgreSQL 8.3

• TSVECTOR to represent text data

• TSQUERY to represent search predicates

• Special indexes

Page 16: Full Text Search In PostgreSQL

PostgreSQL Text-Search:

Basic Querying

SELECT * FROM Posts WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ || tags) @@ to_tsquery(‘postgresql & performance’);

text-search matching operator

Page 17: Full Text Search In PostgreSQL

PostgreSQL Text-Search:

Basic Querying

SELECT * FROM Posts WHERE title || ‘ ’ || body || ‘ ’ || tags @@ ‘postgresql & performance’;

time with no index: 8 min 2 sec

Page 18: Full Text Search In PostgreSQL

PostgreSQL Text-Search:

Add TSVECTOR column

ALTER TABLE Posts ADD COLUMN PostText TSVECTOR;

UPDATE Posts SET PostText = to_tsvector(‘english’, title || ‘ ’ || body || ‘ ’ || tags);

Page 19: Full Text Search In PostgreSQL

Special index types

• GIN (generalized inverted index)

• GiST (generalized search tree)

Page 20: Full Text Search In PostgreSQL

PostgreSQL Text-Search:

Indexing

CREATE INDEX PostText_GIN ON Posts USING GIN(PostText);

time: 39 min 36 sec

Page 21: Full Text Search In PostgreSQL

PostgreSQL Text-Search:

Querying

SELECT * FROM Posts WHERE PostText @@ ‘postgresql & performance’;

time with index: 20 milliseconds

Page 22: Full Text Search In PostgreSQL

PostgreSQL Text-Search:

Keep TSVECTOR in sync

CREATE TRIGGER TS_PostText BEFORE INSERT OR UPDATE ON PostsFOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( PostText, ‘english’, title, body, tags);

Page 23: Full Text Search In PostgreSQL

Lucene

Page 24: Full Text Search In PostgreSQL

Lucene

• Full-text indexing and search engine

• Apache Project since 2001

• Apache License

• Java implementation

• Ports exist for C, Perl, Ruby, Python, PHP, etc.

Page 25: Full Text Search In PostgreSQL

Lucene:

How to use

1. Add documents to index

2. Parse query

3. Execute query

Page 26: Full Text Search In PostgreSQL

Lucene:

Creating an index

• Programmatic solution in Java...

time: 8 minutes 55 seconds

Page 27: Full Text Search In PostgreSQL

Lucene:

Indexing

String url = "jdbc:postgresql:stackoverflow"; Properties props = new Properties(); props.setProperty("user", "postgres"); Class.forName("org.postgresql.Driver"); Connection con = DriverManager.getConnection(url, props);

Statement stmt = con.createStatement(); String sql = "SELECT PostId, Title, Body, Tags FROM Posts"; ResultSet rs = stmt.executeQuery(sql);

Date start = new Date();

IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);

run any SQL query

open Lucene index writer

Page 28: Full Text Search In PostgreSQL

Lucene:

Indexing

while (rs.next()) { Document doc = new Document();

doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED));

writer.addDocument(doc);}

writer.optimize();writer.close();

loop over SQL result

each row is a Document

with four Fields

finish and close index

Page 29: Full Text Search In PostgreSQL

Lucene:

Querying

• Parse a Lucene queryString[] fields = new String[3];fields[0] = “title”; fields[1] = “body”; fields[2] = “tags”;

Query q = new MultiFieldQueryParser(fields, new StandardAnalyzer()).parse(‘performance’);

• Execute the querySearcher s = new IndexSearcher(indexName);

Hits h = s.search(q);

time: 80 milliseconds

parse search query

define fields

Page 30: Full Text Search In PostgreSQL

Sphinx Search

Page 31: Full Text Search In PostgreSQL

Sphinx Search

• Embedded full-text search engine

• Started in 2001

• GPLv2 license

• Good database integration

Page 32: Full Text Search In PostgreSQL

Sphinx Search:

How to use

1. Edit configuration file

2. Index the data

3. Query the index

4. Issues

Page 33: Full Text Search In PostgreSQL

Sphinx Search:

sphinx.conf

source stackoverflowsrc{ type = pgsql sql_host = localhost sql_user = postgres sql_pass = xxxx sql_db = stackoverflow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT * FROM Posts WHERE PostId=$id}

Page 34: Full Text Search In PostgreSQL

Sphinx Search:

sphinx.conf

index stackoverflow{ source = stackoverflowsrc path = /opt/local/var/db/sphinx/stackoverflow}

Page 35: Full Text Search In PostgreSQL

Sphinx Search:

Building index

indexer -c sphinx.conf stackoverflow

collected 1242365 docs, 720.5 MBsorted 88.3 Mhits, 100.0% donetotal 1242365 docs, 720452944 bytestotal 357.647 sec, 2014423.75 bytes/sec, 3473.72 docs/sec

time: 5 min 57 sec

Page 36: Full Text Search In PostgreSQL

Sphinx Search:

Querying index

search -c sphinx.conf -i stackoverflow -b “sql & performance”

time: 8 milliseconds

Page 37: Full Text Search In PostgreSQL

Sphinx Search:

Issues

• Index updates are as expensive as rebuilding the index from scratch

• Maintain “main” index plus “delta” index for recent changes

• Merge indexes periodically

• Not all data fits into this model

Page 38: Full Text Search In PostgreSQL

Inverted Index

Page 39: Full Text Search In PostgreSQL

Inverted index

TagTypesPosts Tags

intersection of words / Posts

searchable words

Page 40: Full Text Search In PostgreSQL

Inverted index:

Updated ER Diagram

Page 41: Full Text Search In PostgreSQL

Inverted index:

Data definitionCREATE TABLE TagTypes ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL);

CREATE UNIQUE INDEX TagTypes_Tag_index ON TagTypes(Tag);

CREATE TABLE Tags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES TagTypes (TagId));

CREATE INDEX Tags_PostId_index ON Tags(PostId);CREATE INDEX Tags_TagId_index ON Tags(TagId);

Page 42: Full Text Search In PostgreSQL

Inverted index:

Indexing

INSERT INTO Tags (PostId, TagId) SELECT p.PostId, t.TagId FROM Posts p JOIN TagTypes t ON (p.Tags LIKE ‘%<’ || t.Tag || ‘>%’);

90 seconds per tag!!

Page 43: Full Text Search In PostgreSQL

Inverted index:

Querying

SELECT p.* FROM Posts pJOIN Tags t USING (PostId)JOIN TagTypes tt USING (TagId)WHERE tt.Tag = ‘performance’;

40 milliseconds

Page 44: Full Text Search In PostgreSQL

Search Engine Services

Page 45: Full Text Search In PostgreSQL

Search engine services:

Google Custom Search Engine

• http://www.google.com/cse/

• DEMO ➪ http://www.karwin.com/demo/gcse-demo.html

even big web sites use this solution

Page 46: Full Text Search In PostgreSQL

Search engine services:

Is it right for you?

• Your site is public and allows external index

• Search is a non-critical feature for you

• Search results are satisfactory

• You need to offload search processing

Page 47: Full Text Search In PostgreSQL

Comparison: Time to Build Index

LIKE predicate none

PostgreSQL / GIN 40 min

Sphinx Search 6 min

Apache Lucene 9 min

Inverted index high

Google / Yahoo! offline

Page 48: Full Text Search In PostgreSQL

Comparison: Index Storage

LIKE predicate none

PostgreSQL / GIN 532 MB

Sphinx Search 533 MB

Apache Lucene 1071 MB

Inverted index 101 MB

Google / Yahoo! offline

Page 49: Full Text Search In PostgreSQL

Comparison: Query Speed

LIKE predicate 90+ sec

PostgreSQL / GIN 20 ms

Sphinx Search 8 ms

Apache Lucene 80 ms

Inverted index 40 ms

Google / Yahoo! *

Page 50: Full Text Search In PostgreSQL

Comparison: Bottom-Line

LIKE predicate none none 11,250x SQL

PostgreSQL / GIN 7x 5.3x 2.5x RDBMS

Sphinx Search 1x * 5.3x 1x 3rd party

Apache Lucene 1.5x 10x 10x 3rd party

Inverted index high 1x 5x SQL

Google / Yahoo! offline offline * Service

indexing storage query solution

Page 51: Full Text Search In PostgreSQL

Copyright 2009 Bill Karwin

www.slideshare.net/billkarwin

Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/

You are free to share - to copy, distribute and transmit this work, under the following conditions:

Attribution. You must attribute this work to Bill Karwin.

Noncommercial. You may not use this work for commercial purposes.

No Derivative Works. You may not alter, transform, or build

upon this work.