getting started with mysql full text search

47

Upload: matt-lord

Post on 06-Aug-2015

113 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Getting Started with MySQL Full Text Search
Page 2: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

MySQL Full-Text Search

Matt LordMySQL Product Manager@mattalord

Page 3: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Safe Harbor StatementThe following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

3

Page 4: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 4

MySQL Full-Text Search : Agenda

1

2

3

4

5

An Introduction to Full-Text Search

Common Terms and Concepts

What’s New in MySQL 5.6 and 5.7

A Real World Example

Integration with Lucene, Solr, and Elasticsearch

What’s Next for MySQL Full-Text Search6

Page 5: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 5

An Introduction to Full-Text Search

Page 6: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 6

What is it?• Search entire documents– Character based fields • VARCHAR, TEXT, BLOB

• For a search string – Combinations of words– Phrases: “specific string to match”–Wildcards: * – Requirements: +, -, ~– Expressions: (…)– Relevancy weight characters: <, >

Page 7: Getting Started with MySQL Full Text Search

Searching Without an Index

Page 8: Getting Started with MySQL Full Text Search

Searching With an Index

Page 9: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 9

What Would I Use it For?• Content management –What metadata should be used to describe the information– This helps to make your searches far more useful

• Search services–What documents or meta-data contain certain terms or tokens–What documents are most relevant to the current view–What data do you think this user would be most interested in

Page 10: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 10

How Would I Use It? StoreCollect

IndexSearch

• Collect search data– Existing documents describing the content– Generated metadata from the incoming content

• Store the data–Within MySQL tables

• Index the data– Add Full-Text indexes on the content columns

• Allow for efficient searches – Provide users with an efficient way to search the content

Page 11: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 11

Common Terms and Concepts

Page 12: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 12

Common Terms• Token–Word or a series of characters

• Dictionary–What words are related, mean the same thing, are abbreviations for, etc.

• Stop Words–Words that should not be indexed

• Relevancy and Weight– How should weight search terms and calculate document relevancy?

Page 13: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 13

Tokens• Tokens–Words, or a series of characters that together form common meaning

• Related Server options– innodb_ft_min_token_size – Don’t bother to index words shorter than this• These would typically be words that are invalid, or are extremely common– So they increase the size of the index and decrease search efficiency w/o real benefit

– innodb_ft_max_token_size – Don’t bother to index words longer than this• These would typically be words that are invalid– So again, they increase the size of the index and decrease search efficiency w/o real benefit

Page 14: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 14

Stop Words• Server options– innodb_ft_enable_stopword – Should stop words be used at all for new indexes?– innodb_ft_server_stopword_table – Use this global table for the list of stop words– innodb_ft_user_stopword_table – Use this table for my own stop word list• All of the above only affect indexes created while they are set– CREATE INDEX, ALTER TABLE, OPTIMIZE TABLE, ANALYZE TABLE

• Default stop word list – SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;

Page 15: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 15

Relevancy and Weight• Term Frequency (TF)–Measure of how often a token/word appears in an individual document

• Inverse Document Frequency (IDF)–Measure of how common a token/word is across all documents

• Coordinate Level Matching– Number of query terms that are found within an individual document• How close together are the matching terms?

• User Modifications – ‘<‘ and ‘>’ characters can be used to grant terms higher or lower weight– ‘+’ and ‘–’ characters can be used to require terms be present or absent

Page 16: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 16

A Full Text Index• It’s an inverted Index of relationships between tokens and documents

This movie is about a boy going to war.

This movie is about a

girl starting an auto-

shop.

This movie is about

flowers.

a about an are as at be by com de en for from

how i in is it la of

on or that the this to

was what when where

who will with und

the www

Min Token Size

Max Token Size

Document 1

Document 2

Document 3

Stop Words Token Size

Full Text / Inverted Index

ID TOKEN DOCUMENT

1 movie 1,2,3

2 boy 1

3 girl 2

4 going 1

5 starting 2

6 war 1

7 auto-shop 2

8 flowers 3

Token FiltersDocuments

Tokenizer

Tokenizer

Indexer

Indexer

Page 17: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 17

Document Searches• Search for “movie about girl”• Term Frequency (TF)– “movie” occurs 1 time in Docs 1,2,3– “girl” occurs 1 time in Doc 2• No Doc has more than 1 occurrence of either word

• Inverse Document Frequency (IDF)– “movie” occurs in Docs 1,2,3– “girl” occurs only in Doc 2• “girl” is more meaningful or “weighted”

• Docs 1,2,3 match our search, but Doc 2 is most relevant

Full Text / Inverted Index

ID TOKEN DOCUMENT

1 movie 1,2,3

2 boy 1

3 girl 2

4 going 1

5 starting 2

6 war 1

7 auto-shop 2

8 flowers 3

Page 18: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 18

Additional Options & Variables• innodb_ft_aux_table – View index details for this table– Via the INNODB_FT_INDEX_TABLE, INNODB_FT_INDEX_CACHE, INNODB_FT_CONFIG,

INNODB_FT_DELETED, and INNODB_FT_BEING_DELETED Information_Schema tables

• innodb_ft_cache_size – In memory cache size for each index

• innodb_ft_total_cache_size – Total in memory cache size limit per server

• innodb_ft_num_word_optimize – Batch size used during tokenization

• innodb_ft_result_cache_limit – In memory cache size limit for individual searches

• innodb_ft_sort_pll_degree – Number of parallel threads to use during index builds

Page 19: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 19

Example Walkthrough • Now let’s quickly demonstrate all of these terms & concepts in action• We’ll use a very simple made up series of silly short stories

Page 20: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 20

Example Walkthrough: Table and Data

mysql> create table short_stories (author varchar(100), story text);Query OK, 0 rows affected (0.23 sec)

mysql> insert into short_stories values ("Matt Lord", "I've worked at MySQL and Oracle for about 12 years now. I'm currently the Product Manager for MySQL.");Query OK, 1 row affected (0.03 sec)

mysql> insert into short_stories values ("Sid Lord", "I'm 10 years old. I like to eat and play video games. That's pretty much it.");Query OK, 1 row affected (0.12 sec)

mysql> insert into short_stories values ("Lily Lord", "I'm almost 7 years old. I like to make art, play with toys, and play video games. And also, dress up. Yay!");Query OK, 1 row affected (0.03 sec)

• This is the table, column, and data that we’ll add a Full Text index on

Page 21: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 21

Example Walkthrough: Custom Stop Words

mysql> create table example.ss_words select * from information_schema.INNODB_FT_DEFAULT_STOPWORD;Query OK, 36 rows affected (0.40 sec)

mysql> insert into ss_words values (“oracle"), (“and”), (“like”);Query OK, 3 rows affected (0.04 sec)

mysql> select group_concat(value) as stop_words from ss_words\G*************************** 1. row ***************************stop_words: a,about,an,are,as,at,be,by,com,de,en,for,from,how,i,in,is,it,la,of,on,or,that,the,this,to,was,what,when,where,who,will,with,und,the,www,oracle,and,like1 row in set (0.00 sec)

mysql> set global innodb_ft_server_stopword_table="example/ss_words";Query OK, 0 rows affected (0.00 sec)

• This is how we define words that will NOT be included in the Full Text index

Page 22: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 22

Example Walkthrough: Token Sizes • We can define the min and max token/word sizes–Words that fall outside of this min/max range will NOT be included in the index• And thus NOT used for searches

• We set constraints on the min and max length of words/tokens that we want to include in the index– Very short or very long words are typically invalid or so common as to be worthless• E.g.: a, an, de, ta, someverylongsentencethataccidentallygotstucktogethersomehowwhoops

• We’ll go with the defaults– innodb_ft_min_token_size=3 and innodb_ft_max_token_size=84–Words/Tokens outside of the 3-84 character range are ignored for the index

Page 23: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 23

Example Walkthrough: Adding the Index

mysql> alter table short_stories add fulltext index (story);Query OK, 0 rows affected, 1 warning (2.07 sec)

# Here we’re setting up the information_schema views so that we can see the index # record details (on the next slide)mysql> set global innodb_ft_aux_table="example/short_stories";Query OK, 0 rows affected (0.00 sec)

Page 24: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 24

Example Walkthrough: The Final Indexmysql> select * from information_schema.INNODB_FT_INDEX_TABLE;+-----------+--------------+-------------+-----------+--------+----------+| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |+-----------+--------------+-------------+-----------+--------+----------+| almost | 4 | 4 | 1 | 4 | 4 || also | 4 | 4 | 1 | 4 | 86 || art | 4 | 4 | 1 | 4 | 39 || currently | 2 | 2 | 1 | 2 | 60 || dress | 4 | 4 | 1 | 4 | 92 || eat | 3 | 3 | 1 | 3 | 28 || games | 3 | 4 | 2 | 3 | 47 || games | 3 | 4 | 2 | 4 | 75 |

…| video | 3 | 4 | 2 | 3 | 41 || video | 3 | 4 | 2 | 4 | 69 || worked | 2 | 2 | 1 | 2 | 5 || yay | 4 | 4 | 1 | 4 | 102 || years | 2 | 4 | 3 | 2 | 45 || years | 2 | 4 | 3 | 3 | 7 || years | 2 | 4 | 3 | 4 | 13 |+-----------+--------------+-------------+-----------+--------+----------+29 rows in set (0.00 sec)

Page 25: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 25

Example Walkthrough: Our Final Sample Query

mysql> SELECT author, story, MATCH(story) AGAINST("toys and games") AS relevancy -> FROM short_stories WHERE MATCH(story) AGAINST("toys and games") -> ORDER BY relevancy DESC\G*************************** 1. row *************************** author: Lily Lord story: I'm almost 7 years old. I like to make art, play with toys, and play video games. And also, dress up. Yay!relevancy: 0.25865283608436584*************************** 2. row *************************** author: Sid Lord story: I'm 10 years old. I like to eat and play video games. That's pretty much it.relevancy: 0.0310081318020820622 rows in set (0.00 sec)

Page 26: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 26

What’s New in MySQL 5.6 and 5.7

Page 27: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 27

What’s New?• MySQL 5.6– InnoDB Full-Text Index support• Fully ACID compliant, MVCC search• With performance improvements over MyISAM• Easily customizable stop-word lists

• MySQL 5.7– Pluggable Full-Text Parser support– CJK Support • N-gram parser for Chinese, Japanese, and Korean• MeCab parser for Japanese

Page 28: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 28

A Real World Example

Page 29: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 29

An Internal Content Management System• I have tons of valuable business related content– But it’s spread across various locations and formats • Wiki pages, PPTs, Word Docs, Txt docs, …

– How can I ingest, aggregate, and correlate this data– How can I provide a useful search tool

• Let’s build something to vastly increase the value of our intranet content– Something similar to Google Desktop search or Apple’s Spotlight • But for the vast amounts of data strewn across our company intranet

–We can then incorporate the search into a MySQL based intranet tool

Page 30: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 30

Gathering The Contents of Our Existing Data• Use any existing metadata that you already have• Pull metadata from existing files– Specialized tools to extract metadata • Exiftool to gather metadata on image files & Exif2maps to pull location data from image files• Taglib to pull metadata from sound files• `libreoffice –headess –convert-to …` to extract plain text from Office formats • GNU Libextractor to pull metadata and location data from all file types

• Extract text content from binary format files (.ppt, .doc, .pdf, etc.)– Apache Tika (originally part of Lucene)• Auto-detects file format and uses appropriate parsing library • Extracts metadata and structured text content from all popular/common document and file formats

Page 31: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 31

Apache Tika and MySQL

Extract

Plain Text

Load

Text Docs

Full Text Index

Page 32: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 32

Apache Tika Example• Downloads, docs, etc. can be found at https://tika.apache.orgshell> java -jar tika-app-1.7.jar -z -t /tmp/MySQL_FTS.pptxCopyright © 2014 Oracle and/or its affiliates. All rights reserved. |1MySQL Full-Text SearchMatt LordMySQL Product Manager

2Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |23Safe Harbor StatementThe following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment …

Page 33: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 33

Apache Tika Example Cont.shell> ls /tmp/*.p*/tmp/MySQL_5.7_GIS.pptx /tmp/MySQL_5.7_GIS_reborn.pptx /tmp/MySQL_FTS.pptx /tmp/MySQLGroupReplication.pdf

shell> for file in `ls /tmp/*.p*`; do java -jar tika-app-1.7.jar -z -t $file > $file.txt && echo -n "#DOC_END" >> $file.txt; done

shell> ls /tmp/*.txt/tmp/MySQL_5.7_GIS.pptx.txt /tmp/MySQL_5.7_GIS_reborn.pptx.txt /tmp/MySQL_FTS.pptx.txt /tmp/MySQLGroupReplication.pdf.txt

shell> sed -n '55,62'p /tmp/MySQLGroupReplication.pdf.txt Program Agenda

MySQL Group Replication Background

Zoom in: Major Building Blocks

Zoom in: The Complete Stack

Page 34: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 34

Our MySQL Tablemysql> show create table intranet_doc\G *************************** 1. row *************************** Table: intranet_docCreate Table: CREATE TABLE `intranet_doc` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `type` varchar(50) DEFAULT NULL, `fs_path` varchar(200) DEFAULT NULL, `doc_host` varchar(60) DEFAULT NULL, `txt_content` longtext, PRIMARY KEY (`id`), KEY `type` (`type`), FULLTEXT KEY `txt_content` (`txt_content`)) ENGINE=InnoDB DEFAULT CHARSET=latin11 row in set (0.01 sec)

Page 35: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 35

Loading in the Text Contentshell> for file in `ls /tmp/*.txt`; do mysql -D intranet_search -e \"load data infile '$file' into table intranet_doc \lines terminated by '#DOC_END' (txt_content) SET fs_path='$file', \doc_host='`uname -n`', \type=substring_index(substring_index('$file', '.', -2), '.', 1) "; done

mysql> select fs_path, type, doc_host from intranet_doc;+------------------------------------+------+-------------------+| fs_path | type | doc_host |+------------------------------------+------+-------------------+| /tmp/MySQL_5.7_GIS.pptx.txt | pptx | mylab.localdomain || /tmp/MySQL_5.7_GIS_reborn.pptx.txt | pptx | mylab.localdomain || /tmp/MySQL_FTS.pptx.txt | pptx | mylab.localdomain || /tmp/MySQLGroupReplication.pdf.txt | pdf | mylab.localdomain |+------------------------------------+------+-------------------+4 rows in set (0.00 sec)

Page 36: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 36

Our Final Search Query• Search for PowerPoint docs that mention Apache Tika

mysql> SELECT fs_path, doc_host, type -> FROM intranet_doc -> WHERE type LIKE "ppt%" -> AND MATCH(txt_content) AGAINST ("+Tika");+-------------------------+-------------------+------+| fs_path | doc_host | type |+-------------------------+-------------------+------+| /tmp/MySQL_FTS.pptx.txt | mylab.localdomain | pptx |+-------------------------+-------------------+------+1 row in set (0.00 sec)

Page 37: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 37

Integration with Lucene/Solr/Elasticsearch

Page 38: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 38

Apache Lucene• Lucene is the core Full-text search library–Written in Java

• Originally created by Doug Cutting (creator of Hadoop)• Open source project (since 2003)• Mature• Easy to learn API• Stores its indexes as files on disk• Solr and Elasticsearch provide web services built on top of Lucene

Page 39: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 39

MySQL Native Full Text VS Lucene• Eliminates complexity• Single canonical source• No need for synchronization• Single query language (SQL)• No additional maintenance

• Use– MySQL based app with basic full-text

search • e.g. E-commerce app with a product description

search

• Supports very complex searches• Supports stemming & fuzzy searches• Very scalable • Rich document handling (PDF, PPT, …)• Easy to use RESTful web services– Solr, Elasticsearch, …

• Use– Full blown advanced search focused app • e.g. IMDB

Page 40: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 40

Solr and MySQL• Create simple custom

DataImportHandler– http://wiki.apache.org/solr/

DataImportHandler

• Full and incremental indexing• Scheduled re-indexing to

keep the two in sync

Page 41: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 41

Solr and MySQL

Custom DataImportHandler XML

MySQL Connector/J

• Easy integration– Index sample sakila database • http://localhost:8983/solr/sakila/collection1/dataimport?command=full-import

Page 42: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 42

Elasticsearch and MySQL• Easy integration– Index sample sakila.country table• curl -XPUT 'localhost:9200/_river/sakila_country/_meta' -d '{

"type" : "jdbc", "jdbc" : { "url" : "jdbc:mysql://localhost:3306/sakila",

"user" : “root", "password" : “mypass",

"sql" : "select * from country"

}

}'

JDBC River Plugin

MySQL Connector/J

Page 43: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 43

What’s Next for MySQL Full-Text Search

Page 44: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 44

Additional Features• Improved performance• More efficient disk space usage• Support for stemming and facets• Support for fuzzy string searches• Support for aliases, synonyms, abbreviations, etc. • Proximity search and use in relevancy scores• Automatic ordering by relevancy • What else would you like to see?– Let us know!

Page 45: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 45

Appendix : Additional Resources• Manual– https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html

• Community forum– http://forums.mysql.com/list.php?107

• Apache Tika– https://tika.apache.org

• Report Full-Text bugs and submit feature requests– http://bugs.mysql.com/

Page 46: Getting Started with MySQL Full Text Search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Safe Harbor StatementThe preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

46

Page 47: Getting Started with MySQL Full Text Search