getting started with mysql full text search

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

MySQL Full-Text Search

Matt LordMySQL Product Manager@mattalord


Safe Harbor StatementThe following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

3

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 4

MySQL Full-Text Search : Agenda

1

2

3

4

5

An Introduction to Full-Text Search

Common Terms and Concepts

What’s New in MySQL 5.6 and 5.7

A Real World Example

Integration with Lucene, Solr, and Elasticsearch

What’s Next for MySQL Full-Text Search6


An Introduction to Full-Text Search


What is it?• Search entire documents– Character based fields • VARCHAR, TEXT, BLOB

• For a search string – Combinations of words– Phrases: “specific string to match”–Wildcards: * – Requirements: +, -, ~– Expressions: (…)– Relevancy weight characters: <, >

Searching Without an Index

Searching With an Index


What Would I Use it For?• Content management –What metadata should be used to describe the information– This helps to make your searches far more useful

• Search services–What documents or meta-data contain certain terms or tokens–What documents are most relevant to the current view–What data do you think this user would be most interested in


How Would I Use It? StoreCollect

IndexSearch

• Collect search data– Existing documents describing the content– Generated metadata from the incoming content

• Store the data–Within MySQL tables

• Index the data– Add Full-Text indexes on the content columns

• Allow for efficient searches – Provide users with an efficient way to search the content


Common Terms and Concepts


Common Terms• Token–Word or a series of characters

• Dictionary–What words are related, mean the same thing, are abbreviations for, etc.

• Stop Words–Words that should not be indexed

• Relevancy and Weight– How should weight search terms and calculate document relevancy?


Tokens• Tokens–Words, or a series of characters that together form common meaning

• Related Server options– innodb_ft_min_token_size – Don’t bother to index words shorter than this• These would typically be words that are invalid, or are extremely common– So they increase the size of the index and decrease search efficiency w/o real benefit

– innodb_ft_max_token_size – Don’t bother to index words longer than this• These would typically be words that are invalid– So again, they increase the size of the index and decrease search efficiency w/o real benefit

http://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html



Stop Words• Server options– innodb_ft_enable_stopword – Should stop words be used at all for new indexes?– innodb_ft_server_stopword_table – Use this global table for the list of stop words– innodb_ft_user_stopword_table – Use this table for my own stop word list• All of the above only affect indexes created while they are set– CREATE INDEX, ALTER TABLE, OPTIMIZE TABLE, ANALYZE TABLE

• Default stop word list – SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;





Relevancy and Weight• Term Frequency (TF)–Measure of how often a token/word appears in an individual document

• Inverse Document Frequency (IDF)–Measure of how common a token/word is across all documents

• Coordinate Level Matching– Number of query terms that are found within an individual document• How close together are the matching terms?

• User Modifications – ‘<‘ and ‘>’ characters can be used to grant terms higher or lower weight– ‘+’ and ‘–’ characters can be used to require terms be present or absent


A Full Text Index• It’s an inverted Index of relationships between tokens and documents

This movie is about a boy going to war.

This movie is about a

girl starting an auto-

shop.

This movie is about

flowers.

a about an are as at be by com de en for from

how i in is it la of

on or that the this to

was what when where

who will with und

the www

Min Token Size

Max Token Size

Document 1

Document 2

Document 3

Stop Words Token Size

Full Text / Inverted Index

ID TOKEN DOCUMENT

1 movie 1,2,3

2 boy 1

3 girl 2

4 going 1

5 starting 2

6 war 1

7 auto-shop 2

8 flowers 3

Token FiltersDocuments

Tokenizer

Tokenizer

Indexer

Indexer


Document Searches• Search for “movie about girl”• Term Frequency (TF)– “movie” occurs 1 time in Docs 1,2,3– “girl” occurs 1 time in Doc 2• No Doc has more than 1 occurrence of either word

• Inverse Document Frequency (IDF)– “movie” occurs in Docs 1,2,3– “girl” occurs only in Doc 2• “girl” is more meaningful or “weighted”

• Docs 1,2,3 match our search, but Doc 2 is most relevant

Full Text / Inverted Index

ID TOKEN DOCUMENT

1 movie 1,2,3

2 boy 1

3 girl 2

4 going 1

5 starting 2

6 war 1

7 auto-shop 2

8 flowers 3


Additional Options & Variables• innodb_ft_aux_table – View index details for this table– Via the INNODB_FT_INDEX_TABLE, INNODB_FT_INDEX_CACHE, INNODB_FT_CONFIG,

INNODB_FT_DELETED, and INNODB_FT_BEING_DELETED Information_Schema tables

• innodb_ft_cache_size – In memory cache size for each index

• innodb_ft_total_cache_size – Total in memory cache size limit per server

• innodb_ft_num_word_optimize – Batch size used during tokenization

• innodb_ft_result_cache_limit – In memory cache size limit for individual searches

• innodb_ft_sort_pll_degree – Number of parallel threads to use during index builds








Example Walkthrough • Now let’s quickly demonstrate all of these terms & concepts in action• We’ll use a very simple made up series of silly short stories


Example Walkthrough: Table and Data

mysql> create table short_stories (author varchar(100), story text);Query OK, 0 rows affected (0.23 sec)

mysql> insert into short_stories values ("Matt Lord", "I've worked at MySQL and Oracle for about 12 years now. I'm currently the Product Manager for MySQL.");Query OK, 1 row affected (0.03 sec)

mysql> insert into short_stories values ("Sid Lord", "I'm 10 years old. I like to eat and play video games. That's pretty much it.");Query OK, 1 row affected (0.12 sec)

mysql> insert into short_stories values ("Lily Lord", "I'm almost 7 years old. I like to make art, play with toys, and play video games. And also, dress up. Yay!");Query OK, 1 row affected (0.03 sec)

• This is the table, column, and data that we’ll add a Full Text index on


Example Walkthrough: Custom Stop Words

mysql> create table example.ss_words select * from information_schema.INNODB_FT_DEFAULT_STOPWORD;Query OK, 36 rows affected (0.40 sec)

mysql> insert into ss_words values (“oracle"), (“and”), (“like”);Query OK, 3 rows affected (0.04 sec)

mysql> select group_concat(value) as stop_words from ss_words\G*************************** 1. row ***************************stop_words: a,about,an,are,as,at,be,by,com,de,en,for,from,how,i,in,is,it,la,of,on,or,that,the,this,to,was,what,when,where,who,will,with,und,the,www,oracle,and,like1 row in set (0.00 sec)

mysql> set global innodb_ft_server_stopword_table="example/ss_words";Query OK, 0 rows affected (0.00 sec)

• This is how we define words that will NOT be included in the Full Text index


Example Walkthrough: Token Sizes • We can define the min and max token/word sizes–Words that fall outside of this min/max range will NOT be included in the index• And thus NOT used for searches

• We set constraints on the min and max length of words/tokens that we want to include in the index– Very short or very long words are typically invalid or so common as to be worthless• E.g.: a, an, de, ta, someverylongsentencethataccidentallygotstucktogethersomehowwhoops

• We’ll go with the defaults– innodb_ft_min_token_size=3 and innodb_ft_max_token_size=84–Words/Tokens outside of the 3-84 character range are ignored for the index




Example Walkthrough: Adding the Index

mysql> alter table short_stories add fulltext index (story);Query OK, 0 rows affected, 1 warning (2.07 sec)

# Here we’re setting up the information_schema views so that we can see the index # record details (on the next slide)mysql> set global innodb_ft_aux_table="example/short_stories";Query OK, 0 rows affected (0.00 sec)


Example Walkthrough: The Final Indexmysql> select * from information_schema.INNODB_FT_INDEX_TABLE;+-----------+--------------+-------------+-----------+--------+----------+| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |+-----------+--------------+-------------+-----------+--------+----------+| almost | 4 | 4 | 1 | 4 | 4 || also | 4 | 4 | 1 | 4 | 86 || art | 4 | 4 | 1 | 4 | 39 || currently | 2 | 2 | 1 | 2 | 60 || dress | 4 | 4 | 1 | 4 | 92 || eat | 3 | 3 | 1 | 3 | 28 || games | 3 | 4 | 2 | 3 | 47 || games | 3 | 4 | 2 | 4 | 75 |

…| video | 3 | 4 | 2 | 3 | 41 || video | 3 | 4 | 2 | 4 | 69 || worked | 2 | 2 | 1 | 2 | 5 || yay | 4 | 4 | 1 | 4 | 102 || years | 2 | 4 | 3 | 2 | 45 || years | 2 | 4 | 3 | 3 | 7 || years | 2 | 4 | 3 | 4 | 13 |+-----------+--------------+-------------+-----------+--------+----------+29 rows in set (0.00 sec)


Example Walkthrough: Our Final Sample Query

mysql> SELECT author, story, MATCH(story) AGAINST("toys and games") AS relevancy -> FROM short_stories WHERE MATCH(story) AGAINST("toys and games") -> ORDER BY relevancy DESC\G*************************** 1. row *************************** author: Lily Lord story: I'm almost 7 years old. I like to make art, play with toys, and play video games. And also, dress up. Yay!relevancy: 0.25865283608436584*************************** 2. row *************************** author: Sid Lord story: I'm 10 years old. I like to eat and play video games. That's pretty much it.relevancy: 0.0310081318020820622 rows in set (0.00 sec)


What’s New in MySQL 5.6 and 5.7


What’s New?• MySQL 5.6– InnoDB Full-Text Index support• Fully ACID compliant, MVCC search• With performance improvements over MyISAM• Easily customizable stop-word lists

• MySQL 5.7– Pluggable Full-Text Parser support– CJK Support • N-gram parser for Chinese, Japanese, and Korean• MeCab parser for Japanese


A Real World Example


An Internal Content Management System• I have tons of valuable business related content– But it’s spread across various locations and formats • Wiki pages, PPTs, Word Docs, Txt docs, …

– How can I ingest, aggregate, and correlate this data– How can I provide a useful search tool

• Let’s build something to vastly increase the value of our intranet content– Something similar to Google Desktop search or Apple’s Spotlight • But for the vast amounts of data strewn across our company intranet

–We can then incorporate the search into a MySQL based intranet tool


Gathering The Contents of Our Existing Data• Use any existing metadata that you already have• Pull metadata from existing files– Specialized tools to extract metadata • Exiftool to gather metadata on image files & Exif2maps to pull location data from image files• Taglib to pull metadata from sound files• `libreoffice –headess –convert-to …` to extract plain text from Office formats • GNU Libextractor to pull metadata and location data from all file types

• Extract text content from binary format files (.ppt, .doc, .pdf, etc.)– Apache Tika (originally part of Lucene)• Auto-detects file format and uses appropriate parsing library • Extracts metadata and structured text content from all popular/common document and file formats


Apache Tika and MySQL

Extract

Plain Text

Load

Text Docs

Full Text Index


Apache Tika Example• Downloads, docs, etc. can be found at https://tika.apache.orgshell> java -jar tika-app-1.7.jar -z -t /tmp/MySQL_FTS.pptxCopyright © 2014 Oracle and/or its affiliates. All rights reserved. |1MySQL Full-Text SearchMatt LordMySQL Product Manager

2Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |23Safe Harbor StatementThe following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment …

https://tika.apache.org/


Apache Tika Example Cont.shell> ls /tmp/*.p*/tmp/MySQL_5.7_GIS.pptx /tmp/MySQL_5.7_GIS_reborn.pptx /tmp/MySQL_FTS.pptx /tmp/MySQLGroupReplication.pdf

shell> for file in `ls /tmp/*.p*`; do java -jar tika-app-1.7.jar -z -t $file > $file.txt && echo -n "#DOC_END" >> $file.txt; done

shell> ls /tmp/*.txt/tmp/MySQL_5.7_GIS.pptx.txt /tmp/MySQL_5.7_GIS_reborn.pptx.txt /tmp/MySQL_FTS.pptx.txt /tmp/MySQLGroupReplication.pdf.txt

shell> sed -n '55,62'p /tmp/MySQLGroupReplication.pdf.txt Program Agenda

MySQL Group Replication Background

Zoom in: Major Building Blocks

Zoom in: The Complete Stack


Our MySQL Tablemysql> show create table intranet_doc\G *************************** 1. row *************************** Table: intranet_docCreate Table: CREATE TABLE ìntranet_doc` ( ìd` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `type` varchar(50) DEFAULT NULL, `fs_path` varchar(200) DEFAULT NULL, `doc_host` varchar(60) DEFAULT NULL, `txt_content` longtext, PRIMARY KEY (ìd`), KEY `type` (`type`), FULLTEXT KEY `txt_content` (`txt_content`)) ENGINE=InnoDB DEFAULT CHARSET=latin11 row in set (0.01 sec)


Loading in the Text Contentshell> for file in `ls /tmp/*.txt`; do mysql -D intranet_search -e \"load data infile '$file' into table intranet_doc \lines terminated by '#DOC_END' (txt_content) SET fs_path='$file', \doc_host='`uname -n`', \type=substring_index(substring_index('$file', '.', -2), '.', 1) "; done

mysql> select fs_path, type, doc_host from intranet_doc;+------------------------------------+------+-------------------+| fs_path | type | doc_host |+------------------------------------+------+-------------------+| /tmp/MySQL_5.7_GIS.pptx.txt | pptx | mylab.localdomain || /tmp/MySQL_5.7_GIS_reborn.pptx.txt | pptx | mylab.localdomain || /tmp/MySQL_FTS.pptx.txt | pptx | mylab.localdomain || /tmp/MySQLGroupReplication.pdf.txt | pdf | mylab.localdomain |+------------------------------------+------+-------------------+4 rows in set (0.00 sec)


Our Final Search Query• Search for PowerPoint docs that mention Apache Tika

mysql> SELECT fs_path, doc_host, type -> FROM intranet_doc -> WHERE type LIKE "ppt%" -> AND MATCH(txt_content) AGAINST ("+Tika");+-------------------------+-------------------+------+| fs_path | doc_host | type |+-------------------------+-------------------+------+| /tmp/MySQL_FTS.pptx.txt | mylab.localdomain | pptx |+-------------------------+-------------------+------+1 row in set (0.00 sec)


Integration with Lucene/Solr/Elasticsearch


Apache Lucene• Lucene is the core Full-text search library–Written in Java

• Originally created by Doug Cutting (creator of Hadoop)• Open source project (since 2003)• Mature• Easy to learn API• Stores its indexes as files on disk• Solr and Elasticsearch provide web services built on top of Lucene


MySQL Native Full Text VS Lucene• Eliminates complexity• Single canonical source• No need for synchronization• Single query language (SQL)• No additional maintenance

• Use– MySQL based app with basic full-text

search • e.g. E-commerce app with a product description

search

• Supports very complex searches• Supports stemming & fuzzy searches• Very scalable • Rich document handling (PDF, PPT, …)• Easy to use RESTful web services– Solr, Elasticsearch, …

• Use– Full blown advanced search focused app • e.g. IMDB


Solr and MySQL• Create simple custom

DataImportHandler– http://wiki.apache.org/solr/

DataImportHandler

• Full and incremental indexing• Scheduled re-indexing to

keep the two in sync

http://wiki.apache.org/solr/DataImportHandler

http://wiki.apache.org/solr/DataImportHandler


Solr and MySQL

Custom DataImportHandler XML

MySQL Connector/J

• Easy integration– Index sample sakila database • http://localhost:8983/solr/sakila/collection1/dataimport?command=full-import


Elasticsearch and MySQL• Easy integration– Index sample sakila.country table• curl -XPUT 'localhost:9200/_river/sakila_country/_meta' -d '{

"type" : "jdbc", "jdbc" : { "url" : "jdbc:mysql://localhost:3306/sakila",

"user" : “root", "password" : “mypass",

"sql" : "select * from country"

}

}'

JDBC River Plugin

MySQL Connector/J


What’s Next for MySQL Full-Text Search


Additional Features• Improved performance• More efficient disk space usage• Support for stemming and facets• Support for fuzzy string searches• Support for aliases, synonyms, abbreviations, etc. • Proximity search and use in relevancy scores• Automatic ordering by relevancy • What else would you like to see?– Let us know!


Appendix : Additional Resources• Manual– https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html

• Community forum– http://forums.mysql.com/list.php?107

• Apache Tika– https://tika.apache.org

• Report Full-Text bugs and submit feature requests– http://bugs.mysql.com/

https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html

http://forums.mysql.com/list.php?107

https://tika.apache.org/

http://bugs.mysql.com/


Safe Harbor StatementThe preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

46

getting started with mysql full text search

Software