getting started with mysql full text search
TRANSCRIPT
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
MySQL Full-Text Search
Matt LordMySQL Product Manager@mattalord
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Safe Harbor StatementThe following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
3
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 4
MySQL Full-Text Search : Agenda
1
2
3
4
5
An Introduction to Full-Text Search
Common Terms and Concepts
What’s New in MySQL 5.6 and 5.7
A Real World Example
Integration with Lucene, Solr, and Elasticsearch
What’s Next for MySQL Full-Text Search6
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 5
An Introduction to Full-Text Search
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 6
What is it?• Search entire documents– Character based fields • VARCHAR, TEXT, BLOB
• For a search string – Combinations of words– Phrases: “specific string to match”–Wildcards: * – Requirements: +, -, ~– Expressions: (…)– Relevancy weight characters: <, >
Searching Without an Index
Searching With an Index
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 9
What Would I Use it For?• Content management –What metadata should be used to describe the information– This helps to make your searches far more useful
• Search services–What documents or meta-data contain certain terms or tokens–What documents are most relevant to the current view–What data do you think this user would be most interested in
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 10
How Would I Use It? StoreCollect
IndexSearch
• Collect search data– Existing documents describing the content– Generated metadata from the incoming content
• Store the data–Within MySQL tables
• Index the data– Add Full-Text indexes on the content columns
• Allow for efficient searches – Provide users with an efficient way to search the content
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 11
Common Terms and Concepts
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 12
Common Terms• Token–Word or a series of characters
• Dictionary–What words are related, mean the same thing, are abbreviations for, etc.
• Stop Words–Words that should not be indexed
• Relevancy and Weight– How should weight search terms and calculate document relevancy?
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 13
Tokens• Tokens–Words, or a series of characters that together form common meaning
• Related Server options– innodb_ft_min_token_size – Don’t bother to index words shorter than this• These would typically be words that are invalid, or are extremely common– So they increase the size of the index and decrease search efficiency w/o real benefit
– innodb_ft_max_token_size – Don’t bother to index words longer than this• These would typically be words that are invalid– So again, they increase the size of the index and decrease search efficiency w/o real benefit
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 14
Stop Words• Server options– innodb_ft_enable_stopword – Should stop words be used at all for new indexes?– innodb_ft_server_stopword_table – Use this global table for the list of stop words– innodb_ft_user_stopword_table – Use this table for my own stop word list• All of the above only affect indexes created while they are set– CREATE INDEX, ALTER TABLE, OPTIMIZE TABLE, ANALYZE TABLE
• Default stop word list – SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 15
Relevancy and Weight• Term Frequency (TF)–Measure of how often a token/word appears in an individual document
• Inverse Document Frequency (IDF)–Measure of how common a token/word is across all documents
• Coordinate Level Matching– Number of query terms that are found within an individual document• How close together are the matching terms?
• User Modifications – ‘<‘ and ‘>’ characters can be used to grant terms higher or lower weight– ‘+’ and ‘–’ characters can be used to require terms be present or absent
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 16
A Full Text Index• It’s an inverted Index of relationships between tokens and documents
This movie is about a boy going to war.
This movie is about a
girl starting an auto-
shop.
This movie is about
flowers.
a about an are as at be by com de en for from
how i in is it la of
on or that the this to
was what when where
who will with und
the www
Min Token Size
Max Token Size
Document 1
Document 2
Document 3
Stop Words Token Size
Full Text / Inverted Index
ID TOKEN DOCUMENT
1 movie 1,2,3
2 boy 1
3 girl 2
4 going 1
5 starting 2
6 war 1
7 auto-shop 2
8 flowers 3
Token FiltersDocuments
Tokenizer
Tokenizer
Indexer
Indexer
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 17
Document Searches• Search for “movie about girl”• Term Frequency (TF)– “movie” occurs 1 time in Docs 1,2,3– “girl” occurs 1 time in Doc 2• No Doc has more than 1 occurrence of either word
• Inverse Document Frequency (IDF)– “movie” occurs in Docs 1,2,3– “girl” occurs only in Doc 2• “girl” is more meaningful or “weighted”
• Docs 1,2,3 match our search, but Doc 2 is most relevant
Full Text / Inverted Index
ID TOKEN DOCUMENT
1 movie 1,2,3
2 boy 1
3 girl 2
4 going 1
5 starting 2
6 war 1
7 auto-shop 2
8 flowers 3
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 18
Additional Options & Variables• innodb_ft_aux_table – View index details for this table– Via the INNODB_FT_INDEX_TABLE, INNODB_FT_INDEX_CACHE, INNODB_FT_CONFIG,
INNODB_FT_DELETED, and INNODB_FT_BEING_DELETED Information_Schema tables
• innodb_ft_cache_size – In memory cache size for each index
• innodb_ft_total_cache_size – Total in memory cache size limit per server
• innodb_ft_num_word_optimize – Batch size used during tokenization
• innodb_ft_result_cache_limit – In memory cache size limit for individual searches
• innodb_ft_sort_pll_degree – Number of parallel threads to use during index builds
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 19
Example Walkthrough • Now let’s quickly demonstrate all of these terms & concepts in action• We’ll use a very simple made up series of silly short stories
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 20
Example Walkthrough: Table and Data
mysql> create table short_stories (author varchar(100), story text);Query OK, 0 rows affected (0.23 sec)
mysql> insert into short_stories values ("Matt Lord", "I've worked at MySQL and Oracle for about 12 years now. I'm currently the Product Manager for MySQL.");Query OK, 1 row affected (0.03 sec)
mysql> insert into short_stories values ("Sid Lord", "I'm 10 years old. I like to eat and play video games. That's pretty much it.");Query OK, 1 row affected (0.12 sec)
mysql> insert into short_stories values ("Lily Lord", "I'm almost 7 years old. I like to make art, play with toys, and play video games. And also, dress up. Yay!");Query OK, 1 row affected (0.03 sec)
• This is the table, column, and data that we’ll add a Full Text index on
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 21
Example Walkthrough: Custom Stop Words
mysql> create table example.ss_words select * from information_schema.INNODB_FT_DEFAULT_STOPWORD;Query OK, 36 rows affected (0.40 sec)
mysql> insert into ss_words values (“oracle"), (“and”), (“like”);Query OK, 3 rows affected (0.04 sec)
mysql> select group_concat(value) as stop_words from ss_words\G*************************** 1. row ***************************stop_words: a,about,an,are,as,at,be,by,com,de,en,for,from,how,i,in,is,it,la,of,on,or,that,the,this,to,was,what,when,where,who,will,with,und,the,www,oracle,and,like1 row in set (0.00 sec)
mysql> set global innodb_ft_server_stopword_table="example/ss_words";Query OK, 0 rows affected (0.00 sec)
• This is how we define words that will NOT be included in the Full Text index
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 22
Example Walkthrough: Token Sizes • We can define the min and max token/word sizes–Words that fall outside of this min/max range will NOT be included in the index• And thus NOT used for searches
• We set constraints on the min and max length of words/tokens that we want to include in the index– Very short or very long words are typically invalid or so common as to be worthless• E.g.: a, an, de, ta, someverylongsentencethataccidentallygotstucktogethersomehowwhoops
• We’ll go with the defaults– innodb_ft_min_token_size=3 and innodb_ft_max_token_size=84–Words/Tokens outside of the 3-84 character range are ignored for the index
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 23
Example Walkthrough: Adding the Index
mysql> alter table short_stories add fulltext index (story);Query OK, 0 rows affected, 1 warning (2.07 sec)
# Here we’re setting up the information_schema views so that we can see the index # record details (on the next slide)mysql> set global innodb_ft_aux_table="example/short_stories";Query OK, 0 rows affected (0.00 sec)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 24
Example Walkthrough: The Final Indexmysql> select * from information_schema.INNODB_FT_INDEX_TABLE;+-----------+--------------+-------------+-----------+--------+----------+| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |+-----------+--------------+-------------+-----------+--------+----------+| almost | 4 | 4 | 1 | 4 | 4 || also | 4 | 4 | 1 | 4 | 86 || art | 4 | 4 | 1 | 4 | 39 || currently | 2 | 2 | 1 | 2 | 60 || dress | 4 | 4 | 1 | 4 | 92 || eat | 3 | 3 | 1 | 3 | 28 || games | 3 | 4 | 2 | 3 | 47 || games | 3 | 4 | 2 | 4 | 75 |
…| video | 3 | 4 | 2 | 3 | 41 || video | 3 | 4 | 2 | 4 | 69 || worked | 2 | 2 | 1 | 2 | 5 || yay | 4 | 4 | 1 | 4 | 102 || years | 2 | 4 | 3 | 2 | 45 || years | 2 | 4 | 3 | 3 | 7 || years | 2 | 4 | 3 | 4 | 13 |+-----------+--------------+-------------+-----------+--------+----------+29 rows in set (0.00 sec)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 25
Example Walkthrough: Our Final Sample Query
mysql> SELECT author, story, MATCH(story) AGAINST("toys and games") AS relevancy -> FROM short_stories WHERE MATCH(story) AGAINST("toys and games") -> ORDER BY relevancy DESC\G*************************** 1. row *************************** author: Lily Lord story: I'm almost 7 years old. I like to make art, play with toys, and play video games. And also, dress up. Yay!relevancy: 0.25865283608436584*************************** 2. row *************************** author: Sid Lord story: I'm 10 years old. I like to eat and play video games. That's pretty much it.relevancy: 0.0310081318020820622 rows in set (0.00 sec)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 26
What’s New in MySQL 5.6 and 5.7
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 27
What’s New?• MySQL 5.6– InnoDB Full-Text Index support• Fully ACID compliant, MVCC search• With performance improvements over MyISAM• Easily customizable stop-word lists
• MySQL 5.7– Pluggable Full-Text Parser support– CJK Support • N-gram parser for Chinese, Japanese, and Korean• MeCab parser for Japanese
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 28
A Real World Example
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 29
An Internal Content Management System• I have tons of valuable business related content– But it’s spread across various locations and formats • Wiki pages, PPTs, Word Docs, Txt docs, …
– How can I ingest, aggregate, and correlate this data– How can I provide a useful search tool
• Let’s build something to vastly increase the value of our intranet content– Something similar to Google Desktop search or Apple’s Spotlight • But for the vast amounts of data strewn across our company intranet
–We can then incorporate the search into a MySQL based intranet tool
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 30
Gathering The Contents of Our Existing Data• Use any existing metadata that you already have• Pull metadata from existing files– Specialized tools to extract metadata • Exiftool to gather metadata on image files & Exif2maps to pull location data from image files• Taglib to pull metadata from sound files• `libreoffice –headess –convert-to …` to extract plain text from Office formats • GNU Libextractor to pull metadata and location data from all file types
• Extract text content from binary format files (.ppt, .doc, .pdf, etc.)– Apache Tika (originally part of Lucene)• Auto-detects file format and uses appropriate parsing library • Extracts metadata and structured text content from all popular/common document and file formats
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 31
Apache Tika and MySQL
Extract
Plain Text
Load
Text Docs
Full Text Index
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 32
Apache Tika Example• Downloads, docs, etc. can be found at https://tika.apache.orgshell> java -jar tika-app-1.7.jar -z -t /tmp/MySQL_FTS.pptxCopyright © 2014 Oracle and/or its affiliates. All rights reserved. |1MySQL Full-Text SearchMatt LordMySQL Product Manager
2Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |23Safe Harbor StatementThe following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment …
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 33
Apache Tika Example Cont.shell> ls /tmp/*.p*/tmp/MySQL_5.7_GIS.pptx /tmp/MySQL_5.7_GIS_reborn.pptx /tmp/MySQL_FTS.pptx /tmp/MySQLGroupReplication.pdf
shell> for file in `ls /tmp/*.p*`; do java -jar tika-app-1.7.jar -z -t $file > $file.txt && echo -n "#DOC_END" >> $file.txt; done
shell> ls /tmp/*.txt/tmp/MySQL_5.7_GIS.pptx.txt /tmp/MySQL_5.7_GIS_reborn.pptx.txt /tmp/MySQL_FTS.pptx.txt /tmp/MySQLGroupReplication.pdf.txt
shell> sed -n '55,62'p /tmp/MySQLGroupReplication.pdf.txt Program Agenda
MySQL Group Replication Background
Zoom in: Major Building Blocks
Zoom in: The Complete Stack
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 34
Our MySQL Tablemysql> show create table intranet_doc\G *************************** 1. row *************************** Table: intranet_docCreate Table: CREATE TABLE `intranet_doc` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `type` varchar(50) DEFAULT NULL, `fs_path` varchar(200) DEFAULT NULL, `doc_host` varchar(60) DEFAULT NULL, `txt_content` longtext, PRIMARY KEY (`id`), KEY `type` (`type`), FULLTEXT KEY `txt_content` (`txt_content`)) ENGINE=InnoDB DEFAULT CHARSET=latin11 row in set (0.01 sec)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 35
Loading in the Text Contentshell> for file in `ls /tmp/*.txt`; do mysql -D intranet_search -e \"load data infile '$file' into table intranet_doc \lines terminated by '#DOC_END' (txt_content) SET fs_path='$file', \doc_host='`uname -n`', \type=substring_index(substring_index('$file', '.', -2), '.', 1) "; done
mysql> select fs_path, type, doc_host from intranet_doc;+------------------------------------+------+-------------------+| fs_path | type | doc_host |+------------------------------------+------+-------------------+| /tmp/MySQL_5.7_GIS.pptx.txt | pptx | mylab.localdomain || /tmp/MySQL_5.7_GIS_reborn.pptx.txt | pptx | mylab.localdomain || /tmp/MySQL_FTS.pptx.txt | pptx | mylab.localdomain || /tmp/MySQLGroupReplication.pdf.txt | pdf | mylab.localdomain |+------------------------------------+------+-------------------+4 rows in set (0.00 sec)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 36
Our Final Search Query• Search for PowerPoint docs that mention Apache Tika
mysql> SELECT fs_path, doc_host, type -> FROM intranet_doc -> WHERE type LIKE "ppt%" -> AND MATCH(txt_content) AGAINST ("+Tika");+-------------------------+-------------------+------+| fs_path | doc_host | type |+-------------------------+-------------------+------+| /tmp/MySQL_FTS.pptx.txt | mylab.localdomain | pptx |+-------------------------+-------------------+------+1 row in set (0.00 sec)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 37
Integration with Lucene/Solr/Elasticsearch
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 38
Apache Lucene• Lucene is the core Full-text search library–Written in Java
• Originally created by Doug Cutting (creator of Hadoop)• Open source project (since 2003)• Mature• Easy to learn API• Stores its indexes as files on disk• Solr and Elasticsearch provide web services built on top of Lucene
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 39
MySQL Native Full Text VS Lucene• Eliminates complexity• Single canonical source• No need for synchronization• Single query language (SQL)• No additional maintenance
• Use– MySQL based app with basic full-text
search • e.g. E-commerce app with a product description
search
• Supports very complex searches• Supports stemming & fuzzy searches• Very scalable • Rich document handling (PDF, PPT, …)• Easy to use RESTful web services– Solr, Elasticsearch, …
• Use– Full blown advanced search focused app • e.g. IMDB
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 40
Solr and MySQL• Create simple custom
DataImportHandler– http://wiki.apache.org/solr/
DataImportHandler
• Full and incremental indexing• Scheduled re-indexing to
keep the two in sync
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 41
Solr and MySQL
Custom DataImportHandler XML
MySQL Connector/J
• Easy integration– Index sample sakila database • http://localhost:8983/solr/sakila/collection1/dataimport?command=full-import
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 42
Elasticsearch and MySQL• Easy integration– Index sample sakila.country table• curl -XPUT 'localhost:9200/_river/sakila_country/_meta' -d '{
"type" : "jdbc", "jdbc" : { "url" : "jdbc:mysql://localhost:3306/sakila",
"user" : “root", "password" : “mypass",
"sql" : "select * from country"
}
}'
JDBC River Plugin
MySQL Connector/J
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 43
What’s Next for MySQL Full-Text Search
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 44
Additional Features• Improved performance• More efficient disk space usage• Support for stemming and facets• Support for fuzzy string searches• Support for aliases, synonyms, abbreviations, etc. • Proximity search and use in relevancy scores• Automatic ordering by relevancy • What else would you like to see?– Let us know!
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 45
Appendix : Additional Resources• Manual– https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html
• Community forum– http://forums.mysql.com/list.php?107
• Apache Tika– https://tika.apache.org
• Report Full-Text bugs and submit feature requests– http://bugs.mysql.com/
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Safe Harbor StatementThe preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
46