integrated full-text search ( i fts ) in microsoft sql server ® 2008

34
Integrated Full-Text Search (iFTS) in Microsoft SQL Server ® 2008 Fernando Azpeitia Lopez SQL Server Engine - Program Manager Microsoft ® Corporation

Upload: aleron

Post on 24-Feb-2016

90 views

Category:

Documents


0 download

DESCRIPTION

Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server ® 2008. Fernando Azpeitia Lopez SQL Server Engine - Program Manager Microsoft ® Corporation. Session Objectives And Takeaways. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Integrated Full-Text Search (iFTS) in Microsoft SQL Server ® 2008

Fernando Azpeitia LopezSQL Server Engine - Program ManagerMicrosoft ® Corporation

Page 2: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

“What is exactly full-text search in a DB? Which are its main scenarios and how can it provide value to my solution?”

• Cover the main concepts and capabilities of our full-text search

solution inside SQL Server.• Realize how iFTS (SQL Server 2008 integrated FTS) can

extract significant value out of unstructured and semi-structured data stored inside the DB.

• Demonstrate the usage of iFTS and how it solves your daily

DB Search needs.• Cover iFTS’ best practices and workarounds.• Evangelize SQL Server FTS and prove it ready for high

level production environments• Understanding the future of Search in DBs. Where are we

going and why?

Session Objectives And Takeaways

Page 3: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Session Details

3

• Searching at Microsoft: Main Players

• SQL Server Full-Text Search (FTS) : Overview of Basic Scenarios, Concepts, Features & Architecture (Demo)

- The need- The Scenario- The Solution- How do I deploy it?- How do I use it?

• SQL Server 2008 iFTS (Integrated FTS) - Why did we need a new Search infrastructure? - SQL FTS Roadmap- The new Architecture- Main Improvements (Demo)- Upgrading to iFTS- Best Practices and Workarounds- What is not in iFTS?- Upcoming Future

•  Q&A

aria

Page 4: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Bing • Search WWW• Highly scalable• Does not use FTS as provided by SQL Server

MS Search• Search information on disk• Integrated with OS (e.g: Vista Desktop Search)• Used by former SharePoint, SQL Server 2000 and 2005 and Desktop

Search … FAST

• Search information on your organization (Enterprise Search)• SharePoint Server search solution

iFTS in SQL Server 2008• Provides Full Text Search over data in RDBMS• Completely redesigned search solution to leverage DB robustness and scalability.

Searching at Microsoft: Main Players

Page 5: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Full-Text Search (FTS) in SQL Server: Overview

“The Need”

“How can I extract value out of vast amounts of non-relational

data stored in the DB, by leveraging fast filtering mechanisms

to

get to the information I need?”

Page 6: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Full-Text Search (FTS) in SQL Server: Overview

“The Scenario”

1. Heterogeneous data (different types and languages) stored and managed in

the DB. Mainly unstructured and semi-structured data (e.g: binary documents,

emails, XML, HTML, etc..)

2. Besides leverage DB capabilities, the ability to Search efficiently over

this data arises.

3. The creation of special (non-relational) indexes is needed in order to

enable Search over these corpuses.

4. Users need to manage the data and its indexes together and consistently.

5. Main Markets: Compliance/E-Discovery, Government/Law Enforcement,

eCommerce, Customer Support, Vertical Search (Medical, Media,..Etc….)

Page 7: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Full-Text Search (FTS) in SQL Server: Overview“The Solution” SQL Server Full-Text Search (FTS)

• FTS is a feature integrated into SQL Server that allows fast and flexible querying of significant words and phrases in: • Character-based database columns• Documents stored in an binary typed column; varbinary() and

binary(), as well as XML data types• FTS searches for language aware tokens, keywords or phrases

inside the

database providing scoring and relevance of the result set.

Page 8: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Full-Text Search (FTS) in SQL Server: Overview

“How does it work?”

Tokenization• The data is parsed by language aware processes (iFilters,

stemmer, wordbreaker, etc..)

Indexing• FTS uses special indexing mechanism: custom FTIndexes stored in

FTCatalogs.• Each token (word) is indexed keeping metadata related to it.

Querying• Predicates: CONTAINS, FREETEXT• Table-valued functions returning relevance base score:

CONTAINSTABLE, FREETEXTTABLE• Matching semantics: exact or fuzzy• Various query options available: i.e : weight, NEAR, Thesaurus,

Inflectional, Top_N_by_Rank, prefix, etc..…• Wide language support at indexing and querying time.

Page 9: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Full-Text Search (FTS) in SQL Server: Overview

Demo: “How do I deploy it?”

1. Locate table with textual data (character based or binary raw data).

2. Full-text catalog created in that database.CREATE FULLTEXT CATALOG ftCatalog AS DEFAULT;

3. Full-text index (associated with the full-text catalog) on the column/s of the table.

CREATE FULLTEXT INDEX ON HumanResources.JobCandidate(Resume) KEY INDEX PK_JobCandidate_JobCandidateID;

4. Population of the index that results in scanning and processing of the textual data.

5. T-SQL Queries that use full-text predicates or table valued constructs.

6. A full set of T-SQL DDL available to perform FTS manageability operations.

Page 10: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Full-Text Search (FTS) in SQL Server: Overview

Demo: “How do I use it?”

1. Populating the FT Index. Keeping up with changes.

2. Querying my data: Understanding CONTAINS and FREETEXT

3. Extending the query to leverage useful features:1. Thesaurus2. Prefix search3. Inflectional Forms4. Weighted terms5. TOP_N

Etc…

Page 11: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008
Page 12: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Possible challenges you might have encountered when using SQL Server 2005 Full-Text Search (FTS) Indexes stored outside SQL Server might lead to manageability challenges

E.g: Backup/Restoring your DB with FTCatalogs might require specific actions

The Mixed query performance suffers from having to pull over the complete full-text result set

E.g: Poor performance when relational side very selective

Possible Scaling issues on big boxes

Many sustained engineering challenges due lack of technology ownership.

E.g: Inability to efficiently support our customers Inability to deliver top features highly requested.

Inability to shape our own roadmap targeting our direct customer needs.

Why a new Search Infrastructure?

Page 13: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Provide foundation to address future needs

Extend the FTS feature set customizable proximity operator property level search snippets with hit-highlighting field weighted relevance customizable tokenizing etc…..

Compelling search platform for others to build on Eventually, offer Search within all data inside the database.

Why a new Search Infrastructure?

Page 14: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Search: SQL Server Full-Text Search Roadmap

FTIndexes into the databaseSeamless administration, manageability and high availability experiencefor search indexes

Integration with Query ProcessorEfficient and scalable query processing over search to query continuum

Enabling search in semi-structured and unstructured DB storage solutions (e.g:Filestream, Sparse Columns, XML, etc..)

Feature set comparable to what they see in web and desktop search spaceSnippets, field scoped search, weighted relevance, customizable NEAR, etc…

Massive Scale-up and performance enhancements. Becomes the fast DB search engine in the market.

Facilitate Scale-out and manageability enhancements. e.g: Remote indexing, etc..

Enabling finding of ad-hoc information inside the DB without prior knowledge of its schema.

Page 15: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

MSFTESQL(FT Search Engine)

Wordbreakers

SQL Query compilation

FTS 2008 New Architecture

FTS Query compilation

SQL Query Execution

FTS Query Execution

SQL Server process

FTS Engine

Crawl Gatherer

STOPLIST

iFiltersProtocol Hander

SQL Query Procesor

Filter Daemon Host (FDHost)

DB Data Filtered Text

Data to be Indexed

Keyword and occurrence information

SharedMemory

Shared Memory

Full-Text Index

Doc_tableid

doc_type document

Indexer

Page 16: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

MSFTESQL(FT Search Engine)

Wordbreakers

SQL Query compilation

Wordbreaking of query

terms

Thesaurus

FTS 2008 New Architecture

FTS Query compilation

SQL Query Execution

FTS Query Execution

SQL Server process

FTS Engine STOPLIST

SQL Query Procesor

Filter Daemon Host (FDHost)

Full-Text Index

Query

Page 17: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Full-Text Indexes are stored and maintained inside SQL Server.

Full integration with other unstructured and semi-structured solutions.

Manageability features like Backup/Restore, Attach/Detach,

Mirroring and Log shipping work for FT indexes just as any other DB objects or

regular indexes.

Indexing and specific cases of query performance have been improved dramatically.

(Demo) Access to Full-Text Indexes raw Data. Provides access to search corpus

and statistical information. Useful for troubleshooting scenarios.

sys.dm_fts_index_keywords() sys.dm_fts_index_keywords_by_document()

Main Improvements

Keyword Display_term

Col_id Doc_id Occ_count

Page 18: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

(Demo) Query Input Parser : sys.dm_fts_parser Better supportability: understanding a given WB behavior.

SELECT * FROM sys.dm_fts_parser('"This is test“ AND “This also"',1033,0,0)

Group_id keyword occurrence Special_term

Display_term

Expansion_type

Source_term

1 0x0074006800690073

1 Noise Word This 0 This is test

1 0x00690073 2 Noise Word is 0 This is test

1 0x0074006500730074

3 Exact Match test 0 This is test

2 0x0074006800690073

1 Noise Word this 0 This also

2 0x0061006C0073006F

2 Noise Word also 0 This also

Main Improvements

Page 19: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Full-Text Query is integrated with SQL Query Processor

Mixed query performance has been improved for significantly different cardinality scenarios

Overall Query performance scales better in most scenarios

Resource utilization is managed (mostly) as part of SQL Server

Main Improvements

Page 20: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Thesaurus improvements

Stored in internal tables (in tempdb) in XML form instead of being parsed from external files

Instance level thesaurus

sys.sp_fulltext_load_thesaurus_file (lcid)

Loads all the data specified in the Thesaurus XML corresponding to the language with specified lcid.

Main Improvements

Page 21: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

(Demo) New STOPLIST support: Simplified noise words utilization and manageability. DB object associated with the FT index.

CREATE FULLTEXT STOPLIST stoplist_name

[ FROM {[database_name.] source_stoplist_name} | SYSTEM STOPLIST]

[AUTHORIZATION owner_name]

ALTER FULLTEXT STOPLIST stoplist_name{ | ADD <keyword> LANGUAGE language_term | DROP

{| <keyword> LANGUAGE language_term| ALL LANGUAGE language_term| ALL}

Main Improvements

Page 22: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

New family of Word-Breakers (WB):

WBs are components responsible of parse the textual data in a given language and pass the tokenized result to the Full-Text Index.

51 languages/WBs out of the box

Improved quality and supportability in most word- breakers

Main Improvements

Page 23: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

WBs available in SQL Server 2008:

ArabicBengaliBrazilian BulgarianCanadianCatalan Chinese (Simplified)Chinese (Traditional)Chinese (Hong Kong)Chinese (Macau)Chinese (Singapore)CroatianCyrillicDanishDutchEnglishEnglish UKFrench

GermanGujaratiHebrewHindiIcelandicIndonesianItalianJapaneseKoreanLatvianLithuanianMalayMalayalamMarathiNeutralNorwegianPolish

PortuguesePunjabiRomanianRussianSerbian LatinSlovakSlovenianSpanishSwedishTamilTeluguThaiTurkishUkrainianUrdu Vietnamese

Languages present but disabled by default

New languages supported in SQL Server 2008

Existing in SQL Server 2005, and being replaced by new WBs in SQL Server 2008

Unchanged language/WB from SQL Server 2005

Main Improvements

Page 24: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

The indexing performance has improved in most scenarios

2005 Crawl 2005 Total IFTS

Crawl IFTS Total

20M rows 1k text data

02:06 02:25 01:22 01:285M rows 8k text data

02:10 02:41 02:22 02:3220M rows 1k nvarchar data

01:37 01:55 01:20 01:26

For some HW configuration and data types, specific best practices are recommended to improve indexing performance (i.e: capping SQL Server’s memory, etc…)

Measured on 4 processor AMD64 2793 MHz, 8G RAM. Numbers are in HH:MM format. Total time is combining time to crawl and time of merge into index

Main Improvements

Page 25: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Due a new Full-Text Index architecture, former Full-Text Indexes are not compatible in SQL Server 2008 anymore.

Solution…: The Full-Text Catalog Upgrade Option

Import: (default) Faster method although performance and semantic implications are possible.

Rebuild: Slower method although ideal final state of new FTCatalogs guaranteed.

Reset : Faster Upgrade method although your Search app will not have the FTCatalogs available afterwards. You need to rebuild them when possible.

Possible Upgrade methods:

In place Upgrade: User will be prompted for what Upgrade Option to choose for existing FTCatalogs.

Restore/Attach : Instance level setting will be applied to former Full-Text Catalogs brought up with the former DB.

Upgrading to iFTS

Page 26: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Full-Text key type: Use Integer..When so, no internal mapping table required we

avoid an extra internal JOIN at query time.

To limit results and increase performance, use the top_n_by_rank option with FREETEXTTABLE and CONTAINSTABLE.

Use CONTAINSTABLE or FREETEXTTABLE when you only require

Full-Text Key or rank information.

Keep your FTIndexes de-fragmentated when possible Reorganize the Full-Text catalog by using ALTER FULLTEXT CATALOG REORGANIZE.

Best Practices and Workarounds

Page 27: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Prefix queries Issues?The issue

Possible Query perf degradation with non-deterministic enough prefix queries (i.e: ‘a*’).

Recommended best practice/s- Try to narrow down your search when possible.- If the query plan has a nested loop with the fulltext STVF, you

should try hinting a merge join.

Best Practices and Workarounds

Page 28: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Issues with Complex queries? A. Several ANDs + ORs within a single CONTAINS()

The issue QO plan and compilation complexity might cause blocking Multiple FT logical operators adds extra cost when relational predicate is

pushed into the FTIndex When kept as relational, the QO chosen plan might be not ideal, causing costly

executions.Recommended best practice Implement app. level max # of terms. Leverage Thesaurus capabilities when possible.

B. Several CONTAINS linked by ORs conditions The issue QO plan and compilation complexity might cause blockingRecommended best practice Combine multiple CONTAINS predicates into one CONTAINS

predicate

Best Practices and Workarounds

Page 29: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Best Practices and Workarounds

Blocking issues due high DML + FT query workloadsThe issue

…High DML workload (+100 DMLs/sec) +

…Plus AUTO being specified as the change tracking mechanism for the Full-text index

…. might cause the query load to suffer blocking.

Recommended best practice/s A Trace Flag exist to solve this issue (7646)

There are documented best practices in the SQL Server 2008 White Paper to mitigate this issue. (e.g: manual change tracking, REORGANIZE, etc..)

Install latest PCU for latest fixes in this area.

Page 30: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Best resources for additional information:

• SQL Server 2008 BOL

• SQL Server 2008 iFTS Internals and Enhcancements (White Paper): http://msdn.microsoft.com/en-us/library/cc721269.aspx

• Fernando Azpeitia Lopez : [email protected] Program Manager .SQL Server Full-Text Search.

Best Practices and Workarounds

Page 31: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

No document property level search (i.e: search ‘foo’ on doc.’title’)

Some of the customer wish list items: Snippets, column weights, language detection, customizable wordbreakers and proximity operators, etc..

Non support for remote FTIndexing. Currently, iFTS can only index data stored directly in the DB or in the file system using Filestream integration.

No partitioned full-text indexes. No support for SWITCH partition on tables that are FT indexed.

Non support for remote FTIndexing. Currently, iFTS can only index data stored directly in the DB or in the file system using Filestream integration.

What is not yet in IFTS?

Page 32: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

What are we working on for our next major release/s?

Improve dramatically our overall query performance by reengineering key components in our architecture. i.e: Early estimations point we will become the fastest DB Engine

in the market.

Address as many developer features as time permits (e.g: customizable NEAR, property search, etc..)

Improve our overall scale story to support large corpuses under an impressive query performance.

Upcoming Future

Page 33: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Summary iFTS adds significant value

Implementation -> straightforwardManagement -> straightforwardImproved overall Performance, Integration and

Robustness

iFTS 2008 is the beginning of an ambitious upcoming plan

to become the fastest and easier to use DB Search solution in the world.

Page 34: Integrated Full-Text Search ( i FTS ) in Microsoft SQL Server  ®  2008

Thank you