tuning oracle text for rapid document retrieval - a case study

30
Tuning Oracle Text for Rapid Document Retrieval - A Case Study Sandi Wilson Senior Oracle DBA Summit Information Systems, a Fiserv Company

Upload: others

Post on 12-Sep-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

Tuning Oracle Text for

Rapid Document Retrieval

- A Case Study

Sandi Wilson Senior Oracle DBA

Summit Information Systems, a Fiserv Company

Page 2: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

2

Speaker Background

Corvallis, Oregon, US BS in Computer Science, UCF Oracle DBA 10+ years (7.3.4+) Current Employment : Fiserv

– Banking applications

– Financial document archiving and retrieval

Page 3: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

3

Agenda

What is Oracle Text?

Vocabulary

Scope

Structures

Parameters

Packages

Impacting Performance

Q&A

Page 4: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

4

What is Oracle Text?

Earlier versions – SQL*TextRetrieval

– TextServer

– ConText Option/Cartridge

– InterMedia Text

Robust text search and document classification

Supports a wide variety of source document types

Page 5: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

5

What is Oracle Text? (cont.)

Query methods include: keywords, contexts, themes, word stems, pattern matching, fuzzy matching, HTML/XML sections

Output formats include: original document format, unformatted text, HTML with keyword highlighting, information visualization

Page 6: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

6

Vocabulary

OT

Token/Search token

– a text string

Docid

– an identifier assigned by and used within the

OT index to reference a source document

Preferences

– Parameter classes

Page 7: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

7

Scope

CONTEXT index type

Keyword & XML section searches

10gR2 Standard edition features

Non-static source: 50-500GB

Page 8: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

8

Index Creation

Use:

CREATE INDEX mytext_idx

ON Documents(text) INDEXTYPE IS ctxsys.Context PARMETERS („datastore mydatastore storage mystorage section group mysectiongroup lexer mylexer memory 400M‟);

Creates a set of internal index tables using defined or default index preferences

Page 9: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

9

Index Creation (cont.)

Parses and tokenizes text according to lexer parameters

Filters by removing tokens identified in stopword list

Source documents are processed in batches and appended to the index

Page 10: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

10

Internal Index Structure

DR$<index_name>$I – “Token” table

– List of tokens and where they occur

– BLOB: token_info

DR$<index_name>$X – B-tree index on $I table

DR$<index_name>$N – “Deleted” docids table

– List of invalid docids

Page 11: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

11

Internal Index Structure (cont.)

DR$<index_name>$K

– Index-organized “Lookup” table

– Maps external (source data) rowids to internal

index docids

DR$<index_name>$R

– Index-organized “Reverse Lookup” table

– Maps internal index docids to external rowids

Page 12: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

12

Index Parameters

Storage

– Define storage clauses for DR$ objects

Datastore

– Identifies where source text is located

Lexer

– Rules for converting text to tokens

Stop words

– High frequency words, excluded from indexing

Section Groups

– HTML, XML, AUTO

Page 13: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

13

Index Packages

CTX_ADM

CTX_DDL

CTX_OUTPUT

CTX_REPORT

Page 14: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

14

CTX_ADM

Set_Parameter

– Use:

CTX_ADM.SET_PARAMETER

(„max_index_memory‟,‟400M‟);

– default_index_memory

– log_directory

– Default_<preference> (i.e. datastore, lexer,

stoplist, etc)

Page 15: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

15

CTX_DDL

SYNC_INDEX

– Use:

CTX_DDL.SYNC_INDEX(„mytext_idx‟,‟100M‟);

– Processes newly added or modified documents

CTX_USER_PENDING view

– Source documents are processed in batches

and appended to the index

Page 16: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

16

CTX_DDL (cont.)

Optimize_Index

– Use:

CTX_DDL.OPTIMIZE_INDEX(„mytext_idx‟,‟fast‟);

– Defragments token info

– Removes references to deleted documents

– Modes: Fast (no GC), Full, Token, Token

Type, Rebuild

Page 17: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

17

CTX_DDL (cont.)

Create_Preference

Add_Stopword

Add_Stop_Section

Add_MDATA_Section

Page 18: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

18

CTX_OUTPUT

Start_Log/Stop_Log

Add_Event/Remove_Event

– EVENT_INDEX_PRINT_ROWID

– EVENT_OPT_PRINT_TOKEN

– EVENT_INDEX_PRINT_TOKEN

Start_Query_Log/Stop_Query_Log

Add_Trace/Remove_Trace

Page 19: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

19

CTX_REPORT

Use to generate reports on index information and query activity

Name Description

DESCRIBE_INDEX Creates a report describing the index.

DESCRIBE_POLICY Creates a report describing a policy.

CREATE_INDEX_SCRIPT Creates a SQL*Plus script to duplicate the named index.

CREATE_POLICY_SCRIPT Creates a SQL*Plus script to duplicate

the named policy.

Page 20: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

20

CTX_REPORT (cont.)

Name Description

INDEX_SIZE Creates a report to show the internal objects of an index, their tablespaces and used sizes.

INDEX_STATS Creates a report to show the various statistics of an index.

QUERY_LOG_SUMMARY Creates a report showing query statistics TOKEN_INFO Creates a report showing the information

for a token, decoded. TOKEN_TYPE Translates a name and returns a numeric

token type

Page 21: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

21

Impacting Performance

Parameters

Creation

Synchronization

Optimization

Query tuning

Other

$R

output

$I $K

report

syn

c

opti create

lexer $N stop

$X

Page 22: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

22

Index Parameters and Performance

Define large max_index_memory and default_index_memory

Choose appropriate lexer preferences

– Avoid tokenizing useful composite strings Ex: [email protected]

Define comprehensive stopword lists

– Avoid indexing unhelpful tokens Ex: “account”, “credit”, “authorize”

– Use CTX_REPORT.INDEX_STATS to identify new stopword candidates

Page 23: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

23

Index Creation and Performance

Maximize memory allocation for index creation

– If possible, reduce SGA in order to increase

max_index_memory

Define MDATA sections to avoid costly “mixed” queries

Use NOLOGGING storage parameter during creation, alter to LOGGING after index is created

Page 24: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

24

Synchonization and Performance

Sync infrequently

Maximize memory allocation for sync

Page 25: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

25

Optimization and Performance

Optimize after index creation

Optimize frequently

Rebuild mode achieves best $I defragmentation and space consolidation

Use lightweight „token‟ and „token type‟ modes for targeted improvements

Implement an aggressive optimization schedule

Page 26: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

26

Query Tuning

Use a single CONTAINS clause

Do not nest CONTAINS clauses in inner loop

When possible, use MDATA sections instead of mixed queries

Page 27: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

27

Other Performance Considerations

Gather statistics on the index, but not the internal tables

Use KEEP pool for internal tables

Page 28: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

34

Summary

Determine appropriate parameters for your OT application

Allocate maximum memory for OT processing

Synchronize infrequently

Optimize frequently

Tune the text queries

Page 29: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

35

References

Faisal, Mohammed, “Real World Performance Part III - Oracle Text”, OOW 2006

Ford, Roger, “Advanced MDATA - Tips and Tricks”, Oracle Technology Network, http://www.oracle.com/technology/products/text/htdocs/mdata_tricks.html

Kyte, Thomas, Expert One-On-One

Page 30: Tuning Oracle Text for Rapid Document Retrieval - A Case Study

36

Q&A