www.semantec.de advanced searching with oracle text indexing and searching in text and documents...

39
www.semantec. de Advanced searching with Oracle Text Indexing and searching in text and documents Author: Krasen Paskalev Certified Oracle DBA Semantec GmbH. D-71083 Herrenberg

Upload: evelyn-gilbert

Post on 31-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

www.semantec.de

Advanced searching with Oracle Text

Indexing and searching in text and documents

Author: Krasen Paskalev

Certified Oracle DBA

Semantec GmbH.

D-71083 Herrenberg

www.semantec.de

Agenda

• Motivation– Problems when searching in documents

• Oracle Text features– Oracle text searching capabilities– Document sources, formats and languages– How indexing work– Index types

• A business case

www.semantec.de

The NeedFind document X by keyword Y

SELECT doc_name FROM documents

WHERE UPPER(text) LIKE ‚%CAT%‘

• We don‘t need APPLICATION, VACATION• Too slow – often results in full table scans

• No information about relevance

• No search in files – Word, PDF, Excel

• No advanced searching

www.semantec.de

Finding information

• Information systems major tasks:– Store and– Retrieve

• We are experts in storing both structured and non-structured data

• How to find...– Fast– Precise– Effective

• ...what we need?

www.semantec.de

Agenda

• Motivation

• Oracle Text features– Oracle text searching capabilities– Document sources, formats and languages– How indexing work– Index types

• A business case

www.semantec.de

What is Oracle Text?

• Formerly known as ConText (8.0) and interMedia Text (8i)

• Uses standart SQL to index, search and analyze text and documents stored in the Oracle database, in files and on the Web

• Allows advanced searching including keyword search, pattern matching, boolean expressions, etc.

• Supports multiple languages

www.semantec.de

Example of Oracle Text search

SELECT doc_name FROM documents

WHERE UPPER(text) LIKE ‚%SPACE%‘

SELECT doc_name FROM documents

WHERE CONTAINS(text, ‚space‘, 1) > 0

ORDER BY score(1) DESC

Normal search:

Oracle Text index:

www.semantec.de

Boolean expressions

• AND (&) – ‚mouse & wireless‘

• OR (|) – ‚mouse | wireless‘

• NOT (~) – ‚mouse ~ wireless‘

• ACCUMulate (,) – ‚mouse, monitor, cd‘

SELECT doc_name FROM documents

WHERE CONTAINS(text, ‚mouse | wireless‘, 1) > 0

ORDER BY score(1) DESC

www.semantec.de

Proximity

• NEAR – ‚mouse‘ is within 5 words of ‚wireless‘

SELECT doc_name FROM documents

WHERE CONTAINS(text, ‚NEAR((mouse,wireless),5)‘, 1) > 0

ORDER BY score(1) DESC

www.semantec.de

Expansion operators

• Allow to expand the word list searched for• Wildcard (%, _) – ‚_ing‘, ‚monito%‘• Soundex (!) – words that sound similarly

– ‚!sing‘ -> sing sink

• Fuzzy – words that are spelled similarly– ‚fuzzy(sing,70,10,weight)‘ -> sing king sink

• Stem ($) – words having the same linguistic root– ‚$sing‘ -> sing sang sung

www.semantec.de

Thesauri

• The set of words in Oracle Text have relationships stored in a thesauri:– Synonym rings– Hierarchical - Broader, Narrower term– Associative relation term– Translation

www.semantec.de

Thesauri examples

• Theme search – ‚ABOUT(economics)‘• Broader term – ‚BT(cat)‘ -> animal• Narrower term – ‚NT(animal)‘ -> cat dog• Associative relation – ‚RT(cat)‘ -> kitten• Translated term – ‚TR(cat)‘ -> cat gato• Synonym – ‚SYN(cat)‘ -> cat tiger

www.semantec.de

Document sections

<book author=„J.K. Rowling“>Harry Potter</book>

‚harry WITHIN book‘

‚rowling WITHIN book@author‘

<A><B>I like my cat.</B></A>

‚cat INPATH(A/B)‘

‚HASPATH(A/B)‘

For documents having internal structure, like XML and HTML, sections can be defined and indexed

XPath functions}

www.semantec.de

Location of documents

• Direct – Text is stored directly in a text column• Multi-column – Text is in multiple columns• Detail – Text is in multiple rows of a detail table• Nested – Text is stored in a nested table• File – Documents are stored externally as files• URL – Documents are stored externally as files on

the Internet• User – Documents are synthesized at index time

by a stored procedure

www.semantec.de

Direct and Multi-column

documentsdoc_name author text

documentsdoc_name author text

Direct Multi-column

<doc_name>

...

<author>

...

<text>

...

Allowed datatypes:• CHAR

• VARCHAR

• VARCHAR2

• BLOB

• CLOB

• BFILE

• XMLType

www.semantec.de

Detail and Nesteddocumentsdoc_name author

doc_detailsdoc_name seq_no text

Detail

{{

documentsdoc_name author doc_nst doc_nst

seq_no text

Nested

www.semantec.de

File and URLdocumentsdoc_name author text

File

File1: /location1/file1.docFile2: /location1/file2.doc

documentsdoc_name author text

URL

URL1: http://www.mysite.com/file1.docURL2: http://www.mysite.com/file2.doc

The column stores the document‘s location in the file system

The column stores the document‘s location on the Web

www.semantec.de

Document formats

• Over 150 document formats are supported including:

• Microsoft Word, Excel, PowerPoint, Project

• HTML

• XML

• PDF

www.semantec.de

Languages

• Oracle Text supports indexing of text in different languages including:

• English, German, other western European

• Japanese, Chinese, Korean, ...

www.semantec.de

German language features

• Composite word indexing– VERTRAGSANLAGE

• Alternate spelling– ÖFFNEN <-> OEFFNEN

www.semantec.de

How does indexing work?

www.semantec.de

Index types

• Oracle Text supports 3 types of indexes:– CONTEXT– CTXCAT– CTXRULE

www.semantec.de

CONTEXT

• Use this index when your text consist of large coherent documents

• It is not transactional and needs periodic synchronization

• Supports all Oracle Text features

www.semantec.de

CTXCAT

• Use this index for better query performance for mixed queries. Best for indexing small text fragments

• This index is automatically maintained when data is changed.

• Does not support all features– No sections– Only single column document location– ...

www.semantec.de

CTXRULE

• Used to build document classification or routing application

• A table of queries and corresponding categories identifying the classification or routing criteria is defined

• Each incoming document can be classified to a category using the corresponding queries

www.semantec.de

Index creation

CREATE INDEX myindex ON docs(text)

INDEXTYPE IS CTXSYS.CONTEXT;

• A number of preferences can be specified:• Datastore – How are your documents stored?

• Filter – How can the documents be converted to plain text?

• Lexer – What language is being indexed?

• Wordlist – How stem and fuzzy queries are to be expanded?

• Storage – How should the index data be stored?

• Stop list – Which words or themes should not be indexed?

• Section group – How are documents sections defined?

www.semantec.de

The cat is jumping on the floor.

Present search results

• Filter – converts documents from their format to plaintext or HTML

• Highlight – generates offsets (location in document) of the text matching your query

www.semantec.de

Agenda

• Motivation

• Oracle Text features

• A business case

www.semantec.de

A business case

• At Semantec we have a mission critical collaboration platform - Service Manager

• Our customers communicate to us using Service Manager

• It allows to plan, track, control and report on all objectives, projects and activities

www.semantec.de

The application• The application has a number of Text fields

www.semantec.de

The application• It also has attachments

www.semantec.de

The searching needs• We have developed a complex search using LIKE, but...

• No search in attachments

• No score• No boolean

operators• No chance to

peek at fragments of the text found

www.semantec.de

The solution

• We have created 2 Oracle Text indexes:– A multi-column table index on the columns

Name, Description and Notes– A file index on the attachments

www.semantec.de

The solution• We searched in both indexes

• After finding the results we highlighted portions of the text containing them

SELECT score(1), s.service_id

FROM sm_services s

WHERE CONTAINS(s.dummy_ctxindx ,:srch ,1) > 0

UNION ALL

SELECT score(2), s.service_id

FROM sm_services s, sm_upload a

WHERE s.id = a.service_id

AND CONTAINS( a.id_context , :srch ,2) > 0

ORDER BY score desc, service_id

www.semantec.de

The resultSearched text – stem

Link to open the file

The score

Link to open the application at the item containing the attachment

Highlighted portions of the text

www.semantec.de

Indexing performance

• 522 – number of documents

• 178 MB – total size of the documents

• 15 min – indexing time

• 86162 – number of different words

• 13 MB – size of the index

www.semantec.de

Searching performance50 times faster!

SELECT id

FROM sm_services

WHERE UPPER(name) LIKE '%MANAGER%'

OR UPPER(customer_descr) LIKE '%MANAGER%'

OR UPPER(supplier_note) LIKE '%MANAGER%‚

Standart search -> Time: 10.56 sec

SELECT id

FROM sm_services

WHERE CONTAINS(dummy_ctxindx ,'manager' ,1) > 0

Oracle Text search -> Time: 0.20 sec

www.semantec.de

Summary

• Fully integrated with the database

• Indexes everything...

• ... Located everywhere

• Powerful text search capabilities

• Oracle Text talks German

• „Google-izes“ your application

www.semantec.de

Want to know more?

Telephone:

Telephone:

Fax:

E-Mail:

Internet:

Company:Name:

Address:

Semantec GmbH.

Krasen Paskalev, Armin Singer, Peter Kopecki

Benzstr. 32D-71083 Herrenberg, Germany

Meet us here -> booth 2C at the ground floor

+49(7032)9130-0

+49(7032)9130-12

+49(7032)9130-22

[email protected]

[email protected]

www.semantec.de