www.semantec.de advanced searching with oracle text indexing and searching in text and documents...
TRANSCRIPT
www.semantec.de
Advanced searching with Oracle Text
Indexing and searching in text and documents
Author: Krasen Paskalev
Certified Oracle DBA
Semantec GmbH.
D-71083 Herrenberg
www.semantec.de
Agenda
• Motivation– Problems when searching in documents
• Oracle Text features– Oracle text searching capabilities– Document sources, formats and languages– How indexing work– Index types
• A business case
www.semantec.de
The NeedFind document X by keyword Y
SELECT doc_name FROM documents
WHERE UPPER(text) LIKE ‚%CAT%‘
• We don‘t need APPLICATION, VACATION• Too slow – often results in full table scans
• No information about relevance
• No search in files – Word, PDF, Excel
• No advanced searching
www.semantec.de
Finding information
• Information systems major tasks:– Store and– Retrieve
• We are experts in storing both structured and non-structured data
• How to find...– Fast– Precise– Effective
• ...what we need?
www.semantec.de
Agenda
• Motivation
• Oracle Text features– Oracle text searching capabilities– Document sources, formats and languages– How indexing work– Index types
• A business case
www.semantec.de
What is Oracle Text?
• Formerly known as ConText (8.0) and interMedia Text (8i)
• Uses standart SQL to index, search and analyze text and documents stored in the Oracle database, in files and on the Web
• Allows advanced searching including keyword search, pattern matching, boolean expressions, etc.
• Supports multiple languages
www.semantec.de
Example of Oracle Text search
SELECT doc_name FROM documents
WHERE UPPER(text) LIKE ‚%SPACE%‘
SELECT doc_name FROM documents
WHERE CONTAINS(text, ‚space‘, 1) > 0
ORDER BY score(1) DESC
Normal search:
Oracle Text index:
www.semantec.de
Boolean expressions
• AND (&) – ‚mouse & wireless‘
• OR (|) – ‚mouse | wireless‘
• NOT (~) – ‚mouse ~ wireless‘
• ACCUMulate (,) – ‚mouse, monitor, cd‘
SELECT doc_name FROM documents
WHERE CONTAINS(text, ‚mouse | wireless‘, 1) > 0
ORDER BY score(1) DESC
www.semantec.de
Proximity
• NEAR – ‚mouse‘ is within 5 words of ‚wireless‘
SELECT doc_name FROM documents
WHERE CONTAINS(text, ‚NEAR((mouse,wireless),5)‘, 1) > 0
ORDER BY score(1) DESC
www.semantec.de
Expansion operators
• Allow to expand the word list searched for• Wildcard (%, _) – ‚_ing‘, ‚monito%‘• Soundex (!) – words that sound similarly
– ‚!sing‘ -> sing sink
• Fuzzy – words that are spelled similarly– ‚fuzzy(sing,70,10,weight)‘ -> sing king sink
• Stem ($) – words having the same linguistic root– ‚$sing‘ -> sing sang sung
www.semantec.de
Thesauri
• The set of words in Oracle Text have relationships stored in a thesauri:– Synonym rings– Hierarchical - Broader, Narrower term– Associative relation term– Translation
www.semantec.de
Thesauri examples
• Theme search – ‚ABOUT(economics)‘• Broader term – ‚BT(cat)‘ -> animal• Narrower term – ‚NT(animal)‘ -> cat dog• Associative relation – ‚RT(cat)‘ -> kitten• Translated term – ‚TR(cat)‘ -> cat gato• Synonym – ‚SYN(cat)‘ -> cat tiger
www.semantec.de
Document sections
<book author=„J.K. Rowling“>Harry Potter</book>
‚harry WITHIN book‘
‚rowling WITHIN book@author‘
<A><B>I like my cat.</B></A>
‚cat INPATH(A/B)‘
‚HASPATH(A/B)‘
For documents having internal structure, like XML and HTML, sections can be defined and indexed
XPath functions}
www.semantec.de
Location of documents
• Direct – Text is stored directly in a text column• Multi-column – Text is in multiple columns• Detail – Text is in multiple rows of a detail table• Nested – Text is stored in a nested table• File – Documents are stored externally as files• URL – Documents are stored externally as files on
the Internet• User – Documents are synthesized at index time
by a stored procedure
www.semantec.de
Direct and Multi-column
documentsdoc_name author text
documentsdoc_name author text
Direct Multi-column
<doc_name>
...
<author>
...
<text>
...
Allowed datatypes:• CHAR
• VARCHAR
• VARCHAR2
• BLOB
• CLOB
• BFILE
• XMLType
www.semantec.de
Detail and Nesteddocumentsdoc_name author
doc_detailsdoc_name seq_no text
Detail
{{
documentsdoc_name author doc_nst doc_nst
seq_no text
Nested
www.semantec.de
File and URLdocumentsdoc_name author text
File
File1: /location1/file1.docFile2: /location1/file2.doc
documentsdoc_name author text
URL
URL1: http://www.mysite.com/file1.docURL2: http://www.mysite.com/file2.doc
The column stores the document‘s location in the file system
The column stores the document‘s location on the Web
www.semantec.de
Document formats
• Over 150 document formats are supported including:
• Microsoft Word, Excel, PowerPoint, Project
• HTML
• XML
www.semantec.de
Languages
• Oracle Text supports indexing of text in different languages including:
• English, German, other western European
• Japanese, Chinese, Korean, ...
www.semantec.de
German language features
• Composite word indexing– VERTRAGSANLAGE
• Alternate spelling– ÖFFNEN <-> OEFFNEN
www.semantec.de
CONTEXT
• Use this index when your text consist of large coherent documents
• It is not transactional and needs periodic synchronization
• Supports all Oracle Text features
www.semantec.de
CTXCAT
• Use this index for better query performance for mixed queries. Best for indexing small text fragments
• This index is automatically maintained when data is changed.
• Does not support all features– No sections– Only single column document location– ...
www.semantec.de
CTXRULE
• Used to build document classification or routing application
• A table of queries and corresponding categories identifying the classification or routing criteria is defined
• Each incoming document can be classified to a category using the corresponding queries
www.semantec.de
Index creation
CREATE INDEX myindex ON docs(text)
INDEXTYPE IS CTXSYS.CONTEXT;
• A number of preferences can be specified:• Datastore – How are your documents stored?
• Filter – How can the documents be converted to plain text?
• Lexer – What language is being indexed?
• Wordlist – How stem and fuzzy queries are to be expanded?
• Storage – How should the index data be stored?
• Stop list – Which words or themes should not be indexed?
• Section group – How are documents sections defined?
www.semantec.de
The cat is jumping on the floor.
Present search results
• Filter – converts documents from their format to plaintext or HTML
• Highlight – generates offsets (location in document) of the text matching your query
www.semantec.de
A business case
• At Semantec we have a mission critical collaboration platform - Service Manager
• Our customers communicate to us using Service Manager
• It allows to plan, track, control and report on all objectives, projects and activities
www.semantec.de
The searching needs• We have developed a complex search using LIKE, but...
• No search in attachments
• No score• No boolean
operators• No chance to
peek at fragments of the text found
www.semantec.de
The solution
• We have created 2 Oracle Text indexes:– A multi-column table index on the columns
Name, Description and Notes– A file index on the attachments
www.semantec.de
The solution• We searched in both indexes
• After finding the results we highlighted portions of the text containing them
SELECT score(1), s.service_id
FROM sm_services s
WHERE CONTAINS(s.dummy_ctxindx ,:srch ,1) > 0
UNION ALL
SELECT score(2), s.service_id
FROM sm_services s, sm_upload a
WHERE s.id = a.service_id
AND CONTAINS( a.id_context , :srch ,2) > 0
ORDER BY score desc, service_id
www.semantec.de
The resultSearched text – stem
Link to open the file
The score
Link to open the application at the item containing the attachment
Highlighted portions of the text
www.semantec.de
Indexing performance
• 522 – number of documents
• 178 MB – total size of the documents
• 15 min – indexing time
• 86162 – number of different words
• 13 MB – size of the index
www.semantec.de
Searching performance50 times faster!
SELECT id
FROM sm_services
WHERE UPPER(name) LIKE '%MANAGER%'
OR UPPER(customer_descr) LIKE '%MANAGER%'
OR UPPER(supplier_note) LIKE '%MANAGER%‚
Standart search -> Time: 10.56 sec
SELECT id
FROM sm_services
WHERE CONTAINS(dummy_ctxindx ,'manager' ,1) > 0
Oracle Text search -> Time: 0.20 sec
www.semantec.de
Summary
• Fully integrated with the database
• Indexes everything...
• ... Located everywhere
• Powerful text search capabilities
• Oracle Text talks German
• „Google-izes“ your application
www.semantec.de
Want to know more?
Telephone:
Telephone:
Fax:
E-Mail:
Internet:
Company:Name:
Address:
Semantec GmbH.
Krasen Paskalev, Armin Singer, Peter Kopecki
Benzstr. 32D-71083 Herrenberg, Germany
Meet us here -> booth 2C at the ground floor
+49(7032)9130-0
+49(7032)9130-12
+49(7032)9130-22
www.semantec.de