text indexing and retrieval

Multimedia Database Management System - Chapter 4

Text Document!Indexing and Retrieval

Rachmat Wahid Saleh Insani, S.Kom


Objectives• Main differences between IR systems and DBMSs.

• General automatic indexing process and Boolean retrieval model.

• Vector space, probabilistic, and cluster-based retrieval models, respectively.

• Nontraditional IR methods.

• Performance measurement of IR.

• Compares performance of different retrieval techniques.

• WWW.


Differences between IR system and DBMS

• Indexing & Retrieval system.

• No structured records. No fixed attributes.

• Retrieval depend on degree of coincidence.

• Item retrieved may not be relevant.

• DBMS.

• Each record has a set of attributes.

• Retrieval based on exact match.

• Item retrieved definitely relevant.


Basic Document Retrieval Process


Basic Boolean!Retrieval Model

• Documents indexed by a set of keywords.

• Queries are represented by a set of keywords and logical operators.


File StructureA document which is retrieved, is called record. A record may consist

of many sentences and terms. A term can be in many records.


File Structure• File structure in IR systems

consist of:

• Flat file. One or more documents are stored in a file.

• Inverted file. Each term has separated index which stores the record identifiers for all records of that term.

• Signature file. Contains bit patterns that represent documents.


Inverted Files

In an inverted file, for each term a separate index is constructed that stores the record identifiers for all records

containing that term.


Extension of!Inverted File Operation

• We have ignored 2 important factors: term positions and term weights.

• Relationship between 2 or more terms can be strengthened by adding nearness parameters: within sentence and adjacency.

• I.e, term1 within sentence term2 means that term1 and term2 occur in a common sentence of a retrieved record.

• I.e, term1 adjacency term2 means that term1 and term2 occur adjacency in the retrieved documents.


General Structure of!Extended Inverted File OperationFor example, if an inverted file has the following entries:

• information: R99, 10, 8, 3; R155, 15, 3, 6; R166, 2, 3, 1

• retrieval: R77, 9, 7, 2; R99, 10, 8, 4; R166, 10, 2, 5

Which record will be retrieved if the query is “information within sentence retrieval”?


Term Operation and!Automatic Indexing

• A document contains many words. But not every words is useful, e.g., prepositions “of”, “the”, and “a”.

• Terms are processed with many operations, e.g., stemming, thesaurus, and weighting.

• Stemming. A fuse of related words.

• Thesaurus, List of synonymous terms and sometimes the relationship among them.

• Weighting,


Weighting Formula

• Wij, weight of term j in doc i,

• tfij, frequency of term j in doc i,

• N, total number of documents,

• dfi, number of documents contain term j

Wij = tfij ⋅ log(Ndf j)


Automatic Document Indexing

• The automatic indexing process consist of few steps:

• Identify words in title, abstract, or document.

• Eliminate stop words.

• Identify synonyms.

• Stem words using certain algorithms.

• Count stem frequencies in each document.

• Calculate term weights.

• Create the inverted file based on the above terms and weights.


Vector Space Retrieval Model

There is a fixed set of index terms to represent documents and queries.

!

!

• Tik, weight of term k in document i

• Qjk, weight of term k in query j

• N, total number of term in docs and queries

Di = [Ti1,Ti2,Ti3,...,Tik ,...,TiN ]Qi = [Qj1,Qj2,Qj3,...,Qjk ,...,QjN ]


Vector Space Retrieval Model

• To compensate differences in document sizes and query sized, the similarity between document Di and Query Qi is calculated as follows:

S(Di ,Qj ) =Tik ⋅Qjk

k=1

N

∑

Tik2 ⋅ Qjk

2

k=1

N

∑k=1

N

∑


Relevance Feedback Techniques

• Relevance feedback takes users’ judgements about the relevance of documents and uses them to modify query or document indexes. Users’ judgement uses to modify query which the rules are:

- Relevant terms that occur, are added to the original query or term weight increased.

- Irrelevant term that occur are deleted from query or term weight reduced.


Relevance Feedback Techniques

• Document index terms are modified using query terms, so the change made affect other users. Document modification uses the following rules, based on relevance feedback:

- Terms in the query, but not in user-judged relevant document, are added to the document index list with an initial weight.

- Weights of index terms in the query and also in relevant document are increased by a certain amount.

- Weights of index terms not in the query and also in relevant document are decreased by a certain amount.


Traditional IR Method Issues

• Individual words do not contain all the information encoded in language.

• One word may have multiple meanings.

• A number of words may have a similar meaning.

• Phrases have meanings beyond the sum of individual words.


A Way to Improve!IR Performance

• Natural Language Processing

• Knowledge-based IR Model.


Performance Measurement

• Information retrieval performance measured using three parameters:

- Retrieval speed.

- Recall.

- Precision.


Performance Comparison Among Different IR Techniques• Automatic indexing is as good as manual indexing.

• Retrieval performance of partial match techniques is better than exact match techniques.

• The use of relevance feedback will improve the retrieval performance.

• Significant user input produces higher retrieval performance than no or limited user input.

• The use of domain knowledge and user profile significantly improve the retrieval performance.


World Wide WebWWW is a collection of interlinked documents distributed

around the world.


Introduction to WWW• Hypertext document. An information management

in which data is stored in a network of nodes connected by computer-supported links. It is made up of a number of nodes and links.

• Hypermedia, an extension of hypertext in that anchors and nodes can be any type of media e.g., graphics, audio, video, etc.

• WWW is the integration of hypermedia and the Internet.


Architecture of WWW


Resource Discovery• Resource Discovery is a process of finding and retrieving

information on the Internet.

• Locations of documents in WWW and Internet are specified using Uniform Resource Locator (URL). The general format is protocol://server-name[:port]/document-name.

• Two ways to find and retrieve documents on the Internet:

• Organizing-Browsing

• Searching

protocol://server-name%5B


Major Difference Between!IR Systems and WWW Search Engine• WWW documents are distributed around the Internet. IR

system documents are centrally located.

• The number of WWW documents is much greater than IR system documents.

• WWW documents are more dynamic and heterogeneous.

• WWW documents are structured with HTML, IR system documents are normally plain text.

• WWW search engine are used by more users and more frequently than IR systems.


General Structure of WWW Search Engine


SpiderIt visits a Web page, reads it, and then follows links to other pages within the site. The spider may return to the site on a regular basis, such as every month or two, to look for changes.


IndexA collection of a copy of every Web page that the spider finds. If a Web page changes, this book is updated with new information.


Search EngineA program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it estimates is most relevant.


Search Engine Example


Google

text indexing and retrieval

Technology

dbms indexing retrieval

file structure file

sentence retrieval

retrieved record

term operation

flat file

signature file

term positions