text indexing and retrieval

32
Multimedia Database Management System - Chapter 4 Text Document Indexing and Retrieval Rachmat Wahid Saleh Insani, S.Kom

Upload: rachmat-wahid-saleh-insani

Post on 15-Jul-2015

168 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Text Document!Indexing and Retrieval

Rachmat Wahid Saleh Insani, S.Kom

Page 2: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Objectives• Main differences between IR systems and DBMSs.

• General automatic indexing process and Boolean retrieval model.

• Vector space, probabilistic, and cluster-based retrieval models, respectively.

• Nontraditional IR methods.

• Performance measurement of IR.

• Compares performance of different retrieval techniques.

• WWW.

Page 3: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Differences between IR system and DBMS

• Indexing & Retrieval system.

• No structured records. No fixed attributes.

• Retrieval depend on degree of coincidence.

• Item retrieved may not be relevant.

• DBMS.

• Each record has a set of attributes.

• Retrieval based on exact match.

• Item retrieved definitely relevant.

Page 4: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Basic Document Retrieval Process

Page 5: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Basic Boolean!Retrieval Model

• Documents indexed by a set of keywords.

• Queries are represented by a set of keywords and logical operators.

Page 6: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

File StructureA document which is retrieved, is called record. A record may consist

of many sentences and terms. A term can be in many records.

Page 7: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

File Structure• File structure in IR systems

consist of:

• Flat file. One or more documents are stored in a file.

• Inverted file. Each term has separated index which stores the record identifiers for all records of that term.

• Signature file. Contains bit patterns that represent documents.

Page 8: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Inverted Files

In an inverted file, for each term a separate index is constructed that stores the record identifiers for all records

containing that term.

Page 9: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Extension of!Inverted File Operation

• We have ignored 2 important factors: term positions and term weights.

• Relationship between 2 or more terms can be strengthened by adding nearness parameters: within sentence and adjacency.

• I.e, term1 within sentence term2 means that term1 and term2 occur in a common sentence of a retrieved record.

• I.e, term1 adjacency term2 means that term1 and term2 occur adjacency in the retrieved documents.

Page 10: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

General Structure of!Extended Inverted File OperationFor example, if an inverted file has the following entries:

• information: R99, 10, 8, 3; R155, 15, 3, 6; R166, 2, 3, 1

• retrieval: R77, 9, 7, 2; R99, 10, 8, 4; R166, 10, 2, 5

Which record will be retrieved if the query is “information within sentence retrieval”?

Page 11: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Term Operation and!Automatic Indexing

• A document contains many words. But not every words is useful, e.g., prepositions “of”, “the”, and “a”.

• Terms are processed with many operations, e.g., stemming, thesaurus, and weighting.

• Stemming. A fuse of related words.

• Thesaurus, List of synonymous terms and sometimes the relationship among them.

• Weighting,

Page 12: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Weighting Formula

• Wij, weight of term j in doc i,

• tfij, frequency of term j in doc i,

• N, total number of documents,

• dfi, number of documents contain term j

Wij = tfij ⋅ log(Ndf j)

Page 13: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Automatic Document Indexing

• The automatic indexing process consist of few steps:

• Identify words in title, abstract, or document.

• Eliminate stop words.

• Identify synonyms.

• Stem words using certain algorithms.

• Count stem frequencies in each document.

• Calculate term weights.

• Create the inverted file based on the above terms and weights.

Page 14: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Vector Space Retrieval Model

There is a fixed set of index terms to represent documents and queries.

!

!

• Tik, weight of term k in document i

• Qjk, weight of term k in query j

• N, total number of term in docs and queries

Di = [Ti1,Ti2,Ti3,...,Tik ,...,TiN ]Qi = [Qj1,Qj2,Qj3,...,Qjk ,...,QjN ]

Page 15: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Vector Space Retrieval Model

• To compensate differences in document sizes and query sized, the similarity between document Di and Query Qi is calculated as follows:

S(Di ,Qj ) =Tik ⋅Qjk

k=1

N

Tik2 ⋅ Qjk

2

k=1

N

∑k=1

N

Page 16: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Relevance Feedback Techniques

• Relevance feedback takes users’ judgements about the relevance of documents and uses them to modify query or document indexes. Users’ judgement uses to modify query which the rules are:

- Relevant terms that occur, are added to the original query or term weight increased.

- Irrelevant term that occur are deleted from query or term weight reduced.

Page 17: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Relevance Feedback Techniques

• Document index terms are modified using query terms, so the change made affect other users. Document modification uses the following rules, based on relevance feedback:

- Terms in the query, but not in user-judged relevant document, are added to the document index list with an initial weight.

- Weights of index terms in the query and also in relevant document are increased by a certain amount.

- Weights of index terms not in the query and also in relevant document are decreased by a certain amount.

Page 18: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Traditional IR Method Issues

• Individual words do not contain all the information encoded in language.

• One word may have multiple meanings.

• A number of words may have a similar meaning.

• Phrases have meanings beyond the sum of individual words.

Page 19: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

A Way to Improve!IR Performance

• Natural Language Processing

• Knowledge-based IR Model.

Page 20: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Performance Measurement

• Information retrieval performance measured using three parameters:

- Retrieval speed.

- Recall.

- Precision.

Page 21: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Performance Comparison Among Different IR Techniques• Automatic indexing is as good as manual indexing.

• Retrieval performance of partial match techniques is better than exact match techniques.

• The use of relevance feedback will improve the retrieval performance.

• Significant user input produces higher retrieval performance than no or limited user input.

• The use of domain knowledge and user profile significantly improve the retrieval performance.

Page 22: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

World Wide WebWWW is a collection of interlinked documents distributed

around the world.

Page 23: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Introduction to WWW• Hypertext document. An information management

in which data is stored in a network of nodes connected by computer-supported links. It is made up of a number of nodes and links.

• Hypermedia, an extension of hypertext in that anchors and nodes can be any type of media e.g., graphics, audio, video, etc.

• WWW is the integration of hypermedia and the Internet.

Page 24: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Architecture of WWW

Page 25: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Resource Discovery• Resource Discovery is a process of finding and retrieving

information on the Internet.

• Locations of documents in WWW and Internet are specified using Uniform Resource Locator (URL). The general format is protocol://server-name[:port]/document-name.

• Two ways to find and retrieve documents on the Internet:

• Organizing-Browsing

• Searching

Page 26: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Major Difference Between!IR Systems and WWW Search Engine• WWW documents are distributed around the Internet. IR

system documents are centrally located.

• The number of WWW documents is much greater than IR system documents.

• WWW documents are more dynamic and heterogeneous.

• WWW documents are structured with HTML, IR system documents are normally plain text.

• WWW search engine are used by more users and more frequently than IR systems.

Page 27: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

General Structure of WWW Search Engine

Page 28: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

SpiderIt visits a Web page, reads it, and then follows links to other pages within the site. The spider may return to the site on a regular basis, such as every month or two, to look for changes.

Page 29: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

IndexA collection of a copy of every Web page that the spider finds. If a Web page changes, this book is updated with new information.

Page 30: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Search EngineA program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it estimates is most relevant.

Page 31: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Search Engine Example

Page 32: Text Indexing and Retrieval

Multimedia Database Management System - Chapter 4

Google