text operations j. h. wang feb. 21, 2008. the retrieval process user interface text operations query...

Text Operations

J. H. WangFeb. 21, 2008

The Retrieval ProcessUserInterface

Text Operations

Query Operations

Indexing

Searching

Ranking

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Outline

• Document Preprocessing (7.1-7.2)• Text Compression (7.4-7.5): skipped• Automatic Indexing (Chap. 9, Salton)

– Term Selection

Document Preprocessing

• Lexical analysis– Letters, digits, punctuation marks, …

• Stopword removal– “the”, “of”, …

• Stemming– Prefix, suffix

• Index term selection– Noun

• Construction of term categorization structure– Thesaurus

• Logical view of the documents

structure

Accents,spacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

Lexical Analysis

• Converting a stream of characters into a stream of words– Recognition of words– Digits: usually not good index terms

• Ex.: The number of deaths due to car accidents between 1910 and 1989, “510B.C.”, credit card numbers, …

– Hyphens• Ex: state-of-the-art, gilt-edge, B-49, …

– Punctuation marks: normally removed entirely• Ex: 510B.C., program codes: x.id vs. xid, …

– The case of letters: usually not important• Ex: Bank vs. bank, Unix-like operating systems, …

Elimination of Stopwords• Stopwords: words which are too freque

nt among the documents in the collection are not good discriminators– Articles, prepositions, conjunctions, …– Some verbs, adverbs, and adjectives

• To reduce the size of the indexing structure

• Stopword removal might reduce recall– Ex: “to be or not to be”

Stemming

• The substitution of the words by their respective stems– Ex: plurals, gerund forms, past tense suffixes, …

• A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes)– Ex: connect, connected, connecting,

connection, connections

• Controversy about the benefits– Useful for improving retrieval performance– Reducing the size of the indexing structure

Stemming

• Four types of stemming strategies– Affix removal, table lookup, successor

variety, and n-grams (or term clustering)

• Suffix removal– Port’s algorithm (available in the

Appendix)• Simplicity and elegance

Index Term Selection

• Manually or automatically• Identification of noun groups

– Most of the semantics is carried by the noun words

– Systematic elimination of verbs, adjectives, adverbs, connectives, articles, and pronouns

– A noun group is a set of nouns whose syntactic distance in the text does not exceed a predefined threshold

Thesauri

• Thesaurus: a reference to a treasury of words– A precompiled list of important words in

a given domain of knowledge– For each word in this list, a set of related

words• Ex: synonyms, …

– It also involves normalization of vocabulary, and a structure

Example Entry in Peter Roget’s Thesaurus

• Cowardly adjective• Ignobly lacking in courage: cowardly tur

ncoats.• Syns: chicken (slang), chicken-hearted,

craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang).

Main Purposes of a Thesaurus

• To provide a standard vocabulary for indexing and searching

• To assist users with locating terms for proper query formulation

• To provide classified hierarchies that allow the broadening and narrowing of the current query request according to the user needs

Motivation for Building a Thesaurus

• Using a controlled vocabulary for the indexing and searching– Normalization of indexing concepts– Reduction of noise– Identification of indexing terms with a

clear semantic meaning– Retrieval based on concepts rather than

on words• Ex: term classification hierarchy in

Yahoo!

Main Components of a Thesaurus

• Index terms: individual words, group of words, phrases– Concept

• Ex: “missiles, ballistic”

– Definition or explanation • Ex: seal (marine animals), seal (documents)

• Relationships among the terms– BT (broader), NT (narrower)– RT (related): much difficult

• A layout design for these term relationships– A list or bi-dimensional display

Automatic Indexing(Term Selection)

Automatic Indexing

• Indexing– assign identifiers (index terms) to text

documents

• Identifiers– single-term vs. term phrase– controlled vs. uncontrolled vocabularies

instruction manuals, terminological schedules, …

– objective vs. nonobjective text identifiers cataloging rules control, e.g., author names, publisher names, dates of publications, …

Two Issues

• Issue 1: indexing exhaustivity– exhaustive: assign a large number of terms– Nonexhaustive: only main aspects of subject conte

nt• Issue 2: term specificity

– broad terms (generic)cannot distinguish relevant from nonrelevant documents

– narrow terms (specific)retrieve relatively fewer documents, but most of them are relevant

All docs

Recall vs. Precision

• Recall (R) = Number of relevant documents retrieved / total number of relevant documents in collection– The proportion of relevant items retrieved

• Precision (P) = Number of relevant documents retrieved / total number of documents retrieved– The proportion of items retrieved that are

relevant

• Example: for a query, e.g. TaipeiRetrieveddocs

Relevantdocs

text operations j. h. wang feb. 21, 2008. the retrieval process user interface text operations query...

Documents

text document representation & indexing ---- vector space...

latent topic-semantic indexing based automatic text...

improving text classification using local latent semantic...

full text indexing - theoretical computer science text...

vakhitov alexander approximate text indexing

query-driven indexing for scalable p2p text retrieval

interactive handwritten text recognition and indexing of...

taxonomies for text analytics and auto-indexing

basic text processing and indexing

more on indexing and text...

semi-automatic indexing of full text biomedical articles

jason - texture, text and context vs indexing

g.skobeltsyn | query-driven indexing for p2p text retrieval...

query operations j. h. wang mar. 26, 2008. the retrieval...

detection and extraction of artificial text for semantic...

concept-based indexing in text information retrieval

multidim indexing operations

text document classification: an approach based on indexing

performing indexing and full-text searching lesson 21

more on indexing and text operations - sharif