chapter 2 modeling

148
Hsin-Hsi Chen 1 Chapter 2 Modeling Hsin-Hsi Chen Department of Computer Scienc e and Information Engineering National Taiwan University

Upload: mohammad-wilder

Post on 31-Dec-2015

37 views

Category:

Documents


1 download

DESCRIPTION

Chapter 2 Modeling. Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University. Indexing. Indexing. indexing: assign identifiers to text items. assign: manual vs. automatic indexing identifiers: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 2 Modeling

Hsin-Hsi Chen 1

Chapter 2 Modeling

Hsin-Hsi Chen

Department of Computer Science and Information Engineering

National Taiwan University

Page 2: Chapter 2 Modeling

Hsin-Hsi Chen 2

Indexing

Page 3: Chapter 2 Modeling

Hsin-Hsi Chen 3

Indexing

• indexing: assign identifiers to text items.• assign: manual vs. automatic indexing• identifiers:

– objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, …

– controlled vs. uncontrolled vocabulariesinstruction manuals, terminological schedules, …

– single-term vs. term phrase

Page 4: Chapter 2 Modeling

Hsin-Hsi Chen 4

Two Issues

• Issue 1: indexing exhaustivity– exhaustive: assign a large number of terms– non-exhaustive

• Issue 2: term specificity– broad terms (generic)

cannot distinguish relevant from non-relevant items

– narrow terms (specific)retrieve relatively fewer items, but most of them are relevant

Page 5: Chapter 2 Modeling

Hsin-Hsi Chen 5

Parameters of retrieval effectiveness

• Recall

• Precision

• Goalhigh recall and high precision

P Number of relevant items retrieved

Total number of items retrieved

R Number of relevant items retrieved

Total number of relevant items in collection

Page 6: Chapter 2 Modeling

Hsin-Hsi Chen 6

Non-relevantItems

RelevantItems

RetrievedPartab

c d

Precisiona

a + bRecall

a

a + d

Page 7: Chapter 2 Modeling

Hsin-Hsi Chen 7

A Joint Measure

• F-score

is a parameter that encode the importance of recall and procedure.

=1: equal weight <1: precision is more important >1: recall is more important

FP R

P R

( )

2

2

1

Page 8: Chapter 2 Modeling

Hsin-Hsi Chen 8

Choices of Recall and Precision

• Both recall and precision vary from 0 to 1.

• In principle, the average user wants to achieve both high recall and high precision.

• In practice, a compromise must be reached because simultaneously optimizing recall and precision is not normally achievable.

Page 9: Chapter 2 Modeling

Hsin-Hsi Chen 9

Choices of Recall and Precision (Continued)

• Particular choices of indexing and search policies have produced variations in performance ranging from 0.8 precision and 0.2 recall to 0.1 precision and 0.8 recall.

• In many circumstance, both the recall and the precision varying between 0.5 and 0.6 are more satisfactory for the average users.

Page 10: Chapter 2 Modeling

Hsin-Hsi Chen 10

Term-Frequency Consideration

• Function words– for example, "and", "or", "of", "but", …– the frequencies of these words are high in all

texts• Content words

– words that actually relate to document content – varying frequencies in the different texts of a

collection– indicate term importance for content

Page 11: Chapter 2 Modeling

Hsin-Hsi Chen 11

A Frequency-Based Indexing Method

• Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words.

• Compute the term frequency tfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di.

• Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij > T.

Page 12: Chapter 2 Modeling

Hsin-Hsi Chen 12

Discussions

• high-frequency termsfavor recall

• high precisionthe ability to distinguish individual documents from each other

• high-frequency termsgood for precision when its term frequency is not equally high in all documents.

Page 13: Chapter 2 Modeling

Hsin-Hsi Chen 13

Inverse Document Frequency

• Inverse Document Frequency (IDF) for term Tj

where dfj (document frequency of term Tj) is number of documents in which Tj occurs.

– fulfil both the recall and the precision– occur frequently in individual documents but ra

rely in the remainder of the collection

idfN

dfj

j

log

Page 14: Chapter 2 Modeling

Hsin-Hsi Chen 14

New Term Importance Indicator

• weight wij of a term Tj in a document ti

• Eliminating common function words

• Computing the value of wij for each term Tj in each document Di

• Assigning to the documents of a collection all terms with sufficiently high (tf x idf) factors

w tfN

dfij ij

j

log

Page 15: Chapter 2 Modeling

Hsin-Hsi Chen 15

Term-discrimination Value

• Useful index termsdistinguish the documents of a collection from each other

• Document Space– two documents are assigned very similar term sets,

when the corresponding points in document configuration appear close together

– when a high-frequency term without discrimination is assigned, it will increase the document space density

Page 16: Chapter 2 Modeling

Hsin-Hsi Chen 16

Original State After Assignment of good discriminator

After Assignment of poor discriminator

A Virtual Document Space

Page 17: Chapter 2 Modeling

Hsin-Hsi Chen 17

Good Term Assignment

• When a term is assigned to the documents of a collection, the few items (i.e., documents) to which the term is assigned will be distinguished from the rest of the collection.

• This should increase the average distance between the items in the collection and hence produce a document space less dense than before.

Page 18: Chapter 2 Modeling

Hsin-Hsi Chen 18

Poor Term Assignment

• A high frequency term is assigned that does not discriminate between the items (i.e., documents) of a collection.

• Its assignment will render the document more similar.

• This is reflected in an increase in document space density.

Page 19: Chapter 2 Modeling

Hsin-Hsi Chen 19

Term Discrimination Value

• definitiondvj = Q - Qj

where Q and Qj are space densities before and after the assignments of term Tj.

• dvj>0, Tj is a good term; dvj<0, Tj is a poor term.

QN N

sim D Di kki k

N

i

N

1

1 11( )( , )

Page 20: Chapter 2 Modeling

Hsin-Hsi Chen 20

DocumentFrequency

Low frequency

dvj=0Medium frequency

dvj>0

High frequency

dvj<0

N

Thesaurustransformation

Phrasetransformation

Variations of Term-Discrimination Valuewith Document Frequency

Page 21: Chapter 2 Modeling

Hsin-Hsi Chen 21

Another Term Weighting

• wij = tfij dvj

• compared with

– : decrease steadily with increasing documentfrequency

– dvj: increase from zero to positive as the document frequency of the term increase,

decrease shapely (i.e., negative) as the document frequency becomes still larger.

w tfN

dfij ij

j

log

N

df j

Page 22: Chapter 2 Modeling

Hsin-Hsi Chen 22

Term Relationships in Indexing

• Single-term indexing– Single terms are often ambiguous.– Many single terms are either too specific or too

broad to be useful.

• Complex text identifiers– subject experts and trained indexers– linguistic analysis algorithms, e.g., NP chunker– term-grouping or term clustering methods

Page 23: Chapter 2 Modeling

Hsin-Hsi Chen 23

Term Classification (Clustering)

T T T T

D

D

D

d d d

d d d

d d d

t

n

t

t

n n nt

1 2 3

1

2

11 12 1

21 22 2

1 2

Page 24: Chapter 2 Modeling

Hsin-Hsi Chen 24

Term Classification (Clustering)

• Column partGroup terms whose corresponding column representation reveal similar assignments to the documents of the collection.

• Row partGroup documents that exhibit sufficiently similar term assignment.

Page 25: Chapter 2 Modeling

Hsin-Hsi Chen 25

Linguistic Methodologies

• Indexing phrases:nominal constructions including adjectives and nouns– Assign syntactic class indicators (i.e., part of speech) to

the words occurring in document texts.

– Construct word phrases from sequences of words exhibiting certain allowed syntactic markers (noun-noun and adjective-noun sequences).

Page 26: Chapter 2 Modeling

Hsin-Hsi Chen 26

Term-Phrase Formation

• Term Phrasea sequence of related text words carry a more specific meaning than the single termse.g., “computer science” vs. computer

DocumentFrequency

Low frequency

dvj=0Medium frequency

dvj>0

High frequency

dvj<0

N

Thesaurustransformation

Phrasetransformation

Page 27: Chapter 2 Modeling

Hsin-Hsi Chen 27

Simple Phrase-Formation Process

• the principal phrase component (phrase head)a term with a document frequency exceeding a stated threshold, or exhibiting a negative discriminator value

• the other components of the phrasemedium- or low- frequency terms with stated co-occurrence relationships with the phrase head

• common function wordsnot used in the phrase-formation process

Page 28: Chapter 2 Modeling

Hsin-Hsi Chen 28

An Example

• Effective retrieval systems are essential for people in need of information.– “are”, “for”, “in” and “of”:

common function words– “system”, “people”, and “information”:

phrase heads

Page 29: Chapter 2 Modeling

Hsin-Hsi Chen 29

The Formatted Term-Phrases

Phrase Heads and ComponentsMust Be Adjacent

Phrase Heads and ComponentsCo-occur in Sentence

1. retrieval system* 6. effective systems

2. systems essential 7. systems need

3. essential people 8. effective people

4. people need 9. retrieval people

5. need information* 10. effective information*

11. retrieval information*

12. essential information*

effective retrieval systems essential people need information

*: phrases assumed to be useful for content identification2/5 5/12

Page 30: Chapter 2 Modeling

Hsin-Hsi Chen 30

The Problems

• A phrase-formation process controlled only by word co-occurrences and the document frequencies of certain words is not likely to generate a large number of high-quality phrases.

• Additional syntactic criteria for phrase heads and phrase components may provide further control in phrase formation.

Page 31: Chapter 2 Modeling

Hsin-Hsi Chen 31

Additional Term-Phrase Formation Steps

• Syntactic class indicator are assigned to the terms, and phrase formation is limited to sequences of specified syntactic markers, such as adjective-noun and noun-noun sequences.

Adverb-adjective adverb-noun • The phrase elements are all chosen from within

the same syntactic unit, such as subject phrase, object phrase, and verb phrase.

Page 32: Chapter 2 Modeling

Hsin-Hsi Chen 32

Consider Syntactic Unit

• effective retrieval systems are essential for people in the need of information

• subject phrase– effective retrieval systems

• verb phrase– are essential

• object phrase– people in need of information

Page 33: Chapter 2 Modeling

Hsin-Hsi Chen 33

Phrases within Syntactic Components

• Adjacent phrase heads and components within syntactic components– retrieval systems*– people need– need information*

• Phrase heads and components co-occur within syntactic components– effective systems

[subj effective retrieval systems] [vp are essential ]for [obj people need information]

2/3

Page 34: Chapter 2 Modeling

Hsin-Hsi Chen 34

Problems

• More stringent phrase formation criteria produce fewer phrases, both good and bad, than less stringent methodologies.

• Prepositional phrase attachment, e.g.,The man saw the girl with the telescope.

• Anaphora resolutionHe dropped the plate on his foot and broke it.

Page 35: Chapter 2 Modeling

Hsin-Hsi Chen 35

Problems (Continued)

• Any phrase matching system must be able to deal with the problems of– synonym recognition

– differing word orders

– intervening extraneous word

• Example– retrieval of information vs. information retrieval

Page 36: Chapter 2 Modeling

Hsin-Hsi Chen 36

Equivalent Phrase Formulation

• Base form: text analysis system• Variants:

– system analyzes the text– text is analyzed by the system– system carries out text analysis– text is subjected to system analysis

• Related term substitution– text: documents, information items– analysis: processing, transformation, manipulation– system: program, process

Page 37: Chapter 2 Modeling

Hsin-Hsi Chen 37

Thesaurus-Group Generation

• Thesaurus transformation– broadens index terms whose scope is too narrow to be

useful in retrieval

– a thesaurus must assemble groups of related specific terms under more general, higher-level class indicators

DocumentFrequency

Low frequency

dvj=0Medium frequency

dvj>0

High frequency

dvj<0

N

Thesaurustransformation

Phrasetransformation

Page 38: Chapter 2 Modeling

Hsin-Hsi Chen 38

Sample Classes of Roget’s Thesaurus

Class Indicator Entry Class Indicator Entrypermission offerleave presentation

760 sanction tenderallowance 763 overture

tolerance advanceauthorization submissionprohibition proposalveto proposition

761 disallowance invitationinjunction refusalban declining

taboo 764 noncompliance

consent rejection

acquiescence denial

762 compliance

agreement

acceptance

Page 39: Chapter 2 Modeling

Hsin-Hsi Chen 39

同義詞詞林• 12 large categories

• 94 middle categories

• 1,428 small categories

• 3,925 word clusters

Page 40: Chapter 2 Modeling

Hsin-Hsi Chen 40

A PeopleAa a collective name01        Human being The people Everybody02        I We03        You You04        He/She They05        Myself Others Someone06        WhoAb people of all ages and both sexes01 A Man A Woman Men and Women02 An Old Person An Adult The old and the young03        A Teenager04        An Infant A ChildAc posture01 A Tall Person A Dwarf02 A Fat Person A Thin Person03 A Beautiful Woman A Handsome Man

Page 41: Chapter 2 Modeling

Hsin-Hsi Chen 41

A. PERSON (人): Aa. general name (泛稱), Ab. people of all ages and both sexes (男女老少), Ac. posture (體態), Ad. nationality/citizenship (籍屬), Ae. occupation (職業), Af.

identity (身分), Ag. situation (狀況), Ah. relative/family dependents (親人/眷屬), Ai. rank in the family (輩次), Aj. relationship (關係), Ak. morality (品行), Al. ability and insight

(才識), Am. religion (信仰), An. comic/clown type (丑類)

B. THING (物): Ba. generally called (統稱), Bb. (擬狀物), Bc. part of an object (物體的部分), Bd. a celestial body (天體), Be. terrian features (地貌), Bf. meteorological

phonomena (氣象), Bg. natural substance (自然物), Bh. plant (植物), Bi. animals (動物), Bj. micro-organism (微生物), Bk. the whole body (全身), Bl. secretions/excretions (排泄

物/分泌物), Bm. Material (材料), Bn. Building (建築物), Bo. machines and tools (機具), Bp. appliances (用品), Bq. Clothing (衣物), Br. edibles/medicines/drugs (食品/藥物/毒

品)

C. TIME AND SPACE (時間與空間): Ca. time (時間), Cb. space (空間)

D. ABSTRACT THINGS (抽象事物): Da. event/circumstances (事情/情況), Db. reason/logic (事理), Dc. looks (外貌), Dd. functions/properties (性能), De. character/ability (性

格/才能), Df. conscious (意識), Dg. analogical thing (比喻物), Dh. imaginary things (臆想物), Di. society/politics (社會/政法), Dj. economy (經濟), Dk. culture and education (文

教), Dl. disease (疾病), Dm. Organization (機構), Dn. quantity/unit (數量/單位)

E. CHARATERISTICS (特徵): Ea. external form (外形), Eb. surface looks/seeming (表象), Ec. color/taste (顏色/味道), Ed. Property (性質), Ee. virtue and ability (德才), Ef.

Circumstances (境況)

F. MOTION (動作). Fa. motion of upper limbs (hands) (上肢動作), Fb. motion of lower limbs (legs) (下肢動作), Fc. motion of head (頭部動作), Fd. motion of the whole body

(全身動作)

G. PSYCHOLOGICAL ACTIVITY (心理活動): Ga. state of mind (心理狀態), Gb. activity of mind (心理活動), Gc. capability and willingness (能/願)

H. ACTIVITY (活動): Ha. political activity (政治活動), Hb. military activity (軍事活動), Hc. administrative management (行政管理), Hd. Production (生產), He. economical

activity (經濟活動), Hf. communications and transportation (交通運輸), Hg. education and hygiene scientific research (教衛科研), Hh. recreational and sports activities (文體活

動), Hi. social contact (社交), Hj. Life (生活), Hk. religionary activity (宗教活動), Hl. superstitious belief activity (迷信活動), Hm. public security and judicature (公安/司法),

Hn. wicked behavior (惡行)

I. PHENOMENON AND CONDITION (現象與狀態): Ia. natural phenomena (自然現象), Ib. physiology phenomena (生理現象), Ic. facial expression (表情), Id. object status

(物體狀態), Ie. Situation (事態), If. circumstances (mostly unlucky) (境遇), Ig. the beginning and the end (始末), Ih. Change (變化)

J. TO BE RELATED (關聯): Ja. association (聯繫), Jb. similarities and dissimilarities (異同), Jc. to operate in coordination (配合), Jd. existence (存在), Je. Influence (影響)

K. AUXILIARY PHRASE (助語): Ka. quantitative modifier (疏狀), Kb. preposition (中介), Kc. conjunction (聯接), Kd. auxiliary (輔助), Ke. interjection (呼嘆), Kf.

Onomatopoeia (擬聲)

L. GREETINGS (敬語)

Page 42: Chapter 2 Modeling

Hsin-Hsi Chen 42

The Indexing Prescription (1)

• Identify the individual words in the document collection.

• Use a stop list to delete from the texts the function words.

• Use an suffix-stripping routine to reduce each remaining word to word-stem form.

• For each remaining word stem Tj in document Di, compute wij.

• Represent each document Di byDi=(T1, wi1; T2, wi2; …, Tt, wit)

Page 43: Chapter 2 Modeling

Hsin-Hsi Chen 43

Word Stemming

• effectiveness --> effective --> effect

• picnicking --> picnic

• king -\-> k

Page 44: Chapter 2 Modeling

Hsin-Hsi Chen 44

Some Morphological Rules

• Restore a silent e after suffix removal from certain words to produce “hope” from “hoping” rather than “hop”

• Delete certain doubled consonants after suffix removal, so as to generate “hop” from “hopping” rather than “hopp”.

• Use a final y for an i in forms such as “easier”, so as to generate “easy” instead of “easi”.

Page 45: Chapter 2 Modeling

Hsin-Hsi Chen 45

The Indexing Prescription (2)• Identify individual text words.• Use stop list to delete common function words.• Use automatic suffix stripping to produce word stems.• Compute term-discrimination value for all word stems.• Use thesaurus class replacement for all low-frequency

terms with discrimination values near zero.• Use phrase-formation process for all high-frequency terms

with negative discrimination values.• Compute weighting factors for complex indexing units.• Assign to each document single term weights, term

phrases, and thesaurus classes with weights.

Page 46: Chapter 2 Modeling

Hsin-Hsi Chen 46

Query vs. Document

• Differences– Query texts are short.

– Fewer terms are assigned to queries.

– The occurrence of query terms rarely exceeds 1.

Q=(wq1, wq2, …, wqt) where wqj: inverse document frequencyDi=(di1, di2, …, dit) where dij: term frequency*inverse document frequency

sim Q D w dqj ij

j

t

( , ) ‧

1

Page 47: Chapter 2 Modeling

Hsin-Hsi Chen 47

Query vs. Document• When non-normalized documents are used, the longer

documents with more assigned terms have a greater chance of matching particular query terms than do the shorter document vectors.

sim Q Diw d

d w

qj ij

j

t

ij qjj

t

j

t( , )

( ) ( )

1

2 2

11

sim Q Diw d

d

qj ij

j

t

ijj

t( , )

( )

‧1

2

1

or

Page 48: Chapter 2 Modeling

Hsin-Hsi Chen 48

Relevance Feedback

• Terms present in previously retrieved documents that have been identified as relevant to the user’s query are added to the original formulations.

• The weights of the original query terms are altered by replacing the inverse document frequency portion of the weights with term-relevance weights obtained by using the occurrence characteristics of the terms in the previous retrieved relevant and nonrelevant documents of the collection.

Page 49: Chapter 2 Modeling

Hsin-Hsi Chen 49

Relevance Feedback• Q = (wq1, wq2, ..., wqt)• Di = (di1, di2, ..., dit)• New query may be the following form

Q’ = {wq1, wq2, ..., wqt}+{w’qt+1, w’qt+2, ..., w’qt+m}

• The weights of the newly added terms Tt+1 to Tt+m may consist of a combined term-frequency and term-relevance weight.

Page 50: Chapter 2 Modeling

Hsin-Hsi Chen 50

Final Indexing

• Identify individual text words.• Use a stop list to delete common words.• Use suffix stripping to produce word stems.• Replace low-frequency terms with thesaurus classes.• Replace high-frequency terms with phrases.• Compute term weights for all single terms, phrases, and th

esaurus classes.• Compare query statements with document vectors.• Identify some retrieved documents as relevant and some as

nonrelevant to the query.

Page 51: Chapter 2 Modeling

Hsin-Hsi Chen 51

Final Indexing

• Compute term-relevance factors based on available relevance assessments.

• Construct new queries with added terms from relevant documents and term weights based on combined frequency and term-relevance weight.

• Return to step (7).Compare query statements with document vectors ……..

Page 52: Chapter 2 Modeling

Hsin-Hsi Chen 52

Summary of expected effectiveness of automatic indexing (Salton, 1989)

• Basic single-term automatic indexing -• Use of thesaurus to group related terms in the given topic area

+10% to +20%• Use of automatically derived term associations obtained from

joint term assignments found in sample document collections0% to -10%

• Use of automatically derived term phrases obtained by using co-occurring terms found in the texts of sample collections

+5% to +10%• Use of one iteration of relevant feedback to add new query

terms extracted from previously retrieved relevant documents+30% to +60%

Page 53: Chapter 2 Modeling

Hsin-Hsi Chen 53

Models

Page 54: Chapter 2 Modeling

Hsin-Hsi Chen 54

Ranking

• central problem of IR– Predict which documents are relevant and which are

not

• Ranking– Establish an ordering of the documents retrieved

• IR models– Different model provides distinct sets of premises to

deal with document relevance

Page 55: Chapter 2 Modeling

Hsin-Hsi Chen 55

Information Retrieval Models• Classic Models

– Boolean model• set theoretic• documents and queries are represented as sets of index terms• compare Boolean query statements with the term sets used to identify

document content.

– Vector model• algebraic model• documents and queries are represented as vectors in a t-dimensional space• compute global similarities between queries and documents.

– Probabilistic model• probabilistic• documents and queries are represented on the basis of probabilistic theory• compute the relevance probabilities for the documents of a collection.

Page 56: Chapter 2 Modeling

Hsin-Hsi Chen 56

Information Retrieval Models(Continued)

• Structured Models– reference to the structure present in written text– non-overlapping list model– proximal nodes model

• Browsing– flat– structured guided– hypertext

Page 57: Chapter 2 Modeling

Hsin-Hsi Chen 57

Taxonomy of Information Retrieval Models

USER

TASK

Retrieval:Adhoc

Filtering

Browsing

Classic Modelsbooleanvector

probabilistic

Structured ModelsNon-Overlapped

ListsProximal Nodes

BrowsingFlat

Structured GuidedHypertext

Set Theoretic

FuzzyExtended Boolean

Algebraic

Generalized VectorLat. Semantic Index

Neural Network

Probabilistic

Inference NetworkBelief Network

Page 58: Chapter 2 Modeling

Hsin-Hsi Chen 58

Issues of a retrieval system

• Models– boolean– vector– probabilistic

• Logical views of documents– full text– set of index terms

• User task– retrieval– browsing

Page 59: Chapter 2 Modeling

Hsin-Hsi Chen 59

Combinations of these issues

Index Terms Full TextFull Text+Structure

Retrieval

ClassicSet Theoretic

AlgebraicProbabilistic

Structured

Browsing FlatHypertext

Flat

ClassicSet Theoretic

AlgebraicProbabilistic

Structure GuidedHypertext

USER

TASK

LOGICAL VIEW OF DOCUMENTS

Page 60: Chapter 2 Modeling

Hsin-Hsi Chen 60

Retrieval: Ad hoc and Filtering

• Ad hoc retrieval– Documents remain relatively static while new queries are

submitted

• Filtering– Queries remain relatively static while new documents come into

the system• e.g., news wiring services in the stock market

– User profile describes the user’s preferences• Filtering task indicates to the user which document might be interested to

him• Which ones are really relevant is fully reserved to the user

– Routing: a variation of filtering• Ranking filtered documents and show this ranking to users

Page 61: Chapter 2 Modeling

Hsin-Hsi Chen 61

User profile

• Simplistic approach– The profile is described through a set of keywor

ds– The user provides the necessary keywords

• Elaborate approach– Collect information from the user– initial profile + relevance feedback (relevant inf

ormation and nonrelevant information)

Page 62: Chapter 2 Modeling

Hsin-Hsi Chen 62

Formal Definition of IR Models

• /D, Q, F, R(qi, dj)/– D: a set composed of logical views (or representations)

for the documents in collection

– Q: a set composed of logical views (or representations) for the user information needs

– F: a framework for modeling documents representations, queries, and their relationships

– R(qi, dj): a ranking function which associations a real number with qiQ and dj D

query

Page 63: Chapter 2 Modeling

Hsin-Hsi Chen 63

Formal Definition of IR Models(continued)

• classic Boolean model– set of documents– standard operations on sets

• classic vector model– t-dimensional vector space– standard linear algebra operations on vector

• classic probabilistic model– sets– standard probabilistic operations, and Bayes’ theorem

Page 64: Chapter 2 Modeling

Hsin-Hsi Chen 64

Basic Concepts of Classic IR

• index terms (usually nouns): index and summarize• weight of index terms• Definition

– K={k1, …, kt}: a set of all index terms– wi,j: a weight of an index term ki of a document dj

– dj=(w1,j, w2,j, …, wt,j): an index term vector for the document dj

– gi(dj)= wi,j

• assumption– index term weights are mutually independent

wi,j associated with (ki,dj) tells us nothingabout wi+1,j associated with (ki+1,dj)

The terms computer and network in the area of computer networks

Page 65: Chapter 2 Modeling

Hsin-Hsi Chen 65

Boolean Model

• The index term weight variables are all binary, i.e., wi,j{0,1}

• A query q is a Boolean expression (and, or, not)

• qdnf: the disjunctive normal form for q• qcc: conjunctive components of qdnf

• sim(dj,q): similarity of dj to q– 1: if qcc | (qcc qdnf(ki, gi(dj)=gi(qcc))– 0: otherwise

dj is relevant to q

Page 66: Chapter 2 Modeling

Hsin-Hsi Chen 66

Boolean Model (Continued)

• Example– q=ka (kb kc)

– qdnf=(1,1,1) (1,1,0) (1,0,0)

(ka kb) (ka kc)= (ka kb kc) (ka kb kc)(ka kb kc) (ka kb kc)= (ka kb kc) (ka kb kc) (ka kb kc)

ka kb

kc

(1,0,0)(1,1,0)

(1,1,1)

Page 67: Chapter 2 Modeling

Hsin-Hsi Chen 67

Boolean Model (Continued)

• advantage: simple

• disadvantage– binary decision (relevant or non-relevant) witho

ut grading scale– exact match (no partial match)

• e.g., dj=(0,1,0) is non-relevant to q=(ka (kb kc)

– retrieve too few or too many documents

Page 68: Chapter 2 Modeling

Hsin-Hsi Chen 68

Basic Vector Space Model

• Term vector representation of documents Di=(ai1, ai2, …, ait)queries Qj=(qj1, qj2, …, qjt)

• t distinct terms are used to characterize content.

• Each term is identified with a term vector T.

• t vectors are linearly independent.

• Any vector (i.e., document vectors and query vectors) is represented as a linear combination of the t term vectors.

• The rth document Dr can be represented as a document vector, written as

D a Tr r i

i

t

i

1

document vectorquery vector

Page 69: Chapter 2 Modeling

Hsin-Hsi Chen 69

Document representation in vector spacea document vector in a two-dimensional vector space

Page 70: Chapter 2 Modeling

Hsin-Hsi Chen 70

Similarity Measure

• measure by product of two vectorsx • y = |x| |y| cos

• document-query similarity

• how to determine the vector components (i.e., ari, q

sj) and term correlations (i.e., Ti Tj)?

D Q a q T Tr s r s i

i j

t

ji j‧ ‧

, 1

D a Tr r i

i

t

i

1

Q q

j

t

s sj jT

1

query vector:document vector:

document vectorand document vector

document vectorand document vector

document-documentsimilarity

document-querysimilarity

Page 71: Chapter 2 Modeling

Hsin-Hsi Chen 71

Similarity Measure (Continued)

• vector components

T T T T

A

D

D

D

a a a

a a a

a a a

t

n

t

t

n n nt

1 2 3

1

2

11 12 1

21 22 2

1 2

Page 72: Chapter 2 Modeling

Hsin-Hsi Chen 72

Similarity Measure (Continued)

• term correlations Ti • Tj are not availableassumption: term vectors are orthogonal

Ti • Tj = 0 (ij) Ti • Tj =1 (i=j) • Assume that terms are uncorrelated.

Similarity measurement between query and document

• Similarity measurement between documents

sim D Q a qr s r s

i j

t

i j( ),

,

1

sim D D a ar s r s

i j

t

i j( ),

,

1

Page 73: Chapter 2 Modeling

Hsin-Hsi Chen 73

Sample query-documentsimilarity computation

• D1=2T1+3T2+5T3 D2=3T1+7T2+1T3

Q=0T1+0T2+2T3

• similarity computations for uncorrelated termssim(D1,Q)=2•0+3 •0+5 •2=10sim(D2,Q)=3•0+7 •0+1 •2=2

• D1 is preferred

Page 74: Chapter 2 Modeling

Hsin-Hsi Chen 74

Sample query-documentsimilarity computation (Continued)

• T1 T2 T3

T1 1 0.5 0T2 0.5 1 -0.2T3 0 -0.2 1

• similarity computations for correlated termssim(D1,Q)=(2T1+3T2+5T3) • (0T1+0T2+2T3 )

=4T1•T3+6T2 •T3 +10T3 •T3 =-6*0.2+10*1=8.8

sim(D2,Q)=(3T1+7T2+1T3) • (0T1+0T2+2T3 )=6T1•T3+14T2 •T3 +2T3 •T3 =-14*0.2+2*1=-0.8

• D1 is preferred

Page 75: Chapter 2 Modeling

Hsin-Hsi Chen 75

Vector Model

• wi,j: a positive, non-binary weight for (ki,dj)

• wi,q: a positive, non-binary weight for (ki,q)

• q=(w1,q, w2,q, …, wt,q): a query vector, where t is the total number of index terms in the system

• dj= (w1,j, w2,j, …, wt,j): a document vector

Page 76: Chapter 2 Modeling

Hsin-Hsi Chen 76

Similarity of document dj w.r.t. query q

• The correlation between vectors dj and q

• | q | does not affect the ranking

• | dj | provides a normalization

tj qi

ti ji

ti qiji

j

jj

ww

ww

qd

qdqdsim

12,1

2,

1 ,,

||||),(

Q

dj

cos(dj,q)

Page 77: Chapter 2 Modeling

Hsin-Hsi Chen 77

document ranking

• Similarity (i.e., sim(q, dj)) varies from 0 to 1.

• Retrieve the documents with a degree of similarity above a predefined threshold(allow partial matching)

Page 78: Chapter 2 Modeling

Hsin-Hsi Chen 78

term weighting techniques• IR problem: one of clustering

– user query: a specification of a set A of objects– clustering problem: determine which documents are in the set A (r

elevant), which ones are not (non-relevant)– intra-cluster similarity

• the features better describe the objects in the set A• tf factor in vector model

the raw frequency of a term ki inside a document dj

– inter-cluster dissimilarity• the features better distinguish the the objects in the set A from the remaining

objects in the collection C• idf factor (inverse document frequency) in vector model

the inverse of the frequency of a term ki among the documents in the collection

Page 79: Chapter 2 Modeling

Hsin-Hsi Chen 79

Definition of tf

• N: total number of documents in the system

• ni: the number of documents in which the index term ki appears

• freqi,j: the raw frequency of term ki in the document dj

• fi,j: the normalized frequency of term ki in document dj jll

jiji freq

freqf

,

,, max

Term tl has maximum frequencyin the document dj

(0~1)

Page 80: Chapter 2 Modeling

Hsin-Hsi Chen 80

Definition of idf and tf-idf scheme

• idfi: inverse document frequency for ki

• wi,j: term-weighting by tf-idf scheme

• query term weight (Salton and Buckley)

ii n

Nidf log

ijiji n

Nfw log,,

iqil

qiqi n

N

freq

freqw log)

max

5.05.0(

,

,,

freqi,q: the raw frequency of the term ki in q

(a very short document)

document formula: 0.5query formula: 0.75

freqi,q=1, max freq=2

Page 81: Chapter 2 Modeling

Hsin-Hsi Chen 81

Analysis of vector model

• advantages– its term-weighting scheme improves retrieval

performance

– its partial matching strategy allows retrieval of documents that approximate the query conditions

– its cosine ranking formula sorts the documents according to their degree of similarity to the query

• disadvantages– indexed terms are assumed to be mutually independent

Page 82: Chapter 2 Modeling

Hsin-Hsi Chen 82

Probabilistic Model

• Given a query, there is an ideal answer set– a set of documents which contains exactly the

relevant documents and no other

• query process– a process of specifying the properties of an

ideal answer set

• problem: what are the properties?

Page 83: Chapter 2 Modeling

Hsin-Hsi Chen 83

Probabilistic Model (Continued)

• Generate a preliminary probabilistic description of the ideal answer set

• Initiate an interaction with the user– User looks at the retrieved documents and

decide which ones are relevant and which ones are not

– System uses this information to refine the description of the ideal answer set

– Repeat the process many times.

Page 84: Chapter 2 Modeling

Hsin-Hsi Chen 84

Probabilistic Principle

• Given a user query q and a document dj in the collection, the probabilistic model estimates the probability that user will find dj relevant

• assumptions– The probability of relevance depends on query and docum

ent representations only– There is a subset of all documents which the user prefers a

s the answer set for the query q

• Given a query, the probabilistic model assigns to each document dj a measure of its similarity to the query

)(

)(

qtotnonrelevandP

qtorelevantdP

j

j

Page 85: Chapter 2 Modeling

Hsin-Hsi Chen 85

Probabilistic Principle

• wi,j{0,1}, wi,q{0,1}: the index term weight variables are all binary

• q: a query which is a subset of index terms • R: the set of documents known to be relevant• R (complement of R): the set of non-relevant documents

• P(R|dj): the probability that the document dj is relevant to the query q

• P(R|dj): the probability that dj is non-relevant to q

Page 86: Chapter 2 Modeling

Hsin-Hsi Chen 86

similarity• sim(dj,q): the similarity of the document dj t

o the query q

)|(

)|(),(

j

jj dRP

dRPqdsim (by definition)

)()|(

)()|(),(

RPRdP

RPRdPqdsim

j

jj

(Bayes’ rule)

)|(

)|(),(

RdP

RdPqdsim

j

jj (P(R) and P(R) are the

same for all documents)

)|( RdP j : the probability of randomly selecting the documentdj from the set of R of relevant documents

P(R): the probability that a document randomly selected from the entire collection is relevant

)(

)|()()|(

YP

XYPXPYXP

)(

)(

qtotnonrelevandP

qtorelevantdP

j

j

Page 87: Chapter 2 Modeling

Hsin-Hsi Chen 87

t

i i

i

ii

iiij

t

ii

t

i i

i

ii

iiij

t

ii

t

i iqgdg

ii

iqgdg

ii

t

iqgdg

iqgdg

i

qgdgi

qgdgi

t

i

dgi

dgi

t

i

dgi

dgi

j

jj

RkP

RkP

RkPRkP

RkPRkPqgdg

RkP

RkP

RkPRkP

RkPRkPqgdg

RkPRkPRkP

RkPRkPRkP

RkPRkP

RkPRkP

RkPRkP

RkPRkP

RdP

RdPqdsim

iji

iji

ijiiji

ijiiji

jiji

jiji

11

11

1)()(

)()(

1)()(1)()(

)()(1)()(

1

)(1)(

1

)(1)(

)|(

)|(

))|(1()|(

))|(1()|(log)()(

)|(

)|(

)|()|(

)|()|(log)()(

))|(())|()|((

))|(())|()|((log

))|(())|((

))|(())|((log

))|(())|((

))|(())|((log

)|(

)|(),(

P(ki|R): the probability that the indexterm ki is present in a document randomly selected from the set R.

P(ki|R): the probability that the indexterm ki is not present in a document randomly selected from the set R.

independence assumption of index terms

Page 88: Chapter 2 Modeling

Hsin-Hsi Chen 88

))|(

))|(1(log)

))|(1(

)|((log)()(

)|(

)|()

)|(

))|(1(log)

))|(1(

)|((log)()(

)|(

)|(

))|(1()|(

))|(1()|(log)()(

)|(

)|(),(

1

11

11

RkP

RkP

RkP

RkPqgdg

RkP

RkP

RkP

RkP

RkP

RkPqgdg

RkP

RkP

RkPRkP

RkPRkPqgdg

RdP

RdPqdsim

i

i

i

iij

t

ii

t

i i

i

i

i

i

iij

t

ii

t

i i

i

ii

iiij

t

ii

j

jj

Problem: where is the set R?independent of document

Page 89: Chapter 2 Modeling

Hsin-Hsi Chen 89

Initial guess

• P(ki|R) is constant for all index terms ki.

• The distribution of index terms among the non-relevant documents can be approximated by the distribution of index terms among all the documents in the collection.

5.0)|( Rkp i

N

nRkP i

i )|(

( 假設 N>>|R|,N|R|)

Page 90: Chapter 2 Modeling

Hsin-Hsi Chen 90

Initial ranking

• V: a subset of the documents initially retrieved and ranked by the probabilistic model (top r documents)

• Vi: subset of V composed of documents which contain the index term ki

• Approximate P(ki|R) by the distribution of the index term ki among the documents retrieved so far.

• Approximate P(ki|R) by considering that all the non-retrieved documents are not relevant.

V

VRkP i

i )|(

VN

VnRkP ii

i

)|(

Page 91: Chapter 2 Modeling

Hsin-Hsi Chen 91

Small values of V and Vi

• alternative 1

• alternative 2

1

5.0)|(

1

5.0)|(

VN

VnRkP

V

VRkP

iii

ii

1)|(

1)|(

VNNn

VnRkP

VNn

VRkP

iii

i

ii

i

V

VRkP i

i )|(

VN

VnRkP ii

i

)|(

a problem when V=1 and Vi=0

Page 92: Chapter 2 Modeling

Hsin-Hsi Chen 92

Analysis of Probabilistic Model

• advantage– documents are ranked in decreasing order of

their probability of being relevant

• disadvantages– the need to guess the initial separation of

documents into relevant and non-relevant sets– do not consider the frequency with which an

index terms occurs inside a document– the independence assumption for index terms

Page 93: Chapter 2 Modeling

Hsin-Hsi Chen 93

Comparison of classic models

• Boolean model: the weakest classic model

• Vector model is expected to outperform the probabilistic model with general collections (Salton and Buckley)

Page 94: Chapter 2 Modeling

Hsin-Hsi Chen 94

Alternative Set Theoretic Models-Fuzzy Set Model

• Model– a query term: a fuzzy set– a document: degree of membership in this set– membership function

• Associate membership function with the elements of the class

• 0: no membership in the set• 1: full membership • 0~1: marginal elements of the set

documents

Page 95: Chapter 2 Modeling

Hsin-Hsi Chen 95

Fuzzy Set Theory

• A fuzzy subset A of a universe of discourse U is characterized by a membership function µA: U[0,1] which associates with each element u of U a number µA(u) in the interval [0,1]– complement:– union:– intersection:

)(1)( uu AA

))(),(max()( uuu BABA

))(),(min()( uuu BABA

a class

a document

document collectionfor query term

Page 96: Chapter 2 Modeling

Hsin-Hsi Chen 96

Examples

• Assume U={d1, d2, d3, d4, d5, d6}• Let A and B be {d1, d2, d3} and {d2, d3, d4}, respectively.• Assume A={d1:0.8, d2:0.7, d3:0.6, d4:0, d5:0, d6:0} and

B={d1:0, d2:0.6, d3:0.8, d4:0.9, d5:0, d6:0}• ={d1:0.2, d2:0.3, d3:0.4, d4:1, d5:1, d6:1}• =

{d1:0.8, d2:0.7, d3:0.8, d4:0.9, d5:0, d6:0}• =

{d1:0, d2:0.6, d3:0.6, d4:0, d5:0, d6:0}

)(1)( uu AA ))(),(max()( uuu BABA

))(),(min()( uuu BABA

Page 97: Chapter 2 Modeling

Hsin-Hsi Chen 97

Fuzzy Information Retrieval

• basic idea– Expand the set of index terms in the query with

related terms (from the thesaurus) such that additional relevant documents can be retrieved

– A thesaurus can be constructed by defining a term-term correlation matrix c whose rows and columns are associated to the index terms in the document collection

keyword connection matrix

Page 98: Chapter 2 Modeling

Hsin-Hsi Chen 98

Fuzzy Information Retrieval(Continued)

• normalized correlation factor ci,l between two terms ki and kl (0~1)

• In the fuzzy set associated to each index term ki, a document dj has a degree of membership µi,j

lili

lili nnn

nc

,

,,

)1(1 ,,

jdlk

liji c

where ni is # of documents containing term ki

nl is # of documents containing term kl

ni,l is # of documents containing ki and kl

Page 99: Chapter 2 Modeling

Hsin-Hsi Chen 99

Fuzzy Information Retrieval(Continued)

• physical meaning– A document dj belongs to the fuzzy set associated to the

term ki if its own terms are related to ki, i.e., i,j=1.

– If there is at least one index term kl of dj which is strongly related to the index ki, then i,j1.

ki is a good fuzzy index

– When all index terms of dj are only loosely related to ki, i,j0.

ki is not a good fuzzy index

Page 100: Chapter 2 Modeling

Hsin-Hsi Chen 100

Example

• q=(ka (kb kc))=(ka kb kc) (ka kb kc) (ka kb kc)=cc1+cc2+cc3

Da

Db

Dc

cc3cc2

cc1

Da: the fuzzy set of documents associated to the index ka

djDa has a degree of membership a,j > a predefined threshold K

Da: the fuzzy set of documents associated to the index ka

(the negation of index term ka)

Page 101: Chapter 2 Modeling

Hsin-Hsi Chen 101

Example

))1)(1(1())1(1()1(1

)1(1

,,,,,,,,,

3

1,

,321,

jcjbjajcjbjajcjbja

ijicc

jccccccjq

Query q=ka (kb kc)

disjunctive normal form qdnf=(1,1,1) (1,1,0) (1,0,0)

(1) the degree of membership in a disjunctive fuzzy set is computed using an algebraic sum (instead of max function) more smoothly(2) the degree of membership in a conjunctive fuzzy set is computed using an algebraic product (instead of min function)

Recall )(1)( uu AA

Page 102: Chapter 2 Modeling

Hsin-Hsi Chen 102

Fuzzy Set Model– Q: “gold silver truck”

D1: “Shipment of gold damaged in a fire”D2: “Delivery of silver arrived in a silver truck”D3: “Shipment of gold arrived in a truck”

– IDF (Select Keywords)• a = in = of = 0 = log 3/3

arrived = gold = shipment = truck = 0.176 = log 3/2

damaged = delivery = fire = silver = 0.477 = log 3/1

– 8 Keywords (Dimensions) are selected• arrived(1), damaged(2), delivery(3), fire(4), gold(5),

silver(6), shipment(7), truck(8)

Page 103: Chapter 2 Modeling

Hsin-Hsi Chen 103

Fuzzy Set Model

9

5

1*1*3

2*

3

21

)1(*)1(*)1(*)1(1

)1(1

01*1*1*111

2

1*

2

1*0*01

)112

11(*)

112

11(*)

222

21(*)

222

21(1

)1(*)1(*)1(*)1(1

)1(1

,,,,

,,

,

,,,,

,,

1

1

1

1

1

firetruckdamagedtruckgoldtruckshipmenttruck

dk

ktruckdtruck

dsilver

firegolddamagedgoldgoldgoldshipmentgold

dk

kgolddgold

CCCC

C

CCCC

C

l

l

l

l

Page 104: Chapter 2 Modeling

Hsin-Hsi Chen 104

Fuzzy Set Model

Page 105: Chapter 2 Modeling

Hsin-Hsi Chen 105

Fuzzy Set Model• Sim(q,d): Alternative 1

Sim(q,d3) > Sim(q,d2) > Sim(q,d1)

• Sim(q,d): Alternative 2

Sim(q,d3) > Sim(q,d2) > Sim(q,d1)

Page 106: Chapter 2 Modeling

Hsin-Hsi Chen 106

Alternative Algebraic Model:Generalized Vector Space Model• independence of index terms

– ki: a vector associated with the index term ki

– the set of vectors {k1, k2, …, kt} is linearly independent• orthogonal:

– The index term vectors are assumed linearly independent but are not pairwise orthogonal in generalized vector space model

– The index term vectors, which are not seen as the basis of the space, are composed of smaller components derived from the particular collection.

0 jkk i for ij

Page 107: Chapter 2 Modeling

Hsin-Hsi Chen 107

Review

• Two vectors u and v are linearly independent– if u+v=0 then ==0

• Two vectors u and v are orthogonal, I.e, =90o

– u•v=0 (I.e., uTv=0)

• if two vectors u and v are orthogonal, then u and v are linearly independent– assume u+v=0, u0 and v0 – uT(u+v)=0 --> uTu+ uT v=0 --> uTu=0

Page 108: Chapter 2 Modeling

Hsin-Hsi Chen 108

Generalized Vector Space Model• {k1, k2, …, kt}: index terms in a collection• wi,j: binary weights associated with the term-document pair {ki, dj}• The patterns of term co-occurrence (inside documents) can be repre

sented by a set of 2t minterms

• gi(mj): return the weight {0,1} of the index term ki in the minterm mj (1 i t)

m1=(0, 0, …, 0): point to documents containing none of index termsm2=(1, 0, …, 0): point to documents containing the index term k1 onlym3=(0,1,…,0): point to documents containing the index term k2 onlym4=(1,1,…,0): point to documents containing the index terms k1 and k2

m2t=(1, 1, …, 1): point to documents containing all the index terms

Page 109: Chapter 2 Modeling

Hsin-Hsi Chen 109

Generalized Vector Space Model(Continued)

• mi (2t-tuple vector) is associated with minterm mi (t-tuple vector)

• e.g., m4 is associated with m4 containing k1 and k2, and no others

• co-occurrence of index terms inside documents: dependencies among index terms

)1,0,...,0,0(

0...

)0,0,...,1,0(

)0,0,...,0,1(

2

2

1

t

im

jiformm

m

m

j

(the set of mi are pairwise orthogonal)

Page 110: Chapter 2 Modeling

Hsin-Hsi Chen 110

28,1

27,1

26,1

25,1

88,177,16,155,11

6

cccc

mcmcmcmck

minterm mr mr vectorm1=(0,0,0) m1=(1,0,0,0,0,0,0,0)m2=(0,0,1) m2=(0,1,0,0,0,0,0,0)m3=(0,1,0) m3=(0,0,1,0,0,0,0,0)m4=(0,1,1) m4=(0,0,0,1,0,0,0,0)m5=(1,0,0) m5=(0,0,0,0,1,0,0,0)m6=(1,0,1) m6=(0,0,0,0,0,1,0,0)m7=(1,1,0) m7=(0,0,0,0,0,0,1,0)m8=(1,1,1) m8=(0,0,0,0,0,0,0,1)

t=3

d1 (k1) d11 (k1 k2)d2 (k3) d12 (k1 k3)d3 (k3) d13 (k1 k2)d4 (k1) d14 (k1 k2)d5 (k2) d15 (k1 k2 k3)d6 (k2) d16 (k1 k2)d7 (k2 k3) d17 (k1 k2)d8 (k2 k3) d18 (k1 k2)d9 (k2) d19 (k1 k2 k3)d10 (k2 k3) d20 (k1 k2)

19,115,18,1

20,118,117,116,114,113,111,17,1

12,16,14,11,15,1

wwc

wwwwwwwc

wcwwc

Page 111: Chapter 2 Modeling

Hsin-Hsi Chen 111

28,2

27,2

24,2

23,2

88,277,24,233,22

4

cccc

mcmcmcmck

minterm mr mr vectorm1=(0,0,0) m1=(1,0,0,0,0,0,0,0)m2=(0,0,1) m2=(0,1,0,0,0,0,0,0)m3=(0,1,0) m3=(0,0,1,0,0,0,0,0)m4=(0,1,1) m4=(0,0,0,1,0,0,0,0)m5=(1,0,0) m5=(0,0,0,0,1,0,0,0)m6=(1,0,1) m6=(0,0,0,0,0,1,0,0)m7=(1,1,0) m7=(0,0,0,0,0,0,1,0)m8=(1,1,1) m8=(0,0,0,0,0,0,0,1)

t=3

d1 (k1) d11 (k1 k2)d2 (k3) d12 (k1 k3)d3 (k3) d13 (k1 k2)d4 (k1) d14 (k1 k2)d5 (k2) d15 (k1 k2 k3)d6 (k2) d16 (k1 k2)d7 (k2 k3) d17 (k1 k2)d8 (k2 k3) d18 (k1 k2)d9 (k2) d19 (k1 k2 k3)d10 (k2 k3) d20 (k1 k2)

19,215,28,2

20,218,217,216,214,213,211,27,2

10,28,27,24,29,26,25,23,2

wwc

wwwwwwwc

wwwcwwwc

Page 112: Chapter 2 Modeling

Hsin-Hsi Chen 112

minterm mr mr vectorm1=(0,0,0) m1=(1,0,0,0,0,0,0,0)m2=(0,0,1) m2=(0,1,0,0,0,0,0,0)m3=(0,1,0) m3=(0,0,1,0,0,0,0,0)m4=(0,1,1) m4=(0,0,0,1,0,0,0,0)m5=(1,0,0) m5=(0,0,0,0,1,0,0,0)m6=(1,0,1) m6=(0,0,0,0,0,1,0,0)m7=(1,1,0) m7=(0,0,0,0,0,0,1,0)m8=(1,1,1) m8=(0,0,0,0,0,0,0,1)

t=3

12,36,310,38,37,34,33,32,32,3

28,3

26,3

24,3

22,3

88,366,34,322,33

4

wcwwwcwwc

cccc

mcmcmcmck

19,315,38,3 wwc

d1 (k1) d11 (k1 k2)d2 (k3) d12 (k1 k3)d3 (k3) d13 (k1 k2)d4 (k1) d14 (k1 k2)d5 (k2) d15 (k1 k2 k3)d6 (k2) d16 (k1 k2)d7 (k2 k3) d17 (k1 k2)d8 (k2 k3) d18 (k1 k2)d9 (k2) d19 (k1 k2 k3)d10 (k2 k3) d20 (k1 k2)

Page 113: Chapter 2 Modeling

Hsin-Hsi Chen 113

Generalized Vector Space Model(Continued)

• Determine the index vector ki associated with the index term ki

1)(,2

1)(, ,

,ri ri

ri

mgr

mgrrri

ic

mck

lallformgdgd

jiri

rljlj

wc)()(|

,,

Collect all the vectors mr in which the index term ki is in state 1.

Sum up wi,j associated withthe index term ki and documentdj whose term occurrence pattern coincides with minterm mr

Page 114: Chapter 2 Modeling

Hsin-Hsi Chen 114

Generalized Vector Space Model(Continued)

• kikj quantifies a degree of correlation between ki and kj

• standard cosine similarity is adopted

1)(1)(|

,,

rri mgjmgr

rjriji cckk

ii qii

i jij kwqkwd ,,

1)(,2

1)(, ,

,ri ri

ri

mgr

mgrrri

ic

mck

Page 115: Chapter 2 Modeling

Hsin-Hsi Chen 115

28,3

26,3

24,3

22,3

88,366,34,322,33

4

cccc

mcmcmcmck

8,38,24,34,232

8,38,16,36,131

8,28,17,27,121

cccckk

cccckk

cccckk

28,1

27,1

26,1

25,1

88,177,16,155,11

6

cccc

mcmcmcmck

28,2

27,2

24,2

23,2

88,277,24,233,22

4

cccc

mcmcmcmck

Page 116: Chapter 2 Modeling

Hsin-Hsi Chen 116

Comparison with Standard Vector Space Model

d1 (k1): (w1,1,0,0) d11 (k1 k2): (w1,11,w2,11,0)

d2 (k3): (0,0,w3,2) d12 (k1 k3): (w1,12,0,w3,12)

d3 (k3): (0,0,w3,3) d13 (k1 k2): (w1,13,w2,13,0)

d4 (k1): (w1,4,0,0) d14 (k1 k2): (w1,14,w2,14,0)

d5 (k2): (0,w2,5,0) d15 (k1 k2 k3): (w1,15,w2,15, w3,15)

d6 (k2): (0,w2,6,0) d16 (k1 k2): (w1,16,w2,16,0)

d7 (k2 k3): (0,w2,7,w3,7) d17 (k1 k2): (w1,17,w2,17,0)

d8 (k2 k3): (0,w2,8,w3,8) d18 (k1 k2): (w1,18,w2,18,0)

d9 (k2): (0,w2,9,0) d19 (k1 k2 k3): (w1,19,w2,19, w3,19)

d10 (k2 k3): (0,w2,10,w3,10) d20 (k1 k2): (w1,20,w2,20,0)

Page 117: Chapter 2 Modeling

Hsin-Hsi Chen 117

Generalized Vector Space Model

Page 118: Chapter 2 Modeling

Hsin-Hsi Chen 118

Generalized Vector Space Model

Page 119: Chapter 2 Modeling

Hsin-Hsi Chen 119

Generalized Vector Space Model

Page 120: Chapter 2 Modeling

Hsin-Hsi Chen 120

Vector Space Model– Q: “gold silver truck”

D1: “Shipment of gold damaged in a fire”D2: “Delivery of silver arrived in a silver truck”D3: “Shipment of gold arrived in a truck”

– 8 Dimensions (arrived, damaged, delivery, fire, gold, silver, shipment, truck)

• Weight = TF * IDF

• Q = (0, 0, 0, 0, .176, .477, 0, .176)D1 = (0, .477, 0, .477, .176, 0, .176, 0)D2 = (.176, 0, .477, 0, 0, .954, 0, .176)D3 = (.176, 0, 0, 0, .176, 0, .176, .176)

Construction ofMatrix T

Page 121: Chapter 2 Modeling

Hsin-Hsi Chen 121

Construction of Matrix T

d1 d2 d3

Page 122: Chapter 2 Modeling

Hsin-Hsi Chen 122

Normalize Matrix K Normalized Direction

Page 123: Chapter 2 Modeling

Hsin-Hsi Chen 123

Construction of Matrix T

Calculate by Yourself

Page 124: Chapter 2 Modeling

Hsin-Hsi Chen 124

Latent Semantic Indexing (LSI) Model

• representation of documents and queries by index terms– problem 1: many unrelated documents might be

included in the answer set– problem 2: relevant documents which are not indexed

by any of the query keywords are not retrieved

• possible solution: concept matching instead of index term matching– application in cross-language information retrieval

(CLIR)

Page 125: Chapter 2 Modeling

Hsin-Hsi Chen 125

basic idea

• Map each document and query vector into a lower dimensional space which is associated with concepts

• Retrieval in the reduced space may be superior to retrieval in the space of index terms

Page 126: Chapter 2 Modeling

Hsin-Hsi Chen 126

Definition

• t: the number of index terms in the collection

• N: the total number of documents

• M=(Mij): a term-document association matrix with t rows (i.e., term) and N columns (i.e., document)

• Mij: a weight wi,j associated with the term-document pair [ki, dj] (e.g., using tf-idf)

Page 127: Chapter 2 Modeling

Hsin-Hsi Chen 127

Singular Value Decomposition

})()({

:sin

}{

)1(

AQDQQDQQDQAQDQA

iondecompositvaluegular

IQQIQQstRQ

AA

RA

ttttttttt

ttnn

t

nn

where D =

1

2

n

.

.

.0

0diagonal matrix

orthogonal

1 2 … n 0

Page 128: Chapter 2 Modeling

Hsin-Hsi Chen 128

ttttttt

t

ttnn

t

nn

UUDVDUUDVUDVUDVAA

UDVA

iondecompositvaluegular

IVVIUUstRVU

AA

RA

2))(())((

:sin

,,

)2(

where D =

1

2

n

.

.

.0

0diagonal matrix

orthogonal

(AB)T= BT AT

1 2 … n 0

Page 129: Chapter 2 Modeling

Hsin-Hsi Chen 129

vectorcolumnaqqqqQwhere

QDQQDQAQ

QDQA

in

t

t

:],[ 21

][][ 2121 nn qqqqqqA

1

2

n

.

.

.

0

nnn

nnn

qAqqAqqAq

qqqAqAqAq

222111

221121 ][][

1, 2, …, n 為 A 之 eigenvalues , qk 為 A 相對於 k 之 eigenvector

Page 130: Chapter 2 Modeling

Hsin-Hsi Chen 130

Singular Value Decomposition

matrixtermtotermttaMM

matrixdocumenttodocumentNNaMM

DSKM

columnsNandrowstwithmatrixdocumenttermaM

t

t

t

:

:

:

According to

t

t

t

Nt

DSKM

MMfromderivedrseigenvectoofmatrixtheD

MMfromderivedrseigenvectoofmatrixtheK

RM

:

:

IDD

IKKt

t

Page 131: Chapter 2 Modeling

Hsin-Hsi Chen 131

t

ttt

ttt

t

DSD

DSKKSD

DSKDSK

matrixdocumenttodocumentMM

2

))((

)()(

:

t

ttt

ttt

t

KSK

KSDDSK

DSKDSK

matrixtermtotermMM

2

))((

))((

:

對照 A=QDQt

Q is matrix of eigenvectors of AD is diagonal matrix of singular values

tMMfromderived

rseigenvectoofmatrixtheK :

MMfromderived

rseigenvectoofmatrixtheDt

:得到

),min(,

sin:

Ntrwherevalues

gularofmatrixdiagonalrrS

s < r (Concept space is reduced)

Page 132: Chapter 2 Modeling

Hsin-Hsi Chen 132

Consider only the s largest singular values of S

1

2

n

.

.

.0

0

1 2 … n 0

The resultant Ms matrix is the matrix of rank s which is closestto the original matrix M in the least square sense.

t

ssss DSKM (s<<t, s<<N)

s 必須足夠大到涵蓋所有相關文件,也不能太粗,把不相關的納進來。

由概念分群來說明:太細 - 各個 index term 代表不同的概念太粗 - 所有 index term 成為一概念

Page 133: Chapter 2 Modeling

Hsin-Hsi Chen 133

Ranking in LSI• query: a pseudo-document in the original M t

erm-document– query is modeled as the document with number 0

– MstMs: the ranks of all documents w.r.t this query

tssss

t

ssss

t

sss

t

sss

t

ssst

t

sssst

s

SDSD

DSSDDSKKSD

DSKDSKMM

))((

)(

(i,j) qualifies the relationship betweendocuments di and dj When i = 0, it denotes similarity between q and documents

Page 134: Chapter 2 Modeling

Hsin-Hsi Chen 134

Structured Text Retrieval Models

• Definition– Combine information on text content with information on the document

structure– e.g., same-page(near(‘atomic holocaust’, Figure(label(‘earth’))))

• Expressive power vs. evaluation efficiency – a model based on non-overlapping lists– a model based on proximal nodes

• Terminology– match point: position in the text of a sequence of words that matches the user

query– region: a contiguous portion of the text– node: a structural component of the document (chap, sec, …)

Page 135: Chapter 2 Modeling

Hsin-Hsi Chen 135

Non-Overlapping Lists

• divide the whole text of each document in non-overlapping text regions (lists)

• example

• Text regions from distinct lists might overlap

L0 Chapter

L1 Sections

L2 Subsections

L3 Subsubsections

indexinglists

a list of all chapters in the document

a list of all sections in the document

a list of all subsections in the document

a list all subsubsections in the document

1 5000

1 3000

Chapter 1

3001 50001.1 1.2

1 10001001 3000 3001 50001.1.1 1.1.2 1.2.1

1 500 5011000 1001

non-overlapping in a list

Page 136: Chapter 2 Modeling

Hsin-Hsi Chen 136

Non-Overlapping Lists(Continued)

• Data structure– a single inverted file – each structural component (e.g., chap, sec, …) stands as an

entry– for each entry, there is a list of text regions as a list occurrences

• Operations– Select a region which contains a given word– Select a region A which does not contain any other region B

(where B belongs to a list distinct from the list for A)– Select a region not contained within any other region– …

Recall that there is another invertedfile for the words in the text

Page 137: Chapter 2 Modeling

Hsin-Hsi Chen 137

Inverted Files

• File is represented as an array of indexed records.

Term 1 Term 2 Term 3 Term 4

Record 1 1 1 0 1

Record 2 0 1 1 1

Record 3 1 0 1 1

Record 4 0 0 1 1

Page 138: Chapter 2 Modeling

Hsin-Hsi Chen 138

Inverted-file process

• The record-term array is inverted (transposed).

Record 1 Record 2 Record 3 Record 4

Term 1 1 0 1 0

Term 2 1 1 0 0

Term 3 0 1 1 1

Term 4 1 1 1 1

Page 139: Chapter 2 Modeling

Hsin-Hsi Chen 139

Inverted-file process (Continued)

• Take two or more rows of an inverted term-record array, and produce a single combined list of record identifiers.

Query (term2 and term3)1 1 0 00 1 1 1

---------------------------------1 <-- R2

Page 140: Chapter 2 Modeling

Hsin-Hsi Chen 140

Extensions of Inverted Index Operations(Distance Constraints)

• Distance Constraints– (A within sentence B)

terms A and B must co-occur in a common sentence

– (A adjacent B)terms A and B must occur adjacently in the text

Page 141: Chapter 2 Modeling

Hsin-Hsi Chen 141

Extensions of Inverted Index Operations(Distance Constraints)

• Implementation– include term-location in the inverted indexes

information: {R345, R348, R350, …}retrieval: {R123, R128, R345, …}

– include sentence-location in the indexes information:

{R345, 25; R345, 37; R348, 10; R350, 8; …}retrieval:

{R123, 5; R128, 25; R345, 37; R345, 40; …}

Page 142: Chapter 2 Modeling

Hsin-Hsi Chen 142

Extensions of Inverted Index Operations(Distance Constraints)

– include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {R345, 2, 3, 5; …}retrieval: {R345, 2, 3, 6; …}

– query examples(information adjacent retrieval)(information within five words retrieval)

– cost: the size of indexes

Page 143: Chapter 2 Modeling

Hsin-Hsi Chen 143

Model Based on Proximal Nodes

• hierarchical vs. flat indexing structures

Chapter

Sections

Subsections

Subsubsections

…holocaust 10 256 48,324…

paragraphs, pages, lines

an inverted list for holocaust

hierarchicalindex

flat index

entries: positions in the text

nodes: position in the text

Page 144: Chapter 2 Modeling

Hsin-Hsi Chen 144

Model Based on Proximal Nodes(Continued)

• query language– Specification of regular expressions– Reference to structural components by name– Combination– Example

• Search for sections, subsections, or subsubsections which contain the word ‘holocaust’

• [(*section) with (‘holocaust’)]

Page 145: Chapter 2 Modeling

Hsin-Hsi Chen 145

Model Based on Proximal Nodes(Continued)

• Basic algorithm– Traverse the inverted list for the term ‘holocaust’– For each entry in the list (i.e., an occurrence), search the

hierarchical index looking for sections, subsections, and sub-subsections

• Revised algorithm– For the first entry, search as before– Let the last matching structural component be the innermost

matching component– Verify the innermost matching component also matches the

second entry.• If it does, the larger structural components above it also do.

nearby nodes

Page 146: Chapter 2 Modeling

Hsin-Hsi Chen 146

Models for Browsing

• Browsing vs. searching– The goal of a searching task is clearer in the

mind of the user than the goal of a browsing task

• Models– Flat browsing– Structure guided browsing– The hypertext model

Page 147: Chapter 2 Modeling

Hsin-Hsi Chen 147

Models for Browsing

• Flat organization– Documents are represented as dots in a 2-D plan

– Documents are represented as elements in a 1-D list, e.g., the results of search engine

• Structure guided browsing– Documents are organized in a directory, which group

documents covering related topics

• Hypertext model– Navigating the hypertext: a traversal of a directed graph

Page 148: Chapter 2 Modeling

Hsin-Hsi Chen 148

Trends and Research Issues• Library systems

– Cognitive and behavioral issues oriented particularly at a better understanding of which criteria the users adopt to judge relevance

• Specialized retrieval systems– e.g., legal and business documents– how to retrieve all relevant documents without retrieving a large

number of unrelated documents

• The Web– User does not know what he wants or has great difficulty in

formulating his request– How the paradigm adopted for the user interface affects the ranking– The indexes maintained by various Web search engine are almost

disjoint