conventional text-retrieval systems

1

Conventional Text-Retrieval Systems

Hsin-Hsi Chen

2

Database Management

• A specified set of attributes is used to characterize each item.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO)

• Exact match between the attributes used inquery formulations and those attached to the record.

SELECT BDATE, ADDRFROM EMPLOYEEWHERE NAME = ‘John Smith’

3

Text-Retrieval Systems• Content identifiers (keywords, index terms,

descriptors) characterize the stored texts.• degrees of coincidence between the sets of

identifiers attached to queries and documents

content analysisquery formulation

4

Possible Representation

• document representation– unweighted index terms (term vectors)– weighted index terms– …

• query– unweighted or weighted index terms– Boolean combinations (or, and, not)– …

• search operation must be effective

5

File Structures

• Main requirements– fast-access for various kinds of searches– large number of indices

• Alternatives– Inverted Files– Signature Files– PAT trees

6

Inverted Files

• File is represented as an array of indexed records.

Term 1 Term 2 Term 3 Term 4

Record 1 1 1 0 1

Record 2 0 1 1 1

Record 3 1 0 1 1

Record 4 0 0 1 1

7

Inverted-file process

• The record-term array is inverted (transposed).

Record 1 Record 2 Record 3 Record 4

Term 1 1 0 1 0

Term 2 1 1 0 0

Term 3 0 1 1 1

Term 4 1 1 1 1

8

Inverted-file process (Continued)

• Take two or more rows of an inverted term-record array, and produce a single combined list of record identifiers.

Query (term2 and term3)1 1 0 00 1 1 1

---------------------------------1 <-- R2

9

List-merging for two ordered lists

• The inverted-index operations to obtain answers are based on list-merging process.

• ExampleT1: {R1, R3}T2: {R1, R2}Merged(T1, T2): {R1, R1, R2, R3}

10

List-merged Algorithm

• Given two input lists of record identifiers in increasing record-number orderif both lists are empty then stop;else if one of the input lists is empty

then transfer onto the output list all items from

the other list in order and stop;else take the next item Ri from list 1 and

the next item Rj from list 2

11

if i < jthen transfer Ri onto the merged output

list and read next item from list 1before repeating the process;

else transfer Rj onto the merged outputlist and read next item from list 2before repeating the process

12

Record 1 Record 2 Record 3 Record 4

Term 1 1 0 1 0

Term 2 1 1 0 0

Term 3 0 1 1 1

Term 4 1 1 1 1

((T1 or T2) and not T3)

T1: {R1, R3}T2: {R1, R2}T3: {R2, R3, R4}

Merged(T1, T2): {R1, R1, R2, R3}Output for (T1 or T2): {R1, R2, R3}Merged(T1 or T2, T3): {R1, R2, R2, R3, R3, R4}Output for ((T1 or T2) and T3): {R2, R3}Merged((T1 or T2), ((T1 or T2) and T3)):

{R1, R2, R2, R3, R3}Output for ((T1 or T2) and not T3): {R1}

13

Extensions of Inverted Index Operations(Distance Constraints)

• Distance Constraints– (A within sentence B)

terms A and B must co-occur in a common sentence

– (A adjacent B)terms A and B must occur adjacently in the text

14


• Implementation– include term-location in the inverted indexes

information: {R345, R348, R350, …}retrieval: {R123, R128, R345, …}

– include sentence-location in the indexes information:

{R345, 25; R345, 37; R348, 10; R350, 8; …}retrieval:

{R123, 5; R128, 25; R345, 37; R345, 40; …}

15


– include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {R345, 2, 3, 5; …}retrieval: {R345, 2, 3, 6; …}

– query examples(information adjacent retrieval)(information within five words retrieval)

– cost: the size of indexes

16

Extensions of Inverted Index Operations(Term Weights)

• Term WeightsRi={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6}

• Issues– how to generate the term weights– how to apply the term weights

• Sum the weights of all document terms that match the given query.

• Rank the output documents in the descending order of term weight.

17

Boolean Query

• Transform a Boolean expression into disjunctive normal form.

T1 and (T2 or T3)= (T1 and T2) or (T1 and T3)

• For each conjunct, compute the minimum term weight of any document term in that conjunct.

• The document weight is the maximum of all the conjunct weights.

18

Boolean Query

• Example: (T1 and T2) or T3Document Conjunct QueryVectors Weights Weight

(T1 and T2) (T3) (T1 and T2) or T3

D1=(T1,0.2;T2,0.5;T3,0.6)0.2 0.6 0.6

D2=(T1,0.7;T2,0.2;T3,0.1)0.2 0.1 0.2

D1 is preferred.

19

Synonym Specification

• Original Query(T1 and T2) or T3

Assume S1 is a synonym of T1.Assume S3 is a synonym of T3.

• Broader Query((T1 or S1) and T2) or (T3 or S3)

• The number of relevant items retrieved may be larger.

20

Term Truncation

• Term Truncation– Remove suffixes and/or prefixes from context

terms.– Example

PSYCH*: psychiatrist, psychiatry, psychiatric,psychology, psychological, …

21

Term Truncation

• Implementation– Only suffix truncation

Conventional inverted-index methodology can be maintained unchanged.

– Only prefix truncationThe term entries in inverted index are inversely alphabetized.antisymmetry --> yrtemmysitna

22

Term Truncation

– Both prefix and suffix truncation*SYMM*: antisymmetric, asymmetry inverted-index entries that are alphabetized both forward and backward

– infix truncationwom*n woman womeninverted index with entries for all possible “rotated” word forms

23

Term Truncation

• Each term entry X=x1, x2, …, xn with individual characters xi is augmented by adding a special terminal character /.

ABC ABC/BABC BABC/BCAB BCAB/

• Each augmented term x1, x2, …, xn/ is rotated cyclically by wrapping the term around itself n+1 times.

ABC/ /ABC, C/AB, BC/A, ABC/

24

Term Truncation

• Each resulting word form is then augmented by appending a blank character ^.

• The resulting file of word forms is sorted alphabetically.

^, /, a, b, c, …, Zlow high

25

ABC ABC/ /ABC^ /ABC^C/AB^ /BABC^BC/A^ /BCABÂBC/^ AB/BC^

BABC BABC/ /BABC^ ABC/^C/BAB^ ABC/B^BC/AB^ B/BCAÂBC/B^ BABC/^BABC/^ BC/A^

BCAB BCAB/ /BCAB^ BC/BA^B/BCA^ BCAB/ÂB/BC^ C/AB^CAB/B^ C/BAB^BCAB/^ CAB/B^

26

Retrieval Strategies

• Query term XLook for index entries /X^ or X/^.

• Query term X*Look for /X*.

• Query term *XLook for X/^, X/Y1, …, X/Yn.original patterns: X, Y1X, …, YnX

• Query term *X*Look for XY1/Z1, …, XYn/Zn.original patterns: Z1XY1, …, ZnXYn

27

ABC ABC/ /ABC^ /ABC^ *B*C/AB^ /BABC^BC/A^ /BCAB^ABC/^ AB/BC^

BABC BABC/ /BABC^ ABC/^C/BAB^ ABC/B^BC/AB^ B/BCA^ BCABABC/B^ BABC/^ BABCBABC/^ BC/A^ ABC

BCAB BCAB/ /BCAB^ BC/BA^ BABCB/BCA^ BCAB/^ BCABAB/BC^ C/AB^CAB/B^ C/BAB^BCAB/^ CAB/B^

28

Retrieval Strategies

• Query term X*YLook for Y/XZ1, …, Y/XZm.Original patterns: XZ1Y, …, XZmY

• CostIncrease index entries.

conventional text-retrieval systems

Documents