conventional text-retrieval systems
DESCRIPTION
Conventional Text-Retrieval Systems. Hsin-Hsi Chen. Database Management. A specified set of attributes is used to characterize each item. EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO) - PowerPoint PPT PresentationTRANSCRIPT
1
Conventional Text-Retrieval Systems
Hsin-Hsi Chen
2
Database Management
• A specified set of attributes is used to characterize each item.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO)
• Exact match between the attributes used inquery formulations and those attached to the record.
SELECT BDATE, ADDRFROM EMPLOYEEWHERE NAME = ‘John Smith’
3
Text-Retrieval Systems• Content identifiers (keywords, index terms,
descriptors) characterize the stored texts.• degrees of coincidence between the sets of
identifiers attached to queries and documents
content analysisquery formulation
4
Possible Representation
• document representation– unweighted index terms (term vectors)– weighted index terms– …
• query– unweighted or weighted index terms– Boolean combinations (or, and, not)– …
• search operation must be effective
5
File Structures
• Main requirements– fast-access for various kinds of searches– large number of indices
• Alternatives– Inverted Files– Signature Files– PAT trees
6
Inverted Files
• File is represented as an array of indexed records.
Term 1 Term 2 Term 3 Term 4
Record 1 1 1 0 1
Record 2 0 1 1 1
Record 3 1 0 1 1
Record 4 0 0 1 1
7
Inverted-file process
• The record-term array is inverted (transposed).
Record 1 Record 2 Record 3 Record 4
Term 1 1 0 1 0
Term 2 1 1 0 0
Term 3 0 1 1 1
Term 4 1 1 1 1
8
Inverted-file process (Continued)
• Take two or more rows of an inverted term-record array, and produce a single combined list of record identifiers.
Query (term2 and term3)1 1 0 00 1 1 1
---------------------------------1 <-- R2
9
List-merging for two ordered lists
• The inverted-index operations to obtain answers are based on list-merging process.
• ExampleT1: {R1, R3}T2: {R1, R2}Merged(T1, T2): {R1, R1, R2, R3}
10
List-merged Algorithm
• Given two input lists of record identifiers in increasing record-number orderif both lists are empty then stop;else if one of the input lists is empty
then transfer onto the output list all items from
the other list in order and stop;else take the next item Ri from list 1 and
the next item Rj from list 2
11
if i < jthen transfer Ri onto the merged output
list and read next item from list 1before repeating the process;
else transfer Rj onto the merged outputlist and read next item from list 2before repeating the process
12
Record 1 Record 2 Record 3 Record 4
Term 1 1 0 1 0
Term 2 1 1 0 0
Term 3 0 1 1 1
Term 4 1 1 1 1
((T1 or T2) and not T3)
T1: {R1, R3}T2: {R1, R2}T3: {R2, R3, R4}
Merged(T1, T2): {R1, R1, R2, R3}Output for (T1 or T2): {R1, R2, R3}Merged(T1 or T2, T3): {R1, R2, R2, R3, R3, R4}Output for ((T1 or T2) and T3): {R2, R3}Merged((T1 or T2), ((T1 or T2) and T3)):
{R1, R2, R2, R3, R3}Output for ((T1 or T2) and not T3): {R1}
13
Extensions of Inverted Index Operations(Distance Constraints)
• Distance Constraints– (A within sentence B)
terms A and B must co-occur in a common sentence
– (A adjacent B)terms A and B must occur adjacently in the text
14
Extensions of Inverted Index Operations(Distance Constraints)
• Implementation– include term-location in the inverted indexes
information: {R345, R348, R350, …}retrieval: {R123, R128, R345, …}
– include sentence-location in the indexes information:
{R345, 25; R345, 37; R348, 10; R350, 8; …}retrieval:
{R123, 5; R128, 25; R345, 37; R345, 40; …}
15
Extensions of Inverted Index Operations(Distance Constraints)
– include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {R345, 2, 3, 5; …}retrieval: {R345, 2, 3, 6; …}
– query examples(information adjacent retrieval)(information within five words retrieval)
– cost: the size of indexes
16
Extensions of Inverted Index Operations(Term Weights)
• Term WeightsRi={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6}
• Issues– how to generate the term weights– how to apply the term weights
• Sum the weights of all document terms that match the given query.
• Rank the output documents in the descending order of term weight.
17
Boolean Query
• Transform a Boolean expression into disjunctive normal form.
T1 and (T2 or T3)= (T1 and T2) or (T1 and T3)
• For each conjunct, compute the minimum term weight of any document term in that conjunct.
• The document weight is the maximum of all the conjunct weights.
18
Boolean Query
• Example: (T1 and T2) or T3Document Conjunct QueryVectors Weights Weight
(T1 and T2) (T3) (T1 and T2) or T3
D1=(T1,0.2;T2,0.5;T3,0.6)0.2 0.6 0.6
D2=(T1,0.7;T2,0.2;T3,0.1)0.2 0.1 0.2
D1 is preferred.
19
Synonym Specification
• Original Query(T1 and T2) or T3
Assume S1 is a synonym of T1.Assume S3 is a synonym of T3.
• Broader Query((T1 or S1) and T2) or (T3 or S3)
• The number of relevant items retrieved may be larger.
20
Term Truncation
• Term Truncation– Remove suffixes and/or prefixes from context
terms.– Example
PSYCH*: psychiatrist, psychiatry, psychiatric,psychology, psychological, …
21
Term Truncation
• Implementation– Only suffix truncation
Conventional inverted-index methodology can be maintained unchanged.
– Only prefix truncationThe term entries in inverted index are inversely alphabetized.antisymmetry --> yrtemmysitna
22
Term Truncation
– Both prefix and suffix truncation*SYMM*: antisymmetric, asymmetry inverted-index entries that are alphabetized both forward and backward
– infix truncationwom*n woman womeninverted index with entries for all possible “rotated” word forms
23
Term Truncation
• Each term entry X=x1, x2, …, xn with individual characters xi is augmented by adding a special terminal character /.
ABC ABC/BABC BABC/BCAB BCAB/
• Each augmented term x1, x2, …, xn/ is rotated cyclically by wrapping the term around itself n+1 times.
ABC/ /ABC, C/AB, BC/A, ABC/
24
Term Truncation
• Each resulting word form is then augmented by appending a blank character ^.
• The resulting file of word forms is sorted alphabetically.
^, /, a, b, c, …, Zlow high
25
ABC ABC/ /ABC^ /ABC^C/AB^ /BABC^BC/A^ /BCAB^ABC/^ AB/BC^
BABC BABC/ /BABC^ ABC/^C/BAB^ ABC/B^BC/AB^ B/BCA^ABC/B^ BABC/^BABC/^ BC/A^
BCAB BCAB/ /BCAB^ BC/BA^B/BCA^ BCAB/^AB/BC^ C/AB^CAB/B^ C/BAB^BCAB/^ CAB/B^
26
Retrieval Strategies
• Query term XLook for index entries /X^ or X/^.
• Query term X*Look for /X*.
• Query term *XLook for X/^, X/Y1, …, X/Yn.original patterns: X, Y1X, …, YnX
• Query term *X*Look for XY1/Z1, …, XYn/Zn.original patterns: Z1XY1, …, ZnXYn
27
ABC ABC/ /ABC^ /ABC^ *B*C/AB^ /BABC^BC/A^ /BCAB^ABC/^ AB/BC^
BABC BABC/ /BABC^ ABC/^C/BAB^ ABC/B^BC/AB^ B/BCA^ BCABABC/B^ BABC/^ BABCBABC/^ BC/A^ ABC
BCAB BCAB/ /BCAB^ BC/BA^ BABCB/BCA^ BCAB/^ BCABAB/BC^ C/AB^CAB/B^ C/BAB^BCAB/^ CAB/B^
28
Retrieval Strategies
• Query term X*YLook for Y/XZ1, …, Y/XZm.Original patterns: XZ1Y, …, XZmY
• CostIncrease index entries.