dynamic element retrieval in a structured environment crouch, carolyn j. university of minnesota...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Dynamic Element Retrieval in a Structured Environment
Crouch, Carolyn J.
University of Minnesota Duluth, MN
October 1, 2006
Key Problems
Retrieval of elements at desired level of granularity
Assigning a rank order to each element that reflects its perceived relevance to the query
INEX- Initiative for the Evaluation of XML Retrieval
INEX provides an environment for experiments in structured retrieval
Traditionally contains two types of topics CO and CAS Both INEX 2004 and 2005 utilize an evaluation
measure known as inex-eval Recall(the proportion of relevant information retrieved)
and Precision(the proportion of retrieved items that are relevant
Flexible Retrieval System
Systems processes XML documents Smart format(Salton’s Magic Automatic
Retriever of Text) Lnu-ltu term weighting
A Method for Flexible Retrieval
Input to Flexible Retrieval Construction of the Document Tree Ranking of Elements Output of Flexible Retrieval
Input to Flexible Retrieval
Preorder traversal Ranked terminal leaf nodes(paragraphs) Generate document tree(schema and
paragraphs)
Ranking of Elements
Address ranking issue’s with Lnu-ltu term weighting
Length and normalization issue’s Pivot and slope
Lnu(weight of element vector formula)
(1 + log(term frequency)) ÷ (1 + log(average term frequency))
__________________________________________________
(1 − slope) + slope × ((number unique terms) ÷ pivot)
Ltu(weighting of query terms formula)
(1 + log(term frequency) × log(N ÷ nk)
___________________________________________
(1 − slope) + slope × ((number unique terms) ÷ pivot)
Overview of flexible retrieval
1. Parse to extract leaf nodes from the original XML
documents
2. Index leaf nodes and queries using Smart
3. Perform Smart retrieval to get highly correlated leaf
nodes
Overview of flexible retrieval(cont)
4. For each document containing a retrieved leaf node
a. Get its document schema
b. Generate vector representations for inner
nodes (elements)
5. For each term in the query
a. Get its inverted file entry and corresponding
xpaths
b. Find nk at all levels
Experiments and Results
Attendant file size(dictionary, inverted index, element vectors reduced by 60%, 50% and 50% respectively)
30%- 40% less storage than all-element index Is dynamic element retrieval Cost Effective?
Researchers
Grabs and Shek(similar work to flexible retrieval)
Govert et al.(term weights are multiplied by a collection-dependent augmentation factor as they are propagated up the doc. Tree
Mass et al.(maintain separate indices for element at different levels of granularity. Solves issues of distorted statistics
Overview of flexible retrieval(cont)
6. Correlate element vectors at each level with query
7. Return ranked list of elements
Table I
INEX 2004 INEX 2005
article 12,107 16,440
sections 69,577 94,421
subsections 77,397 104,746
paragraphs 1,029,747 1,378,202
elements 1,188,828 1,593,809
CO Topics 40 Topics 40 Topics
(34 assessed) (29 assessed)
Table II. Comparison of All-Element and Flexible Retrieval under Inex-Eval (Generalized)
Precision at Rank
2004 2005
Rank All Element Flexible All Element Flexible
1 0.3897 0.3971 0.4224 0.4224
5 0.3088 0.2882 0.3241 0.3413
10 0.2735 0.2669 0.2991 0.2991
20 0.2529 0.2390 0.2841 0.2939
25 0.2456 0.2379 0.2669 0.2800
50 0.2000 0.1972 0.2364 0.2366
100 0.1523 0.1501 0.1921 0.1920
500 0.0697 0.0697 0.0943 0.0949
1500 0.0353 0.0362 0.0472 0.0483
Table II.(cont)
Precision at Various Points of Recall
2004 2005
Recall All Element Flexible All Element Flexible
0.01 0.3395 0.3348 0.3562 0.3693
0.25 0.0971 0.0951 0.1131 0.1165
0.50 0.0257 0.0283 0.0385 0.0404
0.75 0.0017 0.0017 0.0097 0.0095
1.00 0.0013 0.0013 0.0015 0.0015
avg prec 0.0625 0.0620 0.0739 0.0750
Table III. Comparison of All-Element and Flexible Retrieval under Inex-Eval (Strict)
Precision at Rank 2004 2005Rank All Element Flexible All Element Flexible1 0.2000 0.2000 0.1481 0.14815 0.1440 0.1200 0.0667 0.074110 0.1240 0.1200 0.0852 0.077820 0.1120 0.1020 0.0815 0.081525 0.1024 0.0992 0.0800 0.083050 0.0898 0.0832 0.0689 0.0681100 0.0628 0.0608 0.0511 0.0500500 0.0268 0.0259 0.0219 0.02171500 0.0141 0.0143 0.0096 0.0097
Table III.(cont)
Precision at Various Points of Recall 2004 2005Recall All Element Flexible All Element Flexible0.01 0.2134 0.2115 0.1521 0.15350.25 0.1006 0.1070 0.0540 0.05150.50 0.0411 0.0394 0.0156 0.01910.75 0.0166 0.0159 0.0103 0.01041.00 0.0042 0.0044 0.0046 0.0048avg prec 0.0586 0.0577 0.0318 0.0335