dynamic element retrieval in a structured environment crouch, carolyn j. university of minnesota...

28
Dynamic Element Retrieval in a Structured Environment Crouch, Carolyn J. University of Minnesota Duluth, MN October 1, 2006

Post on 19-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

Dynamic Element Retrieval in a Structured Environment

Crouch, Carolyn J.

University of Minnesota Duluth, MN

October 1, 2006

Key Problems

Retrieval of elements at desired level of granularity

Assigning a rank order to each element that reflects its perceived relevance to the query

Retrieval Environment

Vector Space Model INEX Environment Flexible Retrieval

Vector Space Model

Document Indexing Term Weighting Similarity Coefficients

INEX- Initiative for the Evaluation of XML Retrieval

INEX provides an environment for experiments in structured retrieval

Traditionally contains two types of topics CO and CAS Both INEX 2004 and 2005 utilize an evaluation

measure known as inex-eval Recall(the proportion of relevant information retrieved)

and Precision(the proportion of retrieved items that are relevant

Flexible Retrieval System

Systems processes XML documents Smart format(Salton’s Magic Automatic

Retriever of Text) Lnu-ltu term weighting

A Method for Flexible Retrieval

Input to Flexible Retrieval Construction of the Document Tree Ranking of Elements Output of Flexible Retrieval

Input to Flexible Retrieval

Preorder traversal Ranked terminal leaf nodes(paragraphs) Generate document tree(schema and

paragraphs)

Document Tree

Construction of the Document Tree

Schema determine document tree Calculate Lnu-ltu term weights

Ranking of Elements

Address ranking issue’s with Lnu-ltu term weighting

Length and normalization issue’s Pivot and slope

Simple structured document

Lnu(weight of element vector formula)

(1 + log(term frequency)) ÷ (1 + log(average term frequency))

__________________________________________________

(1 − slope) + slope × ((number unique terms) ÷ pivot)

Ltu(weighting of query terms formula)

(1 + log(term frequency) × log(N ÷ nk)

___________________________________________

(1 − slope) + slope × ((number unique terms) ÷ pivot)

Overview of flexible retrieval

1. Parse to extract leaf nodes from the original XML

documents

2. Index leaf nodes and queries using Smart

3. Perform Smart retrieval to get highly correlated leaf

nodes

Overview of flexible retrieval(cont)

4. For each document containing a retrieved leaf node

a. Get its document schema

b. Generate vector representations for inner

nodes (elements)

5. For each term in the query

a. Get its inverted file entry and corresponding

xpaths

b. Find nk at all levels

Output of Flexible Retrieval

Equivalent to all-element index

Experiments in flexible retrieval

Factors of interest Experiments and results

Factors of interest

Slope and pivot during Lnu-ltu term weighting The n(number of paragraph)

Experiments and Results

Attendant file size(dictionary, inverted index, element vectors reduced by 60%, 50% and 50% respectively)

30%- 40% less storage than all-element index Is dynamic element retrieval Cost Effective?

Conclusion

Similar work(Grabs and Shek) Exhaustivity dependent Progress in specifity

Researchers

Grabs and Shek(similar work to flexible retrieval)

Govert et al.(term weights are multiplied by a collection-dependent augmentation factor as they are propagated up the doc. Tree

Mass et al.(maintain separate indices for element at different levels of granularity. Solves issues of distorted statistics

Overview of flexible retrieval(cont)

6. Correlate element vectors at each level with query

7. Return ranked list of elements

Table I

INEX 2004 INEX 2005

article 12,107 16,440

sections 69,577 94,421

subsections 77,397 104,746

paragraphs 1,029,747 1,378,202

elements 1,188,828 1,593,809

CO Topics 40 Topics 40 Topics

(34 assessed) (29 assessed)

Table II. Comparison of All-Element and Flexible Retrieval under Inex-Eval (Generalized)

Precision at Rank

2004 2005

Rank All Element Flexible All Element Flexible

1 0.3897 0.3971 0.4224 0.4224

5 0.3088 0.2882 0.3241 0.3413

10 0.2735 0.2669 0.2991 0.2991

20 0.2529 0.2390 0.2841 0.2939

25 0.2456 0.2379 0.2669 0.2800

50 0.2000 0.1972 0.2364 0.2366

100 0.1523 0.1501 0.1921 0.1920

500 0.0697 0.0697 0.0943 0.0949

1500 0.0353 0.0362 0.0472 0.0483

Table II.(cont)

Precision at Various Points of Recall

2004 2005

Recall All Element Flexible All Element Flexible

0.01 0.3395 0.3348 0.3562 0.3693

0.25 0.0971 0.0951 0.1131 0.1165

0.50 0.0257 0.0283 0.0385 0.0404

0.75 0.0017 0.0017 0.0097 0.0095

1.00 0.0013 0.0013 0.0015 0.0015

avg prec 0.0625 0.0620 0.0739 0.0750

Table III. Comparison of All-Element and Flexible Retrieval under Inex-Eval (Strict)

Precision at Rank 2004 2005Rank All Element Flexible All Element Flexible1 0.2000 0.2000 0.1481 0.14815 0.1440 0.1200 0.0667 0.074110 0.1240 0.1200 0.0852 0.077820 0.1120 0.1020 0.0815 0.081525 0.1024 0.0992 0.0800 0.083050 0.0898 0.0832 0.0689 0.0681100 0.0628 0.0608 0.0511 0.0500500 0.0268 0.0259 0.0219 0.02171500 0.0141 0.0143 0.0096 0.0097

Table III.(cont)

Precision at Various Points of Recall 2004 2005Recall All Element Flexible All Element Flexible0.01 0.2134 0.2115 0.1521 0.15350.25 0.1006 0.1070 0.0540 0.05150.50 0.0411 0.0394 0.0156 0.01910.75 0.0166 0.0159 0.0103 0.01041.00 0.0042 0.0044 0.0046 0.0048avg prec 0.0586 0.0577 0.0318 0.0335