hybrid xml retrieval revisited jovan pehcevski phd candidate school of cs and it, rmit university...

20
Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : [email protected]

Post on 21-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

Hybrid XML Retrieval Revisited

Jovan PehcevskiPhD CandidateSchool of CS and IT, RMIT Universityemail : [email protected]

Page 2: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 2

Outline

• Analysis of INEX 2004 CO and VCAS Relevance Assessments– Original, General and Specific assessments– Broad and Narrow topics

• Ad-hoc runs using hybrid XML retrieval approach– CO sub-track: runs explore two retrieval heuristics

implementing different approaches to element overlap– VCAS sub-track: runs explore two retrieval heuristics

implementing different combinations of structural conditions and granularity of target elements

• Experiments and Results

• Conclusions

Page 3: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 3

Analysis of INEX 2004 Relevance Assessments

A General element is the least specific highly relevant element containing other highly relevant elements (shown in ellipse)

A Specific element is the most specific highly relevant element contained by other highly relevant elements (shown in triangles)

Page 4: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 4

INEX 2004 CO Analysis…

Page 5: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 5

CO Topic Categories (General assessments)

16 Narrow CO topics (full circles): -favour more specific elements

9 Broad CO topics (full triangles): -favour less specific elements-numbers: 164, 168, 175, 178, 183, 190, 192, 197 and 198.

Page 6: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 6

INEX 2004 VCAS Analysis…

Page 7: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 7

VCAS Topic Categories (General assessments)

16 Narrow VCAS topics (full circles): -favour more specific elements

6 Broad VCAS topics (full triangles): -favour less specific elements-numbers: 130, 131, 134, 137, 139 and 150.

Page 8: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 8

Hybrid XML Retrieval Approach

• Utilising best features from Zettair (a compact and fast full-text search engine) and eXist (a native XML database)

• A “fetch and browse” approach, where full articles are first retrieved by Zettair (the fetch phase), and the most specific elements from those articles are then extracted by eXist (the browse phase)

• Additionally uses a retrieval module that identifies and ranks Coherent Retrieval Elements (CREs). A CRE is the most specific ancestor of at least two elements extracted by eXist

Page 9: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 9

Hybrid XML Retrieval Approach…

• To determine the final ranks of CREs, the retrieval module uses a combination of the following heuristics:– The number of times a CRE appears in the absolute path of each

extracted element in the eXist answer list - more matches (M) or fewer matches (m)

– The length of the absolute path of the CRE, taken from the root element - longer path (P) or shorter path (p)

– The ordering of the XPath sequence in the absolute path of the CRE - nearer to beginning (B) or nearer to end (E)

• For INEX 2003 test set, MpE yields best performance, although PME is more suitable for some metrics

Page 10: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 10

INEX 2004 CO Runs Description

• Five CO runs

– Zettair, the baseline run

– Hybrid_MpE, the hybrid system using MpE heuristic combination

– Hybrid_MpE_NO, the hybrid system using MpE heuristic and no overlap among the result CREs

– Hybrid_PME, the hybrid system using PME heuristic combination

– Hybrid_PME_NO, the hybrid system using PME heuristic and no overlap among the result CREs

Page 11: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 11

INEX 2004 VCAS Runs Description

• Six VCAS runs

– Zettair, the baseline run

– Hybrid_CO_MpE, the hybrid system using MpE heuristic, with unconstrained (CO) queries

– Hybrid_CO_PME, the hybrid system using PME heuristic, with unconstrained (CO) queries

– Hybrid_VCAS_MpE, the hybrid system using MpE, with structural conditions strictly matched and unconstrained target element

– Hybrid_VCAS_PME, the hybrid system using PME, with structural conditions strictly matched and unconstrained target element

– Hybrid_CAS, the initial hybrid system (without the CRE module), with both structural conditions and target element strictly matched

Page 12: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 12

Experiments and Results (CO runs)

1. The non-overlap hybrid runs perform much worse than the corresponding overlap hybrid runs (result of a varying Original recall base)

2. For the overlap hybrid CO runs, MpE heuristic yields better performance than PME heuristic, except for s3_e321 and s3_e32 with P@10

3. All the hybrid runs perform better than the baseline run, except for e3_s321 and e3_s32 with P@10, where the baseline run outperforms the two Hybrid_PME runs

Page 13: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 13

Experiments and Results (CO runs)…

Page 14: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 14

General CO retrieval scenario

All and Broad topics: – Zettair performs best with both MAP and P@10, although with the latter

measure the non-overlap hybrid run (MpE_NO) performs the same as Zettair

– The non-overlap hybrid run (MpE_NO) substantially outperforms the overlap hybrid run (MpE)

Narrow topics: – The overlap hybrid run (MpE) performs best, and the performances of the

other two runs are the same

Page 15: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 15

Experiments and Results (VCAS runs)

1. The strict hybrid run (CAS) performs worse than the other hybrid runs. With strict quantisation, both Hybrid_VCAS runs perform better than the Hybrid_CO runs

2. The hybrid runs using MpE heuristic perform better than the same runs using PME heuristic

3. All the hybrid runs perform better than the baseline run, except when using strict quantisation with MAP, where the baseline run outperforms the two PME and the CAS run

Page 16: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 16

Experiments and Results (VCAS runs)…

Page 17: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 17

General VCAS retrieval scenario

All, Broad and Narrow topics: – Zettair by far outperforms all the hybrid runs!

– With P@10, the strict hybrid run (CAS) outperforms the other two hybrid runs

– Of the latter two, the hybrid VCAS run overall performs better than the hybrid CO run

Page 18: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 18

Conclusions

• Different runs for each CO and VCAS sub-track were designed to investigate different aspects of the XML retrieval task.

• The two different cases of INEX 2004 relevance assessments (General and Specific) model different user behaviours; preferred retrieval aspects vary depending on the model used.

• For the CO sub-track and General assessments, the plain full-text search engine and the hybrid system (with non-overlapping result answers) perform best. However, in this case a system should distinguish between different categories of CO topics.

• For the VCAS sub-track and General assessments, the same choice of using plain full-text search engine is very effective choice. Distinguishing between different VCAS topic categories does not appear to make any difference on performance.

Page 19: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 19

Questions ???

The church of St. Jovan the Divine at Kaneo, Ohrid

Page 20: Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University email : jovanp@cs.rmit.edu.au

J. Pehcevski, December 2004 20

Additional slide (efficiency considerations)

• INEX XML document collection– 12,107 XML articles of IEEE Computer Society’s publications from 12

magazines and 6 transactions for the period between 1995-2002.– 494 MB in size.

• Zettair– the size of the index takes roughly 26% of the total collection size.– time taken to index the entire INEX collection on a system with a

Pentium4 2.66GHz processor and a 512MB RAM memory running Mandrake Linux 10.0 is around 70 seconds.

• eXist– the size of the index is roughly twice as big as the total collection size.– time taken to index the entire INEX collection on a system with a

Pentium 4 2.6GHz processor and a 512MB RAM memory running Mandrake Linux 10.0 is around 2050 seconds.