survey jaehui park 2008. 07. 17.. copyright 2008 by cebt introduction members jung-yeon yang,...
DESCRIPTION
Copyright 2008 by CEBT Main Topic Long Queries in Keyword Search Keywords: – Compound query, Evidence Combination, Phrasal Query, Multi-term Query, Multiple Keyword Search, Multiword Unit, and so on. Issues proximity or distance syntactic structure (order) semantic NLP remedies … 3TRANSCRIPT
SurveySurvey
Jaehui Park2008. 07. 17.
Copyright 2008 by CEBT
IntroductionIntroduction Members
Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon We are interested in
Issues in Information Retrieval– About crawling, indexing, searching and ranking methods
How to process multi-term queries in information retrieval environments– Ex)
Today US Today Today Weather Paris Today Weather-> Multi-term queries express more complex information need than single
queries.
2
Copyright 2008 by CEBT
Main TopicMain Topic Long Queries in Keyword Search Keywords:
– Compound query, Evidence Combination, Phrasal Query, Multi-term Query, Multiple Keyword Search, Multiword Unit, and so on.
Issues proximity or distance syntactic structure (order) semantic NLP remedies …
3
Copyright 2008 by CEBT
ProximityProximity An intuitive concept for processing multiple term queries Readings
Term Proximity Scoring for Keyword-Based Retrieval Systems – [ECIR 2003] Yves Rasolofo and Jacques Savoy
Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval– [TREC 2005] Stefan Buttcher and Charles L. A. Clarke
Efficient Text Proximity Search– [SPIRE 2007] Ralf Schenkel, et al.
Why Bigger Windows Are Better Than Smaller Ones– [TR-UM 1997] Ron Papka and James Allan
…
4
Term Proximity Scoring for Keyword-Based Term Proximity Scoring for Keyword-Based Retrieval Systems Retrieval Systems
Yves Rasolofo and Jacques SavoyEuropean Colloquium on IR Research(ECIR) 2003, LNCS 2633
2008. 07. 17.Presented by Jaehui Park
Copyright 2008 by CEBT
IntroductionIntroduction Phrase, term proximity or term distance in IR
Focus on adding a word pair scoring module Okapi probabilistic model + proximity measurement
Previous work Salton & McGil [1983]
– Generating statistical phrases based on word co-occurrence Fagan [1987]
– Considering syntactic relation or syntactic structures Mitra et al. [1997]
– “Once a good basic ranking scheme is used, the use of phrases do not have a major effect on precision at high ranks”
Arampatzis et al.[2000]– The lack of success when using NLP technique in IR
Hawking & Thistlewaite [1996]– The use of proximity scoring within the PADRE system (Z-mode method)
6
Copyright 2008 by CEBT
OkapiOkapi Okapi [Robertson & Spark Jones 1976]
Document ranking function according to their relevance to a given search query based on the probabilistic retrieval model
Considering– Term frequency– Document length
The weight for a given term ti in document d
7
Copyright 2008 by CEBT
OkapiOkapi Okapi [Robertson & Spark Jones 1976] (continued)
The weight for the term ti within a query
The retrieval status value (for a document according to a query)
8
Copyright 2008 by CEBT
Term Proximity WeightingTerm Proximity Weighting Improving retrieval performance by using term proximity
scoring Assumption
If a document contains sentences having at least two query terms within them, the probability that this document will be relevant must be greater.
The closer are the query terms, the higher is the relevance probability.
Objective Assigning more importance to those keywords having a
short distance between their occurrences.
9
Copyright 2008 by CEBT
Term Proximity WeightingTerm Proximity Weighting 1. expand the request(query) using keyword pairs
extracted from the query’s wording
2. compute a term pair instance weight
“information retrieval “ : 1.0 “the retrieval of medical information” : 0.11 (1/9)
10
Copyright 2008 by CEBT
Term Proximity WeightingTerm Proximity Weighting 3. sum all the corresponding term pairs
4. compute the contribution of all occurring term pairs in the document
5. compute the final retrieval status value
11
Copyright 2008 by CEBT
ExperimentsExperiments Test Collections
TREC-8 document (528,155 docs)– Financial Times, Federal Register, Foreign Broadcast
Information Service, LA Times TREC-9, TREC-10 (1,692,096 docs)
12
Copyright 2008 by CEBT
ExperimentsExperiments Evaluation
13
Copyright 2008 by CEBT
ExperimentsExperiments Evaluation
14
Copyright 2008 by CEBT
ExperimentsExperiments Evaluation
15
Copyright 2008 by CEBT
ConclusionConclusion The impact of a new term proximity algorithm on
retrieval effectiveness for keyword-based system was examined. Improve ranking for documents having query term pairs
occurring within a given distance constraint.
The term proximity scoring approach Improve precision after retrieving a few documents
16