learning to estimate query difficulty including applications to missing content detection and...

Learning to Estimate Learning to Estimate Query DifficultyQuery Difficulty

Including Applications to Missing Content Detection and Including Applications to Missing Content Detection and Distributed Information RetrievalDistributed Information Retrieval

Elad Yom-Tov, Shai Fine, David Elad Yom-Tov, Shai Fine, David Carmel, Adam DarlowCarmel, Adam Darlow

IBM Haifa Research LabsIBM Haifa Research LabsSIGIR 2005SIGIR 2005

2

AbstractAbstract Novel learning methods are used for estimating the Novel learning methods are used for estimating the

quality of results returned by a search engine in quality of results returned by a search engine in response to a query.response to a query.

Estimation is based on the agreement between the Estimation is based on the agreement between the top results of the full query and the top results of its top results of the full query and the top results of its sub-queries.sub-queries.

Quality estimation are useful for several applications, Quality estimation are useful for several applications, including including improvement of retrievalimprovement of retrieval, detecting queries , detecting queries for whichfor which no relevant content existsno relevant content exists in the document in the document collection, and collection, and distributed information retrievaldistributed information retrieval..

3

Introduction (1/2)Introduction (1/2) Many IR systems suffer from a radical variance in Many IR systems suffer from a radical variance in

performance.performance. Estimating query difficultyEstimating query difficulty is an attempt to quantify the is an attempt to quantify the

quality of results returned by a given system for the query.quality of results returned by a given system for the query. Reasons for query difficulty estimationReasons for query difficulty estimation

Feedback to the userFeedback to the user The user can rephrase “difficult” queries.The user can rephrase “difficult” queries.

Feedback to the search engineFeedback to the search engine To invoke alternative strategies for different queriesTo invoke alternative strategies for different queries

Feedback to the system administratorFeedback to the system administrator To identify queries related to a specific subject, and expand the To identify queries related to a specific subject, and expand the

document collection.document collection. For distributed information retrievalFor distributed information retrieval

4

Introduction (2/2)Introduction (2/2) The observation and motivation:The observation and motivation:

queries answered well are those whose query terms queries answered well are those whose query terms agree on most of the returned documents.agree on most of the returned documents. Agreement is measured by the overlap between the top Agreement is measured by the overlap between the top

results.results. Difficult queries are those:Difficult queries are those:

A.A. The query terms cannot agree on top results.The query terms cannot agree on top results.B.B. Most of the terms do agree except a few Most of the terms do agree except a few outliers outliers (( 局外人局外人 ))..

A TREC query for example:A TREC query for example:““What impact has the chunnel (What impact has the chunnel ( 水底隧道水底隧道 ) ) had on the had on the

British economy and/or the life style of the British”British economy and/or the life style of the British”

5

Related Work (1/2)Related Work (1/2) In the In the RobustRobust tracktrack of TREC 2004of TREC 2004, systems are asked to , systems are asked to

rank the topics by predicted difficulty.rank the topics by predicted difficulty. The goal is eventually to use such predictions to do topic-The goal is eventually to use such predictions to do topic-

specific processing.specific processing. Prediction methods suggested by the participants:Prediction methods suggested by the participants:

Measuring clarity based on the system’s score of the top resultsMeasuring clarity based on the system’s score of the top results Analyzing the ambiguity of the query termsAnalyzing the ambiguity of the query terms Learning a predictor using old TREC topics as training dataLearning a predictor using old TREC topics as training data

(Ounis, 2004) showed that (Ounis, 2004) showed that IDF-basedIDF-based predictor is positively predictor is positively related to query precision.related to query precision.

(Diaz, 2004) used (Diaz, 2004) used temporal distributiontemporal distribution together with together with content of the documents to improve the prediction of AP content of the documents to improve the prediction of AP for a query.for a query.

6

Related Work (2/2)Related Work (2/2) The The Reliable Information AccessReliable Information Access (RIA) workshop (RIA) workshop

investigated the reasons for system performance investigated the reasons for system performance variance across queries.variance across queries. 10 failure categories were identified.10 failure categories were identified.

4 of which are due to emphasizing only partial aspects of 4 of which are due to emphasizing only partial aspects of the query.the query.

One of the conclusions of this workshop:One of the conclusions of this workshop:“…“…comparing a full topic ranking against ranking based on comparing a full topic ranking against ranking based on

only one aspect of the topic will give a measure of the only one aspect of the topic will give a measure of the importance of that aspect to the retrieved set”importance of that aspect to the retrieved set”

7

Estimating Query DifficultyEstimating Query Difficulty Query terms are defined as the Query terms are defined as the keywordskeywords and the and the

lexical affinitieslexical affinities.. Features used for learning:Features used for learning:

The overlap between each sub-query and the full queryThe overlap between each sub-query and the full query Measured by κ-statistics Measured by κ-statistics

The rounded logarithm of the document frequency, The rounded logarithm of the document frequency, log(log(DFDF), of each of the sub-queries.), of each of the sub-queries.

Two challenges for learning:Two challenges for learning: The number of sub-queries is not constant.The number of sub-queries is not constant.

A canonic representation is needed.A canonic representation is needed. The sub-queries are not ordered.The sub-queries are not ordered.

8

Query Estimator Using a Query Estimator Using a Histogram (1/2)Histogram (1/2)

The basic procedure:The basic procedure:1)1) Find the top N results for the full query and for each sub-query.Find the top N results for the full query and for each sub-query. Build a histogram of the overlaps Build a histogram of the overlaps hh((ii,,jj) to form a feature vector.) to form a feature vector.

Values of log(Values of log(DFDF) are split into 3 discrete values {0) are split into 3 discrete values {0 －－ 1, 21, 2 －－ 3, 43, 4＋＋ }.}.

hh((ii,,jj) means log() means log(DFDF)) ＝＝ ii & overlaps & overlaps ＝＝ jj.. The rows of The rows of hh((ii,,jj) are concatenated as a feature vector.) are concatenated as a feature vector.

1)1) Compute the linear weight vector Compute the linear weight vector cc for prediction. for prediction. An example, suppose a query has 4 sub-queries:An example, suppose a query has 4 sub-queries:

log(log(DFDF((nn)))) ＝＝ [0 1 1 2], overlap[0 1 1 2], overlap ＝＝ [2 0 0 1][2 0 0 1]

→ h(i) ＝ [0 0 1 2 0 0 0 1 0]

9

Query Estimator Using a Query Estimator Using a Histogram (2/2)Histogram (2/2)

Two additional featuresTwo additional features1)1) The score of the top-ranked documentThe score of the top-ranked document2)2) The number of words in the queryThe number of words in the query

Estimate the linear weight vector Estimate the linear weight vector cc (Moore-Penrose (Moore-Penrose pseudo-inverse):pseudo-inverse):

cc = ( = (HH ．． HHTT))-1-1 ．． HH ．． ttTT

HH ＝＝ the matrix with columns are feature vectors of training the matrix with columns are feature vectors of training queriesqueries

tt ＝＝ a vector of the target measure (P@10 or MAP) of training a vector of the target measure (P@10 or MAP) of training queriesqueries

((HH and and tt can be modified according to the objective) can be modified according to the objective)

10

Query Estimator Using a Query Estimator Using a Modified Decision Tree (1/2)Modified Decision Tree (1/2)

Useful for sparseness, i.e. queries are too short.Useful for sparseness, i.e. queries are too short. A binary decision treeA binary decision tree Pairs of overlap and log(DF) of sub-queries form Pairs of overlap and log(DF) of sub-queries form

features.features. Each node consists of a Each node consists of a weight vectorweight vector, , thresholdthreshold, and, and

scorescore.. An example:An example:

11

Query Estimator Using a Query Estimator Using a Modified Decision Tree (2/2)Modified Decision Tree (2/2)

The concept of The concept of Random ForestRandom Forest Better decision trees can be obtained by training a Better decision trees can be obtained by training a

multitude of trees, each in a slightly different manner or multitude of trees, each in a slightly different manner or using different data.using different data.

Apply AdaBoost algo. to resample training dataApply AdaBoost algo. to resample training data

12

Experiment and Evaluation (1/2)Experiment and Evaluation (1/2) The IR system is Juru.The IR system is Juru. Two document collectionsTwo document collections

TREC-8: 528,155 documents, 200 topicsTREC-8: 528,155 documents, 200 topics WT10G: 1,692,096 documents, 100 topicsWT10G: 1,692,096 documents, 100 topics

Four-fold cross-validation,Four-fold cross-validation, Measured by Kendall’s-τcoefficientMeasured by Kendall’s-τcoefficient

13

Experiment and Evaluation (2/2)Experiment and Evaluation (2/2) Compared with some other algorithmsCompared with some other algorithms

Estimation based on the score of the top resultEstimation based on the score of the top result Estimation based on the average score of the top ten Estimation based on the average score of the top ten

resultsresults Estimation based on the standard deviation of IDF values Estimation based on the standard deviation of IDF values

of query termsof query terms Estimation based on learning a SVM for regressionEstimation based on learning a SVM for regression

14

Application 1: Improving IR Application 1: Improving IR Using Query Estimation (1/2)Using Query Estimation (1/2)

Selective automatic query expansionSelective automatic query expansion1.1. Adding terms to the query based on frequently Adding terms to the query based on frequently

appearing terms in the top retrieved documentsappearing terms in the top retrieved documents2.2. Only works for Only works for easyeasy queries queries3.3. Using the same features to train a SVM classifierUsing the same features to train a SVM classifier

Deciding which part of the topic should be usedDeciding which part of the topic should be used1.1. TREC topics contain two parts: short TREC topics contain two parts: short titletitle and longer and longer

descriptiondescription2.2. Some topics that are not answered well by the Some topics that are not answered well by the

description part are better answered by the title part.description part are better answered by the title part.3.3. Difficult topics use title part and easy topics use Difficult topics use title part and easy topics use

description.description.

15

Application 1: Improving IR Application 1: Improving IR Using Query Estimation (2/2)Using Query Estimation (2/2)

16

Application 2: Detecting Missing Application 2: Detecting Missing Content (1/2)Content (1/2)

Missing content queries (Missing content queries (MCQsMCQs) are those have no ) are those have no relevant document in the collection.relevant document in the collection.

Experiment methodExperiment method 166 MCQs are created artificially from 400 TREC queries166 MCQs are created artificially from 400 TREC queries

200 TREC topics consist of title and description.200 TREC topics consist of title and description. Ten-fold cross-validationTen-fold cross-validation A tree-based classifier is trained to classify MCQs from A tree-based classifier is trained to classify MCQs from

non-MCQs.non-MCQs. A query difficulty estimator may or may not be used as a A query difficulty estimator may or may not be used as a

pre-filter of easy queries before the MCQ classifier.pre-filter of easy queries before the MCQ classifier.

17

Application 2: Detecting Missing Application 2: Detecting Missing Content (2/2)Content (2/2)

18

Application 3: Merging the Results Application 3: Merging the Results of Distributed Retrieval (1/2)of Distributed Retrieval (1/2)

It is difficult to rerank the documents from different It is difficult to rerank the documents from different datasets since the scores are local for each specific datasets since the scores are local for each specific dataset.dataset.

CORICORI (W. Croft, 1995) is one of the state-of-the-art (W. Croft, 1995) is one of the state-of-the-art algorithm for distributed retrieval, using inference algorithm for distributed retrieval, using inference network to do collection ranking.network to do collection ranking.

Apply the estimator to this problem:Apply the estimator to this problem: A query estimator is trained for each dataset.A query estimator is trained for each dataset. The estimated difficulty is used for weighting the scores.The estimated difficulty is used for weighting the scores. These weighted scores are merged to built the final ranking.These weighted scores are merged to built the final ranking. Ten-fold cross-validationTen-fold cross-validation Only minimal information is supplied by the search engine.Only minimal information is supplied by the search engine.

19

Application 3: Merging the Results Application 3: Merging the Results of Distributed Retrieval (2/2)of Distributed Retrieval (2/2)

Selective weightingSelective weighting All queries are clustered (2-means) based on their All queries are clustered (2-means) based on their

estimations for each of the datasets.estimations for each of the datasets. In one cluster, the variance of the estimations is smallIn one cluster, the variance of the estimations is small

→ → unweighted scores are better for queries in this cluster.unweighted scores are better for queries in this cluster. The estimations of difficulty become noise when there is The estimations of difficulty become noise when there is

little variance.little variance.

20

Conclusions and Future WorkConclusions and Future Work Two methods for learning an estimator of query difficulty are Two methods for learning an estimator of query difficulty are

described.described. The learned estimator predicts the expected precision of the The learned estimator predicts the expected precision of the

query by analyzing the overlap between the results of the full query by analyzing the overlap between the results of the full query and the results of its sub-queries.query and the results of its sub-queries.

We show that such an estimator can be used to several We show that such an estimator can be used to several applications.applications.

Our results show that the quality of query prediction strongly Our results show that the quality of query prediction strongly depends on the query length.depends on the query length.

One of the future work is to look for additional features not One of the future work is to look for additional features not depend on the query length.depend on the query length.

Whether more training data can be accumulated in automatic Whether more training data can be accumulated in automatic or semi-automatic manner is left for future research.or semi-automatic manner is left for future research.

learning to estimate query difficulty including applications to missing content detection and...

Documents