thoughts (and research) on query intent bruce croft center for intelligent information retrieval...

Thoughts (and Research) on Query Intent

Bruce CroftCenter for Intelligent Information Retrieval

UMass Amherst

Overview

• Query Representation and Understanding Workshop at SIGIR 2010

• Research projects in the CIIR

Observations

• “Query intent” has become a popular phrase at conferences and at companies

• Research with query logs = acceptance of paper• Few standards in these papers about test collections,

metrics, even tasks• Query processing has been part of IR for a long time– e.g., stemming, expansion, relevance feedback

• Most retrieval models say little about queries• So, what’s going on and what’s interesting?

Terminology• Query intent (or search intent) is the same thing as

information need– The notion of an information need or problem underlying a query

has been discussed in the IR literature for many years, and it was generally agreed that query intent is another way of referring to the same idea

• Query representation involves modeling the intent or need– Query understanding refers to the process of identifying the

underlying intent or need based on a particular representation

• Intent classes, intent dimensions, and query classes – terms used to talk about the many different types of information

needs and problems

Terminology• Query rewriting, query transformation, query

refinement, query alteration, and query reformulation– names given to the process of changing the original query to

better represent the underlying intent (and consequently improve ranking)

• Query expansion, substitution, reduction, segmentation– some of the techniques or steps used in the query transformation

process

• Query – most research assumes the query is the string entered by user.

Transformation can produce many different representations of the query. Difference between explicit and implicit query is important

Research Questions

• How to develop a unified and general framework for query understanding?

• How to formally define a query representation?• How to develop new system architectures for query

understanding?• How to combine query understanding with other

components in information retrieval systems?• How to conduct evaluations of query understanding?• How to make effective use of both human knowledge

and machine learning in query understanding?

Possible Research Tasks

• Long query relevance• Query reduction• Similar query finding• Query classification• Named entity recognition in queries• Context-aware search– Intent-aware search

Methodology

• Must agree on tasks, evaluation metrics, and text collections

• TREC-style vs. “black-box” evaluations• Crowdsourcing for annotations• Resources such as query collections, document

collections, query logs, etc. differ widely in their availability in academic and industry settings

Resources

• Document collections – TREC ClueWeb collection preferred• Query collections – need collections of different query

types (e.g. long, location, product…) validated by industry• Query logs – critical resource for some approaches, not

available in academia. Alternatives include MSN/AOL logs, KDD queries, anchor text logs, logs from other applications (Wikipedia), logs from some restricted environment (e.g. academic library)

• N-grams, etc. – corpus and query language statistics from web collections

CIIR Projects

• Modeling structure in queries• Modeling distributions of queries• Modeling diversity in queries• Transforming long queries• Generating queries from documents• Generating query logs from anchor text• Finding similar queries

The Challenge of Query Representation

• User inputs a string of characters• Query structure is never explicitly observed and

is difficult to infer– Short and ambiguous search queries

– Idiosyncratic grammar

– No capitalization and punctuation

talking to heaven movie

new york times square

do grover cleveland have kids

Structural Query Representation

• A query Q has a hierarchical representation– A query is a set of structures = {1 ,…, n }– Each structure is a set of concepts ={1 , 2 ,…}

• Hierarchical representation allows to– Model arbitrary term dependencies as concepts– Group concepts by structures– Assign weights to concepts/structures

members rock group nirvana

[members] [rock] [group] [nirvana]

[members rock] [rock group] [group nirvana]

[members] [nirvana]

[members] [rock group] [nirvana]

Terms

Bigrams

Chunks

Key Concepts

[members nirvana] [rock group]Dependence

Structures

Concepts

Encoding Query Structure in a HypergraphDocument

Structure 1 Structure n

Concepts Concepts

Weighted Sequential Dependence Model (WSD)

• Allow the parameters of the sequential dependence model to depend on the concept

• Assume the parameters take a simple parametric form– maintains reasonable model complexity

w - free parameters

g - concept importance features

[Bendersky, Metzler, and Croft, 2009]

Defining Concept Importance in WSD

• Features g define the concept importance• Depend on the concept (term/bigram) • Independent of a specific document/document

corpus• Combine several sources for more accurate

weighting– Endogenous Features – collection dependent features– Exogenous Features – collection independent features

WSD Ranking Function

• Score document D by:

Concept Importance Features Weight

GF … DF

civil 16.9 … 14.1 0.0619

war 17.9 … 12.8 0.1947

battle 16.6 … 12.6 0.0913

reenactments 10.8 … 9.7 0.3487

civil war 14.5 … 10.8 0.1959

war battle 9.5 … 7.4 0.2458

battle reenactments 7.6 … 4.7 0.0540

Query “civil war battle reenactments”

Concept weights may vary even if concept

DF is similar

Good segments do not necessarily predict

important concepts

TREC Description (Long) Queries

ROBUST04 WT10G GOV20.1

0.15

0.2

0.25

0.3

QLSDWSDM

AP

+6.3%

+24.1%

+1.6%

Query RepresentationDistribution of Terms

(DOT)words + phrases : original or new

Single Reformulated Query (SRQ)a single reformulation operation

Relevance Model [Lavrenko and Croft, SIGIR01]

Sequential Dependence Model [Metzler and Croft, SIGIR05]

Latent Concept Expansion [Metzler and Croft, SIGIR07]

Uncertainty in PRF [Collins-Thompson and Callan, SIGIR07]

Query Segmentation[Bergsma and Wang, EMNLP-CoNLL07] [Tan and Peng, WWW08]

Query Substitution[Jones et al, WWW06][Wang and Zhai, CIKM08]

DOT does not consider how these terms are fitted into actual queries, thus missing the dependencies between them.

SRQ does not consider combining with other operations, thus missing information about alternative reformulations

Distribution of Queries (DOQ)each query is the output of applying

single or multiple reformulation operations.

Example

Distribution of Terms (DOT) Single Reformulated Query (SRQ)

Distribution of Queries (DOQ) 0.28 ``(oil industry)(history)'', 0.24 ``(petroleum industry)(history)'',

0.20 ``(oil and gas industry)(history)'', 0.18 ``(oil)(industrialized)(history)'' …

Original TREC Query: oil industry history

Relevance Model { 0.44 `ìndustry'', 0.28 `òil'', 0.08 ``petroleum'', 0.08 ``gas'', 0.08 ``county'', 0.04 ``history''...}

Sequential Dependence Model [Metzler, SIGIR05] { 0.28 `òil'', 0.28 `ìndustry'', 0.28 ``history'', 0.08 `òil industry'', 0.08 `ìndustry history''...}

Query Segmentation ``(oil industry)(history)''

Query Substitution``petroleum industry history''

Application I

• Reducing Long Queries [Xue, Huston, and Croft, CIKM2010]– A novel CRF-based model learns distribution of subset

queries, which directly optimizes retrieval performance(1) using the top 1

subset query(K) using the top K

subset queries

q, d indicate significantly Better than QL and DM

Query Reduction

Application II: Substitution

Query Substitution• A context of a word is the unigram preceding it• Context distribution

• The translation model

• The substitution model– Q= q1, … qi-2, qi-1, qi, qi+1, qi+2, … qn, candidate = s

)(

)(

)()|(

wCcjw

iwi

j

ccount

ccountwcP

The probability that the term ci appears in w’s context

Z

ewst

sPwPD ))|(.)|||(.(

)|(

The KL divergence between the context distributions of w and s

)|_()|()( 2112 sqqqqPwstswP iiiiii

How fit the new term is to the context of the current query

Query Expansion and Stemming• Probabilities are estimated from corpus or

query log– Using text passages nearly the same as pseudo

relevance feedback

• Query Expansion is similar to substitution– We add the new term and keep the original term

substitution: “cheap airfare” → “cheap flight”

expansion: “cheap airfare” → “cheap airfare flight”

• Stemming– New terms are restricted to Porter-stemmed root terms

“drive direction” → “drive driving direction”

The Anchor Log• Extract <anchor, url> pairs from the Gov-2 collection

to create the anchor log [Dang and Croft, 2009]

• The anchor log is very noisy– “click here”, “print version”, … don’t represent the linked

page• Anchor text gives comparable performance to MSN log

for substitution, expansion, stemming

MSN Log Anchor Log

# Total Queries 14 million 526 million

# Unique Queries 6 million 20 million

Avg. Query Length 2.68 2.62

Learning to Rank Reformulations

[Dang, Bendersky, and Croft, 2010]

Using Query Distributions

• Reformulating Short Queries [Xue et al, CIKM2010]– Passage Information used to generate candidate queries

and estimate probabilities

Gov2

o, w, m, a represents different methods to generate candidate queries.

q, d, r indicate significantly better than QL, SDM and RM.

Example Query Reformulationsusing Passages

Conclusions

• Studying query intent is not new, but more data is leading to many new insights

• Not just a web search issue, but more obvious in web search

• Lots of interesting research to do, but field needs more coherence in terms of research goals, testbeds

thoughts (and research) on query intent bruce croft center for intelligent information retrieval...

Documents

query classification

query alteration

query refinement

query q

implicit query

original query

terminology query intent

observations query intent