thoughts (and research) on query intent bruce croft center for intelligent information retrieval...
TRANSCRIPT
Thoughts (and Research) on Query Intent
Bruce CroftCenter for Intelligent Information Retrieval
UMass Amherst
Overview
• Query Representation and Understanding Workshop at SIGIR 2010
• Research projects in the CIIR
Observations
• “Query intent” has become a popular phrase at conferences and at companies
• Research with query logs = acceptance of paper• Few standards in these papers about test collections,
metrics, even tasks• Query processing has been part of IR for a long time– e.g., stemming, expansion, relevance feedback
• Most retrieval models say little about queries• So, what’s going on and what’s interesting?
Terminology• Query intent (or search intent) is the same thing as
information need– The notion of an information need or problem underlying a query
has been discussed in the IR literature for many years, and it was generally agreed that query intent is another way of referring to the same idea
• Query representation involves modeling the intent or need– Query understanding refers to the process of identifying the
underlying intent or need based on a particular representation
• Intent classes, intent dimensions, and query classes – terms used to talk about the many different types of information
needs and problems
Terminology• Query rewriting, query transformation, query
refinement, query alteration, and query reformulation– names given to the process of changing the original query to
better represent the underlying intent (and consequently improve ranking)
• Query expansion, substitution, reduction, segmentation– some of the techniques or steps used in the query transformation
process
• Query – most research assumes the query is the string entered by user.
Transformation can produce many different representations of the query. Difference between explicit and implicit query is important
Research Questions
• How to develop a unified and general framework for query understanding?
• How to formally define a query representation?• How to develop new system architectures for query
understanding?• How to combine query understanding with other
components in information retrieval systems?• How to conduct evaluations of query understanding?• How to make effective use of both human knowledge
and machine learning in query understanding?
Possible Research Tasks
• Long query relevance• Query reduction• Similar query finding• Query classification• Named entity recognition in queries• Context-aware search– Intent-aware search
Methodology
• Must agree on tasks, evaluation metrics, and text collections
• TREC-style vs. “black-box” evaluations• Crowdsourcing for annotations• Resources such as query collections, document
collections, query logs, etc. differ widely in their availability in academic and industry settings
Resources
• Document collections – TREC ClueWeb collection preferred• Query collections – need collections of different query
types (e.g. long, location, product…) validated by industry• Query logs – critical resource for some approaches, not
available in academia. Alternatives include MSN/AOL logs, KDD queries, anchor text logs, logs from other applications (Wikipedia), logs from some restricted environment (e.g. academic library)
• N-grams, etc. – corpus and query language statistics from web collections
CIIR Projects
• Modeling structure in queries• Modeling distributions of queries• Modeling diversity in queries• Transforming long queries• Generating queries from documents• Generating query logs from anchor text• Finding similar queries
The Challenge of Query Representation
• User inputs a string of characters• Query structure is never explicitly observed and
is difficult to infer– Short and ambiguous search queries
– Idiosyncratic grammar
– No capitalization and punctuation
talking to heaven movie
new york times square
do grover cleveland have kids
Structural Query Representation
• A query Q has a hierarchical representation– A query is a set of structures = {1 ,…, n }– Each structure is a set of concepts ={1 , 2 ,…}
• Hierarchical representation allows to– Model arbitrary term dependencies as concepts– Group concepts by structures– Assign weights to concepts/structures
members rock group nirvana
[members] [rock] [group] [nirvana]
[members rock] [rock group] [group nirvana]
[members] [nirvana]
[members] [rock group] [nirvana]
Terms
Bigrams
Chunks
Key Concepts
[members nirvana] [rock group]Dependence
Structures
Concepts
Encoding Query Structure in a HypergraphDocument
Structure 1 Structure n
Concepts Concepts
Weighted Sequential Dependence Model (WSD)
• Allow the parameters of the sequential dependence model to depend on the concept
• Assume the parameters take a simple parametric form– maintains reasonable model complexity
w - free parameters
g - concept importance features
[Bendersky, Metzler, and Croft, 2009]
Defining Concept Importance in WSD
• Features g define the concept importance• Depend on the concept (term/bigram) • Independent of a specific document/document
corpus• Combine several sources for more accurate
weighting– Endogenous Features – collection dependent features– Exogenous Features – collection independent features
WSD Ranking Function
• Score document D by:
Concept Importance Features Weight
GF … DF
civil 16.9 … 14.1 0.0619
war 17.9 … 12.8 0.1947
battle 16.6 … 12.6 0.0913
reenactments 10.8 … 9.7 0.3487
civil war 14.5 … 10.8 0.1959
war battle 9.5 … 7.4 0.2458
battle reenactments 7.6 … 4.7 0.0540
Query “civil war battle reenactments”
Concept weights may vary even if concept
DF is similar
Good segments do not necessarily predict
important concepts
TREC Description (Long) Queries
ROBUST04 WT10G GOV20.1
0.15
0.2
0.25
0.3
QLSDWSDM
AP
+6.3%
+24.1%
+1.6%
Query RepresentationDistribution of Terms
(DOT)words + phrases : original or new
Single Reformulated Query (SRQ)a single reformulation operation
Relevance Model [Lavrenko and Croft, SIGIR01]
Sequential Dependence Model [Metzler and Croft, SIGIR05]
Latent Concept Expansion [Metzler and Croft, SIGIR07]
Uncertainty in PRF [Collins-Thompson and Callan, SIGIR07]
Query Segmentation[Bergsma and Wang, EMNLP-CoNLL07] [Tan and Peng, WWW08]
Query Substitution[Jones et al, WWW06][Wang and Zhai, CIKM08]
DOT does not consider how these terms are fitted into actual queries, thus missing the dependencies between them.
SRQ does not consider combining with other operations, thus missing information about alternative reformulations
Distribution of Queries (DOQ)each query is the output of applying
single or multiple reformulation operations.
Example
Distribution of Terms (DOT) Single Reformulated Query (SRQ)
Distribution of Queries (DOQ) 0.28 ``(oil industry)(history)'', 0.24 ``(petroleum industry)(history)'',
0.20 ``(oil and gas industry)(history)'', 0.18 ``(oil)(industrialized)(history)'' …
Original TREC Query: oil industry history
Relevance Model { 0.44 ``industry'', 0.28 ``oil'', 0.08 ``petroleum'', 0.08 ``gas'', 0.08 ``county'', 0.04 ``history''...}
Sequential Dependence Model [Metzler, SIGIR05] { 0.28 ``oil'', 0.28 ``industry'', 0.28 ``history'', 0.08 ``oil industry'', 0.08 ``industry history''...}
Query Segmentation ``(oil industry)(history)''
Query Substitution``petroleum industry history''
Application I
• Reducing Long Queries [Xue, Huston, and Croft, CIKM2010]– A novel CRF-based model learns distribution of subset
queries, which directly optimizes retrieval performance(1) using the top 1
subset query(K) using the top K
subset queries
q, d indicate significantly Better than QL and DM
Query Reduction
Application II: Substitution
Query Substitution• A context of a word is the unigram preceding it• Context distribution
• The translation model
• The substitution model– Q= q1, … qi-2, qi-1, qi, qi+1, qi+2, … qn, candidate = s
)(
)(
)()|(
wCcjw
iwi
j
ccount
ccountwcP
The probability that the term ci appears in w’s context
Z
ewst
sPwPD ))|(.)|||(.(
)|(
The KL divergence between the context distributions of w and s
)|_()|()( 2112 sqqqqPwstswP iiiiii
How fit the new term is to the context of the current query
Query Expansion and Stemming• Probabilities are estimated from corpus or
query log– Using text passages nearly the same as pseudo
relevance feedback
• Query Expansion is similar to substitution– We add the new term and keep the original term
substitution: “cheap airfare” → “cheap flight”
expansion: “cheap airfare” → “cheap airfare flight”
• Stemming– New terms are restricted to Porter-stemmed root terms
“drive direction” → “drive driving direction”
The Anchor Log• Extract <anchor, url> pairs from the Gov-2 collection
to create the anchor log [Dang and Croft, 2009]
• The anchor log is very noisy– “click here”, “print version”, … don’t represent the linked
page• Anchor text gives comparable performance to MSN log
for substitution, expansion, stemming
MSN Log Anchor Log
# Total Queries 14 million 526 million
# Unique Queries 6 million 20 million
Avg. Query Length 2.68 2.62
Learning to Rank Reformulations
[Dang, Bendersky, and Croft, 2010]
Using Query Distributions
• Reformulating Short Queries [Xue et al, CIKM2010]– Passage Information used to generate candidate queries
and estimate probabilities
Gov2
o, w, m, a represents different methods to generate candidate queries.
q, d, r indicate significantly better than QL, SDM and RM.
Example Query Reformulationsusing Passages
Conclusions
• Studying query intent is not new, but more data is leading to many new insights
• Not just a web search issue, but more obvious in web search
• Lots of interesting research to do, but field needs more coherence in terms of research goals, testbeds