unsupervised query segmentation using clickthrough for information retrieval yanen li 1, bo-june...
TRANSCRIPT
![Page 1: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/1.jpg)
Unsupervised Query Segmentation Using Clickthrough for Information Retrieval
Yanen Li1, Bo-June (Paul) Hsu2, ChengXiang Zhai1 and Kuansan Wang2
1Department of Computer Science, University of Illinois at Urbana-Champaign2Microsoft Research, Microsoft Research, One Microsoft Way Redmond, WA
Email: [email protected]
07/25/2011, SIGIR 2011, Beijing China
![Page 2: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/2.jpg)
2
Outline
• Motivation and Related Works• Unsupervised Query Segmentation Model with
Clickthrough • Query Segmentation Evaluation • Integrated Language Model with Query
Segmentation (QSLM)• Evaluation of QSLM• Conclusion and Future Work
![Page 3: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/3.jpg)
3
This Work:• Task 1: probabilistic query segmentationbank of america online banking{[bank of america] [online banking], 0.502}, {bank of america online banking], 0.428}, {[bank of ] [ america] [online banking], 0.001}
• Task 2: retrieval model with query segmentationQ -> {A(Q)} -> D
Motivation
query segmentation: breaking a query into semantic meaningful segments
bank of america online banking -> [bank of america ] [online banking]
Query seg is useful for: (1) noun phrase discovery; (2) query reformulation; (3) phrase-based retrieval models (4) user intent analysis
![Page 4: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/4.jpg)
4
Related Work of Query Segmentation• Mutual information based models [Risvik www 03, Jones www 06]
• Supervised query segmentation models– MRF [Yu KEYS 09]– Limitation: need labeled training examples
• Simple N-gram probability models [Hagen SIGIR 10]
• Unsupervised models– [Tan WWW 2008]– Minimum description length
Limitation: no relevance information (example: “of the”, Query: president of the united states)
president | of the | united states?)
We try to model query seg with clickthrough data, which is previously unexplored
![Page 5: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/5.jpg)
5
Unsupervised Query Segmentation Model using Clickthrough
• Appear both in query and doc • Relevance information• How to model?
Intuitions
![Page 6: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/6.jpg)
6
1. Pick a query length n under a length distribution; e.g. n=4
2. Select a segmentation partition B B∈ n , according to a segmentation partition model P (B|n, ψ);e.g. [X X ] [X X ]
3. Generate query segments Sm consistent with B, ac-cording to a segment unigram model P(Sm|θ). e.g. [food network ] [coupon codes]
Our Segmentation Model
• A generative model• Generating a query:
![Page 7: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/7.jpg)
7
• Under this model:
e.g P([the cuban swimmer paper] |θ) VS P(the | θ) P(cuban | θ) P(swimmer | θ) P(paper| θ)
B: segmentation partitionθ: segment unigram distribution. Vocabulary space: 12…K
infinite strong prior that penalizes longer segments
Prob of seeing Q given B
![Page 8: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/8.jpg)
8
• Extending to <query, doc> pairs
An interpolated model:
global component document-specific component
[President] [of the] [united states]
1. the White House and President Barack Obama, the 44th President of the United States
2. the united states President Barack Obama …3. President Obama remained unable to break a stalemate over the debt…Few investors believe the United States …
QueryClicked docs
Prob is not high for this segmentation
![Page 9: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/9.jpg)
9
• Parameter estimation
An EM algorithm:e.g. oxford real estate advisors
θ: segment unigram distributionEstimate by maximizing in all query-doc pairs
E step, given θ(k-1), for each Q compute posterior probability of a valid segmentation give Q
e.g. P([X ] [X X ] [ X ] | oxford real estate advisors, θD, ψ)
M step, update θ(k):
P(real estate |θ(k)) P([X] [X X] [X] | oxford real estate advisors, θD, ψ)+ P([X X] [X] | real estate california, θD, ψ)+ P([X] [X] [XX] [X] | find a real estate agent, θD, ψ)+…
![Page 10: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/10.jpg)
10
Query Segmentation Evaluation • Datasets– Training set from Bing query log
– Test set 1500 queries from [Bergsma EMNLP-CoNLL 2007], 3 annotators
– Test set 21000 queries from Bing query log, 3 annotators
• Metrics– query accuracy– classify accuracy– segment precision– segment recall– segment F– On setA, setB, setC, set Intersection & Conjunction
![Page 11: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/11.jpg)
11
Result Snapshot
30 [elizabeth nj] [factory outlets]31 [rush university] [medical center]32 [pitch card game] [program]33 [hillsborough] [river] [state park]34 [trane] [vs] [american standard] [a c]35 [jefferson county al] [school system]36 [oxford] [real estate] [advisors]37 [johnson county] [community college]
38 [new york] [insight meditation]39 [aurora ohio] [movie theater]40 [trigun] [maximum] [graphic novels]41 [animals] [redwood] [national park]42 [prime time] [male] [exotic] [dances]43 [pacific grove] [adult] [school]44 [ralph] [ m] [brown] [act]45 [chicago] [gay pride parade]46 [livermore] [mobile home parks]47 [vintage] [harley davidson] [soft] [tail] [standard]
48 [aerotemp] [heat pump] [pools]49 [american indian] [salt] [deficiency]50 [cheap] [crossword puzzle] [books]
2030822 [beauty and the beast]2025251 [history] [of] [armenia]2030690 [american saddlery country flex saddle]2024252 [funny] [award] [certificates]2023090 [champion] [mobile homes]2027667 [pictures] [of] [best friend] [woman] [hugging]2022846 [budget driving school] [san diego]2027746 [publishing] [web site] [internet]2030341 [you tube] [american idol] [results] [april 2 2008]… …
Test Set 1 Test Set 2
![Page 12: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/12.jpg)
12
Subset Metric Baseline Tan's Models Our Models
MI EM + corpus EM+Clicked Doc
Annotation A query accuracy 0.274 0.414 0.440
classify accuracy 0.693 0.762 0.776
segment precision 0.469 0.562 0.598
segment recall 0.534 0.555 0.639
segment F 0.499 0.558 0.618
Annotation B query accuracy 0.244 0.44 0.410
classify accuracy 0.634 0.774 0.750
segment precision 0.408 0.568 0.521
segment recall 0.472 0.578 0.631
segment F 0.438 0.573 0.571
Annotation C query accuracy 0.264 0.416 0.402
classify accuracy 0.666 0.759 0.756
segment precision 0.451 0.558 0.548
segment recall 0.519 0.561 0.619
segment F 0.483 0.559 0.582
Intersection query accuracy 0.343 0.528 0.586
classify accuracy 0.728 0.815 0.842
segment precision 0.510 0.640 0.681
segment recall 0.550 0.650 0.747
segment F 0.530 0.645 0.713
--Clearly outperforms the MI baseline.-- Outperforms [Tan,
WWW 2008] model according to A, C and Intersection-- Our Model + MS Web n-gram beats other models with additional resources
Evaluation on Test Set 1
![Page 13: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/13.jpg)
13
Segmentation Performance with Respect to Penalty Factor
1. Penalty Factor can affect the result a lot
1. At f=2 it achieves good results
![Page 14: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/14.jpg)
14
Integrated Language Model with Query Segmentation (QSLM)
• Traditional IR models– TF-IDF, BM25, Unigram LM …– Terms are scored independently
• Proximity heuristics [Tao SIGIR 07]
• Higher order LMs (biterm LM [Srikanth SIGIR 02])• Capturing linkage [Gao SIGIR 04]
Simple Oracle Ranker
qID Unigram Bigram Oracle2024077 0.33 0.25 0.332024272 0.3 0.34 0.342024291 0.29 0.36 0.36
…
Oracle Ranker Procedure
ResultRemarks:1. Oracle ranker performs
very well2. Simulate similar behavior
with query seg
![Page 15: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/15.jpg)
15
QSLM ModelQuery seg prob
LM
1. doc LM model
2. background LM model
![Page 16: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/16.jpg)
16
bank of america online
1. AOL Inc. (NYSE: AOL, stylized as "Aol.", and previously known as America Online) is an American global Internet services and media company
Document Query Segmentation Prob a/(a+b) Ranking score
Doc 1[bank of america] [online]
0.94 0.6 0.564[bank] [of] [america online]
0.02 0.8 0.0160.58
Doc 2 [bank of america] [online] 0.94 0.9 0.846[bank] [of] [america online] 0.02 0.4 0.008
0.854
2. Online Banking from Bank of America lets you manage your accounts, pay your bills, view credit card activity and more.
How to score docs under QSLM
![Page 17: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/17.jpg)
17
Evaluation of QSLM on Search Ranking
Dataset from Bing12,064 queries
Results on Web Search
1. Better performance than BM25 and Unigram, Bigram LMs2. Results more significant on longer queries
Baselines:BM25, Unigram LM,Bigram LM
![Page 18: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/18.jpg)
18
How many segmentations are needed?1. More segmentations, better search ranking2. Small #segmentations is enough
![Page 19: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/19.jpg)
19
Conclusions and Future Work
• Unsupervised model using clickthrough is effective on query segmentation
• LM with query segmentation can improve search ranking
• But QSLM still underperforms Oracle Ranker• Better model to incorporate query
segmentation is desirable
![Page 20: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/20.jpg)
20
Acknowledgement
We thank SIGIR for the Travel Grant support!
![Page 22: Unsupervised Query Segmentation Using Clickthrough for Information Retrieval Yanen Li 1, Bo-June (Paul) Hsu 2, ChengXiang Zhai 1 and Kuansan Wang 2 1 Department](https://reader036.vdocument.in/reader036/viewer/2022062321/56649ed35503460f94be3854/html5/thumbnails/22.jpg)
22
Thank You!