improving similarity measures for short segments of text
DESCRIPTION
Improving Similarity Measures for Short Segments of Text. Scott Wen -tau Yih & Chris Meek Microsoft Research. query mariners. Query Suggestion. How similar are they? mariners vs. seattle mariners mariners vs. 1st mariner bank. Keyword Expansion for Online Ads. Chocolate Cigarettes. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/1.jpg)
Improving Similarity Measures for Short Segments of TextScott Wen-tau Yih & Chris MeekMicrosoft Research
![Page 2: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/2.jpg)
Query Suggestion
How similar are they?
mariners vs. seattle marinersmariners vs. 1st mariner bank
querymariners
![Page 3: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/3.jpg)
Keyword Expansion for Online Ads
Chocolate Cigarettes
Chocolate candy
Chocolate cigars
Nostalgic candy
Novelty candy
Candy cigarettes
Old fashioned candy
How similar are they?
chocolate cigarettes vs. cigaretteschocolate cigarettes vs. chocolate
cigarschocolate cigarettes vs. old
fashioned candy
![Page 4: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/4.jpg)
Measuring SimilarityGoal: create a similarity function
fsim: (String1,String2) RRank suggestions
Fix String1 as q; vary String2 as s1 , s2 , , sk
Whether the function is symmetric is not important
For query suggestion – fsim(q,s)fsim(“mariners”, “seattle mariners”) = 0.9fsim(“mariners”, “1st mariner bank”) = 0.6
![Page 5: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/5.jpg)
Enabling Useful ApplicationsWeb search
Ranking query suggestionsSegmenting web sessions using query logs
Online advertisingSuggesting alternative keywords to advertisersMatching similar keywords to show ads
Document writingProviding alternative phrasingCorrecting spelling errors
![Page 6: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/6.jpg)
ChallengesShort text segments may not overlap
“Microsoft Research” vs. “MSR” 0 cosine scoreAmbiguous terms
“Bill Gates” vs. “Utility Bill” 0.5 cosine score“taxi runway” vs. “taxi” 0.7 cosine score
Text segments may rarely co-occur in corpus
“Hyatt Vancouver” vs. “Haytt Vancover” 1 pageLonger query Fewer pages
![Page 7: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/7.jpg)
Our ContributionsWeb-relevance similarity measure
Represent the input text segments as real-valued term vectors using Web documentsImprove term weighting scheme based on relevant keyword extraction
Learning similarity measureFit user preference for the application betterCompare learning similarity function vs. learning ranking function
![Page 8: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/8.jpg)
OutlineIntroduction
Problem, Applications, ChallengesOur Methods
Web-relevance similarity functionCombine similarity measures using learning
Learning similarity functionLearning ranking function
Experiments on query suggestion
![Page 9: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/9.jpg)
Web-relevance Similarity MeasureQuery expansion of x using a search
engineLet Dn(x) be the set of top n documentsBuild a term vector vi for each document di Dn(x)
Elements are scores representing the relevancy of the words in document di
C(x) = 1n vi / ||vi|| (L2-normalized, centroid)QE(x) = C(x) / ||C(x)|| (L2-normalized)
Similarity score is simply the inner product
fsim (q,s) = QE(q) QE(s)
![Page 10: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/10.jpg)
Web-kernel SimilarityRelevancy = TFIDF [Sahami&Heilman ‘06]
Why TFIDF?High TF: important or relevant to the documentHigh DF: stopwords or words in template blocksCrude estimate of the importance of the wordCan we do better than
TFIDF?
![Page 11: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/11.jpg)
Web-relevance SimilarityRelevancy = Prob(relevance | wj,di)Keyword extraction can judge the importance of the words more accurately! [Yih et al. WWW-06] Assign relevancy scores (probabilities) to words/phrasesMachine Learning model learned by logistic regressionUse more than 10 categories of features
Query-log frequency High-DF words may be popular queries
The position of the word in the documentThe format, hyperlink, etc.
![Page 12: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/12.jpg)
Learning SimilaritySimilarity measures should depend on application
q=“Seattle Mariners” s1=“Seattle” s2=“Seattle Mariners Ticket”Let human subjects decide what’s similar
Parametric similarity function fsim(q,s|w)Learn the parameter (weights) from dataUse Machine Learning to combine multiple base similarity measures
![Page 13: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/13.jpg)
Base Similarity MeasuresSurface matching methods
Suppose Q and S are the sets of words in a given pair of query q and suggestion s
Matching |QS|Dice 2|QS|/(|Q|+|S|)Jaccard |QS|/|QS|Overlap |QS|/min(|Q|,|S|)Cosine |QS|/sqrt(|Q|×|S|)
Corpus-based methodsWeb-relevance, Web-kernel, KL-divergence
![Page 14: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/14.jpg)
Learning Similarity FunctionData – pairs of query and suggestion (qi,sj)
Label: Relevance judgment (rel=1 or rel=0)Features: Scores on (qi,sj) provided by multiple base similarity measures
We combine them using logistic regressionz = w1Cosine(q,s) + w2Dice(q,s) + w3Matching(q,s) +
w4Web-relevance(q,s) + w5KL-divergence(q,s) +
fsim(q,s|w) = Prob(rel|q,s;w) = exp(z)/(1+exp(z))
![Page 15: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/15.jpg)
Learning Ranking FunctionWe compare suggestions sj , sk to the same query q
Data – tuples of a query q and suggestions sj, sk
Label: [sim(q,sj) > sim(q,sk)] or [sim(q,sj) < sim(q,sk)]Features: Scores on pairs (q,sj) and (q,sk) provided by multiple base similarity measures
Learn a probabilistic model using logistic regression
Prob([sim(q,sj) > sim(q,sk)] | q,sj,sk;w)
![Page 16: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/16.jpg)
ExperimentsData: Query suggestion dataset [Metzler et al. ’07]
|Q| = 122, |(Q,S)| = 4852; {Ex,Good} vs. {Fair,Bad}
Results10-fold cross-validationEvaluation metrics: AUC and Precision@k
Query Suggestion Labelshell oil credit card shell gas cards Excellentshell oil credit card texaco credit card Fairtarrant county college
fresno city college Bad
tarrant county college
dallas county schools
Good
![Page 17: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/17.jpg)
AUC Scores
123456789
10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.7390.735
0.7030.691000000000001
0.664000000000001
0.6270.6270.626
0.6170.606
![Page 18: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/18.jpg)
Precision@3
123456789
10
0 0.1 0.2 0.3 0.4 0.5 0.60.569
0.556
0.5080.483
0.436
0.4560.456
0.4560.444
0.389
![Page 19: Improving Similarity Measures for Short Segments of Text](https://reader035.vdocument.in/reader035/viewer/2022062520/5681624c550346895dd29556/html5/thumbnails/19.jpg)
ConclusionsWeb-relevance
New term-weighting scheme from keyword extractionOutperform existing methods on query suggestion
Learning similarityFit the application – better suggestion rankingLearning similarity function vs. learning ranking function
Future workExperiment with alternative combination methodsExplore other probabilistic models for similarityApply our similarity measures to different tasks