spelling correction for advertising: how “noise” can help
DESCRIPTION
NISS Workshop on Computational Advertising, November 2009. Spelling Correction for Advertising: How “Noise” Can Help. Silviu Cucerzan Microsoft Research Text Mining Search and Navigation. Buying Cheap( er ) on eBay. Canon 30d. Not good for the sellers. Not good for most buyers. - PowerPoint PPT PresentationTRANSCRIPT
Text Mining Search and Navigation
Spelling Correction for Advertising:How “Noise” Can Help
Silviu CucerzanMicrosoft ResearchText Mining Search and Navigation
NISS Workshop on Computational Advertising, November 2009
Text Mining Search and Navigation
Buying Cheap(er) on eBay
Cannon 30d
Canon 30d
Not good for the sellers.Not good for most buyers.
Not good for the middle man.
Text Mining Search and Navigation
epresso machines
espesso machines
espreso machines
espressomachines
esspreso machines
esspresso machines
expresso machines
exspresso machines
Good Ads for Bad Queries
espresso machines
singular wireless
cingulair wireless
cigular wireless
cingulare wireless
cingullar wireless
cinguilar wireles
cingluarwireless
circular wireless
cingular wireless
Text Mining Search and Navigation
Is a Trusted Dictionary Enough?• Search:
max payne chats and codesnew humwee pics
• Music:selin dion color of my lovecristina aquillara
• Shopping:pansonic dvd reordersbrita water filer
• Help and Support:printer divers for window vistainsert flash flies into power point
cheats
celine colour
panasonic recorders
filter
drivers windows
files powerpoint
christina aguilera
Text Mining Search and Navigation
Web Query Logs as Corpora
• Web Search: over to 1 billion queries per day!
• 10-15% of the queries contain spelling errors
• highly dynamic domain:many new names and concepts become popular every day
extremely difficult to maintain a high-coverage lexicon
• difficult to define what a valid web query is
e.g.: divx, ecard, ipod, korn, xbox, zune,naboo, nimh, nsync, shrek, 5dmkii, tsx
The problem
The solution
Text Mining Search and Navigation
Problems To Be Handled
cheese cake factory cheesecake factorychat inspanich chat in spanish
amd processors amd processors
Concatenate and split
Recognize out-of-lexicon valid words
Change in-lexicon words to out-of-lexicon words
gun dam fighter gundam fighter
power crd power cordvideo crd video card
chicken sop chicken soupsop opera soap opera
Context-sensitive correction of out-of-lexicon words
Context-sensitive correction of in-lexicon words
Text Mining Search and Navigation
An HMM Architecture for Spelling Correction
britabritbrit.britsbriatrita
watereaterhaterlater
materoaterrater
waderwaferwagerwaiterwalterwasterwaterswaterywaver
filerfiberfiferfile
filedfilersfilesfiletfillerfilnerfilterfinerfirerfiverfixerflier
brita water filer
states:
input query:
allalternativespellingsfrom thequery log
Text Mining Search and Navigation
What about terrible misspellings?
• input: arnol shwartzeggar• desired output: arnold schwarzenegger
unweighted edit distance: 5
Text Mining Search and Navigation
Misspelled query: arnol shwartzeggar
First iteration: arnold schwartzneggar
Second iteration: arnold schwartzeneggerThird iteration: arnold schwaxrzeneggerFourth iteration: arnold schwarzenegger
An Iterative Approach
nomore
changes
Speller output:
Text Mining Search and Navigation
hunny moon
honemoon 8honemoons 3honeybeemon 3honeymonn 14honeymoon 19019honeymoon's 12honeymooner 3honeymooner's 6honeymooners 771honeymooning 29honeymoonitis 6honeymoons 5259honneymoon 6honneymoons 9honnymoon 4honoeymoon 3honymoon 19huneymoon 10honey moon 333honey moon's 5honey mooners 34honey moons 136honney moon 6hony moon 4
Iterativespelling
correctionprocess
honeymoon
Search Query Log StatisticsSome Intuition
Text Mining Search and Navigation
Basic Assumptions about the “Noise”
• query logs contain a lot of different misspellings for most words
• the better spelled a word form, the more frequent it is
• the correct forms are much more frequent than their misspellings
Text Mining Search and Navigation
Another Example
albert einstein 4834albert einstien 525albert einstine 149albert einsten 27albert einsteins 25albert einstain 11albert einstin 10albert eintein 9albeart einstein 6aolbert einstein 6alber einstein 4albert einseint 3albert einsteirn 3albert einsterin 3albert eintien 3alberto einstein 3albrecht einstein 3alvert einstein 3
Text Mining Search and Navigation
Concatenation and Splitting
0s britenetspear inconcert
1s britneyspears in concert
2s britney spears in concert
3s britney spears in concert
31 l
42 l
20 l
Store word unigrams and bigrams in the same searchable trie structure.
Find alternative spellings for the input words in this common structure.
Text Mining Search and Navigation
Avoid Changing the User’s Intent
britabritbrit.britsbriatrita
watereaterhaterlater
materoaterrater
waderwaferwagerwaiterwalterwasterwaterswaterywaver
filerfiberfiferfilefiledfilersfilesfiletfillerfilnerfilterfinerfirerfiverfixerflier
brita water filer
brit
waiter
file
Text Mining Search and Navigation
Modified Viterbi Search – Fringes
1
12
11
1
1ka
a
a
w
2
22
21
2
2ka
a
a
w
3
32
31
3
3ka
a
a
w
4
42
41
4
4ka
a
a
w
5
52
51
5
5ka
a
a
w
6
62
61
6
6ka
a
a
w
7
72
71
7
7ka
a
a
w
stop w
ord
unkn
own word
1
12
11
1
1ka
a
a
w
2
22
21
2
2ka
a
a
w
3
32
31
3
3ka
a
a
w
4
42
41
4
4ka
a
a
w
5
52
51
5
5ka
a
a
w
6
62
61
6
6ka
a
a
w
7
72
71
7
7ka
a
a
w
stop w
ord
1
12
11
1
1ka
a
a
w
2
22
21
2
2ka
a
a
w
3
32
31
3
3ka
a
a
w
4
42
41
4
4ka
a
a
w
5
52
51
5
5ka
a
a
w
6
62
61
6
6ka
a
a
w
7
72
71
7
7ka
a
a
w
stop w
ord
unkn
own word
1
12
11
1
1ka
a
a
w
2
22
21
2
2ka
a
a
w
3
32
31
3
3ka
a
a
w
4
42
41
4
4ka
a
a
w
5
52
51
5
5ka
a
a
w
6
62
61
6
6ka
a
a
w
7
72
71
7
7ka
a
a
w
stop w
ord
unkn
own word
1
12
11
1
1ka
a
a
w
2
22
21
2
2ka
a
a
w
3
32
31
3
3ka
a
a
w
4
42
41
4
4ka
a
a
w
5
52
51
5
5ka
a
a
w
6
62
61
6
6ka
a
a
w
7
72
71
7
7ka
a
a
w
stop w
ord
unkn
own word
1
12
11
1
1ka
a
a
w
2
22
21
2
2ka
a
a
w
3
32
31
3
3ka
a
a
w
4
42
41
4
4ka
a
a
w
5
52
51
5
5ka
a
a
w
6
62
61
6
6ka
a
a
w
7
72
71
7
7ka
a
a
w
stop w
ord
e.g.: water filer waiter file
k1k2 k1+k2 paths
in-lexicon words
Text Mining Search and Navigation
Modified Viterbi Search – Stop words
1
12
11
1
1ka
a
a
w
2
22
21
2
2ka
a
a
w
3
32
31
3
3ka
a
a
w
4
42
41
4
4ka
a
a
w
5
52
51
5
5ka
a
a
w
6
62
61
6
6ka
a
a
w
7
72
71
7
7ka
a
a
w
1
12
11
1
1ka
a
a
w
2
22
21
2
2ka
a
a
w
3
32
31
3
3ka
a
a
w
4
42
41
4
4ka
a
a
w
5
52
51
5
5ka
a
a
w
6
62
61
6
6ka
a
a
w
7
72
71
7
7ka
a
a
w
e.g.: lord of teh rigs lord of the rings
Text Mining Search and Navigation
Evaluation
All queries Valid Misspelled
Nr. queries 1044 864 180Full system 81.8 84.8 67.2No lexicon 70.3 72.2 61.1
No query log 77.0 82.1 52.8All edits equal 80.4 83.3 66.1Unigrams only 54.7 57.4 41.71 iteration only 80.9 88.0 47.2
2 iterations only 81.3 84.4 66.7No fringes 80.6 83.3 67.2
Text Mining Search and Navigation
A Closer Look to the Results
• 81.8% overall agreement with the annotators
• Errors:– alternative queries for valid queries
many false positives are reasonable suggestions
e.g. cowboy robes cowboy ropes
– alternative queries for misspelled queriessome suggestions could be valid (user’s intent not known)e.g. massanger massager / messenger
annotator inter-agreement rate: 91.3%
Text Mining Search and Navigation
Evaluation – When we “know” user’s intent
Full system 73.1No lexicon 59.2
No query log 44.9All edits equal 69.9Unigrams only 43.01 iteration only 45.52 iterations only 68.2
No fringes 71.0
(audio flie, audio file) audio file(bueavista, buena vista) buena vista(carrabean nooms, carrabean rooms) caribbean rooms
368 queries
Text Mining Search and Navigation
Learning Curve
1 month 2 months 3 months 4 months65
70
75
80
8580.7 81.8 81.6 81.2
66.167.2 68.9 69.4
All queriesMispelled queries
Silviu Cucerzan and Eric Brill – “Spelling correction as an iterative processthat exploits the collective knowledge of web users”, EMNLP 2004