spelling correction for advertising: how “noise” can help

20
Spelling Correction for Advertising: How “Noise” Can Help Silviu Cucerzan Microsoft Research Text Mining Search and Navigation NISS Workshop on Computational Advertising, November 2009

Upload: abba

Post on 24-Feb-2016

35 views

Category:

Documents


0 download

DESCRIPTION

NISS Workshop on Computational Advertising, November 2009. Spelling Correction for Advertising: How “Noise” Can Help. Silviu Cucerzan Microsoft Research Text Mining Search and Navigation. Buying Cheap( er ) on eBay. Canon 30d. Not good for the sellers. Not good for most buyers. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Spelling Correction for Advertising:How “Noise” Can Help

Silviu CucerzanMicrosoft ResearchText Mining Search and Navigation

NISS Workshop on Computational Advertising, November 2009

Page 2: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Buying Cheap(er) on eBay

Cannon 30d

Canon 30d

Not good for the sellers.Not good for most buyers.

Not good for the middle man.

Page 3: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

epresso machines

espesso machines

espreso machines

espressomachines

esspreso machines

esspresso machines

expresso machines

exspresso machines

Good Ads for Bad Queries

espresso machines

singular wireless

cingulair wireless

cigular wireless

cingulare wireless

cingullar wireless

cinguilar wireles

cingluarwireless

circular wireless

cingular wireless

Page 4: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Is a Trusted Dictionary Enough?• Search:

max payne chats and codesnew humwee pics

• Music:selin dion color of my lovecristina aquillara

• Shopping:pansonic dvd reordersbrita water filer

• Help and Support:printer divers for window vistainsert flash flies into power point

cheats

celine colour

panasonic recorders

filter

drivers windows

files powerpoint

christina aguilera

Page 5: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Web Query Logs as Corpora

• Web Search: over to 1 billion queries per day!

• 10-15% of the queries contain spelling errors

• highly dynamic domain:many new names and concepts become popular every day

extremely difficult to maintain a high-coverage lexicon

• difficult to define what a valid web query is

e.g.: divx, ecard, ipod, korn, xbox, zune,naboo, nimh, nsync, shrek, 5dmkii, tsx

The problem

The solution

Page 6: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Problems To Be Handled

cheese cake factory cheesecake factorychat inspanich chat in spanish

amd processors amd processors

Concatenate and split

Recognize out-of-lexicon valid words

Change in-lexicon words to out-of-lexicon words

gun dam fighter gundam fighter

power crd power cordvideo crd video card

chicken sop chicken soupsop opera soap opera

Context-sensitive correction of out-of-lexicon words

Context-sensitive correction of in-lexicon words

Page 7: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

An HMM Architecture for Spelling Correction

britabritbrit.britsbriatrita

watereaterhaterlater

materoaterrater

waderwaferwagerwaiterwalterwasterwaterswaterywaver

filerfiberfiferfile

filedfilersfilesfiletfillerfilnerfilterfinerfirerfiverfixerflier

brita water filer

states:

input query:

allalternativespellingsfrom thequery log

Page 8: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

What about terrible misspellings?

• input: arnol shwartzeggar• desired output: arnold schwarzenegger

unweighted edit distance: 5

Page 9: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Misspelled query: arnol shwartzeggar

First iteration: arnold schwartzneggar

Second iteration: arnold schwartzeneggerThird iteration: arnold schwaxrzeneggerFourth iteration: arnold schwarzenegger

An Iterative Approach

nomore

changes

Speller output:

Page 10: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

hunny moon

honemoon 8honemoons 3honeybeemon 3honeymonn 14honeymoon 19019honeymoon's 12honeymooner 3honeymooner's 6honeymooners 771honeymooning 29honeymoonitis 6honeymoons 5259honneymoon 6honneymoons 9honnymoon 4honoeymoon 3honymoon 19huneymoon 10honey moon 333honey moon's 5honey mooners 34honey moons 136honney moon 6hony moon 4

Iterativespelling

correctionprocess

honeymoon

Search Query Log StatisticsSome Intuition

Page 11: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Basic Assumptions about the “Noise”

• query logs contain a lot of different misspellings for most words

• the better spelled a word form, the more frequent it is

• the correct forms are much more frequent than their misspellings

Page 12: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Another Example

albert einstein 4834albert einstien 525albert einstine 149albert einsten 27albert einsteins 25albert einstain 11albert einstin 10albert eintein 9albeart einstein 6aolbert einstein 6alber einstein 4albert einseint 3albert einsteirn 3albert einsterin 3albert eintien 3alberto einstein 3albrecht einstein 3alvert einstein 3

Page 13: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Concatenation and Splitting

0s britenetspear inconcert

1s britneyspears in concert

2s britney spears in concert

3s britney spears in concert

31 l

42 l

20 l

Store word unigrams and bigrams in the same searchable trie structure.

Find alternative spellings for the input words in this common structure.

Page 14: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Avoid Changing the User’s Intent

britabritbrit.britsbriatrita

watereaterhaterlater

materoaterrater

waderwaferwagerwaiterwalterwasterwaterswaterywaver

filerfiberfiferfilefiledfilersfilesfiletfillerfilnerfilterfinerfirerfiverfixerflier

brita water filer

brit

waiter

file

Page 15: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Modified Viterbi Search – Fringes

1

12

11

1

1ka

a

a

w

2

22

21

2

2ka

a

a

w

3

32

31

3

3ka

a

a

w

4

42

41

4

4ka

a

a

w

5

52

51

5

5ka

a

a

w

6

62

61

6

6ka

a

a

w

7

72

71

7

7ka

a

a

w

stop w

ord

unkn

own word

1

12

11

1

1ka

a

a

w

2

22

21

2

2ka

a

a

w

3

32

31

3

3ka

a

a

w

4

42

41

4

4ka

a

a

w

5

52

51

5

5ka

a

a

w

6

62

61

6

6ka

a

a

w

7

72

71

7

7ka

a

a

w

stop w

ord

1

12

11

1

1ka

a

a

w

2

22

21

2

2ka

a

a

w

3

32

31

3

3ka

a

a

w

4

42

41

4

4ka

a

a

w

5

52

51

5

5ka

a

a

w

6

62

61

6

6ka

a

a

w

7

72

71

7

7ka

a

a

w

stop w

ord

unkn

own word

1

12

11

1

1ka

a

a

w

2

22

21

2

2ka

a

a

w

3

32

31

3

3ka

a

a

w

4

42

41

4

4ka

a

a

w

5

52

51

5

5ka

a

a

w

6

62

61

6

6ka

a

a

w

7

72

71

7

7ka

a

a

w

stop w

ord

unkn

own word

1

12

11

1

1ka

a

a

w

2

22

21

2

2ka

a

a

w

3

32

31

3

3ka

a

a

w

4

42

41

4

4ka

a

a

w

5

52

51

5

5ka

a

a

w

6

62

61

6

6ka

a

a

w

7

72

71

7

7ka

a

a

w

stop w

ord

unkn

own word

1

12

11

1

1ka

a

a

w

2

22

21

2

2ka

a

a

w

3

32

31

3

3ka

a

a

w

4

42

41

4

4ka

a

a

w

5

52

51

5

5ka

a

a

w

6

62

61

6

6ka

a

a

w

7

72

71

7

7ka

a

a

w

stop w

ord

e.g.: water filer waiter file

k1k2 k1+k2 paths

in-lexicon words

Page 16: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Modified Viterbi Search – Stop words

1

12

11

1

1ka

a

a

w

2

22

21

2

2ka

a

a

w

3

32

31

3

3ka

a

a

w

4

42

41

4

4ka

a

a

w

5

52

51

5

5ka

a

a

w

6

62

61

6

6ka

a

a

w

7

72

71

7

7ka

a

a

w

1

12

11

1

1ka

a

a

w

2

22

21

2

2ka

a

a

w

3

32

31

3

3ka

a

a

w

4

42

41

4

4ka

a

a

w

5

52

51

5

5ka

a

a

w

6

62

61

6

6ka

a

a

w

7

72

71

7

7ka

a

a

w

e.g.: lord of teh rigs lord of the rings

Page 17: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Evaluation

All queries Valid Misspelled

Nr. queries 1044 864 180Full system 81.8 84.8 67.2No lexicon 70.3 72.2 61.1

No query log 77.0 82.1 52.8All edits equal 80.4 83.3 66.1Unigrams only 54.7 57.4 41.71 iteration only 80.9 88.0 47.2

2 iterations only 81.3 84.4 66.7No fringes 80.6 83.3 67.2

Page 18: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

A Closer Look to the Results

• 81.8% overall agreement with the annotators

• Errors:– alternative queries for valid queries

many false positives are reasonable suggestions

e.g. cowboy robes cowboy ropes

– alternative queries for misspelled queriessome suggestions could be valid (user’s intent not known)e.g. massanger massager / messenger

annotator inter-agreement rate: 91.3%

Page 19: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Evaluation – When we “know” user’s intent

Full system 73.1No lexicon 59.2

No query log 44.9All edits equal 69.9Unigrams only 43.01 iteration only 45.52 iterations only 68.2

No fringes 71.0

(audio flie, audio file) audio file(bueavista, buena vista) buena vista(carrabean nooms, carrabean rooms) caribbean rooms

368 queries

Page 20: Spelling  Correction for Advertising: How “Noise” Can Help

Text Mining Search and Navigation

Learning Curve

1 month 2 months 3 months 4 months65

70

75

80

8580.7 81.8 81.6 81.2

66.167.2 68.9 69.4

All queriesMispelled queries

Silviu Cucerzan and Eric Brill – “Spelling correction as an iterative processthat exploits the collective knowledge of web users”, EMNLP 2004