a spelling correction program based on a noisy channel model

Upload: santiago-castro

Post on 14-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 A Spelling Correction Program Based on a Noisy Channel Model

    1/6

    - 1 -

    A S p e l l i n g C o r r e ct i on P r o g r a m B a s e d o n a N o i s y C h a n n e l M o d e lM a r k D . K e m i g h a nK e n n e t h W . C h u r c h

    W i l l i a m A . G a l eA T & T B e l l L a b o r at o r ie s

    6 0 0 M o u n t a i n A v e .Murray Hi l l , N . J . , US A

    A b s t r a c tT h i s p a p e r d e s c r i b e s a n e w p r o g r a m , c o r r e c t ,w h i c h t a k e s w o r d s r e j e c t e d b y t h e U n i x s p e l lp r o g r a m , p r o p o s e s a l i s t o f c a n d i d a t e c o r r e c t i o n s,a n d s o r t s th e m b y p r o b a b i l it y . T h e p r o b a b i l i t ys c o r e s a r e t h e n o v e l c o n t r i b u t i o n o f t h i s w o r k .P r o b a b i l it i e s a r e b a s e d o n a n o i s y c h a n n e l m o d e l .I t i s a s s u m e d t h a t t h e t y p i s t k n o w s w h a t w o r d sh e o r s h e w a n t s t o t y p e b u t s o m e n o i s e i s a d d e do n t h e w a y t o t h e k e y b o a r d ( i n t h e f o r m o f t y p o sand spe l l ing e r ro rs ) . Us ing a c la s s ic Bay es iana r g u m e n t o f t h e k i n d t h a t i s p o p u l a r i n t h espeech recogni t ion l i t e ra tu re ( J e l inek , 1985) , onec a n o f t e n r e c o v e r t h e i n t e n d e d c o r r e c t i o n , c , f r o ma typo , t , by f ind ing th e cor rec t ion c tha tm a x i m i z e s P r ( c ) P r ( t l c ) . The f i r s t fac to r ,P r ( c ) , i s a p r i o r m o d e l o f w o r d p r o b a b i l it i e s; t h es e c o n d f a c t o r , P r ( t [ c ) , i s a m o d e l o f t h e n o i s yc h a n n e l t h a t a c c o u n t s f o r s p e l l i n gt rans form a t ions on l e t t e r s equences (e . g . ,inse r t ions , de le t ions , subs t i tu t ions and reve rsa l s ) .B o t h s e ts o f p r o b a b il i ti e s w e r e t r a i n e d o n d a t ac o l l e c t e d f r o m t h e A s s o c i a t e d P r e s s ( A P )newsw ire . Th i s t ex t i s idea l ly su i t ed fo r th i sp u r p o s e s i n c e i t c o n t a i n s a l a r g e n u m b e r o f t y p o s( a b o u t t w o t h o u s a n d p e r m o n t h ) .1 . I n t r o d u c t i o nT h e c o r r e c t program reads a l i s t o f m is spe l l edw o r d s f r o m t h e i n p u t s t r e a m ( s td in ) , and pr in t s as e t o f c a n d i d a t e c o r r e c t io n s f o r e a c h w o r d o n t h eo u t p u t s t r e a m ( s t d o u t ) . C o r r e c t a l s o p r o d u c e s aprobab i l i ty a long wi th each cor rec t ion (un les st h e r e is o n l y o n e c a n d i d a t e c o r r e c t io n ) . H e r e i ss o m e s a m p l e o u t p u t p r o d u c e d b y t h e U n i x c o m m a n d , " s p e l l < p a p e r ] c o r r e c t , " w h e r ep a p e r i s a t ex t f i l e con ta in ing the m is spe l l ed

    w o r d s i n c o l u m n 1 :T y p odeteredlaywernegotationsnotcampaigningprogessionususally

    Correctionsdeterred (100%) metered (0%) petered (0%)lawyer (100%) layer (0%) lawer (0%)negotiations???tprogressic~l (94%) procession (4%)profession (2%)usually

    2 . P r o p o s i n g C a n d i d a t e C o r re c t io n sThe f i r s t s t age o f c o r r e c t f inds words on a f ixedl i s t tha t d i f fe r f rom the typo t by a s ing leinse r t ion , de le t ion , subs t i tu t ion o r reve rsa l . Thel i s t w a s c o l l e c t e d f r o m m a n y s o u r c e s , i n c l u d i n gs p e l l , t h e A P n e w s w i r e , a n d s e v e r a l m a c h i n er e a d a b l e di c t io n a r i es . F o r e x a m p l e , g i v e n t hei n p u t t y p o , a c r e s s , the f i r s t s t age gene ra te sc a n d i d a t e c o r re c t i o n s i n t h e t a b le b e l o w . T h u s ,t h e c o r r e c t w o r d a c t r e s s c o u l d b e t r a n s f o r m e d b ythe no i sy channe l in to the typo a c r e s s b yrep lac ing the t wi th no th ing , @ , a t pos i t ion 2 . 2This unusua l ly d i f f i cu l t exam ple was s e lec ted toi l lus t ra te the four t rans form a t ions ; m os t typoh a v e j u s t a f e w p o s s i b l e c o r r e c t i o n s , a n d t h e r e i sr a r e l y m o r e t h a n o n e p l a u s i b l e c o r r e c t i o n .T y p o C o r r e c t io n T r a n s f o r m a t i o nac res s ac t re s s @ t 2 de le t ionacress cress a # 0 insert io nac res s ca res s ac ca 0 reve rsa lacress acce ss r c 2 subs t i tut io nac res s ac ros s e o 3 subs t i tu t ionac res s ac res s # 4 inse r t ionac res s ac res s # 5 inse r t ion

    1. ??? indicates that no correction was found.2. "/ 'h e symbols @ and # represtmt nulls in the typo an dcorrection, respectively. "Ilae transformations are namedfrom the txoint of vie w of the correction, n ot the typo.

    2 0 5

  • 7/30/2019 A Spelling Correction Program Based on a Noisy Channel Model

    2/6

    - 2 -

    3 . S c o r in gEach cand i da t e co r r ec t i on , c , i s sco red byP r ( c ) P r ( t l c ) , a n d t h e n n o r m a l i z e d b y t h e s u mof the sco res fo r a ll p roposed cand i da t es . Thepr ior , P r ( c ) , i s es t i mat ed by( f r e q ( c ) + 0 . 5 ) / N , w h e r e f r e q ( c ) i s then u m b e r o f t i m e s t h a t t h e w o r d c a p p e a r s i n t h e1988 AP co rpus (N = 44 mi l l i on wo rds ) )The cond i t i ona l p robab i l i t i es , P r ( t l c ) , a r ec o m p u t e d f r o m f b u r c o n f u s i o n m a t r i c e s ( s e eappendix) : (1) d e l [ x , y ] , t h e n u m b e r o f t i m e s t h a tt he charac t e r s xy ( i n t he co r r ec t word ) weretyped as x in the t rain ing set , (2) , a d d [ x , y ] , t h enumber o f t i mes t ha t x was t yped as xy , (3 )s u b [ x , y ] , t h e n u m b e r o f t i m e s t h a t y w a s t y p e das x , and (4) r e v [ x , y ] , t h e n u m b e r o f t i m e s t h a tx y w a s t y p e d a s y x . Probab i l i t i es a r e es t i mat edf rom t hese mat r i ces by d i v i d i ng by c h a r s [ x , y ] o rc h a r s i x] , th e n u m b e r o f t i m e s t h a t x y a n d xappeared i n t he t r a i n i ng se t , respec t i ve l y .4

    d e l [ c p _ l , c p _ ~] i f de l e t i onc h a r s [ c p _ l , c e ] 'a d d [ c p _ l , t p ] , i f i n ser t i oncha rs [ c t , _ 1]P r ( t l c ) =s u b [ t p , c p ] , i f subst i tu t ionchars[cp]

    r e v [ c p , Cp+t ]c h a r s [ c p , cp+ t ] ' i f r ever sa lw h e r e c p is t h e p th charac t e r o f c , and l i kewi setp i s t he p ~ charac t e r o f t . The f i ve mat r i ces a recompu t ed wi t h a boo t s t r app i ng p rocedure .In i t i a l ly assume a un i fo rm d i s t r i bu t ion over t heposs i b l e con fus i ons . Then run the p rog ram overt he t r a i n i ng se t (1988 AP co rpus ) t o f i ndco r rec t i ons fo r t he words t ha t s p e l l rejects . Uset hese co r r ec t i ons t o updat e t he con fus i onmat r i ces , and i te r a t e . The mat r i ces a r e smoo t he dus i ng t he Good-Tur i ng met hod (Good , 1953) .3. Fol lowing Box and Tiao (1973), we can assume anuninform at ive prior and reach a posterior d ist r ibut ion forp . The expectat ion of th is d ist r ibut ion amounts to usingr+ .5 i n s t ead o f r . We ca l l t h i s t he expect ed l i ke li hoodest imate . See Gale and Church (1990) for a d iscussion ofthe shortcomings of th is method.4 . T h e c h a r s matrices can b e easi ly repl icated, and a r etherefore omit ted from the appendix .

    Ret u rn i ng t o t he a c r e s s exampl e , t he sevenproposed t r ans fo rmat i ons a r e sco red bymul t i p l y i ng t he p r i o r p robab i l i t y (wh i ch i sp ropor t i ona l t o 0 .5 + co l umn 4 i n t he t ab l ebe l ow) and t he channel p robab i l i t y ( co l umn 5 ) t ofo rm a r aw sco re ( co l umn 2) , wh i ch a reno rmal i zed t o p roduce p robab i l i t i es ( co l umn 1 ) .The f i na l r esu l t s a r e : a c r e s (45%) , a c t r e s s ( 3 7 % ) ,a c r o s s (18%) , a c c e s s (0%) , c a r e s s (0%) , c r e s s(0%) . Th i s e xam pl e i s very hard ; i n f i ~c t , t hesecond cho i ce i s p robab l y r i gh t , as can be seenf rom t he con t ex t : . . . w a s c a l l e d a " s t e l l a r a n dv e r s a t i l e a c r e s s w h o s e c o m b i n a t i o n o f s a s s a n dg l a m o u r h a s d e f i n e d h e r . .. T h e p r o g r a m w o u l dneed a much be t t e r p r i o r model i n o rder t ohand l e t h is case . In t he fu t u re , a p rog ram m i gh tbe ab l e t o l ake advan t age o f t he f ac t t ha t a c t r e s si s cons i derab l y more p l aus i b l e t han a c r e s a s a nan t eceden t fo r w h o s e .

    c % R a w f r e q (c ) Pr(tlc)actress 37% .157 1343 55./4 70,0 00cress 0% .000 0 46. /32 ,000,000caress 0% .000 4 .95/58 0,000access 0% .000 2280 .98/4,700,000ac ross 18% .077 8436 93. /10,000,00 0acres 21% .092 2879 417 . /13 ,000 ,000acres 23% .098 287 9 205. /6 ,00 0,000

    4 . E v a l u a t i o nM a n y t y p o s s u c h a s a b s o r b a n t h a v e j u s t o n ecand i da t e co r r ec t i on , bu t o t her s such as a d u s t e dhave mul t i p l e co r r ec t i ons . The t ab l e be l ows h o w s e x a m p l e s o f t y p o s w i t h l e ss t h a n t e ncand i da t e co r r ec t i ons , t he co r r ec t i ons o rdered byl i ke l i hood .# T y p o C o r r e c t i o n s0 admininist ra t ion1 absorbant2 adusted3 ambi t ios4 compatabi l i ty5 a f t e6 dialy7 poice8 piots9 spash

    abso rben tadjusted dustedambi t ious ambi t ions ambi t ioncompat ib i l i ty compactabi l i tycomparabi l i ty computabi l i tyafter fa te aft a te antedai ly d iary dials d ial d imly di l lypol ice price voice poise p ice poncepoirepi lo ts p ivots r io ts p lots p i t s pots p intspioussplash smash slash spasm stash swashsash pash spas

    M o s t t y p o s h a v e r e l a t i v e l y f e w c a n d i d a t eco r rec t i ons . The t ab l e be l ow show s the num ber

    2 0 6

  • 7/30/2019 A Spelling Correction Program Based on a Noisy Channel Model

    3/6

    - 3 -

    o f t y l x ~ S b r o k e n o u t b y t h e n u m b e r o fc o r r e c t i o n s i n s e v e n m o n t h - l o n g s a m p l e s o f t h eA P n e w s w i r e . I n M a r c h , f o r e x a m p l e , th e r ewere '720 typos wi th 0 cor rec t ions , 1120 typoswi th 1 cor rec t ion , 269 wi th 2 cor rec t ions , e t c .The f i red coh tm n shows tha t the re i s a gene ra lt r e n d f o r f e w e r c h o i c e s , t h ou g h t h e 0 - c h o i c e c a s ei s spe~z ia l . (The sys tem was t ra ined on the APwire l i ' om 2 /88 o 2 /89 ; the re su l t s be low werec o m p u t e d f r o m A P w i r e d u r i n g 3 / 8 9 - 9 / 8 9 ) .

    56789

    1 0+T o t a l

    M a r c h A p r il M a y J u n e7 2 0 6 0 4 5 4 2 6 0 6

    1 1 2 0 9 9 7 1 0 3 7 1 0 0 72 6 9 2 2 4 2 0 9 2 2 3109 92 89 101

    5 8 5 7 6 2 4 55 4 4 1 2 0 2 6

    J u l y A u g S e p t T o t a l4 9 2 4 6 5 5 0 8 3 9 3 79 5 8 9 4 4 9 3 0 6 9 9 31 9 9 2 2 4 2 1 4 1 5 6 2

    7 9 8 7 8 2 6 3 94 3 5 9 4 3 3 6 72 8 2 4 2 8 2 2 1

    3 5 2 2 1 9 1 9 2 2 1 7 2 3 [ 1 5 720 11 13 7 11 15 17[ 9 419 14 14 5 7 7 16[ 8215 11 6 11 10 8 16 l 77

    1 5 4 9 7 7 9 7 5 5 3 7 7 7 81 6 1 32 5 7 3 2 i - ~ 2 0 9 0 2 1 2 5 1 9 02 1 9 27 1 9 ~ t 1 4 5 ~

    W e d e c i d e d t o l o o k a t t h e 2 - c a n d i d a t e c a s e i nm ore de ta i l in o rde r to t e s t how of ten the topscor ing cand ida te agreed wi th a pane l o f th reejudges . The judg es were g iven 564 tr ip le s and af e w c o n c o r d a n c e l in e s :

    a b s u r b a b s o r b a b s u r df i n an c i a l co m m u n i t y . * E * * S * " I t is ab s u rb an d p ro b ab l yo b s c e n e f o r a n y p e r s o n s o e n g a g e d t o u n d

    The f i r s t word of the t r ip le was a s p e l l re ject ; theo the r two were the cand ida te s ( in a lphabe t i ca lo r d e r ) . T h e ju d g e s w e r e g i v e n a 5 - w a y f o r c e dc h o i c e . T h e y c o u l d c i rc l e a n y o n e o f t h e t h r e ew o r d s , i f th e y t h o u g h t t h a t w a s w h a t t h e a u t h o rhad in tended . Al te rna t ive ly , i f they thought tha tthe au thor had in tended som eth ing e l s e , theyc o u l d w r i te d o w n " o t h e r " . F i n a l ly , i f t h e yw e r e n ' t s u re , t h e y c o n l d w r i t e ' , 9 , ,. T h ed i s t r ibu t ion o f re sponses i s shown in thefo l lowing t ab le .

    c h o i c e 0c h o i c e 1c h o i c e 2o t h e r?t o t a l

    J u d g e 1 J u d g e 2 J u d g e 399 124 93

    188 176 167175 159 1512 8 2 6 3 074 79 1235 ~ 5 ~ 5 ~

    The re su l t s show t lm t s p e l l i s re jec t ing too m anywords , s ince cho ice 0 ( spe l l e r ro r ) i s s e lec tedabou t 20% o f the t im e . In these cases , c o r r e c tw a s g i v e n a n o n - p r o b l e m t o c o r r e c t :

    acqu i rees acqu i re rs acqu i re sb e a c q u i r e r s , a s t h e y h a v e b e e n , t h a n a c q u i r e e s . * E * * S * I ft h e i n d u s t r i a l s h ad a t t r ac t ed b i d s t i t

    S i n c e w e w e r e m o s t l y c o n c e r n e d w i t h e v a l u a t i n gt h e s c o r i n g f u n c t i o n , w e d i d n ' t w a n t t o b ed i s t rac ted wi th e r ro rs in s p e l l a n d o t h e r p r o b l e m stha t a re beyond the s cope of th i s pape r .T h e r e f o r e , w e d e c i d e d t o c o n s i d e r o n l y t h o s ec a s e s w h e r e a t l e a s t t w o j u d g e s c i r c l e d o n e o fthe two candida te s , and they agreed wi th eachothe r . Th i s l e f t 329 t r ip le s .T h e f o l l o w i n g t a b l e s h o w s t h a t c o r r e c t agreeswi th the m a jor i ty o f the judges in 87% of t t l e329 cases o f in te re st . In o rde r to he lp c~dibrateth i s re su l t , th ree in fe r io r m e thods , a re a l soeva lua ted . The n o - p r i o r m e t h o d i g n o r e s t h epr io r p robab i l i ty . The n o - c h a n n e l m e t h o dignolvs the channe l p robab i l i ty . F ina l ly , then e i t h e r m ethod ignores bo th p robab i l i t i e s andselects the f i rs t can dida te in "all cases . As thefo l lowing t ab le shows , c o r r e c t is s ignif icant lybe t t e r than the th ree in fe r io r a l t e rna t ives . Boththe channe l and the p r io r p robab i l i t i e s p rov ide as ign i f i can t con t r ibu t ion , and the com bina t ion i ss ign i f i can tly be t t e r than e i the r in i so la t ion . Thes e c o n d h a l f o f t h e t a b l e e v a l u a t e s t h e j u d g e saga ins t one ano the r and shows tha t theys ign i t i can t ly ou t -pe r lb rm c o r r e c t , ind ica t ing tha tt h e r e is p l e n t y o f r o o m f o r f u r t h e r im p r o v e m e n t .6Al l th ree judges found the t a sk m ore d i f fm ul ta n d t i m e c o n s u m i n g t h a n t h e y h a d e x p e c t e d .

    5 . F o r t h e p u rp o s es o f t h i s ex p e r i m en t , a t y p e is a l o w e r c a s ew o rd r e j ec t ed b y t h e U n i x @ s p e l l p ro g ram .

    6 . J u d g es w ere o n l y s co red o n t r i p l e s f o r w h i ch t h eys e l e c t e d " 1 " o r " 2 , " a n d f o r w h i c h t h e o t h e r t w o j u d g e sa g r e e d o n " 1 " o r " 2 2 ' A t r ip l e w a s s c o re d " c o r r e c t "f o r o n e j u d g e i f t h a t j u d g e a g r e e d w i t h t h e o t h e r t w o a n d" i n c o r r e c t " i f t h a t j u d g e d i s a g r e e d w i t h t h e o t h e r t w o .

    2 0 7 ,

  • 7/30/2019 A Spelling Correction Program Based on a Noisy Channel Model

    4/6

    - 4 -

    E a c h j u d g e s p e n t a b o u t h a l f a d a y g r a d i n g th e564 t r ip le s .

    M e t h o dcorrect

    n o - p r i o rn o - c h a n n e ln e i t h e rJ u d g e 1J u d g e 2J u d g e 3

    D i s c r i m i n a t i o n %2 8 6 / 3 2 9 8 7 + 1.9263 /329 80 _+ 2.2247 /329 75 _+ 2.4172 /329 52 + 2 . 8271/273 99 + 0 . 5271/275 99 + 0 . 7271 /28 1 96 _+ 1.1

    W e were a l so in te re s ted in t e s t ing whe the r thes c o r e p r e d i c t e d a c c u r a c y . T h e f i g u re a t t h e e n dof th is pap e r shows tha t th i s i s indeed so . Theh o r i z o n t a l a x i s s h o w s t h e s c o r e f r o m o n e o f t h eth ree p red ic to rs (a s the l ines a re l ab le led)a v e r a g e d o v e r a g r o u p o f 2 0 t y p o s . T h e v e r t i c a la x i s s h o w s t h e f r a c t i o n o f t h i s g r o u p t h a t w e r er igh t . Th e d iagona l l ine ind ica te s pe r fec t ion .F o r e x a m p l e , c o n s i d e r a g r o u p o f t y p o s w h o s ea v e r a g e s c o r e w a s . 8. P e r f e c t a c c u r a c y w o u l d b ea c h i e v e d i f e x a c t l y 8 0 p e r c e n t o f t h i s g r o u pa g r e e d w i t h t h e m a j o r i t y o p i n i o n o f t h e j u d g e s .T h e c u r v e d l i n e s a b o v e an d b e l o w t h e p e r f e c t i o nl ine show one s tanda rd dev ia t ion l im i t s fo res t im a t ing probab i l i t i e s f rom sam ples o f 20 . Theo b s e r v a t i o n s o n c o r r e c t a r e o u t s i d e o f t h e o n es t a n d a rd d e v i a t i o n l im i t s a b o u t a s m u c h a s w o u l db e c a l l e d f o r b y c h a n c e , w h i l e e a c h o f t h e o t h e rt w o m e t h o d s h a s m o r e p o i n t s o u t s i d e t h a n w o u l dr e s u l t j u s t b y c h a n c e . W e c o n c l u d e t h a t t h es c o r e s f r o m c o r r e c t p r e d i c t a c c u r a c y f a i rl y w e l l ;s c o r e s f r o m t h e o t h e r t w o m e t h o d s a r e m o r eprob lem at ic .5 . C o n c l u s i o n sT h e r e h a v e b e e n a n u m b e r o f s p e ll i n g c o r r e c t io np r o g r a m s i n t h e p a s t s u c h a s K u c e r a ( 1 9 8 8 ) t h a tg e n e r a t e d a l i s t o f c a n d i d a t e s b y l o o k i n g f o rinse r t ions , de le t ions , subs t i tu t ions and reve rsa l s ,r a u c h a s w e h a v e b e e n d o i n g h e r e . O u rcont r ibu t ion i s the em phas i s on s cor ing .M c l l r o y , t h e a u t h o r o f t h e U n i x s p e l l p r o g r a m(1982) , in ten t iona l ly focused on the spe l l ingd e t e c t i o n p r o b l e m , a n d a r g u e d ( p r i v a t ec o m m u n i c a t i o n ) t h a t s p e l l i n g c o r r e c t i o n w a s ab a d i d e a s o l o n g a s t h e c o r r e c t o r c o u l d n ' ts epa ra te the p laus ib le cand ida te s f rom theim plaus ib le ones . He fe l t tha t i t was p roba b lym ore d i s t rac t ing than he lpfu l to bury the use ru n d e r a l o n g l i s t o f m o s t l y i m p l a u s i b l ecandida te s . In th i s work , we have a t tem p ted toshow tha t i t i s pos s ib le to sor t the cand ida te s by

    a l i k e l i h o o d f u n c t i o n t h a t a g r e e s w e l l e n o u g hw i t h h u m a n j u d g e s t o b e h e l p f u l .In future work, w e w o u l d h o p e t o e x t e n d t h ep r i o r m o d e l t o t a k e a d v a n t a g e o f c o n t e x t . W en o t i c e d t h a t t h e h u m a n j u d g e s w e r e e x t r e m e l yr e l u c t a n t t o c a s t a v o t e g i v e n o n l y t h ei n f o r m a t i o n a v a i l a b l e t o t h e p r o g r a m , a n d t h a tt h e y w e r e m u c h m o r e c o m f o r t a b l e w h e n t h e yc o u l d s e e a c o n c o r d a n c e l i n e o r t w o . P e r h a p so u r p r o g r a m c o u l d t a k e a d v a n t a g e o f t h e s ec o n t e x t u a l c u e s b y a d o p t i n g v e r y s i m p l el a n g u a g e m o d e l i n g t e c h n i q u e s s u c h a s t r i g r a m s ,t h a t h a v e p r o v e n e f f e c t i v e f o r s p e e c h r e c o g n i t io nappl ica t ions ( J e l inek , 1985) . Ho pefu l ly m o rei n t e r e s t i n g l a n g u a g e m o d e l s w o u l d i m p r o v ep e r f o r m a n c e e v e n m o r e ,R e f e r e n c e sBox, G. E. P. , and G. C. Tiao, 1973 B a y e s i a nIn ference in S ta t i s t ica l A na lys i s , Addison-Wesley,Reading, Massachusetts.Gale, W., Church, K., (1990, submitted), "What'sWrong with Adding One ?"Good, I.J ., (1 95 3), "T he population frequencies ofspecies and the estimation of population parameters,"B i o m e t r i k a , v. 40, pp. 237-264.Jelinek, F. (1985) "Self-organized LanguageModeling for Speech Recognition," IBM Report.Kucera, H., (1988), "Automated Word SubstitutionUsing Numerical Rankings of Structural DisparityBetween Misspelled Words & Candidate SubstitutionWords," Patent Number: 4,783,758.Mcllroy, M . , (198 2), "Dev elopm ent of a Spel l ingLis t , " I E E E T r a n s a c t i o n s o n C o m m u n i c a t i o n s , Vol .COM-30, No. 1.

    Accurac y of Prob abilitiesood5

    u,)5

    , / ' ~ J

    0, 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0p re d ic te d p ro b a b i l i t y

    2 0 8

  • 7/30/2019 A Spelling Correction Program Based on a Noisy Channel Model

    5/6

    - 5 -

    6 . A ppend ix : Confus ion Matr ice sX

    bCdefghiJk1mnopqrstuvwxyZ@

    d e l [ X , Y ] = D e l e t i o n o f Y a f t e r XY (Deleted Letter)a b , c . . . . . .e f g h i j k 1 m n o p q r s t u v w x y z0 7 58 21 3 5 18 8 61 0 4 43 5 53 0 9 0 98 28 53 62 1 0 0 2 02 2 1 0 22 0 0 0 183 0 0 26 0 0 2 0 0 6 17 0 6 1 0 0 0 0

    37 0 70 0 63 0 0 24 320 0 9 17 0 0 33 0 0 46 6 54 17 0 0 0 1 012 0 7 25 45 0 10 0 62 1 1 8 4 3 3 0 0 I1 1 0 3 2 0 0 6 080 1 50 74 89 3 1 1 6 0 0 32 9 76 19 9 1 237 223 34 8 2 1 7 1 04 0 0 0 13 46 0 0 79 0 0 12 0 0 4 0 0 11 0 8 1 0 0 0 1 025 0 0 2 83 1 37 25 39 0 0 3 0 29 4 0 0 52 7 1 22 0 0 0 1 015 12 1 3 20 0 0 25 24 0 0 7 1 9 22 0 0 15 1 26 0 0 1 0 1 026 1 60 26 23 1 9 0 1 0 0 38 14 82 41 7 0 16 71 64 1 1 0 0 1 70 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 04 0 0 1 15 1 8 1 5 0 1 3 0 17 0 0 0 1 5 0 0 0 1 0 0 024 0 1 6 48 0 0 0 217 0 0 211 2 0 29 0 0 2 12 7 3 2 0 0 11 015 10 0 0 33 0 0 1 42 0 0 0 180 7 7 31 0 0 9 0 4 0 0 0 0 021 0 42 71 68 1 160 0 191 0 0 0 17 144 21 0 0 0 127 87 43 1 1 0 2 011 4 3 6 8 0 5 0 4 1 0 13 9 70 26 20 0 98 20 13 47 2 5 0 1 025 0 0 0 22 0 0 12 15 0 0 28 1 0 30 93 0 58 1 18 2 0 0 0 0 0

    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 063 4 12 19 188 0 11 5 132 0 3 33 7 157 21 2 0 277 103 68 0 10 1 0 27 016 0 27 0 74 1 0 18 231 0 0 2 1 0 30 30 0 4 265 124 21 0 0 0 1 024 1 2 0 76 1 7 49 427 0 0 31 3 3 11 1 0 203 5 137 14 0 4 0 2 026 6 9 10 15 0 1 0 28 0 0 39 2 111 1 0 0 129 31 66 0 0 0 0 1 09 0 0 0 58 0 0 0 31 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 1 040 0 0 1 11 1 0 11 15 0 0 1 0 2 2 0 0 2 24 0 0 0 0 0 0 01 0 17 0 3 0 0 I 0 0 0 0 0 0 0 6 0 0 0 5 0 0 0 0 1 02 1 34 0 2 0 1 0 1 0 0 1 2 1 1 1 0 0 17 1 0 0 1 0 0 01 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 220 14 41 31 20 20 7 6 20 3 6 22 16 5 5 17 0 28 26 6 2 1 24 0 0 2

    XT -bCdefghiJk1

    mnopqrstuvwxyZ@

    add[ X , Y] = I nse r t i o n o f Y a f t e r XY (Inserted Letter)a b c d e f g h i j k 1 m n o p q r s t u v w x y z15 1 14 7 10 0 1 1 33 1 4 31 2 39 12 4 3 28 134 7 28 0 1 1 4 13 11 0 0 7 0 1 0 50 0 0 15 0 1 1 0 0 5 16 0 0 3 0 0 0 019 0 54 1 13 0 0 18 50 0 3 1 1 1 7 1 0 7 25 7 8 4 0 1 0 018 0 3 17 14 2 0 0 9 0 0 6 1 9 13 0 0 6 119 0 0 0 0 0 5 039 2 8 76 147 2 0 1 4 0 3 4 6 27 5 1 0 83 417 6 4 1 10 2 8 0

    1 0 0 0 2 27 1 0 12 0 0 10 0 0 0 0 0 5 23 0 1 0 0 0 1 08 0 0 0 5 1 5 12 8 0 0 2 0 1 1 0 1 5 69 2 3 0 1 0 0 04 1 0 1 24 0 10 18 17 2 0 1 0 1 4 0 0 16 24 22 1 0 5 0 3 010 3 13 13 25 0 I 1 69 2 1 17 11 33 27 1 0 9 30 29 11 0 0 1 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 02 4 0 1 9 0 0 1 1 0 1 I 0 0 2 1 0 0 95 0 1 0 0 0 4 03 1 0 1 38 0 0 0 79 0 2 128 1 0 7 0 0 0 97 7 3 1 0 0 2 0I1 1 1 0 17 0 0 1 6 0 1 0 102 44 7 2 0 0 47 1 2 0 1 0 0 015 5 7 13 52 4 17 O 34 0 1 1 26 99 12 0 0 2 156 53 1 1 0 0 1 014 1 1 3 7 2 1 0 28 1 0 6 3 13 64 30 0 16 59 4 19 1 0 0 1 123 0 1 1 10 0 0 20 3 0 0 2 0 0 26 70 0 29 52 9 1 1 1 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 015 2 1 0 89 1 1 2 64 0 0 5 9 7 10 0 0 132 273 29 7 0 1 0 10 013 1 7 20 41 0 1 50 101 0 2 2 10 7 3 1 0 1 205 49 7 0 1 0 7 039 0 0 3 65 1 10 24 59 1 0 6 3 1 23 1 0 54 264 183 11 0 5 0 6 015 0 3 0 9 0 0 1 24 I 1 3 3 9 1 3 0 49 19 27 26 0 0 2 3 00 2 0 0 36 0 0 0 10 0 0 1 0 1 0 1 0 0 0 0 1 5 1 0 0 00 0 0 1 10 0 0 1 1 0 1 1 0 2 0 0 1 1 8 0 2 0 4 0 0 00 0 18 0 1 0 0 6 1 0 0 0 1 0 3 0 0 0 2 0 0 0 0 1 0 05 1 2 0 3 0 0 0 2 0 0 1 1 6 0 0 0 1 33 1 13 0 1 0 2 02 0 0 0 5 1 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 446 8 9 8 26 11 14 3 5 1 17 5 6 2 2 10 0 6 23 2 11 1 2 1 1 2

    2 0 9

  • 7/30/2019 A Spelling Correction Program Based on a Noisy Channel Model

    6/6

    - 6 -

    XabCdCfghiJk1

    InnoPqrstuvwxYZ

    s u b [ X , Y ] = S u b s t i t u t io n o f X ( i n c o r r e c t ) fo r Y ( c o r r e c t )Y ( c o r r e c t )a b c d e f g h i j k 1 m n o p q r s t u v w x y z0 0 7 1 34 2 0 0 2 118 0 1 0 0 3 76 0 0 1 35 9 9 0 1 0 5 00 0 9 9 2 2 3 1 0 0 0 5 11 5 0 10 0 0 2 1 0 0 8 0 0 06 5 0 16 0 9 5 0 0 0 I 0 7 9 1 10 2 5 39 40 1 3 7 1 1 01 10 13 0 12 0 5 5 0 0 2 3 7 3 0 1 0 4 3 3 0 2 2 0 0 4 0 2 0

    388 0 3 11 0 2 2 0 89 0 0 3 0 5 93 0 0 14 12 6 15 0 1 0 18 00 15 0 3 1 0 5 2 0 0 0 3 4 1 0 0 0 6 4 12 0 0 2 0 0 04 1 11 11 9 2 0 0 0 1 1 3 0 0 2 1 3 5 13 21 0 0 1 0 3 01 8 0 3 0 0 0 0 0 0 2 0 1 2 1 4 2 3 0 3 1 11 0 0 2 0 0 0103 0 0 0 146 0 1 0 0 0 0 6 0 0 49 0 0 0 2 1 47 0 2 1 15 00 1 1 9 0 0 1 0 0 0 0 2 1 0 0 0 0 0 5 0 0 0 0 0 0 01 2 8 4 1 1 2 5 0 0 0 0 5 0 2 0 0 0 6 0 0 0 . 4 0 0 32 10 1 4 0 4 5 6 13 0 1 0 0 14 2 5 0 11 10 2 0 0 0 0 0 01 3 7 8 0 2 0 6 0 0 4 4 0 18 0 0 6 0 0 9 15 13 3 2 2 3 02 7 6 5 3 0 1 19 1 0 4 35 78 0 0 7 0 28 5 7 0 0 1 2 0 291 1 1 3 116 0 0 0 25 0 2 0 0 0 0 14 0 2 4 14 39 0 0 0 18 00 11 1 2 0 6 5 0 2 9 0 2 7 6 15 0 0 1 3 6 0 4 1 0 0 00 0 1 0 0 0 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 14 0 30 12 2 2 8 2 0 5 8 4 20 1 14 0 0 12 22 4 0 0 1 0 0

    11 8 2 7 3 3 3 5 4 0 1 0 1 0 27 0 6 1 7 0 14 0 15 0 0 5 3 20 13 4 9 42 7 5 19 5 0 1 0 14 9 5 5 6 0 11 37 0 0 2 19 0 7 6

    20 0 0 0 44 0 0 0 64 0 0 0 0 2 43 0 0 4 0 0 0 0 2 0 8 00 0 7 0 0 3 0 0 0 0 0 1 0 0 1 0 0 0 8 3 0 0 0 0 0 02 2 1 0 1 0 0 2 0 0 I 0 0 0 0 7 0 6 3 3 1 0 0 0 0 00 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 00 0 2 0 15 0 1 7 15 0 0 0 2 0 6 1 0 7 36 8 5 0 0 1 0 00 0 0 7 0 0 0 0 0 0 0 7 5 0 0 0 0 2 21 3 0 0 0 0 3 0

    X

    bCdCfghiJkI

    mnoPqr$tuvwxYZ

    r e v [ X , Y ] = R e v e r s a l o f X YY

    a b c d e f g h i j k 1 m n o p q r s t u v w x y z0 0 2 I 1 0 0 0 19 0 1 14 4 25 10 3 0 27 3 5 31 0 0 0 0 00 0 0 0 2 0 0 0 0 0 0 1 1 0 2 0 0 0 2 0 0 0 0 0 0 00 0 0 0 1 0 0 1 85 0 0 15 0 0 13 0 0 0 3 0 7 0 0 0 0 00 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0I 0 4 5 0 0 0 0 60 0 0 21 6 16 11 2 0 29 5 0 85 0 0 0 2 00 0 0 0 0 0 0 0 12 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 04 0 0 0 2 0 0 0 0 0 0 1 0 15 0 0 0 3 0 0 3 0 0 0 0 012 0 0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0

    15 8 31 3 66 1 3 0 0 0 0 9 0 5 11 0 1 13 42 35 0 6 0 0 0 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0I1 0 0 12 20 0 1 0 4 0 0 0 0 0 1 3 0 0 1 1 3 9 0 0 7 09 0 0 0 20 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 4 0 0 0 0 015 0 6 2 12 0 8 0 1 0 0 0 3 0 0 0 0 0 6 4 0 0 0 0 0 05 0 2 0 4 0 0 0 5 0 0 1 0 5 0 1 0 11 1 1 0 0 7 1 0 017 0 0 0 4 0 0 1 0 0 0 0 0 0 1 0 0 5 3 , 6 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 012 0 0 0 24 0 3 0 14 0 2 2 0 7 30 1 0 0 0 2 10 0 0 0 2 04 0 0 0 9 0 0 5 15 0 0 5 2 0 1 22 0 0 0 1 3 0 0 0 16 04 0 3 0 4 0 0 21 49 0 0 4 0 0 3 0 0 5 0 0 11 0 2 0 0 0

    22 0 5 1 1 0 2 0 2 0 0 2 1 0 20 2 0 11 I1 2 0 0 0 0 0 00 0 0 0 I 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 4 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 8 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 00 1 2 0 0 0 1 0 0 0 0 3 0 0 0 2 0 1 10 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    210