normalization of sms text gurpreet singh khanuja sachin yadav
TRANSCRIPT
![Page 1: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav](https://reader036.vdocument.in/reader036/viewer/2022082516/56649cea5503460f949b5413/html5/thumbnails/1.jpg)
Normalization of SMS Text
Gurpreet Singh KhanujaSachin Yadav
![Page 2: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav](https://reader036.vdocument.in/reader036/viewer/2022082516/56649cea5503460f949b5413/html5/thumbnails/2.jpg)
Problem Introduction
• Real-time, short, massive in volume• Noisy• Similar to chat text as in facebook,twitter
This is so hard 2 handle! Tell ppl u luv them cuz 2morrow is truly not promised!!!
This is so hard to handle! Tell people you love them because tomorrow is truly not promised!!!"
![Page 3: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav](https://reader036.vdocument.in/reader036/viewer/2022082516/56649cea5503460f949b5413/html5/thumbnails/3.jpg)
e.g., earthquak, earthquick, ...could be normalised to“earthquake”.
![Page 4: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav](https://reader036.vdocument.in/reader036/viewer/2022082516/56649cea5503460f949b5413/html5/thumbnails/4.jpg)
Objective• To restore ill-formed English words to their
canonical forms.• Typos (e.g. luv )• ad hoc abbreviations (e.g. ppl, u)• Phonetic substitutions (e.g. 2morrow,10q),
etc.
![Page 5: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav](https://reader036.vdocument.in/reader036/viewer/2022082516/56649cea5503460f949b5413/html5/thumbnails/5.jpg)
• The normalization focuses on Out-Of-Vocabulary (OOV) words, which are firstly identified.
• This is so hard 2 handle! Tell ppl u luv them cuz 2morrow is truly not promised!!!
• This is so hard to handle! Tell people you love them because tomorrow is truly not promised!!!
![Page 6: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav](https://reader036.vdocument.in/reader036/viewer/2022082516/56649cea5503460f949b5413/html5/thumbnails/6.jpg)
• Noisy channel modelGiven ill-formed text T and standard form S, find argmax P(S|T) via argmax P(T|S)P(S), where P(S) = language model and P(T|S) =error model [Toutanova and Moore, 2002, Choudhury et al., 2007,Cook and Stevenson, 2009]• Phrasal statistical machine translation SMT: original text = source language; normalised form = target language[Aw et al., 2006, Kaufmann and Kalita, 2010]• MiscellaneousAutomatic Speech Recognition [Kobus et al., 2008], Hybrid models[Beaufort et al., 2010]
![Page 7: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav](https://reader036.vdocument.in/reader036/viewer/2022082516/56649cea5503460f949b5413/html5/thumbnails/7.jpg)
Proposed Approach
• Confusion set generation (find correction candidates) ... crush da redberry b4 da water ...
before four
be bore
Confusion set generation
![Page 8: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav](https://reader036.vdocument.in/reader036/viewer/2022082516/56649cea5503460f949b5413/html5/thumbnails/8.jpg)
• Detection of ill-formed word ( is the candidate an ill-formed word?)
before crush da redberry ? da water
fourbore
Ill-formed word detector
Yes or No
![Page 9: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav](https://reader036.vdocument.in/reader036/viewer/2022082516/56649cea5503460f949b5413/html5/thumbnails/9.jpg)
Normalization of ill-formed word (select canonical lexical form)
![Page 10: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav](https://reader036.vdocument.in/reader036/viewer/2022082516/56649cea5503460f949b5413/html5/thumbnails/10.jpg)
References
• Bo Han and Timothy BaldwinLexical Normalization of Short Text Messages: Make Sense a #twitter
• Deana L. Pennell and Yang LiuA Character-Level Machine Translation Approach for Normalization of SMS Abbreviations
• Toutanova and Moore, 2002, Choudhury et al., 2007,Cook and Stevenson, 2009
• Kaufmann and Kalita, 2010• https://twitter.com