normalization of sms text gurpreet singh khanuja sachin yadav

10
Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav

Upload: ashlee-butler

Post on 17-Dec-2015

221 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav

Normalization of SMS Text

Gurpreet Singh KhanujaSachin Yadav

Page 2: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav

Problem Introduction

• Real-time, short, massive in volume• Noisy• Similar to chat text as in facebook,twitter

This is so hard 2 handle! Tell ppl u luv them cuz 2morrow is truly not promised!!!

This is so hard to handle! Tell people you love them because tomorrow is truly not promised!!!"

Page 3: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav

e.g., earthquak, earthquick, ...could be normalised to“earthquake”.

Page 4: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav

Objective• To restore ill-formed English words to their

canonical forms.• Typos (e.g. luv )• ad hoc abbreviations (e.g. ppl, u)• Phonetic substitutions (e.g. 2morrow,10q),

etc.

Page 5: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav

• The normalization focuses on Out-Of-Vocabulary (OOV) words, which are firstly identified.

• This is so hard 2 handle! Tell ppl u luv them cuz 2morrow is truly not promised!!!

• This is so hard to handle! Tell people you love them because tomorrow is truly not promised!!!

Page 6: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav

• Noisy channel modelGiven ill-formed text T and standard form S, find argmax P(S|T) via argmax P(T|S)P(S), where P(S) = language model and P(T|S) =error model [Toutanova and Moore, 2002, Choudhury et al., 2007,Cook and Stevenson, 2009]• Phrasal statistical machine translation SMT: original text = source language; normalised form = target language[Aw et al., 2006, Kaufmann and Kalita, 2010]• MiscellaneousAutomatic Speech Recognition [Kobus et al., 2008], Hybrid models[Beaufort et al., 2010]

Page 7: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav

Proposed Approach

• Confusion set generation (find correction candidates) ... crush da redberry b4 da water ...

before four

be bore

Confusion set generation

Page 8: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav

• Detection of ill-formed word ( is the candidate an ill-formed word?)

before crush da redberry ? da water

fourbore

Ill-formed word detector

Yes or No

Page 9: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav

Normalization of ill-formed word (select canonical lexical form)

Page 10: Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav

References

• Bo Han and Timothy BaldwinLexical Normalization of Short Text Messages: Make Sense a #twitter

• Deana L. Pennell and Yang LiuA Character-Level Machine Translation Approach for Normalization of SMS Abbreviations

• Toutanova and Moore, 2002, Choudhury et al., 2007,Cook and Stevenson, 2009

• Kaufmann and Kalita, 2010• https://twitter.com