word segmentation in urdu - informatics homepages...
Post on 05-Feb-2018
226 Views
Preview:
TRANSCRIPT
Word Segmentation in Urdu
Nadir DurraniInstitute of Natural Language Processing
University of StuttgartUniversity of Stuttgart
Sarmad HussainCenter for Research in Urdu Language Processing
National University of Computer and Emerging Sciences
Road Map
• Urdu Word Segmentation
– Space Omission Problem
• Non-Joiners and Urdu Orthography
• Joiners and definition of Word
– Space Insertion Problem– Space Insertion Problem
• Affixation
• Compounding
• Proper Nouns
• Foreign Words
• Abbreviations
Contd..
• Model
• Algorithm
• Results
Why do we need words ?
• Tokenization is a foremost task in all NLP applications.
– Syntactic and Semantic Analysis in Machine Translation is based on words and neighboring onesbased on words and neighboring ones
– Spell Checker requires word boundary information for error word in order to suggest list of possible corrections
– A POS tagger should know word boundaries to tag them properly
Segmentingwordsshouldn’tbehard
• For Latin based languages like English, French and
Dutch etc. space and punctuation marks are a good
approximation
• In some Asian languages white space is never used to
determine word boundaries. Text is written in a
continuum.
Word Segmentation Problem in Asian
Languages
• Chinese
• Thai
• Khmer
• Burmese
• Dzongkha
• Lao
Word segmentation in these languages is a “Space Omission Problem ”
Word Segmentation Problem in Urdu
• Urdu, A Unique Case
• Like some Asian languages multiple words can be written in continuum without inserting any space
• Unlike these Asian languages space is a frequently used character. • Unlike these Asian languages space is a frequently used character. However, its presence does not necessarily imply word boundary
• So Urdu is also a “Space Insertion Problem” along with “Space Omission” problem
�����a ��aa�� ���������a�a�� ������
��a��� ��
�� ����� � ! �
��"# $! %aa&'aa( #a)��# *+
a,#"�-�%aa.!�/��a �0�# 1"'
Urdu Orthography
• Urdu is cursive in nature
• Characters acquire different shapes as they join with neighboring characters
• Urdu has two types of characters– Joiners can acquire 4 different shapes namely initial,
medial, final and isolated• Arabic Letter Meem can take initial: م , medial: م , final: م and
isolated: م
– Non-joiners can only acquire final and isolated shapes• Arabic Letter Dal can only take final: د and isolated د
Orthographic Rules for Urdu
Word Joiners Example Non-Joiners Example
Start Initial Shape سجدم Isolated جالد
Some where in
Between
Medial after J
Initial after NJ
رهمنبامد
Final after J
Isolated after NJ
ردبنردنا
End Final after J مجع Final after J دبنInitial after NJ
مجعمکا Isolated after NJ
دبندر
J = Joiners , NJ = Non-Joiners
Red = Shape in Consideration , Blue = Context
Notion of Space in Urdu
• Notion of space is completely alien in Urdu hand-writing
• Children are never taught to leave space when starting a new wordstarting a new word
• Following sample clearly shows that space is not used in hand-written Urdu
How space became part of Urdu?
• Space has become part of Urdu text because computer can not handle it without space
• If a word ends with a joiner character next word must be started by putting a space character otherwise two be started by putting a space character otherwise two words would join and the text would look visually inappropriate
Badshahi)سجدمیبادشاہ– Mosque)سجدیمبادشاہ
• However space does not always mean word boundary
Space Omission Errors
Non-Joiner Word Ending
• Putting space is no longer and obligation if a word ends with a non-joiner
– �%a"�a� �2a��3a�45�a����#a"%a"-6 7– �%"�� �2��3�45�����#"%"-6 7 �%"�� �2��3�45�����#"%"-6 7–Troop leader Ahmed Sher Doger said
• Each word ends with a non-joiner so its up to the users whether or not he wants to put the space
Space Omission Problem
Joiner Word Ending
• Two words written without space even when first
word is ending with a joiner
• Triggered by disagreement on definition of word
Category ExamplesCategory Examples
Oblique pronouns followed by case marker آپکا vs. آپ کا (yours)
Some abstract nouns preceded by singular
demonstratives
اس وقت .vs اسوقت (at
that time)
Some postpositions combine with their genitive case
markers
کيطرف vs. کی طرف(towards)
Sometimes helping verbs are written with root verbs
without any space
کریگی vs. گی کرے(will do)
Sometimes there is no other reason but something
introduced as stylistic variation and now lexicalized
کے ليے vs. کيليے(for)
Space Insertion Problem
• Space is often used as a word boundary but
not always
• In some cases space occurs in between a word • In some cases space occurs in between a word
that haves multiple morphemes
• The problem is to delete the spaces that do
not mean word boundaries
Cases for Space Insertion
Cases Description Examples
Affixation Derivational affixes are written with space
when first morpheme ends with a joiner
شادی شده غير(married) + (un)
-do- Some times joined and separated versions
have different spellings
مزے دار .vs مزیدارdelicious
Compounds Compounds are written with space when
first morpheme ends with a joiner
ثانيہ نشاطRenaissance
Both Often compounding is used in combination
with affixation to create more problems
بےیارومددگارHelpless
))گار)مدد) ((و) (یار(بے((Without (Friend) (and) ((Help)er))
Contd..
Cases Description Examples
Reduplication Reduplication is used in Urdu to put emphasis
or express multiplicity or variety
کبھی کبھیSometimes
Sometimes first morphemes repeats itself or
will occur in a format X-(some character)-X or
sometimes changing a vowel of X to /a/
بخود خودAutomatically
/sometimes changing a vowel of X to /aٹھيک ٹھاک ٹھيک ٹھاکAlright
Proper Names Some proper nouns are written with space in
between
اباد اسRمIslamabad
Foreign Words Lexicalized foreign words with multiple
morphemes are written with space between
ٹيلی فونTelephone
Abbreviations Abbreviations when transliterated in Urdu are
written with spaces in between
پی ایچ ڈیPhD
Lack of definition of word in Urdu
Single Word Confusion Two Words
Reduplication برابر دھڑا دھڑ آہستہ آہستہ
Equal One after another Slowly
One word Not Sure Two words
Compounds نظم و ضبط تباه و برباد اسلم و عمرانDiscipline Destroy Aslam and Imran
One word Not Sure Two words
Category Level
Single Morpheme Words 1
Affixation 2
Abbreviations Compounds Reduplication 3
Model
Algorithm
•�����a ��aa�� ���������a�a�� ������
��a��� ��
�� ����� � ! �
��"#
$! %aa,#a&'aa( #a)��# *+"�-�%aa.!�/��a �0�# 1"'
• Remove the diacritics and tokenize into
orthographic words OW
•8�9 8�����9 8�� ����9 8��9 8�9 8����9 8��������9 8����9 8����� � ! ��"#9 8$! %9 8)��# *+9 8( #9 8&'9 8,#9 8"�-�%9 8.!�9 80�#9 8/��9 81"'9
Space Omission Problem
• For each OW we do lexical lookup for spelling variation and break them into words– کيليے will get fixed into کے ليے
• Maximum matching algorithm(dynamic programming algorithm) algorithm)
• 10 best segmentations for that OW based on minimum word heuristic
• These segmentations are merged with segmentations from other OW’s
Contd..
• After running space omission module we have
segmented our sentence to morphemes
•8�9 8�����9 8�� ����9 8��9 8�9 8����9 8���9 8�����9 •8�9 8�����9 8�� ����9 8��9 8�9 8����9 8���9 8�����9 8����9 8"#98 ! ��8 9 � 8 9�����9 8 %8 9$!9 8)��# *+9 8( #9 8&'9 8,#9 8"%89"�:9 8.!�9 80�#9 8/��9 8"'8 919
Space Insertion Module
• All segmentations are then passed to – Affixation Handler (morpheme statistics + POS)
– Reduplication Handler (single edit distance)
– Abbreviation Handler (finite state automata)
– Compounder Handler (simple lexical look-ups)
8������9 8�� ����9 8�������9 8���9 8���������9 8"#98 ! ��8 9 � 8 9�����9 8 % 98$!9 8)��# *+9 8( #9 8&'9 8,#9 8"%89"�:9 8.!�9 80�#/��9 8"'8 919
• Best segmentation is selected based on three different heuristics– Min word heuristic
– Unigram Statistics
– Bigram Statistics
Annotation
Train & Test
• Training was done on a morpheme segmented corpus of 70k words
• Testing was done a very small corpus of 2367 • Testing was done a very small corpus of 2367 words
– 404 Segmentation Errors
• 221 Space Omission Errors
• 183 Space Insertion Errors
– Affixation (66), Compounding (63), Abbreviation (32), Reduplication (22)
Results
Questions?
• Thank you for listening…
• Acknowledgement:
– Special thanks to NAACL executive for funding my – Special thanks to NAACL executive for funding my
travel
top related