designing pos tagst for kannada -vijayakashmi - ldc-il
TRANSCRIPT
DESIGNING POS TAG SET FOR KANNADA
Presented by:Presented by:Vijayalaxmi .F. Patil
LDC-IL
CONTENTS
Introduction
Dravidian Languages
Tag set : Meaning and Structure
Kannada Tag set : Category, Type, Attribute Kannada Tag set : Category, Type, Attribute
Conclusion
INTRODUCTION
This paper presents the importance and the structure of POS tag set for Kannada, one of the major languages of the Dravidian Language family.
This is a process of marking up the words in a text or corpus as corresponding to a particular part of speech based on both its definition, as well as its context i.e. the relationship with adjacent and related words in a phrase, sentence or paragraph.
Continue…..
POS tagging is often the first stage of natural language processing following further processing like chunking, parsing etc are done. Tags play vital role in speech recognition, information retrieval and information extraction.
Recent machine learning techniques makes use of corpora to acquire high-level language knowledge. This knowledge is estimated from the corpora which are usually tagged with the correct part of speech labels. Many words occurring in the natural language texts are not listed in any catalog or lexicon.
DRAVIDIAN LANGUAGES
South Indian languages belong to a common source and the cognate languages constitute a single family known as Dravidian family. About 23 languages are there in the Dravidian language family which appears to be unrelated to any other known language family. There are more than 40 million speakers of Dravidian languages. Dravidian languages are divided on the basis of geographical perspective, shared innovations and characteristic features possessed by the languages. Classification of the Dravidian languages into three sub groups namely-languages into three sub groups namely-
Dravidian Languages
South Dravidian Central Dravidian North Dravidian
Continues…………South Dravidian languages: The name itself reveals the languages spoken in Southern part of India are south Dravidian languages and they are eight in number viz, Kannada, Malayalam, Tamil, Tulu, Kodagu, Badaga, Toda and Kota.
Central Dravidian languages : The languages which are Central Dravidian languages : The languages which are spoken by central part of India are Central Dravidian languages. They are 12 in number viz, Telugu, Gondi, Konda, Kui, Kuvi, Pengo, Manda, Kolami, Naiky, Parji, Gadaba Ollari and Gadaba Sillur.
North Dravidian languages : The languages spoken in the north part of India are North Dravidian languages and they are three in number viz, Kurukh, Malto and Brahui.
Continues………..� Kannada Language is spoken predominantly
in the state of Karnataka, whose native speakers are called Kannadigas (కన��గరు Kannadigaru). It is the 27th most spoken language in the world. It is one of the scheduled languages of India and the official and administrative language of the state of Karnataka.Karnataka.
� Based on the recommendations of the Committee of Linguistic Experts, appointed by the Ministry of Culture, the Government of India officially recognized Kannada as a classical language. During later centuries, Kannada, along with other Dravidian languages like Telugu, Tamil, Malayalam etc, has been greatly influenced by Sanskrit in terms of vocabulary, grammar and literary styles.
Tag set : Meaning and Structure
What is a tag set?
A set of defined tags i.e a set of word categories to be applied to the word categories to be applied to the word tokens of a text.
Continues……………
Types of tag set
Flat tag setHierarchical tag setFine grained tag set
Flat tag set just list down the categories applicable for a particular Flat tag set just list down the categories applicable for a particular language without any provision for modularity or feature reusability.Hierarchical tag set means that the categories is that tag set which is structured relative to one another rather than a large number of independent categories. A hierarchical tag set will contain a small number of categories, each category contains a number of Types, and each Type contains Attributes, and so on, in a tree-like structure.Fine grained tag set is the tagset where the minute things are considered and is accutare in syntactic analysis.
Continues……….
Present paper is based on a hierarchical tag set
Preprocessing: A process of normalization of text before tokenization.
Part of speech: Categories [that] group lexical items which perform similar grammatical functions
Lexicon: A list of possible tags for the root forms of all the valid words in a given language.
KANNADA TAG SET
Category
Noun (N) Pronoun (P) Demonstrative (D) Nominal Modifier (J) Nominal Modifier (J) Verb (V) Adverb (A) Participle (L) Particle (C) Numeral (NUM) Reduplication (RDP) Residual (RD) Unknown (UNK) Punctuation (PU)
NOUN
CategoryCategoryCategoryCategory TypeTypeTypeType AttributeAttributeAttributeAttribute
Noun (N) Common (NC)
Gender, Number, CaseMarker, Adverbial suffix, Adjectival suffix, Post-position, Negative, Clitic,
Proper (NP) Gender, Number, CaseMarker, Adverbial suffix, Adjectival suffix, Post-
E.g.(1)మనుష�ెౕ \NC.hum.pl.nom.0.0.0.0.emp ‘people’(2)ర�ౕశ�ెూడ�ె \NP.mas.sg.gen.0.0.pp.0.0 ‘with Ramesh’(3) �ాడువ�ద�ె�ౕ \NV.acc.0.0.emp ‘doing’(4) అ��యవ�ెగూ \NST.dis.gen.pp.incl ‘till there’
Adjectival suffix, Post-position, Negative, Clitic
Verbal (NV) Case Marker, Post-position, Negative, Clitic
Spatio-temporal (NST) Dimension, Case marker, Post-position, Clitic.
PRONOUNCategoryCategoryCategoryCategory TypeTypeTypeType AttributeAttributeAttributeAttribute
Pronoun Pronominal (PRP) Gender, Number, Person, Case Marker, Dimention, Adverbialsuffix, Adjectival suffix, Post-position, Negative, Clitic
Reflexive (PRF) Gender, Number, Person, CaseMarker, Adverbial suffix, Post-position, Negative, Clitic
Reciprocal (PRC) Gender, Number, Person, Case
Eg. (5)అవళ� \PRP.fem.sg.3rd.nom.dis.0.0.0.0.0 ‘she’(6)�ా�ెౕ\PRF.hum.pl.nom.0.0.0.epm ‘yourself’(7) పరస"ర \PRC.hum.pl.0.nom.0.0.0.0.0 ‘reciprocal’(8)#ారు\PWH.hum.0.0.nom.0.0.0.0.0 ‘who’
Reciprocal (PRC) Gender, Number, Person, CaseMarker, Adverbial suffix, Post-position, Negative, Clitic
Wh-Pronoun (PWH) Gender, Number, Person, CaseMarker, Adverbial suffix, Adjectival suffix, Post-position, Negative, Clitic
DEMONSTRATIVE
CategoryCategoryCategoryCategory TypeTypeTypeType AttributeAttributeAttributeAttribute
Demonstrative(DAB) Absolute (DAB) Dimension
Wh-demonstrative (DWH)
E.g. (9)ఆ \DAB.dis ‘that’
(10)#ావ\DWH ‘which’
NOMINAL MODIFIER
Category Type Attribute
Nominal Modifier (J)
Adjective (JJ) Negative. Adjectival suffix, Clitic
Quantifier (JQ) Gender, Number, Numeral, Case Marker, Adverbial suffix, Adjectival suffix, Post-position, Dimension, Negative, Clitic,
E.g. (11)సుందర�ాద\JJ.0.adj.0 ‘beautiful’
(12)అష&�ె�ౕ\JQ.nue.0.nnm.acc.0.0.0.dis.0.emp (that much)
(13)బహళ\JINT.0 ‘much’
Post-position, Dimension, Negative, Clitic,
Intensifier (JINT) Clitic
VERB
Category Type Attribute
Verb (V) Gender, Number, Person, Tense, causative,
Aspect, Mood, Finiteness, Negative, Defective verb,
Clitic
E.g. (14)బరు�ా)*ె+ౕ \V.fem.sg.3rd.fut.n.prg.intr.nfn.n.n.intr ‘will she come?’
(15),ను�,)ద-ళ� \ V.fem.sg.3rd.pst.n.prg.0.nfn.n.n.0 ‘he will divide’
Category Type Attribute
Adverb (A) Manner (AMN) Clitic
ADVERB
E.g.(16).ధన�ా0+ౕ\AMN.emp ‘slowly’
Category Type Attribute
Participle (L) Relative (LRL) Tense, Negative, Adjectival suffix, Post-
position, Negative, Clitic,
Verbal (LV) Tense, Negative, Clitic
Nominal (LN) Gender, Number, Tense, negative, Case
Marker, Adverbial suffix, Adjectival suffix,
PARTICIPLE
E.g. (17)బంద \LRL.pst.0.0.0.emp ‘which has come’
(18)1ెూౕ0 \LV.pst.0 ‘go’
(19)బరదవరు \LN.hum.pl.pst.y.nom.0.0.0.0 ‘those who have not ‘come’
(20)1ెౕళ2ద-�ె \LC.0.y.0 ‘if not tell’
Marker, Adverbial suffix, Adjectival suffix,
Postposition, Clitic,
Conditional (LC) adjective suffix, Negative, Clitic,
PARTICAL
ExamplesAttributesTypeCategory
(24)1ౌదు,(‘yes’) అల�(‘no’)
(Dis) Agreement (CAGR)
(23)ఒ6 (‘oh’), అ7ౕ(‘alas’)
Interjection (CIN)
(22)అథ�ా (‘or’)Subordinating (CSB)
(21)మతూ), (‘and’)ఆదరూ (‘but’)
CliticCo-ordinating (CCD)
Particle (C)
Others (CX)
(28)కూడ (‘also’)Inclusive (CINCL)
(27)బహుశః ‘probably’)Dubitative (CDUB)
(26)�ాత;, <ెౕవల,(‘only’)
CliticDelimitive (CDLIM)
(25)1ౌద= �ా, అ=ా>(‘isn’t it’)
Confirmative( CCON)
(‘no’)
NUMERAL
Category Type Attribute Examples
Numeral (NUM)
Real (NUMR)
Case marker, Clitic, Adverbial
(29)10,20,30,40
Clitic, Adverbial suffix, Postposition
Serial (NUMS) (30)10.5, 25.02
Calendric (NUMC) (31)
Ordinal (NUMO) (32)3rd, 4th, 20th
Category Type Attribute
Reduplication(RDP)
Gender, number, person,
Case marker, Post-
position, Adverbial suffix,
Cilitic
REDUPLICATION
Cilitic
E.g.(33)ఒ?ెూ@బ@�ా0\RDP.hum.pl.0.nom.0.adv.0 ‘one by one’
(34)అవరవ�ెూడ�ె \RDP.hum.pl.3rd.gen.pp.0.0 ‘with them’
RESIDUAL
Category Type Attribute
Residual(RD)Residual(RD)Residual(RD)Residual(RD) Foreign Word (RDF)
Symbol (RDS)
E.g. (35)काम ‘work’
(36)Ink
(37)@ # $ & %
UNKNOWN
Unknown (UNK)
Category
E.g.(38)యAాయAా ధమBస ‘Sanskrit shloka’
PUNCTUATION
Category
Punctuation(PU) (39), . / ? “ : ; } [ \ | = + _ /
ATTRIBUTES AND THEIR VALUESAttributeAttributeAttributeAttribute ValuesValuesValuesValues
Person \PER First\1 Second\2 Third\3
Number\NUM Singular\sg Plural\pl
Gender\GEN Masculine\mas Feminine\fem Neuter\neu Human\hum
Case Marker
\CSM
Nominative/no
m
Accusative\acc Instrumental\i
ns
Dative\dat Ablative\abl Genitive\gen Locative\loc
Tense \TNS Present\prs Past\pst Future\fut
Aspect Imperfect\ ipfv Perfect\prf Progressive\
prog
Mood \MOOD Interrogative\i
nt
Habitual\hab Imperative\imp Optative\opt Hortative\hort Debitive\debt Potential \potn
nt
Finiteness\FIN Finite\fin Non-finite\nfn Infinitive\inf
Dimension
\DIM
Proximal\prx Distal\dst
Clitic /CL Interrogative\int
Inclusive\incl Indefiniteness\i
nd
Emphatic\emp Comparative\c
om
Heresay\hers
Numeral \NML Cardinal (crd) Ordinal (ord) Non-numeral
(nnm)
Negative (NEG) Yes/y No/n
Adverbial
suffix/adv
Adjectival
suffix/adj
Defective
verb\DEF
Yes\y No\n
CONCLUSION
The use of morphological features is especially helpful todevelop a reasonable POS tagger when tagged resources arelimited. In Pos tagging one word may have more than onepart- of speech label. Syntactic and semantic parsing ofnatural language sentences are generally influenced byadequate part-of-speech.
REFERENCES
ANDREW, H. developing a tag set for automated part-of-speech taggingin urdu. department of linguistics and modern english language,university of lancaster.
BALI, K. microsoft research india. bangalore.
BASKARAN, S. microsoft research india. bangalore.
BHATTACHARYA, T. delhi university, delhi.
BHATTACHARYYA, P. iit-bombay, mumbai.BHATTACHARYYA, P. iit-bombay, mumbai.
DANDAPAT, S., april 2008 . part-of-speech tagging for bengali.
HUDSON THOMAS, 1878 . elementary grammar of the kannadalanguage
JHA, G. N. jawaharlal nehru university, delhi.
MALLIKARJUN, B. ciil mysore, 31st march 2005 . morphological
processing of kannada verbs
MEETEI, A. N., 1st december 2009 . an introduction to language
and annotation
REFERENCES
NICOLA, U. AND HERMANN, N., 2003 . using pos information forstatistical machine translation into morphologically rich languagesRAJENDRAN, S. tamil university, thanjavur.SARAVANAN, K. microsoft research india, bangalore.SCHIFFMAN, H., september 1979. a reference grammar of spokenkannadaSHARMA, D. M., SAMAR HUSAIN, AND RAJEEV SANGAL, pune 2008 .SHARMA, D. M., SAMAR HUSAIN, AND RAJEEV SANGAL, pune 2008 .linguistic data annotation for indian languagesSHRIDHAR, S.N.1990 . kannada (descriptive grammars) SOBHA L, au-kbc research centre, chennai.SUBBARAO, K. V. delhi, 2008 . designing a common pos-tagsetframework for indian languages.UPPOOR, N. june 2009. a rule-based parts of speech tagger forkannadawikipedia.org/wiki/kannada language. kannada language