designing pos tagst for kannada -vijayakashmi - ldc-il

29
DESIGNING POS TAG SET FOR KANNADA Presented by: Vijayalaxmi .F. Patil LDC-IL

Upload: others

Post on 03-Feb-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

DESIGNING POS TAG SET FOR KANNADA

Presented by:Presented by:Vijayalaxmi .F. Patil

LDC-IL

Page 2: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

CONTENTS

Introduction

Dravidian Languages

Tag set : Meaning and Structure

Kannada Tag set : Category, Type, Attribute Kannada Tag set : Category, Type, Attribute

Conclusion

Page 3: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

INTRODUCTION

This paper presents the importance and the structure of POS tag set for Kannada, one of the major languages of the Dravidian Language family.

This is a process of marking up the words in a text or corpus as corresponding to a particular part of speech based on both its definition, as well as its context i.e. the relationship with adjacent and related words in a phrase, sentence or paragraph.

Page 4: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

Continue…..

POS tagging is often the first stage of natural language processing following further processing like chunking, parsing etc are done. Tags play vital role in speech recognition, information retrieval and information extraction.

Recent machine learning techniques makes use of corpora to acquire high-level language knowledge. This knowledge is estimated from the corpora which are usually tagged with the correct part of speech labels. Many words occurring in the natural language texts are not listed in any catalog or lexicon.

Page 5: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

DRAVIDIAN LANGUAGES

South Indian languages belong to a common source and the cognate languages constitute a single family known as Dravidian family. About 23 languages are there in the Dravidian language family which appears to be unrelated to any other known language family. There are more than 40 million speakers of Dravidian languages. Dravidian languages are divided on the basis of geographical perspective, shared innovations and characteristic features possessed by the languages. Classification of the Dravidian languages into three sub groups namely-languages into three sub groups namely-

Dravidian Languages

South Dravidian Central Dravidian North Dravidian

Page 6: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

Continues…………South Dravidian languages: The name itself reveals the languages spoken in Southern part of India are south Dravidian languages and they are eight in number viz, Kannada, Malayalam, Tamil, Tulu, Kodagu, Badaga, Toda and Kota.

Central Dravidian languages : The languages which are Central Dravidian languages : The languages which are spoken by central part of India are Central Dravidian languages. They are 12 in number viz, Telugu, Gondi, Konda, Kui, Kuvi, Pengo, Manda, Kolami, Naiky, Parji, Gadaba Ollari and Gadaba Sillur.

North Dravidian languages : The languages spoken in the north part of India are North Dravidian languages and they are three in number viz, Kurukh, Malto and Brahui.

Page 7: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

Continues………..� Kannada Language is spoken predominantly

in the state of Karnataka, whose native speakers are called Kannadigas (కన��గరు Kannadigaru). It is the 27th most spoken language in the world. It is one of the scheduled languages of India and the official and administrative language of the state of Karnataka.Karnataka.

� Based on the recommendations of the Committee of Linguistic Experts, appointed by the Ministry of Culture, the Government of India officially recognized Kannada as a classical language. During later centuries, Kannada, along with other Dravidian languages like Telugu, Tamil, Malayalam etc, has been greatly influenced by Sanskrit in terms of vocabulary, grammar and literary styles.

Page 8: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

Tag set : Meaning and Structure

What is a tag set?

A set of defined tags i.e a set of word categories to be applied to the word categories to be applied to the word tokens of a text.

Page 9: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

Continues……………

Types of tag set

Flat tag setHierarchical tag setFine grained tag set

Flat tag set just list down the categories applicable for a particular Flat tag set just list down the categories applicable for a particular language without any provision for modularity or feature reusability.Hierarchical tag set means that the categories is that tag set which is structured relative to one another rather than a large number of independent categories. A hierarchical tag set will contain a small number of categories, each category contains a number of Types, and each Type contains Attributes, and so on, in a tree-like structure.Fine grained tag set is the tagset where the minute things are considered and is accutare in syntactic analysis.

Page 10: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

Continues……….

Present paper is based on a hierarchical tag set

Preprocessing: A process of normalization of text before tokenization.

Part of speech: Categories [that] group lexical items which perform similar grammatical functions

Lexicon: A list of possible tags for the root forms of all the valid words in a given language.

Page 11: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

KANNADA TAG SET

Category

Noun (N) Pronoun (P) Demonstrative (D) Nominal Modifier (J) Nominal Modifier (J) Verb (V) Adverb (A) Participle (L) Particle (C) Numeral (NUM) Reduplication (RDP) Residual (RD) Unknown (UNK) Punctuation (PU)

Page 12: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

NOUN

CategoryCategoryCategoryCategory TypeTypeTypeType AttributeAttributeAttributeAttribute

Noun (N) Common (NC)

Gender, Number, CaseMarker, Adverbial suffix, Adjectival suffix, Post-position, Negative, Clitic,

Proper (NP) Gender, Number, CaseMarker, Adverbial suffix, Adjectival suffix, Post-

E.g.(1)మనుష�ెౕ \NC.hum.pl.nom.0.0.0.0.emp ‘people’(2)ర�ౕశ�ెూడ�ె \NP.mas.sg.gen.0.0.pp.0.0 ‘with Ramesh’(3) �ాడువ�ద�ె�ౕ \NV.acc.0.0.emp ‘doing’(4) అ��యవ�ెగూ \NST.dis.gen.pp.incl ‘till there’

Adjectival suffix, Post-position, Negative, Clitic

Verbal (NV) Case Marker, Post-position, Negative, Clitic

Spatio-temporal (NST) Dimension, Case marker, Post-position, Clitic.

Page 13: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

PRONOUNCategoryCategoryCategoryCategory TypeTypeTypeType AttributeAttributeAttributeAttribute

Pronoun Pronominal (PRP) Gender, Number, Person, Case Marker, Dimention, Adverbialsuffix, Adjectival suffix, Post-position, Negative, Clitic

Reflexive (PRF) Gender, Number, Person, CaseMarker, Adverbial suffix, Post-position, Negative, Clitic

Reciprocal (PRC) Gender, Number, Person, Case

Eg. (5)అవళ� \PRP.fem.sg.3rd.nom.dis.0.0.0.0.0 ‘she’(6)�ా�ెౕ\PRF.hum.pl.nom.0.0.0.epm ‘yourself’(7) పరస"ర \PRC.hum.pl.0.nom.0.0.0.0.0 ‘reciprocal’(8)#ారు\PWH.hum.0.0.nom.0.0.0.0.0 ‘who’

Reciprocal (PRC) Gender, Number, Person, CaseMarker, Adverbial suffix, Post-position, Negative, Clitic

Wh-Pronoun (PWH) Gender, Number, Person, CaseMarker, Adverbial suffix, Adjectival suffix, Post-position, Negative, Clitic

Page 14: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

DEMONSTRATIVE

CategoryCategoryCategoryCategory TypeTypeTypeType AttributeAttributeAttributeAttribute

Demonstrative(DAB) Absolute (DAB) Dimension

Wh-demonstrative (DWH)

E.g. (9)ఆ \DAB.dis ‘that’

(10)#ావ\DWH ‘which’

Page 15: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

NOMINAL MODIFIER

Category Type Attribute

Nominal Modifier (J)

Adjective (JJ) Negative. Adjectival suffix, Clitic

Quantifier (JQ) Gender, Number, Numeral, Case Marker, Adverbial suffix, Adjectival suffix, Post-position, Dimension, Negative, Clitic,

E.g. (11)సుందర�ాద\JJ.0.adj.0 ‘beautiful’

(12)అష&�ె�ౕ\JQ.nue.0.nnm.acc.0.0.0.dis.0.emp (that much)

(13)బహళ\JINT.0 ‘much’

Post-position, Dimension, Negative, Clitic,

Intensifier (JINT) Clitic

Page 16: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

VERB

Category Type Attribute

Verb (V) Gender, Number, Person, Tense, causative,

Aspect, Mood, Finiteness, Negative, Defective verb,

Clitic

E.g. (14)బరు�ా)*ె+ౕ \V.fem.sg.3rd.fut.n.prg.intr.nfn.n.n.intr ‘will she come?’

(15),ను�,)ద-ళ� \ V.fem.sg.3rd.pst.n.prg.0.nfn.n.n.0 ‘he will divide’

Page 17: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

Category Type Attribute

Adverb (A) Manner (AMN) Clitic

ADVERB

E.g.(16).ధన�ా0+ౕ\AMN.emp ‘slowly’

Page 18: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

Category Type Attribute

Participle (L) Relative (LRL) Tense, Negative, Adjectival suffix, Post-

position, Negative, Clitic,

Verbal (LV) Tense, Negative, Clitic

Nominal (LN) Gender, Number, Tense, negative, Case

Marker, Adverbial suffix, Adjectival suffix,

PARTICIPLE

E.g. (17)బంద \LRL.pst.0.0.0.emp ‘which has come’

(18)1ెూౕ0 \LV.pst.0 ‘go’

(19)బరదవరు \LN.hum.pl.pst.y.nom.0.0.0.0 ‘those who have not ‘come’

(20)1ెౕళ2ద-�ె \LC.0.y.0 ‘if not tell’

Marker, Adverbial suffix, Adjectival suffix,

Postposition, Clitic,

Conditional (LC) adjective suffix, Negative, Clitic,

Page 19: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

PARTICAL

ExamplesAttributesTypeCategory

(24)1ౌదు,(‘yes’) అల�(‘no’)

(Dis) Agreement (CAGR)

(23)ఒ6 (‘oh’), అ7ౕ(‘alas’)

Interjection (CIN)

(22)అథ�ా (‘or’)Subordinating (CSB)

(21)మతూ), (‘and’)ఆదరూ (‘but’)

CliticCo-ordinating (CCD)

Particle (C)

Others (CX)

(28)కూడ (‘also’)Inclusive (CINCL)

(27)బహుశః ‘probably’)Dubitative (CDUB)

(26)�ాత;, <ెౕవల,(‘only’)

CliticDelimitive (CDLIM)

(25)1ౌద= �ా, అ=ా>(‘isn’t it’)

Confirmative( CCON)

(‘no’)

Page 20: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

NUMERAL

Category Type Attribute Examples

Numeral (NUM)

Real (NUMR)

Case marker, Clitic, Adverbial

(29)10,20,30,40

Clitic, Adverbial suffix, Postposition

Serial (NUMS) (30)10.5, 25.02

Calendric (NUMC) (31)

Ordinal (NUMO) (32)3rd, 4th, 20th

Page 21: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

Category Type Attribute

Reduplication(RDP)

Gender, number, person,

Case marker, Post-

position, Adverbial suffix,

Cilitic

REDUPLICATION

Cilitic

E.g.(33)ఒ?ెూ@బ@�ా0\RDP.hum.pl.0.nom.0.adv.0 ‘one by one’

(34)అవరవ�ెూడ�ె \RDP.hum.pl.3rd.gen.pp.0.0 ‘with them’

Page 22: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

RESIDUAL

Category Type Attribute

Residual(RD)Residual(RD)Residual(RD)Residual(RD) Foreign Word (RDF)

Symbol (RDS)

E.g. (35)काम ‘work’

(36)Ink

(37)@ # $ & %

Page 23: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

UNKNOWN

Unknown (UNK)

Category

E.g.(38)యAాయAా ధమBస ‘Sanskrit shloka’

Page 24: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

PUNCTUATION

Category

Punctuation(PU) (39), . / ? “ : ; } [ \ | = + _ /

Page 25: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

ATTRIBUTES AND THEIR VALUESAttributeAttributeAttributeAttribute ValuesValuesValuesValues

Person \PER First\1 Second\2 Third\3

Number\NUM Singular\sg Plural\pl

Gender\GEN Masculine\mas Feminine\fem Neuter\neu Human\hum

Case Marker

\CSM

Nominative/no

m

Accusative\acc Instrumental\i

ns

Dative\dat Ablative\abl Genitive\gen Locative\loc

Tense \TNS Present\prs Past\pst Future\fut

Aspect Imperfect\ ipfv Perfect\prf Progressive\

prog

Mood \MOOD Interrogative\i

nt

Habitual\hab Imperative\imp Optative\opt Hortative\hort Debitive\debt Potential \potn

nt

Finiteness\FIN Finite\fin Non-finite\nfn Infinitive\inf

Dimension

\DIM

Proximal\prx Distal\dst

Clitic /CL Interrogative\int

Inclusive\incl Indefiniteness\i

nd

Emphatic\emp Comparative\c

om

Heresay\hers

Numeral \NML Cardinal (crd) Ordinal (ord) Non-numeral

(nnm)

Negative (NEG) Yes/y No/n

Adverbial

suffix/adv

Adjectival

suffix/adj

Defective

verb\DEF

Yes\y No\n

Page 26: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

CONCLUSION

The use of morphological features is especially helpful todevelop a reasonable POS tagger when tagged resources arelimited. In Pos tagging one word may have more than onepart- of speech label. Syntactic and semantic parsing ofnatural language sentences are generally influenced byadequate part-of-speech.

Page 27: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

REFERENCES

ANDREW, H. developing a tag set for automated part-of-speech taggingin urdu. department of linguistics and modern english language,university of lancaster.

BALI, K. microsoft research india. bangalore.

BASKARAN, S. microsoft research india. bangalore.

BHATTACHARYA, T. delhi university, delhi.

BHATTACHARYYA, P. iit-bombay, mumbai.BHATTACHARYYA, P. iit-bombay, mumbai.

DANDAPAT, S., april 2008 . part-of-speech tagging for bengali.

HUDSON THOMAS, 1878 . elementary grammar of the kannadalanguage

JHA, G. N. jawaharlal nehru university, delhi.

MALLIKARJUN, B. ciil mysore, 31st march 2005 . morphological

processing of kannada verbs

MEETEI, A. N., 1st december 2009 . an introduction to language

and annotation

Page 28: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL

REFERENCES

NICOLA, U. AND HERMANN, N., 2003 . using pos information forstatistical machine translation into morphologically rich languagesRAJENDRAN, S. tamil university, thanjavur.SARAVANAN, K. microsoft research india, bangalore.SCHIFFMAN, H., september 1979. a reference grammar of spokenkannadaSHARMA, D. M., SAMAR HUSAIN, AND RAJEEV SANGAL, pune 2008 .SHARMA, D. M., SAMAR HUSAIN, AND RAJEEV SANGAL, pune 2008 .linguistic data annotation for indian languagesSHRIDHAR, S.N.1990 . kannada (descriptive grammars) SOBHA L, au-kbc research centre, chennai.SUBBARAO, K. V. delhi, 2008 . designing a common pos-tagsetframework for indian languages.UPPOOR, N. june 2009. a rule-based parts of speech tagger forkannadawikipedia.org/wiki/kannada language. kannada language

Page 29: Designing POs Tagst for Kannada -Vijayakashmi - LDC-IL