part-of-speech tagging using neural networks

Part-Of-Speech Part-Of-Speech Tagging Tagging

using Neural using Neural NetworksNetworks

Ankur ParikhAnkur ParikhLTRCLTRC

IIIT HyderabadIIIT Hyderabad [email protected]@gmail.com

OutlineOutline

1.Introduction1.Introduction2.Background and Motivation2.Background and Motivation3.Experimental Setup3.Experimental Setup4.Preprocessing 4.Preprocessing 5.Representation5.Representation6.Single-neuro tagger6.Single-neuro tagger7.Experiments7.Experiments8.Multi-neuro tagger8.Multi-neuro tagger9.Results9.Results10.Discussion10.Discussion11.Future Work11.Future Work

IntroductionIntroduction

POS-TaggingPOS-Tagging::

It is the process of assigning the part of speech tag to It is the process of assigning the part of speech tag to the NL text based on both its definition and its contextthe NL text based on both its definition and its context..

Uses:Uses:Parsing of sentences, MT, IR, Word Sense disambiguation, Parsing of sentences, MT, IR, Word Sense disambiguation, Speech synthesis etc.Speech synthesis etc.

Methods:Methods:1. Statistical Approach1. Statistical Approach2. Rule Based2. Rule Based

Background: Previous Background: Previous ApproachesApproaches

Lots of work has been done using various machine Lots of work has been done using various machine learning algorithms like learning algorithms like TNTTNT CRF CRF

for Hindi.for Hindi. Trade-off: Performance versus Training timeTrade-off: Performance versus Training time

- Less precision affects later stages- Less precision affects later stages

- For a new domain or new corpus parameter tuning - For a new domain or new corpus parameter tuning is a non-trivial task. is a non-trivial task.

Background: Previous Background: Previous Approaches & MotivationApproaches & Motivation

Empirically chosen context.Empirically chosen context. Effective Handling of corpus based featuresEffective Handling of corpus based features Need of the hour: Need of the hour:

- Good performance- Good performance

- Less training time- Less training time

- Multiple contexts- Multiple contexts

- exploit corpus based features effectively- exploit corpus based features effectively Two Approaches and their comparison with TNT and Two Approaches and their comparison with TNT and

CRFCRF Word level taggingWord level tagging

Experimental Setup : Experimental Setup : Corpus statitsticsCorpus statitstics

Tag set of 25 tagsTag set of 25 tags

CorpusCorpus Size (in Size (in words)words)

Unseen Unseen words (in words (in percentagpercentage)e)

TrainingTraining 187,095187,095 --

DevelopmDevelopmentent

23,56523,565 5.33%5.33%

TestingTesting 23,28123,281 8.15%8.15%

Experimental Setup: Tools Experimental Setup: Tools and Resourcesand Resources

ToolsTools- CRF++- CRF++- TNT- TNT- Morfessor Categories – MAP- Morfessor Categories – MAP

ResourcesResources- Universal word – Hindi Dictionary- Universal word – Hindi Dictionary- Hindi Word net- Hindi Word net- Morph Analyzer - Morph Analyzer

PreprocessingPreprocessing

XC tag is removed (Gadde et. Al., XC tag is removed (Gadde et. Al., 2008).2008).

LexiconLexicon

- For each unique word w of the - For each unique word w of the training corpus training corpus => ENTRY(t1,=> ENTRY(t1,……,t24)……,t24)

- where tj = c(posj , w) / c(w)- where tj = c(posj , w) / c(w)

Representation: Encoding & Representation: Encoding & DecodingDecoding

Each word w is encoded as an n-element Each word w is encoded as an n-element vector INPUT(t1,t2,…,tn) where n = size of vector INPUT(t1,t2,…,tn) where n = size of the tag set.the tag set.

INPUT(t1,t2,…,tn) comes from lexicon if INPUT(t1,t2,…,tn) comes from lexicon if training corpus contains w.training corpus contains w.

If w is not in the training corpusIf w is not in the training corpus

- N(w) = Number of possible POS tags for w- N(w) = Number of possible POS tags for w

- tj - tj = 1/N(w) if posj is a candidate= 1/N(w) if posj is a candidate

= 0 otherwise= 0 otherwise

Representation: Encoding & Representation: Encoding & DecodingDecoding

For each word w, Desired Output is For each word w, Desired Output is encoded as D = (d1,d2,….,dn).encoded as D = (d1,d2,….,dn).

- dj = 1 if posj is a desired ouput- dj = 1 if posj is a desired ouput

= 0 otherwise= 0 otherwise In testing, for each word w, an n-In testing, for each word w, an n-

element vector OUTPUT(o1,…,on) is element vector OUTPUT(o1,…,on) is returned.returned.

- Result = posj, if oj = max(OUTPUT)- Result = posj, if oj = max(OUTPUT)

Single – neuro tagger: Single – neuro tagger: StructureStructure

Single – neuro tagger: Single – neuro tagger: Training & TaggingTraining & Tagging

Error Back-propagation learning Error Back-propagation learning AlgorithmAlgorithm

Weights are Initialized with Random Weights are Initialized with Random valuesvalues

Sequential modeSequential mode Momentum termMomentum term Eta = 0.4 and Alpha = 0.1Eta = 0.4 and Alpha = 0.1 In tagging, it can give multiple outputs In tagging, it can give multiple outputs

or a sorted list of all tags.or a sorted list of all tags.

Experiments: Experiments: Development DataDevelopment Data

FeaturesFeatures PrecisionPrecision

Corpus based and Corpus based and contextualcontextual

93.19%93.19%

Root of the wordRoot of the word 93.38%93.38%

Length of the wordLength of the word 94.04%94.04%

Handling of unseen Handling of unseen wordswords

Root->Dictionary-Root->Dictionary->Word net->Word net->Morfessor >Morfessor

{{tj tj = c(= c(posj posj ,s) + ,s) + c(c(posj posj ,p)/ c(s) + c(p)},p)/ c(s) + c(p)}

95.62%95.62%

Development of the Development of the systemsystem

Multi – neuro tagger: Multi – neuro tagger: StructureStructure

Multi – neuro tagger: Multi – neuro tagger: TrainingTraining

Multi – neuro tagger: Multi – neuro tagger: Learning curvesLearning curves

Multi – neuro tagger: Multi – neuro tagger: ResultsResults

StructureStructure ContextContext DevelopmDevelopmentent

TestTest

97-48-2497-48-24 33 95.44%95.44% 91.87%91.87%

121-48-24121-48-24 4_prev4_prev 95.64%95.64% 92.05%92.05%

121-48-24121-48-24 4_next4_next 95.66%95.66% 91.95%91.95%

145-72-24145-72-24 55 95.55%95.55% 92.15%92.15%

169-72-24169-72-24 6_prev6_prev 95.56%95.56% 92.14%92.14%

169-72-24169-72-24 6_next6_next 95.54%95.54% 92.14%92.14%

193-96-24193-96-24 77 95.46%95.46% 92.07%92.07%

Multi – neuro tagger: Multi – neuro tagger: ComparisonComparison

Precision after voting : 92.19%Precision after voting : 92.19%

TaggerTagger DevelopmDevelopment ent

TestTest Training Training TimeTime

TNTTNT 95.18%95.18% 91.58%91.58% 1-2 1-2 (Seconds)(Seconds)

Multi – Multi – neuro neuro taggertagger

95.78%95.78% 92.19%92.19% 13-14 13-14 (Minutes)(Minutes)

CRFCRF 96.05%96.05% 92.92%92.92% 2-2-2.5(Hours2.5(Hours))

ConclusionConclusion

Single versus Multi-neuro taggerSingle versus Multi-neuro tagger Multi-neuro tagger versus TNT and Multi-neuro tagger versus TNT and

CRFCRF Corpus and Dictionary based featuresCorpus and Dictionary based features More parameters need to be tunedMore parameters need to be tuned 24^5 = 79,62,624 n-grams, while 24^5 = 79,62,624 n-grams, while

250,560 weights250,560 weights Well suited for Indian LanguagesWell suited for Indian Languages

Future WorkFuture Work

Better voting schemes (Confidence Better voting schemes (Confidence point based)point based)

Finding the right context Finding the right context (Probability based)(Probability based)

Various Structures and algorithmsVarious Structures and algorithms

- Sequential Neural Network- Sequential Neural Network

- Convolution Neural Network- Convolution Neural Network

- Combination with SVM- Combination with SVM

Thank You!!

Queries???

part-of-speech tagging using neural networks

Documents