segmentation in sanskrit texts

15

Upload: amrith-krishna

Post on 12-Apr-2017

1.047 views

Category:

Engineering


7 download

TRANSCRIPT

Page 1: Segmentation in Sanskrit texts
Page 2: Segmentation in Sanskrit texts

देहिनोऽस्मिन्यथा देिे कौिारं यौवनं जरा .तथा देिान्तरप्रास्ततर्धीरमतत्र न िहु्यतत

देहिनः अस्मिन ्यथा देिे कौिारं यौवनं जरा तथा देिान्तर प्रास्ततः र्धीरः तत्र न िहु्यतत

Page 3: Segmentation in Sanskrit texts
Page 4: Segmentation in Sanskrit texts

तथा देिान्तरप्रास्ततर्धीरमतत्र न िहु्यतत

तथा देिान्तर प्रास्ततः र्धीरः तत्र न िहु्यतत

Page 5: Segmentation in Sanskrit texts

तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत

Page 6: Segmentation in Sanskrit texts

A

Page 7: Segmentation in Sanskrit texts

तथा देिान्तरप्रास्ततर्धीरमतत्र न िुह्यतत

रािरािेभ्यः रािमय

witi

PMI Matrix of the un-segmentable token lemmas

P(w1,w2,w3,w4) = P(w1 | <s>)P(w2|w1)P(w3|w2)P(w4|w3)P(</s>|w4)

Page 8: Segmentation in Sanskrit texts

Set (Size in sentences) Micro Accuracy Macro Accuracy

Training set (1700) 87.76 % 92.56 %

Testing Set (150) 87.82 93.56 %

Page 9: Segmentation in Sanskrit texts

• Treat the problem as a query expansion problem.• Start with unsegmented tokens• At each step a new candidate word is selected and added to query• The query expansion iterates till a complete sentence is output.

Chunk 1 – c1 c2 c3 c4

w1

w2 .....wk.

.

.

.

.

Wl6

S = c1 + c2 + c3 + c4

C2 = Set of wi, which are candidates for semantically correct segmentation.

Similarly for c2 and c3

Page 10: Segmentation in Sanskrit texts

• Treat the problem as a query expansion problem.• Start with unsegmented tokens• At each step a new candidate word is selected and added to query• The query expansion iterates till a complete sentence is output.

Chunk 1 – c1 c2 c3 c4

w1

w2 .....wk.

.

.

.

.

Wl6

S = c1 + c2 + c3 + c4

C2 = Set of wi, which are candidates for semantically correct segmentation.

Similarly for c2 and c3

Page 13: Segmentation in Sanskrit texts

• From Query Nodes, reach the most promising candidate word nodes.• Perform multiple personalised random walks.• Edge weights – Accommodate heterogeneous information• Learn weights for each of the random walk approach (path) by

supervised methods.• The weighted sum of all the random walk methods, gives the most

suitable candidate• PS- We use 4 lakh tagged sentences from Digital corpus of Sanskrit.

Language Model (LM) with word lemmas

LM with morphological types

Verb specific Expectancy

Compound word formation patterns

Page 14: Segmentation in Sanskrit texts

Language Model with words - LMw

LM with morphological types - LMt

Verb specific Expectancy – ViE

Compound word formation patterns

PCRW -Unifying

Framework

• Handle Free Word Order• Incorporate heterogeneous types of information• Bonus – Form different relational paths(upto l) by combination of

individual edge weights.• For l = 3, some sample paths that can be formed as combination.• LMw -> LMt ->LMw• LMt -> V1E -> LMt• LMt -> VkE -> LMt

Page 15: Segmentation in Sanskrit texts