from pan¯ .inian sandhi to finite state...

From Pan. inian Sandhito Finite State Calculus

Malcolm D. Hyman

Max Planck Institute for the History of Science, Berlin

First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.1

Overview

1. Research context

2. An XML vocabulary for Pan. inian rules

3. From Pan. inian rules to an FST

4. Implications: remarks on linguistic description


Research context

Ongoing work on modeling components ofSanskrit grammar according to Pan. inianprinciples

nominal inflection

verbal inflection (using Dhatupat.ha)stem formation (perfect stem, participialstems. . . )

morphophonology (sandhi)


Methodology

How closely to follow Pan. ini?

Practical concerns dictate an incrementalapproach.

We are obliged to interpret Pan. ini.

Research results concerning both Indiangrammatical methods and facts of theSanskrit language will emerge fromcomputational studies.


Building blocks of an XML model

The rules model not only a Pan. inian sutra, butalso its context and its interpretation.

An XML schema

A sound-based encoding (SLP1)

A regular expression dialect (PCREs)


The SLP1 encoding

�

a

a

��

a

A

�

i

i

��

ı

I

�

u

u

�

u

U

�

r�

f

r�

F

l�

x

�

l�

X

��

e

e

� �

ai

E

��

o

o

��

au

O

*

�� k

k

��

kh

K

��

g

g

��

gh

G

�� n

N

� ��

c

c

�� ch

C

��

j

j��

jh

J

��

ñ

Y

�� t.

w

�� t.h

W

��d.

q

�� d. h

Q

!��

n.R

"��

t

t

#��

th

T

$� d

d

�%��

dh

D

&��

n

n

'��

p

p

(� ph

P

)��

b

b

*��

bh

B

+��

m

m

,��

y

y

�-�r

r

.� l

l

/��

v

v0��

s

S

1��

s.z

�2��

s

s

3�h

h

* anusvara = M; visarga = H


The rule element

8.3.23 mo ’nusvarah.

<rule source="m"target="M"rcontext="[@(wb)][@(hal)]"ref="A.8.3.23"/>

(We may need more than one rule to express a

sutra.)


The macro element

We need some means for translating Pan. ini’smetalanguage, e. g. sound classes (pratyaharas):

<macro name="JaS"value="JBGQDjbgqd"c="voiced stop"/>


The mapping element

1.1.2 aden gun. ah.

<mapping name="guna"ref="A.1.1.2">

<map from="@(a)" to="a"/><map from="@(i)" to="e"/><map from="@(u)" to="o"/><map from="@(f)" to="a"/><map from="@(x)" to="a"/>

</mapping>


The function element

<function name="gunate"><rule source="[@(a)@(i)@(u)]"

target="%(guna($1))"/><rule source="[@(f)@(x)]"

target="%(guna($1))%(semivowel($1))"/>

</function>


Applying a function

6.1.87 ad gun. ah.

<rule source="[@(a)][@(wb)]([@(ik)])"target="!(gunate($1))"ref="A.6.1.87"/>


Implementing the modeled rules

The XML model captures some of thestructure of Pan. ini’s grammar. But theobvious serial application of the rules iscomputationally inefficient.

The rules can be automatically translated intoregular expressions for compilation into afinite state transducer using tools such asxfst (Xerox) or fsa (van Noord).

The relation between the underlying stringsand the surface strings is a regular relation.


The replace operator

Rules may be translated into regular expressionsemploying the replace operator (Karttunen 1995).

(a|A)( | #)(a|A) → a(a|A)( | #)(i|I) → e(a|A)( | #)(u|U) → o(a|A)( | #)(f|F) → ar(a|A)( | #)(x|X) → al


Context-dependent replacement

Documented algorithms exist for the translationof context-dependent replacements into FSTs(Mohri & Sproat 1996).

6.1.109 enah. padantadati

<rule source="a"target="’"lcontext="[@(eN)][@(wb)]"ref="6.1.109"/>

a → ’ / (e|o)( | #)


An FST for 6.1.109

6.1.109 enah. padantadati

s 0 s 1 s 2e, o

?

?

e, o

, #

e, o

?, a:’


A composed FST for external sandhi

37 sutras constitute core rules for externalsandhi

XML: 48 rules, 61 macros, 16 mappings, 3functions

compiled regular expressions are ~268KB

composed transducer has 4,994 states,417,814 arcs


Comparing two approaches

Serial application of rules:

FORM SUTRA

tat ca

tad ca 8.2.39taj ca 8.4.40, 44tac ca 8.4.55tacca


Comparing two approaches

A unique path through the transducer:

<t:t><a:a><t:c><" ":c><c:ε><a:a>


Limitations of segmentalism

Segments are atomic, and enumerating themlimits linguistic generalization.

Features overlap segments. It wasJ. R. Firth’s insight that “some phonologicalproperties are not uniquely ‘placed’ withrespect to particular segments within a largerunit” (Anderson, 1985, 185).

Coarticulation “can be detected in almostevery phoneme sequence in normal speech”(Goodglass, 1993, 62).


Positions of the Indian grammarians

Pan. ini moved beyond the vikara system ofearlier linguistic thinkers (Cardona 1965,311).

Use of abbreviations (pratyaharas) for soundclasses and the principle of savarn. ya (A.1.1.50) emphasize featural analysis.

Segments contain subsegments (e. g. /r

�

/contains r: MBh. 3.452.1 ff.

Pitch is a property of the syllable (R

�

Pr. 3.9) orspreads to adjacent consonants (TPr. 1.43).


N-retroflexion in finite state modeling

Non-final /n/ is realized as n. after {r

�

, r

�

, r, s. }despite intervening vowels, semivowels,gutturals/velars, labials, or anusvara.

<rule source="n"target="R"lcontext="[fFrz]

[#@(aw)@(ku)@(pu)M]*"rcontext=".*[@(ac)]"ref="8.4.1-2"/>


N-retroflexion examples

There is a regular relation between a set ofunderlying and surface strings that includes thefollowing pairs:

UNDERLYING SURFACE

br

�

m. hana br

�

m. han. a ‘making big/strong’arabhyamana arabhyaman. a ‘being commenced’nis. anna nis. an. n. a ‘sitting’


A prosody of retroflexion

When R is projected onto the linear phonematicplane, n. occurs within its extension (Allen 1951,943).

bR

r

�

m. han. a

a-R

rabhyaman. a

ni-R

s. an. n. a


How to represent length?

/devat/ ([+long] segment)/deva �t/ (phoneme of length)/devaat/ (two phonemes)


Autosegmental approaches to length

d e v a t

[DBL]

d e v a t

C V C V V C


Autosegmental implications

“stability” of suprasegmental units (Goldsmith1976)

compensatory lengthening (Latin consul →cosul ; cf. epigraphic COS)Swedish has complementary distribution ofvocalic/consonantal length in rime ofstressed syllables

long vowels are structurally parallel todiphthongs on the CV tier but not on thesegmental tier


Length in Indian grammar

The Pan. inian Sivasutras specify only five basic

vowels, not distinguishing between short or long

(or pluta) vowels. Pan. ini characteristically refers

to a-varn. a, etc., that is, the a vowel independent

of its length (1.1.69).


The utility of linguistic descriptions

The virtue of particular linguistic descriptionsis substantially relative to their purpose.Linear and non-linear descriptions each haveadvantages.

The As. t.adhyayı is motivated by brevity andexplanatory generality. Computationallinguistics strives for efficiency andexplicitness.


from pan¯ .inian sandhi to finite state...

Documents