from pan¯ .inian sandhi to finite state...
TRANSCRIPT
From Pan. inian Sandhito Finite State Calculus
Malcolm D. Hyman
Max Planck Institute for the History of Science, Berlin
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.1
Overview
1. Research context
2. An XML vocabulary for Pan. inian rules
3. From Pan. inian rules to an FST
4. Implications: remarks on linguistic description
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.2
Research context
Ongoing work on modeling components ofSanskrit grammar according to Pan. inianprinciples
nominal inflection
verbal inflection (using Dhatupat.ha)stem formation (perfect stem, participialstems. . . )
morphophonology (sandhi)
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.3
Methodology
How closely to follow Pan. ini?
Practical concerns dictate an incrementalapproach.
We are obliged to interpret Pan. ini.
Research results concerning both Indiangrammatical methods and facts of theSanskrit language will emerge fromcomputational studies.
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.4
Building blocks of an XML model
The rules model not only a Pan. inian sutra, butalso its context and its interpretation.
An XML schema
A sound-based encoding (SLP1)
A regular expression dialect (PCREs)
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.5
The SLP1 encoding
�
a
a
���
a
A
�
i
i
��
ı
I
�
u
u
�
u
U
�
r�
f
r�
F
l�
x
�
l�
X
��
e
e
� �
ai
E
�� �
o
o
�� ��
au
O
*
��� k
k
���
kh
K
���
g
g
����
gh
G
��� n
N
� ����
c
c
�� ch
C
����
j
j���
jh
J
���
ñ
Y
�� t.
w
�� t.h
W
���d.
q
�� d. h
Q
!��
n.R
"��
t
t
#��
th
T
$� d
d
�%��
dh
D
&��
n
n
'��
p
p
(� ph
P
)��
b
b
*��
bh
B
+��
m
m
,��
y
y
�-�r
r
.� l
l
/��
v
v0��
s
S
1��
s.z
�2��
s
s
3�h
h
* anusvara = M; visarga = H
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.6
The rule element
8.3.23 mo ’nusvarah.
<rule source="m"target="M"rcontext="[@(wb)][@(hal)]"ref="A.8.3.23"/>
(We may need more than one rule to express a
sutra.)
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.7
The macro element
We need some means for translating Pan. ini’smetalanguage, e. g. sound classes (pratyaharas):
<macro name="JaS"value="JBGQDjbgqd"c="voiced stop"/>
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.8
The mapping element
1.1.2 aden gun. ah.
<mapping name="guna"ref="A.1.1.2">
<map from="@(a)" to="a"/><map from="@(i)" to="e"/><map from="@(u)" to="o"/><map from="@(f)" to="a"/><map from="@(x)" to="a"/>
</mapping>
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.9
The function element
<function name="gunate"><rule source="[@(a)@(i)@(u)]"
target="%(guna($1))"/><rule source="[@(f)@(x)]"
target="%(guna($1))%(semivowel($1))"/>
</function>
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.10
Applying a function
6.1.87 ad gun. ah.
<rule source="[@(a)][@(wb)]([@(ik)])"target="!(gunate($1))"ref="A.6.1.87"/>
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.11
Implementing the modeled rules
The XML model captures some of thestructure of Pan. ini’s grammar. But theobvious serial application of the rules iscomputationally inefficient.
The rules can be automatically translated intoregular expressions for compilation into afinite state transducer using tools such asxfst (Xerox) or fsa (van Noord).
The relation between the underlying stringsand the surface strings is a regular relation.
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.12
The replace operator
Rules may be translated into regular expressionsemploying the replace operator (Karttunen 1995).
(a|A)( | #)(a|A) → a(a|A)( | #)(i|I) → e(a|A)( | #)(u|U) → o(a|A)( | #)(f|F) → ar(a|A)( | #)(x|X) → al
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.13
Context-dependent replacement
Documented algorithms exist for the translationof context-dependent replacements into FSTs(Mohri & Sproat 1996).
6.1.109 enah. padantadati
<rule source="a"target="’"lcontext="[@(eN)][@(wb)]"ref="6.1.109"/>
a → ’ / (e|o)( | #)
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.14
An FST for 6.1.109
6.1.109 enah. padantadati
s 0 s 1 s 2e, o
?
?
e, o
, #
e, o
?, a:’
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.15
A composed FST for external sandhi
37 sutras constitute core rules for externalsandhi
XML: 48 rules, 61 macros, 16 mappings, 3functions
compiled regular expressions are ~268KB
composed transducer has 4,994 states,417,814 arcs
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.16
Comparing two approaches
Serial application of rules:
FORM SUTRA
tat ca
tad ca 8.2.39taj ca 8.4.40, 44tac ca 8.4.55tacca
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.17
Comparing two approaches
A unique path through the transducer:
<t:t><a:a><t:c><" ":c><c:ε><a:a>
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.17
Limitations of segmentalism
Segments are atomic, and enumerating themlimits linguistic generalization.
Features overlap segments. It wasJ. R. Firth’s insight that “some phonologicalproperties are not uniquely ‘placed’ withrespect to particular segments within a largerunit” (Anderson, 1985, 185).
Coarticulation “can be detected in almostevery phoneme sequence in normal speech”(Goodglass, 1993, 62).
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.18
Positions of the Indian grammarians
Pan. ini moved beyond the vikara system ofearlier linguistic thinkers (Cardona 1965,311).
Use of abbreviations (pratyaharas) for soundclasses and the principle of savarn. ya (A.1.1.50) emphasize featural analysis.
Segments contain subsegments (e. g. /r
�
/contains r: MBh. 3.452.1 ff.
Pitch is a property of the syllable (R
�
Pr. 3.9) orspreads to adjacent consonants (TPr. 1.43).
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.19
N-retroflexion in finite state modeling
Non-final /n/ is realized as n. after {r
�
, r
�
, r, s. }despite intervening vowels, semivowels,gutturals/velars, labials, or anusvara.
<rule source="n"target="R"lcontext="[fFrz]
[#@(aw)@(ku)@(pu)M]*"rcontext=".*[@(ac)]"ref="8.4.1-2"/>
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.20
N-retroflexion examples
There is a regular relation between a set ofunderlying and surface strings that includes thefollowing pairs:
UNDERLYING SURFACE
br
�
m. hana br
�
m. han. a ‘making big/strong’arabhyamana arabhyaman. a ‘being commenced’nis. anna nis. an. n. a ‘sitting’
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.21
A prosody of retroflexion
When R is projected onto the linear phonematicplane, n. occurs within its extension (Allen 1951,943).
bR
r
�
m. han. a
a-R
rabhyaman. a
ni-R
s. an. n. a
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.22
How to represent length?
/devat/ ([+long] segment)/deva �t/ (phoneme of length)/devaat/ (two phonemes)
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.23
Autosegmental approaches to length
d e v a t
[DBL]
d e v a t
C V C V V C
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.24
Autosegmental implications
“stability” of suprasegmental units (Goldsmith1976)
compensatory lengthening (Latin consul →cosul ; cf. epigraphic COS)Swedish has complementary distribution ofvocalic/consonantal length in rime ofstressed syllables
long vowels are structurally parallel todiphthongs on the CV tier but not on thesegmental tier
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.25
Length in Indian grammar
The Pan. inian Sivasutras specify only five basic
vowels, not distinguishing between short or long
(or pluta) vowels. Pan. ini characteristically refers
to a-varn. a, etc., that is, the a vowel independent
of its length (1.1.69).
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.26
The utility of linguistic descriptions
The virtue of particular linguistic descriptionsis substantially relative to their purpose.Linear and non-linear descriptions each haveadvantages.
The As. t.adhyayı is motivated by brevity andexplanatory generality. Computationallinguistics strives for efficiency andexplicitness.
First International Sanskrit Computational Linguistics Symposium, Paris, 2007 – p.27