hapaxlegomena tipo log i a lengua je
TRANSCRIPT
-
8/10/2019 Hapaxlegomena Tipo Log i a Lengua Je
1/10
This article was downloaded by: [Corporacion CINCEL]On: 03 May 2012, At: 07:08Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Journal of Quantitative LinguisticsPublication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/njql20
Hapax Legomena and Language
TypologyIoan-Iovitz Popescu
a& Gabriel Altmann
b
aBucharest UniversitybLdenscheid
Available online: 07 Oct 2008
To cite this article:Ioan-Iovitz Popescu & Gabriel Altmann (2008): Hapax Legomena and
Language Typology, Journal of Quantitative Linguistics, 15:4, 370-378
To link to this article: http://dx.doi.org/10.1080/09296170802326699
PLEASE SCROLL DOWN FOR ARTICLE
Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions
This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expresslyforbidden.
The publisher does not give any warranty express or implied or make anyrepresentation that the contents will be complete or accurate or up to date. Theaccuracy of any instructions, formulae, and drug doses should be independentlyverified with primary sources. Thepublisher shall not be liable for any loss,actions, claims, proceedings, demand, or costs or damages whatsoever orhowsoever caused arising directly or indirectly in connection with or arising out
of the use of this material.
http://www.tandfonline.com/loi/njql20http://www.tandfonline.com/page/terms-and-conditionshttp://www.tandfonline.com/page/terms-and-conditionshttp://dx.doi.org/10.1080/09296170802326699http://www.tandfonline.com/loi/njql20 -
8/10/2019 Hapaxlegomena Tipo Log i a Lengua Je
2/10
Hapax Legomena and Language Typology*
Ioan-Iovitz Popescu1 and Gabriel Altmann21Bucharest University; 2Lu denscheid
ABSTRACT
Counting word forms one obtains more hapax legomena in highly synthetic languagesthan in highly analytic ones. We propose an index of analytism based exclusively onmechanical counting of word forms in text.
Since the seminal work of V. Skalicka (see esp. 20042006) language
typology is not a classification of languages any more. Though many
linguists are still engaged in describing, collecting and locating isolated
language phenomena, typology can nowadays be seen as the study ofrelationships between language properties. In general, the relationships
are the stronger the more contingent two phenomena are but it can be
expected that even distant phenomena like phonemics and syntax
display some dependences. The consequent continuation of this kind of
typology has become language synergetics (cf. Ko hler, 1986, 2005), which
tries to express all relationships formally and derive them from a
common self-regulation scheme. Needless to say, all relationships in
language are stochastic and the resulting equations are either nonlinear
regression functions or probability distributions. Many of them holdonly for averages.
However, up to now, textual phenomena have not been exploited for
finding any typological relationships because the study of texts in many
languages at once is not as easy as using ready-made grammar textbooks
from which elaborated phenomena can be selected. Perhaps the first
*Address correspondence to: Gabriel Altmann, Stu ttinghauser Ringstr. 44, 58515Lu denscheid. E-mail: [email protected]
Journal of Quantitative Linguistics2008, Volume 15, Number 4, pp. 370378DOI: 10.1080/09296170802326699
0929-6174/08/15040370 2008 Taylor & Francis
Dow
nloa
de
dby
[Corporac
ion
CINCEL]a
t07:0
803May
2012
-
8/10/2019 Hapaxlegomena Tipo Log i a Lengua Je
3/10
attempt to use textual phenomena is made in Popescu et al. (2008), where
a relationship between a function of theh-point (cf. Popescu, 2007) of the
rank-frequency distribution of word forms and the degree of synthetismof language has been found. The study was performed in texts of 20
languages and the relationship does not depend on text length.
In the present contribution we shall try to show a relationship
between hapax legomena of texts and the synthetism/analytism of
language. Logically, if a language is highly synthetic, whatever the text
length, not all forms of each word will occur several times (i.e. more
than once). A great number of forms will occur only once, thus forming
the stock of hapax legomena occupying the last HL ranks. On the
contrary, in highly analytic languages the number of forms is small,
hence all words have a greater chance of being repeated. The situation
would be quite different if we analysed lemmatized texts, in which no
morphology is contained.
This result (showing only the bottom part of the data) simply shows
that in highly synthetic languages with a long hapax legomena sequence
the theoretical function rather underestimates the frequencies trying to
capture the steep decrease in frequencies.
In the case of highly analytic languages (Figure 1c) the theoreticalfunction overestimates the empirical frequencies (not only of hapax
legomena).
In the present work we try to give to these considerations a
quantitative expression, as illustrated in Figures 1a, b, c. For this
purpose, let us denote V the vocabulary of word forms and HL the
number of ranks at which hapax legomena of the rank-frequency
distribution occur, and let us suppose that the rank-frequency distribu-
tion follows Zipfs law. The assumption of the validity of Zipfs law has
been corroborated on innumerable data sets in many languages, so thatsporadic exceptions could be captured by one of the many general-
izations of this law. For the sake of simplicity we shall use the Zipf
function f(r) c/ra in which a and c are fitting parameters, thus gettingrid of the necessity of normalization and truncation at the right-hand
side.
Now, if we fit this function to rank-frequency data, we may expect that
with a good fitting the function achieves f(r) 1 (i.e. the level of hapaxlegomena) exactly in the middle of hapax legomena (i.e. at r V HL/2),
as shown in Figure 1a for a Bulgarian text. But iterative fitting of the Zipffunction may yield different values. Therefore, in the following we will
HAPAX LEGOMENA AND LANGUAGE TYPOLOGY 371
Dow
nloa
de
dby
[Corporac
ion
CINCEL]a
t07:0
803May
2012
-
8/10/2019 Hapaxlegomena Tipo Log i a Lengua Je
4/10
Fig. 1b. Over-Zipfian location of hapax legomena in a highly synthetic language (hereHungarian).
Fig. 1a. Zipfian location of hapax legomena in a balanced language (here Bulgarian).
372 I.-I. POPESCU & G. ALTMANN
Dow
nloa
de
dby
[Corporac
ion
CINCEL]a
t07:0
803May
2012
-
8/10/2019 Hapaxlegomena Tipo Log i a Lengua Je
5/10
take the above central location as a reference for any other fitting and
introduce the indicator
A c
VHL=2a 1
as an expression of analytism of a language. The interval of this indicator
is not yet known, but the relationship to analytism/synthetism is evident:
the greater A, the stronger is the analytism. In order to check ourhypothesis we used the texts analysed in Popescu et al. (2008) and
obtained A-values presented in Table 1 for 100 texts from 20 languages
and in Table 2 for the corresponding language averages. The individual
extreme cases of Table 1 are illustrated in Figure 1b for a highly synthetic
Hungarian text and in Figure 1c for a highly analytic Hawaiian text.
Generally, the Zipf fitting curve is displaced rightwards for analytic
languages and leftwards for synthetic languages with respect to the real
rank-frequency distribution. The displacement depends on the degree of
analytism/synthetism. Clearly, there is a bi-univocal and reciprocalcorrespondence between the analytism indicatorAand its crossing point
Fig. 1c. Under-Zipfian location of hapax legomena in a highly analytic language (hereHawaiian).
HAPAX LEGOMENA AND LANGUAGE TYPOLOGY 373
Dow
nloa
de
dby
[Corporac
ion
CINCEL]a
t07:0
803May
2012
-
8/10/2019 Hapaxlegomena Tipo Log i a Lengua Je
6/10
Table 1. Zipfs function f(r) c/ra fitting to data of 100 texts from 20 languages.
ID V a c R
2
HL
c
VHL=2a
B 01 400 0.6850 41.8602 0.9837 298 0.9507B 02 201 0.5704 17.6950 0.8705 153 1.1292B 03 285 0.5550 20.9975 0.8790 212 1.1798B 04 286 0.6169 23.6917 0.9619 222 0.9790B 05 238 0.6202 22.0499 0.9367 187 1.0090Cz 01 638 0.7473 54.2844 0.9764 517 0.6416Cz 02 543 0.7169 51.9648 0.9767 412 0.8013Cz 03 1274 0.8028 175.4805 0.9832 964 0.8261Cz 04 323 0.6228 23.3822 0.9537 241 0.8562
Cz 05 556 0.8722 77.1944 0.9715 445 0.4864E 01 939 0.7657 145.9980 0.9620 662 1.0783E 02 1017 0.7434 180.1325 0.9661 735 1.4610E 03 1001 0.8179 254.7482 0.9752 620 1.2123E 04 1232 0.8712 385.9532 0.9870 693 1.0449E 05 1495 0.8009 319.1386 0.9822 971 1.2529E 07 1597 0.7568 300.1258 0.9347 1075 1.5416E 13 1659 0.8034 811.1689 0.9800 736 2.5688G 05 332 0.6935 32.8211 0.9646 250 0.8129G 09 379 0.6523 32.5565 0.9626 302 0.9431G 10 301 0.6053 21.8114 0.9402 237 0.9331
G 11 297 0.5895 19.9677 0.9593 232 0.9320G 12 169 0.6062 14.3627 0.9514 141 0.8888G 14 129 0.5755 10.8110 0.9349 107 0.8977G 17 124 0.5515 13.1021 0.9349 84 1.1531H 01 1079 1.2268 214.2708 0.9600 844 0.0749H 02 789 1.1865 122.0057 0.9365 638 0.0824H 03 291 1.2114 44.9653 0.8864 259 0.0950H 04 609 0.9549 74.8581 0.9451 509 0.2753H 05 290 0.8168 30.9795 0.9093 250 0.4784Hw 03 521 0.7932 329.6012 0.9489 255 2.8821Hw 04 744 0.7633 678.1305 0.9154 347 5.3384
Hw 05 680 0.7267 592.6243 0.8742 302 6.2199Hw 06 1039 0.7816 1081.7823 0.9352 500 5.8855I 01 3667 0.7266 509.5979 0.9336 2514 1.7784I 02 2203 0.7488 305.6487 0.9559 1604 1.3468I 03 483 0.7895 56.8099 0.9523 382 0.6427I 04 1237 0.7014 153.3448 0.9385 848 1.3948I 05 512 0.6524 54.5840 0.9293 355 1.2306In 01 221 0.5809 18.2346 0.9486 166 1.0420In 02 209 0.5915 19.1717 0.9583 147 1.0509In 03 194 0.5417 15.6229 0.9565 130 1.1233In 04 213 0.4877 11.9156 0.9574 145 1.0683
(continued)
374 I.-I. POPESCU & G. ALTMANN
Dow
nloa
de
dby
[Corporac
ion
CINCEL]a
t07:0
803May
2012
-
8/10/2019 Hapaxlegomena Tipo Log i a Lengua Je
7/10
Table 1. (Continued).
ID V a c R
2
HL
c
VHL=2a
In 05 188 0.5374 19.4218 0.8843 121 1.4347Kn 003 1833 0.6072 66.4545 0.9775 1373 0.9223Kn 004 720 0.5237 22.1001 0.9699 564 0.9144Kn 005 2477 0.6621 124.5588 0.9105 1784 0.9480Kn 006 2433 0.5809 95.9573 0.9522 1655 1.3181Kn 011 2516 0.5786 77.0267 0.9666 1873 1.0862Lk 01 174 0.6416 23.4838 0.9348 127 1.1474Lk 02 479 0.7731 139.2126 0.9510 302 1.5798Lk 03 272 0.7512 71.8668 0.9527 174 1.4240
Lk 04 116 0.6792 18.7509 0.9801 80 0.9901Lt 01 2211 0.7935 109.3668 0.9078 1792 0.3666Lt 02 2334 0.8047 160.3530 0.9335 1878 0.4729Lt 03 2703 0.6366 109.5291 0.9832 2049 0.9695Lt 04 1910 0.6505 129.2023 0.9463 1359 1.2627Lt 05 909 0.5877 34.1056 0.9713 737 0.8449Lt 06 609 0.5293 19.3370 0.9325 521 0.8726M 01 398 0.7680 185.4091 0.9225 202 2.3386M 02 277 0.8197 123.4636 0.9693 146 1.5787M 03 277 0.7902 147.8281 0.9557 133 2.1571M 04 326 0.8353 137.7184 0.9763 192 1.4664
M 05 514 0.7484 297.2460 0.9306 239 3.3897Mq 01 289 0.8030 240.0615 0.9588 91 2.9102Mq 02 150 0.7440 46.4870 0.9655 86 1.4370Mq 03 301 0.9795 225.2046 0.9856 138 1.0853Mr 001 1555 0.6293 78.3965 0.9815 1128 1.0210Mr 018 1788 0.6685 128.5531 0.9863 1249 1.1470Mr 026 2038 0.6224 101.6971 0.9633 1486 1.1758Mr 027 1400 0.6166 120.0829 0.9456 846 1.7214Mr 288 2079 0.6304 100.2890 0.9683 1534 1.0857R 01 843 0.6720 73.6423 0.9571 606 1.0739R 02 1179 0.7567 115.8007 0.9802 908 0.7930
R 03 719 0.7175 60.8094 0.9778 567 0.7771R 04 729 0.6673 52.4236 0.9798 573 0.8993R 05 567 0.6746 48.1009 0.9743 424 0.9157R 06 432 0.6349 30.3691 0.9350 353 0.8995Rt 01 223 0.8575 123.9533 0.9645 127 1.6008Rt 02 214 0.7469 83.2271 0.9316 128 1.9726Rt 03 207 0.7208 78.6409 0.9465 98 2.0454Rt 04 181 0.7359 60.2092 0.9358 102 1.6749Rt 05 197 0.6917 87.0541 0.9469 73 2.5959Ru 01 422 0.6538 36.1404 0.9604 316 0.9437Ru 02 1240 0.7713 138.5450 0.9915 946 0.8251
(continued)
HAPAX LEGOMENA AND LANGUAGE TYPOLOGY 375
Dow
nloa
de
dby
[Corporac
ion
CINCEL]a
t07:0
803May
2012
-
8/10/2019 Hapaxlegomena Tipo Log i a Lengua Je
8/10
(real or virtual) between the fitting Zipf curve and the hapax legomena
level f(r) 1.Finally, it is worthwhile mentioning that the data of Table 1 reveal a
good linear dependence between the hapax legomena length HLand the
vocabulary V, namely HL 0.7256*V718.6979, as it is further
illustrated in Figure 2. Hence, replacing HL in (1), we get a goodapproximation for the analytism indicator in the form
A 2ac
1:2744V18:6979a 2
An independent measure of analytism/synthetism using e.g. the
Greenberg-Krupa indices (Greenberg, 1960; Krupa, 1965) could tell us
whether the purely morphological definition of this property corresponds
to our purely textual measure which can be won mechanically withoutanalysing each word of a text with regard to its morphological structure.
Table 1. (Continued).
ID V a c R
2
HL
c
VHL=2a
Ru 03 1792 0.7106 158.2659 0.9620 1365 1.0851Ru 04 2536 0.7181 234.3457 0.9571 1850 1.1661Ru 05 6073 0.7826 775.3826 0.9807 4395 1.2063Sl 01 457 0.7467 44.1840 0.9760 364 0.6665Sl 02 603 0.6846 68.9001 0.9823 423 1.1571Sl 03 907 0.7685 115.2402 0.9604 651 0.8651Sl 04 1102 0.9187 334.8100 0.9912 701 0.7633Sl 05 2223 0.7232 240.2785 0.9490 1593 1.2572Sm 01 267 0.8285 177.1858 0.9678 119 2.1315
Sm 02 222 0.7752 123.5355 0.9450 96 2.2641Sm 03 140 0.6858 58.1896 0.8708 75 2.4320Sm 04 153 0.7925 89.0771 0.9563 76 2.0738Sm 05 124 0.7161 46.3093 0.9263 66 1.8312T 01 611 0.7624 120.0367 0.8817 465 1.2995T 02 720 0.7803 144.5780 0.8685 540 1.2297T 03 645 0.7652 167.7334 0.8923 447 1.6447
B, Bulgarian; Cz, Czech; E, English; G, German; H, Hungarian; Hw, Hawaiian; I, Italian;In, Indonesian; Kn, Kannada; Lk, Lakota,; Lt, Latin; M, Maori; Mq, Marquesan; Mr,Marathi; R, Romanian; Rt, Rarotongan; Ru, Russian; Sl, Slovenian; Sm, Samoan; T,
Tagalog.
376 I.-I. POPESCU & G. ALTMANN
Dow
nloa
de
dby
[Corporac
ion
CINCEL]a
t07:0
803May
2012
-
8/10/2019 Hapaxlegomena Tipo Log i a Lengua Je
9/10
-
8/10/2019 Hapaxlegomena Tipo Log i a Lengua Je
10/10
Needless to say, one can perform the same procedure using the Zipf-
Mandelbrot function or a number of other hyperbolic functions.
We adhere to Occams razor and restrict ourselves to the original Zipffunction which yielding this morphological distinctiveness without
touching morphology gets a secondary important corroboration.
REFERENCES
Greenberg, J. H. (1960). A quantitative approach to the morphological typology oflanguages. International Journal of American Linguistics, 26, 178194.
Ko hler, R. (1986). Zur linguistischen Synergetik. Struktur und Dynamik der Lexik.Bochum: Brockmeyer.
Ko hler, R. (ed.) (2005). Synergetic linguistics. In R. Ko hler, G. Altmann & R. G.Piotrowski (Eds), Quantitative Linguistics. An International Handbook (pp. 760775). Berlin: de Gruyter.
Krupa, V. (1965). On quantification of typology. Linguistics, 12, 3136.Popescu, I.-I. (2007). Text ranking by the weight of highly frequent words. In P. Grzybek
& R. Ko hler (Eds), Exact Methods in the Study of Language and Text (pp. 555565). Berlin/New York: Mouton de Gruyter.
Popescu, I.-I., Vidya, M. N., Uhlrova , L., Pustet, R., Mehler, A., Macutek, J., Krupa,V., Ko hler, R., Jayaram, B. D., Grzybek, P., & Altmann, G. (2008). Word
Frequency Studies. Berlin: Mouton de Gruyter (in press).Skalicka, V. (20042006). Souborne d lo IIII (edited by F. C erma k et al.). Praha:
Nakladatelstv Karolinum. [The work contains the Czech translations.]
378 I.-I. POPESCU & G. ALTMANN
Dow
nloa
de
dby
[Corporac
ion
CINCEL]a
t07:0
803May
2012