hapax legomena and language typology

NOTA IMPORTANTE:

El documento que usted ha recibido es slo para fines acadmicos de Docencia e Investigacin.

No se autoriza su venta ni su reproduccin masiva.

UNIVERSIDAD DE CONCEPCIN

SISTEMA DE BIBLIOTECAS SERVICIO BIBLIOGRFICO.

This article was downloaded by: [Universidad De Concepcion]On: 03 May 2013, At: 12:48Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Quantitative LinguisticsPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/njql20

Hapax Legomena and LanguageTypologyIoan-Iovitz Popescu a & Gabriel Altmann ba Bucharest Universityb LdenscheidPublished online: 07 Oct 2008.

To cite this article: Ioan-Iovitz Popescu & Gabriel Altmann (2008): Hapax Legomena andLanguage Typology, Journal of Quantitative Linguistics, 15:4, 370-378

To link to this article: http://dx.doi.org/10.1080/09296170802326699

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes.Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expresslyforbidden.

The publisher does not give any warranty express or implied or make anyrepresentation that the contents will be complete or accurate or up to date. Theaccuracy of any instructions, formulae, and drug doses should be independentlyverified with primary sources. The publisher shall not be liable for any loss,actions, claims, proceedings, demand, or costs or damages whatsoever orhowsoever caused arising directly or indirectly in connection with or arising outof the use of this material.

Hapax Legomena and Language Typology*

Ioan-Iovitz Popescu1 and Gabriel Altmann21Bucharest University; 2Ludenscheid

ABSTRACT

Counting word forms one obtains more hapax legomena in highly synthetic languagesthan in highly analytic ones. We propose an index of analytism based exclusively onmechanical counting of word forms in text.

Since the seminal work of V. Skalicka (see esp. 20042006) languagetypology is not a classication of languages any more. Though manylinguists are still engaged in describing, collecting and locating isolatedlanguage phenomena, typology can nowadays be seen as the study ofrelationships between language properties. In general, the relationshipsare the stronger the more contingent two phenomena are but it can beexpected that even distant phenomena like phonemics and syntaxdisplay some dependences. The consequent continuation of this kind oftypology has become language synergetics (cf. Kohler, 1986, 2005), whichtries to express all relationships formally and derive them from acommon self-regulation scheme. Needless to say, all relationships inlanguage are stochastic and the resulting equations are either nonlinearregression functions or probability distributions. Many of them holdonly for averages.However, up to now, textual phenomena have not been exploited for

nding any typological relationships because the study of texts in manylanguages at once is not as easy as using ready-made grammar textbooksfrom which elaborated phenomena can be selected. Perhaps the rst

*Address correspondence to: Gabriel Altmann, Stuttinghauser Ringstr. 44, 58515Ludenscheid. E-mail: [email protected]

Journal of Quantitative Linguistics2008, Volume 15, Number 4, pp. 370378DOI: 10.1080/09296170802326699

0929-6174/08/15040370 2008 Taylor & Francis

Dow

nloa

ded

by [U

nivers

idad D

e Con

cepc

ion] a

t 12:4

8 03 M

ay 20

13

attempt to use textual phenomena is made in Popescu et al. (2008), wherea relationship between a function of the h-point (cf. Popescu, 2007) of therank-frequency distribution of word forms and the degree of synthetismof language has been found. The study was performed in texts of 20languages and the relationship does not depend on text length.In the present contribution we shall try to show a relationship

between hapax legomena of texts and the synthetism/analytism oflanguage. Logically, if a language is highly synthetic, whatever the textlength, not all forms of each word will occur several times (i.e. morethan once). A great number of forms will occur only once, thus formingthe stock of hapax legomena occupying the last HL ranks. On thecontrary, in highly analytic languages the number of forms is small,hence all words have a greater chance of being repeated. The situationwould be quite dierent if we analysed lemmatized texts, in which nomorphology is contained.This result (showing only the bottom part of the data) simply shows

that in highly synthetic languages with a long hapax legomena sequencethe theoretical function rather underestimates the frequencies trying tocapture the steep decrease in frequencies.In the case of highly analytic languages (Figure 1c) the theoretical

function overestimates the empirical frequencies (not only of hapaxlegomena).In the present work we try to give to these considerations a

quantitative expression, as illustrated in Figures 1a, b, c. For thispurpose, let us denote V the vocabulary of word forms and HL thenumber of ranks at which hapax legomena of the rank-frequencydistribution occur, and let us suppose that the rank-frequency distribu-tion follows Zipfs law. The assumption of the validity of Zipfs law hasbeen corroborated on innumerable data sets in many languages, so thatsporadic exceptions could be captured by one of the many general-izations of this law. For the sake of simplicity we shall use the Zipffunction f(r) c/ra in which a and c are tting parameters, thus gettingrid of the necessity of normalization and truncation at the right-handside.Now, if we t this function to rank-frequency data, we may expect that

with a good tting the function achieves f(r) 1 (i.e. the level of hapaxlegomena) exactly in the middle of hapax legomena (i.e. at rV HL/2),as shown in Figure 1a for a Bulgarian text. But iterative tting of the Zipffunction may yield dierent values. Therefore, in the following we will

HAPAX LEGOMENA AND LANGUAGE TYPOLOGY 371

Dow

nloa

ded

by [U

nivers

idad D

e Con

cepc

ion] a

t 12:4

8 03 M

ay 20

13

Fig. 1b. Over-Zipan location of hapax legomena in a highly synthetic language (hereHungarian).

Fig. 1a. Zipan location of hapax legomena in a balanced language (here Bulgarian).

372 I.-I. POPESCU & G. ALTMANN

Dow

nloa

ded

by [U

nivers

idad D

e Con

cepc

ion] a

t 12:4

8 03 M

ay 20

13

take the above central location as a reference for any other tting andintroduce the indicator

A cVHL=2a 1

as an expression of analytism of a language. The interval of this indicatoris not yet known, but the relationship to analytism/synthetism is evident:the greater A, the stronger is the analytism. In order to check ourhypothesis we used the texts analysed in Popescu et al. (2008) andobtained A-values presented in Table 1 for 100 texts from 20 languagesand in Table 2 for the corresponding language averages. The individualextreme cases of Table 1 are illustrated in Figure 1b for a highly syntheticHungarian text and in Figure 1c for a highly analytic Hawaiian text.Generally, the Zipf tting curve is displaced rightwards for analyticlanguages and leftwards for synthetic languages with respect to the realrank-frequency distribution. The displacement depends on the degree ofanalytism/synthetism. Clearly, there is a bi-univocal and reciprocalcorrespondence between the analytism indicator A and its crossing point

Fig. 1c. Under-Zipan location of hapax legomena in a highly analytic language (hereHawaiian).


Dow

nloa

ded

by [U

nivers

idad D

e Con

cepc

ion] a

t 12:4

8 03 M

ay 20

13

Table 1. Zipfs function f(r) c/ra tting to data of 100 texts from 20 languages.

ID V a c R2 HLc

VHL=2a

B 01 400 0.6850 41.8602 0.9837 298 0.9507B 02 201 0.5704 17.6950 0.8705 153 1.1292B 03 285 0.5550 20.9975 0.8790 212 1.1798B 04 286 0.6169 23.6917 0.9619 222 0.9790B 05 238 0.6202 22.0499 0.9367 187 1.0090Cz 01 638 0.7473 54.2844 0.9764 517 0.6416Cz 02 543 0.7169 51.9648 0.9767 412 0.8013Cz 03 1274 0.8028 175.4805 0.9832 964 0.8261Cz 04 323 0.6228 23.3822 0.9537 241 0.8562Cz 05 556 0.8722 77.1944 0.9715 445 0.4864E 01 939 0.7657 145.9980 0.9620 662 1.0783E 02 1017 0.7434 180.1325 0.9661 735 1.4610E 03 1001 0.8179 254.7482 0.9752 620 1.2123E 04 1232 0.8712 385.9532 0.9870 693 1.0449E 05 1495 0.8009 319.1386 0.9822 971 1.2529E 07 1597 0.7568 300.1258 0.9347 1075 1.5416E 13 1659 0.8034 811.1689 0.9800 736 2.5688G 05 332 0.6935 32.8211 0.9646 250 0.8129G 09 379 0.6523 32.5565 0.9626 302 0.9431G 10 301 0.6053 21.8114 0.9402 237 0.9331G 11 297 0.5895 19.9677 0.9593 232 0.9320G 12 169 0.6062 14.3627 0.9514 141 0.8888G 14 129 0.5755 10.8110 0.9349 107 0.8977G 17 124 0.5515 13.1021 0.9349 84 1.1531H 01 1079 1.2268 214.2708 0.9600 844 0.0749H 02 789 1.1865 122.0057 0.9365 638 0.0824H 03 291 1.2114 44.9653 0.8864 259 0.0950H 04 609 0.9549 74.8581 0.9451 509 0.2753H 05 290 0.8168 30.9795 0.9093 250 0.4784Hw 03 521 0.7932 329.6012 0.9489 255 2.8821Hw 04 744 0.7633 678.1305 0.9154 347 5.3384Hw 05 680 0.7267 592.6243 0.8742 302 6.2199Hw 06 1039 0.7816 1081.7823 0.9352 500 5.8855I 01 3667 0.7266 509.5979 0.9336 2514 1.7784I 02 2203 0.7488 305.6487 0.9559 1604 1.3468I 03 483 0.7895 56.8099 0.9523 382 0.6427I 04 1237 0.7014 153.3448 0.9385 848 1.3948I 05 512 0.6524 54.5840 0.9293 355 1.2306In 01 221 0.5809 18.2346 0.9486 166 1.0420In 02 209 0.5915 19.1717 0.9583 147 1.0509In 03 194 0.5417 15.6229 0.9565 130 1.1233In 04 213 0.4877 11.9156 0.9574 145 1.0683

(continued )


Dow

nloa

ded

by [U

nivers

idad D

e Con

cepc

ion] a

t 12:4

8 03 M

ay 20

13

Table 1. (Continued ).

ID V a c R2 HLc

VHL=2a

In 05 188 0.5374 19.4218 0.8843 121 1.4347Kn 003 1833 0.6072 66.4545 0.9775 1373 0.9223Kn 004 720 0.5237 22.1001 0.9699 564 0.9144Kn 005 2477 0.6621 124.5588 0.9105 1784 0.9480Kn 006 2433 0.5809 95.9573 0.9522 1655 1.3181Kn 011 2516 0.5786 77.0267 0.9666 1873 1.0862Lk 01 174 0.6416 23.4838 0.9348 127 1.1474Lk 02 479 0.7731 139.2126 0.9510 302 1.5798Lk 03 272 0.7512 71.8668 0.9527 174 1.4240Lk 04 116 0.6792 18.7509 0.9801 80 0.9901Lt 01 2211 0.7935 109.3668 0.9078 1792 0.3666Lt 02 2334 0.8047 160.3530 0.9335 1878 0.4729Lt 03 2703 0.6366 109.5291 0.9832 2049 0.9695Lt 04 1910 0.6505 129.2023 0.9463 1359 1.2627Lt 05 909 0.5877 34.1056 0.9713 737 0.8449Lt 06 609 0.5293 19.3370 0.9325 521 0.8726M 01 398 0.7680 185.4091 0.9225 202 2.3386M 02 277 0.8197 123.4636 0.9693 146 1.5787M 03 277 0.7902 147.8281 0.9557 133 2.1571M 04 326 0.8353 137.7184 0.9763 192 1.4664M 05 514 0.7484 297.2460 0.9306 239 3.3897Mq 01 289 0.8030 240.0615 0.9588 91 2.9102Mq 02 150 0.7440 46.4870 0.9655 86 1.4370Mq 03 301 0.9795 225.2046 0.9856 138 1.0853Mr 001 1555 0.6293 78.3965 0.9815 1128 1.0210Mr 018 1788 0.6685 128.5531 0.9863 1249 1.1470Mr 026 2038 0.6224 101.6971 0.9633 1486 1.1758Mr 027 1400 0.6166 120.0829 0.9456 846 1.7214Mr 288 2079 0.6304 100.2890 0.9683 1534 1.0857R 01 843 0.6720 73.6423 0.9571 606 1.0739R 02 1179 0.7567 115.8007 0.9802 908 0.7930R 03 719 0.7175 60.8094 0.9778 567 0.7771R 04 729 0.6673 52.4236 0.9798 573 0.8993R 05 567 0.6746 48.1009 0.9743 424 0.9157R 06 432 0.6349 30.3691 0.9350 353 0.8995Rt 01 223 0.8575 123.9533 0.9645 127 1.6008Rt 02 214 0.7469 83.2271 0.9316 128 1.9726Rt 03 207 0.7208 78.6409 0.9465 98 2.0454Rt 04 181 0.7359 60.2092 0.9358 102 1.6749Rt 05 197 0.6917 87.0541 0.9469 73 2.5959Ru 01 422 0.6538 36.1404 0.9604 316 0.9437Ru 02 1240 0.7713 138.5450 0.9915 946 0.8251

(continued )


Dow

nloa

ded

by [U

nivers

idad D

e Con

cepc

ion] a

t 12:4

8 03 M

ay 20

13

(real or virtual) between the tting Zipf curve and the hapax legomenalevel f(r) 1.Finally, it is worthwhile mentioning that the data of Table 1 reveal a

good linear dependence between the hapax legomena length HL and thevocabulary V, namely HL 0.7256*V718.6979, as it is furtherillustrated in Figure 2. Hence, replacing HL in (1), we get a goodapproximation for the analytism indicator in the form

A 2ac

1:2744V 18:6979a 2

An independent measure of analytism/synthetism using e.g. theGreenberg-Krupa indices (Greenberg, 1960; Krupa, 1965) could tell uswhether the purely morphological denition of this property correspondsto our purely textual measure which can be won mechanically withoutanalysing each word of a text with regard to its morphological structure.

Table 1. (Continued ).

ID V a c R2 HLc

VHL=2a

Ru 03 1792 0.7106 158.2659 0.9620 1365 1.0851Ru 04 2536 0.7181 234.3457 0.9571 1850 1.1661Ru 05 6073 0.7826 775.3826 0.9807 4395 1.2063Sl 01 457 0.7467 44.1840 0.9760 364 0.6665Sl 02 603 0.6846 68.9001 0.9823 423 1.1571Sl 03 907 0.7685 115.2402 0.9604 651 0.8651Sl 04 1102 0.9187 334.8100 0.9912 701 0.7633Sl 05 2223 0.7232 240.2785 0.9490 1593 1.2572Sm 01 267 0.8285 177.1858 0.9678 119 2.1315Sm 02 222 0.7752 123.5355 0.9450 96 2.2641Sm 03 140 0.6858 58.1896 0.8708 75 2.4320Sm 04 153 0.7925 89.0771 0.9563 76 2.0738Sm 05 124 0.7161 46.3093 0.9263 66 1.8312T 01 611 0.7624 120.0367 0.8817 465 1.2995T 02 720 0.7803 144.5780 0.8685 540 1.2297T 03 645 0.7652 167.7334 0.8923 447 1.6447

B, Bulgarian; Cz, Czech; E, English; G, German; H, Hungarian; Hw, Hawaiian; I, Italian;In, Indonesian; Kn, Kannada; Lk, Lakota,; Lt, Latin; M, Maori; Mq, Marquesan; Mr,Marathi; R, Romanian; Rt, Rarotongan; Ru, Russian; Sl, Slovenian; Sm, Samoan; T,Tagalog.


Dow

nloa

ded

by [U

nivers

idad D

e Con

cepc

ion] a

t 12:4

8 03 M

ay 20

13

Table 2. Mean analytism indicator A of 20 languages.

Language Mean A Number of texts

Hungarian 0.2012 5Czech 0.7223 5Latin 0.7982 6Romanian 0.8931 6German 0.9372 7Slovenian 0.9418 5Kannada 1.0378 5Russian 1.0453 5Bulgarian 1.0495 5Indonesian 1.1438 5Marathi 1.2302 5Italian 1.2787 5Lakota 1.2853 4Tagalog 1.3913 3English 1.4514 7Marquesan 1.8108 3Rarotongan 1.9779 5Samoan 2.1465 5Maori 2.1861 5Hawaiian 5.0815 4

Fig. 2. Illustrating the linear dependence between the hapax legomena length HL and thevocabulary V.


Dow

nloa

ded

by [U

nivers

idad D

e Con

cepc

ion] a

t 12:4

8 03 M

ay 20

13

Needless to say, one can perform the same procedure using the Zipf-Mandelbrot function or a number of other hyperbolic functions.We adhere to Occams razor and restrict ourselves to the original Zipffunction which yielding this morphological distinctiveness withouttouching morphology gets a secondary important corroboration.

REFERENCES

Greenberg, J. H. (1960). A quantitative approach to the morphological typology oflanguages. International Journal of American Linguistics, 26, 178194.

Kohler, R. (1986). Zur linguistischen Synergetik. Struktur und Dynamik der Lexik.Bochum: Brockmeyer.

Kohler, R. (ed.) (2005). Synergetic linguistics. In R. Kohler, G. Altmann & R. G.Piotrowski (Eds), Quantitative Linguistics. An International Handbook (pp. 760775). Berlin: de Gruyter.

Krupa, V. (1965). On quantication of typology. Linguistics, 12, 3136.Popescu, I.-I. (2007). Text ranking by the weight of highly frequent words. In P. Grzybek

& R. Kohler (Eds), Exact Methods in the Study of Language and Text (pp. 555565). Berlin/New York: Mouton de Gruyter.

Popescu, I.-I., Vidya, M. N., Uhlrova, L., Pustet, R., Mehler, A., Macutek, J., Krupa,V., Kohler, R., Jayaram, B. D., Grzybek, P., & Altmann, G. (2008). WordFrequency Studies. Berlin: Mouton de Gruyter (in press).

Skalicka, V. (20042006). Souborne dlo IIII (edited by F. Cermak et al.). Praha:Nakladatelstv Karolinum. [The work contains the Czech translations.]


Dow

nloa

ded

by [U

nivers

idad D

e Con

cepc

ion] a

t 12:4

8 03 M

ay 20

13

hapax legomena and language typology

Documents