hapaxlegomena tipo log i a lengua je

8/10/2019 Hapaxlegomena Tipo Log i a Lengua Je

1/10

This article was downloaded by: [Corporacion CINCEL]On: 03 May 2012, At: 07:08Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Quantitative LinguisticsPublication details, including instructions for authors and

subscription information:

http://www.tandfonline.com/loi/njql20

Hapax Legomena and Language

TypologyIoan-Iovitz Popescu

a& Gabriel Altmann

b

aBucharest UniversitybLdenscheid

Available online: 07 Oct 2008

To cite this article:Ioan-Iovitz Popescu & Gabriel Altmann (2008): Hapax Legomena and

Language Typology, Journal of Quantitative Linguistics, 15:4, 370-378

To link to this article: http://dx.doi.org/10.1080/09296170802326699

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes.

Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expresslyforbidden.

The publisher does not give any warranty express or implied or make anyrepresentation that the contents will be complete or accurate or up to date. Theaccuracy of any instructions, formulae, and drug doses should be independentlyverified with primary sources. Thepublisher shall not be liable for any loss,actions, claims, proceedings, demand, or costs or damages whatsoever orhowsoever caused arising directly or indirectly in connection with or arising out

of the use of this material.
http://www.tandfonline.com/loi/njql20http://www.tandfonline.com/page/terms-and-conditionshttp://www.tandfonline.com/page/terms-and-conditionshttp://dx.doi.org/10.1080/09296170802326699http://www.tandfonline.com/loi/njql20


2/10

Hapax Legomena and Language Typology*

Ioan-Iovitz Popescu1 and Gabriel Altmann21Bucharest University; 2Lu denscheid

ABSTRACT

Counting word forms one obtains more hapax legomena in highly synthetic languagesthan in highly analytic ones. We propose an index of analytism based exclusively onmechanical counting of word forms in text.

Since the seminal work of V. Skalicka (see esp. 20042006) language

typology is not a classification of languages any more. Though many

linguists are still engaged in describing, collecting and locating isolated

language phenomena, typology can nowadays be seen as the study ofrelationships between language properties. In general, the relationships

are the stronger the more contingent two phenomena are but it can be

expected that even distant phenomena like phonemics and syntax

display some dependences. The consequent continuation of this kind of

typology has become language synergetics (cf. Ko hler, 1986, 2005), which

tries to express all relationships formally and derive them from a

common self-regulation scheme. Needless to say, all relationships in

language are stochastic and the resulting equations are either nonlinear

regression functions or probability distributions. Many of them holdonly for averages.

However, up to now, textual phenomena have not been exploited for

finding any typological relationships because the study of texts in many

languages at once is not as easy as using ready-made grammar textbooks

from which elaborated phenomena can be selected. Perhaps the first

*Address correspondence to: Gabriel Altmann, Stu ttinghauser Ringstr. 44, 58515Lu denscheid. E-mail: [email protected]

Journal of Quantitative Linguistics2008, Volume 15, Number 4, pp. 370378DOI: 10.1080/09296170802326699

0929-6174/08/15040370 2008 Taylor & Francis

Dow

nloa

de

dby

[Corporac

ion

CINCEL]a

t07:0

803May

2012


3/10

attempt to use textual phenomena is made in Popescu et al. (2008), where

a relationship between a function of theh-point (cf. Popescu, 2007) of the

rank-frequency distribution of word forms and the degree of synthetismof language has been found. The study was performed in texts of 20

languages and the relationship does not depend on text length.

In the present contribution we shall try to show a relationship

between hapax legomena of texts and the synthetism/analytism of

language. Logically, if a language is highly synthetic, whatever the text

length, not all forms of each word will occur several times (i.e. more

than once). A great number of forms will occur only once, thus forming

the stock of hapax legomena occupying the last HL ranks. On the

contrary, in highly analytic languages the number of forms is small,

hence all words have a greater chance of being repeated. The situation

would be quite different if we analysed lemmatized texts, in which no

morphology is contained.

This result (showing only the bottom part of the data) simply shows

that in highly synthetic languages with a long hapax legomena sequence

the theoretical function rather underestimates the frequencies trying to

capture the steep decrease in frequencies.

In the case of highly analytic languages (Figure 1c) the theoreticalfunction overestimates the empirical frequencies (not only of hapax

legomena).

In the present work we try to give to these considerations a

quantitative expression, as illustrated in Figures 1a, b, c. For this

purpose, let us denote V the vocabulary of word forms and HL the

number of ranks at which hapax legomena of the rank-frequency

distribution occur, and let us suppose that the rank-frequency distribu-

tion follows Zipfs law. The assumption of the validity of Zipfs law has

been corroborated on innumerable data sets in many languages, so thatsporadic exceptions could be captured by one of the many general-

izations of this law. For the sake of simplicity we shall use the Zipf

function f(r) c/ra in which a and c are fitting parameters, thus gettingrid of the necessity of normalization and truncation at the right-hand

side.

Now, if we fit this function to rank-frequency data, we may expect that

with a good fitting the function achieves f(r) 1 (i.e. the level of hapaxlegomena) exactly in the middle of hapax legomena (i.e. at r V HL/2),

as shown in Figure 1a for a Bulgarian text. But iterative fitting of the Zipffunction may yield different values. Therefore, in the following we will

HAPAX LEGOMENA AND LANGUAGE TYPOLOGY 371

Dow

nloa

de

dby

[Corporac

ion

CINCEL]a

t07:0

803May

2012


4/10

Fig. 1b. Over-Zipfian location of hapax legomena in a highly synthetic language (hereHungarian).

Fig. 1a. Zipfian location of hapax legomena in a balanced language (here Bulgarian).

372 I.-I. POPESCU & G. ALTMANN

Dow

nloa

de

dby

[Corporac

ion

CINCEL]a

t07:0

803May

2012


5/10

take the above central location as a reference for any other fitting and

introduce the indicator

A c

VHL=2a 1

as an expression of analytism of a language. The interval of this indicator

is not yet known, but the relationship to analytism/synthetism is evident:

the greater A, the stronger is the analytism. In order to check ourhypothesis we used the texts analysed in Popescu et al. (2008) and

obtained A-values presented in Table 1 for 100 texts from 20 languages

and in Table 2 for the corresponding language averages. The individual

extreme cases of Table 1 are illustrated in Figure 1b for a highly synthetic

Hungarian text and in Figure 1c for a highly analytic Hawaiian text.

Generally, the Zipf fitting curve is displaced rightwards for analytic

languages and leftwards for synthetic languages with respect to the real

rank-frequency distribution. The displacement depends on the degree of

analytism/synthetism. Clearly, there is a bi-univocal and reciprocalcorrespondence between the analytism indicatorAand its crossing point

Fig. 1c. Under-Zipfian location of hapax legomena in a highly analytic language (hereHawaiian).


Dow

nloa

de

dby

[Corporac

ion

CINCEL]a

t07:0

803May

2012


6/10

Table 1. Zipfs function f(r) c/ra fitting to data of 100 texts from 20 languages.

ID V a c R

2

HL

c

VHL=2a

B 01 400 0.6850 41.8602 0.9837 298 0.9507B 02 201 0.5704 17.6950 0.8705 153 1.1292B 03 285 0.5550 20.9975 0.8790 212 1.1798B 04 286 0.6169 23.6917 0.9619 222 0.9790B 05 238 0.6202 22.0499 0.9367 187 1.0090Cz 01 638 0.7473 54.2844 0.9764 517 0.6416Cz 02 543 0.7169 51.9648 0.9767 412 0.8013Cz 03 1274 0.8028 175.4805 0.9832 964 0.8261Cz 04 323 0.6228 23.3822 0.9537 241 0.8562

Cz 05 556 0.8722 77.1944 0.9715 445 0.4864E 01 939 0.7657 145.9980 0.9620 662 1.0783E 02 1017 0.7434 180.1325 0.9661 735 1.4610E 03 1001 0.8179 254.7482 0.9752 620 1.2123E 04 1232 0.8712 385.9532 0.9870 693 1.0449E 05 1495 0.8009 319.1386 0.9822 971 1.2529E 07 1597 0.7568 300.1258 0.9347 1075 1.5416E 13 1659 0.8034 811.1689 0.9800 736 2.5688G 05 332 0.6935 32.8211 0.9646 250 0.8129G 09 379 0.6523 32.5565 0.9626 302 0.9431G 10 301 0.6053 21.8114 0.9402 237 0.9331

G 11 297 0.5895 19.9677 0.9593 232 0.9320G 12 169 0.6062 14.3627 0.9514 141 0.8888G 14 129 0.5755 10.8110 0.9349 107 0.8977G 17 124 0.5515 13.1021 0.9349 84 1.1531H 01 1079 1.2268 214.2708 0.9600 844 0.0749H 02 789 1.1865 122.0057 0.9365 638 0.0824H 03 291 1.2114 44.9653 0.8864 259 0.0950H 04 609 0.9549 74.8581 0.9451 509 0.2753H 05 290 0.8168 30.9795 0.9093 250 0.4784Hw 03 521 0.7932 329.6012 0.9489 255 2.8821Hw 04 744 0.7633 678.1305 0.9154 347 5.3384

Hw 05 680 0.7267 592.6243 0.8742 302 6.2199Hw 06 1039 0.7816 1081.7823 0.9352 500 5.8855I 01 3667 0.7266 509.5979 0.9336 2514 1.7784I 02 2203 0.7488 305.6487 0.9559 1604 1.3468I 03 483 0.7895 56.8099 0.9523 382 0.6427I 04 1237 0.7014 153.3448 0.9385 848 1.3948I 05 512 0.6524 54.5840 0.9293 355 1.2306In 01 221 0.5809 18.2346 0.9486 166 1.0420In 02 209 0.5915 19.1717 0.9583 147 1.0509In 03 194 0.5417 15.6229 0.9565 130 1.1233In 04 213 0.4877 11.9156 0.9574 145 1.0683

(continued)


Dow

nloa

de

dby

[Corporac

ion

CINCEL]a

t07:0

803May

2012


7/10

Table 1. (Continued).

ID V a c R

2

HL

c

VHL=2a

In 05 188 0.5374 19.4218 0.8843 121 1.4347Kn 003 1833 0.6072 66.4545 0.9775 1373 0.9223Kn 004 720 0.5237 22.1001 0.9699 564 0.9144Kn 005 2477 0.6621 124.5588 0.9105 1784 0.9480Kn 006 2433 0.5809 95.9573 0.9522 1655 1.3181Kn 011 2516 0.5786 77.0267 0.9666 1873 1.0862Lk 01 174 0.6416 23.4838 0.9348 127 1.1474Lk 02 479 0.7731 139.2126 0.9510 302 1.5798Lk 03 272 0.7512 71.8668 0.9527 174 1.4240

Lk 04 116 0.6792 18.7509 0.9801 80 0.9901Lt 01 2211 0.7935 109.3668 0.9078 1792 0.3666Lt 02 2334 0.8047 160.3530 0.9335 1878 0.4729Lt 03 2703 0.6366 109.5291 0.9832 2049 0.9695Lt 04 1910 0.6505 129.2023 0.9463 1359 1.2627Lt 05 909 0.5877 34.1056 0.9713 737 0.8449Lt 06 609 0.5293 19.3370 0.9325 521 0.8726M 01 398 0.7680 185.4091 0.9225 202 2.3386M 02 277 0.8197 123.4636 0.9693 146 1.5787M 03 277 0.7902 147.8281 0.9557 133 2.1571M 04 326 0.8353 137.7184 0.9763 192 1.4664

M 05 514 0.7484 297.2460 0.9306 239 3.3897Mq 01 289 0.8030 240.0615 0.9588 91 2.9102Mq 02 150 0.7440 46.4870 0.9655 86 1.4370Mq 03 301 0.9795 225.2046 0.9856 138 1.0853Mr 001 1555 0.6293 78.3965 0.9815 1128 1.0210Mr 018 1788 0.6685 128.5531 0.9863 1249 1.1470Mr 026 2038 0.6224 101.6971 0.9633 1486 1.1758Mr 027 1400 0.6166 120.0829 0.9456 846 1.7214Mr 288 2079 0.6304 100.2890 0.9683 1534 1.0857R 01 843 0.6720 73.6423 0.9571 606 1.0739R 02 1179 0.7567 115.8007 0.9802 908 0.7930

R 03 719 0.7175 60.8094 0.9778 567 0.7771R 04 729 0.6673 52.4236 0.9798 573 0.8993R 05 567 0.6746 48.1009 0.9743 424 0.9157R 06 432 0.6349 30.3691 0.9350 353 0.8995Rt 01 223 0.8575 123.9533 0.9645 127 1.6008Rt 02 214 0.7469 83.2271 0.9316 128 1.9726Rt 03 207 0.7208 78.6409 0.9465 98 2.0454Rt 04 181 0.7359 60.2092 0.9358 102 1.6749Rt 05 197 0.6917 87.0541 0.9469 73 2.5959Ru 01 422 0.6538 36.1404 0.9604 316 0.9437Ru 02 1240 0.7713 138.5450 0.9915 946 0.8251

(continued)


Dow

nloa

de

dby

[Corporac

ion

CINCEL]a

t07:0

803May

2012


8/10

(real or virtual) between the fitting Zipf curve and the hapax legomena

level f(r) 1.Finally, it is worthwhile mentioning that the data of Table 1 reveal a

good linear dependence between the hapax legomena length HLand the

vocabulary V, namely HL 0.7256*V718.6979, as it is further

illustrated in Figure 2. Hence, replacing HL in (1), we get a goodapproximation for the analytism indicator in the form

A 2ac

1:2744V18:6979a 2

An independent measure of analytism/synthetism using e.g. the

Greenberg-Krupa indices (Greenberg, 1960; Krupa, 1965) could tell us

whether the purely morphological definition of this property corresponds

to our purely textual measure which can be won mechanically withoutanalysing each word of a text with regard to its morphological structure.

Table 1. (Continued).

ID V a c R

2

HL

c

VHL=2a

Ru 03 1792 0.7106 158.2659 0.9620 1365 1.0851Ru 04 2536 0.7181 234.3457 0.9571 1850 1.1661Ru 05 6073 0.7826 775.3826 0.9807 4395 1.2063Sl 01 457 0.7467 44.1840 0.9760 364 0.6665Sl 02 603 0.6846 68.9001 0.9823 423 1.1571Sl 03 907 0.7685 115.2402 0.9604 651 0.8651Sl 04 1102 0.9187 334.8100 0.9912 701 0.7633Sl 05 2223 0.7232 240.2785 0.9490 1593 1.2572Sm 01 267 0.8285 177.1858 0.9678 119 2.1315

Sm 02 222 0.7752 123.5355 0.9450 96 2.2641Sm 03 140 0.6858 58.1896 0.8708 75 2.4320Sm 04 153 0.7925 89.0771 0.9563 76 2.0738Sm 05 124 0.7161 46.3093 0.9263 66 1.8312T 01 611 0.7624 120.0367 0.8817 465 1.2995T 02 720 0.7803 144.5780 0.8685 540 1.2297T 03 645 0.7652 167.7334 0.8923 447 1.6447

B, Bulgarian; Cz, Czech; E, English; G, German; H, Hungarian; Hw, Hawaiian; I, Italian;In, Indonesian; Kn, Kannada; Lk, Lakota,; Lt, Latin; M, Maori; Mq, Marquesan; Mr,Marathi; R, Romanian; Rt, Rarotongan; Ru, Russian; Sl, Slovenian; Sm, Samoan; T,

Tagalog.


Dow

nloa

de

dby

[Corporac

ion

CINCEL]a

t07:0

803May

2012


9/10


10/10

Needless to say, one can perform the same procedure using the Zipf-

Mandelbrot function or a number of other hyperbolic functions.

We adhere to Occams razor and restrict ourselves to the original Zipffunction which yielding this morphological distinctiveness without

touching morphology gets a secondary important corroboration.

REFERENCES

Greenberg, J. H. (1960). A quantitative approach to the morphological typology oflanguages. International Journal of American Linguistics, 26, 178194.

Ko hler, R. (1986). Zur linguistischen Synergetik. Struktur und Dynamik der Lexik.Bochum: Brockmeyer.

Ko hler, R. (ed.) (2005). Synergetic linguistics. In R. Ko hler, G. Altmann & R. G.Piotrowski (Eds), Quantitative Linguistics. An International Handbook (pp. 760775). Berlin: de Gruyter.

Krupa, V. (1965). On quantification of typology. Linguistics, 12, 3136.Popescu, I.-I. (2007). Text ranking by the weight of highly frequent words. In P. Grzybek

& R. Ko hler (Eds), Exact Methods in the Study of Language and Text (pp. 555565). Berlin/New York: Mouton de Gruyter.

Popescu, I.-I., Vidya, M. N., Uhlrova , L., Pustet, R., Mehler, A., Macutek, J., Krupa,V., Ko hler, R., Jayaram, B. D., Grzybek, P., & Altmann, G. (2008). Word

Frequency Studies. Berlin: Mouton de Gruyter (in press).Skalicka, V. (20042006). Souborne d lo IIII (edited by F. C erma k et al.). Praha:

Nakladatelstv Karolinum. [The work contains the Czech translations.]


Dow

nloa

de

dby

[Corporac

ion

CINCEL]a

t07:0

803May

2012

hapaxlegomena tipo log i a lengua je

Documents