(2005)on the syllabic similarities of romance languages

On the Syllabic Similaritiesof Romance Languages

Anca Dinu1 and Liviu P. Dinu2

1 University of Bucharest, Faculty of Foreign Languages,5-7 Edgar Quinet, 70106, Bucharest, Romania

anca [email protected] University of Bucharest, Faculty of Mathematics and Computer Science,

14 Academiei, 70109, Bucharest, [email protected]

Abstract. In this paper we study the syllabic similarity between Ro-mance languages via rank distance. The results confirm the linguisticaltheories, bringing a plus of quantification and rigor.

1 The Syllabic Similarity of Romance Languages

The problem of classifying Romance languages is an intensely studied issue. Un-fortunately, in many studies of this kind, the data referring to Romanian areincomplete or even missing (as it happens, for example, in Ziegler, 2000). Herewe study the ”syllabic” similarity of Romance languages. The work corpus isformed by the representative vocabularies of Romance languages (Latin, Roma-nian, Italian, Spanish, Catalan, French and Portuguese languages) (Sala, 1988).We syllabified the vocabularies. For each vocabulary we constructed a classifi-cation of syllables: on the first position we put the most frequent syllable of thevocabulary, on the second position the next frequent syllable, and so on.

The method we applied in investigating the syllabic similarity of Romancelanguages is the following: each of the seven Romance languages is comparedto the other six (using rank distance (Dinu, 2003)), for each comparison havinga graphic as a result. We apply the normalized rank distance between all pairsof such classifications and we obtain a series of results which express the ”syl-labic” similarity between Romance languages. We also investigate the distancesbetween partial classifications. Each graphic represents the behavior of the func-tion f∆(i) with i varying between 1 and 561 (the minimum number of syllablescorrespondent to the Latin language). The function f∆ expresses the variationof the normalized rank distance between two classifications (see Appendix).

We chose this method for the following reasons: when a listener hears forthe first time a language, it is difficult to believe that he is able to distinguishsyntactic constructions or even words. In fact, it is more plausible that he candistinguish and individualize syllables; due to this fact, he is able to say to whichlanguage (or family of languages) the language he hears is similar.

A. Gelbukh (Ed.): CICLing 2005, LNCS 3406, pp. 785–788, 2005.c© Springer-Verlag Berlin Heidelberg 2005

786 A. Dinu and L.P. Dinu

Table 1. The syllable number of Romance languages

Language The percentage covered by the first · · · syllables No. syllables100 200 300 400 500 561 type token

Latin 72% 86% 92% 95% 98% 100% 561 3922Romanian 63% 74% 80% 84% 87% 90% 1243 6591Italian 75% 85% 91% 94% 96% 97% 803 7937Portuguese 69% 84% 91% 95% 97% 98% 693 6152Spanish 73% 87% 93% 96% 98% 99% 672 7477Catalan 62% 77% 84% 88% 92% 93% 967 5624French 48% 61% 67% 72% 76% 78% 1738 5691

In Table 1, we present the number of distinct syllables (types) and the num-ber of all the syllables (tokens) from every language analyzed. The frequencyof the syllables from every language is not uniformly distributed. Table 1 showsthe fact that the syllables are distributed according to some principles of theminimum effort type (Zipf 1932, Herdan 1966); thus, a relatively small numberof distinct syllables will cover a large part of the corpus. Generally, the first300 syllables (ordered according to their frequency) cover over 80% (even 90%for some languages) of the number of all the syllables of the corpus. After thisnumber, the percentage increases slowly. The analyze of the graphics in Fig. 1and Fig. 2 enables us to make some observations, inaccessible otherwise. If welook at the first 300 syllables, Romanian is closest to Italian, followed by Span-ish, Portuguese and Catalan. We observe, that the more syllables, the further isItalian from Romanian, whereas Portuguese is closer (however, at the level of allthe syllables taking into consideration, Portuguese is the closest to Romanian).At the same time, if we look at the graphic that exhibits the similarities betweenItalian and the other Romance languages, we observe that Romanian is the fur-thest; the closest Romance language to Italian is Spanish, situated at a veryshort distance. This fact is in accordance with the generalized observation thatRumanians understand and learn more quickly Italian, than Italians understandand learn Romanian. In figures 1-2 each graphic represents the behavior of agiven language compared to the other 6 languages. However, if we analyze thegraphics, we observe that almost every time Romanian finds itself at the biggestdistance from the other languages. This fact proves that the evolution of Ro-manian in a distanced space from the Latin nucleus leaded to bigger differencesbetween Romanian and the rest of the Romance languages, then the differencesbetween any other two Romance languages (at least at the phonological level).Therefore, our study reveals the fact that the syllabic distances between Latinand each of the Romance languages analyzed have very near values. Obviously,our study could be further improved. It would be interesting to study if, in thecontext of some representative texts belonging to Romance languages, the valuesof these distances remain the same.

On the Syllabic Similarities of Romance Languages 787

Fig. 1. The similarity of Romance Languages

2 Appendix: Rank Distance

In natural languages, in the framework of lexical units, the most importantinformation is carried by the first part of the unit (Marcus, 1971). By analogy,

788 A. Dinu and L.P. Dinu

Fig. 2. The Latin - Romance languages similarity

the difference on the first positions between two classifications is more importantthan the difference on the last positions. This was the starting point in theconstruction of the rank distance (Dinu, 2003):

Definition 1. The rank-distance between two classifications L1 and L2 is:

∆(L1, L2) =∑

x∈L1∩L2

|ord(x|L1) − ord(x|L2)| +∑

x∈L1\L2

ord(x|L1)

+∑

x∈L2\L1

ord(x|L2).

where we denoted by ord(x|L) the rank of the element x in classification L.

Let L1 and L2 be two classifications having the same length, n. For eachi ∈ {1, 2, ..., n} we define the function f∆ by:

f∆(i)def= ∆(Li

1, Li2) = ∆(Li

1,Li2)

i(i+1) ,

where Li1 and Li

2 are classifications of length i obtained from the previous clas-sifications by deleting the elements below position i.

References

1. Dinu, L.P. On the classification and aggregation of hierarchies with different con-stitutive elements. Fundamenta Informaticae, 55, 1, 39-50, 2003.

2. Herdan, G. The Advanced Theory of Language as Choice and Chance, Springer,New-York, 1966

3. Marcus, S. Linguistic structures and generative devices in molecular genetics.Cahiers Ling. Theor. Appl., 11, 77-104, 1974

4. Sala, M. (coord.) Vocabularul reprezentativ al limbilor romanice, Bucuresti, 1982.5. Ziegler, A. Word Length in Romance Languages. A Complemental Contribution.

Journal of Quantitative Linguistics, 7, 1, 65-68, 2000.6. Zipf, G.K. Selected Studies of the Principle of Relative Frequencies in Language.

Cambridge, Mass. 1932.

(2005)on the syllabic similarities of romance languages

Documents