a cross-linguistic quantitative study of homophony€¦ · a cross-linguistic quantitative study of...

24
1 A cross-linguistic quantitative study of homophony Jinyun Ke (Dr.) English Language Institute, University of Michigan Abstract : Homophony is ubiquitous across languages. It is an important source of ambiguity which is a distinctive feature of human language. There have been, however, few quantitative investigations on questions such as “do languages have similar degrees of homophony?”, “can the degree of homophony in a language be predictable?”. We report a preliminary attempt to answer these questions. We measure the degree of homophony of two sets of languages, one including twenty Chinese dialects and the other including three Germanic languages. It is found that there exists a strong correlation between the degree of homophony and the number of occurring syllable types (which can be taken as an estimation of the size of the phonological resource of a language), or the number of monosyllabic words in the lexicon. Furthermore, the distributional properties of homophony reflect some self-organization characteristics of language as a system, as illustrated by two pieces of evidence: the first is the correlation between the degree of homophony and the degree of disyllabification in Chinese dialects, and the second is the observation from some languages that pairs of words tend to exist in different grammatical classes, suggesting that language self-organizes in a way to decrease the chances of ambiguity. 1. Introduction Human language is full of ambiguity, a ubiquitous phenomenon in which one form can be interpreted as more than one meaning (i.e. one-to-many mapping). Homophony is an important source of ambiguity. It refers to two or more words, which are called “homophones”, “having the same sound, but differing in meaning or derivation”, according to the Oxford English Dictionary (OED). Here are some examples of homophones: “chu1 shi4” 1 for “出事” (“accident”), “出示(“show”) and “出世” (“be born”) in Chinese; “sight”, “site” and “cite” in English; “père”, “pair”, and “paire” in French; “das” and “daβ”, “man” and “Mann” in German. To extend the definition from words to more general forms, /s/ (including variants /z/ and /Ιz/) in English can be also considered as homophonous, as the same form has two functions, one to form plural of nouns, and the other to form the third person present tense of verbs. Homophony is ubiquitous across languages. There are two main sources of homophones: sound change and language contact. Most of the homophones arise as the result of phonological merger, a type of sound change which is very common in languages. Words become homophonous once the phonetic distinction that kept them apart becomes lost. In English, for example, “meat” in Middle English was pronounced similar to “mate” in modern English, but after the Great Vowel Shift it became homophonous to “meet” due to vowel raising, though the written forms still retain the distinction. Chinese is a classic instance in which numerous homophones have come from sound change. In modern Chinese dialects, especially in northern dialects where many mergers have occurred in the history, we can find a lot of homophonous monosyllabic morphemes which 1 In this paper, the pronunications of Chinese characters are given by the pinyin spelling, with tone following the syllable.

Upload: phungkien

Post on 09-Apr-2018

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

1

A cross-linguistic quantitative study of homophony Jinyun Ke (Dr.) English Language Institute, University of Michigan Abstract : Homophony is ubiquitous across languages. It is an important source of ambiguity which is a distinctive feature of human language. There have been, however, few quantitative investigations on questions such as “do languages have similar degrees of homophony?”, “can the degree of homophony in a language be predictable?”. We report a preliminary attempt to answer these questions. We measure the degree of homophony of two sets of languages, one including twenty Chinese dialects and the other including three Germanic languages. It is found that there exists a strong correlation between the degree of homophony and the number of occurring syllable types (which can be taken as an estimation of the size of the phonological resource of a language), or the number of monosyllabic words in the lexicon. Furthermore, the distributional properties of homophony reflect some self-organization characteristics of language as a system, as illustrated by two pieces of evidence: the first is the correlation between the degree of homophony and the degree of disyllabification in Chinese dialects, and the second is the observation from some languages that pairs of words tend to exist in different grammatical classes, suggesting that language self-organizes in a way to decrease the chances of ambiguity.

1. Introduction Human language is full of ambiguity, a ubiquitous phenomenon in which one form can be interpreted as more than one meaning (i.e. one-to-many mapping). Homophony is an important source of ambiguity. It refers to two or more words, which are called “homophones”, “having the same sound, but differing in meaning or derivation”, according to the Oxford English Dictionary (OED). Here are some examples of homophones: “chu1 shi4”1 for “出事” (“accident”), “出示” (“show”) and “出世” (“be born”) in Chinese; “sight”, “site” and “cite” in English; “père”, “pair”, and “paire” in French; “das” and “daβ”, “man” and “Mann” in German. To extend the definition from words to more general forms, /s/ (including variants /z/ and /Ιz/) in English can be also considered as homophonous, as the same form has two functions, one to form plural of nouns, and the other to form the third person present tense of verbs. Homophony is ubiquitous across languages. There are two main sources of homophones: sound change and language contact. Most of the homophones arise as the result of phonological merger, a type of sound change which is very common in languages. Words become homophonous once the phonetic distinction that kept them apart becomes lost. In English, for example, “meat” in Middle English was pronounced similar to “mate” in modern English, but after the Great Vowel Shift it became homophonous to “meet” due to vowel raising, though the written forms still retain the distinction. Chinese is a classic instance in which numerous homophones have come from sound change. In modern Chinese dialects, especially in northern dialects where many mergers have occurred in the history, we can find a lot of homophonous monosyllabic morphemes which 1 In this paper, the pronunications of Chinese characters are given by the pinyin spelling, with tone following the syllable.

Page 2: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

2

were distinct in earlier stages. For example, in Mandarin, “急” (“anxious”), “疾” (“illness”) and “即” (“immediately”) are homophones with the same pronunciation (ji2) due to the loss of the consonant ending of “p,t,k” in monosyllables, while these morphemes remain distinct in southern dialects where such sound changes did not happen. Lexical borrowing is another main source of homophones. Languages are constantly in contact with other languages, which results in lexical borrowings to various extents. Very often the pronunciation of foreign words cannot fit the phonology of the target language, and these words are adjusted accordingly by the native speaker of the borrowing language. Occasionally the borrowed word may collide with some existing words, or even some words borrowed earlier from other languages, and homophones are thus created. For example, in English the two words “sheik” and “chic” are homophones with the same pronunciation [∫i:k], but they came in different times into English: according to the OED, the former was borrowed from the Arabic word “shaikh” in the sixteen century, while the latter is a French loan word in the middle of the nineteen century. It has been a common belief that all languages have homophony in different degrees (Antilla, 1989: 184). However, so far there have been few attempts to examine the degrees of homophony quantitatively, not to mention cross-language comparison. There have been some lists of homophones, e.g., Higgins (1995) for English, and some case studies on the history of individual homophones or near-homophones, e.g., Bloomfield (1933: 396-398) and Malkiel (1979). However, there has been little discussion on the relation between homophony and other parts of the language system. Intuitively if a language has a large phonological resource to exploit, i.e. a large number of sounds to make up words, the language may have fewer homophones. Such attestable hypothesis has been largely unexplored. In this paper, we report a preliminary study to address the above hypothesis. We measure the degree of homophony on two sets of available data: one includes twenty Chinese dialects and the other includes three Germanic languages. Section 2 introduces these measures and the results. Section 3 examines the relation between the degree of homophony and the size of the phonological resource, and Section 4 proposes two hypotheses to predict the degree of homophony and shows that the degree of monosyllabicity (i.e. the number of monosyllabic words in the lexicon) is a better indicator for the degree of homophony than the number of occurring syllable types. While the emergence of homophony is considered as unavoidable, the existence of homophony does not seem totally at random, but instead reflects some characteristics of self-organization in a language. In Section 5, we will show two pieces of evidence in this regard. The first is the correlation between the degree of homophony and the disyllabification in Chinese dialects, and the second is the observation from some languages that pairs of words tend to exist in different grammatical classes, suggesting that language self-organizes in a way to decrease the chances of ambiguity.

Page 3: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

3

2. Cross-language comparison of the degree of homophony There are several difficulties in having a quantitative measure of the degree of homophony and making cross-language comparison. First, as the lexicon in a language is basically an open set and keeps evolving, whether a given word has a homophone or not heavily depends on the size of the lexicon used for search for homophones. A lexicon with more entries and with many ancient words included certainly will be more likely to include more homophones. Therefore, it is important to ensure the comparability of lexicons, from which homophones are extracted. We will demonstrate how this problem of comparability is dealt with for two sets of data. The first is a set of Chinese dialects, for which the pronunciations of the same set of Chinese monosyllabic morphemes are available. Another set of data comes from CELEX, a large lexical database for three Germanic languages. The details of the two sets of data will be described in the Sections 2.1 and 2.2. The second difficulty in deriving a homophone list for a language is a long-standing problem for both lexicologists and semanticists, that is how to distinguish homophony from polysemy. To avoid the difficulty of differentiating polysemes from true homophones, in the study of three Germanic languages, we restrict the scope of our analyses to homophones which have different orthographic forms. It is assumed that very often when two words are spelled differently, the chance for two homophones to have the same etymology is very small2. Following the above principle of polyseme pruning, we exclude homonyms such as “(river) bank” and “(financial) bank”. Those words, such as “work” as a noun and “work” as a verb which have the same meaning but are used as different parts-of-speech, are not considered as homophones either. We further exclude those pairs which are different inflection forms of one lemma, because these words in fact refer to the same meaning even though there are distinctions in either gender, number or tense. For example, in Dutch “raad” and “raadt” are two inflectional forms of the same verb “to guess”, and are pronounced the same. Such pairs are excluded from our homophone lists. The homophone lists derived based on the above criteria underestimates the degree of homophony, and it is unclear yet how serious this underestimation is. Nevertheless, this restricted definition of homophony enables us to obtain an estimation of the lower-bound of the degree of homophony in these languages. More importantly, these restricted, but explicit, criteria enable cross-language comparison.

2.1. Degrees of homophony in Chinese dialects The data for Chinese dialects come from the Dictionary on Computer (DOC), which is an electronic database of the phonological systems of Chinese languages. It is one of the earliest computer databases of languages, first developed in the research group led by Prof. William S-Y Wang at the University of California at Berkeley in 1966, and has been upgraded and maintained through the years (Wang, 1969a; Cheng, 1998). DOC has been a fertile database with historical depth and geographic breadth, having been used in many studies of sound changes in historical Chinese and Chinese dialects (Wang, 1977). These studies constitute the empirical basis for the launch of the theory of lexical diffusion (Wang, 1969b; Chen & Wang, 1975).

2 We note that there are still words which have different spellings but actually come from the same origin, such as “check” and “cheque” in English.

Page 4: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

4

DOC includes the pronunciation of over 2,700 monosyllabic morphemes (or “Chinese characters”, to be more accurate) in 20 Chinese modern dialects (Beijing, Jinan, Xi’an, Taiyuan, Wuhan, Chengdu, Yangzhou, Suzhou, Wenzhou, Hefei, Changsha, Shuangfeng, Nanchang, Meixian, Guangzhou, Yangjiang, Xiamen, Jian’ou, Chaozhou, Fuzhou3), two ancient Chinese rhyme books (Guang Yun of the 7th century, and Zhongyuan Yinyun of the 14th century) and corresponding borrowings in Japanese and Korean4. We use the data of DOC to measure the degree of homophony in the 20 Chinese dialects. One advantage of these data is that the semantic range is approximately the same for the dialects, since all dialects use the same set of characters or morphemes. While there is a big problem for the definition of “wordness” in Chinese (it is hard to decide whether a combination of morphemes is a word or a phrase), to examine the monosyllabic morphemes will avoid this problem and enable us to carry out comparisons across dialects. We note that, however, the obtained measure is only valid in monosyllabic morphemes, and does not reflect the complete situation of homophony in modern dialects, as some of these monosyllabic morphemes can not be used as free morphemes and there are many polysyllabic words in the contemporary dialects. However, these are the data we could have convenient access so far, and the obtained measure may provide at least some preliminary comparisons, which can be extended to a better coverage later when more data of the lexicons of modern dialects are available. Table 1 gives the numbers of entries for each dialect in DOC (there are many characters having multiple pronunciations, therefore the numbers of entries vary across dialects and all are more than the numbers of characters, i.e. 2700), and the syllable inventory, i.e., the number of syllable types occurring in these morphemes (Syl) (tone included). The table also shows the results of three measures of the degrees of homophony: (1) the number of homophone sets (HomoSet), and the proportion of homophone sets in the total number of sets of morphemes (PropSet); (2) the number of homophone pairs (HomoPair), and the proportion of homophone pairs in the total number of pairs (PropPair); (3) the average number of homophones per syllable (AverHomo). The three measures of the degrees of homophony are carried out as follows.

Table 1. The degrees of homophony in 20 modern dialects (sorted according to the number of occurring syllables (Syl).

Dialect Entries Syl HomoSet PropSet HomoPair PropPair AverHomo per Syl Taiyuan 3933 828 580 0.70 14581 0.0019 4.75 Wuhan 3947 870 625 0.72 13412 0.0017 4.54 Chengdu 3838 938 657 0.70 11769 0.0016 4.09 Yangzhou 3766 947 642 0.68 11673 0.0016 3.98

3 Among the 20 dialects, the data of 17 dialects are from 汉语方音字汇 (Han4yu3 Fang1yin1 Zi4hui4, “A collection of Character Pronunciation in Chinese Dialects”, abbreviated as Zihui 1989). 4 Japanese and Korean have had heavy contacts with Chinese, and there are many Chinese borrowing words in these two languages. In Japanese there are two main layers of borrowings, called Kan-on and Go-on readings respectively.

Page 5: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

5

Hefei 3693 976 661 0.68 10782 0.0016 3.78 Changsha 4174 981 653 0.67 13548 0.0016 4.26 Suzhou 3967 999 644 0.64 12077 0.0015 3.97 Shuangfeng 4020 1001 672 0.67 10802 0.0013 4.02 Wenzhou 4108 1048 682 0.65 13587 0.0016 3.92 Jinan 3853 1063 732 0.69 9855 0.0013 3.62 Xian 3875 1084 745 0.69 9397 0.0013 3.57 Nanchang 3842 1111 732 0.66 8828 0.0012 3.46 Beijing 4111 1125 757 0.67 10564 0.0013 3.66 Jian’ou 4181 1241 780 0.63 10154 0.0012 3.37 Meixian 3848 1304 785 0.60 7539 0.0010 2.95 Yangjiang 3682 1319 800 0.61 6485 0.0010 2.79 Guangzhou 3773 1367 812 0.59 6143 0.0009 2.76 Fuzhou 4398 1413 867 0.61 8639 0.0009 3.11 Chaozhou 4193 1759 919 0.52 5977 0.0007 2.38 Xiamen 5000 1855 993 0.54 8664 0.0007 2.93

1) HomoSet: the number of syllables which have homophones (the 4th column in Table 1). For example, HomoSet(Beijing)=757 and HomoSet(Guangzhou)=812, i.e., Guangzhou has more homophone sets than Beijing. However, we need to do a normalization as the total number of syllables should be taken into account. The number of syllables with homophones divided by the total number of actual syllables, i.e., HomoSet/Syl, gives a more indicative measure, namely, PropSet (the 5th column). Now Beijing has a higher degree of homophony than Guangzhou (PropSet(Beijing)=0.67 and PropSet(Guangzhou)=0.59), which is more consistent with our expectation. Moreover, there exists a significantly high negative correlation between the degree of homophony and the number of syllables, as shown in Fig. 1. The Pearson correlation test shows a high correlation: Corr(PropSet,Syl)=-0.90 (p<0.001).

y = 9.1x-0.38

R2 = 0.90

0.40

0.50

0.60

0.70

0.80

0.90

1.00

700 900 1100 1300 1500 1700 1900

Number of syllables

Perc

enta

ge o

f hom

opho

ne

sets

(Pro

pSet

)

xiamen

chaozhou

taiyuan wuhan

fuzhou

guangzhou

yangjiang

beijingjian'ou

Fig. 1. Correlation between the size of the syllable inventory and the degree of homophony in terms of proportion of homophone sets. The power function and the R-squared value for the curve-fitting are given on the top right of the figure.

Page 6: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

6

2) HomoPair: the number of pairs of homophones (the 6th column). HomoPair(Beijing) =10564 and HomoPair(Guangzhou)=6143. After normalization, dividing HomoPair by the total number of pairs of morphemes, we obtain the proportion of homophone pairs as another measure of the degree of homophony, namely, PropPair (the 7th column). Similar to the result of PropSet, Beijing has a higher degree of homophony than Guangzhou (PropPair(Beijing)=0.0013, and PropPair(Guangzhou)=0.0009). For this measure of PropPair, we see an even higher correlation between the degree of homophony and the size of the syllable inventory: Corr(PropPair,Syl)=-0.96 (p<0.001), as shown in Fig. 2.

y = 16.9x-1.35

R2 = 0.96

0.0000

0.0005

0.0010

0.0015

0.0020

600 800 1000 1200 1400 1600 1800 2000

Number of syllables

Perc

enta

ge o

f hom

opho

ne

pairs

(Pro

pPai

r)

xiamen

chaozhou

taiyuanwuhan

fuzhou

guangzhou

beijing jian'ou

Fig. 2. Correlation between the size of the syllable inventory and the degree of homophony in terms of proportion of homophone pairs.

3) AverHomo: the average number of homophones per syllable (the 8th column). Again we find that the AverHomo has a high negative correlation with Syl: Corr(AverHomo, Syl)=-0.85 (p<0.001), as shown in Fig. 3.

y = 807.3x-0.77

R2 = 0.85

1.00

2.00

3.00

4.00

5.00

600 800 1000 1200 1400 1600 1800 2000

Number of syllables

Ave

rage

num

ber o

f ho

mop

hone

s pe

r syl

labl

e

xiam en

chaozhou

taiyuan wuhan

fuzhou

guangzhou

beijing

Fig. 3. Correlation between the size of the syllable inventory and the degree of homophony in terms of average number of homophones per syllable.

Page 7: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

7

The three different measures discussed above all exhibit a high negative correlation between the size of the syllable inventory and the degree of homophony. These convergent results from different indices suggest the robustness of the measurement of the degree of homophony. The correlations show that the more syllable types a language has, the smaller the degree of homophony. This correlation conforms to our intuition about the relationship between the size of the phonological resource and the degree of homophony. However, this finding is not supported by the second set of data to be discussed below.

2.2. Degrees of homophony in three Germanic languages We examine another set of languages for which we have available data for extracting homophones in large lexicons. The data are provided by CELEX, which is an electronic lexical database developed by the Dutch Centre for Lexical Information of the Max Planck Institute for Psycholinguistics (Baayen et al., 1995). The database contains lexical information including spelling, pronunciation, morphological structure, syntactic information (part of speech and subcategorization) and corpus frequency for three Germanic languages, including Dutch, English and German. Table 2 gives some information about the database, including the sizes of the wordform lexicon and lemma lexicon before and after processing5 for three languages, as well as the size of the corpora from which the frequency information is obtained. From the table we see that the three lexicons are not comparable in either of the two types of lexicon. For the lemma lexicon, Dutch has more than twice the lemmata of the other two languages; for the wordform lexicon, the word count in English is less than half of the other two6. To deal with the compatibility problem between different lexicons, we decided to consider only the first 5000 most frequent words and carry out the comparison along the frequency bands. It is assumed that the first few thousand frequent words should be relatively stable, regardless of the size the corpora, provided that the corpora are both sufficiently large and from similar genres.

Table 2. A summary of the three lexicons from CELEX. lemma types wordform types corpus size pre-processing post-processing pre-processing post-processing Dutch 124,136 122,400 381,292 313,270 42.38m English 52,447 41,535 160,595 77,031 17.9m German 51,728 51,728 365,530 321,081 6.0m

5 It was found that there are some repetitive entries in the lexicons for the three languages. Therefore, we carried out some cleaning processing on the lexicons to remove the repeated items. 6 Wordform lexicon includes words like “walk” , “walked” and “walking” as individual items, while lemma lexicon excludes inflectional word forms, for example, in the above case, only “walk” is included, but not the above three inflectional forms. The ratios between the number of word forms and the number of lemmata gives a rough idea how the three languages differ in their inflectional morphology complexity. The averge number of word forms for each lemmata in German is much higher (321081/51728=6.2) than that of Dutch (313270/122400=2.6) and English (77031/41535=1.9), that is, in German the words have more inflectional forms on average.

Page 8: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

8

For each of the three languages, we first sort the words in the order of their frequency. Then we check for each of the first 5000 words if it has a homophone/homophones in the whole word list7, according to our restricted criteria of homophony stated above.

The first 5000 frequent words are then grouped into 14 frequency bands in a decreasing order of frequencies. The size of the first 10 bands increases by 100 words for each band. In other words, the first band includes the first 100 words, the second band includes the first 200 words, and so on. After the 10th band, the sizes of bands increase by 1000 words, i.e. the 11th band includes the first 1000 words, and the 12th band includes the first 2000 words. For each band, we count the number of words which have at least one homophone. Fig. 4 shows the degree of homophony in the 14 frequency bands, i.e. cumulative proportion of words which have homophones in the given frequency bands, in the three languages.

0

0.1

0.2

0.3

0.4

0 1000 2000 3000 4000 5000

Frequency band

Deg

ree

of h

omop

hony

Eng

Dut

Ger

Fig. 4. Degree of homophony in the first 5000 frequent words in three Germanic languages.

We have a few interesting observations from the above figure. First, the degrees of homophony in the first several frequency bands are all much higher than in later frequency bands in the three languages. In the first frequency band (the first 100 words), English has 35% words with homophones, Dutch 11% and German 16%. In English, among the 35 words having homophones, 32 of them belong to the closed class vocabulary, i.e. function words, such as the articles “the” and “a”, the prepositions “to” and “in”, and the conjunctions “but” and “or”, etc. In fact in the three languages, over 90% of the words in the first 100 most frequent words are such function words. It remains to be seen whether it is true for other languages that there exists a large degree of homophony in the most frequent words, and whether these homophone pairs are mostly closed-class words. Furthermore, we find that most of the homophones are monosyllabic words. We therefore posit that there exists a correlation between the degree of homophony and monosyllabicity. We will examine this in more detail in a later section.

Second, while there are more homophones in high frequency bands than in low frequency bands, the degree of homophony starts to level off at a certain value after the 12th frequency band which 7 We note that this way of searching for homophones in the whole word list is still dependent on the size of the lexicon, but confining the first word in the first 5000 frequent word list at least provides some compatibility, as those pairs in which both members are infrequent are excluded.

Page 9: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

9

includes 2000 words. This suggests that if we want to compare the degrees of homophony among different languages, we may only need to examine the high frequency word list to a certain extent, say, up to the first 2000 words. We see from Fig. 4 that English has the highest degree of homophony (about 10%) while Dutch and German have similar smaller degrees (about 4%) as the level-off value. Why do these three languages have such differences? In the Chinese dialects shown above, we find a negative correlation between the number of syllables and the degree of homophony. Is there such a correlation in the Germanic languages as well? Can we predict the degree of homophony based on this parameter, i.e. the number of syllable types? The following is a preliminary attempt to answer these questions.

3. Homophony and phonological resource The capacity of handling a large number of words is considered as a defining characteristic for human species (Deacon, 1997). Most words in a language are arbitrary associations between forms and meanings, which are expressions of the Saussurean Sign (de Saussure, 1910/1983). While the number of meanings seems to be infinite, only a small number of them are lexicalized, and others are expressed by combinations of words. The forms of the lexical items are built by choosing from a finite set of units, which we call “phonological resource”. The size and characteristics of the components of this finite set of phonological resource should affect the degree of homophony. In the following, we begin with explaining how to measure the phonological resource in a language. The “phonological resource” refers to the number of possible distinctive forms a language can make use of to construct words or morphemes. It depends not only on the number of sounds, but also on the ways that the sounds are combined together, i.e., the phonotactic constraints. Languages differ a lot in both dimensions. Though the number of sounds that humans can make is infinitely large as articulation exploits a continuous space in the vocal tract, the actual number of sound categories (called “segments” or “phonemes”) which are used in distinguishing meanings by any individual language, is very limited. According to the UCLA Phonological Segment Inventory Database (UPSID) (Maddieson & Precoda, 1990), the maximum number of segments in an extant language is 141 (a Khoisan language !Xũ). The average number of segments among languages, however, is much smaller. In the UPSID, the average size of the segment inventory is only about 31. Segments are organized into syllables, and words are constructed by concatenation of syllables. An ordinary syllable consists of an obligatory vowel, and optional preceding and following consonants. Different types of combination of consonants and vowels in one syllable constitute different canonical forms, such as (V), (CV), (CVC), (CVCC), (CCVC), and so on. Different languages vary a lot in the number and complexity of legitimate canonical forms. For example, Germanic languages allow large consonant clusters, such as in English, (CCCVCCC) in “scripts” and (CCVCCCC) as in “glimpsed”; (CCCVCCCC) as in “abstractst” in Dutch and “strolchst” in German; while the most complex canonical form in Chinese dialects is only (CGVN) (“G” standing for “glide”, and “N” for “nasal”), such as “liang” in Putonghua.

Page 10: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

10

Table 3 lists the number of consonants and vowels and the number of canonical forms in the three Germanic languages. To ensure a valid comparison between the three languages, the criteria for determining the numbers of phonemes are important. While the determination of phonemes in a language often has non-unique solutions (Chao, 1934), we adopt the systems used in CELEX, which include larger numbers of segments than usual agreed analyses. For example, there are 24, instead of 20, vowels in English, due to the inclusion of four nasalized vowels which only occur in foreign words. We adopt these systems for the sake of simplicity of comparison. Moreover, as to be shown below, the distribution of the frequency of syllable types follows a power-law distribution, and a large proportion of segments only occur once or twice; therefore, the syllable types with foreign sounds should be in a similar status as those infrequent syllable types, and including these foreign sounds in the segment inventory just like including other infrequent sounds.

Table 3. Segment inventories, number of canonical forms, occurring syllable types, possible and actual CV combinations, and CV exploitation rates in three Germanic languages.

consonants vowels canonical forms

occurring syllables

Possible CVs

actual CVs

CV exploitation rate

Dutch 23 21 35 9,031 483 254 53% English 24 24 41 9,570 576 412 72% German 25 34 33 4,225 850 217 26%

From the inventory of consonants and vowels and the legitimate canonical forms, we see that the relations between the three variables are complex. English has fewer segments than German, but much more types of canonical forms, which may be explained as languages seem to have a trade-off between the number of segments and the ways of combining segments so as to achieve a similar size of phonological resource. However, this hypothesis does not hold for when Dutch and English are compared: Dutch has fewer segments than English, but also fewer canonical forms, though the difference is small. Since we only have three languages and these languages are very similar due to their close genetic relationships, it is hard to make more meaning inferences.

The number of segments and the types of canonical forms may provide a measure of the potential phonological resource in a language. However, each language has a set of specific phonotactic constraints, resulting in many systematic gaps, such as no *[tl-] and *[dl-], as well as many accidental gaps such as no *[krIp] and *[blIk] in English. Therefore it is hardly possible to have an accurate estimate of the number of syllable types based on only the number of consonants and vowels, and the number of canonical forms. Jespersen (1933: 623) estimates the number of possible syllable types in English as more than 158,000, when systematic gaps are excluded. According to our calculation, however, the number of occurring syllable types in the CELEX English lexicon is only 9,570. If we assume that CELEX has included a representative number of syllable types as its lexicon size is sufficiently large (77,031 word forms), we may obtain a rough estimate of the exploitation rate of the phonological resource in English, based on the above two values. Taking the ratio between our number (9,570) and that of Jespersen’s (158,000), we estimate that the exploitation rate is only about 6%. This shows that the usable phonological resource is far from being fully employed.

Page 11: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

11

We may have another measure of the exploitation rate of phonological resource, by examining CV combinations only. The possible CV types can be estimated by taking the full combinations of all consonants and vowels. As shown in Table 3, the possible CV combinations are far from being fully utilized either. English has the highest rate (85%), and German the lowest (26%). Xxx In fact, German has a larger segment inventory than Dutch and English, but German and Dutch have a similar number of CV combinations, while English has about twice as many as the other two. This implies that English has fewer phonotactic constraints than German and Dutch, at least in the CV combinations.

Furthermore, we find that the syllables are not utilized in a uniform way. Some syllables appear very frequently, such as in English [lI] (appearing 2850 times), [rI] (2016) and [ə] (1916), while a large proportion of the syllables (about 44%) only occur in one or two words. German and Dutch have similar characteristics. The three most frequent syllables in German are [gə] (4405), [tə] (3349) and [tən] (2845); and in Dutch they are [də] (17848), [tə] (12220) and [xə] (9899).

Figures 5, 6 and 7 show the distribution of the frequency of syllable types in the three languages. All the three curves can be interpolated as similar power functions (prob(f)= Cfα , αeng=-1.6, αdut=-1.3, and αger=-1.6 ), which appear as straight lines in the log-log plane. Power-law distribution is often considered as a reflection of the presence of self-organization in the system (xxx). The distributional characteristic of syllable frequencies suggests that self-organization may be present in the organization of the lexicon.

Fig. 5. Distribution of the frequency of syllable types in the English lexicon. The solid line is the curve for the actual distribution, and the dotted line is the fitted curve with a power law.

100

101

102

103

104

10-4

10-3

10-2

10-1

100

Frequency of syllables in English lexicon

Pro

babi

lity

Page 12: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

12

Fig. 6. Distribution of the frequency of syllable types in the Dutch lexicon.

Fig. 7. Distribution of the frequency of syllable types in the German lexicon.

4. Predicting the degree of homophony As shown above, languages have a small exploitation rate of the possible phonological resource. The number of possible syllable types does not give a representative measure of the phonological resource in actual use. Instead, the number of actually occurring syllable types in the contemporary lexicon may serve as a better index. Having chosen this index, we propose the following hypothesis for the relation between the degree of homophony and the phonological resource: (1) Hypothesis-I: A larger number of syllable types (Syl) predicts a smaller degree of homophony.

100 101 102 103 10410-4

10-3

10-2

10-1

100

Frequency of syllables in German lexicon

Pro

babi

lity

100 101 102 103 10410-4

10-3

10-2

10-1

100

Frequency of syllables in Dutch lexicon

Pro

babi

lity

Page 13: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

13

The hypothesis seems straightforward. If there are more distinctive forms for constructing words, then the chance to have two words with the same forms, i.e., homophones, should be smaller. We recall that in our earlier analyses of the degree of homophony in Chinese dialects, we do observe a significant correlation between the degree of homophony and the size of the syllable inventory, as reflected in Figures 1, 2 and 3. However, comparing the degree of homophony and the Syl in the three Germanic languages, such a correlation does not hold: English has the largest Syl (9570) and the largest degree of homophony (10% as in first 5000 frequency word list as shown in Fig. 4), and German has a much smaller Syl (4225) than English, but also a smaller degree of homophony (4%). How to explain the inconsistency between the observations from these two sets of data? One explanation is that in the case of the Chinese data, only monosyllabic morphemes are examined and many of them are not real “words” in the actual language use (refer to a later Section about disyllabification in Chinese), while in the case of Germanic languages, the data are from real lexicons in which words have different lengths. Intuitively, longer words are less likely to have homophones. For two languages having the same size of syllable inventory, the one which has more long words will be expected to have fewer homophones. It has been shown that languages differ a lot in the word mean length, and there is a high negative correlation between the word mean length and the size of the segment inventory (Nettle, 1995; Nettle, 1998; Nettle, 1999). Fig. 8 shows the correlation for ten languages, taken from Nettle (1999).

y = 17.28x-0.30

R2 = 0.67

2

4

6

8

0 50 100 150 200Number of Segments

Mea

n W

ord

Leng

th

!Kung Vute

Georgian

Thai

Mandarin

Tamasheq

HawaiianItalian

TurkishHausa

Fig. 8: Relation between the size of segment inventory and the word mean length in ten languages. The curve-fitting power function and the R-squared value are shown in the figure. Adapted from Nettle (1999:146).

As mentioned earlier, longer words are less likely to have homophones. In fact, in the homophone list of the three languages compiled from CELEX, we find that most of the

Page 14: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

14

homophone pairs are monosyllabic words. Therefore, we propose Hypothesis-II for predicting the degree of homophony, as stated below: (2) Hypothesis-II: A larger number of monosyllabic words (MonoW) would predict a higher degree of homophony. We analyze MonoW in the three Germanic languages in their 5000 most frequent word lists in a similar way as analyzing the degree of homophony reported in Section 2.2. As shown in Fig. 9, the proportion of monosyllabic words in the first 100 most frequent words are very high for all three languages, especially English and Dutch, both over 80%. But the values of MonoW drop quickly and stabilize at different levels: English has a much higher proportion of monosyllabic words (32%) than Dutch (20%) and German (14%). When we examine the correlations between the degrees of homophony and the values of MonoW in different frequency bands, we find that the correlations are all very high in the three languages, i.e., 0.99, 0.96, 0.98 respectively in English, Dutch and German. Tsou (1976) reported a similar observation that many examples of homophones are monosyllabic. He predicted that “in disyllabic or polysyllabic morphemes the probability for homophony is decreased geometrically” (ibid:75). Our data support this prediction: in the first 1000 frequent words in German, 44 of the monosyllabic words have homophones, while only 14 disyllabic and polysyllabic words have homophones; in English, all the 135 sets of homophones are monosyllabic words;

0.00

0.20

0.40

0.60

0.80

1.00

0 1000 2000 3000 4000 5000

Frequency band

Degr

ee o

f mon

osyl

labi

city

Eng

Dut

Ger

Fig. 9: Degree of monosyllabicity in the first 5000 words in English, Dutch and German.

Each language has its own characteristic for monosyllabicity. The number of monosyllabic words in a language not only depends on the size of the phonological resource, such as the number of occurring syllable types (Syl), but also on other aspects of the language system, such as the complexity of the morphological system. We expect that if a language has a large number of morphological processes, either inflectional or derivational, the words are likely to be longer, and therefore the language tends to have a smaller proportion of monosyllabic words in its lexicon.

5. Self-organization in homophony

Do homophones cause ambiguity and confusion in daily communication? One answer to this question from common sense is that homophones do not usually affect communication, because

Page 15: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

15

the context, such as the neighboring words, help to disambiguate. However, there still exist situations where contexts do not provide enough information for immediate comprehension, and misunderstanding persists for a while until enough information is obtained.

Morever, many psycholinguistic experiments show that there are differences in the processing of words with and without homophones. For instance, in an experiment where subjects were asked to verify whether the referent of a word was a member of a specified semantic category, it is found that false positive rates were higher for words that were homophonous with a category member than for orthographically similar non-homophones. For example, “rows” was more likely than “robs” to be misclassified as a flower (Van Orden, 1987). It has also been found that words with homophonous partners typically take longer time than those without homophonous partners in lexical decision experiments (e.g. Ferrand & Grainger, 2003).

The homophone interference effect found in these experimental situations may appear as insignificant or negligible, with respect to actual communication interactions. And if there is any confusion caused by the presence of homophony, the listener is the most affected; how will this effect on the listener affect the fate of the troubling homophones? There are two conceivable reasons. First, the confusion in the listener may affect the speaker. When the listener asks for clarification, the speaker will realize that there is some mis-communication going on, and it will require extra effort for the speaker to attempt new ways to repair and clarify the situation. Second, the confusion caused by the homophony in the listener may remind the listener to avoid the use of the homophone in the same context in his own speaking. Though these may be some spurious processes, they may result in some significant effects in the long run.

Therefore, some words which cause problems will face a de-selection pressure and consequently will be used less and less. These can be attested in the many cases of words which get disused because of being homophonous to some taboo words (Bloomfield, 1933; Stimson, 1966). We consider this as a self-organization process in language, and expect that the effects of such homophony avoidance may be detectable in the synchronic distribution of homophones. In the following we will consider two phenomena as evidence of the self-organization process.

5.1 Disyllabification in Chinese

The first evidence is the disyllabification phenomenon in the Chinese history, which has been extensively discussed (e.g. Guo, 1938; Lü, 1963; Dai, 1990; Feng 1995; Duanmu, 2000). It has been generally believed that monosyllabic words are the majority in ancient Chinese, but the number of disyllabic words has increased a lot in the history. In modern Chinese dialects, monosyllabic words are only in a small proportion, and the majority of words are disyllabic. For instance, in Putonghua, in the frequent word list, only 29% are monosyllabic words. Many words which were monosyllabic in earlier times have become disyllabic. For example, “父” (fu4), which means “father”, is now only used in disyllabic words, such as “父亲” (fu4 qin1) in Putonghua; “睛” (jing1) (“eye” or “eyeball”) has to be embedded in disyllabic compound words, such as “眼睛” (yan3 jing1) (“eye(s)”).

It is widely accepted that there has been a disyllabification process in the history of Chinese, despite the various arguments on when this process started and how extensive it has been (c.f. Kennedy 1951/1964). As for the questions about why this process happened, there are more

Page 16: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

16

controversies (Feng 1995; Duanmu, 2000). Homophony avoidance has been used as an explanation, i.e. the monosyllabic words got disyllabified in order to avoid the confusion caused by homophonous words. An additional monosyllabic word is used to disambiguate an ambiguous word, and the collocation of the two words gradually becomes a fixed expression, and later may become a lexicalized word8. Meanwhile, there are several other hypotheses in explaining the disyllabification process, such as speech-tempo constraint (Guo, 1938), grammatical considerations (Li, 1990) morphologization (Dai, 1990), stress constraint (Lu & Duanmu, 1991; Duanmu, 2000), and prosodic constraint (Feng, 1995). The argumentation of these controversial hypotheses will not be dealt with in this paper. Our main concern here is to provide one piece of evidence for the homophony avoidance hypothesis.

The homophony avoidance hypothesis would predict the following correlations: a smaller phonological inventory implies a larger degree of homophony in monosyllabic morphemes, and consequently a larger degree of disyllabification. What we are interested in this study is to examine this prediction. Lü (1963) has speculated the possible relation between the size of the syllable inventory and the degree of disyllabification, comparing the northern and southern dialects: “because Cantonese has a larger syllable inventory than Putonghua, there should be fewer disyllabic words in Cantonese than in Putonghua” (p.440). There are ample examples of words having been disyllabified in Putonghua which are still monosyllabic in Cantonese; for instance, “蚊” (“wen2”) (“mosquito”) cannot be used as a free morpheme and has to combine with a suffix to form “蚊子” (“wen2 zi”) in Putonghua, while in Cantonese this morpheme is still used as a monosyllabic word. There has been no systematic way to compare the dialects quantitatively. In the following we will report a study which supports the above hypothesis.

We use the dialect dictionary 汉语方言词汇 (Han4yu3 Fang1yan1 Ci2hui4, “A Collection of Words in Chinese Dialects”, henceforth Cihui (1995)) to estimate the degree of disyllabification in different dialects. The Cihui gives a list of corresponding words for 1236 lexemes in 20 Chinese dialects. We first count the number of monosyllabic words in the whole list, and calculate the proportion of monosyllabic words, denoted as PropMono. The degree of disyllabification (PropDisy1) is estimated as 1-PropMono. Table 4 gives the degrees of disyllabification of the 20 Chinese dialects, as well as the number of syllable types and the degrees of homophony (PropSet) which have been shown in Table 1.

Table 4. Comparison of degrees of homophony and degrees of disyllabification in 20 Chinese modern dialects

Dialect Syl Homo PropSet PropDisy1 PropDisy2 Taiyuan 828 0.70 0.60 0.40 Wuhan 870 0.72 0.62 0.40

Chengdu 938 0.70 0.62 0.42 Yangzhou 947 0.68 0.61 0.40

Hefei 976 0.68 0.61 0.40 8 In English, there are similar phenomena to the disyllabification in Chinese. For example, in some areas in the United States, there has been a sound change merging [Ε] and [Ι], which results in pairs of homophones such as “pen” and “pin”. It is found that these two words are expressed by adding a modifier, for example: “ink pen”, and “stick pin” in order to eliminate the possible confusion. Also, to differentiate the second person plural pronoun and the second person singular, the expression ‘you all’ is often used to indicate the plural meaning. These examples show how ambiguity avoidance leads to fixed collocations of individual words. Though so far these words have not become lexical items yet, they may become lexicalized later.

Page 17: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

17

Changsha 981 0.67 0.62 0.41 Suzhou 999 0.64 0.61 0.40

Shuangfeng 1001 0.67 0.63 0.43 Wenzhou 1048 0.65 0.53 0.31

Ji’nan 1063 0.69 0.59 0.36 Xi’an 1084 0.69 0.61 0.41

Nanchang 1111 0.66 0.60 0.38 Beijing 1125 0.67 0.62 0.41 Jian’ou 1241 0.63 0.55 0.31 Meixian 1304 0.60 0.60 0.39

Yangjiang 1319 0.61 0.51 0.24 Guangzhou 1367 0.59 0.50 0.24

Fuzhou 1413 0.61 0.51 0.25 Chaozhou 1759 0.52 0.50 0.23 Xiamen 1855 0.54 0.54 0.29

We find there is a significantly high negative correlation between the size of the syllable inventory and the degree of disyllabification: Corr(Syl, PropDisy1)=-0.74 (p<0.001), which supports Lü (1963). Also, there is a high positive correlation between the degree of homophony and the degree of disyllabification: Corr(PropSet, PropDisy1)=0.76. Fig. 10 and Fig. 11 show the relation between the two pairs of variables and the curve-fitting functions.

y = 82.9x-0.78

R2 = 0.60

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

700 900 1100 1300 1500 1700 1900 2100

Number of syllables

Deg

ree

of d

isyl

labi

ficat

ion

xiamen

chaozhou

taiyuan

fuzhou

guangzhou

beijing

Fig. 10. Sizes of syllable inventory versus degrees of disyllabification in 20 Chinese dialects.

Page 18: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

18

y = 0.82x1.96

R2 = 0.60

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.45 0.50 0.55 0.60 0.65 0.70 0.75

Degree of homophony (PercSet)

Deg

ree

of d

isyl

labi

ficat

ion

xiamen

chaozhou

taiyuan

fuzhouguangzhou

beijing

Fig. 11. Degrees of homophony versus degrees of disyllabification in 20 Chinese dialects.

The above method to calculate the degree of disyllabification is subject to the problem of overestimating the degrees of disyllabification, because in the list there are some polysyllabic words which were never monosyllabic in the first place, and are polysyllabic in all modern dialects, such as “ 玻 璃 ”(“bo1 li”) (“the glass”). These words did not go through a “disyllabification” process from monosyllabic to disyllabic, and therefore should not be included in the estimation. Therefore, we modify the measure by only taking into account those lexemes which are expressed by a monosyllabic word in at least one dialect, based on the assumption that the original form for the meaning is likely to be monosyllabic at earlier stages and has been retained in at least one dialect (as we assume that it is rare that a disyllabic word would become a monosyllabic word again). Thus we obtain a better measure of the degree of disyllabification, PropDisy2, as shown in the 5th column of Table 4. We again calculate the correlations between the degree of homophony and the degree of disyllabification. It is found that the correlations are higher compared to those using PropDisy1: Corr(Syl, PropDisy2)=-0.76 (p<0.001); and Corr(PropSet, PropDisy2)=0.78.

These high correlations provide a strong argument for the homophony avoidance hypothesis, because the existence of such a correlation is hard to explain by other proposals, such as those considering prosodic or stress constraints, for disyllabification. There has been no argument to show that the prosodic or stress constraint is related to the size of the syllable inventory. However, we are not to argue that the homophony avoidance is the only one, or the most important, mechanism to account for the disyllabification. We view disyllabification as the result of several mechanisms, and homophony avoidance is only one of them. There may have been several stages of disyllabification, due to different mechanisms at work. At the initial stage for the increase of disyllabic words, the homophony avoidance mechanism may play an important role. Once the disyllabic prosodic structure is well established in the language, new lexical items are more likely to be disyllabic. This may account for the continuous increase in the number of disyllabic words in the last 100 years, especially the borrowing words from other cultures (Masini, 1993).

Page 19: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

19

5.2 Grammatical differentiation between homophones A pair of homophones sharing the same grammatical class are more likely to cause confusion than words belonging to different grammatical classes. If so, we may expect to see more pairs of homophones in different grammatical classes than in the same classes. Kelly & Ragade (2000) did some statistical analysis on the English homophones to test this hypothesis. They found that there is no statistical bias against homophone pairs having the same grammatical class. However, many words belong to more than one grammatical class. When the frequency effect is taken into account, the frequently used grammatical class of a word is statistically biased not to be the same as that of the word’s homophone(s). For example, for the pair of homophones “weight” and “wait”, both can be a noun and a verb, but considering only the most frequent usage, one is a noun, and the other is a verb. Kelly and Ragade carried out two statistical tests. The first is called “existence constraint test”. Among the 502 homophone pairs in their list, it is found that there are 139 pairs of words which are from the same grammatical class. In order to test if the distribution of homophone pairs is random, Monte Carlo experiments were run for fifty cycles of random word pairings. In these cycles, each word in the analysis was randomly paired with another word. It turns out that the mean number of pairs with the same grammatical class from these 50 cycles is 140.1. There is no statistical difference between the estimated value (140.1) and the observed value (139) as shown by t-test. The results are summarized in the first row of Table 5.

Table 5. Homophone pairs are differentiated in usage in terms of grammatical classes when frequency is taken into account. A summary of the findings given in Kelly & Ragade (2000).

conditions pairs of homophones / total

pairs of words

mean from50 cycles of random pairing (standard

deviation)

t-test between homophone and random pairs

existence constraint test

139/502 140.1 (9.47) 0.84 (p>0.30)

frequency constraint test

73/253 84 (5.94) 13.09 (p<0.0001)

The second test is called “frequency constraint test”. A subset of data with 253 homophones pairs from the original 502 pairs was prepared, in which each word was only marked by its most frequently used grammatical class. It is found that there are 73 pairs with the same grammatical class. The Monte Carlo experiments were performed again to this set of data, and the estimated mean value for the number of random pairs of words with the same grammatical class was 84. The t-test shows that this value from random pairs is significantly different from the observed value, which suggests that homophone pairs tend to differentiate in different grammatical classes. As Kelly and Ragade explain, “this restriction seems reasonable if one assumes that the usage frequency of a word will be depressed if it has a greater chance of impairing comprehension”.

We apply the same tests to the three Germanic homophone lists extracted from CELEX. The homophone pairs are compiled from what have been obtained in the analyses in Section 2.2, i.e. each pair has at least one word in the first 5000 frequent word list. The statistical results of the

Page 20: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

20

Monte Carlo experiments are shown in Tables 6, 7 and 8 for English, Dutch and German respectively.

The results for English show the opposite result as what Kelly and Ragade have shown. The existence constraint test shows that the whole set of homophone pairs differentiate in grammatical classes in a way significantly different from random data; but the frequent constraint test disproves the differentiation, as random pairs of words (203) have less differentiation in grammatical classes than homophone pairs (207). The homophone data we compiled are different from what Kelly and Ragade used, and it is not clear yet what accounts for this discrepancy. However, in Dutch and German, both tests show grammatical differentiation with statistical significance.

Table 6. Statistical results of the differentiation of homophone pairs in usage in terms of grammatical class in English. pairs of

homophones/ total pairs of words

mean from 50 cycles of random pairing (standard

deviation)

t-test between homophone and random pairs

existence constraint test

410/1125 426 (13.5) 8.6 (p<0.001)

frequency constraint test

207/448 203 (10.6) -2.7 (p<0.005)

xxx

Table 7. Statistical results of the differentiation of homophone pairs in usage in terms of grammatical class in Dutch. pairs of

homophones/ total pairs of words

mean from 50 cycles of random pairing (standard

deviation)

t-test between homophone and random pairs

existence constraint test

180/499 186 (11.4) 3.6 (p<0.001)

frequency constraint test

55/175 71 (6.7) 16.4 (p<0.0001)

Table 8. Statistical results of the differentiation of homophone pairs in usage in terms of grammatical class in German. pairs of

homophones/ total pairs of words

mean from 50 cycles of random pairing (standard

deviation)

t-test between homophone and random pairs

existence constraint test

52/183 59 (5.2) 9.5 (p<0.001)

frequency constraint test

44/125 65 (6.9) 21.8 (p<0.001)

Page 21: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

21

5. Conclusions and discussions Homophony appears as a by-product of incessant sound change and language contact. Its membership is in a continuous flux, as new homophones arise and old homophones disappear. However, the existence and distribution of homophones are not random. The degree of homophony in a language may be predictable to some extent. Our study has shown that it is correlated with some parameters of the language, such as the size of the phonological resources, or the degree of monosyllabicity in the lexicon. There are several possible ways to measure the size of the phonological resources, for instance, the size of the segment inventory, the number of possible syllable types, the number of syllable types actually occurring, and so on. The number of occurring syllables shows a strong negative correlation with the degree of homophony in the 20 Chinese dialects, i.e. the more syllables a language has, the less monosyllabic homophones it has. However, this correlation does not hold in the three Germanic languages. Instead, the number of monosyllabic words shows a high correlation with the degree of homophony, as the majority homophone words are monosyllabic in languages. While the degree of homophony is possibly predictable, the synchronic distribution of the homophones may also exhibit certain characteristics which reflect the nature of language as a self-organizing system. Language evolves in a way to ensure efficient communication. When some words cause ambiguity and confusion often, such as words which are homophonous to taboo words, they will disappear, or change (for instance, monosyllabic words may get disyllabified). Also, the statistical results for the Germanic languages suggest that homophone pairs tend to diffentiate in grammatical classes, so as to decrease the possibility of ambiguity in communication, though it may not be necessarily true in all languages. Self-organization has been recognized as a universal mechanism for the evolution of complex systems, in parallel to Darwinian natural selection (Kauffman, 1995). Language is such a self-organizing complex system (Lindblom et al., 1984; Köhler, 1994; Steels, 1998; de Boer, 2001). Köhler(1994) has proposed a general framework called “synergentic linguistics” to examine the various self-organizing features in language. Similar to what we have shown in Section 2.1 where the degree of homophony is shown to be a function of the number of syllable types, he has proposed several hypotheses, such as the lexicon size is a function of the number of meanings to be coded and of the mean polysemy, and the word length is a function of lexicon size, of redundancy, of the phonological inventory size, and of frequency, and so on. However, how does the self-organization process progress to result in these various features? The systemic view which only considers language itself as an abstract self-contained system will not provide us with a successful framework to look for answers to this question. There is no such a “language system” which actually exists and self-organizes. Nor does any individual speaker aim to organize his language as an efficient system, for instance, to construct an optimal lexicon, to put homophones in different grammatical class, to restrict monosyllabic words or homophones within a certain degree, etc. The statistical distribution of grammatical differentiation in homophone pairs and the disyllabification process should be explained as the long term effect of language evolution at the population level, through the iterative local interactions among individual speakers and listerners (Ke, 2004).

Page 22: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

22

While it is hardly possible for empirical studies to examine the long term effect of such a dynamic process, computational modelling may provide us with a useful tool to investigate these questions. With models, experimentation under controlled conditions can be carried out to show the accumulative effect of long term local interactions among indivudals in a language community, and to study the effect of various parameters. This methodology has achieved growing attention in the study of language evolution in recent years (Cangelosi & Parisi, 2001; Wang et al. 2004). One immediate future work for homophony in this area is to simulate the evolution of homophones, from rise to fall, to see whether and how the features in synchronic distribution can emerge in the model. This may be a fruitful area to explore.

References Antilla, R. (1989). Historical and Comparative Linguistics. John Benjamins, Amsterdam/Philadelphia, 2nd

edition. Baayen, R.H., Piepenbrock, R. & Gulikersm L. (1995). The CELEX Lexical Database (Release 2) [CD-

ROM]. Philadelphia, PA: Linguistic Data Consortium, University of Pennsylvania [Distributor]. Bloomfield, L. (1933). Language. The University of Chicago Press. Cangelosi, A. and Parisi, D. (2001). Simulating the Evolution of Language (ed.). Springer-Verlag, London. Chao, Y. R. (1934). The non-uniqueness of phonemic solutions of phonetic systems. Bulletin of the

Institute of History and Philology, Academia Sinica, 4 (4):363-397. Reprinted in Readings in Linguistics, ed. Martin Joos, p38-54.

Chen, M. Y.-C. and Wang, W. S.-Y. (1975). Sound change: actuation and implementation. Language, 51(1):255-281.

Cheng, C-C. (1998). Quantification for understanding language cognition, in Quantitative and Computational Studies on the Chinese Language, 15-30, ed Benjamin K. T'sou, Tom B. Y. Lai, Samuel W. K. Chan, and William S-Y. Wang, City University of Hong Kong Press.

Cihui (1964/1995). 汉语方言词汇 (Han4yu3 Fang1yan2 Ci2hui4, “A Collection of Words in Chinese Dialects”. Beijing Daxue Zhongguo Yuyan Wenxue Xi Yuyanxue Jiaoyanshi, Yuwen Chuban She (北京大学中国语言文学系语言学教研室, 语文出版社).

Dai, X. J. (1990). Historical morphonologization of syntactic words: Evidence from Chinese derived verbs. Diachronica, 7(1):9–46.

Deacon, T. (1997). The Symbolic Species. W. Norton and Co., New York. de Boer, B. (2001). The Origins of Vowel Systems. Oxford University Press. de Saussure, F. (1910/1983). Course in General Linguistics (Cours de Linguistique Generale). Open Court,

LaSalle, IL. Duanmu, S. (2000). The Phonology of Standard Chinese. Oxford University Press. Feng, S-L. (1995). Prosodic Structure and Prosodically Constrained Syntax in Chinese. PhD dissertation

University Microfilms, Inc. Ferrand, L. and Grainger, J. (2003). Homophone interference effects in visual word recognition. The

Quarterly Journal of Experimental Psychology: Section A, 56(3):403 – 419. Ferrer, R. and Solé, R. V. (2001). Two regimes in the frequency of words and the origin of complex

lexicons: Zipf's law revisited. Journal of Quantitative Linguistics. Guo, S. (1938). Zhongguo yuci zhi tanxing zuoyong (中国语词之弹性作用 The elastic function of

Chinese word length). Yanjing Xuebao (燕京学报), 24. Higgins, J. (1995). Quantifying English homophones and minimal pairs, in Studies in General and English

Phonetics: Essays in Honor of Professor J.D. O'Connor. 326-334. Routledge. Li, N. (1990). Dongci fenlei yanjiu shuolue (动词分类研究说略 A note on the catgorization of verbs).

Zhongguo Yuwen (中国语文), 4:248–257.

Page 23: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

23

Lindblom, B., MacNeilage, P., and Studdert-Kennedy, M. (1984). Self-organizing processes and the explanation of language universals. In Butterworth, B., Bernard, C., and Dahl, O., editors, Explanations for Language Universals, pages 181–203. Walter de Gruyter & Co.

Lu, B. and Duanmu, S. (1991). A case study of the relation between rhythm and syntax in Chinese. In The Third North America Conference on Chinese Linguistics. Ithaca.

Lü, S. X. (1963). Xiandai hanyu shuangyinjie wenti chutan (现代汉语双音节问题初探 An enquiry to the question of disyllabification in Chinese). Zhongguo Yuwen (中国语文), 1:11–23.

Jespersen, O. (1933). Monosyllabism in English, Binnial Lecture on English Philology, in the British Academy, Nov. 6. Reprinted in Selected writings of Otto Jespersen, Tokyo: Senjo Publishing, 617-641.

Kauffman, S. A. (1995). At Home in the Universe: the Search for Laws of Self-organization and Complexity. New York: Oxford University Press.

Ke, J-Y. (2004). Self-organization and Language Evolution: System, Population and Individual. Unpublished PhD thesis, City University of Hong Kong.

Kelly, M. H. and Ragade, A. Grammatical relationships between homonyms: effects on language comprehension and the structure of the English vocabulary. Unpublished manuscript, 2000 http://www.sas.upenn.edu /~kellym/homonym.html.

Kennedy, G. A. (1951/1964). The Monosyllabic Myth, Journal of the American Oriental Society, 71.3: 161-166, 1951. reprinted in Selected works of George A. Kennedy, edited by Tien-yi Li. 104-118. New Haven, Conn. : Far Eastern Publications, Yale University.

Köhler, Reinhard. (1994). Synergetic Linguistics. In: Asher, R.E.: The Encyclopedia of Language and Linguistics. Oxford, New York, Seoul, Tokyo: Pergamon Press, S. 4454-4455.

Maddieson, I. and Precoda, K. (1990). Updating UPSID. Journal of the Acoustical Society of America, Suppl. 1, Vol. 86, S19.

Malkiel, Y. (1979). Problems in the diachronic differentiation of near-homophones, Language, 55:1-36. Masini, F. (1993). The Formation of Modern Chinese Lexicon and Its Evolution toward a National

Language: The Period from 1840 to 1898. Journal of Chinese Linguistics, California. Nettle, D. (1995). Segmental inventory size, word length, and communication efficiency. Linguistics,

33:359–367. Nettle, D. (1998). Coevolution of phonology and the lexicon in twelve languages of West Africa. Journal

of Quantitative Linguistics, 5(3):240–245. Nettle, D. (1999). Linguistic Diversity. Oxford University Press, Oxford. Steels, L. (1998). Synthesizing the origins of language and meaning using coevolution, self-organization

and level formation. In Hurford, J. R., Studdert-Kennedy, M., and Knight, C., editors, Approaches to the Evolution of Language: Social and Cognitive Bases, pages 384–404. Cambridge University Press, Cambridge.

Stimson, H. (1966). A tabu word in the Peking dialect. Language. 42.285-294. Tsou, B. K. (1976). Homophony and internal change in Chinese. Computational Analysis of Asian &

African Languages, 3:67–86. Van Orden, G. C. (1987). A rows is a rose: Spelling, sound, and reading. Memory and Cognition, 15:181–

198. Wang, W. S.-Y. (1969a). Project DOC: Its methodological basis. Journal of the American Oriental Society,

90:57–66. Wang, W. S.-Y. (1969b). Competing changes as a cause of residue. Language, 45(1):9–25. Wang, W. S.-Y. (1977). The Lexicon in Phonological Change (ed.). Mouton, The Hague. Wang, W. S.-Y., Ke, J.-Y., and Minett, J. W. (2004). Computer modeling of language evolution. In Huang,

C.-R. and Lenders, W., editors, Computational Linguistics and Beyond: Perspectives at the Beginning of the 21st Century. Frontiers in Linguistics (I). Language and Linguistics, Academia Sinica, Taipei.

Page 24: A cross-linguistic quantitative study of homophony€¦ · A cross-linguistic quantitative study of homophony Jinyun Ke ... DOC includes the pronunciation of over 2,700 monosyllabic

24

Zihui (1962/1989). Hanyu Fangyin Zihui (汉语方音字汇 A collection of Character Pronunciation in Chinese Dialects). Beijing Daxue Zhongguo Yuyan Wenxue Xi Yuyanxue Jiaoyanshi, Wenzi Gaige Chubanshe (北京大学中国语言文学系语言学教研室,文字改革出版社)1989 2nd edition.

Acknowledgements

I would like to thank Profs. Reinhard Köhler, William S-Y Wang and Chin-Chuan Cheng, and the members of the former Language Engineering Laboratory of City University of Hong Kong for their helpful discussions. Also, I am thankful to Volker Dollun, Dinoj Surendran, Lolke Van-Der-Veen and Feng Wang for their help in this study. Special thanks are due to Dr. Christophe Coupé and the support of Laboratoire Dynamique du Langage, Institut des Sciences de l'Homme in Lyon, France.