quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... ·...
TRANSCRIPT
Quantitative approaches to linguistic similarity
Michael CysouwWork-in-Progress, 9 November 2004
Part 1
Distribution of rare characteristics
• Using the WALS-data to approach some perennial questions:
• Are there languages that have many typologically rare characteristics?
• Are there regions that show a relatively high density of rare features?
• Does rarity cluster?
Rarity Index Ri
€
Rfi = n ⋅ f if tot
n = number of feature valuesfi = frequency of feature value iftot = total number of languages included
For fi /ftot< 1/n :
€
Rfi =1n −1
n ⋅ f if tot
−1
+1For fi /ftot > 1/n :
Order of Object and Verb (by Matthew Dryer)
Computing Rarity Index
• Three feature values (n = 3)• Frequencies fi:
640 (OV), 639 (VO), 91 (no preference)• Total ftot = 640 + 639 + 91 = 1370• Rov = 1.20
Rvo = 1.20Rnopref = 0.20
Indices of English
All languagesMAll languagesM
All languages
rarity indexMrarity indexM
rarity index
number of occurencesMnumber of occurencesM
num
ber
of occure
nces
m0.00.0M0.0M
0.0
0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
2.02.0M2.0M
2.0
m00M0M
0
50005000M5000M
5000
1000010000M10000M
10000
1500015000M15000M
15000
EnglishMEnglishM
English
rarity indexMrarity indexM
rarity index
number of occurencesMnumber of occurencesM
num
ber
of occure
nces
m0.00.0M0.0M
0.0
0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
2.02.0M2.0M
2.0
m00M0M
0
55M5M
5
1010M10M
10
1515M15M
15
2020M20M
20
2525M25M
25
3030M30M
30
Indices of Indonesian
All languagesMAll languagesM
All languages
rarity indexMrarity indexM
rarity index
number of occurencesMnumber of occurencesM
num
ber
of occure
nces
m0.00.0M0.0M
0.0
0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
2.02.0M2.0M
2.0
m00M0M
0
50005000M5000M
5000
1000010000M10000M
10000
1500015000M15000M
15000
IndonesianMIndonesianM
Indonesian
rarity indexMrarity indexM
rarity index
number of occurencesMnumber of occurencesM
num
ber
of occure
nces
m0.00.0M0.0M
0.0
0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
2.02.0M2.0M
2.0
m00M0M
0
1010M10M
10
2020M20M
20
3030M30M
30
Median of Rarity Indices
median of rarity indices per languageMmedian of rarity indices per languageM
median of rarity indices per language
FrequencyMFrequencyM
Fre
quency
m0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
m00M0M
0
200200M200M
200
400400M400M
400
600600M600M
600
800800M800M
800
10001000M1000M
1000
median of rarity indices per languageMmedian of rarity indices per languageM
median of rarity indices per language
FrequencyMFrequencyM
Fre
quency
m0.20.2M0.2M
0.2
0.40.4M0.4M
0.4
0.60.6M0.6M
0.6
0.80.8M0.8M
0.8
1.01.0M1.0M
1.0
m00M0M
0
2020M20M
20
4040M40M
40
6060M60M
60
8080M80M
80
Influence of amount of datam00M0M
0
2020M20M
20
4040M40M
40
6060M60M
60
8080M80M
80
100100M100M
100
120120M120M
120
140140M140M
140
m0.50.5M0.5M
0.5
1.01.0M1.0M
1.0
1.51.5M1.5M
1.5
Number of features codedMNumber of features codedM
Number of features coded
Median of rarity indicesMMedian of rarity indicesMM
edia
n o
f ra
rity
indic
es
Number of indices smaller than onem00M0M
0
2020M20M
20
4040M40M
40
6060M60M
60
8080M80M
80
100100M100M
100
120120M120M
120
140140M140M
140
m00M0M
0
1010M10M
10
2020M20M
20
3030M30M
30
4040M40M
40
5050M50M
50
Number of features codedMNumber of features codedM
Number of features coded
Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M
Num
ber
of ra
rity
indic
es s
malle
r th
an 1
Indo-european languagesm00M0M
0
2020M20M
20
4040M40M
40
6060M60M
60
8080M80M
80
100100M100M
100
120120M120M
120
140140M140M
140
m00M0M
0
1010M10M
10
2020M20M
20
3030M30M
30
4040M40M
40
5050M50M
50
Number of features codedMNumber of features codedM
Number of features coded
Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M
Num
ber
of ra
rity
indic
es s
malle
r th
an 1
German, French, English, Russian, Greek, Irish,
Latvian
Germanicm00M0M
0
2020M20M
20
4040M40M
40
6060M60M
60
8080M80M
80
100100M100M
100
120120M120M
120
140140M140M
140
m00M0M
0
1010M10M
10
2020M20M
20
3030M30M
30
4040M40M
40
5050M50M
50
Number of features codedMNumber of features codedM
Number of features coded
Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M
Num
ber
of ra
rity
indic
es s
malle
r th
an 1
Germanic
Indo-European
Slavicm00M0M
0
2020M20M
20
4040M40M
40
6060M60M
60
8080M80M
80
100100M100M
100
120120M120M
120
140140M140M
140
m00M0M
0
1010M10M
10
2020M20M
20
3030M30M
30
4040M40M
40
5050M50M
50
Number of features codedMNumber of features codedM
Number of features coded
Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M
Num
ber
of ra
rity
indic
es s
malle
r th
an 1 Slavic
Germanic
Indo-European
Going more extreme...(Number of rarity indices smaller than .5)
m00M0M
0
2020M20M
20
4040M40M
40
6060M60M
60
8080M80M
80
100100M100M
100
120120M120M
120
140140M140M
140
m00M0M
0
55M5M
5
1010M10M
10
1515M15M
15
2020M20M
20
Number of features codedMNumber of features codedM
Number of features coded
Number of rarity indices smaller than .5MNumber of rarity indices smaller than .5M
Num
ber
of ra
rity
indic
es s
malle
r th
an .5
Languages with many rare features
m00M0M
0
2020M20M
20
4040M40M
40
6060M60M
60
8080M80M
80
100100M100M
100
120120M120M
120
140140M140M
140
m00M0M
0
55M5M
5
1010M10M
10
1515M15M
15
2020M20M
20
Number of features codedMNumber of features codedM
Number of features coded
Number of rarity indices smaller than .5MNumber of rarity indices smaller than .5M
Num
ber
of ra
rity
indic
es s
malle
r th
an .5
Languages with many rare features
Caucasian languagesm00M0M
0
2020M20M
20
4040M40M
40
6060M60M
60
8080M80M
80
100100M100M
100
120120M120M
120
140140M140M
140
m00M0M
0
55M5M
5
1010M10M
10
1515M15M
15
2020M20M
20
Number of features codedMNumber of features codedM
Number of features coded
Number of rarity indices smaller than .5MNumber of rarity indices smaller than .5M
Num
ber
of ra
rity
indic
es s
malle
r th
an .5
Next steps
• Improve the integrations of R with the WALS-programm
• Going from exploring the data to testing hypotheses
• Taking the feature-perspective: which rare features cluster?
Part 2
Quantitative approaches to historical relatedness
• Range of recent papers, using methods from biological phylogenetic reconstruction to infer linguistic family trees
• But: only final part of historical-comparative method is taken up
• Almost all on higher groupings of Indo-European
• But: one should first check validity by applying the methods to agreed upon classifications
Inferring tree is just one of the many possibilities of quantitative approaches
Dictionaries, etc.
Cognate sets
Regular sound correspondences
Family treeInfer missing cognates
Testing Holm’s approach(together with Søren Wichmann and David Kamholz)
• Holm’s idea: use etymological dictionary instead of Swadesh-style wordlists
• By counting the number of shared retentions for each pair of languages, he estimates the relative point of split between each pair (dissimilarity estimates)
• In simulations (by Kamholz) the approach seems to work
• We tested the method on Mixe-Zoque data
Interpreting the estimates
Following Holm’s interpretation:
SHM
OlP SaP
ChZ
AyZ
SoZ
ChisZ
NHM
MM
LM
Using ADDTREE on the estimates:
How did it work out?
Holm’s method:
Wichmann (1995)
The Popolucan errors
Holm’s method:
Wichmann (1995)
The Popolucan errors
• Error 1: they are grouped together, because of many shared retentions
• But: there are no shared innovations!
• Error 2: they are grouped with Zoque instead of with Mixe
• Circularity problem: reconstruction depends on tree, and Holm makes tree out of reconstruction
The other two errors
Holm’s method:
Wichmann (1995)
Difficult to place in the tree
SHM
OlP SaP
ChZ
AyZ
SoZ
ChisZ
NHM
MM
LM
LanguageNumber of dictionary
entriesLM 7000
ChisZ 6000NHM 5600MM 4100OlP 4000SaP 3600AyZ 2000ChZ 1600SoZ 800
SHM 700
Estimates of available knowledge about Mixe-Zoque
Number of retentions depends on available knowledge
R2 = 0.7497
0
1000
2000
3000
4000
5000
6000
7000
8000
0 100 200 300 400 500 600
Number of retentions
Est
imate
s o
f avail
ab
le d
ata
Spread of estimates depends on available knowledge
R2 = 0.6197
0
1000
2000
3000
4000
5000
6000
7000
8000
0 10 20 30 40 50 60 70 80
Spread of Estimates (Standart deviation)
Est
imate
s o
f availab
le d
ata
Summary of problems
• Absence of shared innovations is not counted
• The data that enter in the analysis (i.e. reconstructed etyma) partly depend on the outcome (i.e. the tree)
• Unbalanced amount of available data distorts the estimates
The End