quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... ·...

34
Quantitative approaches to linguistic similarity Michael Cysouw Work -in-Progress, 9 November 2004

Upload: others

Post on 17-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Quantitative approaches to linguistic similarity

Michael CysouwWork-in-Progress, 9 November 2004

Page 2: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Part 1

Page 3: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Distribution of rare characteristics

• Using the WALS-data to approach some perennial questions:

• Are there languages that have many typologically rare characteristics?

• Are there regions that show a relatively high density of rare features?

• Does rarity cluster?

Page 4: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Rarity Index Ri

Rfi = n ⋅ f if tot

n = number of feature valuesfi = frequency of feature value iftot = total number of languages included

For fi /ftot< 1/n :

Rfi =1n −1

n ⋅ f if tot

−1

+1For fi /ftot > 1/n :

Page 5: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Order of Object and Verb (by Matthew Dryer)

Page 6: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Computing Rarity Index

• Three feature values (n = 3)• Frequencies fi:

640 (OV), 639 (VO), 91 (no preference)• Total ftot = 640 + 639 + 91 = 1370• Rov = 1.20

Rvo = 1.20Rnopref = 0.20

Page 7: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Indices of English

All languagesMAll languagesM

All languages

rarity indexMrarity indexM

rarity index

number of occurencesMnumber of occurencesM

num

ber

of occure

nces

m0.00.0M0.0M

0.0

0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

2.02.0M2.0M

2.0

m00M0M

0

50005000M5000M

5000

1000010000M10000M

10000

1500015000M15000M

15000

EnglishMEnglishM

English

rarity indexMrarity indexM

rarity index

number of occurencesMnumber of occurencesM

num

ber

of occure

nces

m0.00.0M0.0M

0.0

0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

2.02.0M2.0M

2.0

m00M0M

0

55M5M

5

1010M10M

10

1515M15M

15

2020M20M

20

2525M25M

25

3030M30M

30

Page 8: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Indices of Indonesian

All languagesMAll languagesM

All languages

rarity indexMrarity indexM

rarity index

number of occurencesMnumber of occurencesM

num

ber

of occure

nces

m0.00.0M0.0M

0.0

0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

2.02.0M2.0M

2.0

m00M0M

0

50005000M5000M

5000

1000010000M10000M

10000

1500015000M15000M

15000

IndonesianMIndonesianM

Indonesian

rarity indexMrarity indexM

rarity index

number of occurencesMnumber of occurencesM

num

ber

of occure

nces

m0.00.0M0.0M

0.0

0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

2.02.0M2.0M

2.0

m00M0M

0

1010M10M

10

2020M20M

20

3030M30M

30

Page 9: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Median of Rarity Indices

median of rarity indices per languageMmedian of rarity indices per languageM

median of rarity indices per language

FrequencyMFrequencyM

Fre

quency

m0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

m00M0M

0

200200M200M

200

400400M400M

400

600600M600M

600

800800M800M

800

10001000M1000M

1000

median of rarity indices per languageMmedian of rarity indices per languageM

median of rarity indices per language

FrequencyMFrequencyM

Fre

quency

m0.20.2M0.2M

0.2

0.40.4M0.4M

0.4

0.60.6M0.6M

0.6

0.80.8M0.8M

0.8

1.01.0M1.0M

1.0

m00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

Page 10: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Influence of amount of datam00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

1.51.5M1.5M

1.5

Number of features codedMNumber of features codedM

Number of features coded

Median of rarity indicesMMedian of rarity indicesMM

edia

n o

f ra

rity

indic

es

Page 11: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Number of indices smaller than onem00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

1010M10M

10

2020M20M

20

3030M30M

30

4040M40M

40

5050M50M

50

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M

Num

ber

of ra

rity

indic

es s

malle

r th

an 1

Page 12: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Indo-european languagesm00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

1010M10M

10

2020M20M

20

3030M30M

30

4040M40M

40

5050M50M

50

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M

Num

ber

of ra

rity

indic

es s

malle

r th

an 1

German, French, English, Russian, Greek, Irish,

Latvian

Page 13: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Germanicm00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

1010M10M

10

2020M20M

20

3030M30M

30

4040M40M

40

5050M50M

50

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M

Num

ber

of ra

rity

indic

es s

malle

r th

an 1

Germanic

Indo-European

Page 14: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Slavicm00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

1010M10M

10

2020M20M

20

3030M30M

30

4040M40M

40

5050M50M

50

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than 1MNumber of rarity indices smaller than 1M

Num

ber

of ra

rity

indic

es s

malle

r th

an 1 Slavic

Germanic

Indo-European

Page 15: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Going more extreme...(Number of rarity indices smaller than .5)

m00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

55M5M

5

1010M10M

10

1515M15M

15

2020M20M

20

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than .5MNumber of rarity indices smaller than .5M

Num

ber

of ra

rity

indic

es s

malle

r th

an .5

Page 16: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Languages with many rare features

m00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

55M5M

5

1010M10M

10

1515M15M

15

2020M20M

20

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than .5MNumber of rarity indices smaller than .5M

Num

ber

of ra

rity

indic

es s

malle

r th

an .5

Page 17: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Languages with many rare features

Page 18: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Caucasian languagesm00M0M

0

2020M20M

20

4040M40M

40

6060M60M

60

8080M80M

80

100100M100M

100

120120M120M

120

140140M140M

140

m00M0M

0

55M5M

5

1010M10M

10

1515M15M

15

2020M20M

20

Number of features codedMNumber of features codedM

Number of features coded

Number of rarity indices smaller than .5MNumber of rarity indices smaller than .5M

Num

ber

of ra

rity

indic

es s

malle

r th

an .5

Page 19: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Next steps

• Improve the integrations of R with the WALS-programm

• Going from exploring the data to testing hypotheses

• Taking the feature-perspective: which rare features cluster?

Page 20: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Part 2

Page 21: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Quantitative approaches to historical relatedness

• Range of recent papers, using methods from biological phylogenetic reconstruction to infer linguistic family trees

• But: only final part of historical-comparative method is taken up

• Almost all on higher groupings of Indo-European

• But: one should first check validity by applying the methods to agreed upon classifications

Page 22: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Inferring tree is just one of the many possibilities of quantitative approaches

Dictionaries, etc.

Cognate sets

Regular sound correspondences

Family treeInfer missing cognates

Page 23: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Testing Holm’s approach(together with Søren Wichmann and David Kamholz)

• Holm’s idea: use etymological dictionary instead of Swadesh-style wordlists

• By counting the number of shared retentions for each pair of languages, he estimates the relative point of split between each pair (dissimilarity estimates)

• In simulations (by Kamholz) the approach seems to work

• We tested the method on Mixe-Zoque data

Page 24: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Interpreting the estimates

Following Holm’s interpretation:

SHM

OlP SaP

ChZ

AyZ

SoZ

ChisZ

NHM

MM

LM

Using ADDTREE on the estimates:

Page 25: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

How did it work out?

Holm’s method:

Wichmann (1995)

Page 26: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

The Popolucan errors

Holm’s method:

Wichmann (1995)

Page 27: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

The Popolucan errors

• Error 1: they are grouped together, because of many shared retentions

• But: there are no shared innovations!

• Error 2: they are grouped with Zoque instead of with Mixe

• Circularity problem: reconstruction depends on tree, and Holm makes tree out of reconstruction

Page 28: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

The other two errors

Holm’s method:

Wichmann (1995)

Page 29: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Difficult to place in the tree

SHM

OlP SaP

ChZ

AyZ

SoZ

ChisZ

NHM

MM

LM

Page 30: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

LanguageNumber of dictionary

entriesLM 7000

ChisZ 6000NHM 5600MM 4100OlP 4000SaP 3600AyZ 2000ChZ 1600SoZ 800

SHM 700

Estimates of available knowledge about Mixe-Zoque

Page 31: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Number of retentions depends on available knowledge

R2 = 0.7497

0

1000

2000

3000

4000

5000

6000

7000

8000

0 100 200 300 400 500 600

Number of retentions

Est

imate

s o

f avail

ab

le d

ata

Page 32: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Spread of estimates depends on available knowledge

R2 = 0.6197

0

1000

2000

3000

4000

5000

6000

7000

8000

0 10 20 30 40 50 60 70 80

Spread of Estimates (Standart deviation)

Est

imate

s o

f availab

le d

ata

Page 33: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

Summary of problems

• Absence of shared innovations is not counted

• The data that enter in the analysis (i.e. reconstructed etyma) partly depend on the outcome (i.e. the tree)

• Unbalanced amount of available data distorts the estimates

Page 34: Quantitative approaches to linguistic similaritycysouw.de/home/presentations_files/cysouw... · Quantitative approaches to linguistic similarity Michael Cysouw Work-in-Progress, 9

The End