phylogenetic models and mcmc methods for the reconstruction of language history

70
and MCMC methods for the reconstru Robin J. Ryder CEREMADE – Paris Dauphine / CREST – INSEE Joint work with Geoff K. Nicholls at the Department of Statistics, University of Oxford www.slideshare.net/robinryder

Upload: robin-ryder

Post on 11-Jul-2015

1.400 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Phylogenetic models and MCMC methods for the reconstruction of language history

Phylogenetic models and MCMC methods for the reconstruction of language history

Robin J. RyderCEREMADE – Paris Dauphine / CREST – INSEE

Joint work with Geoff K. Nichollsat the Department of Statistics, University of Oxford

www.slideshare.net/robinryder

Page 2: Phylogenetic models and MCMC methods for the reconstruction of language history

Carles li reis, nostre emper[er]e magnesSet anz tuz pleins ad estet en Espaigne :Tresqu’en la mer cunquist la tere altaigne.N’i ad castel ki devant lui remaigne ;Mur ne citet n’i est remes a fraindre,Fors Sarraguce, ki est en une muntaigne.

Chanson de Roland, 1r (11th century)

Page 3: Phylogenetic models and MCMC methods for the reconstruction of language history

La plus commune façon d'amollir les coeurs de ceux qu'on a offensez, lors qu'ayant la vengeance en main, ils nous tiennent à leur mercy, c'est de les esmouvoir par submission à commiseration et à pitié.

Montaigne, Essais, I, 1 (1580)

Page 4: Phylogenetic models and MCMC methods for the reconstruction of language history

Tes yeux sont si profonds qu'en me penchant pour boireJ'ai vu tous les soleils y venir se mirerS'y jeter à mourir tous les désespérésTes yeux sont si profonds que j'y perds la mémoire

Aragon, Les Yeux d'Elsa (1942)

Page 5: Phylogenetic models and MCMC methods for the reconstruction of language history

Et la piaule swingue au son du ghetto, on tape à la porteChill c'est trop fort ! baisse le son merde ! j'connaisA chaque fois c'est pareil tant pis il faut qu'ça pèteEt profite en traître des nouveaux albums qu'Rod m'achète

Akhénaton, Juste une pression (2005)

Page 6: Phylogenetic models and MCMC methods for the reconstruction of language history

What to expect

Description of the data

Model of language diversification

MCMC for phylogenetic trees

Synthetic studies

Analysis of two data sets

Page 7: Phylogenetic models and MCMC methods for the reconstruction of language history

Indo-European languages

Page 8: Phylogenetic models and MCMC methods for the reconstruction of language history

Indo-European languages

Page 9: Phylogenetic models and MCMC methods for the reconstruction of language history

Language diversification

Languages change in a way comparable to biological species

Similarities between languages indicate that they may be cousins.

Most common model : phylogenetic tree

Page 10: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 11: Phylogenetic models and MCMC methods for the reconstruction of language history

Questions

Topology

Internal ages

Age of the root: 6000-6500 BP or 8000-9500 BP?

(BP=Before Present)

Page 12: Phylogenetic models and MCMC methods for the reconstruction of language history

Core vocabulary

100 or 200 meanings, present in almost all languages : bird, hand, to eat, red...

Borrowing is possible (non-tree-like change), but:

“Easy” to detect

Uncommon

Does not introduce systematic bias

Page 13: Phylogenetic models and MCMC methods for the reconstruction of language history

Data coding

Old English: stierfþ

Old High German: stirbit, touwit

Avestan: miriiete

Old Church Slavonic: umĭretŭ

Latin: moritur

Oscan: ?

Cognacy classes:

1. {stierfþ, stirbit}

2. {touwit}

3. {miriiete, umĭretŭ, moritur}

Page 14: Phylogenetic models and MCMC methods for the reconstruction of language history

Constraints

Constraints on parts of the topology

Constraints on some internal ages

We use these constraints to infer rates and other ages

Page 15: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 16: Phylogenetic models and MCMC methods for the reconstruction of language history

Description of the model (1)

Traits are born at rate λ

Trait instances die at rate μ

λ and μ are constants

Page 17: Phylogenetic models and MCMC methods for the reconstruction of language history

Description of the model (2)

Catastrophes occur at rate ρ

At a catastrophe, each trait dies with probability κ and Poiss(ν) traits are born.

λ/μ=ν/κ: the number of traits is constant on average.

Page 18: Phylogenetic models and MCMC methods for the reconstruction of language history

Description of the model (3)

Observation model: each data point (0s and 1s) is missing with probability ξ

Some traits are not observed and are therefore deleted from the data

Page 19: Phylogenetic models and MCMC methods for the reconstruction of language history

Registration process

Page 20: Phylogenetic models and MCMC methods for the reconstruction of language history

Registration process

Page 21: Phylogenetic models and MCMC methods for the reconstruction of language history

Registration process

Page 22: Phylogenetic models and MCMC methods for the reconstruction of language history

Registration process

Page 23: Phylogenetic models and MCMC methods for the reconstruction of language history

Posterior distribution

Page 24: Phylogenetic models and MCMC methods for the reconstruction of language history

Likelihood calculations

Page 25: Phylogenetic models and MCMC methods for the reconstruction of language history

Prior distribution on trees

Our main focus is on the root age

We would like the marginal prior on the root age to be (approximately) uniform over (say) 5000-15000BP

Page 26: Phylogenetic models and MCMC methods for the reconstruction of language history

MCMC moves

Random walk on the parameters

Various moves on the tree (Drummond et al., 2002)

Page 27: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 28: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 29: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 30: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 31: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 32: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 33: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 34: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 35: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 36: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 37: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 38: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 39: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 40: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 41: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 42: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 43: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 44: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 45: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 46: Phylogenetic models and MCMC methods for the reconstruction of language history

Checking mixing and convergence

Auto-correlations

Need statistics on the tree

Length of the tree

Root age

Presence/Absence of a few subtrees

Page 47: Phylogenetic models and MCMC methods for the reconstruction of language history

Synthetic data

True tree, ~40 words/language Consensus tree

Page 48: Phylogenetic models and MCMC methods for the reconstruction of language history

Synthetic data (2)

Death rate (μ)

Page 49: Phylogenetic models and MCMC methods for the reconstruction of language history

Influence of borrowing

True tree, ~40 words/languageBorrowing: 10%

Consensus tree

Page 50: Phylogenetic models and MCMC methods for the reconstruction of language history

Influence of borrowing (2)

Consensus treeTrue tree, ~40 words/languageBorrowing: 50%

Page 51: Phylogenetic models and MCMC methods for the reconstruction of language history

Influence of borrowing (3)

Topology is reconstructed correctly

Dates are underestimated for high levels of borrowing

Root age Death rate (μ)

Borrowing: 50%

Page 52: Phylogenetic models and MCMC methods for the reconstruction of language history

Detecting borrowing

Confirmed: hardly any borrowing!

Page 53: Phylogenetic models and MCMC methods for the reconstruction of language history

Data used

Indo-European languages

Core vocabulary (Swadesh 100 or 200)

Two independent data sets

Dyen et al. (1997): 87 languages, mostly modern

Ringe et al. (2002): 24 languages, mostly ancient

Page 54: Phylogenetic models and MCMC methods for the reconstruction of language history

Constraints

Page 55: Phylogenetic models and MCMC methods for the reconstruction of language history

Cross-validation

Page 56: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 57: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 58: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 59: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 60: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 61: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 62: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 63: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 64: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 65: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 66: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 67: Phylogenetic models and MCMC methods for the reconstruction of language history
Page 68: Phylogenetic models and MCMC methods for the reconstruction of language history

Root age

Page 69: Phylogenetic models and MCMC methods for the reconstruction of language history

Conclusions

Strong support for the Anatolian hypothesis: root age around 8000BP. No support for the Kurgan hypothesis.

Applicable to a variety of linguistic and cultural data sets

TraitLab: it's free!

Page 70: Phylogenetic models and MCMC methods for the reconstruction of language history

Questions

otázky

spørgsmåler

vragen

questions

Fragen

domande

pytania

questões

întrebări

вопросы

vprašanja

preguntespreguntas

frågor

vrae

spurningar

quaestiones

ερωτήσεις

въпроси

kesses

spørsmåler

kláusimai

запитанні

سوال

पशcwestiwnau