phylogenetic models and mcmc methods for the reconstruction of language history
TRANSCRIPT
Phylogenetic models and MCMC methods for the reconstruction of language history
Robin J. RyderCEREMADE – Paris Dauphine / CREST – INSEE
Joint work with Geoff K. Nichollsat the Department of Statistics, University of Oxford
www.slideshare.net/robinryder
Carles li reis, nostre emper[er]e magnesSet anz tuz pleins ad estet en Espaigne :Tresqu’en la mer cunquist la tere altaigne.N’i ad castel ki devant lui remaigne ;Mur ne citet n’i est remes a fraindre,Fors Sarraguce, ki est en une muntaigne.
Chanson de Roland, 1r (11th century)
La plus commune façon d'amollir les coeurs de ceux qu'on a offensez, lors qu'ayant la vengeance en main, ils nous tiennent à leur mercy, c'est de les esmouvoir par submission à commiseration et à pitié.
Montaigne, Essais, I, 1 (1580)
Tes yeux sont si profonds qu'en me penchant pour boireJ'ai vu tous les soleils y venir se mirerS'y jeter à mourir tous les désespérésTes yeux sont si profonds que j'y perds la mémoire
Aragon, Les Yeux d'Elsa (1942)
Et la piaule swingue au son du ghetto, on tape à la porteChill c'est trop fort ! baisse le son merde ! j'connaisA chaque fois c'est pareil tant pis il faut qu'ça pèteEt profite en traître des nouveaux albums qu'Rod m'achète
Akhénaton, Juste une pression (2005)
What to expect
Description of the data
Model of language diversification
MCMC for phylogenetic trees
Synthetic studies
Analysis of two data sets
Indo-European languages
Indo-European languages
Language diversification
Languages change in a way comparable to biological species
Similarities between languages indicate that they may be cousins.
Most common model : phylogenetic tree
Questions
Topology
Internal ages
Age of the root: 6000-6500 BP or 8000-9500 BP?
(BP=Before Present)
Core vocabulary
100 or 200 meanings, present in almost all languages : bird, hand, to eat, red...
Borrowing is possible (non-tree-like change), but:
“Easy” to detect
Uncommon
Does not introduce systematic bias
Data coding
Old English: stierfþ
Old High German: stirbit, touwit
Avestan: miriiete
Old Church Slavonic: umĭretŭ
Latin: moritur
Oscan: ?
Cognacy classes:
1. {stierfþ, stirbit}
2. {touwit}
3. {miriiete, umĭretŭ, moritur}
Constraints
Constraints on parts of the topology
Constraints on some internal ages
We use these constraints to infer rates and other ages
Description of the model (1)
Traits are born at rate λ
Trait instances die at rate μ
λ and μ are constants
Description of the model (2)
Catastrophes occur at rate ρ
At a catastrophe, each trait dies with probability κ and Poiss(ν) traits are born.
λ/μ=ν/κ: the number of traits is constant on average.
Description of the model (3)
Observation model: each data point (0s and 1s) is missing with probability ξ
Some traits are not observed and are therefore deleted from the data
Registration process
Registration process
Registration process
Registration process
Posterior distribution
Likelihood calculations
Prior distribution on trees
Our main focus is on the root age
We would like the marginal prior on the root age to be (approximately) uniform over (say) 5000-15000BP
MCMC moves
Random walk on the parameters
Various moves on the tree (Drummond et al., 2002)
Checking mixing and convergence
Auto-correlations
Need statistics on the tree
Length of the tree
Root age
Presence/Absence of a few subtrees
Synthetic data
True tree, ~40 words/language Consensus tree
Synthetic data (2)
Death rate (μ)
Influence of borrowing
True tree, ~40 words/languageBorrowing: 10%
Consensus tree
Influence of borrowing (2)
Consensus treeTrue tree, ~40 words/languageBorrowing: 50%
Influence of borrowing (3)
Topology is reconstructed correctly
Dates are underestimated for high levels of borrowing
Root age Death rate (μ)
Borrowing: 50%
Detecting borrowing
Confirmed: hardly any borrowing!
Data used
Indo-European languages
Core vocabulary (Swadesh 100 or 200)
Two independent data sets
Dyen et al. (1997): 87 languages, mostly modern
Ringe et al. (2002): 24 languages, mostly ancient
Constraints
Cross-validation
Root age
Conclusions
Strong support for the Anatolian hypothesis: root age around 8000BP. No support for the Kurgan hypothesis.
Applicable to a variety of linguistic and cultural data sets
TraitLab: it's free!
Questions
otázky
spørgsmåler
vragen
questions
Fragen
domande
pytania
questões
întrebări
вопросы
vprašanja
preguntespreguntas
frågor
vrae
spurningar
quaestiones
ερωτήσεις
въпроси
kesses
spørsmåler
kláusimai
запитанні
سوال
पशcwestiwnau