inferring phylogenetic models for european and other languages using mml jane n. ooi 18560210...

Inferring phylogenetic models Inferring phylogenetic models for European and other for European and other Languages using MMLLanguages using MML

Jane N. OoiJane N. Ooi

1856021018560210

Supervisor : A./Prof. David L. DoweSupervisor : A./Prof. David L. Dowe

Table of ContentsTable of Contents

Motivation and BackgroundMotivation and Background What is a phylogenetic model?What is a phylogenetic model? Phylogenetic Trees and GraphsPhylogenetic Trees and Graphs Types of evolution of languagesTypes of evolution of languages Minimum Message Length (MML)Minimum Message Length (MML)

Multistate distribution – modelling of mutationsMultistate distribution – modelling of mutations Results/DiscussionResults/Discussion Conclusion and future workConclusion and future work

MotivationMotivation

To study how languages have evolved To study how languages have evolved (Phylogeny of languages).(Phylogeny of languages). e.g. Artificial languages, European languages.e.g. Artificial languages, European languages.

To refine natural language compression To refine natural language compression method. method.

Evolution of languagesEvolution of languages What is phylogeny?What is phylogeny?

Phylogeny means Phylogeny means EvolutionEvolution

What is a phylogeneticWhat is a phylogenetic model?model?

A A phylogenetic tree/graphphylogenetic tree/graph is isa a treetree/graph showing the /graph showing the evolutionaryevolutionary interrelationships interrelationships among various among various speciesspecies or other entities that are believed or other entities that are believed to have a to have a common ancestorcommon ancestor. .

Difference between a phylogenetic tree and Difference between a phylogenetic tree and a phylogenetic grapha phylogenetic graph

Phylogenetic trees Phylogenetic trees Each child node has exactly one parent node.Each child node has exactly one parent node.

Phylogenetic graphs (new concept)Phylogenetic graphs (new concept) Each child node can descend from one or Each child node can descend from one or

more parent(s) node.more parent(s) node.

Y Z

X

YX

Z

Evolution of languagesEvolution of languages 3 types of evolution3 types of evolution

Evolution of Evolution of phonology/pronunciationphonology/pronunciation

Evolution of Evolution of written script/spellingwritten script/spelling

Evolution of grammatical structuresEvolution of grammatical structures

WordsWords USUS UKUK

scheduleschedule skeduleskedule sheduleshedule

leisureleisure leezhureleezhure lezhurelezhure

EnglishEnglish MalayMalay

MobileMobile MobilMobil

TelevisionTelevision TelevisyenTelevisyen

Minimum Message Length (MML)Minimum Message Length (MML)

What is MML?What is MML? A measure of A measure of goodness of classification goodness of classification

based on information theory.based on information theory. Data can be described using “models”Data can be described using “models” MML methods favour the “MML methods favour the “bestbest” ”

description of data wheredescription of data where ““bestbest” = ” = shortestshortest overall message length overall message length

Two part messageTwo part message Msglength = Msglength(model) + msglength(data|model)Msglength = Msglength(model) + msglength(data|model)


Degree of similarity between languages can be Degree of similarity between languages can be measured by compressing them in terms of one measured by compressing them in terms of one another.another.

Example : Example : Language A Language BLanguage A Language B

• 3 possibilities – 3 possibilities – Unrelated Unrelated – shortest message length when compressed – shortest message length when compressed

separately. separately. A descended from B A descended from B – shortest message length when A – shortest message length when A

compressed in terms of B. compressed in terms of B. B descended from A B descended from A – shortest message length when B – shortest message length when B

compressed in terms of A. compressed in terms of A.


The The bestbest phylogenetic model is the phylogenetic model is the tree/graph that achieves the tree/graph that achieves the

shortestshortest overall message length. overall message length.

Modelling mutation between wordsModelling mutation between words

Root languageRoot language Equal frequencies for all characters.Equal frequencies for all characters.

• Log(size of alphabet) * no. of chars.Log(size of alphabet) * no. of chars. Some characters occur more frequently than Some characters occur more frequently than

others. others. • Exp : English “x” compared with “a”.Exp : English “x” compared with “a”.• Multistate distribution of characters.Multistate distribution of characters.

Modelling mutation between wordsModelling mutation between words

Child languages Child languages Mutistate distributionMutistate distribution

• 4 states.4 states. InsertInsert DeleteDelete CopyCopy ChangeChange

Use string alignment techniques to find the best Use string alignment techniques to find the best alignment between words.alignment between words.

Dynamic Programming Algorithm to find alignment Dynamic Programming Algorithm to find alignment between strings.between strings.

MML favors the alignment between words that MML favors the alignment between words that produces the shortest overall message length.produces the shortest overall message length.

Example :Example :

rr ee cc oo mm mm aa nn dd ee rr

|| || || || || || || || || || ||

rr ee cc oo mm mm ee nn dd -- --

Work to dateWork to date

Preliminary model Preliminary model Only copy and change mutations.Only copy and change mutations. Words of the same length.Words of the same length. artificial and European languages.artificial and European languages.

Expanded modelExpanded model Copy, change, insert and delete mutationsCopy, change, insert and delete mutations Words of different length.Words of different length. artificial and European languages.artificial and European languages.

Results – Preliminary modelResults – Preliminary model

Artificial languagesArtificial languages A – randomA – random B – 5% mutation B – 5% mutation

from Afrom A C – 5% mutation C – 5% mutation

from Bfrom B Full stop “.” marks Full stop “.” marks

the end of string.the end of string.

AA BB CC

11

22

33

44

……

5050

asdfge.asdfge.

zlsdrya.zlsdrya.

wet.wet.

vsert.vsert.

……....

……....

assfge.assfge.

zlcdrya.zlcdrya.

wet.wet.

vsegt.vsegt.

……....

……....

assfge.assfge.

zlchrya.zlchrya.

wbt.wbt.

vsagt.vsagt.

……....

……....


Possible Possible tree topologies tree topologies for 3 languages :for 3 languages :

X

Y Z

Expanded model

only

X

Z

Y

Null hypothesis : totally unrelated

Fully related

Partially related

Expected topology

X

Y

Z

X

Y

Z


Possible graph topologies for Possible graph topologies for

3 languages :3 languages :

YX

Z

Y

Z

X

Non-related parents Related parents

Results – Preliminary modelResults – Preliminary model Results :Results :

Best tree =Best tree =

Language BLanguage B/ \/ \

Pmut(B,A)~ 0.051648 Pmut(B,C)~ 0.049451Pmut(B,A)~ 0.051648 Pmut(B,C)~ 0.049451/ \/ \

v vv vLanguage A Language CLanguage A Language C

Overall Message Length = Overall Message Length = 2933.26 bits2933.26 bits• Cost of topology = log(5)Cost of topology = log(5)• Cost of fixing root language (B) = log(3)Cost of fixing root language (B) = log(3)• Cost of root language = 2158.7186 bitsCost of root language = 2158.7186 bits• Branch 1Branch 1

Cost of child language (Lang. A) binomial distribution = 392.069784 bitsCost of child language (Lang. A) binomial distribution = 392.069784 bits• Branch 2Branch 2

Cost of child language (Lang. C) binomial distribution = 378.562159 bitsCost of child language (Lang. C) binomial distribution = 378.562159 bits

B

A C


European LanguagesEuropean Languages FrenchFrench EnglishEnglish SpanishSpanish

EnglishEnglish FrenchFrench SpanishSpanish

11

22

33

44

……

3030

baby.baby.

beach.beach.

biscuits.biscuits.

cream.cream.

……....

……....

bebe.bebe.

plage.plage.

biscuits.biscuits.

creme.creme.

……....

……....

nene.nene.

playa.playa.

bizocho.bizocho.

crema.crema.

……....

……....


FrenchFrench P(from French)~ 0.834297 Pmut(French,Spanish) ~ 0.245174P(from French)~ 0.834297 Pmut(French,Spanish) ~ 0.245174 P(from Spanish P(from Spanish not French) ~ 0.090559 Spanishnot French) ~ 0.090559 Spanish P(from neither)~ 0.075145 P(from neither)~ 0.075145 EnglishEnglish

French

English

Spanish

Cost of parent language (French) =1226.76 bitsCost of language (Spanish) binomial distribution = 734.59 bitsCost of child language (English) trinomial distribution = 537.70 bitsTotal tree cost = log(5) + log(3) + log(2) + 1226.76 + 734.59 + 537.70

= 2503.95 bits

Results – Expanded modelResults – Expanded model 16 sets of 4 languages16 sets of 4 languages Different length vocabulariesDifferent length vocabularies

A – randomly generatedA – randomly generated B – mutated from AB – mutated from A C – mutated from AC – mutated from A D – mutated from BD – mutated from B

Mutation probabilitiesMutation probabilities Copy – 0.65Copy – 0.65 Change – 0.20Change – 0.20 Insert – 0.05Insert – 0.05 Delete – 0.10Delete – 0.10

Results – Expanded modelResults – Expanded model

Language Language AA

LanguagLanguage Be B

Language Language CC

Language Language DD

11 awjmv. awjmv. afjmv.afjmv. wqmv.wqmv. afjnv.afjnv.

22 bauke.bauke. baxke.baxke. auke.auke. bave.bave.

33 doinet.doinet. domnitdomnit deoinet.deoinet. domnit.domnit.

44 eni. eni. eol.eol. enc.enc. eol.eol.

55 foijgnw. foijgnw. fiogw.fiogw. foijnw.foijnw. fidgw.fidgw.

…….... ………….... ………….... ………….... …………....

…….... ………….... ………….... ………….... …………....

5050 ………….... ………….... ………….... …………....

Examples of a set of 4 vocabularies used


Possible tree structures for 4 languages:Possible tree structures for 4 languages:

A

C

B

D

Null hypothesis :

totally unrelated

A B

C D

A B

DC

Partially related

A

B

C

D


A

B C D

A

B

C

D

A

B C

D

Fully related

A

B

C D

Expected topology


Correct tree structure 100% of the time.Correct tree structure 100% of the time. Sample of inferred tree and cost :Sample of inferred tree and cost :

Language A : size = 383 chars, cost = 1821.121913 bits Language A : size = 383 chars, cost = 1821.121913 bits

A

B C

D

Results – Expanded modelResults – Expanded model Pr(Delete) = 0.076250 Pr(Delete) = 0.076250 Pr(Insert) = 0.038750 Pr(Insert) = 0.038750 Pr(Mismatch) = 0.186250 Pr(Mismatch) = 0.186250 Pr(Match) = 0.698750 Pr(Match) = 0.698750 4 state Multinomial cost = 930.108894 bits 4 state Multinomial cost = 930.108894 bits

Pr(Delete) = 0.071250 Pr(Delete) = 0.071250 Pr(Insert) = 0.038750 Pr(Insert) = 0.038750 Pr(Mismatch) = 0.183750 Pr(Mismatch) = 0.183750 Pr(Match) = 0.706250 Pr(Match) = 0.706250 4 state Multinomial cost = 916.979371 bits 4 state Multinomial cost = 916.979371 bits

*Note that all multinomial cost includes and extra cost of log(26) to *Note that all multinomial cost includes and extra cost of log(26) to state the new character for mismatch and insert *state the new character for mismatch and insert *

A

B

A

C

Results – Expanded modelResults – Expanded model Pr(Delete) = 0.066580 Pr(Delete) = 0.066580 Pr(Insert) = 0.035248 Pr(Insert) = 0.035248 Pr(Mismatch) = 0.189295 Pr(Mismatch) = 0.189295 Pr(Match) = 0.708877 Pr(Match) = 0.708877 4 state Multinomial cost = 873.869382 bits 4 state Multinomial cost = 873.869382 bits

Cost of fixing topology = log(7) = 2.81 bitsCost of fixing topology = log(7) = 2.81 bits Total tree cost = 930.11 + 916.98 + 873.87 + Total tree cost = 930.11 + 916.98 + 873.87 +

1821.11 + log(7) + log(4) + log(3) + 1821.11 + log(7) + log(4) + log(3) + log(2) log(2)

= 4549.46 bits= 4549.46 bits

B

D


European LanguagesEuropean Languages FrenchFrench EnglishEnglish GermanGerman

EnglishEnglish FrenchFrench GermanGerman

11

22

33

44

……

601601

even.even.

eyes.eyes.

false.false.

fear.fear.

……....

……....

meme.meme.

oeil.oeil.

faux.faux.

peur.peur.

……....

……....

sogar.sogar.

auge.auge.

falsch.falsch.

angst.angst.

……....

……....


Total cost of this tree = 56807.155 bitsTotal cost of this tree = 56807.155 bits

Cost of fixing topology = log(4) = 2 bitsCost of fixing topology = log(4) = 2 bits

Cost of fixing root language (French) = log(3) = 1.585 bitsCost of fixing root language (French) = log(3) = 1.585 bits Cost of French = no. of chars * log(27) = 21054.64 bitsCost of French = no. of chars * log(27) = 21054.64 bits

French

English

German

Cost of fixing parent/child language (English) = log(2) = 1 bitCost of fixing parent/child language (English) = log(2) = 1 bit Cost of multistate distribution (French -> English) = 15567.98 bitsCost of multistate distribution (French -> English) = 15567.98 bits MML inferred probabilities:MML inferred probabilities:

Pr(Delete) = 0.164322Pr(Delete) = 0.164322 Pr(Insert) = 0.071429Pr(Insert) = 0.071429 Pr(Mismatch) = 0.357143Pr(Mismatch) = 0.357143 Pr(Match) = 0.407106Pr(Match) = 0.407106

Cost of multistate distribution (English -> German) = 20179.95 bitsCost of multistate distribution (English -> German) = 20179.95 bits MML inferred probabilities:MML inferred probabilities:

Pr(Delete) = 0.069480Pr(Delete) = 0.069480 Pr(Insert) = 0.189866Pr(Insert) = 0.189866 Pr(Mismatch) = 0.442394 Pr(Mismatch) = 0.442394 Pr(Match) = 0.298260Pr(Match) = 0.298260

Note that an extra cost of log(26) is needed for each mismatch and Note that an extra cost of log(26) is needed for each mismatch and log(27) for each insert to state the new character.log(27) for each insert to state the new character.


ConclusionConclusion

MML methods have managed to MML methods have managed to infer the correct phylogenetic tree/graphs for infer the correct phylogenetic tree/graphs for

artificial languages.artificial languages. infer phylogenetic trees/graphs for languages infer phylogenetic trees/graphs for languages

by encoding them in terms of one another.by encoding them in terms of one another. We cannot conclude that one language We cannot conclude that one language

really descend from another language. We really descend from another language. We can only conclude that they are related.can only conclude that they are related.

Future work :Future work :

Compression – grammar and vocabulary.Compression – grammar and vocabulary. Compression – phonemes of languages.Compression – phonemes of languages. Endangered languages – Indigenous languages.Endangered languages – Indigenous languages. Refine coding scheme.Refine coding scheme.

Some characters occur more frequently than others. Some characters occur more frequently than others. Exp: English - “x” compared with “a”.Exp: English - “x” compared with “a”.

Some characters are more likely to mutate from one Some characters are more likely to mutate from one language to another language.language to another language.

Questions?Questions?

Papers on success of MMLPapers on success of MML C. S. Wallace and P. R. Freeman. Single factor analysis by MML estimation. Journal of the Royal C. S. Wallace and P. R. Freeman. Single factor analysis by MML estimation. Journal of the Royal

Statistical Society. Series B, 54(1):195-209, 1992.Statistical Society. Series B, 54(1):195-209, 1992. C. S.Wallace. Multiple factor analysis by MML estimation. Technical Report CS 95/218, C. S.Wallace. Multiple factor analysis by MML estimation. Technical Report CS 95/218,

Department of Computer Science, Monash University, 1995.Department of Computer Science, Monash University, 1995. C. S. Wallace and D. L. Dowe. MML estimation of the von Mises concentration parameter. C. S. Wallace and D. L. Dowe. MML estimation of the von Mises concentration parameter.

Technical Report CS 93/193, Department of Computer Science, Monash University,1993.Technical Report CS 93/193, Department of Computer Science, Monash University,1993. C. S. Wallace and D. L. Dowe. Refinements of MDL and MML coding. The Computer Journal, C. S. Wallace and D. L. Dowe. Refinements of MDL and MML coding. The Computer Journal,

42(4):330-337, 1999.42(4):330-337, 1999. P. J. Tan and D. L. Dowe. MML inference of decision graphs with multi-way joins. In Proceedings P. J. Tan and D. L. Dowe. MML inference of decision graphs with multi-way joins. In Proceedings

of the 15th Australian Joint Conference on Artificial Intelligence, Canberra, Australia, 2-6 of the 15th Australian Joint Conference on Artificial Intelligence, Canberra, Australia, 2-6 December 2002, published in Lecture Notes in Artificial Intelligence (LNAI) 2557, pages 131-142. December 2002, published in Lecture Notes in Artificial Intelligence (LNAI) 2557, pages 131-142. Springer-Verlag, 2002.Springer-Verlag, 2002.

S. L. Needham and D. L. Dowe. Message length as an effective Ockham's razor in decision tree S. L. Needham and D. L. Dowe. Message length as an effective Ockham's razor in decision tree induction. In Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics induction. In Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics (AI+STATS 2001), Key West, Florida, U.S.A., January 2001, pages 253-260, 2001(AI+STATS 2001), Key West, Florida, U.S.A., January 2001, pages 253-260, 2001

Y. Agusta and D. L. Dowe. Unsupervised learning of correlated multivariate Gaussian mixture Y. Agusta and D. L. Dowe. Unsupervised learning of correlated multivariate Gaussian mixture models using MML. In Proceedings of the Australian Conference on Artificial Intelligence 2003, models using MML. In Proceedings of the Australian Conference on Artificial Intelligence 2003, Lecture Notes in Artificial Intelligence (LNAI) 2903, pages 477-489. Springer-Verlag, 2003.Lecture Notes in Artificial Intelligence (LNAI) 2903, pages 477-489. Springer-Verlag, 2003.

J. W. Comley and D. L. Dowe. General Bayesian networks and asymmetric languages. In J. W. Comley and D. L. Dowe. General Bayesian networks and asymmetric languages. In Proceedings of the Hawaii International Conference on Statistics and Related Fields, June 5-8, Proceedings of the Hawaii International Conference on Statistics and Related Fields, June 5-8, 2003, 2003.2003, 2003.

J. W. Comley and D. L. Dowe. Minimum Message Length, MDL and Generalised Bayesian J. W. Comley and D. L. Dowe. Minimum Message Length, MDL and Generalised Bayesian Networks with Asymmetric Languages, chapter 11, pages 265-294. M.I.T. Press, 2005. [Camera Networks with Asymmetric Languages, chapter 11, pages 265-294. M.I.T. Press, 2005. [Camera ready copy submitted October 2003].ready copy submitted October 2003].

P. J. Tan and D. L. Dowe. MML inference of oblique decision trees. In Proc. 17th Australian Joint P. J. Tan and D. L. Dowe. MML inference of oblique decision trees. In Proc. 17th Australian Joint Conference on Artificial Intelligence (AI04), Cairns, Qld., Australia, pages 1082-1088. Springer-Conference on Artificial Intelligence (AI04), Cairns, Qld., Australia, pages 1082-1088. Springer-Verlag, December 2004.Verlag, December 2004.

inferring phylogenetic models for european and other languages using mml jane n. ooi 18560210...

Documents