inferring phylogenetic models for european and other languages using mml jane n. ooi 18560210...
Post on 19-Dec-2015
218 views
TRANSCRIPT
Inferring phylogenetic models Inferring phylogenetic models for European and other for European and other Languages using MMLLanguages using MML
Jane N. OoiJane N. Ooi
1856021018560210
Supervisor : A./Prof. David L. DoweSupervisor : A./Prof. David L. Dowe
Table of ContentsTable of Contents
Motivation and BackgroundMotivation and Background What is a phylogenetic model?What is a phylogenetic model? Phylogenetic Trees and GraphsPhylogenetic Trees and Graphs Types of evolution of languagesTypes of evolution of languages Minimum Message Length (MML)Minimum Message Length (MML)
Multistate distribution – modelling of mutationsMultistate distribution – modelling of mutations Results/DiscussionResults/Discussion Conclusion and future workConclusion and future work
MotivationMotivation
To study how languages have evolved To study how languages have evolved (Phylogeny of languages).(Phylogeny of languages). e.g. Artificial languages, European languages.e.g. Artificial languages, European languages.
To refine natural language compression To refine natural language compression method. method.
Evolution of languagesEvolution of languages What is phylogeny?What is phylogeny?
Phylogeny means Phylogeny means EvolutionEvolution
What is a phylogeneticWhat is a phylogenetic model?model?
A A phylogenetic tree/graphphylogenetic tree/graph is isa a treetree/graph showing the /graph showing the evolutionaryevolutionary interrelationships interrelationships among various among various speciesspecies or other entities that are believed or other entities that are believed to have a to have a common ancestorcommon ancestor. .
Difference between a phylogenetic tree and Difference between a phylogenetic tree and a phylogenetic grapha phylogenetic graph
Phylogenetic trees Phylogenetic trees Each child node has exactly one parent node.Each child node has exactly one parent node.
Phylogenetic graphs (new concept)Phylogenetic graphs (new concept) Each child node can descend from one or Each child node can descend from one or
more parent(s) node.more parent(s) node.
Y Z
X
YX
Z
Evolution of languagesEvolution of languages 3 types of evolution3 types of evolution
Evolution of Evolution of phonology/pronunciationphonology/pronunciation
Evolution of Evolution of written script/spellingwritten script/spelling
Evolution of grammatical structuresEvolution of grammatical structures
WordsWords USUS UKUK
scheduleschedule skeduleskedule sheduleshedule
leisureleisure leezhureleezhure lezhurelezhure
EnglishEnglish MalayMalay
MobileMobile MobilMobil
TelevisionTelevision TelevisyenTelevisyen
Minimum Message Length (MML)Minimum Message Length (MML)
What is MML?What is MML? A measure of A measure of goodness of classification goodness of classification
based on information theory.based on information theory. Data can be described using “models”Data can be described using “models” MML methods favour the “MML methods favour the “bestbest” ”
description of data wheredescription of data where ““bestbest” = ” = shortestshortest overall message length overall message length
Two part messageTwo part message Msglength = Msglength(model) + msglength(data|model)Msglength = Msglength(model) + msglength(data|model)
Minimum Message Length (MML)Minimum Message Length (MML)
Degree of similarity between languages can be Degree of similarity between languages can be measured by compressing them in terms of one measured by compressing them in terms of one another.another.
Example : Example : Language A Language BLanguage A Language B
• 3 possibilities – 3 possibilities – Unrelated Unrelated – shortest message length when compressed – shortest message length when compressed
separately. separately. A descended from B A descended from B – shortest message length when A – shortest message length when A
compressed in terms of B. compressed in terms of B. B descended from A B descended from A – shortest message length when B – shortest message length when B
compressed in terms of A. compressed in terms of A.
Minimum Message Length (MML)Minimum Message Length (MML)
The The bestbest phylogenetic model is the phylogenetic model is the tree/graph that achieves the tree/graph that achieves the
shortestshortest overall message length. overall message length.
Modelling mutation between wordsModelling mutation between words
Root languageRoot language Equal frequencies for all characters.Equal frequencies for all characters.
• Log(size of alphabet) * no. of chars.Log(size of alphabet) * no. of chars. Some characters occur more frequently than Some characters occur more frequently than
others. others. • Exp : English “x” compared with “a”.Exp : English “x” compared with “a”.• Multistate distribution of characters.Multistate distribution of characters.
Modelling mutation between wordsModelling mutation between words
Child languages Child languages Mutistate distributionMutistate distribution
• 4 states.4 states. InsertInsert DeleteDelete CopyCopy ChangeChange
Use string alignment techniques to find the best Use string alignment techniques to find the best alignment between words.alignment between words.
Dynamic Programming Algorithm to find alignment Dynamic Programming Algorithm to find alignment between strings.between strings.
MML favors the alignment between words that MML favors the alignment between words that produces the shortest overall message length.produces the shortest overall message length.
Example :Example :
rr ee cc oo mm mm aa nn dd ee rr
|| || || || || || || || || || ||
rr ee cc oo mm mm ee nn dd -- --
Work to dateWork to date
Preliminary model Preliminary model Only copy and change mutations.Only copy and change mutations. Words of the same length.Words of the same length. artificial and European languages.artificial and European languages.
Expanded modelExpanded model Copy, change, insert and delete mutationsCopy, change, insert and delete mutations Words of different length.Words of different length. artificial and European languages.artificial and European languages.
Results – Preliminary modelResults – Preliminary model
Artificial languagesArtificial languages A – randomA – random B – 5% mutation B – 5% mutation
from Afrom A C – 5% mutation C – 5% mutation
from Bfrom B Full stop “.” marks Full stop “.” marks
the end of string.the end of string.
AA BB CC
11
22
33
44
……
5050
asdfge.asdfge.
zlsdrya.zlsdrya.
wet.wet.
vsert.vsert.
……....
……....
assfge.assfge.
zlcdrya.zlcdrya.
wet.wet.
vsegt.vsegt.
……....
……....
assfge.assfge.
zlchrya.zlchrya.
wbt.wbt.
vsagt.vsagt.
……....
……....
Results – Preliminary modelResults – Preliminary model
Possible Possible tree topologies tree topologies for 3 languages :for 3 languages :
X
Y Z
Expanded model
only
X
Z
Y
Null hypothesis : totally unrelated
Fully related
Partially related
Expected topology
X
Y
Z
X
Y
Z
Results – Preliminary modelResults – Preliminary model
Possible graph topologies for Possible graph topologies for
3 languages :3 languages :
YX
Z
Y
Z
X
Non-related parents Related parents
Results – Preliminary modelResults – Preliminary model Results :Results :
Best tree =Best tree =
Language BLanguage B/ \/ \
Pmut(B,A)~ 0.051648 Pmut(B,C)~ 0.049451Pmut(B,A)~ 0.051648 Pmut(B,C)~ 0.049451/ \/ \
v vv vLanguage A Language CLanguage A Language C
Overall Message Length = Overall Message Length = 2933.26 bits2933.26 bits• Cost of topology = log(5)Cost of topology = log(5)• Cost of fixing root language (B) = log(3)Cost of fixing root language (B) = log(3)• Cost of root language = 2158.7186 bitsCost of root language = 2158.7186 bits• Branch 1Branch 1
Cost of child language (Lang. A) binomial distribution = 392.069784 bitsCost of child language (Lang. A) binomial distribution = 392.069784 bits• Branch 2Branch 2
Cost of child language (Lang. C) binomial distribution = 378.562159 bitsCost of child language (Lang. C) binomial distribution = 378.562159 bits
B
A C
Results – Preliminary modelResults – Preliminary model
European LanguagesEuropean Languages FrenchFrench EnglishEnglish SpanishSpanish
EnglishEnglish FrenchFrench SpanishSpanish
11
22
33
44
……
3030
baby.baby.
beach.beach.
biscuits.biscuits.
cream.cream.
……....
……....
bebe.bebe.
plage.plage.
biscuits.biscuits.
creme.creme.
……....
……....
nene.nene.
playa.playa.
bizocho.bizocho.
crema.crema.
……....
……....
Results – Preliminary modelResults – Preliminary model
FrenchFrench P(from French)~ 0.834297 Pmut(French,Spanish) ~ 0.245174P(from French)~ 0.834297 Pmut(French,Spanish) ~ 0.245174 P(from Spanish P(from Spanish not French) ~ 0.090559 Spanishnot French) ~ 0.090559 Spanish P(from neither)~ 0.075145 P(from neither)~ 0.075145 EnglishEnglish
French
English
Spanish
Cost of parent language (French) =1226.76 bitsCost of language (Spanish) binomial distribution = 734.59 bitsCost of child language (English) trinomial distribution = 537.70 bitsTotal tree cost = log(5) + log(3) + log(2) + 1226.76 + 734.59 + 537.70
= 2503.95 bits
Results – Expanded modelResults – Expanded model 16 sets of 4 languages16 sets of 4 languages Different length vocabulariesDifferent length vocabularies
A – randomly generatedA – randomly generated B – mutated from AB – mutated from A C – mutated from AC – mutated from A D – mutated from BD – mutated from B
Mutation probabilitiesMutation probabilities Copy – 0.65Copy – 0.65 Change – 0.20Change – 0.20 Insert – 0.05Insert – 0.05 Delete – 0.10Delete – 0.10
Results – Expanded modelResults – Expanded model
Language Language AA
LanguagLanguage Be B
Language Language CC
Language Language DD
11 awjmv. awjmv. afjmv.afjmv. wqmv.wqmv. afjnv.afjnv.
22 bauke.bauke. baxke.baxke. auke.auke. bave.bave.
33 doinet.doinet. domnitdomnit deoinet.deoinet. domnit.domnit.
44 eni. eni. eol.eol. enc.enc. eol.eol.
55 foijgnw. foijgnw. fiogw.fiogw. foijnw.foijnw. fidgw.fidgw.
…….... ………….... ………….... ………….... …………....
…….... ………….... ………….... ………….... …………....
5050 ………….... ………….... ………….... …………....
Examples of a set of 4 vocabularies used
Results – Expanded modelResults – Expanded model
Possible tree structures for 4 languages:Possible tree structures for 4 languages:
A
C
B
D
Null hypothesis :
totally unrelated
A B
C D
A B
DC
Partially related
A
B
C
D
Results – Expanded modelResults – Expanded model
A
B C D
A
B
C
D
A
B C
D
Fully related
A
B
C D
Expected topology
Results – Expanded modelResults – Expanded model
Correct tree structure 100% of the time.Correct tree structure 100% of the time. Sample of inferred tree and cost :Sample of inferred tree and cost :
Language A : size = 383 chars, cost = 1821.121913 bits Language A : size = 383 chars, cost = 1821.121913 bits
A
B C
D
Results – Expanded modelResults – Expanded model Pr(Delete) = 0.076250 Pr(Delete) = 0.076250 Pr(Insert) = 0.038750 Pr(Insert) = 0.038750 Pr(Mismatch) = 0.186250 Pr(Mismatch) = 0.186250 Pr(Match) = 0.698750 Pr(Match) = 0.698750 4 state Multinomial cost = 930.108894 bits 4 state Multinomial cost = 930.108894 bits
Pr(Delete) = 0.071250 Pr(Delete) = 0.071250 Pr(Insert) = 0.038750 Pr(Insert) = 0.038750 Pr(Mismatch) = 0.183750 Pr(Mismatch) = 0.183750 Pr(Match) = 0.706250 Pr(Match) = 0.706250 4 state Multinomial cost = 916.979371 bits 4 state Multinomial cost = 916.979371 bits
*Note that all multinomial cost includes and extra cost of log(26) to *Note that all multinomial cost includes and extra cost of log(26) to state the new character for mismatch and insert *state the new character for mismatch and insert *
A
B
A
C
Results – Expanded modelResults – Expanded model Pr(Delete) = 0.066580 Pr(Delete) = 0.066580 Pr(Insert) = 0.035248 Pr(Insert) = 0.035248 Pr(Mismatch) = 0.189295 Pr(Mismatch) = 0.189295 Pr(Match) = 0.708877 Pr(Match) = 0.708877 4 state Multinomial cost = 873.869382 bits 4 state Multinomial cost = 873.869382 bits
Cost of fixing topology = log(7) = 2.81 bitsCost of fixing topology = log(7) = 2.81 bits Total tree cost = 930.11 + 916.98 + 873.87 + Total tree cost = 930.11 + 916.98 + 873.87 +
1821.11 + log(7) + log(4) + log(3) + 1821.11 + log(7) + log(4) + log(3) + log(2) log(2)
= 4549.46 bits= 4549.46 bits
B
D
Results – Expanded modelResults – Expanded model
European LanguagesEuropean Languages FrenchFrench EnglishEnglish GermanGerman
EnglishEnglish FrenchFrench GermanGerman
11
22
33
44
……
601601
even.even.
eyes.eyes.
false.false.
fear.fear.
……....
……....
meme.meme.
oeil.oeil.
faux.faux.
peur.peur.
……....
……....
sogar.sogar.
auge.auge.
falsch.falsch.
angst.angst.
……....
……....
Results – Expanded modelResults – Expanded model
Total cost of this tree = 56807.155 bitsTotal cost of this tree = 56807.155 bits
Cost of fixing topology = log(4) = 2 bitsCost of fixing topology = log(4) = 2 bits
Cost of fixing root language (French) = log(3) = 1.585 bitsCost of fixing root language (French) = log(3) = 1.585 bits Cost of French = no. of chars * log(27) = 21054.64 bitsCost of French = no. of chars * log(27) = 21054.64 bits
French
English
German
Cost of fixing parent/child language (English) = log(2) = 1 bitCost of fixing parent/child language (English) = log(2) = 1 bit Cost of multistate distribution (French -> English) = 15567.98 bitsCost of multistate distribution (French -> English) = 15567.98 bits MML inferred probabilities:MML inferred probabilities:
Pr(Delete) = 0.164322Pr(Delete) = 0.164322 Pr(Insert) = 0.071429Pr(Insert) = 0.071429 Pr(Mismatch) = 0.357143Pr(Mismatch) = 0.357143 Pr(Match) = 0.407106Pr(Match) = 0.407106
Cost of multistate distribution (English -> German) = 20179.95 bitsCost of multistate distribution (English -> German) = 20179.95 bits MML inferred probabilities:MML inferred probabilities:
Pr(Delete) = 0.069480Pr(Delete) = 0.069480 Pr(Insert) = 0.189866Pr(Insert) = 0.189866 Pr(Mismatch) = 0.442394 Pr(Mismatch) = 0.442394 Pr(Match) = 0.298260Pr(Match) = 0.298260
Note that an extra cost of log(26) is needed for each mismatch and Note that an extra cost of log(26) is needed for each mismatch and log(27) for each insert to state the new character.log(27) for each insert to state the new character.
Results – Expanded modelResults – Expanded model
ConclusionConclusion
MML methods have managed to MML methods have managed to infer the correct phylogenetic tree/graphs for infer the correct phylogenetic tree/graphs for
artificial languages.artificial languages. infer phylogenetic trees/graphs for languages infer phylogenetic trees/graphs for languages
by encoding them in terms of one another.by encoding them in terms of one another. We cannot conclude that one language We cannot conclude that one language
really descend from another language. We really descend from another language. We can only conclude that they are related.can only conclude that they are related.
Future work :Future work :
Compression – grammar and vocabulary.Compression – grammar and vocabulary. Compression – phonemes of languages.Compression – phonemes of languages. Endangered languages – Indigenous languages.Endangered languages – Indigenous languages. Refine coding scheme.Refine coding scheme.
Some characters occur more frequently than others. Some characters occur more frequently than others. Exp: English - “x” compared with “a”.Exp: English - “x” compared with “a”.
Some characters are more likely to mutate from one Some characters are more likely to mutate from one language to another language.language to another language.
Questions?Questions?
Papers on success of MMLPapers on success of MML C. S. Wallace and P. R. Freeman. Single factor analysis by MML estimation. Journal of the Royal C. S. Wallace and P. R. Freeman. Single factor analysis by MML estimation. Journal of the Royal
Statistical Society. Series B, 54(1):195-209, 1992.Statistical Society. Series B, 54(1):195-209, 1992. C. S.Wallace. Multiple factor analysis by MML estimation. Technical Report CS 95/218, C. S.Wallace. Multiple factor analysis by MML estimation. Technical Report CS 95/218,
Department of Computer Science, Monash University, 1995.Department of Computer Science, Monash University, 1995. C. S. Wallace and D. L. Dowe. MML estimation of the von Mises concentration parameter. C. S. Wallace and D. L. Dowe. MML estimation of the von Mises concentration parameter.
Technical Report CS 93/193, Department of Computer Science, Monash University,1993.Technical Report CS 93/193, Department of Computer Science, Monash University,1993. C. S. Wallace and D. L. Dowe. Refinements of MDL and MML coding. The Computer Journal, C. S. Wallace and D. L. Dowe. Refinements of MDL and MML coding. The Computer Journal,
42(4):330-337, 1999.42(4):330-337, 1999. P. J. Tan and D. L. Dowe. MML inference of decision graphs with multi-way joins. In Proceedings P. J. Tan and D. L. Dowe. MML inference of decision graphs with multi-way joins. In Proceedings
of the 15th Australian Joint Conference on Artificial Intelligence, Canberra, Australia, 2-6 of the 15th Australian Joint Conference on Artificial Intelligence, Canberra, Australia, 2-6 December 2002, published in Lecture Notes in Artificial Intelligence (LNAI) 2557, pages 131-142. December 2002, published in Lecture Notes in Artificial Intelligence (LNAI) 2557, pages 131-142. Springer-Verlag, 2002.Springer-Verlag, 2002.
S. L. Needham and D. L. Dowe. Message length as an effective Ockham's razor in decision tree S. L. Needham and D. L. Dowe. Message length as an effective Ockham's razor in decision tree induction. In Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics induction. In Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics (AI+STATS 2001), Key West, Florida, U.S.A., January 2001, pages 253-260, 2001(AI+STATS 2001), Key West, Florida, U.S.A., January 2001, pages 253-260, 2001
Y. Agusta and D. L. Dowe. Unsupervised learning of correlated multivariate Gaussian mixture Y. Agusta and D. L. Dowe. Unsupervised learning of correlated multivariate Gaussian mixture models using MML. In Proceedings of the Australian Conference on Artificial Intelligence 2003, models using MML. In Proceedings of the Australian Conference on Artificial Intelligence 2003, Lecture Notes in Artificial Intelligence (LNAI) 2903, pages 477-489. Springer-Verlag, 2003.Lecture Notes in Artificial Intelligence (LNAI) 2903, pages 477-489. Springer-Verlag, 2003.
J. W. Comley and D. L. Dowe. General Bayesian networks and asymmetric languages. In J. W. Comley and D. L. Dowe. General Bayesian networks and asymmetric languages. In Proceedings of the Hawaii International Conference on Statistics and Related Fields, June 5-8, Proceedings of the Hawaii International Conference on Statistics and Related Fields, June 5-8, 2003, 2003.2003, 2003.
J. W. Comley and D. L. Dowe. Minimum Message Length, MDL and Generalised Bayesian J. W. Comley and D. L. Dowe. Minimum Message Length, MDL and Generalised Bayesian Networks with Asymmetric Languages, chapter 11, pages 265-294. M.I.T. Press, 2005. [Camera Networks with Asymmetric Languages, chapter 11, pages 265-294. M.I.T. Press, 2005. [Camera ready copy submitted October 2003].ready copy submitted October 2003].
P. J. Tan and D. L. Dowe. MML inference of oblique decision trees. In Proc. 17th Australian Joint P. J. Tan and D. L. Dowe. MML inference of oblique decision trees. In Proc. 17th Australian Joint Conference on Artificial Intelligence (AI04), Cairns, Qld., Australia, pages 1082-1088. Springer-Conference on Artificial Intelligence (AI04), Cairns, Qld., Australia, pages 1082-1088. Springer-Verlag, December 2004.Verlag, December 2004.