automated compositional change detection in saxo ... · - gesta danorum is tendentious, contains...
TRANSCRIPT
Automated Compositional Change Detection in SaxoGrammaticus’ Gesta Danorum
K.L. Nielbo, M.L. Perner, C. Larsen, J. Nielsen and [email protected]
knielbo.github.io
Center for Humanities ComputingAarhus University, Denmark
March 7, 2019
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project development
Outline
1 IntroductionSaxo Grammatricus
2 MethodsDataVector spaceSeeded LDASignal generation
3 ResultsKeyword change pointsDistance matricesChange detection
4 DiscussionSummaryProject development
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project development
Saxo’s Gesta Danorum
Saxo Grammatricus
- A medieval writer (c. 1160 - post 1208) that represent the beginning ofthe modern day historian in Scandinavia.
- Saxo’s history of the Danes Gesta Danorum (“Deeds of the Danes”) isthe single most important written source to Danish history in the 12th
century.
- Gesta Danorum is tendentious, contains elements of fiction, and itscompositions has been an academic subject of debate for more than acentury.
Composition debate
- Debate regarding the bipartite composition Gesta Danorum
1. is the transition between the old mythical and new historical partslocated in book eight, nine, or ten?2. is this transition gradual (continuous) or sudden (point-like)?
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project development
Data
Data set
- all sixteen books of Saxos Danmarkshistorietranslated from Latin by Peter Zeeberg andpublished by Det Danske Sprog- ogLitteraturselskab and G.E.C.Gads Forlag in2000.
Normalization
- books were concatenated and sliced innon-overlapping windows at a size of 50sentences
- unigrams were casefolded and numerals removed
- data-specific frequent words were removed
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project development
Naive baseline model
Figure 1: Geometrical document representation, where each document is a high rankword vector over the full vocabulary.
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project development
Alternative model
Algorithm 1 Classical LDA
1: for each k = 1...T do2: choose φk ∼ Dir(β)3: for each d choose θd ∼ Dir(α)
do4: for each token i = 1...Nd do5: select a zi ∼ Mult(θd )6: select a wi ∼ Mult(φxi )7: end for8: end for9: end for
Figure 2: In LDA each document isrepresented as a low rank dense vector (i.e.,a probability distribution over a small set oflatent topics). Seed words guide the modelsuch that the topics converge a specificdirection.
Jagarlamudi, J., Iii, H. D., & Udupa, R. (2012). Incorporating Lexical Priors into Topic Models. Proceeding EACL 12,204–213
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project development
Document distance and change detection
1. Distance D between every combination of two document slices s1 and s2 iscomputed for the baseline model using cosine distance DC :
DC (s1, s2) =s1 · s2
‖ s1 ‖‖ s2 ‖(1)
and the alternative model using relative entropy DKL:
DKL(s1 | s2) =n∑
i=1
si1 × log2
si1
si2(2)
2. A semantic change signal ∆D was estimated for each model by averagingover the distances from slice s j the preceding slices from s1 . . . s j−1:
∆D(sj ) =1
N
j−1∑i=1
D(sj , si ) (3)
3. Two change detection techniques, a mean- and variance-shift technique,were applied to each signal in order to identify statistically reliable changepoints in their respective mean and variance at an α-level of .01.
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project developmentFigure 3: Keyword/entity counts with a mean-shift model. Notice that ArchbishopAbsalon is introduced in book 14.
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project development
Figure 4: “Document Recurrence Matrix” ∼ distance matrices for baseline (left) andalternative (right) models. Left pattern can be explained by the word burstiness, whilethe right pattern indicates a bipartite structure. Notice books rectangle from book 14-16on the left.
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project development
Figure 5: For the baseline model a no-change model explains more variance R2 = 0.52than the sigmoid model R2 = 0.02. The pattern is reversed for the contrast model, wherethe sigmoid model explains more variance R2 = 0.93 in comparison with the linear modelR2 = 0.86
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project development
Summary
Findings
- Baseline shows no reliable change
- Alternative show gradual change starting in book 8 (latter part) and endsin book 10
- Greatest rate of change in book 9
- Both models indicate change in book 14
Interpretation
- strongest support for the continuous transition claim
- although the book 14 is the second book dealing with Saxo’scontemporaries, it introduces Archbishop Absalon
- baseline favors text slices that are strictly similar, while the alternative issensitive to relational similar slices
- LDA gives us a simple technique for clustering documents on a set ofhidden topics, which when combined with time series analysis offers greatpotential for historical research
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project development
*** [HUMlab] data sprints ***
A collaborative approach to research & development
AutomatedCompositional Change
Detection in SaxoGrammaticus’ Gesta
Danorum
K.L. Nielbo, M.L.Perner, C. Larsen, J.
Nielsen and [email protected]
knielbo.github.io
Introduction
Saxo Grammatricus
Methods
Data
Vector space
Seeded LDA
Signal generation
Results
Keyword changepoints
Distance matrices
Change detection
Discussion
Summary
Project development
TAK
knielbo.github.io
slides: http://knielbo.github.io/files/kln dhn19.pdf
& tak til[HUMlab], Copenhagen University Library, South Campus
Royal Library, DenmarkDet Danske Sprog- og Litteraturselskab