automated compositional change detection in saxo ... · - gesta danorum is tendentious, contains...

13
Automated Compositional Change Detection in Saxo Grammaticus’ Gesta Danorum K.L. Nielbo, M.L. Perner, C. Larsen, J. Nielsen and D.Laursen [email protected] knielbo.github.io Center for Humanities Computing Aarhus University, Denmark March 7, 2019

Upload: others

Post on 17-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

Automated Compositional Change Detection in SaxoGrammaticus’ Gesta Danorum

K.L. Nielbo, M.L. Perner, C. Larsen, J. Nielsen and [email protected]

knielbo.github.io

Center for Humanities ComputingAarhus University, Denmark

March 7, 2019

Page 2: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project development

Outline

1 IntroductionSaxo Grammatricus

2 MethodsDataVector spaceSeeded LDASignal generation

3 ResultsKeyword change pointsDistance matricesChange detection

4 DiscussionSummaryProject development

Page 3: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project development

Saxo’s Gesta Danorum

Saxo Grammatricus

- A medieval writer (c. 1160 - post 1208) that represent the beginning ofthe modern day historian in Scandinavia.

- Saxo’s history of the Danes Gesta Danorum (“Deeds of the Danes”) isthe single most important written source to Danish history in the 12th

century.

- Gesta Danorum is tendentious, contains elements of fiction, and itscompositions has been an academic subject of debate for more than acentury.

Composition debate

- Debate regarding the bipartite composition Gesta Danorum

1. is the transition between the old mythical and new historical partslocated in book eight, nine, or ten?2. is this transition gradual (continuous) or sudden (point-like)?

Page 4: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project development

Data

Data set

- all sixteen books of Saxos Danmarkshistorietranslated from Latin by Peter Zeeberg andpublished by Det Danske Sprog- ogLitteraturselskab and G.E.C.Gads Forlag in2000.

Normalization

- books were concatenated and sliced innon-overlapping windows at a size of 50sentences

- unigrams were casefolded and numerals removed

- data-specific frequent words were removed

Page 5: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project development

Naive baseline model

Figure 1: Geometrical document representation, where each document is a high rankword vector over the full vocabulary.

Page 6: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project development

Alternative model

Algorithm 1 Classical LDA

1: for each k = 1...T do2: choose φk ∼ Dir(β)3: for each d choose θd ∼ Dir(α)

do4: for each token i = 1...Nd do5: select a zi ∼ Mult(θd )6: select a wi ∼ Mult(φxi )7: end for8: end for9: end for

Figure 2: In LDA each document isrepresented as a low rank dense vector (i.e.,a probability distribution over a small set oflatent topics). Seed words guide the modelsuch that the topics converge a specificdirection.

Jagarlamudi, J., Iii, H. D., & Udupa, R. (2012). Incorporating Lexical Priors into Topic Models. Proceeding EACL 12,204–213

Page 7: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project development

Document distance and change detection

1. Distance D between every combination of two document slices s1 and s2 iscomputed for the baseline model using cosine distance DC :

DC (s1, s2) =s1 · s2

‖ s1 ‖‖ s2 ‖(1)

and the alternative model using relative entropy DKL:

DKL(s1 | s2) =n∑

i=1

si1 × log2

si1

si2(2)

2. A semantic change signal ∆D was estimated for each model by averagingover the distances from slice s j the preceding slices from s1 . . . s j−1:

∆D(sj ) =1

N

j−1∑i=1

D(sj , si ) (3)

3. Two change detection techniques, a mean- and variance-shift technique,were applied to each signal in order to identify statistically reliable changepoints in their respective mean and variance at an α-level of .01.

Page 8: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project developmentFigure 3: Keyword/entity counts with a mean-shift model. Notice that ArchbishopAbsalon is introduced in book 14.

Page 9: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project development

Figure 4: “Document Recurrence Matrix” ∼ distance matrices for baseline (left) andalternative (right) models. Left pattern can be explained by the word burstiness, whilethe right pattern indicates a bipartite structure. Notice books rectangle from book 14-16on the left.

Page 10: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project development

Figure 5: For the baseline model a no-change model explains more variance R2 = 0.52than the sigmoid model R2 = 0.02. The pattern is reversed for the contrast model, wherethe sigmoid model explains more variance R2 = 0.93 in comparison with the linear modelR2 = 0.86

Page 11: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project development

Summary

Findings

- Baseline shows no reliable change

- Alternative show gradual change starting in book 8 (latter part) and endsin book 10

- Greatest rate of change in book 9

- Both models indicate change in book 14

Interpretation

- strongest support for the continuous transition claim

- although the book 14 is the second book dealing with Saxo’scontemporaries, it introduces Archbishop Absalon

- baseline favors text slices that are strictly similar, while the alternative issensitive to relational similar slices

- LDA gives us a simple technique for clustering documents on a set ofhidden topics, which when combined with time series analysis offers greatpotential for historical research

Page 12: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project development

*** [HUMlab] data sprints ***

A collaborative approach to research & development

Page 13: Automated Compositional Change Detection in Saxo ... · - Gesta Danorum is tendentious, contains elements of ction, and its compositions has been an academic subject of debate for

AutomatedCompositional Change

Detection in SaxoGrammaticus’ Gesta

Danorum

K.L. Nielbo, M.L.Perner, C. Larsen, J.

Nielsen and [email protected]

knielbo.github.io

Introduction

Saxo Grammatricus

Methods

Data

Vector space

Seeded LDA

Signal generation

Results

Keyword changepoints

Distance matrices

Change detection

Discussion

Summary

Project development

TAK

[email protected]

knielbo.github.io

slides: http://knielbo.github.io/files/kln dhn19.pdf

& tak til[HUMlab], Copenhagen University Library, South Campus

Royal Library, DenmarkDet Danske Sprog- og Litteraturselskab